Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-04-01 Thread Alex Williamson
On Mon, 1 Apr 2019 04:40:03 -0400
Yan Zhao  wrote:

> On Mon, Apr 01, 2019 at 04:14:30PM +0800, Cornelia Huck wrote:
> > On Wed, 27 Mar 2019 16:10:20 -0600
> > Alex Williamson  wrote:
> >   
> > > On Wed, 27 Mar 2019 20:18:54 +
> > > "Dr. David Alan Gilbert"  wrote:
> > >   
> > > > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong 
> > > > > > > > > > device or
> > > > > > > > > > version of device?  Or say to a device with older firmware 
> > > > > > > > > > or perhaps
> > > > > > > > > > a device that has less device memory ?
> > > > > > > > > Actually it's still an open for VFIO migration. Need to think 
> > > > > > > > > about
> > > > > > > > > whether it's better to check that in libvirt or qemu (like a 
> > > > > > > > > device magic
> > > > > > > > > along with verion ?).
> > > > > > > 
> > > > > > > We must keep the hardware generation is the same with one POD of 
> > > > > > > public cloud
> > > > > > > providers. But we still think about the live migration between 
> > > > > > > from the the lower
> > > > > > > generation of hardware migrated to the higher generation.  
> > > > > > 
> > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > support.
> > > > > > 
> > > > > > But regardless of that, I think we need to make sure that 
> > > > > > incompatible
> > > > > > devices/versions fail directly instead of failing in a subtle, hard 
> > > > > > to
> > > > > > debug way. Might be useful to do some initial sanity checks in 
> > > > > > libvirt
> > > > > > as well.
> > > > > > 
> > > > > > How easy is it to obtain that information in a form that can be
> > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > What about some kind of revision?  
> > > > > hi Alex and Cornelia
> > > > > for device compatibility, do you think it's a good idea to use 
> > > > > "version"
> > > > > and "device version" fields?
> > > > > 
> > > > > version field: identify live migration interface's version. it can 
> > > > > have a
> > > > > sort of backward compatibility, like target machine's version >= 
> > > > > source
> > > > > machine's version. something like that.
> > > 
> > > Don't we essentially already have this via the device specific region?
> > > The struct vfio_info_cap_header includes id and version fields, so we
> > > can declare a migration id and increment the version for any
> > > incompatible changes to the protocol.
> > >   
> > > > > 
> > > > > device_version field consists two parts:
> > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> > > 
> > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > suggest we use a bit to flag it as such so we can reserve that portion
> > > of the 32bit address space.  See for example:
> > > 
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> > > 
> > > For vendor specific regions.  
> > 
> > Just browsing through the thread... if I don't misunderstand, we could
> > use a vfio-ccw region type id here for ccw, couldn't we? Just to make
> > sure that this is not pci-specific.  
> CCW could use another bit other than bit 31?
> e.g.
> #define VFIO_REGION_TYPE_CCW_VENDOR_TYPE(1 << 30)
> then ccw device use (VFIO_REGION_TYPE_CCW_VENDOR_TYPE | vendor id) as its
> first 32 bit for device version string.
> 
> But as Alex said we'll not provide an extra region to get device version,
> and device version is only exported in sysfs, probably we should define them 
> as
> below:
> #define VFIO_DEVICE_VERSION_TYPE_PCI (1<<31)
> #define VFIO_DEVICE_VERSION_TYPE_CCW (1<<30)
> 
> Do you think it's ok?

We already had this discussion for device specific regions and decided
that CCW doesn't have enough vendors to justify a full subset of the
available address space.  Also, this doesn't need to imply the device
interface, we're simply specifying a vendor registrar such that we can
give each vendor their own namespace, so I don't think it would be a
problem for a CCW to specify a namespace using a PCI vendor ID.
Finally, if we have such need for this in the future, because I'm not
sure where we stand with this in the current proposals, maybe we should
make use of an IEEE OUI rather than a PCI database to avoid this sort
of confusion and mis-association if we have further need.  Thanks,

Alex



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-04-01 Thread Yan Zhao
On Mon, Apr 01, 2019 at 04:14:30PM +0800, Cornelia Huck wrote:
> On Wed, 27 Mar 2019 16:10:20 -0600
> Alex Williamson  wrote:
> 
> > On Wed, 27 Mar 2019 20:18:54 +
> > "Dr. David Alan Gilbert"  wrote:
> > 
> > > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > > > > >   b) How do we detect if we're migrating from/to the wrong 
> > > > > > > > > device or
> > > > > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > > > > perhaps
> > > > > > > > > a device that has less device memory ?  
> > > > > > > > Actually it's still an open for VFIO migration. Need to think 
> > > > > > > > about
> > > > > > > > whether it's better to check that in libvirt or qemu (like a 
> > > > > > > > device magic
> > > > > > > > along with verion ?).  
> > > > > > 
> > > > > > We must keep the hardware generation is the same with one POD of 
> > > > > > public cloud
> > > > > > providers. But we still think about the live migration between from 
> > > > > > the the lower
> > > > > > generation of hardware migrated to the higher generation.
> > > > > 
> > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > support.
> > > > > 
> > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > as well.
> > > > > 
> > > > > How easy is it to obtain that information in a form that can be
> > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > What about some kind of revision?
> > > > hi Alex and Cornelia
> > > > for device compatibility, do you think it's a good idea to use "version"
> > > > and "device version" fields?
> > > > 
> > > > version field: identify live migration interface's version. it can have 
> > > > a
> > > > sort of backward compatibility, like target machine's version >= source
> > > > machine's version. something like that.  
> > 
> > Don't we essentially already have this via the device specific region?
> > The struct vfio_info_cap_header includes id and version fields, so we
> > can declare a migration id and increment the version for any
> > incompatible changes to the protocol.
> > 
> > > > 
> > > > device_version field consists two parts:
> > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > 
> > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > suggest we use a bit to flag it as such so we can reserve that portion
> > of the 32bit address space.  See for example:
> > 
> > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> > 
> > For vendor specific regions.
> 
> Just browsing through the thread... if I don't misunderstand, we could
> use a vfio-ccw region type id here for ccw, couldn't we? Just to make
> sure that this is not pci-specific.
CCW could use another bit other than bit 31?
e.g.
#define VFIO_REGION_TYPE_CCW_VENDOR_TYPE(1 << 30)
then ccw device use (VFIO_REGION_TYPE_CCW_VENDOR_TYPE | vendor id) as its
first 32 bit for device version string.

But as Alex said we'll not provide an extra region to get device version,
and device version is only exported in sysfs, probably we should define them as
below:
#define VFIO_DEVICE_VERSION_TYPE_PCI (1<<31)
#define VFIO_DEVICE_VERSION_TYPE_CCW (1<<30)

Do you think it's ok?

Thanks
Yan



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-04-01 Thread Cornelia Huck
On Wed, 27 Mar 2019 16:10:20 -0600
Alex Williamson  wrote:

> On Wed, 27 Mar 2019 20:18:54 +
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong 
> > > > > > > > device or
> > > > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > > > perhaps
> > > > > > > > a device that has less device memory ?  
> > > > > > > Actually it's still an open for VFIO migration. Need to think 
> > > > > > > about
> > > > > > > whether it's better to check that in libvirt or qemu (like a 
> > > > > > > device magic
> > > > > > > along with verion ?).  
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of 
> > > > > public cloud
> > > > > providers. But we still think about the live migration between from 
> > > > > the the lower
> > > > > generation of hardware migrated to the higher generation.
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.  
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
> 
> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> 
> For vendor specific regions.

Just browsing through the thread... if I don't misunderstand, we could
use a vfio-ccw region type id here for ccw, couldn't we? Just to make
sure that this is not pci-specific.



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-31 Thread Zhao Yan
On Sat, Mar 30, 2019 at 10:14:07PM +0800, Alex Williamson wrote:
> On Fri, 29 Mar 2019 19:10:50 -0400
> Zhao Yan  wrote:
> 
> > On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> > > On Thu, 28 Mar 2019 22:47:04 -0400
> > > Zhao Yan  wrote:
> > >   
> > > > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:  
> > > > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > > > Erik Skultety  wrote:
> > > > > 
> > > > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > > > > > > hi Alex and Dave,
> > > > > > > Thanks for your replies.
> > > > > > > Please see my comments inline.
> > > > > > >
> > > > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > > > > > 
> > > > > > > > On Wed, 27 Mar 2019 20:18:54 +
> > > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > > >  
> > > > > > > > > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck 
> > > > > > > > > > wrote:  
> > > > > > > > > > > > > > >   b) How do we detect if we're migrating from/to 
> > > > > > > > > > > > > > > the wrong device or
> > > > > > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > > > > > > Actually it's still an open for VFIO migration. 
> > > > > > > > > > > > > > Need to think about
> > > > > > > > > > > > > > whether it's better to check that in libvirt or 
> > > > > > > > > > > > > > qemu (like a device magic
> > > > > > > > > > > > > > along with verion ?).  
> > > > > > > > > > > >
> > > > > > > > > > > > We must keep the hardware generation is the same with 
> > > > > > > > > > > > one POD of public cloud
> > > > > > > > > > > > providers. But we still think about the live migration 
> > > > > > > > > > > > between from the the lower
> > > > > > > > > > > > generation of hardware migrated to the higher 
> > > > > > > > > > > > generation.  
> > > > > > > > > > >
> > > > > > > > > > > Agreed, lower->higher is the one direction that might 
> > > > > > > > > > > make sense to
> > > > > > > > > > > support.
> > > > > > > > > > >
> > > > > > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > > > > > incompatible
> > > > > > > > > > > devices/versions fail directly instead of failing in a 
> > > > > > > > > > > subtle, hard to
> > > > > > > > > > > debug way. Might be useful to do some initial sanity 
> > > > > > > > > > > checks in libvirt
> > > > > > > > > > > as well.
> > > > > > > > > > >
> > > > > > > > > > > How easy is it to obtain that information in a form that 
> > > > > > > > > > > can be
> > > > > > > > > > > consumed by higher layers? Can we find out the device 
> > > > > > > > > > > type at least?
> > > > > > > > > > > What about some kind of revision?  
> > > > > > > > > > hi Alex and Cornelia
> > > > > > > > > > for device compatibility, do you think it's a good idea to 
> > > > > > > > > > use "version"
> > > > > > > > > > and "device version" fields?
> > > > > > > > > >
> > > > > > > > > > version field: identify live migration interface's version. 
> > > > > > > > > > it can have a
> > > > > > > > > > sort of backward compatibility, like target machine's 
> > > > > > > > > > version >= source
> > > > > > > > > > machine's version. something like that.  
> > > > > > > >
> > > > > > > > Don't we essentially already have this via the device specific 
> > > > > > > > region?
> > > > > > > > The struct vfio_info_cap_header includes id and version fields, 
> > > > > > > > so we
> > > > > > > > can declare a migration id and increment the version for any
> > > > > > > > incompatible changes to the protocol.  
> > > > > > > yes, good idea!
> > > > > > > so, what about declaring below new cap?
> > > > > > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > > > > struct vfio_region_info_cap_migration {
> > > > > > > struct vfio_info_cap_header header;
> > > > > > > __u32 device_version_len;
> > > > > > > __u8  device_version[];
> > > > > > > };
> > > > > 
> > > > > I'm not sure why we need a new region for everything, it seems this
> > > > > could fit within the protocol of a single region.  This could simply 
> > > > > be
> > > > > a new action to retrieve the version where the protocol would report
> > > > > the number of bytes available, just like the migration stream itself. 
> > > > >
> > > > so, to get version of VFIO live migration device state interface (simply
> > > > call it migration interface?),
> > > > a new cap looks like this:
> > > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > it contains struct vfio_info_cap_header only.
> > > > when get region info of the migration region, we query this cap and get
> > > > migration interface's version. right?
> > > > 
> > > > or just directly use 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-30 Thread Alex Williamson
On Fri, 29 Mar 2019 19:10:50 -0400
Zhao Yan  wrote:

> On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> > On Thu, 28 Mar 2019 22:47:04 -0400
> > Zhao Yan  wrote:
> >   
> > > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:  
> > > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > > Erik Skultety  wrote:
> > > > 
> > > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > > > > > hi Alex and Dave,
> > > > > > Thanks for your replies.
> > > > > > Please see my comments inline.
> > > > > >
> > > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> > > > > >   
> > > > > > > On Wed, 27 Mar 2019 20:18:54 +
> > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > >  
> > > > > > > > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck 
> > > > > > > > > wrote:  
> > > > > > > > > > > > > >   b) How do we detect if we're migrating from/to 
> > > > > > > > > > > > > > the wrong device or
> > > > > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > > > > > Actually it's still an open for VFIO migration. Need 
> > > > > > > > > > > > > to think about
> > > > > > > > > > > > > whether it's better to check that in libvirt or qemu 
> > > > > > > > > > > > > (like a device magic
> > > > > > > > > > > > > along with verion ?).  
> > > > > > > > > > >
> > > > > > > > > > > We must keep the hardware generation is the same with one 
> > > > > > > > > > > POD of public cloud
> > > > > > > > > > > providers. But we still think about the live migration 
> > > > > > > > > > > between from the the lower
> > > > > > > > > > > generation of hardware migrated to the higher generation. 
> > > > > > > > > > >  
> > > > > > > > > >
> > > > > > > > > > Agreed, lower->higher is the one direction that might make 
> > > > > > > > > > sense to
> > > > > > > > > > support.
> > > > > > > > > >
> > > > > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > > > > incompatible
> > > > > > > > > > devices/versions fail directly instead of failing in a 
> > > > > > > > > > subtle, hard to
> > > > > > > > > > debug way. Might be useful to do some initial sanity checks 
> > > > > > > > > > in libvirt
> > > > > > > > > > as well.
> > > > > > > > > >
> > > > > > > > > > How easy is it to obtain that information in a form that 
> > > > > > > > > > can be
> > > > > > > > > > consumed by higher layers? Can we find out the device type 
> > > > > > > > > > at least?
> > > > > > > > > > What about some kind of revision?  
> > > > > > > > > hi Alex and Cornelia
> > > > > > > > > for device compatibility, do you think it's a good idea to 
> > > > > > > > > use "version"
> > > > > > > > > and "device version" fields?
> > > > > > > > >
> > > > > > > > > version field: identify live migration interface's version. 
> > > > > > > > > it can have a
> > > > > > > > > sort of backward compatibility, like target machine's version 
> > > > > > > > > >= source
> > > > > > > > > machine's version. something like that.  
> > > > > > >
> > > > > > > Don't we essentially already have this via the device specific 
> > > > > > > region?
> > > > > > > The struct vfio_info_cap_header includes id and version fields, 
> > > > > > > so we
> > > > > > > can declare a migration id and increment the version for any
> > > > > > > incompatible changes to the protocol.  
> > > > > > yes, good idea!
> > > > > > so, what about declaring below new cap?
> > > > > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > > > struct vfio_region_info_cap_migration {
> > > > > > struct vfio_info_cap_header header;
> > > > > > __u32 device_version_len;
> > > > > > __u8  device_version[];
> > > > > > };
> > > > 
> > > > I'm not sure why we need a new region for everything, it seems this
> > > > could fit within the protocol of a single region.  This could simply be
> > > > a new action to retrieve the version where the protocol would report
> > > > the number of bytes available, just like the migration stream itself.   
> > > >  
> > > so, to get version of VFIO live migration device state interface (simply
> > > call it migration interface?),
> > > a new cap looks like this:
> > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > it contains struct vfio_info_cap_header only.
> > > when get region info of the migration region, we query this cap and get
> > > migration interface's version. right?
> > > 
> > > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?  
> > 
> > Again, why a new region.  I'm imagining we have one region and this is
> > just asking for a slightly different thing from it.  But TBH, I'm not
> > sure we need it at all vs the sysfs interface.
> >   
> > > > > > > > > device_version field 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-29 Thread Zhao Yan
On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> On Thu, 28 Mar 2019 22:47:04 -0400
> Zhao Yan  wrote:
> 
> > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > Erik Skultety  wrote:
> > >   
> > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:  
> > > > > hi Alex and Dave,
> > > > > Thanks for your replies.
> > > > > Please see my comments inline.
> > > > >
> > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> > > > > > On Wed, 27 Mar 2019 20:18:54 +
> > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > >
> > > > > > > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > >   
> > > > > > > > > > > > >   b) How do we detect if we're migrating from/to the 
> > > > > > > > > > > > > wrong device or
> > > > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > > > a device that has less device memory ?
> > > > > > > > > > > > Actually it's still an open for VFIO migration. Need to 
> > > > > > > > > > > > think about
> > > > > > > > > > > > whether it's better to check that in libvirt or qemu 
> > > > > > > > > > > > (like a device magic
> > > > > > > > > > > > along with verion ?).
> > > > > > > > > >
> > > > > > > > > > We must keep the hardware generation is the same with one 
> > > > > > > > > > POD of public cloud
> > > > > > > > > > providers. But we still think about the live migration 
> > > > > > > > > > between from the the lower
> > > > > > > > > > generation of hardware migrated to the higher generation.   
> > > > > > > > > >  
> > > > > > > > >
> > > > > > > > > Agreed, lower->higher is the one direction that might make 
> > > > > > > > > sense to
> > > > > > > > > support.
> > > > > > > > >
> > > > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > > > incompatible
> > > > > > > > > devices/versions fail directly instead of failing in a 
> > > > > > > > > subtle, hard to
> > > > > > > > > debug way. Might be useful to do some initial sanity checks 
> > > > > > > > > in libvirt
> > > > > > > > > as well.
> > > > > > > > >
> > > > > > > > > How easy is it to obtain that information in a form that can 
> > > > > > > > > be
> > > > > > > > > consumed by higher layers? Can we find out the device type at 
> > > > > > > > > least?
> > > > > > > > > What about some kind of revision?
> > > > > > > > hi Alex and Cornelia
> > > > > > > > for device compatibility, do you think it's a good idea to use 
> > > > > > > > "version"
> > > > > > > > and "device version" fields?
> > > > > > > >
> > > > > > > > version field: identify live migration interface's version. it 
> > > > > > > > can have a
> > > > > > > > sort of backward compatibility, like target machine's version 
> > > > > > > > >= source
> > > > > > > > machine's version. something like that.
> > > > > >
> > > > > > Don't we essentially already have this via the device specific 
> > > > > > region?
> > > > > > The struct vfio_info_cap_header includes id and version fields, so 
> > > > > > we
> > > > > > can declare a migration id and increment the version for any
> > > > > > incompatible changes to the protocol.
> > > > > yes, good idea!
> > > > > so, what about declaring below new cap?
> > > > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > > struct vfio_region_info_cap_migration {
> > > > > struct vfio_info_cap_header header;
> > > > > __u32 device_version_len;
> > > > > __u8  device_version[];
> > > > > };  
> > > 
> > > I'm not sure why we need a new region for everything, it seems this
> > > could fit within the protocol of a single region.  This could simply be
> > > a new action to retrieve the version where the protocol would report
> > > the number of bytes available, just like the migration stream itself.  
> > so, to get version of VFIO live migration device state interface (simply
> > call it migration interface?),
> > a new cap looks like this:
> > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > it contains struct vfio_info_cap_header only.
> > when get region info of the migration region, we query this cap and get
> > migration interface's version. right?
> > 
> > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?
> 
> Again, why a new region.  I'm imagining we have one region and this is
> just asking for a slightly different thing from it.  But TBH, I'm not
> sure we need it at all vs the sysfs interface.
> 
> > > > > > > > device_version field consists two parts:
> > > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> > > > > >
> > > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > > suggest we use a bit to flag it as such so we can reserve that 
> > > > > > portion
> > > > > > of the 32bit address 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-29 Thread Alex Williamson
On Thu, 28 Mar 2019 22:47:04 -0400
Zhao Yan  wrote:

> On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> > On Thu, 28 Mar 2019 10:21:38 +0100
> > Erik Skultety  wrote:
> >   
> > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:  
> > > > hi Alex and Dave,
> > > > Thanks for your replies.
> > > > Please see my comments inline.
> > > >
> > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> > > > > On Wed, 27 Mar 2019 20:18:54 +
> > > > > "Dr. David Alan Gilbert"  wrote:
> > > > >
> > > > > > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > > > > > > > >   b) How do we detect if we're migrating from/to the 
> > > > > > > > > > > > wrong device or
> > > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > > a device that has less device memory ?
> > > > > > > > > > > Actually it's still an open for VFIO migration. Need to 
> > > > > > > > > > > think about
> > > > > > > > > > > whether it's better to check that in libvirt or qemu 
> > > > > > > > > > > (like a device magic
> > > > > > > > > > > along with verion ?).
> > > > > > > > >
> > > > > > > > > We must keep the hardware generation is the same with one POD 
> > > > > > > > > of public cloud
> > > > > > > > > providers. But we still think about the live migration 
> > > > > > > > > between from the the lower
> > > > > > > > > generation of hardware migrated to the higher generation.
> > > > > > > >
> > > > > > > > Agreed, lower->higher is the one direction that might make 
> > > > > > > > sense to
> > > > > > > > support.
> > > > > > > >
> > > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > > incompatible
> > > > > > > > devices/versions fail directly instead of failing in a subtle, 
> > > > > > > > hard to
> > > > > > > > debug way. Might be useful to do some initial sanity checks in 
> > > > > > > > libvirt
> > > > > > > > as well.
> > > > > > > >
> > > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > > consumed by higher layers? Can we find out the device type at 
> > > > > > > > least?
> > > > > > > > What about some kind of revision?
> > > > > > > hi Alex and Cornelia
> > > > > > > for device compatibility, do you think it's a good idea to use 
> > > > > > > "version"
> > > > > > > and "device version" fields?
> > > > > > >
> > > > > > > version field: identify live migration interface's version. it 
> > > > > > > can have a
> > > > > > > sort of backward compatibility, like target machine's version >= 
> > > > > > > source
> > > > > > > machine's version. something like that.
> > > > >
> > > > > Don't we essentially already have this via the device specific region?
> > > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > > can declare a migration id and increment the version for any
> > > > > incompatible changes to the protocol.
> > > > yes, good idea!
> > > > so, what about declaring below new cap?
> > > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > struct vfio_region_info_cap_migration {
> > > > struct vfio_info_cap_header header;
> > > > __u32 device_version_len;
> > > > __u8  device_version[];
> > > > };  
> > 
> > I'm not sure why we need a new region for everything, it seems this
> > could fit within the protocol of a single region.  This could simply be
> > a new action to retrieve the version where the protocol would report
> > the number of bytes available, just like the migration stream itself.  
> so, to get version of VFIO live migration device state interface (simply
> call it migration interface?),
> a new cap looks like this:
> #define VFIO_REGION_INFO_CAP_MIGRATION 4
> it contains struct vfio_info_cap_header only.
> when get region info of the migration region, we query this cap and get
> migration interface's version. right?
> 
> or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?

Again, why a new region.  I'm imagining we have one region and this is
just asking for a slightly different thing from it.  But TBH, I'm not
sure we need it at all vs the sysfs interface.

> > > > > > > device_version field consists two parts:
> > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> > > > >
> > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > > of the 32bit address space.  See for example:
> > > > >
> > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> > > > >
> > > > > For vendor specific regions.
> > > > Yes, use PCI vendor ID.
> > > > you are right, we need to use highest bit 
> > > > (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > to 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-28 Thread Zhao Yan
On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> On Thu, 28 Mar 2019 10:21:38 +0100
> Erik Skultety  wrote:
> 
> > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > > hi Alex and Dave,
> > > Thanks for your replies.
> > > Please see my comments inline.
> > >
> > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > > On Wed, 27 Mar 2019 20:18:54 +
> > > > "Dr. David Alan Gilbert"  wrote:
> > > >  
> > > > > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > > > > >   b) How do we detect if we're migrating from/to the 
> > > > > > > > > > > wrong device or
> > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > > Actually it's still an open for VFIO migration. Need to 
> > > > > > > > > > think about
> > > > > > > > > > whether it's better to check that in libvirt or qemu (like 
> > > > > > > > > > a device magic
> > > > > > > > > > along with verion ?).  
> > > > > > > >
> > > > > > > > We must keep the hardware generation is the same with one POD 
> > > > > > > > of public cloud
> > > > > > > > providers. But we still think about the live migration between 
> > > > > > > > from the the lower
> > > > > > > > generation of hardware migrated to the higher generation.  
> > > > > > >
> > > > > > > Agreed, lower->higher is the one direction that might make sense 
> > > > > > > to
> > > > > > > support.
> > > > > > >
> > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > incompatible
> > > > > > > devices/versions fail directly instead of failing in a subtle, 
> > > > > > > hard to
> > > > > > > debug way. Might be useful to do some initial sanity checks in 
> > > > > > > libvirt
> > > > > > > as well.
> > > > > > >
> > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > consumed by higher layers? Can we find out the device type at 
> > > > > > > least?
> > > > > > > What about some kind of revision?  
> > > > > > hi Alex and Cornelia
> > > > > > for device compatibility, do you think it's a good idea to use 
> > > > > > "version"
> > > > > > and "device version" fields?
> > > > > >
> > > > > > version field: identify live migration interface's version. it can 
> > > > > > have a
> > > > > > sort of backward compatibility, like target machine's version >= 
> > > > > > source
> > > > > > machine's version. something like that.  
> > > >
> > > > Don't we essentially already have this via the device specific region?
> > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > can declare a migration id and increment the version for any
> > > > incompatible changes to the protocol.  
> > > yes, good idea!
> > > so, what about declaring below new cap?
> > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > struct vfio_region_info_cap_migration {
> > > struct vfio_info_cap_header header;
> > > __u32 device_version_len;
> > > __u8  device_version[];
> > > };
> 
> I'm not sure why we need a new region for everything, it seems this
> could fit within the protocol of a single region.  This could simply be
> a new action to retrieve the version where the protocol would report
> the number of bytes available, just like the migration stream itself.
so, to get version of VFIO live migration device state interface (simply
call it migration interface?),
a new cap looks like this:
#define VFIO_REGION_INFO_CAP_MIGRATION 4
it contains struct vfio_info_cap_header only.
when get region info of the migration region, we query this cap and get
migration interface's version. right?

or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?


> > > > > > device_version field consists two parts:
> > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > > >
> > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > of the 32bit address space.  See for example:
> > > >
> > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> > > >
> > > > For vendor specific regions.  
> > > Yes, use PCI vendor ID.
> > > you are right, we need to use highest bit 
> > > (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > to identify it's a PCI ID.
> > > Thanks for pointing it out.
> > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > why it's 0x? I searched QEMU and kernel code and did not find anywhere
> > > uses it.
> 
> PCI vendor IDs are 16bits, it's just indicating that when the
> PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.

thanks:)

> > > > > > 2. vendor proprietary string: it can be any string that a vendor 
> > > > > > driver
> 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-28 Thread Alex Williamson
On Thu, 28 Mar 2019 10:21:38 +0100
Erik Skultety  wrote:

> On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > hi Alex and Dave,
> > Thanks for your replies.
> > Please see my comments inline.
> >
> > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > On Wed, 27 Mar 2019 20:18:54 +
> > > "Dr. David Alan Gilbert"  wrote:
> > >  
> > > > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong 
> > > > > > > > > > device or
> > > > > > > > > > version of device?  Or say to a device with older firmware 
> > > > > > > > > > or perhaps
> > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > Actually it's still an open for VFIO migration. Need to think 
> > > > > > > > > about
> > > > > > > > > whether it's better to check that in libvirt or qemu (like a 
> > > > > > > > > device magic
> > > > > > > > > along with verion ?).  
> > > > > > >
> > > > > > > We must keep the hardware generation is the same with one POD of 
> > > > > > > public cloud
> > > > > > > providers. But we still think about the live migration between 
> > > > > > > from the the lower
> > > > > > > generation of hardware migrated to the higher generation.  
> > > > > >
> > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > support.
> > > > > >
> > > > > > But regardless of that, I think we need to make sure that 
> > > > > > incompatible
> > > > > > devices/versions fail directly instead of failing in a subtle, hard 
> > > > > > to
> > > > > > debug way. Might be useful to do some initial sanity checks in 
> > > > > > libvirt
> > > > > > as well.
> > > > > >
> > > > > > How easy is it to obtain that information in a form that can be
> > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > What about some kind of revision?  
> > > > > hi Alex and Cornelia
> > > > > for device compatibility, do you think it's a good idea to use 
> > > > > "version"
> > > > > and "device version" fields?
> > > > >
> > > > > version field: identify live migration interface's version. it can 
> > > > > have a
> > > > > sort of backward compatibility, like target machine's version >= 
> > > > > source
> > > > > machine's version. something like that.  
> > >
> > > Don't we essentially already have this via the device specific region?
> > > The struct vfio_info_cap_header includes id and version fields, so we
> > > can declare a migration id and increment the version for any
> > > incompatible changes to the protocol.  
> > yes, good idea!
> > so, what about declaring below new cap?
> > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > struct vfio_region_info_cap_migration {
> > struct vfio_info_cap_header header;
> > __u32 device_version_len;
> > __u8  device_version[];
> > };

I'm not sure why we need a new region for everything, it seems this
could fit within the protocol of a single region.  This could simply be
a new action to retrieve the version where the protocol would report
the number of bytes available, just like the migration stream itself.

> > > > > device_version field consists two parts:
> > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > >
> > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > suggest we use a bit to flag it as such so we can reserve that portion
> > > of the 32bit address space.  See for example:
> > >
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> > >
> > > For vendor specific regions.  
> > Yes, use PCI vendor ID.
> > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > to identify it's a PCI ID.
> > Thanks for pointing it out.
> > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > why it's 0x? I searched QEMU and kernel code and did not find anywhere
> > uses it.

PCI vendor IDs are 16bits, it's just indicating that when the
PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.

> > > > > 2. vendor proprietary string: it can be any string that a vendor 
> > > > > driver
> > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > >
> > > > >
> > > > > struct vfio_device_state_ctl {
> > > > >  __u32 version;/* ro */
> > > > >  __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
> > > > >  struct {
> > > > >   __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > >   ...
> > > > >  }data;
> > > > >  ...
> > > > >  };  
> > >
> > > We have a buffer area where we can read and write data from the vendor
> > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > device version 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-28 Thread Erik Skultety
On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> hi Alex and Dave,
> Thanks for your replies.
> Please see my comments inline.
>
> On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> > On Wed, 27 Mar 2019 20:18:54 +
> > "Dr. David Alan Gilbert"  wrote:
> >
> > > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > > > > >   b) How do we detect if we're migrating from/to the wrong 
> > > > > > > > > device or
> > > > > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > > > > perhaps
> > > > > > > > > a device that has less device memory ?
> > > > > > > > Actually it's still an open for VFIO migration. Need to think 
> > > > > > > > about
> > > > > > > > whether it's better to check that in libvirt or qemu (like a 
> > > > > > > > device magic
> > > > > > > > along with verion ?).
> > > > > >
> > > > > > We must keep the hardware generation is the same with one POD of 
> > > > > > public cloud
> > > > > > providers. But we still think about the live migration between from 
> > > > > > the the lower
> > > > > > generation of hardware migrated to the higher generation.
> > > > >
> > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > support.
> > > > >
> > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > as well.
> > > > >
> > > > > How easy is it to obtain that information in a form that can be
> > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > What about some kind of revision?
> > > > hi Alex and Cornelia
> > > > for device compatibility, do you think it's a good idea to use "version"
> > > > and "device version" fields?
> > > >
> > > > version field: identify live migration interface's version. it can have 
> > > > a
> > > > sort of backward compatibility, like target machine's version >= source
> > > > machine's version. something like that.
> >
> > Don't we essentially already have this via the device specific region?
> > The struct vfio_info_cap_header includes id and version fields, so we
> > can declare a migration id and increment the version for any
> > incompatible changes to the protocol.
> yes, good idea!
> so, what about declaring below new cap?
> #define VFIO_REGION_INFO_CAP_MIGRATION 4
> struct vfio_region_info_cap_migration {
> struct vfio_info_cap_header header;
> __u32 device_version_len;
> __u8  device_version[];
> };
>
>
> > > >
> > > > device_version field consists two parts:
> > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> >
> > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > suggest we use a bit to flag it as such so we can reserve that portion
> > of the 32bit address space.  See for example:
> >
> > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> >
> > For vendor specific regions.
> Yes, use PCI vendor ID.
> you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> to identify it's a PCI ID.
> Thanks for pointing it out.
> But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> why it's 0x? I searched QEMU and kernel code and did not find anywhere
> uses it.
>
>
> > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > >
> > > >
> > > > struct vfio_device_state_ctl {
> > > >  __u32 version;/* ro */
> > > >  __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
> > > >  struct {
> > > > __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > ...
> > > >  }data;
> > > >  ...
> > > >  };
> >
> > We have a buffer area where we can read and write data from the vendor
> > driver, why would we have this IS_COMPATIBLE protocol to write the
> > device version string but use a static fixed length version string in
> > the control header to read it?  IOW, let's use GET_VERSION,
> > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > specific version information length.
> you are right, such static fixed length version string is bad :)
> To get device version, do you think which approach below is better?
> 1. use GET_VERSION action, and read from region buffer
> 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
>
> seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> for checking migration interface's version?
>
> > > >
> > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > >
> > > > The flow to figure 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-28 Thread Zhao Yan
hi Alex and Dave,
Thanks for your replies.
Please see my comments inline.

On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> On Wed, 27 Mar 2019 20:18:54 +
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong 
> > > > > > > > device or
> > > > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > > > perhaps
> > > > > > > > a device that has less device memory ?
> > > > > > > Actually it's still an open for VFIO migration. Need to think 
> > > > > > > about
> > > > > > > whether it's better to check that in libvirt or qemu (like a 
> > > > > > > device magic
> > > > > > > along with verion ?).
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of 
> > > > > public cloud
> > > > > providers. But we still think about the live migration between from 
> > > > > the the lower
> > > > > generation of hardware migrated to the higher generation.  
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?  
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
yes, good idea!
so, what about declaring below new cap? 
#define VFIO_REGION_INFO_CAP_MIGRATION 4
struct vfio_region_info_cap_migration {
struct vfio_info_cap_header header;
__u32 device_version_len;
__u8  device_version[];
};


> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> 
> For vendor specific regions.
Yes, use PCI vendor ID.
you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
to identify it's a PCI ID.
Thanks for pointing it out. 
But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
why it's 0x? I searched QEMU and kernel code and did not find anywhere
uses it.


> > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > thinks can identify a source device. e.g. pciid + mdev type.
> > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > 
> > > 
> > > struct vfio_device_state_ctl {
> > >  __u32 version;/* ro */
> > >  __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
> > >  struct {
> > >   __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > >   ...
> > >  }data;
> > >  ...
> > >  };
> 
> We have a buffer area where we can read and write data from the vendor
> driver, why would we have this IS_COMPATIBLE protocol to write the
> device version string but use a static fixed length version string in
> the control header to read it?  IOW, let's use GET_VERSION,
> CHECK_VERSION actions that make use of the buffer area and allow vendor
> specific version information length.
you are right, such static fixed length version string is bad :)
To get device version, do you think which approach below is better?
1. use GET_VERSION action, and read from region buffer
2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION

seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
for checking migration interface's version?

> > > 
> > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > 
> > > The flow to figure out whether a source device is migratable to target 
> > > device
> > > is like that:
> > > 1. in source side's .save_setup, save source device's device_version 
> > > string
> > > 2. in target side's .load_state, load source device's device version 
> > > 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-27 Thread Alex Williamson
On Wed, 27 Mar 2019 20:18:54 +
"Dr. David Alan Gilbert"  wrote:

> * Zhao Yan (yan.y.z...@intel.com) wrote:
> > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > >   b) How do we detect if we're migrating from/to the wrong device 
> > > > > > > or
> > > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > > perhaps
> > > > > > > a device that has less device memory ?
> > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > whether it's better to check that in libvirt or qemu (like a device 
> > > > > > magic
> > > > > > along with verion ?).
> > > > 
> > > > We must keep the hardware generation is the same with one POD of public 
> > > > cloud
> > > > providers. But we still think about the live migration between from the 
> > > > the lower
> > > > generation of hardware migrated to the higher generation.  
> > > 
> > > Agreed, lower->higher is the one direction that might make sense to
> > > support.
> > > 
> > > But regardless of that, I think we need to make sure that incompatible
> > > devices/versions fail directly instead of failing in a subtle, hard to
> > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > as well.
> > > 
> > > How easy is it to obtain that information in a form that can be
> > > consumed by higher layers? Can we find out the device type at least?
> > > What about some kind of revision?  
> > hi Alex and Cornelia
> > for device compatibility, do you think it's a good idea to use "version"
> > and "device version" fields?
> > 
> > version field: identify live migration interface's version. it can have a
> > sort of backward compatibility, like target machine's version >= source
> > machine's version. something like that.

Don't we essentially already have this via the device specific region?
The struct vfio_info_cap_header includes id and version fields, so we
can declare a migration id and increment the version for any
incompatible changes to the protocol.

> > 
> > device_version field consists two parts:
> > 1. vendor id : it takes 32 bits. e.g. 0x8086.

Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
suggest we use a bit to flag it as such so we can reserve that portion
of the 32bit address space.  See for example:

#define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
#define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)

For vendor specific regions.

> > 2. vendor proprietary string: it can be any string that a vendor driver
> > thinks can identify a source device. e.g. pciid + mdev type.
> > "vendor id" is to avoid overlap of "vendor proprietary string".
> > 
> > 
> > struct vfio_device_state_ctl {
> >  __u32 version;/* ro */
> >  __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
> >  struct {
> > __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > ...
> >  }data;
> >  ...
> >  };

We have a buffer area where we can read and write data from the vendor
driver, why would we have this IS_COMPATIBLE protocol to write the
device version string but use a static fixed length version string in
the control header to read it?  IOW, let's use GET_VERSION,
CHECK_VERSION actions that make use of the buffer area and allow vendor
specific version information length.

> > 
> > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > 
> > The flow to figure out whether a source device is migratable to target 
> > device
> > is like that:
> > 1. in source side's .save_setup, save source device's device_version string
> > 2. in target side's .load_state, load source device's device version string
> > and write it to data region, and call IS_COMPATIBLE action to ask vendor 
> > driver
> > to check whether the source device is compatible to it.
> > 
> > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > maintain a compatibility table and decide whether source device is 
> > compatible
> > to target device according to its proprietary table.
> > In device_version string, vendor driver only has to describe the source
> > device as elaborately as possible and resorts to vendor driver in target 
> > side
> > to figure out whether they are compatible.  

I agree, it's too complicated and restrictive to try to create an
interface for the user to determine compatibility, let the driver
declare it compatible or not.

> It would also be good if the 'IS_COMPATIBLE' was somehow callable
> externally - so we could be able to answer a question like 'can we
> migrate this VM to this host' - from the management layer before it
> actually starts the migration.

I think we'd need to mirror this capability in sysfs to support that,
or create a qmp interface through QEMU that the device owner could make
the request on behalf of the management layer.  Getting access to the
vfio device requires an iommu context that's already in use by the
device 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-27 Thread Dr. David Alan Gilbert
* Zhao Yan (yan.y.z...@intel.com) wrote:
> On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > perhaps
> > > > > > a device that has less device memory ?  
> > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > whether it's better to check that in libvirt or qemu (like a device 
> > > > > magic
> > > > > along with verion ?).  
> > > 
> > > We must keep the hardware generation is the same with one POD of public 
> > > cloud
> > > providers. But we still think about the live migration between from the 
> > > the lower
> > > generation of hardware migrated to the higher generation.
> > 
> > Agreed, lower->higher is the one direction that might make sense to
> > support.
> > 
> > But regardless of that, I think we need to make sure that incompatible
> > devices/versions fail directly instead of failing in a subtle, hard to
> > debug way. Might be useful to do some initial sanity checks in libvirt
> > as well.
> > 
> > How easy is it to obtain that information in a form that can be
> > consumed by higher layers? Can we find out the device type at least?
> > What about some kind of revision?
> hi Alex and Cornelia
> for device compatibility, do you think it's a good idea to use "version"
> and "device version" fields?
> 
> version field: identify live migration interface's version. it can have a
> sort of backward compatibility, like target machine's version >= source
> machine's version. something like that.
> 
> device_version field consists two parts:
> 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 2. vendor proprietary string: it can be any string that a vendor driver
> thinks can identify a source device. e.g. pciid + mdev type.
> "vendor id" is to avoid overlap of "vendor proprietary string".
> 
> 
> struct vfio_device_state_ctl {
>  __u32 version;/* ro */
>  __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
>  struct {
>   __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
>   ...
>  }data;
>  ...
>  };
> 
> Then, an action IS_COMPATIBLE is added to check device compatibility.
> 
> The flow to figure out whether a source device is migratable to target device
> is like that:
> 1. in source side's .save_setup, save source device's device_version string
> 2. in target side's .load_state, load source device's device version string
> and write it to data region, and call IS_COMPATIBLE action to ask vendor 
> driver
> to check whether the source device is compatible to it.
> 
> The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> maintain a compatibility table and decide whether source device is compatible
> to target device according to its proprietary table.
> In device_version string, vendor driver only has to describe the source
> device as elaborately as possible and resorts to vendor driver in target side
> to figure out whether they are compatible.

It would also be good if the 'IS_COMPATIBLE' was somehow callable
externally - so we could be able to answer a question like 'can we
migrate this VM to this host' - from the management layer before it
actually starts the migration.

Dave

> Thanks
> Yan
> 
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-27 Thread Zhao Yan
On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?  
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device 
> > > > magic
> > > > along with verion ?).  
> > 
> > We must keep the hardware generation is the same with one POD of public 
> > cloud
> > providers. But we still think about the live migration between from the the 
> > lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?
hi Alex and Cornelia
for device compatibility, do you think it's a good idea to use "version"
and "device version" fields?

version field: identify live migration interface's version. it can have a
sort of backward compatibility, like target machine's version >= source
machine's version. something like that.

device_version field consists two parts:
1. vendor id : it takes 32 bits. e.g. 0x8086.
2. vendor proprietary string: it can be any string that a vendor driver
thinks can identify a source device. e.g. pciid + mdev type.
"vendor id" is to avoid overlap of "vendor proprietary string".


struct vfio_device_state_ctl {
 __u32 version;/* ro */
 __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
 struct {
__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
...
 }data;
 ...
 };

Then, an action IS_COMPATIBLE is added to check device compatibility.

The flow to figure out whether a source device is migratable to target device
is like that:
1. in source side's .save_setup, save source device's device_version string
2. in target side's .load_state, load source device's device version string
and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
to check whether the source device is compatible to it.

The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
maintain a compatibility table and decide whether source device is compatible
to target device according to its proprietary table.
In device_version string, vendor driver only has to describe the source
device as elaborately as possible and resorts to vendor driver in target side
to figure out whether they are compatible.

Thanks
Yan






Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-17 Thread Zhao Yan
On Mon, Mar 18, 2019 at 11:09:04AM +0800, Alex Williamson wrote:
> On Sun, 17 Mar 2019 22:51:27 -0400
> Zhao Yan  wrote:
> 
> > On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> > > On Thu, 14 Mar 2019 19:05:06 -0400
> > > Zhao Yan  wrote:
> > >   
> > > > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:  
> > > > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > > > Zhao Yan  wrote:
> > > > > 
> > > > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > > > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > > > Zhao Yan  wrote:
> > > > > > >   
> > > > > > > > hi Alex
> > > > > > > > Any comments to the sequence below?
> > > > > > > > 
> > > > > > > > Actaully we have some concerns and suggestions to 
> > > > > > > > userspace-opaque migration
> > > > > > > > data.
> > > > > > > > 
> > > > > > > > 1. if data is opaque to userspace, kernel interface must be 
> > > > > > > > tightly bound to
> > > > > > > > migration. 
> > > > > > > >e.g. vendor driver has to know state (running + not logging) 
> > > > > > > > should not
> > > > > > > >return any data, and state (running + logging) should return 
> > > > > > > > whole
> > > > > > > >snapshot first and dirty later. it also has to know qemu 
> > > > > > > > migration will
> > > > > > > >not call GET_BUFFER in state (running + not logging), 
> > > > > > > > otherwise, it has
> > > > > > > >to adjust its behavior.  
> > > > > > > 
> > > > > > > This all just sounds like defining the protocol we expect with the
> > > > > > > interface.  For instance if we define a session as beginning when
> > > > > > > logging is enabled and ending when the device is stopped and the
> > > > > > > interface reports no more data is available, then we can state 
> > > > > > > that any
> > > > > > > partial accumulation of data is incomplete relative to migration. 
> > > > > > >  If
> > > > > > > userspace wants to initiate a new migration stream, they can 
> > > > > > > simply
> > > > > > > toggle logging.  How the vendor driver provides the data during 
> > > > > > > the
> > > > > > > session is not defined, but beginning the session with a snapshot
> > > > > > > followed by repeated iterations of dirtied data is certainly a 
> > > > > > > valid
> > > > > > > approach.
> > > > > > >   
> > > > > > > > 2. vendor driver cannot ensure userspace get all the data it 
> > > > > > > > intends to
> > > > > > > > save in pre-copy phase.
> > > > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check 
> > > > > > > > and send
> > > > > > > >   data in previous phase.  
> > > > > > > 
> > > > > > > First, I don't think the device has control of when QEMU switches 
> > > > > > > from
> > > > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > > > transition at any point.  However, it seems a simply data 
> > > > > > > available
> > > > > > > counter provides an indication of when it might be optimal to 
> > > > > > > make such
> > > > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > > > available data counter would indicate a large value, the entire 
> > > > > > > initial
> > > > > > > snapshot of the device.  As the migration continues and pages are
> > > > > > > dirtied, the device would reach a steady state amount of data
> > > > > > > available, depending on the guest activity.  This could indicate 
> > > > > > > to the
> > > > > > > user to stop the device.  The migration stream would not be 
> > > > > > > considered
> > > > > > > completed until the available data counter reaches zero while the
> > > > > > > device is in the stopped|logging state.
> > > > > > >   
> > > > > > > > 3. if all the sequence is tightly bound to live migration, can 
> > > > > > > > we remove the
> > > > > > > > logging state? what about adding two states migrate-in and 
> > > > > > > > migrate-out?
> > > > > > > > so there are four states: running, stopped, migrate-in, 
> > > > > > > > migrate-out.
> > > > > > > >migrate-out is for source side when migration starts. 
> > > > > > > > together with
> > > > > > > >state running and stopped, it can substitute state logging.
> > > > > > > >migrate-in is for target side.  
> > > > > > > 
> > > > > > > In fact, Kirti's implementation specifies a data direction, but I 
> > > > > > > think
> > > > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > > > logging implies some overhead for the vendor driver.
> > > > > > >  
> > > > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > > > migrate-out are more universal againt hardware requirement changes.
> > > > > > 
> > > > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > > > > > > hi Alex
> > > > > > > > > thanks for your reply.
> > > > > > > > > 
> > > > > > > > > So, if we choose migration data to be userspace opaque, do 
> > > > > > > > > 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-17 Thread Alex Williamson
On Sun, 17 Mar 2019 22:51:27 -0400
Zhao Yan  wrote:

> On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> > On Thu, 14 Mar 2019 19:05:06 -0400
> > Zhao Yan  wrote:
> >   
> > > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:  
> > > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > > Zhao Yan  wrote:
> > > > 
> > > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > > Zhao Yan  wrote:
> > > > > >   
> > > > > > > hi Alex
> > > > > > > Any comments to the sequence below?
> > > > > > > 
> > > > > > > Actaully we have some concerns and suggestions to 
> > > > > > > userspace-opaque migration
> > > > > > > data.
> > > > > > > 
> > > > > > > 1. if data is opaque to userspace, kernel interface must be 
> > > > > > > tightly bound to
> > > > > > > migration. 
> > > > > > >e.g. vendor driver has to know state (running + not logging) 
> > > > > > > should not
> > > > > > >return any data, and state (running + logging) should return 
> > > > > > > whole
> > > > > > >snapshot first and dirty later. it also has to know qemu 
> > > > > > > migration will
> > > > > > >not call GET_BUFFER in state (running + not logging), 
> > > > > > > otherwise, it has
> > > > > > >to adjust its behavior.  
> > > > > > 
> > > > > > This all just sounds like defining the protocol we expect with the
> > > > > > interface.  For instance if we define a session as beginning when
> > > > > > logging is enabled and ending when the device is stopped and the
> > > > > > interface reports no more data is available, then we can state that 
> > > > > > any
> > > > > > partial accumulation of data is incomplete relative to migration.  
> > > > > > If
> > > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > > session is not defined, but beginning the session with a snapshot
> > > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > > approach.
> > > > > >   
> > > > > > > 2. vendor driver cannot ensure userspace get all the data it 
> > > > > > > intends to
> > > > > > > save in pre-copy phase.
> > > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check 
> > > > > > > and send
> > > > > > >   data in previous phase.  
> > > > > > 
> > > > > > First, I don't think the device has control of when QEMU switches 
> > > > > > from
> > > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > > transition at any point.  However, it seems a simply data available
> > > > > > counter provides an indication of when it might be optimal to make 
> > > > > > such
> > > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > > available data counter would indicate a large value, the entire 
> > > > > > initial
> > > > > > snapshot of the device.  As the migration continues and pages are
> > > > > > dirtied, the device would reach a steady state amount of data
> > > > > > available, depending on the guest activity.  This could indicate to 
> > > > > > the
> > > > > > user to stop the device.  The migration stream would not be 
> > > > > > considered
> > > > > > completed until the available data counter reaches zero while the
> > > > > > device is in the stopped|logging state.
> > > > > >   
> > > > > > > 3. if all the sequence is tightly bound to live migration, can we 
> > > > > > > remove the
> > > > > > > logging state? what about adding two states migrate-in and 
> > > > > > > migrate-out?
> > > > > > > so there are four states: running, stopped, migrate-in, 
> > > > > > > migrate-out.
> > > > > > >migrate-out is for source side when migration starts. together 
> > > > > > > with
> > > > > > >state running and stopped, it can substitute state logging.
> > > > > > >migrate-in is for target side.  
> > > > > > 
> > > > > > In fact, Kirti's implementation specifies a data direction, but I 
> > > > > > think
> > > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > > logging implies some overhead for the vendor driver.
> > > > > >  
> > > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > > migrate-out are more universal againt hardware requirement changes.
> > > > > 
> > > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > > > > > hi Alex
> > > > > > > > thanks for your reply.
> > > > > > > > 
> > > > > > > > So, if we choose migration data to be userspace opaque, do you 
> > > > > > > > think below
> > > > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > > > 
> > > > > > > > 1. initially LOGGING state is not set. If userspace calls 
> > > > > > > > GET_BUFFER to
> > > > > > > > vendor driver,  vendor driver should reject and return 0.  
> > > > > > 
> > > > > > 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-17 Thread Zhao Yan
On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> On Thu, 14 Mar 2019 19:05:06 -0400
> Zhao Yan  wrote:
> 
> > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > Zhao Yan  wrote:
> > >   
> > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:  
> > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > Zhao Yan  wrote:
> > > > > 
> > > > > > hi Alex
> > > > > > Any comments to the sequence below?
> > > > > > 
> > > > > > Actaully we have some concerns and suggestions to userspace-opaque 
> > > > > > migration
> > > > > > data.
> > > > > > 
> > > > > > 1. if data is opaque to userspace, kernel interface must be tightly 
> > > > > > bound to
> > > > > > migration. 
> > > > > >e.g. vendor driver has to know state (running + not logging) 
> > > > > > should not
> > > > > >return any data, and state (running + logging) should return 
> > > > > > whole
> > > > > >snapshot first and dirty later. it also has to know qemu 
> > > > > > migration will
> > > > > >not call GET_BUFFER in state (running + not logging), otherwise, 
> > > > > > it has
> > > > > >to adjust its behavior.
> > > > > 
> > > > > This all just sounds like defining the protocol we expect with the
> > > > > interface.  For instance if we define a session as beginning when
> > > > > logging is enabled and ending when the device is stopped and the
> > > > > interface reports no more data is available, then we can state that 
> > > > > any
> > > > > partial accumulation of data is incomplete relative to migration.  If
> > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > session is not defined, but beginning the session with a snapshot
> > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > approach.
> > > > > 
> > > > > > 2. vendor driver cannot ensure userspace get all the data it 
> > > > > > intends to
> > > > > > save in pre-copy phase.
> > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and 
> > > > > > send
> > > > > >   data in previous phase.
> > > > > 
> > > > > First, I don't think the device has control of when QEMU switches from
> > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > transition at any point.  However, it seems a simply data available
> > > > > counter provides an indication of when it might be optimal to make 
> > > > > such
> > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > available data counter would indicate a large value, the entire 
> > > > > initial
> > > > > snapshot of the device.  As the migration continues and pages are
> > > > > dirtied, the device would reach a steady state amount of data
> > > > > available, depending on the guest activity.  This could indicate to 
> > > > > the
> > > > > user to stop the device.  The migration stream would not be considered
> > > > > completed until the available data counter reaches zero while the
> > > > > device is in the stopped|logging state.
> > > > > 
> > > > > > 3. if all the sequence is tightly bound to live migration, can we 
> > > > > > remove the
> > > > > > logging state? what about adding two states migrate-in and 
> > > > > > migrate-out?
> > > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > > >migrate-out is for source side when migration starts. together 
> > > > > > with
> > > > > >state running and stopped, it can substitute state logging.
> > > > > >migrate-in is for target side.
> > > > > 
> > > > > In fact, Kirti's implementation specifies a data direction, but I 
> > > > > think
> > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > logging implies some overhead for the vendor driver.
> > > > >
> > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > migrate-out are more universal againt hardware requirement changes.
> > > >   
> > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > > > > > > hi Alex
> > > > > > > thanks for your reply.
> > > > > > > 
> > > > > > > So, if we choose migration data to be userspace opaque, do you 
> > > > > > > think below
> > > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > > 
> > > > > > > 1. initially LOGGING state is not set. If userspace calls 
> > > > > > > GET_BUFFER to
> > > > > > > vendor driver,  vendor driver should reject and return 0.
> > > > > 
> > > > > What would this state mean otherwise?  If we're not logging then it
> > > > > should not be expected that we can construct dirtied data from a
> > > > > previous read of the state before logging was enabled (it would be
> > > > > outside of the "session").  So at best this is an incomplete segment 
> > > > > of
> > > > > the initial 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-14 Thread Alex Williamson
On Thu, 14 Mar 2019 19:05:06 -0400
Zhao Yan  wrote:

> On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> > On Wed, 13 Mar 2019 21:12:22 -0400
> > Zhao Yan  wrote:
> >   
> > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:  
> > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > Zhao Yan  wrote:
> > > > 
> > > > > hi Alex
> > > > > Any comments to the sequence below?
> > > > > 
> > > > > Actaully we have some concerns and suggestions to userspace-opaque 
> > > > > migration
> > > > > data.
> > > > > 
> > > > > 1. if data is opaque to userspace, kernel interface must be tightly 
> > > > > bound to
> > > > > migration. 
> > > > >e.g. vendor driver has to know state (running + not logging) 
> > > > > should not
> > > > >return any data, and state (running + logging) should return whole
> > > > >snapshot first and dirty later. it also has to know qemu migration 
> > > > > will
> > > > >not call GET_BUFFER in state (running + not logging), otherwise, 
> > > > > it has
> > > > >to adjust its behavior.
> > > > 
> > > > This all just sounds like defining the protocol we expect with the
> > > > interface.  For instance if we define a session as beginning when
> > > > logging is enabled and ending when the device is stopped and the
> > > > interface reports no more data is available, then we can state that any
> > > > partial accumulation of data is incomplete relative to migration.  If
> > > > userspace wants to initiate a new migration stream, they can simply
> > > > toggle logging.  How the vendor driver provides the data during the
> > > > session is not defined, but beginning the session with a snapshot
> > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > approach.
> > > > 
> > > > > 2. vendor driver cannot ensure userspace get all the data it intends 
> > > > > to
> > > > > save in pre-copy phase.
> > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and 
> > > > > send
> > > > >   data in previous phase.
> > > > 
> > > > First, I don't think the device has control of when QEMU switches from
> > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > transition at any point.  However, it seems a simply data available
> > > > counter provides an indication of when it might be optimal to make such
> > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > available data counter would indicate a large value, the entire initial
> > > > snapshot of the device.  As the migration continues and pages are
> > > > dirtied, the device would reach a steady state amount of data
> > > > available, depending on the guest activity.  This could indicate to the
> > > > user to stop the device.  The migration stream would not be considered
> > > > completed until the available data counter reaches zero while the
> > > > device is in the stopped|logging state.
> > > > 
> > > > > 3. if all the sequence is tightly bound to live migration, can we 
> > > > > remove the
> > > > > logging state? what about adding two states migrate-in and 
> > > > > migrate-out?
> > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > >migrate-out is for source side when migration starts. together with
> > > > >state running and stopped, it can substitute state logging.
> > > > >migrate-in is for target side.
> > > > 
> > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > we still need logging to indicate sessions.  I'd also assume that
> > > > logging implies some overhead for the vendor driver.
> > > >
> > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > migrate-out are more universal againt hardware requirement changes.
> > >   
> > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > > > > > hi Alex
> > > > > > thanks for your reply.
> > > > > > 
> > > > > > So, if we choose migration data to be userspace opaque, do you 
> > > > > > think below
> > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > 
> > > > > > 1. initially LOGGING state is not set. If userspace calls 
> > > > > > GET_BUFFER to
> > > > > > vendor driver,  vendor driver should reject and return 0.
> > > > 
> > > > What would this state mean otherwise?  If we're not logging then it
> > > > should not be expected that we can construct dirtied data from a
> > > > previous read of the state before logging was enabled (it would be
> > > > outside of the "session").  So at best this is an incomplete segment of
> > > > the initial snapshot of the device, but that presumes how the vendor
> > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > driver reject it, but I think we should consider it undefined and
> > > > vendor specific relative to the migration interface.
> > > > 
> > > > > > 2. then LOGGING state is set, if userspace calls 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-14 Thread Zhao Yan
On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> On Wed, 13 Mar 2019 21:12:22 -0400
> Zhao Yan  wrote:
> 
> > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > Zhao Yan  wrote:
> > >   
> > > > hi Alex
> > > > Any comments to the sequence below?
> > > > 
> > > > Actaully we have some concerns and suggestions to userspace-opaque 
> > > > migration
> > > > data.
> > > > 
> > > > 1. if data is opaque to userspace, kernel interface must be tightly 
> > > > bound to
> > > > migration. 
> > > >e.g. vendor driver has to know state (running + not logging) should 
> > > > not
> > > >return any data, and state (running + logging) should return whole
> > > >snapshot first and dirty later. it also has to know qemu migration 
> > > > will
> > > >not call GET_BUFFER in state (running + not logging), otherwise, it 
> > > > has
> > > >to adjust its behavior.  
> > > 
> > > This all just sounds like defining the protocol we expect with the
> > > interface.  For instance if we define a session as beginning when
> > > logging is enabled and ending when the device is stopped and the
> > > interface reports no more data is available, then we can state that any
> > > partial accumulation of data is incomplete relative to migration.  If
> > > userspace wants to initiate a new migration stream, they can simply
> > > toggle logging.  How the vendor driver provides the data during the
> > > session is not defined, but beginning the session with a snapshot
> > > followed by repeated iterations of dirtied data is certainly a valid
> > > approach.
> > >   
> > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > save in pre-copy phase.
> > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > >   data in previous phase.  
> > > 
> > > First, I don't think the device has control of when QEMU switches from
> > > pre-copy to stop-and-copy, the protocol needs to support that
> > > transition at any point.  However, it seems a simply data available
> > > counter provides an indication of when it might be optimal to make such
> > > a transition.  If a vendor driver follows a scheme as above, the
> > > available data counter would indicate a large value, the entire initial
> > > snapshot of the device.  As the migration continues and pages are
> > > dirtied, the device would reach a steady state amount of data
> > > available, depending on the guest activity.  This could indicate to the
> > > user to stop the device.  The migration stream would not be considered
> > > completed until the available data counter reaches zero while the
> > > device is in the stopped|logging state.
> > >   
> > > > 3. if all the sequence is tightly bound to live migration, can we 
> > > > remove the
> > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > >migrate-out is for source side when migration starts. together with
> > > >state running and stopped, it can substitute state logging.
> > > >migrate-in is for target side.  
> > > 
> > > In fact, Kirti's implementation specifies a data direction, but I think
> > > we still need logging to indicate sessions.  I'd also assume that
> > > logging implies some overhead for the vendor driver.
> > >  
> > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > migrate-out are more universal againt hardware requirement changes.
> > 
> > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > > hi Alex
> > > > > thanks for your reply.
> > > > > 
> > > > > So, if we choose migration data to be userspace opaque, do you think 
> > > > > below
> > > > > sequence is the right behavior for vendor driver to follow:
> > > > > 
> > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER 
> > > > > to
> > > > > vendor driver,  vendor driver should reject and return 0.  
> > > 
> > > What would this state mean otherwise?  If we're not logging then it
> > > should not be expected that we can construct dirtied data from a
> > > previous read of the state before logging was enabled (it would be
> > > outside of the "session").  So at best this is an incomplete segment of
> > > the initial snapshot of the device, but that presumes how the vendor
> > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > driver reject it, but I think we should consider it undefined and
> > > vendor specific relative to the migration interface.
> > >   
> > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > driver,
> > > > >a. vendor driver shoud first query a whole snapshot of device 
> > > > > memory
> > > > >(let's use this term to represent device's standalone memory for 
> > > > > now),
> > > > >b. vendor driver returns a chunk of data just queried to 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-14 Thread Alex Williamson
On Wed, 13 Mar 2019 21:12:22 -0400
Zhao Yan  wrote:

> On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > On Tue, 12 Mar 2019 21:13:01 -0400
> > Zhao Yan  wrote:
> >   
> > > hi Alex
> > > Any comments to the sequence below?
> > > 
> > > Actaully we have some concerns and suggestions to userspace-opaque 
> > > migration
> > > data.
> > > 
> > > 1. if data is opaque to userspace, kernel interface must be tightly bound 
> > > to
> > > migration. 
> > >e.g. vendor driver has to know state (running + not logging) should not
> > >return any data, and state (running + logging) should return whole
> > >snapshot first and dirty later. it also has to know qemu migration will
> > >not call GET_BUFFER in state (running + not logging), otherwise, it has
> > >to adjust its behavior.  
> > 
> > This all just sounds like defining the protocol we expect with the
> > interface.  For instance if we define a session as beginning when
> > logging is enabled and ending when the device is stopped and the
> > interface reports no more data is available, then we can state that any
> > partial accumulation of data is incomplete relative to migration.  If
> > userspace wants to initiate a new migration stream, they can simply
> > toggle logging.  How the vendor driver provides the data during the
> > session is not defined, but beginning the session with a snapshot
> > followed by repeated iterations of dirtied data is certainly a valid
> > approach.
> >   
> > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > save in pre-copy phase.
> > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > >   data in previous phase.  
> > 
> > First, I don't think the device has control of when QEMU switches from
> > pre-copy to stop-and-copy, the protocol needs to support that
> > transition at any point.  However, it seems a simply data available
> > counter provides an indication of when it might be optimal to make such
> > a transition.  If a vendor driver follows a scheme as above, the
> > available data counter would indicate a large value, the entire initial
> > snapshot of the device.  As the migration continues and pages are
> > dirtied, the device would reach a steady state amount of data
> > available, depending on the guest activity.  This could indicate to the
> > user to stop the device.  The migration stream would not be considered
> > completed until the available data counter reaches zero while the
> > device is in the stopped|logging state.
> >   
> > > 3. if all the sequence is tightly bound to live migration, can we remove 
> > > the
> > > logging state? what about adding two states migrate-in and migrate-out?
> > > so there are four states: running, stopped, migrate-in, migrate-out.
> > >migrate-out is for source side when migration starts. together with
> > >state running and stopped, it can substitute state logging.
> > >migrate-in is for target side.  
> > 
> > In fact, Kirti's implementation specifies a data direction, but I think
> > we still need logging to indicate sessions.  I'd also assume that
> > logging implies some overhead for the vendor driver.
> >  
> ok. If you prefer logging, I'm ok with it. just found migrate-in and
> migrate-out are more universal againt hardware requirement changes.
> 
> > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > hi Alex
> > > > thanks for your reply.
> > > > 
> > > > So, if we choose migration data to be userspace opaque, do you think 
> > > > below
> > > > sequence is the right behavior for vendor driver to follow:
> > > > 
> > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > vendor driver,  vendor driver should reject and return 0.  
> > 
> > What would this state mean otherwise?  If we're not logging then it
> > should not be expected that we can construct dirtied data from a
> > previous read of the state before logging was enabled (it would be
> > outside of the "session").  So at best this is an incomplete segment of
> > the initial snapshot of the device, but that presumes how the vendor
> > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > driver reject it, but I think we should consider it undefined and
> > vendor specific relative to the migration interface.
> >   
> > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > driver,
> > > >a. vendor driver shoud first query a whole snapshot of device memory
> > > >(let's use this term to represent device's standalone memory for 
> > > > now),
> > > >b. vendor driver returns a chunk of data just queried to userspace,
> > > >while recording current pos in data.
> > > >c. vendor driver finds all data just queried is finished 
> > > > transmitting to
> > > >userspace, and queries only dirty data in device memory now.
> > > >d. vendor driver returns a chunk of data just quered (this time 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-13 Thread Zhao Yan
On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> On Tue, 12 Mar 2019 21:13:01 -0400
> Zhao Yan  wrote:
> 
> > hi Alex
> > Any comments to the sequence below?
> > 
> > Actaully we have some concerns and suggestions to userspace-opaque migration
> > data.
> > 
> > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > migration. 
> >e.g. vendor driver has to know state (running + not logging) should not
> >return any data, and state (running + logging) should return whole
> >snapshot first and dirty later. it also has to know qemu migration will
> >not call GET_BUFFER in state (running + not logging), otherwise, it has
> >to adjust its behavior.
> 
> This all just sounds like defining the protocol we expect with the
> interface.  For instance if we define a session as beginning when
> logging is enabled and ending when the device is stopped and the
> interface reports no more data is available, then we can state that any
> partial accumulation of data is incomplete relative to migration.  If
> userspace wants to initiate a new migration stream, they can simply
> toggle logging.  How the vendor driver provides the data during the
> session is not defined, but beginning the session with a snapshot
> followed by repeated iterations of dirtied data is certainly a valid
> approach.
> 
> > 2. vendor driver cannot ensure userspace get all the data it intends to
> > save in pre-copy phase.
> >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> >   data in previous phase.
> 
> First, I don't think the device has control of when QEMU switches from
> pre-copy to stop-and-copy, the protocol needs to support that
> transition at any point.  However, it seems a simply data available
> counter provides an indication of when it might be optimal to make such
> a transition.  If a vendor driver follows a scheme as above, the
> available data counter would indicate a large value, the entire initial
> snapshot of the device.  As the migration continues and pages are
> dirtied, the device would reach a steady state amount of data
> available, depending on the guest activity.  This could indicate to the
> user to stop the device.  The migration stream would not be considered
> completed until the available data counter reaches zero while the
> device is in the stopped|logging state.
> 
> > 3. if all the sequence is tightly bound to live migration, can we remove the
> > logging state? what about adding two states migrate-in and migrate-out?
> > so there are four states: running, stopped, migrate-in, migrate-out.
> >migrate-out is for source side when migration starts. together with
> >state running and stopped, it can substitute state logging.
> >migrate-in is for target side.
> 
> In fact, Kirti's implementation specifies a data direction, but I think
> we still need logging to indicate sessions.  I'd also assume that
> logging implies some overhead for the vendor driver.
>
ok. If you prefer logging, I'm ok with it. just found migrate-in and
migrate-out are more universal againt hardware requirement changes.

> > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > > hi Alex
> > > thanks for your reply.
> > > 
> > > So, if we choose migration data to be userspace opaque, do you think below
> > > sequence is the right behavior for vendor driver to follow:
> > > 
> > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > vendor driver,  vendor driver should reject and return 0.
> 
> What would this state mean otherwise?  If we're not logging then it
> should not be expected that we can construct dirtied data from a
> previous read of the state before logging was enabled (it would be
> outside of the "session").  So at best this is an incomplete segment of
> the initial snapshot of the device, but that presumes how the vendor
> driver constructs the data.  I wouldn't necessarily mandate the vendor
> driver reject it, but I think we should consider it undefined and
> vendor specific relative to the migration interface.
> 
> > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > driver,
> > >a. vendor driver shoud first query a whole snapshot of device memory
> > >(let's use this term to represent device's standalone memory for now),
> > >b. vendor driver returns a chunk of data just queried to userspace,
> > >while recording current pos in data.
> > >c. vendor driver finds all data just queried is finished transmitting 
> > > to
> > >userspace, and queries only dirty data in device memory now.
> > >d. vendor driver returns a chunk of data just quered (this time is 
> > > dirty
> > >data )to userspace while recording current pos in data
> > >e. if all data is transmited to usespace and still GET_BUFFERs come 
> > > from
> > >userspace, vendor driver starts another round of dirty data query.
> 
> This is a valid vendor driver approach, but 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-13 Thread Alex Williamson
On Tue, 12 Mar 2019 02:48:39 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, March 12, 2019 4:19 AM
> > On Mon, 11 Mar 2019 02:33:11 +
> > "Tian, Kevin"  wrote:
> >   
> [...]
> > 
> > I think I've fully conceded any notion of security by obscurity towards
> > opaque data already, but segregating types of device data still seems
> > to unnecessarily impose a usage model on the vendor driver that I think
> > we should try to avoid.  The migration interface should define the
> > protocol through which userspace can save and restore the device state,
> > not impose how the vendor driver exposes or manages that state.  Also, I
> > got the impression (perhaps incorrectly) that you were trying to mmap
> > live data to userspace, which would allow not only saving the state,
> > but also unchecked state modification by userspace. I think we want
> > more of a producer/consumer model of the state where consuming state
> > also involves at least some degree of sanity or consistency checking.
> > Let's not forget too that we're obviously dealing with non-open source
> > driver in the mdev universe as well.  
> 
> OK. I think for this part we are in agreement - as long as the goal of
> this interface is clearly defined as such way. :-)
> 
> [...]
> > > But I still didn't get your exact concern about security part. For
> > > version yes we still haven't worked out a sane way to represent
> > > vendor-specific compatibility requirement. But allowing user
> > > space to modify data through this interface has really no difference
> > > from allowing guest to modify data through trapped MMIO interface.
> > > mdev driver should guarantee that operations through both interfaces
> > > can modify only the state associated with the said mdev instance,
> > > w/o breaking the isolation boundary. Then the former becomes just
> > > a batch of operations to be verified in the same way as if they are
> > > done individually through the latter interface.  
> > 
> > It seems like you're assuming a working model for the vendor driver and
> > the data entering and exiting through this interface.  The vendor
> > drivers can expose data any way that they want.  All we need to do is
> > imagine that the migration data stream includes an array index count
> > somewhere which the user could modify to trigger the in-kernel vendor
> > driver to allocate an absurd array size and DoS the target.  This is
> > probably the most simplistic attack, possibly knowing the state machine
> > of the vendor driver a malicious user could trick it into providing
> > host kernel data.  We're not necessarily only conveying state that the
> > user already has access to via this interface, the vendor driver may
> > include non-visible internal state as well.  Even the state that is
> > user accessible is being pushed into the vendor driver via an alternate
> > path from the user mediation we have on the existing paths.  
> 
> Then I don't know how this concern can be effectively addressed 
> since you assume vendor drivers are not trusted here. and why do
> you trust vendor drivers on mediating existing path but not this
> alternative one? non-visible internal states just mean more stuff
> to be carefully scrutinized, which is not essentially causing a 
> conceptual difference of trust level.
> 
> Or can this concern be partially mitigated if we create some 
> test cases which poke random data through the new interface,
> and mark vendor drivers which pass such tests as trusted? Then
> there is also an open who should be in charge of such certification 
> process...

The vendor driver is necessarily trusted, it lives in the kernel, it
works in the kernel address space.  Unfortunately that's also the risk
with passing data from userspace into the vendor driver, the vendor
driver needs to take every precaution in sanitizing and validating that
data.  I wish we had a common way to perform that checking, but it
seems that each vendor driver is going to need to define their own
protocol and battle their own bugs and exploits in the code
implementing that protocol.  For open source drivers we can continue to
rely on review and openness, for closed drivers... the user has already
accepted the risk to trust the driver themselves.  Perhaps all I can do
is raise the visibility that there are potential security issues here
and vendor drivers need to own that risk.

A fuzzing test would be great, we could at least validate whether a
vendor driver implements some sort of CRC test, but I don't think we
can create a certification process around that.  Thanks,

Alex



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-13 Thread Alex Williamson
On Tue, 12 Mar 2019 21:13:01 -0400
Zhao Yan  wrote:

> hi Alex
> Any comments to the sequence below?
> 
> Actaully we have some concerns and suggestions to userspace-opaque migration
> data.
> 
> 1. if data is opaque to userspace, kernel interface must be tightly bound to
> migration. 
>e.g. vendor driver has to know state (running + not logging) should not
>return any data, and state (running + logging) should return whole
>snapshot first and dirty later. it also has to know qemu migration will
>not call GET_BUFFER in state (running + not logging), otherwise, it has
>to adjust its behavior.

This all just sounds like defining the protocol we expect with the
interface.  For instance if we define a session as beginning when
logging is enabled and ending when the device is stopped and the
interface reports no more data is available, then we can state that any
partial accumulation of data is incomplete relative to migration.  If
userspace wants to initiate a new migration stream, they can simply
toggle logging.  How the vendor driver provides the data during the
session is not defined, but beginning the session with a snapshot
followed by repeated iterations of dirtied data is certainly a valid
approach.

> 2. vendor driver cannot ensure userspace get all the data it intends to
> save in pre-copy phase.
>   e.g. in stop-and-copy phase, vendor driver has to first check and send
>   data in previous phase.

First, I don't think the device has control of when QEMU switches from
pre-copy to stop-and-copy, the protocol needs to support that
transition at any point.  However, it seems a simply data available
counter provides an indication of when it might be optimal to make such
a transition.  If a vendor driver follows a scheme as above, the
available data counter would indicate a large value, the entire initial
snapshot of the device.  As the migration continues and pages are
dirtied, the device would reach a steady state amount of data
available, depending on the guest activity.  This could indicate to the
user to stop the device.  The migration stream would not be considered
completed until the available data counter reaches zero while the
device is in the stopped|logging state.

> 3. if all the sequence is tightly bound to live migration, can we remove the
> logging state? what about adding two states migrate-in and migrate-out?
> so there are four states: running, stopped, migrate-in, migrate-out.
>migrate-out is for source side when migration starts. together with
>state running and stopped, it can substitute state logging.
>migrate-in is for target side.

In fact, Kirti's implementation specifies a data direction, but I think
we still need logging to indicate sessions.  I'd also assume that
logging implies some overhead for the vendor driver.

> On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > hi Alex
> > thanks for your reply.
> > 
> > So, if we choose migration data to be userspace opaque, do you think below
> > sequence is the right behavior for vendor driver to follow:
> > 
> > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > vendor driver,  vendor driver should reject and return 0.

What would this state mean otherwise?  If we're not logging then it
should not be expected that we can construct dirtied data from a
previous read of the state before logging was enabled (it would be
outside of the "session").  So at best this is an incomplete segment of
the initial snapshot of the device, but that presumes how the vendor
driver constructs the data.  I wouldn't necessarily mandate the vendor
driver reject it, but I think we should consider it undefined and
vendor specific relative to the migration interface.

> > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > driver,
> >a. vendor driver shoud first query a whole snapshot of device memory
> >(let's use this term to represent device's standalone memory for now),
> >b. vendor driver returns a chunk of data just queried to userspace,
> >while recording current pos in data.
> >c. vendor driver finds all data just queried is finished transmitting to
> >userspace, and queries only dirty data in device memory now.
> >d. vendor driver returns a chunk of data just quered (this time is dirty
> >data )to userspace while recording current pos in data
> >e. if all data is transmited to usespace and still GET_BUFFERs come from
> >userspace, vendor driver starts another round of dirty data query.

This is a valid vendor driver approach, but it's outside the scope of
the interface definition.  A vendor driver could also decide to not
provide any data until both stopped and logging are set and then
provide a fixed, final snapshot.  The interface supports either
approach by defining the protocol to interact with it.

> > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > driver,
> >a. if vendor driver finds 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-12 Thread Zhao Yan
hi Alex
Any comments to the sequence below?

Actaully we have some concerns and suggestions to userspace-opaque migration
data.

1. if data is opaque to userspace, kernel interface must be tightly bound to
migration. 
   e.g. vendor driver has to know state (running + not logging) should not
   return any data, and state (running + logging) should return whole
   snapshot first and dirty later. it also has to know qemu migration will
   not call GET_BUFFER in state (running + not logging), otherwise, it has
   to adjust its behavior.

2. vendor driver cannot ensure userspace get all the data it intends to
save in pre-copy phase.
  e.g. in stop-and-copy phase, vendor driver has to first check and send
  data in previous phase.
 

3. if all the sequence is tightly bound to live migration, can we remove the
logging state? what about adding two states migrate-in and migrate-out?
so there are four states: running, stopped, migrate-in, migrate-out.
   migrate-out is for source side when migration starts. together with
   state running and stopped, it can substitute state logging.
   migrate-in is for target side.


Thanks
Yan

On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> hi Alex
> thanks for your reply.
> 
> So, if we choose migration data to be userspace opaque, do you think below
> sequence is the right behavior for vendor driver to follow:
> 
> 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> vendor driver,  vendor driver should reject and return 0.
> 
> 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> driver,
>a. vendor driver shoud first query a whole snapshot of device memory
>(let's use this term to represent device's standalone memory for now),
>b. vendor driver returns a chunk of data just queried to userspace,
>while recording current pos in data.
>c. vendor driver finds all data just queried is finished transmitting to
>userspace, and queries only dirty data in device memory now.
>d. vendor driver returns a chunk of data just quered (this time is dirty
>data )to userspace while recording current pos in data
>e. if all data is transmited to usespace and still GET_BUFFERs come from
>userspace, vendor driver starts another round of dirty data query.
> 
> 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> driver,
>a. if vendor driver finds there's previously untransmitted data, returns
>them until all transmitted.
>b. vendor driver then queries dirty data again and transmits them.
>c. at last, vendor driver queris device config data (which has to be
>queried at last and sent once) and transmits them.
> 
> 
> for the 1 bullet, if LOGGING state is firstly set and migration aborts
> then,  vendor driver has to be able to detect that condition. so seemingly,
> vendor driver has to know more qemu's migration state, like migration
> called and failed. Do you think that's acceptable?
> 
> 
> Thanks
> Yan
> 
> 



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-11 Thread Zhao Yan
hi Alex
thanks for your reply.

So, if we choose migration data to be userspace opaque, do you think below
sequence is the right behavior for vendor driver to follow:

1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
vendor driver,  vendor driver should reject and return 0.

2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
driver,
   a. vendor driver shoud first query a whole snapshot of device memory
   (let's use this term to represent device's standalone memory for now),
   b. vendor driver returns a chunk of data just queried to userspace,
   while recording current pos in data.
   c. vendor driver finds all data just queried is finished transmitting to
   userspace, and queries only dirty data in device memory now.
   d. vendor driver returns a chunk of data just quered (this time is dirty
   data )to userspace while recording current pos in data
   e. if all data is transmited to usespace and still GET_BUFFERs come from
   userspace, vendor driver starts another round of dirty data query.

3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
driver,
   a. if vendor driver finds there's previously untransmitted data, returns
   them until all transmitted.
   b. vendor driver then queries dirty data again and transmits them.
   c. at last, vendor driver queris device config data (which has to be
   queried at last and sent once) and transmits them.


for the 1 bullet, if LOGGING state is firstly set and migration aborts
then,  vendor driver has to be able to detect that condition. so seemingly,
vendor driver has to know more qemu's migration state, like migration
called and failed. Do you think that's acceptable?


Thanks
Yan





Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-11 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, March 12, 2019 4:19 AM
> On Mon, 11 Mar 2019 02:33:11 +
> "Tian, Kevin"  wrote:
> 
[...]
> 
> I think I've fully conceded any notion of security by obscurity towards
> opaque data already, but segregating types of device data still seems
> to unnecessarily impose a usage model on the vendor driver that I think
> we should try to avoid.  The migration interface should define the
> protocol through which userspace can save and restore the device state,
> not impose how the vendor driver exposes or manages that state.  Also, I
> got the impression (perhaps incorrectly) that you were trying to mmap
> live data to userspace, which would allow not only saving the state,
> but also unchecked state modification by userspace. I think we want
> more of a producer/consumer model of the state where consuming state
> also involves at least some degree of sanity or consistency checking.
> Let's not forget too that we're obviously dealing with non-open source
> driver in the mdev universe as well.

OK. I think for this part we are in agreement - as long as the goal of
this interface is clearly defined as such way. :-)

[...]
> > But I still didn't get your exact concern about security part. For
> > version yes we still haven't worked out a sane way to represent
> > vendor-specific compatibility requirement. But allowing user
> > space to modify data through this interface has really no difference
> > from allowing guest to modify data through trapped MMIO interface.
> > mdev driver should guarantee that operations through both interfaces
> > can modify only the state associated with the said mdev instance,
> > w/o breaking the isolation boundary. Then the former becomes just
> > a batch of operations to be verified in the same way as if they are
> > done individually through the latter interface.
> 
> It seems like you're assuming a working model for the vendor driver and
> the data entering and exiting through this interface.  The vendor
> drivers can expose data any way that they want.  All we need to do is
> imagine that the migration data stream includes an array index count
> somewhere which the user could modify to trigger the in-kernel vendor
> driver to allocate an absurd array size and DoS the target.  This is
> probably the most simplistic attack, possibly knowing the state machine
> of the vendor driver a malicious user could trick it into providing
> host kernel data.  We're not necessarily only conveying state that the
> user already has access to via this interface, the vendor driver may
> include non-visible internal state as well.  Even the state that is
> user accessible is being pushed into the vendor driver via an alternate
> path from the user mediation we have on the existing paths.

Then I don't know how this concern can be effectively addressed 
since you assume vendor drivers are not trusted here. and why do
you trust vendor drivers on mediating existing path but not this
alternative one? non-visible internal states just mean more stuff
to be carefully scrutinized, which is not essentially causing a 
conceptual difference of trust level.

Or can this concern be partially mitigated if we create some 
test cases which poke random data through the new interface,
and mark vendor drivers which pass such tests as trusted? Then
there is also an open who should be in charge of such certification 
process...

Thanks
Kevin



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-11 Thread Alex Williamson
On Mon, 11 Mar 2019 02:33:11 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson
> > Sent: Saturday, March 9, 2019 6:03 AM
> > 
> > On Fri, 8 Mar 2019 16:21:46 +
> > "Dr. David Alan Gilbert"  wrote:
> >   
> > > * Alex Williamson (alex.william...@redhat.com) wrote:  
> > > > On Thu, 7 Mar 2019 23:20:36 +
> > > > "Tian, Kevin"  wrote:
> > > >  
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Friday, March 8, 2019 1:44 AM  
> > > > > > > >  
> > > > > > > > > This kind of data needs to be saved / loaded in 
> > > > > > > > > pre-copy and
> > > > > > > > > stop-and-copy phase.
> > > > > > > > > The data of device memory is held in device memory 
> > > > > > > > > region.
> > > > > > > > > Size of devie memory is usually larger than that of 
> > > > > > > > > device
> > > > > > > > > memory region. qemu needs to save/load it in chunks 
> > > > > > > > > of size  
> > of  
> > > > > > > > > device memory region.
> > > > > > > > > Not all device has device memory. Like IGD only uses 
> > > > > > > > > system  
> > > > > > memory.  
> > > > > > > >
> > > > > > > > It seems a little gratuitous to me that this is a separate 
> > > > > > > > region or
> > > > > > > > that this data is handled separately.  All of this data is 
> > > > > > > > opaque to
> > > > > > > > QEMU, so why do we need to separate it?  
> > > > > > > hi Alex,
> > > > > > > as the device state interfaces are provided by kernel, it is 
> > > > > > > expected to
> > > > > > > meet as general needs as possible. So, do you think there are 
> > > > > > > such  
> > use  
> > > > > > > cases from user space that user space knows well of the device, 
> > > > > > > and
> > > > > > > it wants kernel to return desired data back to it.
> > > > > > > E.g. It just wants to get whole device config data including all 
> > > > > > > mmios,
> > > > > > > page tables, pci config data...
> > > > > > > or, It just wants to get current device memory snapshot, not  
> > including any  
> > > > > > > dirty data.
> > > > > > > Or, It just needs the dirty pages in device memory or system 
> > > > > > > memory.
> > > > > > > With all this accurate query, quite a lot of useful features can 
> > > > > > > be
> > > > > > > developped in user space.
> > > > > > >
> > > > > > > If all of this data is opaque to user app, seems the only use 
> > > > > > > case is
> > > > > > > for live migration.  
> > > > > >
> > > > > > I can certainly appreciate a more versatile interface, but I think
> > > > > > we're also trying to create the most simple interface we can, with 
> > > > > > the
> > > > > > primary target being live migration.  As soon as we start defining 
> > > > > > this
> > > > > > type of device memory and that type of device memory, we're going to
> > > > > > have another device come along that needs yet another because they  
> > have  
> > > > > > a slightly different requirement.  Even without that, we're going to
> > > > > > have vendor drivers implement it differently, so what works for one
> > > > > > device for a more targeted approach may not work for all devices.  
> > > > > > Can
> > > > > > you enumerate some specific examples of the use cases you imagine  
> > your  
> > > > > > design to enable?
> > > > > >  
> > > > >
> > > > > Do we want to consider an use case where user space would like to
> > > > > selectively introspect a portion of the device state (including 
> > > > > implicit
> > > > > state which are not available through PCI regions), and may ask for
> > > > > capability of direct mapping of selected portion for scanning (e.g.
> > > > > device memory) instead of always turning on dirty logging on all
> > > > > device state?  
> > > >
> > > > I don't see that a migration interface necessarily lends itself to this
> > > > use case.  A migration data stream has no requirement to be user
> > > > consumable as anything other than opaque data, there's also no
> > > > requirement that it expose state in a form that directly represents the
> > > > internal state of the device.  In fact I'm not sure we want to encourage
> > > > introspection via this data stream.  If a user knows how to interpret
> > > > the data, what prevents them from modifying the data in-flight?  I've
> > > > raised the question previously regarding how the vendor driver can
> > > > validate the integrity of the migration data stream.  Using the
> > > > migration interface to introspect the device certainly suggests an
> > > > interface ripe for exploiting any potential weakness in the vendor
> > > > driver reassembling that migration stream.  If the user has an mmap to
> > > > the actual live working state of the vendor driver, protection in the
> > > > hardware seems like the only way you could protect against a malicious
> > > > user.  Please be defensive in what is directly exposed to the user and
> > > > what safeguards are in place within the vendor driver for 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-10 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Saturday, March 9, 2019 6:03 AM
> 
> On Fri, 8 Mar 2019 16:21:46 +
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Alex Williamson (alex.william...@redhat.com) wrote:
> > > On Thu, 7 Mar 2019 23:20:36 +
> > > "Tian, Kevin"  wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > Sent: Friday, March 8, 2019 1:44 AM
> > > > > > >
> > > > > > > > This kind of data needs to be saved / loaded in 
> > > > > > > > pre-copy and
> > > > > > > > stop-and-copy phase.
> > > > > > > > The data of device memory is held in device memory 
> > > > > > > > region.
> > > > > > > > Size of devie memory is usually larger than that of 
> > > > > > > > device
> > > > > > > > memory region. qemu needs to save/load it in chunks of 
> > > > > > > > size
> of
> > > > > > > > device memory region.
> > > > > > > > Not all device has device memory. Like IGD only uses 
> > > > > > > > system
> > > > > memory.
> > > > > > >
> > > > > > > It seems a little gratuitous to me that this is a separate region 
> > > > > > > or
> > > > > > > that this data is handled separately.  All of this data is opaque 
> > > > > > > to
> > > > > > > QEMU, so why do we need to separate it?
> > > > > > hi Alex,
> > > > > > as the device state interfaces are provided by kernel, it is 
> > > > > > expected to
> > > > > > meet as general needs as possible. So, do you think there are such
> use
> > > > > > cases from user space that user space knows well of the device, and
> > > > > > it wants kernel to return desired data back to it.
> > > > > > E.g. It just wants to get whole device config data including all 
> > > > > > mmios,
> > > > > > page tables, pci config data...
> > > > > > or, It just wants to get current device memory snapshot, not
> including any
> > > > > > dirty data.
> > > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > > With all this accurate query, quite a lot of useful features can be
> > > > > > developped in user space.
> > > > > >
> > > > > > If all of this data is opaque to user app, seems the only use case 
> > > > > > is
> > > > > > for live migration.
> > > > >
> > > > > I can certainly appreciate a more versatile interface, but I think
> > > > > we're also trying to create the most simple interface we can, with the
> > > > > primary target being live migration.  As soon as we start defining 
> > > > > this
> > > > > type of device memory and that type of device memory, we're going to
> > > > > have another device come along that needs yet another because they
> have
> > > > > a slightly different requirement.  Even without that, we're going to
> > > > > have vendor drivers implement it differently, so what works for one
> > > > > device for a more targeted approach may not work for all devices.  Can
> > > > > you enumerate some specific examples of the use cases you imagine
> your
> > > > > design to enable?
> > > > >
> > > >
> > > > Do we want to consider an use case where user space would like to
> > > > selectively introspect a portion of the device state (including implicit
> > > > state which are not available through PCI regions), and may ask for
> > > > capability of direct mapping of selected portion for scanning (e.g.
> > > > device memory) instead of always turning on dirty logging on all
> > > > device state?
> > >
> > > I don't see that a migration interface necessarily lends itself to this
> > > use case.  A migration data stream has no requirement to be user
> > > consumable as anything other than opaque data, there's also no
> > > requirement that it expose state in a form that directly represents the
> > > internal state of the device.  In fact I'm not sure we want to encourage
> > > introspection via this data stream.  If a user knows how to interpret
> > > the data, what prevents them from modifying the data in-flight?  I've
> > > raised the question previously regarding how the vendor driver can
> > > validate the integrity of the migration data stream.  Using the
> > > migration interface to introspect the device certainly suggests an
> > > interface ripe for exploiting any potential weakness in the vendor
> > > driver reassembling that migration stream.  If the user has an mmap to
> > > the actual live working state of the vendor driver, protection in the
> > > hardware seems like the only way you could protect against a malicious
> > > user.  Please be defensive in what is directly exposed to the user and
> > > what safeguards are in place within the vendor driver for validating
> > > incoming data.  Thanks,
> >
> > Hmm; that sounds like a security-by-obscurity answer!
> 
> Yup, that's fair.  I won't deny that in-kernel vendor driver state
> passing through userspace from source to target systems scares me quite
> a bit, but defining device introspection as a use case for the
> migration interface imposes requirements on the vendor drivers that
> 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-08 Thread Alex Williamson
On Fri, 8 Mar 2019 16:21:46 +
"Dr. David Alan Gilbert"  wrote:

> * Alex Williamson (alex.william...@redhat.com) wrote:
> > On Thu, 7 Mar 2019 23:20:36 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Friday, March 8, 2019 1:44 AM
> > > > > >
> > > > > > > This kind of data needs to be saved / loaded in pre-copy 
> > > > > > > and
> > > > > > > stop-and-copy phase.
> > > > > > > The data of device memory is held in device memory region.
> > > > > > > Size of devie memory is usually larger than that of device
> > > > > > > memory region. qemu needs to save/load it in chunks of 
> > > > > > > size of
> > > > > > > device memory region.
> > > > > > > Not all device has device memory. Like IGD only uses 
> > > > > > > system
> > > > memory.
> > > > > >
> > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > QEMU, so why do we need to separate it?
> > > > > hi Alex,
> > > > > as the device state interfaces are provided by kernel, it is expected 
> > > > > to
> > > > > meet as general needs as possible. So, do you think there are such use
> > > > > cases from user space that user space knows well of the device, and
> > > > > it wants kernel to return desired data back to it.
> > > > > E.g. It just wants to get whole device config data including all 
> > > > > mmios,
> > > > > page tables, pci config data...
> > > > > or, It just wants to get current device memory snapshot, not 
> > > > > including any
> > > > > dirty data.
> > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > With all this accurate query, quite a lot of useful features can be
> > > > > developped in user space.
> > > > >
> > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > for live migration.
> > > > 
> > > > I can certainly appreciate a more versatile interface, but I think
> > > > we're also trying to create the most simple interface we can, with the
> > > > primary target being live migration.  As soon as we start defining this
> > > > type of device memory and that type of device memory, we're going to
> > > > have another device come along that needs yet another because they have
> > > > a slightly different requirement.  Even without that, we're going to
> > > > have vendor drivers implement it differently, so what works for one
> > > > device for a more targeted approach may not work for all devices.  Can
> > > > you enumerate some specific examples of the use cases you imagine your
> > > > design to enable?
> > > > 
> > > 
> > > Do we want to consider an use case where user space would like to
> > > selectively introspect a portion of the device state (including implicit 
> > > state which are not available through PCI regions), and may ask for
> > > capability of direct mapping of selected portion for scanning (e.g.
> > > device memory) instead of always turning on dirty logging on all
> > > device state?  
> > 
> > I don't see that a migration interface necessarily lends itself to this
> > use case.  A migration data stream has no requirement to be user
> > consumable as anything other than opaque data, there's also no
> > requirement that it expose state in a form that directly represents the
> > internal state of the device.  In fact I'm not sure we want to encourage
> > introspection via this data stream.  If a user knows how to interpret
> > the data, what prevents them from modifying the data in-flight?  I've
> > raised the question previously regarding how the vendor driver can
> > validate the integrity of the migration data stream.  Using the
> > migration interface to introspect the device certainly suggests an
> > interface ripe for exploiting any potential weakness in the vendor
> > driver reassembling that migration stream.  If the user has an mmap to
> > the actual live working state of the vendor driver, protection in the
> > hardware seems like the only way you could protect against a malicious
> > user.  Please be defensive in what is directly exposed to the user and
> > what safeguards are in place within the vendor driver for validating
> > incoming data.  Thanks,  
> 
> Hmm; that sounds like a security-by-obscurity answer!

Yup, that's fair.  I won't deny that in-kernel vendor driver state
passing through userspace from source to target systems scares me quite
a bit, but defining device introspection as a use case for the
migration interface imposes requirements on the vendor drivers that
don't otherwise exist.  Mdev vendor specific utilities could always be
written to interpret the migration stream to deduce the internal state,
but I think that imposing segregated device memory vs device config
regions with the expectation that internal state can be directly
tracked is 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-08 Thread Dr. David Alan Gilbert
* Alex Williamson (alex.william...@redhat.com) wrote:
> On Thu, 7 Mar 2019 23:20:36 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Friday, March 8, 2019 1:44 AM  
> > > > >  
> > > > > > This kind of data needs to be saved / loaded in pre-copy and
> > > > > > stop-and-copy phase.
> > > > > > The data of device memory is held in device memory region.
> > > > > > Size of devie memory is usually larger than that of device
> > > > > > memory region. qemu needs to save/load it in chunks of size 
> > > > > > of
> > > > > > device memory region.
> > > > > > Not all device has device memory. Like IGD only uses system 
> > > > > >  
> > > memory.  
> > > > >
> > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > that this data is handled separately.  All of this data is opaque to
> > > > > QEMU, so why do we need to separate it?  
> > > > hi Alex,
> > > > as the device state interfaces are provided by kernel, it is expected to
> > > > meet as general needs as possible. So, do you think there are such use
> > > > cases from user space that user space knows well of the device, and
> > > > it wants kernel to return desired data back to it.
> > > > E.g. It just wants to get whole device config data including all mmios,
> > > > page tables, pci config data...
> > > > or, It just wants to get current device memory snapshot, not including 
> > > > any
> > > > dirty data.
> > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > With all this accurate query, quite a lot of useful features can be
> > > > developped in user space.
> > > >
> > > > If all of this data is opaque to user app, seems the only use case is
> > > > for live migration.  
> > > 
> > > I can certainly appreciate a more versatile interface, but I think
> > > we're also trying to create the most simple interface we can, with the
> > > primary target being live migration.  As soon as we start defining this
> > > type of device memory and that type of device memory, we're going to
> > > have another device come along that needs yet another because they have
> > > a slightly different requirement.  Even without that, we're going to
> > > have vendor drivers implement it differently, so what works for one
> > > device for a more targeted approach may not work for all devices.  Can
> > > you enumerate some specific examples of the use cases you imagine your
> > > design to enable?
> > >   
> > 
> > Do we want to consider an use case where user space would like to
> > selectively introspect a portion of the device state (including implicit 
> > state which are not available through PCI regions), and may ask for
> > capability of direct mapping of selected portion for scanning (e.g.
> > device memory) instead of always turning on dirty logging on all
> > device state?
> 
> I don't see that a migration interface necessarily lends itself to this
> use case.  A migration data stream has no requirement to be user
> consumable as anything other than opaque data, there's also no
> requirement that it expose state in a form that directly represents the
> internal state of the device.  In fact I'm not sure we want to encourage
> introspection via this data stream.  If a user knows how to interpret
> the data, what prevents them from modifying the data in-flight?  I've
> raised the question previously regarding how the vendor driver can
> validate the integrity of the migration data stream.  Using the
> migration interface to introspect the device certainly suggests an
> interface ripe for exploiting any potential weakness in the vendor
> driver reassembling that migration stream.  If the user has an mmap to
> the actual live working state of the vendor driver, protection in the
> hardware seems like the only way you could protect against a malicious
> user.  Please be defensive in what is directly exposed to the user and
> what safeguards are in place within the vendor driver for validating
> incoming data.  Thanks,

Hmm; that sounds like a security-by-obscurity answer!

The scripts/analyze-migration.py scripts will actually dump the
migration stream data in an almost readable format.
So if you properly define the VMState definitions it should be almost
readable; it's occasionally been useful.

I agree that you should be very very careful to validate the incoming
migration stream against:
  a) Corruption
  b) Wrong driver versions
  c) Malicious intent
c.1) Especially by the guest
c.2) Or by someone trying to feed you a duff stream
  d) Someone trying load the VFIO stream into completely the wrong
device.

Whether the migration interface is the right thing to use for that
inspection hmm; well it might be - if you're trying to debug
your device and need a dump of it's state, then why not?
(I guess you end up with something not dissimilar to what things
like intek_reg_snapshot in 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-08 Thread Alex Williamson
On Thu, 7 Mar 2019 23:20:36 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Friday, March 8, 2019 1:44 AM  
> > > >  
> > > > > This kind of data needs to be saved / loaded in pre-copy and
> > > > > stop-and-copy phase.
> > > > > The data of device memory is held in device memory region.
> > > > > Size of devie memory is usually larger than that of device
> > > > > memory region. qemu needs to save/load it in chunks of size of
> > > > > device memory region.
> > > > > Not all device has device memory. Like IGD only uses system  
> > memory.  
> > > >
> > > > It seems a little gratuitous to me that this is a separate region or
> > > > that this data is handled separately.  All of this data is opaque to
> > > > QEMU, so why do we need to separate it?  
> > > hi Alex,
> > > as the device state interfaces are provided by kernel, it is expected to
> > > meet as general needs as possible. So, do you think there are such use
> > > cases from user space that user space knows well of the device, and
> > > it wants kernel to return desired data back to it.
> > > E.g. It just wants to get whole device config data including all mmios,
> > > page tables, pci config data...
> > > or, It just wants to get current device memory snapshot, not including any
> > > dirty data.
> > > Or, It just needs the dirty pages in device memory or system memory.
> > > With all this accurate query, quite a lot of useful features can be
> > > developped in user space.
> > >
> > > If all of this data is opaque to user app, seems the only use case is
> > > for live migration.  
> > 
> > I can certainly appreciate a more versatile interface, but I think
> > we're also trying to create the most simple interface we can, with the
> > primary target being live migration.  As soon as we start defining this
> > type of device memory and that type of device memory, we're going to
> > have another device come along that needs yet another because they have
> > a slightly different requirement.  Even without that, we're going to
> > have vendor drivers implement it differently, so what works for one
> > device for a more targeted approach may not work for all devices.  Can
> > you enumerate some specific examples of the use cases you imagine your
> > design to enable?
> >   
> 
> Do we want to consider an use case where user space would like to
> selectively introspect a portion of the device state (including implicit 
> state which are not available through PCI regions), and may ask for
> capability of direct mapping of selected portion for scanning (e.g.
> device memory) instead of always turning on dirty logging on all
> device state?

I don't see that a migration interface necessarily lends itself to this
use case.  A migration data stream has no requirement to be user
consumable as anything other than opaque data, there's also no
requirement that it expose state in a form that directly represents the
internal state of the device.  In fact I'm not sure we want to encourage
introspection via this data stream.  If a user knows how to interpret
the data, what prevents them from modifying the data in-flight?  I've
raised the question previously regarding how the vendor driver can
validate the integrity of the migration data stream.  Using the
migration interface to introspect the device certainly suggests an
interface ripe for exploiting any potential weakness in the vendor
driver reassembling that migration stream.  If the user has an mmap to
the actual live working state of the vendor driver, protection in the
hardware seems like the only way you could protect against a malicious
user.  Please be defensive in what is directly exposed to the user and
what safeguards are in place within the vendor driver for validating
incoming data.  Thanks,

Alex



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-07 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Friday, March 8, 2019 1:44 AM
> > >
> > > > This kind of data needs to be saved / loaded in pre-copy and
> > > > stop-and-copy phase.
> > > > The data of device memory is held in device memory region.
> > > > Size of devie memory is usually larger than that of device
> > > > memory region. qemu needs to save/load it in chunks of size of
> > > > device memory region.
> > > > Not all device has device memory. Like IGD only uses system
> memory.
> > >
> > > It seems a little gratuitous to me that this is a separate region or
> > > that this data is handled separately.  All of this data is opaque to
> > > QEMU, so why do we need to separate it?
> > hi Alex,
> > as the device state interfaces are provided by kernel, it is expected to
> > meet as general needs as possible. So, do you think there are such use
> > cases from user space that user space knows well of the device, and
> > it wants kernel to return desired data back to it.
> > E.g. It just wants to get whole device config data including all mmios,
> > page tables, pci config data...
> > or, It just wants to get current device memory snapshot, not including any
> > dirty data.
> > Or, It just needs the dirty pages in device memory or system memory.
> > With all this accurate query, quite a lot of useful features can be
> > developped in user space.
> >
> > If all of this data is opaque to user app, seems the only use case is
> > for live migration.
> 
> I can certainly appreciate a more versatile interface, but I think
> we're also trying to create the most simple interface we can, with the
> primary target being live migration.  As soon as we start defining this
> type of device memory and that type of device memory, we're going to
> have another device come along that needs yet another because they have
> a slightly different requirement.  Even without that, we're going to
> have vendor drivers implement it differently, so what works for one
> device for a more targeted approach may not work for all devices.  Can
> you enumerate some specific examples of the use cases you imagine your
> design to enable?
> 

Do we want to consider an use case where user space would like to
selectively introspect a portion of the device state (including implicit 
state which are not available through PCI regions), and may ask for
capability of direct mapping of selected portion for scanning (e.g.
device memory) instead of always turning on dirty logging on all
device state?

Thanks
Kevin



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-07 Thread Alex Williamson
Hi Yan,

Sorry for the delay, I've been on PTO...

On Sun, 24 Feb 2019 21:22:56 -0500
Zhao Yan  wrote:

> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao  wrote:
> >   
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > ---
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > > Every device is supposed to possess device config data.
> > >   Usually device config's size is small (no big than 10M), and it 
> > >  
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> >   
> > > needs to be loaded in certain strict order.
> > > Therefore, device config only needs to be saved/loaded in
> > > stop-and-copy phase.
> > > The data of device config is held in device config region.
> > > Size of device config data is smaller than or equal to that of
> > > device config region.  
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> >   
> > > 
> > > Device Memory: device's internal memory, standalone and outside system  
> > 
> > s/system/VM/
> >   
> > > memory. It is usually very big.  
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> >   
> > > This kind of data needs to be saved / loaded in pre-copy and
> > > stop-and-copy phase.
> > > The data of device memory is held in device memory region.
> > > Size of devie memory is usually larger than that of device
> > > memory region. qemu needs to save/load it in chunks of size of
> > > device memory region.
> > > Not all device has device memory. Like IGD only uses system 
> > > memory.  
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?  
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.

I can certainly appreciate a more versatile interface, but I think
we're also trying to create the most simple interface we can, with the
primary target being live migration.  As soon as we start defining this
type of device memory and that type of device memory, we're going to
have another device come along that needs yet another because they have
a slightly different requirement.  Even without that, we're going to
have vendor drivers implement it differently, so what works for one
device for a more targeted approach may not work for all devices.  Can
you enumerate some specific examples of the use cases you imagine your
design to enable?

> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot

When we're not LOGGING, does the vendor driver need to return
anything?  It seems that LOGGING could be considered an enable switch
for the interface.

> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-05 Thread Zhao Yan
hi Alex
we still have some opens as below. could you kindly help review on that? :)

Thanks
Yan

On Mon, Feb 25, 2019 at 10:22:56AM +0800, Zhao Yan wrote:
> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao  wrote:
> > 
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > ---
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > > Every device is supposed to possess device config data.
> > >   Usually device config's size is small (no big than 10M), and it
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> > 
> > > needs to be loaded in certain strict order.
> > > Therefore, device config only needs to be saved/loaded in
> > > stop-and-copy phase.
> > > The data of device config is held in device config region.
> > > Size of device config data is smaller than or equal to that of
> > > device config region.
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> > 
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > 
> > s/system/VM/
> > 
> > > memory. It is usually very big.
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> > 
> > > This kind of data needs to be saved / loaded in pre-copy and
> > > stop-and-copy phase.
> > > The data of device memory is held in device memory region.
> > > Size of devie memory is usually larger than that of device
> > > memory region. qemu needs to save/load it in chunks of size of
> > > device memory region.
> > > Not all device has device memory. Like IGD only uses system 
> > > memory.
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.
> 
> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot
> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending data size?
> 5. user space keeps reading data.
> 6. vendor driver keeps appending new dirty data to the tail of remaining
> unread full data/dirty data and increase the pending data size?
> 
> in stop-and-copy phase
> 1. user space sets device state to exit LOGGING state,
> 2. vendor driver stops data logging. it has to append device config
>data at the tail of remaining dirty data unread by userspace.
> 
> during this flow, when vendor driver should get dirty data? just keeps
> logging and appends to tail? how to ensure dirty data is refresh new before
> LOGGING state exit? how does vendor driver know whether certain dirty data
> is copied or not?
> 
> I've no idea how NVidia handle this problem, and they don't open their
> kernel side code. 
> just feel it's a bit 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-24 Thread Zhao Yan
On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> Hi Yan,
> 
> Thanks for working on this!
> 
> On Tue, 19 Feb 2019 16:50:54 +0800
> Yan Zhao  wrote:
> 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > ---
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> > Every device is supposed to possess device config data.
> > Usually device config's size is small (no big than 10M), and it
> 
> I'm not sure how we can really impose a limit here, it is what it is
> for a device.  A smaller state is obviously desirable to reduce
> downtime, but some devices could have very large states.
> 
> > needs to be loaded in certain strict order.
> > Therefore, device config only needs to be saved/loaded in
> > stop-and-copy phase.
> > The data of device config is held in device config region.
> > Size of device config data is smaller than or equal to that of
> > device config region.
> 
> So the intention here is that this is the last data read from the
> device and it's done in one pass, so the region needs to be large
> enough to expose all config data at once.  On restore it's the last
> data written before switching the device to the run state.
> 
> > 
> > Device Memory: device's internal memory, standalone and outside system
> 
> s/system/VM/
> 
> > memory. It is usually very big.
> 
> Or it doesn't exist.  Not sure we should be setting expectations since
> it will vary per device.
> 
> > This kind of data needs to be saved / loaded in pre-copy and
> > stop-and-copy phase.
> > The data of device memory is held in device memory region.
> > Size of devie memory is usually larger than that of device
> > memory region. qemu needs to save/load it in chunks of size of
> > device memory region.
> > Not all device has device memory. Like IGD only uses system memory.
> 
> It seems a little gratuitous to me that this is a separate region or
> that this data is handled separately.  All of this data is opaque to
> QEMU, so why do we need to separate it?
hi Alex,
as the device state interfaces are provided by kernel, it is expected to
meet as general needs as possible. So, do you think there are such use
cases from user space that user space knows well of the device, and
it wants kernel to return desired data back to it.
E.g. It just wants to get whole device config data including all mmios,
page tables, pci config data...
or, It just wants to get current device memory snapshot, not including any
dirty data.
Or, It just needs the dirty pages in device memory or system memory.
With all this accurate query, quite a lot of useful features can be
developped in user space.

If all of this data is opaque to user app, seems the only use case is
for live migration.

>From another aspect, if the final solution is to let the data opaque to
user space, like what NV did, kernel side's implementation will be more
complicated, and actually a little challenge to vendor driver.
in that case, in pre-copy phase,
1. in not LOGGING state, vendor driver first returns full data including
full device memory snapshot
2. user space reads some data (you can't expect it to finish reading all
data)
3. then userspace set the device state to LOGGING to start dirty data
logging
4. vendor driver starts dirty data logging, and appends the dirty data to
the tail of remaining unread full data and increase the pending data size?
5. user space keeps reading data.
6. vendor driver keeps appending new dirty data to the tail of remaining
unread full data/dirty data and increase the pending data size?

in stop-and-copy phase
1. user space sets device state to exit LOGGING state,
2. vendor driver stops data logging. it has to append device config
   data at the tail of remaining dirty data unread by userspace.

during this flow, when vendor driver should get dirty data? just keeps
logging and appends to tail? how to ensure dirty data is refresh new before
LOGGING state exit? how does vendor driver know whether certain dirty data
is copied or not?

I've no idea how NVidia handle this problem, and they don't open their
kernel side code. 
just feel it's a bit hard for other vendor drivers to follow:)

> > System memory dirty pages: If a device produces dirty pages in system
> > memory, it is able to get dirty bitmap for certain range of system
> > memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-21 Thread Alex Williamson
Hi Yan,

Thanks for working on this!

On Tue, 19 Feb 2019 16:50:54 +0800
Yan Zhao  wrote:

> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> ---
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
> Every device is supposed to possess device config data.
>   Usually device config's size is small (no big than 10M), and it

I'm not sure how we can really impose a limit here, it is what it is
for a device.  A smaller state is obviously desirable to reduce
downtime, but some devices could have very large states.

> needs to be loaded in certain strict order.
> Therefore, device config only needs to be saved/loaded in
> stop-and-copy phase.
> The data of device config is held in device config region.
> Size of device config data is smaller than or equal to that of
> device config region.

So the intention here is that this is the last data read from the
device and it's done in one pass, so the region needs to be large
enough to expose all config data at once.  On restore it's the last
data written before switching the device to the run state.

> 
> Device Memory: device's internal memory, standalone and outside system

s/system/VM/

> memory. It is usually very big.

Or it doesn't exist.  Not sure we should be setting expectations since
it will vary per device.

> This kind of data needs to be saved / loaded in pre-copy and
> stop-and-copy phase.
> The data of device memory is held in device memory region.
> Size of devie memory is usually larger than that of device
> memory region. qemu needs to save/load it in chunks of size of
> device memory region.
> Not all device has device memory. Like IGD only uses system memory.

It seems a little gratuitous to me that this is a separate region or
that this data is handled separately.  All of this data is opaque to
QEMU, so why do we need to separate it?

> System memory dirty pages: If a device produces dirty pages in system
> memory, it is able to get dirty bitmap for certain range of system
> memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> phase in .log_sync callback. By setting dirty bitmap in .log_sync
> callback, dirty pages in system memory will be save/loaded by ram's
> live migration code.
> The dirty bitmap of system memory is held in dirty bitmap region.
> If system memory range is larger than that dirty bitmap region can
> hold, qemu will cut it into several chunks and get dirty bitmap in
> succession.
> 
> 
> Device State Regions
> 
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
> Get access via read/write system call.
> Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.

Is mmap mandatory?  I would think this would be defined by the mdev
device what access they want to support per region.  We don't want to
impose a more complicated interface if the device doesn't require it.

> device config region: mandatory, holding data of device config
> device memory region: optional, holding data of device memory
> dirty bitmap region: optional, holding bitmap of system memory
> dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).

It's not obvious to me how this is better, a big region isn't padded,
there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
gap in a file really of any consequence?  Each region beyond the header
is more than likely larger than PAGE_SIZE, therefore they can be nicely
aligned together.  We still need fields to tell us how much data is
available in each area, so another to tell us the start of each area is
a minor detail.  And I think we still want to allow drivers to specify
which parts of which areas support mmap, so I don't think we're getting
away from sparse mmap support.

> kernel device state interface [1]
> --
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2

If we were to go 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-21 Thread Dr. David Alan Gilbert
* Zhao Yan (yan.y.z...@intel.com) wrote:
> On Wed, Feb 20, 2019 at 11:01:43AM +, Dr. David Alan Gilbert wrote:
> > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > Currently it does not support post-copy phase.
> > > > > 
> > > > > It follows Alex's comments on last version of VFIO live migration 
> > > > > patches,
> > > > > including device states, VFIO device state region layout, dirty 
> > > > > bitmap's
> > > > > query.
> > > > 
> > > > Hi,
> > > >   I've sent minor comments to later patches; but some minor general
> > > > comments:
> > > > 
> > > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > > > so check when you can.
> > > hi Dave
> > > Thanks for this suggestion. I'll add more checks for migration streams.
> > > 
> > > 
> > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > version of device?  Or say to a device with older firmware or perhaps
> > > > a device that has less device memory ?
> > > Actually it's still an open for VFIO migration. Need to think about
> > > whether it's better to check that in libvirt or qemu (like a device magic
> > > along with verion ?).
> > > This patchset is intended to settle down the main device state interfaces
> > > for VFIO migration. So that we can work on that and improve it.
> > > 
> > > 
> > > >   c) Consider using the trace_ mechanism - it's really useful to
> > > > add to loops writing/reading data so that you can see when it fails.
> > > > 
> > > > Dave
> > > >
> > > Got it. many thanks~~
> > > 
> > > 
> > > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > > 'migrtion'
> > > 
> > > sorry :)
> > 
> > No problem.
> > 
> > Given the mails, I'm guessing you've mostly tested this on graphics
> > devices?  Have you also checked with VFIO network cards?
> > 
> yes, I tested it on Intel's graphics devices which do not have device
> memory. so the cap of device-memory is off.
> I believe this patchset can work well on VFIO network cards as well,
> because Gonglei once said their NIC can work well on our previous code
> (i.e. device-memory cap off).

It would be great if you could find some Intel NIC people to help test
it out.

> 
> > Also see the mail I sent in reply to Kirti's series; we need to boil
> > these down to one solution.
> >
> Maybe Kirti can merge their implementaion into the code for device-memory
> cap (like in my patch 5 for device-memory).

It would be great to come up with one patchset between yourself and
Kirti that was tested for Intel and Nvidia GPUs and Intel NICs
(and anyone else who wants to jump on board!).

Dave

> > Dave
> > 
> > > > 
> > > > > Device Data
> > > > > ---
> > > > > Device data is divided into three types: device memory, device config,
> > > > > and system memory dirty pages produced by device.
> > > > > 
> > > > > Device config: data like MMIOs, page tables...
> > > > > Every device is supposed to possess device config data.
> > > > >   Usually device config's size is small (no big than 10M), and it
> > > > > needs to be loaded in certain strict order.
> > > > > Therefore, device config only needs to be saved/loaded in
> > > > > stop-and-copy phase.
> > > > > The data of device config is held in device config region.
> > > > > Size of device config data is smaller than or equal to that of
> > > > > device config region.
> > > > > 
> > > > > Device Memory: device's internal memory, standalone and outside system
> > > > > memory. It is usually very big.
> > > > > This kind of data needs to be saved / loaded in pre-copy and
> > > > > stop-and-copy phase.
> > > > > The data of device memory is held in device memory region.
> > > > > Size of devie memory is usually larger than that of device
> > > > > memory region. qemu needs to save/load it in chunks of size of
> > > > > device memory region.
> > > > > Not all device has device memory. Like IGD only uses system 
> > > > > memory.
> > > > > 
> > > > > System memory dirty pages: If a device produces dirty pages in system
> > > > > memory, it is able to get dirty bitmap for certain range of 
> > > > > system
> > > > > memory. This dirty bitmap is queried in pre-copy and 
> > > > > stop-and-copy
> > > > > phase in .log_sync callback. By setting dirty bitmap in 
> > > > > .log_sync
> > > > > callback, dirty pages in system memory will be save/loaded by 
> > > > > ram's
> > > > > live migration code.
> > > > > The dirty bitmap of system memory is held in dirty bitmap 
> > > > > region.
> > > > > If system memory range is larger than that dirty bitmap 
> > > > > region can
> > > > > hold, 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)
> >
> > > -Original Message-
> > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > Sent: Thursday, February 21, 2019 10:05 AM
> > > To: Gonglei (Arei) 
> > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> dgilb...@redhat.com;
> > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> ken@amd.com;
> > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > k...@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > > >
> > > > > > 5) About log sync, why not register log_global_start/stop in
> > > > > vfio_memory_listener?
> > > > > >
> > > > > >
> > > > > seems log_global_start/stop cannot be iterately called in pre-copy
> phase?
> > > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > > iteratively to reduce down time, right?
> > > > >
> > > >
> > > > We just need invoking only once for start and stop logging. Why we need
> to
> > > call
> > > > them literately? See memory_listener of vhost.
> > > >
> > > the dirty pages in system memory produces by device is incremental.
> > > if it can be got iteratively, the dirty pages in stop-and-copy phase can 
> > > be
> > > minimal.
> > > :)
> > >
> > I mean starting or stopping the capability of logging, not log sync.
> >
> > We register the below callbacks:
> >
> > .log_sync = vfio_log_sync,
> > .log_global_start = vfio_log_global_start,
> > .log_global_stop = vfio_log_global_stop,
> >
> .log_global_start is also a good point to notify logging state.
> But if notifying in .save_setup handler, we can do fine-grained
> control of when to notify of logging starting together with get_buffer
> operation.
> Is there any special benifit by registering to .log_global_start/stop?
> 

Performance benefit when one VM has multiple same vfio devices.


Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)







> -Original Message-
> From: Zhao Yan [mailto:yan.y.z...@intel.com]
> Sent: Thursday, February 21, 2019 12:08 PM
> To: Gonglei (Arei) 
> Cc: c...@nvidia.com; k...@vger.kernel.org; a...@ozlabs.ru;
> zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> mlevi...@redhat.com; pa...@linux.ibm.com; fel...@nutanix.com;
> ken@amd.com; kevin.t...@intel.com; dgilb...@redhat.com;
> alex.william...@redhat.com; intel-gvt-...@lists.freedesktop.org;
> changpeng@intel.com; coh...@redhat.com; zhi.a.w...@intel.com;
> jonathan.dav...@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 03:33:24AM +, Gonglei (Arei) wrote:
> >
> > > -Original Message-
> > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > Sent: Thursday, February 21, 2019 9:59 AM
> > > To: Gonglei (Arei) 
> > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> dgilb...@redhat.com;
> > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> ken@amd.com;
> > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > k...@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Thu, Feb 21, 2019 at 01:35:43AM +, Gonglei (Arei) wrote:
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > > To: Gonglei (Arei) 
> > > > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> > > dgilb...@redhat.com;
> > > > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> > > ken@amd.com;
> > > > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > > > k...@vger.kernel.org
> > > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > > >
> > > > > On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > > > > > Hi yan,
> > > > > >
> > > > > > Thanks for your work.
> > > > > >
> > > > > > I have some suggestions or questions:
> > > > > >
> > > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > > ok.
> > > > >
> > > > > > 2) We should start vfio devices before vcpu resumes, so we can't 
> > > > > > rely
> on
> > > vm
> > > > > start change handler completely.
> > > > > vfio devices is by default set to running state.
> > > > > In the target machine, its state transition flow is
> running->stop->running.
> > > >
> > > > That's confusing. We should start vfio devices after vfio_load_state,
> > > otherwise
> > > > how can you keep the devices' information are the same between source
> side
> > > > and destination side?
> > > >
> > > so, your meaning is to set device state to running in the first call to
> > > vfio_load_state?
> > >
> > No, it should start devices after vfio_load_state and before vcpu resuming.
> >
> 
> What about set device state to running in load_cleanup handler ?
> 

The timing is fine, but you should also think about if should set device state 
to running in failure branches when calling load_cleanup handler.

Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Thu, Feb 21, 2019 at 03:16:45AM +, Gonglei (Arei) wrote:
> 
> 
> 
> > -Original Message-
> > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > Sent: Thursday, February 21, 2019 10:05 AM
> > To: Gonglei (Arei) 
> > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > k...@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems log_global_start/stop cannot be iterately called in pre-copy 
> > > > phase?
> > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > iteratively to reduce down time, right?
> > > >
> > >
> > > We just need invoking only once for start and stop logging. Why we need to
> > call
> > > them literately? See memory_listener of vhost.
> > >
> > the dirty pages in system memory produces by device is incremental.
> > if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> > minimal.
> > :)
> > 
> I mean starting or stopping the capability of logging, not log sync. 
> 
> We register the below callbacks:
> 
> .log_sync = vfio_log_sync,
> .log_global_start = vfio_log_global_start,
> .log_global_stop = vfio_log_global_stop,
>
.log_global_start is also a good point to notify logging state.
But if notifying in .save_setup handler, we can do fine-grained
control of when to notify of logging starting together with get_buffer
operation.
Is there any special benifit by registering to .log_global_start/stop?


> Regards,
> -Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Thu, Feb 21, 2019 at 03:33:24AM +, Gonglei (Arei) wrote:
> 
> > -Original Message-
> > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > Sent: Thursday, February 21, 2019 9:59 AM
> > To: Gonglei (Arei) 
> > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > k...@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Thu, Feb 21, 2019 at 01:35:43AM +, Gonglei (Arei) wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > To: Gonglei (Arei) 
> > > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> > dgilb...@redhat.com;
> > > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> > ken@amd.com;
> > > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > > k...@vger.kernel.org
> > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > >
> > > > On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > > > > Hi yan,
> > > > >
> > > > > Thanks for your work.
> > > > >
> > > > > I have some suggestions or questions:
> > > > >
> > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > ok.
> > > >
> > > > > 2) We should start vfio devices before vcpu resumes, so we can't rely 
> > > > > on
> > vm
> > > > start change handler completely.
> > > > vfio devices is by default set to running state.
> > > > In the target machine, its state transition flow is 
> > > > running->stop->running.
> > >
> > > That's confusing. We should start vfio devices after vfio_load_state,
> > otherwise
> > > how can you keep the devices' information are the same between source side
> > > and destination side?
> > >
> > so, your meaning is to set device state to running in the first call to
> > vfio_load_state?
> > 
> No, it should start devices after vfio_load_state and before vcpu resuming.
>

What about set device state to running in load_cleanup handler ?

> > > > so, maybe you can ignore the stop notification in kernel?
> > > > > 3) We'd better support live migration rollback since have many failure
> > > > scenarios,
> > > > >  register a migration notifier is a good choice.
> > > > I think this patchset can also handle the failure case well.
> > > > if migration failure or cancelling happens,
> > > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > > stopped) keeps as it is).
> > >
> > > IIRC there're many failure paths don't calling cleanup handler.
> > >
> > could you take an example?
> 
> Never mind, that's another bug I think. 
> 
> > > > then,
> > > > if vm switches back to running, device state will be set to running;
> > > > if vm stayes at stopped state, device state is also stopped (it has no
> > > > meaning to let it in running state).
> > > > Do you think so ?
> > > >
> > > IF the underlying state machine is complicated,
> > > We should tell the canceling state to vendor driver proactively.
> > >
> > That makes sense.
> > 
> > > > > 4) Four memory region for live migration is too complicated IMHO.
> > > > one big region requires the sub-regions well padded.
> > > > like for the first control fields, they have to be padded to 4K.
> > > > the same for other data fields.
> > > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > > both need to be PAGE aligned.
> > > >
> > > But if we don't need use mmap for control filed and device state, they are
> > small basically.
> > > The performance is enough using pread/pwrite.
> > >
> > we don't mmap control fields. but if data fields going immedately after
> > control fields (e.g. just 64 bytes), we can't mmap data fields
> > successfully because its start offset is 64. Therefore control fields have
> > to be padded to 4k to let data fields start from 4k.
> > That's the drawback of one big region holding both control and data fields.
> > 
> > > > Also, 4 regions is clearer in my view :)
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)


> -Original Message-
> From: Zhao Yan [mailto:yan.y.z...@intel.com]
> Sent: Thursday, February 21, 2019 9:59 AM
> To: Gonglei (Arei) 
> Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> k...@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 01:35:43AM +, Gonglei (Arei) wrote:
> >
> >
> > > -Original Message-
> > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > Sent: Thursday, February 21, 2019 8:25 AM
> > > To: Gonglei (Arei) 
> > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> dgilb...@redhat.com;
> > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> ken@amd.com;
> > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > k...@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > > > Hi yan,
> > > >
> > > > Thanks for your work.
> > > >
> > > > I have some suggestions or questions:
> > > >
> > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > vfio_pci_save_config(), likes Nvidia's solution.
> > > ok.
> > >
> > > > 2) We should start vfio devices before vcpu resumes, so we can't rely on
> vm
> > > start change handler completely.
> > > vfio devices is by default set to running state.
> > > In the target machine, its state transition flow is 
> > > running->stop->running.
> >
> > That's confusing. We should start vfio devices after vfio_load_state,
> otherwise
> > how can you keep the devices' information are the same between source side
> > and destination side?
> >
> so, your meaning is to set device state to running in the first call to
> vfio_load_state?
> 
No, it should start devices after vfio_load_state and before vcpu resuming.

> > > so, maybe you can ignore the stop notification in kernel?
> > > > 3) We'd better support live migration rollback since have many failure
> > > scenarios,
> > > >  register a migration notifier is a good choice.
> > > I think this patchset can also handle the failure case well.
> > > if migration failure or cancelling happens,
> > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > stopped) keeps as it is).
> >
> > IIRC there're many failure paths don't calling cleanup handler.
> >
> could you take an example?

Never mind, that's another bug I think. 

> > > then,
> > > if vm switches back to running, device state will be set to running;
> > > if vm stayes at stopped state, device state is also stopped (it has no
> > > meaning to let it in running state).
> > > Do you think so ?
> > >
> > IF the underlying state machine is complicated,
> > We should tell the canceling state to vendor driver proactively.
> >
> That makes sense.
> 
> > > > 4) Four memory region for live migration is too complicated IMHO.
> > > one big region requires the sub-regions well padded.
> > > like for the first control fields, they have to be padded to 4K.
> > > the same for other data fields.
> > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > both need to be PAGE aligned.
> > >
> > But if we don't need use mmap for control filed and device state, they are
> small basically.
> > The performance is enough using pread/pwrite.
> >
> we don't mmap control fields. but if data fields going immedately after
> control fields (e.g. just 64 bytes), we can't mmap data fields
> successfully because its start offset is 64. Therefore control fields have
> to be padded to 4k to let data fields start from 4k.
> That's the drawback of one big region holding both control and data fields.
> 
> > > Also, 4 regions is clearer in my view :)
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> 
> 
> 
> > Regards,
> > -Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)




> -Original Message-
> From: Zhao Yan [mailto:yan.y.z...@intel.com]
> Sent: Thursday, February 21, 2019 10:05 AM
> To: Gonglei (Arei) 
> Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> k...@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> the dirty pages in system memory produces by device is incremental.
> if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> minimal.
> :)
> 
I mean starting or stopping the capability of logging, not log sync. 

We register the below callbacks:

.log_sync = vfio_log_sync,
.log_global_start = vfio_log_global_start,
.log_global_stop = vfio_log_global_stop,

Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to 
> call
> them literately? See memory_listener of vhost.
>
the dirty pages in system memory produces by device is incremental.
if it can be got iteratively, the dirty pages in stop-and-copy phase can be
minimal. 
:)

> Regards,
> -Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Thu, Feb 21, 2019 at 01:35:43AM +, Gonglei (Arei) wrote:
> 
> 
> > -Original Message-
> > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > Sent: Thursday, February 21, 2019 8:25 AM
> > To: Gonglei (Arei) 
> > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > k...@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > > Hi yan,
> > >
> > > Thanks for your work.
> > >
> > > I have some suggestions or questions:
> > >
> > > 1) Would you add msix mode support,? if not, pls add a check in
> > vfio_pci_save_config(), likes Nvidia's solution.
> > ok.
> > 
> > > 2) We should start vfio devices before vcpu resumes, so we can't rely on 
> > > vm
> > start change handler completely.
> > vfio devices is by default set to running state.
> > In the target machine, its state transition flow is running->stop->running.
> 
> That's confusing. We should start vfio devices after vfio_load_state, 
> otherwise
> how can you keep the devices' information are the same between source side
> and destination side?
>
so, your meaning is to set device state to running in the first call to
vfio_load_state?

> > so, maybe you can ignore the stop notification in kernel?
> > > 3) We'd better support live migration rollback since have many failure
> > scenarios,
> > >  register a migration notifier is a good choice.
> > I think this patchset can also handle the failure case well.
> > if migration failure or cancelling happens,
> > in cleanup handler, LOGGING state is cleared. device state(running or
> > stopped) keeps as it is).
> 
> IIRC there're many failure paths don't calling cleanup handler.
>
could you take an example?
> > then,
> > if vm switches back to running, device state will be set to running;
> > if vm stayes at stopped state, device state is also stopped (it has no
> > meaning to let it in running state).
> > Do you think so ?
> > 
> IF the underlying state machine is complicated,
> We should tell the canceling state to vendor driver proactively.
> 
That makes sense.

> > > 4) Four memory region for live migration is too complicated IMHO.
> > one big region requires the sub-regions well padded.
> > like for the first control fields, they have to be padded to 4K.
> > the same for other data fields.
> > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > both need to be PAGE aligned.
> > 
> But if we don't need use mmap for control filed and device state, they are 
> small basically.
> The performance is enough using pread/pwrite. 
> 
we don't mmap control fields. but if data fields going immedately after
control fields (e.g. just 64 bytes), we can't mmap data fields
successfully because its start offset is 64. Therefore control fields have
to be padded to 4k to let data fields start from 4k.
That's the drawback of one big region holding both control and data fields.

> > Also, 4 regions is clearer in my view :)
> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to 
> call
> them literately? See memory_listener of vhost.
> 



> Regards,
> -Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)



> -Original Message-
> From: Zhao Yan [mailto:yan.y.z...@intel.com]
> Sent: Thursday, February 21, 2019 8:25 AM
> To: Gonglei (Arei) 
> Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> k...@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > Hi yan,
> >
> > Thanks for your work.
> >
> > I have some suggestions or questions:
> >
> > 1) Would you add msix mode support,? if not, pls add a check in
> vfio_pci_save_config(), likes Nvidia's solution.
> ok.
> 
> > 2) We should start vfio devices before vcpu resumes, so we can't rely on vm
> start change handler completely.
> vfio devices is by default set to running state.
> In the target machine, its state transition flow is running->stop->running.

That's confusing. We should start vfio devices after vfio_load_state, otherwise
how can you keep the devices' information are the same between source side
and destination side?

> so, maybe you can ignore the stop notification in kernel?
> > 3) We'd better support live migration rollback since have many failure
> scenarios,
> >  register a migration notifier is a good choice.
> I think this patchset can also handle the failure case well.
> if migration failure or cancelling happens,
> in cleanup handler, LOGGING state is cleared. device state(running or
> stopped) keeps as it is).

IIRC there're many failure paths don't calling cleanup handler.

> then,
> if vm switches back to running, device state will be set to running;
> if vm stayes at stopped state, device state is also stopped (it has no
> meaning to let it in running state).
> Do you think so ?
> 
IF the underlying state machine is complicated,
We should tell the canceling state to vendor driver proactively.

> > 4) Four memory region for live migration is too complicated IMHO.
> one big region requires the sub-regions well padded.
> like for the first control fields, they have to be padded to 4K.
> the same for other data fields.
> Otherwise, mmap simply fails, because the start-offset and size for mmap
> both need to be PAGE aligned.
> 
But if we don't need use mmap for control filed and device state, they are 
small basically.
The performance is enough using pread/pwrite. 

> Also, 4 regions is clearer in my view :)
> 
> > 5) About log sync, why not register log_global_start/stop in
> vfio_memory_listener?
> >
> >
> seems log_global_start/stop cannot be iterately called in pre-copy phase?
> for dirty pages in system memory, it's better to transfer dirty data
> iteratively to reduce down time, right?
> 

We just need invoking only once for start and stop logging. Why we need to call
them literately? See memory_listener of vhost.

Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Wed, Feb 20, 2019 at 11:01:43AM +, Dr. David Alan Gilbert wrote:
> * Zhao Yan (yan.y.z...@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > > 
> > > > It follows Alex's comments on last version of VFIO live migration 
> > > > patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > > 
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > > 
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > > so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> > 
> > 
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).
> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> > 
> > 
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > > 
> > > Dave
> > >
> > Got it. many thanks~~
> > 
> > 
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> > 
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
yes, I tested it on Intel's graphics devices which do not have device
memory. so the cap of device-memory is off.
I believe this patchset can work well on VFIO network cards as well,
because Gonglei once said their NIC can work well on our previous code
(i.e. device-memory cap off).


> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
>
Maybe Kirti can merge their implementaion into the code for device-memory
cap (like in my patch 5 for device-memory).

> Dave
> 
> > > 
> > > > Device Data
> > > > ---
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > > 
> > > > Device config: data like MMIOs, page tables...
> > > > Every device is supposed to possess device config data.
> > > > Usually device config's size is small (no big than 10M), and it
> > > > needs to be loaded in certain strict order.
> > > > Therefore, device config only needs to be saved/loaded in
> > > > stop-and-copy phase.
> > > > The data of device config is held in device config region.
> > > > Size of device config data is smaller than or equal to that of
> > > > device config region.
> > > > 
> > > > Device Memory: device's internal memory, standalone and outside system
> > > > memory. It is usually very big.
> > > > This kind of data needs to be saved / loaded in pre-copy and
> > > > stop-and-copy phase.
> > > > The data of device memory is held in device memory region.
> > > > Size of devie memory is usually larger than that of device
> > > > memory region. qemu needs to save/load it in chunks of size of
> > > > device memory region.
> > > > Not all device has device memory. Like IGD only uses system 
> > > > memory.
> > > > 
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > > memory, it is able to get dirty bitmap for certain range of 
> > > > system
> > > > memory. This dirty bitmap is queried in pre-copy and 
> > > > stop-and-copy
> > > > phase in .log_sync callback. By setting dirty bitmap in 
> > > > .log_sync
> > > > callback, dirty pages in system memory will be save/loaded by 
> > > > ram's
> > > > live migration code.
> > > > The dirty bitmap of system memory is held in dirty bitmap 
> > > > region.
> > > > If system memory range is larger than that dirty bitmap region 
> > > > can
> > > > hold, qemu will cut it into several chunks and get dirty bitmap 
> > > > in
> > > > succession.
> > > > 
> > > > 
> > > > Device State Regions
> > > > 
> > > > Vendor driver is required to expose two mandatory regions and another 
> > > > two
> > > > optional regions if it plans to support device state management.
> > > > 
> > > > So, there are up to four regions in total.
> > > > One control region: mandatory.
> > > > Get access via read/write system call.
> > > > Its layout is 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> Hi yan,
> 
> Thanks for your work.
> 
> I have some suggestions or questions:
> 
> 1) Would you add msix mode support,? if not, pls add a check in 
> vfio_pci_save_config(), likes Nvidia's solution.
ok.

> 2) We should start vfio devices before vcpu resumes, so we can't rely on vm 
> start change handler completely.
vfio devices is by default set to running state.
In the target machine, its state transition flow is running->stop->running.
so, maybe you can ignore the stop notification in kernel?
> 3) We'd better support live migration rollback since have many failure 
> scenarios,
>  register a migration notifier is a good choice.
I think this patchset can also handle the failure case well.
if migration failure or cancelling happens, 
in cleanup handler, LOGGING state is cleared. device state(running or
stopped) keeps as it is).
then,
if vm switches back to running, device state will be set to running;
if vm stayes at stopped state, device state is also stopped (it has no
meaning to let it in running state).
Do you think so ?

> 4) Four memory region for live migration is too complicated IMHO. 
one big region requires the sub-regions well padded.
like for the first control fields, they have to be padded to 4K.
the same for other data fields.
Otherwise, mmap simply fails, because the start-offset and size for mmap
both need to be PAGE aligned.

Also, 4 regions is clearer in my view :)

> 5) About log sync, why not register log_global_start/stop in 
> vfio_memory_listener?
> 
> 
seems log_global_start/stop cannot be iterately called in pre-copy phase?
for dirty pages in system memory, it's better to transfer dirty data
iteratively to reduce down time, right?


> Regards,
> -Gonglei
> 
> 
> > -Original Message-
> > From: Yan Zhao [mailto:yan.y.z...@intel.com]
> > Sent: Tuesday, February 19, 2019 4:51 PM
> > To: alex.william...@redhat.com; qemu-devel@nongnu.org
> > Cc: intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com; Gonglei (Arei)
> > ; k...@vger.kernel.org; Yan Zhao
> > 
> > Subject: [PATCH 0/5] QEMU VFIO live migration
> > 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > ---
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> > Every device is supposed to possess device config data.
> > Usually device config's size is small (no big than 10M), and it
> > needs to be loaded in certain strict order.
> > Therefore, device config only needs to be saved/loaded in
> > stop-and-copy phase.
> > The data of device config is held in device config region.
> > Size of device config data is smaller than or equal to that of
> > device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> > memory. It is usually very big.
> > This kind of data needs to be saved / loaded in pre-copy and
> > stop-and-copy phase.
> > The data of device memory is held in device memory region.
> > Size of devie memory is usually larger than that of device
> > memory region. qemu needs to save/load it in chunks of size of
> > device memory region.
> > Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> > memory, it is able to get dirty bitmap for certain range of system
> > memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > callback, dirty pages in system memory will be save/loaded by ram's
> > live migration code.
> > The dirty bitmap of system memory is held in dirty bitmap region.
> > If system memory range is larger than that dirty bitmap region can
> > hold, qemu will cut it into several chunks and get dirty bitmap in
> > succession.
> > 
> > 
> > Device State Regions
> > 
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)



> -Original Message-
> From: Cornelia Huck [mailto:coh...@redhat.com]
> Sent: Wednesday, February 20, 2019 7:43 PM
> To: Gonglei (Arei) 
> Cc: Dr. David Alan Gilbert ; Zhao Yan
> ; c...@nvidia.com; k...@vger.kernel.org;
> a...@ozlabs.ru; zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> mlevi...@redhat.com; pa...@linux.ibm.com; fel...@nutanix.com;
> ken@amd.com; kevin.t...@intel.com; alex.william...@redhat.com;
> intel-gvt-...@lists.freedesktop.org; changpeng@intel.com;
> zhi.a.w...@intel.com; jonathan.dav...@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, 20 Feb 2019 11:28:46 +
> "Gonglei (Arei)"  wrote:
> 
> > > -Original Message-
> > > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
> > > Sent: Wednesday, February 20, 2019 7:02 PM
> > > To: Zhao Yan 
> > > Cc: c...@nvidia.com; k...@vger.kernel.org; a...@ozlabs.ru;
> > > zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> > > qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > mlevi...@redhat.com; pa...@linux.ibm.com; Gonglei (Arei)
> > > ; fel...@nutanix.com; ken@amd.com;
> > > kevin.t...@intel.com; alex.william...@redhat.com;
> > > intel-gvt-...@lists.freedesktop.org; changpeng@intel.com;
> > > coh...@redhat.com; zhi.a.w...@intel.com;
> jonathan.dav...@nutanix.com
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > > On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > This patchset enables VFIO devices to have live migration 
> > > > > > capability.
> > > > > > Currently it does not support post-copy phase.
> > > > > >
> > > > > > It follows Alex's comments on last version of VFIO live migration
> patches,
> > > > > > including device states, VFIO device state region layout, dirty 
> > > > > > bitmap's
> > > > > > query.
> 
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device 
> > > > magic
> > > > along with verion ?).
> >
> > We must keep the hardware generation is the same with one POD of public
> cloud
> > providers. But we still think about the live migration between from the the
> lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?

We can provide an interface to query if the VM support live migration or not
in prepare phase of Libvirt.

Can we get the revision_id from the vendor driver ? before invoking

register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
revision_id,
_vfio_handlers,
vdev);

then limit the live migration form higher gens to lower gens?

Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)
Hi yan,

Thanks for your work.

I have some suggestions or questions:

1) Would you add msix mode support,? if not, pls add a check in 
vfio_pci_save_config(), likes Nvidia's solution.
2) We should start vfio devices before vcpu resumes, so we can't rely on vm 
start change handler completely.
3) We'd better support live migration rollback since have many failure 
scenarios,
 register a migration notifier is a good choice.
4) Four memory region for live migration is too complicated IMHO. 
5) About log sync, why not register log_global_start/stop in 
vfio_memory_listener?


Regards,
-Gonglei


> -Original Message-
> From: Yan Zhao [mailto:yan.y.z...@intel.com]
> Sent: Tuesday, February 19, 2019 4:51 PM
> To: alex.william...@redhat.com; qemu-devel@nongnu.org
> Cc: intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com; Gonglei (Arei)
> ; k...@vger.kernel.org; Yan Zhao
> 
> Subject: [PATCH 0/5] QEMU VFIO live migration
> 
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> ---
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
> Every device is supposed to possess device config data.
>   Usually device config's size is small (no big than 10M), and it
> needs to be loaded in certain strict order.
> Therefore, device config only needs to be saved/loaded in
> stop-and-copy phase.
> The data of device config is held in device config region.
> Size of device config data is smaller than or equal to that of
> device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
> memory. It is usually very big.
> This kind of data needs to be saved / loaded in pre-copy and
> stop-and-copy phase.
> The data of device memory is held in device memory region.
> Size of devie memory is usually larger than that of device
> memory region. qemu needs to save/load it in chunks of size of
> device memory region.
> Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
> memory, it is able to get dirty bitmap for certain range of system
> memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> phase in .log_sync callback. By setting dirty bitmap in .log_sync
> callback, dirty pages in system memory will be save/loaded by ram's
> live migration code.
> The dirty bitmap of system memory is held in dirty bitmap region.
> If system memory range is larger than that dirty bitmap region can
> hold, qemu will cut it into several chunks and get dirty bitmap in
> succession.
> 
> 
> Device State Regions
> 
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
> Get access via read/write system call.
> Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
> device config region: mandatory, holding data of device config
> device memory region: optional, holding data of device memory
> dirty bitmap region: optional, holding bitmap of system memory
> dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2
> 
> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
>   __u32 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Cornelia Huck
On Wed, 20 Feb 2019 11:28:46 +
"Gonglei (Arei)"  wrote:

> > -Original Message-
> > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
> > Sent: Wednesday, February 20, 2019 7:02 PM
> > To: Zhao Yan 
> > Cc: c...@nvidia.com; k...@vger.kernel.org; a...@ozlabs.ru;
> > zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> > qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > mlevi...@redhat.com; pa...@linux.ibm.com; Gonglei (Arei)
> > ; fel...@nutanix.com; ken@amd.com;
> > kevin.t...@intel.com; alex.william...@redhat.com;
> > intel-gvt-...@lists.freedesktop.org; changpeng@intel.com;
> > coh...@redhat.com; zhi.a.w...@intel.com; jonathan.dav...@nutanix.com
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:  
> > > > * Yan Zhao (yan.y.z...@intel.com) wrote:  
> > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > Currently it does not support post-copy phase.
> > > > >
> > > > > It follows Alex's comments on last version of VFIO live migration 
> > > > > patches,
> > > > > including device states, VFIO device state region layout, dirty 
> > > > > bitmap's
> > > > > query.  

> > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > version of device?  Or say to a device with older firmware or perhaps
> > > > a device that has less device memory ?  
> > > Actually it's still an open for VFIO migration. Need to think about
> > > whether it's better to check that in libvirt or qemu (like a device magic
> > > along with verion ?).  
> 
> We must keep the hardware generation is the same with one POD of public cloud
> providers. But we still think about the live migration between from the the 
> lower
> generation of hardware migrated to the higher generation.

Agreed, lower->higher is the one direction that might make sense to
support.

But regardless of that, I think we need to make sure that incompatible
devices/versions fail directly instead of failing in a subtle, hard to
debug way. Might be useful to do some initial sanity checks in libvirt
as well.

How easy is it to obtain that information in a form that can be
consumed by higher layers? Can we find out the device type at least?
What about some kind of revision?



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)


> -Original Message-
> From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
> Sent: Wednesday, February 20, 2019 7:02 PM
> To: Zhao Yan 
> Cc: c...@nvidia.com; k...@vger.kernel.org; a...@ozlabs.ru;
> zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> mlevi...@redhat.com; pa...@linux.ibm.com; Gonglei (Arei)
> ; fel...@nutanix.com; ken@amd.com;
> kevin.t...@intel.com; alex.william...@redhat.com;
> intel-gvt-...@lists.freedesktop.org; changpeng@intel.com;
> coh...@redhat.com; zhi.a.w...@intel.com; jonathan.dav...@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> * Zhao Yan (yan.y.z...@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > >
> > > > It follows Alex's comments on last version of VFIO live migration 
> > > > patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > >
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > >
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > > so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> >
> >
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).

We must keep the hardware generation is the same with one POD of public cloud
providers. But we still think about the live migration between from the the 
lower
generation of hardware migrated to the higher generation.

> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> >
> >
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > >
> > > Dave
> > >
> > Got it. many thanks~~
> >
> >
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> >
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
> 
> Dave
> 
> > >
> > > > Device Data
> > > > ---
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > >
> > > > Device config: data like MMIOs, page tables...
> > > > Every device is supposed to possess device config data.
> > > > Usually device config's size is small (no big than 10M), and it
> > > > needs to be loaded in certain strict order.
> > > > Therefore, device config only needs to be saved/loaded in
> > > > stop-and-copy phase.
> > > > The data of device config is held in device config region.
> > > > Size of device config data is smaller than or equal to that of
> > > > device config region.
> > > >
> > > > Device Memory: device's internal memory, standalone and outside
> system
> > > > memory. It is usually very big.
> > > > This kind of data needs to be saved / loaded in pre-copy and
> > > > stop-and-copy phase.
> > > > The data of device memory is held in device memory region.
> > > > Size of devie memory is usually larger than that of device
> > > > memory region. qemu needs to save/load it in chunks of size of
> > > > device memory region.
> > > > Not all device has device memory. Like IGD only uses system
> memory.
> > > >
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > > memory, it is able to get dirty bitmap for certain range of
> system
> > > > memory. This dirty bitmap is queried in pre-copy and
> stop-and-copy
> > > > phase in .log_sync callback. By setting dirty bitmap in 
> > > > .log_sync
> > > > callback, dirty pages in system memory will be save/loaded by
> ram's
> > > > live migration code.
> > > > The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > > If system memory range is larger than that dirty bitmap region
> can
> > > > hold, qemu will cut it into several chunks and get dirty bitmap 
> > > > in
> > > > 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Dr. David Alan Gilbert
* Zhao Yan (yan.y.z...@intel.com) wrote:
> On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > 
> > Hi,
> >   I've sent minor comments to later patches; but some minor general
> > comments:
> > 
> >   a) Never trust the incoming migrations stream - it might be corrupt,
> > so check when you can.
> hi Dave
> Thanks for this suggestion. I'll add more checks for migration streams.
> 
> 
> >   b) How do we detect if we're migrating from/to the wrong device or
> > version of device?  Or say to a device with older firmware or perhaps
> > a device that has less device memory ?
> Actually it's still an open for VFIO migration. Need to think about
> whether it's better to check that in libvirt or qemu (like a device magic
> along with verion ?).
> This patchset is intended to settle down the main device state interfaces
> for VFIO migration. So that we can work on that and improve it.
> 
> 
> >   c) Consider using the trace_ mechanism - it's really useful to
> > add to loops writing/reading data so that you can see when it fails.
> > 
> > Dave
> >
> Got it. many thanks~~
> 
> 
> > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > 'migrtion'
> 
> sorry :)

No problem.

Given the mails, I'm guessing you've mostly tested this on graphics
devices?  Have you also checked with VFIO network cards?

Also see the mail I sent in reply to Kirti's series; we need to boil
these down to one solution.

Dave

> > 
> > > Device Data
> > > ---
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > > Every device is supposed to possess device config data.
> > >   Usually device config's size is small (no big than 10M), and it
> > > needs to be loaded in certain strict order.
> > > Therefore, device config only needs to be saved/loaded in
> > > stop-and-copy phase.
> > > The data of device config is held in device config region.
> > > Size of device config data is smaller than or equal to that of
> > > device config region.
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > > memory. It is usually very big.
> > > This kind of data needs to be saved / loaded in pre-copy and
> > > stop-and-copy phase.
> > > The data of device memory is held in device memory region.
> > > Size of devie memory is usually larger than that of device
> > > memory region. qemu needs to save/load it in chunks of size of
> > > device memory region.
> > > Not all device has device memory. Like IGD only uses system 
> > > memory.
> > > 
> > > System memory dirty pages: If a device produces dirty pages in system
> > > memory, it is able to get dirty bitmap for certain range of system
> > > memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > > phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > callback, dirty pages in system memory will be save/loaded by 
> > > ram's
> > > live migration code.
> > > The dirty bitmap of system memory is held in dirty bitmap region.
> > > If system memory range is larger than that dirty bitmap region can
> > > hold, qemu will cut it into several chunks and get dirty bitmap in
> > > succession.
> > > 
> > > 
> > > Device State Regions
> > > 
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > > Get access via read/write system call.
> > > Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.
> > > device config region: mandatory, holding data of device config
> > > device memory region: optional, holding data of device memory
> > > dirty bitmap region: optional, holding bitmap of system memory
> > > dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).
> > > 
> > > 
> > > kernel device state interface [1]
> > > --
> > > #define 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-19 Thread Zhao Yan
On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> 
> Hi,
>   I've sent minor comments to later patches; but some minor general
> comments:
> 
>   a) Never trust the incoming migrations stream - it might be corrupt,
> so check when you can.
hi Dave
Thanks for this suggestion. I'll add more checks for migration streams.


>   b) How do we detect if we're migrating from/to the wrong device or
> version of device?  Or say to a device with older firmware or perhaps
> a device that has less device memory ?
Actually it's still an open for VFIO migration. Need to think about
whether it's better to check that in libvirt or qemu (like a device magic
along with verion ?).
This patchset is intended to settle down the main device state interfaces
for VFIO migration. So that we can work on that and improve it.


>   c) Consider using the trace_ mechanism - it's really useful to
> add to loops writing/reading data so that you can see when it fails.
> 
> Dave
>
Got it. many thanks~~


> (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> 'migrtion'

sorry :)
> 
> > Device Data
> > ---
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> > Every device is supposed to possess device config data.
> > Usually device config's size is small (no big than 10M), and it
> > needs to be loaded in certain strict order.
> > Therefore, device config only needs to be saved/loaded in
> > stop-and-copy phase.
> > The data of device config is held in device config region.
> > Size of device config data is smaller than or equal to that of
> > device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> > memory. It is usually very big.
> > This kind of data needs to be saved / loaded in pre-copy and
> > stop-and-copy phase.
> > The data of device memory is held in device memory region.
> > Size of devie memory is usually larger than that of device
> > memory region. qemu needs to save/load it in chunks of size of
> > device memory region.
> > Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> > memory, it is able to get dirty bitmap for certain range of system
> > memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > callback, dirty pages in system memory will be save/loaded by ram's
> > live migration code.
> > The dirty bitmap of system memory is held in dirty bitmap region.
> > If system memory range is larger than that dirty bitmap region can
> > hold, qemu will cut it into several chunks and get dirty bitmap in
> > succession.
> > 
> > 
> > Device State Regions
> > 
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> > Get access via read/write system call.
> > Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> > device config region: mandatory, holding data of device config
> > device memory region: optional, holding data of device memory
> > dirty bitmap region: optional, holding bitmap of system memory
> > dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> > 
> > 
> > kernel device state interface [1]
> > --
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > #define VFIO_DEVICE_STATE_RUNNING 0 
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > __u32 version;/* ro */
> > 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-19 Thread Dr. David Alan Gilbert
* Yan Zhao (yan.y.z...@intel.com) wrote:
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.

Hi,
  I've sent minor comments to later patches; but some minor general
comments:

  a) Never trust the incoming migrations stream - it might be corrupt,
so check when you can.
  b) How do we detect if we're migrating from/to the wrong device or
version of device?  Or say to a device with older firmware or perhaps
a device that has less device memory ?
  c) Consider using the trace_ mechanism - it's really useful to
add to loops writing/reading data so that you can see when it fails.

Dave

(P.S. You have a few typo's grep your code for 'devcie', 'devie' and
'migrtion'

> Device Data
> ---
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
> Every device is supposed to possess device config data.
>   Usually device config's size is small (no big than 10M), and it
> needs to be loaded in certain strict order.
> Therefore, device config only needs to be saved/loaded in
> stop-and-copy phase.
> The data of device config is held in device config region.
> Size of device config data is smaller than or equal to that of
> device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
> memory. It is usually very big.
> This kind of data needs to be saved / loaded in pre-copy and
> stop-and-copy phase.
> The data of device memory is held in device memory region.
> Size of devie memory is usually larger than that of device
> memory region. qemu needs to save/load it in chunks of size of
> device memory region.
> Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
> memory, it is able to get dirty bitmap for certain range of system
> memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> phase in .log_sync callback. By setting dirty bitmap in .log_sync
> callback, dirty pages in system memory will be save/loaded by ram's
> live migration code.
> The dirty bitmap of system memory is held in dirty bitmap region.
> If system memory range is larger than that dirty bitmap region can
> hold, qemu will cut it into several chunks and get dirty bitmap in
> succession.
> 
> 
> Device State Regions
> 
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
> Get access via read/write system call.
> Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
> device config region: mandatory, holding data of device config
> device memory region: optional, holding data of device memory
> dirty bitmap region: optional, holding bitmap of system memory
> dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0 
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2
> 
> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
>   __u32 version;/* ro */
>   __u32 device_state;   /* VFIO device state, wo */
>   __u32 caps;  /* ro */
> struct {
>   __u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
>   __u64 size;/*rw*/
>   } device_config;
>   struct {
>   __u32 action;/* wo, GET_BUFFER or SET_BUFFER */ 
>   __u64 size; /* rw */  
> __u64 pos; /*the offset in total buffer of device memory*/
>   } device_memory;
>   struct {
>   __u64 start_addr; /* wo */
>   __u64 page_nr;   /* wo */
>   } system_memory;
> };
> 
> Devcie States
> - 
> After migration is initialzed, it will set device