Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-05-10 Thread Juan Quintela
Yulei Zhang  wrote:
> Instead of using vm state description, add SaveVMHandlers for VFIO
> device to support live migration.
>
> Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the memory
> bitmap that dirtied by vfio device during the iterative precopy stage
> to shorten the system downtime afterward.
>
> For vfio pci device status migrate, during the system downtime, it will
> save the following states
> 1. pci configuration space addr0~addr5
> 2. pci configuration space msi_addr msi_data
> 3. pci device status fetch from device driver
>
> And on the target side the vfio_load will restore the same states
> 1. re-setup the pci bar configuration
> 2. re-setup the pci device msi configuration
> 3. restore the pci device status

[...]

> +static uint64_t vfio_dirty_log_sync(VFIOPCIDevice *vdev)
> +{
> +RAMBlock *block;
> +struct vfio_device_get_dirty_bitmap *d;
> +uint64_t page = 0;
> +ram_addr_t size;
> +unsigned long nr, bitmap;
> +
> +RAMBLOCK_FOREACH(block) {

BTW, you are asking to sync all blocks of memory?  Does vfio needs it?
Things like acpi tables, or similar weird blocks looks strange, no?

> +size = block->used_length;
> +nr = size >> TARGET_PAGE_BITS;
> +bitmap = (BITS_TO_LONGS(nr) + 1) * sizeof(unsigned long);
> +d = g_malloc0(sizeof(*d) +  bitmap);
> +d->start_addr = block->offset;
> +d->page_nr = nr;
> +if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_DIRTY_BITMAP, d)) {
> +error_report("vfio: Failed to get device dirty bitmap");
> +g_free(d);
> +goto exit;
> +}
> +
> +if (d->page_nr) {
> +cpu_physical_memory_set_dirty_lebitmap(
> + (unsigned long *)>dirty_bitmap,
> + d->start_addr, d->page_nr);
> +page += d->page_nr;

Are you sure that this are the number of dirty pages?  It looks to me to
the total number of pages in the region, no?

> +}
> +g_free(d);
> +}
> +
> +exit:
> +return page;
> +}

You walk the whole RAM on each bitmap.  Could it be possible that driver
knows _what_ memory regions it can have dirty pages on?


> +static void vfio_save_live_pending(QEMUFile *f, void *opaque, uint64_t 
> max_size,
> +   uint64_t *non_postcopiable_pending,
> +   uint64_t *postcopiable_pending)
> +{
> +VFIOPCIDevice *vdev = opaque;
> +uint64_t pending;
> +
> +qemu_mutex_lock_iothread();
> +rcu_read_lock();
> +pending = vfio_dirty_log_sync(vdev);
> +rcu_read_unlock();
> +qemu_mutex_unlock_iothread();
> +*non_postcopiable_pending += pending;
> +}

As said in other emails, this is not for iterative migration, see how we
do for ram (I have simplified it a bit):

static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
 uint64_t *res_precopy_only,
 uint64_t *res_compatible,
 uint64_t *res_postcopy_only)
{
RAMState **temp = opaque;
RAMState *rs = *temp;
uint64_t remaining_size;

remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;

if (remaining_size < max_size) {
qemu_mutex_lock_iothread();
rcu_read_lock();
migration_bitmap_sync(rs);
rcu_read_unlock();
qemu_mutex_unlock_iothread();
remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
}

*res_precopy_only += remaining_size;
}

I.e. we only do the sync stage if we don't have enough dirty pages on
the bitmap.

BTW, once that we are here, independently of this, perhaps we should
change the "sync" to take functions for each ramblock instead of for
the whole ram.

> +static int vfio_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +VFIOPCIDevice *vdev = opaque;
> +PCIDevice *pdev = >pdev;
> +int sz = vdev->device_state.size - VFIO_DEVICE_STATE_OFFSET;
> +uint8_t *buf = NULL;
> +uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +bool msi_64bit;
> +
> +if (qemu_get_byte(f) == VFIO_SAVE_FLAG_SETUP) {
> +goto exit;
> +}

As said before, I would suggest that you add several fields:
- VERSION: So you can make incopatible changes
- FLAGS: compatible changes
- SIZE of the region?  Rest of QEMU don't have it, I consider it an
  error.  Having it makes way easier to analyze the stream and seach for errors.

> +/* retore pci bar configuration */
> +ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +vfio_pci_write_config(pdev, PCI_COMMAND,
> +  ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);

Please forgive my pci ignorance, but are we really want to store the pci
configuration every time that we receive an iteration stage?

> +for (i = 0; i < PCI_ROM_SLOT; i++) {

Is this PCI_ROM_SLOT fixed by some spec or should we transfer it?

> +bar_cfg = 

Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-18 Thread Dr. David Alan Gilbert
* Zhang, Yulei (yulei.zh...@intel.com) wrote:
> 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, April 17, 2018 10:35 PM
> > To: Zhang, Yulei 
> > Cc: qemu-devel@nongnu.org; Tian, Kevin ;
> > joonas.lahti...@linux.intel.com; zhen...@linux.intel.com;
> > kwankh...@nvidia.com; Wang, Zhi A ;
> > dgilb...@redhat.com; quint...@redhat.com
> > Subject: Re: [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device
> > to support live migration
> > 
> > On Tue, 17 Apr 2018 08:11:12 +
> > "Zhang, Yulei"  wrote:
> > 
> > > > -Original Message-
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Tuesday, April 17, 2018 4:38 AM
> > > > To: Zhang, Yulei 
> > > > Cc: qemu-devel@nongnu.org; Tian, Kevin ;
> > > > joonas.lahti...@linux.intel.com; zhen...@linux.intel.com;
> > > > kwankh...@nvidia.com; Wang, Zhi A ;
> > > > dgilb...@redhat.com; quint...@redhat.com
> > > > Subject: Re: [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO
> > device
> > > > to support live migration
> > > >
> > > > On Tue, 10 Apr 2018 14:03:13 +0800
> > > > Yulei Zhang  wrote:
> > > >
> > > > > Instead of using vm state description, add SaveVMHandlers for VFIO
> > > > > device to support live migration.
> > > > >
> > > > > Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the
> > > > memory
> > > > > bitmap that dirtied by vfio device during the iterative precopy stage
> > > > > to shorten the system downtime afterward.
> > > > >
> > > > > For vfio pci device status migrate, during the system downtime, it
> > > > > will save the following states 1. pci configuration space addr0~addr5
> > > > > 2. pci configuration space msi_addr msi_data 3. pci device status
> > > > > fetch from device driver
> > > > >
> > > > > And on the target side the vfio_load will restore the same states 1.
> > > > > re-setup the pci bar configuration 2. re-setup the pci device msi
> > > > > configuration 3. restore the pci device status
> > > >
> > > > Interrupts are configured via ioctl, but I don't see any code here to
> > configure
> > > > the device into the correct interrupt state.  How do we know the target
> > > > device is compatible with the source device?  Do we rely on the vendor
> > > > driver to implicitly include some kind of device and version information
> > and
> > > > fail at the very end of the migration?  It seems like we need to somehow
> > > > front-load that sort of device compatibility checking since a vfio-pci
> > device
> > > > can be anything (ex. what happens if a user tries to migrate a GVT-g
> > vGPU to
> > > > an NVIDIA vGPU?).  Thanks,
> > > >
> > > > Alex
> > >
> > > We emulate the write to the pci configure space in vfio_load() which will
> > help
> > > setup the interrupt state.
> > 
> > But you're only doing that for MSI, not MSI-X, we cannot simply say
> > that we don't have an MSI-X devices right now and add it later or we'll
> > end up with incompatible vmstate, we need to plan for how we'll support
> > it within the save state stream now.
> > 
> Agree with u. 
> 
> > > For compatibility I think currently the vendor driver would put version
> > number
> > > or device specific info in the device state region, so during the pre-copy
> > stage
> > > the target side will discover the difference and call off the migration.
> > 
> > Those sorts of things should be built into the device state region, we
> > shouldn't rely on vendor drivers to make these kinds of considerations,
> > we should build it into the API, which also allows QEMU to check state
> > compatibility before attempting a migration.  Thanks,
> > 
> > Alex
> 
> Not sure about how to check the compatibility before attempting a migration,
> to my understanding, qemu doesn't know the configuration on target side,
> target may call off the migration when it finds the input device state isn't 
> compatible,
> and source vm will resume.

There are a few parts to this:
   a) Tools, like libvirt etc need to be able to detect that the two
configurations are actually compatible before they try - so there should
be some way of exposing enough detail about the host/drivers for
something somewhere to be able to say 'yes it should work' before even
starting the migration.

   b) The migration stream should contain enough information to be able
to *cleanly* detect an incompatibility so that the failure is obvious;
so make sure you have enough information in there (preferably in the
'setup' part if it really is iterative).  Then when loading check that
data and print a clear reason why it's incompatible.

   c) We've got to try and keep that compatibility across versions - so
migrations to a newer QEMU, or newer drivers (or depending on the
hardware) to newer hardware should be able to work if 

Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-18 Thread Zhang, Yulei


> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, April 17, 2018 10:35 PM
> To: Zhang, Yulei 
> Cc: qemu-devel@nongnu.org; Tian, Kevin ;
> joonas.lahti...@linux.intel.com; zhen...@linux.intel.com;
> kwankh...@nvidia.com; Wang, Zhi A ;
> dgilb...@redhat.com; quint...@redhat.com
> Subject: Re: [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device
> to support live migration
> 
> On Tue, 17 Apr 2018 08:11:12 +
> "Zhang, Yulei"  wrote:
> 
> > > -Original Message-
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Tuesday, April 17, 2018 4:38 AM
> > > To: Zhang, Yulei 
> > > Cc: qemu-devel@nongnu.org; Tian, Kevin ;
> > > joonas.lahti...@linux.intel.com; zhen...@linux.intel.com;
> > > kwankh...@nvidia.com; Wang, Zhi A ;
> > > dgilb...@redhat.com; quint...@redhat.com
> > > Subject: Re: [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO
> device
> > > to support live migration
> > >
> > > On Tue, 10 Apr 2018 14:03:13 +0800
> > > Yulei Zhang  wrote:
> > >
> > > > Instead of using vm state description, add SaveVMHandlers for VFIO
> > > > device to support live migration.
> > > >
> > > > Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the
> > > memory
> > > > bitmap that dirtied by vfio device during the iterative precopy stage
> > > > to shorten the system downtime afterward.
> > > >
> > > > For vfio pci device status migrate, during the system downtime, it
> > > > will save the following states 1. pci configuration space addr0~addr5
> > > > 2. pci configuration space msi_addr msi_data 3. pci device status
> > > > fetch from device driver
> > > >
> > > > And on the target side the vfio_load will restore the same states 1.
> > > > re-setup the pci bar configuration 2. re-setup the pci device msi
> > > > configuration 3. restore the pci device status
> > >
> > > Interrupts are configured via ioctl, but I don't see any code here to
> configure
> > > the device into the correct interrupt state.  How do we know the target
> > > device is compatible with the source device?  Do we rely on the vendor
> > > driver to implicitly include some kind of device and version information
> and
> > > fail at the very end of the migration?  It seems like we need to somehow
> > > front-load that sort of device compatibility checking since a vfio-pci
> device
> > > can be anything (ex. what happens if a user tries to migrate a GVT-g
> vGPU to
> > > an NVIDIA vGPU?).  Thanks,
> > >
> > > Alex
> >
> > We emulate the write to the pci configure space in vfio_load() which will
> help
> > setup the interrupt state.
> 
> But you're only doing that for MSI, not MSI-X, we cannot simply say
> that we don't have an MSI-X devices right now and add it later or we'll
> end up with incompatible vmstate, we need to plan for how we'll support
> it within the save state stream now.
> 
Agree with u. 

> > For compatibility I think currently the vendor driver would put version
> number
> > or device specific info in the device state region, so during the pre-copy
> stage
> > the target side will discover the difference and call off the migration.
> 
> Those sorts of things should be built into the device state region, we
> shouldn't rely on vendor drivers to make these kinds of considerations,
> we should build it into the API, which also allows QEMU to check state
> compatibility before attempting a migration.  Thanks,
> 
> Alex

Not sure about how to check the compatibility before attempting a migration,
to my understanding, qemu doesn't know the configuration on target side,
target may call off the migration when it finds the input device state isn't 
compatible,
and source vm will resume.



Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-17 Thread Kirti Wankhede


On 4/17/2018 1:31 PM, Zhang, Yulei wrote:

>>> +static SaveVMHandlers savevm_vfio_handlers = {
>>> +.save_setup = vfio_save_setup,
>>> +.save_live_pending = vfio_save_live_pending,
>>> +.save_live_complete_precopy = vfio_save_complete,
>>> +.load_state = vfio_load,
>>> +};
>>> +
>>
>> Isn't .is_active, .save_live_iterate and .cleanup required?
>> What is vendor driver have large amount of data in device's memory which
>> vendor driver is aware of? and vendor driver would required multiple
>> iterations to send that data to QEMU to save complete state of device?
>>
> I suppose the vendor driver will copy the device's memory to the device
> region iteratively, and let qemu read from the device region and transfer
> the data to the target side in pre-copy stage, isn't it? 
> 


As Dave mentioned in other mail in this thread, all data will not be
copied only in pre-copy state. Some static data would be copied in
pre-copy state but there could be significant amount of data in
stop-and-copy state where iterations would be required. .is_active and
.save_live_iterate would be required for that iterations.

.cleanup is required to provide an indication to vendor driver that
migration is complete and vendor driver can cleanup all the extra
allocations done for migration.

Thanks,
Kirti



Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-17 Thread Eric Blake
On 04/16/2018 09:44 AM, Kirti Wankhede wrote:
> 
> 
> On 4/10/2018 11:33 AM, Yulei Zhang wrote:
>> Instead of using vm state description, add SaveVMHandlers for VFIO
>> device to support live migration.

In the subject line: s/Hanlders/Handlers/

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-17 Thread Dr. David Alan Gilbert
* Yulei Zhang (yulei.zh...@intel.com) wrote:
> Instead of using vm state description, add SaveVMHandlers for VFIO
> device to support live migration.
> 
> Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the memory
> bitmap that dirtied by vfio device during the iterative precopy stage
> to shorten the system downtime afterward.
> 
> For vfio pci device status migrate, during the system downtime, it will
> save the following states
> 1. pci configuration space addr0~addr5
> 2. pci configuration space msi_addr msi_data
> 3. pci device status fetch from device driver
> 
> And on the target side the vfio_load will restore the same states
> 1. re-setup the pci bar configuration
> 2. re-setup the pci device msi configuration
> 3. restore the pci device status
> 
> Signed-off-by: Yulei Zhang 
> ---
>  hw/vfio/pci.c  | 195 
> +++--
>  linux-headers/linux/vfio.h |  14 
>  2 files changed, 204 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 13d8c73..ac6a9c7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -33,9 +33,14 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/register.h"
> +#include "exec/ram_addr.h"
>  
>  #define MSIX_CAP_LENGTH 12
>  
> +#define VFIO_SAVE_FLAG_SETUP 0
> +#define VFIO_SAVE_FLAG_DEV_STATE 1
> +
>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>  static void vfio_vm_change_state_handler(void *pv, int running, RunState 
> state);
> @@ -2639,6 +2644,190 @@ static void 
> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>  vdev->req_enabled = false;
>  }
>  
> +static uint64_t vfio_dirty_log_sync(VFIOPCIDevice *vdev)
> +{
> +RAMBlock *block;
> +struct vfio_device_get_dirty_bitmap *d;
> +uint64_t page = 0;
> +ram_addr_t size;
> +unsigned long nr, bitmap;
> +
> +RAMBLOCK_FOREACH(block) {
> +size = block->used_length;
> +nr = size >> TARGET_PAGE_BITS;
> +bitmap = (BITS_TO_LONGS(nr) + 1) * sizeof(unsigned long);
> +d = g_malloc0(sizeof(*d) +  bitmap);
> +d->start_addr = block->offset;
> +d->page_nr = nr;
> +if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_DIRTY_BITMAP, d)) {
> +error_report("vfio: Failed to get device dirty bitmap");
> +g_free(d);
> +goto exit;
> +}
> +
> +if (d->page_nr) {
> +cpu_physical_memory_set_dirty_lebitmap(
> + (unsigned long *)>dirty_bitmap,
> + d->start_addr, d->page_nr);
> +page += d->page_nr;
> +}
> +g_free(d);
> +}
> +
> +exit:
> +return page;
> +}
> +
> +static void vfio_save_live_pending(QEMUFile *f, void *opaque, uint64_t 
> max_size,
> +   uint64_t *non_postcopiable_pending,
> +   uint64_t *postcopiable_pending)
> +{
> +VFIOPCIDevice *vdev = opaque;
> +uint64_t pending;
> +
> +qemu_mutex_lock_iothread();
> +rcu_read_lock();
> +pending = vfio_dirty_log_sync(vdev);
> +rcu_read_unlock();
> +qemu_mutex_unlock_iothread();
> +*non_postcopiable_pending += pending;
> +}
> +
> +static int vfio_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +VFIOPCIDevice *vdev = opaque;
> +PCIDevice *pdev = >pdev;
> +int sz = vdev->device_state.size - VFIO_DEVICE_STATE_OFFSET;
> +uint8_t *buf = NULL;
> +uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +bool msi_64bit;
> +
> +if (qemu_get_byte(f) == VFIO_SAVE_FLAG_SETUP) {
> +goto exit;
> +}

If you're building something complex like this, you might want to add
some version flags at the start and a canary at the end to detect
corruption.

Also note that the migration could fail at any point; so calling
qemu_file_get_error is good practice at points before acting on data
youv'e read with qemu_get_be* since it could be bogus if it's already
failed.

> +/* retore pci bar configuration */
> +ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +vfio_pci_write_config(pdev, PCI_COMMAND,
> +  ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +for (i = 0; i < PCI_ROM_SLOT; i++) {
> +bar_cfg = qemu_get_be32(f);
> +vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> +}
> +vfio_pci_write_config(pdev, PCI_COMMAND,
> +  ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> +
> +/* restore msi configuration */
> +ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> +
> +vfio_pci_write_config(>pdev,
> +  pdev->msi_cap + PCI_MSI_FLAGS,
> +  ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +msi_lo = 

Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-17 Thread Alex Williamson
On Tue, 17 Apr 2018 08:11:12 +
"Zhang, Yulei"  wrote:

> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, April 17, 2018 4:38 AM
> > To: Zhang, Yulei 
> > Cc: qemu-devel@nongnu.org; Tian, Kevin ;
> > joonas.lahti...@linux.intel.com; zhen...@linux.intel.com;
> > kwankh...@nvidia.com; Wang, Zhi A ;
> > dgilb...@redhat.com; quint...@redhat.com
> > Subject: Re: [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device
> > to support live migration
> > 
> > On Tue, 10 Apr 2018 14:03:13 +0800
> > Yulei Zhang  wrote:
> >   
> > > Instead of using vm state description, add SaveVMHandlers for VFIO
> > > device to support live migration.
> > >
> > > Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the  
> > memory  
> > > bitmap that dirtied by vfio device during the iterative precopy stage
> > > to shorten the system downtime afterward.
> > >
> > > For vfio pci device status migrate, during the system downtime, it
> > > will save the following states 1. pci configuration space addr0~addr5
> > > 2. pci configuration space msi_addr msi_data 3. pci device status
> > > fetch from device driver
> > >
> > > And on the target side the vfio_load will restore the same states 1.
> > > re-setup the pci bar configuration 2. re-setup the pci device msi
> > > configuration 3. restore the pci device status  
> > 
> > Interrupts are configured via ioctl, but I don't see any code here to 
> > configure
> > the device into the correct interrupt state.  How do we know the target
> > device is compatible with the source device?  Do we rely on the vendor
> > driver to implicitly include some kind of device and version information and
> > fail at the very end of the migration?  It seems like we need to somehow
> > front-load that sort of device compatibility checking since a vfio-pci 
> > device
> > can be anything (ex. what happens if a user tries to migrate a GVT-g vGPU to
> > an NVIDIA vGPU?).  Thanks,
> > 
> > Alex  
> 
> We emulate the write to the pci configure space in vfio_load() which will help
> setup the interrupt state. 

But you're only doing that for MSI, not MSI-X, we cannot simply say
that we don't have an MSI-X devices right now and add it later or we'll
end up with incompatible vmstate, we need to plan for how we'll support
it within the save state stream now.

> For compatibility I think currently the vendor driver would put version 
> number 
> or device specific info in the device state region, so during the pre-copy 
> stage
> the target side will discover the difference and call off the migration.

Those sorts of things should be built into the device state region, we
shouldn't rely on vendor drivers to make these kinds of considerations,
we should build it into the API, which also allows QEMU to check state
compatibility before attempting a migration.  Thanks,

Alex



Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-17 Thread Zhang, Yulei

> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, April 17, 2018 4:38 AM
> To: Zhang, Yulei 
> Cc: qemu-devel@nongnu.org; Tian, Kevin ;
> joonas.lahti...@linux.intel.com; zhen...@linux.intel.com;
> kwankh...@nvidia.com; Wang, Zhi A ;
> dgilb...@redhat.com; quint...@redhat.com
> Subject: Re: [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device
> to support live migration
> 
> On Tue, 10 Apr 2018 14:03:13 +0800
> Yulei Zhang  wrote:
> 
> > Instead of using vm state description, add SaveVMHandlers for VFIO
> > device to support live migration.
> >
> > Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the
> memory
> > bitmap that dirtied by vfio device during the iterative precopy stage
> > to shorten the system downtime afterward.
> >
> > For vfio pci device status migrate, during the system downtime, it
> > will save the following states 1. pci configuration space addr0~addr5
> > 2. pci configuration space msi_addr msi_data 3. pci device status
> > fetch from device driver
> >
> > And on the target side the vfio_load will restore the same states 1.
> > re-setup the pci bar configuration 2. re-setup the pci device msi
> > configuration 3. restore the pci device status
> 
> Interrupts are configured via ioctl, but I don't see any code here to 
> configure
> the device into the correct interrupt state.  How do we know the target
> device is compatible with the source device?  Do we rely on the vendor
> driver to implicitly include some kind of device and version information and
> fail at the very end of the migration?  It seems like we need to somehow
> front-load that sort of device compatibility checking since a vfio-pci device
> can be anything (ex. what happens if a user tries to migrate a GVT-g vGPU to
> an NVIDIA vGPU?).  Thanks,
> 
> Alex

We emulate the write to the pci configure space in vfio_load() which will help
setup the interrupt state. 
For compatibility I think currently the vendor driver would put version number 
or device specific info in the device state region, so during the pre-copy stage
the target side will discover the difference and call off the migration.



Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-17 Thread Zhang, Yulei


> -Original Message-
> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> Sent: Monday, April 16, 2018 10:45 PM
> To: Zhang, Yulei ; qemu-devel@nongnu.org
> Cc: Tian, Kevin ; joonas.lahti...@linux.intel.com;
> zhen...@linux.intel.com; Wang, Zhi A ;
> alex.william...@redhat.com; dgilb...@redhat.com; quint...@redhat.com
> Subject: Re: [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device
> to support live migration
> 
> 
> 
> On 4/10/2018 11:33 AM, Yulei Zhang wrote:
> > Instead of using vm state description, add SaveVMHandlers for VFIO
> > device to support live migration.
> >
> > Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the
> memory
> > bitmap that dirtied by vfio device during the iterative precopy stage
> > to shorten the system downtime afterward.
> >
> > For vfio pci device status migrate, during the system downtime, it
> > will save the following states 1. pci configuration space addr0~addr5
> > 2. pci configuration space msi_addr msi_data 3. pci device status
> > fetch from device driver
> >
> > And on the target side the vfio_load will restore the same states 1.
> > re-setup the pci bar configuration 2. re-setup the pci device msi
> > configuration 3. restore the pci device status
> >
> 
> Can #1 and #2 be setup by vendor driver? Vendor driver knows capabilities
> of device and vendor driver can setup rather than have the restore code in
> common place and then handle all the capability cases here.
> 
We need vfio to help setup the interrupt patch.

> 
> > Signed-off-by: Yulei Zhang 
> > ---
> >  hw/vfio/pci.c  | 195
> +++--
> >  linux-headers/linux/vfio.h |  14 
> >  2 files changed, 204 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 13d8c73..ac6a9c7
> > 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -33,9 +33,14 @@
> >  #include "trace.h"
> >  #include "qapi/error.h"
> >  #include "migration/blocker.h"
> > +#include "migration/register.h"
> > +#include "exec/ram_addr.h"
> >
> >  #define MSIX_CAP_LENGTH 12
> >
> > +#define VFIO_SAVE_FLAG_SETUP 0
> > +#define VFIO_SAVE_FLAG_DEV_STATE 1
> > +
> >  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);  static
> > void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);  static
> > void vfio_vm_change_state_handler(void *pv, int running, RunState
> > state); @@ -2639,6 +2644,190 @@ static void
> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
> >  vdev->req_enabled = false;
> >  }
> >
> > +static uint64_t vfio_dirty_log_sync(VFIOPCIDevice *vdev) {
> > +RAMBlock *block;
> > +struct vfio_device_get_dirty_bitmap *d;
> > +uint64_t page = 0;
> > +ram_addr_t size;
> > +unsigned long nr, bitmap;
> > +
> > +RAMBLOCK_FOREACH(block) {
> > +size = block->used_length;
> > +nr = size >> TARGET_PAGE_BITS;
> > +bitmap = (BITS_TO_LONGS(nr) + 1) * sizeof(unsigned long);
> > +d = g_malloc0(sizeof(*d) +  bitmap);
> > +d->start_addr = block->offset;
> > +d->page_nr = nr;
> > +if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_DIRTY_BITMAP, d)) {
> > +error_report("vfio: Failed to get device dirty bitmap");
> > +g_free(d);
> > +goto exit;
> > +}
> > +
> > +if (d->page_nr) {
> > +cpu_physical_memory_set_dirty_lebitmap(
> > + (unsigned long *)>dirty_bitmap,
> > + d->start_addr, d->page_nr);
> > +page += d->page_nr;
> > +}
> > +g_free(d);
> > +}
> > +
> > +exit:
> > +return page;
> > +}
> > +
> > +static void vfio_save_live_pending(QEMUFile *f, void *opaque, uint64_t
> max_size,
> > +   uint64_t *non_postcopiable_pending,
> > +   uint64_t *postcopiable_pending) {
> > +VFIOPCIDevice *vdev = opaque;
> > +uint64_t pending;
> > +
> > +qemu_mutex_lock_iothread();
> > +rcu_read_lock();
> > +pending = vfio_dirty_log_sync(vdev);
> > +rcu_read_unlock();
> > +qemu_mutex_unlock_iothread();
> > +*non_postcopiable_pending += pending; }
> > +
> 
> This doesn't provide a way to read device's state during pre-copy phase.
> device_state region can be used to read device specific data from vendor
> driver during pre-copy phase. That could be done by mmap device_state
> region during init and then ioctl vendor driver here to query size of data
> copied to device_state region and pending data size.
> 

Maybe size could be placed in the device state region instead of adding ioctl.
> 
> > +static int vfio_load(QEMUFile *f, void *opaque, int version_id) {
> > +VFIOPCIDevice *vdev = opaque;
> > +PCIDevice *pdev = >pdev;
> > +int sz = vdev->device_state.size - VFIO_DEVICE_STATE_OFFSET;
> > +uint8_t *buf = NULL;
> > +uint32_t ctl, 

Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-16 Thread Alex Williamson
On Tue, 10 Apr 2018 14:03:13 +0800
Yulei Zhang  wrote:

> Instead of using vm state description, add SaveVMHandlers for VFIO
> device to support live migration.
> 
> Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the memory
> bitmap that dirtied by vfio device during the iterative precopy stage
> to shorten the system downtime afterward.
> 
> For vfio pci device status migrate, during the system downtime, it will
> save the following states
> 1. pci configuration space addr0~addr5
> 2. pci configuration space msi_addr msi_data
> 3. pci device status fetch from device driver
> 
> And on the target side the vfio_load will restore the same states
> 1. re-setup the pci bar configuration
> 2. re-setup the pci device msi configuration
> 3. restore the pci device status

Interrupts are configured via ioctl, but I don't see any code here to
configure the device into the correct interrupt state.  How do we know
the target device is compatible with the source device?  Do we rely on
the vendor driver to implicitly include some kind of device and version
information and fail at the very end of the migration?  It seems like
we need to somehow front-load that sort of device compatibility
checking since a vfio-pci device can be anything (ex. what happens if
a user tries to migrate a GVT-g vGPU to an NVIDIA vGPU?).  Thanks,

Alex



Re: [Qemu-devel] [RFC PATCH V4 3/4] vfio: Add SaveVMHanlders for VFIO device to support live migration

2018-04-16 Thread Kirti Wankhede


On 4/10/2018 11:33 AM, Yulei Zhang wrote:
> Instead of using vm state description, add SaveVMHandlers for VFIO
> device to support live migration.
> 
> Introduce new Ioctl VFIO_DEVICE_GET_DIRTY_BITMAP to fetch the memory
> bitmap that dirtied by vfio device during the iterative precopy stage
> to shorten the system downtime afterward.
> 
> For vfio pci device status migrate, during the system downtime, it will
> save the following states
> 1. pci configuration space addr0~addr5
> 2. pci configuration space msi_addr msi_data
> 3. pci device status fetch from device driver
> 
> And on the target side the vfio_load will restore the same states
> 1. re-setup the pci bar configuration
> 2. re-setup the pci device msi configuration
> 3. restore the pci device status
> 

Can #1 and #2 be setup by vendor driver? Vendor driver knows
capabilities of device and vendor driver can setup rather than have the
restore code in common place and then handle all the capability cases here.


> Signed-off-by: Yulei Zhang 
> ---
>  hw/vfio/pci.c  | 195 
> +++--
>  linux-headers/linux/vfio.h |  14 
>  2 files changed, 204 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 13d8c73..ac6a9c7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -33,9 +33,14 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/register.h"
> +#include "exec/ram_addr.h"
>  
>  #define MSIX_CAP_LENGTH 12
>  
> +#define VFIO_SAVE_FLAG_SETUP 0
> +#define VFIO_SAVE_FLAG_DEV_STATE 1
> +
>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>  static void vfio_vm_change_state_handler(void *pv, int running, RunState 
> state);
> @@ -2639,6 +2644,190 @@ static void 
> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>  vdev->req_enabled = false;
>  }
>  
> +static uint64_t vfio_dirty_log_sync(VFIOPCIDevice *vdev)
> +{
> +RAMBlock *block;
> +struct vfio_device_get_dirty_bitmap *d;
> +uint64_t page = 0;
> +ram_addr_t size;
> +unsigned long nr, bitmap;
> +
> +RAMBLOCK_FOREACH(block) {
> +size = block->used_length;
> +nr = size >> TARGET_PAGE_BITS;
> +bitmap = (BITS_TO_LONGS(nr) + 1) * sizeof(unsigned long);
> +d = g_malloc0(sizeof(*d) +  bitmap);
> +d->start_addr = block->offset;
> +d->page_nr = nr;
> +if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_DIRTY_BITMAP, d)) {
> +error_report("vfio: Failed to get device dirty bitmap");
> +g_free(d);
> +goto exit;
> +}
> +
> +if (d->page_nr) {
> +cpu_physical_memory_set_dirty_lebitmap(
> + (unsigned long *)>dirty_bitmap,
> + d->start_addr, d->page_nr);
> +page += d->page_nr;
> +}
> +g_free(d);
> +}
> +
> +exit:
> +return page;
> +}
> +
> +static void vfio_save_live_pending(QEMUFile *f, void *opaque, uint64_t 
> max_size,
> +   uint64_t *non_postcopiable_pending,
> +   uint64_t *postcopiable_pending)
> +{
> +VFIOPCIDevice *vdev = opaque;
> +uint64_t pending;
> +
> +qemu_mutex_lock_iothread();
> +rcu_read_lock();
> +pending = vfio_dirty_log_sync(vdev);
> +rcu_read_unlock();
> +qemu_mutex_unlock_iothread();
> +*non_postcopiable_pending += pending;
> +}
> +

This doesn't provide a way to read device's state during pre-copy phase.
device_state region can be used to read device specific data from vendor
driver during pre-copy phase. That could be done by mmap device_state
region during init and then ioctl vendor driver here to query size of
data copied to device_state region and pending data size.


> +static int vfio_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +VFIOPCIDevice *vdev = opaque;
> +PCIDevice *pdev = >pdev;
> +int sz = vdev->device_state.size - VFIO_DEVICE_STATE_OFFSET;
> +uint8_t *buf = NULL;
> +uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +bool msi_64bit;
> +
> +if (qemu_get_byte(f) == VFIO_SAVE_FLAG_SETUP) {
> +goto exit;
> +}
> +
> +/* retore pci bar configuration */
> +ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +vfio_pci_write_config(pdev, PCI_COMMAND,
> +  ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +for (i = 0; i < PCI_ROM_SLOT; i++) {
> +bar_cfg = qemu_get_be32(f);
> +vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> +}
> +vfio_pci_write_config(pdev, PCI_COMMAND,
> +  ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> +
> +/* restore msi configuration */
> +ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +msi_64bit = !!(ctl &