RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit
Hi Jiquian,

> From: Chen, Jiqian 
> Sent: Wednesday, September 20, 2023 1:24 PM
> 
> Hi Lingshan,
> Please reply to your own email thread, below are not related to my patches.
> Thanks a lot.

They are related to your patch.
 Both the patches have overlapping functionalities.

You probably missed my previous response explaining the same at [1].

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00255.html



RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit

> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 1:17 PM

> > This is not live or device migration. This is restoring the device context
> initiated by the driver owning the device.
> restore the device context should be done by the hypervisor before setting
> DRIVER_OK and waking up the guest, not a concern here, out of spec

The framework is generic for the PCI devices hence, there may not be any 
hypervisor at all. Hence restore operation to trigger on DRIVER_OK setting, 
when previously suspended.

Since we already agree in previous email that re-read until device sets the 
DRIVER_OK, its fine to me. It covers the above restore condition.

Thanks.



RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit

> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 1:16 PM

[..]
> > In my view, setting the DRIVER_OK is the signal regardless of hypervisor or
> physical device.
> > Hence the re-read is must.
> Yes, as I said below, should verify by re-read.
> >
Thanks.


RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit

> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 1:00 PM
> 
> On 9/20/2023 3:24 PM, Chen, Jiqian wrote:
> > Hi Lingshan,
> > It seems you reply to the wrong email thread. They are not related to my
> patch.
> These reply to Parva's comments.
> @Parva, if you want to discuss more about live migration, please reply in my
> thread, lets don't flood here.
You made the point that "this is not live migration".
I am not discussing live migration in this thread.

I replied for the point that device restoration from suspend for physical and 
hypevisor based device is not a 40nsec worth of work to restore by just doing a 
register write.
This is not live or device migration. This is restoring the device context 
initiated by the driver owning the device.


RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 12:58 PM
> 
> On 9/20/2023 3:10 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Wednesday, September 20, 2023 12:37 PM
> >>> The problem to overcome in [1] is, resume operation needs to be
> >>> synchronous
> >> as it involves large part of context to resume back, and hence just
> >> asynchronously setting DRIVER_OK is not enough.
> >>> The sw must verify back that device has resumed the operation and
> >>> ready to
> >> answer requests.
> >> this is not live migration, all device status and other information
> >> still stay in the device, no need to "resume" context, just resume running.
> >>
> > I am aware that it is not live migration. :)
> >
> > "Just resuming" involves lot of device setup task. The device implementation
> does not know for how long a device is suspended.
> > So for example, a VM is suspended for 6 hours, hence the device context
> could be saved in a slow disk.
> > Hence, when the resume is done, it needs to setup things again and driver 
> > got
> to verify before accessing more from the device.
> The restore procedures should perform by the hypervisor and done before set
> DRIVER_OK and wake up the guest.
Which is the signal to trigger the restore? Which is the trigger in physical 
device when there is no hypervisor?

In my view, setting the DRIVER_OK is the signal regardless of hypervisor or 
physical device.
Hence the re-read is must.

> And the hypervisor/driver needs to check the device status by re-reading.
> >
> >> Like resume from a failed LM.
> >>> This is slightly different flow than setting the DRIVER_OK for the
> >>> first time
> >> device initialization sequence as it does not involve large restoration.
> >>> So, to merge two ideas, instead of doing DRIVER_OK to resume, the
> >>> driver
> >> should clear the SUSPEND bit and verify that it is out of SUSPEND.
> >>> Because driver is still in _OK_ driving the device flipping the SUSPEND 
> >>> bit.
> >> Please read the spec, it says:
> >> The driver MUST NOT clear a device status bit
> >>
> > Yes, this is why either DRIER_OK validation by the driver is needed or 
> > Jiqian's
> synchronous new register..
> so re-read
Yes. re-read until set, Thanks.



RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit

> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 12:37 PM

> > The problem to overcome in [1] is, resume operation needs to be synchronous
> as it involves large part of context to resume back, and hence just
> asynchronously setting DRIVER_OK is not enough.
> > The sw must verify back that device has resumed the operation and ready to
> answer requests.
> this is not live migration, all device status and other information still 
> stay in the
> device, no need to "resume" context, just resume running.
> 
I am aware that it is not live migration. :)

"Just resuming" involves lot of device setup task. The device implementation 
does not know for how long a device is suspended.
So for example, a VM is suspended for 6 hours, hence the device context could 
be saved in a slow disk.
Hence, when the resume is done, it needs to setup things again and driver got 
to verify before accessing more from the device.
 
> Like resume from a failed LM.
> >
> > This is slightly different flow than setting the DRIVER_OK for the first 
> > time
> device initialization sequence as it does not involve large restoration.
> >
> > So, to merge two ideas, instead of doing DRIVER_OK to resume, the driver
> should clear the SUSPEND bit and verify that it is out of SUSPEND.
> >
> > Because driver is still in _OK_ driving the device flipping the SUSPEND bit.
> Please read the spec, it says:
> The driver MUST NOT clear a device status bit
> 
Yes, this is why either DRIER_OK validation by the driver is needed or Jiqian's 
synchronous new register..



RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit

> From: Chen, Jiqian 
> Sent: Wednesday, September 20, 2023 12:03 PM

> If driver write 0 to reset device, can the SUSPEND bit be cleared?
It must as reset operation, resets everything else and so the suspend too.

> (pci_pm_resume->virtio_pci_restore->virtio_device_restore-
> >virtio_reset_device)
> If SUSPEND is cleared, then during the reset process in Qemu, I can't judge if
> the reset request is from guest restore process or not, and then I can't 
> change
> the reset behavior.
Reset should not be influenced by suspend.
Suspend should do the work of suspend and reset to do the reset.

The problem to overcome in [1] is, resume operation needs to be synchronous as 
it involves large part of context to resume back, and hence just asynchronously 
setting DRIVER_OK is not enough.
The sw must verify back that device has resumed the operation and ready to 
answer requests.

This is slightly different flow than setting the DRIVER_OK for the first time 
device initialization sequence as it does not involve large restoration.

So, to merge two ideas, instead of doing DRIVER_OK to resume, the driver should 
clear the SUSPEND bit and verify that it is out of SUSPEND.

Because driver is still in _OK_ driving the device flipping the SUSPEND bit.


RE: [virtio-dev] RE: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-19 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Wednesday, September 20, 2023 9:28 AM

> >> For above purpose, we need a mechanism that allows guests and QEMU to
> >> negotiate their reset behavior. So this patch add a new parameter
> >> named
> > Freeze != reset. :)
> > Please fix it to say freeze or suspend.
> But in my virtio-gpu scene, I want to prevent Qemu destroying resources when
> Guest do resuming(pci_pm_resume-> virtio_pci_restore->
> virtio_device_restore-> virtio_reset_device-> vp_modern_set_status->Qemu
> virtio_pci_reset->virtio_gpu_gl_reset-> virtio_gpu_reset). And I add check in
> virtio_gpu_gl_reset and virtio_gpu_reset, if freeze_mode was set to FREEZE_S3
> during Guest suspending, Qemu will not destroy resources. So the reason why I
> add this mechanism is to affect the reset behavior. And I think this also can 
> help
> other virtio devices to affect their behavior, like the issue of virtio-video 
> which
> Mikhail Golubev-Ciuchea encountered.
>
The point is when driver tells to freeze, it is freeze command and not reset.
So resume() should not invoke device_reset() when FREEZE+RESUME supported.
 
> >
> >> freeze_mode to struct virtio_pci_common_cfg. And when guest suspends,
> >> it can write freeze_mode to be FREEZE_S3, and then virtio devices can
> >> change their reset behavior on Qemu side according to freeze_mode.
> >> What's more,
> > Not reset, but suspend behavior.
> The same reason as above.
>
Reset should not be done by the guest driver when the device supports unfreeze.
 
> >
> >> freeze_mode can be used for all virtio devices to affect the behavior
> >> of Qemu, not just virtio gpu device.
> >>
> >> Signed-off-by: Jiqian Chen 
> >> ---
> >>  transport-pci.tex | 7 +++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/transport-pci.tex b/transport-pci.tex index
> >> a5c6719..2543536 100644
> >> --- a/transport-pci.tex
> >> +++ b/transport-pci.tex
> >> @@ -319,6 +319,7 @@ \subsubsection{Common configuration structure
> >> layout}\label{sec:Virtio Transport
> >>  le64 queue_desc;/* read-write */
> >>  le64 queue_driver;  /* read-write */
> >>  le64 queue_device;  /* read-write */
> >> +le16 freeze_mode;   /* read-write */
> >>  le16 queue_notif_config_data;   /* read-only for driver */
> >>  le16 queue_reset;   /* read-write */
> >>
> > The new field cannot be in the middle of the structure.
> > Otherwise, the location of the queue_notif_config_data depends on
> completely unrelated feature bit, breaking the backward compatibility.
> > So please move it at the end.
> I have confused about this. I found in latest kernel code(master branch):
> struct virtio_pci_common_cfg {
>   /* About the whole device. */
>   __le32 device_feature_select;   /* read-write */
>   __le32 device_feature;  /* read-only */
>   __le32 guest_feature_select;/* read-write */
>   __le32 guest_feature;   /* read-write */
>   __le16 msix_config; /* read-write */
>   __le16 num_queues;  /* read-only */
>   __u8 device_status; /* read-write */
>   __u8 config_generation; /* read-only */
> 
>   /* About a specific virtqueue. */
>   __le16 queue_select;/* read-write */
>   __le16 queue_size;  /* read-write, power of 2. */
>   __le16 queue_msix_vector;   /* read-write */
>   __le16 queue_enable;/* read-write */
>   __le16 queue_notify_off;/* read-only */
>   __le32 queue_desc_lo;   /* read-write */
>   __le32 queue_desc_hi;   /* read-write */
>   __le32 queue_avail_lo;  /* read-write */
>   __le32 queue_avail_hi;  /* read-write */
>   __le32 queue_used_lo;   /* read-write */
>   __le32 queue_used_hi;   /* read-write */
> 
>   __le16 freeze_mode; /* read-write */
> };
> There is no queue_notif_config_data or queue_reset, and freeze_mode I added
> is at the end. Why is it different from virtio-spec?
>
Because notify data may not be used by Linux driver so it may be shorter.
I didn’t dig code yet.
 
> >
> >> @@ -393,6 +394,12 @@ \subsubsection{Common configuration structure
> >> layout}\label{sec:Virtio Transport  \item[\field{queue_device}]
> >>  The driver writes the physical address of Device Area here.
> >> See section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues}.
> >>
> >> +\item[\field{freeze_mode}]
> >> +The driver writes this to set the freeze mode of virtio pci.
> >> +VIRTIO_PCI_FREEZE_MODE_UNFREEZE - virtio-pci is running;
> >> +VIRTIO_PCI_FREEZE_MODE_FREEZE_S3 - guest vm is doing S3, and
> >> +virtio-
> > For above names, please define the actual values in the spec.
> Ok, I will add them.
> 
> >
> >> pci enters S3 suspension;
> >> +Other values are reserved for future use, like S4, etc.

RE: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-19 Thread Parav Pandit
Hi Jiqian,

> From: Jiqian Chen 
> Sent: Tuesday, September 19, 2023 5:13 PM
> 
> When guest vm does S3, Qemu will reset and clear some things of virtio
> devices, but guest can't aware that, so that may cause some problems.
It is not true that guest VM is not aware of it.
As you show in your kernel patch, it is freeze/unfreeze in the guest VM PCI PM 
driver callback. So please update the commit log.

> For excample, Qemu calls virtio_reset->virtio_gpu_gl_reset when guest
s/excample/example

> resume, that function will destroy render resources of virtio-gpu. As a 
> result,
> after guest resume, the display can't come back and we only saw a black
> screen. Due to guest can't re-create all the resources, so we need to let Qemu
> not to destroy them when S3.
Above QEMU specific details to go in cover letter, instead of commit log, but 
no strong opinion.
Explaining the use case is good.

> 
> For above purpose, we need a mechanism that allows guests and QEMU to
> negotiate their reset behavior. So this patch add a new parameter named
Freeze != reset. :)
Please fix it to say freeze or suspend.

> freeze_mode to struct virtio_pci_common_cfg. And when guest suspends, it can
> write freeze_mode to be FREEZE_S3, and then virtio devices can change their
> reset behavior on Qemu side according to freeze_mode. What's more,
Not reset, but suspend behavior.

> freeze_mode can be used for all virtio devices to affect the behavior of Qemu,
> not just virtio gpu device.
> 
> Signed-off-by: Jiqian Chen 
> ---
>  transport-pci.tex | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/transport-pci.tex b/transport-pci.tex index a5c6719..2543536 
> 100644
> --- a/transport-pci.tex
> +++ b/transport-pci.tex
> @@ -319,6 +319,7 @@ \subsubsection{Common configuration structure
> layout}\label{sec:Virtio Transport
>  le64 queue_desc;/* read-write */
>  le64 queue_driver;  /* read-write */
>  le64 queue_device;  /* read-write */
> +le16 freeze_mode;   /* read-write */
>  le16 queue_notif_config_data;   /* read-only for driver */
>  le16 queue_reset;   /* read-write */
> 
The new field cannot be in the middle of the structure.
Otherwise, the location of the queue_notif_config_data depends on completely 
unrelated feature bit, breaking the backward compatibility.
So please move it at the end.

> @@ -393,6 +394,12 @@ \subsubsection{Common configuration structure
> layout}\label{sec:Virtio Transport  \item[\field{queue_device}]
>  The driver writes the physical address of Device Area here.  See 
> section
> \ref{sec:Basic Facilities of a Virtio Device / Virtqueues}.
> 
> +\item[\field{freeze_mode}]
> +The driver writes this to set the freeze mode of virtio pci.
> +VIRTIO_PCI_FREEZE_MODE_UNFREEZE - virtio-pci is running;
> +VIRTIO_PCI_FREEZE_MODE_FREEZE_S3 - guest vm is doing S3, and virtio-
For above names, please define the actual values in the spec.

> pci enters S3 suspension;
> +Other values are reserved for future use, like S4, etc.
> +
It cannot be just one way communication from driver to device as freezing the 
device of few hundred MB to GB of gpu memory or other device memory can take 
several msec.
Hence driver must poll to get the acknowledgement from the device that freeze 
functionality is completed.

Please refer to queue_reset register definition for achieving such scheme and 
reframe the wording for it.

Also kindly add the device and driver normative on how/when this register is 
accessed.

Also please fix the description to not talk about guest VM. Largely it only 
exists in theory of operation etc text.

You need to describe what exactly should happen in the device when its freeze.
Please refer to my series where infrastructure is added for device migration 
where the FREEZE mode behavior is defined.
It is similar to what you define, but its management plane operation controlled 
outside of the guest VM.
But it is good direction in terms of what to define in spec language.
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#u

you are missing the feature bit to indicate to the driver that device supports 
this functionality.
Please add one.

>  \item[\field{queue_notif_config_data}]
>  This field exists only if VIRTIO_F_NOTIF_CONFIG_DATA has been
> negotiated.
>  The driver will use this value when driver sends available buffer
> --
> 2.34.1




RE: [RFC PATCH for 8.0 10/13] virtio-net: Migrate vhost inflight descriptors

2023-01-16 Thread Parav Pandit

> From: Jason Wang 
> Sent: Wednesday, January 11, 2023 12:51 AM
> 
> On Wed, Jan 11, 2023 at 12:40 PM Parav Pandit  wrote:
> >
> >
> > > From: Jason Wang 
> > > Sent: Tuesday, January 10, 2023 11:35 PM
> > >
> > > On Tue, Jan 10, 2023 at 11:02 AM Parav Pandit  wrote:
> > > >
> > > > Hi Jason,
> > > >
> > > > > From: Jason Wang 
> > > > > Sent: Monday, December 5, 2022 10:25 PM
> > > >
> > > > >
> > > > > A dumb question, any reason we need bother with virtio-net? It
> > > > > looks to me it's not a must and would complicate migration
> compatibility.
> > > >
> > > > Virtio net vdpa device is processing the descriptors out of order.
> > > > This vdpa device doesn’t offer IN_ORDER flag.
> > > >
> > > > And when a VQ is suspended it cannot complete these descriptors as
> > > > some
> > > dummy zero length completions.
> > > > The guest VM is flooded with [1].
> > >
> > > Yes, but any reason for the device to do out-of-order for RX?
> > >
> > For some devices it is more optimal to process them out of order.
> > And its not limited to RX.
> 
> TX should be fine, since the device can anyhow pretend to send all packets, so
> we won't have any in-flight descriptors.
> 
> >
> > > >
> > > > So it is needed for the devices that doesn’t offer IN_ORDER feature.
> > > >
> > > > [1]
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > > > /tre
> > > > e/drivers/net/virtio_net.c?h=v6.2-rc3#n1252
> > >
> > > It is only enabled in a debug kernel which should be harmless?
> > it is KERN_DEBUG log level. Its is not debug kernel, just the debug log 
> > level.
> 
> Ok, but the production environment should not use that level anyhow.
> 
> > And regardless, generating zero length packets for debug kernel is even more
> confusing.
> 
> Note that it is allowed in the virtio-spec[1] (we probably can fix that in the
> driver) and we have pr_debug() all over this drivers and other places. It 
> doesn't
> cause any side effects except for the debugging purpose.
> 
> So I think having inflight tracking is useful, but I'm not sure it's worth 
> bothering
> with virtio-net (or worth to bothering now):
> 
> - zero length is allowed
This isn’t explicitly called out. It may be worth to add to the spec.

> - it only helps for debugging
> - may cause issues for migration compatibility
We have this missing for long time regardless of this feature. So let's not mix 
here.

The mlx5 vdpa device can do eventual in-order completion before/at time of 
suspend, so I think we can wait for now to until more advance hw arrives.

> - requires new infrastructure to be invented
> 
> Thanks
> 
> [1] spec said
> 
This doesn’t say that its zero-length completion. Len is a mandatory field to 
tell how many bytes device wrote.
> "
> Note: len is particularly useful for drivers using untrusted buffers:
> if a driver does not know exactly how much has been written by the device, the
> driver would have to zero the buffer in advance to ensure no data leakage
> occurs.
> "



RE: [RFC PATCH for 8.0 10/13] virtio-net: Migrate vhost inflight descriptors

2023-01-10 Thread Parav Pandit

> From: Jason Wang 
> Sent: Tuesday, January 10, 2023 11:35 PM
> 
> On Tue, Jan 10, 2023 at 11:02 AM Parav Pandit  wrote:
> >
> > Hi Jason,
> >
> > > From: Jason Wang 
> > > Sent: Monday, December 5, 2022 10:25 PM
> >
> > >
> > > A dumb question, any reason we need bother with virtio-net? It looks
> > > to me it's not a must and would complicate migration compatibility.
> >
> > Virtio net vdpa device is processing the descriptors out of order.
> > This vdpa device doesn’t offer IN_ORDER flag.
> >
> > And when a VQ is suspended it cannot complete these descriptors as some
> dummy zero length completions.
> > The guest VM is flooded with [1].
> 
> Yes, but any reason for the device to do out-of-order for RX?
>
For some devices it is more optimal to process them out of order.
And its not limited to RX.
 
> >
> > So it is needed for the devices that doesn’t offer IN_ORDER feature.
> >
> > [1]
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > e/drivers/net/virtio_net.c?h=v6.2-rc3#n1252
> 
> It is only enabled in a debug kernel which should be harmless?
it is KERN_DEBUG log level. Its is not debug kernel, just the debug log level.
And regardless, generating zero length packets for debug kernel is even more 
confusing. 


RE: [RFC PATCH for 8.0 10/13] virtio-net: Migrate vhost inflight descriptors

2023-01-09 Thread Parav Pandit
Hi Jason,

> From: Jason Wang 
> Sent: Monday, December 5, 2022 10:25 PM

> 
> A dumb question, any reason we need bother with virtio-net? It looks to me 
> it's
> not a must and would complicate migration compatibility.

Virtio net vdpa device is processing the descriptors out of order.
This vdpa device doesn’t offer IN_ORDER flag.

And when a VQ is suspended it cannot complete these descriptors as some dummy 
zero length completions.
The guest VM is flooded with [1].

So it is needed for the devices that doesn’t offer IN_ORDER feature.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/virtio_net.c?h=v6.2-rc3#n1252



RE: [RFC PATCH for 8.0 10/13] virtio-net: Migrate vhost inflight descriptors

2022-12-05 Thread Parav Pandit

> From: Eugenio Pérez 
> Sent: Monday, December 5, 2022 12:05 PM
> 
> There is currently no data to be migrated, since nothing populates or read
> the fields on virtio-net.
> 
> The migration of in-flight descriptors is modelled after the migration of
> requests in virtio-blk. With some differences:
> * virtio-blk migrates queue number on each request. Here we only add a
>   vq if it has descriptors to migrate, and then we make all descriptors
>   in an array.
> * Use of QTAILQ since it works similar to signal the end of the inflight
>   descriptors: 1 for more data, 0 if end. But do it for each vq instead
>   of for each descriptor.
> * Usage of VMState macros.
> 
> The fields of descriptors would be way more complicated if we use the
> VirtQueueElements directly, since there would be a few levels of
> indirections. Using VirtQueueElementOld for the moment, and migrate to
> VirtQueueElement for the final patch.
> 
> TODO: Proper migration versioning
> TODO: Do not embed vhost-vdpa structs
> TODO: Migrate the VirtQueueElement, not VirtQueueElementOld.
> 
> Signed-off-by: Eugenio Pérez 
> ---
>  include/hw/virtio/virtio-net.h |   2 +
>  include/migration/vmstate.h|  11 +++
>  hw/net/virtio-net.c| 129 +
>  3 files changed, 142 insertions(+)
> 
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index ef234ffe7e..ae7c017ef0 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -151,9 +151,11 @@ typedef struct VirtIONetQueue {
>  QEMUTimer *tx_timer;
>  QEMUBH *tx_bh;
>  uint32_t tx_waiting;
> +uint32_t tx_inflight_num, rx_inflight_num;
>  struct {
>  VirtQueueElement *elem;
>  } async_tx;
> +VirtQueueElement **tx_inflight, **rx_inflight;
>  struct VirtIONet *n;
>  } VirtIONetQueue;
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 9726d2d09e..9e0dfef9ee 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -626,6 +626,17 @@ extern const VMStateInfo vmstate_info_qlist;
>  .offset = vmstate_offset_varray(_state, _field, _type),  \
>  }
> 
> +#define VMSTATE_STRUCT_VARRAY_ALLOC_UINT16(_field, _state,
> _field_num,\
> +   _version, _vmsd, _type) { 
>  \
> +.name   = (stringify(_field)),   
>  \
> +.version_id = (_version),
>  \
> +.vmsd   = &(_vmsd),  
>  \
> +.num_offset = vmstate_offset_value(_state, _field_num, uint16_t),
>  \
> +.size   = sizeof(_type), 
>  \
> +.flags  = VMS_STRUCT | VMS_VARRAY_UINT16 | VMS_ALLOC |
> VMS_POINTER,   \
> +.offset = vmstate_offset_pointer(_state, _field, _type), 
>  \
> +}
> +
>  #define VMSTATE_STRUCT_VARRAY_ALLOC(_field, _state, _field_num,
> _version, _vmsd, _type) {\
>  .name   = (stringify(_field)),   \
>  .version_id = (_version),\
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index
> aba12759d5..ffd7bf1fc7 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -3077,6 +3077,13 @@ static bool mac_table_doesnt_fit(void *opaque,
> int version_id)
>  return !mac_table_fits(opaque, version_id);  }
> 
> +typedef struct VirtIONetInflightQueue {
> +uint16_t idx;
> +uint16_t num;
> +QTAILQ_ENTRY(VirtIONetInflightQueue) entry;
> +VirtQueueElementOld *elems;
> +} VirtIONetInflightQueue;
> +
>  /* This temporary type is shared by all the WITH_TMP methods
>   * although only some fields are used by each.
>   */
> @@ -3086,6 +3093,7 @@ struct VirtIONetMigTmp {
>  uint16_tcurr_queue_pairs_1;
>  uint8_t has_ufo;
>  uint32_thas_vnet_hdr;
> +QTAILQ_HEAD(, VirtIONetInflightQueue) queues_inflight;
>  };
> 
>  /* The 2nd and subsequent tx_waiting flags are loaded later than @@ -
> 3231,6 +3239,124 @@ static const VMStateDescription
> vmstate_virtio_net_rss = {
>  },
>  };
> 
> +static const VMStateDescription vmstate_virtio_net_inflight_queue = {
> +.name  = "virtio-net-device/inflight/queue",
> +.fields = (VMStateField[]) {
> +VMSTATE_UINT16(idx, VirtIONetInflightQueue),
> +VMSTATE_UINT16(num, VirtIONetInflightQueue),
> +
> +VMSTATE_STRUCT_VARRAY_ALLOC_UINT16(elems,
> VirtIONetInflightQueue, num,
> +   0, vmstate_virtqueue_element_old,
> +   VirtQueueElementOld),
> +VMSTATE_END_OF_LIST()
> +},
> +};
> +
> +static int virtio_net_inflight_init(void *opaque) {
> +struct VirtIONetMigTmp *tmp = opaque;
> +
> +QTAILQ_INIT(>queues_inflight);
> +return 0;
> +}
> +
> +static 

RE: About restoring the state in vhost-vdpa device

2022-05-18 Thread Parav Pandit

> From: Eugenio Perez Martin 
> Sent: Tuesday, May 17, 2022 4:12 AM
 
> > 2. Each VQ enablement one at a time, requires constant steering update
> > for the VQ While this information is something already known. Trying to
> reuse brings a callback result in this in-efficiency.
> > So better to start with more reusable APIs that fits the LM flow.
> 
> We can change to that model later. Since the model proposed by us does not
> add any burden, we can discard it down the road if something better arises.
> The proposed behavior should already work for all
> devices: It comes for free regarding kernel / vdpa code.
It is not for free.
It comes with higher LM downtime.
And that makes it unusable as the queues scale.

> 
> I think that doing at vhost/vDPA level is going to cause the same problem as
> VRING_SET_BASE: We will need to maintain two ways of performing the
> same, and the code will need to synchronize them. I'm not *against* adding
> it by itself, I'm just considering it an optimization that needs to be 
> balanced
> against what already enables the device to perform state restoring.

We only need to change the sequencing of how we restore and abstract it out how 
to restore in the vdpa layer.
CVQ or something else it the choice internal inside the vpda vendor driver.


RE: About restoring the state in vhost-vdpa device

2022-05-18 Thread Parav Pandit

> From: Jason Wang 
> Sent: Monday, May 16, 2022 11:05 PM
> >> Although it's a longer route, I'd very much prefer an in-band virtio
> >> way to perform it rather than a linux/vdpa specific. It's one of the
> >> reasons I prefer the CVQ behavior over a vdpa specific ioctl.
> >>
> > What is the in-band method to set last_avail_idx?
> > In-band virtio method doesn't exist.
> 
> 
> Right, but it's part of the vhost API which was there for more than 10 years.
> This should be supported by all the vDPA vendors.
Sure. My point to Eugenio was that vdpa doesn’t have to limited by virtio spec.
Plumbing exists to make vdpa work without virtio spec.
And hence, additional ioctl can be ok.

> >> layers of the stack need to maintain more state.
> > Mostly not. A complete virtio device state arrived from source vdpa device
> can be given to destination vdpa device without anyone else looking in the
> middle. If this format is known/well defined.
> 
> 
> That's fine, and it seems the virtio spec is a better place for this,
> then we won't duplicate efforts?
> 
Yes. for VDPA kernel, setting parameters doesn’t need virtio spec update.
It is similar to avail index setting.

> 
> >
> >>  From the guest point of view, to enable all the queues with
> >> VHOST_VDPA_SET_VRING_ENABLE and don't send DRIVER_OK is the
> same
> >> as send DRIVER_OK and not to enable any data queue with
> >> VHOST_VDPA_SET_VRING_ENABLE.
> > Enabling SET_VRING_ENABLE after DRIVER_OK has two basic things
> broken.
> 
> 
> It looks to me the spec:
> 
> 1) For PCI it doesn't forbid the driver to set queue_enable to 1 after
> DRIVER_OK.
Device init sequence sort of hints that vq setup should be done before 
driver_ok in below snippet.

"Perform device-specific setup, including discovery of virtqueues for the 
device, optional per-bus setup,
reading and possibly writing the device’s virtio configuration space, and 
population of virtqueues."

For a moment even if we assume, that queue can be enabled after driver_ok, it 
ends up going to incorrect queue.
Because the queue where it supposed to go, it not enabled and its rss is not 
setup.

So on restore flow it is desired to set needed config before doing driver_ok.

> 2) For MMIO, it even allows the driver to disable a queue after DRIVER_OK
> 
> 
> > 1. supplied RSS config and VQ config is not honored for several tens of
> hundreds of milliseconds
> > It will be purely dependent on how/when this ioctl are made.
> > Due to this behavior packet supposed to arrive in X VQ, arrives in Y VQ.
> 
> 
> I don't get why we end up with this situation.
> 
> 1) enable cvq
> 2) set driver_ok
> 3) set RSS
> 4) enable TX/RX
> 
> vs
> 
> 1) set RSS
> 2) enable cvq
> 3) enable TX/RX
> 4) set driver_ok
> 
> Is the latter faster?
> 
Yes, because later sequence has the ability to setup steering config once.
As opposed to that first sequence needs to incrementally update the rss setting 
on every new queue addition on step #4.

> 
> >
> > 2. Each VQ enablement one at a time, requires constant steering update
> for the VQ
> > While this information is something already known. Trying to reuse brings a
> callback result in this in-efficiency.
> > So better to start with more reusable APIs that fits the LM flow.
> 
> 
> I agree, but the method proposed in the mail seems to be the only way
> that can work with the all the major vDPA vendors.
> 
> E.g the new API requires the device has the ability to receive device
> state other than the control virtqueue which might not be supported the
> hardware. (The device might expects a trap and emulate model rather than
> save and restore).
> 
How a given vendor to return the values is in the vendor specific vdpa driver, 
just like avail_index which is not coming through the CVQ.

>  From qemu point of view, it might need to support both models.
> 
> If the device can't do save and restore:
> 
> 1.1) enable cvq
> 1.2) set driver_ok
> 1.3) set device state (MQ, RSS) via control vq
> 1.4) enable TX/RX
> 
> If the device can do save and restore:
> 
> 2.1) set device state (new API for setting MQ,RSS)
> 2.2) enable cvq
> 2.3) enable TX?RX
> 2.4) set driver_ok
> 
> We can start from 1 since it works for all device and then adding
> support for 2?
> 

How about:
3.1) create cvq for the supported device
Cvq not exposed to user space, stays in the kernel. Vdpa driver created it.

3.2) set device state (MQ, RSS) comes via user->kernel ioctl()
Vdpa driver internally decides whether to use cvq or something else (like avail 
index).

3.3) enable tx/rx
3.4) set driver_ok


RE: About restoring the state in vhost-vdpa device

2022-05-16 Thread Parav Pandit


> From: Eugenio Perez Martin 
> Sent: Monday, May 16, 2022 4:51 AM
> 
> On Fri, May 13, 2022 at 8:25 PM Parav Pandit  wrote:
> >
> > Hi Gautam,
> >
> > Please fix your email client to have right response format.
> > Otherwise, it will be confusing for the rest and us to follow the
> conversation.
> >
> > More below.
> >
> > > From: Gautam Dawar 
> > > Sent: Friday, May 13, 2022 1:48 PM
> >
> > > > Our proposal diverge in step 7: Instead of enabling *all* the
> > > > virtqueues, only enable the CVQ.
> > > Just to double check, VQ 0 and 1 of the net are also not enabled, correct?
> > > [GD>>] Yes, that's my understanding as well.
> > >
> 
> That's correct. We can say that for a queue to be enabled three things must
> happen:
> * DRIVER_OK (Still not send)
> * VHOST_VDPA_SET_VRING_ENABLE sent for that queue (Only sent for
> CVQ)
> * If queue is not in first data queue pair and not cvq: send
> VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET with a queue pair that include it.
> 
These if conditions, specially the last one is not good as it requires device 
type knowledge, which in most cases not needed.
Specially for the new code.

> > > > After that, send the DRIVER_OK and queue all the control commands
> > > > to restore the device status (MQ, RSS, ...). Once all of them have
> > > > been acknowledged ("device", or emulated cvq in host vdpa backend
> > > > driver, has used all cvq buffers, enable (SET_VRING_ENABLE,
> > > > set_vq_ready) all other queues.
> > > >
> > > What is special about doing DRIVER_OK and enqueuing the control
> > > commands?
> > > Why other configuration cannot be applied before DRIVER_OK?
> 
> There is nothing special beyond "they have a method to be set that way, so
> reusing it avoids having to maintain many ways to set the same things,
> simplifying implementations".
> 
> I'm not saying "it has been like that forever so we cannot change it":
> I'm very open to the change but priority-wise we should first achieve a
> working LM with packed, in_order, or even indirect.
> 
> > > [GD>>] For the device to be live (and any queue to be able to pass
> > > traffic), DRIVER_OK is a must.
> > This applies only to the vdpa device implemented over virtio device.
> > For such use case/implementation anyway a proper virtio spec extension is
> needed for it be efficient.
> > And what that happens this scheme will still work.
> >
> 
> Although it's a longer route, I'd very much prefer an in-band virtio way to
> perform it rather than a linux/vdpa specific. It's one of the reasons I prefer
> the CVQ behavior over a vdpa specific ioctl.
> 
What is the in-band method to set last_avail_idx?
In-band virtio method doesn't exist.

> > Other vdpa devices doesn’t have to live with this limitation at this moment.
> >
> > > So, any configuration can be passed over the CVQ only after it is
> > > started (vring configuration + DRIVER_OK). For an emulated queue, if
> > > the order is reversed, I think the enqueued commands will remain
> > > buffered and device should be able to service them when it goes live.
> > I likely didn’t understand what you describe above.
> >
> > Vq avail index etc is programmed before doing DRIVER_OK anyway.
> >
> > Sequence is very straight forward at destination from user to kernel.
> > 1. Set config space fields (such as virtio_net_config/virtio_blk_config).
> > 2. Set other device attributes (max vqs, current num vqs) 3. Set net
> > specific config (RSS, vlan, mac and more filters) 4. Set VQ fields
> > (enable, msix, addr, avail indx) 5. Set DRIVER_OK, device resumes from
> > where it left off
> >
> > Steps #1 to #4 can be done multiple times in pre-warm device up case in
> future.
> 
> That requires creating a way to set all the parameters enumerated there to
> vdpa devices. Each time a new feature is added to virtio-net that modifies
> the guest-visible fronted device we would need to update it. 
Any guest visible feature exposed by the vdpa device on the source side, a 
migration agent needs to verify/and make destination device capable to support 
it anyway. Without it a device may be migrated, but it won't perform at same 
level as source.

> And all the
> layers of the stack need to maintain more state.
Mostly not. A complete virtio device state arrived from source vdpa device can 
be given to destination vdpa device without anyone else looking in the middle. 
If this format is known/well defined.

> 
> From the guest point of view, to enable all the queu

RE: About restoring the state in vhost-vdpa device

2022-05-13 Thread Parav Pandit
Hi Gautam,

Please fix your email client to have right response format.
Otherwise, it will be confusing for the rest and us to follow the conversation.

More below.

> From: Gautam Dawar 
> Sent: Friday, May 13, 2022 1:48 PM

> > Our proposal diverge in step 7: Instead of enabling *all* the
> > virtqueues, only enable the CVQ.
> Just to double check, VQ 0 and 1 of the net are also not enabled, correct?
> [GD>>] Yes, that's my understanding as well.
> 
> > After that, send the DRIVER_OK and queue all the control commands to
> > restore the device status (MQ, RSS, ...). Once all of them have been
> > acknowledged ("device", or emulated cvq in host vdpa backend driver,
> > has used all cvq buffers, enable (SET_VRING_ENABLE, set_vq_ready) all
> > other queues.
> >
> What is special about doing DRIVER_OK and enqueuing the control
> commands?
> Why other configuration cannot be applied before DRIVER_OK?
> [GD>>] For the device to be live (and any queue to be able to pass traffic),
> DRIVER_OK is a must. 
This applies only to the vdpa device implemented over virtio device.
For such use case/implementation anyway a proper virtio spec extension is 
needed for it be efficient.
And what that happens this scheme will still work.

Other vdpa devices doesn’t have to live with this limitation at this moment.

> So, any configuration can be passed over the CVQ only
> after it is started (vring configuration + DRIVER_OK). For an emulated queue,
> if the order is reversed, I think the enqueued commands will remain
> buffered and device should be able to service them when it goes live.
I likely didn’t understand what you describe above.

Vq avail index etc is programmed before doing DRIVER_OK anyway.

Sequence is very straight forward at destination from user to kernel.
1. Set config space fields (such as virtio_net_config/virtio_blk_config).
2. Set other device attributes (max vqs, current num vqs)
3. Set net specific config (RSS, vlan, mac and more filters)
4. Set VQ fields (enable, msix, addr, avail indx)
5. Set DRIVER_OK, device resumes from where it left off

Steps #1 to #4 can be done multiple times in pre-warm device up case in future.
For now, they can be done once to get things started.


RE: About restoring the state in vhost-vdpa device

2022-05-13 Thread Parav Pandit

> From: Eugenio Perez Martin 
> Sent: Wednesday, May 11, 2022 3:44 PM
> 
> This is a proposal to restore the state of the vhost-vdpa device at the
> destination after a live migration. It uses as many available features both
> from the device and from qemu as possible so we keep the communication
> simple and speed up the merging process.
> 
> # Initializing a vhost-vdpa device.
> 
> Without the context of live migration, the steps to initialize the device from
> vhost-vdpa at qemu starting are:
> 1) [vhost] Open the vdpa device, Using simply open()
> 2) [vhost+virtio] Get device features. These are expected not to change in
> the device's lifetime, so we can save them. Qemu issues a
> VHOST_GET_FEATURES ioctl and vdpa forwards to the backend driver using
> get_device_features() callback.
> 3) [vhost+virtio] Get its max_queue_pairs if _F_MQ and _F_CTRL_VQ.
This should be soon replaced with more generic num_vq interface as 
max_queue_pairs don’t, work beyond net.
There is no need to continue some ancient interface way for newly built vdpa 
stack.

> These are obtained using VHOST_VDPA_GET_CONFIG, and that request is
> forwarded to the device using get_config. QEMU expects the device to not
> change it in its lifetime.
> 4) [vhost] Vdpa set status (_S_ACKNOLEDGE, _S_DRIVER). Still no
> FEATURES_OK or DRIVER_OK. The ioctl is VHOST_VDPA_SET_STATUS, and
> the vdpa backend driver callback is set_status.
> 
> These are the steps used to initialize the device in qemu terminology, taking
> away some redundancies to make it simpler.
> 
> Now the driver sends the FEATURES_OK and the DRIVER_OK, and qemu
> detects it, so it *starts* the device.
> 
> # Starting a vhost-vdpa device
> 
> At virtio_net_vhost_status we have two important variables here:
> int cvq = _F_CTRL_VQ ? 1 : 0;
> int queue_pairs = _F_CTRL_VQ && _F_MQ ? (max_queue_pairs of step 3) :
> 0;
> 
> Now identification of the cvq index. Qemu *know* that the device will
> expose it at the last queue (max_queue_pairs*2) if _F_MQ has been
> acknowledged by the guest's driver or 2 if not. It cannot depend on any data
> sent to the device via cvq, because we couldn't get its command status on a
> change.
> 
> Now we start the vhost device. The workflow is currently:
> 
> 5) [virtio+vhost] The first step is to send the acknowledgement of the Virtio
> features and vhost/vdpa backend features to the device, so it knows how to
> configure itself. This is done using the same calls as step 4 with these 
> feature
> bits added.
> 6) [virtio] Set the size, base, addr, kick and call fd for each queue
> (SET_VRING_ADDR, SET_VRING_NUM, ...; and forwarded with
> set_vq_address, set_vq_state, ...)
> 7) [vdpa] Send host notifiers and *send SET_VRING_ENABLE = 1* for each
> queue. This is done using ioctl VHOST_VDPA_SET_VRING_ENABLE, and
> forwarded to the vdpa backend using set_vq_ready callback.
> 8) [virtio + vdpa] Send memory translations & set DRIVER_OK.
> 
So MQ all VQs setup should be set before step_8.

> If we follow the current workflow, the device is allowed now to start
> receiving only on vq pair 0, since we've still not set the multi queue pair. 
> This
> could cause the guest to receive packets in unexpected queues, breaking
> RSS.
> 
> # Proposal
> 
> Our proposal diverge in step 7: Instead of enabling *all* the virtqueues, only
> enable the CVQ. 
Just to double check, VQ 0 and 1 of the net are also not enabled, correct?

> After that, send the DRIVER_OK and queue all the control
> commands to restore the device status (MQ, RSS, ...). Once all of them have
> been acknowledged ("device", or emulated cvq in host vdpa backend driver,
> has used all cvq buffers, enable (SET_VRING_ENABLE, set_vq_ready) all
> other queues.
> 
What is special about doing DRIVER_OK and enqueuing the control commands?
Why other configuration cannot be applied before DRIVER_OK?

In other words,
Step_7 already setups up the necessary VQ related fields.

Before doing driver ok, what is needed is to setup any other device fields and 
features.
For net this includes rss, vlan, mac filters.
So, a new vdpa ioctl() should be able to set these values.
This is the ioctl() between user and kernel.
Post this ioctl(), DRIVER_OK should be done resuming the device.

Device has full view of config now.

This node local device setup change should not require migration protocol 
change.

This scheme will also work for non_net virtio devices too.

> Everything needed for this is already implemented in the kernel as far as I
> see, there is only a small modification in qemu needed. Thus achieving the
> restoring of the device state without creating maintenance burden.
> 
> A lot of optimizations can be applied on top without the need to add stuff to
> the migration protocol or vDPA uAPI, like the pre-warming of the vdpa
> queues or adding more capabilities to the emulated CVQ.
Above ioctl() will enable vdpa subsystem to apply this setting one mor more 
times in pre-warming up stage before DRIVER_OK.

> 
> 

RE: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Parav Pandit


> From: Jason Wang 
> Sent: Wednesday, August 19, 2020 12:19 PM
> 
> 
> On 2020/8/19 下午1:26, Parav Pandit wrote:
> >
> >> From: Jason Wang 
> >> Sent: Wednesday, August 19, 2020 8:16 AM
> >
> >> On 2020/8/18 下午5:32, Parav Pandit wrote:
> >>> Hi Jason,
> >>>
> >>> From: Jason Wang 
> >>> Sent: Tuesday, August 18, 2020 2:32 PM
> >>>
> >>>
> >>> On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> >>> On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> >>> On 2020/8/14 下午1:16, Yan Zhao wrote:
> >>> On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> >>> On 2020/8/10 下午3:46, Yan Zhao wrote:
> >>> driver is it handled by?
> >>> It looks that the devlink is for network device specific, and in
> >>> devlink.h, it says include/uapi/linux/devlink.h - Network physical
> >>> device Netlink interface, Actually not, I think there used to have
> >>> some discussion last year and the conclusion is to remove this
> >>> comment.
> >>>
> >>> [...]
> >>>
> >>>> Yes, but it could be hard. E.g vDPA will chose to use devlink
> >>>> (there's a long
> >> debate on sysfs vs devlink). So if we go with sysfs, at least two
> >> APIs needs to be supported ...
> >>> We had internal discussion and proposal on this topic.
> >>> I wanted Eli Cohen to be back from vacation on Wed 8/19, but since
> >>> this is
> >> active discussion right now, I will share the thoughts anyway.
> >>> Here are the initial round of thoughts and proposal.
> >>>
> >>> User requirements:
> >>> ---
> >>> 1. User might want to create one or more vdpa devices per PCI PF/VF/SF.
> >>> 2. User might want to create one or more vdpa devices of type
> >>> net/blk or
> >> other type.
> >>> 3. User needs to look and dump at the health of the queues for debug
> purpose.
> >>> 4. During vdpa net device creation time, user may have to provide a
> >>> MAC
> >> address and/or VLAN.
> >>> 5. User should be able to set/query some of the attributes for
> >>> debug/compatibility check 6. When user wants to create vdpa device,
> >>> it needs
> >> to know which device supports creation.
> >>> 7. User should be able to see the queue statistics of doorbells,
> >>> wqes etc regardless of class type
> >>
> >> Note that wqes is probably not something common in all of the vendors.
> > Yes. I virtq descriptors stats is better to monitor the virtqueues.
> >
> >>
> >>> To address above requirements, there is a need of vendor agnostic
> >>> tool, so
> >> that user can create/config/delete vdpa device(s) regardless of the vendor.
> >>> Hence,
> >>> We should have a tool that lets user do it.
> >>>
> >>> Examples:
> >>> -
> >>> (a) List parent devices which supports creating vdpa devices.
> >>> It also shows which class types supported by this parent device.
> >>> In below command two parent devices support vdpa device creation.
> >>> First is PCI VF whose bdf is 03.00:5.
> >>> Second is PCI SF whose name is mlx5_sf.1
> >>>
> >>> $ vdpa list pd
> >>
> >> What did "pd" mean?
> >>
> > Parent device which support creation of one or more vdpa devices.
> > In a system there can be multiple parent devices which may be support vdpa
> creation.
> > User should be able to know which devices support it, and when user creates 
> > a
> vdpa device, it tells which parent device to use for creation as done in below
> vdpa dev add example.
> >>> pci/:03.00:5
> >>> class_supports
> >>>   net vdpa
> >>> virtbus/mlx5_sf.1
> >>
> >> So creating mlx5_sf.1 is the charge of devlink?
> >>
> > Yes.
> > But here vdpa tool is working at the parent device identifier {bus+name}
> instead of devlink identifier.
> >
> >
> >>> class_supports
> >>>   net
> >>>
> >>> (b) Now add a vdpa device and show the device.
> >>> $ vdpa dev add pci/:03.00:5 type net
> >>
> >> So if you want to create devices types other than vdpa on
> >> pci/:03.00:5 it needs some synchronization with devlink?
> > Pl

RE: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Parav Pandit


> From: Yan Zhao 
> Sent: Wednesday, August 19, 2020 9:01 AM

> On Tue, Aug 18, 2020 at 09:39:24AM +0000, Parav Pandit wrote:

> > Please refer to my previous email which has more example and details.
> hi Parav,
> the example is based on a new vdpa tool running over netlink, not based on
> devlink, right?
Right.

> For vfio migration compatibility, we have to deal with both mdev and physical
> pci devices, I don't think it's a good idea to write a new tool for it, given 
> we are
> able to retrieve the same info from sysfs and there's already an mdevctl from
mdev attribute should be visible in the mdev's sysfs tree.
I do not propose to write a new mdev tool over netlink. I am sorry if I implied 
that with my suggestion of vdpa tool.

If underlying device is vdpa, mdev might be able to understand vdpa device and 
query from it and populate in mdev sysfs tree.

The vdpa tool I propose is usable even without mdevs.
vdpa tool's role is to create one or more vdpa devices and place on the "vdpa" 
bus which is the lowest layer here.
Additionally this tool let user query virtqueue stats, db stats.
When a user creates vdpa net device, user may need to configure features of the 
vdpa device such as VIRTIO_NET_F_MAC, default VIRTIO_NET_F_MTU.
These are vdpa level features, attributes. Mdev is layer above it.

> Alex
> (https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.
> com%2Fmdevctl%2Fmdevctldata=02%7C01%7Cparav%40nvidia.com%7C
> 0c2691d430304f5ea11308d843f2d84e%7C43083d15727340c1b7db39efd9ccc17
> a%7C0%7C0%7C637334057571911357sdata=KxH7PwxmKyy9JODut8BWr
> LQyOBylW00%2Fyzc4rEvjUvA%3Dreserved=0).
>
Sorry for above link mangling. Our mail server is still transitioning due to 
company acquisition.

I am less familiar on below points to comment.

> hi All,
> could we decide that sysfs is the interface that every VFIO vendor driver 
> needs
> to provide in order to support vfio live migration, otherwise the userspace
> management tool would not list the device into the compatible list?
> 
> if that's true, let's move to the standardizing of the sysfs interface.
> (1) content
> common part: (must)
>- software_version: (in major.minor.bugfix scheme)
>- device_api: vfio-pci or vfio-ccw ...
>- type: mdev type for mdev device or
>a signature for physical device which is a counterpart for
>  mdev type.
> 
> device api specific part: (must)
>   - pci id: pci id of mdev parent device or pci id of physical pci
> device (device_api is vfio-pci)
>   - subchannel_type (device_api is vfio-ccw)
> 
> vendor driver specific part: (optional)
>   - aggregator
>   - chpid_type
>   - remote_url
> 
> NOTE: vendors are free to add attributes in this part with a restriction that 
> this
> attribute is able to be configured with the same name in sysfs too. e.g.
> for aggregator, there must be a sysfs attribute in device node
> /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-
> 078a62063ab1/intel_vgpu/aggregator,
> so that the userspace tool is able to configure the target device according to
> source device's aggregator attribute.
> 
> 
> (2) where and structure
> proposal 1:
> |- [path to device]
>   |--- migration
>   | |--- self
>   | ||-software_version
>   | ||-device_api
>   | ||-type
>   | ||-[pci_id or subchannel_type]
>   | ||-
>   | |--- compatible
>   | ||-software_version
>   | ||-device_api
>   | ||-type
>   | ||-[pci_id or subchannel_type]
>   | ||-
> multiple compatible is allowed.
> attributes should be ASCII text files, preferably with only one value per 
> file.
> 
> 
> proposal 2: use bin_attribute.
> |- [path to device]
>   |--- migration
>   | |--- self
>   | |--- compatible
> 
> so we can continue use multiline format. e.g.
> cat compatible
>   software_version=0.1.0
>   device_api=vfio_pci
>   type=i915-GVTg_V5_{val1:int:1,2,4,8}
>   pci_id=80865963
>   aggregator={val1}/2
> 
> Thanks
> Yan


RE: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Parav Pandit


> From: Jason Wang 
> Sent: Wednesday, August 19, 2020 8:16 AM


> On 2020/8/18 下午5:32, Parav Pandit wrote:
> > Hi Jason,
> >
> > From: Jason Wang 
> > Sent: Tuesday, August 18, 2020 2:32 PM
> >
> >
> > On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > On 2020/8/14 下午1:16, Yan Zhao wrote:
> > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > On 2020/8/10 下午3:46, Yan Zhao wrote:
> > driver is it handled by?
> > It looks that the devlink is for network device specific, and in
> > devlink.h, it says include/uapi/linux/devlink.h - Network physical
> > device Netlink interface, Actually not, I think there used to have
> > some discussion last year and the conclusion is to remove this
> > comment.
> >
> > [...]
> >
> >> Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a 
> >> long
> debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs 
> to be
> supported ...
> > We had internal discussion and proposal on this topic.
> > I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is
> active discussion right now, I will share the thoughts anyway.
> >
> > Here are the initial round of thoughts and proposal.
> >
> > User requirements:
> > ---
> > 1. User might want to create one or more vdpa devices per PCI PF/VF/SF.
> > 2. User might want to create one or more vdpa devices of type net/blk or
> other type.
> > 3. User needs to look and dump at the health of the queues for debug 
> > purpose.
> > 4. During vdpa net device creation time, user may have to provide a MAC
> address and/or VLAN.
> > 5. User should be able to set/query some of the attributes for
> > debug/compatibility check 6. When user wants to create vdpa device, it needs
> to know which device supports creation.
> > 7. User should be able to see the queue statistics of doorbells, wqes
> > etc regardless of class type
> 
> 
> Note that wqes is probably not something common in all of the vendors.
Yes. I virtq descriptors stats is better to monitor the virtqueues.

> 
> 
> >
> > To address above requirements, there is a need of vendor agnostic tool, so
> that user can create/config/delete vdpa device(s) regardless of the vendor.
> >
> > Hence,
> > We should have a tool that lets user do it.
> >
> > Examples:
> > -
> > (a) List parent devices which supports creating vdpa devices.
> > It also shows which class types supported by this parent device.
> > In below command two parent devices support vdpa device creation.
> > First is PCI VF whose bdf is 03.00:5.
> > Second is PCI SF whose name is mlx5_sf.1
> >
> > $ vdpa list pd
> 
> 
> What did "pd" mean?
> 
Parent device which support creation of one or more vdpa devices.
In a system there can be multiple parent devices which may be support vdpa 
creation.
User should be able to know which devices support it, and when user creates a 
vdpa device, it tells which parent device to use for creation as done in below 
vdpa dev add example.
> 
> > pci/:03.00:5
> >class_supports
> >  net vdpa
> > virtbus/mlx5_sf.1
> 
> 
> So creating mlx5_sf.1 is the charge of devlink?
> 
Yes.
But here vdpa tool is working at the parent device identifier {bus+name} 
instead of devlink identifier.


> 
> >class_supports
> >  net
> >
> > (b) Now add a vdpa device and show the device.
> > $ vdpa dev add pci/:03.00:5 type net
> 
> 
> So if you want to create devices types other than vdpa on
> pci/:03.00:5 it needs some synchronization with devlink?
Please refer to FAQ-1,  a new tool is not linked to devlink because vdpa will 
evolve with time and devlink will fall short.
So no, it doesn't need any synchronization with devlink.
As long as parent device exist, user can create it.
All synchronization will be within drivers/vdpa/vdpa.c
This user interface is exposed via new netlink family by doing 
genl_register_family() with new name "vdpa" in drivers/vdpa/vdpa.c.

> 
> 
> > $ vdpa dev show
> > vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4
> >
> > (c) vdpa dev show features vdpa0
> > iommu platform
> > version 1
> >
> > (d) dump vdpa statistics
> > $ vdpa dev stats show vdpa0
> > kickdoorbells 10
> > wqes 100
> >
> > (e) Now delete a vdpa device previously created.
> > $ vdpa dev del vdpa0
> >
> > Design overview:
> > 

RE: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Parav Pandit
Hi Cornelia,

> From: Cornelia Huck 
> Sent: Tuesday, August 18, 2020 3:07 PM
> To: Daniel P. Berrangé 
> Cc: Jason Wang ; Yan Zhao
> ; k...@vger.kernel.org; libvir-l...@redhat.com;
> qemu-devel@nongnu.org; Kirti Wankhede ;
> eau...@redhat.com; xin-ran.w...@intel.com; cor...@lwn.net; openstack-
> disc...@lists.openstack.org; shaohe.f...@intel.com; kevin.t...@intel.com;
> Parav Pandit ; jian-feng.d...@intel.com;
> dgilb...@redhat.com; zhen...@linux.intel.com; hejie...@intel.com;
> bao.yum...@zte.com.cn; Alex Williamson ;
> eskul...@redhat.com; smoo...@redhat.com; intel-gvt-
> d...@lists.freedesktop.org; Jiri Pirko ;
> dinec...@redhat.com; de...@ovirt.org
> Subject: Re: device compatibility interface for live migration with assigned
> devices
> 
> On Tue, 18 Aug 2020 10:16:28 +0100
> Daniel P. Berrangé  wrote:
> 
> > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > >
> > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > >
> > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > >
> > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > >
> > >  On 2020/8/10 下午3:46, Yan Zhao wrote:
> >
> > >  we actually can also retrieve the same information through sysfs,
> > > .e.g
> > >
> > >  |- [path to device]
> > > |--- migration
> > > | |--- self
> > > | |   |---device_api
> > > ||   |---mdev_type
> > > ||   |---software_version
> > > ||   |---device_id
> > > ||   |---aggregator
> > > | |--- compatible
> > > | |   |---device_api
> > > ||   |---mdev_type
> > > ||   |---software_version
> > > ||   |---device_id
> > > ||   |---aggregator
> > >
> > >
> > >  Yes but:
> > >
> > >  - You need one file per attribute (one syscall for one attribute)
> > >  - Attribute is coupled with kobject
> 
> Is that really that bad? You have the device with an embedded kobject
> anyway, and you can just put things into an attribute group?
> 
> [Also, I think that self/compatible split in the example makes things
> needlessly complex. Shouldn't semantic versioning and matching already
> cover nearly everything? I would expect very few cases that are more
> complex than that. Maybe the aggregation stuff, but I don't think we need
> that self/compatible split for that, either.]
> 
> > >
> > >  All of above seems unnecessary.
> > >
> > >  Another point, as we discussed in another thread, it's really hard
> > > to make  sure the above API work for all types of devices and
> > > frameworks. So having a  vendor specific API looks much better.
> > >
> > >  From the POV of userspace mgmt apps doing device compat checking /
> > > migration,  we certainly do NOT want to use different vendor
> > > specific APIs. We want to  have an API that can be used / controlled in a
> standard manner across vendors.
> > >
> > >Yes, but it could be hard. E.g vDPA will chose to use devlink (there's 
> > > a
> > >long debate on sysfs vs devlink). So if we go with sysfs, at least two
> > >APIs needs to be supported ...
> >
> > NB, I was not questioning devlink vs sysfs directly. If devlink is
> > related to netlink, I can't say I'm enthusiastic as IMKE sysfs is
> > easier to deal with. I don't know enough about devlink to have much of an
> opinion though.
> > The key point was that I don't want the userspace APIs we need to deal
> > with to be vendor specific.
> 
> From what I've seen of devlink, it seems quite nice; but I understand why
> sysfs might be easier to deal with (especially as there's likely already a 
> lot of
> code using it.)
> 
> I understand that some users would like devlink because it is already widely
> used for network drivers (and some others), but I don't think the majority of
> devices used with vfio are network (although certainly a lot of them are.)
> 
> >
> > What I care about is that we have a *standard* userspace API for
> > performing device compatibility checking / state migration, for use by
> > QEMU/libvirt/ OpenStack, such that we can write code without countless
> > vendor specific code paths.
> >
> > If there is vendor specific stuff on the side, that's fine as we can
> > ignore that, but the core functionality for device compat / migration
> > needs to be standardized.
> 
> To summarize:
> - choose one of sysfs or devlink
> - have a common interface, with a standardized way to add
>   vendor-specific attributes
> ?

Please refer to my previous email which has more example and details.


RE: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Parav Pandit
Hi Jason,

From: Jason Wang  
Sent: Tuesday, August 18, 2020 2:32 PM


On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in
devlink.h, it says
include/uapi/linux/devlink.h - Network physical device Netlink
interface,
Actually not, I think there used to have some discussion last year and the
conclusion is to remove this comment.

[...]

> Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long 
> debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs 
> to be supported ...

We had internal discussion and proposal on this topic.
I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is 
active discussion right now, I will share the thoughts anyway.

Here are the initial round of thoughts and proposal.

User requirements:
---
1. User might want to create one or more vdpa devices per PCI PF/VF/SF.
2. User might want to create one or more vdpa devices of type net/blk or other 
type.
3. User needs to look and dump at the health of the queues for debug purpose.
4. During vdpa net device creation time, user may have to provide a MAC address 
and/or VLAN.
5. User should be able to set/query some of the attributes for 
debug/compatibility check
6. When user wants to create vdpa device, it needs to know which device 
supports creation.
7. User should be able to see the queue statistics of doorbells, wqes etc 
regardless of class type

To address above requirements, there is a need of vendor agnostic tool, so that 
user can create/config/delete vdpa device(s) regardless of the vendor.

Hence,
We should have a tool that lets user do it.

Examples:
-
(a) List parent devices which supports creating vdpa devices.
It also shows which class types supported by this parent device.
In below command two parent devices support vdpa device creation.
First is PCI VF whose bdf is 03.00:5.
Second is PCI SF whose name is mlx5_sf.1

$ vdpa list pd
pci/:03.00:5
  class_supports
net vdpa
virtbus/mlx5_sf.1
  class_supports
net

(b) Now add a vdpa device and show the device.
$ vdpa dev add pci/:03.00:5 type net
$ vdpa dev show
vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4

(c) vdpa dev show features vdpa0
iommu platform
version 1

(d) dump vdpa statistics
$ vdpa dev stats show vdpa0
kickdoorbells 10
wqes 100

(e) Now delete a vdpa device previously created.
$ vdpa dev del vdpa0

Design overview:
---
1. Above example tool runs over netlink socket interface.
2. This enables users to return meaningful error strings in addition to code so 
that user can be more informed.
Often this is missing in ioctl()/configfs/sysfs interfaces.
3. This tool over netlink enables syscaller tests to be more usable like other 
subsystems to keep kernel robust
4. This provides vendor agnostic view of all vdpa capable parent and vdpa 
devices.

5. Each driver which supports vdpa device creation, registers the parent device 
along with supported classes.

FAQs:

1. Why not using devlink?
Ans: Because as vdpa echo system grows, devlink will fall short of extending 
vdpa specific params, attributes, stats.

2. Why not use sysfs?
Ans: 
(a) Because running syscaller infrastructure can run well over netlink sockets 
like it runs for several subsystem.
(b) it lacks the ability to return error messages. Doing via kernel log is just 
doesn't work.
(c) Why not using some ioctl()? It will reinvent the wheel of netlink that has 
TLV formats for several attributes.

3. Why not configs?
It follows same limitation as that of sysfs.

Low level design and driver APIS:

Will post once we discuss this further.


RE: [PATCH v7 0/11] add failover feature for assigned network devices

2019-11-04 Thread Parav Pandit
Hi Jens,


> -Original Message-
> From: Jens Freimann 
> Sent: Tuesday, October 29, 2019 6:49 AM
> To: qemu-devel@nongnu.org
> Cc: ehabk...@redhat.com; m...@redhat.com; berra...@redhat.com;
> la...@redhat.com; aa...@redhat.com; ai...@redhat.com; Parav Pandit
> ; dgilb...@redhat.com; alex.william...@redhat.com;
> arm...@redhat.com; ebl...@redhat.com; jasow...@redhat.com;
> quint...@redhat.com; pbonz...@redhat.com
> Subject: [PATCH v7 0/11] add failover feature for assigned network devices
> 
> This is implementing the host side of the net_failover concept
> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> 
> Changes since v6:
> * reword patch description of 06/11 (Markus)
> * have qemu_savevm_state_guest_unplug_pending() return true when at
>   least one device is still not unplugged (Dave)
> 
> The general idea is that we have a pair of devices, a vfio-pci and a 
> virtio-net
> device. Before migration the vfio device is unplugged and data flows to the
> virtio-net device, on the target side another vfio-pci device is plugged in 
> to take
> over the data-path. In the guest the net_failover module will pair net devices
> with the same MAC address.
> 
> * Patch 1 adds the infrastructure to hide the device for the qbus and qdev 
> APIs
> 
> * Patch 2 adds checks to PCIDevice for only allowing ethernet devices as
>   failover primary and only PCIExpress capable devices
> 
> * Patch 3 sets a new flag for PCIDevice 'partially_hotplugged' which we
>   use to skip the unrealize code path when doing a unplug of the primary
>   device
> 
> * Patch 4 sets the pending_deleted_event before triggering the guest
>   unplug request
> 
> * Patch 5 and 6 add new qmp events, one sends the device id of a device
>   that was just requested to be unplugged from the guest and another one
>   to let libvirt know if VIRTIO_NET_F_STANDBY was negotiated
> 
> * Patch 7 make sure that we can unplug the vfio-device before
>   migration starts
> 
> * Patch 8 adds a new migration state that is entered while we wait for
>   devices to be unplugged by guest OS
> 
> * Patch 9 just adds the new migration state to a check in libqos code
> 
> * Patch 10 In the second patch the virtio-net uses the API to defer adding the
> vfio
>   device until the VIRTIO_NET_F_STANDBY feature is acked. It also
>   implements the migration handler to unplug the device from the guest and
>   re-plug in case of migration failure
> 
> * Patch 11 allows migration for failover vfio-pci devices
> 

Patches are great to see failover capability enabled on the netdevice.
However it's very difficult to test it without having libvirt support.
Can we please have the necessary libvirt enhancements?
I will be able to test it with Mellanox ConnectX5 device + qemu + libvirt.

[..]




RE: [PATCH v5 02/11] pci: add option for net failover

2019-10-24 Thread Parav Pandit



> -Original Message-
> From: Jens Freimann 
> Sent: Thursday, October 24, 2019 4:38 AM
> To: Parav Pandit 
> Cc: qemu-devel@nongnu.org; ehabk...@redhat.com; m...@redhat.com;
> berra...@redhat.com; pkre...@redhat.com; la...@redhat.com;
> aa...@redhat.com; ai...@redhat.com; dgilb...@redhat.com;
> alex.william...@redhat.com
> Subject: Re: [PATCH v5 02/11] pci: add option for net failover
> 
> On Thu, Oct 24, 2019 at 05:03:46AM +, Parav Pandit wrote:
> >> @@ -2101,6 +2104,20 @@ static void pci_qdev_realize(DeviceState
> >> *qdev, Error **errp)
> >>  }
> >>  }
> >>
> >> +if (pci_dev->net_failover_pair_id) {
> >> +if (!pci_is_express(pci_dev)) {
> >
> >I am testing and integrating this piece with mlx5 devices.
> >I see that pci_is_express() return true only for first PCI function.
> >Didn't yet dig the API.
> >Commenting out this check and below class check progresses further.
> 
> First of all, thanks for testing this!
> Could you share your commandline please? I can't reproduce it.
> >
I added debug prints to get the difference between VF1 and VF2 behavior.
What I see is, vfio_populate_device() below code is activated for VF2 where 
qemu claims that its not a PCIe device.

vdev->config_size = reg_info->size;
if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) {
vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS;
printf("%s clearing QEMU PCI bits\n", __func__);
}

Command line:
/usr/local/bin/qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
   -machine q35,usb=off,vmport=off,dump-guest-core=off -cpu 
Haswell-noTSX-IBRS \
   -net none \
   -qmp unix:/tmp/qmp.socket,server,nowait \
-monitor telnet:127.0.0.1:5556,server,nowait \
-device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
-device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
-device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
-netdev tap,id=hostnet1,fd=4 4<>/dev/tap49\
-device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:02:02:02,bus=root2,failover=on
 \
-device 
vfio-pci,id=hostdev0,host=05:00.2,bus=root1,net_failover_pair_id=net1 \
/var/lib/libvirt/images/sriov-lm-02.qcow2

> >While reviewing, I realized that we shouldn't have this check for below
> reasons.
> >
> >1. It is user's responsibility to pass networking device.
> >But its ok to check the class, if PCI Device is passed.
> >So class comparison should be inside the pci_check().
> 
> I'm not sure I understand this point, could you please elaborate?
> You're suggesting to move the check for the class into the check for
> pci_is_express?
> 
No. Below is the suggestion.

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 8fbf32d68c..8004309973 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2105,17 +2105,14 @@ static void pci_qdev_realize(DeviceState *qdev, Error 
**errp)
 }

 if (pci_dev->net_failover_pair_id) {
-if (!pci_is_express(pci_dev)) {
-error_setg(errp, "failover device is not a PCIExpress device");
-error_propagate(errp, local_err);
-return;
-}
-class_id = pci_get_word(pci_dev->config + PCI_CLASS_DEVICE);
-if (class_id != PCI_CLASS_NETWORK_ETHERNET) {
-error_setg(errp, "failover device is not an Ethernet device");
-error_propagate(errp, local_err);
-return;
-}
+if (pci_is_express(pci_dev)) {
+class_id = pci_get_word(pci_dev->config + PCI_CLASS_DEVICE);
+if (class_id != PCI_CLASS_NETWORK_ETHERNET) {
+error_setg(errp, "failover device is not an Ethernet device");
+error_propagate(errp, local_err);
+return;
+}
+   }

This will allow to map non PCI device as failover too.
After writing above hunk I think that when code reaches to check for 
If (pci_dev->net_failover_pair_id),... it is already gone gone through 
do_pci_register_device().
There should not be any check needed again for pci_is_express().
Isn't it?


> >2. It is limiting to only consider PCI devices.
> >Automated and regression tests should be able validate this feature without
> PCI Device.
> >This will enhance the stability of feature in long run.
> >
> >3. net failover driver doesn't limit it to have it over only PCI device.
> >So similarly hypervisor should be limiting.
> 
> I agree that we don't have to limit it to PCI(e) forever. But for this first 
> shot I
> think we should and then extend it continually. There are more things we can
> support in the future like other hotplug types etc.
> 
o.k. But probably net_failover_pair_id field should be in DeviceState instead 
of PCIDevice at minimum?
Or you want to refactor it later?



RE: [PATCH v5 02/11] pci: add option for net failover

2019-10-24 Thread Parav Pandit



> -Original Message-
> From: Jens Freimann 
> Sent: Wednesday, October 23, 2019 3:27 AM
> To: qemu-devel@nongnu.org
> Cc: ehabk...@redhat.com; m...@redhat.com; berra...@redhat.com;
> pkre...@redhat.com; la...@redhat.com; aa...@redhat.com;
> ai...@redhat.com; Parav Pandit ;
> dgilb...@redhat.com; alex.william...@redhat.com
> Subject: [PATCH v5 02/11] pci: add option for net failover
> 
> This patch adds a net_failover_pair_id property to PCIDev which is used to
> link the primary device in a failover pair (the PCI dev) to a standby (a 
> virtio-
> net-pci) device.
> 
> It only supports ethernet devices. Also currently it only supports PCIe
> devices. QEMU will exit with an error message otherwise.
> 
> Signed-off-by: Jens Freimann 
> ---
>  hw/pci/pci.c | 17 +
>  include/hw/pci/pci.h |  3 +++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c index aa05c2b9b2..fa9b5219f8 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -75,6 +75,8 @@ static Property pci_props[] = {
>  QEMU_PCIE_LNKSTA_DLLLA_BITNR, true),
>  DEFINE_PROP_BIT("x-pcie-extcap-init", PCIDevice, cap_present,
>  QEMU_PCIE_EXTCAP_INIT_BITNR, true),
> +DEFINE_PROP_STRING("net_failover_pair_id", PCIDevice,
> +net_failover_pair_id),
>  DEFINE_PROP_END_OF_LIST()
>  };
> 
> @@ -2077,6 +2079,7 @@ static void pci_qdev_realize(DeviceState *qdev,
> Error **errp)
>  ObjectClass *klass = OBJECT_CLASS(pc);
>  Error *local_err = NULL;
>  bool is_default_rom;
> +uint16_t class_id;
> 
>  /* initialize cap_present for pci_is_express() and pci_config_size(),
>   * Note that hybrid PCIs are not set automatically and need to manage
> @@ -2101,6 +2104,20 @@ static void pci_qdev_realize(DeviceState *qdev,
> Error **errp)
>  }
>  }
> 
> +if (pci_dev->net_failover_pair_id) {
> +if (!pci_is_express(pci_dev)) {

I am testing and integrating this piece with mlx5 devices.
I see that pci_is_express() return true only for first PCI function.
Didn't yet dig the API.
Commenting out this check and below class check progresses further.

While reviewing, I realized that we shouldn't have this check for below reasons.

1. It is user's responsibility to pass networking device.
But its ok to check the class, if PCI Device is passed.
So class comparison should be inside the pci_check().

2. It is limiting to only consider PCI devices. 
Automated and regression tests should be able validate this feature without PCI 
Device.
This will enhance the stability of feature in long run.

3. net failover driver doesn't limit it to have it over only PCI device.
So similarly hypervisor should be limiting.

> +error_setg(errp, "failover device is not a PCIExpress device");
> +error_propagate(errp, local_err);
> +return;
> +}
> +class_id = pci_get_word(pci_dev->config + PCI_CLASS_DEVICE);
> +if (class_id != PCI_CLASS_NETWORK_ETHERNET) {
> +error_setg(errp, "failover device is not an Ethernet device");
> +error_propagate(errp, local_err);
> +return;
> +}
> +}
> +
>  /* rom loading */
>  is_default_rom = false;
>  if (pci_dev->romfile == NULL && pc->romfile != NULL) { diff --git
> a/include/hw/pci/pci.h b/include/hw/pci/pci.h index f3f0ffd5fb..def5435685
> 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -352,6 +352,9 @@ struct PCIDevice {
>  MSIVectorUseNotifier msix_vector_use_notifier;
>  MSIVectorReleaseNotifier msix_vector_release_notifier;
>  MSIVectorPollNotifier msix_vector_poll_notifier;
> +
> +/* ID of standby device in net_failover pair */
> +char *net_failover_pair_id;
>  };
> 
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> --
> 2.21.0