Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object

2023-10-17 Thread Alex Williamson
On Tue, 17 Oct 2023 12:28:30 -0300
Jason Gunthorpe  wrote:

> On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
> 
> > Do we therefore need some programatic means for the kernel driver to
> > expose the node configuration to userspace?  What interfaces would
> > libvirt like to see here?  Is there an opportunity that this could
> > begin to define flavors or profiles for variant devices like we have
> > types for mdev devices where the node configuration would be
> > encompassed in a device profile?   
> 
> I don't think we should shift this mess into the kernel..
> 
> We have a wide range of things now that the orchestration must do in
> order to prepare that are fairly device specific. I understand in K8S
> configurations the preference is using operators (aka user space
> drivers) to trigger these things.
> 
> Supplying a few extra qemu command line options seems minor compared
> to all the profile and provisioning work that has to happen for other
> device types.

This seems to be a growing problem for things like mlx5-vfio-pci where
there's non-trivial device configuration necessary to enable migration
support.  It's not super clear to me how those devices are actually
expected to be used in practice with that configuration burden.

Are we simply saying here that it's implicit knowledge that the
orchestration must posses that when assigning devices exactly matching
10de:2342 or 10de:2345 when bound to the nvgrace-gpu-vfio-pci driver
that 8 additional NUMA nodes should be added to the VM and an ACPI
generic initiator object created linking those additional nodes to the
assigned GPU?

Is libvirt ok with that specification or are we simply going to bubble
this up as a user problem? Thanks,

Alex



Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object

2023-10-17 Thread Alex Williamson
On Tue, 17 Oct 2023 14:00:54 +
Ankit Agrawal  wrote:

> >> -device 
> >>vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> >> -object 
> >>nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8
> >>  
> >
> > Why didn't we just implement start and count in the base object (or a
> > list)? It seems like this gives the nvidia-acpi-generic-initiator two
> > different ways to set gi->node, either node= of the parent or
> > numa-node-start= here.  Once we expose the implicit node count in the
> > base object, I'm not sure the purpose of this object.  I would have
> > thought it for keying the build of the NVIDIA specific _DSD, but that's
> > not implemented in this version.  
> 
> Agree, allowing a list of nodes to be provided to the acpi-generic-initiator
> will remove the need for the nvidia-acpi-generic-initiator object. 

And what happened to the _DSD?  Is it no longer needed?  Why?

> > I also don't see any programatic means for management tools to know how
> > many nodes to create.  For example what happens if there's a MIGv2 that
> > supports 16 partitions by default and makes use of the same vfio-pci
> > variant driver?  Thanks,  
> 
> It is supposed to stay at 8 for all the G+H devices. Maybe this can be managed
> through proper documentation in the user manual?

I thought the intention here was that a management tool would
automatically configure the VM with these nodes and GI object in
support of the device.  Planning only for Grace-Hopper isn't looking
too far into the future and it's difficult to make software that can
reference a user manual.  This leads to a higher maintenance burden
where the management tool needs to recognize not only the driver, but
the device bound to the driver and update as new devices are released.
The management tool will never automatically support new devices without
making an assumption about the node configuration.

Do we therefore need some programatic means for the kernel driver to
expose the node configuration to userspace?  What interfaces would
libvirt like to see here?  Is there an opportunity that this could
begin to define flavors or profiles for variant devices like we have
types for mdev devices where the node configuration would be
encompassed in a device profile?  Thanks,

Alex



Re: [PATCH v5] vfio/pci: Propagate ACPI notifications to user-space via eventfd

2023-06-15 Thread Alex Williamson
On Fri,  9 Jun 2023 13:39:50 +
Grzegorz Jaszczyk  wrote:

> To allow pass-through devices receiving ACPI notifications, permit to
> register ACPI notify handler (via VFIO_DEVICE_SET_IRQS) for a given
> device. The handler role is to receive and propagate such ACPI
> notifications to the user-space through the user provided eventfd. This
> allows VMM to receive and propagate them further to the VM, where the
> actual driver for pass-through device resides and can react to device
> specific notifications accordingly.
> 
> The eventfd usage ensures VMM and device isolation: it allows to use a
> dedicated channel associated with the device for such events, such that
> the VMM has direct access.
> 
> Since the eventfd counter is used as ACPI notification value
> placeholder, the eventfd signaling needs to be serialized in order to
> not end up with notification values being coalesced. Therefore ACPI
> notification values are buffered and signalized one by one, when the
> previous notification value has been consumed.
> 
> Signed-off-by: Grzegorz Jaszczyk 
> ---
> Changelog v4..v5
> Address Alex Williamson's feedback:
> - s/vfio_acpi_notify.{c,o}/acpi_notify.{c,o}
> - Do not put acpi_notify to its own module but fold it into main
>   vfio.ko. Additionally select it from VFIO_PCI_CORE instead of VFIO_PCI.
> - Cleanup acpi notify under igate mutex (in vfio_pci_core_close_device).
> - Add extra check for ACPI companion in vfio_pci_get_irq_count and
>   extend vfio_pci_ioctl_get_irq_info.
> - Drop acpi.h include - linux/vfio_acpi_notify.h includes it already.
> - Send device check notification value for DATA_NONE and non-zero count
>   and for DATA_BOOL and non-zero count  (as for loopback testing).
> - Drop some redundant !acpi_notify->acpi_notify_trigger checks.
> - Move some common code to new helper functions:
>   1) acpi_notification_dequeue
>   2) vfio_acpi_notify_cleanup and rename previous
>  vfio_acpi_notify_cleanup into vfio_remove_acpi_notify which uses it
> - Add rate limited logging for dropped notifications.
> - Move vdev->acpi_notification pointer cleanup to the
>   vfio_acpi_notify_cleanup function this also fixes two bigger issues
>   caught by Alex.
> - Allow the eventfd to be swapped.
> - s/GFP_KERNEL/GFP_KERNEL_ACCOUNT.
> - s/VFIO_PCI_ACPI_NTFY_IRQ_INDEX/VFIO_PCI_ACPI_IRQ_INDEX.
> - Add header protection for multiple includes.
> - v4: 
> https://patchwork.kernel.org/project/kvm/patch/20230522165811.123417-1-...@semihalf.com/
> 
> Changelog v3..v4
> Address Alex Williamson feedback:
> - Instead of introducing new ioctl used for eventfd registration, take
>   advantage of VFIO_DEVICE_SET_IRQS which already supports virtual IRQs
>   for things like error notification and device release requests.
> - Introduced mechanism preventing creation of large queues.
> Other:
> - Move the implementation into the newly introduced VFIO_ACPI_NOTIFY
>   helper module. It is actually not bound to VFIO_PCI but VFIO_PCI
>   enables it whenever ACPI support is enabled. This change is introduced
>   since ACPI notifications are not limited to PCI devices, making it PCI
>   independent will allow to re-use it also for other VFIO_* like
>   supports: e.g. VFIO_PLATFORM in the future if needed. Moving it out of
>   drivers/vfio/pci/ was also suggested offline.
> - s/notify_val_next/node
> - v3: 
> https://patchwork.kernel.org/project/kvm/patch/20230502132700.654528-1-jaszc...@google.com/
> 
> Changelog v2..v3:
> - Fix compilation warnings when building with "W=1"
> 
> Changelog v1..v2:
> - The v2 implementation is actually completely different then v1:
>   instead of using acpi netlink events for propagating ACPI
>   notifications to the user space take advantage of eventfd, which can
>   provide better VMM and device isolation: it allows to use a dedicated
>   channel associated with the device for such events, such that the VMM
>   has direct access.
> - Using eventfd counter as notification value placeholder was suggested
>   in v1 and requires additional serialization logic introduced in v2.
> - Since the vfio-pci supports non-ACPI platforms address !CONFIG_ACPI
>   case.
> - v1 discussion: 
> https://patchwork.kernel.org/project/kvm/patch/20230307220553.631069-1-...@semihalf.com/
> ---
>  drivers/vfio/Kconfig  |   5 +
>  drivers/vfio/Makefile |   1 +
>  drivers/vfio/acpi_notify.c| 249 ++
>  drivers/vfio/pci/Kconfig  |   1 +
>  drivers/vfio/pci/vfio_pci_core.c  |  13 ++
>  drivers/vfio/pci/vfio_pci_intrs.c |  85 ++
>  include/linux/vfio_acpi_notify.h  |  45 ++
>  include/linux/vfio_pci_core.h |   1 +
>  include/uapi/linux/vfio.h |

Re: [PATCH v4] vfio/pci: Propagate ACPI notifications to user-space via eventfd

2023-06-07 Thread Alex Williamson
On Wed, 7 Jun 2023 22:22:12 +0200
Grzegorz Jaszczyk  wrote:
> > >
> > > Can we drop the NTFY and just use VFIO_PCI_ACPI_IRQ_INDEX?  
> >
> > ACPI_IRQ at first glance could be confused with SCI, which is e.g.
> > registered as "acpi" irq seen in /proc/interrupts, maybe it is worth
> > keeping NTFY here to emphasise the "Notify" part?  
> 
> Please let me know if you prefer VFIO_PCI_ACPI_IRQ_INDEX or
> VFIO_PCI_ACPI_NTFY_IRQ_INDEX taking into account the above.

This is a device level ACPI interrupt, so it doesn't seem like it would
be confused with SCI.  What other ACPI related interrupts would a
device have?  I'm still partial to dropping the NTFY but if you're
attached to it, let's not abbreviate it, make it NOTIFY and do the same
for function names.

...
> > > > + } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
> > > > + u32 notification_val;
> > > > +
> > > > + if (!count)
> > > > + return -EINVAL;
> > > > +
> > > > + notification_val = *(u32 *)data;  
> > >
> > > DATA_BOOL is defined as a u8, and of course also as a bool, so we
> > > expect only zero/non-zero.  I think a valid interpretation would be any
> > > non-zero value generates a device check notification value.  
> >
> > Maybe it would be helpful and ease testing if we could use u8 as a
> > notification value placeholder so it would be more flexible?
> > Notification values from 0x80 to 0xBF are device-specific, 0xC0 and
> > above are reserved for definition by hardware vendors for hardware
> > specific notifications and BTW in practice I didn't see notification
> > values that do not fit in u8 but even if exist we can limit to u8 and
> > gain some flexibility anyway. Please let me know what you think.  
> 
> Does the above seem ok for you?

The data type is only a u8 for practicality, it's still labeled as a
bool which suggests it's interpreted as either zero or non-zero.  We
also need to reconcile DATA_NONE, which should trigger the interrupt,
but with an implicit notification value.  I see the utility in what
you're proposing, but it logically implies an extension of the SET_IRQS
ioctl for a new data type which has hardly any practical value.  Thanks,

Alex



Re: [PATCH v2] util: basic support for VFIO variant drivers

2023-05-31 Thread Alex Williamson
On Wed, 31 May 2023 17:46:50 -0300
Jason Gunthorpe  wrote:

> On Wed, May 31, 2023 at 02:40:01PM -0600, Alex Williamson wrote:
> 
> > Also note that we're saying "vfio" not "vfio-pci".  Only the mdev
> > interface has the device_api attribute to indicate the exported vfio
> > device interface.  The "vfio_pci:" match in modalias indicates a vfio
> > PCI driver, not necessarily a driver that provides a vfio-pci API.
> 
> modalias was designed so you take the /sys/.../modalias file, prepend
> vfio_ then do a standard modalias search on that string. The matching
> module should be loaded and the module name bound to the device as the
> driver name.
> 
> There should be no bus type dependencies in any of this in management
> code.

For example, modalias of a random wifi adapter:

pci:v8086d2723sv8086sd0084bc02sc80i00

The bus name is prepended because the encoding is bus specific.  Yes,
code doesn't really need to interpret that, it simply adds "vfio_" to
the beginning of the string and finds the matching driver with the
fewest number of wildcards in modules.alias.

We are not code, we have a vfio_pci driver, a vfio-pci device API, and
a vfio_pci: modalias prefix, it's easy to get them confused and infer
information that isn't intended.  All I'm trying (poorly) to clarify is
that a vfio_pci: modalias prefix only indicates a vfio driver for a PCI
device.  It does not guarantee the vfio device API exposed to userspace
is vfio-pci.  Therefore management tools should be cautious to make
assumptions about the type of device the VM will see even though we've
got vfio-pci written all over the place.  Thanks,

Alex



Re: [PATCH v2] util: basic support for VFIO variant drivers

2023-05-31 Thread Alex Williamson
On Wed, 31 May 2023 14:30:52 -0300
Jason Gunthorpe  wrote:

> On Wed, May 31, 2023 at 01:18:44PM -0400, Laine Stump wrote:
> > On 5/31/23 10:31 AM, Jason Gunthorpe wrote:  
> > > On Wed, May 31, 2023 at 03:18:17PM +0100, Joao Martins wrote:  
> > > > Hey Laine,
> > > > 
> > > > On 23/08/2022 15:11, Laine Stump wrote:  
> > > > > ping.
> > > > > 
> > > > > I have a different version of this patch where I do read the 
> > > > > modules.alias file
> > > > > rather than just checking the name of the driver, but that also 
> > > > > requires "double
> > > > > mocking" open() in the unit test, which wasn't working properly, and 
> > > > > I'd rather
> > > > > not spend the time figuring it out if it's not going to be needed. 
> > > > > (Alex prefers
> > > > > that version because it is more correct than just checking the name, 
> > > > > and he's
> > > > > concerned that the new sysfs-based API may take longer than we're 
> > > > > thinking to
> > > > > get into downstream distros, but the version in this patch does 
> > > > > satisfy both
> > > > > Jason and Daniel's suggested implementations). Anyway, I can post the 
> > > > > other
> > > > > patch if anyone is interested.
> > > > >   
> > > > [sorry for the thread necromancy]  
> > 
> > Heh. I had actually dug out this same thread last week and started a mail to
> > ask Jason if the planned sysfs stuff had ever been pushed, but then forgot
> > to hit "send" :-)
> > 
> > Now that there are multiple vfio variant drivers available (for igb, e1000e,
> > and mlx5 that I know of),  
> 
> Oh I haven't seen those intel ones posted yet?
> 
> > Jason, can you point me at the information for this patch in an ELI5 manner
> > for a non-kernel person? (including what upstream kernel it's in, and what
> > it is that I need to look at to determine if a driver is a vfio
> > variant).  
> 
> It is this patch:
> 
> commit 3c28a76124b25882411f005924be73795b6ef078
> Author: Yi Liu 
> Date:   Wed Sep 21 18:44:01 2022 +0800
> 
> vfio: Add struct device to vfio_device
> 
> and replace kref. With it a 'vfio-dev/vfioX' node is created under the
> sysfs path of the parent, indicating the device is bound to a vfio
> driver, e.g.:
> 
> /sys/devices/pci\:6f/\:6f\:01.0/vfio-dev/vfio0
> 
> It is also a preparatory step toward adding cdev for supporting future
> device-oriented uAPI.
> 
> Add Documentation/ABI/testing/sysfs-devices-vfio-dev.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Yi Liu 
> Signed-off-by: Kevin Tian 
> Reviewed-by: Jason Gunthorpe 
> Link: 
> https://lore.kernel.org/r/20220921104401.38898-16-kevin.t...@intel.com
> Signed-off-by: Alex Williamson 
> 
> $ git describe --contains 3c28a76124b25882411f005924be73795b6ef078
> v6.1-rc1~42^2~35
> 
> So it is in v6.1-rc1
> 
> libvirt should start thinking about determining the vfioX number for
> the device, we will need that for iommufd enablement eventually
> 
> so, stat for a directory like this:
> 
> /sys/devices/pci\:6f/\:6f\:01.0/vfio-dev
> 
> Confirms vfio
>
> Then scan it to get 'vfioX' which will eventually be the /dev/ node
> libvirt will have to open.
> 
> And the other part is something in the stack should use the modalias
> mechanism to find load and bind the correct variant driver.

I'd forgotten about this as well, so after binding a driver to a device
we can tell if that driver presents a vfio interface by looking for
this sysfs directory.  Prior to binding to a device, we can only know
if a driver provides a vfio interface through modalias.

Also note that we're saying "vfio" not "vfio-pci".  Only the mdev
interface has the device_api attribute to indicate the exported vfio
device interface.  The "vfio_pci:" match in modalias indicates a vfio
PCI driver, not necessarily a driver that provides a vfio-pci API.  We
have no current examples to the contrary, but for instance I wouldn't
recommend validating whether mode='subsystem' type='pci' is appropriate
based on that information.  Thanks,

Alex



Re: [PATCH v4] vfio/pci: Propagate ACPI notifications to user-space via eventfd

2023-05-25 Thread Alex Williamson
On Mon, 22 May 2023 16:58:11 +
Grzegorz Jaszczyk  wrote:

> To allow pass-through devices receiving ACPI notifications, permit to
> register ACPI notify handler (via VFIO_DEVICE_SET_IRQS) for a given
> device. The handler role is to receive and propagate such ACPI
> notifications to the user-space through the user provided eventfd. This
> allows VMM to receive and propagate them further to the VM, where the
> actual driver for pass-through device resides and can react to device
> specific notifications accordingly.
> 
> The eventfd usage ensures VMM and device isolation: it allows to use a
> dedicated channel associated with the device for such events, such that
> the VMM has direct access.
> 
> Since the eventfd counter is used as ACPI notification value
> placeholder, the eventfd signaling needs to be serialized in order to
> not end up with notification values being coalesced. Therefore ACPI
> notification values are buffered and signalized one by one, when the
> previous notification value has been consumed.
> 
> Signed-off-by: Grzegorz Jaszczyk 
> ---
> Changelog v3..v4
> Address Alex Williamson feedback:
> - Instead of introducing new ioctl used for eventfd registration, take
>   advantage of VFIO_DEVICE_SET_IRQS which already supports virtual IRQs
>   for things like error notification and device release requests.
> - Introduced mechanism preventing creation of large queues.
> Other:
> - Move the implementation into the newly introduced VFIO_ACPI_NOTIFY
>   helper module. It is actually not bound to VFIO_PCI but VFIO_PCI
>   enables it whenever ACPI support is enabled. This change is introduced
>   since ACPI notifications are not limited to PCI devices, making it PCI
>   independent will allow to re-use it also for other VFIO_* like
>   supports: e.g. VFIO_PLATFORM in the future if needed. Moving it out of
>   drivers/vfio/pci/ was also suggested offline.

We don't require a separate module for such re-use, see for instance
vfio's virqfd code, which was previously a helper module like this but
the argument for e2d55709398e ("vfio: Fold vfio_virqfd.ko into
vfio.ko") was that the code size doesn't warrant a separate module and
we can still optionally include it as part of vfio.ko via Kconfig.

> - s/notify_val_next/node
> - v3: 
> https://patchwork.kernel.org/project/kvm/patch/20230502132700.654528-1-jaszc...@google.com/
> 
> Changelog v2..v3:
> - Fix compilation warnings when building with "W=1"
> 
> Changelog v1..v2:
> - The v2 implementation is actually completely different then v1:
>   instead of using acpi netlink events for propagating ACPI
>   notifications to the user space take advantage of eventfd, which can
>   provide better VMM and device isolation: it allows to use a dedicated
>   channel associated with the device for such events, such that the VMM
>   has direct access.
> - Using eventfd counter as notification value placeholder was suggested
>   in v1 and requires additional serialization logic introduced in v2.
> - Since the vfio-pci supports non-ACPI platforms address !CONFIG_ACPI
>   case.
> - v1 discussion: 
> https://patchwork.kernel.org/project/kvm/patch/20230307220553.631069-1-...@semihalf.com/
> ---
> ---
>  drivers/vfio/Kconfig  |   5 +
>  drivers/vfio/Makefile |   1 +
>  drivers/vfio/pci/Kconfig  |   1 +
>  drivers/vfio/pci/vfio_pci_core.c  |   9 ++
>  drivers/vfio/pci/vfio_pci_intrs.c |  73 ++
>  drivers/vfio/vfio_acpi_notify.c   | 219 ++
>  include/linux/vfio_acpi_notify.h  |  40 ++
>  include/linux/vfio_pci_core.h |   1 +
>  include/uapi/linux/vfio.h |   1 +
>  9 files changed, 350 insertions(+)
>  create mode 100644 drivers/vfio/vfio_acpi_notify.c
>  create mode 100644 include/linux/vfio_acpi_notify.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 89e06c981e43..7822b0d8e7b1 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -12,6 +12,11 @@ menuconfig VFIO
> If you don't know what to do here, say N.
>  
>  if VFIO
> +config VFIO_ACPI_NOTIFY
> + tristate
> + depends on ACPI
> + default n
> +
>  config VFIO_CONTAINER
>   bool "Support for the VFIO container /dev/vfio/vfio"
>   select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 70e7dcb302ef..129c121b503d 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -14,3 +14,4 @@ obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
>  obj-$(CONFIG_VFIO_MDEV) += mdev/
>  obj-$(CONFIG_VFIO_FSL_MC) += fsl-mc/
> +obj-$(CONFIG_VFIO_ACPI_NOTIFY)

Re: [PATCH v3] vfio/pci: Propagate ACPI notifications to user-space via eventfd

2023-05-15 Thread Alex Williamson
On Tue,  2 May 2023 13:27:00 +
Grzegorz Jaszczyk  wrote:

> From: Grzegorz Jaszczyk 
> 
> To allow pass-through devices receiving ACPI notifications, permit to
> register ACPI notify handler (via introduced new ioctl) for a given

This shouldn't require a new ioctl, it should fit within the
abstraction of the VFIO_DEVICE_SET_IRQS ioctl which already supports
virtual IRQs for things like error notification and device release
requests.  Support for this IRQ index on a given device should also be
discoverable via VFIO_DEVICE_GET_IRQ_INFO that way.

> device. The handler role is to receive and propagate such ACPI
> notifications to the user-space through the user provided eventfd. This
> allows VMM to receive and propagate them further to the VM, where the
> actual driver for pass-through device resides and can react to device
> specific notifications accordingly.
> 
> The eventfd usage ensures VMM and device isolation: it allows to use a
> dedicated channel associated with the device for such events, such that
> the VMM has direct access.
> 
> Since the eventfd counter is used as ACPI notification value
> placeholder, the eventfd signaling needs to be serialized in order to
> not end up with notification values being coalesced. Therefore ACPI
> notification values are buffered and signalized one by one, when the
> previous notification value has been consumed.

I don't see anything that prevents this queuing mechanism from creating
an arbitrarily large queue, don't we need to drop events at some point
to avoid introducing an exploit vector?  Aren't these notifications
often for things like "device check", where queuing duplicate entries
doesn't make sense and perhaps the most recent notification is the only
relevant value otherwise?  If we only need to avoid calling
eventfd_signal() while a non-zero value is pending, couldn't we call
eventfd_ctx_do_read() ourselves to clear the old value rather than
queuing?  Thanks,

Alex
 
> Signed-off-by: Grzegorz Jaszczyk 
> ---
> Changelog v2..v3:
> - Fix compilation warnings when building with "W=1"
> Changelog v1..v2:
> - The v2 implementation is actually completely different then v1:
>   instead of using acpi netlink events for propagating ACPI
>   notifications to the user space take advantage of eventfd, which can
>   provide better VMM and device isolation: it allows to use a dedicated
>   channel associated with the device for such events, such that the VMM
>   has direct access.
> - Using eventfd counter as notification value placeholder was suggested
>   in v1 and requires additional serialization logic introduced in v2.
> - Since the vfio-pci supports non-ACPI platforms address !CONFIG_ACPI
>   case.
> - v1 discussion: 
> https://patchwork.kernel.org/project/kvm/patch/20230307220553.631069-1-...@semihalf.com/
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 214 +++
>  include/linux/vfio_pci_core.h|  11 ++
>  include/uapi/linux/vfio.h|  15 +++
>  3 files changed, 240 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c 
> b/drivers/vfio/pci/vfio_pci_core.c
> index a5ab416cf476..2d6101e89fde 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -10,6 +10,7 @@
>  
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -679,6 +680,70 @@ void vfio_pci_core_disable(struct vfio_pci_core_device 
> *vdev)
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>  
> +struct notification_queue {
> + int notification_val;
> + struct list_head notify_val_next;
> +};
> +
> +#if IS_ENABLED(CONFIG_ACPI)
> +static void vfio_pci_core_acpi_notify(acpi_handle handle, u32 event, void 
> *data)
> +{
> + struct vfio_pci_core_device *vdev = (struct vfio_pci_core_device *)data;
> + struct vfio_acpi_notification *acpi_notify = vdev->acpi_notification;
> + struct notification_queue *entry;
> +
> + entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> + if (!entry)
> + return;
> +
> + entry->notification_val = event;
> + INIT_LIST_HEAD(>notify_val_next);
> +
> + mutex_lock(_notify->notification_list_lock);
> + list_add_tail(>notify_val_next, _notify->notification_list);
> + mutex_unlock(_notify->notification_list_lock);
> +
> + schedule_work(_notify->acpi_notification_work);
> +}
> +
> +static void vfio_pci_acpi_notify_close_device(struct vfio_pci_core_device 
> *vdev)
> +{
> + struct vfio_acpi_notification *acpi_notify = vdev->acpi_notification;
> + struct pci_dev *pdev = vdev->pdev;
> + struct acpi_device *adev = ACPI_COMPANION(>dev);
> + struct notification_queue *entry, *entry_tmp;
> + u64 cnt;
> +
> + if (!acpi_notify || !acpi_notify->acpi_notify_trigger)
> + return;
> +
> + acpi_remove_notify_handler(adev->handle, ACPI_DEVICE_NOTIFY,
> +vfio_pci_core_acpi_notify);
> +
> + 

Re: [PATCH] vfio/pci: Propagate ACPI notifications to the user-space

2023-03-23 Thread Alex Williamson
On Thu, 9 Mar 2023 14:41:23 +0100
Grzegorz Jaszczyk  wrote:

> czw., 9 mar 2023 o 00:38 Alex Williamson 
> napisał(a):
> >
> > On Wed, 8 Mar 2023 14:44:28 -0800
> > Dominik Behr  wrote:
> >  
> > > On Wed, Mar 8, 2023 at 12:06 PM Alex Williamson
> > >  wrote:  
> > > >
> > > > On Wed, 8 Mar 2023 10:45:51 -0800
> > > > Dominik Behr  wrote:
> > > >  
> > > > > It is the same interface as other ACPI events like AC adapter LID etc
> > > > > are forwarded to user-space.
> > > > >  ACPI events are not particularly high frequency like interrupts.  
> > > >
> > > > I'm not sure that's relevant, these interfaces don't proclaim to
> > > > provide isolation among host processes which manage behavior relative
> > > > to accessories.  These are effectively system level services.  It's only
> > > > a very, very specialized use case that places a VMM as peers among these
> > > > processes.  Generally we don't want to grant a VMM any privileges beyond
> > > > what it absolutely needs, so letting a VMM managing an assigned NIC
> > > > really ought not to be able to snoop host events related to anything
> > > > other than the NIC.  
> > > How is that related to the fact that we are forwarding VFIO-PCI events
> > > to netlink? Kernel does not grant any privileges to VMM.
> > > There are already other ACPI events on netlink. The implementer of the
> > > VMM can choose to allow VMM to snoop them or not.
> > > In our case our VMM (crosvm) does already snoop LID, battery and AC
> > > adapter events so the guest can adjust its behavior accordingly.
> > > This change just adds another class of ACPI events that are forwarded
> > > to netlink.  
> >
> > That's true, it is the VMM choice whether to allow snooping netlink,
> > but this is being proposed as THE solution to allow VMMs to receive
> > ACPI events related to vfio assigned devices.  If the solution
> > inherently requires escalating the VMM privileges to see all netlink
> > events, that's a weakness in the proposal.  As noted previously,
> > there's also no introspection here, the VMM can't know whether it
> > should listen to netlink for ACPI events or include AML related to a
> > GPE for the device.  It cannot determine if either the kernel supports
> > this feature or if the device has an ACPI companion that can generate
> > these events.  
> 
> To be precise the VMM doesn't listen to all netlink events: it listens
> only to "acpi_event" family and acpi related multicast group, which
> means it listens to all events generated through
> acpi_bus_generate_netlink_event.
> 
> Before sending this patch I thought about using eventfd instead
> netalink which will actually provide a channel associated with a given
> device and therefore such notifications will be received only by the
> VMM associated with such a device. Nevertheless, it seems like eventfd
> will allow to signalize events happening (notify on a given device)
> but is not capable of sending any payload so in our case there is no
> room for propagating notification value via eventfd. Maybe there is
> other mechanism eventfd-like which will allow to achieve above?

Reading an eventfd returns an 8-byte value, we generally only use it
as a counter, but it's been discussed previously and IIRC, it's possible
to use that value as a notification value.

> If there is no such mechanism, maybe instead of using existing acpi
> netlink events, which are associated with "acpi_event" netlink family
> and acpi multicast group, we could create per vfio-pci a different
> netlink family or probably reuse "acpi_event" family but use different
> multicast group, so each device will have dedicated netlink family.
> Does it seem reasonable?
> 
> >  
> > > >  
> > > > > > > > What sort of ACPI events are we expecting to see here and what 
> > > > > > > > does user space do with them?  
> > > > > The use we are looking at right now are D-notifier events about the
> > > > > GPU power available to mobile discrete GPUs.
> > > > > The firmware notifies the GPU driver and resource daemon to
> > > > > dynamically adjust the amount of power that can be used by the GPU.
> > > > >  
> > > > > > The proposed interface really has no introspection, how does the VMM
> > > > > > know which devices need ACPI tables added "upfront"?  How do these
> > > > > > events factor into h

Re: [PATCH] vfio/pci: Propagate ACPI notifications to the user-space

2023-03-08 Thread Alex Williamson
On Wed, 8 Mar 2023 14:44:28 -0800
Dominik Behr  wrote:

> On Wed, Mar 8, 2023 at 12:06 PM Alex Williamson
>  wrote:
> >
> > On Wed, 8 Mar 2023 10:45:51 -0800
> > Dominik Behr  wrote:
> >  
> > > It is the same interface as other ACPI events like AC adapter LID etc
> > > are forwarded to user-space.
> > >  ACPI events are not particularly high frequency like interrupts.  
> >
> > I'm not sure that's relevant, these interfaces don't proclaim to
> > provide isolation among host processes which manage behavior relative
> > to accessories.  These are effectively system level services.  It's only
> > a very, very specialized use case that places a VMM as peers among these
> > processes.  Generally we don't want to grant a VMM any privileges beyond
> > what it absolutely needs, so letting a VMM managing an assigned NIC
> > really ought not to be able to snoop host events related to anything
> > other than the NIC.  
> How is that related to the fact that we are forwarding VFIO-PCI events
> to netlink? Kernel does not grant any privileges to VMM.
> There are already other ACPI events on netlink. The implementer of the
> VMM can choose to allow VMM to snoop them or not.
> In our case our VMM (crosvm) does already snoop LID, battery and AC
> adapter events so the guest can adjust its behavior accordingly.
> This change just adds another class of ACPI events that are forwarded
> to netlink.

That's true, it is the VMM choice whether to allow snooping netlink,
but this is being proposed as THE solution to allow VMMs to receive
ACPI events related to vfio assigned devices.  If the solution
inherently requires escalating the VMM privileges to see all netlink
events, that's a weakness in the proposal.  As noted previously,
there's also no introspection here, the VMM can't know whether it
should listen to netlink for ACPI events or include AML related to a
GPE for the device.  It cannot determine if either the kernel supports
this feature or if the device has an ACPI companion that can generate
these events.

> >  
> > > > > > What sort of ACPI events are we expecting to see here and what does 
> > > > > > user space do with them?  
> > > The use we are looking at right now are D-notifier events about the
> > > GPU power available to mobile discrete GPUs.
> > > The firmware notifies the GPU driver and resource daemon to
> > > dynamically adjust the amount of power that can be used by the GPU.
> > >  
> > > > The proposed interface really has no introspection, how does the VMM
> > > > know which devices need ACPI tables added "upfront"?  How do these
> > > > events factor into hotplug device support, where we may not be able to
> > > > dynamically inject ACPI code into the VM?  
> > >
> > > The VMM can examine PCI IDs and the associated firmware node of the
> > > PCI device to figure out what events to expect and what ACPI table to
> > > generate to support it but that should not be necessary.  
> >
> > I'm not entirely sure where your VMM is drawing the line between the VM
> > and management tools, but I think this is another case where the
> > hypervisor itself should not have privileges to examine the host
> > firmware tables to build its own.  Something like libvirt would be
> > responsible for that.  
> Yes, but that depends on the design of hypervisor and VMM and is not
> related to this patch.

It is very much related to this patch if it proposes an interface to
solve a problem which is likely not compatible with the security model
of other VMMs.  We need a single solution to support all VMMs.

> >  
> > > A generic GPE based ACPI event forwarder as Grzegorz proposed can be
> > > injected at VM init time and handle any notification that comes later,
> > > even from hotplug devices.  
> >
> > It appears that forwarder is sending the notify to a specific ACPI
> > device node, so it's unclear to me how that becomes boilerplate AML
> > added to all VMs.  We'll need to notify different devices based on
> > different events, right?  
> Valid point. The notifications have a "scope" ACPI path.
> In my experience these events are consumed without looking where they
> came from but I believe the patch can be extended to
> provide ACPI path, in your example "_SB.PCI0.GPP0.PEGP" instead of
> generic vfio_pci which VMM could use to translate an equivalent ACPI
> path in the guest and pass it to a generic ACPI GPE based notifier via
> shared memory. Grzegorz could you chime in whether that would be
> possible?

So effectively we're imposing the host ACPI namespace on the 

Re: [PATCH] vfio/pci: Propagate ACPI notifications to the user-space

2023-03-08 Thread Alex Williamson
On Wed, 8 Mar 2023 10:45:51 -0800
Dominik Behr  wrote:

> On Wed, Mar 8, 2023 at 9:49 AM Alex Williamson
>  wrote:
> 
> > Adding libvirt folks.  This intentionally designs the interface in a
> > way that requires a privileged intermediary to monitor netlink on the
> > host, associate messages to VMs based on an attached device, and
> > re-inject the event to the VMM.  Why wouldn't we use a channel
> > associated with the device for such events, such that the VMM has
> > direct access?  The netlink path seems like it has more moving pieces,
> > possibly scalability issues, and maybe security issues?  
> 
> It is the same interface as other ACPI events like AC adapter LID etc
> are forwarded to user-space.
>  ACPI events are not particularly high frequency like interrupts.

I'm not sure that's relevant, these interfaces don't proclaim to
provide isolation among host processes which manage behavior relative
to accessories.  These are effectively system level services.  It's only
a very, very specialized use case that places a VMM as peers among these
processes.  Generally we don't want to grant a VMM any privileges beyond
what it absolutely needs, so letting a VMM managing an assigned NIC
really ought not to be able to snoop host events related to anything
other than the NIC.

> > > > What sort of ACPI events are we expecting to see here and what does 
> > > > user space do with them?  
> The use we are looking at right now are D-notifier events about the
> GPU power available to mobile discrete GPUs.
> The firmware notifies the GPU driver and resource daemon to
> dynamically adjust the amount of power that can be used by the GPU.
> 
> > The proposed interface really has no introspection, how does the VMM
> > know which devices need ACPI tables added "upfront"?  How do these
> > events factor into hotplug device support, where we may not be able to
> > dynamically inject ACPI code into the VM?  
> 
> The VMM can examine PCI IDs and the associated firmware node of the
> PCI device to figure out what events to expect and what ACPI table to
> generate to support it but that should not be necessary.

I'm not entirely sure where your VMM is drawing the line between the VM
and management tools, but I think this is another case where the
hypervisor itself should not have privileges to examine the host
firmware tables to build its own.  Something like libvirt would be
responsible for that.

> A generic GPE based ACPI event forwarder as Grzegorz proposed can be
> injected at VM init time and handle any notification that comes later,
> even from hotplug devices.

It appears that forwarder is sending the notify to a specific ACPI
device node, so it's unclear to me how that becomes boilerplate AML
added to all VMs.  We'll need to notify different devices based on
different events, right?
 
> > The acpi_bus_generate_netlink_event() below really only seems to form a
> > u8 event type from the u32 event.  Is this something that could be
> > provided directly from the vfio device uAPI with an ioeventfd, thus
> > providing introspection that a device supports ACPI event notifications
> > and the ability for the VMM to exclusively monitor those events, and
> > only those events for the device, without additional privileges?  
> 
> From what I can see these events are 8 bit as they come from ACPI.
> They also do not carry any payload and it is up to the receiving
> driver to query any additional context/state from the device.
> This will work the same in the VM where driver can query the same
> information from the passed through PCI device.
> There are multiple other netflink based ACPI events forwarders which
> do exactly the same thing for other devices like AC adapter, lid/power
> button, ACPI thermal notifications, etc.
> They all use the same mechanism and can be received by user-space
> programs whether VMMs or others.

But again, those other receivers are potentially system services, not
an isolated VM instance operating in a limited privilege environment.
IMO, it's very different if the host display server has access to lid
or power events than it is to allow some arbitrary VM that happens to
have an unrelated assigned device that same privilege.

On my laptop, I see multiple _GPE scopes, each apparently very unique
to the devices:

Scope (_GPE)
{
Method (_L0C, 0, Serialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
{
Notify (\_SB.PCI0.GPP0.PEGP, 0x81) // Information Change
}   

Method (_L0D, 0, Serialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
{
Notify (\_SB.PCI0.GPP0.PEGP, 0x81) // Information Change
}

Method (_L0F, 0, Serialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
{   

Re: [PATCH] vfio/pci: Propagate ACPI notifications to the user-space

2023-03-08 Thread Alex Williamson
[Cc +libvir-list]

On Wed, 8 Mar 2023 12:41:24 +0100
Grzegorz Jaszczyk  wrote:

> śr., 8 mar 2023 o 00:42 Alex Williamson  
> napisał(a):
> >
> > On Tue,  7 Mar 2023 22:05:53 +
> > Grzegorz Jaszczyk  wrote:
> >  
> > > From: Dominik Behr 
> > >
> > > Hitherto there was no support for propagating ACPI notifications to the
> > > guest drivers. In order to provide such support, install a handler for
> > > notifications on an ACPI device during vfio-pci device registration. The
> > > handler role is to propagate such ACPI notifications to the user-space
> > > via acpi netlink events, which allows VMM to receive and propagate them
> > > further to the VMs.
> > >
> > > Thanks to the above, the actual driver for the pass-through device,
> > > which belongs to the guest, can receive and react to device specific
> > > notifications.  
> 
> > What consumes these events?  
> 
> Those events are consumed by the VMM, which can have a built-in ACPI
> event listener.
> 
> > Has this been proposed to any VM management tools like libvirt?  
> 
> This patch was evaluated and tested with crosvm VMM (but since the
> kernel part is not in the tree the implementation is marked as WIP).

Adding libvirt folks.  This intentionally designs the interface in a
way that requires a privileged intermediary to monitor netlink on the
host, associate messages to VMs based on an attached device, and
re-inject the event to the VMM.  Why wouldn't we use a channel
associated with the device for such events, such that the VMM has
direct access?  The netlink path seems like it has more moving pieces,
possibly scalability issues, and maybe security issues?

> > What sort of ACPI events are we expecting to see here and what does user 
> > space do with them?  
> 
> With this patch we are expecting to see and propagate any device
> specific notifications, which are aimed to notify the proper device
> (driver) which belongs to the guest.
> 
> Here is the description how propagating such notification could be
> implemented by VMM:
> 
> 1) VMM could upfront generate proper virtual ACPI description for
> guest per vfio-pci device (more precisely it could be e.g.  ACPI GPE
> handler, which aim is only to notify relevant device):

The proposed interface really has no introspection, how does the VMM
know which devices need ACPI tables added "upfront"?  How do these
events factor into hotplug device support, where we may not be able to
dynamically inject ACPI code into the VM?

> 
> Scope (_GPE)
> {
> Method (_E00, 0, NotSerialized)  // _Exx: Edge-Triggered
> GPE, xx=0x00-0xFF
> {
> Local0 = \_SB.PC00.PE08.NOTY
> Notify (\_SB.PC00.PE08, Local0)
> }
> }
> 
> 2) Now, when the VMM receives ACPI netlink event (thanks to VMM
> builtin ACPI event listener, which is able to receive any event
> generated through acpi_bus_generate_netlink_event) VMM classifies it
> based on device_class ("vfio_pci" in this case) and parses it further
> to get device name and the notification value for it. This
> notification value is stored in a virtual register and VMM triggers
> GPE associated with the pci-vfio device.

Each VMM is listening for netlink events and sees all the netlink
traffic from the host, including events destined for other VMMs?  This
doesn't seem terribly acceptable from a security perspective.
 
> 3) Guest kernel upon handling GPE, thanks to generated AML (ad 1.),
> triggers Notify on required pass-through device and therefore
> replicates the ACPI Notification on the guest side (Accessing
> \_SB.PC00.PE08.NOTY from above example, result with trap to VMM, which
> returns previously stored notify value).

The acpi_bus_generate_netlink_event() below really only seems to form a
u8 event type from the u32 event.  Is this something that could be
provided directly from the vfio device uAPI with an ioeventfd, thus
providing introspection that a device supports ACPI event notifications
and the ability for the VMM to exclusively monitor those events, and
only those events for the device, without additional privileges?
Thanks,

Alex
 
> With above the ACPI notifications are actually replicated on the guest
> side and from a guest driver perspective they don't differ from native
> ones.
> 
> >  
> > > Signed-off-by: Dominik Behr 
> > > Co-developed-by: Grzegorz Jaszczyk 
> > > Signed-off-by: Grzegorz Jaszczyk 
> > > ---
> > >  drivers/vfio/pci/vfio_pci_core.c | 33 
> > >  1 file changed, 33 insertions(+)
> > >
> > > diff --git a/dr

Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

2022-09-23 Thread Alex Williamson
On Fri, 23 Sep 2022 10:29:41 -0300
Jason Gunthorpe  wrote:

> On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
> 
> > Yes, we use cgroups extensively already.  
> 
> Ok, I will try to see about this
> 
> Can you also tell me if the selinux/seccomp will prevent qemu from
> opening more than one /dev/vfio/vfio ? I suppose the answer is no?

QEMU manages the container:group association with legacy vfio, so it
can't be restricted from creating multiple containers.  Thanks,

Alex



Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

2022-09-21 Thread Alex Williamson
[Cc+ Steve, libvirt, Daniel, Laine]

On Tue, 20 Sep 2022 16:56:42 -0300
Jason Gunthorpe  wrote:

> On Tue, Sep 13, 2022 at 09:28:18AM +0200, Eric Auger wrote:
> > Hi,
> > 
> > On 9/13/22 03:55, Tian, Kevin wrote:  
> > > We didn't close the open of how to get this merged in LPC due to the
> > > audio issue. Then let's use mails.
> > >
> > > Overall there are three options on the table:
> > >
> > > 1) Require vfio-compat to be 100% compatible with vfio-type1
> > >
> > >Probably not a good choice given the amount of work to fix the 
> > > remaining
> > >gaps. And this will block support of new IOMMU features for a longer 
> > > time.
> > >
> > > 2) Leave vfio-compat as what it is in this series
> > >
> > >Treat it as a vehicle to validate the iommufd logic instead of 
> > > immediately
> > >replacing vfio-type1. Functionally most vfio applications can work w/o
> > >change if putting aside the difference on locked mm accounting, p2p, 
> > > etc.
> > >
> > >Then work on new features and 100% vfio-type1 compat. in parallel.
> > >
> > > 3) Focus on iommufd native uAPI first
> > >
> > >Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
> > >
> > >Then work on new features and vfio-compat in parallel.
> > >
> > > I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 
> > > 3).  
> > 
> > I am also inclined to pursue 3) as this was the initial Jason's guidance
> > and pre-requisite to integrate new features. In the past we concluded
> > vfio-compat would mostly be used for testing purpose. Our QEMU
> > integration fully is based on device based API.  
> 
> There are some poor chicken and egg problems here.
> 
> I had some assumptions:
>  a - the vfio cdev model is going to be iommufd only
>  b - any uAPI we add as we go along should be generally useful going
>  forward
>  c - we should try to minimize the 'minimally viable iommufd' series
> 
> The compat as it stands now (eg #2) is threading this needle. Since it
> can exist without cdev it means (c) is made smaller, to two series.
> 
> Since we add something useful to some use cases, eg DPDK is deployable
> that way, (b) is OK.
> 
> If we focus on a strict path with 3, and avoid adding non-useful code,
> then we have to have two more (unwritten!) series beyond where we are
> now - vfio group compartmentalization, and cdev integration, and the
> initial (c) will increase.
> 
> 3 also has us merging something that currently has no usable
> userspace, which I also do dislike alot.
> 
> I still think the compat gaps are small. I've realized that
> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
> can deadlock the kernel I propose we purge it completely.

Steve won't be happy to hear that, QEMU support exists but isn't yet
merged.
 
> P2P is ongoing.
> 
> That really just leaves the accounting, and I'm still not convinced at
> this must be a critical thing. Linus's latest remarks reported in lwn
> at the maintainer summit on tracepoints/BPF as ABI seem to support
> this. Let's see an actual deployed production configuration that would
> be impacted, and we won't find that unless we move forward.

I'll try to summarize the proposed change so that we can get better
advice from libvirt folks, or potentially anyone else managing locked
memory limits for device assignment VMs.

Background: when a DMA range, ex. guest RAM, is mapped to a vfio device,
we use the system IOMMU to provide GPA to HPA translation for assigned
devices. Unlike CPU page tables, we don't generally have a means to
demand fault these translations, therefore the memory target of the
translation is pinned to prevent that it cannot be swapped or
relocated, ie. to guarantee the translation is always valid.

The issue is where we account these pinned pages, where accounting is
necessary such that a user cannot lock an arbitrary number of pages
into RAM to generate a DoS attack.  Duplicate accounting should be
resolved by iommufd, but is outside the scope of this discussion.

Currently, vfio tests against the mm_struct.locked_vm relative to
rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur,
where task is the current process.  This is the same limit set via the
setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'.

Note that in both cases above, we're dealing with a task, or process
limit and both prlimit and ulimit man pages describe them as such.

iommufd supposes instead, and references existing kernel
implementations, that despite the descriptions above these limits are
actually meant to be user limits and therefore instead charges pinned
pages against user_struct.locked_vm and also marks them in
mm_struct.pinned_vm.

The proposed algorithm is to read the _task_ locked memory limit, then
attempt to charge the _user_ locked_vm, such that user_struct.locked_vm
cannot exceed the task locked memory limit.

This obviously has implications.  AFAICT, any management tool that
doesn't 

Re: [PATCH v2] util: basic support for VFIO variant drivers

2022-08-23 Thread Alex Williamson
On Tue, 23 Aug 2022 10:11:32 -0400
Laine Stump  wrote:

> ping.
> 
> I have a different version of this patch where I do read the 
> modules.alias file rather than just checking the name of the driver, but 
> that also requires "double mocking" open() in the unit test, which 
> wasn't working properly, and I'd rather not spend the time figuring it 
> out if it's not going to be needed. (Alex prefers that version because 
> it is more correct than just checking the name, and he's concerned that 
> the new sysfs-based API may take longer than we're thinking to get into 
> downstream distros, but the version in this patch does satisfy both 
> Jason and Daniel's suggested implementations). Anyway, I can post the 
> other patch if anyone is interested.

Yeah, I'm still not a fan of this approach.  We're essentially
inventing a requirement in libvirt for a kernel driver naming
convention, because it happens to work.  For now.  Hacky temporary
solutions have been known to be longer lived than anticipated.  This
eventually deteriorates into managing a list of drivers that don't meet
the convention, frustrating developers unaware of this arbitrary
requirement and/or delaying usability through libvirt.  Thanks,

Alex



Re: [PATCH] util: basic support for vendor-specific vfio drivers

2022-08-05 Thread Alex Williamson
On Fri, 5 Aug 2022 15:20:24 -0300
Jason Gunthorpe  wrote:

> On Fri, Aug 05, 2022 at 11:24:08AM -0600, Alex Williamson wrote:
> > On Thu, 4 Aug 2022 21:11:07 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Thu, Aug 04, 2022 at 01:36:24PM -0600, Alex Williamson wrote:
> > >   
> > > > > > That is reasonable, but I'd say those three kernels only have two
> > > > > > drivers and they both have vfio as a substring in their name - so 
> > > > > > the
> > > > > > simple thing of just substring searching 'vfio' would get us over 
> > > > > > that
> > > > > > gap.  
> > > > > 
> > > > > Looking at the aliases for exactly "vfio_pci" isn't that much more 
> > > > > complicated, and "feels" a lot more reliable than just doing a 
> > > > > substring 
> > > > > search for "vfio" in the driver's name. (It would be, uh,  "not 
> > > > > smart" to name a driver "vfio" if it wasn't actually a vfio 
> > > > > variant driver (or the opposite), but I could imagine it happening; 
> > > > > :-/)
> > > 
> > > This is still pretty hacky. I'm worried about what happens to the
> > > kernel if this becames some crazy unintended uAPI that we never really
> > > thought about carefully... This was not a use case when we designed
> > > the modules.alias stuff at least.
> > > 
> > > BTW - why not do things the normal way?
> > > 
> > > 1. readlink /sys/bus/pci/devices/XX/iommu_group
> > > 2. Compute basename of #1
> > > 3. Check if /dev/vfio/#2 exists (or /sys/class/vfio/#2)
> > > 
> > > It has a small edge case where a multi-device group might give a false
> > > positive for an undrivered device, but for the purposes of libvirt
> > > that seems pretty obscure.. (while the above has false negative
> > > issues, obviously)  
> > 
> > This is not a small edge case, it's extremely common.  We have a *lot*
> > of users assigning desktop GPUs and other consumer grade hardware, which
> > are usually multi-function devices without isolation exposed via ACS or
> > quirks.  
> 
> The edge case is that the user has created a multi-device group,
> manually assigned device 1 in the group to VFIO, left device 2 with no
> driver and then told libvirt to manually use device 2. With the above
> approach libvirt won't detect this misconfiguration and qemu will
> fail.
> 
> > The vfio group exists if any devices in the group are bound to a vfio
> > driver, but the device is not accessible from the group unless the
> > viability test passes.  That means QEMU may not be able to get access
> > to the device because the device we want isn't actually bound to a vfio
> > driver or another device in the group is not in a viable state.  Thanks,  
> 
> This is a different misconfiguration that libvirt also won't detect,
> right? In this case ownership claiming in the kernel will fail and
> qemu will fail too, like above.
> 
> This, and the above, could be handled by having libvirt also open the
> group FD and get the device. It would prove both correct binding and
> viability.

libvirt cannot do this in the group model because the group must be
isolated in a container before the device can be accessed and libvirt
cannot presume the QEMU container configuration.  For direct device
access, this certainly becomes a possibility and I've been trying to
steer things in that direction, libvirt has the option to pass an fd for
the iommufd and can then pass fds for each of the devices in the new
uAPI paradigm.

> I had understood the point of this logic was to give better error
> reporting to users so that common misconfigurations could be diagnosed
> earlier. When I say 'small edge case' I mean it seems like an unlikely
> misconfiguration that someone would know to setup VFIO but then use
> the wrong BDFs to do it - arguably less likely than someone would know
> to setup VFIO but forget to unbind the other drivers in the group?

I'm not sure how much testing libvirt does of other devices in a group,
Laine?

AIUI here, libvirt has a managed='yes|no' option per device.  In the
'yes' case libvirt will unbind devices from their host driver and bind
them to vfio-pci.  In the 'no' case, I believe libvirt is still doing a
sanity test on the driver, but only knows about vfio-pci.

The initial step is to then enlighten libvirt that other drivers can be
compatible for the 'no' case and later we can make smarter choices
about which driver to use or allow the user to specify (ie. a user
should be able to use vfio-pci rather than a variant driver if they
choose) in the 'yes' case.

If libvirt is currently testing that only the target device is bound to
vfio-pci, then maybe we do have gaps for the ancillary devices in the
group, but that gap changes if instead we only test that a vfio group
exists relative to the iommu group of the target device.  Thanks,

Alex



Re: [PATCH] util: basic support for vendor-specific vfio drivers

2022-08-05 Thread Alex Williamson
On Thu, 4 Aug 2022 21:11:07 -0300
Jason Gunthorpe  wrote:

> On Thu, Aug 04, 2022 at 01:36:24PM -0600, Alex Williamson wrote:
> 
> > > > That is reasonable, but I'd say those three kernels only have two
> > > > drivers and they both have vfio as a substring in their name - so the
> > > > simple thing of just substring searching 'vfio' would get us over that
> > > > gap.
> > > 
> > > Looking at the aliases for exactly "vfio_pci" isn't that much more 
> > > complicated, and "feels" a lot more reliable than just doing a substring 
> > > search for "vfio" in the driver's name. (It would be, uh,  "not 
> > > smart" to name a driver "vfio" if it wasn't actually a vfio 
> > > variant driver (or the opposite), but I could imagine it happening; :-/)  
> 
> This is still pretty hacky. I'm worried about what happens to the
> kernel if this becames some crazy unintended uAPI that we never really
> thought about carefully... This was not a use case when we designed
> the modules.alias stuff at least.
> 
> BTW - why not do things the normal way?
> 
> 1. readlink /sys/bus/pci/devices/XX/iommu_group
> 2. Compute basename of #1
> 3. Check if /dev/vfio/#2 exists (or /sys/class/vfio/#2)
> 
> It has a small edge case where a multi-device group might give a false
> positive for an undrivered device, but for the purposes of libvirt
> that seems pretty obscure.. (while the above has false negative
> issues, obviously)

This is not a small edge case, it's extremely common.  We have a *lot*
of users assigning desktop GPUs and other consumer grade hardware, which
are usually multi-function devices without isolation exposed via ACS or
quirks.

The vfio group exists if any devices in the group are bound to a vfio
driver, but the device is not accessible from the group unless the
viability test passes.  That means QEMU may not be able to get access
to the device because the device we want isn't actually bound to a vfio
driver or another device in the group is not in a viable state.  Thanks,

Alex



Re: [PATCH] util: basic support for vendor-specific vfio drivers

2022-08-04 Thread Alex Williamson
On Thu, 4 Aug 2022 15:11:07 -0400
Laine Stump  wrote:

> On 8/4/22 2:36 PM, Jason Gunthorpe wrote:
> > On Thu, Aug 04, 2022 at 12:18:26PM -0600, Alex Williamson wrote:  
> >> On Thu, 4 Aug 2022 13:51:20 -0300
> >> Jason Gunthorpe  wrote:
> >>  
> >>> On Mon, Aug 01, 2022 at 09:49:28AM -0600, Alex Williamson wrote:
> >>>  
> >>>>>>>> Fortunately these new vendor/device-specific drivers can be easily
> >>>>>>>> identified as being "vfio-pci + extra stuff" - all that's needed is 
> >>>>>>>> to
> >>>>>>>> look at the output of the "modinfo $driver_name" command to see if
> >>>>>>>> "vfio_pci" is in the alias list for the driver.  
> >>>
> >>> We are moving in a direction on the kernel side to expose a sysfs
> >>> under the PCI device that definitively says it is VFIO enabled, eg
> >>> something like
> >>>
> >>>   /sys/devices/pci:00/:00:1f.6/vfio/
> >>>
> >>> Which is how every other subsystem in the kernel works. When this
> >>> lands libvirt can simply stat the vfio directory and confirm that the
> >>> device handle it is looking at is vfio enabled, for all things that
> >>> vfio support.
> >>>
> >>> My thinking had been to do the above work a bit later, but if libvirt
> >>> needs it right now then lets do it right away so we don't have to
> >>> worry about this hacky modprobe stuff down the road?  
> >>
> >> That seems like a pretty long gap, there are vfio-pci variant drivers
> >> since v5.18 and this hasn't even been proposed for v6.0 (aka v5.20)
> >> midway through the merge window.  We therefore have at least 3 kernels
> >> exposing devices in a way that libvirt can't make use of simply due to
> >> a driver matching test.  
> > 
> > That is reasonable, but I'd say those three kernels only have two
> > drivers and they both have vfio as a substring in their name - so the
> > simple thing of just substring searching 'vfio' would get us over that
> > gap.  
> 
> Looking at the aliases for exactly "vfio_pci" isn't that much more 
> complicated, and "feels" a lot more reliable than just doing a substring 
> search for "vfio" in the driver's name. (It would be, uh,  "not 
> smart" to name a driver "vfio" if it wasn't actually a vfio 
> variant driver (or the opposite), but I could imagine it happening; :-/)
> 
> >   
> >> might be leveraged for managed='yes' with variant drivers.  Once vfio
> >> devices expose a chardev themselves, libvirt might order the tests as:  
> > 
> > I wasn't thinking to include the chardev part if we are to expedite
> > this. The struct device bit alone is enough and it doesn't have the
> > complex bits needed to make the cdev.
> > 
> > If you say you want to do it we'll do it for v6.1..  
> 
> Since we already need to do something else as a stop-gap for the interim 
> (in order to avoid making driver developers wait any longer if for no 
> other reason), my opinion would be to not spend extra time splitting up 
> patches just to give us this functionality slightly sooner; we'll anyway 
> have something at least workable in place.

We also need to be careful in adding things piecemeal that libvirt can
determine when new functionality, such as vfio device chardevs, are
actually available and not simply a placeholder to fill a gap
elsewhere.  Thanks,

Alex



Re: [PATCH] util: basic support for vendor-specific vfio drivers

2022-08-04 Thread Alex Williamson
On Thu, 4 Aug 2022 13:51:20 -0300
Jason Gunthorpe  wrote:

> On Mon, Aug 01, 2022 at 09:49:28AM -0600, Alex Williamson wrote:
> 
> > > > > > Fortunately these new vendor/device-specific drivers can be easily
> > > > > > identified as being "vfio-pci + extra stuff" - all that's needed is 
> > > > > > to
> > > > > > look at the output of the "modinfo $driver_name" command to see if
> > > > > > "vfio_pci" is in the alias list for the driver.  
> 
> We are moving in a direction on the kernel side to expose a sysfs
> under the PCI device that definitively says it is VFIO enabled, eg
> something like
> 
>  /sys/devices/pci:00/:00:1f.6/vfio/
> 
> Which is how every other subsystem in the kernel works. When this
> lands libvirt can simply stat the vfio directory and confirm that the
> device handle it is looking at is vfio enabled, for all things that
> vfio support.
> 
> My thinking had been to do the above work a bit later, but if libvirt
> needs it right now then lets do it right away so we don't have to
> worry about this hacky modprobe stuff down the road?

That seems like a pretty long gap, there are vfio-pci variant drivers
since v5.18 and this hasn't even been proposed for v6.0 (aka v5.20)
midway through the merge window.  We therefore have at least 3 kernels
exposing devices in a way that libvirt can't make use of simply due to
a driver matching test.

Libvirt needs backwards compatibility, so we'll need it to look for the
vfio-pci driver through some long deprecation period.  In the interim,
it can look at module aliases, support for which will be necessary and
might be leveraged for managed='yes' with variant drivers.  Once vfio
devices expose a chardev themselves, libvirt might order the tests as:

 a) vfio device chardev present
 b) driver is a vfio-pci modalias
 c) driver is vfio-pci

The current state of the world though is that variant driver exist and
libvirt can't make use of them.  Thanks,

Alex



Re: [PATCH] util: basic support for vendor-specific vfio drivers

2022-08-01 Thread Alex Williamson
On Mon, 1 Aug 2022 16:02:05 +0200
Erik Skultety  wrote:

> Putting Alex on CC since I don't see him there:
> +alex.william...@redhat.com

Hmm, Laine cc'd me on the initial post but it seems it got dropped
somewhere.
 
> On Mon, Aug 01, 2022 at 09:30:38AM -0400, Laine Stump wrote:
> > On 8/1/22 7:58 AM, Erik Skultety wrote:  
> > > On Mon, Aug 01, 2022 at 12:02:22AM -0400, Laine Stump wrote:  
> > > > Before a PCI device can be assigned to a guest with VFIO, that device
> > > > must be bound to the vfio-pci driver rather than to the device's
> > > > normal driver. The vfio-pci driver provides APIs that permit QEMU to
> > > > perform all the necessary operations to make the device accessible to
> > > > the guest.
> > > > 
> > > > There has been kernel work recently to support vendor/device-specific
> > > > VFIO drivers that provide the basic vfio-pci driver functionality
> > > > while adding support for device-specific operations (for example these
> > > > device-specific drivers are planned to support live migration of
> > > > certain devices). All that will be needed to make this functionality
> > > > available will be to bind the new vendor-specific driver to the device
> > > > (rather than the generic vfio-pci driver, which will continue to work
> > > > just without the extra functionality).
> > > > 
> > > > But until now libvirt has required that all PCI devices being assigned
> > > > to a guest with VFIO specifically have the "vfio-pci" driver bound to
> > > > the device. So even if the user manually binds a shiny new
> > > > vendor-specific driver to the device (and puts "managed='no'" in the
> > > > config to prevent libvirt from changing that), libvirt will just fail
> > > > during startup of the guest (or during hotplug) because the driver
> > > > bound to the device isn't named exactly "vfio-pci".
> > > > 
> > > > Fortunately these new vendor/device-specific drivers can be easily
> > > > identified as being "vfio-pci + extra stuff" - all that's needed is to
> > > > look at the output of the "modinfo $driver_name" command to see if
> > > > "vfio_pci" is in the alias list for the driver.
> > > > 
> > > > That's what this patch does. When libvirt checks the driver bound to a
> > > > device (either to decide if it needs to bind to a different driver or
> > > > perform some other operation, or if the current driver is acceptable
> > > > as-is), if the driver isn't specifically "vfio-pci", then it will look
> > > > at the output of modinfo for the driver that *is* bound to the device;
> > > > if modinfo shows vfio_pci as an alias for that device, then we'll
> > > > behave as if the driver was exactly "vfio-pci".  
> > > 
> > > Since you say that the vendor/device-specific drivers does each of such 
> > > drivers
> > > implement the base vfio-pci functionality or they simply call into the 
> > > base
> > > driver? The reason why I'm asking is that if each of the vendor-specific
> > > drivers depend on the vfio-pci module to be loaded as well, then reading
> > > /proc/modules should suffice as vfio-pci should be listed right next to 
> > > the
> > > vendor-specific one. What am I missing?  
> > I don't know the definitive answer to that, as I have no example of a
> > working vendor-specific driver to look at and only know about the kernel
> > work going on second-hand from Alex. It looks like even the vfio_pci driver
> > itself depends on other presumably lower level vfio-* modules (it directly
> > uses vfio_pci_core, which in turn uses vfio and vfio_vifqfd), so possibly
> > these new drivers would be depending on one or more of those lower level
> > modules rather than vfio_pci. Also I would imagine it would be possible for
> > other drivers to also depend on the vfio-pci driver while not themselves
> > being a vfio driver.

A module dependency on vfio-pci (actually vfio-pci-core) is a pretty
loose requirement, *any* symbol dependency generates such a linkage,
without necessarily exposing a vfio-pci uAPI.  The alias support
introduced to the kernel is intended to allow userspace to determine
the most appropriate vfio-pci driver for a device, whether that's
vfio-pci itself or a variant driver that augments device specific
features.  See the upstream commit here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc6711b0bf36de068b10490198d05ac168377989

All we're doing here is extending libvirt to say that if the driver is
vfio-pci or the modalias for the driver is prefixed with vfio-pci, then
the driver exposes a vfio-pci compatible uAPI.  I expect in the future
libvirt, or some other utility, may take on the role as described in
the above commit log to not only detect that a driver supports a
vfio-pci uAPI, but also to identify the most appropriate driver for the
device which exposes a vfio-uAPI.

> > > The 'alias' field is optional so do we have any support guarantees from 
> > > the
> > > vendors that the it will always be filled in correctly? I mean you surely
> > > 

Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-28 Thread Alex Williamson
On Thu, 28 Apr 2022 03:21:45 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Wednesday, April 27, 2022 12:22 AM  
> > > >
> > > > My expectation would be that libvirt uses:
> > > >
> > > >  -object iommufd,id=iommufd0,fd=NNN
> > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > >
> > > > Whereas simple QEMU command line would be:
> > > >
> > > >  -object iommufd,id=iommufd0
> > > >  -device vfio-pci,iommufd=iommufd0,host=:02:00.0
> > > >
> > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > implicit iommufd object is someone problematic because one of the
> > > > things I forgot to highlight in my previous description is that the
> > > > iommufd object is meant to be shared across not only various vfio
> > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > > vdpa.  
> > >
> > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > ioas requirements across subsystems while having multiple iommufd's
> > > instead lose the benefit of centralized accounting. The latter will also
> > > cause some trouble when we start virtualizing ENQCMD which requires
> > > VM-wide PASID virtualization thus further needs to share that
> > > information across iommufd's. Not unsolvable but really no gain by
> > > adding such complexity. So I'm curious whether Qemu provide
> > > a way to restrict that certain object type can only have one instance
> > > to discourage such multi-iommufd attempt?  
> > 
> > I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> > philosophy seems to be to let users create whatever configuration they
> > want.  For libvirt though, the assumption would be that a single
> > iommufd object can be used across subsystems, so libvirt would never
> > automatically create multiple objects.  
> 
> I like the flexibility what the objection approach gives in your proposal.
> But with the said complexity in mind (with no foreseen benefit), I wonder

What's the actual complexity?  Front-end/backend splits are very common
in QEMU.  We're making the object connection via name, why is it
significantly more complicated to allow multiple iommufd objects?  On
the contrary, it seems to me that we'd need to go out of our way to add
code to block multiple iommufd objects.

> whether an alternative approach which treats iommufd as a global
> property instead of an object is acceptable in Qemu, i.e.:
> 
> -iommufd on/off
> -device vfio-pci,iommufd,[fd=MMM/host=:02:00.0]
> 
> All devices with iommufd specified then implicitly share a single iommufd
> object within Qemu.

QEMU requires key-value pairs AFAIK, so the above doesn't work, then
we're just back to the iommufd=on/off.
 
> This still allows vfio devices to be specified via fd but just requires 
> Libvirt
> to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> considered or just not a typical way in Qemu philosophy e.g. any object
> associated with a device must be explicitly specified?

Avoiding QEMU opening files was a significant focus of my alternate
proposal.  Also note that we must be able to support hotplug, so we
need to be able to dynamically add and remove the iommufd object, I
don't see that a global property allows for that.  Implicit
associations of devices to shared resources doesn't seem particularly
desirable to me.  Thanks,

Alex



Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Alex Williamson
On Tue, 26 Apr 2022 13:42:17 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > We also need to be able to advise libvirt as to how each iommufd object
> > or user of that object factors into the VM locked memory requirement.
> > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > to set the locked memory limit to the size of VM RAM per iommufd,
> > regardless of the number of devices using a given iommufd.  However, I
> > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > Combinations of devices where some map VM RAM and others map QEMU
> > buffer space could still require some incremental increase per device
> > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > will still be involved even after iommufd solves the per-device
> > vfio-pci locked memory limit issue.  Thanks,  
> 
> If the model is to pass the FD, how about we put a limit on the FD
> itself instead of abusing the locked memory limit?
> 
> We could have a no-way-out ioctl that directly limits the # of PFNs
> covered by iopt_pages inside an iommufd.

FD passing would likely only be the standard for libvirt invoked VMs.
The QEMU vfio-pci device would still parse a host= or sysfsdev= option
when invoked by mortals and associate to use the legacy vfio group
interface or the new vfio device interface based on whether an iommufd
is specified.

Does that rule out your suggestion?  I don't know, please reveal more
about the mechanics of putting a limit on the FD itself and this
no-way-out ioctl.  The latter name suggests to me that I should also
note that we need to support memory hotplug with these devices.  Thanks,

Alex



Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Alex Williamson
On Tue, 26 Apr 2022 08:37:41 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Monday, April 25, 2022 10:38 PM
> > 
> > On Mon, 25 Apr 2022 11:10:14 +0100
> > Daniel P. Berrangé  wrote:
> >   
> > > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:  
> > > > [Cc +libvirt folks]
> > > >
> > > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > > Yi Liu  wrote:
> > > >  
> > > > > With the introduction of iommufd[1], the linux kernel provides a  
> > generic  
> > > > > interface for userspace drivers to propagate their DMA mappings to  
> > kernel  
> > > > > for assigned devices. This series does the porting of the VFIO devices
> > > > > onto the /dev/iommu uapi and let it coexist with the legacy  
> > implementation.  
> > > > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> > >
> > > snip
> > >  
> > > > > The selection of the backend is made on a device basis using the new
> > > > > iommufd option (on/off/auto). By default the iommufd backend is  
> > selected  
> > > > > if supported by the host and by QEMU (iommufd KConfig). This option  
> > is  
> > > > > currently available only for the vfio-pci device. For other types of
> > > > > devices, it does not yet exist and the legacy BE is chosen by 
> > > > > default.  
> > > >
> > > > I've discussed this a bit with Eric, but let me propose a different
> > > > command line interface.  Libvirt generally likes to pass file
> > > > descriptors to QEMU rather than grant it access to those files
> > > > directly.  This was problematic with vfio-pci because libvirt can't
> > > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > > container.  Therefore we abandoned this approach and instead libvirt
> > > > grants file permissions.
> > > >
> > > > However, with iommufd there's no reason that QEMU ever needs more  
> > than  
> > > > a single instance of /dev/iommufd and we're using per device vfio file
> > > > descriptors, so it seems like a good time to revisit this.  
> > >
> > > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > > privileges, such that you don't want to unconditionally give QEMU
> > > access to this device ?  
> > 
> > It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> > interface which should have limited scope for abuse, but more so here
> > the goal would be to de-privilege QEMU that one step further that it
> > cannot open the device file itself.
> >   
> > > > The interface I was considering would be to add an iommufd object to
> > > > QEMU, so we might have a:
> > > >
> > > > -device iommufd[,fd=#][,id=foo]
> > > >
> > > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > > itself if an fd is not provided.  This object could be shared with
> > > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > > for more esoteric use cases.  [NB, maybe this should be a -object 
> > > > rather  
> > than  
> > > > -device since the iommufd is not a guest visible device?]  
> > >
> > > Yes,  -object would be the right answer for something that's purely
> > > a host side backend impl selector.
> > >  
> > > > The vfio-pci device might then become:
> > > >
> > > > -device vfio-  
> > pci[,host=:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> > oo]  
> > > >
> > > > So essentially we can specify the device via host, sysfsdev, or passing
> > > > an fd to the vfio device file.  When an iommufd object is specified,
> > > > "foo" in the example above, each of those options would use the
> > > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > > your example.  With the fd passing option, an iommufd object would be
> > > > required and necessarily use device level access.
> > > >
> > > > In your example, the iommufd=auto seems especially troublesome for
> > > > libvirt because QEMU is going to have different locked memory
> > > > requirements based on whether we're using type1 or iommufd, where  
> > the  
> > > > latter resolves the

Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-25 Thread Alex Williamson
On Mon, 25 Apr 2022 22:23:05 +0200
Eric Auger  wrote:

> Hi Alex,
> 
> On 4/23/22 12:09 AM, Alex Williamson wrote:
> > [Cc +libvirt folks]
> >
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu  wrote:
> >  
> >> With the introduction of iommufd[1], the linux kernel provides a generic
> >> interface for userspace drivers to propagate their DMA mappings to kernel
> >> for assigned devices. This series does the porting of the VFIO devices
> >> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> >> Other devices like vpda, vfio mdev and etc. are not considered yet.
> >>
> >> For vfio devices, the new interface is tied with device fd and iommufd
> >> as the iommufd solution is device-centric. This is different from legacy
> >> vfio which is group-centric. To support both interfaces in QEMU, this
> >> series introduces the iommu backend concept in the form of different
> >> container classes. The existing vfio container is named legacy container
> >> (equivalent with legacy iommu backend in this series), while the new
> >> iommufd based container is named as iommufd container (may also be 
> >> mentioned
> >> as iommufd backend in this series). The two backend types have their own
> >> way to setup secure context and dma management interface. Below diagram
> >> shows how it looks like with both BEs.
> >>
> >> VFIO   AddressSpace/Memory
> >> +---+  +--+  +-+  +-+
> >> |  pci  |  | platform |  |  ap |  | ccw |
> >> +---+---+  ++-+  +--+--+  +--+--+ +--+
> >> |   |   |||   AddressSpace   |
> >> |   |   ||++-+
> >> +---V---V---VV+   /
> >> |   VFIOAddressSpace  | <+
> >> |  |  |  MemoryListener
> >> |  VFIOContainer list |
> >> +---+++
> >> ||
> >> ||
> >> +---V--++V--+
> >> |   iommufd||vfio legacy|
> >> |  container   || container |
> >> +---+--+++--+
> >> ||
> >> | /dev/iommu | /dev/vfio/vfio
> >> | /dev/vfio/devices/vfioX| /dev/vfio/$group_id
> >>  Userspace  ||
> >>  ===++
> >>  Kernel |  device fd |
> >> +---+| group/container fd
> >> | (BIND_IOMMUFD || (SET_CONTAINER/SET_IOMMU)
> >> |  ATTACH_IOAS) || device fd
> >> |   ||
> >> |   +---VV-+
> >> iommufd |   |vfio  |
> >> (map/unmap  |   +-++---+
> >>  ioas_copy) | || map/unmap
> >> | ||
> >>  +--V--++-V--+  +--V+
> >>  | iommfd core ||  device|  |  vfio iommu   |
> >>  +-+++  +---+
> >>
> >> [Secure Context setup]
> >> - iommufd BE: uses device fd and iommufd to setup secure context
> >>   (bind_iommufd, attach_ioas)
> >> - vfio legacy BE: uses group fd and container fd to setup secure context
> >>   (set_container, set_iommu)
> >> [Device access]
> >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> >> - vfio legacy BE: device fd is retrieved from group fd ioctl
> >> [DMA Mapping flow]
> >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> >> - VFIO populates DMA map/unmap via the container BEs
> >>   *) iommufd BE: uses iommufd
> >>   *) vfio legacy BE: uses container fd
> >>
> >> This series qomifies the VFIOContainer object which acts as a base class
> >> for a container. This base 

Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-25 Thread Alex Williamson
On Mon, 25 Apr 2022 11:10:14 +0100
Daniel P. Berrangé  wrote:

> On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > [Cc +libvirt folks]
> > 
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu  wrote:
> >   
> > > With the introduction of iommufd[1], the linux kernel provides a generic
> > > interface for userspace drivers to propagate their DMA mappings to kernel
> > > for assigned devices. This series does the porting of the VFIO devices
> > > onto the /dev/iommu uapi and let it coexist with the legacy 
> > > implementation.
> > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> 
> snip
> 
> > > The selection of the backend is made on a device basis using the new
> > > iommufd option (on/off/auto). By default the iommufd backend is selected
> > > if supported by the host and by QEMU (iommufd KConfig). This option is
> > > currently available only for the vfio-pci device. For other types of
> > > devices, it does not yet exist and the legacy BE is chosen by default.  
> > 
> > I've discussed this a bit with Eric, but let me propose a different
> > command line interface.  Libvirt generally likes to pass file
> > descriptors to QEMU rather than grant it access to those files
> > directly.  This was problematic with vfio-pci because libvirt can't
> > easily know when QEMU will want to grab another /dev/vfio/vfio
> > container.  Therefore we abandoned this approach and instead libvirt
> > grants file permissions.
> > 
> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.  
> 
> I assume access to '/dev/iommufd' gives the process somewhat elevated
> privileges, such that you don't want to unconditionally give QEMU
> access to this device ?

It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
interface which should have limited scope for abuse, but more so here
the goal would be to de-privilege QEMU that one step further that it
cannot open the device file itself.

> > The interface I was considering would be to add an iommufd object to
> > QEMU, so we might have a:
> > 
> > -device iommufd[,fd=#][,id=foo]
> > 
> > For non-libivrt usage this would have the ability to open /dev/iommufd
> > itself if an fd is not provided.  This object could be shared with
> > other iommufd users in the VM and maybe we'd allow multiple instances
> > for more esoteric use cases.  [NB, maybe this should be a -object rather 
> > than
> > -device since the iommufd is not a guest visible device?]  
> 
> Yes,  -object would be the right answer for something that's purely
> a host side backend impl selector.
> 
> > The vfio-pci device might then become:
> > 
> > -device 
> > vfio-pci[,host=:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> > 
> > So essentially we can specify the device via host, sysfsdev, or passing
> > an fd to the vfio device file.  When an iommufd object is specified,
> > "foo" in the example above, each of those options would use the
> > vfio-device access mechanism, essentially the same as iommufd=on in
> > your example.  With the fd passing option, an iommufd object would be
> > required and necessarily use device level access.
> > 
> > In your example, the iommufd=auto seems especially troublesome for
> > libvirt because QEMU is going to have different locked memory
> > requirements based on whether we're using type1 or iommufd, where the
> > latter resolves the duplicate accounting issues.  libvirt needs to know
> > deterministically which backed is being used, which this proposal seems
> > to provide, while at the same time bringing us more in line with fd
> > passing.  Thoughts?  Thanks,  
> 
> Yep, I agree that libvirt needs to have more direct control over this.
> This is also even more important if there are notable feature differences
> in the 2 backends.
> 
> I wonder if anyone has considered an even more distinct impl, whereby
> we have a completely different device type on the backend, eg
> 
>   -device 
> vfio-iommu-pci[,host=:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> 
> If a vendor wants to fully remove the legacy impl, they can then use the
> Kconfig mechanism to disable the build of the legacy impl device, while
> keeping the iommu impl (or vica-verca if the new iommu impl isn't considered
> reliable enough for them to support yet).
> 
> Libvirt would use
> 
>-obj

Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-22 Thread Alex Williamson
[Cc +libvirt folks]

On Thu, 14 Apr 2022 03:46:52 -0700
Yi Liu  wrote:

> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
> VFIO   AddressSpace/Memory
> +---+  +--+  +-+  +-+
> |  pci  |  | platform |  |  ap |  | ccw |
> +---+---+  ++-+  +--+--+  +--+--+ +--+
> |   |   |||   AddressSpace   |
> |   |   ||++-+
> +---V---V---VV+   /
> |   VFIOAddressSpace  | <+
> |  |  |  MemoryListener
> |  VFIOContainer list |
> +---+++
> ||
> ||
> +---V--++V--+
> |   iommufd||vfio legacy|
> |  container   || container |
> +---+--+++--+
> ||
> | /dev/iommu | /dev/vfio/vfio
> | /dev/vfio/devices/vfioX| /dev/vfio/$group_id
>  Userspace  ||
>  ===++
>  Kernel |  device fd |
> +---+| group/container fd
> | (BIND_IOMMUFD || (SET_CONTAINER/SET_IOMMU)
> |  ATTACH_IOAS) || device fd
> |   ||
> |   +---VV-+
> iommufd |   |vfio  |
> (map/unmap  |   +-++---+
>  ioas_copy) | || map/unmap
> | ||
>  +--V--++-V--+  +--V+
>  | iommfd core ||  device|  |  vfio iommu   |
>  +-+++  +---+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>   (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class
> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic code
> such as code related to memory_listener and address space management whereas
> the derived class implements callbacks that depend on the kernel user space
> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.

I've discussed this a bit with Eric, but let me propose a different
command line interface.  Libvirt generally likes to pass file
descriptors to QEMU rather than grant it access to those files
directly.  This was problematic with vfio-pci because libvirt can't
easily know when QEMU will want to grab another /dev/vfio/vfio
container.  Therefore we abandoned 

Re: Add options to device xml to skip reattach of pci passthrough devices.

2021-06-18 Thread Alex Williamson
On Fri, 18 Jun 2021 10:43:07 -0400
Laine Stump  wrote:

> On 6/16/21 4:15 PM, Daniel Henrique Barboza wrote:
> > 
> > 
> > On 6/9/21 4:38 PM, Manish Mishra wrote:  
> >> Hi Everyone,
> >>
> >> We want to add extra options to device xml to skip reattach of pci 
> >> passthrough devices. Following is xml format for pci passthrough 
> >> devices added to domain as of now.
> >>
> >> 
> >>
> >>    
> >>
> >>    
> >>
> >>    
> >>
> >> 
> >>
> >> When we pass managed=’yes’ flag through xml, libvirt takes 
> >> responsibility of detaching device on domain(guest VM) start and 
> >> reattaching on domain shutdown. We observed some issues where guest VM 
> >> shutdown may take long time, blocked for reattach operation on pci 
> >> passthrough device. As domain lock is held during this time it also 
> >> makes libvirt mostly inactive as it blocks even basic operations like 
> >> (virsh list). Reattaching of device to host can block due to reasons 
> >> like buggy driver or initialization of device itself can take long 
> >> time in some cases.  
> > 
> > I am more interested in hearing about the problem with this faulty buggy
> > driver holding domain lock during device reattach and compromising 'virsh'
> > operations, and see if there's something to do to mitigate that, instead
> > of creating a XML workaround for a driver problem.
> >   
> >>
> >> We want to pass following extra options to resolve this:
> >>
> >>  1. *skipReAttach*(optional flag)
> >>
> >> In some cases we do not need to reattach device to host as it may be 
> >> reserved only for guests, with this flag we can skip reattach 
> >> operation on host.  We do not want to modify managed flag to avoid 
> >> regression, so thinking of adding new optional flag.
> >>
> >>  2. *reAttachDriverName*(optional flag)
> >>
> >> Name of driver to which we want to attach instead of default, to avoid 
> >> reattaching to buggy driver. Currently libvirt asks host to auto 
> >> selects driver for device.
> >>
> >> Yes we can use managed=’no’ but in that case user has to take 
> >> responsibility of detaching device before starting domain which we do 
> >> not want. Please let us know your views on this.  
> > 
> > The case you mentioned above, "we do not need to reattach device to host
> > as it may be reserved only for guests", is one of the most common uses
> > we have for managed='no' AFAIK. The user/sysadm must detach the device
> > from the host, but it's only one time. After that the device can remain
> > detached from the host, and guests can use it freely as long as you
> > don't reboot the host (or reattach the device back). This scenario
> > you described fit the managed='no' mechanics fine IMO.
> > 
> > If you want to automate the detach process, you can use a Libvirt QEMU
> > hook (/etc/libvirt/hooks/qemu) to make the device detach when starting
> > the domain, in case the device isn't already detached. Note that
> > this has the same effect of the "skipReAttach" option you proposed.
> > 
> > Making a design around faulty drivers isn't ideal. If the driver you're
> > using starts to have problems with the detach operation as well, 
> > 'skipReAttach'
> > will do you no good. You'll have to fall back to 'managed=no' to circumvent
> > that.
> > 
> > Even if we discard the motivation, I'm not sure about the utility of having
> > more forms of PCI assignment management (e.g 
> > managed=yes|no|detach|reattach).
> > managed=yes|no seems to cover most use cases where the device driver works
> > properly.
> > 
> > 
> > Laine, what do you think?  
> 
> 
> I have a vague memory of someone (may even have been me) proposing 
> something similar several years ago, and the idea was shot down. I don't 
> remember the exact context or naming, but the general idea was to have 
> something like managed='detach-only' in order to have the advantage of 
> all configuration being within libvirt, but eliminating the potential 
> bad behavior associated with repeated re-binding of devices to drivers. 
> I unfortunately also don't recall the reason the idea was nixed. Dan or 
> Alex - do either of you have any memory of this?

Sounds very vaguely familiar, but not enough to remember the arguments.

> As for myself, 1) I agree with Daniel's suggestion that it is important 
> to find the root cause of the long delay rather than just covering it up 
> with another obscure option that will need to be re-discovered by anyone 
> who encounters the problem, and 2) every new bit that we add in makes 
> the code more complex and so more prone to errors, and also makes 
> configuration more complex and also more prone to errors. So while new 
> options like this could be useful, they could also be a net loss (or 
> not, it's hard to know without actually doing it, but once it's done it 
> can't be "un-done" :-))

AFAIK, the generally recommendation for "devices used only for
assignment" is to use driverctl to automatically attach these devices
to vfio-pci on boot/attach and 

Re: [libvirt PATCH v3 00/21] Add support for persistent mediated devices

2021-02-01 Thread Alex Williamson
On Mon, 1 Feb 2021 16:57:44 -0600
Jonathon Jongsma  wrote:

> On Mon, 1 Feb 2021 11:33:08 +0100
> Erik Skultety  wrote:
> 
> > On Mon, Feb 01, 2021 at 09:48:11AM +, Daniel P. Berrangé wrote:  
> > > On Mon, Feb 01, 2021 at 10:45:43AM +0100, Erik Skultety wrote:
> > > > On Mon, Feb 01, 2021 at 09:42:32AM +, Daniel P. Berrangé
> > > > wrote:
> > > > > On Fri, Jan 29, 2021 at 05:34:36PM -0600, Jonathon Jongsma
> > > > > wrote:
> > > > > > On Thu, 7 Jan 2021 17:43:54 +0100
> > > > > > Erik Skultety  wrote:
> > > > > > 
> > > > > > > > Tested with v6.10.0-283-g1948d4e61e.
> > > > > > > > 
> > > > > > > > 1.Can define/start/destroy mdev device successfully;
> > > > > > > > 
> > > > > > > > 2.'virsh nodedev-list' has no '--active' option, which is
> > > > > > > > inconsistent with the description in the patch:
> > > > > > > > # virsh nodedev-list --active
> > > > > > > > error: command 'nodedev-list' doesn't support option
> > > > > > > > --active
> > > > > > > > 
> > > > > > > > 3.virsh client hang when trying to destroy a mdev device
> > > > > > > > which is using by a vm, and after that all 'virsh nodev*'
> > > > > > > > cmds will hang. If restarting llibvirtd after that,
> > > > > > > > libvirtd will hang.  
> > > > > > > 
> > > > > > > It hangs because underneath a write to the 'remove' sysfs
> > > > > > > attribute is now blocking for some reason and since we're
> > > > > > > relying on mdevctl to do it for us, hence "it hangs". I'm
> > > > > > > not trying to make an excuse, it's plain wrong. I'd love to
> > > > > > > rely on such a basic functionality, but it looks like we'll
> > > > > > > have to go with a extremely ugly workaround and try to get
> > > > > > > the list of active domains from the nodedev driver and see
> > > > > > > whether any of them has the device assigned before we try
> > > > > > > to destroy the mdev via the nodedev driver.
> > > > > > 
> > > > > > So, I've been trying to figure out a way to do this, but as
> > > > > > far as I know, there's no way to get a list of active domains
> > > > > > from within the nodedev driver, and I can't think of any
> > > > > > better ways to handle it. Any ideas?
> > > > > 
> > > > > Correct, the nodedev driver isn't permitted to talk to any of
> > > > > the virt drivers.
> > > > 
> > > > Oh, not even via secondary connection? What makes nodedev so
> > > > special, since we can open a secondary connection from e.g. the
> > > > storage driver?
> > > 
> > > It is technically possible, but it should never be done, because it
> > > introduces a bi-directional dependancy between the daemons which
> > > introduces the danger of deadlocking them. None of the secondary
> > > drivers should connect to the hypervisor drivers.
> > > 
> > > > > Is there anything in sysfs which reports whether the device is
> > > > > in use ?
> > > > 
> > > > Nothing that I know of, the way it used to work was that you
> > > > tried to write to sysfs and kernel returned a write error with
> > > > "device in use" or something like that, but that has changed
> > > > since :(.
> > 
> > Without having tried this and since mdevctl is just a Bash script,
> > can we bypass mdevctl on destroys a little bit by constructing the
> > path to the sysfs attribute ourselves and perform a non-blocking
> > write of zero bytes to see if we get an error? If so, we'll skip
> > mdevctl and report an error. If we don't, we'll invoke mdevctl to
> > remove the device in order to remain consistent. Would that be an
> > acceptable workaround (provided it would work)?  
> 
> As far as I can tell, this doesn't work. According to my
> tests, attempting to write zero bytes to $(mdev_sysfs_path)/remove
> doesn't result in an error if the mdev is in use by a vm. It just
> "successfully" writes zero bytes. Adding Alex to cc in case he has an
> idea for a workaround here.

[Cc +Connie]

I'm not really sure why mdevs are unique here.  When we write to
remove, the first step is to release the device from the driver, so
it's really the same as an unbind for a vfio-pci device.  PCI drivers
cannot return an error, an unbind is handled not as a request, but a
directive, so when the device is in use we block until the unbind can
complete.  With vfio-pci (and added upstream to the mdev core -
depending on vendor support), the driver remove callback triggers a
virtual interrupt to the user asking to cooperatively return the device
(triggering a hot-unplug in QEMU).  Has this really worked so well in
vfio-pci that we've forgotten that an unbind can block there too or are
we better about tracking something with PCI devices vs mdevs?

On idea for a solution would be that vfio only allows a single open of a
group at a time, so if libvirt were to open the group it could know
that it's unused.  If you can manage to close the group once you've
already triggered the remove/unbind, then I'd think the completion of
the write would be deterministic.  If the group is in use 

Re: [PATCH] Support x-vga=on for PCI host passthrough devices

2020-10-07 Thread Alex Williamson
On Wed, 07 Oct 2020 19:59:36 +0100
steve  wrote:

>  Original message 
> From: Alex Williamson 
> Date: 07/10/2020 18:08 (GMT+00:00)
> To: Steven Newbury 
> Cc: Peter Krempa , libvir-list@redhat.com
> Subject: Re: [PATCH] Support x-vga=on for PCI host passthrough devices
> 
> On Wed, 07 Oct 2020 14:20:21 +0100
> Steven Newbury  wrote:
> 
> > On Wed, 2020-10-07 at 15:07 +0200, Peter Krempa wrote:
> > > On Wed, Oct 07, 2020 at 13:59:35 +0100, Steven Newbury wrote: 
> > > > When using a passthrough GPU with libvirt there is no option to
> > > > pass "x-vga=on" to the device specification.  This means legacy 
> > >
> > > Please note that we don't add support for experimental qemu features
> > > (prefixed with "x-") until they are deemed stable and the x- is
> > > removed, so this patch can't be accepted in this form.
> > >  
> > Okay, so should I bug qemu to promote the feature to stable?  It's been
> > like that forever, it's certainly not a new feature:
> >
> > https://github.com/qemu/qemu/commit/f15689c7e4422d5453ae45628df5b83a53e518ed
> >
> > So it's been that way for 8 years!
> 
> It's that way upstream because VGA routing is a nightmare, it's
> essentially broken on any system with Intel graphics because the device
> always has VGA regions routed to itself.  That problem is not getting
> better, but the demand for VGA is getting less, so there's very little
> incentive to work on the problem rather than just letting it die out
> once nobody cares about VGA.  Any 600 series or newer GeForce card
> should have a UEFI ROM available.  Also note that VGA support can be
> configured both as a module option of vfio-pci and a build option, so
> there's no guarantee that a display class device will have the VGA
> regions available for QEMU to use this option.  Thanks,
> 
> I'm still going to fix up my patch even if it's not going to get
> committed.  It's very useful for me, my old nvidia card doesn't work
> without it, and I also need it for seabios/dos compatibility with my
> ATIs. In fairness, I haven't tried it with my Intel, it's too old to
> support the new vGPU stuff and I need to see what I'm typing! ;-)
> 
> For what it's worth, I had no trouble with GPU passthrough once I
> realised it needed x-vga=on. VGA arbitration was automatically
> switched. It's been rock solid stability wise, although the 8800 has
> horrible VGA/DOS performance. Windows10 runs like on bare metal.

Are you aware that you can add arbitrary options to devices using the
 support in libvirt?  For example:


  ...
  
...

  

  
  
  

...
  
  


  


This is generally how users make use of unsupportable or
not-yet-supported QEMU features while still using libvirt to manage the
VM.  Thanks,

Alex



Re: [PATCH] Support x-vga=on for PCI host passthrough devices

2020-10-07 Thread Alex Williamson
On Wed, 07 Oct 2020 14:20:21 +0100
Steven Newbury  wrote:

> On Wed, 2020-10-07 at 15:07 +0200, Peter Krempa wrote:
> > On Wed, Oct 07, 2020 at 13:59:35 +0100, Steven Newbury wrote:  
> > > When using a passthrough GPU with libvirt there is no option to
> > > pass "x-vga=on" to the device specification.  This means legacy  
> > 
> > Please note that we don't add support for experimental qemu features
> > (prefixed with "x-") until they are deemed stable and the x- is
> > removed, so this patch can't be accepted in this form.
> >   
> Okay, so should I bug qemu to promote the feature to stable?  It's been
> like that forever, it's certainly not a new feature:
> 
> https://github.com/qemu/qemu/commit/f15689c7e4422d5453ae45628df5b83a53e518ed
> 
> So it's been that way for 8 years!

It's that way upstream because VGA routing is a nightmare, it's
essentially broken on any system with Intel graphics because the device
always has VGA regions routed to itself.  That problem is not getting
better, but the demand for VGA is getting less, so there's very little
incentive to work on the problem rather than just letting it die out
once nobody cares about VGA.  Any 600 series or newer GeForce card
should have a UEFI ROM available.  Also note that VGA support can be
configured both as a module option of vfio-pci and a build option, so
there's no guarantee that a display class device will have the VGA
regions available for QEMU to use this option.  Thanks,

Alex



Re: device compatibility interface for live migration with assigned devices

2020-09-14 Thread Alex Williamson
On Mon, 14 Sep 2020 13:48:43 +
"Zeng, Xin"  wrote:

> On Saturday, September 12, 2020 12:52 AM
> Alex Williamson  wrote:
> > To: Zhao, Yan Y 
> > Cc: Sean Mooney ; Cornelia Huck
> > ; Daniel P.Berrangé ;
> > k...@vger.kernel.org; libvir-list@redhat.com; Jason Wang
> > ; qemu-de...@nongnu.org;
> > kwankh...@nvidia.com; eau...@redhat.com; Wang, Xin-ran  > ran.w...@intel.com>; cor...@lwn.net; openstack-  
> > disc...@lists.openstack.org; Feng, Shaohe ; Tian,
> > Kevin ; Parav Pandit ; Ding,
> > Jian-feng ; dgilb...@redhat.com;
> > zhen...@linux.intel.com; Xu, Hejie ;
> > bao.yum...@zte.com.cn; intel-gvt-...@lists.freedesktop.org;
> > eskul...@redhat.com; Jiri Pirko ; dinec...@redhat.com;
> > de...@ovirt.org
> > Subject: Re: device compatibility interface for live migration with assigned
> > devices
> > 
> > On Fri, 11 Sep 2020 08:56:00 +0800
> > Yan Zhao  wrote:
> >   
> > > On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote:  
> > > > On Thu, 10 Sep 2020 13:50:11 +0100
> > > > Sean Mooney  wrote:
> > > >  
> > > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote:  
> > > > > > On Wed, 9 Sep 2020 10:13:09 +0800
> > > > > > Yan Zhao  wrote:
> > > > > >  
> > > > > > > > > still, I'd like to put it more explicitly to make ensure it's 
> > > > > > > > > not  
> > missed:  
> > > > > > > > > the reason we want to specify compatible_type as a trait and  
> > check  
> > > > > > > > > whether target compatible_type is the superset of source
> > > > > > > > > compatible_type is for the consideration of backward  
> > compatibility.  
> > > > > > > > > e.g.
> > > > > > > > > an old generation device may have a mdev type xxx-v4-yyy,  
> > while a newer  
> > > > > > > > > generation  device may be of mdev type xxx-v5-yyy.
> > > > > > > > > with the compatible_type traits, the old generation device is 
> > > > > > > > > still
> > > > > > > > > able to be regarded as compatible to newer generation device  
> > even their  
> > > > > > > > > mdev types are not equal.  
> > > > > > > >
> > > > > > > > If you want to support migration from v4 to v5, can't the  
> > (presumably  
> > > > > > > > newer) driver that supports v5 simply register the v4 type as 
> > > > > > > > well,  
> > so  
> > > > > > > > that the mdev can be created as v4? (Just like QEMU versioned  
> > machine  
> > > > > > > > types work.)  
> > > > > > >
> > > > > > > yes, it should work in some conditions.
> > > > > > > but it may not be that good in some cases when v5 and v4 in the  
> > name string  
> > > > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and 
> > > > > > > v5  
> > for  
> > > > > > > gen9)
> > > > > > >
> > > > > > > e.g.
> > > > > > > (1). when src mdev type is v4 and target mdev type is v5 as
> > > > > > > software does not support it initially, and v4 and v5 identify  
> > hardware  
> > > > > > > differences.  
> > > > > >
> > > > > > My first hunch here is: Don't introduce types that may be compatible
> > > > > > later. Either make them compatible, or make them distinct by design,
> > > > > > and possibly add a different, compatible type later.
> > > > > >  
> > > > > > > then after software upgrade, v5 is now compatible to v4, should 
> > > > > > > the
> > > > > > > software now downgrade mdev type from v5 to v4?
> > > > > > > not sure if moving hardware generation info into a separate  
> > attribute  
> > > > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type,  
> > while use  
> > > > > > > compatible_pci_ids to identify compatibility.  
> > > > > >
> > > > > > If the generations are compatible, don't mention it in the mdev 
> > > > > > type.
> > > > > > If they aren'

Re: device compatibility interface for live migration with assigned devices

2020-09-11 Thread Alex Williamson
On Fri, 11 Sep 2020 08:56:00 +0800
Yan Zhao  wrote:

> On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote:
> > On Thu, 10 Sep 2020 13:50:11 +0100
> > Sean Mooney  wrote:
> >   
> > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote:  
> > > > On Wed, 9 Sep 2020 10:13:09 +0800
> > > > Yan Zhao  wrote:
> > > > 
> > > > > > > still, I'd like to put it more explicitly to make ensure it's not 
> > > > > > > missed:
> > > > > > > the reason we want to specify compatible_type as a trait and check
> > > > > > > whether target compatible_type is the superset of source
> > > > > > > compatible_type is for the consideration of backward 
> > > > > > > compatibility.
> > > > > > > e.g.
> > > > > > > an old generation device may have a mdev type xxx-v4-yyy, while a 
> > > > > > > newer
> > > > > > > generation  device may be of mdev type xxx-v5-yyy.
> > > > > > > with the compatible_type traits, the old generation device is 
> > > > > > > still
> > > > > > > able to be regarded as compatible to newer generation device even 
> > > > > > > their
> > > > > > > mdev types are not equal.  
> > > > > > 
> > > > > > If you want to support migration from v4 to v5, can't the 
> > > > > > (presumably
> > > > > > newer) driver that supports v5 simply register the v4 type as well, 
> > > > > > so
> > > > > > that the mdev can be created as v4? (Just like QEMU versioned 
> > > > > > machine
> > > > > > types work.)  
> > > > > 
> > > > > yes, it should work in some conditions.
> > > > > but it may not be that good in some cases when v5 and v4 in the name 
> > > > > string
> > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 
> > > > > for
> > > > > gen9)
> > > > > 
> > > > > e.g.
> > > > > (1). when src mdev type is v4 and target mdev type is v5 as
> > > > > software does not support it initially, and v4 and v5 identify 
> > > > > hardware
> > > > > differences.
> > > > 
> > > > My first hunch here is: Don't introduce types that may be compatible
> > > > later. Either make them compatible, or make them distinct by design,
> > > > and possibly add a different, compatible type later.
> > > > 
> > > > > then after software upgrade, v5 is now compatible to v4, should the
> > > > > software now downgrade mdev type from v5 to v4?
> > > > > not sure if moving hardware generation info into a separate attribute
> > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while 
> > > > > use
> > > > > compatible_pci_ids to identify compatibility.
> > > > 
> > > > If the generations are compatible, don't mention it in the mdev type.
> > > > If they aren't, use distinct types, so that management software doesn't
> > > > have to guess. At least that would be my naive approach here.
> > > yep that is what i would prefer to see too.  
> > > > 
> > > > > 
> > > > > (2) name string of mdev type is composed by "driver_name + type_name".
> > > > > in some devices, e.g. qat, different generations of devices are 
> > > > > binding to
> > > > > drivers of different names, e.g. "qat-v4", "qat-v5".
> > > > > then though type_name is equal, mdev type is not equal. e.g.
> > > > > "qat-v4-type1", "qat-v5-type1".
> > > > 
> > > > I guess that shows a shortcoming of that "driver_name + type_name"
> > > > approach? Or maybe I'm just confused.
> > > yes i really dont like haveing the version in the mdev-type name 
> > > i would stongly perfger just qat-type-1 wehere qat is just there as a way 
> > > of namespacing.
> > > although symmetric-cryto, asymmetric-cryto and compression woudl be a 
> > > better name then type-1, type-2, type-3 if
> > > that is what they would end up mapping too. e.g. qat-compression or 
> > > qat-aes is a much better name then type-1
> > > higher 

Re: device compatibility interface for live migration with assigned devices

2020-09-10 Thread Alex Williamson
On Thu, 10 Sep 2020 13:50:11 +0100
Sean Mooney  wrote:

> On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote:
> > On Wed, 9 Sep 2020 10:13:09 +0800
> > Yan Zhao  wrote:
> >   
> > > > > still, I'd like to put it more explicitly to make ensure it's not 
> > > > > missed:
> > > > > the reason we want to specify compatible_type as a trait and check
> > > > > whether target compatible_type is the superset of source
> > > > > compatible_type is for the consideration of backward compatibility.
> > > > > e.g.
> > > > > an old generation device may have a mdev type xxx-v4-yyy, while a 
> > > > > newer
> > > > > generation  device may be of mdev type xxx-v5-yyy.
> > > > > with the compatible_type traits, the old generation device is still
> > > > > able to be regarded as compatible to newer generation device even 
> > > > > their
> > > > > mdev types are not equal.
> > > > 
> > > > If you want to support migration from v4 to v5, can't the (presumably
> > > > newer) driver that supports v5 simply register the v4 type as well, so
> > > > that the mdev can be created as v4? (Just like QEMU versioned machine
> > > > types work.)
> > > 
> > > yes, it should work in some conditions.
> > > but it may not be that good in some cases when v5 and v4 in the name 
> > > string
> > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for
> > > gen9)
> > > 
> > > e.g.
> > > (1). when src mdev type is v4 and target mdev type is v5 as
> > > software does not support it initially, and v4 and v5 identify hardware
> > > differences.  
> > 
> > My first hunch here is: Don't introduce types that may be compatible
> > later. Either make them compatible, or make them distinct by design,
> > and possibly add a different, compatible type later.
> >   
> > > then after software upgrade, v5 is now compatible to v4, should the
> > > software now downgrade mdev type from v5 to v4?
> > > not sure if moving hardware generation info into a separate attribute
> > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while use
> > > compatible_pci_ids to identify compatibility.  
> > 
> > If the generations are compatible, don't mention it in the mdev type.
> > If they aren't, use distinct types, so that management software doesn't
> > have to guess. At least that would be my naive approach here.  
> yep that is what i would prefer to see too.
> >   
> > > 
> > > (2) name string of mdev type is composed by "driver_name + type_name".
> > > in some devices, e.g. qat, different generations of devices are binding to
> > > drivers of different names, e.g. "qat-v4", "qat-v5".
> > > then though type_name is equal, mdev type is not equal. e.g.
> > > "qat-v4-type1", "qat-v5-type1".  
> > 
> > I guess that shows a shortcoming of that "driver_name + type_name"
> > approach? Or maybe I'm just confused.  
> yes i really dont like haveing the version in the mdev-type name 
> i would stongly perfger just qat-type-1 wehere qat is just there as a way of 
> namespacing.
> although symmetric-cryto, asymmetric-cryto and compression woudl be a better 
> name then type-1, type-2, type-3 if
> that is what they would end up mapping too. e.g. qat-compression or qat-aes 
> is a much better name then type-1
> higher layers of software are unlikely to parse the mdev names but as a human 
> looking at them its much eaiser to
> understand if the names are meaningful. the qat prefix i think is important 
> however to make sure that your mdev-types
> dont colide with other vendeors mdev types. so i woudl encurage all vendors 
> to prefix there mdev types with etiher the
> device name or the vendor.

+1 to all this, the mdev type is meant to indicate a software
compatible interface, if different hardware versions can be software
compatible, then don't make the job of finding a compatible device
harder.  The full type is a combination of the vendor driver name plus
the vendor provided type name specifically in order to provide a type
namespace per vendor driver.  That's done at the mdev core level.
Thanks,

Alex



Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Alex Williamson
On Thu, 20 Aug 2020 08:39:22 +0800
Yan Zhao  wrote:

> On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > On Tue, 18 Aug 2020 10:16:28 +0100
> > Daniel P. Berrangé  wrote:
> >   
> > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:  
> > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > 
> > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > 
> > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > 
> > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > 
> > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:
> > >   
> > > >  we actually can also retrieve the same information through sysfs, .e.g
> > > > 
> > > >  |- [path to device]
> > > > |--- migration
> > > > | |--- self
> > > > | |   |---device_api
> > > > ||   |---mdev_type
> > > > ||   |---software_version
> > > > ||   |---device_id
> > > > ||   |---aggregator
> > > > | |--- compatible
> > > > | |   |---device_api
> > > > ||   |---mdev_type
> > > > ||   |---software_version
> > > > ||   |---device_id
> > > > ||   |---aggregator
> > > > 
> > > > 
> > > >  Yes but:
> > > > 
> > > >  - You need one file per attribute (one syscall for one attribute)
> > > >  - Attribute is coupled with kobject  
> > 
> > Is that really that bad? You have the device with an embedded kobject
> > anyway, and you can just put things into an attribute group?
> > 
> > [Also, I think that self/compatible split in the example makes things
> > needlessly complex. Shouldn't semantic versioning and matching already
> > cover nearly everything? I would expect very few cases that are more
> > complex than that. Maybe the aggregation stuff, but I don't think we
> > need that self/compatible split for that, either.]  
> Hi Cornelia,
> 
> The reason I want to declare compatible list of attributes is that
> sometimes it's not a simple 1:1 matching of source attributes and target 
> attributes
> as I demonstrated below,
> source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
> target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
>(mdev_type i915-GVTg_V5_8 + aggregator 4)
> 
> and aggragator may be just one of such examples that 1:1 matching does not
> fit.

If you're suggesting that we need a new 'compatible' set for every
aggregation, haven't we lost the purpose of aggregation?  For example,
rather than having N mdev types to represent all the possible
aggregation values, we have a single mdev type with N compatible
migration entries, one for each possible aggregation value.  BTW, how do
we have multiple compatible directories?  compatible0001,
compatible0002? Thanks,

Alex



Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Alex Williamson
On Thu, 20 Aug 2020 08:18:10 +0800
Yan Zhao  wrote:

> On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote:
> <...>
> > > > > > What I care about is that we have a *standard* userspace API for
> > > > > > performing device compatibility checking / state migration, for use 
> > > > > > by
> > > > > > QEMU/libvirt/ OpenStack, such that we can write code without 
> > > > > > countless
> > > > > > vendor specific code paths.
> > > > > >
> > > > > > If there is vendor specific stuff on the side, that's fine as we can
> > > > > > ignore that, but the core functionality for device compat / 
> > > > > > migration
> > > > > > needs to be standardized.
> > > > > 
> > > > > To summarize:
> > > > > - choose one of sysfs or devlink
> > > > > - have a common interface, with a standardized way to add
> > > > >   vendor-specific attributes
> > > > > ?
> > > > 
> > > > Please refer to my previous email which has more example and details.   
> > > >  
> > > hi Parav,
> > > the example is based on a new vdpa tool running over netlink, not based
> > > on devlink, right?
> > > For vfio migration compatibility, we have to deal with both mdev and 
> > > physical
> > > pci devices, I don't think it's a good idea to write a new tool for it, 
> > > given
> > > we are able to retrieve the same info from sysfs and there's already an
> > > mdevctl from Alex (https://github.com/mdevctl/mdevctl).
> > > 
> > > hi All,
> > > could we decide that sysfs is the interface that every VFIO vendor driver
> > > needs to provide in order to support vfio live migration, otherwise the
> > > userspace management tool would not list the device into the compatible
> > > list?
> > > 
> > > if that's true, let's move to the standardizing of the sysfs interface.
> > > (1) content
> > > common part: (must)
> > >- software_version: (in major.minor.bugfix scheme)
> > >- device_api: vfio-pci or vfio-ccw ...
> > >- type: mdev type for mdev device or
> > >a signature for physical device which is a counterpart for
> > >  mdev type.
> > > 
> > > device api specific part: (must)
> > >   - pci id: pci id of mdev parent device or pci id of physical pci
> > > device (device_api is vfio-pci)  
> > 
> > As noted previously, the parent PCI ID should not matter for an mdev
> > device, if a vendor has a dependency on matching the parent device PCI
> > ID, that's a vendor specific restriction.  An mdev device can also
> > expose a vfio-pci device API without the parent device being PCI.  For
> > a physical PCI device, shouldn't the PCI ID be encompassed in the
> > signature?  Thanks,
> >   
> you are right. I need to put the PCI ID as a vendor specific field.
> I didn't do that because I wanted all fields in vendor specific to be
> configurable by management tools, so they can configure the target device
> according to the value of a vendor specific field even they don't know
> the meaning of the field.
> But maybe they can just ignore the field when they can't find a matching
> writable field to configure the target.


If fields can be ignored, what's the point of reporting them?  Seems
it's no longer a requirement.  Thanks,

Alex


> > >   - subchannel_type (device_api is vfio-ccw) 
> > >  
> > > vendor driver specific part: (optional)
> > >   - aggregator
> > >   - chpid_type
> > >   - remote_url
> > > 
> > > NOTE: vendors are free to add attributes in this part with a
> > > restriction that this attribute is able to be configured with the same
> > > name in sysfs too. e.g.
> > > for aggregator, there must be a sysfs attribute in device node
> > > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator,
> > > so that the userspace tool is able to configure the target device
> > > according to source device's aggregator attribute.
> > > 
> > > 
> > > (2) where and structure
> > > proposal 1:
> > > |- [path to device]
> > >   |--- migration
> > >   | |--- self
> > >   | ||-software_version
> > >   | ||-device_api
> > >   | ||-type
> > >   | ||-[pci_id or subchannel_type]
> > >   | ||-
> > >   | |--- compatible
> > >   | ||-software_version
> > >   | ||-device_api
> > >   | ||-type
> > >   | ||-[pci_id or subchannel_type]
> > >   | ||-
> > > multiple compatible is allowed.
> > > attributes should be ASCII text files, preferably with only one value
> > > per file.
> > > 
> > > 
> > > proposal 2: use bin_attribute.
> > > |- [path to device]
> > >   |--- migration
> > >   | |--- self
> > >   | |--- compatible
> > > 
> > > so we can continue use multiline format. e.g.
> > > cat compatible
> > >   software_version=0.1.0
> > >   device_api=vfio_pci
> > >   type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > >   pci_id=80865963
> > >   aggregator={val1}/2
> > > 
> > > Thanks
> > > Yan
> > >   
> >   
> 



Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Alex Williamson
On Wed, 19 Aug 2020 11:30:35 +0800
Yan Zhao  wrote:

> On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote:
> > Hi Cornelia,
> >   
> > > From: Cornelia Huck 
> > > Sent: Tuesday, August 18, 2020 3:07 PM
> > > To: Daniel P. Berrangé 
> > > Cc: Jason Wang ; Yan Zhao
> > > ; k...@vger.kernel.org; libvir-list@redhat.com;
> > > qemu-de...@nongnu.org; Kirti Wankhede ;
> > > eau...@redhat.com; xin-ran.w...@intel.com; cor...@lwn.net; openstack-
> > > disc...@lists.openstack.org; shaohe.f...@intel.com; kevin.t...@intel.com;
> > > Parav Pandit ; jian-feng.d...@intel.com;
> > > dgilb...@redhat.com; zhen...@linux.intel.com; hejie...@intel.com;
> > > bao.yum...@zte.com.cn; Alex Williamson ;
> > > eskul...@redhat.com; smoo...@redhat.com; intel-gvt-
> > > d...@lists.freedesktop.org; Jiri Pirko ;
> > > dinec...@redhat.com; de...@ovirt.org
> > > Subject: Re: device compatibility interface for live migration with 
> > > assigned
> > > devices
> > > 
> > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > Daniel P. Berrangé  wrote:
> > >   
> > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:  
> > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > >
> > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > >
> > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > >
> > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > >
> > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > > >  
> > > > >  we actually can also retrieve the same information through sysfs,
> > > > > .e.g
> > > > >
> > > > >  |- [path to device]
> > > > > |--- migration
> > > > > | |--- self
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > | |--- compatible
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > >
> > > > >
> > > > >  Yes but:
> > > > >
> > > > >  - You need one file per attribute (one syscall for one attribute)
> > > > >  - Attribute is coupled with kobject  
> > > 
> > > Is that really that bad? You have the device with an embedded kobject
> > > anyway, and you can just put things into an attribute group?
> > > 
> > > [Also, I think that self/compatible split in the example makes things
> > > needlessly complex. Shouldn't semantic versioning and matching already
> > > cover nearly everything? I would expect very few cases that are more
> > > complex than that. Maybe the aggregation stuff, but I don't think we need
> > > that self/compatible split for that, either.]
> > >   
> > > > >
> > > > >  All of above seems unnecessary.
> > > > >
> > > > >  Another point, as we discussed in another thread, it's really hard
> > > > > to make  sure the above API work for all types of devices and
> > > > > frameworks. So having a  vendor specific API looks much better.
> > > > >
> > > > >  From the POV of userspace mgmt apps doing device compat checking /
> > > > > migration,  we certainly do NOT want to use different vendor
> > > > > specific APIs. We want to  have an API that can be used / controlled 
> > > > > in a  
> > > standard manner across vendors.  
> > > > >
> > > > >Yes, but it could be hard. E.g vDPA will chose to use devlink 
> > > > > (there's a
> > > > >long debate on sysfs vs devlink). So if we go with sysfs, at least 
> > > > > two
> > > > >APIs needs to be supported ...  
> > > >
> > > > NB, I was not questioning devlink vs sysfs directly. If devlink is
> > > > related to netlink, I can't say I'm enthusiastic as IMKE sysfs is
> > > > easier to deal with. I don't know enough about devlink to have much of 
> > > > an  
> > > o

Re: device compatibility interface for live migration with assigned devices

2020-07-30 Thread Alex Williamson
On Thu, 30 Jul 2020 11:41:04 +0800
Yan Zhao  wrote:

> On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote:
> > On Wed, 29 Jul 2020 12:28:46 +0100
> > Sean Mooney  wrote:
> >   
> > > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:  
> > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > > > version
> > > > > > > > information embedded within the migration stream.  Therefore a
> > > > > > > > migration should fail early if the devices are incompatible.  
> > > > > > > > Is it  
> > > > > > > 
> > > > > > > but as I know, currently in VFIO migration protocol, we have no 
> > > > > > > way to
> > > > > > > get vendor specific compatibility checking string in migration 
> > > > > > > setup stage
> > > > > > > (i.e. .save_setup stage) before the device is set to _SAVING 
> > > > > > > state.
> > > > > > > In this way, for devices who does not save device data in precopy 
> > > > > > > stage,
> > > > > > > the migration compatibility checking is as late as in 
> > > > > > > stop-and-copy
> > > > > > > stage, which is too late.
> > > > > > > do you think we need to add the getting/checking of vendor 
> > > > > > > specific
> > > > > > > compatibility string early in save_setup stage?
> > > > > > >  
> > > > > > 
> > > > > > hi Alex,
> > > > > > after an offline discussion with Kevin, I realized that it may not 
> > > > > > be a
> > > > > > problem if migration compatibility check in vendor driver occurs 
> > > > > > late in
> > > > > > stop-and-copy phase for some devices, because if we report device
> > > > > > compatibility attributes clearly in an interface, the chances for
> > > > > > libvirt/openstack to make a wrong decision is little.
> > > > > 
> > > > > I think it would be wise for a vendor driver to implement a pre-copy
> > > > > phase, even if only to send version information and verify it at the
> > > > > target.  Deciding you have no device state to send during pre-copy 
> > > > > does
> > > > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > > > we've defined that we can enter stop-and-copy at any point, including
> > > > > without a pre-copy phase, so I would recommend that vendor drivers
> > > > > validate compatibility at the start of both the pre-copy and the
> > > > > stop-and-copy phases.
> > > > > 
> > > > 
> > > > ok. got it!
> > > > 
> > > > > > so, do you think we are now arriving at an agreement that we'll 
> > > > > > give up
> > > > > > the read-and-test scheme and start to defining one interface 
> > > > > > (perhaps in
> > > > > > json format), from which libvirt/openstack is able to parse and 
> > > > > > find out
> > > > > > compatibility list of a source mdev/physical device?
> > > > > 
> > > > > Based on the feedback we've received, the previously proposed 
> > > > > interface
> > > > > is not viable.  I think there's agreement that the user needs to be
> > > > > able to parse and interpret the version information.  Using json seems
> > > > > viable, but I don't know if it's the best option.  Is there any
> > > > > precedent of markup strings returned via sysfs we could follow?
> > > > 
> > > > I found some examples of using formatted string under /sys, mostly under
> > > > tracing. maybe we can do a similar implementation.
> > > > 
> > > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > > > 
> > > > name: kvm_mmio
> > > > ID: 32
> > > > format:
> > > > field:

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Alex Williamson
On Wed, 29 Jul 2020 12:28:46 +0100
Sean Mooney  wrote:

> On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:  
> > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > Yan Zhao  wrote:
> > >   
> > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > version
> > > > > > information embedded within the migration stream.  Therefore a
> > > > > > migration should fail early if the devices are incompatible.  Is it 
> > > > > >
> > > > > 
> > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > get vendor specific compatibility checking string in migration setup 
> > > > > stage
> > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > In this way, for devices who does not save device data in precopy 
> > > > > stage,
> > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > stage, which is too late.
> > > > > do you think we need to add the getting/checking of vendor specific
> > > > > compatibility string early in save_setup stage?
> > > > >
> > > > 
> > > > hi Alex,
> > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > problem if migration compatibility check in vendor driver occurs late in
> > > > stop-and-copy phase for some devices, because if we report device
> > > > compatibility attributes clearly in an interface, the chances for
> > > > libvirt/openstack to make a wrong decision is little.  
> > > 
> > > I think it would be wise for a vendor driver to implement a pre-copy
> > > phase, even if only to send version information and verify it at the
> > > target.  Deciding you have no device state to send during pre-copy does
> > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > we've defined that we can enter stop-and-copy at any point, including
> > > without a pre-copy phase, so I would recommend that vendor drivers
> > > validate compatibility at the start of both the pre-copy and the
> > > stop-and-copy phases.
> > >   
> > 
> > ok. got it!
> >   
> > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > json format), from which libvirt/openstack is able to parse and find out
> > > > compatibility list of a source mdev/physical device?  
> > > 
> > > Based on the feedback we've received, the previously proposed interface
> > > is not viable.  I think there's agreement that the user needs to be
> > > able to parse and interpret the version information.  Using json seems
> > > viable, but I don't know if it's the best option.  Is there any
> > > precedent of markup strings returned via sysfs we could follow?  
> > 
> > I found some examples of using formatted string under /sys, mostly under
> > tracing. maybe we can do a similar implementation.
> > 
> > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > 
> > name: kvm_mmio
> > ID: 32
> > format:
> > field:unsigned short common_type;   offset:0;   size:2; 
> > signed:0;
> > field:unsigned char common_flags;   offset:2;   size:1; 
> > signed:0;
> > field:unsigned char common_preempt_count;   offset:3;   
> > size:1; signed:0;
> > field:int common_pid;   offset:4;   size:4; signed:1;
> > 
> > field:u32 type; offset:8;   size:4; signed:0;
> > field:u32 len;  offset:12;  size:4; signed:0;
> > field:u64 gpa;  offset:16;  size:8; signed:0;
> > field:u64 val;  offset:24;  size:8; signed:0;
> > 
> > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", 
> > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> >   
> this is not json fromat and its not supper frendly to parse.
> > 
> > #cat /sys/devices/pci:00/:00:02.0/uevent
> > DRIVER=vfio-pci
> > PCI_CLASS=3
> > PCI_ID=8086:591D
> > PCI_SUBSYS_I

Re: device compatibility interface for live migration with assigned devices

2020-07-27 Thread Alex Williamson
On Mon, 27 Jul 2020 15:24:40 +0800
Yan Zhao  wrote:

> > > As you indicate, the vendor driver is responsible for checking version
> > > information embedded within the migration stream.  Therefore a
> > > migration should fail early if the devices are incompatible.  Is it  
> > but as I know, currently in VFIO migration protocol, we have no way to
> > get vendor specific compatibility checking string in migration setup stage
> > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > In this way, for devices who does not save device data in precopy stage,
> > the migration compatibility checking is as late as in stop-and-copy
> > stage, which is too late.
> > do you think we need to add the getting/checking of vendor specific
> > compatibility string early in save_setup stage?
> >  
> hi Alex,
> after an offline discussion with Kevin, I realized that it may not be a
> problem if migration compatibility check in vendor driver occurs late in
> stop-and-copy phase for some devices, because if we report device
> compatibility attributes clearly in an interface, the chances for
> libvirt/openstack to make a wrong decision is little.

I think it would be wise for a vendor driver to implement a pre-copy
phase, even if only to send version information and verify it at the
target.  Deciding you have no device state to send during pre-copy does
not mean your vendor driver needs to opt-out of the pre-copy phase
entirely.  Please also note that pre-copy is at the user's discretion,
we've defined that we can enter stop-and-copy at any point, including
without a pre-copy phase, so I would recommend that vendor drivers
validate compatibility at the start of both the pre-copy and the
stop-and-copy phases.

> so, do you think we are now arriving at an agreement that we'll give up
> the read-and-test scheme and start to defining one interface (perhaps in
> json format), from which libvirt/openstack is able to parse and find out
> compatibility list of a source mdev/physical device?

Based on the feedback we've received, the previously proposed interface
is not viable.  I think there's agreement that the user needs to be
able to parse and interpret the version information.  Using json seems
viable, but I don't know if it's the best option.  Is there any
precedent of markup strings returned via sysfs we could follow?

Your idea of having both a "self" object and an array of "compatible"
objects is perhaps something we can build on, but we must not assume
PCI devices at the root level of the object.  Providing both the
mdev-type and the driver is a bit redundant, since the former includes
the latter.  We can't have vendor specific versioning schemes though,
ie. gvt-version. We need to agree on a common scheme and decide which
fields the version is relative to, ex. just the mdev type?

I had also proposed fields that provide information to create a
compatible type, for example to create a type_x2 device from a type_x1
mdev type, they need to know to apply an aggregation attribute.  If we
need to explicitly list every aggregation value and the resulting type,
I think we run aground of what aggregation was trying to avoid anyway,
so we might need to pick a language that defines variable substitution
or some kind of tagging.  For example if we could define ${aggr} as an
integer within a specified range, then we might be able to define a type
relative to that value (type_x${aggr}) which requires an aggregation
attribute using the same value.  I dunno, just spit balling.  Thanks,

Alex



Re: device compatibility interface for live migration with assigned devices

2020-07-17 Thread Alex Williamson
On Fri, 17 Jul 2020 19:03:44 +0100
"Dr. David Alan Gilbert"  wrote:

> * Alex Williamson (alex.william...@redhat.com) wrote:
> > On Wed, 15 Jul 2020 16:20:41 +0800
> > Yan Zhao  wrote:
> >   
> > > On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:  
> > > > On Tue, 14 Jul 2020 18:19:46 +0100
> > > > "Dr. David Alan Gilbert"  wrote:
> > > > 
> > > > > * Alex Williamson (alex.william...@redhat.com) wrote:
> > > > > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > > > > Daniel P. Berrangé  wrote:
> > > > > >   
> > > > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > > > > hi folks,
> > > > > > > > we are defining a device migration compatibility interface that 
> > > > > > > > helps upper
> > > > > > > > layer stack like openstack/ovirt/libvirt to check if two 
> > > > > > > > devices are
> > > > > > > > live migration compatible.
> > > > > > > > The "devices" here could be MDEVs, physical devices, or hybrid 
> > > > > > > > of the two.
> > > > > > > > e.g. we could use it to check whether
> > > > > > > > - a src MDEV can migrate to a target MDEV,
> > > > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > > > > 
> > > > > > > > The upper layer stack could use this interface as the last step 
> > > > > > > > to check
> > > > > > > > if one device is able to migrate to another device before 
> > > > > > > > triggering a real
> > > > > > > > live migration procedure.
> > > > > > > > we are not sure if this interface is of value or help to you. 
> > > > > > > > please don't
> > > > > > > > hesitate to drop your valuable comments.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > (1) interface definition
> > > > > > > > The interface is defined in below way:
> > > > > > > > 
> > > > > > > >  __userspace
> > > > > > > >   /\  \
> > > > > > > >  / \write
> > > > > > > > / read  \
> > > > > > > >/__   ___\|/_
> > > > > > > >   | migration_version | | migration_version |-->check 
> > > > > > > > migration
> > > > > > > >   - -   
> > > > > > > > compatibility
> > > > > > > >  device Adevice B
> > > > > > > > 
> > > > > > > > 
> > > > > > > > a device attribute named migration_version is defined under 
> > > > > > > > each device's
> > > > > > > > sysfs node. e.g. 
> > > > > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > > > > > > > userspace tools read the migration_version as a string from the 
> > > > > > > > source device,
> > > > > > > > and write it to the migration_version sysfs attribute in the 
> > > > > > > > target device.
> > > > > > > > 
> > > > > > > > The userspace should treat ANY of below conditions as two 
> > > > > > > > devices not compatible:
> > > > > > > > - any one of the two devices does not have a migration_version 
> > > > > > > > attribute
> > > > > > > > - error when reading from migration_version attribute of one 
> > > > > > > > device
> > > > > > > > - error when writing migration_version string of one device to
> > > > > > > >   migration_version attribute of the other device
> > > > > > > > 
> > > > > > >

Re: device compatibility interface for live migration with assigned devices

2020-07-17 Thread Alex Williamson
On Thu, 16 Jul 2020 16:32:30 +0800
Yan Zhao  wrote:

> On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
> > 
> > On 2020/7/14 上午7:29, Yan Zhao wrote:  
> > > hi folks,
> > > we are defining a device migration compatibility interface that helps 
> > > upper
> > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > live migration compatible.
> > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > e.g. we could use it to check whether
> > > - a src MDEV can migrate to a target MDEV,
> > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > - a src MDEV can migration to a target VF in SRIOV.
> > >(e.g. SIOV/SRIOV backward compatibility case)
> > > 
> > > The upper layer stack could use this interface as the last step to check
> > > if one device is able to migrate to another device before triggering a 
> > > real
> > > live migration procedure.
> > > we are not sure if this interface is of value or help to you. please don't
> > > hesitate to drop your valuable comments.
> > > 
> > > 
> > > (1) interface definition
> > > The interface is defined in below way:
> > > 
> > >   __userspace
> > >/\  \
> > >   / \write
> > >  / read  \
> > > /__   ___\|/_
> > >| migration_version | | migration_version |-->check migration
> > >- -   compatibility
> > >   device Adevice B
> > > 
> > > 
> > > a device attribute named migration_version is defined under each device's
> > > sysfs node. e.g. 
> > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).  
> > 
> > 
> > Are you aware of the devlink based device management interface that is
> > proposed upstream? I think it has many advantages over sysfs, do you
> > consider to switch to that?  


Advantages, such as?


> not familiar with the devlink. will do some research of it.
> > 
> >   
> > > userspace tools read the migration_version as a string from the source 
> > > device,
> > > and write it to the migration_version sysfs attribute in the target 
> > > device.
> > > 
> > > The userspace should treat ANY of below conditions as two devices not 
> > > compatible:
> > > - any one of the two devices does not have a migration_version attribute
> > > - error when reading from migration_version attribute of one device
> > > - error when writing migration_version string of one device to
> > >migration_version attribute of the other device
> > > 
> > > The string read from migration_version attribute is defined by device 
> > > vendor
> > > driver and is completely opaque to the userspace.  
> > 
> > 
> > My understanding is that something opaque to userspace is not the 
> > philosophy  
> 
> but the VFIO live migration in itself is essentially a big opaque stream to 
> userspace.
> 
> > of Linux. Instead of having a generic API but opaque value, why not do in a
> > vendor specific way like:
> > 
> > 1) exposing the device capability in a vendor specific way via sysfs/devlink
> > or other API
> > 2) management read capability in both src and dst and determine whether we
> > can do the migration
> > 
> > This is the way we plan to do with vDPA.
> >  
> yes, in another reply, Alex proposed to use an interface in json format.
> I guess we can define something like
> 
> { "self" :
>   [
> { "pciid" : "8086591d",
>   "driver" : "i915",
>   "gvt-version" : "v1",
>   "mdev_type"   : "i915-GVTg_V5_2",
>   "aggregator"  : "1",
>   "pv-mode" : "none",
> }
>   ],
>   "compatible" :
>   [
> { "pciid" : "8086591d",
>   "driver" : "i915",
>   "gvt-version" : "v1",
>   "mdev_type"   : "i915-GVTg_V5_2",
>   "aggregator"  : "1"
>   "pv-mode" : "none",
> },
> { "pciid" : "8086591d",
>   "driver" : "i915",
>   "gvt-version" : "v1",
>   "mdev_type"   : "i915-GVTg_V5_4",
>   "aggregator"  : "2"
>   "pv-mode" : "none",
> },
> { "pciid" : "8086591d",
>   "driver" : "i915",
>   "gvt-version" : "v2",
>   "mdev_type"   : "i915-GVTg_V5_4",
>   "aggregator"  : "2"
>   "pv-mode" : "none, ppgtt, context",
> }
> ...
>   ]
> }
> 
> But as those fields are mostly vendor specific, the userspace can
> only do simple string comparing, I guess the list would be very long as
> it needs to enumerate all possible targets.


This ignores so much of what I tried to achieve in my example :(


> also, in some fileds like "gvt-version", is there a simple way to express
> things like v2+?


That's not a reasonable thing to express anyway, how can you be certain
that v3 won't break compatibility with v2?  Sean proposed a versioning
scheme that accounts for this, using an x.y.z version expressing the
major, minor, and bugfix versions, where there is no compatibility
across major versions, 

Re: device compatibility interface for live migration with assigned devices

2020-07-17 Thread Alex Williamson
On Wed, 15 Jul 2020 15:37:19 +0800
Alex Xu  wrote:

> Alex Williamson  于2020年7月15日周三 上午5:00写道:
> 
> > On Tue, 14 Jul 2020 18:19:46 +0100
> > "Dr. David Alan Gilbert"  wrote:
> >  
> > > * Alex Williamson (alex.william...@redhat.com) wrote:  
> > > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > > Daniel P. Berrangé  wrote:
> > > >  
> > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > > hi folks,
> > > > > > we are defining a device migration compatibility interface that  
> > helps upper  
> > > > > > layer stack like openstack/ovirt/libvirt to check if two devices  
> > are  
> > > > > > live migration compatible.
> > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of  
> > the two.  
> > > > > > e.g. we could use it to check whether
> > > > > > - a src MDEV can migrate to a target MDEV,
> > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > >
> > > > > > The upper layer stack could use this interface as the last step to  
> > check  
> > > > > > if one device is able to migrate to another device before  
> > triggering a real  
> > > > > > live migration procedure.
> > > > > > we are not sure if this interface is of value or help to you.  
> > please don't  
> > > > > > hesitate to drop your valuable comments.
> > > > > >
> > > > > >
> > > > > > (1) interface definition
> > > > > > The interface is defined in below way:
> > > > > >
> > > > > >  __userspace
> > > > > >   /\  \
> > > > > >  / \write
> > > > > > / read  \
> > > > > >/__   ___\|/_
> > > > > >   | migration_version | | migration_version |-->check migration
> > > > > >   - -   compatibility
> > > > > >  device Adevice B
> > > > > >
> > > > > >
> > > > > > a device attribute named migration_version is defined under each  
> > device's  
> > > > > > sysfs node. e.g.  
> > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).  
> > > > > > userspace tools read the migration_version as a string from the  
> > source device,  
> > > > > > and write it to the migration_version sysfs attribute in the  
> > target device.  
> > > > > >
> > > > > > The userspace should treat ANY of below conditions as two devices  
> > not compatible:  
> > > > > > - any one of the two devices does not have a migration_version  
> > attribute  
> > > > > > - error when reading from migration_version attribute of one device
> > > > > > - error when writing migration_version string of one device to
> > > > > >   migration_version attribute of the other device
> > > > > >
> > > > > > The string read from migration_version attribute is defined by  
> > device vendor  
> > > > > > driver and is completely opaque to the userspace.
> > > > > > for a Intel vGPU, string format can be defined like
> > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" +  
> > "aggregator count".  
> > > > > >
> > > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > >
> > > > > > for a QAT VF, it may be
> > > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > >
> > > > > > (to avoid namespace confliction from each vendor, we may prefix a  
> > driver name to  
> > > > > > each migration_version string. e.g.  
> > i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > > >
> > >

Re: device compatibility interface for live migration with assigned devices

2020-07-17 Thread Alex Williamson
On Wed, 15 Jul 2020 16:20:41 +0800
Yan Zhao  wrote:

> On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
> > On Tue, 14 Jul 2020 18:19:46 +0100
> > "Dr. David Alan Gilbert"  wrote:
> >   
> > > * Alex Williamson (alex.william...@redhat.com) wrote:  
> > > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > > Daniel P. Berrangé  wrote:
> > > > 
> > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > > > > > hi folks,
> > > > > > we are defining a device migration compatibility interface that 
> > > > > > helps upper
> > > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > > live migration compatible.
> > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of 
> > > > > > the two.
> > > > > > e.g. we could use it to check whether
> > > > > > - a src MDEV can migrate to a target MDEV,
> > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > > 
> > > > > > The upper layer stack could use this interface as the last step to 
> > > > > > check
> > > > > > if one device is able to migrate to another device before 
> > > > > > triggering a real
> > > > > > live migration procedure.
> > > > > > we are not sure if this interface is of value or help to you. 
> > > > > > please don't
> > > > > > hesitate to drop your valuable comments.
> > > > > > 
> > > > > > 
> > > > > > (1) interface definition
> > > > > > The interface is defined in below way:
> > > > > > 
> > > > > >  __userspace
> > > > > >   /\  \
> > > > > >  / \write
> > > > > > / read  \
> > > > > >/__   ___\|/_
> > > > > >   | migration_version | | migration_version |-->check migration
> > > > > >   - -   compatibility
> > > > > >  device Adevice B
> > > > > > 
> > > > > > 
> > > > > > a device attribute named migration_version is defined under each 
> > > > > > device's
> > > > > > sysfs node. e.g. 
> > > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > > > > > userspace tools read the migration_version as a string from the 
> > > > > > source device,
> > > > > > and write it to the migration_version sysfs attribute in the target 
> > > > > > device.
> > > > > > 
> > > > > > The userspace should treat ANY of below conditions as two devices 
> > > > > > not compatible:
> > > > > > - any one of the two devices does not have a migration_version 
> > > > > > attribute
> > > > > > - error when reading from migration_version attribute of one device
> > > > > > - error when writing migration_version string of one device to
> > > > > >   migration_version attribute of the other device
> > > > > > 
> > > > > > The string read from migration_version attribute is defined by 
> > > > > > device vendor
> > > > > > driver and is completely opaque to the userspace.
> > > > > > for a Intel vGPU, string format can be defined like
> > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + 
> > > > > > "aggregator count".
> > > > > > 
> > > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > > 
> > > > > > for a QAT VF, it may be
> > > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > > 
> > > > > > (

Re: device compatibility interface for live migration with assigned devices

2020-07-14 Thread Alex Williamson
On Tue, 14 Jul 2020 18:19:46 +0100
"Dr. David Alan Gilbert"  wrote:

> * Alex Williamson (alex.william...@redhat.com) wrote:
> > On Tue, 14 Jul 2020 11:21:29 +0100
> > Daniel P. Berrangé  wrote:
> >   
> > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > hi folks,
> > > > we are defining a device migration compatibility interface that helps 
> > > > upper
> > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > live migration compatible.
> > > > The "devices" here could be MDEVs, physical devices, or hybrid of the 
> > > > two.
> > > > e.g. we could use it to check whether
> > > > - a src MDEV can migrate to a target MDEV,
> > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > - a src MDEV can migration to a target VF in SRIOV.
> > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > 
> > > > The upper layer stack could use this interface as the last step to check
> > > > if one device is able to migrate to another device before triggering a 
> > > > real
> > > > live migration procedure.
> > > > we are not sure if this interface is of value or help to you. please 
> > > > don't
> > > > hesitate to drop your valuable comments.
> > > > 
> > > > 
> > > > (1) interface definition
> > > > The interface is defined in below way:
> > > > 
> > > >  __userspace
> > > >   /\  \
> > > >  / \write
> > > > / read  \
> > > >/__   ___\|/_
> > > >   | migration_version | | migration_version |-->check migration
> > > >   - -   compatibility
> > > >  device Adevice B
> > > > 
> > > > 
> > > > a device attribute named migration_version is defined under each 
> > > > device's
> > > > sysfs node. e.g. 
> > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > > > userspace tools read the migration_version as a string from the source 
> > > > device,
> > > > and write it to the migration_version sysfs attribute in the target 
> > > > device.
> > > > 
> > > > The userspace should treat ANY of below conditions as two devices not 
> > > > compatible:
> > > > - any one of the two devices does not have a migration_version attribute
> > > > - error when reading from migration_version attribute of one device
> > > > - error when writing migration_version string of one device to
> > > >   migration_version attribute of the other device
> > > > 
> > > > The string read from migration_version attribute is defined by device 
> > > > vendor
> > > > driver and is completely opaque to the userspace.
> > > > for a Intel vGPU, string format can be defined like
> > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + 
> > > > "aggregator count".
> > > > 
> > > > for an NVMe VF connecting to a remote storage. it could be
> > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > 
> > > > for a QAT VF, it may be
> > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > 
> > > > (to avoid namespace confliction from each vendor, we may prefix a 
> > > > driver name to
> > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1) 
> > > >  
> > 
> > It's very strange to define it as opaque and then proceed to describe
> > the contents of that opaque string.  The point is that its contents
> > are defined by the vendor driver to describe the device, driver version,
> > and possibly metadata about the configuration of the device.  One
> > instance of a device might generate a different string from another.
> > The string that a device produces is not necessarily the only string
> > the vendor driver will accept, for example the driver might support
> > backwards compatible migrations.  
> 
> (As I've said in the previous discussion, off one of the patch series)
> 
> My vie

Re: device compatibility interface for live migration with assigned devices

2020-07-14 Thread Alex Williamson
On Tue, 14 Jul 2020 17:47:22 +0100
Daniel P. Berrangé  wrote:

> On Tue, Jul 14, 2020 at 10:16:16AM -0600, Alex Williamson wrote:
> > On Tue, 14 Jul 2020 11:21:29 +0100
> > Daniel P. Berrangé  wrote:
> >   
> > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > 
> > > > The string read from migration_version attribute is defined by device 
> > > > vendor
> > > > driver and is completely opaque to the userspace.
> > > > for a Intel vGPU, string format can be defined like
> > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + 
> > > > "aggregator count".
> > > > 
> > > > for an NVMe VF connecting to a remote storage. it could be
> > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > 
> > > > for a QAT VF, it may be
> > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > 
> > > > (to avoid namespace confliction from each vendor, we may prefix a 
> > > > driver name to
> > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1) 
> > > >  
> > 
> > It's very strange to define it as opaque and then proceed to describe
> > the contents of that opaque string.  The point is that its contents
> > are defined by the vendor driver to describe the device, driver version,
> > and possibly metadata about the configuration of the device.  One
> > instance of a device might generate a different string from another.
> > The string that a device produces is not necessarily the only string
> > the vendor driver will accept, for example the driver might support
> > backwards compatible migrations.  
> 
> 
> > > IMHO there needs to be a mechanism for the kernel to report via sysfs
> > > what versions are supported on a given device. This puts the job of
> > > reporting compatible versions directly under the responsibility of the
> > > vendor who writes the kernel driver for it. They are the ones with the
> > > best knowledge of the hardware they've built and the rules around its
> > > compatibility.  
> > 
> > The version string discussed previously is the version string that
> > represents a given device, possibly including driver information,
> > configuration, etc.  I think what you're asking for here is an
> > enumeration of every possible version string that a given device could
> > accept as an incoming migration stream.  If we consider the string as
> > opaque, that means the vendor driver needs to generate a separate
> > string for every possible version it could accept, for every possible
> > configuration option.  That potentially becomes an excessive amount of
> > data to either generate or manage.
> > 
> > Am I overestimating how vendors intend to use the version string?  
> 
> If I'm interpreting your reply & the quoted text orrectly, the version
> string isn't really a version string in any normal sense of the word
> "version".
> 
> Instead it sounds like string encoding a set of features in some arbitrary
> vendor specific format, which they parse and do compatibility checks on
> individual pieces ? One or more parts may contain a version number, but
> its much more than just a version.
> 
> If that's correct, then I'd prefer we didn't call it a version string,
> instead call it a "capability string" to make it clear it is expressing
> a much more general concept, but...

I'd agree with that.  The intent of the previous proposal was to
provide and interface for reading a string and writing a string back in
where the result of that write indicated migration compatibility with
the device.  So yes, "version" is not the right term.
 
> > We'd also need to consider devices that we could create, for instance
> > providing the same interface enumeration prior to creating an mdev
> > device to have a confidence level that the new device would be a valid
> > target.
> > 
> > We defined the string as opaque to allow vendor flexibility and because
> > defining a common format is hard.  Do we need to revisit this part of
> > the discussion to define the version string as non-opaque with parsing
> > rules, probably with separate incoming vs outgoing interfaces?  Thanks,  
> 
> ..even if the huge amount of flexibility is technically relevant from the
> POV of the hardware/drivers, we should consider whether management apps
> actually want, or can use, that level of fl

Re: device compatibility interface for live migration with assigned devices

2020-07-14 Thread Alex Williamson
On Tue, 14 Jul 2020 13:33:24 +0100
Sean Mooney  wrote:

> On Tue, 2020-07-14 at 11:21 +0100, Daniel P. Berrangé wrote:
> > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > hi folks,
> > > we are defining a device migration compatibility interface that helps 
> > > upper
> > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > live migration compatible.
> > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > e.g. we could use it to check whether
> > > - a src MDEV can migrate to a target MDEV,  
> mdev live migration is completely possible to do but i agree with Dan 
> barrange's comments
> from the point of view of openstack integration i dont see calling out to a 
> vender sepecific
> tool to be an accpetable

As I replied to Dan, I'm hoping Yan was referring more to vendor
specific knowledge rather than actual tools.

> solutions for device compatiablity checking. the sys filesystem
> that describs the mdevs that can be created shoudl also
> contain the relevent infomation such
> taht nova could integrate it via libvirt xml representation or directly 
> retrive the
> info from
> sysfs.
> > > - a src VF in SRIOV can migrate to a target VF in SRIOV,  
> so vf to vf migration is not possible in the general case as there is no 
> standarised
> way to transfer teh device state as part of the siorv specs produced by the 
> pci-sig
> as such there is not vender neutral way to support sriov live migration. 

We're not talking about a general case, we're talking about physical
devices which have vfio wrappers or hooks with device specific
knowledge in order to support the vfio migration interface.  The point
is that a discussion around vfio device migration cannot be limited to
mdev devices.

> > > - a src MDEV can migration to a target VF in SRIOV.  
> that also makes this unviable
> > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > 
> > > The upper layer stack could use this interface as the last step to check
> > > if one device is able to migrate to another device before triggering a 
> > > real
> > > live migration procedure.  
> well actully that is already too late really. ideally we would want to do 
> this compaiablity
> check much sooneer to avoid the migration failing. in an openstack envionment 
>  at least
> by the time we invoke libvirt (assuming your using the libvirt driver) to do 
> the migration we have alreaedy
> finished schduling the instance to the new host. if if we do the 
> compatiablity check at this point
> and it fails then the live migration is aborted and will not be retired. 
> These types of late check lead to a
> poor user experince as unless you check the migration detial it basically 
> looks like the migration was ignored
> as it start to migrate and then continuge running on the orgininal host.
> 
> when using generic pci passhotuhg with openstack, the pci alias is intended 
> to reference a single vendor id/product
> id so you will have 1+ alias for each type of device. that allows openstack 
> to schedule based on the availability of a
> compatibale device because we track inventories of pci devices and can query 
> that when selecting a host.
> 
> if we were to support mdev live migration in the future we would want to take 
> the same declarative approch.
> 1 interospec the capability of the deivce we manage
> 2 create inventories of the allocatable devices and there capabilities
> 3 schdule the instance to a host based on the device-type/capabilities and 
> claim it atomicly to prevent raceces
> 4 have the lower level hyperviors do addtional validation if need prelive 
> migration.
> 
> this proposal seams to be targeting extending step 4 where as ideally we 
> should focuse on providing the info that would
> be relevant in set 1 preferably in a vendor neutral way vai a kernel 
> interface like /sys.

I think this is reading a whole lot into the phrase "last step".  We
want to make the information available for a management engine to
consume as needed to make informed decisions regarding likely
compatible target devices.
 
> > > we are not sure if this interface is of value or help to you. please don't
> > > hesitate to drop your valuable comments.
> > > 
> > > 
> > > (1) interface definition
> > > The interface is defined in below way:
> > > 
> > >  __userspace
> > >   /\  \
> > >  / \write
> > > / read  \
> > >/__   ___\|/_
> > >   | migration_version | | migration_version |-->check migration
> > >   - -   compatibility
> > >  device Adevice B
> > > 
> > > 
> > > a device attribute named migration_version is defined under each device's
> > > sysfs node. e.g. 
> > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).  
> this might be useful as we could tag the inventory with the 

Re: device compatibility interface for live migration with assigned devices

2020-07-14 Thread Alex Williamson
On Tue, 14 Jul 2020 11:21:29 +0100
Daniel P. Berrangé  wrote:

> On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > hi folks,
> > we are defining a device migration compatibility interface that helps upper
> > layer stack like openstack/ovirt/libvirt to check if two devices are
> > live migration compatible.
> > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > e.g. we could use it to check whether
> > - a src MDEV can migrate to a target MDEV,
> > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > - a src MDEV can migration to a target VF in SRIOV.
> >   (e.g. SIOV/SRIOV backward compatibility case)
> > 
> > The upper layer stack could use this interface as the last step to check
> > if one device is able to migrate to another device before triggering a real
> > live migration procedure.
> > we are not sure if this interface is of value or help to you. please don't
> > hesitate to drop your valuable comments.
> > 
> > 
> > (1) interface definition
> > The interface is defined in below way:
> > 
> >  __userspace
> >   /\  \
> >  / \write
> > / read  \
> >/__   ___\|/_
> >   | migration_version | | migration_version |-->check migration
> >   - -   compatibility
> >  device Adevice B
> > 
> > 
> > a device attribute named migration_version is defined under each device's
> > sysfs node. e.g. 
> > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > userspace tools read the migration_version as a string from the source 
> > device,
> > and write it to the migration_version sysfs attribute in the target device.
> > 
> > The userspace should treat ANY of below conditions as two devices not 
> > compatible:
> > - any one of the two devices does not have a migration_version attribute
> > - error when reading from migration_version attribute of one device
> > - error when writing migration_version string of one device to
> >   migration_version attribute of the other device
> > 
> > The string read from migration_version attribute is defined by device vendor
> > driver and is completely opaque to the userspace.
> > for a Intel vGPU, string format can be defined like
> > "parent device PCI ID" + "version of gvt driver" + "mdev type" + 
> > "aggregator count".
> > 
> > for an NVMe VF connecting to a remote storage. it could be
> > "PCI ID" + "driver version" + "configured remote storage URL"
> > 
> > for a QAT VF, it may be
> > "PCI ID" + "driver version" + "supported encryption set".
> > 
> > (to avoid namespace confliction from each vendor, we may prefix a driver 
> > name to
> > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)

It's very strange to define it as opaque and then proceed to describe
the contents of that opaque string.  The point is that its contents
are defined by the vendor driver to describe the device, driver version,
and possibly metadata about the configuration of the device.  One
instance of a device might generate a different string from another.
The string that a device produces is not necessarily the only string
the vendor driver will accept, for example the driver might support
backwards compatible migrations.

> > (2) backgrounds
> > 
> > The reason we hope the migration_version string is opaque to the userspace
> > is that it is hard to generalize standard comparing fields and comparing
> > methods for different devices from different vendors.
> > Though userspace now could still do a simple string compare to check if
> > two devices are compatible, and result should also be right, it's still
> > too limited as it excludes the possible candidate whose migration_version
> > string fails to be equal.
> > e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
> > with another MDEV with mdev_type_3, aggregator count 1, even their
> > migration_version strings are not equal.
> > (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
> > 
> > besides that, driver version + configured resources are all elements 
> > demanding
> > to take into account.
> > 
> > So, we hope leaving the freedom to vendor driver and let it make the final 
> > decision
> > in a simple reading from source side and writing for test in the target 
> > side way.
> > 
> > 
> > we then think the device compatibility issues for live migration with 
> > assigned
> > devices can be divided into two steps:
> > a. management tools filter out possible migration target devices.
> >Tags could be created according to info from product specification.
> >we think openstack/ovirt may have vendor proprietary components to create
> >those customized tags for each product from each vendor.  
> 
> >for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
> >search target vGPU are like:
> 

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-19 Thread Alex Williamson
On Tue, 9 Jun 2020 20:37:31 -0400
Yan Zhao  wrote:

> On Fri, Jun 05, 2020 at 03:39:50PM +0100, Dr. David Alan Gilbert wrote:
> > > > > I tried to simplify the problem a bit, but we keep going backwards.  
> > > > > If
> > > > > the requirement is that potentially any source device can migrate to 
> > > > > any
> > > > > target device and we cannot provide any means other than writing an
> > > > > opaque source string into a version attribute on the target and
> > > > > evaluating the result to determine compatibility, then we're requiring
> > > > > userspace to do an exhaustive search to find a potential match.  That
> > > > > sucks. 
> > > >  
> hi Alex and Dave,
> do you think it's good for us to put aside physical devices and mdev 
> aggregation
> for the moment, and use Alex's original idea that
> 
> +  Userspace should regard two mdev devices compatible when ALL of below
> +  conditions are met:
> +  (0) The mdev devices are of the same type
> +  (1) success when reading from migration_version attribute of one mdev 
> device.
> +  (2) success when writing migration_version string of one mdev device to
> +  migration_version attribute of the other mdev device.

I think Pandora's box is already opened, if we can't articulate how
this solution would evolve to support features that we know are coming,
why should we proceed with this approach?  We've already seen interest
in breaking rule (0) in this thread, so we can't focus the solution on
mdev devices.

Maybe the best we can do is to compare one instance of a device to
another instance of a device, without any capability to predict
compatibility prior to creating devices, in the case on mdev.  The
string would need to include not only the device and vendor driver
compatibility, but also anything that has modified the state of the
device, such as creation time or post-creation time configuration.  The
user is left on their own for creating a compatible device, or
filtering devices to determine which might be, or which might generate,
compatible devices.  It's not much of a solution, I wonder if anyone
would even use it.

> and what about adding another sysfs attribute for vendors to put
> recommended migration compatible device type. e.g.
> #cat 
> /sys/bus/pci/devices/:00:02.0/mdev_supported_types/i915-GVTg_V5_8/migration_compatible_devices
> parent id: 8086 591d
> mdev_type: i915-GVTg_V5_8
> 
> vendors are free to define the format and conent of this 
> migration_compatible_devices
> and it's even not to be a full list.
> 
> before libvirt or user to do live migration, they have to read and test
> migration_version attributes of src/target devices to check migration 
> compatibility.

AFAICT, free-form, vendor defined attributes are useless to libvirt.
Vendors could already put this information in the description attribute
and have it ignored by userspace tools due to the lack of defined
format.  It's also not clear what value this provides when it's
necessarily incomplete, a driver written today cannot know what future
drivers might be compatible with its migration data.  Thanks,

Alex



Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-05 Thread Alex Williamson
On Fri, 5 Jun 2020 11:22:24 +0100
"Dr. David Alan Gilbert"  wrote:

> * Alex Williamson (alex.william...@redhat.com) wrote:
> > On Wed, 3 Jun 2020 01:24:43 -0400
> > Yan Zhao  wrote:
> >   
> > > On Tue, Jun 02, 2020 at 09:55:28PM -0600, Alex Williamson wrote:  
> > > > On Tue, 2 Jun 2020 23:19:48 -0400
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> > > > > > On Wed, 29 Apr 2020 20:39:50 -0400
> > > > > > Yan Zhao  wrote:
> > > > > >   
> > > > > > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert 
> > > > > > > wrote:
> > > > > > >   
> > > > > > > > > > > > > > > > > > > > An mdev type is meant to define a 
> > > > > > > > > > > > > > > > > > > > software compatible interface, so in
> > > > > > > > > > > > > > > > > > > > the case of mdev->mdev migration, 
> > > > > > > > > > > > > > > > > > > > doesn't migrating to a different type
> > > > > > > > > > > > > > > > > > > > fail the most basic of compatibility 
> > > > > > > > > > > > > > > > > > > > tests that we expect userspace to
> > > > > > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are 
> > > > > > > > > > > > > > > > > > > > migration compatible, it seems a
> > > > > > > > > > > > > > > > > > > > prerequisite to that is that they 
> > > > > > > > > > > > > > > > > > > > provide the same software interface,
> > > > > > > > > > > > > > > > > > > > which means they should be the same 
> > > > > > > > > > > > > > > > > > > > mdev type.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > > > > > phys->mdev, how does a
> > > > > > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > > > > > there going to be a new class hierarchy 
> > > > > > > > > > > > > > > > > > > > created to enumerate all
> > > > > > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > yes, management tool needs to guess and 
> > > > > > > > > > > > > > > > > > > test migration compatible
> > > > > > > > > > > > > > > > > > > between two devices. But I think it's not 
> > > > > > > > > > > > > > > > > > > the problem only for
> > > > > > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for 
> > > > > > > > > > > > > > > > > > > mdev->mdev, management tool needs
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > 

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-03 Thread Alex Williamson
On Wed, 3 Jun 2020 01:24:43 -0400
Yan Zhao  wrote:

> On Tue, Jun 02, 2020 at 09:55:28PM -0600, Alex Williamson wrote:
> > On Tue, 2 Jun 2020 23:19:48 -0400
> > Yan Zhao  wrote:
> >   
> > > On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:  
> > > > On Wed, 29 Apr 2020 20:39:50 -0400
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert 
> > > > > wrote:
> > > > > 
> > > > > > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > > > > fail the most basic of compatibility tests 
> > > > > > > > > > > > > > > > > > that we expect userspace to
> > > > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are 
> > > > > > > > > > > > > > > > > > migration compatible, it seems a
> > > > > > > > > > > > > > > > > > prerequisite to that is that they provide 
> > > > > > > > > > > > > > > > > > the same software interface,
> > > > > > > > > > > > > > > > > > which means they should be the same mdev 
> > > > > > > > > > > > > > > > > > type.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > > > phys->mdev, how does a  
> > > > > > > > > > > > > > > > > management  
> > > > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > > > there going to be a new class hierarchy 
> > > > > > > > > > > > > > > > > > created to enumerate all
> > > > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > > > > > between two devices. But I think it's not the 
> > > > > > > > > > > > > > > > > problem only for
> > > > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for 
> > > > > > > > > > > > > > > > > mdev->mdev, management tool needs
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > first assume that the two mdevs have the same 
> > > > > > > > > > > > > > > > > type of parent devices
> > > > > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's 
> > > > > > > > > > > > > > > > > still enumerating
> > > > > > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-02 Thread Alex Williamson
On Tue, 2 Jun 2020 23:19:48 -0400
Yan Zhao  wrote:

> On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> > On Wed, 29 Apr 2020 20:39:50 -0400
> > Yan Zhao  wrote:
> >   
> > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:
> > >   
> > > > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > > fail the most basic of compatibility tests that 
> > > > > > > > > > > > > > > > we expect userspace to
> > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are migration 
> > > > > > > > > > > > > > > > compatible, it seems a
> > > > > > > > > > > > > > > > prerequisite to that is that they provide the 
> > > > > > > > > > > > > > > > same software interface,
> > > > > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > phys->mdev, how does a
> > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > there going to be a new class hierarchy created 
> > > > > > > > > > > > > > > > to enumerate all
> > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > > > between two devices. But I think it's not the 
> > > > > > > > > > > > > > > problem only for
> > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > > > > management tool needs
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > first assume that the two mdevs have the same 
> > > > > > > > > > > > > > > type of parent devices
> > > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's 
> > > > > > > > > > > > > > > still enumerating
> > > > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > > > > if pdev2 is exactly 2 times of pdev1, why not 
> > > > > > > > > > > > > > > allow migration between
> > > > > > > > > > > > > > > mdev1 <-> mdev2.
> > > > > > > > > > > > > > 
> > > > 

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-02 Thread Alex Williamson
On Wed, 29 Apr 2020 20:39:50 -0400
Yan Zhao  wrote:

> On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:
> 
> > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't migrating 
> > > > > > > > > > > > > > to a different type
> > > > > > > > > > > > > > fail the most basic of compatibility tests that we 
> > > > > > > > > > > > > > expect userspace to
> > > > > > > > > > > > > > perform?  IOW, if two mdev types are migration 
> > > > > > > > > > > > > > compatible, it seems a
> > > > > > > > > > > > > > prerequisite to that is that they provide the same 
> > > > > > > > > > > > > > software interface,
> > > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In the hybrid cases of mdev->phys or phys->mdev, 
> > > > > > > > > > > > > > how does a  
> > > > > > > > > > > > > management  
> > > > > > > > > > > > > > tool begin to even guess what might be compatible?  
> > > > > > > > > > > > > > Are we expecting
> > > > > > > > > > > > > > libvirt to probe ever device with this attribute in 
> > > > > > > > > > > > > > the system?  Is
> > > > > > > > > > > > > > there going to be a new class hierarchy created to 
> > > > > > > > > > > > > > enumerate all
> > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > between two devices. But I think it's not the problem 
> > > > > > > > > > > > > only for
> > > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > > management tool needs
> > > > > > > > > > > > > to
> > > > > > > > > > > > > first assume that the two mdevs have the same type of 
> > > > > > > > > > > > > parent devices
> > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's still 
> > > > > > > > > > > > > enumerating
> > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > > if pdev2 is exactly 2 times of pdev1, why not allow 
> > > > > > > > > > > > > migration between
> > > > > > > > > > > > > mdev1 <-> mdev2.  
> > > > > > > > > > > > 
> > > > > > > > > > > > How could the manage tool figure out that 1/2 of pdev1 
> > > > > > > > > > > > is equivalent 
> > > > > > > > > > > > to 1/4 of pdev2? If we really want to allow such thing 
> > > > > > > > > > > > happen, the best
> > > > > > > > > > > > choice is to report the same mdev type on both pdev1 
> > > > > > > > > > > > and pdev2.  
> > > > > > > > > > > I think that's exactly the value of this 
> > > > > > > > > > > migration_version interface.
> > > > > > > > > > > the management tool can take advantage of this interface 
> > > > > > > > > > > to know if two
> > > > > > > > > > > devices are migration compatible, no matter they are 
> > > > > > > > > > > mdevs, non-mdevs,
> > > > > > > > > > > or mix.
> > > > > > > > > > > 
> > > > > > > > > > > as I know, (please correct me if not right), current 
> > > > > > > > > > > libvirt still
> > > > > > > > > > > requires manually generating mdev devices, and it just 
> > > > > > > > > > > duplicates src vm
> > > > > > > > > > > configuration to the target vm.
> > > > > > > > > > > for libvirt, currently it's always phys->phys and 
> > > > > > > > > > > mdev->mdev (and of the
> > > > > > > > > > > same mdev type).
> > > > > > > > > > > But it does not justify that hybrid cases should not be 
> > > > > > > > > > > allowed. otherwise,
> > > > > > > > > > > why do we need to introduce this migration_version 
> > > > > > > > > > > interface and leave
> > > > > > > > > > > the judgement of migration compatibility to vendor 
> > > > > > > > > > > driver? why not simply
> > > > > > > > > > > set the criteria to something like "pciids of parent 
> > > > > > > > > > > devices are equal,
> > > > > > > > > > > and mdev types are equal" ?
> > > > > > > > > > > 
> > > > > > > > > > >   
> > > > > > > > > > > > btw mdev<->phys just brings trouble to upper stack as 
> > > > > > > > > > > > Alex pointed out.   
> > > > > > > > > > > could you help me understand why it will bring trouble to 
> > > > > > > > > > > upper stack?
> > > > > > > > > > > 
> > > > > > > > > > > I think it just needs to read src migration_version under 
> > > > > > > > > > > src dev node,
> > > > > > > > > > > and test it in target migration version under target dev 
> > > > > > > > > > > node. 
> > > > > > > > > > > 
> > > > > > > > > > > after all, through this interface we just help the upper 
> > > > > > > > > > > layer
> > > > > > > > > > > 

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-20 Thread Alex Williamson
On Sun, 19 Apr 2020 21:24:57 -0400
Yan Zhao  wrote:

> On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > On Fri, 17 Apr 2020 05:52:02 -0400
> > Yan Zhao  wrote:
> >   
> > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:  
> > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > Yan Zhao  wrote:
> > > > 
> > > > > This patchset introduces a migration_version attribute under sysfs of 
> > > > > VFIO
> > > > > Mediated devices.
> > > > > 
> > > > > This migration_version attribute is used to check migration 
> > > > > compatibility
> > > > > between two mdev devices.
> > > > > 
> > > > > Currently, it has two locations:
> > > > > (1) under mdev_type node,
> > > > > which can be used even before device creation, but only for mdev
> > > > > devices of the same mdev type.
> > > > > (2) under mdev device node,
> > > > > which can only be used after the mdev devices are created, but 
> > > > > the src
> > > > > and target mdev devices are not necessarily be of the same mdev 
> > > > > type
> > > > > (The second location is newly added in v5, in order to keep consistent
> > > > > with the migration_version node for migratable pass-though devices)   
> > > > >  
> > > > 
> > > > What is the relationship between those two attributes?
> > > > 
> > > (1) is for mdev devices specifically, and (2) is provided to keep the same
> > > sysfs interface as with non-mdev cases. so (2) is for both mdev devices 
> > > and
> > > non-mdev devices.
> > > 
> > > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
> > > is binding to vfio-pci, but is able to register migration region and do
> > > migration transactions from a vendor provided affiliate driver),
> > > the vendor driver would export (2) directly, under device node.
> > > It is not able to provide (1) as there're no mdev devices involved.  
> > 
> > Ok, creating an alternate attribute for non-mdev devices makes sense.
> > However, wouldn't that rather be a case (3)? The change here only
> > refers to mdev devices.
> >  
> as you pointed below, (3) and (2) serve the same purpose. 
> and I think a possible usage is to migrate between a non-mdev device and
> an mdev device. so I think it's better for them both to use (2) rather
> than creating (3).

An mdev type is meant to define a software compatible interface, so in
the case of mdev->mdev migration, doesn't migrating to a different type
fail the most basic of compatibility tests that we expect userspace to
perform?  IOW, if two mdev types are migration compatible, it seems a
prerequisite to that is that they provide the same software interface,
which means they should be the same mdev type.

In the hybrid cases of mdev->phys or phys->mdev, how does a management
tool begin to even guess what might be compatible?  Are we expecting
libvirt to probe ever device with this attribute in the system?  Is
there going to be a new class hierarchy created to enumerate all
possible migrate-able devices?

I agree that there was a gap in the previous proposal for non-mdev
devices, but I think this bring a lot of questions that we need to
puzzle through and libvirt will need to re-evaluate how they might
decide to pick a migration target device.  For example, I'm sure
libvirt would reject any policy decisions regarding picking a physical
device versus an mdev device.  Had we previously left it that only a
layer above libvirt would select a target device and libvirt only tests
compatibility to that target device?

We also need to consider that this expands the namespace.  If we no
longer require matching types as the first level of comparison, then
vendor migration strings can theoretically collide.  How do we
coordinate that can't happen?  Thanks,

Alex

> > > > Is existence (and compatibility) of (1) a pre-req for possible
> > > > existence (and compatibility) of (2)?
> > > >
> > > no. (2) does not reply on (1).  
> > 
> > Hm. Non-existence of (1) seems to imply "this type does not support
> > migration". If an mdev created for such a type suddenly does support
> > migration, it feels a bit odd.
> >   
> yes. but I think if the condition happens, it should be reported a bug
> to vendor driver.
> should I add a line in the doc like "vendor driver should ensure that the
> migration compatibility from migration_version under mdev_type should be
> consistent with that from migration_version under device node" ?
> 
> > (It obviously cannot be a prereq for what I called (3) above.)
> >   
> > >   
> > > > Does userspace need to check (1) or can it completely rely on (2), if
> > > > it so chooses?
> > > >
> > > I think it can completely reply on (2) if compatibility check before
> > > mdev creation is not required.
> > >   
> > > > If devices with a different mdev type are indeed compatible, it seems
> > > > userspace can only find out after the devices have actually been
> > > > created, as (1) does not apply?
> > > yes, I think so.   
> > 
> > How useful would 

Re: [PATCH v4 0/2] introduction of migration_version attribute for VFIO live migration

2020-03-24 Thread Alex Williamson
On Tue, 24 Mar 2020 09:23:31 +
"Dr. David Alan Gilbert"  wrote:

> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Tue, Mar 24, 2020 at 05:29:59AM +0800, Alex Williamson wrote:  
> > > On Mon, 3 Jun 2019 20:34:22 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > On Tue, Jun 04, 2019 at 03:29:32AM +0800, Alex Williamson wrote:  
> > > > > On Thu, 30 May 2019 20:44:38 -0400
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > This patchset introduces a migration_version attribute under sysfs 
> > > > > > of VFIO
> > > > > > Mediated devices.
> > > > > > 
> > > > > > This migration_version attribute is used to check migration 
> > > > > > compatibility
> > > > > > between two mdev devices of the same mdev type.
> > > > > > 
> > > > > > Patch 1 defines migration_version attribute in
> > > > > > Documentation/vfio-mediated-device.txt
> > > > > > 
> > > > > > Patch 2 uses GVT as an example to show how to expose 
> > > > > > migration_version
> > > > > > attribute and check migration compatibility in vendor driver.
> > > > > 
> > > > > Thanks for iterating through this, it looks like we've settled on
> > > > > something reasonable, but now what?  This is one piece of the puzzle 
> > > > > to
> > > > > supporting mdev migration, but I don't think it makes sense to commit
> > > > > this upstream on its own without also defining the remainder of how we
> > > > > actually do migration, preferably with more than one working
> > > > > implementation and at least prototyped, if not final, QEMU support.  I
> > > > > hope that was the intent, and maybe it's now time to look at the next
> > > > > piece of the puzzle.  Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > Got it. 
> > > > Also thank you and all for discussing and guiding all along:)
> > > > We'll move to the next episode now.  
> > > 
> > > Hi Yan,
> > > 
> > > As we're hopefully moving towards a migration API, would it make sense
> > > to refresh this series at the same time?  I think we're still expecting
> > > a vendor driver implementing Kirti's migration API to also implement
> > > this sysfs interface for compatibility verification.  Thanks,
> > >  
> > Hi Alex
> > Got it!
> > Thanks for reminding of this. And as now we have vfio-pci implementing
> > vendor ops to allow live migration of pass-through devices, is it
> > necessary to implement similar sysfs node for those devices?
> > or do you think just PCI IDs of those devices are enough for libvirt to
> > know device compatibility ?  
> 
> Wasn't the problem that we'd have to know how to check for things like:
>   a) Whether different firmware versions in the device were actually
> compatible
>   b) Whether minor hardware differences were compatible - e.g. some
> hardware might let you migrate to the next version of hardware up.

Yes, minor changes in hardware or firmware that may not be represented
in the device ID or hardware revision.  Also the version is as much for
indicating the compatibility of the vendor defined migration protocol
as it is for the hardware itself.  I certainly wouldn't be so bold as
to create a protocol that is guaranteed compatible forever.  We'll need
to expose the same sysfs attribute in some standard location for
non-mdev devices.  I assume vfio-pci would provide the vendor ops some
mechanism to expose these in a standard namespace of sysfs attributes
under the device itself.  Perhaps that indicates we need to link the
mdev type version under the mdev device as well to make this
transparent to userspace tools like libvirt.  Thanks,

Alex



Re: [PATCH v4 0/2] introduction of migration_version attribute for VFIO live migration

2020-03-23 Thread Alex Williamson
On Mon, 3 Jun 2019 20:34:22 -0400
Yan Zhao  wrote:

> On Tue, Jun 04, 2019 at 03:29:32AM +0800, Alex Williamson wrote:
> > On Thu, 30 May 2019 20:44:38 -0400
> > Yan Zhao  wrote:
> >   
> > > This patchset introduces a migration_version attribute under sysfs of VFIO
> > > Mediated devices.
> > > 
> > > This migration_version attribute is used to check migration compatibility
> > > between two mdev devices of the same mdev type.
> > > 
> > > Patch 1 defines migration_version attribute in
> > > Documentation/vfio-mediated-device.txt
> > > 
> > > Patch 2 uses GVT as an example to show how to expose migration_version
> > > attribute and check migration compatibility in vendor driver.  
> > 
> > Thanks for iterating through this, it looks like we've settled on
> > something reasonable, but now what?  This is one piece of the puzzle to
> > supporting mdev migration, but I don't think it makes sense to commit
> > this upstream on its own without also defining the remainder of how we
> > actually do migration, preferably with more than one working
> > implementation and at least prototyped, if not final, QEMU support.  I
> > hope that was the intent, and maybe it's now time to look at the next
> > piece of the puzzle.  Thanks,
> > 
> > Alex  
> 
> Got it. 
> Also thank you and all for discussing and guiding all along:)
> We'll move to the next episode now.

Hi Yan,

As we're hopefully moving towards a migration API, would it make sense
to refresh this series at the same time?  I think we're still expecting
a vendor driver implementing Kirti's migration API to also implement
this sysfs interface for compatibility verification.  Thanks,

Alex



Re: [libvirt] [PATCH v6 0/4] PCI hostdev partial assignment support

2019-12-17 Thread Alex Williamson
On Tue, 17 Dec 2019 17:35:01 -0300
Daniel Henrique Barboza  wrote:

> changes from previous version 5 [1]:
> - changes in the commit message of patch 1 and the
> documentation included in patches 3 and 4, all of them
> suggested/hinted by Alex Williamson.


Seems conceptually sound and believe the descriptions are accurate now,
I'll leave it to libvirt folks to determine that it does what it says
in a reasonable way ;)  Thanks,

Alex

 
> Daniel Henrique Barboza (4):
>   Introducing new address type='unassigned' for PCI hostdevs
>   qemu: handle unassigned PCI hostdevs in command line
>   formatdomain.html.in: document 
>   news.xml: add address type='unassigned' entry
> 
> 
> [1] https://www.redhat.com/archives/libvir-list/2019-December/msg01097.html
> 
>  docs/formatdomain.html.in | 10 
>  docs/news.xml | 14 +
>  docs/schemas/domaincommon.rng |  5 ++
>  src/conf/device_conf.c|  2 +
>  src/conf/device_conf.h|  1 +
>  src/conf/domain_conf.c|  7 ++-
>  src/qemu/qemu_command.c   |  5 ++
>  src/qemu/qemu_domain.c|  1 +
>  src/qemu/qemu_domain_address.c|  5 ++
>  .../hostdev-pci-address-unassigned.args   | 31 ++
>  .../hostdev-pci-address-unassigned.xml| 42 ++
>  tests/qemuxml2argvtest.c  |  4 ++
>  .../hostdev-pci-address-unassigned.xml| 58 +++
>  tests/qemuxml2xmltest.c   |  1 +
>  14 files changed, 185 insertions(+), 1 deletion(-)
>  create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args
>  create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml
>  create mode 100644 
> tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH v5 4/4] news.xml: add address type='unassigned' entry

2019-12-17 Thread Alex Williamson
On Tue, 17 Dec 2019 16:06:47 -0300
Daniel Henrique Barboza  wrote:

> Signed-off-by: Daniel Henrique Barboza 
> ---
>  docs/news.xml | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/docs/news.xml b/docs/news.xml
> index 2a25b6ca49..febda970f6 100644
> --- a/docs/news.xml
> +++ b/docs/news.xml
> @@ -54,6 +54,20 @@
>written in the RST as an alternative to HTML.
>  
>
> +  
> +
> +  new PCI hostdev address type: unassigned
> +
> +
> +  A new PCI hostdev address type 'unassigned' is introduced,
> +  giving users the option to choose which PCI hostdevs
> +  within the same IOMMU group will not be assigned to the
> +  guest. PCI hostdevs that shouldn't be used by the guest
> +  can be classified as address type='unassigned'.
> +  Libvirt will still be able to manage the device as a
> +  regular PCI hostdev.
> +

This is rather convoluted.  Users have always had the choice of which
devices NOT to assign.  I would present this as a mechanism for libvirt
to manage the driver binding of devices, as done for managed PCI hostdev
devices, without actually attaching the devices to the VM, thereby
facilitating device assignment when only a subset of the devices within
an IOMMU group are intended to be used by the guest.

BTW, this should also work with the hostdev  option if
the user wanted to bind the device to pci-stub rather than vfio-pci to
provide even further isolation from the user being able to access the
device.  Thanks,

Alex

> +  
>  
>  
>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH v5 3/4] formatdomain.html.in: document

2019-12-17 Thread Alex Williamson
On Tue, 17 Dec 2019 16:06:46 -0300
Daniel Henrique Barboza  wrote:

> Signed-off-by: Daniel Henrique Barboza 
> ---
>  docs/formatdomain.html.in | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
> index e06cf2061b..7a5ebdd67e 100644
> --- a/docs/formatdomain.html.in
> +++ b/docs/formatdomain.html.in
> @@ -4203,6 +4203,19 @@
>  attributes: iobase and irq.
>  Since 1.2.1
>
> +  unassigned
> +  For PCI hostdevs, address type='unassigned'/
> +allows the admin to include a PCI hostdev in the domain XML 
> definition,
> +without making it available for the guest. This allows for 
> configurations
> +in which Libvirt manages the device as a regular PCI hostdev,
> +regardless of whether the guest will have access to it. This is
> +an alternative to scenarios in which the admin might be compelled to 
> use
> +an ACS patch to remove the device from the guest while Libvirt
> +retains control of the PCI device.

The ACS patch is really orthogonal to the goal here, so I don't think
it should be included in the discussion.  A user can just as easily
pre-bind other devices to vfio-pci to make the configuration viable
rather than patch their kernel to change the viability constraints,
which this series doesn't accomplish either.  Thanks,

Alex

> +address type='unassigned'/ is an invalid address
> +type for all other device types.
> +Since 6.0.0
> +  
>  
>  
>  Virtio-related options

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH v5 1/4] Introducing new address type='unassigned' for PCI hostdevs

2019-12-17 Thread Alex Williamson
On Tue, 17 Dec 2019 16:06:44 -0300
Daniel Henrique Barboza  wrote:

> Today, to use a PCI hostdev "A" in a domain, all PCI devices
> that belongs to the same IOMMU group must also be declared in
> the domain XML, meaning that all IOMMU devices are detached
> from the host and all of them are visible to the guest.

This is not accurate.  All endpoint devices in the IOMMU group need to
be "viable" (ie. bound to vfio-pci, pci-stub, unbound) for any device
within the group to be used through vfio.  There is absolutely no
requirement that they all be declared in the XML or assigned to the
guest.  Also please clarify "IOMMU devices".  Is that all devices
within the IOMMU group?
 
> The result is that the guest will have access to all devices,
> but this is not necessarily what the administrator wanted. If only
> the hostdev "A" was intended for guest usage, but hostdevs "B" and
> "C" happened to be in the same IOMMU group of "A", the guest will
> gain access to all 3 devices. This makes the administrator rely
> on alternative solutions, such as use all hostdevs with un-managed
> mode and detached all the IOMMU before the guest starts. If
> use un-managed mode is not an option, the only alternative left is
> an ACS patch to deny guest access to "B" and "C".

Also not accurate.  The libvirt hooks interface can be used to manage
the additional devices ad-hoc, the additional devices might not need
management if they're not endpoints, they might be statically bound to
vfio-pci at boot time, etc.  I don't think the scenario is nearly as
dire as presented here.  "detached all the IOMMU" doesn't make sense.

> This patch introduces a new address type called "unassigned" to
> handle this situation where a hostdev will be owned by a domain, but
> not visible to the guest OS. This allows the administrator to
> declare all the IOMMU while also choosing which hostdevs will be

"declare all the IOMMU" doesn't make sense.

> usable by the guest. This new mechanic applies to all PCI hostdevs,
> regardless of whether they are a PCI multifunction hostdev or not.
> Using  in any case other than a PCI
> hostdev will result in error.

Regardless of my comments above, I think this support is worthwhile,
but let's not pretend there aren't solutions, this just makes it easier
to accomplish in a supported and dynamic way.  Thanks,

Alex

> Next patch will use this new address type in the QEMU driver to
> avoid adding unassigned devices to the QEMU launch command line.
> 
> Signed-off-by: Daniel Henrique Barboza 
> ---
>  docs/schemas/domaincommon.rng |  5 ++
>  src/conf/device_conf.c|  2 +
>  src/conf/device_conf.h|  1 +
>  src/conf/domain_conf.c|  7 ++-
>  src/qemu/qemu_command.c   |  1 +
>  src/qemu/qemu_domain.c|  1 +
>  src/qemu/qemu_domain_address.c|  5 ++
>  .../hostdev-pci-address-unassigned.xml| 42 ++
>  .../hostdev-pci-address-unassigned.xml| 58 +++
>  tests/qemuxml2xmltest.c   |  1 +
>  10 files changed, 122 insertions(+), 1 deletion(-)
>  create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml
>  create mode 100644 
> tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml
> 
> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
> index e964773f5e..5f1d4a34a4 100644
> --- a/docs/schemas/domaincommon.rng
> +++ b/docs/schemas/domaincommon.rng
> @@ -5502,6 +5502,11 @@
>
>
>  
> +
> +  
> +unassigned
> +  
> +
>
>  
>
> diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c
> index 4c57f0995f..4dbd5c1ac9 100644
> --- a/src/conf/device_conf.c
> +++ b/src/conf/device_conf.c
> @@ -45,6 +45,7 @@ VIR_ENUM_IMPL(virDomainDeviceAddress,
>"virtio-mmio",
>"isa",
>"dimm",
> +  "unassigned",
>  );
>  
>  static int
> @@ -120,6 +121,7 @@ virDomainDeviceInfoAddressIsEqual(const 
> virDomainDeviceInfo *a,
>  /* address types below don't have any specific data */
>  case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO:
>  case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390:
> +case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED:
>  break;
>  
>  case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI:
> diff --git a/src/conf/device_conf.h b/src/conf/device_conf.h
> index d98fae775c..e091d7cfe2 100644
> --- a/src/conf/device_conf.h
> +++ b/src/conf/device_conf.h
> @@ -45,6 +45,7 @@ typedef enum {
>  VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO,
>  VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA,
>  VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM,
> +VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED,
>  
>  VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST
>  } virDomainDeviceAddressType;
> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
> index 

Re: [libvirt] [PATCH v4 0/5] PCI hostdev partial assignment support

2019-12-17 Thread Alex Williamson
On Tue, 17 Dec 2019 13:43:14 -0300
Daniel Henrique Barboza  wrote:

> On 12/17/19 1:32 PM, Alex Williamson wrote:
> > On Tue, 17 Dec 2019 11:25:38 -0500
> > Laine Stump  wrote:
> >   
> >> On 12/16/19 6:03 PM, Daniel Henrique Barboza wrote:  
> >>> About breaking existing configurations, there is the possibility of not
> >>> going forward with patch 03, which is enforcing this rule of declaring
> >>> all the
> >>> IOMMU group. Existing domains will keep working as usual, the option to
> >>> unassign devices will still be present, but the user will have to deal
> >>> with
> >>> the potential QEMU errors if not all PCI devices were detached from
> >>> the host.
> >>>
> >>> In this case, the 'unassigned' type will become more of a ON/OFF
> >>> switch to
> >>> add/remove the PCI hostdev from the guest without removing it from the
> >>> domain XML. It is still useful, but we lose the idea of all the IOMMU
> >>> devices being described in the domain XML, which is something Laine
> >>> mentioned it would be desirable in one of the RFCs.  
> >>
> >>
> >> I don't actually recall saying that :-). I haven't looked in the list
> >> archives, but what I *can* imagine myself saying is that only devices
> >> mentioned in the XML should be manipulated in any way by libvirt. So,  
> > 
> > +1
> > 
> >> for example, you shouldn't unbind device X from its host driver if there
> >> is nothing in the XML telling you to do that. But if a device isn't
> >> mentioned in the XML, and is already bound to some driver that is
> >> acceptable to the VFIO subsystem (e.g. vfio-pci, pci-stub or no driver
> >> at all (? is that right Alex?)) then that should not create any problem.  
> > 
> > Yes, that's right.
> >   
> >> Doing otherwise would break too many existing configs. (For example, my
> >> own assigned-GPU config, which assumes that all the devices are already
> >> bound to the proper driver, and uses "managed='no'")  
> 
> 
> This is the new approach of the series I implemented today and plan to
> to send for review today/tomorrow. I realized, after all the discussions
> yesterday with Alex, that the patch series would be best just sticking with
> what we want fixed (managed=yes and parcial assignment) and leaving
> unmanaged (managed=no) configurations alone. If the user has an existing,
> working unmanaged setup, this means that the user chose to manage device
> detach/re-attach manually and shouldn't be bothered with a change that's
> aimed to managed devices.

There are of course existing, working managed=yes configurations where
the set of assigned devices is only a subset of the IOMMU group, and
the user has configured other means to make the group viable relative
to vfio.  The statement above doesn't convince me that the next
iteration isn't simply going to restrict its manipulation of other
devices.  As Laine says above and I said in my previous reply, libvirt
should not manipulate the driver binding of any devices not explicitly
listed in the domain XMl.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH v4 0/5] PCI hostdev partial assignment support

2019-12-17 Thread Alex Williamson
On Tue, 17 Dec 2019 11:25:38 -0500
Laine Stump  wrote:

> On 12/16/19 6:03 PM, Daniel Henrique Barboza wrote:
> >
> >
> > On 12/16/19 7:28 PM, Cole Robinson wrote:  
> >> On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:  
> >>> changes from version 3 [1]:
> >>> - removed last 2 patches that made function 0 of
> >>> PCI multifunction devices mandatory
> >>> - new patch: news.xml update
> >>> - changed 'since' version to 6.0.0 in patch 4
> >>> - unassigned hostdevs are now getting qemu aliases
> >>>
> >>> [1] 
> >>> https://www.redhat.com/archives/libvir-list/2019-November/msg01263.html
> >>>
> >>> Daniel Henrique Barboza (5):
> >>>    Introducing new address type='unassigned' for PCI hostdevs
> >>>    qemu: handle unassigned PCI hostdevs in command line
> >>>    virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices
> >>>    formatdomain.html.in: document 
> >>>    news.xml: add address type='unassigned' entry
> >>>  
> >>
> >> Codewise it looks fine now. But I'm looking more closely at patch #3 and
> >> realizing that it can explicitly reject a previously accepted VM config.
> >> And indeed, now that I give it a test with my GPU passthrough setup, it
> >> is rejecting my previosly working config.
> >>
> >> error: Requested operation is not valid: All devices of the same IOMMU
> >> group 1 of the PCI device :01:00.0 must belong to domain win10
> >>
> >> I've attached the nodedev XML for the three devices with iommuGroup 1.
> >> Only the two nvidia devices are assigned to my VM, but not the PCIe
> >> controller device.
> >>
> >> Is the libvirt heuristic missing something? Or is this acting as 
> >> expected?  
> >
> > You mentioned that you declared 3 devices of IOMMU group 1. Unless the 
> > code in
> > patch 3 has a bug, there are more PCI hostdevs in IOMMU group 1 that 
> > were left
> > out of the domain XML.
> >
> >  
> >>
> >> I didn't quite gather that this is a change to reject previously
> >> accepted configurations, so I will defer to Laine and Alex as to whether
> >> this should be committed.  
> >
> >
> > I mentioned in the commit msg of patch 03 that this would break working
> > configurations that didn't comply with the new 'all devices of the IOMMU
> > group must be included in the domain XML' directive. Perhaps this is 
> > worth
> > mentioning in the 'news' page to warn users about it.  
> 
> 
> No, this shouldn't be a requirement at all. In my mind the purpose of 
> these patches is to make something work (in a safe manner) that failed 
> before, *not* to add new restrictions that break things that already 
> work. (Sorry I wasn't paying more attention to the patches earlier).

+1

> > About breaking existing configurations, there is the possibility of not
> > going forward with patch 03, which is enforcing this rule of declaring 
> > all the
> > IOMMU group. Existing domains will keep working as usual, the option to
> > unassign devices will still be present, but the user will have to deal 
> > with
> > the potential QEMU errors if not all PCI devices were detached from 
> > the host.
> >
> > In this case, the 'unassigned' type will become more of a ON/OFF 
> > switch to
> > add/remove the PCI hostdev from the guest without removing it from the
> > domain XML. It is still useful, but we lose the idea of all the IOMMU
> > devices being described in the domain XML, which is something Laine
> > mentioned it would be desirable in one of the RFCs.  
> 
> 
> I don't actually recall saying that :-). I haven't looked in the list 
> archives, but what I *can* imagine myself saying is that only devices 
> mentioned in the XML should be manipulated in any way by libvirt. So,

+1
 
> for example, you shouldn't unbind device X from its host driver if there 
> is nothing in the XML telling you to do that. But if a device isn't 
> mentioned in the XML, and is already bound to some driver that is 
> acceptable to the VFIO subsystem (e.g. vfio-pci, pci-stub or no driver 
> at all (? is that right Alex?)) then that should not create any problem.

Yes, that's right.

> Doing otherwise would break too many existing configs. (For example, my 
> own assigned-GPU config, which assumes that all the devices are already 
> bound to the proper driver, and uses "managed='no'")

Effectively anyone assigning PFs with a PCIe root port that does not
support ACS would be broken by this series.  That's a significant
portion of vfio users.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH v4 0/5] PCI hostdev partial assignment support

2019-12-16 Thread Alex Williamson
On Mon, 16 Dec 2019 21:09:20 -0300
Daniel Henrique Barboza  wrote:

> On 12/16/19 8:43 PM, Alex Williamson wrote:
> > On Mon, 16 Dec 2019 20:24:56 -0300
> > Daniel Henrique Barboza  wrote:
> >   
> >>
> >> The code isn't forcing a device to be assigned to the guest. It is forcing
> >> all the IOMMU devices to be declared in the domain XML to be detached from
> >> the host.  
> > 
> > Detached from the host by unbinding from host drivers and binding to
> > vfio-pci and "partially" assigned to the guest?  That's wrong, we can't
> > do that.  Not only will vfio-pci not bind to anything but endpoints,
> > you'll break the host binding bridges that might be part of the group,
> > and there are valid use cases for sequestering a device with pci-stub
> > rather than vfio-pci to add another barrier to the user getting access
> > to the device.
> > 
> >> What I did was to extend a verification Libvirt already does, to check for
> >> PCI devices of the same IOMMU X being used by other domains, to check the
> >> the host as well. Guest start fails if there is any device left in IOMMU X
> >> that's not present in the domain.  
> > 
> > Yep, can't do that.  
> 
> 
> Thanks for the info.
> 
> To keep the discussion focused, this is the error I'm trying to dodge:
> 
> error: internal error: qemu unexpectedly closed the monitor: 
> 2019-10-04T12:39:41.091312Z qemu-system-ppc64: -device 
> vfio-pci,host=0001:09:00.3,id=hostdev0,bus=pci.2.0,addr=0x1.0x3:
> vfio 0001:09:00.3: group 1 is not viable
> Please ensure all devices within the iommu_group are bound to their vfio bus 
> driver.
> 
> This happens when not all PCI devices from IOMMU group 1 are bind to 
> vfio_pci, regardless
> of whether QEMU is going to use all of them in the guest. Binding all the 
> IOMMU
> devices to vfio-pci makes QEMU satisfied, in this particular case.
> 
> What is the minimal condition to avoid this error? What Libvirt is doing ATM 
> is not enough
> (it will fail to launch with this QEMU error above), and what I'm proposing 
> is wrong.
> Can we say that all PCI endpoints of the same IOMMU must be assigned to 
> vfio-pci?

Yes, but libvirt should not assume that it can manipulate the bindings
of adjacent devices without being explicitly directed to do so.  The
error may be a hindrance to you, but it might also prevent, for
example, the only other NIC in the system being detached from the host
driver.  Is it worth making the VM run without explicitly listing all
devices to assign at the cost of disrupting host services or subverting
the additional isolation a user might be attempting to configure with
having unused devices bound to vfio-pci.  This seems like a bad idea,
the VM should be configured to explicitly list every device it needs to
have assigned or partially assigned.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH v4 0/5] PCI hostdev partial assignment support

2019-12-16 Thread Alex Williamson
On Mon, 16 Dec 2019 20:24:56 -0300
Daniel Henrique Barboza  wrote:

> On 12/16/19 8:06 PM, Alex Williamson wrote:
> > On Mon, 16 Dec 2019 17:28:28 -0500
> > Cole Robinson  wrote:
> >   
> >> On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:  
> >>> changes from version 3 [1]:  
> > 
> > Thanks for catching this!  PCIe root ports or bridges being part of an
> > IOMMU group is part of the nature of the grouping.  However, only
> > endpoint devices can be bound to vfio-pci and thus participate in this
> > "partial assignment".  If the code is trying to force all other devices
> > in the IOMMU group that aren't assigned into this partial assignment
> > mode, that's just fundamentally broken.  Thanks,  
> 
> The code isn't forcing a device to be assigned to the guest. It is forcing
> all the IOMMU devices to be declared in the domain XML to be detached from
> the host.

Detached from the host by unbinding from host drivers and binding to
vfio-pci and "partially" assigned to the guest?  That's wrong, we can't
do that.  Not only will vfio-pci not bind to anything but endpoints,
you'll break the host binding bridges that might be part of the group,
and there are valid use cases for sequestering a device with pci-stub
rather than vfio-pci to add another barrier to the user getting access
to the device.
 
> What I did was to extend a verification Libvirt already does, to check for
> PCI devices of the same IOMMU X being used by other domains, to check the
> the host as well. Guest start fails if there is any device left in IOMMU X
> that's not present in the domain.

Yep, can't do that.

> In short, the code is implying that all IOMMU devices must be detached from
> the host, regardless of whether they're going to be used in the guest,
> regardless of whether they are PCI root ports or bridges. Is this assumption
> correct, considering kernel/QEMU?

Nope, please don't do this.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH v4 0/5] PCI hostdev partial assignment support

2019-12-16 Thread Alex Williamson
On Mon, 16 Dec 2019 17:28:28 -0500
Cole Robinson  wrote:

> On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
> > changes from version 3 [1]:
> > - removed last 2 patches that made function 0 of 
> > PCI multifunction devices mandatory
> > - new patch: news.xml update
> > - changed 'since' version to 6.0.0 in patch 4
> > - unassigned hostdevs are now getting qemu aliases
> > 
> > [1] https://www.redhat.com/archives/libvir-list/2019-November/msg01263.html
> > 
> > Daniel Henrique Barboza (5):
> >   Introducing new address type='unassigned' for PCI hostdevs
> >   qemu: handle unassigned PCI hostdevs in command line
> >   virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices
> >   formatdomain.html.in: document 
> >   news.xml: add address type='unassigned' entry
> >   
> 
> Codewise it looks fine now. But I'm looking more closely at patch #3 and
> realizing that it can explicitly reject a previously accepted VM config.
> And indeed, now that I give it a test with my GPU passthrough setup, it
> is rejecting my previosly working config.
> 
> error: Requested operation is not valid: All devices of the same IOMMU
> group 1 of the PCI device :01:00.0 must belong to domain win10
> 
> I've attached the nodedev XML for the three devices with iommuGroup 1.
> Only the two nvidia devices are assigned to my VM, but not the PCIe
> controller device.
> 
> Is the libvirt heuristic missing something? Or is this acting as expected?
> 
> I didn't quite gather that this is a change to reject previously
> accepted configurations, so I will defer to Laine and Alex as to whether
> this should be committed.

Thanks for catching this!  PCIe root ports or bridges being part of an
IOMMU group is part of the nature of the grouping.  However, only
endpoint devices can be bound to vfio-pci and thus participate in this
"partial assignment".  If the code is trying to force all other devices
in the IOMMU group that aren't assigned into this partial assignment
mode, that's just fundamentally broken.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-12 Thread Alex Williamson
On Thu, 12 Dec 2019 12:09:48 +0800
Jason Wang  wrote:

> On 2019/12/7 上午1:42, Alex Williamson wrote:
> > On Fri, 6 Dec 2019 17:40:02 +0800
> > Jason Wang  wrote:
> >  
> >> On 2019/12/6 下午4:22, Yan Zhao wrote:  
> >>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:  
> >>>> On 2019/12/5 下午4:51, Yan Zhao wrote:  
> >>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:  
> >>>>>> Hi:
> >>>>>>
> >>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:  
> >>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without 
> >>>>>>> host
> >>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>>>> dynamic host mediation is required to  (1) get device states, (2) get
> >>>>>>> dirty pages. Since device states as well as other critical information
> >>>>>>> required for dirty page tracking for VFs are usually retrieved from 
> >>>>>>> PFs,
> >>>>>>> it is handy to provide an extension in PF driver to centralizingly 
> >>>>>>> control
> >>>>>>> VFs' migration.
> >>>>>>>
> >>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>>>> dynamically trap VFs' bars for dirty page tracking and  
> >>>>>> A silly question, what's the reason for doing this, is this a must for 
> >>>>>> dirty
> >>>>>> page tracking?
> >>>>>> 
> >>>>> For performance consideration. VFs' bars should be passthoughed at
> >>>>> normal time and only enter into trap state on need.  
> >>>> Right, but how does this matter for the case of dirty page tracking?
> >>>> 
> >>> Take NIC as an example, to trap its VF dirty pages, software way is
> >>> required to trap every write of ring tail that resides in BAR0.  
> >>
> >> Interesting, but it looks like we need:
> >> - decode the instruction
> >> - mediate all access to BAR0
> >> All of which seems a great burden for the VF driver. I wonder whether or
> >> not doing interrupt relay and tracking head is better in this case.  
> > This sounds like a NIC specific solution, I believe the goal here is to
> > allow any device type to implement a partial mediation solution, in
> > this case to sufficiently track the device while in the migration
> > saving state.  
> 
> 
> I suspect there's a solution that can work for any device type. E.g for 
> virtio, avail index (head) doesn't belongs to any BAR and device may 
> decide to disable doorbell from guest. So did interrupt relay since 
> driver may choose to disable interrupt from device. In this case, the 
> only way to track dirty pages correctly is to switch to software datapath.
> 
> 
> >  
> >>>There's
> >>> still no IOMMU Dirty bit available.  
> >>>>>>>  (3) centralizing
> >>>>>>> VF critical states retrieving and VF controls into one driver, we 
> >>>>>>> propose
> >>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>>>
> >>>>>>>
> >>>>>>>_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> >>>>>>> _ _
> >>>>>>>  __   register mediate ops|  ___ ___  
> >>>>>>>   |
> >>>>>>> |  |<---| VF|   |   |
> >>>>>>> | vfio-pci |  | |  mediate  |   | PF driver |   |
> >>>>>>> |__|--->|   driver  |   |___|
> >>>>>>>  |open(pdev)  |  ---  |   
> >>>>>>>   |
> >>>>>>>  ||
> >>>>>>>  ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ 
> >>>>>>> _ _|
> >>>>>>> \|/  \|/
> >>>>>>> --- 
> >>>>>>> |VF   |  

Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-11 Thread Alex Williamson
On Wed, 11 Dec 2019 21:02:40 -0500
Yan Zhao  wrote:

> On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> > On Wed, 11 Dec 2019 01:25:55 -0500
> > Yan Zhao  wrote:
> >   
> > > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:  
> > > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > > Yan Zhao  wrote:
> > > > > >   
> > > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:  
> > > > > > >     
> > > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > > Yan Zhao  wrote:
> > > > > > > > 
> > > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson 
> > > > > > > > > wrote:
> > > > > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > >   
> > > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and 
> > > > > > > > > > > vendor driver to
> > > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > > 
> > > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > > When QEMU detects a device regions of this type, it will 
> > > > > > > > > > > create an
> > > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap 
> > > > > > > > > > > field of this
> > > > > > > > > > > info region.
> > > > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > > > regions and disable all the sparse mmaped subregions (if 
> > > > > > > > > > > the sparse
> > > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > > > 
> > > > > > > > > > > A typical usage is
> > > > > > > > > > > 1. vendor driver first cuts its bar 0 into several 
> > > > > > > > > > > sections, all in a
> > > > > > > > > > > sparse mmap array. So initally, all its bar 0 are 
> > > > > > > > > > > passthroughed.
> > > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be 
> > > > > > > > > > > disablable.
> > > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and 
> > > > > > > > > > > set trap to true
> > > > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable 
> > > > > > > > > > > flags on.
> > > > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor 
> > > > > > > > > > > driver be able
> > > > > > > > > > > to trap access of bar 0 registers and make dirty page 
> > > > > > > > > > > tracking possible.
> > > > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to 
> > > > > > > > > > > QEMU again.
> > > > > > > > > > > QEMU reads trap field of this info region which is false 
> > > > > > > > > > > and QEMU
> > > > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > > > 
> > > > > > > > > > &g

Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-11 Thread Alex Williamson
On Wed, 11 Dec 2019 01:25:55 -0500
Yan Zhao  wrote:

> On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > On Tue, 10 Dec 2019 02:44:44 -0500
> > Yan Zhao  wrote:
> >   
> > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:  
> > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > Yan Zhao  wrote:
> > > > > >   
> > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:  
> > > > > > > 
> > > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > > Yan Zhao  wrote:
> > > > > > > > 
> > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor 
> > > > > > > > > driver to
> > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > 
> > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > When QEMU detects a device regions of this type, it will 
> > > > > > > > > create an
> > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap 
> > > > > > > > > field of this
> > > > > > > > > info region.
> > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > regions and disable all the sparse mmaped subregions (if the 
> > > > > > > > > sparse
> > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > 
> > > > > > > > > A typical usage is
> > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, 
> > > > > > > > > all in a
> > > > > > > > > sparse mmap array. So initally, all its bar 0 are 
> > > > > > > > > passthroughed.
> > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be 
> > > > > > > > > disablable.
> > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set 
> > > > > > > > > trap to true
> > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable 
> > > > > > > > > flags on.
> > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor 
> > > > > > > > > driver be able
> > > > > > > > > to trap access of bar 0 registers and make dirty page 
> > > > > > > > > tracking possible.
> > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU 
> > > > > > > > > again.
> > > > > > > > > QEMU reads trap field of this info region which is false and 
> > > > > > > > > QEMU
> > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > 
> > > > > > > > > Vendor driver specifies whether it supports 
> > > > > > > > > dynamic-trap-bar-info region
> > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > 
> > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with 
> > > > > > > > > region len=0
> > > > > > > > > and region->ops=null.
> > > > > > > > > Vvendor driver should override this region's len, flags, rw, 
> > > > 

Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-10 Thread Alex Williamson
On Mon, 9 Dec 2019 21:44:23 -0500
Yan Zhao  wrote:

> > > > > Currently, yes, i40e has build dependency on vfio-pci.
> > > > > It's like this, if i40e decides to support SRIOV and compiles in vf
> > > > > related code who depends on vfio-pci, it will also have build 
> > > > > dependency
> > > > > on vfio-pci. isn't it natural?
> > > > 
> > > > No, this is not natural.  There are certainly i40e VF use cases that
> > > > have no interest in vfio and having dependencies between the two
> > > > modules is unacceptable.  I think you probably want to modularize the
> > > > i40e vfio support code and then perhaps register a table in vfio-pci
> > > > that the vfio-pci code can perform a module request when using a
> > > > compatible device.  Just and idea, there might be better options.  I
> > > > will not accept a solution that requires unloading the i40e driver in
> > > > order to unload the vfio-pci driver.  It's inconvenient with just one
> > > > NIC driver, imagine how poorly that scales.
> > > > 
> > > what about this way:
> > > mediate driver registers a module notifier and every time when
> > > vfio_pci is loaded, register to vfio_pci its mediate ops?
> > > (Just like in below sample code)
> > > This way vfio-pci is free to unload and this registering only gives
> > > vfio-pci a name of what module to request.
> > > After that,
> > > in vfio_pci_open(), vfio-pci requests the mediate driver. (or puts
> > > the mediate driver when mediate driver does not support mediating the
> > > device)
> > > in vfio_pci_release(), vfio-pci puts the mediate driver.
> > > 
> > > static void register_mediate_ops(void)
> > > {
> > > int (*func)(struct vfio_pci_mediate_ops *ops) = NULL;
> > > 
> > > func = symbol_get(vfio_pci_register_mediate_ops);
> > > 
> > > if (func) {
> > > func(_dt_ops);
> > > symbol_put(vfio_pci_register_mediate_ops);
> > > }
> > > }
> > > 
> > > static int igd_module_notify(struct notifier_block *self,
> > >   unsigned long val, void *data)
> > > {
> > > struct module *mod = data;
> > > int ret = 0;
> > > 
> > > switch (val) {
> > > case MODULE_STATE_LIVE:
> > > if (!strcmp(mod->name, "vfio_pci"))
> > > register_mediate_ops();
> > > break;
> > > case MODULE_STATE_GOING:
> > > break;
> > > default:
> > > break;
> > > }
> > > return ret;
> > > }
> > > 
> > > static struct notifier_block igd_module_nb = {
> > > .notifier_call = igd_module_notify,
> > > .priority = 0,
> > > };
> > > 
> > > 
> > > 
> > > static int __init igd_dt_init(void)
> > > {
> > >   ...
> > >   register_mediate_ops();
> > >   register_module_notifier(_module_nb);
> > >   ...
> > >   return 0;
> > > }  
> > 
> > 
> > No, this is bad.  Please look at MODULE_ALIAS() and request_module() as
> > used in the vfio-platform for loading reset driver modules.  I think
> > the correct approach is that vfio-pci should perform a request_module()
> > based on the device being probed.  Having the mediation provider
> > listening for vfio-pci and registering itself regardless of whether we
> > intend to use it assumes that we will want to use it and assumes that
> > the mediation provider module is already loaded.  We should be able to
> > support demand loading of modules that may serve no other purpose than
> > providing this mediation.  Thanks,  
> hi Alex
> Thanks for this message.
> So is it good to create a separate module as mediation provider driver,
> and alias its module name to "vfio-pci-mediate-vid-did".
> Then when vfio-pci probes the device, it requests module of that name ?

I think this would give us an option to have the mediator as a separate
module, but not require it.  Maybe rather than a request_module(),
where if we follow the platform reset example we'd then expect the init
code for the module to register into a list, we could do a
symbol_request().  AIUI, this would give us a reference to the symbol
if the module providing it is already loaded, and request a module
(perhaps via an alias) if it's not already load.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-10 Thread Alex Williamson
On Tue, 10 Dec 2019 02:44:44 -0500
Yan Zhao  wrote:

> On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > On Mon, 9 Dec 2019 01:22:12 -0500
> > Yan Zhao  wrote:
> >   
> > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:  
> > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > Yan Zhao  wrote:
> > > > 
> > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > Yan Zhao  wrote:
> > > > > >   
> > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor 
> > > > > > > driver to
> > > > > > > communicate dynamic trap info. It is of type
> > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > 
> > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of 
> > > > > > > this
> > > > > > > info region.
> > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > regions and disable all the sparse mmaped subregions (if the 
> > > > > > > sparse
> > > > > > > mmaped subregion is disablable).
> > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > 
> > > > > > > A typical usage is
> > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all 
> > > > > > > in a
> > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap 
> > > > > > > to true
> > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags 
> > > > > > > on.
> > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver 
> > > > > > > be able
> > > > > > > to trap access of bar 0 registers and make dirty page tracking 
> > > > > > > possible.
> > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU 
> > > > > > > again.
> > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > 
> > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info 
> > > > > > > region
> > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > 
> > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with 
> > > > > > > region len=0
> > > > > > > and region->ops=null.
> > > > > > > Vvendor driver should override this region's len, flags, rw, mmap 
> > > > > > > in its
> > > > > > > vfio_pci_mediate_ops.  
> > > > > > 
> > > > > > TBH, I don't like this interface at all.  Userspace doesn't pass 
> > > > > > data
> > > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > > define an IRQ type that tells the user to re-evaluate the sparse 
> > > > > > mmap
> > > > > > information for a region.  The user would enumerate the device IRQs 
> > > > > > via
> > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > indicate which region(s) should be re-evaluated on signaling.  The 
> > > > > > user
> > > > > > would e

Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-09 Thread Alex Williamson
On Sun, 8 Dec 2019 22:42:25 -0500
Yan Zhao  wrote:

> On Sat, Dec 07, 2019 at 05:22:26AM +0800, Alex Williamson wrote:
> > On Fri, 6 Dec 2019 02:56:55 -0500
> > Yan Zhao  wrote:
> >   
> > > On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:  
> > > > On Wed,  4 Dec 2019 22:25:36 -0500
> > > > Yan Zhao  wrote:
> > > > 
> > > > > when vfio-pci is bound to a physical device, almost all the hardware
> > > > > resources are passthroughed.
> > > > > Sometimes, vendor driver of this physcial device may want to mediate 
> > > > > some
> > > > > hardware resource access for a short period of time, e.g. dirty page
> > > > > tracking during live migration.
> > > > > 
> > > > > Here we introduce mediate ops in vfio-pci for this purpose.
> > > > > 
> > > > > Vendor driver can register a mediate ops to vfio-pci.
> > > > > But rather than directly bind to the passthroughed device, the
> > > > > vendor driver is now either a module that does not bind to any device 
> > > > > or
> > > > > a module binds to other device.
> > > > > E.g. when passing through a VF device that is bound to vfio-pci 
> > > > > modules,
> > > > > PF driver that binds to PF device can register to vfio-pci to mediate
> > > > > VF's regions, hence supporting VF live migration.
> > > > > 
> > > > > The sequence goes like this:
> > > > > 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> > > > > 
> > > > > 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> > > > > 
> > > > > 3. Whenever vfio-pci opens a device, it searches the list and call
> > > > > vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> > > > > mediating this device.
> > > > > Upon a success return value of from vfio_pci_mediate_ops->open(),
> > > > > vfio-pci will stop list searching and store a mediate handle to
> > > > > represent this open into vendor driver.
> > > > > (so if multiple vendor drivers support mediating a device through
> > > > > vfio_pci_mediate_ops, only one will win, depending on their 
> > > > > registering
> > > > > sequence)
> > > > > 
> > > > > 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in 
> > > > > vfio-pci
> > > > > ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so 
> > > > > that
> > > > > vendor driver is able to override a region's default flags and caps,
> > > > > e.g. adding a sparse mmap cap to passthrough only sub-regions of a 
> > > > > whole
> > > > > region.
> > > > > 
> > > > > 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> > > > > vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> > > > > if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> > > > > passthrough this read/write/mmap to physical device, otherwise it just
> > > > > returns without touch physical device.
> > > > > 
> > > > > 6. When vfio-pci closes a device, vfio_pci_release() chains into
> > > > > vfio_pci_mediate_ops->release() to close the reference in vendor 
> > > > > driver.
> > > > > 
> > > > > 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> > > > > 
> > > > > Cc: Kevin Tian 
> > > > > 
> > > > > Signed-off-by: Yan Zhao 
> > > > > ---
> > > > >  drivers/vfio/pci/vfio_pci.c | 146 
> > > > > 
> > > > >  drivers/vfio/pci/vfio_pci_private.h |   2 +
> > > > >  include/linux/vfio.h|  16 +++
> > > > >  3 files changed, 164 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > > > > index 02206162eaa9..55080ff29495 100644
> > > > > --- a/drivers/vfio/pci/vfio_pci.c
> > > > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > > > @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | 
> > > > > S_IWUSR);
> > > > >  MODULE_PARM_DESC(disable_idle_d3,
> 

Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-09 Thread Alex Williamson
On Mon, 9 Dec 2019 01:22:12 -0500
Yan Zhao  wrote:

> On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > On Fri, 6 Dec 2019 01:04:07 -0500
> > Yan Zhao  wrote:
> >   
> > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:  
> > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > Yan Zhao  wrote:
> > > > 
> > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver 
> > > > > to
> > > > > communicate dynamic trap info. It is of type
> > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > 
> > > > > This region has two fields: dt_fd and trap.
> > > > > When QEMU detects a device regions of this type, it will create an
> > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > info region.
> > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > mmaped subregion is disablable).
> > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > 
> > > > > A typical usage is
> > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to 
> > > > > true
> > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be 
> > > > > able
> > > > > to trap access of bar 0 registers and make dirty page tracking 
> > > > > possible.
> > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > re-passthrough the whole bar 0 region.
> > > > > 
> > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info 
> > > > > region
> > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > vfio_pci_mediate_ops->open().
> > > > > 
> > > > > If vfio-pci detects this cap, it will create a default
> > > > > dynamic_trap_bar_info region on behalf of vendor driver with region 
> > > > > len=0
> > > > > and region->ops=null.
> > > > > Vvendor driver should override this region's len, flags, rw, mmap in 
> > > > > its
> > > > > vfio_pci_mediate_ops.
> > > > 
> > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > configuring user signaling with eventfds.  I think we only need to
> > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > information for a region.  The user would enumerate the device IRQs via
> > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > >   
> > > > sparse mmap capability for the associated regions when signaled.
> > > 
> > > Do you like the "disablable" flag of sparse mmap ?
> > > I think it's a lightweight way for user to switch mmap state of a whole 
> > > region,
> > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > region might be too heavy.  
> > 
> > No, I don't like the disable-able flag.  At what frequency do we expect
> > regions to change?  It seems like we'd only change when switching into
> > and out of the _SAVING state, which is rare.  It seems easy for
> > userspace, at least QEMU, to drop the entire mmap configuration and  
> ok. I'll try this way.
> 
> > re-read it.  Another concern here is how do we synchronize the event?
> > Are we assuming that this event woul

Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-06 Thread Alex Williamson
On Fri, 6 Dec 2019 02:56:55 -0500
Yan Zhao  wrote:

> On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:
> > On Wed,  4 Dec 2019 22:25:36 -0500
> > Yan Zhao  wrote:
> >   
> > > when vfio-pci is bound to a physical device, almost all the hardware
> > > resources are passthroughed.
> > > Sometimes, vendor driver of this physcial device may want to mediate some
> > > hardware resource access for a short period of time, e.g. dirty page
> > > tracking during live migration.
> > > 
> > > Here we introduce mediate ops in vfio-pci for this purpose.
> > > 
> > > Vendor driver can register a mediate ops to vfio-pci.
> > > But rather than directly bind to the passthroughed device, the
> > > vendor driver is now either a module that does not bind to any device or
> > > a module binds to other device.
> > > E.g. when passing through a VF device that is bound to vfio-pci modules,
> > > PF driver that binds to PF device can register to vfio-pci to mediate
> > > VF's regions, hence supporting VF live migration.
> > > 
> > > The sequence goes like this:
> > > 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> > > 
> > > 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> > > 
> > > 3. Whenever vfio-pci opens a device, it searches the list and call
> > > vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> > > mediating this device.
> > > Upon a success return value of from vfio_pci_mediate_ops->open(),
> > > vfio-pci will stop list searching and store a mediate handle to
> > > represent this open into vendor driver.
> > > (so if multiple vendor drivers support mediating a device through
> > > vfio_pci_mediate_ops, only one will win, depending on their registering
> > > sequence)
> > > 
> > > 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> > > ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> > > vendor driver is able to override a region's default flags and caps,
> > > e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> > > region.
> > > 
> > > 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> > > vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> > > if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> > > passthrough this read/write/mmap to physical device, otherwise it just
> > > returns without touch physical device.
> > > 
> > > 6. When vfio-pci closes a device, vfio_pci_release() chains into
> > > vfio_pci_mediate_ops->release() to close the reference in vendor driver.
> > > 
> > > 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> > > 
> > > Cc: Kevin Tian 
> > > 
> > > Signed-off-by: Yan Zhao 
> > > ---
> > >  drivers/vfio/pci/vfio_pci.c | 146 
> > >  drivers/vfio/pci/vfio_pci_private.h |   2 +
> > >  include/linux/vfio.h|  16 +++
> > >  3 files changed, 164 insertions(+)
> > > 
> > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > > index 02206162eaa9..55080ff29495 100644
> > > --- a/drivers/vfio/pci/vfio_pci.c
> > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > >  MODULE_PARM_DESC(disable_idle_d3,
> > >"Disable using the PCI D3 low power state for idle, unused 
> > > devices");
> > >  
> > > +static LIST_HEAD(mediate_ops_list);
> > > +static DEFINE_MUTEX(mediate_ops_list_lock);
> > > +struct vfio_pci_mediate_ops_list_entry {
> > > + struct vfio_pci_mediate_ops *ops;
> > > + int refcnt;
> > > + struct list_headnext;
> > > +};
> > > +
> > >  static inline bool vfio_vga_disabled(void)
> > >  {
> > >  #ifdef CONFIG_VFIO_PCI_VGA
> > > @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
> > >   if (!(--vdev->refcnt)) {
> > >   vfio_spapr_pci_eeh_release(vdev->pdev);
> > >   vfio_pci_disable(vdev);
> > > + if (vdev->mediate_ops && vdev->mediate_ops->release) {
> > > + vdev->mediate_ops->release(vdev->mediate_handle);
> > > + 

Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-06 Thread Alex Williamson
On Fri, 6 Dec 2019 17:40:02 +0800
Jason Wang  wrote:

> On 2019/12/6 下午4:22, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:  
> >> On 2019/12/5 下午4:51, Yan Zhao wrote:  
> >>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:  
>  Hi:
> 
>  On 2019/12/5 上午11:24, Yan Zhao wrote:  
> > For SRIOV devices, VFs are passthroughed into guest directly without 
> > host
> > driver mediation. However, when VMs migrating with passthroughed VFs,
> > dynamic host mediation is required to  (1) get device states, (2) get
> > dirty pages. Since device states as well as other critical information
> > required for dirty page tracking for VFs are usually retrieved from PFs,
> > it is handy to provide an extension in PF driver to centralizingly 
> > control
> > VFs' migration.
> >
> > Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> > dynamically trap VFs' bars for dirty page tracking and  
>  A silly question, what's the reason for doing this, is this a must for 
>  dirty
>  page tracking?
>   
> >>> For performance consideration. VFs' bars should be passthoughed at
> >>> normal time and only enter into trap state on need.  
> >>
> >> Right, but how does this matter for the case of dirty page tracking?
> >>  
> > Take NIC as an example, to trap its VF dirty pages, software way is
> > required to trap every write of ring tail that resides in BAR0.  
> 
> 
> Interesting, but it looks like we need:
> - decode the instruction
> - mediate all access to BAR0
> All of which seems a great burden for the VF driver. I wonder whether or 
> not doing interrupt relay and tracking head is better in this case.

This sounds like a NIC specific solution, I believe the goal here is to
allow any device type to implement a partial mediation solution, in
this case to sufficiently track the device while in the migration
saving state.

> >   There's
> > still no IOMMU Dirty bit available.  
> > (3) centralizing
> > VF critical states retrieving and VF controls into one driver, we 
> > propose
> > to introduce mediate ops on top of current vfio-pci device driver.
> >
> >
> >   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > __   register mediate ops|  ___ ___|
> > |  |<---| VF|   |   |
> > | vfio-pci |  | |  mediate  |   | PF driver |   |
> > |__|--->|   driver  |   |___|
> > |open(pdev)  |  ---  | |
> > ||
> > ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >\|/  \|/
> > --- 
> > |VF   | |PF|
> > --- 
> >
> >
> > VF mediate driver could be a standalone driver that does not bind to
> > any devices (as in demo code in patches 5-6) or it could be a built-in
> > extension of PF driver (as in patches 7-9) .
> >
> > Rather than directly bind to VF, VF mediate driver register a mediate
> > ops into vfio-pci in driver init. vfio-pci maintains a list of such
> > mediate ops.
> > (Note that: VF mediate driver can register mediate ops into vfio-pci
> > before vfio-pci binding to any devices. And VF mediate driver can
> > support mediating multiple devices.)
> >
> > When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> > list and calls each vfio_pci_mediate_ops->open() with pdev of the 
> > opening
> > device as a parameter.
> > VF mediate driver should return success or failure depending on it
> > supports the pdev or not.
> > E.g. VF mediate driver would compare its supported VF devfn with the
> > devfn of the passed-in pdev.
> > Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> > stop querying other mediate ops and bind the opening device with this
> > mediate ops using the returned mediate handle.
> >
> > Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on 
> > the
> > VF will be intercepted into VF mediate driver as
> > vfio_pci_mediate_ops->get_region_info(),
> > vfio_pci_mediate_ops->rw,
> > vfio_pci_mediate_ops->mmap, and get customized.
> > For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> > further return 'pt' to indicate whether vfio-pci should further
> > passthrough data to hw.
> >
> > when vfio-pci closes the VF, it calls its 
> > vfio_pci_mediate_ops->release()
> > with a mediate 

Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-06 Thread Alex Williamson
On Fri, 6 Dec 2019 01:04:07 -0500
Yan Zhao  wrote:

> On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > On Wed,  4 Dec 2019 22:26:50 -0500
> > Yan Zhao  wrote:
> >   
> > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > communicate dynamic trap info. It is of type
> > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > 
> > > This region has two fields: dt_fd and trap.
> > > When QEMU detects a device regions of this type, it will create an
> > > eventfd and write its eventfd id to dt_fd field.
> > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > info region.
> > > - If trap is true, QEMU would search the device's PCI BAR
> > > regions and disable all the sparse mmaped subregions (if the sparse
> > > mmaped subregion is disablable).
> > > - If trap is false, QEMU would re-enable those subregions.
> > > 
> > > A typical usage is
> > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > QEMU reads trap field of this info region which is false and QEMU
> > > re-passthrough the whole bar 0 region.
> > > 
> > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > vfio_pci_mediate_ops->open().
> > > 
> > > If vfio-pci detects this cap, it will create a default
> > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > and region->ops=null.
> > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > vfio_pci_mediate_ops.  
> > 
> > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > configuring user signaling with eventfds.  I think we only need to
> > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > information for a region.  The user would enumerate the device IRQs via
> > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > indicate which region(s) should be re-evaluated on signaling.  The user
> > would enable that signaling via SET_IRQS and simply re-evaluate the  
> ok. I'll try to switch to this way. Thanks for this suggestion.
> 
> > sparse mmap capability for the associated regions when signaled.  
> 
> Do you like the "disablable" flag of sparse mmap ?
> I think it's a lightweight way for user to switch mmap state of a whole 
> region,
> otherwise going through a complete flow of GET_REGION_INFO and re-setup
> region might be too heavy.

No, I don't like the disable-able flag.  At what frequency do we expect
regions to change?  It seems like we'd only change when switching into
and out of the _SAVING state, which is rare.  It seems easy for
userspace, at least QEMU, to drop the entire mmap configuration and
re-read it.  Another concern here is how do we synchronize the event?
Are we assuming that this event would occur when a user switch to
_SAVING mode on the device?  That operation is synchronous, the device
must be in saving mode after the write to device state completes, but
it seems like this might be trying to add an asynchronous dependency.
Will the write to device_state only complete once the user handles the
eventfd?  How would the kernel know when the mmap re-evaluation is
complete.  It seems like there are gaps here that the vendor driver
could miss traps required for migration because the user hasn't
completed the mmap transition yet.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-05 Thread Alex Williamson
On Wed,  4 Dec 2019 22:25:36 -0500
Yan Zhao  wrote:

> when vfio-pci is bound to a physical device, almost all the hardware
> resources are passthroughed.
> Sometimes, vendor driver of this physcial device may want to mediate some
> hardware resource access for a short period of time, e.g. dirty page
> tracking during live migration.
> 
> Here we introduce mediate ops in vfio-pci for this purpose.
> 
> Vendor driver can register a mediate ops to vfio-pci.
> But rather than directly bind to the passthroughed device, the
> vendor driver is now either a module that does not bind to any device or
> a module binds to other device.
> E.g. when passing through a VF device that is bound to vfio-pci modules,
> PF driver that binds to PF device can register to vfio-pci to mediate
> VF's regions, hence supporting VF live migration.
> 
> The sequence goes like this:
> 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> 
> 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> 
> 3. Whenever vfio-pci opens a device, it searches the list and call
> vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> mediating this device.
> Upon a success return value of from vfio_pci_mediate_ops->open(),
> vfio-pci will stop list searching and store a mediate handle to
> represent this open into vendor driver.
> (so if multiple vendor drivers support mediating a device through
> vfio_pci_mediate_ops, only one will win, depending on their registering
> sequence)
> 
> 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> vendor driver is able to override a region's default flags and caps,
> e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> region.
> 
> 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> passthrough this read/write/mmap to physical device, otherwise it just
> returns without touch physical device.
> 
> 6. When vfio-pci closes a device, vfio_pci_release() chains into
> vfio_pci_mediate_ops->release() to close the reference in vendor driver.
> 
> 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> 
> Cc: Kevin Tian 
> 
> Signed-off-by: Yan Zhao 
> ---
>  drivers/vfio/pci/vfio_pci.c | 146 
>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>  include/linux/vfio.h|  16 +++
>  3 files changed, 164 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 02206162eaa9..55080ff29495 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(disable_idle_d3,
>"Disable using the PCI D3 low power state for idle, unused 
> devices");
>  
> +static LIST_HEAD(mediate_ops_list);
> +static DEFINE_MUTEX(mediate_ops_list_lock);
> +struct vfio_pci_mediate_ops_list_entry {
> + struct vfio_pci_mediate_ops *ops;
> + int refcnt;
> + struct list_headnext;
> +};
> +
>  static inline bool vfio_vga_disabled(void)
>  {
>  #ifdef CONFIG_VFIO_PCI_VGA
> @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
>   if (!(--vdev->refcnt)) {
>   vfio_spapr_pci_eeh_release(vdev->pdev);
>   vfio_pci_disable(vdev);
> + if (vdev->mediate_ops && vdev->mediate_ops->release) {
> + vdev->mediate_ops->release(vdev->mediate_handle);
> + vdev->mediate_ops = NULL;
> + }
>   }
>  
>   mutex_unlock(>reflck->lock);
> @@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
>  {
>   struct vfio_pci_device *vdev = device_data;
>   int ret = 0;
> + struct vfio_pci_mediate_ops_list_entry *mentry;
>  
>   if (!try_module_get(THIS_MODULE))
>   return -ENODEV;
> @@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
>   goto error;
>  
>   vfio_spapr_pci_eeh_open(vdev->pdev);
> + mutex_lock(_ops_list_lock);
> + list_for_each_entry(mentry, _ops_list, next) {
> + u64 caps;
> + u32 handle;

Wouldn't it seem likely that the ops provider might use this handle as
a pointer, so we'd want it to be an opaque void*?

> +
> + memset(, 0, sizeof(caps));

@caps has no purpose here, add it if/when we do something with it.
It's also a standard type, why are we memset'ing it rather than just
=0??

> + ret = mentry->ops->open(vdev->pdev, , );
> + if (!ret)  {
> + vdev->mediate_ops = mentry->ops;
> + vdev->mediate_handle = handle;
> 

Re: [libvirt] [RFC PATCH 3/9] vfio/pci: register a default migration region

2019-12-05 Thread Alex Williamson
On Wed,  4 Dec 2019 22:26:38 -0500
Yan Zhao  wrote:

> Vendor driver specifies when to support a migration region through cap
> VFIO_PCI_DEVICE_CAP_MIGRATION in vfio_pci_mediate_ops->open().
> 
> If vfio-pci detects this cap, it creates a default migration region on
> behalf of vendor driver with region len=0 and region->ops=null.
> Vendor driver should override this region's len, flags, rw, mmap in
> its vfio_pci_mediate_ops.
> 
> This migration region definition is aligned to QEMU vfio migration code v8:
> (https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html)
> 
> Cc: Kevin Tian 
> 
> Signed-off-by: Yan Zhao 
> ---
>  drivers/vfio/pci/vfio_pci.c |  15 
>  include/linux/vfio.h|   1 +
>  include/uapi/linux/vfio.h   | 149 
>  3 files changed, 165 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index f3730252ee82..059660328be2 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -115,6 +115,18 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>   return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>  }
>  
> +/**
> + * init a region to hold migration ctl & data
> + */
> +void init_migration_region(struct vfio_pci_device *vdev)
> +{
> + vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
> + VFIO_REGION_SUBTYPE_MIGRATION,
> + NULL, 0,
> + VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> + NULL);
> +}
> +
>  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>  {
>   struct resource *res;
> @@ -523,6 +535,9 @@ static int vfio_pci_open(void *device_data)
>   vdev->mediate_ops = mentry->ops;
>   vdev->mediate_handle = handle;
>  
> + if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
> + init_migration_region(vdev);

No.  We're not going to add a cap flag for every region the mediation
driver wants to add.  The mediation driver should have the ability to
add regions and irqs to the device itself.  Thanks,

Alex

> +
>   pr_info("vfio pci found mediate_ops %s, 
> caps=%llx, handle=%x for %x:%x\n",
>   vdev->mediate_ops->name, caps,
>   handle, vdev->pdev->vendor,

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-05 Thread Alex Williamson
On Wed,  4 Dec 2019 22:26:50 -0500
Yan Zhao  wrote:

> Dynamic trap bar info region is a channel for QEMU and vendor driver to
> communicate dynamic trap info. It is of type
> VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> 
> This region has two fields: dt_fd and trap.
> When QEMU detects a device regions of this type, it will create an
> eventfd and write its eventfd id to dt_fd field.
> When vendor drivre signals this eventfd, QEMU reads trap field of this
> info region.
> - If trap is true, QEMU would search the device's PCI BAR
> regions and disable all the sparse mmaped subregions (if the sparse
> mmaped subregion is disablable).
> - If trap is false, QEMU would re-enable those subregions.
> 
> A typical usage is
> 1. vendor driver first cuts its bar 0 into several sections, all in a
> sparse mmap array. So initally, all its bar 0 are passthroughed.
> 2. vendor driver specifys part of bar 0 sections to be disablable.
> 3. on migration starts, vendor driver signals dt_fd and set trap to true
> to notify QEMU disabling the bar 0 sections of disablable flags on.
> 4. QEMU disables those bar 0 section and hence let vendor driver be able
> to trap access of bar 0 registers and make dirty page tracking possible.
> 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> QEMU reads trap field of this info region which is false and QEMU
> re-passthrough the whole bar 0 region.
> 
> Vendor driver specifies whether it supports dynamic-trap-bar-info region
> through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> vfio_pci_mediate_ops->open().
> 
> If vfio-pci detects this cap, it will create a default
> dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> and region->ops=null.
> Vvendor driver should override this region's len, flags, rw, mmap in its
> vfio_pci_mediate_ops.

TBH, I don't like this interface at all.  Userspace doesn't pass data
to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
configuring user signaling with eventfds.  I think we only need to
define an IRQ type that tells the user to re-evaluate the sparse mmap
information for a region.  The user would enumerate the device IRQs via
GET_IRQ_INFO, find one of this type where the IRQ info would also
indicate which region(s) should be re-evaluated on signaling.  The user
would enable that signaling via SET_IRQS and simply re-evaluate the
sparse mmap capability for the associated regions when signaled.
Thanks,

Alex

> 
> Cc: Kevin Tian 
> 
> Signed-off-by: Yan Zhao 
> ---
>  drivers/vfio/pci/vfio_pci.c | 16 
>  include/linux/vfio.h|  3 ++-
>  include/uapi/linux/vfio.h   | 11 +++
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 059660328be2..62b811ca43e4 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device *vdev)
>   NULL);
>  }
>  
> +/**
> + * register a region to hold info for dynamically trap bar regions
> + */
> +void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
> +{
> + vfio_pci_register_dev_region(vdev,
> + VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
> + VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
> + NULL, 0,
> + VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> + NULL);
> +}
> +
>  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>  {
>   struct resource *res;
> @@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
>   if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
>   init_migration_region(vdev);
>  
> + if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
> + init_dynamic_trap_bar_info_region(vdev);
> +
>   pr_info("vfio pci found mediate_ops %s, 
> caps=%llx, handle=%x for %x:%x\n",
>   vdev->mediate_ops->name, caps,
>   handle, vdev->pdev->vendor,
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index cddea8e9dcb2..cf8ecf687bee 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -197,7 +197,8 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
>  
>  struct vfio_pci_mediate_ops {
>   char*name;
> -#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
> +#define VFIO_PCI_DEVICE_CAP_MIGRATION(0x01)
> +#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR (0x02)
>   int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
>   void(*release)(int handle);
>   void(*get_region_info)(int handle,
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index caf8845a67a6..74a2d0b57741 100644
> --- 

Re: [libvirt] [PATCH v2 0/4] PCI hostdev partial assignment support

2019-11-21 Thread Alex Williamson
On Thu, 21 Nov 2019 19:19:13 -0300
Daniel Henrique Barboza  wrote:

> Changes from previous version [1], all of them result of
> feedback from Alex Williamson and Abdulla Bubshait:
> - use  instead of creating a new subsys
> attribute;
> - expand the change to all PCI hostdevs. Former patch 01 was
> discarded because we don't need the PCI Multifunction helpers
> for now; 
> - series changed name to reflect what it's being done
> - new patch 04: add documentation to formatdomain.html.in
> 
> To avoid a huge wall of text please refer to [1] for context
> about the road up to here. Commit msgs of the first 3 patches
> tells the story as well.
> 
> [1] https://www.redhat.com/archives/libvir-list/2019-October/msg00298.html  
> 
> What I want to discuss here instead is a caveat that I've found
> while testing this work, since its first version. This test was
> done in a Power 8 system with a Broadcom BCM5719 PCIe Multifunction
> card, with 4 virtual functions. This series enables Libvirt to
> declare PCI hostdevs that will not be visible to the guest using
> address type='none'. During the tests I faced a scenario that I
> expected to fail, but it didn't. This is the relevant XML except:
> 
> 
> 
>   
>   
> 
>   
>   
> 
> 
>   
>   
> 
>   
>   
> 
> 
>   
>   
> 
>   
>function='0x2'/>
> 
> 
>   
>   
> 
>   
>   
> 
> 
> 
> I'm declaring all the BCM5719 functions in the XML, but I am making
> functions 0, 1 and 3 unassignable by the guest using address type='none'.
> This test was meant to fail, but it didn't. To my surprise the guest
> booted and the device is functional:
> 
> $ lspci
> :00:01.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
> :00:03.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01)
> :00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
> 0001:00:01.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit 
> Ethernet PCIe (rev 01)
> $ 
> 
> I've talked with Michael Roth (QEMU PPC64 developer that worked with
> the PCI multifunction hotplug/unplug support in the PPC64 machine)
> about this. He mentioned that this is intended. I'll quote here what he
> had to say about it:
> 
> 
> "The current logic is that we only emit the hotplug event when function
> 0 is attached, but if some other function is attached at boot-time the
> guest will still see it on the bus, and whether that works or not I
> think is up to the device/driver"
> 
> 
> This explains why this test didn't fail as I expected. At least for
> the PPC64 machine, depending on the device support, this setup is
> allowed. PPC64 machine uses function 0 hotplug as a signal of
> 'plug all the queue functions and function 0', but function 0
> isn't required at boot time.  I would like to hear other opinions
> in this because I can't say whether this is allowed in x86.

I would expect that on x86 a Linux guest would need to be booted with
the pci=pcie_scan_all kernel option to find this device.  The PCI-core
will only scan for devfn 00.0 behind a downstream port and then only
scan non-zero functions if that device indicate multifunction support.
I'd expect more archs to behave this way than what you see on PPC where
it "just works".

> I am mentioning all this now because this had a direct impact on the
> design of this work since the previous version, and I failed
> to bring it up back then. I am *not* checking for the assignment of
> function 0 at guest boot time in Libvirt, leaving the user free to
> decide what to do. I am aware that this will be inconsistent to the
> logic of the PCI multifunction hotplug/unplug support, where
> function 0 is required. This also puts a lot of faith in the user,
> relying that the user is fully aware of the capabilities of the
> hardware.
> 
> My question is: should Libvirt force function 0 to be present in
> boot time as well, regardless of whether the PPC64 guest or some
> cards are able to boot without it?

In my reading of the PCI 3.0 spec, 3.2.2.3.4 indicates that
multifunction devices are required to implement function 0 and that
configuration software is under no obligation to scan for higher number
functions if function 0 is not present and does not have the
multifunction bit set in the header type register.  With PCIe, where
link information, payload size, ARI, VC, etc are exposed in config
space, many of these are only valid or have specific means only for
function 0.

What you're seeing on PPC is, I think, more typical of paravirtualized
PCI enumeration, where you're only seei

Re: [libvirt] [PATCH 0/6] VFIO mdev aggregated resources handling

2019-11-19 Thread Alex Williamson
On Fri, 15 Nov 2019 04:24:35 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson
> > Sent: Thursday, November 7, 2019 2:45 AM
> > 
> > On Wed, 6 Nov 2019 12:20:31 +0800
> > Zhenyu Wang  wrote:
> >   
> > > On 2019.11.05 14:10:42 -0700, Alex Williamson wrote:  
> > > > On Thu, 24 Oct 2019 13:08:23 +0800
> > > > Zhenyu Wang  wrote:
> > > >  
> > > > > Hi,
> > > > >
> > > > > This is a refresh for previous send of this series. I got impression 
> > > > > that
> > > > > some SIOV drivers would still deploy their own create and config  
> > method so  
> > > > > stopped effort on this. But seems this would still be useful for some 
> > > > >  
> > other  
> > > > > SIOV driver which may simply want capability to aggregate resources.  
> > So here's  
> > > > > refreshed series.
> > > > >
> > > > > Current mdev device create interface depends on fixed mdev type,  
> > which get uuid  
> > > > > from user to create instance of mdev device. If user wants to use  
> > customized  
> > > > > number of resource for mdev device, then only can create new mdev  
> > type for that  
> > > > > which may not be flexible. This requirement comes not only from to  
> > be able to  
> > > > > allocate flexible resources for KVMGT, but also from Intel scalable IO
> > > > > virtualization which would use vfio/mdev to be able to allocate  
> > arbitrary  
> > > > > resources on mdev instance. More info on [1] [2] [3].
> > > > >
> > > > > To allow to create user defined resources for mdev, it trys to extend 
> > > > >  
> > mdev  
> > > > > create interface by adding new "aggregate=xxx" parameter following  
> > UUID, for  
> > > > > target mdev type if aggregation is supported, it can create new mdev  
> > device  
> > > > > which contains resources combined by number of instances, e.g
> > > > >
> > > > > echo ",aggregate=10" > create
> > > > >
> > > > > VM manager e.g libvirt can check mdev type with "aggregation"  
> > attribute which  
> > > > > can support this setting. If no "aggregation" attribute found for 
> > > > > mdev  
> > type,  
> > > > > previous behavior is still kept for one instance allocation. And new  
> > sysfs  
> > > > > attribute "aggregated_instances" is created for each mdev device to  
> > show allocated number.  
> > > >
> > > > Given discussions we've had recently around libvirt interacting with
> > > > mdev, I think that libvirt would rather have an abstract interface via
> > > > mdevctl[1].  Therefore can you evaluate how mdevctl would support  
> > this  
> > > > creation extension?  It seems like it would fit within the existing
> > > > mdev and mdevctl framework if aggregation were simply a sysfs  
> > attribute  
> > > > for the device.  For example, the mdevctl steps might look like this:
> > > >
> > > > mdevctl define -u UUID -p PARENT -t TYPE
> > > > mdevctl modify -u UUID --addattr=mdev/aggregation --value=2
> > > > mdevctl start -u UUID  
> 
> Hi, Alex, can you elaborate why a sysfs attribute is more friendly
> to mdevctl? what is the complexity if having mdevctl to pass
> additional parameter at creation time as this series originally 
> proposed? Just want to clearly understand the limitation of the
> parameter way. :-)

We could also flip this question around, vfio-ap already uses sysfs to
finish composing a device after it's created, therefore why shouldn't
aggregation use this existing mechanism.  Extending the creation
interface is a more fundamental change than simply standardizing an
optional sysfs namespace entry.

> > > >
> > > > When mdevctl starts the mdev, it will first create it using the
> > > > existing mechanism, then apply aggregation attribute, which can  
> > consume  
> > > > the necessary additional instances from the parent device, or return an
> > > > error, which would unwind and return a failure code to the caller
> > > > (libvirt).  I think the vendor driver would then have freedom to decide
> > > > when the attribute could be modified, for instance it would be entirely
> > > > reasonable to return -EBUSY if th

[libvirt] libvirt mdev migration, mdevctl integration

2019-11-18 Thread Alex Williamson
Hey folks,

We had some discussions at KVM Forum around mdev live migration and
what that might mean for libvirt handling of mdev devices and
potential libvirt/mdevctl[1] flows.  I believe the current situation is
that libvirt knows nothing about an mdev beyond the UUID in the XML.
It expects the mdev to exist on the system prior to starting the VM.
The intention is for mdevctl to step in here by providing persistence
for mdev devices such that these pre-defined mdevs are potentially not
just ephemeral, for example, we can tag specific mdevs for automatic
startup on each boot.

It seems the next step in this journey is to figure out if libvirt can
interact with mdevctl to "manage" a device.  I believe we've avoided
defining managed='yes' behavior for mdev hostdevs up to this point
because creating an mdev device involves policy decisions.  For
example, which parent device hosts the mdev, are there optimal NUMA
considerations, are there performance versus power considerations, what
is the nature of the mdev, etc.  mdevctl doesn't necessarily want to
make placement decisions either, but it does understand how to create
and remove an mdev, what it's type is, associate it to a fixed
parent, apply attributes, etc.  So would it be reasonable that for a
manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl
to start an mdev by UUID and stop it when the VM is shutdown?  This
assumes the mdev referenced by the UUID is already defined and known to
mdevct.  I'd expect semantics much like managed='yes' around vfio-pci
binding, ex. start/stop if it doesn't exist, leave it alone if it
already exists.

If that much seems reasonable, and someone is willing to invest some
development time to support it, what are then the next steps to enable
migration?

AIUI, libvirt blindly assumes hostdev devices cannot be migrated.  This
may already be getting some work due to Jens' network failover support
where the attached hostdev doesn't really migrate, but it allows the
migration to proceed in a partially detached state so that it can jump
back into action should the migration fail.  Long term we expect that
not only some mdev hostdevs might be migratable, but possibly some
regular vfio-pci hostdevs as well.  I think libvirt will need to remove
any assumptions around hostdev migration and rather rely on
introspection of the QEMU process to determine if any devices hold
migration blockers (or simply try the migration and let QEMU fail
quickly if there are blockers).

So assuming we now have a VM with a managed='yes' mdev hostdev device,
what do we need to do to reproduce that device at the migration target?
mdevctl can dump a device in a json format, where libvirt could use
this to define and start an equivalent device on the migration target
(potentially this json is extended by mdevctl to include the migration
compatibility vendor string).  Part of our discussion at the Forum was
around the extent to which libvirt would want to consider this json
opaque.  For instance, while libvirt doesn't currently support localhost
migration, libvirt might want to use an alternate UUID for the mdev
device on the migration target so as not to introduce additional
barriers to such migrations.  Potentially mdevctl could accept the json
from the source system as a template and allow parameters such as UUID
to be overwritten by commandline options. This might allow libvirt to
consider the json as opaque.

An issue here though is that the json will also include the parent
device, which we obviously cannot assume is the same (particularly the
bus address) on the migration target.  We can allow commandline
overrides for the parent just as we do above for the UUID when defining
the mdev device from json, but it's an open issue who is going to be
smart enough (perhaps dumb enough) to claim this responsibility.  It
would be interesting to understand how libvirt handles other host
specific information during migration, for instance if node or processor
affinities are part of the VM XML, how is that translated to the
target?  I could imagine that we could introduce a simple "first
available" placement in mdevctl, but maybe there should minimally be a
node allocation preference with optional enforcement (similar to
numactl), or maybe something above libvirt needs to take this
responsibility to prepare the target before we get ourselves into
trouble.

Anyway, I hope this captures some of what was discussed at KVM Forum
and that we can continue that discussion here to map out the design and
tasks to enable vfio/mdev hostdev migration in libvirt.  Thanks,

Alex

[1] https://github.com/mdevctl/mdevctl

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH 0/6] VFIO mdev aggregated resources handling

2019-11-06 Thread Alex Williamson
On Wed, 6 Nov 2019 12:20:31 +0800
Zhenyu Wang  wrote:

> On 2019.11.05 14:10:42 -0700, Alex Williamson wrote:
> > On Thu, 24 Oct 2019 13:08:23 +0800
> > Zhenyu Wang  wrote:
> >   
> > > Hi,
> > > 
> > > This is a refresh for previous send of this series. I got impression that
> > > some SIOV drivers would still deploy their own create and config method so
> > > stopped effort on this. But seems this would still be useful for some 
> > > other
> > > SIOV driver which may simply want capability to aggregate resources. So 
> > > here's
> > > refreshed series.
> > > 
> > > Current mdev device create interface depends on fixed mdev type, which 
> > > get uuid
> > > from user to create instance of mdev device. If user wants to use 
> > > customized
> > > number of resource for mdev device, then only can create new mdev type 
> > > for that
> > > which may not be flexible. This requirement comes not only from to be 
> > > able to
> > > allocate flexible resources for KVMGT, but also from Intel scalable IO
> > > virtualization which would use vfio/mdev to be able to allocate arbitrary
> > > resources on mdev instance. More info on [1] [2] [3].
> > > 
> > > To allow to create user defined resources for mdev, it trys to extend mdev
> > > create interface by adding new "aggregate=xxx" parameter following UUID, 
> > > for
> > > target mdev type if aggregation is supported, it can create new mdev 
> > > device
> > > which contains resources combined by number of instances, e.g
> > > 
> > > echo ",aggregate=10" > create
> > > 
> > > VM manager e.g libvirt can check mdev type with "aggregation" attribute 
> > > which
> > > can support this setting. If no "aggregation" attribute found for mdev 
> > > type,
> > > previous behavior is still kept for one instance allocation. And new sysfs
> > > attribute "aggregated_instances" is created for each mdev device to show 
> > > allocated number.  
> > 
> > Given discussions we've had recently around libvirt interacting with
> > mdev, I think that libvirt would rather have an abstract interface via
> > mdevctl[1].  Therefore can you evaluate how mdevctl would support this
> > creation extension?  It seems like it would fit within the existing
> > mdev and mdevctl framework if aggregation were simply a sysfs attribute
> > for the device.  For example, the mdevctl steps might look like this:
> > 
> > mdevctl define -u UUID -p PARENT -t TYPE
> > mdevctl modify -u UUID --addattr=mdev/aggregation --value=2
> > mdevctl start -u UUID
> > 
> > When mdevctl starts the mdev, it will first create it using the
> > existing mechanism, then apply aggregation attribute, which can consume
> > the necessary additional instances from the parent device, or return an
> > error, which would unwind and return a failure code to the caller
> > (libvirt).  I think the vendor driver would then have freedom to decide
> > when the attribute could be modified, for instance it would be entirely
> > reasonable to return -EBUSY if the user attempts to modify the
> > attribute while the mdev device is in-use.  Effectively aggregation
> > simply becomes a standardized attribute with common meaning.  Thoughts?
> > [cc libvirt folks for their impression] Thanks,  
> 
> I think one problem is that before mdevctl start to create mdev you
> don't know what vendor attributes are, as we apply mdev attributes
> after create. You may need some lookup depending on parent.. I think
> making aggregation like other vendor attribute for mdev might be the
> simplest way, but do we want to define its behavior in formal? e.g
> like previous discussed it should show maxium instances for aggregation, etc.

Yes, we'd still want to standardize how we enable and discover
aggregation since we expect multiple users.  Even if libvirt were to
use mdevctl as it's mdev interface, higher level tools should have an
introspection mechanism available.  Possibly the sysfs interfaces
proposed in this series remains largely the same, but I think perhaps
the implementation of them moves out to the vendor driver.  In fact,
perhaps the only change to mdev core is to define the standard.  For
example, the "aggregation" attribute on the type is potentially simply
a defined, optional, per type attribute, similar to "name" and
"description".  For "aggregated_instances" we already have the
mdev_attr_groups of the mdev_parent_ops, we could define an

Re: [libvirt] [PATCH 0/6] VFIO mdev aggregated resources handling

2019-11-05 Thread Alex Williamson
On Thu, 24 Oct 2019 13:08:23 +0800
Zhenyu Wang  wrote:

> Hi,
> 
> This is a refresh for previous send of this series. I got impression that
> some SIOV drivers would still deploy their own create and config method so
> stopped effort on this. But seems this would still be useful for some other
> SIOV driver which may simply want capability to aggregate resources. So here's
> refreshed series.
> 
> Current mdev device create interface depends on fixed mdev type, which get 
> uuid
> from user to create instance of mdev device. If user wants to use customized
> number of resource for mdev device, then only can create new mdev type for 
> that
> which may not be flexible. This requirement comes not only from to be able to
> allocate flexible resources for KVMGT, but also from Intel scalable IO
> virtualization which would use vfio/mdev to be able to allocate arbitrary
> resources on mdev instance. More info on [1] [2] [3].
> 
> To allow to create user defined resources for mdev, it trys to extend mdev
> create interface by adding new "aggregate=xxx" parameter following UUID, for
> target mdev type if aggregation is supported, it can create new mdev device
> which contains resources combined by number of instances, e.g
> 
> echo ",aggregate=10" > create
> 
> VM manager e.g libvirt can check mdev type with "aggregation" attribute which
> can support this setting. If no "aggregation" attribute found for mdev type,
> previous behavior is still kept for one instance allocation. And new sysfs
> attribute "aggregated_instances" is created for each mdev device to show 
> allocated number.

Given discussions we've had recently around libvirt interacting with
mdev, I think that libvirt would rather have an abstract interface via
mdevctl[1].  Therefore can you evaluate how mdevctl would support this
creation extension?  It seems like it would fit within the existing
mdev and mdevctl framework if aggregation were simply a sysfs attribute
for the device.  For example, the mdevctl steps might look like this:

mdevctl define -u UUID -p PARENT -t TYPE
mdevctl modify -u UUID --addattr=mdev/aggregation --value=2
mdevctl start -u UUID

When mdevctl starts the mdev, it will first create it using the
existing mechanism, then apply aggregation attribute, which can consume
the necessary additional instances from the parent device, or return an
error, which would unwind and return a failure code to the caller
(libvirt).  I think the vendor driver would then have freedom to decide
when the attribute could be modified, for instance it would be entirely
reasonable to return -EBUSY if the user attempts to modify the
attribute while the mdev device is in-use.  Effectively aggregation
simply becomes a standardized attribute with common meaning.  Thoughts?
[cc libvirt folks for their impression] Thanks,

Alex

[1] https://github.com/mdevctl/mdevctl

> References:
> [1] 
> https://software.intel.com/en-us/download/intel-virtualization-technology-for-directed-io-architecture-specification
> [2] 
> https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
> [3] https://schd.ws/hosted_files/lc32018/00/LC3-SIOV-final.pdf
> 
> Zhenyu Wang (6):
>   vfio/mdev: Add new "aggregate" parameter for mdev create
>   vfio/mdev: Add "aggregation" attribute for supported mdev type
>   vfio/mdev: Add "aggregated_instances" attribute for supported mdev
> device
>   Documentation/driver-api/vfio-mediated-device.rst: Update for
> vfio/mdev aggregation support
>   Documentation/ABI/testing/sysfs-bus-vfio-mdev: Update for vfio/mdev
> aggregation support
>   drm/i915/gvt: Add new type with aggregation support
> 
>  Documentation/ABI/testing/sysfs-bus-vfio-mdev | 24 ++
>  .../driver-api/vfio-mediated-device.rst   | 23 ++
>  drivers/gpu/drm/i915/gvt/gvt.c|  4 +-
>  drivers/gpu/drm/i915/gvt/gvt.h| 11 ++-
>  drivers/gpu/drm/i915/gvt/kvmgt.c  | 53 -
>  drivers/gpu/drm/i915/gvt/vgpu.c   | 56 -
>  drivers/vfio/mdev/mdev_core.c | 36 -
>  drivers/vfio/mdev/mdev_private.h  |  6 +-
>  drivers/vfio/mdev/mdev_sysfs.c| 79 ++-
>  include/linux/mdev.h  | 19 +
>  10 files changed, 294 insertions(+), 17 deletions(-)
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH 0/4] PCI multifunction partial assignment support

2019-10-07 Thread Alex Williamson
On Mon,  7 Oct 2019 18:11:32 -0300
Daniel Henrique Barboza  wrote:

> (--- long post warning ---)
> 
> This is a work that derived from the discussions I had with
> Laine Stump and Alex Williamson in [1]. I'll provide a quick
> gist below.
> 
> --
> 
> Today, Libvirt does not have proper support for partial
> assignment of functions of passed-through PCI multifunction
> devices (hostdev with VFIO-PCI).  By partial assignment I mean
> the guest being able to use just some, not all, virtual functions
> of the device. Even if the functions itself became useless in
> the host, the some functions might not be safe to be used
> by the guest, thus the user should be able to limit it.

Not safe in what way?  Patch 2/4 says some devices might be "security
sensitive", but the fact that this patch is necessary implies that the
host kernel already considers the devices non-isolated.  They must be
in the same iommu group to have this issue.  Is there a concrete
example of a device where a user would want this configuration?  The
case I can think of is not a security issue, but a functional one
where GPU and audio functions are grouped together and maybe the audio
function doesn't work well when assigned, or maybe we just want the
guest to default to another audio device and it's easier if we just
don't expose this on-card audio.
 
> I mentioned 'proper' because today it is possible to get this
> done in Libvirt if we use 'managed=no' in the hostdevs. If the
> user makes the proper setup (i.e. detaching all IOMMU devices),
> and use managed='no', Libvirt will launch the guest just with the
> functions declared in the XML. The technical reason for this is
> simple: in virHostdevPreparePCIDevices() we do not take into account
> that multifunction PCI devices requires the whole IOMMU to be
> detached, not just the devices being declared in def->hostdevs.
> In this case, managed='yes' will not work in this scenario, causing
> errors in QEMU launch.
> 
> The discussion I've started in [1] was motivated by my attempt
> of automatically detaching the IOMMU inside the prepare function
> with managed='yes' devices. Laine discarded this idea, arguing
> that the concept of partial assignment will cause user confusion
> if Libvirt starts to handle things without the user being fully
> aware. In [1] it was discussed the possibility of declaring the
> functions that won't be assigned to the guest in the XML, forcing
> the user to be aware that these functions will be lost in the host,
> as a possible approach for a solution.
> 
> ---
> 
> These series tries to solve the partial assignment of multifunction
> hostdev PCI devices by introducing a new hostdev attribute called
> 'assigned'. This is how it works:
> 
> - it is a boolean value that will be efffective just for
> multifunction hostdev PCI devices, since there's no other
> occurrence for this kind of use in Libvirt. Trying to
> declare assign='yes|no' in any other PCI hostdev device
> will cause parse errors;
> 
> - default value if the attribute is not present is
> 'assigned=yes';
> 
> -  element will be forbidden if the hostdev is declared
> with assigned='no'. This is to make more evident to the user
> that this is a function that the guest will NOT be using, with
> a bonus that we will not need to calculate an address that
> won't be used;

It seems more intuitive to me to use the guest  element to
expose this.  libvirt often makes use of 'none' to declare empty
devices, so maybe  would be more in line with
precedent.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC] handling hostdev save/load net config for non SR-IOV devices

2019-07-18 Thread Alex Williamson
On Thu, 18 Jul 2019 17:08:23 -0400
Laine Stump  wrote:

> On 7/18/19 2:58 PM, Daniel Henrique Barboza wrote:
> >
> > On 7/18/19 2:18 PM, Laine Stump wrote:
> >  
> >> But to back up a bit - what is it about managed='yes' that makes you 
> >> want to do it that way instead of managed='no'? Do you really ever 
> >> need the devices to be binded to the host driver? Or are you just 
> >> using managed='yes' because there's not a standard/concenient place 
> >> to configure devices to be permanently binded to vfio-pci immediately 
> >> when the host boots?

driverctl can do this fwiw.

> >
> > It's a case of user convenience for devices that has mixed usage, at
> > least in my opinion.
> >
> > For example, I can say from personal experience dealing with devices
> > that will never be used directly by the host, such as NVIDIA GPUs that are
> > used only as hostdevs of guests, that this code I'm developing is
> > pointless. In that setup the GPUs are binded to vfio-pci right after the
> > host boots using a /etc/rc.d script (or something equivalent). Not sure
> > if this is the standard way of binding a device to vfio-pci, but it works
> > for that environment.   
> 
> 
> Yeah, the problem is that there really isn't a standard "this is *the 
> one correct way* to configure this" place for this config, so everybody 
> just does what works for them, making it difficult to provide a 
> "recommended solution" in the libvirt documentation that you (i.e. "I" 
> :-)) have a high level of confidence in.

I think driverctl is trying to be that solution, but as soon as you say
for example "NVIDIA GPUs only work this way", there are immediately a
dozen users explaining a fifteen different ways that they bind their
NVIDIA GPU only to vfio-pci while the VM is running and return it back
when it stops.  There are cases where users try to fight the kernel to
get devices bound to vfio-pci exclusively and before everything else
and it never changes, cases where we can let the kernel do its thing
and grab it for vfio-pci later, and cases where we bounce devices
between drivers depending on the current use case.


> > The problem is with devices that the user expects to use both in guests
> > and in the host. In that case, the user will need either to handle the 
> > nodedev
> > detach/reattach manually or to use managed=true and let Libvirt re-attach
> > the devices every time the guest is destroyed.  Even if the device is 
> > going to
> > be used in the same or another guest right after (meaning that we 
> > re-attached
> > the device back simply to detach it again right after), using managed=true
> > is convenient because the user doesn't need to think about the state of
> > the device.  
> 
> 
> Yeah, I agree that there are uses for managed='yes' and it's a good 
> thing to have. It's just that I think most of the time it's being used 
> when it isn't needed (and thus shouldn't be used).

We can't guess what the user is trying to accomplish, managed=true is a
more user friendly default.  Unfortunatley nm trying to dhcp any new
NIC that appears is also a more user friendly default and the
combination of those is less than desirable.

> >> I think we should spend more time making it easier to have devices 
> >> "pre-binded" to vfio-pci at boot time, so that we could discourage 
> >> use of managed='yes'. (not "instead of" what you're doing, but "in 
> >> addition to" it).

driverctl, and soon hopefully mdevctl.

> > I think managed=true use can be called a 'bad user habit' in that 
> > sense. I can
> > think of some ways to alleviate it:
> >
> > - an user setting in an conf file that changes how managed=true works. 
> > Instead
> > of detach/re-attach the device, Libvirt will only detach the device, 
> > leaving the
> > device bind to vfio-pci even after guest destroy  
> 
> >
> > - same idea, but with a (yet another) XML attribute "re-attach=false" 
> > in the
> > hostdev definition. I like this idea better because you can set customized
> > behavior for each device/guest instead of changing the managed mechanic to
> > everyone  
> 
> 
> I posted a patch to support that (with a new managed mode called 
> "detach", which would automatically bind the device to vfio-pci at geust 
> startup, and leave it binded to vfio-pci when the guest released it) a 
> couple years ago, and it was rejected upstream (after a lot of discussion):
> 
> 
> https://www.redhat.com/archives/libvir-list/2016-March/msg00593.html
> 
> 
> I believe the main reason was that it was "giving the consumer yet 
> another option that they probably wouldn't understand, and would make 
> the wrong choice on", or something like that...
> 
> 
> I still like the idea, but it wasn't worth spending more time on the debate

If we have driverctl to bind a device to a user preferred driver at
start, doesn't managed=true become a no-op? (or is that what this patch
was trying to do?)

> > - for boot time (daemon start time), one way I can think of is an XML
> > file with the 

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-07-01 Thread Alex Williamson
On Mon, 1 Jul 2019 10:20:43 +0200
Cornelia Huck  wrote:

> On Fri, 28 Jun 2019 11:05:46 -0600
> Alex Williamson  wrote:
> 
> > On Fri, 28 Jun 2019 11:06:48 +0200
> > Cornelia Huck  wrote:  
> 
> > > What do you think of a way to specify JSON for the attributes directly
> > > on the command line? Or would it be better to just edit the config
> > > files directly?
> > 
> > Supplying json on the command like seems difficult, even doing so with
> > with jq requires escaping quotes.  It's not a very friendly
> > experience.  Maybe something more like how virsh allows snippets of xml
> > to be included, we could use jq to validate a json snippet provided
> > as a file and add it to the attributes... of course if we need to allow
> > libvirt to modify the json config files directly, the user could do
> > that as well.  Is there a use case you're thinking of?  Maybe we could
> > augment the 'list' command to take a --uuid and --dumpjson option and
> > the 'define' command to accept a --jsonfile.  Maybe the 'start' command
> > could accept the same, so a transient device could define attributes
> > w/o excessive command line options.  Thanks,
> > 
> > Alex  
> 
> I was mostly thinking about complex configurations where writing a JSON
> config would be simpler than adding a lot of command line options.
> Something like dumping a JSON file and allowing to refer to a JSON file
> as you suggested could be useful; but then, those very complex use
> cases are probably already covered by editing the config file directly.
> Not sure if it is worth the effort; maybe just leave it as it is for
> now.

Well, I already did it.  It seems useful for creating transient devices
with attribute specifications.  If it's too ugly we can drop it.
Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-28 Thread Alex Williamson
On Fri, 28 Jun 2019 11:06:48 +0200
Cornelia Huck  wrote:

> On Thu, 27 Jun 2019 19:57:04 -0600
> Alex Williamson  wrote:
> 
> > On Thu, 27 Jun 2019 15:15:02 -0600
> > Alex Williamson  wrote:
> >   
> > > On Thu, 27 Jun 2019 09:38:32 -0600
> > > Alex Williamson  wrote:
> > > > > On 6/27/19 8:26 AM, Cornelia Huck wrote:
> > > > > > 
> > > > > > {
> > > > > >   "foo": "1",
> > > > > >   "bar": "42",
> > > > > >   "baz": {
> > > > > > "depends": ["foo", "bar"],
> > > > > > "value": "plahh"
> > > > > >   }
> > > > > > }
> > > > > > 
> > > > > > Something like that?
> > > > 
> > > > I'm not sure yet.  I think we need to look at what's feasible (and
> > > > easy) with jq.  Thanks,  
> > > 
> > > I think it's not too much trouble to remove and insert into arrays, so
> > > what if we were to define the config as:
> > > 
> > > {
> > >   "mdev_type":"vendor-type",
> > >   "start":"auto",
> > >   "attrs": [
> > >   {"attrX":["Xvalue1","Xvalue2"]},
> > >   {"dir/attrY": "Yvalue1"},
> > >   {"attrX": "Xvalue3"}
> > > ]
> > > }
> > > 
> > > "attr" here would define sysfs attributes under the device.  The array
> > > would be processed in order, so in the above example we'd do the
> > > following:
> > > 
> > >  1. echo Xvalue1 > attrX
> > >  2. echo Xvalue2 > attrX
> > >  3. echo Yvalue1 > dir/attrY
> > >  4. echo Xvalue3 > attrX
> > > 
> > > When starting the device mdevctl would simply walk the array, if the
> > > attribute key exists write the value(s).  If a write fails or the
> > > attribute doesn't exist, remove the device and report error.  
> 
> Yes, I think it makes sense to fail the startup of a device where we
> cannot set all attributes to the requested values.
> 
> > > 
> > > I think it's easiest with jq to manipulate arrays by removing and
> > > inserting by index.  Also if we end up with something like above, it's
> > > ambiguous if we reference the "attrX" key.  So perhaps we add the
> > > following options to the modify command:
> > > 
> > > --addattr=ATTRIBUTE --delattr --index=INDEX --value=VALUE1[,VALUE2]
> > > 
> > > We could handle it like a stack, so if --index is not supplied, add to
> > > the end or remove from the end.  If --index is provided, delete that
> > > index or add the attribute at that index.  So if you had the above and
> > > wanted to remove Xvalue1 but keep the ordering, you'd do:
> > > 
> > > --delattr --index=0
> > > --addattr --index=0 --value=Xvalue2
> > > 
> > > Which should results in:
> > > 
> > >   "attrs": [
> > >   {"attrX": "Xvalue2"},
> > >   {"dir/attrY": "Yvalue1"},
> > >   {"attrX": "Xvalue3"}
> > > ]  
> 
> Modifying by index looks reasonable; I just sent a pull request to
> print the index of an attribute out as well, so it is easier to specify
> the right attribute to modify.

Pulled, I had initially separated the per line and interpreted them,
but it felt too verbose, so I went the full other direction or putting
them on a single line and using the compact json representation.  Maybe
this is a reasonable compromise.

> > > If we want to modify a running device, I'm thinking we probably want a
> > > new command and options --attr=ATTRIBUTE --value=VALUE might suffice.
> > > 
> > > Do we need to support something like this for the 'start' command or
> > > should we leave that for simple devices and require a sequence of:
> > > 
> > > # mdevctl define ...
> > > # mdevctl modify --addattr...
> > > ...
> > > # mdevctl start
> > > # mdevctl undefine
> > > 
> > > This is effectively the long way to get a transient device.  Otherwise
> > > we'd need to figure out how to have --attr --value appear multiple
> > > times on the start command line.  Thanks,
> 
> What do you think of a way to specify JSON 

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-27 Thread Alex Williamson
On Thu, 27 Jun 2019 15:15:02 -0600
Alex Williamson  wrote:

> On Thu, 27 Jun 2019 09:38:32 -0600
> Alex Williamson  wrote:
> > > On 6/27/19 8:26 AM, Cornelia Huck wrote:
> > > > 
> > > > {
> > > >   "foo": "1",
> > > >   "bar": "42",
> > > >   "baz": {
> > > > "depends": ["foo", "bar"],
> > > > "value": "plahh"
> > > >   }
> > > > }
> > > > 
> > > > Something like that?
> > 
> > I'm not sure yet.  I think we need to look at what's feasible (and
> > easy) with jq.  Thanks,  
> 
> I think it's not too much trouble to remove and insert into arrays, so
> what if we were to define the config as:
> 
> {
>   "mdev_type":"vendor-type",
>   "start":"auto",
>   "attrs": [
>   {"attrX":["Xvalue1","Xvalue2"]},
>   {"dir/attrY": "Yvalue1"},
>   {"attrX": "Xvalue3"}
> ]
> }
> 
> "attr" here would define sysfs attributes under the device.  The array
> would be processed in order, so in the above example we'd do the
> following:
> 
>  1. echo Xvalue1 > attrX
>  2. echo Xvalue2 > attrX
>  3. echo Yvalue1 > dir/attrY
>  4. echo Xvalue3 > attrX
> 
> When starting the device mdevctl would simply walk the array, if the
> attribute key exists write the value(s).  If a write fails or the
> attribute doesn't exist, remove the device and report error.
> 
> I think it's easiest with jq to manipulate arrays by removing and
> inserting by index.  Also if we end up with something like above, it's
> ambiguous if we reference the "attrX" key.  So perhaps we add the
> following options to the modify command:
> 
> --addattr=ATTRIBUTE --delattr --index=INDEX --value=VALUE1[,VALUE2]
> 
> We could handle it like a stack, so if --index is not supplied, add to
> the end or remove from the end.  If --index is provided, delete that
> index or add the attribute at that index.  So if you had the above and
> wanted to remove Xvalue1 but keep the ordering, you'd do:
> 
> --delattr --index=0
> --addattr --index=0 --value=Xvalue2
> 
> Which should results in:
> 
>   "attrs": [
>   {"attrX": "Xvalue2"},
>   {"dir/attrY": "Yvalue1"},
>   {"attrX": "Xvalue3"}
> ]
> 
> If we want to modify a running device, I'm thinking we probably want a
> new command and options --attr=ATTRIBUTE --value=VALUE might suffice.
> 
> Do we need to support something like this for the 'start' command or
> should we leave that for simple devices and require a sequence of:
> 
> # mdevctl define ...
> # mdevctl modify --addattr...
> ...
> # mdevctl start
> # mdevctl undefine
> 
> This is effectively the long way to get a transient device.  Otherwise
> we'd need to figure out how to have --attr --value appear multiple
> times on the start command line.  Thanks,

This is now implemented, and yes you can specify '--addattr remove
--value 1' and mdevctl will immediately remove the device after it's
created (more power to the admin).  Listing defined devices also lists
any attributes defined for easy inspection.  It is also possible to
override the conversion of comma separated values into an array by
encoding and escaping the comma.  It's a little cumbersome, but
possible in case a driver isn't fully on board with the one attribute,
one value rule of sysfs.  Does this work for vfio-ap?  I also still
need to check if this allows an NVIDIA vGPU mdev to be configured such
that the framerate limiter can be automatically controlled.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-27 Thread Alex Williamson
On Thu, 27 Jun 2019 09:38:32 -0600
Alex Williamson  wrote:
> > On 6/27/19 8:26 AM, Cornelia Huck wrote:  
> > > 
> > > {
> > >   "foo": "1",
> > >   "bar": "42",
> > >   "baz": {
> > > "depends": ["foo", "bar"],
> > > "value": "plahh"
> > >   }
> > > }
> > > 
> > > Something like that?  
> 
> I'm not sure yet.  I think we need to look at what's feasible (and
> easy) with jq.  Thanks,

I think it's not too much trouble to remove and insert into arrays, so
what if we were to define the config as:

{
  "mdev_type":"vendor-type",
  "start":"auto",
  "attrs": [
  {"attrX":["Xvalue1","Xvalue2"]},
  {"dir/attrY": "Yvalue1"},
  {"attrX": "Xvalue3"}
]
}

"attr" here would define sysfs attributes under the device.  The array
would be processed in order, so in the above example we'd do the
following:

 1. echo Xvalue1 > attrX
 2. echo Xvalue2 > attrX
 3. echo Yvalue1 > dir/attrY
 4. echo Xvalue3 > attrX

When starting the device mdevctl would simply walk the array, if the
attribute key exists write the value(s).  If a write fails or the
attribute doesn't exist, remove the device and report error.

I think it's easiest with jq to manipulate arrays by removing and
inserting by index.  Also if we end up with something like above, it's
ambiguous if we reference the "attrX" key.  So perhaps we add the
following options to the modify command:

--addattr=ATTRIBUTE --delattr --index=INDEX --value=VALUE1[,VALUE2]

We could handle it like a stack, so if --index is not supplied, add to
the end or remove from the end.  If --index is provided, delete that
index or add the attribute at that index.  So if you had the above and
wanted to remove Xvalue1 but keep the ordering, you'd do:

--delattr --index=0
--addattr --index=0 --value=Xvalue2

Which should results in:

  "attrs": [
  {"attrX": "Xvalue2"},
  {"dir/attrY": "Yvalue1"},
  {"attrX": "Xvalue3"}
]

If we want to modify a running device, I'm thinking we probably want a
new command and options --attr=ATTRIBUTE --value=VALUE might suffice.

Do we need to support something like this for the 'start' command or
should we leave that for simple devices and require a sequence of:

# mdevctl define ...
# mdevctl modify --addattr...
...
# mdevctl start
# mdevctl undefine

This is effectively the long way to get a transient device.  Otherwise
we'd need to figure out how to have --attr --value appear multiple
times on the start command line.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-27 Thread Alex Williamson
On Thu, 27 Jun 2019 11:00:31 -0400
Matthew Rosato  wrote:

> On 6/27/19 8:26 AM, Cornelia Huck wrote:
> > On Wed, 26 Jun 2019 19:53:50 -0600
> > Alex Williamson  wrote:
> >   
> >> On Wed, 26 Jun 2019 08:37:20 -0600
> >> Alex Williamson  wrote:
> >>  
> >>> On Wed, 26 Jun 2019 11:58:06 +0200
> >>> Cornelia Huck  wrote:
> >>> 
> >>>> On Tue, 25 Jun 2019 16:52:51 -0600
> >>>> Alex Williamson  wrote:
> >>>>   
> >>>>> Hi,
> >>>>>
> >>>>> Based on the discussions we've had, I've rewritten the bulk of
> >>>>> mdevctl.  I think it largely does everything we want now, modulo
> >>>>> devices that will need some sort of 1:N values per key for
> >>>>> configuration in the config file versus the 1:1 key:value setup we
> >>>>> currently have (so don't consider the format final just yet).
> >>>>
> >>>> We might want to factor out that config format handling while we're
> >>>> trying to finalize it.
> >>>>
> >>>> cc:ing Matt for his awareness. I'm currently not quite sure how to
> >>>> handle those vfio-ap "write several values to an attribute one at a
> >>>> time" requirements. Maybe 1:N key:value is the way to go; maybe we
> >>>> need/want JSON or something like that.  
> >>>
> >>> Maybe we should just do JSON for future flexibility.  I assume there
> >>> are lots of helpers that should make it easy even from a bash script.
> >>> I'll look at that next.
> >>
> >> Done.  Throw away any old mdev config files, we use JSON now.   
> > 
> > The code changes look quite straightforward, thanks.
> >   
> >> The per
> >> mdev config now looks like this:
> >>
> >> {
> >>   "mdev_type": "i915-GVTg_V4_8",
> >>   "start": "auto"
> >> }
> >>
> >> My expectation, and what I've already pre-enabled support in set_key
> >> and get_key functions, is that we'd use arrays for values, so we might
> >> have:
> >>
> >>   "new_key": ["value1", "value2"]
> >>
> >> set_key will automatically convert a comma separated list of values
> >> into such an array, so I'm thinking this would be specified by the user
> >> as:
> >>
> >> # mdevctl modify -u UUID --key=new_key --value=value1,value2  
> > 
> > Looks sensible.
> > 
> > For vfio-ap, we'd probably end up with something like the following:
> > 
> > {
> >   "mdev_type": "vfio_ap-passthrough",
> >   "start": "auto",
> >   "assign_adapter": ["5", "6"],
> >   "assign_domain": ["4", "0xab"]
> > }
> > 
> > (following the Guest1 example in the kernel documentation)
> > 
> >  > ["6", "7"]? Remove 5, add 7? Remove all values, then set the new ones?  
> 
> IMO remove 5, add 7 would make the most sense.  I'm not sure that doing
> an unassign of all adapters (effectively removing all APQNs) followed by
> an assign of the new ones would work nicely with Tony's vfio-ap dynamic
> configuration patches.

Are we conflating operating on the config file versus operating on the
device?  I was thinking that setting a new key value replaces the
existing key, because anything else adds unnecessary complication to
the code and command line.  So in the above example, if the user
specified:

  mdevctl modify -u UUID --key=assign_adapter --value=6,7

The new value is simply ["6", "7"].  This would take effect the next
time the device is started.  We haven't yet considered how to change
running devices, but I think the semantics we have since the respin of
mdevctl separate saved config vs running devices in order to generalize
the support of transient devices.

> > Similar for deleting the "assign_adapter" key. We have an
> > "unassign_adapter" attribute, but this is not something we can infer
> > automatically; we need to know that we're dealing with an vfio-ap
> > matrix device...>
> >   
> >>
> >> We should think about whether ordering is important and maybe
> >> incorporate that into key naming conventions or come up with some
> >> syntax for specifying startup blocks.  Thanks,
> >>
> >> Alex  
> > 
> > Hm...
> > 
> > {
> >   "foo": "1",
> >   "bar": "42",
> >   "baz": {
> > "depends": ["foo", "bar"],
> > "value": "plahh"
> >   }
> > }
> > 
> > Something like that?

I'm not sure yet.  I think we need to look at what's feasible (and
easy) with jq.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-26 Thread Alex Williamson
On Wed, 26 Jun 2019 08:37:20 -0600
Alex Williamson  wrote:

> On Wed, 26 Jun 2019 11:58:06 +0200
> Cornelia Huck  wrote:
> 
> > On Tue, 25 Jun 2019 16:52:51 -0600
> > Alex Williamson  wrote:
> >   
> > > Hi,
> > > 
> > > Based on the discussions we've had, I've rewritten the bulk of
> > > mdevctl.  I think it largely does everything we want now, modulo
> > > devices that will need some sort of 1:N values per key for
> > > configuration in the config file versus the 1:1 key:value setup we
> > > currently have (so don't consider the format final just yet).
> > 
> > We might want to factor out that config format handling while we're
> > trying to finalize it.
> > 
> > cc:ing Matt for his awareness. I'm currently not quite sure how to
> > handle those vfio-ap "write several values to an attribute one at a
> > time" requirements. Maybe 1:N key:value is the way to go; maybe we
> > need/want JSON or something like that.  
> 
> Maybe we should just do JSON for future flexibility.  I assume there
> are lots of helpers that should make it easy even from a bash script.
> I'll look at that next.

Done.  Throw away any old mdev config files, we use JSON now.  The per
mdev config now looks like this:

{
  "mdev_type": "i915-GVTg_V4_8",
  "start": "auto"
}

My expectation, and what I've already pre-enabled support in set_key
and get_key functions, is that we'd use arrays for values, so we might
have:

  "new_key": ["value1", "value2"]

set_key will automatically convert a comma separated list of values
into such an array, so I'm thinking this would be specified by the user
as:

# mdevctl modify -u UUID --key=new_key --value=value1,value2

We should think about whether ordering is important and maybe
incorporate that into key naming conventions or come up with some
syntax for specifying startup blocks.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-26 Thread Alex Williamson
On Wed, 26 Jun 2019 11:58:06 +0200
Cornelia Huck  wrote:

> On Tue, 25 Jun 2019 16:52:51 -0600
> Alex Williamson  wrote:
> 
> > Hi,
> > 
> > Based on the discussions we've had, I've rewritten the bulk of
> > mdevctl.  I think it largely does everything we want now, modulo
> > devices that will need some sort of 1:N values per key for
> > configuration in the config file versus the 1:1 key:value setup we
> > currently have (so don't consider the format final just yet).  
> 
> We might want to factor out that config format handling while we're
> trying to finalize it.
> 
> cc:ing Matt for his awareness. I'm currently not quite sure how to
> handle those vfio-ap "write several values to an attribute one at a
> time" requirements. Maybe 1:N key:value is the way to go; maybe we
> need/want JSON or something like that.

Maybe we should just do JSON for future flexibility.  I assume there
are lots of helpers that should make it easy even from a bash script.
I'll look at that next.

> > We now support "transient" devices and there's no distinction or
> > difference in handling of such devices whether they're created by
> > mdevctl or externally.  All devices will also have systemd management,
> > though systemd is no longer required, it's used if available.  The
> > instance name used for systemd device units has also changed to allow
> > us to use BindsTo= such that services are not only created, but are
> > also removed if the device is removed.  Unfortunately it's not a simple
> > UUID via the systemd route any longer.  
> 
> That's a bit unfortunate; however, making it workable without systemd
> certainly is a good thing :)

The "decoder ring" is simply that the instance value takes the systemd
device path of the mdev device itself.  The mdev device is named by the
uuid and is created under the parent device, so we just need to get the
realpath of the parent, append the uuid, and encode it with
systemd-escape.  It's not magic, but it's  not trivial on the command
line either.  We could add a command to mdevctl to print this, but it
doesn't make much sense to call into mdevctl for that and not simply
control the device via mdevctl.

> > Since the original posting, the project has moved from my personal
> > github to here:
> > 
> > https://github.com/mdevctl/mdevctl
> > 
> > Please see the README there for overview of the new commands and
> > example of their usage.  There is no attempt to maintain backwards
> > compatibility with previous versions, this tool is in its infancy.
> > Also since the original posting, RPM packaging is included, so simply
> > run 'make rpm' and install the resulting package.  
> 
> Nice.
> 
> > 
> > Highlights of this version include proper argument parsing via getopts
> > so that options can be provided in any order.  I'm still using the
> > format 'mdevctl {command} [options]' but now it's consistent that all
> > the options come after the command, in any order.  I think this is
> > relatively consistent with a variety of other tools.  
> 
> Parsing via getops is also very nice.
> 
> > 
> > Devices are no longer automatically persisted, we handle them as
> > transient, but we also can promote them to persistent through the
> > 'define' command.  The define, undefine, and modify commands all
> > operate only on the config file, so that we can define separate from
> > creating.  When promoting from a transient to defined device, we can
> > use the existing device to create the config.  Both the type and the
> > startup of a device can be modified in the config, without affecting
> > the running device.
> > 
> > Starting an mdev device no longer relies exclusively on a saved config,
> > the device can be fully specified via options to create a transient
> > device.  Specifying only a uuid is also sufficient for a defined
> > device.  Some consideration has also been given to uuid collisions.
> > The mdev interface in the kernel prevents multiple mdevs with the same
> > uuid running concurrently, but mdevctl allows mdevs to be defined with
> > the same uuid under separate parent devices.  Some options therefore
> > allow both a uuid and parent to be specified and require this if the
> > uuid alone is ambiguous.  Clearly starting two such devices at the same
> > time will fail and is left to higher level tools to manage, just like
> > the ability to define more devices than there are available instances on
> > the host system.  
> 
> I still have to look into the details of this.
> 
> > 
> > The stop and list commands are largely the same ideas as previous
> &

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-25 Thread Alex Williamson
Hi,

Based on the discussions we've had, I've rewritten the bulk of
mdevctl.  I think it largely does everything we want now, modulo
devices that will need some sort of 1:N values per key for
configuration in the config file versus the 1:1 key:value setup we
currently have (so don't consider the format final just yet).

We now support "transient" devices and there's no distinction or
difference in handling of such devices whether they're created by
mdevctl or externally.  All devices will also have systemd management,
though systemd is no longer required, it's used if available.  The
instance name used for systemd device units has also changed to allow
us to use BindsTo= such that services are not only created, but are
also removed if the device is removed.  Unfortunately it's not a simple
UUID via the systemd route any longer.

Since the original posting, the project has moved from my personal
github to here:

https://github.com/mdevctl/mdevctl

Please see the README there for overview of the new commands and
example of their usage.  There is no attempt to maintain backwards
compatibility with previous versions, this tool is in its infancy.
Also since the original posting, RPM packaging is included, so simply
run 'make rpm' and install the resulting package.

Highlights of this version include proper argument parsing via getopts
so that options can be provided in any order.  I'm still using the
format 'mdevctl {command} [options]' but now it's consistent that all
the options come after the command, in any order.  I think this is
relatively consistent with a variety of other tools.

Devices are no longer automatically persisted, we handle them as
transient, but we also can promote them to persistent through the
'define' command.  The define, undefine, and modify commands all
operate only on the config file, so that we can define separate from
creating.  When promoting from a transient to defined device, we can
use the existing device to create the config.  Both the type and the
startup of a device can be modified in the config, without affecting
the running device.

Starting an mdev device no longer relies exclusively on a saved config,
the device can be fully specified via options to create a transient
device.  Specifying only a uuid is also sufficient for a defined
device.  Some consideration has also been given to uuid collisions.
The mdev interface in the kernel prevents multiple mdevs with the same
uuid running concurrently, but mdevctl allows mdevs to be defined with
the same uuid under separate parent devices.  Some options therefore
allow both a uuid and parent to be specified and require this if the
uuid alone is ambiguous.  Clearly starting two such devices at the same
time will fail and is left to higher level tools to manage, just like
the ability to define more devices than there are available instances on
the host system.

The stop and list commands are largely the same ideas as previous
though the semantics are completely different.  Listing running devices
now notes which are defined versus transient.  Perhaps it might also be
useful when listing defined devices to note which are running.

The sbin/libexec split of mdevctl has been squashed.  There are some
commands in the script that are currently only intended to be used from
udev or systemd, these are simply excluded from the help.  It's
possible we may want to promote the start-parent-mdevs command out of
this class, but the rest are specifically systemd helpers.

I'll include the current help test message below for further semantic
details, but please have a look-see, or better yet give it a try.
Thanks,

Alex

PS - I'm looking at adding udev change events when a device registers
or unregisters with the mdev core, which should help us know when to
trigger creation of persistent, auto started devices.  That support is
included here with the MDEV_STATE="registered|unregistered" environment
values.  Particularly, kvmgt now supports dynamic loading an unloading,
so as long as the enable_gvt=1 option is provided to the i915 driver
mdev support can come and go independent of the parent device.  The
change uevents are necessary to trigger on that, so I'd appreciate any
feedback on those as well.  Until then, the persistence of mdevctl
really depends on mdev support on the parent device being _completely_
setup prior to processing the udev rules.

# mdevctl
Usage: mdevctl {COMMAND} [options...]

Available commands:
define  Define a config for an mdev device.  Options:
<-u|--uuid=UUID> [<-p|--parent=PARENT> <-t|--type=TYPE>] [-a|--auto]
If the device specified by the UUID currently exists, parent
and type may be omitted to use the existing values. The auto
option marks the device to start on parent availability.
Running devices are unaffected by this command.
undefineUndefine, or remove a config for an mdev device.  Options:
<-u|--uuid=UUID> 

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-19 Thread Alex Williamson
On Wed, 19 Jun 2019 11:04:15 +0200
Sylvain Bauza  wrote:

> On Wed, Jun 19, 2019 at 12:27 AM Alex Williamson 
> wrote:
> 
> > On Tue, 18 Jun 2019 14:48:11 +0200
> > Sylvain Bauza  wrote:
> >  
> > > On Tue, Jun 18, 2019 at 1:01 PM Cornelia Huck  wrote:
> > >  
> > > > On Mon, 17 Jun 2019 11:05:17 -0600
> > > > Alex Williamson  wrote:
> > > >  
> > > > > On Mon, 17 Jun 2019 16:10:30 +0100
> > > > > Daniel P. Berrangé  wrote:
> > > > >  
> > > > > > On Mon, Jun 17, 2019 at 08:54:38AM -0600, Alex Williamson wrote:  
> > > > > > > On Mon, 17 Jun 2019 15:00:00 +0100
> > > > > > > Daniel P. Berrangé  wrote:
> > > > > > >  
> > > > > > > > On Thu, May 23, 2019 at 05:20:01PM -0600, Alex Williamson  
> > wrote:  
> > > >  
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > Currently mediated device management, much like SR-IOV VF  
> > > > management,  
> > > > > > > > > is largely left as an exercise for the user.  This is an  
> > attempt  
> > > > to  
> > > > > > > > > provide something and see where it goes.  I doubt we'll solve
> > > > > > > > > everyone's needs on the first pass, but maybe we'll solve  
> > enough  
> > > > and  
> > > > > > > > > provide helpers for the rest.  Without further ado, I'll  
> > point  
> > > > to what  
> > > > > > > > > I have so far:
> > > > > > > > >
> > > > > > > > > https://github.com/awilliam/mdevctl
> > > > > > > > >
> > > > > > > > > This is inspired by driverctl, which is also a bash  
> > utility.  
> > > > mdevctl  
> > > > > > > > > uses udev and systemd to record and recreate mdev devices for
> > > > > > > > > persistence and provides a command line utility for  
> > querying,  
> > > > listing,  
> > > > > > > > > starting, stopping, adding, and removing mdev devices.  
> > > > Currently, for  
> > > > > > > > > better or worse, it considers anything created to be  
> > > > persistent.  I can  
> > > > > > > > > imagine a global configuration option that might disable  
> > this and  
> > > > > > > > > perhaps an autostart flag per mdev device, such that  
> > mdevctl  
> > > > might  
> > > > > > > > > simply "know" about some mdevs but not attempt to create them
> > > > > > > > > automatically.  Clearly command line usage help, man pages,  
> > and  
> > > > > > > > > packaging are lacking as well, release early, release  
> > often,  
> > > > plus this  
> > > > > > > > > is a discussion starter to see if perhaps this is sufficient  
> > to  
> > > > meet  
> > > > > > > > > some needs.  
> > > > > > > >
> > > > > > > > I think from libvirt's POV, we would *not* want devices to be  
> > made  
> > > > > > > > unconditionally persistent. We usually wish to expose a choice  
> > to  
> > > > > > > > applications whether to have resources be transient or  
> > persistent.  
> > > > > > > >
> > > > > > > > So from that POV, a global config option to turn off  
> > persistence  
> > > > > > > > is not workable either. We would want control per-device, with
> > > > > > > > autostart control per device too.  
> > > > > > >
> > > > > > > The code has progressed somewhat in the past 3+ weeks, we still  
> > > > persist  
> > > > > > > all devices, but the start-up mode can be selected per device  
> > or  
> > > > with a  
> > > > > > > global default mode.  Devices configured with 'auto' start-up
> > > > > > > automatically while 'manual' devices are simply known and  
> > available  
> > > > to  
> > > > > > > be started.  I imagine we could add a 'transient' mode where we  
> > purge  
> > > > > > &

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-19 Thread Alex Williamson
On Wed, 19 Jun 2019 11:46:59 +0200
Cornelia Huck  wrote:

> On Wed, 19 Jun 2019 08:28:02 +0100
> Daniel P. Berrangé  wrote:
> 
> > On Tue, Jun 18, 2019 at 04:12:10PM -0600, Alex Williamson wrote:  
> > > On Tue, 18 Jun 2019 14:48:11 +0200
> > > Sylvain Bauza  wrote:
> > > 
> > > > On Tue, Jun 18, 2019 at 1:01 PM Cornelia Huck  
> > > > wrote:  
> 
> > > > > I think we need to reach consensus about the actual scope of the
> > > > > mdevctl tool.
> > > > >
> > > > >  
> > > > Thanks Cornelia, my thoughts:
> > > > 
> > > > - Is it supposed to be responsible for managing *all* mdev devices in   
> > > >  
> > > > >   the system, or is it more supposed to be a convenience helper for
> > > > >   users/software wanting to manage mdevs?
> > > > >  
> > > > 
> > > > The latter. If an operator (or some software) wants to create mdevs by 
> > > > not
> > > > using mdevctl (and rather directly calling the sysfs), I think it's OK.
> > > > That said, mdevs created by mdevctl would be supported by systemctl, 
> > > > while
> > > > the others not but I think it's okay.
> > > 
> > > I agree (sort of), and I'm hearing that we should drop any sort of
> > > automatic persistence of mdevs created outside of mdevctl.  The problem
> > > comes when we try to draw the line between unmanaged and manged
> > > devices.  For instance, if we have a command to list mdevs it would
> > > feel incomplete if it didn't list all mdevs both those managed by
> > > mdevctl and those created elsewhere.  For managed devices, I expect
> > > we'll also have commands that allow the mode of the device to be
> > > switched between transient, saved, and persistent.  Should a user then  
> 
> Hm, what's the difference between 'saved' and 'persistent'? That
> 'saved' devices are not necessarily present?

It seems like we're coming up with the following classes:

1) transient
  a) mdevctl created
  b) foreign
2) defined
  a) automatic start-up
  b) manual start-up

I was using persistent for 2b), but that's probably not a good name
because devices can still be stopped, so they're not really
persistently available even in this class.

mdevctl today only has defined devices, transient needs to be
implemented, which should lead to a conclusion on whether 1a) and 1b)
are really the same.

> > > be allowed to promote an unmanaged device to one of these modes via the
> > > same command?  Should they be allowed to stop an unmanaged device
> > > through driverctl?  Through systemctl?  These all seem like reasonable
> > > things to do, so what then is the difference between transient and
> > > unmanaged mdev and is mdevctl therefore managing all mdevs, not just
> > > those it has created?
> > 
> > To my mind there shouldn't really need to be a difference between
> > transient mdevs created by mdevctrl and mdevs created by an user
> > directly using sysfs. Both are mdevs on the running system with
> > no config file that you have to enumerate by looking at sysfs.
> > This ties back to my belief that we shouldn't need to have any
> > config on disk for a transient mdev, just discover them all
> > dynamically when required.  
> 
> So mdevctl can potentially interact with any mdev device on the system,
> it just has to be instructed by a user or software to do so? I think we
> can work with that.

Some TBDs around systemd/init support for transient devices and how
transient devices can be promoted to defined.  For instance if a
vfio-ap device requires matrix programming after instantiation, can we
glean that programming from sysfs or is there metadata irrecoverably
lost if no config file is created for a transient device?  This would
also imply that a 1b) foreign device could not be promoted to 2x)
defined device.

> > > > - Do we want mdevctl to manage config files for individual mdevs, or
> > > > >   are they supposed to be in a common format that can also be managed
> > > > >   by e.g. libvirt?
> > > > >  
> > > > 
> > > > Unless I misunderstand, I think mdevctl just helps to create mdevs for
> > > > being used by guests created either by libvirt or QEMU or even others.
> > > > How a guest would allocate a mdev (ie. saying "I'll use this specific 
> > > > mdev
> > > > UUID") is IMHO not something for mdevctl.
> > > 
> > > Right, mdevctl isn't concerned with how a specific md

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-18 Thread Alex Williamson
On Tue, 18 Jun 2019 14:48:11 +0200
Sylvain Bauza  wrote:

> On Tue, Jun 18, 2019 at 1:01 PM Cornelia Huck  wrote:
> 
> > On Mon, 17 Jun 2019 11:05:17 -0600
> > Alex Williamson  wrote:
> >  
> > > On Mon, 17 Jun 2019 16:10:30 +0100
> > > Daniel P. Berrangé  wrote:
> > >  
> > > > On Mon, Jun 17, 2019 at 08:54:38AM -0600, Alex Williamson wrote:  
> > > > > On Mon, 17 Jun 2019 15:00:00 +0100
> > > > > Daniel P. Berrangé  wrote:
> > > > >  
> > > > > > On Thu, May 23, 2019 at 05:20:01PM -0600, Alex Williamson wrote:  
> >  
> > > > > > > Hi,
> > > > > > >
> > > > > > > Currently mediated device management, much like SR-IOV VF  
> > management,  
> > > > > > > is largely left as an exercise for the user.  This is an attempt  
> > to  
> > > > > > > provide something and see where it goes.  I doubt we'll solve
> > > > > > > everyone's needs on the first pass, but maybe we'll solve enough  
> > and  
> > > > > > > provide helpers for the rest.  Without further ado, I'll point  
> > to what  
> > > > > > > I have so far:
> > > > > > >
> > > > > > > https://github.com/awilliam/mdevctl
> > > > > > >
> > > > > > > This is inspired by driverctl, which is also a bash utility.  
> > mdevctl  
> > > > > > > uses udev and systemd to record and recreate mdev devices for
> > > > > > > persistence and provides a command line utility for querying,  
> > listing,  
> > > > > > > starting, stopping, adding, and removing mdev devices.  
> > Currently, for  
> > > > > > > better or worse, it considers anything created to be  
> > persistent.  I can  
> > > > > > > imagine a global configuration option that might disable this and
> > > > > > > perhaps an autostart flag per mdev device, such that mdevctl  
> > might  
> > > > > > > simply "know" about some mdevs but not attempt to create them
> > > > > > > automatically.  Clearly command line usage help, man pages, and
> > > > > > > packaging are lacking as well, release early, release often,  
> > plus this  
> > > > > > > is a discussion starter to see if perhaps this is sufficient to  
> > meet  
> > > > > > > some needs.  
> > > > > >
> > > > > > I think from libvirt's POV, we would *not* want devices to be made
> > > > > > unconditionally persistent. We usually wish to expose a choice to
> > > > > > applications whether to have resources be transient or persistent.
> > > > > >
> > > > > > So from that POV, a global config option to turn off persistence
> > > > > > is not workable either. We would want control per-device, with
> > > > > > autostart control per device too.  
> > > > >
> > > > > The code has progressed somewhat in the past 3+ weeks, we still  
> > persist  
> > > > > all devices, but the start-up mode can be selected per device or  
> > with a  
> > > > > global default mode.  Devices configured with 'auto' start-up
> > > > > automatically while 'manual' devices are simply known and available  
> > to  
> > > > > be started.  I imagine we could add a 'transient' mode where we purge
> > > > > the information about the device when it is removed or the next time
> > > > > the parent device is added.  
> > > >
> > > > Having a pesistent config written out & then purged later is still
> > > > problematic. If the host crashes, nothing will purge the config file,
> > > > so it will become a persistent device. Also when listing devices we
> > > > want to be able to report whether it is persistent or transient. The
> > > > obvious way todo that is to simply look if a config file exists or
> > > > not.  
> > >
> > > I was thinking that the config file would identify the device as
> > > transient, therefore if the system crashed we'd have the opportunity to
> > > purge those entries on the next boot as we're processing the entries
> > > for that parent device.  Clearly it has yet to be implemented, but I
> > > expect there are some advantages to tracking devices via a tran

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-17 Thread Alex Williamson
On Mon, 17 Jun 2019 16:10:30 +0100
Daniel P. Berrangé  wrote:

> On Mon, Jun 17, 2019 at 08:54:38AM -0600, Alex Williamson wrote:
> > On Mon, 17 Jun 2019 15:00:00 +0100
> > Daniel P. Berrangé  wrote:
> >   
> > > On Thu, May 23, 2019 at 05:20:01PM -0600, Alex Williamson wrote:  
> > > > Hi,
> > > > 
> > > > Currently mediated device management, much like SR-IOV VF management,
> > > > is largely left as an exercise for the user.  This is an attempt to
> > > > provide something and see where it goes.  I doubt we'll solve
> > > > everyone's needs on the first pass, but maybe we'll solve enough and
> > > > provide helpers for the rest.  Without further ado, I'll point to what
> > > > I have so far:
> > > > 
> > > > https://github.com/awilliam/mdevctl
> > > > 
> > > > This is inspired by driverctl, which is also a bash utility.  mdevctl
> > > > uses udev and systemd to record and recreate mdev devices for
> > > > persistence and provides a command line utility for querying, listing,
> > > > starting, stopping, adding, and removing mdev devices.  Currently, for
> > > > better or worse, it considers anything created to be persistent.  I can
> > > > imagine a global configuration option that might disable this and
> > > > perhaps an autostart flag per mdev device, such that mdevctl might
> > > > simply "know" about some mdevs but not attempt to create them
> > > > automatically.  Clearly command line usage help, man pages, and
> > > > packaging are lacking as well, release early, release often, plus this
> > > > is a discussion starter to see if perhaps this is sufficient to meet
> > > > some needs.
> > > 
> > > I think from libvirt's POV, we would *not* want devices to be made
> > > unconditionally persistent. We usually wish to expose a choice to
> > > applications whether to have resources be transient or persistent.
> > > 
> > > So from that POV, a global config option to turn off persistence
> > > is not workable either. We would want control per-device, with
> > > autostart control per device too.  
> > 
> > The code has progressed somewhat in the past 3+ weeks, we still persist
> > all devices, but the start-up mode can be selected per device or with a
> > global default mode.  Devices configured with 'auto' start-up
> > automatically while 'manual' devices are simply known and available to
> > be started.  I imagine we could add a 'transient' mode where we purge
> > the information about the device when it is removed or the next time
> > the parent device is added.  
> 
> Having a pesistent config written out & then purged later is still
> problematic. If the host crashes, nothing will purge the config file,
> so it will become a persistent device. Also when listing devices we
> want to be able to report whether it is persistent or transient. The
> obvious way todo that is to simply look if a config file exists or
> not.

I was thinking that the config file would identify the device as
transient, therefore if the system crashed we'd have the opportunity to
purge those entries on the next boot as we're processing the entries
for that parent device.  Clearly it has yet to be implemented, but I
expect there are some advantages to tracking devices via a transient
config entry or else we're constantly re-discovering foreign mdevs.

> > > I would simply get rid of the udev rule that magically persists
> > > stuff. Any person/tool using sysfs right now expects devices to
> > > be transient. If they want to have persistence they can stop using
> > > sysfs & use higher level tools directly.  
> > 
> > I think it's an interesting feature, but it's easy enough to control
> > via a global option in sysconfig with the default off if it's seen as
> > overstepping.  
> 
> A global option is really not desirable, as it means that the behaviour
> of the system that libvirt sees can silently change at any time. IMHO
> this udev hook is  intermixing the two layers in the stack - keep the
> low level sysfs layer completely separate from the higher level mgmt
> concepts provided by this mdevctrl.

It seems like it just means that libvirt needs to be explicit such that
it doesn't rely on a global preference, ex. always using a --transient
option.

> > > > Originally I thought about making a utility to manage both mdev and
> > > > SR-IOV VFs all in one, but it seemed more natural to start here
> > > > (besides, I couldn't think of a good name for the 

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-17 Thread Alex Williamson
On Mon, 17 Jun 2019 15:00:00 +0100
Daniel P. Berrangé  wrote:

> On Thu, May 23, 2019 at 05:20:01PM -0600, Alex Williamson wrote:
> > Hi,
> > 
> > Currently mediated device management, much like SR-IOV VF management,
> > is largely left as an exercise for the user.  This is an attempt to
> > provide something and see where it goes.  I doubt we'll solve
> > everyone's needs on the first pass, but maybe we'll solve enough and
> > provide helpers for the rest.  Without further ado, I'll point to what
> > I have so far:
> > 
> > https://github.com/awilliam/mdevctl
> > 
> > This is inspired by driverctl, which is also a bash utility.  mdevctl
> > uses udev and systemd to record and recreate mdev devices for
> > persistence and provides a command line utility for querying, listing,
> > starting, stopping, adding, and removing mdev devices.  Currently, for
> > better or worse, it considers anything created to be persistent.  I can
> > imagine a global configuration option that might disable this and
> > perhaps an autostart flag per mdev device, such that mdevctl might
> > simply "know" about some mdevs but not attempt to create them
> > automatically.  Clearly command line usage help, man pages, and
> > packaging are lacking as well, release early, release often, plus this
> > is a discussion starter to see if perhaps this is sufficient to meet
> > some needs.  
> 
> I think from libvirt's POV, we would *not* want devices to be made
> unconditionally persistent. We usually wish to expose a choice to
> applications whether to have resources be transient or persistent.
> 
> So from that POV, a global config option to turn off persistence
> is not workable either. We would want control per-device, with
> autostart control per device too.

The code has progressed somewhat in the past 3+ weeks, we still persist
all devices, but the start-up mode can be selected per device or with a
global default mode.  Devices configured with 'auto' start-up
automatically while 'manual' devices are simply known and available to
be started.  I imagine we could add a 'transient' mode where we purge
the information about the device when it is removed or the next time
the parent device is added.
 
> I would simply get rid of the udev rule that magically persists
> stuff. Any person/tool using sysfs right now expects devices to
> be transient. If they want to have persistence they can stop using
> sysfs & use higher level tools directly.

I think it's an interesting feature, but it's easy enough to control
via a global option in sysconfig with the default off if it's seen as
overstepping.

> > Originally I thought about making a utility to manage both mdev and
> > SR-IOV VFs all in one, but it seemed more natural to start here
> > (besides, I couldn't think of a good name for the combined utility).
> > If this seems useful, maybe I'll start on a vfctl for SR-IOV and we'll
> > see whether they have enough synergy to become one.  
> 
> [snip]
> 
> > I'm also curious how or if libvirt or openstack might use this.  If
> > nothing else, it makes libvirt hook scripts easier to write, especially
> > if we add an option not to autostart mdevs, or if users don't mind
> > persistent mdevs, maybe there's nothing more to do.  
> 
> We currently have an API for creating host devices in libvirt which
> we use for NPIV devices only, which is where we'd like to put mdev
> creation support.  This API is for creating transient devices
> though, so we don't want anything created this way to magically
> become persistent.
> 
> For persistence we'd create a new API in libvirt allowing you to
> define & undefine the persistent config for a devices, and another
> set of APIs to create/delete from the persistent config.
> 
> As a general rule, libvirt would prefer to use an API rather than
> spawning external commands, but can live with either.
> 
> There's also the question of systemd integration - so far everything
> in libvirt still works on non-systemd based distros, but this new
> tool looks like it requires systemd.  Personally I'm not too bothered
> by this but others might be more concerned.

Yes, Pavel brought up this issue offline as well and it needs more
consideration.  The systemd support still needs work as well, I've
discovered it gets very confused when the mdev device is removed
outside of mdevctl, but I haven't yet been able to concoct a BindsTo=
line that can handle the hyphens in the uuid device name.  I'd say
mdevctl is not intentionally systemd specific, it's simply a byproduct
of the systems it was developed on.  Also, if libvirt were to focus
only on transient devices, then startup via systemctl doesn't make
sense, which proba

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-14 Thread Alex Williamson
On Fri, 14 Jun 2019 17:06:15 +0200
Christophe de Dinechin  wrote:

> > On 14 Jun 2019, at 16:23, Alex Williamson  
> > wrote:
> > 
> > On Fri, 14 Jun 2019 11:54:42 +0200
> > Christophe de Dinechin  wrote:
> >   
> >> That is true irrespective of the usage, isn’t it? In other words, when you
> >> invoke `mdevctl create-mdev`, you assert “I own that specific parent/type”.
> >> At least, that’s how I read the way the script behaves today. Whether you
> >> invoke uuidgen inside or outside the script does not change that assertion
> >> (at least with today’s code).  
> > 
> > What gives you this impression?  
> 
> That your code does nothing to avoid any race today?
> 
> Maybe I was confused with the existing `uuidgen` example in you README,
> but it looks to me like the usage model involves much more than just
> create-mdev, and that any race that might exist is not in create-mdev itself
> (or in uuidgen for that matter).

I believe I mentioned this was an early release, error handling is
lacking.  Still, I think the races are minimal, they largely involve
uuid collisions.  Separate users can create mdevs under the same parent
concurrently, they would need to have a uuid collision to conflict.
Otherwise there's the resource issue on the parent, but that's left of
the kernel to manage.  If an mdev fails to be created at the kernel,
mdevctl should unwind, but we're not going to pretend that we have a
lock on the parent's sysfs mdev interfaces.

> > Where is the parent/type ownership implied?  
> 
> I did not imply it, but I read some concern about ownership
> on your part in "they need to guess that an mdev device
> with the same parent and type is *theirs*.” (emphasis mine)
> 
> I personally see no change on the “need to guess” implied
> by the fact that you run uuidgen inside the script, so
> that’s why I tried to guess what you meant.

As I noted in the reply to the pull request, putting `uuidgen` inline
was probably a bad example.  However, the difference is that the user
has imposed the race on themselves if they invoke mdevctl like this,
they've provided a uuid but they didn't record what it is.  This is the
user's problem.  Pushing uuid selection into mdevctl makes it mdevctl's
problem because the interface is fundamentally broken.

> > The intended semantics are
> > "try to create this type of device under this parent”.  
> 
> Agreed. Which is why I don’t see why trying to create
> with some new UUID introduces any race (as long as
> the script prints out that UUID, which I admit my patch
> entirely failed to to)

And that's the piece that makes it fundamentally broken.  Beyond that,
it seems unnecessary.  I don't see this as the primary invocation of
mdevctl and the functionality it adds is trivially accomplished in a
wrapper, so what's the value?

> >>> How do you resolve two instances of this happening in parallel and both
> >>> coming to the same conclusion which is their device.  If a user wants
> >>> this sort of headache they can call mdevctl with `uuidgen` but I don't
> >>> think we should encourage it further.
> >> 
> >> I agree there is a race, but if anything, having a usage where you don’t
> >> pass the UUID on the command line is a step in the right direction.
> >> It leaves the door open for the create-mdev script to do smarter things,
> >> like deferring the allocation of the mdevs to an entity that has slightly
> >> more knowledge of the global system state than uuidgen.  
> > 
> > A user might (likely) require a specific uuid to match their VM
> > configuration.  I can only think of very niche use cases where a user
> > doesn't care what uuid they get.  
> 
> They do care. But I typically copy-paste my UUIDs, and then
> 
> 1. copy-pasting at the end is always faster than between
> the command and other arguments (3-args case). 
> 
> 2. copy-pasting the output of the previous command is faster
> than having one extra step where I need to copy the same thing twice
> (2-args case).
> 
> So to me, if the script is intended to be used by humans, my
> proposal makes it slightly more comfortable to use. Nothing more.

This is your preference, but I wouldn't call it universal.  Specifying
the uuid last seems backwards to me, we're creating an object so let's
first name that object.  We then specify where that object should be
created and what type it has.  This seems very logical to me, besides,
it's also the exact same order we use when listing mdevs :P

Clearly there's personal preference here, so let's not arbitrarily pick
a different preference.  If copy/paste order is more important to you
then submit a patch to give mdevctl real argument pro

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-14 Thread Alex Williamson
On Fri, 14 Jun 2019 11:54:42 +0200
Christophe de Dinechin  wrote:

> > On 13 Jun 2019, at 18:35, Alex Williamson  
> > wrote:
> > 
> > On Thu, 13 Jun 2019 18:17:53 +0200
> > Christophe de Dinechin  wrote:
> >   
> >>> On 24 May 2019, at 01:20, Alex Williamson  
> >>> wrote:
> >>> 
> >>> Hi,
> >>> 
> >>> Currently mediated device management, much like SR-IOV VF management,
> >>> is largely left as an exercise for the user.  This is an attempt to
> >>> provide something and see where it goes.  I doubt we'll solve
> >>> everyone's needs on the first pass, but maybe we'll solve enough and
> >>> provide helpers for the rest.  Without further ado, I'll point to what
> >>> I have so far:
> >>> 
> >>> https://github.com/awilliam/mdevctl
> >> 
> >> While it’s still early, what about :
> >> 
> >>mdevctl create-mdev   []
> >> 
> >> where if the mdev-uuid is missing, you just run uuidgen within the script?
> >> 
> >> I sent a small PR in case you think it makes sense.  
> > 
> > It sounds racy.  If the user doesn't provide the UUID then they need to
> > guess that an mdev device with the same parent and type is theirs.  
> 
> That is true irrespective of the usage, isn’t it? In other words, when you
> invoke `mdevctl create-mdev`, you assert “I own that specific parent/type”.
> At least, that’s how I read the way the script behaves today. Whether you
> invoke uuidgen inside or outside the script does not change that assertion
> (at least with today’s code).

What gives you this impression?  Where is the parent/type ownership
implied?  The intended semantics are "try to create this type of device
under this parent".
 
> >  How do you resolve two instances of this happening in parallel and both
> > coming to the same conclusion which is their device.  If a user wants
> > this sort of headache they can call mdevctl with `uuidgen` but I don't
> > think we should encourage it further.  
> 
> I agree there is a race, but if anything, having a usage where you don’t
> pass the UUID on the command line is a step in the right direction.
> It leaves the door open for the create-mdev script to do smarter things,
> like deferring the allocation of the mdevs to an entity that has slightly
> more knowledge of the global system state than uuidgen.

A user might (likely) require a specific uuid to match their VM
configuration.  I can only think of very niche use cases where a user
doesn't care what uuid they get.

> In other words, in my mind, `mdevctl create-mdev parent type` does not
> imply “this will use uuidgen” but rather, if anything, implies “this will do 
> the
> right thing to prevent the race in the future, even if that’s more complex
> than just calling uuidgen”.

What race are you trying to prevent, uuid collision?

> However, I believe that this means we should reorder the args further.
> I would suggest something like:
> 
>   mdevctl create-mdev  [ []]
> 
> where

Absolutely not, now you've required mdevctl to implement policy in mdev
placement.  mdevctl follows the unix standard, do one thing and do it
well.  If someone wants to layer placement policy on top of mdevctl,
great, but let's not impose that within mdevctl.

> 1 arg means you let mdevctl choose the parent device for you (future)
>(e.g. I want a VGPU of this type, I don’t really care where it comes from)
> 2 args mean you want that specific type/parent combination
> 3 args mean you assert you own that device
> 
> That also implies that mdevctl create-mdev should output what it allocated
> so that some higher-level software can tell “OK, that’s the instance I got”.

I don't think we're aligned on what mdevctl is attempting to provide.
Maybe you're describing a layer you'd like to see above mdevctl?
Thanks,

Alex

> > BTW, I've moved the project to https://github.com/mdevctl/mdevctl, the
> > latest commit in the tree above makes that change, I've also updated
> > the description on my repo to point to the new location.  Thanks,  
> 
> Done.
> 
> > 
> > Alex
> > 
> > --
> > libvir-list mailing list
> > libvir-list@redhat.com
> > https://www.redhat.com/mailman/listinfo/libvir-list  
> 


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] mdevctl: A shoestring mediated device management and persistence utility

2019-06-13 Thread Alex Williamson
On Thu, 13 Jun 2019 18:17:53 +0200
Christophe de Dinechin  wrote:

> > On 24 May 2019, at 01:20, Alex Williamson  
> > wrote:
> > 
> > Hi,
> > 
> > Currently mediated device management, much like SR-IOV VF management,
> > is largely left as an exercise for the user.  This is an attempt to
> > provide something and see where it goes.  I doubt we'll solve
> > everyone's needs on the first pass, but maybe we'll solve enough and
> > provide helpers for the rest.  Without further ado, I'll point to what
> > I have so far:
> > 
> > https://github.com/awilliam/mdevctl  
> 
> While it’s still early, what about :
> 
>   mdevctl create-mdev   []
> 
> where if the mdev-uuid is missing, you just run uuidgen within the script?
> 
> I sent a small PR in case you think it makes sense.

It sounds racy.  If the user doesn't provide the UUID then they need to
guess that an mdev device with the same parent and type is theirs.  How
do you resolve two instances of this happening in parallel and both
coming to the same conclusion which is their device.  If a user wants
this sort of headache they can call mdevctl with `uuidgen` but I don't
think we should encourage it further.

BTW, I've moved the project to https://github.com/mdevctl/mdevctl, the
latest commit in the tree above makes that change, I've also updated
the description on my repo to point to the new location.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

  1   2   3   >