Re: [PATCH v3 00/19] Add a host IOMMU device abstraction to check with vIOMMU

2024-05-07 Thread Jason Gunthorpe
On Tue, May 07, 2024 at 02:24:30AM +, Duan, Zhenzhong wrote:
> >On Mon, May 06, 2024 at 02:30:47AM +, Duan, Zhenzhong wrote:
> >
> >> I'm not clear how useful multiple iommufd instances support are.
> >> One possible benefit is for security? It may bring a slightly fine-grained
> >> isolation in kernel.
> >
> >No. I don't think there is any usecase, it is only harmful.
> 
> OK, so we need to limit QEMU to only one iommufd instance.

I don't know about limit, but you don't need to do extra stuff to make
it work.

The main issue will be to get all the viommu instances to share the
same iommufd IOAS for the guest physical mapping. Otherwise each
viommu should be largely unware of the others sharing (or not) a
iommufd.

If you can structure things properly it probably doesn't need a hard
limit, it will just work worse.

Jason



Re: [PATCH v3 00/19] Add a host IOMMU device abstraction to check with vIOMMU

2024-05-06 Thread Jason Gunthorpe
On Mon, May 06, 2024 at 02:30:47AM +, Duan, Zhenzhong wrote:

> I'm not clear how useful multiple iommufd instances support are.
> One possible benefit is for security? It may bring a slightly fine-grained
> isolation in kernel.

No. I don't think there is any usecase, it is only harmful.

Jason



Re: [PATCH v3 00/19] Add a host IOMMU device abstraction to check with vIOMMU

2024-05-03 Thread Jason Gunthorpe
On Fri, May 03, 2024 at 04:04:25PM +0200, Cédric Le Goater wrote:
> However, have you considered another/complementary approach which
> would be to create an host IOMMU (iommufd) backend object and a vIOMMU
> device object together for each vfio-pci device being plugged in the
> machine ?
> 
> Something like,
> -device pcie-root-port,port=23,chassis=8,id=pci.8,bus=pcie.0 \
> -object iommufd,id=iommufd1 \
> -device 
> intel-iommu,intremap=on,device-iotlb=on,caching-mode=on,iommufd=iommufd1 \
> -device vfio-pci,host=:08:10.0,bus=pci.1,iommufd=iommufd0

? The main point of this is to have a single iommufd FD open in
qemu. Not multiple. Would you achieve this with a iommufd0 and
iommufd1 ?

Jason



Re: [PATCH RFCv2 6/8] backends/iommufd: Add ability to disable hugepages

2024-02-12 Thread Jason Gunthorpe
On Mon, Feb 12, 2024 at 01:56:41PM +, Joao Martins wrote:
> Allow disabling hugepages to be dirty track at base page
> granularity in similar vein to vfio_type1_iommu.disable_hugepages
> but per IOAS.

No objection to this, but I just wanted to observe I didn't imagine
using this option for this purpose. It should work OK but it is a
pretty big an inefficient hammer :)

Jason



Re: [PATCH RFCv2 2/8] vfio/iommufd: Introduce auto domain creation

2024-02-12 Thread Jason Gunthorpe
On Mon, Feb 12, 2024 at 01:56:37PM +, Joao Martins wrote:
> There's generally two modes of operation for IOMMUFD:
> 
> * The simple user API which intends to perform relatively simple things
> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
> and mainly performs IOAS_MAP and UNMAP.
> 
> * The native IOMMUFD API where you have fine grained control of the
> IOMMU domain and model it accordingly. This is where most new feature
> are being steered to.
> 
> For dirty tracking 2) is required, as it needs to ensure that
> the stage-2/parent IOMMU domain will only attach devices
> that support dirty tracking (so far it is all homogeneous in x86, likely
> not the case for smmuv3). Such invariant on dirty tracking provides a
> useful guarantee to VMMs that will refuse incompatible device
> attachments for IOMMU domains.
> 
> For dirty tracking such property is enabled/enforced via HWPT_ALLOC,
> which is responsible for creating an IOMMU domain. This is contrast to
> the 'simple API' where the IOMMU domain is created by IOMMUFD
> automatically when it attaches to VFIO (usually referred as autodomains)
> 
> To support dirty tracking with the advanced IOMMUFD API, it needs
> similar logic, where IOMMU domains are created and devices attached to
> compatible domains. Essentially mimmicing kernel
> iommufd_device_auto_get_domain(). If this fails (i.e. mdevs) it falls back
> to IOAS attach.
> 
> Signed-off-by: Joao Martins 
> ---
> Right now the only alternative to a userspace autodomains implementation
> is to mimmicing all the flags being added to HWPT_ALLOC but into VFIO
> IOAS attach.

It was my expectation that VMM userspace would implement direct hwpt
support. This is needed in all kinds of cases going forward because
hwpt allocation is not uniform across iommu instances and managing
this in the kernel is only feasible for simpler cases. Dirty tracking
would be one of them.

> +int iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id,
> +   uint32_t pt_id, uint32_t flags,
> +   uint32_t data_type, uint32_t data_len,
> +   void *data_ptr, uint32_t *out_hwpt)
> +{
> +int ret;
> +struct iommu_hwpt_alloc alloc_hwpt = {
> +.size = sizeof(struct iommu_hwpt_alloc),
> +.flags = flags,
> +.dev_id = dev_id,
> +.pt_id = pt_id,
> +.data_type = data_type,
> +.data_len = data_len,
> +.data_uptr = (uint64_t)data_ptr,
> +.__reserved = 0,

Unless the coding style requirs this it is not necessary to zero in
the __reserved within a bracketed in initializer..

> +static int iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> +VFIOIOMMUFDContainer *container,
> +Error **errp)
> +{
> +int iommufd = vbasedev->iommufd_dev.iommufd->fd;
> +VFIOIOASHwpt *hwpt;
> +Error *err = NULL;
> +int ret = -EINVAL;
> +uint32_t hwpt_id;
> +
> +/* Try to find a domain */
> +QLIST_FOREACH(hwpt, >hwpt_list, next) {
> +ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, );
> +if (ret) {
> +/* -EINVAL means the domain is incompatible with the device. */
> +if (ret == -EINVAL) {

Please send a kernel kdoc patch to document this..

The approach looks good to me

The nesting patches surely need this too?

Jason



Re: [PATCH v6 1/2] qom: new object to associate device to numa node

2024-01-09 Thread Jason Gunthorpe
On Tue, Jan 09, 2024 at 11:36:03AM -0800, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Tue, Jan 09, 2024 at 06:02:03PM +0100, David Hildenbrand wrote:
> > > > Given that, an alternative proposal that I think would work
> > > > for you would be to add a 'placeholder' memory node definition
> > > > in SRAT (so allow 0 size explicitly - might need a new SRAT
> > > > entry to avoid backwards compat issues).
> > > 
> > > Putting all the PCI/GI/... complexity aside, I'll just raise again that 
> > > for
> > > virtio-mem something simple like that might be helpful as well, IIUC.
> > > 
> > >   -numa node,nodeid=2 \
> > >   ...
> > >   -device virtio-mem-pci,node=2,... \
> > > 
> > > All we need is the OS to prepare for an empty node that will get populated
> > > with memory later.
> > 
> > That is all this is doing too, the NUMA relationship of the actual
> > memory is desribed already by the PCI device since it is a BAR on the
> > device.
> > 
> > The only purpose is to get the empty nodes into Linux :(
> > 
> > > So if that's what a "placeholder" node definition in srat could achieve as
> > > well, even without all of the other acpi-generic-initiator stuff, that 
> > > would
> > > be great.
> > 
> > Seems like there are two use quite similar cases.. virtio-mem is going
> > to be calling the same family of kernel API I suspect :)
> 
> It seems sad that we, as an industry, went through all of this trouble
> to define a dynamically enumerable CXL device model only to turn around
> and require static ACPI tables to tell us how to enumerate it.
> 
> A similar problem exists on the memory target side and the approach
> taken there was to have Linux statically reserve at least enough numa
> node numbers for all the platform CXL memory ranges (defined in the
> ACPI.CEDT.CFMWS), but with the promise to come back and broach the
> dynamic node creation problem "if the need arises".
> 
> This initiator-node enumeration case seems like that occasion where the
> need has arisen to get Linux out of the mode of needing to declare all
> possible numa nodes early in boot. Allow for nodes to be discoverable
> post NUMA-init.
> 
> One strawman scheme that comes to mind is instead of "add nodes early" in
> boot, "delete unused nodes late" in boot after the device topology has
> been enumerated. Otherwise, requiring static ACPI tables to further
> enumerate an industry-standard dynamically enumerated bus seems to be
> going in the wrong direction.

Fully agree, and I think this will get increasingly painful as we go
down the CXL road.

Jason



Re: [PATCH v6 1/2] qom: new object to associate device to numa node

2024-01-09 Thread Jason Gunthorpe
On Tue, Jan 09, 2024 at 06:02:03PM +0100, David Hildenbrand wrote:
> > Given that, an alternative proposal that I think would work
> > for you would be to add a 'placeholder' memory node definition
> > in SRAT (so allow 0 size explicitly - might need a new SRAT
> > entry to avoid backwards compat issues).
> 
> Putting all the PCI/GI/... complexity aside, I'll just raise again that for
> virtio-mem something simple like that might be helpful as well, IIUC.
> 
>   -numa node,nodeid=2 \
>   ...
>   -device virtio-mem-pci,node=2,... \
> 
> All we need is the OS to prepare for an empty node that will get populated
> with memory later.

That is all this is doing too, the NUMA relationship of the actual
memory is desribed already by the PCI device since it is a BAR on the
device.

The only purpose is to get the empty nodes into Linux :(

> So if that's what a "placeholder" node definition in srat could achieve as
> well, even without all of the other acpi-generic-initiator stuff, that would
> be great.

Seems like there are two use quite similar cases.. virtio-mem is going
to be calling the same family of kernel API I suspect :)

Jason



Re: [PATCH v4 28/41] vfio/iommufd: Implement the iommufd backend

2023-11-09 Thread Jason Gunthorpe
On Thu, Nov 09, 2023 at 01:21:59PM +, Joao Martins wrote:
> On 09/11/2023 13:09, Jason Gunthorpe wrote:
> > On Thu, Nov 09, 2023 at 01:03:02PM +, Joao Martins wrote:
> > 
> >>> I am not talking about mdevs; but rather the regular (non mdev) case not 
> >>> being
> >>> able to use dirty tracking with autodomains hwpt allocation.
> >>
> >> ... without any vIOMMU.
> > 
> > Ah, well, that is troublesome isn't it..
> > 
> > So do we teach autodomains to be more featured in the kernel or do we
> > teach the generic qemu code to effectively implement autodomains in
> > userspace?
> 
> The latter is actually what we have been doing. Well I wouldn't call 
> autodomains
> in qemu, but rather just allocate a hwpt, instead of attaching the IOAS
> directly. But well mdevs don't have domains and we overlooked that. I would 
> turn
> the exception into an exception rather than making the norm, doesn't look to 
> be
> much complexity added?

Autodomains are complex because of things like mdev and iommu
non-uniformity's. Qemu can't just allocate a single HWPT, it needs to
be annoyingly managed.

> What I last re-collect is that autodomains represents the 'simple users' that
> don't care much beyond the basics of IOMMU features (I recall the example was
> DPDK apps and the like). You could say that for current needs IOMMU 
> autodomains
> suffices for qemu.

Yes, that was my intention. Aside from that it primarily exists to
support vfio compatibility

> Connecting autodomains to this enforcing on the hwpt is relatively simple btw,
> it just needs to connect the dirty tracking flag with same semantic of
> hwpt-alloc equivalent and pass the hwpt flags into the domain allocation.

Yes

> It's more of what of a question should be the expectations to the user when
> using ATTACH_HWPT with an IOAS_ID versus direct manipulation of HWPT. I am
> wondering if dirty tracking is alone here or whether there's more features 
> that
> start to mud the simplicity of autodomains that would approximate of 
> hwpt-alloc.

This is why I had been thinking of a pure HWPT based scheme

So it seems we cannot have a simple model where the generic qmeu layer
just works in IOAS :( It might as well always work in HWPT and
understand all the auto domains complexity itself.

Jason



Re: [PATCH v4 28/41] vfio/iommufd: Implement the iommufd backend

2023-11-09 Thread Jason Gunthorpe
On Thu, Nov 09, 2023 at 01:03:02PM +, Joao Martins wrote:

> > I am not talking about mdevs; but rather the regular (non mdev) case not 
> > being
> > able to use dirty tracking with autodomains hwpt allocation.
> 
> ... without any vIOMMU.

Ah, well, that is troublesome isn't it..

So do we teach autodomains to be more featured in the kernel or do we
teach the generic qemu code to effectively implement autodomains in
userspace?

Jason



Re: [PATCH v4 28/41] vfio/iommufd: Implement the iommufd backend

2023-11-09 Thread Jason Gunthorpe
On Thu, Nov 09, 2023 at 12:17:35PM +, Joao Martins wrote:
> 
> 
> On 08/11/2023 12:48, Jason Gunthorpe wrote:
> > On Wed, Nov 08, 2023 at 07:16:52AM +, Duan, Zhenzhong wrote:
> > 
> >>>> +ret = iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> >>>> + container->ioas_id, _id);
> >>>> +
> >>>> +if (ret) {
> >>>> +error_setg_errno(errp, errno, "error alloc shadow hwpt");
> >>>> +return ret;
> >>>> +}
> >>>
> >>> The above alloc_hwpt fails for mdevs (at least, it fails for me 
> >>> attempting to use
> >>> iommufd backend with vfio-ccw and vfio-ap on s390).  The ioctl is failing 
> >>> in the
> >>> kernel because it can't find an IOMMUFD_OBJ_DEVICE.
> >>>
> >>> AFAIU that's because the mdevs are meant to instead use kernel access via
> >>> vfio_iommufd_emulated_attach_ioas, not hwpt.  That's how mdevs behave when
> >>> looking at the kernel vfio compat container.
> >>>
> >>> As a test, I was able to get vfio-ccw and vfio-ap working using the 
> >>> iommufd
> >>> backend by just skipping this alloc_hwpt above and instead passing 
> >>> container-
> >>>> ioas_id into the iommufd_cdev_attach_hwpt below.  That triggers the
> >>> vfio_iommufd_emulated_attach_ioas call in the kernel.
> >>
> >> Thanks for help test and investigation.
> >> I was only focusing on real device and missed the mdev particularity, 
> >> sorry.
> >> You are right, there is no hwpt support for mdev, not even an emulated 
> >> hwpt.
> >> I'll digging into this and see how to distinguish mdev with real device in
> >> this low level function.
> > 
> > I was expecting that hwpt manipulation would be done exclusively
> > inside the device-specific vIOMMU userspace driver. Generic code paths
> > that don't have that knowledge should use the IOAS for everything
> 
> I am probably late into noticing this given Zhenzhong v5; but arent' we
> forgetting the enforcing of dirty tracking in HWPT is done /via/
> ALLOC_HWPT ?

The underlying viommu driver supporting mdev cannot support dirty
tracking via the hwpt flag, so it doesn't matter.

The entire point is that a mdev doesn't have a hwpt or any of the hwpt
linked features including dirty tracking.

Jason



Re: [PATCH v4 28/41] vfio/iommufd: Implement the iommufd backend

2023-11-08 Thread Jason Gunthorpe
On Wed, Nov 08, 2023 at 01:25:34PM +, Duan, Zhenzhong wrote:

> >I was expecting that hwpt manipulation would be done exclusively
> >inside the device-specific vIOMMU userspace driver. Generic code paths
> >that don't have that knowledge should use the IOAS for everything
> 
> Yes, this way we don't need to distinguish between mdev and real device,
> just attach to IOAS. But lose the benefit that same hwpt could be passed
> into vIOMMU to be used as S2 hwpt in nesting.

If you have a nesting capable vIOMMU driver then it should be
creating the HWPTs and managing them in its layer. Maybe the core code
provides some helpers.

Obviously you can't link a mdev to a nesting vIOMMU driver in the
first place. Mdev should be connected to a different IOMMU driver that
doesn't use HWPT at all.

I think it will make alot of trouble to put the hwpt in the wrong
layer as there shouldn't really be much generic code touching it.

Jason



Re: [PATCH v4 28/41] vfio/iommufd: Implement the iommufd backend

2023-11-08 Thread Jason Gunthorpe
On Wed, Nov 08, 2023 at 07:16:52AM +, Duan, Zhenzhong wrote:

> >> +ret = iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> >> + container->ioas_id, _id);
> >> +
> >> +if (ret) {
> >> +error_setg_errno(errp, errno, "error alloc shadow hwpt");
> >> +return ret;
> >> +}
> >
> >The above alloc_hwpt fails for mdevs (at least, it fails for me attempting 
> >to use
> >iommufd backend with vfio-ccw and vfio-ap on s390).  The ioctl is failing in 
> >the
> >kernel because it can't find an IOMMUFD_OBJ_DEVICE.
> >
> >AFAIU that's because the mdevs are meant to instead use kernel access via
> >vfio_iommufd_emulated_attach_ioas, not hwpt.  That's how mdevs behave when
> >looking at the kernel vfio compat container.
> >
> >As a test, I was able to get vfio-ccw and vfio-ap working using the iommufd
> >backend by just skipping this alloc_hwpt above and instead passing container-
> >>ioas_id into the iommufd_cdev_attach_hwpt below.  That triggers the
> >vfio_iommufd_emulated_attach_ioas call in the kernel.
> 
> Thanks for help test and investigation.
> I was only focusing on real device and missed the mdev particularity, sorry.
> You are right, there is no hwpt support for mdev, not even an emulated hwpt.
> I'll digging into this and see how to distinguish mdev with real device in
> this low level function.

I was expecting that hwpt manipulation would be done exclusively
inside the device-specific vIOMMU userspace driver. Generic code paths
that don't have that knowledge should use the IOAS for everything

Jason



Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object

2023-10-17 Thread Jason Gunthorpe
On Tue, Oct 17, 2023 at 10:54:19AM -0600, Alex Williamson wrote:
> On Tue, 17 Oct 2023 12:28:30 -0300
> Jason Gunthorpe  wrote:
> 
> > On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
> > 
> > > Do we therefore need some programatic means for the kernel driver to
> > > expose the node configuration to userspace?  What interfaces would
> > > libvirt like to see here?  Is there an opportunity that this could
> > > begin to define flavors or profiles for variant devices like we have
> > > types for mdev devices where the node configuration would be
> > > encompassed in a device profile?   
> > 
> > I don't think we should shift this mess into the kernel..
> > 
> > We have a wide range of things now that the orchestration must do in
> > order to prepare that are fairly device specific. I understand in K8S
> > configurations the preference is using operators (aka user space
> > drivers) to trigger these things.
> > 
> > Supplying a few extra qemu command line options seems minor compared
> > to all the profile and provisioning work that has to happen for other
> > device types.
> 
> This seems to be a growing problem for things like mlx5-vfio-pci where
> there's non-trivial device configuration necessary to enable migration
> support.  It's not super clear to me how those devices are actually
> expected to be used in practice with that configuration burden.

Yes, it is the nature of the situation. There is lots and lots of
stuff in the background here. We can nibble at some things, but I
don't see a way to be completely free of a userspace driver providing
the orchestration piece for every device type.

Maybe someone who knows more about the k8s stuff works can explain
more?

> Are we simply saying here that it's implicit knowledge that the
> orchestration must posses that when assigning devices exactly matching
> 10de:2342 or 10de:2345 when bound to the nvgrace-gpu-vfio-pci driver
> that 8 additional NUMA nodes should be added to the VM and an ACPI
> generic initiator object created linking those additional nodes to the
> assigned GPU?

What I'm trying to say is that orchestration should try to pull in a
userspace driver to provide the non-generic pieces.

But, it isn't clear to me what that driver is generically. Something
like this case is pretty stand alone, but mlx5 needs to interact with
the networking control plane to fully provision the PCI
function. Storage needs a different control plane. Few PCI devices are
so stand alone they can be provisioned without complicated help. 

Even things like IDXD need orchestration to sort of uniquely
understand how to spawn their SIOV functions.

I'm not sure I see a clear vision here from libvirt side how all these
parts interact in the libvirt world, or if the answer is "use
openshift and the operators".

Jason



Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object

2023-10-17 Thread Jason Gunthorpe
On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:

> Do we therefore need some programatic means for the kernel driver to
> expose the node configuration to userspace?  What interfaces would
> libvirt like to see here?  Is there an opportunity that this could
> begin to define flavors or profiles for variant devices like we have
> types for mdev devices where the node configuration would be
> encompassed in a device profile? 

I don't think we should shift this mess into the kernel..

We have a wide range of things now that the orchestration must do in
order to prepare that are fairly device specific. I understand in K8S
configurations the preference is using operators (aka user space
drivers) to trigger these things.

Supplying a few extra qemu command line options seems minor compared
to all the profile and provisioning work that has to happen for other
device types.

Jason



Re: [PATCH 39/52] migration/rdma: Convert qemu_rdma_write_one() to Error

2023-10-16 Thread Jason Gunthorpe
On Sat, Oct 07, 2023 at 03:40:50AM +, Zhijian Li (Fujitsu) wrote:
> +rdma-core
> 
> 
> Is global variable *errno* reliable when the documentation only states
> "returns 0 on success, or the value of errno on failure (which indicates the 
> failure reason)."

I think the intention of this language was that an errno constant is
returned, the caller should not assume it is stored in errno.

errno is difficult, many things overwrite it, you can loose its value
during error handling.

Jason



Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

2023-09-27 Thread Jason Gunthorpe
On Wed, Sep 27, 2023 at 03:03:09PM +, Vikram Sethi wrote:
> > From: Alex Williamson 
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe 
> > Cc: Jonathan Cameron ; Ankit Agrawal
> > ; David Hildenbrand ; Cédric Le
> > Goater ; shannon.zha...@gmail.com;
> > peter.mayd...@linaro.org; a...@anisinha.ca; Aniket Agashe
> > ; Neo Jia ; Kirti Wankhede
> > ; Tarun Gupta (SW-GPU) ;
> > Vikram Sethi ; Andy Currid ;
> > qemu-...@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > ; ira.we...@intel.com; navneet.si...@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe  wrote:
> > 
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,

I haven't looked myself, but I've been told this is very
hard. Probably the NUMA concept needs to be split a bit so that the
memory allocator pool handle is not tied to the ACPI.

> Even if Linux could create NUMA nodes dynamically for coherent CXL
> or CXL type devices, there is additional information FW knows that
> the kernel doesn't. For example, what the distance/latency between
> CPU and the device NUMA node is.

But that should just be the existing normal PCIy stuff to define
affinity of the PCI function. Since the memory is logically linked to
the PCI function we have no issue from a topology perspective.

> While CXL devices present a CDAT table which gives latency type
> attributes within the device, there would still be some guesswork
> needed in the kernel as to what the end to end latency/distance is.

Is it non-uniform per CXL function?

> knows this information and should provide it.  Similarly, QEMU
> should pass along this information to VMs for the best outcomes.

Well yes, ideally qemu would pass vCPU affinity for the vPCI functions
so existing NUMA aware allocations in PCI drivers are optimized. eg we
put queues in memory close to the PCI function already.

I think it is important to keep seperated the need to know the
topology/affinity/etc and the need for the driver to have a Linux NUMA
node handle to operate the memory alocator pool APIs.

Regardless, it is what it is for the foreseeable future :(

Jason



Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory

2023-09-27 Thread Jason Gunthorpe
On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:

> CXL accelerators / GPUs etc are a different question but who has one
> of those anyway? :)

That's exactly what I mean when I say CXL will need it too. I keep
describing this current Grace & Hopper as pre-CXL HW. You can easially
imagine draping CXL around it. CXL doesn't solve the problem that
motivates all this hackying - Linux can't dynamically create NUMA
nodes.

So even with CXL baremetal FW will have to create these nodes so Linux
can consume them. Hackity hack hack..

Broadly when you start to look at all of CXL, any device with coherent
memory and the ability to create SRIOV VF like things is going to have
a problem that Linux driver sofware will want a NUMA node for every
possible VF.

So we can view this as some generic NVIDIA hack, but really it is a
"coherent memory and SRIOV" hack that should be needed forany device
of this class.

I was hoping the CXL world would fix this stuff, but I was told they
also doubled down and are creating unnecessary ACPI subtly to avoid
needing dynamic NUMA nodes in Linux.. Sigh.

Jason



Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes

2023-09-25 Thread Jason Gunthorpe
On Mon, Sep 25, 2023 at 03:53:51PM +0100, Jonathan Cameron wrote:
> On Mon, 25 Sep 2023 11:03:28 -0300
> Jason Gunthorpe  wrote:
> 
> > On Mon, Sep 25, 2023 at 02:54:40PM +0100, Jonathan Cameron wrote:
> > 
> > > Possible the ASWG folk would say this is fine and I'm reading too much 
> > > into
> > > the spec, but I'd definitely suggest asking them via the appropriate path,
> > > or throwing in a code first proposal for a comment on this special case 
> > > and
> > > see what response you get - my guess is it will be 'fix Linux' :(  
> > 
> > The goal here is for qemu to emulate what the bare metal environment
> > is doing.
> > 
> > There may be a legitimate question if what the bare metal FW has done
> > is legitimate (though let's face it, there are lots of creative ACPI
> > things around), but I don't quite see how this is a qemu question?
> > 
> > Unless you are taking the position that qemu should not emulate this
> > HW?
> 
> Ok. I'd failed to register that the bare metal code was doing this though
> with hindsight I guess that is obvious. Though without more info or
> a bare metal example being given its also possible the BIOS was doing
> enumeration etc (like CXL does for all < 2.0 devices) and hence was
> building SRAT with the necessary memory ranges in place - even if the
> driver then does the hot add dance later.

Ankit, maybe you can share some relavent ACPI dumps from the physical
hardware and explain how this compares?

> That's dubious and likely to break at some point unless the spec
> comprehends this use case, but meh, so are lots of other things and
> the hardware vendor gets to pick up the pieces and deal with grumpy
> customers.

Yes.

> I don't currently see this as a safe solution for the proposed other
> use cases however that are virtualization only.

So, how should that translate into a command line experiance? Sounds
like the broad concept is general but this actual specific HW is not?

Thanks,
Jason



Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes

2023-09-25 Thread Jason Gunthorpe
On Mon, Sep 25, 2023 at 02:54:40PM +0100, Jonathan Cameron wrote:

> Possible the ASWG folk would say this is fine and I'm reading too much into
> the spec, but I'd definitely suggest asking them via the appropriate path,
> or throwing in a code first proposal for a comment on this special case and
> see what response you get - my guess is it will be 'fix Linux' :(

The goal here is for qemu to emulate what the bare metal environment
is doing.

There may be a legitimate question if what the bare metal FW has done
is legitimate (though let's face it, there are lots of creative ACPI
things around), but I don't quite see how this is a qemu question?

Unless you are taking the position that qemu should not emulate this
HW?

Jason



Re: [PATCH v1 15/22] Add iommufd configure option

2023-09-20 Thread Jason Gunthorpe
On Wed, Sep 20, 2023 at 12:17:24PM -0600, Alex Williamson wrote:

> > The iommufd design requires one open of the /dev/iommu to be shared
> > across all the vfios.
> 
> "requires"?  It's certainly of limited value to have multiple iommufd
> instances rather than create multiple address spaces within a single
> iommufd, but what exactly precludes an iommufd per device if QEMU, or
> any other userspace so desired?  Thanks,

>From the kernel side requires is too strong I suppose

Not sure about these qemu patches though?

Jason



Re: [PATCH v1 15/22] Add iommufd configure option

2023-09-20 Thread Jason Gunthorpe
On Wed, Sep 20, 2023 at 12:01:42PM -0600, Alex Williamson wrote:
> On Wed, 20 Sep 2023 03:42:20 +
> "Duan, Zhenzhong"  wrote:
> 
> > >-Original Message-
> > >From: Cédric Le Goater 
> > >Sent: Wednesday, September 20, 2023 1:08 AM
> > >Subject: Re: [PATCH v1 15/22] Add iommufd configure option
> > >
> > >On 8/30/23 12:37, Zhenzhong Duan wrote:  
> > >> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> > >> iommufd support, enabled by default.  
> > >
> > >Why would someone want to disable support at compile time ? It might  
> > 
> > For those users who only want to support legacy container feature?
> > Let me know if you still prefer to drop this patch, I'm fine with that.
> > 
> > >have been useful for dev but now QEMU should self-adjust at runtime
> > >depending only on the host capabilities AFAIUI. Am I missing something ?  
> > 
> > IOMMUFD doesn't support all features of legacy container, so QEMU
> > doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> > We need to specify it explicitly to use IOMMUFD as below:
> > 
> > -object iommufd,id=iommufd0
> > -device vfio-pci,host=:02:00.0,iommufd=iommufd0
> 
> There's an important point here that maybe we've let slip for too long.
> Laine had asked in an internal forum whether the switch to IOMMUFD was
> visible to the guest.  I replied that it wasn't, but this note about
> IOMMUFD vs container features jogged my memory that I think we still
> lack p2p support with IOMMUFD, ie. IOMMU mapping of device MMIO.  It
> seemed like there was something else too, but I don't recall without
> some research.

I think p2p is the only guest visible one.

I still expect to solve it :\

> Ideally we'd have feature parity and libvirt could simply use the
> native IOMMUFD interface whenever both the kernel and QEMU support it.
> 
> Without that parity, when does libvirt decide to use IOMMUFD?
> 
> How would libvirt know if some future IOMMUFD does have parity?

At this point I think it is reasonable that iommufd is explicitly
opted into.

The next step would be automatic for single PCI device VMs (p2p is not
relavent)

The final step would be automatic if kernel supports P2P. I expect
libvirt will be able to detect this from an open'd /dev/iommu.

Jason



Re: [PATCH v1 15/22] Add iommufd configure option

2023-09-20 Thread Jason Gunthorpe
On Wed, Sep 20, 2023 at 07:37:53PM +0200, Eric Auger wrote:

> >> qemu will typically not be able to
> >> self-open /dev/iommufd as it is root-only.
> >
> > I don't understand, we open multiple fds to KVM devices. This is the
> > same.
> Actually qemu opens the /dev/iommu in case no fd is passed along with
> the iommufd object. This is done in
> [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object, in
> 
> iommufd_backend_connect(). I don't understand either.

The char dev node is root only so this automatic behvaior is fine
but not useful if qmeu is running in a sandbox.

I'm not sure what "multiple fds to KVM devices" means, I don't know
anything about kvm devices..

The iommufd design requires one open of the /dev/iommu to be shared
across all the vfios.

Jason



Re: [PATCH v1 17/22] util/char_dev: Add open_cdev()

2023-09-20 Thread Jason Gunthorpe
On Wed, Sep 20, 2023 at 01:39:02PM +0100, Daniel P. Berrangé wrote:

> > diff --git a/util/chardev_open.c b/util/chardev_open.c
> > new file mode 100644
> > index 00..d03e415131
> > --- /dev/null
> > +++ b/util/chardev_open.c
> > @@ -0,0 +1,61 @@
> > +/*
> > + * Copyright (C) 2023 Intel Corporation.
> > + * Copyright (c) 2019, Mellanox Technologies. All rights reserved.
> > + *
> > + * Authors: Yi Liu 
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.  See
> > + * the COPYING file in the top-level directory.
> > + *
> > + * Copied from
> > + * https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
> > + *
> > + */
> 
> Since this is GPL-2.0-only, IMHO it would be preferrable to keep it
> out of the util/ directory, as we're aiming to not add further 2.0
> only code, except for specific subdirs. This only appears to be used
> by code under hw/vfio/, whcih is one of the dirs still permitting
> 2.0-only code. So I think better to keep this file where it is used.

The copyright comment above is not fully accurate.

The original code is under the "OpenIB" dual license, you can choose
to take it using the OpenIB BSD license text:

 *  Redistribution and use in source and binary forms, with or
 *  without modification, are permitted provided that the following
 *  conditions are met:
 *
 *  - Redistributions of source code must retain the above
 *copyright notice, this list of conditions and the following
 *disclaimer.
 *
 *  - Redistributions in binary form must reproduce the above
 *copyright notice, this list of conditions and the following
 *disclaimer in the documentation and/or other materials
 *provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.

And drop reference to GPL if that is what qemu desires.

Jason



Re: [PATCH v1 15/22] Add iommufd configure option

2023-09-20 Thread Jason Gunthorpe
On Wed, Sep 20, 2023 at 02:19:42PM +0200, Cédric Le Goater wrote:
> On 9/20/23 05:42, Duan, Zhenzhong wrote:
> > 
> > 
> > > -Original Message-
> > > From: Cédric Le Goater 
> > > Sent: Wednesday, September 20, 2023 1:08 AM
> > > Subject: Re: [PATCH v1 15/22] Add iommufd configure option
> > > 
> > > On 8/30/23 12:37, Zhenzhong Duan wrote:
> > > > This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> > > > iommufd support, enabled by default.
> > > 
> > > Why would someone want to disable support at compile time ? It might
> > 
> > For those users who only want to support legacy container feature?
> > Let me know if you still prefer to drop this patch, I'm fine with that.
> 
> I think it is too early.
> 
> > > have been useful for dev but now QEMU should self-adjust at runtime
> > > depending only on the host capabilities AFAIUI. Am I missing something ?
> > 
> > IOMMUFD doesn't support all features of legacy container, so QEMU
> > doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> > We need to specify it explicitly to use IOMMUFD as below:
> > 
> >  -object iommufd,id=iommufd0
> >  -device vfio-pci,host=:02:00.0,iommufd=iommufd0
> 
> OK. I am not sure this is the correct interface yet. At first glance,
> I wouldn't introduce a new object for a simple backend depending on a
> kernel interface. I would tend to prefer a "iommu-something" property
> of the vfio-pci device with string values: "legacy", "iommufd", "default"
> and define the various interfaces (the ops you proposed) for each
> depending on the user preference and the capabilities of the host and
> possibly the device.

I think the idea came from Alex? The major point is to be able to have
libvirt open /dev/iommufd and FD pass it into qemu and then share that
single FD across all VFIOs. qemu will typically not be able to
self-open /dev/iommufd as it is root-only.

So the object is not exactly for the backend, the object is for the
file descriptor.

Adding a legacy/iommufd option to the vfio-pci device string doesn't
address these needs.

Jason



Re: [PATCH v1 15/22] Add iommufd configure option

2023-09-20 Thread Jason Gunthorpe
On Wed, Sep 20, 2023 at 02:01:39PM +0100, Daniel P. Berrangé wrote:

> Assuming we must have the exact same FD used for all vfio-pci devices,
> then using -object iommufd is the least worst way to get that FD
> injected into QEMU from libvirt. 

Yes, same FD. It is a shared resource.

Jason



Re: [PATCH v1 00/22] vfio: Adopt iommufd

2023-09-18 Thread Jason Gunthorpe
On Mon, Sep 18, 2023 at 02:23:48PM +0200, Cédric Le Goater wrote:
> On 9/18/23 13:51, Jason Gunthorpe wrote:
> > On Fri, Sep 15, 2023 at 02:42:48PM +0200, Cédric Le Goater wrote:
> > > On 8/30/23 12:37, Zhenzhong Duan wrote:
> > > > Hi All,
> > > > 
> > > > As the kernel side iommufd cdev and hot reset feature have been queued,
> > > > also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> > > > to update a new version matching kernel side update and with rfc flag
> > > > removed. Qemu code can be found at [2], look forward more comments!
> > > 
> > > FYI, I have started cleaning up the VFIO support in QEMU PPC. First
> > > is the removal of nvlink2, which was dropped from the kernel 2.5 years
> > > ago. Next is probably removal of all the PPC bits in VFIO. Code is
> > > bitrotting and AFAICT VFIO has been broken on these platforms since
> > > 5.18 or so.
> > 
> > It was fixed since then - at least one company (not IBM) still cares
> > about vfio on ppc, though I think it is for a DPDK use case not VFIO.
> 
> Indeed.
> I just checked on a POWER9 box running a debian sid (6.4) and device
> assignment of a simple NIC (e1000e) in a ubuntu 23.04 guest worked
> correctly. Using a 6.6-rc1 on the host worked also. One improvement
> would be to reflect in the Kconfig files that CONFIG_IOMMUFD is not
> supported on PPC so that it can not be selected.

When we did this I thought there were other iommu drivers on Power
that did work with VFIO (fsl_pamu specifically), but it turns out that
ppc iommu driver doesn't support VFIO and the VFIO FSL stuff is for
ARM only.

So it could be done...

These days I believe we have the capacity to do the PPC stuff without
making it so special - it would be alot of work but the road is pretty
clear. At least if qemu wants to remove PPC VFIO support I would not
object.

Jason



Re: [PATCH v1 00/22] vfio: Adopt iommufd

2023-09-18 Thread Jason Gunthorpe
On Fri, Sep 15, 2023 at 02:42:48PM +0200, Cédric Le Goater wrote:
> On 8/30/23 12:37, Zhenzhong Duan wrote:
> > Hi All,
> > 
> > As the kernel side iommufd cdev and hot reset feature have been queued,
> > also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> > to update a new version matching kernel side update and with rfc flag
> > removed. Qemu code can be found at [2], look forward more comments!
> 
> FYI, I have started cleaning up the VFIO support in QEMU PPC. First
> is the removal of nvlink2, which was dropped from the kernel 2.5 years
> ago. Next is probably removal of all the PPC bits in VFIO. Code is
> bitrotting and AFAICT VFIO has been broken on these platforms since
> 5.18 or so.

It was fixed since then - at least one company (not IBM) still cares
about vfio on ppc, though I think it is for a DPDK use case not VFIO.

Jason



Re: [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend

2023-09-06 Thread Jason Gunthorpe
On Wed, Sep 06, 2023 at 01:09:26PM -0600, Alex Williamson wrote:
> On Wed, 6 Sep 2023 15:10:39 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Aug 30, 2023 at 06:37:53PM +0800, Zhenzhong Duan wrote:
> > > Note the /dev/iommu device may have been pre-opened by a
> > > management tool such as libvirt. This mode is no more considered
> > > for the legacy backend. So let's remove the "TODO" comment.  
> > 
> > Can you show an example of that syntax too?
> 
> Unless you're just looking for something in the commit log, 

Yeah, I was thinking the commit log

> patch 16/ added the following to the qemu help output:
> 
> +#ifdef CONFIG_IOMMUFD
> +``-object iommufd,id=id[,fd=fd]``
> +Creates an iommufd backend which allows control of DMA mapping
> +through the /dev/iommu device.
> +
> +The ``id`` parameter is a unique ID which frontends (such as
> +vfio-pci of vdpa) will use to connect withe the iommufd backend.
> +
> +The ``fd`` parameter is an optional pre-opened file descriptor
> +resulting from /dev/iommu opening. Usually the iommufd is shared
> +accross all subsystems, bringing the benefit of centralized
> +reference counting.
> +#endif
>  
> > Also, the vfio device should be openable externally as well
> 
> Appears to be added in the very next patch in the series.  Thanks,

Indeed, I got confused because this removed the TODO - that could
reasonably be pushed to the next patch and include a bit more detail
in the commit message

Jason



Re: [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend

2023-09-06 Thread Jason Gunthorpe
On Wed, Aug 30, 2023 at 06:37:53PM +0800, Zhenzhong Duan wrote:
> Note the /dev/iommu device may have been pre-opened by a
> management tool such as libvirt. This mode is no more considered
> for the legacy backend. So let's remove the "TODO" comment.

Can you show an example of that syntax too?

Also, the vfio device should be openable externally as well

Jason



Re: [PATCH for-8.2 v3 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup()

2023-08-08 Thread Jason Gunthorpe
On Tue, Aug 08, 2023 at 09:23:09AM +0300, Avihai Horon wrote:
> 
> On 07/08/2023 18:53, Cédric Le Goater wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > [ Adding Juan and Peter for their awareness ]
> > 
> > On 8/2/23 10:14, Avihai Horon wrote:
> > > Changing the device state from STOP_COPY to STOP can take time as the
> > > device may need to free resources and do other operations as part of the
> > > transition. Currently, this is done in vfio_save_complete_precopy() and
> > > therefore it is counted in the migration downtime.
> > > 
> > > To avoid this, change the device state from STOP_COPY to STOP in
> > > vfio_save_cleanup(), which is called after migration has completed and
> > > thus is not part of migration downtime.
> > 
> > What bothers me is that this looks like a device specific optimization
> 
> True, currently it helps mlx5, but this change is based on the assumption
> that, in general, VFIO devices are likely to free resources when
> transitioning from STOP_COPY to STOP.
> So I think this is a good change to have in any case.

Yes, I agree with this idea. Kernel drivers should be designed to take
advantage of things like this and defer time consuming work to the
stop arc.

> > and we are loosing the error part.
> 
> I don't think we lose the error part.
> AFAIU, the crucial part is transitioning to STOP_COPY and sending the final
> data.
> If that's done successfully, then migration is successful.

Yes.

> The STOP_COPY->STOP transition is done as part of the cleanup flow, after
> the migration is completed -- i.e., failure in it does not affect the
> success of migration.
> Further more, if there is an error in the STOP_COPY->STOP transition, then
> it's reported by vfio_migration_set_state().

If STOP_COPY->STOP fails then the migration should succeed anyhow and
qemu has to FLR the VFIO device to recover it. This probably only
matters if the migration is aborted for some other reason and qemu has
to resume the VM. It will not be able to restore the device to running
until it has been FLRd.

However, qemu can probably ignore this error and eventually if it
tries to go to RUNNING the kernel will report failure and it can do
the FLR at that point. Otherwise on the migration success path qemu
should simply close the VFIO device.

Jason



Re: [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration

2023-07-18 Thread Jason Gunthorpe
On Sun, Jul 16, 2023 at 11:15:35AM +0300, Avihai Horon wrote:
> Hi all,
> 
> The first patch in this series adds a small optimization to VFIO
> migration by moving the STOP_COPY->STOP transition to
> vfio_save_cleanup(). Testing with a ConnectX-7 VFIO device showed that
> this can reduce downtime by up to 6%.
> 
> The rest of the series adds P2P support for VFIO migration.
> 
> VFIO migration uAPI defines an optional intermediate P2P quiescent
> state. While in the P2P quiescent state, P2P DMA transactions cannot be
> initiated by the device, but the device can respond to incoming ones.
> Additionally, all outstanding P2P transactions are guaranteed to have
> been completed by the time the device enters this state.

For clarity it is "all outstanding DMA transactions are guarenteed to
have been completed" - the device has no idea if something is P2P or
not.

> The purpose of this state is to support migration of multiple devices
> that are doing P2P transactions between themselves.

s/are/might/

Jason



Re: [PATCH 3/3] vfio/migration: Make VFIO migration non-experimental

2023-06-27 Thread Jason Gunthorpe
On Tue, Jun 27, 2023 at 02:21:55PM +0200, Cédric Le Goater wrote:

> We have a way to run and migrate a machine with a device not supporting
> dirty tracking. Only Hisilicon is in that case today. May be there are
> plans to add dirty tracking support ?

Hisilicon will eventually use Joao's work for IOMMU based tracking,
this is what their HW was designed to do.

Jason



Re: [PATCH 3/3] vfio/migration: Make VFIO migration non-experimental

2023-06-26 Thread Jason Gunthorpe
On Mon, Jun 26, 2023 at 05:26:42PM +0200, Cédric Le Goater wrote:

> Since dirty tracking is a must-have to implement migration support
> for any existing and future VFIO PCI variant driver, anything else
> would be experimental code and we are trying to remove the flag !
> Please correct me if I am wrong.

I don't think we should call it experimental - it is like other qemu
options where we can choose to run qemu in a suboptimal way.

Jason



Re: [PATCH v3 5/5] intel_iommu: Optimize out some unnecessary UNMAP calls

2023-06-08 Thread Jason Gunthorpe
On Thu, Jun 08, 2023 at 03:53:23PM -0400, Peter Xu wrote:
> Though that does look slightly special, because the whole empty UNMAP
> region can be seen as a hole too; not sure when that -ENOENT will be useful
> if qemu should always bypass it anyway.  Indeed not a problem to qemu.

It sounds like it might be good to have a flag to unmap the whole
range regardless of contiguity

Jason



Re: [PATCH v3 5/5] intel_iommu: Optimize out some unnecessary UNMAP calls

2023-06-08 Thread Jason Gunthorpe
On Thu, Jun 08, 2023 at 11:40:55AM -0400, Peter Xu wrote:

> > But if there is the proper locks to prevent a map/unmap race, then
> > there should also be the proper locks to check that there is no map in
> > the first place and avoid the kernel call..
> 
> The problem is IIRC guest iommu driver can do smart things like batching
> invalidations, it means when QEMU gets it from the guest OS it may already
> not matching one mapped objects.

qemu has to fix it. The kernel API is object based, not paged
based. You cannot unmap partions of a prior mapping.

I assume for this kind of emulation it is doing 4k objects because
it has no idea what size of mapping the client will use?

> We can definitely lookup every single object and explicitly unmap, but it
> loses partial of the point of batching that guest OS does.  

You don't need every single object, but it would be faster to check
where things are mapped and then call the kernel correctly instead of
trying to iterate with the unmapped reults.

Jason



Re: [PATCH v3 5/5] intel_iommu: Optimize out some unnecessary UNMAP calls

2023-06-08 Thread Jason Gunthorpe
On Thu, Jun 08, 2023 at 10:05:08AM -0400, Peter Xu wrote:

> IIUC what VFIO does here is it returns succeed if unmap over nothing rather
> than failing like iommufd.  Curious (like JasonW) on why that retval?  I'd
> assume for returning "how much unmapped" we can at least still return 0 for
> nothing.

In iommufd maps are objects, you can only map or unmap entire
objects. The ability to batch unmap objects by specifying an range
that spans many is something that was easy to do and that VFIO had,
but I'm not sure it is actually usefull..

So asking to unmap an object that is already known not to be mapped is
actually possibly racy, especially if you consider iommufd's support
for kernel-side IOVA allocation. It should not be done, or if it is
done, with user space locking to protect it.

For VFIO, long long ago, VFIO could unmap IOVA page at a time - ie it
wasn't objects. In this world it made some sense that the unmap would
'succeed' as the end result was unmapped.

> Are you probably suggesting that we can probably handle that in QEMU side
> on -ENOENT here for iommufd only (a question to Yi?).

Yes, this can be done, ENOENT is reliably returned and qemu doesn't
use the kernel-side IOVA allocator.

But if there is the proper locks to prevent a map/unmap race, then
there should also be the proper locks to check that there is no map in
the first place and avoid the kernel call..

Jason



Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls

2023-05-29 Thread Jason Gunthorpe
On Fri, May 26, 2023 at 08:44:29AM +, Liu, Yi L wrote:

> > > >> In fact, the other purpose of this patch is to eliminate noisy error
> > > >> log when we work with IOMMUFD. It looks the duplicate UNMAP call will
> > > >> fail with IOMMUFD while always succeed with legacy container. This
> > > >> behavior difference lead to below error log for IOMMUFD:
> > 
> > A dumb question, should IOMMUFD stick the same behaviour with legacy 
> > container?
> 
> May need to hear from JasonG.  Should IOMMU_IOAS_UNMAP return error or
> success if the iova is not found?

The native iommufd functions will return failure if they could not
unmap anything.

Otherwise they return the number of consecutive bytes unmapped.

The VFIO emulation functions should do whatever VFIO does, is there a
mistake there?

Jason



Re: Multiple vIOMMU instance support in QEMU?

2023-05-18 Thread Jason Gunthorpe
On Thu, May 18, 2023 at 03:45:24PM -0400, Peter Xu wrote:
> On Thu, May 18, 2023 at 11:56:46AM -0300, Jason Gunthorpe wrote:
> > On Thu, May 18, 2023 at 10:16:24AM -0400, Peter Xu wrote:
> > 
> > > What you mentioned above makes sense to me from the POV that 1 vIOMMU may
> > > not suffice, but that's at least totally new area to me because I never
> > > used >1 IOMMUs even bare metal (excluding the case where I'm aware that
> > > e.g. a GPU could have its own IOMMU-like dma translator).
> > 
> > Even x86 systems are multi-iommu, one iommu per physical CPU socket.
> 
> I tried to look at a 2-node system on hand and I indeed got two dmars:
> 
> [4.444788] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap 
> 8d2078c106f0466 ecap f020df
> [4.459673] DMAR: dmar1: reg_base_addr c7ffc000 ver 1:0 cap 
> 8d2078c106f0466 ecap f020df
> 
> Though they do not seem to be all parallel on attaching devices.  E.g.,
> most of the devices on this host are attached to dmar1, while there're only
> two devices attached to dmar0:

Yeah, I expect it has to do with physical topology. PCIe devices
physically connected to each socket should use the socket local iommu
and the socket local caches.

ie it would be foolish to take an IO in socket A and the forward it to
socket B to perform IOMMU translation then forward it back to socket A
to land in memory.

> > I'm not sure how they model this though - Kevin do you know? Do we get
> > multiple iommu instances in Linux or is all the broadcasting of
> > invalidates and sharing of tables hidden?
> > 
> > > What's the system layout of your multi-vIOMMU world?  Is there still a
> > > centric vIOMMU, or multi-vIOMMUs can run fully in parallel, so that e.g. 
> > > we
> > > can have DEV1,DEV2 under vIOMMU1 and DEV3,DEV4 under vIOMMU2?
> > 
> > Just like physical, each viommu is parallel and independent. Each has
> > its own caches, ASIDs, DIDs/etc and thus invalidation domains.
> > 
> > The seperated caches is the motivating reason to do this as something
> > like vCMDQ is a direct command channel for invalidations to only the
> > caches of a single IOMMU block.
> 
> From cache invalidation pov, shouldn't the best be per-device granule (like
> dev-iotlb in VT-d? No idea for ARM)?

There are many caches and different cache tag schemes in an iommu. All
of them are local to the IOMMU block.

Consider where we might have a single vDID but the devices using that
DID are spread across two physical IOMMUs. When the VM asks to
invalidate the vDID the system has to generate two physical pDID
invalidations.

This can't be done without a software mediation layer in the VMM.

The better solution is to make the pDID and vDID 1:1 so the VM itself
replicates the invalidations. The VM has better knowledge of when
replication is needed so it is overall more efficient.

> I see that Intel is already copied here (at least Yi and Kevin) so I assume
> there're already some kind of synchronizations on multi-vIOMMU vs recent
> works on Intel side, which is definitely nice and can avoid work conflicts.

I actually don't know that.. Intel sees multiple DMAR blocks in SW and
they have kernel level replication of invalidation.. Intel doesn't
have a HW fast path yet so they can rely on mediation to fix it. Thus
I expect there is no HW replication of invalidations here. Kevin?

Remember the VFIO API hides all of this, when you change the VFIO
container it automatically generates all requires invalidations in the
kernel.

I also heard AMD has a HW fast and also multi-iommu but I don't really
know the details.

Jason



Re: Multiple vIOMMU instance support in QEMU?

2023-05-18 Thread Jason Gunthorpe
On Thu, May 18, 2023 at 10:16:24AM -0400, Peter Xu wrote:

> What you mentioned above makes sense to me from the POV that 1 vIOMMU may
> not suffice, but that's at least totally new area to me because I never
> used >1 IOMMUs even bare metal (excluding the case where I'm aware that
> e.g. a GPU could have its own IOMMU-like dma translator).

Even x86 systems are multi-iommu, one iommu per physical CPU socket.

I'm not sure how they model this though - Kevin do you know? Do we get
multiple iommu instances in Linux or is all the broadcasting of
invalidates and sharing of tables hidden?

> What's the system layout of your multi-vIOMMU world?  Is there still a
> centric vIOMMU, or multi-vIOMMUs can run fully in parallel, so that e.g. we
> can have DEV1,DEV2 under vIOMMU1 and DEV3,DEV4 under vIOMMU2?

Just like physical, each viommu is parallel and independent. Each has
its own caches, ASIDs, DIDs/etc and thus invalidation domains.

The seperated caches is the motivating reason to do this as something
like vCMDQ is a direct command channel for invalidations to only the
caches of a single IOMMU block.

> Is it a common hardware layout or nVidia specific?

I think it is pretty normal, you have multiple copies of the IOMMU and
its caches for physical reasons.

The only choice is if the platform HW somehow routes invalidations to
all IOMMUs or requires SW to route/replicate invalidates.

ARM's IP seems to be designed toward the latter so I expect it is
going to be common on ARM.

Jason



Re: [PATCH v11 05/11] vfio/migration: Block multiple devices migration

2023-05-16 Thread Jason Gunthorpe
On Tue, May 16, 2023 at 02:35:21PM +, Shameerali Kolothum Thodi wrote:

> Ok. Got it. So it depends on what SMMU does for that mapping and is not
> related to migration per se and has the potential to crash the system if 
> SMMU go ahead with that memory access. Isn't it a more generic problem
> then when we have multiple devices attached to the VM? 

It is, and I am hoping to solve it when I get P2P support into iommufd

> I need to check if there is anything in SMMU spec that forbids this
> access.

There isn't, it is up to the integration how it handles 'hairpin'
traffic. A good implementation will work correctly as Intel and other
CPUs do. A very bad implementation will crash. Medium is to return
error tlps.

Jason



Re: [PATCH v11 05/11] vfio/migration: Block multiple devices migration

2023-05-16 Thread Jason Gunthorpe
On Tue, May 16, 2023 at 01:57:22PM +, Shameerali Kolothum Thodi wrote:

> > What happens on your platform if a guest tries to do P2P? Does the
> > platform crash?
> 
> I am not sure. Since the devices are behind SMMU, I was under the assumption
> that we do have the guarantee of isolation here(grouping).

That is not the case. The SMMU will contain an IOPTE with the the CPU
address of the P2P memory and when the SMMU executes that operation
the system does .. ??

Hopefully it responds back with error and keeps going, or it
crashes..

Jason



Re: [PATCH v11 05/11] vfio/migration: Block multiple devices migration

2023-05-16 Thread Jason Gunthorpe
On Tue, May 16, 2023 at 10:03:54AM +, Shameerali Kolothum Thodi wrote:

> > Currently VFIO migration doesn't implement some kind of intermediate
> > quiescent state in which P2P DMAs are quiesced before stopping or
> > running the device. This can cause problems in multi-device migration
> > where the devices are doing P2P DMAs, since the devices are not stopped
> > together at the same time.
> > 
> > Until such support is added, block migration of multiple devices.
> 
> Missed this one. Currently this blocks even if the attached devices are not
> capable of P2P DMAs. eg; HiSilicon ACC devices. These are integrated end point
> devices without any P2P capability between them. Is it Ok to check for
> VFIO_MIGRATION_P2P flag and allow if the devices are not supporting that?

Lacking VFIO_MIGRATION_P2P doesn't mean the device is incapable of
P2P, it means the migration can't support P2P.

We'd need some kind of new flag to check and such devices should be
blocked from creating P2P mappings. Basically we don't currently
fully support devices that are incapable of P2P operations.

What happens on your platform if a guest tries to do P2P? Does the
platform crash?

Jason



Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support

2023-03-06 Thread Jason Gunthorpe
On Wed, Mar 01, 2023 at 03:39:17PM -0700, Alex Williamson wrote:
> On Wed, 1 Mar 2023 17:12:51 -0400
> Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:
> > 
> > > So it seems like what we need here is both a preface buffer size and a
> > > target device latency.  The QEMU pre-copy algorithm should factor both
> > > the remaining data size and the device latency into deciding when to
> > > transition to stop-copy, thereby allowing the device to feed actually
> > > relevant data into the algorithm rather than dictate its behavior.  
> > 
> > I don't know that we can realistically estimate startup latency,
> > especially have the sender estimate latency on the receiver..
> 
> Knowing that the target device is compatible with the source is a point
> towards making an educated guess.
> 
> > I feel like trying to overlap the device start up with the STOP phase
> > is an unnecessary optimization? How do you see it benifits?
> 
> If we can't guarantee that there's some time difference between sending
> initial bytes immediately at the end of pre-copy vs immediately at the
> beginning of stop-copy, does that mean any handling of initial bytes is
> an unnecessary optimization?

Sure if the device doesn't implement an initial_bytes startup phase
then it is all pointless, but probably those devices should return 0
for initial_bytes? If we see initial_bytes and assume it indicates a
startup phase, why not do it?

> I'm imagining that completing initial bytes triggers some
> initialization sequence in the target host driver which runs in
> parallel to the remaining data stream, so in practice, even if sent at
> the beginning of stop-copy, the target device gets a head start.

It isn't parallel in mlx5. The load operation of the initial bytes on
the receiver will execute the load command and that command will take
some amount of time sort of proportional to how much data is in the
device. IIRC the mlx5 VFIO driver will block read until this finishes.

It is convoluted but it ultimately is allocating (potentially alot)
pages in the hypervisor kernel so the time predictability is not very
good.

Other device types we are looking at might do network connections at
this step - eg a storage might open a network connection to its back
end. This could be unpredicatably long in degenerate cases.

> > I've been thinking of this from the perspective that we should always
> > ensure device startup is completed, it is time that has to be paid,
> > why pay it during STOP?
> 
> Creating a policy for QEMU to send initial bytes in a given phase
> doesn't ensure startup is complete.  There's no guaranteed time
> difference between sending that data and the beginning of stop-copy.

As I've said, to really do a good job here we want to have the sender
wait until the receiver completes startup, and not just treat it as a
unidirectional byte-stream. That isn't this patch..

> QEMU is trying to achieve a downtime goal, where it estimates network
> bandwidth to get a data size threshold, and then polls devices for
> remaining data.  That downtime goal might exceed the startup latency of
> the target device anyway, where it's then the operators choice to pay
> that time in stop-copy, or stalled on the target.

If you are saying there should be a policy flag ('optimize for total
migration time' vs 'optimize for minimum downtime') that seems
reasonable, though I wonder who would pick the first option.
 
> But if we actually want to ensure startup of the target is complete,
> then drivers should be able to return both data size and estimated time
> for the target device to initialize.  That time estimate should be
> updated by the driver based on if/when initial_bytes is drained.  The
> decision whether to continue iterating pre-copy would then be based on
> both the maximum remaining device startup time and the calculated time
> based on remaining data size.

That seems complicated. Why not just wait for the other side to
acknowledge it has started the device? Then we aren't trying to guess.

AFAIK this sort of happens implicitly in this patch because once
initial bytes is pushed the next data that follows it will block on
the pending load and the single socket will backpressure until the
load is done. Horrible, yes, but it is where qemu is at. multi-fd is
really important :)

Jason



Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support

2023-03-01 Thread Jason Gunthorpe
On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:

> So it seems like what we need here is both a preface buffer size and a
> target device latency.  The QEMU pre-copy algorithm should factor both
> the remaining data size and the device latency into deciding when to
> transition to stop-copy, thereby allowing the device to feed actually
> relevant data into the algorithm rather than dictate its behavior.

I don't know that we can realistically estimate startup latency,
especially have the sender estimate latency on the receiver..

I feel like trying to overlap the device start up with the STOP phase
is an unnecessary optimization? How do you see it benifits?

I've been thinking of this from the perspective that we should always
ensure device startup is completed, it is time that has to be paid,
why pay it during STOP?

Jason



Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support

2023-02-27 Thread Jason Gunthorpe
On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:

> But we have no requirement to send all init_bytes before stop-copy.
> This is a hack to achieve a theoretical benefit that a driver might be
> able to improve the latency on the target by completing another
> iteration.

I think this is another half-step at this point..

The goal is to not stop the VM until the target VFIO driver has
completed loading initial_bytes.

This signals that the time consuming pre-setup is completed in the
device and we don't have to use downtime to do that work.

We've measured this in our devices and the time-shift can be
significant, like seconds levels of time removed from the downtime
period.

Stopping the VM before this pre-setup is done is simply extending the
stopped VM downtime.

Really what we want is to have the far side acknowledge that
initial_bytes has completed loading.

To remind, what mlx5 is doing here with precopy is time-shifting work,
not data. We want to put expensive work (ie time) into the period when
the VM is still running and have less downtime.

This challenges the assumption built into qmeu that all data has equal
time and it can estimate downtime time simply by scaling the estimated
data. We have a data-size independent time component to deal with as
well.

Jason



Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU

2023-02-24 Thread Jason Gunthorpe
On Fri, Feb 24, 2023 at 12:53:26PM +, Joao Martins wrote:
> > But reading the code this ::bypass_iommu (new to me) apparently tells that
> > vIOMMU is bypassed or not for the PCI devices all the way to avoiding
> > enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks 
> > whether
> > PCI device is within the IOMMU address space (or bypassed) prior to DMA 
> > maps and
> > such.
> > 
> > You can see from the other email that all of the other options in my head 
> > were
> > either bit inconvenient or risky. I wasn't aware of this option for what is
> > worth -- much simpler, should work!
> >
> 
> I say *should*, but on a second thought interrupt remapping may still be
> required to one of these devices that are IOMMU-bypassed. Say to put 
> affinities
> to vcpus above 255? I was trying this out with more than 255 vcpus with a 
> couple
> VFs and at a first glance these VFs fail to probe (these are CX6
> VFs).

It is pretty bizarre, but the Intel iommu driver is responsible for
installing the interrupt remapping irq driver on the devices.

So if there is no iommu driver bound then there won't be any interrupt
remapping capability for the device even if the interrupt remapping HW
is otherwise setup.

The only reason Avihai is touching this is to try and keep the
interrupt remapping emulation usable, we could certainly punt on that
for now if it looks too ugly.

Jason



Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU

2023-02-23 Thread Jason Gunthorpe
On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote:
> On Thu, 23 Feb 2023 16:55:54 -0400
> Jason Gunthorpe  wrote:
> 
> > On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
> > > > #2 is the presumption that the guest is using an identity map.  
> > > 
> > > This is a dangerous assumption.
> > >   
> > > > > I'd think the only viable fallback if the vIOMMU doesn't report its 
> > > > > max
> > > > > IOVA is the full 64-bit address space, otherwise it seems like we need
> > > > > to add a migration blocker.
> > > > 
> > > > This is basically saying vIOMMU doesn't work with migration, and we've
> > > > heard that this isn't OK. There are cases where vIOMMU is on but the
> > > > guest always uses identity maps. eg for virtual interrupt remapping.  
> > > 
> > > Yes, the vIOMMU can be automatically added to a VM when we exceed 255
> > > vCPUs, but I don't see how we can therefore deduce anything about the
> > > usage mode of the vIOMMU.
> > 
> > We just loose optimizations. Any mappings that are established outside
> > the dirty tracking range are permanently dirty. So at worst the guest
> > can block migration by establishing bad mappings. It is not exactly
> > production quality but it is still useful for a closed environment
> > with known guest configurations.
> 
> That doesn't seem to be what happens in this series, 

Seems like something is missed then

> nor does it really make sense to me that userspace would simply
> decide to truncate the dirty tracking ranges array.

Who else would do it?

> > No, 2**64 is too big a number to be reasonable.
> 
> So what are the actual restrictions were dealing with here?  I think it
> would help us collaborate on a solution if we didn't have these device
> specific restrictions sprinkled through the base implementation.

Hmm? It was always like this, the driver gets to decide if it accepts
the proprosed tracking ranges or not. Given how the implementation has
to work there is no device that could do 2**64...

At least for mlx5 it is in the multi-TB range. Enough for physical
memory on any real server.

> > Ideally we'd work it the other way and tell the vIOMMU that the vHW
> > only supports a limited number of address bits for the translation, eg
> > through the ACPI tables. Then the dirty tracking could safely cover
> > the larger of all system memory or the limited IOVA address space.
> 
> Why can't we do that?  Hotplug is an obvious issue, but maybe it's not
> vHW telling the vIOMMU a restriction, maybe it's a QEMU machine or
> vIOMMU option and if it's not set to something the device can support,
> migration is blocked.

I don't know, maybe we should if we can.

> > Or even better figure out how to get interrupt remapping without IOMMU
> > support :\
> 
> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
> -device intel-iommu,caching-mode=on,intremap=on

Joao?

If this works lets just block migration if the vIOMMU is turned on..

Jason



Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU

2023-02-23 Thread Jason Gunthorpe
On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
> > #2 is the presumption that the guest is using an identity map.
> 
> This is a dangerous assumption.
> 
> > > I'd think the only viable fallback if the vIOMMU doesn't report its max
> > > IOVA is the full 64-bit address space, otherwise it seems like we need
> > > to add a migration blocker.  
> > 
> > This is basically saying vIOMMU doesn't work with migration, and we've
> > heard that this isn't OK. There are cases where vIOMMU is on but the
> > guest always uses identity maps. eg for virtual interrupt remapping.
> 
> Yes, the vIOMMU can be automatically added to a VM when we exceed 255
> vCPUs, but I don't see how we can therefore deduce anything about the
> usage mode of the vIOMMU.  

We just loose optimizations. Any mappings that are established outside
the dirty tracking range are permanently dirty. So at worst the guest
can block migration by establishing bad mappings. It is not exactly
production quality but it is still useful for a closed environment
with known guest configurations.

> nested assignment, ie. userspace drivers running within the guest,
> where making assumptions about the IOVA extents of the userspace driver
> seems dangerous.
>
> Let's backup though, if a device doesn't support the full address width
> of the platform, it's the responsibility of the device driver to
> implement a DMA mask such that the device is never asked to DMA outside
> of its address space support.  Therefore how could a device ever dirty
> pages outside of its own limitations?

The device always supports the full address space. We can't enforce
any kind of limit on the VM

It just can't dirty track it all.

> Isn't it reasonable to require that a device support dirty tracking for
> the entire extent if its DMA address width in order to support this
> feature?

No, 2**64 is too big a number to be reasonable.

Ideally we'd work it the other way and tell the vIOMMU that the vHW
only supports a limited number of address bits for the translation, eg
through the ACPI tables. Then the dirty tracking could safely cover
the larger of all system memory or the limited IOVA address space.

Or even better figure out how to get interrupt remapping without IOMMU
support :\

Jason



Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop

2023-02-23 Thread Jason Gunthorpe
On Thu, Feb 23, 2023 at 01:16:40PM -0700, Alex Williamson wrote:
> On Thu, 23 Feb 2023 15:30:28 -0400
> Jason Gunthorpe  wrote:
> 
> > On Thu, Feb 23, 2023 at 12:27:23PM -0700, Alex Williamson wrote:
> > > So again, I think I'm just looking for a better comment that doesn't
> > > add FUD to the reasoning behind switching to a single range,   
> > 
> > It isn't a single range, it is a single page of ranges, right?
> 
> Exceeding a single page of ranges is the inflection point at which we
> switch to a single range.

Oh, that isn't what it should do - it should cut it back to fit in a
page..

Jason



Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop

2023-02-23 Thread Jason Gunthorpe
On Thu, Feb 23, 2023 at 12:27:23PM -0700, Alex Williamson wrote:
> So again, I think I'm just looking for a better comment that doesn't
> add FUD to the reasoning behind switching to a single range, 

It isn't a single range, it is a single page of ranges, right?

The comment should say

"Keep the implementation simple and use at most a PAGE_SIZE of ranges
because the kernel is guaranteed to be able to parse that"

Jason



Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU

2023-02-22 Thread Jason Gunthorpe
On Wed, Feb 22, 2023 at 04:34:39PM -0700, Alex Williamson wrote:
> > +/*
> > + * With vIOMMU we try to track the entire IOVA space. As the IOVA 
> > space can
> > + * be rather big, devices might not be able to track it due to HW
> > + * limitations. In that case:
> > + * (1) Retry tracking a smaller part of the IOVA space.
> > + * (2) Retry tracking a range in the size of the physical memory.
> 
> This looks really sketchy, why do we think there's a "good enough"
> value here?  If we get it wrong, the device potentially has access to
> IOVA space that we're not tracking, right?

The idea was the untracked range becomes permanently dirty, so at
worst this means the migration never converges.
 
#2 is the presumption that the guest is using an identity map.

> I'd think the only viable fallback if the vIOMMU doesn't report its max
> IOVA is the full 64-bit address space, otherwise it seems like we need
> to add a migration blocker.

This is basically saying vIOMMU doesn't work with migration, and we've
heard that this isn't OK. There are cases where vIOMMU is on but the
guest always uses identity maps. eg for virtual interrupt remapping.

We also have future problems that nested translation is incompatible
with device dirty tracking..

Jason



Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop

2023-02-22 Thread Jason Gunthorpe
On Wed, Feb 22, 2023 at 03:40:43PM -0700, Alex Williamson wrote:
> > +/*
> > + * DMA logging uAPI guarantees to support at least num_ranges that 
> > fits into
> > + * a single host kernel page. To be on the safe side, use this as a 
> > limit
> > + * from which to merge to a single range.
> > + */
> > +max_ranges = qemu_real_host_page_size() / sizeof(*ranges);
> > +cur_ranges = iova_tree_nnodes(container->mappings);
> > +control->num_ranges = (cur_ranges <= max_ranges) ? cur_ranges : 1;
> 
> This makes me suspicious that we're implementing to the characteristics
> of a specific device rather than strictly to the vfio migration API.
> Are we just trying to avoid the error handling to support the try and
> fall back to a single range behavior?

This was what we agreed to when making the kernel patches. Userspace
is restricted to send one page of range list to the kernel, and the
kernel will always adjust that to whatever smaller list the device needs.

We added this limit only because we don't want to have a way for
userspace to consume a lot of kernel memory.

See LOG_MAX_RANGES in vfio_main.c

If qemu is viommu mode and it has a huge number of ranges then it must
cut it down before passing things to the kernel.

Jason



Re: [PATCH v10 03/12] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support

2023-02-15 Thread Jason Gunthorpe
On Wed, Feb 15, 2023 at 01:14:35PM -0700, Alex Williamson wrote:

> We'll need to consider whether we want to keep "dumb" dirty tracking,
> or even any form of dirty tracking in the type1 uAPI, under an
> experimental opt-in.  Thanks,

I was expecting we'd delete the kernel code for type 1 dirty tracking
once the v2 parts are merged to qemu since we don't and won't have any
kernel implementation of it..

The big point of this to allow qmeu to continue on with a future
kernel that no longer reports it supports this.

Jason



Re: [RFC v3 18/18] vfio/as: Allow the selection of a given iommu backend

2023-02-03 Thread Jason Gunthorpe
On Fri, Feb 03, 2023 at 06:57:02PM +0100, Eric Auger wrote:
> Hi Jason,
> 
> On 2/3/23 13:51, Jason Gunthorpe wrote:
> > On Tue, Jan 31, 2023 at 09:53:05PM +0100, Eric Auger wrote:
> >> Now we support two types of iommu backends, let's add the capability
> >> to select one of them. This depends on whether an iommufd object has
> >> been linked with the vfio-pci device:
> >>
> >> if the user wants to use the legacy backend, it shall not
> >> link the vfio-pci device with any iommufd object:
> >>
> >> -device vfio-pci,host=:02:00.0
> >>
> >> This is called the legacy mode/backend.
> >>
> >> If the user wants to use the iommufd backend (/dev/iommu) it
> >> shall pass an iommufd object id in the vfio-pci device options:
> >>
> >>  -object iommufd,id=iommufd0
> >>  -device vfio-pci,host=:02:00.0,iommufd=iommufd0
> >>
> >> Note the /dev/iommu device may have been pre-opened by a
> >> management tool such as libvirt. This mode is no more considered
> >> for the legacy backend. So let's remove the "TODO" comment.
> > The vfio cdev should also be pre-openable like iommufd?
> 
> where does the requirement come from?

I would expect it helps sandbox security.

We couldn't do this with the original VFIO model, but now we can have
libvirt open the vfio with privilege and FD pass it to qemu.

This way qemu never needs to have privilege to open a VFIO or iommu
cdev node.

Jason



Re: [RFC v3 00/18] vfio: Adopt iommufd

2023-02-03 Thread Jason Gunthorpe
On Tue, Jan 31, 2023 at 09:52:47PM +0100, Eric Auger wrote:
> Given some iommufd kernel limitations, the iommufd backend is
> not yuet fully on par with the legacy backend w.r.t. features like:
> - p2p mappings (you will see related error traces)
> - coherency tracking

You said this was a qemu side limitation?

> - live migration

The vfio kernel interfaces are deprecated,  Avihai's series here adds
live migration support:

https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avih...@nvidia.com/

And there will be another series for iommufd system iommu based live
migration

> - vfio pci device hot reset

What is needed here?

Jason



Re: [RFC v3 18/18] vfio/as: Allow the selection of a given iommu backend

2023-02-03 Thread Jason Gunthorpe
On Tue, Jan 31, 2023 at 09:53:05PM +0100, Eric Auger wrote:
> Now we support two types of iommu backends, let's add the capability
> to select one of them. This depends on whether an iommufd object has
> been linked with the vfio-pci device:
> 
> if the user wants to use the legacy backend, it shall not
> link the vfio-pci device with any iommufd object:
> 
> -device vfio-pci,host=:02:00.0
> 
> This is called the legacy mode/backend.
> 
> If the user wants to use the iommufd backend (/dev/iommu) it
> shall pass an iommufd object id in the vfio-pci device options:
> 
>  -object iommufd,id=iommufd0
>  -device vfio-pci,host=:02:00.0,iommufd=iommufd0
> 
> Note the /dev/iommu device may have been pre-opened by a
> management tool such as libvirt. This mode is no more considered
> for the legacy backend. So let's remove the "TODO" comment.

The vfio cdev should also be pre-openable like iommufd?

Jason



Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support

2023-02-01 Thread Jason Gunthorpe
On Wed, Feb 01, 2023 at 11:42:46AM -0700, Alex Williamson wrote:

> > 'p2p off' is a valuable option in its own right because this stuff
> > doesn't work reliably and is actively dangerous. Did you know you can
> > hard crash the bare metal from a guest on some platforms with P2P
> > operations? Yikes. If you don't need to use it turn it off and don't
> > take the risk.
> 
> If we're honest, there are a number of cases of non-exceptional faults
> that an assigned device can generate that the platform might escalate
> to fatal errors.

What I understand is that is true on some commodity hardware, but
engineered systems to run as cloud hypervisors have these problems
solved and VFIO is made safe.

Unfortunately there is no way to know if you have a safe or unsafe
system from the OS.

> > Arguably for this reason 'p2p off' should trend toward the default as
> > it is much safer.
> 
> Safety in the hands of the userspace to protect the host though?
> Shouldn't the opt-in be at the kernel level whether to allow p2p
> mappings?  

I haven't seen anyone interested in doing this kind of work. The
expectation seems to be that places seriously concerned about security
either don't include VFIO at all in their environments or have
engineered their platforms to make it safe.

Where this leaves the enterprise space, I don't know. I think they end
up with systems that functionally work but possibly have DOS problems.

So, given this landscape I think a user option in qemu is the best we
can do at the moment.

> I don't have an issue if QEMU were to mirror this by
> creating a RAM-only AddressSpace for devices which would be used when
> p2p is disable (it'd save us some headaches for various unaligned
> devices as well), but we shouldn't pretend that actually protects the
> host.  OTOH, QEMU could feel confident supporting migration of devices
> w/o support of the migration P2P states with that restriction.

It protects the host from a hostile VM, it does not fully protect the
host from a compromised qemu. That is still an improvement.

> > I think multi-device will likely have some use cases, so I'd like to
> > see a path to have support for them. For this series I think it is
> > probably fine since it is already 18 patches.
> 
> It might be fine for this series because it hasn't yet proposed to make
> migration non-experimental, but it's unclear where the goal post is
> that we can actually make that transition.

IMHO non-experimental just means the solution works with whatever
configuration limitations it comes with. It shouldn't mean every
device or every configuration combination works.

So if you want to do single device, or just hard require P2P for now,
those are both reasonable temporary choices IMHO.

But they are temporary and we should come with a remedy to allow
non-P2P migration devices to work as well.

Given we merged a non-P2P kernel driver I prefer the single device
option as it is more logically consistent with the kernel situation.

Jason



Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support

2023-02-01 Thread Jason Gunthorpe
On Tue, Jan 31, 2023 at 09:15:03PM -0700, Alex Williamson wrote:

> > IMHO this is generally the way forward to do multi-device as well,
> > remove the MMIO from all the address maps: VFIO, SW access, all of
> > them. Nothing can touch MMIO except for the vCPU.
> 
> Are you suggesting this relative to migration or in general?  

I would suggest a general qemu p2p on/off option.

> P2P is a feature with tangible benefits and real use cases.  Real
> systems seem to be moving towards making P2P work better, so it
> would seem short sighted to move to and enforce only configurations
> w/o P2P in QEMU generally.  

qemu needs to support it, but it should be a user option option.

Every system I've been involved with for enabling P2P into a VM has
been a total nightmare. This is not something you just turn on and it
works great :\ The whole thing was carefully engineered right down to
the BIOS to be able to work safely.

P2P in baremetal is much easier compared to P2P inside a VM.

> Besides, this would require some sort of deprecation, so are we
> intending to make users choose between migration and P2P?

Give qemu an option 'p2p on/p2p off' and default it to on for
backwards compatability.

If p2p on and migration devices don't support P2P states then
migration is disabled. The user made this choice when they bought
un-capable HW.

Log warnings to make it more discoverable. I think with the cdev
patches we can make it so libvirt can query the device FD for
capabilities to be even cleaner.

If user sets 'p2p off' then migration works with all HW.

p2p on/off is a global switch. With p2p off nothing, no HW or SW or
hybrid device, can touch the MMIO memory.

'p2p off' is a valuable option in its own right because this stuff
doesn't work reliably and is actively dangerous. Did you know you can
hard crash the bare metal from a guest on some platforms with P2P
operations? Yikes. If you don't need to use it turn it off and don't
take the risk.

Arguably for this reason 'p2p off' should trend toward the default as
it is much safer.

> Are we obliged to start with that hardware?  I'm just trying to think
> about whether a single device restriction is sufficient to prevent any
> possible P2P or whether there might be an easier starting point for
> more capable hardware.  There's no shortage of hardware that could
> support migration given sufficient effort.  Thanks,

I think multi-device will likely have some use cases, so I'd like to
see a path to have support for them. For this series I think it is
probably fine since it is already 18 patches.

Jason



Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support

2023-01-31 Thread Jason Gunthorpe
On Tue, Jan 31, 2023 at 03:43:01PM -0700, Alex Williamson wrote:

> How does this affect our path towards supported migration?  I'm
> thinking about a user experience where QEMU supports migration if
> device A OR device B are attached, but not devices A and B attached to
> the same VM.  We might have a device C where QEMU supports migration
> with B AND C, but not A AND C, nor A AND B AND C.  This would be the
> case if device B and device C both supported P2P states, but device A
> did not. The user has no observability of this feature, so all of this
> looks effectively random to the user.

I think qemu should just log if it encounters a device without P2P
support.

> Even in the single device case, we need to make an assumption that a
> device that does not support P2P migration states (or when QEMU doesn't
> make use of P2P states) cannot be a DMA target, or otherwise have its
> MMIO space accessed while in a STOP state.  Can we guarantee that when
> other devices have not yet transitioned to STOP?

You mean the software devices created by qemu?

> We could disable the direct map MemoryRegions when we move to a STOP
> state, which would give QEMU visibility to those accesses, but besides
> pulling an abort should such an access occur, could we queue them in
> software, add them to the migration stream, and replay them after the
> device moves to the RUNNING state?  We'd need to account for the lack of
> RESUMING_P2P states as well, trapping and queue accesses from devices
> already RUNNING to those still in RESUMING (not _P2P).

I think any internal SW devices should just fail all accesses to the
P2P space, all the time.

qemu simply acts like a real system that doesn't support P2P.

IMHO this is generally the way forward to do multi-device as well,
remove the MMIO from all the address maps: VFIO, SW access, all of
them. Nothing can touch MMIO except for the vCPU.

> This all looks complicated.  Is it better to start with requiring P2P
> state support?  Thanks,

People have built HW without it, so I don't see this as so good..

Jason



Re: [RFC v3 16/18] vfio/iommufd: Implement the iommufd backend

2023-01-31 Thread Jason Gunthorpe
On Tue, Jan 31, 2023 at 09:53:03PM +0100, Eric Auger wrote:
> From: Yi Liu 
> 
> Add the iommufd backend. The IOMMUFD container class is implemented
> based on the new /dev/iommu user API. This backend obviously depends
> on CONFIG_IOMMUFD.
> 
> So far, the iommufd backend doesn't support live migration and
> cache coherency yet due to missing support in the host kernel meaning
> that only a subset of the container class callbacks is implemented.

What is missing for cache coherency? I spent lots of time on that
already, I thought I got everything..

Jason



Re: [PATCH v5 10/14] vfio/migration: Implement VFIO migration protocol v2

2023-01-09 Thread Jason Gunthorpe
On Mon, Jan 09, 2023 at 06:27:21PM +0100, Cédric Le Goater wrote:
> also, in vfio_migration_query_flags() :
> 
>   +static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t 
> *mig_flags)
>   +{
>   +uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>   +  sizeof(struct 
> vfio_device_feature_migration),
>   +  sizeof(uint64_t))] = {};
>   +struct vfio_device_feature *feature = (struct vfio_device_feature 
> *)buf;
>   +struct vfio_device_feature_migration *mig =
>   +(struct vfio_device_feature_migration *)feature->data;
>   +
>   +feature->argsz = sizeof(buf);
>   +feature->flags = VFIO_DEVICE_FEATURE_GET | 
> VFIO_DEVICE_FEATURE_MIGRATION;
>   +if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>   +return -EOPNOTSUPP;
>   +}
>   +
>   +*mig_flags = mig->flags;
>   +
>   +return 0;
>   +}
> 
> 
> The code is using any possible error returned by the VFIO_DEVICE_FEATURE
> ioctl to distinguish protocol v1 from v2.

I'm comfortable with that from a kernel perspective.

There is no such thing as this API failing in the kernel but userspace
should continue on, no matter what the error code. So always failing
here is correct.

About the only thing you might want to do is convert anything other
than ENOTTY into a hard qemu failure similar to failing to open
/dev/vfio or something - it means something has gone really
wrong.. But that is pretty obscure stuff

Jason



Re: [PATCH v5 00/14] vfio/migration: Implement VFIO migration protocol v2

2023-01-06 Thread Jason Gunthorpe
On Fri, Jan 06, 2023 at 04:36:09PM -0700, Alex Williamson wrote:

> Missing from the series is the all important question of what happens
> to "x-enable-migration" now.  We have two in-kernel drivers supporting
> v2 migration, so while hardware and firmware may still be difficult to
> bring together, it does seem possible for the upstream community to
> test and maintain this.

My post-break memory is a bit hazy, but don't we still need a qemu
series for the new dirty tracking uAPI? I suggest that is the right
spot to declare victory on this, as it is actually production usable
and testable.

I'm also hopeful we can see the system iommu dirty tracking

> To declare this supported and not to impose any additional requirements
> on management tools, I think migration needs to be enabled by default
> for devices that support it.

At least for mlx5 there will be a switch that causes the VF to not
support migration, and that will be probably be the default.

> Is there any utility to keeping around
> some sort of device option to force it ON/OFF?  

I think not at the qemu, level. Even for testing purposes it is easy
to disable live migration by not loading the valiant vfio driver.

Jason



Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-28 Thread Jason Gunthorpe
On Mon, Nov 28, 2022 at 01:36:30PM -0700, Alex Williamson wrote:
> On Mon, 28 Nov 2022 15:40:23 -0400
> Jason Gunthorpe  wrote:
> 
> > On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:
> > 
> > > There's a claim here about added complexity that I'm not really seeing.
> > > It looks like we simply make an ioctl call here and scale our buffer
> > > based on the minimum of the returned device estimate or our upper
> > > bound.  
> > 
> > I'm not keen on this, for something like mlx5 that has a small precopy
> > size and large post-copy size it risks running with an under allocated
> > buffer, which is harmful to performance.
> 
> I'm trying to weed out whether there are device assumptions in the
> implementation, seems like maybe we found one.  

I don't think there are assumptions. Any correct kernel driver should
be able to do this transfer out of the FD byte-at-a-time.

This buffer size is just a random selection for now until we get
multi-fd and can sit down, benchmark and optimize this properly.

The ideal realization of this has no buffer at all.

> MIG_DATA_SIZE specifies that it's an estimated data size for
> stop-copy, so shouldn't that provide the buffer size you're looking
> for? 

Yes, it should, and it should be OK for mlx5

Jason



Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-28 Thread Jason Gunthorpe
On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:

> There's a claim here about added complexity that I'm not really seeing.
> It looks like we simply make an ioctl call here and scale our buffer
> based on the minimum of the returned device estimate or our upper
> bound.

I'm not keen on this, for something like mlx5 that has a small precopy
size and large post-copy size it risks running with an under allocated
buffer, which is harmful to performance.

It is a mmap space, if we don't touch the pages they don't get
allocated from the OS, I think this is micro-optimizing.

Jason



Re: [PATCH v2] vfio/pci: Verify each MSI vector to avoid invalid MSI vectors

2022-11-28 Thread Jason Gunthorpe
On Sat, Nov 26, 2022 at 11:15:14AM +, Marc Zyngier wrote:

> > Physical hardware doesn't do this, virtual emulation shouldn't either.
> 
> If you want to fix VFIO, be my guest. My rambling about the sorry
> state of this has been in the kernel for 5 years (ed8703a506a8).

We are talking about things. Stuff we want to do doesn't work, or is
completely insane right now.

> > People are taking too many liberties with trapping the PCI MSI
> > registers through VFIO. :(
> 
> Do you really want to leave access to the MSI BAR to userspace? The
> number of ways this can go wrong is mind-boggling. 

Yeah, actually I do. This is basically mandatory to do something like
IMS, SIOV, etc.

> Starting with having to rebuild the interrupt translation tables on
> the host side to follow what the guest does, instead of keeping the
> two independent.

At least on x86 most of the discussion has been about teaching the
interrupt controller to go to the hypervisor to get help when
establishing interrupts. The hypervisor can tell the guest what the
real MSI data is.

This is following the example of hyperv which plugs in a hyper call to
HVCALL_MAP_DEVICE_INTERRUPT in its remapping irq_chip. This allows the
hypervisor to tell the guest a real addr/data pair and the hypervisor
does not have to involve itself in the device programming.

We haven't reached a point of thinking in detail about ARM, but I would
guess the general theme would still apply.

Jason



Re: [PATCH v2] vfio/pci: Verify each MSI vector to avoid invalid MSI vectors

2022-11-24 Thread Jason Gunthorpe
On Wed, Nov 23, 2022 at 09:42:36AM +0800, chenxiang via wrote:
> From: Xiang Chen 
> 
> Currently the number of MSI vectors comes from register PCI_MSI_FLAGS
> which should be power-of-2 in qemu, in some scenaries it is not the same as
> the number that driver requires in guest, for example, a PCI driver wants
> to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
> 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
> guest only wants to allocate 6 MSI vectors.
> 
> When GICv4.1 is enabled, it iterates over all possible MSIs and enable the
> forwarding while the guest has only created some of mappings in the virtual
> ITS, so some calls fail. The exception print is as following:
> vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
> registration
> fails:66311

With Thomas's series to make MSI more dynamic this could spell future
problems, as future kernels might have different ordering.

It is just architecturally wrong to tie the MSI programming at the PCI
level with the current state of the guest's virtual interrupt
controller.

Physical hardware doesn't do this, virtual emulation shouldn't either.

People are taking too many liberties with trapping the PCI MSI
registers through VFIO. :(

Jason



Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-17 Thread Jason Gunthorpe
On Thu, Nov 17, 2022 at 07:07:10PM +0200, Avihai Horon wrote:
> > > +}
> > > +
> > > +if (mig_state->data_fd != -1) {
> > > +if (migration->data_fd != -1) {
> > > +/*
> > > + * This can happen if the device is asynchronously reset and
> > > + * terminates a data transfer.
> > > + */
> > > +error_report("%s: data_fd out of sync", vbasedev->name);
> > > +close(mig_state->data_fd);
> > > +
> > > +return -1;
> > Should we go to recover_state here?  Is migration->device_state
> > invalid?  -EBADF?
> 
> Yes, we should.
> Although VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE ioctl above succeeded, setting
> the device state didn't *really* succeed, as the data_fd went out of sync.
> So we should go to recover_state and return -EBADF.

The state did succeed and it is now "new_state". Getting an
unexpected data_fd means it did something like RUNNING->PRE_COPY_P2P
when the code was expecting PRE_COPY->PRE_COPY_P2P.

It is actually in PRE_COPY_P2P but the in-progress migration must be
stopped and the kernel would have made the migration->data_fd
permanently return some error when it went async to RUNNING.

The recovery is to resart the migration (of this device?) from the
start.

Jason



Re: [RFC 7/7] migration: call qemu_savevm_state_pending_exact() with the guest stopped

2022-10-18 Thread Jason Gunthorpe
On Fri, Oct 14, 2022 at 01:29:51PM +0100, Joao Martins wrote:
> On 14/10/2022 12:28, Juan Quintela wrote:
> > Joao Martins  wrote:
> >> On 13/10/2022 17:08, Juan Quintela wrote:
> >>> Oops.  My understanding was that once the guest is stopped you can say
> >>> how big is it. 
> > 
> > Hi
> > 
> >> It's worth keeping in mind that conceptually a VF won't stop (e.g. DMA) 
> >> until
> >> really told through VFIO. So, stopping CPUs (guest) just means that guest 
> >> code
> >> does not arm the VF for more I/O but still is a weak guarantee as VF still 
> >> has
> >> outstanding I/O to deal with until VFIO tells it to quiesce DMA (for 
> >> devices
> >> that support it).
> > 
> > How can we make sure about that?
> > 
> > i.e. I know I have a vfio device.  I need two things:
> > - in the iterative stage, I eed to check the size, but a estimate is ok.
> >   for example with RAM, we use whatever is the size of the dirty bitmap
> >   at that moment.
> >   If the estimated size is smaller than the theshold, we
> >* stop the guest
> >* sync dirty bitmap
> >* now we test the (real) dirty bitmap size
> > 
> > How can we do something like that with a vfio device.
> > 
> You would have an extra intermediate step that stops the VF prior to asking
> the device state length. What I am not sure is whether stopping
> vCPUs can be skipped as an optimization.

It cannot, if you want to stop the VFIO device you must also stop the
vCPUs because the device is not required to respond properly to MMIO
operations when stopped.

> > My understanding from NVidia folks was that newer firmware have an ioctl
> > to return than information.
> 
> Maybe there's something new. I was thinking from this here:

Juan is talking about the ioctl we had in the pre-copy series.

I expect it to come into some different form to support this RFC.

Jason



Re: [RFC 7/7] migration: call qemu_savevm_state_pending_exact() with the guest stopped

2022-10-14 Thread Jason Gunthorpe
On Thu, Oct 13, 2022 at 01:25:10PM +0100, Joao Martins wrote:

> It would allow supporting both the (current UAPI) case where you need to
> transfer the state to get device state size (so checking against 
> threshold_size
> pending_pre constantly would allow to not violate the SLA) as well as any 
> other
> UAPI improvement to fseek()/data_size that lets you fail even earlier.

We should not get fixated on missing part of the current kernel
support, if we need a new query or something to make qemu happy then
we will add it. We don't need to worry about supporting the kernel-with-no-query
case in qemu.

Jason



Re: [PATCH v2 07/11] vfio/migration: Implement VFIO migration protocol v2

2022-07-18 Thread Jason Gunthorpe
On Mon, May 30, 2022 at 08:07:35PM +0300, Avihai Horon wrote:

> +/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
> +static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
> +{
> +ssize_t data_size;
> +
> +data_size = read(migration->data_fd, migration->data_buffer,
> + migration->data_buffer_size);
> +if (data_size < 0) {
> +return -1;
> +}
> +if (data_size == 0) {
> +return 1;
> +}
> +
> +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +qemu_put_be64(f, data_size);
> +qemu_put_buffer_async(f, migration->data_buffer, data_size, false);
> +qemu_fflush(f);
> +bytes_transferred += data_size;
> +
> +trace_vfio_save_block(migration->vbasedev->name, data_size);
> +
> +return qemu_file_get_error(f);
> +}

We looked at this from an eye to "how much data is transfered" per
callback.

The above function is the basic data mover, and
'migration->data_buffer_size' is set to 1MB at the moment.

So, we product up to 1MB VFIO_MIG_FLAG_DEV_DATA_STATE sections.

This series does not include the precopy support, but that will
include a precopy 'save_live_iterate' function like this:

static int vfio_save_iterate(QEMUFile *f, void *opaque)
{
VFIODevice *vbasedev = opaque;
VFIOMigration *migration = vbasedev->migration;
int ret;

ret = vfio_save_block(f, migration);
if (ret < 0) {
return ret;
}
if (ret == 1) {
return 1;
}
qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
return 0;
}

Thus, during precopy this will never do more than 1MB per callback.

> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +enum vfio_device_mig_state recover_state;
> +int ret;
> +
> +/* We reach here with device state STOP or STOP_COPY only */
> +recover_state = VFIO_DEVICE_STATE_STOP;
> +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +   recover_state);
> +if (ret) {
> +return ret;
> +}
> +
> +do {
> +ret = vfio_save_block(f, vbasedev->migration);
> +if (ret < 0) {
> +return ret;
> +}
> +} while (!ret);

This seems to be the main problem where we chain together 1MB blocks
until the entire completed precopy data is completed. The above is
hooked to 'save_live_complete_precopy'

So, if we want to break the above up into some 'save_iterate' like
function, do you have some advice how to do it? The above do/while
must happen after the VFIO_DEVICE_STATE_STOP_COPY.

For mlx5 the above loop will often be ~10MB's for small VMs and
100MB's for big VMs (big meaning making extensive use of RDMA
functionality), and this will not change with pre-copy support or not.

Is it still a problem?

For other devices, like a GPU, I would imagine pre-copy support is
implemented and this will be a smaller post-precopy residual.

Jason



Re: [PATCH v2 00/11] vfio/migration: Implement VFIO migration protocol v2

2022-06-23 Thread Jason Gunthorpe
On Fri, Jun 17, 2022 at 03:51:29PM -0600, Alex Williamson wrote:

> It's ok by me if QEMU vfio is the one that marks all mapped pages dirty
> if the host interface provides no way to do so.  Would we toggle that
> based on whether the device has bus-master enabled?

I don't think so, that is a very niche optimization, it would only
happen if a device is plugged in but never used.

If a device truely doesn't have bus master capability at all then it's
VFIO migration driver should implement report dirties and report no
dirties.

> Regarding SPAPR, I'd tend to think that if we're dirtying in QEMU then
> nothing prevents us from implementing the same there, but also I'm not
> going to stand in the way of simply disabling migration for that IOMMU
> backend unless someone speaks up that they think it deserves parity.

If the VFIO device internal tracker is being used it should work with
SPAPR too.

The full algorithm should be to try to find a dirty tracker for each
VFIO migration device and if none is found then always dirty
everything at STOP_COPY.

iommufd will provide the only global dirty tracker, so if SPAPR or
legacy VFIO type1 is used without a device internal tracker then it
should do the all-dirties.

Jason



Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported

2022-05-18 Thread Jason Gunthorpe
On Tue, May 17, 2022 at 09:46:56PM -0600, Alex Williamson wrote:

> The current solution is obviously non-optimal, it was mainly
> meant for backwards compatibility, but this seems like a fail faster
> solution, with less useless work, but also providing less indication
> how to configure the migration to succeed.

I don't think this is a production configuration. We should not be
expecting that the user is going to carefully fine tune some SLA
parameter. It may be the "SLA check" we are missing is simply if a SLA
exists or not.

> > It it not intended to be a useful configuration, this is just covering
> > off backwards compat issues - so I'm not keen to do a bunch of
> > management work to support it.
> 
> Are we then deemphasizing the vfio compatibility support in iommufd?

I'm viewing iommufd compatability with VFIO type 1 as only extending
to implemented features.. We are deleting NESTED for instance. As
there is no in-kernel implementation of the type 1 tracker I would
expect to eventually delete it as well once we have consensus this is
what we plan to do.

We discussed this in Joao's series and that was the consensus.

> If we really don't want to put effort towards backwards compatibility,
> should migration support only be enabled with native iommufd
> support?

I'm expecting that the dirty tracking will require native iommufd only
for the system iommu tracker, but the mlx5 on-device tracker will be
fine with either iommu back end.

It is still useful currently for testing the VFIO device part as it
will be some time until the other parts are ready, so I'd rather not
block it completely in qemu.

The goal is for qemu to do something sensible so when we delete kernel
support for type 1 dirty tracking so there is no back compat concern
for existing non-experimental qemu.

Jason



Re: [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()

2022-05-18 Thread Jason Gunthorpe
On Wed, May 18, 2022 at 05:00:26PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 18, 2022 at 12:42:37PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 18, 2022 at 01:54:34PM +0200, Juan Quintela wrote:
> > 
> > > >> Is there a really performance difference to just use:
> > > >>
> > > >> uint8_t buffer[size];
> > > >>
> > > >> qemu_get_buffer(f, buffer, size);
> > > >> write(fd, buffer, size);
> > > >>
> > > >> Or telling it otherwise, what sizes are we talking here?
> > > >
> > > > It depends on the device, but It can range from a few MBs to several
> > > > GBs AFAIK.
> > > 
> > > a few MB is ok.
> > > 
> > > Several GB on the main migration channel without a single
> > > header/whatever?
> > 
> > IIRC it iterates in multi-megabyte chunks each which gets a header.
> > 
> > The chunking size is set by the size of the buffer mmap
> > 
> > The overall point is that memcpying GB's is going to be taxing so we
> > want to eliminate copies on this path, especially copies that result
> > in more system calls.
> > 
> > We are expecting to look into further optimization down the road here
> > because even this is still too slow.
> 
> Considering the possibility of future optimization, IMHO adding this
> kind of API at the QEMUFile level is too high. We'd be better pushing
> the impl down into the QIOChannel API level.
> 
>int64_t qio_channel_copy_range(QIOCHannel *srcioc,
>   QIOChannel *tgtioc,
> size_t len);
> 
> The QIOChannel impl can do pretty much what you showed in the general
> case, but in special cases it could have the option to offload to the
> kernel copy_range() syscall to avoid the context sitches.

This is probably something to do down the road when we figure out
exactly what is best.

Currently we don't have kernel support for optimized copy_file_range()
(ie fops splice_read) inside the VFIO drivers so copy_file_range()
will just fail.

I didn't look closely again but IIRC the challenge is that the
QIOChannel doesn't have a ready large temporary buffer to use for a
non-splice fallback path.

Jason



Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported

2022-05-18 Thread Jason Gunthorpe
On Wed, May 18, 2022 at 01:39:31PM +0200, Juan Quintela wrote:

> > That does seem like a defect in this patch, any SLA constraints should
> > still all be checked under the assumption all ram is dirty.
> 
> And how are we going to:
> - detect the network link speed
> - to be sure that we are inside downtime limit
> 
> I think that it is not possible, so basically we are skiping the precopy
> stage and praying that the other bits are going to be ok.

Like I keep saying, this is not a real use case, we expect dirty
tracking to be available in any real configuration. This is just
trying to make qemu work in some reasonable way if dirty tracking is
not available but a VFIO migration device is plugged in.

Just pick something simple that makes sense. Like if any SLA is set
then just refuse to even start. If no SLA then go directly to
STOP_COPY.

> >> It seems like a better solution would be to expose to management
> >> tools that the VM contains a device that does not support the
> >> pre-copy phase so that downtime expectations can be adjusted.
> >
> > I don't expect this to be a real use case though..
> >
> > Remember, you asked for this patch when you wanted qemu to have good
> > behavior when kernel support for legacy dirty tracking is removed
> > before we merge v2 support.
> 
> I am an ignorant on the subject.  Can I ask how the dirty memory is
> tracked on this v2?

These two RFCs are the current proposal to enable it for the system
iommu:

https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.mart...@oracle.com

And for device internal trackers:

https://lore.kernel.org/kvm/20220501123301.127279-1-yish...@nvidia.com/ 

Regards,
Jason



Re: [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()

2022-05-18 Thread Jason Gunthorpe
On Wed, May 18, 2022 at 01:54:34PM +0200, Juan Quintela wrote:

> >> Is there a really performance difference to just use:
> >>
> >> uint8_t buffer[size];
> >>
> >> qemu_get_buffer(f, buffer, size);
> >> write(fd, buffer, size);
> >>
> >> Or telling it otherwise, what sizes are we talking here?
> >
> > It depends on the device, but It can range from a few MBs to several
> > GBs AFAIK.
> 
> a few MB is ok.
> 
> Several GB on the main migration channel without a single
> header/whatever?

IIRC it iterates in multi-megabyte chunks each which gets a header.

The chunking size is set by the size of the buffer mmap

The overall point is that memcpying GB's is going to be taxing so we
want to eliminate copies on this path, especially copies that result
in more system calls.

We are expecting to look into further optimization down the road here
because even this is still too slow.

Jason



Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported

2022-05-17 Thread Jason Gunthorpe
On Tue, May 17, 2022 at 11:22:32AM -0600, Alex Williamson wrote:

> > > It seems like a better solution would be to expose to management
> > > tools that the VM contains a device that does not support the
> > > pre-copy phase so that downtime expectations can be adjusted.  
> > 
> > I don't expect this to be a real use case though..
> > 
> > Remember, you asked for this patch when you wanted qemu to have good
> > behavior when kernel support for legacy dirty tracking is removed
> > before we merge v2 support.
> 
> Is wanting good behavior a controversial point?  Did we define this as
> the desired good behavior?  Ref?  

As I said, this was intended as a NOP, which is what I thought we
agreed to. Missing the SLA checking that existed before seems like
something to fix in this patch. This is the discussion thread:

https://lore.kernel.org/kvm/20220324231159.ga11...@nvidia.com/

 "I guess I was assuming that enabling v2 migration in QEMU was dependent
  on the existing type1 dirty tracking because it's the only means we
  have to tell QEMU that all memory is perpetually dirty when we have a
  DMA device.  Is that not correct?"

The only point was to prepare qemu for kernel's that don't support the
legacy dirty tracking API but do have a v2 migration interface. IIRC
something undesired happens if you do that right now.

We could also just dirty all memory in qemu and keep it exactly the
same so every SLA detail works. Or completely block pre-copy based
flows.

It it not intended to be a useful configuration, this is just covering
off backwards compat issues - so I'm not keen to do a bunch of
management work to support it.

Thanks,
Jason



Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported

2022-05-17 Thread Jason Gunthorpe
On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote:

> > This is really intended to be a NOP from where things are now, as if
> > you use mlx5 live migration without a patch like this then it causes a
> > botched pre-copy since everything just ends up permanently dirty.
> > 
> > If it makes more sense we could abort the pre-copy too - in the end
> > there will be dirty tracking so I don't know if I'd invest in a big
> > adventure to fully define non-dirty tracking migration.
> 
> How is pre-copy currently "botched" without a patch like this?  If it's
> simply that the pre-copy doesn't converge and the downtime constraints
> don't allow the VM to enter stop-and-copy, that's the expected behavior
> AIUI, and supports backwards compatibility with existing SLAs.

It means it always fails - that certainly isn't working live
migration. There is no point in trying to converge something that we
already know will never converge.

> I'm assuming that by setting this new skip_precopy flag that we're
> forcing the VM to move to stop-and-copy, regardless of any other SLA
> constraints placed on the migration.  

That does seem like a defect in this patch, any SLA constraints should
still all be checked under the assumption all ram is dirty.

> It seems like a better solution would be to expose to management
> tools that the VM contains a device that does not support the
> pre-copy phase so that downtime expectations can be adjusted.

I don't expect this to be a real use case though..

Remember, you asked for this patch when you wanted qemu to have good
behavior when kernel support for legacy dirty tracking is removed
before we merge v2 support.

Thanks,
Jason



Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported

2022-05-16 Thread Jason Gunthorpe
On Mon, May 16, 2022 at 02:22:00PM -0600, Alex Williamson wrote:
> On Mon, 16 May 2022 13:22:14 +0200
> Juan Quintela  wrote:
> 
> > Avihai Horon  wrote:
> > > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > > tracking, migration is blocked completely. This is because a DMA-able
> > > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > > breaking the migration.
> > >
> > > However, this doesn't mean that migration can't be done at all. If
> > > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > > chance to dirty RAM pages that have been migrated already, thus
> > > eliminating the problem previously mentioned.
> > >
> > > Hence, in such case allow migration but skip pre-copy phase.
> > >
> > > Signed-off-by: Avihai Horon   
> > 
> > I don't know (TM).
> > Several issues:
> > - Patch is ugly as hell (ok, that depends on taste)
> > - It changes migration_iteration_run() instead of directly
> >   migration_thread.
> > - There is already another case where we skip the sending of RAM
> >   (localhost migration with shared memory)
> > 
> > In migration/ram.c:
> > 
> > static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > {
> > PageSearchStatus pss;
> > int pages = 0;
> > bool again, found;
> > 
> > /* No dirty page as there is zero RAM */
> > if (!ram_bytes_total()) {
> > return pages;
> > }
> > 
> > This is the other place where we _don't_ send any RAM at all.
> > 
> > I don't have a great idea about how to make things clear at a higher
> > level, I have to think about this.
> 
> It seems like if we have devices dictating what type of migrations can
> be performed then there probably needs to be a switch to restrict use of
> such devices just as we have the -only-migratable switch now to prevent
> attaching devices that don't support migration.  I'd guess that we need
> the switch to opt-in to allowing such devices to maintain
> compatibility.  There's probably a whole pile of qapi things missing to
> expose this to management tools as well.  Thanks,

This is really intended to be a NOP from where things are now, as if
you use mlx5 live migration without a patch like this then it causes a
botched pre-copy since everything just ends up permanently dirty.

If it makes more sense we could abort the pre-copy too - in the end
there will be dirty tracking so I don't know if I'd invest in a big
adventure to fully define non-dirty tracking migration.

Jason



Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation

2022-05-12 Thread Jason Gunthorpe
On Thu, May 12, 2022 at 03:11:40PM -0600, Alex Williamson wrote:
> On Thu, 12 May 2022 15:25:32 -0300
> Jason Gunthorpe  wrote:
> 
> > On Thu, May 12, 2022 at 11:57:10AM -0600, Alex Williamson wrote:
> > > > @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier 
> > > > *notifier, void *data)
> > > >  case MIGRATION_STATUS_CANCELLED:
> > > >  case MIGRATION_STATUS_FAILED:
> > > >  bytes_transferred = 0;
> > > > -ret = vfio_migration_set_state(vbasedev,
> > > > -  ~(VFIO_DEVICE_STATE_SAVING | 
> > > > VFIO_DEVICE_STATE_RESUMING),
> > > > -  VFIO_DEVICE_STATE_RUNNING);
> > > > +ret = vfio_migration_set_state(
> > > > +vbasedev,
> > > > +~(VFIO_DEVICE_STATE_V1_SAVING | 
> > > > VFIO_DEVICE_STATE_V1_RESUMING),
> > > > +VFIO_DEVICE_STATE_V1_RUNNING);  
> > > 
> > > Yikes!  Please follow the line wrapping used elsewhere.  There's no need
> > > to put the first arg on a new line and subsequent wrapped lines should
> > > be indented to match the previous line, or at least to avoid wrapping
> > > itself.  Here we can use something like:  
> > 
> > This is generated by clang-format with one of the qmeu styles, it
> > follows the documented guide:
> > 
> >  In case of function, there are several variants:
> > 
> >  - 4 spaces indent from the beginning
> >  - align the secondary lines just after the opening parenthesis of the
> >first
> > 
> > clang-format selected the first option due to its optimization
> > algorithm.
> > 
> > Knowing nothing about qmeu, I am confused??
> 
> Maybe someone needs to throw more AI models at clang-format so that it
> considers the more readable option?  QEMU does a lot wrong with style
> imo, and maybe it's technically compliant as written, but I think what
> I proposed is also compliant, as well as more readable and more
> consistent with the existing file.  Thanks,

Let Avihai know any indenting you don't like he will fix it.

IIRC clang scores line-breaking an expression as worse than going to
the smaller indent. Personally I would agree with this.

Thanks,
Jason



Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation

2022-05-12 Thread Jason Gunthorpe
On Thu, May 12, 2022 at 11:57:10AM -0600, Alex Williamson wrote:
> > @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier 
> > *notifier, void *data)
> >  case MIGRATION_STATUS_CANCELLED:
> >  case MIGRATION_STATUS_FAILED:
> >  bytes_transferred = 0;
> > -ret = vfio_migration_set_state(vbasedev,
> > -  ~(VFIO_DEVICE_STATE_SAVING | 
> > VFIO_DEVICE_STATE_RESUMING),
> > -  VFIO_DEVICE_STATE_RUNNING);
> > +ret = vfio_migration_set_state(
> > +vbasedev,
> > +~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
> > +VFIO_DEVICE_STATE_V1_RUNNING);
> 
> Yikes!  Please follow the line wrapping used elsewhere.  There's no need
> to put the first arg on a new line and subsequent wrapped lines should
> be indented to match the previous line, or at least to avoid wrapping
> itself.  Here we can use something like:

This is generated by clang-format with one of the qmeu styles, it
follows the documented guide:

 In case of function, there are several variants:

 - 4 spaces indent from the beginning
 - align the secondary lines just after the opening parenthesis of the
   first

clang-format selected the first option due to its optimization
algorithm.

Knowing nothing about qmeu, I am confused??

Jason



Re: [RFC 00/18] vfio: Adopt iommufd

2022-05-10 Thread Jason Gunthorpe
On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
> Thanks Yi and Eric,
> Then will wait for the updated iommufd kernel for the PCI MMIO region.
> 
> Another question,
> How to get the iommu_domain in the ioctl.

The ID of the iommu_domain (called the hwpt) it should be returned by
the vfio attach ioctl.

Jason



Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 02:59:31PM -0600, Alex Williamson wrote:

> > The best you could do is make a dummy IOAS then attach the device,
> > read the mappings, detatch, and then do your unmaps.
> 
> Right, the same thing the kernel does currently.
> 
> > I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
> > called prior to attaching on the device ID.
> 
> Something like /sys/kernel/iommu_groups/$GROUP/reserved_regions?

If we do the above ioctl with iommufd I would want to include the domain
aperture too, but yes.

> > > We must be absolutely certain that there is no DMA to that range
> > > before doing so.  
> > 
> > Yes, but at the same time if the VM thinks it can DMA to that memory
> > then it is quite likely to DMA to it with the new device that doesn't
> > have it mapped in the first place.
> 
> Sorry, this assertion doesn't make sense to me.  We can't assume a
> vIOMMU on x86, so QEMU typically maps the entire VM address space (ie.
> device address space == system memory).  Some of those mappings are
> likely DMA targets (RAM), but only a tiny fraction of the address space
> may actually be used for DMA.  Some of those mappings are exceedingly
> unlikely P2P DMA targets (device memory), so we don't consider mapping
> failures to be fatal to attaching the device.

> If we have a case where a range failed for one device but worked for a
> previous, we're in the latter scenario, because we should have failed
> the device attach otherwise.  Your assertion would require that there
> are existing devices (plural) making use of this mapping and that the
> new device is also likely to make use of this mapping.  I have a hard
> time believing that evidence exists to support that statement.

This is quite normal, we often have multiple NICs and GPUs in the same
system/VM and the expectation is that P2P between the MMIO regions of
all the NICs and all the GPUs will work. Hotplugging in a NIC or GPU
and having it be excluded from P2P maps would be fatal to the VM.

So, while I think it is vanishingly unlikely that a reserved region
conflict would cause a problem, my preference is that this stuff is
deterministic. Either hotplugs fails or hotplug configures it to the
same state it would be if the VM was started with this configuration.

Perhaps this just suggests that qemu should be told by the operator
what kind of P2P to export from a device 'never/auto/always' with auto
being today's behavior.

> P2P use cases are sufficiently rare that this hasn't been an issue.  I
> think there's also still a sufficient healthy dose of FUD whether a
> system supports P2P that drivers do some validation before relying on
> it.

I'm not sure what you mean here, the P2P capability discovery is a
complete mess and never did get standardized. Linux has the
expectation that drivers will use pci_p2pdma_distance() before doing
P2P which weeds out only some of the worst non-working cases.

> > This is why I find it bit strange that qemu doesn't check the
> > ranges. eg I would expect that anything declared as memory in the E820
> > map has to be mappable to the iommu_domain or the device should not
> > attach at all.
> 
> You have some interesting assumptions around associating
> MemoryRegionSegments from the device AddressSpace to something like an
> x86 specific E820 table.  

I'm thinking about it from an OS perspective in the VM, not from qemu
internals. OS's do not randomly DMA everwhere, the firmware tables/etc
do make it predictable where DMA will happen.

> > The P2P is a bit trickier, and I know we don't have a good story
> > because we lack ACPI description, but I would have expected the same
> > kind of thing. Anything P2Pable should be in the iommu_domain or the
> > device should not attach. As with system memory there are only certain
> > parts of the E820 map that an OS would use for P2P.
> > 
> > (ideally ACPI would indicate exactly what combinations of devices are
> > P2Pable and then qemu would use that drive the mandatory address
> > ranges in the IOAS)
> 
> How exactly does ACPI indicate that devices can do P2P?  How can we
> rely on ACPI for a problem that's not unique to platforms that
> implement ACPI?

I am trying to say this never did get standardized. It was talked about
when the pci_p2pdma_distance() was merged and I thought some folks
were going to go off and take care of an ACPI query for it to use. It
would be useful here at least.
 
> > > > > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > > > > haven't added it as it is something you will add in future. so didn't
> > > > > add it in this RFC. :-) Please let me know if it feels better to 
> > > > > filter
> > > > > it from today.
> > > > 
> > > > I currently hope it will use a different map API entirely and not rely
> > > > on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> > > > 
> > > > So blocking it in qemu feels like the right thing to do.  
> > > 
> > > Wait a sec, so legacy 

Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 12:45:41PM -0600, Alex Williamson wrote:
> On Tue, 26 Apr 2022 11:11:56 -0300
> Jason Gunthorpe  wrote:
> 
> > On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> > 
> > > > I think it is strange that the allowed DMA a guest can do depends on
> > > > the order how devices are plugged into the guest, and varys from
> > > > device to device?
> > > > 
> > > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > > regions and unmap the conflicts before hot plugging the new device. We
> > > > don't have a kernel API to do this, maybe we should have one?  
> > > 
> > > For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> > > since the IOVA is GPA which is determined per the e820 table.  
> > 
> > Sure, that is why I said we may need a new API to get this data back
> > so userspace can fix the address map before attempting to attach the
> > new device. Currently that is not possible at all, the device attach
> > fails and userspace has no way to learn what addresses are causing
> > problems.
> 
> We have APIs to get the IOVA ranges, both with legacy vfio and the
> iommufd RFC, QEMU could compare these, but deciding to remove an
> existing mapping is not something to be done lightly. 

Not quite, you can get the IOVA ranges after you attach the device,
but device attach will fail if the new range restrictions intersect
with the existing mappings. So we don't have an easy way to learn the
new range restriction in a way that lets userspace ensure an attach
will not fail due to reserved ranged overlapping with mappings.

The best you could do is make a dummy IOAS then attach the device,
read the mappings, detatch, and then do your unmaps.

I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
called prior to attaching on the device ID.

> We must be absolutely certain that there is no DMA to that range
> before doing so.

Yes, but at the same time if the VM thinks it can DMA to that memory
then it is quite likely to DMA to it with the new device that doesn't
have it mapped in the first place.

It is also a bit odd that the behavior depends on the order the
devices are installed as if you plug the narrower device first then
the next device will happily use the narrower ranges, but viceversa
will get a different result.

This is why I find it bit strange that qemu doesn't check the
ranges. eg I would expect that anything declared as memory in the E820
map has to be mappable to the iommu_domain or the device should not
attach at all.

The P2P is a bit trickier, and I know we don't have a good story
because we lack ACPI description, but I would have expected the same
kind of thing. Anything P2Pable should be in the iommu_domain or the
device should not attach. As with system memory there are only certain
parts of the E820 map that an OS would use for P2P.

(ideally ACPI would indicate exactly what combinations of devices are
P2Pable and then qemu would use that drive the mandatory address
ranges in the IOAS)

> > > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > > haven't added it as it is something you will add in future. so didn't
> > > add it in this RFC. :-) Please let me know if it feels better to filter
> > > it from today.  
> > 
> > I currently hope it will use a different map API entirely and not rely
> > on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> > 
> > So blocking it in qemu feels like the right thing to do.
> 
> Wait a sec, so legacy vfio supports p2p between devices, which has a
> least a couple known use cases, primarily involving GPUs for at least
> one of the peers, and we're not going to make equivalent support a
> feature requirement for iommufd?  

I said "different map API" - something like IOMMU_FD_MAP_DMABUF
perhaps.

The trouble with taking in a user pointer to MMIO memory is that it
becomes quite annoying to go from a VMA back to the actual owner
object so we can establish proper refcounting and lifetime of struct-page-less
memory. Requiring userspace to make that connection via a FD
simplifies and generalizes this.

So, qemu would say 'oh this memory is exported by VFIO, I will do
VFIO_EXPORT_DMA_BUF, then do IOMMU_FD_MAP_DMABUF, then close the FD'

For vfio_compat we'd have to build some hacky compat approach to
discover the dmabuf for vfio-pci from the VMA.

But if qemu is going this way with a new implementation I would prefer
the new implementation use the new way, when we decide what it should
be.

As I mentioned before I would like to use DMABUF since I already have
a use-case to expose DMABUF from vfio-pci to connect to RDMA. I will
post the vfio DMABUF patch I have already.

Jason



Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 01:24:35PM -0600, Alex Williamson wrote:
> On Tue, 26 Apr 2022 13:42:17 -0300
> Jason Gunthorpe  wrote:
> 
> > On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > > We also need to be able to advise libvirt as to how each iommufd object
> > > or user of that object factors into the VM locked memory requirement.
> > > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > > to set the locked memory limit to the size of VM RAM per iommufd,
> > > regardless of the number of devices using a given iommufd.  However, I
> > > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > > Combinations of devices where some map VM RAM and others map QEMU
> > > buffer space could still require some incremental increase per device
> > > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > > will still be involved even after iommufd solves the per-device
> > > vfio-pci locked memory limit issue.  Thanks,  
> > 
> > If the model is to pass the FD, how about we put a limit on the FD
> > itself instead of abusing the locked memory limit?
> > 
> > We could have a no-way-out ioctl that directly limits the # of PFNs
> > covered by iopt_pages inside an iommufd.
> 
> FD passing would likely only be the standard for libvirt invoked VMs.
> The QEMU vfio-pci device would still parse a host= or sysfsdev= option
> when invoked by mortals and associate to use the legacy vfio group
> interface or the new vfio device interface based on whether an iommufd
> is specified.

Yes, but perhaps we don't need resource limits in the mortals case..

> Does that rule out your suggestion?  I don't know, please reveal more
> about the mechanics of putting a limit on the FD itself and this
> no-way-out ioctl.  The latter name suggests to me that I should also
> note that we need to support memory hotplug with these devices.  Thanks,

So libvirt uses CAP_SYS_RESOURCE and prlimit to adjust things in
realtime today?

It could still work, instead of no way out iommufd would have to check
for CAP_SYS_RESOURCE to make the limit higher.

It is a pretty simple idea, we just attach a resource limit to the FD
and every PFN that gets mapped into the iommufd counts against that
limit, regardless if it is pinned or not. An ioctl on the FD would set
the limit, defaulting to unlimited.

To me this has the appeal that what is being resourced controlled is
strictly defined - address space mapped into an iommufd - which has a
bunch of nice additional consequences like partially bounding the
amount of kernel memory an iommufd can consume and so forth.

Doesn't interact with iouring or rdma however.

Though we could certianly consider allowing RDMA to consume an iommufd
to access pinned pages much like a vfio-mdev would - I'm not sure what
is ideal for the qemu usage of RDMA for migration..

Jason



Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> We also need to be able to advise libvirt as to how each iommufd object
> or user of that object factors into the VM locked memory requirement.
> When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> to set the locked memory limit to the size of VM RAM per iommufd,
> regardless of the number of devices using a given iommufd.  However, I
> don't know if all users of iommufd will be exclusively mapping VM RAM.
> Combinations of devices where some map VM RAM and others map QEMU
> buffer space could still require some incremental increase per device
> (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> will still be involved even after iommufd solves the per-device
> vfio-pci locked memory limit issue.  Thanks,

If the model is to pass the FD, how about we put a limit on the FD
itself instead of abusing the locked memory limit?

We could have a no-way-out ioctl that directly limits the # of PFNs
covered by iopt_pages inside an iommufd.

Jason



Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:

> > I think it is strange that the allowed DMA a guest can do depends on
> > the order how devices are plugged into the guest, and varys from
> > device to device?
> > 
> > IMHO it would be nicer if qemu would be able to read the new reserved
> > regions and unmap the conflicts before hot plugging the new device. We
> > don't have a kernel API to do this, maybe we should have one?
> 
> For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> since the IOVA is GPA which is determined per the e820 table.

Sure, that is why I said we may need a new API to get this data back
so userspace can fix the address map before attempting to attach the
new device. Currently that is not possible at all, the device attach
fails and userspace has no way to learn what addresses are causing
problems.

> > eg currently I see the log messages that it is passing P2P BAR memory
> > into iommufd map, this should be prevented inside qemu because it is
> > not reliable right now if iommufd will correctly reject it.
> 
> yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> haven't added it as it is something you will add in future. so didn't
> add it in this RFC. :-) Please let me know if it feels better to filter
> it from today.

I currently hope it will use a different map API entirely and not rely
on discovering the P2P via the VMA. eg using a DMABUF FD or something.

So blocking it in qemu feels like the right thing to do.

Jason



Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 05:55:29PM +0800, Yi Liu wrote:
> > I also suggest falling back to using "/dev/char/%u:%u" if the above
> > does not exist which prevents "vfio/devices/vfio" from turning into
> > ABI.
> 
> do you mean there is no matched file under /dev/vfio/devices/? Is this
> possible?

The layout of /dev/ depens on udev rules, so it is possible. I only
suggested it to avoid creating ABI here.

> > It would be a good idea to make a general open_cdev function that does
> > all this work once the sysfs is found and cdev read out of it, all the
> > other vfio places can use it too.
> 
> hmmm, it's good to have a general open_cdev() function. But I guess this
> is the only place in VFIO to open the device cdev. Do you mean the vdpa
> stuffes?

Any place that starts from a sysfs name would be interested - I don't
know what vdpa does

Jason



Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 10:41:01AM +, Tian, Kevin wrote:

> That's one case of incompatibility, but the IOMMU attach group callback
> can fail in a variety of ways.  One that we've seen that is not
> uncommon is that we might have an mdev container with various  mappings  
> to other devices.  None of those mappings are validated until the mdev
> driver tries to pin something, where it's generally unlikely that
> they'd pin those particular mappings.  Then QEMU hot-adds a regular
> IOMMU backed device, we allocate a domain for the device and replay the
> mappings from the container, but now they get validated and potentially
> fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
> creates a new container and fills it from the same AddressSpace, where
> now QEMU can determine which mappings can be safely skipped.  

I think it is strange that the allowed DMA a guest can do depends on
the order how devices are plugged into the guest, and varys from
device to device?

IMHO it would be nicer if qemu would be able to read the new reserved
regions and unmap the conflicts before hot plugging the new device. We
don't have a kernel API to do this, maybe we should have one?

> A: 
> QEMU sets up a MemoryListener for the device AddressSpace and attempts
> to map anything that triggers that listener, which includes not only VM
> RAM which is our primary mapping goal, but also miscellaneous devices,
> unaligned regions, and other device regions, ex. BARs.  Some of these
> we filter out in QEMU with broad generalizations that unaligned ranges
> aren't anything we can deal with, but other device regions covers
> anything that's mmap'd in QEMU, ie. it has an associated KVM memory
> slot.  IIRC, in the case I'm thinking of, the mapping that triggered
> the replay failure was the BAR for an mdev device.  No attempt was made
> to use gup or PFNMAP to resolve the mapping when only the mdev device
> was present and the mdev host driver didn't attempt to pin pages within
> its own BAR, but neither of these methods worked for the replay (I
> don't recall further specifics). 

This feels sort of like a bug in iommufd, or perhaps qemu..

With iommufd only normal GUP'able memory should be passed to
map. Special memory will have to go through some other API. This is
different from vfio containers.

We could possibly check the VMAs in iommufd during map to enforce
normal memory.. However I'm also a bit surprised that qemu can't ID
the underlying memory source and avoid this?

eg currently I see the log messages that it is passing P2P BAR memory
into iommufd map, this should be prevented inside qemu because it is
not reliable right now if iommufd will correctly reject it.

IMHO multi-container should be avoided because it does force creating
multiple iommu_domains which does have a memory/performance cost.

Though, it is not so important that it is urgent (and copy makes it
work better anyhow), qemu can stay as it is.

Jason



Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 08:37:41AM +, Tian, Kevin wrote:

> Based on current plan there is probably a transition window between the
> point where the first vfio device type (vfio-pci) gaining iommufd support
> and the point where all vfio types supporting iommufd. 

I am still hoping to do all in one shot, lets see :) 

Jason



Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-25 Thread Jason Gunthorpe
On Mon, Apr 25, 2022 at 11:10:14AM +0100, Daniel P. Berrangé wrote:

> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.
> 
> I assume access to '/dev/iommufd' gives the process somewhat elevated
> privileges, such that you don't want to unconditionally give QEMU
> access to this device ?

I doesn't give much, at worst it allows userspace to allocate kernel
memory and pin pages which can be already be done through all sorts of
other interfaces qemu already has access to..

Jason



Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-22 Thread Jason Gunthorpe
On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:

> +static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
> +{
> +long int vfio_id = -1, ret = -ENOTTY;
> +char *path, *tmp = NULL;
> +DIR *dir;
> +struct dirent *dent;
> +struct stat st;
> +gchar *contents;
> +gsize length;
> +int major, minor;
> +dev_t vfio_devt;
> +
> +path = g_strdup_printf("%s/vfio-device", sysfs_path);
> +if (stat(path, ) < 0) {
> +error_setg_errno(errp, errno, "no such host device");
> +goto out;
> +}
> +
> +dir = opendir(path);
> +if (!dir) {
> +error_setg_errno(errp, errno, "couldn't open dirrectory %s", path);
> +goto out;
> +}
> +
> +while ((dent = readdir(dir))) {
> +const char *end_name;
> +
> +if (!strncmp(dent->d_name, "vfio", 4)) {
> +ret = qemu_strtol(dent->d_name + 4, _name, 10, _id);
> +if (ret) {
> +error_setg(errp, "suspicious vfio* file in %s", path);
> +goto out;
> +}

Userspace shouldn't explode if there are different files here down the
road. Just search for the first match of vfio\d+ and there is no need
to parse out the vfio_id from the string. Only fail if no match is
found.

> +tmp = g_strdup_printf("/dev/vfio/devices/vfio%ld", vfio_id);
> +if (stat(tmp, ) < 0) {
> +error_setg_errno(errp, errno, "no such vfio device");
> +goto out;
> +}

And simply pass the string directly here, no need to parse out
vfio_id.

I also suggest falling back to using "/dev/char/%u:%u" if the above
does not exist which prevents "vfio/devices/vfio" from turning into
ABI.

It would be a good idea to make a general open_cdev function that does
all this work once the sysfs is found and cdev read out of it, all the
other vfio places can use it too.

> +static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
> + Error **errp)
> +{
> +VFIOContainer *bcontainer;
> +VFIOIOMMUFDContainer *container;
> +VFIOAddressSpace *space;
> +struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
> +int ret, devfd, iommufd;
> +uint32_t ioas_id;
> +Error *err = NULL;
> +
> +devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
> +if (devfd < 0) {
> +return devfd;
> +}
> +vbasedev->fd = devfd;
> +
> +space = vfio_get_address_space(as);
> +
> +/* try to attach to an existing container in this space */
> +QLIST_FOREACH(bcontainer, >containers, next) {
> +if (!object_dynamic_cast(OBJECT(bcontainer),
> + TYPE_VFIO_IOMMUFD_CONTAINER)) {
> +continue;
> +}
> +container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
> +if (vfio_device_attach_container(vbasedev, container, )) {
> +const char *msg = error_get_pretty(err);
> +
> +trace_vfio_iommufd_fail_attach_existing_container(msg);
> +error_free(err);
> +err = NULL;
> +} else {
> +ret = vfio_ram_block_discard_disable(true);
> +if (ret) {
> +vfio_device_detach_container(vbasedev, container, );
> +error_propagate(errp, err);
> +vfio_put_address_space(space);
> +close(vbasedev->fd);
> +error_prepend(errp,
> +  "Cannot set discarding of RAM broken (%d)", 
> ret);
> +return ret;
> +}
> +goto out;
> +}
> +}

?? this logic shouldn't be necessary, a single ioas always supports
all devices, userspace should never need to juggle multiple ioas's
unless it wants to have different address maps.

Something I would like to see confirmed here in qemu is that qemu can
track the hw pagetable id for each device it binds because we will
need that later to do dirty tracking and other things.

> +/*
> + * TODO: for now iommufd BE is on par with vfio iommu type1, so it's
> + * fine to add the whole range as window. For SPAPR, below code
> + * should be updated.
> + */
> +vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096);

? Not sure what this is, but I don't expect any changes for SPAPR
someday IOMMU_IOAS_IOVA_RANGES should be able to accurately report its
configuration.

I don't see IOMMU_IOAS_IOVA_RANGES called at all, that seems like a
problem..

(and note that IOVA_RANGES changes with every device attached to the IOAS)

Jason



Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-13 Thread Jason Gunthorpe
On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote:
> On 12.04.22 16:36, Jason Gunthorpe wrote:
> > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
> > 
> >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
> >> past already with secretmem, it's not 100% that good of a fit (unmovable
> >> is worth than mlocked). But it gets the job done for now at least.
> > 
> > No, it doesn't. There are too many different interpretations how
> > MELOCK is supposed to work
> > 
> > eg VFIO accounts per-process so hostile users can just fork to go past
> > it.
> > 
> > RDMA is per-process but uses a different counter, so you can double up
> > 
> > iouring is per-user and users a 3rd counter, so it can triple up on
> > the above two
> 
> Thanks for that summary, very helpful.

I kicked off a big discussion when I suggested to change vfio to use
the same as io_uring

We may still end up trying it, but the major concern is that libvirt
sets the RLIMIT_MEMLOCK and if we touch anything here - including
fixing RDMA, or anything really, it becomes a uAPI break for libvirt..

> >> So I'm open for alternative to limit the amount of unmovable memory we
> >> might allocate for user space, and then we could convert seretmem as well.
> > 
> > I think it has to be cgroup based considering where we are now :\
> 
> Most probably. I think the important lessons we learned are that
> 
> * mlocked != unmovable.
> * RLIMIT_MEMLOCK should most probably never have been abused for
>   unmovable memory (especially, long-term pinning)

The trouble is I'm not sure how anything can correctly/meaningfully
set a limit.

Consider qemu where we might have 3 different things all pinning the
same page (rdma, iouring, vfio) - should the cgroup give 3x the limit?
What use is that really?

IMHO there are only two meaningful scenarios - either you are unpriv
and limited to a very small number for your user/cgroup - or you are
priv and you can do whatever you want.

The idea we can fine tune this to exactly the right amount for a
workload does not seem realistic and ends up exporting internal kernel
decisions into a uAPI..

Jason



Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-12 Thread Jason Gunthorpe
On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:

> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
> past already with secretmem, it's not 100% that good of a fit (unmovable
> is worth than mlocked). But it gets the job done for now at least.

No, it doesn't. There are too many different interpretations how
MELOCK is supposed to work

eg VFIO accounts per-process so hostile users can just fork to go past
it.

RDMA is per-process but uses a different counter, so you can double up

iouring is per-user and users a 3rd counter, so it can triple up on
the above two

> So I'm open for alternative to limit the amount of unmovable memory we
> might allocate for user space, and then we could convert seretmem as well.

I think it has to be cgroup based considering where we are now :\

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-23 Thread Jason Gunthorpe
On Tue, Nov 23, 2021 at 10:06:02AM +0100, Paolo Bonzini wrote:

> I think it's great that memfd hooks are usable by more than one subsystem,
> OTOH it's fair that whoever needs it does the work---and VFIO does not need
> it for confidential VMs, yet, so it should be fine for now to have a single
> user.

I think adding a new interface to a core kernel subsystem should come
with a greater requirement to work out something generally useful and
not be overly wedded to a single use case (eg F_SEAL_GUEST)

Especially if something like 'single user' is not just a small
implementation artifact but a key design tennant of the whole eventual
solution.

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread Jason Gunthorpe
On Mon, Nov 22, 2021 at 03:57:17PM +0100, David Hildenbrand wrote:
> On 22.11.21 15:01, Jason Gunthorpe wrote:
> > On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
> >> On 22.11.21 14:31, Jason Gunthorpe wrote:
> >>> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
> >>>
> >>>> I do wonder if we want to support sharing such memfds between processes
> >>>> in all cases ... we most certainly don't want to be able to share
> >>>> encrypted memory between VMs (I heard that the kernel has to forbid
> >>>> that). It would make sense in the use case you describe, though.
> >>>
> >>> If there is a F_SEAL_XX that blocks every kind of new access, who
> >>> cares if userspace passes the FD around or not?
> >> I was imagining that you actually would want to do some kind of "change
> >> ownership". But yeah, the intended semantics and all use cases we have
> >> in mind are not fully clear to me yet. If it's really "no new access"
> >> (side note: is "access" the right word?) then sure, we can pass the fd
> >> around.
> > 
> > What is "ownership" in a world with kvm and iommu are reading pages
> > out of the same fd?
> 
> In the world of encrypted memory / TDX, KVM somewhat "owns" that memory
> IMHO (for example, only it can migrate or swap out these pages; it's
> might be debatable if the TDX module or KVM actually "own" these pages ).

Sounds like it is a swap provider more than an owner?

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread Jason Gunthorpe
On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
> On 22.11.21 14:31, Jason Gunthorpe wrote:
> > On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
> > 
> >> I do wonder if we want to support sharing such memfds between processes
> >> in all cases ... we most certainly don't want to be able to share
> >> encrypted memory between VMs (I heard that the kernel has to forbid
> >> that). It would make sense in the use case you describe, though.
> > 
> > If there is a F_SEAL_XX that blocks every kind of new access, who
> > cares if userspace passes the FD around or not?
> I was imagining that you actually would want to do some kind of "change
> ownership". But yeah, the intended semantics and all use cases we have
> in mind are not fully clear to me yet. If it's really "no new access"
> (side note: is "access" the right word?) then sure, we can pass the fd
> around.

What is "ownership" in a world with kvm and iommu are reading pages
out of the same fd?

"no new access" makes sense to me, we have access through
read/write/mmap/splice/etc and access to pages through the private in
kernel interface (kvm, iommu)

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread Jason Gunthorpe
On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:

> I do wonder if we want to support sharing such memfds between processes
> in all cases ... we most certainly don't want to be able to share
> encrypted memory between VMs (I heard that the kernel has to forbid
> that). It would make sense in the use case you describe, though.

If there is a F_SEAL_XX that blocks every kind of new access, who
cares if userspace passes the FD around or not?

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-20 Thread Jason Gunthorpe
On Sat, Nov 20, 2021 at 01:23:16AM +, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > On Fri, Nov 19, 2021 at 10:21:39PM +, Sean Christopherson wrote:
> > > On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > > > On Fri, Nov 19, 2021 at 07:18:00PM +, Sean Christopherson wrote:
> > > > > No ideas for the kernel API, but that's also less concerning since
> > > > > it's not set in stone.  I'm also not sure that dedicated APIs for
> > > > > each high-ish level use case would be a bad thing, as the semantics
> > > > > are unlikely to be different to some extent.  E.g. for the KVM use
> > > > > case, there can be at most one guest associated with the fd, but
> > > > > there can be any number of VFIO devices attached to the fd.
> > > > 
> > > > Even the kvm thing is not a hard restriction when you take away
> > > > confidential compute.
> > > > 
> > > > Why can't we have multiple KVMs linked to the same FD if the memory
> > > > isn't encrypted? Sure it isn't actually useful but it should work
> > > > fine.
> > > 
> > > Hmm, true, but I want the KVM semantics to be 1:1 even if memory
> > > isn't encrypted.
> > 
> > That is policy and it doesn't belong hardwired into the kernel.
> 
> Agreed.  I had a blurb typed up about that policy just being an "exclusive" 
> flag
> in the kernel API that KVM would set when creating a confidential
> VM,

I still think that is policy in the kernel, what is wrong with
userspace doing it?

> > Your explanation makes me think that the F_SEAL_XX isn't defined
> > properly. It should be a userspace trap door to prevent any new
> > external accesses, including establishing new kvms, iommu's, rdmas,
> > mmaps, read/write, etc.
> 
> Hmm, the way I was thinking of it is that it the F_SEAL_XX itself would 
> prevent
> mapping/accessing it from userspace, and that any policy beyond that would be
> done via kernel APIs and thus handled by whatever in-kernel agent can access 
> the
> memory.  E.g. in the confidential VM case, without support for trusted 
> devices,
> KVM would require that it be the sole owner of the file.

And how would kvm know if there is support for trusted devices?
Again seems like policy choices that should be left in userspace.

Especially for what could be a general in-kernel mechanism with many
users and not tightly linked to KVM as imagined here.

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Jason Gunthorpe
On Fri, Nov 19, 2021 at 10:21:39PM +, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > On Fri, Nov 19, 2021 at 07:18:00PM +, Sean Christopherson wrote:
> > > No ideas for the kernel API, but that's also less concerning since
> > > it's not set in stone.  I'm also not sure that dedicated APIs for
> > > each high-ish level use case would be a bad thing, as the semantics
> > > are unlikely to be different to some extent.  E.g. for the KVM use
> > > case, there can be at most one guest associated with the fd, but
> > > there can be any number of VFIO devices attached to the fd.
> > 
> > Even the kvm thing is not a hard restriction when you take away
> > confidential compute.
> > 
> > Why can't we have multiple KVMs linked to the same FD if the memory
> > isn't encrypted? Sure it isn't actually useful but it should work
> > fine.
> 
> Hmm, true, but I want the KVM semantics to be 1:1 even if memory
> isn't encrypted.

That is policy and it doesn't belong hardwired into the kernel.

Your explanation makes me think that the F_SEAL_XX isn't defined
properly. It should be a userspace trap door to prevent any new
external accesses, including establishing new kvms, iommu's, rdmas,
mmaps, read/write, etc.

> It's not just avoiding the linked list, there's a trust element as
> well.  E.g. in the scenario where a device can access a confidential
> VM's encrypted private memory, the guest is still the "owner" of the
> memory and needs to explicitly grant access to a third party,
> e.g. the device or perhaps another VM.

Authorization is some other issue - the internal kAPI should be able
to indicate it is secured memory and the API user should do whatever
dance to gain access to it. Eg for VFIO ask the realm manager to
associate the pci_device with the owner realm.

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Jason Gunthorpe
On Fri, Nov 19, 2021 at 07:18:00PM +, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, David Hildenbrand wrote:
> > On 19.11.21 16:19, Jason Gunthorpe wrote:
> > > As designed the above looks useful to import a memfd to a VFIO
> > > container but could you consider some more generic naming than calling
> > > this 'guest' ?
> > 
> > +1 the guest terminology is somewhat sob-optimal.
> 
> For the F_SEAL part, maybe F_SEAL_UNMAPPABLE?

Perhaps INACCESSIBLE?

> No ideas for the kernel API, but that's also less concerning since
> it's not set in stone.  I'm also not sure that dedicated APIs for
> each high-ish level use case would be a bad thing, as the semantics
> are unlikely to be different to some extent.  E.g. for the KVM use
> case, there can be at most one guest associated with the fd, but
> there can be any number of VFIO devices attached to the fd.

Even the kvm thing is not a hard restriction when you take away
confidential compute.

Why can't we have multiple KVMs linked to the same FD if the memory
isn't encrypted? Sure it isn't actually useful but it should work
fine.

Supporting only one thing is just a way to avoid having a linked list
of clients to broadcast invalidations too - for instance by using a
standard notifier block...

Also, how does dirty tracking work on this memory?

Jason



  1   2   >