Re: [PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-22 Thread Alex Williamson
On Tue, 20 Feb 2024 17:20:52 +0530
 wrote:

> From: Ankit Agrawal 
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the
> chip-to-chip interconnect that is exposed as a contiguous physically
> addressable region on the host. Since the device memory is cache
> coherent with the CPU, it can be mmaped into the user VMA with a
> cacheable mapping and used like a regular RAM. The device memory is
> not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
> 
> There is also a requirement of a minimum reserved 1G uncached region
> (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1].
> This is to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets
> the Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory region.
> The decision for a different approach was taken in view of the fact that
> it would necessiate additional code in Qemu to discover and insert those
> regions in the VM IPA, along with the additional VM ACPI DSDT changes to
> communiate the device memory region IPA to the VM workloads. Moreover,
> this behavior would have to be added to a variety of emulators (beyond
> top of tree Qemu) out there desiring grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR (i.e.
> comprising of region 2 and 3). The cached device memory aperture is
> assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
> device in the VM with the uncached aperture reported as BAR2 region,
> the cacheable as BAR4. The variant driver provides emulation for these
> fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space register.
> 
> The memory layout on the host looks like the following:
>devmem (memlength)
> |--|
> |-cached|--NC--|
> |   |
> usemem.memphys  resmem.memphys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size memblock
> (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and
> VFIO driver. The VM device driver make use of the same value for its
> calculation to determine USEMEM size.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [3] and [4], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
> 
> If the device memory properties are not present, the driver registers the
> vfio-pci-core function pointers. Since there are no ACPI memory properties
> generated for the VM, the 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Alex Williamson
On Fri, 9 Feb 2024 13:19:03 -0400
Jason Gunthorpe  wrote:

> On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote:
> > I think Kevin's point is also relative to this latter scenario, in the
> > L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
> > cachable, but in the L2 instance of the driver where we only use the
> > vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
> > problem?  An uncached mapping on top of a cachable mapping is often
> > prone to problems.
> 
> On these CPUs the ARM architecture won't permit it, the L0 level
> blocks uncachable using FWB and page table attributes. The VM, no
> matter what it does, cannot make the cachable memory uncachable.

Great, thanks,

Alex




Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Alex Williamson
On Fri, 9 Feb 2024 09:20:22 +
Ankit Agrawal  wrote:

> Thanks Kevin for the review. Comments inline.
> 
> >>
> >> Note that the usemem memory is added by the VM Nvidia device driver [5]
> >> to the VM kernel as memblocks. Hence make the usable memory size
> >> memblock
> >> aligned.  
> >
> > Is memblock size defined in spec or purely a guest implementation choice?  
> 
> The MEMBLOCK value is a hardwired and a constant ABI value between the GPU
> FW and VFIO driver.
> 
> >>
> >> If the bare metal properties are not present, the driver registers the
> >> vfio-pci-core function pointers.  
> >
> > so if qemu doesn't generate such property the variant driver running
> > inside guest will always go to use core functions and guest vfio userspace
> > will observe both resmem and usemem bars. But then there is nothing
> > in field to prohibit mapping resmem bar as cacheable.
> >
> > should this driver check the presence of either ACPI property or
> > resmem/usemem bars to enable variant function pointers?  
> 
> Maybe I am missing something here; but if the ACPI property is absent,
> the real physical BARs present on the device will be exposed by the
> vfio-pci-core functions to the VM. So I think if the variant driver is ran
> within the VM, it should not see the fake usemem and resmem BARs.

There are two possibilities here, either we're assigning the pure
physical device from a host that does not have the ACPI properties or
we're performing a nested assignment.  In the former case we're simply
passing along the unmodified physical BARs.  In the latter case we're
actually passing through the fake BARs, the virtualization of the
device has already happened in the level 1 assignment.

I think Kevin's point is also relative to this latter scenario, in the
L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
cachable, but in the L2 instance of the driver where we only use the
vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
problem?  An uncached mapping on top of a cachable mapping is often
prone to problems.  Thanks,

Alex




Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-08 Thread Alex Williamson
On Thu, 8 Feb 2024 07:21:40 +
"Tian, Kevin"  wrote:

> > From: Ankit Agrawal 
> > Sent: Thursday, February 8, 2024 3:13 PM  
> > >> > +    * Determine how many bytes to be actually read from the
> > >> > device memory.
> > >> > +    * Read request beyond the actual device memory size is
> > >> > filled with ~0,
> > >> > +    * while those beyond the actual reported size is skipped.
> > >> > +    */
> > >> > +   if (offset >= memregion->memlength)
> > >> > +   mem_count = 0;  
> > >>
> > >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
> > >> necessary.  
> > >
> > > Harmless, other than the possibly unnecessary call through to
> > > nvgrace_gpu_map_device_mem().  Maybe both  
> > nvgrace_gpu_map_and_read()  
> > > and nvgrace_gpu_map_and_write() could conditionally return 0 as their
> > > first operation when !mem_count.  Thanks,
> > >
> > >Alex  
> > 
> > IMO, this seems like adding too much code to reduce the call length for a
> > very specific case. If there aren't any strong opinion on this, I'm 
> > planning to
> > leave this code as it is.  
> 
> a slight difference. if mem_count==0 the result should always succeed
> no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
> if it fails it's already a big problem probably nobody cares about the subtle
> difference when reading non-exist range.
> 
> but regarding to readability it's still clearer:
> 
> if (mem_count)
>   nvgrace_gpu_map_and_read();
> 

The below has better flow imo vs conditionalizing the call to
map_and_read/write and subsequent error handling, but I don't think
either adds too much code.  Thanks,

Alex

--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -429,6 +429,9 @@ nvgrace_gpu_map_and_read(struct 
nvgrace_gpu_vfio_pci_core_device *nvdev,
u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
int ret;
 
+   if (!mem_count)
+   return 0;
+
/*
 * Handle read on the BAR regions. Map to the target device memory
 * physical address and copy to the request read buffer.
@@ -547,6 +550,9 @@ nvgrace_gpu_map_and_write(struct 
nvgrace_gpu_vfio_pci_core_device *nvdev,
loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
int ret;
 
+   if (!mem_count)
+   return 0;
+
ret = nvgrace_gpu_map_device_mem(index, nvdev);
if (ret)
return ret;




Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Alex Williamson
On Tue, 6 Feb 2024 04:31:23 +0530
 wrote:

> From: Ankit Agrawal 
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the chip-to-chip
> interconnect that is exposed as a contiguous physically addressable
> region on the host. This device memory aperture can be obtained from host
> ACPI table using device_property_read_u64(), according to the FW
> specification. Since the device memory is cache coherent with the CPU,
> it can be mmap into the user VMA with a cacheable mapping using
> remap_pfn_range() and used like a regular RAM. The device memory
> is not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
> 
> There is also a requirement of a reserved 1G uncached region (termed as
> resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is
> to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory region.
> The decision for a different approach was taken in view of the fact that
> it would necessiate additional code in Qemu to discover and insert those
> regions in the VM IPA, along with the additional VM ACPI DSDT changes to
> communicate the device memory region IPA to the VM workloads. Moreover,
> this behavior would have to be added to a variety of emulators (beyond
> top of tree Qemu) out there desiring grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR (i.e.
> comprising of region 2 and 3). The cached device memory aperture is
> assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
> device in the VM with the uncached aperture reported as BAR2 region,
> the cacheable as BAR4. The variant driver provides emulation for these
> fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space register.
> 
> The memory layout on the host looks like the following:
>devmem (memlength)
> |--|
> |-cached|--NC--|
> |   |
> usemem.phys/memphys resmem.phys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size memblock
> aligned.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [4] and [3], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.
> 
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace Hopper 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Alex Williamson
On Thu, 8 Feb 2024 00:32:10 +0200
Zhi Wang  wrote:

> On Tue, 6 Feb 2024 04:31:23 +0530
>  wrote:
> 
> > From: Ankit Agrawal 
> > 
> > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > for the on-chip GPU that is the logical OS representation of the
> > internal proprietary chip-to-chip cache coherent interconnect.
> > 
> > The device is peculiar compared to a real PCI device in that whilst
> > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> > device, it is not used to access device memory once the faster
> > chip-to-chip interconnect is initialized (occurs at the time of host
> > system boot). The device memory is accessed instead using the
> > chip-to-chip interconnect that is exposed as a contiguous physically
> > addressable region on the host. This device memory aperture can be
> > obtained from host ACPI table using device_property_read_u64(),
> > according to the FW specification. Since the device memory is cache
> > coherent with the CPU, it can be mmap into the user VMA with a
> > cacheable mapping using remap_pfn_range() and used like a regular
> > RAM. The device memory is not added to the host kernel, but mapped
> > directly as this reduces memory wastage due to struct pages.
> > 
> > There is also a requirement of a reserved 1G uncached region (termed
> > as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This
> > is to work around a HW defect. Based on [2], the requisite properties
> > (uncached, unaligned access) can be achieved through a VM mapping (S1)
> > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> > a different non-cached property to the reserved 1G region, it needs to
> > be carved out from the device memory and mapped as a separate region
> > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> > Qemu VMA page properties (pgprot) as NORMAL_NC.
> > 
> > Provide a VFIO PCI variant driver that adapts the unique device memory
> > representation into a more standard PCI representation facing
> > userspace.
> > 
> > The variant driver exposes these two regions - the non-cached reserved
> > (resmem) and the cached rest of the device memory (termed as usemem)
> > as separate VFIO 64b BAR regions. This is divergent from the baremetal
> > approach, where the device memory is exposed as a device memory
> > region. The decision for a different approach was taken in view of
> > the fact that it would necessiate additional code in Qemu to discover
> > and insert those regions in the VM IPA, along with the additional VM
> > ACPI DSDT changes to communicate the device memory region IPA to the
> > VM workloads. Moreover, this behavior would have to be added to a
> > variety of emulators (beyond top of tree Qemu) out there desiring
> > grace hopper support.
> > 
> > Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> > maps the uncached carved out region to the next available PCI BAR
> > (i.e. comprising of region 2 and 3). The cached device memory
> > aperture is assigned BAR region 4 and 5. Qemu will then naturally
> > generate a PCI device in the VM with the uncached aperture reported
> > as BAR2 region, the cacheable as BAR4. The variant driver provides
> > emulation for these fake BARs' PCI config space offset registers.
> > 
> > The hardware ensures that the system does not crash when the memory
> > is accessed with the memory enable turned off. It synthesis ~0 reads
> > and dropped writes on such access. So there is no need to support the
> > disablement/enablement of BAR through PCI_COMMAND config space
> > register.
> > 
> > The memory layout on the host looks like the following:
> >devmem (memlength)
> > |--|
> > |-cached|--NC--|
> > |   |
> > usemem.phys/memphys resmem.phys
> > 
> > PCI BARs need to be aligned to the power-of-2, but the actual memory
> > on the device may not. A read or write access to the physical address
> > from the last device PFN up to the next power-of-2 aligned physical
> > address results in reading ~0 and dropped writes. Note that the GPU
> > device driver [6] is capable of knowing the exact device memory size
> > through separate means. The device memory size is primarily kept in
> > the system ACPI tables for use by the VFIO PCI variant module.
> > 
> > Note that the usemem memory is added by the VM Nvidia device driver
> > [5] to the VM kernel as memblocks. Hence make the usable memory size
> > memblock aligned.
> > 
> > Currently there is no provision in KVM for a S2 mapping with
> > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the
> > same [3]. As previously mentioned, resmem is mapped
> > pgprot_writecombine(), that sets the Qemu VMA page properties
> > (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM
> > marks the region with 

Re: [PATCH] vfio: fix virtio-pci dependency

2024-01-10 Thread Alex Williamson
On Tue,  9 Jan 2024 08:57:19 +0100
Arnd Bergmann  wrote:

> From: Arnd Bergmann 
> 
> The new vfio-virtio driver already has a dependency on 
> VIRTIO_PCI_ADMIN_LEGACY,
> but that is a bool symbol and allows vfio-virtio to be built-in even if
> virtio-pci itself is a loadable module. This leads to a link failure:
> 
> aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function 
> `virtiovf_pci_probe':
> main.c:(.text+0xec): undefined reference to `virtio_pci_admin_has_legacy_io'
> aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function 
> `virtiovf_pci_init_device':
> main.c:(.text+0x260): undefined reference to 
> `virtio_pci_admin_legacy_io_notify_info'
> aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function 
> `virtiovf_pci_bar0_rw':
> main.c:(.text+0x6ec): undefined reference to 
> `virtio_pci_admin_legacy_common_io_read'
> aarch64-linux-ld: main.c:(.text+0x6f4): undefined reference to 
> `virtio_pci_admin_legacy_device_io_read'
> aarch64-linux-ld: main.c:(.text+0x7f0): undefined reference to 
> `virtio_pci_admin_legacy_common_io_write'
> aarch64-linux-ld: main.c:(.text+0x7f8): undefined reference to 
> `virtio_pci_admin_legacy_device_io_write'
> 
> Add another explicit dependency on the tristate symbol.
> 
> Fixes: eb61eca0e8c3 ("vfio/virtio: Introduce a vfio driver over virtio 
> devices")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/vfio/pci/virtio/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> index fc3a0be9d8d4..bd80eca4a196 100644
> --- a/drivers/vfio/pci/virtio/Kconfig
> +++ b/drivers/vfio/pci/virtio/Kconfig
> @@ -1,7 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  config VIRTIO_VFIO_PCI
>  tristate "VFIO support for VIRTIO NET PCI devices"
> -depends on VIRTIO_PCI_ADMIN_LEGACY
> +depends on VIRTIO_PCI && VIRTIO_PCI_ADMIN_LEGACY
>  select VFIO_PCI_CORE
>  help
>This provides support for exposing VIRTIO NET VF devices which 
> support

Applied to vfio next branch for v6.8.  Thanks!

Alex




Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio

2021-04-20 Thread Alex Williamson
On Tue, 20 Apr 2021 09:59:57 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
> > On 2021/4/19 20:33, Jason Gunthorpe wrote:  
> > > On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> > >   
> > >>> I'm also confused how this works securely at all, as a general rule a
> > >>> VFIO PCI driver cannot access the MMIO memory of the function it is
> > >>> planning to assign to the guest. There is a lot of danger that the
> > >>> guest could access that MMIO space one way or another.  
> > >>
> > >> VF's MMIO memory is divided into two parts, one is the guest part,
> > >> and the other is the live migration part. They do not affect each other,
> > >> so there is no security problem.  
> > > 
> > > AFAIK there are several scenarios where a guest can access this MMIO
> > > memory using DMA even if it is not mapped into the guest for CPU
> > > access.
> > >   
> > The hardware divides VF's MMIO memory into two parts. The live migration
> > driver in the host uses the live migration part, and the device driver in
> > the guest uses the guest part. They obtain the address of VF's MMIO memory
> > in their respective drivers, although these two parts The memory is
> > continuous on the hardware device, but due to the needs of the drive 
> > function,
> > they will not perform operations on another part of the memory, and the
> > device hardware also independently responds to the operation commands of
> > the two parts.  
> 
> It doesn't matter, the memory is still under the same PCI BDF and VFIO
> supports scenarios where devices in the same IOMMU group are not
> isolated from each other.
> 
> This is why the granual of isolation is a PCI BDF - VFIO directly
> blocks kernel drivers from attaching to PCI BDFs that are not
> completely isolated from VFIO BDF.
> 
> Bypassing this prevention and attaching a kernel driver directly to
> the same BDF being exposed to the guest breaks that isolation model.
> 
> > So, I still don't understand what the security risk you are talking about 
> > is,
> > and what do you think the security design should look like?
> > Can you elaborate on it?  
> 
> Each security domain must have its own PCI BDF.
> 
> The migration control registers must be on a different VF from the VF
> being plugged into a guest and the two VFs have to be in different
> IOMMU groups to ensure they are isolated from each other.

I think that's a solution, I don't know if it's the only solution.
AIUI, the issue here is that we have a device specific kernel driver
extending vfio-pci with migration support for this device by using an
MMIO region of the same device.  This is susceptible to DMA
manipulation by the user device.   Whether that's a security issue or
not depends on how the user can break the device.  If the scope is
limited to breaking their own device, they can do that any number of
ways and it's not very interesting.  If the user can manipulate device
state in order to trigger an exploit of the host-side kernel driver,
that's obviously more of a problem.

The other side of this is that if migration support can be implemented
entirely within the VF using this portion of the device MMIO space, why
do we need the host kernel to support this rather than implementing it
in userspace?  For example, QEMU could know about this device,
manipulate the BAR size to expose only the operational portion of MMIO
to the VM and use the remainder to support migration itself.  I'm
afraid that just like mdev, the vfio migration uAPI is going to be used
as an excuse to create kernel drivers simply to be able to make use of
that uAPI.  I haven't looked at this driver to know if it has some
other reason to exist beyond what could be done through vfio-pci and
userspace migration support.  Thanks,

Alex



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Alex Williamson
On Fri, 16 Apr 2021 06:12:58 -0700
Jacob Pan  wrote:

> Hi Jason,
> 
> On Thu, 15 Apr 2021 20:07:32 -0300, Jason Gunthorpe  wrote:
> 
> > On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:  
> > > Hi Jason,
> > > 
> > > On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
> > > > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > > > 
> > > >> DMA page faults are delivered to root-complex via page request
> > > >> message and it is per-device according to PCIe spec. Page request
> > > >> handling flow is:
> > > >>
> > > >> 1) iommu driver receives a page request from device
> > > >> 2) iommu driver parses the page request message. Get the RID,PASID,
> > > >> faulted page and requested permissions etc.
> > > >> 3) iommu driver triggers fault handler registered by device driver
> > > >> with iommu_report_device_fault()
> > > > 
> > > > This seems confused.
> > > > 
> > > > The PASID should define how to handle the page fault, not the driver.
> > > >
> > > 
> > > In my series I don't use PASID at all. I am just enabling nested stage
> > > and the guest uses a single context. I don't allocate any user PASID at
> > > any point.
> > > 
> > > When there is a fault at physical level (a stage 1 fault that concerns
> > > the guest), this latter needs to be reported and injected into the
> > > guest. The vfio pci driver registers a fault handler to the iommu layer
> > > and in that fault handler it fills a circ bugger and triggers an eventfd
> > > that is listened to by the VFIO-PCI QEMU device. this latter retrives
> > > the faault from the mmapped circ buffer, it knowns which vIOMMU it is
> > > attached to, and passes the fault to the vIOMMU.
> > > Then the vIOMMU triggers and IRQ in the guest.
> > > 
> > > We are reusing the existing concepts from VFIO, region, IRQ to do that.
> > > 
> > > For that use case, would you also use /dev/ioasid?
> > 
> > /dev/ioasid could do all the things you described vfio-pci as doing,
> > it can even do them the same way you just described.
> > 
> > Stated another way, do you plan to duplicate all of this code someday
> > for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> > platform devices, right?
> > 
> > I feel what you guys are struggling with is some choice in the iommu
> > kernel APIs that cause the events to be delivered to the pci_device
> > owner, not the PASID owner.
> > 
> > That feels solvable.
> >   
> Perhaps more of a philosophical question for you and Alex. There is no
> doubt that the direction you guided for /dev/ioasid is a much cleaner one,
> especially after VDPA emerged as another IOMMU backed framework.

I think this statement answers all your remaining questions ;)

> The question is what do we do with the nested translation features that have
> been targeting the existing VFIO-IOMMU for the last three years? That
> predates VDPA. Shall we put a stop marker *after* nested support and say no
> more extensions for VFIO-IOMMU, new features must be built on this new
> interface?
>
> If we were to close a checkout line for some unforeseen reasons, should we
> honor the customers already in line for a long time?
> 
> This is not a tactic or excuse for not working on the new /dev/ioasid
> interface. In fact, I believe we can benefit from the lessons learned while
> completing the existing. This will give confidence to the new
> interface. Thoughts?

I understand a big part of Jason's argument is that we shouldn't be in
the habit of creating duplicate interfaces, we should create one, well
designed interfaces to share among multiple subsystems.  As new users
have emerged, our solution needs to change to a common one rather than
a VFIO specific one.  The IOMMU uAPI provides an abstraction, but at
the wrong level, requiring userspace interfaces for each subsystem.

Luckily the IOMMU uAPI is not really exposed as an actual uAPI, but
that changes if we proceed to enable the interfaces to tunnel it
through VFIO.

The logical answer would therefore be that we don't make that
commitment to the IOMMU uAPI if we believe now that it's fundamentally
flawed.

Ideally this new /dev/ioasid interface, and making use of it as a VFIO
IOMMU backend, should replace type1.  Type1 will live on until that
interface gets to parity, at which point we may deprecate type1, but it
wouldn't make sense to continue to expand type1 in the same direction
as we intend /dev/ioasid to take over in the meantime, especially if it
means maintaining an otherwise dead uAPI.  Thanks,

Alex



Re: [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver

2021-04-15 Thread Alex Williamson
[Cc+ NVIDIA folks both from migration and vfio-pci-core discussion]

On Tue, 13 Apr 2021 11:36:20 +0800
Longfang Liu  wrote:

> The live migration solution relies on the vfio_device_migration_info protocol.
> The structure vfio_device_migration_info is placed at the 0th offset of
> the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
> migration information. Field accesses from this structure are only supported
> at their native width and alignment. Otherwise, the result is undefined and
> vendor drivers should return an error.
> 
> (1).The driver framework is based on vfio_pci_register_dev_region() of 
> vfio-pci,
> and then a new live migration region is added, and the live migration is
> realized through the ops of this region.
> 
> (2).In order to ensure the compatibility of the devices before and after the
> migration, the device compatibility information check will be performed in
> the Pre-copy stage. If the check fails, an error will be returned and the
> source VM will exit the migration function.
> 
> (3).After the compatibility check is passed, it will enter the Stop-and-copy
> stage. At this time, all the live migration data will be copied, and then
> saved to the VF device of the destination, and then the VF device of the
> destination will be started and the VM of the source will be exited.
> 
> Longfang Liu (3):
>   vfio/hisilicon: add acc live migration driver
>   vfio/hisilicon: register the driver to vfio
>   vfio/hisilicom: add debugfs for driver
> 
>  drivers/vfio/pci/Kconfig  |8 +
>  drivers/vfio/pci/Makefile |1 +
>  drivers/vfio/pci/hisilicon/acc_vf_migration.c | 1337 
> +
>  drivers/vfio/pci/hisilicon/acc_vf_migration.h |  170 
>  drivers/vfio/pci/vfio_pci.c   |   11 +
>  drivers/vfio/pci/vfio_pci_private.h   |9 +
>  6 files changed, 1536 insertions(+)
>  create mode 100644 drivers/vfio/pci/hisilicon/acc_vf_migration.c
>  create mode 100644 drivers/vfio/pci/hisilicon/acc_vf_migration.h
> 



Re: [PATCH 3/3] vfio/iommu_type1: Add support for manual dirty log clear

2021-04-15 Thread Alex Williamson
On Tue, 13 Apr 2021 17:14:45 +0800
Keqian Zhu  wrote:

> From: Kunkun Jiang 
> 
> In the past, we clear dirty log immediately after sync dirty
> log to userspace. This may cause redundant dirty handling if
> userspace handles dirty log iteratively:
> 
> After vfio clears dirty log, new dirty log starts to generate.
> These new dirty log will be reported to userspace even if they
> are generated before userspace handles the same dirty page.
> 
> That's to say, we should minimize the time gap of dirty log
> clearing and dirty log handling. We can give userspace the
> interface to clear dirty log.

IIUC, a user would be expected to clear the bitmap before copying the
dirty pages, therefore you're trying to reduce that time gap between
clearing any copy, but it cannot be fully eliminated and importantly,
if the user clears after copying, they've introduced a race.  Correct?

What results do you have to show that this is a worthwhile optimization?

I really don't like the semantics that testing for an IOMMU capability
enables it.  It needs to be explicitly controllable feature, which
suggests to me that it might be a flag used in combination with _GET or
a separate _GET_NOCLEAR operations.  Thanks,

Alex


> Co-developed-by: Keqian Zhu 
> Signed-off-by: Kunkun Jiang 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 100 ++--
>  include/uapi/linux/vfio.h   |  28 -
>  2 files changed, 123 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 77950e47f56f..d9c4a27b3c4e 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -78,6 +78,7 @@ struct vfio_iommu {
>   boolv2;
>   boolnesting;
>   booldirty_page_tracking;
> + booldirty_log_manual_clear;
>   boolpinned_page_dirty_scope;
>   boolcontainer_open;
>  };
> @@ -1242,6 +1243,78 @@ static int vfio_iommu_dirty_log_sync(struct vfio_iommu 
> *iommu,
>   return ret;
>  }
>  
> +static int vfio_iova_dirty_log_clear(u64 __user *bitmap,
> +  struct vfio_iommu *iommu,
> +  dma_addr_t iova, size_t size,
> +  size_t pgsize)
> +{
> + struct vfio_dma *dma;
> + struct rb_node *n;
> + dma_addr_t start_iova, end_iova, riova;
> + unsigned long pgshift = __ffs(pgsize);
> + unsigned long bitmap_size;
> + unsigned long *bitmap_buffer = NULL;
> + bool clear_valid;
> + int rs, re, start, end, dma_offset;
> + int ret = 0;
> +
> + bitmap_size = DIRTY_BITMAP_BYTES(size >> pgshift);
> + bitmap_buffer = kvmalloc(bitmap_size, GFP_KERNEL);
> + if (!bitmap_buffer) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + if (copy_from_user(bitmap_buffer, bitmap, bitmap_size)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + for (n = rb_first(>dma_list); n; n = rb_next(n)) {
> + dma = rb_entry(n, struct vfio_dma, node);
> + if (!dma->iommu_mapped)
> + continue;
> + if ((dma->iova + dma->size - 1) < iova)
> + continue;
> + if (dma->iova > iova + size - 1)
> + break;
> +
> + start_iova = max(iova, dma->iova);
> + end_iova = min(iova + size, dma->iova + dma->size);
> +
> + /* Similar logic as the tail of vfio_iova_dirty_bitmap */
> +
> + clear_valid = false;
> + start = (start_iova - iova) >> pgshift;
> + end = (end_iova - iova) >> pgshift;
> + bitmap_for_each_set_region(bitmap_buffer, rs, re, start, end) {
> + clear_valid = true;
> + riova = iova + (rs << pgshift);
> + dma_offset = (riova - dma->iova) >> pgshift;
> + bitmap_clear(dma->bitmap, dma_offset, re - rs);
> + }
> +
> + if (clear_valid)
> + vfio_dma_populate_bitmap(dma, pgsize);
> +
> + if (clear_valid && !iommu->pinned_page_dirty_scope &&
> + dma->iommu_mapped && !iommu->num_non_hwdbm_groups) {
> + ret = vfio_iommu_dirty_log_clear(iommu, start_iova,
> + end_iova - start_iova,  bitmap_buffer,
> + iova, pgshift);
> + if (ret) {
> + pr_warn("dma dirty log clear failed!\n");
> + goto out;
> + }
> + }
> +
> + }
> +
> +out:
> + kfree(bitmap_buffer);
> + return ret;
> +}
> +
>  static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
> struct vfio_dma *dma, dma_addr_t 

Re: linux-next: manual merge of the vfio tree with the drm tree

2021-04-15 Thread Alex Williamson
On Thu, 15 Apr 2021 10:08:55 -0300
Jason Gunthorpe  wrote:

> On Thu, Apr 15, 2021 at 04:47:34PM +1000, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Today's linux-next merge of the vfio tree got a conflict in:
> > 
> >   drivers/gpu/drm/i915/gvt/gvt.c
> > 
> > between commit:
> > 
> >   9ff06c385300 ("drm/i915/gvt: Remove references to struct drm_device.pdev")
> > 
> > from the drm tree and commit:
> > 
> >   383987fd15ba ("vfio/gvt: Use mdev_get_type_group_id()")
> > 
> > from the vfio tree.
> > 
> > I fixed it up (I used the latter version) and can carry the fix as
> > necessary.  
> 
> Yes that is right, thank you

Yep, thanks!

Alex



Re: QCA6174 pcie wifi: Add pci quirks

2021-04-15 Thread Alex Williamson
[cc +Pali]

On Thu, 15 Apr 2021 20:02:23 +0200
Ingmar Klein  wrote:

> First thanks to you both, Alex and Bjorn!
> I am in no way an expert on this topic, so I have to fully rely on your
> feedback, concerning this issue.
> 
> If you should have any other solution approach, in form of patch-set, I
> would be glad to test it out. Just let me know, what you think might
> make sense.
> I will wait for your further feedback on the issue. In the meantime I
> have my current workaround via quirk entry.
> 
> By the way, my layman's question:
> Do you think, that the following topic might also apply for the QCA6174?
> https://www.spinics.net/lists/linux-pci/msg106395.html
> Or in other words, should a similar approach be tried for the QCA6174
> and if yes, would it bring any benefit at all?
> I hope you can excuse me, in case the questions should not make too much
> sense.

If you run lspci -vvv on your device, what do LnkCap and LnkSta report
under the express capability?  I wonder if your device even supports
>Gen1 speeds, mine does not.

I would not expect that patch to be relevant to you based on your
report.  I understand it to resolve an issue during link retraining to a
higher speed on boot, not during a bus reset.  Pali can correct if I'm
wrong.  Thanks,

Alex

> Am 15.04.2021 um 04:36 schrieb Alex Williamson:
> > On Wed, 14 Apr 2021 16:03:50 -0500
> > Bjorn Helgaas  wrote:
> >  
> >> [+cc Alex]
> >>
> >> On Fri, Apr 09, 2021 at 11:26:33AM +0200, Ingmar Klein wrote:  
> >>> Edit: Retry, as I did not consider, that my mail-client would make this
> >>> party html.
> >>>
> >>> Dear maintainers,
> >>> I recently encountered an issue on my Proxmox server system, that
> >>> includes a Qualcomm QCA6174 m.2 PCIe wifi module.
> >>> https://deviwiki.com/wiki/AIRETOS_AFX-QCA6174-NX
> >>>
> >>> On system boot and subsequent virtual machine start (with passed-through
> >>> QCA6174), the VM would just freeze/hang, at the point where the ath10k
> >>> driver loads.
> >>> Quick search in the proxmox related topics, brought me to the following
> >>> discussion, which suggested a PCI quirk entry for the QCA6174 in the 
> >>> kernel:
> >>> https://forum.proxmox.com/threads/pcie-passthrough-freezes-proxmox.27513/
> >>>
> >>> I then went ahead, got the Proxmox kernel source (v5.4.106) and applied
> >>> the attached patch.
> >>> Effect was as hoped, that the VM hangs are now gone. System boots and
> >>> runs as intended.
> >>>
> >>> Judging by the existing quirk entries for Atheros, I would think, that
> >>> my proposed "fix" could be included in the vanilla kernel.
> >>> As far as I saw, there is no entry yet, even in the latest kernel 
> >>> sources.  
> >> This would need a signed-off-by; see
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v5.11#n361
> >>
> >> This is an old issue, and likely we'll end up just applying this as
> >> yet another quirk.  But looking at c3e59ee4e766 ("PCI: Mark Atheros
> >> AR93xx to avoid bus reset"), where it started, it seems to be
> >> connected to 425c1b223dac ("PCI: Add Virtual Channel to save/restore
> >> support").
> >>
> >> I'd like to dig into that a bit more to see if there are any clues.
> >> AFAIK Linux itself still doesn't use VC at all, and 425c1b223dac added
> >> a fair bit of code.  I wonder if we're restoring something out of
> >> order or making some simple mistake in the way to restore VC config.  
> > I don't really have any faith in that bisect report in commit
> > c3e59ee4e766.  To double check I dug out the card from that commit,
> > installed an old Fedora release so I could build kernel v3.13,
> > pre-dating 425c1b223dac and tested triggering a bus reset both via
> > setpci and by masking PM reset so that sysfs can trigger the bus reset
> > path with the kernel save/restore code.  Both result in the system
> > hanging when the device is accessed either restoring from the kernel
> > bus reset or reading from the device after the setpci reset.  Thanks,
> >
> > Alex
> >  
> 



Re: QCA6174 pcie wifi: Add pci quirks

2021-04-14 Thread Alex Williamson
On Wed, 14 Apr 2021 16:03:50 -0500
Bjorn Helgaas  wrote:

> [+cc Alex]
> 
> On Fri, Apr 09, 2021 at 11:26:33AM +0200, Ingmar Klein wrote:
> > Edit: Retry, as I did not consider, that my mail-client would make this
> > party html.
> > 
> > Dear maintainers,
> > I recently encountered an issue on my Proxmox server system, that
> > includes a Qualcomm QCA6174 m.2 PCIe wifi module.
> > https://deviwiki.com/wiki/AIRETOS_AFX-QCA6174-NX
> > 
> > On system boot and subsequent virtual machine start (with passed-through
> > QCA6174), the VM would just freeze/hang, at the point where the ath10k
> > driver loads.
> > Quick search in the proxmox related topics, brought me to the following
> > discussion, which suggested a PCI quirk entry for the QCA6174 in the kernel:
> > https://forum.proxmox.com/threads/pcie-passthrough-freezes-proxmox.27513/
> > 
> > I then went ahead, got the Proxmox kernel source (v5.4.106) and applied
> > the attached patch.
> > Effect was as hoped, that the VM hangs are now gone. System boots and
> > runs as intended.
> > 
> > Judging by the existing quirk entries for Atheros, I would think, that
> > my proposed "fix" could be included in the vanilla kernel.
> > As far as I saw, there is no entry yet, even in the latest kernel sources.  
> 
> This would need a signed-off-by; see
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v5.11#n361
> 
> This is an old issue, and likely we'll end up just applying this as
> yet another quirk.  But looking at c3e59ee4e766 ("PCI: Mark Atheros
> AR93xx to avoid bus reset"), where it started, it seems to be
> connected to 425c1b223dac ("PCI: Add Virtual Channel to save/restore
> support").
> 
> I'd like to dig into that a bit more to see if there are any clues.
> AFAIK Linux itself still doesn't use VC at all, and 425c1b223dac added
> a fair bit of code.  I wonder if we're restoring something out of
> order or making some simple mistake in the way to restore VC config.

I don't really have any faith in that bisect report in commit
c3e59ee4e766.  To double check I dug out the card from that commit,
installed an old Fedora release so I could build kernel v3.13,
pre-dating 425c1b223dac and tested triggering a bus reset both via
setpci and by masking PM reset so that sysfs can trigger the bus reset
path with the kernel save/restore code.  Both result in the system
hanging when the device is accessed either restoring from the kernel
bus reset or reading from the device after the setpci reset.  Thanks,

Alex



[GIT PULL] VFIO fix for v5.12-rc8/final

2021-04-14 Thread Alex Williamson
Hi Linus,

Sorry for the late request.

The following changes since commit d434405aaab7d0ebc516b68a8fc4100922d7f5ef:

  Linux 5.12-rc7 (2021-04-11 15:16:13 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.12-rc8

for you to fetch changes up to 909290786ea335366e21d7f1ed5812b90f2f0a92:

  vfio/pci: Add missing range check in vfio_pci_mmap (2021-04-13 08:29:16 -0600)


VFIO fix for v5.12-rc8/final

 - Verify mmap region within range (Christian A. Ehrhardt)


Christian A. Ehrhardt (1):
  vfio/pci: Add missing range check in vfio_pci_mmap

 drivers/vfio/pci/vfio_pci.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)



Re: [PATCH] vfio/iommu_type1: Remove unused pinned_page_dirty_scope in vfio_iommu

2021-04-13 Thread Alex Williamson
On Mon, 12 Apr 2021 10:44:15 +0800
Keqian Zhu  wrote:

> pinned_page_dirty_scope is optimized out by commit 010321565a7d
> ("vfio/iommu_type1: Mantain a counter for non_pinned_groups"),
> but appears again due to some issues during merging branches.
> We can safely remove it here.
> 
> Signed-off-by: Keqian Zhu 
> ---
> 
> However, I'm not clear about the root problem. Is there a bug in git?

Strange, clearly I broke something in merge commit 76adb20f924f, but
it's not evident to me how that line reappeared.  Thanks for spotting
it, I'll queue this for v5.13.  Thanks,

Alex

> ---
>  drivers/vfio/vfio_iommu_type1.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 45cbfd4879a5..4d1f10a33d74 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -77,7 +77,6 @@ struct vfio_iommu {
>   boolv2;
>   boolnesting;
>   booldirty_page_tracking;
> - boolpinned_page_dirty_scope;
>   boolcontainer_open;
>  };
>  



Re: Regression: gvt: vgpu 1: MI_LOAD_REGISTER_MEM handler error

2021-04-12 Thread Alex Williamson
On Mon, 12 Apr 2021 10:32:14 -0600
Alex Williamson  wrote:

> Running a Windows guest on a i915-GVTg_V4_2 from an HD 5500 IGD on
> v5.12-rc6 results in host logs:
> 
> gvt: vgpu 1: lrm access to register (20c0)
> gvt: vgpu 1: MI_LOAD_REGISTER_MEM handler error
> gvt: vgpu 1: cmd parser error
> 0x0 
> 0x29 
> 
> gvt: vgpu 1: scan wa ctx error
> gvt: vgpu 1: failed to submit desc 0
> gvt: vgpu 1: fail submit workload on ring rcs0
> gvt: vgpu 1: fail to emulate MMIO write 2230 len 4
> 
> The guest goes into a boot loop triggering this error before reaching
> the desktop and rebooting.  Guest using Intel driver 20.19.15.5171
> dated 11/4/2020 (from driver file 15.40.5171).
> 
> This VM works well with the same guest and userspace software stack on
> Fedora's kernel 5.11.11-200.fc33.x86_64.  Thanks,

Bisected to:

commit f18d417a57438498e0de481d3a0bc900c2b0e057
Author: Yan Zhao 
Date:   Wed Dec 23 11:45:08 2020 +0800

drm/i915/gvt: filter cmds "srm" and "lrm" in cmd_handler

do not allow "srm" and "lrm" except for GEN8_L3SQCREG4 and 0x21f0.

Cc: Colin Xu 
Cc: Kevin Tian 
Signed-off-by: Yan Zhao 
Signed-off-by: Zhenyu Wang 
Link: 
http://patchwork.freedesktop.org/patch/msgid/20201223034508.17031-1-yan.y.z...@intel.com
Reviewed-by: Zhenyu Wang 



Regression: gvt: vgpu 1: MI_LOAD_REGISTER_MEM handler error

2021-04-12 Thread Alex Williamson


Running a Windows guest on a i915-GVTg_V4_2 from an HD 5500 IGD on
v5.12-rc6 results in host logs:

gvt: vgpu 1: lrm access to register (20c0)
gvt: vgpu 1: MI_LOAD_REGISTER_MEM handler error
gvt: vgpu 1: cmd parser error
0x0 
0x29 

gvt: vgpu 1: scan wa ctx error
gvt: vgpu 1: failed to submit desc 0
gvt: vgpu 1: fail submit workload on ring rcs0
gvt: vgpu 1: fail to emulate MMIO write 2230 len 4

The guest goes into a boot loop triggering this error before reaching
the desktop and rebooting.  Guest using Intel driver 20.19.15.5171
dated 11/4/2020 (from driver file 15.40.5171).

This VM works well with the same guest and userspace software stack on
Fedora's kernel 5.11.11-200.fc33.x86_64.  Thanks,

Alex



Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2

2021-04-12 Thread Alex Williamson
On Mon, 12 Apr 2021 19:41:41 +1000
Michael Ellerman  wrote:

> Alex Williamson  writes:
> > On Fri, 26 Mar 2021 07:13:10 +0100
> > Christoph Hellwig  wrote:
> >  
> >> This driver never had any open userspace (which for VFIO would include
> >> VM kernel drivers) that use it, and thus should never have been added
> >> by our normal userspace ABI rules.
> >> 
> >> Signed-off-by: Christoph Hellwig 
> >> Acked-by: Greg Kroah-Hartman 
> >> ---
> >>  drivers/vfio/pci/Kconfig|   6 -
> >>  drivers/vfio/pci/Makefile   |   1 -
> >>  drivers/vfio/pci/vfio_pci.c |  18 -
> >>  drivers/vfio/pci/vfio_pci_nvlink2.c | 490 
> >>  drivers/vfio/pci/vfio_pci_private.h |  14 -
> >>  include/uapi/linux/vfio.h   |  38 +--
> >>  6 files changed, 4 insertions(+), 563 deletions(-)
> >>  delete mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c  
> >
> > Hearing no objections, applied to vfio next branch for v5.13.  Thanks,  
> 
> Looks like you only took patch 1?
> 
> I can't take patch 2 on its own, that would break the build.
> 
> Do you want to take both patches? There's currently no conflicts against
> my tree. It's possible one could appear before the v5.13 merge window,
> though it would probably just be something minor.
> 
> Or I could apply both patches to my tree, which means patch 1 would
> appear as two commits in the git history, but that's not a big deal.

I've already got a conflict in my next branch with patch 1, so it's
best to go through my tree.  Seems like a shared branch would be
easiest to allow you to merge and manage potential conflicts against
patch 2, I've pushed a branch here:

https://github.com/awilliam/linux-vfio.git v5.13/vfio/nvlink

Thanks,
Alex



Re: [PATCH v1 01/14] vfio: Create vfio_fs_type with inode per device

2021-04-09 Thread Alex Williamson
On Fri, 9 Apr 2021 04:54:23 +
"Zengtao (B)"  wrote:

> > -邮件原件-----
> > 发件人: Alex Williamson [mailto:alex.william...@redhat.com]
> > 发送时间: 2021年3月9日 5:47
> > 收件人: alex.william...@redhat.com
> > 抄送: coh...@redhat.com; k...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; j...@nvidia.com; pet...@redhat.com
> > 主题: [PATCH v1 01/14] vfio: Create vfio_fs_type with inode per device
> > 
> > By linking all the device fds we provide to userspace to an address space
> > through a new pseudo fs, we can use tools like
> > unmap_mapping_range() to zap all vmas associated with a device.
> > 
> > Suggested-by: Jason Gunthorpe 
> > Signed-off-by: Alex Williamson 
> > ---
> >  drivers/vfio/vfio.c |   54
> > +++
> >  1 file changed, 54 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > 38779e6fd80c..abdf8d52a911 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,11 +32,18 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include   
> Minor: keep the headers in alphabetical order.

They started out that way, but various tree-wide changes ignoring that,
and likely oversights on my part as well, has left us with numerous
breaks in that rule already.

> > 
> >  #define DRIVER_VERSION "0.3"
> >  #define DRIVER_AUTHOR  "Alex Williamson "
> >  #define DRIVER_DESC"VFIO - User Level meta-driver"
> > 
> > +#define VFIO_MAGIC 0x5646494f /* "VFIO" */  
> Move to include/uapi/linux/magic.h ? 

Hmm, yeah, I suppose it probably should go there.  Thanks.

FWIW, I'm still working on a next version of this series, currently
struggling how to handle an arbitrary number of vmas per user DMA
mapping.  Thanks,

Alex



Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2

2021-04-06 Thread Alex Williamson
On Fri, 26 Mar 2021 07:13:10 +0100
Christoph Hellwig  wrote:

> This driver never had any open userspace (which for VFIO would include
> VM kernel drivers) that use it, and thus should never have been added
> by our normal userspace ABI rules.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Greg Kroah-Hartman 
> ---
>  drivers/vfio/pci/Kconfig|   6 -
>  drivers/vfio/pci/Makefile   |   1 -
>  drivers/vfio/pci/vfio_pci.c |  18 -
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 490 
>  drivers/vfio/pci/vfio_pci_private.h |  14 -
>  include/uapi/linux/vfio.h   |  38 +--
>  6 files changed, 4 insertions(+), 563 deletions(-)
>  delete mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

Hearing no objections, applied to vfio next branch for v5.13.  Thanks,

Alex



Re: [PATCH v1] vfio/type1: Remove the almost unused check in vfio_iommu_type1_unpin_pages

2021-04-06 Thread Alex Williamson
On Tue, 6 Apr 2021 21:50:09 +0800
Shenming Lu  wrote:

> The check i > npage at the end of vfio_iommu_type1_unpin_pages is unused
> unless npage < 0, but if npage < 0, this function will return npage, which
> should return -EINVAL instead. So let's just check the parameter npage at
> the start of the function. By the way, replace unpin_exit with break.
> 
> Signed-off-by: Shenming Lu 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 45cbfd4879a5..fd4213c41743 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -960,7 +960,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>   bool do_accounting;
>   int i;
>  
> - if (!iommu || !user_pfn)
> + if (!iommu || !user_pfn || npage <= 0)
>   return -EINVAL;
>  
>   /* Supported for v2 version only */
> @@ -977,13 +977,13 @@ static int vfio_iommu_type1_unpin_pages(void 
> *iommu_data,
>   iova = user_pfn[i] << PAGE_SHIFT;
>   dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>   if (!dma)
> - goto unpin_exit;
> + break;
> +
>   vfio_unpin_page_external(dma, iova, do_accounting);
>   }
>  
> -unpin_exit:
>   mutex_unlock(>lock);
> - return i > npage ? npage : (i > 0 ? i : -EINVAL);
> + return i > 0 ? i : -EINVAL;
>  }
>  
>  static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,

Very odd behavior previously.  Applied to vfio next branch for v5.13.
Thanks,

Alex



Re: [PATCH 0/4] vfio: fix a couple of spelling mistakes detected by codespell tool

2021-04-06 Thread Alex Williamson
On Fri, 26 Mar 2021 16:35:24 +0800
Zhen Lei  wrote:

> This detection and correction covers the entire driver/vfio directory.
> 
> Zhen Lei (4):
>   vfio/type1: fix a couple of spelling mistakes
>   vfio/mdev: Fix spelling mistake "interal" -> "internal"
>   vfio/pci: fix a couple of spelling mistakes
>   vfio/platform: Fix spelling mistake "registe" -> "register"
> 
>  drivers/vfio/mdev/mdev_private.h | 2 +-
>  drivers/vfio/pci/vfio_pci.c  | 2 +-
>  drivers/vfio/pci/vfio_pci_config.c   | 2 +-
>  drivers/vfio/pci/vfio_pci_nvlink2.c  | 4 ++--
>  drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c | 2 +-
>  drivers/vfio/vfio_iommu_type1.c  | 6 +++---
>  6 files changed, 9 insertions(+), 9 deletions(-)
> 

Applied to vfio next branch for v5.13.  Thanks,

Alex



Re: [PATCH] vfio: pci: Spello fix in the file vfio_pci.c

2021-04-06 Thread Alex Williamson
On Sun, 14 Mar 2021 10:59:25 +0530
Bhaskar Chowdhury  wrote:

> s/permision/permission/
> 
> Signed-off-by: Bhaskar Chowdhury 
> ---
>  drivers/vfio/pci/vfio_pci.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 706de3ef94bb..62f137692a4f 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -2411,7 +2411,7 @@ static int __init vfio_pci_init(void)
>  {
>   int ret;
> 
> - /* Allocate shared config space permision data used by all devices */
> + /* Allocate shared config space permission data used by all devices */
>   ret = vfio_pci_init_perm_bits();
>   if (ret)
>   return ret;
> --
> 2.26.2
> 

Applied to vfio next branch for v5.13.  Thanks,

Alex



Re: [PATCH] PCI: merge slot and bus reset implementations

2021-04-06 Thread Alex Williamson
On Sun, 4 Apr 2021 11:04:32 +0300
Leon Romanovsky  wrote:

> On Thu, Apr 01, 2021 at 10:56:16AM -0600, Alex Williamson wrote:
> > On Thu, 1 Apr 2021 15:27:37 +0300
> > Leon Romanovsky  wrote:
> >   
> > > On Thu, Apr 01, 2021 at 05:37:16AM +, Raphael Norwitz wrote:  
> > > > Slot resets are bus resets with additional logic to prevent a device
> > > > from being removed during the reset. Currently slot and bus resets have
> > > > separate implementations in pci.c, complicating higher level logic. As
> > > > discussed on the mailing list, they should be combined into a generic
> > > > function which performs an SBR. This change adds a function,
> > > > pci_reset_bus_function(), which first attempts a slot reset and then
> > > > attempts a bus reset if -ENOTTY is returned, such that there is now a
> > > > single device agnostic function to perform an SBR.
> > > > 
> > > > This new function is also needed to add SBR reset quirks and therefore
> > > > is exposed in pci.h.
> > > > 
> > > > Link: https://lkml.org/lkml/2021/3/23/911
> > > > 
> > > > Suggested-by: Alex Williamson 
> > > > Signed-off-by: Amey Narkhede 
> > > > Signed-off-by: Raphael Norwitz 
> > > > ---
> > > >  drivers/pci/pci.c   | 17 +
> > > >  include/linux/pci.h |  1 +
> > > >  2 files changed, 10 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > > index 16a17215f633..12a91af2ade4 100644
> > > > --- a/drivers/pci/pci.c
> > > > +++ b/drivers/pci/pci.c
> > > > @@ -4982,6 +4982,13 @@ static int pci_dev_reset_slot_function(struct 
> > > > pci_dev *dev, int probe)
> > > > return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
> > > >  }
> > > >  
> > > > +int pci_reset_bus_function(struct pci_dev *dev, int probe)
> > > > +{
> > > > +   int rc = pci_dev_reset_slot_function(dev, probe);
> > > > +
> > > > +   return (rc == -ENOTTY) ? pci_parent_bus_reset(dev, probe) : rc; 
> > > >
> > > 
> > > The previous coding style is preferable one in the Linux kernel.
> > > int rc = pci_dev_reset_slot_function(dev, probe);
> > > if (rc != -ENOTTY)
> > >   return rc;
> > > return pci_parent_bus_reset(dev, probe);  
> > 
> > 
> > That'd be news to me, do you have a reference?  I've never seen
> > complaints for ternaries previously.  Thanks,  
> 
> The complaint is not to ternaries, but to the function call as one of
> the parameters, that makes it harder to read.

Sorry, I don't find a function call as a parameter to a ternary to be
extraordinary, nor do I find it to be a discouraged usage model within
the kernel.  This seems like a pretty low bar for hard to read code.



Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-04-01 Thread Alex Williamson
On Thu, 1 Apr 2021 10:12:27 -0300
Jason Gunthorpe  wrote:

> On Mon, Mar 29, 2021 at 05:10:53PM -0600, Alex Williamson wrote:
> > On Tue, 23 Mar 2021 16:32:13 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Mon, Mar 22, 2021 at 10:40:16AM -0600, Alex Williamson wrote:
> > >   
> > > > Of course if you start looking at features like migration support,
> > > > that's more than likely not simply an additional region with optional
> > > > information, it would need to interact with the actual state of the
> > > > device.  For those, I would very much support use of a specific
> > > > id_table.  That's not these.
> > > 
> > > What I don't understand is why do we need two different ways of
> > > inserting vendor code?  
> > 
> > Because a PCI id table only identifies the device, these drivers are
> > looking for a device in the context of firmware dependencies.  
> 
> The firmware dependencies only exist for a defined list of ID's, so I
> don't entirely agree with this statement. I agree with below though,
> so lets leave it be.
> 
> > > I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
> > > the ID table we have here is supposed to be the NVLINK compatible
> > > ID's.  
> > 
> > Those IDs are just for the SXM2 variants of the device that can
> > exist on a variety of platforms, only one of which includes the
> > firmware tables to activate the vfio support.  
> 
> AFAIK, SXM2 is a special physical form factor that has the nvlink
> physical connection - it is only for this specific generation of power
> servers that can accept the specific nvlink those cards have.

SXM2 is not unique to Power, there are various x86 systems that support
the interface, everything from NVIDIA's own line of DGX systems,
various vendor systems, all the way to VARs like Super Micro and
Gigabyte.

> > I think you're looking for a significant inflection in vendor's stated
> > support for vfio use cases, beyond the "best-effort, give it a try",
> > that we currently have.  
> 
> I see, so they don't want to. Lets leave it then.
> 
> Though if Xe breaks everything they need to add/maintain a proper ID
> table, not more hackery.

e4eccb853664 ("vfio/pci: Bypass IGD init in case of -ENODEV") is
supposed to enable Xe, where the IGD code is expected to return -ENODEV
and we go on with the base vfio-pci support.
 
> > > And again, I feel this is all a big tangent, especially now that
> > > HCH wants to delete the nvlink stuff we should just leave igd
> > > alone.  
> > 
> > Determining which things stay in vfio-pci-core and which things are
> > split to variant drivers and how those variant drivers can match the
> > devices they intend to support seems very inline with this series.
> >   
> 
> IMHO, the main litmus test for core is if variant drivers will need it
> or not.
> 
> No variant driver should be stacked on an igd device, or if it someday
> is, it should implement the special igd hackery internally (and have a
> proper ID table). So when we split it up igd goes into vfio_pci.ko as
> some special behavior vfio_pci.ko's universal driver provides for IGD.
> 
> Every variant driver will still need the zdev data to be exposed to
> userspace, and every PCI device on s390 has that extra information. So
> vdev goes to vfio_pci_core.ko
> 
> Future things going into vfio_pci.ko need a really good reason why
> they can't be varian drivers instead.

That sounds fair.  Thanks,

Alex



Re: [PATCH] PCI: merge slot and bus reset implementations

2021-04-01 Thread Alex Williamson
On Thu, 1 Apr 2021 15:27:37 +0300
Leon Romanovsky  wrote:

> On Thu, Apr 01, 2021 at 05:37:16AM +, Raphael Norwitz wrote:
> > Slot resets are bus resets with additional logic to prevent a device
> > from being removed during the reset. Currently slot and bus resets have
> > separate implementations in pci.c, complicating higher level logic. As
> > discussed on the mailing list, they should be combined into a generic
> > function which performs an SBR. This change adds a function,
> > pci_reset_bus_function(), which first attempts a slot reset and then
> > attempts a bus reset if -ENOTTY is returned, such that there is now a
> > single device agnostic function to perform an SBR.
> > 
> > This new function is also needed to add SBR reset quirks and therefore
> > is exposed in pci.h.
> > 
> > Link: https://lkml.org/lkml/2021/3/23/911
> > 
> > Suggested-by: Alex Williamson 
> > Signed-off-by: Amey Narkhede 
> > Signed-off-by: Raphael Norwitz 
> > ---
> >  drivers/pci/pci.c   | 17 +
> >  include/linux/pci.h |  1 +
> >  2 files changed, 10 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 16a17215f633..12a91af2ade4 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -4982,6 +4982,13 @@ static int pci_dev_reset_slot_function(struct 
> > pci_dev *dev, int probe)
> > return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
> >  }
> >  
> > +int pci_reset_bus_function(struct pci_dev *dev, int probe)
> > +{
> > +   int rc = pci_dev_reset_slot_function(dev, probe);
> > +
> > +   return (rc == -ENOTTY) ? pci_parent_bus_reset(dev, probe) : rc;  
> 
> The previous coding style is preferable one in the Linux kernel.
> int rc = pci_dev_reset_slot_function(dev, probe);
> if (rc != -ENOTTY)
>   return rc;
> return pci_parent_bus_reset(dev, probe);


That'd be news to me, do you have a reference?  I've never seen
complaints for ternaries previously.  Thanks,

Alex



[GIT PULL] VFIO fixes for v5.12-rc6

2021-03-30 Thread Alex Williamson
Hi Linus,

The following changes since commit 0d02ec6b3136c73c09e7859f0d0e4e2c4c07b49b:

  Linux 5.12-rc4 (2021-03-21 14:56:43 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.12-rc6

for you to fetch changes up to e0146a108ce4d2c22b9510fd12268e3ee72a0161:

  vfio/nvlink: Add missing SPAPR_TCE_IOMMU depends (2021-03-29 14:48:00 -0600)


VFIO fixes for v5.12-rc6

 - Fix pfnmap batch carryover (Daniel Jordan)

 - Fix nvlink Kconfig dependency (Jason Gunthorpe)


Daniel Jordan (1):
  vfio/type1: Empty batch for pfnmap pages

Jason Gunthorpe (1):
  vfio/nvlink: Add missing SPAPR_TCE_IOMMU depends

 drivers/vfio/pci/Kconfig| 2 +-
 drivers/vfio/vfio_iommu_type1.c | 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)



Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-29 Thread Alex Williamson
On Tue, 23 Mar 2021 16:32:13 -0300
Jason Gunthorpe  wrote:

> On Mon, Mar 22, 2021 at 10:40:16AM -0600, Alex Williamson wrote:
> 
> > Of course if you start looking at features like migration support,
> > that's more than likely not simply an additional region with optional
> > information, it would need to interact with the actual state of the
> > device.  For those, I would very much support use of a specific
> > id_table.  That's not these.  
> 
> What I don't understand is why do we need two different ways of
> inserting vendor code?

Because a PCI id table only identifies the device, these drivers are
looking for a device in the context of firmware dependencies.

> > > new_id and driver_override should probably be in that disable list
> > > too..  
> > 
> > We don't have this other world yet, nor is it clear that we will have
> > it.  
> 
> We do today, it is obscure, but there is a whole set of config options
> designed to disable the unsafe kernel features. Kernels booted with
> secure boot and signed modules tend to enable a lot of them, for
> instance. The people working on the IMA stuff tend to enable a lot
> more as you can defeat the purpose of IMA if you can hijack the
> kernel.
> 
> > What sort of id_table is the base vfio-pci driver expected to use?  
> 
> If it has a match table it would be all match, this is why I called it
> a "universal driver"
> 
> If we have a flavour then the flavour controls the activation of
> VFIO, not new_id or driver_override, and in vfio flavour mode we can
> have an all match table, if we can resolve how to choose between two
> drivers with overlapping matches.
> 
> > > > > This is why I want to try for fine grained autoloading first. It
> > > > > really is the elegant solution if we can work it out.
> > > > 
> > > > I just don't see how we create a manageable change to userspace.
> > > 
> > > I'm not sure I understand. Even if we add a new sysfs to set some
> > > flavour then that is a pretty trivial change for userspace to move
> > > from driver_override?  
> > 
> > Perhaps for some definition of trivial that I'm not familiar with.
> > We're talking about changing libvirt and driverctl and every distro and
> > user that's created a custom script outside of those.  Even changing
> > from "vfio-pci" to "vfio-pci*" is a hurdle.  
> 
> Sure, but it isn't like a major architectural shift, nor is it
> mandatory unless you start using this new hardware class.
> 
> Userspace changes when we add kernel functionality.. The kernel just
> has to keep working the way it used to for old functionality.

Seems like we're bound to keep igd in the core as you propose below.

> > > Well, I read through the Intel GPU driver and this is how I felt it
> > > works. It doesn't even check the firmware bit unless certain PCI IDs
> > > are matched first.  
> > 
> > The IDs being only the PCI vendor ID and class code.
> 
> I don't mean how vfio works, I mean how the Intel GPU driver works.
> 
> eg:
> 
> psb_pci_probe()
>  psb_driver_load()
>   psb_intel_opregion_setup()
>if (memcmp(base, OPREGION_SIGNATURE, 16)) {
> 
> i915_pci_probe()
>  i915_driver_probe()
>   i915_driver_hw_probe()
>intel_opregion_setup()
>   if (memcmp(buf, OPREGION_SIGNATURE, 16)) {
> 
> All of these memcmp's are protected by exact id_tables hung off the
> pci_driver's id_table.
> 
> VFIO is the different case. In this case the ID match confirms that
> the config space has the ASLS dword at the fixed offset. If the ID
> doesn't match nothing should read the ASLS offset.
> 
> > > For NVIDIA GPU Max checked internally and we saw it looks very much
> > > like how Intel GPU works. Only some PCI IDs trigger checking on the
> > > feature the firmware thing is linked to.  
> > 
> > And as Alexey noted, the table came up incomplete.  But also those same
> > devices exist on platforms where this extension is completely
> > irrelevant.  
> 
> I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
> the ID table we have here is supposed to be the NVLINK compatible
> ID's.

Those IDs are just for the SXM2 variants of the device that can
exist on a variety of platforms, only one of which includes the
firmware tables to activate the vfio support.

> > So because we don't check for an Intel specific graphics firmware table
> > when binding to Realtek NIC, we can leap to the conclusion that there
> > must be a concise id_table we can create for IGD support?  
> 
> Concise? No, but we ca

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-26 Thread Alex Williamson
On Fri, 26 Mar 2021 09:40:30 +0300
Leon Romanovsky  wrote:

> On Thu, Mar 25, 2021 at 11:53:24AM -0600, Alex Williamson wrote:
> > On Thu, 25 Mar 2021 18:09:58 +0200
> > Leon Romanovsky  wrote:
> >   
> > > On Thu, Mar 25, 2021 at 08:55:04AM -0600, Alex Williamson wrote:  
> > > > On Thu, 25 Mar 2021 10:37:54 +0200
> > > > Leon Romanovsky  wrote:
> > > > 
> > > > > On Wed, Mar 24, 2021 at 11:17:29AM -0600, Alex Williamson wrote:
> > > > > > On Wed, 24 Mar 2021 17:13:56 +0200
> > > > > > Leon Romanovsky  wrote:  
> > > > > 
> > > > > <...>
> > > > > 
> > > > > > > Yes, and real testing/debugging almost always requires kernel 
> > > > > > > rebuild.
> > > > > > > Everything else is waste of time.  
> > > > > > 
> > > > > > Sorry, this is nonsense.  Allowing users to debug issues without a 
> > > > > > full
> > > > > > kernel rebuild is a good thing.  
> > > > > 
> > > > > It is far from debug, this interface doesn't give you any answers why
> > > > > the reset didn't work, it just helps you to find the one that works.
> > > > > 
> > > > > Unless you believe that this information will be enough to understand
> > > > > the root cause, you will need to ask from the user to perform extra
> > > > > tests, maybe try some quirk. All of that requires from the users to
> > > > > rebuild their kernel.
> > > > > 
> > > > > So no, it is not debug.
> > > > 
> > > > It allows a user to experiment to determine (a) my device doesn't work
> > > > in a given scenario with the default configuration, but (b) if I change
> > > > the reset to this other thing it does work.  That is a step in
> > > > debugging.
> > > > 
> > > > It's absurd to think that a sysfs attribute could provide root cause,
> > > > but it might be enough for someone to further help that user.  It would
> > > > be a useful clue for a bug report.  Yes, reaching root cause might
> > > > involve building a kernel, but that doesn't invalidate that having a
> > > > step towards debugging in the base kernel might be a useful tool.
> > > 
> > > Let's agree to do not agree.
> > >   
> > > > 
> > > > > > > > > > For policy preference, I already described how I've 
> > > > > > > > > > configured QEMU to
> > > > > > > > > > prefer a bus reset rather than a PM reset due to lack of 
> > > > > > > > > > specification
> > > > > > > > > > regarding the scope of a PM "soft reset".  This interface 
> > > > > > > > > > would allow a
> > > > > > > > > > system policy to do that same thing.
> > > > > > > > > > 
> > > > > > > > > > I don't think anyone is suggesting this as a means to avoid 
> > > > > > > > > > quirks that
> > > > > > > > > > would resolve reset issues and create the best default 
> > > > > > > > > > general behavior.
> > > > > > > > > > This provides a mechanism to test various reset methods, 
> > > > > > > > > > and thereby
> > > > > > > > > > identify broken methods, and set a policy.  Sure, that 
> > > > > > > > > > policy might be
> > > > > > > > > > to avoid a broken reset in the interim before it gets 
> > > > > > > > > > quirked and
> > > > > > > > > > there's potential for abuse there, but I think the benefits 
> > > > > > > > > > outweigh
> > > > > > > > > > the risks.  
> > > > > > > > > 
> > > > > > > > > This interface is proposed as first class citizen in the 
> > > > > > > > > general sysfs
> > > > > > > > > layout. Of course, it will be seen as a way to bypass the 
> > > > > > > > > kernel.
> > > > > > > > > 
> > > > > > > > > At least, put it under CON

Re: [PATCH] vfio/type1: Empty batch for pfnmap pages

2021-03-25 Thread Alex Williamson
On Wed, 24 Mar 2021 21:05:52 -0400
Daniel Jordan  wrote:

> When vfio_pin_pages_remote() returns with a partial batch consisting of
> a single VM_PFNMAP pfn, a subsequent call will unfortunately try
> restoring it from batch->pages, resulting in vfio mapping the wrong page
> and unbalancing the page refcount.
> 
> Prevent the function from returning with this kind of partial batch to
> avoid the issue.  There's no explicit check for a VM_PFNMAP pfn because
> it's awkward to do so, so infer it from characteristics of the batch
> instead.  This may result in occasional false positives but keeps the
> code simpler.
> 
> Fixes: 4d83de6da265 ("vfio/type1: Batch page pinning")
> Link: https://lkml.kernel.org/r/20210323133254.33ed9...@omen.home.shazbot.org/
> Reported-by: Alex Williamson 
> Suggested-by: Alex Williamson 
> Signed-off-by: Daniel Jordan 
> ---
> 
> Alex, I couldn't immediately find a way to trigger this bug, but I can
> run your test case if you like.
> 
> This is the minimal fix, but it should still protect all calls of
> vfio_batch_unpin() from this problem.

Thanks, applied to my for-linus branch for v5.12.  The attached unit
test triggers the issue, I don't have any real world examples and was
only just experimenting with this for another series earlier this week.
Thanks,

Alex
/*
 * Alternate pages of device memory and anonymous memory within a single DMA
 * mapping.
 *
 * Run with argv[1] as a fully specified PCI device already bound to vfio-pci.
 * ex. "alternate-pfnmap :01:00.0"
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 

void *vaddr = (void *)0x1;
size_t map_size = 0;

int get_container(void)
{
	int container = open("/dev/vfio/vfio", O_RDWR);

	if (container < 0)
		fprintf(stderr, "Failed to open /dev/vfio/vfio, %d (%s)\n",
		   container, strerror(errno));

	return container;
}

int get_group(char *name)
{
	int seg, bus, slot, func;
	int ret, group, groupid;
	char path[50], iommu_group_path[50], *group_name;
	struct stat st;
	ssize_t len;
	struct vfio_group_status group_status = {
		.argsz = sizeof(group_status)
	};

	ret = sscanf(name, "%04x:%02x:%02x.%d", , , , );
	if (ret != 4) {
		fprintf(stderr, "Invalid device\n");
		return -EINVAL;
	}

	snprintf(path, sizeof(path),
		 "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/",
		 seg, bus, slot, func);

	ret = stat(path, );
	if (ret < 0) {
		fprintf(stderr, "No such device\n");
		return ret;
	}

	strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);

	len = readlink(path, iommu_group_path, sizeof(iommu_group_path));
	if (len <= 0) {
		fprintf(stderr, "No iommu_group for device\n");
		return -EINVAL;
	}

	iommu_group_path[len] = 0;
	group_name = basename(iommu_group_path);

	if (sscanf(group_name, "%d", ) != 1) {
		fprintf(stderr, "Unknown group\n");
		return -EINVAL;
	}

	snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
	group = open(path, O_RDWR);
	if (group < 0) {
		fprintf(stderr, "Failed to open %s, %d (%s)\n",
		   path, group, strerror(errno));
		return group;
	}

	ret = ioctl(group, VFIO_GROUP_GET_STATUS, _status);
	if (ret) {
		fprintf(stderr, "ioctl(VFIO_GROUP_GET_STATUS) failed\n");
		return ret;
	}

	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
		fprintf(stderr,
			"Group not viable, all devices attached to vfio?\n");
		return -1;
	}

	return group;
}

int group_set_container(int group, int container)
{
	int ret = ioctl(group, VFIO_GROUP_SET_CONTAINER, );

	if (ret)
		fprintf(stderr, "Failed to set group container\n");

	return ret;
}

int container_set_iommu(int container)
{
	int ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);

	if (ret)
		fprintf(stderr, "Failed to set IOMMU\n");

	return ret;
}

int group_get_device(int group, char *name)
{
	int device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, name);

	if (device < 0)
		fprintf(stderr, "Failed to get device\n");

	return device;
}

void *mmap_device_page(int device, int prot)
{
	struct vfio_region_info config_info = {
		.argsz = sizeof(config_info),
		.index = VFIO_PCI_CONFIG_REGION_INDEX
	};
	struct vfio_region_info region_info = {
		.argsz = sizeof(region_info)
	};
	void *map = MAP_FAILED;
	unsigned int bar;
	int i, ret;

	ret = ioctl(device, VFIO_DEVICE_GET_REGION_INFO, _info);
	if (ret) {
		fprintf(stderr, "Failed to get config space region info\n");
		return map;
	}

	for (i = 0; i < 6; i++) {
		if (pread(device, , sizeof(bar), config_info.offset +
			  PCI_BASE_ADDRESS_0 + (4 * i)) != sizeof(bar)) {
			fprintf(stderr, "Error reading BAR%d\n", i);
			return map;
		}

		if (!(bar &a

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-25 Thread Alex Williamson
On Thu, 25 Mar 2021 18:09:58 +0200
Leon Romanovsky  wrote:

> On Thu, Mar 25, 2021 at 08:55:04AM -0600, Alex Williamson wrote:
> > On Thu, 25 Mar 2021 10:37:54 +0200
> > Leon Romanovsky  wrote:
> >   
> > > On Wed, Mar 24, 2021 at 11:17:29AM -0600, Alex Williamson wrote:  
> > > > On Wed, 24 Mar 2021 17:13:56 +0200
> > > > Leon Romanovsky  wrote:
> > > 
> > > <...>
> > >   
> > > > > Yes, and real testing/debugging almost always requires kernel rebuild.
> > > > > Everything else is waste of time.
> > > > 
> > > > Sorry, this is nonsense.  Allowing users to debug issues without a full
> > > > kernel rebuild is a good thing.
> > > 
> > > It is far from debug, this interface doesn't give you any answers why
> > > the reset didn't work, it just helps you to find the one that works.
> > > 
> > > Unless you believe that this information will be enough to understand
> > > the root cause, you will need to ask from the user to perform extra
> > > tests, maybe try some quirk. All of that requires from the users to
> > > rebuild their kernel.
> > > 
> > > So no, it is not debug.  
> > 
> > It allows a user to experiment to determine (a) my device doesn't work
> > in a given scenario with the default configuration, but (b) if I change
> > the reset to this other thing it does work.  That is a step in
> > debugging.
> > 
> > It's absurd to think that a sysfs attribute could provide root cause,
> > but it might be enough for someone to further help that user.  It would
> > be a useful clue for a bug report.  Yes, reaching root cause might
> > involve building a kernel, but that doesn't invalidate that having a
> > step towards debugging in the base kernel might be a useful tool.  
> 
> Let's agree to do not agree.
> 
> >   
> > > > > > > > For policy preference, I already described how I've configured 
> > > > > > > > QEMU to
> > > > > > > > prefer a bus reset rather than a PM reset due to lack of 
> > > > > > > > specification
> > > > > > > > regarding the scope of a PM "soft reset".  This interface would 
> > > > > > > > allow a
> > > > > > > > system policy to do that same thing.
> > > > > > > > 
> > > > > > > > I don't think anyone is suggesting this as a means to avoid 
> > > > > > > > quirks that
> > > > > > > > would resolve reset issues and create the best default general 
> > > > > > > > behavior.
> > > > > > > > This provides a mechanism to test various reset methods, and 
> > > > > > > > thereby
> > > > > > > > identify broken methods, and set a policy.  Sure, that policy 
> > > > > > > > might be
> > > > > > > > to avoid a broken reset in the interim before it gets quirked 
> > > > > > > > and
> > > > > > > > there's potential for abuse there, but I think the benefits 
> > > > > > > > outweigh
> > > > > > > > the risks.
> > > > > > > 
> > > > > > > This interface is proposed as first class citizen in the general 
> > > > > > > sysfs
> > > > > > > layout. Of course, it will be seen as a way to bypass the kernel.
> > > > > > > 
> > > > > > > At least, put it under CONFIG_EXPERT option, so no distro will 
> > > > > > > enable it
> > > > > > > by default.  
> > > > > > 
> > > > > > Of course we're proposing it to be accessible, it should also 
> > > > > > require
> > > > > > admin privileges to modify, sysfs has lots of such things.  If it's
> > > > > > relegated to non-default accessibility, it won't be used for testing
> > > > > > and it won't be available for system policy and it's pointless. 
> > > > > >  
> > > > > 
> > > > > We probably have difference in view of what testing is. I expect from
> > > > > the users who experience issues with reset to do extra steps and one 
> > > > > of
> > > > > them is to require from them to compile their kernel.
> > > > 
> > >

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-25 Thread Alex Williamson
On Thu, 25 Mar 2021 10:37:54 +0200
Leon Romanovsky  wrote:

> On Wed, Mar 24, 2021 at 11:17:29AM -0600, Alex Williamson wrote:
> > On Wed, 24 Mar 2021 17:13:56 +0200
> > Leon Romanovsky  wrote:  
> 
> <...>
> 
> > > Yes, and real testing/debugging almost always requires kernel rebuild.
> > > Everything else is waste of time.  
> > 
> > Sorry, this is nonsense.  Allowing users to debug issues without a full
> > kernel rebuild is a good thing.  
> 
> It is far from debug, this interface doesn't give you any answers why
> the reset didn't work, it just helps you to find the one that works.
> 
> Unless you believe that this information will be enough to understand
> the root cause, you will need to ask from the user to perform extra
> tests, maybe try some quirk. All of that requires from the users to
> rebuild their kernel.
> 
> So no, it is not debug.

It allows a user to experiment to determine (a) my device doesn't work
in a given scenario with the default configuration, but (b) if I change
the reset to this other thing it does work.  That is a step in
debugging.

It's absurd to think that a sysfs attribute could provide root cause,
but it might be enough for someone to further help that user.  It would
be a useful clue for a bug report.  Yes, reaching root cause might
involve building a kernel, but that doesn't invalidate that having a
step towards debugging in the base kernel might be a useful tool.

> > > > > > For policy preference, I already described how I've configured QEMU 
> > > > > > to
> > > > > > prefer a bus reset rather than a PM reset due to lack of 
> > > > > > specification
> > > > > > regarding the scope of a PM "soft reset".  This interface would 
> > > > > > allow a
> > > > > > system policy to do that same thing.
> > > > > > 
> > > > > > I don't think anyone is suggesting this as a means to avoid quirks 
> > > > > > that
> > > > > > would resolve reset issues and create the best default general 
> > > > > > behavior.
> > > > > > This provides a mechanism to test various reset methods, and thereby
> > > > > > identify broken methods, and set a policy.  Sure, that policy might 
> > > > > > be
> > > > > > to avoid a broken reset in the interim before it gets quirked and
> > > > > > there's potential for abuse there, but I think the benefits outweigh
> > > > > > the risks.  
> > > > > 
> > > > > This interface is proposed as first class citizen in the general sysfs
> > > > > layout. Of course, it will be seen as a way to bypass the kernel.
> > > > > 
> > > > > At least, put it under CONFIG_EXPERT option, so no distro will enable 
> > > > > it
> > > > > by default.
> > > > 
> > > > Of course we're proposing it to be accessible, it should also require
> > > > admin privileges to modify, sysfs has lots of such things.  If it's
> > > > relegated to non-default accessibility, it won't be used for testing
> > > > and it won't be available for system policy and it's pointless.
> > > 
> > > We probably have difference in view of what testing is. I expect from
> > > the users who experience issues with reset to do extra steps and one of
> > > them is to require from them to compile their kernel.  
> > 
> > I would define the ability to generate a CI test that can pick a
> > device, unbind it from its driver, and iterate reset methods as a
> > worthwhile improvement in testing.  
> 
> Who is going to run this CI? At least all kernel CIs (external and
> internal to HW vendors) that I'm familiar are building kernel themselves.
> 
> Distro kernel is too bloat to be really usable for CI.

At this point I'm suspicious you're trolling.  A distro kernel CI
certainly uses the kernel they intend to ship and support in their
environment. You're concerned about a bloated kernel, but the proposal
here adds 2-bytes per device to track reset methods and a trivial array
in text memory, meanwhile you're proposing multiple per-device memory
allocations to enhance the feature you think is too bloated for CI.

> > > The root permissions doesn't protect from anything, SO lovers will use
> > > root without even thinking twice.  
> > 
> > Yes, with great power comes great responsibility.  Many admins ignore
> > this.  That's far beyond the scope of this series.  
> 
> <...>
> 
> > > I'm 

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-24 Thread Alex Williamson
On Wed, 24 Mar 2021 17:13:56 +0200
Leon Romanovsky  wrote:

> On Wed, Mar 24, 2021 at 08:37:43AM -0600, Alex Williamson wrote:
> > On Wed, 24 Mar 2021 12:03:00 +0200
> > Leon Romanovsky  wrote:
> >   
> > > On Mon, Mar 22, 2021 at 11:10:03AM -0600, Alex Williamson wrote:  
> > > > On Sun, 21 Mar 2021 10:40:55 +0200
> > > > Leon Romanovsky  wrote:
> > > > 
> > > > > On Sat, Mar 20, 2021 at 08:59:42AM -0600, Alex Williamson wrote:
> > > > > > On Sat, 20 Mar 2021 11:10:08 +0200
> > > > > > Leon Romanovsky  wrote:  
> > > > > > > On Fri, Mar 19, 2021 at 10:23:13AM -0600, Alex Williamson wrote:  
> > > > > > >  
> > > > > > > > 
> > > > > > > > What if we taint the kernel or pci_warn() for cases where 
> > > > > > > > either all
> > > > > > > > the reset methods are disabled, ie. 'echo none > reset_method', 
> > > > > > > > or any
> > > > > > > > time a device specific method is disabled?
> > > > > > > 
> > > > > > > What does it mean "none"? Does it mean nothing supported? If yes, 
> > > > > > > I think that
> > > > > > > pci_warn() will be enough. At least for me, taint is usable 
> > > > > > > during debug stages,
> > > > > > > probably if device doesn't crash no one will look to see 
> > > > > > > /proc/sys/kernel/tainted.  
> > > > > > 
> > > > > > "none" as implemented in this patch, clearing the enabled function
> > > > > > reset methods.  
> > > > > 
> > > > > It is far from intuitive, the empty string will be easier to 
> > > > > understand,
> > > > > because "none" means no reset at all.
> > > > 
> > > > "No reset at all" is what "none" achieves, the
> > > > pci_dev.reset_methods_enabled bitmap is cleared.  We can use an empty
> > > > string, but I think we want a way to clear all enabled resets and a way
> > > > to return it to the default.  I could see arguments for an empty string
> > > > serving either purpose, so this version proposed explicitly using
> > > > "none" and "default", as included in the ABI update.
> > > 
> > > I will stick with "default" only and leave "none" for something else.  
> > 
> > Are you suggesting writing "default" restores the unmodified behavior
> > and writing an empty string clears all enabled reset methods?
> >
> > > > > > > > I'd almost go so far as to prevent disabling a device specific 
> > > > > > > > reset
> > > > > > > > altogether, but for example should a device specific reset that 
> > > > > > > > fixes
> > > > > > > > an aspect of FLR behavior prevent using a bus reset?  I'd 
> > > > > > > > prefer in that
> > > > > > > > case if direct FLR were disabled via a device flag introduced 
> > > > > > > > with the
> > > > > > > > quirk and the remaining resets can still be selected by 
> > > > > > > > preference.
> > > > > > > 
> > > > > > > I don't know enough to discuss the PCI details, but you raised 
> > > > > > > good point.
> > > > > > > This sysfs is user visible API that is presented as is from 
> > > > > > > device point
> > > > > > > of view. It can be easily run into problems if PCI/core doesn't 
> > > > > > > work with
> > > > > > > user's choice.
> > > > > > >   
> > > > > > > > 
> > > > > > > > Theoretically all the other reset methods work and are 
> > > > > > > > available, it's
> > > > > > > > only a policy decision which to use, right?
> > > > > > > 
> > > > > > > But this patch was presented as a way to overcome situations where
> > > > > > > supported != working and user magically knows which reset type to 
> > > > > > > set.  
> > > > > > 
> > > > > > It's not 

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-24 Thread Alex Williamson
On Wed, 24 Mar 2021 12:03:00 +0200
Leon Romanovsky  wrote:

> On Mon, Mar 22, 2021 at 11:10:03AM -0600, Alex Williamson wrote:
> > On Sun, 21 Mar 2021 10:40:55 +0200
> > Leon Romanovsky  wrote:
> >   
> > > On Sat, Mar 20, 2021 at 08:59:42AM -0600, Alex Williamson wrote:  
> > > > On Sat, 20 Mar 2021 11:10:08 +0200
> > > > Leon Romanovsky  wrote:    
> > > > > On Fri, Mar 19, 2021 at 10:23:13AM -0600, Alex Williamson wrote: 
> > > > > > 
> > > > > > What if we taint the kernel or pci_warn() for cases where either all
> > > > > > the reset methods are disabled, ie. 'echo none > reset_method', or 
> > > > > > any
> > > > > > time a device specific method is disabled?  
> > > > > 
> > > > > What does it mean "none"? Does it mean nothing supported? If yes, I 
> > > > > think that
> > > > > pci_warn() will be enough. At least for me, taint is usable during 
> > > > > debug stages,
> > > > > probably if device doesn't crash no one will look to see 
> > > > > /proc/sys/kernel/tainted.
> > > > 
> > > > "none" as implemented in this patch, clearing the enabled function
> > > > reset methods.
> > > 
> > > It is far from intuitive, the empty string will be easier to understand,
> > > because "none" means no reset at all.  
> > 
> > "No reset at all" is what "none" achieves, the
> > pci_dev.reset_methods_enabled bitmap is cleared.  We can use an empty
> > string, but I think we want a way to clear all enabled resets and a way
> > to return it to the default.  I could see arguments for an empty string
> > serving either purpose, so this version proposed explicitly using
> > "none" and "default", as included in the ABI update.  
> 
> I will stick with "default" only and leave "none" for something else.

Are you suggesting writing "default" restores the unmodified behavior
and writing an empty string clears all enabled reset methods?
 
> > > > > > I'd almost go so far as to prevent disabling a device specific reset
> > > > > > altogether, but for example should a device specific reset that 
> > > > > > fixes
> > > > > > an aspect of FLR behavior prevent using a bus reset?  I'd prefer in 
> > > > > > that
> > > > > > case if direct FLR were disabled via a device flag introduced with 
> > > > > > the
> > > > > > quirk and the remaining resets can still be selected by preference. 
> > > > > >  
> > > > > 
> > > > > I don't know enough to discuss the PCI details, but you raised good 
> > > > > point.
> > > > > This sysfs is user visible API that is presented as is from device 
> > > > > point
> > > > > of view. It can be easily run into problems if PCI/core doesn't work 
> > > > > with
> > > > > user's choice.
> > > > > 
> > > > > > 
> > > > > > Theoretically all the other reset methods work and are available, 
> > > > > > it's
> > > > > > only a policy decision which to use, right?  
> > > > > 
> > > > > But this patch was presented as a way to overcome situations where
> > > > > supported != working and user magically knows which reset type to 
> > > > > set.
> > > > 
> > > > It's not magic, the new sysfs attributes expose which resets are
> > > > enabled and the order that they're used, the user can simply select the
> > > > next one.  Being able to bypass a broken reset method is a helpful side
> > > > effect of getting to select a preferred reset method.
> > > 
> > > Magic in a sense that user has no idea what those resets mean, the
> > > expectation is that he will blindly iterate till something works.  
> > 
> > Which ought to actually be a safe thing to do.  We should have quirks to
> > exclude resets that are known broken but still probe as present and I'd
> > be perfectly fine if we issue a warning if the user disables all resets
> > for a given device.
> >
> > > > > If you want to take this patch to be policy decision tool,
> > > > > it will need to accept "reset_type1,reset_type2,..." sort of input,
> > > >

Re: [PATCH v2 3/3] vfio/type1: Batch page pinning

2021-03-23 Thread Alex Williamson
On Tue, 23 Mar 2021 18:25:45 -0400
Daniel Jordan  wrote:

> Hi Alex,
> 
> Alex Williamson  writes:
> > I've found a bug in this patch that we need to fix.  The diff is a
> > little difficult to follow,  
> 
> It was an awful diff, I remember...
> 
> > so I'll discuss it in the resulting function below...
> >
> > (1) Imagine the user has passed a vaddr range that alternates pfnmaps
> > and pinned memory per page.
> >
> >
> > static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >   long npage, unsigned long *pfn_base,
> >   unsigned long limit, struct vfio_batch 
> > *batch)
> > {
> > unsigned long pfn;
> > struct mm_struct *mm = current->mm;
> > long ret, pinned = 0, lock_acct = 0;
> > bool rsvd;
> > dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
> >
> > /* This code path is only user initiated */
> > if (!mm)
> > return -ENODEV;
> >
> > if (batch->size) {
> > /* Leftover pages in batch from an earlier call. */
> > *pfn_base = page_to_pfn(batch->pages[batch->offset]);
> > pfn = *pfn_base;
> > rsvd = is_invalid_reserved_pfn(*pfn_base);
> >
> > (4) We're called again and fill our local variables from the batch.  The
> > batch only has one page, so we'll complete the inner loop below and 
> > refill.
> >
> > (6) We're called again, batch->size is 1, but it was for a pfnmap, the pages
> > array still contains the last pinned page, so we end up incorrectly 
> > using
> > this pfn again for the next entry.
> >
> > } else {
> > *pfn_base = 0;
> > }
> >
> > while (npage) {
> > if (!batch->size) {
> > /* Empty batch, so refill it. */
> > long req_pages = min_t(long, npage, 
> > batch->capacity);
> >
> > ret = vaddr_get_pfns(mm, vaddr, req_pages, 
> > dma->prot,
> >  , batch->pages);
> > if (ret < 0)
> > goto unpin_out;
> >
> > (2) Assume the 1st page is pfnmap, the 2nd is pinned memory  
> 
> Just to check we're on the same wavelength, I think you can hit this bug
> with one less call to vfio_pin_pages_remote() if the 1st page in the
> vaddr range is pinned memory and the 2nd is pfnmap.  Then you'd have the
> following sequence:
> 
> vfio_pin_pages_remote() call #1:
> 
>  - In the first batch refill, you'd get a size=1 batch with pinned
>memory and complete the inner loop, breaking at "if (!batch->size)".
>
>  - In the second batch refill, you'd get another size=1 batch with a
>pfnmap page, take the "goto unpin_out" in the inner loop, and return
>from the function with the batch still containing a single pfnmap
>page.
> 
> vfio_pin_pages_remote() call #2:
> 
>  - *pfn_base is set from the first element of the pages array, which
> unfortunately has the non-pfnmap pfn.
> 
> Did I follow you?

Yep, I should have simplified to skip the first mapping, but I was also
trying to make sure I made sense of the test case I was playing with
that triggered it.  The important transition is pinned memory to pfnmap
since that let's us return with non-zero batch size and stale data in
the pages array.

> >
> > batch->size = ret;
> > batch->offset = 0;
> >
> > if (!*pfn_base) {
> > *pfn_base = pfn;
> > rsvd = is_invalid_reserved_pfn(*pfn_base);
> > }
> > }
> >
> > /*
> >  * pfn is preset for the first iteration of this inner loop 
> > and
> >  * updated at the end to handle a VM_PFNMAP pfn.  In that 
> > case,
> >  * batch->pages isn't valid (there's no struct page), so 
> > allow
> >  * batch->pages to be touched only when there's more than 
> > one
> >  * pfn to check, which guarantees the pfns are from a
> >  * !VM_PFNMAP vma.
> >  */
> > while (true) {
> > if (pfn != *pfn_base + pin

Re: [PATCH v2 3/3] vfio/type1: Batch page pinning

2021-03-23 Thread Alex Williamson
Hi Daniel,

I've found a bug in this patch that we need to fix.  The diff is a
little difficult to follow, so I'll discuss it in the resulting
function below...

(1) Imagine the user has passed a vaddr range that alternates pfnmaps
and pinned memory per page.


static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
  long npage, unsigned long *pfn_base,
  unsigned long limit, struct vfio_batch *batch)
{
unsigned long pfn;
struct mm_struct *mm = current->mm;
long ret, pinned = 0, lock_acct = 0;
bool rsvd;
dma_addr_t iova = vaddr - dma->vaddr + dma->iova;

/* This code path is only user initiated */
if (!mm)
return -ENODEV;

if (batch->size) {
/* Leftover pages in batch from an earlier call. */
*pfn_base = page_to_pfn(batch->pages[batch->offset]);
pfn = *pfn_base;
rsvd = is_invalid_reserved_pfn(*pfn_base);

(4) We're called again and fill our local variables from the batch.  The
batch only has one page, so we'll complete the inner loop below and refill.

(6) We're called again, batch->size is 1, but it was for a pfnmap, the pages
array still contains the last pinned page, so we end up incorrectly using
this pfn again for the next entry.

} else {
*pfn_base = 0;
}

while (npage) {
if (!batch->size) {
/* Empty batch, so refill it. */
long req_pages = min_t(long, npage, batch->capacity);

ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
 , batch->pages);
if (ret < 0)
goto unpin_out;

(2) Assume the 1st page is pfnmap, the 2nd is pinned memory

batch->size = ret;
batch->offset = 0;

if (!*pfn_base) {
*pfn_base = pfn;
rsvd = is_invalid_reserved_pfn(*pfn_base);
}
}

/*
 * pfn is preset for the first iteration of this inner loop and
 * updated at the end to handle a VM_PFNMAP pfn.  In that case,
 * batch->pages isn't valid (there's no struct page), so allow
 * batch->pages to be touched only when there's more than one
 * pfn to check, which guarantees the pfns are from a
 * !VM_PFNMAP vma.
 */
while (true) {
if (pfn != *pfn_base + pinned ||
rsvd != is_invalid_reserved_pfn(pfn))
goto out;

(3) On the 2nd page, both tests are probably true here, so we take this goto,
leaving the batch with the next page.

(5) Now we've refilled batch, but the next page is pfnmap, so likely both of the
above tests are true... but this is a pfnmap'ing!

(7) Do we add something like if (batch->size == 1 && !batch->offset) {
put_pfn(pfn, dma->prot); batch->size = 0; }?

/*
 * Reserved pages aren't counted against the user,
 * externally pinned pages are already counted against
 * the user.
 */
if (!rsvd && !vfio_find_vpfn(dma, iova)) {
if (!dma->lock_cap &&
mm->locked_vm + lock_acct + 1 > limit) {
pr_warn("%s: RLIMIT_MEMLOCK (%ld) 
exceeded\n",
__func__, limit << PAGE_SHIFT);
ret = -ENOMEM;
goto unpin_out;
}
lock_acct++;
}

pinned++;
npage--;
vaddr += PAGE_SIZE;
iova += PAGE_SIZE;
batch->offset++;
batch->size--;

if (!batch->size)
break;

pfn = page_to_pfn(batch->pages[batch->offset]);
}

if (unlikely(disable_hugepages))
break;
}

out:
ret = vfio_lock_acct(dma, lock_acct, false);

unpin_out:
if (ret < 0) {
if (pinned && !rsvd) {
for (pfn = *pfn_base ; pinned ; pfn++, pinned--)
put_pfn(pfn, dma->prot);
}
vfio_batch_unpin(batch, dma);

(8) These calls to batch_unpin are rather precarious as well, any time 

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-23 Thread Alex Williamson
On Tue, 23 Mar 2021 10:06:25 -0600
Alex Williamson  wrote:

> On Tue, 23 Mar 2021 21:02:21 +0530
> Amey Narkhede  wrote:
> 
> > On 21/03/23 08:44AM, Alex Williamson wrote:  
> > > On Tue, 23 Mar 2021 15:34:19 +0100
> > > Pali Rohár  wrote:
> > >
> > > > On Thursday 18 March 2021 20:01:55 Amey Narkhede wrote:
> > > > > On 21/03/17 09:13PM, Pali Rohár wrote:
> > > > > > On Wednesday 17 March 2021 14:00:20 Alex Williamson wrote:
> > > > > > > On Wed, 17 Mar 2021 20:40:24 +0100
> > > > > > > Pali Rohár  wrote:
> > > > > > >
> > > > > > > > On Wednesday 17 March 2021 13:32:45 Alex Williamson wrote:    
> > > > > > > > > On Wed, 17 Mar 2021 20:24:24 +0100
> > > > > > > > > Pali Rohár  wrote:
> > > > > > > > >
> > > > > > > > > > On Wednesday 17 March 2021 13:15:36 Alex Williamson wrote:  
> > > > > > > > > >   
> > > > > > > > > > > On Wed, 17 Mar 2021 20:02:06 +0100
> > > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Monday 15 March 2021 09:03:39 Alex Williamson wrote: 
> > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, 15 Mar 2021 15:52:38 +0100
> > > > > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Monday 15 March 2021 08:34:09 Alex Williamson 
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > > > > > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Monday 15 March 2021 19:13:23 Amey Narkhede 
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > slot reset (pci_dev_reset_slot_function) and 
> > > > > > > > > > > > > > > > > secondary bus
> > > > > > > > > > > > > > > > > reset(pci_parent_bus_reset) which I think are 
> > > > > > > > > > > > > > > > > hot reset and
> > > > > > > > > > > > > > > > > warm reset respectively.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > No. PCI secondary bus reset = PCIe Hot Reset. 
> > > > > > > > > > > > > > > > Slot reset is just another
> > > > > > > > > > > > > > > > type of reset, which is currently implemented 
> > > > > > > > > > > > > > > > only for PCIe hot plug
> > > > > > > > > > > > > > > > bridges and for PowerPC PowerNV platform and it 
> > > > > > > > > > > > > > > > just call PCI secondary
> > > > > > > > > > > > > > > > bus reset with some other hook. PCIe Warm Reset 
> > > > > > > > > > > > > > > > does not have API in
> > > > > > > > > > > > > > > > kernel and therefore drivers do not export this 
> > > > > > > > > > > > > > > > type of reset via any
> > > > > > > > > > > > > > > > kernel function (yet).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Warm reset is beyond the scope of this series, 
> > > > > > > > > > > > > > > but could be implemented
> > > > > > > > > > > > > > > in a compatible way to fit within the 
> > > > > > > > > > > > > > > pci_reset_fn_methods[] array
> > > 

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-23 Thread Alex Williamson
On Tue, 23 Mar 2021 21:02:21 +0530
Amey Narkhede  wrote:

> On 21/03/23 08:44AM, Alex Williamson wrote:
> > On Tue, 23 Mar 2021 15:34:19 +0100
> > Pali Rohár  wrote:
> >  
> > > On Thursday 18 March 2021 20:01:55 Amey Narkhede wrote:  
> > > > On 21/03/17 09:13PM, Pali Rohár wrote:  
> > > > > On Wednesday 17 March 2021 14:00:20 Alex Williamson wrote:  
> > > > > > On Wed, 17 Mar 2021 20:40:24 +0100
> > > > > > Pali Rohár  wrote:
> > > > > >  
> > > > > > > On Wednesday 17 March 2021 13:32:45 Alex Williamson wrote:  
> > > > > > > > On Wed, 17 Mar 2021 20:24:24 +0100
> > > > > > > > Pali Rohár  wrote:
> > > > > > > >  
> > > > > > > > > On Wednesday 17 March 2021 13:15:36 Alex Williamson wrote:  
> > > > > > > > > > On Wed, 17 Mar 2021 20:02:06 +0100
> > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > >  
> > > > > > > > > > > On Monday 15 March 2021 09:03:39 Alex Williamson wrote:  
> > > > > > > > > > > > On Mon, 15 Mar 2021 15:52:38 +0100
> > > > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > > > >  
> > > > > > > > > > > > > On Monday 15 March 2021 08:34:09 Alex Williamson 
> > > > > > > > > > > > > wrote:  
> > > > > > > > > > > > > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > > > > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > On Monday 15 March 2021 19:13:23 Amey Narkhede 
> > > > > > > > > > > > > > > wrote:  
> > > > > > > > > > > > > > > > slot reset (pci_dev_reset_slot_function) and 
> > > > > > > > > > > > > > > > secondary bus
> > > > > > > > > > > > > > > > reset(pci_parent_bus_reset) which I think are 
> > > > > > > > > > > > > > > > hot reset and
> > > > > > > > > > > > > > > > warm reset respectively.  
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > No. PCI secondary bus reset = PCIe Hot Reset. 
> > > > > > > > > > > > > > > Slot reset is just another
> > > > > > > > > > > > > > > type of reset, which is currently implemented 
> > > > > > > > > > > > > > > only for PCIe hot plug
> > > > > > > > > > > > > > > bridges and for PowerPC PowerNV platform and it 
> > > > > > > > > > > > > > > just call PCI secondary
> > > > > > > > > > > > > > > bus reset with some other hook. PCIe Warm Reset 
> > > > > > > > > > > > > > > does not have API in
> > > > > > > > > > > > > > > kernel and therefore drivers do not export this 
> > > > > > > > > > > > > > > type of reset via any
> > > > > > > > > > > > > > > kernel function (yet).  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Warm reset is beyond the scope of this series, but 
> > > > > > > > > > > > > > could be implemented
> > > > > > > > > > > > > > in a compatible way to fit within the 
> > > > > > > > > > > > > > pci_reset_fn_methods[] array
> > > > > > > > > > > > > > defined here.  
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok!
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > Note that with this series the resets available 
> > > > > > > > > > > > > >

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-23 Thread Alex Williamson
On Tue, 23 Mar 2021 15:34:19 +0100
Pali Rohár  wrote:

> On Thursday 18 March 2021 20:01:55 Amey Narkhede wrote:
> > On 21/03/17 09:13PM, Pali Rohár wrote:  
> > > On Wednesday 17 March 2021 14:00:20 Alex Williamson wrote:  
> > > > On Wed, 17 Mar 2021 20:40:24 +0100
> > > > Pali Rohár  wrote:
> > > >  
> > > > > On Wednesday 17 March 2021 13:32:45 Alex Williamson wrote:  
> > > > > > On Wed, 17 Mar 2021 20:24:24 +0100
> > > > > > Pali Rohár  wrote:
> > > > > >  
> > > > > > > On Wednesday 17 March 2021 13:15:36 Alex Williamson wrote:  
> > > > > > > > On Wed, 17 Mar 2021 20:02:06 +0100
> > > > > > > > Pali Rohár  wrote:
> > > > > > > >  
> > > > > > > > > On Monday 15 March 2021 09:03:39 Alex Williamson wrote:  
> > > > > > > > > > On Mon, 15 Mar 2021 15:52:38 +0100
> > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > >  
> > > > > > > > > > > On Monday 15 March 2021 08:34:09 Alex Williamson wrote:  
> > > > > > > > > > > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > > > >  
> > > > > > > > > > > > > On Monday 15 March 2021 19:13:23 Amey Narkhede wrote: 
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > slot reset (pci_dev_reset_slot_function) and 
> > > > > > > > > > > > > > secondary bus
> > > > > > > > > > > > > > reset(pci_parent_bus_reset) which I think are hot 
> > > > > > > > > > > > > > reset and
> > > > > > > > > > > > > > warm reset respectively.  
> > > > > > > > > > > > >
> > > > > > > > > > > > > No. PCI secondary bus reset = PCIe Hot Reset. Slot 
> > > > > > > > > > > > > reset is just another
> > > > > > > > > > > > > type of reset, which is currently implemented only 
> > > > > > > > > > > > > for PCIe hot plug
> > > > > > > > > > > > > bridges and for PowerPC PowerNV platform and it just 
> > > > > > > > > > > > > call PCI secondary
> > > > > > > > > > > > > bus reset with some other hook. PCIe Warm Reset does 
> > > > > > > > > > > > > not have API in
> > > > > > > > > > > > > kernel and therefore drivers do not export this type 
> > > > > > > > > > > > > of reset via any
> > > > > > > > > > > > > kernel function (yet).  
> > > > > > > > > > > >
> > > > > > > > > > > > Warm reset is beyond the scope of this series, but 
> > > > > > > > > > > > could be implemented
> > > > > > > > > > > > in a compatible way to fit within the 
> > > > > > > > > > > > pci_reset_fn_methods[] array
> > > > > > > > > > > > defined here.  
> > > > > > > > > > >
> > > > > > > > > > > Ok!
> > > > > > > > > > >  
> > > > > > > > > > > > Note that with this series the resets available through
> > > > > > > > > > > > pci_reset_function() and the per device reset attribute 
> > > > > > > > > > > > is sysfs remain
> > > > > > > > > > > > exactly the same as they are currently.  The bus and 
> > > > > > > > > > > > slot reset
> > > > > > > > > > > > methods used here are limited to devices where only a 
> > > > > > > > > > > > single function is
> > > > > > > > > > > > affected by the reset, therefore it is not like the 
> > > > > > > > > > > > patch you proposed
> > > > > > > > > &g

Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2

2021-03-22 Thread Alex Williamson
On Mon, 22 Mar 2021 16:01:54 +0100
Christoph Hellwig  wrote:
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8ce36c1d53ca11..db7e782419d5d9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -332,19 +332,6 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> -/* 10de vendor PCI sub-types */
> -/*
> - * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
> - */
> -#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
> -
> -/* 1014 vendor PCI sub-types */
> -/*
> - * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> - * to do TLB invalidation on a GPU.
> - */
> -#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> -
>  /* sub-types for VFIO_REGION_TYPE_GFX */
>  #define VFIO_REGION_SUBTYPE_GFX_EDID(1)
>  
> @@ -637,33 +624,6 @@ struct vfio_device_migration_info {
>   */
>  #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE   3
>  
> -/*
> - * Capability with compressed real address (aka SSA - small system address)
> - * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing
> - * and by the userspace to associate a NVLink bridge with a GPU.
> - */
> -#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT  4
> -
> -struct vfio_region_info_cap_nvlink2_ssatgt {
> - struct vfio_info_cap_header header;
> - __u64 tgt;
> -};
> -
> -/*
> - * Capability with an NVLink link speed. The value is read by
> - * the NVlink2 bridge driver from the bridge's "ibm,nvlink-speed"
> - * property in the device tree. The value is fixed in the hardware
> - * and failing to provide the correct value results in the link
> - * not working with no indication from the driver why.
> - */
> -#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD  5
> -
> -struct vfio_region_info_cap_nvlink2_lnkspd {
> - struct vfio_info_cap_header header;
> - __u32 link_speed;
> - __u32 __pad;
> -};
> -
>  /**
>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>   *   struct vfio_irq_info)

I'll leave any attempt to defend keeping this code to Alexey, but
minimally these region sub-types and capability IDs should probably be
reserved to avoid breaking whatever userspace might exist to consume
these.  Our ID space is sufficiently large that we don't need to
recycle them any time soon.  Thanks,

Alex



Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-22 Thread Alex Williamson
On Sun, 21 Mar 2021 10:40:55 +0200
Leon Romanovsky  wrote:

> On Sat, Mar 20, 2021 at 08:59:42AM -0600, Alex Williamson wrote:
> > On Sat, 20 Mar 2021 11:10:08 +0200
> > Leon Romanovsky  wrote:  
> > > On Fri, Mar 19, 2021 at 10:23:13AM -0600, Alex Williamson wrote:   
> > > > 
> > > > What if we taint the kernel or pci_warn() for cases where either all
> > > > the reset methods are disabled, ie. 'echo none > reset_method', or any
> > > > time a device specific method is disabled?
> > > 
> > > What does it mean "none"? Does it mean nothing supported? If yes, I think 
> > > that
> > > pci_warn() will be enough. At least for me, taint is usable during debug 
> > > stages,
> > > probably if device doesn't crash no one will look to see 
> > > /proc/sys/kernel/tainted.  
> > 
> > "none" as implemented in this patch, clearing the enabled function
> > reset methods.  
> 
> It is far from intuitive, the empty string will be easier to understand,
> because "none" means no reset at all.

"No reset at all" is what "none" achieves, the
pci_dev.reset_methods_enabled bitmap is cleared.  We can use an empty
string, but I think we want a way to clear all enabled resets and a way
to return it to the default.  I could see arguments for an empty string
serving either purpose, so this version proposed explicitly using
"none" and "default", as included in the ABI update.

> > > > I'd almost go so far as to prevent disabling a device specific reset
> > > > altogether, but for example should a device specific reset that fixes
> > > > an aspect of FLR behavior prevent using a bus reset?  I'd prefer in that
> > > > case if direct FLR were disabled via a device flag introduced with the
> > > > quirk and the remaining resets can still be selected by preference.
> > > 
> > > I don't know enough to discuss the PCI details, but you raised good point.
> > > This sysfs is user visible API that is presented as is from device point
> > > of view. It can be easily run into problems if PCI/core doesn't work with
> > > user's choice.
> > >   
> > > > 
> > > > Theoretically all the other reset methods work and are available, it's
> > > > only a policy decision which to use, right?
> > > 
> > > But this patch was presented as a way to overcome situations where
> > > supported != working and user magically knows which reset type to set.  
> > 
> > It's not magic, the new sysfs attributes expose which resets are
> > enabled and the order that they're used, the user can simply select the
> > next one.  Being able to bypass a broken reset method is a helpful side
> > effect of getting to select a preferred reset method.  
> 
> Magic in a sense that user has no idea what those resets mean, the
> expectation is that he will blindly iterate till something works.

Which ought to actually be a safe thing to do.  We should have quirks to
exclude resets that are known broken but still probe as present and I'd
be perfectly fine if we issue a warning if the user disables all resets
for a given device.
 
> > > If you want to take this patch to be policy decision tool,
> > > it will need to accept "reset_type1,reset_type2,..." sort of input,
> > > so fallback will work natively.  
> > 
> > I don't see that as a requirement.  We have fall-through support in the
> > kernel, but for a given device we're really only ever going to make use
> > of one of those methods.  If a user knows enough about a device to have
> > a preference, I think it can be singular.  That also significantly
> > simplifies the interface and supporting code.  Thanks,  
> 
> I'm struggling to get requirements from this thread. You talked about
> policy decision to overtake fallback mechanism, Amey wanted to avoid
> quirks.
> 
> Do you have an example of such devices or we are talking about
> theoretical case?

Look at any device that already has a reset quirk and the process it
took to get there.  Those are more than just theoretical cases.

For policy preference, I already described how I've configured QEMU to
prefer a bus reset rather than a PM reset due to lack of specification
regarding the scope of a PM "soft reset".  This interface would allow a
system policy to do that same thing.

I don't think anyone is suggesting this as a means to avoid quirks that
would resolve reset issues and create the best default general behavior.
This provides a mechanism to test various reset methods, and thereby
identify broken methods,

Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-22 Thread Alex Williamson
On Sun, 21 Mar 2021 09:58:18 -0300
Jason Gunthorpe  wrote:

> On Fri, Mar 19, 2021 at 10:40:28PM -0600, Alex Williamson wrote:
> 
> > > Well, today we don't, but Max here adds id_table's to the special
> > > devices and a MODULE_DEVICE_TABLE would come too if we do the flavours
> > > thing below.  
> > 
> > I think the id_tables are the wrong approach for IGD and NVLink
> > variants.  
> 
> I really disagree with this. Checking for some random bits in firmware
> and assuming that every device made forever into the future works with
> this check is not a good way to do compatibility. Christoph made the
> same point.
> 
> We have good processes to maintain id tables, I don't see this as a
> problem.

The base driver we're discussing here is a meta-driver that binds to
any PCI endpoint as directed by the user.  There is no id_table.  There
can't be any id_table unless you're expecting every device vendor to
submit the exact subset of devices they have tested and condone usage
with this interface.  The IGD extensions here only extend that
interface by providing userspace read-only access to a few additional
pieces of information that we've found to be necessary for certain
userspace drivers.  The actual device interface is unchanged.  In the
case of the NVLink extensions, AIUI these are mostly extensions of a
firmware defined interface for managing aspects of the interconnect to
the device.  It is actually the "random bits in firmware" that we want
to expose, the ID of the device is somewhat tangential, we just only
look for those firmware extensions in association to certain vendor
devices.

Of course if you start looking at features like migration support,
that's more than likely not simply an additional region with optional
information, it would need to interact with the actual state of the
device.  For those, I would very much support use of a specific
id_table.  That's not these.

> > > As-is driver_override seems dangerous as overriding the matching table
> > > could surely allow root userspace to crash the machine. In situations
> > > with trusted boot/signed modules this shouldn't be.  
> > 
> > When we're dealing with meta-drivers that can bind to anything, we
> > shouldn't rely on the match, but should instead verify the driver is
> > appropriate in the probe callback.  Even without driver_override,
> > there's the new_id mechanism.  Either method allows the root user to
> > break driver binding.  Greg has previously stated something to the
> > effect that users get to keep all the pieces when they break something
> > by manipulating driver binding.  
> 
> Yes, but that is a view where root is allowed to break the kernel, we
> now have this optional other world where that is not allowed and root
> access to lots of dangerous things are now disabled.
> 
> new_id and driver_override should probably be in that disable list
> too..

We don't have this other world yet, nor is it clear that we will have
it.  What sort of id_table is the base vfio-pci driver expected to use?
There's always a risk that hardware doesn't adhere to the spec or that
platform firmware might escalate an error that we'd otherwise consider
mundane from a userspace driver.

> > > While that might not seem too bad with these simple drivers, at least
> > > the mlx5 migration driver will have a large dependency tree and pull
> > > in lots of other modules. Even Max's sample from v1 pulls in mlx5_core.ko
> > > and a bunch of other stuff in its orbit.  
> > 
> > Luckily the mlx5 driver doesn't need to be covered by compatibility
> > support, so we don't need to set a softdep for it and the module could
> > be named such that a wildcard driver_override of vfio_pci* shouldn't
> > logically include that driver.  Users can manually create their own
> > modprobe.d softdep entry if they'd like to include it.  Otherwise
> > userspace would need to know to bind to it specifically.  
> 
> But now you are giving up on the whole point, which was to
> automatically load the correct specific module without special admin
> involvement!

This series only exposed a temporary compatibility interface to provide
that anyway.  As I understood it, the long term solution was that
userspace would somehow learn which driver to use for which device.
That "somehow" isn't clear to me.

> > > This is why I want to try for fine grained autoloading first. It
> > > really is the elegant solution if we can work it out.  
> > 
> > I just don't see how we create a manageable change to userspace.  
> 
> I'm not sure I understand. Even if we add a new sysfs to set some
> flavour then that is a pretty trivial change for userspace to move
> from driver_ove

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-20 Thread Alex Williamson
On Sat, 20 Mar 2021 11:10:08 +0200
Leon Romanovsky  wrote:
> On Fri, Mar 19, 2021 at 10:23:13AM -0600, Alex Williamson wrote: 
> > 
> > What if we taint the kernel or pci_warn() for cases where either all
> > the reset methods are disabled, ie. 'echo none > reset_method', or any
> > time a device specific method is disabled?  
> 
> What does it mean "none"? Does it mean nothing supported? If yes, I think that
> pci_warn() will be enough. At least for me, taint is usable during debug 
> stages,
> probably if device doesn't crash no one will look to see 
> /proc/sys/kernel/tainted.

"none" as implemented in this patch, clearing the enabled function
reset methods.

> > I'd almost go so far as to prevent disabling a device specific reset
> > altogether, but for example should a device specific reset that fixes
> > an aspect of FLR behavior prevent using a bus reset?  I'd prefer in that
> > case if direct FLR were disabled via a device flag introduced with the
> > quirk and the remaining resets can still be selected by preference.  
> 
> I don't know enough to discuss the PCI details, but you raised good point.
> This sysfs is user visible API that is presented as is from device point
> of view. It can be easily run into problems if PCI/core doesn't work with
> user's choice.
> 
> > 
> > Theoretically all the other reset methods work and are available, it's
> > only a policy decision which to use, right?  
> 
> But this patch was presented as a way to overcome situations where
> supported != working and user magically knows which reset type to set.

It's not magic, the new sysfs attributes expose which resets are
enabled and the order that they're used, the user can simply select the
next one.  Being able to bypass a broken reset method is a helpful side
effect of getting to select a preferred reset method.

> If you want to take this patch to be policy decision tool,
> it will need to accept "reset_type1,reset_type2,..." sort of input,
> so fallback will work natively.

I don't see that as a requirement.  We have fall-through support in the
kernel, but for a given device we're really only ever going to make use
of one of those methods.  If a user knows enough about a device to have
a preference, I think it can be singular.  That also significantly
simplifies the interface and supporting code.  Thanks,

Alex



Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-19 Thread Alex Williamson
On Fri, 19 Mar 2021 19:59:43 -0300
Jason Gunthorpe  wrote:

> On Fri, Mar 19, 2021 at 03:08:09PM -0600, Alex Williamson wrote:
> > On Fri, 19 Mar 2021 17:07:49 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Fri, Mar 19, 2021 at 11:36:42AM -0600, Alex Williamson wrote:  
> > > > On Fri, 19 Mar 2021 17:34:49 +0100
> > > > Christoph Hellwig  wrote:
> > > > 
> > > > > On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:
> > > > > > The wrinkle I don't yet have an easy answer to is how to load 
> > > > > > vfio_pci
> > > > > > as a universal "default" within the driver core lazy bind scheme and
> > > > > > still have working module autoloading... I'm hoping to get some
> > > > > > research into this..  
> > > > 
> > > > What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
> > > > driver, which would load all the known variants in order to influence
> > > > the match, and therefore probe ordering?
> > > 
> > > The way the driver core works is to first match against the already
> > > loaded driver list, then trigger an event for module loading and when
> > > new drivers are registered they bind to unbound devices.  
> > 
> > The former is based on id_tables, the latter on MODULE_DEVICE_TABLE, we
> > don't have either of those.  
> 
> Well, today we don't, but Max here adds id_table's to the special
> devices and a MODULE_DEVICE_TABLE would come too if we do the flavours
> thing below.

I think the id_tables are the wrong approach for IGD and NVLink
variants.
 
> My starting thinking is that everything should have these tables and
> they should work properly..

id_tables require ongoing maintenance whereas the existing variants
require only vendor + device class and some platform feature, like a
firmware or fdt table.  They're meant to only add extra regions to
vfio-pci base support, not extensively modify the device interface.
 
> > As noted to Christoph, the cases where we want a vfio driver to
> > bind to anything automatically is the exception.  
> 
> I agree vfio should not automatically claim devices, but once vfio is
> told to claim a device everything from there after should be
> automatic.
> 
> > > One answer is to have userspace udev have the "hook" here and when a
> > > vfio flavour mod alias is requested on a PCI device it swaps in
> > > vfio_pci if it can't find an alternative.
> > > 
> > > The dream would be a system with no vfio modules loaded could do some
> > > 
> > >  echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> > > 
> > > And a module would be loaded and a struct vfio_device is created for
> > > that device. Very easy for the user.  
> > 
> > This is like switching a device to a parallel universe where we do
> > want vfio drivers to bind automatically to devices.  
> 
> Yes.
> 
> If we do this I'd probably suggest that driver_override be bumped down
> to some user compat and 'vfio > driver_override' would just set the
> flavour.
> 
> As-is driver_override seems dangerous as overriding the matching table
> could surely allow root userspace to crash the machine. In situations
> with trusted boot/signed modules this shouldn't be.

When we're dealing with meta-drivers that can bind to anything, we
shouldn't rely on the match, but should instead verify the driver is
appropriate in the probe callback.  Even without driver_override,
there's the new_id mechanism.  Either method allows the root user to
break driver binding.  Greg has previously stated something to the
effect that users get to keep all the pieces when they break something
by manipulating driver binding.

> > > > If we coupled that with wildcard support in driver_override, ex.
> > > > "vfio_pci*", and used consistent module naming, I think we'd only need
> > > > to teach userspace about this wildcard and binding to a specific module
> > > > would come for free.
> > > 
> > > What would the wildcard do?  
> > 
> > It allows a driver_override to match more than one driver, not too
> > dissimilar to your driver_flavor above.  In this case it would match
> > all driver names starting with "vfio_pci".  For example if we had:
> > 
> > softdep vfio-pci pre: vfio-pci-foo vfio-pci-bar
> >
> > Then we'd pre-seed the condition that drivers foo and bar precede the
> > base vfio-pci driver, each will match the device to the driver and have
> > an opportunity in the

Re: [PATCH v1 07/14] vfio: Add a device notifier interface

2021-03-19 Thread Alex Williamson
On Wed, 10 Mar 2021 07:56:39 +
Christoph Hellwig  wrote:

> On Mon, Mar 08, 2021 at 02:48:30PM -0700, Alex Williamson wrote:
> > Using a vfio device, a notifier block can be registered to receive
> > select device events.  Notifiers can only be registered for contained
> > devices, ie. they are available through a user context.  Registration
> > of a notifier increments the reference to that container context
> > therefore notifiers must minimally respond to the release event by
> > asynchronously removing notifiers.  
> 
> Notifiers generally are a horrible multiplexed API.  Can't we just
> add a proper method table for the intended communication channel?

I've been trying to figure out how, but I think not.  A user can have
multiple devices, each with entirely separate IOMMU contexts.  For each
device, the user can create an mmap of memory to that device and add it
to every other IOMMU context.  That enables peer to peer DMA between
all the devices, across all the IOMMU contexts.  But each individual
device has no direct reference to any IOMMU context other than its own.
A callback on the IOMMU can't reach those other contexts either, there's
no guarantee those other contexts are necessarily managed via the same
vfio IOMMU backend driver.  A notifier is the best I can come up with,
please suggest if you have other ideas.  Thanks,

Alex



Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-19 Thread Alex Williamson
On Fri, 19 Mar 2021 17:07:49 -0300
Jason Gunthorpe  wrote:

> On Fri, Mar 19, 2021 at 11:36:42AM -0600, Alex Williamson wrote:
> > On Fri, 19 Mar 2021 17:34:49 +0100
> > Christoph Hellwig  wrote:
> >   
> > > On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:  
> > > > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > > > as a universal "default" within the driver core lazy bind scheme and
> > > > still have working module autoloading... I'm hoping to get some
> > > > research into this..
> > 
> > What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
> > driver, which would load all the known variants in order to influence
> > the match, and therefore probe ordering?  
> 
> The way the driver core works is to first match against the already
> loaded driver list, then trigger an event for module loading and when
> new drivers are registered they bind to unbound devices.

The former is based on id_tables, the latter on MODULE_DEVICE_TABLE, we
don't have either of those.  As noted to Christoph, the cases where we
want a vfio driver to bind to anything automatically is the exception.
 
> So, the trouble is the event through userspace because the kernel
> can't just go on to use vfio_pci until it knows userspace has failed
> to satisfy the load request.

Given that we don't use MODULE_DEVICE_TABLE, vfio-pci doesn't autoload.
AFAIK, all tools like libvirt and driverctl that typically bind devices
to vfio-pci will manually load vfio-pci.  I think we can take advantage
of that.

> One answer is to have userspace udev have the "hook" here and when a
> vfio flavour mod alias is requested on a PCI device it swaps in
> vfio_pci if it can't find an alternative.
> 
> The dream would be a system with no vfio modules loaded could do some
> 
>  echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> 
> And a module would be loaded and a struct vfio_device is created for
> that device. Very easy for the user.

This is like switching a device to a parallel universe where we do
want vfio drivers to bind automatically to devices.
 
> > If we coupled that with wildcard support in driver_override, ex.
> > "vfio_pci*", and used consistent module naming, I think we'd only need
> > to teach userspace about this wildcard and binding to a specific module
> > would come for free.  
> 
> What would the wildcard do?

It allows a driver_override to match more than one driver, not too
dissimilar to your driver_flavor above.  In this case it would match
all driver names starting with "vfio_pci".  For example if we had:

softdep vfio-pci pre: vfio-pci-foo vfio-pci-bar

Then we'd pre-seed the condition that drivers foo and bar precede the
base vfio-pci driver, each will match the device to the driver and have
an opportunity in their probe function to either claim or skip the
device.  Userspace could also set and exact driver_override, for
example if they want to force using the base vfio-pci driver or go
directly to a specific variant.
 
> > This assumes we drop the per-variant id_table and use the probe
> > function to skip devices without the necessary requirements, either
> > wrong device or missing the tables we expect to expose.  
> 
> Without a module table how do we know which driver is which? 
> 
> Open coding a match table in probe() and returning failure feels hacky
> to me.

How's it any different than Max's get_foo_vfio_pci_driver() that calls
pci_match_id() with an internal match table?  It seems a better fit for
the existing use cases, for example the IGD variant can use a single
line table to exclude all except Intel VGA class devices in its probe
callback, then test availability of the extra regions we'd expose,
otherwise return -ENODEV.  The NVLink variant can use pci_match_id() in
the probe callback to filter out anything other than NVIDIA VGA or 3D
accelerator class devices, then check for associated FDT table, or
return -ENODEV.  We already use the vfio_pci probe function to exclude
devices in the deny-list and non-endpoint devices.  Many drivers
clearly place implicit trust in their id_table, others don't.  In the
case of meta drivers, I think it's fair to make use of the latter
approach.

> > > Should we even load it by default?  One answer would be that the sysfs
> > > file to switch to vfio mode goes into the core PCI layer, and that core
> > > PCI code would contain a hack^H^H^H^Hhook to first load and bind vfio_pci
> > > for that device.  
> > 
> > Generally we don't want to be the default driver for anything (I think
> > mdev devices are the exception).  Assignment to userspace or VM is a
> > niche use case.  Thanks,  
> 
> By &qu

Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-19 Thread Alex Williamson
On Fri, 19 Mar 2021 17:34:49 +0100
Christoph Hellwig  wrote:

> On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:
> > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > as a universal "default" within the driver core lazy bind scheme and
> > still have working module autoloading... I'm hoping to get some
> > research into this..  

What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
driver, which would load all the known variants in order to influence
the match, and therefore probe ordering?

If we coupled that with wildcard support in driver_override, ex.
"vfio_pci*", and used consistent module naming, I think we'd only need
to teach userspace about this wildcard and binding to a specific module
would come for free.  This assumes we drop the per-variant id_table and
use the probe function to skip devices without the necessary
requirements, either wrong device or missing the tables we expect to
expose.

> Should we even load it by default?  One answer would be that the sysfs
> file to switch to vfio mode goes into the core PCI layer, and that core
> PCI code would contain a hack^H^H^H^Hhook to first load and bind vfio_pci
> for that device.

Generally we don't want to be the default driver for anything (I think
mdev devices are the exception).  Assignment to userspace or VM is a
niche use case.  Thanks,

Alex



Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-19 Thread Alex Williamson
On Fri, 19 Mar 2021 14:59:47 +0200
Leon Romanovsky  wrote:

> On Thu, Mar 18, 2021 at 07:34:56PM +0100, Enrico Weigelt, metux IT consult 
> wrote:
> > On 18.03.21 18:22, Leon Romanovsky wrote:
> >   
> > > Which email client do you use?
> > > Your responses are grouped as one huge block without any chance to respond
> > > to you on specific point or answer to your question.  
> > 
> > I'm reading this thread in Tbird, and threading / quoting all looks
> > nice.  
> 
> I'm not talking about threading or quoting but about response itself.
> See it here 
> https://lore.kernel.org/lkml/20210318103935.2ec32...@omen.home.shazbot.org/
> Alex's response is one big chunk without any separations to paragraphs.

I've never known paragraph breaks to be required to interject a reply.

Back on topic...

> >   
> > > I see your flow and understand your position, but will repeat my
> > > position. We need to make sure that vendors will have incentive to
> > > supply quirks.  

What if we taint the kernel or pci_warn() for cases where either all
the reset methods are disabled, ie. 'echo none > reset_method', or any
time a device specific method is disabled?

I'd almost go so far as to prevent disabling a device specific reset
altogether, but for example should a device specific reset that fixes
an aspect of FLR behavior prevent using a bus reset?  I'd prefer in that
case if direct FLR were disabled via a device flag introduced with the
quirk and the remaining resets can still be selected by preference.

Theoretically all the other reset methods work and are available, it's
only a policy decision which to use, right?

If a device probes for a reset that's broken and distros start
including systemd scripts to apply a preference to avoid it, (a) that
enables them to work with existing kernels, and (b) indicates to us to
add the trivial quirk to flag that reset as broken.

The other side of the argument that this discourages quirks is that
this interface actually makes it significantly easier to report specific
reset methods as broken for a given device.

Thanks,
Alex



Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-19 Thread Alex Williamson
On Wed, 10 Mar 2021 14:57:57 +0200
Max Gurtovoy  wrote:
> On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:
> > On 09/03/2021 19:33, Max Gurtovoy wrote:  
> >> +static const struct pci_device_id nvlink2gpu_vfio_pci_table[] = {
> >> +    { PCI_VDEVICE(NVIDIA, 0x1DB1) }, /* GV100GL-A NVIDIA Tesla 
> >> V100-SXM2-16GB */
> >> +    { PCI_VDEVICE(NVIDIA, 0x1DB5) }, /* GV100GL-A NVIDIA Tesla 
> >> V100-SXM2-32GB */
> >> +    { PCI_VDEVICE(NVIDIA, 0x1DB8) }, /* GV100GL-A NVIDIA Tesla 
> >> V100-SXM3-32GB */
> >> +    { PCI_VDEVICE(NVIDIA, 0x1DF5) }, /* GV100GL-B NVIDIA Tesla 
> >> V100-SXM2-16GB */  
> >
> >
> > Where is this list from?
> >
> > Also, how is this supposed to work at the boot time? Will the kernel 
> > try binding let's say this one and nouveau? Which one is going to win?  
> 
> At boot time nouveau driver will win since the vfio drivers don't 
> declare MODULE_DEVICE_TABLE

This still seems troublesome, AIUI the MODULE_DEVICE_TABLE is
responsible for creating aliases so that kmod can figure out which
modules to load, but what happens if all these vfio-pci modules are
built into the kernel or the modules are already loaded?

In the former case, I think it boils down to link order while the
latter is generally considered even less deterministic since it depends
on module load order.  So if one of these vfio modules should get
loaded before the native driver, I think devices could bind here first.

Are there tricks/extensions we could use in driver overrides, for
example maybe a compatibility alias such that one of these vfio-pci
variants could match "vfio-pci"?  Perhaps that, along with some sort of
priority scheme to probe variants ahead of the base driver, though I'm
not sure how we'd get these variants loaded without something like
module aliases.  I know we're trying to avoid creating another level of
driver matching, but that's essentially what we have in the compat
option enabled here, and I'm not sure I see how userspace makes the
leap to understand what driver to use for a given device.  Thanks,

Alex



[GIT PULL] VFIO fixes for v5.12-rc4

2021-03-18 Thread Alex Williamson
Hi Linus,

The following changes since commit 1e28eed17697bcf343c6743f0028cc3b5dd88bf0:

  Linux 5.12-rc3 (2021-03-14 14:41:02 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.12-rc4

for you to fetch changes up to 4ab4fcfce5b540227d80eb32f1db45ab615f7c92:

  vfio/type1: fix vaddr_get_pfns() return in vfio_pin_page_external() 
(2021-03-16 10:39:29 -0600)


VFIO fixes for v5.12-rc4

 - Fix 32-bit issue with new unmap-all flag (Steve Sistare)

 - Various Kconfig changes for better coverage (Jason Gunthorpe)

 - Fix to batch pinning support (Daniel Jordan)


Daniel Jordan (1):
  vfio/type1: fix vaddr_get_pfns() return in vfio_pin_page_external()

Jason Gunthorpe (4):
  vfio: IOMMU_API should be selected
  vfio-platform: Add COMPILE_TEST to VFIO_PLATFORM
  ARM: amba: Allow some ARM_AMBA users to compile with COMPILE_TEST
  vfio: Depend on MMU

Steve Sistare (1):
  vfio/type1: fix unmap all on ILP32

 drivers/vfio/Kconfig|  4 ++--
 drivers/vfio/platform/Kconfig   |  4 ++--
 drivers/vfio/vfio_iommu_type1.c | 20 
 include/linux/amba/bus.h| 11 +++
 4 files changed, 27 insertions(+), 12 deletions(-)



Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-18 Thread Alex Williamson
On Thu, 18 Mar 2021 11:09:34 +0200
Leon Romanovsky  wrote:

> On Wed, Mar 17, 2021 at 11:31:40AM -0600, Alex Williamson wrote:
> > On Wed, 17 Mar 2021 15:58:40 +0200
> > Leon Romanovsky  wrote:
> >  
> > > On Wed, Mar 17, 2021 at 06:47:18PM +0530, Amey Narkhede wrote:  
> > > > On 21/03/17 01:47PM, Leon Romanovsky wrote:  
> > > > > On Wed, Mar 17, 2021 at 04:53:09PM +0530, Amey Narkhede wrote:  
> > > > > > On 21/03/17 01:02PM, Leon Romanovsky wrote:  
> > > > > > > On Wed, Mar 17, 2021 at 03:54:47PM +0530, Amey Narkhede wrote:  
> > > > > > > > On 21/03/17 06:20AM, Leon Romanovsky wrote:  
> > > > > > > > > On Mon, Mar 15, 2021 at 06:32:32PM +, Raphael Norwitz 
> > > > > > > > > wrote:  
> > > > > > > > > > On Mon, Mar 15, 2021 at 10:29:50AM -0600, Alex Williamson 
> > > > > > > > > > wrote:  
> > > > > > > > > > > On Mon, 15 Mar 2021 21:03:41 +0530
> > > > > > > > > > > Amey Narkhede  wrote:
> > > > > > > > > > >  
> > > > > > > > > > > > On 21/03/15 05:07PM, Leon Romanovsky wrote:  
> > > > > > > > > > > > > On Mon, Mar 15, 2021 at 08:34:09AM -0600, Alex 
> > > > > > > > > > > > > Williamson wrote:  
> > > > > > > > > > > > > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > > > > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > On Monday 15 March 2021 19:13:23 Amey Narkhede 
> > > > > > > > > > > > > > > wrote:  
> > > > > > > > > > > > > > > > slot reset (pci_dev_reset_slot_function) and 
> > > > > > > > > > > > > > > > secondary bus
> > > > > > > > > > > > > > > > reset(pci_parent_bus_reset) which I think are 
> > > > > > > > > > > > > > > > hot reset and
> > > > > > > > > > > > > > > > warm reset respectively.  
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > No. PCI secondary bus reset = PCIe Hot Reset. 
> > > > > > > > > > > > > > > Slot reset is just another
> > > > > > > > > > > > > > > type of reset, which is currently implemented 
> > > > > > > > > > > > > > > only for PCIe hot plug
> > > > > > > > > > > > > > > bridges and for PowerPC PowerNV platform and it 
> > > > > > > > > > > > > > > just call PCI secondary
> > > > > > > > > > > > > > > bus reset with some other hook. PCIe Warm Reset 
> > > > > > > > > > > > > > > does not have API in
> > > > > > > > > > > > > > > kernel and therefore drivers do not export this 
> > > > > > > > > > > > > > > type of reset via any
> > > > > > > > > > > > > > > kernel function (yet).  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Warm reset is beyond the scope of this series, but 
> > > > > > > > > > > > > > could be implemented
> > > > > > > > > > > > > > in a compatible way to fit within the 
> > > > > > > > > > > > > > pci_reset_fn_methods[] array
> > > > > > > > > > > > > > defined here.  Note that with this series the 
> > > > > > > > > > > > > > resets available through
> > > > > > > > > > > > > > pci_reset_function() and the per device reset 
> > > > > > > > > > > > > > attribute is sysfs remain
> > > > > > > > > > > > > > exactly the same as they are currently.  The bus 
> > > > > > > > > > > > > > and slot r

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-17 Thread Alex Williamson
On Wed, 17 Mar 2021 20:40:24 +0100
Pali Rohár  wrote:

> On Wednesday 17 March 2021 13:32:45 Alex Williamson wrote:
> > On Wed, 17 Mar 2021 20:24:24 +0100
> > Pali Rohár  wrote:
> >   
> > > On Wednesday 17 March 2021 13:15:36 Alex Williamson wrote:  
> > > > On Wed, 17 Mar 2021 20:02:06 +0100
> > > > Pali Rohár  wrote:
> > > > 
> > > > > On Monday 15 March 2021 09:03:39 Alex Williamson wrote:
> > > > > > On Mon, 15 Mar 2021 15:52:38 +0100
> > > > > > Pali Rohár  wrote:
> > > > > >   
> > > > > > > On Monday 15 March 2021 08:34:09 Alex Williamson wrote:  
> > > > > > > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > > > > > > Pali Rohár  wrote:
> > > > > > > > 
> > > > > > > > > On Monday 15 March 2021 19:13:23 Amey Narkhede wrote:
> > > > > > > > > > slot reset (pci_dev_reset_slot_function) and secondary bus
> > > > > > > > > > reset(pci_parent_bus_reset) which I think are hot reset and
> > > > > > > > > > warm reset respectively.  
> > > > > > > > > 
> > > > > > > > > No. PCI secondary bus reset = PCIe Hot Reset. Slot reset is 
> > > > > > > > > just another
> > > > > > > > > type of reset, which is currently implemented only for PCIe 
> > > > > > > > > hot plug
> > > > > > > > > bridges and for PowerPC PowerNV platform and it just call PCI 
> > > > > > > > > secondary
> > > > > > > > > bus reset with some other hook. PCIe Warm Reset does not have 
> > > > > > > > > API in
> > > > > > > > > kernel and therefore drivers do not export this type of reset 
> > > > > > > > > via any
> > > > > > > > > kernel function (yet).
> > > > > > > > 
> > > > > > > > Warm reset is beyond the scope of this series, but could be 
> > > > > > > > implemented
> > > > > > > > in a compatible way to fit within the pci_reset_fn_methods[] 
> > > > > > > > array
> > > > > > > > defined here.
> > > > > > > 
> > > > > > > Ok!
> > > > > > >   
> > > > > > > > Note that with this series the resets available through
> > > > > > > > pci_reset_function() and the per device reset attribute is 
> > > > > > > > sysfs remain
> > > > > > > > exactly the same as they are currently.  The bus and slot reset
> > > > > > > > methods used here are limited to devices where only a single 
> > > > > > > > function is
> > > > > > > > affected by the reset, therefore it is not like the patch you 
> > > > > > > > proposed
> > > > > > > > which performed a reset irrespective of the downstream devices. 
> > > > > > > >  This
> > > > > > > > series only enables selection of the existing methods.  Thanks,
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > > 
> > > > > > > 
> > > > > > > But with this patch series, there is still an issue with PCI 
> > > > > > > secondary
> > > > > > > bus reset mechanism as exported sysfs attribute does not do that
> > > > > > > remove-reset-rescan procedure. As discussed in other thread, this 
> > > > > > > reset
> > > > > > > let device in unconfigured / broken state.  
> > > > > > 
> > > > > > No, there's not:
> > > > > > 
> > > > > > int pci_reset_function(struct pci_dev *dev)
> > > > > > {
> > > > > > int rc;
> > > > > > 
> > > > > > if (!dev->reset_fn)
> > > > > > return -ENOTTY;
> > > > > > 
> > > > > > pci_dev_lock(dev);  
> > > > > > >>> pci_dev_save_and_disable(dev);  
> > > > > > 
> > > > > > 

Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-17 Thread Alex Williamson
On Wed, 17 Mar 2021 20:24:24 +0100
Pali Rohár  wrote:

> On Wednesday 17 March 2021 13:15:36 Alex Williamson wrote:
> > On Wed, 17 Mar 2021 20:02:06 +0100
> > Pali Rohár  wrote:
> >   
> > > On Monday 15 March 2021 09:03:39 Alex Williamson wrote:  
> > > > On Mon, 15 Mar 2021 15:52:38 +0100
> > > > Pali Rohár  wrote:
> > > > 
> > > > > On Monday 15 March 2021 08:34:09 Alex Williamson wrote:
> > > > > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > > > > Pali Rohár  wrote:
> > > > > >   
> > > > > > > On Monday 15 March 2021 19:13:23 Amey Narkhede wrote:  
> > > > > > > > slot reset (pci_dev_reset_slot_function) and secondary bus
> > > > > > > > reset(pci_parent_bus_reset) which I think are hot reset and
> > > > > > > > warm reset respectively.
> > > > > > > 
> > > > > > > No. PCI secondary bus reset = PCIe Hot Reset. Slot reset is just 
> > > > > > > another
> > > > > > > type of reset, which is currently implemented only for PCIe hot 
> > > > > > > plug
> > > > > > > bridges and for PowerPC PowerNV platform and it just call PCI 
> > > > > > > secondary
> > > > > > > bus reset with some other hook. PCIe Warm Reset does not have API 
> > > > > > > in
> > > > > > > kernel and therefore drivers do not export this type of reset via 
> > > > > > > any
> > > > > > > kernel function (yet).  
> > > > > > 
> > > > > > Warm reset is beyond the scope of this series, but could be 
> > > > > > implemented
> > > > > > in a compatible way to fit within the pci_reset_fn_methods[] array
> > > > > > defined here.  
> > > > > 
> > > > > Ok!
> > > > > 
> > > > > > Note that with this series the resets available through
> > > > > > pci_reset_function() and the per device reset attribute is sysfs 
> > > > > > remain
> > > > > > exactly the same as they are currently.  The bus and slot reset
> > > > > > methods used here are limited to devices where only a single 
> > > > > > function is
> > > > > > affected by the reset, therefore it is not like the patch you 
> > > > > > proposed
> > > > > > which performed a reset irrespective of the downstream devices.  
> > > > > > This
> > > > > > series only enables selection of the existing methods.  Thanks,
> > > > > > 
> > > > > > Alex
> > > > > >   
> > > > > 
> > > > > But with this patch series, there is still an issue with PCI secondary
> > > > > bus reset mechanism as exported sysfs attribute does not do that
> > > > > remove-reset-rescan procedure. As discussed in other thread, this 
> > > > > reset
> > > > > let device in unconfigured / broken state.
> > > > 
> > > > No, there's not:
> > > > 
> > > > int pci_reset_function(struct pci_dev *dev)
> > > > {
> > > > int rc;
> > > > 
> > > > if (!dev->reset_fn)
> > > > return -ENOTTY;
> > > > 
> > > > pci_dev_lock(dev);
> > > > >>> pci_dev_save_and_disable(dev);
> > > > 
> > > > rc = __pci_reset_function_locked(dev);
> > > > 
> > > > >>> pci_dev_restore(dev);
> > > > pci_dev_unlock(dev);
> > > > 
> > > > return rc;
> > > > }
> > > > 
> > > > The remove/re-scan was discussed primarily because your patch performed
> > > > a bus reset regardless of what devices were affected by that reset and
> > > > it's difficult to manage the scope where multiple devices are affected.
> > > > Here, the bus and slot reset functions will fail unless the scope is
> > > > limited to the single device triggering this reset.  Thanks,
> > > > 
> > > > Alex
> > > > 
> > > 
> > > I was thinking a bit more about it and I'm really sure how it would
> > > behave with hotplugging PCIe bridge.
> > > 
> > > On aardvark PCIe controller I have already tested that secondary bus
> > > reset bit is triggering Hot Reset event and then also Link Down event.
> > > These events are not handled by aardvark driver yet (needs to
> > > implemented into kernel's emulated root bridge code).
> > > 
> > > But I'm not sure how it would behave on real HW PCIe hotplugging bridge.
> > > Kernel has already code which removes PCIe device if it changes presence
> > > bit (and inform via interrupt). And Link Down event triggers this
> > > change.  
> > 
> > This is the difference between slot and bus resets, the slot reset is
> > implemented by the hotplug controller and disables presence detection
> > around the bus reset.  Thanks,  
> 
> Yes, but I'm talking about bus reset, not about slot reset.
> 
> I mean: to use bus reset via sysfs on hardware which supports slots and
> hotplugging.
> 
> And if I'm reading code correctly, this combination is allowed, right?
> Via these new patches it is possible to disable slot reset and enable
> bus reset.

That's true, a slot reset is simply a bus reset wrapped around code
that prevents the device from getting ejected.  Maybe it would make
sense to combine the two as far as this interface is concerned, ie. a
single "bus" reset method that will always use slot reset when
available.  Thanks,

Alex



Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-17 Thread Alex Williamson
On Wed, 17 Mar 2021 20:02:06 +0100
Pali Rohár  wrote:

> On Monday 15 March 2021 09:03:39 Alex Williamson wrote:
> > On Mon, 15 Mar 2021 15:52:38 +0100
> > Pali Rohár  wrote:
> >   
> > > On Monday 15 March 2021 08:34:09 Alex Williamson wrote:  
> > > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > > Pali Rohár  wrote:
> > > > 
> > > > > On Monday 15 March 2021 19:13:23 Amey Narkhede wrote:
> > > > > > slot reset (pci_dev_reset_slot_function) and secondary bus
> > > > > > reset(pci_parent_bus_reset) which I think are hot reset and
> > > > > > warm reset respectively.  
> > > > > 
> > > > > No. PCI secondary bus reset = PCIe Hot Reset. Slot reset is just 
> > > > > another
> > > > > type of reset, which is currently implemented only for PCIe hot plug
> > > > > bridges and for PowerPC PowerNV platform and it just call PCI 
> > > > > secondary
> > > > > bus reset with some other hook. PCIe Warm Reset does not have API in
> > > > > kernel and therefore drivers do not export this type of reset via any
> > > > > kernel function (yet).
> > > > 
> > > > Warm reset is beyond the scope of this series, but could be implemented
> > > > in a compatible way to fit within the pci_reset_fn_methods[] array
> > > > defined here.
> > > 
> > > Ok!
> > >   
> > > > Note that with this series the resets available through
> > > > pci_reset_function() and the per device reset attribute is sysfs remain
> > > > exactly the same as they are currently.  The bus and slot reset
> > > > methods used here are limited to devices where only a single function is
> > > > affected by the reset, therefore it is not like the patch you proposed
> > > > which performed a reset irrespective of the downstream devices.  This
> > > > series only enables selection of the existing methods.  Thanks,
> > > > 
> > > > Alex
> > > > 
> > > 
> > > But with this patch series, there is still an issue with PCI secondary
> > > bus reset mechanism as exported sysfs attribute does not do that
> > > remove-reset-rescan procedure. As discussed in other thread, this reset
> > > let device in unconfigured / broken state.  
> > 
> > No, there's not:
> > 
> > int pci_reset_function(struct pci_dev *dev)
> > {
> > int rc;
> > 
> > if (!dev->reset_fn)
> > return -ENOTTY;
> > 
> > pci_dev_lock(dev);  
> > >>> pci_dev_save_and_disable(dev);  
> > 
> > rc = __pci_reset_function_locked(dev);
> >   
> > >>> pci_dev_restore(dev);  
> > pci_dev_unlock(dev);
> > 
> > return rc;
> > }
> > 
> > The remove/re-scan was discussed primarily because your patch performed
> > a bus reset regardless of what devices were affected by that reset and
> > it's difficult to manage the scope where multiple devices are affected.
> > Here, the bus and slot reset functions will fail unless the scope is
> > limited to the single device triggering this reset.  Thanks,
> > 
> > Alex
> >   
> 
> I was thinking a bit more about it and I'm really sure how it would
> behave with hotplugging PCIe bridge.
> 
> On aardvark PCIe controller I have already tested that secondary bus
> reset bit is triggering Hot Reset event and then also Link Down event.
> These events are not handled by aardvark driver yet (needs to
> implemented into kernel's emulated root bridge code).
> 
> But I'm not sure how it would behave on real HW PCIe hotplugging bridge.
> Kernel has already code which removes PCIe device if it changes presence
> bit (and inform via interrupt). And Link Down event triggers this
> change.

This is the difference between slot and bus resets, the slot reset is
implemented by the hotplug controller and disables presence detection
around the bus reset.  Thanks,

Alex



Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-17 Thread Alex Williamson
On Wed, 17 Mar 2021 15:58:40 +0200
Leon Romanovsky  wrote:

> On Wed, Mar 17, 2021 at 06:47:18PM +0530, Amey Narkhede wrote:
> > On 21/03/17 01:47PM, Leon Romanovsky wrote:  
> > > On Wed, Mar 17, 2021 at 04:53:09PM +0530, Amey Narkhede wrote:  
> > > > On 21/03/17 01:02PM, Leon Romanovsky wrote:  
> > > > > On Wed, Mar 17, 2021 at 03:54:47PM +0530, Amey Narkhede wrote:  
> > > > > > On 21/03/17 06:20AM, Leon Romanovsky wrote:  
> > > > > > > On Mon, Mar 15, 2021 at 06:32:32PM +, Raphael Norwitz wrote:  
> > > > > > > > On Mon, Mar 15, 2021 at 10:29:50AM -0600, Alex Williamson 
> > > > > > > > wrote:  
> > > > > > > > > On Mon, 15 Mar 2021 21:03:41 +0530
> > > > > > > > > Amey Narkhede  wrote:
> > > > > > > > >  
> > > > > > > > > > On 21/03/15 05:07PM, Leon Romanovsky wrote:  
> > > > > > > > > > > On Mon, Mar 15, 2021 at 08:34:09AM -0600, Alex Williamson 
> > > > > > > > > > > wrote:  
> > > > > > > > > > > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > > > > > > > > > > Pali Rohár  wrote:
> > > > > > > > > > > >  
> > > > > > > > > > > > > On Monday 15 March 2021 19:13:23 Amey Narkhede wrote: 
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > slot reset (pci_dev_reset_slot_function) and 
> > > > > > > > > > > > > > secondary bus
> > > > > > > > > > > > > > reset(pci_parent_bus_reset) which I think are hot 
> > > > > > > > > > > > > > reset and
> > > > > > > > > > > > > > warm reset respectively.  
> > > > > > > > > > > > >
> > > > > > > > > > > > > No. PCI secondary bus reset = PCIe Hot Reset. Slot 
> > > > > > > > > > > > > reset is just another
> > > > > > > > > > > > > type of reset, which is currently implemented only 
> > > > > > > > > > > > > for PCIe hot plug
> > > > > > > > > > > > > bridges and for PowerPC PowerNV platform and it just 
> > > > > > > > > > > > > call PCI secondary
> > > > > > > > > > > > > bus reset with some other hook. PCIe Warm Reset does 
> > > > > > > > > > > > > not have API in
> > > > > > > > > > > > > kernel and therefore drivers do not export this type 
> > > > > > > > > > > > > of reset via any
> > > > > > > > > > > > > kernel function (yet).  
> > > > > > > > > > > >
> > > > > > > > > > > > Warm reset is beyond the scope of this series, but 
> > > > > > > > > > > > could be implemented
> > > > > > > > > > > > in a compatible way to fit within the 
> > > > > > > > > > > > pci_reset_fn_methods[] array
> > > > > > > > > > > > defined here.  Note that with this series the resets 
> > > > > > > > > > > > available through
> > > > > > > > > > > > pci_reset_function() and the per device reset attribute 
> > > > > > > > > > > > is sysfs remain
> > > > > > > > > > > > exactly the same as they are currently.  The bus and 
> > > > > > > > > > > > slot reset
> > > > > > > > > > > > methods used here are limited to devices where only a 
> > > > > > > > > > > > single function is
> > > > > > > > > > > > affected by the reset, therefore it is not like the 
> > > > > > > > > > > > patch you proposed
> > > > > > > > > > > > which performed a reset irrespective of the downstream 
> > > > > > > > > > > > devices.  This
> > > > > > > > > > > > series only enables selection of the existing

Re: A problem of Intel IOMMU hardware ?

2021-03-17 Thread Alex Williamson
On Wed, 17 Mar 2021 13:16:58 +0800
Lu Baolu  wrote:

> Hi Longpeng,
> 
> On 3/17/21 11:16 AM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.) wrote:
> > Hi guys,
> > 
> > We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a special
> > situation, it would cause DMA fails or get wrong data.
> > 
> > The reproducer (based on Alex's vfio testsuite[1]) is in attachment, it can
> > reproduce the problem with high probability (~50%).
> > 
> > The machine we used is:
> > processor   : 47
> > vendor_id   : GenuineIntel
> > cpu family  : 6
> > model   : 85
> > model name  : Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
> > stepping: 4
> > microcode   : 0x269
> > 
> > And the iommu capability reported is:
> > ver 1:0 cap 8d2078c106f0466 ecap f020df
> > (caching mode = 0 , page-selective invalidation = 1)
> > 
> > (The problem is also on 'Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz' and
> > 'Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz')
> > 
> > We run the reproducer on Linux 4.18 and it works as follow:
> > 
> > Step 1. alloc 4G *2M-hugetlb* memory (N.B. no problem with 4K-page mapping) 
> >  
> 
> I don't understand 2M-hugetlb here means exactly. The IOMMU hardware
> supports both 2M and 1G super page. The mapping physical memory is 4G.
> Why couldn't it use 1G super page?
> 
> > Step 2. DMA Map 4G memory
> > Step 3.
> >  while (1) {
> >  {UNMAP, 0x0, 0xa},  (a)
> >  {UNMAP, 0xc, 0xbff4},  
> 
> Have these two ranges been mapped before? Does the IOMMU driver
> complains when you trying to unmap a range which has never been
> mapped? The IOMMU driver implicitly assumes that mapping and
> unmapping are paired.
>
> >  {MAP,   0x0, 0xc000}, - (b)
> >  use GDB to pause at here, and then DMA read IOVA=0,  
> 
> IOVA 0 seems to be a special one. Have you verified with other addresses
> than IOVA 0?

It is???  That would be a problem.

> >  sometimes DMA success (as expected),
> >  but sometimes DMA error (report not-present).
> >  {UNMAP, 0x0, 0xc000}, - (c)
> >  {MAP,   0x0, 0xa},
> >  {MAP,   0xc, 0xbff4},
> >  }

The interesting thing about this test sequence seems to be how it will
implicitly switch between super pages and regular pages.  Also note
that the test is using the original vfio type1 API rather than the v2
API that's more commonly used today.  This older API allows unmaps to
split mappings, but we don't really know how much the IOMMU is
unmapping without reading the unmap.size field returned by the ioctl.
What I expect to happen is that the IOMMU will make use of superpages
when mapping the full range.  When we unmap {0-b}, that's likely
going to be covered by a 2M (or more) superpage, therefore the unmap
will actually unmap {0-1f}.  The subsequent unmap starting at
0xc might already have {a-1f} unmapped.  However, when we
then map {0 - b} the IOMMU will (should) switch back to 4K pages.
The mapping at 0xc should use 4K pages up through 0x1f, then
might switch to 2M or 1G pages depending on physical memory layout.  So
the {0-2MB} IOVA range could be switching back and forth between a
superpage mapping and 4K mapping, and I can certainly imagine that
could lead to page table, if not cache management bugs.  Thanks,

Alex


> > 
> > The DMA read operations sholud success between (b) and (c), it should NOT 
> > report
> > not-present at least!
> > 
> > After analysis the problem, we think maybe it's caused by the Intel iommu 
> > iotlb.
> > It seems the DMA Remapping hardware still uses the IOTLB or other caches of 
> > (a).
> > 
> > When do DMA unmap at (a), the iotlb will be flush:
> >  intel_iommu_unmap
> >  domain_unmap
> >  iommu_flush_iotlb_psi
> > 
> > When do DMA map at (b), no need to flush the iotlb according to the 
> > capability
> > of this iommu:
> >  intel_iommu_map
> >  domain_pfn_mapping
> >  domain_mapping
> >  __mapping_notify_one
> >  if (cap_caching_mode(iommu->cap)) // FALSE
> >  iommu_flush_iotlb_psi  
> 
> That's true. The iotlb flushing is not needed in case of PTE been
> changed from non-present to present unless caching mode.
> 
> > But the problem will disappear if we FORCE flush here. So we suspect the 
> > iommu
> > hardware.
> > 
> > Do you have any suggestion ?  
> 
> Best regards,
> baolu
> 



Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-15 Thread Alex Williamson
On Mon, 15 Mar 2021 21:03:41 +0530
Amey Narkhede  wrote:

> On 21/03/15 05:07PM, Leon Romanovsky wrote:
> > On Mon, Mar 15, 2021 at 08:34:09AM -0600, Alex Williamson wrote:  
> > > On Mon, 15 Mar 2021 14:52:26 +0100
> > > Pali Rohár  wrote:
> > >  
> > > > On Monday 15 March 2021 19:13:23 Amey Narkhede wrote:  
> > > > > slot reset (pci_dev_reset_slot_function) and secondary bus
> > > > > reset(pci_parent_bus_reset) which I think are hot reset and
> > > > > warm reset respectively.  
> > > >
> > > > No. PCI secondary bus reset = PCIe Hot Reset. Slot reset is just another
> > > > type of reset, which is currently implemented only for PCIe hot plug
> > > > bridges and for PowerPC PowerNV platform and it just call PCI secondary
> > > > bus reset with some other hook. PCIe Warm Reset does not have API in
> > > > kernel and therefore drivers do not export this type of reset via any
> > > > kernel function (yet).  
> > >
> > > Warm reset is beyond the scope of this series, but could be implemented
> > > in a compatible way to fit within the pci_reset_fn_methods[] array
> > > defined here.  Note that with this series the resets available through
> > > pci_reset_function() and the per device reset attribute is sysfs remain
> > > exactly the same as they are currently.  The bus and slot reset
> > > methods used here are limited to devices where only a single function is
> > > affected by the reset, therefore it is not like the patch you proposed
> > > which performed a reset irrespective of the downstream devices.  This
> > > series only enables selection of the existing methods.  Thanks,  
> >
> > Alex,
> >
> > I asked the patch author here [1], but didn't get any response, maybe
> > you can answer me. What is the use case scenario for this functionality?
> >
> > Thanks
> >
> > [1] https://lore.kernel.org/lkml/YE389lAqjJSeTolM@unreal
> >  
> Sorry for not responding immediately. There were some buggy wifi cards
> which needed FLR explicitly not sure if that behavior is fixed in
> drivers. Also there is use a case at Nutanix but the engineer who
> is involved is on PTO that is why I did not respond immediately as
> I don't know the details yet.

And more generally, devices continue to have reset issues and we
impose a fixed priority in our ordering.  We can and probably should
continue to quirk devices when we find broken resets so that we have
the best default behavior, but it's currently not easy for an end user
to experiment, ie. this reset works, that one doesn't.  We might also
have platform issues where a given reset works better on a certain
platform.  Exposing a way to test these things might lead to better
quirks.  In the case I think Pali was looking for, they wanted a
mechanism to force a bus reset, if this was in reference to a single
function device, this could be accomplished by setting a priority for
that mechanism, which would translate to not only the sysfs reset
attribute, but also the reset mechanism used by vfio-pci.  Thanks,

Alex



Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-15 Thread Alex Williamson
On Mon, 15 Mar 2021 15:52:38 +0100
Pali Rohár  wrote:

> On Monday 15 March 2021 08:34:09 Alex Williamson wrote:
> > On Mon, 15 Mar 2021 14:52:26 +0100
> > Pali Rohár  wrote:
> >   
> > > On Monday 15 March 2021 19:13:23 Amey Narkhede wrote:  
> > > > slot reset (pci_dev_reset_slot_function) and secondary bus
> > > > reset(pci_parent_bus_reset) which I think are hot reset and
> > > > warm reset respectively.
> > > 
> > > No. PCI secondary bus reset = PCIe Hot Reset. Slot reset is just another
> > > type of reset, which is currently implemented only for PCIe hot plug
> > > bridges and for PowerPC PowerNV platform and it just call PCI secondary
> > > bus reset with some other hook. PCIe Warm Reset does not have API in
> > > kernel and therefore drivers do not export this type of reset via any
> > > kernel function (yet).  
> > 
> > Warm reset is beyond the scope of this series, but could be implemented
> > in a compatible way to fit within the pci_reset_fn_methods[] array
> > defined here.  
> 
> Ok!
> 
> > Note that with this series the resets available through
> > pci_reset_function() and the per device reset attribute is sysfs remain
> > exactly the same as they are currently.  The bus and slot reset
> > methods used here are limited to devices where only a single function is
> > affected by the reset, therefore it is not like the patch you proposed
> > which performed a reset irrespective of the downstream devices.  This
> > series only enables selection of the existing methods.  Thanks,
> > 
> > Alex
> >   
> 
> But with this patch series, there is still an issue with PCI secondary
> bus reset mechanism as exported sysfs attribute does not do that
> remove-reset-rescan procedure. As discussed in other thread, this reset
> let device in unconfigured / broken state.

No, there's not:

int pci_reset_function(struct pci_dev *dev)
{
int rc;

if (!dev->reset_fn)
return -ENOTTY;

pci_dev_lock(dev);
>>> pci_dev_save_and_disable(dev);

rc = __pci_reset_function_locked(dev);

>>> pci_dev_restore(dev);
pci_dev_unlock(dev);

return rc;
}

The remove/re-scan was discussed primarily because your patch performed
a bus reset regardless of what devices were affected by that reset and
it's difficult to manage the scope where multiple devices are affected.
Here, the bus and slot reset functions will fail unless the scope is
limited to the single device triggering this reset.  Thanks,

Alex



Re: [PATCH 4/4] PCI/sysfs: Allow userspace to query and set device reset mechanism

2021-03-15 Thread Alex Williamson
On Mon, 15 Mar 2021 14:52:26 +0100
Pali Rohár  wrote:

> On Monday 15 March 2021 19:13:23 Amey Narkhede wrote:
> > slot reset (pci_dev_reset_slot_function) and secondary bus
> > reset(pci_parent_bus_reset) which I think are hot reset and
> > warm reset respectively.  
> 
> No. PCI secondary bus reset = PCIe Hot Reset. Slot reset is just another
> type of reset, which is currently implemented only for PCIe hot plug
> bridges and for PowerPC PowerNV platform and it just call PCI secondary
> bus reset with some other hook. PCIe Warm Reset does not have API in
> kernel and therefore drivers do not export this type of reset via any
> kernel function (yet).

Warm reset is beyond the scope of this series, but could be implemented
in a compatible way to fit within the pci_reset_fn_methods[] array
defined here.  Note that with this series the resets available through
pci_reset_function() and the per device reset attribute is sysfs remain
exactly the same as they are currently.  The bus and slot reset
methods used here are limited to devices where only a single function is
affected by the reset, therefore it is not like the patch you proposed
which performed a reset irrespective of the downstream devices.  This
series only enables selection of the existing methods.  Thanks,

Alex



Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-12 Thread Alex Williamson
On Fri, 12 Mar 2021 13:09:38 -0700
Alex Williamson  wrote:

> On Fri, 12 Mar 2021 15:41:47 -0400
> Jason Gunthorpe  wrote:
> 
> 
> ==
> WARNING: possible circular locking dependency detected
> 5.12.0-rc1+ #18 Not tainted
> --
> CPU 0/KVM/1406 is trying to acquire lock:
> a5a58d60 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire+0x83/0xd0
> 
> but task is already holding lock:
> 94c0f3e8fb08 (>i_mmap_rwsem){}-{3:3}, at: 
> vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (>i_mmap_rwsem){}-{3:3}:  
>down_write+0x3d/0x70
>dma_resv_lockdep+0x1b0/0x298
>do_one_initcall+0x5b/0x2d0
>kernel_init_freeable+0x251/0x298
>kernel_init+0xa/0x111
>ret_from_fork+0x22/0x30
> 
> -> #0 (fs_reclaim){+.+.}-{0:0}:  
>__lock_acquire+0x111f/0x1e10
>lock_acquire+0xb5/0x380
>fs_reclaim_acquire+0xa3/0xd0
>kmem_cache_alloc_trace+0x30/0x2c0
>memtype_reserve+0xc3/0x280
>reserve_pfn_range+0x86/0x160
>track_pfn_remap+0xa6/0xe0
>remap_pfn_range+0xa8/0x610
>vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
>vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
>vfio_basic_config_write+0x12d/0x230 [vfio_pci]
>vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
>vfs_write+0xea/0x390
>__x64_sys_pwrite64+0x72/0xb0
>do_syscall_64+0x33/0x40
>entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
..
> > Does current_gfp_context()/memalloc_nofs_save()/etc solve it?  

Yeah, we can indeed use memalloc_nofs_save/restore().  It seems we're
trying to allocate something for pfnmap tracking and that enables lots
of lockdep specific tests.  Is it valid to wrap io_remap_pfn_range()
around clearing this flag or am I just masking a bug?  Thanks,

Alex



Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-12 Thread Alex Williamson
On Fri, 12 Mar 2021 15:41:47 -0400
Jason Gunthorpe  wrote:

> On Fri, Mar 12, 2021 at 12:16:11PM -0700, Alex Williamson wrote:
> > On Wed, 10 Mar 2021 14:40:11 -0400
> > Jason Gunthorpe  wrote:
> >   
> > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > >   
> > > > > I think after the address_space changes this should try to stick with
> > > > > a normal io_rmap_pfn_range() done outside the fault handler.
> > > > 
> > > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > > memory is enabled,
> > > 
> > > Yes, I think I saw Peter thinking along these lines too
> > > 
> > > Then fault just always causes SIGBUS if it gets called  
> > 
> > Trying to use the address_space approach because otherwise we'd just be
> > adding back vma list tracking, it looks like we can't call
> > io_remap_pfn_range() while holding the address_space i_mmap_rwsem via
> > i_mmap_lock_write(), like done in unmap_mapping_range().  lockdep
> > identifies a circular lock order issue against fs_reclaim.  Minimally we
> > also need vma_interval_tree_iter_{first,next} exported in order to use
> > vma_interval_tree_foreach().  Suggestions?  Thanks,  
> 
> You are asking how to put the BAR back into every VMA when it is
> enabled again after it has been zap'd?

Exactly.
 
> What did the lockdep splat look like? Is it a memory allocation?


==
WARNING: possible circular locking dependency detected
5.12.0-rc1+ #18 Not tainted
--
CPU 0/KVM/1406 is trying to acquire lock:
a5a58d60 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire+0x83/0xd0

but task is already holding lock:
94c0f3e8fb08 (>i_mmap_rwsem){}-{3:3}, at: 
vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (>i_mmap_rwsem){}-{3:3}:
   down_write+0x3d/0x70
   dma_resv_lockdep+0x1b0/0x298
   do_one_initcall+0x5b/0x2d0
   kernel_init_freeable+0x251/0x298
   kernel_init+0xa/0x111
   ret_from_fork+0x22/0x30

-> #0 (fs_reclaim){+.+.}-{0:0}:
   __lock_acquire+0x111f/0x1e10
   lock_acquire+0xb5/0x380
   fs_reclaim_acquire+0xa3/0xd0
   kmem_cache_alloc_trace+0x30/0x2c0
   memtype_reserve+0xc3/0x280
   reserve_pfn_range+0x86/0x160
   track_pfn_remap+0xa6/0xe0
   remap_pfn_range+0xa8/0x610
   vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
   vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
   vfio_basic_config_write+0x12d/0x230 [vfio_pci]
   vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
   vfs_write+0xea/0x390
   __x64_sys_pwrite64+0x72/0xb0
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(>i_mmap_rwsem);
   lock(fs_reclaim);
   lock(>i_mmap_rwsem);
  lock(fs_reclaim);

 *** DEADLOCK ***

2 locks held by CPU 0/KVM/1406:
 #0: 94c0f9c71ef0 (>memory_lock){}-{3:3}, at: 
vfio_basic_config_write+0x19a/0x230 [vfio_pci]
 #1: 94c0f3e8fb08 (>i_mmap_rwsem){}-{3:3}, at: 
vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]

stack backtrace:
CPU: 3 PID: 1406 Comm: CPU 0/KVM Not tainted 5.12.0-rc1+ #18
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
04/27/2013
Call Trace:
 dump_stack+0x7f/0xa1
 check_noncircular+0xcf/0xf0
 __lock_acquire+0x111f/0x1e10
 lock_acquire+0xb5/0x380
 ? fs_reclaim_acquire+0x83/0xd0
 ? pat_enabled+0x10/0x10
 ? memtype_reserve+0xc3/0x280
 fs_reclaim_acquire+0xa3/0xd0
 ? fs_reclaim_acquire+0x83/0xd0
 kmem_cache_alloc_trace+0x30/0x2c0
 memtype_reserve+0xc3/0x280
 reserve_pfn_range+0x86/0x160
 track_pfn_remap+0xa6/0xe0
 remap_pfn_range+0xa8/0x610
 ? lock_acquire+0xb5/0x380
 ? vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]
 ? lock_is_held_type+0xa5/0x120
 vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
 vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
 vfio_basic_config_write+0x12d/0x230 [vfio_pci]
 vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
 vfs_write+0xea/0x390
 __x64_sys_pwrite64+0x72/0xb0
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f80176152ff
Code: 08 89 3c 24 48 89 4c 24 18 e8 3d f3 ff ff 4c 8b 54 24 18 48 8b 54 24 10 
41 89 c0 48 8b 74 24 08 8b 3c 24 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 
44 89 c7 48 89 04 24 e8 6d f3 ff ff 48 8b
RSP: 002b:7f7efa5f72f0 EFLAGS: 0293 ORIG_RAX: 0012
RAX: ffda RBX: 0

Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-12 Thread Alex Williamson
On Wed, 10 Mar 2021 14:40:11 -0400
Jason Gunthorpe  wrote:

> On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> 
> > > I think after the address_space changes this should try to stick with
> > > a normal io_rmap_pfn_range() done outside the fault handler.  
> > 
> > I assume you're suggesting calling io_remap_pfn_range() when device
> > memory is enabled,  
> 
> Yes, I think I saw Peter thinking along these lines too
> 
> Then fault just always causes SIGBUS if it gets called

Trying to use the address_space approach because otherwise we'd just be
adding back vma list tracking, it looks like we can't call
io_remap_pfn_range() while holding the address_space i_mmap_rwsem via
i_mmap_lock_write(), like done in unmap_mapping_range().  lockdep
identifies a circular lock order issue against fs_reclaim.  Minimally we
also need vma_interval_tree_iter_{first,next} exported in order to use
vma_interval_tree_foreach().  Suggestions?  Thanks,

Alex



[PATCH v2] vfio/pci: Handle concurrent vma faults

2021-03-10 Thread Alex Williamson
vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
from within a vm_ops fault handler.  This function will trigger a
BUG_ON if it encounters a populated pte within the remapped range,
where any fault is meant to populate the entire vma.  Concurrent
inflight faults to the same vma will therefore hit this issue,
triggering traces such as:

[ 1591.733256] kernel BUG at mm/memory.c:2177!
[ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio 
pv680_mii(O)
[ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
[ 1591.770735] Hardware name:  , BIOS HiFPGA 1P B600 V121-1
[ 1591.778872] pstate: 4049 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[ 1591.786134] pc : remap_pfn_range+0x214/0x340
[ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
[ 1591.799117] sp : 80001068bbd0
[ 1591.803476] x29: 80001068bbd0 x28: 042eff6f
[ 1591.810404] x27: 00110091 x26: 00130091
[ 1591.817457] x25: 00680fd3 x24: a92f1338e358
[ 1591.825144] x23: 00114000 x22: 0041
[ 1591.832506] x21: 00130091 x20: a92f141a4000
[ 1591.839520] x19: 001100a0 x18: 
[ 1591.846108] x17:  x16: a92f11844540
[ 1591.853570] x15:  x14: 
[ 1591.860768] x13: fc00 x12: 0880
[ 1591.868053] x11: 0821bf3d01d0 x10: 5ef2abd89000
[ 1591.875932] x9 : a92f12ab0064 x8 : a92f136471c0
[ 1591.883208] x7 : 00114091 x6 : 0002
[ 1591.890177] x5 : 0001 x4 : 0001
[ 1591.896656] x3 :  x2 : 016804400fd3
[ 1591.903215] x1 : 082126261880 x0 : fc2084989868
[ 1591.910234] Call trace:
[ 1591.914837]  remap_pfn_range+0x214/0x340
[ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
[ 1591.931200]  __do_fault+0x44/0x12c
[ 1591.937031]  handle_mm_fault+0xcc8/0x1230
[ 1591.942475]  do_page_fault+0x16c/0x484
[ 1591.948635]  do_translation_fault+0xbc/0xd8
[ 1591.954171]  do_mem_abort+0x4c/0xc0
[ 1591.960316]  el0_da+0x40/0x80
[ 1591.965585]  el0_sync_handler+0x168/0x1b0
[ 1591.971608]  el0_sync+0x174/0x180
[ 1591.978312] Code: eb1b027f 54c0 f9400022 b4fffe02 (d421)

Switch to using vmf_insert_pfn() to allow replacing mappings, and
include decrypted memory protection as formerly provided by
io_remap_pfn_range().  Tracking of vmas is also updated to
prevent duplicate entries.

Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
Reported-by: Zeng Tao 
Suggested-by: Zeng Tao 
Signed-off-by: Alex Williamson 
---

v2: Set decrypted pgprot in mmap, use non-_prot vmf_insert_pfn()
as suggested by Jason G.

 drivers/vfio/pci/vfio_pci.c |   30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b44578..73e125d73640 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device 
*vdev,
 {
struct vfio_pci_mmap_vma *mmap_vma;
 
+   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
+   if (mmap_vma->vma == vma)
+   return 0; /* Swallow the error, the vma is tracked */
+   }
+
mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
if (!mmap_vma)
return -ENOMEM;
@@ -1612,31 +1617,31 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault 
*vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
-   vm_fault_t ret = VM_FAULT_NOPAGE;
+   unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
 
mutex_lock(>vma_lock);
down_read(>memory_lock);
 
-   if (!__vfio_pci_memory_enabled(vdev)) {
-   ret = VM_FAULT_SIGBUS;
-   mutex_unlock(>vma_lock);
+   if (!__vfio_pci_memory_enabled(vdev))
goto up_out;
+
+   for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
+   ret = vmf_insert_pfn(vma, vaddr, pfn);
+   if (ret != VM_FAULT_NOPAGE) {
+   zap_vma_ptes(vma, vma->vm_start, vaddr - vma->vm_start);
+   goto up_out;
+   }
}
 
if (__vfio_pci_add_vma(vdev, vma)) {
ret = VM_FAULT_OOM;
-   mutex_unlock(>vma_lock);
-   goto up_out;
+   zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
}
 
-   mutex_unlock(>vma_lock);
-
-   if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-  vma->vm_end - vma->vm_start, vma->vm_page_prot))
-   ret = VM_FAULT_SIGBUS;
-
 up_out:
   

Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-10 Thread Alex Williamson
On Wed, 10 Mar 2021 14:14:46 -0400
Jason Gunthorpe  wrote:

> On Wed, Mar 10, 2021 at 10:53:29AM -0700, Alex Williamson wrote:
> 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 65e7e6b44578..ae723808e08b 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device 
> > *vdev,
> >  {
> > struct vfio_pci_mmap_vma *mmap_vma;
> >  
> > +   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
> > +   if (mmap_vma->vma == vma)
> > +   return 0; /* Swallow the error, the vma is tracked */
> > +   }
> > +
> > mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> > if (!mmap_vma)
> > return -ENOMEM;
> > @@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct 
> > vm_fault *vmf)
> >  {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > -   vm_fault_t ret = VM_FAULT_NOPAGE;
> > +   unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
> > +   vm_fault_t ret = VM_FAULT_SIGBUS;
> >  
> > mutex_lock(>vma_lock);
> > down_read(>memory_lock);
> >  
> > -   if (!__vfio_pci_memory_enabled(vdev)) {
> > -   ret = VM_FAULT_SIGBUS;
> > -   mutex_unlock(>vma_lock);
> > +   if (!__vfio_pci_memory_enabled(vdev))
> > goto up_out;
> > +
> > +   for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
> > +   ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
> > + pgprot_decrypted(vma->vm_page_prot)); 
> >  
> 
> I investigated this, I think the above pgprot_decrypted() should be
> moved here:
> 
> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> {
> vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +   vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> 
> 
> And since:
> 
> vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   unsigned long pfn)
> {
>   return vmf_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);
> 
> The above can just use vfm_insert_pfn()

Cool, easy enough.  Thanks for looking.
 
> The only thing that makes me nervous about this arrangment is loosing
> the track_pfn_remap() which was in remap_pfn_range() - I think it
> means we miss out on certain PAT manipulations.. I *suspect* this is
> not a problem for VFIO because it will rely on the MTRRs generally on
> x86 - but I also don't know this mechanim too well.

Yeah, for VM use cases the MTRRs generally override.

> I think after the address_space changes this should try to stick with
> a normal io_rmap_pfn_range() done outside the fault handler.

I assume you're suggesting calling io_remap_pfn_range() when device
memory is enabled, do you mean using vma_interval_tree_foreach() like
unmap_mapping_range() does to avoid re-adding a vma list?  Thanks,

Alex



[PATCH] vfio/pci: Handle concurrent vma faults

2021-03-10 Thread Alex Williamson
vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
from within a vm_ops fault handler.  This function will trigger a
BUG_ON if it encounters a populated pte within the remapped range,
where any fault is meant to populate the entire vma.  Concurrent
inflight faults to the same vma will therefore hit this issue,
triggering traces such as:

[ 1591.733256] kernel BUG at mm/memory.c:2177!
[ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio 
pv680_mii(O)
[ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
[ 1591.770735] Hardware name:  , BIOS HiFPGA 1P B600 V121-1
[ 1591.778872] pstate: 4049 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[ 1591.786134] pc : remap_pfn_range+0x214/0x340
[ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
[ 1591.799117] sp : 80001068bbd0
[ 1591.803476] x29: 80001068bbd0 x28: 042eff6f
[ 1591.810404] x27: 00110091 x26: 00130091
[ 1591.817457] x25: 00680fd3 x24: a92f1338e358
[ 1591.825144] x23: 00114000 x22: 0041
[ 1591.832506] x21: 00130091 x20: a92f141a4000
[ 1591.839520] x19: 001100a0 x18: 
[ 1591.846108] x17:  x16: a92f11844540
[ 1591.853570] x15:  x14: 
[ 1591.860768] x13: fc00 x12: 0880
[ 1591.868053] x11: 0821bf3d01d0 x10: 5ef2abd89000
[ 1591.875932] x9 : a92f12ab0064 x8 : a92f136471c0
[ 1591.883208] x7 : 00114091 x6 : 0002
[ 1591.890177] x5 : 0001 x4 : 0001
[ 1591.896656] x3 :  x2 : 016804400fd3
[ 1591.903215] x1 : 082126261880 x0 : fc2084989868
[ 1591.910234] Call trace:
[ 1591.914837]  remap_pfn_range+0x214/0x340
[ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
[ 1591.931200]  __do_fault+0x44/0x12c
[ 1591.937031]  handle_mm_fault+0xcc8/0x1230
[ 1591.942475]  do_page_fault+0x16c/0x484
[ 1591.948635]  do_translation_fault+0xbc/0xd8
[ 1591.954171]  do_mem_abort+0x4c/0xc0
[ 1591.960316]  el0_da+0x40/0x80
[ 1591.965585]  el0_sync_handler+0x168/0x1b0
[ 1591.971608]  el0_sync+0x174/0x180
[ 1591.978312] Code: eb1b027f 54c0 f9400022 b4fffe02 (d421)

Switch to using vmf_insert_pfn_prot() so that we can retain the
decrypted memory protection from io_remap_pfn_range(), but allow
concurrent page table updates.  Tracking of vmas is also updated to
prevent duplicate entries.

Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
Reported-by: Zeng Tao 
Suggested-by: Zeng Tao 
Signed-off-by: Alex Williamson 
---

Zeng Tao, I hope you don't mind me sending a new version to keep
this moving.  Testing and review appreciated, thanks!

 drivers/vfio/pci/vfio_pci.c |   30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b44578..ae723808e08b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device 
*vdev,
 {
struct vfio_pci_mmap_vma *mmap_vma;
 
+   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
+   if (mmap_vma->vma == vma)
+   return 0; /* Swallow the error, the vma is tracked */
+   }
+
mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
if (!mmap_vma)
return -ENOMEM;
@@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault 
*vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
-   vm_fault_t ret = VM_FAULT_NOPAGE;
+   unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
 
mutex_lock(>vma_lock);
down_read(>memory_lock);
 
-   if (!__vfio_pci_memory_enabled(vdev)) {
-   ret = VM_FAULT_SIGBUS;
-   mutex_unlock(>vma_lock);
+   if (!__vfio_pci_memory_enabled(vdev))
goto up_out;
+
+   for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
+   ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
+ pgprot_decrypted(vma->vm_page_prot));
+   if (ret != VM_FAULT_NOPAGE) {
+   zap_vma_ptes(vma, vma->vm_start, vaddr - vma->vm_start);
+   goto up_out;
+   }
}
 
if (__vfio_pci_add_vma(vdev, vma)) {
ret = VM_FAULT_OOM;
-   mutex_unlock(>vma_lock);
-   goto up_out;
+   zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
}
 
-   mutex_unlock(>vma_lock);
-
-   if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-   

Re: [PATCH v1 02/14] vfio: Update vfio_add_group_dev() API

2021-03-10 Thread Alex Williamson
On Wed, 10 Mar 2021 08:19:13 -0400
Jason Gunthorpe  wrote:

> On Wed, Mar 10, 2021 at 07:48:38AM +, Christoph Hellwig wrote:
> > On Mon, Mar 08, 2021 at 02:47:40PM -0700, Alex Williamson wrote:  
> > > Rather than an errno, return a pointer to the opaque vfio_device
> > > to allow the bus driver to call into vfio-core without additional
> > > lookups and references.  Note that bus drivers are still required
> > > to use vfio_del_group_dev() to teardown the vfio_device.
> > > 
> > > Signed-off-by: Alex Williamson   
> > 
> > This looks like it is superseded by the
> > 
> >   vfio: Split creation of a vfio_device into init and register ops  
> 
> Yes, that series puts vfio_device everywhere so APIs like Alex needs
> to build here become trivial.
> 
> The fact we both converged on this same requirement is good

You're ahead of me in catching up with reviews Christoph, but
considering stable backports and the motivations for each series, I'd
expect to initially make the minimal API change and build from there.
Thanks,

Alex



Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

2021-03-09 Thread Alex Williamson
On Tue, 9 Mar 2021 19:45:03 -0400
Jason Gunthorpe  wrote:

> On Tue, Mar 09, 2021 at 12:56:39PM -0700, Alex Williamson wrote:
> 
> > And I think this is what we end up with for the current code base:  
> 
> Yeah, that looks Ok
>  
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 65e7e6b44578..2f247ab18c66 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1568,19 +1568,24 @@ void vfio_pci_memory_unlock_and_restore(struct 
> > vfio_pci_device *vdev, u16 cmd)
> >  }
> >  
> >  /* Caller holds vma_lock */
> > -static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
> > - struct vm_area_struct *vma)
> > +struct vfio_pci_mmap_vma *__vfio_pci_add_vma(struct vfio_pci_device *vdev,
> > +struct vm_area_struct *vma)
> >  {
> > struct vfio_pci_mmap_vma *mmap_vma;
> >  
> > +   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
> > +   if (mmap_vma->vma == vma)
> > +   return ERR_PTR(-EEXIST);
> > +   }
> > +
> > mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> > if (!mmap_vma)
> > -   return -ENOMEM;
> > +   return ERR_PTR(-ENOMEM);
> >  
> > mmap_vma->vma = vma;
> > list_add(_vma->vma_next, >vma_list);
> >  
> > -   return 0;
> > +   return mmap_vma;
> >  }
> >  
> >  /*
> > @@ -1612,30 +1617,39 @@ static vm_fault_t vfio_pci_mmap_fault(struct 
> > vm_fault *vmf)
> >  {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > -   vm_fault_t ret = VM_FAULT_NOPAGE;
> > +   struct vfio_pci_mmap_vma *mmap_vma;
> > +   unsigned long vaddr, pfn;
> > +   vm_fault_t ret;
> >  
> > mutex_lock(>vma_lock);
> > down_read(>memory_lock);
> >  
> > if (!__vfio_pci_memory_enabled(vdev)) {
> > ret = VM_FAULT_SIGBUS;
> > -   mutex_unlock(>vma_lock);
> > goto up_out;
> > }
> >  
> > -   if (__vfio_pci_add_vma(vdev, vma)) {
> > -   ret = VM_FAULT_OOM;
> > -   mutex_unlock(>vma_lock);
> > +   mmap_vma = __vfio_pci_add_vma(vdev, vma);
> > +   if (IS_ERR(mmap_vma)) {
> > +   /* A concurrent fault might have already inserted the page */
> > +   ret = (PTR_ERR(mmap_vma) == -EEXIST) ? VM_FAULT_NOPAGE :
> > +  VM_FAULT_OOM;  
> 
> I think -EEIXST should not be an error, lets just go down to the
> vmf_insert_pfn() and let the MM resolve the race naturally.
> 
> I suspect returning VM_FAULT_NOPAGE will be averse to the userspace if
> it hits this race??

Given the serialization on vma_lock, if the vma_list entry exists then
the full vma should already be populated, so I don't see the NOPAGE
issue you're worried about.  However, if we wanted to be more similar
to what we expect the new version to do, we could proceed through
re-inserting the pages on -EEXIST.  Zeng Tao's re-ordering to add the
vma_list entry only after successfully inserting all the pfns might work
better for that.
 
> Also the _prot does look needed at least due to the SME, but possibly
> also to ensure NC gets set..

If we need more than pgprot_decrypted(vma->vm_page_prot), please let us
know, but that's all we were getting from io_remap_pfn_range() afaict.
Thanks,

Alex



Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

2021-03-09 Thread Alex Williamson
On Tue, 9 Mar 2021 19:41:27 -0400
Jason Gunthorpe  wrote:

> On Tue, Mar 09, 2021 at 12:26:07PM -0700, Alex Williamson wrote:
> 
> > In the new series, I think the fault handler becomes (untested):
> > 
> > static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > unsigned long base_pfn, pgoff;
> > vm_fault_t ret = VM_FAULT_SIGBUS;
> > 
> > if (vfio_pci_bar_vma_to_pfn(vma, _pfn))
> > return ret;
> > 
> > pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;  
> 
> I don't think this math is completely safe, it needs to parse the
> vm_pgoff..
> 
> I'm worried userspace could split/punch/mangle a VMA using
> munmap/mremap/etc/etc in a way that does update the pg_off but is
> incompatible with the above.

parsing vm_pgoff is done in:

static int vfio_pci_bar_vma_to_pfn(struct vm_area_struct *vma,
   unsigned long *pfn)
{
struct vfio_pci_device *vdev = vma->vm_private_data;
struct pci_dev *pdev = vdev->pdev;
int index;
u64 pgoff;

index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

if (index >= VFIO_PCI_ROM_REGION_INDEX ||
!vdev->bar_mmap_supported[index] || !vdev->barmap[index])
return -EINVAL;

pgoff = vma->vm_pgoff &
((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);

*pfn = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

return 0;
}

But given Peter's concern about faulting individual pages, I think the
fault handler becomes:

static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
unsigned long vaddr, pfn;
vm_fault_t ret = VM_FAULT_SIGBUS;

if (vfio_pci_bar_vma_to_pfn(vma, ))
return ret;

down_read(>memory_lock);

if (__vfio_pci_memory_enabled(vdev)) {
for (vaddr = vma->vm_start;
 vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
pgprot_decrypted(vma->vm_page_prot));
if (ret != VM_FAULT_NOPAGE) {
zap_vma_ptes(vma, vma->vm_start,
 vaddr - vma->vm_start);
break;
}
}
}

up_read(>memory_lock);

return ret;
}

Thanks,
Alex



Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

2021-03-09 Thread Alex Williamson
On Tue, 9 Mar 2021 16:00:36 -0500
Peter Xu  wrote:

> On Tue, Mar 09, 2021 at 01:11:04PM -0700, Alex Williamson wrote:
> > > It's just that the initial MMIO access delay would be spread to the 1st 
> > > access
> > > of each mmio page access rather than using the previous pre-fault scheme. 
> > >  I
> > > think an userspace cares the delay enough should pre-fault all pages 
> > > anyway,
> > > but just raise this up.  Otherwise looks sane.  
> > 
> > Yep, this is a concern.  Is it safe to have loops concurrently and fully
> > populating the same vma with vmf_insert_pfn()?  
> 
> AFAIU it's safe, and probably the (so far) best way for an userspace to 
> quickly
> populate a huge chunk of mmap()ed region for either MMIO or RAM.  Indeed from
> that pov vmf_insert_pfn() seems to be even more efficient on prefaulting since
> it can be threaded.

Ok, then we'll keep the loop and expect that a race might incur
duplicate work, but should be safe.

It also occurred to me that Jason was suggesting the _prot version of
vmf_insert_pfn(), which I think is necessary if we want to keep the
same semantics where the default io_remap_pfn_range() was applying
pgprot_decrypted() onto vma->vm_page_prot.  So if we don't want to
break SME use cases we better apply that ourselves.  Thanks,

Alex



Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

2021-03-09 Thread Alex Williamson
On Tue, 9 Mar 2021 14:48:24 -0500
Peter Xu  wrote:

> On Tue, Mar 09, 2021 at 12:26:07PM -0700, Alex Williamson wrote:
> > On Tue, 9 Mar 2021 13:47:39 -0500
> > Peter Xu  wrote:
> >   
> > > On Tue, Mar 09, 2021 at 12:40:04PM -0400, Jason Gunthorpe wrote:  
> > > > On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:
> > > > > On Tue, 9 Mar 2021 08:46:09 -0400
> > > > > Jason Gunthorpe  wrote:
> > > > > 
> > > > > > On Tue, Mar 09, 2021 at 03:49:09AM +, Zengtao (B) wrote:
> > > > > > > Hi guys:
> > > > > > > 
> > > > > > > Thanks for the helpful comments, after rethinking the issue, I 
> > > > > > > have proposed
> > > > > > >  the following change: 
> > > > > > > 1. follow_pte instead of follow_pfn.  
> > > > > > 
> > > > > > Still no on follow_pfn, you don't need it once you use 
> > > > > > vmf_insert_pfn
> > > > > 
> > > > > vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> > > > > here to determine whether the translation is already present to avoid
> > > > > both duplicate work in inserting the translation and allocating a
> > > > > duplicate vma tracking structure.
> > > >  
> > > > Oh.. Doing something stateful in fault is not nice at all
> > > > 
> > > > I would rather see __vfio_pci_add_vma() search the vma_list for dups
> > > > than call follow_pfn/pte..
> > > 
> > > It seems to me that searching vma list is still the simplest way to fix 
> > > the
> > > problem for the current code base.  I see io_remap_pfn_range() is also 
> > > used in
> > > the new series - maybe that'll need to be moved to where 
> > > PCI_COMMAND_MEMORY got
> > > turned on/off in the new series (I just noticed remap_pfn_range modifies 
> > > vma
> > > flags..), as you suggested in the other email.  
> > 
> > 
> > In the new series, I think the fault handler becomes (untested):
> > 
> > static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > unsigned long base_pfn, pgoff;
> > vm_fault_t ret = VM_FAULT_SIGBUS;
> > 
> > if (vfio_pci_bar_vma_to_pfn(vma, _pfn))
> > return ret;
> > 
> > pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> > 
> > down_read(>memory_lock);
> > 
> > if (__vfio_pci_memory_enabled(vdev))
> > ret = vmf_insert_pfn(vma, vmf->address, pgoff + base_pfn);
> > 
> > up_read(>memory_lock);
> > 
> > return ret;
> > }  
> 
> It's just that the initial MMIO access delay would be spread to the 1st access
> of each mmio page access rather than using the previous pre-fault scheme.  I
> think an userspace cares the delay enough should pre-fault all pages anyway,
> but just raise this up.  Otherwise looks sane.

Yep, this is a concern.  Is it safe to have loops concurrently and fully
populating the same vma with vmf_insert_pfn()?  If it is then we could
just ignore that we're doing duplicate work when we hit this race
condition.  Otherwise we'd need to serialize again, perhaps via a lock
and flag stored in a struct linked from vm_private_data, along with
tracking to free that object :-\  Thanks,

Alex



Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

2021-03-09 Thread Alex Williamson
On Tue, 9 Mar 2021 12:26:07 -0700
Alex Williamson  wrote:

> On Tue, 9 Mar 2021 13:47:39 -0500
> Peter Xu  wrote:
> 
> > On Tue, Mar 09, 2021 at 12:40:04PM -0400, Jason Gunthorpe wrote:  
> > > On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:
> > > > On Tue, 9 Mar 2021 08:46:09 -0400
> > > > Jason Gunthorpe  wrote:
> > > > 
> > > > > On Tue, Mar 09, 2021 at 03:49:09AM +, Zengtao (B) wrote:
> > > > > > Hi guys:
> > > > > > 
> > > > > > Thanks for the helpful comments, after rethinking the issue, I have 
> > > > > > proposed
> > > > > >  the following change: 
> > > > > > 1. follow_pte instead of follow_pfn.  
> > > > > 
> > > > > Still no on follow_pfn, you don't need it once you use vmf_insert_pfn 
> > > > >
> > > > 
> > > > vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> > > > here to determine whether the translation is already present to avoid
> > > > both duplicate work in inserting the translation and allocating a
> > > > duplicate vma tracking structure.
> > >  
> > > Oh.. Doing something stateful in fault is not nice at all
> > > 
> > > I would rather see __vfio_pci_add_vma() search the vma_list for dups
> > > than call follow_pfn/pte..
> > 
> > It seems to me that searching vma list is still the simplest way to fix the
> > problem for the current code base.  I see io_remap_pfn_range() is also used 
> > in
> > the new series - maybe that'll need to be moved to where PCI_COMMAND_MEMORY 
> > got
> > turned on/off in the new series (I just noticed remap_pfn_range modifies vma
> > flags..), as you suggested in the other email.  
> 
> 
> In the new series, I think the fault handler becomes (untested):
> 
> static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct vfio_pci_device *vdev = vma->vm_private_data;
> unsigned long base_pfn, pgoff;
> vm_fault_t ret = VM_FAULT_SIGBUS;
> 
> if (vfio_pci_bar_vma_to_pfn(vma, _pfn))
> return ret;
> 
> pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> 
> down_read(>memory_lock);
> 
> if (__vfio_pci_memory_enabled(vdev))
> ret = vmf_insert_pfn(vma, vmf->address, pgoff + base_pfn);
> 
> up_read(>memory_lock);
> 
> return ret;
> }

And I think this is what we end up with for the current code base:

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b44578..2f247ab18c66 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1568,19 +1568,24 @@ void vfio_pci_memory_unlock_and_restore(struct 
vfio_pci_device *vdev, u16 cmd)
 }
 
 /* Caller holds vma_lock */
-static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
- struct vm_area_struct *vma)
+struct vfio_pci_mmap_vma *__vfio_pci_add_vma(struct vfio_pci_device *vdev,
+struct vm_area_struct *vma)
 {
struct vfio_pci_mmap_vma *mmap_vma;
 
+   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
+   if (mmap_vma->vma == vma)
+   return ERR_PTR(-EEXIST);
+   }
+
mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
if (!mmap_vma)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
 
mmap_vma->vma = vma;
list_add(_vma->vma_next, >vma_list);
 
-   return 0;
+   return mmap_vma;
 }
 
 /*
@@ -1612,30 +1617,39 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault 
*vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
-   vm_fault_t ret = VM_FAULT_NOPAGE;
+   struct vfio_pci_mmap_vma *mmap_vma;
+   unsigned long vaddr, pfn;
+   vm_fault_t ret;
 
mutex_lock(>vma_lock);
down_read(>memory_lock);
 
if (!__vfio_pci_memory_enabled(vdev)) {
ret = VM_FAULT_SIGBUS;
-   mutex_unlock(>vma_lock);
goto up_out;
}
 
-   if (__vfio_pci_add_vma(vdev, vma)) {
-   ret = VM_FAULT_OOM;
-   mutex_unlock(>vma_lock);
+   mmap_vma = __vfio_pci_add_vma(vdev, vma);
+   if (IS_ERR(mmap_vma)) {
+   /* A concurrent fault might have already inserted the page */
+   ret = (PTR_ERR(mmap_vma) == -EEXIST) ? VM_FAULT_NOPAGE :
+

Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

2021-03-09 Thread Alex Williamson
On Tue, 9 Mar 2021 13:47:39 -0500
Peter Xu  wrote:

> On Tue, Mar 09, 2021 at 12:40:04PM -0400, Jason Gunthorpe wrote:
> > On Tue, Mar 09, 2021 at 08:29:51AM -0700, Alex Williamson wrote:  
> > > On Tue, 9 Mar 2021 08:46:09 -0400
> > > Jason Gunthorpe  wrote:
> > >   
> > > > On Tue, Mar 09, 2021 at 03:49:09AM +, Zengtao (B) wrote:  
> > > > > Hi guys:
> > > > > 
> > > > > Thanks for the helpful comments, after rethinking the issue, I have 
> > > > > proposed
> > > > >  the following change: 
> > > > > 1. follow_pte instead of follow_pfn.
> > > > 
> > > > Still no on follow_pfn, you don't need it once you use vmf_insert_pfn  
> > > 
> > > vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
> > > here to determine whether the translation is already present to avoid
> > > both duplicate work in inserting the translation and allocating a
> > > duplicate vma tracking structure.  
> >  
> > Oh.. Doing something stateful in fault is not nice at all
> > 
> > I would rather see __vfio_pci_add_vma() search the vma_list for dups
> > than call follow_pfn/pte..  
> 
> It seems to me that searching vma list is still the simplest way to fix the
> problem for the current code base.  I see io_remap_pfn_range() is also used in
> the new series - maybe that'll need to be moved to where PCI_COMMAND_MEMORY 
> got
> turned on/off in the new series (I just noticed remap_pfn_range modifies vma
> flags..), as you suggested in the other email.


In the new series, I think the fault handler becomes (untested):

static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
unsigned long base_pfn, pgoff;
vm_fault_t ret = VM_FAULT_SIGBUS;

if (vfio_pci_bar_vma_to_pfn(vma, _pfn))
return ret;

pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;

down_read(>memory_lock);

if (__vfio_pci_memory_enabled(vdev))
ret = vmf_insert_pfn(vma, vmf->address, pgoff + base_pfn);

up_read(>memory_lock);

return ret;
}

Thanks,
Alex



Re: [PATCH v1 07/14] vfio: Add a device notifier interface

2021-03-09 Thread Alex Williamson
On Mon, 8 Mar 2021 20:46:27 -0400
Jason Gunthorpe  wrote:

> On Mon, Mar 08, 2021 at 02:48:30PM -0700, Alex Williamson wrote:
> > Using a vfio device, a notifier block can be registered to receive
> > select device events.  Notifiers can only be registered for contained
> > devices, ie. they are available through a user context.  Registration
> > of a notifier increments the reference to that container context
> > therefore notifiers must minimally respond to the release event by
> > asynchronously removing notifiers.
> > 
> > Signed-off-by: Alex Williamson 
> >  drivers/vfio/Kconfig |1 +
> >  drivers/vfio/vfio.c  |   35 +++
> >  include/linux/vfio.h |9 +
> >  3 files changed, 45 insertions(+)
> > 
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> > index 90c0525b1e0c..9a67675c9b6c 100644
> > +++ b/drivers/vfio/Kconfig
> > @@ -23,6 +23,7 @@ menuconfig VFIO
> > tristate "VFIO Non-Privileged userspace driver framework"
> > select IOMMU_API
> > select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
> > +   select SRCU
> > help
> >   VFIO provides a framework for secure userspace device drivers.
> >   See Documentation/driver-api/vfio.rst for more details.
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c47895539a1a..7f6d00e54e83 100644
> > +++ b/drivers/vfio/vfio.c
> > @@ -105,6 +105,7 @@ struct vfio_device {
> > struct list_headgroup_next;
> > void*device_data;
> > struct inode*inode;
> > +   struct srcu_notifier_head   notifier;
> >  };
> >  
> >  #ifdef CONFIG_VFIO_NOIOMMU
> > @@ -601,6 +602,7 @@ struct vfio_device *vfio_group_create_device(struct 
> > vfio_group *group,
> > device->ops = ops;
> > device->device_data = device_data;
> > dev_set_drvdata(dev, device);
> > +   srcu_init_notifier_head(>notifier);
> >  
> > /* No need to get group_lock, caller has group reference */
> > vfio_group_get(group);
> > @@ -1785,6 +1787,39 @@ static const struct file_operations vfio_device_fops 
> > = {
> > .mmap   = vfio_device_fops_mmap,
> >  };
> >  
> > +int vfio_device_register_notifier(struct vfio_device *device,
> > + struct notifier_block *nb)
> > +{
> > +   int ret;
> > +
> > +   /* Container ref persists until unregister on success */
> > +   ret =  vfio_group_add_container_user(device->group);  
> 
> I'm having trouble guessing why we need to refcount the group to add a
> notifier to the device's notifier chain? 
> 
> I suppose it actually has to do with the MMIO mapping? But I don't
> know what the relation is between MMIO mappings in the IOMMU and the
> container? This could deserve a comment?

Sure, I can add a comment.  We want to make sure the device remains
within an IOMMU context so long as we have a DMA mapping to the device
MMIO, which could potentially manipulate the device.  IOMMU context is
managed a the group level.
 
> > +void vfio_device_unregister_notifier(struct vfio_device *device,
> > +   struct notifier_block *nb)
> > +{
> > +   if (!srcu_notifier_chain_unregister(>notifier, nb))
> > +   vfio_group_try_dissolve_container(device->group);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_device_unregister_notifier);  
> 
> Is the SRCU still needed with the new locking? With a cursory look I
> only noticed this called under the reflck->lock ?

When registering the notifier, the iommu->lock is held.  During the
callback, the same lock is acquired, so we'd have AB-BA otherwise.
Thanks,

Alex



Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

2021-03-09 Thread Alex Williamson
On Tue, 9 Mar 2021 08:46:09 -0400
Jason Gunthorpe  wrote:

> On Tue, Mar 09, 2021 at 03:49:09AM +, Zengtao (B) wrote:
> > Hi guys:
> > 
> > Thanks for the helpful comments, after rethinking the issue, I have proposed
> >  the following change: 
> > 1. follow_pte instead of follow_pfn.  
> 
> Still no on follow_pfn, you don't need it once you use vmf_insert_pfn

vmf_insert_pfn() only solves the BUG_ON, follow_pte() is being used
here to determine whether the translation is already present to avoid
both duplicate work in inserting the translation and allocating a
duplicate vma tracking structure.

> > 2. vmf_insert_pfn loops instead of io_remap_pfn_range
> > 3. proper undos when some call fails.
> > 4. keep the bigger lock range to avoid unessary pte install.   
> 
> Why do we need locks at all here?

For the vma tracking and testing whether the fault is already
populated.  Once we get rid of the vma list, maybe it makes sense to
only insert the faulting page rather than the entire vma, at which
point I think we'd have no reason to serialize.  Thanks,

Alex



[PATCH v1 14/14] vfio: Cleanup use of bare unsigned

2021-03-08 Thread Alex Williamson
Replace with 'unsigned int'.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci_intrs.c |   42 ++---
 drivers/vfio/pci/vfio_pci_private.h   |4 +-
 drivers/vfio/platform/vfio_platform_irq.c |   21 +++--
 drivers/vfio/platform/vfio_platform_private.h |4 +-
 4 files changed, 39 insertions(+), 32 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 869dce5f134d..67de58d67908 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -364,8 +364,8 @@ static int vfio_msi_set_vector_signal(struct 
vfio_pci_device *vdev,
return 0;
 }
 
-static int vfio_msi_set_block(struct vfio_pci_device *vdev, unsigned start,
- unsigned count, int32_t *fds, bool msix)
+static int vfio_msi_set_block(struct vfio_pci_device *vdev, unsigned int start,
+ unsigned int count, int32_t *fds, bool msix)
 {
int i, j, ret = 0;
 
@@ -418,8 +418,9 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, 
bool msix)
  * IOCTL support
  */
 static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev,
-   unsigned index, unsigned start,
-   unsigned count, uint32_t flags, void *data)
+   unsigned int index, unsigned int start,
+   unsigned int count, uint32_t flags,
+   void *data)
 {
if (!is_intx(vdev) || start != 0 || count != 1)
return -EINVAL;
@@ -445,8 +446,9 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device 
*vdev,
 }
 
 static int vfio_pci_set_intx_mask(struct vfio_pci_device *vdev,
- unsigned index, unsigned start,
- unsigned count, uint32_t flags, void *data)
+ unsigned int index, unsigned int start,
+ unsigned int count, uint32_t flags,
+ void *data)
 {
if (!is_intx(vdev) || start != 0 || count != 1)
return -EINVAL;
@@ -465,8 +467,9 @@ static int vfio_pci_set_intx_mask(struct vfio_pci_device 
*vdev,
 }
 
 static int vfio_pci_set_intx_trigger(struct vfio_pci_device *vdev,
-unsigned index, unsigned start,
-unsigned count, uint32_t flags, void *data)
+unsigned int index, unsigned int start,
+unsigned int count, uint32_t flags,
+void *data)
 {
if (is_intx(vdev) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
vfio_intx_disable(vdev);
@@ -508,8 +511,9 @@ static int vfio_pci_set_intx_trigger(struct vfio_pci_device 
*vdev,
 }
 
 static int vfio_pci_set_msi_trigger(struct vfio_pci_device *vdev,
-   unsigned index, unsigned start,
-   unsigned count, uint32_t flags, void *data)
+   unsigned int index, unsigned int start,
+   unsigned int count, uint32_t flags,
+   void *data)
 {
int i;
bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
@@ -614,8 +618,9 @@ static int vfio_pci_set_ctx_trigger_single(struct 
eventfd_ctx **ctx,
 }
 
 static int vfio_pci_set_err_trigger(struct vfio_pci_device *vdev,
-   unsigned index, unsigned start,
-   unsigned count, uint32_t flags, void *data)
+   unsigned int index, unsigned int start,
+   unsigned int count, uint32_t flags,
+   void *data)
 {
if (index != VFIO_PCI_ERR_IRQ_INDEX || start != 0 || count > 1)
return -EINVAL;
@@ -625,8 +630,9 @@ static int vfio_pci_set_err_trigger(struct vfio_pci_device 
*vdev,
 }
 
 static int vfio_pci_set_req_trigger(struct vfio_pci_device *vdev,
-   unsigned index, unsigned start,
-   unsigned count, uint32_t flags, void *data)
+   unsigned int index, unsigned int start,
+   unsigned int count, uint32_t flags,
+   void *data)
 {
if (index != VFIO_PCI_REQ_IRQ_INDEX || start != 0 || count > 1)
return -EINVAL;
@@ -636,11 +642,11 @@ static int vfio_pci_set_req_trigger(struct 
vfio_pci_device *vdev,
 }
 
 int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
-   unsigned index, unsigned start, unsigned count,
-   void *data)
+ 

[PATCH v1 13/14] vfio: Remove extern from declarations across vfio

2021-03-08 Thread Alex Williamson
Cleanup disrecommended usage and docs.

Signed-off-by: Alex Williamson 
---
 Documentation/driver-api/vfio-mediated-device.rst |   19 ++-
 Documentation/driver-api/vfio.rst |4 -
 drivers/s390/cio/vfio_ccw_cp.h|   13 +-
 drivers/s390/cio/vfio_ccw_private.h   |   14 +-
 drivers/s390/crypto/vfio_ap_private.h |2 
 drivers/vfio/fsl-mc/vfio_fsl_mc_private.h |7 +
 drivers/vfio/pci/vfio_pci_private.h   |   66 +--
 drivers/vfio/platform/vfio_platform_private.h |   31 +++--
 include/linux/vfio.h  |  122 ++---
 9 files changed, 130 insertions(+), 148 deletions(-)

diff --git a/Documentation/driver-api/vfio-mediated-device.rst 
b/Documentation/driver-api/vfio-mediated-device.rst
index 25eb7d5b834b..7685ef582f7a 100644
--- a/Documentation/driver-api/vfio-mediated-device.rst
+++ b/Documentation/driver-api/vfio-mediated-device.rst
@@ -115,12 +115,11 @@ to register and unregister itself with the core driver:
 
 * Register::
 
-extern int  mdev_register_driver(struct mdev_driver *drv,
-  struct module *owner);
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
 
 * Unregister::
 
-extern void mdev_unregister_driver(struct mdev_driver *drv);
+void mdev_unregister_driver(struct mdev_driver *drv);
 
 The mediated bus driver is responsible for adding mediated devices to the VFIO
 group when devices are bound to the driver and removing mediated devices from
@@ -162,13 +161,13 @@ The callbacks in the mdev_parent_ops structure are as 
follows:
 A driver should use the mdev_parent_ops structure in the function call to
 register itself with the mdev core driver::
 
-   extern int  mdev_register_device(struct device *dev,
-const struct mdev_parent_ops *ops);
+   int  mdev_register_device(struct device *dev,
+ const struct mdev_parent_ops *ops);
 
 However, the mdev_parent_ops structure is not required in the function call
 that a driver should use to unregister itself with the mdev core driver::
 
-   extern void mdev_unregister_device(struct device *dev);
+   void mdev_unregister_device(struct device *dev);
 
 
 Mediated Device Management Interface Through sysfs
@@ -293,11 +292,11 @@ Translation APIs for Mediated Devices
 The following APIs are provided for translating user pfn to host pfn in a VFIO
 driver::
 
-   extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
- int npage, int prot, unsigned long *phys_pfn);
+   int vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+  int npage, int prot, unsigned long *phys_pfn);
 
-   extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn,
-   int npage);
+   int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn,
+int npage);
 
 These functions call back into the back-end IOMMU module by using the pin_pages
 and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently
diff --git a/Documentation/driver-api/vfio.rst 
b/Documentation/driver-api/vfio.rst
index 03e978eb8ec7..e6ba42ca6346 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -252,11 +252,11 @@ into VFIO core.  When devices are bound and unbound to 
the driver,
 the driver should call vfio_add_group_dev() and vfio_del_group_dev()
 respectively::
 
-   extern struct vfio_device *vfio_add_group_dev(struct device *dev,
+   struct vfio_device *vfio_add_group_dev(struct device *dev,
const struct vfio_device_ops *ops,
void *device_data);
 
-   extern void *vfio_del_group_dev(struct device *dev);
+   void *vfio_del_group_dev(struct device *dev);
 
 vfio_add_group_dev() indicates to the core to begin tracking the
 iommu_group of the specified dev and register the dev as owned by
diff --git a/drivers/s390/cio/vfio_ccw_cp.h b/drivers/s390/cio/vfio_ccw_cp.h
index ba31240ce965..1ea81c4fe630 100644
--- a/drivers/s390/cio/vfio_ccw_cp.h
+++ b/drivers/s390/cio/vfio_ccw_cp.h
@@ -42,12 +42,11 @@ struct channel_program {
struct ccw1 *guest_cp;
 };
 
-extern int cp_init(struct channel_program *cp, struct device *mdev,
-  union orb *orb);
-extern void cp_free(struct channel_program *cp);
-extern int cp_prefetch(struct channel_program *cp);
-extern union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm);
-extern void cp_update_scsw(struct channel_program *cp, union scsw *scsw);
-extern bool cp_iova_pinned(struct channel_program *cp, u64 iova);
+int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb);
+void cp_free(struct channel_program *cp);
+int cp_prefetch

[PATCH v1 11/14] vfio/type1: Register device notifier

2021-03-08 Thread Alex Williamson
Impose a new default strict MMIO mapping mode where the vma for
a VM_PFNMAP mapping must be backed by a vfio device.  This allows
holding a reference to the device and registering a notifier for the
device, which additionally keeps the device in an IOMMU context for
the extent of the DMA mapping.  On notification of device release,
automatically drop the DMA mappings for it.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |  163 ---
 1 file changed, 116 insertions(+), 47 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f22c07a40521..e89f11141dee 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -101,6 +101,20 @@ struct vfio_dma {
struct task_struct  *task;
struct rb_root  pfn_list;   /* Ex-user pinned pfn list */
unsigned long   *bitmap;
+   struct pfnmap_obj   *pfnmap;
+};
+
+/*
+ * Separate object used for tracking pfnmaps to allow reference release and
+ * unregistering notifier outside of callback chain.
+ */
+struct pfnmap_obj {
+   struct notifier_block   nb;
+   struct work_struct  work;
+   struct vfio_iommu   *iommu;
+   struct vfio_dma *dma;
+   struct vfio_device  *device;
+   unsigned long   base_pfn;
 };
 
 struct vfio_batch {
@@ -506,42 +520,6 @@ static void vfio_batch_fini(struct vfio_batch *batch)
free_page((unsigned long)batch->pages);
 }
 
-static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
-   unsigned long vaddr, unsigned long *pfn,
-   bool write_fault)
-{
-   pte_t *ptep;
-   spinlock_t *ptl;
-   int ret;
-
-   ret = follow_pte(vma->vm_mm, vaddr, , );
-   if (ret) {
-   bool unlocked = false;
-
-   ret = fixup_user_fault(mm, vaddr,
-  FAULT_FLAG_REMOTE |
-  (write_fault ?  FAULT_FLAG_WRITE : 0),
-  );
-   if (unlocked)
-   return -EAGAIN;
-
-   if (ret)
-   return ret;
-
-   ret = follow_pte(vma->vm_mm, vaddr, , );
-   if (ret)
-   return ret;
-   }
-
-   if (write_fault && !pte_write(*ptep))
-   ret = -EFAULT;
-   else
-   *pfn = pte_pfn(*ptep);
-
-   pte_unmap_unlock(ptep, ptl);
-   return ret;
-}
-
 /* Return 1 if iommu->lock dropped and notified, 0 if done */
 static int unmap_dma_pfn_list(struct vfio_iommu *iommu, struct vfio_dma *dma,
  struct vfio_dma **dma_last, int *retries)
@@ -575,6 +553,52 @@ static int unmap_dma_pfn_list(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
return 0;
 }
 
+static void unregister_device_bg(struct work_struct *work)
+{
+   struct pfnmap_obj *pfnmap = container_of(work, struct pfnmap_obj, work);
+
+   vfio_device_unregister_notifier(pfnmap->device, >nb);
+   vfio_device_put(pfnmap->device);
+   kfree(pfnmap);
+}
+
+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma);
+
+static int vfio_device_nb_cb(struct notifier_block *nb,
+unsigned long action, void *unused)
+{
+   struct pfnmap_obj *pfnmap = container_of(nb, struct pfnmap_obj, nb);
+
+   switch (action) {
+   case VFIO_DEVICE_RELEASE:
+   {
+   struct vfio_dma *dma_last = NULL;
+   int retries = 0;
+again:
+   mutex_lock(>iommu->lock);
+   if (pfnmap->dma) {
+   struct vfio_dma *dma = pfnmap->dma;
+
+   if (unmap_dma_pfn_list(pfnmap->iommu, dma,
+  _last, ))
+   goto again;
+
+   dma->pfnmap = NULL;
+   pfnmap->dma = NULL;
+   vfio_remove_dma(pfnmap->iommu, dma);
+   }
+   mutex_unlock(>iommu->lock);
+
+   /* Cannot unregister notifier from callback chain */
+   INIT_WORK(>work, unregister_device_bg);
+   schedule_work(>work);
+   break;
+   }
+   }
+
+   return NOTIFY_OK;
+}
+
 /*
  * Returns the positive number of pfns successfully obtained or a negative
  * error code.
@@ -601,21 +625,60 @@ static int vaddr_get_pfns(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
 
vaddr = untagged_addr(vaddr);
 
-retry:
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
if (vma && vma->vm_flags & VM_PFNMAP) {
-   ret = follow_fault_pfn(vma, mm, vaddr, pfn,
-  dma->prot &am

[PATCH v1 12/14] vfio/type1: Support batching of device mappings

2021-03-08 Thread Alex Williamson
Populate the page array to the extent available to enable batching.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |   10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e89f11141dee..d499bccfbe3f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -628,6 +628,8 @@ static int vaddr_get_pfns(struct vfio_iommu *iommu, struct 
vfio_dma *dma,
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
if (vma && vma->vm_flags & VM_PFNMAP) {
+   unsigned long count, i;
+
if ((dma->prot & IOMMU_WRITE && !(vma->vm_flags & VM_WRITE)) ||
(dma->prot & IOMMU_READ && !(vma->vm_flags & VM_READ))) {
ret = -EFAULT;
@@ -678,7 +680,13 @@ static int vaddr_get_pfns(struct vfio_iommu *iommu, struct 
vfio_dma *dma,
 
*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) +
dma->pfnmap->base_pfn;
-   ret = 1;
+   count = min_t(long,
+ (vma->vm_end - vaddr) >> PAGE_SHIFT, npages);
+
+   for (i = 0; i < count; i++)
+   pages[i] = pfn_to_page(*pfn + i);
+
+   ret = count;
}
 done:
mmap_read_unlock(mm);



[PATCH v1 09/14] vfio/type1: Refactor pfn_list clearing

2021-03-08 Thread Alex Williamson
Pull code out to a function for re-use.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |   57 +++
 1 file changed, 34 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 12d9905b429f..f7d35a114354 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -542,6 +542,39 @@ static int follow_fault_pfn(struct vm_area_struct *vma, 
struct mm_struct *mm,
return ret;
 }
 
+/* Return 1 if iommu->lock dropped and notified, 0 if done */
+static int unmap_dma_pfn_list(struct vfio_iommu *iommu, struct vfio_dma *dma,
+ struct vfio_dma **dma_last, int *retries)
+{
+   if (!RB_EMPTY_ROOT(>pfn_list)) {
+   struct vfio_iommu_type1_dma_unmap nb_unmap;
+
+   if (*dma_last == dma) {
+   BUG_ON(++(*retries) > 10);
+   } else {
+   *dma_last = dma;
+   *retries = 0;
+   }
+
+   nb_unmap.iova = dma->iova;
+   nb_unmap.size = dma->size;
+
+   /*
+* Notify anyone (mdev vendor drivers) to invalidate and
+* unmap iovas within the range we're about to unmap.
+* Vendor drivers MUST unpin pages in response to an
+* invalidation.
+*/
+   mutex_unlock(>lock);
+   blocking_notifier_call_chain(>notifier,
+VFIO_IOMMU_NOTIFY_DMA_UNMAP,
+_unmap);
+   return 1;
+   }
+
+   return 0;
+}
+
 /*
  * Returns the positive number of pfns successfully obtained or a negative
  * error code.
@@ -1397,29 +1430,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
continue;
}
 
-   if (!RB_EMPTY_ROOT(>pfn_list)) {
-   struct vfio_iommu_type1_dma_unmap nb_unmap;
-
-   if (dma_last == dma) {
-   BUG_ON(++retries > 10);
-   } else {
-   dma_last = dma;
-   retries = 0;
-   }
-
-   nb_unmap.iova = dma->iova;
-   nb_unmap.size = dma->size;
-
-   /*
-* Notify anyone (mdev vendor drivers) to invalidate and
-* unmap iovas within the range we're about to unmap.
-* Vendor drivers MUST unpin pages in response to an
-* invalidation.
-*/
-   mutex_unlock(>lock);
-   blocking_notifier_call_chain(>notifier,
-   VFIO_IOMMU_NOTIFY_DMA_UNMAP,
-   _unmap);
+   if (unmap_dma_pfn_list(iommu, dma, _last, )) {
mutex_lock(>lock);
goto again;
}



[PATCH v1 10/14] vfio/type1: Pass iommu and dma objects through to vaddr_get_pfn

2021-03-08 Thread Alex Williamson
We'll need these to track vfio device mappings.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |   28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f7d35a114354..f22c07a40521 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -579,15 +579,16 @@ static int unmap_dma_pfn_list(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
  * Returns the positive number of pfns successfully obtained or a negative
  * error code.
  */
-static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
- long npages, int prot, unsigned long *pfn,
+static int vaddr_get_pfns(struct vfio_iommu *iommu, struct vfio_dma *dma,
+ struct mm_struct *mm, unsigned long vaddr,
+ long npages, unsigned long *pfn,
  struct page **pages)
 {
struct vm_area_struct *vma;
unsigned int flags = 0;
int ret;
 
-   if (prot & IOMMU_WRITE)
+   if (dma->prot & IOMMU_WRITE)
flags |= FOLL_WRITE;
 
mmap_read_lock(mm);
@@ -604,7 +605,8 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned 
long vaddr,
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
if (vma && vma->vm_flags & VM_PFNMAP) {
-   ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+   ret = follow_fault_pfn(vma, mm, vaddr, pfn,
+  dma->prot & IOMMU_WRITE);
if (ret == -EAGAIN)
goto retry;
 
@@ -680,7 +682,8 @@ static int vfio_wait_all_valid(struct vfio_iommu *iommu)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
+static long vfio_pin_pages_remote(struct vfio_iommu *iommu,
+ struct vfio_dma *dma, unsigned long vaddr,
  long npage, unsigned long *pfn_base,
  unsigned long limit, struct vfio_batch *batch)
 {
@@ -708,7 +711,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
/* Empty batch, so refill it. */
long req_pages = min_t(long, npage, batch->capacity);
 
-   ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
+   ret = vaddr_get_pfns(iommu, dma, mm, vaddr, req_pages,
 , batch->pages);
if (ret < 0)
goto unpin_out;
@@ -806,7 +809,8 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, 
dma_addr_t iova,
return unlocked;
 }
 
-static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
+static int vfio_pin_page_external(struct vfio_iommu *iommu,
+ struct vfio_dma *dma, unsigned long vaddr,
  unsigned long *pfn_base, bool do_accounting)
 {
struct page *pages[1];
@@ -817,7 +821,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, 
unsigned long vaddr,
if (!mm)
return -ENODEV;
 
-   ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages);
+   ret = vaddr_get_pfns(iommu, dma, mm, vaddr, 1, pfn_base, pages);
if (ret == 1 && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
ret = vfio_lock_acct(dma, 1, true);
if (ret) {
@@ -925,8 +929,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
}
 
remote_vaddr = dma->vaddr + (iova - dma->iova);
-   ret = vfio_pin_page_external(dma, remote_vaddr, _pfn[i],
-do_accounting);
+   ret = vfio_pin_page_external(iommu, dma, remote_vaddr,
+_pfn[i], do_accounting);
if (ret)
goto pin_unwind;
 
@@ -1497,7 +1501,7 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
 
while (size) {
/* Pin a contiguous chunk of memory */
-   npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
+   npage = vfio_pin_pages_remote(iommu, dma, vaddr + dma->size,
  size >> PAGE_SHIFT, , limit,
  );
if (npage <= 0) {
@@ -1759,7 +1763,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
size_t n = dma->iova + dma->size - iova;
   

[PATCH v1 08/14] vfio/pci: Notify on device release

2021-03-08 Thread Alex Williamson
Trigger a release notifier call when open reference count is zero.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 585895970e9c..bee9318b46ed 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -560,6 +560,7 @@ static void vfio_pci_release(void *device_data)
mutex_lock(>reflck->lock);
 
if (!(--vdev->refcnt)) {
+   vfio_device_notifier_call(vdev->device, VFIO_DEVICE_RELEASE);
vfio_pci_vf_token_user_add(vdev, -1);
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);



[PATCH v1 06/14] vfio: Add vma to pfn callback

2021-03-08 Thread Alex Williamson
Add a new vfio_device_ops callback to allow the bus driver to
translate a vma mapping of a vfio device fd to a pfn.  Plumb through
vfio-core.  Implemented for vfio-pci.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |1 +
 drivers/vfio/vfio.c |   16 
 include/linux/vfio.h|3 +++
 3 files changed, 20 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 415b5109da9b..585895970e9c 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1756,6 +1756,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
.mmap   = vfio_pci_mmap,
.request= vfio_pci_request,
.match  = vfio_pci_match,
+   .vma_to_pfn = vfio_pci_bar_vma_to_pfn,
 };
 
 static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 3a3e85a0dc3e..c47895539a1a 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -944,6 +944,22 @@ struct vfio_device *vfio_device_get_from_vma(struct 
vm_area_struct *vma)
 }
 EXPORT_SYMBOL_GPL(vfio_device_get_from_vma);
 
+int vfio_vma_to_pfn(struct vm_area_struct *vma, unsigned long *pfn)
+{
+   struct vfio_device *device;
+
+   if (!vma->vm_file || vma->vm_file->f_op != _device_fops)
+   return -EINVAL;
+
+   device = vma->vm_file->private_data;
+
+   if (unlikely(!device->ops->vma_to_pfn))
+   return -EINVAL;
+
+   return device->ops->vma_to_pfn(vma, pfn);
+}
+EXPORT_SYMBOL_GPL(vfio_vma_to_pfn);
+
 static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
 char *buf)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 660b8adf90a6..dbd90d0ba713 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -29,6 +29,7 @@
  * @match: Optional device name match callback (return: 0 for no-match, >0 for
  * match, -errno for abort (ex. match with insufficient or incorrect
  * additional args)
+ * @vma_to_pfn: Optional pfn from vma lookup against vma mapping device fd
  */
 struct vfio_device_ops {
char*name;
@@ -43,6 +44,7 @@ struct vfio_device_ops {
int (*mmap)(void *device_data, struct vm_area_struct *vma);
void(*request)(void *device_data, unsigned int count);
int (*match)(void *device_data, char *buf);
+   int (*vma_to_pfn)(struct vm_area_struct *vma, unsigned long *pfn);
 };
 
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
@@ -59,6 +61,7 @@ extern void *vfio_device_data(struct vfio_device *device);
 extern void vfio_device_unmap_mapping_range(struct vfio_device *device,
loff_t start, loff_t len);
 extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct 
*vma);
+extern int vfio_vma_to_pfn(struct vm_area_struct *vma, unsigned long *pfn);
 
 /* events for the backend driver notify callback */
 enum vfio_iommu_notify_type {



[PATCH v1 07/14] vfio: Add a device notifier interface

2021-03-08 Thread Alex Williamson
Using a vfio device, a notifier block can be registered to receive
select device events.  Notifiers can only be registered for contained
devices, ie. they are available through a user context.  Registration
of a notifier increments the reference to that container context
therefore notifiers must minimally respond to the release event by
asynchronously removing notifiers.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/Kconfig |1 +
 drivers/vfio/vfio.c  |   35 +++
 include/linux/vfio.h |9 +
 3 files changed, 45 insertions(+)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 90c0525b1e0c..9a67675c9b6c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -23,6 +23,7 @@ menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
select IOMMU_API
select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
+   select SRCU
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/driver-api/vfio.rst for more details.
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c47895539a1a..7f6d00e54e83 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -105,6 +105,7 @@ struct vfio_device {
struct list_headgroup_next;
void*device_data;
struct inode*inode;
+   struct srcu_notifier_head   notifier;
 };
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -601,6 +602,7 @@ struct vfio_device *vfio_group_create_device(struct 
vfio_group *group,
device->ops = ops;
device->device_data = device_data;
dev_set_drvdata(dev, device);
+   srcu_init_notifier_head(>notifier);
 
/* No need to get group_lock, caller has group reference */
vfio_group_get(group);
@@ -1785,6 +1787,39 @@ static const struct file_operations vfio_device_fops = {
.mmap   = vfio_device_fops_mmap,
 };
 
+int vfio_device_register_notifier(struct vfio_device *device,
+ struct notifier_block *nb)
+{
+   int ret;
+
+   /* Container ref persists until unregister on success */
+   ret =  vfio_group_add_container_user(device->group);
+   if (ret)
+   return ret;
+
+   ret = srcu_notifier_chain_register(>notifier, nb);
+   if (ret)
+   vfio_group_try_dissolve_container(device->group);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_device_register_notifier);
+
+void vfio_device_unregister_notifier(struct vfio_device *device,
+   struct notifier_block *nb)
+{
+   if (!srcu_notifier_chain_unregister(>notifier, nb))
+   vfio_group_try_dissolve_container(device->group);
+}
+EXPORT_SYMBOL_GPL(vfio_device_unregister_notifier);
+
+int vfio_device_notifier_call(struct vfio_device *device,
+ enum vfio_device_notify_type event)
+{
+   return srcu_notifier_call_chain(>notifier, event, NULL);
+}
+EXPORT_SYMBOL_GPL(vfio_device_notifier_call);
+
 /**
  * External user API, exported by symbols to be linked dynamically.
  *
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index dbd90d0ba713..c3ff36a7fa6f 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -62,6 +62,15 @@ extern void vfio_device_unmap_mapping_range(struct 
vfio_device *device,
loff_t start, loff_t len);
 extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct 
*vma);
 extern int vfio_vma_to_pfn(struct vm_area_struct *vma, unsigned long *pfn);
+extern int vfio_device_register_notifier(struct vfio_device *device,
+struct notifier_block *nb);
+extern void vfio_device_unregister_notifier(struct vfio_device *device,
+   struct notifier_block *nb);
+enum vfio_device_notify_type {
+   VFIO_DEVICE_RELEASE = 0,
+};
+int vfio_device_notifier_call(struct vfio_device *device,
+ enum vfio_device_notify_type event);
 
 /* events for the backend driver notify callback */
 enum vfio_iommu_notify_type {



[PATCH v1 05/14] vfio: Create a vfio_device from vma lookup

2021-03-08 Thread Alex Williamson
Creating an address space mapping onto our vfio pseudo fs for each
device file descriptor means that we can universally retrieve a
vfio_device from a vma mapping this file.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio.c  |   19 +--
 include/linux/vfio.h |1 +
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 3852e57b9e04..3a3e85a0dc3e 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -927,6 +927,23 @@ struct vfio_device *vfio_device_get_from_dev(struct device 
*dev)
 }
 EXPORT_SYMBOL_GPL(vfio_device_get_from_dev);
 
+static const struct file_operations vfio_device_fops;
+
+struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
+{
+   struct vfio_device *device;
+
+   if (!vma->vm_file || vma->vm_file->f_op != _device_fops)
+   return ERR_PTR(-ENODEV);
+
+   device = vma->vm_file->private_data;
+
+   vfio_device_get(device);
+
+   return device;
+}
+EXPORT_SYMBOL_GPL(vfio_device_get_from_vma);
+
 static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
 char *buf)
 {
@@ -1486,8 +1503,6 @@ static int vfio_group_add_container_user(struct 
vfio_group *group)
return 0;
 }
 
-static const struct file_operations vfio_device_fops;
-
 static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
 {
struct vfio_device *device;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index f435dfca15eb..660b8adf90a6 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -58,6 +58,7 @@ extern void vfio_device_put(struct vfio_device *device);
 extern void *vfio_device_data(struct vfio_device *device);
 extern void vfio_device_unmap_mapping_range(struct vfio_device *device,
loff_t start, loff_t len);
+extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct 
*vma);
 
 /* events for the backend driver notify callback */
 enum vfio_iommu_notify_type {



[PATCH v1 04/14] vfio/pci: Use vfio_device_unmap_mapping_range()

2021-03-08 Thread Alex Williamson
With the vfio device fd tied to the address space of the pseudo fs
inode, we can use the mm to track all vmas that might be mmap'ing
device BARs, which removes our vma_list and all the complicated
lock ordering necessary to manually zap each related vma.

Note that we can no longer store the pfn in vm_pgoff if we want to
use unmap_mapping_range() to zap a selective portion of the device
fd corresponding to BAR mappings.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |  228 +++
 drivers/vfio/pci/vfio_pci_private.h |3 
 2 files changed, 46 insertions(+), 185 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f0a1d05f0137..415b5109da9b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -225,7 +225,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device 
*vdev)
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_device *vdev);
-static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data);
+static int vfio_pci_mem_trylock_and_zap_cb(struct pci_dev *pdev, void *data);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -1168,7 +1168,7 @@ static long vfio_pci_ioctl(void *device_data,
struct vfio_pci_group_info info;
struct vfio_devices devs = { .cur_index = 0 };
bool slot = false;
-   int i, group_idx, mem_idx = 0, count = 0, ret = 0;
+   int i, group_idx, count = 0, ret = 0;
 
minsz = offsetofend(struct vfio_pci_hot_reset, count);
 
@@ -1268,32 +1268,16 @@ static long vfio_pci_ioctl(void *device_data,
}
 
/*
-* We need to get memory_lock for each device, but devices
-* can share mmap_lock, therefore we need to zap and hold
-* the vma_lock for each device, and only then get each
-* memory_lock.
+* Try to get the memory_lock write lock for all devices and
+* zap all BAR mmaps.
 */
ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-   vfio_pci_try_zap_and_vma_lock_cb,
+   vfio_pci_mem_trylock_and_zap_cb,
, slot);
-   if (ret)
-   goto hot_reset_release;
-
-   for (; mem_idx < devs.cur_index; mem_idx++) {
-   struct vfio_pci_device *tmp;
-
-   tmp = vfio_device_data(devs.devices[mem_idx]);
-
-   ret = down_write_trylock(>memory_lock);
-   if (!ret) {
-   ret = -EBUSY;
-   goto hot_reset_release;
-   }
-   mutex_unlock(>vma_lock);
-   }
 
/* User has access, do the reset */
-   ret = pci_reset_bus(vdev->pdev);
+   if (!ret)
+   ret = pci_reset_bus(vdev->pdev);
 
 hot_reset_release:
for (i = 0; i < devs.cur_index; i++) {
@@ -1303,10 +1287,7 @@ static long vfio_pci_ioctl(void *device_data,
device = devs.devices[i];
tmp = vfio_device_data(device);
 
-   if (i < mem_idx)
-   up_write(>memory_lock);
-   else
-   mutex_unlock(>vma_lock);
+   up_write(>memory_lock);
vfio_device_put(device);
}
kfree(devs.devices);
@@ -1452,100 +1433,18 @@ static ssize_t vfio_pci_write(void *device_data, const 
char __user *buf,
return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
-/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
-static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
+static void vfio_pci_zap_bars(struct vfio_pci_device *vdev)
 {
-   struct vfio_pci_mmap_vma *mmap_vma, *tmp;
-
-   /*
-* Lock ordering:
-* vma_lock is nested under mmap_lock for vm_ops callback paths.
-* The memory_lock semaphore is used by both code paths calling
-* into this function to zap vmas and the vm_ops.fault callback
-* to protect the memory enable state of the device.
-*
-* When zapping vmas we need to maintain the mmap_lock => vma_lock
-* ordering, which requires using vma_lock to walk vma_list to
-* acquire an mm, then dropping vma_lock to get the mmap_lock and
-* reacquiring vma_lock.  This logic is derived from similar
-* requirements in uverbs_user_mmap_disassociate().

[PATCH v1 00/14] vfio: Device memory DMA mapping improvements

2021-03-08 Thread Alex Williamson
The primary goal of this series is to better manage device memory
mappings, both with a much simplified scheme to zap CPU mappings of
device memory using unmap_mapping_range() and also to restrict IOMMU
mappings of PFNMAPs to vfio device memory and drop those mappings on
device release.  This series updates vfio-pci to include the necessary
vma-to-pfn interface, allowing the type1 IOMMU backend to recognize
vfio device memory.  If other bus drivers support peer-to-peer DMA,
they should be updated with a similar callback and trigger the device
notifier on release.

RFC->v1:

 - Fix some incorrect ERR handling
 - Fix use of vm_pgoff to be compatible with unmap_mapping_range()
 - Add vma-to-pfn interfaces
 - Generic device-from-vma handling
 - pfnmap obj directly maps back to vfio_dma obj
 - No bypass for strict MMIO handling
 - Batch PFNMAP handling
 - Follow-on patches to cleanup "extern" usage and bare unsigned

Works in my environment, further testing always appreciated.  This
will need to be merged with a solution for concurrent fault handling.
Thanks especially to Jason Gunthorpe for previous reviews and
suggestions.  Thanks,

Alex

RFC:https://lore.kernel.org/kvm/161401167013.16443.8389863523766611711.st...@gimli.home/

---

Alex Williamson (14):
  vfio: Create vfio_fs_type with inode per device
  vfio: Update vfio_add_group_dev() API
  vfio: Export unmap_mapping_range() wrapper
  vfio/pci: Use vfio_device_unmap_mapping_range()
  vfio: Create a vfio_device from vma lookup
  vfio: Add vma to pfn callback
  vfio: Add a device notifier interface
  vfio/pci: Notify on device release
  vfio/type1: Refactor pfn_list clearing
  vfio/type1: Pass iommu and dma objects through to vaddr_get_pfn
  vfio/type1: Register device notifier
  vfio/type1: Support batching of device mappings
  vfio: Remove extern from declarations across vfio
  vfio: Cleanup use of bare unsigned


 Documentation/driver-api/vfio-mediated-device.rst |   19 +-
 Documentation/driver-api/vfio.rst |8 -
 drivers/s390/cio/vfio_ccw_cp.h|   13 +
 drivers/s390/cio/vfio_ccw_private.h   |   14 +
 drivers/s390/crypto/vfio_ap_private.h |2 
 drivers/vfio/Kconfig  |1 
 drivers/vfio/fsl-mc/vfio_fsl_mc.c |6 -
 drivers/vfio/fsl-mc/vfio_fsl_mc_private.h |7 -
 drivers/vfio/mdev/vfio_mdev.c |5 
 drivers/vfio/pci/vfio_pci.c   |  229 -
 drivers/vfio/pci/vfio_pci_intrs.c |   42 ++--
 drivers/vfio/pci/vfio_pci_private.h   |   69 +++---
 drivers/vfio/platform/vfio_platform_common.c  |7 -
 drivers/vfio/platform/vfio_platform_irq.c |   21 +-
 drivers/vfio/platform/vfio_platform_private.h |   31 +--
 drivers/vfio/vfio.c   |  154 --
 drivers/vfio/vfio_iommu_type1.c   |  234 +++--
 include/linux/vfio.h  |  129 ++--
 18 files changed, 543 insertions(+), 448 deletions(-)



[PATCH v1 03/14] vfio: Export unmap_mapping_range() wrapper

2021-03-08 Thread Alex Williamson
Allow bus drivers to use vfio pseudo fs mapping to zap all mmaps
across a range of their device files.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio.c  |7 +++
 include/linux/vfio.h |2 ++
 2 files changed, 9 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 34d32f16246a..3852e57b9e04 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -565,6 +565,13 @@ static struct inode *vfio_fs_inode_new(void)
return inode;
 }
 
+void vfio_device_unmap_mapping_range(struct vfio_device *device,
+loff_t start, loff_t len)
+{
+   unmap_mapping_range(device->inode->i_mapping, start, len, true);
+}
+EXPORT_SYMBOL_GPL(vfio_device_unmap_mapping_range);
+
 /**
  * Device objects - create, release, get, put, search
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b784463000d4..f435dfca15eb 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -56,6 +56,8 @@ extern void *vfio_del_group_dev(struct device *dev);
 extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
 extern void vfio_device_put(struct vfio_device *device);
 extern void *vfio_device_data(struct vfio_device *device);
+extern void vfio_device_unmap_mapping_range(struct vfio_device *device,
+   loff_t start, loff_t len);
 
 /* events for the backend driver notify callback */
 enum vfio_iommu_notify_type {



[PATCH v1 02/14] vfio: Update vfio_add_group_dev() API

2021-03-08 Thread Alex Williamson
Rather than an errno, return a pointer to the opaque vfio_device
to allow the bus driver to call into vfio-core without additional
lookups and references.  Note that bus drivers are still required
to use vfio_del_group_dev() to teardown the vfio_device.

Signed-off-by: Alex Williamson 
---
 Documentation/driver-api/vfio.rst|6 +++---
 drivers/vfio/fsl-mc/vfio_fsl_mc.c|6 --
 drivers/vfio/mdev/vfio_mdev.c|5 -
 drivers/vfio/pci/vfio_pci.c  |7 +--
 drivers/vfio/platform/vfio_platform_common.c |7 +--
 drivers/vfio/vfio.c  |   23 ++-
 include/linux/vfio.h |6 +++---
 7 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/Documentation/driver-api/vfio.rst 
b/Documentation/driver-api/vfio.rst
index f1a4d3c3ba0b..03e978eb8ec7 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -252,9 +252,9 @@ into VFIO core.  When devices are bound and unbound to the 
driver,
 the driver should call vfio_add_group_dev() and vfio_del_group_dev()
 respectively::
 
-   extern int vfio_add_group_dev(struct device *dev,
- const struct vfio_device_ops *ops,
- void *device_data);
+   extern struct vfio_device *vfio_add_group_dev(struct device *dev,
+   const struct vfio_device_ops *ops,
+   void *device_data);
 
extern void *vfio_del_group_dev(struct device *dev);
 
diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc.c 
b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
index f27e25112c40..a4c2d0b9cd51 100644
--- a/drivers/vfio/fsl-mc/vfio_fsl_mc.c
+++ b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
@@ -592,6 +592,7 @@ static int vfio_fsl_mc_probe(struct fsl_mc_device *mc_dev)
struct iommu_group *group;
struct vfio_fsl_mc_device *vdev;
struct device *dev = _dev->dev;
+   struct vfio_device *device;
int ret;
 
group = vfio_iommu_group_get(dev);
@@ -608,8 +609,9 @@ static int vfio_fsl_mc_probe(struct fsl_mc_device *mc_dev)
 
vdev->mc_dev = mc_dev;
 
-   ret = vfio_add_group_dev(dev, _fsl_mc_ops, vdev);
-   if (ret) {
+   device = vfio_add_group_dev(dev, _fsl_mc_ops, vdev);
+   if (IS_ERR(device)) {
+   ret = PTR_ERR(device);
dev_err(dev, "VFIO_FSL_MC: Failed to add to vfio group\n");
goto out_group_put;
}
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
index b52eea128549..ebae3871b155 100644
--- a/drivers/vfio/mdev/vfio_mdev.c
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -124,8 +124,11 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = {
 static int vfio_mdev_probe(struct device *dev)
 {
struct mdev_device *mdev = to_mdev_device(dev);
+   struct vfio_device *device;
 
-   return vfio_add_group_dev(dev, _mdev_dev_ops, mdev);
+   device = vfio_add_group_dev(dev, _mdev_dev_ops, mdev);
+
+   return PTR_ERR_OR_ZERO(device);
 }
 
 static void vfio_mdev_remove(struct device *dev)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b44578..f0a1d05f0137 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1926,6 +1926,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 {
struct vfio_pci_device *vdev;
struct iommu_group *group;
+   struct vfio_device *device;
int ret;
 
if (vfio_pci_is_denylisted(pdev))
@@ -1968,9 +1969,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
INIT_LIST_HEAD(>vma_list);
init_rwsem(>memory_lock);
 
-   ret = vfio_add_group_dev(>dev, _pci_ops, vdev);
-   if (ret)
+   device = vfio_add_group_dev(>dev, _pci_ops, vdev);
+   if (IS_ERR(device)) {
+   ret = PTR_ERR(device);
goto out_free;
+   }
 
ret = vfio_pci_reflck_attach(vdev);
if (ret)
diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index fb4b385191f2..ff41fe0b758e 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -657,6 +657,7 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
   struct device *dev)
 {
struct iommu_group *group;
+   struct vfio_device *device;
int ret;
 
if (!vdev)
@@ -685,9 +686,11 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
goto put_reset;
}
 
-   ret = vfio_add_group_dev(dev, _platform_ops, vdev);
-   if (ret)
+   device = vfio_add_group_dev(dev, _platform_ops, vdev);
+   if (IS_ERR(device)) {
+   ret = PTR_ERR(d

[PATCH v1 01/14] vfio: Create vfio_fs_type with inode per device

2021-03-08 Thread Alex Williamson
By linking all the device fds we provide to userspace to an
address space through a new pseudo fs, we can use tools like
unmap_mapping_range() to zap all vmas associated with a device.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio.c |   54 +++
 1 file changed, 54 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 38779e6fd80c..abdf8d52a911 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -32,11 +32,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #define DRIVER_VERSION "0.3"
 #define DRIVER_AUTHOR  "Alex Williamson "
 #define DRIVER_DESC"VFIO - User Level meta-driver"
 
+#define VFIO_MAGIC 0x5646494f /* "VFIO" */
+
+static int vfio_fs_cnt;
+static struct vfsmount *vfio_fs_mnt;
+
 static struct vfio {
struct class*class;
struct list_headiommu_drivers_list;
@@ -97,6 +104,7 @@ struct vfio_device {
struct vfio_group   *group;
struct list_headgroup_next;
void*device_data;
+   struct inode*inode;
 };
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -529,6 +537,34 @@ static struct vfio_group *vfio_group_get_from_dev(struct 
device *dev)
return group;
 }
 
+static int vfio_fs_init_fs_context(struct fs_context *fc)
+{
+   return init_pseudo(fc, VFIO_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type vfio_fs_type = {
+   .name = "vfio",
+   .owner = THIS_MODULE,
+   .init_fs_context = vfio_fs_init_fs_context,
+   .kill_sb = kill_anon_super,
+};
+
+static struct inode *vfio_fs_inode_new(void)
+{
+   struct inode *inode;
+   int ret;
+
+   ret = simple_pin_fs(_fs_type, _fs_mnt, _fs_cnt);
+   if (ret)
+   return ERR_PTR(ret);
+
+   inode = alloc_anon_inode(vfio_fs_mnt->mnt_sb);
+   if (IS_ERR(inode))
+   simple_release_fs(_fs_mnt, _fs_cnt);
+
+   return inode;
+}
+
 /**
  * Device objects - create, release, get, put, search
  */
@@ -539,11 +575,19 @@ struct vfio_device *vfio_group_create_device(struct 
vfio_group *group,
 void *device_data)
 {
struct vfio_device *device;
+   struct inode *inode;
 
device = kzalloc(sizeof(*device), GFP_KERNEL);
if (!device)
return ERR_PTR(-ENOMEM);
 
+   inode = vfio_fs_inode_new();
+   if (IS_ERR(inode)) {
+   kfree(device);
+   return ERR_CAST(inode);
+   }
+   device->inode = inode;
+
kref_init(>kref);
device->dev = dev;
device->group = group;
@@ -574,6 +618,9 @@ static void vfio_device_release(struct kref *kref)
 
dev_set_drvdata(device->dev, NULL);
 
+   iput(device->inode);
+   simple_release_fs(_fs_mnt, _fs_cnt);
+
kfree(device);
 
/* vfio_del_group_dev may be waiting for this device */
@@ -1488,6 +1535,13 @@ static int vfio_group_get_device_fd(struct vfio_group 
*group, char *buf)
 */
filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
 
+   /*
+* Use the pseudo fs inode on the device to link all mmaps
+* to the same address space, allowing us to unmap all vmas
+* associated to this device using unmap_mapping_range().
+*/
+   filep->f_mapping = device->inode->i_mapping;
+
atomic_inc(>container_users);
 
fd_install(ret, filep);



Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

2021-03-08 Thread Alex Williamson
On Mon, 8 Mar 2021 19:11:26 +0800
Zeng Tao  wrote:

> We have met the following error when test with DPDK testpmd:
> [ 1591.733256] kernel BUG at mm/memory.c:2177!
> [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio 
> pv680_mii(O)
> [ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ 
> #1
> [ 1591.770735] Hardware name:  , BIOS HiFPGA 1P B600 V121-1
> [ 1591.778872] pstate: 4049 (nZcv daif +PAN -UAO -TCO BTYPE=--)
> [ 1591.786134] pc : remap_pfn_range+0x214/0x340
> [ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
> [ 1591.799117] sp : 80001068bbd0
> [ 1591.803476] x29: 80001068bbd0 x28: 042eff6f
> [ 1591.810404] x27: 00110091 x26: 00130091
> [ 1591.817457] x25: 00680fd3 x24: a92f1338e358
> [ 1591.825144] x23: 00114000 x22: 0041
> [ 1591.832506] x21: 00130091 x20: a92f141a4000
> [ 1591.839520] x19: 001100a0 x18: 
> [ 1591.846108] x17:  x16: a92f11844540
> [ 1591.853570] x15:  x14: 
> [ 1591.860768] x13: fc00 x12: 0880
> [ 1591.868053] x11: 0821bf3d01d0 x10: 5ef2abd89000
> [ 1591.875932] x9 : a92f12ab0064 x8 : a92f136471c0
> [ 1591.883208] x7 : 00114091 x6 : 0002
> [ 1591.890177] x5 : 0001 x4 : 0001
> [ 1591.896656] x3 :  x2 : 016804400fd3
> [ 1591.903215] x1 : 082126261880 x0 : fc2084989868
> [ 1591.910234] Call trace:
> [ 1591.914837]  remap_pfn_range+0x214/0x340
> [ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
> [ 1591.931200]  __do_fault+0x44/0x12c
> [ 1591.937031]  handle_mm_fault+0xcc8/0x1230
> [ 1591.942475]  do_page_fault+0x16c/0x484
> [ 1591.948635]  do_translation_fault+0xbc/0xd8
> [ 1591.954171]  do_mem_abort+0x4c/0xc0
> [ 1591.960316]  el0_da+0x40/0x80
> [ 1591.965585]  el0_sync_handler+0x168/0x1b0
> [ 1591.971608]  el0_sync+0x174/0x180
> [ 1591.978312] Code: eb1b027f 54c0 f9400022 b4fffe02 (d421)
> 
> The cause is that the vfio_pci_mmap_fault function is not reentrant, if
> multiple threads access the same address which will lead to a page fault
> at the same time, we will have the above error.
> 
> Fix the issue by making the vfio_pci_mmap_fault reentrant, and there is
> another issue that when the io_remap_pfn_range fails, we need to undo
> the __vfio_pci_add_vma, fix it by moving the __vfio_pci_add_vma down
> after the io_remap_pfn_range.
> 
> Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
> Signed-off-by: Zeng Tao 
> ---
>  drivers/vfio/pci/vfio_pci.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b..6928c37 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1613,6 +1613,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault 
> *vmf)
>   struct vm_area_struct *vma = vmf->vma;
>   struct vfio_pci_device *vdev = vma->vm_private_data;
>   vm_fault_t ret = VM_FAULT_NOPAGE;
> + unsigned long pfn;
>  
>   mutex_lock(>vma_lock);
>   down_read(>memory_lock);
> @@ -1623,18 +1624,23 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault 
> *vmf)
>   goto up_out;
>   }
>  
> - if (__vfio_pci_add_vma(vdev, vma)) {
> - ret = VM_FAULT_OOM;
> + if (!follow_pfn(vma, vma->vm_start, )) {
>   mutex_unlock(>vma_lock);
>   goto up_out;
>   }
>  
> - mutex_unlock(>vma_lock);


If I understand correctly, I think you're using (perhaps slightly
abusing) the vma_lock to extend the serialization of the vma_list
manipulation to include io_remap_pfn_range() such that you can test
whether the pte has already been populated using follow_pfn().  In that
case we return VM_FAULT_NOPAGE without trying to repopulate the page
and therefore avoid the BUG_ON in remap_pte_range() triggered by trying
to overwrite an existing pte, and less importantly, a duplicate vma in
our list.  I wonder if use of follow_pfn() is still strongly
discouraged for this use case.

I'm surprised that it's left to the fault handler to provide this
serialization, is this because we're filling the entire vma rather than
only the faulting page?

As we move to unmap_mapping_range()[1] we remove all of the complexity
of managing a list of vmas to zap based on whether device memory is
enabled, including the vma_lock.  Are we going to need to replace that
with another lock here, or is there a better approach to handling
concurrency of this fault handler?  Jason/Peter?  Thanks,

Alex

[1]https://lore.kernel.org/kvm/161401267316.16443.11184767955094847849.st...@gimli.home/

>  
>   if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> -vma->vm_end - 

Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup

2021-03-04 Thread Alex Williamson
On Thu, 4 Mar 2021 19:16:33 -0400
Jason Gunthorpe  wrote:

> On Thu, Mar 04, 2021 at 02:37:57PM -0700, Alex Williamson wrote:
> 
> > Therefore unless a bus driver opts-out by replacing vm_private_data, we
> > can identify participating vmas by the vm_ops and have flags indicating
> > if the vma maps device memory such that vfio_get_device_from_vma()
> > should produce a device reference.  The vfio IOMMU backends would also
> > consume this, ie. if they get a valid vfio_device from the vma, use the
> > pfn_base field directly.  vfio_vm_ops would wrap the bus driver
> > callbacks and provide reference counting on open/close to release this
> > object.  
> 
> > I'm not thrilled with a vfio_device_ops callback plumbed through
> > vfio-core to do vma-to-pfn translation, so I thought this might be a
> > better alternative.  Thanks,  
> 
> Maybe you could explain why, because I'm looking at this idea and
> thinking it looks very complicated compared to a simple driver op
> callback?

vfio-core needs to export a vfio_vma_to_pfn() which I think assumes the
caller has already used vfio_device_get_from_vma(), but should still
validate the vma is one from a vfio device before calling this new
vfio_device_ops callback.  vfio-pci needs to validate the vm_pgoff
value falls within a BAR region, mask off the index and get the
pci_resource_start() for the BAR index.

Then we need a solution for how vfio_device_get_from_vma() determines
whether to grant a device reference for a given vma, where that vma may
map something other than device memory.  Are you imagining that we hand
out device references independently and vfio_vma_to_pfn() would return
an errno for vm_pgoff values that don't map device memory and the IOMMU
driver would release the reference?

It all seems rather ad-hoc.
 
> The implementation of such an op for vfio_pci is one line trivial, why
> do we need allocated memory and a entire shim layer instead? 
> 
> Shim layers are bad.

The only thing here I'd consider a shim layer is overriding vm_ops,
which just seemed like a cleaner and simpler solution than exporting
open/close functions and validating the bus driver installs them, and
the error path should they not.

> We still need a driver op of some kind because only the driver can
> convert a pg_off into a PFN. Remember the big point here is to remove
> the sketchy follow_pte()...

The bus driver simply writes the base_pfn value in the example
structure I outlined in its .mmap callback.  I'm just looking for an
alternative place to store our former vm_pgoff in a way that doesn't
prevent using unmmap_mapping_range().  The IOMMU backend, once it has a
vfio_device via vfio_device_get_from_vma() can know the format of
vm_private_data, cast it as a vfio_vma_private_data and directly use
base_pfn, accomplishing the big point.  They're all operating in the
agreed upon vm_private_data format.  Thanks,

Alex



Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup

2021-03-04 Thread Alex Williamson
On Thu, 25 Feb 2021 19:49:49 -0400
Jason Gunthorpe  wrote:

> On Thu, Feb 25, 2021 at 03:21:13PM -0700, Alex Williamson wrote:
> 
> > This is where it gets tricky.  The vm_pgoff we get from
> > file_operations.mmap is already essentially describing an offset from
> > the base of a specific resource.  We could convert that from an absolute
> > offset to a pfn offset, but it's only the bus driver code (ex.
> > vfio-pci) that knows how to get the base, assuming there is a single
> > base per region (we can't assume enough bits per region to store
> > absolute pfn).  Also note that you're suggesting that all vfio mmaps
> > would need to standardize on the vfio-pci implementation of region
> > layouts.  Not that most drivers haven't copied vfio-pci, but we've
> > specifically avoided exposing it as a fixed uAPI such that we could have
> > the flexibility for a bus driver to implement regions offsets however
> > they need.  
> 
> Okay, well the bus driver owns the address space and the bus driver is
> in control of the vm_pgoff. If it doesn't want to zap then it doesn't
> need to do anything
> 
> vfio-pci can consistently use the index encoding and be fine
>  
> > So I'm not really sure what this looks like.  Within vfio-pci we could
> > keep the index bits in place to allow unmmap_mapping_range() to
> > selectively zap matching vm_pgoffs but expanding that to a vfio
> > standard such that the IOMMU backend can also extract a pfn looks very
> > limiting, or ugly.  Thanks,  
> 
> Lets add a op to convert a vma into a PFN range. The map code will
> pass the vma to the op and get back a pfn (or failure).
> 
> pci will switch the vm_pgoff to an index, find the bar base and
> compute the pfn.
> 
> It is starting to look more and more like dma buf though

How terrible would it be if vfio-core used a shared vm_private_data
structure and inserted itself into the vm_ops call chain to reference
count that structure?  We already wrap the device file descriptor via
vfio_device_fops.mmap, so we could:

struct vfio_vma_private_data *priv;

priv = kzalloc(...

priv->device = device;
vma->vm_private_data = priv;

device->ops->mmap(device->device_data, vma);

if (vma->vm_private_data == priv) {
priv->vm_ops = vma->vm_ops;
vma->vm_ops = _vm_ops;
} else
kfree(priv);

Where:

struct vfio_vma_private_data {
struct vfio_device *device;
unsigned long pfn_base;
void *private_data; // maybe not needed
const struct vm_operations_struct *vm_ops;
struct kref kref;
unsigned int release_notification:1;
};

Therefore unless a bus driver opts-out by replacing vm_private_data, we
can identify participating vmas by the vm_ops and have flags indicating
if the vma maps device memory such that vfio_get_device_from_vma()
should produce a device reference.  The vfio IOMMU backends would also
consume this, ie. if they get a valid vfio_device from the vma, use the
pfn_base field directly.  vfio_vm_ops would wrap the bus driver
callbacks and provide reference counting on open/close to release this
object.

I'm not thrilled with a vfio_device_ops callback plumbed through
vfio-core to do vma-to-pfn translation, so I thought this might be a
better alternative.  Thanks,

Alex



Re: RFC: sysfs node for Secondary PCI bus reset (PCIe Hot Reset)

2021-03-02 Thread Alex Williamson
On Mon, 1 Mar 2021 14:28:17 -0600
Bjorn Helgaas  wrote:

> [+cc Alex, reset expert]
> 
> On Mon, Mar 01, 2021 at 06:12:21PM +0100, Pali Rohár wrote:
> > Hello!
> > 
> > PCIe card can be reset via in-band Hot Reset signal which can be
> > triggered by PCIe bridge via Secondary Bus Reset bit in PCI config
> > space.
> > 
> > Kernel already exports sysfs node "reset" for triggering Functional
> > Reset of particular function of PCI device. But in some cases Functional
> > Reset is not enough and Hot Reset is required.
> > 
> > Following RFC patch exports sysfs node "reset_bus" for PCI bridges which
> > triggers Secondary Bus Reset and therefore for PCIe bridges it resets
> > connected PCIe card.
> > 
> > What do you think about it?
> > 
> > Currently there is userspace script which can trigger PCIe Hot Reset by
> > modifying PCI config space from userspace:
> > 
> > https://alexforencich.com/wiki/en/pcie/hot-reset-linux
> > 
> > But because kernel already provides way how to trigger Functional Reset
> > it could provide also way how to trigger PCIe Hot Reset.

What that script does and what this does, or what the existing reset
attribute does, are very different.  The script finds the upstream
bridge for a given device, removes the device (ignoring that more than
one device might be affected by the bus reset), uses setpci to trigger
a secondary bus reset, then rescans devices.  The below only triggers
the secondary bus reset, neither saving and restoring affected device
state like the existing function level reset attribute, nor removing
and rescanning as the script does.  It simply leaves an entire
hierarchy of PCI devices entirely un-programmed yet still has struct
pci_devs attached to them for untold future misery.

In fact, for the case of a single device affected by the bus reset, as
intended by the script, the existing reset attribute will already do
that if the device supports no other reset mechanism.  There's actually
a running LFX mentorship project that aims to allow the user to control
the type of reset performed by the existing reset attribute such that a
user could force the bus reset behavior over other reset methods.

There might be some justification for an attribute that actually
implements the referenced script correctly, perhaps in kernel we could
avoid races with bus rescans, but simply triggering an SBR to quietly
de-program all downstream devices with no state restore or device
rescan is not it.  Any affected device would be unusable.  Was this
tested?  Thanks,

Alex

> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > index 50fcb62d59b5..f5e11c589498 100644
> > --- a/drivers/pci/pci-sysfs.c
> > +++ b/drivers/pci/pci-sysfs.c
> > @@ -1321,6 +1321,30 @@ static ssize_t reset_store(struct device *dev, 
> > struct device_attribute *attr,
> >  
> >  static DEVICE_ATTR(reset, 0200, NULL, reset_store);
> >  
> > +static ssize_t reset_bus_store(struct device *dev, struct device_attribute 
> > *attr,
> > +  const char *buf, size_t count)
> > +{
> > +   struct pci_dev *pdev = to_pci_dev(dev);
> > +   unsigned long val;
> > +   ssize_t result = kstrtoul(buf, 0, );
> > +
> > +   if (result < 0)
> > +   return result;
> > +
> > +   if (val != 1)
> > +   return -EINVAL;
> > +
> > +   pm_runtime_get_sync(dev);
> > +   result = pci_bridge_secondary_bus_reset(pdev);
> > +   pm_runtime_put(dev);
> > +   if (result < 0)
> > +   return result;
> > +
> > +   return count;
> > +}
> > +
> > +static DEVICE_ATTR(reset_bus, 0200, NULL, reset_bus_store);
> > +
> >  static int pci_create_capabilities_sysfs(struct pci_dev *dev)
> >  {
> > int retval;
> > @@ -1332,8 +1356,15 @@ static int pci_create_capabilities_sysfs(struct 
> > pci_dev *dev)
> > if (retval)
> > goto error;
> > }
> > +   if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> > +   retval = device_create_file(>dev, _attr_reset_bus);
> > +   if (retval)
> > +   goto error_reset_bus;
> > +   }
> > return 0;
> >  
> > +error_reset_bus:
> > +   device_remove_file(>dev, _attr_reset);
> >  error:
> > pcie_vpd_remove_sysfs_dev_files(dev);
> > return retval;
> > @@ -1414,6 +1445,8 @@ static void pci_remove_capabilities_sysfs(struct 
> > pci_dev *dev)
> > device_remove_file(>dev, _attr_reset);
> > dev->reset_fn = 0;
> > }
> > +   if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
> > +   device_remove_file(>dev, _attr_reset_bus);
> >  }
> >  
> >  /**  
> 



Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup

2021-02-25 Thread Alex Williamson
On Wed, 24 Feb 2021 20:06:10 -0400
Jason Gunthorpe  wrote:

> On Wed, Feb 24, 2021 at 02:55:06PM -0700, Alex Williamson wrote:
> 
> > > The only use of the special ops would be if there are multiple types
> > > of mmap's going on, but for this narrow use case those would be safely
> > > distinguished by the vm_pgoff instead  
> > 
> > We potentially do have device specific regions which can support mmap,
> > for example the migration region.  We'll need to think about how we
> > could even know if portions of those regions map to a device.  We could
> > use the notifier to announce it and require the code supporting those
> > device specific regions manage it.  
> 
> So, the above basically says any VFIO VMA is allowed for VFIO to map
> to the IOMMU.
> 
> If there are places creating mmaps for VFIO that should not go to the
> IOMMU then they need to return NULL from this function.
> 
> > I'm not really clear what you're getting at with vm_pgoff though, could
> > you explain further?  
> 
> Ah, so I have to take a side discussion to explain what I ment.
> 
> The vm_pgoff is a bit confused because we change it here in vfio_pci:
> 
> vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> 
> But the address_space invalidation assumes it still has the region
> based encoding:
> 
> + vfio_device_unmap_mapping_range(vdev->device,
> + VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
> + VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
> + VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));
> 
> Those three indexes are in the vm_pgoff numberspace and so vm_pgoff
> must always be set to the same thing - either the
> VFIO_PCI_INDEX_TO_OFFSET() coding or the physical pfn. 

Aha, I hadn't made that connection.

> Since you say we need a limited invalidation this looks like a bug to
> me - and it must always be the VFIO_PCI_INDEX_TO_OFFSET coding.

Yes, this must have only worked in testing because I mmap'd BAR0 which
is at index/offset zero, so the pfn range overlapped the user offset.
I'm glad you caught that...

> So, the PCI vma needs to get switched to use the
> VFIO_PCI_INDEX_TO_OFFSET coding and then we can always extract the
> region number from the vm_pgoff and thus access any additional data,
> such as the base pfn or a flag saying this cannot be mapped to the
> IOMMU. Do the reverse of VFIO_PCI_INDEX_TO_OFFSET and consult
> information attached to that region ID.
> 
> All places creating vfio mmaps have to set the vm_pgoff to
> VFIO_PCI_INDEX_TO_OFFSET().

This is where it gets tricky.  The vm_pgoff we get from
file_operations.mmap is already essentially describing an offset from
the base of a specific resource.  We could convert that from an absolute
offset to a pfn offset, but it's only the bus driver code (ex.
vfio-pci) that knows how to get the base, assuming there is a single
base per region (we can't assume enough bits per region to store
absolute pfn).  Also note that you're suggesting that all vfio mmaps
would need to standardize on the vfio-pci implementation of region
layouts.  Not that most drivers haven't copied vfio-pci, but we've
specifically avoided exposing it as a fixed uAPI such that we could have
the flexibility for a bus driver to implement regions offsets however
they need.

So I'm not really sure what this looks like.  Within vfio-pci we could
keep the index bits in place to allow unmmap_mapping_range() to
selectively zap matching vm_pgoffs but expanding that to a vfio
standard such that the IOMMU backend can also extract a pfn looks very
limiting, or ugly.  Thanks,

Alex

> But we have these violations that need fixing:
> 
> drivers/vfio/fsl-mc/vfio_fsl_mc.c:  vma->vm_pgoff = (region.addr >> 
> PAGE_SHIFT) + pgoff;
> drivers/vfio/platform/vfio_platform_common.c:   vma->vm_pgoff = (region.addr 
> >> PAGE_SHIFT) + pgoff;
> 
> Couldn't see any purpose to this code, cargo cult copy? Just delete
> it.
> 
> drivers/vfio/pci/vfio_pci.c:vma->vm_pgoff = (pci_resource_start(pdev, 
> index) >> PAGE_SHIFT) + pgoff;
> 
> Used to implement fault() but we could get the region number and
> extract the pfn from the vfio_pci_device's data easy enough.
> 
> I manually checked that other parts of VFIO not under drivers/vfio are
> doing it OK, looks fine.
> 
> Jason
> 



Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup

2021-02-24 Thread Alex Williamson
On Mon, 22 Feb 2021 13:29:13 -0400
Jason Gunthorpe  wrote:

> On Mon, Feb 22, 2021 at 09:51:25AM -0700, Alex Williamson wrote:
> 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index da212425ab30..399c42b77fbb 100644
> > +++ b/drivers/vfio/vfio.c
> > @@ -572,6 +572,15 @@ void vfio_device_unmap_mapping_range(struct 
> > vfio_device *device,
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_device_unmap_mapping_range);
> >  
> > +/*
> > + * A VFIO bus driver using this open callback will provide a
> > + * struct vfio_device pointer in the vm_private_data field.  
> 
> The vfio_device pointer should be stored in the struct file
> 
> > +struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
> > +{
> > +   struct vfio_device *device;
> > +
> > +   if (vma->vm_ops->open != vfio_device_vma_open)
> > +   return ERR_PTR(-ENODEV);
> > +  
> 
> Having looked at VFIO alot more closely last week, this is even more
> trivial - VFIO only creates mmaps of memory we want to invalidate, so
> this is just very simple:
> 
> struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
> {
>if (!vma->vm_file ||vma->vm_file->f_op != _device_fops)
>  return ERR_PTR(-ENODEV);
>return vma->vm_file->f_private;
> }

That's pretty clever.

> The only use of the special ops would be if there are multiple types
> of mmap's going on, but for this narrow use case those would be safely
> distinguished by the vm_pgoff instead

We potentially do have device specific regions which can support mmap,
for example the migration region.  We'll need to think about how we
could even know if portions of those regions map to a device.  We could
use the notifier to announce it and require the code supporting those
device specific regions manage it.

I'm not really clear what you're getting at with vm_pgoff though, could
you explain further?

> > +extern void vfio_device_vma_open(struct vm_area_struct *vma);
> > +extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct 
> > *vma);  
> 
> No externs on function prototypes in new code please, we've been
> slowly deleting them..

For now it's consistent with what we have there already, I'd prefer a
separate cleanup of that before or after rather than introducing
inconsistency here.  Thanks,

Alex



Re: [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()

2021-02-24 Thread Alex Williamson
On Mon, 22 Feb 2021 13:22:30 -0400
Jason Gunthorpe  wrote:

> On Mon, Feb 22, 2021 at 09:51:13AM -0700, Alex Williamson wrote:
> 
> > +   vfio_device_unmap_mapping_range(vdev->device,
> > +   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
> > +   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
> > +   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));  
> 
> Isn't this the same as invalidating everything? I see in
> vfio_pci_mmap():
> 
>   if (index >= VFIO_PCI_ROM_REGION_INDEX)
>   return -EINVAL;

No, immediately above that is:

if (index >= VFIO_PCI_NUM_REGIONS) {
int regnum = index - VFIO_PCI_NUM_REGIONS;
struct vfio_pci_region *region = vdev->region + regnum;

if (region && region->ops && region->ops->mmap &&
(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
return region->ops->mmap(vdev, region, vma);
return -EINVAL;
}

We can have device specific regions that can support mmap, but those
regions aren't necessarily on-device memory so we can't assume they're
tied to the memory bit in the command register.
 
> > @@ -2273,15 +2112,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct 
> > pci_dev *pdev, void *data)
> >  
> > vdev = vfio_device_data(device);
> >  
> > -   /*
> > -* Locking multiple devices is prone to deadlock, runaway and
> > -* unwind if we hit contention.
> > -*/
> > -   if (!vfio_pci_zap_and_vma_lock(vdev, true)) {
> > +   if (!down_write_trylock(>memory_lock)) {
> > vfio_device_put(device);
> > return -EBUSY;
> > }  
> 
> And this is only done as part of VFIO_DEVICE_PCI_HOT_RESET?

Yes.

> It looks like VFIO_DEVICE_PCI_HOT_RESET effects the entire slot?

Yes.

> How about putting the inode on the reflck structure, which is also
> per-slot, and then a single unmap_mapping_range() will take care of
> everything, no need to iterate over things in the driver core.
>
> Note the vm->pg_off space doesn't have any special meaning, it is
> fine that two struct vfio_pci_device's are sharing the same address
> space and using an incompatible overlapping pg_offs

Ok, but how does this really help us, unless you're also proposing some
redesign of the memory_lock semaphore?  Even if we're zapping all the
affected devices for a bus reset that doesn't eliminate that we still
require device level granularity for other events.  Maybe there's some
layering of the inodes that you're implying that allows both, but it
still feels like a minor optimization if we need to traverse devices
for the memory_lock.

> > diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> > b/drivers/vfio/pci/vfio_pci_private.h
> > index 9cd1882a05af..ba37f4eeefd0 100644
> > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > @@ -101,6 +101,7 @@ struct vfio_pci_mmap_vma {
> >  
> >  struct vfio_pci_device {
> > struct pci_dev  *pdev;
> > +   struct vfio_device  *device;  
> 
> Ah, I did this too, but I didn't use a pointer :)

vfio_device is embedded in vfio.c, so that worries me.

> All the places trying to call vfio_device_put() when they really want
> a vfio_pci_device * become simpler now. Eg struct vfio_devices wants
> to have an array of vfio_pci_device, and get_pf_vdev() only needs to
> return one pointer.

Sure, that example would be a good simplification.  I'm not sure see
other cases where we're going out of our way to manage the vfio_device
versus vfio_pci_device objects though.  Thanks,

Alex



Re: [RFC PATCH 10/10] vfio/type1: Register device notifier

2021-02-24 Thread Alex Williamson
On Mon, 22 Feb 2021 13:55:23 -0400
Jason Gunthorpe  wrote:

> On Mon, Feb 22, 2021 at 09:52:32AM -0700, Alex Williamson wrote:
> > Introduce a new default strict MMIO mapping mode where the vma for
> > a VM_PFNMAP mapping must be backed by a vfio device.  This allows
> > holding a reference to the device and registering a notifier for the
> > device, which additionally keeps the device in an IOMMU context for
> > the extent of the DMA mapping.  On notification of device release,
> > automatically drop the DMA mappings for it.
> > 
> > Signed-off-by: Alex Williamson 
> >  drivers/vfio/vfio_iommu_type1.c |  124 
> > +++
> >  1 file changed, 123 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index b34ee4b96a4a..2a16257bd5b6 100644
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -61,6 +61,11 @@ module_param_named(dma_entry_limit, dma_entry_limit, 
> > uint, 0644);
> >  MODULE_PARM_DESC(dma_entry_limit,
> >  "Maximum number of user DMA mappings per container (65535).");
> >  
> > +static bool strict_mmio_maps = true;
> > +module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
> > +MODULE_PARM_DESC(strict_mmio_maps,
> > +"Restrict to safe DMA mappings of device memory (true).");  
> 
> I think this should be a kconfig, historically we've required kconfig
> to opt-in to unsafe things that could violate kernel security. Someone
> building a secure boot trusted kernel system should not have an
> options for userspace to just turn off protections.

It could certainly be further protected that this option might not
exist based on a Kconfig, but I think we're already risking breaking
some existing users and I'd rather allow it with an opt-in (like we
already do for lack of interrupt isolation), possibly even with a
kernel taint if used, if necessary.

> > +/* Req separate object for async removal from notifier vs dropping 
> > vfio_dma */
> > +struct pfnmap_obj {
> > +   struct notifier_block   nb;
> > +   struct work_struct  work;
> > +   struct vfio_iommu   *iommu;
> > +   struct vfio_device  *device;
> > +};  
> 
> So this is basically the dmabuf, I think it would be simple enough to
> go in here and change it down the road if someone had interest.
> 
> > +static void unregister_device_bg(struct work_struct *work)
> > +{
> > +   struct pfnmap_obj *pfnmap = container_of(work, struct pfnmap_obj, work);
> > +
> > +   vfio_device_unregister_notifier(pfnmap->device, >nb);
> > +   vfio_device_put(pfnmap->device);  
> 
> The device_put keeps the device from becoming unregistered, but what
> happens during the hot reset case? Is this what the cover letter
> was talking about? CPU access is revoked but P2P is still possible?

Yes, this only addresses cutting off DMA mappings to a released device
where we can safely drop the DMA mapping entirely.  I think we can
consider complete removal of the mapping object safe in this case
because the user has no ongoing access to the device and after
re-opening the device it would be reasonable to expect mappings would
need to be re-established.

Doing the same around disabling device memory or a reset has a much
greater potential to break userspace.  In some of these cases QEMU will
do the right thing, but reset with dependent devices gets into
scenarios that I can't be sure about.  Other userspace drivers also
exist and I can't verify them all.  So ideally we'd want to temporarily
remove the IOMMU mapping, leaving the mapping object, and restoring it
on the other side.  But I don't think we have a way to do that in the
IOMMU API currently, other than unmap and later remap.  So we might
need to support a zombie mode for the mapping object or enhance the
IOMMU API where we could leave the iotlb entry in place but drop r+w
permissions.

At this point we're also just trying to define which error we get,
perhaps an unsupported request if we do nothing or an IOMMU fault if we
invalidate or suspend the mapping.  It's not guaranteed that one of
these has better behavior on the system than the other.
 
> > +static int vfio_device_nb_cb(struct notifier_block *nb,
> > +unsigned long action, void *unused)
> > +{
> > +   struct pfnmap_obj *pfnmap = container_of(nb, struct pfnmap_obj, nb);
> > +
> > +   switch (action) {
> > +   case VFIO_DEVICE_RELEASE:
> > +   {
> > +   struct vfio_dma *dma, *dma_last = NULL;
> > +   int retries = 0;
> > +again:
> > +   mutex_lock(>iommu->lock);
> > +   dma = pf

[GIT PULL] VFIO updates for v5.12-rc1

2021-02-24 Thread Alex Williamson
Hi Linus,

The following changes since commit 3e10585335b7967326ca7b4118cada0d2d00a2ab:

  Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm 
(2021-02-21 13:31:43 -0800)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.12-rc1

for you to fetch changes up to 4d83de6da265cd84e74c19d876055fa5f261cde4:

  vfio/type1: Batch page pinning (2021-02-22 16:30:47 -0700)


VFIO updates for v5.12-rc1

 - Virtual address update handling (Steve Sistare)

 - s390/zpci fixes and cleanups (Max Gurtovoy)

 - Fixes for dirty bitmap handling, non-mdev page pinning,
   and improved pinned dirty scope tracking (Keqian Zhu)

 - Batched page pinning enhancement (Daniel Jordan)

 - Page access permission fix (Alex Williamson)


Alex Williamson (3):
  Merge branch 'v5.12/vfio/next-vaddr' into v5.12/vfio/next
  Merge commit '3e10585335b7967326ca7b4118cada0d2d00a2ab' into 
v5.12/vfio/next
  vfio/type1: Use follow_pte()

Daniel Jordan (3):
  vfio/type1: Change success value of vaddr_get_pfn()
  vfio/type1: Prepare for batched pinning with struct vfio_batch
  vfio/type1: Batch page pinning

Heiner Kallweit (1):
  vfio/pci: Fix handling of pci use accessor return codes

Keqian Zhu (3):
  vfio/iommu_type1: Populate full dirty when detach non-pinned group
  vfio/iommu_type1: Fix some sanity checks in detach group
  vfio/iommu_type1: Mantain a counter for non_pinned_groups

Max Gurtovoy (3):
  vfio-pci/zdev: remove unused vdev argument
  vfio-pci/zdev: fix possible segmentation fault issue
  vfio/pci: remove CONFIG_VFIO_PCI_ZDEV from Kconfig

Steve Sistare (9):
  vfio: option to unmap all
  vfio/type1: unmap cleanup
  vfio/type1: implement unmap all
  vfio: interfaces to update vaddr
  vfio/type1: massage unmap iteration
  vfio/type1: implement interfaces to update vaddr
  vfio: iommu driver notify callback
  vfio/type1: implement notify callback
  vfio/type1: block on invalid vaddr

Tian Tao (1):
  vfio/iommu_type1: Fix duplicate included kthread.h

 drivers/vfio/pci/Kconfig|  12 -
 drivers/vfio/pci/Makefile   |   2 +-
 drivers/vfio/pci/vfio_pci.c |  12 +-
 drivers/vfio/pci/vfio_pci_igd.c |  10 +-
 drivers/vfio/pci/vfio_pci_private.h |   2 +-
 drivers/vfio/pci/vfio_pci_zdev.c|  24 +-
 drivers/vfio/vfio.c |   5 +
 drivers/vfio/vfio_iommu_type1.c | 564 ++--
 include/linux/vfio.h|   7 +
 include/uapi/linux/vfio.h   |  27 ++
 10 files changed, 475 insertions(+), 190 deletions(-)



Re: [PATCH v2 0/3] vfio/type1: Batch page pinning

2021-02-23 Thread Alex Williamson
On Fri, 19 Feb 2021 11:13:02 -0500
Daniel Jordan  wrote:

> v2:
>  - Fixed missing error unwind in patch 3 (Alex).  After more thought,
>the ENODEV case is fine, so it stayed the same.
> 
>  - Rebased on linux-vfio.git/next (no conflicts).
> 
> ---
> 
> The VFIO type1 driver is calling pin_user_pages_remote() once per 4k page, so
> let's do it once per 512 4k pages to bring VFIO in line with other drivers 
> such
> as IB and vDPA.
> 
> qemu guests with at least 2G memory start about 8% faster on a Xeon server,
> with more detailed results in the last changelog.
> 
> Thanks to Matthew, who first suggested the idea to me.
> 
> Daniel
> 
> 
> Test Cases
> --
> 
>  1) qemu passthrough with IOMMU-capable PCI device
> 
>  2) standalone program to hit
> vfio_pin_map_dma() -> vfio_pin_pages_remote()
> 
>  3) standalone program to hit
> vfio_iommu_replay() -> vfio_pin_pages_remote()
> 
> Each was run...
> 
>  - with varying sizes
>  - with/without disable_hugepages=1
>  - with/without LOCKED_VM exceeded
> 
> I didn't test vfio_pin_page_external() because there was no readily available
> hardware, but the changes there are pretty minimal.
> 
> Daniel Jordan (3):
>   vfio/type1: Change success value of vaddr_get_pfn()
>   vfio/type1: Prepare for batched pinning with struct vfio_batch
>   vfio/type1: Batch page pinning
> 
>  drivers/vfio/vfio_iommu_type1.c | 215 +++-
>  1 file changed, 155 insertions(+), 60 deletions(-)
> 
> base-commit: 76adb20f924f8d27ed50d02cd29cadedb59fd88f

Applied to vfio next branch for v5.12.  Thanks,

Alex



  1   2   3   4   5   6   7   8   9   10   >