Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread David Gibson
On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> > 
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> > 
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> > 
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> >   This requirement doesn't sound PPC specific, as addr_width for pci devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1 discussion
> >   and then decide how to incorporate it in v2.
> 
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the
> actual geometry including any holes via a query.

So part of the point of specifying start/end addresses is that
explicitly querying holes shouldn't be necessary: if the requested
range crosses a hole, it should fail.  If you didn't really need all
that range, you shouldn't have asked for it.

Which means these aren't really "hints" but optionally supplied
constraints.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread da...@gibson.dropbear.id.au
On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, September 22, 2021 10:09 PM
> > 
> > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > >
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type
> > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> > point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> > patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > > vfio.
> > > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > > there
> > > > >   is a way to claim a specific iova range from a system-wide address
> > space.
> > > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC
> > hasn't
> > > > >   adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > >   and then decide how to incorporate it in v2.
> > > >
> > > > I think the request was to include a start/end IO address hint when
> > > > creating the ios. When the kernel creates it then it can return the
> > >
> > > is the hint single-range or could be multiple-ranges?
> > 
> > David explained it here:
> > 
> > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> > 
> > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > bit range.
> > 
> > So a 'range hint' will do the job
> > 
> > David also suggested this:
> > 
> > https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/
> > 
> > So I like this better:
> > 
> > struct iommu_ioasid_alloc {
> > __u32   argsz;
> > 
> > __u32   flags;
> > #define IOMMU_IOASID_ENFORCE_SNOOP  (1 << 0)
> > #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1)
> > 
> > __aligned_u64 max_iova_hint;
> > __aligned_u64 base_iova_hint; // Used only if
> > IOMMU_IOASID_HINT_BASE_IOVA
> > 
> > // For creating nested page tables
> > __u32 parent_ios_id;
> > __u32 format;
> > #define IOMMU_FORMAT_KERNEL 0
> > #define IOMMU_FORMAT_PPC_XXX 2
> > #define IOMMU_FORMAT_[..]
> > u32 format_flags; // Layout depends on format above
> > 
> > __aligned_u64 user_page_directory;  // Used if parent_ios_id != 0
> > };
> > 
> > Again 'type' as an overall API indicator should not exist, feature
> > flags need to have clear narrow meanings.
> 
> currently the type is aimed to differentiate three usages:
> 
> - kernel-managed I/O page table
> - user-managed I/O page table
> - shared I/O page table (e.g. with mm, or ept)
> 
> we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> indicator? their difference is not about format.

To me "format" indicates how the IO translation information is
encoded.  We potentially have two different encodings: from userspace
to the kernel and from the kernel to the hardware.  But since this is
the userspace API, it's only the userspace to kernel one that matters
here.

In that sense, KERNEL, is a "format": we encode the translation
information as a series of IOMAP operations to the kernel, rather than
as an in-memory structure.

> > This does both of David's suggestions at once. If quemu wants the 1G
> > limited region it could specify max_iova_hint = 1G, if it wants the
> > extend 64bit region with the hole it can give either the high base or
> > a large max_iova_hint. format/format_flags allows a further
> 
> Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> range information or be ok with a large range including holes (then let
> the kernel to figure out where the holes locate)?

I need more specifics to answer that.  Are you talking from a
userspace PoV, a guest kernel's or the host kernel's?  In general I
think requiring userspace to locate and work aronud holes is a bad
idea.  If userspace requests a range, it should get *all* of that
range.

The ppc case is further complicated because there are multiple ranges
and each range could have separate IO page tables.  In practice
non-kernel managed IO pagetables are likely to be hard on ppc (or a

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread da...@gibson.dropbear.id.au
On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > 
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > vfio.
> > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > there
> > > >   is a way to claim a specific iova range from a system-wide address 
> > > > space.
> > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC 
> > > > hasn't
> > > >   adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > >   and then decide how to incorporate it in v2.
> > > 
> > > I think the request was to include a start/end IO address hint when
> > > creating the ios. When the kernel creates it then it can return the
> > 
> > is the hint single-range or could be multiple-ranges?
> 
> David explained it here:
> 
> https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/

Apparently not well enough.  I've attempted again in this thread.

> qeumu needs to be able to chooose if it gets the 32 bit range or 64
> bit range.

No. qemu needs to supply *both* the 32-bit and 64-bit range to its
guest, and therefore needs to request both from the host.

Or rather, it *might* need to supply both.  It will supply just the
32-bit range by default, but the guest can request the 64-bit range
and/or remove and resize the 32-bit range via hypercall interfaces.
Vaguely recent Linux guests certainly will request the 64-bit range in
addition to the default 32-bit range.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread da...@gibson.dropbear.id.au
On Thu, Sep 23, 2021 at 12:22:23PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, September 23, 2021 8:07 PM
> > 
> > On Thu, Sep 23, 2021 at 09:14:58AM +, Tian, Kevin wrote:
> > 
> > > currently the type is aimed to differentiate three usages:
> > >
> > > - kernel-managed I/O page table
> > > - user-managed I/O page table
> > > - shared I/O page table (e.g. with mm, or ept)
> > 
> > Creating a shared ios is something that should probably be a different
> > command.
> 
> why? I didn't understand the criteria here...
> 
> > 
> > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > indicator? their difference is not about format.
> > 
> > Format should be
> > 
> > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> 
> INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> 
> > 
> > > Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> > > range information or be ok with a large range including holes (then let
> > > the kernel to figure out where the holes locate)?
> > 
> > My impression was it only needed a way to select between the two
> > different cases as they are exclusive. I'd see this API as being a
> > hint and userspace should query the exact ranges to learn what was
> > actually created.
> 
> yes, the user can query the permitted range using DEVICE_GET_INFO.
> But in the end if the user wants two separate regions, I'm afraid that 
> the underlying iommu driver wants to know the exact info. iirc PPC
> has one global system address space shared by all devices.

I think certain POWER models do this, yes, there's *protection*
between DMAs from different devices, but you can't translate the same
address to different places for different devices.  I *think* that's a
firmware/hypervisor convention rather than a hardware limitation, but
I'm not entirely sure.  We don't do things this way when emulating the
POWER vIOMMU in POWER, but PowerVM might and we still have to deal
with that when running as a POWERVM guest.

> It is possible
> that the user may want to claim range-A and range-C, with range-B
> in-between but claimed by another user. Then simply using one hint
> range [A-lowend, C-highend] might not work.
> 
> > 
> > > > device-specific escape if more specific customization is needed and is
> > > > needed to specify user space page tables anyhow.
> > >
> > > and I didn't understand the 2nd link. How does user-managed page
> > > table jump into this range claim problem? I'm getting confused...
> > 
> > PPC could also model it using a FORMAT_KERNEL_PPC_X,
> > FORMAT_KERNEL_PPC_Y
> > though it is less nice..
> 
> yes PPC can use different format, but I didn't understand why it is 
> related user-managed page table which further requires nesting. sound
> disconnected topics here...
> 
> > 
> > > > Yes, ioas_id should always be the xarray index.
> > > >
> > > > PASID needs to be called out as PASID or as a generic "hw description"
> > > > blob.
> > >
> > > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?
> > 
> > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > and it MUST be exposed in that format to be programmed into the PCI
> > device itself.
> 
> In the entire discussion in previous design RFC, I kept an impression that
> ARM-equivalent PASID is called SSID. If we can use PASID as a general
> term in iommufd context, definitely it's much better!
> 
> > 
> > All of this should be able to support a userspace, like DPDK, creating
> > a PASID on its own without any special VFIO drivers.
> > 
> > - Open iommufd
> > - Attach the vfio device FD
> > - Request a PASID device id
> > - Create an ios against the pasid device id
> > - Query the ios for the PCI PASID #
> > - Program the HW to issue TLPs with the PASID
> 
> this all makes me very confused, and completely different from what
> we agreed in previous v2 design proposal:
> 
> - open iommufd
> - create an ioas
> - attach vfio device to ioasid, with vPASID info
>   * vfio converts vPASID to pPASID and then call 
> iommufd_device_attach_ioasid()
>   * the latter then installs ioas to the IOMMU with RID/PASID
> 
> > 
> > > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> > > kernel. Do we want to clear this confusion? Or possibly it's fine because
> > > ioas_id is never used outside of iommufd and iommufd doesn't directly
> > > call ioasid_alloc() from ioasid.c?
> > 
> > As long as it is ioas_id and ioasid it is probably fine..
> 
> let's align with others in a few hours.
> 
> > 
> > > > kvm's API to program the vPASID translation table should probably take
> > > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > > > information using an in-kernel API. Userspace shouldn't have to
> > > > shuttle it around.
> > >
> > > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> > > when kvm calls iommufd with above tuple, vPASID->pPASID is
> > > returned to kvm. So we stil

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread da...@gibson.dropbear.id.au
On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address 
> > > space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> 
> is the hint single-range or could be multiple-ranges?
> 
> > actual geometry including any holes via a query.
> 
> I'd like to see a detail flow from David on how the uAPI works today with
> existing spapr driver and what exact changes he'd like to make on this
> proposed interface. Above info is still insufficient for us to think about the
> right solution.
> 
> > 
> > > - Currently ioasid term has already been used in the kernel
> > (drivers/iommu/
> > >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> > >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> > Stream
> > >   ID). We need find a way to resolve the naming conflict between the
> > hardware
> > >   ID and software handle. One option is to rename the existing ioasid to 
> > > be
> > >   pasid or ssid, given their full names still sound generic. Appreciate 
> > > more
> > >   thoughts on this open!
> > 
> > ioas works well here I think. Use ioas_id to refer to the xarray
> > index.
> 
> What about when introducing pasid to this uAPI? Then use ioas_id
> for the xarray index and ioasid to represent pasid/ssid?

This is probably obsoleted by Jason's other comments, but definitely
don't use "ioas_id" and "ioasid" to mean different things.  Having
meaningfully different things distinguished only by an underscore is
not a good idea.

> At this point
> the software handle and hardware id are mixed together thus need
> a clear terminology to differentiate them.
> 
> 
> Thanks
> Kevin
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-30 Thread David Gibson
On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.
>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.

Ok, there are several things we need for ppc.  None of which are
inherently ppc specific and some of which will I think be useful for
most platforms.  So, starting from most general to most specific
here's basically what's needed:

1. We need to represent the fact that the IOMMU can only translate
   *some* IOVAs, not a full 64-bit range.  You have the addr_width
   already, but I'm entirely sure if the translatable range on ppc
   (or other platforms) is always a power-of-2 size.  It usually will
   be, of course, but I'm not sure that's a hard requirement.  So
   using a size/max rather than just a number of bits might be safer.

   I think basically every platform will need this.  Most platforms
   don't actually implement full 64-bit translation in any case, but
   rather some smaller number of bits that fits their page table
   format.

2. The translatable range of IOVAs may not begin at 0.  So we need to
   advertise to userspace what the base address is, as well as the
   size.  POWER's main IOVA range begins at 2^59 (at least on the
   models I know about).

   I think a number of platforms are likely to want this, though I
   couldn't name them apart from POWER.  Putting the translated IOVA
   window at some huge address is a pretty obvious approach to making
   an IOMMU which can translate a wide address range without colliding
   with any legacy PCI addresses down low (the IOMMU can check if this
   transaction is for it by just looking at some high bits in the
   address).

3. There might be multiple translatable ranges.  So, on POWER the
   IOMMU can typically translate IOVAs from 0..2GiB, and also from
   2^59..2^59+.  The two ranges have completely separate IO
   page tables, with (usually) different layouts.  (The low range will
   nearly always be a single-level page table with 4kiB or 64kiB
   entries, the high one will be multiple levels depending on the size
   of the range and pagesize).

   This may be less common, but I suspect POWER won't be the only
   platform to do something like this.  As above, using a high range
   is a pretty obvious approach, but clearly won't handle older
   devices which can't do 64-bit DMA.  So adding a smaller range for
   those devices is again a pretty obvious solution.  Any platform
   with an "IO hole" can be treated as having two ranges, one below
   the hole and one above it (although in that case they may well not
   have separate page tables 

4. The translatable ranges might not be fixed.  On ppc that 0..2GiB
   and 2^59..whatever ranges are kernel conventions, not specified by
   the hardware or firmware.  When running as a guest (which is the
   normal case on POWER), there are explicit hypercalls for
   configuring the allowed IOVA windows (along with pagesize, number
   of levels etc.).  At the moment it is fixed in hardware that there
   are only 2 windows, one starting at 0 and one at 2^59 but there's
   no inherent reason those couldn't also be configurable.

   This will probably be rarer, but I wouldn't be surprised if it
   appears on another platform.  If you were designing an IOMMU ASIC
   for use in a variety of platforms, making the base address and size
   of the translatable range(s) configurable in registers would make
   sense.


Now, for (3) and (4), representing lists of windows explicitly in
ioctl()s is likely to be pretty ugly.  We might be able to avoid that,
for at least some of the interfaces, by using the nested IOAS stuff.
One way or another, though, the IOASes which are actually attached to
devices need to represent both windows.

e.g.
Create a "top-level" IOAS

Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-09-30 Thread Christoph Hellwig
On Tue, Sep 14, 2021 at 03:45:06PM +, Ioana Ciornei wrote:
> [  245.927020] fsl_dpaa2_eth dpni.3: scather-gather idx 0 P=20a732 
> N=20a7320 D=20a732 L=30 DMA_BIDIRECTIONAL dma map error check not 
> applicable·
> [  245.927048] fsl_dpaa2_eth dpni.3: scather-gather idx 1 P=20a7320030 
> N=20a7320 D=20a7320030 L=5a8 DMA_BIDIRECTIONAL dma map error check not 
> applicable
> [  245.927062] DMA-API: cacheline tracking EEXIST, overlapping mappings 
> aren't supported
> 
> The first line is the dump of the dma_debug_entry which is already present
> in the radix tree and the second one is the entry which just triggered
> the EEXIST.
> 
> As we can see, they are not actually overlapping, at least from my
> understanding. The first one starts at 0x20a732 with a size 0x30
> and the second one at 0x20a7320030.

They overlap the cache lines.  Which means if you use this driver
on a system that is not dma coherent you will corrupt data.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-30 Thread David Gibson
On Thu, Sep 30, 2021 at 07:28:18PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 30, 2021 at 01:09:22PM +1000, David Gibson wrote:
> 
> > > The *admin* the one responsible to understand the groups, not the
> > > applications. The admin has no idea what a group FD is - they should
> > > be looking at the sysfs and seeing the iommu_group directories.
> > 
> > Not just the admin.  If an app is given two devices in the same group
> > to use *both* it must understand that and act accordingly.
> 
> Yes, but this is true regardless of what the uAPI is,

Yes, but formerly it was explicit and now it is implicit.  Before we
said "attach this group to this container", which can reasonably be
expected to affect the whole group.  Now we say "attach this device to
this IOAS" and it silently also affects other devices.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-30 Thread h...@lst.de
On Thu, Sep 30, 2021 at 07:23:55PM -0300, Jason Gunthorpe wrote:
> > > The Intel functional issue is that Intel blocks the cache maintaince
> > > ops from the VM and the VM has no way to self-discover that the cache
> > > maintaince ops don't work.
> > 
> > the VM doesn't need to know whether the maintenance ops 
> > actually works.
> 
> Which is the whole problem.
> 
> Intel has a design where the device driver tells the device to issue
> non-cachable TLPs.
> 
> The driver is supposed to know if it can issue the cache maintaince
> instructions - if it can then it should ask the device to issue
> no-snoop TLPs.

The driver should never issue them.  This whole idea that a driver
can just magically poke the cache directly is just one of these horrible
short cuts that seems to happen in GPU land all the time but nowhere
else.

> > coherency and indirectly the underlying I/O page table format. 
> > If yes, then I don't see a reason why such decision should not be 
> > given to userspace for passthrough case.
> 
> The choice all comes down to if the other arches have cache
> maintenance instructions in the VM that *don't work*

Or have them at all.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-30 Thread h...@lst.de
On Thu, Sep 30, 2021 at 09:43:58PM +0800, Lu Baolu wrote:
> Here, we are discussing arch_sync_dma_for_cpu() and
> arch_sync_dma_for_device(). The x86 arch has clflush to sync dma buffer
> for device, but I can't see any instruction to sync dma buffer for cpu
> if the device is not cache coherent. Is that the reason why x86 can't
> have an implementation for arch_sync_dma_for_cpu(), hence all devices
> are marked coherent?

arch_sync_dma_for_cpu and arch_sync_dma_for_device are only used if
the device is marked non-coherent, that is if Linux knows the device
can't be part of the cache coherency protocol.  There are no known
x86 systems with entirely not cache coherent devices so these helpers
won't be useful as-is.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-30 Thread Jason Gunthorpe via iommu
On Thu, Sep 30, 2021 at 01:09:22PM +1000, David Gibson wrote:

> > The *admin* the one responsible to understand the groups, not the
> > applications. The admin has no idea what a group FD is - they should
> > be looking at the sysfs and seeing the iommu_group directories.
> 
> Not just the admin.  If an app is given two devices in the same group
> to use *both* it must understand that and act accordingly.

Yes, but this is true regardless of what the uAPI is, and for common
app cases where we have a single IO Page table for all devices the app
still doesn't need to care about groups since it can just assign all
devices to the same IO page table and everything works out just fine.

For instance qemu without a vIOMMU does not need to care about
groups. It opens a single iommufd, creates a single IO page table that
maps the guest physical space and assigns every device to that IO page
table. No issue.

Only if qemu is creating a vIOMMU does it need to start to look at the
groups and ensure that the group becomes visible to the guest OS. Here
the group fd doesn't really help anything

Jason


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-30 Thread Jason Gunthorpe via iommu
On Thu, Sep 30, 2021 at 09:35:45AM +, Tian, Kevin wrote:

> > The Intel functional issue is that Intel blocks the cache maintaince
> > ops from the VM and the VM has no way to self-discover that the cache
> > maintaince ops don't work.
> 
> the VM doesn't need to know whether the maintenance ops 
> actually works.

Which is the whole problem.

Intel has a design where the device driver tells the device to issue
non-cachable TLPs.

The driver is supposed to know if it can issue the cache maintaince
instructions - if it can then it should ask the device to issue
no-snoop TLPs.

For instance the same PCI driver on non-x86 should never ask the
device to issue no-snoop TLPs because it has no idea how to restore
cache coherence on eg ARM.

Do you see the issue? This configuration where the hypervisor silently
make wbsync a NOP breaks the x86 architecture because the guest has no
idea it can no longer use no-snoop features.

Using the IOMMU to forcibly prevent the device from issuing no-snoop
makes this whole issue of the broken wbsync moot.

It is important to be really clear on what this is about - this is not
some idealized nice iommu feature - it is working around alot of
backwards compatability baggage that is probably completely unique to
x86.

> > Other arches don't seem to have this specific problem...
> 
> I think the key is whether other archs allow driver to decide DMA
> coherency and indirectly the underlying I/O page table format. 
> If yes, then I don't see a reason why such decision should not be 
> given to userspace for passthrough case.

The choice all comes down to if the other arches have cache
maintenance instructions in the VM that *don't work*

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/6] memblock: cleanup memblock_free interface

2021-09-30 Thread Mike Rapoport
On Thu, Sep 30, 2021 at 02:20:33PM -0700, Linus Torvalds wrote:
> On Thu, Sep 30, 2021 at 11:50 AM Mike Rapoport  wrote:
> >
> > The first patch is a cleanup of numa_distance allocation in arch_numa I've
> > spotted during the conversion.
> > The second patch is a fix for Xen memory freeing on some of the error
> > paths.
> 
> Well, at least patch 2 looks like something that should go into 5.15
> and be marked for stable.
> 
> Patch 1 looks like a trivial local cleanup, and could go in
> immediately. Patch 4 might be in that same category.
> 
> The rest look like "next merge window" to me, since they are spread
> out and neither bugfixes nor tiny localized cleanups (iow renaming
> functions, global resulting search-and-replace things).
> 
> So my gut feel is that two (maybe three) of these patches should go in
> asap, with three (maybe four) be left for 5.16.
> 
> IOW, not trat this as a single series.
> 
> Hmm?

Yes, why not :)
I'd keep patch 4 for the next merge window, does not look urgent to me.

Andrew, can you please take care of this or you'd prefer me resending
everything separately?
 
>  Linus

-- 
Sincerely yours,
Mike.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-30 Thread Jason Gunthorpe via iommu
On Thu, Sep 30, 2021 at 08:49:03AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, September 23, 2021 8:22 PM
> > 
> > > > These are different things and need different bits. Since the ARM path
> > > > has a lot more code supporting it, I'd suggest Intel should change
> > > > their code to use IOMMU_BLOCK_NO_SNOOP and abandon
> > IOMMU_CACHE.
> > >
> > > I didn't fully get this point. The end result is same, i.e. making the DMA
> > > cache-coherent when IOMMU_CACHE is set. Or if you help define the
> > > behavior of IOMMU_CACHE, what will you define now?
> > 
> > It is clearly specifying how the kernel API works:
> > 
> >  !IOMMU_CACHE
> >must call arch cache flushers
> >  IOMMU_CACHE -
> >do not call arch cache flushers
> >  IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
> >dot not arch cache flushers, and ignore the no snoop bit.
> 
> Who will set IOMMU_BLOCK_NO_SNOOP?

Basically only qemu due to specialized x86 hypervisor knowledge.

The only purpose of this attribute is to support a specific
virtualization use case where a whole bunch of stuff is broken
together:
 - the cache maintenance instructions are not available to a guest
 - the guest isn't aware that the instructions don't work and tells
   the device to issue no-snoop TLPs
 - The device ignores the 'disable no-snoop' flag in the PCIe config
   space

Thus things become broken.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/6] memblock: cleanup memblock_free interface

2021-09-30 Thread Linus Torvalds
On Thu, Sep 30, 2021 at 11:50 AM Mike Rapoport  wrote:
>
> The first patch is a cleanup of numa_distance allocation in arch_numa I've
> spotted during the conversion.
> The second patch is a fix for Xen memory freeing on some of the error
> paths.

Well, at least patch 2 looks like something that should go into 5.15
and be marked for stable.

Patch 1 looks like a trivial local cleanup, and could go in
immediately. Patch 4 might be in that same category.

The rest look like "next merge window" to me, since they are spread
out and neither bugfixes nor tiny localized cleanups (iow renaming
functions, global resulting search-and-replace things).

So my gut feel is that two (maybe three) of these patches should go in
asap, with three (maybe four) be left for 5.16.

IOW, not trat this as a single series.

Hmm?

 Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 6/6] memblock: use memblock_free for freeing virtual pointers

2021-09-30 Thread Mike Rapoport
From: Mike Rapoport 

Rename memblock_free_ptr() to memblock_free() and use memblock_free()
when freeing a virtual pointer so that memblock_free() will be a
counterpart of memblock_alloc()

The callers are updated with the below semantic patch and manual addition
of (void *) casting to pointers that are represented by unsigned long
variables.

@@
identifier vaddr;
expression size;
@@
(
- memblock_phys_free(__pa(vaddr), size);
+ memblock_free(vaddr, size);
|
- memblock_free_ptr(vaddr, size);
+ memblock_free(vaddr, size);
)

Signed-off-by: Mike Rapoport 
---
 arch/alpha/kernel/core_irongate.c | 3 +--
 arch/mips/mm/init.c   | 2 +-
 arch/powerpc/kernel/dt_cpu_ftrs.c | 4 ++--
 arch/powerpc/kernel/setup-common.c| 2 +-
 arch/powerpc/kernel/setup_64.c| 2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
 arch/powerpc/platforms/pseries/svm.c  | 3 +--
 arch/riscv/kernel/setup.c | 5 ++---
 arch/sparc/kernel/smp_64.c| 2 +-
 arch/um/kernel/mem.c  | 2 +-
 arch/x86/kernel/setup_percpu.c| 2 +-
 arch/x86/mm/kasan_init_64.c   | 4 ++--
 arch/x86/mm/numa.c| 2 +-
 arch/x86/mm/numa_emulation.c  | 2 +-
 arch/x86/xen/mmu_pv.c | 2 +-
 arch/x86/xen/p2m.c| 2 +-
 drivers/base/arch_numa.c  | 4 ++--
 drivers/macintosh/smu.c   | 2 +-
 drivers/xen/swiotlb-xen.c | 2 +-
 include/linux/memblock.h  | 2 +-
 init/initramfs.c  | 2 +-
 init/main.c   | 2 +-
 kernel/dma/swiotlb.c  | 2 +-
 kernel/printk/printk.c| 4 ++--
 lib/bootconfig.c  | 2 +-
 lib/cpumask.c | 2 +-
 mm/memblock.c | 6 +++---
 mm/percpu.c   | 8 
 mm/sparse.c   | 2 +-
 29 files changed, 39 insertions(+), 42 deletions(-)

diff --git a/arch/alpha/kernel/core_irongate.c 
b/arch/alpha/kernel/core_irongate.c
index ee26dcc49418..6b8ed12936b6 100644
--- a/arch/alpha/kernel/core_irongate.c
+++ b/arch/alpha/kernel/core_irongate.c
@@ -233,8 +233,7 @@ albacore_init_arch(void)
unsigned long size;
 
size = initrd_end - initrd_start;
-   memblock_phys_free(__pa(initrd_start),
-  PAGE_ALIGN(size));
+   memblock_free((void *)initrd_start, PAGE_ALIGN(size));
if (!move_initrd(pci_mem))
printk("irongate_init_arch: initrd too big "
   "(%ldK)\ndisabling initrd\n",
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 3be1c29084fa..325e1552cbea 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -529,7 +529,7 @@ static void * __init pcpu_fc_alloc(unsigned int cpu, size_t 
size,
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-   memblock_phys_free(__pa(ptr), size);
+   memblock_free(ptr, size);
 }
 
 void __init setup_per_cpu_areas(void)
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 42839d6bd486..ba527fb52993 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -1095,8 +1095,8 @@ static int __init dt_cpu_ftrs_scan_callback(unsigned long 
node, const char
 
cpufeatures_setup_finished();
 
-   memblock_phys_free(__pa(dt_cpu_features),
-  sizeof(struct dt_cpu_feature) * nr_dt_cpu_features);
+   memblock_free(dt_cpu_features,
+ sizeof(struct dt_cpu_feature) * nr_dt_cpu_features);
 
return 0;
 }
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 5af8993a8e6d..6b1338db8779 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -825,7 +825,7 @@ static void __init smp_setup_pacas(void)
set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
}
 
-   memblock_phys_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
+   memblock_free(cpu_to_phys_id, nr_cpu_ids * sizeof(u32));
cpu_to_phys_id = NULL;
 }
 #endif
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 75bc294ac40d..1777e992b20b 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -812,7 +812,7 @@ static void * __init pcpu_alloc_bootmem(unsigned int cpu, 
size_t size,
 
 static void __init pcpu_free_bootmem(void *ptr, size_t size)
 {
-   memblock_phys_free(__pa(ptr), size);
+   memblock_free(ptr, size);
 }
 
 static int pcpu_cpu_distance(unsigned int from, unsigned int to)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci

[PATCH v2 5/6] memblock: rename memblock_free to memblock_phys_free

2021-09-30 Thread Mike Rapoport
From: Mike Rapoport 

Since memblock_free() operates on a physical range, make its name reflect
it and rename it to memblock_phys_free(), so it will be a logical
counterpart to memblock_phys_alloc().

The callers are updated with the below semantic patch:

@@
expression addr;
expression size;
@@
- memblock_free(addr, size);
+ memblock_phys_free(addr, size);

Signed-off-by: Mike Rapoport 
---
 arch/alpha/kernel/core_irongate.c |  3 ++-
 arch/arc/mm/init.c|  2 +-
 arch/arm/mach-hisi/platmcpm.c |  2 +-
 arch/arm/mm/init.c|  2 +-
 arch/arm64/mm/mmu.c   |  4 ++--
 arch/mips/mm/init.c   |  2 +-
 arch/mips/sgi-ip30/ip30-setup.c   |  6 +++---
 arch/powerpc/kernel/dt_cpu_ftrs.c |  4 ++--
 arch/powerpc/kernel/paca.c|  8 
 arch/powerpc/kernel/setup-common.c|  2 +-
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  2 +-
 arch/powerpc/platforms/pseries/svm.c  |  3 ++-
 arch/riscv/kernel/setup.c |  5 +++--
 arch/s390/kernel/setup.c  |  8 
 arch/s390/kernel/smp.c|  4 ++--
 arch/s390/kernel/uv.c |  2 +-
 arch/s390/mm/kasan_init.c |  2 +-
 arch/sh/boards/mach-ap325rxa/setup.c  |  2 +-
 arch/sh/boards/mach-ecovec24/setup.c  |  4 ++--
 arch/sh/boards/mach-kfr2r09/setup.c   |  2 +-
 arch/sh/boards/mach-migor/setup.c |  2 +-
 arch/sh/boards/mach-se/7724/setup.c   |  4 ++--
 arch/sparc/kernel/smp_64.c|  2 +-
 arch/um/kernel/mem.c  |  2 +-
 arch/x86/kernel/setup.c   |  4 ++--
 arch/x86/mm/init.c|  2 +-
 arch/x86/xen/mmu_pv.c |  6 +++---
 arch/x86/xen/setup.c  |  6 +++---
 drivers/base/arch_numa.c  |  2 +-
 drivers/firmware/efi/memmap.c |  2 +-
 drivers/of/kexec.c|  3 +--
 drivers/of/of_reserved_mem.c  |  5 +++--
 drivers/s390/char/sclp_early.c|  2 +-
 drivers/usb/early/xhci-dbc.c  | 10 +-
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/memblock.h  |  2 +-
 init/initramfs.c  |  2 +-
 kernel/dma/swiotlb.c  |  2 +-
 lib/cpumask.c |  2 +-
 mm/cma.c  |  2 +-
 mm/memblock.c |  8 
 mm/memory_hotplug.c   |  2 +-
 mm/percpu.c   |  8 
 mm/sparse.c   |  2 +-
 45 files changed, 79 insertions(+), 76 deletions(-)

diff --git a/arch/alpha/kernel/core_irongate.c 
b/arch/alpha/kernel/core_irongate.c
index 72af1e72d833..ee26dcc49418 100644
--- a/arch/alpha/kernel/core_irongate.c
+++ b/arch/alpha/kernel/core_irongate.c
@@ -233,7 +233,8 @@ albacore_init_arch(void)
unsigned long size;
 
size = initrd_end - initrd_start;
-   memblock_free(__pa(initrd_start), PAGE_ALIGN(size));
+   memblock_phys_free(__pa(initrd_start),
+  PAGE_ALIGN(size));
if (!move_initrd(pci_mem))
printk("irongate_init_arch: initrd too big "
   "(%ldK)\ndisabling initrd\n",
diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index 699ecf119641..59408f6a02d4 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -173,7 +173,7 @@ static void __init highmem_init(void)
 #ifdef CONFIG_HIGHMEM
unsigned long tmp;
 
-   memblock_free(high_mem_start, high_mem_sz);
+   memblock_phys_free(high_mem_start, high_mem_sz);
for (tmp = min_high_pfn; tmp < max_high_pfn; tmp++)
free_highmem_page(pfn_to_page(tmp));
 #endif
diff --git a/arch/arm/mach-hisi/platmcpm.c b/arch/arm/mach-hisi/platmcpm.c
index 96a484095194..258586e31333 100644
--- a/arch/arm/mach-hisi/platmcpm.c
+++ b/arch/arm/mach-hisi/platmcpm.c
@@ -339,7 +339,7 @@ static int __init hip04_smp_init(void)
 err_sysctrl:
iounmap(relocation);
 err_reloc:
-   memblock_free(hip04_boot_method[0], hip04_boot_method[1]);
+   memblock_phys_free(hip04_boot_method[0], hip04_boot_method[1]);
 err:
return ret;
 }
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 6162a070a410..6d0cb0f7bc54 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -158,7 +158,7 @@ phys_addr_t __init arm_memblock_steal(phys_addr_t size, 
phys_addr_t align)
panic("Failed to steal %pa bytes at %pS\n",
  &size, (void *)_RET_IP_);
 
-   memblock_free(phys, size);
+   memblock_phys_free(phys, size);
memblock_remove(phys, size);
 
return phys;
di

[PATCH v2 4/6] memblock: stop aliasing __memblock_free_late with memblock_free_late

2021-09-30 Thread Mike Rapoport
From: Mike Rapoport 

memblock_free_late() is a NOP wrapper for __memblock_free_late(), there is
no point to keep this indirection.

Drop the wrapper and rename __memblock_free_late() to memblock_free_late().

Signed-off-by: Mike Rapoport 
---
 include/linux/memblock.h | 7 +--
 mm/memblock.c| 8 
 2 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index fc8183be340c..e25f964fdd60 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -133,7 +133,7 @@ void __next_mem_range_rev(u64 *idx, int nid, enum 
memblock_flags flags,
  struct memblock_type *type_b, phys_addr_t *out_start,
  phys_addr_t *out_end, int *out_nid);
 
-void __memblock_free_late(phys_addr_t base, phys_addr_t size);
+void memblock_free_late(phys_addr_t base, phys_addr_t size);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
 static inline void __next_physmem_range(u64 *idx, struct memblock_type *type,
@@ -441,11 +441,6 @@ static inline void *memblock_alloc_node(phys_addr_t size,
  MEMBLOCK_ALLOC_ACCESSIBLE, nid);
 }
 
-static inline void memblock_free_late(phys_addr_t base, phys_addr_t size)
-{
-   __memblock_free_late(base, size);
-}
-
 /*
  * Set the allocation direction to bottom-up or top-down.
  */
diff --git a/mm/memblock.c b/mm/memblock.c
index 184dcd2e5d99..603f4a02be9b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -366,14 +366,14 @@ void __init memblock_discard(void)
addr = __pa(memblock.reserved.regions);
size = PAGE_ALIGN(sizeof(struct memblock_region) *
  memblock.reserved.max);
-   __memblock_free_late(addr, size);
+   memblock_free_late(addr, size);
}
 
if (memblock.memory.regions != memblock_memory_init_regions) {
addr = __pa(memblock.memory.regions);
size = PAGE_ALIGN(sizeof(struct memblock_region) *
  memblock.memory.max);
-   __memblock_free_late(addr, size);
+   memblock_free_late(addr, size);
}
 
memblock_memory = NULL;
@@ -1586,7 +1586,7 @@ void * __init memblock_alloc_try_nid(
 }
 
 /**
- * __memblock_free_late - free pages directly to buddy allocator
+ * memblock_free_late - free pages directly to buddy allocator
  * @base: phys starting address of the  boot memory block
  * @size: size of the boot memory block in bytes
  *
@@ -1594,7 +1594,7 @@ void * __init memblock_alloc_try_nid(
  * down, but we are still initializing the system.  Pages are released directly
  * to the buddy allocator.
  */
-void __init __memblock_free_late(phys_addr_t base, phys_addr_t size)
+void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
 {
phys_addr_t cursor, end;
 
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 3/6] memblock: drop memblock_free_early_nid() and memblock_free_early()

2021-09-30 Thread Mike Rapoport
From: Mike Rapoport 

memblock_free_early_nid() is unused and memblock_free_early() is an alias
for memblock_free().

Replace calls to memblock_free_early() with calls to memblock_free() and
remove memblock_free_early() and memblock_free_early_nid().

Signed-off-by: Mike Rapoport 
---
 arch/mips/mm/init.c  |  2 +-
 arch/powerpc/platforms/pseries/svm.c |  3 +--
 arch/s390/kernel/smp.c   |  2 +-
 drivers/base/arch_numa.c |  2 +-
 drivers/s390/char/sclp_early.c   |  2 +-
 include/linux/memblock.h | 12 
 kernel/dma/swiotlb.c |  2 +-
 lib/cpumask.c|  2 +-
 mm/percpu.c  |  8 
 mm/sparse.c  |  2 +-
 10 files changed, 12 insertions(+), 25 deletions(-)

diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 19347dc6bbf8..21a5a7ac0037 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -529,7 +529,7 @@ static void * __init pcpu_fc_alloc(unsigned int cpu, size_t 
size,
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-   memblock_free_early(__pa(ptr), size);
+   memblock_free(__pa(ptr), size);
 }
 
 void __init setup_per_cpu_areas(void)
diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
index 87f001b4c4e4..f12229ce7301 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -56,8 +56,7 @@ void __init svm_swiotlb_init(void)
return;
 
 
-   memblock_free_early(__pa(vstart),
-   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+   memblock_free(__pa(vstart), PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
panic("SVM: Cannot allocate SWIOTLB buffer");
 }
 
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 1a04e5bdf655..066efd6d9345 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
 
/* Add CPUs present at boot */
__smp_rescan_cpus(info, true);
-   memblock_free_early((unsigned long)info, sizeof(*info));
+   memblock_free((unsigned long)info, sizeof(*info));
 }
 
 /*
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index f6d0efd01188..e28d9dfe3c20 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -165,7 +165,7 @@ static void * __init pcpu_fc_alloc(unsigned int cpu, size_t 
size,
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-   memblock_free_early(__pa(ptr), size);
+   memblock_free(__pa(ptr), size);
 }
 
 void __init setup_per_cpu_areas(void)
diff --git a/drivers/s390/char/sclp_early.c b/drivers/s390/char/sclp_early.c
index f3d5c7f4c13d..f01d942e1c1d 100644
--- a/drivers/s390/char/sclp_early.c
+++ b/drivers/s390/char/sclp_early.c
@@ -139,7 +139,7 @@ int __init sclp_early_get_core_info(struct sclp_core_info 
*info)
}
sclp_fill_core_info(info, sccb);
 out:
-   memblock_free_early((unsigned long)sccb, length);
+   memblock_free((unsigned long)sccb, length);
return rc;
 }
 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 34de69b3b8ba..fc8183be340c 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -441,18 +441,6 @@ static inline void *memblock_alloc_node(phys_addr_t size,
  MEMBLOCK_ALLOC_ACCESSIBLE, nid);
 }
 
-static inline void memblock_free_early(phys_addr_t base,
- phys_addr_t size)
-{
-   memblock_free(base, size);
-}
-
-static inline void memblock_free_early_nid(phys_addr_t base,
- phys_addr_t size, int nid)
-{
-   memblock_free(base, size);
-}
-
 static inline void memblock_free_late(phys_addr_t base, phys_addr_t size)
 {
__memblock_free_late(base, size);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 87c40517e822..430d2f78d540 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -247,7 +247,7 @@ swiotlb_init(int verbose)
return;
 
 fail_free_mem:
-   memblock_free_early(__pa(tlb), bytes);
+   memblock_free(__pa(tlb), bytes);
 fail:
pr_warn("Cannot allocate buffer");
 }
diff --git a/lib/cpumask.c b/lib/cpumask.c
index c3c76b833384..045779446a18 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -188,7 +188,7 @@ EXPORT_SYMBOL(free_cpumask_var);
  */
 void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 {
-   memblock_free_early(__pa(mask), cpumask_size());
+   memblock_free(__pa(mask), cpumask_size());
 }
 #endif
 
diff --git a/mm/percpu.c b/mm/percpu.c
index e0a986818903..f58318cb04c0 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2472,7 +2472,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int 
nr_groups,
  */
 void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)
 {
-   memblock_free_early(__pa(ai), ai->__ai_size);
+   mem

[PATCH v2 2/6] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer

2021-09-30 Thread Mike Rapoport
From: Mike Rapoport 

free_p2m_page() wrongly passes a virtual pointer to memblock_free() that
treats it as a physical address.

Call memblock_free_ptr() instead that gets a virtual address to free the
memory.

Signed-off-by: Mike Rapoport 
Reviewed-by: Juergen Gross 
---
 arch/x86/xen/p2m.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 5e6e236977c7..141bb9dbd2fb 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -197,7 +197,7 @@ static void * __ref alloc_p2m_page(void)
 static void __ref free_p2m_page(void *p)
 {
if (unlikely(!slab_is_available())) {
-   memblock_free((unsigned long)p, PAGE_SIZE);
+   memblock_free_ptr(p, PAGE_SIZE);
return;
}
 
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/6] arch_numa: simplify numa_distance allocation

2021-09-30 Thread Mike Rapoport
From: Mike Rapoport 

Memory allocation of numa_distance uses memblock_phys_alloc_range() without
actual range limits, converts the returned physical address to virtual and
then only uses the virtual address for further initialization.

Simplify this by replacing memblock_phys_alloc_range() with
memblock_alloc().

Signed-off-by: Mike Rapoport 
---
 drivers/base/arch_numa.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 00fb4120a5b3..f6d0efd01188 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -275,15 +275,13 @@ void __init numa_free_distance(void)
 static int __init numa_alloc_distance(void)
 {
size_t size;
-   u64 phys;
int i, j;
 
size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
-   phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0, PFN_PHYS(max_pfn));
-   if (WARN_ON(!phys))
+   numa_distance = memblock_alloc(size, PAGE_SIZE);
+   if (WARN_ON(!numa_distance))
return -ENOMEM;
 
-   numa_distance = __va(phys);
numa_distance_cnt = nr_node_ids;
 
/* fill with the default distances */
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/6] memblock: cleanup memblock_free interface

2021-09-30 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

Following the discussion on [1] this is the fix for memblock freeing APIs
mismatch. 

The first patch is a cleanup of numa_distance allocation in arch_numa I've
spotted during the conversion.
The second patch is a fix for Xen memory freeing on some of the error
paths.

I agree with Christophe that doing step by step makes the thing easier to
review, so the patches 3-6 do the actual cleanup step by step.

This time I used stricter coccinelle scripts so that only straightforward
uses would get converted.

There still a couple of (void *) castings for the cases when a virtual
address has unsigned long type rather than a pointer type, like e.g
initrd_start.

Since scripts/get_maintainer.pl returned more than 100 addresses I've
trimmed the distribution list only to the relevant lists.

Juergen and Shahab, I didn't keep your Reviewed-by because the patches are
a bit different this time.

v2:
* split changes into several patches
* use stricter coccinelle scripts 

[1] 
https://lore.kernel.org/all/CAHk-=wj9k4LZTz+svCxLYs5Y1=+ykrbauarh1+ghyg3old8...@mail.gmail.com

Mike Rapoport (6):
  arch_numa: simplify numa_distance allocation
  xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer
  memblock: drop memblock_free_early_nid() and memblock_free_early()
  memblock: stop aliasing __memblock_free_late with memblock_free_late
  memblock: rename memblock_free to memblock_phys_free
  memblock: use memblock_free for freeing virtual pointers

 arch/alpha/kernel/core_irongate.c |  2 +-
 arch/arc/mm/init.c|  2 +-
 arch/arm/mach-hisi/platmcpm.c |  2 +-
 arch/arm/mm/init.c|  2 +-
 arch/arm64/mm/mmu.c   |  4 ++--
 arch/mips/mm/init.c   |  2 +-
 arch/mips/sgi-ip30/ip30-setup.c   |  6 +++---
 arch/powerpc/kernel/dt_cpu_ftrs.c |  4 ++--
 arch/powerpc/kernel/paca.c|  8 
 arch/powerpc/kernel/setup-common.c|  2 +-
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  2 +-
 arch/powerpc/platforms/pseries/svm.c  |  3 +--
 arch/riscv/kernel/setup.c |  4 ++--
 arch/s390/kernel/setup.c  |  8 
 arch/s390/kernel/smp.c|  4 ++--
 arch/s390/kernel/uv.c |  2 +-
 arch/s390/mm/kasan_init.c |  2 +-
 arch/sh/boards/mach-ap325rxa/setup.c  |  2 +-
 arch/sh/boards/mach-ecovec24/setup.c  |  4 ++--
 arch/sh/boards/mach-kfr2r09/setup.c   |  2 +-
 arch/sh/boards/mach-migor/setup.c |  2 +-
 arch/sh/boards/mach-se/7724/setup.c   |  4 ++--
 arch/sparc/kernel/smp_64.c|  2 +-
 arch/um/kernel/mem.c  |  2 +-
 arch/x86/kernel/setup.c   |  4 ++--
 arch/x86/kernel/setup_percpu.c|  2 +-
 arch/x86/mm/init.c|  2 +-
 arch/x86/mm/kasan_init_64.c   |  4 ++--
 arch/x86/mm/numa.c|  2 +-
 arch/x86/mm/numa_emulation.c  |  2 +-
 arch/x86/xen/mmu_pv.c |  6 +++---
 arch/x86/xen/p2m.c|  2 +-
 arch/x86/xen/setup.c  |  6 +++---
 drivers/base/arch_numa.c  | 10 --
 drivers/firmware/efi/memmap.c |  2 +-
 drivers/macintosh/smu.c   |  2 +-
 drivers/of/kexec.c|  3 +--
 drivers/of/of_reserved_mem.c  |  5 +++--
 drivers/s390/char/sclp_early.c|  2 +-
 drivers/usb/early/xhci-dbc.c  | 10 +-
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/memblock.h  | 23 +++
 init/initramfs.c  |  2 +-
 init/main.c   |  2 +-
 kernel/dma/swiotlb.c  |  2 +-
 kernel/printk/printk.c|  4 ++--
 lib/bootconfig.c  |  2 +-
 lib/cpumask.c |  2 +-
 mm/cma.c  |  2 +-
 mm/memblock.c | 22 +++---
 mm/memory_hotplug.c   |  2 +-
 mm/percpu.c   |  8 
 mm/sparse.c   |  2 +-
 54 files changed, 99 insertions(+), 119 deletions(-)


base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2021-09-30 Thread Chaitanya Kulkarni via iommu


>>
>> Is this new ops only needed for the PCIe transport ? or do you have
>> following patches to use this op for the other transports ?
> 
> No, I don't think this will make sense for transports that are not based
> on PCI devices.
> 
>> If it is only needed for the PCIe then we need to find a way to
>> not add this somehow...
> 
> I don't see how we can do that. The core code needs to know whether the
> transport supports this and must have an operation to query it.
> 

Okay.

> Logan
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2021-09-30 Thread Logan Gunthorpe



On 2021-09-29 11:06 p.m., Chaitanya Kulkarni wrote:
> Logan,
> 
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 7efb31b87f37..916750a54f60 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3771,7 +3771,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
>> unsigned nsid,
>>  blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
>>
>>  blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
>> -   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
>> +   if (ctrl->ops->supports_pci_p2pdma &&
>> +   ctrl->ops->supports_pci_p2pdma(ctrl))
>>  blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
>>
>>  ns->ctrl = ctrl;
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index 9871c0c9374c..fb9bfc52a6d7 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -477,7 +477,6 @@ struct nvme_ctrl_ops {
>>  unsigned int flags;
>>   #define NVME_F_FABRICS (1 << 0)
>>   #define NVME_F_METADATA_SUPPORTED  (1 << 1)
>> -#define NVME_F_PCI_P2PDMA  (1 << 2)
>>  int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
>>  int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
>>  int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
>> @@ -485,6 +484,7 @@ struct nvme_ctrl_ops {
>>  void (*submit_async_event)(struct nvme_ctrl *ctrl);
>>  void (*delete_ctrl)(struct nvme_ctrl *ctrl);
>>  int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
>> +   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
>>   };
>>
> 
> Is this new ops only needed for the PCIe transport ? or do you have 
> following patches to use this op for the other transports ?

No, I don't think this will make sense for transports that are not based
on PCI devices.

> If it is only needed for the PCIe then we need to find a way to
> not add this somehow...

I don't see how we can do that. The core code needs to know whether the
transport supports this and must have an operation to query it.

Logan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2021-09-30 Thread Logan Gunthorpe



On 2021-09-29 10:47 p.m., Chaitanya Kulkarni wrote:
> Logan,
> 
>> +/*
>> + * bit 2 is the third free bit in the page_link on 64bit systems which
>> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI
>> + * bus address when doing P2PDMA.
>> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
>> + */
>> +
>> +#ifdef CONFIG_PCI_P2PDMA
>> +#define SG_DMA_PCI_P2PDMA  0x04UL
>> +#else
>> +#define SG_DMA_PCI_P2PDMA  0x00UL
>> +#endif
>> +
>> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA)
>> +
> 
> You are doing two things in one patch :-
> 1. Introducing a new macro to replace the current macros.
> 2. Adding a new member to those macros.
> 
> shouldn't this be split into two patches where you introduce a
> macro SG_PAGE_LINK_MASK (SG_CHAIN | SG_END) in prep patch and
> update the SF_PAGE_LINK_MASK with SG_DMA_PCI_P2PDMA with related
> code?
> 

Ok, will split. I'll also add the static inline cleanup Jason suggested
in the first patch.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC] iommu: Use put_pages_list

2021-09-30 Thread Matthew Wilcox (Oracle)
page->freelist is for the use of slab.  We already have the ability
to free a list of pages in the core mm, but it requires the use of a
list_head and for the pages to be chained together through page->lru.
Switch the iommu code over to using free_pages_list().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 drivers/iommu/amd/io_pgtable.c | 99 +++---
 drivers/iommu/dma-iommu.c  | 11 +---
 drivers/iommu/intel/iommu.c| 89 +++---
 include/linux/iommu.h  |  3 +-
 4 files changed, 77 insertions(+), 125 deletions(-)

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 182c93a43efd..8dfa6ee58b76 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -74,49 +74,37 @@ static u64 *first_pte_l7(u64 *pte, unsigned long *page_size,
  *
  /
 
-static void free_page_list(struct page *freelist)
-{
-   while (freelist != NULL) {
-   unsigned long p = (unsigned long)page_address(freelist);
-
-   freelist = freelist->freelist;
-   free_page(p);
-   }
-}
-
-static struct page *free_pt_page(unsigned long pt, struct page *freelist)
+static void free_pt_page(unsigned long pt, struct list_head *list)
 {
struct page *p = virt_to_page((void *)pt);
 
-   p->freelist = freelist;
-
-   return p;
+   list_add_tail(&p->lru, list);
 }
 
 #define DEFINE_FREE_PT_FN(LVL, FN) 
\
-static struct page *free_pt_##LVL (unsigned long __pt, struct page *freelist)  
\
-{  
\
-   unsigned long p;
\
-   u64 *pt;
\
-   int i;  
\
-   
\
-   pt = (u64 *)__pt;   
\
-   
\
-   for (i = 0; i < 512; ++i) { 
\
-   /* PTE present? */  
\
-   if (!IOMMU_PTE_PRESENT(pt[i]))  
\
-   continue;   
\
-   
\
-   /* Large PTE? */
\
-   if (PM_PTE_LEVEL(pt[i]) == 0 || 
\
-   PM_PTE_LEVEL(pt[i]) == 7)   
\
-   continue;   
\
-   
\
-   p = (unsigned long)IOMMU_PTE_PAGE(pt[i]);   
\
-   freelist = FN(p, freelist); 
\
-   }   
\
-   
\
-   return free_pt_page((unsigned long)pt, freelist);   
\
+static void free_pt_##LVL (unsigned long __pt, struct list_head *list) \
+{  \
+   unsigned long p;\
+   u64 *pt;\
+   int i;  \
+   \
+   pt = (u64 *)__pt;   \
+   \
+   for (i = 0; i < 512; ++i) { \
+   /* PTE present? */  \
+   if (!IOMMU_PTE_PRESENT(pt[i]))  \
+   continue;   \
+   \
+   /* Large PTE? */\
+   if (PM_PTE_LEVEL(pt[i]) == 0 || \
+   PM_PTE_LEVEL(pt[i]) == 7)   \
+   continue;   \
+   \
+   p = (unsigned long)IOMMU_PTE_PAGE(pt[i]);   \
+   FN(p, list);\
+   }   

Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-09-30 Thread Jacob Pan
Hi Mike,

On Thu, 30 Sep 2021 14:22:34 +, "Campin, Mike" 
wrote:

> I need support for mixed user PASID, kernel PASID and non-PASID use cases
> in the driver.
> 
This specific RFC is for kernel PASID only. User PASID native use is
supported under SVA lib kernel API and /dev/uacce UAPI or driver specific
char dev. Guest PASID is being developed under the new /dev/iommu framework.
Non-PASID kernel use should be under DMA API unchanged from the driver's
POV. In fact, this proposal will map non-PASID and PASID DMA identically.

Thanks,

Jacob

> -Original Message-
> From: Jason Gunthorpe  
> Sent: Wednesday, September 29, 2021 4:43 PM
> To: Jacob Pan 
> Cc: iommu@lists.linux-foundation.org; LKML
> ; Joerg Roedel ; Christoph
> Hellwig ; Tian, Kevin ; Luck,
> Tony ; Jiang, Dave ; Raj,
> Ashok ; Kumar, Sanjay K ;
> Campin, Mike ; Thomas Gleixner
>  Subject: Re: [RFC 0/7] Support in-kernel DMA with
> PASID and SVA
> 
> On Wed, Sep 29, 2021 at 03:57:20PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 29 Sep 2021 16:39:53 -0300, Jason Gunthorpe 
> > wrote: 
> > > On Wed, Sep 29, 2021 at 12:37:19PM -0700, Jacob Pan wrote:
> > >
> > > > For #2, it seems we can store the kernel PASID in struct device. 
> > > > This will preserve the DMA API interface while making it PASID
> > > > capable. Essentially, each PASID capable device would have two
> > > > special global
> > > > PASIDs: 
> > > > - PASID 0 for DMA request w/o PASID, aka RID2PASID
> > > > - PASID 1 (randomly selected) for in-kernel DMA request w/
> > > > PASID  
> > > 
> > > This seems reasonable, I had the same thought. Basically just have 
> > > the driver issue some trivial call:
> > >   pci_enable_pasid_dma(pdev, &pasid)  
> > That would work, but I guess it needs to be an iommu_ call instead of
> > pci_?  
> 
> Which ever makes sense..  The API should take in a struct pci_device and
> return a PCI PASID - at least as a wrapper around a more generic immu api.
> 
> > I think your suggestion is more precise, in case the driver does not 
> > want to do DMA w/ PASID, we can do less IOTLB flush (PASID 0 only).  
> 
> Since it is odd, and it may create overhead, I would do it only when
> asked to do it
> 
> > > Having multiple RID's pointing at the same IO page table is 
> > > something we expect iommufd to require so the whole thing should 
> > > ideally fall out naturally.  
> 
> > That would be the equivalent of attaching multiple devices to the same 
> > IOMMU domain. right?  
> 
> Effectively..
> 
> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-09-30 Thread Campin, Mike
I need support for mixed user PASID, kernel PASID and non-PASID use cases in 
the driver.

-Original Message-
From: Jason Gunthorpe  
Sent: Wednesday, September 29, 2021 4:43 PM
To: Jacob Pan 
Cc: iommu@lists.linux-foundation.org; LKML ; 
Joerg Roedel ; Christoph Hellwig ; Tian, 
Kevin ; Luck, Tony ; Jiang, Dave 
; Raj, Ashok ; Kumar, Sanjay K 
; Campin, Mike ; Thomas 
Gleixner 
Subject: Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

On Wed, Sep 29, 2021 at 03:57:20PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 29 Sep 2021 16:39:53 -0300, Jason Gunthorpe  wrote:
> 
> > On Wed, Sep 29, 2021 at 12:37:19PM -0700, Jacob Pan wrote:
> >  
> > > For #2, it seems we can store the kernel PASID in struct device. 
> > > This will preserve the DMA API interface while making it PASID capable.
> > > Essentially, each PASID capable device would have two special 
> > > global
> > > PASIDs: 
> > >   - PASID 0 for DMA request w/o PASID, aka RID2PASID
> > >   - PASID 1 (randomly selected) for in-kernel DMA request w/ PASID
> > 
> > This seems reasonable, I had the same thought. Basically just have 
> > the driver issue some trivial call:
> >   pci_enable_pasid_dma(pdev, &pasid)
> That would work, but I guess it needs to be an iommu_ call instead of pci_?

Which ever makes sense..  The API should take in a struct pci_device and return 
a PCI PASID - at least as a wrapper around a more generic immu api.

> I think your suggestion is more precise, in case the driver does not 
> want to do DMA w/ PASID, we can do less IOTLB flush (PASID 0 only).

Since it is odd, and it may create overhead, I would do it only when asked to 
do it

> > Having multiple RID's pointing at the same IO page table is 
> > something we expect iommufd to require so the whole thing should 
> > ideally fall out naturally.

> That would be the equivalent of attaching multiple devices to the same 
> IOMMU domain. right?

Effectively..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 2/2] iommu/vt-d: avoid duplicated removing in __domain_mapping

2021-09-30 Thread Lu Baolu

Hi Longpeng,

On 2021/9/15 23:21, Longpeng(Mike) wrote:

__domain_mapping() always removes the pages in the range from
'iov_pfn' to 'end_pfn', but the 'end_pfn' is always the last pfn
of the range that the caller wants to map.

This would introduce too many duplicated removing and leads the
map operation take too long, for example:

   Map iova=0x10,nr_pages=0x7d61800
 iov_pfn: 0x10, end_pfn: 0x7e617ff
 iov_pfn: 0x14, end_pfn: 0x7e617ff
 iov_pfn: 0x18, end_pfn: 0x7e617ff
 iov_pfn: 0x1c, end_pfn: 0x7e617ff
 iov_pfn: 0x20, end_pfn: 0x7e617ff
 ...
   it takes about 50ms in total.

We can reduce the cost by recalculate the 'end_pfn' and limit it
to the boundary of the end of the pte page.

   Map iova=0x10,nr_pages=0x7d61800
 iov_pfn: 0x10, end_pfn: 0x13
 iov_pfn: 0x14, end_pfn: 0x17
 iov_pfn: 0x18, end_pfn: 0x1b
 iov_pfn: 0x1c, end_pfn: 0x1f
 iov_pfn: 0x20, end_pfn: 0x23
 ...
   it only need 9ms now.

Signed-off-by: Longpeng(Mike) 


The 0day robot reports below compiling warning when building a 32-bit
kernel (make W=1 ARCH=i386):

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/i915_drv.h:43,
from 
drivers/gpu/drm/i915/display/intel_display_types.h:47,

from drivers/gpu/drm/i915/display/intel_dsi.h:30,
from :
   include/linux/intel-iommu.h: In function 'nr_pte_to_next_page':
>> include/linux/intel-iommu.h:719:3: error: cast to pointer from 
integer of different size [-Werror=int-to-pointer-cast]

 719 |   (struct dma_pte *)VTD_PAGE_ALIGN((unsigned long)pte) - pte;
 |   ^
   cc1: all warnings being treated as errors


vim +719 include/linux/intel-iommu.h

   715  
   716  static inline int nr_pte_to_next_page(struct dma_pte *pte)
   717  {
   718  return first_pte_in_page(pte) ? BIT_ULL(VTD_STRIDE_SHIFT) :
 > 719   (struct dma_pte *)VTD_PAGE_ALIGN((unsigned 
long)pte) - pte;
   720  }
   721  

Can you please take a look at this?

Best regards,
baolu


---
  drivers/iommu/intel/iommu.c | 12 +++-
  include/linux/intel-iommu.h |  6 ++
  2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d75f59a..87cbf34 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2354,12 +2354,18 @@ static void switch_to_super_page(struct dmar_domain 
*domain,
return -ENOMEM;
first_pte = pte;
  
+			lvl_pages = lvl_to_nr_pages(largepage_lvl);

+   BUG_ON(nr_pages < lvl_pages);
+
/* It is large page*/
if (largepage_lvl > 1) {
unsigned long end_pfn;
+   unsigned long pages_to_remove;
  
  pteval |= DMA_PTE_LARGE_PAGE;

-   end_pfn = ((iov_pfn + nr_pages) & 
level_mask(largepage_lvl)) - 1;
+   pages_to_remove = min_t(unsigned long, nr_pages,
+   
nr_pte_to_next_page(pte) * lvl_pages);
+   end_pfn = iov_pfn + pages_to_remove - 1;
switch_to_super_page(domain, iov_pfn, end_pfn, 
largepage_lvl);
} else {
pteval &= ~(uint64_t)DMA_PTE_LARGE_PAGE;
@@ -2381,10 +2387,6 @@ static void switch_to_super_page(struct dmar_domain 
*domain,
WARN_ON(1);
}
  
-		lvl_pages = lvl_to_nr_pages(largepage_lvl);

-
-   BUG_ON(nr_pages < lvl_pages);
-
nr_pages -= lvl_pages;
iov_pfn += lvl_pages;
phys_pfn += lvl_pages;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index a590b00..4bff70c 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -713,6 +713,12 @@ static inline bool first_pte_in_page(struct dma_pte *pte)
return !((unsigned long)pte & ~VTD_PAGE_MASK);
  }
  
+static inline int nr_pte_to_next_page(struct dma_pte *pte)

+{
+   return first_pte_in_page(pte) ? BIT_ULL(VTD_STRIDE_SHIFT) :
+   (struct dma_pte *)VTD_PAGE_ALIGN((unsigned long)pte) - pte;
+}
+
  extern struct dmar_drhd_unit * dmar_find_matched_drhd_unit(struct pci_dev 
*dev);
  extern int dmar_find_matched_atsr_unit(struct pci_dev *dev);
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-09-30 Thread Karsten Graul
On 14/09/2021 17:45, Ioana Ciornei wrote:
> On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote:
>> +DPAA2, netdev maintainers
>> Hi,
>>
>> On 5/18/21 7:54 AM, Hamza Mahfooz wrote:
>>> Since, overlapping mappings are not supported by the DMA API we should
>>> report an error if active_cacheline_insert returns -EEXIST.
>>
>> It seems this patch found a victim. I was trying to run iperf3 on a
>> honeycomb (5.14.0, fedora 35) and the console is blasting this error message
>> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, which
>> is attached below.
>>
> 
> These frags are allocated by the stack, transformed into a scatterlist
> by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the
> dpaa2-eth's decision to use two fragments from the same page (that will
> also end un in the same cacheline) in two different in-flight skbs.
> 
> Is this behavior normal?
> 

We see the same problem here and it started with 5.15-rc2 in our nightly CI 
runs.
The CI has panic_on_warn enabled so we see the panic every day now.

Its always the same pattern: module SMC calls dma_map_sg_attrs() which ends
up in the EEXIST warning sooner or later.

It would be better to revert this patch now and start to better understand the 
checking logic for overlapping areas.

Thank you.


The call trace for reference:

[  864.189864] DMA-API: mlx5_core 0662:00:00.0: cacheline tracking EEXIST, 
overlapping mappings aren't supported
[  864.189883] WARNING: CPU: 0 PID: 33720 at kernel/dma/debug.c:570 
add_dma_entry+0x208/0x2c8
...
[  864.190747] CPU: 0 PID: 33720 Comm: smcapp Not tainted 
5.15.0-20210928.rc3.git0.a59bf04db7bb.300.fc34.s390x+debug #1
[  864.190758] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
[  864.190766] Krnl PSW : 0704d0018000 fa6239fc 
(add_dma_entry+0x20c/0x2c8)
[  864.190783]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
RI:0 EA:3
[  864.190795] Krnl GPRS: c000bfff 8000 0061 

[  864.190804]0001 0001 0001 
0001
[  864.190813]0701 0020ff00  
8137b300
[  864.190822]20020100 0001 fa6239f8 
0380074536f8
[  864.190837] Krnl Code: fa6239ec: c020007a4964larl
%r2,fb56ccb4
  fa6239f2: c0e5005ef2ffbrasl   
%r14,fb201ff0
 #fa6239f8: af00mc  0,0
 >fa6239fc: ecb60057007ccgij
%r11,0,6,fa623aaa
  fa623a02: c01000866149larl
%r1,fb6efc94
  fa623a08: e3101012lt  
%r1,0(%r1)
  fa623a0e: a774ff73brc 
7,fa6238f4
  fa623a12: c010008a9227larl
%r1,fb775e60
[  864.202949] Call Trace:
[  864.202959]  [] add_dma_entry+0x20c/0x2c8 
[  864.202971] ([] add_dma_entry+0x208/0x2c8)
[  864.202981]  [] debug_dma_map_sg+0x140/0x160 
[  864.202992]  [] __dma_map_sg_attrs+0x9c/0xd8 
[  864.203002]  [] dma_map_sg_attrs+0x22/0x40 
[  864.203012]  [<03ff80483bde>] smc_ib_buf_map_sg+0x5e/0x90 [smc] 
[  864.203036]  [<03ff80486b44>] smcr_buf_map_link.part.0+0x12c/0x1e8 [smc] 
[  864.203053]  [<03ff80486cb6>] _smcr_buf_map_lgr+0xb6/0xf8 [smc] 
[  864.203071]  [<03ff8048b91c>] smcr_buf_map_lgr+0x4c/0x90 [smc] 
[  864.211496]  [<03ff80490ac2>] smc_llc_cli_add_link+0x152/0x420 [smc] 
[  864.211522]  [<03ff8047acbc>] smcr_clnt_conf_first_link+0x124/0x1e0 
[smc] 
[  864.211537]  [<03ff8047bfb2>] smc_connect_rdma+0x25a/0x2e8 [smc] 
[  864.211551]  [<03ff8047da4a>] __smc_connect+0x38a/0x650 [smc] 
[  864.211566]  [<03ff8047de70>] smc_connect+0x160/0x190 [smc] 
[  864.211580]  [] __sys_connect+0x98/0xd0 
[  864.211592]  [] __do_sys_socketcall+0x16a/0x350 
[  864.211603]  [] __do_syscall+0x1c2/0x1f0 
[  864.211616]  [] system_call+0x78/0xa0 

-- 
Karsten
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH,RFC] iommu/amd: Recover from event log overflow

2021-09-30 Thread Lennert Buytenhek
On Tue, Sep 28, 2021 at 10:45:30AM +0200, Joerg Roedel wrote:

> Hi Lennert,

Hi Joerg,

Thanks for your feedback!


> > +/*
> > + * This function restarts event logging in case the IOMMU experienced
> > + * an event log buffer overflow.
> > + */
> > +void amd_iommu_restart_event_logging(struct amd_iommu *iommu)
> > +{
> > +   iommu_feature_disable(iommu, CONTROL_EVT_LOG_EN);
> > +   iommu_feature_enable(iommu, CONTROL_EVT_LOG_EN);
> > +}
> 
> A few more things need to happen here. First of all head and tail are
> likely equal when the event buffer overflows, so the events will not be
> polled and reported.

I applied the following change on top of my patch, to check the state
of the event log ring buffer when we enter the IRQ handler with an
overflow condition pending:

--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -752,6 +752,18 @@ irqreturn_t amd_iommu_int_thread(int irq, void *data)
struct amd_iommu *iommu = (struct amd_iommu *) data;
u32 status = readl(iommu->mmio_base + MMIO_STATUS_OFFSET);
 
+   if (status & MMIO_STATUS_EVT_OVERFLOW_INT_MASK) {
+   u32 events;
+
+   events = readl(iommu->mmio_base + MMIO_EVT_TAIL_OFFSET) -
+readl(iommu->mmio_base + MMIO_EVT_HEAD_OFFSET);
+   if (events & 0x8000)
+   events += EVT_BUFFER_SIZE;
+   events /= EVENT_ENTRY_SIZE;
+
+   pr_info("Overflow with %d events queued\n", events);
+   }
+
while (status & AMD_IOMMU_INT_MASK) {
/* Enable interrupt sources again */
writel(AMD_IOMMU_INT_MASK,

And that gives me:

[   33.304821] AMD-Vi: Overflow with 511 events queued
[   35.304812] AMD-Vi: Overflow with 511 events queued
[   39.304791] AMD-Vi: Overflow with 511 events queued
[   40.304792] AMD-Vi: Overflow with 511 events queued
[   41.304782] AMD-Vi: Overflow with 511 events queued
[   44.009920] AMD-Vi: Overflow with 511 events queued
[   45.304768] AMD-Vi: Overflow with 511 events queued
[   46.304782] AMD-Vi: Overflow with 511 events queued
[   46.385028] AMD-Vi: Overflow with 511 events queued
[   51.304733] AMD-Vi: Overflow with 511 events queued
[   53.318892] AMD-Vi: Overflow with 511 events queued
[   60.304681] AMD-Vi: Overflow with 511 events queued
[   63.304676] AMD-Vi: Overflow with 511 events queued

In other words, it seems that the hardware considers the event log to
be full when there's still one free entry in the ring buffer, and it
will not actually fill up the last free entry and make the head and
tail pointer equal by doing so.  This seems consistent across Ryzen
3700X, Ryzen 5950X, EPYC 3151, EPYC 3251, RX-421ND, RX-216TD.

The docs talk about "When the Event Log has overflowed", but they don't
define exactly when that happens (i.e. when tail becomes equal to head or
one entry before that), but I did find this phrase that talks about the
overflow condition:

The host software must make space in the event log after an
overflow by reading entries (by adjusting the head pointer) or
by resizing the log.  Event logging may then be restarted.

This seems to suggest that the hardware will never consume the last
entry in the ring buffer and thereby advance tail to be equal to head,
because if it would, then there would be no need for "reading entries
(by adjusting the head pointer)" to make space in the event log buffer,
because the 'empty' and 'full' conditions would be indistinguishable
in that case.

If there _is_ some variation of the hardware out there that does
advance the tail pointer to coincide with the head pointer when there
are already N-1 entries in the log and it has one more entry to write,
then this patch would still work OK on such hardware -- we would just
lose a few more events in that case than we otherwise would, but the
point of the patch is to be able to recover after a burst of I/O page
faults, at which point we've very likely already lost some events, so
I think such hypothetical lossage would be acceptable, given that none
of the hardware I have access to seems to behave that way and from the
wording in the docs it is unlikely to behave that way.

In fact, thinking about it a bit more, by the time an event log
overflow condition is handled, it is actually possible for the event
log ring buffer to be empty and not full.  Imagine this scenario:

- We enter the IRQ handler, EVT status bit is set, the ring buffer is
  full but it hasn't overflowed yet, so OVERFLOW is not set.

- We read interrupt status and clear the EVT bit.

- Before we call iommu_poll_events(), another event comes in, which
  causes the OVERFLOW flag to be set, since we haven't advanced head yet.

- iommu_poll_events() is called, and it processes all events in the
  ring buffer, which is now empty (and will remain empty, since the
  overflow bit is set).

- The MMIO_STATUS_OFFSET re-reading at the end of the IRQ loop finds
  the OVERFLOW bit set and 

Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-30 Thread Lu Baolu

On 2021/9/30 16:49, Tian, Kevin wrote:

From: Jason Gunthorpe 
Sent: Thursday, September 23, 2021 8:22 PM


These are different things and need different bits. Since the ARM path
has a lot more code supporting it, I'd suggest Intel should change
their code to use IOMMU_BLOCK_NO_SNOOP and abandon

IOMMU_CACHE.


I didn't fully get this point. The end result is same, i.e. making the DMA
cache-coherent when IOMMU_CACHE is set. Or if you help define the
behavior of IOMMU_CACHE, what will you define now?


It is clearly specifying how the kernel API works:

  !IOMMU_CACHE
must call arch cache flushers
  IOMMU_CACHE -
do not call arch cache flushers
  IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
dot not arch cache flushers, and ignore the no snoop bit.


Who will set IOMMU_BLOCK_NO_SNOOP? I feel this is arch specific
knowledge about how cache coherency is implemented, i.e.
when IOMMU_CACHE is set intel-iommu driver just maps it to
blocking no-snoop. It's not necessarily to be an attribute in
the same level as IOMMU_CACHE?



On Intel it should refuse to create a !IOMMU_CACHE since the HW can't
do that.


Agree. In reality I guess this is not hit because all devices are marked
coherent on Intel platforms...

Baolu, any insight here?


I am trying to follow the discussion here. Please guide me if I didn't
get the right context.

Here, we are discussing arch_sync_dma_for_cpu() and
arch_sync_dma_for_device(). The x86 arch has clflush to sync dma buffer
for device, but I can't see any instruction to sync dma buffer for cpu
if the device is not cache coherent. Is that the reason why x86 can't
have an implementation for arch_sync_dma_for_cpu(), hence all devices
are marked coherent?


Thanks
Kevin



Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 11/29] iommu/mediatek: Always pm_runtime_get while tlb flush

2021-09-30 Thread Dafna Hirschfeld




On 13.08.21 08:53, Yong Wu wrote:

Prepare for 2 HWs that sharing pgtable in different power-domains.

The previous SoC don't have PM. Only mt8192 has power-domain,
and it is display's power-domain which nearly always is enabled.


hi, I see that in mt1873.dtsi, many devices that uses the iommu have the
'power-domains' property.



When there are 2 M4U HWs, it may has problem.
In this function, we get the pm_status via the m4u dev, but it don't
reflect the real power-domain status of the HW since there may be other
HW also use that power-domain.

Currently we could not get the real power-domain status, thus always
pm_runtime_get here.

Prepare for mt8195, thus, no need fix tags here.

This patch may drop the performance, we expect the user could
pm_runtime_get_sync before dma_alloc_attrs which need tlb ops.



Could you explain this sentence a bit? should the user call pm_runtime_get_sync
before calling dma_alloc_attrs?

Thanks,
Dafna


Signed-off-by: Yong Wu 
---
  drivers/iommu/mtk_iommu.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index add23a36a5e2..abc721a1da21 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -238,8 +238,11 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long 
iova, size_t size,
  
  	for_each_m4u(data, head) {

if (has_pm) {
-   if (pm_runtime_get_if_in_use(data->dev) <= 0)
+   ret = pm_runtime_resume_and_get(data->dev);
+   if (ret < 0) {
+   dev_err(data->dev, "tlb flush: pm get fail 
%d.\n", ret);
continue;
+   }
}
  
  		spin_lock_irqsave(&data->tlb_lock, flags);



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 09/12] media: mtk-vcodec: Get rid of mtk_smi_larb_get/put

2021-09-30 Thread Dafna Hirschfeld




On 30.09.21 05:28, Yong Wu wrote:

Hi Dafna,

Thanks very much for the review.

On Wed, 2021-09-29 at 14:13 +0200, Dafna Hirschfeld wrote:


On 29.09.21 03:37, Yong Wu wrote:

MediaTek IOMMU has already added the device_link between the
consumer
and smi-larb device. If the vcodec device call the
pm_runtime_get_sync,
the smi-larb's pm_runtime_get_sync also be called automatically.

CC: Tiffany Lin 
CC: Irui Wang 
Signed-off-by: Yong Wu 
Reviewed-by: Evan Green 
Acked-by: Tiffany Lin 
Reviewed-by: Dafna Hirschfeld 
---
   .../platform/mtk-vcodec/mtk_vcodec_dec_pm.c   | 37 +++---
--
   .../platform/mtk-vcodec/mtk_vcodec_drv.h  |  3 --
   .../platform/mtk-vcodec/mtk_vcodec_enc.c  |  1 -
   .../platform/mtk-vcodec/mtk_vcodec_enc_pm.c   | 44 +++---
-
   4 files changed, 10 insertions(+), 75 deletions(-)

diff --git a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
index 6038db96f71c..d0bf9aa3b29d 100644
--- a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
+++ b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
@@ -8,14 +8,12 @@
   #include 
   #include 
   #include 
-#include 
   
   #include "mtk_vcodec_dec_pm.h"

   #include "mtk_vcodec_util.h"
   
   int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev)

   {
-   struct device_node *node;
struct platform_device *pdev;
struct mtk_vcodec_pm *pm;
struct mtk_vcodec_clk *dec_clk;
@@ -26,18 +24,7 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev
*mtkdev)
pm = &mtkdev->pm;
pm->mtkdev = mtkdev;
dec_clk = &pm->vdec_clk;
-   node = of_parse_phandle(pdev->dev.of_node, "mediatek,larb", 0);
-   if (!node) {
-   mtk_v4l2_err("of_parse_phandle mediatek,larb fail!");
-   return -1;
-   }
   
-	pdev = of_find_device_by_node(node);

-   of_node_put(node);
-   if (WARN_ON(!pdev)) {
-   return -1;
-   }
-   pm->larbvdec = &pdev->dev;
pdev = mtkdev->plat_dev;
pm->dev = &pdev->dev;
   
@@ -47,14 +34,11 @@ int mtk_vcodec_init_dec_pm(struct

mtk_vcodec_dev *mtkdev)
dec_clk->clk_info = devm_kcalloc(&pdev->dev,
dec_clk->clk_num, sizeof(*clk_info),
GFP_KERNEL);
-   if (!dec_clk->clk_info) {
-   ret = -ENOMEM;
-   goto put_device;
-   }
+   if (!dec_clk->clk_info)
+   return -ENOMEM;
} else {
mtk_v4l2_err("Failed to get vdec clock count");
-   ret = -EINVAL;
-   goto put_device;
+   return -EINVAL;
}
   
   	for (i = 0; i < dec_clk->clk_num; i++) {

@@ -63,29 +47,24 @@ int mtk_vcodec_init_dec_pm(struct
mtk_vcodec_dev *mtkdev)
"clock-names", i, &clk_info->clk_name);
if (ret) {
mtk_v4l2_err("Failed to get clock name id =
%d", i);
-   goto put_device;
+   return ret;
}
clk_info->vcodec_clk = devm_clk_get(&pdev->dev,
clk_info->clk_name);
if (IS_ERR(clk_info->vcodec_clk)) {
mtk_v4l2_err("devm_clk_get (%d)%s fail", i,
clk_info->clk_name);
-   ret = PTR_ERR(clk_info->vcodec_clk);
-   goto put_device;
+   return PTR_ERR(clk_info->vcodec_clk);
}
}
   
   	pm_runtime_enable(&pdev->dev);

return 0;
-put_device:
-   put_device(pm->larbvdec);
-   return ret;
   }
   
   void mtk_vcodec_release_dec_pm(struct mtk_vcodec_dev *dev)

   {
pm_runtime_disable(dev->pm.dev);
-   put_device(dev->pm.larbvdec);
   }


Now that functions only do  'pm_runtime_disable(dev->pm.dev);' so it
will be more
readable to remove the function mtk_vcodec_release_dec_pm
and replace with pm_runtime_disable(dev->pm.dev);
Same for the 'enc' equivalent.


Make sense. But It may be not proper if using pm_runtime_disable
as the symmetry with mtk_vcodec_init_dec_pm in the mtk_vcodec_probe.

Maybe we should move pm_runtime_enable out from mtk_vcodec_init_dec_pm
into mtk_vcodec_probe. I could do a new patch for this. Is this ok for
you?


yes, there is also asymettry when calling pm_runtime* in general,
I see in the decoder it is called from mtk_vcodec_dec_pm.c
but in the encoder it is called from mtk_vcodec_enc.c,

I think all calls to pm_runtime* should be out of the *_pm.c files
since for example 'mtk_vcodec_dec_pw_on' also do just one call to
pm_runtime_resume_and_get so this function can also be removed.

thanks,
Dafna





Thanks,
Dafna


[snip]
___
Linux-mediatek mailing list
linux-media...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-mediatek


___

Re: [PATCH RFC v1 01/11] uapi/virtio-iommu: Add page request grp-id and flags information

2021-09-30 Thread Vivek Kumar Gautam




On 9/30/21 2:19 PM, Jean-Philippe Brucker wrote:

On Thu, Sep 30, 2021 at 10:26:35AM +0530, Vivek Kumar Gautam wrote:

I'm not sure why we made it 32-bit in Linux UAPI, it's a little wasteful.
PCIe PRGI is 9-bits and SMMU STAG is 16-bits. Since the scope of the grpid
is the endpoint, 16-bit means 64k in-flight faults per endpoint, which
seems more than enough.


Right, I will update this to 16-bits field. It won't be okay to update the
iommu uAPI now, right?


Since there is no UAPI transport for the fault request/response at the
moment, it should be possible to update it.


Righty! I will add a patch for that too then.



Thanks & regards

Vivek
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node

2021-09-30 Thread Shameerali Kolothum Thodi
Hi Eric,

> -Original Message-
> From: Eric Auger [mailto:eric.au...@redhat.com]
> Sent: 30 September 2021 10:48
> To: Shameerali Kolothum Thodi ;
> linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org;
> iommu@lists.linux-foundation.org; Jean-Philippe Brucker
> 
> Cc: Linuxarm ; lorenzo.pieral...@arm.com;
> j...@8bytes.org; robin.mur...@arm.com; w...@kernel.org; wanghuiqiang
> ; Guohanjun (Hanjun Guo)
> ; steven.pr...@arm.com; sami.muja...@arm.com;
> j...@solid-run.com; yangyicong 
> Subject: Re: [PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node
> 
> Hi Shameer,
> 
> On 8/5/21 10:07 AM, Shameer Kolothum wrote:
> > Hi,
> >
> > The series adds support to IORT RMR nodes specified in IORT
> > Revision E.b -ARM DEN 0049E[0]. RMR nodes are used to describe
> > memory ranges that are used by endpoints and require a unity
> > mapping in SMMU.
> 
> I used your series and RMRs to force a guest iommu (vSMMUv3 nested stage
> use case) to have a flat mapping for IOVAs within [0x800, 0x810]
> (matching MSI_IOVA_BASE and MSI_IOVA_LENGTH) used by the host to map
> MSI
> physical doorbells.
> 
> That way when an assigned device protected by a vSMMUv3 implemented
> upon
> nested stage issues an MSI transaction, let's say using IOVA=0x800,
> we would get:
>                     S1 (guest)           S2 (host)
> 0x800            0x800            Physical DB
> 
> This method was suggested by Jean-Philippe (added in CC) and it
> simplifies the nested stage integration because we don't have to care
> about nested stage MSI bindings.
> 
> However if I understand correctly we cannot define a range of SIDs using
> the same RMR (due to the single mapping bit which must be set, Table 5
> flags format). This is a spec restriction and not an issue with your series.

Yes. The spec currently mandates single mapping bit to be set.

> 
> As VFIO devices can be hot-plugged we thus need to create as many RMR
> nodes as potential BDFs, leading to 256 * 6 = 1536 RMR nodes if you have
> 5 pcie root ports as it is usual in VMs. Then this causes some trouble
> at qemu level for instance, wrt migration. See [RFC]
> hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding.
> 
> Do you know if there is a plan to remove the single mapping limitation
> in the spec?

I would imagine so. In an earlier comment[1], Robin did mention about possible
relaxing of this in future spec revision.

Thanks,
Shameer
1. 
https://lore.kernel.org/linux-arm-kernel/15c7fac0-11a8-0cdb-aac3-b5d8feb8f...@arm.com/

> Thanks
> 
> Eric
> >
> > We have faced issues with 3408iMR RAID controller cards which
> > fail to boot when SMMU is enabled. This is because these
> > controllers make use of host memory for various caching related
> > purposes and when SMMU is enabled the iMR firmware fails to
> > access these memory regions as there is no mapping for them.
> > IORT RMR provides a way for UEFI to describe and report these
> > memory regions so that the kernel can make a unity mapping for
> > these in SMMU.
> >
> > Change History:
> >
> > v6 --> v7
> >
> > The only change from v6 is the fix pointed out by Steve to
> > the SMMUv2 SMR bypass install in patch #8.
> >
> > Thanks to the Tested-by tags by Laurentiu with SMMUv2 and
> > Hanjun/Huiqiang with SMMUv3 for v6. I haven't added the tags
> > yet as the series still needs more review[1].
> >
> > Feedback and tests on this series is very much appreciated.
> >
> > v5 --> v6
> > - Addressed comments from Robin & Lorenzo.
> >   : Moved iort_parse_rmr() to acpi_iort_init() from
> > iort_init_platform_devices().
> >   : Removed use of struct iort_rmr_entry during the initial
> > parse. Using struct iommu_resv_region instead.
> >   : Report RMR address alignment and overlap errors, but continue.
> >   : Reworked arm_smmu_init_bypass_stes() (patch # 6).
> > - Updated SMMUv2 bypass SMR code. Thanks to Jon N (patch #8).
> > - Set IOMMU protection flags(IOMMU_CACHE, IOMMU_MMIO) based
> >   on Type of RMR region. Suggested by Jon N.
> >
> > Thanks,
> > Shameer
> > [0] https://developer.arm.com/documentation/den0049/latest/
> > [1]
> https://lore.kernel.org/linux-acpi/20210716083442.1708-1-shameerali.koloth
> um.th...@huawei.com/T/#m043c95b869973a834b2fd57f3e1ed0325c84f3b7
> > --
> > v4 --> v5
> >  -Added a fw_data union to struct iommu_resv_region and removed
> >   struct iommu_rmr (Based on comments from Joerg/Robin).
> >  -Added iommu_put_rmrs() to release mem.
> >  -Thanks to Steve for verifying on SMMUv2, but not added the Tested-by
> >   yet because of the above changes.
> >
> > v3 -->v4
> > -Included the SMMUv2 SMR bypass install changes suggested by
> >  Steve(patch #7)
> > -As per Robin's comments, RMR reserve implementation is now
> >  more generic  (patch #8) and dropped v3 patches 8 and 10.
> > -Rebase to 5.13-rc1
> >
> > RFC v2 --> v3
> >  -Dropped RFC tag as the ACPICA header changes are now ready to be
> >   part of 5.13[0]. But this series still has a depe

Re: [PATCH RFC v1 08/11] iommu/arm-smmu-v3: Implement shared context alloc and free ops

2021-09-30 Thread Vivek Kumar Gautam

Hi Jean,


On 9/21/21 9:37 PM, Jean-Philippe Brucker wrote:

On Fri, Apr 23, 2021 at 03:21:44PM +0530, Vivek Gautam wrote:

Implementing the alloc_shared_cd and free_shared_cd in cd-lib, and
start using them for arm-smmu-v3-sva implementation.

Signed-off-by: Vivek Gautam 
---
  .../arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c  | 71 
  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   | 83 ---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  1 -
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 14 
  4 files changed, 73 insertions(+), 96 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
index 537b7c784d40..b87829796596 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-cd-lib.c
@@ -285,16 +285,14 @@ static bool arm_smmu_free_asid(struct xarray *xa, void 
*cookie_cd)
   * descriptor is using it, try to replace it.
   */
  static struct arm_smmu_ctx_desc *
-arm_smmu_share_asid(struct mm_struct *mm, u16 asid)
+arm_smmu_share_asid(struct iommu_pasid_table *tbl, struct mm_struct *mm,
+   struct xarray *xa, u16 asid, u32 asid_bits)


xa and asid_bits could be stored in some arch-specific section of the
iommu_pasid_table struct. Other table drivers wouldn't need those
arguments.


Okay, will move them to a separate structure section.



More a comment for the parent series: it may be clearer to give a
different prefix to functions in this file (arm_smmu_cd_, pst_arm_?).
Reading this patch I'm a little confused by what belongs in the IOMMU
driver and what is done by this library. (I also keep reading 'tbl' as
'tlb'. Maybe we could make it 'table' since that doesn't take a lot more
space)


Yea, this may be confusing. I will fix these namings in my next version.




  {
int ret;
u32 new_asid;
struct arm_smmu_ctx_desc *cd;
-   struct arm_smmu_device *smmu;
-   struct arm_smmu_domain *smmu_domain;
-   struct iommu_pasid_table *tbl;
  
-	cd = xa_load(&arm_smmu_asid_xa, asid);

+   cd = xa_load(xa, asid);
if (!cd)
return NULL;
  
@@ -306,12 +304,8 @@ arm_smmu_share_asid(struct mm_struct *mm, u16 asid)

return cd;
}
  
-	smmu_domain = container_of(cd, struct arm_smmu_domain, s1_cfg.cd);

-   smmu = smmu_domain->smmu;
-   tbl = smmu_domain->tbl;
-
-   ret = xa_alloc(&arm_smmu_asid_xa, &new_asid, cd,
-  XA_LIMIT(1, (1 << smmu->asid_bits) - 1), GFP_KERNEL);
+   ret = xa_alloc(xa, &new_asid, cd, XA_LIMIT(1, (1 << asid_bits) - 1),
+  GFP_KERNEL);
if (ret)
return ERR_PTR(-ENOSPC);
/*
@@ -325,48 +319,52 @@ arm_smmu_share_asid(struct mm_struct *mm, u16 asid)
 * be some overlap between use of both ASIDs, until we invalidate the
 * TLB.
 */
-   ret = iommu_psdtable_write(tbl, &tbl->cfg, 0, cd);
+   ret = arm_smmu_write_ctx_desc(&tbl->cfg, 0, cd);
if (ret)
return ERR_PTR(-ENOSYS);
  
  	/* Invalidate TLB entries previously associated with that context */

-   iommu_psdtable_flush_tlb(tbl, smmu_domain, asid);
+   iommu_psdtable_flush_tlb(tbl, tbl->cookie, asid);
  
-	xa_erase(&arm_smmu_asid_xa, asid);

+   xa_erase(xa, asid);
return NULL;
  }
  
-struct arm_smmu_ctx_desc *

-arm_smmu_alloc_shared_cd(struct iommu_pasid_table *tbl, struct mm_struct *mm)
+static struct iommu_psdtable_mmu_notifier *
+arm_smmu_alloc_shared_cd(struct iommu_pasid_table *tbl, struct mm_struct *mm,
+struct xarray *xa, u32 asid_bits)
  {
u16 asid;
int err = 0;
u64 tcr, par, reg;
struct arm_smmu_ctx_desc *cd;
struct arm_smmu_ctx_desc *ret = NULL;
+   struct iommu_psdtable_mmu_notifier *pst_mn;
  
  	asid = arm64_mm_context_get(mm);

if (!asid)
return ERR_PTR(-ESRCH);
  
+	pst_mn = kzalloc(sizeof(*pst_mn), GFP_KERNEL);

+   if (!pst_mn) {
+   err = -ENOMEM;
+   goto out_put_context;
+   }
+
cd = kzalloc(sizeof(*cd), GFP_KERNEL);
if (!cd) {
err = -ENOMEM;
-   goto out_put_context;
+   goto out_free_mn;
}
  
  	refcount_set(&cd->refs, 1);
  
-	mutex_lock(&arm_smmu_asid_lock);

-   ret = arm_smmu_share_asid(mm, asid);
+   ret = arm_smmu_share_asid(tbl, mm, xa, asid, asid_bits);
if (ret) {
-   mutex_unlock(&arm_smmu_asid_lock);
goto out_free_cd;
}
  
-	err = xa_insert(&arm_smmu_asid_xa, asid, cd, GFP_KERNEL);

-   mutex_unlock(&arm_smmu_asid_lock);
-
+   err = xa_insert(xa, asid, cd, GFP_KERNEL);
if (err)
goto out_free_asid;
  
@@ -406,21 +404,26 @@ arm_smmu_alloc_shared_cd(struct iommu_pasid_table *tbl, struct mm_struct *mm)

cd->asid = asid;
cd->

Re: [PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node

2021-09-30 Thread Eric Auger
Hi Shameer,

On 8/5/21 10:07 AM, Shameer Kolothum wrote:
> Hi,
>
> The series adds support to IORT RMR nodes specified in IORT
> Revision E.b -ARM DEN 0049E[0]. RMR nodes are used to describe
> memory ranges that are used by endpoints and require a unity
> mapping in SMMU.

I used your series and RMRs to force a guest iommu (vSMMUv3 nested stage
use case) to have a flat mapping for IOVAs within [0x800, 0x810]
(matching MSI_IOVA_BASE and MSI_IOVA_LENGTH) used by the host to map MSI
physical doorbells.

That way when an assigned device protected by a vSMMUv3 implemented upon
nested stage issues an MSI transaction, let's say using IOVA=0x800,
we would get:
                    S1 (guest)           S2 (host)
0x800            0x800            Physical DB

This method was suggested by Jean-Philippe (added in CC) and it
simplifies the nested stage integration because we don't have to care
about nested stage MSI bindings.

However if I understand correctly we cannot define a range of SIDs using
the same RMR (due to the single mapping bit which must be set, Table 5
flags format). This is a spec restriction and not an issue with your series.

As VFIO devices can be hot-plugged we thus need to create as many RMR
nodes as potential BDFs, leading to 256 * 6 = 1536 RMR nodes if you have
5 pcie root ports as it is usual in VMs. Then this causes some trouble
at qemu level for instance, wrt migration. See [RFC]
hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding.

Do you know if there is a plan to remove the single mapping limitation
in the spec?

Thanks

Eric
>
> We have faced issues with 3408iMR RAID controller cards which
> fail to boot when SMMU is enabled. This is because these
> controllers make use of host memory for various caching related
> purposes and when SMMU is enabled the iMR firmware fails to 
> access these memory regions as there is no mapping for them.
> IORT RMR provides a way for UEFI to describe and report these
> memory regions so that the kernel can make a unity mapping for
> these in SMMU.
>
> Change History: 
>
> v6 --> v7
>
> The only change from v6 is the fix pointed out by Steve to
> the SMMUv2 SMR bypass install in patch #8.
>
> Thanks to the Tested-by tags by Laurentiu with SMMUv2 and
> Hanjun/Huiqiang with SMMUv3 for v6. I haven't added the tags
> yet as the series still needs more review[1].
>
> Feedback and tests on this series is very much appreciated.
>
> v5 --> v6
> - Addressed comments from Robin & Lorenzo.
>   : Moved iort_parse_rmr() to acpi_iort_init() from
> iort_init_platform_devices().
>   : Removed use of struct iort_rmr_entry during the initial
> parse. Using struct iommu_resv_region instead.
>   : Report RMR address alignment and overlap errors, but continue.
>   : Reworked arm_smmu_init_bypass_stes() (patch # 6).
> - Updated SMMUv2 bypass SMR code. Thanks to Jon N (patch #8).
> - Set IOMMU protection flags(IOMMU_CACHE, IOMMU_MMIO) based
>   on Type of RMR region. Suggested by Jon N.
>
> Thanks,
> Shameer
> [0] https://developer.arm.com/documentation/den0049/latest/
> [1] 
> https://lore.kernel.org/linux-acpi/20210716083442.1708-1-shameerali.kolothum.th...@huawei.com/T/#m043c95b869973a834b2fd57f3e1ed0325c84f3b7
> --
> v4 --> v5
>  -Added a fw_data union to struct iommu_resv_region and removed
>   struct iommu_rmr (Based on comments from Joerg/Robin).
>  -Added iommu_put_rmrs() to release mem.
>  -Thanks to Steve for verifying on SMMUv2, but not added the Tested-by
>   yet because of the above changes.
>
> v3 -->v4
> -Included the SMMUv2 SMR bypass install changes suggested by
>  Steve(patch #7)
> -As per Robin's comments, RMR reserve implementation is now
>  more generic  (patch #8) and dropped v3 patches 8 and 10.
> -Rebase to 5.13-rc1
>
> RFC v2 --> v3
>  -Dropped RFC tag as the ACPICA header changes are now ready to be
>   part of 5.13[0]. But this series still has a dependency on that patch.
>  -Added IORT E.b related changes(node flags, _DSM function 5 checks for
>   PCIe).
>  -Changed RMR to stream id mapping from M:N to M:1 as per the spec and
>   discussion here[1].
>  -Last two patches add support for SMMUv2(Thanks to Jon Nettleton!)
> --
>
> Jon Nettleton (1):
>   iommu/arm-smmu: Get associated RMR info and install bypass SMR
>
> Shameer Kolothum (8):
>   iommu: Introduce a union to struct iommu_resv_region
>   ACPI/IORT: Add support for RMR node parsing
>   iommu/dma: Introduce generic helper to retrieve RMR info
>   ACPI/IORT: Add a helper to retrieve RMR memory regions
>   iommu/arm-smmu-v3: Introduce strtab init helper
>   iommu/arm-smmu-v3: Refactor arm_smmu_init_bypass_stes() to force
> bypass
>   iommu/arm-smmu-v3: Get associated RMR info and install bypass STE
>   iommu/dma: Reserve any RMR regions associated with a dev
>
>  drivers/acpi/arm64/iort.c   | 172 +++-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  76 +++--
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   

RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-30 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, September 23, 2021 7:42 PM
> 
> On Thu, Sep 23, 2021 at 03:38:10AM +, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Thursday, September 23, 2021 11:11 AM
> > >
> > > >
> > > > The required behavior for iommufd is to have the IOMMU ignore the
> > > > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > > > clearly documented for its exact purpose and if other arches also have
> > > > instructions that need to be disabled if snoop TLPs are allowed then
> > > > they can re-use this bit. It appears ARM does not have this issue and
> > > > does not need the bit.
> > >
> > > Disabling wbinvd is one purpose. imo the more important intention
> > > is that iommu vendor uses different PTE formats between snoop and
> > > !snoop. As long as we want allow userspace to opt in case of isoch
> > > performance requirement (unlike current vfio which always choose
> > > snoop format if available), such mechanism is required for all vendors.
> > >
> >
> > btw I'm not sure whether the wbinvd trick is Intel specific. All other
> > platforms (amd, arm, s390, etc.) currently always claim OMMU_CAP_
> > CACHE_COHERENCY (the source of IOMMU_CACHE).
> 
> This only means they don't need to use the arch cache flush
> helpers. It has nothing to do with no-snoop on those platforms.
> 
> > They didn't hit this problem because vfio always sets IOMMU_CACHE to
> > force every DMA to snoop. Will they need to handle similar
> > wbinvd-like trick (plus necessary memory type virtualization) when
> > non-snoop format is enabled?  Or are their architectures highly
> > optimized to afford isoch traffic even with snoop (then fine to not
> > support user opt-in)?
> 
> In other arches the question is:
>  - Do they allow non-coherent DMA to exist in a VM?

And is coherency a static attribute per device or could be opted
by driver on such arch? If the latter, then the same opt path from
userspace sounds also reasonable, since driver is in userspace now.

>  - Can the VM issue cache maintaince ops to fix the decoherence?

As you listed the questions are all about non-coherent DMA, not
how non-coherent DMAs are implemented underlyingly. From this
angle focusing on coherent part as Alex suggested is more forward
looking than tying the uAPI to a specific coherency implementation
using snoop?

> 
> The Intel functional issue is that Intel blocks the cache maintaince
> ops from the VM and the VM has no way to self-discover that the cache
> maintaince ops don't work.

the VM doesn't need to know whether the maintenance ops 
actually works. It just treats the device as if those ops are always
required. The hypervisor will figure out whether those ops should
be blocked based on whether coherency is guaranteed by iommu
based on iommufd/vfio.

> 
> Other arches don't seem to have this specific problem...

I think the key is whether other archs allow driver to decide DMA
coherency and indirectly the underlying I/O page table format. 
If yes, then I don't see a reason why such decision should not be 
given to userspace for passthrough case.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC v1 10/11] uapi/virtio-iommu: Add a new request type to send page response

2021-09-30 Thread Vivek Kumar Gautam

Hi Jean,


On 9/21/21 9:46 PM, Jean-Philippe Brucker wrote:

On Fri, Apr 23, 2021 at 03:21:46PM +0530, Vivek Gautam wrote:

Once the page faults are handled, the response has to be sent to
virtio-iommu backend, from where it can be sent to the host to
prepare the response to a generated io page fault by the device.
Add a new virt-queue request type to handle this.

Signed-off-by: Vivek Gautam 
---
  include/uapi/linux/virtio_iommu.h | 18 ++
  1 file changed, 18 insertions(+)

diff --git a/include/uapi/linux/virtio_iommu.h 
b/include/uapi/linux/virtio_iommu.h
index c12d9b6a7243..1b174b98663a 100644
--- a/include/uapi/linux/virtio_iommu.h
+++ b/include/uapi/linux/virtio_iommu.h
@@ -48,6 +48,7 @@ struct virtio_iommu_config {
  #define VIRTIO_IOMMU_T_PROBE  0x05
  #define VIRTIO_IOMMU_T_ATTACH_TABLE   0x06
  #define VIRTIO_IOMMU_T_INVALIDATE 0x07
+#define VIRTIO_IOMMU_T_PAGE_RESP   0x08
  
  /* Status types */

  #define VIRTIO_IOMMU_S_OK 0x00
@@ -70,6 +71,23 @@ struct virtio_iommu_req_tail {
__u8reserved[3];
  };
  
+struct virtio_iommu_req_page_resp {

+   struct virtio_iommu_req_headhead;
+   __le32  domain;


I don't think we need this field, since the fault report doesn't come with
a domain.


But here we are sending the response which would be consumed by the vfio 
ultimately. In kvmtool, I am consuming this "virtio_iommu_req_page_resp" 
request in the virtio/iommu driver, extracting the domain from it, and 
using that to call the respective "page_response" ops from 
"vfio_iommu_ops" in the vfio/core driver.


Is this incorrect way of passing on the page-response back to the host 
kernel?


But I think this will have to be worked out with the /dev/iommu framework.




+   __le32  endpoint;
+#define VIRTIO_IOMMU_PAGE_RESP_PASID_VALID (1 << 0)


To be consistent with the rest of the document this would be
VIRTIO_IOMMU_PAGE_RESP_F_PASID_VALID


Sure, will update this.




+   __le32  flags;
+   __le32  pasid;
+   __le32  grpid;
+#define VIRTIO_IOMMU_PAGE_RESP_SUCCESS (0x0)
+#define VIRTIO_IOMMU_PAGE_RESP_INVALID (0x1)
+#define VIRTIO_IOMMU_PAGE_RESP_FAILURE (0x2)
+   __le16  resp_code;
+   __u8pasid_valid;


This field isn't needed since there already is
VIRTIO_IOMMU_PAGE_RESP_PASID_VALID


Yes, sure will remove this field.




+   __u8reserved[9];
+   struct virtio_iommu_req_tailtail;
+};


I'd align the size of the struct to 16 bytes, but I don't think that's
strictly necessary.


Will fix this. Thanks a lot for reviewing.

Best regards
Vivek



Thanks,
Jean


+
  struct virtio_iommu_req_attach {
struct virtio_iommu_req_headhead;
__le32  domain;
--
2.17.1


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC v1 02/11] iommu/virtio: Maintain a list of endpoints served by viommu_dev

2021-09-30 Thread Vivek Kumar Gautam




On 9/21/21 9:29 PM, Jean-Philippe Brucker wrote:

On Fri, Apr 23, 2021 at 03:21:38PM +0530, Vivek Gautam wrote:

Keeping a record of list of endpoints that are served by the virtio-iommu
device would help in redirecting the requests of page faults to the
correct endpoint device to handle such requests.

Signed-off-by: Vivek Gautam 
---
  drivers/iommu/virtio-iommu.c | 21 +
  1 file changed, 21 insertions(+)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 50039070e2aa..c970f386f031 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -48,6 +48,7 @@ struct viommu_dev {
spinlock_t  request_lock;
struct list_headrequests;
void*evts;
+   struct list_headendpoints;


As we're going to search by ID, an xarray or rb_tree would be more
appropriate than a list


Sure, I will update this with a rb_tree.



  
  	/* Device configuration */

struct iommu_domain_geometrygeometry;
@@ -115,6 +116,12 @@ struct viommu_endpoint {
void*pgtf;
  };
  
+struct viommu_ep_entry {

+   u32 eid;
+   struct viommu_endpoint  *vdev;
+   struct list_headlist;
+};


No need for a separate struct, I think you can just add the list head and
id into viommu_endpoint.


Yea right. I will update it.




+
  struct viommu_request {
struct list_headlist;
void*writeback;
@@ -573,6 +580,7 @@ static int viommu_probe_endpoint(struct viommu_dev *viommu, 
struct device *dev)
size_t probe_len;
struct virtio_iommu_req_probe *probe;
struct virtio_iommu_probe_property *prop;
+   struct viommu_ep_entry *ep;
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct viommu_endpoint *vdev = dev_iommu_priv_get(dev);
  
@@ -640,6 +648,18 @@ static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev)

prop = (void *)probe->properties + cur;
type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
}
+   if (ret)
+   goto out_free;
+
+   ep = kzalloc(sizeof(*ep), GFP_KERNEL);
+   if (!ep) {
+   ret = -ENOMEM;
+   goto out_free;
+   }
+   ep->eid = probe->endpoint;
+   ep->vdev = vdev;
+
+   list_add(&ep->list, &viommu->endpoints);


This should be in viommu_probe_device() (viommu_probe_endpoint() is only
called if F_PROBE is negotiated). I think we need a lock for this
list/xarray


Sure, will fix this, and add the needed locking around.

Thanks & regards
Vivek



Thanks,
Jean

  
  out_free:

kfree(probe);
@@ -1649,6 +1669,7 @@ static int viommu_probe(struct virtio_device *vdev)
viommu->dev = dev;
viommu->vdev = vdev;
INIT_LIST_HEAD(&viommu->requests);
+   INIT_LIST_HEAD(&viommu->endpoints);
  
  	ret = viommu_init_vqs(viommu);

if (ret)
--
2.17.1


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/2] iommu/ipmmu-vmsa: Add support for r8a779a0

2021-09-30 Thread Geert Uytterhoeven
Hi Shimoda-san,

On Wed, Sep 1, 2021 at 12:27 PM Yoshihiro Shimoda
 wrote:
> This patch series adds support for r8a779a0 (R-Car V3U).
>
> Yoshihiro Shimoda (2):
>   dt-bindings: iommu: renesas,ipmmu-vmsa: add r8a779a0 support
>   iommu/ipmmu-vmsa: Add support for r8a779a0

Thanks to rcar-4.1.0.rc16 of the R-Car BSP, I was pointed to the fact
that the IPMMU modules on R-Car V3U have module clocks and resets,
unlike on other R-Car SoCs.
Probably they should be handled, too?

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC v1 01/11] uapi/virtio-iommu: Add page request grp-id and flags information

2021-09-30 Thread Jean-Philippe Brucker
On Thu, Sep 30, 2021 at 10:26:35AM +0530, Vivek Kumar Gautam wrote:
> > I'm not sure why we made it 32-bit in Linux UAPI, it's a little wasteful.
> > PCIe PRGI is 9-bits and SMMU STAG is 16-bits. Since the scope of the grpid
> > is the endpoint, 16-bit means 64k in-flight faults per endpoint, which
> > seems more than enough.
> 
> Right, I will update this to 16-bits field. It won't be okay to update the
> iommu uAPI now, right?

Since there is no UAPI transport for the fault request/response at the
moment, it should be possible to update it.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-30 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, September 23, 2021 8:22 PM
> 
> > > These are different things and need different bits. Since the ARM path
> > > has a lot more code supporting it, I'd suggest Intel should change
> > > their code to use IOMMU_BLOCK_NO_SNOOP and abandon
> IOMMU_CACHE.
> >
> > I didn't fully get this point. The end result is same, i.e. making the DMA
> > cache-coherent when IOMMU_CACHE is set. Or if you help define the
> > behavior of IOMMU_CACHE, what will you define now?
> 
> It is clearly specifying how the kernel API works:
> 
>  !IOMMU_CACHE
>must call arch cache flushers
>  IOMMU_CACHE -
>do not call arch cache flushers
>  IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
>dot not arch cache flushers, and ignore the no snoop bit.

Who will set IOMMU_BLOCK_NO_SNOOP? I feel this is arch specific
knowledge about how cache coherency is implemented, i.e. 
when IOMMU_CACHE is set intel-iommu driver just maps it to
blocking no-snoop. It's not necessarily to be an attribute in 
the same level as IOMMU_CACHE?

> 
> On Intel it should refuse to create a !IOMMU_CACHE since the HW can't
> do that. 

Agree. In reality I guess this is not hit because all devices are marked
coherent on Intel platforms...

Baolu, any insight here?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource

2021-09-30 Thread David Hildenbrand

On 30.09.21 03:30, Mike Rapoport wrote:

From: Mike Rapoport 

dma_map_resource() uses pfn_valid() to ensure the range is not RAM.
However, pfn_valid() only checks for availability of the memory map for a
PFN but it does not ensure that the PFN is actually backed by RAM.

As dma_map_resource() is the only method in DMA mapping APIs that has this
check, simply drop the pfn_valid() test from dma_map_resource().

Link: https://lore.kernel.org/all/20210824173741.gc...@arm.com/
Signed-off-by: Mike Rapoport 
---


Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 03/12] iommu/mediatek: Add probe_defer for smi-larb

2021-09-30 Thread Yong Wu
On Wed, 2021-09-29 at 18:33 +0200, Dafna Hirschfeld wrote:
> 
> On 29.09.21 03:37, Yong Wu wrote:
> > Prepare for adding device_link.
> > 
> > The iommu consumer should use device_link to connect with the
> > smi-larb(supplier). then the smi-larb should run before the iommu
> > consumer. Here we delay the iommu driver until the smi driver is
> > ready,
> > then all the iommu consumers always are after the smi driver.
> > 
> > When there is no this patch, if some consumer drivers run before
> > smi-larb, the supplier link_status is DL_DEV_NO_DRIVER(0) in the
> > device_link_add, then device_links_driver_bound will use WARN_ON
> > to complain that the link_status of supplier is not right.
> > 
> > device_is_bound may be more elegant here. but it is not allowed to
> > EXPORT from https://lore.kernel.org/patchwork/patch/1334670/.
> > 
> > Signed-off-by: Yong Wu 
> > Tested-by: Frank Wunderlich  # BPI-
> > R2/MT7623
> > ---
> >   drivers/iommu/mtk_iommu.c| 2 +-
> >   drivers/iommu/mtk_iommu_v1.c | 2 +-
> >   2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
> > index d837adfd1da5..d5848f78a677 100644
> > --- a/drivers/iommu/mtk_iommu.c
> > +++ b/drivers/iommu/mtk_iommu.c
> > @@ -844,7 +844,7 @@ static int mtk_iommu_probe(struct
> > platform_device *pdev)
> > id = i;
> >   
> > plarbdev = of_find_device_by_node(larbnode);
> > -   if (!plarbdev) {
> > +   if (!plarbdev || !plarbdev->dev.driver) {
> > of_node_put(larbnode);
> > return -EPROBE_DEFER;
> 
> if plarbdev is null doesn't that mean that the device does not exist?

This is probe function, Is it possible the platform device is not ready
at this time?

I checked the platform device should be created at:

of_platform_default_populate_init:  arch_initcall_sync
  ->of_platform_populate
  ->of_platform_device_create_pdata

Not sure if this may be delayed for some device. If not, it should be
ENODEV here.

> so we should return -ENODEV in that case?
> 
> thanks,
> Dafna
> 
> > }
> > diff --git a/drivers/iommu/mtk_iommu_v1.c
> > b/drivers/iommu/mtk_iommu_v1.c
> > index 1467ba1e4417..4d7809432239 100644
> > --- a/drivers/iommu/mtk_iommu_v1.c
> > +++ b/drivers/iommu/mtk_iommu_v1.c
> > @@ -602,7 +602,7 @@ static int mtk_iommu_probe(struct
> > platform_device *pdev)
> > }
> >   
> > plarbdev = of_find_device_by_node(larbnode);
> > -   if (!plarbdev) {
> > +   if (!plarbdev || !plarbdev->dev.driver) {
> > of_node_put(larbnode);
> > return -EPROBE_DEFER;
> > }
> > 
> 
> ___
> Linux-mediatek mailing list
> linux-media...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-mediatek
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2021-09-30 Thread Chaitanya Kulkarni via iommu
Logan,

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7efb31b87f37..916750a54f60 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3771,7 +3771,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
> unsigned nsid,
>  blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
> 
>  blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
> -   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
> +   if (ctrl->ops->supports_pci_p2pdma &&
> +   ctrl->ops->supports_pci_p2pdma(ctrl))
>  blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
> 
>  ns->ctrl = ctrl;
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9871c0c9374c..fb9bfc52a6d7 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -477,7 +477,6 @@ struct nvme_ctrl_ops {
>  unsigned int flags;
>   #define NVME_F_FABRICS (1 << 0)
>   #define NVME_F_METADATA_SUPPORTED  (1 << 1)
> -#define NVME_F_PCI_P2PDMA  (1 << 2)
>  int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
>  int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
>  int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
> @@ -485,6 +484,7 @@ struct nvme_ctrl_ops {
>  void (*submit_async_event)(struct nvme_ctrl *ctrl);
>  void (*delete_ctrl)(struct nvme_ctrl *ctrl);
>  int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
> +   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
>   };
> 

Is this new ops only needed for the PCIe transport ? or do you have 
following patches to use this op for the other transports ?

If it is only needed for the PCIe then we need to find a way to
not add this somehow...

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2021-09-30 Thread Chaitanya Kulkarni via iommu
Logan,

> +/*
> + * bit 2 is the third free bit in the page_link on 64bit systems which
> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI
> + * bus address when doing P2PDMA.
> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
> + */
> +
> +#ifdef CONFIG_PCI_P2PDMA
> +#define SG_DMA_PCI_P2PDMA  0x04UL
> +#else
> +#define SG_DMA_PCI_P2PDMA  0x00UL
> +#endif
> +
> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA)
> +

You are doing two things in one patch :-
1. Introducing a new macro to replace the current macros.
2. Adding a new member to those macros.

shouldn't this be split into two patches where you introduce a
macro SG_PAGE_LINK_MASK (SG_CHAIN | SG_END) in prep patch and
update the SF_PAGE_LINK_MASK with SG_DMA_PCI_P2PDMA with related
code?

OR

Is there a reason why it is not split ?

-ck
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2021-09-30 Thread Chaitanya Kulkarni via iommu
> +/**
> + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma
> + * @sg: SG entryScatterlist
> + *
> + * Description:
> + *   Clears the PCI P2PDMA mark
> + **/
nit:- Probably want to add '.' above.
> +static inline void sg_dma_unmark_pci_p2pdma(struct scatterlist *sg)
> +{
> +   sg->page_link &= ~SG_DMA_PCI_P2PDMA;
> +}
> +
>   /**
>* sg_phys - Return physical address of an sg entry
>* @sg: SG entry
> --
> 2.30.2
> 

either ways with or without split, looks good.

Reviewed-by: Chaitanya Kulkarni 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu