Re: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for netvsc driver

2021-09-29 Thread Christoph Hellwig
On Tue, Sep 28, 2021 at 05:23:31PM +0800, Tianyu Lan wrote:
>>
>>   - the bare memremap usage in swiotlb looks strange and I'd
>> definitively expect a well documented wrapper.
>
> OK. Should the wrapper in the DMA code? How about dma_map_decrypted() 
> introduced in the V4?

A mentioned then the name is a pretty bad choice as it touches the dma_map*
namespace that it is not related to.  I suspect just a little helper
in the swiotlb code that explains how it is used might be enogh for now.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource

2021-09-29 Thread Christoph Hellwig
On Thu, Sep 30, 2021 at 04:30:38AM +0300, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> dma_map_resource() uses pfn_valid() to ensure the range is not RAM.
> However, pfn_valid() only checks for availability of the memory map for a
> PFN but it does not ensure that the PFN is actually backed by RAM.
> 
> As dma_map_resource() is the only method in DMA mapping APIs that has this
> check, simply drop the pfn_valid() test from dma_map_resource().
> 
> Link: https://lore.kernel.org/all/20210824173741.gc...@arm.com/
> Signed-off-by: Mike Rapoport 

We'll lose a bit of a sanity check with this, but given the problems it
caused:

Reviewed-by: Christoph Hellwig 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH kernel] powerpc/iommu: Report the correct most efficient DMA mask for PCI devices

2021-09-29 Thread Christoph Hellwig
On Thu, Sep 30, 2021 at 01:44:54PM +1000, Alexey Kardashevskiy wrote:
> and returns the IOMMU table mask on the pseries platform which makes some
> drivers (mpt3sas is one example) choose 32bit DMA even though bypass is
> supported. The powernv platform sort of handles it by having a bigger
> default window with a mask >=40 but it only works as drivers choose
> 63/64bit if the required mask is >32 which is rather pointless.
> 
> This reintroduces the bypass capability check to let drivers make
> a better choice of the DMA mask.
> 
> Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode")
> Signed-off-by: Alexey Kardashevskiy 

Looks good:

Reviewed-by: Christoph Hellwig 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC v1 01/11] uapi/virtio-iommu: Add page request grp-id and flags information

2021-09-29 Thread Vivek Kumar Gautam

Hi Jean,


On 9/21/21 9:28 PM, Jean-Philippe Brucker wrote:

Hi Vivek,

Thanks a lot for your work on this


Thanks a lot for taking a look at it. I hope that most of the changes in 
this series and the nested page table series [1] are independent of the 
presently on-going /dev/iommu proposal, and can be separately reviewed.


Please find my comments inline below.



On Fri, Apr 23, 2021 at 03:21:37PM +0530, Vivek Gautam wrote:

Add fault information for group-id and necessary flags for page
request faults that can be handled by page fault handler in
virtio-iommu driver.

Signed-off-by: Vivek Gautam 
Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Jean-Philippe Brucker 
Cc: Eric Auger 
Cc: Alex Williamson 
Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Liu Yi L 
Cc: Lorenzo Pieralisi 
Cc: Shameerali Kolothum Thodi 
---
  include/uapi/linux/virtio_iommu.h | 13 +
  1 file changed, 13 insertions(+)

diff --git a/include/uapi/linux/virtio_iommu.h 
b/include/uapi/linux/virtio_iommu.h
index f8bf927a0689..accc3318ce46 100644
--- a/include/uapi/linux/virtio_iommu.h
+++ b/include/uapi/linux/virtio_iommu.h
@@ -307,14 +307,27 @@ struct virtio_iommu_req_invalidate {
  #define VIRTIO_IOMMU_FAULT_F_DMA_UNRECOV  1
  #define VIRTIO_IOMMU_FAULT_F_PAGE_REQ 2
  
+#define VIRTIO_IOMMU_FAULT_PRQ_F_PASID_VALID		(1 << 0)

+#define VIRTIO_IOMMU_FAULT_PRQ_F_LAST_PAGE (1 << 1)
+#define VIRTIO_IOMMU_FAULT_PRQ_F_PRIV_DATA (1 << 2)
+#define VIRTIO_IOMMU_FAULT_PRQ_F_NEEDS_PASID   (1 << 3)


I don't think this one is necessary here. The NEEDS_PASID flags added by
commit 970471914c67 ("iommu: Allow page responses without PASID") mainly
helps Linux keep track of things internally. It does tell the fault
handler whether to reply with PASID or not, but we don't need that here.
The virtio-iommu driver knows whether a PASID is required by looking at
the "PRG Response PASID Required" bit in the PCIe capability. For non-PCIe
faults (e.g. SMMU stall), I'm guessing we'll need a PROBE property to
declare that the endpoint supports recoverable faults anyway, so "PASID
required in response" can go through there as well.


Sure, I will remove this flag, and rather read the PCIe cap to find out 
if PASID is required or not. After this series, I will follow up with 
the non-PCIe fault handling.





+
+#define VIRTIO_IOMMU_FAULT_UNREC_F_PASID_VALID (1 << 0)
+#define VIRTIO_IOMMU_FAULT_UNREC_F_ADDR_VALID  (1 << 1)
+#define VIRTIO_IOMMU_FAULT_UNREC_F_FETCH_ADDR_VALID(1 << 2)
+
  struct virtio_iommu_fault {
__u8reason;
__u8reserved[3];
__le16  flt_type;
__u8reserved2[2];
+   /* flags is actually permission flags */


It's also used for declaring validity of fields.
VIRTIO_IOMMU_FAULT_F_ADDRESS already tells whether the address field is
valid, so all the other flags introduced by this patch can go in here.


Sure, will remove pr_evt_flags field, and move all the flags to 'flags'.




__le32  flags;
+   /* flags for PASID and Page request handling info */
+   __le32  pr_evt_flags;
__le32  endpoint;
__le32  pasid;
+   __le32  grpid;


I'm not sure why we made it 32-bit in Linux UAPI, it's a little wasteful.
PCIe PRGI is 9-bits and SMMU STAG is 16-bits. Since the scope of the grpid
is the endpoint, 16-bit means 64k in-flight faults per endpoint, which
seems more than enough.


Right, I will update this to 16-bits field. It won't be okay to update 
the iommu uAPI now, right?




New fields must be appended at the end of the struct, because old drivers
will expect to find the 'endpoint' field at this offset. You could remove
'reserved3' while adding 'grpid', to keep the struct layout.


Sure, will update this.




__u8reserved3[4];
__le64  address;
__u8reserved4[8];



So the base structure, currently in the spec, looks like this:

struct virtio_iommu_fault {
u8   reason;
u8   reserved[3];
le32 flags;
le32 endpoint;
le32 reserved1;
le64 address;
};

#define VIRTIO_IOMMU_FAULT_F_READ   (1 << 0)
#define VIRTIO_IOMMU_FAULT_F_WRITE  (1 << 1)
#define VIRTIO_IOMMU_FAULT_F_ADDRESS(1 << 8)

The extended struct could be:

struct virtio_iommu_fault {
u8   reason;
u8   reserved[3];
le32 flags;
le32 endpoint;
le32 pasid;
le64 address;
/* Page 

Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-29 Thread David Gibson
On Wed, Sep 29, 2021 at 07:31:08AM +, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Wednesday, September 29, 2021 2:35 PM
> > 
> > On Wed, Sep 29, 2021 at 05:38:56AM +, Tian, Kevin wrote:
> > > > From: David Gibson 
> > > > Sent: Wednesday, September 29, 2021 12:56 PM
> > > >
> > > > >
> > > > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > > > as the contract to get security context established (and vice versa
> > > > > for unbinding). One additional requirement in iommufd is to manage
> > the
> > > > > switch between multiple security contexts due to decoupled
> > bind/attach:
> > > > >
> > > > > 1)  Open a device in "/dev/vfio/devices" with user access blocked;
> > > >
> > > > Probably worth clarifying that (1) must happen for *all* devices in
> > > > the group before (2) happens for any device in the group.
> > >
> > > No. User access is naturally blocked for other devices as long as they
> > > are not opened yet.
> > 
> > Uh... my point is that everything in the group has to be removed from
> > regular kernel drivers before we reach step (2).  Is the plan that you
> > must do that before you can even open them?  That's a reasonable
> > choice, but then I think you should show that step in this description
> > as well.
> 
> Agree. I think below proposal can meet above requirement and ensure
> it's not broken in the whole process when the group is operated by the
> userspace:
> 
> https://lore.kernel.org/kvm/20210928140712.gl964...@nvidia.com/
> 
> and definitely an updated description will be provided when sending out
> the new proposal.
> 
> > 
> > > > > 2)  Bind the device to an iommufd with an initial security context
> > > > > (an empty iommu domain which blocks dma) established for its
> > > > > group, with user access unblocked;
> > > > >
> > > > > 3)  Attach the device to a user-specified ioasid (shared by all 
> > > > > devices
> > > > > attached to this ioasid). Before attaching, the device should be 
> > > > > first
> > > > > detached from the initial context;
> > > >
> > > > So, this step can implicitly but observably change the behaviour for
> > > > other devices in the group as well.  I don't love that kind of
> > > > difficult to predict side effect, which is why I'm *still* not totally
> > > > convinced by the device-centric model.
> > >
> > > which side-effect is predicted here? The user anyway needs to be
> > > aware of such group restriction regardless whether it uses group
> > > or nongroup interface.
> > 
> > Yes, exactly.  And with a group interface it's obvious it has to
> > understand it.  With the non-group interface, you can get to this
> > stage in ignorance of groups.  It will even work as long as you are
> > lucky enough only to try with singleton-group devices.  Then you try
> > it with two devices in the one group and doing (3) on device A will
> > implicitly change the DMA environment of device B.
> 
> for non-group we can also document it obviously in uAPI that the user
> must understand group restriction and violating it will get failure
> when attaching to different IOAS's for devices in the same group.

Documenting limitations is always inferior to building them into the
actual API signatures.  Sometimes its the only option, but people
frequently don't read the docs, whereas they kind of have to look at
the API itself.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-29 Thread David Gibson
On Wed, Sep 29, 2021 at 09:57:16AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 04:35:19PM +1000, David Gibson wrote:
> 
> > Yes, exactly.  And with a group interface it's obvious it has to
> > understand it.  With the non-group interface, you can get to this
> > stage in ignorance of groups.  It will even work as long as you are
> > lucky enough only to try with singleton-group devices.  Then you try
> > it with two devices in the one group and doing (3) on device A will
> > implicitly change the DMA environment of device B.
> 
> The security model here says this is fine.

I'm not making a statement about the security model, I'm making a
statement about surprisingness of the programming interface.  In your
program you have devices A & B, you perform an operation that
specifies only device A and device B changes behaviour.

> This idea to put the iommu code in charge of security is quite clean,
> as I said in the other mail drivers attached to 'struct devices *'
> tell the iommu layer what they are are doing:
> 
>iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
>iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
>iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)
> 
> And it decides if it is allowed.
> 
> If device A is allowed to go to userspace then security wise it is
> deemed fine that B is impacted. That is what we have defined already
> today.
> 
> This proposal does not free userpace from having to understand this!
> The iommu_group sysfs is still there and still must be understood.
> 
> The *admin* the one responsible to understand the groups, not the
> applications. The admin has no idea what a group FD is - they should
> be looking at the sysfs and seeing the iommu_group directories.

Not just the admin.  If an app is given two devices in the same group
to use *both* it must understand that and act accordingly.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

2021-09-29 Thread David Gibson
On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote:
65;6402;1c> On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:
> 
> > > +struct iommufd_device {
> > > + unsigned int id;
> > > + struct iommufd_ctx *ictx;
> > > + struct device *dev; /* always be the physical device */
> > > + u64 dev_cookie;
> > 
> > Why do you need both an 'id' and a 'dev_cookie'?  Since they're both
> > unique, couldn't you just use the cookie directly as the index into
> > the xarray?
> 
> ID is the kernel value in the xarray - xarray is much more efficient &
> safe with small kernel controlled values.
> 
> dev_cookie is a user assigned value that may not be unique. It's
> purpose is to allow userspace to receive and event and go back to its
> structure. Most likely userspace will store a pointer here, but it is
> also possible userspace could not use it.
> 
> It is a pretty normal pattern

Hm, ok.  Could you point me at an example?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD

2021-09-29 Thread David Gibson
On Wed, Sep 29, 2021 at 06:41:00AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Wednesday, September 29, 2021 2:01 PM
> > 
> > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the
> > vfio
> > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is
> > provided
> > > because it's implicitly done when the device fd is closed.
> > >
> > > In concept a vfio device can be bound to multiple iommufds, each hosting
> > > a subset of I/O address spaces attached by this device.
> > 
> > I really feel like this many<->many mapping between devices is going
> > to be super-confusing, and therefore make it really hard to be
> > confident we have all the rules right for proper isolation.
> 
> Based on new discussion on group ownership part (patch06), I feel this
> many<->many relationship will disappear. The context fd (either container
> or iommufd) will uniquely mark the ownership on a physical device and
> its group. With this design it's impractical to have one device bound
> to multiple iommufds. Actually I don't think this is a compelling usage
> in reality. The previous rationale was that no need to impose such restriction
> if no special reason... and now we have a reason. 
> 
> Jason, are you OK with this simplification?
> 
> > 
> > That's why I was suggesting a concept like endpoints, to break this
> > into two many<->one relationships.  I'm ok if that isn't visible in
> > the user API, but I think this is going to be really hard to keep
> > track of if it isn't explicit somewhere in the internals.
> > 
> 
> I think this endpoint concept is represented by ioas_device_info in
> patch14:
> 
> +/*
> + * An ioas_device_info object is created per each successful attaching
> + * request. A list of objects are maintained per ioas when the address
> + * space is shared by multiple devices.
> + */
> +struct ioas_device_info {
> + struct iommufd_device *idev;
> + struct list_head next;
>  };
> 
> currently it's 1:1 mapping before this object and iommufd_device, 
> because no pasid support yet.

Ok, I haven't read that far in the series yet.

> We can rename it to struct ioas_endpoint if it makes you feel
> better.

Meh.  The concept is much more important than the name.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH kernel] powerpc/iommu: Report the correct most efficient DMA mask for PCI devices

2021-09-29 Thread Alexey Kardashevskiy
According to dma-api.rst, the dma_get_required_mask() helper should return
"the mask that the platform requires to operate efficiently". Which in
the case of PPC64 means the bypass mask and not a mask from an IOMMU table
which is shorter and slower to use due to map/unmap operations (especially
expensive on "pseries").

However the existing implementation ignores the possibility of bypassing
and returns the IOMMU table mask on the pseries platform which makes some
drivers (mpt3sas is one example) choose 32bit DMA even though bypass is
supported. The powernv platform sort of handles it by having a bigger
default window with a mask >=40 but it only works as drivers choose
63/64bit if the required mask is >32 which is rather pointless.

This reintroduces the bypass capability check to let drivers make
a better choice of the DMA mask.

Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode")
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/dma-iommu.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 111249fd619d..d646077bcbcf 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -184,6 +184,14 @@ u64 dma_iommu_get_required_mask(struct device *dev)
struct iommu_table *tbl = get_iommu_table_base(dev);
u64 mask;
 
+   if (dev_is_pci(dev)) {
+   u64 bypass_mask = dma_direct_get_required_mask(dev);
+
+   if (dma_iommu_dma_supported(dev, bypass_mask)) {
+   dev_info(dev, "%s: returning bypass mask 0x%llx\n", 
__func__, bypass_mask);
+   return bypass_mask;
+   }
+   }
if (!tbl)
return 0;
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 09/12] media: mtk-vcodec: Get rid of mtk_smi_larb_get/put

2021-09-29 Thread Yong Wu
Hi Dafna,

Thanks very much for the review.

On Wed, 2021-09-29 at 14:13 +0200, Dafna Hirschfeld wrote:
> 
> On 29.09.21 03:37, Yong Wu wrote:
> > MediaTek IOMMU has already added the device_link between the
> > consumer
> > and smi-larb device. If the vcodec device call the
> > pm_runtime_get_sync,
> > the smi-larb's pm_runtime_get_sync also be called automatically.
> > 
> > CC: Tiffany Lin 
> > CC: Irui Wang 
> > Signed-off-by: Yong Wu 
> > Reviewed-by: Evan Green 
> > Acked-by: Tiffany Lin 
> > Reviewed-by: Dafna Hirschfeld 
> > ---
> >   .../platform/mtk-vcodec/mtk_vcodec_dec_pm.c   | 37 +++---
> > --
> >   .../platform/mtk-vcodec/mtk_vcodec_drv.h  |  3 --
> >   .../platform/mtk-vcodec/mtk_vcodec_enc.c  |  1 -
> >   .../platform/mtk-vcodec/mtk_vcodec_enc_pm.c   | 44 +++---
> > -
> >   4 files changed, 10 insertions(+), 75 deletions(-)
> > 
> > diff --git a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
> > b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
> > index 6038db96f71c..d0bf9aa3b29d 100644
> > --- a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
> > +++ b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
> > @@ -8,14 +8,12 @@
> >   #include 
> >   #include 
> >   #include 
> > -#include 
> >   
> >   #include "mtk_vcodec_dec_pm.h"
> >   #include "mtk_vcodec_util.h"
> >   
> >   int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev)
> >   {
> > -   struct device_node *node;
> > struct platform_device *pdev;
> > struct mtk_vcodec_pm *pm;
> > struct mtk_vcodec_clk *dec_clk;
> > @@ -26,18 +24,7 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev
> > *mtkdev)
> > pm = >pm;
> > pm->mtkdev = mtkdev;
> > dec_clk = >vdec_clk;
> > -   node = of_parse_phandle(pdev->dev.of_node, "mediatek,larb", 0);
> > -   if (!node) {
> > -   mtk_v4l2_err("of_parse_phandle mediatek,larb fail!");
> > -   return -1;
> > -   }
> >   
> > -   pdev = of_find_device_by_node(node);
> > -   of_node_put(node);
> > -   if (WARN_ON(!pdev)) {
> > -   return -1;
> > -   }
> > -   pm->larbvdec = >dev;
> > pdev = mtkdev->plat_dev;
> > pm->dev = >dev;
> >   
> > @@ -47,14 +34,11 @@ int mtk_vcodec_init_dec_pm(struct
> > mtk_vcodec_dev *mtkdev)
> > dec_clk->clk_info = devm_kcalloc(>dev,
> > dec_clk->clk_num, sizeof(*clk_info),
> > GFP_KERNEL);
> > -   if (!dec_clk->clk_info) {
> > -   ret = -ENOMEM;
> > -   goto put_device;
> > -   }
> > +   if (!dec_clk->clk_info)
> > +   return -ENOMEM;
> > } else {
> > mtk_v4l2_err("Failed to get vdec clock count");
> > -   ret = -EINVAL;
> > -   goto put_device;
> > +   return -EINVAL;
> > }
> >   
> > for (i = 0; i < dec_clk->clk_num; i++) {
> > @@ -63,29 +47,24 @@ int mtk_vcodec_init_dec_pm(struct
> > mtk_vcodec_dev *mtkdev)
> > "clock-names", i, _info->clk_name);
> > if (ret) {
> > mtk_v4l2_err("Failed to get clock name id =
> > %d", i);
> > -   goto put_device;
> > +   return ret;
> > }
> > clk_info->vcodec_clk = devm_clk_get(>dev,
> > clk_info->clk_name);
> > if (IS_ERR(clk_info->vcodec_clk)) {
> > mtk_v4l2_err("devm_clk_get (%d)%s fail", i,
> > clk_info->clk_name);
> > -   ret = PTR_ERR(clk_info->vcodec_clk);
> > -   goto put_device;
> > +   return PTR_ERR(clk_info->vcodec_clk);
> > }
> > }
> >   
> > pm_runtime_enable(>dev);
> > return 0;
> > -put_device:
> > -   put_device(pm->larbvdec);
> > -   return ret;
> >   }
> >   
> >   void mtk_vcodec_release_dec_pm(struct mtk_vcodec_dev *dev)
> >   {
> > pm_runtime_disable(dev->pm.dev);
> > -   put_device(dev->pm.larbvdec);
> >   }
> 
> Now that functions only do  'pm_runtime_disable(dev->pm.dev);' so it
> will be more
> readable to remove the function mtk_vcodec_release_dec_pm
> and replace with pm_runtime_disable(dev->pm.dev);
> Same for the 'enc' equivalent.

Make sense. But It may be not proper if using pm_runtime_disable
as the symmetry with mtk_vcodec_init_dec_pm in the mtk_vcodec_probe.

Maybe we should move pm_runtime_enable out from mtk_vcodec_init_dec_pm
into mtk_vcodec_probe. I could do a new patch for this. Is this ok for
you?

> 
> Thanks,
> Dafna

[snip]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 03/20] vfio: Add vfio_[un]register_device()

2021-09-29 Thread da...@gibson.dropbear.id.au
On Wed, Sep 29, 2021 at 09:22:30AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 12:46:14PM +1000, da...@gibson.dropbear.id.au wrote:
> > On Tue, Sep 21, 2021 at 10:00:14PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 22, 2021 at 12:54:02AM +, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Wednesday, September 22, 2021 12:01 AM
> > > > > 
> > > > > >  One open about how to organize the device nodes under
> > > > > /dev/vfio/devices/.
> > > > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > > > devname
> > > > > > from all kinds of devices. The prerequisite of this model is that 
> > > > > > devnames
> > > > > > from different bus types are unique formats:
> > > > > 
> > > > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > > > > 
> > > > > The userspace can learn the correct major/minor by inspecting the
> > > > > sysfs.
> > > > > 
> > > > > This whole concept should disappear into the prior patch that adds the
> > > > > struct device in the first place, and I think most of the code here
> > > > > can be deleted once the struct device is used properly.
> > > > > 
> > > > 
> > > > Can you help elaborate above flow? This is one area where we need
> > > > more guidance.
> > > > 
> > > > When Qemu accepts an option "-device vfio-pci,host=:BB:DD.F",
> > > > how does Qemu identify which vifo0/1/... is associated with the 
> > > > specified 
> > > > :BB:DD.F? 
> > > 
> > > When done properly in the kernel the file:
> > > 
> > > /sys/bus/pci/devices/:BB:DD.F/vfio/vfioX/dev
> > > 
> > > Will contain the major:minor of the VFIO device.
> > > 
> > > Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> > > that the major:minor matches.
> > > 
> > > in the above pattern "pci" and ":BB:DD.FF" are the arguments passed
> > > to qemu.
> > 
> > I thought part of the appeal of the device centric model was less
> > grovelling around in sysfs for information.  Using type/address
> > directly in /dev seems simpler than having to dig around matching
> > things here.
> 
> I would say more regular grovelling. Starting from a sysfs device
> directory and querying the VFIO cdev associated with it is much more
> normal than what happens today, which also includes passing sysfs
> information into an ioctl :\

Hm.. ok.  Clearly I'm unfamiliar with the things that do that.  Other
than current VFIO, the only model I've really seen is where you just
point your program at a device node.

> > Note that this doesn't have to be done in kernel: you could have the
> > kernel just call them /dev/vfio/devices/vfio0, ... but add udev rules
> > that create symlinks from say /dev/vfio/pci/:BB:SS.F - >
> > ../devices/vfioXX based on the sysfs information.
> 
> This is the right approach if people want to do this, but I'm not sure
> it is worth it given backwards compat requires the sysfs path as
> input.

You mean for userspace that needs to be able to go back to the old
VFIO interface as well?  It seems silly to force this sysfs mucking
about on new programs that depend on the new interface.

> We may as well stick with sysfs as the command line interface
> for userspace tools.

> And I certainly don't want to see userspace tools trying to reverse a
> sysfs path into a /dev/ symlink name when they can directly and
> reliably learn the correct cdev from the sysfspath.

Um.. sure.. but they can get the correct cdev from the sysfspath no
matter how we name the cdevs.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices

2021-09-29 Thread David Gibson
On Wed, Sep 29, 2021 at 01:05:21PM -0600, Alex Williamson wrote:
> On Wed, 29 Sep 2021 12:08:59 +1000
> David Gibson  wrote:
> 
> > On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> > > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > > userspace to directly open a vfio device w/o relying on container/group
> > > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > > manner.
> > > 
> > > In case a device is exposed in both legacy and new interfaces (see next
> > > patch for how to decide it), this patch also ensures that when the device
> > > is already opened via one interface then the other one must be blocked.
> > > 
> > > Signed-off-by: Liu Yi L   
> > [snip]
> > 
> > > +static bool vfio_device_in_container(struct vfio_device *device)
> > > +{
> > > + return !!(device->group && device->group->container);  
> > 
> > You don't need !! here.  && is already a logical operation, so returns
> > a valid bool.
> > 
> > > +}
> > > +
> > >  static int vfio_device_fops_release(struct inode *inode, struct file 
> > > *filep)
> > >  {
> > >   struct vfio_device *device = filep->private_data;
> > > @@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode 
> > > *inode, struct file *filep)
> > >  
> > >   module_put(device->dev->driver->owner);
> > >  
> > > - vfio_group_try_dissolve_container(device->group);
> > > + if (vfio_device_in_container(device)) {
> > > + vfio_group_try_dissolve_container(device->group);
> > > + } else {
> > > + atomic_dec(>opened);
> > > + if (device->group) {
> > > + mutex_lock(>group->opened_lock);
> > > + device->group->opened--;
> > > + mutex_unlock(>group->opened_lock);
> > > + }
> > > + }
> > >  
> > >   vfio_device_put(device);
> > >  
> > > @@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file 
> > > *filep, struct vm_area_struct *vma)
> > >  
> > >  static const struct file_operations vfio_device_fops = {
> > >   .owner  = THIS_MODULE,
> > > + .open   = vfio_device_fops_open,
> > >   .release= vfio_device_fops_release,
> > >   .read   = vfio_device_fops_read,
> > >   .write  = vfio_device_fops_write,
> > > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
> > >   .mode = S_IRUGO | S_IWUGO,
> > >  };
> > >  
> > > +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> > > +{
> > > + return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));  
> > 
> > Others have pointed out some problems with the use of dev_name()
> > here.  I'll add that I think you'll make things much easier if instead
> > of using one huge "devices" subdir, you use a separate subdir for each
> > vfio sub-driver (so, one for PCI, one for each type of mdev, one for
> > platform, etc.).  That should make avoiding name conflicts a lot simpler.
> 
> It seems like this is unnecessary if we use the vfioX naming approach.
> Conflicts are trivial to ignore if we don't involve dev_name() and
> looking for the correct major:minor chardev in the correct subdirectory
> seems like a hassle for userspace.  Thanks,

Right.. it does sound like a hassle, but AFAICT that's *more*
necessary with /dev/vfio/vfioXX than with /dev/vfio/pci/:BB:SS.F,
since you have to look up a meaningful name in sysfs to find the right
devnode.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 2/2] arm64/mm: drop HAVE_ARCH_PFN_VALID

2021-09-29 Thread Mike Rapoport
From: Anshuman Khandual 

CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64
platforms and free_unused_memmap() would just return without creating any
holes in the memmap mapping.  There is no need for any special handling in
pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped.  This also moves
the pfn upper bits sanity check into generic pfn_valid().

[rppt: rebased on v5.15-rc3]

Link: 
https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khand...@arm.com
Signed-off-by: Anshuman Khandual 
Acked-by: David Hildenbrand 
Acked-by: Mike Rapoport 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: David Hildenbrand 
Cc: Mike Rapoport 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mike Rapoport 
---
 arch/arm64/Kconfig|  1 -
 arch/arm64/include/asm/page.h |  1 -
 arch/arm64/mm/init.c  | 37 ---
 3 files changed, 39 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5c7ae4c3954b..53dd23f305be 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -154,7 +154,6 @@ config ARM64
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
-   select HAVE_ARCH_PFN_VALID
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
select HAVE_ARCH_SECCOMP_FILTER
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index f98c91bbd7c1..993a27ea6f54 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -41,7 +41,6 @@ void tag_clear_highpage(struct page *to);
 
 typedef struct page *pgtable_t;
 
-int pfn_valid(unsigned long pfn);
 int pfn_is_map_memory(unsigned long pfn);
 
 #include 
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 37a81754d9b6..e60c6eb813b7 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -160,43 +160,6 @@ static void __init zone_sizes_init(unsigned long min, 
unsigned long max)
free_area_init(max_zone_pfns);
 }
 
-int pfn_valid(unsigned long pfn)
-{
-   phys_addr_t addr = PFN_PHYS(pfn);
-   struct mem_section *ms;
-
-   /*
-* Ensure the upper PAGE_SHIFT bits are clear in the
-* pfn. Else it might lead to false positives when
-* some of the upper bits are set, but the lower bits
-* match a valid pfn.
-*/
-   if (PHYS_PFN(addr) != pfn)
-   return 0;
-
-   if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
-   return 0;
-
-   ms = __pfn_to_section(pfn);
-   if (!valid_section(ms))
-   return 0;
-
-   /*
-* ZONE_DEVICE memory does not have the memblock entries.
-* memblock_is_map_memory() check for ZONE_DEVICE based
-* addresses will always fail. Even the normal hotplugged
-* memory will never have MEMBLOCK_NOMAP flag set in their
-* memblock entries. Skip memblock search for all non early
-* memory sections covering all of hotplug memory including
-* both normal and ZONE_DEVICE based.
-*/
-   if (!early_section(ms))
-   return pfn_section_valid(ms, pfn);
-
-   return memblock_is_memory(addr);
-}
-EXPORT_SYMBOL(pfn_valid);
-
 int pfn_is_map_memory(unsigned long pfn)
 {
phys_addr_t addr = PFN_PHYS(pfn);
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource

2021-09-29 Thread Mike Rapoport
From: Mike Rapoport 

dma_map_resource() uses pfn_valid() to ensure the range is not RAM.
However, pfn_valid() only checks for availability of the memory map for a
PFN but it does not ensure that the PFN is actually backed by RAM.

As dma_map_resource() is the only method in DMA mapping APIs that has this
check, simply drop the pfn_valid() test from dma_map_resource().

Link: https://lore.kernel.org/all/20210824173741.gc...@arm.com/
Signed-off-by: Mike Rapoport 
---
 kernel/dma/mapping.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 06fec5547e7c..dda8d8b84a55 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -296,10 +296,6 @@ dma_addr_t dma_map_resource(struct device *dev, 
phys_addr_t phys_addr,
if (WARN_ON_ONCE(!dev->dma_mask))
return DMA_MAPPING_ERROR;
 
-   /* Don't allow RAM to be mapped */
-   if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr
-   return DMA_MAPPING_ERROR;
-
if (dma_map_direct(dev, ops))
addr = dma_direct_map_resource(dev, phys_addr, size, dir, 
attrs);
else if (ops->map_resource)
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/2] arm64: retry dropping HAVE_ARCH_PFN_VALID

2021-09-29 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

This is a new attempt to drop HAVE_ARCH_PFN_VALID on arm64 and start using
the generic implementation of pfn_valid().

The first patch removes the check for pfn_valid() in dma_map_resource()
which is required to avoid false positives when there is memory map for
MMIO.

Anshuman Khandual (1):
  arm64/mm: drop HAVE_ARCH_PFN_VALID

Mike Rapoport (1):
  dma-mapping: remove bogus test for pfn_valid from dma_map_resource

 arch/arm64/Kconfig|  1 -
 arch/arm64/include/asm/page.h |  1 -
 arch/arm64/mm/init.c  | 37 ---
 kernel/dma/mapping.c  |  4 
 4 files changed, 43 deletions(-)


base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Fenghua Yu
Hi, Baolu,

On Thu, Sep 23, 2021 at 01:43:32PM +0800, Lu Baolu wrote:
> On 9/21/21 3:23 AM, Fenghua Yu wrote:
> > +void pasid_put(struct task_struct *tsk, struct mm_struct *mm)
> > +{
> > +   if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> > +   return;
> > +
> > +   /*
> > +* Nothing to do if this task doesn't have a reference to the PASID.
> > +*/
> > +   if (tsk->has_valid_pasid) {
> > +   mutex_lock(_mutex);
> > +   /*
> > +* The PASID's reference was taken during fix up. Release it
> > +* now. If the reference count is 0, the PASID is freed.
> > +*/
> > +   iommu_sva_free_pasid(mm);
> > +   mutex_unlock(_mutex);
> > +   }
> > +}
> > 
> 
> It looks odd that both __fixup_pasid_exception() and pasid_put() are
> defined in the vendor IOMMU driver, but get called in the arch/x86
> code.
> 
> Is it feasible to move these two helpers to the files where they are
> called? The IA32_PASID MSR fixup and release are not part of the IOMMU
> implementation.

Sure. The functions will be removed in the next version. And new functions
will not be called in IOMMU driver.

Thanks.

-Fenghua
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions

2021-09-29 Thread Fenghua Yu
Hi, Baolu,

On Wed, Sep 29, 2021 at 03:34:51PM +0800, Lu Baolu wrote:
> On 2021/9/21 3:23, Fenghua Yu wrote:
> > update_pasid() and its call chain are currently unused in the tree because
> > Thomas disabled the ENQCMD feature. The feature will be re-enabled shortly
> > using a different approach and update_pasid() and its call chain will not
> > be used in the new approach.
> > 
> > Remove the useless functions.
> > 
> > Signed-off-by: Fenghua Yu 
> > Reviewed-by: Tony Luck 
> 
> Thanks for this cleanup. I have queued it for v5.16.

Thank you for pushing this patch to 5.16!

-Fenghua
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-09-29 Thread Jason Gunthorpe
On Wed, Sep 29, 2021 at 05:49:36PM -0600, Logan Gunthorpe wrote:

> Some of this seems out of date. Pretty sure the pages are not refcounted
> with vmf_insert_mixed() and vmf_insert_mixed() is currently the only way
> to use VM_MIXEDMAP mappings.

Hum.

vmf_insert_mixed() boils down to insert_pfn() which always sets the
special bit, so vm_normal_page() returns NULL and thus the pages are
not freed during zap.

So, if the pages are always special and not refcounted all the docs
seem really out of date - or rather they describe the situation
without the special bit, I think.

Why would DAX want to do this in the first place?? This means the
address space zap is much more important that just speeding up
destruction, it is essential for correctness since the PTEs are not
holding refcounts naturally...

Sigh.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices

2021-09-29 Thread Logan Gunthorpe



On 2021-09-29 5:36 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 05:28:38PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote:
>>> On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote:


 On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
>> Hi,
>>
>> This patchset continues my work to add userspace P2PDMA access using
>> O_DIRECT NVMe devices. My last posting[1] just included the first 13
>> patches in this series, but the early P2PDMA cleanup and map_sg error
>> changes from that series have been merged into v5.15-rc1. To address
>> concerns that that series did not add any new functionality, I've added
>> back the userspcae functionality from the original RFC[2] (but improved
>> based on the original feedback).
>
> I really think this is the best series yet, it really looks nice
> overall. I know the sg flag was a bit of a debate at the start, but it
> serves an undeniable purpose and the resulting standard DMA APIs 'just
> working' is really clean.

 Actually, so far, nobody has said anything negative about using the SG 
 flag.

> There is more possible here, we could also pass the new GUP flag in the
> ib_umem code..

 Yes, that would be very useful.
>>>
>>> You might actually prefer to do that then the bio changes to get the
>>> infrastructur merged as it seems less "core"
>>
>> I'm a little bit more concerned about my patch set growing too large.
>> It's already at 20 patches and I think I'll need to add a couple more
>> based on the feedback you've already provided. So I'm leaning toward
>> pushing more functionality as future work.
> 
> I mean you could postpone the three block related patches and use a
> single ib_umem patch instead as the consumer.

I think that's not a very compelling use case given the only provider of
these VMAs is an NVMe block device. My patch set enables a real world
use (copying data between NVMe devices P2P through the CMB with O_DIRECT).

Being able to read or write a CMB with RDMA and only RDMA is not very
compelling.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-09-29 Thread Logan Gunthorpe



On 2021-09-29 5:35 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 05:27:22PM -0600, Logan Gunthorpe wrote:
> 
>>> finish_fault() should set the pte_devmap - eg by passing the
>>> PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or
>>> otherwise signaling do_set_pte() that it should set those PTE bits
>>> when it creates the entry.
>>>
>>> (or there should be a vmf_* helper for this special case, but using
>>> the vmf->page seems righter to me)
>>
>> I'm not opposed to this. Though I'm not sure what's best here.
>>
 If we don't set pte_devmap(), then every single page that GUP
 processes needs to check if it's a ZONE_DEVICE page and also if it's
 a P2PDMA page (thus dereferencing pgmap) in order to satisfy the
 requirements of FOLL_PCI_P2PDMA.
>>>
>>> Definately not suggesting not to set pte_devmap(), only that
>>> VM_MIXEDMAP should not be set on VMAs that only contain struct
>>> pages. That is an abuse of what it is intended for.
>>>
>>> At the very least there should be a big comment above the usage
>>> explaining that this is just working around a limitation in
>>> finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today.
>>
>> Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good
>> and the intention is not clear. I got the impression that mm people
>> wanted those interfaces used for users of pte_devmap().
> 
> I thought VM_MIXEDMAP was quite clear:
> 
> #define VM_MIXEDMAP   0x1000  /* Can contain "struct page" and pure 
> PFN pages */
> 
> This VMA does not include PFN pages, so it should not be tagged
> VM_MIXEDMAP.
> 
> Aside from enabling the special vmf_ API, it only controls some
> special behavior in vm_normal_page:
> 
>  * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>  * page" backing, however the difference is that _all_ pages with a struct
>  * page (that is, those where pfn_valid is true) are refcounted and considered
>  * normal pages by the VM. The disadvantage is that pages are refcounted
>  * (which can be slower and simply not an option for some PFNMAP users). The
>  * advantage is that we don't have to follow the strict linearity rule of
>  * PFNMAP mappings in order to support COWable mappings.
> 
> Which again does not describe this case.

Some of this seems out of date. Pretty sure the pages are not refcounted
with vmf_insert_mixed() and vmf_insert_mixed() is currently the only way
to use VM_MIXEDMAP mappings.

>> device-dax uses these interfaces and as far as I can see it also only
>> contains struct pages (or at least  dev_dax_huge_fault() calls
>> pfn_to_page() on every page when VM_FAULT_NOPAGE happens).
> 
> hacky hacky :)
> 
> I think DAX probably did it that way for the same reason you are
> doing it that way - no other choice without changing something

Sure but if you look at other vmf_insert_mixed() (of which there are
few) you see similar patterns. Seems more like it was documented with
one thing in mind but then used in a completely different manner. Which
is why I suggested the documentation was not so good.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 03:57:20PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 29 Sep 2021 16:39:53 -0300, Jason Gunthorpe  wrote:
> 
> > On Wed, Sep 29, 2021 at 12:37:19PM -0700, Jacob Pan wrote:
> >  
> > > For #2, it seems we can store the kernel PASID in struct device. This
> > > will preserve the DMA API interface while making it PASID capable.
> > > Essentially, each PASID capable device would have two special global
> > > PASIDs: 
> > >   - PASID 0 for DMA request w/o PASID, aka RID2PASID
> > >   - PASID 1 (randomly selected) for in-kernel DMA request w/
> > > PASID  
> > 
> > This seems reasonable, I had the same thought. Basically just have the
> > driver issue some trivial call:
> >   pci_enable_pasid_dma(pdev, )
> That would work, but I guess it needs to be an iommu_ call instead of pci_?

Which ever makes sense..  The API should take in a struct pci_device
and return a PCI PASID - at least as a wrapper around a more generic
immu api.

> I think your suggestion is more precise, in case the driver does not want
> to do DMA w/ PASID, we can do less IOTLB flush (PASID 0 only).

Since it is odd, and it may create overhead, I would do it only when
asked to do it

> > Having multiple RID's pointing at the same IO page table is something
> > we expect iommufd to require so the whole thing should ideally fall
> > out naturally.

> That would be the equivalent of attaching multiple devices to the same
> IOMMU domain. right?

Effectively..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 05:00:43PM -0600, Logan Gunthorpe wrote:
> 
> 
> 
> On 2021-09-29 4:46 p.m., Jason Gunthorpe wrote:
> > On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote:
> >> On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote:
> >> No, that's not a correct reading of the code. Every time there is a new
> >> pagemap, this code calculates the mapping type and bus offset. If a page
> >> comes along with a different page map,f it recalculates. This just
> >> reduces the overhead so that the calculation is done only every time a
> >> page with a different pgmap comes along and not doing it for every
> >> single page.
> > 
> > Each 'struct scatterlist *sg' refers to a range of contiguous pfns
> > starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE
> > pfns long.
> > 
> 
> Ugh, right. A bit contrived for consecutive pages to have different
> pgmaps and still be next to each other in a DMA transaction. But I guess
> it is technically possible and should be protected against.

I worry it is something a hostile userspace could cookup using mmap
and cause some kind of kernel integrity problem with.

> > @@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct 
> > sg_append_table *sgt_append,
> >  
> > /* Merge contiguous pages into the last SG */
> > prv_len = sgt_append->prv->length;
> > -   while (n_pages && page_to_pfn(pages[0]) == paddr) {
> > +   while (n_pages && page_to_pfn(pages[0]) == paddr &&
> > +  sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) {
> 
> I don't think it's correct to use pgmap without first checking if it is
> a zone device page. But your point is taken. I'll try to address this.

Yes

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices

2021-09-29 Thread Jason Gunthorpe
On Wed, Sep 29, 2021 at 05:28:38PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote:
> > On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
> >>> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
>  Hi,
> 
>  This patchset continues my work to add userspace P2PDMA access using
>  O_DIRECT NVMe devices. My last posting[1] just included the first 13
>  patches in this series, but the early P2PDMA cleanup and map_sg error
>  changes from that series have been merged into v5.15-rc1. To address
>  concerns that that series did not add any new functionality, I've added
>  back the userspcae functionality from the original RFC[2] (but improved
>  based on the original feedback).
> >>>
> >>> I really think this is the best series yet, it really looks nice
> >>> overall. I know the sg flag was a bit of a debate at the start, but it
> >>> serves an undeniable purpose and the resulting standard DMA APIs 'just
> >>> working' is really clean.
> >>
> >> Actually, so far, nobody has said anything negative about using the SG 
> >> flag.
> >>
> >>> There is more possible here, we could also pass the new GUP flag in the
> >>> ib_umem code..
> >>
> >> Yes, that would be very useful.
> > 
> > You might actually prefer to do that then the bio changes to get the
> > infrastructur merged as it seems less "core"
> 
> I'm a little bit more concerned about my patch set growing too large.
> It's already at 20 patches and I think I'll need to add a couple more
> based on the feedback you've already provided. So I'm leaning toward
> pushing more functionality as future work.

I mean you could postpone the three block related patches and use a
single ib_umem patch instead as the consumer.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-09-29 Thread Jason Gunthorpe
On Wed, Sep 29, 2021 at 05:27:22PM -0600, Logan Gunthorpe wrote:

> > finish_fault() should set the pte_devmap - eg by passing the
> > PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or
> > otherwise signaling do_set_pte() that it should set those PTE bits
> > when it creates the entry.
> > 
> > (or there should be a vmf_* helper for this special case, but using
> > the vmf->page seems righter to me)
> 
> I'm not opposed to this. Though I'm not sure what's best here.
> 
> >> If we don't set pte_devmap(), then every single page that GUP
> >> processes needs to check if it's a ZONE_DEVICE page and also if it's
> >> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the
> >> requirements of FOLL_PCI_P2PDMA.
> > 
> > Definately not suggesting not to set pte_devmap(), only that
> > VM_MIXEDMAP should not be set on VMAs that only contain struct
> > pages. That is an abuse of what it is intended for.
> > 
> > At the very least there should be a big comment above the usage
> > explaining that this is just working around a limitation in
> > finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today.
> 
> Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good
> and the intention is not clear. I got the impression that mm people
> wanted those interfaces used for users of pte_devmap().

I thought VM_MIXEDMAP was quite clear:

#define VM_MIXEDMAP 0x1000  /* Can contain "struct page" and pure 
PFN pages */

This VMA does not include PFN pages, so it should not be tagged
VM_MIXEDMAP.

Aside from enabling the special vmf_ API, it only controls some
special behavior in vm_normal_page:

 * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
 * page" backing, however the difference is that _all_ pages with a struct
 * page (that is, those where pfn_valid is true) are refcounted and considered
 * normal pages by the VM. The disadvantage is that pages are refcounted
 * (which can be slower and simply not an option for some PFNMAP users). The
 * advantage is that we don't have to follow the strict linearity rule of
 * PFNMAP mappings in order to support COWable mappings.

Which again does not describe this case.

> device-dax uses these interfaces and as far as I can see it also only
> contains struct pages (or at least  dev_dax_huge_fault() calls
> pfn_to_page() on every page when VM_FAULT_NOPAGE happens).

hacky hacky :)

I think DAX probably did it that way for the same reason you are
doing it that way - no other choice without changing something

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices

2021-09-29 Thread Logan Gunthorpe



On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
>>> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
 Hi,

 This patchset continues my work to add userspace P2PDMA access using
 O_DIRECT NVMe devices. My last posting[1] just included the first 13
 patches in this series, but the early P2PDMA cleanup and map_sg error
 changes from that series have been merged into v5.15-rc1. To address
 concerns that that series did not add any new functionality, I've added
 back the userspcae functionality from the original RFC[2] (but improved
 based on the original feedback).
>>>
>>> I really think this is the best series yet, it really looks nice
>>> overall. I know the sg flag was a bit of a debate at the start, but it
>>> serves an undeniable purpose and the resulting standard DMA APIs 'just
>>> working' is really clean.
>>
>> Actually, so far, nobody has said anything negative about using the SG flag.
>>
>>> There is more possible here, we could also pass the new GUP flag in the
>>> ib_umem code..
>>
>> Yes, that would be very useful.
> 
> You might actually prefer to do that then the bio changes to get the
> infrastructur merged as it seems less "core"

I'm a little bit more concerned about my patch set growing too large.
It's already at 20 patches and I think I'll need to add a couple more
based on the feedback you've already provided. So I'm leaning toward
pushing more functionality as future work.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-09-29 Thread Logan Gunthorpe




On 2021-09-29 5:05 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 03:42:00PM -0600, Logan Gunthorpe wrote:
> 
>> The main reason is probably this: if we don't use VM_MIXEDMAP, then we
>> can't set pte_devmap(). 
> 
> I think that is an API limitation in the fault routines..
> 
> finish_fault() should set the pte_devmap - eg by passing the
> PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or
> otherwise signaling do_set_pte() that it should set those PTE bits
> when it creates the entry.
> 
> (or there should be a vmf_* helper for this special case, but using
> the vmf->page seems righter to me)

I'm not opposed to this. Though I'm not sure what's best here.

>> If we don't set pte_devmap(), then every single page that GUP
>> processes needs to check if it's a ZONE_DEVICE page and also if it's
>> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the
>> requirements of FOLL_PCI_P2PDMA.
> 
> Definately not suggesting not to set pte_devmap(), only that
> VM_MIXEDMAP should not be set on VMAs that only contain struct
> pages. That is an abuse of what it is intended for.
> 
> At the very least there should be a big comment above the usage
> explaining that this is just working around a limitation in
> finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today.

Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good
and the intention is not clear. I got the impression that mm people
wanted those interfaces used for users of pte_devmap().

device-dax uses these interfaces and as far as I can see it also only
contains struct pages (or at least  dev_dax_huge_fault() calls
pfn_to_page() on every page when VM_FAULT_NOPAGE happens).

So it would be nice to get some direction here from mm developers on
what they'd prefer.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices

2021-09-29 Thread Jason Gunthorpe
On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
> > On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
> >> Hi,
> >>
> >> This patchset continues my work to add userspace P2PDMA access using
> >> O_DIRECT NVMe devices. My last posting[1] just included the first 13
> >> patches in this series, but the early P2PDMA cleanup and map_sg error
> >> changes from that series have been merged into v5.15-rc1. To address
> >> concerns that that series did not add any new functionality, I've added
> >> back the userspcae functionality from the original RFC[2] (but improved
> >> based on the original feedback).
> > 
> > I really think this is the best series yet, it really looks nice
> > overall. I know the sg flag was a bit of a debate at the start, but it
> > serves an undeniable purpose and the resulting standard DMA APIs 'just
> > working' is really clean.
> 
> Actually, so far, nobody has said anything negative about using the SG flag.
> 
> > There is more possible here, we could also pass the new GUP flag in the
> > ib_umem code..
> 
> Yes, that would be very useful.

You might actually prefer to do that then the bio changes to get the
infrastructur merged as it seems less "core"

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-09-29 Thread Jason Gunthorpe
On Wed, Sep 29, 2021 at 03:42:00PM -0600, Logan Gunthorpe wrote:

> The main reason is probably this: if we don't use VM_MIXEDMAP, then we
> can't set pte_devmap(). 

I think that is an API limitation in the fault routines..

finish_fault() should set the pte_devmap - eg by passing the
PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or
otherwise signaling do_set_pte() that it should set those PTE bits
when it creates the entry.

(or there should be a vmf_* helper for this special case, but using
the vmf->page seems righter to me)

> If we don't set pte_devmap(), then every single page that GUP
> processes needs to check if it's a ZONE_DEVICE page and also if it's
> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the
> requirements of FOLL_PCI_P2PDMA.

Definately not suggesting not to set pte_devmap(), only that
VM_MIXEDMAP should not be set on VMAs that only contain struct
pages. That is an abuse of what it is intended for.

At the very least there should be a big comment above the usage
explaining that this is just working around a limitation in
finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today.

Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations

2021-09-29 Thread Logan Gunthorpe




On 2021-09-29 4:46 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote:
>> On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote:
>> No, that's not a correct reading of the code. Every time there is a new
>> pagemap, this code calculates the mapping type and bus offset. If a page
>> comes along with a different page map,f it recalculates. This just
>> reduces the overhead so that the calculation is done only every time a
>> page with a different pgmap comes along and not doing it for every
>> single page.
> 
> Each 'struct scatterlist *sg' refers to a range of contiguous pfns
> starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE
> pfns long.
> 

Ugh, right. A bit contrived for consecutive pages to have different
pgmaps and still be next to each other in a DMA transaction. But I guess
it is technically possible and should be protected against.

> sg_page() returns the first page, but nothing says that sg_page()+1
> has the same pgmap.
> 
> The code in this patch does check the first page of each sg in a
> larger sgl.
> 
>>> At least sg_alloc_append_table_from_pages() and probably something in
>>> the block world should be updated to not combine struct pages with
>>> different pgmaps, and this should be documented in scatterlist.*
>>> someplace.
>>
>> There's no sane place to do this check. The code is designed to support
>> mappings with different pgmaps.
> 
> All places that generate compound sg's by aggregating multiple pages
> need to include this check along side the check for physical
> contiguity. There are not that many places but
> sg_alloc_append_table_from_pages() is one of them:

Yes. The block layer also does this. I believe a check in
page_is_mergable() will be sufficient there.

> @@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct 
> sg_append_table *sgt_append,
>  
> /* Merge contiguous pages into the last SG */
> prv_len = sgt_append->prv->length;
> -   while (n_pages && page_to_pfn(pages[0]) == paddr) {
> +   while (n_pages && page_to_pfn(pages[0]) == paddr &&
> +  sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) {

I don't think it's correct to use pgmap without first checking if it is
a zone device page. But your point is taken. I'll try to address this.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable()

2021-09-29 Thread Logan Gunthorpe




On 2021-09-28 1:43 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:52PM -0600, Logan Gunthorpe wrote:
>> dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
>> is no longer necessary and may be dropped.
>>
>> Switch to the dma_map_sgtable() interface which will allow for better
>> error reporting if the P2PDMA pages are unsupported.
>>
>> The change to sgtable also appears to fix a couple subtle error path
>> bugs:
>>
>>   - In rdma_rw_ctx_init(), dma_unmap would be called with an sg
>> that could have been incremented from the original call, as
>> well as an nents that was not the original number of nents
>> called when mapped.
>>   - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg
>> were unmapped with the incorrect number of nents.
> 
> Those bugs should definately get fixed.. I might extract the sgtable
> conversion into a stand alone patch to do it.

Yes. I can try to split it off myself and send a patch later this week.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-09-29 Thread Jacob Pan
Hi Jason,

On Wed, 29 Sep 2021 16:39:53 -0300, Jason Gunthorpe  wrote:

> On Wed, Sep 29, 2021 at 12:37:19PM -0700, Jacob Pan wrote:
>  
> > For #2, it seems we can store the kernel PASID in struct device. This
> > will preserve the DMA API interface while making it PASID capable.
> > Essentially, each PASID capable device would have two special global
> > PASIDs: 
> > - PASID 0 for DMA request w/o PASID, aka RID2PASID
> > - PASID 1 (randomly selected) for in-kernel DMA request w/
> > PASID  
> 
> This seems reasonable, I had the same thought. Basically just have the
> driver issue some trivial call:
>   pci_enable_pasid_dma(pdev, )
That would work, but I guess it needs to be an iommu_ call instead of pci_?

Or, it can be done by the platform IOMMU code where system PASID is
automatically enabled for PASID capable devices during boot and stored in
struct device. Device drivers can retrieve the PASID from struct device.

I think your suggestion is more precise, in case the driver does not want
to do DMA w/ PASID, we can do less IOTLB flush (PASID 0 only).

> And then DMA tagged with the PASID will be handled equivilant to
> untagged DMA. Basically PASID and no PASID point to the exact same IO
> page table and the DMA API manipulates that single page table.
> 
> Having multiple RID's pointing at the same IO page table is something
> we expect iommufd to require so the whole thing should ideally fall
> out naturally.
That would be the equivalent of attaching multiple devices to the same
IOMMU domain. right?

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

2021-09-29 Thread Jason Gunthorpe
On Wed, Sep 29, 2021 at 03:34:22PM -0600, Logan Gunthorpe wrote:
> 
> 
> 
> On 2021-09-28 1:47 p.m., Jason Gunthorpe wrote:
> > On Thu, Sep 16, 2021 at 05:40:54PM -0600, Logan Gunthorpe wrote:
> >> Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
> >> allow obtaining P2PDMA pages. If a caller does not set this flag
> >> and tries to map P2PDMA pages it will fail.
> >>
> >> This is implemented by adding a flag and a check to get_dev_pagemap().
> > 
> > I would like to see the get_dev_pagemap() deleted from GUP in the
> > first place.
> > 
> > Why isn't this just a simple check of the page->pgmap type after
> > acquiring a valid page reference? See my prior note
> 
> It could be, but that will mean dereferencing the pgmap for every page
> to determine the type of page and then comparing with FOLL_PCI_P2PDMA.

It would be done under the pte devmap test and this is less expensive
than the xarray search.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote:
> 
> 
> 
> On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote:
> > On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:
> > 
> >> +enum pci_p2pdma_map_type
> >> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device 
> >> *dev,
> >> + struct scatterlist *sg)
> >> +{
> >> +  if (state->pgmap != sg_page(sg)->pgmap) {
> >> +  state->pgmap = sg_page(sg)->pgmap;
> > 
> > This has built into it an assumption that every page in the sg element
> > has the same pgmap, but AFAIK nothing enforces this rule? There is no
> > requirement that the HW has pfn gaps between the pgmaps linux decides
> > to create over it.
> 
> No, that's not a correct reading of the code. Every time there is a new
> pagemap, this code calculates the mapping type and bus offset. If a page
> comes along with a different page map,f it recalculates. This just
> reduces the overhead so that the calculation is done only every time a
> page with a different pgmap comes along and not doing it for every
> single page.

Each 'struct scatterlist *sg' refers to a range of contiguous pfns
starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE
pfns long.

sg_page() returns the first page, but nothing says that sg_page()+1
has the same pgmap.

The code in this patch does check the first page of each sg in a
larger sgl.

> > At least sg_alloc_append_table_from_pages() and probably something in
> > the block world should be updated to not combine struct pages with
> > different pgmaps, and this should be documented in scatterlist.*
> > someplace.
> 
> There's no sane place to do this check. The code is designed to support
> mappings with different pgmaps.

All places that generate compound sg's by aggregating multiple pages
need to include this check along side the check for physical
contiguity. There are not that many places but
sg_alloc_append_table_from_pages() is one of them:

@@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
 
/* Merge contiguous pages into the last SG */
prv_len = sgt_append->prv->length;
-   while (n_pages && page_to_pfn(pages[0]) == paddr) {
+   while (n_pages && page_to_pfn(pages[0]) == paddr &&
+  sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) {
if (sgt_append->prv->length + PAGE_SIZE > max_segment)
break;
sgt_append->prv->length += PAGE_SIZE;
@@ -488,7 +489,8 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
for (i = 1; i < n_pages; i++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
-   page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) {
+   page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1 ||
+   pages[i]->pgmap != pages[i - 1]->pgmap) {
chunks++;
seg_len = 0;
}
@@ -505,9 +507,10 @@ int sg_alloc_append_table_from_pages(struct 
sg_append_table *sgt_append,
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
page_to_pfn(pages[j]) !=
-   page_to_pfn(pages[j - 1]) + 1)
+   page_to_pfn(pages[j - 1]) + 1 ||
+   pages[i]->pgmap != pages[i - 1]->pgmap) {
break;
-   }
+   }
 
/* Pass how many chunks might be left */
s = get_next_sg(sgt_append, s, chunks - i + left_pages,


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD

2021-09-29 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, September 29, 2021 8:28 PM
> 
> On Wed, Sep 29, 2021 at 06:41:00AM +, Tian, Kevin wrote:
> > > From: David Gibson 
> > > Sent: Wednesday, September 29, 2021 2:01 PM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind
> the
> > > vfio
> > > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface
> is
> > > provided
> > > > because it's implicitly done when the device fd is closed.
> > > >
> > > > In concept a vfio device can be bound to multiple iommufds, each
> hosting
> > > > a subset of I/O address spaces attached by this device.
> > >
> > > I really feel like this many<->many mapping between devices is going
> > > to be super-confusing, and therefore make it really hard to be
> > > confident we have all the rules right for proper isolation.
> >
> > Based on new discussion on group ownership part (patch06), I feel this
> > many<->many relationship will disappear. The context fd (either container
> > or iommufd) will uniquely mark the ownership on a physical device and
> > its group. With this design it's impractical to have one device bound
> > to multiple iommufds.
> 
> That should be a requirement! We have no way to prove that two
> iommufds are the same security domain, so devices/groups cannot be
> shared.
> 
> That is why the API I suggested takes in a struct file to ID the user
> security context. A group is accessible only from that single struct
> file and no more.
> 
> If the first series goes the way I outlined then I think David's
> concern about security is strongly solved as the IOMMU layer is
> directly managing it with a very clear responsiblity and semantic.
> 

Yes, this is also my understanding now.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices

2021-09-29 Thread Logan Gunthorpe



On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
>> Hi,
>>
>> This patchset continues my work to add userspace P2PDMA access using
>> O_DIRECT NVMe devices. My last posting[1] just included the first 13
>> patches in this series, but the early P2PDMA cleanup and map_sg error
>> changes from that series have been merged into v5.15-rc1. To address
>> concerns that that series did not add any new functionality, I've added
>> back the userspcae functionality from the original RFC[2] (but improved
>> based on the original feedback).
> 
> I really think this is the best series yet, it really looks nice
> overall. I know the sg flag was a bit of a debate at the start, but it
> serves an undeniable purpose and the resulting standard DMA APIs 'just
> working' is really clean.

Actually, so far, nobody has said anything negative about using the SG flag.

> There is more possible here, we could also pass the new GUP flag in the
> ib_umem code..

Yes, that would be very useful.

> After this gets merged I would make a series to split out the CMD
> genalloc related stuff and try and probably get something like VFIO to
> export this kind of memory as well, then it would have pretty nice
> coverage.

Yup!

Thanks for the review. Anything I didn't respond to I've either made
changes for, or am still working on and will be addressed for subsequent
postings.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-09-29 Thread Logan Gunthorpe




On 2021-09-28 2:05 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote:
> 
>> +static void pci_p2pdma_unmap_mappings(void *data)
>> +{
>> +struct pci_dev *pdev = data;
>> +struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
>> +
>> +p2pdma->active = false;
>> +synchronize_rcu();
>> +unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
>> +pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
>> +}
> 
> If this is going to rely on unmap_mapping_range then GUP should also
> reject this memory for FOLL_LONGTERM..

Right, makes sense.

> 
> What along this control flow:
> 
>> +   error = devm_add_action_or_reset(>dev, 
>> pci_p2pdma_unmap_mappings,
>> +pdev);
> 
> Waits for all the page refcounts to go to zero?

That's already in the existing code as part of memunmap_pages() which
puts the original reference to all the pages and then waits for the
reference to go to zero.

This new action unmaps all the VMAs so that the subsequent call to
memunmap_pages() doesn't block on userspace processes.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-09-29 Thread Logan Gunthorpe



On 2021-09-28 1:55 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote:
>> +int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
>> +{
>> +struct pci_p2pdma_map *pmap;
>> +struct pci_p2pdma *p2pdma;
>> +int ret;
>> +
>> +/* prevent private mappings from being established */
>> +if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
>> +pci_info_ratelimited(pdev,
>> + "%s: fail, attempted private mapping\n",
>> + current->comm);
>> +return -EINVAL;
>> +}
>> +
>> +pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start);
>> +if (!pmap)
>> +return -ENOMEM;
>> +
>> +rcu_read_lock();
>> +p2pdma = rcu_dereference(pdev->p2pdma);
>> +if (!p2pdma) {
>> +ret = -ENODEV;
>> +goto out;
>> +}
>> +
>> +ret = simple_pin_fs(_p2pdma_fs_type, _p2pdma_fs_mnt,
>> +_p2pdma_fs_cnt);
>> +if (ret)
>> +goto out;
>> +
>> +ihold(p2pdma->inode);
>> +pmap->inode = p2pdma->inode;
>> +rcu_read_unlock();
>> +
>> +vma->vm_flags |= VM_MIXEDMAP;
> 
> Why is this a VM_MIXEDMAP? Everything fault sticks in here has a
> struct page, right?

Yes. This decision is not so simple, I tried a few variations before
settling on this.

The main reason is probably this: if we don't use VM_MIXEDMAP, then we
can't set pte_devmap(). If we don't set pte_devmap(), then every single
page that GUP processes needs to check if it's a ZONE_DEVICE page and
also if it's a P2PDMA page (thus dereferencing pgmap) in order to
satisfy the requirements of FOLL_PCI_P2PDMA.

I didn't think other developers would go for that kind of performance
hit. With VM_MIXEDMAP we hide the performance penalty behind the
existing check. And with the current pgmap code as is, we only need to
do that check once for every new pgmap pointer.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

2021-09-29 Thread Logan Gunthorpe




On 2021-09-28 1:47 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:54PM -0600, Logan Gunthorpe wrote:
>> Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
>> allow obtaining P2PDMA pages. If a caller does not set this flag
>> and tries to map P2PDMA pages it will fail.
>>
>> This is implemented by adding a flag and a check to get_dev_pagemap().
> 
> I would like to see the get_dev_pagemap() deleted from GUP in the
> first place.
> 
> Why isn't this just a simple check of the page->pgmap type after
> acquiring a valid page reference? See my prior note

It could be, but that will mean dereferencing the pgmap for every page
to determine the type of page and then comparing with FOLL_PCI_P2PDMA.

Probably not terrible to go this way.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations

2021-09-29 Thread Logan Gunthorpe




On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:
> 
>> +enum pci_p2pdma_map_type
>> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device 
>> *dev,
>> +   struct scatterlist *sg)
>> +{
>> +if (state->pgmap != sg_page(sg)->pgmap) {
>> +state->pgmap = sg_page(sg)->pgmap;
> 
> This has built into it an assumption that every page in the sg element
> has the same pgmap, but AFAIK nothing enforces this rule? There is no
> requirement that the HW has pfn gaps between the pgmaps linux decides
> to create over it.

No, that's not a correct reading of the code. Every time there is a new
pagemap, this code calculates the mapping type and bus offset. If a page
comes along with a different page map,f it recalculates. This just
reduces the overhead so that the calculation is done only every time a
page with a different pgmap comes along and not doing it for every
single page.

> At least sg_alloc_append_table_from_pages() and probably something in
> the block world should be updated to not combine struct pages with
> different pgmaps, and this should be documented in scatterlist.*
> someplace.

There's no sane place to do this check. The code is designed to support
mappings with different pgmaps.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations

2021-09-29 Thread Logan Gunthorpe



On 2021-09-28 12:55 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:
>> Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
>> implementations. It takes an scatterlist segment that must point to a
>> pci_p2pdma struct page and will map it if the mapping requires a bus
>> address.
>>
>> The return value indicates whether the mapping required a bus address
>> or whether the caller still needs to map the segment normally. If the
>> segment should not be mapped, -EREMOTEIO is returned.
>>
>> This helper uses a state structure to track the changes to the
>> pgmap across calls and avoid needing to lookup into the xarray for
>> every page.
>>
>> Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
>> dma_map_sg() implementations where the sg segment containing the page
>> differs from the sg segment containing the DMA address.
>>
>> Signed-off-by: Logan Gunthorpe 
>>  drivers/pci/p2pdma.c   | 59 ++
>>  include/linux/pci-p2pdma.h | 21 ++
>>  2 files changed, 80 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index b656d8c801a7..58c34f1f1473 100644
>> +++ b/drivers/pci/p2pdma.c
>> @@ -943,6 +943,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, 
>> struct scatterlist *sg,
>>  }
>>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>>  
>> +/**
>> + * pci_p2pdma_map_segment - map an sg segment determining the mapping type
>> + * @state: State structure that should be declared outside of the 
>> for_each_sg()
>> + *  loop and initialized to zero.
>> + * @dev: DMA device that's doing the mapping operation
>> + * @sg: scatterlist segment to map
>> + *
>> + * This is a helper to be used by non-iommu dma_map_sg() implementations 
>> where
>> + * the sg segment is the same for the page_link and the dma_address.
>> + *
>> + * Attempt to map a single segment in an SGL with the PCI bus address.
>> + * The segment must point to a PCI P2PDMA page and thus must be
>> + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
>> + *
>> + * Returns the type of mapping used and maps the page if the type is
>> + * PCI_P2PDMA_MAP_BUS_ADDR.
>> + */
>> +enum pci_p2pdma_map_type
>> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device 
>> *dev,
>> +   struct scatterlist *sg)
>> +{
>> +if (state->pgmap != sg_page(sg)->pgmap) {
>> +state->pgmap = sg_page(sg)->pgmap;
>> +state->map = pci_p2pdma_map_type(state->pgmap, dev);
>> +state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
>> +}
> 
> Is this safe? I was just talking with Joao about this,
> 
>  https://lore.kernel.org/r/20210928180150.gi3544...@ziepe.ca
> 

I agree that taking the extra reference on the pagemap seems
unnecessary. However, it was convenient for the purposes of this
patchset to not have to check the page type for every page and only on
every new page map. But if we need to add a check directly to gup,
that'd probably be fine too.

> API wise I absolutely think this should be safe as written, but is it
> really?
> 
> Does pgmap ensure that a positive refcount struct page always has a
> valid pgmap pointer (and thus the mess in gup can be deleted) or does
> this need to get the pgmap as well to keep it alive?

Yes, the P2PDMA code ensures that the pgmap will not be deleted if there
is a positive refcount on any struct page.

LOgan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2021-09-29 Thread Logan Gunthorpe


On 2021-09-28 12:32 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:41PM -0600, Logan Gunthorpe wrote:
>>  config PCI_P2PDMA
>>  bool "PCI peer-to-peer transfer support"
>> -depends on ZONE_DEVICE
>> +depends on ZONE_DEVICE && 64BIT
> 
> Perhaps a comment to explain what the 64bit is doing?

Added.

>>  select GENERIC_ALLOCATOR
>>  help
>>Enableѕ drivers to do PCI peer-to-peer transactions to and from
>> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
>> index 266754a55327..e62b1cf6386f 100644
>> +++ b/include/linux/scatterlist.h
>> @@ -64,6 +64,21 @@ struct sg_append_table {
>>  #define SG_CHAIN0x01UL
>>  #define SG_END  0x02UL
>>  
>> +/*
>> + * bit 2 is the third free bit in the page_link on 64bit systems which
>> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI
>> + * bus address when doing P2PDMA.
>> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
>> + */
>> +
>> +#ifdef CONFIG_PCI_P2PDMA
>> +#define SG_DMA_PCI_P2PDMA   0x04UL
> 
> Add a 
>   static_assert(__alignof__(void *) == 8);
> 
> ?

Good idea. Though, I think your line isn't quite correct. I've added:

static_assert(__alignof__(struct page) >= 8);

>> +#define sg_is_dma_pci_p2pdma(sg) ((sg)->page_link & SG_DMA_PCI_P2PDMA)
> 
> I've been encouraging people to use static inlines more..

I also prefer static inlines, but I usually follow the style of the code
I'm changing. In any case, I've changed to static inlines similar to
your example.

>>  /**
>>   * sg_assign_page - Assign a given page to an SG entry
>> @@ -86,13 +103,13 @@ struct sg_append_table {
>>   **/
>>  static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
>>  {
>> -unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
>> +unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;
> 
> I think this should just be '& SG_END', sg_assign_page() doesn't look
> like it should ever be used on a sg_chain entry, so this is just
> trying to preserve the end stamp.

Perhaps, but I'm not comfortable making that change in this patch or
series. Though, I've reverted this specific change in my patch so
sg_assign_page() will clear SG_DMA_PCI_P2PDMA.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Thomas Gleixner
On Wed, Sep 29 2021 at 11:31, Tony Luck wrote:
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..5a3c87fd65de 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -528,6 +528,32 @@ static enum kernel_gp_hint get_kernel_gp_address(struct 
> pt_regs *regs,
>  
>  #define GPFSTR "general protection fault"
>  
> +/*
> + * When a user executes the ENQCMD instruction it will #GP
> + * fault if the IA32_PASID MSR has not been set up with a
> + * valid PASID.
> + * So if the process has been allocated a PASID (mm->pasid)
> + * AND the IA32_PASID MSR has not been initialized, try to
> + * fix this #GP by initializing the IA32_PASID MSR.
> + * If the #GP was for some other reason, it will trigger
> + * again, but this routine will return false and the #GP
> + * will be processed.
> + */
> +static void try_fixup_pasid(void)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> + return false;
> +
> +#ifdef CONFIG_IOMMU_SUPPORT
> + if (current->mm->pasid && !current->pasid_activated) {
> + current->pasid_activated = 1;
> + wrmsrl(MSR_IA32_PASID, current->mm->pasid);
> + return true;
> + }
> +#endif
> + return false;
> +}
> +
>  DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  {
>   char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
> @@ -536,6 +562,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>   unsigned long gp_addr;
>   int ret;
>  
> + if (user_mode(regs) && try_fixup_pasid())
> + return;
> +
>   cond_local_irq_enable(regs);
>  
>   if (static_cpu_has(X86_FEATURE_UMIP)) {

Amen.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 12:37:19PM -0700, Jacob Pan wrote:
 
> For #2, it seems we can store the kernel PASID in struct device. This will
> preserve the DMA API interface while making it PASID capable. Essentially,
> each PASID capable device would have two special global PASIDs: 
>   - PASID 0 for DMA request w/o PASID, aka RID2PASID
>   - PASID 1 (randomly selected) for in-kernel DMA request w/ PASID

This seems reasonable, I had the same thought. Basically just have the
driver issue some trivial call:
  pci_enable_pasid_dma(pdev, )

And then DMA tagged with the PASID will be handled equivilant to
untagged DMA. Basically PASID and no PASID point to the exact same IO
page table and the DMA API manipulates that single page table.

Having multiple RID's pointing at the same IO page table is something
we expect iommufd to require so the whole thing should ideally fall
out naturally.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-09-29 Thread Jacob Pan
Hi,

Just to follow up on what we discussed during LPC VFIO/IOMMU/PCI MC.
https://linuxplumbersconf.org/event/11/contributions/1021/
The key takeaways are:
1. Addressing mode selections (PA, IOVA, and KVA) should be a policy
decision *not* to be made by device drivers. This implies that it is up to
the platform code or user (via some sysfs knobs) to decide what is the best
for each device. Drivers should not be aware of what addressing modes is
returned by DMA API.

2. DMA APIs can be extended to support DMA request with PASID. 

3. Performance benefit of using KVA (shared) should be demonstrated. Though
the saving of IOTLB flush over IOVA is conceivable.

#1 could be done in platform IOMMU code when devices are attached to their
default domains. E.g. if the device is trusted, it can operate at shared
KVA mode.

For #2, it seems we can store the kernel PASID in struct device. This will
preserve the DMA API interface while making it PASID capable. Essentially,
each PASID capable device would have two special global PASIDs: 
- PASID 0 for DMA request w/o PASID, aka RID2PASID
- PASID 1 (randomly selected) for in-kernel DMA request w/ PASID

Both PASID 0 and 1 will always point to the same page table. I.e. same
addressing mode, IOVA or KVA.

For devices does not support PASID, there is no change. For devices can do
both DMA w/ and w/o PASID, the IOTLB invalidation would include both PASIDs.

By embedding PASID in struct device, it also avoided changes in upper level
APIs. DMA engine API can continue to give out channels without knowing
whether PASID is used or not. The accelerator drivers that does work
submission can retrieve PASID from struct device.

Thoughts?

Thanks for the review and feedback at LPC!

Jacob

On Tue, 21 Sep 2021 13:29:34 -0700, Jacob Pan
 wrote:

> Hi Joerg/Jason/Christoph et all,
> 
> The current in-kernel supervisor PASID support is based on the SVM/SVA
> machinery in sva-lib. Kernel SVA is achieved by extending a special flag
> to indicate the binding of the device and a page table should be performed
> on init_mm instead of the mm of the current process.Page requests and
> other differences between user and kernel SVA are handled as special
> cases.
> 
> This unrestricted binding with the kernel page table is being challenged
> for security and the convention that in-kernel DMA must be compatible with
> DMA APIs.
> (https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/)
> There is also the lack of IOTLB synchronization upon kernel page table
> updates.
> 
> This patchset is trying to address these concerns by having an explicit
> DMA API compatible model while continue to support in-kernel use of DMA
> requests with PASID. Specifically, the following DMA-IOMMU APIs are
> introduced:
> 
> int iommu_dma_pasid_enable/disable(struct device *dev,
>  struct iommu_domain **domain,
>  enum iommu_dma_pasid_mode mode);
> int iommu_map/unmap_kva(struct iommu_domain *domain,
>   void *cpu_addr,size_t size, int prot);
> 
> The following three addressing modes are supported with example API usages
> by device drivers.
> 
> 1. Physical address (bypass) mode. Similar to DMA direct where trusted
> devices can DMA pass through IOMMU on a per PASID basis.
> Example:
>   pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_BYPASS);
>   /* Use the returning PASID and PA for work submission */
> 
> 2. IOVA mode. DMA API compatible. Map a supervisor PASID the same way as
> the PCI requester ID (RID)
> Example:
>   pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_IOVA);
>   /* Use the PASID and DMA API allocated IOVA for work submission */
> 
> 3. KVA mode. New kva map/unmap APIs. Support fast and strict sub-modes
> transparently based on device trustfulness.
> Example:
>   pasid = iommu_dma_pasid_enable(dev, , IOMMU_DMA_PASID_KVA);
>   iommu_map_kva(domain, , size, prot);
>   /* Use the returned PASID and KVA to submit work */
> Where:
>   Fast mode: Shared CPU page tables for trusted devices only
>   Strict mode: IOMMU domain returned for the untrusted device to
>   replicate KVA-PA mapping in IOMMU page tables.
> 
> On a per device basis, DMA address and performance modes are enabled by
> the device drivers. Platform information such as trustability, user
> command line input (not included in this set) could also be taken into
> consideration (not implemented in this RFC).
> 
> This RFC is intended to communicate the API directions. Little testing is
> done outside IDXD and DMA engine tests.
> 
> For PA and IOVA modes, the implementation is straightforward and tested
> with Intel IDXD driver. But several opens remain in KVA fast mode thus
> not tested: 1. Lack of IOTLB synchronization, kernel direct map alias can
> be updated as a result of module loading/eBPF load. Adding kernel mmu
> notifier? 2. The use of the auxiliary domain for 

Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices

2021-09-29 Thread Alex Williamson
On Wed, 29 Sep 2021 12:08:59 +1000
David Gibson  wrote:

> On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> > 
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> > 
> > Signed-off-by: Liu Yi L   
> [snip]
> 
> > +static bool vfio_device_in_container(struct vfio_device *device)
> > +{
> > +   return !!(device->group && device->group->container);  
> 
> You don't need !! here.  && is already a logical operation, so returns
> a valid bool.
> 
> > +}
> > +
> >  static int vfio_device_fops_release(struct inode *inode, struct file 
> > *filep)
> >  {
> > struct vfio_device *device = filep->private_data;
> > @@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode 
> > *inode, struct file *filep)
> >  
> > module_put(device->dev->driver->owner);
> >  
> > -   vfio_group_try_dissolve_container(device->group);
> > +   if (vfio_device_in_container(device)) {
> > +   vfio_group_try_dissolve_container(device->group);
> > +   } else {
> > +   atomic_dec(>opened);
> > +   if (device->group) {
> > +   mutex_lock(>group->opened_lock);
> > +   device->group->opened--;
> > +   mutex_unlock(>group->opened_lock);
> > +   }
> > +   }
> >  
> > vfio_device_put(device);
> >  
> > @@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file *filep, 
> > struct vm_area_struct *vma)
> >  
> >  static const struct file_operations vfio_device_fops = {
> > .owner  = THIS_MODULE,
> > +   .open   = vfio_device_fops_open,
> > .release= vfio_device_fops_release,
> > .read   = vfio_device_fops_read,
> > .write  = vfio_device_fops_write,
> > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
> > .mode = S_IRUGO | S_IWUGO,
> >  };
> >  
> > +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> > +{
> > +   return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));  
> 
> Others have pointed out some problems with the use of dev_name()
> here.  I'll add that I think you'll make things much easier if instead
> of using one huge "devices" subdir, you use a separate subdir for each
> vfio sub-driver (so, one for PCI, one for each type of mdev, one for
> platform, etc.).  That should make avoiding name conflicts a lot simpler.

It seems like this is unnecessary if we use the vfioX naming approach.
Conflicts are trivial to ignore if we don't involve dev_name() and
looking for the correct major:minor chardev in the correct subdirectory
seems like a hassle for userspace.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Luck, Tony
On Wed, Sep 29, 2021 at 06:07:22PM +, Fenghua Yu wrote:
> Add
> +#ifdef CONFIG_IOMMU_SUPPORT
> because mm->pasid and current-pasid_activated are defined only if
> CONFIG_IOMMU_SUPPORT is defined.
> 
> > +   if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) 
> > {
> 
> Maybe need to add "&& cpu_feature_enabled(X86_FEATURE_ENQCMD)" because
> the IA32_PASID MSR is only used when ENQCMD is enabled?
> 
> > +   current->pasid_activated = 1;
> > +   wrmsrl(MSR_IA32_PASID, current->mm->pasid | 
> > MSR_IA32_PASID_VALID);
> > +   return;
> > +   }
> > +
> 
> +endif

New version that addresses those issues.  Has ugly #ifdef in C
code :-(  If that's unacceptable, then could create some stub
functions, or add a call to __try_fixup_pasid() that's in a
file in the iommu code that is only built when CONFIG_IOMMU_SUPPORT
is set.  But either of those move the details far away from the
#GP handler so make extra work for anyone trying to follow along
with what is happening here.

-Tony

---

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..5a3c87fd65de 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -528,6 +528,32 @@ static enum kernel_gp_hint get_kernel_gp_address(struct 
pt_regs *regs,
 
 #define GPFSTR "general protection fault"
 
+/*
+ * When a user executes the ENQCMD instruction it will #GP
+ * fault if the IA32_PASID MSR has not been set up with a
+ * valid PASID.
+ * So if the process has been allocated a PASID (mm->pasid)
+ * AND the IA32_PASID MSR has not been initialized, try to
+ * fix this #GP by initializing the IA32_PASID MSR.
+ * If the #GP was for some other reason, it will trigger
+ * again, but this routine will return false and the #GP
+ * will be processed.
+ */
+static void try_fixup_pasid(void)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
+   return false;
+
+#ifdef CONFIG_IOMMU_SUPPORT
+   if (current->mm->pasid && !current->pasid_activated) {
+   current->pasid_activated = 1;
+   wrmsrl(MSR_IA32_PASID, current->mm->pasid);
+   return true;
+   }
+#endif
+   return false;
+}
+
 DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 {
char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
@@ -536,6 +562,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
unsigned long gp_addr;
int ret;
 
+   if (user_mode(regs) && try_fixup_pasid())
+   return;
+
cond_local_irq_enable(regs);
 
if (static_cpu_has(X86_FEATURE_UMIP)) {
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] [RFC] qcom_scm: hide Kconfig symbol

2021-09-29 Thread Arnd Bergmann
On Wed, Sep 29, 2021 at 4:46 PM Bjorn Andersson
 wrote:
>
> On Wed 29 Sep 05:04 CDT 2021, Arnd Bergmann wrote:
>
> > On Wed, Sep 29, 2021 at 11:51 AM Will Deacon  wrote:
> > > On Mon, Sep 27, 2021 at 05:22:13PM +0200, Arnd Bergmann wrote:
> > > >
> > > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > > > index 124c41adeca1..989c83acbfee 100644
> > > > --- a/drivers/iommu/Kconfig
> > > > +++ b/drivers/iommu/Kconfig
> > > > @@ -308,7 +308,7 @@ config APPLE_DART
> > > >  config ARM_SMMU
> > > >   tristate "ARM Ltd. System MMU (SMMU) Support"
> > > >   depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64)
> > > > - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y
> > > > + select QCOM_SCM
> > > >   select IOMMU_API
> > > >   select IOMMU_IO_PGTABLE_LPAE
> > > >   select ARM_DMA_USE_IOMMU if ARM
> > >
> > > I don't want to get in the way of this patch because I'm also tired of the
> > > randconfig failures caused by QCOM_SCM. However, ARM_SMMU is applicable to
> > > a wide variety of (non-qcom) SoCs and so it seems a shame to require the
> > > QCOM_SCM code to be included for all of those when it's not strictly 
> > > needed
> > > at all.
> >
> > Good point, I agree that needs to be fixed. I think this additional
> > change should do the trick:
> >
>
> ARM_SMMU and QCOM_IOMMU are two separate implementations and both uses
> QCOM_SCM. So both of them should select QCOM_SCM.

Right, I figured that out later as well.

> "Unfortunately" the Qualcomm portion of ARM_SMMU is builtin
> unconditionally, so going with something like select QCOM_SCM if
> ARCH_QCOM would still require the stubs in qcom_scm.h.

Yes, sounds good. I also noticed that I still need one hack in there
if I do this:

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 55690af1b25d..36c304a8fc9b 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -427,6 +427,9 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct
arm_smmu_device *smmu)
 {
const struct device_node *np = smmu->dev->of_node;

+   if (!IS_ENABLED(CONFIG_QCOM_SCM))
+   return ERR_PTR(-ENXIO);
+
 #ifdef CONFIG_ACPI
if (np == NULL) {
/* Match platform for ACPI boot */


Otherwise it still breaks with ARM_SMMU=y and QCOM_SCM=m.

Splitting out the qualcomm portion of the arm_smmu driver using
a separate 'bool' symbol should also work, if  you prefer that
and can suggest a name and help text for that symbol. It would
look like

diff --git a/drivers/iommu/arm/arm-smmu/Makefile
b/drivers/iommu/arm/arm-smmu/Makefile
index e240a7bcf310..b0cc01aa20c9 100644
--- a/drivers/iommu/arm/arm-smmu/Makefile
+++ b/drivers/iommu/arm/arm-smmu/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
 obj-$(CONFIG_ARM_SMMU) += arm_smmu.o
-arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-nvidia.o arm-smmu-qcom.o
+arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-nvidia.o
+arm_smmu-$(CONFIG_ARM_SMMU_QCOM) += arm-smmu-qcom.o
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c
b/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c
index 9f465e146799..2c25cce38060 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c
@@ -215,7 +215,8 @@ struct arm_smmu_device *arm_smmu_impl_init(struct
arm_smmu_device *smmu)
of_device_is_compatible(np, "nvidia,tegra186-smmu"))
return nvidia_smmu_impl_init(smmu);

-   smmu = qcom_smmu_impl_init(smmu);
+   if (IS_ENABLED(CONFIG_ARM_SMMU_QCOM))
+   smmu = qcom_smmu_impl_init(smmu);

if (of_device_is_compatible(np, "marvell,ap806-smmu-500"))
smmu->impl = _mmu500_impl;



   Arnd
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Fenghua Yu
Hi, Tony,

On Wed, Sep 29, 2021 at 10:41:42AM -0700, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
> > On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> > > On 9/29/21 05:28, Thomas Gleixner wrote:
> > >> Looking at that patch again, none of this muck in fpu__pasid_write() is
> > >> required at all. The whole exception fixup is:
> > >> 
> > >>  if (!user_mode(regs))
> > >>   return false;
> > >> 
> > >>  if (!current->mm->pasid)
> > >>   return false;
> > >> 
> > >>  if (current->pasid_activated)
> > >>   return false;
> > >
> > > <-- preemption or BH here: kaboom.
> > 
> > Sigh, this had obviously to run in the early portion of #GP, i.e. before
> > enabling interrupts.
> 
> Like this? Obviously with some comment about why this is being done.
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a848a59291e7 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>   unsigned long gp_addr;
>   int ret;
>  

Add
+#ifdef CONFIG_IOMMU_SUPPORT
because mm->pasid and current-pasid_activated are defined only if
CONFIG_IOMMU_SUPPORT is defined.

> + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) 
> {

Maybe need to add "&& cpu_feature_enabled(X86_FEATURE_ENQCMD)" because
the IA32_PASID MSR is only used when ENQCMD is enabled?

> + current->pasid_activated = 1;
> + wrmsrl(MSR_IA32_PASID, current->mm->pasid | 
> MSR_IA32_PASID_VALID);
> + return;
> + }
> +

+endif

>   cond_local_irq_enable(regs);
>  
>   if (static_cpu_has(X86_FEATURE_UMIP)) {

Thanks.

-Fenghua
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

2021-09-29 Thread Andy Lutomirski



On Wed, Sep 29, 2021, at 10:07 AM, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 09:58:22AM -0700, Andy Lutomirski wrote:
>> On 9/28/21 16:10, Luck, Tony wrote:
>> > Moving beyond pseudo-code and into compiles-but-probably-broken-code.
>> > 
>> > 
>> > The intent of the functions below is that Fenghua should be able to
>> > do:
>> > 
>> > void fpu__pasid_write(u32 pasid)
>> > {
>> >u64 msr_val = pasid | MSR_IA32_PASID_VALID;
>> >struct ia32_pasid_state *addr;
>> > 
>> >addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
>> >addr->pasid = msr_val;
>> >finish_update_one_xsave_feature(current);
>> > }
>> > 
>> 
>> This gets gnarly because we would presumably like to optimize the case where
>> we can do the update directly in registers.  I wonder if we can do it with a
>> bit of macro magic in a somewhat generic way:
>
> Can we defere the optimizations until there is a use case that
> cares? The existing use case (fixing up the #GP fault by setting
> the PASID MSR) isn't performance critical.
>
> Let's just have something that is clear (or as clear as any xsave
> code can be) and correct.
>
> 

The goal would be to use the same code for CET and PKRU, I think.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Andy Lutomirski



On Wed, Sep 29, 2021, at 10:41 AM, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
>> On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
>> > On 9/29/21 05:28, Thomas Gleixner wrote:
>> >> Looking at that patch again, none of this muck in fpu__pasid_write() is
>> >> required at all. The whole exception fixup is:
>> >> 
>> >>  if (!user_mode(regs))
>> >>   return false;
>> >> 
>> >>  if (!current->mm->pasid)
>> >>   return false;
>> >> 
>> >>  if (current->pasid_activated)
>> >>return false;
>> >
>> > <-- preemption or BH here: kaboom.
>> 
>> Sigh, this had obviously to run in the early portion of #GP, i.e. before
>> enabling interrupts.
>
> Like this? Obviously with some comment about why this is being done.
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a848a59291e7 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>   unsigned long gp_addr;
>   int ret;
> 
> + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) 
> {
> + current->pasid_activated = 1;
> + wrmsrl(MSR_IA32_PASID, current->mm->pasid | 
> MSR_IA32_PASID_VALID);
> + return;
> + }
> +

This could do with a WARN_ON_ONCE(TIF_NEED_LOAD_FPU) imo.

Is instrumentation allowed to touch FPU state?

>   cond_local_irq_enable(regs);
> 
>   if (static_cpu_has(X86_FEATURE_UMIP)) {
>
> -Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Luck, Tony
On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
> On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> > On 9/29/21 05:28, Thomas Gleixner wrote:
> >> Looking at that patch again, none of this muck in fpu__pasid_write() is
> >> required at all. The whole exception fixup is:
> >> 
> >>  if (!user_mode(regs))
> >>   return false;
> >> 
> >>  if (!current->mm->pasid)
> >>   return false;
> >> 
> >>  if (current->pasid_activated)
> >> return false;
> >
> > <-- preemption or BH here: kaboom.
> 
> Sigh, this had obviously to run in the early portion of #GP, i.e. before
> enabling interrupts.

Like this? Obviously with some comment about why this is being done.

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..a848a59291e7 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
unsigned long gp_addr;
int ret;
 
+   if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) 
{
+   current->pasid_activated = 1;
+   wrmsrl(MSR_IA32_PASID, current->mm->pasid | 
MSR_IA32_PASID_VALID);
+   return;
+   }
+
cond_local_irq_enable(regs);
 
if (static_cpu_has(X86_FEATURE_UMIP)) {

-Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Thomas Gleixner
On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> On 9/29/21 05:28, Thomas Gleixner wrote:
>> Looking at that patch again, none of this muck in fpu__pasid_write() is
>> required at all. The whole exception fixup is:
>> 
>>  if (!user_mode(regs))
>>   return false;
>> 
>>  if (!current->mm->pasid)
>>   return false;
>> 
>>  if (current->pasid_activated)
>>   return false;
>
> <-- preemption or BH here: kaboom.

Sigh, this had obviously to run in the early portion of #GP, i.e. before
enabling interrupts.

For me that's more than obvious and I apologize that I failed to mention
it.

>> 
>>  wrmsrl(MSR_IA32_PASID, current->mm->pasid);
>
> This needs the actual sane fpstate writing helper -- see other email.

And with that there is no fpstate write helper required at all.

Stop this overengineering madness already.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Fenghua Yu
Hi, Thomas,

On Wed, Sep 29, 2021 at 09:51:15AM -0700, Luck, Tony wrote:
> > There is zero requirement to look at TIF_NEED_FPU_LOAD or
> > fpregs_state_valid() simply because the #GP comes straight from user
> > space which means the FPU registers contain the current tasks user space
> > state.
> 
> Just to double confirm ... there is no point in the #GP handler up to this 
> point
> where pre-emption can occur?

Same question here. The fixup function is called after cond_local_irq_enable().
If an interrupt comes before fixup_pasid_exception(), the interrupt may
use FPU and call kernel_fpu_begin_mask()->set(TIF_NEED_FPU_LOAD)->
__cpu_invalidate_fpregs_state(). Then writing to the IA32_PASID MSR. When
exiting to user, the FPU states will be restored to the FPU regs including
the IA32_PASID MSR. So the MSR could be different from the value written in
fixup_pasid_execption(). Is it possible?

Or should fixup_pasid_exception() be called before cond_local_irq_enable()?

Thanks.

-Fenghua
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

2021-09-29 Thread Luck, Tony
On Wed, Sep 29, 2021 at 09:58:22AM -0700, Andy Lutomirski wrote:
> On 9/28/21 16:10, Luck, Tony wrote:
> > Moving beyond pseudo-code and into compiles-but-probably-broken-code.
> > 
> > 
> > The intent of the functions below is that Fenghua should be able to
> > do:
> > 
> > void fpu__pasid_write(u32 pasid)
> > {
> > u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> > struct ia32_pasid_state *addr;
> > 
> > addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
> > addr->pasid = msr_val;
> > finish_update_one_xsave_feature(current);
> > }
> > 
> 
> This gets gnarly because we would presumably like to optimize the case where
> we can do the update directly in registers.  I wonder if we can do it with a
> bit of macro magic in a somewhat generic way:

Can we defere the optimizations until there is a use case that
cares? The existing use case (fixing up the #GP fault by setting
the PASID MSR) isn't performance critical.

Let's just have something that is clear (or as clear as any xsave
code can be) and correct.

-Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Andy Lutomirski

On 9/29/21 05:28, Thomas Gleixner wrote:

On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote:

On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:

I think the perfect and the good are a bit confused here. If we go for
"good", then we have an mm owning a PASID for its entire lifetime.  If
we want "perfect", then we should actually do it right: teach the
kernel to update an entire mm's PASID setting all at once.  This isn't
*that* hard -- it involves two things:

1. The context switch code needs to resync PASID.  Unfortunately, this
adds some overhead to every context switch, although a static_branch
could minimize it for non-PASID users.



2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
touch FPU state.  So instead the IPI should use task_work_add() to
make sure PASID gets resynced.


What do we need 1 for? Any PASID change can be achieved using 2 no?

Basically, call task_work_add() on all relevant tasks [1], then IPI
spray the current running of those and presto.

[1] it is nigh on impossible to find all tasks sharing an mm in any sane
way due to CLONE_MM && !CLONE_THREAD.


Why would we want any of that at all?

   Process starts, no PASID assigned.

   bind to device -> PASID is allocated and assigned to the mm

   some task of the process issues ENQCMD -> #GP -> write PASID MSR

   After that the PASID is saved and restored as part of the XSTATE and
   there is no extra overhead in context switch or return to user space.

   All tasks of the process which did never use ENQCMD don't care and their
   PASID xstate is in init state.

There is absolutely no point in enforcing that all tasks of the process
have the PASID activated immediately when it is assigned. If they need
it they get it via the #GP fixup and everything just works.

Looking at that patch again, none of this muck in fpu__pasid_write() is
required at all. The whole exception fixup is:

 if (!user_mode(regs))
  return false;

 if (!current->mm->pasid)
  return false;

 if (current->pasid_activated)
 return false;


<-- preemption or BH here: kaboom.



 wrmsrl(MSR_IA32_PASID, current->mm->pasid);


This needs the actual sane fpstate writing helper -- see other email.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

2021-09-29 Thread Andy Lutomirski

On 9/28/21 16:10, Luck, Tony wrote:

Moving beyond pseudo-code and into compiles-but-probably-broken-code.


The intent of the functions below is that Fenghua should be able to
do:

void fpu__pasid_write(u32 pasid)
{
u64 msr_val = pasid | MSR_IA32_PASID_VALID;
struct ia32_pasid_state *addr;

addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
addr->pasid = msr_val;
finish_update_one_xsave_feature(current);
}



This gets gnarly because we would presumably like to optimize the case 
where we can do the update directly in registers.  I wonder if we can do 
it with a bit of macro magic in a somewhat generic way:


typedef fpu__pasid_type u32;

static inline void fpu__set_pasid_in_register(const u32 *value)
{
wrmsr(...);
}

#define DEFINE_FPU_HELPER(name) \
static inline void fpu__set_##name(const fpu__##name##_type *val) \
{ \
fpregs_lock(); \
if (should write in memory) { \
->xfeatures |= XFEATURE_##name; \
ptr = get_xsave_addr(...); \
memcpy(ptr, val, sizeof(*val)); \
__fpu_invalidate_fpregs_state(...); \
} else { \
fpu__set_##name##_in_register(val); \
} \
}
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Luck, Tony
> There is zero requirement to look at TIF_NEED_FPU_LOAD or
> fpregs_state_valid() simply because the #GP comes straight from user
> space which means the FPU registers contain the current tasks user space
> state.

Just to double confirm ... there is no point in the #GP handler up to this point
where pre-emption can occur?

-Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 03/12] iommu/mediatek: Add probe_defer for smi-larb

2021-09-29 Thread Dafna Hirschfeld




On 29.09.21 03:37, Yong Wu wrote:

Prepare for adding device_link.

The iommu consumer should use device_link to connect with the
smi-larb(supplier). then the smi-larb should run before the iommu
consumer. Here we delay the iommu driver until the smi driver is ready,
then all the iommu consumers always are after the smi driver.

When there is no this patch, if some consumer drivers run before
smi-larb, the supplier link_status is DL_DEV_NO_DRIVER(0) in the
device_link_add, then device_links_driver_bound will use WARN_ON
to complain that the link_status of supplier is not right.

device_is_bound may be more elegant here. but it is not allowed to
EXPORT from https://lore.kernel.org/patchwork/patch/1334670/.

Signed-off-by: Yong Wu 
Tested-by: Frank Wunderlich  # BPI-R2/MT7623
---
  drivers/iommu/mtk_iommu.c| 2 +-
  drivers/iommu/mtk_iommu_v1.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index d837adfd1da5..d5848f78a677 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -844,7 +844,7 @@ static int mtk_iommu_probe(struct platform_device *pdev)
id = i;
  
  		plarbdev = of_find_device_by_node(larbnode);

-   if (!plarbdev) {
+   if (!plarbdev || !plarbdev->dev.driver) {
of_node_put(larbnode);
return -EPROBE_DEFER;


if plarbdev is null doesn't that mean that the device does not exist?
so we should return -ENODEV in that case?

thanks,
Dafna


}
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 1467ba1e4417..4d7809432239 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -602,7 +602,7 @@ static int mtk_iommu_probe(struct platform_device *pdev)
}
  
  		plarbdev = of_find_device_by_node(larbnode);

-   if (!plarbdev) {
+   if (!plarbdev || !plarbdev->dev.driver) {
of_node_put(larbnode);
return -EPROBE_DEFER;
}


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] [v2] qcom_scm: hide Kconfig symbol

2021-09-29 Thread Bjorn Andersson
On Tue 28 Sep 02:50 CDT 2021, Arnd Bergmann wrote:

> From: Arnd Bergmann 
> 
> Now that SCM can be a loadable module, we have to add another
> dependency to avoid link failures when ipa or adreno-gpu are
> built-in:
> 
> aarch64-linux-ld: drivers/net/ipa/ipa_main.o: in function `ipa_probe':
> ipa_main.c:(.text+0xfc4): undefined reference to `qcom_scm_is_available'
> 
> ld.lld: error: undefined symbol: qcom_scm_is_available
> >>> referenced by adreno_gpu.c
> >>>   gpu/drm/msm/adreno/adreno_gpu.o:(adreno_zap_shader_load) in 
> >>> archive drivers/built-in.a
> 
> This can happen when CONFIG_ARCH_QCOM is disabled and we don't select
> QCOM_MDT_LOADER, but some other module selects QCOM_SCM. Ideally we'd
> use a similar dependency here to what we have for QCOM_RPROC_COMMON,
> but that causes dependency loops from other things selecting QCOM_SCM.
> 
> This appears to be an endless problem, so try something different this
> time:
> 
>  - CONFIG_QCOM_SCM becomes a hidden symbol that nothing 'depends on'
>but that is simply selected by all of its users
> 
>  - All the stubs in include/linux/qcom_scm.h can go away
> 
>  - arm-smccc.h needs to provide a stub for __arm_smccc_smc() to
>allow compile-testing QCOM_SCM on all architectures.
> 
>  - To avoid a circular dependency chain involving RESET_CONTROLLER
>and PINCTRL_SUNXI, drop the 'select RESET_CONTROLLER' statement.
>According to my testing this still builds fine, and the QCOM
>platform selects this symbol already.
> 
> Acked-by: Kalle Valo 
> Signed-off-by: Arnd Bergmann 
> ---
> Changes in v2:
>   - drop the 'select RESET_CONTROLLER' line, rather than adding
> more of the same
> ---
>  drivers/firmware/Kconfig|  5 +-
>  drivers/gpu/drm/msm/Kconfig |  4 +-
>  drivers/iommu/Kconfig   |  2 +-
>  drivers/media/platform/Kconfig  |  2 +-
>  drivers/mmc/host/Kconfig|  2 +-
>  drivers/net/ipa/Kconfig |  1 +
>  drivers/net/wireless/ath/ath10k/Kconfig |  2 +-
>  drivers/pinctrl/qcom/Kconfig|  3 +-
>  include/linux/arm-smccc.h   | 10 
>  include/linux/qcom_scm.h| 71 -
>  10 files changed, 20 insertions(+), 82 deletions(-)
> 
> diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
> index 220a58cf0a44..cda7d7162cbb 100644
> --- a/drivers/firmware/Kconfig
> +++ b/drivers/firmware/Kconfig
> @@ -203,10 +203,7 @@ config INTEL_STRATIX10_RSU
> Say Y here if you want Intel RSU support.
>  
>  config QCOM_SCM
> - tristate "Qcom SCM driver"
> - depends on ARM || ARM64
> - depends on HAVE_ARM_SMCCC
> - select RESET_CONTROLLER
> + tristate
>  
>  config QCOM_SCM_DOWNLOAD_MODE_DEFAULT
>   bool "Qualcomm download mode enabled by default"
> diff --git a/drivers/gpu/drm/msm/Kconfig b/drivers/gpu/drm/msm/Kconfig
> index e9c6af78b1d7..3ddf739a6f9b 100644
> --- a/drivers/gpu/drm/msm/Kconfig
> +++ b/drivers/gpu/drm/msm/Kconfig
> @@ -17,7 +17,7 @@ config DRM_MSM
>   select DRM_SCHED
>   select SHMEM
>   select TMPFS
> - select QCOM_SCM if ARCH_QCOM
> + select QCOM_SCM
>   select WANT_DEV_COREDUMP
>   select SND_SOC_HDMI_CODEC if SND_SOC
>   select SYNC_FILE
> @@ -55,7 +55,7 @@ config DRM_MSM_GPU_SUDO
>  
>  config DRM_MSM_HDMI_HDCP
>   bool "Enable HDMI HDCP support in MSM DRM driver"
> - depends on DRM_MSM && QCOM_SCM
> + depends on DRM_MSM
>   default y
>   help
> Choose this option to enable HDCP state machine
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 124c41adeca1..989c83acbfee 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -308,7 +308,7 @@ config APPLE_DART
>  config ARM_SMMU
>   tristate "ARM Ltd. System MMU (SMMU) Support"
>   depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64)
> - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y
> + select QCOM_SCM
>   select IOMMU_API
>   select IOMMU_IO_PGTABLE_LPAE
>   select ARM_DMA_USE_IOMMU if ARM

As noted in the RFC, I think you also need to fix QCOM_IOMMU.

In particular (iirc) since all the users of the iommu might be modules,
which would prevent QCOM_IOMMU from being selected.

The rest looks good.

Regards,
Bjorn

> diff --git a/drivers/media/platform/Kconfig b/drivers/media/platform/Kconfig
> index 157c924686e4..80321e03809a 100644
> --- a/drivers/media/platform/Kconfig
> +++ b/drivers/media/platform/Kconfig
> @@ -565,7 +565,7 @@ config VIDEO_QCOM_VENUS
>   depends on VIDEO_DEV && VIDEO_V4L2 && QCOM_SMEM
>   depends on (ARCH_QCOM && IOMMU_DMA) || COMPILE_TEST
>   select QCOM_MDT_LOADER if ARCH_QCOM
> - select QCOM_SCM if ARCH_QCOM
> + select QCOM_SCM
>   select VIDEOBUF2_DMA_CONTIG
>   select V4L2_MEM2MEM_DEV
>   help
> diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
> index 

Re: [PATCH] [RFC] qcom_scm: hide Kconfig symbol

2021-09-29 Thread Bjorn Andersson
On Wed 29 Sep 05:04 CDT 2021, Arnd Bergmann wrote:

> On Wed, Sep 29, 2021 at 11:51 AM Will Deacon  wrote:
> > On Mon, Sep 27, 2021 at 05:22:13PM +0200, Arnd Bergmann wrote:
> > >
> > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > > index 124c41adeca1..989c83acbfee 100644
> > > --- a/drivers/iommu/Kconfig
> > > +++ b/drivers/iommu/Kconfig
> > > @@ -308,7 +308,7 @@ config APPLE_DART
> > >  config ARM_SMMU
> > >   tristate "ARM Ltd. System MMU (SMMU) Support"
> > >   depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64)
> > > - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y
> > > + select QCOM_SCM
> > >   select IOMMU_API
> > >   select IOMMU_IO_PGTABLE_LPAE
> > >   select ARM_DMA_USE_IOMMU if ARM
> >
> > I don't want to get in the way of this patch because I'm also tired of the
> > randconfig failures caused by QCOM_SCM. However, ARM_SMMU is applicable to
> > a wide variety of (non-qcom) SoCs and so it seems a shame to require the
> > QCOM_SCM code to be included for all of those when it's not strictly needed
> > at all.
> 
> Good point, I agree that needs to be fixed. I think this additional
> change should do the trick:
> 

ARM_SMMU and QCOM_IOMMU are two separate implementations and both uses
QCOM_SCM. So both of them should select QCOM_SCM.

"Unfortunately" the Qualcomm portion of ARM_SMMU is builtin
unconditionally, so going with something like select QCOM_SCM if
ARCH_QCOM would still require the stubs in qcom_scm.h.

Regards,
Bjorn

> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -308,7 +308,6 @@ config APPLE_DART
>  config ARM_SMMU
> tristate "ARM Ltd. System MMU (SMMU) Support"
> depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64)
> -   select QCOM_SCM
> select IOMMU_API
> select IOMMU_IO_PGTABLE_LPAE
> select ARM_DMA_USE_IOMMU if ARM
> @@ -438,7 +437,7 @@ config QCOM_IOMMU
> # Note: iommu drivers cannot (yet?) be built as modules
> bool "Qualcomm IOMMU Support"
> depends on ARCH_QCOM || (COMPILE_TEST && !GENERIC_ATOMIC64)
> -   depends on QCOM_SCM=y
> +   select QCOM_SCM
> select IOMMU_API
> select IOMMU_IO_PGTABLE_LPAE
> select ARM_DMA_USE_IOMMU
> 
> I'll see if that causes any problems for the randconfig builds.
> 
>Arnd
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 12:38:35AM +, Tian, Kevin wrote:

> /* If set the driver must call iommu_XX as the first action in probe() or
>   * before it attempts to do DMA
>   */
>  bool suppress_dma_owner:1;

It is not "attempts to do DMA" but more "operates the physical device
in any away"

Not having ownership means another entity could be using user space
DMA to manipulate the device state and attack the integrity of the
kernel's programming of the device.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 04:35:19PM +1000, David Gibson wrote:

> Yes, exactly.  And with a group interface it's obvious it has to
> understand it.  With the non-group interface, you can get to this
> stage in ignorance of groups.  It will even work as long as you are
> lucky enough only to try with singleton-group devices.  Then you try
> it with two devices in the one group and doing (3) on device A will
> implicitly change the DMA environment of device B.

The security model here says this is fine.

This idea to put the iommu code in charge of security is quite clean,
as I said in the other mail drivers attached to 'struct devices *'
tell the iommu layer what they are are doing:

   iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
   iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
   iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)

And it decides if it is allowed.

If device A is allowed to go to userspace then security wise it is
deemed fine that B is impacted. That is what we have defined already
today.

This proposal does not free userpace from having to understand this!
The iommu_group sysfs is still there and still must be understood.

The *admin* the one responsible to understand the groups, not the
applications. The admin has no idea what a group FD is - they should
be looking at the sysfs and seeing the iommu_group directories.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 08:48:28AM +, Tian, Kevin wrote:

> ARM:
> - set to snoop format if IOMMU_CACHE
> - set to nonsnoop format if !IOMMU_CACHE
> (in both cases TLP snoop bit is ignored?)

Where do you see this? I couldn't even find this functionality in the
ARM HW manual??
 
What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
the cache coherence to be disabled, which is not ignoring no snoop.

> I didn't identify the exact commit for above meaning change.
> 
> Robin, could you help share some thoughts here?

It is this:

static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
 unsigned long attrs)
{
int prot = coherent ? IOMMU_CACHE : 0;

Which sets IOMMU_CACHE based on:

static void *iommu_dma_alloc(struct device *dev, size_t size,
dma_addr_t *handle, gfp_t gfp, unsigned long attrs)
{
bool coherent = dev_is_dma_coherent(dev);
int ioprot = dma_info_to_prot(DMA_BIDIRECTIONAL, coherent, attrs); 

Driving IOMMU_CACHE from dev_is_dma_coherent() has *NOTHING* to do
with no-snoop TLPs and everything to do with the arch cache
maintenance API

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 01:07:56PM +0100, Jean-Philippe Brucker wrote:

> Yes, so the guest knows the size of GPA it can write into the page table.
> For Arm SMMU the GPA size is determined by both the SMMU implementation
> and the host kernel configuration. But maybe that could also be
> vendor-specific, if other architectures don't need to communicate it. 

I think there should be a dedicated query to return HW specific
parmaters for a user page table format. Somehow I think there will be
a lot of these.

So 'user page table format arm smmu v1' can be queried to return its
own unique struct that has everything needed to operate that format of
page table.

Userspace already needs to know how to form that specific HW PTEs,
so processing a HW specific query is not a problem.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Thomas Gleixner
On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote:
> On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
>> I think the perfect and the good are a bit confused here. If we go for
>> "good", then we have an mm owning a PASID for its entire lifetime.  If
>> we want "perfect", then we should actually do it right: teach the
>> kernel to update an entire mm's PASID setting all at once.  This isn't
>> *that* hard -- it involves two things:
>> 
>> 1. The context switch code needs to resync PASID.  Unfortunately, this
>> adds some overhead to every context switch, although a static_branch
>> could minimize it for non-PASID users.
>
>> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
>> touch FPU state.  So instead the IPI should use task_work_add() to
>> make sure PASID gets resynced.
>
> What do we need 1 for? Any PASID change can be achieved using 2 no?
>
> Basically, call task_work_add() on all relevant tasks [1], then IPI
> spray the current running of those and presto.
>
> [1] it is nigh on impossible to find all tasks sharing an mm in any sane
> way due to CLONE_MM && !CLONE_THREAD.

Why would we want any of that at all?

  Process starts, no PASID assigned.

  bind to device -> PASID is allocated and assigned to the mm

  some task of the process issues ENQCMD -> #GP -> write PASID MSR

  After that the PASID is saved and restored as part of the XSTATE and
  there is no extra overhead in context switch or return to user space.

  All tasks of the process which did never use ENQCMD don't care and their
  PASID xstate is in init state.

There is absolutely no point in enforcing that all tasks of the process
have the PASID activated immediately when it is assigned. If they need
it they get it via the #GP fixup and everything just works.

Looking at that patch again, none of this muck in fpu__pasid_write() is
required at all. The whole exception fixup is:

if (!user_mode(regs))
 return false;

if (!current->mm->pasid)
 return false;

if (current->pasid_activated)
 return false;

wrmsrl(MSR_IA32_PASID, current->mm->pasid);
current->pasid_activated = true;
return true;

There is zero requirement to look at TIF_NEED_FPU_LOAD or
fpregs_state_valid() simply because the #GP comes straight from user
space which means the FPU registers contain the current tasks user space
state.

If TIF_NEED_FPU_LOAD would be set or fpregs_state_valid() would be false
after the user_mode() check then this would simply be a bug somewhere
else and has nothing to do with this PASID fixup.

So no need for magic update_one_xstate_feature() wrappers, no
concurrency concerns, nothing.

It's that simple, really. Anything more complex is just a purely
academic exercise which creates more problems than it solves.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 06:41:00AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Wednesday, September 29, 2021 2:01 PM
> > 
> > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the
> > vfio
> > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is
> > provided
> > > because it's implicitly done when the device fd is closed.
> > >
> > > In concept a vfio device can be bound to multiple iommufds, each hosting
> > > a subset of I/O address spaces attached by this device.
> > 
> > I really feel like this many<->many mapping between devices is going
> > to be super-confusing, and therefore make it really hard to be
> > confident we have all the rules right for proper isolation.
> 
> Based on new discussion on group ownership part (patch06), I feel this
> many<->many relationship will disappear. The context fd (either container
> or iommufd) will uniquely mark the ownership on a physical device and
> its group. With this design it's impractical to have one device bound
> to multiple iommufds. 

That should be a requirement! We have no way to prove that two
iommufds are the same security domain, so devices/groups cannot be
shared.

That is why the API I suggested takes in a struct file to ID the user
security context. A group is accessible only from that single struct
file and no more.

If the first series goes the way I outlined then I think David's
concern about security is strongly solved as the IOMMU layer is
directly managing it with a very clear responsiblity and semantic.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:

> > +struct iommufd_device {
> > +   unsigned int id;
> > +   struct iommufd_ctx *ictx;
> > +   struct device *dev; /* always be the physical device */
> > +   u64 dev_cookie;
> 
> Why do you need both an 'id' and a 'dev_cookie'?  Since they're both
> unique, couldn't you just use the cookie directly as the index into
> the xarray?

ID is the kernel value in the xarray - xarray is much more efficient &
safe with small kernel controlled values.

dev_cookie is a user assigned value that may not be unique. It's
purpose is to allow userspace to receive and event and go back to its
structure. Most likely userspace will store a pointer here, but it is
also possible userspace could not use it.

It is a pretty normal pattern

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 03/20] vfio: Add vfio_[un]register_device()

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 12:46:14PM +1000, da...@gibson.dropbear.id.au wrote:
> On Tue, Sep 21, 2021 at 10:00:14PM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 22, 2021 at 12:54:02AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, September 22, 2021 12:01 AM
> > > > 
> > > > >  One open about how to organize the device nodes under
> > > > /dev/vfio/devices/.
> > > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > > devname
> > > > > from all kinds of devices. The prerequisite of this model is that 
> > > > > devnames
> > > > > from different bus types are unique formats:
> > > > 
> > > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > > > 
> > > > The userspace can learn the correct major/minor by inspecting the
> > > > sysfs.
> > > > 
> > > > This whole concept should disappear into the prior patch that adds the
> > > > struct device in the first place, and I think most of the code here
> > > > can be deleted once the struct device is used properly.
> > > > 
> > > 
> > > Can you help elaborate above flow? This is one area where we need
> > > more guidance.
> > > 
> > > When Qemu accepts an option "-device vfio-pci,host=:BB:DD.F",
> > > how does Qemu identify which vifo0/1/... is associated with the specified 
> > > :BB:DD.F? 
> > 
> > When done properly in the kernel the file:
> > 
> > /sys/bus/pci/devices/:BB:DD.F/vfio/vfioX/dev
> > 
> > Will contain the major:minor of the VFIO device.
> > 
> > Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> > that the major:minor matches.
> > 
> > in the above pattern "pci" and ":BB:DD.FF" are the arguments passed
> > to qemu.
> 
> I thought part of the appeal of the device centric model was less
> grovelling around in sysfs for information.  Using type/address
> directly in /dev seems simpler than having to dig around matching
> things here.

I would say more regular grovelling. Starting from a sysfs device
directory and querying the VFIO cdev associated with it is much more
normal than what happens today, which also includes passing sysfs
information into an ioctl :\

> Note that this doesn't have to be done in kernel: you could have the
> kernel just call them /dev/vfio/devices/vfio0, ... but add udev rules
> that create symlinks from say /dev/vfio/pci/:BB:SS.F - >
> ../devices/vfioXX based on the sysfs information.

This is the right approach if people want to do this, but I'm not sure
it is worth it given backwards compat requires the sysfs path as
input. We may as well stick with sysfs as the command line interface
for userspace tools.

And I certainly don't want to see userspace tools trying to reverse a
sysfs path into a /dev/ symlink name when they can directly and
reliably learn the correct cdev from the sysfspath.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 03/20] vfio: Add vfio_[un]register_device()

2021-09-29 Thread Jason Gunthorpe via iommu
On Wed, Sep 29, 2021 at 09:08:25AM +0200, Cornelia Huck wrote:
> On Wed, Sep 29 2021, "Tian, Kevin"  wrote:
> 
> >> From: David Gibson 
> >> Sent: Wednesday, September 29, 2021 10:44 AM
> >> 
> >> > One alternative option is to arrange device nodes in sub-directories 
> >> > based
> >> > on the device type. But doing so also adds one trouble to userspace. The
> >> > current vfio uAPI is designed to have the user query device type via
> >> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user
> >> > instead needs to figure out the device type before opening the device, to
> >> > identify the sub-directory.
> >> 
> >> Wouldn't this be up to the operator / configuration, rather than the
> >> actual software though?  I would assume that typically the VFIO
> >> program would be pointed at a specific vfio device node file to use,
> >> e.g.
> >>my-vfio-prog -d /dev/vfio/pci/:0a:03.1
> >> 
> >> Or more generally, if you're expecting userspace to know a name in a
> >> uniqu pattern, they can equally well know a "type/name" pair.
> >> 
> >
> > You are correct. Currently:
> >
> > -device, vfio-pci,host=:BB:DD.F
> > -device, vfio-pci,sysfdev=/sys/bus/pci/devices/ :BB:DD.F
> > -device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00
> >
> > above is definitely type/name information to find the related node. 
> >
> > Actually even for Jason's proposal we still need such information to
> > identify the sysfs path.
> >
> > Then I feel type-based sub-directory does work. Adding another link
> > to sysfs sounds unnecessary now. But I'm not sure whether we still
> > want to create /dev/vfio/devices/vfio0 thing and related udev rule
> > thing that you pointed out in another mail.
> 
> Still reading through this whole thread, but type-based subdirectories
> also make the most sense to me. I don't really see userspace wanting to
> grab just any device and then figure out whether it is the device it was
> looking for, but rather immediately go to a specific device or at least
> a device of a specific type.

Even so the kernel should not be creating this, that is a job for
udev and some symlinks

> Sequentially-numbered devices tend to become really unwieldy in my
> experience if you are working on a system with loads of devices.

If the user experiance is always to refer to the sysfs node as Kevin
shows above then the user never sees the integer.

It is very much like how the group number works already, programs
always start at the sysfs, do the readlink thing on iommu_group and
then get the group number to go to /dev/vfio/X

So it is already the case that every piece of software can construct a
sysfs path to the device, we are just changing from
readlink(iommu_group) to readdir(vfio/vfio_device_XX)

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 09/12] media: mtk-vcodec: Get rid of mtk_smi_larb_get/put

2021-09-29 Thread Dafna Hirschfeld




On 29.09.21 03:37, Yong Wu wrote:

MediaTek IOMMU has already added the device_link between the consumer
and smi-larb device. If the vcodec device call the pm_runtime_get_sync,
the smi-larb's pm_runtime_get_sync also be called automatically.

CC: Tiffany Lin 
CC: Irui Wang 
Signed-off-by: Yong Wu 
Reviewed-by: Evan Green 
Acked-by: Tiffany Lin 
Reviewed-by: Dafna Hirschfeld 
---
  .../platform/mtk-vcodec/mtk_vcodec_dec_pm.c   | 37 +++-
  .../platform/mtk-vcodec/mtk_vcodec_drv.h  |  3 --
  .../platform/mtk-vcodec/mtk_vcodec_enc.c  |  1 -
  .../platform/mtk-vcodec/mtk_vcodec_enc_pm.c   | 44 +++
  4 files changed, 10 insertions(+), 75 deletions(-)

diff --git a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c 
b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
index 6038db96f71c..d0bf9aa3b29d 100644
--- a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
+++ b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c
@@ -8,14 +8,12 @@
  #include 
  #include 
  #include 
-#include 
  
  #include "mtk_vcodec_dec_pm.h"

  #include "mtk_vcodec_util.h"
  
  int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev)

  {
-   struct device_node *node;
struct platform_device *pdev;
struct mtk_vcodec_pm *pm;
struct mtk_vcodec_clk *dec_clk;
@@ -26,18 +24,7 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev)
pm = >pm;
pm->mtkdev = mtkdev;
dec_clk = >vdec_clk;
-   node = of_parse_phandle(pdev->dev.of_node, "mediatek,larb", 0);
-   if (!node) {
-   mtk_v4l2_err("of_parse_phandle mediatek,larb fail!");
-   return -1;
-   }
  
-	pdev = of_find_device_by_node(node);

-   of_node_put(node);
-   if (WARN_ON(!pdev)) {
-   return -1;
-   }
-   pm->larbvdec = >dev;
pdev = mtkdev->plat_dev;
pm->dev = >dev;
  
@@ -47,14 +34,11 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev)

dec_clk->clk_info = devm_kcalloc(>dev,
dec_clk->clk_num, sizeof(*clk_info),
GFP_KERNEL);
-   if (!dec_clk->clk_info) {
-   ret = -ENOMEM;
-   goto put_device;
-   }
+   if (!dec_clk->clk_info)
+   return -ENOMEM;
} else {
mtk_v4l2_err("Failed to get vdec clock count");
-   ret = -EINVAL;
-   goto put_device;
+   return -EINVAL;
}
  
  	for (i = 0; i < dec_clk->clk_num; i++) {

@@ -63,29 +47,24 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev)
"clock-names", i, _info->clk_name);
if (ret) {
mtk_v4l2_err("Failed to get clock name id = %d", i);
-   goto put_device;
+   return ret;
}
clk_info->vcodec_clk = devm_clk_get(>dev,
clk_info->clk_name);
if (IS_ERR(clk_info->vcodec_clk)) {
mtk_v4l2_err("devm_clk_get (%d)%s fail", i,
clk_info->clk_name);
-   ret = PTR_ERR(clk_info->vcodec_clk);
-   goto put_device;
+   return PTR_ERR(clk_info->vcodec_clk);
}
}
  
  	pm_runtime_enable(>dev);

return 0;
-put_device:
-   put_device(pm->larbvdec);
-   return ret;
  }
  
  void mtk_vcodec_release_dec_pm(struct mtk_vcodec_dev *dev)

  {
pm_runtime_disable(dev->pm.dev);
-   put_device(dev->pm.larbvdec);
  }


Now that functions only do  'pm_runtime_disable(dev->pm.dev);' so it will be 
more
readable to remove the function mtk_vcodec_release_dec_pm
and replace with pm_runtime_disable(dev->pm.dev);
Same for the 'enc' equivalent.

Thanks,
Dafna

  
  int mtk_vcodec_dec_pw_on(struct mtk_vcodec_pm *pm)

@@ -122,11 +101,6 @@ void mtk_vcodec_dec_clock_on(struct mtk_vcodec_pm *pm)
}
}
  
-	ret = mtk_smi_larb_get(pm->larbvdec);

-   if (ret) {
-   mtk_v4l2_err("mtk_smi_larb_get larbvdec fail %d", ret);
-   goto error;
-   }>   return;
  
  error:

@@ -139,7 +113,6 @@ void mtk_vcodec_dec_clock_off(struct mtk_vcodec_pm *pm)
struct mtk_vcodec_clk *dec_clk = >vdec_clk;
int i = 0;
  
-	mtk_smi_larb_put(pm->larbvdec);

for (i = dec_clk->clk_num - 1; i >= 0; i--)
clk_disable_unprepare(dec_clk->clk_info[i].vcodec_clk);
  }
diff --git a/drivers/media/platform/mtk-vcodec/mtk_vcodec_drv.h 
b/drivers/media/platform/mtk-vcodec/mtk_vcodec_drv.h
index c6c7672fecfb..64b73dd880ce 100644
--- a/drivers/media/platform/mtk-vcodec/mtk_vcodec_drv.h
+++ b/drivers/media/platform/mtk-vcodec/mtk_vcodec_drv.h
@@ -189,10 +189,7 @@ struct mtk_vcodec_clk {
   */
  struct mtk_vcodec_pm {
struct mtk_vcodec_clk   vdec_clk;
-   struct 

Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace

2021-09-29 Thread Jean-Philippe Brucker
On Wed, Sep 29, 2021 at 10:44:01AM +, Liu, Yi L wrote:
> > From: Jean-Philippe Brucker 
> > Sent: Wednesday, September 22, 2021 10:49 PM
> > 
> > On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote:
> > > [HACK. will fix in v2]
> > >
> > > IOVA range is critical info for userspace to manage DMA for an I/O address
> > > space. This patch reports the valid iova range info of a given device.
> > >
> > > Due to aforementioned hack, this info comes from the hacked vfio type1
> > > driver. To follow the same format in vfio, we also introduce a cap chain
> > > format in IOMMU_DEVICE_GET_INFO to carry the iova range info.
> > [...]
> > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > > index 49731be71213..f408ad3c8ade 100644
> > > --- a/include/uapi/linux/iommu.h
> > > +++ b/include/uapi/linux/iommu.h
> > > @@ -68,6 +68,7 @@
> > >   *  +---++
> > >   *  ...
> > >   * @addr_width:the address width of supported I/O address spaces.
> > > + * @cap_offset: Offset within info struct of first cap
> > >   *
> > >   * Availability: after device is bound to iommufd
> > >   */
> > > @@ -77,9 +78,11 @@ struct iommu_device_info {
> > >  #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP  (1 << 0) /* IOMMU
> > enforced snoop */
> > >  #define IOMMU_DEVICE_INFO_PGSIZES(1 << 1) /* supported page
> > sizes */
> > >  #define IOMMU_DEVICE_INFO_ADDR_WIDTH (1 << 2) /*
> > addr_wdith field valid */
> > > +#define IOMMU_DEVICE_INFO_CAPS   (1 << 3) /* info
> > supports cap chain */
> > >   __u64   dev_cookie;
> > >   __u64   pgsize_bitmap;
> > >   __u32   addr_width;
> > > + __u32   cap_offset;
> > 
> > We can also add vendor-specific page table and PASID table properties as
> > capabilities, otherwise we'll need giant unions in the iommu_device_info
> > struct. That made me wonder whether pgsize and addr_width should also
> > be
> > separate capabilities for consistency, but this way might be good enough.
> > There won't be many more generic capabilities. I have "output address
> > width"
> 
> what do you mean by "output address width"? Is it the output address
> of stage-1 translation?

Yes, so the guest knows the size of GPA it can write into the page table.
For Arm SMMU the GPA size is determined by both the SMMU implementation
and the host kernel configuration. But maybe that could also be
vendor-specific, if other architectures don't need to communicate it. 

> >
> and "PASID width", the rest is specific to Arm and SMMU table
> > formats.
> 
> When coming to nested translation support, the stage-1 related info are
> likely to be vendor-specific, and will be reported in cap chain.

Agreed

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-09-29 Thread Liu, Yi L
> From: Jean-Philippe Brucker 
> Sent: Wednesday, September 22, 2021 9:45 PM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> 
> Is this the reason for passing addr_width to IOASID_ALLOC?  I didn't get
> what it's used for or why it's mandatory. But for PPC it sounds like it
> should be an address range instead of an upper limit?

yes, as this open described, it may need to be a range. But not sure
if PPC requires multiple ranges or just one range. Perhaps, David may
guide there.

Regards,
Yi Liu
 
> Thanks,
> Jean
> 
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> >
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> >   ID). We need find a way to resolve the naming conflict between the
> hardware
> >   ID and software handle. One option is to rename the existing ioasid to be
> >   pasid or ssid, given their full names still sound generic. Appreciate more
> >   thoughts on this open!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 17/20] iommu/iommufd: Report iova range to userspace

2021-09-29 Thread Liu, Yi L
> From: Jean-Philippe Brucker 
> Sent: Wednesday, September 22, 2021 10:49 PM
> 
> On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote:
> > [HACK. will fix in v2]
> >
> > IOVA range is critical info for userspace to manage DMA for an I/O address
> > space. This patch reports the valid iova range info of a given device.
> >
> > Due to aforementioned hack, this info comes from the hacked vfio type1
> > driver. To follow the same format in vfio, we also introduce a cap chain
> > format in IOMMU_DEVICE_GET_INFO to carry the iova range info.
> [...]
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 49731be71213..f408ad3c8ade 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -68,6 +68,7 @@
> >   *+---++
> >   *...
> >   * @addr_width:the address width of supported I/O address spaces.
> > + * @cap_offset:   Offset within info struct of first cap
> >   *
> >   * Availability: after device is bound to iommufd
> >   */
> > @@ -77,9 +78,11 @@ struct iommu_device_info {
> >  #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP(1 << 0) /* IOMMU
> enforced snoop */
> >  #define IOMMU_DEVICE_INFO_PGSIZES  (1 << 1) /* supported page
> sizes */
> >  #define IOMMU_DEVICE_INFO_ADDR_WIDTH   (1 << 2) /*
> addr_wdith field valid */
> > +#define IOMMU_DEVICE_INFO_CAPS (1 << 3) /* info
> supports cap chain */
> > __u64   dev_cookie;
> > __u64   pgsize_bitmap;
> > __u32   addr_width;
> > +   __u32   cap_offset;
> 
> We can also add vendor-specific page table and PASID table properties as
> capabilities, otherwise we'll need giant unions in the iommu_device_info
> struct. That made me wonder whether pgsize and addr_width should also
> be
> separate capabilities for consistency, but this way might be good enough.
> There won't be many more generic capabilities. I have "output address
> width"

what do you mean by "output address width"? Is it the output address
of stage-1 translation?

>
and "PASID width", the rest is specific to Arm and SMMU table
> formats.

When coming to nested translation support, the stage-1 related info are
likely to be vendor-specific, and will be reported in cap chain.

Regards,
Yi Liu

> Thanks,
> Jean
> 
> >  };
> >
> >  #define IOMMU_DEVICE_GET_INFO  _IO(IOMMU_TYPE, IOMMU_BASE +
> 1)
> > --
> > 2.25.1
> >
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] [RFC] qcom_scm: hide Kconfig symbol

2021-09-29 Thread Arnd Bergmann
On Wed, Sep 29, 2021 at 11:51 AM Will Deacon  wrote:
> On Mon, Sep 27, 2021 at 05:22:13PM +0200, Arnd Bergmann wrote:
> >
> > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > index 124c41adeca1..989c83acbfee 100644
> > --- a/drivers/iommu/Kconfig
> > +++ b/drivers/iommu/Kconfig
> > @@ -308,7 +308,7 @@ config APPLE_DART
> >  config ARM_SMMU
> >   tristate "ARM Ltd. System MMU (SMMU) Support"
> >   depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64)
> > - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y
> > + select QCOM_SCM
> >   select IOMMU_API
> >   select IOMMU_IO_PGTABLE_LPAE
> >   select ARM_DMA_USE_IOMMU if ARM
>
> I don't want to get in the way of this patch because I'm also tired of the
> randconfig failures caused by QCOM_SCM. However, ARM_SMMU is applicable to
> a wide variety of (non-qcom) SoCs and so it seems a shame to require the
> QCOM_SCM code to be included for all of those when it's not strictly needed
> at all.

Good point, I agree that needs to be fixed. I think this additional
change should do the trick:

--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -308,7 +308,6 @@ config APPLE_DART
 config ARM_SMMU
tristate "ARM Ltd. System MMU (SMMU) Support"
depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64)
-   select QCOM_SCM
select IOMMU_API
select IOMMU_IO_PGTABLE_LPAE
select ARM_DMA_USE_IOMMU if ARM
@@ -438,7 +437,7 @@ config QCOM_IOMMU
# Note: iommu drivers cannot (yet?) be built as modules
bool "Qualcomm IOMMU Support"
depends on ARCH_QCOM || (COMPILE_TEST && !GENERIC_ATOMIC64)
-   depends on QCOM_SCM=y
+   select QCOM_SCM
select IOMMU_API
select IOMMU_IO_PGTABLE_LPAE
select ARM_DMA_USE_IOMMU

I'll see if that causes any problems for the randconfig builds.

   Arnd
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Peter Zijlstra
On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
> I think the perfect and the good are a bit confused here. If we go for
> "good", then we have an mm owning a PASID for its entire lifetime.  If
> we want "perfect", then we should actually do it right: teach the
> kernel to update an entire mm's PASID setting all at once.  This isn't
> *that* hard -- it involves two things:
> 
> 1. The context switch code needs to resync PASID.  Unfortunately, this
> adds some overhead to every context switch, although a static_branch
> could minimize it for non-PASID users.

> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
> touch FPU state.  So instead the IPI should use task_work_add() to
> make sure PASID gets resynced.

What do we need 1 for? Any PASID change can be achieved using 2 no?

Basically, call task_work_add() on all relevant tasks [1], then IPI
spray the current running of those and presto.

[1] it is nigh on impossible to find all tasks sharing an mm in any sane
way due to CLONE_MM && !CLONE_THREAD.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] [RFC] qcom_scm: hide Kconfig symbol

2021-09-29 Thread Will Deacon
On Mon, Sep 27, 2021 at 05:22:13PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> Now that SCM can be a loadable module, we have to add another
> dependency to avoid link failures when ipa or adreno-gpu are
> built-in:
> 
> aarch64-linux-ld: drivers/net/ipa/ipa_main.o: in function `ipa_probe':
> ipa_main.c:(.text+0xfc4): undefined reference to `qcom_scm_is_available'
> 
> ld.lld: error: undefined symbol: qcom_scm_is_available
> >>> referenced by adreno_gpu.c
> >>>   gpu/drm/msm/adreno/adreno_gpu.o:(adreno_zap_shader_load) in 
> >>> archive drivers/built-in.a
> 
> This can happen when CONFIG_ARCH_QCOM is disabled and we don't select
> QCOM_MDT_LOADER, but some other module selects QCOM_SCM. Ideally we'd
> use a similar dependency here to what we have for QCOM_RPROC_COMMON,
> but that causes dependency loops from other things selecting QCOM_SCM.
> 
> This appears to be an endless problem, so try something different this
> time:
> 
>  - CONFIG_QCOM_SCM becomes a hidden symbol that nothing 'depends on'
>but that is simply selected by all of its users
> 
>  - All the stubs in include/linux/qcom_scm.h can go away
> 
>  - arm-smccc.h needs to provide a stub for __arm_smccc_smc() to
>allow compile-testing QCOM_SCM on all architectures.
> 
>  - To avoid a circular dependency chain involving RESET_CONTROLLER
>and PINCTRL_SUNXI, change the 'depends on RESET_CONTROLLER' in
>the latter one to 'select'.
> 
> The last bit is rather annoying, as drivers should generally never
> 'select' another subsystem, and about half the users of the reset
> controller interface do this anyway.
> 
> Nevertheless, this version seems to pass all my randconfig tests
> and is more robust than any of the prior versions.
> 
> Comments?
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/firmware/Kconfig|  4 +-
>  drivers/gpu/drm/msm/Kconfig |  4 +-
>  drivers/iommu/Kconfig   |  2 +-
>  drivers/media/platform/Kconfig  |  2 +-
>  drivers/mmc/host/Kconfig|  2 +-
>  drivers/net/ipa/Kconfig |  1 +
>  drivers/net/wireless/ath/ath10k/Kconfig |  2 +-
>  drivers/pinctrl/qcom/Kconfig|  3 +-
>  drivers/pinctrl/sunxi/Kconfig   |  6 +--
>  include/linux/arm-smccc.h   | 10 
>  include/linux/qcom_scm.h| 71 -
>  11 files changed, 23 insertions(+), 84 deletions(-)
> 
> diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
> index 220a58cf0a44..f7dd82ef0b9c 100644
> --- a/drivers/firmware/Kconfig
> +++ b/drivers/firmware/Kconfig
> @@ -203,9 +203,7 @@ config INTEL_STRATIX10_RSU
> Say Y here if you want Intel RSU support.
>  
>  config QCOM_SCM
> - tristate "Qcom SCM driver"
> - depends on ARM || ARM64
> - depends on HAVE_ARM_SMCCC
> + tristate
>   select RESET_CONTROLLER
>  
>  config QCOM_SCM_DOWNLOAD_MODE_DEFAULT
> diff --git a/drivers/gpu/drm/msm/Kconfig b/drivers/gpu/drm/msm/Kconfig
> index e9c6af78b1d7..3ddf739a6f9b 100644
> --- a/drivers/gpu/drm/msm/Kconfig
> +++ b/drivers/gpu/drm/msm/Kconfig
> @@ -17,7 +17,7 @@ config DRM_MSM
>   select DRM_SCHED
>   select SHMEM
>   select TMPFS
> - select QCOM_SCM if ARCH_QCOM
> + select QCOM_SCM
>   select WANT_DEV_COREDUMP
>   select SND_SOC_HDMI_CODEC if SND_SOC
>   select SYNC_FILE
> @@ -55,7 +55,7 @@ config DRM_MSM_GPU_SUDO
>  
>  config DRM_MSM_HDMI_HDCP
>   bool "Enable HDMI HDCP support in MSM DRM driver"
> - depends on DRM_MSM && QCOM_SCM
> + depends on DRM_MSM
>   default y
>   help
> Choose this option to enable HDCP state machine
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 124c41adeca1..989c83acbfee 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -308,7 +308,7 @@ config APPLE_DART
>  config ARM_SMMU
>   tristate "ARM Ltd. System MMU (SMMU) Support"
>   depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64)
> - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y
> + select QCOM_SCM
>   select IOMMU_API
>   select IOMMU_IO_PGTABLE_LPAE
>   select ARM_DMA_USE_IOMMU if ARM

I don't want to get in the way of this patch because I'm also tired of the
randconfig failures caused by QCOM_SCM. However, ARM_SMMU is applicable to
a wide variety of (non-qcom) SoCs and so it seems a shame to require the
QCOM_SCM code to be included for all of those when it's not strictly needed
at all.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 04/20] iommu: Add iommu_device_get_info interface

2021-09-29 Thread Lu Baolu

On 2021/9/29 17:25, Lu Baolu wrote:

Hi David,

On 2021/9/29 10:52, David Gibson wrote:

On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:

From: Lu Baolu

This provides an interface for upper layers to get the per-device iommu
attributes.

 int iommu_device_get_info(struct device *dev,
   enum iommu_devattr attr, void *data);

That fact that this interface doesn't let you know how to size the
data buffer, other than by just knowing the right size for each attr
concerns me.



We plan to address this by following the comments here.

https://lore.kernel.org/linux-iommu/20210921161930.gp327...@nvidia.com/


And Christoph gave another option as well.

https://lore.kernel.org/linux-iommu/20210922050746.ga12...@lst.de/

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 04/20] iommu: Add iommu_device_get_info interface

2021-09-29 Thread Lu Baolu

Hi David,

On 2021/9/29 10:52, David Gibson wrote:

On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:

From: Lu Baolu

This provides an interface for upper layers to get the per-device iommu
attributes.

 int iommu_device_get_info(struct device *dev,
   enum iommu_devattr attr, void *data);

That fact that this interface doesn't let you know how to size the
data buffer, other than by just knowing the right size for each attr
concerns me.



We plan to address this by following the comments here.

https://lore.kernel.org/linux-iommu/20210921161930.gp327...@nvidia.com/

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

2021-09-29 Thread Peter Zijlstra
On Fri, Sep 24, 2021 at 08:39:24AM -0700, Luck, Tony wrote:

> If you have ctags installed then a ctrl-] on that
> __fixup_pasid_exception() will take you to the function with
> the comments. No electron microscope needed.

I to use ctags, but when reading the #GP handler, this is a whole
different file. Also, I don't find any of those comments explaining the
not-our-#GP-but-harmless-cycle issue.

The current->has_valid_pasid one comes close, but just misses it. But
really the place to put this is in the #GP handler itself so we don't
have to dig through every call there to figure out how it's supposed to
work.

> +
> +/*
> + * Try to figure out if there is a PASID MSR value to propagate to the
> + * thread taking the #GP.
> + */
> +bool __fixup_pasid_exception(void)
> +{
> +   u32 pasid;
> +
> +   /*
> +* This function is called only when this #GP was triggered from user
> +* space. So the mm cannot be NULL.
> +*/
> +   pasid = current->mm->pasid;
> +
> +   /* If no PASID is allocated, there is nothing to propagate. */
> +   if (pasid == PASID_DISABLED)
> +   return false;
> +
> +   /*
> +* If the current task already has a valid PASID MSR, then the #GP
> +* fault must be for some non-ENQCMD related reason.
> +*/
> +   if (current->has_valid_pasid)
> +   return false;
> +
> +   /* Fix up the MSR by the PASID in the mm. */
> +   fpu__pasid_write(pasid);
> +   current->has_valid_pasid = 1;
> +
> +   return true;
> +}
> 
> -Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-29 Thread Tian, Kevin
+Robin.

> From: Jason Gunthorpe 
> Sent: Thursday, September 23, 2021 8:22 PM
> 
> On Thu, Sep 23, 2021 at 12:05:29PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, September 23, 2021 7:27 PM
> > >
> > > On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:
> > >
> > > > So we can only tell userspace "No_snoop is not supported" (provided
> we
> > > > even want to allow them to enable No_snoop). Users in control of
> stage-1
> > > > tables can create non-cacheable mappings through MAIR attributes.
> > >
> > > My point is that ARM is using IOMMU_CACHE to control the overall
> > > cachability of the DMA
> > >
> > > ie not specifying IOMMU_CACHE requires using the arch specific DMA
> > > cache flushers.
> > >
> > > Intel never uses arch specifc DMA cache flushers, and instead is
> > > abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA
> that
> > > is always
> > > cachable.
> >
> > it uses IOMMU_CACHE to force all DMAs to snoop, including those which
> > has non_snoop flag and wouldn't snoop cache if iommu is disabled.
> Nothing
> > is blocked.
> 
> I see it differently, on Intel the only way to bypass the cache with
> DMA is to specify the no-snoop bit in the TLP. The IOMMU PTE flag we
> are talking about tells the IOMMU to ignore the no snoop bit.
> 
> Again, Intel arch in the kernel does not support the DMA cache flush
> arch API and *DOES NOT* support incoherent DMA at all.
> 
> ARM *does* implement the DMA cache flush arch API and is using
> IOMMU_CACHE to control if the caller will, or will not call the cache
> flushes.

I still didn't fully understand this point after reading the code. Looking
at dma-iommu its cache flush functions are all coded with below as
the first check:

if (dev_is_dma_coherent(dev) && !dev_is_untrusted(dev))
return;

dev->dma_coherent is initialized upon firmware info, not decided by 
IOMMU_CACHE.

i.e. it's not IOMMU_CACHE to decide whether cache flushes should
be called.

Probably the confusion comes from __iommu_dma_alloc_noncontiguous:

if (!(ioprot & IOMMU_CACHE)) {
struct scatterlist *sg;
int i;

for_each_sg(sgt->sgl, sg, sgt->orig_nents, i)
arch_dma_prep_coherent(sg_page(sg), sg->length);
}

Here it makes more sense to be if (!coherent) {}.

with above being corrected, I think all iommu drivers do associate 
IOMMU_CACHE to the snoop aspect:

Intel:
- either force snooping by ignoring snoop bit in TLP (IOMMU_CACHE)
- or has snoop decided by TLP (!IOMMU_CACHE)

ARM:
- set to snoop format if IOMMU_CACHE
- set to nonsnoop format if !IOMMU_CACHE
(in both cases TLP snoop bit is ignored?)

Other archs
- ignore IOMMU_CACHE as cache is always snooped via their IOMMUs

> 
> This is fundamentally different from what Intel is using it for.
> 
> > but why do you call it abuse? IOMMU_CACHE was first introduced for
> > Intel platform:
> 
> IMHO ARM changed the meaning when Robin linked IOMMU_CACHE to
> dma_is_coherent stuff. At that point it became linked to 'do I need to
> call arch cache flushers or not'.
> 

I didn't identify the exact commit for above meaning change.

Robin, could you help share some thoughts here?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions

2021-09-29 Thread Lu Baolu

Hi Fenghua,

On 2021/9/21 3:23, Fenghua Yu wrote:

update_pasid() and its call chain are currently unused in the tree because
Thomas disabled the ENQCMD feature. The feature will be re-enabled shortly
using a different approach and update_pasid() and its call chain will not
be used in the new approach.

Remove the useless functions.

Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 


Thanks for this cleanup. I have queued it for v5.16.

Best regards,
baolu


---
  arch/x86/include/asm/fpu/api.h |  2 --
  drivers/iommu/intel/svm.c  | 24 +---
  2 files changed, 1 insertion(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 23bef08a8388..ca4d0dee1ecd 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -106,6 +106,4 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, const char 
**feature_name);
   */
  #define PASID_DISABLED0
  
-static inline void update_pasid(void) { }

-
  #endif /* _ASM_X86_FPU_API_H */
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 0c228787704f..5b5d69b04fcc 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -505,21 +505,6 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
return ret;
  }
  
-static void _load_pasid(void *unused)

-{
-   update_pasid();
-}
-
-static void load_pasid(struct mm_struct *mm, u32 pasid)
-{
-   mutex_lock(>context.lock);
-
-   /* Update PASID MSR on all CPUs running the mm's tasks. */
-   on_each_cpu_mask(mm_cpumask(mm), _load_pasid, NULL, true);
-
-   mutex_unlock(>context.lock);
-}
-
  static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm,
 unsigned int flags)
  {
@@ -614,10 +599,6 @@ static struct iommu_sva *intel_svm_bind_mm(struct 
intel_iommu *iommu,
if (ret)
goto free_sdev;
  
-	/* The newly allocated pasid is loaded to the mm. */

-   if (!(flags & SVM_FLAG_SUPERVISOR_MODE) && list_empty(>devs))
-   load_pasid(mm, svm->pasid);
-
list_add_rcu(>list, >devs);
  success:
return >sva;
@@ -670,11 +651,8 @@ static int intel_svm_unbind_mm(struct device *dev, u32 
pasid)
kfree_rcu(sdev, rcu);
  
  			if (list_empty(>devs)) {

-   if (svm->notifier.ops) {
+   if (svm->notifier.ops)
mmu_notifier_unregister(>notifier, 
mm);
-   /* Clear mm's pasid. */
-   load_pasid(mm, PASID_DISABLED);
-   }
pasid_private_remove(svm->pasid);
/* We mandate that no page faults may be 
outstanding
 * for the PASID when intel_svm_unbind_mm() is 
called.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-29 Thread Tian, Kevin
> From: David Gibson
> Sent: Wednesday, September 29, 2021 2:35 PM
> 
> On Wed, Sep 29, 2021 at 05:38:56AM +, Tian, Kevin wrote:
> > > From: David Gibson 
> > > Sent: Wednesday, September 29, 2021 12:56 PM
> > >
> > > >
> > > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > > as the contract to get security context established (and vice versa
> > > > for unbinding). One additional requirement in iommufd is to manage
> the
> > > > switch between multiple security contexts due to decoupled
> bind/attach:
> > > >
> > > > 1)  Open a device in "/dev/vfio/devices" with user access blocked;
> > >
> > > Probably worth clarifying that (1) must happen for *all* devices in
> > > the group before (2) happens for any device in the group.
> >
> > No. User access is naturally blocked for other devices as long as they
> > are not opened yet.
> 
> Uh... my point is that everything in the group has to be removed from
> regular kernel drivers before we reach step (2).  Is the plan that you
> must do that before you can even open them?  That's a reasonable
> choice, but then I think you should show that step in this description
> as well.

Agree. I think below proposal can meet above requirement and ensure
it's not broken in the whole process when the group is operated by the
userspace:

https://lore.kernel.org/kvm/20210928140712.gl964...@nvidia.com/

and definitely an updated description will be provided when sending out
the new proposal.

> 
> > > > 2)  Bind the device to an iommufd with an initial security context
> > > > (an empty iommu domain which blocks dma) established for its
> > > > group, with user access unblocked;
> > > >
> > > > 3)  Attach the device to a user-specified ioasid (shared by all devices
> > > > attached to this ioasid). Before attaching, the device should be 
> > > > first
> > > > detached from the initial context;
> > >
> > > So, this step can implicitly but observably change the behaviour for
> > > other devices in the group as well.  I don't love that kind of
> > > difficult to predict side effect, which is why I'm *still* not totally
> > > convinced by the device-centric model.
> >
> > which side-effect is predicted here? The user anyway needs to be
> > aware of such group restriction regardless whether it uses group
> > or nongroup interface.
> 
> Yes, exactly.  And with a group interface it's obvious it has to
> understand it.  With the non-group interface, you can get to this
> stage in ignorance of groups.  It will even work as long as you are
> lucky enough only to try with singleton-group devices.  Then you try
> it with two devices in the one group and doing (3) on device A will
> implicitly change the DMA environment of device B.

for non-group we can also document it obviously in uAPI that the user
must understand group restriction and violating it will get failure
when attaching to different IOAS's for devices in the same group.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/1] iommu/vt-d: Delete dev_has_feat callback

2021-09-29 Thread Lu Baolu
The commit 262948f8ba573 ("iommu: Delete iommu_dev_has_feature()") has
deleted the iommu_dev_has_feature() interface. Remove the dev_has_feat
callback to avoid dead code.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 59 -
 1 file changed, 5 insertions(+), 54 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 01162b0b9712..dc733de9eea5 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5516,62 +5516,14 @@ static int intel_iommu_disable_sva(struct device *dev)
return ret;
 }
 
-/*
- * A PCI express designated vendor specific extended capability is defined
- * in the section 3.7 of Intel scalable I/O virtualization technical spec
- * for system software and tools to detect endpoint devices supporting the
- * Intel scalable IO virtualization without host driver dependency.
- *
- * Returns the address of the matching extended capability structure within
- * the device's PCI configuration space or 0 if the device does not support
- * it.
- */
-static int siov_find_pci_dvsec(struct pci_dev *pdev)
-{
-   int pos;
-   u16 vendor, id;
-
-   pos = pci_find_next_ext_capability(pdev, 0, 0x23);
-   while (pos) {
-   pci_read_config_word(pdev, pos + 4, );
-   pci_read_config_word(pdev, pos + 8, );
-   if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
-   return pos;
-
-   pos = pci_find_next_ext_capability(pdev, pos, 0x23);
-   }
-
-   return 0;
-}
-
-static bool
-intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
+static int intel_iommu_enable_iopf(struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
 
-   if (feat == IOMMU_DEV_FEAT_AUX) {
-   int ret;
-
-   if (!dev_is_pci(dev) || dmar_disabled ||
-   !scalable_mode_support() || !pasid_mode_support())
-   return false;
-
-   ret = pci_pasid_features(to_pci_dev(dev));
-   if (ret < 0)
-   return false;
-
-   return !!siov_find_pci_dvsec(to_pci_dev(dev));
-   }
-
-   if (feat == IOMMU_DEV_FEAT_IOPF)
-   return info && info->pri_supported;
-
-   if (feat == IOMMU_DEV_FEAT_SVA)
-   return info && (info->iommu->flags & VTD_FLAG_SVM_CAPABLE) &&
-   info->pasid_supported && info->pri_supported &&
-   info->ats_supported;
+   if (info && info->pri_supported)
+   return 0;
 
-   return false;
+   return -ENODEV;
 }
 
 static int
@@ -5582,7 +5534,7 @@ intel_iommu_dev_enable_feat(struct device *dev, enum 
iommu_dev_features feat)
return intel_iommu_enable_auxd(dev);
 
case IOMMU_DEV_FEAT_IOPF:
-   return intel_iommu_dev_has_feat(dev, feat) ? 0 : -ENODEV;
+   return intel_iommu_enable_iopf(dev);
 
case IOMMU_DEV_FEAT_SVA:
return intel_iommu_enable_sva(dev);
@@ -5708,7 +5660,6 @@ const struct iommu_ops intel_iommu_ops = {
.get_resv_regions   = intel_iommu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
.device_group   = intel_iommu_device_group,
-   .dev_has_feat   = intel_iommu_dev_has_feat,
.dev_feat_enabled   = intel_iommu_dev_feat_enabled,
.dev_enable_feat= intel_iommu_dev_enable_feat,
.dev_disable_feat   = intel_iommu_dev_disable_feat,
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-09-29 Thread David Gibson
On Wed, Sep 29, 2021 at 05:38:56AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Wednesday, September 29, 2021 12:56 PM
> > 
> > >
> > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > as the contract to get security context established (and vice versa
> > > for unbinding). One additional requirement in iommufd is to manage the
> > > switch between multiple security contexts due to decoupled bind/attach:
> > >
> > > 1)  Open a device in "/dev/vfio/devices" with user access blocked;
> > 
> > Probably worth clarifying that (1) must happen for *all* devices in
> > the group before (2) happens for any device in the group.
> 
> No. User access is naturally blocked for other devices as long as they
> are not opened yet.

Uh... my point is that everything in the group has to be removed from
regular kernel drivers before we reach step (2).  Is the plan that you
must do that before you can even open them?  That's a reasonable
choice, but then I think you should show that step in this description
as well.

> > > 2)  Bind the device to an iommufd with an initial security context
> > > (an empty iommu domain which blocks dma) established for its
> > > group, with user access unblocked;
> > >
> > > 3)  Attach the device to a user-specified ioasid (shared by all devices
> > > attached to this ioasid). Before attaching, the device should be first
> > > detached from the initial context;
> > 
> > So, this step can implicitly but observably change the behaviour for
> > other devices in the group as well.  I don't love that kind of
> > difficult to predict side effect, which is why I'm *still* not totally
> > convinced by the device-centric model.
> 
> which side-effect is predicted here? The user anyway needs to be
> aware of such group restriction regardless whether it uses group
> or nongroup interface.

Yes, exactly.  And with a group interface it's obvious it has to
understand it.  With the non-group interface, you can get to this
stage in ignorance of groups.  It will even work as long as you are
lucky enough only to try with singleton-group devices.  Then you try
it with two devices in the one group and doing (3) on device A will
implicitly change the DMA environment of device B.

(or at least, it will if they share a group because they don't have
distinguishable RIDs.  That's not the only multi-device group case,
but it's one of them).

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 03/20] vfio: Add vfio_[un]register_device()

2021-09-29 Thread Cornelia Huck
On Wed, Sep 29 2021, "Tian, Kevin"  wrote:

>> From: David Gibson 
>> Sent: Wednesday, September 29, 2021 10:44 AM
>> 
>> > One alternative option is to arrange device nodes in sub-directories based
>> > on the device type. But doing so also adds one trouble to userspace. The
>> > current vfio uAPI is designed to have the user query device type via
>> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user
>> > instead needs to figure out the device type before opening the device, to
>> > identify the sub-directory.
>> 
>> Wouldn't this be up to the operator / configuration, rather than the
>> actual software though?  I would assume that typically the VFIO
>> program would be pointed at a specific vfio device node file to use,
>> e.g.
>>  my-vfio-prog -d /dev/vfio/pci/:0a:03.1
>> 
>> Or more generally, if you're expecting userspace to know a name in a
>> uniqu pattern, they can equally well know a "type/name" pair.
>> 
>
> You are correct. Currently:
>
> -device, vfio-pci,host=:BB:DD.F
> -device, vfio-pci,sysfdev=/sys/bus/pci/devices/ :BB:DD.F
> -device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00
>
> above is definitely type/name information to find the related node. 
>
> Actually even for Jason's proposal we still need such information to
> identify the sysfs path.
>
> Then I feel type-based sub-directory does work. Adding another link
> to sysfs sounds unnecessary now. But I'm not sure whether we still
> want to create /dev/vfio/devices/vfio0 thing and related udev rule
> thing that you pointed out in another mail.

Still reading through this whole thread, but type-based subdirectories
also make the most sense to me. I don't really see userspace wanting to
grab just any device and then figure out whether it is the device it was
looking for, but rather immediately go to a specific device or at least
a device of a specific type.

Sequentially-numbered devices tend to become really unwieldy in my
experience if you are working on a system with loads of devices.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD

2021-09-29 Thread Tian, Kevin
> From: David Gibson 
> Sent: Wednesday, September 29, 2021 2:01 PM
> 
> On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the
> vfio
> > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is
> provided
> > because it's implicitly done when the device fd is closed.
> >
> > In concept a vfio device can be bound to multiple iommufds, each hosting
> > a subset of I/O address spaces attached by this device.
> 
> I really feel like this many<->many mapping between devices is going
> to be super-confusing, and therefore make it really hard to be
> confident we have all the rules right for proper isolation.

Based on new discussion on group ownership part (patch06), I feel this
many<->many relationship will disappear. The context fd (either container
or iommufd) will uniquely mark the ownership on a physical device and
its group. With this design it's impractical to have one device bound
to multiple iommufds. Actually I don't think this is a compelling usage
in reality. The previous rationale was that no need to impose such restriction
if no special reason... and now we have a reason. 

Jason, are you OK with this simplification?

> 
> That's why I was suggesting a concept like endpoints, to break this
> into two many<->one relationships.  I'm ok if that isn't visible in
> the user API, but I think this is going to be really hard to keep
> track of if it isn't explicit somewhere in the internals.
> 

I think this endpoint concept is represented by ioas_device_info in
patch14:

+/*
+ * An ioas_device_info object is created per each successful attaching
+ * request. A list of objects are maintained per ioas when the address
+ * space is shared by multiple devices.
+ */
+struct ioas_device_info {
+   struct iommufd_device *idev;
+   struct list_head next;
 };

currently it's 1:1 mapping before this object and iommufd_device, 
because no pasid support yet.

We can rename it to struct ioas_endpoint if it makes you feel better.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

2021-09-29 Thread David Gibson
On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:
> Under the /dev/iommu model, iommufd provides the interface for I/O page
> tables management such as dma map/unmap. However, it cannot work
> independently since the device is still owned by the device-passthrough
> frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
> should build a connection between its device and the iommufd to delegate
> the I/O page table management affairs to iommufd.
> 
> This patch introduces iommufd_[un]bind_device() helpers for the device-
> passthrough framework to build such connection. The helper functions then
> invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
> security context for the bound device. Each successfully bound device is
> internally tracked by an iommufd_device object. This object is returned
> to the caller for subsequent attaching operations on the device as well.
> 
> The caller should pass a user-provided cookie to mark the device in the
> iommufd. Later this cookie will be used to represent the device in iommufd
> uAPI, e.g. when querying device capabilities or handling per-device I/O
> page faults. One alternative is to have iommufd allocate a device label
> and return to the user. Either way works, but cookie is slightly preferred
> per earlier discussion as it may allow the user to inject faults slightly
> faster without ID->vRID lookup.
> 
> iommu_[un]bind_device() functions are only used for physical devices. Other
> variants will be introduced in the future, e.g.:
> 
> -  iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
>DMA isolation;
> -  iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
>instead of iommu to isolate DMA;
> 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/iommu/iommufd/iommufd.c | 160 +++-
>  include/linux/iommufd.h |  38 
>  2 files changed, 196 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/iommufd.h
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 710b7e62988b..e16ca21e4534 100644
> --- a/drivers/iommu/iommufd/iommufd.c
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -16,10 +16,30 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  
>  /* Per iommufd */
>  struct iommufd_ctx {
>   refcount_t refs;
> + struct mutex lock;
> + struct xarray device_xa; /* xarray of bound devices */
> +};
> +
> +/*
> + * A iommufd_device object represents the binding relationship
> + * between iommufd and device. It is created per a successful
> + * binding request from device driver. The bound device must be
> + * a physical device so far. Subdevice will be supported later
> + * (with additional PASID information). An user-assigned cookie
> + * is also recorded to mark the device in the /dev/iommu uAPI.
> + */
> +struct iommufd_device {
> + unsigned int id;
> + struct iommufd_ctx *ictx;
> + struct device *dev; /* always be the physical device */
> + u64 dev_cookie;

Why do you need both an 'id' and a 'dev_cookie'?  Since they're both
unique, couldn't you just use the cookie directly as the index into
the xarray?

>  };
>  
>  static int iommufd_fops_open(struct inode *inode, struct file *filep)
> @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct 
> file *filep)
>   return -ENOMEM;
>  
>   refcount_set(>refs, 1);
> + mutex_init(>lock);
> + xa_init_flags(>device_xa, XA_FLAGS_ALLOC);
>   filep->private_data = ictx;
>  
>   return ret;
>  }
>  
> +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> +{
> + refcount_inc(>refs);
> +}
> +
> +static const struct file_operations iommufd_fops;
> +
> +/**
> + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
> + * @fd: [in] iommufd file descriptor.
> + *
> + * Returns a pointer to the iommufd context, otherwise NULL;
> + *
> + */
> +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> +{
> + struct fd f = fdget(fd);
> + struct file *file = f.file;
> + struct iommufd_ctx *ictx;
> +
> + if (!file)
> + return NULL;
> +
> + if (file->f_op != _fops)
> + return NULL;
> +
> + ictx = file->private_data;
> + if (ictx)
> + iommufd_ctx_get(ictx);
> + fdput(f);
> + return ictx;
> +}
> +
> +/**
> + * iommufd_ctx_put - Releases a reference to the internal iommufd context.
> + * @ictx: [in] Pointer to iommufd context.
> + *
> + */
>  static void iommufd_ctx_put(struct iommufd_ctx *ictx)
>  {
> - if (refcount_dec_and_test(>refs))
> - kfree(ictx);
> + if (!refcount_dec_and_test(>refs))
> + return;
> +
> + WARN_ON(!xa_empty(>device_xa));
> + kfree(ictx);
>  }
>  
>  static int iommufd_fops_release(struct inode *inode, struct file *filep)
> @@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
>   

Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-29 Thread da...@gibson.dropbear.id.au
On Wed, Sep 22, 2021 at 09:41:50AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 03:30:09AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, September 22, 2021 1:41 AM
> > > 
> > > On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > > > After a device is bound to the iommufd, userspace can use this interface
> > > > to query the underlying iommu capability and format info for this 
> > > > device.
> > > > Based on this information the user then creates I/O address space in a
> > > > compatible format with the to-be-attached devices.
> > > >
> > > > Device cookie which is registered at binding time is used to mark the
> > > > device which is being queried here.
> > > >
> > > > Signed-off-by: Liu Yi L 
> > > >  drivers/iommu/iommufd/iommufd.c | 68
> > > +
> > > >  include/uapi/linux/iommu.h  | 49 
> > > >  2 files changed, 117 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index e16ca21e4534..641f199f2d41 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> > > *inode, struct file *filep)
> > > > return 0;
> > > >  }
> > > >
> > > > +static struct device *
> > > > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> > > dev_cookie)
> > > > +{
> > > 
> > > We have an xarray ID for the device, why are we allowing userspace to
> > > use the dev_cookie as input?
> > > 
> > > Userspace should always pass in the ID. The only place dev_cookie
> > > should appear is if the kernel generates an event back to
> > > userspace. Then the kernel should return both the ID and the
> > > dev_cookie in the event to allow userspace to correlate it.
> > > 
> > 
> > A little background.
> > 
> > In earlier design proposal we discussed two options. One is to return
> > an kernel-allocated ID (label) to userspace. The other is to have user
> > register a cookie and use it in iommufd uAPI. At that time the two
> > options were discussed exclusively and the cookie one is preferred.
> > 
> > Now you instead recommended a mixed option. We can follow it for
> > sure if nobody objects.
> 
> Either or for the return is fine, I'd return both just because it is
> more flexable
> 
> But the cookie should never be an input from userspace, and the kernel
> should never search for it. Locating the kernel object is what the ID
> and xarray is for.

Why do we need two IDs at all?  Can't we just use the cookie as the
sole ID?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD

2021-09-29 Thread David Gibson
On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
> device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
> because it's implicitly done when the device fd is closed.
> 
> In concept a vfio device can be bound to multiple iommufds, each hosting
> a subset of I/O address spaces attached by this device.

I really feel like this many<->many mapping between devices is going
to be super-confusing, and therefore make it really hard to be
confident we have all the rules right for proper isolation.

That's why I was suggesting a concept like endpoints, to break this
into two many<->one relationships.  I'm ok if that isn't visible in
the user API, but I think this is going to be really hard to keep
track of if it isn't explicit somewhere in the internals.

> However as a
> starting point (matching current vfio), only one I/O address space is
> supported per vfio device. It implies one device can only be attached
> to one iommufd at this point.
> 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/pci/Kconfig|  1 +
>  drivers/vfio/pci/vfio_pci.c | 72 -
>  drivers/vfio/pci/vfio_pci_private.h |  8 
>  include/uapi/linux/vfio.h   | 30 
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 5e2e1b9a9fd3..3abfb098b4dc 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -5,6 +5,7 @@ config VFIO_PCI
>   depends on MMU
>   select VFIO_VIRQFD
>   select IRQ_BYPASS_MANAGER
> + select IOMMUFD
>   help
> Support for the PCI VFIO bus driver.  This is required to make
> use of PCI drivers using the VFIO framework.
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 145addde983b..20006bb66430 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device 
> *core_vdev)
>   vdev->req_trigger = NULL;
>   }
>   mutex_unlock(>igate);
> +
> + mutex_lock(>videv_lock);
> + if (vdev->videv) {
> + struct vfio_iommufd_device *videv = vdev->videv;
> +
> + vdev->videv = NULL;
> + iommufd_unbind_device(videv->idev);
> + kfree(videv);
> + }
> + mutex_unlock(>videv_lock);
>   }
>  
>   mutex_unlock(>reflck->lock);
> @@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
>   container_of(core_vdev, struct vfio_pci_device, vdev);
>   unsigned long minsz;
>  
> - if (cmd == VFIO_DEVICE_GET_INFO) {
> + if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {
> + struct vfio_device_iommu_bind_data bind_data;
> + unsigned long minsz;
> + struct iommufd_device *idev;
> + struct vfio_iommufd_device *videv;
> +
> + /*
> +  * Reject the request if the device is already opened and
> +  * attached to a container.
> +  */
> + if (vfio_device_in_container(core_vdev))

Usually one would do argument sanity checks before checks that
actually depend on machine state.

> + return -ENOTTY;

This doesn't seem like the right error code.  It's a perfectly valid
operation for this device - just not available right now.

> +
> + minsz = offsetofend(struct vfio_device_iommu_bind_data, 
> dev_cookie);
> +
> + if (copy_from_user(_data, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (bind_data.argsz < minsz ||
> + bind_data.flags || bind_data.iommu_fd < 0)
> + return -EINVAL;
> +
> + mutex_lock(>videv_lock);
> + /*
> +  * Allow only one iommufd per device until multiple
> +  * address spaces (e.g. vSVA) support is introduced
> +  * in the future.
> +  */
> + if (vdev->videv) {
> + mutex_unlock(>videv_lock);
> + return -EBUSY;
> + }
> +
> + idev = iommufd_bind_device(bind_data.iommu_fd,
> +>pdev->dev,
> +bind_data.dev_cookie);
> + if (IS_ERR(idev)) {
> + mutex_unlock(>videv_lock);
> + return PTR_ERR(idev);
> + }
> +
> + videv = kzalloc(sizeof(*videv), GFP_KERNEL);
> + if (!videv) {
> + iommufd_unbind_device(idev);
> + mutex_unlock(>videv_lock);
> + return -ENOMEM;
> + }
> + videv->idev = idev;
> + videv->iommu_fd = 

Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-09-29 Thread David Gibson
On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> After a device is bound to the iommufd, userspace can use this interface
> to query the underlying iommu capability and format info for this device.
> Based on this information the user then creates I/O address space in a
> compatible format with the to-be-attached devices.
> 
> Device cookie which is registered at binding time is used to mark the
> device which is being queried here.
> 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/iommu/iommufd/iommufd.c | 68 +
>  include/uapi/linux/iommu.h  | 49 
>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e16ca21e4534..641f199f2d41 100644
> --- a/drivers/iommu/iommufd/iommufd.c
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, 
> struct file *filep)
>   return 0;
>  }
>  
> +static struct device *
> +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
> +{
> + struct iommufd_device *idev;
> + struct device *dev = NULL;
> + unsigned long index;
> +
> + mutex_lock(>lock);
> + xa_for_each(>device_xa, index, idev) {
> + if (idev->dev_cookie == dev_cookie) {
> + dev = idev->dev;
> + break;
> + }
> + }
> + mutex_unlock(>lock);
> +
> + return dev;
> +}
> +
> +static void iommu_device_build_info(struct device *dev,
> + struct iommu_device_info *info)
> +{
> + bool snoop;
> + u64 awidth, pgsizes;
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, ))
> + info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, )) {
> + info->pgsize_bitmap = pgsizes;
> + info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
> + }
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, )) {
> + info->addr_width = awidth;
> + info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
> + }
> +}
> +
> +static int iommufd_get_device_info(struct iommufd_ctx *ictx,
> +unsigned long arg)
> +{
> + struct iommu_device_info info;
> + unsigned long minsz;
> + struct device *dev;
> +
> + minsz = offsetofend(struct iommu_device_info, addr_width);
> +
> + if (copy_from_user(, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + info.flags = 0;
> +
> + dev = iommu_find_device_from_cookie(ictx, info.dev_cookie);
> + if (!dev)
> + return -EINVAL;
> +
> + iommu_device_build_info(dev, );
> +
> + return copy_to_user((void __user *)arg, , minsz) ? -EFAULT : 0;
> +}
> +
>  static long iommufd_fops_unl_ioctl(struct file *filep,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -127,6 +192,9 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
>   return ret;
>  
>   switch (cmd) {
> + case IOMMU_DEVICE_GET_INFO:
> + ret = iommufd_get_device_info(ictx, arg);
> + break;
>   default:
>   pr_err_ratelimited("unsupported cmd %u\n", cmd);
>   break;
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 59178fc229ca..76b71f9d6b34 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -7,6 +7,55 @@
>  #define _UAPI_IOMMU_H
>  
>  #include 
> +#include 
> +
> +/*  IOCTLs for IOMMU file descriptor (/dev/iommu)  */
> +
> +#define IOMMU_TYPE   (';')
> +#define IOMMU_BASE   100
> +
> +/*
> + * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
> + *   struct iommu_device_info)
> + *
> + * Check IOMMU capabilities and format information on a bound device.
> + *
> + * The device is identified by device cookie (registered when binding
> + * this device).
> + *
> + * @argsz:  user filled size of this data.
> + * @flags:  tells userspace which capability info is available
> + * @dev_cookie: user assinged cookie.
> + * @pgsize_bitmap: Bitmap of supported page sizes. 1-setting of the
> + *  bit in pgsize_bitmap[63:12] indicates a supported
> + *  page size. Details as below table:
> + *
> + *  +===++
> + *  |  Bit[index]   |  Page Size |
> + *  +---++
> + *  |  12   |  4 KB  |
> + *  +---++
> + *  |  13   |  8 KB  |
> + *  +---++
> + *  |  14   |  16 KB |
> + *  +---++
> + *  ...
> + *