date:20220615

Re: [PATCH v9 2/3] iommu/mediatek: Rename MTK_IOMMU_TLB_ADDR to MTK_IOMMU_ADDR

2022-06-15 Thread yf.wang--- via iommu

On Wed, 2022-06-15 at 18:14 +0100, Robin Murphy wrote:
> On 2022-06-15 17:12, yf.wang--- via iommu wrote:
> > From: Yunfei Wang 
> > 
> > Rename MTK_IOMMU_TLB_ADDR to MTK_IOMMU_ADDR, and update
> > MTK_IOMMU_ADDR
> > definition for better generality.
> > 
> > Signed-off-by: Ning Li 
> > Signed-off-by: Yunfei Wang 
> > ---
> >   drivers/iommu/mtk_iommu.c | 8 
> >   1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
> > index bb9dd92c9898..3d62399e8865 100644
> > --- a/drivers/iommu/mtk_iommu.c
> > +++ b/drivers/iommu/mtk_iommu.c
> > @@ -265,8 +265,8 @@ static const struct iommu_ops mtk_iommu_ops;
> >   
> >   static int mtk_iommu_hw_init(const struct mtk_iommu_data *data,
> > unsigned int bankid);
> >   
> > -#define MTK_IOMMU_TLB_ADDR(iova) ({
> > \
> > -   dma_addr_t _addr = iova;\
> > +#define MTK_IOMMU_ADDR(addr) ({
> > \
> > +   unsigned long long _addr = addr;\
> 
> If phys_addr_t is 64-bit, then dma_addr_t is also 64-bit, so there is
> no 
> loss of generality from using an appropriate type - IOVAs have to
> fit 
> into dma_addr_t for iommu-dma, after all. However, since IOVAs also
> have 
> to fit into unsigned long in the general IOMMU API, as "addr" is
> here, 
> then this is still just as broken for 32-bit LPAE as the existing
> code is.
> 
> Thanks,
> Robin.
> 

Hi Robin,
According to Path#1's suggestion, next version will keep ttbr to encoded 32 
bits,
then will don't need to modify it.

Thanks,
Yunfei.


> > ((lower_32_bits(_addr) & GENMASK(31, 12)) |
> > upper_32_bits(_addr));\
> >   })
> >   
> > @@ -381,8 +381,8 @@ static void
> > mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size,
> > writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0,
> >base + data->plat_data->inv_sel_reg);
> >   
> > -   writel_relaxed(MTK_IOMMU_TLB_ADDR(iova), base +
> > REG_MMU_INVLD_START_A);
> > -   writel_relaxed(MTK_IOMMU_TLB_ADDR(iova + size - 1),
> > +   writel_relaxed(MTK_IOMMU_ADDR(iova), base +
> > REG_MMU_INVLD_START_A);
> > +   writel_relaxed(MTK_IOMMU_ADDR(iova + size - 1),
> >base + REG_MMU_INVLD_END_A);
> > writel_relaxed(F_MMU_INV_RANGE, base +
> > REG_MMU_INVALIDATE);
> >
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v2 4/5] vfio/iommu_type1: Clean up update_dirty_scope in detach_group()

2022-06-15 Thread Tian, Kevin

> From: Nicolin Chen
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> All devices in emulated_iommu_groups have pinned_page_dirty_scope
> set, so the update_dirty_scope in the first list_for_each_entry
> is always false. Clean it up, and move the "if update_dirty_scope"
> part from the detach_group_done routine to the domain_list part.
> 
> Rename the "detach_group_done" goto label accordingly.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian , with one nit:

> @@ -2469,7 +2467,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
>   WARN_ON(iommu->notifier.head);
>   vfio_iommu_unmap_unpin_all(iommu);
>   }
> - goto detach_group_done;
> + goto out_unlock;
...
> +out_unlock:
>   mutex_unlock(&iommu->lock);
>  }
> 

I'd just replace the goto with a direct unlock and then return there. 
the readability is slightly better.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v2 3/5] vfio/iommu_type1: Remove the domain->ops comparison

2022-06-15 Thread Tian, Kevin

> From: Nicolin Chen 
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> The domain->ops validation was added, as a precaution, for mixed-driver
> systems. However, at this moment only one iommu driver is possible. So
> remove it.

It's true on a physical platform. But I'm not sure whether a virtual platform
is allowed to include multiple e.g. one virtio-iommu alongside a virtual VT-d
or a virtual smmu. It might be clearer to claim that (as Robin pointed out)
there is plenty more significant problems than this to solve instead of simply
saying that only one iommu driver is possible if we don't have explicit code
to reject such configuration. 😊

> 
> Per discussion with Robin, in future when many can be permitted we will
> rely on the IOMMU core code to check the domain->ops:
> https://lore.kernel.org/linux-iommu/6575de6d-94ba-c427-5b1e-
> 967750ddf...@arm.com/
> 
> Signed-off-by: Nicolin Chen 

Apart from that,

Reviewed-by: Kevin Tian 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 1/3] iommu/io-pgtable-arm-v7s: Add a quirk to allow pgtable PA up to 35bit

2022-06-15 Thread yf.wang--- via iommu

On Wed, 2022-06-15 at 18:03 +0100, Robin Murphy wrote:
> On 2022-06-15 17:12, yf.w...@mediatek.com wrote:
> > 
> >   static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
> >   struct io_pgtable_cfg *cfg)
> >   {
> > @@ -240,10 +245,17 @@ static void *__arm_v7s_alloc_table(int lvl,
> > gfp_t gfp,
> > dma_addr_t dma;
> > size_t size = ARM_V7S_TABLE_SIZE(lvl, cfg);
> > void *table = NULL;
> > +   gfp_t gfp_l1;
> > +
> > +   /*
> > +* ARM_MTK_TTBR_EXT extend the translation table base support
> > all
> > +* memory address.
> > +*/
> > +   gfp_l1 = cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT ?
> > +GFP_KERNEL : ARM_V7S_TABLE_GFP_DMA;
> >   
> > if (lvl == 1)
> > -   table = (void *)__get_free_pages(
> > -   __GFP_ZERO | ARM_V7S_TABLE_GFP_DMA,
> > get_order(size));
> > +   table = (void *)__get_free_pages(gfp_l1 | __GFP_ZERO,
> > get_order(size));
> > else if (lvl == 2)
> > table = kmem_cache_zalloc(data->l2_tables, gfp);
> >   
> > @@ -251,7 +263,8 @@ static void *__arm_v7s_alloc_table(int lvl,
> > gfp_t gfp,
> > return NULL;
> >   
> > phys = virt_to_phys(table);
> > -   if (phys != (arm_v7s_iopte)phys) {
> > +   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT ?
> > +   phys >= (1ULL << cfg->oas) : phys != (arm_v7s_iopte)phys) {
> 
> Given that the comment above says it supports all of memory, how
> would 
> phys >= (1ULL << cfg->oas) ever be true?
> 

Hi Robin,

Since Mediatek IOMMU hardware support at most 35bit PA in pgtable,
so add a quirk to allow the PA of pgtables support up to bit35,
but need to check oas do error hanlde.


> > /* Doesn't fit in PTE */
> > dev_err(dev, "Page table does not fit in PTE: %pa",
> > &phys);
> > goto out_free;
> > arm_v7s_install_table(arm_v7s_iopte *table,
> >arm_v7s_iopte curr,
> >struct io_pgtable_cfg *cfg)

...
 
> > diff --git a/include/linux/io-pgtable.h b/include/linux/io-
> > pgtable.h
> > index 86af6f0a00a2..c9189716f6bd 100644
> > --- a/include/linux/io-pgtable.h
> > +++ b/include/linux/io-pgtable.h
> > @@ -74,17 +74,22 @@ struct io_pgtable_cfg {
> >  *  to support up to 35 bits PA where the bit32, bit33 and
> > bit34 are
> >  *  encoded in the bit9, bit4 and bit5 of the PTE respectively.
> >  *
> > +* IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT: (ARM v7s format) MediaTek
> > IOMMUs
> > +*  extend the translation table base support up to 35 bits PA,
> > the
> > +*  encoding format is same with IO_PGTABLE_QUIRK_ARM_MTK_EXT.
> > +*
> >  * IO_PGTABLE_QUIRK_ARM_TTBR1: (ARM LPAE format) Configure the
> > table
> >  *  for use in the upper half of a split address space.
> >  *
> >  * IO_PGTABLE_QUIRK_ARM_OUTER_WBWA: Override the outer-
> > cacheability
> >  *  attributes set in the TCR for a non-coherent page-table
> > walker.
> >  */
> > -   #define IO_PGTABLE_QUIRK_ARM_NS BIT(0)
> > -   #define IO_PGTABLE_QUIRK_NO_PERMS   BIT(1)
> > -   #define IO_PGTABLE_QUIRK_ARM_MTK_EXTBIT(3)
> > -   #define IO_PGTABLE_QUIRK_ARM_TTBR1  BIT(5)
> > -   #define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
> > +   #define IO_PGTABLE_QUIRK_ARM_NS BIT(0)
> > +   #define IO_PGTABLE_QUIRK_NO_PERMS   BIT(1)
> > +   #define IO_PGTABLE_QUIRK_ARM_MTK_EXTBIT(3)
> > +   #define IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT   BIT(4)
> > +   #define IO_PGTABLE_QUIRK_ARM_TTBR1  BIT(5)
> > +   #define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
> > unsigned long   quirks;
> > unsigned long   pgsize_bitmap;
> > unsigned intias;
> > @@ -122,7 +127,7 @@ struct io_pgtable_cfg {
> > } arm_lpae_s2_cfg;
> >   
> > struct {
> > -   u32 ttbr;
> > +   u64 ttbr;
> 
> The point of this is to return an encoded TTBR register value, not a
> raw 
> base address. I see from the other patches that your register is
> still 
> 32 bits, so I'd prefer to follow the standard pattern and not need
> this 
> change.
> 
> Thanks,
> Robin.
> 

Hi Robin,
Thanks for your suggestion, next version will recovery ttbr to 32 bits,
will modify arm_v7s_alloc_pgtable to return an encoded TTBR, encoded
PA bits[34:32] to to lower bits, as follows:

/* TTBR */
phys_addr_t paddr;
paddr = virt_to_phys(data->pgd);
cfg->arm_v7s_cfg.ttbr = virt_to_phys(data->pgd) | ARM_V7S_TTBR_S |
(cfg->coherent_walk ? (ARM_V7S_TTBR_NOS |
 ARM_V7S_TTBR_IRGN_ATTR(ARM_V7S_RGN_WBWA) |
 ARM_V7S_TTBR_ORGN_ATTR(ARM_V7S_RGN_WBWA)) :
(ARM_V7S_TTBR_IRGN_ATTR(ARM_V7S_RGN_NC) |
 ARM_V7S_TTBR_ORGN_ATTR(ARM_V7S_RGN_NC)));

if (cfg->quirks & I

Re: [PATCH v3 6/6] iommu: mtk_iommu: Lookup phandle to retrieve syscon to pericfg

2022-06-15 Thread Yong Wu via iommu

On Mon, 2022-06-13 at 10:13 +0200, AngeloGioacchino Del Regno wrote:
> Il 13/06/22 07:32, Yong Wu ha scritto:
> > On Thu, 2022-06-09 at 12:08 +0200, AngeloGioacchino Del Regno
> > wrote:
> > > On some SoCs (of which only MT8195 is supported at the time of
> > > writing),
> > > the "R" and "W" (I/O) enable bits for the IOMMUs are in the
> > > pericfg_ao
> > > register space and not in the IOMMU space: as it happened already
> > > with
> > > infracfg, it is expected that this list will grow.
> > 
> > Currently I don't see the list will grow. As commented before, In
> > the
> > lastest SoC, The IOMMU enable bits for IOMMU will be in ATF, rather
> > than in this pericfg register region. In this case, Is this patch
> > unnecessary? or we could add this patch when there are 2 SoCs use
> > this
> > setting at least?  what's your opinion?
> > 
> 
> Perhaps I've misunderstood... besides, can you please check if
> there's any
> other SoC (not just chromebooks, also smartphone SoCs) that need this
> logic?

As far as I know, SmartPhone SoCs don't enable the infra iommu until
now. they don't have this logic. I don't object this patch, I think we
could add it when at least 2 SoCs need this.

Thanks very much for help improving here.

> 
> Thanks,
> Angelo
> 
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v2 2/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

2022-06-15 Thread Tian, Kevin

> From: Nicolin Chen 
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> From: Jason Gunthorpe 
> 
> The KVM mechanism for controlling wbinvd is based on OR of the coherency
> property of all devices attached to a guest, no matter those devices are
> attached to a single domain or multiple domains.
> 
> So, there is no value in trying to push a device that could do enforced
> cache coherency to a dedicated domain vs re-using an existing domain
> which is non-coherent since KVM won't be able to take advantage of it.
> This just wastes domain memory.
> 
> Simplify this code and eliminate the test. This removes the only logic
> that needed to have a dummy domain attached prior to searching for a
> matching domain and simplifies the next patches.
> 
> It's unclear whether we want to further optimize the Intel driver to
> update the domain coherency after a device is detached from it, at
> least not before KVM can be verified to handle such dynamics in related
> emulation paths (wbinvd, vcpu load, write_cr0, ept, etc.). In reality
> we don't see an usage requiring such optimization as the only device
> which imposes such non-coherency is Intel GPU which even doesn't
> support hotplug/hot remove.
> 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v2 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-06-15 Thread Tian, Kevin

> From: Nicolin Chen 
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> Cases like VFIO wish to attach a device to an existing domain that was
> not allocated specifically from the device. This raises a condition
> where the IOMMU driver can fail the domain attach because the domain and
> device are incompatible with each other.
> 
> This is a soft failure that can be resolved by using a different domain.
> 
> Provide a dedicated errno from the IOMMU driver during attach that the
> reason attached failed is because of domain incompatability. EMEDIUMTYPE
> is chosen because it is never used within the iommu subsystem today and
> evokes a sense that the 'medium' aka the domain is incompatible.
> 
> VFIO can use this to know attach is a soft failure and it should continue
> searching. Otherwise the attach will be a hard failure and VFIO will
> return the code to userspace.
> 
> Update all drivers to return EMEDIUMTYPE in their failure paths that are
> related to domain incompatability. Also turn adjacent error prints into
> debug prints, for these soft failures, to prevent a kernel log spam.
> 
> Add kdocs describing this behavior.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 5/5] iommu/mediatek: Remove a unused "mapping" which is only for v1

2022-06-15 Thread Yong Wu via iommu

Just remove a unused variable that only is for mtk_iommu_v1.

Fixes: 9485a04a5bb9 ("iommu/mediatek: Separate mtk_iommu_data for v1 and v2")
Signed-off-by: Yong Wu 
---
 drivers/iommu/mtk_iommu.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 5e86fd48928a..e65e705d9fc1 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -221,10 +221,7 @@ struct mtk_iommu_data {
struct device   *smicomm_dev;
 
struct mtk_iommu_bank_data  *bank;
-
-   struct dma_iommu_mapping*mapping; /* For mtk_iommu_v1.c */
struct regmap   *pericfg;
-
struct mutexmutex; /* Protect m4u_group/m4u_dom 
above */
 
/*
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 4/5] iommu/mediatek: Improve safety for mediatek, smi property in larb nodes

2022-06-15 Thread Yong Wu via iommu

No functional change. Just improve safety from dts.

All the larbs that connect to one IOMMU must connect with the same
smi-common. This patch checks all the mediatek,smi property for each
larb, If their mediatek,smi are different, it will return fails.
Also avoid there is no available smi-larb nodes.

Suggested-by: Guenter Roeck 
Signed-off-by: Yong Wu 
---
 drivers/iommu/mtk_iommu.c | 49 ++-
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index a869d4aee7b3..5e86fd48928a 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -1044,7 +1044,7 @@ static const struct component_master_ops 
mtk_iommu_com_ops = {
 static int mtk_iommu_mm_dts_parse(struct device *dev, struct component_match 
**match,
  struct mtk_iommu_data *data)
 {
-   struct device_node *larbnode, *smicomm_node, *smi_subcomm_node;
+   struct device_node *larbnode, *frst_avail_smicomm_node = NULL;
struct platform_device *plarbdev;
struct device_link *link;
int i, larb_nr, ret;
@@ -1056,6 +1056,7 @@ static int mtk_iommu_mm_dts_parse(struct device *dev, 
struct component_match **m
return -EINVAL;
 
for (i = 0; i < larb_nr; i++) {
+   struct device_node *smicomm_node, *smi_subcomm_node;
u32 id;
 
larbnode = of_parse_phandle(dev->of_node, "mediatek,larbs", i);
@@ -1091,27 +1092,43 @@ static int mtk_iommu_mm_dts_parse(struct device *dev, 
struct component_match **m
}
data->larb_imu[id].dev = &plarbdev->dev;
 
+   /* Get smi-(sub)-common dev from the last larb. */
+   smi_subcomm_node = of_parse_phandle(larbnode, "mediatek,smi", 
0);
+   if (!smi_subcomm_node) {
+   ret = -EINVAL;
+   goto err_larbnode_put;
+   }
+
+   /*
+* It may have two level smi-common. the node is smi-sub-common 
if it
+* has a new mediatek,smi property. otherwise it is smi-commmon.
+*/
+   smicomm_node = of_parse_phandle(smi_subcomm_node, 
"mediatek,smi", 0);
+   if (smicomm_node)
+   of_node_put(smi_subcomm_node);
+   else
+   smicomm_node = smi_subcomm_node;
+
+   if (!frst_avail_smicomm_node) {
+   frst_avail_smicomm_node = smicomm_node;
+   } else if (frst_avail_smicomm_node != smicomm_node) {
+   dev_err(dev, "mediatek,smi is not right @larb%d.", id);
+   of_node_put(smicomm_node);
+   ret = -EINVAL;
+   goto err_larbnode_put;
+   } else {
+   of_node_put(smicomm_node);
+   }
+
component_match_add_release(dev, match, component_release_of,
component_compare_of, larbnode);
}
 
-   /* Get smi-(sub)-common dev from the last larb. */
-   smi_subcomm_node = of_parse_phandle(larbnode, "mediatek,smi", 0);
-   if (!smi_subcomm_node)
+   if (!frst_avail_smicomm_node)
return -EINVAL;
 
-   /*
-* It may have two level smi-common. the node is smi-sub-common if it
-* has a new mediatek,smi property. otherwise it is smi-commmon.
-*/
-   smicomm_node = of_parse_phandle(smi_subcomm_node, "mediatek,smi", 0);
-   if (smicomm_node)
-   of_node_put(smi_subcomm_node);
-   else
-   smicomm_node = smi_subcomm_node;
-
-   plarbdev = of_find_device_by_node(smicomm_node);
-   of_node_put(smicomm_node);
+   plarbdev = of_find_device_by_node(frst_avail_smicomm_node);
+   of_node_put(frst_avail_smicomm_node);
data->smicomm_dev = &plarbdev->dev;
 
link = device_link_add(data->smicomm_dev, dev,
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 3/5] iommu/mediatek: Validate number of phandles associated with "mediatek, larbs"

2022-06-15 Thread Yong Wu via iommu

From: Guenter Roeck 

Fix the smatch warnings:
drivers/iommu/mtk_iommu.c:878 mtk_iommu_mm_dts_parse() error: uninitialized
symbol 'larbnode'.

If someone abuse the dtsi node(Don't follow the definition of dt-binding),
for example "mediatek,larbs" is provided as boolean property, "larb_nr"
will be zero and cause abnormal.

To fix this problem and improve the code safety, add some checking
for the invalid input from dtsi, e.g. checking the larb_nr/larbid valid
range, and avoid "mediatek,larb-id" property conflicts in the smi-larb
nodes.

Fixes: d2e9a1102cfc ("iommu/mediatek: Contain MM IOMMU flow with the MM TYPE")
Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Signed-off-by: Guenter Roeck 
Signed-off-by: Yong Wu 
Reviewed-by: AngeloGioacchino Del Regno 

---
 drivers/iommu/mtk_iommu.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index ab24078938bf..a869d4aee7b3 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -1052,6 +1052,8 @@ static int mtk_iommu_mm_dts_parse(struct device *dev, 
struct component_match **m
larb_nr = of_count_phandle_with_args(dev->of_node, "mediatek,larbs", 
NULL);
if (larb_nr < 0)
return larb_nr;
+   if (larb_nr == 0 || larb_nr > MTK_LARB_NR_MAX)
+   return -EINVAL;
 
for (i = 0; i < larb_nr; i++) {
u32 id;
@@ -1068,6 +1070,10 @@ static int mtk_iommu_mm_dts_parse(struct device *dev, 
struct component_match **m
ret = of_property_read_u32(larbnode, "mediatek,larb-id", &id);
if (ret)/* The id is consecutive if there is no this property */
id = i;
+   if (id >= MTK_LARB_NR_MAX) {
+   ret = -EINVAL;
+   goto err_larbnode_put;
+   }
 
plarbdev = of_find_device_by_node(larbnode);
if (!plarbdev) {
@@ -1078,6 +1084,11 @@ static int mtk_iommu_mm_dts_parse(struct device *dev, 
struct component_match **m
ret = -EPROBE_DEFER;
goto err_larbnode_put;
}
+
+   if (data->larb_imu[id].dev) {
+   ret = -EEXIST;
+   goto err_larbnode_put;
+   }
data->larb_imu[id].dev = &plarbdev->dev;
 
component_match_add_release(dev, match, component_release_of,
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 2/5] iommu/mediatek: Add error path for loop of mm_dts_parse

2022-06-15 Thread Yong Wu via iommu

The mtk_iommu_mm_dts_parse will parse the smi larbs nodes. if the i+1
larb is parsed fail(return -EINVAL), we should of_node_put for the 0..i
larbs. In the fail path, one of_node_put matches with of_parse_phandle in
it.

Fixes: d2e9a1102cfc ("iommu/mediatek: Contain MM IOMMU flow with the MM TYPE")
Signed-off-by: Yong Wu 
---
 drivers/iommu/mtk_iommu.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 3b2489e8a6dd..ab24078938bf 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -1071,12 +1071,12 @@ static int mtk_iommu_mm_dts_parse(struct device *dev, 
struct component_match **m
 
plarbdev = of_find_device_by_node(larbnode);
if (!plarbdev) {
-   of_node_put(larbnode);
-   return -ENODEV;
+   ret = -ENODEV;
+   goto err_larbnode_put;
}
if (!plarbdev->dev.driver) {
-   of_node_put(larbnode);
-   return -EPROBE_DEFER;
+   ret = -EPROBE_DEFER;
+   goto err_larbnode_put;
}
data->larb_imu[id].dev = &plarbdev->dev;
 
@@ -1107,9 +1107,20 @@ static int mtk_iommu_mm_dts_parse(struct device *dev, 
struct component_match **m
   DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME);
if (!link) {
dev_err(dev, "Unable to link %s.\n", 
dev_name(data->smicomm_dev));
-   return -EINVAL;
+   ret = -EINVAL;
+   goto err_larbnode_put;
}
return 0;
+
+err_larbnode_put:
+   while (i--) {
+   larbnode = of_parse_phandle(dev->of_node, "mediatek,larbs", i);
+   if (larbnode && of_device_is_available(larbnode)) {
+   of_node_put(larbnode);
+   of_node_put(larbnode);
+   }
+   }
+   return ret;
 }
 
 static int mtk_iommu_probe(struct platform_device *pdev)
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 1/5] iommu/mediatek: Use dev_err_probe to mute probe_defer err log

2022-06-15 Thread Yong Wu via iommu

Mute the probe defer log:

[2.654806] mtk-iommu 14018000.iommu: mm dts parse fail(-517).
[2.656168] mtk-iommu 1c01f000.iommu: mm dts parse fail(-517).

Fixes: d2e9a1102cfc ("iommu/mediatek: Contain MM IOMMU flow with the MM TYPE")
Signed-off-by: Yong Wu 
Reviewed-by: AngeloGioacchino Del Regno 

Reviewed-by: Guenter Roeck 
---
 drivers/iommu/mtk_iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index bb9dd92c9898..3b2489e8a6dd 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -1204,7 +1204,7 @@ static int mtk_iommu_probe(struct platform_device *pdev)
if (MTK_IOMMU_IS_TYPE(data->plat_data, MTK_IOMMU_TYPE_MM)) {
ret = mtk_iommu_mm_dts_parse(dev, &match, data);
if (ret) {
-   dev_err(dev, "mm dts parse fail(%d).", ret);
+   dev_err_probe(dev, ret, "mm dts parse fail.");
goto out_runtime_disable;
}
} else if (MTK_IOMMU_IS_TYPE(data->plat_data, MTK_IOMMU_TYPE_INFRA) &&
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 0/5] iommu/mediatek: Improve safety from dts

2022-06-15 Thread Yong Wu via iommu

This patchset contains misc improve patches:
[1/5] When mt8195 v7, I added a error log for dts parse fail, but it
doesn't ignore probe_defer case.(v6 doesn't have this err log.)
[2/5] Add a error path for MM dts parse.

[3/5][4/5] To improve safety from dts. Base on this:
https://lore.kernel.org/linux-mediatek/20211210205704.1664928-1-li...@roeck-us.net/

Change notes:
v2: a) Rebase on v5.19-rc1.
b) Add a New patch [5/5] just remove a variable that only is for v1.

v1: 
https://lore.kernel.org/linux-mediatek/20220511064920.18455-1-yong...@mediatek.com/
Base on linux-next-20220510.

Guenter Roeck (1):
  iommu/mediatek: Validate number of phandles associated with "mediatek,
larbs"

Yong Wu (4):
  iommu/mediatek: Use dev_err_probe to mute probe_defer err log
  iommu/mediatek: Add error path for loop of mm_dts_parse
  iommu/mediatek: Improve safety for mediatek, smi property in larb
nodes
  iommu/mediatek: Remove a unused "mapping" which is only for v1

 drivers/iommu/mtk_iommu.c | 86 +++
 1 file changed, 61 insertions(+), 25 deletions(-)

-- 
2.18.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 3/3] iommu/mediatek: Allow page table PA up to 35bit

2022-06-15 Thread kernel test robot

Hi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on joro-iommu/next]
[also build test ERROR on linus/master v5.19-rc2 next-20220615]
[cannot apply to arm-perf/for-next/perf]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/intel-lab-lkp/linux/commits/yf-wang-mediatek-com/iommu-io-pgtable-arm-v7s-Add-a-quirk-to-allow-pgtable-PA-up-to-35bit/20220616-011227
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arc-allyesconfig 
(https://download.01.org/0day-ci/archive/20220616/202206161233.wdjdwjgb-...@intel.com/config)
compiler: arceb-elf-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/intel-lab-lkp/linux/commit/0032fcce9c1ab50caec1ef5dd4089a8a61fcf15c
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review 
yf-wang-mediatek-com/iommu-io-pgtable-arm-v7s-Add-a-quirk-to-allow-pgtable-PA-up-to-35bit/20220616-011227
git checkout 0032fcce9c1ab50caec1ef5dd4089a8a61fcf15c
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 
O=build_dir ARCH=arc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   In file included from include/linux/scatterlist.h:9,
from include/linux/dma-mapping.h:10,
from include/linux/dma-direct.h:9,
from drivers/iommu/mtk_iommu.c:11:
   drivers/iommu/mtk_iommu.c: In function 'mtk_iommu_attach_device':
>> drivers/iommu/mtk_iommu.c:693:49: error: 'struct mtk_iommu_data' has no 
>> member named 'base'
 693 | writel(bank->m4u_dom->ttbr, data->base + 
REG_MMU_PT_BASE_ADDR);
 | ^~
   arch/arc/include/asm/io.h:231:75: note: in definition of macro 
'writel_relaxed'
 231 | #define writel_relaxed(v,c) __raw_writel((__force u32) 
cpu_to_le32(v),c)
 |  
 ^
   drivers/iommu/mtk_iommu.c:693:17: note: in expansion of macro 'writel'
 693 | writel(bank->m4u_dom->ttbr, data->base + 
REG_MMU_PT_BASE_ADDR);
 | ^~


vim +693 drivers/iommu/mtk_iommu.c

   646  
   647  static int mtk_iommu_attach_device(struct iommu_domain *domain,
   648 struct device *dev)
   649  {
   650  struct mtk_iommu_data *data = dev_iommu_priv_get(dev), 
*frstdata;
   651  struct mtk_iommu_domain *dom = to_mtk_domain(domain);
   652  struct list_head *hw_list = data->hw_list;
   653  struct device *m4udev = data->dev;
   654  struct mtk_iommu_bank_data *bank;
   655  unsigned int bankid;
   656  int ret, region_id;
   657  
   658  region_id = mtk_iommu_get_iova_region_id(dev, data->plat_data);
   659  if (region_id < 0)
   660  return region_id;
   661  
   662  bankid = mtk_iommu_get_bank_id(dev, data->plat_data);
   663  mutex_lock(&dom->mutex);
   664  if (!dom->bank) {
   665  /* Data is in the frstdata in sharing pgtable case. */
   666  frstdata = mtk_iommu_get_frst_data(hw_list);
   667  
   668  ret = mtk_iommu_domain_finalise(dom, frstdata, 
region_id);
   669  if (ret) {
   670  mutex_unlock(&dom->mutex);
   671  return -ENODEV;
   672  }
   673  dom->bank = &data->bank[bankid];
   674  }
   675  mutex_unlock(&dom->mutex);
   676  
   677  mutex_lock(&data->mutex);
   678  bank = &data->bank[bankid];
   679  if (!bank->m4u_dom) { /* Initialize the M4U HW for each a BANK 
*/
   680  ret = pm_runtime_resume_and_get(m4udev);
   681  if (ret < 0) {
   682  dev_err(m4udev, "pm get fail(%d) in attach.\n", 
ret);
   683  goto err_unlock;
   684  }
   685  
   686  ret = mtk_iommu_hw_init(data, bankid);
   687  if (ret) {
   688  pm_runtime_put(m4udev);
   689  goto err_unlock;
   690  }
   691

Re: [PATCH] uacce: fix concurrency of fops_open and uacce_remove

2022-06-15 Thread Zhangfei Gao


Hi, Jean

On 2022/6/15 下午11:16, Jean-Philippe Brucker wrote:

Hi,

On Fri, Jun 10, 2022 at 08:34:23PM +0800, Zhangfei Gao wrote:

The uacce parent's module can be removed when uacce is working,
which may cause troubles.

If rmmod/uacce_remove happens just after fops_open: bind_queue,
the uacce_remove can not remove the bound queue since it is not
added to the queue list yet, which blocks the uacce_disable_sva.

Change queues_lock area to make sure the bound queue is added to
the list thereby can be searched in uacce_remove.

And uacce->parent->driver is checked immediately in case rmmod is
just happening.

Also the parent driver must always stop DMA before calling
uacce_remove.

Signed-off-by: Yang Shen 
Signed-off-by: Zhangfei Gao 
---
  drivers/misc/uacce/uacce.c | 19 +--
  1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index 281c54003edc..b6219c6bfb48 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -136,9 +136,16 @@ static int uacce_fops_open(struct inode *inode, struct 
file *filep)
if (!q)
return -ENOMEM;
  
+	mutex_lock(&uacce->queues_lock);

+
+   if (!uacce->parent->driver) {

I don't think this is useful, because the core clears parent->driver after
having run uacce_remove():

   rmmod hisi_zip   open()
...  uacce_fops_open()
__device_release_driver() ...
 pci_device_remove()
  hisi_zip_remove()
   hisi_qm_uninit()
uacce_remove()
 ...  ...
  mutex_lock(uacce->queues_lock)
 ...  if (!uacce->parent->driver)
 device_unbind_cleanup()  /* driver still valid, proceed */
  dev->driver = NULL


The check  if (!uacce->parent->driver) is required, otherwise NULL 
pointer may happen.

iommu_sva_bind_device
const struct iommu_ops *ops = dev_iommu_ops(dev);  -> 
dev->iommu->iommu_dev->ops


rmmod has no issue, but remove parent pci device has the issue.

Test:
sleep in fops_open before mutex.

estuary:/mnt$ ./work/a.out &
//sleep in fops_open

echo 1 > /sys/bus/pci/devices/:00:02.0/remove &
estuary:/mnt$ [   22.594348] uacce_remove!
[   22.594663] pci :00:02.0: Removing from iommu group 2
[   22.595073] iommu_release_device dev->iommu=0
[   22.595076] CPU: 2 PID: 229 Comm: ash Not tainted 
5.19.0-rc1-15071-gcbcf098c5257-dirty #633
[   22.595079] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 
02/06/2015

[   22.595080] Call trace:
[   22.595080]  dump_backtrace+0xe4/0xf0
[   22.595085]  show_stack+0x20/0x70
[   22.595086]  dump_stack_lvl+0x8c/0xb8
[   22.595089]  dump_stack+0x18/0x34
[   22.595091]  iommu_release_device+0x90/0x98
[   22.595095]  iommu_bus_notifier+0x58/0x68
[   22.595097]  blocking_notifier_call_chain+0x74/0xa8
[   22.595100]  device_del+0x268/0x3b0
[   22.595102]  pci_remove_bus_device+0x84/0x110
[   22.595106]  pci_stop_and_remove_bus_device_locked+0x30/0x60
...

estuary:/mnt$ [   31.466360] uacce: sleep end!
[   31.466362] uacce->parent->driver=0
[   31.466364] uacce->parent->iommu=0
[   31.466365] uacce_bind_queue!
[   31.466366] uacce_bind_queue call iommu_sva_bind_device!
[   31.466367] uacce->parent=d120d0
[   31.466371] Unable to handle kernel NULL pointer dereference at 
virtual address 0038

[   31.472870] Mem abort info:
[   31.473450]   ESR = 0x9604
[   31.474223]   EC = 0x25: DABT (current EL), IL = 32 bits
[   31.475390]   SET = 0, FnV = 0
[   31.476031]   EA = 0, S1PTW = 0
[   31.476680]   FSC = 0x04: level 0 translation fault
[   31.477687] Data abort info:
[   31.478294]   ISV = 0, ISS = 0x0004
[   31.479152]   CM = 0, WnR = 0
[   31.479785] user pgtable: 4k pages, 48-bit VAs, pgdp=714d8000
[   31.481144] [0038] pgd=, p4d=
[   31.482622] Internal error: Oops: 9604 [#1] PREEMPT SMP
[   31.483784] Modules linked in: hisi_zip
[   31.484590] CPU: 2 PID: 228 Comm: a.out Not tainted 
5.19.0-rc1-15071-gcbcf098c5257-dirty #633
[   31.486374] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 
02/06/2015
[   31.487862] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)

[   31.489390] pc : iommu_sva_bind_device+0x44/0xf4
[   31.490404] lr : uacce_fops_open+0x128/0x234



Since uacce_remove() disabled SVA, the following uacce_bind_queue() will
fail anyway. However, if uacce->flags does not have UACCE_DEV_SVA set,
we'll proceed further and call uacce->ops->get_queue(), which does not
exist anymore since the parent module is gone.

I think we need the global uacce_mutex to serialize uacce_remove() and
uacce_fops_open(). uacce_remove() would do everything, including
xa_erase(), while holding that mutex. And uacce_fops_open() would try to
obtain the uacce object from the xarray while holding the mutex, which
fails if the uacce object is being removed.


Since fops_o

RE: [PATCH v2 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu()

2022-06-15 Thread Tian, Kevin

> From: Baolu Lu 
> Sent: Wednesday, June 15, 2022 9:10 PM
> 
> On 2022/6/15 14:22, Tian, Kevin wrote:
> >> From: Baolu Lu 
> >> Sent: Tuesday, June 14, 2022 3:21 PM
> >>
> >> On 2022/6/14 14:49, Tian, Kevin wrote:
>  From: Lu Baolu
>  Sent: Tuesday, June 14, 2022 10:51 AM
> 
>  The disable_dmar_iommu() is called when IOMMU initialization fails or
>  the IOMMU is hot-removed from the system. In both cases, there is no
>  need to clear the IOMMU translation data structures for devices.
> 
>  On the initialization path, the device probing only happens after the
>  IOMMU is initialized successfully, hence there're no translation data
>  structures.
> >>> Out of curiosity. With kexec the IOMMU may contain stale mappings
> >>> from the old kernel. Then is it meaningful to disable IOMMU after the
> >>> new kernel fails to initialize it properly?
> >>
> >> For kexec kernel, if the IOMMU is detected to be pre-enabled, the IOMMU
> >> driver will try to copy tables from the old kernel. If copying table
> >> fails, the IOMMU driver will disable IOMMU and do the normal
> >> initialization.
> >>
> >
> > What about an error occurred after copying table in the initialization
> > path? The new kernel will be in a state assuming iommu is disabled
> > but it is still enabled using an old mapping for certain devices...
> >
> 
> If copying table failed, the translation will be disabled and a clean
> root table will be used.
> 
> if (translation_pre_enabled(iommu)) {
>  pr_info("Translation already enabled - trying to copy
> translation structures\n");
> 
>  ret = copy_translation_tables(iommu);
>  if (ret) {
>  /*
>   * We found the IOMMU with translation
>   * enabled - but failed to copy over the
>   * old root-entry table. Try to proceed
>   * by disabling translation now and
>   * allocating a clean root-entry table.
>   * This might cause DMAR faults, but
>   * probably the dump will still succeed.
>   */
>  pr_err("Failed to copy translation tables from previous
> kernel for %s\n",
> iommu->name);
>  iommu_disable_translation(iommu);
>  clear_translation_pre_enabled(iommu);
>  } else {
>  pr_info("Copied translation tables from previous kernel
> for %s\n",
>  iommu->name);
>  }
> }
> 

I meant copying table succeeds but another error occurs in the
remaining path of initialization...
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-06-15 Thread Nicolin Chen via iommu

On Thu, Jun 16, 2022 at 10:09:49AM +0800, Baolu Lu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2022/6/16 08:03, Nicolin Chen wrote:
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 44016594831d..0dd13330fe12 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -4323,7 +4323,7 @@ static int prepare_domain_attach_device(struct 
> > iommu_domain *domain,
> >   return -ENODEV;
> > 
> >   if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
> > - return -EOPNOTSUPP;
> > + return -EMEDIUMTYPE;
> > 
> >   /* check if this iommu agaw is sufficient for max mapped address */
> >   addr_width = agaw_to_width(iommu->agaw);
> > @@ -4331,10 +4331,10 @@ static int prepare_domain_attach_device(struct 
> > iommu_domain *domain,
> >   addr_width = cap_mgaw(iommu->cap);
> > 
> >   if (dmar_domain->max_addr > (1LL << addr_width)) {
> > - dev_err(dev, "%s: iommu width (%d) is not "
> > + dev_dbg(dev, "%s: iommu width (%d) is not "
> >   "sufficient for the mapped address (%llx)\n",
> >   __func__, addr_width, dmar_domain->max_addr);
> > - return -EFAULT;
> > + return -EMEDIUMTYPE;
> >   }
> >   dmar_domain->gaw = addr_width;
> 
> Can we simply remove the dev_err()? As the return value has explicitly
> explained the failure reason, putting a print statement won't help much.

Yes. As long as no one has objection, I can remove that in the next
version.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCHES 1/2] iommu: Add RCU-protected page free support

2022-06-15 Thread Baolu Lu


On 2022/6/15 23:40, Jason Gunthorpe wrote:

On Fri, Jun 10, 2022 at 01:37:20PM +0800, Baolu Lu wrote:

On 2022/6/9 20:49, Jason Gunthorpe wrote:

+void iommu_free_pgtbl_pages(struct iommu_domain *domain,
+   struct list_head *pages)
+{
+   struct page *page, *next;
+
+   if (!domain->concurrent_traversal) {
+   put_pages_list(pages);
+   return;
+   }
+
+   list_for_each_entry_safe(page, next, pages, lru) {
+   list_del(&page->lru);
+   call_rcu(&page->rcu_head, pgtble_page_free_rcu);
+   }

It seems OK, but I wonder if there is benifit to using
put_pages_list() from the rcu callback


The price is that we need to allocate a "struct list_head" and free it
in the rcu callback as well. Currently the list_head is sitting in the
stack.


You'd have to use a different struct page layout so that the list_head
was in the struct page and didn't overlap with the rcu_head


Okay, let me head this direction in the next version.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-06-15 Thread Baolu Lu


On 2022/6/16 08:03, Nicolin Chen wrote:

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 44016594831d..0dd13330fe12 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4323,7 +4323,7 @@ static int prepare_domain_attach_device(struct 
iommu_domain *domain,
return -ENODEV;
  
  	if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))

-   return -EOPNOTSUPP;
+   return -EMEDIUMTYPE;
  
  	/* check if this iommu agaw is sufficient for max mapped address */

addr_width = agaw_to_width(iommu->agaw);
@@ -4331,10 +4331,10 @@ static int prepare_domain_attach_device(struct 
iommu_domain *domain,
addr_width = cap_mgaw(iommu->cap);
  
  	if (dmar_domain->max_addr > (1LL << addr_width)) {

-   dev_err(dev, "%s: iommu width (%d) is not "
+   dev_dbg(dev, "%s: iommu width (%d) is not "
"sufficient for the mapped address (%llx)\n",
__func__, addr_width, dmar_domain->max_addr);
-   return -EFAULT;
+   return -EMEDIUMTYPE;
}
dmar_domain->gaw = addr_width;


Can we simply remove the dev_err()? As the return value has explicitly
explained the failure reason, putting a print statement won't help much.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 4/7] iommu/amd: Introduce function to check and enable SNP

2022-06-15 Thread Suravee Suthikulpanit via iommu

From: Brijesh Singh 

To support SNP, IOMMU needs to be enabled, and prohibits IOMMU
configurations where DTE[Mode]=0, which means it cannot be supported with
IOMMU passthrough domain (a.k.a IOMMU_DOMAIN_IDENTITY),
and when AMD IOMMU driver is configured to not use the IOMMU host (v1) page
table. Otherwise, RMP table initialization could cause the system to crash.

The request to enable SNP support in IOMMU must be done before PCI
initialization state of the IOMMU driver because enabling SNP affects
how IOMMU driver sets up IOMMU data structures (i.e. DTE).

Unlike other IOMMU features, SNP feature does not have an enable bit in
the IOMMU control register. Instead, the IOMMU driver introduces
an amd_iommu_snp_en variable to track enabling state of SNP.

Introduce amd_iommu_snp_enable() for other drivers to request enabling
the SNP support in IOMMU, which checks all prerequisites and determines
if the feature can be safely enabled.

Please see the IOMMU spec section 2.12 for further details.

Co-developed-by: Suravee Suthikulpanit 
Signed-off-by: Suravee Suthikulpanit 
Signed-off-by: Brijesh Singh 
---
 drivers/iommu/amd/amd_iommu_types.h |  5 
 drivers/iommu/amd/init.c| 45 +++--
 drivers/iommu/amd/iommu.c   |  4 +--
 include/linux/amd-iommu.h   |  6 
 4 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu_types.h 
b/drivers/iommu/amd/amd_iommu_types.h
index 73b729be7410..ce4db2835b36 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -463,6 +463,9 @@ extern bool amd_iommu_irq_remap;
 /* kmem_cache to get tables with 128 byte alignement */
 extern struct kmem_cache *amd_iommu_irq_cache;
 
+/* SNP is enabled on the system? */
+extern bool amd_iommu_snp_en;
+
 #define PCI_SBDF_TO_SEGID(sbdf)(((sbdf) >> 16) & 0x)
 #define PCI_SBDF_TO_DEVID(sbdf)((sbdf) & 0x)
 #define PCI_SEG_DEVID_TO_SBDF(seg, devid)  u32)(seg) & 0x) << 16) 
| \
@@ -1013,4 +1016,6 @@ extern struct amd_irte_ops irte_32_ops;
 extern struct amd_irte_ops irte_128_ops;
 #endif
 
+extern struct iommu_ops amd_iommu_ops;
+
 #endif /* _ASM_X86_AMD_IOMMU_TYPES_H */
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 013c55e3c2f2..b5d3de327a5f 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -95,8 +95,6 @@
  * out of it.
  */
 
-extern const struct iommu_ops amd_iommu_ops;
-
 /*
  * structure describing one IOMMU in the ACPI table. Typically followed by one
  * or more ivhd_entrys.
@@ -168,6 +166,9 @@ static int amd_iommu_target_ivhd_type;
 
 static bool amd_iommu_snp_sup;
 
+bool amd_iommu_snp_en;
+EXPORT_SYMBOL(amd_iommu_snp_en);
+
 LIST_HEAD(amd_iommu_pci_seg_list); /* list of all PCI segments */
 LIST_HEAD(amd_iommu_list); /* list of all AMD IOMMUs in the
   system */
@@ -3549,3 +3550,43 @@ int amd_iommu_pc_set_reg(struct amd_iommu *iommu, u8 
bank, u8 cntr, u8 fxn, u64
 
return iommu_pc_get_set_reg(iommu, bank, cntr, fxn, value, true);
 }
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+int amd_iommu_snp_enable(void)
+{
+   /*
+* The SNP support requires that IOMMU must be enabled, and is
+* not configured in the passthrough mode.
+*/
+   if (no_iommu || iommu_default_passthrough()) {
+   pr_err("SNP: IOMMU is either disabled or configured in 
passthrough mode.\n");
+   return -EINVAL;
+   }
+
+   /*
+* Prevent enabling SNP after IOMMU_ENABLED state because this process
+* affect how IOMMU driver sets up data structures and configures
+* IOMMU hardware.
+*/
+   if (init_state > IOMMU_ENABLED) {
+   pr_err("SNP: Too late to enable SNP for IOMMU.\n");
+   return -EINVAL;
+   }
+
+   amd_iommu_snp_en = amd_iommu_snp_sup;
+   if (!amd_iommu_snp_en)
+   return -EINVAL;
+
+   pr_info("SNP enabled\n");
+
+   /* Enforce IOMMU v1 pagetable when SNP is enabled. */
+   if (amd_iommu_pgtable != AMD_IOMMU_V1) {
+   pr_warn("Force to using AMD IOMMU v1 page table due to SNP\n");
+   amd_iommu_pgtable = AMD_IOMMU_V1;
+   amd_iommu_ops.pgsize_bitmap = AMD_IOMMU_PGSIZES;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(amd_iommu_snp_enable);
+#endif
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 86045dc50a0f..0792cd618dba 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -71,7 +71,7 @@ LIST_HEAD(acpihid_map);
  * Domain for untranslated devices - only allocated
  * if iommu=pt passed on kernel cmd line.
  */
-const struct iommu_ops amd_iommu_ops;
+struct iommu_ops amd_iommu_ops;
 
 static ATOMIC_NOTIFIER_HEAD(ppr_notifier);
 int amd_iommu_max_glx_val = -1;
@@ -2412,7 +2412,7 @@ static int amd_iommu_def_domain_type(struct device *dev)
re

[PATCH v2 2/7] iommu/amd: Process all IVHDs before enabling IOMMU features

2022-06-15 Thread Suravee Suthikulpanit via iommu

The ACPI IVRS table can contain multiple IVHD blocks. Each block contains
information used to initialize each IOMMU instance.

Currently, init_iommu_all sequentially process IVHD block and initialize
IOMMU instance one-by-one. However, certain features require all IOMMUs
to be configured in the same way system-wide. In case certain IVHD blocks
contain inconsistent information (most likely FW bugs), the driver needs
to go through and try to revert settings on IOMMUs that have already been
configured.

A solution is to split IOMMU initialization into 3 phases:

Phase1 : Processes information of the IVRS table for all IOMMU instances.
This allow all IVHDs to be processed prior to enabling features.

Phase2 : Early feature support check on all IOMMUs (using information in
IVHD blocks.

Phase3 : Iterates through all IOMMU instances and enabling features.

Signed-off-by: Suravee Suthikulpanit 
---
 drivers/iommu/amd/init.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index b3e4551ce9dd..5f86e357dbaa 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1692,7 +1692,6 @@ static int __init init_iommu_one(struct amd_iommu *iommu, 
struct ivhd_header *h,
 struct acpi_table_header *ivrs_base)
 {
struct amd_iommu_pci_seg *pci_seg;
-   int ret;
 
pci_seg = get_pci_segment(h->pci_seg, ivrs_base);
if (pci_seg == NULL)
@@ -1773,6 +1772,13 @@ static int __init init_iommu_one(struct amd_iommu 
*iommu, struct ivhd_header *h,
if (!iommu->mmio_base)
return -ENOMEM;
 
+   return init_iommu_from_acpi(iommu, h);
+}
+
+static int __init init_iommu_one_late(struct amd_iommu *iommu)
+{
+   int ret;
+
if (alloc_cwwb_sem(iommu))
return -ENOMEM;
 
@@ -1794,10 +1800,6 @@ static int __init init_iommu_one(struct amd_iommu 
*iommu, struct ivhd_header *h,
if (amd_iommu_pre_enabled)
amd_iommu_pre_enabled = translation_pre_enabled(iommu);
 
-   ret = init_iommu_from_acpi(iommu, h);
-   if (ret)
-   return ret;
-
if (amd_iommu_irq_remap) {
ret = amd_iommu_create_irq_domain(iommu);
if (ret)
@@ -1808,7 +1810,7 @@ static int __init init_iommu_one(struct amd_iommu *iommu, 
struct ivhd_header *h,
 * Make sure IOMMU is not considered to translate itself. The IVRS
 * table tells us so, but this is a lie!
 */
-   pci_seg->rlookup_table[iommu->devid] = NULL;
+   iommu->pci_seg->rlookup_table[iommu->devid] = NULL;
 
return 0;
 }
@@ -1853,6 +1855,7 @@ static int __init init_iommu_all(struct acpi_table_header 
*table)
end += table->length;
p += IVRS_HEADER_LENGTH;
 
+   /* Phase 1: Process all IVHD blocks */
while (p < end) {
h = (struct ivhd_header *)p;
if (*p == amd_iommu_target_ivhd_type) {
@@ -1878,6 +1881,15 @@ static int __init init_iommu_all(struct 
acpi_table_header *table)
}
WARN_ON(p != end);
 
+   /* Phase 2 : Early feature support check */
+
+   /* Phase 3 : Enabling IOMMU features */
+   for_each_iommu(iommu) {
+   ret = init_iommu_one_late(iommu);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 7/7] iommu/amd: Do not support IOMMUv2 APIs when SNP is enabled

2022-06-15 Thread Suravee Suthikulpanit via iommu

The IOMMUv2 APIs (for supporting shared virtual memory with PASID)
configures the domain with IOMMU v2 page table, and sets DTE[Mode]=0.
This configuration cannot be supported on SNP-enabled system.

Signed-off-by: Suravee Suthikulpanit 
---
 drivers/iommu/amd/init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index bc008a82c12c..780d6977a331 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -3448,7 +3448,12 @@ __setup("ivrs_acpihid",  parse_ivrs_acpihid);
 
 bool amd_iommu_v2_supported(void)
 {
-   return amd_iommu_v2_present;
+   /*
+* Since DTE[Mode]=0 is prohibited on SNP-enabled system
+* (i.e. EFR[SNPSup]=1), IOMMUv2 page table cannot be used without
+* setting up IOMMUv1 page table.
+*/
+   return amd_iommu_v2_present && !amd_iommu_snp_en;
 }
 EXPORT_SYMBOL(amd_iommu_v2_supported);
 
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 6/7] iommu/amd: Do not support IOMMU_DOMAIN_IDENTITY after SNP is enabled

2022-06-15 Thread Suravee Suthikulpanit via iommu

Once SNP is enabled (by executing SNP_INIT command), IOMMU can no longer
support the passthrough domain (i.e. IOMMU_DOMAIN_IDENTITY).

The SNP_INIT command is called early in the boot process, and would fail
if the kernel is configure to default to passthrough mode.

After the system is already booted, users can try to change IOMMU domain
type of a particular IOMMU group. In this case, the IOMMU driver needs to
check the SNP-enable status and return failure when requesting to change
domain type to identity.

Therefore, return failure when trying to allocate identity domain.

Signed-off-by: Suravee Suthikulpanit 
---
 drivers/iommu/amd/iommu.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 4f4571d3ff61..d8a6df423b90 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2119,6 +2119,15 @@ static struct iommu_domain 
*amd_iommu_domain_alloc(unsigned type)
 {
struct protection_domain *domain;
 
+   /*
+* Since DTE[Mode]=0 is prohibited on SNP-enabled system,
+* default to use IOMMU_DOMAIN_DMA[_FQ].
+*/
+   if (amd_iommu_snp_en && (type == IOMMU_DOMAIN_IDENTITY)) {
+   pr_warn("Cannot allocate identity domain due to SNP\n");
+   return NULL;
+   }
+
domain = protection_domain_alloc(type);
if (!domain)
return NULL;
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 3/7] iommu/amd: Introduce an iommu variable for tracking SNP support status

2022-06-15 Thread Suravee Suthikulpanit via iommu

EFR[SNPSup] needs to be checked early in the boot process, since it is
used to determine how IOMMU driver configures other IOMMU features
and data structures. This check can be done as soon as the IOMMU driver
finishes parsing IVHDs.

Introduce a variable for tracking the SNP support status, which is
initialized before enabling the rest of IOMMU features.

Also report IOMMU SNP support information for each IOMMU.

Signed-off-by: Suravee Suthikulpanit 
---
 drivers/iommu/amd/init.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 5f86e357dbaa..013c55e3c2f2 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -166,6 +166,8 @@ static bool amd_iommu_disabled __initdata;
 static bool amd_iommu_force_enable __initdata;
 static int amd_iommu_target_ivhd_type;
 
+static bool amd_iommu_snp_sup;
+
 LIST_HEAD(amd_iommu_pci_seg_list); /* list of all PCI segments */
 LIST_HEAD(amd_iommu_list); /* list of all AMD IOMMUs in the
   system */
@@ -260,7 +262,6 @@ int amd_iommu_get_num_iommus(void)
return amd_iommus_present;
 }
 
-#ifdef CONFIG_IRQ_REMAP
 /*
  * Iterate through all the IOMMUs to verify if the specified
  * EFR bitmask of IOMMU feature are set.
@@ -285,7 +286,6 @@ static bool check_feature_on_all_iommus(u64 mask)
}
return ret;
 }
-#endif
 
 /*
  * For IVHD type 0x11/0x40, EFR is also available via IVHD.
@@ -368,7 +368,7 @@ static void iommu_set_cwwb_range(struct amd_iommu *iommu)
u64 start = iommu_virt_to_phys((void *)iommu->cmd_sem);
u64 entry = start & PM_ADDR_MASK;
 
-   if (!iommu_feature(iommu, FEATURE_SNP))
+   if (!amd_iommu_snp_sup)
return;
 
/* Note:
@@ -783,7 +783,7 @@ static void *__init iommu_alloc_4k_pages(struct amd_iommu 
*iommu,
void *buf = (void *)__get_free_pages(gfp, order);
 
if (buf &&
-   iommu_feature(iommu, FEATURE_SNP) &&
+   amd_iommu_snp_sup &&
set_memory_4k((unsigned long)buf, (1 << order))) {
free_pages((unsigned long)buf, order);
buf = NULL;
@@ -1882,6 +1882,7 @@ static int __init init_iommu_all(struct acpi_table_header 
*table)
WARN_ON(p != end);
 
/* Phase 2 : Early feature support check */
+   amd_iommu_snp_sup = check_feature_on_all_iommus(FEATURE_SNP);
 
/* Phase 3 : Enabling IOMMU features */
for_each_iommu(iommu) {
@@ -2118,6 +2119,9 @@ static void print_iommu_info(void)
if (iommu->features & FEATURE_GAM_VAPIC)
pr_cont(" GA_vAPIC");
 
+   if (iommu->features & FEATURE_SNP)
+   pr_cont(" SNP");
+
pr_cont("\n");
}
}
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 5/7] iommu/amd: Set translation valid bit only when IO page tables are in use

2022-06-15 Thread Suravee Suthikulpanit via iommu

On AMD system with SNP enabled, IOMMU hardware checks the host translation
valid (TV) and guest translation valid (GV) bits in the device table entry
(DTE) before accessing the corresponded page tables.

However, current IOMMU driver sets the TV bit for all devices regardless
of whether the host page table is in use. This results in
ILLEGAL_DEV_TABLE_ENTRY event for devices, which do not the host page
table root pointer set up.

Thefore, when SNP is enabled, only set TV bit when DMA remapping is not
used, which is when domain ID in the AMD IOMMU device table entry (DTE)
is zero.

Signed-off-by: Suravee Suthikulpanit 
---
 drivers/iommu/amd/init.c  |  3 ++-
 drivers/iommu/amd/iommu.c | 15 +--
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index b5d3de327a5f..bc008a82c12c 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -2544,7 +2544,8 @@ static void init_device_table_dma(struct 
amd_iommu_pci_seg *pci_seg)
 
for (devid = 0; devid <= pci_seg->last_bdf; ++devid) {
__set_dev_entry_bit(dev_table, devid, DEV_ENTRY_VALID);
-   __set_dev_entry_bit(dev_table, devid, DEV_ENTRY_TRANSLATION);
+   if (!amd_iommu_snp_en)
+   __set_dev_entry_bit(dev_table, devid, 
DEV_ENTRY_TRANSLATION);
}
 }
 
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 0792cd618dba..4f4571d3ff61 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1563,7 +1563,14 @@ static void set_dte_entry(struct amd_iommu *iommu, u16 
devid,
(domain->flags & PD_GIOV_MASK))
pte_root |= DTE_FLAG_GIOV;
 
-   pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;
+   pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V;
+
+   /*
+* When SNP is enabled, Only set TV bit when IOMMU
+* page translation is in use.
+*/
+   if (!amd_iommu_snp_en || (domain->id != 0))
+   pte_root |= DTE_FLAG_TV;
 
flags = dev_table[devid].data[1];
 
@@ -1625,7 +1632,11 @@ static void clear_dte_entry(struct amd_iommu *iommu, u16 
devid)
struct dev_table_entry *dev_table = get_dev_table(iommu);
 
/* remove entry from the device table seen by the hardware */
-   dev_table[devid].data[0]  = DTE_FLAG_V | DTE_FLAG_TV;
+   dev_table[devid].data[0]  = DTE_FLAG_V;
+
+   if (!amd_iommu_snp_en)
+   dev_table[devid].data[0] |= DTE_FLAG_TV;
+
dev_table[devid].data[1] &= DTE_FLAG_MASK;
 
amd_iommu_apply_erratum_63(iommu, devid);
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 0/7] iommu/amd: Enforce IOMMU restrictions for SNP-enabled system

2022-06-15 Thread Suravee Suthikulpanit via iommu

SNP-enabled system requires IOMMU v1 page table to be configured with
non-zero DTE[Mode] for DMA-capable devices. This effects a number of
usecases such as IOMMU pass-through mode and AMD IOMMUv2 APIs for
binding/unbinding pasid.

The series introduce a global variable to check SNP-enabled state
during driver initialization, and use it to enforce the SNP restrictions
during runtime.

Also, for non-DMA-capable devices such as IOAPIC, the recommendation
is to set DTE[TV] and DTE[Mode] to zero on SNP-enabled system.
Therefore, additinal checks is added before setting DTE[TV].

Testing:
  - Tested booting and verify dmesg.
  - Tested booting with iommu=pt
  - Tested loading amd_iommu_v2 driver
  - Tested changing the iommu domain at runtime
  - Tested booting SEV/SNP-enabled guest
  - Tested when CONFIG_AMD_MEM_ENCRYPT is not set

Pre-requisite:
  - [PATCH v3 00/35] iommu/amd: Add multiple PCI segments support

https://lore.kernel.org/linux-iommu/20220511072141.15485-29-vasant.he...@amd.com/T/

Chanages from V1:
(https://lore.kernel.org/linux-iommu/20220613012502.109918-1-suravee.suthikulpa...@amd.com/T/#t
 )
  - Remove the newly introduced domain_type_supported() callback.
  - Patch 1: Modify existing check_feature_on_all_iommus() instead of
 introducing another helper function to do similar check.
  - Patch 3: Modify to use check_feature_on_all_iommus().
  - Patch 4: Add IOMMU init_state check before enabling SNP.
 Also move the function declaration to include/linux/amd-iommu.h 
  - Patch 6: Modify amd_iommu_domain_alloc() to fail when allocating identity
 domain and SNP is enabled.

Best Regards,
Suravee

Brijesh Singh (1):
  iommu/amd: Introduce function to check and enable SNP

Suravee Suthikulpanit (6):
  iommu/amd: Warn when found inconsistency EFR mask
  iommu/amd: Process all IVHDs before enabling IOMMU features
  iommu/amd: Introduce an iommu variable for tracking SNP support status
  iommu/amd: Set translation valid bit only when IO page tables are in
use
  iommu/amd: Do not support IOMMU_DOMAIN_IDENTITY after SNP is enabled
  iommu/amd: Do not support IOMMUv2 APIs when SNP is enabled

 drivers/iommu/amd/amd_iommu_types.h |   5 ++
 drivers/iommu/amd/init.c| 110 +++-
 drivers/iommu/amd/iommu.c   |  28 ++-
 include/linux/amd-iommu.h   |   6 ++
 4 files changed, 127 insertions(+), 22 deletions(-)

-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 1/7] iommu/amd: Warn when found inconsistency EFR mask

2022-06-15 Thread Suravee Suthikulpanit via iommu

The function check_feature_on_all_iommus() checks to ensure if an IOMMU
feature support bit is set on the Extended Feature Register (EFR).
Current logic iterates through all IOMMU, and returns false when it
found the first unset bit.

To provide more thorough checking, modify the logic to iterate through all
IOMMUs even when found that the bit is not set, and also throws a FW_BUG
warning if inconsistency is found.

Signed-off-by: Suravee Suthikulpanit 
---
 drivers/iommu/amd/init.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 3dd0f26039c7..b3e4551ce9dd 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -261,18 +261,29 @@ int amd_iommu_get_num_iommus(void)
 }
 
 #ifdef CONFIG_IRQ_REMAP
+/*
+ * Iterate through all the IOMMUs to verify if the specified
+ * EFR bitmask of IOMMU feature are set.
+ * Warn and return false if found inconsistency.
+ */
 static bool check_feature_on_all_iommus(u64 mask)
 {
bool ret = false;
struct amd_iommu *iommu;
 
for_each_iommu(iommu) {
-   ret = iommu_feature(iommu, mask);
-   if (!ret)
+   bool tmp = iommu_feature(iommu, mask);
+
+   if ((ret != tmp) &&
+   !list_is_first(&iommu->list, &amd_iommu_list)) {
+   pr_err(FW_BUG "Found inconsistent EFR mask (%#llx) on 
iommu%d (%04x:%02x:%02x.%01x).\n",
+  mask, iommu->index, iommu->pci_seg->id, 
PCI_BUS_NUM(iommu->devid),
+  PCI_SLOT(iommu->devid), PCI_FUNC(iommu->devid));
return false;
+   }
+   ret = tmp;
}
-
-   return true;
+   return ret;
 }
 #endif
 
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 5/5] vfio/iommu_type1: Simplify group attachment

2022-06-15 Thread Nicolin Chen via iommu

Un-inline the domain specific logic from the attach/detach_group ops into
two paired functions vfio_iommu_alloc_attach_domain() and
vfio_iommu_detach_destroy_domain() that strictly deal with creating and
destroying struct vfio_domains.

Add the logic to check for EMEDIUMTYPE return code of iommu_attach_group()
and avoid the extra domain allocations and attach/detach sequences of the
old code. This allows properly detecting an actual attach error, like
-ENOMEM, vs treating all attach errors as an incompatible domain.

Co-developed-by: Jason Gunthorpe 
Signed-off-by: Jason Gunthorpe 
Signed-off-by: Nicolin Chen 
---
 drivers/vfio/vfio_iommu_type1.c | 298 +---
 1 file changed, 163 insertions(+), 135 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 573caf320788..5986c68e59ee 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -86,6 +86,7 @@ struct vfio_domain {
struct list_headgroup_list;
boolfgsp : 1;   /* Fine-grained super pages */
boolenforce_cache_coherency : 1;
+   boolmsi_cookie : 1;
 };
 
 struct vfio_dma {
@@ -2153,12 +2154,163 @@ static void vfio_iommu_iova_insert_copy(struct 
vfio_iommu *iommu,
list_splice_tail(iova_copy, iova);
 }
 
+static struct vfio_domain *
+vfio_iommu_alloc_attach_domain(struct bus_type *bus, struct vfio_iommu *iommu,
+  struct vfio_iommu_group *group)
+{
+   struct iommu_domain *new_domain;
+   struct vfio_domain *domain;
+   int ret = 0;
+
+   /* Try to match an existing compatible domain */
+   list_for_each_entry (domain, &iommu->domain_list, next) {
+   ret = iommu_attach_group(domain->domain, group->iommu_group);
+   if (ret == -EMEDIUMTYPE)
+   continue;
+   if (ret)
+   return ERR_PTR(ret);
+   goto done;
+   }
+
+   new_domain = iommu_domain_alloc(bus);
+   if (!new_domain)
+   return ERR_PTR(-EIO);
+
+   if (iommu->nesting) {
+   ret = iommu_enable_nesting(new_domain);
+   if (ret)
+   goto out_free_iommu_domain;
+   }
+
+   ret = iommu_attach_group(new_domain, group->iommu_group);
+   if (ret)
+   goto out_free_iommu_domain;
+
+   domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+   if (!domain) {
+   ret = -ENOMEM;
+   goto out_detach;
+   }
+
+   domain->domain = new_domain;
+   vfio_test_domain_fgsp(domain);
+
+   /*
+* If the IOMMU can block non-coherent operations (ie PCIe TLPs with
+* no-snoop set) then VFIO always turns this feature on because on Intel
+* platforms it optimizes KVM to disable wbinvd emulation.
+*/
+   if (new_domain->ops->enforce_cache_coherency)
+   domain->enforce_cache_coherency =
+   new_domain->ops->enforce_cache_coherency(new_domain);
+
+   /* replay mappings on new domains */
+   ret = vfio_iommu_replay(iommu, domain);
+   if (ret)
+   goto out_free_domain;
+
+   INIT_LIST_HEAD(&domain->group_list);
+   list_add(&domain->next, &iommu->domain_list);
+   vfio_update_pgsize_bitmap(iommu);
+
+done:
+   list_add(&group->next, &domain->group_list);
+
+   /*
+* An iommu backed group can dirty memory directly and therefore
+* demotes the iommu scope until it declares itself dirty tracking
+* capable via the page pinning interface.
+*/
+   iommu->num_non_pinned_groups++;
+
+   return domain;
+
+out_free_domain:
+   kfree(domain);
+out_detach:
+   iommu_detach_group(new_domain, group->iommu_group);
+out_free_iommu_domain:
+   iommu_domain_free(new_domain);
+   return ERR_PTR(ret);
+}
+
+static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
+{
+   struct rb_node *node;
+
+   while ((node = rb_first(&iommu->dma_list)))
+   vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
+}
+
+static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
+{
+   struct rb_node *n, *p;
+
+   n = rb_first(&iommu->dma_list);
+   for (; n; n = rb_next(n)) {
+   struct vfio_dma *dma;
+   long locked = 0, unlocked = 0;
+
+   dma = rb_entry(n, struct vfio_dma, node);
+   unlocked += vfio_unmap_unpin(iommu, dma, false);
+   p = rb_first(&dma->pfn_list);
+   for (; p; p = rb_next(p)) {
+   struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn,
+node);
+
+   if (!is_invalid_reserved_pfn(vpfn->pfn))
+   locked++;
+   }
+   vfio_lock_acct(dma, l

[PATCH v2 3/5] vfio/iommu_type1: Remove the domain->ops comparison

2022-06-15 Thread Nicolin Chen via iommu

The domain->ops validation was added, as a precaution, for mixed-driver
systems. However, at this moment only one iommu driver is possible. So
remove it.

Per discussion with Robin, in future when many can be permitted we will
rely on the IOMMU core code to check the domain->ops:
https://lore.kernel.org/linux-iommu/6575de6d-94ba-c427-5b1e-967750ddf...@arm.com/

Signed-off-by: Nicolin Chen 
---
 drivers/vfio/vfio_iommu_type1.c | 32 +++-
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f4e3b423a453..11be5f95580b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2277,29 +2277,19 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
domain->domain->ops->enforce_cache_coherency(
domain->domain);
 
-   /*
-* Try to match an existing compatible domain.  We don't want to
-* preclude an IOMMU driver supporting multiple bus_types and being
-* able to include different bus_types in the same IOMMU domain, so
-* we test whether the domains use the same iommu_ops rather than
-* testing if they're on the same bus_type.
-*/
+   /* Try to match an existing compatible domain */
list_for_each_entry(d, &iommu->domain_list, next) {
-   if (d->domain->ops == domain->domain->ops) {
-   iommu_detach_group(domain->domain, group->iommu_group);
-   if (!iommu_attach_group(d->domain,
-   group->iommu_group)) {
-   list_add(&group->next, &d->group_list);
-   iommu_domain_free(domain->domain);
-   kfree(domain);
-   goto done;
-   }
-
-   ret = iommu_attach_group(domain->domain,
-group->iommu_group);
-   if (ret)
-   goto out_domain;
+   iommu_detach_group(domain->domain, group->iommu_group);
+   if (!iommu_attach_group(d->domain, group->iommu_group)) {
+   list_add(&group->next, &d->group_list);
+   iommu_domain_free(domain->domain);
+   kfree(domain);
+   goto done;
}
+
+   ret = iommu_attach_group(domain->domain,  group->iommu_group);
+   if (ret)
+   goto out_domain;
}
 
vfio_test_domain_fgsp(domain);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 4/5] vfio/iommu_type1: Clean up update_dirty_scope in detach_group()

2022-06-15 Thread Nicolin Chen via iommu

All devices in emulated_iommu_groups have pinned_page_dirty_scope
set, so the update_dirty_scope in the first list_for_each_entry
is always false. Clean it up, and move the "if update_dirty_scope"
part from the detach_group_done routine to the domain_list part.

Rename the "detach_group_done" goto label accordingly.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Nicolin Chen 
---
 drivers/vfio/vfio_iommu_type1.c | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 11be5f95580b..573caf320788 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2453,14 +2453,12 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
struct vfio_iommu *iommu = iommu_data;
struct vfio_domain *domain;
struct vfio_iommu_group *group;
-   bool update_dirty_scope = false;
LIST_HEAD(iova_copy);
 
mutex_lock(&iommu->lock);
list_for_each_entry(group, &iommu->emulated_iommu_groups, next) {
if (group->iommu_group != iommu_group)
continue;
-   update_dirty_scope = !group->pinned_page_dirty_scope;
list_del(&group->next);
kfree(group);
 
@@ -2469,7 +2467,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
WARN_ON(iommu->notifier.head);
vfio_iommu_unmap_unpin_all(iommu);
}
-   goto detach_group_done;
+   goto out_unlock;
}
 
/*
@@ -2485,9 +2483,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
continue;
 
iommu_detach_group(domain->domain, group->iommu_group);
-   update_dirty_scope = !group->pinned_page_dirty_scope;
list_del(&group->next);
-   kfree(group);
/*
 * Group ownership provides privilege, if the group list is
 * empty, the domain goes away. If it's the last domain with
@@ -2510,6 +2506,16 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
vfio_iommu_aper_expand(iommu, &iova_copy);
vfio_update_pgsize_bitmap(iommu);
}
+   /*
+* Removal of a group without dirty tracking may allow
+* the iommu scope to be promoted.
+*/
+   if (!group->pinned_page_dirty_scope) {
+   iommu->num_non_pinned_groups--;
+   if (iommu->dirty_page_tracking)
+   vfio_iommu_populate_bitmap_full(iommu);
+   }
+   kfree(group);
break;
}
 
@@ -2518,16 +2524,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
else
vfio_iommu_iova_free(&iova_copy);
 
-detach_group_done:
-   /*
-* Removal of a group without dirty tracking may allow the iommu scope
-* to be promoted.
-*/
-   if (update_dirty_scope) {
-   iommu->num_non_pinned_groups--;
-   if (iommu->dirty_page_tracking)
-   vfio_iommu_populate_bitmap_full(iommu);
-   }
+out_unlock:
mutex_unlock(&iommu->lock);
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-06-15 Thread Nicolin Chen via iommu

Cases like VFIO wish to attach a device to an existing domain that was
not allocated specifically from the device. This raises a condition
where the IOMMU driver can fail the domain attach because the domain and
device are incompatible with each other.

This is a soft failure that can be resolved by using a different domain.

Provide a dedicated errno from the IOMMU driver during attach that the
reason attached failed is because of domain incompatability. EMEDIUMTYPE
is chosen because it is never used within the iommu subsystem today and
evokes a sense that the 'medium' aka the domain is incompatible.

VFIO can use this to know attach is a soft failure and it should continue
searching. Otherwise the attach will be a hard failure and VFIO will
return the code to userspace.

Update all drivers to return EMEDIUMTYPE in their failure paths that are
related to domain incompatability. Also turn adjacent error prints into
debug prints, for these soft failures, to prevent a kernel log spam.

Add kdocs describing this behavior.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Nicolin Chen 
---
 drivers/iommu/amd/iommu.c   |  2 +-
 drivers/iommu/apple-dart.c  |  4 +--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 12 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |  4 +--
 drivers/iommu/arm/arm-smmu/qcom_iommu.c |  4 +--
 drivers/iommu/intel/iommu.c |  6 ++---
 drivers/iommu/iommu.c   | 28 +
 drivers/iommu/ipmmu-vmsa.c  |  4 +--
 drivers/iommu/mtk_iommu_v1.c|  2 +-
 drivers/iommu/omap-iommu.c  |  4 +--
 drivers/iommu/s390-iommu.c  |  2 +-
 drivers/iommu/sprd-iommu.c  |  4 +--
 drivers/iommu/tegra-gart.c  |  2 +-
 drivers/iommu/virtio-iommu.c|  4 +--
 14 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 840831d5d2ad..ad499658a6b6 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1662,7 +1662,7 @@ static int attach_device(struct device *dev,
if (domain->flags & PD_IOMMUV2_MASK) {
struct iommu_domain *def_domain = iommu_get_dma_domain(dev);
 
-   ret = -EINVAL;
+   ret = -EMEDIUMTYPE;
if (def_domain->type != IOMMU_DOMAIN_IDENTITY)
goto out;
 
diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 8af0242a90d9..e58dc310afd7 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -495,10 +495,10 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
 
if (cfg->stream_maps[0].dart->force_bypass &&
domain->type != IOMMU_DOMAIN_IDENTITY)
-   return -EINVAL;
+   return -EMEDIUMTYPE;
if (!cfg->stream_maps[0].dart->supports_bypass &&
domain->type == IOMMU_DOMAIN_IDENTITY)
-   return -EINVAL;
+   return -EMEDIUMTYPE;
 
ret = apple_dart_finalize_domain(domain, cfg);
if (ret)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 88817a3376ef..1c66e4b6d852 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2420,24 +2420,24 @@ static int arm_smmu_attach_dev(struct iommu_domain 
*domain, struct device *dev)
goto out_unlock;
}
} else if (smmu_domain->smmu != smmu) {
-   dev_err(dev,
+   dev_dbg(dev,
"cannot attach to SMMU %s (upstream of %s)\n",
dev_name(smmu_domain->smmu->dev),
dev_name(smmu->dev));
-   ret = -ENXIO;
+   ret = -EMEDIUMTYPE;
goto out_unlock;
} else if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1 &&
   master->ssid_bits != smmu_domain->s1_cfg.s1cdmax) {
-   dev_err(dev,
+   dev_dbg(dev,
"cannot attach to incompatible domain (%u SSID bits != 
%u)\n",
smmu_domain->s1_cfg.s1cdmax, master->ssid_bits);
-   ret = -EINVAL;
+   ret = -EMEDIUMTYPE;
goto out_unlock;
} else if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1 &&
   smmu_domain->stall_enabled != master->stall_enabled) {
-   dev_err(dev, "cannot attach to stall-%s domain\n",
+   dev_dbg(dev, "cannot attach to stall-%s domain\n",
smmu_domain->stall_enabled ? "enabled" : "disabled");
-   ret = -EINVAL;
+   ret = -EMEDIUMTYPE;
goto out_unlock;
}
 
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 2ed3594f384e..1d40023253d8 100644
--- a/drivers/i

[PATCH v2 2/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

2022-06-15 Thread Nicolin Chen via iommu

From: Jason Gunthorpe 

The KVM mechanism for controlling wbinvd is based on OR of the coherency
property of all devices attached to a guest, no matter those devices are
attached to a single domain or multiple domains.

So, there is no value in trying to push a device that could do enforced
cache coherency to a dedicated domain vs re-using an existing domain
which is non-coherent since KVM won't be able to take advantage of it.
This just wastes domain memory.

Simplify this code and eliminate the test. This removes the only logic
that needed to have a dummy domain attached prior to searching for a
matching domain and simplifies the next patches.

It's unclear whether we want to further optimize the Intel driver to
update the domain coherency after a device is detached from it, at
least not before KVM can be verified to handle such dynamics in related
emulation paths (wbinvd, vcpu load, write_cr0, ept, etc.). In reality
we don't see an usage requiring such optimization as the only device
which imposes such non-coherency is Intel GPU which even doesn't
support hotplug/hot remove.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Nicolin Chen 
---
 drivers/vfio/vfio_iommu_type1.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c13b9290e357..f4e3b423a453 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2285,9 +2285,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 * testing if they're on the same bus_type.
 */
list_for_each_entry(d, &iommu->domain_list, next) {
-   if (d->domain->ops == domain->domain->ops &&
-   d->enforce_cache_coherency ==
-   domain->enforce_cache_coherency) {
+   if (d->domain->ops == domain->domain->ops) {
iommu_detach_group(domain->domain, group->iommu_group);
if (!iommu_attach_group(d->domain,
group->iommu_group)) {
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 0/5] Simplify vfio_iommu_type1 attach/detach routine

2022-06-15 Thread Nicolin Chen via iommu

This is a preparatory series for IOMMUFD v2 patches. It enforces error
code -EMEDIUMTYPE in iommu_attach_device() and iommu_attach_group() when
an IOMMU domain and a device/group are incompatible. It also drops the
useless domain->ops check since it won't fail in current environment.

These allow VFIO iommu code to simplify its group attachment routine, by
avoiding the extra IOMMU domain allocations and attach/detach sequences
of the old code.

Worths mentioning the exact match for enforce_cache_coherency is removed
with this series, since there's very less value in doing that since KVM
won't be able to take advantage of it -- this just wastes domain memory.
Instead, we rely on Intel IOMMU driver taking care of that internally.

This is on github: https://github.com/nicolinc/iommufd/commits/vfio_iommu_attach

Changelog
v2:
 * Added -EMEDIUMTYPE to more IOMMU drivers that fit the category.
 * Changed dev_err to dev_dbg for -EMEDIUMTYPE to avoid kernel log spam.
 * Dropped iommu_ops patch, and removed domain->ops in VFIO directly,
   since there's no mixed-driver use case that would fail the sanity.
 * Updated commit log of the patch removing enforce_cache_coherency.
 * Fixed a misplace of "num_non_pinned_groups--" in detach_group patch.
 * Moved "num_non_pinned_groups++" in PATCH-5 to the common path between
   domain-reusing and new-domain pathways, like the code previously did.
 * Fixed a typo in EMEDIUMTYPE patch.
v1: https://lore.kernel.org/kvm/20220606061927.26049-1-nicol...@nvidia.com/

Jason Gunthorpe (1):
  vfio/iommu_type1: Prefer to reuse domains vs match enforced cache
coherency

Nicolin Chen (4):
  iommu: Return -EMEDIUMTYPE for incompatible domain and device/group
  vfio/iommu_type1: Remove the domain->ops comparison
  vfio/iommu_type1: Clean up update_dirty_scope in detach_group()
  vfio/iommu_type1: Simplify group attachment

 drivers/iommu/amd/iommu.c   |   2 +-
 drivers/iommu/apple-dart.c  |   4 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  12 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |   4 +-
 drivers/iommu/arm/arm-smmu/qcom_iommu.c |   4 +-
 drivers/iommu/intel/iommu.c |   6 +-
 drivers/iommu/iommu.c   |  28 ++
 drivers/iommu/ipmmu-vmsa.c  |   4 +-
 drivers/iommu/mtk_iommu_v1.c|   2 +-
 drivers/iommu/omap-iommu.c  |   4 +-
 drivers/iommu/s390-iommu.c  |   2 +-
 drivers/iommu/sprd-iommu.c  |   4 +-
 drivers/iommu/tegra-gart.c  |   2 +-
 drivers/iommu/virtio-iommu.c|   4 +-
 drivers/vfio/vfio_iommu_type1.c | 317 ++--
 15 files changed, 220 insertions(+), 179 deletions(-)

-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

2022-06-15 Thread Nicolin Chen via iommu

On Wed, Jun 15, 2022 at 07:35:00AM +, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
> > From: Nicolin Chen 
> > Sent: Wednesday, June 15, 2022 4:45 AM
> >
> > Hi Kevin,
> >
> > On Wed, Jun 08, 2022 at 11:48:27PM +, Tian, Kevin wrote:
> > > > > > The KVM mechanism for controlling wbinvd is only triggered during
> > > > > > kvm_vfio_group_add(), meaning it is a one-shot test done once the
> > > > devices
> > > > > > are setup.
> > > > >
> > > > > It's not one-shot. kvm_vfio_update_coherency() is called in both
> > > > > group_add() and group_del(). Then the coherency property is
> > > > > checked dynamically in wbinvd emulation:
> > > >
> > > > From the perspective of managing the domains that is still
> > > > one-shot. It doesn't get updated when individual devices are
> > > > added/removed to domains.
> > >
> > > It's unchanged per-domain but dynamic per-vm when multiple
> > > domains are added/removed (i.e. kvm->arch.noncoherent_dma_count).
> > > It's the latter being checked in the kvm.
> >
> > I am going to send a v2, yet not quite getting the point here.
> > Meanwhile, Jason is on leave.
> >
> > What, in your opinion, would be an accurate description here?
> >
> 
> Something like below:
> --
> The KVM mechanism for controlling wbinvd is based on OR of
> the coherency property of all devices attached to a guest, no matter
> those devices  are attached to a single domain or multiple domains.
> 
> So, there is no value in trying to push a device that could do enforced
> cache coherency to a dedicated domain vs re-using an existing domain
> which is non-coherent since KVM won't be able to take advantage of it.
> This just wastes domain memory.
> 
> Simplify this code and eliminate the test. This removes the only logic
> that needed to have a dummy domain attached prior to searching for a
> matching domain and simplifies the next patches.
> 
> It's unclear whether we want to further optimize the Intel driver to
> update the domain coherency after a device is detached from it, at
> least not before KVM can be verified to handle such dynamics in related
> emulation paths (wbinvd, vcpu load, write_cr0, ept, etc.). In reality
> we don't see an usage requiring such optimization as the only device
> which imposes such non-coherency is Intel GPU which even doesn't
> support hotplug/hot remove.

Thanks! I just updated that and will send v2.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3] iommu/vt-d: Make DMAR_UNITS_SUPPORTED a config setting

2022-06-15 Thread Jerry Snitselaar

On Wed, Jun 15, 2022 at 01:36:50PM -0500, Steve Wahl wrote:
> To support up to 64 sockets with 10 DMAR units each (640), make the
> value of DMAR_UNITS_SUPPORTED adjustable by a config variable,
> CONFIG_DMAR_UNITS_SUPPORTED, and make it's default 1024 when MAXSMP is
> set.
> 
> If the available hardware exceeds DMAR_UNITS_SUPPORTED (previously set
> to MAX_IO_APICS, or 128), it causes these messages: "DMAR: Failed to
> allocate seq_id", "DMAR: Parse DMAR table failure.", and "x2apic: IRQ
> remapping doesn't support X2APIC mode x2apic disabled"; and the system
> fails to boot properly.
> 
> Signed-off-by: Steve Wahl 
> Reviewed-by: Kevin Tian 

Reviewed-by: Jerry Snitselaar 

> ---
> 
> Note that we could not find a reason for connecting
> DMAR_UNITS_SUPPORTED to MAX_IO_APICS as was done previously.  Perhaps
> it seemed like the two would continue to match on earlier processors.
> There doesn't appear to be kernel code that assumes that the value of
> one is related to the other.
> 
> v2: Make this value a config option, rather than a fixed constant.  The 
> default
> values should match previous configuration except in the MAXSMP case.  
> Keeping the
> value at a power of two was requested by Kevin Tian.
> 
> v3: Make the config option dependent upon DMAR_TABLE, as it is not used 
> without this.
> 
>  drivers/iommu/intel/Kconfig | 7 +++
>  include/linux/dmar.h| 6 +-
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
> index 39a06d245f12..07aaebcb581d 100644
> --- a/drivers/iommu/intel/Kconfig
> +++ b/drivers/iommu/intel/Kconfig
> @@ -9,6 +9,13 @@ config DMAR_PERF
>  config DMAR_DEBUG
>   bool
>  
> +config DMAR_UNITS_SUPPORTED
> + int "Number of DMA Remapping Units supported"
> + depends on DMAR_TABLE
> + default 1024 if MAXSMP
> + default 128  if X86_64
> + default 64
> +
>  config INTEL_IOMMU
>   bool "Support for Intel IOMMU using DMA Remapping Devices"
>   depends on PCI_MSI && ACPI && (X86 || IA64)
> diff --git a/include/linux/dmar.h b/include/linux/dmar.h
> index 45e903d84733..0c03c1845c23 100644
> --- a/include/linux/dmar.h
> +++ b/include/linux/dmar.h
> @@ -18,11 +18,7 @@
>  
>  struct acpi_dmar_header;
>  
> -#ifdef   CONFIG_X86
> -# define DMAR_UNITS_SUPPORTEDMAX_IO_APICS
> -#else
> -# define DMAR_UNITS_SUPPORTED64
> -#endif
> +#define  DMAR_UNITS_SUPPORTEDCONFIG_DMAR_UNITS_SUPPORTED
>  
>  /* DMAR Flags */
>  #define DMAR_INTR_REMAP  0x1
> -- 
> 2.26.2
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3] iommu/vt-d: Make DMAR_UNITS_SUPPORTED a config setting

2022-06-15 Thread Steve Wahl

To support up to 64 sockets with 10 DMAR units each (640), make the
value of DMAR_UNITS_SUPPORTED adjustable by a config variable,
CONFIG_DMAR_UNITS_SUPPORTED, and make it's default 1024 when MAXSMP is
set.

If the available hardware exceeds DMAR_UNITS_SUPPORTED (previously set
to MAX_IO_APICS, or 128), it causes these messages: "DMAR: Failed to
allocate seq_id", "DMAR: Parse DMAR table failure.", and "x2apic: IRQ
remapping doesn't support X2APIC mode x2apic disabled"; and the system
fails to boot properly.

Signed-off-by: Steve Wahl 
Reviewed-by: Kevin Tian 
---

Note that we could not find a reason for connecting
DMAR_UNITS_SUPPORTED to MAX_IO_APICS as was done previously.  Perhaps
it seemed like the two would continue to match on earlier processors.
There doesn't appear to be kernel code that assumes that the value of
one is related to the other.

v2: Make this value a config option, rather than a fixed constant.  The default
values should match previous configuration except in the MAXSMP case.  Keeping 
the
value at a power of two was requested by Kevin Tian.

v3: Make the config option dependent upon DMAR_TABLE, as it is not used without 
this.

 drivers/iommu/intel/Kconfig | 7 +++
 include/linux/dmar.h| 6 +-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 39a06d245f12..07aaebcb581d 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -9,6 +9,13 @@ config DMAR_PERF
 config DMAR_DEBUG
bool
 
+config DMAR_UNITS_SUPPORTED
+   int "Number of DMA Remapping Units supported"
+   depends on DMAR_TABLE
+   default 1024 if MAXSMP
+   default 128  if X86_64
+   default 64
+
 config INTEL_IOMMU
bool "Support for Intel IOMMU using DMA Remapping Devices"
depends on PCI_MSI && ACPI && (X86 || IA64)
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 45e903d84733..0c03c1845c23 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -18,11 +18,7 @@
 
 struct acpi_dmar_header;
 
-#ifdef CONFIG_X86
-# define   DMAR_UNITS_SUPPORTEDMAX_IO_APICS
-#else
-# define   DMAR_UNITS_SUPPORTED64
-#endif
+#defineDMAR_UNITS_SUPPORTEDCONFIG_DMAR_UNITS_SUPPORTED
 
 /* DMAR Flags */
 #define DMAR_INTR_REMAP0x1
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 3/3] iommu/mediatek: Allow page table PA up to 35bit

2022-06-15 Thread Robin Murphy


On 2022-06-15 17:12, yf.wang--- via iommu wrote:

From: Yunfei Wang 

Single memory zone feature will remove ZONE_DMA32 and ZONE_DMA. So add
the quirk IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT to let level 1 and level 2
pgtable support at most 35bit PA.


I'm not sure how this works in practice, given that you don't seem to be 
setting the IOMMU's own DMA masks to more than 32 bits, so the DMA 
mapping in io-pgtable is going to fail if you ever do actually allocate 
a pagetable page above 4GB :/



Signed-off-by: Ning Li 
Signed-off-by: Yunfei Wang 
---
  drivers/iommu/mtk_iommu.c | 14 +-
  1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 3d62399e8865..4dbc33758711 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -138,6 +138,7 @@
  /* PM and clock always on. e.g. infra iommu */
  #define PM_CLK_AO BIT(15)
  #define IFA_IOMMU_PCIE_SUPPORTBIT(16)
+#define PGTABLE_PA_35_EN   BIT(17)
  
  #define MTK_IOMMU_HAS_FLAG_MASK(pdata, _x, mask)	\

pdata)->flags) & (mask)) == (_x))
@@ -240,6 +241,7 @@ struct mtk_iommu_data {
  struct mtk_iommu_domain {
struct io_pgtable_cfg   cfg;
struct io_pgtable_ops   *iop;
+   u32 ttbr;
  
  	struct mtk_iommu_bank_data	*bank;

struct iommu_domain domain;
@@ -596,6 +598,9 @@ static int mtk_iommu_domain_finalise(struct 
mtk_iommu_domain *dom,
.iommu_dev = data->dev,
};
  
+	if (MTK_IOMMU_HAS_FLAG(data->plat_data, PGTABLE_PA_35_EN))

+   dom->cfg.quirks |= IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT;
+
if (MTK_IOMMU_HAS_FLAG(data->plat_data, HAS_4GB_MODE))
dom->cfg.oas = data->enable_4GB ? 33 : 32;
else
@@ -684,8 +689,8 @@ static int mtk_iommu_attach_device(struct iommu_domain 
*domain,
goto err_unlock;
}
bank->m4u_dom = dom;
-   writel(dom->cfg.arm_v7s_cfg.ttbr & MMU_PT_ADDR_MASK,
-  bank->base + REG_MMU_PT_BASE_ADDR);
+   bank->m4u_dom->ttbr = MTK_IOMMU_ADDR(dom->cfg.arm_v7s_cfg.ttbr);
+   writel(bank->m4u_dom->ttbr, data->base + REG_MMU_PT_BASE_ADDR);


To add to my comment on patch #1, having to make this change here 
further indicates that you're using it the wrong way.


Thanks,
Robin.

  
  		pm_runtime_put(m4udev);

}
@@ -1366,8 +1371,7 @@ static int __maybe_unused mtk_iommu_runtime_resume(struct 
device *dev)
writel_relaxed(reg->int_control[i], base + 
REG_MMU_INT_CONTROL0);
writel_relaxed(reg->int_main_control[i], base + 
REG_MMU_INT_MAIN_CONTROL);
writel_relaxed(reg->ivrp_paddr[i], base + REG_MMU_IVRP_PADDR);
-   writel(m4u_dom->cfg.arm_v7s_cfg.ttbr & MMU_PT_ADDR_MASK,
-  base + REG_MMU_PT_BASE_ADDR);
+   writel(m4u_dom->ttbr, base + REG_MMU_PT_BASE_ADDR);
} while (++i < data->plat_data->banks_num);
  
  	/*

@@ -1401,7 +1405,7 @@ static const struct mtk_iommu_plat_data mt2712_data = {
  static const struct mtk_iommu_plat_data mt6779_data = {
.m4u_plat  = M4U_MT6779,
.flags = HAS_SUB_COMM_2BITS | OUT_ORDER_WR_EN | WR_THROT_EN |
-MTK_IOMMU_TYPE_MM,
+MTK_IOMMU_TYPE_MM | PGTABLE_PA_35_EN,
.inv_sel_reg   = REG_MMU_INV_SEL_GEN2,
.banks_num= 1,
.banks_enable = {true},

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 2/3] iommu/mediatek: Rename MTK_IOMMU_TLB_ADDR to MTK_IOMMU_ADDR

2022-06-15 Thread Robin Murphy


On 2022-06-15 17:12, yf.wang--- via iommu wrote:

From: Yunfei Wang 

Rename MTK_IOMMU_TLB_ADDR to MTK_IOMMU_ADDR, and update MTK_IOMMU_ADDR
definition for better generality.

Signed-off-by: Ning Li 
Signed-off-by: Yunfei Wang 
---
  drivers/iommu/mtk_iommu.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index bb9dd92c9898..3d62399e8865 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -265,8 +265,8 @@ static const struct iommu_ops mtk_iommu_ops;
  
  static int mtk_iommu_hw_init(const struct mtk_iommu_data *data, unsigned int bankid);
  
-#define MTK_IOMMU_TLB_ADDR(iova) ({	\

-   dma_addr_t _addr = iova;\
+#define MTK_IOMMU_ADDR(addr) ({
\
+   unsigned long long _addr = addr;\


If phys_addr_t is 64-bit, then dma_addr_t is also 64-bit, so there is no 
loss of generality from using an appropriate type - IOVAs have to fit 
into dma_addr_t for iommu-dma, after all. However, since IOVAs also have 
to fit into unsigned long in the general IOMMU API, as "addr" is here, 
then this is still just as broken for 32-bit LPAE as the existing code is.


Thanks,
Robin.


((lower_32_bits(_addr) & GENMASK(31, 12)) | upper_32_bits(_addr));\
  })
  
@@ -381,8 +381,8 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size,

writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0,
   base + data->plat_data->inv_sel_reg);
  
-		writel_relaxed(MTK_IOMMU_TLB_ADDR(iova), base + REG_MMU_INVLD_START_A);

-   writel_relaxed(MTK_IOMMU_TLB_ADDR(iova + size - 1),
+   writel_relaxed(MTK_IOMMU_ADDR(iova), base + 
REG_MMU_INVLD_START_A);
+   writel_relaxed(MTK_IOMMU_ADDR(iova + size - 1),
   base + REG_MMU_INVLD_END_A);
writel_relaxed(F_MMU_INV_RANGE, base + REG_MMU_INVALIDATE);
  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 1/3] iommu/io-pgtable-arm-v7s: Add a quirk to allow pgtable PA up to 35bit

2022-06-15 Thread Robin Murphy


On 2022-06-15 17:12, yf.w...@mediatek.com wrote:

From: Yunfei Wang 

Single memory zone feature will remove ZONE_DMA32 and ZONE_DMA and
cause pgtable PA size larger than 32bit.

Since Mediatek IOMMU hardware support at most 35bit PA in pgtable,
so add a quirk to allow the PA of pgtables support up to bit35.

Signed-off-by: Ning Li 
Signed-off-by: Yunfei Wang 
---
  drivers/iommu/io-pgtable-arm-v7s.c | 58 +++---
  include/linux/io-pgtable.h | 17 +
  2 files changed, 56 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c 
b/drivers/iommu/io-pgtable-arm-v7s.c
index be066c1503d3..39e5503ac75a 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -182,14 +182,8 @@ static bool arm_v7s_is_mtk_enabled(struct io_pgtable_cfg 
*cfg)
(cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_EXT);
  }
  
-static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,

-   struct io_pgtable_cfg *cfg)
+static arm_v7s_iopte to_mtk_iopte(phys_addr_t paddr, arm_v7s_iopte pte)
  {
-   arm_v7s_iopte pte = paddr & ARM_V7S_LVL_MASK(lvl);
-
-   if (!arm_v7s_is_mtk_enabled(cfg))
-   return pte;
-
if (paddr & BIT_ULL(32))
pte |= ARM_V7S_ATTR_MTK_PA_BIT32;
if (paddr & BIT_ULL(33))
@@ -199,6 +193,17 @@ static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int 
lvl,
return pte;
  }
  
+static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,

+   struct io_pgtable_cfg *cfg)
+{
+   arm_v7s_iopte pte = paddr & ARM_V7S_LVL_MASK(lvl);
+
+   if (arm_v7s_is_mtk_enabled(cfg))
+   return to_mtk_iopte(paddr, pte);
+
+   return pte;
+}
+
  static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
  struct io_pgtable_cfg *cfg)
  {
@@ -240,10 +245,17 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
dma_addr_t dma;
size_t size = ARM_V7S_TABLE_SIZE(lvl, cfg);
void *table = NULL;
+   gfp_t gfp_l1;
+
+   /*
+* ARM_MTK_TTBR_EXT extend the translation table base support all
+* memory address.
+*/
+   gfp_l1 = cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT ?
+GFP_KERNEL : ARM_V7S_TABLE_GFP_DMA;
  
  	if (lvl == 1)

-   table = (void *)__get_free_pages(
-   __GFP_ZERO | ARM_V7S_TABLE_GFP_DMA, get_order(size));
+   table = (void *)__get_free_pages(gfp_l1 | __GFP_ZERO, 
get_order(size));
else if (lvl == 2)
table = kmem_cache_zalloc(data->l2_tables, gfp);
  
@@ -251,7 +263,8 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,

return NULL;
  
  	phys = virt_to_phys(table);

-   if (phys != (arm_v7s_iopte)phys) {
+   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT ?
+   phys >= (1ULL << cfg->oas) : phys != (arm_v7s_iopte)phys) {


Given that the comment above says it supports all of memory, how would 
phys >= (1ULL << cfg->oas) ever be true?



/* Doesn't fit in PTE */
dev_err(dev, "Page table does not fit in PTE: %pa", &phys);
goto out_free;
@@ -457,9 +470,14 @@ static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte 
*table,
   arm_v7s_iopte curr,
   struct io_pgtable_cfg *cfg)
  {
+   phys_addr_t phys = virt_to_phys(table);
arm_v7s_iopte old, new;
  
-	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;

+   new = phys | ARM_V7S_PTE_TYPE_TABLE;
+
+   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT)
+   new = to_mtk_iopte(phys, new);
+
if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
new |= ARM_V7S_ATTR_NS_TABLE;
  
@@ -779,6 +797,7 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,

void *cookie)
  {
struct arm_v7s_io_pgtable *data;
+   slab_flags_t slab_flag;
  
  	if (cfg->ias > (arm_v7s_is_mtk_enabled(cfg) ? 34 : ARM_V7S_ADDR_BITS))

return NULL;
@@ -788,7 +807,8 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct 
io_pgtable_cfg *cfg,
  
  	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |

IO_PGTABLE_QUIRK_NO_PERMS |
-   IO_PGTABLE_QUIRK_ARM_MTK_EXT))
+   IO_PGTABLE_QUIRK_ARM_MTK_EXT |
+   IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT))
return NULL;
  
  	/* If ARM_MTK_4GB is enabled, the NO_PERMS is also expected. */

@@ -796,15 +816,27 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct 
io_pgtable_cfg *cfg,
!(cfg->quirks & IO_PGTABLE_QUIRK_NO_PERMS))
return NULL;
  
+	if ((cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT) &&

+

[PATCH v9 1/3] iommu/io-pgtable-arm-v7s: Add a quirk to allow pgtable PA up to 35bit

2022-06-15 Thread yf.wang--- via iommu

From: Yunfei Wang 

Single memory zone feature will remove ZONE_DMA32 and ZONE_DMA and
cause pgtable PA size larger than 32bit.

Since Mediatek IOMMU hardware support at most 35bit PA in pgtable,
so add a quirk to allow the PA of pgtables support up to bit35.

Signed-off-by: Ning Li 
Signed-off-by: Yunfei Wang 
---
 drivers/iommu/io-pgtable-arm-v7s.c | 58 +++---
 include/linux/io-pgtable.h | 17 +
 2 files changed, 56 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c 
b/drivers/iommu/io-pgtable-arm-v7s.c
index be066c1503d3..39e5503ac75a 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -182,14 +182,8 @@ static bool arm_v7s_is_mtk_enabled(struct io_pgtable_cfg 
*cfg)
(cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_EXT);
 }
 
-static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,
-   struct io_pgtable_cfg *cfg)
+static arm_v7s_iopte to_mtk_iopte(phys_addr_t paddr, arm_v7s_iopte pte)
 {
-   arm_v7s_iopte pte = paddr & ARM_V7S_LVL_MASK(lvl);
-
-   if (!arm_v7s_is_mtk_enabled(cfg))
-   return pte;
-
if (paddr & BIT_ULL(32))
pte |= ARM_V7S_ATTR_MTK_PA_BIT32;
if (paddr & BIT_ULL(33))
@@ -199,6 +193,17 @@ static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int 
lvl,
return pte;
 }
 
+static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,
+   struct io_pgtable_cfg *cfg)
+{
+   arm_v7s_iopte pte = paddr & ARM_V7S_LVL_MASK(lvl);
+
+   if (arm_v7s_is_mtk_enabled(cfg))
+   return to_mtk_iopte(paddr, pte);
+
+   return pte;
+}
+
 static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
  struct io_pgtable_cfg *cfg)
 {
@@ -240,10 +245,17 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
dma_addr_t dma;
size_t size = ARM_V7S_TABLE_SIZE(lvl, cfg);
void *table = NULL;
+   gfp_t gfp_l1;
+
+   /*
+* ARM_MTK_TTBR_EXT extend the translation table base support all
+* memory address.
+*/
+   gfp_l1 = cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT ?
+GFP_KERNEL : ARM_V7S_TABLE_GFP_DMA;
 
if (lvl == 1)
-   table = (void *)__get_free_pages(
-   __GFP_ZERO | ARM_V7S_TABLE_GFP_DMA, get_order(size));
+   table = (void *)__get_free_pages(gfp_l1 | __GFP_ZERO, 
get_order(size));
else if (lvl == 2)
table = kmem_cache_zalloc(data->l2_tables, gfp);
 
@@ -251,7 +263,8 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
return NULL;
 
phys = virt_to_phys(table);
-   if (phys != (arm_v7s_iopte)phys) {
+   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT ?
+   phys >= (1ULL << cfg->oas) : phys != (arm_v7s_iopte)phys) {
/* Doesn't fit in PTE */
dev_err(dev, "Page table does not fit in PTE: %pa", &phys);
goto out_free;
@@ -457,9 +470,14 @@ static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte 
*table,
   arm_v7s_iopte curr,
   struct io_pgtable_cfg *cfg)
 {
+   phys_addr_t phys = virt_to_phys(table);
arm_v7s_iopte old, new;
 
-   new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
+   new = phys | ARM_V7S_PTE_TYPE_TABLE;
+
+   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT)
+   new = to_mtk_iopte(phys, new);
+
if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
new |= ARM_V7S_ATTR_NS_TABLE;
 
@@ -779,6 +797,7 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct 
io_pgtable_cfg *cfg,
void *cookie)
 {
struct arm_v7s_io_pgtable *data;
+   slab_flags_t slab_flag;
 
if (cfg->ias > (arm_v7s_is_mtk_enabled(cfg) ? 34 : ARM_V7S_ADDR_BITS))
return NULL;
@@ -788,7 +807,8 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct 
io_pgtable_cfg *cfg,
 
if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
IO_PGTABLE_QUIRK_NO_PERMS |
-   IO_PGTABLE_QUIRK_ARM_MTK_EXT))
+   IO_PGTABLE_QUIRK_ARM_MTK_EXT |
+   IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT))
return NULL;
 
/* If ARM_MTK_4GB is enabled, the NO_PERMS is also expected. */
@@ -796,15 +816,27 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct 
io_pgtable_cfg *cfg,
!(cfg->quirks & IO_PGTABLE_QUIRK_NO_PERMS))
return NULL;
 
+   if ((cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT) &&
+   !arm_v7s_is_mtk_enabled(cfg))
+   return NULL;
+
data = kmalloc(sizeof(*data), GFP_KERNEL);
if (!data)
return NULL;

[PATCH v9 3/3] iommu/mediatek: Allow page table PA up to 35bit

2022-06-15 Thread yf.wang--- via iommu

From: Yunfei Wang 

Single memory zone feature will remove ZONE_DMA32 and ZONE_DMA. So add
the quirk IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT to let level 1 and level 2
pgtable support at most 35bit PA.

Signed-off-by: Ning Li 
Signed-off-by: Yunfei Wang 
---
 drivers/iommu/mtk_iommu.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 3d62399e8865..4dbc33758711 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -138,6 +138,7 @@
 /* PM and clock always on. e.g. infra iommu */
 #define PM_CLK_AO  BIT(15)
 #define IFA_IOMMU_PCIE_SUPPORT BIT(16)
+#define PGTABLE_PA_35_EN   BIT(17)
 
 #define MTK_IOMMU_HAS_FLAG_MASK(pdata, _x, mask)   \
pdata)->flags) & (mask)) == (_x))
@@ -240,6 +241,7 @@ struct mtk_iommu_data {
 struct mtk_iommu_domain {
struct io_pgtable_cfg   cfg;
struct io_pgtable_ops   *iop;
+   u32 ttbr;
 
struct mtk_iommu_bank_data  *bank;
struct iommu_domain domain;
@@ -596,6 +598,9 @@ static int mtk_iommu_domain_finalise(struct 
mtk_iommu_domain *dom,
.iommu_dev = data->dev,
};
 
+   if (MTK_IOMMU_HAS_FLAG(data->plat_data, PGTABLE_PA_35_EN))
+   dom->cfg.quirks |= IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT;
+
if (MTK_IOMMU_HAS_FLAG(data->plat_data, HAS_4GB_MODE))
dom->cfg.oas = data->enable_4GB ? 33 : 32;
else
@@ -684,8 +689,8 @@ static int mtk_iommu_attach_device(struct iommu_domain 
*domain,
goto err_unlock;
}
bank->m4u_dom = dom;
-   writel(dom->cfg.arm_v7s_cfg.ttbr & MMU_PT_ADDR_MASK,
-  bank->base + REG_MMU_PT_BASE_ADDR);
+   bank->m4u_dom->ttbr = MTK_IOMMU_ADDR(dom->cfg.arm_v7s_cfg.ttbr);
+   writel(bank->m4u_dom->ttbr, data->base + REG_MMU_PT_BASE_ADDR);
 
pm_runtime_put(m4udev);
}
@@ -1366,8 +1371,7 @@ static int __maybe_unused mtk_iommu_runtime_resume(struct 
device *dev)
writel_relaxed(reg->int_control[i], base + 
REG_MMU_INT_CONTROL0);
writel_relaxed(reg->int_main_control[i], base + 
REG_MMU_INT_MAIN_CONTROL);
writel_relaxed(reg->ivrp_paddr[i], base + REG_MMU_IVRP_PADDR);
-   writel(m4u_dom->cfg.arm_v7s_cfg.ttbr & MMU_PT_ADDR_MASK,
-  base + REG_MMU_PT_BASE_ADDR);
+   writel(m4u_dom->ttbr, base + REG_MMU_PT_BASE_ADDR);
} while (++i < data->plat_data->banks_num);
 
/*
@@ -1401,7 +1405,7 @@ static const struct mtk_iommu_plat_data mt2712_data = {
 static const struct mtk_iommu_plat_data mt6779_data = {
.m4u_plat  = M4U_MT6779,
.flags = HAS_SUB_COMM_2BITS | OUT_ORDER_WR_EN | WR_THROT_EN |
-MTK_IOMMU_TYPE_MM,
+MTK_IOMMU_TYPE_MM | PGTABLE_PA_35_EN,
.inv_sel_reg   = REG_MMU_INV_SEL_GEN2,
.banks_num= 1,
.banks_enable = {true},
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v9 2/3] iommu/mediatek: Rename MTK_IOMMU_TLB_ADDR to MTK_IOMMU_ADDR

2022-06-15 Thread yf.wang--- via iommu

From: Yunfei Wang 

Rename MTK_IOMMU_TLB_ADDR to MTK_IOMMU_ADDR, and update MTK_IOMMU_ADDR
definition for better generality.

Signed-off-by: Ning Li 
Signed-off-by: Yunfei Wang 
---
 drivers/iommu/mtk_iommu.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index bb9dd92c9898..3d62399e8865 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -265,8 +265,8 @@ static const struct iommu_ops mtk_iommu_ops;
 
 static int mtk_iommu_hw_init(const struct mtk_iommu_data *data, unsigned int 
bankid);
 
-#define MTK_IOMMU_TLB_ADDR(iova) ({\
-   dma_addr_t _addr = iova;\
+#define MTK_IOMMU_ADDR(addr) ({
\
+   unsigned long long _addr = addr;\
((lower_32_bits(_addr) & GENMASK(31, 12)) | upper_32_bits(_addr));\
 })
 
@@ -381,8 +381,8 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long 
iova, size_t size,
writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0,
   base + data->plat_data->inv_sel_reg);
 
-   writel_relaxed(MTK_IOMMU_TLB_ADDR(iova), base + 
REG_MMU_INVLD_START_A);
-   writel_relaxed(MTK_IOMMU_TLB_ADDR(iova + size - 1),
+   writel_relaxed(MTK_IOMMU_ADDR(iova), base + 
REG_MMU_INVLD_START_A);
+   writel_relaxed(MTK_IOMMU_ADDR(iova + size - 1),
   base + REG_MMU_INVLD_END_A);
writel_relaxed(F_MMU_INV_RANGE, base + REG_MMU_INVALIDATE);
 
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 05/21] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers

2022-06-15 Thread Logan Gunthorpe

Add EREMOTEIO error return to dma_map_sgtable() which will be used
by .map_sg() implementations that detect P2PDMA pages that the
underlying DMA device cannot access.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 kernel/dma/mapping.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..9f65d1041638 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -197,7 +197,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
if (ents > 0)
debug_dma_map_sg(dev, sg, nents, ents, dir, attrs);
else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM &&
- ents != -EIO))
+ ents != -EIO && ents != -EREMOTEIO))
return -EIO;
 
return ents;
@@ -255,6 +255,8 @@ EXPORT_SYMBOL(dma_map_sg_attrs);
  * complete the mapping. Should succeed if retried later.
  *   -EIO  Legacy error code with an unknown meaning. eg. this is
  * returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EREMOTEIOThe DMA device cannot access P2PDMA memory specified in
+ * the sg_table. This will not succeed if retried.
  */
 int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
enum dma_data_direction dir, unsigned long attrs)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 15/21] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

2022-06-15 Thread Logan Gunthorpe

Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/uio.h |  6 ++
 lib/iov_iter.c  | 25 +++--
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..ddf9e4cf4a59 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -232,8 +232,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int 
direction, struct pipe_inode
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t 
count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray 
*xarray,
 loff_t start, size_t count);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+   size_t maxsize, unsigned maxpages, size_t *start,
+   unsigned int gup_flags);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+   struct page ***pages, size_t maxsize, size_t *start,
+   unsigned int gup_flags);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..9bf6e3af5120 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1515,9 +1515,9 @@ static struct page *first_bvec_segment(const struct 
iov_iter *i,
return page;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
   struct page **pages, size_t maxsize, unsigned maxpages,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
size_t len;
int n, res;
@@ -1528,7 +1528,6 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return 0;
 
if (likely(iter_is_iovec(i))) {
-   unsigned int gup_flags = 0;
unsigned long addr;
 
if (iov_iter_rw(i) != WRITE)
@@ -1558,6 +1557,13 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return iter_xarray_get_pages(i, pages, maxsize, maxpages, 
start);
return -EFAULT;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_flags);
+
+ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
+  size_t maxsize, unsigned maxpages, size_t *start)
+{
+   return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages);
 
 static struct page **get_pages_array(size_t n)
@@ -1640,9 +1646,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct 
iov_iter *i,
return actual;
 }
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
   struct page ***pages, size_t maxsize,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
struct page **p;
size_t len;
@@ -1654,7 +1660,6 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return 0;
 
if (likely(iter_is_iovec(i))) {
-   unsigned int gup_flags = 0;
unsigned long addr;
 
if (iov_iter_rw(i) != WRITE)
@@ -1667,6 +1672,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
p = get_pages_array(n);
if (!p)
return -ENOMEM;
+
res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res <= 0)) {
kvfree(p);
@@ -1694,6 +1700,13 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
return -EFAULT;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc_flags);
+
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+size_t maxsize, size_t *start)
+{
+   return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 03/21] PCI/P2PDMA: Expose pci_p2pdma_map_type()

2022-06-15 Thread Logan Gunthorpe

pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual IOMMU mapping.

Prototypes for this helper are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/p2pdma.c| 25 +
 include/linux/dma-map-ops.h | 45 +
 2 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4e8bc457e29a..10b1d5c2b5de 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -10,6 +10,7 @@
 
 #define pr_fmt(fmt) "pci-p2pdma: " fmt
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -20,13 +21,6 @@
 #include 
 #include 
 
-enum pci_p2pdma_map_type {
-   PCI_P2PDMA_MAP_UNKNOWN = 0,
-   PCI_P2PDMA_MAP_NOT_SUPPORTED,
-   PCI_P2PDMA_MAP_BUS_ADDR,
-   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -847,8 +841,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-   struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ * a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ *
+ * Returns one of:
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally
+ * using the CPU physical address (in dma-direct) or an IOVA
+ * mapping for the IOMMU.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev)
 {
enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..d693a0e33bac 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -379,4 +379,49 @@ static inline void debug_dma_dump_mappings(struct device 
*dev)
 
 extern const struct dma_map_ops dma_dummy_ops;
 
+enum pci_p2pdma_map_type {
+   /*
+* PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+* type hasn't been calculated yet. Functions that return this enum
+* never return this value.
+*/
+   PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+   /*
+* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+* traverse the host bridge and the host bridge is not in the
+* allowlist. DMA Mapping routines should return an error when
+* this is returned.
+*/
+   PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+   /*
+* PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+* each other directly through a PCI switch and the transaction will
+* not traverse the host bridge. Such a mapping should program
+* the DMA engine with PCI bus addresses.
+*/
+   PCI_P2PDMA_MAP_BUS_ADDR,
+
+   /*
+* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+* to each other, but the transaction traverses a host bridge on the
+* allowlist. In this case, a normal mapping either with CPU physical
+* addresses (in the case of dma-direct) or IOVA addresses (in the
+* case of IOMMUs) should be used to program the DMA engine.
+*/
+   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
+#ifdef CONFIG_PCI_P2PDMA
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev);
+#else /* CONFIG_PCI_P2PDMA */
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
+{
+   return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #endif /* _LINUX_DMA_MAP_OPS_H */
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 09/21] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2022-06-15 Thread Logan Gunthorpe

Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 24165daee3c8..d6e76f2dc293 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3981,7 +3981,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   if (ctrl->ops->supports_pci_p2pdma &&
+   ctrl->ops->supports_pci_p2pdma(ctrl))
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9b72b6ecf33c..957f79420cf3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -495,7 +495,6 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
-#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -503,6 +502,7 @@ struct nvme_ctrl_ops {
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 48f4f6eb877b..e5e032ab1c71 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2976,17 +2976,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, 
char *buf, int size)
return snprintf(buf, size, "%s\n", dev_name(&pdev->dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+   return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+   .flags  = NVME_F_METADATA_SUPPORTED,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.free_ctrl  = nvme_pci_free_ctrl,
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
+   .supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 10/21] nvme-pci: convert to using dma_map_sgtable()

2022-06-15 Thread Logan Gunthorpe

The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/nvme/host/pci.c | 69 +
 1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e5e032ab1c71..52b52a7efa9a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -230,11 +230,10 @@ struct nvme_iod {
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
-   int nents;  /* Used in scatterlist */
dma_addr_t first_dma;
unsigned int dma_len;   /* length of single DMA segment mapping */
dma_addr_t meta_dma;
-   struct scatterlist *sg;
+   struct sg_table sgt;
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -524,7 +523,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 static void **nvme_pci_iod_list(struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+   return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -576,17 +575,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct 
request *req)
}
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-   if (is_pci_p2pdma_page(sg_page(iod->sg)))
-   pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-   rq_dma_dir(req));
-   else
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -597,9 +585,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
return;
}
 
-   WARN_ON_ONCE(!iod->nents);
+   WARN_ON_ONCE(!iod->sgt.nents);
+
+   dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
 
-   nvme_unmap_sg(dev, req);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
  iod->first_dma);
@@ -607,7 +596,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
nvme_free_sgls(dev, req);
else
nvme_free_prps(dev, req);
-   mempool_free(iod->sg, dev->iod_mempool);
+   mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -630,7 +619,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -703,16 +692,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
dma_len = sg_dma_len(sg);
}
 done:
-   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
 free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
 bad_sgl:
-   WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+   WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
"Invalid SGL for payload:%d nents:%d\n",
-   blk_rq_payload_bytes(req), iod->nents);
+   blk_rq_payload_bytes(req), iod->sgt.nents);
return BLK_STS_IOERR;
 }
 
@@ -738,12 +727,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc 
*sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-   struct request *req, struct nvme_rw_command *cmd, int entries)
+   struct request *req, struct nvme_rw_command *cmd)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
+   unsigned int entries = iod->sgt.nents;
dma_addr_t sgl_dma;
int i = 0;

[PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2022-06-15 Thread Logan Gunthorpe

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through
 the root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally with an IOMMU IOVA.
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 68 +++
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..b01ca0c6a7ab 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
 
+   if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+   if (i > 0)
+   cur = sg_next(cur);
+
+   pci_p2pdma_map_bus_segment(s, cur);
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -1158,6 +1169,8 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iova_domain *iovad = &cookie->iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+   struct dev_pagemap *pgmap = NULL;
+   enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1193,6 +1206,35 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;
 
+   if (is_pci_p2pdma_page(sg_page(s))) {
+   if (sg_page(s)->pgmap != pgmap) {
+   pgmap = sg_page(s)->pgmap;
+   map_type = pci_p2pdma_map_type(pgmap, dev);
+   }
+
+   switch (map_type) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   /*
+* A zero length will be ignored by
+* iommu_map_sg() and then can be detected
+* in __finalise_sg() to actually map the
+* bus address.
+*/
+   s->length = 0;
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Mapping through host bridge should be
+* mapped with regular IOVAs, thus we
+* do nothing here and continue below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_restore_sg;
+   }
+   }
+
/*
 * Due to the alignment of our single IOVA allocation, we can
 * depend on these assumptions about the segment boundary mask:
@@ -1215,6 +1257,9 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = s;
}
 
+   if (!iova_len)
+   return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova) {
ret = -ENOMEM;
@@ -1236,7 +1281,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 out_restore_sg:
__invalidate_sg(sg, nents);
 out:
-   if (ret != -ENOMEM)
+   if (ret != -ENOMEM && ret != -EREMOTEIO)
return -EINVAL;
re

[PATCH v7 17/21] lib/scatterlist: add check when merging zone device pages

2022-06-15 Thread Logan Gunthorpe

Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Factor out the check for page mergability into a pages_are_mergable()
helper and add a check with zone_device_pages_are_mergeable().

Signed-off-by: Logan Gunthorpe 
---
 lib/scatterlist.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d5e82e4a57ad..af53a0984f76 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -410,6 +410,15 @@ static struct scatterlist *get_next_sg(struct 
sg_append_table *table,
return new_sg;
 }
 
+static bool pages_are_mergeable(struct page *a, struct page *b)
+{
+   if (page_to_pfn(a) != page_to_pfn(b) + 1)
+   return false;
+   if (!zone_device_pages_have_same_pgmap(a, b))
+   return false;
+   return true;
+}
+
 /**
  * sg_alloc_append_table_from_pages - Allocate and initialize an append sg
  *table from an array of pages
@@ -447,6 +456,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
unsigned int chunks, cur_page, seg_len, i, prv_len = 0;
unsigned int added_nents = 0;
struct scatterlist *s = sgt_append->prv;
+   struct page *last_pg;
 
/*
 * The algorithm below requires max_segment to be aligned to PAGE_SIZE
@@ -460,21 +470,17 @@ int sg_alloc_append_table_from_pages(struct 
sg_append_table *sgt_append,
return -EOPNOTSUPP;
 
if (sgt_append->prv) {
-   unsigned long paddr =
-   (page_to_pfn(sg_page(sgt_append->prv)) * PAGE_SIZE +
-sgt_append->prv->offset + sgt_append->prv->length) /
-   PAGE_SIZE;
-
if (WARN_ON(offset))
return -EINVAL;
 
/* Merge contiguous pages into the last SG */
prv_len = sgt_append->prv->length;
-   while (n_pages && page_to_pfn(pages[0]) == paddr) {
+   last_pg = sg_page(sgt_append->prv);
+   while (n_pages && pages_are_mergeable(last_pg, pages[0])) {
if (sgt_append->prv->length + PAGE_SIZE > max_segment)
break;
sgt_append->prv->length += PAGE_SIZE;
-   paddr++;
+   last_pg = pages[0];
pages++;
n_pages--;
}
@@ -488,7 +494,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
for (i = 1; i < n_pages; i++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
-   page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) {
+   !pages_are_mergeable(pages[i], pages[i - 1])) {
chunks++;
seg_len = 0;
}
@@ -504,8 +510,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
for (j = cur_page + 1; j < n_pages; j++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
-   page_to_pfn(pages[j]) !=
-   page_to_pfn(pages[j - 1]) + 1)
+   !pages_are_mergeable(pages[j], pages[j - 1]))
break;
}
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 06/21] dma-direct: support PCI P2PDMA pages in dma-direct map_sg

2022-06-15 Thread Logan Gunthorpe

Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through the
 root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally, as though it were a CPU physical address
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

P2PDMA mappings are also failed if swiotlb needs to be used on the
mapping.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/direct.c | 43 +--
 kernel/dma/direct.h |  8 +++-
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index e978f36e6be8..133a4be2d3e5 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -454,29 +454,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_sync_dma_for_cpu_all();
 }
 
+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
struct scatterlist *sg;
int i;
 
-   for_each_sg(sgl, sg, nents, i)
-   dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
-attrs);
+   for_each_sg(sgl,  sg, nents, i) {
+   if (sg_is_dma_bus_address(sg))
+   sg_dma_unmark_bus_address(sg);
+   else
+   dma_direct_unmap_page(dev, sg->dma_address,
+ sg_dma_len(sg), dir, attrs);
+   }
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
+   struct pci_p2pdma_map_state p2pdma_state = {};
+   enum pci_p2pdma_map_type map;
struct scatterlist *sg;
+   int i, ret;
 
for_each_sg(sgl, sg, nents, i) {
+   if (is_pci_p2pdma_page(sg_page(sg))) {
+   map = pci_p2pdma_map_segment(&p2pdma_state, dev, sg);
+   switch (map) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Any P2P mapping that traverses the PCI
+* host bridge must be mapped with CPU physical
+* address and not PCI bus addresses. This is
+* done with dma_direct_map_page() below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_unmap;
+   }
+   }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
-   if (sg->dma_address == DMA_MAPPING_ERROR)
+   if (sg->dma_address == DMA_MAPPING_ERROR) {
+   ret = -EIO;
goto out_unmap;
+   }
sg_dma_len(sg) = sg->length;
}
 
@@ -484,7 +515,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return -EIO;
+   return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index a78c0ba70645..e38ffc5e6bdd 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -8,6 +8,7 @@
 #define _KERNEL_DMA_DIRECT_H
 
 #include 
+#include 
 
 int dma_direct_get_sgtable(struct device *dev, struct sg_table *sgt,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
@@ -87,10 +88,15 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (is_swiotlb_force_bounce(dev))
+   if (is_swiotlb_force_bounce(dev)) {
+   if (is_pci_p2pdma_page(page))
+   return DMA_MAPPING_ERROR;
return swiotlb_map(dev, phys, size, dir, attrs);
+   }
 
if (unlikely(!dma_capab

[PATCH v7 16/21] block: add check when merging zone device pages

2022-06-15 Thread Logan Gunthorpe

Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().

Signed-off-by: Logan Gunthorpe 
---
 block/bio.c|  2 ++
 include/linux/mm.h | 23 +++
 2 files changed, 25 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index f92d0223247b..a402a4760457 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -865,6 +865,8 @@ static inline bool page_is_mergeable(const struct bio_vec 
*bv,
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
return false;
+   if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+   return false;
 
*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
if (*same_page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0bcb54ea503c..33b2f4d9fd0a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1108,6 +1108,24 @@ static inline bool is_zone_device_page(const struct page 
*page)
 {
return page_zonenum(page) == ZONE_DEVICE;
 }
+
+/*
+ * Consecutive zone device pages should not be merged into the same sgl
+ * or bvec segment with other types of pages or if they belong to different
+ * pgmaps. Otherwise getting the pgmap of a given segment is not possible
+ * without scanning the entire segment. This helper returns true either if
+ * both pages are not zone device pages or both pages are zone device pages
+ * with the same pgmap.
+ */
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+const struct page *b)
+{
+   if (is_zone_device_page(a) != is_zone_device_page(b))
+   return false;
+   if (!is_zone_device_page(a))
+   return true;
+   return a->pgmap == b->pgmap;
+}
 extern void memmap_init_zone_device(struct zone *, unsigned long,
unsigned long, struct dev_pagemap *);
 #else
@@ -1115,6 +1133,11 @@ static inline bool is_zone_device_page(const struct page 
*page)
 {
return false;
 }
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+const struct page *b)
+{
+   return true;
+}
 #endif
 
 static inline bool folio_is_zone_device(const struct folio *folio)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2022-06-15 Thread Logan Gunthorpe

Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as a bus address.

The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
majority of P2PDMA use cases are restricted to newer root complexes and
roughly require the extra address space for memory BARs used in the
transactions.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/Kconfig |  5 +
 include/linux/scatterlist.h | 44 -
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 133c73207782..5cc7cba1941f 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -164,6 +164,11 @@ config PCI_PASID
 config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE
+   #
+   # The need for the scatterlist DMA bus address flag means PCI P2PDMA
+   # requires 64bit
+   #
+   depends on 64BIT
select GENERIC_ALLOCATOR
help
  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 7ff9d6386c12..6561ca8aead8 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -64,12 +64,24 @@ struct sg_append_table {
 #define SG_CHAIN   0x01UL
 #define SG_END 0x02UL
 
+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a
+ * bus address when doing P2PDMA.
+ */
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_DMA_BUS_ADDRESS 0x04UL
+static_assert(__alignof__(struct page) >= 8);
+#else
+#define SG_DMA_BUS_ADDRESS 0x00UL
+#endif
+
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
 
 static inline unsigned int __sg_flags(struct scatterlist *sg)
 {
@@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
return __sg_flags(sg) & SG_END;
 }
 
+static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+{
+   return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
+}
+
 /**
  * sg_assign_page - Assign a given page to an SG entry
  * @sg:SG entry
@@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
 }
 
+/**
+ * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a bus address and doesn't need to be unmapped.
+ **/
+static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
+{
+   sg->page_link |= SG_DMA_BUS_ADDRESS;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Clears the bus address mark.
+ **/
+static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
+{
+   sg->page_link &= ~SG_DMA_BUS_ADDRESS;
+}
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg: SG entry
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 04/21] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

2022-06-15 Thread Logan Gunthorpe

Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Prototypes for these helpers are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c| 59 +
 include/linux/dma-map-ops.h | 21 +
 2 files changed, 80 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 10b1d5c2b5de..2fc0f4750a2e 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -951,6 +951,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ * loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg)
+{
+   if (state->pgmap != sg_page(sg)->pgmap) {
+   state->pgmap = sg_page(sg)->pgmap;
+   state->map = pci_p2pdma_map_type(state->pgmap, dev);
+   state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+   }
+
+   if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+   sg->dma_address = sg_phys(sg) + state->bus_off;
+   sg_dma_len(sg) = sg->length;
+   sg_dma_mark_bus_address(sg);
+   }
+
+   return state->map;
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a DMA address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the DMA address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+   dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+   sg_dma_len(dma_sg) = pg_sg->length;
+   sg_dma_mark_bus_address(dma_sg);
+}
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  * to enable p2pdma
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index d693a0e33bac..752f91e5eb5d 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -413,15 +413,36 @@ enum pci_p2pdma_map_type {
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
 };
 
+struct pci_p2pdma_map_state {
+   struct dev_pagemap *pgmap;
+   int map;
+   u64 bus_off;
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 struct device *dev);
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg);
 #else /* CONFIG_PCI_P2PDMA */
 static inline enum pci_p2pdma_map_type
 pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
 {
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *

[PATCH v7 19/21] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

2022-06-15 Thread Logan Gunthorpe

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
---
 block/blk-map.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index df8b066cd548..1d6bcf193a42 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -236,6 +236,7 @@ static int bio_map_user_iov(struct request *rq, struct 
iov_iter *iter,
 {
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
+   unsigned int flags = 0;
struct bio *bio;
int ret;
int j;
@@ -248,13 +249,17 @@ static int bio_map_user_iov(struct request *rq, struct 
iov_iter *iter,
return -ENOMEM;
bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq));
 
+   if (blk_queue_pci_p2pdma(rq->q))
+   flags |= FOLL_PCI_P2PDMA;
+
while (iov_iter_count(iter)) {
struct page **pages;
ssize_t bytes;
size_t offs, added = 0;
int npages;
 
-   bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+   bytes = iov_iter_get_pages_alloc_flags(iter, &pages, LONG_MAX,
+  &offs, flags);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 14/21] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

2022-06-15 Thread Logan Gunthorpe

GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/mm.h |  1 +
 mm/gup.c   | 22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..0bcb54ea503c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2941,6 +2941,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
 #define FOLL_PIN   0x4 /* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA0x10 /* allow returning PCI P2PDMA pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..f15f01d06a09 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -564,6 +564,12 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
goto out;
}
 
+   if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+is_pci_p2pdma_page(page))) {
+   page = ERR_PTR(-EREMOTEIO);
+   goto out;
+   }
+
VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
   !PageAnonExclusive(page), page);
 
@@ -994,6 +1000,9 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;
 
+   if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
+   return -EOPNOTSUPP;
+
if (vma_is_secretmem(vma))
return -EFAULT;
 
@@ -2289,6 +2298,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
 
+   if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+is_pci_p2pdma_page(page)))
+   goto pte_unmap;
+
folio = try_grab_folio(page, 1, flags);
if (!folio)
goto pte_unmap;
@@ -2368,6 +2381,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned 
long addr,
undo_dev_pagemap(nr, nr_start, flags, pages);
break;
}
+
+   if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+   undo_dev_pagemap(nr, nr_start, flags, pages);
+   break;
+   }
+
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(!try_grab_page(page, flags))) {
@@ -2856,7 +2875,8 @@ static int internal_get_user_pages_fast(unsigned long 
start,
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
   FOLL_FORCE | FOLL_PIN | FOLL_GET |
-  FOLL_FAST_ONLY | FOLL_NOFAULT)))
+  FOLL_FAST_ONLY | FOLL_NOFAULT |
+  FOLL_PCI_P2PDMA)))
return -EINVAL;
 
if (gup_flags & FOLL_PIN)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 13/21] PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

2022-06-15 Thread Logan Gunthorpe

This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/pci/p2pdma.c   | 66 --
 include/linux/pci-p2pdma.h | 27 
 2 files changed, 93 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 2fc0f4750a2e..d4e635012ffe 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -885,72 +885,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
return type;
 }
 
-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
-   struct device *dev, struct scatterlist *sg, int nents)
-{
-   struct scatterlist *s;
-   int i;
-
-   for_each_sg(sg, s, nents, i) {
-   s->dma_address = sg_phys(s) + p2p_pgmap->bus_offset;
-   sg_dma_len(s) = s->length;
-   }
-
-   return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   struct pci_p2pdma_pagemap *p2p_pgmap =
-   to_p2p_pgmap(sg_page(sg)->pgmap);
-
-   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
-   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-   return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
-   case PCI_P2PDMA_MAP_BUS_ADDR:
-   return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
-   default:
-   /* Mapping is not Supported */
-   return 0;
-   }
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- * mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   enum pci_p2pdma_map_type map_type;
-
-   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
-   if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-   dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
 /**
  * pci_p2pdma_map_segment - map an sg segment determining the mapping type
  * @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..2c07aa6b7665 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -30,10 +30,6 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev 
*pdev,
 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,17 +79,6 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-   return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-}
 static inline int pci_p2pdma_enable_store(const char *page,
struct pci_dev **p2p_dev, bool *use_p2pdma)
 {
@@ -119,16 +104,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct 
device *client)
return pci_p2pmem_find_many(&client, 1);
 }
 
-static inline int pci_p2pdma_map_sg(struct device

[PATCH v7 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-06-15 Thread Logan Gunthorpe

Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.

Pages are allocated from the genalloc in bulk with their reference
count set to one. They are returned to the genalloc when the page is put
through p2pdma_page_free() (the reference count is once again set to 1
in free_zone_device_page()).

The VMA does not take a reference to the pages when they are inserted
with vmf_insert_mixed() (which is necessary for zone device pages) so
the backing P2P memory is stored in a structures in vm_private_data.

A pseudo mount is used to allocate an inode for each PCI device. The
inode's address_space is used in the file doing the mmap so that all
VMAs are collected and can be unmapped if the PCI device is unbound.
After unmapping, the VMAs are iterated through and their pages are
put so the device can continue to be unbound. An active flag is used
to signal to VMAs not to allocate any further P2P memory once the
removal process starts. The flag is synchronized with concurrent
access with an RCU lock.

The VMAs and inode will survive after the unbind of the device, but no
pages will be present in the VMA and a subsequent access will result
in a SIGBUS error.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c   | 210 -
 include/linux/pci-p2pdma.h |  16 +++
 include/uapi/linux/magic.h |   1 +
 3 files changed, 225 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index d4e635012ffe..a6572069008b 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -17,14 +17,19 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
struct xarray map_types;
+   struct inode *inode;
+   bool active;
 };
 
 struct pci_p2pdma_pagemap {
@@ -101,6 +106,41 @@ static const struct attribute_group p2pmem_group = {
.name = "p2pmem",
 };
 
+/*
+ * P2PDMA internal mount
+ * Fake an internal VFS mount-point in order to allocate struct address_space
+ * mappings to remove VMAs on unbind events.
+ */
+static int pci_p2pdma_fs_cnt;
+static struct vfsmount *pci_p2pdma_fs_mnt;
+
+static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
+{
+   return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type pci_p2pdma_fs_type = {
+   .name = "p2dma",
+   .owner = THIS_MODULE,
+   .init_fs_context = pci_p2pdma_fs_init_fs_context,
+   .kill_sb = kill_anon_super,
+};
+
+static void p2pdma_page_free(struct page *page)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+   struct percpu_ref *ref;
+
+   gen_pool_free_owner(pgmap->provider->p2pdma->pool,
+   (uintptr_t)page_to_virt(page), PAGE_SIZE,
+   (void **)&ref);
+   percpu_ref_put(ref);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+   .page_free = p2pdma_page_free,
+};
+
 static void pci_p2pdma_release(void *data)
 {
struct pci_dev *pdev = data;
@@ -117,6 +157,9 @@ static void pci_p2pdma_release(void *data)
gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
xa_destroy(&p2pdma->map_types);
+
+   iput(p2pdma->inode);
+   simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -134,17 +177,32 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
if (!p2p->pool)
goto out;
 
-   error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+   error = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
+ &pci_p2pdma_fs_cnt);
if (error)
goto out_pool_destroy;
 
+   p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
+   if (IS_ERR(p2p->inode)) {
+   error = -ENOMEM;
+   goto out_unpin_fs;
+   }
+
+   error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+   if (error)
+   goto out_put_inode;
+
error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
if (error)
-   goto out_pool_destroy;
+   goto out_put_inode;
 
rcu_assign_pointer(pdev->p2pdma, p2p);
return 0;
 
+out_put_inode:
+   iput(p2p->inode);
+out_unpin_fs:
+   simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
 out_pool_destroy:
gen_pool_destroy(p2p->pool);
 out:
@@ -152,6 +210,18 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
return error;
 }
 
+static void pci_p2pdma_unmap_mappings(void *data)
+{
+   struct pci_dev *pdev = data;
+   struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+
+   /* Ensure no new pages can be allocated in mappings */
+

[PATCH v7 02/21] PCI/P2PDMA: Attempt to set map_type if it has not been set

2022-06-15 Thread Logan Gunthorpe

Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

This change will calculate the mapping type if it hasn't pre-calculated
so it is no longer invalid to call pci_p2pdma_map_sg() before the mapping
type is calculated, so drop the WARN_ON when that is the case.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/p2pdma.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 462b429ad243..4e8bc457e29a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -854,6 +854,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
struct pci_dev *client;
struct pci_p2pdma *p2pdma;
+   int dist;
 
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
@@ -870,6 +871,10 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
type = xa_to_value(xa_load(&p2pdma->map_types,
   map_types_idx(client)));
rcu_read_unlock();
+
+   if (type == PCI_P2PDMA_MAP_UNKNOWN)
+   return calc_map_type_and_dist(provider, client, &dist, true);
+
return type;
 }
 
@@ -912,7 +917,7 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
case PCI_P2PDMA_MAP_BUS_ADDR:
return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
default:
-   WARN_ON_ONCE(1);
+   /* Mapping is not Supported */
return 0;
}
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 00/21] Userspace P2PDMA with O_DIRECT NVMe devices

2022-06-15 Thread Logan Gunthorpe

Hi,

This patchset continues my work to add userspace P2PDMA access using
O_DIRECT NVMe devices. This posting cleans up the way the pages are
stored in the VMA and relies on proper reference counting that was
fixed up recently in the kernel. The new method uses vm_insert_page()
in pci_mmap_p2pmem() so there are no longer an faults or other ops and
the pages are just freed sensibly when the VMA is removed. This simplifies
the VMA code significantly.

The previous posting was here[1].

This patch set enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch 14, then Patches 15 through 19 wire this flag up based
on whether the block queue indicates P2PDMA support. Patches 20
through 21 enable the CMB to be mapped into userspace by mmaping
the nvme char device.

This is relatively straightforward, however the one significant
problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
support enforcing this rule would require a huge hack that I don't
expect would be all that pallatable. So the first 13 patches add
support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
and dma-iommu implementations. Thus, systems without an IOMMU plus
Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
then be unsupported, notably ARM and PowerPC but support would be added
when they convert to dma-iommu).

dma_map_sgtable() is preferred when dealing with P2PDMA memory as it
will return -EREMOTEIO when the DMA device cannot map specific P2PDMA
pages based on the existing rules in calc_map_type_and_dist().

The other issue is dma_unmap_sg() needs a flag to determine whether a
given dma_addr_t was mapped regularly or as a PCI bus address. To allow
this, a third flag is added to the page_link field in struct
scatterlist. This effectively means support for P2PDMA will now depend
on CONFIG_64BIT.

Feedback welcome.

This series is based on v5.19-rc1. A git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_v7

Thanks,

Logan

[1] https://lkml.kernel.org/r/20220407154717.7695-1-log...@deltatee.com

--

Changes since v6:
  - Rebase onto v5.19-rc1
  - Rework how the pages are stored in the VMA per Jason's suggestion

Changes since v5:
  - Rebased onto v5.18-rc1 which includes Christophs cleanup to
free_zone_device_page() (similar to Ralph's patch).
  - Fix bug with concurrent first calls to pci_p2pdma_vma_fault()
that caused a double allocation and lost p2p memory. Noticed
by Andrew Maier.
  - Collected a Reviewed-by tag from Chaitanya.
  - Numerous minor fixes to commit messages

Changes since v4:
  - Rebase onto v5.17-rc1.
  - Included Ralph Cambell's patches which removes the ZONE_DEVICE page
reference count offset. This is just to demonstrate that this
series is compatible with that direction.
  - Added a comment in pci_p2pdma_map_sg_attrs(), per Chaitanya and
included his Reviewed-by tags.
  - Patch 1 in the last series which cleaned up scatterlist.h
has been upstreamed.
  - Dropped NEED_SG_DMA_BUS_ADDR_FLAG seeing depends on doesn't
work with selected symbols, per Christoph.
  - Switched iov_iter_get_pages_[alloc_]flags to be exported with
EXPORT_SYMBOL_GPL, per Christoph.
  - Renamed zone_device_pages_are_mergeable() to
zone_device_pages_have_same_pgmap(), per Christoph.
  - Renamed .mmap_file_open operation in nvme_ctrl_ops to
cdev_file_open(), per Christoph.

Changes since v3:
  - Add some comment and commit message cleanups I had missed for v3,
also moved the prototypes for some of the p2pdma helpers to
dma-map-ops.h (which I missed in v3 and was suggested in v2).
  - Add separate cleanup patch for scatterlist.h and change the macros
to functions. (Suggested by Chaitanya and Jason, respectively)
  - Rename sg_dma_mark_pci_p2pdma() and sg_is_dma_pci_p2pdma() to
sg_dma_mark_bus_address() and sg_is_dma_bus_address() which
is a more generic name (As requested by Jason)
  - Fixes to some comments and commit messages as suggested by Bjorn
and Jason.
  - Ensure swiotlb is not used with P2PDMA pages. (Per Jason)
  - The sgtable coversion in RDMA was split out and sent upstream
separately, the new patch is only the removal. (Per Jason)
  - Moved the FOLL_PCI_P2PDMA check outside of get_dev_pagemap() as
Jason suggested this will be removed in the near term.
  - Add two patches to ensure that zone device pages with different
pgmaps are never merged in the block layer or
sg_alloc_append_table_from_pages() (Per Jason)
  - Ensure synchronize_rcu() or call_rcu() is used before returning
pages to the genalloc. (Jason pointed out that pages are not
gauranteed to be unused in all architectures until at least
after an RCU grace period, and that synchronize_r

[PATCH v7 07/21] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support

2022-06-15 Thread Logan Gunthorpe

Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 include/linux/dma-map-ops.h | 10 ++
 include/linux/dma-mapping.h |  5 +
 kernel/dma/mapping.c| 18 ++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 752f91e5eb5d..4d4161d58ce0 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -11,7 +11,17 @@
 
 struct cma;
 
+/*
+ * Values for struct dma_map_ops.flags:
+ *
+ * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
+ * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ */
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
 struct dma_map_ops {
+   unsigned int flags;
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..f7c61b2b4b5e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -140,6 +140,7 @@ int dma_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -250,6 +251,10 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 {
return 0;
 }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   return false;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9f65d1041638..21793506fdb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -722,6 +722,24 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   /* if ops is not set, dma direct will be used which supports P2PDMA */
+   if (!ops)
+   return true;
+
+   /*
+* Note: dma_ops_bypass is not checked here because P2PDMA should
+* not be used with dma mapping ops that do not have support even
+* if the specific device is bypassing them.
+*/
+
+   return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 11/21] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()

2022-06-15 Thread Logan Gunthorpe

Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/nvme/target/rdma.c |  2 +-
 include/rdma/ib_verbs.h| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 09fdcac87d17..4597bca43a6d 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -415,7 +415,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
*ndev,
if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
goto out_free_rsp;
 
-   if (!ib_uses_virt_dma(ndev->device))
+   if (ib_dma_pci_p2p_dma_supported(ndev->device))
r->req.p2p_client = &ndev->device->dev;
r->send_sge.length = sizeof(*r->req.cqe);
r->send_sge.lkey = ndev->pd->local_dma_lkey;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9c6317cf80d5..523843d9ed6c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4013,6 +4013,17 @@ static inline bool ib_uses_virt_dma(struct ib_device 
*dev)
return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device;
 }
 
+/*
+ * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
+ */
+static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
+{
+   if (ib_uses_virt_dma(dev))
+   return false;
+
+   return dma_pci_p2pdma_supported(dev->dma_device);
+}
+
 /**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 21/21] nvme-pci: allow mmaping the CMB in userspace

2022-06-15 Thread Logan Gunthorpe

Allow userspace to obtain CMB memory by mmaping the controller's
char device. The mmap call allocates and returns a hunk of CMB memory,
(the offset is ignored) so userspace does not have control over the
address within the CMB.

A VMA allocated in this way will only be usable by drivers that set
FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
checked the first time the pages are mapped for DMA.

Currently this is only supported by O_DIRECT to an PCI NVMe device
or through the NVMe passthrough IOCTL.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/core.c | 35 +++
 drivers/nvme/host/nvme.h |  3 +++
 drivers/nvme/host/pci.c  | 23 +++
 3 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d6e76f2dc293..23fe4b544bf1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3166,6 +3166,7 @@ static int nvme_dev_open(struct inode *inode, struct file 
*file)
 {
struct nvme_ctrl *ctrl =
container_of(inode->i_cdev, struct nvme_ctrl, cdev);
+   int ret = -EINVAL;
 
switch (ctrl->state) {
case NVME_CTRL_LIVE:
@@ -3175,13 +3176,25 @@ static int nvme_dev_open(struct inode *inode, struct 
file *file)
}
 
nvme_get_ctrl(ctrl);
-   if (!try_module_get(ctrl->ops->module)) {
-   nvme_put_ctrl(ctrl);
-   return -EINVAL;
-   }
+   if (!try_module_get(ctrl->ops->module))
+   goto err_put_ctrl;
 
file->private_data = ctrl;
+
+   if (ctrl->ops->cdev_file_open) {
+   ret = ctrl->ops->cdev_file_open(ctrl, file);
+   if (ret)
+   goto err_put_mod;
+   }
+
return 0;
+
+err_put_mod:
+   module_put(ctrl->ops->module);
+err_put_ctrl:
+   nvme_put_ctrl(ctrl);
+   return ret;
+
 }
 
 static int nvme_dev_release(struct inode *inode, struct file *file)
@@ -3189,11 +3202,24 @@ static int nvme_dev_release(struct inode *inode, struct 
file *file)
struct nvme_ctrl *ctrl =
container_of(inode->i_cdev, struct nvme_ctrl, cdev);
 
+   if (ctrl->ops->cdev_file_release)
+   ctrl->ops->cdev_file_release(file);
+
module_put(ctrl->ops->module);
nvme_put_ctrl(ctrl);
return 0;
 }
 
+static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct nvme_ctrl *ctrl = file->private_data;
+
+   if (!ctrl->ops->mmap_cmb)
+   return -ENODEV;
+
+   return ctrl->ops->mmap_cmb(ctrl, vma);
+}
+
 static const struct file_operations nvme_dev_fops = {
.owner  = THIS_MODULE,
.open   = nvme_dev_open,
@@ -3201,6 +3227,7 @@ static const struct file_operations nvme_dev_fops = {
.unlocked_ioctl = nvme_dev_ioctl,
.compat_ioctl   = compat_ptr_ioctl,
.uring_cmd  = nvme_dev_uring_cmd,
+   .mmap   = nvme_dev_mmap,
 };
 
 static ssize_t nvme_sysfs_reset(struct device *dev,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 957f79420cf3..44ff05d8e24d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -503,6 +503,9 @@ struct nvme_ctrl_ops {
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+   int (*cdev_file_open)(struct nvme_ctrl *ctrl, struct file *file);
+   void (*cdev_file_release)(struct file *file);
+   int (*mmap_cmb)(struct nvme_ctrl *ctrl, struct vm_area_struct *vma);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 52b52a7efa9a..8ef3752b7ddb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2972,6 +2972,26 @@ static bool nvme_pci_supports_pci_p2pdma(struct 
nvme_ctrl *ctrl)
return dma_pci_p2pdma_supported(dev->dev);
 }
 
+static int nvme_pci_cdev_file_open(struct nvme_ctrl *ctrl, struct file *file)
+{
+   struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+   return pci_p2pdma_file_open(pdev, file);
+}
+
+static void nvme_pci_cdev_file_release(struct file *file)
+{
+   pci_p2pdma_file_release(file);
+}
+
+static int nvme_pci_mmap_cmb(struct nvme_ctrl *ctrl,
+struct vm_area_struct *vma)
+{
+   struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+   return pci_mmap_p2pmem(pdev, vma);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
@@ -2983,6 +3003,9 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
.supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
+   .cdev_file_open = nvme_pci_cdev_file_open

[PATCH v7 18/21] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()

2022-06-15 Thread Logan Gunthorpe

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.

Signed-off-by: Logan Gunthorpe 
---
 block/bio.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index a402a4760457..0d152da8938d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1180,6 +1180,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, 
struct iov_iter *iter)
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
bool same_page = false;
+   unsigned int flags = 0;
ssize_t size, left;
unsigned len, i;
size_t offset;
@@ -1192,7 +1193,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, 
struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-   size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+   if (bio->bi_bdev && bio->bi_bdev->bd_disk &&
+   blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+   flags |= FOLL_PCI_P2PDMA;
+
+   size = iov_iter_get_pages_flags(iter, pages, LONG_MAX, nr_pages,
+   &offset, flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 12/21] RDMA/rw: drop pci_p2pdma_[un]map_sg()

2022-06-15 Thread Logan Gunthorpe

dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped. This means the
rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 drivers/infiniband/core/rw.c | 45 
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 4d98f931a13d..8367974b7998 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -274,33 +274,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return 1;
 }
 
-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
-u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg)))
-   pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
-   else
-   ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sgtable(struct ib_device *dev, struct sg_table *sgt,
-  enum dma_data_direction dir)
-{
-   int nents;
-
-   if (is_pci_p2pdma_page(sg_page(sgt->sgl))) {
-   if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
-   return 0;
-   nents = pci_p2pdma_map_sg(dev->dma_device, sgt->sgl,
- sgt->orig_nents, dir);
-   if (!nents)
-   return -EIO;
-   sgt->nents = nents;
-   return 0;
-   }
-   return ib_dma_map_sgtable_attrs(dev, sgt, dir, 0);
-}
-
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:   context to initialize
@@ -327,7 +300,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
};
int ret;
 
-   ret = rdma_rw_map_sgtable(dev, &sgt, dir);
+   ret = ib_dma_map_sgtable_attrs(dev, &sgt, dir, 0);
if (ret)
return ret;
sg_cnt = sgt.nents;
@@ -366,7 +339,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
return ret;
 
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, &sgt, dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -414,12 +387,12 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return -EINVAL;
}
 
-   ret = rdma_rw_map_sgtable(dev, &sgt, dir);
+   ret = ib_dma_map_sgtable_attrs(dev, &sgt, dir, 0);
if (ret)
return ret;
 
if (prot_sg_cnt) {
-   ret = rdma_rw_map_sgtable(dev, &prot_sgt, dir);
+   ret = ib_dma_map_sgtable_attrs(dev, &prot_sgt, dir, 0);
if (ret)
goto out_unmap_sg;
}
@@ -486,9 +459,9 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 out_unmap_prot_sg:
if (prot_sgt.nents)
-   rdma_rw_unmap_sg(dev, prot_sgt.sgl, prot_sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, &prot_sgt, dir, 0);
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, &sgt, dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_signature_init);
@@ -621,7 +594,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp,
break;
}
 
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
@@ -649,8 +622,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 
if (prot_sg_cnt)
-   rdma_rw_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy_signature);
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCHES 1/2] iommu: Add RCU-protected page free support

2022-06-15 Thread Jason Gunthorpe via iommu

On Fri, Jun 10, 2022 at 01:37:20PM +0800, Baolu Lu wrote:
> On 2022/6/9 20:49, Jason Gunthorpe wrote:
> > > +void iommu_free_pgtbl_pages(struct iommu_domain *domain,
> > > + struct list_head *pages)
> > > +{
> > > + struct page *page, *next;
> > > +
> > > + if (!domain->concurrent_traversal) {
> > > + put_pages_list(pages);
> > > + return;
> > > + }
> > > +
> > > + list_for_each_entry_safe(page, next, pages, lru) {
> > > + list_del(&page->lru);
> > > + call_rcu(&page->rcu_head, pgtble_page_free_rcu);
> > > + }
> > It seems OK, but I wonder if there is benifit to using
> > put_pages_list() from the rcu callback
> 
> The price is that we need to allocate a "struct list_head" and free it
> in the rcu callback as well. Currently the list_head is sitting in the
> stack.

You'd have to use a different struct page layout so that the list_head
was in the struct page and didn't overlap with the rcu_head

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] uacce: fix concurrency of fops_open and uacce_remove

2022-06-15 Thread Jean-Philippe Brucker

Hi,

On Fri, Jun 10, 2022 at 08:34:23PM +0800, Zhangfei Gao wrote:
> The uacce parent's module can be removed when uacce is working,
> which may cause troubles.
> 
> If rmmod/uacce_remove happens just after fops_open: bind_queue,
> the uacce_remove can not remove the bound queue since it is not
> added to the queue list yet, which blocks the uacce_disable_sva.
> 
> Change queues_lock area to make sure the bound queue is added to
> the list thereby can be searched in uacce_remove.
> 
> And uacce->parent->driver is checked immediately in case rmmod is
> just happening.
> 
> Also the parent driver must always stop DMA before calling
> uacce_remove.
> 
> Signed-off-by: Yang Shen 
> Signed-off-by: Zhangfei Gao 
> ---
>  drivers/misc/uacce/uacce.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
> index 281c54003edc..b6219c6bfb48 100644
> --- a/drivers/misc/uacce/uacce.c
> +++ b/drivers/misc/uacce/uacce.c
> @@ -136,9 +136,16 @@ static int uacce_fops_open(struct inode *inode, struct 
> file *filep)
>   if (!q)
>   return -ENOMEM;
>  
> + mutex_lock(&uacce->queues_lock);
> +
> + if (!uacce->parent->driver) {

I don't think this is useful, because the core clears parent->driver after
having run uacce_remove():

  rmmod hisi_zipopen()
   ...   uacce_fops_open()
   __device_release_driver()  ...
pci_device_remove()
 hisi_zip_remove()
  hisi_qm_uninit()
   uacce_remove()
...   ...
  mutex_lock(uacce->queues_lock)
...   if (!uacce->parent->driver)
device_unbind_cleanup()   /* driver still valid, proceed */
 dev->driver = NULL

Since uacce_remove() disabled SVA, the following uacce_bind_queue() will
fail anyway. However, if uacce->flags does not have UACCE_DEV_SVA set,
we'll proceed further and call uacce->ops->get_queue(), which does not
exist anymore since the parent module is gone.

I think we need the global uacce_mutex to serialize uacce_remove() and
uacce_fops_open(). uacce_remove() would do everything, including
xa_erase(), while holding that mutex. And uacce_fops_open() would try to
obtain the uacce object from the xarray while holding the mutex, which
fails if the uacce object is being removed.

Thanks,
Jean

> + ret = -ENODEV;
> + goto out_with_lock;
> + }
> +
>   ret = uacce_bind_queue(uacce, q);
>   if (ret)
> - goto out_with_mem;
> + goto out_with_lock;
>  
>   q->uacce = uacce;
>  
> @@ -153,7 +160,6 @@ static int uacce_fops_open(struct inode *inode, struct 
> file *filep)
>   uacce->inode = inode;
>   q->state = UACCE_Q_INIT;
>  
> - mutex_lock(&uacce->queues_lock);
>   list_add(&q->list, &uacce->queues);
>   mutex_unlock(&uacce->queues_lock);
>  
> @@ -161,7 +167,8 @@ static int uacce_fops_open(struct inode *inode, struct 
> file *filep)
>  
>  out_with_bond:
>   uacce_unbind_queue(q);
> -out_with_mem:
> +out_with_lock:
> + mutex_unlock(&uacce->queues_lock);
>   kfree(q);
>   return ret;
>  }
> @@ -171,10 +178,10 @@ static int uacce_fops_release(struct inode *inode, 
> struct file *filep)
>   struct uacce_queue *q = filep->private_data;
>  
>   mutex_lock(&q->uacce->queues_lock);
> - list_del(&q->list);
> - mutex_unlock(&q->uacce->queues_lock);
>   uacce_put_queue(q);
>   uacce_unbind_queue(q);
> + list_del(&q->list);
> + mutex_unlock(&q->uacce->queues_lock);
>   kfree(q);
>  
>   return 0;
> @@ -513,10 +520,10 @@ void uacce_remove(struct uacce_device *uacce)
>   uacce_put_queue(q);
>   uacce_unbind_queue(q);
>   }
> - mutex_unlock(&uacce->queues_lock);
>  
>   /* disable sva now since no opened queues */
>   uacce_disable_sva(uacce);
> + mutex_unlock(&uacce->queues_lock);
>  
>   if (uacce->cdev)
>   cdev_device_del(uacce->cdev, &uacce->dev);
> -- 
> 2.36.1
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RESEND PATCH v8 01/11] iommu: Add DMA ownership management interfaces

2022-06-15 Thread Steven Price

On 15/06/2022 11:57, Robin Murphy wrote:
> On 2022-06-15 10:53, Steven Price wrote:
>> On 18/04/2022 01:49, Lu Baolu wrote:
>>> Multiple devices may be placed in the same IOMMU group because they
>>> cannot be isolated from each other. These devices must either be
>>> entirely under kernel control or userspace control, never a mixture.
>>>
>>> This adds dma ownership management in iommu core and exposes several
>>> interfaces for the device drivers and the device userspace assignment
>>> framework (i.e. VFIO), so that any conflict between user and kernel
>>> controlled dma could be detected at the beginning.
>>>
>>> The device driver oriented interfaces are,
>>>
>>> int iommu_device_use_default_domain(struct device *dev);
>>> void iommu_device_unuse_default_domain(struct device *dev);
>>>
>>> By calling iommu_device_use_default_domain(), the device driver tells
>>> the iommu layer that the device dma is handled through the kernel DMA
>>> APIs. The iommu layer will manage the IOVA and use the default domain
>>> for DMA address translation.
>>>
>>> The device user-space assignment framework oriented interfaces are,
>>>
>>> int iommu_group_claim_dma_owner(struct iommu_group *group,
>>>     void *owner);
>>> void iommu_group_release_dma_owner(struct iommu_group *group);
>>> bool iommu_group_dma_owner_claimed(struct iommu_group *group);
>>>
>>> The device userspace assignment must be disallowed if the DMA owner
>>> claiming interface returns failure.
>>>
>>> Signed-off-by: Jason Gunthorpe 
>>> Signed-off-by: Kevin Tian 
>>> Signed-off-by: Lu Baolu 
>>> Reviewed-by: Robin Murphy 
>>
>> I'm seeing a regression that I've bisected to this commit on a Firefly
>> RK3288 board. The display driver fails to probe properly because
>> __iommu_attach_group() returns -EBUSY. This causes long hangs and splats
>> as the display flips timeout.
>>
>> The call stack to __iommu_attach_group() is:
>>
>>   __iommu_attach_group from iommu_attach_device+0x64/0xb4
>>   iommu_attach_device from rockchip_drm_dma_attach_device+0x20/0x50
>>   rockchip_drm_dma_attach_device from vop_crtc_atomic_enable+0x10c/0xa64
>>   vop_crtc_atomic_enable from
>> drm_atomic_helper_commit_modeset_enables+0xa8/0x290
>>   drm_atomic_helper_commit_modeset_enables from
>> drm_atomic_helper_commit_tail_rpm+0x44/0x8c
>>   drm_atomic_helper_commit_tail_rpm from commit_tail+0x9c/0x180
>>   commit_tail from drm_atomic_helper_commit+0x164/0x18c
>>   drm_atomic_helper_commit from drm_atomic_commit+0xac/0xe4
>>   drm_atomic_commit from drm_client_modeset_commit_atomic+0x23c/0x284
>>   drm_client_modeset_commit_atomic from
>> drm_client_modeset_commit_locked+0x60/0x1c8
>>   drm_client_modeset_commit_locked from
>> drm_client_modeset_commit+0x24/0x40
>>   drm_client_modeset_commit from drm_fb_helper_set_par+0xb8/0xf8
>>   drm_fb_helper_set_par from drm_fb_helper_hotplug_event.part.0+0xa8/0xc0
>>   drm_fb_helper_hotplug_event.part.0 from output_poll_execute+0xb8/0x224
>>
>>> @@ -2109,7 +2115,7 @@ static int __iommu_attach_group(struct
>>> iommu_domain *domain,
>>>   {
>>>   int ret;
>>>   -    if (group->default_domain && group->domain !=
>>> group->default_domain)
>>> +    if (group->domain && group->domain != group->default_domain)
>>>   return -EBUSY;
>>>     ret = __iommu_group_for_each_dev(group, domain,
>>
>> Reverting this 'fixes' the problem for me. The follow up 0286300e6045
>> ("iommu: iommu_group_claim_dma_owner() must always assign a domain")
>> doesn't help.
>>
>> Adding some debug printks I can see that domain is a valid pointer, but
>> both default_domain and blocking_domain are NULL.
>>
>> I'm using the DTB from the kernel tree (rk3288-firefly.dtb).
>>
>> Any ideas?
> 
> Hmm, TBH I'm not sure how that worked previously... it'll be complaining
> because the ARM DMA domain is still attached, but even when the attach
> goes ahead and replaces the ARM domain with the driver's new one, it's
> not using the special arm_iommu_detach_device() interface anywhere so
> the device would still be left with the wrong DMA ops :/
> 
> I guess the most pragmatic option is probably to give rockchip-drm a
> similar bodge to exynos and tegra, to explicitly remove the ARM domain
> before attaching its own.

A bodge like below indeed 'fixes' the problem:

---8<---
diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_drv.c 
b/drivers/gpu/drm/rockchip/rockchip_drm_drv.c
index 67d38f53d3e5..cbc6a5121296 100644
--- a/drivers/gpu/drm/rockchip/rockchip_drm_drv.c
+++ b/drivers/gpu/drm/rockchip/rockchip_drm_drv.c
@@ -23,6 +23,14 @@
 #include 
 #include 
 
+#if defined(CONFIG_ARM_DMA_USE_IOMMU)
+#include 
+#else
+#define arm_iommu_detach_device(...)   ({ })
+#define arm_iommu_release_mapping(...) ({ })
+#define to_dma_iommu_mapping(dev) NULL
+#endif
+
 #include "rockchip_drm_drv.h"
 #include "rockchip_drm_fb.h"
 #include "rockchip_drm_gem.h"
@@ -49,6 +57,14 @@ int rockchip_drm_dma_attach_device(struct drm_device 
*drm_dev,
if (

Re: [PATCH v2] iommu/vt-d: Make DMAR_UNITS_SUPPORTED a config setting

2022-06-15 Thread Steve Wahl

On Wed, Jun 15, 2022 at 09:38:35AM +0800, Baolu Lu wrote:
> On 2022/6/15 05:12, Steve Wahl wrote:
> > On Tue, Jun 14, 2022 at 12:01:45PM -0700, Jerry Snitselaar wrote:
> > > On Tue, Jun 14, 2022 at 11:45:35AM -0500, Steve Wahl wrote:
> > > > On Tue, Jun 14, 2022 at 10:21:29AM +0800, Baolu Lu wrote:
> > > > > On 2022/6/14 09:54, Jerry Snitselaar wrote:
> > > > > > On Mon, Jun 13, 2022 at 6:51 PM Baolu Lu  
> > > > > > wrote:
> > > > > > > 
> > > > > > > On 2022/6/14 09:44, Jerry Snitselaar wrote:
> > > > > > > > On Mon, Jun 13, 2022 at 6:36 PM Baolu 
> > > > > > > > Lu  wrote:
> > > > > > > > > On 2022/6/14 04:57, Jerry Snitselaar wrote:
> > > > > > > > > > On Thu, May 12, 2022 at 10:13:09AM -0500, Steve Wahl wrote:
> > > > > > > > > > > To support up to 64 sockets with 10 DMAR units each 
> > > > > > > > > > > (640), make the
> > > > > > > > > > > value of DMAR_UNITS_SUPPORTED adjustable by a config 
> > > > > > > > > > > variable,
> > > > > > > > > > > CONFIG_DMAR_UNITS_SUPPORTED, and make it's default 1024 
> > > > > > > > > > > when MAXSMP is
> > > > > > > > > > > set.
> > > > > > > > > > > 
> > > > > > > > > > > If the available hardware exceeds DMAR_UNITS_SUPPORTED 
> > > > > > > > > > > (previously set
> > > > > > > > > > > to MAX_IO_APICS, or 128), it causes these messages: 
> > > > > > > > > > > "DMAR: Failed to
> > > > > > > > > > > allocate seq_id", "DMAR: Parse DMAR table failure.", and 
> > > > > > > > > > > "x2apic: IRQ
> > > > > > > > > > > remapping doesn't support X2APIC mode x2apic disabled"; 
> > > > > > > > > > > and the system
> > > > > > > > > > > fails to boot properly.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Steve Wahl
> > > > > > > > > > > ---
> > > > > > > > > > > 
> > > > > > > > > > > Note that we could not find a reason for connecting
> > > > > > > > > > > DMAR_UNITS_SUPPORTED to MAX_IO_APICS as was done 
> > > > > > > > > > > previously.  Perhaps
> > > > > > > > > > > it seemed like the two would continue to match on earlier 
> > > > > > > > > > > processors.
> > > > > > > > > > > There doesn't appear to be kernel code that assumes that 
> > > > > > > > > > > the value of
> > > > > > > > > > > one is related to the other.
> > > > > > > > > > > 
> > > > > > > > > > > v2: Make this value a config option, rather than a fixed 
> > > > > > > > > > > constant.  The default
> > > > > > > > > > > values should match previous configuration except in the 
> > > > > > > > > > > MAXSMP case.  Keeping the
> > > > > > > > > > > value at a power of two was requested by Kevin Tian.
> > > > > > > > > > > 
> > > > > > > > > > >  drivers/iommu/intel/Kconfig | 6 ++
> > > > > > > > > > >  include/linux/dmar.h| 6 +-
> > > > > > > > > > >  2 files changed, 7 insertions(+), 5 deletions(-)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/iommu/intel/Kconfig 
> > > > > > > > > > > b/drivers/iommu/intel/Kconfig
> > > > > > > > > > > index 247d0f2d5fdf..fdbda77ac21e 100644
> > > > > > > > > > > --- a/drivers/iommu/intel/Kconfig
> > > > > > > > > > > +++ b/drivers/iommu/intel/Kconfig
> > > > > > > > > > > @@ -9,6 +9,12 @@ config DMAR_PERF
> > > > > > > > > > >  config DMAR_DEBUG
> > > > > > > > > > > bool
> > > > > > > > > > > 
> > > > > > > > > > > +config DMAR_UNITS_SUPPORTED
> > > > > > > > > > > +int "Number of DMA Remapping Units supported"
> > > > > > > > > > Also, should there be a "depends on (X86 || IA64)" here?
> > > > > > > > > Do you have any compilation errors or warnings?
> > > > > > > > > 
> > > > > > > > > Best regards,
> > > > > > > > > baolu
> > > > > > > > > 
> > > > > > > > I think it is probably harmless since it doesn't get used 
> > > > > > > > elsewhere,
> > > > > > > > but our tooling was complaining to me because 
> > > > > > > > DMAR_UNITS_SUPPORTED was
> > > > > > > > being autogenerated into the configs for the non-x86 
> > > > > > > > architectures we
> > > > > > > > build (aarch64, s390x, ppcle64).
> > > > > > > > We have files corresponding to the config options that it looks 
> > > > > > > > at,
> > > > > > > > and I had one for x86 and not the others so it noticed the
> > > > > > > > discrepancy.
> > > > > > > 
> > > > > > > So with "depends on (X86 || IA64)", that tool doesn't complain 
> > > > > > > anymore,
> > > > > > > right?
> > > > > > > 
> > > > > > > Best regards,
> > > > > > > baolu
> > > > > > > 
> > > > > > 
> > > > > > Yes, with the depends it no longer happens.
> > > > > 
> > > > > The dmar code only exists on X86 and IA64 arch's. Adding this 
> > > > > depending
> > > > > makes sense to me. I will add it if no objections.
> > > > 
> > > > I think that works after Baolu's patchset that makes intel-iommu.h
> > > > private.  I'm pretty sure it wouldn't have worked before that.
> > > > 
> > > > No objections.
> > > > 
> > > 
> > > Yes, I think applying it with the depends prior to Baolu's change would
> > > still run into the issue from the KTR report if someone compiled without
> >

Re: helping with remapping vmem for dma

2022-06-15 Thread Frank Wunderlich

Am 15. Juni 2022 15:17:00 MESZ schrieb Christoph Hellwig :
>On Wed, Jun 15, 2022 at 02:15:33PM +0100, Robin Murphy wrote:
>> Put simply, if you want to call dma_map_single() on a buffer, then that 
>> buffer needs to be allocated with kmalloc() (or technically alloc_pages(), 
>> but then dma_map_page() would make more sense when dealing with entire 
>> pages.
>
>Yes.  It sounds like the memory here comes from the dma coherent
>allocator, in which case the code need to use the address returned
>by that and not create another mapping.

As i have not found position where memory is allocated (this is a very huge and 
dirty driver) is it maybe possible to check if buf is such "allready dma" 
memory (maybe is_vmalloc_addr) and call dma_single_map only if not (using 
original buf if yes)?

But i guess it should map only a part of available (pre-allocated) memory and 
other parts of this are used somewhere else. So i can ran into some issues 
caused by sharing this full block in different functions.
Hi,

Thanks for first suggestions. 
regards Frank
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 1/3] iommu/io-pgtable-arm-v7s: Add a quirk to allow pgtable PA up to 35bit

2022-06-15 Thread yf.wang--- via iommu

On Tue, 2022-06-14 at 13:56 +0100, Will Deacon wrote:
> Hi,
> 
> For some reason, this series has landed in my spam folder so
> apologies
> for the delay :/
> 
> 
> > +static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,
> > +   struct io_pgtable_cfg *cfg)
> > +{
> > +   arm_v7s_iopte pte = paddr & ARM_V7S_LVL_MASK(lvl);
> > +
> > +   if (!arm_v7s_is_mtk_enabled(cfg))
> > +   return pte;
> > +
> > +   return to_iopte_mtk(paddr, pte);
> 
> nit, but can we rename and rework this so it reads a bit better,
> please?
> Something like:
> 
> 
>   if (arm_v7s_is_mtk_enabled(cfg))
>   return to_mtk_iopte(paddr, pte);
> 
>   return pte;
> 
> 

Hi Will,
Thanks for your suggestion, PATCH v9 version will modify it.


> >  static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
> >   struct io_pgtable_cfg *cfg)
> >  {
> > @@ -234,6 +239,7 @@ static arm_v7s_iopte *iopte_deref(arm_v7s_iopte
> > pte, int lvl,
> >  static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
> >struct arm_v7s_io_pgtable *data)
> >  {
> > +   gfp_t gfp_l1 = __GFP_ZERO | ARM_V7S_TABLE_GFP_DMA;
> > struct io_pgtable_cfg *cfg = &data->iop.cfg;
> > struct device *dev = cfg->iommu_dev;
> > phys_addr_t phys;
> > @@ -241,9 +247,11 @@ static void *__arm_v7s_alloc_table(int lvl,
> > gfp_t gfp,
> > size_t size = ARM_V7S_TABLE_SIZE(lvl, cfg);
> > void *table = NULL;
> >  
> > +   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT)
> > +   gfp_l1 = GFP_KERNEL | __GFP_ZERO;
> 
> I think it's a bit grotty to override the flags inline like this
> (same for
> the slab flag later on). Something like this is a bit cleaner:
> 
> 
>   /*
>* Comment explaining why GFP_KERNEL is desirable here.
>* I'm assuming it's because the walker can address all of
> memory.
>*/
>   gfp_l1 = cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT ?
>GFP_KERNEL : ARM_V7S_TABLE_GFP_DMA;
> 
>   ...
> 
>   __get_free_pages(gfp_l1 | __GFP_ZERO, ...);
> 
> 
> and similar for the slab flag.
> 

Hi Will,
Thanks for your suggestion, PATCH v9 version will modify it.


> > if (lvl == 1)
> > -   table = (void *)__get_free_pages(
> > -   __GFP_ZERO | ARM_V7S_TABLE_GFP_DMA,
> > get_order(size));
> > +   table = (void *)__get_free_pages(gfp_l1,
> > get_order(size));
> > else if (lvl == 2)
> > table = kmem_cache_zalloc(data->l2_tables, gfp);
> >  
> > @@ -251,7 +259,8 @@ static void *__arm_v7s_alloc_table(int lvl,
> > gfp_t gfp,
> > return NULL;
> >  
> > phys = virt_to_phys(table);
> > -   if (phys != (arm_v7s_iopte)phys) {
> > +   if (phys != (arm_v7s_iopte)phys &&
> > +   !(cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT)) {
> > /* Doesn't fit in PTE */
> 
> Shouldn't we be checking that the address is within 35 bits here?
> Perhaps we
> should generate a mask from the oas instead of just using the cast.
> 

Hi Will,
Thanks for your suggestion, PATCH v9 version will add checking that the address 
is within 35 bits:

phys = virt_to_phys(table);
if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_TTBR_EXT ?
phys >= (1ULL << cfg->oas) : phys != (arm_v7s_iopte)phys) {
/* Doesn't fit in PTE */


Thanks,
Yunfei.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: helping with remapping vmem for dma

2022-06-15 Thread Christoph Hellwig

On Wed, Jun 15, 2022 at 02:15:33PM +0100, Robin Murphy wrote:
> Put simply, if you want to call dma_map_single() on a buffer, then that 
> buffer needs to be allocated with kmalloc() (or technically alloc_pages(), 
> but then dma_map_page() would make more sense when dealing with entire 
> pages.

Yes.  It sounds like the memory here comes from the dma coherent
allocator, in which case the code need to use the address returned
by that and not create another mapping.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: helping with remapping vmem for dma

2022-06-15 Thread Robin Murphy


On 2022-06-15 13:11, Frank Wunderlich wrote:

Hi,

i have upported a wifi-driver (mt6625l for armhf) for some time and fall now 
(at least 5.18) in the
"rejecting DMA map of vmalloc memory" error [1].

maybe anybody here can guide me on how to nail it down and maybe fix it.

as far as i have debugged it, it uses dma_map_single [2] to get dma memory from 
a previous
allocated memory region.

this function "kalDevPortRead" in [2] is used via macro HAL_PORT_RD [3] (used 
in HAL_READ_RX_PORT
and HAL_READ_INTR_STATUS in same hal.h file)

HAL_READ_INTR_STATUS is always called with an empty int array as buf which i 
guess is not the problem.
I think the issue is using the use with an preallocated prSDIOCtrl struct (have 
not completely traced
it back where it is allocated).


Put simply, if you want to call dma_map_single() on a buffer, then that 
buffer needs to be allocated with kmalloc() (or technically 
alloc_pages(), but then dma_map_page() would make more sense when 
dealing with entire pages.


Robin.


calls of HAL_PORT_RD/HAL_READ_RX_PORT are in nic{,_rx}.c (with sdio-struct) 
([4] as example)

maybe there is a simple way to get an address in preallocated memory as 
replacement for the dma_map_simple call (and the unmap of course).

regards Frank

[1] 
https://elixir.bootlin.com/linux/latest/source/include/linux/dma-mapping.h#L327
[2] 
https://github.com/frank-w/BPI-R2-4.14/blob/5.18-main/drivers/misc/mediatek/connectivity/wlan/gen2/os/linux/hif/ahb/ahb.c#L940
[3] 
https://github.com/frank-w/BPI-R2-4.14/blob/5.18-main/drivers/misc/mediatek/connectivity/wlan/gen2/include/nic/hal.h#L176
[4] 
https://github.com/frank-w/BPI-R2-4.14/blob/5.18-main/drivers/misc/mediatek/connectivity/wlan/gen2/nic/nic_rx.c#L3604

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu()

2022-06-15 Thread Baolu Lu

On 2022/6/15 14:22, Tian, Kevin wrote:

From: Baolu Lu 
Sent: Tuesday, June 14, 2022 3:21 PM

On 2022/6/14 14:49, Tian, Kevin wrote:

From: Lu Baolu
Sent: Tuesday, June 14, 2022 10:51 AM

The disable_dmar_iommu() is called when IOMMU initialization fails or
the IOMMU is hot-removed from the system. In both cases, there is no
need to clear the IOMMU translation data structures for devices.

On the initialization path, the device probing only happens after the
IOMMU is initialized successfully, hence there're no translation data
structures.

Out of curiosity. With kexec the IOMMU may contain stale mappings
from the old kernel. Then is it meaningful to disable IOMMU after the
new kernel fails to initialize it properly?

For kexec kernel, if the IOMMU is detected to be pre-enabled, the IOMMU
driver will try to copy tables from the old kernel. If copying table
fails, the IOMMU driver will disable IOMMU and do the normal
initialization.

What about an error occurred after copying table in the initialization
path? The new kernel will be in a state assuming iommu is disabled
but it is still enabled using an old mapping for certain devices...

If copying table failed, the translation will be disabled and a clean
root table will be used.

if (translation_pre_enabled(iommu)) {
pr_info("Translation already enabled - trying to copy 
translation structures\n");

ret = copy_translation_tables(iommu);
if (ret) {
/*
 * We found the IOMMU with translation
 * enabled - but failed to copy over the
 * old root-entry table. Try to proceed
 * by disabling translation now and
 * allocating a clean root-entry table.
 * This might cause DMAR faults, but
 * probably the dump will still succeed.
 */
pr_err("Failed to copy translation tables from previous 
kernel for %s\n",

   iommu->name);
iommu_disable_translation(iommu);
clear_translation_pre_enabled(iommu);
} else {
pr_info("Copied translation tables from previous kernel 
for %s\n",

iommu->name);
}
}

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 01/12] iommu/vt-d: debugfs: Remove device_domain_lock usage

2022-06-15 Thread Baolu Lu

On 2022/6/15 14:13, Tian, Kevin wrote:

From: Baolu Lu
Sent: Wednesday, June 15, 2022 9:54 AM

On 2022/6/14 14:43, Tian, Kevin wrote:

From: Lu Baolu
Sent: Tuesday, June 14, 2022 10:51 AM

The domain_translation_struct debugfs node is used to dump the DMAR
page
tables for the PCI devices. It potentially races with setting domains to
devices. The existing code uses a global spinlock device_domain_lock to
avoid the races, but this is problematical as this lock is only used to
protect the device tracking lists of each domain.

is it really problematic at this point? Before following patches are applied
using device_domain_lock should have similar effect as holding the group
lock.

Here it might make more sense to just focus on removing the use of
device_domain_lock outside of iommu.c. Just that using group lock is
cleaner and more compatible to following cleanups.

and it's worth mentioning that racing with page table updates is out
of the scope of this series. Probably also add a comment in the code
to clarify this point.

Hi Kevin,

How do you like below updated patch?

Yes, this is better.

  From cecc9a0623780a11c4ea4d0a15aa6187f01541c4 Mon Sep 17 00:00:00
2001
From: Lu Baolu
Date: Sun, 29 May 2022 10:18:56 +0800
Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock
usage

The domain_translation_struct debugfs node is used to dump the DMAR
page
tables for the PCI devices. It potentially races with setting domains to
devices. The existing code uses the global spinlock device_domain_lock to
avoid the races.

This removes the use of device_domain_lock outside of iommu.c by replacing
it with the group mutex lock. Using the group mutex lock is cleaner and
more compatible to following cleanups.

Signed-off-by: Lu Baolu
---
   drivers/iommu/intel/debugfs.c | 42 +--
   drivers/iommu/intel/iommu.c   |  2 +-
   drivers/iommu/intel/iommu.h   |  1 -
   3 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index d927ef10641b..f4acd8993f60 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -342,13 +342,13 @@ static void pgtable_walk_level(struct seq_file *m,
struct dma_pte *pde,
}
   }

-static int show_device_domain_translation(struct device *dev, void *data)
+static int __show_device_domain_translation(struct device *dev, void *data)
   {
-   struct device_domain_info *info = dev_iommu_priv_get(dev);
-   struct dmar_domain *domain = info->domain;
+   struct dmar_domain *domain;
struct seq_file *m = data;
u64 path[6] = { 0 };

+   domain = to_dmar_domain(iommu_get_domain_for_dev(dev));
if (!domain)
return 0;

@@ -359,20 +359,38 @@ static int show_device_domain_translation(struct
device *dev, void *data)
pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path);
seq_putc(m, '\n');

-   return 0;
+   return 1;
   }

-static int domain_translation_struct_show(struct seq_file *m, void *unused)
+static int show_device_domain_translation(struct device *dev, void *data)
   {
-   unsigned long flags;
-   int ret;
+   struct iommu_group *group;

-   spin_lock_irqsave(&device_domain_lock, flags);
-   ret = bus_for_each_dev(&pci_bus_type, NULL, m,
-  show_device_domain_translation);
-   spin_unlock_irqrestore(&device_domain_lock, flags);
+   group = iommu_group_get(dev);
+   if (group) {
+   /*
+* The group->mutex is held across the callback, which will
+* block calls to iommu_attach/detach_group/device. Hence,
+* the domain of the device will not change during traversal.
+*
+* All devices in an iommu group share a single domain,
hence
+* we only dump the domain of the first device. Even though,

bus_for_each_dev() will still lead to duplicated dump in the same group
but probably we can leave with it for a debug interface.

Yes. This is what it was. Ideally we could walk the iommu groups and
dump the device names belonging to the group and it's domain mappings,
but I was not willing to add any helpers in the iommu core just for a
debugfs use.

---
Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 6/6] iommu: mtk_iommu: Lookup phandle to retrieve syscon to pericfg

2022-06-15 Thread AngeloGioacchino Del Regno


Il 15/06/22 14:09, Matthias Brugger ha scritto:



On 09/06/2022 12:08, AngeloGioacchino Del Regno wrote:

On some SoCs (of which only MT8195 is supported at the time of writing),
the "R" and "W" (I/O) enable bits for the IOMMUs are in the pericfg_ao
register space and not in the IOMMU space: as it happened already with
infracfg, it is expected that this list will grow.

Instead of specifying pericfg compatibles on a per-SoC basis, following
what was done with infracfg, let's lookup the syscon by phandle instead.

Signed-off-by: AngeloGioacchino Del Regno 

---
  drivers/iommu/mtk_iommu.c | 23 +--
  1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 90685946fcbe..0ea0848581e9 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -138,6 +138,8 @@
  /* PM and clock always on. e.g. infra iommu */
  #define PM_CLK_AO    BIT(15)
  #define IFA_IOMMU_PCIE_SUPPORT    BIT(16)
+/* IOMMU I/O (r/w) is enabled using PERICFG_IOMMU_1 register */
+#define HAS_PERI_IOMMU1_REG    BIT(17)


 From what I can see MTK_IOMMU_TYPE_INFRA is only set in MT8195 which uses pericfg. 
So we don't need a new flag here. For me the flag name MTK_IOMMU_TYPE_INFRA was 
confusing as it has nothing to do with the use of infracfg. I'll hijack this patch 
to provide some feedback on the actual code, please see below.



  #define MTK_IOMMU_HAS_FLAG_MASK(pdata, _x, mask)    \
  pdata)->flags) & (mask)) == (_x))
@@ -187,7 +189,6 @@ struct mtk_iommu_plat_data {
  u32    flags;
  u32    inv_sel_reg;
-    char    *pericfg_comp_str;
  struct list_head    *hw_list;
  unsigned int    iova_region_nr;
  const struct mtk_iommu_iova_region    *iova_region;
@@ -1218,14 +1219,16 @@ static int mtk_iommu_probe(struct platform_device *pdev)
  goto out_runtime_disable;
  }
  } else if (MTK_IOMMU_IS_TYPE(data->plat_data, MTK_IOMMU_TYPE_INFRA) &&
-   data->plat_data->pericfg_comp_str) {


Check for pericfg_comp_str is not needed, we only have one platform that uses 
MTK_IOMMU_TYPE_INFRA.




Fair enough. I agree.

-    infracfg = 
syscon_regmap_lookup_by_compatible(data->plat_data->pericfg_comp_str);


We can do something like this to make the code clearer:
data->pericfg = 
syscon_regmap_lookup_by_compatible(data->plat_data->pericfg_comp_str);
     if (IS_ERR(data->pericfg)) {

Using infracfg variable here is confusing as it has nothing to do with infracfg 
used with HAS_4GB_MODE flag.


Yes Matthias, using the infracfg variable is confusing - that's why I changed 
that
already



Regards,
Matthias


-    if (IS_ERR(infracfg)) {
-    ret = PTR_ERR(infracfg);
-    goto out_runtime_disable;
+   MTK_IOMMU_HAS_FLAG(data->plat_data, HAS_PERI_IOMMU1_REG)) {




+    data->pericfg = syscon_regmap_lookup_by_phandle(dev->of_node, 
"mediatek,pericfg");


Here, where I'm assigning directly to data->pericfg :-P

By the way, since it was only about one platform, my intention was to remove the
pericfg_comp_str from struct iommu_plat_data (as you can see), but then, with 
the
current code, I had to assign .



+    if (IS_ERR(data->pericfg)) {
+    p = "mediatek,mt8195-pericfg_ao";


...the string to 'p', because otherwise it would go over 100 columns.

In any case, I just checked and, apparently, MT8195 is really the one and only 
SoC
that needs this pericfg register to be managed by Linux... even the latest and
greatest smartphone chip (Dimensity 9000, MT6983) doesn't need this (at least,
from what I can read on a downstream kernel).

On an afterthought, perhaps the best idea is to just leave this as it is and, as
you proposed, avoid using that confusing infracfg variable, without adding the
pericfg handle at all.

After all, it's just one single SoC.

I'll send a new version soon!

Cheers,
Angelo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

helping with remapping vmem for dma

2022-06-15 Thread Frank Wunderlich

Hi,

i have upported a wifi-driver (mt6625l for armhf) for some time and fall now 
(at least 5.18) in the
"rejecting DMA map of vmalloc memory" error [1].

maybe anybody here can guide me on how to nail it down and maybe fix it.

as far as i have debugged it, it uses dma_map_single [2] to get dma memory from 
a previous
allocated memory region.

this function "kalDevPortRead" in [2] is used via macro HAL_PORT_RD [3] (used 
in HAL_READ_RX_PORT
and HAL_READ_INTR_STATUS in same hal.h file)

HAL_READ_INTR_STATUS is always called with an empty int array as buf which i 
guess is not the problem.
I think the issue is using the use with an preallocated prSDIOCtrl struct (have 
not completely traced
it back where it is allocated).

calls of HAL_PORT_RD/HAL_READ_RX_PORT are in nic{,_rx}.c (with sdio-struct) 
([4] as example)

maybe there is a simple way to get an address in preallocated memory as 
replacement for the dma_map_simple call (and the unmap of course).

regards Frank

[1] 
https://elixir.bootlin.com/linux/latest/source/include/linux/dma-mapping.h#L327
[2] 
https://github.com/frank-w/BPI-R2-4.14/blob/5.18-main/drivers/misc/mediatek/connectivity/wlan/gen2/os/linux/hif/ahb/ahb.c#L940
[3] 
https://github.com/frank-w/BPI-R2-4.14/blob/5.18-main/drivers/misc/mediatek/connectivity/wlan/gen2/include/nic/hal.h#L176
[4] 
https://github.com/frank-w/BPI-R2-4.14/blob/5.18-main/drivers/misc/mediatek/connectivity/wlan/gen2/nic/nic_rx.c#L3604
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 6/6] iommu: mtk_iommu: Lookup phandle to retrieve syscon to pericfg

2022-06-15 Thread Matthias Brugger





On 09/06/2022 12:08, AngeloGioacchino Del Regno wrote:

On some SoCs (of which only MT8195 is supported at the time of writing),
the "R" and "W" (I/O) enable bits for the IOMMUs are in the pericfg_ao
register space and not in the IOMMU space: as it happened already with
infracfg, it is expected that this list will grow.

Instead of specifying pericfg compatibles on a per-SoC basis, following
what was done with infracfg, let's lookup the syscon by phandle instead.

Signed-off-by: AngeloGioacchino Del Regno 

---
  drivers/iommu/mtk_iommu.c | 23 +--
  1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 90685946fcbe..0ea0848581e9 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -138,6 +138,8 @@
  /* PM and clock always on. e.g. infra iommu */
  #define PM_CLK_AO BIT(15)
  #define IFA_IOMMU_PCIE_SUPPORTBIT(16)
+/* IOMMU I/O (r/w) is enabled using PERICFG_IOMMU_1 register */
+#define HAS_PERI_IOMMU1_REGBIT(17)


From what I can see MTK_IOMMU_TYPE_INFRA is only set in MT8195 which uses 
pericfg. So we don't need a new flag here. For me the flag name 
MTK_IOMMU_TYPE_INFRA was confusing as it has nothing to do with the use of 
infracfg. I'll hijack this patch to provide some feedback on the actual code, 
please see below.


  
  #define MTK_IOMMU_HAS_FLAG_MASK(pdata, _x, mask)	\

pdata)->flags) & (mask)) == (_x))
@@ -187,7 +189,6 @@ struct mtk_iommu_plat_data {
u32 flags;
u32 inv_sel_reg;
  
-	char			*pericfg_comp_str;

struct list_head*hw_list;
unsigned intiova_region_nr;
const struct mtk_iommu_iova_region  *iova_region;
@@ -1218,14 +1219,16 @@ static int mtk_iommu_probe(struct platform_device *pdev)
goto out_runtime_disable;
}
} else if (MTK_IOMMU_IS_TYPE(data->plat_data, MTK_IOMMU_TYPE_INFRA) &&
-  data->plat_data->pericfg_comp_str) {


Check for pericfg_comp_str is not needed, we only have one platform that uses 
MTK_IOMMU_TYPE_INFRA.



-   infracfg = 
syscon_regmap_lookup_by_compatible(data->plat_data->pericfg_comp_str);


We can do something like this to make the code clearer:
data->pericfg = 
syscon_regmap_lookup_by_compatible(data->plat_data->pericfg_comp_str);

if (IS_ERR(data->pericfg)) {

Using infracfg variable here is confusing as it has nothing to do with infracfg 
used with HAS_4GB_MODE flag.


Regards,
Matthias


-   if (IS_ERR(infracfg)) {
-   ret = PTR_ERR(infracfg);
-   goto out_runtime_disable;
+  MTK_IOMMU_HAS_FLAG(data->plat_data, HAS_PERI_IOMMU1_REG)) {
+   data->pericfg = syscon_regmap_lookup_by_phandle(dev->of_node, 
"mediatek,pericfg");
+   if (IS_ERR(data->pericfg)) {
+   p = "mediatek,mt8195-pericfg_ao";
+   data->pericfg = syscon_regmap_lookup_by_compatible(p);
+   if (IS_ERR(data->pericfg)) {
+   ret = PTR_ERR(data->pericfg);
+   goto out_runtime_disable;
+   }
}
-
-   data->pericfg = infracfg;
}
  
  	platform_set_drvdata(pdev, data);

@@ -1484,8 +1487,8 @@ static const struct mtk_iommu_plat_data mt8192_data = {
  static const struct mtk_iommu_plat_data mt8195_data_infra = {
.m4u_plat = M4U_MT8195,
.flags= WR_THROT_EN | DCM_DISABLE | STD_AXI_MODE | 
PM_CLK_AO |
-   MTK_IOMMU_TYPE_INFRA | IFA_IOMMU_PCIE_SUPPORT,
-   .pericfg_comp_str = "mediatek,mt8195-pericfg_ao",
+   HAS_PERI_IOMMU1_REG | MTK_IOMMU_TYPE_INFRA |
+   IFA_IOMMU_PCIE_SUPPORT,
.inv_sel_reg  = REG_MMU_INV_SEL_GEN2,
.banks_num= 5,
.banks_enable = {true, false, false, false, true},

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RESEND PATCH v8 01/11] iommu: Add DMA ownership management interfaces

2022-06-15 Thread Robin Murphy


On 2022-06-15 10:53, Steven Price wrote:

On 18/04/2022 01:49, Lu Baolu wrote:

Multiple devices may be placed in the same IOMMU group because they
cannot be isolated from each other. These devices must either be
entirely under kernel control or userspace control, never a mixture.

This adds dma ownership management in iommu core and exposes several
interfaces for the device drivers and the device userspace assignment
framework (i.e. VFIO), so that any conflict between user and kernel
controlled dma could be detected at the beginning.

The device driver oriented interfaces are,

int iommu_device_use_default_domain(struct device *dev);
void iommu_device_unuse_default_domain(struct device *dev);

By calling iommu_device_use_default_domain(), the device driver tells
the iommu layer that the device dma is handled through the kernel DMA
APIs. The iommu layer will manage the IOVA and use the default domain
for DMA address translation.

The device user-space assignment framework oriented interfaces are,

int iommu_group_claim_dma_owner(struct iommu_group *group,
void *owner);
void iommu_group_release_dma_owner(struct iommu_group *group);
bool iommu_group_dma_owner_claimed(struct iommu_group *group);

The device userspace assignment must be disallowed if the DMA owner
claiming interface returns failure.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Kevin Tian 
Signed-off-by: Lu Baolu 
Reviewed-by: Robin Murphy 


I'm seeing a regression that I've bisected to this commit on a Firefly
RK3288 board. The display driver fails to probe properly because
__iommu_attach_group() returns -EBUSY. This causes long hangs and splats
as the display flips timeout.

The call stack to __iommu_attach_group() is:

  __iommu_attach_group from iommu_attach_device+0x64/0xb4
  iommu_attach_device from rockchip_drm_dma_attach_device+0x20/0x50
  rockchip_drm_dma_attach_device from vop_crtc_atomic_enable+0x10c/0xa64
  vop_crtc_atomic_enable from 
drm_atomic_helper_commit_modeset_enables+0xa8/0x290
  drm_atomic_helper_commit_modeset_enables from 
drm_atomic_helper_commit_tail_rpm+0x44/0x8c
  drm_atomic_helper_commit_tail_rpm from commit_tail+0x9c/0x180
  commit_tail from drm_atomic_helper_commit+0x164/0x18c
  drm_atomic_helper_commit from drm_atomic_commit+0xac/0xe4
  drm_atomic_commit from drm_client_modeset_commit_atomic+0x23c/0x284
  drm_client_modeset_commit_atomic from 
drm_client_modeset_commit_locked+0x60/0x1c8
  drm_client_modeset_commit_locked from drm_client_modeset_commit+0x24/0x40
  drm_client_modeset_commit from drm_fb_helper_set_par+0xb8/0xf8
  drm_fb_helper_set_par from drm_fb_helper_hotplug_event.part.0+0xa8/0xc0
  drm_fb_helper_hotplug_event.part.0 from output_poll_execute+0xb8/0x224


@@ -2109,7 +2115,7 @@ static int __iommu_attach_group(struct iommu_domain 
*domain,
  {
int ret;
  
-	if (group->default_domain && group->domain != group->default_domain)

+   if (group->domain && group->domain != group->default_domain)
return -EBUSY;
  
  	ret = __iommu_group_for_each_dev(group, domain,


Reverting this 'fixes' the problem for me. The follow up 0286300e6045
("iommu: iommu_group_claim_dma_owner() must always assign a domain")
doesn't help.

Adding some debug printks I can see that domain is a valid pointer, but
both default_domain and blocking_domain are NULL.

I'm using the DTB from the kernel tree (rk3288-firefly.dtb).

Any ideas?


Hmm, TBH I'm not sure how that worked previously... it'll be complaining 
because the ARM DMA domain is still attached, but even when the attach 
goes ahead and replaces the ARM domain with the driver's new one, it's 
not using the special arm_iommu_detach_device() interface anywhere so 
the device would still be left with the wrong DMA ops :/


I guess the most pragmatic option is probably to give rockchip-drm a 
similar bodge to exynos and tegra, to explicitly remove the ARM domain 
before attaching its own.


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/2] iommu: arm-smmu-impl: Add 8250 display compatible to the client list.

2022-06-15 Thread Dmitry Baryshkov

On Wed, 15 Jun 2022 at 02:01, Emma Anholt  wrote:
>
> Required for turning on per-process page tables for the GPU.
>
> Signed-off-by: Emma Anholt 

Reviewed-by: Dmitry Baryshkov 

> ---
>
>  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
> b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index d8e1ef83c01b..bb9220937068 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -233,6 +233,7 @@ static const struct of_device_id 
> qcom_smmu_client_of_match[] __maybe_unused = {
> { .compatible = "qcom,sc7280-mdss" },
> { .compatible = "qcom,sc7280-mss-pil" },
> { .compatible = "qcom,sc8180x-mdss" },
> +   { .compatible = "qcom,sm8250-mdss" },
> { .compatible = "qcom,sdm845-mdss" },
> { .compatible = "qcom,sdm845-mss-pil" },
> { }
> --
> 2.36.1
>


-- 
With best wishes
Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/2] arm64: dts: qcom: sm8250: Enable per-process page tables.

2022-06-15 Thread Dmitry Baryshkov

On Wed, 15 Jun 2022 at 02:01, Emma Anholt  wrote:
>
> This is an SMMU for the adreno gpu, and adding this compatible lets
> the driver use per-fd page tables, which are required for security
> between GPU clients.
>
> Signed-off-by: Emma Anholt 
> ---
>
> Tested with a full deqp-vk run on RB5, which did involve some iommu faults.
>
>  arch/arm64/boot/dts/qcom/sm8250.dtsi | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/boot/dts/qcom/sm8250.dtsi 
> b/arch/arm64/boot/dts/qcom/sm8250.dtsi
> index a92230bec1dd..483c0e0f1d1a 100644
> --- a/arch/arm64/boot/dts/qcom/sm8250.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sm8250.dtsi
> @@ -2513,7 +2513,7 @@ gpucc: clock-controller@3d9 {
> };
>
> adreno_smmu: iommu@3da {
> -   compatible = "qcom,sm8250-smmu-500", "arm,mmu-500";
> +   compatible = "qcom,sm8250-smmu-500", "arm,mmu-500", 
> "qcom,adreno-smmu";

I see that other dtsi files use a bit different order for the
compatibility strings. They put "qcom,adreno-smmu" before
"arm,mmu-500". Can we please follow them?

With that fixed:
Reviewed-by: Dmitry Baryshkov 

> reg = <0 0x03da 0 0x1>;
> #iommu-cells = <2>;
> #global-interrupts = <2>;
> --
> 2.36.1
>


-- 
With best wishes
Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 9/9] iommu/arm-smmu: Get associated RMR info and install bypass SMR

2022-06-15 Thread Shameer Kolothum via iommu

From: Jon Nettleton 

Check if there is any RMR info associated with the devices behind
the SMMU and if any, install bypass SMRs for them. This is to
keep any ongoing traffic associated with these devices alive
when we enable/reset SMMU during probe().

Signed-off-by: Jon Nettleton 
Signed-off-by: Steven Price 
Tested-by: Steven Price 
Tested-by: Laurentiu Tudor 
Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 52 +++
 1 file changed, 52 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 2ed3594f384e..7ac4907235c3 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -2071,6 +2071,54 @@ err_reset_platform_ops: __maybe_unused;
return err;
 }
 
+static void arm_smmu_rmr_install_bypass_smr(struct arm_smmu_device *smmu)
+{
+   struct list_head rmr_list;
+   struct iommu_resv_region *e;
+   int idx, cnt = 0;
+   u32 reg;
+
+   INIT_LIST_HEAD(&rmr_list);
+   iort_get_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
+
+   /*
+* Rather than trying to look at existing mappings that
+* are setup by the firmware and then invalidate the ones
+* that do no have matching RMR entries, just disable the
+* SMMU until it gets enabled again in the reset routine.
+*/
+   reg = arm_smmu_gr0_read(smmu, ARM_SMMU_GR0_sCR0);
+   reg |= ARM_SMMU_sCR0_CLIENTPD;
+   arm_smmu_gr0_write(smmu, ARM_SMMU_GR0_sCR0, reg);
+
+   list_for_each_entry(e, &rmr_list, list) {
+   struct iommu_iort_rmr_data *rmr;
+   int i;
+
+   rmr = container_of(e, struct iommu_iort_rmr_data, rr);
+   for (i = 0; i < rmr->num_sids; i++) {
+   idx = arm_smmu_find_sme(smmu, rmr->sids[i], ~0);
+   if (idx < 0)
+   continue;
+
+   if (smmu->s2crs[idx].count == 0) {
+   smmu->smrs[idx].id = rmr->sids[i];
+   smmu->smrs[idx].mask = 0;
+   smmu->smrs[idx].valid = true;
+   }
+   smmu->s2crs[idx].count++;
+   smmu->s2crs[idx].type = S2CR_TYPE_BYPASS;
+   smmu->s2crs[idx].privcfg = S2CR_PRIVCFG_DEFAULT;
+
+   cnt++;
+   }
+   }
+
+   dev_notice(smmu->dev, "\tpreserved %d boot mapping%s\n", cnt,
+  cnt == 1 ? "" : "s");
+   iort_put_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
+}
+
 static int arm_smmu_device_probe(struct platform_device *pdev)
 {
struct resource *res;
@@ -2191,6 +2239,10 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
}
 
platform_set_drvdata(pdev, smmu);
+
+   /* Check for RMRs and install bypass SMRs if any */
+   arm_smmu_rmr_install_bypass_smr(smmu);
+
arm_smmu_device_reset(smmu);
arm_smmu_test_smr_masks(smmu);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 8/9] iommu/arm-smmu-v3: Get associated RMR info and install bypass STE

2022-06-15 Thread Shameer Kolothum via iommu

Check if there is any RMR info associated with the devices behind
the SMMUv3 and if any, install bypass STEs for them. This is to
keep any ongoing traffic associated with these devices alive
when we enable/reset SMMUv3 during probe().

Tested-by: Hanjun Guo 
Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 09723861a08a..448e7b7ce0f2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3754,6 +3754,36 @@ static void __iomem *arm_smmu_ioremap(struct device 
*dev, resource_size_t start,
return devm_ioremap_resource(dev, &res);
 }
 
+static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
+{
+   struct list_head rmr_list;
+   struct iommu_resv_region *e;
+
+   INIT_LIST_HEAD(&rmr_list);
+   iort_get_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
+
+   list_for_each_entry(e, &rmr_list, list) {
+   __le64 *step;
+   struct iommu_iort_rmr_data *rmr;
+   int ret, i;
+
+   rmr = container_of(e, struct iommu_iort_rmr_data, rr);
+   for (i = 0; i < rmr->num_sids; i++) {
+   ret = arm_smmu_init_sid_strtab(smmu, rmr->sids[i]);
+   if (ret) {
+   dev_err(smmu->dev, "RMR SID(0x%x) bypass 
failed\n",
+   rmr->sids[i]);
+   continue;
+   }
+
+   step = arm_smmu_get_step_for_sid(smmu, rmr->sids[i]);
+   arm_smmu_init_bypass_stes(step, 1, true);
+   }
+   }
+
+   iort_put_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
+}
+
 static int arm_smmu_device_probe(struct platform_device *pdev)
 {
int irq, ret;
@@ -3837,6 +3867,9 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
/* Record our private device structure */
platform_set_drvdata(pdev, smmu);
 
+   /* Check for RMRs and install bypass STEs if any */
+   arm_smmu_rmr_install_bypass_ste(smmu);
+
/* Reset the device */
ret = arm_smmu_device_reset(smmu, bypass);
if (ret)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 7/9] iommu/arm-smmu-v3: Refactor arm_smmu_init_bypass_stes() to force bypass

2022-06-15 Thread Shameer Kolothum via iommu

By default, disable_bypass flag is set and any dev without
an iommu domain installs STE with CFG_ABORT during
arm_smmu_init_bypass_stes(). Introduce a "force" flag and
move the STE update logic to arm_smmu_init_bypass_stes()
so that we can force it to install CFG_BYPASS STE for specific
SIDs.

This will be useful in a follow-up patch to install bypass
for IORT RMR SIDs.

Tested-by: Hanjun Guo 
Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 17d4f3432df2..09723861a08a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1380,12 +1380,21 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_master *master, u32 sid,
arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
 }
 
-static void arm_smmu_init_bypass_stes(__le64 *strtab, unsigned int nent)
+static void arm_smmu_init_bypass_stes(__le64 *strtab, unsigned int nent, bool 
force)
 {
unsigned int i;
+   u64 val = STRTAB_STE_0_V;
+
+   if (disable_bypass && !force)
+   val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
+   else
+   val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
for (i = 0; i < nent; ++i) {
-   arm_smmu_write_strtab_ent(NULL, -1, strtab);
+   strtab[0] = cpu_to_le64(val);
+   strtab[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+  
STRTAB_STE_1_SHCFG_INCOMING));
+   strtab[2] = 0;
strtab += STRTAB_STE_DWORDS;
}
 }
@@ -1413,7 +1422,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device 
*smmu, u32 sid)
return -ENOMEM;
}
 
-   arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT);
+   arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT, false);
arm_smmu_write_strtab_l1_desc(strtab, desc);
return 0;
 }
@@ -3051,7 +3060,7 @@ static int arm_smmu_init_strtab_linear(struct 
arm_smmu_device *smmu)
reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
cfg->strtab_base_cfg = reg;
 
-   arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
+   arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents, false);
return 0;
 }
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 5/9] ACPI/IORT: Add a helper to retrieve RMR info directly

2022-06-15 Thread Shameer Kolothum via iommu

This will provide a way for SMMU drivers to retrieve StreamIDs
associated with IORT RMR nodes and use that to set bypass settings
for those IDs.

Tested-by: Steven Price 
Tested-by: Laurentiu Tudor 
Tested-by: Hanjun Guo 
Reviewed-by: Hanjun Guo 
Signed-off-by: Shameer Kolothum 
---
 drivers/acpi/arm64/iort.c | 28 
 include/linux/acpi_iort.h |  8 
 2 files changed, 36 insertions(+)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index b6273af316c6..cd1349d3544e 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1394,6 +1394,34 @@ int iort_dma_get_ranges(struct device *dev, u64 *size)
return nc_dma_get_range(dev, size);
 }
 
+/**
+ * iort_get_rmr_sids - Retrieve IORT RMR node reserved regions with
+ * associated StreamIDs information.
+ * @iommu_fwnode: fwnode associated with IOMMU
+ * @head: Resereved region list
+ */
+void iort_get_rmr_sids(struct fwnode_handle *iommu_fwnode,
+  struct list_head *head)
+{
+   iort_iommu_rmr_get_resv_regions(iommu_fwnode, NULL, head);
+}
+EXPORT_SYMBOL_GPL(iort_get_rmr_sids);
+
+/**
+ * iort_put_rmr_sids - Free memory allocated for RMR reserved regions.
+ * @iommu_fwnode: fwnode associated with IOMMU
+ * @head: Resereved region list
+ */
+void iort_put_rmr_sids(struct fwnode_handle *iommu_fwnode,
+  struct list_head *head)
+{
+   struct iommu_resv_region *entry, *next;
+
+   list_for_each_entry_safe(entry, next, head, list)
+   entry->free(NULL, entry);
+}
+EXPORT_SYMBOL_GPL(iort_put_rmr_sids);
+
 static void __init acpi_iort_register_irq(int hwirq, const char *name,
  int trigger,
  struct resource *res)
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index e5d2de9caf7f..b43be0987b19 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -33,6 +33,10 @@ struct irq_domain *iort_get_device_domain(struct device 
*dev, u32 id,
  enum irq_domain_bus_token bus_token);
 void acpi_configure_pmsi_domain(struct device *dev);
 int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
+void iort_get_rmr_sids(struct fwnode_handle *iommu_fwnode,
+  struct list_head *head);
+void iort_put_rmr_sids(struct fwnode_handle *iommu_fwnode,
+  struct list_head *head);
 /* IOMMU interface */
 int iort_dma_get_ranges(struct device *dev, u64 *size);
 int iort_iommu_configure_id(struct device *dev, const u32 *id_in);
@@ -46,6 +50,10 @@ static inline struct irq_domain *iort_get_device_domain(
struct device *dev, u32 id, enum irq_domain_bus_token bus_token)
 { return NULL; }
 static inline void acpi_configure_pmsi_domain(struct device *dev) { }
+static inline
+void iort_get_rmr_sids(struct fwnode_handle *iommu_fwnode, struct list_head 
*head) { }
+static inline
+void iort_put_rmr_sids(struct fwnode_handle *iommu_fwnode, struct list_head 
*head) { }
 /* IOMMU interface */
 static inline int iort_dma_get_ranges(struct device *dev, u64 *size)
 { return -ENODEV; }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 6/9] iommu/arm-smmu-v3: Introduce strtab init helper

2022-06-15 Thread Shameer Kolothum via iommu

Introduce a helper to check the sid range and to init the l2 strtab
entries(bypass). This will be useful when we have to initialize the
l2 strtab with bypass for RMR SIDs.

Tested-by: Hanjun Guo 
Acked-by: Will Deacon 
Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 28 +++--
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 88817a3376ef..17d4f3432df2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2537,6 +2537,19 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device 
*smmu, u32 sid)
return sid < limit;
 }
 
+static int arm_smmu_init_sid_strtab(struct arm_smmu_device *smmu, u32 sid)
+{
+   /* Check the SIDs are in range of the SMMU and our stream table */
+   if (!arm_smmu_sid_in_range(smmu, sid))
+   return -ERANGE;
+
+   /* Ensure l2 strtab is initialised */
+   if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB)
+   return arm_smmu_init_l2_strtab(smmu, sid);
+
+   return 0;
+}
+
 static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
  struct arm_smmu_master *master)
 {
@@ -2560,20 +2573,9 @@ static int arm_smmu_insert_master(struct arm_smmu_device 
*smmu,
new_stream->id = sid;
new_stream->master = master;
 
-   /*
-* Check the SIDs are in range of the SMMU and our stream table
-*/
-   if (!arm_smmu_sid_in_range(smmu, sid)) {
-   ret = -ERANGE;
+   ret = arm_smmu_init_sid_strtab(smmu, sid);
+   if (ret)
break;
-   }
-
-   /* Ensure l2 strtab is initialised */
-   if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
-   ret = arm_smmu_init_l2_strtab(smmu, sid);
-   if (ret)
-   break;
-   }
 
/* Insert into SID tree */
new_node = &(smmu->streams.rb_node);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 4/9] ACPI/IORT: Add support to retrieve IORT RMR reserved regions

2022-06-15 Thread Shameer Kolothum via iommu

Parse through the IORT RMR nodes and populate the reserve region list
corresponding to a given IOMMU and device(optional). Also, go through
the ID mappings of the RMR node and retrieve all the SIDs associated
with it.

Reviewed-by: Lorenzo Pieralisi 
Tested-by: Steven Price 
Tested-by: Laurentiu Tudor 
Tested-by: Hanjun Guo 
Reviewed-by: Hanjun Guo 
Signed-off-by: Shameer Kolothum 
---
 drivers/acpi/arm64/iort.c | 291 ++
 include/linux/iommu.h |   8 ++
 2 files changed, 299 insertions(+)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index cd5d1d7823cb..b6273af316c6 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -788,6 +788,294 @@ void acpi_configure_pmsi_domain(struct device *dev)
 }
 
 #ifdef CONFIG_IOMMU_API
+static void iort_rmr_free(struct device *dev,
+ struct iommu_resv_region *region)
+{
+   struct iommu_iort_rmr_data *rmr_data;
+
+   rmr_data = container_of(region, struct iommu_iort_rmr_data, rr);
+   kfree(rmr_data->sids);
+   kfree(rmr_data);
+}
+
+static struct iommu_iort_rmr_data *iort_rmr_alloc(
+   struct acpi_iort_rmr_desc *rmr_desc,
+   int prot, enum iommu_resv_type type,
+   u32 *sids, u32 num_sids)
+{
+   struct iommu_iort_rmr_data *rmr_data;
+   struct iommu_resv_region *region;
+   u32 *sids_copy;
+   u64 addr = rmr_desc->base_address, size = rmr_desc->length;
+
+   rmr_data = kmalloc(sizeof(*rmr_data), GFP_KERNEL);
+   if (!rmr_data)
+   return NULL;
+
+   /* Create a copy of SIDs array to associate with this rmr_data */
+   sids_copy = kmemdup(sids, num_sids * sizeof(*sids), GFP_KERNEL);
+   if (!sids_copy) {
+   kfree(rmr_data);
+   return NULL;
+   }
+   rmr_data->sids = sids_copy;
+   rmr_data->num_sids = num_sids;
+
+   if (!IS_ALIGNED(addr, SZ_64K) || !IS_ALIGNED(size, SZ_64K)) {
+   /* PAGE align base addr and size */
+   addr &= PAGE_MASK;
+   size = PAGE_ALIGN(size + 
offset_in_page(rmr_desc->base_address));
+
+   pr_err(FW_BUG "RMR descriptor[0x%llx - 0x%llx] not aligned to 
64K, continue with [0x%llx - 0x%llx]\n",
+  rmr_desc->base_address,
+  rmr_desc->base_address + rmr_desc->length - 1,
+  addr, addr + size - 1);
+   }
+
+   region = &rmr_data->rr;
+   INIT_LIST_HEAD(®ion->list);
+   region->start = addr;
+   region->length = size;
+   region->prot = prot;
+   region->type = type;
+   region->free = iort_rmr_free;
+
+   return rmr_data;
+}
+
+static void iort_rmr_desc_check_overlap(struct acpi_iort_rmr_desc *desc,
+   u32 count)
+{
+   int i, j;
+
+   for (i = 0; i < count; i++) {
+   u64 end, start = desc[i].base_address, length = desc[i].length;
+
+   if (!length) {
+   pr_err(FW_BUG "RMR descriptor[0x%llx] with zero length, 
continue anyway\n",
+  start);
+   continue;
+   }
+
+   end = start + length - 1;
+
+   /* Check for address overlap */
+   for (j = i + 1; j < count; j++) {
+   u64 e_start = desc[j].base_address;
+   u64 e_end = e_start + desc[j].length - 1;
+
+   if (start <= e_end && end >= e_start)
+   pr_err(FW_BUG "RMR descriptor[0x%llx - 0x%llx] 
overlaps, continue anyway\n",
+  start, end);
+   }
+   }
+}
+
+/*
+ * Please note, we will keep the already allocated RMR reserve
+ * regions in case of a memory allocation failure.
+ */
+static void iort_get_rmrs(struct acpi_iort_node *node,
+ struct acpi_iort_node *smmu,
+ u32 *sids, u32 num_sids,
+ struct list_head *head)
+{
+   struct acpi_iort_rmr *rmr = (struct acpi_iort_rmr *)node->node_data;
+   struct acpi_iort_rmr_desc *rmr_desc;
+   int i;
+
+   rmr_desc = ACPI_ADD_PTR(struct acpi_iort_rmr_desc, node,
+   rmr->rmr_offset);
+
+   iort_rmr_desc_check_overlap(rmr_desc, rmr->rmr_count);
+
+   for (i = 0; i < rmr->rmr_count; i++, rmr_desc++) {
+   struct iommu_iort_rmr_data *rmr_data;
+   enum iommu_resv_type type;
+   int prot = IOMMU_READ | IOMMU_WRITE;
+
+   if (rmr->flags & ACPI_IORT_RMR_REMAP_PERMITTED)
+   type = IOMMU_RESV_DIRECT_RELAXABLE;
+   else
+   type = IOMMU_RESV_DIRECT;
+
+   if (rmr->flags & ACPI_IORT_RMR_ACCESS_PRIVILEGE)
+   prot |= IOMMU_PRIV;
+
+

[PATCH v13 3/9] ACPI/IORT: Provide a generic helper to retrieve reserve regions

2022-06-15 Thread Shameer Kolothum via iommu

Currently IORT provides a helper to retrieve HW MSI reserve regions.
Change this to a generic helper to retrieve any IORT related reserve
regions. This will be useful when we add support for RMR nodes in
subsequent patches.

[Lorenzo: For ACPI IORT]
Reviewed-by: Lorenzo Pieralisi 
Reviewed-by: Christoph Hellwig 
Tested-by: Steven Price 
Tested-by: Laurentiu Tudor 
Tested-by: Hanjun Guo 
Reviewed-by: Hanjun Guo 
Signed-off-by: Shameer Kolothum 
---
 drivers/acpi/arm64/iort.c | 22 +++---
 drivers/iommu/dma-iommu.c |  2 +-
 include/linux/acpi_iort.h |  4 ++--
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 213f61cae176..cd5d1d7823cb 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -806,15 +806,13 @@ static struct acpi_iort_node 
*iort_get_msi_resv_iommu(struct device *dev)
return NULL;
 }
 
-/**
- * iort_iommu_msi_get_resv_regions - Reserved region driver helper
- * @dev: Device from iommu_get_resv_regions()
- * @head: Reserved region list from iommu_get_resv_regions()
- *
+/*
+ * Retrieve platform specific HW MSI reserve regions.
  * The ITS interrupt translation spaces (ITS_base + SZ_64K, SZ_64K)
  * associated with the device are the HW MSI reserved regions.
  */
-void iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head)
+static void iort_iommu_msi_get_resv_regions(struct device *dev,
+   struct list_head *head)
 {
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct acpi_iort_its_group *its;
@@ -863,6 +861,16 @@ void iort_iommu_msi_get_resv_regions(struct device *dev, 
struct list_head *head)
}
 }
 
+/**
+ * iort_iommu_get_resv_regions - Generic helper to retrieve reserved regions.
+ * @dev: Device from iommu_get_resv_regions()
+ * @head: Reserved region list from iommu_get_resv_regions()
+ */
+void iort_iommu_get_resv_regions(struct device *dev, struct list_head *head)
+{
+   iort_iommu_msi_get_resv_regions(dev, head);
+}
+
 static inline bool iort_iommu_driver_enabled(u8 type)
 {
switch (type) {
@@ -1027,7 +1035,7 @@ int iort_iommu_configure_id(struct device *dev, const u32 
*id_in)
 }
 
 #else
-void iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head)
+void iort_iommu_get_resv_regions(struct device *dev, struct list_head *head)
 { }
 int iort_iommu_configure_id(struct device *dev, const u32 *input_id)
 { return -ENODEV; }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..970a2e018684 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -385,7 +385,7 @@ void iommu_dma_get_resv_regions(struct device *dev, struct 
list_head *list)
 {
 
if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode))
-   iort_iommu_msi_get_resv_regions(dev, list);
+   iort_iommu_get_resv_regions(dev, list);
 
 }
 EXPORT_SYMBOL(iommu_dma_get_resv_regions);
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index a8198b83753d..e5d2de9caf7f 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -36,7 +36,7 @@ int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
 /* IOMMU interface */
 int iort_dma_get_ranges(struct device *dev, u64 *size);
 int iort_iommu_configure_id(struct device *dev, const u32 *id_in);
-void iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
+void iort_iommu_get_resv_regions(struct device *dev, struct list_head *head);
 phys_addr_t acpi_iort_dma_get_max_cpu_address(void);
 #else
 static inline void acpi_iort_init(void) { }
@@ -52,7 +52,7 @@ static inline int iort_dma_get_ranges(struct device *dev, u64 
*size)
 static inline int iort_iommu_configure_id(struct device *dev, const u32 *id_in)
 { return -ENODEV; }
 static inline
-void iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head)
+void iort_iommu_get_resv_regions(struct device *dev, struct list_head *head)
 { }
 
 static inline phys_addr_t acpi_iort_dma_get_max_cpu_address(void)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 2/9] ACPI/IORT: Make iort_iommu_msi_get_resv_regions() return void

2022-06-15 Thread Shameer Kolothum via iommu

At present iort_iommu_msi_get_resv_regions() returns the number of
MSI reserved regions on success and there are no users for this.
The reserved region list will get populated anyway for platforms
that require the HW MSI region reservation. Hence, change the
function to return void instead.

Reviewed-by: Christoph Hellwig 
Tested-by: Steven Price 
Tested-by: Laurentiu Tudor 
Reviewed-by: Hanjun Guo 
Signed-off-by: Shameer Kolothum 
---
 drivers/acpi/arm64/iort.c | 25 +
 include/linux/acpi_iort.h |  6 +++---
 2 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index f2f8f05662de..213f61cae176 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -811,22 +811,19 @@ static struct acpi_iort_node 
*iort_get_msi_resv_iommu(struct device *dev)
  * @dev: Device from iommu_get_resv_regions()
  * @head: Reserved region list from iommu_get_resv_regions()
  *
- * Returns: Number of msi reserved regions on success (0 if platform
- *  doesn't require the reservation or no associated msi regions),
- *  appropriate error value otherwise. The ITS interrupt translation
- *  spaces (ITS_base + SZ_64K, SZ_64K) associated with the device
- *  are the msi reserved regions.
+ * The ITS interrupt translation spaces (ITS_base + SZ_64K, SZ_64K)
+ * associated with the device are the HW MSI reserved regions.
  */
-int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
+void iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head)
 {
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct acpi_iort_its_group *its;
struct acpi_iort_node *iommu_node, *its_node = NULL;
-   int i, resv = 0;
+   int i;
 
iommu_node = iort_get_msi_resv_iommu(dev);
if (!iommu_node)
-   return 0;
+   return;
 
/*
 * Current logic to reserve ITS regions relies on HW topologies
@@ -846,7 +843,7 @@ int iort_iommu_msi_get_resv_regions(struct device *dev, 
struct list_head *head)
}
 
if (!its_node)
-   return 0;
+   return;
 
/* Move to ITS specific data */
its = (struct acpi_iort_its_group *)its_node->node_data;
@@ -860,14 +857,10 @@ int iort_iommu_msi_get_resv_regions(struct device *dev, 
struct list_head *head)
 
region = iommu_alloc_resv_region(base + SZ_64K, SZ_64K,
 prot, IOMMU_RESV_MSI);
-   if (region) {
+   if (region)
list_add_tail(®ion->list, head);
-   resv++;
-   }
}
}
-
-   return (resv == its->its_count) ? resv : -ENODEV;
 }
 
 static inline bool iort_iommu_driver_enabled(u8 type)
@@ -1034,8 +1027,8 @@ int iort_iommu_configure_id(struct device *dev, const u32 
*id_in)
 }
 
 #else
-int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
-{ return 0; }
+void iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head)
+{ }
 int iort_iommu_configure_id(struct device *dev, const u32 *input_id)
 { return -ENODEV; }
 #endif
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index f1f0842a2cb2..a8198b83753d 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -36,7 +36,7 @@ int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
 /* IOMMU interface */
 int iort_dma_get_ranges(struct device *dev, u64 *size);
 int iort_iommu_configure_id(struct device *dev, const u32 *id_in);
-int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
+void iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
 phys_addr_t acpi_iort_dma_get_max_cpu_address(void);
 #else
 static inline void acpi_iort_init(void) { }
@@ -52,8 +52,8 @@ static inline int iort_dma_get_ranges(struct device *dev, u64 
*size)
 static inline int iort_iommu_configure_id(struct device *dev, const u32 *id_in)
 { return -ENODEV; }
 static inline
-int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
-{ return 0; }
+void iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head)
+{ }
 
 static inline phys_addr_t acpi_iort_dma_get_max_cpu_address(void)
 { return PHYS_ADDR_MAX; }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 1/9] iommu: Introduce a callback to struct iommu_resv_region

2022-06-15 Thread Shameer Kolothum via iommu

A callback is introduced to struct iommu_resv_region to free memory
allocations associated with the reserved region. This will be useful
when we introduce support for IORT RMR based reserved regions.

Reviewed-by: Christoph Hellwig 
Tested-by: Steven Price 
Tested-by: Laurentiu Tudor 
Tested-by: Hanjun Guo 
Signed-off-by: Shameer Kolothum 
---
 drivers/iommu/iommu.c | 16 +++-
 include/linux/iommu.h |  2 ++
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 847ad47a2dfd..298a8c060698 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2590,16 +2590,22 @@ void iommu_put_resv_regions(struct device *dev, struct 
list_head *list)
  * @list: reserved region list for device
  *
  * IOMMU drivers can use this to implement their .put_resv_regions() callback
- * for simple reservations. Memory allocated for each reserved region will be
- * freed. If an IOMMU driver allocates additional resources per region, it is
- * going to have to implement a custom callback.
+ * for simple reservations. If a per region callback is provided that will be
+ * used to free all memory allocations associated with the reserved region or
+ * else just free up the memory for the regions. If an IOMMU driver allocates
+ * additional resources per region, it is going to have to implement a custom
+ * callback.
  */
 void generic_iommu_put_resv_regions(struct device *dev, struct list_head *list)
 {
struct iommu_resv_region *entry, *next;
 
-   list_for_each_entry_safe(entry, next, list, list)
-   kfree(entry);
+   list_for_each_entry_safe(entry, next, list, list) {
+   if (entry->free)
+   entry->free(dev, entry);
+   else
+   kfree(entry);
+   }
 }
 EXPORT_SYMBOL(generic_iommu_put_resv_regions);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 5e1afe169549..b22ffa6bc4a9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -135,6 +135,7 @@ enum iommu_resv_type {
  * @length: Length of the region in bytes
  * @prot: IOMMU Protection flags (READ/WRITE/...)
  * @type: Type of the reserved region
+ * @free: Callback to free associated memory allocations
  */
 struct iommu_resv_region {
struct list_headlist;
@@ -142,6 +143,7 @@ struct iommu_resv_region {
size_t  length;
int prot;
enum iommu_resv_typetype;
+   void (*free)(struct device *dev, struct iommu_resv_region *region);
 };
 
 /**
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v13 0/9] ACPI/IORT: Support for IORT RMR node

2022-06-15 Thread Shameer Kolothum via iommu

Hi

v12 --> v13
  -No changes. Rebased to 5.19-rc1.
  -Picked up tags received from Laurentiu, Hanjun and Will. Thanks!.

Thanks,
Shameer

From old:
We have faced issues with 3408iMR RAID controller cards which
fail to boot when SMMU is enabled. This is because these
controllers make use of host memory for various caching related
purposes and when SMMU is enabled the iMR firmware fails to
access these memory regions as there is no mapping for them.
IORT RMR provides a way for UEFI to describe and report these
memory regions so that the kernel can make a unity mapping for
these in SMMU.

Change History:

v11 --> v12
  -Minor fix in patch #4 to address the issue reported by the kernel test robot.
  -Added R-by tags by Christoph(patch #1) and Lorenzo(patch #4).
  -Added T-by from Steve to all relevant patches. Many thanks!.

v10 --> v11
 -Addressed Christoph's comments. We now have a  callback to 
  struct iommu_resv_region to free all related memory and also dropped
  the FW specific union and now has a container struct iommu_iort_rmr_data.
  See patches #1 & #4
 -Added R-by from Christoph.
 -Dropped R-by from Lorenzo for patches #4 & #5 due to the above changes.
 -Also dropped T-by from Steve and Laurentiu. Many thanks for your test
  efforts. I have done basic sanity testing on my platform but please
  do it again at your end.

v9 --> v10
 - Dropped patch #1 ("Add temporary RMR node flag definitions") since
   the ACPICA header updates patch is now in the mailing list
 - Based on the suggestion from Christoph, introduced a 
   resv_region_free_fw_data() callback in struct iommu_resv_region and
   used that to free RMR specific memory allocations.

v8 --> v9
 - Adressed comments from Robin on interfaces.
 - Addressed comments from Lorenzo.

v7 --> v8
  - Patch #1 has temp definitions for RMR related changes till
    the ACPICA header changes are part of kernel.
  - No early parsing of RMR node info and is only parsed at the
    time of use.
  - Changes to the RMR get/put API format compared to the
    previous version.
  - Support for RMR descriptor shared by multiple stream IDs.

v6 --> v7
 -fix pointed out by Steve to the SMMUv2 SMR bypass install in patch #8.

v5 --> v6
- Addressed comments from Robin & Lorenzo.
  : Moved iort_parse_rmr() to acpi_iort_init() from
    iort_init_platform_devices().
  : Removed use of struct iort_rmr_entry during the initial
    parse. Using struct iommu_resv_region instead.
  : Report RMR address alignment and overlap errors, but continue.
  : Reworked arm_smmu_init_bypass_stes() (patch # 6).
- Updated SMMUv2 bypass SMR code. Thanks to Jon N (patch #8).
- Set IOMMU protection flags(IOMMU_CACHE, IOMMU_MMIO) based
  on Type of RMR region. Suggested by Jon N.

v4 --> v5
 -Added a fw_data union to struct iommu_resv_region and removed
  struct iommu_rmr (Based on comments from Joerg/Robin).
 -Added iommu_put_rmrs() to release mem.
 -Thanks to Steve for verifying on SMMUv2, but not added the Tested-by
  yet because of the above changes.

v3 -->v4
-Included the SMMUv2 SMR bypass install changes suggested by
 Steve(patch #7)
-As per Robin's comments, RMR reserve implementation is now
 more generic  (patch #8) and dropped v3 patches 8 and 10.
-Rebase to 5.13-rc1

RFC v2 --> v3
 -Dropped RFC tag as the ACPICA header changes are now ready to be
  part of 5.13[0]. But this series still has a dependency on that patch.
 -Added IORT E.b related changes(node flags, _DSM function 5 checks for
  PCIe).
 -Changed RMR to stream id mapping from M:N to M:1 as per the spec and
  discussion here[1].
 -Last two patches add support for SMMUv2(Thanks to Jon Nettleton!)

Jon Nettleton (1):
  iommu/arm-smmu: Get associated RMR info and install bypass SMR

Shameer Kolothum (8):
  iommu: Introduce a callback to struct iommu_resv_region
  ACPI/IORT: Make iort_iommu_msi_get_resv_regions() return void
  ACPI/IORT: Provide a generic helper to retrieve reserve regions
  ACPI/IORT: Add support to retrieve IORT RMR reserved regions
  ACPI/IORT: Add a helper to retrieve RMR info directly
  iommu/arm-smmu-v3: Introduce strtab init helper
  iommu/arm-smmu-v3: Refactor arm_smmu_init_bypass_stes() to force
bypass
  iommu/arm-smmu-v3: Get associated RMR info and install bypass STE

 drivers/acpi/arm64/iort.c   | 360 ++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  78 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |  52 +++
 drivers/iommu/dma-iommu.c   |   2 +-
 drivers/iommu/iommu.c   |  16 +-
 include/linux/acpi_iort.h   |  14 +-
 include/linux/iommu.h   |  10 +
 7 files changed, 486 insertions(+), 46 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RESEND PATCH v8 01/11] iommu: Add DMA ownership management interfaces

2022-06-15 Thread Steven Price

On 18/04/2022 01:49, Lu Baolu wrote:
> Multiple devices may be placed in the same IOMMU group because they
> cannot be isolated from each other. These devices must either be
> entirely under kernel control or userspace control, never a mixture.
> 
> This adds dma ownership management in iommu core and exposes several
> interfaces for the device drivers and the device userspace assignment
> framework (i.e. VFIO), so that any conflict between user and kernel
> controlled dma could be detected at the beginning.
> 
> The device driver oriented interfaces are,
> 
>   int iommu_device_use_default_domain(struct device *dev);
>   void iommu_device_unuse_default_domain(struct device *dev);
> 
> By calling iommu_device_use_default_domain(), the device driver tells
> the iommu layer that the device dma is handled through the kernel DMA
> APIs. The iommu layer will manage the IOVA and use the default domain
> for DMA address translation.
> 
> The device user-space assignment framework oriented interfaces are,
> 
>   int iommu_group_claim_dma_owner(struct iommu_group *group,
>   void *owner);
>   void iommu_group_release_dma_owner(struct iommu_group *group);
>   bool iommu_group_dma_owner_claimed(struct iommu_group *group);
> 
> The device userspace assignment must be disallowed if the DMA owner
> claiming interface returns failure.
> 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Kevin Tian 
> Signed-off-by: Lu Baolu 
> Reviewed-by: Robin Murphy 

I'm seeing a regression that I've bisected to this commit on a Firefly
RK3288 board. The display driver fails to probe properly because
__iommu_attach_group() returns -EBUSY. This causes long hangs and splats
as the display flips timeout.

The call stack to __iommu_attach_group() is:

 __iommu_attach_group from iommu_attach_device+0x64/0xb4
 iommu_attach_device from rockchip_drm_dma_attach_device+0x20/0x50
 rockchip_drm_dma_attach_device from vop_crtc_atomic_enable+0x10c/0xa64
 vop_crtc_atomic_enable from drm_atomic_helper_commit_modeset_enables+0xa8/0x290
 drm_atomic_helper_commit_modeset_enables from 
drm_atomic_helper_commit_tail_rpm+0x44/0x8c
 drm_atomic_helper_commit_tail_rpm from commit_tail+0x9c/0x180
 commit_tail from drm_atomic_helper_commit+0x164/0x18c
 drm_atomic_helper_commit from drm_atomic_commit+0xac/0xe4
 drm_atomic_commit from drm_client_modeset_commit_atomic+0x23c/0x284
 drm_client_modeset_commit_atomic from 
drm_client_modeset_commit_locked+0x60/0x1c8
 drm_client_modeset_commit_locked from drm_client_modeset_commit+0x24/0x40
 drm_client_modeset_commit from drm_fb_helper_set_par+0xb8/0xf8
 drm_fb_helper_set_par from drm_fb_helper_hotplug_event.part.0+0xa8/0xc0
 drm_fb_helper_hotplug_event.part.0 from output_poll_execute+0xb8/0x224

> @@ -2109,7 +2115,7 @@ static int __iommu_attach_group(struct iommu_domain 
> *domain,
>  {
>   int ret;
>  
> - if (group->default_domain && group->domain != group->default_domain)
> + if (group->domain && group->domain != group->default_domain)
>   return -EBUSY;
>  
>   ret = __iommu_group_for_each_dev(group, domain,

Reverting this 'fixes' the problem for me. The follow up 0286300e6045
("iommu: iommu_group_claim_dma_owner() must always assign a domain")
doesn't help.

Adding some debug printks I can see that domain is a valid pointer, but
both default_domain and blocking_domain are NULL.

I'm using the DTB from the kernel tree (rk3288-firefly.dtb).

Any ideas?

Thanks,

Steve
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH 3/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

2022-06-15 Thread Tian, Kevin

> From: Nicolin Chen 
> Sent: Wednesday, June 15, 2022 4:45 AM
> 
> Hi Kevin,
> 
> On Wed, Jun 08, 2022 at 11:48:27PM +, Tian, Kevin wrote:
> > > > > The KVM mechanism for controlling wbinvd is only triggered during
> > > > > kvm_vfio_group_add(), meaning it is a one-shot test done once the
> > > devices
> > > > > are setup.
> > > >
> > > > It's not one-shot. kvm_vfio_update_coherency() is called in both
> > > > group_add() and group_del(). Then the coherency property is
> > > > checked dynamically in wbinvd emulation:
> > >
> > > From the perspective of managing the domains that is still
> > > one-shot. It doesn't get updated when individual devices are
> > > added/removed to domains.
> >
> > It's unchanged per-domain but dynamic per-vm when multiple
> > domains are added/removed (i.e. kvm->arch.noncoherent_dma_count).
> > It's the latter being checked in the kvm.
> 
> I am going to send a v2, yet not quite getting the point here.
> Meanwhile, Jason is on leave.
> 
> What, in your opinion, would be an accurate description here?
> 

Something like below:
--
The KVM mechanism for controlling wbinvd is based on OR of
the coherency property of all devices attached to a guest, no matter
those devices  are attached to a single domain or multiple domains.

So, there is no value in trying to push a device that could do enforced
cache coherency to a dedicated domain vs re-using an existing domain
which is non-coherent since KVM won't be able to take advantage of it. 
This just wastes domain memory.

Simplify this code and eliminate the test. This removes the only logic
that needed to have a dummy domain attached prior to searching for a
matching domain and simplifies the next patches.

It's unclear whether we want to further optimize the Intel driver to
update the domain coherency after a device is detached from it, at
least not before KVM can be verified to handle such dynamics in related
emulation paths (wbinvd, vcpu load, write_cr0, ept, etc.). In reality
we don't see an usage requiring such optimization as the only device
which imposes such non-coherency is Intel GPU which even doesn't
support hotplug/hot remove.
--

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

92 matches

Mail list logo