Re: [PATCH] swiotlb: allocate memory in a cache-friendly way

2021-09-16 Thread Chao Gao
On Thu, Sep 16, 2021 at 11:49:39AM -0400, Konrad Rzeszutek Wilk wrote:
>On Wed, Sep 01, 2021 at 12:21:35PM +0800, Chao Gao wrote:
>> Currently, swiotlb uses a global index to indicate the starting point
>> of next search. The index increases from 0 to the number of slots - 1
>> and then wraps around. It is straightforward but not cache-friendly
>> because the "oldest" slot in swiotlb tends to be used first.
>> 
>> Freed slots are probably accessed right before being freed, especially
>> in VM's case (device backends access them in DMA_TO_DEVICE mode; guest
>> accesses them in other DMA modes). Thus those just freed slots may
>> reside in cache. Then reusing those just freed slots can reduce cache
>> misses.
>> 
>> To that end, maintain a free list for free slots and insert freed slots
>> from the head and searching for free slots always starts from the head.
>> 
>> With this optimization, network throughput of sending data from host to
>> guest, measured by iperf3, increases by 7%.
>
>Wow, that is pretty awesome!
>
>Are there any other benchmarks that you ran that showed a negative
>performance?

TBH, yes. Recently I do fio tests with this patch. The impact of this patch
is: (+ means performance improvement; - means performance regression)

1-job fio:
randread: +6.7%
randwrite: -1.6%
read: +8.2%
write: +7.4%

8-job fio:
randread: -5.5%
randwrite: -12.6%
read: -24.8%
write: -45.5%

I haven't figured out why multi-job fio tests suffer. Will post v2 once
the issue gets resolved.

Thanks
Chao
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 17/20] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

2021-09-16 Thread Logan Gunthorpe
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
---
 block/blk-map.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 4526adde0156..7508448e290c 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -234,6 +234,7 @@ static int bio_map_user_iov(struct request *rq, struct 
iov_iter *iter,
gfp_t gfp_mask)
 {
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
+   unsigned int flags = 0;
struct bio *bio;
int ret;
int j;
@@ -246,13 +247,17 @@ static int bio_map_user_iov(struct request *rq, struct 
iov_iter *iter,
return -ENOMEM;
bio->bi_opf |= req_op(rq);
 
+   if (blk_queue_pci_p2pdma(rq->q))
+   flags |= FOLL_PCI_P2PDMA;
+
while (iov_iter_count(iter)) {
struct page **pages;
ssize_t bytes;
size_t offs, added = 0;
int npages;
 
-   bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+   bytes = iov_iter_get_pages_alloc_flags(iter, &pages, LONG_MAX,
+  &offs, flags);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 20/20] nvme-pci: allow mmaping the CMB in userspace

2021-09-16 Thread Logan Gunthorpe
Allow userspace to obtain CMB memory by mmaping the controller's
char device. The mmap call allocates and returns a hunk of CMB memory,
(the offset is ignored) so userspace does not have control over the
address within the CMB.

A VMA allocated in this way will only be usable by drivers that set
FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
checked the first time the pages are mapped for DMA.

Currently this is only supported by O_DIRECT to an PCI NVMe device
or through the NVMe passthrough IOCTL.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/core.c | 15 +++
 drivers/nvme/host/nvme.h |  2 ++
 drivers/nvme/host/pci.c  | 18 ++
 3 files changed, 35 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 916750a54f60..dfc18f0bdeee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3075,6 +3075,10 @@ static int nvme_dev_open(struct inode *inode, struct 
file *file)
}
 
file->private_data = ctrl;
+
+   if (ctrl->ops->mmap_file_open)
+   ctrl->ops->mmap_file_open(ctrl, file);
+
return 0;
 }
 
@@ -3088,12 +3092,23 @@ static int nvme_dev_release(struct inode *inode, struct 
file *file)
return 0;
 }
 
+static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct nvme_ctrl *ctrl = file->private_data;
+
+   if (!ctrl->ops->mmap_cmb)
+   return -ENODEV;
+
+   return ctrl->ops->mmap_cmb(ctrl, vma);
+}
+
 static const struct file_operations nvme_dev_fops = {
.owner  = THIS_MODULE,
.open   = nvme_dev_open,
.release= nvme_dev_release,
.unlocked_ioctl = nvme_dev_ioctl,
.compat_ioctl   = compat_ptr_ioctl,
+   .mmap   = nvme_dev_mmap,
 };
 
 static ssize_t nvme_sysfs_reset(struct device *dev,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index fb9bfc52a6d7..1cc721290d4c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -485,6 +485,8 @@ struct nvme_ctrl_ops {
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+   void (*mmap_file_open)(struct nvme_ctrl *ctrl, struct file *file);
+   int (*mmap_cmb)(struct nvme_ctrl *ctrl, struct vm_area_struct *vma);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e2cd73129a88..9d69e4a3d62e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2870,6 +2870,22 @@ static bool nvme_pci_supports_pci_p2pdma(struct 
nvme_ctrl *ctrl)
return dma_pci_p2pdma_supported(dev->dev);
 }
 
+static void nvme_pci_mmap_file_open(struct nvme_ctrl *ctrl,
+   struct file *file)
+{
+   struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+   pci_p2pdma_mmap_file_open(pdev, file);
+}
+
+static int nvme_pci_mmap_cmb(struct nvme_ctrl *ctrl,
+struct vm_area_struct *vma)
+{
+   struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+   return pci_mmap_p2pmem(pdev, vma);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
@@ -2881,6 +2897,8 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
.supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
+   .mmap_file_open = nvme_pci_mmap_file_open,
+   .mmap_cmb   = nvme_pci_mmap_cmb,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 18/20] mm: use custom page_free for P2PDMA pages

2021-09-16 Thread Logan Gunthorpe
When P2PDMA pages are passed to userspace, they will need to be
reference counted properly and returned to their genalloc after their
reference count returns to 1. This is accomplished with the existing
DEV_PAGEMAP_OPS and the .page_free() operation.

Change CONFIG_P2PDMA to select CONFIG_DEV_PAGEMAP_OPS and add
MEMORY_DEVICE_PCI_P2PDMA to page_is_devmap_managed(),
devmap_managed_enable_[put|get]() and free_devmap_managed_page().

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/Kconfig  |  1 +
 drivers/pci/p2pdma.c | 13 +
 include/linux/mm.h   |  1 +
 mm/memremap.c| 12 +---
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 90b4bddb3300..b31d35259d3a 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -165,6 +165,7 @@ config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE && 64BIT
select GENERIC_ALLOCATOR
+   select DEV_PAGEMAP_OPS
help
  Enableѕ drivers to do PCI peer-to-peer transactions to and from
  BARs that are exposed in other devices that are the part of
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4478633346bd..2422af5a529c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -100,6 +100,18 @@ static const struct attribute_group p2pmem_group = {
.name = "p2pmem",
 };
 
+static void p2pdma_page_free(struct page *page)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+
+   gen_pool_free(pgmap->provider->p2pdma->pool,
+ (uintptr_t)page_to_virt(page), PAGE_SIZE);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+   .page_free = p2pdma_page_free,
+};
+
 static void pci_p2pdma_release(void *data)
 {
struct pci_dev *pdev = data;
@@ -197,6 +209,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, 
size_t size,
pgmap->range.end = pgmap->range.start + size - 1;
pgmap->nr_range = 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+   pgmap->ops = &p2pdma_pgmap_ops;
 
p2p_pgmap->provider = pdev;
p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6afdc09d0712..9a6ea00e5292 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1163,6 +1163,7 @@ static inline bool page_is_devmap_managed(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
+   case MEMORY_DEVICE_PCI_P2PDMA:
return true;
default:
break;
diff --git a/mm/memremap.c b/mm/memremap.c
index ceebdb8a72bb..fbdc9991af0e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -44,14 +44,16 @@ EXPORT_SYMBOL(devmap_managed_key);
 static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 {
if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-   pgmap->type == MEMORY_DEVICE_FS_DAX)
+   pgmap->type == MEMORY_DEVICE_FS_DAX ||
+   pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
static_branch_dec(&devmap_managed_key);
 }
 
 static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-   pgmap->type == MEMORY_DEVICE_FS_DAX)
+   pgmap->type == MEMORY_DEVICE_FS_DAX ||
+   pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
static_branch_inc(&devmap_managed_key);
 }
 #else
@@ -355,6 +357,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
case MEMORY_DEVICE_GENERIC:
break;
case MEMORY_DEVICE_PCI_P2PDMA:
+   if (!pgmap->ops->page_free) {
+   WARN(1, "Missing page_free method\n");
+   return ERR_PTR(-EINVAL);
+   }
params.pgprot = pgprot_noncached(params.pgprot);
break;
default:
@@ -504,7 +510,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_devmap_managed_page(struct page *page)
 {
/* notify page idle for dax */
-   if (!is_device_private_page(page)) {
+   if (!is_device_private_page(page) && !is_pci_p2pdma_page(page)) {
wake_up_var(&page->_refcount);
return;
}
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 08/20] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2021-09-16 Thread Logan Gunthorpe
When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through
 the root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally with an IOMMU IOVA.
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 68 +++
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 896bea04c347..e7c658d04222 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -911,6 +912,16 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
 
+   if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+   if (i > 0)
+   cur = sg_next(cur);
+
+   pci_p2pdma_map_bus_segment(s, cur);
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -1008,6 +1019,8 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iova_domain *iovad = &cookie->iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+   struct dev_pagemap *pgmap = NULL;
+   enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1042,6 +1055,35 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;
 
+   if (is_pci_p2pdma_page(sg_page(s))) {
+   if (sg_page(s)->pgmap != pgmap) {
+   pgmap = sg_page(s)->pgmap;
+   map_type = pci_p2pdma_map_type(pgmap, dev);
+   }
+
+   switch (map_type) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   /*
+* A zero length will be ignored by
+* iommu_map_sg() and then can be detected
+* in __finalise_sg() to actually map the
+* bus address.
+*/
+   s->length = 0;
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Mapping through host bridge should be
+* mapped with regular IOVAs, thus we
+* do nothing here and continue below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_restore_sg;
+   }
+   }
+
/*
 * Due to the alignment of our single IOVA allocation, we can
 * depend on these assumptions about the segment boundary mask:
@@ -1064,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = s;
}
 
+   if (!iova_len)
+   return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova) {
ret = -ENOMEM;
@@ -1085,7 +1130,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 out_restore_sg:
__invalidate_sg(sg, nents);
 out:
-   if (ret != -ENOMEM)
+   if (ret != -ENOMEM && ret != -EREMOTEIO)
return -EINVAL;
return ret;
 }
@@ -1093,7 +1138,7 

[PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-09-16 Thread Logan Gunthorpe
Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.

Pages are allocated from the genalloc in bulk and their reference count
incremented. They are returned to the genalloc when the page is put.

The VMA does not take a reference to the pages when they are inserted
with vmf_insert_mixed() (which is necessary for zone device pages) so
the backing P2P memory is stored in a structures in vm_private_data.

A pseudo mount is used to allocate an inode for each PCI device. The
inode's address_space is used in the file doing the mmap so that all
VMAs are collected and can be unmapped if the PCI device is unbound.
After unmapping, the VMAs are iterated through and their pages are
put so the device can continue to be unbound. An active flag is used
to signal to VMAs not to allocate any further P2P memory once the
removal process starts. The flag is synchronized with concurrent
access with an RCU lock.

The VMAs and inode will survive after the unbind of the device, but no
pages will be present in the VMA and a subsequent access will result
in a SIGBUS error.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c   | 263 -
 include/linux/pci-p2pdma.h |  11 ++
 include/uapi/linux/magic.h |   1 +
 3 files changed, 273 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 2422af5a529c..a5adf57af53a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -16,14 +16,19 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
struct xarray map_types;
+   struct inode *inode;
+   bool active;
 };
 
 struct pci_p2pdma_pagemap {
@@ -32,6 +37,14 @@ struct pci_p2pdma_pagemap {
u64 bus_offset;
 };
 
+struct pci_p2pdma_map {
+   struct kref ref;
+   struct pci_dev *pdev;
+   struct inode *inode;
+   void *kaddr;
+   size_t len;
+};
+
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
 {
return container_of(pgmap, struct pci_p2pdma_pagemap, pgmap);
@@ -100,6 +113,26 @@ static const struct attribute_group p2pmem_group = {
.name = "p2pmem",
 };
 
+/*
+ * P2PDMA internal mount
+ * Fake an internal VFS mount-point in order to allocate struct address_space
+ * mappings to remove VMAs on unbind events.
+ */
+static int pci_p2pdma_fs_cnt;
+static struct vfsmount *pci_p2pdma_fs_mnt;
+
+static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
+{
+   return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type pci_p2pdma_fs_type = {
+   .name = "p2dma",
+   .owner = THIS_MODULE,
+   .init_fs_context = pci_p2pdma_fs_init_fs_context,
+   .kill_sb = kill_anon_super,
+};
+
 static void p2pdma_page_free(struct page *page)
 {
struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
@@ -128,6 +161,9 @@ static void pci_p2pdma_release(void *data)
gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
xa_destroy(&p2pdma->map_types);
+
+   iput(p2pdma->inode);
+   simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -145,17 +181,32 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
if (!p2p->pool)
goto out;
 
-   error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+   error = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
+ &pci_p2pdma_fs_cnt);
if (error)
goto out_pool_destroy;
 
+   p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
+   if (IS_ERR(p2p->inode)) {
+   error = -ENOMEM;
+   goto out_unpin_fs;
+   }
+
+   error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+   if (error)
+   goto out_put_inode;
+
error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
if (error)
-   goto out_pool_destroy;
+   goto out_put_inode;
 
rcu_assign_pointer(pdev->p2pdma, p2p);
return 0;
 
+out_put_inode:
+   iput(p2p->inode);
+out_unpin_fs:
+   simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
 out_pool_destroy:
gen_pool_destroy(p2p->pool);
 out:
@@ -163,6 +214,45 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
return error;
 }
 
+static void pci_p2pdma_map_free_pages(struct pci_p2pdma_map *pmap)
+{
+   int i;
+
+   if (!pmap->kaddr)
+   return;
+
+   for (i = 0; i < pmap->len; i += PAGE_SIZE)
+   put_page(virt_to_page(pmap->kaddr + i));
+
+   pmap->kaddr = NULL;
+}
+
+static void pci_p2pdma_free_mappings(struct address_space *mapping)
+{
+   struct vm_area_struct *vma;
+
+   i_mmap_lock_wr

[PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2021-09-16 Thread Logan Gunthorpe
Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7efb31b87f37..916750a54f60 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3771,7 +3771,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   if (ctrl->ops->supports_pci_p2pdma &&
+   ctrl->ops->supports_pci_p2pdma(ctrl))
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9871c0c9374c..fb9bfc52a6d7 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -477,7 +477,6 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
-#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -485,6 +484,7 @@ struct nvme_ctrl_ops {
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b82492cd7503..7d1ef66eac2e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2874,17 +2874,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, 
char *buf, int size)
return snprintf(buf, size, "%s\n", dev_name(&pdev->dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+   return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+   .flags  = NVME_F_METADATA_SUPPORTED,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.free_ctrl  = nvme_pci_free_ctrl,
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
+   .supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 12/20] RDMA/rw: use dma_map_sgtable()

2021-09-16 Thread Logan Gunthorpe
dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped.

Switch to the dma_map_sgtable() interface which will allow for better
error reporting if the P2PDMA pages are unsupported.

The change to sgtable also appears to fix a couple subtle error path
bugs:

  - In rdma_rw_ctx_init(), dma_unmap would be called with an sg
that could have been incremented from the original call, as
well as an nents that was not the original number of nents
called when mapped.
  - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg
were unmapped with the incorrect number of nents.

Signed-off-by: Logan Gunthorpe 
---
 drivers/infiniband/core/rw.c | 75 +++-
 include/rdma/ib_verbs.h  | 19 +
 2 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 5221cce65675..1bdb56380764 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -273,26 +273,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return 1;
 }
 
-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
-u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg)))
-   pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
-   else
-   ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
- u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg))) {
-   if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
-   return 0;
-   return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
-   }
-   return ib_dma_map_sg(dev, sg, sg_cnt, dir);
-}
-
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:   context to initialize
@@ -313,12 +293,16 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp, u32 port_num,
u64 remote_addr, u32 rkey, enum dma_data_direction dir)
 {
struct ib_device *dev = qp->pd->device;
+   struct sg_table sgt = {
+   .sgl = sg,
+   .orig_nents = sg_cnt,
+   };
int ret;
 
-   ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir);
-   if (!ret)
-   return -ENOMEM;
-   sg_cnt = ret;
+   ret = ib_dma_map_sgtable(dev, &sgt, dir, 0);
+   if (ret)
+   return ret;
+   sg_cnt = sgt.nents;
 
/*
 * Skip to the S/G entry that sg_offset falls into:
@@ -354,7 +338,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
return ret;
 
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sg, sg_cnt, dir);
+   ib_dma_unmap_sgtable(dev, &sgt, dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -387,6 +371,14 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
qp->integrity_en);
struct ib_rdma_wr *rdma_wr;
int count = 0, ret;
+   struct sg_table sgt = {
+   .sgl = sg,
+   .orig_nents = sg_cnt,
+   };
+   struct sg_table prot_sgt = {
+   .sgl = prot_sg,
+   .orig_nents = prot_sg_cnt,
+   };
 
if (sg_cnt > pages_per_mr || prot_sg_cnt > pages_per_mr) {
pr_err("SG count too large: sg_cnt=%u, prot_sg_cnt=%u, 
pages_per_mr=%u\n",
@@ -394,18 +386,14 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return -EINVAL;
}
 
-   ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir);
-   if (!ret)
-   return -ENOMEM;
-   sg_cnt = ret;
+   ret = ib_dma_map_sgtable(dev, &sgt, dir, 0);
+   if (ret)
+   return ret;
 
if (prot_sg_cnt) {
-   ret = rdma_rw_map_sg(dev, prot_sg, prot_sg_cnt, dir);
-   if (!ret) {
-   ret = -ENOMEM;
+   ret = ib_dma_map_sgtable(dev, &prot_sgt, dir, 0);
+   if (ret)
goto out_unmap_sg;
-   }
-   prot_sg_cnt = ret;
}
 
ctx->type = RDMA_RW_SIG_MR;
@@ -426,10 +414,11 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
 
memcpy(ctx->reg->mr->sig_attrs, sig_attrs, sizeof(struct ib_sig_attrs));
 
-   ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sg_cnt, NULL, prot_sg,
- prot_sg_cnt, NULL, SZ_4K);
+   ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sgt.nents, NULL, prot_sg,
+ prot_sgt.nents, NULL, SZ_4K);
if (unlikely(ret)) {
-   pr_err("failed to map PI sg (%u)\n", sg_cnt + prot_sg_cnt);
+   pr_err("failed to map PI sg (%u)\n",
+   

[PATCH v3 06/20] dma-direct: support PCI P2PDMA pages in dma-direct map_sg

2021-09-16 Thread Logan Gunthorpe
Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through the
 root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally, as though it were a CPU physical address
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/direct.c | 44 ++--
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 4c6c5e0635e3..fa8317e8ff44 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "direct.h"
 
 /*
@@ -421,29 +422,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_sync_dma_for_cpu_all();
 }
 
+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
struct scatterlist *sg;
int i;
 
-   for_each_sg(sgl, sg, nents, i)
-   dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
-attrs);
+   for_each_sg(sgl, sg, nents, i) {
+   if (sg_is_dma_pci_p2pdma(sg)) {
+   sg_dma_unmark_pci_p2pdma(sg);
+   } else  {
+   dma_direct_unmap_page(dev, sg->dma_address,
+ sg_dma_len(sg), dir, attrs);
+   }
+   }
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
+   struct pci_p2pdma_map_state p2pdma_state = {};
+   enum pci_p2pdma_map_type map;
struct scatterlist *sg;
+   int i, ret;
 
for_each_sg(sgl, sg, nents, i) {
+   if (is_pci_p2pdma_page(sg_page(sg))) {
+   map = pci_p2pdma_map_segment(&p2pdma_state, dev, sg);
+   switch (map) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Mapping through host bridge should be
+* mapped normally, thus we do nothing
+* and continue below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_unmap;
+   }
+   }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
-   if (sg->dma_address == DMA_MAPPING_ERROR)
+   if (sg->dma_address == DMA_MAPPING_ERROR) {
+   ret = -EIO;
goto out_unmap;
+   }
sg_dma_len(sg) = sg->length;
}
 
@@ -451,7 +483,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return -EIO;
+   return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 05/20] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers

2021-09-16 Thread Logan Gunthorpe
Add EREMOTEIO error return to dma_map_sgtable() which will be used
by .map_sg() implementations that detect P2PDMA pages that the
underlying DMA device cannot access.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/mapping.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 7ee5284bff58..7315ae31cf1d 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -197,7 +197,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
if (ents > 0)
debug_dma_map_sg(dev, sg, nents, ents, dir);
else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM &&
- ents != -EIO))
+ ents != -EIO && ents != -EREMOTEIO))
return -EIO;
 
return ents;
@@ -248,12 +248,14 @@ EXPORT_SYMBOL(dma_map_sg_attrs);
  * Returns 0 on success or a negative error code on error. The following
  * error codes are supported with the given meaning:
  *
- *   -EINVAL - An invalid argument, unaligned access or other error
- *in usage. Will not succeed if retried.
- *   -ENOMEM - Insufficient resources (like memory or IOVA space) to
- *complete the mapping. Should succeed if retried later.
- *   -EIO- Legacy error code with an unknown meaning. eg. this is
- *returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EINVAL   - An invalid argument, unaligned access or other error
+ *   in usage. Will not succeed if retried.
+ *   -ENOMEM   - Insufficient resources (like memory or IOVA space) to
+ *   complete the mapping. Should succeed if retried later.
+ *   -EIO  - Legacy error code with an unknown meaning. eg. this is
+ *   returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EREMOTEIO- The DMA device cannot access P2PDMA memory specified 
in
+ *   the sg_table. This will not succeed if retried.
  */
 int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
enum dma_data_direction dir, unsigned long attrs)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 10/20] nvme-pci: convert to using dma_map_sgtable()

2021-09-16 Thread Logan Gunthorpe
The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/pci.c | 69 +
 1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7d1ef66eac2e..e2cd73129a88 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -228,11 +228,10 @@ struct nvme_iod {
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
-   int nents;  /* Used in scatterlist */
dma_addr_t first_dma;
unsigned int dma_len;   /* length of single DMA segment mapping */
dma_addr_t meta_dma;
-   struct scatterlist *sg;
+   struct sg_table sgt;
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -523,7 +522,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 static void **nvme_pci_iod_list(struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+   return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -575,17 +574,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct 
request *req)
}
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-   if (is_pci_p2pdma_page(sg_page(iod->sg)))
-   pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-   rq_dma_dir(req));
-   else
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -596,9 +584,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
return;
}
 
-   WARN_ON_ONCE(!iod->nents);
+   WARN_ON_ONCE(!iod->sgt.nents);
+
+   dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
 
-   nvme_unmap_sg(dev, req);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
  iod->first_dma);
@@ -606,7 +595,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
nvme_free_sgls(dev, req);
else
nvme_free_prps(dev, req);
-   mempool_free(iod->sg, dev->iod_mempool);
+   mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -629,7 +618,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -702,16 +691,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
dma_len = sg_dma_len(sg);
}
 done:
-   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
 free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
 bad_sgl:
-   WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+   WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
"Invalid SGL for payload:%d nents:%d\n",
-   blk_rq_payload_bytes(req), iod->nents);
+   blk_rq_payload_bytes(req), iod->sgt.nents);
return BLK_STS_IOERR;
 }
 
@@ -737,12 +726,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc 
*sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-   struct request *req, struct nvme_rw_command *cmd, int entries)
+   struct request *req, struct nvme_rw_command *cmd)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
+   int entries = iod->sgt.nents;
dma_addr_t sgl_dma;
int i = 0;
 
@@ -840,7 +830,7 @@ static blk_status_t nvme_map_data(struct nvme_dev

[PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices

2021-09-16 Thread Logan Gunthorpe
Hi,

This patchset continues my work to add userspace P2PDMA access using
O_DIRECT NVMe devices. My last posting[1] just included the first 13
patches in this series, but the early P2PDMA cleanup and map_sg error
changes from that series have been merged into v5.15-rc1. To address
concerns that that series did not add any new functionality, I've added
back the userspcae functionality from the original RFC[2] (but improved
based on the original feedback).

The patchset enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch 14, then Patches 15 through 17 wire this flag up based
on whether the block queue indicates P2PDMA support. Patches 18
through 20 enable the CMB to be mapped into userspace by mmaping
the nvme char device.

This is relatively straightforward, however the one significant
problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
support enforcing this rule would require a huge hack that I don't
expect would be all that pallatable. So the first 13 patches add
support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
and dma-iommu implementations. Thus systems without an IOMMU plus
Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
then be unsupported, notably ARM and PowerPC but support would be added
when they convert to dma-iommu).

dma_map_sgtable() is preferred when dealing with P2PDMA memory as it
will return -EREMOTEIO when the DMA device cannot map specific P2PDMA
pages based on the existing rules in calc_map_type_and_dist().

The other issue is dma_unmap_sg() needs a flag to determine whether a
given dma_addr_t was mapped regularly or as a PCI bus address. To allow
this, a third flag is added to the page_link field in struct
scatterlist. This effectively means support for P2PDMA will now depend
on CONFIG_64BIT.

Feedback welcome.

This series is based on v5.15-rc1. A git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_v3

Thanks,

Logan

[1] https://lkml.kernel.org/r/20210513223203.5542-1-log...@deltatee.com
[2] https://lkml.kernel.org/r/20201106170036.18713-1-log...@deltatee.com

--

Logan Gunthorpe (20):
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  PCI/P2PDMA: attempt to set map_type if it has not been set
  PCI/P2PDMA: make pci_p2pdma_map_type() non-static
  PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  nvme-pci: convert to using dma_map_sgtable()
  RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  RDMA/rw: use dma_map_sgtable()
  PCI/P2PDMA: remove pci_p2pdma_[un]map_sg()
  mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
  block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
  block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
  mm: use custom page_free for P2PDMA pages
  PCI/P2PDMA: introduce pci_mmap_p2pmem()
  nvme-pci: allow mmaping the CMB in userspace

 block/bio.c  |   8 +-
 block/blk-map.c  |   7 +-
 drivers/dax/super.c  |   7 +-
 drivers/infiniband/core/rw.c |  75 +++
 drivers/iommu/dma-iommu.c|  68 +-
 drivers/nvme/host/core.c |  18 +-
 drivers/nvme/host/nvme.h |   4 +-
 drivers/nvme/host/pci.c  |  98 +
 drivers/nvme/target/rdma.c   |   2 +-
 drivers/pci/Kconfig  |   3 +-
 drivers/pci/p2pdma.c | 402 +--
 include/linux/dma-map-ops.h  |  10 +
 include/linux/dma-mapping.h  |   5 +
 include/linux/memremap.h |   4 +-
 include/linux/mm.h   |   2 +
 include/linux/pci-p2pdma.h   |  92 ++--
 include/linux/scatterlist.h  |  50 -
 include/linux/uio.h  |  21 +-
 include/rdma/ib_verbs.h  |  30 +++
 include/uapi/linux/magic.h   |   1 +
 kernel/dma/direct.c  |  44 +++-
 kernel/dma/mapping.c |  34 ++-
 lib/iov_iter.c   |  28 +--
 mm/gup.c |  28 ++-
 mm/huge_memory.c |   8 +-
 mm/memory-failure.c  |   4 +-
 mm/memory_hotplug.c  |   2 +-
 mm/memremap.c|  26 ++-
 28 files changed, 834 insertions(+), 247 deletions(-)


base-commit: 6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f
--
2.30.2
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 07/20] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support

2021-09-16 Thread Logan Gunthorpe
Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/dma-map-ops.h | 10 ++
 include/linux/dma-mapping.h |  5 +
 kernel/dma/mapping.c| 18 ++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..b60d6870c847 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -11,7 +11,17 @@
 
 struct cma;
 
+/*
+ * Values for struct dma_map_ops.flags:
+ *
+ * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
+ * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ */
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
 struct dma_map_ops {
+   unsigned int flags;
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..f7c61b2b4b5e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -140,6 +140,7 @@ int dma_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -250,6 +251,10 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 {
return 0;
 }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   return false;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 7315ae31cf1d..23a02fe1832a 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -727,6 +727,24 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   /* if ops is not set, dma direct will be used which supports P2PDMA */
+   if (!ops)
+   return true;
+
+   /*
+* Note: dma_ops_bypass is not checked here because P2PDMA should
+* not be used with dma mapping ops that do not have support even
+* if the specific device is bypassing them.
+*/
+
+   return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()

2021-09-16 Thread Logan Gunthorpe
Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/target/rdma.c |  2 +-
 include/rdma/ib_verbs.h| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 891174ccd44b..9ea212c187f2 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -414,7 +414,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
*ndev,
if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
goto out_free_rsp;
 
-   if (!ib_uses_virt_dma(ndev->device))
+   if (ib_dma_pci_p2p_dma_supported(ndev->device))
r->req.p2p_client = &ndev->device->dev;
r->send_sge.length = sizeof(*r->req.cqe);
r->send_sge.lkey = ndev->pd->local_dma_lkey;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 4b50d9a3018a..2b71c9ca2186 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3986,6 +3986,17 @@ static inline bool ib_uses_virt_dma(struct ib_device 
*dev)
return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device;
 }
 
+/*
+ * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
+ */
+static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
+{
+   if (ib_uses_virt_dma(dev))
+   return false;
+
+   return dma_pci_p2pdma_supported(dev->dma_device);
+}
+
 /**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 16/20] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()

2021-09-16 Thread Logan Gunthorpe
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.

Signed-off-by: Logan Gunthorpe 
---
 block/bio.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 5df3dd282e40..2436e83fe3b4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1088,6 +1088,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, 
struct iov_iter *iter)
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
bool same_page = false;
+   unsigned int flags = 0;
ssize_t size, left;
unsigned len, i;
size_t offset;
@@ -1100,7 +1101,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, 
struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-   size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+   if (bio->bi_bdev && bio->bi_bdev->bd_disk &&
+   blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+   flags |= FOLL_PCI_P2PDMA;
+
+   size = iov_iter_get_pages_flags(iter, pages, LONG_MAX, nr_pages,
+   &offset, flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 02/20] PCI/P2PDMA: attempt to set map_type if it has not been set

2021-09-16 Thread Logan Gunthorpe
Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

With this change it's no longer invalid to call pci_p2pdma_map_sg()
before the mapping type is calculated so drop the WARN_ON when that
is the case.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 50cdde3e9a8b..1192c465ba6d 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -848,6 +848,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
struct pci_dev *client;
struct pci_p2pdma *p2pdma;
+   int dist;
 
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
@@ -864,6 +865,10 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
type = xa_to_value(xa_load(&p2pdma->map_types,
   map_types_idx(client)));
rcu_read_unlock();
+
+   if (type == PCI_P2PDMA_MAP_UNKNOWN)
+   return calc_map_type_and_dist(provider, client, &dist, false);
+
return type;
 }
 
@@ -906,7 +911,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
case PCI_P2PDMA_MAP_BUS_ADDR:
return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
default:
-   WARN_ON_ONCE(1);
return 0;
}
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

2021-09-16 Thread Logan Gunthorpe
Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If a caller does not set this flag
and tries to map P2PDMA pages it will fail.

This is implemented by adding a flag and a check to get_dev_pagemap().

Signed-off-by: Logan Gunthorpe 
---
 drivers/dax/super.c  |  7 ---
 include/linux/memremap.h |  4 ++--
 include/linux/mm.h   |  1 +
 mm/gup.c | 28 +---
 mm/huge_memory.c |  8 
 mm/memory-failure.c  |  4 ++--
 mm/memory_hotplug.c  |  2 +-
 mm/memremap.c| 14 ++
 8 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index fc89e91beea7..ffb6e57e65bb 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -180,9 +180,10 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
} else if (pfn_t_devmap(pfn) && pfn_t_devmap(end_pfn)) {
struct dev_pagemap *pgmap, *end_pgmap;
 
-   pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
-   end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL);
-   if (pgmap && pgmap == end_pgmap && pgmap->type == 
MEMORY_DEVICE_FS_DAX
+   pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL, false);
+   end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL, false);
+   if (!IS_ERR_OR_NULL(pgmap) && pgmap == end_pgmap
+   && pgmap->type == MEMORY_DEVICE_FS_DAX
&& pfn_t_to_page(pfn)->pgmap == pgmap
&& pfn_t_to_page(end_pfn)->pgmap == pgmap
&& pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr))
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..f10c332dac8b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -136,7 +136,7 @@ void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-   struct dev_pagemap *pgmap);
+   struct dev_pagemap *pgmap, bool allow_pci_p2pdma);
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
@@ -161,7 +161,7 @@ static inline void devm_memunmap_pages(struct device *dev,
 }
 
 static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-   struct dev_pagemap *pgmap)
+   struct dev_pagemap *pgmap, bool allow_pci_p2pdma)
 {
return NULL;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..6afdc09d0712 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2864,6 +2864,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
 #define FOLL_PIN   0x4 /* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA0x10 /* allow returning PCI P2PDMA pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..1a03b9200cd9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -522,11 +522,16 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
 * case since they are only valid while holding the pgmap
 * reference.
 */
-   *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
-   if (*pgmap)
+   *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap,
+flags & FOLL_PCI_P2PDMA);
+   if (IS_ERR(*pgmap)) {
+   page = ERR_CAST(*pgmap);
+   goto out;
+   } else if (*pgmap) {
page = pte_page(pte);
-   else
+   } else {
goto no_page;
+   }
} else if (unlikely(!page)) {
if (flags & FOLL_DUMP) {
/* Avoid special (like zero) pages in core dumps */
@@ -846,7 +851,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
return NULL;
 
page = follow_page_mask(vma, address, foll_flags, &ctx);
-   if (ctx.pgmap)
+   if (!IS_ERR_OR_NULL(ctx.pgmap))
put_dev_pagemap(ctx.pgmap);
return page;
 }
@@ -1199,7 +1204,7 @@ static long __get_user_pages(struct mm_struct *mm,
nr_pages -= page_increm;
} while (nr_pages);
 out:
-   if (ctx.pgmap)
+   if (!IS_ERR_OR_NULL(ctx.pgmap))
put_dev_pagemap(ctx.pgmap);
return i ? i : ret;
 }
@@ -2149,8 +2154,9 @@ static int 

[PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations

2021-09-16 Thread Logan Gunthorpe
Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c   | 59 ++
 include/linux/pci-p2pdma.h | 21 ++
 2 files changed, 80 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index b656d8c801a7..58c34f1f1473 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -943,6 +943,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ * loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-iommu dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg)
+{
+   if (state->pgmap != sg_page(sg)->pgmap) {
+   state->pgmap = sg_page(sg)->pgmap;
+   state->map = pci_p2pdma_map_type(state->pgmap, dev);
+   state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+   }
+
+   if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+   sg->dma_address = sg_phys(sg) + state->bus_off;
+   sg_dma_len(sg) = sg->length;
+   sg_dma_mark_pci_p2pdma(sg);
+   }
+
+   return state->map;
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a dma address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the dma address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+   dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+   sg_dma_len(dma_sg) = pg_sg->length;
+   sg_dma_mark_pci_p2pdma(dma_sg);
+}
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  * to enable p2pdma
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index caac2d023f8f..e5a8d5bc0f51 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -13,6 +13,12 @@
 
 #include 
 
+struct pci_p2pdma_map_state {
+   struct dev_pagemap *pgmap;
+   int map;
+   u64 bus_off;
+};
+
 struct block_device;
 struct scatterlist;
 
@@ -70,6 +76,11 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -135,6 +146,16 @@ static inline void pci_p2pdma_unmap_sg_attrs(struct device 
*dev,
unsigned long attrs)
 {
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_s

[PATCH v3 15/20] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

2021-09-16 Thread Logan Gunthorpe
Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/uio.h | 21 +
 lib/iov_iter.c  | 28 
 2 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5265024e8b90..d4ce252db728 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -228,10 +228,23 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int 
direction, struct pipe_inode
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t 
count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray 
*xarray,
 loff_t start, size_t count);
-ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
-   size_t maxsize, unsigned maxpages, size_t *start);
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
-   size_t maxsize, size_t *start);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+   size_t maxsize, unsigned maxpages, size_t *start,
+   unsigned int gup_flags);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+   struct page ***pages, size_t maxsize, size_t *start,
+   unsigned int gup_flags);
+static inline ssize_t iov_iter_get_pages(struct iov_iter *i,
+   struct page **pages, size_t maxsize, unsigned maxpages,
+   size_t *start)
+{
+   return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
+static inline ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+   struct page ***pages, size_t maxsize, size_t *start)
+{
+   return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f2d50d69a6c3..bbf0fb6736a9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1468,9 +1468,9 @@ static struct page *first_bvec_segment(const struct 
iov_iter *i,
return page;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
   struct page **pages, size_t maxsize, unsigned maxpages,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
size_t len;
int n, res;
@@ -1485,9 +1485,11 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 
addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
n = DIV_ROUND_UP(len, PAGE_SIZE);
-   res = get_user_pages_fast(addr, n,
-   iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-   pages);
+
+   if (iov_iter_rw(i) != WRITE)
+   gup_flags |= FOLL_WRITE;
+
+   res = get_user_pages_fast(addr, n, gup_flags, pages);
if (unlikely(res < 0))
return res;
return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1507,7 +1509,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return iter_xarray_get_pages(i, pages, maxsize, maxpages, 
start);
return -EFAULT;
 }
-EXPORT_SYMBOL(iov_iter_get_pages);
+EXPORT_SYMBOL(iov_iter_get_pages_flags);
 
 static struct page **get_pages_array(size_t n)
 {
@@ -1589,9 +1591,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct 
iov_iter *i,
return actual;
 }
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
   struct page ***pages, size_t maxsize,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
struct page **p;
size_t len;
@@ -1604,14 +1606,16 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 
if (likely(iter_is_iovec(i))) {
unsigned long addr;
-
addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
n = DIV_ROUND_UP(len, PAGE_SIZE);
p = get_pages_array(n);
if (!p)
return -ENOMEM;
-   res = get_user_pages_fast(addr, n,
-   iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+
+   if (iov_iter_rw(i) != WRITE)
+   gup_flags |= FOLL_WRITE;
+
+   res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res < 0)) {
kvfree(p);
return res;
@@ -1637,7 +1641,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return iter_xarray_get_pages_alloc(i, pa

[PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2021-09-16 Thread Logan Gunthorpe
Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as P2PDMA.

Using this bit requires adding an additional dependency on CONFIG_64BIT to
CONFIG_PCI_P2PDMA. This should be acceptable as the majority of P2PDMA
use cases are restricted to newer root complexes and roughly require the
extra address space for memory BARs used in the transactions.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: John Hubbard 
---
 drivers/pci/Kconfig |  2 +-
 include/linux/scatterlist.h | 50 ++---
 2 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..90b4bddb3300 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -163,7 +163,7 @@ config PCI_PASID
 
 config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
-   depends on ZONE_DEVICE
+   depends on ZONE_DEVICE && 64BIT
select GENERIC_ALLOCATOR
help
  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 266754a55327..e62b1cf6386f 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -64,6 +64,21 @@ struct sg_append_table {
 #define SG_CHAIN   0x01UL
 #define SG_END 0x02UL
 
+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a PCI
+ * bus address when doing P2PDMA.
+ * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
+ */
+
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_DMA_PCI_P2PDMA  0x04UL
+#else
+#define SG_DMA_PCI_P2PDMA  0x00UL
+#endif
+
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA)
+
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
@@ -72,7 +87,9 @@ struct sg_append_table {
 #define sg_is_chain(sg)((sg)->page_link & SG_CHAIN)
 #define sg_is_last(sg) ((sg)->page_link & SG_END)
 #define sg_chain_ptr(sg)   \
-   ((struct scatterlist *) ((sg)->page_link & ~(SG_CHAIN | SG_END)))
+   ((struct scatterlist *)((sg)->page_link & ~SG_PAGE_LINK_MASK))
+
+#define sg_is_dma_pci_p2pdma(sg) ((sg)->page_link & SG_DMA_PCI_P2PDMA)
 
 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -86,13 +103,13 @@ struct sg_append_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 {
-   unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
+   unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;
 
/*
 * In order for the low bit stealing approach to work, pages
 * must be aligned at a 32-bit boundary as a minimum.
 */
-   BUG_ON((unsigned long) page & (SG_CHAIN | SG_END));
+   BUG_ON((unsigned long)page & SG_PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
BUG_ON(sg_is_chain(sg));
 #endif
@@ -126,7 +143,7 @@ static inline struct page *sg_page(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
BUG_ON(sg_is_chain(sg));
 #endif
-   return (struct page *)((sg)->page_link & ~(SG_CHAIN | SG_END));
+   return (struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK);
 }
 
 /**
@@ -228,6 +245,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
 }
 
+/**
+ * sg_dma_mark_pci_p2pdma - Mark the scatterlist entry for PCI p2pdma
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a PCI bus address.
+ **/
+static inline void sg_dma_mark_pci_p2pdma(struct scatterlist *sg)
+{
+   sg->page_link |= SG_DMA_PCI_P2PDMA;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Clears the PCI P2PDMA mark
+ **/
+static inline void sg_dma_unmark_pci_p2pdma(struct scatterlist *sg)
+{
+   sg->page_link &= ~SG_DMA_PCI_P2PDMA;
+}
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg: SG entry
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static

2021-09-16 Thread Logan Gunthorpe
pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual iommu mapping.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c   | 24 +-
 include/linux/pci-p2pdma.h | 41 ++
 2 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 1192c465ba6d..b656d8c801a7 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -20,13 +20,6 @@
 #include 
 #include 
 
-enum pci_p2pdma_map_type {
-   PCI_P2PDMA_MAP_UNKNOWN = 0,
-   PCI_P2PDMA_MAP_NOT_SUPPORTED,
-   PCI_P2PDMA_MAP_BUS_ADDR,
-   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -841,8 +834,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-   struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ * a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ *
+ * Returns one of:
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally
+ * using the CPU physical address (in dma-direct) or an IOVA
+ * mapping for the IOMMU.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev)
 {
enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..caac2d023f8f 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -16,6 +16,40 @@
 struct block_device;
 struct scatterlist;
 
+enum pci_p2pdma_map_type {
+   /*
+* PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+* type hasn't been calculated yet. Functions that return this enum
+* never return this value.
+*/
+   PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+   /*
+* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+* traverse the host bridge and the host bridge is not in the
+* whitelist. DMA Mapping routines should return an error when
+* this is returned.
+*/
+   PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+   /*
+* PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+* eachother directly through a PCI switch and the transaction will
+* not traverse the host bridge. Such a mapping should program
+* the DMA engine with PCI bus addresses.
+*/
+   PCI_P2PDMA_MAP_BUS_ADDR,
+
+   /*
+* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+* to eachother, but the transaction traverses a host bridge on the
+* whitelist. In this case, a normal mapping either with CPU physical
+* addresses (in the case of dma-direct) or IOVA addresses (in the
+* case of IOMMUs) should be used to program the DMA engine.
+*/
+   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
@@ -30,6 +64,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev);
 int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
@@ -83,6 +119,11 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
+{
+   return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
 static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
struct scatterlist *sg, int nents, enum dma_data_direction dir,
unsigned long attrs)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-f

[PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg()

2021-09-16 Thread Logan Gunthorpe
This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/p2pdma.c   | 65 --
 include/linux/pci-p2pdma.h | 27 
 2 files changed, 92 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 58c34f1f1473..4478633346bd 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -878,71 +878,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
return type;
 }
 
-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
-   struct device *dev, struct scatterlist *sg, int nents)
-{
-   struct scatterlist *s;
-   int i;
-
-   for_each_sg(sg, s, nents, i) {
-   s->dma_address = sg_phys(s) - p2p_pgmap->bus_offset;
-   sg_dma_len(s) = s->length;
-   }
-
-   return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   struct pci_p2pdma_pagemap *p2p_pgmap =
-   to_p2p_pgmap(sg_page(sg)->pgmap);
-
-   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
-   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-   return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
-   case PCI_P2PDMA_MAP_BUS_ADDR:
-   return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
-   default:
-   return 0;
-   }
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- * mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   enum pci_p2pdma_map_type map_type;
-
-   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
-   if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-   dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
 /**
  * pci_p2pdma_map_segment - map an sg segment determining the mapping type
  * @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index e5a8d5bc0f51..0c33a40a86e7 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -72,10 +72,6 @@ void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct 
scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
 enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 struct device *dev);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
 enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
   struct scatterlist *sg);
@@ -135,17 +131,6 @@ pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct 
device *dev)
 {
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 }
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-   return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-}
 static inline enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
   struct scatterlist *sg)
@@ -181,16 +166,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct 
device *client)
return pci_p2pmem_find_many(&client, 1);
 }
 
-static inline int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir)
-{
-   return pci_p2pdma_map_sg_

Re: [PATCH v3 0/8] Implement generic cc_platform_has() helper function

2021-09-16 Thread Kuppuswamy, Sathyanarayanan




On 9/16/21 8:02 AM, Borislav Petkov wrote:

On Wed, Sep 15, 2021 at 10:26:06AM -0700, Kuppuswamy, Sathyanarayanan wrote:

I have a Intel variant patch (please check following patch). But it includes
TDX changes as well. Shall I move TDX changes to different patch and just
create a separate patch for adding intel_cc_platform_has()?


Yes, please, so that I can expedite that stuff separately and so that it
can go in early in order for future work to be based ontop.


Sent it part of TDX patch series. Please check and cherry pick it.

https://lore.kernel.org/lkml/20210916183550.15349-2-sathyanarayanan.kuppusw...@linux.intel.com/



Thx.



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC][PATCH v2 00/13] iommu/arm-smmu-v3: Add NVIDIA implementation

2021-09-16 Thread Nicolin Chen via iommu
Hi Kevin,

On Thu, Sep 02, 2021 at 10:27:06PM +, Tian, Kevin wrote:

> > Indeed, this looks like a flavour of the accelerated invalidation
> > stuff we've talked about already.
> >
> > I would see it probably exposed as some HW specific IOCTL on the iommu
> > fd to get access to the accelerated invalidation for IOASID's in the
> > FD.
> >
> > Indeed, this seems like a further example of why /dev/iommu is looking
> > like a good idea as this RFC is very complicated to do something
> > fairly simple.
> >
> > Where are thing on the /dev/iommu work these days?
> 
> We are actively working on the basic skeleton. Our original plan is to send
> out the 1st draft before LPC, with support of vfio type1 semantics and
> and pci dev only (single-device group). But later we realized that adding
> multi-devices group support is also necessary even in the 1st draft to avoid
> some dirty hacks and build the complete picture. This will add some time
> though. If things go well, we'll still try hit the original plan. If not, it 
> will be
> soon after LPC.

As I also want to take a look at your work for this implementation,
would you please CC me when you send out your patches? Thank you!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: allocate memory in a cache-friendly way

2021-09-16 Thread Konrad Rzeszutek Wilk
On Wed, Sep 01, 2021 at 12:21:35PM +0800, Chao Gao wrote:
> Currently, swiotlb uses a global index to indicate the starting point
> of next search. The index increases from 0 to the number of slots - 1
> and then wraps around. It is straightforward but not cache-friendly
> because the "oldest" slot in swiotlb tends to be used first.
> 
> Freed slots are probably accessed right before being freed, especially
> in VM's case (device backends access them in DMA_TO_DEVICE mode; guest
> accesses them in other DMA modes). Thus those just freed slots may
> reside in cache. Then reusing those just freed slots can reduce cache
> misses.
> 
> To that end, maintain a free list for free slots and insert freed slots
> from the head and searching for free slots always starts from the head.
> 
> With this optimization, network throughput of sending data from host to
> guest, measured by iperf3, increases by 7%.

Wow, that is pretty awesome!

Are there any other benchmarks that you ran that showed a negative
performance?

Thank you.
> 
> A bad side effect of this patch is we cannot use a large stride to skip
> unaligned slots when there is an alignment requirement. Currently, a
> large stride is used when a) device has an alignment requirement, stride
> is calculated according to the requirement; b) the requested size is
> larger than PAGE_SIZE. For x86 with 4KB page size, stride is set to 2.
> 
> For case a), few devices have an alignment requirement; the impact is
> limited. For case b) this patch probably leads to one (or more if page size
> is larger than 4K) additional lookup; but as the "io_tlb_slot" struct of
> free slots are also accessed when freeing slots, they probably resides in
> CPU cache as well and then the overhead is almost negligible.
> 
> Suggested-by: Andi Kleen 
> Signed-off-by: Chao Gao 
> ---
>  include/linux/swiotlb.h | 15 --
>  kernel/dma/swiotlb.c| 43 +++--
>  2 files changed, 20 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index b0cb2a9973f4..8cafafd218af 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -63,6 +63,13 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t 
> phys,
>  #ifdef CONFIG_SWIOTLB
>  extern enum swiotlb_force swiotlb_force;
>  
> +struct io_tlb_slot {
> + phys_addr_t orig_addr;
> + size_t alloc_size;
> + unsigned int list;
> + struct list_head node;
> +};
> +
>  /**
>   * struct io_tlb_mem - IO TLB Memory Pool Descriptor
>   *
> @@ -93,17 +100,13 @@ struct io_tlb_mem {
>   phys_addr_t end;
>   unsigned long nslabs;
>   unsigned long used;
> - unsigned int index;
> + struct list_head free_slots;
>   spinlock_t lock;
>   struct dentry *debugfs;
>   bool late_alloc;
>   bool force_bounce;
>   bool for_alloc;
> - struct io_tlb_slot {
> - phys_addr_t orig_addr;
> - size_t alloc_size;
> - unsigned int list;
> - } *slots;
> + struct io_tlb_slot *slots;
>  };
>  extern struct io_tlb_mem io_tlb_default_mem;
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 87c40517e822..12b5b8471e54 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -184,7 +184,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
> *mem, phys_addr_t start,
>   mem->nslabs = nslabs;
>   mem->start = start;
>   mem->end = mem->start + bytes;
> - mem->index = 0;
> + INIT_LIST_HEAD(&mem->free_slots);
>   mem->late_alloc = late_alloc;
>  
>   if (swiotlb_force == SWIOTLB_FORCE)
> @@ -195,6 +195,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
> *mem, phys_addr_t start,
>   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
>   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>   mem->slots[i].alloc_size = 0;
> + list_add_tail(&mem->slots[i].node, &mem->free_slots);
>   }
>   memset(vaddr, 0, bytes);
>  }
> @@ -447,13 +448,6 @@ static inline unsigned long get_max_slots(unsigned long 
> boundary_mask)
>   return nr_slots(boundary_mask + 1);
>  }
>  
> -static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
> -{
> - if (index >= mem->nslabs)
> - return 0;
> - return index;
> -}
> -
>  /*
>   * Find a suitable number of IO TLB entries size that will fit this request 
> and
>   * allocate a buffer from that IO TLB pool.
> @@ -462,38 +456,29 @@ static int swiotlb_find_slots(struct device *dev, 
> phys_addr_t orig_addr,
> size_t alloc_size)
>  {
>   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> + struct io_tlb_slot *slot, *tmp;
>   unsigned long boundary_mask = dma_get_seg_boundary(dev);
>   dma_addr_t tbl_dma_addr =
>   phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
>   unsigned long max_slots = get_max_slots(boundary_mask);
>   

[PATCH v2 4/8] iommu/arm-smmu: Attach to host1x context device bus

2021-09-16 Thread Mikko Perttunen via iommu
Set itself as the IOMMU for the host1x context device bus, containing
"dummy" devices used for Host1x context isolation.

Signed-off-by: Mikko Perttunen 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 4bc75c4ce402..23082675d542 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 
 #include "arm-smmu.h"
 
@@ -2051,8 +2052,20 @@ static int arm_smmu_bus_init(struct iommu_ops *ops)
goto err_reset_pci_ops;
}
 #endif
+#ifdef CONFIG_TEGRA_HOST1X_CONTEXT_BUS
+   if (!iommu_present(&host1x_context_device_bus_type)) {
+   err = bus_set_iommu(&host1x_context_device_bus_type, ops);
+   if (err)
+   goto err_reset_fsl_mc_ops;
+   }
+#endif
+
return 0;
 
+err_reset_fsl_mc_ops: __maybe_unused;
+#ifdef CONFIG_FSL_MC_BUS
+   bus_set_iommu(&fsl_mc_bus_type, NULL);
+#endif
 err_reset_pci_ops: __maybe_unused;
 #ifdef CONFIG_PCI
bus_set_iommu(&pci_bus_type, NULL);
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/8] gpu: host1x: Add context device management code

2021-09-16 Thread Mikko Perttunen via iommu
Add code to register context devices from device tree, allocate them
out and manage their refcounts.

Signed-off-by: Mikko Perttunen 
---
v2:
* Directly set DMA mask instead of inheriting from Host1x.
* Use iommu-map instead of custom DT property.
---
 drivers/gpu/host1x/Makefile  |   1 +
 drivers/gpu/host1x/context.c | 174 +++
 drivers/gpu/host1x/context.h |  27 ++
 drivers/gpu/host1x/dev.c |  12 ++-
 drivers/gpu/host1x/dev.h |   2 +
 include/linux/host1x.h   |  17 
 6 files changed, 232 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/host1x/context.c
 create mode 100644 drivers/gpu/host1x/context.h

diff --git a/drivers/gpu/host1x/Makefile b/drivers/gpu/host1x/Makefile
index c891a3e33844..8a65e13d113a 100644
--- a/drivers/gpu/host1x/Makefile
+++ b/drivers/gpu/host1x/Makefile
@@ -10,6 +10,7 @@ host1x-y = \
debug.o \
mipi.o \
fence.o \
+   context.o \
hw/host1x01.o \
hw/host1x02.o \
hw/host1x04.o \
diff --git a/drivers/gpu/host1x/context.c b/drivers/gpu/host1x/context.c
new file mode 100644
index ..987c08a1e2f2
--- /dev/null
+++ b/drivers/gpu/host1x/context.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021, NVIDIA Corporation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "context.h"
+#include "dev.h"
+
+/*
+ * Due to an issue with T194 NVENC, only 38 bits can be used.
+ * Anyway, 256GiB of IOVA ought to be enough for anyone.
+ */
+static dma_addr_t context_device_dma_mask = DMA_BIT_MASK(38);
+
+int host1x_context_list_init(struct host1x *host1x)
+{
+   struct host1x_context_list *cdl = &host1x->context_list;
+   struct host1x_context *ctx;
+   struct device_node *node;
+   int index;
+   int err;
+
+   node = of_get_child_by_name(host1x->dev->of_node, "memory-contexts");
+   if (!node)
+   return 0;
+
+   cdl->devs = NULL;
+   cdl->len = 0;
+   mutex_init(&cdl->lock);
+
+   err = of_property_count_u32_elems(node, "iommu-map");
+   if (err < 0) {
+   err = 0;
+   goto put_node;
+   }
+
+   cdl->devs = kcalloc(err, sizeof(*cdl->devs), GFP_KERNEL);
+   if (!cdl->devs) {
+   err = -ENOMEM;
+   goto put_node;
+   }
+   cdl->len = err / 4;
+
+   for (index = 0; index < cdl->len; index++) {
+   struct iommu_fwspec *fwspec;
+
+   ctx = &cdl->devs[index];
+
+   ctx->host = host1x;
+
+   device_initialize(&ctx->dev);
+
+   ctx->dev.dma_mask = &context_device_dma_mask;
+   ctx->dev.coherent_dma_mask = context_device_dma_mask;
+   dev_set_name(&ctx->dev, "host1x-ctx.%d", index);
+   ctx->dev.bus = &host1x_context_device_bus_type;
+   ctx->dev.parent = host1x->dev;
+
+   dma_set_max_seg_size(&ctx->dev, UINT_MAX);
+
+   err = device_add(&ctx->dev);
+   if (err) {
+   dev_err(host1x->dev, "could not add context device %d: 
%d\n", index, err);
+   goto del_devices;
+   }
+
+   err = of_dma_configure_id(&ctx->dev, node, true, &index);
+   if (err) {
+   dev_err(host1x->dev, "IOMMU configuration failed for 
context device %d: %d\n",
+   index, err);
+   device_del(&ctx->dev);
+   goto del_devices;
+   }
+
+   fwspec = dev_iommu_fwspec_get(&ctx->dev);
+   if (!fwspec) {
+   dev_err(host1x->dev, "Context device %d has no 
IOMMU!\n", index);
+   device_del(&ctx->dev);
+   goto del_devices;
+   }
+
+   ctx->stream_id = fwspec->ids[0] & 0x;
+   }
+
+   of_node_put(node);
+
+   return 0;
+
+del_devices:
+   while (--index >= 0)
+   device_del(&cdl->devs[index].dev);
+
+   kfree(cdl->devs);
+   cdl->len = 0;
+
+put_node:
+   of_node_put(node);
+
+   return err;
+}
+
+void host1x_context_list_free(struct host1x_context_list *cdl)
+{
+   int i;
+
+   for (i = 0; i < cdl->len; i++)
+   device_del(&cdl->devs[i].dev);
+
+   kfree(cdl->devs);
+   cdl->len = 0;
+}
+
+struct host1x_context *host1x_context_alloc(struct host1x *host1x,
+   struct pid *pid)
+{
+   struct host1x_context_list *cdl = &host1x->context_list;
+   struct host1x_context *free = NULL;
+   int i;
+
+   if (!cdl->len)
+   return ERR_PTR(-EOPNOTSUPP);
+
+   mutex_lock(&cdl->lock);
+
+   for (i = 0; i < cdl->len; i++) {
+   struct host1x_context *cd = &cdl->devs[i];
+
+   if (cd->owner == pid) {
+   refcount_inc(&cd->ref);
+ 

[PATCH v2 3/8] gpu: host1x: Program context stream ID on submission

2021-09-16 Thread Mikko Perttunen via iommu
Add code to do stream ID switching at the beginning of a job. The
stream ID is switched to the stream ID specified by the context
passed in the job structure.

Before switching the stream ID, an OP_DONE wait is done on the
channel's engine to ensure that there is no residual ongoing
work that might do DMA using the new stream ID.

Signed-off-by: Mikko Perttunen 
---
 drivers/gpu/host1x/hw/channel_hw.c| 52 +--
 drivers/gpu/host1x/hw/host1x06_hardware.h | 10 +
 drivers/gpu/host1x/hw/host1x07_hardware.h | 10 +
 include/linux/host1x.h|  4 ++
 4 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/host1x/hw/channel_hw.c 
b/drivers/gpu/host1x/hw/channel_hw.c
index 1999780a7203..d451f8437f62 100644
--- a/drivers/gpu/host1x/hw/channel_hw.c
+++ b/drivers/gpu/host1x/hw/channel_hw.c
@@ -159,6 +159,45 @@ static void host1x_channel_set_streamid(struct 
host1x_channel *channel)
 #endif
 }
 
+static void host1x_channel_program_engine_streamid(struct host1x_job *job)
+{
+#if HOST1X_HW >= 6
+   u32 fence;
+
+   if (!job->context)
+   return;
+
+   fence = host1x_syncpt_incr_max(job->syncpt, 1);
+
+   /* First, increment a syncpoint on OP_DONE condition.. */
+
+   host1x_cdma_push(&job->channel->cdma,
+   host1x_opcode_nonincr(HOST1X_UCLASS_INCR_SYNCPT, 1),
+   HOST1X_UCLASS_INCR_SYNCPT_INDX_F(job->syncpt->id) |
+   HOST1X_UCLASS_INCR_SYNCPT_COND_F(1));
+
+   /* Wait for syncpoint to increment */
+
+   host1x_cdma_push(&job->channel->cdma,
+   host1x_opcode_setclass(HOST1X_CLASS_HOST1X,
+   host1x_uclass_wait_syncpt_r(), 1),
+   host1x_class_host_wait_syncpt(job->syncpt->id, fence));
+
+   /*
+* Now that we know the engine is idle, return to class and
+* change stream ID.
+*/
+
+   host1x_cdma_push(&job->channel->cdma,
+   host1x_opcode_setclass(job->class, 0, 0),
+   HOST1X_OPCODE_NOP);
+
+   host1x_cdma_push(&job->channel->cdma,
+   host1x_opcode_setpayload(job->context->stream_id),
+   host1x_opcode_setstreamid(job->engine_streamid_offset / 4));
+#endif
+}
+
 static int channel_submit(struct host1x_job *job)
 {
struct host1x_channel *ch = job->channel;
@@ -214,18 +253,23 @@ static int channel_submit(struct host1x_job *job)
if (sp->base)
synchronize_syncpt_base(job);
 
-   syncval = host1x_syncpt_incr_max(sp, user_syncpt_incrs);
-
host1x_hw_syncpt_assign_to_channel(host, sp, ch);
 
-   job->syncpt_end = syncval;
-
/* add a setclass for modules that require it */
if (job->class)
host1x_cdma_push(&ch->cdma,
 host1x_opcode_setclass(job->class, 0, 0),
 HOST1X_OPCODE_NOP);
 
+   /*
+* Ensure engine DMA is idle and set new stream ID. May increment
+* syncpt max.
+*/
+   host1x_channel_program_engine_streamid(job);
+
+   syncval = host1x_syncpt_incr_max(sp, user_syncpt_incrs);
+   job->syncpt_end = syncval;
+
submit_gathers(job, syncval - user_syncpt_incrs);
 
/* end CDMA submit & stash pinned hMems into sync queue */
diff --git a/drivers/gpu/host1x/hw/host1x06_hardware.h 
b/drivers/gpu/host1x/hw/host1x06_hardware.h
index 01a142a09800..5d515745eee7 100644
--- a/drivers/gpu/host1x/hw/host1x06_hardware.h
+++ b/drivers/gpu/host1x/hw/host1x06_hardware.h
@@ -127,6 +127,16 @@ static inline u32 host1x_opcode_gather_incr(unsigned 
offset, unsigned count)
return (6 << 28) | (offset << 16) | BIT(15) | BIT(14) | count;
 }
 
+static inline u32 host1x_opcode_setstreamid(unsigned streamid)
+{
+   return (7 << 28) | streamid;
+}
+
+static inline u32 host1x_opcode_setpayload(unsigned payload)
+{
+   return (9 << 28) | payload;
+}
+
 static inline u32 host1x_opcode_gather_wide(unsigned count)
 {
return (12 << 28) | count;
diff --git a/drivers/gpu/host1x/hw/host1x07_hardware.h 
b/drivers/gpu/host1x/hw/host1x07_hardware.h
index e6582172ebfd..82c0cc9bb0b5 100644
--- a/drivers/gpu/host1x/hw/host1x07_hardware.h
+++ b/drivers/gpu/host1x/hw/host1x07_hardware.h
@@ -127,6 +127,16 @@ static inline u32 host1x_opcode_gather_incr(unsigned 
offset, unsigned count)
return (6 << 28) | (offset << 16) | BIT(15) | BIT(14) | count;
 }
 
+static inline u32 host1x_opcode_setstreamid(unsigned streamid)
+{
+   return (7 << 28) | streamid;
+}
+
+static inline u32 host1x_opcode_setpayload(unsigned payload)
+{
+   return (9 << 28) | payload;
+}
+
 static inline u32 host1x_opcode_gather_wide(unsigned count)
 {
return (12 << 28) | count;
diff --git a/include/linux/host1x.h b/include/linux/host1x.h
index f3073738564a..eb0ca304ca92 100644
--- a/include/linux/host1x.h
+++ b/include/linux/host1x.h
@@ -277,6 +277,10 @@ struct host1x_job {
 
/* 

[PATCH v2 6/8] drm/tegra: falcon: Set DMACTX field on DMA transactions

2021-09-16 Thread Mikko Perttunen via iommu
The DMACTX field determines which context, as specified in the
TRANSCFG register, is used. While during boot it doesn't matter
which is used, later on it matters and this value is reused by
the firmware.

Signed-off-by: Mikko Perttunen 
---
 drivers/gpu/drm/tegra/falcon.c | 8 
 drivers/gpu/drm/tegra/falcon.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/tegra/falcon.c b/drivers/gpu/drm/tegra/falcon.c
index 223ab2ceb7e6..8bdb72f08f58 100644
--- a/drivers/gpu/drm/tegra/falcon.c
+++ b/drivers/gpu/drm/tegra/falcon.c
@@ -48,6 +48,14 @@ static int falcon_copy_chunk(struct falcon *falcon,
if (target == FALCON_MEMORY_IMEM)
cmd |= FALCON_DMATRFCMD_IMEM;
 
+   /*
+* Use second DMA context (i.e. the one for firmware). Strictly
+* speaking, at this point both DMA contexts point to the firmware
+* stream ID, but this register's value will be reused by the firmware
+* for later DMA transactions, so we need to use the correct value.
+*/
+   cmd |= FALCON_DMATRFCMD_DMACTX(1);
+
falcon_writel(falcon, offset, FALCON_DMATRFMOFFS);
falcon_writel(falcon, base, FALCON_DMATRFFBOFFS);
falcon_writel(falcon, cmd, FALCON_DMATRFCMD);
diff --git a/drivers/gpu/drm/tegra/falcon.h b/drivers/gpu/drm/tegra/falcon.h
index c56ee32d92ee..1955cf11a8a6 100644
--- a/drivers/gpu/drm/tegra/falcon.h
+++ b/drivers/gpu/drm/tegra/falcon.h
@@ -50,6 +50,7 @@
 #define FALCON_DMATRFCMD_IDLE  (1 << 1)
 #define FALCON_DMATRFCMD_IMEM  (1 << 4)
 #define FALCON_DMATRFCMD_SIZE_256B (6 << 8)
+#define FALCON_DMATRFCMD_DMACTX(v) (((v) & 0x7) << 12)
 
 #define FALCON_DMATRFFBOFFS0x111c
 
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 7/8] drm/tegra: vic: Implement get_streamid_offset

2021-09-16 Thread Mikko Perttunen via iommu
Implement the get_streamid_offset required for supporting context
isolation. Since old firmware cannot support context isolation
without hacks that we don't want to implement, check the firmware
binary to see if context isolation should be enabled.

Signed-off-by: Mikko Perttunen 
---
 drivers/gpu/drm/tegra/vic.c | 38 +
 1 file changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/tegra/vic.c b/drivers/gpu/drm/tegra/vic.c
index c02010ff2b7f..e7cc79906296 100644
--- a/drivers/gpu/drm/tegra/vic.c
+++ b/drivers/gpu/drm/tegra/vic.c
@@ -37,6 +37,8 @@ struct vic {
struct clk *clk;
struct reset_control *rst;
 
+   bool can_use_context;
+
/* Platform configuration */
const struct vic_config *config;
 };
@@ -216,6 +218,7 @@ static int vic_load_firmware(struct vic *vic)
 {
struct host1x_client *client = &vic->client.base;
struct tegra_drm *tegra = vic->client.drm;
+   u32 fce_bin_data_offset;
dma_addr_t iova;
size_t size;
void *virt;
@@ -264,6 +267,25 @@ static int vic_load_firmware(struct vic *vic)
vic->falcon.firmware.phys = phys;
}
 
+   /*
+* Check if firmware is new enough to not require mapping firmware
+* to data buffer domains.
+*/
+   fce_bin_data_offset = *(u32 *)(virt + VIC_UCODE_FCE_DATA_OFFSET);
+
+   if (!vic->config->supports_sid) {
+   vic->can_use_context = false;
+   } else if (fce_bin_data_offset != 0x0 && fce_bin_data_offset != 
0xa5a5a5a5) {
+   /*
+* Firmware will access FCE through STREAMID0, so context
+* isolation cannot be used.
+*/
+   vic->can_use_context = false;
+   dev_warn_once(vic->dev, "context isolation disabled due to old 
firmware\n");
+   } else {
+   vic->can_use_context = true;
+   }
+
return 0;
 
 cleanup:
@@ -353,10 +375,26 @@ static void vic_close_channel(struct tegra_drm_context 
*context)
pm_runtime_put(vic->dev);
 }
 
+static int vic_get_streamid_offset(struct tegra_drm_client *client)
+{
+   struct vic *vic = to_vic(client);
+   int err;
+
+   err = vic_load_firmware(vic);
+   if (err < 0)
+   return err;
+
+   if (vic->can_use_context)
+   return 0x30;
+   else
+   return -ENOTSUPP;
+}
+
 static const struct tegra_drm_client_ops vic_ops = {
.open_channel = vic_open_channel,
.close_channel = vic_close_channel,
.submit = tegra_drm_submit,
+   .get_streamid_offset = vic_get_streamid_offset,
 };
 
 #define NVIDIA_TEGRA_124_VIC_FIRMWARE "nvidia/tegra124/vic03_ucode.bin"
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 5/8] arm64: tegra: Add Host1x context stream IDs on Tegra186+

2021-09-16 Thread Mikko Perttunen via iommu
Add Host1x context stream IDs on systems that support Host1x context
isolation. Host1x and attached engines can use these stream IDs to
allow isolation between memory used by different processes.

The specified stream IDs must match those configured by the hypervisor,
if one is present.

Signed-off-by: Mikko Perttunen 
---
v2:
* Added context devices on T194.
* Use iommu-map instead of custom property.
---
 arch/arm64/boot/dts/nvidia/tegra186.dtsi | 12 
 arch/arm64/boot/dts/nvidia/tegra194.dtsi | 12 
 2 files changed, 24 insertions(+)

diff --git a/arch/arm64/boot/dts/nvidia/tegra186.dtsi 
b/arch/arm64/boot/dts/nvidia/tegra186.dtsi
index 065185bd65ed..71571c29c7ae 100644
--- a/arch/arm64/boot/dts/nvidia/tegra186.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra186.dtsi
@@ -1270,6 +1270,18 @@ host1x@13e0 {
 
iommus = <&smmu TEGRA186_SID_HOST1X>;
 
+   memory-contexts {
+   iommu-map = <
+   0 &smmu TEGRA186_SID_HOST1X_CTX0 1
+   1 &smmu TEGRA186_SID_HOST1X_CTX1 1
+   2 &smmu TEGRA186_SID_HOST1X_CTX2 1
+   3 &smmu TEGRA186_SID_HOST1X_CTX3 1
+   4 &smmu TEGRA186_SID_HOST1X_CTX4 1
+   5 &smmu TEGRA186_SID_HOST1X_CTX5 1
+   6 &smmu TEGRA186_SID_HOST1X_CTX6 1
+   7 &smmu TEGRA186_SID_HOST1X_CTX7 1>;
+   };
+
dpaux1: dpaux@1504 {
compatible = "nvidia,tegra186-dpaux";
reg = <0x1504 0x1>;
diff --git a/arch/arm64/boot/dts/nvidia/tegra194.dtsi 
b/arch/arm64/boot/dts/nvidia/tegra194.dtsi
index 5788735ef968..abcdc42614a6 100644
--- a/arch/arm64/boot/dts/nvidia/tegra194.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra194.dtsi
@@ -1412,6 +1412,18 @@ host1x@13e0 {
interconnect-names = "dma-mem";
iommus = <&smmu TEGRA194_SID_HOST1X>;
 
+   memory-contexts {
+   iommu-map = <
+   0 &smmu TEGRA194_SID_HOST1X_CTX0 1
+   1 &smmu TEGRA194_SID_HOST1X_CTX1 1
+   2 &smmu TEGRA194_SID_HOST1X_CTX2 1
+   3 &smmu TEGRA194_SID_HOST1X_CTX3 1
+   4 &smmu TEGRA194_SID_HOST1X_CTX4 1
+   5 &smmu TEGRA194_SID_HOST1X_CTX5 1
+   6 &smmu TEGRA194_SID_HOST1X_CTX6 1
+   7 &smmu TEGRA194_SID_HOST1X_CTX7 1>;
+   };
+
nvdec@1514 {
compatible = "nvidia,tegra194-nvdec";
reg = <0x1514 0x0004>;
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 8/8] drm/tegra: Support context isolation

2021-09-16 Thread Mikko Perttunen via iommu
For engines that support context isolation, allocate a context when
opening a channel, and set up stream ID offset and context fields
when submitting a job.

Signed-off-by: Mikko Perttunen 
---
 drivers/gpu/drm/tegra/drm.h|  2 ++
 drivers/gpu/drm/tegra/submit.c | 13 +
 drivers/gpu/drm/tegra/uapi.c   | 34 --
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.h b/drivers/gpu/drm/tegra/drm.h
index fc0a19554eac..717e9f81ee1f 100644
--- a/drivers/gpu/drm/tegra/drm.h
+++ b/drivers/gpu/drm/tegra/drm.h
@@ -80,6 +80,7 @@ struct tegra_drm_context {
 
/* Only used by new UAPI. */
struct xarray mappings;
+   struct host1x_context *memory_context;
 };
 
 struct tegra_drm_client_ops {
@@ -91,6 +92,7 @@ struct tegra_drm_client_ops {
int (*submit)(struct tegra_drm_context *context,
  struct drm_tegra_submit *args, struct drm_device *drm,
  struct drm_file *file);
+   int (*get_streamid_offset)(struct tegra_drm_client *client);
 };
 
 int tegra_drm_submit(struct tegra_drm_context *context,
diff --git a/drivers/gpu/drm/tegra/submit.c b/drivers/gpu/drm/tegra/submit.c
index 776f825df52f..b6008246bebe 100644
--- a/drivers/gpu/drm/tegra/submit.c
+++ b/drivers/gpu/drm/tegra/submit.c
@@ -469,6 +469,9 @@ static void release_job(struct host1x_job *job)
struct tegra_drm_submit_data *job_data = job->user_data;
u32 i;
 
+   if (job->context)
+   host1x_context_put(job->context);
+
for (i = 0; i < job_data->num_used_mappings; i++)
tegra_drm_mapping_put(job_data->used_mappings[i].mapping);
 
@@ -572,6 +575,16 @@ int tegra_drm_ioctl_channel_submit(struct drm_device *drm, 
void *data,
job->release = release_job;
job->timeout = 1;
 
+   if (context->memory_context && 
context->client->ops->get_streamid_offset) {
+   int offset = 
context->client->ops->get_streamid_offset(context->client);
+
+   if (offset >= 0) {
+   job->context = context->memory_context;
+   job->engine_streamid_offset = offset;
+   host1x_context_get(job->context);
+   }
+   }
+
/*
 * job_data is now part of job reference counting, so don't release
 * it from here.
diff --git a/drivers/gpu/drm/tegra/uapi.c b/drivers/gpu/drm/tegra/uapi.c
index 690a339c52ec..bfded46b974a 100644
--- a/drivers/gpu/drm/tegra/uapi.c
+++ b/drivers/gpu/drm/tegra/uapi.c
@@ -37,6 +37,9 @@ static void tegra_drm_channel_context_close(struct 
tegra_drm_context *context)
struct tegra_drm_mapping *mapping;
unsigned long id;
 
+   if (context->memory_context)
+   host1x_context_put(context->memory_context);
+
xa_for_each(&context->mappings, id, mapping)
tegra_drm_mapping_put(mapping);
 
@@ -76,6 +79,7 @@ static struct tegra_drm_client *tegra_drm_find_client(struct 
tegra_drm *tegra, u
 
 int tegra_drm_ioctl_channel_open(struct drm_device *drm, void *data, struct 
drm_file *file)
 {
+   struct host1x *host = tegra_drm_to_host1x(drm->dev_private);
struct tegra_drm_file *fpriv = file->driver_priv;
struct tegra_drm *tegra = drm->dev_private;
struct drm_tegra_channel_open *args = data;
@@ -106,10 +110,29 @@ int tegra_drm_ioctl_channel_open(struct drm_device *drm, 
void *data, struct drm_
}
}
 
+   /* Only allocate context if the engine supports context isolation. */
+   if (client->ops->get_streamid_offset &&
+   client->ops->get_streamid_offset(client) >= 0) {
+   context->memory_context =
+   host1x_context_alloc(host, get_task_pid(current, 
PIDTYPE_TGID));
+   if (IS_ERR(context->memory_context)) {
+   if (PTR_ERR(context->memory_context) != -EOPNOTSUPP) {
+   err = PTR_ERR(context->memory_context);
+   goto put_channel;
+   } else {
+   /*
+* OK, HW does not support contexts or contexts
+* are disabled.
+*/
+   context->memory_context = NULL;
+   }
+   }
+   }
+
err = xa_alloc(&fpriv->contexts, &args->context, context, XA_LIMIT(1, 
U32_MAX),
   GFP_KERNEL);
if (err < 0)
-   goto put_channel;
+   goto put_memctx;
 
context->client = client;
xa_init_flags(&context->mappings, XA_FLAGS_ALLOC1);
@@ -122,6 +145,9 @@ int tegra_drm_ioctl_channel_open(struct drm_device *drm, 
void *data, struct drm_
 
return 0;
 
+put_memctx:
+   if (context->memory_context)
+   host1x_context_put(context->memory_context);
 put_ch

Re: [PATCH v3 0/8] Implement generic cc_platform_has() helper function

2021-09-16 Thread Borislav Petkov
On Wed, Sep 15, 2021 at 10:26:06AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> I have a Intel variant patch (please check following patch). But it includes
> TDX changes as well. Shall I move TDX changes to different patch and just
> create a separate patch for adding intel_cc_platform_has()?

Yes, please, so that I can expedite that stuff separately and so that it
can go in early in order for future work to be based ontop.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/8] gpu: host1x: Add context bus

2021-09-16 Thread Mikko Perttunen via iommu
The context bus is a "dummy" bus that contains struct devices that
correspond to IOMMU contexts assigned through Host1x to processes.

Even when host1x itself is built as a module, the bus is registered
in built-in code so that the built-in ARM SMMU driver is able to
reference it.

Signed-off-by: Mikko Perttunen 
---
 drivers/gpu/Makefile   |  3 +--
 drivers/gpu/host1x/Kconfig |  5 +
 drivers/gpu/host1x/Makefile|  1 +
 drivers/gpu/host1x/context_bus.c   | 31 ++
 include/linux/host1x_context_bus.h | 15 +++
 5 files changed, 53 insertions(+), 2 deletions(-)
 create mode 100644 drivers/gpu/host1x/context_bus.c
 create mode 100644 include/linux/host1x_context_bus.h

diff --git a/drivers/gpu/Makefile b/drivers/gpu/Makefile
index 835c88318cec..8997f0096545 100644
--- a/drivers/gpu/Makefile
+++ b/drivers/gpu/Makefile
@@ -2,7 +2,6 @@
 # drm/tegra depends on host1x, so if both drivers are built-in care must be
 # taken to initialize them in the correct order. Link order is the only way
 # to ensure this currently.
-obj-$(CONFIG_TEGRA_HOST1X) += host1x/
-obj-y  += drm/ vga/
+obj-y  += host1x/ drm/ vga/
 obj-$(CONFIG_IMX_IPUV3_CORE)   += ipu-v3/
 obj-$(CONFIG_TRACE_GPU_MEM)+= trace/
diff --git a/drivers/gpu/host1x/Kconfig b/drivers/gpu/host1x/Kconfig
index 6dab94adf25e..8546dde3acc8 100644
--- a/drivers/gpu/host1x/Kconfig
+++ b/drivers/gpu/host1x/Kconfig
@@ -1,7 +1,12 @@
 # SPDX-License-Identifier: GPL-2.0-only
+
+config TEGRA_HOST1X_CONTEXT_BUS
+   bool
+
 config TEGRA_HOST1X
tristate "NVIDIA Tegra host1x driver"
depends on ARCH_TEGRA || (ARM && COMPILE_TEST)
+   select TEGRA_HOST1X_CONTEXT_BUS
select IOMMU_IOVA
help
  Driver for the NVIDIA Tegra host1x hardware.
diff --git a/drivers/gpu/host1x/Makefile b/drivers/gpu/host1x/Makefile
index d2b6f7de0498..c891a3e33844 100644
--- a/drivers/gpu/host1x/Makefile
+++ b/drivers/gpu/host1x/Makefile
@@ -18,3 +18,4 @@ host1x-y = \
hw/host1x07.o
 
 obj-$(CONFIG_TEGRA_HOST1X) += host1x.o
+obj-$(CONFIG_TEGRA_HOST1X_CONTEXT_BUS) += context_bus.o
diff --git a/drivers/gpu/host1x/context_bus.c b/drivers/gpu/host1x/context_bus.c
new file mode 100644
index ..2625914f3c7d
--- /dev/null
+++ b/drivers/gpu/host1x/context_bus.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021, NVIDIA Corporation.
+ */
+
+#include 
+#include 
+
+struct bus_type host1x_context_device_bus_type = {
+   .name = "host1x-context",
+};
+EXPORT_SYMBOL(host1x_context_device_bus_type);
+
+static int __init host1x_context_device_bus_init(void)
+{
+   int err;
+
+   if (!of_machine_is_compatible("nvidia,tegra186") &&
+   !of_machine_is_compatible("nvidia,tegra194") &&
+   !of_machine_is_compatible("nvidia,tegra234"))
+   return 0;
+
+   err = bus_register(&host1x_context_device_bus_type);
+   if (err < 0) {
+   pr_err("bus type registration failed: %d\n", err);
+   return err;
+   }
+
+   return 0;
+}
+postcore_initcall(host1x_context_device_bus_init);
diff --git a/include/linux/host1x_context_bus.h 
b/include/linux/host1x_context_bus.h
new file mode 100644
index ..72462737a6db
--- /dev/null
+++ b/include/linux/host1x_context_bus.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2021, NVIDIA Corporation. All rights reserved.
+ */
+
+#ifndef __LINUX_HOST1X_CONTEXT_BUS_H
+#define __LINUX_HOST1X_CONTEXT_BUS_H
+
+#include 
+
+#ifdef CONFIG_TEGRA_HOST1X_CONTEXT_BUS
+extern struct bus_type host1x_context_device_bus_type;
+#endif
+
+#endif
-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/8] Host1x context isolation support

2021-09-16 Thread Mikko Perttunen via iommu
Hi all,

***
New in v2:

Added support for Tegra194
Use standard iommu-map property instead of custom mechanism
***

this series adds support for Host1x 'context isolation'. Since
when programming engines through Host1x, userspace can program in
any addresses it wants, we need some way to isolate the engines'
memory spaces. Traditionally this has either been done imperfectly
with a single shared IOMMU domain, or by copying and verifying the
programming command stream at submit time (Host1x firewall).

Since Tegra186 there is a privileged (only usable by kernel)
Host1x opcode that allows setting the stream ID sent by the engine
to the SMMU. So, by allocating a number of context banks and stream
IDs for this purpose, and using this opcode at the beginning of
each job, we can implement isolation. Due to the limited number of
context banks only each process gets its own context, and not
each channel.

This feature also allows sharing engines among multiple VMs when
used with Host1x's hardware virtualization support - up to 8 VMs
can be configured with a subset of allowed stream IDs, enforced
at hardware level.

To implement this, this series adds a new host1x context bus, which
will contain the 'struct device's corresponding to each context
bank / stream ID, changes to device tree and SMMU code to allow
registering the devices and using the bus, as well as the Host1x
stream ID programming code and support in TegraDRM.

Device tree bindings are not updated yet pending consensus that the
proposed changes make sense.

Thanks,
Mikko

Mikko Perttunen (8):
  gpu: host1x: Add context bus
  gpu: host1x: Add context device management code
  gpu: host1x: Program context stream ID on submission
  iommu/arm-smmu: Attach to host1x context device bus
  arm64: tegra: Add Host1x context stream IDs on Tegra186+
  drm/tegra: falcon: Set DMACTX field on DMA transactions
  drm/tegra: vic: Implement get_streamid_offset
  drm/tegra: Support context isolation

 arch/arm64/boot/dts/nvidia/tegra186.dtsi  |  12 ++
 arch/arm64/boot/dts/nvidia/tegra194.dtsi  |  12 ++
 drivers/gpu/Makefile  |   3 +-
 drivers/gpu/drm/tegra/drm.h   |   2 +
 drivers/gpu/drm/tegra/falcon.c|   8 +
 drivers/gpu/drm/tegra/falcon.h|   1 +
 drivers/gpu/drm/tegra/submit.c|  13 ++
 drivers/gpu/drm/tegra/uapi.c  |  34 -
 drivers/gpu/drm/tegra/vic.c   |  38 +
 drivers/gpu/host1x/Kconfig|   5 +
 drivers/gpu/host1x/Makefile   |   2 +
 drivers/gpu/host1x/context.c  | 174 ++
 drivers/gpu/host1x/context.h  |  27 
 drivers/gpu/host1x/context_bus.c  |  31 
 drivers/gpu/host1x/dev.c  |  12 +-
 drivers/gpu/host1x/dev.h  |   2 +
 drivers/gpu/host1x/hw/channel_hw.c|  52 ++-
 drivers/gpu/host1x/hw/host1x06_hardware.h |  10 ++
 drivers/gpu/host1x/hw/host1x07_hardware.h |  10 ++
 drivers/iommu/arm/arm-smmu/arm-smmu.c |  13 ++
 include/linux/host1x.h|  21 +++
 include/linux/host1x_context_bus.h|  15 ++
 22 files changed, 488 insertions(+), 9 deletions(-)
 create mode 100644 drivers/gpu/host1x/context.c
 create mode 100644 drivers/gpu/host1x/context.h
 create mode 100644 drivers/gpu/host1x/context_bus.c
 create mode 100644 include/linux/host1x_context_bus.h

-- 
2.32.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for netvsc driver

2021-09-16 Thread Tianyu Lan

On 9/16/2021 12:21 AM, Michael Kelley wrote:

I think you are proposing this approach to allocating memory for the send
and receive buffers so that you can avoid having two virtual mappings for
the memory, per comments from Christop Hellwig.  But overall, the approach
seems a bit complex and I wonder if it is worth it.  If allocating large 
contiguous
chunks of physical memory is successful, then there is some memory savings
in that the data structures needed to keep track of the physical pages is
smaller than the equivalent page tables might be.  But if you have to revert
to allocating individual pages, then the memory savings is reduced.



Yes, this version follows idea from Christop in the previous 
discussion.(https://lkml.org/lkml/2021/9/2/112)
This patch shows the implementation and check whether this is a right 
direction.



Ultimately, the list of actual PFNs has to be kept somewhere.  Another approach
would be to do the reverse of what hv_map_memory() from the v4 patch
series does.  I.e., you could do virt_to_phys() on each virtual address that
maps above VTOM, and subtract out the shared_gpa_boundary to get the
list of actual PFNs that need to be freed.


virt_to_phys() doesn't work for virtual address returned by 
vmap/vmap_pfn() (just like it doesn't work for va returned by 
vmalloc()). The pfn above vTom doesn't have struct page backing and

vmap_pfn() populates the pfn directly in the pte.(Please see the
vmap_pfn_apply()). So it's not easy to convert the va to pa.


  This way you don't have two copies
of the list of PFNs -- one with and one without the shared_gpa_boundary added.
But it comes at the cost of additional code so that may not be a great idea.

I think what you have here works, and I don't have a clearly better solution
at the moment except perhaps to revert to the v4 solution and just have two
virtual mappings.  I'll keep thinking about it.  Maybe Christop has other
thoughts.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for netvsc driver

2021-09-16 Thread Tianyu Lan

On 9/16/2021 12:46 AM, Haiyang Zhang wrote:

+   memset(vmap_pages, 0,
+  sizeof(*vmap_pages) * vmap_page_index);
+   vmap_page_index = 0;
+
+   for (j = 0; j < i; j++)
+   __free_pages(pages[j], alloc_unit);
+
+   kfree(pages);
+   alloc_unit = 1;

This is the case where a large enough contiguous physical memory chunk
could not be found.  But rather than dropping all the way down to single
pages, would it make sense to try something smaller, but not 1?  For
example, cut the alloc_unit in half and try again.  But I'm not sure of
all the implications.

I had the same question. But probably gradually decrementing uses too much
time?



This version is to propose the solution. We may optimize this to try
smaller size until to single page if this is right direction.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()

2021-09-16 Thread Michael Ellerman
Christoph Hellwig  writes:
> On Wed, Sep 15, 2021 at 07:18:34PM +0200, Christophe Leroy wrote:
>> Could you please provide more explicit explanation why inlining such an
>> helper is considered as bad practice and messy ?
>
> Because now we get architectures to all subly differ.  Look at the mess
> for ioremap and the ioremap* variant.
>
> The only good reason to allow for inlines if if they are used in a hot
> path.  Which cc_platform_has is not, especially not on powerpc.

Yes I agree, it's not a hot path so it doesn't really matter, which is
why I Acked it.

I think it is possible to do both, share the declaration across arches
but also give arches flexibility to use an inline if they prefer, see
patch below.

I'm not suggesting we actually do that for this series now, but I think
it would solve the problem if we ever needed to in future.

cheers


diff --git a/arch/powerpc/platforms/pseries/cc_platform.c 
b/arch/powerpc/include/asm/cc_platform.h
similarity index 74%
rename from arch/powerpc/platforms/pseries/cc_platform.c
rename to arch/powerpc/include/asm/cc_platform.h
index e8021af83a19..6285c3c385a6 100644
--- a/arch/powerpc/platforms/pseries/cc_platform.c
+++ b/arch/powerpc/include/asm/cc_platform.h
@@ -7,13 +7,10 @@
  * Author: Tom Lendacky 
  */
 
-#include 
 #include 
-
-#include 
 #include 
 
-bool cc_platform_has(enum cc_attr attr)
+static inline bool arch_cc_platform_has(enum cc_attr attr)
 {
switch (attr) {
case CC_ATTR_MEM_ENCRYPT:
@@ -23,4 +20,3 @@ bool cc_platform_has(enum cc_attr attr)
return false;
}
 }
-EXPORT_SYMBOL_GPL(cc_platform_has);
diff --git a/arch/x86/include/asm/cc_platform.h 
b/arch/x86/include/asm/cc_platform.h
new file mode 100644
index ..0a4220697043
--- /dev/null
+++ b/arch/x86/include/asm/cc_platform.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CC_PLATFORM_H_
+#define _ASM_X86_CC_PLATFORM_H_
+
+bool arch_cc_platform_has(enum cc_attr attr);
+
+#endif // _ASM_X86_CC_PLATFORM_H_
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 3c9bacd3c3f3..77e8c3465979 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -11,11 +11,11 @@
 #include 
 #include 
 
-bool cc_platform_has(enum cc_attr attr)
+bool arch_cc_platform_has(enum cc_attr attr)
 {
if (sme_me_mask)
return amd_cc_platform_has(attr);
 
return false;
 }
-EXPORT_SYMBOL_GPL(cc_platform_has);
+EXPORT_SYMBOL_GPL(arch_cc_platform_has);
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 253f3ea66cd8..f3306647c5d9 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -65,6 +65,8 @@ enum cc_attr {
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
 
+#include 
+
 /**
  * cc_platform_has() - Checks if the specified cc_attr attribute is active
  * @attr: Confidential computing attribute to check
@@ -77,7 +79,10 @@ enum cc_attr {
  * * TRUE  - Specified Confidential Computing attribute is active
  * * FALSE - Specified Confidential Computing attribute is not active
  */
-bool cc_platform_has(enum cc_attr attr);
+static inline bool cc_platform_has(enum cc_attr attr)
+{
+   return arch_cc_platform_has(attr);
+}
 
 #else  /* !CONFIG_ARCH_HAS_CC_PLATFORM */
 


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node parsing

2021-09-16 Thread Jon Nettleton
On Thu, Sep 16, 2021 at 10:26 AM Shameerali Kolothum Thodi
 wrote:
>
>
>
> > -Original Message-
> > From: Jon Nettleton [mailto:j...@solid-run.com]
> > Sent: 16 September 2021 08:52
> > To: Shameerali Kolothum Thodi 
> > Cc: Robin Murphy ; Lorenzo Pieralisi
> > ; Laurentiu Tudor ;
> > linux-arm-kernel ; ACPI Devel Maling
> > List ; Linux IOMMU
> > ; Joerg Roedel ; Will
> > Deacon ; wanghuiqiang ;
> > Guohanjun (Hanjun Guo) ; Steven Price
> > ; Sami Mujawar ; Eric
> > Auger ; yangyicong 
> > Subject: Re: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node parsing
> >
> > On Thu, Sep 16, 2021 at 9:26 AM Shameerali Kolothum Thodi
> >  wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Jon Nettleton [mailto:j...@solid-run.com]
> > > > Sent: 06 September 2021 20:51
> > > > To: Robin Murphy 
> > > > Cc: Lorenzo Pieralisi ; Shameerali
> > > > Kolothum Thodi ; Laurentiu
> > > > Tudor ; linux-arm-kernel
> > > > ; ACPI Devel Maling List
> > > > ; Linux IOMMU
> > > > ; Linuxarm ;
> > > > Joerg Roedel ; Will Deacon ;
> > > > wanghuiqiang ; Guohanjun (Hanjun Guo)
> > > > ; Steven Price ; Sami
> > > > Mujawar ; Eric Auger
> > ;
> > > > yangyicong 
> > > > Subject: Re: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node
> > > > parsing
> > > >
> > > [...]
> > >
> > > > > >
> > > > > > On the prot value assignment based on the remapping flag, I'd
> > > > > > like to hear Robin/Joerg's opinion, I'd avoid being in a
> > > > > > situation where "normally" this would work but then we have to quirk
> > it.
> > > > > >
> > > > > > Is this a valid assumption _always_ ?
> > > > >
> > > > > No. Certainly applying IOMMU_CACHE without reference to the
> > > > > device's _CCA attribute or how CPUs may be accessing a shared
> > > > > buffer could lead to a loss of coherency. At worst, applying
> > > > > IOMMU_MMIO to a device-private buffer *could* cause the device to
> > > > > lose coherency with itself if the memory underlying the RMR may
> > > > > have allocated into system caches. Note that the expected use for
> > > > > non-remappable RMRs is the device holding some sort of long-lived
> > > > > private data in system RAM - the MSI doorbell trick is far more of a 
> > > > > niche
> > hack really.
> > > > >
> > > > > At the very least I think we need to refer to the device's memory
> > > > > access properties here.
> > > > >
> > > > > Jon, Laurentiu - how do RMRs correspond to the EFI memory map on
> > > > > your firmware? I'm starting to think that as long as the
> > > > > underlying memory is described appropriately there then we should
> > > > > be able to infer correct attributes from the EFI memory type and 
> > > > > flags.
> > > >
> > > > The devices are all cache coherent and marked as _CCA, 1.  The
> > > > Memory regions are in the virt table as
> > ARM_MEMORY_REGION_ATTRIBUTE_DEVICE.
> > > >
> > > > The current chicken and egg problem we have is that during the
> > > > fsl-mc-bus initialization we call
> > > >
> > > > error = acpi_dma_configure_id(&pdev->dev, DEV_DMA_COHERENT,
> > > >   &mc_stream_id);
> > > >
> > > > which gets deferred because the SMMU has not been initialized yet.
> > > > Then we initialize the RMR tables but there is no device reference
> > > > there to be able to query device properties, only the stream id.
> > > > After the IORT tables are parsed and the SMMU is setup, on the
> > > > second device probe we associate everything based on the stream id
> > > > and the fsl-mc-bus device is able to claim its 1-1 DMA mappings.
> > >
> > > Can we solve this order problem by delaying the
> > > iommu_alloc_resv_region() to the iommu_dma_get_rmr_resv_regions(dev,
> > > list) ? We could invoke
> > > device_get_dma_attr() from there which I believe will return the _CCA
> > attribute.
> > >
> > > Or is that still early to invoke that?
> >
> > That looks like it should work. Do we then also need to parse through the
> > VirtualMemoryTable matching the start and end addresses to determine the
> > other memory attributes like MMIO?
>
> Yes. But that looks tricky as I can't find that readily available on Arm, 
> like the
> efi_mem_attributes(). I will take a look.
>
> Please let me know if there is one or any other easy way to retrieve it.

maybe we don't need to.  Maybe it is enough to just move
iommu_alloc_resv_regions and then set the IOMMU_CACHE flag
if type = IOMMU_RESV_DIRECT_RELAXABLE and _CCN=1?

-Jon

>
> Thanks,
> Shameer
>
> >
> > -Jon
> >
> > >
> > > Thanks,
> > > Shameer
> > >
> > > > cat /sys/kernel/iommu_groups/0/reserved_regions
> > > > 0x0100 0x10ff direct-relaxable
> > > > 0x0800 0x080f msi
> > > > 0x00080c00 0x00081bff direct-relaxable
> > > > 0x001c 0x001c001f direct-relaxable
> > > > 0x00208000 0x00209fff direct-relaxable
> > > >
> > > > -Jon
> > > >
> > > > >
> > > > > Robin.
___
i

Re: [PATCH V5 09/12] x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-09-16 Thread Tianyu Lan

On 9/15/2021 11:42 PM, Michael Kelley wrote:

@@ -196,13 +199,34 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
*mem, phys_addr_t start,
mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
mem->slots[i].alloc_size = 0;
}
+
+   if (set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT))
+   return -EFAULT;
+
+   /*
+* Map memory in the unencrypted physical address space when requested
+* (e.g. for Hyper-V AMD SEV-SNP Isolation VMs).
+*/
+   if (swiotlb_unencrypted_base) {
+   phys_addr_t paddr = __pa(vaddr) + swiotlb_unencrypted_base;

Nit:  Use "start" instead of "__pa(vaddr)" since "start" is already the needed
physical address.



Yes, "start" should be used here.




@@ -304,7 +332,7 @@ int
  swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
  {
struct io_tlb_mem *mem = &io_tlb_default_mem;
-   unsigned long bytes = nslabs << IO_TLB_SHIFT;
+   int ret;

if (swiotlb_force == SWIOTLB_NO_FORCE)
return 0;
@@ -318,8 +346,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
if (!mem->slots)
return -ENOMEM;

-   set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
-   swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
+   ret = swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
+   if (ret)

Before returning the error, free the pages obtained from the earlier call
to __get_free_pages()?



Yes, will fix.

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V5 07/12] Drivers: hv: vmbus: Add SNP support for VMbus channel initiate message

2021-09-16 Thread Tianyu Lan

On 9/15/2021 11:41 PM, Michael Kelley wrote:

diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 42f3d9d123a1..560cba916d1d 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -240,6 +240,8 @@ struct vmbus_connection {
 * is child->parent notification
 */
struct hv_monitor_page *monitor_pages[2];
+   void *monitor_pages_original[2];
+   unsigned long monitor_pages_pa[2];

The type of this field really should be phys_addr_t.  In addition to
just making semantic sense, then it will match the return type from
virt_to_phys() and the input arg to memremap() since resource_size_t
is typedef'ed as phys_addr_t.



OK. Will update in the next version.

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-16 Thread Roman Skakun
Hi Stefano,

> Also, Option 1 listed in the webpage seems to be a lot better. Any
> reason you can't do that? Because that option both solves the problem
> and increases performance.

Yes, Option 1 is probably more efficient.
But I use another platform under Xen without DMA adjustment functionality.
I pinned this webpage only for example to describe the similar problem I had.

Cheers,
Roman

ср, 15 сент. 2021 г. в 04:51, Stefano Stabellini :

>
> On Tue, 14 Sep 2021, Christoph Hellwig wrote:
> > On Tue, Sep 14, 2021 at 05:29:07PM +0200, Jan Beulich wrote:
> > > I'm not convinced the swiotlb use describe there falls under "intended
> > > use" - mapping a 1280x720 framebuffer in a single chunk? (As an aside,
> > > the bottom of this page is also confusing, as following "Then we can
> > > confirm the modified swiotlb size in the boot log:" there is a log
> > > fragment showing the same original size of 64Mb.
> >
> > It doesn't.  We also do not add hacks to the kernel for whacky out
> > of tree modules.
>
> Also, Option 1 listed in the webpage seems to be a lot better. Any
> reason you can't do that? Because that option both solves the problem
> and increases performance.



--
Best Regards, Roman.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node parsing

2021-09-16 Thread Shameerali Kolothum Thodi



> -Original Message-
> From: Jon Nettleton [mailto:j...@solid-run.com]
> Sent: 16 September 2021 08:52
> To: Shameerali Kolothum Thodi 
> Cc: Robin Murphy ; Lorenzo Pieralisi
> ; Laurentiu Tudor ;
> linux-arm-kernel ; ACPI Devel Maling
> List ; Linux IOMMU
> ; Joerg Roedel ; Will
> Deacon ; wanghuiqiang ;
> Guohanjun (Hanjun Guo) ; Steven Price
> ; Sami Mujawar ; Eric
> Auger ; yangyicong 
> Subject: Re: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node parsing
> 
> On Thu, Sep 16, 2021 at 9:26 AM Shameerali Kolothum Thodi
>  wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Jon Nettleton [mailto:j...@solid-run.com]
> > > Sent: 06 September 2021 20:51
> > > To: Robin Murphy 
> > > Cc: Lorenzo Pieralisi ; Shameerali
> > > Kolothum Thodi ; Laurentiu
> > > Tudor ; linux-arm-kernel
> > > ; ACPI Devel Maling List
> > > ; Linux IOMMU
> > > ; Linuxarm ;
> > > Joerg Roedel ; Will Deacon ;
> > > wanghuiqiang ; Guohanjun (Hanjun Guo)
> > > ; Steven Price ; Sami
> > > Mujawar ; Eric Auger
> ;
> > > yangyicong 
> > > Subject: Re: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node
> > > parsing
> > >
> > [...]
> >
> > > > >
> > > > > On the prot value assignment based on the remapping flag, I'd
> > > > > like to hear Robin/Joerg's opinion, I'd avoid being in a
> > > > > situation where "normally" this would work but then we have to quirk
> it.
> > > > >
> > > > > Is this a valid assumption _always_ ?
> > > >
> > > > No. Certainly applying IOMMU_CACHE without reference to the
> > > > device's _CCA attribute or how CPUs may be accessing a shared
> > > > buffer could lead to a loss of coherency. At worst, applying
> > > > IOMMU_MMIO to a device-private buffer *could* cause the device to
> > > > lose coherency with itself if the memory underlying the RMR may
> > > > have allocated into system caches. Note that the expected use for
> > > > non-remappable RMRs is the device holding some sort of long-lived
> > > > private data in system RAM - the MSI doorbell trick is far more of a 
> > > > niche
> hack really.
> > > >
> > > > At the very least I think we need to refer to the device's memory
> > > > access properties here.
> > > >
> > > > Jon, Laurentiu - how do RMRs correspond to the EFI memory map on
> > > > your firmware? I'm starting to think that as long as the
> > > > underlying memory is described appropriately there then we should
> > > > be able to infer correct attributes from the EFI memory type and flags.
> > >
> > > The devices are all cache coherent and marked as _CCA, 1.  The
> > > Memory regions are in the virt table as
> ARM_MEMORY_REGION_ATTRIBUTE_DEVICE.
> > >
> > > The current chicken and egg problem we have is that during the
> > > fsl-mc-bus initialization we call
> > >
> > > error = acpi_dma_configure_id(&pdev->dev, DEV_DMA_COHERENT,
> > >   &mc_stream_id);
> > >
> > > which gets deferred because the SMMU has not been initialized yet.
> > > Then we initialize the RMR tables but there is no device reference
> > > there to be able to query device properties, only the stream id.
> > > After the IORT tables are parsed and the SMMU is setup, on the
> > > second device probe we associate everything based on the stream id
> > > and the fsl-mc-bus device is able to claim its 1-1 DMA mappings.
> >
> > Can we solve this order problem by delaying the
> > iommu_alloc_resv_region() to the iommu_dma_get_rmr_resv_regions(dev,
> > list) ? We could invoke
> > device_get_dma_attr() from there which I believe will return the _CCA
> attribute.
> >
> > Or is that still early to invoke that?
> 
> That looks like it should work. Do we then also need to parse through the
> VirtualMemoryTable matching the start and end addresses to determine the
> other memory attributes like MMIO?

Yes. But that looks tricky as I can't find that readily available on Arm, like 
the
efi_mem_attributes(). I will take a look.

Please let me know if there is one or any other easy way to retrieve it.

Thanks,
Shameer

> 
> -Jon
> 
> >
> > Thanks,
> > Shameer
> >
> > > cat /sys/kernel/iommu_groups/0/reserved_regions
> > > 0x0100 0x10ff direct-relaxable
> > > 0x0800 0x080f msi
> > > 0x00080c00 0x00081bff direct-relaxable
> > > 0x001c 0x001c001f direct-relaxable
> > > 0x00208000 0x00209fff direct-relaxable
> > >
> > > -Jon
> > >
> > > >
> > > > Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node parsing

2021-09-16 Thread Jon Nettleton
On Thu, Sep 16, 2021 at 9:26 AM Shameerali Kolothum Thodi
 wrote:
>
>
>
> > -Original Message-
> > From: Jon Nettleton [mailto:j...@solid-run.com]
> > Sent: 06 September 2021 20:51
> > To: Robin Murphy 
> > Cc: Lorenzo Pieralisi ; Shameerali Kolothum Thodi
> > ; Laurentiu Tudor
> > ; linux-arm-kernel
> > ; ACPI Devel Maling List
> > ; Linux IOMMU
> > ; Linuxarm ;
> > Joerg Roedel ; Will Deacon ;
> > wanghuiqiang ; Guohanjun (Hanjun Guo)
> > ; Steven Price ; Sami
> > Mujawar ; Eric Auger ;
> > yangyicong 
> > Subject: Re: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node parsing
> >
> [...]
>
> > > >
> > > > On the prot value assignment based on the remapping flag, I'd like
> > > > to hear Robin/Joerg's opinion, I'd avoid being in a situation where
> > > > "normally" this would work but then we have to quirk it.
> > > >
> > > > Is this a valid assumption _always_ ?
> > >
> > > No. Certainly applying IOMMU_CACHE without reference to the device's
> > > _CCA attribute or how CPUs may be accessing a shared buffer could lead
> > > to a loss of coherency. At worst, applying IOMMU_MMIO to a
> > > device-private buffer *could* cause the device to lose coherency with
> > > itself if the memory underlying the RMR may have allocated into system
> > > caches. Note that the expected use for non-remappable RMRs is the
> > > device holding some sort of long-lived private data in system RAM -
> > > the MSI doorbell trick is far more of a niche hack really.
> > >
> > > At the very least I think we need to refer to the device's memory
> > > access properties here.
> > >
> > > Jon, Laurentiu - how do RMRs correspond to the EFI memory map on your
> > > firmware? I'm starting to think that as long as the underlying memory
> > > is described appropriately there then we should be able to infer
> > > correct attributes from the EFI memory type and flags.
> >
> > The devices are all cache coherent and marked as _CCA, 1.  The Memory
> > regions are in the virt table as ARM_MEMORY_REGION_ATTRIBUTE_DEVICE.
> >
> > The current chicken and egg problem we have is that during the fsl-mc-bus
> > initialization we call
> >
> > error = acpi_dma_configure_id(&pdev->dev, DEV_DMA_COHERENT,
> >   &mc_stream_id);
> >
> > which gets deferred because the SMMU has not been initialized yet. Then we
> > initialize the RMR tables but there is no device reference there to be able 
> > to
> > query device properties, only the stream id.  After the IORT tables are 
> > parsed
> > and the SMMU is setup, on the second device probe we associate everything
> > based on the stream id and the fsl-mc-bus device is able to claim its 1-1 
> > DMA
> > mappings.
>
> Can we solve this order problem by delaying the iommu_alloc_resv_region()
> to the iommu_dma_get_rmr_resv_regions(dev, list) ? We could invoke
> device_get_dma_attr() from there which I believe will return the _CCA 
> attribute.
>
> Or is that still early to invoke that?

That looks like it should work. Do we then also need to parse through the
VirtualMemoryTable matching the start and end addresses to determine the
other memory attributes like MMIO?

-Jon

>
> Thanks,
> Shameer
>
> > cat /sys/kernel/iommu_groups/0/reserved_regions
> > 0x0100 0x10ff direct-relaxable
> > 0x0800 0x080f msi
> > 0x00080c00 0x00081bff direct-relaxable
> > 0x001c 0x001c001f direct-relaxable
> > 0x00208000 0x00209fff direct-relaxable
> >
> > -Jon
> >
> > >
> > > Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()

2021-09-16 Thread Christoph Hellwig
On Wed, Sep 15, 2021 at 07:18:34PM +0200, Christophe Leroy wrote:
> Could you please provide more explicit explanation why inlining such an
> helper is considered as bad practice and messy ?

Because now we get architectures to all subly differ.  Look at the mess
for ioremap and the ioremap* variant.

The only good reason to allow for inlines if if they are used in a hot
path.  Which cc_platform_has is not, especially not on powerpc.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node parsing

2021-09-16 Thread Shameerali Kolothum Thodi



> -Original Message-
> From: Jon Nettleton [mailto:j...@solid-run.com]
> Sent: 06 September 2021 20:51
> To: Robin Murphy 
> Cc: Lorenzo Pieralisi ; Shameerali Kolothum Thodi
> ; Laurentiu Tudor
> ; linux-arm-kernel
> ; ACPI Devel Maling List
> ; Linux IOMMU
> ; Linuxarm ;
> Joerg Roedel ; Will Deacon ;
> wanghuiqiang ; Guohanjun (Hanjun Guo)
> ; Steven Price ; Sami
> Mujawar ; Eric Auger ;
> yangyicong 
> Subject: Re: [PATCH v7 2/9] ACPI/IORT: Add support for RMR node parsing
> 
[...]

> > >
> > > On the prot value assignment based on the remapping flag, I'd like
> > > to hear Robin/Joerg's opinion, I'd avoid being in a situation where
> > > "normally" this would work but then we have to quirk it.
> > >
> > > Is this a valid assumption _always_ ?
> >
> > No. Certainly applying IOMMU_CACHE without reference to the device's
> > _CCA attribute or how CPUs may be accessing a shared buffer could lead
> > to a loss of coherency. At worst, applying IOMMU_MMIO to a
> > device-private buffer *could* cause the device to lose coherency with
> > itself if the memory underlying the RMR may have allocated into system
> > caches. Note that the expected use for non-remappable RMRs is the
> > device holding some sort of long-lived private data in system RAM -
> > the MSI doorbell trick is far more of a niche hack really.
> >
> > At the very least I think we need to refer to the device's memory
> > access properties here.
> >
> > Jon, Laurentiu - how do RMRs correspond to the EFI memory map on your
> > firmware? I'm starting to think that as long as the underlying memory
> > is described appropriately there then we should be able to infer
> > correct attributes from the EFI memory type and flags.
> 
> The devices are all cache coherent and marked as _CCA, 1.  The Memory
> regions are in the virt table as ARM_MEMORY_REGION_ATTRIBUTE_DEVICE.
> 
> The current chicken and egg problem we have is that during the fsl-mc-bus
> initialization we call
> 
> error = acpi_dma_configure_id(&pdev->dev, DEV_DMA_COHERENT,
>   &mc_stream_id);
> 
> which gets deferred because the SMMU has not been initialized yet. Then we
> initialize the RMR tables but there is no device reference there to be able to
> query device properties, only the stream id.  After the IORT tables are parsed
> and the SMMU is setup, on the second device probe we associate everything
> based on the stream id and the fsl-mc-bus device is able to claim its 1-1 DMA
> mappings.

Can we solve this order problem by delaying the iommu_alloc_resv_region()
to the iommu_dma_get_rmr_resv_regions(dev, list) ? We could invoke
device_get_dma_attr() from there which I believe will return the _CCA attribute.

Or is that still early to invoke that?

Thanks,
Shameer

> cat /sys/kernel/iommu_groups/0/reserved_regions
> 0x0100 0x10ff direct-relaxable
> 0x0800 0x080f msi
> 0x00080c00 0x00081bff direct-relaxable
> 0x001c 0x001c001f direct-relaxable
> 0x00208000 0x00209fff direct-relaxable
> 
> -Jon
> 
> >
> > Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu