[PATCH v2] iommu/arm-smmu-v3: limit reporting of MSI allocation failures

2018-01-20 Thread Nate Watterson
Currently, the arm-smmu-v3 driver expects to allocate MSIs for all SMMUs
with FEAT_MSI set. This results in unwarranted "failed to allocate MSIs"
warnings being printed on systems where FW was either deliberately
configured to force the use of SMMU wired interrupts -or- is altogether
incapable of describing SMMU MSI topology (ACPI IORT prior to rev.C).

Remedy this by checking msi_domain before attempting to allocate SMMU
MSIs.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 744592d..00de028 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2328,10 +2328,15 @@ static void arm_smmu_setup_msis(struct arm_smmu_device 
*smmu)
if (!(smmu->features & ARM_SMMU_FEAT_MSI))
return;
 
+   if (!dev->msi_domain) {
+   dev_info(smmu->dev, "msi_domain absent - falling back to wired 
irqs\n");
+   return;
+   }
+
/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
ret = platform_msi_domain_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
if (ret) {
-   dev_warn(dev, "failed to allocate MSIs\n");
+   dev_warn(dev, "failed to allocate MSIs - falling back to wired 
irqs\n");
return;
}
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH v2] iommu/arm-smmu-v3: limit reporting of MSI allocation failures

2018-01-20 Thread Nate Watterson
Currently, the arm-smmu-v3 driver expects to allocate MSIs for all SMMUs
with FEAT_MSI set. This results in unwarranted "failed to allocate MSIs"
warnings being printed on systems where FW was either deliberately
configured to force the use of SMMU wired interrupts -or- is altogether
incapable of describing SMMU MSI topology (ACPI IORT prior to rev.C).

Remedy this by checking msi_domain before attempting to allocate SMMU
MSIs.

Signed-off-by: Nate Watterson 
Signed-off-by: Sinan Kaya 
---
 drivers/iommu/arm-smmu-v3.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 744592d..00de028 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2328,10 +2328,15 @@ static void arm_smmu_setup_msis(struct arm_smmu_device 
*smmu)
if (!(smmu->features & ARM_SMMU_FEAT_MSI))
return;
 
+   if (!dev->msi_domain) {
+   dev_info(smmu->dev, "msi_domain absent - falling back to wired 
irqs\n");
+   return;
+   }
+
/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
ret = platform_msi_domain_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
if (ret) {
-   dev_warn(dev, "failed to allocate MSIs\n");
+   dev_warn(dev, "failed to allocate MSIs - falling back to wired 
irqs\n");
return;
}
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: suppress MSI allocation failure message

2018-01-17 Thread Nate Watterson
From: Sinan Kaya <ok...@codeaurora.org>

Even though QDF2400 supports MSI interrupts with SMMUv3, it is not enabled
in ACPI FW to preserve compatibility with older kernel versions. Code is
emitting warning message during boot.

This is causing some test tools to record a false warning and is causing
support issues.

Better reduce the message level.

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 744592d..2118fda 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2331,7 +2331,7 @@ static void arm_smmu_setup_msis(struct arm_smmu_device 
*smmu)
/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
ret = platform_msi_domain_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
if (ret) {
-   dev_warn(dev, "failed to allocate MSIs\n");
+   dev_info(dev, "failed to allocate MSIs\n");
return;
}
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: suppress MSI allocation failure message

2018-01-17 Thread Nate Watterson
From: Sinan Kaya 

Even though QDF2400 supports MSI interrupts with SMMUv3, it is not enabled
in ACPI FW to preserve compatibility with older kernel versions. Code is
emitting warning message during boot.

This is causing some test tools to record a false warning and is causing
support issues.

Better reduce the message level.

Signed-off-by: Sinan Kaya 
Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 744592d..2118fda 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2331,7 +2331,7 @@ static void arm_smmu_setup_msis(struct arm_smmu_device 
*smmu)
/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
ret = platform_msi_domain_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
if (ret) {
-   dev_warn(dev, "failed to allocate MSIs\n");
+   dev_info(dev, "failed to allocate MSIs\n");
return;
}
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



Re: [PATCH v2 0/3] arm-smmu: performance optimization

2017-09-18 Thread Nate Watterson

Hi Leizhen,

On 9/12/2017 9:00 AM, Zhen Lei wrote:

v1 -> v2:
base on (add02cfdc9bc2 "iommu: Introduce Interface for IOMMU TLB Flushing")

Zhen Lei (3):
   iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock
 confliction
   iommu/arm-smmu-v3: add support for unmap an iova range with only one
 tlb sync


I tested these (2) patches on QDF2400 hardware and saw performance
improvements in line with those I reported when testing the original
series. I don't have any hardware close at hand to test the 3rd patch
in the series so that will have to come from someone else.

Tested-by: Nate Watterson <nwatt...@codeaurora.org>

Thanks,
Nate


   iommu/arm-smmu: add support for unmap a memory range with only one tlb
 sync

  drivers/iommu/arm-smmu-v3.c| 52 ++
  drivers/iommu/arm-smmu.c   | 10 
  drivers/iommu/io-pgtable-arm-v7s.c | 32 +++
  drivers/iommu/io-pgtable-arm.c | 30 ++
  drivers/iommu/io-pgtable.h |  1 +
  5 files changed, 99 insertions(+), 26 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v2 0/3] arm-smmu: performance optimization

2017-09-18 Thread Nate Watterson

Hi Leizhen,

On 9/12/2017 9:00 AM, Zhen Lei wrote:

v1 -> v2:
base on (add02cfdc9bc2 "iommu: Introduce Interface for IOMMU TLB Flushing")

Zhen Lei (3):
   iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock
 confliction
   iommu/arm-smmu-v3: add support for unmap an iova range with only one
 tlb sync


I tested these (2) patches on QDF2400 hardware and saw performance
improvements in line with those I reported when testing the original
series. I don't have any hardware close at hand to test the 3rd patch
in the series so that will have to come from someone else.

Tested-by: Nate Watterson 

Thanks,
Nate


   iommu/arm-smmu: add support for unmap a memory range with only one tlb
 sync

  drivers/iommu/arm-smmu-v3.c| 52 ++
  drivers/iommu/arm-smmu.c   | 10 
  drivers/iommu/io-pgtable-arm-v7s.c | 32 +++
  drivers/iommu/io-pgtable-arm.c | 30 ++
  drivers/iommu/io-pgtable.h |  1 +
  5 files changed, 99 insertions(+), 26 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH 1/1] iommu/iova: Make rcache flush optional on IOVA allocation failure

2017-09-18 Thread Nate Watterson

Hi Tomasz,

On 9/18/2017 12:02 PM, Robin Murphy wrote:

Hi Tomasz,

On 18/09/17 11:56, Tomasz Nowicki wrote:

Since IOVA allocation failure is not unusual case we need to flush
CPUs' rcache in hope we will succeed in next round.

However, it is useful to decide whether we need rcache flush step because
of two reasons:
- Not scalability. On large system with ~100 CPUs iterating and flushing
   rcache for each CPU becomes serious bottleneck so we may want to deffer it.

s/deffer/defer


- free_cpu_cached_iovas() does not care about max PFN we are interested in.
   Thus we may flush our rcaches and still get no new IOVA like in the
   commonly used scenario:

 if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
 iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift);

 if (!iova)
 iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);

1. First alloc_iova_fast() call is limited to DMA_BIT_MASK(32) to get
   PCI devices a SAC address
2. alloc_iova() fails due to full 32-bit space
3. rcaches contain PFNs out of 32-bit space so free_cpu_cached_iovas()
   throws entries away for nothing and alloc_iova() fails again
4. Next alloc_iova_fast() call cannot take advantage of rcache since we
   have just defeated caches. In this case we pick the slowest option
   to proceed.

This patch reworks flushed_rcache local flag to be additional function
argument instead and control rcache flush step. Also, it updates all users
to do the flush as the last chance.


Looks like you've run into the same thing Nate found[1] - I came up with
almost the exact same patch, only with separate alloc_iova_fast() and
alloc_iova_fast_noretry() wrapper functions, but on reflection, just
exposing the bool to callers is probably simpler. One nit, can you
document it in the kerneldoc comment too? With that:

Reviewed-by: Robin Murphy <robin.mur...@arm.com>

Thanks,
Robin.

[1]:https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg19758.html

This patch completely resolves the issue I reported in [1]!!
Tested-by: Nate Watterson <nwatt...@codeaurora.org>

Thanks,
Nate




Signed-off-by: Tomasz Nowicki <tomasz.nowi...@caviumnetworks.com>
---
  drivers/iommu/amd_iommu.c   | 5 +++--
  drivers/iommu/dma-iommu.c   | 6 --
  drivers/iommu/intel-iommu.c | 5 +++--
  drivers/iommu/iova.c| 7 +++
  include/linux/iova.h| 5 +++--
  5 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8d2ec60..ce68986 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1604,10 +1604,11 @@ static unsigned long dma_ops_alloc_iova(struct device 
*dev,
  
  	if (dma_mask > DMA_BIT_MASK(32))

pfn = alloc_iova_fast(_dom->iovad, pages,
- IOVA_PFN(DMA_BIT_MASK(32)));
+ IOVA_PFN(DMA_BIT_MASK(32)), false);
  
  	if (!pfn)

-   pfn = alloc_iova_fast(_dom->iovad, pages, 
IOVA_PFN(dma_mask));
+   pfn = alloc_iova_fast(_dom->iovad, pages,
+ IOVA_PFN(dma_mask), true);
  
  	return (pfn << PAGE_SHIFT);

  }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 191be9c..25914d3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -370,10 +370,12 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,
  
  	/* Try to get PCI devices a SAC address */

if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
-   iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> 
shift);
+   iova = alloc_iova_fast(iovad, iova_len,
+  DMA_BIT_MASK(32) >> shift, false);
  
  	if (!iova)

-   iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);
+   iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
+  true);
  
  	return (dma_addr_t)iova << shift;

  }
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 05c0c3a..75c8320 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3460,11 +3460,12 @@ static unsigned long intel_alloc_iova(struct device 
*dev,
 * from higher range
 */
iova_pfn = alloc_iova_fast(>iovad, nrpages,
-  IOVA_PFN(DMA_BIT_MASK(32)));
+  IOVA_PFN(DMA_BIT_MASK(32)), false);
if (iova_pfn)
return iova_pfn;
}
-   iova_pfn = alloc_iova_fast(>iovad, nrpages, IOVA_PFN(dma_mask));
+   iova_pfn = alloc_iova_fast(>iovad, nrpages,
+  IOVA_PFN(dma_mask), true);
if (unlikely(!iova_pfn)) {

Re: [PATCH 1/1] iommu/iova: Make rcache flush optional on IOVA allocation failure

2017-09-18 Thread Nate Watterson

Hi Tomasz,

On 9/18/2017 12:02 PM, Robin Murphy wrote:

Hi Tomasz,

On 18/09/17 11:56, Tomasz Nowicki wrote:

Since IOVA allocation failure is not unusual case we need to flush
CPUs' rcache in hope we will succeed in next round.

However, it is useful to decide whether we need rcache flush step because
of two reasons:
- Not scalability. On large system with ~100 CPUs iterating and flushing
   rcache for each CPU becomes serious bottleneck so we may want to deffer it.

s/deffer/defer


- free_cpu_cached_iovas() does not care about max PFN we are interested in.
   Thus we may flush our rcaches and still get no new IOVA like in the
   commonly used scenario:

 if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
 iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift);

 if (!iova)
 iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);

1. First alloc_iova_fast() call is limited to DMA_BIT_MASK(32) to get
   PCI devices a SAC address
2. alloc_iova() fails due to full 32-bit space
3. rcaches contain PFNs out of 32-bit space so free_cpu_cached_iovas()
   throws entries away for nothing and alloc_iova() fails again
4. Next alloc_iova_fast() call cannot take advantage of rcache since we
   have just defeated caches. In this case we pick the slowest option
   to proceed.

This patch reworks flushed_rcache local flag to be additional function
argument instead and control rcache flush step. Also, it updates all users
to do the flush as the last chance.


Looks like you've run into the same thing Nate found[1] - I came up with
almost the exact same patch, only with separate alloc_iova_fast() and
alloc_iova_fast_noretry() wrapper functions, but on reflection, just
exposing the bool to callers is probably simpler. One nit, can you
document it in the kerneldoc comment too? With that:

Reviewed-by: Robin Murphy 

Thanks,
Robin.

[1]:https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg19758.html

This patch completely resolves the issue I reported in [1]!!
Tested-by: Nate Watterson 

Thanks,
Nate




Signed-off-by: Tomasz Nowicki 
---
  drivers/iommu/amd_iommu.c   | 5 +++--
  drivers/iommu/dma-iommu.c   | 6 --
  drivers/iommu/intel-iommu.c | 5 +++--
  drivers/iommu/iova.c| 7 +++
  include/linux/iova.h| 5 +++--
  5 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8d2ec60..ce68986 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1604,10 +1604,11 @@ static unsigned long dma_ops_alloc_iova(struct device 
*dev,
  
  	if (dma_mask > DMA_BIT_MASK(32))

pfn = alloc_iova_fast(_dom->iovad, pages,
- IOVA_PFN(DMA_BIT_MASK(32)));
+ IOVA_PFN(DMA_BIT_MASK(32)), false);
  
  	if (!pfn)

-   pfn = alloc_iova_fast(_dom->iovad, pages, 
IOVA_PFN(dma_mask));
+   pfn = alloc_iova_fast(_dom->iovad, pages,
+ IOVA_PFN(dma_mask), true);
  
  	return (pfn << PAGE_SHIFT);

  }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 191be9c..25914d3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -370,10 +370,12 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,
  
  	/* Try to get PCI devices a SAC address */

if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
-   iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> 
shift);
+   iova = alloc_iova_fast(iovad, iova_len,
+  DMA_BIT_MASK(32) >> shift, false);
  
  	if (!iova)

-   iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);
+   iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
+  true);
  
  	return (dma_addr_t)iova << shift;

  }
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 05c0c3a..75c8320 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3460,11 +3460,12 @@ static unsigned long intel_alloc_iova(struct device 
*dev,
 * from higher range
 */
iova_pfn = alloc_iova_fast(>iovad, nrpages,
-  IOVA_PFN(DMA_BIT_MASK(32)));
+  IOVA_PFN(DMA_BIT_MASK(32)), false);
if (iova_pfn)
return iova_pfn;
}
-   iova_pfn = alloc_iova_fast(>iovad, nrpages, IOVA_PFN(dma_mask));
+   iova_pfn = alloc_iova_fast(>iovad, nrpages,
+  IOVA_PFN(dma_mask), true);
if (unlikely(!iova_pfn)) {
pr_err("Allocating %ld-page iova for %s failed",
 

Re: [PATCH v3 0/4] Optimise 64-bit IOVA allocations

2017-08-25 Thread Nate Watterson

Hi Robin,

On 8/22/2017 11:17 AM, Robin Murphy wrote:

Hi all,

Just a quick repost of v2[1] with a small fix for the bug reported by Nate.

I tested the series and can confirm that the crash I reported on v2
no longer occurs with this version.


To recap, whilst this mostly only improves worst-case performance, those
worst-cases have a tendency to be pathologically bad:

Ard reports general desktop performance with Chromium on AMD Seattle going
from ~1-2FPS to perfectly usable.

Leizhen reports gigabit ethernet throughput going from ~6.5Mbit/s to line
speed.

I also inadvertantly found that the HiSilicon hns_dsaf driver was taking ~35s
to probe simply becuase of the number of DMA buffers it maps on startup (perf
shows around 76% of that was spent under the lock in alloc_iova()). With this
series applied it takes a mere ~1s, mostly of unrelated mdelay()s, with
alloc_iova() entirely lost in the noise.


Are any of these cases PCI devices attached to domains that have run
out of 32-bit IOVAs and have to retry the allocation using dma_limit?

iommu_dma_alloc_iova() {
[...]
if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))  [<- TRUE]
iova = alloc_iova_fast(DMA_BIT_MASK(32)); [<- NULL]
if (!iova)
iova = alloc_iova_fast(dma_limit);[<- OK  ]
[...]   
}

I am asking because, when using 64k pages, the Mellanox CX4 adapter
exhausts the supply 32-bit IOVAs simply allocating per-cpu IOVA space
during 'ifconfig up' so the code path outlined above is taken for
nearly all subsequent allocations. Although I do see a notable (~2x)
performance improvement with this series, I would still characterize it
as "pathologically bad" at < 10% of iommu passthrough performance.

This was a bit surprising to me as I thought the iova_rcache would
have eliminated the need to walk the rbtree for runtime allocations.
Unfortunately, it looks like the failed attempt to allocate a 32-bit
IOVA actually drops the cached IOVAs that we could have used when
subsequently performing the allocation at dma_limit.

alloc_iova_fast() {
[...]
iova_pfn = iova_rcache_get(...); [<- Fail, no 32-bit IOVAs]
if (iova_pfn)
return iova_pfn;

retry:
new_iova = alloc_iova(...);  [<- Fail, no 32-bit IOVAs]
if (!new_iova) {
unsigned int cpu;

if (flushed_rcache)
return 0;

/* Try replenishing IOVAs by flushing rcache. */
flushed_rcache = true;
for_each_online_cpu(cpu)
free_cpu_cached_iovas(cpu, iovad); [<- :( ]
goto retry;
}
}

As an experiment, I added code to skip the rcache flushing/retry for
the 32-bit allocations. In this configuration, 100% of passthrough mode
performance was achieved. I made the same change in the baseline and
measured performance at ~95% of passthrough mode.

I also got similar results by altogether removing the 32-bit allocation
from iommu_dma_alloc_iova() which makes me wonder why we even bother.
What (PCIe) workloads have been shown to actually benefit from it?

Tested-by: Nate Watterson <nwatt...@codeaurora.org>
-Nate



Robin.

[1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg19139.html

Robin Murphy (1):
   iommu/iova: Extend rbtree node caching

Zhen Lei (3):
   iommu/iova: Optimise rbtree searching
   iommu/iova: Optimise the padding calculation
   iommu/iova: Make dma_32bit_pfn implicit

  drivers/gpu/drm/tegra/drm.c  |   3 +-
  drivers/gpu/host1x/dev.c |   3 +-
  drivers/iommu/amd_iommu.c|   7 +--
  drivers/iommu/dma-iommu.c|  18 +--
  drivers/iommu/intel-iommu.c  |  11 ++--
  drivers/iommu/iova.c | 114 +--
  drivers/misc/mic/scif/scif_rma.c |   3 +-
  include/linux/iova.h |   8 +--
  8 files changed, 62 insertions(+), 105 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v3 0/4] Optimise 64-bit IOVA allocations

2017-08-25 Thread Nate Watterson

Hi Robin,

On 8/22/2017 11:17 AM, Robin Murphy wrote:

Hi all,

Just a quick repost of v2[1] with a small fix for the bug reported by Nate.

I tested the series and can confirm that the crash I reported on v2
no longer occurs with this version.


To recap, whilst this mostly only improves worst-case performance, those
worst-cases have a tendency to be pathologically bad:

Ard reports general desktop performance with Chromium on AMD Seattle going
from ~1-2FPS to perfectly usable.

Leizhen reports gigabit ethernet throughput going from ~6.5Mbit/s to line
speed.

I also inadvertantly found that the HiSilicon hns_dsaf driver was taking ~35s
to probe simply becuase of the number of DMA buffers it maps on startup (perf
shows around 76% of that was spent under the lock in alloc_iova()). With this
series applied it takes a mere ~1s, mostly of unrelated mdelay()s, with
alloc_iova() entirely lost in the noise.


Are any of these cases PCI devices attached to domains that have run
out of 32-bit IOVAs and have to retry the allocation using dma_limit?

iommu_dma_alloc_iova() {
[...]
if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))  [<- TRUE]
iova = alloc_iova_fast(DMA_BIT_MASK(32)); [<- NULL]
if (!iova)
iova = alloc_iova_fast(dma_limit);[<- OK  ]
[...]   
}

I am asking because, when using 64k pages, the Mellanox CX4 adapter
exhausts the supply 32-bit IOVAs simply allocating per-cpu IOVA space
during 'ifconfig up' so the code path outlined above is taken for
nearly all subsequent allocations. Although I do see a notable (~2x)
performance improvement with this series, I would still characterize it
as "pathologically bad" at < 10% of iommu passthrough performance.

This was a bit surprising to me as I thought the iova_rcache would
have eliminated the need to walk the rbtree for runtime allocations.
Unfortunately, it looks like the failed attempt to allocate a 32-bit
IOVA actually drops the cached IOVAs that we could have used when
subsequently performing the allocation at dma_limit.

alloc_iova_fast() {
[...]
iova_pfn = iova_rcache_get(...); [<- Fail, no 32-bit IOVAs]
if (iova_pfn)
return iova_pfn;

retry:
new_iova = alloc_iova(...);  [<- Fail, no 32-bit IOVAs]
if (!new_iova) {
unsigned int cpu;

if (flushed_rcache)
return 0;

/* Try replenishing IOVAs by flushing rcache. */
flushed_rcache = true;
for_each_online_cpu(cpu)
free_cpu_cached_iovas(cpu, iovad); [<- :( ]
goto retry;
}
}

As an experiment, I added code to skip the rcache flushing/retry for
the 32-bit allocations. In this configuration, 100% of passthrough mode
performance was achieved. I made the same change in the baseline and
measured performance at ~95% of passthrough mode.

I also got similar results by altogether removing the 32-bit allocation
from iommu_dma_alloc_iova() which makes me wonder why we even bother.
What (PCIe) workloads have been shown to actually benefit from it?

Tested-by: Nate Watterson 
-Nate



Robin.

[1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg19139.html

Robin Murphy (1):
   iommu/iova: Extend rbtree node caching

Zhen Lei (3):
   iommu/iova: Optimise rbtree searching
   iommu/iova: Optimise the padding calculation
   iommu/iova: Make dma_32bit_pfn implicit

  drivers/gpu/drm/tegra/drm.c  |   3 +-
  drivers/gpu/host1x/dev.c |   3 +-
  drivers/iommu/amd_iommu.c|   7 +--
  drivers/iommu/dma-iommu.c|  18 +--
  drivers/iommu/intel-iommu.c  |  11 ++--
  drivers/iommu/iova.c | 114 +--
  drivers/misc/mic/scif/scif_rma.c |   3 +-
  include/linux/iova.h |   8 +--
  8 files changed, 62 insertions(+), 105 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-08-03 Thread Nate Watterson

Hi Robin,

On 7/31/2017 7:42 AM, Robin Murphy wrote:

Hi Nate,

On 29/07/17 04:57, Nate Watterson wrote:

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
 Unable to handle kernel NULL pointer dereference at virtual address
0020
 user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
 [0020] *pgd=0006efab0003, *pud=0006efab0003,
*pmd=0007d8720003, *pte=
 Internal error: Oops: 9607 [#1] SMP
 Modules linked in:
 CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
 task: 8007da1e5780 task.stack: 8007ddcb8000
 PC is at __cached_rbnode_delete_update+0x2c/0x58
 LR is at private_free_iova+0x2c/0x60
 pc : [] lr : [] pstate: 204001c5
 sp : 8007ddcbba00
 x29: 8007ddcbba00 x28: 8007c8350210
 x27: 8007d1a8 x26: 8007dcc20800
 x25: 0140 x24: 8007c98f0008
 x23: fe4e x22: 0140
 x21: 8007c98f0008 x20: 8007c9adb240
 x19: 8007c98f0018 x18: 0010
 x17:  x16: 
 x15: 4000 x14: 
 x13:  x12: 0001
 x11: dead0200 x10: 
 x9 :  x8 : 8007c9adb1c0
 x7 : 40002000 x6 : 00210d00
 x5 :  x4 : c57e
 x3 : ffcf x2 : ffcf
 x1 : 8007c9adb240 x0 : 
 [...]
 [] __cached_rbnode_delete_update+0x2c/0x58
 [] private_free_iova+0x2c/0x60
 [] iova_magazine_free_pfns+0x4c/0xa0
 [] free_iova_fast+0x1b0/0x230
 [] iommu_dma_free_iova+0x5c/0x80
 [] __iommu_dma_unmap+0x5c/0x98
 [] iommu_dma_unmap_resource+0x24/0x30
 [] iommu_dma_unmap_page+0xc/0x18
 [] __iommu_unmap_page+0x40/0x60
 [] mlx5e_page_release+0xbc/0x128
 [] mlx5e_dealloc_rx_wqe+0x30/0x40
 [] mlx5e_close_channel+0x70/0x1f8
 [] mlx5e_close_channels+0x2c/0x50
 [] mlx5e_close_locked+0x54/0x68
 [] mlx5e_close+0x30/0x58
 [...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
 |curr = >cached32_node;
   94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
 |
 |cached_iova = rb_entry(*curr, struct iova, node);
 |
   99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
  100|*curr = rb_next(>node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.


Thanks for the report - I somehow managed to reason myself out of
keeping the "no cached node" check in __cached_rbnode_delete_update() on
the assumption that it must always be set by a previous allocation.
However, there is indeed just one case case for which that fails: when
you free any IOVA immediately after freeing the very topmost one. Which
is something that freeing an entire magazine's worth of IOVAs back to
the tree all at once has a very real chance of doing...

The obvious straightforward fix is inline below, but I'm now starting to
understand the appeal of reserving a sentinel node to ensure the tree
can never be empty, so I might have a quick go at that to see if it
results in nicer code overall.


After applying your fix, the crash no longer occurs, but the performance
of this use case is only marginally less awful than it was before. I'll
start a

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-08-03 Thread Nate Watterson

Hi Robin,

On 7/31/2017 7:42 AM, Robin Murphy wrote:

Hi Nate,

On 29/07/17 04:57, Nate Watterson wrote:

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
 Unable to handle kernel NULL pointer dereference at virtual address
0020
 user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
 [0020] *pgd=0006efab0003, *pud=0006efab0003,
*pmd=0007d8720003, *pte=
 Internal error: Oops: 9607 [#1] SMP
 Modules linked in:
 CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
 task: 8007da1e5780 task.stack: 8007ddcb8000
 PC is at __cached_rbnode_delete_update+0x2c/0x58
 LR is at private_free_iova+0x2c/0x60
 pc : [] lr : [] pstate: 204001c5
 sp : 8007ddcbba00
 x29: 8007ddcbba00 x28: 8007c8350210
 x27: 8007d1a8 x26: 8007dcc20800
 x25: 0140 x24: 8007c98f0008
 x23: fe4e x22: 0140
 x21: 8007c98f0008 x20: 8007c9adb240
 x19: 8007c98f0018 x18: 0010
 x17:  x16: 
 x15: 4000 x14: 
 x13:  x12: 0001
 x11: dead0200 x10: 
 x9 :  x8 : 8007c9adb1c0
 x7 : 40002000 x6 : 00210d00
 x5 :  x4 : c57e
 x3 : ffcf x2 : ffcf
 x1 : 8007c9adb240 x0 : 
 [...]
 [] __cached_rbnode_delete_update+0x2c/0x58
 [] private_free_iova+0x2c/0x60
 [] iova_magazine_free_pfns+0x4c/0xa0
 [] free_iova_fast+0x1b0/0x230
 [] iommu_dma_free_iova+0x5c/0x80
 [] __iommu_dma_unmap+0x5c/0x98
 [] iommu_dma_unmap_resource+0x24/0x30
 [] iommu_dma_unmap_page+0xc/0x18
 [] __iommu_unmap_page+0x40/0x60
 [] mlx5e_page_release+0xbc/0x128
 [] mlx5e_dealloc_rx_wqe+0x30/0x40
 [] mlx5e_close_channel+0x70/0x1f8
 [] mlx5e_close_channels+0x2c/0x50
 [] mlx5e_close_locked+0x54/0x68
 [] mlx5e_close+0x30/0x58
 [...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
 |curr = >cached32_node;
   94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
 |
 |cached_iova = rb_entry(*curr, struct iova, node);
 |
   99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
  100|*curr = rb_next(>node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.


Thanks for the report - I somehow managed to reason myself out of
keeping the "no cached node" check in __cached_rbnode_delete_update() on
the assumption that it must always be set by a previous allocation.
However, there is indeed just one case case for which that fails: when
you free any IOVA immediately after freeing the very topmost one. Which
is something that freeing an entire magazine's worth of IOVAs back to
the tree all at once has a very real chance of doing...

The obvious straightforward fix is inline below, but I'm now starting to
understand the appeal of reserving a sentinel node to ensure the tree
can never be empty, so I might have a quick go at that to see if it
results in nicer code overall.


After applying your fix, the crash no longer occurs, but the performance
of this use case is only marginally less awful than it was before. I'll
start a

Re: [PATCH v3 0/5] ACPI: DMA ranges management

2017-08-03 Thread Nate Watterson

Hi Lorenzo,

On 8/3/2017 8:32 AM, Lorenzo Pieralisi wrote:

This patch series is v3 of a previous posting:

v2->v3:
- Fixed DMA masks computation
 - Fixed size computation overflow in acpi_dma_get_range()

v1->v2:
- Reworked acpi_dma_get_range() flow and logs
- Added IORT named component address limits
- Renamed acpi_dev_get_resources() helper function
- Rebased against v4.13-rc3

v2: http://lkml.kernel.org/r/20170731152323.32488-1-lorenzo.pieral...@arm.com
v1: http://lkml.kernel.org/r/20170720144517.32529-1-lorenzo.pieral...@arm.com

-- Original cover letter --

As reported in:

http://lkml.kernel.org/r/cal85gma_sscwm80tkdkzqee+s1bewzdevdki1kpkmutdrms...@mail.gmail.com

the bus connecting devices to an IOMMU bus can be smaller in size than
the IOMMU input address bits which results in devices DMA HW bugs in
particular related to IOVA allocation (ie chopping of higher address
bits owing to system bus HW capabilities mismatch with the IOMMU).

Fortunately this problem can be solved through an already present but never
used ACPI 6.2 firmware bindings (ie _DMA object) allowing to define the DMA
window for a specific bus in ACPI and therefore all upstream devices
connected to it.

This small patch series enables _DMA parsing in ACPI core code and
use it in ACPI IORT code in order to detect DMA ranges for devices and
update their data structures to make them work with their related DMA
addressing restrictions.

Cc: Will Deacon <will.dea...@arm.com>
Cc: Hanjun Guo <hanjun@linaro.org>
Cc: Feng Kan <f...@apm.com>
Cc: Jon Masters <j...@redhat.com>
Cc: Robert Moore <robert.mo...@intel.com>
Cc: Robin Murphy <robin.mur...@arm.com>
Cc: Zhang Rui <rui.zh...@intel.com>
Cc: "Rafael J. Wysocki" <r...@rjwysocki.net>

Lorenzo Pieralisi (5):
   ACPICA: resource_mgr: Allow _DMA method in walk resources
   ACPI: Make acpi_dev_get_resources() method agnostic
   ACPI: Introduce DMA ranges parsing
   ACPI: Make acpi_dma_configure() DMA regions aware
   ACPI/IORT: Add IORT named component memory address limits

  drivers/acpi/acpica/rsxface.c |  7 ++--
  drivers/acpi/arm64/iort.c | 57 ++-
  drivers/acpi/resource.c   | 82 +-
  drivers/acpi/scan.c   | 91 +++
  include/acpi/acnames.h|  1 +
  include/acpi/acpi_bus.h   |  2 +
  include/linux/acpi.h  |  8 
  include/linux/acpi_iort.h |  5 ++-
  8 files changed, 219 insertions(+), 34 deletions(-)



I tested with named components and with _DMA objects at a variety of
sizes and verified the configured masks matched the expected values.

Tested-by: Nate Watterson <nwatt...@codeaurora.org>

One general question I had while testing the patch is whether it is
possible to define a complete 64-bit range using the _DMA method. For
instance, to get a 64-bit dma_mask I had to use a non-zero MIN value
so that LEN would not overflow.

QWORDMemory(
ResourceConsumer,
PosDecode,   // _DEC
MinFixed,// _MIF
MaxFixed,// _MAF
Prefetchable,// _MEM
ReadWrite,   // _RW
0,   // _GRA
0x1000,   // _MIN
0x,   // _MAX
0x0,  // _TRA
0xf000,   // _LEN
,
,
,
)

-Nate

--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v3 0/5] ACPI: DMA ranges management

2017-08-03 Thread Nate Watterson

Hi Lorenzo,

On 8/3/2017 8:32 AM, Lorenzo Pieralisi wrote:

This patch series is v3 of a previous posting:

v2->v3:
- Fixed DMA masks computation
 - Fixed size computation overflow in acpi_dma_get_range()

v1->v2:
- Reworked acpi_dma_get_range() flow and logs
- Added IORT named component address limits
- Renamed acpi_dev_get_resources() helper function
- Rebased against v4.13-rc3

v2: http://lkml.kernel.org/r/20170731152323.32488-1-lorenzo.pieral...@arm.com
v1: http://lkml.kernel.org/r/20170720144517.32529-1-lorenzo.pieral...@arm.com

-- Original cover letter --

As reported in:

http://lkml.kernel.org/r/cal85gma_sscwm80tkdkzqee+s1bewzdevdki1kpkmutdrms...@mail.gmail.com

the bus connecting devices to an IOMMU bus can be smaller in size than
the IOMMU input address bits which results in devices DMA HW bugs in
particular related to IOVA allocation (ie chopping of higher address
bits owing to system bus HW capabilities mismatch with the IOMMU).

Fortunately this problem can be solved through an already present but never
used ACPI 6.2 firmware bindings (ie _DMA object) allowing to define the DMA
window for a specific bus in ACPI and therefore all upstream devices
connected to it.

This small patch series enables _DMA parsing in ACPI core code and
use it in ACPI IORT code in order to detect DMA ranges for devices and
update their data structures to make them work with their related DMA
addressing restrictions.

Cc: Will Deacon 
Cc: Hanjun Guo 
Cc: Feng Kan 
Cc: Jon Masters 
Cc: Robert Moore 
Cc: Robin Murphy 
Cc: Zhang Rui 
Cc: "Rafael J. Wysocki" 

Lorenzo Pieralisi (5):
   ACPICA: resource_mgr: Allow _DMA method in walk resources
   ACPI: Make acpi_dev_get_resources() method agnostic
   ACPI: Introduce DMA ranges parsing
   ACPI: Make acpi_dma_configure() DMA regions aware
   ACPI/IORT: Add IORT named component memory address limits

  drivers/acpi/acpica/rsxface.c |  7 ++--
  drivers/acpi/arm64/iort.c | 57 ++-
  drivers/acpi/resource.c   | 82 +-
  drivers/acpi/scan.c   | 91 +++
  include/acpi/acnames.h|  1 +
  include/acpi/acpi_bus.h   |  2 +
  include/linux/acpi.h  |  8 
  include/linux/acpi_iort.h |  5 ++-
  8 files changed, 219 insertions(+), 34 deletions(-)



I tested with named components and with _DMA objects at a variety of
sizes and verified the configured masks matched the expected values.

Tested-by: Nate Watterson 

One general question I had while testing the patch is whether it is
possible to define a complete 64-bit range using the _DMA method. For
instance, to get a 64-bit dma_mask I had to use a non-zero MIN value
so that LEN would not overflow.

QWORDMemory(
ResourceConsumer,
PosDecode,   // _DEC
MinFixed,// _MIF
MaxFixed,// _MAF
Prefetchable,// _MEM
ReadWrite,   // _RW
0,   // _GRA
0x1000,   // _MIN
0x,   // _MAX
0x0,  // _TRA
0xf000,   // _LEN
,
,
,
)

-Nate

--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v2 5/5] ACPI/IORT: Add IORT named component memory address limits

2017-08-02 Thread Nate Watterson

Hi Lorenzo,
I ran a quick test and this seems to work for memory_address_limit < 64,
however memory_address_limit == 64 yields a mask of 0x0.

-Nate

On 7/31/2017 11:23 AM, Lorenzo Pieralisi wrote:

IORT named components provide firmware configuration describing
how many address bits a given device is capable of generating
to address memory.

Add code to the kernel to retrieve memory address limits
configuration for IORT named components and configure DMA masks
accordingly.

Signed-off-by: Lorenzo Pieralisi <lorenzo.pieral...@arm.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Robin Murphy <robin.mur...@arm.com>
Cc: Nate Watterson <nwatt...@codeaurora.org>
---
  drivers/acpi/arm64/iort.c | 40 ++--
  1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 67b85ae..b85d19f 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -680,6 +680,24 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
return ret ? NULL : ops;
  }
  
+static int nc_dma_get_range(struct device *dev, u64 *size)

+{
+   struct acpi_iort_node *node;
+   struct acpi_iort_named_component *ncomp;
+
+   node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
+ iort_match_node_callback, dev);
+   if (!node)
+   return -ENODEV;
+
+   ncomp = (struct acpi_iort_named_component *)node->node_data;
+
+   *size = ncomp->memory_address_limit >= 64 ? ~0ULL :
+   1ULL<memory_address_limit;
+
+   return 0;
+}
+
  /**
   * iort_dma_setup() - Set-up device DMA parameters.
   *
@@ -708,17 +726,19 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, 
u64 *dma_size)
  
  	size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);
  
-	if (dev_is_pci(dev)) {

+   if (dev_is_pci(dev))
ret = acpi_dma_get_range(dev, , , );
-   if (!ret) {
-   mask = __roundup_pow_of_two(dmaaddr + size) - 1;
-   /*
-* Limit coherent and dma mask based on size
-* retrieved from firmware.
-*/
-   dev->coherent_dma_mask = mask;
-   *dev->dma_mask = mask;
-   }
+   else
+   ret = nc_dma_get_range(dev, );
+
+   if (!ret) {
+   mask = __roundup_pow_of_two(dmaaddr + size) - 1;
+   /*
+* Limit coherent and dma mask based on size
+* retrieved from firmware.
+*/
+   dev->coherent_dma_mask = mask;
+   *dev->dma_mask = mask;
}
  
  	*dma_addr = dmaaddr;




--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v2 5/5] ACPI/IORT: Add IORT named component memory address limits

2017-08-02 Thread Nate Watterson

Hi Lorenzo,
I ran a quick test and this seems to work for memory_address_limit < 64,
however memory_address_limit == 64 yields a mask of 0x0.

-Nate

On 7/31/2017 11:23 AM, Lorenzo Pieralisi wrote:

IORT named components provide firmware configuration describing
how many address bits a given device is capable of generating
to address memory.

Add code to the kernel to retrieve memory address limits
configuration for IORT named components and configure DMA masks
accordingly.

Signed-off-by: Lorenzo Pieralisi 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Nate Watterson 
---
  drivers/acpi/arm64/iort.c | 40 ++--
  1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 67b85ae..b85d19f 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -680,6 +680,24 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
return ret ? NULL : ops;
  }
  
+static int nc_dma_get_range(struct device *dev, u64 *size)

+{
+   struct acpi_iort_node *node;
+   struct acpi_iort_named_component *ncomp;
+
+   node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
+ iort_match_node_callback, dev);
+   if (!node)
+   return -ENODEV;
+
+   ncomp = (struct acpi_iort_named_component *)node->node_data;
+
+   *size = ncomp->memory_address_limit >= 64 ? ~0ULL :
+   1ULL<memory_address_limit;
+
+   return 0;
+}
+
  /**
   * iort_dma_setup() - Set-up device DMA parameters.
   *
@@ -708,17 +726,19 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, 
u64 *dma_size)
  
  	size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);
  
-	if (dev_is_pci(dev)) {

+   if (dev_is_pci(dev))
ret = acpi_dma_get_range(dev, , , );
-   if (!ret) {
-   mask = __roundup_pow_of_two(dmaaddr + size) - 1;
-   /*
-* Limit coherent and dma mask based on size
-* retrieved from firmware.
-*/
-   dev->coherent_dma_mask = mask;
-   *dev->dma_mask = mask;
-   }
+   else
+   ret = nc_dma_get_range(dev, );
+
+   if (!ret) {
+   mask = __roundup_pow_of_two(dmaaddr + size) - 1;
+   /*
+* Limit coherent and dma mask based on size
+* retrieved from firmware.
+*/
+   dev->coherent_dma_mask = mask;
+   *dev->dma_mask = mask;
}
  
  	*dma_addr = dmaaddr;




--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-28 Thread Nate Watterson

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
Unable to handle kernel NULL pointer dereference at virtual address 0020
user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
[0020] *pgd=0006efab0003, *pud=0006efab0003, 
*pmd=0007d8720003, *pte=
Internal error: Oops: 9607 [#1] SMP
Modules linked in:
CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
task: 8007da1e5780 task.stack: 8007ddcb8000
PC is at __cached_rbnode_delete_update+0x2c/0x58
LR is at private_free_iova+0x2c/0x60
pc : [] lr : [] pstate: 204001c5
sp : 8007ddcbba00
x29: 8007ddcbba00 x28: 8007c8350210
x27: 8007d1a8 x26: 8007dcc20800
x25: 0140 x24: 8007c98f0008
x23: fe4e x22: 0140
x21: 8007c98f0008 x20: 8007c9adb240
x19: 8007c98f0018 x18: 0010
x17:  x16: 
x15: 4000 x14: 
x13:  x12: 0001
x11: dead0200 x10: 
x9 :  x8 : 8007c9adb1c0
x7 : 40002000 x6 : 00210d00
x5 :  x4 : c57e
x3 : ffcf x2 : ffcf
x1 : 8007c9adb240 x0 : 
[...]
[] __cached_rbnode_delete_update+0x2c/0x58
[] private_free_iova+0x2c/0x60
[] iova_magazine_free_pfns+0x4c/0xa0
[] free_iova_fast+0x1b0/0x230
[] iommu_dma_free_iova+0x5c/0x80
[] __iommu_dma_unmap+0x5c/0x98
[] iommu_dma_unmap_resource+0x24/0x30
[] iommu_dma_unmap_page+0xc/0x18
[] __iommu_unmap_page+0x40/0x60
[] mlx5e_page_release+0xbc/0x128
[] mlx5e_dealloc_rx_wqe+0x30/0x40
[] mlx5e_close_channel+0x70/0x1f8
[] mlx5e_close_channels+0x2c/0x50
[] mlx5e_close_locked+0x54/0x68
[] mlx5e_close+0x30/0x58
[...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
  92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
|curr = >cached32_node;
  94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
|
|cached_iova = rb_entry(*curr, struct iova, node);
|
  99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
 100|*curr = rb_next(>node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.

-Nate

On 7/21/2017 7:42 AM, Robin Murphy wrote:

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei .

Tested-by: Ard Biesheuvel 
Tested-by: Zhen Lei 
Signed-off-by: Robin Murphy 
---

v2: No change

  drivers/iommu/iova.c | 59 +---
  include/linux/iova.h |  3 ++-
  2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-28 Thread Nate Watterson

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
Unable to handle kernel NULL pointer dereference at virtual address 0020
user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
[0020] *pgd=0006efab0003, *pud=0006efab0003, 
*pmd=0007d8720003, *pte=
Internal error: Oops: 9607 [#1] SMP
Modules linked in:
CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
task: 8007da1e5780 task.stack: 8007ddcb8000
PC is at __cached_rbnode_delete_update+0x2c/0x58
LR is at private_free_iova+0x2c/0x60
pc : [] lr : [] pstate: 204001c5
sp : 8007ddcbba00
x29: 8007ddcbba00 x28: 8007c8350210
x27: 8007d1a8 x26: 8007dcc20800
x25: 0140 x24: 8007c98f0008
x23: fe4e x22: 0140
x21: 8007c98f0008 x20: 8007c9adb240
x19: 8007c98f0018 x18: 0010
x17:  x16: 
x15: 4000 x14: 
x13:  x12: 0001
x11: dead0200 x10: 
x9 :  x8 : 8007c9adb1c0
x7 : 40002000 x6 : 00210d00
x5 :  x4 : c57e
x3 : ffcf x2 : ffcf
x1 : 8007c9adb240 x0 : 
[...]
[] __cached_rbnode_delete_update+0x2c/0x58
[] private_free_iova+0x2c/0x60
[] iova_magazine_free_pfns+0x4c/0xa0
[] free_iova_fast+0x1b0/0x230
[] iommu_dma_free_iova+0x5c/0x80
[] __iommu_dma_unmap+0x5c/0x98
[] iommu_dma_unmap_resource+0x24/0x30
[] iommu_dma_unmap_page+0xc/0x18
[] __iommu_unmap_page+0x40/0x60
[] mlx5e_page_release+0xbc/0x128
[] mlx5e_dealloc_rx_wqe+0x30/0x40
[] mlx5e_close_channel+0x70/0x1f8
[] mlx5e_close_channels+0x2c/0x50
[] mlx5e_close_locked+0x54/0x68
[] mlx5e_close+0x30/0x58
[...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
  92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
|curr = >cached32_node;
  94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
|
|cached_iova = rb_entry(*curr, struct iova, node);
|
  99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
 100|*curr = rb_next(>node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.

-Nate

On 7/21/2017 7:42 AM, Robin Murphy wrote:

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei .

Tested-by: Ard Biesheuvel 
Tested-by: Zhen Lei 
Signed-off-by: Robin Murphy 
---

v2: No change

  drivers/iommu/iova.c | 59 +---
  include/linux/iova.h |  3 ++-
  2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6 +46,7 @@ 

Re: [PATCH 0/4] ACPI: DMA ranges management

2017-07-26 Thread Nate Watterson



On 7/26/2017 11:35 AM, Lorenzo Pieralisi wrote:

On Wed, Jul 26, 2017 at 04:05:55PM +0100, Robin Murphy wrote:

Hi Nate,

On 26/07/17 15:46, Nate Watterson wrote:

Hi Lorenzo,

On 7/20/2017 10:45 AM, Lorenzo Pieralisi wrote:

As reported in:

http://lkml.kernel.org/r/cal85gma_sscwm80tkdkzqee+s1bewzdevdki1kpkmutdrms...@mail.gmail.com


the bus connecting devices to an IOMMU bus can be smaller in size than
the IOMMU input address bits which results in devices DMA HW bugs in
particular related to IOVA allocation (ie chopping of higher address
bits owing to system bus HW capabilities mismatch with the IOMMU).

Fortunately this problem can be solved through an already present but
never
used ACPI 6.2 firmware bindings (ie _DMA object) allowing to define
the DMA
window for a specific bus in ACPI and therefore all upstream devices
connected to it.

This small patch series enables _DMA parsing in ACPI core code and
use it in ACPI IORT code in order to detect DMA ranges for devices and
update their data structures to make them work with their related DMA
addressing restrictions.


I tested the patches and unfortunately it seems like the DMA addressing
restrictions are not really enforced for devices that attempt to set
their own dma_mask based on controller capabilities. For instance,
consider the following from the ahci_platform driver:

 if (hpriv->cap & HOST_CAP_64) {
 rc = dma_coerce_mask_and_coherent(dev, DMA_BIT_MASK(64));
 [...]
 }

Prior to the check, I can see that the device dma_mask respects the
limits enumerated in the _DMA object, however it is then clobbered by
the call to dma_coerce_mask_and_coherent(). Interestingly, if
HOST_CAP_64 was not set and the _DMA object for the device (or its
parent) indicated support for > 32-bit addrs, the host controller
could end up getting programmed with addresses beyond what it actually
supports. That is more a bug with the ahci_platform driver assuming a
default 32-bit dma_mask, but I would not be surprised to find other
drivers that rely on the same assumption.


Yup, you've hit upon the more general problem, which applies equally to
DT "dma-ranges" too. I'm working on arm64 DMA stuff at the moment, and
have the patch to actually enforce the firmware-described limit when
drivers update their masks, but that depends on everyone passing the
correct information to arch_setup_dma_ops() in the first place (I think
DT needs more fixing than ACPI does).


To ensure that dma_set_mask() and friends actually respect _DMA, would
you consider introducing a dma_supported() callback to check the input
dma_mask against the FW defined limits? This would end up aggressively
clipping the dma_mask to 32-bits for devices like the above if the _DMA
limit was less than 64-bits, but that is probably preferable to the
controller accessing unintended addresses.

Also, how would you feel about adding support for the IORT named_node
memory_address_limit field?


We will certainly need that for some platform devices, so if you fancy
giving it a go before Lorenzo or I get there, feel free!


I can do it for v2 but I would like to understand why using _DMA is
not good enough for named components - having two bindings describing
the same thing is not ideal and I'd rather avoid it - if there is
a reason I am happy to add the necessary code.


My primary reason for requesting this is that I had already configured
the memory_address_limit field for the address challenged platform
devices in our (QDF2400) IORT under the assumption it would eventually
be supported.

Tested-by: Nate Watterson <nwatt...@codeaurora.org>


Thanks,
Lorenzo


Robin.


-Nate


Cc: Will Deacon <will.dea...@arm.com>
Cc: Hanjun Guo <hanjun@linaro.org>
Cc: Feng Kan <f...@apm.com>
Cc: Jon Masters <j...@redhat.com>
Cc: Robert Moore <robert.mo...@intel.com>
Cc: Robin Murphy <robin.mur...@arm.com>
Cc: Zhang Rui <rui.zh...@intel.com>
Cc: "Rafael J. Wysocki" <r...@rjwysocki.net>

Lorenzo Pieralisi (4):
ACPI: Allow _DMA method in walk resources
ACPI: Make acpi_dev_get_resources() method agnostic
ACPI: Introduce DMA ranges parsing
ACPI: Make acpi_dma_configure() DMA regions aware

   drivers/acpi/acpica/rsxface.c |  7 ++--
   drivers/acpi/arm64/iort.c | 27 +++-
   drivers/acpi/resource.c   | 83
-
   drivers/acpi/scan.c   | 95
+++
   include/acpi/acnames.h|  1 +
   include/acpi/acpi_bus.h   |  2 +
   include/linux/acpi.h  |  8 
   include/linux/acpi_iort.h |  5 ++-
   8 files changed, 194 insertions(+), 34 deletions(-)







--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH 0/4] ACPI: DMA ranges management

2017-07-26 Thread Nate Watterson



On 7/26/2017 11:35 AM, Lorenzo Pieralisi wrote:

On Wed, Jul 26, 2017 at 04:05:55PM +0100, Robin Murphy wrote:

Hi Nate,

On 26/07/17 15:46, Nate Watterson wrote:

Hi Lorenzo,

On 7/20/2017 10:45 AM, Lorenzo Pieralisi wrote:

As reported in:

http://lkml.kernel.org/r/cal85gma_sscwm80tkdkzqee+s1bewzdevdki1kpkmutdrms...@mail.gmail.com


the bus connecting devices to an IOMMU bus can be smaller in size than
the IOMMU input address bits which results in devices DMA HW bugs in
particular related to IOVA allocation (ie chopping of higher address
bits owing to system bus HW capabilities mismatch with the IOMMU).

Fortunately this problem can be solved through an already present but
never
used ACPI 6.2 firmware bindings (ie _DMA object) allowing to define
the DMA
window for a specific bus in ACPI and therefore all upstream devices
connected to it.

This small patch series enables _DMA parsing in ACPI core code and
use it in ACPI IORT code in order to detect DMA ranges for devices and
update their data structures to make them work with their related DMA
addressing restrictions.


I tested the patches and unfortunately it seems like the DMA addressing
restrictions are not really enforced for devices that attempt to set
their own dma_mask based on controller capabilities. For instance,
consider the following from the ahci_platform driver:

 if (hpriv->cap & HOST_CAP_64) {
 rc = dma_coerce_mask_and_coherent(dev, DMA_BIT_MASK(64));
 [...]
 }

Prior to the check, I can see that the device dma_mask respects the
limits enumerated in the _DMA object, however it is then clobbered by
the call to dma_coerce_mask_and_coherent(). Interestingly, if
HOST_CAP_64 was not set and the _DMA object for the device (or its
parent) indicated support for > 32-bit addrs, the host controller
could end up getting programmed with addresses beyond what it actually
supports. That is more a bug with the ahci_platform driver assuming a
default 32-bit dma_mask, but I would not be surprised to find other
drivers that rely on the same assumption.


Yup, you've hit upon the more general problem, which applies equally to
DT "dma-ranges" too. I'm working on arm64 DMA stuff at the moment, and
have the patch to actually enforce the firmware-described limit when
drivers update their masks, but that depends on everyone passing the
correct information to arch_setup_dma_ops() in the first place (I think
DT needs more fixing than ACPI does).


To ensure that dma_set_mask() and friends actually respect _DMA, would
you consider introducing a dma_supported() callback to check the input
dma_mask against the FW defined limits? This would end up aggressively
clipping the dma_mask to 32-bits for devices like the above if the _DMA
limit was less than 64-bits, but that is probably preferable to the
controller accessing unintended addresses.

Also, how would you feel about adding support for the IORT named_node
memory_address_limit field?


We will certainly need that for some platform devices, so if you fancy
giving it a go before Lorenzo or I get there, feel free!


I can do it for v2 but I would like to understand why using _DMA is
not good enough for named components - having two bindings describing
the same thing is not ideal and I'd rather avoid it - if there is
a reason I am happy to add the necessary code.


My primary reason for requesting this is that I had already configured
the memory_address_limit field for the address challenged platform
devices in our (QDF2400) IORT under the assumption it would eventually
be supported.

Tested-by: Nate Watterson 


Thanks,
Lorenzo


Robin.


-Nate


Cc: Will Deacon 
Cc: Hanjun Guo 
Cc: Feng Kan 
Cc: Jon Masters 
Cc: Robert Moore 
Cc: Robin Murphy 
Cc: Zhang Rui 
Cc: "Rafael J. Wysocki" 

Lorenzo Pieralisi (4):
ACPI: Allow _DMA method in walk resources
ACPI: Make acpi_dev_get_resources() method agnostic
ACPI: Introduce DMA ranges parsing
ACPI: Make acpi_dma_configure() DMA regions aware

   drivers/acpi/acpica/rsxface.c |  7 ++--
   drivers/acpi/arm64/iort.c | 27 +++-
   drivers/acpi/resource.c   | 83
-
   drivers/acpi/scan.c   | 95
+++
   include/acpi/acnames.h|  1 +
   include/acpi/acpi_bus.h   |  2 +
   include/linux/acpi.h  |  8 
   include/linux/acpi_iort.h |  5 ++-
   8 files changed, 194 insertions(+), 34 deletions(-)







--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH 0/4] ACPI: DMA ranges management

2017-07-26 Thread Nate Watterson

Hi Lorenzo,

On 7/20/2017 10:45 AM, Lorenzo Pieralisi wrote:

As reported in:

http://lkml.kernel.org/r/cal85gma_sscwm80tkdkzqee+s1bewzdevdki1kpkmutdrms...@mail.gmail.com

the bus connecting devices to an IOMMU bus can be smaller in size than
the IOMMU input address bits which results in devices DMA HW bugs in
particular related to IOVA allocation (ie chopping of higher address
bits owing to system bus HW capabilities mismatch with the IOMMU).

Fortunately this problem can be solved through an already present but never
used ACPI 6.2 firmware bindings (ie _DMA object) allowing to define the DMA
window for a specific bus in ACPI and therefore all upstream devices
connected to it.

This small patch series enables _DMA parsing in ACPI core code and
use it in ACPI IORT code in order to detect DMA ranges for devices and
update their data structures to make them work with their related DMA
addressing restrictions.


I tested the patches and unfortunately it seems like the DMA addressing
restrictions are not really enforced for devices that attempt to set
their own dma_mask based on controller capabilities. For instance,
consider the following from the ahci_platform driver:

if (hpriv->cap & HOST_CAP_64) {
rc = dma_coerce_mask_and_coherent(dev, DMA_BIT_MASK(64));
[...]
}

Prior to the check, I can see that the device dma_mask respects the
limits enumerated in the _DMA object, however it is then clobbered by
the call to dma_coerce_mask_and_coherent(). Interestingly, if
HOST_CAP_64 was not set and the _DMA object for the device (or its
parent) indicated support for > 32-bit addrs, the host controller
could end up getting programmed with addresses beyond what it actually
supports. That is more a bug with the ahci_platform driver assuming a
default 32-bit dma_mask, but I would not be surprised to find other
drivers that rely on the same assumption.

To ensure that dma_set_mask() and friends actually respect _DMA, would
you consider introducing a dma_supported() callback to check the input
dma_mask against the FW defined limits? This would end up aggressively
clipping the dma_mask to 32-bits for devices like the above if the _DMA
limit was less than 64-bits, but that is probably preferable to the
controller accessing unintended addresses.

Also, how would you feel about adding support for the IORT named_node
memory_address_limit field?

-Nate


Cc: Will Deacon 
Cc: Hanjun Guo 
Cc: Feng Kan 
Cc: Jon Masters 
Cc: Robert Moore 
Cc: Robin Murphy 
Cc: Zhang Rui 
Cc: "Rafael J. Wysocki" 

Lorenzo Pieralisi (4):
   ACPI: Allow _DMA method in walk resources
   ACPI: Make acpi_dev_get_resources() method agnostic
   ACPI: Introduce DMA ranges parsing
   ACPI: Make acpi_dma_configure() DMA regions aware

  drivers/acpi/acpica/rsxface.c |  7 ++--
  drivers/acpi/arm64/iort.c | 27 +++-
  drivers/acpi/resource.c   | 83 -
  drivers/acpi/scan.c   | 95 +++
  include/acpi/acnames.h|  1 +
  include/acpi/acpi_bus.h   |  2 +
  include/linux/acpi.h  |  8 
  include/linux/acpi_iort.h |  5 ++-
  8 files changed, 194 insertions(+), 34 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH 0/4] ACPI: DMA ranges management

2017-07-26 Thread Nate Watterson

Hi Lorenzo,

On 7/20/2017 10:45 AM, Lorenzo Pieralisi wrote:

As reported in:

http://lkml.kernel.org/r/cal85gma_sscwm80tkdkzqee+s1bewzdevdki1kpkmutdrms...@mail.gmail.com

the bus connecting devices to an IOMMU bus can be smaller in size than
the IOMMU input address bits which results in devices DMA HW bugs in
particular related to IOVA allocation (ie chopping of higher address
bits owing to system bus HW capabilities mismatch with the IOMMU).

Fortunately this problem can be solved through an already present but never
used ACPI 6.2 firmware bindings (ie _DMA object) allowing to define the DMA
window for a specific bus in ACPI and therefore all upstream devices
connected to it.

This small patch series enables _DMA parsing in ACPI core code and
use it in ACPI IORT code in order to detect DMA ranges for devices and
update their data structures to make them work with their related DMA
addressing restrictions.


I tested the patches and unfortunately it seems like the DMA addressing
restrictions are not really enforced for devices that attempt to set
their own dma_mask based on controller capabilities. For instance,
consider the following from the ahci_platform driver:

if (hpriv->cap & HOST_CAP_64) {
rc = dma_coerce_mask_and_coherent(dev, DMA_BIT_MASK(64));
[...]
}

Prior to the check, I can see that the device dma_mask respects the
limits enumerated in the _DMA object, however it is then clobbered by
the call to dma_coerce_mask_and_coherent(). Interestingly, if
HOST_CAP_64 was not set and the _DMA object for the device (or its
parent) indicated support for > 32-bit addrs, the host controller
could end up getting programmed with addresses beyond what it actually
supports. That is more a bug with the ahci_platform driver assuming a
default 32-bit dma_mask, but I would not be surprised to find other
drivers that rely on the same assumption.

To ensure that dma_set_mask() and friends actually respect _DMA, would
you consider introducing a dma_supported() callback to check the input
dma_mask against the FW defined limits? This would end up aggressively
clipping the dma_mask to 32-bits for devices like the above if the _DMA
limit was less than 64-bits, but that is probably preferable to the
controller accessing unintended addresses.

Also, how would you feel about adding support for the IORT named_node
memory_address_limit field?

-Nate


Cc: Will Deacon 
Cc: Hanjun Guo 
Cc: Feng Kan 
Cc: Jon Masters 
Cc: Robert Moore 
Cc: Robin Murphy 
Cc: Zhang Rui 
Cc: "Rafael J. Wysocki" 

Lorenzo Pieralisi (4):
   ACPI: Allow _DMA method in walk resources
   ACPI: Make acpi_dev_get_resources() method agnostic
   ACPI: Introduce DMA ranges parsing
   ACPI: Make acpi_dma_configure() DMA regions aware

  drivers/acpi/acpica/rsxface.c |  7 ++--
  drivers/acpi/arm64/iort.c | 27 +++-
  drivers/acpi/resource.c   | 83 -
  drivers/acpi/scan.c   | 95 +++
  include/acpi/acnames.h|  1 +
  include/acpi/acpi_bus.h   |  2 +
  include/linux/acpi.h  |  8 
  include/linux/acpi_iort.h |  5 ++-
  8 files changed, 194 insertions(+), 34 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH] ata: ahci_platform: Add shutdown handler

2017-07-20 Thread Nate Watterson
The newly introduced ahci_platform_shutdown() method is called during
system shutdown to disable host controller DMA and interrupts in order
to avoid potentially corrupting or otherwise interfering with a new
kernel being started with kexec.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/ata/ahci_platform.c|  1 +
 drivers/ata/libahci_platform.c | 34 ++
 include/linux/ahci_platform.h  |  2 ++
 3 files changed, 37 insertions(+)

diff --git a/drivers/ata/ahci_platform.c b/drivers/ata/ahci_platform.c
index 62a04c8..99f9a89 100644
--- a/drivers/ata/ahci_platform.c
+++ b/drivers/ata/ahci_platform.c
@@ -93,6 +93,7 @@ static SIMPLE_DEV_PM_OPS(ahci_pm_ops, ahci_platform_suspend,
 static struct platform_driver ahci_driver = {
.probe = ahci_probe,
.remove = ata_platform_remove_one,
+   .shutdown = ahci_platform_shutdown,
.driver = {
.name = DRV_NAME,
.of_match_table = ahci_of_match,
diff --git a/drivers/ata/libahci_platform.c b/drivers/ata/libahci_platform.c
index cd2eab6..a270a11 100644
--- a/drivers/ata/libahci_platform.c
+++ b/drivers/ata/libahci_platform.c
@@ -602,6 +602,40 @@ static void ahci_host_stop(struct ata_host *host)
ahci_platform_disable_resources(hpriv);
 }
 
+/**
+ * ahci_platform_shutdown - Disable interrupts and stop DMA for host ports
+ * @dev: platform device pointer for the host
+ *
+ * This function is called during system shutdown and performs the minimal
+ * deconfiguration required to ensure that an ahci_platform host cannot
+ * corrupt or otherwise interfere with a new kernel being started with kexec.
+ */
+void ahci_platform_shutdown(struct platform_device *pdev)
+{
+   struct ata_host *host = platform_get_drvdata(pdev);
+   struct ahci_host_priv *hpriv = host->private_data;
+   void __iomem *mmio = hpriv->mmio;
+   int i;
+
+   for (i = 0; i < host->n_ports; i++) {
+   struct ata_port *ap = host->ports[i];
+
+   /* Disable port interrupts */
+   if (ap->ops->freeze)
+   ap->ops->freeze(ap);
+
+   /* Stop the port DMA engines */
+   if (ap->ops->port_stop)
+   ap->ops->port_stop(ap);
+   }
+
+   /* Disable and clear host interrupts */
+   writel(readl(mmio + HOST_CTL) & ~HOST_IRQ_EN, mmio + HOST_CTL);
+   readl(mmio + HOST_CTL); /* flush */
+   writel(GENMASK(host->n_ports, 0), mmio + HOST_IRQ_STAT);
+}
+EXPORT_SYMBOL_GPL(ahci_platform_shutdown);
+
 #ifdef CONFIG_PM_SLEEP
 /**
  * ahci_platform_suspend_host - Suspend an ahci-platform host
diff --git a/include/linux/ahci_platform.h b/include/linux/ahci_platform.h
index a270f25e..1b0a17b 100644
--- a/include/linux/ahci_platform.h
+++ b/include/linux/ahci_platform.h
@@ -36,6 +36,8 @@ int ahci_platform_init_host(struct platform_device *pdev,
const struct ata_port_info *pi_template,
struct scsi_host_template *sht);
 
+void ahci_platform_shutdown(struct platform_device *pdev);
+
 int ahci_platform_suspend_host(struct device *dev);
 int ahci_platform_resume_host(struct device *dev);
 int ahci_platform_suspend(struct device *dev);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] ata: ahci_platform: Add shutdown handler

2017-07-20 Thread Nate Watterson
The newly introduced ahci_platform_shutdown() method is called during
system shutdown to disable host controller DMA and interrupts in order
to avoid potentially corrupting or otherwise interfering with a new
kernel being started with kexec.

Signed-off-by: Nate Watterson 
---
 drivers/ata/ahci_platform.c|  1 +
 drivers/ata/libahci_platform.c | 34 ++
 include/linux/ahci_platform.h  |  2 ++
 3 files changed, 37 insertions(+)

diff --git a/drivers/ata/ahci_platform.c b/drivers/ata/ahci_platform.c
index 62a04c8..99f9a89 100644
--- a/drivers/ata/ahci_platform.c
+++ b/drivers/ata/ahci_platform.c
@@ -93,6 +93,7 @@ static SIMPLE_DEV_PM_OPS(ahci_pm_ops, ahci_platform_suspend,
 static struct platform_driver ahci_driver = {
.probe = ahci_probe,
.remove = ata_platform_remove_one,
+   .shutdown = ahci_platform_shutdown,
.driver = {
.name = DRV_NAME,
.of_match_table = ahci_of_match,
diff --git a/drivers/ata/libahci_platform.c b/drivers/ata/libahci_platform.c
index cd2eab6..a270a11 100644
--- a/drivers/ata/libahci_platform.c
+++ b/drivers/ata/libahci_platform.c
@@ -602,6 +602,40 @@ static void ahci_host_stop(struct ata_host *host)
ahci_platform_disable_resources(hpriv);
 }
 
+/**
+ * ahci_platform_shutdown - Disable interrupts and stop DMA for host ports
+ * @dev: platform device pointer for the host
+ *
+ * This function is called during system shutdown and performs the minimal
+ * deconfiguration required to ensure that an ahci_platform host cannot
+ * corrupt or otherwise interfere with a new kernel being started with kexec.
+ */
+void ahci_platform_shutdown(struct platform_device *pdev)
+{
+   struct ata_host *host = platform_get_drvdata(pdev);
+   struct ahci_host_priv *hpriv = host->private_data;
+   void __iomem *mmio = hpriv->mmio;
+   int i;
+
+   for (i = 0; i < host->n_ports; i++) {
+   struct ata_port *ap = host->ports[i];
+
+   /* Disable port interrupts */
+   if (ap->ops->freeze)
+   ap->ops->freeze(ap);
+
+   /* Stop the port DMA engines */
+   if (ap->ops->port_stop)
+   ap->ops->port_stop(ap);
+   }
+
+   /* Disable and clear host interrupts */
+   writel(readl(mmio + HOST_CTL) & ~HOST_IRQ_EN, mmio + HOST_CTL);
+   readl(mmio + HOST_CTL); /* flush */
+   writel(GENMASK(host->n_ports, 0), mmio + HOST_IRQ_STAT);
+}
+EXPORT_SYMBOL_GPL(ahci_platform_shutdown);
+
 #ifdef CONFIG_PM_SLEEP
 /**
  * ahci_platform_suspend_host - Suspend an ahci-platform host
diff --git a/include/linux/ahci_platform.h b/include/linux/ahci_platform.h
index a270f25e..1b0a17b 100644
--- a/include/linux/ahci_platform.h
+++ b/include/linux/ahci_platform.h
@@ -36,6 +36,8 @@ int ahci_platform_init_host(struct platform_device *pdev,
const struct ata_port_info *pi_template,
struct scsi_host_template *sht);
 
+void ahci_platform_shutdown(struct platform_device *pdev);
+
 int ahci_platform_suspend_host(struct device *dev);
 int ahci_platform_resume_host(struct device *dev);
 int ahci_platform_suspend(struct device *dev);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-07-20 Thread Nate Watterson

Hi Jonathan,

[...]
 

Hi All,

I'm a bit of late entry to this discussion.  Just been running some more
detailed tests on our d05 boards and wanted to bring some more numbers to
the discussion.

All tests against 4.12 with the following additions:
* Robin's series removing the io-pgtable spinlock (and a few recent fixes)
* Cherry picked updates to the sas driver, merged prior to 4.13-rc1
* An additional HNS (network card) bug fix that will be upstreamed shortly.

I've broken the results down into this patch and this patch + the remainder
of the set. As leizhen mentioned we got a nice little performance
bump from Robin's series so that was applied first (as it's in mainline now)

SAS tests were fio with noop scheduler, 4k block size and various io depths
1 process per disk.  Note this is probably a different setup to leizhen's
original numbers.

Precentages are off the performance seen with the smmu disabled.
SAS
4.12 - none of this series.
SMMU disabled
read io-depth 32 -   384K IOPS (100%)
read io-depth 2048 - 950K IOPS (100%)
rw io-depth 32 - 166K IOPS (100%)
rw io-depth 2048 -   340K IOPS (100%)

SMMU enabled
read io-depth 32 -   201K IOPS (52%)
read io-depth 2048 - 306K IOPS (32%)
rw io-depth 32 - 99K  IOPS (60%)
rw io-depth 2048 -   150K IOPS (44%)

Robin's recent series with fixes as seen on list (now merged)
SMMU enabled.
read io-depth 32 -   208K IOPS (54%)
read io-depth 2048 - 335K IOPS (35%)
rw io-depth 32 - 105K IOPS (63%)
rw io-depth 2048 -   165K IOPS (49%)

4.12 + Robin's series + just this patch SMMU enabled

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)

read io-depth 32 -   225K IOPS (59%)
read io-depth 2048 - 365K IOPS (38%)
rw io-depth 32 - 110K IOPS (66%)
rw io-depth 2048 -   179K IOPS (53%)

4.12 + Robin's series + Second part of this series

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
(iommu: add a new member unmap_tlb_sync into struct iommu_ops)
(iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
(iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)

read io-depth 32 -225K IOPS (59%)
read io-depth 2048 -  833K IOPS (88%)
rw io-depth 32 -  112K IOPS (67%)
rw io-depth 2048 -220K IOPS (65%)

Robin's series gave us small gains across the board (3-5% recovered)
relative to the no smmu performance (which we are taking as the ideal case)

This first patch gets us back another 2-5% of the no smmu performance

The next few patches get us very little advantage on the small io-depths
but make a large difference to the larger io-depths - in particular the
read IOPS which is over twice as fast as without the series.

For HNS it seems that we are less dependent on the SMMU performance and
can reach the non SMMU speed.

Tests with
iperf -t 30 -i 10 -c IPADDRESS -P 3 last 10 seconds taken to avoid any
initial variability.

The server end of the link was always running with smmu v3 disabled
so as to act as a fast sink of the data. Some variation seen across
repeat runs.

Mainline v4.12 + network card fix
NO SMMU
9.42 GBits/sec

SMMU
4.36 GBits/sec (46%)

Robin's io-pgtable spinlock series

6.68 to 7.34 (71% - 78% variation across runs)

Just this patch SMMU enabled

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)

7.96-8.8 GBits/sec (85% - 94%  some variation across runs)

Full series

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
(iommu: add a new member unmap_tlb_sync into struct iommu_ops)
(iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
(iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)

9.42 GBits/Sec (100%)

So HNS test shows a greater boost from Robin's series and this first patch.
This is most likely because the HNS test is not putting as high a load on
the SMMU and associated code as the SAS test.

In both cases however, this shows that both parts of this patch
series are beneficial.

So on to the questions ;)

Will, you mentioned that along with Robin and Nate you were working on
a somewhat related strategy to improve the performance.  Any ETA on that?


The strategy I was working on is basically equivalent to the second
part of the series. I will test your patches out sometime this week, and
I'll also try to have our performance team run it through their whole
suite.


Thanks, that's excellent.  Look forward to hearing how it goes.


I tested the patches with 4 NVME drives connected to a single SMMU and
the results seem to be inline with those you've reported.

FIO - 512k blocksize / io-depth 32 / 1 thread per drive
 Baseline 4.13-rc1 w/SMMU enabled: 25% of SMMU bypass performance
 Baseline + Patch 1  : 28%
 Baseline + Patches 2-5  : 86%
 Baseline + Complete series  : 100% [!!]

I saw performance improvements across all of the other FIO profiles I
tested, although not always as substantial as was seen in 

Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-07-20 Thread Nate Watterson

Hi Jonathan,

[...]
 

Hi All,

I'm a bit of late entry to this discussion.  Just been running some more
detailed tests on our d05 boards and wanted to bring some more numbers to
the discussion.

All tests against 4.12 with the following additions:
* Robin's series removing the io-pgtable spinlock (and a few recent fixes)
* Cherry picked updates to the sas driver, merged prior to 4.13-rc1
* An additional HNS (network card) bug fix that will be upstreamed shortly.

I've broken the results down into this patch and this patch + the remainder
of the set. As leizhen mentioned we got a nice little performance
bump from Robin's series so that was applied first (as it's in mainline now)

SAS tests were fio with noop scheduler, 4k block size and various io depths
1 process per disk.  Note this is probably a different setup to leizhen's
original numbers.

Precentages are off the performance seen with the smmu disabled.
SAS
4.12 - none of this series.
SMMU disabled
read io-depth 32 -   384K IOPS (100%)
read io-depth 2048 - 950K IOPS (100%)
rw io-depth 32 - 166K IOPS (100%)
rw io-depth 2048 -   340K IOPS (100%)

SMMU enabled
read io-depth 32 -   201K IOPS (52%)
read io-depth 2048 - 306K IOPS (32%)
rw io-depth 32 - 99K  IOPS (60%)
rw io-depth 2048 -   150K IOPS (44%)

Robin's recent series with fixes as seen on list (now merged)
SMMU enabled.
read io-depth 32 -   208K IOPS (54%)
read io-depth 2048 - 335K IOPS (35%)
rw io-depth 32 - 105K IOPS (63%)
rw io-depth 2048 -   165K IOPS (49%)

4.12 + Robin's series + just this patch SMMU enabled

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)

read io-depth 32 -   225K IOPS (59%)
read io-depth 2048 - 365K IOPS (38%)
rw io-depth 32 - 110K IOPS (66%)
rw io-depth 2048 -   179K IOPS (53%)

4.12 + Robin's series + Second part of this series

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
(iommu: add a new member unmap_tlb_sync into struct iommu_ops)
(iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
(iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)

read io-depth 32 -225K IOPS (59%)
read io-depth 2048 -  833K IOPS (88%)
rw io-depth 32 -  112K IOPS (67%)
rw io-depth 2048 -220K IOPS (65%)

Robin's series gave us small gains across the board (3-5% recovered)
relative to the no smmu performance (which we are taking as the ideal case)

This first patch gets us back another 2-5% of the no smmu performance

The next few patches get us very little advantage on the small io-depths
but make a large difference to the larger io-depths - in particular the
read IOPS which is over twice as fast as without the series.

For HNS it seems that we are less dependent on the SMMU performance and
can reach the non SMMU speed.

Tests with
iperf -t 30 -i 10 -c IPADDRESS -P 3 last 10 seconds taken to avoid any
initial variability.

The server end of the link was always running with smmu v3 disabled
so as to act as a fast sink of the data. Some variation seen across
repeat runs.

Mainline v4.12 + network card fix
NO SMMU
9.42 GBits/sec

SMMU
4.36 GBits/sec (46%)

Robin's io-pgtable spinlock series

6.68 to 7.34 (71% - 78% variation across runs)

Just this patch SMMU enabled

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)

7.96-8.8 GBits/sec (85% - 94%  some variation across runs)

Full series

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
(iommu: add a new member unmap_tlb_sync into struct iommu_ops)
(iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
(iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)

9.42 GBits/Sec (100%)

So HNS test shows a greater boost from Robin's series and this first patch.
This is most likely because the HNS test is not putting as high a load on
the SMMU and associated code as the SAS test.

In both cases however, this shows that both parts of this patch
series are beneficial.

So on to the questions ;)

Will, you mentioned that along with Robin and Nate you were working on
a somewhat related strategy to improve the performance.  Any ETA on that?


The strategy I was working on is basically equivalent to the second
part of the series. I will test your patches out sometime this week, and
I'll also try to have our performance team run it through their whole
suite.


Thanks, that's excellent.  Look forward to hearing how it goes.


I tested the patches with 4 NVME drives connected to a single SMMU and
the results seem to be inline with those you've reported.

FIO - 512k blocksize / io-depth 32 / 1 thread per drive
 Baseline 4.13-rc1 w/SMMU enabled: 25% of SMMU bypass performance
 Baseline + Patch 1  : 28%
 Baseline + Patches 2-5  : 86%
 Baseline + Complete series  : 100% [!!]

I saw performance improvements across all of the other FIO profiles I
tested, although not always as substantial as was seen in 

Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-07-17 Thread Nate Watterson

Hi Jonathan,

On 7/17/2017 10:23 AM, Jonathan Cameron wrote:

On Mon, 17 Jul 2017 14:06:42 +0100
John Garry  wrote:


+

On 29/06/2017 03:08, Leizhen (ThunderTown) wrote:



On 2017/6/28 17:32, Will Deacon wrote:

Hi Zhen Lei,

Nate (CC'd), Robin and I have been working on something very similar to
this series, but this patch is different to what we had planned. More below.

On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote:

Because all TLBI commands should be followed by a SYNC command, to make
sure that it has been completely finished. So we can just add the TLBI
commands into the queue, and put off the execution until meet SYNC or
other commands. To prevent the followed SYNC command waiting for a long
time because of too many commands have been delayed, restrict the max
delayed number.

According to my test, I got the same performance data as I replaced writel
with writel_relaxed in queue_inc_prod.

Signed-off-by: Zhen Lei 
---
  drivers/iommu/arm-smmu-v3.c | 42 +-
  1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 291da5f..4481123 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -337,6 +337,7 @@
  /* Command queue */
  #define CMDQ_ENT_DWORDS   2
  #define CMDQ_MAX_SZ_SHIFT 8
+#define CMDQ_MAX_DELAYED   32

  #define CMDQ_ERR_SHIFT24
  #define CMDQ_ERR_MASK 0x7f
@@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent {
};
} cfgi;

+   #define CMDQ_OP_TLBI_NH_ALL 0x10
#define CMDQ_OP_TLBI_NH_ASID0x11
#define CMDQ_OP_TLBI_NH_VA  0x12
#define CMDQ_OP_TLBI_EL2_ALL0x20
@@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent {

  struct arm_smmu_queue {
int irq; /* Wired interrupt */
+   u32 nr_delay;

__le64  *base;
dma_addr_t  base_dma;
@@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q)
return ret;
  }

-static void queue_inc_prod(struct arm_smmu_queue *q)
+static void queue_inc_swprod(struct arm_smmu_queue *q)
  {
-   u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1;
+   u32 prod = q->prod + 1;

q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
+}
+
+static void queue_inc_prod(struct arm_smmu_queue *q)
+{
+   queue_inc_swprod(q);
writel(q->prod, q->prod_reg);
  }

@@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t 
n_dwords)
*dst++ = cpu_to_le64(*src++);
  }

-static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent)
+static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int optimize)
  {
if (queue_full(q))
return -ENOSPC;

queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords);
-   queue_inc_prod(q);
+
+   /*
+* We don't want too many commands to be delayed, this may lead the
+* followed sync command to wait for a long time.
+*/
+   if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) {
+   queue_inc_swprod(q);
+   } else {
+   queue_inc_prod(q);
+   q->nr_delay = 0;
+   }
+


So here, you're effectively putting invalidation commands into the command
queue without updating PROD. Do you actually see a performance advantage
from doing so? Another side of the argument would be that we should be

Yes, my sas ssd performance test showed that it can improve about 
100-150K/s(the same to I directly replace
writel with writel_relaxed). And the average execution time of 
iommu_unmap(which called by iommu_dma_unmap_sg)
dropped from 10us to 5us.
  

moving PROD as soon as we can, so that the SMMU can process invalidation
commands in the background and reduce the cost of the final SYNC operation
when the high-level unmap operation is complete.

There maybe that __iowmb() is more expensive than wait for tlbi complete. 
Except the time of __iowmb()
itself, it also protected by spinlock, lock confliction will rise rapidly in 
the stress scene. __iowmb()
average cost 300-500ns(Sorry, I forget the exact value).

In addition, after applied this patcheset and Robin's v2, and my earlier dma64 
iova optimization patchset.
Our net performance test got the same data to global bypass. But sas ssd still 
have more than 20% dropped.
Maybe we should still focus at map/unamp, because the average execution time of 
iova alloc/free is only
about 400ns.

By the way, patch2-5 is more effective than this one, it can improve more than 
350K/s. And with it, we can
got about 100-150K/s improvement of Robin's v2. Otherwise, I saw non effective 
of Robin's v2. Sorry, I have
not tested how about this 

Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-07-17 Thread Nate Watterson

Hi Jonathan,

On 7/17/2017 10:23 AM, Jonathan Cameron wrote:

On Mon, 17 Jul 2017 14:06:42 +0100
John Garry  wrote:


+

On 29/06/2017 03:08, Leizhen (ThunderTown) wrote:



On 2017/6/28 17:32, Will Deacon wrote:

Hi Zhen Lei,

Nate (CC'd), Robin and I have been working on something very similar to
this series, but this patch is different to what we had planned. More below.

On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote:

Because all TLBI commands should be followed by a SYNC command, to make
sure that it has been completely finished. So we can just add the TLBI
commands into the queue, and put off the execution until meet SYNC or
other commands. To prevent the followed SYNC command waiting for a long
time because of too many commands have been delayed, restrict the max
delayed number.

According to my test, I got the same performance data as I replaced writel
with writel_relaxed in queue_inc_prod.

Signed-off-by: Zhen Lei 
---
  drivers/iommu/arm-smmu-v3.c | 42 +-
  1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 291da5f..4481123 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -337,6 +337,7 @@
  /* Command queue */
  #define CMDQ_ENT_DWORDS   2
  #define CMDQ_MAX_SZ_SHIFT 8
+#define CMDQ_MAX_DELAYED   32

  #define CMDQ_ERR_SHIFT24
  #define CMDQ_ERR_MASK 0x7f
@@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent {
};
} cfgi;

+   #define CMDQ_OP_TLBI_NH_ALL 0x10
#define CMDQ_OP_TLBI_NH_ASID0x11
#define CMDQ_OP_TLBI_NH_VA  0x12
#define CMDQ_OP_TLBI_EL2_ALL0x20
@@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent {

  struct arm_smmu_queue {
int irq; /* Wired interrupt */
+   u32 nr_delay;

__le64  *base;
dma_addr_t  base_dma;
@@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q)
return ret;
  }

-static void queue_inc_prod(struct arm_smmu_queue *q)
+static void queue_inc_swprod(struct arm_smmu_queue *q)
  {
-   u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1;
+   u32 prod = q->prod + 1;

q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
+}
+
+static void queue_inc_prod(struct arm_smmu_queue *q)
+{
+   queue_inc_swprod(q);
writel(q->prod, q->prod_reg);
  }

@@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t 
n_dwords)
*dst++ = cpu_to_le64(*src++);
  }

-static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent)
+static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int optimize)
  {
if (queue_full(q))
return -ENOSPC;

queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords);
-   queue_inc_prod(q);
+
+   /*
+* We don't want too many commands to be delayed, this may lead the
+* followed sync command to wait for a long time.
+*/
+   if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) {
+   queue_inc_swprod(q);
+   } else {
+   queue_inc_prod(q);
+   q->nr_delay = 0;
+   }
+


So here, you're effectively putting invalidation commands into the command
queue without updating PROD. Do you actually see a performance advantage
from doing so? Another side of the argument would be that we should be

Yes, my sas ssd performance test showed that it can improve about 
100-150K/s(the same to I directly replace
writel with writel_relaxed). And the average execution time of 
iommu_unmap(which called by iommu_dma_unmap_sg)
dropped from 10us to 5us.
  

moving PROD as soon as we can, so that the SMMU can process invalidation
commands in the background and reduce the cost of the final SYNC operation
when the high-level unmap operation is complete.

There maybe that __iowmb() is more expensive than wait for tlbi complete. 
Except the time of __iowmb()
itself, it also protected by spinlock, lock confliction will rise rapidly in 
the stress scene. __iowmb()
average cost 300-500ns(Sorry, I forget the exact value).

In addition, after applied this patcheset and Robin's v2, and my earlier dma64 
iova optimization patchset.
Our net performance test got the same data to global bypass. But sas ssd still 
have more than 20% dropped.
Maybe we should still focus at map/unamp, because the average execution time of 
iova alloc/free is only
about 400ns.

By the way, patch2-5 is more effective than this one, it can improve more than 
350K/s. And with it, we can
got about 100-150K/s improvement of Robin's v2. Otherwise, I saw non effective 
of Robin's v2. Sorry, I have
not tested how about this patch without patch2-5. Further more, I got the 

[PATCH 2/2] ata: ahci_platform: Handle shutdown with ata_platform_shutdown_one()

2017-07-03 Thread Nate Watterson
To avoid interfering with a new kernel started using kexec, quiesce
host controller DMA interrupts during driver shutdown with
ata_platform_remove_one().

Change-Id: Iedee7f7ba12172b3d34796a3e3b92dbb72d4ed9c
Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/ata/ahci_platform.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ata/ahci_platform.c b/drivers/ata/ahci_platform.c
index 62a04c8..ca25b0d 100644
--- a/drivers/ata/ahci_platform.c
+++ b/drivers/ata/ahci_platform.c
@@ -93,6 +93,7 @@ static SIMPLE_DEV_PM_OPS(ahci_pm_ops, ahci_platform_suspend,
 static struct platform_driver ahci_driver = {
.probe = ahci_probe,
.remove = ata_platform_remove_one,
+   .shutdown = ata_platform_shutdown_one,
.driver = {
.name = DRV_NAME,
.of_match_table = ahci_of_match,
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH 2/2] ata: ahci_platform: Handle shutdown with ata_platform_shutdown_one()

2017-07-03 Thread Nate Watterson
To avoid interfering with a new kernel started using kexec, quiesce
host controller DMA interrupts during driver shutdown with
ata_platform_remove_one().

Change-Id: Iedee7f7ba12172b3d34796a3e3b92dbb72d4ed9c
Signed-off-by: Nate Watterson 
---
 drivers/ata/ahci_platform.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ata/ahci_platform.c b/drivers/ata/ahci_platform.c
index 62a04c8..ca25b0d 100644
--- a/drivers/ata/ahci_platform.c
+++ b/drivers/ata/ahci_platform.c
@@ -93,6 +93,7 @@ static SIMPLE_DEV_PM_OPS(ahci_pm_ops, ahci_platform_suspend,
 static struct platform_driver ahci_driver = {
.probe = ahci_probe,
.remove = ata_platform_remove_one,
+   .shutdown = ata_platform_shutdown_one,
.driver = {
.name = DRV_NAME,
.of_match_table = ahci_of_match,
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH 1/2] libata: Introduce ata_platform_shutdown_one()

2017-07-03 Thread Nate Watterson
In similar fashion to how ata_platform_remove_one() is used, this
newly introduced method can be used by platform ata drivers to get
basic shutdown functionality (stopping host DMA and interrupts).

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/ata/libata-core.c | 20 
 include/linux/libata.h|  1 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e157a0e..537932e 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6702,6 +6702,25 @@ int ata_platform_remove_one(struct platform_device *pdev)
return 0;
 }
 
+/**
+ * ata_platform_shutdown_one - Platform layer callback for device shutdown
+ * @pdev: Platform device being shutdown
+ *
+ * Platform layer indicates to libata via this hook that shutdown is
+ * in progress and the input device should be quiesced. Functionally this
+ * is equivalent to ata_platform_remove_one(), however devres_release_all()
+ * is not called on the shutdown path as it is for remove so releasing the
+ * resources associated with the device must instead be initiated directly.
+ */
+void ata_platform_shutdown_one(struct platform_device *pdev)
+{
+   struct ata_host *host = platform_get_drvdata(pdev);
+
+   ata_host_detach(host);
+   devres_release(>dev, ata_host_stop, NULL, NULL);
+   devres_release(>dev, ata_host_release, NULL, NULL);
+}
+
 static int __init ata_parse_force_one(char **cur,
  struct ata_force_ent *force_ent,
  const char **reason)
@@ -7222,6 +7241,7 @@ void ata_print_version(const struct device *dev, const 
char *version)
 #endif /* CONFIG_PCI */
 
 EXPORT_SYMBOL_GPL(ata_platform_remove_one);
+EXPORT_SYMBOL_GPL(ata_platform_shutdown_one);
 
 EXPORT_SYMBOL_GPL(__ata_ehi_push_desc);
 EXPORT_SYMBOL_GPL(ata_ehi_push_desc);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index c9a69fc..6c01f23f 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -1240,6 +1240,7 @@ struct pci_bits {
 struct platform_device;
 
 extern int ata_platform_remove_one(struct platform_device *pdev);
+extern void ata_platform_shutdown_one(struct platform_device *pdev);
 
 /*
  * ACPI - drivers/ata/libata-acpi.c
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH 1/2] libata: Introduce ata_platform_shutdown_one()

2017-07-03 Thread Nate Watterson
In similar fashion to how ata_platform_remove_one() is used, this
newly introduced method can be used by platform ata drivers to get
basic shutdown functionality (stopping host DMA and interrupts).

Signed-off-by: Nate Watterson 
---
 drivers/ata/libata-core.c | 20 
 include/linux/libata.h|  1 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e157a0e..537932e 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6702,6 +6702,25 @@ int ata_platform_remove_one(struct platform_device *pdev)
return 0;
 }
 
+/**
+ * ata_platform_shutdown_one - Platform layer callback for device shutdown
+ * @pdev: Platform device being shutdown
+ *
+ * Platform layer indicates to libata via this hook that shutdown is
+ * in progress and the input device should be quiesced. Functionally this
+ * is equivalent to ata_platform_remove_one(), however devres_release_all()
+ * is not called on the shutdown path as it is for remove so releasing the
+ * resources associated with the device must instead be initiated directly.
+ */
+void ata_platform_shutdown_one(struct platform_device *pdev)
+{
+   struct ata_host *host = platform_get_drvdata(pdev);
+
+   ata_host_detach(host);
+   devres_release(>dev, ata_host_stop, NULL, NULL);
+   devres_release(>dev, ata_host_release, NULL, NULL);
+}
+
 static int __init ata_parse_force_one(char **cur,
  struct ata_force_ent *force_ent,
  const char **reason)
@@ -7222,6 +7241,7 @@ void ata_print_version(const struct device *dev, const 
char *version)
 #endif /* CONFIG_PCI */
 
 EXPORT_SYMBOL_GPL(ata_platform_remove_one);
+EXPORT_SYMBOL_GPL(ata_platform_shutdown_one);
 
 EXPORT_SYMBOL_GPL(__ata_ehi_push_desc);
 EXPORT_SYMBOL_GPL(ata_ehi_push_desc);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index c9a69fc..6c01f23f 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -1240,6 +1240,7 @@ struct pci_bits {
 struct platform_device;
 
 extern int ata_platform_remove_one(struct platform_device *pdev);
+extern void ata_platform_shutdown_one(struct platform_device *pdev);
 
 /*
  * ACPI - drivers/ata/libata-acpi.c
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



Re: [PATCH v2] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson

I should have removed the '-v3' since this revision of the patch adds
shutdown to arm-smmu.c as well. I'll fix that in a subsequent version
after waiting to see if there are additional changes that need to be
made.

On 6/29/2017 6:18 PM, Nate Watterson wrote:

The shutdown method disables the SMMU to avoid corrupting a new kernel
started with kexec.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
  drivers/iommu/arm-smmu-v3.c | 7 +++
  drivers/iommu/arm-smmu.c| 6 ++
  2 files changed, 13 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..3d8ac29 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,15 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
  
  	arm_smmu_device_disable(smmu);

+
return 0;
  }
  
+static void arm_smmu_device_shutdown(struct platform_device *pdev)

+{
+   arm_smmu_device_remove(pdev);
+}
+
  static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2787,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
  };
  module_platform_driver(arm_smmu_driver);
  
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c

index 7ec30b0..af50bab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2319,6 +2319,11 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
return 0;
  }
  
+static void arm_smmu_device_shutdown(struct platform_device *pdev)

+{
+   arm_smmu_device_remove(pdev);
+}
+
  static struct platform_driver arm_smmu_driver = {
.driver = {
.name   = "arm-smmu",
@@ -2326,6 +2331,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
  };
  module_platform_driver(arm_smmu_driver);
  



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v2] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson

I should have removed the '-v3' since this revision of the patch adds
shutdown to arm-smmu.c as well. I'll fix that in a subsequent version
after waiting to see if there are additional changes that need to be
made.

On 6/29/2017 6:18 PM, Nate Watterson wrote:

The shutdown method disables the SMMU to avoid corrupting a new kernel
started with kexec.

Signed-off-by: Nate Watterson 
---
  drivers/iommu/arm-smmu-v3.c | 7 +++
  drivers/iommu/arm-smmu.c| 6 ++
  2 files changed, 13 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..3d8ac29 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,15 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
  
  	arm_smmu_device_disable(smmu);

+
return 0;
  }
  
+static void arm_smmu_device_shutdown(struct platform_device *pdev)

+{
+   arm_smmu_device_remove(pdev);
+}
+
  static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2787,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
  };
  module_platform_driver(arm_smmu_driver);
  
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c

index 7ec30b0..af50bab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2319,6 +2319,11 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
return 0;
  }
  
+static void arm_smmu_device_shutdown(struct platform_device *pdev)

+{
+   arm_smmu_device_remove(pdev);
+}
+
  static struct platform_driver arm_smmu_driver = {
.driver = {
.name   = "arm-smmu",
@@ -2326,6 +2331,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
  };
  module_platform_driver(arm_smmu_driver);
  



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH v2] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson
The shutdown method disables the SMMU to avoid corrupting a new kernel
started with kexec.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 7 +++
 drivers/iommu/arm-smmu.c| 6 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..3d8ac29 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,15 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 
arm_smmu_device_disable(smmu);
+
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2787,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7ec30b0..af50bab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2319,6 +2319,11 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct platform_driver arm_smmu_driver = {
.driver = {
.name   = "arm-smmu",
@@ -2326,6 +2331,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH v2] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson
The shutdown method disables the SMMU to avoid corrupting a new kernel
started with kexec.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 7 +++
 drivers/iommu/arm-smmu.c| 6 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..3d8ac29 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,15 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 
arm_smmu_device_disable(smmu);
+
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2787,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7ec30b0..af50bab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2319,6 +2319,11 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct platform_driver arm_smmu_driver = {
.driver = {
.name   = "arm-smmu",
@@ -2326,6 +2331,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



Re: [PATCH] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson

On 6/29/2017 2:34 PM, Will Deacon wrote:

On Thu, Jun 29, 2017 at 01:40:15PM -0400, Nate Watterson wrote:

The shutdown method disables the SMMU and its interrupts to avoid
potentially corrupting a new kernel started with kexec.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
  drivers/iommu/arm-smmu-v3.c | 11 +++
  1 file changed, 11 insertions(+)


We should update arm-smmu.c as well.


diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..907d576 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,19 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
  
  	arm_smmu_device_disable(smmu);

+
+   /* Disable IRQs */
+   arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
+   ARM_SMMU_IRQ_CTRLACK);
+


Can you justify the need for this? If we actually need to disable
interrupts, then I'd like to understand why so that we can make sure we
get the ordering right with respect to disabling the device. Also, do we
need to clear the MSI registers too?


I have no justification. Based on the number of drivers that take care
to prevent their HW from generating an interrupt, I thought it would be
required, but I can't find any such requirement explicitly laid out in
the documentation.

When you mention the MSI registers do you mean, for instance,
SMMU_GERROR_IRQ_CFG0? It looks like those are always cleared while
initializing the SMMU so the case where an SMMU transitions from using
MSIs to using wired interrupts between kernels will be handled properly.



My understanding is that kexec will mask irqs at the GIC, so there's not
actually an issue here.

Will



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson

On 6/29/2017 2:34 PM, Will Deacon wrote:

On Thu, Jun 29, 2017 at 01:40:15PM -0400, Nate Watterson wrote:

The shutdown method disables the SMMU and its interrupts to avoid
potentially corrupting a new kernel started with kexec.

Signed-off-by: Nate Watterson 
---
  drivers/iommu/arm-smmu-v3.c | 11 +++
  1 file changed, 11 insertions(+)


We should update arm-smmu.c as well.


diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..907d576 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,19 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
  
  	arm_smmu_device_disable(smmu);

+
+   /* Disable IRQs */
+   arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
+   ARM_SMMU_IRQ_CTRLACK);
+


Can you justify the need for this? If we actually need to disable
interrupts, then I'd like to understand why so that we can make sure we
get the ordering right with respect to disabling the device. Also, do we
need to clear the MSI registers too?


I have no justification. Based on the number of drivers that take care
to prevent their HW from generating an interrupt, I thought it would be
required, but I can't find any such requirement explicitly laid out in
the documentation.

When you mention the MSI registers do you mean, for instance,
SMMU_GERROR_IRQ_CFG0? It looks like those are always cleared while
initializing the SMMU so the case where an SMMU transitions from using
MSIs to using wired interrupts between kernels will be handled properly.



My understanding is that kexec will mask irqs at the GIC, so there's not
actually an issue here.

Will



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson
The shutdown method disables the SMMU and its interrupts to avoid
potentially corrupting a new kernel started with kexec.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..907d576 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,19 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 
arm_smmu_device_disable(smmu);
+
+   /* Disable IRQs */
+   arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
+   ARM_SMMU_IRQ_CTRLACK);
+
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2791,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson
The shutdown method disables the SMMU and its interrupts to avoid
potentially corrupting a new kernel started with kexec.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..907d576 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,19 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 
arm_smmu_device_disable(smmu);
+
+   /* Disable IRQs */
+   arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
+   ARM_SMMU_IRQ_CTRLACK);
+
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2791,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



Re: [PATCH V11 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-05-23 Thread Nate Watterson

Hi Lorenzo,

On 5/23/2017 5:26 AM, Lorenzo Pieralisi wrote:

On Tue, May 23, 2017 at 02:31:17PM +0530, Sricharan R wrote:

Hi Lorenzo,

On 5/23/2017 2:22 PM, Lorenzo Pieralisi wrote:

On Tue, May 23, 2017 at 02:26:10AM -0400, Nate Watterson wrote:

Hi Sricharan,

On 4/10/2017 7:21 AM, Sricharan R wrote:

This is an equivalent to the DT's handling of the iommu master's probe
with deferred probing when the corrsponding iommu is not probed yet.
The lack of a registered IOMMU can be caused by the lack of a driver for
the IOMMU, the IOMMU device probe not having been performed yet, having
been deferred, or having failed.

The first case occurs when the firmware describes the bus master and
IOMMU topology correctly but no device driver exists for the IOMMU yet
or the device driver has not been compiled in. Return NULL, the caller
will configure the device without an IOMMU.

The second and third cases are handled by deferring the probe of the bus
master device which will eventually get reprobed after the IOMMU.

The last case is currently handled by deferring the probe of the bus
master device as well. A mechanism to either configure the bus master
device without an IOMMU or to fail the bus master device probe depending
on whether the IOMMU is optional or mandatory would be a good
enhancement.

Tested-by: Hanjun Guo <hanjun@linaro.org>
Reviewed-by: Robin Murphy <robin.mur...@arm.com>
[Lorenzo: Added fixes for dma_coherent_mask overflow, acpi_dma_configure
   called multiple times for same device]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieral...@arm.com>
Signed-off-by: Sricharan R <sricha...@codeaurora.org>
---
  drivers/acpi/arm64/iort.c  | 33 -
  drivers/acpi/scan.c| 11 ---
  drivers/base/dma-mapping.c |  2 +-
  include/acpi/acpi_bus.h|  2 +-
  include/linux/acpi.h   |  7 +--
  5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3dd9ec3..e323ece 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -543,6 +543,14 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
const struct iommu_ops *ops = NULL;
int ret = -ENODEV;
struct fwnode_handle *iort_fwnode;
+   struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+   /*
+* If we already translated the fwspec there
+* is nothing left to do, return the iommu_ops.
+*/
+   if (fwspec && fwspec->ops)
+   return fwspec->ops;


Is this logic strictly required? It breaks masters with multiple SIDs
as only the first SID is actually added to the master's fwspec.


My bad, that's indeed a silly bug I introduced. Please let me know if the
patch below fixes it, we will send it upstream shortly.



oops, i think emails crossed. Please let me know if you are ok to add
this to the other fixes.


No worries, yes I am ok thanks but please give Nate some time to report
back to make sure the diff I sent actually fixes the problem.


The patch you sent fixes the problem. Thanks for the quick turnaround.



Apologies for the breakage.

Lorenzo



Regards,
  Sricharan



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH V11 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-05-23 Thread Nate Watterson

Hi Lorenzo,

On 5/23/2017 5:26 AM, Lorenzo Pieralisi wrote:

On Tue, May 23, 2017 at 02:31:17PM +0530, Sricharan R wrote:

Hi Lorenzo,

On 5/23/2017 2:22 PM, Lorenzo Pieralisi wrote:

On Tue, May 23, 2017 at 02:26:10AM -0400, Nate Watterson wrote:

Hi Sricharan,

On 4/10/2017 7:21 AM, Sricharan R wrote:

This is an equivalent to the DT's handling of the iommu master's probe
with deferred probing when the corrsponding iommu is not probed yet.
The lack of a registered IOMMU can be caused by the lack of a driver for
the IOMMU, the IOMMU device probe not having been performed yet, having
been deferred, or having failed.

The first case occurs when the firmware describes the bus master and
IOMMU topology correctly but no device driver exists for the IOMMU yet
or the device driver has not been compiled in. Return NULL, the caller
will configure the device without an IOMMU.

The second and third cases are handled by deferring the probe of the bus
master device which will eventually get reprobed after the IOMMU.

The last case is currently handled by deferring the probe of the bus
master device as well. A mechanism to either configure the bus master
device without an IOMMU or to fail the bus master device probe depending
on whether the IOMMU is optional or mandatory would be a good
enhancement.

Tested-by: Hanjun Guo 
Reviewed-by: Robin Murphy 
[Lorenzo: Added fixes for dma_coherent_mask overflow, acpi_dma_configure
   called multiple times for same device]
Signed-off-by: Lorenzo Pieralisi 
Signed-off-by: Sricharan R 
---
  drivers/acpi/arm64/iort.c  | 33 -
  drivers/acpi/scan.c| 11 ---
  drivers/base/dma-mapping.c |  2 +-
  include/acpi/acpi_bus.h|  2 +-
  include/linux/acpi.h   |  7 +--
  5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3dd9ec3..e323ece 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -543,6 +543,14 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
const struct iommu_ops *ops = NULL;
int ret = -ENODEV;
struct fwnode_handle *iort_fwnode;
+   struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+   /*
+* If we already translated the fwspec there
+* is nothing left to do, return the iommu_ops.
+*/
+   if (fwspec && fwspec->ops)
+   return fwspec->ops;


Is this logic strictly required? It breaks masters with multiple SIDs
as only the first SID is actually added to the master's fwspec.


My bad, that's indeed a silly bug I introduced. Please let me know if the
patch below fixes it, we will send it upstream shortly.



oops, i think emails crossed. Please let me know if you are ok to add
this to the other fixes.


No worries, yes I am ok thanks but please give Nate some time to report
back to make sure the diff I sent actually fixes the problem.


The patch you sent fixes the problem. Thanks for the quick turnaround.



Apologies for the breakage.

Lorenzo



Regards,
  Sricharan



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH V11 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-05-23 Thread Nate Watterson

Hi Sricharan,

On 4/10/2017 7:21 AM, Sricharan R wrote:

This is an equivalent to the DT's handling of the iommu master's probe
with deferred probing when the corrsponding iommu is not probed yet.
The lack of a registered IOMMU can be caused by the lack of a driver for
the IOMMU, the IOMMU device probe not having been performed yet, having
been deferred, or having failed.

The first case occurs when the firmware describes the bus master and
IOMMU topology correctly but no device driver exists for the IOMMU yet
or the device driver has not been compiled in. Return NULL, the caller
will configure the device without an IOMMU.

The second and third cases are handled by deferring the probe of the bus
master device which will eventually get reprobed after the IOMMU.

The last case is currently handled by deferring the probe of the bus
master device as well. A mechanism to either configure the bus master
device without an IOMMU or to fail the bus master device probe depending
on whether the IOMMU is optional or mandatory would be a good
enhancement.

Tested-by: Hanjun Guo 
Reviewed-by: Robin Murphy 
[Lorenzo: Added fixes for dma_coherent_mask overflow, acpi_dma_configure
   called multiple times for same device]
Signed-off-by: Lorenzo Pieralisi 
Signed-off-by: Sricharan R 
---
  drivers/acpi/arm64/iort.c  | 33 -
  drivers/acpi/scan.c| 11 ---
  drivers/base/dma-mapping.c |  2 +-
  include/acpi/acpi_bus.h|  2 +-
  include/linux/acpi.h   |  7 +--
  5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3dd9ec3..e323ece 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -543,6 +543,14 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
const struct iommu_ops *ops = NULL;
int ret = -ENODEV;
struct fwnode_handle *iort_fwnode;
+   struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+   /*
+* If we already translated the fwspec there
+* is nothing left to do, return the iommu_ops.
+*/
+   if (fwspec && fwspec->ops)
+   return fwspec->ops;


Is this logic strictly required? It breaks masters with multiple SIDs
as only the first SID is actually added to the master's fwspec.

  
  	if (node) {

iort_fwnode = iort_get_fwnode(node);
@@ -550,8 +558,17 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
return NULL;
  
  		ops = iommu_ops_from_fwnode(iort_fwnode);

+   /*
+* If the ops look-up fails, this means that either
+* the SMMU drivers have not been probed yet or that
+* the SMMU drivers are not built in the kernel;
+* Depending on whether the SMMU drivers are built-in
+* in the kernel or not, defer the IOMMU configuration
+* or just abort it.
+*/
if (!ops)
-   return NULL;
+   return iort_iommu_driver_enabled(node->type) ?
+  ERR_PTR(-EPROBE_DEFER) : NULL;
  
  		ret = arm_smmu_iort_xlate(dev, streamid, iort_fwnode, ops);

}
@@ -625,12 +642,26 @@ const struct iommu_ops *iort_iommu_configure(struct 
device *dev)
  
  		while (parent) {

ops = iort_iommu_xlate(dev, parent, streamid);
+   if (IS_ERR_OR_NULL(ops))
+   return ops;
  
  			parent = iort_node_get_id(node, ,

  IORT_IOMMU_TYPE, i++);
}
}
  
+	/*

+* If we have reason to believe the IOMMU driver missed the initial
+* add_device callback for dev, replay it to get things in order.
+*/
+   if (!IS_ERR_OR_NULL(ops) && ops->add_device &&
+   dev->bus && !dev->iommu_group) {
+   int err = ops->add_device(dev);
+
+   if (err)
+   ops = ERR_PTR(err);
+   }
+
return ops;
  }
  
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c

index 1926918..2a513cc 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1373,20 +1373,25 @@ enum dev_dma_attr acpi_get_dma_attr(struct acpi_device 
*adev)
   * @dev: The pointer to the device
   * @attr: device dma attributes
   */
-void acpi_dma_configure(struct device *dev, enum dev_dma_attr attr)
+int acpi_dma_configure(struct device *dev, enum dev_dma_attr attr)
  {
const struct iommu_ops *iommu;
+   u64 size;
  
  	iort_set_dma_mask(dev);
  
  	iommu = iort_iommu_configure(dev);

+   if (IS_ERR(iommu))
+   return PTR_ERR(iommu);
  
+	size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);

/*
 * Assume dma valid range starts 

Re: [PATCH V11 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-05-23 Thread Nate Watterson

Hi Sricharan,

On 4/10/2017 7:21 AM, Sricharan R wrote:

This is an equivalent to the DT's handling of the iommu master's probe
with deferred probing when the corrsponding iommu is not probed yet.
The lack of a registered IOMMU can be caused by the lack of a driver for
the IOMMU, the IOMMU device probe not having been performed yet, having
been deferred, or having failed.

The first case occurs when the firmware describes the bus master and
IOMMU topology correctly but no device driver exists for the IOMMU yet
or the device driver has not been compiled in. Return NULL, the caller
will configure the device without an IOMMU.

The second and third cases are handled by deferring the probe of the bus
master device which will eventually get reprobed after the IOMMU.

The last case is currently handled by deferring the probe of the bus
master device as well. A mechanism to either configure the bus master
device without an IOMMU or to fail the bus master device probe depending
on whether the IOMMU is optional or mandatory would be a good
enhancement.

Tested-by: Hanjun Guo 
Reviewed-by: Robin Murphy 
[Lorenzo: Added fixes for dma_coherent_mask overflow, acpi_dma_configure
   called multiple times for same device]
Signed-off-by: Lorenzo Pieralisi 
Signed-off-by: Sricharan R 
---
  drivers/acpi/arm64/iort.c  | 33 -
  drivers/acpi/scan.c| 11 ---
  drivers/base/dma-mapping.c |  2 +-
  include/acpi/acpi_bus.h|  2 +-
  include/linux/acpi.h   |  7 +--
  5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3dd9ec3..e323ece 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -543,6 +543,14 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
const struct iommu_ops *ops = NULL;
int ret = -ENODEV;
struct fwnode_handle *iort_fwnode;
+   struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+   /*
+* If we already translated the fwspec there
+* is nothing left to do, return the iommu_ops.
+*/
+   if (fwspec && fwspec->ops)
+   return fwspec->ops;


Is this logic strictly required? It breaks masters with multiple SIDs
as only the first SID is actually added to the master's fwspec.

  
  	if (node) {

iort_fwnode = iort_get_fwnode(node);
@@ -550,8 +558,17 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
return NULL;
  
  		ops = iommu_ops_from_fwnode(iort_fwnode);

+   /*
+* If the ops look-up fails, this means that either
+* the SMMU drivers have not been probed yet or that
+* the SMMU drivers are not built in the kernel;
+* Depending on whether the SMMU drivers are built-in
+* in the kernel or not, defer the IOMMU configuration
+* or just abort it.
+*/
if (!ops)
-   return NULL;
+   return iort_iommu_driver_enabled(node->type) ?
+  ERR_PTR(-EPROBE_DEFER) : NULL;
  
  		ret = arm_smmu_iort_xlate(dev, streamid, iort_fwnode, ops);

}
@@ -625,12 +642,26 @@ const struct iommu_ops *iort_iommu_configure(struct 
device *dev)
  
  		while (parent) {

ops = iort_iommu_xlate(dev, parent, streamid);
+   if (IS_ERR_OR_NULL(ops))
+   return ops;
  
  			parent = iort_node_get_id(node, ,

  IORT_IOMMU_TYPE, i++);
}
}
  
+	/*

+* If we have reason to believe the IOMMU driver missed the initial
+* add_device callback for dev, replay it to get things in order.
+*/
+   if (!IS_ERR_OR_NULL(ops) && ops->add_device &&
+   dev->bus && !dev->iommu_group) {
+   int err = ops->add_device(dev);
+
+   if (err)
+   ops = ERR_PTR(err);
+   }
+
return ops;
  }
  
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c

index 1926918..2a513cc 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1373,20 +1373,25 @@ enum dev_dma_attr acpi_get_dma_attr(struct acpi_device 
*adev)
   * @dev: The pointer to the device
   * @attr: device dma attributes
   */
-void acpi_dma_configure(struct device *dev, enum dev_dma_attr attr)
+int acpi_dma_configure(struct device *dev, enum dev_dma_attr attr)
  {
const struct iommu_ops *iommu;
+   u64 size;
  
  	iort_set_dma_mask(dev);
  
  	iommu = iort_iommu_configure(dev);

+   if (IS_ERR(iommu))
+   return PTR_ERR(iommu);
  
+	size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);

/*
 * Assume dma valid range starts at 0 and covers the whole
 * coherent_dma_mask.
 */
-   arch_setup_dma_ops(dev, 

Re: [PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-05-16 Thread Nate Watterson



On 5/16/2017 3:55 PM, Auger Eric wrote:

Hi,

On 13/04/2017 21:38, Nate Watterson wrote:

Hi Robin,

On 4/13/2017 7:21 AM, Robin Murphy wrote:

Hi Nate,

On 13/04/17 09:55, Nate Watterson wrote:

Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an
IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().


Oh bum. Thanks for the report.

However, I really don't like bodging around it with deliberate undefined
behaviour. Fixing things properly doesn't seem too hard:


I was not especially please with my solution, but I wanted to avoid
potentially missing any other spots in the code where granule was
used uninitialized. The compile time check made me feel a little
less dirty about innappropriately using the iova_domain with MSI
cookies.



->8-
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366ddd1..62618e77bedc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -396,13 +396,13 @@ static void iommu_dma_free_iova(struct
iommu_dma_cookie *cookie,
  dma_addr_t iova, size_t size)
   {
  struct iova_domain *iovad = >iovad;
-   unsigned long shift = iova_shift(iovad);

  /* The MSI case is only ever cleaning up its most recent
allocation */
  if (cookie->type == IOMMU_DMA_MSI_COOKIE)
  cookie->msi_iova -= size;
  else
-   free_iova_fast(iovad, iova >> shift, size >> shift);
+   free_iova_fast(iovad, iova_pfn(iovad, iova),
+   size >> iova_shift(iovad));
   }

   static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t
dma_addr,
@@ -617,11 +617,14 @@ static dma_addr_t __iommu_dma_map(struct device
*dev, phys_addr_t phys,
   {
  struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
  struct iommu_dma_cookie *cookie = domain->iova_cookie;
-   struct iova_domain *iovad = >iovad;
-   size_t iova_off = iova_offset(iovad, phys);
+   size_t iova_off = 0;
  dma_addr_t iova;

-   size = iova_align(iovad, size + iova_off);
+   if (cookie->type == IOMMU_DMA_IOVA_COOKIE) {
+   iova_off = iova_offset(>iovad, phys);
+   size = iova_align(>iovad, size + iova_off);
+   }
+
  iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev),
dev);
  if (!iova)
  return DMA_ERROR_CODE;
-8<-

Untested, and you'll probably want to double-check it anyway given that
the original oversight was mine in the first place ;)


This looks good to me. As Shanker has already mentioned, it does fix the
faults we were previously seeing with direct device assignment. I also
verified that there aren't any other obvious cases of a granule == 0
being used in the dma_iommu code by adding BUG_ON(!iovad->granule) to
iova_{mask/align/offset/...} and running a variety of tests without
issue.

Are you going to post the patch?


I also noticed PCIe passthrough/ARM is broken for me with 4.12-r1. I
tested Robin's patch as well, on Cavium ThunderX, and this fixes the
faults I have seen.

Has anyone sent a formal patch?


iommu/dma: Don't touch invalid iova_domain members



Thanks

Eric





Robin.


Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
   drivers/iommu/dma-iommu.c | 10 ++
   1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain
*domain, dma_addr_t base)
 cookie->msi_iova = base;
   domain->iova_cookie = cookie;
+
+/*
+ * Setup granule for compatibility with __iommu_dma_{alloc/free}
and
+ * add a compile time check to ensure that writing granule won't
+ * clobber msi_iova.
+ */
+cookie->iovad.granule = cookie_msi_granule(cookie);
+BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+sizeof(cookie->msi_iova));
+
   return 0;
   }
   EXPORT_SYMBOL(iommu_get_msi_cookie);





--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-05-16 Thread Nate Watterson



On 5/16/2017 3:55 PM, Auger Eric wrote:

Hi,

On 13/04/2017 21:38, Nate Watterson wrote:

Hi Robin,

On 4/13/2017 7:21 AM, Robin Murphy wrote:

Hi Nate,

On 13/04/17 09:55, Nate Watterson wrote:

Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an
IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().


Oh bum. Thanks for the report.

However, I really don't like bodging around it with deliberate undefined
behaviour. Fixing things properly doesn't seem too hard:


I was not especially please with my solution, but I wanted to avoid
potentially missing any other spots in the code where granule was
used uninitialized. The compile time check made me feel a little
less dirty about innappropriately using the iova_domain with MSI
cookies.



->8-
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366ddd1..62618e77bedc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -396,13 +396,13 @@ static void iommu_dma_free_iova(struct
iommu_dma_cookie *cookie,
  dma_addr_t iova, size_t size)
   {
  struct iova_domain *iovad = >iovad;
-   unsigned long shift = iova_shift(iovad);

  /* The MSI case is only ever cleaning up its most recent
allocation */
  if (cookie->type == IOMMU_DMA_MSI_COOKIE)
  cookie->msi_iova -= size;
  else
-   free_iova_fast(iovad, iova >> shift, size >> shift);
+   free_iova_fast(iovad, iova_pfn(iovad, iova),
+   size >> iova_shift(iovad));
   }

   static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t
dma_addr,
@@ -617,11 +617,14 @@ static dma_addr_t __iommu_dma_map(struct device
*dev, phys_addr_t phys,
   {
  struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
  struct iommu_dma_cookie *cookie = domain->iova_cookie;
-   struct iova_domain *iovad = >iovad;
-   size_t iova_off = iova_offset(iovad, phys);
+   size_t iova_off = 0;
  dma_addr_t iova;

-   size = iova_align(iovad, size + iova_off);
+   if (cookie->type == IOMMU_DMA_IOVA_COOKIE) {
+   iova_off = iova_offset(>iovad, phys);
+   size = iova_align(>iovad, size + iova_off);
+   }
+
  iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev),
dev);
  if (!iova)
  return DMA_ERROR_CODE;
-8<-

Untested, and you'll probably want to double-check it anyway given that
the original oversight was mine in the first place ;)


This looks good to me. As Shanker has already mentioned, it does fix the
faults we were previously seeing with direct device assignment. I also
verified that there aren't any other obvious cases of a granule == 0
being used in the dma_iommu code by adding BUG_ON(!iovad->granule) to
iova_{mask/align/offset/...} and running a variety of tests without
issue.

Are you going to post the patch?


I also noticed PCIe passthrough/ARM is broken for me with 4.12-r1. I
tested Robin's patch as well, on Cavium ThunderX, and this fixes the
faults I have seen.

Has anyone sent a formal patch?


iommu/dma: Don't touch invalid iova_domain members



Thanks

Eric





Robin.


Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson 
---
   drivers/iommu/dma-iommu.c | 10 ++
   1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain
*domain, dma_addr_t base)
 cookie->msi_iova = base;
   domain->iova_cookie = cookie;
+
+/*
+ * Setup granule for compatibility with __iommu_dma_{alloc/free}
and
+ * add a compile time check to ensure that writing granule won't
+ * clobber msi_iova.
+ */
+cookie->iovad.granule = cookie_msi_granule(cookie);
+BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+sizeof(cookie->msi_iova));
+
   return 0;
   }
   EXPORT_SYMBOL(iommu_get_msi_cookie);





--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-04-13 Thread Nate Watterson

Hi Robin,

On 4/13/2017 7:21 AM, Robin Murphy wrote:

Hi Nate,

On 13/04/17 09:55, Nate Watterson wrote:

Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().


Oh bum. Thanks for the report.

However, I really don't like bodging around it with deliberate undefined
behaviour. Fixing things properly doesn't seem too hard:


I was not especially please with my solution, but I wanted to avoid
potentially missing any other spots in the code where granule was
used uninitialized. The compile time check made me feel a little
less dirty about innappropriately using the iova_domain with MSI
cookies.



->8-
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366ddd1..62618e77bedc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -396,13 +396,13 @@ static void iommu_dma_free_iova(struct
iommu_dma_cookie *cookie,
 dma_addr_t iova, size_t size)
  {
 struct iova_domain *iovad = >iovad;
-   unsigned long shift = iova_shift(iovad);

 /* The MSI case is only ever cleaning up its most recent
allocation */
 if (cookie->type == IOMMU_DMA_MSI_COOKIE)
 cookie->msi_iova -= size;
 else
-   free_iova_fast(iovad, iova >> shift, size >> shift);
+   free_iova_fast(iovad, iova_pfn(iovad, iova),
+   size >> iova_shift(iovad));
  }

  static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t
dma_addr,
@@ -617,11 +617,14 @@ static dma_addr_t __iommu_dma_map(struct device
*dev, phys_addr_t phys,
  {
 struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
 struct iommu_dma_cookie *cookie = domain->iova_cookie;
-   struct iova_domain *iovad = >iovad;
-   size_t iova_off = iova_offset(iovad, phys);
+   size_t iova_off = 0;
 dma_addr_t iova;

-   size = iova_align(iovad, size + iova_off);
+   if (cookie->type == IOMMU_DMA_IOVA_COOKIE) {
+   iova_off = iova_offset(>iovad, phys);
+   size = iova_align(>iovad, size + iova_off);
+   }
+
 iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
 if (!iova)
 return DMA_ERROR_CODE;
-8<-

Untested, and you'll probably want to double-check it anyway given that
the original oversight was mine in the first place ;)


This looks good to me. As Shanker has already mentioned, it does fix the
faults we were previously seeing with direct device assignment. I also
verified that there aren't any other obvious cases of a granule == 0
being used in the dma_iommu code by adding BUG_ON(!iovad->granule) to
iova_{mask/align/offset/...} and running a variety of tests without
issue.

Are you going to post the patch?



Robin.


Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
  drivers/iommu/dma-iommu.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
  
  	cookie->msi_iova = base;

domain->iova_cookie = cookie;
+
+   /*
+* Setup granule for compatibility with __iommu_dma_{alloc/free} and
+* add a compile time check to ensure that writing granule won't
+* clobber msi_iova.
+*/
+   cookie->iovad.granule = cookie_msi_granule(cookie);
+   BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+   sizeof(cookie->msi_iova));
+
return 0;
  }
  EXPORT_SYMBOL(iommu_get_msi_cookie);




--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-04-13 Thread Nate Watterson

Hi Robin,

On 4/13/2017 7:21 AM, Robin Murphy wrote:

Hi Nate,

On 13/04/17 09:55, Nate Watterson wrote:

Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().


Oh bum. Thanks for the report.

However, I really don't like bodging around it with deliberate undefined
behaviour. Fixing things properly doesn't seem too hard:


I was not especially please with my solution, but I wanted to avoid
potentially missing any other spots in the code where granule was
used uninitialized. The compile time check made me feel a little
less dirty about innappropriately using the iova_domain with MSI
cookies.



->8-
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366ddd1..62618e77bedc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -396,13 +396,13 @@ static void iommu_dma_free_iova(struct
iommu_dma_cookie *cookie,
 dma_addr_t iova, size_t size)
  {
 struct iova_domain *iovad = >iovad;
-   unsigned long shift = iova_shift(iovad);

 /* The MSI case is only ever cleaning up its most recent
allocation */
 if (cookie->type == IOMMU_DMA_MSI_COOKIE)
 cookie->msi_iova -= size;
 else
-   free_iova_fast(iovad, iova >> shift, size >> shift);
+   free_iova_fast(iovad, iova_pfn(iovad, iova),
+   size >> iova_shift(iovad));
  }

  static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t
dma_addr,
@@ -617,11 +617,14 @@ static dma_addr_t __iommu_dma_map(struct device
*dev, phys_addr_t phys,
  {
 struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
 struct iommu_dma_cookie *cookie = domain->iova_cookie;
-   struct iova_domain *iovad = >iovad;
-   size_t iova_off = iova_offset(iovad, phys);
+   size_t iova_off = 0;
 dma_addr_t iova;

-   size = iova_align(iovad, size + iova_off);
+   if (cookie->type == IOMMU_DMA_IOVA_COOKIE) {
+   iova_off = iova_offset(>iovad, phys);
+   size = iova_align(>iovad, size + iova_off);
+   }
+
 iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
 if (!iova)
 return DMA_ERROR_CODE;
-8<-

Untested, and you'll probably want to double-check it anyway given that
the original oversight was mine in the first place ;)


This looks good to me. As Shanker has already mentioned, it does fix the
faults we were previously seeing with direct device assignment. I also
verified that there aren't any other obvious cases of a granule == 0
being used in the dma_iommu code by adding BUG_ON(!iovad->granule) to
iova_{mask/align/offset/...} and running a variety of tests without
issue.

Are you going to post the patch?



Robin.


Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson 
---
  drivers/iommu/dma-iommu.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
  
  	cookie->msi_iova = base;

domain->iova_cookie = cookie;
+
+   /*
+* Setup granule for compatibility with __iommu_dma_{alloc/free} and
+* add a compile time check to ensure that writing granule won't
+* clobber msi_iova.
+*/
+   cookie->iovad.granule = cookie_msi_granule(cookie);
+   BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+   sizeof(cookie->msi_iova));
+
return 0;
  }
  EXPORT_SYMBOL(iommu_get_msi_cookie);




--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-04-13 Thread Nate Watterson
Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().

Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/dma-iommu.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
 
cookie->msi_iova = base;
domain->iova_cookie = cookie;
+
+   /*
+* Setup granule for compatibility with __iommu_dma_{alloc/free} and
+* add a compile time check to ensure that writing granule won't
+* clobber msi_iova.
+*/
+   cookie->iovad.granule = cookie_msi_granule(cookie);
+   BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+   sizeof(cookie->msi_iova));
+
return 0;
 }
 EXPORT_SYMBOL(iommu_get_msi_cookie);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-04-13 Thread Nate Watterson
Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().

Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson 
---
 drivers/iommu/dma-iommu.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
 
cookie->msi_iova = base;
domain->iova_cookie = cookie;
+
+   /*
+* Setup granule for compatibility with __iommu_dma_{alloc/free} and
+* add a compile time check to ensure that writing granule won't
+* clobber msi_iova.
+*/
+   cookie->iovad.granule = cookie_msi_granule(cookie);
+   BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+   sizeof(cookie->msi_iova));
+
return 0;
 }
 EXPORT_SYMBOL(iommu_get_msi_cookie);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/iova: fix underflow bug in __alloc_and_insert_iova_range

2017-04-06 Thread Nate Watterson
Normally, calling alloc_iova() using an iova_domain with insufficient
pfns remaining between start_pfn and dma_limit will fail and return a
NULL pointer. Unexpectedly, if such a "full" iova_domain contains an
iova with pfn_lo == 0, the alloc_iova() call will instead succeed and
return an iova containing invalid pfns.

This is caused by an underflow bug in __alloc_and_insert_iova_range()
that occurs after walking the "full" iova tree when the search ends
at the iova with pfn_lo == 0 and limit_pfn is then adjusted to be just
below that (-1). This (now huge) limit_pfn gives the impression that a
vast amount of space is available between it and start_pfn and thus
a new iova is allocated with the invalid pfn_hi value, 0xFFF .

To rememdy this, a check is introduced to ensure that adjustments to
limit_pfn will not underflow.

This issue has been observed in the wild, and is easily reproduced with
the following sample code.

struct iova_domain *iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
struct iova *rsvd_iova, *good_iova, *bad_iova;
unsigned long limit_pfn = 3;
unsigned long start_pfn = 1;
unsigned long va_size = 2;

init_iova_domain(iovad, SZ_4K, start_pfn, limit_pfn);
rsvd_iova = reserve_iova(iovad, 0, 0);
good_iova = alloc_iova(iovad, va_size, limit_pfn, true);
bad_iova = alloc_iova(iovad, va_size, limit_pfn, true);

Prior to the patch, this yielded:
*rsvd_iova == {0, 0}   /* Expected */
*good_iova == {2, 3}   /* Expected */
*bad_iova  == {-2, -1} /* Oh no... */

After the patch, bad_iova is NULL as expected since inadequate
space remains between limit_pfn and start_pfn after allocating
good_iova.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/iova.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b7268a1..f6533e0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -138,7 +138,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
break;  /* found a free slot */
}
 adjust_limit_pfn:
-   limit_pfn = curr_iova->pfn_lo - 1;
+   limit_pfn = curr_iova->pfn_lo ? (curr_iova->pfn_lo - 1) : 0;
 move_left:
prev = curr;
curr = rb_prev(curr);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/iova: fix underflow bug in __alloc_and_insert_iova_range

2017-04-06 Thread Nate Watterson
Normally, calling alloc_iova() using an iova_domain with insufficient
pfns remaining between start_pfn and dma_limit will fail and return a
NULL pointer. Unexpectedly, if such a "full" iova_domain contains an
iova with pfn_lo == 0, the alloc_iova() call will instead succeed and
return an iova containing invalid pfns.

This is caused by an underflow bug in __alloc_and_insert_iova_range()
that occurs after walking the "full" iova tree when the search ends
at the iova with pfn_lo == 0 and limit_pfn is then adjusted to be just
below that (-1). This (now huge) limit_pfn gives the impression that a
vast amount of space is available between it and start_pfn and thus
a new iova is allocated with the invalid pfn_hi value, 0xFFF .

To rememdy this, a check is introduced to ensure that adjustments to
limit_pfn will not underflow.

This issue has been observed in the wild, and is easily reproduced with
the following sample code.

struct iova_domain *iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
struct iova *rsvd_iova, *good_iova, *bad_iova;
unsigned long limit_pfn = 3;
unsigned long start_pfn = 1;
unsigned long va_size = 2;

init_iova_domain(iovad, SZ_4K, start_pfn, limit_pfn);
rsvd_iova = reserve_iova(iovad, 0, 0);
good_iova = alloc_iova(iovad, va_size, limit_pfn, true);
bad_iova = alloc_iova(iovad, va_size, limit_pfn, true);

Prior to the patch, this yielded:
*rsvd_iova == {0, 0}   /* Expected */
*good_iova == {2, 3}   /* Expected */
*bad_iova  == {-2, -1} /* Oh no... */

After the patch, bad_iova is NULL as expected since inadequate
space remains between limit_pfn and start_pfn after allocating
good_iova.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/iova.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b7268a1..f6533e0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -138,7 +138,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
break;  /* found a free slot */
}
 adjust_limit_pfn:
-   limit_pfn = curr_iova->pfn_lo - 1;
+   limit_pfn = curr_iova->pfn_lo ? (curr_iova->pfn_lo - 1) : 0;
 move_left:
prev = curr;
curr = rb_prev(curr);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH 2/2] arm64/dma-mapping: validate dma_masks against IORT defined limits

2017-01-31 Thread Nate Watterson
Some drivers set the dma_mask of client devices based solely on values
read from capability registers which may not account for platform
specific bus address width limitations. Fortunately, the ACPI IORT table
provides a way to report the effective number of address bits a device
can use to access memory. This information, when present, is used to
supplement the checks already being done in dma_supported() to avoid
setting overly generous dma_masks.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 arch/arm64/mm/dma-mapping.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index e040827..467fd23 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -19,6 +19,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -347,6 +348,12 @@ static int __swiotlb_get_sgtable(struct device *dev, 
struct sg_table *sgt,
 
 static int __swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
+   int dma_limit;
+
+   dma_limit = iort_get_memory_address_limit(hwdev);
+   if (dma_limit >= 0 && DMA_BIT_MASK(dma_limit) < mask)
+   return 0;
+
if (swiotlb)
return swiotlb_dma_supported(hwdev, mask);
return 1;
@@ -784,6 +791,17 @@ static void __iommu_unmap_sg_attrs(struct device *dev,
iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);
 }
 
+static int __iommu_dma_supported(struct device *hwdev, u64 mask)
+{
+   int dma_limit;
+
+   dma_limit = iort_get_memory_address_limit(hwdev);
+   if (dma_limit >= 0 && DMA_BIT_MASK(dma_limit) < mask)
+   return 0;
+
+   return iommu_dma_supported(hwdev, mask);
+}
+
 static struct dma_map_ops iommu_dma_ops = {
.alloc = __iommu_alloc_attrs,
.free = __iommu_free_attrs,
@@ -799,7 +817,7 @@ static void __iommu_unmap_sg_attrs(struct device *dev,
.sync_sg_for_device = __iommu_sync_sg_for_device,
.map_resource = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
-   .dma_supported = iommu_dma_supported,
+   .dma_supported = __iommu_dma_supported,
.mapping_error = iommu_dma_mapping_error,
 };
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH 2/2] arm64/dma-mapping: validate dma_masks against IORT defined limits

2017-01-31 Thread Nate Watterson
Some drivers set the dma_mask of client devices based solely on values
read from capability registers which may not account for platform
specific bus address width limitations. Fortunately, the ACPI IORT table
provides a way to report the effective number of address bits a device
can use to access memory. This information, when present, is used to
supplement the checks already being done in dma_supported() to avoid
setting overly generous dma_masks.

Signed-off-by: Nate Watterson 
---
 arch/arm64/mm/dma-mapping.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index e040827..467fd23 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -19,6 +19,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -347,6 +348,12 @@ static int __swiotlb_get_sgtable(struct device *dev, 
struct sg_table *sgt,
 
 static int __swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
+   int dma_limit;
+
+   dma_limit = iort_get_memory_address_limit(hwdev);
+   if (dma_limit >= 0 && DMA_BIT_MASK(dma_limit) < mask)
+   return 0;
+
if (swiotlb)
return swiotlb_dma_supported(hwdev, mask);
return 1;
@@ -784,6 +791,17 @@ static void __iommu_unmap_sg_attrs(struct device *dev,
iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);
 }
 
+static int __iommu_dma_supported(struct device *hwdev, u64 mask)
+{
+   int dma_limit;
+
+   dma_limit = iort_get_memory_address_limit(hwdev);
+   if (dma_limit >= 0 && DMA_BIT_MASK(dma_limit) < mask)
+   return 0;
+
+   return iommu_dma_supported(hwdev, mask);
+}
+
 static struct dma_map_ops iommu_dma_ops = {
.alloc = __iommu_alloc_attrs,
.free = __iommu_free_attrs,
@@ -799,7 +817,7 @@ static void __iommu_unmap_sg_attrs(struct device *dev,
.sync_sg_for_device = __iommu_sync_sg_for_device,
.map_resource = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
-   .dma_supported = iommu_dma_supported,
+   .dma_supported = __iommu_dma_supported,
.mapping_error = iommu_dma_mapping_error,
 };
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH 1/2] ACPI/IORT: add iort_get_memory_address_limit function

2017-01-31 Thread Nate Watterson
The memory_address_limit field present in IORT named component nodes
describes the effective number of address bits that a device can use
when accessing memory. Because this information has value when creating
or validating device dma_masks, the aforementioned accessor function is
introduced.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/acpi/arm64/iort.c | 25 +
 include/linux/acpi_iort.h |  3 +++
 2 files changed, 28 insertions(+)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index e0d2e6e..cd5d5f4 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -569,6 +569,31 @@ void iort_set_dma_mask(struct device *dev)
 }
 
 /**
+ * iort_get_memory_address_limit - If a named component node exists in the IORT
+ * for a device, get the number of address bits
+ * that the device can effectively use when
+ * accessing memory.
+ *
+ * @dev: The device
+ *
+ * Returns: ENOENT when no corresponding named component node found for dev
+ *  Named component memory_address_limit field otherwise
+ */
+int iort_get_memory_address_limit(struct device *dev)
+{
+   struct acpi_iort_node *node;
+   struct acpi_iort_named_component *ncomp;
+
+   node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
+ iort_match_node_callback, dev);
+   if (!node)
+   return -ENOENT;
+
+   ncomp = (struct acpi_iort_named_component *)node->node_data;
+   return ncomp->memory_address_limit;
+}
+
+/**
  * iort_iommu_configure - Set-up IOMMU configuration for a device.
  *
  * @dev: device to configure
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index 77e0809..677b9c4 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -36,6 +36,7 @@
 struct irq_domain *iort_get_device_domain(struct device *dev, u32 req_id);
 /* IOMMU interface */
 void iort_set_dma_mask(struct device *dev);
+int iort_get_memory_address_limit(struct device *dev);
 const struct iommu_ops *iort_iommu_configure(struct device *dev);
 #else
 static inline void acpi_iort_init(void) { }
@@ -47,6 +48,8 @@ static inline struct irq_domain 
*iort_get_device_domain(struct device *dev,
 { return NULL; }
 /* IOMMU interface */
 static inline void iort_set_dma_mask(struct device *dev) { }
+static inline int iort_get_memory_address_limit(struct device *dev)
+{ return -ENOENT; }
 static inline
 const struct iommu_ops *iort_iommu_configure(struct device *dev)
 { return NULL; }
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH 1/2] ACPI/IORT: add iort_get_memory_address_limit function

2017-01-31 Thread Nate Watterson
The memory_address_limit field present in IORT named component nodes
describes the effective number of address bits that a device can use
when accessing memory. Because this information has value when creating
or validating device dma_masks, the aforementioned accessor function is
introduced.

Signed-off-by: Nate Watterson 
---
 drivers/acpi/arm64/iort.c | 25 +
 include/linux/acpi_iort.h |  3 +++
 2 files changed, 28 insertions(+)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index e0d2e6e..cd5d5f4 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -569,6 +569,31 @@ void iort_set_dma_mask(struct device *dev)
 }
 
 /**
+ * iort_get_memory_address_limit - If a named component node exists in the IORT
+ * for a device, get the number of address bits
+ * that the device can effectively use when
+ * accessing memory.
+ *
+ * @dev: The device
+ *
+ * Returns: ENOENT when no corresponding named component node found for dev
+ *  Named component memory_address_limit field otherwise
+ */
+int iort_get_memory_address_limit(struct device *dev)
+{
+   struct acpi_iort_node *node;
+   struct acpi_iort_named_component *ncomp;
+
+   node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
+ iort_match_node_callback, dev);
+   if (!node)
+   return -ENOENT;
+
+   ncomp = (struct acpi_iort_named_component *)node->node_data;
+   return ncomp->memory_address_limit;
+}
+
+/**
  * iort_iommu_configure - Set-up IOMMU configuration for a device.
  *
  * @dev: device to configure
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index 77e0809..677b9c4 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -36,6 +36,7 @@
 struct irq_domain *iort_get_device_domain(struct device *dev, u32 req_id);
 /* IOMMU interface */
 void iort_set_dma_mask(struct device *dev);
+int iort_get_memory_address_limit(struct device *dev);
 const struct iommu_ops *iort_iommu_configure(struct device *dev);
 #else
 static inline void acpi_iort_init(void) { }
@@ -47,6 +48,8 @@ static inline struct irq_domain 
*iort_get_device_domain(struct device *dev,
 { return NULL; }
 /* IOMMU interface */
 static inline void iort_set_dma_mask(struct device *dev) { }
+static inline int iort_get_memory_address_limit(struct device *dev)
+{ return -ENOENT; }
 static inline
 const struct iommu_ops *iort_iommu_configure(struct device *dev)
 { return NULL; }
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: limit use of 2-level stream tables

2017-01-10 Thread Nate Watterson
In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..7d1a7e5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1983,17 +1983,9 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
u32 size, l1size;
struct arm_smmu_strtab_cfg *cfg = >strtab_cfg;
 
-   /*
-* If we can resolve everything with a single L2 table, then we
-* just need a single L1 descriptor. Otherwise, calculate the L1
-* size, capped to the SIDSIZE.
-*/
-   if (smmu->sid_bits < STRTAB_SPLIT) {
-   size = 0;
-   } else {
-   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
-   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
-   }
+   /* Calculate the L1 size, capped to the SIDSIZE. */
+   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
+   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
cfg->num_l1_ents = 1 << size;
 
size += STRTAB_SPLIT;
@@ -2504,6 +2496,13 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT)
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: limit use of 2-level stream tables

2017-01-10 Thread Nate Watterson
In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..7d1a7e5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1983,17 +1983,9 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
u32 size, l1size;
struct arm_smmu_strtab_cfg *cfg = >strtab_cfg;
 
-   /*
-* If we can resolve everything with a single L2 table, then we
-* just need a single L1 descriptor. Otherwise, calculate the L1
-* size, capped to the SIDSIZE.
-*/
-   if (smmu->sid_bits < STRTAB_SPLIT) {
-   size = 0;
-   } else {
-   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
-   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
-   }
+   /* Calculate the L1 size, capped to the SIDSIZE. */
+   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
+   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
cfg->num_l1_ents = 1 << size;
 
size += STRTAB_SPLIT;
@@ -2504,6 +2496,13 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT)
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: Clear prior settings when updating STEs

2016-12-20 Thread Nate Watterson
To prevent corruption of the stage-1 context pointer field when
updating STEs, rebuild the entire containing dword instead of
clearing individual fields.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..94f305d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1042,13 +1042,8 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
}
}
 
-   /* Nuke the existing Config, as we're going to rewrite it */
-   val &= ~(STRTAB_STE_0_CFG_MASK << STRTAB_STE_0_CFG_SHIFT);
-
-   if (ste->valid)
-   val |= STRTAB_STE_0_V;
-   else
-   val &= ~STRTAB_STE_0_V;
+   /* Nuke the existing STE_0 value, as we're going to rewrite it */
+   val = ste->valid ? STRTAB_STE_0_V : 0;
 
if (ste->bypass) {
val |= disable_bypass ? STRTAB_STE_0_CFG_ABORT
@@ -1083,7 +1078,6 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK
<< STRTAB_STE_0_S1CTXPTR_SHIFT) |
STRTAB_STE_0_CFG_S1_TRANS;
-
}
 
if (ste->s2_cfg) {
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: Clear prior settings when updating STEs

2016-12-20 Thread Nate Watterson
To prevent corruption of the stage-1 context pointer field when
updating STEs, rebuild the entire containing dword instead of
clearing individual fields.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..94f305d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1042,13 +1042,8 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
}
}
 
-   /* Nuke the existing Config, as we're going to rewrite it */
-   val &= ~(STRTAB_STE_0_CFG_MASK << STRTAB_STE_0_CFG_SHIFT);
-
-   if (ste->valid)
-   val |= STRTAB_STE_0_V;
-   else
-   val &= ~STRTAB_STE_0_V;
+   /* Nuke the existing STE_0 value, as we're going to rewrite it */
+   val = ste->valid ? STRTAB_STE_0_V : 0;
 
if (ste->bypass) {
val |= disable_bypass ? STRTAB_STE_0_CFG_ABORT
@@ -1083,7 +1078,6 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK
<< STRTAB_STE_0_S1CTXPTR_SHIFT) |
STRTAB_STE_0_CFG_S1_TRANS;
-
}
 
if (ste->s2_cfg) {
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: prevent corruption of ste stage-1 context ptr

2016-12-19 Thread Nate Watterson
To ensure that the stage-1 context ptr for an ste points to the
intended context descriptor, this patch adds code to clear away
the stale context ptr value prior to or'ing in the new one.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..093f9f1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1080,6 +1080,8 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
if (smmu->features & ARM_SMMU_FEAT_STALLS)
dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
+   val &= ~(STRTAB_STE_0_S1CTXPTR_MASK <<
+STRTAB_STE_0_S1CTXPTR_SHIFT);
val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK
<< STRTAB_STE_0_S1CTXPTR_SHIFT) |
STRTAB_STE_0_CFG_S1_TRANS;
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: prevent corruption of ste stage-1 context ptr

2016-12-19 Thread Nate Watterson
To ensure that the stage-1 context ptr for an ste points to the
intended context descriptor, this patch adds code to clear away
the stale context ptr value prior to or'ing in the new one.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..093f9f1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1080,6 +1080,8 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
if (smmu->features & ARM_SMMU_FEAT_STALLS)
dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
+   val &= ~(STRTAB_STE_0_S1CTXPTR_MASK <<
+STRTAB_STE_0_S1CTXPTR_SHIFT);
val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK
<< STRTAB_STE_0_S1CTXPTR_SHIFT) |
STRTAB_STE_0_CFG_S1_TRANS;
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: avoid over allocating for l2 stream tables

2016-12-19 Thread Nate Watterson
Currently, all l2 stream tables are being allocated with space for
(1<<split) stes without regard to the number of sid bits the smmu
physically supports. To avoid allocating memory for inaccessible
stes, this patch limits the span of an l2 table to be no larger
than the sid size of the smmu to which it belongs.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..5dca671 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1129,6 +1129,7 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, 
unsigned int nent)
 
 static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 {
+   u8 span;
size_t size;
void *strtab;
struct arm_smmu_strtab_cfg *cfg = >strtab_cfg;
@@ -1137,10 +1138,11 @@ static int arm_smmu_init_l2_strtab(struct 
arm_smmu_device *smmu, u32 sid)
if (desc->l2ptr)
return 0;
 
-   size = 1 << (STRTAB_SPLIT + ilog2(STRTAB_STE_DWORDS) + 3);
+   span = (smmu->sid_bits < STRTAB_SPLIT) ? smmu->sid_bits : STRTAB_SPLIT;
+   size = 1 << (span + ilog2(STRTAB_STE_DWORDS) + 3);
strtab = >strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
 
-   desc->span = STRTAB_SPLIT + 1;
+   desc->span = span + 1;
desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, >l2ptr_dma,
  GFP_KERNEL | __GFP_ZERO);
if (!desc->l2ptr) {
@@ -1150,7 +1152,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device 
*smmu, u32 sid)
return -ENOMEM;
}
 
-   arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT);
+   arm_smmu_init_bypass_stes(desc->l2ptr, 1 << span);
arm_smmu_write_strtab_l1_desc(strtab, desc);
return 0;
 }
@@ -2001,6 +2003,8 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
dev_warn(smmu->dev,
 "2-level strtab only covers %u/%u bits of SID\n",
 size, smmu->sid_bits);
+   else if (smmu->sid_bits < size)
+   size = smmu->sid_bits;
 
l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
strtab = dmam_alloc_coherent(smmu->dev, l1size, >strtab_dma,
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: avoid over allocating for l2 stream tables

2016-12-19 Thread Nate Watterson
Currently, all l2 stream tables are being allocated with space for
(1<
---
 drivers/iommu/arm-smmu-v3.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..5dca671 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1129,6 +1129,7 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, 
unsigned int nent)
 
 static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 {
+   u8 span;
size_t size;
void *strtab;
struct arm_smmu_strtab_cfg *cfg = >strtab_cfg;
@@ -1137,10 +1138,11 @@ static int arm_smmu_init_l2_strtab(struct 
arm_smmu_device *smmu, u32 sid)
if (desc->l2ptr)
return 0;
 
-   size = 1 << (STRTAB_SPLIT + ilog2(STRTAB_STE_DWORDS) + 3);
+   span = (smmu->sid_bits < STRTAB_SPLIT) ? smmu->sid_bits : STRTAB_SPLIT;
+   size = 1 << (span + ilog2(STRTAB_STE_DWORDS) + 3);
strtab = >strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
 
-   desc->span = STRTAB_SPLIT + 1;
+   desc->span = span + 1;
desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, >l2ptr_dma,
  GFP_KERNEL | __GFP_ZERO);
if (!desc->l2ptr) {
@@ -1150,7 +1152,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device 
*smmu, u32 sid)
return -ENOMEM;
}
 
-   arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT);
+   arm_smmu_init_bypass_stes(desc->l2ptr, 1 << span);
arm_smmu_write_strtab_l1_desc(strtab, desc);
return 0;
 }
@@ -2001,6 +2003,8 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
dev_warn(smmu->dev,
 "2-level strtab only covers %u/%u bits of SID\n",
 size, smmu->sid_bits);
+   else if (smmu->sid_bits < size)
+   size = smmu->sid_bits;
 
l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
strtab = dmam_alloc_coherent(smmu->dev, l1size, >strtab_dma,
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



Re: [PATCH v5 14/14] drivers: acpi: iort: introduce iort_iommu_configure

2016-09-13 Thread Nate Watterson

On 2016-09-09 10:23, Lorenzo Pieralisi wrote:

DT based systems have a generic kernel API to configure IOMMUs
for devices (ie of_iommu_configure()).

On ARM based ACPI systems, the of_iommu_configure() equivalent can
be implemented atop ACPI IORT kernel API, with the corresponding
functions to map device identifiers to IOMMUs and retrieve the
corresponding IOMMU operations necessary for DMA operations set-up.

By relying on the iommu_fwspec generic kernel infrastructure,
implement the IORT based IOMMU configuration for ARM ACPI systems
and hook it up in the ACPI kernel layer that implements DMA
configuration for a device.

Signed-off-by: Lorenzo Pieralisi 
Cc: Hanjun Guo 
Cc: Tomasz Nowicki 
Cc: "Rafael J. Wysocki" 
---
 drivers/acpi/arm64/iort.c | 96 
+++

 drivers/acpi/scan.c   |  7 +++-
 include/linux/acpi_iort.h |  6 +++
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 7c68eb4..55a4ae9 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -19,6 +19,7 @@
 #define pr_fmt(fmt)"ACPI: IORT: " fmt

 #include 
+#include 
 #include 
 #include 
 #include 
@@ -27,6 +28,8 @@

 #define IORT_TYPE_MASK(type)   (1 << (type))
 #define IORT_MSI_TYPE  (1 << ACPI_IORT_NODE_ITS_GROUP)
+#define IORT_IOMMU_TYPE((1 << ACPI_IORT_NODE_SMMU) | \
+   (1 << ACPI_IORT_NODE_SMMU_V3))

 struct iort_its_msi_chip {
struct list_headlist;
@@ -467,6 +470,99 @@ struct irq_domain *iort_get_device_domain(struct
device *dev, u32 req_id)
return irq_find_matching_fwnode(handle, DOMAIN_BUS_PCI_MSI);
 }

+static int __get_pci_rid(struct pci_dev *pdev, u16 alias, void *data)
+{
+   u32 *rid = data;
+
+   *rid = alias;
+   return 0;
+}
+
+static int arm_smmu_iort_xlate(struct device *dev, u32 streamid,
+  struct fwnode_handle *fwnode)
+{
+   int ret = iommu_fwspec_init(dev, fwnode);
+
+   if (!ret)
+   ret = iommu_fwspec_add_ids(dev, , 1);
+
+   return ret;
+}
+
+static const struct iommu_ops *iort_iommu_xlate(struct device *dev,
+   struct acpi_iort_node *node,
+   u32 streamid)
+{
+   struct fwnode_handle *iort_fwnode = NULL;
+   int ret;
+
+   if (node) {
+   iort_fwnode = iort_get_fwnode(node);
+   if (!iort_fwnode)
+   return NULL;
+
+   ret = arm_smmu_iort_xlate(dev, streamid,
+ iort_fwnode);
+   if (!ret)
+   return fwspec_iommu_get_ops(iort_fwnode);
+   }
+
+   return NULL;
+}
+
+/**
+ * iort_iommu_configure - Set-up IOMMU configuration for a device.
+ *
+ * @dev: device to configure
+ *
+ * Returns: iommu_ops pointer on configuration success
+ *  NULL on configuration failure
+ */
+const struct iommu_ops *iort_iommu_configure(struct device *dev)
+{
+   struct acpi_iort_node *node, *parent;
+   const struct iommu_ops *ops = NULL;
+   u32 streamid = 0;
+
+   if (dev_is_pci(dev)) {
+   struct pci_bus *bus = to_pci_dev(dev)->bus;
+   u32 rid;
+
+   pci_for_each_dma_alias(to_pci_dev(dev), __get_pci_rid,
+  );
+
+   node = iort_scan_node(ACPI_IORT_NODE_PCI_ROOT_COMPLEX,
+ iort_match_node_callback, >dev);
+   if (!node)
+   return NULL;
+
+   parent = iort_node_map_rid(node, rid, ,
+  IORT_IOMMU_TYPE);
+
+   ops = iort_iommu_xlate(dev, parent, streamid);
+
+   } else {
+   int i = 0;
+
+   node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
+ iort_match_node_callback, dev);
+   if (!node)
+   return NULL;
+


Nothing wrong with your code here, but wanted to warn you that there
appears to be a bug in iort_match_node_callback() for NAMED_COMPONENTS.

iort_match_node_callback() {
acpi_status status = AE_NOT_FOUND;
...
case ACPI_IORT_NODE_NAMED_COMPONENT: {
...
status = acpi_get_name(adev->handle, ACPI_FULL_PATHNAME, );
if (ACPI_FAILURE(status)) {
dev_warn(dev, "Can't get device full path name\n");
break;
}

ncomp = (struct acpi_iort_named_component *)node->node_data;
if (!strcmp(ncomp->device_name, buf.pointer))
status = AE_OK;

acpi_os_free(buf.pointer);
break;
}
...
return status;

Re: [PATCH v5 14/14] drivers: acpi: iort: introduce iort_iommu_configure

2016-09-13 Thread Nate Watterson

On 2016-09-09 10:23, Lorenzo Pieralisi wrote:

DT based systems have a generic kernel API to configure IOMMUs
for devices (ie of_iommu_configure()).

On ARM based ACPI systems, the of_iommu_configure() equivalent can
be implemented atop ACPI IORT kernel API, with the corresponding
functions to map device identifiers to IOMMUs and retrieve the
corresponding IOMMU operations necessary for DMA operations set-up.

By relying on the iommu_fwspec generic kernel infrastructure,
implement the IORT based IOMMU configuration for ARM ACPI systems
and hook it up in the ACPI kernel layer that implements DMA
configuration for a device.

Signed-off-by: Lorenzo Pieralisi 
Cc: Hanjun Guo 
Cc: Tomasz Nowicki 
Cc: "Rafael J. Wysocki" 
---
 drivers/acpi/arm64/iort.c | 96 
+++

 drivers/acpi/scan.c   |  7 +++-
 include/linux/acpi_iort.h |  6 +++
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 7c68eb4..55a4ae9 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -19,6 +19,7 @@
 #define pr_fmt(fmt)"ACPI: IORT: " fmt

 #include 
+#include 
 #include 
 #include 
 #include 
@@ -27,6 +28,8 @@

 #define IORT_TYPE_MASK(type)   (1 << (type))
 #define IORT_MSI_TYPE  (1 << ACPI_IORT_NODE_ITS_GROUP)
+#define IORT_IOMMU_TYPE((1 << ACPI_IORT_NODE_SMMU) | \
+   (1 << ACPI_IORT_NODE_SMMU_V3))

 struct iort_its_msi_chip {
struct list_headlist;
@@ -467,6 +470,99 @@ struct irq_domain *iort_get_device_domain(struct
device *dev, u32 req_id)
return irq_find_matching_fwnode(handle, DOMAIN_BUS_PCI_MSI);
 }

+static int __get_pci_rid(struct pci_dev *pdev, u16 alias, void *data)
+{
+   u32 *rid = data;
+
+   *rid = alias;
+   return 0;
+}
+
+static int arm_smmu_iort_xlate(struct device *dev, u32 streamid,
+  struct fwnode_handle *fwnode)
+{
+   int ret = iommu_fwspec_init(dev, fwnode);
+
+   if (!ret)
+   ret = iommu_fwspec_add_ids(dev, , 1);
+
+   return ret;
+}
+
+static const struct iommu_ops *iort_iommu_xlate(struct device *dev,
+   struct acpi_iort_node *node,
+   u32 streamid)
+{
+   struct fwnode_handle *iort_fwnode = NULL;
+   int ret;
+
+   if (node) {
+   iort_fwnode = iort_get_fwnode(node);
+   if (!iort_fwnode)
+   return NULL;
+
+   ret = arm_smmu_iort_xlate(dev, streamid,
+ iort_fwnode);
+   if (!ret)
+   return fwspec_iommu_get_ops(iort_fwnode);
+   }
+
+   return NULL;
+}
+
+/**
+ * iort_iommu_configure - Set-up IOMMU configuration for a device.
+ *
+ * @dev: device to configure
+ *
+ * Returns: iommu_ops pointer on configuration success
+ *  NULL on configuration failure
+ */
+const struct iommu_ops *iort_iommu_configure(struct device *dev)
+{
+   struct acpi_iort_node *node, *parent;
+   const struct iommu_ops *ops = NULL;
+   u32 streamid = 0;
+
+   if (dev_is_pci(dev)) {
+   struct pci_bus *bus = to_pci_dev(dev)->bus;
+   u32 rid;
+
+   pci_for_each_dma_alias(to_pci_dev(dev), __get_pci_rid,
+  );
+
+   node = iort_scan_node(ACPI_IORT_NODE_PCI_ROOT_COMPLEX,
+ iort_match_node_callback, >dev);
+   if (!node)
+   return NULL;
+
+   parent = iort_node_map_rid(node, rid, ,
+  IORT_IOMMU_TYPE);
+
+   ops = iort_iommu_xlate(dev, parent, streamid);
+
+   } else {
+   int i = 0;
+
+   node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
+ iort_match_node_callback, dev);
+   if (!node)
+   return NULL;
+


Nothing wrong with your code here, but wanted to warn you that there
appears to be a bug in iort_match_node_callback() for NAMED_COMPONENTS.

iort_match_node_callback() {
acpi_status status = AE_NOT_FOUND;
...
case ACPI_IORT_NODE_NAMED_COMPONENT: {
...
status = acpi_get_name(adev->handle, ACPI_FULL_PATHNAME, );
if (ACPI_FAILURE(status)) {
dev_warn(dev, "Can't get device full path name\n");
break;
}

ncomp = (struct acpi_iort_named_component *)node->node_data;
if (!strcmp(ncomp->device_name, buf.pointer))
status = AE_OK;

acpi_os_free(buf.pointer);
break;
}
...
return status;
}

Notice how if strcmp() fails, status remains set to the status of the 
call
to 

[PATCH] iommu/iova: validate iova_domain input to put_iova_domain

2016-07-13 Thread Nate Watterson
Passing a NULL or uninitialized iova_domain into put_iova_domain
will currently crash the kernel when the unconfigured iova_domain
data members are accessed. To prevent this from occurring, this patch
adds a check to make sure that the domain is non-NULL and that the
domain granule is non-zero. The granule can be used to check if the
domain was properly initialized because calling init_iova_domain
with a granule of zero would have already triggered a BUG statement
crashing the kernel.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/iova.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e23001b..3511a1c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -459,6 +459,10 @@ void put_iova_domain(struct iova_domain *iovad)
struct rb_node *node;
unsigned long flags;
 
+   /* Only teardown properly initialized domains */
+   if (!iovad || !iovad->granule)
+   return;
+
free_iova_rcaches(iovad);
spin_lock_irqsave(>iova_rbtree_lock, flags);
node = rb_first(>rbroot);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/iova: validate iova_domain input to put_iova_domain

2016-07-13 Thread Nate Watterson
Passing a NULL or uninitialized iova_domain into put_iova_domain
will currently crash the kernel when the unconfigured iova_domain
data members are accessed. To prevent this from occurring, this patch
adds a check to make sure that the domain is non-NULL and that the
domain granule is non-zero. The granule can be used to check if the
domain was properly initialized because calling init_iova_domain
with a granule of zero would have already triggered a BUG statement
crashing the kernel.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/iova.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e23001b..3511a1c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -459,6 +459,10 @@ void put_iova_domain(struct iova_domain *iovad)
struct rb_node *node;
unsigned long flags;
 
+   /* Only teardown properly initialized domains */
+   if (!iovad || !iovad->granule)
+   return;
+
free_iova_rcaches(iovad);
spin_lock_irqsave(>iova_rbtree_lock, flags);
node = rb_first(>rbroot);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/dma-iommu: respect established iova region limits

2016-07-13 Thread Nate Watterson
In the current dma-iommu implementation, the upper limit used when
allocating iovas is based on the calling device's dma_mask without
considering the potentially more restrictive iova limits established
in iommu_dma_init_domain. To ensure that iovas are allocated within
the expected iova region, this patch adds logic in __alloc_iova to
clip input dma_limit values that are out of bounds.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/dma-iommu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index ea5a9eb..2066066 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -157,11 +157,14 @@ static struct iova *__alloc_iova(struct iova_domain 
*iovad, size_t size,
unsigned long shift = iova_shift(iovad);
unsigned long length = iova_align(iovad, size) >> shift;
 
+   /* Respect the upper limit established in iommu_dma_init_domain */
+   dma_limit = min_t(dma_addr_t, dma_limit >> shift, iovad->dma_32bit_pfn);
+
/*
 * Enforce size-alignment to be safe - there could perhaps be an
 * attribute to control this per-device, or at least per-domain...
 */
-   return alloc_iova(iovad, length, dma_limit >> shift, true);
+   return alloc_iova(iovad, length, dma_limit, true);
 }
 
 /* The IOVA allocator knows what we mapped, so just unmap whatever that was */
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/dma-iommu: respect established iova region limits

2016-07-13 Thread Nate Watterson
In the current dma-iommu implementation, the upper limit used when
allocating iovas is based on the calling device's dma_mask without
considering the potentially more restrictive iova limits established
in iommu_dma_init_domain. To ensure that iovas are allocated within
the expected iova region, this patch adds logic in __alloc_iova to
clip input dma_limit values that are out of bounds.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/dma-iommu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index ea5a9eb..2066066 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -157,11 +157,14 @@ static struct iova *__alloc_iova(struct iova_domain 
*iovad, size_t size,
unsigned long shift = iova_shift(iovad);
unsigned long length = iova_align(iovad, size) >> shift;
 
+   /* Respect the upper limit established in iommu_dma_init_domain */
+   dma_limit = min_t(dma_addr_t, dma_limit >> shift, iovad->dma_32bit_pfn);
+
/*
 * Enforce size-alignment to be safe - there could perhaps be an
 * attribute to control this per-device, or at least per-domain...
 */
-   return alloc_iova(iovad, length, dma_limit >> shift, true);
+   return alloc_iova(iovad, length, dma_limit, true);
 }
 
 /* The IOVA allocator knows what we mapped, so just unmap whatever that was */
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH v2] iommu/arm-smmu-v3: limit use of 2-level stream tables

2016-07-12 Thread Nate Watterson
In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5f6b3bc..f27b8dc 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2033,17 +2033,9 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
u32 size, l1size;
struct arm_smmu_strtab_cfg *cfg = >strtab_cfg;
 
-   /*
-* If we can resolve everything with a single L2 table, then we
-* just need a single L1 descriptor. Otherwise, calculate the L1
-* size, capped to the SIDSIZE.
-*/
-   if (smmu->sid_bits < STRTAB_SPLIT) {
-   size = 0;
-   } else {
-   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
-   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
-   }
+   /* Calculate the L1 size, capped to the SIDSIZE. */
+   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
+   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
cfg->num_l1_ents = 1 << size;
 
size += STRTAB_SPLIT;
@@ -2531,6 +2523,13 @@ static int arm_smmu_device_probe(struct arm_smmu_device 
*smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT)
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH v2] iommu/arm-smmu-v3: limit use of 2-level stream tables

2016-07-12 Thread Nate Watterson
In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5f6b3bc..f27b8dc 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2033,17 +2033,9 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
u32 size, l1size;
struct arm_smmu_strtab_cfg *cfg = >strtab_cfg;
 
-   /*
-* If we can resolve everything with a single L2 table, then we
-* just need a single L1 descriptor. Otherwise, calculate the L1
-* size, capped to the SIDSIZE.
-*/
-   if (smmu->sid_bits < STRTAB_SPLIT) {
-   size = 0;
-   } else {
-   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
-   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
-   }
+   /* Calculate the L1 size, capped to the SIDSIZE. */
+   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
+   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
cfg->num_l1_ents = 1 << size;
 
size += STRTAB_SPLIT;
@@ -2531,6 +2523,13 @@ static int arm_smmu_device_probe(struct arm_smmu_device 
*smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT)
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: limit use of 2-level stream tables

2016-07-11 Thread Nate Watterson
In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson <nwatt...@codeaurora.org>
---
 drivers/iommu/arm-smmu-v3.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5f6b3bc..742254c 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2531,6 +2531,17 @@ static int arm_smmu_device_probe(struct arm_smmu_device 
*smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT &&
+   smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+   dev_info(smmu->dev, "SIDSIZE (%d) <= STRTAB_SPLIT (%d) : 
disabling 2-level stream tables\n",
+smmu->sid_bits, STRTAB_SPLIT);
+   }
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Technologies, Inc. on behalf
of the Qualcomm Innovation Center, Inc.  The Qualcomm Innovation Center, Inc. 
is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.



[PATCH] iommu/arm-smmu-v3: limit use of 2-level stream tables

2016-07-11 Thread Nate Watterson
In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5f6b3bc..742254c 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2531,6 +2531,17 @@ static int arm_smmu_device_probe(struct arm_smmu_device 
*smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT &&
+   smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+   dev_info(smmu->dev, "SIDSIZE (%d) <= STRTAB_SPLIT (%d) : 
disabling 2-level stream tables\n",
+smmu->sid_bits, STRTAB_SPLIT);
+   }
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Technologies, Inc. on behalf
of the Qualcomm Innovation Center, Inc.  The Qualcomm Innovation Center, Inc. 
is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.