Re: [RFC v1 7/7] iommu/arm-smmu-v3: Enable ACPI based HiSilicon erratum 161010801

2017-05-17 Thread John Garry

On 16/05/2017 15:03, Shameerali Kolothum Thodi wrote:

> Lorenzo made a point that it might be relatively straightforward to just
> follow the IORT mapping for the SMMU through to the ITS MADT entry and
> pull the ITS geometry out of that. It would certainly be nicer to have
> such a helper abstracted away in the IORT code than have to go parsing
> vendor-specific tables directly in the SMMU driver. I reckon it might be
> worth taking that idea a bit further to see how it looks.

Ok. John has already mentioned this idea in our off-list discussion and we
have one implementation where we go through the IORT node ID mappings
array to find out the associated IORT ITS  node. It then iterates over the ACPI
MADT  GIC ITS table entries, , looking for a match for the GIC ITS id, and
retrieves the base address of the matching ITS.

But as you said, it requires some helper from the IORT code , otherwise we
will end up adding all the ACPI table parse code in SMMUv3. We will take
another look at this.

Thanks,
Shameer



It could also be worth considering hard-coding the reserved address in 
the SMMUv3 driver for this model. This would not involve (more) reliance 
on CSRT or IORT driver. However, note that hip07 has multiple SMMUs, and 
we only need to enable the quirk for the SMMU in front the PCIe host 
controller.


John

BTW, FYI, hi161x is alias for hip07/06


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 3/3] iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #126

2017-06-06 Thread John Garry

On 30/05/2017 13:03, Geetha sowjanya wrote:

From: Geetha Sowjanya 

Cavium ThunderX2 SMMU doesn't support MSI and also doesn't have unique irq


separate irq lines


lines for gerror, eventq and cmdq-sync.

This patch addresses the issue by checking if any interrupt sources are
using same irq number, then they are registered as shared irqs.

Signed-off-by: Geetha Sowjanya 
---
 Documentation/arm64/silicon-errata.txt |1 +
 drivers/iommu/arm-smmu-v3.c|   29 +
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index 4693a32..42422f6 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -63,6 +63,7 @@ stable kernels.
 | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456
|
 | Cavium | ThunderX SMMUv2 | #27704  | N/A 
|
 | Cavium | ThunderX2 SMMUv3| #74 | N/A 
|
+| Cavium | ThunderX2 SMMUv3| #126| N/A 
|
 || | | 
|
 | Freescale/NXP  | LS2080A/LS1043A | A-008585| FSL_ERRATUM_A008585 
|
 || | | 
|
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4e80205..d2db01f 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2232,6 +2232,25 @@ static void arm_smmu_setup_msis(struct arm_smmu_device 
*smmu)
devm_add_action(dev, arm_smmu_free_msis, dev);
 }

+static int get_irq_flags(struct arm_smmu_device *smmu, int irq)
+{
+   int match_count = 0;
+
+   if (irq == smmu->evtq.q.irq)
+   match_count++;
+   if (irq == smmu->cmdq.q.irq)
+   match_count++;
+   if (irq == smmu->gerr_irq)
+   match_count++;
+   if (irq == smmu->priq.q.irq)
+   match_count++;
+
+   if (match_count > 1)
+   return IRQF_SHARED | IRQF_ONESHOT;
+
+   return IRQF_ONESHOT;
+}
+
 static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 {
int ret, irq;
@@ -2252,7 +2271,7 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device 
*smmu)
if (irq) {
ret = devm_request_threaded_irq(smmu->dev, irq, NULL,
arm_smmu_evtq_thread,
-   IRQF_ONESHOT,
+   get_irq_flags(smmu, irq),
"arm-smmu-v3-evtq", smmu);
if (ret < 0)
dev_warn(smmu->dev, "failed to enable evtq irq\n");
@@ -2261,7 +2280,8 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device 
*smmu)
irq = smmu->cmdq.q.irq;
if (irq) {
ret = devm_request_irq(smmu->dev, irq,
-  arm_smmu_cmdq_sync_handler, 0,


If it is indeed ok to change this to having IRQF_ONESHOT set now 
(right?), then can you add a note to the commit message?



+  arm_smmu_cmdq_sync_handler,
+  get_irq_flags(smmu, irq),


indentation looks inconsistent


   "arm-smmu-v3-cmdq-sync", smmu);
if (ret < 0)
dev_warn(smmu->dev, "failed to enable cmdq-sync irq\n");
@@ -2270,7 +2290,8 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device 
*smmu)
irq = smmu->gerr_irq;
if (irq) {
ret = devm_request_irq(smmu->dev, irq, arm_smmu_gerror_handler,
-  0, "arm-smmu-v3-gerror", smmu);
+   get_irq_flags(smmu, irq),
+   "arm-smmu-v3-gerror", smmu);
if (ret < 0)
dev_warn(smmu->dev, "failed to enable gerror irq\n");
}
@@ -2280,7 +2301,7 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device 
*smmu)
if (irq) {
ret = devm_request_threaded_irq(smmu->dev, irq, NULL,
arm_smmu_priq_thread,
-   IRQF_ONESHOT,
+   get_irq_flags(smmu, 
irq),
"arm-smmu-v3-priq",
smmu);
if (ret < 0)




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 0/2] acpi/iort, numa: Add numa node mapping for smmuv3 devices

2017-06-08 Thread John Garry

On 08/06/2017 05:44, Ganapatrao Kulkarni wrote:

ARM IORT specification(rev. C) has added  provision to define proximity
domain in SMMUv3 IORT table. Adding required code to parse Proximity
domain and set numa_node of smmv3 platform devices.

v3:
  - Addressed Lorenzo Pieralisi comment.

v2:
  - Changed as per Lorenzo Pieralisi and Hanjun Guo suggestions.

v1:
  - Initial patch

Ganapatrao Kulkarni (2):
  acpica: iort: Update SMMUv3 header for proximity domain mapping
  acpi/iort: numa: Add numa node mapping for smmuv3 devices

 drivers/acpi/arm64/iort.c | 28 ++--
 include/acpi/actbl2.h |  4 
 2 files changed, 30 insertions(+), 2 deletions(-)



We'll try and test this in the next day or so.

John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/8] io-pgtable lock removal

2017-06-15 Thread John Garry

On 15/06/2017 01:40, Ray Jui via iommu wrote:

Hi Robin,

wangzhou tested this patchset on our SMMUv3-based development board with 
a 10G PCI NIC card.


Currently we see a ~17% performance (throughput) drop when enabling the 
SMMU, but only a ~8% drop with your patchset.


FYI, for our integrated storage and network adapter, we see a big 
performance hit (maybe 40%) when enabling the SMMU with or without the 
patchset. Leizhen has been doing some investigation on this.


Thanks,
John


Hi Robin,

I have applied this patch series on top of v4.12-rc4, and ran various
Ethernet and NVMf target throughput tests on it.

To give you some background of my setup:

The system is a ARMv8 based system with 8 cores. It has various PCIe
root complexes that can be used to connect to PCIe endpoint devices
including NIC cards and NVMe SSDs.

I'm particularly interested in the performance of the PCIe root complex
that connects to the NIC card, and during my test, IOMMU is
enabled/disabled against that particular PCIe root complex. The root
complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).

For the Ethernet throughput out of 50G link:

Note during the multiple TCP session test, each session will be spread
to different CPU cores for optimized performance

Without IOMMU:

TX TCP x1 - 29.7 Gbps
TX TCP x4 - 30.5 Gbps
TX TCP x8 - 28 Gbps

RX TCP x1 - 15 Gbps
RX TCP x4 - 33.7 Gbps
RX TCP x8 - 36 Gbps

With IOMMU, but without your latest patch:

TX TCP x1 - 15.2 Gbps
TX TCP x4 - 14.3 Gbps
TX TCP x8 - 13 Gbps

RX TCP x1 - 7.88 Gbps
RX TCP x4 - 13.2 Gbps
RX TCP x8 - 12.6 Gbps

With IOMMU and your latest patch:

TX TCP x1 - 21.4 Gbps
TX TCP x4 - 30.5 Gbps
TX TCP x8 - 21.3 Gbps

RX TCP x1 - 7.7 Gbps
RX TCP x4 - 20.1 Gbps
RX TCP x8 - 27.1 Gbps

With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
8 jobs:

Without IOMMU:

IOPS = 1080K

With IOMMU, but without your latest patch:

IOPS = 520K

With IOMMU and your latest patch:

IOPS = 500K ~ 850K (a lot of variation observed during the same test run)

As you can see, performance has improved significantly with this patch
series! That is very impressive!

However, it is still off, compared to the test runs without the IOMMU.
I'm wondering if more improvement is expected.

In addition, a much larger throughput variation is observed in the tests
with these latest patches, when multiple CPUs are involved. I'm
wondering if that is caused by some remaining lock in the driver?

Also, in a few occasions, I observed the following message during the
test, when multiple cores are involved:

arm-smmu 6400.mmu: TLB sync timed out -- SMMU may be deadlocked

Thanks,

Ray

On 6/9/17 12:28 PM, Nate Watterson wrote:

Hi Robin,

On 6/8/2017 7:51 AM, Robin Murphy wrote:

Hi all,

Here's the cleaned up nominally-final version of the patches everybody's
keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
#2-#4 do some preparatory work (and bid farewell to everyone's least
favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.

The branch I've previously shared has been updated too:

   git://linux-arm.org/linux-rm  iommu/pgtable

All feedback welcome, as I'd really like to land this for 4.13.



I tested the series on a QDF2400 development platform and see notable
performance improvements particularly in workloads that make concurrent
accesses to a single iommu_domain.


Robin.


Robin Murphy (8):
   iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
   iommu/io-pgtable-arm: Improve split_blk_unmap
   iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
   iommu/io-pgtable: Introduce explicit coherency
   iommu/io-pgtable-arm: Support lockless operation
   iommu/io-pgtable-arm-v7s: Support lockless operation
   iommu/arm-smmu: Remove io-pgtable spinlock
   iommu/arm-smmu-v3: Remove io-pgtable spinlock

  drivers/iommu/arm-smmu-v3.c|  36 ++-
  drivers/iommu/arm-smmu.c   |  48 --
  drivers/iommu/io-pgtable-arm-v7s.c | 173
+
  drivers/iommu/io-pgtable-arm.c | 190
-
  drivers/iommu/io-pgtable.h |   6 ++
  5 files changed, 268 insertions(+), 185 deletions(-)




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/8] io-pgtable lock removal

2017-06-23 Thread John Garry

On 22/06/2017 16:53, Robin Murphy wrote:

The feedback has been promising, so v2 is just a final update to cover
a handful of memory ordering and cosmetic tweaks that came up when Will
and I went through this offline.

Thanks,
Robin.


Hi Robin,

Is it worth us retesting this patchset?

If yes, as for a branch (for convenience), does this one now match v2:
git://linux-arm.org/linux-rm  iommu/pgtable

I saw it was rewritten recently.

Thanks,
John




Robin Murphy (8):
  iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
  iommu/io-pgtable-arm: Improve split_blk_unmap
  iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  iommu/io-pgtable: Introduce explicit coherency
  iommu/io-pgtable-arm: Support lockless operation
  iommu/io-pgtable-arm-v7s: Support lockless operation
  iommu/arm-smmu: Remove io-pgtable spinlock
  iommu/arm-smmu-v3: Remove io-pgtable spinlock

 drivers/iommu/arm-smmu-v3.c|  36 ++-
 drivers/iommu/arm-smmu.c   |  48 --
 drivers/iommu/io-pgtable-arm-v7s.c | 179 +-
 drivers/iommu/io-pgtable-arm.c | 191 -
 drivers/iommu/io-pgtable.h |   6 ++
 5 files changed, 274 insertions(+), 186 deletions(-)




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/8] io-pgtable lock removal

2017-06-26 Thread John Garry

On 23/06/2017 10:58, Robin Murphy wrote:

On 23/06/17 09:47, John Garry wrote:

On 22/06/2017 16:53, Robin Murphy wrote:

The feedback has been promising, so v2 is just a final update to cover
a handful of memory ordering and cosmetic tweaks that came up when Will
and I went through this offline.

Thanks,
Robin.


Hi Robin,

Is it worth us retesting this patchset?

If yes, as for a branch (for convenience), does this one now match v2:
git://linux-arm.org/linux-rm  iommu/pgtable

I saw it was rewritten recently.


Indeed, I pushed the branch out before posting. Functionally it ought to
be identical to v1, so this was more just a heads-up - feel free to
double-check I've not broken anything, but I wouldn't say you need to
throw the full test lab at it again.



I saw Will has already sent the pull request. But, FWIW, we are seeing 
roughly the same performance as v1 patchset. For PCI NIC, Zhou again 
found performance drop goes from ~15->8% with SMMU enabled, and for 
integrated storage controller [platform device], we still see a drop of 
about 50%, depending on datarates (Leizhen has been working on fixing this).


John


Thanks,
Robin.



Thanks,
John




Robin Murphy (8):
  iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
  iommu/io-pgtable-arm: Improve split_blk_unmap
  iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  iommu/io-pgtable: Introduce explicit coherency
  iommu/io-pgtable-arm: Support lockless operation
  iommu/io-pgtable-arm-v7s: Support lockless operation
  iommu/arm-smmu: Remove io-pgtable spinlock
  iommu/arm-smmu-v3: Remove io-pgtable spinlock

 drivers/iommu/arm-smmu-v3.c|  36 ++-
 drivers/iommu/arm-smmu.c   |  48 --
 drivers/iommu/io-pgtable-arm-v7s.c | 179
+-
 drivers/iommu/io-pgtable-arm.c | 191
-
 drivers/iommu/io-pgtable.h |   6 ++
 5 files changed, 274 insertions(+), 186 deletions(-)







.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/8] io-pgtable lock removal

2017-06-26 Thread John Garry




I saw Will has already sent the pull request. But, FWIW, we are seeing
roughly the same performance as v1 patchset. For PCI NIC, Zhou again
found performance drop goes from ~15->8% with SMMU enabled, and for
integrated storage controller [platform device], we still see a drop of
about 50%, depending on datarates (Leizhen has been working on fixing
this).


Thanks for confirming. Following Joerg's suggestion that the storage
workloads may still depend on rbtree performance - it had slipped my
mind that even with small block sizes those could well be grouped into
scatterlists large enough to trigger a >64-page IOVA allocation - I've
taken the liberty of cooking up a simplified version of Leizhen's rbtree
optimisation series in the iommu/iova branch of my tree. I'll follow up
on that after the merge window, but if anyone wants to play with it in
the meantime feel free.


Just a reminder that we did also see poor performance with our 
integrated NIC on your v1 patchset also (I can push for v2 patchset 
testing, but expect the same).


We might be able to now include a LSI 3108 PCI SAS card in our testing 
also to give a broader set of results.


John



Robin.

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-07-17 Thread John Garry

+

On 29/06/2017 03:08, Leizhen (ThunderTown) wrote:



On 2017/6/28 17:32, Will Deacon wrote:

Hi Zhen Lei,

Nate (CC'd), Robin and I have been working on something very similar to
this series, but this patch is different to what we had planned. More below.

On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote:

Because all TLBI commands should be followed by a SYNC command, to make
sure that it has been completely finished. So we can just add the TLBI
commands into the queue, and put off the execution until meet SYNC or
other commands. To prevent the followed SYNC command waiting for a long
time because of too many commands have been delayed, restrict the max
delayed number.

According to my test, I got the same performance data as I replaced writel
with writel_relaxed in queue_inc_prod.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/arm-smmu-v3.c | 42 +-
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 291da5f..4481123 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -337,6 +337,7 @@
 /* Command queue */
 #define CMDQ_ENT_DWORDS2
 #define CMDQ_MAX_SZ_SHIFT  8
+#define CMDQ_MAX_DELAYED   32

 #define CMDQ_ERR_SHIFT 24
 #define CMDQ_ERR_MASK  0x7f
@@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent {
};
} cfgi;

+   #define CMDQ_OP_TLBI_NH_ALL 0x10
#define CMDQ_OP_TLBI_NH_ASID0x11
#define CMDQ_OP_TLBI_NH_VA  0x12
#define CMDQ_OP_TLBI_EL2_ALL0x20
@@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent {

 struct arm_smmu_queue {
int irq; /* Wired interrupt */
+   u32 nr_delay;

__le64  *base;
dma_addr_t  base_dma;
@@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q)
return ret;
 }

-static void queue_inc_prod(struct arm_smmu_queue *q)
+static void queue_inc_swprod(struct arm_smmu_queue *q)
 {
-   u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1;
+   u32 prod = q->prod + 1;

q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
+}
+
+static void queue_inc_prod(struct arm_smmu_queue *q)
+{
+   queue_inc_swprod(q);
writel(q->prod, q->prod_reg);
 }

@@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t 
n_dwords)
*dst++ = cpu_to_le64(*src++);
 }

-static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent)
+static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int optimize)
 {
if (queue_full(q))
return -ENOSPC;

queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords);
-   queue_inc_prod(q);
+
+   /*
+* We don't want too many commands to be delayed, this may lead the
+* followed sync command to wait for a long time.
+*/
+   if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) {
+   queue_inc_swprod(q);
+   } else {
+   queue_inc_prod(q);
+   q->nr_delay = 0;
+   }
+


So here, you're effectively putting invalidation commands into the command
queue without updating PROD. Do you actually see a performance advantage
from doing so? Another side of the argument would be that we should be

Yes, my sas ssd performance test showed that it can improve about 
100-150K/s(the same to I directly replace
writel with writel_relaxed). And the average execution time of 
iommu_unmap(which called by iommu_dma_unmap_sg)
dropped from 10us to 5us.


moving PROD as soon as we can, so that the SMMU can process invalidation
commands in the background and reduce the cost of the final SYNC operation
when the high-level unmap operation is complete.

There maybe that __iowmb() is more expensive than wait for tlbi complete. 
Except the time of __iowmb()
itself, it also protected by spinlock, lock confliction will rise rapidly in 
the stress scene. __iowmb()
average cost 300-500ns(Sorry, I forget the exact value).

In addition, after applied this patcheset and Robin's v2, and my earlier dma64 
iova optimization patchset.
Our net performance test got the same data to global bypass. But sas ssd still 
have more than 20% dropped.
Maybe we should still focus at map/unamp, because the average execution time of 
iova alloc/free is only
about 400ns.

By the way, patch2-5 is more effective than this one, it can improve more than 
350K/s. And with it, we can
got about 100-150K/s improvement of Robin's v2. Otherwise, I saw non effective 
of Robin's v2. Sorry, I have
not tested how about this patch without patch2-5. Further more, I got the same 
performance data to global
bypass for the traditional mechanical hard disk with only patch2-5(without this 
patch and Robin's).



Wil

Re: [PATCH 1/1] iommu/arm-smmu-v3: eliminate a potential memory corruption on Hi16xx soc

2018-10-15 Thread John Garry

On 15/10/2018 09:36, Zhen Lei wrote:

ITS translation register map:
0x-0x003C   Reserved
0x0040  GITS_TRANSLATER
0x0044-0xFFFC   Reserved



Can you add a better opening than the ITS translation register map?


The standard GITS_TRANSLATER register in ITS is only 4 bytes, but Hisilicon
expands the next 4 bytes to carry some IMPDEF information. That means, 8 bytes
data will be written to MSIAddress each time.

MSIAddr: |4bytes|4bytes|
 |MSIData   |IMPDEF|

There is no problem for ITS, because the next 4 bytes space is reserved in ITS.
But it will overwrite the 4 bytes memory following "sync_count". It's very


I think arm_smmu_device.sync_count is better, or "sync_count member in 
the the smmu driver control struct".



luckly that the previous and the next neighbour of "sync_count" are both aligned


/s/luckly/luckily or fortunately/


by 8 bytes, so no problem is met now.

It's good to explicitly add a workaround:
1. Add gcc __attribute__((aligned(8))) to make sure that "sync_count" is always
   aligned by 8 bytes.
2. Add a "u64" union member to make sure the 4 bytes padding is always exist.

There is no functional change.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/arm-smmu-v3.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5059d09..a07bc0d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -586,7 +586,10 @@ struct arm_smmu_device {

struct arm_smmu_strtab_cfg  strtab_cfg;

+   union {
+   u64 padding; /* workaround for Hisilicon */


I think that a more detailed comment is required.


u32 sync_count;


Can you indent these 2 members? However - as discussed internally - this 
may have endian issue so better to declare full 64b struct.



+   } __attribute__((aligned(8)));

/* IOMMU core code handle */
struct iommu_device iommu;


Thanks



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/2] iommu/arm-smmu-v3: Reunify arm_smmu_cmdq_issue_sync()

2018-10-17 Thread John Garry

On 17/10/2018 14:56, Robin Murphy wrote:

Now that both sync methods are more or less the same shape, we can save
some code and levels of indirection by rolling them up together again,
with just a couple of simple conditionals to discriminate the MSI and
queue-polling specifics.


Hi Robin, Will,

I had been thinking of this other patch previously:

iommu/arm-smmu-v3: Stop rebuilding non-MSI CMD_SYNC commands

The contents of the non-MSI CMD_SYNC command are fixed. This patch 
offers a small optimisation by keeping a sample of this command on the 
heap for re-use, thereby avoiding unnecessary re-building.


Signed-off-by: John Garry 

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 6947ccf..9d86c29 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -963,14 +963,16 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
arm_smmu_device *smmu)


 static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
 {
-   u64 cmd[CMDQ_ENT_DWORDS];
+   static u64 cmd[CMDQ_ENT_DWORDS] = {
+  FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
+  FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV) |
+  FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH) |
+  FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB),
+   0};
unsigned long flags;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
-   struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
int ret;

-   arm_smmu_cmdq_build_cmd(cmd, &ent);
-
spin_lock_irqsave(&smmu->cmdq.lock, flags);
arm_smmu_cmdq_insert_cmd(smmu, cmd);
ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);

But it seems that combining the the MSI and non-MSI methods would block 
this.


How do you feel about this?

Thanks,
John



Signed-off-by: Robin Murphy 
---
 drivers/iommu/arm-smmu-v3.c | 49 +
 1 file changed, 12 insertions(+), 37 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index da8a91d116bf..36db63e3afcf 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -933,7 +933,7 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device 
*smmu,
  * The difference between val and sync_idx is bounded by the maximum size of
  * a queue at 2^20 entries, so 32 bits is plenty for wrap-safe arithmetic.
  */
-static int __arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
+static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
 {
ktime_t timeout;
u32 val;
@@ -988,16 +988,17 @@ static int arm_smmu_sync_poll_cons(struct arm_smmu_device 
*smmu, u32 sync_idx,
return -ETIMEDOUT;
 }

-static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu)
+static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
 {
u64 cmd[CMDQ_ENT_DWORDS];
unsigned long flags;
-   struct arm_smmu_cmdq_ent ent = {
-   .opcode = CMDQ_OP_CMD_SYNC,
-   .sync   = {
-   .msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)),
-   },
-   };
+   bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
+  (smmu->features & ARM_SMMU_FEAT_COHERENCY);
+   struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
+   int ret, sync_idx, sync_gen;
+
+   if (msi)
+   ent.sync.msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count));

spin_lock_irqsave(&smmu->cmdq.lock, flags);

@@ -1009,39 +1010,13 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
arm_smmu_device *smmu)
arm_smmu_cmdq_build_cmd(cmd, &ent);
arm_smmu_cmdq_insert_cmd(smmu, cmd);
}
-
-   spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
-
-   return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
-}
-
-static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
-{
-   u64 cmd[CMDQ_ENT_DWORDS];
-   unsigned long flags;
-   struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
-   int sync_idx, sync_gen;
-
-   arm_smmu_cmdq_build_cmd(cmd, &ent);
-
-   spin_lock_irqsave(&smmu->cmdq.lock, flags);
-   if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC)
-   arm_smmu_cmdq_insert_cmd(smmu, cmd);
sync_idx = smmu->cmdq.q.prod;
sync_gen = READ_ONCE(smmu->cmdq_generation);
+
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);

-   return arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen);
-}
-
-static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
-{
-   int ret;
-   bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
-  (smmu->features & ARM_SMMU_FEAT_COHERENCY);
-
-   ret = msi ? __arm_smmu_cmdq_issue_sync_m

Re: [PATCH v5 1/2] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock

2018-10-18 Thread John Garry

On 17/10/2018 14:56, Robin Murphy wrote:

Even without the MSI trick, we can still do a lot better than hogging
the entire queue while it drains. All we actually need to do for the
necessary guarantee of completion is wait for our particular command to
have been consumed - as long as we keep track of where we inserted it,
there is no need to block other CPUs from adding further commands in the
meantime. There is one theoretical (but incredibly unlikely) edge case
to avoid, where cons has wrapped twice to still appear 'behind' the sync
position - this is easily disambiguated by adding a generation count to
the queue to indicate when prod wraps, since cons cannot wrap twice
without prod having wrapped at least once.

This also makes it reasonable to separate the two conceptually different
modes of polling such that command insertion - which really wants to be
fair and have minimal latency - is not subject to exponential backoff,
and returns to its original implementation.

Signed-off-by: Robin Murphy 
---

v5:
 - Rework to incorporate the back-to-back sync elision.
 - Refactor the generation count slightly to preemptively help with
   the HiSilicon MSI workaround.
 - Split the cleanup into a separate patch for ease of review (it could
   happily be squashed when applied).

The fundamental logic is copied directly from v4, but I've dropped the
previous tested-by since there are a fair few subtle changes in how it's
integrated. Patches are based on Will's iommu/devel branch plus my "Fix
big-endian CMD_SYNC writes" patch.

Robin.

 drivers/iommu/arm-smmu-v3.c | 94 +++--
 1 file changed, 69 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 867ba548c2cc..da8a91d116bf 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -588,6 +588,7 @@ struct arm_smmu_device {
struct arm_smmu_strtab_cfg  strtab_cfg;

u32 sync_count;
+   int cmdq_generation;

/* IOMMU core code handle */
struct iommu_device iommu;
@@ -676,6 +677,17 @@ static bool queue_empty(struct arm_smmu_queue *q)
   Q_WRP(q, q->prod) == Q_WRP(q, q->cons);
 }

+static bool queue_behind(struct arm_smmu_queue *q, u32 idx)
+{
+   return Q_IDX(q, q->cons) < Q_IDX(q, idx);
+}
+
+static bool queue_ahead_not_wrapped(struct arm_smmu_queue *q, u32 idx)
+{
+   return Q_IDX(q, q->cons) >= Q_IDX(q, idx) &&
+  Q_WRP(q, q->cons) == Q_WRP(q, idx);
+}
+
 static void queue_sync_cons(struct arm_smmu_queue *q)
 {
q->cons = readl_relaxed(q->cons_reg);
@@ -709,33 +721,19 @@ static void queue_inc_prod(struct arm_smmu_queue *q)
writel(q->prod, q->prod_reg);
 }

-/*
- * Wait for the SMMU to consume items. If sync is true, wait until the queue
- * is empty. Otherwise, wait until there is at least one free slot.
- */
-static int queue_poll_cons(struct arm_smmu_queue *q, bool sync, bool wfe)
+static int queue_poll_cons(struct arm_smmu_queue *q, bool wfe)
 {
-   ktime_t timeout;
-   unsigned int delay = 1, spin_cnt = 0;
+   ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US);

-   /* Wait longer if it's a CMD_SYNC */
-   timeout = ktime_add_us(ktime_get(), sync ?
-   ARM_SMMU_CMDQ_SYNC_TIMEOUT_US :
-   ARM_SMMU_POLL_TIMEOUT_US);
-
-   while (queue_sync_cons(q), (sync ? !queue_empty(q) : queue_full(q))) {
+   while (queue_sync_cons(q), queue_full(q)) {
if (ktime_compare(ktime_get(), timeout) > 0)
return -ETIMEDOUT;

if (wfe) {
wfe();
-   } else if (++spin_cnt < ARM_SMMU_CMDQ_SYNC_SPIN_COUNT) {
-   cpu_relax();
-   continue;
} else {
-   udelay(delay);
-   delay *= 2;
-   spin_cnt = 0;
+   cpu_relax();
+   udelay(1);
}
}

@@ -905,8 +903,11 @@ static void arm_smmu_cmdq_insert_cmd(struct 
arm_smmu_device *smmu, u64 *cmd)

smmu->prev_cmd_opcode = FIELD_GET(CMDQ_0_OP, cmd[0]);

+   if (Q_IDX(q, q->prod + 1) == 0)
+   WRITE_ONCE(smmu->cmdq_generation, smmu->cmdq_generation + 1);
+
while (queue_insert_raw(q, cmd) == -ENOSPC) {
-   if (queue_poll_cons(q, false, wfe))
+   if (queue_poll_cons(q, wfe))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
}
 }
@@ -945,6 +946,48 @@ static int __arm_smmu_sync_poll_msi(struct arm_smmu_device 
*smmu, u32 sync_idx)
return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
 }

+static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx,
+  int sync_gen)
+{
+   

Re: [PATCH v5 2/2] iommu/arm-smmu-v3: Reunify arm_smmu_cmdq_issue_sync()

2018-10-18 Thread John Garry

On 18/10/2018 12:18, Robin Murphy wrote:

Hi John,

On 17/10/18 15:38, John Garry wrote:

On 17/10/2018 14:56, Robin Murphy wrote:

Now that both sync methods are more or less the same shape, we can save
some code and levels of indirection by rolling them up together again,
with just a couple of simple conditionals to discriminate the MSI and
queue-polling specifics.


Hi Robin, Will,

I had been thinking of this other patch previously:

iommu/arm-smmu-v3: Stop rebuilding non-MSI CMD_SYNC commands

The contents of the non-MSI CMD_SYNC command are fixed. This patch
offers a small optimisation by keeping a sample of this command on the
heap for re-use, thereby avoiding unnecessary re-building.




Hi Robin,


As far as I can tell, not counting the general call overhead which can
be solved in less-specific ways, this essentially saves two MOVs, a
MOVK, and an STP which in practice might not even reach L1 before it
gets forwarded back out of the store buffer. Do you have any numbers for
what difference this makes in terms of I/O performance or cache traffic?


I'll try to get some numbers. I'm not expected anything big, but this 
just seemed like a cheap optimisation, maybe at the expense of slightly 
inconsistent code.


Thanks,
John



Robin.


Signed-off-by: John Garry 

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 6947ccf..9d86c29 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -963,14 +963,16 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct
arm_smmu_device *smmu)

  static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
  {
-   u64 cmd[CMDQ_ENT_DWORDS];
+   static u64 cmd[CMDQ_ENT_DWORDS] = {
+  FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
+  FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV) |
+  FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH) |
+  FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB),
+   0};
 unsigned long flags;
 bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
-   struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
 int ret;

-   arm_smmu_cmdq_build_cmd(cmd, &ent);
-
 spin_lock_irqsave(&smmu->cmdq.lock, flags);
 arm_smmu_cmdq_insert_cmd(smmu, cmd);
 ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);

But it seems that combining the the MSI and non-MSI methods would
block this.

How do you feel about this?

Thanks,
John



Signed-off-by: Robin Murphy 
---
 drivers/iommu/arm-smmu-v3.c | 49 +
 1 file changed, 12 insertions(+), 37 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index da8a91d116bf..36db63e3afcf 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -933,7 +933,7 @@ static void arm_smmu_cmdq_issue_cmd(struct
arm_smmu_device *smmu,
  * The difference between val and sync_idx is bounded by the maximum
size of
  * a queue at 2^20 entries, so 32 bits is plenty for wrap-safe
arithmetic.
  */
-static int __arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu,
u32 sync_idx)
+static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32
sync_idx)
 {
 ktime_t timeout;
 u32 val;
@@ -988,16 +988,17 @@ static int arm_smmu_sync_poll_cons(struct
arm_smmu_device *smmu, u32 sync_idx,
 return -ETIMEDOUT;
 }

-static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu)
+static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
 {
 u64 cmd[CMDQ_ENT_DWORDS];
 unsigned long flags;
-struct arm_smmu_cmdq_ent ent = {
-.opcode = CMDQ_OP_CMD_SYNC,
-.sync= {
-.msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)),
-},
-};
+bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
+   (smmu->features & ARM_SMMU_FEAT_COHERENCY);
+struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
+int ret, sync_idx, sync_gen;
+
+if (msi)
+ent.sync.msiaddr =
cpu_to_le32(virt_to_phys(&smmu->sync_count));

 spin_lock_irqsave(&smmu->cmdq.lock, flags);

@@ -1009,39 +1010,13 @@ static int
__arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu)
 arm_smmu_cmdq_build_cmd(cmd, &ent);
 arm_smmu_cmdq_insert_cmd(smmu, cmd);
 }
-
-spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
-
-return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
-}
-
-static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
-{
-u64 cmd[CMDQ_ENT_DWORDS];
-unsigned long flags;
-struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
-int sync_idx, sync_gen;
-
-arm_smmu_cmdq_build_cmd(cmd, &ent);
-
-spin_lock_irqsave(&smmu->cmdq.lock, flags);
-if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC)
-arm_smmu_cmdq_insert_cmd(smmu, cmd);
 sync_idx = s

Re: [PATCH v5 1/2] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock

2018-10-18 Thread John Garry

On 18/10/2018 12:19, Robin Murphy wrote:

On 18/10/18 11:55, John Garry wrote:
[...]

@@ -976,18 +1019,19 @@ static int __arm_smmu_cmdq_issue_sync(struct
arm_smmu_device *smmu)
 {
 u64 cmd[CMDQ_ENT_DWORDS];
 unsigned long flags;
-bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
 struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
-int ret;
+int sync_idx, sync_gen;

 arm_smmu_cmdq_build_cmd(cmd, &ent);

 spin_lock_irqsave(&smmu->cmdq.lock, flags);
-arm_smmu_cmdq_insert_cmd(smmu, cmd);
-ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
+if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC)
+arm_smmu_cmdq_insert_cmd(smmu, cmd);


Hi Robin,

If we did stop rebuilding the non-MSI command as I suggested, then we
would not have the case of building the command and then discarding
it, right?


I suppose so. But that build/discard case can also be avoided by
applying patch 2/2 of this series ;)


OK, I'll check it more detail. TBH, I found 2/2 hard to follow from just 
the diff.


Thanks,
John



Robin.



Thanks,
John


+sync_idx = smmu->cmdq.q.prod;
+sync_gen = READ_ONCE(smmu->cmdq_generation);
 spin_unlock_irqrestore(&smmu->cmdq.lock, flags);

-return ret;
+return arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen);
 }

 static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)






.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 1/2] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock

2018-10-19 Thread John Garry

On 18/10/2018 12:48, John Garry wrote:

On 18/10/2018 12:19, Robin Murphy wrote:

On 18/10/18 11:55, John Garry wrote:
[...]

@@ -976,18 +1019,19 @@ static int __arm_smmu_cmdq_issue_sync(struct
arm_smmu_device *smmu)
 {
 u64 cmd[CMDQ_ENT_DWORDS];
 unsigned long flags;
-bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
 struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
-int ret;
+int sync_idx, sync_gen;

 arm_smmu_cmdq_build_cmd(cmd, &ent);

 spin_lock_irqsave(&smmu->cmdq.lock, flags);
-arm_smmu_cmdq_insert_cmd(smmu, cmd);
-ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
+if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC)
+arm_smmu_cmdq_insert_cmd(smmu, cmd);


Hi Robin,

If we did stop rebuilding the non-MSI command as I suggested, then we
would not have the case of building the command and then discarding
it, right?


I suppose so. But that build/discard case can also be avoided by
applying patch 2/2 of this series ;)




So we can avoid the build/discard but the command is built under the 
lock. Hopefully worth it.


However there is still scope (and - considering the locking - maybe more 
need) for a static non-MSI CMD_SYNC command :)


Anyway, we'll look to get some numbers.

Cheers,


OK, I'll check it more detail. TBH, I found 2/2 hard to follow from just
the diff.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/2] iommu/arm-smmu-v3: Reunify arm_smmu_cmdq_issue_sync()

2018-10-22 Thread John Garry

On 17/10/2018 14:56, Robin Murphy wrote:

Now that both sync methods are more or less the same shape, we can save
some code and levels of indirection by rolling them up together again,
with just a couple of simple conditionals to discriminate the MSI and
queue-polling specifics.

Signed-off-by: Robin Murphy 


Tested-by: John Garry 

I seem to be getting some boost in the scenarios I tested:
Storage controller: 746K IOPS (with) vs 730K (without)
NVMe disk:  471K IOPS (with) vs 420K IOPS (without)

Note that this is with strict mode set and without the CMD_SYNC 
optimisation I punted. And this is on D05, so no MSI-based CMD_SYNC support.


Thanks,
John

Ps. for anyone testing, use the v1 smmu "Fix Endian" patch to get this 
series to apply.



---
 drivers/iommu/arm-smmu-v3.c | 49 +
 1 file changed, 12 insertions(+), 37 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index da8a91d116bf..36db63e3afcf 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -933,7 +933,7 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device 
*smmu,
  * The difference between val and sync_idx is bounded by the maximum size of
  * a queue at 2^20 entries, so 32 bits is plenty for wrap-safe arithmetic.
  */
-static int __arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
+static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
 {
ktime_t timeout;
u32 val;
@@ -988,16 +988,17 @@ static int arm_smmu_sync_poll_cons(struct arm_smmu_device 
*smmu, u32 sync_idx,
return -ETIMEDOUT;
 }

-static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu)
+static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
 {
u64 cmd[CMDQ_ENT_DWORDS];
unsigned long flags;
-   struct arm_smmu_cmdq_ent ent = {
-   .opcode = CMDQ_OP_CMD_SYNC,
-   .sync   = {
-   .msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)),
-   },
-   };
+   bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
+  (smmu->features & ARM_SMMU_FEAT_COHERENCY);
+   struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
+   int ret, sync_idx, sync_gen;
+
+   if (msi)
+   ent.sync.msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count));

spin_lock_irqsave(&smmu->cmdq.lock, flags);

@@ -1009,39 +1010,13 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
arm_smmu_device *smmu)
arm_smmu_cmdq_build_cmd(cmd, &ent);
arm_smmu_cmdq_insert_cmd(smmu, cmd);
}
-
-   spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
-
-   return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
-}
-
-static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
-{
-   u64 cmd[CMDQ_ENT_DWORDS];
-   unsigned long flags;
-   struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
-   int sync_idx, sync_gen;
-
-   arm_smmu_cmdq_build_cmd(cmd, &ent);
-
-   spin_lock_irqsave(&smmu->cmdq.lock, flags);
-   if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC)
-   arm_smmu_cmdq_insert_cmd(smmu, cmd);
sync_idx = smmu->cmdq.q.prod;
sync_gen = READ_ONCE(smmu->cmdq_generation);
+
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);

-   return arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen);
-}
-
-static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
-{
-   int ret;
-   bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
-  (smmu->features & ARM_SMMU_FEAT_COHERENCY);
-
-   ret = msi ? __arm_smmu_cmdq_issue_sync_msi(smmu)
- : __arm_smmu_cmdq_issue_sync(smmu);
+   ret = msi ? arm_smmu_sync_poll_msi(smmu, ent.sync.msidata)
+ : arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen);
if (ret)
dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n");
 }




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/1] iommu/arm-smmu-v3: eliminate a potential memory corruption on Hi16xx soc

2018-10-30 Thread John Garry

On 30/10/2018 01:52, Leizhen (ThunderTown) wrote:



On 2018/10/30 1:59, Will Deacon wrote:

On Sat, Oct 20, 2018 at 03:36:54PM +0800, Zhen Lei wrote:

The standard GITS_TRANSLATER register in ITS is only 4 bytes, but
Hisilicon expands the next 4 bytes to carry some IMPDEF information. That
means, total 8 bytes data will be written to MSIAddress each time.

MSIAddr: |4bytes|4bytes|
 |MSIData   |IMPDEF|

There is no problem for ITS, because the next 4 bytes space is reserved
in ITS. But it will overwrite the 4 bytes memory following "sync_count".
It's very fortunately that the previous and the next neighbour of the
"sync_count" are both aligned by 8 bytes, so no problem is met now.

It's good to explicitly add a workaround:
1. Add gcc __attribute__((aligned(8))) to make sure that "sync_count" is
   always aligned by 8 bytes.
2. Add a "int" struct member to make sure the 4 bytes padding is always
   exist.

There is no functional change.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/arm-smmu-v3.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5059d09..624fdd0 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -586,7 +586,20 @@ struct arm_smmu_device {

struct arm_smmu_strtab_cfg  strtab_cfg;

-   u32 sync_count;
+   /*
+* The alignment and padding is required by Hi16xx of Hisilicon.


Nit: I know that there is no functional change related to this bug which 
depends on the chip version, but how about "hi1620 and earlier"? hi16xx 
is very broad.


Thanks


+* Because the ITS hardware on Hi16xx will truncate the MSIAddress(Here
+* it's the address of "sync_count") to 8 bytes boundary first, then
+* write 32 bits MSIdata at offset 0, and 32 bits IMPDEF data at offset
+* 4. Without this workaround, the adjacent member maybe overwritten.
+*
+*|---4bytes---|---4bytes---|
+* MSIAddress & (~0x7):   MSIdata  | IMPDEF data|
+*/
+   struct {
+   u32 sync_count;
+   int padding;
+   } __attribute__((aligned(8)));


I thought the conclusion after reviewing your original patch was to maintain
the union and drop the alignment directive? e.g.

union {
u32 sync_count;
u64 padding; /* Hi16xx writes an extra 32 bits of goodness 
*/
};

OK, I will sent v3.



Will

.






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-08 Thread John Garry
Change function __iommu_dma_alloc_pages() to allocate memory/pages
for DMA from respective device NUMA node.

Originally-from: Ganapatrao Kulkarni 
Signed-off-by: John Garry 
---

This patch was originally posted by Ganapatrao in [1] *.

However, after initial review, it was never reposted (due to lack of
cycles, I think). In addition, the functionality in its sibling patches
were merged through patches, as mentioned in [2]; this also refers to a
discussion on device local allocations vs CPU local allocations for DMA
pool, and which is better [3].

However, as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[1] https://lore.kernel.org/patchwork/patch/833004/
[2] https://lkml.org/lkml/2018/8/22/391
[3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html

* Authorship on this updated patch may need to be fixed - I add not want to
  add Ganapatrao's SOB without permission.

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..ada00bc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-   unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+   unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
struct page **pages;
-   unsigned int i = 0, array_size = count * sizeof(*pages);
+   unsigned int i = 0, nid = dev_to_node(dev);
 
order_mask &= (2U << MAX_ORDER) - 1;
if (!order_mask)
return NULL;
 
-   if (array_size <= PAGE_SIZE)
-   pages = kzalloc(array_size, GFP_KERNEL);
-   else
-   pages = vzalloc(array_size);
+   pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid);
if (!pages)
return NULL;
 
@@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
unsigned int order = __fls(order_mask);
 
order_size = 1U << order;
-   page = alloc_pages((order_mask - order_size) ?
-  gfp | __GFP_NORETRY : gfp, order);
+   page = alloc_pages_node(nid,
+   (order_mask - order_size) ?
+   gfp | __GFP_NORETRY : gfp,
+   order);
if (!page)
continue;
if (!order)
@@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
alloc_sizes = min_size;
 
count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-   pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+   pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+   gfp);
if (!pages)
return NULL;
 
-- 
1.9.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/9] dmapool: remove checks for dev == NULL

2018-11-12 Thread John Garry

On 12/11/2018 15:42, Tony Battersby wrote:

dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device.  But nobody
ended up using dma pools without a device, so the current checks in
dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
Remove them.



As an aside, is it right that dma_pool_create() does not actually reject 
dev==NULL and would crash from a NULL-pointer dereference?


Thanks,
John


Signed-off-by: Tony Battersby 
---
--- linux/mm/dmapool.c.orig 2018-08-03 16:12:23.0 -0400
+++ linux/mm/dmapool.c  2018-08-03 16:13:44.0 -0400
@@ -277,7 +277,7 @@ void dma_pool_destroy(struct dma_pool *p
mutex_lock(&pools_reg_lock);
mutex_lock(&pools_lock);
list_del(&pool->pools);
-   if (pool->dev && list_empty(&pool->dev->dma_pools))
+   if (list_empty(&pool->dev->dma_pools))
empty = true;
mutex_unlock(&pools_lock);
if (empty)
@@ -289,13 +289,9 @@ void dma_pool_destroy(struct dma_pool *p
page = list_entry(pool->page_list.next,
  struct dma_page, page_list);
if (is_page_busy(page)) {
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_destroy %s, %p busy\n",
-   pool->name, page->vaddr);
-   else
-   pr_err("dma_pool_destroy %s, %p busy\n",
-  pool->name, page->vaddr);
+   dev_err(pool->dev,
+   "dma_pool_destroy %s, %p busy\n",
+   pool->name, page->vaddr);
/* leak the still-in-use consistent memory */
list_del(&page->page_list);
kfree(page);
@@ -357,13 +353,9 @@ void *dma_pool_alloc(struct dma_pool *po
for (i = sizeof(page->offset); i < pool->size; i++) {
if (data[i] == POOL_POISON_FREED)
continue;
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_alloc %s, %p (corrupted)\n",
-   pool->name, retval);
-   else
-   pr_err("dma_pool_alloc %s, %p (corrupted)\n",
-   pool->name, retval);
+   dev_err(pool->dev,
+   "dma_pool_alloc %s, %p (corrupted)\n",
+   pool->name, retval);

/*
 * Dump the first 4 bytes even if they are not
@@ -418,13 +410,9 @@ void dma_pool_free(struct dma_pool *pool
page = pool_find_page(pool, dma);
if (!page) {
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_free %s, %p/%lx (bad dma)\n",
-   pool->name, vaddr, (unsigned long)dma);
-   else
-   pr_err("dma_pool_free %s, %p/%lx (bad dma)\n",
-  pool->name, vaddr, (unsigned long)dma);
+   dev_err(pool->dev,
+   "dma_pool_free %s, %p/%lx (bad dma)\n",
+   pool->name, vaddr, (unsigned long)dma);
return;
}

@@ -432,13 +420,9 @@ void dma_pool_free(struct dma_pool *pool
 #ifdef DMAPOOL_DEBUG
if ((dma - page->dma) != offset) {
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_free %s, %p (bad vaddr)/%pad\n",
-   pool->name, vaddr, &dma);
-   else
-   pr_err("dma_pool_free %s, %p (bad vaddr)/%pad\n",
-  pool->name, vaddr, &dma);
+   dev_err(pool->dev,
+   "dma_pool_free %s, %p (bad vaddr)/%pad\n",
+   pool->name, vaddr, &dma);
return;
}
{
@@ -449,12 +433,9 @@ void dma_pool_free(struct dma_pool *pool
continue;
}
spin_unlock_irqrestore(&pool->lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "dma_pool_free %s, dma %pad 
already free\n",
-   pool->name, &dma);
-   else
-   pr_err("dma_pool_free %s, dma %pad already 
free\n",
-  pool->name, &dma);
+   dev_err(pool->dev,
+   "dm

Re: [PATCH] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-20 Thread John Garry

On 08/11/2018 17:55, John Garry wrote:

Change function __iommu_dma_alloc_pages() to allocate memory/pages
for DMA from respective device NUMA node.



Ping a friendly reminder on this patch.

Thanks


Originally-from: Ganapatrao Kulkarni 
Signed-off-by: John Garry 
---

This patch was originally posted by Ganapatrao in [1] *.

However, after initial review, it was never reposted (due to lack of
cycles, I think). In addition, the functionality in its sibling patches
were merged through patches, as mentioned in [2]; this also refers to a
discussion on device local allocations vs CPU local allocations for DMA
pool, and which is better [3].

However, as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[1] https://lore.kernel.org/patchwork/patch/833004/
[2] https://lkml.org/lkml/2018/8/22/391
[3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html

* Authorship on this updated patch may need to be fixed - I add not want to
  add Ganapatrao's SOB without permission.

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..ada00bc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
kvfree(pages);
 }

-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-   unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+   unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
struct page **pages;
-   unsigned int i = 0, array_size = count * sizeof(*pages);
+   unsigned int i = 0, nid = dev_to_node(dev);

order_mask &= (2U << MAX_ORDER) - 1;
if (!order_mask)
return NULL;

-   if (array_size <= PAGE_SIZE)
-   pages = kzalloc(array_size, GFP_KERNEL);
-   else
-   pages = vzalloc(array_size);
+   pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid);
if (!pages)
return NULL;

@@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
unsigned int order = __fls(order_mask);

order_size = 1U << order;
-   page = alloc_pages((order_mask - order_size) ?
-  gfp | __GFP_NORETRY : gfp, order);
+   page = alloc_pages_node(nid,
+   (order_mask - order_size) ?
+   gfp | __GFP_NORETRY : gfp,
+   order);
if (!page)
continue;
if (!order)
@@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
alloc_sizes = min_size;

count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-   pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+   pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+   gfp);
if (!pages)
return NULL;





___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-20 Thread John Garry

On 20/11/2018 10:09, Ganapatrao Kulkarni wrote:

Hi John,

On Tue, Nov 20, 2018 at 3:35 PM John Garry  wrote:


On 08/11/2018 17:55, John Garry wrote:

Change function __iommu_dma_alloc_pages() to allocate memory/pages
for DMA from respective device NUMA node.



Ping a friendly reminder on this patch.

Thanks


Originally-from: Ganapatrao Kulkarni 
Signed-off-by: John Garry 
---

This patch was originally posted by Ganapatrao in [1] *.

However, after initial review, it was never reposted (due to lack of
cycles, I think). In addition, the functionality in its sibling patches
were merged through patches, as mentioned in [2]; this also refers to a
discussion on device local allocations vs CPU local allocations for DMA
pool, and which is better [3].

However, as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[1] https://lore.kernel.org/patchwork/patch/833004/
[2] https://lkml.org/lkml/2018/8/22/391
[3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html

* Authorship on this updated patch may need to be fixed - I add not want to
  add Ganapatrao's SOB without permission.


thanks for taking this up. please feel free to add my SoB.


OK, I will also make you the author and repost.

Thanks,
John



diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..ada00bc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
  kvfree(pages);
 }

-static struct page **__iommu_dma_alloc_pages(unsigned int count,
- unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+ unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
  struct page **pages;
- unsigned int i = 0, array_size = count * sizeof(*pages);
+ unsigned int i = 0, nid = dev_to_node(dev);

  order_mask &= (2U << MAX_ORDER) - 1;
  if (!order_mask)
  return NULL;

- if (array_size <= PAGE_SIZE)
- pages = kzalloc(array_size, GFP_KERNEL);
- else
- pages = vzalloc(array_size);
+ pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid);
  if (!pages)
  return NULL;

@@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
  unsigned int order = __fls(order_mask);

  order_size = 1U << order;
- page = alloc_pages((order_mask - order_size) ?
-gfp | __GFP_NORETRY : gfp, order);
+ page = alloc_pages_node(nid,
+ (order_mask - order_size) ?
+ gfp | __GFP_NORETRY : gfp,
+ order);
  if (!page)
  continue;
  if (!order)
@@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
  alloc_sizes = min_size;

  count = PAGE_ALIGN(size) >> PAGE_SHIFT;
- pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+ pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+ gfp);
  if (!pages)
  return NULL;







thanks
Ganapat

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-20 Thread John Garry
From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate memory/pages
for DMA from respective device NUMA node.

Signed-off-by: Ganapatrao Kulkarni 
[JPG:  Modifed to use kvzalloc() and fixed indentation]
Signed-off-by: John Garry 
---
Difference v1->v2:
- Add Ganapatrao's tag and change author

This patch was originally posted by Ganapatrao in [1].

However, after initial review, it was never reposted (due to lack of
cycles, I think). In addition, the functionality in its sibling patches
were merged through patches, as mentioned in [2]; this also refers to a
discussion on device local allocations vs CPU local allocations for DMA
pool, and which is better [3].

However, as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[1] https://lore.kernel.org/patchwork/patch/833004/
[2] https://lkml.org/lkml/2018/8/22/391
[3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html


diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..ada00bc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-   unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+   unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
struct page **pages;
-   unsigned int i = 0, array_size = count * sizeof(*pages);
+   unsigned int i = 0, nid = dev_to_node(dev);
 
order_mask &= (2U << MAX_ORDER) - 1;
if (!order_mask)
return NULL;
 
-   if (array_size <= PAGE_SIZE)
-   pages = kzalloc(array_size, GFP_KERNEL);
-   else
-   pages = vzalloc(array_size);
+   pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid);
if (!pages)
return NULL;
 
@@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
unsigned int order = __fls(order_mask);
 
order_size = 1U << order;
-   page = alloc_pages((order_mask - order_size) ?
-  gfp | __GFP_NORETRY : gfp, order);
+   page = alloc_pages_node(nid,
+   (order_mask - order_size) ?
+   gfp | __GFP_NORETRY : gfp,
+   order);
if (!page)
continue;
if (!order)
@@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
alloc_sizes = min_size;
 
count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-   pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+   pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+   gfp);
if (!pages)
return NULL;
 
-- 
1.9.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-20 Thread John Garry

On 20/11/2018 14:20, Robin Murphy wrote:

On 20/11/2018 13:42, John Garry wrote:

From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate memory/pages
for DMA from respective device NUMA node.

Signed-off-by: Ganapatrao Kulkarni 
[JPG:  Modifed to use kvzalloc() and fixed indentation]
Signed-off-by: John Garry 
---
Difference v1->v2:
- Add Ganapatrao's tag and change author

This patch was originally posted by Ganapatrao in [1].

However, after initial review, it was never reposted (due to lack of
cycles, I think). In addition, the functionality in its sibling patches
were merged through patches, as mentioned in [2]; this also refers to a
discussion on device local allocations vs CPU local allocations for DMA
pool, and which is better [3].

However, as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[1] https://lore.kernel.org/patchwork/patch/833004/
[2] https://lkml.org/lkml/2018/8/22/391
[3]
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html


diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..ada00bc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page
**pages, int count)
  kvfree(pages);
  }
  -static struct page **__iommu_dma_alloc_pages(unsigned int count,
-unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+unsigned int count, unsigned long order_mask, gfp_t gfp)
  {
  struct page **pages;
-unsigned int i = 0, array_size = count * sizeof(*pages);
+unsigned int i = 0, nid = dev_to_node(dev);
order_mask &= (2U << MAX_ORDER) - 1;
  if (!order_mask)
  return NULL;
  -if (array_size <= PAGE_SIZE)
-pages = kzalloc(array_size, GFP_KERNEL);
-else
-pages = vzalloc(array_size);
+pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid);


The pages array is only accessed by the CPU servicing the
iommu_dma_alloc() call, and is usually freed again before that call even
returns. It's certainly never touched by the device, so forcing it to a
potentially non-local node doesn't make a great deal of sense.


Right, it seems sensible to not make this allocation include the 
device-locality requirement, so can leave as is. However modifying to 
use kvzalloc() would seem ok.





  if (!pages)
  return NULL;
  @@ -483,8 +480,10 @@ static struct page
**__iommu_dma_alloc_pages(unsigned int count,
  unsigned int order = __fls(order_mask);
order_size = 1U << order;
-page = alloc_pages((order_mask - order_size) ?
-   gfp | __GFP_NORETRY : gfp, order);
+page = alloc_pages_node(nid,
+(order_mask - order_size) ?
+gfp | __GFP_NORETRY : gfp,
+order);


If we're touching this, can we sort out that horrendous ternary? FWIW I
found I have a local version of the original patch which I tweaked at
the time, and apparently I reworked this hunk as below, which does seem
somewhat nicer for the same diffstat.

Robin.


@@ -446,10 +443,12 @@ static struct page
**__iommu_dma_alloc_pages(unsigned int count,
for (order_mask &= (2U << __fls(count)) - 1;
 order_mask; order_mask &= ~order_size) {
unsigned int order = __fls(order_mask);
+   gfp_t alloc_flags = gfp;

order_size = 1U << order;
-   page = alloc_pages((order_mask - order_size) ?
-  gfp | __GFP_NORETRY : gfp,
order);
+   if (order_size < order_mask)
+   alloc_flags |= __GFP_NORETRY;


Sure, this can be included


+   page = alloc_pages_node(nid, alloc_flags, order);
if (!page)
continue;
if (!order)

.




Cheers,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-21 Thread John Garry
From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

We also include a change to use kvzalloc() for kzalloc()/vzalloc()
combination.

Signed-off-by: Ganapatrao Kulkarni 
[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary 
operator]
Signed-off-by: John Garry 

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..4afb1a8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-   unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+   unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
struct page **pages;
-   unsigned int i = 0, array_size = count * sizeof(*pages);
+   unsigned int i = 0, nid = dev_to_node(dev);
 
order_mask &= (2U << MAX_ORDER) - 1;
if (!order_mask)
return NULL;
 
-   if (array_size <= PAGE_SIZE)
-   pages = kzalloc(array_size, GFP_KERNEL);
-   else
-   pages = vzalloc(array_size);
+   pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
if (!pages)
return NULL;
 
@@ -481,10 +478,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
for (order_mask &= (2U << __fls(count)) - 1;
 order_mask; order_mask &= ~order_size) {
unsigned int order = __fls(order_mask);
+   gfp_t alloc_flags = gfp;
 
order_size = 1U << order;
-   page = alloc_pages((order_mask - order_size) ?
-  gfp | __GFP_NORETRY : gfp, order);
+   if (order_mask > order_size)
+   alloc_flags |= __GFP_NORETRY;
+   page = alloc_pages_node(nid, alloc_flags, order);
if (!page)
continue;
if (!order)
@@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
alloc_sizes = min_size;
 
count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-   pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+   pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+   gfp);
if (!pages)
return NULL;
 
-- 
1.9.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-21 Thread John Garry

On 21/11/2018 16:07, Will Deacon wrote:

On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:

From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

We also include a change to use kvzalloc() for kzalloc()/vzalloc()
combination.

Signed-off-by: Ganapatrao Kulkarni 
[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary 
operator]
Signed-off-by: John Garry 


Weird, you're missing a diffstat here.

Anyway, the patch looks fine to me, but it would be nice if you could
justify the change with some numbers. Do you actually see an improvement
from this change?



Hi Will,

Ah, I missed adding my comments explaining the motivation. It would be 
better in the commit log. Anyway, here's the snippet:


" ... as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[3] 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html";


I did have some numbers to show improvement in some scenarios when I 
tested this a while back which I'll dig out.


However I would say that some scenarios will improve and the opposite 
for others with this change, considering different conditions in which 
DMA memory may be used.


Cheers,
John


Will

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-23 Thread John Garry

On 21/11/2018 16:57, Will Deacon wrote:

On Wed, Nov 21, 2018 at 04:47:48PM +, John Garry wrote:

On 21/11/2018 16:07, Will Deacon wrote:

On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:

From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

We also include a change to use kvzalloc() for kzalloc()/vzalloc()
combination.

Signed-off-by: Ganapatrao Kulkarni 
[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary 
operator]
Signed-off-by: John Garry 


Weird, you're missing a diffstat here.

Anyway, the patch looks fine to me, but it would be nice if you could
justify the change with some numbers. Do you actually see an improvement

>from this change?




Hi Will,

Ah, I missed adding my comments explaining the motivation. It would be
better in the commit log. Anyway, here's the snippet:

" ... as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[3]
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html";


Yes, please add to this to the commit log.



Sure,


I did have some numbers to show improvement in some scenarios when I tested
this a while back which I'll dig out.

However I would say that some scenarios will improve and the opposite for
others with this change, considering different conditions in which DMA
memory may be used.


Well, if you can show that it's useful in some cases and not catastrophic in
others, then I think shooting for parity with direct DMA is a reasonable
justification for the change.



So I have done some more testing with our SoC crypto engine, using 
tcrypt module. The reason I used this device was because we can utilise 
a local device per socket in the system, unlike other DMAing devices, 
which generally only exist on a single socket (for us, anyway).


Overall the results aren't brilliant - as expected - but show a general 
marginal improvement. Here's some figures:


Summary:
Average diff+0.9%
Max diff+1.5%
Min diff+0.2%

Test Ops/second before  after   diff %  

async ecb(aes) encrypt
test 0 (128 bit key, 16 byte blocks)68717   69057   0.5
test 1 (128 bit key, 64 byte blocks):   72633   73163   0.7
test 2 (128 bit key, 256 byte blocks):  71475   72076   0.8
test 3 (128 bit key, 1024 byte blocks): 66819   67467   1.0
test 4 (128 bit key, 8192 byte blocks): 38237   38495   0.7
test 5 (192 bit key, 16 byte blocks):   70273   71079   1.2
test 6 (192 bit key, 64 byte blocks):   72455   73292   1.2
test 7 (192 bit key, 256 byte blocks):  71085   71876   1.1
test 8 (192 bit key, 1024 byte blocks): 65891   66576   1.0
test 9 (192 bit key, 8192 byte blocks): 34846   35061   0.6
test 10 (256 bit key, 16 byte blocks):  72927   73762   1.2
test 11 (256 bit key, 64 byte blocks):  72361   73207   1.2
test 12 (256 bit key, 256 byte blocks): 70907   71602   1.0
test 13 (256 bit key, 1024 byte blocks):65035   65653   1.0
test 14 (256 bit key, 8192 byte blocks):32835   32998   0.5
async ecb(aes) decrypt
test 0 (128 bit key, 16 byte blocks)68384   69130   1.1
test 1 (128 bit key, 64 byte blocks):   72645   73133   0.7
test 2 (128 bit key, 256 byte blocks):  71523   71912   0.5
test 3 (128 bit key, 1024 byte blocks): 66902   67258   0.5
test 4 (128 bit key, 8192 byte blocks): 38260   38434   0.5
test 5 (192 bit key, 16 byte blocks):   70301   70816   0.7
test 6 (192 bit key, 64 byte blocks):   72473   73064   0.8
test 7 (192 bit key, 256 byte blocks):  71106   71663   0.8
test 8 (192 bit key, 1024 byte blocks): 65915   66363   0.7
test 9 (192 bit key, 8192 byte blocks): 34876   35006   0.4
test 10 (256 bit key, 16 byte blocks):  72969   73519   0.8
test 11 (256 bit key, 64 byte blocks):  72404   72925   0.7
test 12 (256 bit key, 256 byte blocks): 70861   71350   0.7
test 13 (256 bit key, 1024 byte blocks):65074   65485   0.6
test 14 (256 bit key, 8192 byte blocks):32861   32974   0.3
async cbc(aes) encrypt
test 0 (128 bit key, 16 byte blocks)58306   59131   1.4
test 1 (128 bit key, 64 byte blocks):   61647   62565   1.5
test 2 (128 bit key, 256 byte blocks):  60841   61666   1.4
test 3 (128 bit key, 1024 byte blocks): 57503   58204   1.2
test 4 (128 bit key, 8192 byte blocks): 34760   35055   0.9
test 5 (192 bit key, 16 byte blocks):   59684   60452   1.3
test 6 (192 bit key, 64 byte blocks):   61705   62516   1.3
test 7 (192 bit key, 256 byte blocks):  60678   61426   1.2
test 8 (192 bit key, 1024 byte blocks): 56805   57487   1.2
test 9 (192 bit key, 8192 byte blocks): 31836   32093   0.8
test 10 (256 bit key, 16 byte blocks):  61961   62785   1.3
test 11 (256 bit key, 64 byte blocks):  61584   62427   1.4
test 12 (256 bit key, 256 byte blocks): 604

Re: [PATCH] dma-debug: hns_enet_drv could use more DMA entries

2018-11-30 Thread John Garry

+

Pasting original message at bottom.


On 30/11/2018 08:42, Christoph Hellwig wrote:

On Thu, Nov 29, 2018 at 10:54:56PM -0500, Qian Cai wrote:

 /* allow architectures to override this if absolutely required */
 #ifndef PREALLOC_DMA_DEBUG_ENTRIES
+/* amount of DMA mappings on this driver is huge. */
+#ifdef HNS_ENET
+#define PREALLOC_DMA_DEBUG_ENTRIES (1 << 17)
+#else
 #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16)
 #endif
+#endif


How would this be defined in a header that leaks into this file?

I think we need to turn PREALLOC_DMA_DEBUG_ENTRIES into a user
selectable config options, as I really don't want to collect hacks like
this.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu










The amount of DMA mappings from Hisilicon HNS ethernet devices is huge,
so it could trigger "DMA-API: debugging out of memory - disabling".

hnae_get_handle [1]
  hnae_init_queue
hnae_init_ring
  hnae_alloc_buffers [2]
debug_dma_map_page
  dma_entry_alloc

[1] for (i = 0; i < handle->q_num; i++)
[2] for (i = 0; i < ring->desc_num; i++)

On this Huawei TaiShan 2280 aarch64 server, it has reached the limit
already,

4 (ports) x 16 (handles) x 1024 (rings) = 65536

Signed-off-by: Qian Cai 
---
 kernel/dma/debug.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 231ca4628062..ae91689cc9ad 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -43,8 +43,13 @@

 /* allow architectures to override this if absolutely required */
 #ifndef PREALLOC_DMA_DEBUG_ENTRIES
+/* amount of DMA mappings on this driver is huge. */
+#ifdef HNS_ENET
+#define PREALLOC_DMA_DEBUG_ENTRIES (1 << 17)
+#else
 #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16)
 #endif
+#endif

 enum {
dma_debug_single,


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-30 Thread John Garry
From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

The motivation for this change is to have a policy for page allocation
consistent with direct DMA mapping, which attempts to allocate pages local
to the device, as mentioned in [1].

In addition, for certain workloads it has been observed a marginal
performance improvement. The patch caused an observation of 0.9% average
throughput improvement for running tcrypt with HiSilicon crypto engine.

We also include a modification to use kvzalloc() for kzalloc()/vzalloc()
combination.

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html

Signed-off-by: Ganapatrao Kulkarni 
[JPG: Added kvzalloc(), drop pages ** being device local, remove ternary 
operator, update message]
Signed-off-by: John Garry 
---
Difference:
v1->v2:
- Add Ganapatrao's tag and change author

v2->v3:
- removed ternary operator
- stopped making pages ** allocation local to device

v3->v4:
- Update commit message to include motivation for patch, including
  headline performance improvement for test.

Some notes:
This patch was originally posted by Ganapatrao in [2].

However, after initial review, it was never reposted (due to lack of
cycles, I think). In addition, the functionality in its sibling patches
were merged through patches, as mentioned in [2]; this also refers to a
discussion on device local allocations vs CPU local allocations for DMA
pool, and which is better [1].

However, as mentioned in [1], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

And from some testing, mentioned in commit message, shows marginal
improvement.

[2] https://lore.kernel.org/patchwork/patch/833004/
[3] https://lkml.org/lkml/2018/8/22/391

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..4afb1a8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-   unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+   unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
struct page **pages;
-   unsigned int i = 0, array_size = count * sizeof(*pages);
+   unsigned int i = 0, nid = dev_to_node(dev);
 
order_mask &= (2U << MAX_ORDER) - 1;
if (!order_mask)
return NULL;
 
-   if (array_size <= PAGE_SIZE)
-   pages = kzalloc(array_size, GFP_KERNEL);
-   else
-   pages = vzalloc(array_size);
+   pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
if (!pages)
return NULL;
 
@@ -481,10 +478,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
for (order_mask &= (2U << __fls(count)) - 1;
 order_mask; order_mask &= ~order_size) {
unsigned int order = __fls(order_mask);
+   gfp_t alloc_flags = gfp;
 
order_size = 1U << order;
-   page = alloc_pages((order_mask - order_size) ?
-  gfp | __GFP_NORETRY : gfp, order);
+   if (order_mask > order_size)
+   alloc_flags |= __GFP_NORETRY;
+   page = alloc_pages_node(nid, alloc_flags, order);
if (!page)
continue;
if (!order)
@@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
alloc_sizes = min_size;
 
count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-   pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+   pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+   gfp);
if (!pages)
return NULL;
 
-- 
1.9.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma-debug: Kconfig for PREALLOC_DMA_DEBUG_ENTRIES

2018-12-03 Thread John Garry

On 01/12/2018 16:36, Christoph Hellwig wrote:

On Fri, Nov 30, 2018 at 07:39:50PM +, Robin Murphy wrote:

I was assuming the point was to also add something like

default 131072 if HNS_ENET

so that DMA debug doesn't require too much thought from the user. If they
still have to notice the overflow message and empirically figure out a
value that does work, rebuilding the kernel each time is far less
convenient than simply adding "dma_debug_entries=..." to their kernel
command line and rebooting, which they can do today. If they do already
know up-front that the default will need overriding and what the
appropriate value is, then the command line still seems seems just as
convenient.


I'm not so fond of random drivers changing the defaults.  My idea
was rather to have the config option so that the defconfig files for
the Hisilicon SOCs with this hardware could select a larger number
without making a total mess of the kernel configuration.

If we really have to we could do different defaults, but I'd still
much rather do this on a arch/platform basis than specific drivers.


As I understand, some drivers could even use much more than this 
(131072), to such a point where I can't imagine that we would want to 
set an arch default to support them. For this HNS_ENET case, it is arm64 
specific so it would be an arch defconfig.


Thanks,
John


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] dma-debug: Dynamically expand the dma_debug_entry pool

2018-12-03 Thread John Garry

On 03/12/2018 17:28, Robin Murphy wrote:

Certain drivers such as large multi-queue network adapters can use pools
of mapped DMA buffers larger than the default dma_debug_entry pool of
65536 entries, with the result that merely probing such a device can
cause DMA debug to disable itself during boot unless explicitly given an
appropriate "dma_debug_entries=..." option.

Developers trying to debug some other driver on such a system may not be
immediately aware of this, and at worst it can hide bugs if they fail to
realise that dma-debug has already disabled itself unexpectedly by the
time the code of interest gets to run. Even once they do realise, it can
be a bit of a pain to emprirically determine a suitable number of
preallocated entries to configure without massively over-allocating.

There's really no need for such a static limit, though, since we can
quite easily expand the pool at runtime in those rare cases that the
preallocated entries are insufficient, which is arguably the least
surprising and most useful behaviour.


Hi Robin,

Do you have an idea on shrinking the pool again when the culprit driver 
is removed, i.e. we have so many unused debug entries now available?


Thanks,
John



Signed-off-by: Robin Murphy 
---
 kernel/dma/debug.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index de5db800dbfc..46cc075aec99 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -47,6 +47,9 @@
 #ifndef PREALLOC_DMA_DEBUG_ENTRIES
 #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16)
 #endif
+/* If the pool runs out, try this many times to allocate this many new entries 
*/
+#define DMA_DEBUG_DYNAMIC_ENTRIES 256
+#define DMA_DEBUG_DYNAMIC_RETRIES 2

 enum {
dma_debug_single,
@@ -702,12 +705,21 @@ static struct dma_debug_entry *dma_entry_alloc(void)
 {
struct dma_debug_entry *entry;
unsigned long flags;
+   int retry_count;

-   spin_lock_irqsave(&free_entries_lock, flags);
+   for (retry_count = 0; ; retry_count++) {
+   spin_lock_irqsave(&free_entries_lock, flags);
+
+   if (num_free_entries > 0)
+   break;

-   if (list_empty(&free_entries)) {
-   global_disable = true;
spin_unlock_irqrestore(&free_entries_lock, flags);
+
+   if (retry_count < DMA_DEBUG_DYNAMIC_RETRIES &&
+   !prealloc_memory(DMA_DEBUG_DYNAMIC_ENTRIES))
+   continue;
+
+   global_disable = true;
pr_err("debugging out of memory - disabling\n");
return NULL;
}




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] dma-debug: Dynamically expand the dma_debug_entry pool

2018-12-04 Thread John Garry

On 04/12/2018 13:11, Robin Murphy wrote:

Hi John,

On 03/12/2018 18:23, John Garry wrote:

On 03/12/2018 17:28, Robin Murphy wrote:

Certain drivers such as large multi-queue network adapters can use pools
of mapped DMA buffers larger than the default dma_debug_entry pool of
65536 entries, with the result that merely probing such a device can
cause DMA debug to disable itself during boot unless explicitly given an
appropriate "dma_debug_entries=..." option.

Developers trying to debug some other driver on such a system may not be
immediately aware of this, and at worst it can hide bugs if they fail to
realise that dma-debug has already disabled itself unexpectedly by the
time the code of interest gets to run. Even once they do realise, it can
be a bit of a pain to emprirically determine a suitable number of
preallocated entries to configure without massively over-allocating.

There's really no need for such a static limit, though, since we can
quite easily expand the pool at runtime in those rare cases that the
preallocated entries are insufficient, which is arguably the least
surprising and most useful behaviour.


Hi Robin,

Do you have an idea on shrinking the pool again when the culprit
driver is removed, i.e. we have so many unused debug entries now
available?


I honestly don't believe it's worth the complication. This is a
development feature with significant overheads already, so there's not
an awful lot to gain by trying to optimise memory usage. If a system can
ever load a driver that makes hundreds of thousands of simultaneous
mappings, it can almost certainly spare 20-odd megabytes of RAM for the
corresponding debug entries in perpetuity. Sure, it does mean you'd need
to reboot to recover memory from a major leak, but that's mostly true of
the current behaviour too, and rebooting during driver development is
hardly an unacceptable inconvenience.



ok, I just thought that it would not be too difficult to implement this 
on the dma entry free path.



In fact, having got this far in, what I'd quite like to do is to get rid
of dma_debug_resize_entries() such that we never need to free things at
all, since then we could allocate whole pages as blocks of entries to
save on masses of individual slab allocations.



On a related topic, is it possible for the user to learn the total 
entries created at a given point in time? If not, could we add a file in 
the debugfs folder for this?


Thanks,
John


Robin.



Thanks,
John



Signed-off-by: Robin Murphy 
---
 kernel/dma/debug.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index de5db800dbfc..46cc075aec99 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -47,6 +47,9 @@
 #ifndef PREALLOC_DMA_DEBUG_ENTRIES
 #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16)
 #endif
+/* If the pool runs out, try this many times to allocate this many
new entries */
+#define DMA_DEBUG_DYNAMIC_ENTRIES 256
+#define DMA_DEBUG_DYNAMIC_RETRIES 2

 enum {
 dma_debug_single,
@@ -702,12 +705,21 @@ static struct dma_debug_entry
*dma_entry_alloc(void)
 {
 struct dma_debug_entry *entry;
 unsigned long flags;
+int retry_count;

-spin_lock_irqsave(&free_entries_lock, flags);
+for (retry_count = 0; ; retry_count++) {
+spin_lock_irqsave(&free_entries_lock, flags);
+
+if (num_free_entries > 0)
+break;

-if (list_empty(&free_entries)) {
-global_disable = true;
 spin_unlock_irqrestore(&free_entries_lock, flags);
+
+if (retry_count < DMA_DEBUG_DYNAMIC_RETRIES &&
+!prealloc_memory(DMA_DEBUG_DYNAMIC_ENTRIES))
+continue;
+
+global_disable = true;
 pr_err("debugging out of memory - disabling\n");
 return NULL;
 }






.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] dma-debug: Dynamically expand the dma_debug_entry pool

2018-12-04 Thread John Garry






In fact, having got this far in, what I'd quite like to do is to get rid
of dma_debug_resize_entries() such that we never need to free things at
all, since then we could allocate whole pages as blocks of entries to
save on masses of individual slab allocations.



On a related topic, is it possible for the user to learn the total
entries created at a given point in time? If not, could we add a file
in the debugfs folder for this?




Hi Robin,


I did get as far as pondering that you effectively lose track of
utilisation once the low-water-mark of min_free_entries hits 0 and stays


I did try your patches and I noticed this, i.e I was hitting the point 
at which we start to alloc more entries.



there - AFAICS it should be sufficient to just expose nr_total_entries
as-is, since users can then calculate current and maximum occupancy
based on *_free_entries. Does that sound reasonable to you?



Sounds ok. I am just interested to know roughly how many DMA buffers 
we're using in our system.



That also indirectly reminds me that this lot is documented in
DMA_API.txt, so I should be good and update that too...


Thanks,
John



Cheers,
Robin.

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-12-07 Thread John Garry

On 30/11/2018 11:14, John Garry wrote:

From: Ganapatrao Kulkarni 



Hi Joerg,

A friendly reminder. Can you please let me know your position on this patch?

Cheers,
John


Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

The motivation for this change is to have a policy for page allocation
consistent with direct DMA mapping, which attempts to allocate pages local
to the device, as mentioned in [1].

In addition, for certain workloads it has been observed a marginal
performance improvement. The patch caused an observation of 0.9% average
throughput improvement for running tcrypt with HiSilicon crypto engine.

We also include a modification to use kvzalloc() for kzalloc()/vzalloc()
combination.

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html

Signed-off-by: Ganapatrao Kulkarni 
[JPG: Added kvzalloc(), drop pages ** being device local, remove ternary 
operator, update message]
Signed-off-by: John Garry 
---
Difference:
v1->v2:
- Add Ganapatrao's tag and change author

v2->v3:
- removed ternary operator
- stopped making pages ** allocation local to device

v3->v4:
- Update commit message to include motivation for patch, including
  headline performance improvement for test.

Some notes:
This patch was originally posted by Ganapatrao in [2].

However, after initial review, it was never reposted (due to lack of
cycles, I think). In addition, the functionality in its sibling patches
were merged through patches, as mentioned in [2]; this also refers to a
discussion on device local allocations vs CPU local allocations for DMA
pool, and which is better [1].

However, as mentioned in [1], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

And from some testing, mentioned in commit message, shows marginal
improvement.

[2] https://lore.kernel.org/patchwork/patch/833004/
[3] https://lkml.org/lkml/2018/8/22/391

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..4afb1a8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
kvfree(pages);
 }

-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-   unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+   unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
struct page **pages;
-   unsigned int i = 0, array_size = count * sizeof(*pages);
+   unsigned int i = 0, nid = dev_to_node(dev);

order_mask &= (2U << MAX_ORDER) - 1;
if (!order_mask)
return NULL;

-   if (array_size <= PAGE_SIZE)
-   pages = kzalloc(array_size, GFP_KERNEL);
-   else
-   pages = vzalloc(array_size);
+   pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
if (!pages)
return NULL;

@@ -481,10 +478,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
for (order_mask &= (2U << __fls(count)) - 1;
 order_mask; order_mask &= ~order_size) {
unsigned int order = __fls(order_mask);
+   gfp_t alloc_flags = gfp;

order_size = 1U << order;
-   page = alloc_pages((order_mask - order_size) ?
-  gfp | __GFP_NORETRY : gfp, order);
+   if (order_mask > order_size)
+   alloc_flags |= __GFP_NORETRY;
+   page = alloc_pages_node(nid, alloc_flags, order);
if (!page)
continue;
if (!order)
@@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
alloc_sizes = min_size;

count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-   pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+   pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+   gfp);
if (!pages)
return NULL;





___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 1/1] iommu: set the default iommu-dma mode as non-strict

2019-03-06 Thread John Garry






(VFIO) is untrusted, ok. But a malicious driver loaded into the kernel
address space would have much easier ways to corrupt the system than to
exploit lazy mode...

Yes, so that we have no need to consider untrusted drivers.



For (3), I agree that we should at least disallow lazy mode if
pci_dev->untrusted is set. At the moment it means that we require the
strictest IOMMU configuration for external-facing PCI ports, but it can
be extended to blacklist other vulnerable devices or locations.

I plan to add an attribute file for each device, espcially for hotplug devices. 
And
let the root users to decide which mode should be used, strict or non-strict. 
Becasue
they should known whether the hot-plug divice is trusted or not.


Aside from the problem that without massive implementation changes 
strict/non-strict is at
best a per-domain property, not a per-device one, I can't see this being 
particularly practical
- surely the whole point of a malicious endpoint is that it's going to pretend 
to be some common
device for which a 'trusted' kernel driver already exists?

Yes, It should be assumed that all kernel drivers and all hard-wired devices 
are trusted. There is
no reason to doubt that the open source drivers or the drivers and devices 
provided by legitimate
suppliers are malicious.



If you've chosen to trust *any* external device, I think you may as well have 
just set non-strict globally anyway.
The effort involved in trying to implement super-fine-grained control seems 
hard to justify.

The default mode of external devices is strict, it can be obviously changed to 
non-strict mode. But as
you said, it maybe hard to be implemented. In addition, bring a malicious 
device into computer room,
attach and export data it's not easy also. Maybe I should follow 
Jean'suggestion first,




add a config item.



+1

On another topic, we did also see a use case for selectively 
passthrough'ing devices.


Typically, having the kernel use the identity mapping for when driving a 
device is fine. In fact, having the IOMMU translating puts a big 
performance burden on the system. However sometimes we may require the 
IOMMU involved for certain devices, like for when the kernel device 
driver has big contiguous DMA requirements, which is the case for some 
RDMA NIC cards.


John




Robin.



If you do (3) then maybe we don't need (1) and (2), which require a
tonne of work in the DMA and IOMMU layers (but would certainly be nice
to see, since it would also help handle ATS invalidation timeouts)

Thanks,
Jean


So that some high-end trusted devices use non-strict mode, and keep others 
still using
strict mode. The drivers who want to use non-strict mode, should change to use 
new APIs
by themselves.




Why not keep the policy to secure by default, as we do for
iommu.passthrough? And maybe add something similar to
CONFIG_IOMMU_DEFAULT_PASSTRHOUGH? It's easy enough for experts to pass a
command-line argument or change the default config.


Sorry for the late reply, it was Chinese new year, and we had a long discussion
internally, we are fine to add a Kconfig but not sure OS vendors will set it
to default y.

OS vendors seems not happy to pass a command-line argument, to be honest,
this is our motivation to enable non-strict as default. Hope OS vendors
can see this email thread, and give some input here.

Thanks
Hanjun


.






.





.






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH/RFC] driver core: Postpone DMA tear-down until after devres release

2019-03-07 Thread John Garry

On 11/02/2019 10:22, Robin Murphy wrote:

On 08/02/2019 18:55, Geert Uytterhoeven wrote:

Hi Robin,

On Fri, Feb 8, 2019 at 6:55 PM Robin Murphy  wrote:

On 08/02/2019 16:40, Joerg Roedel wrote:

On Thu, Feb 07, 2019 at 08:36:53PM +0100, Geert Uytterhoeven wrote:

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 8ac10af17c0043a3..d62487d024559620 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -968,9 +968,9 @@ static void __device_release_driver(struct
device *dev, struct device *parent)
  drv->remove(dev);

  device_links_driver_cleanup(dev);
-arch_teardown_dma_ops(dev);

  devres_release_all(dev);
+arch_teardown_dma_ops(dev);
  dev->driver = NULL;
  dev_set_drvdata(dev, NULL);
  if (dev->pm_domain && dev->pm_domain->dismiss)


Thanks for the fix! Should it also be tagged for stable and get a Fixes


FTR, Greg has added it to driver-core-testing, with a CC to stable.


So I see, great!


tag? I know it only triggers with a fix in v5.0-rc, but still...


I think so:

Fixes: 09515ef5ddad ("of/acpi: Configure dma operations at probe time
for platform/amba/pci bus devices")


Thanks! It won't backport cleanly due to commit dc3c05504d38849f
("dma-mapping: remove dma_deconfigure") in v4.20, though.


Ah yes - backports beyond that should simply be a case of moving the
dma_deconfigure() wrapper in the same manner.



Hi guys,

Any idea what happened to this fix?

I have this on 5.0:

[0.00] Linux version 5.0.0 (john@htsatcamb-server) (gcc version 
4.8.5 (Linaro GCC 4.8-2015.06)) #121 SMP PREEMPT Thu Mar 7 14:28:39 GMT 2019
[0.00] Kernel command line: BOOT_IMAGE=/john/Image rdinit=/init 
crashkernel=256M@32M earlycon console=ttyAMA0,115200 acpi=force 
pcie_aspm=off scsi_mod.use_blk_mq=y no_console_suspend pcie-hisi.disable=1


...

[   26.806856] pci_bus 000c:20: 2-byte config write to 000c:20:00.0 
offset 0x44 may corrupt adjacent RW1C bits

[   26.817521] pcieport 0002:f8:00.0: can't derive routing for PCI INT B
[   26.837167] pci_bus 000c:20: 2-byte config write to 000c:20:00.0 
offset 0x44 may corrupt adjacent RW1C bits

[   26.850091] serial 0002:f9:00.1: PCI INT B: no GSI
[   26.879364] serial 0002:f9:00.1: enabling device ( -> 0003)
[   26.913131] 0002:f9:00.1: ttyS3 at I/O 0x1008 (irq = 0, base_baud = 
115200) is a ST16650V2
[   26.992897] rtc-efi rtc-efi: setting system clock to 
2019-03-07T14:41:48 UTC (1551969708)

[   27.009380] ALSA device list:
[   27.015326]   No soundcards found.
[   27.022549] Freeing unused kernel memory: 1216K
[   27.055567] Run /init as init process
root@(none)$ cd /sys/devices/platform/HISI0162:01
root@(none)$ echo HISI0162:01 > driver/unbind
[   36.488040] hisi_sas_v2_hw HISI0162:01: dev[9:1] is gone
[   36.561077] hisi_sas_v2_hw HISI0162:01: dev[8:1] is gone
[   36.621061] hisi_sas_v2_hw HISI0162:01: dev[7:1] is gone
[   36.693074] hisi_sas_v2_hw HISI0162:01: dev[6:1] is gone
[   36.753066] hisi_sas_v2_hw HISI0162:01: dev[5:1] is gone
[   36.764276] sd 0:0:3:0: [sdd] Synchronizing SCSI cache
[   36.821106] hisi_sas_v2_hw HISI0162:01: dev[4:1] is gone
[   36.889048] hisi_sas_v2_hw HISI0162:01: dev[3:1] is gone
[   36.993002] hisi_sas_v2_hw HISI0162:01: dev[1:1] is gone
[   37.004276] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   37.014768] sd 0:0:0:0: [sda] Stopping disk
[   37.709094] hisi_sas_v2_hw HISI0162:01: dev[2:5] is gone
[   37.721231] hisi_sas_v2_hw HISI0162:01: dev[0:2] is gone
[   37.803774] BUG: Bad page state in process sh  pfn:11356
[   37.81] page:7e44d580 count:1 mapcount:0 
mapping: index:0x0

[   37.830525] flags: 0xfffc0001000(reserved)
[   37.839443] raw: 0fffc0001000 7e44d588 7e44d588 

[   37.854998] raw:   0001 


[   37.870552] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[   37.883485] bad because of flags: 0x1000(reserved)
[   37.893098] Modules linked in:
[   37.899221] CPU: 5 PID: 2691 Comm: sh Not tainted 5.0.0 #121
[   37.910578] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon 
D05 IT21 Nemo 2.0 RC0 04/18/2018

[   37.928924] Call trace:
[   37.933825]  dump_backtrace+0x0/0x150
[   37.941166]  show_stack+0x14/0x1c
[   37.947808]  dump_stack+0x8c/0xb0
[   37.954451]  bad_page+0xe4/0x144
[   37.960918]  free_pages_check_bad+0x7c/0x84
[   37.969307]  __free_pages_ok+0x284/0x290
[   37.977173]  __free_pages+0x30/0x44
[   37.984163]  __dma_direct_free_pages+0x68/0x6c
[   37.993076]  dma_direct_free+0x24/0x38
[   38.000591]  dma_free_attrs+0x84/0xc0
[   38.007930]  dmam_release+0x20/0x28
[   38.014924]  release_nodes+0x128/0x1f8
[   38.022439]  devres_release_all+0x34/0x4c
[   38.030478]  device_release_driver_internal+0x190/0x208
[   38.040963]  device_release_driver+0x14/0x1c
[   38.049526]  unbind_store+0xbc/0xf4
[   38.056517]  drv_attr_store+0x20/0x30
[   38

Re: [PATCH/RFC] driver core: Postpone DMA tear-down until after devres release

2019-03-07 Thread John Garry

On 07/03/2019 14:52, Robin Murphy wrote:

Hi John,


Hi Robin,

ok, fine.

It's a pain bisecting another rmmod issue with it...

Cheers,



On 07/03/2019 14:45, John Garry wrote:
[...]

Hi guys,

Any idea what happened to this fix?


It's been in -next for a while (commit 376991db4b64) - I assume it will
land shortly and hit stable thereafter, at which point somebody gets to
sort out the manual backport past 4.20.

Robin.


I have this on 5.0:

[0.00] Linux version 5.0.0 (john@htsatcamb-server) (gcc
version 4.8.5 (Linaro GCC 4.8-2015.06)) #121 SMP PREEMPT Thu Mar 7
14:28:39 GMT 2019
[0.00] Kernel command line: BOOT_IMAGE=/john/Image
rdinit=/init crashkernel=256M@32M earlycon console=ttyAMA0,115200
acpi=force pcie_aspm=off scsi_mod.use_blk_mq=y no_console_suspend
pcie-hisi.disable=1

...

[   26.806856] pci_bus 000c:20: 2-byte config write to 000c:20:00.0
offset 0x44 may corrupt adjacent RW1C bits
[   26.817521] pcieport 0002:f8:00.0: can't derive routing for PCI INT B
[   26.837167] pci_bus 000c:20: 2-byte config write to 000c:20:00.0
offset 0x44 may corrupt adjacent RW1C bits
[   26.850091] serial 0002:f9:00.1: PCI INT B: no GSI
[   26.879364] serial 0002:f9:00.1: enabling device ( -> 0003)
[   26.913131] 0002:f9:00.1: ttyS3 at I/O 0x1008 (irq = 0, base_baud =
115200) is a ST16650V2
[   26.992897] rtc-efi rtc-efi: setting system clock to
2019-03-07T14:41:48 UTC (1551969708)
[   27.009380] ALSA device list:
[   27.015326]   No soundcards found.
[   27.022549] Freeing unused kernel memory: 1216K
[   27.055567] Run /init as init process
root@(none)$ cd /sys/devices/platform/HISI0162:01
root@(none)$ echo HISI0162:01 > driver/unbind
[   36.488040] hisi_sas_v2_hw HISI0162:01: dev[9:1] is gone
[   36.561077] hisi_sas_v2_hw HISI0162:01: dev[8:1] is gone
[   36.621061] hisi_sas_v2_hw HISI0162:01: dev[7:1] is gone
[   36.693074] hisi_sas_v2_hw HISI0162:01: dev[6:1] is gone
[   36.753066] hisi_sas_v2_hw HISI0162:01: dev[5:1] is gone
[   36.764276] sd 0:0:3:0: [sdd] Synchronizing SCSI cache
[   36.821106] hisi_sas_v2_hw HISI0162:01: dev[4:1] is gone
[   36.889048] hisi_sas_v2_hw HISI0162:01: dev[3:1] is gone
[   36.993002] hisi_sas_v2_hw HISI0162:01: dev[1:1] is gone
[   37.004276] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   37.014768] sd 0:0:0:0: [sda] Stopping disk
[   37.709094] hisi_sas_v2_hw HISI0162:01: dev[2:5] is gone
[   37.721231] hisi_sas_v2_hw HISI0162:01: dev[0:2] is gone
[   37.803774] BUG: Bad page state in process sh  pfn:11356
[   37.81] page:7e44d580 count:1 mapcount:0
mapping: index:0x0
[   37.830525] flags: 0xfffc0001000(reserved)
[   37.839443] raw: 0fffc0001000 7e44d588 7e44d588

[   37.854998] raw:   0001

[   37.870552] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[   37.883485] bad because of flags: 0x1000(reserved)
[   37.893098] Modules linked in:
[   37.899221] CPU: 5 PID: 2691 Comm: sh Not tainted 5.0.0 #121
[   37.910578] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
D05 IT21 Nemo 2.0 RC0 04/18/2018
[   37.928924] Call trace:
[   37.933825]  dump_backtrace+0x0/0x150
[   37.941166]  show_stack+0x14/0x1c
[   37.947808]  dump_stack+0x8c/0xb0
[   37.954451]  bad_page+0xe4/0x144
[   37.960918]  free_pages_check_bad+0x7c/0x84
[   37.969307]  __free_pages_ok+0x284/0x290
[   37.977173]  __free_pages+0x30/0x44
[   37.984163]  __dma_direct_free_pages+0x68/0x6c
[   37.993076]  dma_direct_free+0x24/0x38
[   38.000591]  dma_free_attrs+0x84/0xc0
[   38.007930]  dmam_release+0x20/0x28
[   38.014924]  release_nodes+0x128/0x1f8
[   38.022439]  devres_release_all+0x34/0x4c
[   38.030478]  device_release_driver_internal+0x190/0x208
[   38.040963]  device_release_driver+0x14/0x1c
[   38.049526]  unbind_store+0xbc/0xf4
[   38.056517]  drv_attr_store+0x20/0x30
[   38.063859]  sysfs_kf_write+0x44/0x4c
[   38.071200]  kernfs_fop_write+0xd0/0x1c4
[   38.079065]  __vfs_write+0x2c/0x158
[   38.086055]  vfs_write+0xa8/0x19c
[   38.092696]  ksys_write+0x44/0xa0
[   38.099338]  __arm64_sys_write+0x1c/0x24
[   38.107203]  el0_svc_common+0xb0/0x100
[   38.114718]  el0_svc_handler+0x70/0x88
[   38.122232]  el0_svc+0x8/0x7c0
[   38.128356] Disabling lock debugging due to kernel taint
[   38.139019] BUG: Bad page state in process sh  pfn:11355
[   38.149682] page:7e44d540 count:0 mapcount:0
mapping: index:0x0
[   38.165760] flags: 0xfffc0001000(reserved)
[   38.174676] raw: 0fffc0001000 7e44d548 7e44d548

[   38.190230] raw:   

[   38.205783] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[   38.218716] bad because of flags: 0x1000(reserved)
[   38.228329] Modules linked in:
[   38.234451] CPU: 5 PID: 2691 Comm: sh Tainted: GB 5.0.0 #121
[   38.248604] Hardware name: Huawei Taishan 2280

Re: [PATCH 1/1] iommu: Add config option to set lazy mode as default

2019-03-22 Thread John Garry

On 22/03/2019 14:11, Zhen Lei wrote:


This allows the default behaviour to be controlled by a kernel config
option instead of changing the command line for the kernel to include
"iommu.strict=0" on ARM64 where this is desired.

This is similar to CONFIG_IOMMU_DEFAULT_PASSTHROUGH

Note: At present, intel_iommu, amd_iommu and s390_iommu use lazy mode as
defalut, so there is no need to add code for them.


/s/defalut/default/



Signed-off-by: Zhen Lei 
---
 drivers/iommu/Kconfig | 14 ++
 drivers/iommu/iommu.c |  5 +


Do we need to update kernel-parameters.txt for iommu.strict?


 2 files changed, 19 insertions(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 6f07f3b21816c64..5ec9780f564eaf8 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -85,6 +85,20 @@ config IOMMU_DEFAULT_PASSTHROUGH

  If unsure, say N here.

+config IOMMU_LAZY_MODE


maybe should add "DMA" to the name, and even "DEFAULT"


+   bool "IOMMU use lazy mode to flush IOTLB and free IOVA"
+   depends on IOMMU_API
+   help
+ For every IOMMU unmap operation, the flush operation of IOTLB and the 
free
+ operation of IOVA are deferred.


This is a bit unclear, as there is no context. I think that you need to 
say something like, "Support lazy mode, where for every IOMMU DMA unmap 
operation, the flush operation of IOTLB and the free operation of IOVA 
are deferred. "



They are only guaranteed to be done before

+ the related IOVA will be reused. Removing the need to pass in 
iommu.strict=0
+ through command line on ARM64(Now, intel_iommu, amd_iommu, s390_iommu 
use
+ lazy mode as deault).


prone to going out-of-date

 If this is enabled, you can still disable with kernel

+ parameters, such as iommu.strict=1, intel_iommu=strict, 
amd_iommu=fullflush
+ or s390_iommu=strict depending on the architecture.
+
+ If unsure, say N here.
+
 config OF_IOMMU
def_bool y
depends on OF && IOMMU_API
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 33a982e33716369..e307d70d1578b3b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -43,7 +43,12 @@
 #else
 static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA;
 #endif
+
+#ifdef CONFIG_IOMMU_LAZY_MODE
+static bool iommu_dma_strict __read_mostly;
+#else
 static bool iommu_dma_strict __read_mostly = true;
+#endif

 struct iommu_callback_data {
const struct iommu_ops *ops;
--
1.8.3





Cheers



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH/RFC] driver core: Postpone DMA tear-down until after devres release

2019-03-26 Thread John Garry


Memory is incorrectly freed using the direct ops, as dma_map_ops = NULL.
Oops...

After reversing the order of the calls to arch_teardown_dma_ops() and
devres_release_all(), dma_map_ops is still valid, and the DMA memory is
now released using __iommu_free_attrs():

+sata_rcar ee30.sata: dmam_release:32: size 2048 vaddr ff8012145000 
dma_handle 0x0xf000 attrs 0x0
+sata_rcar ee30.sata: dma_free_attrs:289: size 2048, ops = iommu_dma_ops
+sata_rcar ee30.sata: dma_free_attrs:311: calling __iommu_free_attrs()
---
 drivers/base/dd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 8ac10af17c0043a3..d62487d024559620 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -968,9 +968,9 @@ static void __device_release_driver(struct device *dev, 
struct device *parent)
drv->remove(dev);

device_links_driver_cleanup(dev);
-   arch_teardown_dma_ops(dev);

devres_release_all(dev);
+   arch_teardown_dma_ops(dev);
dev->driver = NULL;


Hi guys,

Could there still be the same problem in the error path of really_probe():

static int really_probe(struct device *dev, struct device_driver *drv)
{

[...]

goto done;

probe_failed:
arch_teardown_dma_ops(dev);
dma_failed:
if (dev->bus)
blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
 BUS_NOTIFY_DRIVER_NOT_BOUND, dev);
pinctrl_bind_failed:
device_links_no_driver(dev);
devres_release_all(dev);
driver_sysfs_remove(dev);
dev->driver = NULL;
dev_set_drvdata(dev, NULL);

We seem to be able to call arch_teardown_dma_ops() prior to 
devres_release_all() if we reach probe_failed label.


We have seen this crash when our driver probe fails on a dev branch 
based on v5.1-rc1:


[   87.896707] hisi_sas_v3_hw :74:02.0: Adding to iommu group 2
[   87.909765] scsi host1: hisi_sas_v3_hw
[   89.127958] hisi_sas_v3_hw :74:02.0: evaluate _DSM failed
[   89.134043] BUG: Bad page state in process swapper/0  pfn:313f5
[   89.139965] page:7ec4fd40 count:1 mapcount:0 
mapping: index:0x0

[   89.147960] flags: 0xfffe0001000(reserved)
[   89.152398] raw: 0fffe0001000 7ec4fd48 7ec4fd48 

[   89.160130] raw:   0001 


[   89.167861] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[   89.174290] bad because of flags: 0x1000(reserved)
[   89.179070] Modules linked in:
[   89.182117] CPU: 49 PID: 1 Comm: swapper/0 Not tainted 
5.1.0-rc1-43081-g22d97fd-dirty #1433
[   89.190453] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.12.01 01/29/2019

[   89.198876] Call trace:
[   89.201316]  dump_backtrace+0x0/0x118
[   89.204965]  show_stack+0x14/0x1c
[   89.208272]  dump_stack+0xa4/0xc8
[   89.211576]  bad_page+0xe4/0x13c
[   89.214791]  free_pages_check_bad+0x4c/0xc0
[   89.218961]  __free_pages_ok+0x30c/0x340
[   89.222871]  __free_pages+0x30/0x44
[   89.226347]  __dma_direct_free_pages+0x30/0x38
[   89.230777]  dma_direct_free+0x24/0x38
[   89.234513]  dma_free_attrs+0x9c/0xd8
[   89.238161]  dmam_release+0x20/0x28
[   89.241640]  release_nodes+0x17c/0x220
[   89.245375]  devres_release_all+0x34/0x54
[   89.249371]  really_probe+0xc4/0x2c8
[   89.252933]  driver_probe_device+0x58/0xfc
[   89.257016]  device_driver_attach+0x68/0x70
[   89.261185]  __driver_attach+0x94/0xdc
[   89.264921]  bus_for_each_dev+0x5c/0xb4
[   89.268744]  driver_attach+0x20/0x28
[   89.272306]  bus_add_driver+0x14c/0x200
[   89.276128]  driver_register+0x6c/0x124
[   89.279953]  __pci_register_driver+0x48/0x50
[   89.284213]  sas_v3_pci_driver_init+0x20/0x28
[   89.288557]  do_one_initcall+0x40/0x25c
[   89.292381]  kernel_init_freeable+0x2b8/0x3c0
[   89.296727]  kernel_init+0x10/0x100
[   89.300202]  ret_from_fork+0x10/0x18
[   89.303773] Disabling lock debugging due to kernel taint
[   89.309076] BUG: Bad page state in process swapper/0  pfn:313f6
[   89.314988] page:7ec4fd80 count:1 mapcount:0 
mapping: index:0x0

[   89.322983] flags: 0xfffe0001000(reserved)
[   89.327417] raw: 0fffe0001000 7ec4fd88 7ec4fd88 

[   89.335149] raw:   0001 



Thanks,
John



dev_set_drvdata(dev, NULL);
if (dev->pm_domain && dev->pm_domain->dismiss)




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH/RFC] driver core: Postpone DMA tear-down until after devres release

2019-03-26 Thread John Garry

On 26/03/2019 12:31, Geert Uytterhoeven wrote:

Hi John,

CC robh

On Tue, Mar 26, 2019 at 12:42 PM John Garry  wrote:

Memory is incorrectly freed using the direct ops, as dma_map_ops = NULL.
Oops...

After reversing the order of the calls to arch_teardown_dma_ops() and
devres_release_all(), dma_map_ops is still valid, and the DMA memory is
now released using __iommu_free_attrs():

+sata_rcar ee30.sata: dmam_release:32: size 2048 vaddr ff8012145000 
dma_handle 0x0xf000 attrs 0x0
+sata_rcar ee30.sata: dma_free_attrs:289: size 2048, ops = iommu_dma_ops
+sata_rcar ee30.sata: dma_free_attrs:311: calling __iommu_free_attrs()
---
 drivers/base/dd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 8ac10af17c0043a3..d62487d024559620 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -968,9 +968,9 @@ static void __device_release_driver(struct device *dev, 
struct device *parent)
  drv->remove(dev);

  device_links_driver_cleanup(dev);
- arch_teardown_dma_ops(dev);

  devres_release_all(dev);
+ arch_teardown_dma_ops(dev);
  dev->driver = NULL;


Hi guys,

Could there still be the same problem in the error path of really_probe():

static int really_probe(struct device *dev, struct device_driver *drv)
{

[...]

goto done;

probe_failed:
arch_teardown_dma_ops(dev);
dma_failed:
if (dev->bus)
blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
 BUS_NOTIFY_DRIVER_NOT_BOUND, dev);
pinctrl_bind_failed:
device_links_no_driver(dev);
devres_release_all(dev);
driver_sysfs_remove(dev);
dev->driver = NULL;
dev_set_drvdata(dev, NULL);

We seem to be able to call arch_teardown_dma_ops() prior to
devres_release_all() if we reach probe_failed label.


Yes, this looks like another instance of the same problem.
And test_remove doesn't expose this, as it doesn't exercise the full cycle.



Hi Geert,

OK, thanks.

There is another devres_release_all() call in device_release(). Not sure 
if there is another potential problem there also...


As for this issue, I'll look to fix unless anyone else doing it.

Cheers,
John


Gr{oetje,eeting}s,

Geert




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure

2019-03-28 Thread John Garry
In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after
devres release"), we changed the ordering of tearing down the device DMA
ops and releasing all the device's resources; this was because the DMA ops
should be maintained until we release the device's managed DMA memories.

However, we have seen another crash on an arm64 system when a
device driver probe fails:

  hisi_sas_v3_hw :74:02.0: Adding to iommu group 2
  scsi host1: hisi_sas_v3_hw
  BUG: Bad page state in process swapper/0  pfn:313f5
  page:7ec4fd40 count:1 mapcount:0
  mapping: index:0x0
  flags: 0xfffe0001000(reserved)
  raw: 0fffe0001000 7ec4fd48 7ec4fd48

  raw:   0001

  page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
  bad because of flags: 0x1000(reserved)
  Modules linked in:
  CPU: 49 PID: 1 Comm: swapper/0 Not tainted
5.1.0-rc1-43081-g22d97fd-dirty #1433
  Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI
RC0 - V1.12.01 01/29/2019
  Call trace:
  dump_backtrace+0x0/0x118
  show_stack+0x14/0x1c
  dump_stack+0xa4/0xc8
  bad_page+0xe4/0x13c
  free_pages_check_bad+0x4c/0xc0
  __free_pages_ok+0x30c/0x340
  __free_pages+0x30/0x44
  __dma_direct_free_pages+0x30/0x38
  dma_direct_free+0x24/0x38
  dma_free_attrs+0x9c/0xd8
  dmam_release+0x20/0x28
  release_nodes+0x17c/0x220
  devres_release_all+0x34/0x54
  really_probe+0xc4/0x2c8
  driver_probe_device+0x58/0xfc
  device_driver_attach+0x68/0x70
  __driver_attach+0x94/0xdc
  bus_for_each_dev+0x5c/0xb4
  driver_attach+0x20/0x28
  bus_add_driver+0x14c/0x200
  driver_register+0x6c/0x124
  __pci_register_driver+0x48/0x50
  sas_v3_pci_driver_init+0x20/0x28
  do_one_initcall+0x40/0x25c
  kernel_init_freeable+0x2b8/0x3c0
  kernel_init+0x10/0x100
  ret_from_fork+0x10/0x18
  Disabling lock debugging due to kernel taint
  BUG: Bad page state in process swapper/0  pfn:313f6
  page:7ec4fd80 count:1 mapcount:0
mapping: index:0x0
[   89.322983] flags: 0xfffe0001000(reserved)
  raw: 0fffe0001000 7ec4fd88 7ec4fd88

  raw:   0001


The crash occurs for the same reason.

In this case, on the really_probe() failure path, we are still clearing
the DMA ops prior to releasing the device's managed memories.

This patch fixes this issue by reordering the DMA ops teardown and the
call to devres_release_all() on the failure path.

Reported-by: Xiang Chen 
Tested-by: Xiang Chen 
Signed-off-by: John Garry 
---

For convenience, here is the 2nd half of really_probe() now:

atomic_inc(&probe_count);
pr_debug("bus: '%s': %s: probing driver %s with device %s\n",
 drv->bus->name, __func__, drv->name, dev_name(dev));
WARN_ON(!list_empty(&dev->devres_head));

re_probe:
dev->driver = drv;

/* If using pinctrl, bind pins now before probing */
ret = pinctrl_bind_pins(dev);
if (ret)
goto pinctrl_bind_failed;

if (dev->bus->dma_configure) {
ret = dev->bus->dma_configure(dev);
if (ret)
goto probe_failed;
}

if (driver_sysfs_add(dev)) {
printk(KERN_ERR "%s: driver_sysfs_add(%s) failed\n",
__func__, dev_name(dev));
goto probe_failed;
}

if (dev->pm_domain && dev->pm_domain->activate) {
ret = dev->pm_domain->activate(dev);
if (ret)
goto probe_failed;
}

if (dev->bus->probe) {
ret = dev->bus->probe(dev);
if (ret)
goto probe_failed;
} else if (drv->probe) {
ret = drv->probe(dev);
if (ret)
goto probe_failed;
}

if (test_remove) {
test_remove = false;

if (dev->bus->remove)
dev->bus->remove(dev);
else if (drv->remove)
drv->remove(dev);

devres_release_all(dev);
driver_sysfs_remove(dev);
dev->driver = NULL;
dev_set_drvdata(dev, NULL);
if (dev->pm_domain && dev->pm_domain->dismiss)
dev->pm_domain->dismiss(dev);
pm_runtime_reinit(dev);

goto re_probe;
}

pinctrl_init_done(dev);

if (dev->pm_domain && dev->pm_domain->sync)
dev->pm_domain->sync(dev);

driver_bound(dev);
ret = 1;
pr_debug("bus: '%s': %s: bound device %s t

Re: arm64 iommu groups issue

2020-02-17 Thread John Garry


Right, and even worse is that it relies on the port driver even 
existing at all.


All this iommu group assignment should be taken outside device driver 
probe paths.


However we could still consider device links for sync'ing the SMMU and 
each device probing.


Yes, we should get that for DT now thanks to the of_devlink stuff, but 
cooking up some equivalent for IORT might be worthwhile.


It doesn't solve this problem, but at least we could remove the 
iommu_ops check in iort_iommu_xlate().


We would need to carve out a path from pci_device_add() or even 
device_add() to solve all cases.





Another thought that crosses my mind is that when pci_device_group()
walks up to the point of ACS isolation and doesn't find an existing
group, it can still infer that everything it walked past *should* be put
in the same group it's then eventually going to return. Unfortunately I
can't see an obvious way for it to act on that knowledge, though, since
recursive iommu_probe_device() is unlikely to end well.




[...]

And this looks to be the reason for which current 
iommu_bus_init()->bus_for_each_device(..., add_iommu_group) fails also.


Of course, just adding a 'correct' add_device replay without the 
of_xlate process doesn't help at all. No wonder this looked suspiciously 
simpler than where the first idea left off...


(on reflection, the core of this idea seems to be recycling the existing 
iommu_bus_init walk rather than building up a separate "waiting list", 
while forgetting that that wasn't the difficult part of the original 
idea anyway)


We could still use a bus walk to add the group per iommu, but we would 
need an additional check to ensure the device is associated with the IOMMU.




On this current code mentioned, the principle of this seems wrong to 
me - we call bus_for_each_device(..., add_iommu_group) for the first 
SMMU in the system which probes, but we attempt to add_iommu_group() 
for all devices on the bus, even though the SMMU for that device may 
yet to have probed.


Yes, iommu_bus_init() is one of the places still holding a 
deeply-ingrained assumption that the ops go live for all IOMMU instances 
at once, which is what warranted the further replay in 
of_iommu_configure() originally. Moving that out of 
of_platform_device_create() to support probe deferral is where the 
trouble really started.


I'm not too familiar with the history here, but could this be reverted 
now with the introduction of of_devlink stuff?


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: silence iommu group prints

2020-02-28 Thread John Garry

On 28/02/2020 02:16, Lu Baolu wrote:

Hi,

On 2020/2/27 19:57, Russell King wrote:

On the LX2160A, there are lots (about 160) of IOMMU messages produced
during boot; this is excessive.  Reduce the severity of these messages
to debug level.

Signed-off-by: Russell King 
---
  drivers/iommu/iommu.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3ead597e1c57..304281ec623b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -741,7 +741,7 @@ int iommu_group_add_device(struct iommu_group 
*group, struct device *dev)

  trace_add_device_to_group(group->id, dev);
-    dev_info(dev, "Adding to iommu group %d\n", group->id);
+    dev_dbg(dev, "Adding to iommu group %d\n", group->id);


I'm not strongly against this. But to me this message seems to be a good
indicator that a device was probed successfully by the iommu subsystem.
Keeping it in the default kernel message always helps to the kernel
debugging.



I would tend to agree.


Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice

2020-03-03 Thread John Garry

On 21/01/2020 09:56, Robin Murphy wrote:

On 18/12/2019 4:39 am, Cong Wang wrote:

Both find_iova() and __free_iova() take iova_rbtree_lock,
there is no reason to take and release it twice inside
free_iova().

Fold them into one critical section by calling the unlock
versions instead.


Makes sense to me.

Reviewed-by: Robin Murphy 


Reviewed-by: John Garry 

Could we at least get this patch picked up? It should be ok to take in 
isolation, since there is some debate on the other 2 patches in this 
series. Thanks





Cc: Joerg Roedel 
Cc: John Garry 
Signed-off-by: Cong Wang 
---
  drivers/iommu/iova.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 184d4c0e20b5..f46f8f794678 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
  void
  free_iova(struct iova_domain *iovad, unsigned long pfn)
  {
-    struct iova *iova = find_iova(iovad, pfn);
+    unsigned long flags;
+    struct iova *iova;
+    spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+    iova = private_find_iova(iovad, pfn);
  if (iova)
-    __free_iova(iovad, iova);
+    private_free_iova(iovad, iova);
+    spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
  }
  EXPORT_SYMBOL_GPL(free_iova);


.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-19 Thread John Garry

Hi Will,



On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote:

And for the overall system, we have:

   PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
0/40116 [4000Hz cycles],  (all, 96 CPUs)
--

 27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
 11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
  6.35%  [kernel]  [k] _raw_spin_unlock_irq
  2.65%  [kernel]  [k] get_user_pages_fast
  2.03%  [kernel]  [k] __slab_free
  1.55%  [kernel]  [k] tick_nohz_idle_exit
  1.47%  [kernel]  [k] arm_lpae_map
  1.39%  [kernel]  [k] __fget
  1.14%  [kernel]  [k] __lock_text_start
  1.09%  [kernel]  [k] _raw_spin_lock
  1.08%  [kernel]  [k] bio_release_pages.part.42
  1.03%  [kernel]  [k] __sbitmap_get_word
  0.97%  [kernel]  [k] arm_smmu_atc_inv_domain.constprop.42
  0.91%  [kernel]  [k] fput_many
  0.88%  [kernel]  [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time in
arm_smmu_atc_inv_domain(), which is disappointing when considering it should
effectively be a noop.

As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to combine CMD
SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look if a
get a chance for other possible bottlenecks.


Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.


I'm only getting back to look at this now, as SMMU performance is a bit 
of a hot topic again for us.


So one thing we are doing which looks to help performance is this series 
from Marc:


https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d

So that is just spreading the per-CPU load for NVMe interrupt handling 
(where the DMA unmapping is happening), so I'd say just side-stepping 
any SMMU issue really.


Going back to the SMMU, I wanted to run epbf and perf annotate to help 
profile this, but was having no luck getting them to work properly. I'll 
look at this again now.


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-20 Thread John Garry

On 19/03/2020 18:43, Jean-Philippe Brucker wrote:

On Thu, Mar 19, 2020 at 12:54:59PM +, John Garry wrote:

Hi Will,



On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote:

And for the overall system, we have:

PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
0/40116 [4000Hz cycles],  (all, 96 CPUs)
--

  27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
  11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.35%  [kernel]  [k] _raw_spin_unlock_irq
   2.65%  [kernel]  [k] get_user_pages_fast
   2.03%  [kernel]  [k] __slab_free
   1.55%  [kernel]  [k] tick_nohz_idle_exit
   1.47%  [kernel]  [k] arm_lpae_map
   1.39%  [kernel]  [k] __fget
   1.14%  [kernel]  [k] __lock_text_start
   1.09%  [kernel]  [k] _raw_spin_lock
   1.08%  [kernel]  [k] bio_release_pages.part.42
   1.03%  [kernel]  [k] __sbitmap_get_word
   0.97%  [kernel]  [k] arm_smmu_atc_inv_domain.constprop.42
   0.91%  [kernel]  [k] fput_many
   0.88%  [kernel]  [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time in
arm_smmu_atc_inv_domain(), which is disappointing when considering it should
effectively be a noop.

As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to combine CMD
SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look if a
get a chance for other possible bottlenecks.


Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.


I'm only getting back to look at this now, as SMMU performance is a bit of a
hot topic again for us.

So one thing we are doing which looks to help performance is this series
from Marc:

https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d

So that is just spreading the per-CPU load for NVMe interrupt handling
(where the DMA unmapping is happening), so I'd say just side-stepping any
SMMU issue really.

Going back to the SMMU, I wanted to run epbf and perf annotate to help
profile this, but was having no luck getting them to work properly. I'll
look at this again now.


Could you also try with the upcoming ATS change currently in Will's tree?
They won't improve your numbers but it'd be good to check that they don't
make things worse.


I can do when I get a chance.



I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
- 5.74% arm_smmu_tlb_inv_range
 5.09% arm_smmu_cmdq_issue_cmdlist
 0.28% __pi_memset
 0.08% __pi_memcpy
 0.08% arm_smmu_atc_inv_domain.constprop.37
 0.07% arm_smmu_cmdq_build_cmd
 0.01% arm_smmu_cmdq_batch_add
  0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.


Well I did originally suggest using RCU protection to scan the list of 
devices, instead of reading an atomic and checking for non-zero value. 
But that would be an optimsation for ATS also, and there was no ATS 
devices at the time (to verify performance).


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-20 Thread John Garry







I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
 - 5.74% arm_smmu_tlb_inv_range
  5.09% arm_smmu_cmdq_issue_cmdlist
  0.28% __pi_memset
  0.08% __pi_memcpy
  0.08% arm_smmu_atc_inv_domain.constprop.37
  0.07% arm_smmu_cmdq_build_cmd
  0.01% arm_smmu_cmdq_batch_add
   0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.


Well I did originally suggest using RCU protection to scan the list of
devices, instead of reading an atomic and checking for non-zero value. But
that would be an optimsation for ATS also, and there was no ATS devices at
the time (to verify performance).


Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
performance, but I agree that using RCU to scan the list should get better
results when using ATS.

When ATS isn't in use however, I suspect reading nr_ats_masters should be
more efficient than taking the RCU lock + reading an "ats_devices" list
(since the smmu_domain->devices list also serves context descriptor
invalidation, even when ATS isn't in use). I'll run some tests however, to
see if I can micro-optimize this case, but I don't expect noticeable
improvements.


ok, cheers. I, too, would not expect a significant improvement there.

JFYI, I've been playing for "perf annotate" today and it's giving 
strange results for my NVMe testing. So "report" looks somewhat sane, if 
not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():



55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_cmdq_issue_cmdlist
 9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
_raw_spin_unlock_irqrestore

 2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
 1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
 1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_atc_inv_domain.constprop.42

 1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
 1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw

But "annotate" consistently tells me that a specific instruction 
consumes ~99% of the load for the enqueue function:


 :  /* 5. If we are inserting a CMD_SYNC, 
we must wait for it to complete */

 :  if (sync) {
0.00 :   80001071c948:   ldr w0, [x29, #108]
 :  int ret = 0;
0.00 :   80001071c94c:   mov w24, #0x0 
 // #0

 :  if (sync) {
0.00 :   80001071c950:   cbnzw0, 80001071c990 


 :  arch_local_irq_restore():
0.00 :   80001071c954:   msr daif, x21
 :  arm_smmu_cmdq_issue_cmdlist():
 :  }
 :  }
 :
 :  local_irq_restore(flags);
 :  return ret;
 :  }
   99.51 :   80001071c958:   adrpx0, 800011909000 


0.00 :   80001071c95c:   add x21, x0, #0x908
0.02 :   80001071c960:   ldr x2, [x29, #488]
0.14 :   80001071c964:   ldr x1, [x21]
0.00 :   80001071c968:   eor x1, x2, x1
0.00 :   80001071c96c:   mov w0, w24


But there may be a hint that we're getting consuming a lot of time in 
polling the CMD_SYNC consumption.


The files are available here:

https://raw.githubusercontent.com/hisilicon/kernel-dev/private-topic-nvme-5.6-profiling/ann.txt, 
report


Or maybe I'm just not using the tool properly ...

Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-23 Thread John Garry

On 20/03/2020 16:33, Marc Zyngier wrote:

JFYI, I've been playing for "perf annotate" today and it's giving
strange results for my NVMe testing. So "report" looks somewhat sane,
if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():


    55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_cmdq_issue_cmdlist
 9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
_raw_spin_unlock_irqrestore

 2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
 1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
 1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k]
arm_smmu_atc_inv_domain.constprop.42
 1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
 1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw

But "annotate" consistently tells me that a specific instruction
consumes ~99% of the load for the enqueue function:

 :  /* 5. If we are inserting a CMD_SYNC,
we must wait for it to complete */
 :  if (sync) {
    0.00 :   80001071c948:   ldr w0, [x29, #108]
 :  int ret = 0;
    0.00 :   80001071c94c:   mov w24, #0x0  // #0
 :  if (sync) {
    0.00 :   80001071c950:   cbnz    w0, 80001071c990

 :  arch_local_irq_restore():
    0.00 :   80001071c954:   msr daif, x21
 :  arm_smmu_cmdq_issue_cmdlist():
 :  }
 :  }
 :
 :  local_irq_restore(flags);
 :  return ret;
 :  }
   99.51 :   80001071c958:   adrp    x0, 800011909000





Hi Marc,

This is likely the side effect of the re-enabling of interrupts (msr 
daif, x21)
on the previous instruction which causes the perf interrupt to fire 
right after.


ok, makes sense.



Time to enable pseudo-NMIs in the PMUv3 driver...



Do you know if there is any plan for this?

In the meantime, maybe I can do some trickery by putting the 
local_irq_restore() in a separate function, outside 
arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function.


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-24 Thread John Garry

On 23/03/2020 09:16, Marc Zyngier wrote:

+ Julien, Mark

Hi Marc,


Time to enable pseudo-NMIs in the PMUv3 driver...



Do you know if there is any plan for this?


There was. Julien Thierry has a bunch of patches for that [1], but they 
needs

reviving.



So those patches still apply cleanly (apart from the kvm patch, which I 
can skip, I suppose) and build, so I can try this I figure. Is there 
anything else which I should ensure or know about? Apart from enable 
CONFIG_ARM64_PSUEDO_NMI.


A quickly taken perf annotate and report is at the tip here: 
https://github.com/hisilicon/kernel-dev/commits/private-topic-nvme-5.6-profiling 





In the meantime, maybe I can do some trickery by putting the
local_irq_restore() in a separate function, outside
arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same
function.




Scratch that :)


I don't see how you can improve the profiling without compromising
the locking in this case...



Cheers,
John

[1] https://patchwork.kernel.org/cover/11047407/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-24 Thread John Garry

On 24/03/2020 10:43, Marc Zyngier wrote:

On Tue, 24 Mar 2020 09:18:10 +
John Garry  wrote:


On 23/03/2020 09:16, Marc Zyngier wrote:

+ Julien, Mark

Hi Marc,


Time to enable pseudo-NMIs in the PMUv3 driver...


Do you know if there is any plan for this?

There was. Julien Thierry has a bunch of patches for that [1], but they > needs
reviving.


So those patches still apply cleanly (apart from the kvm patch, which
I can skip, I suppose) and build, so I can try this I figure. Is
there anything else which I should ensure or know about? Apart from
enable CONFIG_ARM64_PSUEDO_NMI.

You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
has it set to 0, preventing me from being able to use the feature
(hint, nudge...;-).


Yeah, apparently it's set on our D06CS board, but I just need to double 
check the FW version with our FW guy.


As for D05, there has not been a FW update there in quite a long time 
and no plans for it. Sorry.


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-24 Thread John Garry

On 24/03/2020 12:07, Robin Murphy wrote:

On 2020-03-24 11:55 am, John Garry wrote:

On 24/03/2020 10:43, Marc Zyngier wrote:

On Tue, 24 Mar 2020 09:18:10 +
John Garry  wrote:


On 23/03/2020 09:16, Marc Zyngier wrote:

+ Julien, Mark

Hi Marc,


Time to enable pseudo-NMIs in the PMUv3 driver...


Do you know if there is any plan for this?
There was. Julien Thierry has a bunch of patches for that [1], but 
they > needs

reviving.


So those patches still apply cleanly (apart from the kvm patch, which
I can skip, I suppose) and build, so I can try this I figure. Is
there anything else which I should ensure or know about? Apart from
enable CONFIG_ARM64_PSUEDO_NMI.

You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
has it set to 0, preventing me from being able to use the feature
(hint, nudge...;-).


Yeah, apparently it's set on our D06CS board, but I just need to 
double check the FW version with our FW guy.


Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but 
since it's not been called out:


   This high priority configuration for interrupts needs to be
   explicitly enabled by setting the kernel parameter
   "irqchip.gicv3_pseudo_nmi" to 1.


Yeah, I saw that by chance somewhere else previously.



FWIW I believe is is still on the plan for someone here to dust off the 
PMU pNMI patches at some point.


Cool. Well I can try to experiment with what Julien had at v4 for now.

Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-25 Thread John Garry




FWIW I believe is is still on the plan for someone here to dust off 
the PMU pNMI patches at some point.


Cool. Well I can try to experiment with what Julien had at v4 for now.



JFYI, I have done some more perf record capturing, and updated the 
"annotate" and "report" output here 
https://raw.githubusercontent.com/hisilicon/kernel-dev/679eca1008b1d11b42e1b5fa8a205266c240d1e1/ann.txt 
and .../report


This capture is just for cpu0, since NVMe irq handling+dma unmapping 
will occur on specific CPUs, cpu0 being one of them.


The reports look somewhat sane. So we no longer have ~99% of time 
attributed to re-enabling interrupts, now that's like:



3.14 :   80001071eae0:   ldr w0, [x29, #108]
 :  int ret = 0;
0.00 :   80001071eae4:   mov w24, #0x0 
 // #0

 :  if (sync) {
0.00 :   80001071eae8:   cbnzw0, 80001071eb44 


 :  arch_local_irq_restore():
 :  asm volatile(ALTERNATIVE(
0.00 :   80001071eaec:   msr daif, x21
 :  arch_static_branch():
0.25 :   80001071eaf0:   nop
 :  arm_smmu_cmdq_issue_cmdlist():
 :  }
 :  }
 :
 :  local_irq_restore(flags);
 :  return ret;
 :  }
One observation (if these reports are to be believed) is that we may 
spend a lot of time in the CAS loop, trying to get a place in the queue 
initially:


 :  __CMPXCHG_CASE(w,  , , 32,   )
 :  __CMPXCHG_CASE(x,  , , 64,   )
0.00 :   80001071e828:   mov x0, x27
0.00 :   80001071e82c:   mov x4, x1
0.00 :   80001071e830:   cas x4, x2, [x27]
   28.61 :   80001071e834:   mov x0, x4
 :  arm_smmu_cmdq_issue_cmdlist():
 :  if (old == llq.val)
0.00 :   80001071e838:   ldr x1, [x23]

John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-04-06 Thread John Garry

On 02/04/2020 13:10, John Garry wrote:

On 18/03/2020 20:53, Will Deacon wrote:
As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the 
testing our

batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to 
combine CMD

SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look 
if a

get a chance for other possible bottlenecks.

Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.





Hi Will,

JFYI, I added some debug in arm_smmu_cmdq_issue_cmdlist() to get some 
idea of what is going on. Perf annotate did not tell much.


I tested NVMe performance with and without Marc's patchset to spread 
LPIs for managed interrupts.


Average duration of arm_smmu_cmdq_issue_cmdlist() mainline [all results 
are approximations]:

owner: 6ms
non-owner: 4ms

mainline + LPI spreading patchset:
owner: 25ms
non-owner: 22ms

For this, a list would be a itlb+cmd_sync.

Please note that the LPI spreading patchset is still giving circa 25% 
NVMe throughput increase. What happens there would be that we get many 
more cpus involved, which creates more inter-cpu contention. But the 
performance increase comes from just alleviating pressure on those 
overloaded cpus.


I also notice that with the LPI spreading patchset, on average a cpu is 
an "owner" in arm_smmu_cmdq_issue_cmdlist() 1 in 8, as opposed to 1 in 3 
for mainline. This means that we're just creating longer chains of lists 
to be published.


But I found that for a non-owner, average msi cmd_sync polling time is 
12ms with the LPI spreading patchset. As such, it seems to be really 
taking approx (12*2/8-1=) ~3ms to consume a single list. This seems 
consistent with my finding that an owner polls consumption for 3ms also. 
Without the LPI speading patchset, polling time is approx 2 and 3ms for 
both owner and non-owner, respectively.


As an experiment, I did try to hack the code to use a spinlock again for 
protecting the command queue, instead of current solution - and always 
saw a performance drop there. To be expected. But maybe we can try to 
not use a spinlock, but still serialise production+consumption to 
alleviate the long polling periods.


Let me know your thoughts.

Cheers,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: iommu_iova slab eats too much memory

2020-04-24 Thread John Garry

On 24/04/2020 17:30, Robin Murphy wrote:

On 2020-04-24 2:20 pm, Bin wrote:

Dear Robin:
  Thank you for your explanation. Now, I understand that this could be
NIC driver's fault, but how could I confirm it? Do I have to debug the
driver myself?


I'd start with CONFIG_DMA_API_DEBUG - of course it will chew through
memory about an order of magnitude faster than the IOVAs alone, but it
should shed some light on whether DMA API usage looks suspicious, and
dumping the mappings should help track down the responsible driver(s).
Although the debugfs code doesn't show the stacktrace of where each
mapping was made, I guess it would be fairly simple to tweak that for a
quick way to narrow down where to start looking in an offending driver.

Robin.


Just mentioning this in case it's relevant - we found long term aging 
throughput test causes RB tree to grow very large (and would I assume 
eat lots of memory):


https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leiz...@huawei.com/

John




Robin Murphy  于2020年4月24日周五 下午8:15写道:


On 2020-04-24 1:06 pm, Bin wrote:

I'm not familiar with the mmu stuff, so what you mean by "some driver
leaking DMA mappings", is it possible that some other kernel module like
KVM or NIC driver leads to the leaking problem instead of the iommu

module

itself?


Yes - I doubt that intel-iommu itself is failing to free IOVAs when it
should, since I'd expect a lot of people to have noticed that. It's far
more likely that some driver is failing to call dma_unmap_* when it's
finished with a buffer - with the IOMMU disabled that would be a no-op
on x86 with a modern 64-bit-capable device, so such a latent bug could
have been easily overlooked.

Robin.


Bin  于 2020年4月24日周五 20:00写道:


Well, that's the problem! I'm assuming the iommu kernel module is

leaking

memory. But I don't know why and how.

Do you have any idea about it? Or any further information is needed?

Robin Murphy  于 2020年4月24日周五 19:20写道:


On 2020-04-24 1:40 am, Bin wrote:

Hello? anyone there?

Bin  于2020年4月23日周四 下午5:14写道:


Forget to mention, I've already disabled the slab merge, so this is

what

it is.

Bin  于2020年4月23日周四 下午5:11写道:


Hey, guys:

I'm running a batch of CoreOS boxes, the lsb_release is:

```
# cat /etc/lsb-release
DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=2303.3.0
DISTRIB_CODENAME="Rhyolite"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 2303.3.0 (Rhyolite)"
```

```
# uname -a
Linux cloud-worker-25 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00

2019

x86_64 Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz GenuineIntel

GNU/Linux

```
Recently, I found my vms constently being killed due to OOM, and

after

digging into the problem, I finally realized that the kernel is

leaking

memory.

Here's my slabinfo:

 Active / Total Objects (% used): 83818306 / 84191607 (99.6%)
 Active / Total Slabs (% used)  : 1336293 / 1336293 (100.0%)
 Active / Total Caches (% used) : 152 / 217 (70.0%)
 Active / Total Size (% used)   : 5828768.08K / 5996848.72K

(97.2%)

 Minimum / Average / Maximum Object : 0.01K / 0.07K / 23.25K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME

80253888 80253888 100%0.06K 1253967   64   5015868K

iommu_iova


Do you really have a peak demand of ~80 million simultaneous DMA
buffers, or is some driver leaking DMA mappings?

Robin.


489472 489123  99%0.03K   3824  128 15296K kmalloc-32

297444 271112  91%0.19K   7082   42 56656K dentry

254400 252784  99%0.06K   3975   64 15900K

anon_vma_chain


222528  39255  17%0.50K   6954   32111264K kmalloc-512

202482 201814  99%0.19K   4821   42 38568K

vm_area_struct


200192 200192 100%0.01K391  512  1564K kmalloc-8

170528 169359  99%0.25K   5329   32 42632K filp

158144 153508  97%0.06K   2471   64  9884K kmalloc-64

149914 149365  99%0.09K   3259   46 13036K anon_vma

146640 143123  97%0.10K   3760   39 15040K buffer_head

130368  32791  25%0.09K   3104   42 12416K kmalloc-96

129752 129752 100%0.07K   2317   56  9268K Acpi-Operand

105468 105106  99%0.04K   1034  102  4136K
selinux_inode_security
 73080  73080 100%0.13K   2436   30  9744K

kernfs_node_cache


 72360  70261  97%0.59K   1340   54 42880K inode_cache

 71040  71040 100%0.12K   2220   32  8880K

eventpoll_epi


 68096  59262  87%0.02K266  256  1064K kmalloc-16

 53652  53652 100%0.04K526  102  2104K pde_opener

 50496  31654  62%2.00K   3156   16100992K

kmalloc-2048


 46242  46242 100%0.19K   1101   42  8808K cred_jar

 44496  43013  96%0.66K927   48 29664K

proc_inode_cache


 44352  44352 100%0.06K693   64  2772K

task_delay_info


 43516  43471  99%0.69K946   46 3027

Re: arm-smmu-v3 high cpu usage for NVMe

2020-05-22 Thread John Garry

On 20/03/2020 10:41, John Garry wrote:

+ Barry, Alexandru

    PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 
0/34434 drop:

0/40116 [4000Hz cycles],  (all, 96 CPUs)
-- 



  27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
  11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.35%  [kernel]  [k] _raw_spin_unlock_irq
   2.65%  [kernel]  [k] get_user_pages_fast
   2.03%  [kernel]  [k] __slab_free
   1.55%  [kernel]  [k] tick_nohz_idle_exit
   1.47%  [kernel]  [k] arm_lpae_map
   1.39%  [kernel]  [k] __fget
   1.14%  [kernel]  [k] __lock_text_start
   1.09%  [kernel]  [k] _raw_spin_lock
   1.08%  [kernel]  [k] bio_release_pages.part.42
   1.03%  [kernel]  [k] __sbitmap_get_word
   0.97%  [kernel]  [k] 
arm_smmu_atc_inv_domain.constprop.42

   0.91%  [kernel]  [k] fput_many
   0.88%  [kernel]  [k] __arm_lpae_map



Hi Will, Robin,

I'm just getting around to look at this topic again. Here's the current 
picture for my NVMe test:


perf top -C 0 *
Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
Overhead Shared Object Symbol
75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
3.28% [kernel] [k] arm_smmu_tlb_inv_range
2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
2.35% [kernel] [k] _raw_spin_unlock_irqrestore
1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
1.20% [kernel] [k] aio_complete_rw
0.96% [kernel] [k] enqueue_task_fair
0.93% [kernel] [k] gic_handle_irq
0.86% [kernel] [k] _raw_spin_lock_irqsave
0.72% [kernel] [k] put_reqs_available
0.72% [kernel] [k] sbitmap_queue_clear

* only certain CPUs run the dma unmap for my scenario, cpu0 being one of 
them.


Colleague Barry has similar findings for some other scenarios.

So we tried the latest perf NMI support wip patches, and noticed a few 
hotspots (see 
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate 
and 
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt) 
when running some NVMe traffic:


- initial cmpxchg to get a place in the queue
- when more CPUs get involved, we start failing at an exponential rate
0.00 :8000107a3500:   cas x4, x2, [x27]
26.52 :8000107a3504:   mov x0, x4 : 
arm_smmu_cmdq_issue_cmdlist():


- the queue locking
- polling cmd_sync

Some ideas to optimise:

a. initial cmpxchg
So this cmpxchg could be considered unfair. In addition, with all the 
contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged 
around the system.
Maybe we can implement something similar to the idea of queued/ticketed 
spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after 
initial cmpxchg fails, released by its leader, and releasing subsequent 
followers


b. Drop the queue_full checking in certain circumstances
If we cannot theoretically fill the queue, then stop the checking for 
queue full or similar. This should also help current problem of a., as 
the less time between cmpxchg, the less chance of failing (as we check 
queue available space between cmpxchg attempts).


So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we 
always issue a cmd_sync for a batch (regardless of whether requested), 
then we should never fill (I think).


c. Don't do queue locking in certain circumstances
If we implement (and support) b. and support MSI polling, then I don't 
think that this is required.


d. More minor ideas are to move forward when the "owner" stops gathering 
to reduce time of advancing the prod, hopefully reducing cmd_sync 
polling time; and also use a smaller word size for the valid bitmap 
operations, maybe 32b atomic operations are overall more efficient (than 
64b) - mostly valid range check is < 16 bits from my observation.


Let me know your thoughts or any other ideas.

Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH RFC 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency

2020-06-01 Thread John Garry
As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes approx 25% of the cycles for this function.

The cmpxchg() is removed as follows:
- We assume that the cmdq can never fill with changes to limit the
  batch size (where necessary) and always issue a CMD_SYNC for a batch
  We need to do this since we no longer maintain the cons value in
  software, and we cannot deal with no available space properly.
- Replace cmpxchg() with atomic inc operation, to maintain the prod
  and owner values.

Early experiments have shown that we may see a 25% boost in throughput
IOPS for my NVMe test with these changes. And some CPUs, which were
loaded at ~55%, now see a ~45% load.

So, even though the changes are incomplete and other parts of the driver
will need fixing up (and it looks maybe broken for !MSI support), the
performance boost seen would seem to be worth the effort of exploring
this.

Comments requested please.

Thanks

[0] 
https://lore.kernel.org/linux-iommu/b926444035e5e2439431908e3842afd24b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3

John Garry (2):
  iommu/arm-smmu-v3: Calculate bits for prod and owner
  iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

 drivers/iommu/arm-smmu-v3.c | 92 +++--
 1 file changed, 47 insertions(+), 45 deletions(-)

-- 
2.16.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 1/2] iommu/arm-smmu-v3: Calculate bits for prod and owner

2020-06-01 Thread John Garry
Since the arm_smmu_ll_queue.prod will be for counting the "owner" value
and also HW prod pointer, calculate how many bits are available for and
used by each.

This is based on the number of possible CPUs in the system. And we require
that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC
and at least 1 x other.

Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 16K
is the max supported CPUs. For this, max_n_shift would be 15.

Signed-off-by: John Garry 
---
 drivers/iommu/arm-smmu-v3.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 82508730feb7..e607ab5a4cbd 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -530,6 +530,8 @@ struct arm_smmu_ll_queue {
u8  __pad[SMP_CACHE_BYTES];
} cacheline_aligned_in_smp;
u32 max_n_shift;
+   u32 max_cmd_per_batch;
+   u32 owner_count_shift;
 };
 
 struct arm_smmu_queue {
@@ -1512,7 +1514,10 @@ static void arm_smmu_cmdq_batch_add(struct 
arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
struct arm_smmu_cmdq_ent *cmd)
 {
-   if (cmds->num == CMDQ_BATCH_ENTRIES) {
+   struct arm_smmu_cmdq *q = &smmu->cmdq;
+   struct arm_smmu_ll_queue *llq = &q->q.llq;
+
+   if (cmds->num == llq->max_cmd_per_batch) {
arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
cmds->num = 0;
}
@@ -3156,8 +3161,24 @@ static int arm_smmu_init_one_queue(struct 
arm_smmu_device *smmu,
   unsigned long cons_off,
   size_t dwords, const char *name)
 {
+   int cpus = num_possible_cpus();
size_t qsz;
 
+   /*
+* We can get the number of bits required for owner counting by
+* log2(nr possible cpus) + 1, but we have to take into account that he
+* wrap+prod could overflow before the owner zeroes, so add 1
+* more (to cpus) for bits_for_cmdq_owner calculation.
+*/
+   int bits_for_cmdq_owner = ilog2(cpus + 1) + 1;
+   /* 1-bit for overflow, 1-bit for wrap */
+   int bits_available_for_prod = 32 - 2 - bits_for_cmdq_owner;
+   int entries_for_prod;
+
+   if (bits_available_for_prod < 1) /* this would be insane - how many 
CPUs? */
+   return -ENOMEM;
+
+   q->llq.max_n_shift = min_t(int, q->llq.max_n_shift, 
bits_available_for_prod);
do {
qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
@@ -3167,6 +3188,17 @@ static int arm_smmu_init_one_queue(struct 
arm_smmu_device *smmu,
 
q->llq.max_n_shift--;
} while (1);
+   entries_for_prod = 1 << q->llq.max_n_shift;
+
+   /*
+* We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x 
whatever else).
+* Assuming orig max_n_shift >= 17, this would mean ~16K CPUs max.
+*/
+   if (entries_for_prod < 2 * cpus)
+   return -ENOMEM;
+
+   q->llq.max_cmd_per_batch = min_t(u32, entries_for_prod / cpus, 
CMDQ_BATCH_ENTRIES);
+   q->llq.owner_count_shift = q->llq.max_n_shift + 1;
 
if (!q->base) {
dev_err(smmu->dev,
-- 
2.16.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 2/2] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

2020-06-01 Thread John Garry
It has been shown that the cmpxchg() for finding space in the cmdq can
be a bottleneck:
- For more CPUs contenting the cmdq, cmpxchg() will fail more often.
- Since the software-maintained cons pointer is updated on the same 64b
  memory region, the chance of cmpxchg() failure increases again.

The cmpxchg() is removed as part of 2 related changes:
- If a CMD_SYNC is always issued for a batch, the cmdq can - in theory -
  never fill, since we always wait for a CMD_SYNC to be consumed. We must
  issue the CMD_SYNC so that a CPU will be always limited to issuing
  max_cmd_per_batch commands. Generally for DMA unmap ops, a CMD_SYNC
  is always issued anyway.
  As such, the cmdq locking is removed, and we only longer maintain cons
  in software (this needs to be revised for !MSI support).
- Update prod and cmdq owner in a single operation. For this, we count the
  prod and owner in separate regions in arm_smmu_ll_queue.prod.
  As with simple binary counting, once the prod+wrap fields overflow, they
  will zero. They will overflow into "owner" region, but this is ok as we
  have accounted for the extra value.
  As for the "owner", we now count this value, instead of setting a flag.
  Similar to before, once the owner has finished gathering, it will clear
  this mask. As such, a CPU declares itself as the "owner" when it reads
  zero for this field. This zeroing will also clear possible overflow in
  wrap+prod region.

Signed-off-by: John Garry 
---
 drivers/iommu/arm-smmu-v3.c | 58 +++--
 1 file changed, 14 insertions(+), 44 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index e607ab5a4cbd..d6c7d82f9cf8 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1375,7 +1375,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
   u64 *cmds, int n, bool sync)
 {
u64 cmd_sync[CMDQ_ENT_DWORDS];
-   u32 prod;
+   u32 prod, prodx;
unsigned long flags;
bool owner;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
@@ -1383,33 +1383,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
.max_n_shift = cmdq->q.llq.max_n_shift,
}, head = llq;
int ret = 0;
+   u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
+   u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
+   u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
+
+   /* always issue a CMD_SYNC TODO: fixup callers for this */
+   sync = true;
 
/* 1. Allocate some space in the queue */
local_irq_save(flags);
-   llq.val = READ_ONCE(cmdq->q.llq.val);
-   do {
-   u64 old;
-
-   while (!queue_has_space(&llq, n + sync)) {
-   local_irq_restore(flags);
-   if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
-   dev_err_ratelimited(smmu->dev, "CMDQ 
timeout\n");
-   local_irq_save(flags);
-   }
 
-   head.cons = llq.cons;
-   head.prod = queue_inc_prod_n(&llq, n + sync) |
-CMDQ_PROD_OWNED_FLAG;
+   prodx = atomic_fetch_add(n + sync + owner_val, 
&cmdq->q.llq.atomic.prod);
 
-   old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-   if (old == llq.val)
-   break;
-
-   llq.val = old;
-   } while (1);
-   owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
-   head.prod &= ~CMDQ_PROD_OWNED_FLAG;
-   llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
+   owner = !(prodx & owner_mask);
+   llq.prod = prod_mask & prodx;
+   head.prod = queue_inc_prod_n(&llq, n + sync);
 
/*
 * 2. Write our commands into the queue As with simple binary counting, 
once the prod+wrap fields overflow, they
  will zero. They will overflow into "owner" region, but this is ok as we
  have accounted for the extra value.
@@ -1420,14 +1408,6 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
prod = queue_inc_prod_n(&llq, n);
arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
-
-   /*
-* In order to determine completion of our CMD_SYNC, we must
-* ensure that the queue can't wrap twice without us noticing.
-* We achieve that by taking the cmdq lock as shared before
-* marking our slot as valid.
-*/
-   arm_smmu_cmdq_shared_lock(cmdq);
}
 
/* 3. Mark our slots as valid, ensuring commands are visible first */
@@

Re: arm64 iommu groups issue

2020-06-15 Thread John Garry

On 12/06/2020 15:30, Lorenzo Pieralisi wrote:

On Mon, Feb 17, 2020 at 12:08:48PM +, John Garry wrote:


Right, and even worse is that it relies on the port driver even
existing at all.

All this iommu group assignment should be taken outside device
driver probe paths.

However we could still consider device links for sync'ing the SMMU
and each device probing.


Yes, we should get that for DT now thanks to the of_devlink stuff, but
cooking up some equivalent for IORT might be worthwhile.


It doesn't solve this problem, but at least we could remove the iommu_ops
check in iort_iommu_xlate().

We would need to carve out a path from pci_device_add() or even device_add()
to solve all cases.




Another thought that crosses my mind is that when pci_device_group()
walks up to the point of ACS isolation and doesn't find an existing
group, it can still infer that everything it walked past *should* be put
in the same group it's then eventually going to return. Unfortunately I
can't see an obvious way for it to act on that knowledge, though, since
recursive iommu_probe_device() is unlikely to end well.




[...]


And this looks to be the reason for which current
iommu_bus_init()->bus_for_each_device(..., add_iommu_group) fails
also.


Of course, just adding a 'correct' add_device replay without the
of_xlate process doesn't help at all. No wonder this looked suspiciously
simpler than where the first idea left off...

(on reflection, the core of this idea seems to be recycling the existing
iommu_bus_init walk rather than building up a separate "waiting list",
while forgetting that that wasn't the difficult part of the original
idea anyway)


We could still use a bus walk to add the group per iommu, but we would need
an additional check to ensure the device is associated with the IOMMU.




On this current code mentioned, the principle of this seems wrong to
me - we call bus_for_each_device(..., add_iommu_group) for the first
SMMU in the system which probes, but we attempt to add_iommu_group()
for all devices on the bus, even though the SMMU for that device may
yet to have probed.


Yes, iommu_bus_init() is one of the places still holding a
deeply-ingrained assumption that the ops go live for all IOMMU instances
at once, which is what warranted the further replay in
of_iommu_configure() originally. Moving that out of
of_platform_device_create() to support probe deferral is where the
trouble really started.


I'm not too familiar with the history here, but could this be reverted now
with the introduction of of_devlink stuff?


Hi John,


Hi Lorenzo,



have we managed to reach a consensus on this thread on how to solve
the issue ? 


No, not really. So Robin and I tried a couple of quick things 
previously, but they came did not come to much, as above.



Asking because this thread seems stalled - I am keen on
getting it fixed.


I haven't spent more time on this. But from what I was hearing last 
time, this issue was ticketed internally for arm, so I was waiting for 
that to be picked up to re-engage.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC v2 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency

2020-06-16 Thread John Garry
As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes approx 25% of the cycles for this function.

This series removes that cmpxchg().

For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
increase:
Before: 1310 IOPs
After: 1630 IOPs

I also have a test harness to check the rate of DMA map+unmaps we can
achieve:

CPU count   32  64  128
Before: 63187   19418   10169
After:  93287   44789   15862

(unit is map+unmaps per CPU per second)

There's no specific problem that I know of with this series, as previous
issues should now be fixed - but I'm a bit nervous about how we deal with
the queue being full and wrapping.

And I want to test more.

Thanks

[0] 
https://lore.kernel.org/linux-iommu/b926444035e5e2439431908e3842afd24b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3

John Garry (4):
  iommu/arm-smmu-v3: Fix trivial typo
  iommu/arm-smmu-v3: Calculate bits for prod and owner
  iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
  iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

 drivers/iommu/arm-smmu-v3.c | 210 ++--
 1 file changed, 131 insertions(+), 79 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/4] iommu/arm-smmu-v3: Fix trivial typo

2020-06-16 Thread John Garry
Set "cmq" -> "cmdq".

Signed-off-by: John Garry 
---
 drivers/iommu/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..a8e814c652fe 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1479,7 +1479,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
}
 
/*
-* Try to unlock the cmq lock. This will fail if we're the last
+* Try to unlock the cmdq lock. This will fail if we're the last
 * reader, in which case we can safely update cmdq->q.llq.cons
 */
if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC v2 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch

2020-06-16 Thread John Garry
To ensure that a CPU does not send more than a permitted amount of commands
to the cmdq, ensure that each batch includes a CMD_SYNC. When issuing a
CMD_SYNC, we always wait for the consumption of its batch of commands - as
such, we guarantee that any CPU will not issue more than its permitted
amount.

Signed-off-by: John Garry 
---
 drivers/iommu/arm-smmu-v3.c | 87 +
 1 file changed, 40 insertions(+), 47 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index c3562dc35d45..36648163364c 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1373,11 +1373,15 @@ static void arm_smmu_cmdq_write_entries(struct 
arm_smmu_cmdq *cmdq, u64 *cmds,
  * - Command insertion is totally ordered, so if two CPUs each race to
  *   insert their own list of commands then all of the commands from one
  *   CPU will appear before any of the commands from the other CPU.
+ *
+ * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
+ *   more than the permitted amount commands at once.
  */
 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
-  u64 *cmds, int n, bool sync)
+  u64 *cmds, int n)
 {
u64 cmd_sync[CMDQ_ENT_DWORDS];
+   const int sync = 1;
u32 prod;
unsigned long flags;
bool owner;
@@ -1419,19 +1423,18 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
 * Dependency ordering from the cmpxchg() loop above.
 */
arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
-   if (sync) {
-   prod = queue_inc_prod_n(&llq, n);
-   arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
-   queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
 
-   /*
-* In order to determine completion of our CMD_SYNC, we must
-* ensure that the queue can't wrap twice without us noticing.
-* We achieve that by taking the cmdq lock as shared before
-* marking our slot as valid.
-*/
-   arm_smmu_cmdq_shared_lock(cmdq);
-   }
+   prod = queue_inc_prod_n(&llq, n);
+   arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
+   queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
+
+   /*
+* In order to determine completion of our CMD_SYNC, we must
+* ensure that the queue can't wrap twice without us noticing.
+* We achieve that by taking the cmdq lock as shared before
+* marking our slot as valid.
+*/
+   arm_smmu_cmdq_shared_lock(cmdq);
 
/* 3. Mark our slots as valid, ensuring commands are visible first */
dma_wmb();
@@ -1468,26 +1471,22 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
atomic_set_release(&cmdq->owner_prod, prod);
}
 
-   /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
-   if (sync) {
-   llq.prod = queue_inc_prod_n(&llq, n);
-   ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
-   if (ret) {
-   dev_err_ratelimited(smmu->dev,
-   "CMD_SYNC timeout at 0x%08x [hwprod 
0x%08x, hwcons 0x%08x]\n",
-   llq.prod,
-   readl_relaxed(cmdq->q.prod_reg),
-   readl_relaxed(cmdq->q.cons_reg));
-   }
+   /* 5. Since we always insert a CMD_SYNC, we must wait for it to 
complete */
+   llq.prod = queue_inc_prod_n(&llq, n);
+   ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
+   if (ret) {
+   dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x 
[hwprod 0x%08x, hwcons 0x%08x]\n",
+   llq.prod, readl_relaxed(cmdq->q.prod_reg),
+   readl_relaxed(cmdq->q.cons_reg));
+   }
 
-   /*
-* Try to unlock the cmdq lock. This will fail if we're the last
-* reader, in which case we can safely update cmdq->q.llq.cons
-*/
-   if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
-   WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
-   arm_smmu_cmdq_shared_unlock(cmdq);
-   }
+   /*
+* Try to unlock the cmdq lock. This will fail if we're the last reader,
+* in which case we can safely update cmdq->q.llq.cons
+*/
+   if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
+   WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
+   arm_smmu_cmdq_shared_unlock(cmdq);
}
 
  

[PATCH RFC v2 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

2020-06-16 Thread John Garry
It has been shown that the cmpxchg() for finding space in the cmdq can
be a real bottleneck:
- for more CPUs contending the cmdq, the cmpxchg() will fail more often
- since the software-maintained cons pointer is updated on the same 64b
  memory region, the chance of cmpxchg() failure increases again

The cmpxchg() is removed as part of 2 related changes:

- Update prod and cmdq owner in a single atomic add operation. For this, we
  count the prod and owner in separate regions in prod memory.

  As with simple binary counting, once the prod+wrap fields overflow, they
  will zero. They should never overflow into "owner" region, and we zero
  the non-owner, prod region for each owner. This maintains the prod
  pointer.

  As for the "owner", we now count this value, instead of setting a flag.
  Similar to before, once the owner has finished gathering, it will clear
  a mask. As such, a CPU declares itself as the "owner" when it reads zero
  for this region. This zeroing will also clear possible overflow in
  wrap+prod region, above.

  The owner is now responsible for cmdq locking to avoid possible deadlock.
  The owner will lock the cmdq for all non-owers it has gathered when they
  have space in the queue and have written their entries.

- Check for space in the cmdq after the prod pointer has been assigned.

  We don't bother checking for space in the cmdq before assigning the prod
  pointer, as this would be racy.

  So since the prod pointer is updated unconditionally, it would be common
  for no space to be available in the cmdq when prod is assigned - that
  is, according the software-maintained prod and cons pointer. So now
  it must be ensured that the entries are not yet written and not until
  there is space.

  How the prod pointer is maintained also leads to a strange condition
  where the prod pointer can wrap past the cons pointer. We can detect this
  condition, and report no space here. However, a prod pointer progressed
  twice past the cons pointer cannot be detected. But it can be ensured that
  this that this scenario does not occur, as we limit the amount of
  commands any CPU can issue at any given time, such that we cannot
  progress prod pointer further.

Signed-off-by: John Garry 
---
 drivers/iommu/arm-smmu-v3.c | 102 +---
 1 file changed, 61 insertions(+), 41 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 36648163364c..fd42a98f501f 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -772,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, 
u32 n)
prod = Q_IDX(q, q->prod);
cons = Q_IDX(q, q->cons);
 
-   if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
+   if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) {
+   /* check if we have wrapped twice, meaning definitely no space 
*/
+   if (cons > prod)
+   return false;
+
space = (1 << q->max_n_shift) - (prod - cons);
-   else
+   } else {
+   /* similar check to above */
+   if (prod > cons)
+   return false;
+
space = cons - prod;
+   }
 
return space >= n;
 }
@@ -1073,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device 
*smmu)
  *   fails if the caller appears to be the last lock holder (yes, this is
  *   racy). All successful UNLOCK routines have RELEASE semantics.
  */
-static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
+static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count)
 {
int val;
 
@@ -1083,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct 
arm_smmu_cmdq *cmdq)
 * to INT_MIN so these increments won't hurt as the value will remain
 * negative.
 */
-   if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
+   if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0)
return;
 
do {
val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
-   } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+   } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val);
 }
 
 static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -1374,8 +1383,10 @@ static void arm_smmu_cmdq_write_entries(struct 
arm_smmu_cmdq *cmdq, u64 *cmds,
  *   insert their own list of commands then all of the commands from one
  *   CPU will appear before any of the commands from the other CPU.
  *
- * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
- *   more than the permitted amount commands at once.
+ * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
+ *   for when the cmdq is full, such that we don't wrap more than twice.
+ *   It also m

[PATCH RFC v2 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner

2020-06-16 Thread John Garry
Since the arm_smmu_ll_queue.prod will be for counting the "owner" value
and also HW prod pointer, calculate how many bits are available for and
used by each.

This is based on the number of possible CPUs in the system. And we require
that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC
and at least 1 x other.

Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 16K
is the max supported CPUs. For this, max_n_shift would be 14.

Signed-off-by: John Garry 
---
 drivers/iommu/arm-smmu-v3.c | 41 -
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index a8e814c652fe..c3562dc35d45 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -532,6 +532,8 @@ struct arm_smmu_ll_queue {
u8  __pad[SMP_CACHE_BYTES];
} cacheline_aligned_in_smp;
u32 max_n_shift;
+   u32 max_cmd_per_batch;
+   u32 owner_count_shift;
 };
 
 struct arm_smmu_queue {
@@ -1515,7 +1517,10 @@ static void arm_smmu_cmdq_batch_add(struct 
arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
struct arm_smmu_cmdq_ent *cmd)
 {
-   if (cmds->num == CMDQ_BATCH_ENTRIES) {
+   struct arm_smmu_cmdq *q = &smmu->cmdq;
+   struct arm_smmu_ll_queue *llq = &q->q.llq;
+
+   if (cmds->num == llq->max_cmd_per_batch) {
arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
cmds->num = 0;
}
@@ -3141,8 +3146,26 @@ static int arm_smmu_init_one_queue(struct 
arm_smmu_device *smmu,
   unsigned long cons_off,
   size_t dwords, const char *name)
 {
+   int cpus = num_possible_cpus();
size_t qsz;
 
+   /*
+* We can get the number of bits required for owner counting by
+* log2(nr possible cpus) + 1
+*/
+   int bits_for_cmdq_owner = ilog2(cpus) + 1;
+   /*
+* 1-bit for overflow, 1-bit for wrap, 1-bit extra to ensure prod+wrap
+* does not overflow into CPU count.
+*/
+   int bits_available_for_prod = 32 - 3 - bits_for_cmdq_owner;
+   int entries_for_prod;
+
+   if (bits_available_for_prod < 1) /* How many CPUs??? */
+   return -ENOMEM;
+
+   q->llq.max_n_shift = min_t(int, q->llq.max_n_shift,
+  bits_available_for_prod);
do {
qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
@@ -3152,6 +3175,22 @@ static int arm_smmu_init_one_queue(struct 
arm_smmu_device *smmu,
 
q->llq.max_n_shift--;
} while (1);
+   entries_for_prod = 1 << q->llq.max_n_shift;
+
+   /*
+* We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x
+* whatever else).
+*/
+   if (entries_for_prod < 2 * cpus)
+   return -ENOMEM;
+
+   /*
+* When finding max_cmd_per_batch, deduct 1 entry per batch to take
+* account CMD_SYNC
+*/
+   q->llq.max_cmd_per_batch = min_t(u32, (entries_for_prod - cpus) / cpus,
+CMDQ_BATCH_ENTRIES);
+   q->llq.owner_count_shift = q->llq.max_n_shift + 2;
 
if (!q->base) {
dev_err(smmu->dev,
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/13] Rework IOMMU API to allow for batching of invalidation

2019-08-16 Thread John Garry

On 15/08/2019 14:55, Will Deacon wrote:

On Thu, Aug 15, 2019 at 12:19:58PM +0100, John Garry wrote:

On 14/08/2019 18:56, Will Deacon wrote:

If you'd like to play with the patches, then I've also pushed them here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap

but they should behave as a no-op on their own.


As anticipated, my storage testing scenarios roughly give parity throughput
and CPU loading before and after this series.

Patches to convert the

Arm SMMUv3 driver to the new API are here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq


I quickly tested this again and now I see a performance lift:

before (5.3-rc1)after
D05 8x SAS disks907K IOPS   970K IOPS
D05 1x NVMe 450K IOPS   466K IOPS
D06 1x NVMe 467K IOPS   466K IOPS

The CPU loading seems to track throughput, so nothing much to say there.

Note: From 5.2 testing, I was seeing >900K IOPS from that NVMe disk for
!IOMMU.


Cheers, John. For interest, how do things look if you pass iommu.strict=0?
That might give some indication about how much the invalidation is still
hurting us.


So I tested for iommu/cmdq for NVMe only, and I see:

 !SMMU  5.3-rc4 strict/!strict  cmdq strict/!strict
D05 NVMe 750K IOPS  456K/540K IOPS  466K/537K
D06 NVMe 750K IOPS  456K/740K IOPS  466K/745K

I don't know why the D06 iommu.strict performance is ~ same as D05, 
while !strict is so much better. D06 SMMU implementation is supposed to 
be generally much better than that of D05, so I would have thought that 
the strict performance would be better (than that of D05).





BTW, what were your thoughts on changing
arm_smmu_atc_inv_domain()->arm_smmu_atc_inv_master() to batching? It seems
suitable, but looks untouched. Were you waiting for a resolution to the
performance issue which Leizhen reported?


In principle, I'm supportive of such a change, but I'm not currently able
to test any ATS stuff so somebody else would need to write the patch.
Jean-Philippe is on holiday at the moment, but I'd be happy to review
something from you if you send it out.


Unfortunately I don't have anything ATS-enabled either. Not many do, it 
seems.


Cheers,
John



Will

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/4] iommu: Document usage of "/sys/kernel/iommu_groups//type" file

2019-08-21 Thread John Garry

On 21/08/2019 03:42, Sai Praneeth Prakhya wrote:

The default domain type of an iommu group can be changed using file
"/sys/kernel/iommu_groups//type". Hence, document it's usage and
more importantly spell out it's limitations and an example.

Cc: Christoph Hellwig 
Cc: Joerg Roedel 
Cc: Ashok Raj 
Cc: Will Deacon 
Cc: Lu Baolu 
Cc: Sohil Mehta 
Cc: Robin Murphy 
Cc: Jacob Pan 
Signed-off-by: Sai Praneeth Prakhya 
---
 .../ABI/testing/sysfs-kernel-iommu_groups | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups 
b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
index 017f5bc3920c..0a70b3a66ff3 100644
--- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups
+++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
@@ -33,3 +33,38 @@ Description:In case an RMRR is used only by graphics or 
USB devices
it is now exposed as "direct-relaxable" instead of "direct".
In device assignment use case, for instance, those RMRR
are considered to be relaxable and safe.
+
+What:  /sys/kernel/iommu_groups//type
+Date:  August 2019
+KernelVersion: v5.4
+Contact:   Sai Praneeth Prakhya 
+Description:   /sys/kernel/iommu_groups//type lets the user know the
+   type of default domain in use by iommu for this group. A
+   privileged user could request kernel to change the group type by
+   writing to this file. Presently, only three types are supported


It's unclear whether the following is a list of all values the user 
could both read and write (which it isn't).



+   1. dma: All the DMA transactions from the devices in this group
+   are translated by the iommu.


Why "dma", and not "DMA" (which is what we would read for DMA type)?


+   2. identity: All the DMA transactions from the devices in this
+group are *not* translated by the iommu.
+   3. auto: Kernel decides one among dma/identity


Isn't this the same as reset value, which we could read and remember?

(And the kernel could reject when we attempt to change to a disallowed 
type).



+   Note:
+   -
+   A group type could be modified only when *all* the devices in
+   the group are not binded to any device driver. So, the user has
+   to first unbind the appropriate drivers and then change the
+   default domain type.
+   Caution:
+   
+   Unbinding a device driver will take away the drivers control
+   over the device and if done on devices that host root file
+   system could lead to catastrophic effects (the user might
+   need to reboot the machine to get it to normal state). So, it's
+   expected that the user understands what he is doing.


I think that this goes without saying.


+   Example:
+   
+   # Unbind USB device driver
+   1. echo ":00:14.0" > /sys/bus/pci/drivers/xhci_hcd/unbind
+   # Put the USB device in identity mode (a.k.a pass through)
+   2. echo "identity" > /sys/kernel/iommu_groups//type
+   # Re-bind the driver
+   3. echo ":00:14.0" > /sys/bus/pci/drivers/xhci_hcd/bind



Thanks,
John


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


arm64 iommu groups issue

2019-09-19 Thread John Garry

Hi all,

We have noticed a special behaviour on our arm64 D05 board when the SMMU 
is enabled with regards PCI device iommu groups.


This platform does not support ACS, yet we find that all functions for a 
PCI device are not grouped together:


root@ubuntu:/sys# dmesg | grep "Adding to iommu group"
[7.307539] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
[   12.590533] hns_dsaf HISI00B2:00: Adding to iommu group 1
[   13.688527] mlx5_core 000a:11:00.0: Adding to iommu group 2
[   14.324606] mlx5_core 000a:11:00.1: Adding to iommu group 3
[   14.937090] ehci-platform PNP0D20:00: Adding to iommu group 4
[   15.276637] pcieport 0002:f8:00.0: Adding to iommu group 5
[   15.340845] pcieport 0004:88:00.0: Adding to iommu group 6
[   15.392098] pcieport 0005:78:00.0: Adding to iommu group 7
[   15.443356] pcieport 000a:10:00.0: Adding to iommu group 8
[   15.484975] pcieport 000c:20:00.0: Adding to iommu group 9
[   15.543647] pcieport 000d:30:00.0: Adding to iommu group 10
[   15.599771] serial 0002:f9:00.0: Adding to iommu group 5
[   15.690807] serial 0002:f9:00.1: Adding to iommu group 5
[   84.322097] mlx5_core 000a:11:00.2: Adding to iommu group 8
[   84.856408] mlx5_core 000a:11:00.3: Adding to iommu group 8

root@ubuntu:/sys#  lspci -tv
lspci -tvv
-+-[000d:30]---00.0-[31]--
  +-[000c:20]---00.0-[21]00.0  Huawei Technologies Co., Ltd.
  +-[000a:10]---00.0-[11-12]--+-00.0  Mellanox [ConnectX-5]
  |   +-00.1  Mellanox [ConnectX-5]
  |   +-00.2  Mellanox [ConnectX-5 VF]
  |   \-00.3  Mellanox [ConnectX-5 VF]
  +-[0007:90]---00.0-[91]00.0  Huawei Technologies Co., ...
  +-[0006:c0]---00.0-[c1]--
  +-[0005:78]---00.0-[79]--
  +-[0004:88]---00.0-[89]--
  +-[0002:f8]---00.0-[f9]--+-00.0  MosChip Semiconductor Technology ...
  |+-00.1  MosChip Semiconductor Technology ...
  |\-00.2  MosChip Semiconductor Technology ...
  \-[:00]-

For the PCI devices in question - on port 000a:10:00.0 - you will notice 
that the port and VFs (000a:11:00.2, 3) are in one group, yet the 2 PFs 
(000a:11:00.0, 000a:11:00.1) are in separate groups.


I also notice the same ordering nature on our D06 platform - the 
pcieport is added to an iommu group after PF for that port. However this 
platform supports ACS, so not such a problem.


After some checking, I find that when the pcieport driver probes, the 
associated SMMU device had not registered yet with the IOMMU framework, 
so we defer the probe for this device - in iort.c:iort_iommu_xlate(), 
when no iommu ops are available, we defer.


Yet, when the mlx5 PF devices probe, the iommu ops are available at this 
stage. So the probe continues and we get an iommu group for the device - 
but not the same group as the parent port, as it has not yet been added 
to a group. When the port eventually probes it gets a new, separate group.


This all seems to be as the built-in module init ordering is as follows: 
pcieport drv, smmu drv, mlx5 drv


I notice that if I build the mlx5 drv as a ko and insert after boot, all 
functions + pcieport are in the same group:


[   11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
[   17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1
[   18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2
[   20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3
[   20.356303] pcieport 0004:88:00.0: Adding to iommu group 4
[   20.493337] pcieport 0005:78:00.0: Adding to iommu group 5
[   20.702999] pcieport 000a:10:00.0: Adding to iommu group 6
[   20.859183] pcieport 000c:20:00.0: Adding to iommu group 7
[   20.996140] pcieport 000d:30:00.0: Adding to iommu group 8
[   21.152637] serial 0002:f9:00.0: Adding to iommu group 3
[   21.346991] serial 0002:f9:00.1: Adding to iommu group 3
[  100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6
[  101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6
[  292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6
[  293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6

This does seem like a problem for arm64 platforms which don't support 
ACS, yet enable an SMMU. Maybe also a problem even if they do support ACS.


Opinion?

Thanks,
John



Re: arm64 iommu groups issue

2019-09-19 Thread John Garry

On 19/09/2019 14:25, Robin Murphy wrote:

When the port eventually probes it gets a new, separate group.

This all seems to be as the built-in module init ordering is as
follows: pcieport drv, smmu drv, mlx5 drv

I notice that if I build the mlx5 drv as a ko and insert after boot,
all functions + pcieport are in the same group:

[   11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
[   17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1
[   18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2
[   20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3
[   20.356303] pcieport 0004:88:00.0: Adding to iommu group 4
[   20.493337] pcieport 0005:78:00.0: Adding to iommu group 5
[   20.702999] pcieport 000a:10:00.0: Adding to iommu group 6
[   20.859183] pcieport 000c:20:00.0: Adding to iommu group 7
[   20.996140] pcieport 000d:30:00.0: Adding to iommu group 8
[   21.152637] serial 0002:f9:00.0: Adding to iommu group 3
[   21.346991] serial 0002:f9:00.1: Adding to iommu group 3
[  100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6
[  101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6
[  292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6
[  293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6

This does seem like a problem for arm64 platforms which don't support
ACS, yet enable an SMMU. Maybe also a problem even if they do support
ACS.

Opinion?




Hi Robin,


Yeah, this is less than ideal.


For sure. Our production D05 boards don't ship with the SMMU enabled in 
BIOS, but it would be slightly concerning in this regard if they did.


> One way to bodge it might be to make

pci_device_group() also walk downwards to see if any non-ACS-isolated
children already have a group, rather than assuming that groups get
allocated in hierarchical order, but that's far from ideal.


Agree.

My own workaround was to hack the mentioned iort code to defer the PF 
probe if the parent port had also yet to probe.




The underlying issue is that, for historical reasons, OF/IORT-based
IOMMU drivers have ended up with group allocation being tied to endpoint
driver probing via the dma_configure() mechanism (long story short,
driver probe is the only thing which can be delayed in order to wait for
a specific IOMMU instance to be ready).However, in the meantime, the
IOMMU API internals have evolved sufficiently that I think there's a way
to really put things right - I have the spark of an idea which I'll try
to sketch out ASAP...



OK, great.

Thanks,
John


Robin.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 0/6] SMMUv3 PMCG IMP DEF event support

2019-09-30 Thread John Garry
This patchset adds IMP DEF event support for the SMMUv3 PMCG.

It is marked as an RFC as the method to identify the PMCG implementation
may be a quite disliked. And, in general, the series is somewhat
incomplete.

So the background is that the PMCG supports IMP DEF events, yet we have no
method to identify the PMCG to know the IMP DEF events.

A method for identifying the PMCG implementation could be using
PMDEVARCH, but we cannot rely on this being set properly, as whether this
is implemented is not defined in SMMUv3 spec.

Another method would be perf event aliasing, but this method of event
matching is based on CPU id, which would not guarantee same
uniqueness as PMCG implementation.

Yet another method could be to continue using ACPI OEM ID in the IORT
code, but this does not scale. And it is not suitable if we ever add DT
support to the PMCG driver.

The method used in this series is based on matching on the parent SMMUv3
IIDR. We store this IIDR contents in the arm smmu structure as the first
element, which means that we don't have to expose SMMU APIs - this is
the part which may be disliked.

The final two patches switch the pre-existing PMCG model identification
from ACPI OEM ID to the same parent SMMUv3 IIDR matching.

For now, we only consider SMMUv3' nodes being the associated node for
PMCG.

John Garry (6):
  ACPI/IORT: Set PMCG device parent
  iommu/arm-smmu-v3: Record IIDR in arm_smmu_device structure
  perf/smmuv3: Retrieve parent SMMUv3 IIDR
  perf/smmuv3: Support HiSilicon hip08 (hi1620) IMP DEF events
  perf/smmuv3: Match implementation options based on parent SMMU IIDR
  ACPI/IORT: Drop code to set the PMCG software-defined model

 drivers/acpi/arm64/iort.c | 69 ++--
 drivers/iommu/arm-smmu-v3.c   |  5 +++
 drivers/perf/arm_smmuv3_pmu.c | 84 ++-
 include/linux/acpi_iort.h |  8 
 4 files changed, 112 insertions(+), 54 deletions(-)

-- 
2.17.1



[RFC PATCH 3/6] perf/smmuv3: Retrieve parent SMMUv3 IIDR

2019-09-30 Thread John Garry
To support IMP DEF events per PMCG, retrieve the parent SMMUv3 IIDR. This
will be used as a lookup for the IMP DEF events supported, under the
assumption that a PMCG implementation has the same uniqueness as the
parent SMMUv3. In this, we assume that any PMCG associated with the same
SMMUv3 will have the same IMP DEF events - otherwise, some other
secondary matching would need to be done.

Signed-off-by: John Garry 
---
 drivers/perf/arm_smmuv3_pmu.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c
index da71c741cb46..f702898c695d 100644
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -115,6 +115,7 @@ struct smmu_pmu {
bool global_filter;
u32 global_filter_span;
u32 global_filter_sid;
+   u32 parent_iidr;
 };
 
 #define to_smmu_pmu(p) (container_of(p, struct smmu_pmu, pmu))
@@ -551,6 +552,11 @@ static const struct attribute_group *smmu_pmu_attr_grps[] 
= {
NULL
 };
 
+static const struct attribute_group **smmu_pmu_lookup_attr_groups(u32 
parent_smmu_iidr)
+{
+   return smmu_pmu_attr_grps;
+}
+
 /*
  * Generic device handlers
  */
@@ -706,11 +712,21 @@ static int smmu_pmu_probe(struct platform_device *pdev)
int irq, err;
char *name;
struct device *dev = &pdev->dev;
+   struct device *parent = dev->parent;
 
smmu_pmu = devm_kzalloc(dev, sizeof(*smmu_pmu), GFP_KERNEL);
if (!smmu_pmu)
return -ENOMEM;
 
+   if (parent) {
+   void *parent_drvdata;
+
+   parent_drvdata = 
platform_get_drvdata(to_platform_device(parent));
+   if (!parent_drvdata)
+   return -EPROBE_DEFER;
+   smmu_pmu->parent_iidr = *(u32 *)parent_drvdata;
+   }
+
smmu_pmu->dev = dev;
platform_set_drvdata(pdev, smmu_pmu);
 
@@ -724,7 +740,7 @@ static int smmu_pmu_probe(struct platform_device *pdev)
.start  = smmu_pmu_event_start,
.stop   = smmu_pmu_event_stop,
.read   = smmu_pmu_event_read,
-   .attr_groups= smmu_pmu_attr_grps,
+   .attr_groups= 
smmu_pmu_lookup_attr_groups(smmu_pmu->parent_iidr),
.capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 5/6] perf/smmuv3: Match implementation options based on parent SMMU IIDR

2019-09-30 Thread John Garry
Currently we match the implementation options based on the ACPI PLATFORM
OEM ID.

Since we can now match based on the parent SMMUv3 IIDR, switch to this
method.

Signed-off-by: John Garry 
---
 drivers/perf/arm_smmuv3_pmu.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c
index 11f28ba5fae0..33d1379ae525 100644
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -739,14 +739,10 @@ static void smmu_pmu_reset(struct smmu_pmu *smmu_pmu)
   smmu_pmu->reloc_base + SMMU_PMCG_OVSCLR0);
 }
 
-static void smmu_pmu_get_acpi_options(struct smmu_pmu *smmu_pmu)
+static void smmu_pmu_get_implementation_options(struct smmu_pmu *smmu_pmu)
 {
-   u32 model;
-
-   model = *(u32 *)dev_get_platdata(smmu_pmu->dev);
-
-   switch (model) {
-   case IORT_SMMU_V3_PMCG_HISI_HIP08:
+   switch (smmu_pmu->parent_iidr) {
+   case PARENT_SMMU_IIDR_HISI_HIP08:
/* HiSilicon Erratum 162001800 */
smmu_pmu->options |= SMMU_PMCG_EVCNTR_RDONLY;
break;
@@ -844,7 +840,7 @@ static int smmu_pmu_probe(struct platform_device *pdev)
return -EINVAL;
}
 
-   smmu_pmu_get_acpi_options(smmu_pmu);
+   smmu_pmu_get_implementation_options(smmu_pmu);
 
/* Pick one CPU to be the preferred one to use */
smmu_pmu->on_cpu = raw_smp_processor_id();
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 1/6] ACPI/IORT: Set PMCG device parent

2019-09-30 Thread John Garry
In the IORT, a PMCG node includes a node reference to its associated
device.

Set the PMCG platform device parent device for future referencing.

For now, we only consider setting for when the associated component is an
SMMUv3.

Signed-off-by: John Garry 
---
 drivers/acpi/arm64/iort.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 8569b79e8b58..0b687520c3e7 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1455,7 +1455,7 @@ static __init const struct iort_dev_config 
*iort_get_dev_cfg(
  * Returns: 0 on success, <0 failure
  */
 static int __init iort_add_platform_device(struct acpi_iort_node *node,
-  const struct iort_dev_config *ops)
+  const struct iort_dev_config *ops, 
struct device *parent)
 {
struct fwnode_handle *fwnode;
struct platform_device *pdev;
@@ -1466,6 +1466,8 @@ static int __init iort_add_platform_device(struct 
acpi_iort_node *node,
if (!pdev)
return -ENOMEM;
 
+   pdev->dev.parent = parent;
+
if (ops->dev_set_proximity) {
ret = ops->dev_set_proximity(&pdev->dev, node);
if (ret)
@@ -1573,6 +1575,11 @@ static void __init iort_enable_acs(struct acpi_iort_node 
*iort_node)
 static inline void iort_enable_acs(struct acpi_iort_node *iort_node) { }
 #endif
 
+static int iort_fwnode_match(struct device *dev, const void *fwnode)
+{
+   return dev->fwnode == fwnode;
+}
+
 static void __init iort_init_platform_devices(void)
 {
struct acpi_iort_node *iort_node, *iort_end;
@@ -1594,11 +1601,34 @@ static void __init iort_init_platform_devices(void)
iort_table->length);
 
for (i = 0; i < iort->node_count; i++) {
+   struct device *parent = NULL;
+
if (iort_node >= iort_end) {
pr_err("iort node pointer overflows, bad table\n");
return;
}
 
+   /* Fixme: handle parent declared in IORT after PMCG */
+   if (iort_node->type == ACPI_IORT_NODE_PMCG) {
+   struct acpi_iort_node *iort_assoc_node;
+   struct acpi_iort_pmcg *pmcg;
+   u32 node_reference;
+
+   pmcg = (struct acpi_iort_pmcg *)iort_node->node_data;
+
+   node_reference = pmcg->node_reference;
+   iort_assoc_node = ACPI_ADD_PTR(struct acpi_iort_node, 
iort,
+node_reference);
+
+   if (iort_assoc_node->type == ACPI_IORT_NODE_SMMU_V3) {
+   struct fwnode_handle *assoc_fwnode;
+
+   assoc_fwnode = iort_get_fwnode(iort_assoc_node);
+
+   parent = bus_find_device(&platform_bus_type, 
NULL,
+ assoc_fwnode, iort_fwnode_match);
+   }
+   }
iort_enable_acs(iort_node);
 
ops = iort_get_dev_cfg(iort_node);
@@ -1609,7 +1639,7 @@ static void __init iort_init_platform_devices(void)
 
iort_set_fwnode(iort_node, fwnode);
 
-   ret = iort_add_platform_device(iort_node, ops);
+   ret = iort_add_platform_device(iort_node, ops, parent);
if (ret) {
iort_delete_fwnode(iort_node);
acpi_free_fwnode_static(fwnode);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 2/6] iommu/arm-smmu-v3: Record IIDR in arm_smmu_device structure

2019-09-30 Thread John Garry
To allow other devices know the SMMU HW IIDR, record the IIDR contents as
the first member of the arm_smmu_device structure.

In storing as the first member, it saves exposing SMMU APIs, which are
nicely self-contained today.

Signed-off-by: John Garry 
---
 drivers/iommu/arm-smmu-v3.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 40f4757096c3..1ed3ef0f1ec3 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -70,6 +70,8 @@
 #define IDR1_SSIDSIZE  GENMASK(10, 6)
 #define IDR1_SIDSIZE   GENMASK(5, 0)
 
+#define ARM_SMMU_IIDR  0x18
+
 #define ARM_SMMU_IDR5  0x14
 #define IDR5_STALL_MAX GENMASK(31, 16)
 #define IDR5_GRAN64K   (1 << 6)
@@ -546,6 +548,7 @@ struct arm_smmu_strtab_cfg {
 
 /* An SMMUv3 instance */
 struct arm_smmu_device {
+   u32 iidr; /* must be first 
member */
struct device   *dev;
void __iomem*base;
 
@@ -3153,6 +3156,8 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
iommu_device_set_ops(&smmu->iommu, &arm_smmu_ops);
iommu_device_set_fwnode(&smmu->iommu, dev->fwnode);
 
+   smmu->iidr = readl(smmu->base + ARM_SMMU_IIDR);
+
ret = iommu_device_register(&smmu->iommu);
if (ret) {
dev_err(dev, "Failed to register iommu\n");
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 6/6] ACPI/IORT: Drop code to set the PMCG software-defined model

2019-09-30 Thread John Garry
Now that we can identify a PMCG implementation from the parent SMMUv3
IIDR, drop all the code to match based on the ACPI OEM ID.

Signed-off-by: John Garry 
---
 drivers/acpi/arm64/iort.c | 35 +--
 include/linux/acpi_iort.h |  8 
 2 files changed, 1 insertion(+), 42 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 0b687520c3e7..d04888cb8cff 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1377,27 +1377,6 @@ static void __init 
arm_smmu_v3_pmcg_init_resources(struct resource *res,
   ACPI_EDGE_SENSITIVE, &res[2]);
 }
 
-static struct acpi_platform_list pmcg_plat_info[] __initdata = {
-   /* HiSilicon Hip08 Platform */
-   {"HISI  ", "HIP08   ", 0, ACPI_SIG_IORT, greater_than_or_equal,
-"Erratum #162001800", IORT_SMMU_V3_PMCG_HISI_HIP08},
-   { }
-};
-
-static int __init arm_smmu_v3_pmcg_add_platdata(struct platform_device *pdev)
-{
-   u32 model;
-   int idx;
-
-   idx = acpi_match_platform_list(pmcg_plat_info);
-   if (idx >= 0)
-   model = pmcg_plat_info[idx].data;
-   else
-   model = IORT_SMMU_V3_PMCG_GENERIC;
-
-   return platform_device_add_data(pdev, &model, sizeof(model));
-}
-
 struct iort_dev_config {
const char *name;
int (*dev_init)(struct acpi_iort_node *node);
@@ -1408,7 +1387,6 @@ struct iort_dev_config {
 struct acpi_iort_node *node);
int (*dev_set_proximity)(struct device *dev,
struct acpi_iort_node *node);
-   int (*dev_add_platdata)(struct platform_device *pdev);
 };
 
 static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = {
@@ -1430,7 +1408,6 @@ static const struct iort_dev_config 
iort_arm_smmu_v3_pmcg_cfg __initconst = {
.name = "arm-smmu-v3-pmcg",
.dev_count_resources = arm_smmu_v3_pmcg_count_resources,
.dev_init_resources = arm_smmu_v3_pmcg_init_resources,
-   .dev_add_platdata = arm_smmu_v3_pmcg_add_platdata,
 };
 
 static __init const struct iort_dev_config *iort_get_dev_cfg(
@@ -1494,17 +1471,7 @@ static int __init iort_add_platform_device(struct 
acpi_iort_node *node,
if (ret)
goto dev_put;
 
-   /*
-* Platform devices based on PMCG nodes uses platform_data to
-* pass the hardware model info to the driver. For others, add
-* a copy of IORT node pointer to platform_data to be used to
-* retrieve IORT data information.
-*/
-   if (ops->dev_add_platdata)
-   ret = ops->dev_add_platdata(pdev);
-   else
-   ret = platform_device_add_data(pdev, &node, sizeof(node));
-
+   ret = platform_device_add_data(pdev, &node, sizeof(node));
if (ret)
goto dev_put;
 
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index 8e7e2ec37f1b..7a8961e6a8bb 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -14,14 +14,6 @@
 #define IORT_IRQ_MASK(irq) (irq & 0xULL)
 #define IORT_IRQ_TRIGGER_MASK(irq) ((irq >> 32) & 0xULL)
 
-/*
- * PMCG model identifiers for use in smmu pmu driver. Please note
- * that this is purely for the use of software and has nothing to
- * do with hardware or with IORT specification.
- */
-#define IORT_SMMU_V3_PMCG_GENERIC0x /* Generic SMMUv3 PMCG */
-#define IORT_SMMU_V3_PMCG_HISI_HIP08 0x0001 /* HiSilicon HIP08 PMCG */
-
 int iort_register_domain_token(int trans_id, phys_addr_t base,
   struct fwnode_handle *fw_node);
 void iort_deregister_domain_token(int trans_id);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 4/6] perf/smmuv3: Support HiSilicon hip08 (hi1620) IMP DEF events

2019-09-30 Thread John Garry
Now that we can identify an PMCG implementation through the parent SMMUv3
implementation, add support for IMP DEF events.

For each new implementation supported, we add a new attr_grps structure
for a particular implementation and do lookup matching based on the
parent SMMUv3 IIDR. This could maybe be optimised in future to reduce
structures required.

For now, only the l1_tlb event is added for HiSilicon hip08 platform.

This platform supports many more IMP DEF events, but I need something
better than the electronically translated description of the event to
support.

Signed-off-by: John Garry 
---
 drivers/perf/arm_smmuv3_pmu.c | 54 ++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c
index f702898c695d..11f28ba5fae0 100644
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -132,6 +132,8 @@ SMMU_PMU_EVENT_ATTR_EXTRACTOR(filter_stream_id, config1, 0, 
31);
 SMMU_PMU_EVENT_ATTR_EXTRACTOR(filter_span, config1, 32, 32);
 SMMU_PMU_EVENT_ATTR_EXTRACTOR(filter_enable, config1, 33, 33);
 
+#define PARENT_SMMU_IIDR_HISI_HIP08 (0x30736)
+
 static inline void smmu_pmu_enable(struct pmu *pmu)
 {
struct smmu_pmu *smmu_pmu = to_smmu_pmu(pmu);
@@ -505,6 +507,21 @@ static struct attribute *smmu_pmu_events[] = {
NULL
 };
 
+SMMU_EVENT_ATTR(l1_tlb, 0x8a);
+
+static struct attribute *smmu_pmu_events_hip08[] = {
+   &smmu_event_attr_cycles.attr.attr,
+   &smmu_event_attr_transaction.attr.attr,
+   &smmu_event_attr_tlb_miss.attr.attr,
+   &smmu_event_attr_config_cache_miss.attr.attr,
+   &smmu_event_attr_trans_table_walk_access.attr.attr,
+   &smmu_event_attr_config_struct_access.attr.attr,
+   &smmu_event_attr_pcie_ats_trans_rq.attr.attr,
+   &smmu_event_attr_pcie_ats_trans_passed.attr.attr,
+   &smmu_event_attr_l1_tlb.attr.attr,
+   NULL
+};
+
 static umode_t smmu_pmu_event_is_visible(struct kobject *kobj,
 struct attribute *attr, int unused)
 {
@@ -514,7 +531,10 @@ static umode_t smmu_pmu_event_is_visible(struct kobject 
*kobj,
 
pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr.attr);
 
-   if (test_bit(pmu_attr->id, smmu_pmu->supported_events))
+   if (pmu_attr->id < SMMU_PMCG_ARCH_MAX_EVENTS &&
+   test_bit(pmu_attr->id, smmu_pmu->supported_events))
+   return attr->mode;
+   else if (pmu_attr->id >= SMMU_PMCG_ARCH_MAX_EVENTS)
return attr->mode;
 
return 0;
@@ -526,6 +546,12 @@ static struct attribute_group smmu_pmu_events_group = {
.is_visible = smmu_pmu_event_is_visible,
 };
 
+static struct attribute_group smmu_pmu_events_group_hip08 = {
+   .name = "events",
+   .attrs = smmu_pmu_events_hip08,
+   .is_visible = smmu_pmu_event_is_visible,
+};
+
 /* Formats */
 PMU_FORMAT_ATTR(event,"config:0-15");
 PMU_FORMAT_ATTR(filter_stream_id,  "config1:0-31");
@@ -552,8 +578,34 @@ static const struct attribute_group *smmu_pmu_attr_grps[] 
= {
NULL
 };
 
+static const struct attribute_group *smmu_pmu_attr_grps_hip08[] = {
+   &smmu_pmu_cpumask_group,
+   &smmu_pmu_events_group_hip08,
+   &smmu_pmu_format_group,
+   NULL
+};
+
+struct smmu_pmu_attr_grps_custom {
+   u32 parent_smmu_iidr;
+   const struct attribute_group **attr_grps;
+} smmu_pmu_attr_custom_grps[] = {
+   {
+   .parent_smmu_iidr = PARENT_SMMU_IIDR_HISI_HIP08,
+   .attr_grps = smmu_pmu_attr_grps_hip08,
+   },
+};
+
 static const struct attribute_group **smmu_pmu_lookup_attr_groups(u32 
parent_smmu_iidr)
 {
+   int i;
+
+   for (i = 0; i < ARRAY_SIZE(smmu_pmu_attr_custom_grps); i++) {
+   struct smmu_pmu_attr_grps_custom *c = 
&smmu_pmu_attr_custom_grps[i];
+
+   if (c->parent_smmu_iidr == parent_smmu_iidr)
+   return c->attr_grps;
+   }
+
return smmu_pmu_attr_grps;
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 6/6] ACPI/IORT: Drop code to set the PMCG software-defined model

2019-10-15 Thread John Garry

On 15/10/2019 04:06, Hanjun Guo wrote:

-/*
> - * PMCG model identifiers for use in smmu pmu driver. Please note
> - * that this is purely for the use of software and has nothing to
> - * do with hardware or with IORT specification.
> - */
> -#define IORT_SMMU_V3_PMCG_GENERIC0x /* Generic SMMUv3 PMCG */
> -#define IORT_SMMU_V3_PMCG_HISI_HIP08 0x0001 /* HiSilicon HIP08 PMCG 
*/

Since only HiSilicon platform has such erratum, and I think it works with
both old version of firmware, I'm fine with removing this erratum framework.



Yeah, seems a decent change on its own, even without the SMMU PMCG 
driver changes.


But we still need to check on patch "[PATCH RFC 2/6] iommu/arm-smmu-v3: 
Record IIDR in arm_smmu_device structure" to progress any of this.


Will, Robin, Any opinion on that patch?

https://lore.kernel.org/linux-iommu/1569854031-237636-1-git-send-email-john.ga...@huawei.com/T/#m1e24771a23ee5426ec94ca2c4ec572642c155a77


Acked-by: Hanjun Guo 


Cheers,
John



Thanks
Hanjun


.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/6] ACPI/IORT: Set PMCG device parent

2019-10-15 Thread John Garry

Hi Hanjun,

Thanks for checking this.


  */
 static int __init iort_add_platform_device(struct acpi_iort_node *node,
-  const struct iort_dev_config *ops)
+  const struct iort_dev_config *ops, 
struct device *parent)


Since you added a input for this function, could you please update
the comments of this function as well?


Right, that can be updated. Indeed, the current comment omit the @ops 
argument also.





 {
struct fwnode_handle *fwnode;
struct platform_device *pdev;
@@ -1466,6 +1466,8 @@ static int __init iort_add_platform_device(struct 
acpi_iort_node *node,
if (!pdev)
return -ENOMEM;

+   pdev->dev.parent = parent;
+
if (ops->dev_set_proximity) {
ret = ops->dev_set_proximity(&pdev->dev, node);
if (ret)
@@ -1573,6 +1575,11 @@ static void __init iort_enable_acs(struct acpi_iort_node 
*iort_node)
 static inline void iort_enable_acs(struct acpi_iort_node *iort_node) { }
 #endif

+static int iort_fwnode_match(struct device *dev, const void *fwnode)
+{
+   return dev->fwnode == fwnode;
+}
+
 static void __init iort_init_platform_devices(void)
 {
struct acpi_iort_node *iort_node, *iort_end;
@@ -1594,11 +1601,34 @@ static void __init iort_init_platform_devices(void)
iort_table->length);

for (i = 0; i < iort->node_count; i++) {
+   struct device *parent = NULL;
+
if (iort_node >= iort_end) {
pr_err("iort node pointer overflows, bad table\n");
return;
}

+   /* Fixme: handle parent declared in IORT after PMCG */
+   if (iort_node->type == ACPI_IORT_NODE_PMCG) {
+   struct acpi_iort_node *iort_assoc_node;
+   struct acpi_iort_pmcg *pmcg;
+   u32 node_reference;
+
+   pmcg = (struct acpi_iort_pmcg *)iort_node->node_data;
+
+   node_reference = pmcg->node_reference;
+   iort_assoc_node = ACPI_ADD_PTR(struct acpi_iort_node, 
iort,
+node_reference);
+
+   if (iort_assoc_node->type == ACPI_IORT_NODE_SMMU_V3) {
+   struct fwnode_handle *assoc_fwnode;
+
+   assoc_fwnode = iort_get_fwnode(iort_assoc_node);
+
+   parent = bus_find_device(&platform_bus_type, 
NULL,
+ assoc_fwnode, iort_fwnode_match);
+   }
+   }


How about using a function to include those new added code to make this
function (iort_init_platform_devices()) a bit cleaner?



Can do. But I still need to add code to deal with the scenario of the 
parent PMCG not being an SMMU, which is supported in the spec as I recall.


Note that I would not have FW to test that, so maybe can omit support 
for now as long as there's no regression.



iort_enable_acs(iort_node);

ops = iort_get_dev_cfg(iort_node);
@@ -1609,7 +1639,7 @@ static void __init iort_init_platform_devices(void)

iort_set_fwnode(iort_node, fwnode);

-   ret = iort_add_platform_device(iort_node, ops);
+   ret = iort_add_platform_device(iort_node, ops, parent);


This function is called if ops is valid, so retrieve the parent
can be done before this function I think.


Ah, yes

Thanks,
John






Thanks
Hanjun


.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/6] SMMUv3 PMCG IMP DEF event support

2019-10-16 Thread John Garry

On 15/10/2019 19:00, Robin Murphy wrote:

Hi John,

On 30/09/2019 15:33, John Garry wrote:

This patchset adds IMP DEF event support for the SMMUv3 PMCG.

It is marked as an RFC as the method to identify the PMCG implementation
may be a quite disliked. And, in general, the series is somewhat
incomplete.

So the background is that the PMCG supports IMP DEF events, yet we
have no
method to identify the PMCG to know the IMP DEF events.

A method for identifying the PMCG implementation could be using
PMDEVARCH, but we cannot rely on this being set properly, as whether this
is implemented is not defined in SMMUv3 spec.

Another method would be perf event aliasing, but this method of event
matching is based on CPU id, which would not guarantee same
uniqueness as PMCG implementation.

Yet another method could be to continue using ACPI OEM ID in the IORT
code, but this does not scale. And it is not suitable if we ever add DT
support to the PMCG driver.

The method used in this series is based on matching on the parent SMMUv3
IIDR. We store this IIDR contents in the arm smmu structure as the first
element, which means that we don't have to expose SMMU APIs - this is
the part which may be disliked.

The final two patches switch the pre-existing PMCG model identification
from ACPI OEM ID to the same parent SMMUv3 IIDR matching.

For now, we only consider SMMUv3' nodes being the associated node for
PMCG.




Hi Robin,


Two significant concerns right off the bat:

- It seems more common than not for silicon designers to fail to
implement IIDR correctly, so it's only a matter of time before
inevitably needing to bring back some firmware-level identifier
abstraction (if not already - does Hi161x have PMCGs?)


Maybe there's a way that we can switch to this method, and leave the 
door open for an easy way to support firmware-level identifier again, if 
ever needed. I'm not too pushed - this was secondary to just allowing 
the PMCG driver know the associated SMMU model.


And, no, hi161x does not have any PMCGs.



- This seems like a step in entirely the wrong direction for supporting
.


So to support PMCGs that reference a Named Component or Root Complex, I 
thought that the IORT parsing code would have to do some secondary 
lookup to the associated SMMU, through the Named Component or Root 
Complex node.


What was your idea here?

Note: I do acknowledge that an overall issue is that we assume all PMCG 
IMP DEF events are same for a given SMMU model.




Interpreting the Node Reference is definitely a welcome improvement over
matching table headers, but absent a truly compelling argument to the
contrary, I'd rather retain the "PMCG model" abstraction in between that
and the driver itself (especially since those can trivially be hung off
compatibles once it comes to DT support).


For DT, I would assume that we just use compatible strings would allow 
us to identify the PMCG model.


On a related matter, is there still a need to deal with scenarios of the 
PMCG being located within the SMMU register map? As you may remember, we 
did have this issue but relocated the PMCG to outside the SMMU register 
map in a later chip rev.


Cheers,
John



Thanks,
Robin.



John Garry (6):
   ACPI/IORT: Set PMCG device parent
   iommu/arm-smmu-v3: Record IIDR in arm_smmu_device structure
   perf/smmuv3: Retrieve parent SMMUv3 IIDR
   perf/smmuv3: Support HiSilicon hip08 (hi1620) IMP DEF events
   perf/smmuv3: Match implementation options based on parent SMMU IIDR
   ACPI/IORT: Drop code to set the PMCG software-defined model

  drivers/acpi/arm64/iort.c | 69 ++--
  drivers/iommu/arm-smmu-v3.c   |  5 +++
  drivers/perf/arm_smmuv3_pmu.c | 84 ++-
  include/linux/acpi_iort.h |  8 
  4 files changed, 112 insertions(+), 54 deletions(-)



.






Re: [RFC PATCH 0/6] SMMUv3 PMCG IMP DEF event support

2019-10-16 Thread John Garry



Hi Robin,


Two significant concerns right off the bat:

- It seems more common than not for silicon designers to fail to
implement IIDR correctly, so it's only a matter of time before
inevitably needing to bring back some firmware-level identifier
abstraction (if not already - does Hi161x have PMCGs?)


Maybe there's a way that we can switch to this method, and leave the
door open for an easy way to support firmware-level identifier again,
if ever needed. I'm not too pushed - this was secondary to just
allowing the PMCG driver know the associated SMMU model.


But that's the part I'm not buying - there's no clear advantage to
pushing that complexity down into the PMCG driver, vs. leaving the IORT
code responsible for translating an SMMU model into a PMCG model, yet
the aforementioned disadvantages jump out right away.



One advantage is that the next piece of quirky hw with a properly 
implemented IIDR does not require a new IORT model.


And today, this handling is only for hi1620, and since we can use hi1620 
IIDR to id it, then it seems good to remove code outside the PMCG driver 
specifically to handle it.


But if you think it's going to be needed again, then it makes sense not 
to remove it.



And, no, hi161x does not have any PMCGs.


Hooray, I guess :)



- This seems like a step in entirely the wrong direction for supporting
.


So to support PMCGs that reference a Named Component or Root Complex,
I thought that the IORT parsing code would have to do some secondary
lookup to the associated SMMU, through the Named Component or Root
Complex node.

What was your idea here?


The associated SMMU has no relevance in that context - the reason for
the Node Reference to point to a non-SMMU node is for devices that
implement their own embedded TLB (e.g. AMBA DTI masters) and expose a
standard PMCG interface to monitor it. It isn't reasonable to expect any
old PCIe controller or on-chip-accelerator driver to expose a fake SMMU
IIDR just to keep some other driver happy.


But won't there still be an SMMU associated with the AMBA DTI masters, 
in your example?


It's this SMMU which the PMCG driver would reference as the "parent" 
device, and the IORT parsing would need to do the lookup for this reference.


But then, this becomes something that the DT parsing would need to 
handle also.





Note: I do acknowledge that an overall issue is that we assume all
PMCG IMP DEF events are same for a given SMMU model.


That assumption does technically fail already - I know MMU-600 has
different IMP-DEF events for its TCU and TBUs, however as long as we can
get as far as "this is some part of an MMU-600" the driver should be
able to figure out the rest (annoyingly it looks like both PMCG types
expose the same PMCG_ID_REGS information, but they should be
distinguishable by PMCG_CEIDn).


JFYI, PMCG_CEIDn contents for hi1620 are all zero, apart from PMDEVARCH 
and PMDEVTYPE, which are same as arm implementation according to the 
spec - sigh...





Interpreting the Node Reference is definitely a welcome improvement over
matching table headers, but absent a truly compelling argument to the
contrary, I'd rather retain the "PMCG model" abstraction in between that
and the driver itself (especially since those can trivially be hung off
compatibles once it comes to DT support).


For DT, I would assume that we just use compatible strings would allow
us to identify the PMCG model.


Right, that was largely my point - DT probing can start with a PMCG
model, so it's a lot more logical for ACPI probing to do the same, with
the actual PMCG model determination hidden away in the ACPI code. That's
the basis of the current design.

I have been nagging the architects that PMCGs not having their own IIDR
is an unwelcome hole in the spec, so hopefully this might get a bit
easier some day.


For sure. The spec reads that the PMCGs may be "independently-designed", 
hence no general id method. I don't get this.





On a related matter, is there still a need to deal with scenarios of
the PMCG being located within the SMMU register map? As you may
remember, we did have this issue but relocated the PMCG to outside the
SMMU register map in a later chip rev.


MMU-600 has its TCU PMCG page 0 in the middle of its SMMU page 0 space,
but given that it's an Arm IP, I expect that when the heat gets turned
up for making it work, it's most likely to be under me ;)


OK, so this is another reason why I thought that having a reference to 
the SMMU device could be useful in terms of solving that problem.




Robin.

.


Thanks again,
John








Re: [PATCH 0/7] iommu: Permit modular builds of ARM SMMU[v3] drivers

2019-11-01 Thread John Garry

On 31/10/2019 23:34, Saravana Kannan via iommu wrote:

I looked into the iommu-map property and it shouldn't be too hard to
add support for it. Looks like we can simply hold off on probing the
root bridge device till all the iommus in its iommu-map are probed and
we should be fine.


I'm also unsure about distro vendors agreeing to a mandatory kernel
parameter (of_devlink). Do you plan to eventually enable it by default?


static const struct supplier_bindings of_supplier_bindings[] = {
 { .parse_prop = parse_clocks, },
 { .parse_prop = parse_interconnects, },
 { .parse_prop = parse_regulators, },
+{ .parse_prop = parse_iommus, },
 {},
};

I plan to upstream this pretty soon, but I have other patches in
flight that touch the same file and I'm waiting for those to get
accepted. I also want to clean up the code a bit to reduce some
repetition before I add support for more bindings.

I'm also wondering about ACPI support.

I'd love to add ACPI support too, but I have zero knowledge of ACPI.
I'd be happy to help anyone who wants to add ACPI support that allows
ACPI to add device links.


If possible to add, that may be useful for remedying this:

https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26...@huawei.com/

Thanks,
John




IOMMU already has a sort of
canonical code path that links endpoints to their IOMMU
(iommu_probe_device()), after the firmware descriptions have been parsed.
So if we created the device links in the iommu core, for example
iommu_bus_notifier(), we would support all firmware interface fl


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/7] iommu: Permit modular builds of ARM SMMU[v3] drivers

2019-11-04 Thread John Garry

On 01/11/2019 21:13, Saravana Kannan wrote:

On Fri, Nov 1, 2019 at 3:28 AM John Garry  wrote:


On 31/10/2019 23:34, Saravana Kannan via iommu wrote:

I looked into the iommu-map property and it shouldn't be too hard to
add support for it. Looks like we can simply hold off on probing the
root bridge device till all the iommus in its iommu-map are probed and
we should be fine.


I'm also unsure about distro vendors agreeing to a mandatory kernel
parameter (of_devlink). Do you plan to eventually enable it by default?


static const struct supplier_bindings of_supplier_bindings[] = {
  { .parse_prop = parse_clocks, },
  { .parse_prop = parse_interconnects, },
  { .parse_prop = parse_regulators, },
+{ .parse_prop = parse_iommus, },
  {},
};

I plan to upstream this pretty soon, but I have other patches in
flight that touch the same file and I'm waiting for those to get
accepted. I also want to clean up the code a bit to reduce some
repetition before I add support for more bindings.

I'm also wondering about ACPI support.

I'd love to add ACPI support too, but I have zero knowledge of ACPI.
I'd be happy to help anyone who wants to add ACPI support that allows
ACPI to add device links.


If possible to add, that may be useful for remedying this:

https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26...@huawei.com/


I'm happy that this change might fix that problem, but isn't the
problem reported in that thread more to do with child devices getting
added before the parent probes successfully? That doesn't make sense
to me. 


So the pcieport device and then the child device are added in the PCI 
scan, but only some time later do the device drivers probe for these 
devices; so it's not that the that pcieport driver creates the child device.


The problem then occurs in that the ordering the of device driver probe 
is such that we have this: pcieport probe + defer (as no IOMMU group 
registered), SMMU probe (registers the IOMMU group), child device probe, 
pcieport really probe.


Can't the piceport driver not add its child devices before it

probes successfully? Or more specifically, who adds the child devices
of the pcieport before the pcieport itself probes?


The devices are actually added in order pcieport, child device, but not 
really probed in that same order, as above.


I'll add you to that thread if you want to discuss further.

Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm64 iommu groups issue

2019-11-04 Thread John Garry

+

On 19/09/2019 15:35, John Garry wrote:

On 19/09/2019 14:25, Robin Murphy wrote:

When the port eventually probes it gets a new, separate group.

This all seems to be as the built-in module init ordering is as
follows: pcieport drv, smmu drv, mlx5 drv

I notice that if I build the mlx5 drv as a ko and insert after boot,
all functions + pcieport are in the same group:

[   11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
[   17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1
[   18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2
[   20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3
[   20.356303] pcieport 0004:88:00.0: Adding to iommu group 4
[   20.493337] pcieport 0005:78:00.0: Adding to iommu group 5
[   20.702999] pcieport 000a:10:00.0: Adding to iommu group 6
[   20.859183] pcieport 000c:20:00.0: Adding to iommu group 7
[   20.996140] pcieport 000d:30:00.0: Adding to iommu group 8
[   21.152637] serial 0002:f9:00.0: Adding to iommu group 3
[   21.346991] serial 0002:f9:00.1: Adding to iommu group 3
[  100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6
[  101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6
[  292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6
[  293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6

This does seem like a problem for arm64 platforms which don't support
ACS, yet enable an SMMU. Maybe also a problem even if they do support
ACS.

Opinion?




Hi Robin,


Yeah, this is less than ideal.


For sure. Our production D05 boards don't ship with the SMMU enabled in 
BIOS, but it would be slightly concerning in this regard if they did.


 > One way to bodge it might be to make

pci_device_group() also walk downwards to see if any non-ACS-isolated
children already have a group, rather than assuming that groups get
allocated in hierarchical order, but that's far from ideal.


Agree.

My own workaround was to hack the mentioned iort code to defer the PF 
probe if the parent port had also yet to probe.




The underlying issue is that, for historical reasons, OF/IORT-based
IOMMU drivers have ended up with group allocation being tied to endpoint
driver probing via the dma_configure() mechanism (long story short,
driver probe is the only thing which can be delayed in order to wait for
a specific IOMMU instance to be ready).However, in the meantime, the
IOMMU API internals have evolved sufficiently that I think there's a way
to really put things right - I have the spark of an idea which I'll try
to sketch out ASAP...



OK, great.

Thanks,
John


Robin.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"

2019-11-08 Thread John Garry

On 08/11/2019 15:16, Will Deacon wrote:

+MODULE_DEVICE_TABLE(of, arm_smmu_of_match);


Hi Will,

  
  static struct platform_driver arm_smmu_driver = {

.driver = {
.name   = "arm-smmu-v3",
.of_match_table = of_match_ptr(arm_smmu_of_match),
-   .suppress_bind_attrs = true,


Does this mean that we can now manually unbind this driver from the SMMU 
device?


Seems dangerous. Here's what happens for me:

root@ubuntu:/sys# cd ./bus/platform/drivers/arm-smmu-v3
ind @ubuntu:/sys/bus/platform/drivers/arm-smmu-v3# echo 
arm-smmu-v3.0.auto > unbind

[   77.580351] hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found!
ho [   78.635473] platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 
0x0146 [hwprod 0x0146, hwcons 0x]



},
.probe  = arm_smmu_device_probe,
+   .remove = arm_smmu_device_remove,
.shutdown = arm_smmu_device_shutdown,
  };
-builtin_platform_driver(arm_smmu_driver);
+module_platform_driver(arm_smmu_driver);
+



Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"

2019-11-08 Thread John Garry

On 08/11/2019 16:17, John Garry wrote:

On 08/11/2019 15:16, Will Deacon wrote:

+MODULE_DEVICE_TABLE(of, arm_smmu_of_match);


Hi Will,


  static struct platform_driver arm_smmu_driver = {
  .driver    = {
  .name    = "arm-smmu-v3",
  .of_match_table    = of_match_ptr(arm_smmu_of_match),
-    .suppress_bind_attrs = true,


Does this mean that we can now manually unbind this driver from the SMMU 
device?


Seems dangerous. Here's what happens for me:

root@ubuntu:/sys# cd ./bus/platform/drivers/arm-smmu-v3
ind @ubuntu:/sys/bus/platform/drivers/arm-smmu-v3# echo 
arm-smmu-v3.0.auto > unbind

[   77.580351] hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found!
ho [   78.635473] platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 
0x0146 [hwprod 0x0146, hwcons 0x]



  },
  .probe    = arm_smmu_device_probe,
+    .remove    = arm_smmu_device_remove,
  .shutdown = arm_smmu_device_shutdown,
  };
-builtin_platform_driver(arm_smmu_driver);
+module_platform_driver(arm_smmu_driver);
+


BTW, it now looks like it was your v1 series I was testing there, on 
your branch iommu/module. It would be helpful to update for ease of testing.


Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"

2019-11-08 Thread John Garry

On 08/11/2019 16:47, Will Deacon wrote:

On Fri, Nov 08, 2019 at 04:44:25PM +, John Garry wrote:

On 08/11/2019 16:17, John Garry wrote:

On 08/11/2019 15:16, Will Deacon wrote:

+MODULE_DEVICE_TABLE(of, arm_smmu_of_match);


Hi Will,


   static struct platform_driver arm_smmu_driver = {
   .driver    = {
   .name    = "arm-smmu-v3",
   .of_match_table    = of_match_ptr(arm_smmu_of_match),
-    .suppress_bind_attrs = true,


Does this mean that we can now manually unbind this driver from the SMMU
device?

Seems dangerous. Here's what happens for me:

root@ubuntu:/sys# cd ./bus/platform/drivers/arm-smmu-v3
ind @ubuntu:/sys/bus/platform/drivers/arm-smmu-v3# echo
arm-smmu-v3.0.auto > unbind
[   77.580351] hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found!
ho [   78.635473] platform arm-smmu-v3.0.auto: CMD_SYNC timeout at
0x0146 [hwprod 0x0146, hwcons 0x]


   },
   .probe    = arm_smmu_device_probe,
+    .remove    = arm_smmu_device_remove,
   .shutdown = arm_smmu_device_shutdown,
   };
-builtin_platform_driver(arm_smmu_driver);
+module_platform_driver(arm_smmu_driver);
+


BTW, it now looks like it was your v1 series I was testing there, on your
branch iommu/module. It would be helpful to update for ease of testing.




Hi Will,


Yes, sorry about that. I'll update it now (although I'm not sure it will
help with this -- I was going to see what happens with other devices such
as the intel-iommu or storage controllers)


So I tried your v2 series for this - it has the same issue, as I 
anticipated.


It seems that some iommu drivers do call iommu_device_register(), so 
maybe a decent reference. Or simply stop the driver being unbound.


Cheers,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"

2019-11-08 Thread John Garry

On 08/11/2019 17:32, Will Deacon wrote:

On Fri, Nov 08, 2019 at 05:25:09PM +, John Garry wrote:

On 08/11/2019 16:47, Will Deacon wrote:

On Fri, Nov 08, 2019 at 04:44:25PM +, John Garry wrote:

BTW, it now looks like it was your v1 series I was testing there, on your
branch iommu/module. It would be helpful to update for ease of testing.


Yes, sorry about that. I'll update it now (although I'm not sure it will
help with this -- I was going to see what happens with other devices such
as the intel-iommu or storage controllers)


So I tried your v2 series for this - it has the same issue, as I
anticipated.


Right, I'm just not sure how resilient drivers are expected to be to force
unbinding like this. You can break lots of stuff with root...


For sure, but it is good practice to limit that.

I had to fix something like this recently, so know about it... another 
potential problem is use-after-frees, where your device managed memory 
is freed at removal but still registered somewhere.





It seems that some iommu drivers do call iommu_device_register(), so maybe a
decent reference. Or simply stop the driver being unbound.


I'm not sure what you mean about iommu_device_register() (we call that
already), 


Sorry, I meant to say iommu_device_unregister().

but I guess we can keep the '.suppress_bind_attrs = true' if
necessary. 


It may be good to add it to older stable kernels also, pre c07b6426df92.

I'll have a play on my laptop and see how well that works if

you start unbinding stuff.


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"

2019-11-08 Thread John Garry

On 08/11/2019 17:48, Will Deacon wrote:

On Fri, Nov 08, 2019 at 05:32:48PM +, Will Deacon wrote:

On Fri, Nov 08, 2019 at 05:25:09PM +, John Garry wrote:

On 08/11/2019 16:47, Will Deacon wrote:

On Fri, Nov 08, 2019 at 04:44:25PM +, John Garry wrote:

BTW, it now looks like it was your v1 series I was testing there, on your
branch iommu/module. It would be helpful to update for ease of testing.


Yes, sorry about that. I'll update it now (although I'm not sure it will
help with this -- I was going to see what happens with other devices such
as the intel-iommu or storage controllers)


So I tried your v2 series for this - it has the same issue, as I
anticipated.


Right, I'm just not sure how resilient drivers are expected to be to force
unbinding like this. You can break lots of stuff with root...


It seems that some iommu drivers do call iommu_device_register(), so maybe a
decent reference. Or simply stop the driver being unbound.


I'm not sure what you mean about iommu_device_register() (we call that
already), but I guess we can keep the '.suppress_bind_attrs = true' if
necessary. I'll have a play on my laptop and see how well that works if
you start unbinding stuff.


So unbinding the nvme driver goes bang:

[90139.090158] nvme nvme0: failed to set APST feature (-19)
[90141.966780] Aborting journal on device dm-1-8.
[90141.967124] Buffer I/O error on dev dm-1, logical block 26247168, lost sync 
page write
[90141.967169] JBD2: Error -5 detected when updating journal superblock for 
dm-1-8.
[90141.967403] Buffer I/O error on dev dm-1, logical block 0, lost sync page 
write
[90141.967454] EXT4-fs (dm-1): I/O error while writing superblock
[90141.967467] EXT4-fs error (device dm-1): ext4_journal_check_start:61: 
Detected aborted journal
[90141.967473] EXT4-fs (dm-1): Remounting filesystem read-only
[90141.967569] Buffer I/O error on dev dm-1, logical block 0, lost sync page 
write
[90141.967682] EXT4-fs (dm-1): I/O error while writing superblock

and I've not managed to recover the thing yet (it's stuck trying to reboot.)



Not surprised. I guess the device backing your root directory disappeared.


What state was your system in after unbinding the SMMU?


Unusable again. For me the storage controller backing the root directory 
is compromised by disabling the SMMU unsafely.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu/arm-smmu-v3: Don't display an error when IRQ lines are missing

2019-11-11 Thread John Garry

On 11/11/2019 11:17, Jean-Philippe Brucker wrote:

Since commit 7723f4c5ecdb ("driver core: platform: Add an error message
to platform_get_irq*()"), platform_get_irq_byname() displays an error
when the IRQ isn't found. Since the SMMUv3 driver uses that function to
query which interrupt method is available, the message is now displayed
during boot for any SMMUv3 that doesn't implement the combined
interrupt, or that implements MSIs.

[   20.700337] arm-smmu-v3 arm-smmu-v3.7.auto: IRQ combined not found
[   20.706508] arm-smmu-v3 arm-smmu-v3.7.auto: IRQ eventq not found
[   20.712503] arm-smmu-v3 arm-smmu-v3.7.auto: IRQ priq not found
[   20.718325] arm-smmu-v3 arm-smmu-v3.7.auto: IRQ gerror not found

Use platform_get_irq_byname_optional() to avoid displaying a spurious
error.

Fixes: 7723f4c5ecdb ("driver core: platform: Add an error message to 
platform_get_irq*()")
Signed-off-by: Jean-Philippe Brucker 


That stops a nuisance:
Tested-by: John Garry 

However, I will say though that the combined irq seems necessary for 
TX2, which is not warned about being missing now.


Finally, A cover letter would have been handy to mention that the new 
API was only introduced after rc1


thanks


---
  drivers/iommu/arm-smmu-v3.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index bfa4a0f39ed0..a89797f346a4 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -5207,19 +5207,19 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
  
  	/* Interrupt lines */
  
-	irq = platform_get_irq_byname(pdev, "combined");

+   irq = platform_get_irq_byname_optional(pdev, "combined");
if (irq > 0)
smmu->combined_irq = irq;
else {
-   irq = platform_get_irq_byname(pdev, "eventq");
+   irq = platform_get_irq_byname_optional(pdev, "eventq");
if (irq > 0)
smmu->evtq.q.irq = irq;
  
-		irq = platform_get_irq_byname(pdev, "priq");

+   irq = platform_get_irq_byname_optional(pdev, "priq");
if (irq > 0)
smmu->priq.q.irq = irq;
  
-		irq = platform_get_irq_byname(pdev, "gerror");

+   irq = platform_get_irq_byname_optional(pdev, "gerror");
if (irq > 0)
smmu->gerr_irq = irq;
}



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 09/14] iommu/arm-smmu: Prevent forced unbinding of Arm SMMU drivers

2019-11-26 Thread John Garry

On 21/11/2019 11:49, Will Deacon wrote:

Forcefully unbinding the Arm SMMU drivers is a pretty dangerous operation,
since it will likely lead to catastrophic failure for any DMA devices
mastering through the SMMU being unbound. When the driver then attempts
to "handle" the fatal faults, it's very easy to trip over dead data
structures, leading to use-after-free.

On John's machine, he reports that the machine was "unusable" due to
loss of the storage controller following a forced unbind of the SMMUv3
driver:

   | # cd ./bus/platform/drivers/arm-smmu-v3
   | # echo arm-smmu-v3.0.auto > unbind
   | hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found!
   | platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146
   | [hwprod 0x0146, hwcons 0x]

Prevent this forced unbinding of the drivers by setting "suppress_bind_attrs"
to true.


This seems a reasonable approach for now.

BTW, I'll give this series a spin this week, which again looks to be 
your iommu/module branch, excluding the new IORT patch.


Cheers,
John



Link: 
https://lore.kernel.org/lkml/06dfd385-1af0-3106-4cc5-6a5b8e864...@huawei.com
Reported-by: John Garry 
Signed-off-by: Will Deacon 
---
  drivers/iommu/arm-smmu-v3.c | 5 +++--
  drivers/iommu/arm-smmu.c| 7 ---
  2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 2ad8e2ca0583..3fd75abce3bb 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3700,8 +3700,9 @@ MODULE_DEVICE_TABLE(of, arm_smmu_of_match);
  
  static struct platform_driver arm_smmu_driver = {

.driver = {
-   .name   = "arm-smmu-v3",
-   .of_match_table = of_match_ptr(arm_smmu_of_match),
+   .name   = "arm-smmu-v3",
+   .of_match_table = of_match_ptr(arm_smmu_of_match),
+   .suppress_bind_attrs= true,
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 53bbe0663b9e..d6c83bd69555 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2237,9 +2237,10 @@ static const struct dev_pm_ops arm_smmu_pm_ops = {
  
  static struct platform_driver arm_smmu_driver = {

.driver = {
-   .name   = "arm-smmu",
-   .of_match_table = of_match_ptr(arm_smmu_of_match),
-   .pm = &arm_smmu_pm_ops,
+   .name   = "arm-smmu",
+   .of_match_table = of_match_ptr(arm_smmu_of_match),
+   .pm = &arm_smmu_pm_ops,
+   .suppress_bind_attrs= true,
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 09/14] iommu/arm-smmu: Prevent forced unbinding of Arm SMMU drivers

2019-11-27 Thread John Garry

On 26/11/2019 20:27, Saravana Kannan wrote:

On Tue, Nov 26, 2019 at 1:13 AM John Garry  wrote:


On 21/11/2019 11:49, Will Deacon wrote:

Forcefully unbinding the Arm SMMU drivers is a pretty dangerous operation,
since it will likely lead to catastrophic failure for any DMA devices
mastering through the SMMU being unbound. When the driver then attempts
to "handle" the fatal faults, it's very easy to trip over dead data
structures, leading to use-after-free.

On John's machine, he reports that the machine was "unusable" due to
loss of the storage controller following a forced unbind of the SMMUv3
driver:

| # cd ./bus/platform/drivers/arm-smmu-v3
| # echo arm-smmu-v3.0.auto > unbind
| hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found!
| platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146
| [hwprod 0x0146, hwcons 0x]

Prevent this forced unbinding of the drivers by setting "suppress_bind_attrs"
to true.


This seems a reasonable approach for now.

BTW, I'll give this series a spin this week, which again looks to be
your iommu/module branch, excluding the new IORT patch.




Hi Saravana,


Is this on a platform where of_devlink creates device links between
the iommu device and its suppliers?I'm guessing no? Because device
links should for unbinding of all the consumers before unbinding the
supplier.


I'm only really interested in ACPI, TBH.



Looks like it'll still allow the supplier to unbind if the consumers
don't allow unbinding. Is that the case here?


So just unbinding the driver from a device does not delete the device 
nor exit the device from it's IOMMU group - so we keep the reference to 
the SMMU ko. As such, I don't know how to realistically test unloading 
the SMMU ko when we have platform devices involved. Maybe someone can 
enlighten me...


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 09/14] iommu/arm-smmu: Prevent forced unbinding of Arm SMMU drivers

2019-11-27 Thread John Garry

On 27/11/2019 11:04, John Garry wrote:

On 26/11/2019 20:27, Saravana Kannan wrote:

On Tue, Nov 26, 2019 at 1:13 AM John Garry  wrote:


On 21/11/2019 11:49, Will Deacon wrote:
Forcefully unbinding the Arm SMMU drivers is a pretty dangerous 
operation,

since it will likely lead to catastrophic failure for any DMA devices
mastering through the SMMU being unbound. When the driver then attempts
to "handle" the fatal faults, it's very easy to trip over dead data
structures, leading to use-after-free.

On John's machine, he reports that the machine was "unusable" due to
loss of the storage controller following a forced unbind of the SMMUv3
driver:

    | # cd ./bus/platform/drivers/arm-smmu-v3
    | # echo arm-smmu-v3.0.auto > unbind
    | hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found!
    | platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146
    | [hwprod 0x0146, hwcons 0x]

Prevent this forced unbinding of the drivers by setting 
"suppress_bind_attrs"

to true.


This seems a reasonable approach for now.

BTW, I'll give this series a spin this week, which again looks to be
your iommu/module branch, excluding the new IORT patch.




Hi Saravana,


Is this on a platform where of_devlink creates device links between
the iommu device and its suppliers?I'm guessing no? Because device
links should for unbinding of all the consumers before unbinding the
supplier.


I'm only really interested in ACPI, TBH.



Looks like it'll still allow the supplier to unbind if the consumers
don't allow unbinding. Is that the case here?


So just unbinding the driver from a device does not delete the device 
nor exit the device from it's IOMMU group - so we keep the reference to 
the SMMU ko. As such, I don't know how to realistically test unloading 
the SMMU ko when we have platform devices involved. Maybe someone can 
enlighten me...


But I could do it on our D06 dev board, where all devices behind the 
SMMUs are PCI based:


--- Initial state ---

root@(none)$ dmesg | grep Adding
[   30.271801] pcieport :00:00.0: Adding to iommu group 0
[   30.296088] pcieport :00:04.0: Adding to iommu group 1
[   30.322234] pcieport :00:08.0: Adding to iommu group 2
[   30.335641] pcieport :00:0c.0: Adding to iommu group 3
[   30.343114] pcieport :00:10.0: Adding to iommu group 4
[   30.355650] pcieport :00:12.0: Adding to iommu group 5
[   30.366794] pcieport :7c:00.0: Adding to iommu group 6
[   30.377993] hns3 :7d:00.0: Adding to iommu group 7
[   31.861957] hns3 :7d:00.1: Adding to iommu group 8
[   33.313967] hns3 :7d:00.2: Adding to iommu group 9
[   33.436029] hns3 :7d:00.3: Adding to iommu group 10
[   33.555935] hns3 :bd:00.0: Adding to iommu group 11
[   35.143851] pcieport :74:00.0: Adding to iommu group 12
[   35.150736] pcieport :80:00.0: Adding to iommu group 13
[   35.158910] pcieport :80:08.0: Adding to iommu group 14
[   35.166860] pcieport :80:0c.0: Adding to iommu group 15
[   35.174813] pcieport :80:10.0: Adding to iommu group 16
[   35.182854] pcieport :bc:00.0: Adding to iommu group 17
[   35.189702] pcieport :b4:00.0: Adding to iommu group 18
[   35.196445] hisi_sas_v3_hw :74:02.0: Adding to iommu group 19
[   39.410693] ahci :74:03.0: Adding to iommu group 20
root@(none)$ lsmod
Not tainted
arm_smmu_v3 40960 21 - Live 0x88c6

--- Start removing devices ---

root@(none)$  echo 1 > ./sys/devices/pci:00/:00:00.0/remove
[   55.567808] pci_bus :01: busn_res: [bus 01] is released
[   55.573514] pci :00:00.0: Removing from iommu group 0
root@(none)$  echo 1 > ./sys/devices/pci:00/:00:04.0/remove
[   61.767425] pci_bus :02: busn_res: [bus 02] is released
[   61.773132] pci :00:04.0: Removing from iommu group 1
root@(none)$ echo 1 > ./sys/devices/pci:00/:00:04.0/remove
sh: ./sys/devices/pci:00/:00:04.0/remove: No such file or directory
root@(none)$  echo 1 > ./sys/devices/pci:00/:00:08.0/remove
[   75.635417] pci_bus :03: busn_res: [bus 03] is released
[   75.641124] pci :00:08.0: Removing from iommu group 2
root@(none)$ echo 1 > ./sys/devices/pci:00/:00:0c.0/remove
[   81.587419] pci_bus :04: busn_res: [bus 04] is released
[   81.593110] pci :00:0c.0: Removing from iommu group 3
root@(none)$  echo 1 > ./sys/devices/pci:00/:00:10.0/remove
[   85.607605] pci_bus :05: busn_res: [bus 05] is released
[   85.613300] pci :00:10.0: Removing from iommu group 4
root@(none)$ echo 1 > ./sys/devices/pci:00/:00:12.0/remove
[   92.731421] pci_bus :06: busn_res: [bus 06] is released
[   92.737125] pci :00:12.0: Removing from iommu group 5
root@(none)$ echo 1 > ./sys/devices/pci:7c/:7c:00.0/remove
[  102.286726] pci :7d:00.0: Removing from iommu group 7
[  102.294157] pci :7d:00.1: Remov

Re: [PATCH] iommu/arm-smmu: support SMMU module probing from the IORT

2019-11-27 Thread John Garry

On 25/11/2019 16:04, Lorenzo Pieralisi wrote:

On Fri, Nov 22, 2019 at 06:41:25PM +0100, Ard Biesheuvel wrote:

Add support for SMMU drivers built as modules to the ACPI/IORT device
probing path, by deferring the probe of the master if the SMMU driver is
known to exist but has not been loaded yet. Given that the IORT code
registers a platform device for each SMMU that it discovers, we can
easily trigger the udev based autoloading of the SMMU drivers by making
the platform device identifier part of the module alias.

Signed-off-by: Ard Biesheuvel 
---
  drivers/acpi/arm64/iort.c   | 4 ++--
  drivers/iommu/arm-smmu-v3.c | 1 +
  drivers/iommu/arm-smmu.c| 1 +
  3 files changed, 4 insertions(+), 2 deletions(-)


I think it is best if Will picks this up and add it to the
series that modularize the SMMU drivers:

Acked-by: Lorenzo Pieralisi 


Tested-by: John Garry  # only manual smmu ko loading




diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 5a7551d060f2..a696457a9b11 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -850,9 +850,9 @@ static inline bool iort_iommu_driver_enabled(u8 type)
  {
switch (type) {
case ACPI_IORT_NODE_SMMU_V3:
-   return IS_BUILTIN(CONFIG_ARM_SMMU_V3);
+   return IS_ENABLED(CONFIG_ARM_SMMU_V3);
case ACPI_IORT_NODE_SMMU:
-   return IS_BUILTIN(CONFIG_ARM_SMMU);
+   return IS_ENABLED(CONFIG_ARM_SMMU);
default:
pr_warn("IORT node type %u does not describe an SMMU\n", type);
return false;
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 7669beafc493..bf6a1e8eb9b0 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3733,4 +3733,5 @@ module_platform_driver(arm_smmu_driver);
  
  MODULE_DESCRIPTION("IOMMU API for ARM architected SMMUv3 implementations");

  MODULE_AUTHOR("Will Deacon ");
+MODULE_ALIAS("platform:arm-smmu-v3");
  MODULE_LICENSE("GPL v2");
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index d55acc48aee3..db5106b0955b 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2292,4 +2292,5 @@ module_platform_driver(arm_smmu_driver);
  
  MODULE_DESCRIPTION("IOMMU API for ARM architected SMMU implementations");

  MODULE_AUTHOR("Will Deacon ");
+MODULE_ALIAS("platform:arm-smmu");
  MODULE_LICENSE("GPL v2");
--
2.20.1


.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] iommu: match the original algorithm

2019-11-27 Thread John Garry

On 21/11/2019 00:13, Cong Wang wrote:

The IOVA cache algorithm implemented in IOMMU code does not
exactly match the original algorithm described in the paper.

Particularly, it doesn't need to free the loaded empty magazine
when trying to put it back to global depot.

This patch makes it exactly match the original algorithm.



I haven't gone into the details, but this patch alone is giving this:

root@(none)$ [  123.857024] kmemleak: 8 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)


root@(none)$ cat /sys/kernel/debug/kmemleak
unreferenced object 0x002339843000 (size 2048):
  comm "swapper/0", pid 1, jiffies 4294898165 (age 122.688s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<1d2710bf>] kmem_cache_alloc+0x188/0x260
[] init_iova_domain+0x1e8/0x2a8
[<2646fc92>] iommu_setup_dma_ops+0x200/0x710
[] arch_setup_dma_ops+0x80/0x128
[<994e1e43>] acpi_dma_configure+0x11c/0x140
[] pci_dma_configure+0xe0/0x108
[] really_probe+0x210/0x548
[<87884b1b>] driver_probe_device+0x7c/0x148
[<10af2936>] device_driver_attach+0x94/0xa0
[] __driver_attach+0xa4/0x110
[] bus_for_each_dev+0xe8/0x158
[] driver_attach+0x30/0x40
[<3cf39ba8>] bus_add_driver+0x234/0x2f0
[<43830a45>] driver_register+0xbc/0x1d0
[] __pci_register_driver+0xb0/0xc8
[] sas_v3_pci_driver_init+0x20/0x28
unreferenced object 0x002339844000 (size 2048):
  comm "swapper/0", pid 1, jiffies 4294898165 (age 122.688s)

[snip]

And I don't feel like continuing until it's resolved

Thanks,
John


Cc: Joerg Roedel 
Signed-off-by: Cong Wang 
---
  drivers/iommu/iova.c | 14 --
  1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 41c605b0058f..92f72a85e62a 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -900,7 +900,7 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
  
  	if (!iova_magazine_full(cpu_rcache->loaded)) {

can_insert = true;
-   } else if (!iova_magazine_full(cpu_rcache->prev)) {
+   } else if (iova_magazine_empty(cpu_rcache->prev)) {
swap(cpu_rcache->prev, cpu_rcache->loaded);
can_insert = true;
} else {
@@ -909,8 +909,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
if (new_mag) {
spin_lock(&rcache->lock);
if (rcache->depot_size < MAX_GLOBAL_MAGS) {
-   rcache->depot[rcache->depot_size++] =
-   cpu_rcache->loaded;
+   swap(rcache->depot[rcache->depot_size], 
cpu_rcache->prev);
+   swap(cpu_rcache->prev, cpu_rcache->loaded);
+   rcache->depot_size++;
} else {
mag_to_free = cpu_rcache->loaded;
}
@@ -963,14 +964,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache 
*rcache,
  
  	if (!iova_magazine_empty(cpu_rcache->loaded)) {

has_pfn = true;
-   } else if (!iova_magazine_empty(cpu_rcache->prev)) {
+   } else if (iova_magazine_full(cpu_rcache->prev)) {
swap(cpu_rcache->prev, cpu_rcache->loaded);
has_pfn = true;
} else {
spin_lock(&rcache->lock);
if (rcache->depot_size > 0) {
-   iova_magazine_free(cpu_rcache->loaded);
-   cpu_rcache->loaded = 
rcache->depot[--rcache->depot_size];
+   swap(rcache->depot[rcache->depot_size - 1], 
cpu_rcache->prev);
+   swap(cpu_rcache->prev, cpu_rcache->loaded);
+   rcache->depot_size--;
has_pfn = true;
}
spin_unlock(&rcache->lock);



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v2 2/3] iommu: optimize iova_magazine_free_pfns()

2019-11-29 Thread John Garry

On 29/11/2019 00:48, Cong Wang wrote:

If the maganize is empty, iova_magazine_free_pfns() should


magazine


be a nop, however it misses the case of mag->size==0. So we
should just call iova_magazine_empty().

This should reduce the contention on iovad->iova_rbtree_lock
a little bit.

Cc: Joerg Roedel 
Signed-off-by: Cong Wang 
---
  drivers/iommu/iova.c | 22 +++---
  1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cb473ddce4cf..184d4c0e20b5 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -797,13 +797,23 @@ static void iova_magazine_free(struct iova_magazine *mag)
kfree(mag);
  }
  
+static bool iova_magazine_full(struct iova_magazine *mag)

+{
+   return (mag && mag->size == IOVA_MAG_SIZE);
+}
+
+static bool iova_magazine_empty(struct iova_magazine *mag)
+{
+   return (!mag || mag->size == 0);
+}
+
  static void
  iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
  {
unsigned long flags;
int i;
  
-	if (!mag)

+   if (iova_magazine_empty(mag))


The only hot path we this call is 
__iova_rcache_insert()->iova_magazine_free_pfns(mag_to_free) and 
mag_to_free is full in this case, so I am sure how the additional check 
helps, right?


Thanks,
John


return;
  
  	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);

@@ -820,16 +830,6 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
mag->size = 0;
  }
  
-static bool iova_magazine_full(struct iova_magazine *mag)

-{
-   return (mag && mag->size == IOVA_MAG_SIZE);
-}
-
-static bool iova_magazine_empty(struct iova_magazine *mag)
-{
-   return (!mag || mag->size == 0);
-}
-
  static unsigned long iova_magazine_pop(struct iova_magazine *mag,
   unsigned long limit_pfn)
  {



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v2 3/3] iommu: avoid taking iova_rbtree_lock twice

2019-11-29 Thread John Garry

On 29/11/2019 00:48, Cong Wang wrote:

Both find_iova() and __free_iova() take iova_rbtree_lock,
there is no reason to take and release it twice inside
free_iova().

Fold them into the critical section by calling the unlock
versions instead.


Since generally the iova would be non-NULL, this seems a reasonable 
change (which could be mentioned in the commit log)


John



Cc: Joerg Roedel 
Signed-off-by: Cong Wang 
---
  drivers/iommu/iova.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 184d4c0e20b5..f46f8f794678 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
  void
  free_iova(struct iova_domain *iovad, unsigned long pfn)
  {
-   struct iova *iova = find_iova(iovad, pfn);
+   unsigned long flags;
+   struct iova *iova;
  
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);

+   iova = private_find_iova(iovad, pfn);
if (iova)
-   __free_iova(iovad, iova);
+   private_free_iova(iovad, iova);
+   spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
  
  }

  EXPORT_SYMBOL_GPL(free_iova);



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v2 1/3] iommu: match the original algorithm

2019-11-29 Thread John Garry

On 29/11/2019 00:48, Cong Wang wrote:

The IOVA cache algorithm implemented in IOMMU code does not
exactly match the original algorithm described in the paper.



which paper?


Particularly, it doesn't need to free the loaded empty magazine
when trying to put it back to global depot. To make it work, we
have to pre-allocate magazines in the depot and only recycle them
when all of them are full.

Before this patch, rcache->depot[] contains either full or
freed entries, after this patch, it contains either full or
empty (but allocated) entries.


I *quickly* tested this patch and got a small performance gain.



Cc: Joerg Roedel 
Signed-off-by: Cong Wang 
---
  drivers/iommu/iova.c | 45 +++-
  1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 41c605b0058f..cb473ddce4cf 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad)
struct iova_cpu_rcache *cpu_rcache;
struct iova_rcache *rcache;
unsigned int cpu;
-   int i;
+   int i, j;
  
  	for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {

rcache = &iovad->rcaches[i];
spin_lock_init(&rcache->lock);
rcache->depot_size = 0;
+   for (j = 0; j < MAX_GLOBAL_MAGS; ++j) {
+   rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL);
+   WARN_ON(!rcache->depot[j]);
+   }
rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), 
cache_line_size());
if (WARN_ON(!rcache->cpu_rcaches))
continue;
@@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain 
*iovad,
  
  	if (!iova_magazine_full(cpu_rcache->loaded)) {

can_insert = true;
-   } else if (!iova_magazine_full(cpu_rcache->prev)) {
+   } else if (iova_magazine_empty(cpu_rcache->prev)) {


is this change strictly necessary?


swap(cpu_rcache->prev, cpu_rcache->loaded);
can_insert = true;
} else {
-   struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
+   spin_lock(&rcache->lock);
+   if (rcache->depot_size < MAX_GLOBAL_MAGS) {
+   swap(rcache->depot[rcache->depot_size], 
cpu_rcache->prev);
+   swap(cpu_rcache->prev, cpu_rcache->loaded);
+   rcache->depot_size++;
+   can_insert = true;
+   } else {
+   mag_to_free = cpu_rcache->loaded;
+   }
+   spin_unlock(&rcache->lock);
+
+   if (mag_to_free) {
+   struct iova_magazine *new_mag = 
iova_magazine_alloc(GFP_ATOMIC);
  
-		if (new_mag) {

-   spin_lock(&rcache->lock);
-   if (rcache->depot_size < MAX_GLOBAL_MAGS) {
-   rcache->depot[rcache->depot_size++] =
-   cpu_rcache->loaded;
+   if (new_mag) {
+   cpu_rcache->loaded = new_mag;
+   can_insert = true;
} else {
-   mag_to_free = cpu_rcache->loaded;
+   mag_to_free = NULL;
}
-   spin_unlock(&rcache->lock);
-
-   cpu_rcache->loaded = new_mag;
-   can_insert = true;
}
}
  
@@ -963,14 +973,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
  
  	if (!iova_magazine_empty(cpu_rcache->loaded)) {

has_pfn = true;
-   } else if (!iova_magazine_empty(cpu_rcache->prev)) {
+   } else if (iova_magazine_full(cpu_rcache->prev)) {
swap(cpu_rcache->prev, cpu_rcache->loaded);
has_pfn = true;
} else {
spin_lock(&rcache->lock);
if (rcache->depot_size > 0) {
-   iova_magazine_free(cpu_rcache->loaded);


it is good to remove this from under the lock, apart from this change


-   cpu_rcache->loaded = 
rcache->depot[--rcache->depot_size];
+   swap(rcache->depot[rcache->depot_size - 1], 
cpu_rcache->prev);
+   swap(cpu_rcache->prev, cpu_rcache->loaded);
+   rcache->depot_size--;


I'm not sure how appropriate the name "depot_size" is any longer.


has_pfn = true;
}
spin_unlock(&rcache->lock);
@@ -1019,7 +1030,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
iova_magazine_free(cpu_rcache->prev);
}
free_percpu(rcache->cpu_rcaches);
-   for (j = 0; j < rcache->

Re: [Patch v2 2/3] iommu: optimize iova_magazine_free_pfns()

2019-12-02 Thread John Garry

On 30/11/2019 06:02, Cong Wang wrote:

On Fri, Nov 29, 2019 at 5:24 AM John Garry  wrote:


On 29/11/2019 00:48, Cong Wang wrote:

If the maganize is empty, iova_magazine_free_pfns() should


magazine


Good catch!




be a nop, however it misses the case of mag->size==0. So we
should just call iova_magazine_empty().

This should reduce the contention on iovad->iova_rbtree_lock
a little bit.

Cc: Joerg Roedel 
Signed-off-by: Cong Wang 
---
   drivers/iommu/iova.c | 22 +++---
   1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cb473ddce4cf..184d4c0e20b5 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -797,13 +797,23 @@ static void iova_magazine_free(struct iova_magazine *mag)
   kfree(mag);
   }

+static bool iova_magazine_full(struct iova_magazine *mag)
+{
+ return (mag && mag->size == IOVA_MAG_SIZE);
+}
+
+static bool iova_magazine_empty(struct iova_magazine *mag)
+{
+ return (!mag || mag->size == 0);
+}
+
   static void
   iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
   {
   unsigned long flags;
   int i;

- if (!mag)
+ if (iova_magazine_empty(mag))


The only hot path we this call is
__iova_rcache_insert()->iova_magazine_free_pfns(mag_to_free) and
mag_to_free is full in this case, so I am sure how the additional check
helps, right?


This is what I mean by "a little bit" in changelog, did you miss it or
misunderstand it? :)


I was concerned that in the fastpath we actually make things very 
marginally slower by adding a check which will fail.


Thanks,
John



Thanks.
.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v2 1/3] iommu: match the original algorithm

2019-12-02 Thread John Garry

On 30/11/2019 05:58, Cong Wang wrote:

On Fri, Nov 29, 2019 at 6:43 AM John Garry  wrote:


On 29/11/2019 00:48, Cong Wang wrote:

The IOVA cache algorithm implemented in IOMMU code does not
exactly match the original algorithm described in the paper.



which paper?


It's in drivers/iommu/iova.c, from line 769:

  769 /*
  770  * Magazine caches for IOVA ranges.  For an introduction to magazines,
  771  * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab
  772  * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams.
  773  * For simplicity, we use a static magazine size and don't implement the
  774  * dynamic size tuning described in the paper.
  775  */





Particularly, it doesn't need to free the loaded empty magazine
when trying to put it back to global depot. To make it work, we
have to pre-allocate magazines in the depot and only recycle them
when all of them are full.

Before this patch, rcache->depot[] contains either full or
freed entries, after this patch, it contains either full or
empty (but allocated) entries.


I *quickly* tested this patch and got a small performance gain.


Thanks for testing! It requires a different workload to see bigger gain,
in our case, 24 memcache.parallel servers with 120 clients.



So in fact I was getting a ~10% throughput boost for my storage test. 
Seems more than I would expect/hope for. I would need to test more.








Cc: Joerg Roedel 
Signed-off-by: Cong Wang 
---
   drivers/iommu/iova.c | 45 +++-
   1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 41c605b0058f..cb473ddce4cf 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad)
   struct iova_cpu_rcache *cpu_rcache;
   struct iova_rcache *rcache;
   unsigned int cpu;
- int i;
+ int i, j;

   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
   rcache = &iovad->rcaches[i];
   spin_lock_init(&rcache->lock);
   rcache->depot_size = 0;
+ for (j = 0; j < MAX_GLOBAL_MAGS; ++j) {
+ rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL);
+ WARN_ON(!rcache->depot[j]);
+ }
   rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), 
cache_line_size());
   if (WARN_ON(!rcache->cpu_rcaches))
   continue;
@@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain 
*iovad,

   if (!iova_magazine_full(cpu_rcache->loaded)) {
   can_insert = true;
- } else if (!iova_magazine_full(cpu_rcache->prev)) {
+ } else if (iova_magazine_empty(cpu_rcache->prev)) {


is this change strictly necessary?


Yes, because it is what described in the paper. But it should be
functionally same because cpu_rcache->prev is either full or empty.


That is was what I was getting at.








   swap(cpu_rcache->prev, cpu_rcache->loaded);
   can_insert = true;
   } else {
- struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);


Apart from this change, did anyone ever consider kmem cache for the 
magazines?



+ spin_lock(&rcache->lock);
+ if (rcache->depot_size < MAX_GLOBAL_MAGS) {
+ swap(rcache->depot[rcache->depot_size], cpu_rcache->prev);
+ swap(cpu_rcache->prev, cpu_rcache->loaded);
+ rcache->depot_size++;
+ can_insert = true;
+ } else {
+ mag_to_free = cpu_rcache->loaded;
+ }
+ spin_unlock(&rcache->lock);
+
+ if (mag_to_free) {
+ struct iova_magazine *new_mag = 
iova_magazine_alloc(GFP_ATOMIC);

- if (new_mag) {
- spin_lock(&rcache->lock);
- if (rcache->depot_size < MAX_GLOBAL_MAGS) {
- rcache->depot[rcache->depot_size++] =
- cpu_rcache->loaded;
+ if (new_mag) {
+ cpu_rcache->loaded = new_mag;
+ can_insert = true;
   } else {
- mag_to_free = cpu_rcache->loaded;
+ mag_to_free = NULL;
   }
- spin_unlock(&rcache->lock);
-
- cpu_rcache->loaded = new_mag;
- can_insert = true;
   }
   }

@@ -963,14 +973,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache 
*rcache,

   if (!iova_magazine_empty(cpu_rcache->loaded)) {
   has_p

Re: [Patch v3 0/3] iommu: reduce spinlock contention on fast path

2019-12-10 Thread John Garry

On 06/12/2019 21:38, Cong Wang wrote:

This patchset contains three small optimizations for the global spinlock
contention in IOVA cache. Our memcache perf test shows this reduced its
p999 latency down by 45% on AMD when IOMMU is enabled.

Cong Wang (3):
   iommu: avoid unnecessary magazine allocations
   iommu: optimize iova_magazine_free_pfns()
   iommu: avoid taking iova_rbtree_lock twice
---
  drivers/iommu/iova.c | 75 ++--
  1 file changed, 45 insertions(+), 30 deletions(-)



I retested, and got a ~1.1% gain in throughput for my storage test - 
results here if interested https://pastebin.com/dSQxYpN8


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice

2019-12-19 Thread John Garry

On 18/12/2019 04:39, Cong Wang wrote:

Both find_iova() and __free_iova() take iova_rbtree_lock,
there is no reason to take and release it twice inside
free_iova().

Fold them into one critical section by calling the unlock
versions instead.

Cc: Joerg Roedel 
Cc: John Garry 
Signed-off-by: Cong Wang 


FWIW:
Reviewed-by: John Garry 


---
  drivers/iommu/iova.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 184d4c0e20b5..f46f8f794678 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
  void
  free_iova(struct iova_domain *iovad, unsigned long pfn)
  {
-   struct iova *iova = find_iova(iovad, pfn);
+   unsigned long flags;
+   struct iova *iova;
  
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);

+   iova = private_find_iova(iovad, pfn);
if (iova)
-   __free_iova(iovad, iova);
+   private_free_iova(iovad, iova);
+   spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
  
  }

  EXPORT_SYMBOL_GPL(free_iova);



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 00/16] iommu: Permit modular builds of ARM SMMU[v3] drivers

2019-12-19 Thread John Garry

On 19/12/2019 12:03, Will Deacon wrote:

Hi all,

This is version four of the patches I previously posted here:

   v1: https://lore.kernel.org/lkml/20191030145112.19738-1-w...@kernel.org/
   v2: https://lore.kernel.org/lkml/20191108151608.20932-1-w...@kernel.org
   v3: https://lore.kernel.org/lkml/20191121114918.2293-1-w...@kernel.org

Changes since v3 include:

   * Based on v5.5-rc1
   * ACPI/IORT support (thanks to Ard)
   * Export pci_{enable,disable}_ats() (thanks to Greg)
   * Added review tags


Since it looks like not much has changed since v3, and I even tested 
removal of smmu-v3.ko [0] - this should be ok for the series:


Tested-by: John Garry  # smmu v3

[0] 
https://lore.kernel.org/linux-iommu/c8eb97b1-dab5-cc25-7669-2819f64a8...@huawei.com/




I tested this on AMD Seattle by loading arm-smmu-mod.ko from the initrd.

Cheers,

Will

Cc: Jean-Philippe Brucker 
Cc: Jordan Crouse 
Cc: John Garry 
Cc: Bjorn Helgaas 
Cc: Saravana Kannan 
Cc: Greg Kroah-Hartman 
Cc: "Isaac J. Manjarres" 
Cc: Robin Murphy 
Cc: Lorenzo Pieralisi 
Cc: Joerg Roedel 
Cc: Ard Biesheuvel 

--->8

Ard Biesheuvel (1):
   iommu/arm-smmu: Support SMMU module probing from the IORT

Greg Kroah-Hartman (1):
   PCI/ATS: Restore EXPORT_SYMBOL_GPL() for pci_{enable,disable}_ats()

Will Deacon (14):
   drivers/iommu: Export core IOMMU API symbols to permit modular drivers
   iommu/of: Request ACS from the PCI core when configuring IOMMU linkage
   PCI: Export pci_ats_disabled() as a GPL symbol to modules
   drivers/iommu: Take a ref to the IOMMU driver prior to ->add_device()
   iommu/of: Take a ref to the IOMMU driver during ->of_xlate()
   drivers/iommu: Allow IOMMU bus ops to be unregistered
   Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"
   Revert "iommu/arm-smmu: Make arm-smmu explicitly non-modular"
   iommu/arm-smmu: Prevent forced unbinding of Arm SMMU drivers
   iommu/arm-smmu-v3: Unregister IOMMU and bus ops on device removal
   iommu/arm-smmu-v3: Allow building as a module
   iommu/arm-smmu: Unregister IOMMU and bus ops on device removal
   iommu/arm-smmu: Allow building as a module
   iommu/arm-smmu: Update my email address in MODULE_AUTHOR()

  drivers/acpi/arm64/iort.c   |   4 +-
  drivers/iommu/Kconfig   |  16 -
  drivers/iommu/Makefile  |   3 +-
  drivers/iommu/arm-smmu-v3.c |  94 +-
  drivers/iommu/arm-smmu.c| 128 +---
  drivers/iommu/iommu-sysfs.c |   5 ++
  drivers/iommu/iommu.c   |  32 -
  drivers/iommu/of_iommu.c|  19 --
  drivers/pci/ats.c   |   2 +
  drivers/pci/pci.c   |   1 +
  include/linux/iommu.h   |   4 +-
  11 files changed, 223 insertions(+), 85 deletions(-)



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


arm-smmu-v3 high cpu usage for NVMe

2020-01-02 Thread John Garry



Hi Will, Robin,

While analyzing an arm64 issue in interrupt handling for NVMe [0], we 
have noticed a worryingly high CPU utilization in the SMMU driver.


The background is that we may get CPU lockup for high-throughput NVMe 
testing, and we noticed that disabling the SMMU during testing avoids 
the issue. However this lockup is a cross-architecture issue and there 
are attempts to address it, like [1]. To me, disabling the SMMU is just 
avoiding that specific issue.


Anyway, we should still consider this high CPU loading:

   PerfTop:1694 irqs/sec  kernel:97.3%  exact:  0.0% lost: 0/0 
drop: 0/0 [4000Hz cycles],  (all, CPU: 0)

--

50.84%  [kernel]   [k] arm_smmu_cmdq_issue_cmdlist
19.51%  [kernel]   [k] _raw_spin_unlock_irqrestore
 5.14%  [kernel]   [k] __slab_free
 2.37%  [kernel]   [k] bio_release_pages.part.42
 2.20%  [kernel]   [k] fput_many
 1.92%  [kernel]   [k] aio_complete_rw
 1.85%  [kernel]   [k] __arm_lpae_unmap
 1.71%  [kernel]   [k] arm_smmu_atc_inv_domain.constprop.42
 1.11%  [kernel]   [k] sbitmap_queue_clear
 1.05%  [kernel]   [k] blk_mq_free_request
 0.97%  [kernel]   [k] nvme_irq
 0.71%  [kernel]   [k] blk_account_io_done
 0.66%  [kernel]   [k] kmem_cache_free
 0.66%  [kernel]   [k] blk_mq_complete_request

This is for a CPU servicing the NVMe interrupt and doing the DMA unmap. 
The DMA unmap is done in threaded interrupt context.


And for the overall system, we have:

  PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 
drop: 0/40116 [4000Hz cycles],  (all, 96 CPUs)

--

27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
 6.35%  [kernel]  [k] _raw_spin_unlock_irq
 2.65%  [kernel]  [k] get_user_pages_fast
 2.03%  [kernel]  [k] __slab_free
 1.55%  [kernel]  [k] tick_nohz_idle_exit
 1.47%  [kernel]  [k] arm_lpae_map
 1.39%  [kernel]  [k] __fget
 1.14%  [kernel]  [k] __lock_text_start
 1.09%  [kernel]  [k] _raw_spin_lock
 1.08%  [kernel]  [k] bio_release_pages.part.42
 1.03%  [kernel]  [k] __sbitmap_get_word
 0.97%  [kernel]  [k] arm_smmu_atc_inv_domain.constprop.42
 0.91%  [kernel]  [k] fput_many
 0.88%  [kernel]  [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time 
in arm_smmu_atc_inv_domain(), which is disappointing when considering it 
should effectively be a noop.


As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing 
our batch size is 1, so we're not seeing the real benefit of the 
batching. I can't help but think that we could improve this code to try 
to combine CMD SYNCs for small batches.


Anyway, let me know your thoughts or any questions. I'll have a look if 
a get a chance for other possible bottlenecks.


[0] 
https://lore.kernel.org/lkml/e815b5451ea86e99d42045f7067f4...@www.loen.fr/


[1] 
https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbu...@kernel.org/


Cheers,
John

On 21/08/2019 16:17, Will Deacon wrote:

Hi again,

This is version two of the patches I posted yesterday:

   v1: https://lkml.kernel.org/r/20190820154549.17018-1-w...@kernel.org

Changes since then include:

   * Fix 'ats_enabled' checking when enabling ATS
   * Remove redundant 'dev_is_pci()' calls
   * Remove bool bitfield
   * Add patch temporarily disabling ATS detection for -stable
   * Issue ATC invalidation even when non-leaf
   * Elide invalidation/SYNC for zero-sized address ranges
   * Shuffle the patches round a bit

Thanks,

Will

Cc: Zhen Lei 
Cc: Jean-Philippe Brucker 
Cc: John Garry 
Cc: Robin Murphy 

--->8

Will Deacon (8):
   iommu/arm-smmu-v3: Document ordering guarantees of command insertion
   iommu/arm-smmu-v3: Disable detection of ATS and PRI
   iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag
   iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations
   iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
   iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
   iommu/arm-smmu-v3: Avoid locking on invalidation path when not using
 ATS
   Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI"

  drivers/iommu/arm-smmu-v3.c | 117 
  1 file changed, 87 insertions(+), 30 deletions(-)



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


  1   2   3   4   5   6   >