Re: [RFC v1 7/7] iommu/arm-smmu-v3: Enable ACPI based HiSilicon erratum 161010801
On 16/05/2017 15:03, Shameerali Kolothum Thodi wrote: > Lorenzo made a point that it might be relatively straightforward to just > follow the IORT mapping for the SMMU through to the ITS MADT entry and > pull the ITS geometry out of that. It would certainly be nicer to have > such a helper abstracted away in the IORT code than have to go parsing > vendor-specific tables directly in the SMMU driver. I reckon it might be > worth taking that idea a bit further to see how it looks. Ok. John has already mentioned this idea in our off-list discussion and we have one implementation where we go through the IORT node ID mappings array to find out the associated IORT ITS node. It then iterates over the ACPI MADT GIC ITS table entries, , looking for a match for the GIC ITS id, and retrieves the base address of the matching ITS. But as you said, it requires some helper from the IORT code , otherwise we will end up adding all the ACPI table parse code in SMMUv3. We will take another look at this. Thanks, Shameer It could also be worth considering hard-coding the reserved address in the SMMUv3 driver for this model. This would not involve (more) reliance on CSRT or IORT driver. However, note that hip07 has multiple SMMUs, and we only need to enable the quirk for the SMMU in front the PCIe host controller. John BTW, FYI, hi161x is alias for hip07/06 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 3/3] iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #126
On 30/05/2017 13:03, Geetha sowjanya wrote: From: Geetha Sowjanya Cavium ThunderX2 SMMU doesn't support MSI and also doesn't have unique irq separate irq lines lines for gerror, eventq and cmdq-sync. This patch addresses the issue by checking if any interrupt sources are using same irq number, then they are registered as shared irqs. Signed-off-by: Geetha Sowjanya --- Documentation/arm64/silicon-errata.txt |1 + drivers/iommu/arm-smmu-v3.c| 29 + 2 files changed, 26 insertions(+), 4 deletions(-) diff --git a/Documentation/arm64/silicon-errata.txt b/Documentation/arm64/silicon-errata.txt index 4693a32..42422f6 100644 --- a/Documentation/arm64/silicon-errata.txt +++ b/Documentation/arm64/silicon-errata.txt @@ -63,6 +63,7 @@ stable kernels. | Cavium | ThunderX Core | #27456 | CAVIUM_ERRATUM_27456 | | Cavium | ThunderX SMMUv2 | #27704 | N/A | | Cavium | ThunderX2 SMMUv3| #74 | N/A | +| Cavium | ThunderX2 SMMUv3| #126| N/A | || | | | | Freescale/NXP | LS2080A/LS1043A | A-008585| FSL_ERRATUM_A008585 | || | | | diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 4e80205..d2db01f 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2232,6 +2232,25 @@ static void arm_smmu_setup_msis(struct arm_smmu_device *smmu) devm_add_action(dev, arm_smmu_free_msis, dev); } +static int get_irq_flags(struct arm_smmu_device *smmu, int irq) +{ + int match_count = 0; + + if (irq == smmu->evtq.q.irq) + match_count++; + if (irq == smmu->cmdq.q.irq) + match_count++; + if (irq == smmu->gerr_irq) + match_count++; + if (irq == smmu->priq.q.irq) + match_count++; + + if (match_count > 1) + return IRQF_SHARED | IRQF_ONESHOT; + + return IRQF_ONESHOT; +} + static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu) { int ret, irq; @@ -2252,7 +2271,7 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu) if (irq) { ret = devm_request_threaded_irq(smmu->dev, irq, NULL, arm_smmu_evtq_thread, - IRQF_ONESHOT, + get_irq_flags(smmu, irq), "arm-smmu-v3-evtq", smmu); if (ret < 0) dev_warn(smmu->dev, "failed to enable evtq irq\n"); @@ -2261,7 +2280,8 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu) irq = smmu->cmdq.q.irq; if (irq) { ret = devm_request_irq(smmu->dev, irq, - arm_smmu_cmdq_sync_handler, 0, If it is indeed ok to change this to having IRQF_ONESHOT set now (right?), then can you add a note to the commit message? + arm_smmu_cmdq_sync_handler, + get_irq_flags(smmu, irq), indentation looks inconsistent "arm-smmu-v3-cmdq-sync", smmu); if (ret < 0) dev_warn(smmu->dev, "failed to enable cmdq-sync irq\n"); @@ -2270,7 +2290,8 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu) irq = smmu->gerr_irq; if (irq) { ret = devm_request_irq(smmu->dev, irq, arm_smmu_gerror_handler, - 0, "arm-smmu-v3-gerror", smmu); + get_irq_flags(smmu, irq), + "arm-smmu-v3-gerror", smmu); if (ret < 0) dev_warn(smmu->dev, "failed to enable gerror irq\n"); } @@ -2280,7 +2301,7 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu) if (irq) { ret = devm_request_threaded_irq(smmu->dev, irq, NULL, arm_smmu_priq_thread, - IRQF_ONESHOT, + get_irq_flags(smmu, irq), "arm-smmu-v3-priq", smmu); if (ret < 0) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/2] acpi/iort, numa: Add numa node mapping for smmuv3 devices
On 08/06/2017 05:44, Ganapatrao Kulkarni wrote: ARM IORT specification(rev. C) has added provision to define proximity domain in SMMUv3 IORT table. Adding required code to parse Proximity domain and set numa_node of smmv3 platform devices. v3: - Addressed Lorenzo Pieralisi comment. v2: - Changed as per Lorenzo Pieralisi and Hanjun Guo suggestions. v1: - Initial patch Ganapatrao Kulkarni (2): acpica: iort: Update SMMUv3 header for proximity domain mapping acpi/iort: numa: Add numa node mapping for smmuv3 devices drivers/acpi/arm64/iort.c | 28 ++-- include/acpi/actbl2.h | 4 2 files changed, 30 insertions(+), 2 deletions(-) We'll try and test this in the next day or so. John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/8] io-pgtable lock removal
On 15/06/2017 01:40, Ray Jui via iommu wrote: Hi Robin, wangzhou tested this patchset on our SMMUv3-based development board with a 10G PCI NIC card. Currently we see a ~17% performance (throughput) drop when enabling the SMMU, but only a ~8% drop with your patchset. FYI, for our integrated storage and network adapter, we see a big performance hit (maybe 40%) when enabling the SMMU with or without the patchset. Leizhen has been doing some investigation on this. Thanks, John Hi Robin, I have applied this patch series on top of v4.12-rc4, and ran various Ethernet and NVMf target throughput tests on it. To give you some background of my setup: The system is a ARMv8 based system with 8 cores. It has various PCIe root complexes that can be used to connect to PCIe endpoint devices including NIC cards and NVMe SSDs. I'm particularly interested in the performance of the PCIe root complex that connects to the NIC card, and during my test, IOMMU is enabled/disabled against that particular PCIe root complex. The root complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU). For the Ethernet throughput out of 50G link: Note during the multiple TCP session test, each session will be spread to different CPU cores for optimized performance Without IOMMU: TX TCP x1 - 29.7 Gbps TX TCP x4 - 30.5 Gbps TX TCP x8 - 28 Gbps RX TCP x1 - 15 Gbps RX TCP x4 - 33.7 Gbps RX TCP x8 - 36 Gbps With IOMMU, but without your latest patch: TX TCP x1 - 15.2 Gbps TX TCP x4 - 14.3 Gbps TX TCP x8 - 13 Gbps RX TCP x1 - 7.88 Gbps RX TCP x4 - 13.2 Gbps RX TCP x8 - 12.6 Gbps With IOMMU and your latest patch: TX TCP x1 - 21.4 Gbps TX TCP x4 - 30.5 Gbps TX TCP x8 - 21.3 Gbps RX TCP x1 - 7.7 Gbps RX TCP x4 - 20.1 Gbps RX TCP x8 - 27.1 Gbps With the NVMf target test with 4 SSDs, fio based test, random read, 4k, 8 jobs: Without IOMMU: IOPS = 1080K With IOMMU, but without your latest patch: IOPS = 520K With IOMMU and your latest patch: IOPS = 500K ~ 850K (a lot of variation observed during the same test run) As you can see, performance has improved significantly with this patch series! That is very impressive! However, it is still off, compared to the test runs without the IOMMU. I'm wondering if more improvement is expected. In addition, a much larger throughput variation is observed in the tests with these latest patches, when multiple CPUs are involved. I'm wondering if that is caused by some remaining lock in the driver? Also, in a few occasions, I observed the following message during the test, when multiple cores are involved: arm-smmu 6400.mmu: TLB sync timed out -- SMMU may be deadlocked Thanks, Ray On 6/9/17 12:28 PM, Nate Watterson wrote: Hi Robin, On 6/8/2017 7:51 AM, Robin Murphy wrote: Hi all, Here's the cleaned up nominally-final version of the patches everybody's keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix, #2-#4 do some preparatory work (and bid farewell to everyone's least favourite bit of code, hooray!), and #5-#8 do the dirty deed itself. The branch I've previously shared has been updated too: git://linux-arm.org/linux-rm iommu/pgtable All feedback welcome, as I'd really like to land this for 4.13. I tested the series on a QDF2400 development platform and see notable performance improvements particularly in workloads that make concurrent accesses to a single iommu_domain. Robin. Robin Murphy (8): iommu/io-pgtable-arm-v7s: Check table PTEs more precisely iommu/io-pgtable-arm: Improve split_blk_unmap iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap iommu/io-pgtable: Introduce explicit coherency iommu/io-pgtable-arm: Support lockless operation iommu/io-pgtable-arm-v7s: Support lockless operation iommu/arm-smmu: Remove io-pgtable spinlock iommu/arm-smmu-v3: Remove io-pgtable spinlock drivers/iommu/arm-smmu-v3.c| 36 ++- drivers/iommu/arm-smmu.c | 48 -- drivers/iommu/io-pgtable-arm-v7s.c | 173 + drivers/iommu/io-pgtable-arm.c | 190 - drivers/iommu/io-pgtable.h | 6 ++ 5 files changed, 268 insertions(+), 185 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/8] io-pgtable lock removal
On 22/06/2017 16:53, Robin Murphy wrote: The feedback has been promising, so v2 is just a final update to cover a handful of memory ordering and cosmetic tweaks that came up when Will and I went through this offline. Thanks, Robin. Hi Robin, Is it worth us retesting this patchset? If yes, as for a branch (for convenience), does this one now match v2: git://linux-arm.org/linux-rm iommu/pgtable I saw it was rewritten recently. Thanks, John Robin Murphy (8): iommu/io-pgtable-arm-v7s: Check table PTEs more precisely iommu/io-pgtable-arm: Improve split_blk_unmap iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap iommu/io-pgtable: Introduce explicit coherency iommu/io-pgtable-arm: Support lockless operation iommu/io-pgtable-arm-v7s: Support lockless operation iommu/arm-smmu: Remove io-pgtable spinlock iommu/arm-smmu-v3: Remove io-pgtable spinlock drivers/iommu/arm-smmu-v3.c| 36 ++- drivers/iommu/arm-smmu.c | 48 -- drivers/iommu/io-pgtable-arm-v7s.c | 179 +- drivers/iommu/io-pgtable-arm.c | 191 - drivers/iommu/io-pgtable.h | 6 ++ 5 files changed, 274 insertions(+), 186 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/8] io-pgtable lock removal
On 23/06/2017 10:58, Robin Murphy wrote: On 23/06/17 09:47, John Garry wrote: On 22/06/2017 16:53, Robin Murphy wrote: The feedback has been promising, so v2 is just a final update to cover a handful of memory ordering and cosmetic tweaks that came up when Will and I went through this offline. Thanks, Robin. Hi Robin, Is it worth us retesting this patchset? If yes, as for a branch (for convenience), does this one now match v2: git://linux-arm.org/linux-rm iommu/pgtable I saw it was rewritten recently. Indeed, I pushed the branch out before posting. Functionally it ought to be identical to v1, so this was more just a heads-up - feel free to double-check I've not broken anything, but I wouldn't say you need to throw the full test lab at it again. I saw Will has already sent the pull request. But, FWIW, we are seeing roughly the same performance as v1 patchset. For PCI NIC, Zhou again found performance drop goes from ~15->8% with SMMU enabled, and for integrated storage controller [platform device], we still see a drop of about 50%, depending on datarates (Leizhen has been working on fixing this). John Thanks, Robin. Thanks, John Robin Murphy (8): iommu/io-pgtable-arm-v7s: Check table PTEs more precisely iommu/io-pgtable-arm: Improve split_blk_unmap iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap iommu/io-pgtable: Introduce explicit coherency iommu/io-pgtable-arm: Support lockless operation iommu/io-pgtable-arm-v7s: Support lockless operation iommu/arm-smmu: Remove io-pgtable spinlock iommu/arm-smmu-v3: Remove io-pgtable spinlock drivers/iommu/arm-smmu-v3.c| 36 ++- drivers/iommu/arm-smmu.c | 48 -- drivers/iommu/io-pgtable-arm-v7s.c | 179 +- drivers/iommu/io-pgtable-arm.c | 191 - drivers/iommu/io-pgtable.h | 6 ++ 5 files changed, 274 insertions(+), 186 deletions(-) . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/8] io-pgtable lock removal
I saw Will has already sent the pull request. But, FWIW, we are seeing roughly the same performance as v1 patchset. For PCI NIC, Zhou again found performance drop goes from ~15->8% with SMMU enabled, and for integrated storage controller [platform device], we still see a drop of about 50%, depending on datarates (Leizhen has been working on fixing this). Thanks for confirming. Following Joerg's suggestion that the storage workloads may still depend on rbtree performance - it had slipped my mind that even with small block sizes those could well be grouped into scatterlists large enough to trigger a >64-page IOVA allocation - I've taken the liberty of cooking up a simplified version of Leizhen's rbtree optimisation series in the iommu/iova branch of my tree. I'll follow up on that after the merge window, but if anyone wants to play with it in the meantime feel free. Just a reminder that we did also see poor performance with our integrated NIC on your v1 patchset also (I can push for v2 patchset testing, but expect the same). We might be able to now include a LSI 3108 PCI SAS card in our testing also to give a broader set of results. John Robin. . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction
+ On 29/06/2017 03:08, Leizhen (ThunderTown) wrote: On 2017/6/28 17:32, Will Deacon wrote: Hi Zhen Lei, Nate (CC'd), Robin and I have been working on something very similar to this series, but this patch is different to what we had planned. More below. On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote: Because all TLBI commands should be followed by a SYNC command, to make sure that it has been completely finished. So we can just add the TLBI commands into the queue, and put off the execution until meet SYNC or other commands. To prevent the followed SYNC command waiting for a long time because of too many commands have been delayed, restrict the max delayed number. According to my test, I got the same performance data as I replaced writel with writel_relaxed in queue_inc_prod. Signed-off-by: Zhen Lei --- drivers/iommu/arm-smmu-v3.c | 42 +- 1 file changed, 37 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 291da5f..4481123 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -337,6 +337,7 @@ /* Command queue */ #define CMDQ_ENT_DWORDS2 #define CMDQ_MAX_SZ_SHIFT 8 +#define CMDQ_MAX_DELAYED 32 #define CMDQ_ERR_SHIFT 24 #define CMDQ_ERR_MASK 0x7f @@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent { }; } cfgi; + #define CMDQ_OP_TLBI_NH_ALL 0x10 #define CMDQ_OP_TLBI_NH_ASID0x11 #define CMDQ_OP_TLBI_NH_VA 0x12 #define CMDQ_OP_TLBI_EL2_ALL0x20 @@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent { struct arm_smmu_queue { int irq; /* Wired interrupt */ + u32 nr_delay; __le64 *base; dma_addr_t base_dma; @@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q) return ret; } -static void queue_inc_prod(struct arm_smmu_queue *q) +static void queue_inc_swprod(struct arm_smmu_queue *q) { - u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1; + u32 prod = q->prod + 1; q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod); +} + +static void queue_inc_prod(struct arm_smmu_queue *q) +{ + queue_inc_swprod(q); writel(q->prod, q->prod_reg); } @@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t n_dwords) *dst++ = cpu_to_le64(*src++); } -static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent) +static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int optimize) { if (queue_full(q)) return -ENOSPC; queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords); - queue_inc_prod(q); + + /* +* We don't want too many commands to be delayed, this may lead the +* followed sync command to wait for a long time. +*/ + if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) { + queue_inc_swprod(q); + } else { + queue_inc_prod(q); + q->nr_delay = 0; + } + So here, you're effectively putting invalidation commands into the command queue without updating PROD. Do you actually see a performance advantage from doing so? Another side of the argument would be that we should be Yes, my sas ssd performance test showed that it can improve about 100-150K/s(the same to I directly replace writel with writel_relaxed). And the average execution time of iommu_unmap(which called by iommu_dma_unmap_sg) dropped from 10us to 5us. moving PROD as soon as we can, so that the SMMU can process invalidation commands in the background and reduce the cost of the final SYNC operation when the high-level unmap operation is complete. There maybe that __iowmb() is more expensive than wait for tlbi complete. Except the time of __iowmb() itself, it also protected by spinlock, lock confliction will rise rapidly in the stress scene. __iowmb() average cost 300-500ns(Sorry, I forget the exact value). In addition, after applied this patcheset and Robin's v2, and my earlier dma64 iova optimization patchset. Our net performance test got the same data to global bypass. But sas ssd still have more than 20% dropped. Maybe we should still focus at map/unamp, because the average execution time of iova alloc/free is only about 400ns. By the way, patch2-5 is more effective than this one, it can improve more than 350K/s. And with it, we can got about 100-150K/s improvement of Robin's v2. Otherwise, I saw non effective of Robin's v2. Sorry, I have not tested how about this patch without patch2-5. Further more, I got the same performance data to global bypass for the traditional mechanical hard disk with only patch2-5(without this patch and Robin's). Wil
Re: [PATCH 1/1] iommu/arm-smmu-v3: eliminate a potential memory corruption on Hi16xx soc
On 15/10/2018 09:36, Zhen Lei wrote: ITS translation register map: 0x-0x003C Reserved 0x0040 GITS_TRANSLATER 0x0044-0xFFFC Reserved Can you add a better opening than the ITS translation register map? The standard GITS_TRANSLATER register in ITS is only 4 bytes, but Hisilicon expands the next 4 bytes to carry some IMPDEF information. That means, 8 bytes data will be written to MSIAddress each time. MSIAddr: |4bytes|4bytes| |MSIData |IMPDEF| There is no problem for ITS, because the next 4 bytes space is reserved in ITS. But it will overwrite the 4 bytes memory following "sync_count". It's very I think arm_smmu_device.sync_count is better, or "sync_count member in the the smmu driver control struct". luckly that the previous and the next neighbour of "sync_count" are both aligned /s/luckly/luckily or fortunately/ by 8 bytes, so no problem is met now. It's good to explicitly add a workaround: 1. Add gcc __attribute__((aligned(8))) to make sure that "sync_count" is always aligned by 8 bytes. 2. Add a "u64" union member to make sure the 4 bytes padding is always exist. There is no functional change. Signed-off-by: Zhen Lei --- drivers/iommu/arm-smmu-v3.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 5059d09..a07bc0d 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -586,7 +586,10 @@ struct arm_smmu_device { struct arm_smmu_strtab_cfg strtab_cfg; + union { + u64 padding; /* workaround for Hisilicon */ I think that a more detailed comment is required. u32 sync_count; Can you indent these 2 members? However - as discussed internally - this may have endian issue so better to declare full 64b struct. + } __attribute__((aligned(8))); /* IOMMU core code handle */ struct iommu_device iommu; Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 2/2] iommu/arm-smmu-v3: Reunify arm_smmu_cmdq_issue_sync()
On 17/10/2018 14:56, Robin Murphy wrote: Now that both sync methods are more or less the same shape, we can save some code and levels of indirection by rolling them up together again, with just a couple of simple conditionals to discriminate the MSI and queue-polling specifics. Hi Robin, Will, I had been thinking of this other patch previously: iommu/arm-smmu-v3: Stop rebuilding non-MSI CMD_SYNC commands The contents of the non-MSI CMD_SYNC command are fixed. This patch offers a small optimisation by keeping a sample of this command on the heap for re-use, thereby avoiding unnecessary re-building. Signed-off-by: John Garry diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 6947ccf..9d86c29 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -963,14 +963,16 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu) static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) { - u64 cmd[CMDQ_ENT_DWORDS]; + static u64 cmd[CMDQ_ENT_DWORDS] = { + FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) | + FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV) | + FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH) | + FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB), + 0}; unsigned long flags; bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV); - struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; int ret; - arm_smmu_cmdq_build_cmd(cmd, &ent); - spin_lock_irqsave(&smmu->cmdq.lock, flags); arm_smmu_cmdq_insert_cmd(smmu, cmd); ret = queue_poll_cons(&smmu->cmdq.q, true, wfe); But it seems that combining the the MSI and non-MSI methods would block this. How do you feel about this? Thanks, John Signed-off-by: Robin Murphy --- drivers/iommu/arm-smmu-v3.c | 49 + 1 file changed, 12 insertions(+), 37 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index da8a91d116bf..36db63e3afcf 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -933,7 +933,7 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu, * The difference between val and sync_idx is bounded by the maximum size of * a queue at 2^20 entries, so 32 bits is plenty for wrap-safe arithmetic. */ -static int __arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx) +static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx) { ktime_t timeout; u32 val; @@ -988,16 +988,17 @@ static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx, return -ETIMEDOUT; } -static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu) +static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) { u64 cmd[CMDQ_ENT_DWORDS]; unsigned long flags; - struct arm_smmu_cmdq_ent ent = { - .opcode = CMDQ_OP_CMD_SYNC, - .sync = { - .msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)), - }, - }; + bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) && + (smmu->features & ARM_SMMU_FEAT_COHERENCY); + struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; + int ret, sync_idx, sync_gen; + + if (msi) + ent.sync.msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)); spin_lock_irqsave(&smmu->cmdq.lock, flags); @@ -1009,39 +1010,13 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu) arm_smmu_cmdq_build_cmd(cmd, &ent); arm_smmu_cmdq_insert_cmd(smmu, cmd); } - - spin_unlock_irqrestore(&smmu->cmdq.lock, flags); - - return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata); -} - -static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) -{ - u64 cmd[CMDQ_ENT_DWORDS]; - unsigned long flags; - struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; - int sync_idx, sync_gen; - - arm_smmu_cmdq_build_cmd(cmd, &ent); - - spin_lock_irqsave(&smmu->cmdq.lock, flags); - if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC) - arm_smmu_cmdq_insert_cmd(smmu, cmd); sync_idx = smmu->cmdq.q.prod; sync_gen = READ_ONCE(smmu->cmdq_generation); + spin_unlock_irqrestore(&smmu->cmdq.lock, flags); - return arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen); -} - -static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) -{ - int ret; - bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) && - (smmu->features & ARM_SMMU_FEAT_COHERENCY); - - ret = msi ? __arm_smmu_cmdq_issue_sync_m
Re: [PATCH v5 1/2] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
On 17/10/2018 14:56, Robin Murphy wrote: Even without the MSI trick, we can still do a lot better than hogging the entire queue while it drains. All we actually need to do for the necessary guarantee of completion is wait for our particular command to have been consumed - as long as we keep track of where we inserted it, there is no need to block other CPUs from adding further commands in the meantime. There is one theoretical (but incredibly unlikely) edge case to avoid, where cons has wrapped twice to still appear 'behind' the sync position - this is easily disambiguated by adding a generation count to the queue to indicate when prod wraps, since cons cannot wrap twice without prod having wrapped at least once. This also makes it reasonable to separate the two conceptually different modes of polling such that command insertion - which really wants to be fair and have minimal latency - is not subject to exponential backoff, and returns to its original implementation. Signed-off-by: Robin Murphy --- v5: - Rework to incorporate the back-to-back sync elision. - Refactor the generation count slightly to preemptively help with the HiSilicon MSI workaround. - Split the cleanup into a separate patch for ease of review (it could happily be squashed when applied). The fundamental logic is copied directly from v4, but I've dropped the previous tested-by since there are a fair few subtle changes in how it's integrated. Patches are based on Will's iommu/devel branch plus my "Fix big-endian CMD_SYNC writes" patch. Robin. drivers/iommu/arm-smmu-v3.c | 94 +++-- 1 file changed, 69 insertions(+), 25 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 867ba548c2cc..da8a91d116bf 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -588,6 +588,7 @@ struct arm_smmu_device { struct arm_smmu_strtab_cfg strtab_cfg; u32 sync_count; + int cmdq_generation; /* IOMMU core code handle */ struct iommu_device iommu; @@ -676,6 +677,17 @@ static bool queue_empty(struct arm_smmu_queue *q) Q_WRP(q, q->prod) == Q_WRP(q, q->cons); } +static bool queue_behind(struct arm_smmu_queue *q, u32 idx) +{ + return Q_IDX(q, q->cons) < Q_IDX(q, idx); +} + +static bool queue_ahead_not_wrapped(struct arm_smmu_queue *q, u32 idx) +{ + return Q_IDX(q, q->cons) >= Q_IDX(q, idx) && + Q_WRP(q, q->cons) == Q_WRP(q, idx); +} + static void queue_sync_cons(struct arm_smmu_queue *q) { q->cons = readl_relaxed(q->cons_reg); @@ -709,33 +721,19 @@ static void queue_inc_prod(struct arm_smmu_queue *q) writel(q->prod, q->prod_reg); } -/* - * Wait for the SMMU to consume items. If sync is true, wait until the queue - * is empty. Otherwise, wait until there is at least one free slot. - */ -static int queue_poll_cons(struct arm_smmu_queue *q, bool sync, bool wfe) +static int queue_poll_cons(struct arm_smmu_queue *q, bool wfe) { - ktime_t timeout; - unsigned int delay = 1, spin_cnt = 0; + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US); - /* Wait longer if it's a CMD_SYNC */ - timeout = ktime_add_us(ktime_get(), sync ? - ARM_SMMU_CMDQ_SYNC_TIMEOUT_US : - ARM_SMMU_POLL_TIMEOUT_US); - - while (queue_sync_cons(q), (sync ? !queue_empty(q) : queue_full(q))) { + while (queue_sync_cons(q), queue_full(q)) { if (ktime_compare(ktime_get(), timeout) > 0) return -ETIMEDOUT; if (wfe) { wfe(); - } else if (++spin_cnt < ARM_SMMU_CMDQ_SYNC_SPIN_COUNT) { - cpu_relax(); - continue; } else { - udelay(delay); - delay *= 2; - spin_cnt = 0; + cpu_relax(); + udelay(1); } } @@ -905,8 +903,11 @@ static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd) smmu->prev_cmd_opcode = FIELD_GET(CMDQ_0_OP, cmd[0]); + if (Q_IDX(q, q->prod + 1) == 0) + WRITE_ONCE(smmu->cmdq_generation, smmu->cmdq_generation + 1); + while (queue_insert_raw(q, cmd) == -ENOSPC) { - if (queue_poll_cons(q, false, wfe)) + if (queue_poll_cons(q, wfe)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); } } @@ -945,6 +946,48 @@ static int __arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx) return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0; } +static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx, + int sync_gen) +{ +
Re: [PATCH v5 2/2] iommu/arm-smmu-v3: Reunify arm_smmu_cmdq_issue_sync()
On 18/10/2018 12:18, Robin Murphy wrote: Hi John, On 17/10/18 15:38, John Garry wrote: On 17/10/2018 14:56, Robin Murphy wrote: Now that both sync methods are more or less the same shape, we can save some code and levels of indirection by rolling them up together again, with just a couple of simple conditionals to discriminate the MSI and queue-polling specifics. Hi Robin, Will, I had been thinking of this other patch previously: iommu/arm-smmu-v3: Stop rebuilding non-MSI CMD_SYNC commands The contents of the non-MSI CMD_SYNC command are fixed. This patch offers a small optimisation by keeping a sample of this command on the heap for re-use, thereby avoiding unnecessary re-building. Hi Robin, As far as I can tell, not counting the general call overhead which can be solved in less-specific ways, this essentially saves two MOVs, a MOVK, and an STP which in practice might not even reach L1 before it gets forwarded back out of the store buffer. Do you have any numbers for what difference this makes in terms of I/O performance or cache traffic? I'll try to get some numbers. I'm not expected anything big, but this just seemed like a cheap optimisation, maybe at the expense of slightly inconsistent code. Thanks, John Robin. Signed-off-by: John Garry diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 6947ccf..9d86c29 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -963,14 +963,16 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu) static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) { - u64 cmd[CMDQ_ENT_DWORDS]; + static u64 cmd[CMDQ_ENT_DWORDS] = { + FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) | + FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV) | + FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH) | + FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB), + 0}; unsigned long flags; bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV); - struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; int ret; - arm_smmu_cmdq_build_cmd(cmd, &ent); - spin_lock_irqsave(&smmu->cmdq.lock, flags); arm_smmu_cmdq_insert_cmd(smmu, cmd); ret = queue_poll_cons(&smmu->cmdq.q, true, wfe); But it seems that combining the the MSI and non-MSI methods would block this. How do you feel about this? Thanks, John Signed-off-by: Robin Murphy --- drivers/iommu/arm-smmu-v3.c | 49 + 1 file changed, 12 insertions(+), 37 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index da8a91d116bf..36db63e3afcf 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -933,7 +933,7 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu, * The difference between val and sync_idx is bounded by the maximum size of * a queue at 2^20 entries, so 32 bits is plenty for wrap-safe arithmetic. */ -static int __arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx) +static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx) { ktime_t timeout; u32 val; @@ -988,16 +988,17 @@ static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx, return -ETIMEDOUT; } -static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu) +static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) { u64 cmd[CMDQ_ENT_DWORDS]; unsigned long flags; -struct arm_smmu_cmdq_ent ent = { -.opcode = CMDQ_OP_CMD_SYNC, -.sync= { -.msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)), -}, -}; +bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) && + (smmu->features & ARM_SMMU_FEAT_COHERENCY); +struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; +int ret, sync_idx, sync_gen; + +if (msi) +ent.sync.msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)); spin_lock_irqsave(&smmu->cmdq.lock, flags); @@ -1009,39 +1010,13 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu) arm_smmu_cmdq_build_cmd(cmd, &ent); arm_smmu_cmdq_insert_cmd(smmu, cmd); } - -spin_unlock_irqrestore(&smmu->cmdq.lock, flags); - -return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata); -} - -static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) -{ -u64 cmd[CMDQ_ENT_DWORDS]; -unsigned long flags; -struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; -int sync_idx, sync_gen; - -arm_smmu_cmdq_build_cmd(cmd, &ent); - -spin_lock_irqsave(&smmu->cmdq.lock, flags); -if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC) -arm_smmu_cmdq_insert_cmd(smmu, cmd); sync_idx = s
Re: [PATCH v5 1/2] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
On 18/10/2018 12:19, Robin Murphy wrote: On 18/10/18 11:55, John Garry wrote: [...] @@ -976,18 +1019,19 @@ static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) { u64 cmd[CMDQ_ENT_DWORDS]; unsigned long flags; -bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV); struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; -int ret; +int sync_idx, sync_gen; arm_smmu_cmdq_build_cmd(cmd, &ent); spin_lock_irqsave(&smmu->cmdq.lock, flags); -arm_smmu_cmdq_insert_cmd(smmu, cmd); -ret = queue_poll_cons(&smmu->cmdq.q, true, wfe); +if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC) +arm_smmu_cmdq_insert_cmd(smmu, cmd); Hi Robin, If we did stop rebuilding the non-MSI command as I suggested, then we would not have the case of building the command and then discarding it, right? I suppose so. But that build/discard case can also be avoided by applying patch 2/2 of this series ;) OK, I'll check it more detail. TBH, I found 2/2 hard to follow from just the diff. Thanks, John Robin. Thanks, John +sync_idx = smmu->cmdq.q.prod; +sync_gen = READ_ONCE(smmu->cmdq_generation); spin_unlock_irqrestore(&smmu->cmdq.lock, flags); -return ret; +return arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen); } static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 1/2] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
On 18/10/2018 12:48, John Garry wrote: On 18/10/2018 12:19, Robin Murphy wrote: On 18/10/18 11:55, John Garry wrote: [...] @@ -976,18 +1019,19 @@ static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) { u64 cmd[CMDQ_ENT_DWORDS]; unsigned long flags; -bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV); struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; -int ret; +int sync_idx, sync_gen; arm_smmu_cmdq_build_cmd(cmd, &ent); spin_lock_irqsave(&smmu->cmdq.lock, flags); -arm_smmu_cmdq_insert_cmd(smmu, cmd); -ret = queue_poll_cons(&smmu->cmdq.q, true, wfe); +if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC) +arm_smmu_cmdq_insert_cmd(smmu, cmd); Hi Robin, If we did stop rebuilding the non-MSI command as I suggested, then we would not have the case of building the command and then discarding it, right? I suppose so. But that build/discard case can also be avoided by applying patch 2/2 of this series ;) So we can avoid the build/discard but the command is built under the lock. Hopefully worth it. However there is still scope (and - considering the locking - maybe more need) for a static non-MSI CMD_SYNC command :) Anyway, we'll look to get some numbers. Cheers, OK, I'll check it more detail. TBH, I found 2/2 hard to follow from just the diff. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 2/2] iommu/arm-smmu-v3: Reunify arm_smmu_cmdq_issue_sync()
On 17/10/2018 14:56, Robin Murphy wrote: Now that both sync methods are more or less the same shape, we can save some code and levels of indirection by rolling them up together again, with just a couple of simple conditionals to discriminate the MSI and queue-polling specifics. Signed-off-by: Robin Murphy Tested-by: John Garry I seem to be getting some boost in the scenarios I tested: Storage controller: 746K IOPS (with) vs 730K (without) NVMe disk: 471K IOPS (with) vs 420K IOPS (without) Note that this is with strict mode set and without the CMD_SYNC optimisation I punted. And this is on D05, so no MSI-based CMD_SYNC support. Thanks, John Ps. for anyone testing, use the v1 smmu "Fix Endian" patch to get this series to apply. --- drivers/iommu/arm-smmu-v3.c | 49 + 1 file changed, 12 insertions(+), 37 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index da8a91d116bf..36db63e3afcf 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -933,7 +933,7 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu, * The difference between val and sync_idx is bounded by the maximum size of * a queue at 2^20 entries, so 32 bits is plenty for wrap-safe arithmetic. */ -static int __arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx) +static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx) { ktime_t timeout; u32 val; @@ -988,16 +988,17 @@ static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx, return -ETIMEDOUT; } -static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu) +static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) { u64 cmd[CMDQ_ENT_DWORDS]; unsigned long flags; - struct arm_smmu_cmdq_ent ent = { - .opcode = CMDQ_OP_CMD_SYNC, - .sync = { - .msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)), - }, - }; + bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) && + (smmu->features & ARM_SMMU_FEAT_COHERENCY); + struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; + int ret, sync_idx, sync_gen; + + if (msi) + ent.sync.msiaddr = cpu_to_le32(virt_to_phys(&smmu->sync_count)); spin_lock_irqsave(&smmu->cmdq.lock, flags); @@ -1009,39 +1010,13 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu) arm_smmu_cmdq_build_cmd(cmd, &ent); arm_smmu_cmdq_insert_cmd(smmu, cmd); } - - spin_unlock_irqrestore(&smmu->cmdq.lock, flags); - - return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata); -} - -static int __arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) -{ - u64 cmd[CMDQ_ENT_DWORDS]; - unsigned long flags; - struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC }; - int sync_idx, sync_gen; - - arm_smmu_cmdq_build_cmd(cmd, &ent); - - spin_lock_irqsave(&smmu->cmdq.lock, flags); - if (smmu->prev_cmd_opcode != CMDQ_OP_CMD_SYNC) - arm_smmu_cmdq_insert_cmd(smmu, cmd); sync_idx = smmu->cmdq.q.prod; sync_gen = READ_ONCE(smmu->cmdq_generation); + spin_unlock_irqrestore(&smmu->cmdq.lock, flags); - return arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen); -} - -static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) -{ - int ret; - bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) && - (smmu->features & ARM_SMMU_FEAT_COHERENCY); - - ret = msi ? __arm_smmu_cmdq_issue_sync_msi(smmu) - : __arm_smmu_cmdq_issue_sync(smmu); + ret = msi ? arm_smmu_sync_poll_msi(smmu, ent.sync.msidata) + : arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen); if (ret) dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n"); } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 1/1] iommu/arm-smmu-v3: eliminate a potential memory corruption on Hi16xx soc
On 30/10/2018 01:52, Leizhen (ThunderTown) wrote: On 2018/10/30 1:59, Will Deacon wrote: On Sat, Oct 20, 2018 at 03:36:54PM +0800, Zhen Lei wrote: The standard GITS_TRANSLATER register in ITS is only 4 bytes, but Hisilicon expands the next 4 bytes to carry some IMPDEF information. That means, total 8 bytes data will be written to MSIAddress each time. MSIAddr: |4bytes|4bytes| |MSIData |IMPDEF| There is no problem for ITS, because the next 4 bytes space is reserved in ITS. But it will overwrite the 4 bytes memory following "sync_count". It's very fortunately that the previous and the next neighbour of the "sync_count" are both aligned by 8 bytes, so no problem is met now. It's good to explicitly add a workaround: 1. Add gcc __attribute__((aligned(8))) to make sure that "sync_count" is always aligned by 8 bytes. 2. Add a "int" struct member to make sure the 4 bytes padding is always exist. There is no functional change. Signed-off-by: Zhen Lei --- drivers/iommu/arm-smmu-v3.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 5059d09..624fdd0 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -586,7 +586,20 @@ struct arm_smmu_device { struct arm_smmu_strtab_cfg strtab_cfg; - u32 sync_count; + /* +* The alignment and padding is required by Hi16xx of Hisilicon. Nit: I know that there is no functional change related to this bug which depends on the chip version, but how about "hi1620 and earlier"? hi16xx is very broad. Thanks +* Because the ITS hardware on Hi16xx will truncate the MSIAddress(Here +* it's the address of "sync_count") to 8 bytes boundary first, then +* write 32 bits MSIdata at offset 0, and 32 bits IMPDEF data at offset +* 4. Without this workaround, the adjacent member maybe overwritten. +* +*|---4bytes---|---4bytes---| +* MSIAddress & (~0x7): MSIdata | IMPDEF data| +*/ + struct { + u32 sync_count; + int padding; + } __attribute__((aligned(8))); I thought the conclusion after reviewing your original patch was to maintain the union and drop the alignment directive? e.g. union { u32 sync_count; u64 padding; /* Hi16xx writes an extra 32 bits of goodness */ }; OK, I will sent v3. Will . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
Change function __iommu_dma_alloc_pages() to allocate memory/pages for DMA from respective device NUMA node. Originally-from: Ganapatrao Kulkarni Signed-off-by: John Garry --- This patch was originally posted by Ganapatrao in [1] *. However, after initial review, it was never reposted (due to lack of cycles, I think). In addition, the functionality in its sibling patches were merged through patches, as mentioned in [2]; this also refers to a discussion on device local allocations vs CPU local allocations for DMA pool, and which is better [3]. However, as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [1] https://lore.kernel.org/patchwork/patch/833004/ [2] https://lkml.org/lkml/2018/8/22/391 [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html * Authorship on this updated patch may need to be fixed - I add not want to add Ganapatrao's SOB without permission. diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..ada00bc 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, - unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, + unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; - unsigned int i = 0, array_size = count * sizeof(*pages); + unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; - if (array_size <= PAGE_SIZE) - pages = kzalloc(array_size, GFP_KERNEL); - else - pages = vzalloc(array_size); + pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid); if (!pages) return NULL; @@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, unsigned int order = __fls(order_mask); order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); + page = alloc_pages_node(nid, + (order_mask - order_size) ? + gfp | __GFP_NORETRY : gfp, + order); if (!page) continue; if (!order) @@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp, alloc_sizes = min_size; count = PAGE_ALIGN(size) >> PAGE_SHIFT; - pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp); + pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT, + gfp); if (!pages) return NULL; -- 1.9.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 2/9] dmapool: remove checks for dev == NULL
On 12/11/2018 15:42, Tony Battersby wrote: dmapool originally tried to support pools without a device because dma_alloc_coherent() supports allocations without a device. But nobody ended up using dma pools without a device, so the current checks in dmapool.c for pool->dev == NULL are both insufficient and causing bloat. Remove them. As an aside, is it right that dma_pool_create() does not actually reject dev==NULL and would crash from a NULL-pointer dereference? Thanks, John Signed-off-by: Tony Battersby --- --- linux/mm/dmapool.c.orig 2018-08-03 16:12:23.0 -0400 +++ linux/mm/dmapool.c 2018-08-03 16:13:44.0 -0400 @@ -277,7 +277,7 @@ void dma_pool_destroy(struct dma_pool *p mutex_lock(&pools_reg_lock); mutex_lock(&pools_lock); list_del(&pool->pools); - if (pool->dev && list_empty(&pool->dev->dma_pools)) + if (list_empty(&pool->dev->dma_pools)) empty = true; mutex_unlock(&pools_lock); if (empty) @@ -289,13 +289,9 @@ void dma_pool_destroy(struct dma_pool *p page = list_entry(pool->page_list.next, struct dma_page, page_list); if (is_page_busy(page)) { - if (pool->dev) - dev_err(pool->dev, - "dma_pool_destroy %s, %p busy\n", - pool->name, page->vaddr); - else - pr_err("dma_pool_destroy %s, %p busy\n", - pool->name, page->vaddr); + dev_err(pool->dev, + "dma_pool_destroy %s, %p busy\n", + pool->name, page->vaddr); /* leak the still-in-use consistent memory */ list_del(&page->page_list); kfree(page); @@ -357,13 +353,9 @@ void *dma_pool_alloc(struct dma_pool *po for (i = sizeof(page->offset); i < pool->size; i++) { if (data[i] == POOL_POISON_FREED) continue; - if (pool->dev) - dev_err(pool->dev, - "dma_pool_alloc %s, %p (corrupted)\n", - pool->name, retval); - else - pr_err("dma_pool_alloc %s, %p (corrupted)\n", - pool->name, retval); + dev_err(pool->dev, + "dma_pool_alloc %s, %p (corrupted)\n", + pool->name, retval); /* * Dump the first 4 bytes even if they are not @@ -418,13 +410,9 @@ void dma_pool_free(struct dma_pool *pool page = pool_find_page(pool, dma); if (!page) { spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, - "dma_pool_free %s, %p/%lx (bad dma)\n", - pool->name, vaddr, (unsigned long)dma); - else - pr_err("dma_pool_free %s, %p/%lx (bad dma)\n", - pool->name, vaddr, (unsigned long)dma); + dev_err(pool->dev, + "dma_pool_free %s, %p/%lx (bad dma)\n", + pool->name, vaddr, (unsigned long)dma); return; } @@ -432,13 +420,9 @@ void dma_pool_free(struct dma_pool *pool #ifdef DMAPOOL_DEBUG if ((dma - page->dma) != offset) { spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, - "dma_pool_free %s, %p (bad vaddr)/%pad\n", - pool->name, vaddr, &dma); - else - pr_err("dma_pool_free %s, %p (bad vaddr)/%pad\n", - pool->name, vaddr, &dma); + dev_err(pool->dev, + "dma_pool_free %s, %p (bad vaddr)/%pad\n", + pool->name, vaddr, &dma); return; } { @@ -449,12 +433,9 @@ void dma_pool_free(struct dma_pool *pool continue; } spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, "dma_pool_free %s, dma %pad already free\n", - pool->name, &dma); - else - pr_err("dma_pool_free %s, dma %pad already free\n", - pool->name, &dma); + dev_err(pool->dev, + "dm
Re: [PATCH] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On 08/11/2018 17:55, John Garry wrote: Change function __iommu_dma_alloc_pages() to allocate memory/pages for DMA from respective device NUMA node. Ping a friendly reminder on this patch. Thanks Originally-from: Ganapatrao Kulkarni Signed-off-by: John Garry --- This patch was originally posted by Ganapatrao in [1] *. However, after initial review, it was never reposted (due to lack of cycles, I think). In addition, the functionality in its sibling patches were merged through patches, as mentioned in [2]; this also refers to a discussion on device local allocations vs CPU local allocations for DMA pool, and which is better [3]. However, as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [1] https://lore.kernel.org/patchwork/patch/833004/ [2] https://lkml.org/lkml/2018/8/22/391 [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html * Authorship on this updated patch may need to be fixed - I add not want to add Ganapatrao's SOB without permission. diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..ada00bc 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, - unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, + unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; - unsigned int i = 0, array_size = count * sizeof(*pages); + unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; - if (array_size <= PAGE_SIZE) - pages = kzalloc(array_size, GFP_KERNEL); - else - pages = vzalloc(array_size); + pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid); if (!pages) return NULL; @@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, unsigned int order = __fls(order_mask); order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); + page = alloc_pages_node(nid, + (order_mask - order_size) ? + gfp | __GFP_NORETRY : gfp, + order); if (!page) continue; if (!order) @@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp, alloc_sizes = min_size; count = PAGE_ALIGN(size) >> PAGE_SHIFT; - pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp); + pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT, + gfp); if (!pages) return NULL; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On 20/11/2018 10:09, Ganapatrao Kulkarni wrote: Hi John, On Tue, Nov 20, 2018 at 3:35 PM John Garry wrote: On 08/11/2018 17:55, John Garry wrote: Change function __iommu_dma_alloc_pages() to allocate memory/pages for DMA from respective device NUMA node. Ping a friendly reminder on this patch. Thanks Originally-from: Ganapatrao Kulkarni Signed-off-by: John Garry --- This patch was originally posted by Ganapatrao in [1] *. However, after initial review, it was never reposted (due to lack of cycles, I think). In addition, the functionality in its sibling patches were merged through patches, as mentioned in [2]; this also refers to a discussion on device local allocations vs CPU local allocations for DMA pool, and which is better [3]. However, as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [1] https://lore.kernel.org/patchwork/patch/833004/ [2] https://lkml.org/lkml/2018/8/22/391 [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html * Authorship on this updated patch may need to be fixed - I add not want to add Ganapatrao's SOB without permission. thanks for taking this up. please feel free to add my SoB. OK, I will also make you the author and repost. Thanks, John diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..ada00bc 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, - unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, + unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; - unsigned int i = 0, array_size = count * sizeof(*pages); + unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; - if (array_size <= PAGE_SIZE) - pages = kzalloc(array_size, GFP_KERNEL); - else - pages = vzalloc(array_size); + pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid); if (!pages) return NULL; @@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, unsigned int order = __fls(order_mask); order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? -gfp | __GFP_NORETRY : gfp, order); + page = alloc_pages_node(nid, + (order_mask - order_size) ? + gfp | __GFP_NORETRY : gfp, + order); if (!page) continue; if (!order) @@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp, alloc_sizes = min_size; count = PAGE_ALIGN(size) >> PAGE_SHIFT; - pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp); + pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT, + gfp); if (!pages) return NULL; thanks Ganapat . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate memory/pages for DMA from respective device NUMA node. Signed-off-by: Ganapatrao Kulkarni [JPG: Modifed to use kvzalloc() and fixed indentation] Signed-off-by: John Garry --- Difference v1->v2: - Add Ganapatrao's tag and change author This patch was originally posted by Ganapatrao in [1]. However, after initial review, it was never reposted (due to lack of cycles, I think). In addition, the functionality in its sibling patches were merged through patches, as mentioned in [2]; this also refers to a discussion on device local allocations vs CPU local allocations for DMA pool, and which is better [3]. However, as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [1] https://lore.kernel.org/patchwork/patch/833004/ [2] https://lkml.org/lkml/2018/8/22/391 [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..ada00bc 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, - unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, + unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; - unsigned int i = 0, array_size = count * sizeof(*pages); + unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; - if (array_size <= PAGE_SIZE) - pages = kzalloc(array_size, GFP_KERNEL); - else - pages = vzalloc(array_size); + pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid); if (!pages) return NULL; @@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, unsigned int order = __fls(order_mask); order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); + page = alloc_pages_node(nid, + (order_mask - order_size) ? + gfp | __GFP_NORETRY : gfp, + order); if (!page) continue; if (!order) @@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp, alloc_sizes = min_size; count = PAGE_ALIGN(size) >> PAGE_SHIFT; - pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp); + pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT, + gfp); if (!pages) return NULL; -- 1.9.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On 20/11/2018 14:20, Robin Murphy wrote: On 20/11/2018 13:42, John Garry wrote: From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate memory/pages for DMA from respective device NUMA node. Signed-off-by: Ganapatrao Kulkarni [JPG: Modifed to use kvzalloc() and fixed indentation] Signed-off-by: John Garry --- Difference v1->v2: - Add Ganapatrao's tag and change author This patch was originally posted by Ganapatrao in [1]. However, after initial review, it was never reposted (due to lack of cycles, I think). In addition, the functionality in its sibling patches were merged through patches, as mentioned in [2]; this also refers to a discussion on device local allocations vs CPU local allocations for DMA pool, and which is better [3]. However, as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [1] https://lore.kernel.org/patchwork/patch/833004/ [2] https://lkml.org/lkml/2018/8/22/391 [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..ada00bc 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, -unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, +unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; -unsigned int i = 0, array_size = count * sizeof(*pages); +unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; -if (array_size <= PAGE_SIZE) -pages = kzalloc(array_size, GFP_KERNEL); -else -pages = vzalloc(array_size); +pages = kvzalloc_node(count * sizeof(*pages), GFP_KERNEL, nid); The pages array is only accessed by the CPU servicing the iommu_dma_alloc() call, and is usually freed again before that call even returns. It's certainly never touched by the device, so forcing it to a potentially non-local node doesn't make a great deal of sense. Right, it seems sensible to not make this allocation include the device-locality requirement, so can leave as is. However modifying to use kvzalloc() would seem ok. if (!pages) return NULL; @@ -483,8 +480,10 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, unsigned int order = __fls(order_mask); order_size = 1U << order; -page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); +page = alloc_pages_node(nid, +(order_mask - order_size) ? +gfp | __GFP_NORETRY : gfp, +order); If we're touching this, can we sort out that horrendous ternary? FWIW I found I have a local version of the original patch which I tweaked at the time, and apparently I reworked this hunk as below, which does seem somewhat nicer for the same diffstat. Robin. @@ -446,10 +443,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, for (order_mask &= (2U << __fls(count)) - 1; order_mask; order_mask &= ~order_size) { unsigned int order = __fls(order_mask); + gfp_t alloc_flags = gfp; order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); + if (order_size < order_mask) + alloc_flags |= __GFP_NORETRY; Sure, this can be included + page = alloc_pages_node(nid, alloc_flags, order); if (!page) continue; if (!order) . Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate pages for DMA from respective device NUMA node. The ternary operator which would be for alloc_pages_node() is tidied along with this. We also include a change to use kvzalloc() for kzalloc()/vzalloc() combination. Signed-off-by: Ganapatrao Kulkarni [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator] Signed-off-by: John Garry diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..4afb1a8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, - unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, + unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; - unsigned int i = 0, array_size = count * sizeof(*pages); + unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; - if (array_size <= PAGE_SIZE) - pages = kzalloc(array_size, GFP_KERNEL); - else - pages = vzalloc(array_size); + pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL); if (!pages) return NULL; @@ -481,10 +478,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, for (order_mask &= (2U << __fls(count)) - 1; order_mask; order_mask &= ~order_size) { unsigned int order = __fls(order_mask); + gfp_t alloc_flags = gfp; order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); + if (order_mask > order_size) + alloc_flags |= __GFP_NORETRY; + page = alloc_pages_node(nid, alloc_flags, order); if (!page) continue; if (!order) @@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp, alloc_sizes = min_size; count = PAGE_ALIGN(size) >> PAGE_SHIFT; - pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp); + pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT, + gfp); if (!pages) return NULL; -- 1.9.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On 21/11/2018 16:07, Will Deacon wrote: On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote: From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate pages for DMA from respective device NUMA node. The ternary operator which would be for alloc_pages_node() is tidied along with this. We also include a change to use kvzalloc() for kzalloc()/vzalloc() combination. Signed-off-by: Ganapatrao Kulkarni [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator] Signed-off-by: John Garry Weird, you're missing a diffstat here. Anyway, the patch looks fine to me, but it would be nice if you could justify the change with some numbers. Do you actually see an improvement from this change? Hi Will, Ah, I missed adding my comments explaining the motivation. It would be better in the commit log. Anyway, here's the snippet: " ... as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html"; I did have some numbers to show improvement in some scenarios when I tested this a while back which I'll dig out. However I would say that some scenarios will improve and the opposite for others with this change, considering different conditions in which DMA memory may be used. Cheers, John Will . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On 21/11/2018 16:57, Will Deacon wrote: On Wed, Nov 21, 2018 at 04:47:48PM +, John Garry wrote: On 21/11/2018 16:07, Will Deacon wrote: On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote: From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate pages for DMA from respective device NUMA node. The ternary operator which would be for alloc_pages_node() is tidied along with this. We also include a change to use kvzalloc() for kzalloc()/vzalloc() combination. Signed-off-by: Ganapatrao Kulkarni [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator] Signed-off-by: John Garry Weird, you're missing a diffstat here. Anyway, the patch looks fine to me, but it would be nice if you could justify the change with some numbers. Do you actually see an improvement >from this change? Hi Will, Ah, I missed adding my comments explaining the motivation. It would be better in the commit log. Anyway, here's the snippet: " ... as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html"; Yes, please add to this to the commit log. Sure, I did have some numbers to show improvement in some scenarios when I tested this a while back which I'll dig out. However I would say that some scenarios will improve and the opposite for others with this change, considering different conditions in which DMA memory may be used. Well, if you can show that it's useful in some cases and not catastrophic in others, then I think shooting for parity with direct DMA is a reasonable justification for the change. So I have done some more testing with our SoC crypto engine, using tcrypt module. The reason I used this device was because we can utilise a local device per socket in the system, unlike other DMAing devices, which generally only exist on a single socket (for us, anyway). Overall the results aren't brilliant - as expected - but show a general marginal improvement. Here's some figures: Summary: Average diff+0.9% Max diff+1.5% Min diff+0.2% Test Ops/second before after diff % async ecb(aes) encrypt test 0 (128 bit key, 16 byte blocks)68717 69057 0.5 test 1 (128 bit key, 64 byte blocks): 72633 73163 0.7 test 2 (128 bit key, 256 byte blocks): 71475 72076 0.8 test 3 (128 bit key, 1024 byte blocks): 66819 67467 1.0 test 4 (128 bit key, 8192 byte blocks): 38237 38495 0.7 test 5 (192 bit key, 16 byte blocks): 70273 71079 1.2 test 6 (192 bit key, 64 byte blocks): 72455 73292 1.2 test 7 (192 bit key, 256 byte blocks): 71085 71876 1.1 test 8 (192 bit key, 1024 byte blocks): 65891 66576 1.0 test 9 (192 bit key, 8192 byte blocks): 34846 35061 0.6 test 10 (256 bit key, 16 byte blocks): 72927 73762 1.2 test 11 (256 bit key, 64 byte blocks): 72361 73207 1.2 test 12 (256 bit key, 256 byte blocks): 70907 71602 1.0 test 13 (256 bit key, 1024 byte blocks):65035 65653 1.0 test 14 (256 bit key, 8192 byte blocks):32835 32998 0.5 async ecb(aes) decrypt test 0 (128 bit key, 16 byte blocks)68384 69130 1.1 test 1 (128 bit key, 64 byte blocks): 72645 73133 0.7 test 2 (128 bit key, 256 byte blocks): 71523 71912 0.5 test 3 (128 bit key, 1024 byte blocks): 66902 67258 0.5 test 4 (128 bit key, 8192 byte blocks): 38260 38434 0.5 test 5 (192 bit key, 16 byte blocks): 70301 70816 0.7 test 6 (192 bit key, 64 byte blocks): 72473 73064 0.8 test 7 (192 bit key, 256 byte blocks): 71106 71663 0.8 test 8 (192 bit key, 1024 byte blocks): 65915 66363 0.7 test 9 (192 bit key, 8192 byte blocks): 34876 35006 0.4 test 10 (256 bit key, 16 byte blocks): 72969 73519 0.8 test 11 (256 bit key, 64 byte blocks): 72404 72925 0.7 test 12 (256 bit key, 256 byte blocks): 70861 71350 0.7 test 13 (256 bit key, 1024 byte blocks):65074 65485 0.6 test 14 (256 bit key, 8192 byte blocks):32861 32974 0.3 async cbc(aes) encrypt test 0 (128 bit key, 16 byte blocks)58306 59131 1.4 test 1 (128 bit key, 64 byte blocks): 61647 62565 1.5 test 2 (128 bit key, 256 byte blocks): 60841 61666 1.4 test 3 (128 bit key, 1024 byte blocks): 57503 58204 1.2 test 4 (128 bit key, 8192 byte blocks): 34760 35055 0.9 test 5 (192 bit key, 16 byte blocks): 59684 60452 1.3 test 6 (192 bit key, 64 byte blocks): 61705 62516 1.3 test 7 (192 bit key, 256 byte blocks): 60678 61426 1.2 test 8 (192 bit key, 1024 byte blocks): 56805 57487 1.2 test 9 (192 bit key, 8192 byte blocks): 31836 32093 0.8 test 10 (256 bit key, 16 byte blocks): 61961 62785 1.3 test 11 (256 bit key, 64 byte blocks): 61584 62427 1.4 test 12 (256 bit key, 256 byte blocks): 604
Re: [PATCH] dma-debug: hns_enet_drv could use more DMA entries
+ Pasting original message at bottom. On 30/11/2018 08:42, Christoph Hellwig wrote: On Thu, Nov 29, 2018 at 10:54:56PM -0500, Qian Cai wrote: /* allow architectures to override this if absolutely required */ #ifndef PREALLOC_DMA_DEBUG_ENTRIES +/* amount of DMA mappings on this driver is huge. */ +#ifdef HNS_ENET +#define PREALLOC_DMA_DEBUG_ENTRIES (1 << 17) +#else #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16) #endif +#endif How would this be defined in a header that leaks into this file? I think we need to turn PREALLOC_DMA_DEBUG_ENTRIES into a user selectable config options, as I really don't want to collect hacks like this. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu The amount of DMA mappings from Hisilicon HNS ethernet devices is huge, so it could trigger "DMA-API: debugging out of memory - disabling". hnae_get_handle [1] hnae_init_queue hnae_init_ring hnae_alloc_buffers [2] debug_dma_map_page dma_entry_alloc [1] for (i = 0; i < handle->q_num; i++) [2] for (i = 0; i < ring->desc_num; i++) On this Huawei TaiShan 2280 aarch64 server, it has reached the limit already, 4 (ports) x 16 (handles) x 1024 (rings) = 65536 Signed-off-by: Qian Cai --- kernel/dma/debug.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 231ca4628062..ae91689cc9ad 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -43,8 +43,13 @@ /* allow architectures to override this if absolutely required */ #ifndef PREALLOC_DMA_DEBUG_ENTRIES +/* amount of DMA mappings on this driver is huge. */ +#ifdef HNS_ENET +#define PREALLOC_DMA_DEBUG_ENTRIES (1 << 17) +#else #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16) #endif +#endif enum { dma_debug_single, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate pages for DMA from respective device NUMA node. The ternary operator which would be for alloc_pages_node() is tidied along with this. The motivation for this change is to have a policy for page allocation consistent with direct DMA mapping, which attempts to allocate pages local to the device, as mentioned in [1]. In addition, for certain workloads it has been observed a marginal performance improvement. The patch caused an observation of 0.9% average throughput improvement for running tcrypt with HiSilicon crypto engine. We also include a modification to use kvzalloc() for kzalloc()/vzalloc() combination. [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html Signed-off-by: Ganapatrao Kulkarni [JPG: Added kvzalloc(), drop pages ** being device local, remove ternary operator, update message] Signed-off-by: John Garry --- Difference: v1->v2: - Add Ganapatrao's tag and change author v2->v3: - removed ternary operator - stopped making pages ** allocation local to device v3->v4: - Update commit message to include motivation for patch, including headline performance improvement for test. Some notes: This patch was originally posted by Ganapatrao in [2]. However, after initial review, it was never reposted (due to lack of cycles, I think). In addition, the functionality in its sibling patches were merged through patches, as mentioned in [2]; this also refers to a discussion on device local allocations vs CPU local allocations for DMA pool, and which is better [1]. However, as mentioned in [1], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. And from some testing, mentioned in commit message, shows marginal improvement. [2] https://lore.kernel.org/patchwork/patch/833004/ [3] https://lkml.org/lkml/2018/8/22/391 diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..4afb1a8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, - unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, + unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; - unsigned int i = 0, array_size = count * sizeof(*pages); + unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; - if (array_size <= PAGE_SIZE) - pages = kzalloc(array_size, GFP_KERNEL); - else - pages = vzalloc(array_size); + pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL); if (!pages) return NULL; @@ -481,10 +478,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, for (order_mask &= (2U << __fls(count)) - 1; order_mask; order_mask &= ~order_size) { unsigned int order = __fls(order_mask); + gfp_t alloc_flags = gfp; order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); + if (order_mask > order_size) + alloc_flags |= __GFP_NORETRY; + page = alloc_pages_node(nid, alloc_flags, order); if (!page) continue; if (!order) @@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp, alloc_sizes = min_size; count = PAGE_ALIGN(size) >> PAGE_SHIFT; - pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp); + pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT, + gfp); if (!pages) return NULL; -- 1.9.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] dma-debug: Kconfig for PREALLOC_DMA_DEBUG_ENTRIES
On 01/12/2018 16:36, Christoph Hellwig wrote: On Fri, Nov 30, 2018 at 07:39:50PM +, Robin Murphy wrote: I was assuming the point was to also add something like default 131072 if HNS_ENET so that DMA debug doesn't require too much thought from the user. If they still have to notice the overflow message and empirically figure out a value that does work, rebuilding the kernel each time is far less convenient than simply adding "dma_debug_entries=..." to their kernel command line and rebooting, which they can do today. If they do already know up-front that the default will need overriding and what the appropriate value is, then the command line still seems seems just as convenient. I'm not so fond of random drivers changing the defaults. My idea was rather to have the config option so that the defconfig files for the Hisilicon SOCs with this hardware could select a larger number without making a total mess of the kernel configuration. If we really have to we could do different defaults, but I'd still much rather do this on a arch/platform basis than specific drivers. As I understand, some drivers could even use much more than this (131072), to such a point where I can't imagine that we would want to set an arch default to support them. For this HNS_ENET case, it is arm64 specific so it would be an arch defconfig. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] dma-debug: Dynamically expand the dma_debug_entry pool
On 03/12/2018 17:28, Robin Murphy wrote: Certain drivers such as large multi-queue network adapters can use pools of mapped DMA buffers larger than the default dma_debug_entry pool of 65536 entries, with the result that merely probing such a device can cause DMA debug to disable itself during boot unless explicitly given an appropriate "dma_debug_entries=..." option. Developers trying to debug some other driver on such a system may not be immediately aware of this, and at worst it can hide bugs if they fail to realise that dma-debug has already disabled itself unexpectedly by the time the code of interest gets to run. Even once they do realise, it can be a bit of a pain to emprirically determine a suitable number of preallocated entries to configure without massively over-allocating. There's really no need for such a static limit, though, since we can quite easily expand the pool at runtime in those rare cases that the preallocated entries are insufficient, which is arguably the least surprising and most useful behaviour. Hi Robin, Do you have an idea on shrinking the pool again when the culprit driver is removed, i.e. we have so many unused debug entries now available? Thanks, John Signed-off-by: Robin Murphy --- kernel/dma/debug.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index de5db800dbfc..46cc075aec99 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -47,6 +47,9 @@ #ifndef PREALLOC_DMA_DEBUG_ENTRIES #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16) #endif +/* If the pool runs out, try this many times to allocate this many new entries */ +#define DMA_DEBUG_DYNAMIC_ENTRIES 256 +#define DMA_DEBUG_DYNAMIC_RETRIES 2 enum { dma_debug_single, @@ -702,12 +705,21 @@ static struct dma_debug_entry *dma_entry_alloc(void) { struct dma_debug_entry *entry; unsigned long flags; + int retry_count; - spin_lock_irqsave(&free_entries_lock, flags); + for (retry_count = 0; ; retry_count++) { + spin_lock_irqsave(&free_entries_lock, flags); + + if (num_free_entries > 0) + break; - if (list_empty(&free_entries)) { - global_disable = true; spin_unlock_irqrestore(&free_entries_lock, flags); + + if (retry_count < DMA_DEBUG_DYNAMIC_RETRIES && + !prealloc_memory(DMA_DEBUG_DYNAMIC_ENTRIES)) + continue; + + global_disable = true; pr_err("debugging out of memory - disabling\n"); return NULL; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] dma-debug: Dynamically expand the dma_debug_entry pool
On 04/12/2018 13:11, Robin Murphy wrote: Hi John, On 03/12/2018 18:23, John Garry wrote: On 03/12/2018 17:28, Robin Murphy wrote: Certain drivers such as large multi-queue network adapters can use pools of mapped DMA buffers larger than the default dma_debug_entry pool of 65536 entries, with the result that merely probing such a device can cause DMA debug to disable itself during boot unless explicitly given an appropriate "dma_debug_entries=..." option. Developers trying to debug some other driver on such a system may not be immediately aware of this, and at worst it can hide bugs if they fail to realise that dma-debug has already disabled itself unexpectedly by the time the code of interest gets to run. Even once they do realise, it can be a bit of a pain to emprirically determine a suitable number of preallocated entries to configure without massively over-allocating. There's really no need for such a static limit, though, since we can quite easily expand the pool at runtime in those rare cases that the preallocated entries are insufficient, which is arguably the least surprising and most useful behaviour. Hi Robin, Do you have an idea on shrinking the pool again when the culprit driver is removed, i.e. we have so many unused debug entries now available? I honestly don't believe it's worth the complication. This is a development feature with significant overheads already, so there's not an awful lot to gain by trying to optimise memory usage. If a system can ever load a driver that makes hundreds of thousands of simultaneous mappings, it can almost certainly spare 20-odd megabytes of RAM for the corresponding debug entries in perpetuity. Sure, it does mean you'd need to reboot to recover memory from a major leak, but that's mostly true of the current behaviour too, and rebooting during driver development is hardly an unacceptable inconvenience. ok, I just thought that it would not be too difficult to implement this on the dma entry free path. In fact, having got this far in, what I'd quite like to do is to get rid of dma_debug_resize_entries() such that we never need to free things at all, since then we could allocate whole pages as blocks of entries to save on masses of individual slab allocations. On a related topic, is it possible for the user to learn the total entries created at a given point in time? If not, could we add a file in the debugfs folder for this? Thanks, John Robin. Thanks, John Signed-off-by: Robin Murphy --- kernel/dma/debug.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index de5db800dbfc..46cc075aec99 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -47,6 +47,9 @@ #ifndef PREALLOC_DMA_DEBUG_ENTRIES #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16) #endif +/* If the pool runs out, try this many times to allocate this many new entries */ +#define DMA_DEBUG_DYNAMIC_ENTRIES 256 +#define DMA_DEBUG_DYNAMIC_RETRIES 2 enum { dma_debug_single, @@ -702,12 +705,21 @@ static struct dma_debug_entry *dma_entry_alloc(void) { struct dma_debug_entry *entry; unsigned long flags; +int retry_count; -spin_lock_irqsave(&free_entries_lock, flags); +for (retry_count = 0; ; retry_count++) { +spin_lock_irqsave(&free_entries_lock, flags); + +if (num_free_entries > 0) +break; -if (list_empty(&free_entries)) { -global_disable = true; spin_unlock_irqrestore(&free_entries_lock, flags); + +if (retry_count < DMA_DEBUG_DYNAMIC_RETRIES && +!prealloc_memory(DMA_DEBUG_DYNAMIC_ENTRIES)) +continue; + +global_disable = true; pr_err("debugging out of memory - disabling\n"); return NULL; } . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] dma-debug: Dynamically expand the dma_debug_entry pool
In fact, having got this far in, what I'd quite like to do is to get rid of dma_debug_resize_entries() such that we never need to free things at all, since then we could allocate whole pages as blocks of entries to save on masses of individual slab allocations. On a related topic, is it possible for the user to learn the total entries created at a given point in time? If not, could we add a file in the debugfs folder for this? Hi Robin, I did get as far as pondering that you effectively lose track of utilisation once the low-water-mark of min_free_entries hits 0 and stays I did try your patches and I noticed this, i.e I was hitting the point at which we start to alloc more entries. there - AFAICS it should be sufficient to just expose nr_total_entries as-is, since users can then calculate current and maximum occupancy based on *_free_entries. Does that sound reasonable to you? Sounds ok. I am just interested to know roughly how many DMA buffers we're using in our system. That also indirectly reminds me that this lot is documented in DMA_API.txt, so I should be good and update that too... Thanks, John Cheers, Robin. . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On 30/11/2018 11:14, John Garry wrote: From: Ganapatrao Kulkarni Hi Joerg, A friendly reminder. Can you please let me know your position on this patch? Cheers, John Change function __iommu_dma_alloc_pages() to allocate pages for DMA from respective device NUMA node. The ternary operator which would be for alloc_pages_node() is tidied along with this. The motivation for this change is to have a policy for page allocation consistent with direct DMA mapping, which attempts to allocate pages local to the device, as mentioned in [1]. In addition, for certain workloads it has been observed a marginal performance improvement. The patch caused an observation of 0.9% average throughput improvement for running tcrypt with HiSilicon crypto engine. We also include a modification to use kvzalloc() for kzalloc()/vzalloc() combination. [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html Signed-off-by: Ganapatrao Kulkarni [JPG: Added kvzalloc(), drop pages ** being device local, remove ternary operator, update message] Signed-off-by: John Garry --- Difference: v1->v2: - Add Ganapatrao's tag and change author v2->v3: - removed ternary operator - stopped making pages ** allocation local to device v3->v4: - Update commit message to include motivation for patch, including headline performance improvement for test. Some notes: This patch was originally posted by Ganapatrao in [2]. However, after initial review, it was never reposted (due to lack of cycles, I think). In addition, the functionality in its sibling patches were merged through patches, as mentioned in [2]; this also refers to a discussion on device local allocations vs CPU local allocations for DMA pool, and which is better [1]. However, as mentioned in [1], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. And from some testing, mentioned in commit message, shows marginal improvement. [2] https://lore.kernel.org/patchwork/patch/833004/ [3] https://lkml.org/lkml/2018/8/22/391 diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..4afb1a8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, - unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, + unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; - unsigned int i = 0, array_size = count * sizeof(*pages); + unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; - if (array_size <= PAGE_SIZE) - pages = kzalloc(array_size, GFP_KERNEL); - else - pages = vzalloc(array_size); + pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL); if (!pages) return NULL; @@ -481,10 +478,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, for (order_mask &= (2U << __fls(count)) - 1; order_mask; order_mask &= ~order_size) { unsigned int order = __fls(order_mask); + gfp_t alloc_flags = gfp; order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); + if (order_mask > order_size) + alloc_flags |= __GFP_NORETRY; + page = alloc_pages_node(nid, alloc_flags, order); if (!page) continue; if (!order) @@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp, alloc_sizes = min_size; count = PAGE_ALIGN(size) >> PAGE_SHIFT; - pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp); + pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT, + gfp); if (!pages) return NULL; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 1/1] iommu: set the default iommu-dma mode as non-strict
(VFIO) is untrusted, ok. But a malicious driver loaded into the kernel address space would have much easier ways to corrupt the system than to exploit lazy mode... Yes, so that we have no need to consider untrusted drivers. For (3), I agree that we should at least disallow lazy mode if pci_dev->untrusted is set. At the moment it means that we require the strictest IOMMU configuration for external-facing PCI ports, but it can be extended to blacklist other vulnerable devices or locations. I plan to add an attribute file for each device, espcially for hotplug devices. And let the root users to decide which mode should be used, strict or non-strict. Becasue they should known whether the hot-plug divice is trusted or not. Aside from the problem that without massive implementation changes strict/non-strict is at best a per-domain property, not a per-device one, I can't see this being particularly practical - surely the whole point of a malicious endpoint is that it's going to pretend to be some common device for which a 'trusted' kernel driver already exists? Yes, It should be assumed that all kernel drivers and all hard-wired devices are trusted. There is no reason to doubt that the open source drivers or the drivers and devices provided by legitimate suppliers are malicious. If you've chosen to trust *any* external device, I think you may as well have just set non-strict globally anyway. The effort involved in trying to implement super-fine-grained control seems hard to justify. The default mode of external devices is strict, it can be obviously changed to non-strict mode. But as you said, it maybe hard to be implemented. In addition, bring a malicious device into computer room, attach and export data it's not easy also. Maybe I should follow Jean'suggestion first, add a config item. +1 On another topic, we did also see a use case for selectively passthrough'ing devices. Typically, having the kernel use the identity mapping for when driving a device is fine. In fact, having the IOMMU translating puts a big performance burden on the system. However sometimes we may require the IOMMU involved for certain devices, like for when the kernel device driver has big contiguous DMA requirements, which is the case for some RDMA NIC cards. John Robin. If you do (3) then maybe we don't need (1) and (2), which require a tonne of work in the DMA and IOMMU layers (but would certainly be nice to see, since it would also help handle ATS invalidation timeouts) Thanks, Jean So that some high-end trusted devices use non-strict mode, and keep others still using strict mode. The drivers who want to use non-strict mode, should change to use new APIs by themselves. Why not keep the policy to secure by default, as we do for iommu.passthrough? And maybe add something similar to CONFIG_IOMMU_DEFAULT_PASSTRHOUGH? It's easy enough for experts to pass a command-line argument or change the default config. Sorry for the late reply, it was Chinese new year, and we had a long discussion internally, we are fine to add a Kconfig but not sure OS vendors will set it to default y. OS vendors seems not happy to pass a command-line argument, to be honest, this is our motivation to enable non-strict as default. Hope OS vendors can see this email thread, and give some input here. Thanks Hanjun . . . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH/RFC] driver core: Postpone DMA tear-down until after devres release
On 11/02/2019 10:22, Robin Murphy wrote: On 08/02/2019 18:55, Geert Uytterhoeven wrote: Hi Robin, On Fri, Feb 8, 2019 at 6:55 PM Robin Murphy wrote: On 08/02/2019 16:40, Joerg Roedel wrote: On Thu, Feb 07, 2019 at 08:36:53PM +0100, Geert Uytterhoeven wrote: diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 8ac10af17c0043a3..d62487d024559620 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -968,9 +968,9 @@ static void __device_release_driver(struct device *dev, struct device *parent) drv->remove(dev); device_links_driver_cleanup(dev); -arch_teardown_dma_ops(dev); devres_release_all(dev); +arch_teardown_dma_ops(dev); dev->driver = NULL; dev_set_drvdata(dev, NULL); if (dev->pm_domain && dev->pm_domain->dismiss) Thanks for the fix! Should it also be tagged for stable and get a Fixes FTR, Greg has added it to driver-core-testing, with a CC to stable. So I see, great! tag? I know it only triggers with a fix in v5.0-rc, but still... I think so: Fixes: 09515ef5ddad ("of/acpi: Configure dma operations at probe time for platform/amba/pci bus devices") Thanks! It won't backport cleanly due to commit dc3c05504d38849f ("dma-mapping: remove dma_deconfigure") in v4.20, though. Ah yes - backports beyond that should simply be a case of moving the dma_deconfigure() wrapper in the same manner. Hi guys, Any idea what happened to this fix? I have this on 5.0: [0.00] Linux version 5.0.0 (john@htsatcamb-server) (gcc version 4.8.5 (Linaro GCC 4.8-2015.06)) #121 SMP PREEMPT Thu Mar 7 14:28:39 GMT 2019 [0.00] Kernel command line: BOOT_IMAGE=/john/Image rdinit=/init crashkernel=256M@32M earlycon console=ttyAMA0,115200 acpi=force pcie_aspm=off scsi_mod.use_blk_mq=y no_console_suspend pcie-hisi.disable=1 ... [ 26.806856] pci_bus 000c:20: 2-byte config write to 000c:20:00.0 offset 0x44 may corrupt adjacent RW1C bits [ 26.817521] pcieport 0002:f8:00.0: can't derive routing for PCI INT B [ 26.837167] pci_bus 000c:20: 2-byte config write to 000c:20:00.0 offset 0x44 may corrupt adjacent RW1C bits [ 26.850091] serial 0002:f9:00.1: PCI INT B: no GSI [ 26.879364] serial 0002:f9:00.1: enabling device ( -> 0003) [ 26.913131] 0002:f9:00.1: ttyS3 at I/O 0x1008 (irq = 0, base_baud = 115200) is a ST16650V2 [ 26.992897] rtc-efi rtc-efi: setting system clock to 2019-03-07T14:41:48 UTC (1551969708) [ 27.009380] ALSA device list: [ 27.015326] No soundcards found. [ 27.022549] Freeing unused kernel memory: 1216K [ 27.055567] Run /init as init process root@(none)$ cd /sys/devices/platform/HISI0162:01 root@(none)$ echo HISI0162:01 > driver/unbind [ 36.488040] hisi_sas_v2_hw HISI0162:01: dev[9:1] is gone [ 36.561077] hisi_sas_v2_hw HISI0162:01: dev[8:1] is gone [ 36.621061] hisi_sas_v2_hw HISI0162:01: dev[7:1] is gone [ 36.693074] hisi_sas_v2_hw HISI0162:01: dev[6:1] is gone [ 36.753066] hisi_sas_v2_hw HISI0162:01: dev[5:1] is gone [ 36.764276] sd 0:0:3:0: [sdd] Synchronizing SCSI cache [ 36.821106] hisi_sas_v2_hw HISI0162:01: dev[4:1] is gone [ 36.889048] hisi_sas_v2_hw HISI0162:01: dev[3:1] is gone [ 36.993002] hisi_sas_v2_hw HISI0162:01: dev[1:1] is gone [ 37.004276] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 37.014768] sd 0:0:0:0: [sda] Stopping disk [ 37.709094] hisi_sas_v2_hw HISI0162:01: dev[2:5] is gone [ 37.721231] hisi_sas_v2_hw HISI0162:01: dev[0:2] is gone [ 37.803774] BUG: Bad page state in process sh pfn:11356 [ 37.81] page:7e44d580 count:1 mapcount:0 mapping: index:0x0 [ 37.830525] flags: 0xfffc0001000(reserved) [ 37.839443] raw: 0fffc0001000 7e44d588 7e44d588 [ 37.854998] raw: 0001 [ 37.870552] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [ 37.883485] bad because of flags: 0x1000(reserved) [ 37.893098] Modules linked in: [ 37.899221] CPU: 5 PID: 2691 Comm: sh Not tainted 5.0.0 #121 [ 37.910578] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018 [ 37.928924] Call trace: [ 37.933825] dump_backtrace+0x0/0x150 [ 37.941166] show_stack+0x14/0x1c [ 37.947808] dump_stack+0x8c/0xb0 [ 37.954451] bad_page+0xe4/0x144 [ 37.960918] free_pages_check_bad+0x7c/0x84 [ 37.969307] __free_pages_ok+0x284/0x290 [ 37.977173] __free_pages+0x30/0x44 [ 37.984163] __dma_direct_free_pages+0x68/0x6c [ 37.993076] dma_direct_free+0x24/0x38 [ 38.000591] dma_free_attrs+0x84/0xc0 [ 38.007930] dmam_release+0x20/0x28 [ 38.014924] release_nodes+0x128/0x1f8 [ 38.022439] devres_release_all+0x34/0x4c [ 38.030478] device_release_driver_internal+0x190/0x208 [ 38.040963] device_release_driver+0x14/0x1c [ 38.049526] unbind_store+0xbc/0xf4 [ 38.056517] drv_attr_store+0x20/0x30 [ 38
Re: [PATCH/RFC] driver core: Postpone DMA tear-down until after devres release
On 07/03/2019 14:52, Robin Murphy wrote: Hi John, Hi Robin, ok, fine. It's a pain bisecting another rmmod issue with it... Cheers, On 07/03/2019 14:45, John Garry wrote: [...] Hi guys, Any idea what happened to this fix? It's been in -next for a while (commit 376991db4b64) - I assume it will land shortly and hit stable thereafter, at which point somebody gets to sort out the manual backport past 4.20. Robin. I have this on 5.0: [0.00] Linux version 5.0.0 (john@htsatcamb-server) (gcc version 4.8.5 (Linaro GCC 4.8-2015.06)) #121 SMP PREEMPT Thu Mar 7 14:28:39 GMT 2019 [0.00] Kernel command line: BOOT_IMAGE=/john/Image rdinit=/init crashkernel=256M@32M earlycon console=ttyAMA0,115200 acpi=force pcie_aspm=off scsi_mod.use_blk_mq=y no_console_suspend pcie-hisi.disable=1 ... [ 26.806856] pci_bus 000c:20: 2-byte config write to 000c:20:00.0 offset 0x44 may corrupt adjacent RW1C bits [ 26.817521] pcieport 0002:f8:00.0: can't derive routing for PCI INT B [ 26.837167] pci_bus 000c:20: 2-byte config write to 000c:20:00.0 offset 0x44 may corrupt adjacent RW1C bits [ 26.850091] serial 0002:f9:00.1: PCI INT B: no GSI [ 26.879364] serial 0002:f9:00.1: enabling device ( -> 0003) [ 26.913131] 0002:f9:00.1: ttyS3 at I/O 0x1008 (irq = 0, base_baud = 115200) is a ST16650V2 [ 26.992897] rtc-efi rtc-efi: setting system clock to 2019-03-07T14:41:48 UTC (1551969708) [ 27.009380] ALSA device list: [ 27.015326] No soundcards found. [ 27.022549] Freeing unused kernel memory: 1216K [ 27.055567] Run /init as init process root@(none)$ cd /sys/devices/platform/HISI0162:01 root@(none)$ echo HISI0162:01 > driver/unbind [ 36.488040] hisi_sas_v2_hw HISI0162:01: dev[9:1] is gone [ 36.561077] hisi_sas_v2_hw HISI0162:01: dev[8:1] is gone [ 36.621061] hisi_sas_v2_hw HISI0162:01: dev[7:1] is gone [ 36.693074] hisi_sas_v2_hw HISI0162:01: dev[6:1] is gone [ 36.753066] hisi_sas_v2_hw HISI0162:01: dev[5:1] is gone [ 36.764276] sd 0:0:3:0: [sdd] Synchronizing SCSI cache [ 36.821106] hisi_sas_v2_hw HISI0162:01: dev[4:1] is gone [ 36.889048] hisi_sas_v2_hw HISI0162:01: dev[3:1] is gone [ 36.993002] hisi_sas_v2_hw HISI0162:01: dev[1:1] is gone [ 37.004276] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 37.014768] sd 0:0:0:0: [sda] Stopping disk [ 37.709094] hisi_sas_v2_hw HISI0162:01: dev[2:5] is gone [ 37.721231] hisi_sas_v2_hw HISI0162:01: dev[0:2] is gone [ 37.803774] BUG: Bad page state in process sh pfn:11356 [ 37.81] page:7e44d580 count:1 mapcount:0 mapping: index:0x0 [ 37.830525] flags: 0xfffc0001000(reserved) [ 37.839443] raw: 0fffc0001000 7e44d588 7e44d588 [ 37.854998] raw: 0001 [ 37.870552] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [ 37.883485] bad because of flags: 0x1000(reserved) [ 37.893098] Modules linked in: [ 37.899221] CPU: 5 PID: 2691 Comm: sh Not tainted 5.0.0 #121 [ 37.910578] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018 [ 37.928924] Call trace: [ 37.933825] dump_backtrace+0x0/0x150 [ 37.941166] show_stack+0x14/0x1c [ 37.947808] dump_stack+0x8c/0xb0 [ 37.954451] bad_page+0xe4/0x144 [ 37.960918] free_pages_check_bad+0x7c/0x84 [ 37.969307] __free_pages_ok+0x284/0x290 [ 37.977173] __free_pages+0x30/0x44 [ 37.984163] __dma_direct_free_pages+0x68/0x6c [ 37.993076] dma_direct_free+0x24/0x38 [ 38.000591] dma_free_attrs+0x84/0xc0 [ 38.007930] dmam_release+0x20/0x28 [ 38.014924] release_nodes+0x128/0x1f8 [ 38.022439] devres_release_all+0x34/0x4c [ 38.030478] device_release_driver_internal+0x190/0x208 [ 38.040963] device_release_driver+0x14/0x1c [ 38.049526] unbind_store+0xbc/0xf4 [ 38.056517] drv_attr_store+0x20/0x30 [ 38.063859] sysfs_kf_write+0x44/0x4c [ 38.071200] kernfs_fop_write+0xd0/0x1c4 [ 38.079065] __vfs_write+0x2c/0x158 [ 38.086055] vfs_write+0xa8/0x19c [ 38.092696] ksys_write+0x44/0xa0 [ 38.099338] __arm64_sys_write+0x1c/0x24 [ 38.107203] el0_svc_common+0xb0/0x100 [ 38.114718] el0_svc_handler+0x70/0x88 [ 38.122232] el0_svc+0x8/0x7c0 [ 38.128356] Disabling lock debugging due to kernel taint [ 38.139019] BUG: Bad page state in process sh pfn:11355 [ 38.149682] page:7e44d540 count:0 mapcount:0 mapping: index:0x0 [ 38.165760] flags: 0xfffc0001000(reserved) [ 38.174676] raw: 0fffc0001000 7e44d548 7e44d548 [ 38.190230] raw: [ 38.205783] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [ 38.218716] bad because of flags: 0x1000(reserved) [ 38.228329] Modules linked in: [ 38.234451] CPU: 5 PID: 2691 Comm: sh Tainted: GB 5.0.0 #121 [ 38.248604] Hardware name: Huawei Taishan 2280
Re: [PATCH 1/1] iommu: Add config option to set lazy mode as default
On 22/03/2019 14:11, Zhen Lei wrote: This allows the default behaviour to be controlled by a kernel config option instead of changing the command line for the kernel to include "iommu.strict=0" on ARM64 where this is desired. This is similar to CONFIG_IOMMU_DEFAULT_PASSTHROUGH Note: At present, intel_iommu, amd_iommu and s390_iommu use lazy mode as defalut, so there is no need to add code for them. /s/defalut/default/ Signed-off-by: Zhen Lei --- drivers/iommu/Kconfig | 14 ++ drivers/iommu/iommu.c | 5 + Do we need to update kernel-parameters.txt for iommu.strict? 2 files changed, 19 insertions(+) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 6f07f3b21816c64..5ec9780f564eaf8 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -85,6 +85,20 @@ config IOMMU_DEFAULT_PASSTHROUGH If unsure, say N here. +config IOMMU_LAZY_MODE maybe should add "DMA" to the name, and even "DEFAULT" + bool "IOMMU use lazy mode to flush IOTLB and free IOVA" + depends on IOMMU_API + help + For every IOMMU unmap operation, the flush operation of IOTLB and the free + operation of IOVA are deferred. This is a bit unclear, as there is no context. I think that you need to say something like, "Support lazy mode, where for every IOMMU DMA unmap operation, the flush operation of IOTLB and the free operation of IOVA are deferred. " They are only guaranteed to be done before + the related IOVA will be reused. Removing the need to pass in iommu.strict=0 + through command line on ARM64(Now, intel_iommu, amd_iommu, s390_iommu use + lazy mode as deault). prone to going out-of-date If this is enabled, you can still disable with kernel + parameters, such as iommu.strict=1, intel_iommu=strict, amd_iommu=fullflush + or s390_iommu=strict depending on the architecture. + + If unsure, say N here. + config OF_IOMMU def_bool y depends on OF && IOMMU_API diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 33a982e33716369..e307d70d1578b3b 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -43,7 +43,12 @@ #else static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA; #endif + +#ifdef CONFIG_IOMMU_LAZY_MODE +static bool iommu_dma_strict __read_mostly; +#else static bool iommu_dma_strict __read_mostly = true; +#endif struct iommu_callback_data { const struct iommu_ops *ops; -- 1.8.3 Cheers ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH/RFC] driver core: Postpone DMA tear-down until after devres release
Memory is incorrectly freed using the direct ops, as dma_map_ops = NULL. Oops... After reversing the order of the calls to arch_teardown_dma_ops() and devres_release_all(), dma_map_ops is still valid, and the DMA memory is now released using __iommu_free_attrs(): +sata_rcar ee30.sata: dmam_release:32: size 2048 vaddr ff8012145000 dma_handle 0x0xf000 attrs 0x0 +sata_rcar ee30.sata: dma_free_attrs:289: size 2048, ops = iommu_dma_ops +sata_rcar ee30.sata: dma_free_attrs:311: calling __iommu_free_attrs() --- drivers/base/dd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 8ac10af17c0043a3..d62487d024559620 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -968,9 +968,9 @@ static void __device_release_driver(struct device *dev, struct device *parent) drv->remove(dev); device_links_driver_cleanup(dev); - arch_teardown_dma_ops(dev); devres_release_all(dev); + arch_teardown_dma_ops(dev); dev->driver = NULL; Hi guys, Could there still be the same problem in the error path of really_probe(): static int really_probe(struct device *dev, struct device_driver *drv) { [...] goto done; probe_failed: arch_teardown_dma_ops(dev); dma_failed: if (dev->bus) blocking_notifier_call_chain(&dev->bus->p->bus_notifier, BUS_NOTIFY_DRIVER_NOT_BOUND, dev); pinctrl_bind_failed: device_links_no_driver(dev); devres_release_all(dev); driver_sysfs_remove(dev); dev->driver = NULL; dev_set_drvdata(dev, NULL); We seem to be able to call arch_teardown_dma_ops() prior to devres_release_all() if we reach probe_failed label. We have seen this crash when our driver probe fails on a dev branch based on v5.1-rc1: [ 87.896707] hisi_sas_v3_hw :74:02.0: Adding to iommu group 2 [ 87.909765] scsi host1: hisi_sas_v3_hw [ 89.127958] hisi_sas_v3_hw :74:02.0: evaluate _DSM failed [ 89.134043] BUG: Bad page state in process swapper/0 pfn:313f5 [ 89.139965] page:7ec4fd40 count:1 mapcount:0 mapping: index:0x0 [ 89.147960] flags: 0xfffe0001000(reserved) [ 89.152398] raw: 0fffe0001000 7ec4fd48 7ec4fd48 [ 89.160130] raw: 0001 [ 89.167861] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [ 89.174290] bad because of flags: 0x1000(reserved) [ 89.179070] Modules linked in: [ 89.182117] CPU: 49 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc1-43081-g22d97fd-dirty #1433 [ 89.190453] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI RC0 - V1.12.01 01/29/2019 [ 89.198876] Call trace: [ 89.201316] dump_backtrace+0x0/0x118 [ 89.204965] show_stack+0x14/0x1c [ 89.208272] dump_stack+0xa4/0xc8 [ 89.211576] bad_page+0xe4/0x13c [ 89.214791] free_pages_check_bad+0x4c/0xc0 [ 89.218961] __free_pages_ok+0x30c/0x340 [ 89.222871] __free_pages+0x30/0x44 [ 89.226347] __dma_direct_free_pages+0x30/0x38 [ 89.230777] dma_direct_free+0x24/0x38 [ 89.234513] dma_free_attrs+0x9c/0xd8 [ 89.238161] dmam_release+0x20/0x28 [ 89.241640] release_nodes+0x17c/0x220 [ 89.245375] devres_release_all+0x34/0x54 [ 89.249371] really_probe+0xc4/0x2c8 [ 89.252933] driver_probe_device+0x58/0xfc [ 89.257016] device_driver_attach+0x68/0x70 [ 89.261185] __driver_attach+0x94/0xdc [ 89.264921] bus_for_each_dev+0x5c/0xb4 [ 89.268744] driver_attach+0x20/0x28 [ 89.272306] bus_add_driver+0x14c/0x200 [ 89.276128] driver_register+0x6c/0x124 [ 89.279953] __pci_register_driver+0x48/0x50 [ 89.284213] sas_v3_pci_driver_init+0x20/0x28 [ 89.288557] do_one_initcall+0x40/0x25c [ 89.292381] kernel_init_freeable+0x2b8/0x3c0 [ 89.296727] kernel_init+0x10/0x100 [ 89.300202] ret_from_fork+0x10/0x18 [ 89.303773] Disabling lock debugging due to kernel taint [ 89.309076] BUG: Bad page state in process swapper/0 pfn:313f6 [ 89.314988] page:7ec4fd80 count:1 mapcount:0 mapping: index:0x0 [ 89.322983] flags: 0xfffe0001000(reserved) [ 89.327417] raw: 0fffe0001000 7ec4fd88 7ec4fd88 [ 89.335149] raw: 0001 Thanks, John dev_set_drvdata(dev, NULL); if (dev->pm_domain && dev->pm_domain->dismiss) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH/RFC] driver core: Postpone DMA tear-down until after devres release
On 26/03/2019 12:31, Geert Uytterhoeven wrote: Hi John, CC robh On Tue, Mar 26, 2019 at 12:42 PM John Garry wrote: Memory is incorrectly freed using the direct ops, as dma_map_ops = NULL. Oops... After reversing the order of the calls to arch_teardown_dma_ops() and devres_release_all(), dma_map_ops is still valid, and the DMA memory is now released using __iommu_free_attrs(): +sata_rcar ee30.sata: dmam_release:32: size 2048 vaddr ff8012145000 dma_handle 0x0xf000 attrs 0x0 +sata_rcar ee30.sata: dma_free_attrs:289: size 2048, ops = iommu_dma_ops +sata_rcar ee30.sata: dma_free_attrs:311: calling __iommu_free_attrs() --- drivers/base/dd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 8ac10af17c0043a3..d62487d024559620 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -968,9 +968,9 @@ static void __device_release_driver(struct device *dev, struct device *parent) drv->remove(dev); device_links_driver_cleanup(dev); - arch_teardown_dma_ops(dev); devres_release_all(dev); + arch_teardown_dma_ops(dev); dev->driver = NULL; Hi guys, Could there still be the same problem in the error path of really_probe(): static int really_probe(struct device *dev, struct device_driver *drv) { [...] goto done; probe_failed: arch_teardown_dma_ops(dev); dma_failed: if (dev->bus) blocking_notifier_call_chain(&dev->bus->p->bus_notifier, BUS_NOTIFY_DRIVER_NOT_BOUND, dev); pinctrl_bind_failed: device_links_no_driver(dev); devres_release_all(dev); driver_sysfs_remove(dev); dev->driver = NULL; dev_set_drvdata(dev, NULL); We seem to be able to call arch_teardown_dma_ops() prior to devres_release_all() if we reach probe_failed label. Yes, this looks like another instance of the same problem. And test_remove doesn't expose this, as it doesn't exercise the full cycle. Hi Geert, OK, thanks. There is another devres_release_all() call in device_release(). Not sure if there is another potential problem there also... As for this issue, I'll look to fix unless anyone else doing it. Cheers, John Gr{oetje,eeting}s, Geert ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure
In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after devres release"), we changed the ordering of tearing down the device DMA ops and releasing all the device's resources; this was because the DMA ops should be maintained until we release the device's managed DMA memories. However, we have seen another crash on an arm64 system when a device driver probe fails: hisi_sas_v3_hw :74:02.0: Adding to iommu group 2 scsi host1: hisi_sas_v3_hw BUG: Bad page state in process swapper/0 pfn:313f5 page:7ec4fd40 count:1 mapcount:0 mapping: index:0x0 flags: 0xfffe0001000(reserved) raw: 0fffe0001000 7ec4fd48 7ec4fd48 raw: 0001 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x1000(reserved) Modules linked in: CPU: 49 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc1-43081-g22d97fd-dirty #1433 Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI RC0 - V1.12.01 01/29/2019 Call trace: dump_backtrace+0x0/0x118 show_stack+0x14/0x1c dump_stack+0xa4/0xc8 bad_page+0xe4/0x13c free_pages_check_bad+0x4c/0xc0 __free_pages_ok+0x30c/0x340 __free_pages+0x30/0x44 __dma_direct_free_pages+0x30/0x38 dma_direct_free+0x24/0x38 dma_free_attrs+0x9c/0xd8 dmam_release+0x20/0x28 release_nodes+0x17c/0x220 devres_release_all+0x34/0x54 really_probe+0xc4/0x2c8 driver_probe_device+0x58/0xfc device_driver_attach+0x68/0x70 __driver_attach+0x94/0xdc bus_for_each_dev+0x5c/0xb4 driver_attach+0x20/0x28 bus_add_driver+0x14c/0x200 driver_register+0x6c/0x124 __pci_register_driver+0x48/0x50 sas_v3_pci_driver_init+0x20/0x28 do_one_initcall+0x40/0x25c kernel_init_freeable+0x2b8/0x3c0 kernel_init+0x10/0x100 ret_from_fork+0x10/0x18 Disabling lock debugging due to kernel taint BUG: Bad page state in process swapper/0 pfn:313f6 page:7ec4fd80 count:1 mapcount:0 mapping: index:0x0 [ 89.322983] flags: 0xfffe0001000(reserved) raw: 0fffe0001000 7ec4fd88 7ec4fd88 raw: 0001 The crash occurs for the same reason. In this case, on the really_probe() failure path, we are still clearing the DMA ops prior to releasing the device's managed memories. This patch fixes this issue by reordering the DMA ops teardown and the call to devres_release_all() on the failure path. Reported-by: Xiang Chen Tested-by: Xiang Chen Signed-off-by: John Garry --- For convenience, here is the 2nd half of really_probe() now: atomic_inc(&probe_count); pr_debug("bus: '%s': %s: probing driver %s with device %s\n", drv->bus->name, __func__, drv->name, dev_name(dev)); WARN_ON(!list_empty(&dev->devres_head)); re_probe: dev->driver = drv; /* If using pinctrl, bind pins now before probing */ ret = pinctrl_bind_pins(dev); if (ret) goto pinctrl_bind_failed; if (dev->bus->dma_configure) { ret = dev->bus->dma_configure(dev); if (ret) goto probe_failed; } if (driver_sysfs_add(dev)) { printk(KERN_ERR "%s: driver_sysfs_add(%s) failed\n", __func__, dev_name(dev)); goto probe_failed; } if (dev->pm_domain && dev->pm_domain->activate) { ret = dev->pm_domain->activate(dev); if (ret) goto probe_failed; } if (dev->bus->probe) { ret = dev->bus->probe(dev); if (ret) goto probe_failed; } else if (drv->probe) { ret = drv->probe(dev); if (ret) goto probe_failed; } if (test_remove) { test_remove = false; if (dev->bus->remove) dev->bus->remove(dev); else if (drv->remove) drv->remove(dev); devres_release_all(dev); driver_sysfs_remove(dev); dev->driver = NULL; dev_set_drvdata(dev, NULL); if (dev->pm_domain && dev->pm_domain->dismiss) dev->pm_domain->dismiss(dev); pm_runtime_reinit(dev); goto re_probe; } pinctrl_init_done(dev); if (dev->pm_domain && dev->pm_domain->sync) dev->pm_domain->sync(dev); driver_bound(dev); ret = 1; pr_debug("bus: '%s': %s: bound device %s t
Re: arm64 iommu groups issue
Right, and even worse is that it relies on the port driver even existing at all. All this iommu group assignment should be taken outside device driver probe paths. However we could still consider device links for sync'ing the SMMU and each device probing. Yes, we should get that for DT now thanks to the of_devlink stuff, but cooking up some equivalent for IORT might be worthwhile. It doesn't solve this problem, but at least we could remove the iommu_ops check in iort_iommu_xlate(). We would need to carve out a path from pci_device_add() or even device_add() to solve all cases. Another thought that crosses my mind is that when pci_device_group() walks up to the point of ACS isolation and doesn't find an existing group, it can still infer that everything it walked past *should* be put in the same group it's then eventually going to return. Unfortunately I can't see an obvious way for it to act on that knowledge, though, since recursive iommu_probe_device() is unlikely to end well. [...] And this looks to be the reason for which current iommu_bus_init()->bus_for_each_device(..., add_iommu_group) fails also. Of course, just adding a 'correct' add_device replay without the of_xlate process doesn't help at all. No wonder this looked suspiciously simpler than where the first idea left off... (on reflection, the core of this idea seems to be recycling the existing iommu_bus_init walk rather than building up a separate "waiting list", while forgetting that that wasn't the difficult part of the original idea anyway) We could still use a bus walk to add the group per iommu, but we would need an additional check to ensure the device is associated with the IOMMU. On this current code mentioned, the principle of this seems wrong to me - we call bus_for_each_device(..., add_iommu_group) for the first SMMU in the system which probes, but we attempt to add_iommu_group() for all devices on the bus, even though the SMMU for that device may yet to have probed. Yes, iommu_bus_init() is one of the places still holding a deeply-ingrained assumption that the ops go live for all IOMMU instances at once, which is what warranted the further replay in of_iommu_configure() originally. Moving that out of of_platform_device_create() to support probe deferral is where the trouble really started. I'm not too familiar with the history here, but could this be reverted now with the introduction of of_devlink stuff? Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu: silence iommu group prints
On 28/02/2020 02:16, Lu Baolu wrote: Hi, On 2020/2/27 19:57, Russell King wrote: On the LX2160A, there are lots (about 160) of IOMMU messages produced during boot; this is excessive. Reduce the severity of these messages to debug level. Signed-off-by: Russell King --- drivers/iommu/iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 3ead597e1c57..304281ec623b 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -741,7 +741,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev) trace_add_device_to_group(group->id, dev); - dev_info(dev, "Adding to iommu group %d\n", group->id); + dev_dbg(dev, "Adding to iommu group %d\n", group->id); I'm not strongly against this. But to me this message seems to be a good indicator that a device was probed successfully by the iommu subsystem. Keeping it in the default kernel message always helps to the kernel debugging. I would tend to agree. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice
On 21/01/2020 09:56, Robin Murphy wrote: On 18/12/2019 4:39 am, Cong Wang wrote: Both find_iova() and __free_iova() take iova_rbtree_lock, there is no reason to take and release it twice inside free_iova(). Fold them into one critical section by calling the unlock versions instead. Makes sense to me. Reviewed-by: Robin Murphy Reviewed-by: John Garry Could we at least get this patch picked up? It should be ok to take in isolation, since there is some debate on the other 2 patches in this series. Thanks Cc: Joerg Roedel Cc: John Garry Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 184d4c0e20b5..f46f8f794678 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova); void free_iova(struct iova_domain *iovad, unsigned long pfn) { - struct iova *iova = find_iova(iovad, pfn); + unsigned long flags; + struct iova *iova; + spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + iova = private_find_iova(iovad, pfn); if (iova) - __free_iova(iovad, iova); + private_free_iova(iovad, iova); + spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); } EXPORT_SYMBOL_GPL(free_iova); . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
Hi Will, On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote: And for the overall system, we have: PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: 0/40116 [4000Hz cycles], (all, 96 CPUs) -- 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 11.71% [kernel] [k] _raw_spin_unlock_irqrestore 6.35% [kernel] [k] _raw_spin_unlock_irq 2.65% [kernel] [k] get_user_pages_fast 2.03% [kernel] [k] __slab_free 1.55% [kernel] [k] tick_nohz_idle_exit 1.47% [kernel] [k] arm_lpae_map 1.39% [kernel] [k] __fget 1.14% [kernel] [k] __lock_text_start 1.09% [kernel] [k] _raw_spin_lock 1.08% [kernel] [k] bio_release_pages.part.42 1.03% [kernel] [k] __sbitmap_get_word 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 0.91% [kernel] [k] fput_many 0.88% [kernel] [k] __arm_lpae_map One thing to note is that we still spend an appreciable amount of time in arm_smmu_atc_inv_domain(), which is disappointing when considering it should effectively be a noop. As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our batch size is 1, so we're not seeing the real benefit of the batching. I can't help but think that we could improve this code to try to combine CMD SYNCs for small batches. Anyway, let me know your thoughts or any questions. I'll have a look if a get a chance for other possible bottlenecks. Did you ever get any more information on this? I don't have any SMMUv3 hardware any more, so I can't really dig into this myself. I'm only getting back to look at this now, as SMMU performance is a bit of a hot topic again for us. So one thing we are doing which looks to help performance is this series from Marc: https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d So that is just spreading the per-CPU load for NVMe interrupt handling (where the DMA unmapping is happening), so I'd say just side-stepping any SMMU issue really. Going back to the SMMU, I wanted to run epbf and perf annotate to help profile this, but was having no luck getting them to work properly. I'll look at this again now. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 19/03/2020 18:43, Jean-Philippe Brucker wrote: On Thu, Mar 19, 2020 at 12:54:59PM +, John Garry wrote: Hi Will, On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote: And for the overall system, we have: PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: 0/40116 [4000Hz cycles], (all, 96 CPUs) -- 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 11.71% [kernel] [k] _raw_spin_unlock_irqrestore 6.35% [kernel] [k] _raw_spin_unlock_irq 2.65% [kernel] [k] get_user_pages_fast 2.03% [kernel] [k] __slab_free 1.55% [kernel] [k] tick_nohz_idle_exit 1.47% [kernel] [k] arm_lpae_map 1.39% [kernel] [k] __fget 1.14% [kernel] [k] __lock_text_start 1.09% [kernel] [k] _raw_spin_lock 1.08% [kernel] [k] bio_release_pages.part.42 1.03% [kernel] [k] __sbitmap_get_word 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 0.91% [kernel] [k] fput_many 0.88% [kernel] [k] __arm_lpae_map One thing to note is that we still spend an appreciable amount of time in arm_smmu_atc_inv_domain(), which is disappointing when considering it should effectively be a noop. As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our batch size is 1, so we're not seeing the real benefit of the batching. I can't help but think that we could improve this code to try to combine CMD SYNCs for small batches. Anyway, let me know your thoughts or any questions. I'll have a look if a get a chance for other possible bottlenecks. Did you ever get any more information on this? I don't have any SMMUv3 hardware any more, so I can't really dig into this myself. I'm only getting back to look at this now, as SMMU performance is a bit of a hot topic again for us. So one thing we are doing which looks to help performance is this series from Marc: https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d So that is just spreading the per-CPU load for NVMe interrupt handling (where the DMA unmapping is happening), so I'd say just side-stepping any SMMU issue really. Going back to the SMMU, I wanted to run epbf and perf annotate to help profile this, but was having no luck getting them to work properly. I'll look at this again now. Could you also try with the upcoming ATS change currently in Will's tree? They won't improve your numbers but it'd be good to check that they don't make things worse. I can do when I get a chance. I've run a bunch of netperf instances on multiple cores and collecting SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty consistently. - 6.07% arm_smmu_iotlb_sync - 5.74% arm_smmu_tlb_inv_range 5.09% arm_smmu_cmdq_issue_cmdlist 0.28% __pi_memset 0.08% __pi_memcpy 0.08% arm_smmu_atc_inv_domain.constprop.37 0.07% arm_smmu_cmdq_build_cmd 0.01% arm_smmu_cmdq_batch_add 0.31% __pi_memset So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), when ATS is not used. According to the annotations, the load from the atomic_read(), that checks whether the domain uses ATS, is 77% of the samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure there is much room for optimization there. Well I did originally suggest using RCU protection to scan the list of devices, instead of reading an atomic and checking for non-zero value. But that would be an optimsation for ATS also, and there was no ATS devices at the time (to verify performance). Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
I've run a bunch of netperf instances on multiple cores and collecting SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty consistently. - 6.07% arm_smmu_iotlb_sync - 5.74% arm_smmu_tlb_inv_range 5.09% arm_smmu_cmdq_issue_cmdlist 0.28% __pi_memset 0.08% __pi_memcpy 0.08% arm_smmu_atc_inv_domain.constprop.37 0.07% arm_smmu_cmdq_build_cmd 0.01% arm_smmu_cmdq_batch_add 0.31% __pi_memset So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), when ATS is not used. According to the annotations, the load from the atomic_read(), that checks whether the domain uses ATS, is 77% of the samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure there is much room for optimization there. Well I did originally suggest using RCU protection to scan the list of devices, instead of reading an atomic and checking for non-zero value. But that would be an optimsation for ATS also, and there was no ATS devices at the time (to verify performance). Heh, I have yet to get my hands on one. Currently I can't evaluate ATS performance, but I agree that using RCU to scan the list should get better results when using ATS. When ATS isn't in use however, I suspect reading nr_ats_masters should be more efficient than taking the RCU lock + reading an "ats_devices" list (since the smmu_domain->devices list also serves context descriptor invalidation, even when ATS isn't in use). I'll run some tests however, to see if I can micro-optimize this case, but I don't expect noticeable improvements. ok, cheers. I, too, would not expect a significant improvement there. JFYI, I've been playing for "perf annotate" today and it's giving strange results for my NVMe testing. So "report" looks somewhat sane, if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist(): 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_cmdq_issue_cmdlist 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_atc_inv_domain.constprop.42 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw But "annotate" consistently tells me that a specific instruction consumes ~99% of the load for the enqueue function: : /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ : if (sync) { 0.00 : 80001071c948: ldr w0, [x29, #108] : int ret = 0; 0.00 : 80001071c94c: mov w24, #0x0 // #0 : if (sync) { 0.00 : 80001071c950: cbnzw0, 80001071c990 : arch_local_irq_restore(): 0.00 : 80001071c954: msr daif, x21 : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } 99.51 : 80001071c958: adrpx0, 800011909000 0.00 : 80001071c95c: add x21, x0, #0x908 0.02 : 80001071c960: ldr x2, [x29, #488] 0.14 : 80001071c964: ldr x1, [x21] 0.00 : 80001071c968: eor x1, x2, x1 0.00 : 80001071c96c: mov w0, w24 But there may be a hint that we're getting consuming a lot of time in polling the CMD_SYNC consumption. The files are available here: https://raw.githubusercontent.com/hisilicon/kernel-dev/private-topic-nvme-5.6-profiling/ann.txt, report Or maybe I'm just not using the tool properly ... Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 20/03/2020 16:33, Marc Zyngier wrote: JFYI, I've been playing for "perf annotate" today and it's giving strange results for my NVMe testing. So "report" looks somewhat sane, if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist(): 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_cmdq_issue_cmdlist 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_atc_inv_domain.constprop.42 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw But "annotate" consistently tells me that a specific instruction consumes ~99% of the load for the enqueue function: : /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ : if (sync) { 0.00 : 80001071c948: ldr w0, [x29, #108] : int ret = 0; 0.00 : 80001071c94c: mov w24, #0x0 // #0 : if (sync) { 0.00 : 80001071c950: cbnz w0, 80001071c990 : arch_local_irq_restore(): 0.00 : 80001071c954: msr daif, x21 : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } 99.51 : 80001071c958: adrp x0, 800011909000 Hi Marc, This is likely the side effect of the re-enabling of interrupts (msr daif, x21) on the previous instruction which causes the perf interrupt to fire right after. ok, makes sense. Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? In the meantime, maybe I can do some trickery by putting the local_irq_restore() in a separate function, outside arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 23/03/2020 09:16, Marc Zyngier wrote: + Julien, Mark Hi Marc, Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? There was. Julien Thierry has a bunch of patches for that [1], but they needs reviving. So those patches still apply cleanly (apart from the kvm patch, which I can skip, I suppose) and build, so I can try this I figure. Is there anything else which I should ensure or know about? Apart from enable CONFIG_ARM64_PSUEDO_NMI. A quickly taken perf annotate and report is at the tip here: https://github.com/hisilicon/kernel-dev/commits/private-topic-nvme-5.6-profiling In the meantime, maybe I can do some trickery by putting the local_irq_restore() in a separate function, outside arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function. Scratch that :) I don't see how you can improve the profiling without compromising the locking in this case... Cheers, John [1] https://patchwork.kernel.org/cover/11047407/ ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 24/03/2020 10:43, Marc Zyngier wrote: On Tue, 24 Mar 2020 09:18:10 + John Garry wrote: On 23/03/2020 09:16, Marc Zyngier wrote: + Julien, Mark Hi Marc, Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? There was. Julien Thierry has a bunch of patches for that [1], but they > needs reviving. So those patches still apply cleanly (apart from the kvm patch, which I can skip, I suppose) and build, so I can try this I figure. Is there anything else which I should ensure or know about? Apart from enable CONFIG_ARM64_PSUEDO_NMI. You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05 has it set to 0, preventing me from being able to use the feature (hint, nudge...;-). Yeah, apparently it's set on our D06CS board, but I just need to double check the FW version with our FW guy. As for D05, there has not been a FW update there in quite a long time and no plans for it. Sorry. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 24/03/2020 12:07, Robin Murphy wrote: On 2020-03-24 11:55 am, John Garry wrote: On 24/03/2020 10:43, Marc Zyngier wrote: On Tue, 24 Mar 2020 09:18:10 + John Garry wrote: On 23/03/2020 09:16, Marc Zyngier wrote: + Julien, Mark Hi Marc, Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? There was. Julien Thierry has a bunch of patches for that [1], but they > needs reviving. So those patches still apply cleanly (apart from the kvm patch, which I can skip, I suppose) and build, so I can try this I figure. Is there anything else which I should ensure or know about? Apart from enable CONFIG_ARM64_PSUEDO_NMI. You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05 has it set to 0, preventing me from being able to use the feature (hint, nudge...;-). Yeah, apparently it's set on our D06CS board, but I just need to double check the FW version with our FW guy. Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but since it's not been called out: This high priority configuration for interrupts needs to be explicitly enabled by setting the kernel parameter "irqchip.gicv3_pseudo_nmi" to 1. Yeah, I saw that by chance somewhere else previously. FWIW I believe is is still on the plan for someone here to dust off the PMU pNMI patches at some point. Cool. Well I can try to experiment with what Julien had at v4 for now. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
FWIW I believe is is still on the plan for someone here to dust off the PMU pNMI patches at some point. Cool. Well I can try to experiment with what Julien had at v4 for now. JFYI, I have done some more perf record capturing, and updated the "annotate" and "report" output here https://raw.githubusercontent.com/hisilicon/kernel-dev/679eca1008b1d11b42e1b5fa8a205266c240d1e1/ann.txt and .../report This capture is just for cpu0, since NVMe irq handling+dma unmapping will occur on specific CPUs, cpu0 being one of them. The reports look somewhat sane. So we no longer have ~99% of time attributed to re-enabling interrupts, now that's like: 3.14 : 80001071eae0: ldr w0, [x29, #108] : int ret = 0; 0.00 : 80001071eae4: mov w24, #0x0 // #0 : if (sync) { 0.00 : 80001071eae8: cbnzw0, 80001071eb44 : arch_local_irq_restore(): : asm volatile(ALTERNATIVE( 0.00 : 80001071eaec: msr daif, x21 : arch_static_branch(): 0.25 : 80001071eaf0: nop : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } One observation (if these reports are to be believed) is that we may spend a lot of time in the CAS loop, trying to get a place in the queue initially: : __CMPXCHG_CASE(w, , , 32, ) : __CMPXCHG_CASE(x, , , 64, ) 0.00 : 80001071e828: mov x0, x27 0.00 : 80001071e82c: mov x4, x1 0.00 : 80001071e830: cas x4, x2, [x27] 28.61 : 80001071e834: mov x0, x4 : arm_smmu_cmdq_issue_cmdlist(): : if (old == llq.val) 0.00 : 80001071e838: ldr x1, [x23] John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 02/04/2020 13:10, John Garry wrote: On 18/03/2020 20:53, Will Deacon wrote: As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our batch size is 1, so we're not seeing the real benefit of the batching. I can't help but think that we could improve this code to try to combine CMD SYNCs for small batches. Anyway, let me know your thoughts or any questions. I'll have a look if a get a chance for other possible bottlenecks. Did you ever get any more information on this? I don't have any SMMUv3 hardware any more, so I can't really dig into this myself. Hi Will, JFYI, I added some debug in arm_smmu_cmdq_issue_cmdlist() to get some idea of what is going on. Perf annotate did not tell much. I tested NVMe performance with and without Marc's patchset to spread LPIs for managed interrupts. Average duration of arm_smmu_cmdq_issue_cmdlist() mainline [all results are approximations]: owner: 6ms non-owner: 4ms mainline + LPI spreading patchset: owner: 25ms non-owner: 22ms For this, a list would be a itlb+cmd_sync. Please note that the LPI spreading patchset is still giving circa 25% NVMe throughput increase. What happens there would be that we get many more cpus involved, which creates more inter-cpu contention. But the performance increase comes from just alleviating pressure on those overloaded cpus. I also notice that with the LPI spreading patchset, on average a cpu is an "owner" in arm_smmu_cmdq_issue_cmdlist() 1 in 8, as opposed to 1 in 3 for mainline. This means that we're just creating longer chains of lists to be published. But I found that for a non-owner, average msi cmd_sync polling time is 12ms with the LPI spreading patchset. As such, it seems to be really taking approx (12*2/8-1=) ~3ms to consume a single list. This seems consistent with my finding that an owner polls consumption for 3ms also. Without the LPI speading patchset, polling time is approx 2 and 3ms for both owner and non-owner, respectively. As an experiment, I did try to hack the code to use a spinlock again for protecting the command queue, instead of current solution - and always saw a performance drop there. To be expected. But maybe we can try to not use a spinlock, but still serialise production+consumption to alleviate the long polling periods. Let me know your thoughts. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: iommu_iova slab eats too much memory
On 24/04/2020 17:30, Robin Murphy wrote: On 2020-04-24 2:20 pm, Bin wrote: Dear Robin: Thank you for your explanation. Now, I understand that this could be NIC driver's fault, but how could I confirm it? Do I have to debug the driver myself? I'd start with CONFIG_DMA_API_DEBUG - of course it will chew through memory about an order of magnitude faster than the IOVAs alone, but it should shed some light on whether DMA API usage looks suspicious, and dumping the mappings should help track down the responsible driver(s). Although the debugfs code doesn't show the stacktrace of where each mapping was made, I guess it would be fairly simple to tweak that for a quick way to narrow down where to start looking in an offending driver. Robin. Just mentioning this in case it's relevant - we found long term aging throughput test causes RB tree to grow very large (and would I assume eat lots of memory): https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leiz...@huawei.com/ John Robin Murphy 于2020年4月24日周五 下午8:15写道: On 2020-04-24 1:06 pm, Bin wrote: I'm not familiar with the mmu stuff, so what you mean by "some driver leaking DMA mappings", is it possible that some other kernel module like KVM or NIC driver leads to the leaking problem instead of the iommu module itself? Yes - I doubt that intel-iommu itself is failing to free IOVAs when it should, since I'd expect a lot of people to have noticed that. It's far more likely that some driver is failing to call dma_unmap_* when it's finished with a buffer - with the IOMMU disabled that would be a no-op on x86 with a modern 64-bit-capable device, so such a latent bug could have been easily overlooked. Robin. Bin 于 2020年4月24日周五 20:00写道: Well, that's the problem! I'm assuming the iommu kernel module is leaking memory. But I don't know why and how. Do you have any idea about it? Or any further information is needed? Robin Murphy 于 2020年4月24日周五 19:20写道: On 2020-04-24 1:40 am, Bin wrote: Hello? anyone there? Bin 于2020年4月23日周四 下午5:14写道: Forget to mention, I've already disabled the slab merge, so this is what it is. Bin 于2020年4月23日周四 下午5:11写道: Hey, guys: I'm running a batch of CoreOS boxes, the lsb_release is: ``` # cat /etc/lsb-release DISTRIB_ID="Container Linux by CoreOS" DISTRIB_RELEASE=2303.3.0 DISTRIB_CODENAME="Rhyolite" DISTRIB_DESCRIPTION="Container Linux by CoreOS 2303.3.0 (Rhyolite)" ``` ``` # uname -a Linux cloud-worker-25 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00 2019 x86_64 Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz GenuineIntel GNU/Linux ``` Recently, I found my vms constently being killed due to OOM, and after digging into the problem, I finally realized that the kernel is leaking memory. Here's my slabinfo: Active / Total Objects (% used): 83818306 / 84191607 (99.6%) Active / Total Slabs (% used) : 1336293 / 1336293 (100.0%) Active / Total Caches (% used) : 152 / 217 (70.0%) Active / Total Size (% used) : 5828768.08K / 5996848.72K (97.2%) Minimum / Average / Maximum Object : 0.01K / 0.07K / 23.25K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 80253888 80253888 100%0.06K 1253967 64 5015868K iommu_iova Do you really have a peak demand of ~80 million simultaneous DMA buffers, or is some driver leaking DMA mappings? Robin. 489472 489123 99%0.03K 3824 128 15296K kmalloc-32 297444 271112 91%0.19K 7082 42 56656K dentry 254400 252784 99%0.06K 3975 64 15900K anon_vma_chain 222528 39255 17%0.50K 6954 32111264K kmalloc-512 202482 201814 99%0.19K 4821 42 38568K vm_area_struct 200192 200192 100%0.01K391 512 1564K kmalloc-8 170528 169359 99%0.25K 5329 32 42632K filp 158144 153508 97%0.06K 2471 64 9884K kmalloc-64 149914 149365 99%0.09K 3259 46 13036K anon_vma 146640 143123 97%0.10K 3760 39 15040K buffer_head 130368 32791 25%0.09K 3104 42 12416K kmalloc-96 129752 129752 100%0.07K 2317 56 9268K Acpi-Operand 105468 105106 99%0.04K 1034 102 4136K selinux_inode_security 73080 73080 100%0.13K 2436 30 9744K kernfs_node_cache 72360 70261 97%0.59K 1340 54 42880K inode_cache 71040 71040 100%0.12K 2220 32 8880K eventpoll_epi 68096 59262 87%0.02K266 256 1064K kmalloc-16 53652 53652 100%0.04K526 102 2104K pde_opener 50496 31654 62%2.00K 3156 16100992K kmalloc-2048 46242 46242 100%0.19K 1101 42 8808K cred_jar 44496 43013 96%0.66K927 48 29664K proc_inode_cache 44352 44352 100%0.06K693 64 2772K task_delay_info 43516 43471 99%0.69K946 46 3027
Re: arm-smmu-v3 high cpu usage for NVMe
On 20/03/2020 10:41, John Garry wrote: + Barry, Alexandru PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: 0/40116 [4000Hz cycles], (all, 96 CPUs) -- 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 11.71% [kernel] [k] _raw_spin_unlock_irqrestore 6.35% [kernel] [k] _raw_spin_unlock_irq 2.65% [kernel] [k] get_user_pages_fast 2.03% [kernel] [k] __slab_free 1.55% [kernel] [k] tick_nohz_idle_exit 1.47% [kernel] [k] arm_lpae_map 1.39% [kernel] [k] __fget 1.14% [kernel] [k] __lock_text_start 1.09% [kernel] [k] _raw_spin_lock 1.08% [kernel] [k] bio_release_pages.part.42 1.03% [kernel] [k] __sbitmap_get_word 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 0.91% [kernel] [k] fput_many 0.88% [kernel] [k] __arm_lpae_map Hi Will, Robin, I'm just getting around to look at this topic again. Here's the current picture for my NVMe test: perf top -C 0 * Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024 Overhead Shared Object Symbol 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 3.28% [kernel] [k] arm_smmu_tlb_inv_range 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49 2.35% [kernel] [k] _raw_spin_unlock_irqrestore 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41 1.20% [kernel] [k] aio_complete_rw 0.96% [kernel] [k] enqueue_task_fair 0.93% [kernel] [k] gic_handle_irq 0.86% [kernel] [k] _raw_spin_lock_irqsave 0.72% [kernel] [k] put_reqs_available 0.72% [kernel] [k] sbitmap_queue_clear * only certain CPUs run the dma unmap for my scenario, cpu0 being one of them. Colleague Barry has similar findings for some other scenarios. So we tried the latest perf NMI support wip patches, and noticed a few hotspots (see https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate and https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt) when running some NVMe traffic: - initial cmpxchg to get a place in the queue - when more CPUs get involved, we start failing at an exponential rate 0.00 :8000107a3500: cas x4, x2, [x27] 26.52 :8000107a3504: mov x0, x4 : arm_smmu_cmdq_issue_cmdlist(): - the queue locking - polling cmd_sync Some ideas to optimise: a. initial cmpxchg So this cmpxchg could be considered unfair. In addition, with all the contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged around the system. Maybe we can implement something similar to the idea of queued/ticketed spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after initial cmpxchg fails, released by its leader, and releasing subsequent followers b. Drop the queue_full checking in certain circumstances If we cannot theoretically fill the queue, then stop the checking for queue full or similar. This should also help current problem of a., as the less time between cmpxchg, the less chance of failing (as we check queue available space between cmpxchg attempts). So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we always issue a cmd_sync for a batch (regardless of whether requested), then we should never fill (I think). c. Don't do queue locking in certain circumstances If we implement (and support) b. and support MSI polling, then I don't think that this is required. d. More minor ideas are to move forward when the "owner" stops gathering to reduce time of advancing the prod, hopefully reducing cmd_sync polling time; and also use a smaller word size for the valid bitmap operations, maybe 32b atomic operations are overall more efficient (than 64b) - mostly valid range check is < 16 bits from my observation. Let me know your thoughts or any other ideas. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RFC 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
As mentioned in [0], the CPU may consume many cycles processing arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to get space on the queue takes approx 25% of the cycles for this function. The cmpxchg() is removed as follows: - We assume that the cmdq can never fill with changes to limit the batch size (where necessary) and always issue a CMD_SYNC for a batch We need to do this since we no longer maintain the cons value in software, and we cannot deal with no available space properly. - Replace cmpxchg() with atomic inc operation, to maintain the prod and owner values. Early experiments have shown that we may see a 25% boost in throughput IOPS for my NVMe test with these changes. And some CPUs, which were loaded at ~55%, now see a ~45% load. So, even though the changes are incomplete and other parts of the driver will need fixing up (and it looks maybe broken for !MSI support), the performance boost seen would seem to be worth the effort of exploring this. Comments requested please. Thanks [0] https://lore.kernel.org/linux-iommu/b926444035e5e2439431908e3842afd24b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3 John Garry (2): iommu/arm-smmu-v3: Calculate bits for prod and owner iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() drivers/iommu/arm-smmu-v3.c | 92 +++-- 1 file changed, 47 insertions(+), 45 deletions(-) -- 2.16.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RFC 1/2] iommu/arm-smmu-v3: Calculate bits for prod and owner
Since the arm_smmu_ll_queue.prod will be for counting the "owner" value and also HW prod pointer, calculate how many bits are available for and used by each. This is based on the number of possible CPUs in the system. And we require that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC and at least 1 x other. Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 16K is the max supported CPUs. For this, max_n_shift would be 15. Signed-off-by: John Garry --- drivers/iommu/arm-smmu-v3.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 82508730feb7..e607ab5a4cbd 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -530,6 +530,8 @@ struct arm_smmu_ll_queue { u8 __pad[SMP_CACHE_BYTES]; } cacheline_aligned_in_smp; u32 max_n_shift; + u32 max_cmd_per_batch; + u32 owner_count_shift; }; struct arm_smmu_queue { @@ -1512,7 +1514,10 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, struct arm_smmu_cmdq_batch *cmds, struct arm_smmu_cmdq_ent *cmd) { - if (cmds->num == CMDQ_BATCH_ENTRIES) { + struct arm_smmu_cmdq *q = &smmu->cmdq; + struct arm_smmu_ll_queue *llq = &q->q.llq; + + if (cmds->num == llq->max_cmd_per_batch) { arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false); cmds->num = 0; } @@ -3156,8 +3161,24 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, unsigned long cons_off, size_t dwords, const char *name) { + int cpus = num_possible_cpus(); size_t qsz; + /* +* We can get the number of bits required for owner counting by +* log2(nr possible cpus) + 1, but we have to take into account that he +* wrap+prod could overflow before the owner zeroes, so add 1 +* more (to cpus) for bits_for_cmdq_owner calculation. +*/ + int bits_for_cmdq_owner = ilog2(cpus + 1) + 1; + /* 1-bit for overflow, 1-bit for wrap */ + int bits_available_for_prod = 32 - 2 - bits_for_cmdq_owner; + int entries_for_prod; + + if (bits_available_for_prod < 1) /* this would be insane - how many CPUs? */ + return -ENOMEM; + + q->llq.max_n_shift = min_t(int, q->llq.max_n_shift, bits_available_for_prod); do { qsz = ((1 << q->llq.max_n_shift) * dwords) << 3; q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, @@ -3167,6 +3188,17 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, q->llq.max_n_shift--; } while (1); + entries_for_prod = 1 << q->llq.max_n_shift; + + /* +* We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x whatever else). +* Assuming orig max_n_shift >= 17, this would mean ~16K CPUs max. +*/ + if (entries_for_prod < 2 * cpus) + return -ENOMEM; + + q->llq.max_cmd_per_batch = min_t(u32, entries_for_prod / cpus, CMDQ_BATCH_ENTRIES); + q->llq.owner_count_shift = q->llq.max_n_shift + 1; if (!q->base) { dev_err(smmu->dev, -- 2.16.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RFC 2/2] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
It has been shown that the cmpxchg() for finding space in the cmdq can be a bottleneck: - For more CPUs contenting the cmdq, cmpxchg() will fail more often. - Since the software-maintained cons pointer is updated on the same 64b memory region, the chance of cmpxchg() failure increases again. The cmpxchg() is removed as part of 2 related changes: - If a CMD_SYNC is always issued for a batch, the cmdq can - in theory - never fill, since we always wait for a CMD_SYNC to be consumed. We must issue the CMD_SYNC so that a CPU will be always limited to issuing max_cmd_per_batch commands. Generally for DMA unmap ops, a CMD_SYNC is always issued anyway. As such, the cmdq locking is removed, and we only longer maintain cons in software (this needs to be revised for !MSI support). - Update prod and cmdq owner in a single operation. For this, we count the prod and owner in separate regions in arm_smmu_ll_queue.prod. As with simple binary counting, once the prod+wrap fields overflow, they will zero. They will overflow into "owner" region, but this is ok as we have accounted for the extra value. As for the "owner", we now count this value, instead of setting a flag. Similar to before, once the owner has finished gathering, it will clear this mask. As such, a CPU declares itself as the "owner" when it reads zero for this field. This zeroing will also clear possible overflow in wrap+prod region. Signed-off-by: John Garry --- drivers/iommu/arm-smmu-v3.c | 58 +++-- 1 file changed, 14 insertions(+), 44 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index e607ab5a4cbd..d6c7d82f9cf8 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -1375,7 +1375,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 *cmds, int n, bool sync) { u64 cmd_sync[CMDQ_ENT_DWORDS]; - u32 prod; + u32 prod, prodx; unsigned long flags; bool owner; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; @@ -1383,33 +1383,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, .max_n_shift = cmdq->q.llq.max_n_shift, }, head = llq; int ret = 0; + u32 owner_val = 1 << cmdq->q.llq.owner_count_shift; + u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); + u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift); + + /* always issue a CMD_SYNC TODO: fixup callers for this */ + sync = true; /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); - do { - u64 old; - - while (!queue_has_space(&llq, n + sync)) { - local_irq_restore(flags); - if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) - dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); - local_irq_save(flags); - } - head.cons = llq.cons; - head.prod = queue_inc_prod_n(&llq, n + sync) | -CMDQ_PROD_OWNED_FLAG; + prodx = atomic_fetch_add(n + sync + owner_val, &cmdq->q.llq.atomic.prod); - old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) - break; - - llq.val = old; - } while (1); - owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); - head.prod &= ~CMDQ_PROD_OWNED_FLAG; - llq.prod &= ~CMDQ_PROD_OWNED_FLAG; + owner = !(prodx & owner_mask); + llq.prod = prod_mask & prodx; + head.prod = queue_inc_prod_n(&llq, n + sync); /* * 2. Write our commands into the queue As with simple binary counting, once the prod+wrap fields overflow, they will zero. They will overflow into "owner" region, but this is ok as we have accounted for the extra value. @@ -1420,14 +1408,6 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, prod = queue_inc_prod_n(&llq, n); arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); - - /* -* In order to determine completion of our CMD_SYNC, we must -* ensure that the queue can't wrap twice without us noticing. -* We achieve that by taking the cmdq lock as shared before -* marking our slot as valid. -*/ - arm_smmu_cmdq_shared_lock(cmdq); } /* 3. Mark our slots as valid, ensuring commands are visible first */ @@
Re: arm64 iommu groups issue
On 12/06/2020 15:30, Lorenzo Pieralisi wrote: On Mon, Feb 17, 2020 at 12:08:48PM +, John Garry wrote: Right, and even worse is that it relies on the port driver even existing at all. All this iommu group assignment should be taken outside device driver probe paths. However we could still consider device links for sync'ing the SMMU and each device probing. Yes, we should get that for DT now thanks to the of_devlink stuff, but cooking up some equivalent for IORT might be worthwhile. It doesn't solve this problem, but at least we could remove the iommu_ops check in iort_iommu_xlate(). We would need to carve out a path from pci_device_add() or even device_add() to solve all cases. Another thought that crosses my mind is that when pci_device_group() walks up to the point of ACS isolation and doesn't find an existing group, it can still infer that everything it walked past *should* be put in the same group it's then eventually going to return. Unfortunately I can't see an obvious way for it to act on that knowledge, though, since recursive iommu_probe_device() is unlikely to end well. [...] And this looks to be the reason for which current iommu_bus_init()->bus_for_each_device(..., add_iommu_group) fails also. Of course, just adding a 'correct' add_device replay without the of_xlate process doesn't help at all. No wonder this looked suspiciously simpler than where the first idea left off... (on reflection, the core of this idea seems to be recycling the existing iommu_bus_init walk rather than building up a separate "waiting list", while forgetting that that wasn't the difficult part of the original idea anyway) We could still use a bus walk to add the group per iommu, but we would need an additional check to ensure the device is associated with the IOMMU. On this current code mentioned, the principle of this seems wrong to me - we call bus_for_each_device(..., add_iommu_group) for the first SMMU in the system which probes, but we attempt to add_iommu_group() for all devices on the bus, even though the SMMU for that device may yet to have probed. Yes, iommu_bus_init() is one of the places still holding a deeply-ingrained assumption that the ops go live for all IOMMU instances at once, which is what warranted the further replay in of_iommu_configure() originally. Moving that out of of_platform_device_create() to support probe deferral is where the trouble really started. I'm not too familiar with the history here, but could this be reverted now with the introduction of of_devlink stuff? Hi John, Hi Lorenzo, have we managed to reach a consensus on this thread on how to solve the issue ? No, not really. So Robin and I tried a couple of quick things previously, but they came did not come to much, as above. Asking because this thread seems stalled - I am keen on getting it fixed. I haven't spent more time on this. But from what I was hearing last time, this issue was ticketed internally for arm, so I was waiting for that to be picked up to re-engage. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RFC v2 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
As mentioned in [0], the CPU may consume many cycles processing arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to get space on the queue takes approx 25% of the cycles for this function. This series removes that cmpxchg(). For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput increase: Before: 1310 IOPs After: 1630 IOPs I also have a test harness to check the rate of DMA map+unmaps we can achieve: CPU count 32 64 128 Before: 63187 19418 10169 After: 93287 44789 15862 (unit is map+unmaps per CPU per second) There's no specific problem that I know of with this series, as previous issues should now be fixed - but I'm a bit nervous about how we deal with the queue being full and wrapping. And I want to test more. Thanks [0] https://lore.kernel.org/linux-iommu/b926444035e5e2439431908e3842afd24b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3 John Garry (4): iommu/arm-smmu-v3: Fix trivial typo iommu/arm-smmu-v3: Calculate bits for prod and owner iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() drivers/iommu/arm-smmu-v3.c | 210 ++-- 1 file changed, 131 insertions(+), 79 deletions(-) -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 1/4] iommu/arm-smmu-v3: Fix trivial typo
Set "cmq" -> "cmdq". Signed-off-by: John Garry --- drivers/iommu/arm-smmu-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..a8e814c652fe 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -1479,7 +1479,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, } /* -* Try to unlock the cmq lock. This will fail if we're the last +* Try to unlock the cmdq lock. This will fail if we're the last * reader, in which case we can safely update cmdq->q.llq.cons */ if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RFC v2 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
To ensure that a CPU does not send more than a permitted amount of commands to the cmdq, ensure that each batch includes a CMD_SYNC. When issuing a CMD_SYNC, we always wait for the consumption of its batch of commands - as such, we guarantee that any CPU will not issue more than its permitted amount. Signed-off-by: John Garry --- drivers/iommu/arm-smmu-v3.c | 87 + 1 file changed, 40 insertions(+), 47 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index c3562dc35d45..36648163364c 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -1373,11 +1373,15 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, * - Command insertion is totally ordered, so if two CPUs each race to * insert their own list of commands then all of the commands from one * CPU will appear before any of the commands from the other CPU. + * + * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue + * more than the permitted amount commands at once. */ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, - u64 *cmds, int n, bool sync) + u64 *cmds, int n) { u64 cmd_sync[CMDQ_ENT_DWORDS]; + const int sync = 1; u32 prod; unsigned long flags; bool owner; @@ -1419,19 +1423,18 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, * Dependency ordering from the cmpxchg() loop above. */ arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); - if (sync) { - prod = queue_inc_prod_n(&llq, n); - arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); - queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); - /* -* In order to determine completion of our CMD_SYNC, we must -* ensure that the queue can't wrap twice without us noticing. -* We achieve that by taking the cmdq lock as shared before -* marking our slot as valid. -*/ - arm_smmu_cmdq_shared_lock(cmdq); - } + prod = queue_inc_prod_n(&llq, n); + arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); + queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); + + /* +* In order to determine completion of our CMD_SYNC, we must +* ensure that the queue can't wrap twice without us noticing. +* We achieve that by taking the cmdq lock as shared before +* marking our slot as valid. +*/ + arm_smmu_cmdq_shared_lock(cmdq); /* 3. Mark our slots as valid, ensuring commands are visible first */ dma_wmb(); @@ -1468,26 +1471,22 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, atomic_set_release(&cmdq->owner_prod, prod); } - /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ - if (sync) { - llq.prod = queue_inc_prod_n(&llq, n); - ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); - if (ret) { - dev_err_ratelimited(smmu->dev, - "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", - llq.prod, - readl_relaxed(cmdq->q.prod_reg), - readl_relaxed(cmdq->q.cons_reg)); - } + /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */ + llq.prod = queue_inc_prod_n(&llq, n); + ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); + if (ret) { + dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", + llq.prod, readl_relaxed(cmdq->q.prod_reg), + readl_relaxed(cmdq->q.cons_reg)); + } - /* -* Try to unlock the cmdq lock. This will fail if we're the last -* reader, in which case we can safely update cmdq->q.llq.cons -*/ - if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { - WRITE_ONCE(cmdq->q.llq.cons, llq.cons); - arm_smmu_cmdq_shared_unlock(cmdq); - } + /* +* Try to unlock the cmdq lock. This will fail if we're the last reader, +* in which case we can safely update cmdq->q.llq.cons +*/ + if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { + WRITE_ONCE(cmdq->q.llq.cons, llq.cons); + arm_smmu_cmdq_shared_unlock(cmdq); }
[PATCH RFC v2 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
It has been shown that the cmpxchg() for finding space in the cmdq can be a real bottleneck: - for more CPUs contending the cmdq, the cmpxchg() will fail more often - since the software-maintained cons pointer is updated on the same 64b memory region, the chance of cmpxchg() failure increases again The cmpxchg() is removed as part of 2 related changes: - Update prod and cmdq owner in a single atomic add operation. For this, we count the prod and owner in separate regions in prod memory. As with simple binary counting, once the prod+wrap fields overflow, they will zero. They should never overflow into "owner" region, and we zero the non-owner, prod region for each owner. This maintains the prod pointer. As for the "owner", we now count this value, instead of setting a flag. Similar to before, once the owner has finished gathering, it will clear a mask. As such, a CPU declares itself as the "owner" when it reads zero for this region. This zeroing will also clear possible overflow in wrap+prod region, above. The owner is now responsible for cmdq locking to avoid possible deadlock. The owner will lock the cmdq for all non-owers it has gathered when they have space in the queue and have written their entries. - Check for space in the cmdq after the prod pointer has been assigned. We don't bother checking for space in the cmdq before assigning the prod pointer, as this would be racy. So since the prod pointer is updated unconditionally, it would be common for no space to be available in the cmdq when prod is assigned - that is, according the software-maintained prod and cons pointer. So now it must be ensured that the entries are not yet written and not until there is space. How the prod pointer is maintained also leads to a strange condition where the prod pointer can wrap past the cons pointer. We can detect this condition, and report no space here. However, a prod pointer progressed twice past the cons pointer cannot be detected. But it can be ensured that this that this scenario does not occur, as we limit the amount of commands any CPU can issue at any given time, such that we cannot progress prod pointer further. Signed-off-by: John Garry --- drivers/iommu/arm-smmu-v3.c | 102 +--- 1 file changed, 61 insertions(+), 41 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 36648163364c..fd42a98f501f 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -772,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) prod = Q_IDX(q, q->prod); cons = Q_IDX(q, q->cons); - if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) + if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) { + /* check if we have wrapped twice, meaning definitely no space */ + if (cons > prod) + return false; + space = (1 << q->max_n_shift) - (prod - cons); - else + } else { + /* similar check to above */ + if (prod > cons) + return false; + space = cons - prod; + } return space >= n; } @@ -1073,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu) * fails if the caller appears to be the last lock holder (yes, this is * racy). All successful UNLOCK routines have RELEASE semantics. */ -static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq) +static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count) { int val; @@ -1083,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq) * to INT_MIN so these increments won't hurt as the value will remain * negative. */ - if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0) + if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0) return; do { val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0); - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val); + } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val); } static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq) @@ -1374,8 +1383,10 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, * insert their own list of commands then all of the commands from one * CPU will appear before any of the commands from the other CPU. * - * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue - * more than the permitted amount commands at once. + * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer + * for when the cmdq is full, such that we don't wrap more than twice. + * It also m
[PATCH RFC v2 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner
Since the arm_smmu_ll_queue.prod will be for counting the "owner" value and also HW prod pointer, calculate how many bits are available for and used by each. This is based on the number of possible CPUs in the system. And we require that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC and at least 1 x other. Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 16K is the max supported CPUs. For this, max_n_shift would be 14. Signed-off-by: John Garry --- drivers/iommu/arm-smmu-v3.c | 41 - 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index a8e814c652fe..c3562dc35d45 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -532,6 +532,8 @@ struct arm_smmu_ll_queue { u8 __pad[SMP_CACHE_BYTES]; } cacheline_aligned_in_smp; u32 max_n_shift; + u32 max_cmd_per_batch; + u32 owner_count_shift; }; struct arm_smmu_queue { @@ -1515,7 +1517,10 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, struct arm_smmu_cmdq_batch *cmds, struct arm_smmu_cmdq_ent *cmd) { - if (cmds->num == CMDQ_BATCH_ENTRIES) { + struct arm_smmu_cmdq *q = &smmu->cmdq; + struct arm_smmu_ll_queue *llq = &q->q.llq; + + if (cmds->num == llq->max_cmd_per_batch) { arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false); cmds->num = 0; } @@ -3141,8 +3146,26 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, unsigned long cons_off, size_t dwords, const char *name) { + int cpus = num_possible_cpus(); size_t qsz; + /* +* We can get the number of bits required for owner counting by +* log2(nr possible cpus) + 1 +*/ + int bits_for_cmdq_owner = ilog2(cpus) + 1; + /* +* 1-bit for overflow, 1-bit for wrap, 1-bit extra to ensure prod+wrap +* does not overflow into CPU count. +*/ + int bits_available_for_prod = 32 - 3 - bits_for_cmdq_owner; + int entries_for_prod; + + if (bits_available_for_prod < 1) /* How many CPUs??? */ + return -ENOMEM; + + q->llq.max_n_shift = min_t(int, q->llq.max_n_shift, + bits_available_for_prod); do { qsz = ((1 << q->llq.max_n_shift) * dwords) << 3; q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, @@ -3152,6 +3175,22 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, q->llq.max_n_shift--; } while (1); + entries_for_prod = 1 << q->llq.max_n_shift; + + /* +* We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x +* whatever else). +*/ + if (entries_for_prod < 2 * cpus) + return -ENOMEM; + + /* +* When finding max_cmd_per_batch, deduct 1 entry per batch to take +* account CMD_SYNC +*/ + q->llq.max_cmd_per_batch = min_t(u32, (entries_for_prod - cpus) / cpus, +CMDQ_BATCH_ENTRIES); + q->llq.owner_count_shift = q->llq.max_n_shift + 2; if (!q->base) { dev_err(smmu->dev, -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 00/13] Rework IOMMU API to allow for batching of invalidation
On 15/08/2019 14:55, Will Deacon wrote: On Thu, Aug 15, 2019 at 12:19:58PM +0100, John Garry wrote: On 14/08/2019 18:56, Will Deacon wrote: If you'd like to play with the patches, then I've also pushed them here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap but they should behave as a no-op on their own. As anticipated, my storage testing scenarios roughly give parity throughput and CPU loading before and after this series. Patches to convert the Arm SMMUv3 driver to the new API are here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq I quickly tested this again and now I see a performance lift: before (5.3-rc1)after D05 8x SAS disks907K IOPS 970K IOPS D05 1x NVMe 450K IOPS 466K IOPS D06 1x NVMe 467K IOPS 466K IOPS The CPU loading seems to track throughput, so nothing much to say there. Note: From 5.2 testing, I was seeing >900K IOPS from that NVMe disk for !IOMMU. Cheers, John. For interest, how do things look if you pass iommu.strict=0? That might give some indication about how much the invalidation is still hurting us. So I tested for iommu/cmdq for NVMe only, and I see: !SMMU 5.3-rc4 strict/!strict cmdq strict/!strict D05 NVMe 750K IOPS 456K/540K IOPS 466K/537K D06 NVMe 750K IOPS 456K/740K IOPS 466K/745K I don't know why the D06 iommu.strict performance is ~ same as D05, while !strict is so much better. D06 SMMU implementation is supposed to be generally much better than that of D05, so I would have thought that the strict performance would be better (than that of D05). BTW, what were your thoughts on changing arm_smmu_atc_inv_domain()->arm_smmu_atc_inv_master() to batching? It seems suitable, but looks untouched. Were you waiting for a resolution to the performance issue which Leizhen reported? In principle, I'm supportive of such a change, but I'm not currently able to test any ATS stuff so somebody else would need to write the patch. Jean-Philippe is on holiday at the moment, but I'd be happy to review something from you if you send it out. Unfortunately I don't have anything ATS-enabled either. Not many do, it seems. Cheers, John Will . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/4] iommu: Document usage of "/sys/kernel/iommu_groups//type" file
On 21/08/2019 03:42, Sai Praneeth Prakhya wrote: The default domain type of an iommu group can be changed using file "/sys/kernel/iommu_groups//type". Hence, document it's usage and more importantly spell out it's limitations and an example. Cc: Christoph Hellwig Cc: Joerg Roedel Cc: Ashok Raj Cc: Will Deacon Cc: Lu Baolu Cc: Sohil Mehta Cc: Robin Murphy Cc: Jacob Pan Signed-off-by: Sai Praneeth Prakhya --- .../ABI/testing/sysfs-kernel-iommu_groups | 35 +++ 1 file changed, 35 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups b/Documentation/ABI/testing/sysfs-kernel-iommu_groups index 017f5bc3920c..0a70b3a66ff3 100644 --- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups +++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups @@ -33,3 +33,38 @@ Description:In case an RMRR is used only by graphics or USB devices it is now exposed as "direct-relaxable" instead of "direct". In device assignment use case, for instance, those RMRR are considered to be relaxable and safe. + +What: /sys/kernel/iommu_groups//type +Date: August 2019 +KernelVersion: v5.4 +Contact: Sai Praneeth Prakhya +Description: /sys/kernel/iommu_groups//type lets the user know the + type of default domain in use by iommu for this group. A + privileged user could request kernel to change the group type by + writing to this file. Presently, only three types are supported It's unclear whether the following is a list of all values the user could both read and write (which it isn't). + 1. dma: All the DMA transactions from the devices in this group + are translated by the iommu. Why "dma", and not "DMA" (which is what we would read for DMA type)? + 2. identity: All the DMA transactions from the devices in this +group are *not* translated by the iommu. + 3. auto: Kernel decides one among dma/identity Isn't this the same as reset value, which we could read and remember? (And the kernel could reject when we attempt to change to a disallowed type). + Note: + - + A group type could be modified only when *all* the devices in + the group are not binded to any device driver. So, the user has + to first unbind the appropriate drivers and then change the + default domain type. + Caution: + + Unbinding a device driver will take away the drivers control + over the device and if done on devices that host root file + system could lead to catastrophic effects (the user might + need to reboot the machine to get it to normal state). So, it's + expected that the user understands what he is doing. I think that this goes without saying. + Example: + + # Unbind USB device driver + 1. echo ":00:14.0" > /sys/bus/pci/drivers/xhci_hcd/unbind + # Put the USB device in identity mode (a.k.a pass through) + 2. echo "identity" > /sys/kernel/iommu_groups//type + # Re-bind the driver + 3. echo ":00:14.0" > /sys/bus/pci/drivers/xhci_hcd/bind Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
arm64 iommu groups issue
Hi all, We have noticed a special behaviour on our arm64 D05 board when the SMMU is enabled with regards PCI device iommu groups. This platform does not support ACS, yet we find that all functions for a PCI device are not grouped together: root@ubuntu:/sys# dmesg | grep "Adding to iommu group" [7.307539] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0 [ 12.590533] hns_dsaf HISI00B2:00: Adding to iommu group 1 [ 13.688527] mlx5_core 000a:11:00.0: Adding to iommu group 2 [ 14.324606] mlx5_core 000a:11:00.1: Adding to iommu group 3 [ 14.937090] ehci-platform PNP0D20:00: Adding to iommu group 4 [ 15.276637] pcieport 0002:f8:00.0: Adding to iommu group 5 [ 15.340845] pcieport 0004:88:00.0: Adding to iommu group 6 [ 15.392098] pcieport 0005:78:00.0: Adding to iommu group 7 [ 15.443356] pcieport 000a:10:00.0: Adding to iommu group 8 [ 15.484975] pcieport 000c:20:00.0: Adding to iommu group 9 [ 15.543647] pcieport 000d:30:00.0: Adding to iommu group 10 [ 15.599771] serial 0002:f9:00.0: Adding to iommu group 5 [ 15.690807] serial 0002:f9:00.1: Adding to iommu group 5 [ 84.322097] mlx5_core 000a:11:00.2: Adding to iommu group 8 [ 84.856408] mlx5_core 000a:11:00.3: Adding to iommu group 8 root@ubuntu:/sys# lspci -tv lspci -tvv -+-[000d:30]---00.0-[31]-- +-[000c:20]---00.0-[21]00.0 Huawei Technologies Co., Ltd. +-[000a:10]---00.0-[11-12]--+-00.0 Mellanox [ConnectX-5] | +-00.1 Mellanox [ConnectX-5] | +-00.2 Mellanox [ConnectX-5 VF] | \-00.3 Mellanox [ConnectX-5 VF] +-[0007:90]---00.0-[91]00.0 Huawei Technologies Co., ... +-[0006:c0]---00.0-[c1]-- +-[0005:78]---00.0-[79]-- +-[0004:88]---00.0-[89]-- +-[0002:f8]---00.0-[f9]--+-00.0 MosChip Semiconductor Technology ... |+-00.1 MosChip Semiconductor Technology ... |\-00.2 MosChip Semiconductor Technology ... \-[:00]- For the PCI devices in question - on port 000a:10:00.0 - you will notice that the port and VFs (000a:11:00.2, 3) are in one group, yet the 2 PFs (000a:11:00.0, 000a:11:00.1) are in separate groups. I also notice the same ordering nature on our D06 platform - the pcieport is added to an iommu group after PF for that port. However this platform supports ACS, so not such a problem. After some checking, I find that when the pcieport driver probes, the associated SMMU device had not registered yet with the IOMMU framework, so we defer the probe for this device - in iort.c:iort_iommu_xlate(), when no iommu ops are available, we defer. Yet, when the mlx5 PF devices probe, the iommu ops are available at this stage. So the probe continues and we get an iommu group for the device - but not the same group as the parent port, as it has not yet been added to a group. When the port eventually probes it gets a new, separate group. This all seems to be as the built-in module init ordering is as follows: pcieport drv, smmu drv, mlx5 drv I notice that if I build the mlx5 drv as a ko and insert after boot, all functions + pcieport are in the same group: [ 11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0 [ 17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1 [ 18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2 [ 20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3 [ 20.356303] pcieport 0004:88:00.0: Adding to iommu group 4 [ 20.493337] pcieport 0005:78:00.0: Adding to iommu group 5 [ 20.702999] pcieport 000a:10:00.0: Adding to iommu group 6 [ 20.859183] pcieport 000c:20:00.0: Adding to iommu group 7 [ 20.996140] pcieport 000d:30:00.0: Adding to iommu group 8 [ 21.152637] serial 0002:f9:00.0: Adding to iommu group 3 [ 21.346991] serial 0002:f9:00.1: Adding to iommu group 3 [ 100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6 [ 101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6 [ 292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6 [ 293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6 This does seem like a problem for arm64 platforms which don't support ACS, yet enable an SMMU. Maybe also a problem even if they do support ACS. Opinion? Thanks, John
Re: arm64 iommu groups issue
On 19/09/2019 14:25, Robin Murphy wrote: When the port eventually probes it gets a new, separate group. This all seems to be as the built-in module init ordering is as follows: pcieport drv, smmu drv, mlx5 drv I notice that if I build the mlx5 drv as a ko and insert after boot, all functions + pcieport are in the same group: [ 11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0 [ 17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1 [ 18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2 [ 20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3 [ 20.356303] pcieport 0004:88:00.0: Adding to iommu group 4 [ 20.493337] pcieport 0005:78:00.0: Adding to iommu group 5 [ 20.702999] pcieport 000a:10:00.0: Adding to iommu group 6 [ 20.859183] pcieport 000c:20:00.0: Adding to iommu group 7 [ 20.996140] pcieport 000d:30:00.0: Adding to iommu group 8 [ 21.152637] serial 0002:f9:00.0: Adding to iommu group 3 [ 21.346991] serial 0002:f9:00.1: Adding to iommu group 3 [ 100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6 [ 101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6 [ 292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6 [ 293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6 This does seem like a problem for arm64 platforms which don't support ACS, yet enable an SMMU. Maybe also a problem even if they do support ACS. Opinion? Hi Robin, Yeah, this is less than ideal. For sure. Our production D05 boards don't ship with the SMMU enabled in BIOS, but it would be slightly concerning in this regard if they did. > One way to bodge it might be to make pci_device_group() also walk downwards to see if any non-ACS-isolated children already have a group, rather than assuming that groups get allocated in hierarchical order, but that's far from ideal. Agree. My own workaround was to hack the mentioned iort code to defer the PF probe if the parent port had also yet to probe. The underlying issue is that, for historical reasons, OF/IORT-based IOMMU drivers have ended up with group allocation being tied to endpoint driver probing via the dma_configure() mechanism (long story short, driver probe is the only thing which can be delayed in order to wait for a specific IOMMU instance to be ready).However, in the meantime, the IOMMU API internals have evolved sufficiently that I think there's a way to really put things right - I have the spark of an idea which I'll try to sketch out ASAP... OK, great. Thanks, John Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 0/6] SMMUv3 PMCG IMP DEF event support
This patchset adds IMP DEF event support for the SMMUv3 PMCG. It is marked as an RFC as the method to identify the PMCG implementation may be a quite disliked. And, in general, the series is somewhat incomplete. So the background is that the PMCG supports IMP DEF events, yet we have no method to identify the PMCG to know the IMP DEF events. A method for identifying the PMCG implementation could be using PMDEVARCH, but we cannot rely on this being set properly, as whether this is implemented is not defined in SMMUv3 spec. Another method would be perf event aliasing, but this method of event matching is based on CPU id, which would not guarantee same uniqueness as PMCG implementation. Yet another method could be to continue using ACPI OEM ID in the IORT code, but this does not scale. And it is not suitable if we ever add DT support to the PMCG driver. The method used in this series is based on matching on the parent SMMUv3 IIDR. We store this IIDR contents in the arm smmu structure as the first element, which means that we don't have to expose SMMU APIs - this is the part which may be disliked. The final two patches switch the pre-existing PMCG model identification from ACPI OEM ID to the same parent SMMUv3 IIDR matching. For now, we only consider SMMUv3' nodes being the associated node for PMCG. John Garry (6): ACPI/IORT: Set PMCG device parent iommu/arm-smmu-v3: Record IIDR in arm_smmu_device structure perf/smmuv3: Retrieve parent SMMUv3 IIDR perf/smmuv3: Support HiSilicon hip08 (hi1620) IMP DEF events perf/smmuv3: Match implementation options based on parent SMMU IIDR ACPI/IORT: Drop code to set the PMCG software-defined model drivers/acpi/arm64/iort.c | 69 ++-- drivers/iommu/arm-smmu-v3.c | 5 +++ drivers/perf/arm_smmuv3_pmu.c | 84 ++- include/linux/acpi_iort.h | 8 4 files changed, 112 insertions(+), 54 deletions(-) -- 2.17.1
[RFC PATCH 3/6] perf/smmuv3: Retrieve parent SMMUv3 IIDR
To support IMP DEF events per PMCG, retrieve the parent SMMUv3 IIDR. This will be used as a lookup for the IMP DEF events supported, under the assumption that a PMCG implementation has the same uniqueness as the parent SMMUv3. In this, we assume that any PMCG associated with the same SMMUv3 will have the same IMP DEF events - otherwise, some other secondary matching would need to be done. Signed-off-by: John Garry --- drivers/perf/arm_smmuv3_pmu.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c index da71c741cb46..f702898c695d 100644 --- a/drivers/perf/arm_smmuv3_pmu.c +++ b/drivers/perf/arm_smmuv3_pmu.c @@ -115,6 +115,7 @@ struct smmu_pmu { bool global_filter; u32 global_filter_span; u32 global_filter_sid; + u32 parent_iidr; }; #define to_smmu_pmu(p) (container_of(p, struct smmu_pmu, pmu)) @@ -551,6 +552,11 @@ static const struct attribute_group *smmu_pmu_attr_grps[] = { NULL }; +static const struct attribute_group **smmu_pmu_lookup_attr_groups(u32 parent_smmu_iidr) +{ + return smmu_pmu_attr_grps; +} + /* * Generic device handlers */ @@ -706,11 +712,21 @@ static int smmu_pmu_probe(struct platform_device *pdev) int irq, err; char *name; struct device *dev = &pdev->dev; + struct device *parent = dev->parent; smmu_pmu = devm_kzalloc(dev, sizeof(*smmu_pmu), GFP_KERNEL); if (!smmu_pmu) return -ENOMEM; + if (parent) { + void *parent_drvdata; + + parent_drvdata = platform_get_drvdata(to_platform_device(parent)); + if (!parent_drvdata) + return -EPROBE_DEFER; + smmu_pmu->parent_iidr = *(u32 *)parent_drvdata; + } + smmu_pmu->dev = dev; platform_set_drvdata(pdev, smmu_pmu); @@ -724,7 +740,7 @@ static int smmu_pmu_probe(struct platform_device *pdev) .start = smmu_pmu_event_start, .stop = smmu_pmu_event_stop, .read = smmu_pmu_event_read, - .attr_groups= smmu_pmu_attr_grps, + .attr_groups= smmu_pmu_lookup_attr_groups(smmu_pmu->parent_iidr), .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 5/6] perf/smmuv3: Match implementation options based on parent SMMU IIDR
Currently we match the implementation options based on the ACPI PLATFORM OEM ID. Since we can now match based on the parent SMMUv3 IIDR, switch to this method. Signed-off-by: John Garry --- drivers/perf/arm_smmuv3_pmu.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c index 11f28ba5fae0..33d1379ae525 100644 --- a/drivers/perf/arm_smmuv3_pmu.c +++ b/drivers/perf/arm_smmuv3_pmu.c @@ -739,14 +739,10 @@ static void smmu_pmu_reset(struct smmu_pmu *smmu_pmu) smmu_pmu->reloc_base + SMMU_PMCG_OVSCLR0); } -static void smmu_pmu_get_acpi_options(struct smmu_pmu *smmu_pmu) +static void smmu_pmu_get_implementation_options(struct smmu_pmu *smmu_pmu) { - u32 model; - - model = *(u32 *)dev_get_platdata(smmu_pmu->dev); - - switch (model) { - case IORT_SMMU_V3_PMCG_HISI_HIP08: + switch (smmu_pmu->parent_iidr) { + case PARENT_SMMU_IIDR_HISI_HIP08: /* HiSilicon Erratum 162001800 */ smmu_pmu->options |= SMMU_PMCG_EVCNTR_RDONLY; break; @@ -844,7 +840,7 @@ static int smmu_pmu_probe(struct platform_device *pdev) return -EINVAL; } - smmu_pmu_get_acpi_options(smmu_pmu); + smmu_pmu_get_implementation_options(smmu_pmu); /* Pick one CPU to be the preferred one to use */ smmu_pmu->on_cpu = raw_smp_processor_id(); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 1/6] ACPI/IORT: Set PMCG device parent
In the IORT, a PMCG node includes a node reference to its associated device. Set the PMCG platform device parent device for future referencing. For now, we only consider setting for when the associated component is an SMMUv3. Signed-off-by: John Garry --- drivers/acpi/arm64/iort.c | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index 8569b79e8b58..0b687520c3e7 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -1455,7 +1455,7 @@ static __init const struct iort_dev_config *iort_get_dev_cfg( * Returns: 0 on success, <0 failure */ static int __init iort_add_platform_device(struct acpi_iort_node *node, - const struct iort_dev_config *ops) + const struct iort_dev_config *ops, struct device *parent) { struct fwnode_handle *fwnode; struct platform_device *pdev; @@ -1466,6 +1466,8 @@ static int __init iort_add_platform_device(struct acpi_iort_node *node, if (!pdev) return -ENOMEM; + pdev->dev.parent = parent; + if (ops->dev_set_proximity) { ret = ops->dev_set_proximity(&pdev->dev, node); if (ret) @@ -1573,6 +1575,11 @@ static void __init iort_enable_acs(struct acpi_iort_node *iort_node) static inline void iort_enable_acs(struct acpi_iort_node *iort_node) { } #endif +static int iort_fwnode_match(struct device *dev, const void *fwnode) +{ + return dev->fwnode == fwnode; +} + static void __init iort_init_platform_devices(void) { struct acpi_iort_node *iort_node, *iort_end; @@ -1594,11 +1601,34 @@ static void __init iort_init_platform_devices(void) iort_table->length); for (i = 0; i < iort->node_count; i++) { + struct device *parent = NULL; + if (iort_node >= iort_end) { pr_err("iort node pointer overflows, bad table\n"); return; } + /* Fixme: handle parent declared in IORT after PMCG */ + if (iort_node->type == ACPI_IORT_NODE_PMCG) { + struct acpi_iort_node *iort_assoc_node; + struct acpi_iort_pmcg *pmcg; + u32 node_reference; + + pmcg = (struct acpi_iort_pmcg *)iort_node->node_data; + + node_reference = pmcg->node_reference; + iort_assoc_node = ACPI_ADD_PTR(struct acpi_iort_node, iort, +node_reference); + + if (iort_assoc_node->type == ACPI_IORT_NODE_SMMU_V3) { + struct fwnode_handle *assoc_fwnode; + + assoc_fwnode = iort_get_fwnode(iort_assoc_node); + + parent = bus_find_device(&platform_bus_type, NULL, + assoc_fwnode, iort_fwnode_match); + } + } iort_enable_acs(iort_node); ops = iort_get_dev_cfg(iort_node); @@ -1609,7 +1639,7 @@ static void __init iort_init_platform_devices(void) iort_set_fwnode(iort_node, fwnode); - ret = iort_add_platform_device(iort_node, ops); + ret = iort_add_platform_device(iort_node, ops, parent); if (ret) { iort_delete_fwnode(iort_node); acpi_free_fwnode_static(fwnode); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 2/6] iommu/arm-smmu-v3: Record IIDR in arm_smmu_device structure
To allow other devices know the SMMU HW IIDR, record the IIDR contents as the first member of the arm_smmu_device structure. In storing as the first member, it saves exposing SMMU APIs, which are nicely self-contained today. Signed-off-by: John Garry --- drivers/iommu/arm-smmu-v3.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 40f4757096c3..1ed3ef0f1ec3 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -70,6 +70,8 @@ #define IDR1_SSIDSIZE GENMASK(10, 6) #define IDR1_SIDSIZE GENMASK(5, 0) +#define ARM_SMMU_IIDR 0x18 + #define ARM_SMMU_IDR5 0x14 #define IDR5_STALL_MAX GENMASK(31, 16) #define IDR5_GRAN64K (1 << 6) @@ -546,6 +548,7 @@ struct arm_smmu_strtab_cfg { /* An SMMUv3 instance */ struct arm_smmu_device { + u32 iidr; /* must be first member */ struct device *dev; void __iomem*base; @@ -3153,6 +3156,8 @@ static int arm_smmu_device_probe(struct platform_device *pdev) iommu_device_set_ops(&smmu->iommu, &arm_smmu_ops); iommu_device_set_fwnode(&smmu->iommu, dev->fwnode); + smmu->iidr = readl(smmu->base + ARM_SMMU_IIDR); + ret = iommu_device_register(&smmu->iommu); if (ret) { dev_err(dev, "Failed to register iommu\n"); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 6/6] ACPI/IORT: Drop code to set the PMCG software-defined model
Now that we can identify a PMCG implementation from the parent SMMUv3 IIDR, drop all the code to match based on the ACPI OEM ID. Signed-off-by: John Garry --- drivers/acpi/arm64/iort.c | 35 +-- include/linux/acpi_iort.h | 8 2 files changed, 1 insertion(+), 42 deletions(-) diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index 0b687520c3e7..d04888cb8cff 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -1377,27 +1377,6 @@ static void __init arm_smmu_v3_pmcg_init_resources(struct resource *res, ACPI_EDGE_SENSITIVE, &res[2]); } -static struct acpi_platform_list pmcg_plat_info[] __initdata = { - /* HiSilicon Hip08 Platform */ - {"HISI ", "HIP08 ", 0, ACPI_SIG_IORT, greater_than_or_equal, -"Erratum #162001800", IORT_SMMU_V3_PMCG_HISI_HIP08}, - { } -}; - -static int __init arm_smmu_v3_pmcg_add_platdata(struct platform_device *pdev) -{ - u32 model; - int idx; - - idx = acpi_match_platform_list(pmcg_plat_info); - if (idx >= 0) - model = pmcg_plat_info[idx].data; - else - model = IORT_SMMU_V3_PMCG_GENERIC; - - return platform_device_add_data(pdev, &model, sizeof(model)); -} - struct iort_dev_config { const char *name; int (*dev_init)(struct acpi_iort_node *node); @@ -1408,7 +1387,6 @@ struct iort_dev_config { struct acpi_iort_node *node); int (*dev_set_proximity)(struct device *dev, struct acpi_iort_node *node); - int (*dev_add_platdata)(struct platform_device *pdev); }; static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = { @@ -1430,7 +1408,6 @@ static const struct iort_dev_config iort_arm_smmu_v3_pmcg_cfg __initconst = { .name = "arm-smmu-v3-pmcg", .dev_count_resources = arm_smmu_v3_pmcg_count_resources, .dev_init_resources = arm_smmu_v3_pmcg_init_resources, - .dev_add_platdata = arm_smmu_v3_pmcg_add_platdata, }; static __init const struct iort_dev_config *iort_get_dev_cfg( @@ -1494,17 +1471,7 @@ static int __init iort_add_platform_device(struct acpi_iort_node *node, if (ret) goto dev_put; - /* -* Platform devices based on PMCG nodes uses platform_data to -* pass the hardware model info to the driver. For others, add -* a copy of IORT node pointer to platform_data to be used to -* retrieve IORT data information. -*/ - if (ops->dev_add_platdata) - ret = ops->dev_add_platdata(pdev); - else - ret = platform_device_add_data(pdev, &node, sizeof(node)); - + ret = platform_device_add_data(pdev, &node, sizeof(node)); if (ret) goto dev_put; diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index 8e7e2ec37f1b..7a8961e6a8bb 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -14,14 +14,6 @@ #define IORT_IRQ_MASK(irq) (irq & 0xULL) #define IORT_IRQ_TRIGGER_MASK(irq) ((irq >> 32) & 0xULL) -/* - * PMCG model identifiers for use in smmu pmu driver. Please note - * that this is purely for the use of software and has nothing to - * do with hardware or with IORT specification. - */ -#define IORT_SMMU_V3_PMCG_GENERIC0x /* Generic SMMUv3 PMCG */ -#define IORT_SMMU_V3_PMCG_HISI_HIP08 0x0001 /* HiSilicon HIP08 PMCG */ - int iort_register_domain_token(int trans_id, phys_addr_t base, struct fwnode_handle *fw_node); void iort_deregister_domain_token(int trans_id); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 4/6] perf/smmuv3: Support HiSilicon hip08 (hi1620) IMP DEF events
Now that we can identify an PMCG implementation through the parent SMMUv3 implementation, add support for IMP DEF events. For each new implementation supported, we add a new attr_grps structure for a particular implementation and do lookup matching based on the parent SMMUv3 IIDR. This could maybe be optimised in future to reduce structures required. For now, only the l1_tlb event is added for HiSilicon hip08 platform. This platform supports many more IMP DEF events, but I need something better than the electronically translated description of the event to support. Signed-off-by: John Garry --- drivers/perf/arm_smmuv3_pmu.c | 54 ++- 1 file changed, 53 insertions(+), 1 deletion(-) diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c index f702898c695d..11f28ba5fae0 100644 --- a/drivers/perf/arm_smmuv3_pmu.c +++ b/drivers/perf/arm_smmuv3_pmu.c @@ -132,6 +132,8 @@ SMMU_PMU_EVENT_ATTR_EXTRACTOR(filter_stream_id, config1, 0, 31); SMMU_PMU_EVENT_ATTR_EXTRACTOR(filter_span, config1, 32, 32); SMMU_PMU_EVENT_ATTR_EXTRACTOR(filter_enable, config1, 33, 33); +#define PARENT_SMMU_IIDR_HISI_HIP08 (0x30736) + static inline void smmu_pmu_enable(struct pmu *pmu) { struct smmu_pmu *smmu_pmu = to_smmu_pmu(pmu); @@ -505,6 +507,21 @@ static struct attribute *smmu_pmu_events[] = { NULL }; +SMMU_EVENT_ATTR(l1_tlb, 0x8a); + +static struct attribute *smmu_pmu_events_hip08[] = { + &smmu_event_attr_cycles.attr.attr, + &smmu_event_attr_transaction.attr.attr, + &smmu_event_attr_tlb_miss.attr.attr, + &smmu_event_attr_config_cache_miss.attr.attr, + &smmu_event_attr_trans_table_walk_access.attr.attr, + &smmu_event_attr_config_struct_access.attr.attr, + &smmu_event_attr_pcie_ats_trans_rq.attr.attr, + &smmu_event_attr_pcie_ats_trans_passed.attr.attr, + &smmu_event_attr_l1_tlb.attr.attr, + NULL +}; + static umode_t smmu_pmu_event_is_visible(struct kobject *kobj, struct attribute *attr, int unused) { @@ -514,7 +531,10 @@ static umode_t smmu_pmu_event_is_visible(struct kobject *kobj, pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr.attr); - if (test_bit(pmu_attr->id, smmu_pmu->supported_events)) + if (pmu_attr->id < SMMU_PMCG_ARCH_MAX_EVENTS && + test_bit(pmu_attr->id, smmu_pmu->supported_events)) + return attr->mode; + else if (pmu_attr->id >= SMMU_PMCG_ARCH_MAX_EVENTS) return attr->mode; return 0; @@ -526,6 +546,12 @@ static struct attribute_group smmu_pmu_events_group = { .is_visible = smmu_pmu_event_is_visible, }; +static struct attribute_group smmu_pmu_events_group_hip08 = { + .name = "events", + .attrs = smmu_pmu_events_hip08, + .is_visible = smmu_pmu_event_is_visible, +}; + /* Formats */ PMU_FORMAT_ATTR(event,"config:0-15"); PMU_FORMAT_ATTR(filter_stream_id, "config1:0-31"); @@ -552,8 +578,34 @@ static const struct attribute_group *smmu_pmu_attr_grps[] = { NULL }; +static const struct attribute_group *smmu_pmu_attr_grps_hip08[] = { + &smmu_pmu_cpumask_group, + &smmu_pmu_events_group_hip08, + &smmu_pmu_format_group, + NULL +}; + +struct smmu_pmu_attr_grps_custom { + u32 parent_smmu_iidr; + const struct attribute_group **attr_grps; +} smmu_pmu_attr_custom_grps[] = { + { + .parent_smmu_iidr = PARENT_SMMU_IIDR_HISI_HIP08, + .attr_grps = smmu_pmu_attr_grps_hip08, + }, +}; + static const struct attribute_group **smmu_pmu_lookup_attr_groups(u32 parent_smmu_iidr) { + int i; + + for (i = 0; i < ARRAY_SIZE(smmu_pmu_attr_custom_grps); i++) { + struct smmu_pmu_attr_grps_custom *c = &smmu_pmu_attr_custom_grps[i]; + + if (c->parent_smmu_iidr == parent_smmu_iidr) + return c->attr_grps; + } + return smmu_pmu_attr_grps; } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 6/6] ACPI/IORT: Drop code to set the PMCG software-defined model
On 15/10/2019 04:06, Hanjun Guo wrote: -/* > - * PMCG model identifiers for use in smmu pmu driver. Please note > - * that this is purely for the use of software and has nothing to > - * do with hardware or with IORT specification. > - */ > -#define IORT_SMMU_V3_PMCG_GENERIC0x /* Generic SMMUv3 PMCG */ > -#define IORT_SMMU_V3_PMCG_HISI_HIP08 0x0001 /* HiSilicon HIP08 PMCG */ Since only HiSilicon platform has such erratum, and I think it works with both old version of firmware, I'm fine with removing this erratum framework. Yeah, seems a decent change on its own, even without the SMMU PMCG driver changes. But we still need to check on patch "[PATCH RFC 2/6] iommu/arm-smmu-v3: Record IIDR in arm_smmu_device structure" to progress any of this. Will, Robin, Any opinion on that patch? https://lore.kernel.org/linux-iommu/1569854031-237636-1-git-send-email-john.ga...@huawei.com/T/#m1e24771a23ee5426ec94ca2c4ec572642c155a77 Acked-by: Hanjun Guo Cheers, John Thanks Hanjun . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 1/6] ACPI/IORT: Set PMCG device parent
Hi Hanjun, Thanks for checking this. */ static int __init iort_add_platform_device(struct acpi_iort_node *node, - const struct iort_dev_config *ops) + const struct iort_dev_config *ops, struct device *parent) Since you added a input for this function, could you please update the comments of this function as well? Right, that can be updated. Indeed, the current comment omit the @ops argument also. { struct fwnode_handle *fwnode; struct platform_device *pdev; @@ -1466,6 +1466,8 @@ static int __init iort_add_platform_device(struct acpi_iort_node *node, if (!pdev) return -ENOMEM; + pdev->dev.parent = parent; + if (ops->dev_set_proximity) { ret = ops->dev_set_proximity(&pdev->dev, node); if (ret) @@ -1573,6 +1575,11 @@ static void __init iort_enable_acs(struct acpi_iort_node *iort_node) static inline void iort_enable_acs(struct acpi_iort_node *iort_node) { } #endif +static int iort_fwnode_match(struct device *dev, const void *fwnode) +{ + return dev->fwnode == fwnode; +} + static void __init iort_init_platform_devices(void) { struct acpi_iort_node *iort_node, *iort_end; @@ -1594,11 +1601,34 @@ static void __init iort_init_platform_devices(void) iort_table->length); for (i = 0; i < iort->node_count; i++) { + struct device *parent = NULL; + if (iort_node >= iort_end) { pr_err("iort node pointer overflows, bad table\n"); return; } + /* Fixme: handle parent declared in IORT after PMCG */ + if (iort_node->type == ACPI_IORT_NODE_PMCG) { + struct acpi_iort_node *iort_assoc_node; + struct acpi_iort_pmcg *pmcg; + u32 node_reference; + + pmcg = (struct acpi_iort_pmcg *)iort_node->node_data; + + node_reference = pmcg->node_reference; + iort_assoc_node = ACPI_ADD_PTR(struct acpi_iort_node, iort, +node_reference); + + if (iort_assoc_node->type == ACPI_IORT_NODE_SMMU_V3) { + struct fwnode_handle *assoc_fwnode; + + assoc_fwnode = iort_get_fwnode(iort_assoc_node); + + parent = bus_find_device(&platform_bus_type, NULL, + assoc_fwnode, iort_fwnode_match); + } + } How about using a function to include those new added code to make this function (iort_init_platform_devices()) a bit cleaner? Can do. But I still need to add code to deal with the scenario of the parent PMCG not being an SMMU, which is supported in the spec as I recall. Note that I would not have FW to test that, so maybe can omit support for now as long as there's no regression. iort_enable_acs(iort_node); ops = iort_get_dev_cfg(iort_node); @@ -1609,7 +1639,7 @@ static void __init iort_init_platform_devices(void) iort_set_fwnode(iort_node, fwnode); - ret = iort_add_platform_device(iort_node, ops); + ret = iort_add_platform_device(iort_node, ops, parent); This function is called if ops is valid, so retrieve the parent can be done before this function I think. Ah, yes Thanks, John Thanks Hanjun . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/6] SMMUv3 PMCG IMP DEF event support
On 15/10/2019 19:00, Robin Murphy wrote: Hi John, On 30/09/2019 15:33, John Garry wrote: This patchset adds IMP DEF event support for the SMMUv3 PMCG. It is marked as an RFC as the method to identify the PMCG implementation may be a quite disliked. And, in general, the series is somewhat incomplete. So the background is that the PMCG supports IMP DEF events, yet we have no method to identify the PMCG to know the IMP DEF events. A method for identifying the PMCG implementation could be using PMDEVARCH, but we cannot rely on this being set properly, as whether this is implemented is not defined in SMMUv3 spec. Another method would be perf event aliasing, but this method of event matching is based on CPU id, which would not guarantee same uniqueness as PMCG implementation. Yet another method could be to continue using ACPI OEM ID in the IORT code, but this does not scale. And it is not suitable if we ever add DT support to the PMCG driver. The method used in this series is based on matching on the parent SMMUv3 IIDR. We store this IIDR contents in the arm smmu structure as the first element, which means that we don't have to expose SMMU APIs - this is the part which may be disliked. The final two patches switch the pre-existing PMCG model identification from ACPI OEM ID to the same parent SMMUv3 IIDR matching. For now, we only consider SMMUv3' nodes being the associated node for PMCG. Hi Robin, Two significant concerns right off the bat: - It seems more common than not for silicon designers to fail to implement IIDR correctly, so it's only a matter of time before inevitably needing to bring back some firmware-level identifier abstraction (if not already - does Hi161x have PMCGs?) Maybe there's a way that we can switch to this method, and leave the door open for an easy way to support firmware-level identifier again, if ever needed. I'm not too pushed - this was secondary to just allowing the PMCG driver know the associated SMMU model. And, no, hi161x does not have any PMCGs. - This seems like a step in entirely the wrong direction for supporting . So to support PMCGs that reference a Named Component or Root Complex, I thought that the IORT parsing code would have to do some secondary lookup to the associated SMMU, through the Named Component or Root Complex node. What was your idea here? Note: I do acknowledge that an overall issue is that we assume all PMCG IMP DEF events are same for a given SMMU model. Interpreting the Node Reference is definitely a welcome improvement over matching table headers, but absent a truly compelling argument to the contrary, I'd rather retain the "PMCG model" abstraction in between that and the driver itself (especially since those can trivially be hung off compatibles once it comes to DT support). For DT, I would assume that we just use compatible strings would allow us to identify the PMCG model. On a related matter, is there still a need to deal with scenarios of the PMCG being located within the SMMU register map? As you may remember, we did have this issue but relocated the PMCG to outside the SMMU register map in a later chip rev. Cheers, John Thanks, Robin. John Garry (6): ACPI/IORT: Set PMCG device parent iommu/arm-smmu-v3: Record IIDR in arm_smmu_device structure perf/smmuv3: Retrieve parent SMMUv3 IIDR perf/smmuv3: Support HiSilicon hip08 (hi1620) IMP DEF events perf/smmuv3: Match implementation options based on parent SMMU IIDR ACPI/IORT: Drop code to set the PMCG software-defined model drivers/acpi/arm64/iort.c | 69 ++-- drivers/iommu/arm-smmu-v3.c | 5 +++ drivers/perf/arm_smmuv3_pmu.c | 84 ++- include/linux/acpi_iort.h | 8 4 files changed, 112 insertions(+), 54 deletions(-) .
Re: [RFC PATCH 0/6] SMMUv3 PMCG IMP DEF event support
Hi Robin, Two significant concerns right off the bat: - It seems more common than not for silicon designers to fail to implement IIDR correctly, so it's only a matter of time before inevitably needing to bring back some firmware-level identifier abstraction (if not already - does Hi161x have PMCGs?) Maybe there's a way that we can switch to this method, and leave the door open for an easy way to support firmware-level identifier again, if ever needed. I'm not too pushed - this was secondary to just allowing the PMCG driver know the associated SMMU model. But that's the part I'm not buying - there's no clear advantage to pushing that complexity down into the PMCG driver, vs. leaving the IORT code responsible for translating an SMMU model into a PMCG model, yet the aforementioned disadvantages jump out right away. One advantage is that the next piece of quirky hw with a properly implemented IIDR does not require a new IORT model. And today, this handling is only for hi1620, and since we can use hi1620 IIDR to id it, then it seems good to remove code outside the PMCG driver specifically to handle it. But if you think it's going to be needed again, then it makes sense not to remove it. And, no, hi161x does not have any PMCGs. Hooray, I guess :) - This seems like a step in entirely the wrong direction for supporting . So to support PMCGs that reference a Named Component or Root Complex, I thought that the IORT parsing code would have to do some secondary lookup to the associated SMMU, through the Named Component or Root Complex node. What was your idea here? The associated SMMU has no relevance in that context - the reason for the Node Reference to point to a non-SMMU node is for devices that implement their own embedded TLB (e.g. AMBA DTI masters) and expose a standard PMCG interface to monitor it. It isn't reasonable to expect any old PCIe controller or on-chip-accelerator driver to expose a fake SMMU IIDR just to keep some other driver happy. But won't there still be an SMMU associated with the AMBA DTI masters, in your example? It's this SMMU which the PMCG driver would reference as the "parent" device, and the IORT parsing would need to do the lookup for this reference. But then, this becomes something that the DT parsing would need to handle also. Note: I do acknowledge that an overall issue is that we assume all PMCG IMP DEF events are same for a given SMMU model. That assumption does technically fail already - I know MMU-600 has different IMP-DEF events for its TCU and TBUs, however as long as we can get as far as "this is some part of an MMU-600" the driver should be able to figure out the rest (annoyingly it looks like both PMCG types expose the same PMCG_ID_REGS information, but they should be distinguishable by PMCG_CEIDn). JFYI, PMCG_CEIDn contents for hi1620 are all zero, apart from PMDEVARCH and PMDEVTYPE, which are same as arm implementation according to the spec - sigh... Interpreting the Node Reference is definitely a welcome improvement over matching table headers, but absent a truly compelling argument to the contrary, I'd rather retain the "PMCG model" abstraction in between that and the driver itself (especially since those can trivially be hung off compatibles once it comes to DT support). For DT, I would assume that we just use compatible strings would allow us to identify the PMCG model. Right, that was largely my point - DT probing can start with a PMCG model, so it's a lot more logical for ACPI probing to do the same, with the actual PMCG model determination hidden away in the ACPI code. That's the basis of the current design. I have been nagging the architects that PMCGs not having their own IIDR is an unwelcome hole in the spec, so hopefully this might get a bit easier some day. For sure. The spec reads that the PMCGs may be "independently-designed", hence no general id method. I don't get this. On a related matter, is there still a need to deal with scenarios of the PMCG being located within the SMMU register map? As you may remember, we did have this issue but relocated the PMCG to outside the SMMU register map in a later chip rev. MMU-600 has its TCU PMCG page 0 in the middle of its SMMU page 0 space, but given that it's an Arm IP, I expect that when the heat gets turned up for making it work, it's most likely to be under me ;) OK, so this is another reason why I thought that having a reference to the SMMU device could be useful in terms of solving that problem. Robin. . Thanks again, John
Re: [PATCH 0/7] iommu: Permit modular builds of ARM SMMU[v3] drivers
On 31/10/2019 23:34, Saravana Kannan via iommu wrote: I looked into the iommu-map property and it shouldn't be too hard to add support for it. Looks like we can simply hold off on probing the root bridge device till all the iommus in its iommu-map are probed and we should be fine. I'm also unsure about distro vendors agreeing to a mandatory kernel parameter (of_devlink). Do you plan to eventually enable it by default? static const struct supplier_bindings of_supplier_bindings[] = { { .parse_prop = parse_clocks, }, { .parse_prop = parse_interconnects, }, { .parse_prop = parse_regulators, }, +{ .parse_prop = parse_iommus, }, {}, }; I plan to upstream this pretty soon, but I have other patches in flight that touch the same file and I'm waiting for those to get accepted. I also want to clean up the code a bit to reduce some repetition before I add support for more bindings. I'm also wondering about ACPI support. I'd love to add ACPI support too, but I have zero knowledge of ACPI. I'd be happy to help anyone who wants to add ACPI support that allows ACPI to add device links. If possible to add, that may be useful for remedying this: https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26...@huawei.com/ Thanks, John IOMMU already has a sort of canonical code path that links endpoints to their IOMMU (iommu_probe_device()), after the firmware descriptions have been parsed. So if we created the device links in the iommu core, for example iommu_bus_notifier(), we would support all firmware interface fl ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/7] iommu: Permit modular builds of ARM SMMU[v3] drivers
On 01/11/2019 21:13, Saravana Kannan wrote: On Fri, Nov 1, 2019 at 3:28 AM John Garry wrote: On 31/10/2019 23:34, Saravana Kannan via iommu wrote: I looked into the iommu-map property and it shouldn't be too hard to add support for it. Looks like we can simply hold off on probing the root bridge device till all the iommus in its iommu-map are probed and we should be fine. I'm also unsure about distro vendors agreeing to a mandatory kernel parameter (of_devlink). Do you plan to eventually enable it by default? static const struct supplier_bindings of_supplier_bindings[] = { { .parse_prop = parse_clocks, }, { .parse_prop = parse_interconnects, }, { .parse_prop = parse_regulators, }, +{ .parse_prop = parse_iommus, }, {}, }; I plan to upstream this pretty soon, but I have other patches in flight that touch the same file and I'm waiting for those to get accepted. I also want to clean up the code a bit to reduce some repetition before I add support for more bindings. I'm also wondering about ACPI support. I'd love to add ACPI support too, but I have zero knowledge of ACPI. I'd be happy to help anyone who wants to add ACPI support that allows ACPI to add device links. If possible to add, that may be useful for remedying this: https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26...@huawei.com/ I'm happy that this change might fix that problem, but isn't the problem reported in that thread more to do with child devices getting added before the parent probes successfully? That doesn't make sense to me. So the pcieport device and then the child device are added in the PCI scan, but only some time later do the device drivers probe for these devices; so it's not that the that pcieport driver creates the child device. The problem then occurs in that the ordering the of device driver probe is such that we have this: pcieport probe + defer (as no IOMMU group registered), SMMU probe (registers the IOMMU group), child device probe, pcieport really probe. Can't the piceport driver not add its child devices before it probes successfully? Or more specifically, who adds the child devices of the pcieport before the pcieport itself probes? The devices are actually added in order pcieport, child device, but not really probed in that same order, as above. I'll add you to that thread if you want to discuss further. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm64 iommu groups issue
+ On 19/09/2019 15:35, John Garry wrote: On 19/09/2019 14:25, Robin Murphy wrote: When the port eventually probes it gets a new, separate group. This all seems to be as the built-in module init ordering is as follows: pcieport drv, smmu drv, mlx5 drv I notice that if I build the mlx5 drv as a ko and insert after boot, all functions + pcieport are in the same group: [ 11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0 [ 17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1 [ 18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2 [ 20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3 [ 20.356303] pcieport 0004:88:00.0: Adding to iommu group 4 [ 20.493337] pcieport 0005:78:00.0: Adding to iommu group 5 [ 20.702999] pcieport 000a:10:00.0: Adding to iommu group 6 [ 20.859183] pcieport 000c:20:00.0: Adding to iommu group 7 [ 20.996140] pcieport 000d:30:00.0: Adding to iommu group 8 [ 21.152637] serial 0002:f9:00.0: Adding to iommu group 3 [ 21.346991] serial 0002:f9:00.1: Adding to iommu group 3 [ 100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6 [ 101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6 [ 292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6 [ 293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6 This does seem like a problem for arm64 platforms which don't support ACS, yet enable an SMMU. Maybe also a problem even if they do support ACS. Opinion? Hi Robin, Yeah, this is less than ideal. For sure. Our production D05 boards don't ship with the SMMU enabled in BIOS, but it would be slightly concerning in this regard if they did. > One way to bodge it might be to make pci_device_group() also walk downwards to see if any non-ACS-isolated children already have a group, rather than assuming that groups get allocated in hierarchical order, but that's far from ideal. Agree. My own workaround was to hack the mentioned iort code to defer the PF probe if the parent port had also yet to probe. The underlying issue is that, for historical reasons, OF/IORT-based IOMMU drivers have ended up with group allocation being tied to endpoint driver probing via the dma_configure() mechanism (long story short, driver probe is the only thing which can be delayed in order to wait for a specific IOMMU instance to be ready).However, in the meantime, the IOMMU API internals have evolved sufficiently that I think there's a way to really put things right - I have the spark of an idea which I'll try to sketch out ASAP... OK, great. Thanks, John Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"
On 08/11/2019 15:16, Will Deacon wrote: +MODULE_DEVICE_TABLE(of, arm_smmu_of_match); Hi Will, static struct platform_driver arm_smmu_driver = { .driver = { .name = "arm-smmu-v3", .of_match_table = of_match_ptr(arm_smmu_of_match), - .suppress_bind_attrs = true, Does this mean that we can now manually unbind this driver from the SMMU device? Seems dangerous. Here's what happens for me: root@ubuntu:/sys# cd ./bus/platform/drivers/arm-smmu-v3 ind @ubuntu:/sys/bus/platform/drivers/arm-smmu-v3# echo arm-smmu-v3.0.auto > unbind [ 77.580351] hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found! ho [ 78.635473] platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146 [hwprod 0x0146, hwcons 0x] }, .probe = arm_smmu_device_probe, + .remove = arm_smmu_device_remove, .shutdown = arm_smmu_device_shutdown, }; -builtin_platform_driver(arm_smmu_driver); +module_platform_driver(arm_smmu_driver); + Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"
On 08/11/2019 16:17, John Garry wrote: On 08/11/2019 15:16, Will Deacon wrote: +MODULE_DEVICE_TABLE(of, arm_smmu_of_match); Hi Will, static struct platform_driver arm_smmu_driver = { .driver = { .name = "arm-smmu-v3", .of_match_table = of_match_ptr(arm_smmu_of_match), - .suppress_bind_attrs = true, Does this mean that we can now manually unbind this driver from the SMMU device? Seems dangerous. Here's what happens for me: root@ubuntu:/sys# cd ./bus/platform/drivers/arm-smmu-v3 ind @ubuntu:/sys/bus/platform/drivers/arm-smmu-v3# echo arm-smmu-v3.0.auto > unbind [ 77.580351] hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found! ho [ 78.635473] platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146 [hwprod 0x0146, hwcons 0x] }, .probe = arm_smmu_device_probe, + .remove = arm_smmu_device_remove, .shutdown = arm_smmu_device_shutdown, }; -builtin_platform_driver(arm_smmu_driver); +module_platform_driver(arm_smmu_driver); + BTW, it now looks like it was your v1 series I was testing there, on your branch iommu/module. It would be helpful to update for ease of testing. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"
On 08/11/2019 16:47, Will Deacon wrote: On Fri, Nov 08, 2019 at 04:44:25PM +, John Garry wrote: On 08/11/2019 16:17, John Garry wrote: On 08/11/2019 15:16, Will Deacon wrote: +MODULE_DEVICE_TABLE(of, arm_smmu_of_match); Hi Will, static struct platform_driver arm_smmu_driver = { .driver = { .name = "arm-smmu-v3", .of_match_table = of_match_ptr(arm_smmu_of_match), - .suppress_bind_attrs = true, Does this mean that we can now manually unbind this driver from the SMMU device? Seems dangerous. Here's what happens for me: root@ubuntu:/sys# cd ./bus/platform/drivers/arm-smmu-v3 ind @ubuntu:/sys/bus/platform/drivers/arm-smmu-v3# echo arm-smmu-v3.0.auto > unbind [ 77.580351] hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found! ho [ 78.635473] platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146 [hwprod 0x0146, hwcons 0x] }, .probe = arm_smmu_device_probe, + .remove = arm_smmu_device_remove, .shutdown = arm_smmu_device_shutdown, }; -builtin_platform_driver(arm_smmu_driver); +module_platform_driver(arm_smmu_driver); + BTW, it now looks like it was your v1 series I was testing there, on your branch iommu/module. It would be helpful to update for ease of testing. Hi Will, Yes, sorry about that. I'll update it now (although I'm not sure it will help with this -- I was going to see what happens with other devices such as the intel-iommu or storage controllers) So I tried your v2 series for this - it has the same issue, as I anticipated. It seems that some iommu drivers do call iommu_device_register(), so maybe a decent reference. Or simply stop the driver being unbound. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"
On 08/11/2019 17:32, Will Deacon wrote: On Fri, Nov 08, 2019 at 05:25:09PM +, John Garry wrote: On 08/11/2019 16:47, Will Deacon wrote: On Fri, Nov 08, 2019 at 04:44:25PM +, John Garry wrote: BTW, it now looks like it was your v1 series I was testing there, on your branch iommu/module. It would be helpful to update for ease of testing. Yes, sorry about that. I'll update it now (although I'm not sure it will help with this -- I was going to see what happens with other devices such as the intel-iommu or storage controllers) So I tried your v2 series for this - it has the same issue, as I anticipated. Right, I'm just not sure how resilient drivers are expected to be to force unbinding like this. You can break lots of stuff with root... For sure, but it is good practice to limit that. I had to fix something like this recently, so know about it... another potential problem is use-after-frees, where your device managed memory is freed at removal but still registered somewhere. It seems that some iommu drivers do call iommu_device_register(), so maybe a decent reference. Or simply stop the driver being unbound. I'm not sure what you mean about iommu_device_register() (we call that already), Sorry, I meant to say iommu_device_unregister(). but I guess we can keep the '.suppress_bind_attrs = true' if necessary. It may be good to add it to older stable kernels also, pre c07b6426df92. I'll have a play on my laptop and see how well that works if you start unbinding stuff. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 6/9] Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular"
On 08/11/2019 17:48, Will Deacon wrote: On Fri, Nov 08, 2019 at 05:32:48PM +, Will Deacon wrote: On Fri, Nov 08, 2019 at 05:25:09PM +, John Garry wrote: On 08/11/2019 16:47, Will Deacon wrote: On Fri, Nov 08, 2019 at 04:44:25PM +, John Garry wrote: BTW, it now looks like it was your v1 series I was testing there, on your branch iommu/module. It would be helpful to update for ease of testing. Yes, sorry about that. I'll update it now (although I'm not sure it will help with this -- I was going to see what happens with other devices such as the intel-iommu or storage controllers) So I tried your v2 series for this - it has the same issue, as I anticipated. Right, I'm just not sure how resilient drivers are expected to be to force unbinding like this. You can break lots of stuff with root... It seems that some iommu drivers do call iommu_device_register(), so maybe a decent reference. Or simply stop the driver being unbound. I'm not sure what you mean about iommu_device_register() (we call that already), but I guess we can keep the '.suppress_bind_attrs = true' if necessary. I'll have a play on my laptop and see how well that works if you start unbinding stuff. So unbinding the nvme driver goes bang: [90139.090158] nvme nvme0: failed to set APST feature (-19) [90141.966780] Aborting journal on device dm-1-8. [90141.967124] Buffer I/O error on dev dm-1, logical block 26247168, lost sync page write [90141.967169] JBD2: Error -5 detected when updating journal superblock for dm-1-8. [90141.967403] Buffer I/O error on dev dm-1, logical block 0, lost sync page write [90141.967454] EXT4-fs (dm-1): I/O error while writing superblock [90141.967467] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal [90141.967473] EXT4-fs (dm-1): Remounting filesystem read-only [90141.967569] Buffer I/O error on dev dm-1, logical block 0, lost sync page write [90141.967682] EXT4-fs (dm-1): I/O error while writing superblock and I've not managed to recover the thing yet (it's stuck trying to reboot.) Not surprised. I guess the device backing your root directory disappeared. What state was your system in after unbinding the SMMU? Unusable again. For me the storage controller backing the root directory is compromised by disabling the SMMU unsafely. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/arm-smmu-v3: Don't display an error when IRQ lines are missing
On 11/11/2019 11:17, Jean-Philippe Brucker wrote: Since commit 7723f4c5ecdb ("driver core: platform: Add an error message to platform_get_irq*()"), platform_get_irq_byname() displays an error when the IRQ isn't found. Since the SMMUv3 driver uses that function to query which interrupt method is available, the message is now displayed during boot for any SMMUv3 that doesn't implement the combined interrupt, or that implements MSIs. [ 20.700337] arm-smmu-v3 arm-smmu-v3.7.auto: IRQ combined not found [ 20.706508] arm-smmu-v3 arm-smmu-v3.7.auto: IRQ eventq not found [ 20.712503] arm-smmu-v3 arm-smmu-v3.7.auto: IRQ priq not found [ 20.718325] arm-smmu-v3 arm-smmu-v3.7.auto: IRQ gerror not found Use platform_get_irq_byname_optional() to avoid displaying a spurious error. Fixes: 7723f4c5ecdb ("driver core: platform: Add an error message to platform_get_irq*()") Signed-off-by: Jean-Philippe Brucker That stops a nuisance: Tested-by: John Garry However, I will say though that the combined irq seems necessary for TX2, which is not warned about being missing now. Finally, A cover letter would have been handy to mention that the new API was only introduced after rc1 thanks --- drivers/iommu/arm-smmu-v3.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index bfa4a0f39ed0..a89797f346a4 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -5207,19 +5207,19 @@ static int arm_smmu_device_probe(struct platform_device *pdev) /* Interrupt lines */ - irq = platform_get_irq_byname(pdev, "combined"); + irq = platform_get_irq_byname_optional(pdev, "combined"); if (irq > 0) smmu->combined_irq = irq; else { - irq = platform_get_irq_byname(pdev, "eventq"); + irq = platform_get_irq_byname_optional(pdev, "eventq"); if (irq > 0) smmu->evtq.q.irq = irq; - irq = platform_get_irq_byname(pdev, "priq"); + irq = platform_get_irq_byname_optional(pdev, "priq"); if (irq > 0) smmu->priq.q.irq = irq; - irq = platform_get_irq_byname(pdev, "gerror"); + irq = platform_get_irq_byname_optional(pdev, "gerror"); if (irq > 0) smmu->gerr_irq = irq; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 09/14] iommu/arm-smmu: Prevent forced unbinding of Arm SMMU drivers
On 21/11/2019 11:49, Will Deacon wrote: Forcefully unbinding the Arm SMMU drivers is a pretty dangerous operation, since it will likely lead to catastrophic failure for any DMA devices mastering through the SMMU being unbound. When the driver then attempts to "handle" the fatal faults, it's very easy to trip over dead data structures, leading to use-after-free. On John's machine, he reports that the machine was "unusable" due to loss of the storage controller following a forced unbind of the SMMUv3 driver: | # cd ./bus/platform/drivers/arm-smmu-v3 | # echo arm-smmu-v3.0.auto > unbind | hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found! | platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146 | [hwprod 0x0146, hwcons 0x] Prevent this forced unbinding of the drivers by setting "suppress_bind_attrs" to true. This seems a reasonable approach for now. BTW, I'll give this series a spin this week, which again looks to be your iommu/module branch, excluding the new IORT patch. Cheers, John Link: https://lore.kernel.org/lkml/06dfd385-1af0-3106-4cc5-6a5b8e864...@huawei.com Reported-by: John Garry Signed-off-by: Will Deacon --- drivers/iommu/arm-smmu-v3.c | 5 +++-- drivers/iommu/arm-smmu.c| 7 --- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 2ad8e2ca0583..3fd75abce3bb 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -3700,8 +3700,9 @@ MODULE_DEVICE_TABLE(of, arm_smmu_of_match); static struct platform_driver arm_smmu_driver = { .driver = { - .name = "arm-smmu-v3", - .of_match_table = of_match_ptr(arm_smmu_of_match), + .name = "arm-smmu-v3", + .of_match_table = of_match_ptr(arm_smmu_of_match), + .suppress_bind_attrs= true, }, .probe = arm_smmu_device_probe, .remove = arm_smmu_device_remove, diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index 53bbe0663b9e..d6c83bd69555 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -2237,9 +2237,10 @@ static const struct dev_pm_ops arm_smmu_pm_ops = { static struct platform_driver arm_smmu_driver = { .driver = { - .name = "arm-smmu", - .of_match_table = of_match_ptr(arm_smmu_of_match), - .pm = &arm_smmu_pm_ops, + .name = "arm-smmu", + .of_match_table = of_match_ptr(arm_smmu_of_match), + .pm = &arm_smmu_pm_ops, + .suppress_bind_attrs= true, }, .probe = arm_smmu_device_probe, .remove = arm_smmu_device_remove, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 09/14] iommu/arm-smmu: Prevent forced unbinding of Arm SMMU drivers
On 26/11/2019 20:27, Saravana Kannan wrote: On Tue, Nov 26, 2019 at 1:13 AM John Garry wrote: On 21/11/2019 11:49, Will Deacon wrote: Forcefully unbinding the Arm SMMU drivers is a pretty dangerous operation, since it will likely lead to catastrophic failure for any DMA devices mastering through the SMMU being unbound. When the driver then attempts to "handle" the fatal faults, it's very easy to trip over dead data structures, leading to use-after-free. On John's machine, he reports that the machine was "unusable" due to loss of the storage controller following a forced unbind of the SMMUv3 driver: | # cd ./bus/platform/drivers/arm-smmu-v3 | # echo arm-smmu-v3.0.auto > unbind | hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found! | platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146 | [hwprod 0x0146, hwcons 0x] Prevent this forced unbinding of the drivers by setting "suppress_bind_attrs" to true. This seems a reasonable approach for now. BTW, I'll give this series a spin this week, which again looks to be your iommu/module branch, excluding the new IORT patch. Hi Saravana, Is this on a platform where of_devlink creates device links between the iommu device and its suppliers?I'm guessing no? Because device links should for unbinding of all the consumers before unbinding the supplier. I'm only really interested in ACPI, TBH. Looks like it'll still allow the supplier to unbind if the consumers don't allow unbinding. Is that the case here? So just unbinding the driver from a device does not delete the device nor exit the device from it's IOMMU group - so we keep the reference to the SMMU ko. As such, I don't know how to realistically test unloading the SMMU ko when we have platform devices involved. Maybe someone can enlighten me... Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 09/14] iommu/arm-smmu: Prevent forced unbinding of Arm SMMU drivers
On 27/11/2019 11:04, John Garry wrote: On 26/11/2019 20:27, Saravana Kannan wrote: On Tue, Nov 26, 2019 at 1:13 AM John Garry wrote: On 21/11/2019 11:49, Will Deacon wrote: Forcefully unbinding the Arm SMMU drivers is a pretty dangerous operation, since it will likely lead to catastrophic failure for any DMA devices mastering through the SMMU being unbound. When the driver then attempts to "handle" the fatal faults, it's very easy to trip over dead data structures, leading to use-after-free. On John's machine, he reports that the machine was "unusable" due to loss of the storage controller following a forced unbind of the SMMUv3 driver: | # cd ./bus/platform/drivers/arm-smmu-v3 | # echo arm-smmu-v3.0.auto > unbind | hisi_sas_v2_hw HISI0162:01: CQE_AXI_W_ERR (0x800) found! | platform arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0146 | [hwprod 0x0146, hwcons 0x] Prevent this forced unbinding of the drivers by setting "suppress_bind_attrs" to true. This seems a reasonable approach for now. BTW, I'll give this series a spin this week, which again looks to be your iommu/module branch, excluding the new IORT patch. Hi Saravana, Is this on a platform where of_devlink creates device links between the iommu device and its suppliers?I'm guessing no? Because device links should for unbinding of all the consumers before unbinding the supplier. I'm only really interested in ACPI, TBH. Looks like it'll still allow the supplier to unbind if the consumers don't allow unbinding. Is that the case here? So just unbinding the driver from a device does not delete the device nor exit the device from it's IOMMU group - so we keep the reference to the SMMU ko. As such, I don't know how to realistically test unloading the SMMU ko when we have platform devices involved. Maybe someone can enlighten me... But I could do it on our D06 dev board, where all devices behind the SMMUs are PCI based: --- Initial state --- root@(none)$ dmesg | grep Adding [ 30.271801] pcieport :00:00.0: Adding to iommu group 0 [ 30.296088] pcieport :00:04.0: Adding to iommu group 1 [ 30.322234] pcieport :00:08.0: Adding to iommu group 2 [ 30.335641] pcieport :00:0c.0: Adding to iommu group 3 [ 30.343114] pcieport :00:10.0: Adding to iommu group 4 [ 30.355650] pcieport :00:12.0: Adding to iommu group 5 [ 30.366794] pcieport :7c:00.0: Adding to iommu group 6 [ 30.377993] hns3 :7d:00.0: Adding to iommu group 7 [ 31.861957] hns3 :7d:00.1: Adding to iommu group 8 [ 33.313967] hns3 :7d:00.2: Adding to iommu group 9 [ 33.436029] hns3 :7d:00.3: Adding to iommu group 10 [ 33.555935] hns3 :bd:00.0: Adding to iommu group 11 [ 35.143851] pcieport :74:00.0: Adding to iommu group 12 [ 35.150736] pcieport :80:00.0: Adding to iommu group 13 [ 35.158910] pcieport :80:08.0: Adding to iommu group 14 [ 35.166860] pcieport :80:0c.0: Adding to iommu group 15 [ 35.174813] pcieport :80:10.0: Adding to iommu group 16 [ 35.182854] pcieport :bc:00.0: Adding to iommu group 17 [ 35.189702] pcieport :b4:00.0: Adding to iommu group 18 [ 35.196445] hisi_sas_v3_hw :74:02.0: Adding to iommu group 19 [ 39.410693] ahci :74:03.0: Adding to iommu group 20 root@(none)$ lsmod Not tainted arm_smmu_v3 40960 21 - Live 0x88c6 --- Start removing devices --- root@(none)$ echo 1 > ./sys/devices/pci:00/:00:00.0/remove [ 55.567808] pci_bus :01: busn_res: [bus 01] is released [ 55.573514] pci :00:00.0: Removing from iommu group 0 root@(none)$ echo 1 > ./sys/devices/pci:00/:00:04.0/remove [ 61.767425] pci_bus :02: busn_res: [bus 02] is released [ 61.773132] pci :00:04.0: Removing from iommu group 1 root@(none)$ echo 1 > ./sys/devices/pci:00/:00:04.0/remove sh: ./sys/devices/pci:00/:00:04.0/remove: No such file or directory root@(none)$ echo 1 > ./sys/devices/pci:00/:00:08.0/remove [ 75.635417] pci_bus :03: busn_res: [bus 03] is released [ 75.641124] pci :00:08.0: Removing from iommu group 2 root@(none)$ echo 1 > ./sys/devices/pci:00/:00:0c.0/remove [ 81.587419] pci_bus :04: busn_res: [bus 04] is released [ 81.593110] pci :00:0c.0: Removing from iommu group 3 root@(none)$ echo 1 > ./sys/devices/pci:00/:00:10.0/remove [ 85.607605] pci_bus :05: busn_res: [bus 05] is released [ 85.613300] pci :00:10.0: Removing from iommu group 4 root@(none)$ echo 1 > ./sys/devices/pci:00/:00:12.0/remove [ 92.731421] pci_bus :06: busn_res: [bus 06] is released [ 92.737125] pci :00:12.0: Removing from iommu group 5 root@(none)$ echo 1 > ./sys/devices/pci:7c/:7c:00.0/remove [ 102.286726] pci :7d:00.0: Removing from iommu group 7 [ 102.294157] pci :7d:00.1: Remov
Re: [PATCH] iommu/arm-smmu: support SMMU module probing from the IORT
On 25/11/2019 16:04, Lorenzo Pieralisi wrote: On Fri, Nov 22, 2019 at 06:41:25PM +0100, Ard Biesheuvel wrote: Add support for SMMU drivers built as modules to the ACPI/IORT device probing path, by deferring the probe of the master if the SMMU driver is known to exist but has not been loaded yet. Given that the IORT code registers a platform device for each SMMU that it discovers, we can easily trigger the udev based autoloading of the SMMU drivers by making the platform device identifier part of the module alias. Signed-off-by: Ard Biesheuvel --- drivers/acpi/arm64/iort.c | 4 ++-- drivers/iommu/arm-smmu-v3.c | 1 + drivers/iommu/arm-smmu.c| 1 + 3 files changed, 4 insertions(+), 2 deletions(-) I think it is best if Will picks this up and add it to the series that modularize the SMMU drivers: Acked-by: Lorenzo Pieralisi Tested-by: John Garry # only manual smmu ko loading diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index 5a7551d060f2..a696457a9b11 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -850,9 +850,9 @@ static inline bool iort_iommu_driver_enabled(u8 type) { switch (type) { case ACPI_IORT_NODE_SMMU_V3: - return IS_BUILTIN(CONFIG_ARM_SMMU_V3); + return IS_ENABLED(CONFIG_ARM_SMMU_V3); case ACPI_IORT_NODE_SMMU: - return IS_BUILTIN(CONFIG_ARM_SMMU); + return IS_ENABLED(CONFIG_ARM_SMMU); default: pr_warn("IORT node type %u does not describe an SMMU\n", type); return false; diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 7669beafc493..bf6a1e8eb9b0 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -3733,4 +3733,5 @@ module_platform_driver(arm_smmu_driver); MODULE_DESCRIPTION("IOMMU API for ARM architected SMMUv3 implementations"); MODULE_AUTHOR("Will Deacon "); +MODULE_ALIAS("platform:arm-smmu-v3"); MODULE_LICENSE("GPL v2"); diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index d55acc48aee3..db5106b0955b 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -2292,4 +2292,5 @@ module_platform_driver(arm_smmu_driver); MODULE_DESCRIPTION("IOMMU API for ARM architected SMMU implementations"); MODULE_AUTHOR("Will Deacon "); +MODULE_ALIAS("platform:arm-smmu"); MODULE_LICENSE("GPL v2"); -- 2.20.1 . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/3] iommu: match the original algorithm
On 21/11/2019 00:13, Cong Wang wrote: The IOVA cache algorithm implemented in IOMMU code does not exactly match the original algorithm described in the paper. Particularly, it doesn't need to free the loaded empty magazine when trying to put it back to global depot. This patch makes it exactly match the original algorithm. I haven't gone into the details, but this patch alone is giving this: root@(none)$ [ 123.857024] kmemleak: 8 new suspected memory leaks (see /sys/kernel/debug/kmemleak) root@(none)$ cat /sys/kernel/debug/kmemleak unreferenced object 0x002339843000 (size 2048): comm "swapper/0", pid 1, jiffies 4294898165 (age 122.688s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 backtrace: [<1d2710bf>] kmem_cache_alloc+0x188/0x260 [] init_iova_domain+0x1e8/0x2a8 [<2646fc92>] iommu_setup_dma_ops+0x200/0x710 [ ] arch_setup_dma_ops+0x80/0x128 [<994e1e43>] acpi_dma_configure+0x11c/0x140 [ ] pci_dma_configure+0xe0/0x108 [ ] really_probe+0x210/0x548 [<87884b1b>] driver_probe_device+0x7c/0x148 [<10af2936>] device_driver_attach+0x94/0xa0 [ ] __driver_attach+0xa4/0x110 [ ] bus_for_each_dev+0xe8/0x158 [ ] driver_attach+0x30/0x40 [<3cf39ba8>] bus_add_driver+0x234/0x2f0 [<43830a45>] driver_register+0xbc/0x1d0 [ ] __pci_register_driver+0xb0/0xc8 [ ] sas_v3_pci_driver_init+0x20/0x28 unreferenced object 0x002339844000 (size 2048): comm "swapper/0", pid 1, jiffies 4294898165 (age 122.688s) [snip] And I don't feel like continuing until it's resolved Thanks, John Cc: Joerg Roedel Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 41c605b0058f..92f72a85e62a 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -900,7 +900,7 @@ static bool __iova_rcache_insert(struct iova_domain *iovad, if (!iova_magazine_full(cpu_rcache->loaded)) { can_insert = true; - } else if (!iova_magazine_full(cpu_rcache->prev)) { + } else if (iova_magazine_empty(cpu_rcache->prev)) { swap(cpu_rcache->prev, cpu_rcache->loaded); can_insert = true; } else { @@ -909,8 +909,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad, if (new_mag) { spin_lock(&rcache->lock); if (rcache->depot_size < MAX_GLOBAL_MAGS) { - rcache->depot[rcache->depot_size++] = - cpu_rcache->loaded; + swap(rcache->depot[rcache->depot_size], cpu_rcache->prev); + swap(cpu_rcache->prev, cpu_rcache->loaded); + rcache->depot_size++; } else { mag_to_free = cpu_rcache->loaded; } @@ -963,14 +964,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache, if (!iova_magazine_empty(cpu_rcache->loaded)) { has_pfn = true; - } else if (!iova_magazine_empty(cpu_rcache->prev)) { + } else if (iova_magazine_full(cpu_rcache->prev)) { swap(cpu_rcache->prev, cpu_rcache->loaded); has_pfn = true; } else { spin_lock(&rcache->lock); if (rcache->depot_size > 0) { - iova_magazine_free(cpu_rcache->loaded); - cpu_rcache->loaded = rcache->depot[--rcache->depot_size]; + swap(rcache->depot[rcache->depot_size - 1], cpu_rcache->prev); + swap(cpu_rcache->prev, cpu_rcache->loaded); + rcache->depot_size--; has_pfn = true; } spin_unlock(&rcache->lock); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v2 2/3] iommu: optimize iova_magazine_free_pfns()
On 29/11/2019 00:48, Cong Wang wrote: If the maganize is empty, iova_magazine_free_pfns() should magazine be a nop, however it misses the case of mag->size==0. So we should just call iova_magazine_empty(). This should reduce the contention on iovad->iova_rbtree_lock a little bit. Cc: Joerg Roedel Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index cb473ddce4cf..184d4c0e20b5 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -797,13 +797,23 @@ static void iova_magazine_free(struct iova_magazine *mag) kfree(mag); } +static bool iova_magazine_full(struct iova_magazine *mag) +{ + return (mag && mag->size == IOVA_MAG_SIZE); +} + +static bool iova_magazine_empty(struct iova_magazine *mag) +{ + return (!mag || mag->size == 0); +} + static void iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad) { unsigned long flags; int i; - if (!mag) + if (iova_magazine_empty(mag)) The only hot path we this call is __iova_rcache_insert()->iova_magazine_free_pfns(mag_to_free) and mag_to_free is full in this case, so I am sure how the additional check helps, right? Thanks, John return; spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); @@ -820,16 +830,6 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad) mag->size = 0; } -static bool iova_magazine_full(struct iova_magazine *mag) -{ - return (mag && mag->size == IOVA_MAG_SIZE); -} - -static bool iova_magazine_empty(struct iova_magazine *mag) -{ - return (!mag || mag->size == 0); -} - static unsigned long iova_magazine_pop(struct iova_magazine *mag, unsigned long limit_pfn) { ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v2 3/3] iommu: avoid taking iova_rbtree_lock twice
On 29/11/2019 00:48, Cong Wang wrote: Both find_iova() and __free_iova() take iova_rbtree_lock, there is no reason to take and release it twice inside free_iova(). Fold them into the critical section by calling the unlock versions instead. Since generally the iova would be non-NULL, this seems a reasonable change (which could be mentioned in the commit log) John Cc: Joerg Roedel Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 184d4c0e20b5..f46f8f794678 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova); void free_iova(struct iova_domain *iovad, unsigned long pfn) { - struct iova *iova = find_iova(iovad, pfn); + unsigned long flags; + struct iova *iova; + spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + iova = private_find_iova(iovad, pfn); if (iova) - __free_iova(iovad, iova); + private_free_iova(iovad, iova); + spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); } EXPORT_SYMBOL_GPL(free_iova); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v2 1/3] iommu: match the original algorithm
On 29/11/2019 00:48, Cong Wang wrote: The IOVA cache algorithm implemented in IOMMU code does not exactly match the original algorithm described in the paper. which paper? Particularly, it doesn't need to free the loaded empty magazine when trying to put it back to global depot. To make it work, we have to pre-allocate magazines in the depot and only recycle them when all of them are full. Before this patch, rcache->depot[] contains either full or freed entries, after this patch, it contains either full or empty (but allocated) entries. I *quickly* tested this patch and got a small performance gain. Cc: Joerg Roedel Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 45 +++- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 41c605b0058f..cb473ddce4cf 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad) struct iova_cpu_rcache *cpu_rcache; struct iova_rcache *rcache; unsigned int cpu; - int i; + int i, j; for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { rcache = &iovad->rcaches[i]; spin_lock_init(&rcache->lock); rcache->depot_size = 0; + for (j = 0; j < MAX_GLOBAL_MAGS; ++j) { + rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL); + WARN_ON(!rcache->depot[j]); + } rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size()); if (WARN_ON(!rcache->cpu_rcaches)) continue; @@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain *iovad, if (!iova_magazine_full(cpu_rcache->loaded)) { can_insert = true; - } else if (!iova_magazine_full(cpu_rcache->prev)) { + } else if (iova_magazine_empty(cpu_rcache->prev)) { is this change strictly necessary? swap(cpu_rcache->prev, cpu_rcache->loaded); can_insert = true; } else { - struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC); + spin_lock(&rcache->lock); + if (rcache->depot_size < MAX_GLOBAL_MAGS) { + swap(rcache->depot[rcache->depot_size], cpu_rcache->prev); + swap(cpu_rcache->prev, cpu_rcache->loaded); + rcache->depot_size++; + can_insert = true; + } else { + mag_to_free = cpu_rcache->loaded; + } + spin_unlock(&rcache->lock); + + if (mag_to_free) { + struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC); - if (new_mag) { - spin_lock(&rcache->lock); - if (rcache->depot_size < MAX_GLOBAL_MAGS) { - rcache->depot[rcache->depot_size++] = - cpu_rcache->loaded; + if (new_mag) { + cpu_rcache->loaded = new_mag; + can_insert = true; } else { - mag_to_free = cpu_rcache->loaded; + mag_to_free = NULL; } - spin_unlock(&rcache->lock); - - cpu_rcache->loaded = new_mag; - can_insert = true; } } @@ -963,14 +973,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache, if (!iova_magazine_empty(cpu_rcache->loaded)) { has_pfn = true; - } else if (!iova_magazine_empty(cpu_rcache->prev)) { + } else if (iova_magazine_full(cpu_rcache->prev)) { swap(cpu_rcache->prev, cpu_rcache->loaded); has_pfn = true; } else { spin_lock(&rcache->lock); if (rcache->depot_size > 0) { - iova_magazine_free(cpu_rcache->loaded); it is good to remove this from under the lock, apart from this change - cpu_rcache->loaded = rcache->depot[--rcache->depot_size]; + swap(rcache->depot[rcache->depot_size - 1], cpu_rcache->prev); + swap(cpu_rcache->prev, cpu_rcache->loaded); + rcache->depot_size--; I'm not sure how appropriate the name "depot_size" is any longer. has_pfn = true; } spin_unlock(&rcache->lock); @@ -1019,7 +1030,7 @@ static void free_iova_rcaches(struct iova_domain *iovad) iova_magazine_free(cpu_rcache->prev); } free_percpu(rcache->cpu_rcaches); - for (j = 0; j < rcache->
Re: [Patch v2 2/3] iommu: optimize iova_magazine_free_pfns()
On 30/11/2019 06:02, Cong Wang wrote: On Fri, Nov 29, 2019 at 5:24 AM John Garry wrote: On 29/11/2019 00:48, Cong Wang wrote: If the maganize is empty, iova_magazine_free_pfns() should magazine Good catch! be a nop, however it misses the case of mag->size==0. So we should just call iova_magazine_empty(). This should reduce the contention on iovad->iova_rbtree_lock a little bit. Cc: Joerg Roedel Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index cb473ddce4cf..184d4c0e20b5 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -797,13 +797,23 @@ static void iova_magazine_free(struct iova_magazine *mag) kfree(mag); } +static bool iova_magazine_full(struct iova_magazine *mag) +{ + return (mag && mag->size == IOVA_MAG_SIZE); +} + +static bool iova_magazine_empty(struct iova_magazine *mag) +{ + return (!mag || mag->size == 0); +} + static void iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad) { unsigned long flags; int i; - if (!mag) + if (iova_magazine_empty(mag)) The only hot path we this call is __iova_rcache_insert()->iova_magazine_free_pfns(mag_to_free) and mag_to_free is full in this case, so I am sure how the additional check helps, right? This is what I mean by "a little bit" in changelog, did you miss it or misunderstand it? :) I was concerned that in the fastpath we actually make things very marginally slower by adding a check which will fail. Thanks, John Thanks. . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v2 1/3] iommu: match the original algorithm
On 30/11/2019 05:58, Cong Wang wrote: On Fri, Nov 29, 2019 at 6:43 AM John Garry wrote: On 29/11/2019 00:48, Cong Wang wrote: The IOVA cache algorithm implemented in IOMMU code does not exactly match the original algorithm described in the paper. which paper? It's in drivers/iommu/iova.c, from line 769: 769 /* 770 * Magazine caches for IOVA ranges. For an introduction to magazines, 771 * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab 772 * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams. 773 * For simplicity, we use a static magazine size and don't implement the 774 * dynamic size tuning described in the paper. 775 */ Particularly, it doesn't need to free the loaded empty magazine when trying to put it back to global depot. To make it work, we have to pre-allocate magazines in the depot and only recycle them when all of them are full. Before this patch, rcache->depot[] contains either full or freed entries, after this patch, it contains either full or empty (but allocated) entries. I *quickly* tested this patch and got a small performance gain. Thanks for testing! It requires a different workload to see bigger gain, in our case, 24 memcache.parallel servers with 120 clients. So in fact I was getting a ~10% throughput boost for my storage test. Seems more than I would expect/hope for. I would need to test more. Cc: Joerg Roedel Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 45 +++- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 41c605b0058f..cb473ddce4cf 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad) struct iova_cpu_rcache *cpu_rcache; struct iova_rcache *rcache; unsigned int cpu; - int i; + int i, j; for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { rcache = &iovad->rcaches[i]; spin_lock_init(&rcache->lock); rcache->depot_size = 0; + for (j = 0; j < MAX_GLOBAL_MAGS; ++j) { + rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL); + WARN_ON(!rcache->depot[j]); + } rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size()); if (WARN_ON(!rcache->cpu_rcaches)) continue; @@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain *iovad, if (!iova_magazine_full(cpu_rcache->loaded)) { can_insert = true; - } else if (!iova_magazine_full(cpu_rcache->prev)) { + } else if (iova_magazine_empty(cpu_rcache->prev)) { is this change strictly necessary? Yes, because it is what described in the paper. But it should be functionally same because cpu_rcache->prev is either full or empty. That is was what I was getting at. swap(cpu_rcache->prev, cpu_rcache->loaded); can_insert = true; } else { - struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC); Apart from this change, did anyone ever consider kmem cache for the magazines? + spin_lock(&rcache->lock); + if (rcache->depot_size < MAX_GLOBAL_MAGS) { + swap(rcache->depot[rcache->depot_size], cpu_rcache->prev); + swap(cpu_rcache->prev, cpu_rcache->loaded); + rcache->depot_size++; + can_insert = true; + } else { + mag_to_free = cpu_rcache->loaded; + } + spin_unlock(&rcache->lock); + + if (mag_to_free) { + struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC); - if (new_mag) { - spin_lock(&rcache->lock); - if (rcache->depot_size < MAX_GLOBAL_MAGS) { - rcache->depot[rcache->depot_size++] = - cpu_rcache->loaded; + if (new_mag) { + cpu_rcache->loaded = new_mag; + can_insert = true; } else { - mag_to_free = cpu_rcache->loaded; + mag_to_free = NULL; } - spin_unlock(&rcache->lock); - - cpu_rcache->loaded = new_mag; - can_insert = true; } } @@ -963,14 +973,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache, if (!iova_magazine_empty(cpu_rcache->loaded)) { has_p
Re: [Patch v3 0/3] iommu: reduce spinlock contention on fast path
On 06/12/2019 21:38, Cong Wang wrote: This patchset contains three small optimizations for the global spinlock contention in IOVA cache. Our memcache perf test shows this reduced its p999 latency down by 45% on AMD when IOMMU is enabled. Cong Wang (3): iommu: avoid unnecessary magazine allocations iommu: optimize iova_magazine_free_pfns() iommu: avoid taking iova_rbtree_lock twice --- drivers/iommu/iova.c | 75 ++-- 1 file changed, 45 insertions(+), 30 deletions(-) I retested, and got a ~1.1% gain in throughput for my storage test - results here if interested https://pastebin.com/dSQxYpN8 Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice
On 18/12/2019 04:39, Cong Wang wrote: Both find_iova() and __free_iova() take iova_rbtree_lock, there is no reason to take and release it twice inside free_iova(). Fold them into one critical section by calling the unlock versions instead. Cc: Joerg Roedel Cc: John Garry Signed-off-by: Cong Wang FWIW: Reviewed-by: John Garry --- drivers/iommu/iova.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 184d4c0e20b5..f46f8f794678 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova); void free_iova(struct iova_domain *iovad, unsigned long pfn) { - struct iova *iova = find_iova(iovad, pfn); + unsigned long flags; + struct iova *iova; + spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + iova = private_find_iova(iovad, pfn); if (iova) - __free_iova(iovad, iova); + private_free_iova(iovad, iova); + spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); } EXPORT_SYMBOL_GPL(free_iova); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 00/16] iommu: Permit modular builds of ARM SMMU[v3] drivers
On 19/12/2019 12:03, Will Deacon wrote: Hi all, This is version four of the patches I previously posted here: v1: https://lore.kernel.org/lkml/20191030145112.19738-1-w...@kernel.org/ v2: https://lore.kernel.org/lkml/20191108151608.20932-1-w...@kernel.org v3: https://lore.kernel.org/lkml/20191121114918.2293-1-w...@kernel.org Changes since v3 include: * Based on v5.5-rc1 * ACPI/IORT support (thanks to Ard) * Export pci_{enable,disable}_ats() (thanks to Greg) * Added review tags Since it looks like not much has changed since v3, and I even tested removal of smmu-v3.ko [0] - this should be ok for the series: Tested-by: John Garry # smmu v3 [0] https://lore.kernel.org/linux-iommu/c8eb97b1-dab5-cc25-7669-2819f64a8...@huawei.com/ I tested this on AMD Seattle by loading arm-smmu-mod.ko from the initrd. Cheers, Will Cc: Jean-Philippe Brucker Cc: Jordan Crouse Cc: John Garry Cc: Bjorn Helgaas Cc: Saravana Kannan Cc: Greg Kroah-Hartman Cc: "Isaac J. Manjarres" Cc: Robin Murphy Cc: Lorenzo Pieralisi Cc: Joerg Roedel Cc: Ard Biesheuvel --->8 Ard Biesheuvel (1): iommu/arm-smmu: Support SMMU module probing from the IORT Greg Kroah-Hartman (1): PCI/ATS: Restore EXPORT_SYMBOL_GPL() for pci_{enable,disable}_ats() Will Deacon (14): drivers/iommu: Export core IOMMU API symbols to permit modular drivers iommu/of: Request ACS from the PCI core when configuring IOMMU linkage PCI: Export pci_ats_disabled() as a GPL symbol to modules drivers/iommu: Take a ref to the IOMMU driver prior to ->add_device() iommu/of: Take a ref to the IOMMU driver during ->of_xlate() drivers/iommu: Allow IOMMU bus ops to be unregistered Revert "iommu/arm-smmu: Make arm-smmu-v3 explicitly non-modular" Revert "iommu/arm-smmu: Make arm-smmu explicitly non-modular" iommu/arm-smmu: Prevent forced unbinding of Arm SMMU drivers iommu/arm-smmu-v3: Unregister IOMMU and bus ops on device removal iommu/arm-smmu-v3: Allow building as a module iommu/arm-smmu: Unregister IOMMU and bus ops on device removal iommu/arm-smmu: Allow building as a module iommu/arm-smmu: Update my email address in MODULE_AUTHOR() drivers/acpi/arm64/iort.c | 4 +- drivers/iommu/Kconfig | 16 - drivers/iommu/Makefile | 3 +- drivers/iommu/arm-smmu-v3.c | 94 +- drivers/iommu/arm-smmu.c| 128 +--- drivers/iommu/iommu-sysfs.c | 5 ++ drivers/iommu/iommu.c | 32 - drivers/iommu/of_iommu.c| 19 -- drivers/pci/ats.c | 2 + drivers/pci/pci.c | 1 + include/linux/iommu.h | 4 +- 11 files changed, 223 insertions(+), 85 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
arm-smmu-v3 high cpu usage for NVMe
Hi Will, Robin, While analyzing an arm64 issue in interrupt handling for NVMe [0], we have noticed a worryingly high CPU utilization in the SMMU driver. The background is that we may get CPU lockup for high-throughput NVMe testing, and we noticed that disabling the SMMU during testing avoids the issue. However this lockup is a cross-architecture issue and there are attempts to address it, like [1]. To me, disabling the SMMU is just avoiding that specific issue. Anyway, we should still consider this high CPU loading: PerfTop:1694 irqs/sec kernel:97.3% exact: 0.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, CPU: 0) -- 50.84% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 19.51% [kernel] [k] _raw_spin_unlock_irqrestore 5.14% [kernel] [k] __slab_free 2.37% [kernel] [k] bio_release_pages.part.42 2.20% [kernel] [k] fput_many 1.92% [kernel] [k] aio_complete_rw 1.85% [kernel] [k] __arm_lpae_unmap 1.71% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 1.11% [kernel] [k] sbitmap_queue_clear 1.05% [kernel] [k] blk_mq_free_request 0.97% [kernel] [k] nvme_irq 0.71% [kernel] [k] blk_account_io_done 0.66% [kernel] [k] kmem_cache_free 0.66% [kernel] [k] blk_mq_complete_request This is for a CPU servicing the NVMe interrupt and doing the DMA unmap. The DMA unmap is done in threaded interrupt context. And for the overall system, we have: PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: 0/40116 [4000Hz cycles], (all, 96 CPUs) -- 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 11.71% [kernel] [k] _raw_spin_unlock_irqrestore 6.35% [kernel] [k] _raw_spin_unlock_irq 2.65% [kernel] [k] get_user_pages_fast 2.03% [kernel] [k] __slab_free 1.55% [kernel] [k] tick_nohz_idle_exit 1.47% [kernel] [k] arm_lpae_map 1.39% [kernel] [k] __fget 1.14% [kernel] [k] __lock_text_start 1.09% [kernel] [k] _raw_spin_lock 1.08% [kernel] [k] bio_release_pages.part.42 1.03% [kernel] [k] __sbitmap_get_word 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 0.91% [kernel] [k] fput_many 0.88% [kernel] [k] __arm_lpae_map One thing to note is that we still spend an appreciable amount of time in arm_smmu_atc_inv_domain(), which is disappointing when considering it should effectively be a noop. As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our batch size is 1, so we're not seeing the real benefit of the batching. I can't help but think that we could improve this code to try to combine CMD SYNCs for small batches. Anyway, let me know your thoughts or any questions. I'll have a look if a get a chance for other possible bottlenecks. [0] https://lore.kernel.org/lkml/e815b5451ea86e99d42045f7067f4...@www.loen.fr/ [1] https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbu...@kernel.org/ Cheers, John On 21/08/2019 16:17, Will Deacon wrote: Hi again, This is version two of the patches I posted yesterday: v1: https://lkml.kernel.org/r/20190820154549.17018-1-w...@kernel.org Changes since then include: * Fix 'ats_enabled' checking when enabling ATS * Remove redundant 'dev_is_pci()' calls * Remove bool bitfield * Add patch temporarily disabling ATS detection for -stable * Issue ATC invalidation even when non-leaf * Elide invalidation/SYNC for zero-sized address ranges * Shuffle the patches round a bit Thanks, Will Cc: Zhen Lei Cc: Jean-Philippe Brucker Cc: John Garry Cc: Robin Murphy --->8 Will Deacon (8): iommu/arm-smmu-v3: Document ordering guarantees of command insertion iommu/arm-smmu-v3: Disable detection of ATS and PRI iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" drivers/iommu/arm-smmu-v3.c | 117 1 file changed, 87 insertions(+), 30 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu