Re: Some questions about arm_lpae_install_table
Hi Will and Robin, Sorry for the late reply. On 2020/11/3 18:21, Robin Murphy wrote: On 2020-11-03 09:11, Will Deacon wrote: On Tue, Nov 03, 2020 at 11:00:27AM +0800, Kunkun Jiang wrote: Recently, I have read and learned the code related to io-pgtable-arm.c. There are two question on arm_lpae_install_table. 1、the first static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table, arm_lpae_iopte *ptep, arm_lpae_iopte curr, struct io_pgtable_cfg *cfg) { arm_lpae_iopte old, new; new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE; if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS) new |= ARM_LPAE_PTE_NSTABLE; /* * Ensure the table itself is visible before its PTE can be. * Whilst we could get away with cmpxchg64_release below, this * doesn't have any ordering semantics when !CONFIG_SMP. */ dma_wmb(); old = cmpxchg64_relaxed(ptep, curr, new); if (cfg->coherent_walk || (old & ARM_LPAE_PTE_SW_SYNC)) return old; /* Even if it's not ours, there's no point waiting; just kick it */ __arm_lpae_sync_pte(ptep, cfg); if (old == curr) WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC); return old; } If another thread changes the ptep between cmpxchg64_relaxed and WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC), the operation WRITE_ONCE will overwrite the change. Can you please provide an example of a code path where this happens? The idea is that CPUs can race on the cmpxchg(), but there will only be one winner. The only way a table entry can suddenly disappear is in a race that involves mapping or unmapping a whole block/table-sized region, while simultaneously mapping a page *within* that region. Yes, if someone uses the API in a nonsensical and completely invalid way that cannot have a predictable outcome, they get an unpredictable outcome. Whoop de do... Yes, as Robin mentioned, there will be an unpredictable outcome in extreme cases. Now, we have two APIs mapping and unmapping to alter a block/page descriptor. I worry about that it may be increasingly difficult to add API in the future. 2、the second for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) { /* Unmap! */ if (i == unmap_idx) continue; __arm_lpae_init_pte(data, blk_paddr, pte, lvl, [i]); } pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg); When altering a translation table descriptor include split a block into constituent granules, the Armv8-A and SMMUv3 architectures require a break-before-make procedure. But in the function arm_lpae_split_blk_unmap, it changes a block descriptor to an equivalent span of page translations directly. Is it appropriate to do so? Break-before-make doesn't really work for the SMMU because faults are generally fatal. Are you seeing problems in practice with this code? TBH I do still wonder if we shouldn't just get rid of split_blk_unmap and all its complexity. Other drivers treat an unmap of a page from a block entry as simply unmapping the whole block, and that's the behaviour VFIO seems to expect. My only worry is that it's been around long enough that there might be some horrible out-of-tree code that *is* relying on it, despite the fact that it's impossible to implement in a way that's definitely 100% safe :/ Robin. . I have not encountered a problem. I mean SMMU has supported BBML feature in SMMUv3.2. Maybe we can use this feature to update translation table without break-before-make. Look forward to your reply. Thanks, Kunkun Jiang ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/1] iommu: Fix deferred domain attachment in iommu_probe_device()
Hi Joerg, On 11/3/20 9:23 PM, Joerg Roedel wrote: Hi Baolu, On Mon, Oct 26, 2020 at 02:30:08PM +0800, Lu Baolu wrote: @@ -264,7 +266,8 @@ int iommu_probe_device(struct device *dev) */ iommu_alloc_default_domain(group, dev); - if (group->default_domain) + if (group->default_domain && + !iommu_is_attach_deferred(group->default_domain, dev)) ret = __iommu_attach_device(group->default_domain, dev); This is the hotplug path, not used for boot-initialization. Have you seen failures from the missing check here? I haven't seen any failure. Just wondered why deferred attaching was not checked here. It's fine to me if it's only for hotplug path. Please ignore this change. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 3/4] iommu/iova: Flush CPU rcache for when a depot fills
On 2020-11-03 17:56, John Garry wrote: To summarize, the issue is that as time goes by, the CPU rcache and depot rcache continue to grow. As such, IOVA RB tree access time also continues to grow. Hi Robin, I'm struggling to see how this is not simply indicative of a leak originating elsewhere. It sounds like one, but I don't think it is. For the number of magazines to continually grow, it means IOVAs *of a particular size* are being freed faster than they are being allocated, while the only place that ongoing allocations should be coming from is those same magazines! But that is not the nature of how the IOVA caching works. The cache size is not defined by how DMA mappings we may have at a given moment in time or maximum which we did have at a point earlier. It just grows to a limit to where all CPU and global depot rcaches fill. Here's an artificial example of how the rcache can grow, but I hope can help illustrate: - consider a process which wants many DMA mapping active at a given point in time - if we tie to cpu0, cpu0 rcache will grow to 128 * 2 - then tie to cpu1, cpu1 rcache will grow to 128 * 2, so total CPU rcache = 2 * 128 * 2. CPU rcache for cpu0 is not flushed - there is no maintenance for this. - then tie to cpu2, cpu2 rcache will grow to 128 * 2, so total CPU rcache = 3 * 128 * 2 - then cpu3, cpu4, and so on. - We can do this for all CPUs in the system, so total CPU rcache grows from zero -> #CPUs * 128 * 2. Yet no DMA mapping leaks. I get that. That's the initial warm-up phase I alluded to below. In an even simpler example, allocating on CPU A and freeing on CPU B will indeed move IOVAs from the tree into magazines without reuse, but only up to a point. Eventually, CPU B's cache fills up and pushes a magazine into the depot, and at *that* point things reach a steady state, since the next allocation on CPU A will then pull that magazine from the depot and proceed to allocate from there. If allocs and frees stay perfectly balanced, the working set is then 3 magazines. Yes, the depot can fill up if the number of IOVAs that CPU B frees at once before CPU A reallocates them is comparable to the total depot capacity, but it can't reasonably *stay* full unless CPU A stops allocating altogether. Something similar can happen in normal use, where the scheduler relocates processes all over the CPUs in the system as time goes by, which causes the total rcache size to continue to grow. And in addition to this, the global depot continues to grow very slowly as well. But when it does fill (the global depot, that is), and we start to free magazines to make space – as is current policy - that's very slow and causes the performance drop. Sure, but how does it then consistently *remain* in that state? And *why* does the depot slowly and steadily grow in the first place if alloc and free are ultimately balanced? I can get the depot swinging between full and empty if it's simply too small to bounce magazines between a large number of "CPU A"s and "CPU B"s, but again, that's surely going to show as repeated performance swings between bad at each end and good in the middle, not a steady degradation. Now indeed that could happen over the short term if IOVAs are allocated and freed again in giant batches larger than the total global cache capacity, but that would show a cyclic behaviour - when activity starts, everything is first allocated straight from the tree, then when it ends the caches would get overwhelmed by the large burst of freeing and start having to release things back to the tree, but eventually that would stop once everything *is* freed, then when activity begins again the next round of allocating would inherently clear out all the caches before going anywhere near the tree. But there is no clearing. A CPU will keep the IOVA cached indefinitely, even when there is no active DMA mapping present at all. Sure, the percpu caches can buffer IOVAs for an indefinite amount of time depending on how many CPUs are active, but the depot usage is still absolutely representative of the total working set for whichever CPUs *are* active. In this whole discussion I'm basically just considering the percpu caches as pipeline stages for serialising IOVAs into and out of magazines. It's the motion of magazines that's the interesting part. If the depot keeps continually filling up, *some* CPUs are freeing enough IOVAs to push out full magazines, and those IOVAs have to come from somewhere, so *some* CPUs are allocating, and those CPUs can't allocate forever without taking magazines back out of the depot (that's the "clearing out" I meant). Something about a steady degradation that never shows any sign of recovery (even periodically) just doesn't seem to add up. Anyway, by now I think it would be most interesting to get rid of this bottleneck completely rather than dance around bodging it, and see what happens if we simply let the depot
Re: [PATCH v3 0/6] Add virtio-iommu built-in topology
On 06 Oct 2020 17:23, Auger Eric wrote: > Hello Al, > > On 10/2/20 8:23 PM, Al Stone wrote: > > On 24 Sep 2020 11:54, Auger Eric wrote: > >> Hi, > >> > >> Adding Al in the loop > >> > >> On 9/24/20 11:38 AM, Michael S. Tsirkin wrote: > >>> On Thu, Sep 24, 2020 at 11:21:29AM +0200, Joerg Roedel wrote: > On Thu, Sep 24, 2020 at 05:00:35AM -0400, Michael S. Tsirkin wrote: > > OK so this looks good. Can you pls repost with the minor tweak > > suggested and all acks included, and I will queue this? > > My NACK still stands, as long as a few questions are open: > > 1) The format used here will be the same as in the ACPI table? I > think the answer to this questions must be Yes, so this leads > to the real question: > >>> > >>> I am not sure it's a must. > >>> We can always tweak the parser if there are slight differences > >>> between ACPI and virtio formats. > >>> > >>> But we do want the virtio format used here to be approved by the virtio > >>> TC, so it won't change. > >>> > >>> Eric, Jean-Philippe, does one of you intend to create a github issue > >>> and request a ballot for the TC? It's been posted end of August with no > >>> changes ... > >> Jean-Philippe, would you? > >>> > 2) Has the ACPI table format stabalized already? If and only if > the answer is Yes I will Ack these patches. We don't need to > wait until the ACPI table format is published in a > specification update, but at least some certainty that it > will not change in incompatible ways anymore is needed. > > >> > >> Al, do you have any news about the the VIOT definition submission to > >> the UEFI ASWG? > >> > >> Thank you in advance > >> > >> Best Regards > >> > >> Eric > > > > A follow-up to my earlier post > > > > Hearing no objection, I've submitted the VIOT table description to > > the ASWG for consideration under what they call the "code first" > > process. The "first reading" -- a brief discussion on what the > > table is and why we would like to add it -- was held yesterday. > > No concerns have been raised as yet. Given the discussions that > > have already occurred, I don't expect any, either. I have been > > wrong at least once before, however. > > > > At this point, ASWG will revisit the request to add VIOT each > > week. If there have been no comments in the prior week, and no > > further discussion during the meeting, then a vote will be taken. > > Otherwise, there will be discussion and we try again the next > > week. > > > > The ASWG was also told that the likelihood of this definition of > > the table changing is pretty low, and that it has been thought out > > pretty well already. ASWG's consideration will therefore start > > from the assumption that it would be best _not_ to make changes. > > > > So, I'll let you know what happens next week. > > Thank you very much for the updates and for your support backing the > proposal in the best delays. > > Best Regards > > Eric So, there are some questions about the VIOT definition and I just don't know enough to be able to answer them. One of the ASWG members is trying to understand the semantics behind the subtables. Is there a particular set of people, or mailing lists, that I can point to to get the questions answered? Ideally it would be one of the public lists where it has already been discussed, but an individual would be fine, too. No changes have been proposed, just some questions asked. Thanks. > > > >> > >>> > >>> Not that I know, but I don't see why it's a must. > >>> > So what progress has been made with the ACPI table specification, is it > just a matter of time to get it approved or are there concerns? > > Regards, > > Joerg > >>> > >> > > > -- ciao, al --- Al Stone Software Engineer Red Hat, Inc. a...@redhat.com --- ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 01:48:51PM -0400, Jason Gunthorpe wrote: > I think the same PCI driver with a small flag to support the PF or > VF is not the same as two completely different drivers in different > subsystems There are counter-examples: ixgbe vs. ixgbevf. Note that also a single driver can support both, an SVA device and an mdev device, sharing code for accessing parts of the device like queues and handling interrupts. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 0/7] arm64: Default to 32-bit wide ZONE_DMA
On Tue, Nov 03, 2020 at 06:00:33PM +0100, Nicolas Saenz Julienne wrote: > On Fri, 2020-10-30 at 18:11 +, Catalin Marinas wrote: > > On Thu, Oct 29, 2020 at 06:25:43PM +0100, Nicolas Saenz Julienne wrote: > > > Ard Biesheuvel (1): > > > arm64: mm: Set ZONE_DMA size based on early IORT scan > > > > > > Nicolas Saenz Julienne (6): > > > arm64: mm: Move reserve_crashkernel() into mem_init() > > > arm64: mm: Move zone_dma_bits initialization into zone_sizes_init() > > > of/address: Introduce of_dma_get_max_cpu_address() > > > of: unittest: Add test for of_dma_get_max_cpu_address() > > > arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges > > > mm: Remove examples from enum zone_type comment > > > > Thanks for putting this together. I had a minor comment but the patches > > look fine to me. We still need an ack from Rob on the DT patch and I can > > queue the series for 5.11. > > I'm preparing a v6 unifying both functions as you suggested. > > > Could you please also test the patch below on top of this series? It's > > the removal of the implied DMA offset in the max_zone_phys() > > calculation. > > Yes, happily. Comments below. > > > --8<- > > From 3ae252d888be4984a612236124f5b099e804c745 Mon Sep 17 00:00:00 2001 > > From: Catalin Marinas > > Date: Fri, 30 Oct 2020 18:07:34 + > > Subject: [PATCH] arm64: Ignore any DMA offsets in the max_zone_phys() > > calculation > > > > Currently, the kernel assumes that if RAM starts above 32-bit (or > > zone_bits), there is still a ZONE_DMA/DMA32 at the bottom of the RAM and > > such constrained devices have a hardwired DMA offset. In practice, we > > haven't noticed any such hardware so let's assume that we can expand > > ZONE_DMA32 to the available memory if no RAM below 4GB. Similarly, > > ZONE_DMA is expanded to the 4GB limit if no RAM addressable by > > zone_bits. > > > > Signed-off-by: Catalin Marinas > > --- > > arch/arm64/mm/init.c | 17 - > > 1 file changed, 12 insertions(+), 5 deletions(-) > > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > > index 095540667f0f..362160e16fb2 100644 > > --- a/arch/arm64/mm/init.c > > +++ b/arch/arm64/mm/init.c > > @@ -175,14 +175,21 @@ static void __init reserve_elfcorehdr(void) > > #endif /* CONFIG_CRASH_DUMP */ > > > > /* > > - * Return the maximum physical address for a zone with a given address size > > - * limit. It currently assumes that for memory starting above 4G, 32-bit > > - * devices will use a DMA offset. > > + * Return the maximum physical address for a zone accessible by the given > > bits > > + * limit. If the DRAM starts above 32-bit, expand the zone to the maximum > > + * available memory, otherwise cap it at 32-bit. > > */ > > static phys_addr_t __init max_zone_phys(unsigned int zone_bits) > > { > > - phys_addr_t offset = memblock_start_of_DRAM() & GENMASK_ULL(63, > > zone_bits); > > - return min(offset + (1ULL << zone_bits), memblock_end_of_DRAM()); > > + phys_addr_t zone_mask = (1ULL << zone_bits) - 1; > > Maybe use DMA_BIT_MASK(), instead of the manual calculation? Yes. > > > + phys_addr_t phys_start = memblock_start_of_DRAM(); > > + > > + if (!(phys_start & U32_MAX)) > > I'd suggest using 'bigger than' instead of masks. Just to cover ourselves > against memory starting at odd locations. Also it'll behaves properly when > phys_start is zero (this breaks things on RPi4). Good point. > > + zone_mask = PHYS_ADDR_MAX; > > + else if (!(phys_start & zone_mask)) > > + zone_mask = U32_MAX; > > + > > + return min(zone_mask + 1, memblock_end_of_DRAM()); > > This + 1 isn't going to play well when zone_mask is PHYS_ADDR_MAX. You are right on PHYS_ADDR_MAX overflowing but I'd keep the +1 since memblock_end_of_DRAM() returns the first byte past the accessible range (so exclusive end). I'll tweak this function a bit to avoid the overflow or use the arm64-specific PHYS_MASK (that's never going to be the full 64 bits). Thanks. -- Catalin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: can we switch bmips to the generic dma ranges code
I'll get on it. Jim On Tue, Nov 3, 2020 at 12:42 PM Florian Fainelli wrote: > > > > On 11/3/2020 2:15 AM, Christoph Hellwig wrote: > > Hi Florian and others, > > > > now that the generic DMA ranges code landed, can we switch bmips over > > to it instead of duplicating the logic? > > This should be fine, I cannot easily test the 338x chips, however as far > as PCIe goes, I believe Jim may have some older patches he can dig up > for testing. > -- > Florian smime.p7s Description: S/MIME Cryptographic Signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v18 2/4] iommu/arm-smmu: Add a way for implementations to influence SCTLR
On Mon 02 Nov 12:18 CST 2020, Robin Murphy wrote: > On 2020-11-02 17:14, Jordan Crouse wrote: > > From: Rob Clark > > > > For the Adreno GPU's SMMU, we want SCTLR.HUPCF set to ensure that > > pending translations are not terminated on iova fault. Otherwise > > a terminated CP read could hang the GPU by returning invalid > > command-stream data. > > > > Signed-off-by: Rob Clark > > Reviewed-by: Bjorn Andersson > > Signed-off-by: Jordan Crouse > > --- > > > > drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++ > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 3 +++ > > drivers/iommu/arm/arm-smmu/arm-smmu.h | 3 +++ > > 3 files changed, 12 insertions(+) > > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > > b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > > index 1e942eed2dfc..0663d7d26908 100644 > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > > @@ -129,6 +129,12 @@ static int qcom_adreno_smmu_init_context(struct > > arm_smmu_domain *smmu_domain, > > (smmu_domain->cfg.fmt == ARM_SMMU_CTX_FMT_AARCH64)) > > pgtbl_cfg->quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1; > > + /* > > +* On the GPU device we want to process subsequent transactions after a > > +* fault to keep the GPU from hanging > > +*/ > > + smmu_domain->cfg.sctlr_set |= ARM_SMMU_SCTLR_HUPCF; > > + > > /* > > * Initialize private interface with GPU: > > */ > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c > > b/drivers/iommu/arm/arm-smmu/arm-smmu.c > > index dad7fa86fbd4..1f06ab219819 100644 > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c > > @@ -617,6 +617,9 @@ void arm_smmu_write_context_bank(struct arm_smmu_device > > *smmu, int idx) > > if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) > > reg |= ARM_SMMU_SCTLR_E; > > + reg |= cfg->sctlr_set; > > + reg &= ~cfg->sctlr_clr; > > Since we now have a write_s2cr hook, I'm inclined to think that the > consistency of a write_sctlr hook that could similarly apply its own > arbitrary tweaks would make sense for this. Does anyone have any strong > opinions? > I like it. Regards, Bjorn > Robin. > > > + > > arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg); > > } > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h > > b/drivers/iommu/arm/arm-smmu/arm-smmu.h > > index 6c5ffeae..ddf2ca4c923d 100644 > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h > > @@ -144,6 +144,7 @@ enum arm_smmu_cbar_type { > > #define ARM_SMMU_CB_SCTLR 0x0 > > #define ARM_SMMU_SCTLR_S1_ASIDPNE BIT(12) > > #define ARM_SMMU_SCTLR_CFCFG BIT(7) > > +#define ARM_SMMU_SCTLR_HUPCF BIT(8) > > #define ARM_SMMU_SCTLR_CFIE BIT(6) > > #define ARM_SMMU_SCTLR_CFRE BIT(5) > > #define ARM_SMMU_SCTLR_E BIT(4) > > @@ -341,6 +342,8 @@ struct arm_smmu_cfg { > > u16 asid; > > u16 vmid; > > }; > > + u32 sctlr_set;/* extra bits to set in > > SCTLR */ > > + u32 sctlr_clr;/* bits to mask in SCTLR > > */ > > enum arm_smmu_cbar_type cbar; > > enum arm_smmu_context_fmt fmt; > > }; > > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 3/4] iommu/iova: Flush CPU rcache for when a depot fills
To summarize, the issue is that as time goes by, the CPU rcache and depot rcache continue to grow. As such, IOVA RB tree access time also continues to grow. Hi Robin, I'm struggling to see how this is not simply indicative of a leak originating elsewhere. It sounds like one, but I don't think it is. For the number of magazines to continually grow, it means IOVAs *of a particular size* are being freed faster than they are being allocated, while the only place that ongoing allocations should be coming from is those same magazines! But that is not the nature of how the IOVA caching works. The cache size is not defined by how DMA mappings we may have at a given moment in time or maximum which we did have at a point earlier. It just grows to a limit to where all CPU and global depot rcaches fill. Here's an artificial example of how the rcache can grow, but I hope can help illustrate: - consider a process which wants many DMA mapping active at a given point in time - if we tie to cpu0, cpu0 rcache will grow to 128 * 2 - then tie to cpu1, cpu1 rcache will grow to 128 * 2, so total CPU rcache = 2 * 128 * 2. CPU rcache for cpu0 is not flushed - there is no maintenance for this. - then tie to cpu2, cpu2 rcache will grow to 128 * 2, so total CPU rcache = 3 * 128 * 2 - then cpu3, cpu4, and so on. - We can do this for all CPUs in the system, so total CPU rcache grows from zero -> #CPUs * 128 * 2. Yet no DMA mapping leaks. Something similar can happen in normal use, where the scheduler relocates processes all over the CPUs in the system as time goes by, which causes the total rcache size to continue to grow. And in addition to this, the global depot continues to grow very slowly as well. But when it does fill (the global depot, that is), and we start to free magazines to make space – as is current policy - that's very slow and causes the performance drop. Now indeed that could happen over the short term if IOVAs are allocated and freed again in giant batches larger than the total global cache capacity, but that would show a cyclic behaviour - when activity starts, everything is first allocated straight from the tree, then when it ends the caches would get overwhelmed by the large burst of freeing and start having to release things back to the tree, but eventually that would stop once everything *is* freed, then when activity begins again the next round of allocating would inherently clear out all the caches before going anywhere near the tree. But there is no clearing. A CPU will keep the IOVA cached indefinitely, even when there is no active DMA mapping present at all. To me the "steady decline" behaviour suggests that someone somewhere is making DMA unmap calls with a smaller size than they were mapped with (you tend to notice it quicker the other way round due to all the device errors and random memory corruption) - in many cases that would appear to work out fine from the driver's point of view, but would provoke exactly this behaviour in the IOVA allocator. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 05:55:40PM +0100, j...@8bytes.org wrote: > On Tue, Nov 03, 2020 at 11:22:23AM -0400, Jason Gunthorpe wrote: > > This whole thread was brought up by IDXD which has a SVA driver and > > now wants to add a vfio-mdev driver too. SVA devices that want to be > > plugged into VMs are going to be common - this architecture that a SVA > > driver cannot cover the kvm case seems problematic. > > Isn't that the same pattern as having separate drivers for VFs and the > parent device in SR-IOV? I think the same PCI driver with a small flag to support the PF or VF is not the same as two completely different drivers in different subsystems Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: can we switch bmips to the generic dma ranges code
On 11/3/2020 2:15 AM, Christoph Hellwig wrote: > Hi Florian and others, > > now that the generic DMA ranges code landed, can we switch bmips over > to it instead of duplicating the logic? This should be fine, I cannot easily test the 338x chips, however as far as PCIe goes, I believe Jim may have some older patches he can dig up for testing. -- Florian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 7/7] mm: Remove examples from enum zone_type comment
We can't really list every setup in common code. On top of that they are unlikely to stay true for long as things change in the arch trees independently of this comment. Suggested-by: Christoph Hellwig Signed-off-by: Nicolas Saenz Julienne Reviewed-by: Christoph Hellwig --- include/linux/mmzone.h | 20 1 file changed, 20 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fb3bf696c05e..9d0c454d23cd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -354,26 +354,6 @@ enum zone_type { * DMA mask is assumed when ZONE_DMA32 is defined. Some 64-bit * platforms may need both zones as they support peripherals with * different DMA addressing limitations. -* -* Some examples: -* -* - i386 and x86_64 have a fixed 16M ZONE_DMA and ZONE_DMA32 for the -*rest of the lower 4G. -* -* - arm only uses ZONE_DMA, the size, up to 4G, may vary depending on -*the specific device. -* -* - arm64 has a fixed 1G ZONE_DMA and ZONE_DMA32 for the rest of the -*lower 4G. -* -* - powerpc only uses ZONE_DMA, the size, up to 2G, may vary -*depending on the specific device. -* -* - s390 uses ZONE_DMA fixed to the lower 2G. -* -* - ia64 and riscv only use ZONE_DMA32. -* -* - parisc uses neither. */ #ifdef CONFIG_ZONE_DMA ZONE_DMA, -- 2.29.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 6/7] arm64: mm: Set ZONE_DMA size based on early IORT scan
From: Ard Biesheuvel We recently introduced a 1 GB sized ZONE_DMA to cater for platforms incorporating masters that can address less than 32 bits of DMA, in particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has peripherals that can only address up to 1 GB (and its PCIe host bridge can only access the bottom 3 GB) Instructing the DMA layer about these limitations is straight-forward, even though we had to fix some issues regarding memory limits set in the IORT for named components, and regarding the handling of ACPI _DMA methods. However, the DMA layer also needs to be able to allocate memory that is guaranteed to meet those DMA constraints, for bounce buffering as well as allocating the backing for consistent mappings. This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately, it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes problems with kdump, and potentially in other places where allocations cannot cross zone boundaries. Therefore, we should avoid having two separate DMA zones when possible. So let's do an early scan of the IORT, and only create the ZONE_DMA if we encounter any devices that need it. This puts the burden on the firmware to describe such limitations in the IORT, which may be redundant (and less precise) if _DMA methods are also being provided. However, it should be noted that this situation is highly unusual for arm64 ACPI machines. Also, the DMA subsystem still gives precedence to the _DMA method if implemented, and so we will not lose the ability to perform streaming DMA outside the ZONE_DMA if the _DMA method permits it. Cc: Jeremy Linton Cc: Lorenzo Pieralisi Cc: Nicolas Saenz Julienne Cc: Rob Herring Cc: Christoph Hellwig Cc: Robin Murphy Cc: Hanjun Guo Cc: Sudeep Holla Cc: Anshuman Khandual Signed-off-by: Ard Biesheuvel [nsaenz: unified implementation with DT's counterpart] Signed-off-by: Nicolas Saenz Julienne Tested-by: Jeremy Linton Acked-by: Lorenzo Pieralisi Acked-by: Hanjun Guo --- Changes since v5: - Unify with DT's counterpart, return phys_addr_t Changes since v3: - Use min_not_zero() - Check revision - Remove unnecessary #ifdef in zone_sizes_init() arch/arm64/mm/init.c | 5 +++- drivers/acpi/arm64/iort.c | 55 +++ include/linux/acpi_iort.h | 4 +++ 3 files changed, 63 insertions(+), 1 deletion(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index a2ce8a9a71a6..ca5d4b10679d 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include @@ -186,11 +187,13 @@ static phys_addr_t __init max_zone_phys(unsigned int zone_bits) static void __init zone_sizes_init(unsigned long min, unsigned long max) { unsigned long max_zone_pfns[MAX_NR_ZONES] = {0}; + unsigned int __maybe_unused acpi_zone_dma_bits; unsigned int __maybe_unused dt_zone_dma_bits; #ifdef CONFIG_ZONE_DMA + acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address()); dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL)); - zone_dma_bits = min(32U, dt_zone_dma_bits); + zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits); arm64_dma_phys_limit = max_zone_phys(zone_dma_bits); max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit); #endif diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index 9929ff50c0c0..1787406684aa 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -1718,3 +1718,58 @@ void __init acpi_iort_init(void) iort_init_platform_devices(); } + +#ifdef CONFIG_ZONE_DMA +/* + * Extract the highest CPU physical address accessible to all DMA masters in + * the system. PHYS_ADDR_MAX is returned when no constrained device is found. + */ +phys_addr_t __init acpi_iort_dma_get_max_cpu_address(void) +{ + phys_addr_t limit = PHYS_ADDR_MAX; + struct acpi_iort_node *node, *end; + struct acpi_table_iort *iort; + acpi_status status; + int i; + + if (acpi_disabled) + return limit; + + status = acpi_get_table(ACPI_SIG_IORT, 0, + (struct acpi_table_header **)); + if (ACPI_FAILURE(status)) + return limit; + + node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset); + end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length); + + for (i = 0; i < iort->node_count; i++) { + if (node >= end) + break; + + switch (node->type) { + struct acpi_iort_named_component *ncomp; + struct acpi_iort_root_complex *rc; + phys_addr_t local_limit; + + case ACPI_IORT_NODE_NAMED_COMPONENT: + ncomp = (struct acpi_iort_named_component *)node->node_data; + local_limit =
[PATCH v6 4/7] of: unittest: Add test for of_dma_get_max_cpu_address()
Introduce a test for of_dma_get_max_cup_address(), it uses the same DT data as the rest of dma-ranges unit tests. Signed-off-by: Nicolas Saenz Julienne Reviewed-by: Rob Herring --- Changes since v5: - Update address expected by test Changes since v3: - Remove HAS_DMA guards drivers/of/unittest.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index 06cc988faf78..98cc0163301b 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -869,6 +869,23 @@ static void __init of_unittest_changeset(void) #endif } +static void __init of_unittest_dma_get_max_cpu_address(void) +{ + struct device_node *np; + phys_addr_t cpu_addr; + + np = of_find_node_by_path("/testcase-data/address-tests"); + if (!np) { + pr_err("missing testcase data\n"); + return; + } + + cpu_addr = of_dma_get_max_cpu_address(np); + unittest(cpu_addr == 0x4fff, +"of_dma_get_max_cpu_address: wrong CPU addr %pad (expecting %x)\n", +_addr, 0x4fff); +} + static void __init of_unittest_dma_ranges_one(const char *path, u64 expect_dma_addr, u64 expect_paddr) { @@ -3266,6 +3283,7 @@ static int __init of_unittest(void) of_unittest_changeset(); of_unittest_parse_interrupts(); of_unittest_parse_interrupts_extended(); + of_unittest_dma_get_max_cpu_address(); of_unittest_parse_dma_ranges(); of_unittest_pci_dma_ranges(); of_unittest_match_node(); -- 2.29.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 3/7] of/address: Introduce of_dma_get_max_cpu_address()
Introduce of_dma_get_max_cpu_address(), which provides the highest CPU physical address addressable by all DMA masters in the system. It's specially useful for setting memory zones sizes at early boot time. Signed-off-by: Nicolas Saenz Julienne Reviewed-by: Rob Herring --- Changes since v4: - Return max address, not address limit (one off difference) Changes since v3: - use u64 with cpu_end Changes since v2: - Use PHYS_ADDR_MAX - return phys_dma_t - Rename function - Correct subject - Add support to start parsing from an arbitrary device node in order for the function to work with unit tests drivers/of/address.c | 42 ++ include/linux/of.h | 7 +++ 2 files changed, 49 insertions(+) diff --git a/drivers/of/address.c b/drivers/of/address.c index eb9ab4f1e80b..09c0af7fd1c4 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -1024,6 +1024,48 @@ int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map) } #endif /* CONFIG_HAS_DMA */ +/** + * of_dma_get_max_cpu_address - Gets highest CPU address suitable for DMA + * @np: The node to start searching from or NULL to start from the root + * + * Gets the highest CPU physical address that is addressable by all DMA masters + * in the sub-tree pointed by np, or the whole tree if NULL is passed. If no + * DMA constrained device is found, it returns PHYS_ADDR_MAX. + */ +phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np) +{ + phys_addr_t max_cpu_addr = PHYS_ADDR_MAX; + struct of_range_parser parser; + phys_addr_t subtree_max_addr; + struct device_node *child; + struct of_range range; + const __be32 *ranges; + u64 cpu_end = 0; + int len; + + if (!np) + np = of_root; + + ranges = of_get_property(np, "dma-ranges", ); + if (ranges && len) { + of_dma_range_parser_init(, np); + for_each_of_range(, ) + if (range.cpu_addr + range.size > cpu_end) + cpu_end = range.cpu_addr + range.size - 1; + + if (max_cpu_addr > cpu_end) + max_cpu_addr = cpu_end; + } + + for_each_available_child_of_node(np, child) { + subtree_max_addr = of_dma_get_max_cpu_address(child); + if (max_cpu_addr > subtree_max_addr) + max_cpu_addr = subtree_max_addr; + } + + return max_cpu_addr; +} + /** * of_dma_is_coherent - Check if device is coherent * @np:device node diff --git a/include/linux/of.h b/include/linux/of.h index 5d51891cbf1a..9ed5b8532c30 100644 --- a/include/linux/of.h +++ b/include/linux/of.h @@ -558,6 +558,8 @@ int of_map_id(struct device_node *np, u32 id, const char *map_name, const char *map_mask_name, struct device_node **target, u32 *id_out); +phys_addr_t of_dma_get_max_cpu_address(struct device_node *np); + #else /* CONFIG_OF */ static inline void of_core_init(void) @@ -995,6 +997,11 @@ static inline int of_map_id(struct device_node *np, u32 id, return -EINVAL; } +static inline phys_addr_t of_dma_get_max_cpu_address(struct device_node *np) +{ + return PHYS_ADDR_MAX; +} + #define of_match_ptr(_ptr) NULL #define of_match_node(_matches, _node) NULL #endif /* CONFIG_OF */ -- 2.29.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 2/7] arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()
zone_dma_bits's initialization happens earlier that it's actually needed, in arm64_memblock_init(). So move it into the more suitable zone_sizes_init(). Signed-off-by: Nicolas Saenz Julienne Tested-by: Jeremy Linton --- arch/arm64/mm/init.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index fc4ab0d6d5d2..410721fc4fc0 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,8 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) unsigned long max_zone_pfns[MAX_NR_ZONES] = {0}; #ifdef CONFIG_ZONE_DMA + zone_dma_bits = ARM64_ZONE_DMA_BITS; + arm64_dma_phys_limit = max_zone_phys(zone_dma_bits); max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit); #endif #ifdef CONFIG_ZONE_DMA32 @@ -376,11 +378,6 @@ void __init arm64_memblock_init(void) early_init_fdt_scan_reserved_mem(); - if (IS_ENABLED(CONFIG_ZONE_DMA)) { - zone_dma_bits = ARM64_ZONE_DMA_BITS; - arm64_dma_phys_limit = max_zone_phys(ARM64_ZONE_DMA_BITS); - } - if (IS_ENABLED(CONFIG_ZONE_DMA32)) arm64_dma32_phys_limit = max_zone_phys(32); else -- 2.29.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()
crashkernel might reserve memory located in ZONE_DMA. We plan to delay ZONE_DMA's initialization after unflattening the devicetree and ACPI's boot table initialization, so move it later in the boot process. Specifically into mem_init(), this is the last place crashkernel will be able to reserve the memory before the page allocator kicks in. There isn't any apparent reason for doing this earlier. Signed-off-by: Nicolas Saenz Julienne Tested-by: Jeremy Linton --- arch/arm64/mm/init.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 095540667f0f..fc4ab0d6d5d2 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -386,8 +386,6 @@ void __init arm64_memblock_init(void) else arm64_dma32_phys_limit = PHYS_MASK + 1; - reserve_crashkernel(); - reserve_elfcorehdr(); high_memory = __va(memblock_end_of_DRAM() - 1) + 1; @@ -508,6 +506,8 @@ void __init mem_init(void) else swiotlb_force = SWIOTLB_NO_FORCE; + reserve_crashkernel(); + set_max_mapnr(max_pfn - PHYS_PFN_OFFSET); #ifndef CONFIG_SPARSEMEM_VMEMMAP -- 2.29.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 5/7] arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges
We recently introduced a 1 GB sized ZONE_DMA to cater for platforms incorporating masters that can address less than 32 bits of DMA, in particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has peripherals that can only address up to 1 GB (and its PCIe host bridge can only access the bottom 3 GB) The DMA layer also needs to be able to allocate memory that is guaranteed to meet those DMA constraints, for bounce buffering as well as allocating the backing for consistent mappings. This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately, it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes problems with kdump, and potentially in other places where allocations cannot cross zone boundaries. Therefore, we should avoid having two separate DMA zones when possible. So, with the help of of_dma_get_max_cpu_address() get the topmost physical address accessible to all DMA masters in system and use that information to fine-tune ZONE_DMA's size. In the absence of addressing limited masters ZONE_DMA will span the whole 32-bit address space, otherwise, in the case of the Raspberry Pi 4 it'll only span the 30-bit address space, and have ZONE_DMA32 cover the rest of the 32-bit address space. Signed-off-by: Nicolas Saenz Julienne --- Changes since v4: - Use fls64 as we're now using the max address (as opposed to the limit) Changes since v3: - Simplify code for readability. Changes since v2: - Updated commit log by shamelessly copying Ard's ACPI commit log arch/arm64/mm/init.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 410721fc4fc0..a2ce8a9a71a6 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -42,8 +42,6 @@ #include #include -#define ARM64_ZONE_DMA_BITS30 - /* * We need to be able to catch inadvertent references to memstart_addr * that occur (potentially in generic code) before arm64_memblock_init() @@ -188,9 +186,11 @@ static phys_addr_t __init max_zone_phys(unsigned int zone_bits) static void __init zone_sizes_init(unsigned long min, unsigned long max) { unsigned long max_zone_pfns[MAX_NR_ZONES] = {0}; + unsigned int __maybe_unused dt_zone_dma_bits; #ifdef CONFIG_ZONE_DMA - zone_dma_bits = ARM64_ZONE_DMA_BITS; + dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL)); + zone_dma_bits = min(32U, dt_zone_dma_bits); arm64_dma_phys_limit = max_zone_phys(zone_dma_bits); max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit); #endif -- 2.29.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 0/7] arm64: Default to 32-bit wide ZONE_DMA
Using two distinct DMA zones turned out to be problematic. Here's an attempt go back to a saner default. I tested this on both a RPi4 and QEMU. --- Changes since v5: - Unify ACPI/DT functions Changes since v4: - Fix of_dma_get_max_cpu_address() so it returns the last addressable addres, not the limit Changes since v3: - Drop patch adding define in dma-mapping - Address small review changes - Update Ard's patch - Add new patch removing examples from mmzone.h Changes since v2: - Introduce Ard's patch - Improve OF dma-ranges parsing function - Add unit test for OF function - Address small changes - Move crashkernel reservation later in boot process Changes since v1: - Parse dma-ranges instead of using machine compatible string Ard Biesheuvel (1): arm64: mm: Set ZONE_DMA size based on early IORT scan Nicolas Saenz Julienne (6): arm64: mm: Move reserve_crashkernel() into mem_init() arm64: mm: Move zone_dma_bits initialization into zone_sizes_init() of/address: Introduce of_dma_get_max_cpu_address() of: unittest: Add test for of_dma_get_max_cpu_address() arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges mm: Remove examples from enum zone_type comment arch/arm64/mm/init.c | 18 ++--- drivers/acpi/arm64/iort.c | 55 +++ drivers/of/address.c | 42 ++ drivers/of/unittest.c | 18 + include/linux/acpi_iort.h | 4 +++ include/linux/mmzone.h| 20 -- include/linux/of.h| 7 + 7 files changed, 135 insertions(+), 29 deletions(-) -- 2.29.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v18 2/4] iommu/arm-smmu: Add a way for implementations to influence SCTLR
On Mon, Nov 02, 2020 at 06:18:45PM +, Robin Murphy wrote: > On 2020-11-02 17:14, Jordan Crouse wrote: > >From: Rob Clark > > > >For the Adreno GPU's SMMU, we want SCTLR.HUPCF set to ensure that > >pending translations are not terminated on iova fault. Otherwise > >a terminated CP read could hang the GPU by returning invalid > >command-stream data. > > > >Signed-off-by: Rob Clark > >Reviewed-by: Bjorn Andersson > >Signed-off-by: Jordan Crouse > >--- > > > > drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++ > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 3 +++ > > drivers/iommu/arm/arm-smmu/arm-smmu.h | 3 +++ > > 3 files changed, 12 insertions(+) > > > >diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > >b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > >index 1e942eed2dfc..0663d7d26908 100644 > >--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > >+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > >@@ -129,6 +129,12 @@ static int qcom_adreno_smmu_init_context(struct > >arm_smmu_domain *smmu_domain, > > (smmu_domain->cfg.fmt == ARM_SMMU_CTX_FMT_AARCH64)) > > pgtbl_cfg->quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1; > >+/* > >+ * On the GPU device we want to process subsequent transactions after a > >+ * fault to keep the GPU from hanging > >+ */ > >+smmu_domain->cfg.sctlr_set |= ARM_SMMU_SCTLR_HUPCF; > >+ > > /* > > * Initialize private interface with GPU: > > */ > >diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c > >b/drivers/iommu/arm/arm-smmu/arm-smmu.c > >index dad7fa86fbd4..1f06ab219819 100644 > >--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c > >+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c > >@@ -617,6 +617,9 @@ void arm_smmu_write_context_bank(struct arm_smmu_device > >*smmu, int idx) > > if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) > > reg |= ARM_SMMU_SCTLR_E; > >+reg |= cfg->sctlr_set; > >+reg &= ~cfg->sctlr_clr; > > Since we now have a write_s2cr hook, I'm inclined to think that the > consistency of a write_sctlr hook that could similarly apply its own > arbitrary tweaks would make sense for this. Does anyone have any strong > opinions? None from me. That would make an eventual stall-on-fault implementation easier too. Jordan > Robin. > > >+ > > arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg); > > } > >diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h > >b/drivers/iommu/arm/arm-smmu/arm-smmu.h > >index 6c5ffeae..ddf2ca4c923d 100644 > >--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h > >+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h > >@@ -144,6 +144,7 @@ enum arm_smmu_cbar_type { > > #define ARM_SMMU_CB_SCTLR 0x0 > > #define ARM_SMMU_SCTLR_S1_ASIDPNE BIT(12) > > #define ARM_SMMU_SCTLR_CFCFG BIT(7) > >+#define ARM_SMMU_SCTLR_HUPCFBIT(8) > > #define ARM_SMMU_SCTLR_CFIEBIT(6) > > #define ARM_SMMU_SCTLR_CFREBIT(5) > > #define ARM_SMMU_SCTLR_E BIT(4) > >@@ -341,6 +342,8 @@ struct arm_smmu_cfg { > > u16 asid; > > u16 vmid; > > }; > >+u32 sctlr_set;/* extra bits to set in > >SCTLR */ > >+u32 sctlr_clr;/* bits to mask in SCTLR > >*/ > > enum arm_smmu_cbar_type cbar; > > enum arm_smmu_context_fmt fmt; > > }; > > -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 0/7] arm64: Default to 32-bit wide ZONE_DMA
On Fri, 2020-10-30 at 18:11 +, Catalin Marinas wrote: > On Thu, Oct 29, 2020 at 06:25:43PM +0100, Nicolas Saenz Julienne wrote: > > Ard Biesheuvel (1): > > arm64: mm: Set ZONE_DMA size based on early IORT scan > > > > Nicolas Saenz Julienne (6): > > arm64: mm: Move reserve_crashkernel() into mem_init() > > arm64: mm: Move zone_dma_bits initialization into zone_sizes_init() > > of/address: Introduce of_dma_get_max_cpu_address() > > of: unittest: Add test for of_dma_get_max_cpu_address() > > arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges > > mm: Remove examples from enum zone_type comment > > Thanks for putting this together. I had a minor comment but the patches > look fine to me. We still need an ack from Rob on the DT patch and I can > queue the series for 5.11. I'm preparing a v6 unifying both functions as you suggested. > Could you please also test the patch below on top of this series? It's > the removal of the implied DMA offset in the max_zone_phys() > calculation. Yes, happily. Comments below. > --8<- > From 3ae252d888be4984a612236124f5b099e804c745 Mon Sep 17 00:00:00 2001 > From: Catalin Marinas > Date: Fri, 30 Oct 2020 18:07:34 + > Subject: [PATCH] arm64: Ignore any DMA offsets in the max_zone_phys() > calculation > > Currently, the kernel assumes that if RAM starts above 32-bit (or > zone_bits), there is still a ZONE_DMA/DMA32 at the bottom of the RAM and > such constrained devices have a hardwired DMA offset. In practice, we > haven't noticed any such hardware so let's assume that we can expand > ZONE_DMA32 to the available memory if no RAM below 4GB. Similarly, > ZONE_DMA is expanded to the 4GB limit if no RAM addressable by > zone_bits. > > Signed-off-by: Catalin Marinas > --- > arch/arm64/mm/init.c | 17 - > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 095540667f0f..362160e16fb2 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -175,14 +175,21 @@ static void __init reserve_elfcorehdr(void) > #endif /* CONFIG_CRASH_DUMP */ > > /* > - * Return the maximum physical address for a zone with a given address size > - * limit. It currently assumes that for memory starting above 4G, 32-bit > - * devices will use a DMA offset. > + * Return the maximum physical address for a zone accessible by the given > bits > + * limit. If the DRAM starts above 32-bit, expand the zone to the maximum > + * available memory, otherwise cap it at 32-bit. > */ > static phys_addr_t __init max_zone_phys(unsigned int zone_bits) > { > - phys_addr_t offset = memblock_start_of_DRAM() & GENMASK_ULL(63, > zone_bits); > - return min(offset + (1ULL << zone_bits), memblock_end_of_DRAM()); > + phys_addr_t zone_mask = (1ULL << zone_bits) - 1; Maybe use DMA_BIT_MASK(), instead of the manual calculation? > + phys_addr_t phys_start = memblock_start_of_DRAM(); > + > + if (!(phys_start & U32_MAX)) I'd suggest using 'bigger than' instead of masks. Just to cover ourselves against memory starting at odd locations. Also it'll behaves properly when phys_start is zero (this breaks things on RPi4). > + zone_mask = PHYS_ADDR_MAX; > + else if (!(phys_start & zone_mask)) > + zone_mask = U32_MAX; > + > + return min(zone_mask + 1, memblock_end_of_DRAM()); This + 1 isn't going to play well when zone_mask is PHYS_ADDR_MAX. Regards, Nicolas signature.asc Description: This is a digitally signed message part ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 11:22:23AM -0400, Jason Gunthorpe wrote: > This whole thread was brought up by IDXD which has a SVA driver and > now wants to add a vfio-mdev driver too. SVA devices that want to be > plugged into VMs are going to be common - this architecture that a SVA > driver cannot cover the kvm case seems problematic. Isn't that the same pattern as having separate drivers for VFs and the parent device in SR-IOV? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 3/4] iommu/iova: Flush CPU rcache for when a depot fills
On 2020-10-26 17:31, John Garry wrote: Leizhen reported some time ago that IOVA performance may degrade over time [0], but unfortunately his solution to fix this problem was not given attention. To summarize, the issue is that as time goes by, the CPU rcache and depot rcache continue to grow. As such, IOVA RB tree access time also continues to grow. I'm struggling to see how this is not simply indicative of a leak originating elsewhere. For the number of magazines to continually grow, it means IOVAs *of a particular size* are being freed faster than they are being allocated, while the only place that ongoing allocations should be coming from is those same magazines! Now indeed that could happen over the short term if IOVAs are allocated and freed again in giant batches larger than the total global cache capacity, but that would show a cyclic behaviour - when activity starts, everything is first allocated straight from the tree, then when it ends the caches would get overwhelmed by the large burst of freeing and start having to release things back to the tree, but eventually that would stop once everything *is* freed, then when activity begins again the next round of allocating would inherently clear out all the caches before going anywhere near the tree. To me the "steady decline" behaviour suggests that someone somewhere is making DMA unmap calls with a smaller size than they were mapped with (you tend to notice it quicker the other way round due to all the device errors and random memory corruption) - in many cases that would appear to work out fine from the driver's point of view, but would provoke exactly this behaviour in the IOVA allocator. Robin. At a certain point, a depot may become full, and also some CPU rcaches may also be full when inserting another IOVA is attempted. For this scenario, currently the "loaded" CPU rcache is freed and a new one is created. This freeing means that many IOVAs in the RB tree need to be freed, which makes IO throughput performance fall off a cliff in some storage scenarios: Jobs: 12 (f=12): [] [0.0% done] [6314MB/0KB/0KB /s] [1616K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [5669MB/0KB/0KB /s] [1451K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6673MB/0KB/0KB /s] [1708K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6761MB/0KB/0KB /s] [1731K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6685MB/0KB/0KB /s] [1711K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6178MB/0KB/0KB /s] [1582K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [6731MB/0KB/0KB /s] [1723K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [2387MB/0KB/0KB /s] [611K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [2689MB/0KB/0KB /s] [688K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [2278MB/0KB/0KB /s] [583K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [1288MB/0KB/0KB /s] [330K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [1632MB/0KB/0KB /s] [418K/0/0 iops] Jobs: 12 (f=12): [] [0.0% done] [1765MB/0KB/0KB /s] [452K/0/0 iops] And continue in this fashion, without recovering. Note that in this example it was required to wait 16 hours for this to occur. Also note that IO throughput also becomes gradually becomes more unstable leading up to this point. As a solution to this issue, judge that the IOVA caches have grown too big when cached magazines need to be free, and just flush all the CPUs rcaches instead. The depot rcaches, however, are not flushed, as they can be used to immediately replenish active CPUs. In future, some IOVA compaction could be implemented to solve the instabilty issue, which I figure could be quite complex to implement. [0] https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leiz...@huawei.com/ Analyzed-by: Zhen Lei Reported-by: Xiang Chen Signed-off-by: John Garry --- drivers/iommu/iova.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 1f3f0f8b12e0..386005055aca 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -901,7 +901,6 @@ static bool __iova_rcache_insert(struct iova_domain *iovad, struct iova_rcache *rcache, unsigned long iova_pfn) { - struct iova_magazine *mag_to_free = NULL; struct iova_cpu_rcache *cpu_rcache; bool can_insert = false; unsigned long flags; @@ -923,13 +922,12 @@ static bool __iova_rcache_insert(struct iova_domain *iovad, if (cpu_rcache->loaded)
Re: [PATCH v5 2/2] iommu/iova: Free global iova rcache on iova alloc failure
On 2020-11-03 14:31, John Garry wrote: On 03/11/2020 12:35, Robin Murphy wrote: On 2020-09-30 08:44, vji...@codeaurora.org wrote: From: Vijayanand Jitta When ever an iova alloc request fails we free the iova ranges present in the percpu iova rcaches and then retry but the global iova rcache is not freed as a result we could still see iova alloc failure even after retry as global rcache is holding the iova's which can cause fragmentation. So, free the global iova rcache as well and then go for the retry. If we do clear all the CPU rcaches, it would nice to have something immediately available to replenish, i.e. use the global rcache, instead of flushing it, if that is not required... If we've reached the point of clearing *any* caches, though, I think any hope of maintaining performance is already long gone. We've walked the rbtree for the entire address space and found that it's still too full to allocate from; we're teetering on the brink of hard failure and this is a last-ditch attempt to claw back as much as possible in the hope that it gives us a usable space. TBH I'm not entirely sure what allocation pattern was expected by the original code such that purging only some of the caches made sense, nor what kind of pattern leads to lots of smaller IOVAs being allocated, freed, and never reused to the point of blocking larger allocations, but either way the reasoning does at least seem to hold up in abstract. This looks reasonable to me - it's mildly annoying that we end up with so many similar-looking functions, Well I did add a function to clear all CPU rcaches here, if you would like to check: https://lore.kernel.org/linux-iommu/1603733501-211004-2-git-send-email-john.ga...@huawei.com/ I was thinking more of the way free_iova_rcaches(), free_cpu_cached_iovas(), and free_global_cached_iovas() all look pretty much the same shape at a glance. but the necessary differences are right down in the middle of the loops so nothing can reasonably be factored out :( Reviewed-by: Robin Murphy Signed-off-by: Vijayanand Jitta --- drivers/iommu/iova.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c3a1a8e..faf9b13 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -25,6 +25,7 @@ static void init_iova_rcaches(struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); static void fq_destroy_all_entries(struct iova_domain *iovad); static void fq_flush_timeout(struct timer_list *t); +static void free_global_cached_iovas(struct iova_domain *iovad); a thought: It would be great if the file could be rearranged at some point where we don't require so many forward declarations. void init_iova_domain(struct iova_domain *iovad, unsigned long granule, @@ -442,6 +443,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, flush_rcache = false; for_each_online_cpu(cpu) free_cpu_cached_iovas(cpu, iovad); + free_global_cached_iovas(iovad); goto retry; } @@ -1057,5 +1059,26 @@ void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) } } +/* + * free all the IOVA ranges of global cache + */ +static void free_global_cached_iovas(struct iova_domain *iovad) +{ + struct iova_rcache *rcache; + unsigned long flags; + int i, j; + + for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + rcache = >rcaches[i]; + spin_lock_irqsave(>lock, flags); + for (j = 0; j < rcache->depot_size; ++j) { + iova_magazine_free_pfns(rcache->depot[j], iovad); + iova_magazine_free(rcache->depot[j]); + rcache->depot[j] = NULL; I don't think that NULLify is strictly necessary True, we don't explicitly clear depot entries in __iova_rcache_get() for normal operation, so there's not much point in doing so here. Robin. + } + rcache->depot_size = 0; + spin_unlock_irqrestore(>lock, flags); + } +} MODULE_AUTHOR("Anil S Keshavamurthy "); MODULE_LICENSE("GPL"); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 03:35:32PM +0100, j...@8bytes.org wrote: > On Tue, Nov 03, 2020 at 10:06:42AM -0400, Jason Gunthorpe wrote: > > The point is that other places beyond VFIO need this > > Which and why? > > > Sure, but sometimes it is necessary, and in those cases the answer > > can't be "rewrite a SVA driver to use vfio" > > This is getting to abstract. Can you come up with an example where > handling this in VFIO or an endpoint device kernel driver does not work? This whole thread was brought up by IDXD which has a SVA driver and now wants to add a vfio-mdev driver too. SVA devices that want to be plugged into VMs are going to be common - this architecture that a SVA driver cannot cover the kvm case seems problematic. Yes, everything can have a SVA driver and a vfio-mdev, it works just fine, but it is not very clean or simple. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 10:06:42AM -0400, Jason Gunthorpe wrote: > The point is that other places beyond VFIO need this Which and why? > Sure, but sometimes it is necessary, and in those cases the answer > can't be "rewrite a SVA driver to use vfio" This is getting to abstract. Can you come up with an example where handling this in VFIO or an endpoint device kernel driver does not work? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 2/2] iommu/iova: Free global iova rcache on iova alloc failure
On 03/11/2020 12:35, Robin Murphy wrote: On 2020-09-30 08:44, vji...@codeaurora.org wrote: From: Vijayanand Jitta When ever an iova alloc request fails we free the iova ranges present in the percpu iova rcaches and then retry but the global iova rcache is not freed as a result we could still see iova alloc failure even after retry as global rcache is holding the iova's which can cause fragmentation. So, free the global iova rcache as well and then go for the retry. If we do clear all the CPU rcaches, it would nice to have something immediately available to replenish, i.e. use the global rcache, instead of flushing it, if that is not required... This looks reasonable to me - it's mildly annoying that we end up with so many similar-looking functions, Well I did add a function to clear all CPU rcaches here, if you would like to check: https://lore.kernel.org/linux-iommu/1603733501-211004-2-git-send-email-john.ga...@huawei.com/ but the necessary differences are right down in the middle of the loops so nothing can reasonably be factored out :( Reviewed-by: Robin Murphy Signed-off-by: Vijayanand Jitta --- drivers/iommu/iova.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c3a1a8e..faf9b13 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -25,6 +25,7 @@ static void init_iova_rcaches(struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); static void fq_destroy_all_entries(struct iova_domain *iovad); static void fq_flush_timeout(struct timer_list *t); +static void free_global_cached_iovas(struct iova_domain *iovad); a thought: It would be great if the file could be rearranged at some point where we don't require so many forward declarations. void init_iova_domain(struct iova_domain *iovad, unsigned long granule, @@ -442,6 +443,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, flush_rcache = false; for_each_online_cpu(cpu) free_cpu_cached_iovas(cpu, iovad); + free_global_cached_iovas(iovad); goto retry; } @@ -1057,5 +1059,26 @@ void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) } } +/* + * free all the IOVA ranges of global cache + */ +static void free_global_cached_iovas(struct iova_domain *iovad) +{ + struct iova_rcache *rcache; + unsigned long flags; + int i, j; + + for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + rcache = >rcaches[i]; + spin_lock_irqsave(>lock, flags); + for (j = 0; j < rcache->depot_size; ++j) { + iova_magazine_free_pfns(rcache->depot[j], iovad); + iova_magazine_free(rcache->depot[j]); + rcache->depot[j] = NULL; I don't think that NULLify is strictly necessary + } + rcache->depot_size = 0; + spin_unlock_irqrestore(>lock, flags); + } +} MODULE_AUTHOR("Anil S Keshavamurthy "); MODULE_LICENSE("GPL"); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 03:03:18PM +0100, j...@8bytes.org wrote: > On Tue, Nov 03, 2020 at 09:23:35AM -0400, Jason Gunthorpe wrote: > > Userspace needs fine grained control over the composition of the page > > table behind the PASID, 1:1 with the mm_struct is only one use case. > > VFIO already offers an interface for that. It shouldn't be too > complicated to expand that for PASID-bound page-tables. > > > Userspace needs to be able to handle IOMMU faults, apparently > > Could be implemented by a fault-fd handed out by VFIO. The point is that other places beyond VFIO need this > I really don't think that user-space should have to deal with details > like PASIDs or other IOMMU internals, unless absolutly necessary. This > is an OS we work on, and the idea behind an OS is to abstract the > hardware away. Sure, but sometimes it is necessary, and in those cases the answer can't be "rewrite a SVA driver to use vfio" Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu: fix a check in iommu_check_bind_data()
On Tue, Nov 03, 2020 at 01:16:23PM +0300, Dan Carpenter wrote: > The "data->flags" variable is a u64 so if one of the high 32 bits is > set the original code will allow it, but it should be rejected. The > fix is to declare "mask" as a u64 instead of a u32. > > Fixes: d90573812eea ("iommu/uapi: Handle data and argsz filled by users") > Signed-off-by: Dan Carpenter > --- > drivers/iommu/iommu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Applied for v5.10, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/2] iommu: Fix a few issues related to PRQ
On Fri, Oct 30, 2020 at 10:37:22AM +0800, Yi Sun wrote: > We found a few issues about PRQ. So, two patches are cooked to > fix them. Please have a review. Thanks! > > Changes from v1: > > - Modify subject of patch 1 to make it more accurate. > - Move get_domain_info() up to the sanity check part in patch 1. > - Remove v1 patch 2 which is not suitable. > - Add description for current patch 2. > - Add 'Fixes:' tags for all patches. > > Liu Yi L (1): > iommu/vt-d: Fix sid not set issue in in intel_svm_bind_gpasid() > > Liu, Yi L (1): > iommu/vt-d: Fix a bug for PDP check in prq_event_thread > > drivers/iommu/intel/svm.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) Applied for v5.10, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/amd: Enforce 4k mapping for certain IOMMU data structures
Hi Suravee, On Wed, Oct 28, 2020 at 11:18:24PM +, Suravee Suthikulpanit wrote: > AMD IOMMU requires 4k-aligned pages for the event log, the PPR log, > and the completion wait write-back regions. However, when allocating > the pages, they could be part of large mapping (e.g. 2M) page. > This causes #PF due to the SNP RMP hardware enforces the check based > on the page level for these data structures. > > So, fix by calling set_memory_4k() on the allocated pages. > > Fixes: commit c69d89aff393 ("iommu/amd: Use 4K page for completion wait > write-back semaphore") > Cc: Brijesh Singh > Signed-off-by: Suravee Suthikulpanit > --- > drivers/iommu/amd/init.c | 22 +- > 1 file changed, 17 insertions(+), 5 deletions(-) > > diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c > index 82e4af8f09bb..75dc30226a7c 100644 > --- a/drivers/iommu/amd/init.c > +++ b/drivers/iommu/amd/init.c > @@ -29,6 +29,7 @@ > #include > #include > #include > +#include > > #include > > @@ -672,11 +673,22 @@ static void __init free_command_buffer(struct amd_iommu > *iommu) > free_pages((unsigned long)iommu->cmd_buf, get_order(CMD_BUFFER_SIZE)); > } > > +static void *__init iommu_alloc_4k_pages(gfp_t gfp, size_t size) > +{ > + void *buf; > + int order = get_order(size); > + > + buf = (void *)__get_free_pages(gfp, order); > + if (!buf) > + return buf; > + return set_memory_4k((unsigned long)buf, (1 << order)) ? NULL : buf; > +} > + Please make the 4k split only if SNP is actually enabled in the system. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/1] iommu/vt-d: Fix kernel NULL pointer dereference in find_domain()
On Wed, Oct 28, 2020 at 03:07:25PM +0800, Lu Baolu wrote: > Fixes: e2726daea583d ("iommu/vt-d: debugfs: Add support to show page table > internals") > Cc: sta...@vger.kernel.org#v5.6+ > Reported-and-tested-by: Xu Pengfei > Signed-off-by: Lu Baolu Applied for v5.10, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 02:18:52PM +0100, j...@8bytes.org wrote: > On Tue, Nov 03, 2020 at 08:56:43AM -0400, Jason Gunthorpe wrote: > > On Tue, Nov 03, 2020 at 10:52:09AM +0100, j...@8bytes.org wrote: > > > So having said this, what is the benefit of exposing those SVA internals > > > to user-space? > > > > Only the device use of the PASID is device specific, the actual PASID > > and everything on the IOMMU side is generic. > > > > There is enough API there it doesn't make sense to duplicate it into > > every single SVA driver. > > What generic things have to be done by the drivers besides > allocating/deallocating PASIDs and binding an address space to it? > > Is there anything which isn't better handled in a kernel-internal > library which drivers just use? Userspace needs fine grained control over the composition of the page table behind the PASID, 1:1 with the mm_struct is only one use case. Userspace needs to be able to handle IOMMU faults, apparently The Intel guys had a bunch of other stuff too, looking through the new API they are proposing for vfio gives some flavour what they think is needed.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/1] iommu: Fix deferred domain attachment in iommu_probe_device()
Hi Baolu, On Mon, Oct 26, 2020 at 02:30:08PM +0800, Lu Baolu wrote: > @@ -264,7 +266,8 @@ int iommu_probe_device(struct device *dev) >*/ > iommu_alloc_default_domain(group, dev); > > - if (group->default_domain) > + if (group->default_domain && > + !iommu_is_attach_deferred(group->default_domain, dev)) > ret = __iommu_attach_device(group->default_domain, dev); This is the hotplug path, not used for boot-initialization. Have you seen failures from the missing check here? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/amd: Increase interrupt remapping table limit to 512 entries
On Thu, Oct 15, 2020 at 02:50:02AM +, Suravee Suthikulpanit wrote: > Certain device drivers allocate IO queues on a per-cpu basis. > On AMD EPYC platform, which can support up-to 256 cpu threads, > this can exceed the current MAX_IRQ_PER_TABLE limit of 256, > and result in the error message: > > AMD-Vi: Failed to allocate IRTE > > This has been observed with certain NVME devices. > > AMD IOMMU hardware can actually support upto 512 interrupt > remapping table entries. Therefore, update the driver to > match the hardware limit. > > Please note that this also increases the size of interrupt remapping > table to 8KB per device when using the 128-bit IRTE format. > > Signed-off-by: Suravee Suthikulpanit > --- > drivers/iommu/amd/amd_iommu_types.h | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) Applied for 5.10, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 08:56:43AM -0400, Jason Gunthorpe wrote: > On Tue, Nov 03, 2020 at 10:52:09AM +0100, j...@8bytes.org wrote: > > So having said this, what is the benefit of exposing those SVA internals > > to user-space? > > Only the device use of the PASID is device specific, the actual PASID > and everything on the IOMMU side is generic. > > There is enough API there it doesn't make sense to duplicate it into > every single SVA driver. What generic things have to be done by the drivers besides allocating/deallocating PASIDs and binding an address space to it? Is there anything which isn't better handled in a kernel-internal library which drivers just use? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Nov 03, 2020 at 10:52:09AM +0100, j...@8bytes.org wrote: > So having said this, what is the benefit of exposing those SVA internals > to user-space? Only the device use of the PASID is device specific, the actual PASID and everything on the IOMMU side is generic. There is enough API there it doesn't make sense to duplicate it into every single SVA driver. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 2/2] iommu/iova: Free global iova rcache on iova alloc failure
On 2020-09-30 08:44, vji...@codeaurora.org wrote: From: Vijayanand Jitta When ever an iova alloc request fails we free the iova ranges present in the percpu iova rcaches and then retry but the global iova rcache is not freed as a result we could still see iova alloc failure even after retry as global rcache is holding the iova's which can cause fragmentation. So, free the global iova rcache as well and then go for the retry. This looks reasonable to me - it's mildly annoying that we end up with so many similar-looking functions, but the necessary differences are right down in the middle of the loops so nothing can reasonably be factored out :( Reviewed-by: Robin Murphy Signed-off-by: Vijayanand Jitta --- drivers/iommu/iova.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c3a1a8e..faf9b13 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -25,6 +25,7 @@ static void init_iova_rcaches(struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); static void fq_destroy_all_entries(struct iova_domain *iovad); static void fq_flush_timeout(struct timer_list *t); +static void free_global_cached_iovas(struct iova_domain *iovad); void init_iova_domain(struct iova_domain *iovad, unsigned long granule, @@ -442,6 +443,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, flush_rcache = false; for_each_online_cpu(cpu) free_cpu_cached_iovas(cpu, iovad); + free_global_cached_iovas(iovad); goto retry; } @@ -1057,5 +1059,26 @@ void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) } } +/* + * free all the IOVA ranges of global cache + */ +static void free_global_cached_iovas(struct iova_domain *iovad) +{ + struct iova_rcache *rcache; + unsigned long flags; + int i, j; + + for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + rcache = >rcaches[i]; + spin_lock_irqsave(>lock, flags); + for (j = 0; j < rcache->depot_size; ++j) { + iova_magazine_free_pfns(rcache->depot[j], iovad); + iova_magazine_free(rcache->depot[j]); + rcache->depot[j] = NULL; + } + rcache->depot_size = 0; + spin_unlock_irqrestore(>lock, flags); + } +} MODULE_AUTHOR("Anil S Keshavamurthy "); MODULE_LICENSE("GPL"); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 4/7] iommu: Add quirk for Intel graphic devices in map_sg
On 2020-09-27 07:34, Lu Baolu wrote: Combining the sg segments exposes a bug in the Intel i915 driver which causes visual artifacts and the screen to freeze. This is most likely because of how the i915 handles the returned list. It probably doesn't respect the returned value specifying the number of elements in the list and instead depends on the previous behaviour of the Intel iommu driver which would return the same number of elements in the output list as in the input list. Signed-off-by: Tom Murphy Signed-off-by: Lu Baolu --- drivers/iommu/dma-iommu.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 3526db774611..e7e4d758f51a 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -879,6 +879,33 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents, unsigned int cur_len = 0, max_len = dma_get_max_seg_size(dev); int i, count = 0; + /* +* The Intel graphic driver is used to assume that the returned +* sg list is not combound. This blocks the efforts of converting +* Intel IOMMU driver to dma-iommu api's. Add this quirk to make the +* device driver work and should be removed once it's fixed in i915 +* driver. +*/ + if (IS_ENABLED(CONFIG_DRM_I915) && dev_is_pci(dev) && + to_pci_dev(dev)->vendor == PCI_VENDOR_ID_INTEL && + (to_pci_dev(dev)->class >> 16) == PCI_BASE_CLASS_DISPLAY) { + for_each_sg(sg, s, nents, i) { + unsigned int s_iova_off = sg_dma_address(s); + unsigned int s_length = sg_dma_len(s); + unsigned int s_iova_len = s->length; + + s->offset += s_iova_off; + s->length = s_length; + sg_dma_address(s) = dma_addr + s_iova_off; + sg_dma_len(s) = s_length; + dma_addr += s_iova_len; + + pr_info_once("sg combining disabled due to i915 driver\n"); + } + + return nents; + } BTW, a much less invasive workaround would be to simply override seg_mask to 0. That's enough to make sure that no segment looks eligible for merging. Robin. + for_each_sg(sg, s, nents, i) { /* Restore this segment's original unaligned fields first */ unsigned int s_iova_off = sg_dma_address(s); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api
Hi, On Tue, Nov 03, 2020 at 11:58:26AM +0200, Joonas Lahtinen wrote: > Would that work for you? We intend to send the feature pull requests > to DRM for 5.11 in the upcoming weeks. For the IOMMU side it is best to include the workaround for now. When the DRM fixes are merged into v5.11-rc1 together with this conversion, it can be reverted and will not be in 5.11-final. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Some questions about arm_lpae_install_table
On 2020-11-03 09:11, Will Deacon wrote: On Tue, Nov 03, 2020 at 11:00:27AM +0800, Kunkun Jiang wrote: Recently, I have read and learned the code related to io-pgtable-arm.c. There are two question on arm_lpae_install_table. 1、the first static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table, arm_lpae_iopte *ptep, arm_lpae_iopte curr, struct io_pgtable_cfg *cfg) { arm_lpae_iopte old, new; new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE; if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS) new |= ARM_LPAE_PTE_NSTABLE; /* * Ensure the table itself is visible before its PTE can be. * Whilst we could get away with cmpxchg64_release below, this * doesn't have any ordering semantics when !CONFIG_SMP. */ dma_wmb(); old = cmpxchg64_relaxed(ptep, curr, new); if (cfg->coherent_walk || (old & ARM_LPAE_PTE_SW_SYNC)) return old; /* Even if it's not ours, there's no point waiting; just kick it */ __arm_lpae_sync_pte(ptep, cfg); if (old == curr) WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC); return old; } If another thread changes the ptep between cmpxchg64_relaxed and WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC), the operation WRITE_ONCE will overwrite the change. Can you please provide an example of a code path where this happens? The idea is that CPUs can race on the cmpxchg(), but there will only be one winner. The only way a table entry can suddenly disappear is in a race that involves mapping or unmapping a whole block/table-sized region, while simultaneously mapping a page *within* that region. Yes, if someone uses the API in a nonsensical and completely invalid way that cannot have a predictable outcome, they get an unpredictable outcome. Whoop de do... 2、the second for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) { /* Unmap! */ if (i == unmap_idx) continue; __arm_lpae_init_pte(data, blk_paddr, pte, lvl, [i]); } pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg); When altering a translation table descriptor include split a block into constituent granules, the Armv8-A and SMMUv3 architectures require a break-before-make procedure. But in the function arm_lpae_split_blk_unmap, it changes a block descriptor to an equivalent span of page translations directly. Is it appropriate to do so? Break-before-make doesn't really work for the SMMU because faults are generally fatal. Are you seeing problems in practice with this code? TBH I do still wonder if we shouldn't just get rid of split_blk_unmap and all its complexity. Other drivers treat an unmap of a page from a block entry as simply unmapping the whole block, and that's the behaviour VFIO seems to expect. My only worry is that it's been around long enough that there might be some horrible out-of-tree code that *is* relying on it, despite the fact that it's impossible to implement in a way that's definitely 100% safe :/ Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu: fix a check in iommu_check_bind_data()
The "data->flags" variable is a u64 so if one of the high 32 bits is set the original code will allow it, but it should be rejected. The fix is to declare "mask" as a u64 instead of a u32. Fixes: d90573812eea ("iommu/uapi: Handle data and argsz filled by users") Signed-off-by: Dan Carpenter --- drivers/iommu/iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 8c470f451a32..b53446bb8c6b 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2071,7 +2071,7 @@ EXPORT_SYMBOL_GPL(iommu_uapi_cache_invalidate); static int iommu_check_bind_data(struct iommu_gpasid_bind_data *data) { - u32 mask; + u64 mask; int i; if (data->version != IOMMU_GPASID_BIND_VERSION_1) -- 2.28.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
can we switch bmips to the generic dma ranges code
Hi Florian and others, now that the generic DMA ranges code landed, can we switch bmips over to it instead of duplicating the logic? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api
Quoting Tvrtko Ursulin (2020-11-03 11:14:32) > > > On 03/11/2020 02:53, Lu Baolu wrote: > > On 11/2/20 7:52 PM, Tvrtko Ursulin wrote: > >> > >> On 02/11/2020 02:00, Lu Baolu wrote: > >>> Hi Tvrtko, > >>> On 10/12/20 4:44 PM, Tvrtko Ursulin wrote: > > On 29/09/2020 01:11, Lu Baolu wrote: > FYI we have merged the required i915 patches to out tree last week > or so. I *think* this means they will go into 5.11. So the i915 > specific workaround patch will not be needed in Intel IOMMU. > >>> > >>> Do you mind telling me what's the status of this fix patch? I tried this > >>> series on v5.10-rc1 with the graphic quirk patch dropped. I am still > >>> seeing dma faults from graphic device. > >> > >> Hmm back then I thought i915 fixes for this would land in 5.11 so I > >> will stick with that. :) (See my quoted text a paragraph above yours.) > > > > What size are those fixes? I am considering pushing this series for > > v5.11. Is it possible to get some acks for those patches and let them > > go to Linus through iommu tree? > > For 5.10 you mean? They feel a bit too large for comfort to go via a > non-i915/drm tree. These are the two patches required: > > https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=8a473dbadccfc6206150de3db3223c40785da348 > https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=934941ed5a3070a7833c688c9b1d71484fc01a68 > > I'll copy Joonas as our maintainer - how does the idea of taking the > above two patches through the iommu tree sound to you? Hi Lu, The patches have already been merged into our tree and are heading towards 5.11, so I don't think we should merge them elsewhere. DRM subsystem had the feature freeze for 5.10 at the time of 5.9-rc5 and only drm-intel-fixes pull requests are sent after that. The patches seem to target to eliminate need for a previously used workaround. To me it seems more appropriate for the patches to follow the regular process as new feature for 5.11 to make sure the changes get validated as part of linux-next. Would that work for you? We intend to send the feature pull requests to DRM for 5.11 in the upcoming weeks. Regards, Joonas ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: amdgpu error whenever IOMMU is enabled
Hi Jörg, I am seeing that amdgpu uses amd_iommu_v2 kernel-module. To me this is the last puzzle piece that was missing to explain the dependency of that bug in amdgpu to the iommu. [cid:image001.png@01D6B1CB.26F9C970] Best regards Edgar From: Merger, Edgar [AUTOSOL/MAS/AUGS] Sent: Freitag, 30. Oktober 2020 15:26 To: jroe...@suse.de Cc: iommu@lists.linux-foundation.org Subject: amdgpu error whenever IOMMU is enabled Hello Jörg, We have developed a Board "mCOM10L1900" that can be populated with an AMD R1305G Ryzen CPU but also with other CPUs from AMD´s R1000 and V1000 Series. Please see attached datasheet. With one board we have a boot-problem that is reproducible at every ~50 boot. The system is accessible via ssh and works fine except for the Graphics. The graphics is off. We don´t see a screen. Please see attached "dmesg.log". From [52.772273] onwards the kernel reports drm/amdgpu errors. It even tries to reset the GPU but that fails too. I tried to reset amdgpu also by command "sudo cat /sys/kernel/debug/dri/N/amdgpu_gpu_recover". That did not help either. There is a similar error reported here https://bugzilla.kernel.org/show_bug.cgi?id=204241. However the applied patch should have already been in the Linux-kernel we use, which is "5.4.0-47-generic". Also found that "amdgpu_info" shows different versions at "SMC firmware version". Please see attachment. It is still unclear what role the IOMMU plays here. Whenever we turn it off, the error does not show up anymore on the failing board. Best regards Edgar ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
use of dma_direct_set_offset in (allwinner) drivers
Hi all, Linux 5.10-rc1 switched from having a single dma offset in struct device to a set of DMA ranges, and introduced a new helper to set them, dma_direct_set_offset. This in fact surfaced that a bunch of drivers that violate our layering and set the offset from drivers, which meant we had to reluctantly export the symbol to set up the DMA range. The drivers are: drivers/gpu/drm/sun4i/sun4i_backend.c This just use dma_direct_set_offset as a fallback. Is there any good reason to not just kill off the fallback? drivers/media/platform/sunxi/sun4i-csi/sun4i_csi.c Same as above. drivers/media/platform/sunxi/sun6i-csi/sun6i_csi.c This driver unconditionally sets the offset. Why can't we do this in the device tree? drivers/staging/media/sunxi/cedrus/cedrus_hw.c Same as above. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
On Mon, Oct 12, 2020 at 08:38:54AM +, Tian, Kevin wrote: > > From: Jason Wang > > Jason suggest something like /dev/sva. There will be a lot of other > > subsystems that could benefit from this (e.g vDPA). Honestly, I fail to see the benefit of offloading these IOMMU specific setup tasks to user-space. The ways PASID and the device partitioning it allows are used are very device specific. A GPU will be partitioned completly different than a network card. So the device drivers should use the (v)SVA APIs to setup the partitioning in a way which makes sense for the device. And VFIO is of course a user by itself, as it allows assigning device partitions to guests. Or even allow assigning complete devices and allow the guests to partition it themselfes. So having said this, what is the benefit of exposing those SVA internals to user-space? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH for-5.10] swiotlb: remove the tbl_dma_addr argument to swiotlb_tbl_map_single
ping? On Fri, Oct 23, 2020 at 08:33:09AM +0200, Christoph Hellwig wrote: > The tbl_dma_addr argument is used to check the DMA boundary for the > allocations, and thus needs to be a dma_addr_t. swiotlb-xen instead > passed a physical address, which could lead to incorrect results for > strange offsets. Fix this by removing the parameter entirely and hard > code the DMA address for io_tlb_start instead. > > Fixes: 91ffe4ad534a ("swiotlb-xen: introduce phys_to_dma/dma_to_phys > translations") > Signed-off-by: Christoph Hellwig > Reviewed-by: Stefano Stabellini > --- > drivers/iommu/intel/iommu.c | 5 ++--- > drivers/xen/swiotlb-xen.c | 3 +-- > include/linux/swiotlb.h | 10 +++--- > kernel/dma/swiotlb.c| 16 ++-- > 4 files changed, 12 insertions(+), 22 deletions(-) > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index 8651f6d4dfa032..6b560e6f193096 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -3815,9 +3815,8 @@ bounce_map_single(struct device *dev, phys_addr_t > paddr, size_t size, >* page aligned, we don't need to use a bounce page. >*/ > if (!IS_ALIGNED(paddr | size, VTD_PAGE_SIZE)) { > - tlb_addr = swiotlb_tbl_map_single(dev, > - phys_to_dma_unencrypted(dev, io_tlb_start), > - paddr, size, aligned_size, dir, attrs); > + tlb_addr = swiotlb_tbl_map_single(dev, paddr, size, > + aligned_size, dir, attrs); > if (tlb_addr == DMA_MAPPING_ERROR) { > goto swiotlb_error; > } else { > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c > index 71ce1b7a23d1cc..2b385c1b4a99cb 100644 > --- a/drivers/xen/swiotlb-xen.c > +++ b/drivers/xen/swiotlb-xen.c > @@ -395,8 +395,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device > *dev, struct page *page, >*/ > trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force); > > - map = swiotlb_tbl_map_single(dev, virt_to_phys(xen_io_tlb_start), > - phys, size, size, dir, attrs); > + map = swiotlb_tbl_map_single(dev, phys, size, size, dir, attrs); > if (map == (phys_addr_t)DMA_MAPPING_ERROR) > return DMA_MAPPING_ERROR; > > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h > index 513913ff748626..3bb72266a75a1d 100644 > --- a/include/linux/swiotlb.h > +++ b/include/linux/swiotlb.h > @@ -45,13 +45,9 @@ enum dma_sync_target { > SYNC_FOR_DEVICE = 1, > }; > > -extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, > - dma_addr_t tbl_dma_addr, > - phys_addr_t phys, > - size_t mapping_size, > - size_t alloc_size, > - enum dma_data_direction dir, > - unsigned long attrs); > +phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys, > + size_t mapping_size, size_t alloc_size, > + enum dma_data_direction dir, unsigned long attrs); > > extern void swiotlb_tbl_unmap_single(struct device *hwdev, >phys_addr_t tlb_addr, > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index b4eea0abc3f002..92e2f54f24c01b 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -441,14 +441,11 @@ static void swiotlb_bounce(phys_addr_t orig_addr, > phys_addr_t tlb_addr, > } > } > > -phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, > -dma_addr_t tbl_dma_addr, > -phys_addr_t orig_addr, > -size_t mapping_size, > -size_t alloc_size, > -enum dma_data_direction dir, > -unsigned long attrs) > +phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t > orig_addr, > + size_t mapping_size, size_t alloc_size, > + enum dma_data_direction dir, unsigned long attrs) > { > + dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(hwdev, io_tlb_start); > unsigned long flags; > phys_addr_t tlb_addr; > unsigned int nslots, stride, index, wrap; > @@ -667,9 +664,8 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t > paddr, size_t size, > trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size, > swiotlb_force); > > - swiotlb_addr = swiotlb_tbl_map_single(dev, > - phys_to_dma_unencrypted(dev, io_tlb_start), > - paddr, size, size, dir, attrs); > + swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, size, dir, > + attrs); >
Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api
On 03/11/2020 02:53, Lu Baolu wrote: On 11/2/20 7:52 PM, Tvrtko Ursulin wrote: On 02/11/2020 02:00, Lu Baolu wrote: Hi Tvrtko, On 10/12/20 4:44 PM, Tvrtko Ursulin wrote: On 29/09/2020 01:11, Lu Baolu wrote: Hi Tvrtko, On 9/28/20 5:44 PM, Tvrtko Ursulin wrote: On 27/09/2020 07:34, Lu Baolu wrote: Hi, The previous post of this series could be found here. https://lore.kernel.org/linux-iommu/20200912032200.11489-1-baolu...@linux.intel.com/ This version introduce a new patch [4/7] to fix an issue reported here. https://lore.kernel.org/linux-iommu/51a1baec-48d1-c0ac-181b-1fba92aa4...@linux.intel.com/ There aren't any other changes. Please help to test and review. Best regards, baolu Lu Baolu (3): iommu: Add quirk for Intel graphic devices in map_sg Since I do have patches to fix i915 to handle this, do we want to co-ordinate the two and avoid having to add this quirk and then later remove it? Or you want to go the staged approach? I have no preference. It depends on which patch goes first. Let the maintainers help here. FYI we have merged the required i915 patches to out tree last week or so. I *think* this means they will go into 5.11. So the i915 specific workaround patch will not be needed in Intel IOMMU. Do you mind telling me what's the status of this fix patch? I tried this series on v5.10-rc1 with the graphic quirk patch dropped. I am still seeing dma faults from graphic device. Hmm back then I thought i915 fixes for this would land in 5.11 so I will stick with that. :) (See my quoted text a paragraph above yours.) What size are those fixes? I am considering pushing this series for v5.11. Is it possible to get some acks for those patches and let them go to Linus through iommu tree? For 5.10 you mean? They feel a bit too large for comfort to go via a non-i915/drm tree. These are the two patches required: https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=8a473dbadccfc6206150de3db3223c40785da348 https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=934941ed5a3070a7833c688c9b1d71484fc01a68 I'll copy Joonas as our maintainer - how does the idea of taking the above two patches through the iommu tree sound to you? Regards, Tvrtko ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Some questions about arm_lpae_install_table
On Tue, Nov 03, 2020 at 11:00:27AM +0800, Kunkun Jiang wrote: > Recently, I have read and learned the code related to io-pgtable-arm.c. > There > are two question on arm_lpae_install_table. > > 1、the first > > > static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table, > > arm_lpae_iopte *ptep, > > arm_lpae_iopte curr, > > struct io_pgtable_cfg *cfg) > > { > > arm_lpae_iopte old, new; > > > > new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE; > > if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS) > > new |= ARM_LPAE_PTE_NSTABLE; > > > > /* > > * Ensure the table itself is visible before its PTE can be. > > * Whilst we could get away with cmpxchg64_release below, this > > * doesn't have any ordering semantics when !CONFIG_SMP. > > */ > > dma_wmb(); > > > > old = cmpxchg64_relaxed(ptep, curr, new); > > > > if (cfg->coherent_walk || (old & ARM_LPAE_PTE_SW_SYNC)) > > return old; > > > > /* Even if it's not ours, there's no point waiting; just kick it > > */ > > __arm_lpae_sync_pte(ptep, cfg); > > if (old == curr) > > WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC); > > > > return old; > > } > > If another thread changes the ptep between cmpxchg64_relaxed and > WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC), the operation > WRITE_ONCE will overwrite the change. Can you please provide an example of a code path where this happens? The idea is that CPUs can race on the cmpxchg(), but there will only be one winner. > 2、the second > > > for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) { > > /* Unmap! */ > > if (i == unmap_idx) > > continue; > > > > __arm_lpae_init_pte(data, blk_paddr, pte, lvl, > > [i]); > > } > > > > pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg); > > When altering a translation table descriptor include split a block into > constituent granules, the Armv8-A and SMMUv3 architectures require > a break-before-make procedure. But in the function arm_lpae_split_blk_unmap, > it changes a block descriptor to an equivalent span of page translations > directly. Is it appropriate to do so? Break-before-make doesn't really work for the SMMU because faults are generally fatal. Are you seeing problems in practice with this code? Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu