Re: Some questions about arm_lpae_install_table

2020-11-03 Thread Kunkun Jiang

Hi Will and Robin,

Sorry for the late reply.

On 2020/11/3 18:21, Robin Murphy wrote:

On 2020-11-03 09:11, Will Deacon wrote:

On Tue, Nov 03, 2020 at 11:00:27AM +0800, Kunkun Jiang wrote:

Recently, I have read and learned the code related to io-pgtable-arm.c.
There
are two question on arm_lpae_install_table.

1、the first


static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
  arm_lpae_iopte *ptep,
  arm_lpae_iopte curr,
  struct io_pgtable_cfg 
*cfg)

{
 arm_lpae_iopte old, new;

 new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
 if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 new |= ARM_LPAE_PTE_NSTABLE;

    /*
  * Ensure the table itself is visible before its PTE can be.
  * Whilst we could get away with cmpxchg64_release below, 
this

  * doesn't have any ordering semantics when !CONFIG_SMP.
  */
 dma_wmb();

 old = cmpxchg64_relaxed(ptep, curr, new);

 if (cfg->coherent_walk || (old & ARM_LPAE_PTE_SW_SYNC))
 return old;

 /* Even if it's not ours, there's no point waiting; just 
kick it

*/
 __arm_lpae_sync_pte(ptep, cfg);
 if (old == curr)
 WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);

 return old;
}


If another thread changes the ptep between cmpxchg64_relaxed and
WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC), the operation
WRITE_ONCE will overwrite the change.


Can you please provide an example of a code path where this happens? The
idea is that CPUs can race on the cmpxchg(), but there will only be one
winner.


The only way a table entry can suddenly disappear is in a race that 
involves mapping or unmapping a whole block/table-sized region, while 
simultaneously mapping a page *within* that region. Yes, if someone 
uses the API in a nonsensical and completely invalid way that cannot 
have a predictable outcome, they get an unpredictable outcome. Whoop 
de do...



Yes, as Robin mentioned, there will be an unpredictable outcome in extreme
cases. Now, we have two APIs mapping and unmapping to alter a block/page
descriptor. I worry about that it may be  increasingly difficult to add 
API in

the future.


2、the second


for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) {
 /* Unmap! */
 if (i == unmap_idx)
continue;

 __arm_lpae_init_pte(data, blk_paddr, pte, lvl,
[i]);
}

pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);


When altering a translation table descriptor include split a block into
constituent granules, the Armv8-A and SMMUv3 architectures require
a break-before-make procedure. But in the function 
arm_lpae_split_blk_unmap,
it changes a block descriptor to an equivalent span of page 
translations

directly. Is it appropriate to do so?


Break-before-make doesn't really work for the SMMU because faults are
generally fatal.

Are you seeing problems in practice with this code?


TBH I do still wonder if we shouldn't just get rid of split_blk_unmap 
and all its complexity. Other drivers treat an unmap of a page from a 
block entry as simply unmapping the whole block, and that's the 
behaviour VFIO seems to expect. My only worry is that it's been around 
long enough that there might be some horrible out-of-tree code that 
*is* relying on it, despite the fact that it's impossible to implement 
in a way that's definitely 100% safe :/


Robin.
.

I have not encountered a problem. I mean SMMU has supported BBML feature
in SMMUv3.2. Maybe we can use this feature to update translation table 
without

break-before-make.

Look forward to your reply.

Thanks,
Kunkun Jiang
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/1] iommu: Fix deferred domain attachment in iommu_probe_device()

2020-11-03 Thread Lu Baolu

Hi Joerg,

On 11/3/20 9:23 PM, Joerg Roedel wrote:

Hi Baolu,

On Mon, Oct 26, 2020 at 02:30:08PM +0800, Lu Baolu wrote:

@@ -264,7 +266,8 @@ int iommu_probe_device(struct device *dev)
 */
iommu_alloc_default_domain(group, dev);
  
-	if (group->default_domain)

+   if (group->default_domain &&
+   !iommu_is_attach_deferred(group->default_domain, dev))
ret = __iommu_attach_device(group->default_domain, dev);


This is the hotplug path, not used for boot-initialization. Have you
seen failures from the missing check here?


I haven't seen any failure. Just wondered why deferred attaching was not
checked here. It's fine to me if it's only for hotplug path.

Please ignore this change.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/4] iommu/iova: Flush CPU rcache for when a depot fills

2020-11-03 Thread Robin Murphy

On 2020-11-03 17:56, John Garry wrote:
To summarize, the issue is that as time goes by, the CPU rcache and 
depot
rcache continue to grow. As such, IOVA RB tree access time also 
continues

to grow.




Hi Robin,


I'm struggling to see how this is not simply indicative of a leak
originating elsewhere. 


It sounds like one, but I don't think it is.


For the number of magazines to continually grow,
it means IOVAs *of a particular size* are being freed faster than they
are being allocated, while the only place that ongoing allocations
should be coming from is those same magazines!


But that is not the nature of how the IOVA caching works. The cache size 
is not defined by how DMA mappings we may have at a given moment in time 
or maximum which we did have at a point earlier. It just grows to a 
limit to where all CPU and global depot rcaches fill.


Here's an artificial example of how the rcache can grow, but I hope can 
help illustrate:
- consider a process which wants many DMA mapping active at a given 
point in time

- if we tie to cpu0, cpu0 rcache will grow to 128 * 2
- then tie to cpu1, cpu1 rcache will grow to 128 * 2, so total CPU 
rcache = 2 * 128 * 2. CPU rcache for cpu0 is not flushed - there is no 
maintenance for this.
- then tie to cpu2, cpu2 rcache will grow to 128 * 2, so total CPU 
rcache = 3 * 128 * 2

- then cpu3, cpu4, and so on.
- We can do this for all CPUs in the system, so total CPU rcache grows 
from zero -> #CPUs * 128 * 2. Yet no DMA mapping leaks.


I get that. That's the initial warm-up phase I alluded to below. In an 
even simpler example, allocating on CPU A and freeing on CPU B will 
indeed move IOVAs from the tree into magazines without reuse, but only 
up to a point. Eventually, CPU B's cache fills up and pushes a magazine 
into the depot, and at *that* point things reach a steady state, since 
the next allocation on CPU A will then pull that magazine from the depot 
and proceed to allocate from there. If allocs and frees stay perfectly 
balanced, the working set is then 3 magazines. Yes, the depot can fill 
up if the number of IOVAs that CPU B frees at once before CPU A 
reallocates them is comparable to the total depot capacity, but it can't 
reasonably *stay* full unless CPU A stops allocating altogether.


Something similar can happen in normal use, where the scheduler 
relocates processes all over the CPUs in the system as time goes by, 
which causes the total rcache size to continue to grow. And in addition 
to this, the global depot continues to grow very slowly as well. But 
when it does fill (the global depot, that is), and we start to free 
magazines to make space – as is current policy - that's very slow and 
causes the performance drop.


Sure, but how does it then consistently *remain* in that state? And 
*why* does the depot slowly and steadily grow in the first place if 
alloc and free are ultimately balanced? I can get the depot swinging 
between full and empty if it's simply too small to bounce magazines 
between a large number of "CPU A"s and "CPU B"s, but again, that's 
surely going to show as repeated performance swings between bad at each 
end and good in the middle, not a steady degradation.



Now indeed that could happen over the short term if IOVAs are allocated
and freed again in giant batches larger than the total global cache
capacity, but that would show a cyclic behaviour - when activity starts,
everything is first allocated straight from the tree, then when it ends
the caches would get overwhelmed by the large burst of freeing and start
having to release things back to the tree, but eventually that would
stop once everything *is* freed, then when activity begins again the
next round of allocating would inherently clear out all the caches
before going anywhere near the tree. 


But there is no clearing. A CPU will keep the IOVA cached indefinitely, 
even when there is no active DMA mapping present at all.


Sure, the percpu caches can buffer IOVAs for an indefinite amount of 
time depending on how many CPUs are active, but the depot usage is still 
absolutely representative of the total working set for whichever CPUs 
*are* active. In this whole discussion I'm basically just considering 
the percpu caches as pipeline stages for serialising IOVAs into and out 
of magazines. It's the motion of magazines that's the interesting part.


If the depot keeps continually filling up, *some* CPUs are freeing 
enough IOVAs to push out full magazines, and those IOVAs have to come 
from somewhere, so *some* CPUs are allocating, and those CPUs can't 
allocate forever without taking magazines back out of the depot (that's 
the "clearing out" I meant). Something about a steady degradation that 
never shows any sign of recovery (even periodically) just doesn't seem 
to add up.


Anyway, by now I think it would be most interesting to get rid of this 
bottleneck completely rather than dance around bodging it, and see what 
happens if we simply let the depot 

Re: [PATCH v3 0/6] Add virtio-iommu built-in topology

2020-11-03 Thread Al Stone
On 06 Oct 2020 17:23, Auger Eric wrote:
> Hello Al,
> 
> On 10/2/20 8:23 PM, Al Stone wrote:
> > On 24 Sep 2020 11:54, Auger Eric wrote:
> >> Hi,
> >>
> >> Adding Al in the loop
> >>
> >> On 9/24/20 11:38 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Sep 24, 2020 at 11:21:29AM +0200, Joerg Roedel wrote:
>  On Thu, Sep 24, 2020 at 05:00:35AM -0400, Michael S. Tsirkin wrote:
> > OK so this looks good. Can you pls repost with the minor tweak
> > suggested and all acks included, and I will queue this?
> 
>  My NACK still stands, as long as a few questions are open:
> 
>   1) The format used here will be the same as in the ACPI table? I
>  think the answer to this questions must be Yes, so this leads
>  to the real question:
> >>>
> >>> I am not sure it's a must.
> >>> We can always tweak the parser if there are slight differences
> >>> between ACPI and virtio formats.
> >>>
> >>> But we do want the virtio format used here to be approved by the virtio
> >>> TC, so it won't change.
> >>>
> >>> Eric, Jean-Philippe, does one of you intend to create a github issue
> >>> and request a ballot for the TC? It's been posted end of August with no
> >>> changes ...
> >> Jean-Philippe, would you?
> >>>
>   2) Has the ACPI table format stabalized already? If and only if
>  the answer is Yes I will Ack these patches. We don't need to
>  wait until the ACPI table format is published in a
>  specification update, but at least some certainty that it
>  will not change in incompatible ways anymore is needed.
> 
> >>
> >> Al, do you have any news about the the VIOT definition submission to
> >> the UEFI ASWG?
> >>
> >> Thank you in advance
> >>
> >> Best Regards
> >>
> >> Eric
> > 
> > A follow-up to my earlier post 
> > 
> > Hearing no objection, I've submitted the VIOT table description to
> > the ASWG for consideration under what they call the "code first"
> > process.  The "first reading" -- a brief discussion on what the
> > table is and why we would like to add it -- was held yesterday.
> > No concerns have been raised as yet.  Given the discussions that
> > have already occurred, I don't expect any, either.  I have been
> > wrong at least once before, however.
> > 
> > At this point, ASWG will revisit the request to add VIOT each
> > week.  If there have been no comments in the prior week, and no
> > further discussion during the meeting, then a vote will be taken.
> > Otherwise, there will be discussion and we try again the next
> > week.
> > 
> > The ASWG was also told that the likelihood of this definition of
> > the table changing is pretty low, and that it has been thought out
> > pretty well already.  ASWG's consideration will therefore start
> > from the assumption that it would be best _not_ to make changes.
> > 
> > So, I'll let you know what happens next week.
> 
> Thank you very much for the updates and for your support backing the
> proposal in the best delays.
> 
> Best Regards
> 
> Eric

So, there are some questions about the VIOT definition and I just
don't know enough to be able to answer them.  One of the ASWG members
is trying to understand the semantics behind the subtables.

Is there a particular set of people, or mailing lists, that I can
point to to get the questions answered?  Ideally it would be one
of the public lists where it has already been discussed, but an
individual would be fine, too.  No changes have been proposed, just
some questions asked.

Thanks.

> > 
> >>
> >>>
> >>> Not that I know, but I don't see why it's a must.
> >>>
>  So what progress has been made with the ACPI table specification, is it
>  just a matter of time to get it approved or are there concerns?
> 
>  Regards,
> 
>   Joerg
> >>>
> >>
> > 
> 

-- 
ciao,
al
---
Al Stone
Software Engineer
Red Hat, Inc.
a...@redhat.com
---

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread j...@8bytes.org
On Tue, Nov 03, 2020 at 01:48:51PM -0400, Jason Gunthorpe wrote:
> I think the same PCI driver with a small flag to support the PF or
> VF is not the same as two completely different drivers in different
> subsystems

There are counter-examples: ixgbe vs. ixgbevf.

Note that also a single driver can support both, an SVA device and an
mdev device, sharing code for accessing parts of the device like queues
and handling interrupts.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 0/7] arm64: Default to 32-bit wide ZONE_DMA

2020-11-03 Thread Catalin Marinas
On Tue, Nov 03, 2020 at 06:00:33PM +0100, Nicolas Saenz Julienne wrote:
> On Fri, 2020-10-30 at 18:11 +, Catalin Marinas wrote:
> > On Thu, Oct 29, 2020 at 06:25:43PM +0100, Nicolas Saenz Julienne wrote:
> > > Ard Biesheuvel (1):
> > >   arm64: mm: Set ZONE_DMA size based on early IORT scan
> > > 
> > > Nicolas Saenz Julienne (6):
> > >   arm64: mm: Move reserve_crashkernel() into mem_init()
> > >   arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()
> > >   of/address: Introduce of_dma_get_max_cpu_address()
> > >   of: unittest: Add test for of_dma_get_max_cpu_address()
> > >   arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges
> > >   mm: Remove examples from enum zone_type comment
> > 
> > Thanks for putting this together. I had a minor comment but the patches
> > look fine to me. We still need an ack from Rob on the DT patch and I can
> > queue the series for 5.11.
> 
> I'm preparing a v6 unifying both functions as you suggested.
> 
> > Could you please also test the patch below on top of this series? It's
> > the removal of the implied DMA offset in the max_zone_phys()
> > calculation.
> 
> Yes, happily. Comments below.
> 
> > --8<-
> > From 3ae252d888be4984a612236124f5b099e804c745 Mon Sep 17 00:00:00 2001
> > From: Catalin Marinas 
> > Date: Fri, 30 Oct 2020 18:07:34 +
> > Subject: [PATCH] arm64: Ignore any DMA offsets in the max_zone_phys()
> >  calculation
> > 
> > Currently, the kernel assumes that if RAM starts above 32-bit (or
> > zone_bits), there is still a ZONE_DMA/DMA32 at the bottom of the RAM and
> > such constrained devices have a hardwired DMA offset. In practice, we
> > haven't noticed any such hardware so let's assume that we can expand
> > ZONE_DMA32 to the available memory if no RAM below 4GB. Similarly,
> > ZONE_DMA is expanded to the 4GB limit if no RAM addressable by
> > zone_bits.
> > 
> > Signed-off-by: Catalin Marinas 
> > ---
> >  arch/arm64/mm/init.c | 17 -
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 095540667f0f..362160e16fb2 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -175,14 +175,21 @@ static void __init reserve_elfcorehdr(void)
> >  #endif /* CONFIG_CRASH_DUMP */
> >  
> >  /*
> > - * Return the maximum physical address for a zone with a given address size
> > - * limit. It currently assumes that for memory starting above 4G, 32-bit
> > - * devices will use a DMA offset.
> > + * Return the maximum physical address for a zone accessible by the given 
> > bits
> > + * limit. If the DRAM starts above 32-bit, expand the zone to the maximum
> > + * available memory, otherwise cap it at 32-bit.
> >   */
> >  static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
> >  {
> > -   phys_addr_t offset = memblock_start_of_DRAM() & GENMASK_ULL(63, 
> > zone_bits);
> > -   return min(offset + (1ULL << zone_bits), memblock_end_of_DRAM());
> > +   phys_addr_t zone_mask = (1ULL << zone_bits) - 1;
> 
> Maybe use DMA_BIT_MASK(), instead of the manual calculation?

Yes.

> 
> > +   phys_addr_t phys_start = memblock_start_of_DRAM();
> > +
> > +   if (!(phys_start & U32_MAX))
> 
> I'd suggest using 'bigger than' instead of masks. Just to cover ourselves
> against memory starting at odd locations. Also it'll behaves properly when
> phys_start is zero (this breaks things on RPi4).

Good point.

> > +   zone_mask = PHYS_ADDR_MAX;
> > +   else if (!(phys_start & zone_mask))
> > +   zone_mask = U32_MAX;
> > +
> > +   return min(zone_mask + 1, memblock_end_of_DRAM());
> 
> This + 1 isn't going to play well when zone_mask is PHYS_ADDR_MAX.

You are right on PHYS_ADDR_MAX overflowing but I'd keep the +1 since
memblock_end_of_DRAM() returns the first byte past the accessible range
(so exclusive end).

I'll tweak this function a bit to avoid the overflow or use the
arm64-specific PHYS_MASK (that's never going to be the full 64 bits).

Thanks.

-- 
Catalin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: can we switch bmips to the generic dma ranges code

2020-11-03 Thread Jim Quinlan via iommu
I'll get on it.

Jim

On Tue, Nov 3, 2020 at 12:42 PM Florian Fainelli  wrote:
>
>
>
> On 11/3/2020 2:15 AM, Christoph Hellwig wrote:
> > Hi Florian and others,
> >
> > now that the generic DMA ranges code landed, can we switch bmips over
> > to it instead of duplicating the logic?
>
> This should be fine, I cannot easily test the 338x chips, however as far
> as PCIe goes, I believe Jim may have some older patches he can dig up
> for testing.
> --
> Florian


smime.p7s
Description: S/MIME Cryptographic Signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v18 2/4] iommu/arm-smmu: Add a way for implementations to influence SCTLR

2020-11-03 Thread Bjorn Andersson
On Mon 02 Nov 12:18 CST 2020, Robin Murphy wrote:

> On 2020-11-02 17:14, Jordan Crouse wrote:
> > From: Rob Clark 
> > 
> > For the Adreno GPU's SMMU, we want SCTLR.HUPCF set to ensure that
> > pending translations are not terminated on iova fault.  Otherwise
> > a terminated CP read could hang the GPU by returning invalid
> > command-stream data.
> > 
> > Signed-off-by: Rob Clark 
> > Reviewed-by: Bjorn Andersson 
> > Signed-off-by: Jordan Crouse 
> > ---
> > 
> >   drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++
> >   drivers/iommu/arm/arm-smmu/arm-smmu.c  | 3 +++
> >   drivers/iommu/arm/arm-smmu/arm-smmu.h  | 3 +++
> >   3 files changed, 12 insertions(+)
> > 
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
> > b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > index 1e942eed2dfc..0663d7d26908 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > @@ -129,6 +129,12 @@ static int qcom_adreno_smmu_init_context(struct 
> > arm_smmu_domain *smmu_domain,
> > (smmu_domain->cfg.fmt == ARM_SMMU_CTX_FMT_AARCH64))
> > pgtbl_cfg->quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1;
> > +   /*
> > +* On the GPU device we want to process subsequent transactions after a
> > +* fault to keep the GPU from hanging
> > +*/
> > +   smmu_domain->cfg.sctlr_set |= ARM_SMMU_SCTLR_HUPCF;
> > +
> > /*
> >  * Initialize private interface with GPU:
> >  */
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
> > b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > index dad7fa86fbd4..1f06ab219819 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > @@ -617,6 +617,9 @@ void arm_smmu_write_context_bank(struct arm_smmu_device 
> > *smmu, int idx)
> > if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
> > reg |= ARM_SMMU_SCTLR_E;
> > +   reg |= cfg->sctlr_set;
> > +   reg &= ~cfg->sctlr_clr;
> 
> Since we now have a write_s2cr hook, I'm inclined to think that the
> consistency of a write_sctlr hook that could similarly apply its own
> arbitrary tweaks would make sense for this. Does anyone have any strong
> opinions?
> 

I like it.

Regards,
Bjorn

> Robin.
> 
> > +
> > arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
> >   }
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h 
> > b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > index 6c5ffeae..ddf2ca4c923d 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > @@ -144,6 +144,7 @@ enum arm_smmu_cbar_type {
> >   #define ARM_SMMU_CB_SCTLR 0x0
> >   #define ARM_SMMU_SCTLR_S1_ASIDPNE BIT(12)
> >   #define ARM_SMMU_SCTLR_CFCFG  BIT(7)
> > +#define ARM_SMMU_SCTLR_HUPCF   BIT(8)
> >   #define ARM_SMMU_SCTLR_CFIE   BIT(6)
> >   #define ARM_SMMU_SCTLR_CFRE   BIT(5)
> >   #define ARM_SMMU_SCTLR_E  BIT(4)
> > @@ -341,6 +342,8 @@ struct arm_smmu_cfg {
> > u16 asid;
> > u16 vmid;
> > };
> > +   u32 sctlr_set;/* extra bits to set in 
> > SCTLR */
> > +   u32 sctlr_clr;/* bits to mask in SCTLR 
> > */
> > enum arm_smmu_cbar_type cbar;
> > enum arm_smmu_context_fmt   fmt;
> >   };
> > 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/4] iommu/iova: Flush CPU rcache for when a depot fills

2020-11-03 Thread John Garry

To summarize, the issue is that as time goes by, the CPU rcache and depot
rcache continue to grow. As such, IOVA RB tree access time also continues
to grow.




Hi Robin,


I'm struggling to see how this is not simply indicative of a leak
originating elsewhere. 


It sounds like one, but I don't think it is.


For the number of magazines to continually grow,
it means IOVAs *of a particular size* are being freed faster than they
are being allocated, while the only place that ongoing allocations
should be coming from is those same magazines!


But that is not the nature of how the IOVA caching works. The cache size 
is not defined by how DMA mappings we may have at a given moment in time 
or maximum which we did have at a point earlier. It just grows to a 
limit to where all CPU and global depot rcaches fill.


Here's an artificial example of how the rcache can grow, but I hope can 
help illustrate:
- consider a process which wants many DMA mapping active at a given 
point in time

- if we tie to cpu0, cpu0 rcache will grow to 128 * 2
- then tie to cpu1, cpu1 rcache will grow to 128 * 2, so total CPU 
rcache = 2 * 128 * 2. CPU rcache for cpu0 is not flushed - there is no 
maintenance for this.
- then tie to cpu2, cpu2 rcache will grow to 128 * 2, so total CPU 
rcache = 3 * 128 * 2

- then cpu3, cpu4, and so on.
- We can do this for all CPUs in the system, so total CPU rcache grows 
from zero -> #CPUs * 128 * 2. Yet no DMA mapping leaks.


Something similar can happen in normal use, where the scheduler 
relocates processes all over the CPUs in the system as time goes by, 
which causes the total rcache size to continue to grow. And in addition 
to this, the global depot continues to grow very slowly as well. But 
when it does fill (the global depot, that is), and we start to free 
magazines to make space – as is current policy - that's very slow and 
causes the performance drop.




Now indeed that could happen over the short term if IOVAs are allocated
and freed again in giant batches larger than the total global cache
capacity, but that would show a cyclic behaviour - when activity starts,
everything is first allocated straight from the tree, then when it ends
the caches would get overwhelmed by the large burst of freeing and start
having to release things back to the tree, but eventually that would
stop once everything *is* freed, then when activity begins again the
next round of allocating would inherently clear out all the caches
before going anywhere near the tree. 


But there is no clearing. A CPU will keep the IOVA cached indefinitely, 
even when there is no active DMA mapping present at all.



To me the "steady decline"
behaviour suggests that someone somewhere is making DMA unmap calls with
a smaller size than they were mapped with (you tend to notice it quicker
the other way round due to all the device errors and random memory
corruption) - in many cases that would appear to work out fine from the
driver's point of view, but would provoke exactly this behaviour in the
IOVA allocator.



Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread Jason Gunthorpe
On Tue, Nov 03, 2020 at 05:55:40PM +0100, j...@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 11:22:23AM -0400, Jason Gunthorpe wrote:
> > This whole thread was brought up by IDXD which has a SVA driver and
> > now wants to add a vfio-mdev driver too. SVA devices that want to be
> > plugged into VMs are going to be common - this architecture that a SVA
> > driver cannot cover the kvm case seems problematic.
> 
> Isn't that the same pattern as having separate drivers for VFs and the
> parent device in SR-IOV?

I think the same PCI driver with a small flag to support the PF or
VF is not the same as two completely different drivers in different
subsystems

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: can we switch bmips to the generic dma ranges code

2020-11-03 Thread Florian Fainelli



On 11/3/2020 2:15 AM, Christoph Hellwig wrote:
> Hi Florian and others,
> 
> now that the generic DMA ranges code landed, can we switch bmips over
> to it instead of duplicating the logic?

This should be fine, I cannot easily test the 338x chips, however as far
as PCIe goes, I believe Jim may have some older patches he can dig up
for testing.
-- 
Florian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 7/7] mm: Remove examples from enum zone_type comment

2020-11-03 Thread Nicolas Saenz Julienne
We can't really list every setup in common code. On top of that they are
unlikely to stay true for long as things change in the arch trees
independently of this comment.

Suggested-by: Christoph Hellwig 
Signed-off-by: Nicolas Saenz Julienne 
Reviewed-by: Christoph Hellwig 
---
 include/linux/mmzone.h | 20 
 1 file changed, 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..9d0c454d23cd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -354,26 +354,6 @@ enum zone_type {
 * DMA mask is assumed when ZONE_DMA32 is defined. Some 64-bit
 * platforms may need both zones as they support peripherals with
 * different DMA addressing limitations.
-*
-* Some examples:
-*
-*  - i386 and x86_64 have a fixed 16M ZONE_DMA and ZONE_DMA32 for the
-*rest of the lower 4G.
-*
-*  - arm only uses ZONE_DMA, the size, up to 4G, may vary depending on
-*the specific device.
-*
-*  - arm64 has a fixed 1G ZONE_DMA and ZONE_DMA32 for the rest of the
-*lower 4G.
-*
-*  - powerpc only uses ZONE_DMA, the size, up to 2G, may vary
-*depending on the specific device.
-*
-*  - s390 uses ZONE_DMA fixed to the lower 2G.
-*
-*  - ia64 and riscv only use ZONE_DMA32.
-*
-*  - parisc uses neither.
 */
 #ifdef CONFIG_ZONE_DMA
ZONE_DMA,
-- 
2.29.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 6/7] arm64: mm: Set ZONE_DMA size based on early IORT scan

2020-11-03 Thread Nicolas Saenz Julienne
From: Ard Biesheuvel 

We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
incorporating masters that can address less than 32 bits of DMA, in
particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
peripherals that can only address up to 1 GB (and its PCIe host
bridge can only access the bottom 3 GB)

Instructing the DMA layer about these limitations is straight-forward,
even though we had to fix some issues regarding memory limits set in
the IORT for named components, and regarding the handling of ACPI _DMA
methods. However, the DMA layer also needs to be able to allocate
memory that is guaranteed to meet those DMA constraints, for bounce
buffering as well as allocating the backing for consistent mappings.

This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
problems with kdump, and potentially in other places where allocations
cannot cross zone boundaries. Therefore, we should avoid having two
separate DMA zones when possible.

So let's do an early scan of the IORT, and only create the ZONE_DMA
if we encounter any devices that need it. This puts the burden on
the firmware to describe such limitations in the IORT, which may be
redundant (and less precise) if _DMA methods are also being provided.
However, it should be noted that this situation is highly unusual for
arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
the _DMA method if implemented, and so we will not lose the ability to
perform streaming DMA outside the ZONE_DMA if the _DMA method permits
it.

Cc: Jeremy Linton 
Cc: Lorenzo Pieralisi 
Cc: Nicolas Saenz Julienne 
Cc: Rob Herring 
Cc: Christoph Hellwig 
Cc: Robin Murphy 
Cc: Hanjun Guo 
Cc: Sudeep Holla 
Cc: Anshuman Khandual 
Signed-off-by: Ard Biesheuvel 
[nsaenz: unified implementation with DT's counterpart]
Signed-off-by: Nicolas Saenz Julienne 
Tested-by: Jeremy Linton 
Acked-by: Lorenzo Pieralisi 
Acked-by: Hanjun Guo 

---

Changes since v5:
 - Unify with DT's counterpart, return phys_addr_t

Changes since v3:
 - Use min_not_zero()
 - Check revision
 - Remove unnecessary #ifdef in zone_sizes_init()

 arch/arm64/mm/init.c  |  5 +++-
 drivers/acpi/arm64/iort.c | 55 +++
 include/linux/acpi_iort.h |  4 +++
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index a2ce8a9a71a6..ca5d4b10679d 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -186,11 +187,13 @@ static phys_addr_t __init max_zone_phys(unsigned int 
zone_bits)
 static void __init zone_sizes_init(unsigned long min, unsigned long max)
 {
unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
+   unsigned int __maybe_unused acpi_zone_dma_bits;
unsigned int __maybe_unused dt_zone_dma_bits;
 
 #ifdef CONFIG_ZONE_DMA
+   acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
-   zone_dma_bits = min(32U, dt_zone_dma_bits);
+   zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 9929ff50c0c0..1787406684aa 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1718,3 +1718,58 @@ void __init acpi_iort_init(void)
 
iort_init_platform_devices();
 }
+
+#ifdef CONFIG_ZONE_DMA
+/*
+ * Extract the highest CPU physical address accessible to all DMA masters in
+ * the system. PHYS_ADDR_MAX is returned when no constrained device is found.
+ */
+phys_addr_t __init acpi_iort_dma_get_max_cpu_address(void)
+{
+   phys_addr_t limit = PHYS_ADDR_MAX;
+   struct acpi_iort_node *node, *end;
+   struct acpi_table_iort *iort;
+   acpi_status status;
+   int i;
+
+   if (acpi_disabled)
+   return limit;
+
+   status = acpi_get_table(ACPI_SIG_IORT, 0,
+   (struct acpi_table_header **));
+   if (ACPI_FAILURE(status))
+   return limit;
+
+   node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
+   end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
+
+   for (i = 0; i < iort->node_count; i++) {
+   if (node >= end)
+   break;
+
+   switch (node->type) {
+   struct acpi_iort_named_component *ncomp;
+   struct acpi_iort_root_complex *rc;
+   phys_addr_t local_limit;
+
+   case ACPI_IORT_NODE_NAMED_COMPONENT:
+   ncomp = (struct acpi_iort_named_component 
*)node->node_data;
+   local_limit = 

[PATCH v6 4/7] of: unittest: Add test for of_dma_get_max_cpu_address()

2020-11-03 Thread Nicolas Saenz Julienne
Introduce a test for of_dma_get_max_cup_address(), it uses the same DT
data as the rest of dma-ranges unit tests.

Signed-off-by: Nicolas Saenz Julienne 
Reviewed-by: Rob Herring 

---
Changes since v5:
- Update address expected by test

Changes since v3:
 - Remove HAS_DMA guards

 drivers/of/unittest.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 06cc988faf78..98cc0163301b 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -869,6 +869,23 @@ static void __init of_unittest_changeset(void)
 #endif
 }
 
+static void __init of_unittest_dma_get_max_cpu_address(void)
+{
+   struct device_node *np;
+   phys_addr_t cpu_addr;
+
+   np = of_find_node_by_path("/testcase-data/address-tests");
+   if (!np) {
+   pr_err("missing testcase data\n");
+   return;
+   }
+
+   cpu_addr = of_dma_get_max_cpu_address(np);
+   unittest(cpu_addr == 0x4fff,
+"of_dma_get_max_cpu_address: wrong CPU addr %pad (expecting 
%x)\n",
+_addr, 0x4fff);
+}
+
 static void __init of_unittest_dma_ranges_one(const char *path,
u64 expect_dma_addr, u64 expect_paddr)
 {
@@ -3266,6 +3283,7 @@ static int __init of_unittest(void)
of_unittest_changeset();
of_unittest_parse_interrupts();
of_unittest_parse_interrupts_extended();
+   of_unittest_dma_get_max_cpu_address();
of_unittest_parse_dma_ranges();
of_unittest_pci_dma_ranges();
of_unittest_match_node();
-- 
2.29.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 3/7] of/address: Introduce of_dma_get_max_cpu_address()

2020-11-03 Thread Nicolas Saenz Julienne
Introduce of_dma_get_max_cpu_address(), which provides the highest CPU
physical address addressable by all DMA masters in the system. It's
specially useful for setting memory zones sizes at early boot time.

Signed-off-by: Nicolas Saenz Julienne 
Reviewed-by: Rob Herring 

---

Changes since v4:
 - Return max address, not address limit (one off difference)

Changes since v3:
 - use u64 with cpu_end

Changes since v2:
 - Use PHYS_ADDR_MAX
 - return phys_dma_t
 - Rename function
 - Correct subject
 - Add support to start parsing from an arbitrary device node in order
   for the function to work with unit tests

 drivers/of/address.c | 42 ++
 include/linux/of.h   |  7 +++
 2 files changed, 49 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index eb9ab4f1e80b..09c0af7fd1c4 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -1024,6 +1024,48 @@ int of_dma_get_range(struct device_node *np, const 
struct bus_dma_region **map)
 }
 #endif /* CONFIG_HAS_DMA */
 
+/**
+ * of_dma_get_max_cpu_address - Gets highest CPU address suitable for DMA
+ * @np: The node to start searching from or NULL to start from the root
+ *
+ * Gets the highest CPU physical address that is addressable by all DMA masters
+ * in the sub-tree pointed by np, or the whole tree if NULL is passed. If no
+ * DMA constrained device is found, it returns PHYS_ADDR_MAX.
+ */
+phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
+{
+   phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
+   struct of_range_parser parser;
+   phys_addr_t subtree_max_addr;
+   struct device_node *child;
+   struct of_range range;
+   const __be32 *ranges;
+   u64 cpu_end = 0;
+   int len;
+
+   if (!np)
+   np = of_root;
+
+   ranges = of_get_property(np, "dma-ranges", );
+   if (ranges && len) {
+   of_dma_range_parser_init(, np);
+   for_each_of_range(, )
+   if (range.cpu_addr + range.size > cpu_end)
+   cpu_end = range.cpu_addr + range.size - 1;
+
+   if (max_cpu_addr > cpu_end)
+   max_cpu_addr = cpu_end;
+   }
+
+   for_each_available_child_of_node(np, child) {
+   subtree_max_addr = of_dma_get_max_cpu_address(child);
+   if (max_cpu_addr > subtree_max_addr)
+   max_cpu_addr = subtree_max_addr;
+   }
+
+   return max_cpu_addr;
+}
+
 /**
  * of_dma_is_coherent - Check if device is coherent
  * @np:device node
diff --git a/include/linux/of.h b/include/linux/of.h
index 5d51891cbf1a..9ed5b8532c30 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -558,6 +558,8 @@ int of_map_id(struct device_node *np, u32 id,
   const char *map_name, const char *map_mask_name,
   struct device_node **target, u32 *id_out);
 
+phys_addr_t of_dma_get_max_cpu_address(struct device_node *np);
+
 #else /* CONFIG_OF */
 
 static inline void of_core_init(void)
@@ -995,6 +997,11 @@ static inline int of_map_id(struct device_node *np, u32 id,
return -EINVAL;
 }
 
+static inline phys_addr_t of_dma_get_max_cpu_address(struct device_node *np)
+{
+   return PHYS_ADDR_MAX;
+}
+
 #define of_match_ptr(_ptr) NULL
 #define of_match_node(_matches, _node) NULL
 #endif /* CONFIG_OF */
-- 
2.29.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 2/7] arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()

2020-11-03 Thread Nicolas Saenz Julienne
zone_dma_bits's initialization happens earlier that it's actually
needed, in arm64_memblock_init(). So move it into the more suitable
zone_sizes_init().

Signed-off-by: Nicolas Saenz Julienne 
Tested-by: Jeremy Linton 
---
 arch/arm64/mm/init.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index fc4ab0d6d5d2..410721fc4fc0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -190,6 +190,8 @@ static void __init zone_sizes_init(unsigned long min, 
unsigned long max)
unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
 
 #ifdef CONFIG_ZONE_DMA
+   zone_dma_bits = ARM64_ZONE_DMA_BITS;
+   arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
 #ifdef CONFIG_ZONE_DMA32
@@ -376,11 +378,6 @@ void __init arm64_memblock_init(void)
 
early_init_fdt_scan_reserved_mem();
 
-   if (IS_ENABLED(CONFIG_ZONE_DMA)) {
-   zone_dma_bits = ARM64_ZONE_DMA_BITS;
-   arm64_dma_phys_limit = max_zone_phys(ARM64_ZONE_DMA_BITS);
-   }
-
if (IS_ENABLED(CONFIG_ZONE_DMA32))
arm64_dma32_phys_limit = max_zone_phys(32);
else
-- 
2.29.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

2020-11-03 Thread Nicolas Saenz Julienne
crashkernel might reserve memory located in ZONE_DMA. We plan to delay
ZONE_DMA's initialization after unflattening the devicetree and ACPI's
boot table initialization, so move it later in the boot process.
Specifically into mem_init(), this is the last place crashkernel will be
able to reserve the memory before the page allocator kicks in. There
isn't any apparent reason for doing this earlier.

Signed-off-by: Nicolas Saenz Julienne 
Tested-by: Jeremy Linton 
---
 arch/arm64/mm/init.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 095540667f0f..fc4ab0d6d5d2 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -386,8 +386,6 @@ void __init arm64_memblock_init(void)
else
arm64_dma32_phys_limit = PHYS_MASK + 1;
 
-   reserve_crashkernel();
-
reserve_elfcorehdr();
 
high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
@@ -508,6 +506,8 @@ void __init mem_init(void)
else
swiotlb_force = SWIOTLB_NO_FORCE;
 
+   reserve_crashkernel();
+
set_max_mapnr(max_pfn - PHYS_PFN_OFFSET);
 
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
-- 
2.29.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 5/7] arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges

2020-11-03 Thread Nicolas Saenz Julienne
We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
incorporating masters that can address less than 32 bits of DMA, in
particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
peripherals that can only address up to 1 GB (and its PCIe host
bridge can only access the bottom 3 GB)

The DMA layer also needs to be able to allocate memory that is
guaranteed to meet those DMA constraints, for bounce buffering as well
as allocating the backing for consistent mappings. This is why the 1 GB
ZONE_DMA was introduced recently. Unfortunately, it turns out the having
a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes problems with kdump, and
potentially in other places where allocations cannot cross zone
boundaries. Therefore, we should avoid having two separate DMA zones
when possible.

So, with the help of of_dma_get_max_cpu_address() get the topmost
physical address accessible to all DMA masters in system and use that
information to fine-tune ZONE_DMA's size. In the absence of addressing
limited masters ZONE_DMA will span the whole 32-bit address space,
otherwise, in the case of the Raspberry Pi 4 it'll only span the 30-bit
address space, and have ZONE_DMA32 cover the rest of the 32-bit address
space.

Signed-off-by: Nicolas Saenz Julienne 

---

Changes since v4:
 - Use fls64 as we're now using the max address (as opposed to the
   limit)

Changes since v3:
 - Simplify code for readability.

Changes since v2:
 - Updated commit log by shamelessly copying Ard's ACPI commit log

 arch/arm64/mm/init.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 410721fc4fc0..a2ce8a9a71a6 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -42,8 +42,6 @@
 #include 
 #include 
 
-#define ARM64_ZONE_DMA_BITS30
-
 /*
  * We need to be able to catch inadvertent references to memstart_addr
  * that occur (potentially in generic code) before arm64_memblock_init()
@@ -188,9 +186,11 @@ static phys_addr_t __init max_zone_phys(unsigned int 
zone_bits)
 static void __init zone_sizes_init(unsigned long min, unsigned long max)
 {
unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
+   unsigned int __maybe_unused dt_zone_dma_bits;
 
 #ifdef CONFIG_ZONE_DMA
-   zone_dma_bits = ARM64_ZONE_DMA_BITS;
+   dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
+   zone_dma_bits = min(32U, dt_zone_dma_bits);
arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
-- 
2.29.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 0/7] arm64: Default to 32-bit wide ZONE_DMA

2020-11-03 Thread Nicolas Saenz Julienne
Using two distinct DMA zones turned out to be problematic. Here's an
attempt go back to a saner default.

I tested this on both a RPi4 and QEMU.

---

Changes since v5:
 - Unify ACPI/DT functions

Changes since v4:
 - Fix of_dma_get_max_cpu_address() so it returns the last addressable
   addres, not the limit

Changes since v3:
 - Drop patch adding define in dma-mapping
 - Address small review changes
 - Update Ard's patch
 - Add new patch removing examples from mmzone.h

Changes since v2:
 - Introduce Ard's patch
 - Improve OF dma-ranges parsing function
 - Add unit test for OF function
 - Address small changes
 - Move crashkernel reservation later in boot process

Changes since v1:
 - Parse dma-ranges instead of using machine compatible string

Ard Biesheuvel (1):
  arm64: mm: Set ZONE_DMA size based on early IORT scan

Nicolas Saenz Julienne (6):
  arm64: mm: Move reserve_crashkernel() into mem_init()
  arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()
  of/address: Introduce of_dma_get_max_cpu_address()
  of: unittest: Add test for of_dma_get_max_cpu_address()
  arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges
  mm: Remove examples from enum zone_type comment

 arch/arm64/mm/init.c  | 18 ++---
 drivers/acpi/arm64/iort.c | 55 +++
 drivers/of/address.c  | 42 ++
 drivers/of/unittest.c | 18 +
 include/linux/acpi_iort.h |  4 +++
 include/linux/mmzone.h| 20 --
 include/linux/of.h|  7 +
 7 files changed, 135 insertions(+), 29 deletions(-)

-- 
2.29.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v18 2/4] iommu/arm-smmu: Add a way for implementations to influence SCTLR

2020-11-03 Thread Jordan Crouse
On Mon, Nov 02, 2020 at 06:18:45PM +, Robin Murphy wrote:
> On 2020-11-02 17:14, Jordan Crouse wrote:
> >From: Rob Clark 
> >
> >For the Adreno GPU's SMMU, we want SCTLR.HUPCF set to ensure that
> >pending translations are not terminated on iova fault.  Otherwise
> >a terminated CP read could hang the GPU by returning invalid
> >command-stream data.
> >
> >Signed-off-by: Rob Clark 
> >Reviewed-by: Bjorn Andersson 
> >Signed-off-by: Jordan Crouse 
> >---
> >
> >  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++
> >  drivers/iommu/arm/arm-smmu/arm-smmu.c  | 3 +++
> >  drivers/iommu/arm/arm-smmu/arm-smmu.h  | 3 +++
> >  3 files changed, 12 insertions(+)
> >
> >diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
> >b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> >index 1e942eed2dfc..0663d7d26908 100644
> >--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> >+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> >@@ -129,6 +129,12 @@ static int qcom_adreno_smmu_init_context(struct 
> >arm_smmu_domain *smmu_domain,
> > (smmu_domain->cfg.fmt == ARM_SMMU_CTX_FMT_AARCH64))
> > pgtbl_cfg->quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1;
> >+/*
> >+ * On the GPU device we want to process subsequent transactions after a
> >+ * fault to keep the GPU from hanging
> >+ */
> >+smmu_domain->cfg.sctlr_set |= ARM_SMMU_SCTLR_HUPCF;
> >+
> > /*
> >  * Initialize private interface with GPU:
> >  */
> >diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
> >b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> >index dad7fa86fbd4..1f06ab219819 100644
> >--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> >+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> >@@ -617,6 +617,9 @@ void arm_smmu_write_context_bank(struct arm_smmu_device 
> >*smmu, int idx)
> > if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
> > reg |= ARM_SMMU_SCTLR_E;
> >+reg |= cfg->sctlr_set;
> >+reg &= ~cfg->sctlr_clr;
> 
> Since we now have a write_s2cr hook, I'm inclined to think that the
> consistency of a write_sctlr hook that could similarly apply its own
> arbitrary tweaks would make sense for this. Does anyone have any strong
> opinions?

None from me. That would make an eventual stall-on-fault implementation easier
too.

Jordan

> Robin.
> 
> >+
> > arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
> >  }
> >diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h 
> >b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> >index 6c5ffeae..ddf2ca4c923d 100644
> >--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
> >+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> >@@ -144,6 +144,7 @@ enum arm_smmu_cbar_type {
> >  #define ARM_SMMU_CB_SCTLR  0x0
> >  #define ARM_SMMU_SCTLR_S1_ASIDPNE  BIT(12)
> >  #define ARM_SMMU_SCTLR_CFCFG   BIT(7)
> >+#define ARM_SMMU_SCTLR_HUPCFBIT(8)
> >  #define ARM_SMMU_SCTLR_CFIEBIT(6)
> >  #define ARM_SMMU_SCTLR_CFREBIT(5)
> >  #define ARM_SMMU_SCTLR_E   BIT(4)
> >@@ -341,6 +342,8 @@ struct arm_smmu_cfg {
> > u16 asid;
> > u16 vmid;
> > };
> >+u32 sctlr_set;/* extra bits to set in 
> >SCTLR */
> >+u32 sctlr_clr;/* bits to mask in SCTLR 
> >*/
> > enum arm_smmu_cbar_type cbar;
> > enum arm_smmu_context_fmt   fmt;
> >  };
> >

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 0/7] arm64: Default to 32-bit wide ZONE_DMA

2020-11-03 Thread Nicolas Saenz Julienne
On Fri, 2020-10-30 at 18:11 +, Catalin Marinas wrote:
> On Thu, Oct 29, 2020 at 06:25:43PM +0100, Nicolas Saenz Julienne wrote:
> > Ard Biesheuvel (1):
> >   arm64: mm: Set ZONE_DMA size based on early IORT scan
> > 
> > Nicolas Saenz Julienne (6):
> >   arm64: mm: Move reserve_crashkernel() into mem_init()
> >   arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()
> >   of/address: Introduce of_dma_get_max_cpu_address()
> >   of: unittest: Add test for of_dma_get_max_cpu_address()
> >   arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges
> >   mm: Remove examples from enum zone_type comment
> 
> Thanks for putting this together. I had a minor comment but the patches
> look fine to me. We still need an ack from Rob on the DT patch and I can
> queue the series for 5.11.

I'm preparing a v6 unifying both functions as you suggested.

> Could you please also test the patch below on top of this series? It's
> the removal of the implied DMA offset in the max_zone_phys()
> calculation.

Yes, happily. Comments below.

> --8<-
> From 3ae252d888be4984a612236124f5b099e804c745 Mon Sep 17 00:00:00 2001
> From: Catalin Marinas 
> Date: Fri, 30 Oct 2020 18:07:34 +
> Subject: [PATCH] arm64: Ignore any DMA offsets in the max_zone_phys()
>  calculation
> 
> Currently, the kernel assumes that if RAM starts above 32-bit (or
> zone_bits), there is still a ZONE_DMA/DMA32 at the bottom of the RAM and
> such constrained devices have a hardwired DMA offset. In practice, we
> haven't noticed any such hardware so let's assume that we can expand
> ZONE_DMA32 to the available memory if no RAM below 4GB. Similarly,
> ZONE_DMA is expanded to the 4GB limit if no RAM addressable by
> zone_bits.
> 
> Signed-off-by: Catalin Marinas 
> ---
>  arch/arm64/mm/init.c | 17 -
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 095540667f0f..362160e16fb2 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -175,14 +175,21 @@ static void __init reserve_elfcorehdr(void)
>  #endif /* CONFIG_CRASH_DUMP */
>  
>  /*
> - * Return the maximum physical address for a zone with a given address size
> - * limit. It currently assumes that for memory starting above 4G, 32-bit
> - * devices will use a DMA offset.
> + * Return the maximum physical address for a zone accessible by the given 
> bits
> + * limit. If the DRAM starts above 32-bit, expand the zone to the maximum
> + * available memory, otherwise cap it at 32-bit.
>   */
>  static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
>  {
> - phys_addr_t offset = memblock_start_of_DRAM() & GENMASK_ULL(63, 
> zone_bits);
> - return min(offset + (1ULL << zone_bits), memblock_end_of_DRAM());
> + phys_addr_t zone_mask = (1ULL << zone_bits) - 1;

Maybe use DMA_BIT_MASK(), instead of the manual calculation?

> + phys_addr_t phys_start = memblock_start_of_DRAM();
> +
> + if (!(phys_start & U32_MAX))

I'd suggest using 'bigger than' instead of masks. Just to cover ourselves
against memory starting at odd locations. Also it'll behaves properly when
phys_start is zero (this breaks things on RPi4).

> + zone_mask = PHYS_ADDR_MAX;
> + else if (!(phys_start & zone_mask))
> + zone_mask = U32_MAX;
> +
> + return min(zone_mask + 1, memblock_end_of_DRAM());

This + 1 isn't going to play well when zone_mask is PHYS_ADDR_MAX.

Regards,
Nicolas



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread j...@8bytes.org
On Tue, Nov 03, 2020 at 11:22:23AM -0400, Jason Gunthorpe wrote:
> This whole thread was brought up by IDXD which has a SVA driver and
> now wants to add a vfio-mdev driver too. SVA devices that want to be
> plugged into VMs are going to be common - this architecture that a SVA
> driver cannot cover the kvm case seems problematic.

Isn't that the same pattern as having separate drivers for VFs and the
parent device in SR-IOV?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/4] iommu/iova: Flush CPU rcache for when a depot fills

2020-11-03 Thread Robin Murphy

On 2020-10-26 17:31, John Garry wrote:

Leizhen reported some time ago that IOVA performance may degrade over time
[0], but unfortunately his solution to fix this problem was not given
attention.

To summarize, the issue is that as time goes by, the CPU rcache and depot
rcache continue to grow. As such, IOVA RB tree access time also continues
to grow.


I'm struggling to see how this is not simply indicative of a leak 
originating elsewhere. For the number of magazines to continually grow, 
it means IOVAs *of a particular size* are being freed faster than they 
are being allocated, while the only place that ongoing allocations 
should be coming from is those same magazines!


Now indeed that could happen over the short term if IOVAs are allocated 
and freed again in giant batches larger than the total global cache 
capacity, but that would show a cyclic behaviour - when activity starts, 
everything is first allocated straight from the tree, then when it ends 
the caches would get overwhelmed by the large burst of freeing and start 
having to release things back to the tree, but eventually that would 
stop once everything *is* freed, then when activity begins again the 
next round of allocating would inherently clear out all the caches 
before going anywhere near the tree. To me the "steady decline" 
behaviour suggests that someone somewhere is making DMA unmap calls with 
a smaller size than they were mapped with (you tend to notice it quicker 
the other way round due to all the device errors and random memory 
corruption) - in many cases that would appear to work out fine from the 
driver's point of view, but would provoke exactly this behaviour in the 
IOVA allocator.


Robin.


At a certain point, a depot may become full, and also some CPU rcaches may
also be full when inserting another IOVA is attempted. For this scenario,
currently the "loaded" CPU rcache is freed and a new one is created. This
freeing means that many IOVAs in the RB tree need to be freed, which
makes IO throughput performance fall off a cliff in some storage scenarios:

Jobs: 12 (f=12): [] [0.0% done] [6314MB/0KB/0KB /s] [1616K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [5669MB/0KB/0KB /s] [1451K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6673MB/0KB/0KB /s] [1708K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6761MB/0KB/0KB /s] [1731K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6685MB/0KB/0KB /s] [1711K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6178MB/0KB/0KB /s] [1582K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [6731MB/0KB/0KB /s] [1723K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [2387MB/0KB/0KB /s] [611K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [2689MB/0KB/0KB /s] [688K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [2278MB/0KB/0KB /s] [583K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [1288MB/0KB/0KB /s] [330K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [1632MB/0KB/0KB /s] [418K/0/0 iops]
Jobs: 12 (f=12): [] [0.0% done] [1765MB/0KB/0KB /s] [452K/0/0 iops]

And continue in this fashion, without recovering. Note that in this
example it was required to wait 16 hours for this to occur. Also note that
IO throughput also becomes gradually becomes more unstable leading up to
this point.

As a solution to this issue, judge that the IOVA caches have grown too big
when cached magazines need to be free, and just flush all the CPUs rcaches
instead.

The depot rcaches, however, are not flushed, as they can be used to
immediately replenish active CPUs.

In future, some IOVA compaction could be implemented to solve the
instabilty issue, which I figure could be quite complex to implement.

[0] 
https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leiz...@huawei.com/

Analyzed-by: Zhen Lei 
Reported-by: Xiang Chen 
Signed-off-by: John Garry 
---
  drivers/iommu/iova.c | 16 ++--
  1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 1f3f0f8b12e0..386005055aca 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -901,7 +901,6 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
 struct iova_rcache *rcache,
 unsigned long iova_pfn)
  {
-   struct iova_magazine *mag_to_free = NULL;
struct iova_cpu_rcache *cpu_rcache;
bool can_insert = false;
unsigned long flags;
@@ -923,13 +922,12 @@ static bool __iova_rcache_insert(struct iova_domain 
*iovad,
if (cpu_rcache->loaded)

Re: [PATCH v5 2/2] iommu/iova: Free global iova rcache on iova alloc failure

2020-11-03 Thread Robin Murphy

On 2020-11-03 14:31, John Garry wrote:

On 03/11/2020 12:35, Robin Murphy wrote:

On 2020-09-30 08:44, vji...@codeaurora.org wrote:

From: Vijayanand Jitta 

When ever an iova alloc request fails we free the iova
ranges present in the percpu iova rcaches and then retry
but the global iova rcache is not freed as a result we could
still see iova alloc failure even after retry as global
rcache is holding the iova's which can cause fragmentation.
So, free the global iova rcache as well and then go for the
retry.




If we do clear all the CPU rcaches, it would nice to have something 
immediately available to replenish, i.e. use the global rcache, instead 
of flushing it, if that is not required...


If we've reached the point of clearing *any* caches, though, I think any 
hope of maintaining performance is already long gone. We've walked the 
rbtree for the entire address space and found that it's still too full 
to allocate from; we're teetering on the brink of hard failure and this 
is a last-ditch attempt to claw back as much as possible in the hope 
that it gives us a usable space.


TBH I'm not entirely sure what allocation pattern was expected by the 
original code such that purging only some of the caches made sense, nor 
what kind of pattern leads to lots of smaller IOVAs being allocated, 
freed, and never reused to the point of blocking larger allocations, but 
either way the reasoning does at least seem to hold up in abstract.


This looks reasonable to me - it's mildly annoying that we end up with 
so many similar-looking functions,


Well I did add a function to clear all CPU rcaches here, if you would 
like to check:


https://lore.kernel.org/linux-iommu/1603733501-211004-2-git-send-email-john.ga...@huawei.com/ 


I was thinking more of the way free_iova_rcaches(), 
free_cpu_cached_iovas(), and free_global_cached_iovas() all look pretty 
much the same shape at a glance.


but the necessary differences are right down in the middle of the 
loops so nothing can reasonably be factored out :(


Reviewed-by: Robin Murphy 


Signed-off-by: Vijayanand Jitta 
---
  drivers/iommu/iova.c | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c3a1a8e..faf9b13 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -25,6 +25,7 @@ static void init_iova_rcaches(struct iova_domain 
*iovad);

  static void free_iova_rcaches(struct iova_domain *iovad);
  static void fq_destroy_all_entries(struct iova_domain *iovad);
  static void fq_flush_timeout(struct timer_list *t);
+static void free_global_cached_iovas(struct iova_domain *iovad);


a thought: It would be great if the file could be rearranged at some 
point where we don't require so many forward declarations.



  void
  init_iova_domain(struct iova_domain *iovad, unsigned long granule,
@@ -442,6 +443,7 @@ alloc_iova_fast(struct iova_domain *iovad, 
unsigned long size,

  flush_rcache = false;
  for_each_online_cpu(cpu)
  free_cpu_cached_iovas(cpu, iovad);
+    free_global_cached_iovas(iovad);
  goto retry;
  }
@@ -1057,5 +1059,26 @@ void free_cpu_cached_iovas(unsigned int cpu, 
struct iova_domain *iovad)

  }
  }
+/*
+ * free all the IOVA ranges of global cache
+ */
+static void free_global_cached_iovas(struct iova_domain *iovad)
+{
+    struct iova_rcache *rcache;
+    unsigned long flags;
+    int i, j;
+
+    for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+    rcache = >rcaches[i];
+    spin_lock_irqsave(>lock, flags);
+    for (j = 0; j < rcache->depot_size; ++j) {
+    iova_magazine_free_pfns(rcache->depot[j], iovad);
+    iova_magazine_free(rcache->depot[j]);
+    rcache->depot[j] = NULL;


I don't think that NULLify is strictly necessary


True, we don't explicitly clear depot entries in __iova_rcache_get() for 
normal operation, so there's not much point in doing so here.


Robin.


+    }
+    rcache->depot_size = 0;
+    spin_unlock_irqrestore(>lock, flags);
+    }
+}
  MODULE_AUTHOR("Anil S Keshavamurthy 
");

  MODULE_LICENSE("GPL");


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread Jason Gunthorpe
On Tue, Nov 03, 2020 at 03:35:32PM +0100, j...@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 10:06:42AM -0400, Jason Gunthorpe wrote:
> > The point is that other places beyond VFIO need this
> 
> Which and why?
>
> > Sure, but sometimes it is necessary, and in those cases the answer
> > can't be "rewrite a SVA driver to use vfio"
> 
> This is getting to abstract. Can you come up with an example where
> handling this in VFIO or an endpoint device kernel driver does not work?

This whole thread was brought up by IDXD which has a SVA driver and
now wants to add a vfio-mdev driver too. SVA devices that want to be
plugged into VMs are going to be common - this architecture that a SVA
driver cannot cover the kvm case seems problematic.

Yes, everything can have a SVA driver and a vfio-mdev, it works just
fine, but it is not very clean or simple.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread j...@8bytes.org
On Tue, Nov 03, 2020 at 10:06:42AM -0400, Jason Gunthorpe wrote:
> The point is that other places beyond VFIO need this

Which and why?

> Sure, but sometimes it is necessary, and in those cases the answer
> can't be "rewrite a SVA driver to use vfio"

This is getting to abstract. Can you come up with an example where
handling this in VFIO or an endpoint device kernel driver does not work?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/2] iommu/iova: Free global iova rcache on iova alloc failure

2020-11-03 Thread John Garry

On 03/11/2020 12:35, Robin Murphy wrote:

On 2020-09-30 08:44, vji...@codeaurora.org wrote:

From: Vijayanand Jitta 

When ever an iova alloc request fails we free the iova
ranges present in the percpu iova rcaches and then retry
but the global iova rcache is not freed as a result we could
still see iova alloc failure even after retry as global
rcache is holding the iova's which can cause fragmentation.
So, free the global iova rcache as well and then go for the
retry.




If we do clear all the CPU rcaches, it would nice to have something 
immediately available to replenish, i.e. use the global rcache, instead 
of flushing it, if that is not required...


This looks reasonable to me - it's mildly annoying that we end up with 
so many similar-looking functions,


Well I did add a function to clear all CPU rcaches here, if you would 
like to check:


https://lore.kernel.org/linux-iommu/1603733501-211004-2-git-send-email-john.ga...@huawei.com/

but the necessary differences are 
right down in the middle of the loops so nothing can reasonably be 
factored out :(


Reviewed-by: Robin Murphy 


Signed-off-by: Vijayanand Jitta 
---
  drivers/iommu/iova.c | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c3a1a8e..faf9b13 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -25,6 +25,7 @@ static void init_iova_rcaches(struct iova_domain 
*iovad);

  static void free_iova_rcaches(struct iova_domain *iovad);
  static void fq_destroy_all_entries(struct iova_domain *iovad);
  static void fq_flush_timeout(struct timer_list *t);
+static void free_global_cached_iovas(struct iova_domain *iovad);


a thought: It would be great if the file could be rearranged at some 
point where we don't require so many forward declarations.



  void
  init_iova_domain(struct iova_domain *iovad, unsigned long granule,
@@ -442,6 +443,7 @@ alloc_iova_fast(struct iova_domain *iovad, 
unsigned long size,

  flush_rcache = false;
  for_each_online_cpu(cpu)
  free_cpu_cached_iovas(cpu, iovad);
+    free_global_cached_iovas(iovad);
  goto retry;
  }
@@ -1057,5 +1059,26 @@ void free_cpu_cached_iovas(unsigned int cpu, 
struct iova_domain *iovad)

  }
  }
+/*
+ * free all the IOVA ranges of global cache
+ */
+static void free_global_cached_iovas(struct iova_domain *iovad)
+{
+    struct iova_rcache *rcache;
+    unsigned long flags;
+    int i, j;
+
+    for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+    rcache = >rcaches[i];
+    spin_lock_irqsave(>lock, flags);
+    for (j = 0; j < rcache->depot_size; ++j) {
+    iova_magazine_free_pfns(rcache->depot[j], iovad);
+    iova_magazine_free(rcache->depot[j]);
+    rcache->depot[j] = NULL;


I don't think that NULLify is strictly necessary


+    }
+    rcache->depot_size = 0;
+    spin_unlock_irqrestore(>lock, flags);
+    }
+}
  MODULE_AUTHOR("Anil S Keshavamurthy ");
  MODULE_LICENSE("GPL");


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread Jason Gunthorpe
On Tue, Nov 03, 2020 at 03:03:18PM +0100, j...@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 09:23:35AM -0400, Jason Gunthorpe wrote:
> > Userspace needs fine grained control over the composition of the page
> > table behind the PASID, 1:1 with the mm_struct is only one use case.
> 
> VFIO already offers an interface for that. It shouldn't be too
> complicated to expand that for PASID-bound page-tables.
> 
> > Userspace needs to be able to handle IOMMU faults, apparently
> 
> Could be implemented by a fault-fd handed out by VFIO.

The point is that other places beyond VFIO need this

> I really don't think that user-space should have to deal with details
> like PASIDs or other IOMMU internals, unless absolutly necessary. This
> is an OS we work on, and the idea behind an OS is to abstract the
> hardware away.

Sure, but sometimes it is necessary, and in those cases the answer
can't be "rewrite a SVA driver to use vfio"

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: fix a check in iommu_check_bind_data()

2020-11-03 Thread Joerg Roedel
On Tue, Nov 03, 2020 at 01:16:23PM +0300, Dan Carpenter wrote:
> The "data->flags" variable is a u64 so if one of the high 32 bits is
> set the original code will allow it, but it should be rejected.  The
> fix is to declare "mask" as a u64 instead of a u32.
> 
> Fixes: d90573812eea ("iommu/uapi: Handle data and argsz filled by users")
> Signed-off-by: Dan Carpenter 
> ---
>  drivers/iommu/iommu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Applied for v5.10, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/2] iommu: Fix a few issues related to PRQ

2020-11-03 Thread Joerg Roedel
On Fri, Oct 30, 2020 at 10:37:22AM +0800, Yi Sun wrote:
> We found a few issues about PRQ. So, two patches are cooked to
> fix them. Please have a review. Thanks!
> 
> Changes from v1:
> 
> - Modify subject of patch 1 to make it more accurate.
> - Move get_domain_info() up to the sanity check part in patch 1.
> - Remove v1 patch 2 which is not suitable.
> - Add description for current patch 2.
> - Add 'Fixes:' tags for all patches.
> 
> Liu Yi L (1):
>   iommu/vt-d: Fix sid not set issue in in intel_svm_bind_gpasid()
> 
> Liu, Yi L (1):
>   iommu/vt-d: Fix a bug for PDP check in prq_event_thread
> 
>  drivers/iommu/intel/svm.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)

Applied for v5.10, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/amd: Enforce 4k mapping for certain IOMMU data structures

2020-11-03 Thread Joerg Roedel
Hi Suravee,

On Wed, Oct 28, 2020 at 11:18:24PM +, Suravee Suthikulpanit wrote:
> AMD IOMMU requires 4k-aligned pages for the event log, the PPR log,
> and the completion wait write-back regions. However, when allocating
> the pages, they could be part of large mapping (e.g. 2M) page.
> This causes #PF due to the SNP RMP hardware enforces the check based
> on the page level for these data structures.
> 
> So, fix by calling set_memory_4k() on the allocated pages.
> 
> Fixes: commit c69d89aff393 ("iommu/amd: Use 4K page for completion wait 
> write-back semaphore")
> Cc: Brijesh Singh 
> Signed-off-by: Suravee Suthikulpanit 
> ---
>  drivers/iommu/amd/init.c | 22 +-
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
> index 82e4af8f09bb..75dc30226a7c 100644
> --- a/drivers/iommu/amd/init.c
> +++ b/drivers/iommu/amd/init.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -672,11 +673,22 @@ static void __init free_command_buffer(struct amd_iommu 
> *iommu)
>   free_pages((unsigned long)iommu->cmd_buf, get_order(CMD_BUFFER_SIZE));
>  }
>  
> +static void *__init iommu_alloc_4k_pages(gfp_t gfp, size_t size)
> +{
> + void *buf;
> + int order = get_order(size);
> +
> + buf = (void *)__get_free_pages(gfp, order);
> + if (!buf)
> + return buf;
> + return set_memory_4k((unsigned long)buf, (1 << order)) ? NULL : buf;
> +}
> +

Please make the 4k split only if SNP is actually enabled in the system.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: Fix kernel NULL pointer dereference in find_domain()

2020-11-03 Thread Joerg Roedel
On Wed, Oct 28, 2020 at 03:07:25PM +0800, Lu Baolu wrote:
> Fixes: e2726daea583d ("iommu/vt-d: debugfs: Add support to show page table 
> internals")
> Cc: sta...@vger.kernel.org#v5.6+
> Reported-and-tested-by: Xu Pengfei 
> Signed-off-by: Lu Baolu 

Applied for v5.10, thanks.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread Jason Gunthorpe
On Tue, Nov 03, 2020 at 02:18:52PM +0100, j...@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 08:56:43AM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 03, 2020 at 10:52:09AM +0100, j...@8bytes.org wrote:
> > > So having said this, what is the benefit of exposing those SVA internals
> > > to user-space?
> > 
> > Only the device use of the PASID is device specific, the actual PASID
> > and everything on the IOMMU side is generic.
> > 
> > There is enough API there it doesn't make sense to duplicate it into
> > every single SVA driver.
> 
> What generic things have to be done by the drivers besides
> allocating/deallocating PASIDs and binding an address space to it?
> 
> Is there anything which isn't better handled in a kernel-internal
> library which drivers just use?

Userspace needs fine grained control over the composition of the page
table behind the PASID, 1:1 with the mm_struct is only one use case.

Userspace needs to be able to handle IOMMU faults, apparently

The Intel guys had a bunch of other stuff too, looking through the new
API they are proposing for vfio gives some flavour what they think is
needed..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu: Fix deferred domain attachment in iommu_probe_device()

2020-11-03 Thread Joerg Roedel
Hi Baolu,

On Mon, Oct 26, 2020 at 02:30:08PM +0800, Lu Baolu wrote:
> @@ -264,7 +266,8 @@ int iommu_probe_device(struct device *dev)
>*/
>   iommu_alloc_default_domain(group, dev);
>  
> - if (group->default_domain)
> + if (group->default_domain &&
> + !iommu_is_attach_deferred(group->default_domain, dev))
>   ret = __iommu_attach_device(group->default_domain, dev);

This is the hotplug path, not used for boot-initialization. Have you
seen failures from the missing check here?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/amd: Increase interrupt remapping table limit to 512 entries

2020-11-03 Thread Joerg Roedel
On Thu, Oct 15, 2020 at 02:50:02AM +, Suravee Suthikulpanit wrote:
> Certain device drivers allocate IO queues on a per-cpu basis.
> On AMD EPYC platform, which can support up-to 256 cpu threads,
> this can exceed the current MAX_IRQ_PER_TABLE limit of 256,
> and result in the error message:
> 
> AMD-Vi: Failed to allocate IRTE
> 
> This has been observed with certain NVME devices.
> 
> AMD IOMMU hardware can actually support upto 512 interrupt
> remapping table entries. Therefore, update the driver to
> match the hardware limit.
> 
> Please note that this also increases the size of interrupt remapping
> table to 8KB per device when using the 128-bit IRTE format.
> 
> Signed-off-by: Suravee Suthikulpanit 
> ---
>  drivers/iommu/amd/amd_iommu_types.h | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)

Applied for 5.10, thanks.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread j...@8bytes.org
On Tue, Nov 03, 2020 at 08:56:43AM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 03, 2020 at 10:52:09AM +0100, j...@8bytes.org wrote:
> > So having said this, what is the benefit of exposing those SVA internals
> > to user-space?
> 
> Only the device use of the PASID is device specific, the actual PASID
> and everything on the IOMMU side is generic.
> 
> There is enough API there it doesn't make sense to duplicate it into
> every single SVA driver.

What generic things have to be done by the drivers besides
allocating/deallocating PASIDs and binding an address space to it?

Is there anything which isn't better handled in a kernel-internal
library which drivers just use?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread Jason Gunthorpe
On Tue, Nov 03, 2020 at 10:52:09AM +0100, j...@8bytes.org wrote:
> So having said this, what is the benefit of exposing those SVA internals
> to user-space?

Only the device use of the PASID is device specific, the actual PASID
and everything on the IOMMU side is generic.

There is enough API there it doesn't make sense to duplicate it into
every single SVA driver.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/2] iommu/iova: Free global iova rcache on iova alloc failure

2020-11-03 Thread Robin Murphy

On 2020-09-30 08:44, vji...@codeaurora.org wrote:

From: Vijayanand Jitta 

When ever an iova alloc request fails we free the iova
ranges present in the percpu iova rcaches and then retry
but the global iova rcache is not freed as a result we could
still see iova alloc failure even after retry as global
rcache is holding the iova's which can cause fragmentation.
So, free the global iova rcache as well and then go for the
retry.


This looks reasonable to me - it's mildly annoying that we end up with 
so many similar-looking functions, but the necessary differences are 
right down in the middle of the loops so nothing can reasonably be 
factored out :(


Reviewed-by: Robin Murphy 


Signed-off-by: Vijayanand Jitta 
---
  drivers/iommu/iova.c | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c3a1a8e..faf9b13 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -25,6 +25,7 @@ static void init_iova_rcaches(struct iova_domain *iovad);
  static void free_iova_rcaches(struct iova_domain *iovad);
  static void fq_destroy_all_entries(struct iova_domain *iovad);
  static void fq_flush_timeout(struct timer_list *t);
+static void free_global_cached_iovas(struct iova_domain *iovad);
  
  void

  init_iova_domain(struct iova_domain *iovad, unsigned long granule,
@@ -442,6 +443,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
flush_rcache = false;
for_each_online_cpu(cpu)
free_cpu_cached_iovas(cpu, iovad);
+   free_global_cached_iovas(iovad);
goto retry;
}
  
@@ -1057,5 +1059,26 @@ void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)

}
  }
  
+/*

+ * free all the IOVA ranges of global cache
+ */
+static void free_global_cached_iovas(struct iova_domain *iovad)
+{
+   struct iova_rcache *rcache;
+   unsigned long flags;
+   int i, j;
+
+   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   rcache = >rcaches[i];
+   spin_lock_irqsave(>lock, flags);
+   for (j = 0; j < rcache->depot_size; ++j) {
+   iova_magazine_free_pfns(rcache->depot[j], iovad);
+   iova_magazine_free(rcache->depot[j]);
+   rcache->depot[j] = NULL;
+   }
+   rcache->depot_size = 0;
+   spin_unlock_irqrestore(>lock, flags);
+   }
+}
  MODULE_AUTHOR("Anil S Keshavamurthy ");
  MODULE_LICENSE("GPL");


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 4/7] iommu: Add quirk for Intel graphic devices in map_sg

2020-11-03 Thread Robin Murphy

On 2020-09-27 07:34, Lu Baolu wrote:

Combining the sg segments exposes a bug in the Intel i915 driver which
causes visual artifacts and the screen to freeze. This is most likely
because of how the i915 handles the returned list. It probably doesn't
respect the returned value specifying the number of elements in the list
and instead depends on the previous behaviour of the Intel iommu driver
which would return the same number of elements in the output list as in
the input list.

Signed-off-by: Tom Murphy 
Signed-off-by: Lu Baolu 
---
  drivers/iommu/dma-iommu.c | 27 +++
  1 file changed, 27 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 3526db774611..e7e4d758f51a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -879,6 +879,33 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
unsigned int cur_len = 0, max_len = dma_get_max_seg_size(dev);
int i, count = 0;
  
+	/*

+* The Intel graphic driver is used to assume that the returned
+* sg list is not combound. This blocks the efforts of converting
+* Intel IOMMU driver to dma-iommu api's. Add this quirk to make the
+* device driver work and should be removed once it's fixed in i915
+* driver.
+*/
+   if (IS_ENABLED(CONFIG_DRM_I915) && dev_is_pci(dev) &&
+   to_pci_dev(dev)->vendor == PCI_VENDOR_ID_INTEL &&
+   (to_pci_dev(dev)->class >> 16) == PCI_BASE_CLASS_DISPLAY) {
+   for_each_sg(sg, s, nents, i) {
+   unsigned int s_iova_off = sg_dma_address(s);
+   unsigned int s_length = sg_dma_len(s);
+   unsigned int s_iova_len = s->length;
+
+   s->offset += s_iova_off;
+   s->length = s_length;
+   sg_dma_address(s) = dma_addr + s_iova_off;
+   sg_dma_len(s) = s_length;
+   dma_addr += s_iova_len;
+
+   pr_info_once("sg combining disabled due to i915 
driver\n");
+   }
+
+   return nents;
+   }


BTW, a much less invasive workaround would be to simply override 
seg_mask to 0. That's enough to make sure that no segment looks eligible 
for merging.


Robin.


+
for_each_sg(sg, s, nents, i) {
/* Restore this segment's original unaligned fields first */
unsigned int s_iova_off = sg_dma_address(s);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api

2020-11-03 Thread Joerg Roedel
Hi,

On Tue, Nov 03, 2020 at 11:58:26AM +0200, Joonas Lahtinen wrote:
> Would that work for you? We intend to send the feature pull requests
> to DRM for 5.11 in the upcoming weeks.

For the IOMMU side it is best to include the workaround for now. When
the DRM fixes are merged into v5.11-rc1 together with this conversion,
it can be reverted and will not be in 5.11-final.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Some questions about arm_lpae_install_table

2020-11-03 Thread Robin Murphy

On 2020-11-03 09:11, Will Deacon wrote:

On Tue, Nov 03, 2020 at 11:00:27AM +0800, Kunkun Jiang wrote:

Recently, I have read and learned the code related to io-pgtable-arm.c.
There
are two question on arm_lpae_install_table.

1、the first


static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
  arm_lpae_iopte *ptep,
  arm_lpae_iopte curr,
  struct io_pgtable_cfg *cfg)
{
     arm_lpae_iopte old, new;

     new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
     if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
     new |= ARM_LPAE_PTE_NSTABLE;

    /*
  * Ensure the table itself is visible before its PTE can be.
  * Whilst we could get away with cmpxchg64_release below, this
  * doesn't have any ordering semantics when !CONFIG_SMP.
  */
     dma_wmb();

     old = cmpxchg64_relaxed(ptep, curr, new);

     if (cfg->coherent_walk || (old & ARM_LPAE_PTE_SW_SYNC))
     return old;

     /* Even if it's not ours, there's no point waiting; just kick it
*/
     __arm_lpae_sync_pte(ptep, cfg);
     if (old == curr)
     WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);

     return old;
}


If another thread changes the ptep between cmpxchg64_relaxed and
WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC), the operation
WRITE_ONCE will overwrite the change.


Can you please provide an example of a code path where this happens? The
idea is that CPUs can race on the cmpxchg(), but there will only be one
winner.


The only way a table entry can suddenly disappear is in a race that 
involves mapping or unmapping a whole block/table-sized region, while 
simultaneously mapping a page *within* that region. Yes, if someone uses 
the API in a nonsensical and completely invalid way that cannot have a 
predictable outcome, they get an unpredictable outcome. Whoop de do...



2、the second


for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) {
     /* Unmap! */
     if (i == unmap_idx)
continue;

     __arm_lpae_init_pte(data, blk_paddr, pte, lvl,
[i]);
}

pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);


When altering a translation table descriptor include split a block into
constituent granules, the Armv8-A and SMMUv3 architectures require
a break-before-make procedure. But in the function arm_lpae_split_blk_unmap,
it changes a block descriptor to an equivalent span of page translations
directly. Is it appropriate to do so?


Break-before-make doesn't really work for the SMMU because faults are
generally fatal.

Are you seeing problems in practice with this code?


TBH I do still wonder if we shouldn't just get rid of split_blk_unmap 
and all its complexity. Other drivers treat an unmap of a page from a 
block entry as simply unmapping the whole block, and that's the 
behaviour VFIO seems to expect. My only worry is that it's been around 
long enough that there might be some horrible out-of-tree code that *is* 
relying on it, despite the fact that it's impossible to implement in a 
way that's definitely 100% safe :/


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu: fix a check in iommu_check_bind_data()

2020-11-03 Thread Dan Carpenter
The "data->flags" variable is a u64 so if one of the high 32 bits is
set the original code will allow it, but it should be rejected.  The
fix is to declare "mask" as a u64 instead of a u32.

Fixes: d90573812eea ("iommu/uapi: Handle data and argsz filled by users")
Signed-off-by: Dan Carpenter 
---
 drivers/iommu/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8c470f451a32..b53446bb8c6b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2071,7 +2071,7 @@ EXPORT_SYMBOL_GPL(iommu_uapi_cache_invalidate);
 
 static int iommu_check_bind_data(struct iommu_gpasid_bind_data *data)
 {
-   u32 mask;
+   u64 mask;
int i;
 
if (data->version != IOMMU_GPASID_BIND_VERSION_1)
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


can we switch bmips to the generic dma ranges code

2020-11-03 Thread Christoph Hellwig
Hi Florian and others,

now that the generic DMA ranges code landed, can we switch bmips over
to it instead of duplicating the logic?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api

2020-11-03 Thread Joonas Lahtinen
Quoting Tvrtko Ursulin (2020-11-03 11:14:32)
> 
> 
> On 03/11/2020 02:53, Lu Baolu wrote:
> > On 11/2/20 7:52 PM, Tvrtko Ursulin wrote:
> >>
> >> On 02/11/2020 02:00, Lu Baolu wrote:
> >>> Hi Tvrtko,
> >>> On 10/12/20 4:44 PM, Tvrtko Ursulin wrote:
> 
>  On 29/09/2020 01:11, Lu Baolu wrote:



>  FYI we have merged the required i915 patches to out tree last week 
>  or so. I *think* this means they will go into 5.11. So the i915 
>  specific workaround patch will not be needed in Intel IOMMU.
> >>>
> >>> Do you mind telling me what's the status of this fix patch? I tried this
> >>> series on v5.10-rc1 with the graphic quirk patch dropped. I am still
> >>> seeing dma faults from graphic device.
> >>
> >> Hmm back then I thought i915 fixes for this would land in 5.11 so I 
> >> will stick with that. :) (See my quoted text a paragraph above yours.)
> > 
> > What size are those fixes? I am considering pushing this series for
> > v5.11. Is it possible to get some acks for those patches and let them
> > go to Linus through iommu tree?
> 
> For 5.10 you mean? They feel a bit too large for comfort to go via a 
> non-i915/drm tree. These are the two patches required:
> 
> https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=8a473dbadccfc6206150de3db3223c40785da348
> https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=934941ed5a3070a7833c688c9b1d71484fc01a68
> 
> I'll copy Joonas as our maintainer - how does the idea of taking the 
> above two patches through the iommu tree sound to you?

Hi Lu,

The patches have already been merged into our tree and are heading
towards 5.11, so I don't think we should merge them elsewhere. DRM
subsystem had the feature freeze for 5.10 at the time of 5.9-rc5
and only drm-intel-fixes pull requests are sent after that.

The patches seem to target to eliminate need for a previously used
workaround. To me it seems more appropriate for the patches to follow
the regular process as new feature for 5.11 to make sure the changes
get validated as part of linux-next.

Would that work for you? We intend to send the feature pull requests
to DRM for 5.11 in the upcoming weeks.

Regards, Joonas
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: amdgpu error whenever IOMMU is enabled

2020-11-03 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Hi Jörg,

I am seeing that amdgpu uses amd_iommu_v2 kernel-module.
To me this is the last puzzle piece that was missing to explain the dependency 
of that bug in amdgpu to the iommu.

[cid:image001.png@01D6B1CB.26F9C970]

Best regards
Edgar

From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Freitag, 30. Oktober 2020 15:26
To: jroe...@suse.de
Cc: iommu@lists.linux-foundation.org
Subject: amdgpu error whenever IOMMU is enabled

Hello Jörg,

We have developed a Board "mCOM10L1900" that can be populated with an AMD 
R1305G Ryzen CPU but also with other CPUs from AMD´s R1000 and V1000 Series. 
Please see attached datasheet.

With one board we have a boot-problem that is reproducible at every ~50 boot. 
The system is accessible via ssh and works fine except for the Graphics. The 
graphics is off. We don´t see a screen. Please see attached "dmesg.log". From 
[52.772273] onwards the kernel reports drm/amdgpu errors. It even tries to 
reset the GPU but that fails too. I tried to reset amdgpu also by command "sudo 
cat /sys/kernel/debug/dri/N/amdgpu_gpu_recover". That did not help either.
There is a similar error reported here 
https://bugzilla.kernel.org/show_bug.cgi?id=204241. However the applied patch 
should have already been in the Linux-kernel we use, which is 
"5.4.0-47-generic".

Also found that "amdgpu_info" shows different versions at "SMC firmware 
version". Please see attachment. It is still unclear what role the IOMMU plays 
here. Whenever we turn it off, the error does not show up anymore on the 
failing board.

Best regards
Edgar

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

use of dma_direct_set_offset in (allwinner) drivers

2020-11-03 Thread Christoph Hellwig
Hi all,

Linux 5.10-rc1 switched from having a single dma offset in struct device
to a set of DMA ranges, and introduced a new helper to set them,
dma_direct_set_offset.

This in fact surfaced that a bunch of drivers that violate our layering
and set the offset from drivers, which meant we had to reluctantly
export the symbol to set up the DMA range.

The drivers are:

drivers/gpu/drm/sun4i/sun4i_backend.c

  This just use dma_direct_set_offset as a fallback.  Is there any good
  reason to not just kill off the fallback?

drivers/media/platform/sunxi/sun4i-csi/sun4i_csi.c

  Same as above.

drivers/media/platform/sunxi/sun6i-csi/sun6i_csi.c

  This driver unconditionally sets the offset.  Why can't we do this
  in the device tree?

drivers/staging/media/sunxi/cedrus/cedrus_hw.c

  Same as above.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-11-03 Thread j...@8bytes.org
On Mon, Oct 12, 2020 at 08:38:54AM +, Tian, Kevin wrote:
> > From: Jason Wang 

> > Jason suggest something like /dev/sva. There will be a lot of other
> > subsystems that could benefit from this (e.g vDPA).

Honestly, I fail to see the benefit of offloading these IOMMU specific
setup tasks to user-space.

The ways PASID and the device partitioning it allows are used are very
device specific. A GPU will be partitioned completly different than a
network card. So the device drivers should use the (v)SVA APIs to setup
the partitioning in a way which makes sense for the device.

And VFIO is of course a user by itself, as it allows assigning device
partitions to guests. Or even allow assigning complete devices and allow
the guests to partition it themselfes.

So having said this, what is the benefit of exposing those SVA internals
to user-space?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH for-5.10] swiotlb: remove the tbl_dma_addr argument to swiotlb_tbl_map_single

2020-11-03 Thread Christoph Hellwig
ping?

On Fri, Oct 23, 2020 at 08:33:09AM +0200, Christoph Hellwig wrote:
> The tbl_dma_addr argument is used to check the DMA boundary for the
> allocations, and thus needs to be a dma_addr_t.  swiotlb-xen instead
> passed a physical address, which could lead to incorrect results for
> strange offsets.  Fix this by removing the parameter entirely and hard
> code the DMA address for io_tlb_start instead.
> 
> Fixes: 91ffe4ad534a ("swiotlb-xen: introduce phys_to_dma/dma_to_phys 
> translations")
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Stefano Stabellini 
> ---
>  drivers/iommu/intel/iommu.c |  5 ++---
>  drivers/xen/swiotlb-xen.c   |  3 +--
>  include/linux/swiotlb.h | 10 +++---
>  kernel/dma/swiotlb.c| 16 ++--
>  4 files changed, 12 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 8651f6d4dfa032..6b560e6f193096 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -3815,9 +3815,8 @@ bounce_map_single(struct device *dev, phys_addr_t 
> paddr, size_t size,
>* page aligned, we don't need to use a bounce page.
>*/
>   if (!IS_ALIGNED(paddr | size, VTD_PAGE_SIZE)) {
> - tlb_addr = swiotlb_tbl_map_single(dev,
> - phys_to_dma_unencrypted(dev, io_tlb_start),
> - paddr, size, aligned_size, dir, attrs);
> + tlb_addr = swiotlb_tbl_map_single(dev, paddr, size,
> + aligned_size, dir, attrs);
>   if (tlb_addr == DMA_MAPPING_ERROR) {
>   goto swiotlb_error;
>   } else {
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 71ce1b7a23d1cc..2b385c1b4a99cb 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -395,8 +395,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device 
> *dev, struct page *page,
>*/
>   trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force);
>  
> - map = swiotlb_tbl_map_single(dev, virt_to_phys(xen_io_tlb_start),
> -  phys, size, size, dir, attrs);
> + map = swiotlb_tbl_map_single(dev, phys, size, size, dir, attrs);
>   if (map == (phys_addr_t)DMA_MAPPING_ERROR)
>   return DMA_MAPPING_ERROR;
>  
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 513913ff748626..3bb72266a75a1d 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -45,13 +45,9 @@ enum dma_sync_target {
>   SYNC_FOR_DEVICE = 1,
>  };
>  
> -extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
> -   dma_addr_t tbl_dma_addr,
> -   phys_addr_t phys,
> -   size_t mapping_size,
> -   size_t alloc_size,
> -   enum dma_data_direction dir,
> -   unsigned long attrs);
> +phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
> + size_t mapping_size, size_t alloc_size,
> + enum dma_data_direction dir, unsigned long attrs);
>  
>  extern void swiotlb_tbl_unmap_single(struct device *hwdev,
>phys_addr_t tlb_addr,
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index b4eea0abc3f002..92e2f54f24c01b 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -441,14 +441,11 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
> phys_addr_t tlb_addr,
>   }
>  }
>  
> -phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
> -dma_addr_t tbl_dma_addr,
> -phys_addr_t orig_addr,
> -size_t mapping_size,
> -size_t alloc_size,
> -enum dma_data_direction dir,
> -unsigned long attrs)
> +phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t 
> orig_addr,
> + size_t mapping_size, size_t alloc_size,
> + enum dma_data_direction dir, unsigned long attrs)
>  {
> + dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(hwdev, io_tlb_start);
>   unsigned long flags;
>   phys_addr_t tlb_addr;
>   unsigned int nslots, stride, index, wrap;
> @@ -667,9 +664,8 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t 
> paddr, size_t size,
>   trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size,
> swiotlb_force);
>  
> - swiotlb_addr = swiotlb_tbl_map_single(dev,
> - phys_to_dma_unencrypted(dev, io_tlb_start),
> - paddr, size, size, dir, attrs);
> + swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, size, dir,
> + attrs);
>  

Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api

2020-11-03 Thread Tvrtko Ursulin



On 03/11/2020 02:53, Lu Baolu wrote:

On 11/2/20 7:52 PM, Tvrtko Ursulin wrote:


On 02/11/2020 02:00, Lu Baolu wrote:

Hi Tvrtko,
On 10/12/20 4:44 PM, Tvrtko Ursulin wrote:


On 29/09/2020 01:11, Lu Baolu wrote:

Hi Tvrtko,

On 9/28/20 5:44 PM, Tvrtko Ursulin wrote:


On 27/09/2020 07:34, Lu Baolu wrote:

Hi,

The previous post of this series could be found here.

https://lore.kernel.org/linux-iommu/20200912032200.11489-1-baolu...@linux.intel.com/ 



This version introduce a new patch [4/7] to fix an issue reported 
here.


https://lore.kernel.org/linux-iommu/51a1baec-48d1-c0ac-181b-1fba92aa4...@linux.intel.com/ 



There aren't any other changes.

Please help to test and review.

Best regards,
baolu

Lu Baolu (3):
   iommu: Add quirk for Intel graphic devices in map_sg


Since I do have patches to fix i915 to handle this, do we want to 
co-ordinate the two and avoid having to add this quirk and then 
later remove it? Or you want to go the staged approach?


I have no preference. It depends on which patch goes first. Let the
maintainers help here.


FYI we have merged the required i915 patches to out tree last week 
or so. I *think* this means they will go into 5.11. So the i915 
specific workaround patch will not be needed in Intel IOMMU.


Do you mind telling me what's the status of this fix patch? I tried this
series on v5.10-rc1 with the graphic quirk patch dropped. I am still
seeing dma faults from graphic device.


Hmm back then I thought i915 fixes for this would land in 5.11 so I 
will stick with that. :) (See my quoted text a paragraph above yours.)


What size are those fixes? I am considering pushing this series for
v5.11. Is it possible to get some acks for those patches and let them
go to Linus through iommu tree?


For 5.10 you mean? They feel a bit too large for comfort to go via a 
non-i915/drm tree. These are the two patches required:


https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=8a473dbadccfc6206150de3db3223c40785da348
https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=934941ed5a3070a7833c688c9b1d71484fc01a68

I'll copy Joonas as our maintainer - how does the idea of taking the 
above two patches through the iommu tree sound to you?


Regards,

Tvrtko
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Some questions about arm_lpae_install_table

2020-11-03 Thread Will Deacon
On Tue, Nov 03, 2020 at 11:00:27AM +0800, Kunkun Jiang wrote:
> Recently, I have read and learned the code related to io-pgtable-arm.c.
> There
> are two question on arm_lpae_install_table.
> 
> 1、the first
> 
> > static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> >  arm_lpae_iopte *ptep,
> >  arm_lpae_iopte curr,
> >  struct io_pgtable_cfg *cfg)
> > {
> >     arm_lpae_iopte old, new;
> > 
> >     new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> >     if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> >     new |= ARM_LPAE_PTE_NSTABLE;
> > 
> >    /*
> >  * Ensure the table itself is visible before its PTE can be.
> >  * Whilst we could get away with cmpxchg64_release below, this
> >  * doesn't have any ordering semantics when !CONFIG_SMP.
> >  */
> >     dma_wmb();
> > 
> >     old = cmpxchg64_relaxed(ptep, curr, new);
> > 
> >     if (cfg->coherent_walk || (old & ARM_LPAE_PTE_SW_SYNC))
> >     return old;
> > 
> >     /* Even if it's not ours, there's no point waiting; just kick it
> > */
> >     __arm_lpae_sync_pte(ptep, cfg);
> >     if (old == curr)
> >     WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> > 
> >     return old;
> > }
> 
> If another thread changes the ptep between cmpxchg64_relaxed and
> WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC), the operation
> WRITE_ONCE will overwrite the change.

Can you please provide an example of a code path where this happens? The
idea is that CPUs can race on the cmpxchg(), but there will only be one
winner.

> 2、the second
> 
> > for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) {
> >     /* Unmap! */
> >     if (i == unmap_idx)
> > continue;
> > 
> >     __arm_lpae_init_pte(data, blk_paddr, pte, lvl,
> > [i]);
> > }
> > 
> > pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
> 
> When altering a translation table descriptor include split a block into
> constituent granules, the Armv8-A and SMMUv3 architectures require
> a break-before-make procedure. But in the function arm_lpae_split_blk_unmap,
> it changes a block descriptor to an equivalent span of page translations
> directly. Is it appropriate to do so?

Break-before-make doesn't really work for the SMMU because faults are
generally fatal.

Are you seeing problems in practice with this code?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu