Re: [PATCH v3] dma-mapping: set default segment_boundary_mask to ULONG_MAX

2020-08-18 Thread Christoph Hellwig
Applied to the dma-mapping tree.  This should give us the whole merge
window to root out any obvious problems with drivers.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] powerpc/pseries/svm: Allocate SWIOTLB buffer anywhere in memory

2020-08-18 Thread Christoph Hellwig
On Tue, Aug 18, 2020 at 07:11:26PM -0300, Thiago Jung Bauermann wrote:
> POWER secure guests (i.e., guests which use the Protection Execution
> Facility) need to use SWIOTLB to be able to do I/O with the hypervisor, but
> they don't need the SWIOTLB memory to be in low addresses since the
> hypervisor doesn't have any addressing limitation.
> 
> This solves a SWIOTLB initialization problem we are seeing in secure guests
> with 128 GB of RAM: they are configured with 4 GB of crashkernel reserved
> memory, which leaves no space for SWIOTLB in low addresses.
> 
> To do this, we use mostly the same code as swiotlb_init(), but allocate the
> buffer using memblock_alloc() instead of memblock_alloc_low().
> 
> Signed-off-by: Thiago Jung Bauermann 

Looks fine to me (except for the pointlessly long comment lines, but I've
been told that's the powerpc way).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Cho KyongHo
On Tue, Aug 18, 2020 at 05:10:06PM +0100, Christoph Hellwig wrote:
> On Tue, Aug 18, 2020 at 11:07:57AM +0100, Will Deacon wrote:
> > > > so I'm not sure
> > > > that we should be complicating the implementation like this to try to
> > > > make it "fast".
> > > > 
> > > I agree that this patch makes the implementation of dma API a bit more
> > > but I don't think this does not impact its complication seriously.
> > 
> > It's death by a thousand cuts; this patch further fragments the architecture
> > backends and leads to arm64-specific behaviour which consequently won't get
> > well tested by anybody else. Now, it might be worth it, but there's not
> > enough information here to make that call.
> 
> So it turns out I misread the series (*cough*, crazy long lines,
> *cough*), and it does not actually expose a new API as I thought, but
> it still makes a total mess of the internal interface.  It turns out
> that on the for cpu side we already have arch_sync_dma_for_cpu_all,
> which should do all that is needed.  We could do the equivalent for
> the to device side, but only IFF there really is a major benefit for
> something that actually is mainstream and matters.
> 
Indeed, arch_sync_dma_for_cpu_all() is used where the new internal API
arch_sync_barrier_for_cpu() should be called. I just thought it is a
special hook for MIPS.
In the next version of the patch series, I should consider using
arch_sync_dma_for_cpu_all() and introducting its 'for_dev' version with
some performance data to show the benefit of the change.

Thank you for the proposal.

KyongHo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Cho KyongHo
On Tue, Aug 18, 2020 at 11:07:57AM +0100, Will Deacon wrote:
> On Tue, Aug 18, 2020 at 06:37:39PM +0900, Cho KyongHo wrote:
> > On Tue, Aug 18, 2020 at 09:28:53AM +0100, Will Deacon wrote:
> > > On Tue, Aug 18, 2020 at 04:43:10PM +0900, Cho KyongHo wrote:
> > > > Cache maintenance operations in the most of CPU architectures needs
> > > > memory barrier after the cache maintenance for the DMAs to view the
> > > > region of the memory correctly. The problem is that memory barrier is
> > > > very expensive and dma_[un]map_sg() and dma_sync_sg_for_{device|cpu}()
> > > > involves the memory barrier per every single cache sg entry. In some
> > > > CPU micro-architecture, a single memory barrier consumes more time than
> > > > cache clean on 4KiB. It becomes more serious if the number of CPU cores
> > > > are larger.
> > > 
> > > Have you got higher-level performance data for this change? It's more 
> > > likely
> > > that the DSB is what actually forces the prior cache maintenance to
> > > complete,
> > 
> > This patch does not skip necessary DSB after cache maintenance. It just
> > remove repeated dsb per every single sg entry and call dsb just once
> > after cache maintenance on all sg entries is completed.
> 
> Yes, I realise that, but what I'm saying is that a big part of your
> justification for this change is:
> 
>   | The problem is that memory barrier is very expensive and dma_[un]map_sg()
>   | and dma_sync_sg_for_{device|cpu}() involves the memory barrier per every
>   | single cache sg entry. In some CPU micro-architecture, a single memory
>   | barrier consumes more time than cache clean on 4KiB.
> 
> and my point is that the DSB is likely completing the cache maintenance,
> so as cache maintenance instructions retire faster in the micro-architecture,
> the DSB becomes absolutely slower. In other words, it doesn't make much
> sense to me to compare the cost of the DSB with the cost of the cache
> maintenance; what matters more is the code of the high-level unmap()
> operation for the sglist.
> 
I now understand your point. But I still believe that repeated DSB in
the middle of cache maintenance wastes redundant CPU cycles. Avoiding
that redundancy causes extra complexity to implmentation of dma API. But
I think it is valuable.

> > > so it's important to look at the bigger picture, not just the
> > > apparent relative cost of these instructions.
> > > 
> > If you mean bigger picture is the performance impact of this patch to a
> > complete user scenario, we are evaluating it in some latency sensitve
> > scenario. But I wonder if a performance gain in a platform/SoC specific
> > scenario is also persuasive.
> 
> Latency is fine too, but phrasing the numbers (and we really need those)
> in terms of things like "The interrupt response time for this in-tree
> driver is improved by xxx ns (yy %) after this change" or "Throughput
> for this in-tree driver goes from xxx mb/s to yyy mb/s" would be really
> helpful.
> 

Unfortunately, we have no in-tree driver to show the performance.
Instead, we just evaluated the speed of dma_sync_sg_for_device() to see
the improvements of this patch.
For example, Cortex-A55 in our 2-cluster, big-mid-little system gains 28%
(130.9 usec. -> 94.5 usec.) during dma_sync_sg_for_device(sg, nents,
DMA_TO_DEVICE) is running with nents = 256 and length of each sg entrh is 4KiB.
Let me describe the detailed performance results in the next patch
series which will include some fixes to errata in commit messages.

> > > Also, it's a miracle that non-coherent DMA even works,
> > 
> > I am sorry, Will. I don't understand this. Can you let me know what do
> > you mena with the above sentence?
> 
> Non-coherent DMA sucks for software.

I agree. But due to the H/W cost, proposals about coherent DMA are
always challenging.

> For the most part, Linux does a nice
> job of hiding this from device drivers, and I think _that_ is the primary
> concern, rather than performance. If performance is a problem, then the
> solution is cache coherence or a shared non-cacheable buffer (rather than
> the streaming API).
We are also trying to use non-cacheable buffers for the non-coherent
DMAs. But the problem with the non-cacheable buffer is CPU access speed.
> 
> > > so I'm not sure
> > > that we should be complicating the implementation like this to try to
> > > make it "fast".
> > > 
> > I agree that this patch makes the implementation of dma API a bit more
> > but I don't think this does not impact its complication seriously.
> 
> It's death by a thousand cuts; this patch further fragments the architecture
> backends and leads to arm64-specific behaviour which consequently won't get
> well tested by anybody else. Now, it might be worth it, but there's not
> enough information here to make that call.
> Will
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Wednesday, August 19, 2020 2:31 AM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> j...@8bytes.org
> Cc: Zengtao (B) ;
> iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> Linuxarm 
> Subject: Re: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi
> polling
> 
> On 2020-08-18 12:17, Barry Song wrote:
> > Polling by MSI isn't necessarily faster than polling by SEV. Tests on
> > hi1620 show hns3 100G NIC network throughput can improve from 25G to
> > 27G if we disable MSI polling while running 16 netperf threads sending
> > UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
> > single thread.
> > The reason for the throughput improvement is that the latency to poll
> > the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
> > in an empty cmd queue, typically we need to wait for 280ns using MSI
> > polling. But we only need around 190ns after disabling MSI polling.
> > This patch provides a command line option so that users can decide to
> > use MSI polling or not based on their tests.
> >
> > Signed-off-by: Barry Song 
> > ---
> >   -v4: rebase on top of 5.9-rc1
> >   refine changelog
> >
> >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18
> ++
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 7196207be7ea..89d3cb391fef 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -418,6 +418,11 @@ module_param_named(disable_bypass,
> disable_bypass, bool, S_IRUGO);
> >   MODULE_PARM_DESC(disable_bypass,
> > "Disable bypass streams such that incoming transactions from devices
> that are not attached to an iommu domain will report an abort back to the
> device and will not be allowed to pass through the SMMU.");
> >
> > +static bool disable_msipolling;
> > +module_param_named(disable_msipolling, disable_msipolling, bool,
> S_IRUGO);
> 
> Just use module_param() - going out of the way to specify a "different"
> name that's identical to the variable name is silly.

Thanks for pointing out, also fixed the same issue in the existing parameter
disable_bypass in the new patchset.

But I am sorry I made a typo, the new patchset should be v5. But I wrote v4.

> 
> Also I think the preference these days is to specify permissions as
> plain octal constants rather than those rather inscrutable macros. I
> certainly find that more readable myself.
> 
> (Yes, the existing parameter commits the same offences, but I'd rather
> clean that up separately than perpetuate it)

Thanks for pointing out. Got fixed in the new patchset.

> 
> > +MODULE_PARM_DESC(disable_msipolling,
> > +   "Disable MSI-based polling for CMD_SYNC completion.");
> > +
> >   enum pri_resp {
> > PRI_RESP_DENY = 0,
> > PRI_RESP_FAIL = 1,
> > @@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd,
> struct arm_smmu_cmdq_ent *ent)
> > return 0;
> >   }
> >
> > +static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)
> > +{
> > +   return !disable_msipolling &&
> > +  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
> > +  smmu->features & ARM_SMMU_FEAT_MSI;
> > +}
> 
> I'd wrap this up into a new ARM_SMMU_OPT_MSIPOLL flag set at probe time,
> rather than constantly reevaluating this whole expression (now that it's
> no longer simply testing two adjacent bits of the same word).

Got it done in the new patchset. It turns out we only need to check one bit now 
with the new
patch:

-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL)
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);


> 
> Robin.
> 
> > +
> >   static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct
> arm_smmu_device *smmu,
> >  u32 prod)
> >   {
> > @@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64
> *cmd, struct arm_smmu_device *smmu,
> >  * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
> >  * payload, so the write will zero the entire command on that platform.
> >  */
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > -   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
> > +   if (arm_smmu_use_msipolling(smmu)) {
> > ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
> >q->ent_dwords * 8;
> > }
> > @@ -1332,8 +1343,7 @@ static int
> __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
> >   static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device
> *smmu,
> >  struct arm_smmu_ll_queue *llq)
> >   {
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > -   smmu->features & 

[PATCH v4 1/3] iommu/arm-smmu-v3: replace symbolic permissions by octal permissions for module parameter

2020-08-18 Thread Barry Song
This fixed the below checkpatch issue:
WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using
octal permissions '0444'.
417: FILE: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:417:
module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO);

-v4:
   * cleanup the existing module parameter of bypass_
   * add ARM_SMMU_OPT_MSIPOLL flag with respect to Robin's comments

Signed-off-by: Barry Song 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..eea5f7c6d9ab 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -414,7 +414,7 @@
 #define MSI_IOVA_LENGTH0x10
 
 static bool disable_bypass = 1;
-module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO);
+module_param_named(disable_bypass, disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 3/3] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Barry Song
Polling by MSI isn't necessarily faster than polling by SEV. Tests on
hi1620 show hns3 100G NIC network throughput can improve from 25G to
27G if we disable MSI polling while running 16 netperf threads sending
UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
single thread.
The reason for the throughput improvement is that the latency to poll
the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
in an empty cmd queue, typically we need to wait for 280ns using MSI
polling. But we only need around 190ns after disabling MSI polling.
This patch provides a command line option so that users can decide to
use MSI polling or not based on their tests.

Signed-off-by: Barry Song 
---
 -v4: add ARM_SMMU_OPT_MSIPOLL flag with respect to Robin's comment

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5b40d535a7c8..7332251dd8cd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param(disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling;
+module_param(disable_msipolling, bool, 0444);
+MODULE_PARM_DESC(disable_msipolling,
+   "Disable MSI-based polling for CMD_SYNC completion.");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -652,6 +657,7 @@ struct arm_smmu_device {
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
 #define ARM_SMMU_OPT_PAGE0_REGS_ONLY   (1 << 1)
+#define ARM_SMMU_OPT_MSIPOLL   (1 << 2)
u32 options;
 
struct arm_smmu_cmdqcmdq;
@@ -992,8 +998,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
}
@@ -1332,8 +1337,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,
 static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
 {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL)
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
 
return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);
@@ -3741,8 +3745,11 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
if (reg & IDR0_SEV)
smmu->features |= ARM_SMMU_FEAT_SEV;
 
-   if (reg & IDR0_MSI)
+   if (reg & IDR0_MSI) {
smmu->features |= ARM_SMMU_FEAT_MSI;
+   if (coherent && !disable_msipolling)
+   smmu->options |= ARM_SMMU_OPT_MSIPOLL;
+   }
 
if (reg & IDR0_HYP)
smmu->features |= ARM_SMMU_FEAT_HYP;
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 0/3] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Barry Song
patch 1/3 and patch 2/3 are the preparation of patch 3/3 which permits users
to disable MSI-based polling by cmd line.

-v4:
  with respect to Robin's comments
  * cleanup the code of the existing module parameter disable_bypass
  * add ARM_SMMU_OPT_MSIPOLL flag. on the other hand, we only need to check
a bit in options rather than two bits in features

Barry Song (3):
  iommu/arm-smmu-v3: replace symbolic permissions by octal permissions
for module parameter
  iommu/arm-smmu-v3: replace module_param_named by module_param for
disable_bypass
  iommu/arm-smmu-v3: permit users to disable msi polling

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/3] iommu/arm-smmu-v3: replace module_param_named by module_param for disable_bypass

2020-08-18 Thread Barry Song
Just use module_param() - going out of the way to specify a "different"
name that's identical to the variable name is silly.

Signed-off-by: Barry Song 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index eea5f7c6d9ab..5b40d535a7c8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -414,7 +414,7 @@
 #define MSI_IOVA_LENGTH0x10
 
 static bool disable_bypass = 1;
-module_param_named(disable_bypass, disable_bypass, bool, 0444);
+module_param(disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 10/17] memblock: reduce number of parameters in for_each_mem_range()

2020-08-18 Thread Miguel Ojeda
On Tue, Aug 18, 2020 at 5:19 PM Mike Rapoport  wrote:
>
>  .clang-format  |  2 ++

For the .clang-format bit:

Acked-by: Miguel Ojeda 

Cheers,
Miguel
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3] powerpc/pseries/svm: Allocate SWIOTLB buffer anywhere in memory

2020-08-18 Thread Thiago Jung Bauermann
POWER secure guests (i.e., guests which use the Protection Execution
Facility) need to use SWIOTLB to be able to do I/O with the hypervisor, but
they don't need the SWIOTLB memory to be in low addresses since the
hypervisor doesn't have any addressing limitation.

This solves a SWIOTLB initialization problem we are seeing in secure guests
with 128 GB of RAM: they are configured with 4 GB of crashkernel reserved
memory, which leaves no space for SWIOTLB in low addresses.

To do this, we use mostly the same code as swiotlb_init(), but allocate the
buffer using memblock_alloc() instead of memblock_alloc_low().

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/svm.h   |  4 
 arch/powerpc/mm/mem.c|  6 +-
 arch/powerpc/platforms/pseries/svm.c | 26 ++
 3 files changed, 35 insertions(+), 1 deletion(-)

Changes from v2:
- Panic if unable to allocate buffer, as suggested by Christoph.

Changes from v1:
- Open-code swiotlb_init() in arch-specific code, as suggested by
  Christoph.

diff --git a/arch/powerpc/include/asm/svm.h b/arch/powerpc/include/asm/svm.h
index 85580b30aba4..7546402d796a 100644
--- a/arch/powerpc/include/asm/svm.h
+++ b/arch/powerpc/include/asm/svm.h
@@ -15,6 +15,8 @@ static inline bool is_secure_guest(void)
return mfmsr() & MSR_S;
 }
 
+void __init svm_swiotlb_init(void);
+
 void dtl_cache_ctor(void *addr);
 #define get_dtl_cache_ctor()   (is_secure_guest() ? dtl_cache_ctor : NULL)
 
@@ -25,6 +27,8 @@ static inline bool is_secure_guest(void)
return false;
 }
 
+static inline void svm_swiotlb_init(void) {}
+
 #define get_dtl_cache_ctor() NULL
 
 #endif /* CONFIG_PPC_SVM */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c2c11eb8dcfc..0f21bcb16405 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -290,7 +291,10 @@ void __init mem_init(void)
 * back to to-down.
 */
memblock_set_bottom_up(true);
-   swiotlb_init(0);
+   if (is_secure_guest())
+   svm_swiotlb_init();
+   else
+   swiotlb_init(0);
 #endif
 
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
index 40c0637203d5..81085eb8f225 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -34,6 +35,31 @@ static int __init init_svm(void)
 }
 machine_early_initcall(pseries, init_svm);
 
+/*
+ * Initialize SWIOTLB. Essentially the same as swiotlb_init(), except that it
+ * can allocate the buffer anywhere in memory. Since the hypervisor doesn't 
have
+ * any addressing limitation, we don't need to allocate it in low addresses.
+ */
+void __init svm_swiotlb_init(void)
+{
+   unsigned char *vstart;
+   unsigned long bytes, io_tlb_nslabs;
+
+   io_tlb_nslabs = (swiotlb_size_or_default() >> IO_TLB_SHIFT);
+   io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+
+   bytes = io_tlb_nslabs << IO_TLB_SHIFT;
+
+   vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
+   if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
+   return;
+
+   if (io_tlb_start)
+   memblock_free_early(io_tlb_start,
+   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+   panic("SVM: Cannot allocate SWIOTLB buffer");
+}
+
 int set_memory_encrypted(unsigned long addr, int numpages)
 {
if (!PAGE_ALIGNED(addr))
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/16] IOMMU driver for Kirin 960/970

2020-08-18 Thread John Stultz
On Tue, Aug 18, 2020 at 9:26 AM Robin Murphy  wrote:
> On 2020-08-18 16:29, Mauro Carvalho Chehab wrote:
> > Em Tue, 18 Aug 2020 15:47:55 +0100
> > Basically, the DT binding has this, for IOMMU:
> >
> >
> >   smmu_lpae {
> >   compatible = "hisilicon,smmu-lpae";
> >   };
> >
> > ...
> >   dpe: dpe@e860 {
> >   compatible = "hisilicon,kirin970-dpe";
> >   memory-region = <_dma_reserved>;
> > ...
> >   iommu_info {
> >   start-addr = <0x8000>;
> >   size = <0xbfff8000>;
> >   };
> >   }
> >
> > This is used by kirin9xx_drm_dss.c in order to enable and use
> > the iommu:
> >
> >
> >   static int dss_enable_iommu(struct platform_device *pdev, struct 
> > dss_hw_ctx *ctx)
> >   {
> >   struct device *dev = NULL;
> >
> >   dev = >dev;
> >
> >   /* create iommu domain */
> >   ctx->mmu_domain = iommu_domain_alloc(dev->bus);
> >   if (!ctx->mmu_domain) {
> >   pr_err("iommu_domain_alloc failed!\n");
> >   return -EINVAL;
> >   }
> >
> >   iommu_attach_device(ctx->mmu_domain, dev);
> >
> >   return 0;
> >   }
> >
> > The only place where the IOMMU domain is used is on this part of the
> > code(error part simplified here) [1]:
> >
> >   void hisi_dss_smmu_on(struct dss_hw_ctx *ctx)
> >   {
> >   uint64_t fama_phy_pgd_base;
> >   uint32_t phy_pgd_base;
> > ...
> >   fama_phy_pgd_base = iommu_iova_to_phys(ctx->mmu_domain, 0);
> >   phy_pgd_base = (uint32_t)fama_phy_pgd_base;
> >   if (WARN_ON(!phy_pgd_base))
> >   return;
> >
> >   set_reg(smmu_base + SMMU_CB_TTBR0, phy_pgd_base, 32, 0);
> >   }
> >
> > [1] 
> > https://github.com/mchehab/linux/commit/36da105e719b47bbe9d6cb7e5619b30c7f3eb1bd
> >
> > In other words, the driver needs to get the physical address of the frame
> > buffer (mapped via iommu) in order to set some DRM-specific register.
> >
> > Yeah, the above code is somewhat hackish. I would love to replace
> > this part by a more standard approach.
>
> OK, so from a quick look at that, my impression is that your display
> controller has its own MMU and you don't need to pretend to use the
> IOMMU API at all. Just have the DRM driver use io-pgtable directly to
> run its own set of ARM_32_LPAE_S1 pagetables - see Panfrost for an
> example (but try to ignore the wacky "Mali LPAE" format).

Yea. For the HiKey960, there was originally a similar patch series but
it was refactored out and the (still out of tree) DRM driver I'm
carrying doesn't seem to need it (though looking we still have the
iommu_info subnode in the dts that maybe needs to be cleaned up).

thanks
-john
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] powerpc/pseries/svm: Allocate SWIOTLB buffer anywhere in memory

2020-08-18 Thread Thiago Jung Bauermann


Christoph Hellwig  writes:

> On Mon, Aug 17, 2020 at 06:46:58PM -0300, Thiago Jung Bauermann wrote:
>> POWER secure guests (i.e., guests which use the Protection Execution
>> Facility) need to use SWIOTLB to be able to do I/O with the hypervisor, but
>> they don't need the SWIOTLB memory to be in low addresses since the
>> hypervisor doesn't have any addressing limitation.
>> 
>> This solves a SWIOTLB initialization problem we are seeing in secure guests
>> with 128 GB of RAM: they are configured with 4 GB of crashkernel reserved
>> memory, which leaves no space for SWIOTLB in low addresses.
>> 
>> To do this, we use mostly the same code as swiotlb_init(), but allocate the
>> buffer using memblock_alloc() instead of memblock_alloc_low().
>> 
>> We also need to add swiotlb_set_no_iotlb_memory() in order to set the
>> no_iotlb_memory flag if initialization fails.
>
> Do you really need the helper?  As far as I can tell the secure guests
> very much rely on swiotlb for all I/O, so you might as well panic if
> you fail to allocate it.

That is true. Ok, I will do that.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/16] IOMMU driver for Kirin 960/970

2020-08-18 Thread Robin Murphy

On 2020-08-18 16:29, Mauro Carvalho Chehab wrote:

Hi Robin,

Em Tue, 18 Aug 2020 15:47:55 +0100
Robin Murphy  escreveu:


On 2020-08-17 08:49, Mauro Carvalho Chehab wrote:

Add a driver for the Kirin 960/970 iommu.

As on the past series, this starts from the original 4.9 driver from
the 96boards tree:

https://github.com/96boards-hikey/linux/tree/hikey970-v4.9

The remaining patches add SPDX headers and make it build and run with
the upstream Kernel.

Chenfeng (1):
iommu: add support for HiSilicon Kirin 960/970 iommu

Mauro Carvalho Chehab (15):
iommu: hisilicon: remove default iommu_map_sg handler
iommu: hisilicon: map and unmap ops gained new arguments
iommu: hisi_smmu_lpae: rebase it to work with upstream
iommu: hisi_smmu: remove linux/hisi/hisi-iommu.h
iommu: hisilicon: cleanup its code style
iommu: hisi_smmu_lpae: get rid of IOMMU_SEC and IOMMU_DEVICE
iommu: get rid of map/unmap tile functions
iommu: hisi_smmu_lpae: use the right code to get domain-priv data
iommu: hisi_smmu_lpae: convert it to probe_device
iommu: add Hisilicon Kirin970 iommu at the building system
iommu: hisi_smmu_lpae: cleanup printk macros
iommu: hisi_smmu_lpae: make OF compatible more standard


Echoing the other comments about none of the driver patches being CC'd
to the IOMMU list...

Still, I dug the series up on lore and frankly I'm not sure what to make
of it - AFAICS the "driver" is just yet another implementation of Arm
LPAE pagetable code, with no obvious indication of how those pagetables
ever get handed off to IOMMU hardware (and indeed no indication of IOMMU
hardware at all). Can you explain how it's supposed to work?

And as a pre-emptive strike, we really don't need any more LPAE
implementations - that's what the io-pgtable library is all about (which
incidentally has been around since 4.0...). I think that should make the
issue of preserving authorship largely moot since there's no need to
preserve most of the code anyway ;)


I didn't know about that, since I got a Hikey 970 board for the first time
about one month ago, and that's the first time I looked into iommu code.

My end goal with this is to make the DRM/KMS driver to work with upstream
Kernels.

The full patch series are at:

https://github.com/mchehab/linux/commits/hikey970/to_upstream-2.0-v1.1

(I need to put a new version there, after some changes due to recent
upstream discussions at the regulator's part of the code)

Basically, the DT binding has this, for IOMMU:


smmu_lpae {
compatible = "hisilicon,smmu-lpae";
};

...
dpe: dpe@e860 {
compatible = "hisilicon,kirin970-dpe";
memory-region = <_dma_reserved>;
...
iommu_info {
start-addr = <0x8000>;
size = <0xbfff8000>;
};
}

This is used by kirin9xx_drm_dss.c in order to enable and use
the iommu:


static int dss_enable_iommu(struct platform_device *pdev, struct 
dss_hw_ctx *ctx)
{
struct device *dev = NULL;

dev = >dev;

/* create iommu domain */
ctx->mmu_domain = iommu_domain_alloc(dev->bus);
if (!ctx->mmu_domain) {
pr_err("iommu_domain_alloc failed!\n");
return -EINVAL;
}

iommu_attach_device(ctx->mmu_domain, dev);

return 0;
}

The only place where the IOMMU domain is used is on this part of the
code(error part simplified here) [1]:

void hisi_dss_smmu_on(struct dss_hw_ctx *ctx)
{
uint64_t fama_phy_pgd_base;
uint32_t phy_pgd_base;
...
fama_phy_pgd_base = iommu_iova_to_phys(ctx->mmu_domain, 0);
phy_pgd_base = (uint32_t)fama_phy_pgd_base;
if (WARN_ON(!phy_pgd_base))
return;

set_reg(smmu_base + SMMU_CB_TTBR0, phy_pgd_base, 32, 0);
}

[1] 
https://github.com/mchehab/linux/commit/36da105e719b47bbe9d6cb7e5619b30c7f3eb1bd

In other words, the driver needs to get the physical address of the frame
buffer (mapped via iommu) in order to set some DRM-specific register.

Yeah, the above code is somewhat hackish. I would love to replace
this part by a more standard approach.


OK, so from a quick look at that, my impression is that your display 
controller has its own MMU and you don't need to pretend to use the 
IOMMU API at all. Just have the DRM driver use io-pgtable directly to 
run its own set of ARM_32_LPAE_S1 pagetables - see Panfrost for an 
example (but try to ignore the wacky "Mali LPAE" format).


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V2 1/2] Add new flush_iotlb_range and handle freelists when using iommu_unmap_fast

2020-08-18 Thread Tom Murphy
On Tue, 18 Aug 2020 at 16:17, Robin Murphy  wrote:
>
> On 2020-08-18 07:04, Tom Murphy wrote:
> > Add a flush_iotlb_range to allow flushing of an iova range instead of a
> > full flush in the dma-iommu path.
> >
> > Allow the iommu_unmap_fast to return newly freed page table pages and
> > pass the freelist to queue_iova in the dma-iommu ops path.
> >
> > This patch is useful for iommu drivers (in this case the intel iommu
> > driver) which need to wait for the ioTLB to be flushed before newly
> > free/unmapped page table pages can be freed. This way we can still batch
> > ioTLB free operations and handle the freelists.
>
> It sounds like the freelist is something that logically belongs in the
> iommu_iotlb_gather structure. And even if it's not a perfect fit I'd be
> inclined to jam it in there anyway just to avoid this giant argument
> explosion ;)

Good point, I'll do that.

>
> Why exactly do we need to introduce a new flush_iotlb_range() op? Can't
> the AMD driver simply use the gather mechanism like everyone else?

No, there's no reason it can't simply use the gather mechanism. I will
use the gather mechanism.
I think I wrote this patch way back before the gather mechanism was
introduced and I've been rebasing/slightly updating it since then
without paying proper attention to the code.

>
> Robin.
>
> > Change-log:
> > V2:
> > -fix missing parameter in mtk_iommu_v1.c
> >
> > Signed-off-by: Tom Murphy 
> > ---
> >   drivers/iommu/amd/iommu.c   | 14 -
> >   drivers/iommu/arm-smmu-v3.c |  3 +-
> >   drivers/iommu/arm-smmu.c|  3 +-
> >   drivers/iommu/dma-iommu.c   | 45 ---
> >   drivers/iommu/exynos-iommu.c|  3 +-
> >   drivers/iommu/intel/iommu.c | 54 +
> >   drivers/iommu/iommu.c   | 25 +++
> >   drivers/iommu/ipmmu-vmsa.c  |  3 +-
> >   drivers/iommu/msm_iommu.c   |  3 +-
> >   drivers/iommu/mtk_iommu.c   |  3 +-
> >   drivers/iommu/mtk_iommu_v1.c|  3 +-
> >   drivers/iommu/omap-iommu.c  |  3 +-
> >   drivers/iommu/qcom_iommu.c  |  3 +-
> >   drivers/iommu/rockchip-iommu.c  |  3 +-
> >   drivers/iommu/s390-iommu.c  |  3 +-
> >   drivers/iommu/sun50i-iommu.c|  3 +-
> >   drivers/iommu/tegra-gart.c  |  3 +-
> >   drivers/iommu/tegra-smmu.c  |  3 +-
> >   drivers/iommu/virtio-iommu.c|  3 +-
> >   drivers/vfio/vfio_iommu_type1.c |  2 +-
> >   include/linux/iommu.h   | 21 +++--
> >   21 files changed, 150 insertions(+), 56 deletions(-)
> >
> > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> > index 2f22326ee4df..25fbacab23c3 100644
> > --- a/drivers/iommu/amd/iommu.c
> > +++ b/drivers/iommu/amd/iommu.c
> > @@ -2513,7 +2513,8 @@ static int amd_iommu_map(struct iommu_domain *dom, 
> > unsigned long iova,
> >
> >   static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long 
> > iova,
> > size_t page_size,
> > -   struct iommu_iotlb_gather *gather)
> > +   struct iommu_iotlb_gather *gather,
> > +   struct page **freelist)
> >   {
> >   struct protection_domain *domain = to_pdomain(dom);
> >   struct domain_pgtable pgtable;
> > @@ -2636,6 +2637,16 @@ static void amd_iommu_flush_iotlb_all(struct 
> > iommu_domain *domain)
> >   spin_unlock_irqrestore(>lock, flags);
> >   }
> >
> > +static void amd_iommu_flush_iotlb_range(struct iommu_domain *domain,
> > + unsigned long iova, size_t size,
> > + struct page *freelist)
> > +{
> > + struct protection_domain *dom = to_pdomain(domain);
> > +
> > + domain_flush_pages(dom, iova, size);
> > + domain_flush_complete(dom);
> > +}
> > +
> >   static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
> >struct iommu_iotlb_gather *gather)
> >   {
> > @@ -2675,6 +2686,7 @@ const struct iommu_ops amd_iommu_ops = {
> >   .is_attach_deferred = amd_iommu_is_attach_deferred,
> >   .pgsize_bitmap  = AMD_IOMMU_PGSIZES,
> >   .flush_iotlb_all = amd_iommu_flush_iotlb_all,
> > + .flush_iotlb_range = amd_iommu_flush_iotlb_range,
> >   .iotlb_sync = amd_iommu_iotlb_sync,
> >   .def_domain_type = amd_iommu_def_domain_type,
> >   };
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index f578677a5c41..8d328dc25326 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -2854,7 +2854,8 @@ static int arm_smmu_map(struct iommu_domain *domain, 
> > unsigned long iova,
> >   }
> >
> >   static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long 
> > iova,
> > -  size_t size, struct iommu_iotlb_gather *gather)
> > +  size_t size, struct iommu_iotlb_gather *gather,
> > +  struct page **freelist)
> >   

Re: [PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Christoph Hellwig
On Tue, Aug 18, 2020 at 11:07:57AM +0100, Will Deacon wrote:
> > > so I'm not sure
> > > that we should be complicating the implementation like this to try to
> > > make it "fast".
> > > 
> > I agree that this patch makes the implementation of dma API a bit more
> > but I don't think this does not impact its complication seriously.
> 
> It's death by a thousand cuts; this patch further fragments the architecture
> backends and leads to arm64-specific behaviour which consequently won't get
> well tested by anybody else. Now, it might be worth it, but there's not
> enough information here to make that call.

So it turns out I misread the series (*cough*, crazy long lines,
*cough*), and it does not actually expose a new API as I thought, but
it still makes a total mess of the internal interface.  It turns out
that on the for cpu side we already have arch_sync_dma_for_cpu_all,
which should do all that is needed.  We could do the equivalent for
the to device side, but only IFF there really is a major benefit for
something that actually is mainstream and matters.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/16] IOMMU driver for Kirin 960/970

2020-08-18 Thread Mauro Carvalho Chehab
Hi Robin,

Em Tue, 18 Aug 2020 15:47:55 +0100
Robin Murphy  escreveu:

> On 2020-08-17 08:49, Mauro Carvalho Chehab wrote:
> > Add a driver for the Kirin 960/970 iommu.
> > 
> > As on the past series, this starts from the original 4.9 driver from
> > the 96boards tree:
> > 
> > https://github.com/96boards-hikey/linux/tree/hikey970-v4.9
> > 
> > The remaining patches add SPDX headers and make it build and run with
> > the upstream Kernel.
> > 
> > Chenfeng (1):
> >iommu: add support for HiSilicon Kirin 960/970 iommu
> > 
> > Mauro Carvalho Chehab (15):
> >iommu: hisilicon: remove default iommu_map_sg handler
> >iommu: hisilicon: map and unmap ops gained new arguments
> >iommu: hisi_smmu_lpae: rebase it to work with upstream
> >iommu: hisi_smmu: remove linux/hisi/hisi-iommu.h
> >iommu: hisilicon: cleanup its code style
> >iommu: hisi_smmu_lpae: get rid of IOMMU_SEC and IOMMU_DEVICE
> >iommu: get rid of map/unmap tile functions
> >iommu: hisi_smmu_lpae: use the right code to get domain-priv data
> >iommu: hisi_smmu_lpae: convert it to probe_device
> >iommu: add Hisilicon Kirin970 iommu at the building system
> >iommu: hisi_smmu_lpae: cleanup printk macros
> >iommu: hisi_smmu_lpae: make OF compatible more standard  
> 
> Echoing the other comments about none of the driver patches being CC'd 
> to the IOMMU list...
> 
> Still, I dug the series up on lore and frankly I'm not sure what to make 
> of it - AFAICS the "driver" is just yet another implementation of Arm 
> LPAE pagetable code, with no obvious indication of how those pagetables 
> ever get handed off to IOMMU hardware (and indeed no indication of IOMMU 
> hardware at all). Can you explain how it's supposed to work?
> 
> And as a pre-emptive strike, we really don't need any more LPAE 
> implementations - that's what the io-pgtable library is all about (which 
> incidentally has been around since 4.0...). I think that should make the 
> issue of preserving authorship largely moot since there's no need to 
> preserve most of the code anyway ;)

I didn't know about that, since I got a Hikey 970 board for the first time
about one month ago, and that's the first time I looked into iommu code.

My end goal with this is to make the DRM/KMS driver to work with upstream
Kernels.

The full patch series are at:

https://github.com/mchehab/linux/commits/hikey970/to_upstream-2.0-v1.1

(I need to put a new version there, after some changes due to recent
upstream discussions at the regulator's part of the code)

Basically, the DT binding has this, for IOMMU:


smmu_lpae {
compatible = "hisilicon,smmu-lpae";
};

...
dpe: dpe@e860 {
compatible = "hisilicon,kirin970-dpe";
memory-region = <_dma_reserved>;
...
iommu_info {
start-addr = <0x8000>;
size = <0xbfff8000>;
};
}

This is used by kirin9xx_drm_dss.c in order to enable and use
the iommu:


static int dss_enable_iommu(struct platform_device *pdev, struct 
dss_hw_ctx *ctx)
{
struct device *dev = NULL;

dev = >dev;

/* create iommu domain */
ctx->mmu_domain = iommu_domain_alloc(dev->bus);
if (!ctx->mmu_domain) {
pr_err("iommu_domain_alloc failed!\n");
return -EINVAL;
}

iommu_attach_device(ctx->mmu_domain, dev);

return 0;
}

The only place where the IOMMU domain is used is on this part of the
code(error part simplified here) [1]:

void hisi_dss_smmu_on(struct dss_hw_ctx *ctx) 
{
uint64_t fama_phy_pgd_base;
uint32_t phy_pgd_base;
...
fama_phy_pgd_base = iommu_iova_to_phys(ctx->mmu_domain, 0);
phy_pgd_base = (uint32_t)fama_phy_pgd_base;
if (WARN_ON(!phy_pgd_base))
return;

set_reg(smmu_base + SMMU_CB_TTBR0, phy_pgd_base, 32, 0);
}

[1] 
https://github.com/mchehab/linux/commit/36da105e719b47bbe9d6cb7e5619b30c7f3eb1bd

In other words, the driver needs to get the physical address of the frame
buffer (mapped via iommu) in order to set some DRM-specific register.

Yeah, the above code is somewhat hackish. I would love to replace 
this part by a more standard approach.

Thanks,
Mauro
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 17/17] memblock: use separate iterators for memory and reserved regions

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

for_each_memblock() is used to iterate over memblock.memory in
a few places that use data from memblock_region rather than the memory
ranges.

Introduce separate for_each_mem_region() and for_each_reserved_mem_region()
to improve encapsulation of memblock internals from its users.

Signed-off-by: Mike Rapoport 
Reviewed-by: Baoquan He 
Acked-by: Ingo Molnar# x86
Acked-by: Thomas Bogendoerfer   # MIPS
Acked-by: Miguel Ojeda# .clang-format
---
 .clang-format  |  3 ++-
 arch/arm64/kernel/setup.c  |  2 +-
 arch/arm64/mm/numa.c   |  2 +-
 arch/mips/netlogic/xlp/setup.c |  2 +-
 arch/riscv/mm/init.c   |  2 +-
 arch/x86/mm/numa.c |  2 +-
 include/linux/memblock.h   | 19 ---
 mm/memblock.c  |  4 ++--
 mm/page_alloc.c|  8 
 9 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/.clang-format b/.clang-format
index 2b77cc419b97..a118fdde25c1 100644
--- a/.clang-format
+++ b/.clang-format
@@ -201,7 +201,7 @@ ForEachMacros:
   - 'for_each_matching_node'
   - 'for_each_matching_node_and_match'
   - 'for_each_member'
-  - 'for_each_memblock'
+  - 'for_each_mem_region'
   - 'for_each_memblock_type'
   - 'for_each_memcg_cache_index'
   - 'for_each_mem_pfn_range'
@@ -268,6 +268,7 @@ ForEachMacros:
   - 'for_each_property_of_node'
   - 'for_each_registered_fb'
   - 'for_each_reserved_mem_range'
+  - 'for_each_reserved_mem_region'
   - 'for_each_rtd_codec_dais'
   - 'for_each_rtd_codec_dais_rollback'
   - 'for_each_rtd_components'
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index a986c6f8ab42..dcce72ac072b 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -217,7 +217,7 @@ static void __init request_standard_resources(void)
if (!standard_resources)
panic("%s: Failed to allocate %zu bytes\n", __func__, res_size);
 
-   for_each_memblock(memory, region) {
+   for_each_mem_region(region) {
res = _resources[i++];
if (memblock_is_nomap(region)) {
res->name  = "reserved";
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 8a97cd3d2dfe..5efdbd01a59c 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -350,7 +350,7 @@ static int __init numa_register_nodes(void)
struct memblock_region *mblk;
 
/* Check that valid nid is set to memblks */
-   for_each_memblock(memory, mblk) {
+   for_each_mem_region(mblk) {
int mblk_nid = memblock_get_region_node(mblk);
 
if (mblk_nid == NUMA_NO_NODE || mblk_nid >= MAX_NUMNODES) {
diff --git a/arch/mips/netlogic/xlp/setup.c b/arch/mips/netlogic/xlp/setup.c
index 1a0fc5b62ba4..6e3102bcd2f1 100644
--- a/arch/mips/netlogic/xlp/setup.c
+++ b/arch/mips/netlogic/xlp/setup.c
@@ -70,7 +70,7 @@ static void nlm_fixup_mem(void)
const int pref_backup = 512;
struct memblock_region *mem;
 
-   for_each_memblock(memory, mem) {
+   for_each_mem_region(mem) {
memblock_remove(mem->base + mem->size - pref_backup,
pref_backup);
}
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 06355716d19a..1fb6a826c2fd 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -531,7 +531,7 @@ static void __init resource_init(void)
 {
struct memblock_region *region;
 
-   for_each_memblock(memory, region) {
+   for_each_mem_region(region) {
struct resource *res;
 
res = memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES);
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index aa76ec2d359b..b6246768479d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -516,7 +516,7 @@ static void __init numa_clear_kernel_node_hotplug(void)
 *   memory ranges, because quirks such as trim_snb_memory()
 *   reserve specific pages for Sandy Bridge graphics. ]
 */
-   for_each_memblock(reserved, mb_region) {
+   for_each_reserved_mem_region(mb_region) {
int nid = memblock_get_region_node(mb_region);
 
if (nid != MAX_NUMNODES)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 354078713cd1..ef131255cedc 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -553,9 +553,22 @@ static inline unsigned long 
memblock_region_reserved_end_pfn(const struct memblo
return PFN_UP(reg->base + reg->size);
 }
 
-#define for_each_memblock(memblock_type, region)   
\
-   for (region = memblock.memblock_type.regions;   
\
-region < (memblock.memblock_type.regions + 
memblock.memblock_type.cnt);\
+/**
+ * for_each_mem_region - itereate over memory regions
+ * @region: loop variable
+ */
+#define for_each_mem_region(region)

[PATCH v3 16/17] memblock: implement for_each_reserved_mem_region() using __next_mem_region()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

Iteration over memblock.reserved with for_each_reserved_mem_region() used
__next_reserved_mem_region() that implemented a subset of
__next_mem_region().

Use __for_each_mem_range() and, essentially, __next_mem_region() with
appropriate parameters to reduce code duplication.

While on it, rename for_each_reserved_mem_region() to
for_each_reserved_mem_range() for consistency.

Signed-off-by: Mike Rapoport 
Acked-by: Miguel Ojeda  # .clang-format
---
 .clang-format|  2 +-
 arch/arm64/kernel/setup.c|  2 +-
 drivers/irqchip/irq-gic-v3-its.c |  2 +-
 include/linux/memblock.h | 12 +++
 mm/memblock.c| 56 
 5 files changed, 27 insertions(+), 47 deletions(-)

diff --git a/.clang-format b/.clang-format
index 3e42a8e4df73..2b77cc419b97 100644
--- a/.clang-format
+++ b/.clang-format
@@ -267,7 +267,7 @@ ForEachMacros:
   - 'for_each_process_thread'
   - 'for_each_property_of_node'
   - 'for_each_registered_fb'
-  - 'for_each_reserved_mem_region'
+  - 'for_each_reserved_mem_range'
   - 'for_each_rtd_codec_dais'
   - 'for_each_rtd_codec_dais_rollback'
   - 'for_each_rtd_components'
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 77c4c9bad1b8..a986c6f8ab42 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -257,7 +257,7 @@ static int __init reserve_memblock_reserved_regions(void)
if (!memblock_is_region_reserved(mem->start, mem_size))
continue;
 
-   for_each_reserved_mem_region(j, _start, _end) {
+   for_each_reserved_mem_range(j, _start, _end) {
resource_size_t start, end;
 
start = max(PFN_PHYS(PFN_DOWN(r_start)), mem->start);
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 95f097448f97..ca5c470ed0d0 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2192,7 +2192,7 @@ static bool gic_check_reserved_range(phys_addr_t addr, 
unsigned long size)
 
addr_end = addr + size - 1;
 
-   for_each_reserved_mem_region(i, , ) {
+   for_each_reserved_mem_range(i, , ) {
if (addr >= start && addr_end <= end)
return true;
}
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 15ed119701c1..354078713cd1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -132,9 +132,6 @@ void __next_mem_range_rev(u64 *idx, int nid, enum 
memblock_flags flags,
  struct memblock_type *type_b, phys_addr_t *out_start,
  phys_addr_t *out_end, int *out_nid);
 
-void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
-   phys_addr_t *out_end);
-
 void __memblock_free_late(phys_addr_t base, phys_addr_t size);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
@@ -224,7 +221,7 @@ static inline void __next_physmem_range(u64 *idx, struct 
memblock_type *type,
 MEMBLOCK_NONE, p_start, p_end, NULL)
 
 /**
- * for_each_reserved_mem_region - iterate over all reserved memblock areas
+ * for_each_reserved_mem_range - iterate over all reserved memblock areas
  * @i: u64 used as loop variable
  * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
  * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
@@ -232,10 +229,9 @@ static inline void __next_physmem_range(u64 *idx, struct 
memblock_type *type,
  * Walks over reserved areas of memblock. Available as soon as memblock
  * is initialized.
  */
-#define for_each_reserved_mem_region(i, p_start, p_end)
\
-   for (i = 0UL, __next_reserved_mem_region(, p_start, p_end);   \
-i != (u64)ULLONG_MAX;  \
-__next_reserved_mem_region(, p_start, p_end))
+#define for_each_reserved_mem_range(i, p_start, p_end) \
+   __for_each_mem_range(i, , NULL, NUMA_NO_NODE, \
+MEMBLOCK_NONE, p_start, p_end, NULL)
 
 static inline bool memblock_is_hotpluggable(struct memblock_region *m)
 {
diff --git a/mm/memblock.c b/mm/memblock.c
index eb4f866bea34..d0be57acccf2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -132,6 +132,14 @@ struct memblock_type physmem = {
 };
 #endif
 
+/*
+ * keep a pointer to  in the text section to use it in
+ * __next_mem_range() and its helpers.
+ *  For architectures that do not keep memblock data after init, this
+ * pointer will be reset to NULL at memblock_discard()
+ */
+static __refdata struct memblock_type *memblock_memory = 
+
 #define for_each_memblock_type(i, memblock_type, rgn)  \
for (i = 0, rgn = _type->regions[0];   \
 i < memblock_type->cnt;\
@@ -399,6 +407,8 @@ void __init memblock_discard(void)

[PATCH v3 15/17] memblock: remove unused memblock_mem_size()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

The only user of memblock_mem_size() was x86 setup code, it is gone now and
memblock_mem_size() funciton can be removed.

Signed-off-by: Mike Rapoport 
Reviewed-by: Baoquan He 
---
 include/linux/memblock.h |  1 -
 mm/memblock.c| 15 ---
 2 files changed, 16 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 27c3b84d1615..15ed119701c1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -481,7 +481,6 @@ static inline bool memblock_bottom_up(void)
 
 phys_addr_t memblock_phys_mem_size(void);
 phys_addr_t memblock_reserved_size(void);
-phys_addr_t memblock_mem_size(unsigned long limit_pfn);
 phys_addr_t memblock_start_of_DRAM(void);
 phys_addr_t memblock_end_of_DRAM(void);
 void memblock_enforce_memory_limit(phys_addr_t memory_limit);
diff --git a/mm/memblock.c b/mm/memblock.c
index 567e454ce0a1..eb4f866bea34 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1657,21 +1657,6 @@ phys_addr_t __init_memblock memblock_reserved_size(void)
return memblock.reserved.total_size;
 }
 
-phys_addr_t __init memblock_mem_size(unsigned long limit_pfn)
-{
-   unsigned long pages = 0;
-   unsigned long start_pfn, end_pfn;
-   int i;
-
-   for_each_mem_pfn_range(i, MAX_NUMNODES, _pfn, _pfn, NULL) {
-   start_pfn = min_t(unsigned long, start_pfn, limit_pfn);
-   end_pfn = min_t(unsigned long, end_pfn, limit_pfn);
-   pages += end_pfn - start_pfn;
-   }
-
-   return PFN_PHYS(pages);
-}
-
 /* lowest address */
 phys_addr_t __init_memblock memblock_start_of_DRAM(void)
 {
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 14/17] x86/setup: simplify reserve_crashkernel()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

* Replace magic numbers with defines
* Replace memblock_find_in_range() + memblock_reserve() with
  memblock_phys_alloc_range()
* Stop checking for low memory size in reserve_crashkernel_low(). The
  allocation from limited range will anyway fail if there is no enough
  memory, so there is no need for extra traversal of memblock.memory

Signed-off-by: Mike Rapoport 
Acked-by: Ingo Molnar 
Reviewed-by: Baoquan He 
---
 arch/x86/kernel/setup.c | 40 ++--
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2cac39ade2e3..52e83ba607b3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -420,13 +420,13 @@ static int __init reserve_crashkernel_low(void)
 {
 #ifdef CONFIG_X86_64
unsigned long long base, low_base = 0, low_size = 0;
-   unsigned long total_low_mem;
+   unsigned long low_mem_limit;
int ret;
 
-   total_low_mem = memblock_mem_size(1UL << (32 - PAGE_SHIFT));
+   low_mem_limit = min(memblock_phys_mem_size(), CRASH_ADDR_LOW_MAX);
 
/* crashkernel=Y,low */
-   ret = parse_crashkernel_low(boot_command_line, total_low_mem, 
_size, );
+   ret = parse_crashkernel_low(boot_command_line, low_mem_limit, 
_size, );
if (ret) {
/*
 * two parts from kernel/dma/swiotlb.c:
@@ -444,23 +444,17 @@ static int __init reserve_crashkernel_low(void)
return 0;
}
 
-   low_base = memblock_find_in_range(0, 1ULL << 32, low_size, CRASH_ALIGN);
+   low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 0, 
CRASH_ADDR_LOW_MAX);
if (!low_base) {
pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
smaller size.\n",
   (unsigned long)(low_size >> 20));
return -ENOMEM;
}
 
-   ret = memblock_reserve(low_base, low_size);
-   if (ret) {
-   pr_err("%s: Error reserving crashkernel low memblock.\n", 
__func__);
-   return ret;
-   }
-
-   pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (System 
low RAM: %ldMB)\n",
+   pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (low 
RAM limit: %ldMB)\n",
(unsigned long)(low_size >> 20),
(unsigned long)(low_base >> 20),
-   (unsigned long)(total_low_mem >> 20));
+   (unsigned long)(low_mem_limit >> 20));
 
crashk_low_res.start = low_base;
crashk_low_res.end   = low_base + low_size - 1;
@@ -504,13 +498,13 @@ static void __init reserve_crashkernel(void)
 * unless "crashkernel=size[KMG],high" is specified.
 */
if (!high)
-   crash_base = memblock_find_in_range(CRASH_ALIGN,
-   CRASH_ADDR_LOW_MAX,
-   crash_size, CRASH_ALIGN);
+   crash_base = memblock_phys_alloc_range(crash_size,
+   CRASH_ALIGN, CRASH_ALIGN,
+   CRASH_ADDR_LOW_MAX);
if (!crash_base)
-   crash_base = memblock_find_in_range(CRASH_ALIGN,
-   CRASH_ADDR_HIGH_MAX,
-   crash_size, CRASH_ALIGN);
+   crash_base = memblock_phys_alloc_range(crash_size,
+   CRASH_ALIGN, CRASH_ALIGN,
+   CRASH_ADDR_HIGH_MAX);
if (!crash_base) {
pr_info("crashkernel reservation failed - No suitable 
area found.\n");
return;
@@ -518,19 +512,13 @@ static void __init reserve_crashkernel(void)
} else {
unsigned long long start;
 
-   start = memblock_find_in_range(crash_base,
-  crash_base + crash_size,
-  crash_size, 1 << 20);
+   start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base,
+ crash_base + crash_size);
if (start != crash_base) {
pr_info("crashkernel reservation failed - memory is in 
use.\n");
return;
}
}
-   ret = memblock_reserve(crash_base, crash_size);
-   if (ret) {
-   pr_err("%s: Error reserving crashkernel memblock.\n", __func__);
-   return;
-   }
 
if (crash_base >= (1ULL << 32) && reserve_crashkernel_low()) {
memblock_free(crash_base, crash_size);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org

[PATCH v3 13/17] x86/setup: simplify initrd relocation and reservation

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

Currently, initrd image is reserved very early during setup and then it
might be relocated and re-reserved after the initial physical memory
mapping is created. The "late" reservation of memblock verifies that mapped
memory size exceeds the size of initrd, then checks whether the relocation
required and, if yes, relocates inirtd to a new memory allocated from
memblock and frees the old location.

The check for memory size is excessive as memblock allocation will anyway
fail if there is not enough memory. Besides, there is no point to allocate
memory from memblock using memblock_find_in_range() + memblock_reserve()
when there exists memblock_phys_alloc_range() with required functionality.

Remove the redundant check and simplify memblock allocation.

Signed-off-by: Mike Rapoport 
Acked-by: Ingo Molnar 
Reviewed-by: Baoquan He 
---
 arch/x86/kernel/setup.c | 16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3511736fbc74..2cac39ade2e3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -263,16 +263,12 @@ static void __init relocate_initrd(void)
u64 area_size = PAGE_ALIGN(ramdisk_size);
 
/* We need to move the initrd down into directly mapped mem */
-   relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped),
-  area_size, PAGE_SIZE);
-
+   relocated_ramdisk = memblock_phys_alloc_range(area_size, PAGE_SIZE, 0,
+ PFN_PHYS(max_pfn_mapped));
if (!relocated_ramdisk)
panic("Cannot find place for new RAMDISK of size %lld\n",
  ramdisk_size);
 
-   /* Note: this includes all the mem currently occupied by
-  the initrd, we rely on that fact to keep the data intact. */
-   memblock_reserve(relocated_ramdisk, area_size);
initrd_start = relocated_ramdisk + PAGE_OFFSET;
initrd_end   = initrd_start + ramdisk_size;
printk(KERN_INFO "Allocated new RAMDISK: [mem %#010llx-%#010llx]\n",
@@ -299,13 +295,13 @@ static void __init early_reserve_initrd(void)
 
memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
 }
+
 static void __init reserve_initrd(void)
 {
/* Assume only end is not page aligned */
u64 ramdisk_image = get_ramdisk_image();
u64 ramdisk_size  = get_ramdisk_size();
u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
-   u64 mapped_size;
 
if (!boot_params.hdr.type_of_loader ||
!ramdisk_image || !ramdisk_size)
@@ -313,12 +309,6 @@ static void __init reserve_initrd(void)
 
initrd_start = 0;
 
-   mapped_size = memblock_mem_size(max_pfn_mapped);
-   if (ramdisk_size >= (mapped_size>>1))
-   panic("initrd too large to handle, "
-  "disabling initrd (%lld needed, %lld available)\n",
-  ramdisk_size, mapped_size>>1);
-
printk(KERN_INFO "RAMDISK: [mem %#010llx-%#010llx]\n", ramdisk_image,
ramdisk_end - 1);
 
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 12/17] arch, drivers: replace for_each_membock() with for_each_mem_range()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

There are several occurrences of the following pattern:

for_each_memblock(memory, reg) {
start = __pfn_to_phys(memblock_region_memory_base_pfn(reg);
end = __pfn_to_phys(memblock_region_memory_end_pfn(reg));

/* do something with start and end */
}

Using for_each_mem_range() iterator is more appropriate in such cases and
allows simpler and cleaner code.

Signed-off-by: Mike Rapoport 
---
 arch/arm/kernel/setup.c  | 18 ++---
 arch/arm/mm/mmu.c| 39 ++
 arch/arm/mm/pmsa-v7.c| 23 ++-
 arch/arm/mm/pmsa-v8.c| 17 
 arch/arm/xen/mm.c|  7 ++--
 arch/arm64/mm/kasan_init.c   | 10 ++---
 arch/arm64/mm/mmu.c  | 11 ++
 arch/c6x/kernel/setup.c  |  9 +++--
 arch/microblaze/mm/init.c|  9 +++--
 arch/mips/cavium-octeon/dma-octeon.c | 12 +++---
 arch/mips/kernel/setup.c | 31 +++
 arch/openrisc/mm/init.c  |  8 ++--
 arch/powerpc/kernel/fadump.c | 50 +++-
 arch/powerpc/kexec/file_load_64.c| 10 ++---
 arch/powerpc/mm/book3s64/hash_utils.c| 16 
 arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++---
 arch/powerpc/mm/kasan/kasan_init_32.c|  8 ++--
 arch/powerpc/mm/mem.c| 16 +---
 arch/powerpc/mm/pgtable_32.c |  8 ++--
 arch/riscv/mm/init.c | 25 +---
 arch/riscv/mm/kasan_init.c   | 10 ++---
 arch/s390/kernel/setup.c | 23 +++
 arch/s390/mm/vmem.c  |  7 ++--
 arch/sparc/mm/init_64.c  | 12 ++
 drivers/bus/mvebu-mbus.c | 12 +++---
 25 files changed, 194 insertions(+), 207 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index d8e18cdd96d3..3f65d0ac9f63 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -843,19 +843,25 @@ early_param("mem", early_mem);
 
 static void __init request_standard_resources(const struct machine_desc *mdesc)
 {
-   struct memblock_region *region;
+   phys_addr_t start, end, res_end;
struct resource *res;
+   u64 i;
 
kernel_code.start   = virt_to_phys(_text);
kernel_code.end = virt_to_phys(__init_begin - 1);
kernel_data.start   = virt_to_phys(_sdata);
kernel_data.end = virt_to_phys(_end - 1);
 
-   for_each_memblock(memory, region) {
-   phys_addr_t start = 
__pfn_to_phys(memblock_region_memory_base_pfn(region));
-   phys_addr_t end = 
__pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
+   for_each_mem_range(i, , ) {
unsigned long boot_alias_start;
 
+   /*
+* In memblock, end points to the first byte after the
+* range while in resourses, end points to the last byte in
+* the range.
+*/
+   res_end = end - 1;
+
/*
 * Some systems have a special memory alias which is only
 * used for booting.  We need to advertise this region to
@@ -869,7 +875,7 @@ static void __init request_standard_resources(const struct 
machine_desc *mdesc)
  __func__, sizeof(*res));
res->name = "System RAM (boot alias)";
res->start = boot_alias_start;
-   res->end = phys_to_idmap(end);
+   res->end = phys_to_idmap(res_end);
res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
request_resource(_resource, res);
}
@@ -880,7 +886,7 @@ static void __init request_standard_resources(const struct 
machine_desc *mdesc)
  sizeof(*res));
res->name  = "System RAM";
res->start = start;
-   res->end = end;
+   res->end = res_end;
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 
request_resource(_resource, res);
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index c36f977b2ccb..698cc740c6b8 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1154,9 +1154,8 @@ phys_addr_t arm_lowmem_limit __initdata = 0;
 
 void __init adjust_lowmem_bounds(void)
 {
-   phys_addr_t memblock_limit = 0;
-   u64 vmalloc_limit;
-   struct memblock_region *reg;
+   phys_addr_t block_start, block_end, memblock_limit = 0;
+   u64 vmalloc_limit, i;
phys_addr_t lowmem_limit = 0;
 
/*
@@ -1172,26 +1171,18 @@ void __init adjust_lowmem_bounds(void)
 * The first usable region must be PMD aligned. Mark its start
 * as MEMBLOCK_NOMAP if it isn't
 */
-   for_each_memblock(memory, 

[PATCH v3 11/17] arch, mm: replace for_each_memblock() with for_each_mem_pfn_range()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

There are several occurrences of the following pattern:

for_each_memblock(memory, reg) {
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);

/* do something with start_pfn and end_pfn */
}

Rather than iterate over all memblock.memory regions and each time query
for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
simpler and clearer code.

Signed-off-by: Mike Rapoport 
Reviewed-by: Baoquan He 
---
 arch/arm/mm/init.c   | 11 ---
 arch/arm64/mm/init.c | 11 ---
 arch/powerpc/kernel/fadump.c | 11 ++-
 arch/powerpc/mm/mem.c| 15 ---
 arch/powerpc/mm/numa.c   |  7 ++-
 arch/s390/mm/page-states.c   |  6 ++
 arch/sh/mm/init.c|  9 +++--
 mm/memblock.c|  6 ++
 mm/sparse.c  | 10 --
 9 files changed, 35 insertions(+), 51 deletions(-)

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 50a5a30a78ff..45f9d5ec2360 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -299,16 +299,14 @@ free_memmap(unsigned long start_pfn, unsigned long 
end_pfn)
  */
 static void __init free_unused_memmap(void)
 {
-   unsigned long start, prev_end = 0;
-   struct memblock_region *reg;
+   unsigned long start, end, prev_end = 0;
+   int i;
 
/*
 * This relies on each bank being in address order.
 * The banks are sorted previously in bootmem_init().
 */
-   for_each_memblock(memory, reg) {
-   start = memblock_region_memory_base_pfn(reg);
-
+   for_each_mem_pfn_range(i, MAX_NUMNODES, , , NULL) {
 #ifdef CONFIG_SPARSEMEM
/*
 * Take care not to free memmap entries that don't exist
@@ -336,8 +334,7 @@ static void __init free_unused_memmap(void)
 * memmap entries are valid from the bank end aligned to
 * MAX_ORDER_NR_PAGES.
 */
-   prev_end = ALIGN(memblock_region_memory_end_pfn(reg),
-MAX_ORDER_NR_PAGES);
+   prev_end = ALIGN(end, MAX_ORDER_NR_PAGES);
}
 
 #ifdef CONFIG_SPARSEMEM
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 481d22c32a2e..f0bf86d81622 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -471,12 +471,10 @@ static inline void free_memmap(unsigned long start_pfn, 
unsigned long end_pfn)
  */
 static void __init free_unused_memmap(void)
 {
-   unsigned long start, prev_end = 0;
-   struct memblock_region *reg;
-
-   for_each_memblock(memory, reg) {
-   start = __phys_to_pfn(reg->base);
+   unsigned long start, end, prev_end = 0;
+   int i;
 
+   for_each_mem_pfn_range(i, MAX_NUMNODES, , , NULL) {
 #ifdef CONFIG_SPARSEMEM
/*
 * Take care not to free memmap entries that don't exist due
@@ -496,8 +494,7 @@ static void __init free_unused_memmap(void)
 * memmap entries are valid from the bank end aligned to
 * MAX_ORDER_NR_PAGES.
 */
-   prev_end = ALIGN(__phys_to_pfn(reg->base + reg->size),
-MAX_ORDER_NR_PAGES);
+   prev_end = ALIGN(end, MAX_ORDER_NR_PAGES);
}
 
 #ifdef CONFIG_SPARSEMEM
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 10ebb4bf71ad..e469b150be21 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1242,14 +1242,15 @@ static void fadump_free_reserved_memory(unsigned long 
start_pfn,
  */
 static void fadump_release_reserved_area(u64 start, u64 end)
 {
-   u64 tstart, tend, spfn, epfn;
-   struct memblock_region *reg;
+   u64 tstart, tend, spfn, epfn, reg_spfn, reg_epfn, i;
 
spfn = PHYS_PFN(start);
epfn = PHYS_PFN(end);
-   for_each_memblock(memory, reg) {
-   tstart = max_t(u64, spfn, memblock_region_memory_base_pfn(reg));
-   tend   = min_t(u64, epfn, memblock_region_memory_end_pfn(reg));
+
+   for_each_mem_pfn_range(i, MAX_NUMNODES, _spfn, _epfn, NULL) {
+   tstart = max_t(u64, spfn, reg_spfn);
+   tend   = min_t(u64, epfn, reg_epfn);
+
if (tstart < tend) {
fadump_free_reserved_memory(tstart, tend);
 
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 42e25874f5a8..80df329f180e 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -184,15 +184,16 @@ void __init initmem_init(void)
 /* mark pages that don't exist as nosave */
 static int __init mark_nonram_nosave(void)
 {
-   struct memblock_region *reg, *prev = NULL;
+   unsigned long spfn, epfn, prev = 0;
+   int i;
 
-   for_each_memblock(memory, reg) {
-   if (prev &&
-   memblock_region_memory_end_pfn(prev) 

[PATCH v3 10/17] memblock: reduce number of parameters in for_each_mem_range()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

Currently for_each_mem_range() and for_each_mem_range_rev() iterators are
the most generic way to traverse memblock regions. As such, they have 8
parameters and they are hardly convenient to users. Most users choose to
utilize one of their wrappers and the only user that actually needs most of
the parameters is memblock itself.

To avoid yet another naming for memblock iterators, rename the existing
for_each_mem_range[_rev]() to __for_each_mem_range[_rev]() and add a new
for_each_mem_range[_rev]() wrappers with only index, start and end
parameters.

The new wrapper nicely fits into init_unavailable_mem() and will be used in
upcoming changes to simplify memblock traversals.

Signed-off-by: Mike Rapoport 
Acked-by: Thomas Bogendoerfer   # MIPS
---
 .clang-format  |  2 ++
 arch/arm64/kernel/machine_kexec_file.c |  6 ++--
 arch/powerpc/kexec/file_load_64.c  |  6 ++--
 include/linux/memblock.h   | 41 +++---
 mm/page_alloc.c|  3 +-
 5 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/.clang-format b/.clang-format
index a0a96088c74f..3e42a8e4df73 100644
--- a/.clang-format
+++ b/.clang-format
@@ -205,7 +205,9 @@ ForEachMacros:
   - 'for_each_memblock_type'
   - 'for_each_memcg_cache_index'
   - 'for_each_mem_pfn_range'
+  - '__for_each_mem_range'
   - 'for_each_mem_range'
+  - '__for_each_mem_range_rev'
   - 'for_each_mem_range_rev'
   - 'for_each_migratetype_order'
   - 'for_each_msi_entry'
diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
index 361a1143e09e..5b0e67b93cdc 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -215,8 +215,7 @@ static int prepare_elf_headers(void **addr, unsigned long 
*sz)
phys_addr_t start, end;
 
nr_ranges = 1; /* for exclusion of crashkernel region */
-   for_each_mem_range(i, , NULL, NUMA_NO_NODE,
-   MEMBLOCK_NONE, , , NULL)
+   for_each_mem_range(i, , )
nr_ranges++;
 
cmem = kmalloc(struct_size(cmem, ranges, nr_ranges), GFP_KERNEL);
@@ -225,8 +224,7 @@ static int prepare_elf_headers(void **addr, unsigned long 
*sz)
 
cmem->max_nr_ranges = nr_ranges;
cmem->nr_ranges = 0;
-   for_each_mem_range(i, , NULL, NUMA_NO_NODE,
-   MEMBLOCK_NONE, , , NULL) {
+   for_each_mem_range(i, , ) {
cmem->ranges[cmem->nr_ranges].start = start;
cmem->ranges[cmem->nr_ranges].end = end - 1;
cmem->nr_ranges++;
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 53bb71e3a2e1..2c9d908eab96 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -250,8 +250,7 @@ static int __locate_mem_hole_top_down(struct kexec_buf 
*kbuf,
phys_addr_t start, end;
u64 i;
 
-   for_each_mem_range_rev(i, , NULL, NUMA_NO_NODE,
-  MEMBLOCK_NONE, , , NULL) {
+   for_each_mem_range_rev(i, , ) {
/*
 * memblock uses [start, end) convention while it is
 * [start, end] here. Fix the off-by-one to have the
@@ -350,8 +349,7 @@ static int __locate_mem_hole_bottom_up(struct kexec_buf 
*kbuf,
phys_addr_t start, end;
u64 i;
 
-   for_each_mem_range(i, , NULL, NUMA_NO_NODE,
-  MEMBLOCK_NONE, , , NULL) {
+   for_each_mem_range(i, , ) {
/*
 * memblock uses [start, end) convention while it is
 * [start, end] here. Fix the off-by-one to have the
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 47a76e237fca..27c3b84d1615 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -162,7 +162,7 @@ static inline void __next_physmem_range(u64 *idx, struct 
memblock_type *type,
 #endif /* CONFIG_HAVE_MEMBLOCK_PHYS_MAP */
 
 /**
- * for_each_mem_range - iterate through memblock areas from type_a and not
+ * __for_each_mem_range - iterate through memblock areas from type_a and not
  * included in type_b. Or just type_a if type_b is NULL.
  * @i: u64 used as loop variable
  * @type_a: ptr to memblock_type to iterate
@@ -173,7 +173,7 @@ static inline void __next_physmem_range(u64 *idx, struct 
memblock_type *type,
  * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
  * @p_nid: ptr to int for nid of the range, can be %NULL
  */
-#define for_each_mem_range(i, type_a, type_b, nid, flags,  \
+#define __for_each_mem_range(i, type_a, type_b, nid, flags,\
   p_start, p_end, p_nid)   \
for (i = 0, __next_mem_range(, nid, flags, type_a, type_b,\
 p_start, p_end, p_nid);\
@@ -182,7 +182,7 @@ static inline void 

[PATCH v3 09/17] memblock: make memblock_debug and related functionality private

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

The only user of memblock_dbg() outside memblock was s390 setup code and it
is converted to use pr_debug() instead.
This allows to stop exposing memblock_debug and memblock_dbg() to the rest
of the kernel.

Signed-off-by: Mike Rapoport 
Reviewed-by: Baoquan He 
---
 arch/s390/kernel/setup.c |  4 ++--
 include/linux/memblock.h | 12 +---
 mm/memblock.c| 13 +++--
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index e600f6953d7c..68089eabae27 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -776,8 +776,8 @@ static void __init memblock_add_mem_detect_info(void)
unsigned long start, end;
int i;
 
-   memblock_dbg("physmem info source: %s (%hhd)\n",
-get_mem_info_source(), mem_detect.info_source);
+   pr_debug("physmem info source: %s (%hhd)\n",
+get_mem_info_source(), mem_detect.info_source);
/* keep memblock lists close to the kernel */
memblock_set_bottom_up(true);
for_each_mem_detect_block(i, , ) {
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 550faf69fc1c..47a76e237fca 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -86,7 +86,6 @@ struct memblock {
 };
 
 extern struct memblock memblock;
-extern int memblock_debug;
 
 #ifndef CONFIG_ARCH_KEEP_MEMBLOCK
 #define __init_memblock __meminit
@@ -98,9 +97,6 @@ void memblock_discard(void);
 static inline void memblock_discard(void) {}
 #endif
 
-#define memblock_dbg(fmt, ...) \
-   if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
-
 phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
   phys_addr_t size, phys_addr_t align);
 void memblock_allow_resize(void);
@@ -476,13 +472,7 @@ bool memblock_is_region_memory(phys_addr_t base, 
phys_addr_t size);
 bool memblock_is_reserved(phys_addr_t addr);
 bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
 
-extern void __memblock_dump_all(void);
-
-static inline void memblock_dump_all(void)
-{
-   if (memblock_debug)
-   __memblock_dump_all();
-}
+void memblock_dump_all(void);
 
 /**
  * memblock_set_current_limit - Set the current allocation limit to allow
diff --git a/mm/memblock.c b/mm/memblock.c
index 59f3998ae5db..799513f3d6a9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -137,7 +137,10 @@ struct memblock_type physmem = {
 i < memblock_type->cnt;\
 i++, rgn = _type->regions[i])
 
-int memblock_debug __initdata_memblock;
+#define memblock_dbg(fmt, ...) \
+   if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
+
+static int memblock_debug __initdata_memblock;
 static bool system_has_some_mirror __initdata_memblock = false;
 static int memblock_can_resize __initdata_memblock;
 static int memblock_memory_in_slab __initdata_memblock = 0;
@@ -1920,7 +1923,7 @@ static void __init_memblock memblock_dump(struct 
memblock_type *type)
}
 }
 
-void __init_memblock __memblock_dump_all(void)
+static void __init_memblock __memblock_dump_all(void)
 {
pr_info("MEMBLOCK configuration:\n");
pr_info(" memory size = %pa reserved size = %pa\n",
@@ -1934,6 +1937,12 @@ void __init_memblock __memblock_dump_all(void)
 #endif
 }
 
+void __init_memblock memblock_dump_all(void)
+{
+   if (memblock_debug)
+   __memblock_dump_all();
+}
+
 void __init memblock_allow_resize(void)
 {
memblock_can_resize = 1;
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 07/17] mircoblaze: drop unneeded NUMA and sparsemem initializations

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

microblaze does not support neither NUMA not SPARSMEM, so there is no point
to call memblock_set_node() and sparse_memory_present_with_active_regions()
functions during microblaze memory initialization.

Remove these calls and the surrounding code.

Signed-off-by: Mike Rapoport 
---
 arch/microblaze/mm/init.c | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index 0880a003573d..49e0c241f9b1 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -105,9 +105,8 @@ static void __init paging_init(void)
 
 void __init setup_memory(void)
 {
-   struct memblock_region *reg;
-
 #ifndef CONFIG_MMU
+   struct memblock_region *reg;
u32 kernel_align_start, kernel_align_size;
 
/* Find main memory where is the kernel */
@@ -161,17 +160,6 @@ void __init setup_memory(void)
pr_info("%s: max_low_pfn: %#lx\n", __func__, max_low_pfn);
pr_info("%s: max_pfn: %#lx\n", __func__, max_pfn);
 
-   /* Add active regions with valid PFNs */
-   for_each_memblock(memory, reg) {
-   unsigned long start_pfn, end_pfn;
-
-   start_pfn = memblock_region_memory_base_pfn(reg);
-   end_pfn = memblock_region_memory_end_pfn(reg);
-   memblock_set_node(start_pfn << PAGE_SHIFT,
- (end_pfn - start_pfn) << PAGE_SHIFT,
- , 0);
-   }
-
paging_init();
 }
 
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 08/17] memblock: make for_each_memblock_type() iterator private

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

for_each_memblock_type() is not used outside mm/memblock.c, move it there
from include/linux/memblock.h

Signed-off-by: Mike Rapoport 
Reviewed-by: Baoquan He 
---
 include/linux/memblock.h | 5 -
 mm/memblock.c| 5 +
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 9d925db0d355..550faf69fc1c 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -552,11 +552,6 @@ static inline unsigned long 
memblock_region_reserved_end_pfn(const struct memblo
 region < (memblock.memblock_type.regions + 
memblock.memblock_type.cnt);\
 region++)
 
-#define for_each_memblock_type(i, memblock_type, rgn)  \
-   for (i = 0, rgn = _type->regions[0];   \
-i < memblock_type->cnt;\
-i++, rgn = _type->regions[i])
-
 extern void *alloc_large_system_hash(const char *tablename,
 unsigned long bucketsize,
 unsigned long numentries,
diff --git a/mm/memblock.c b/mm/memblock.c
index 45f198750be9..59f3998ae5db 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -132,6 +132,11 @@ struct memblock_type physmem = {
 };
 #endif
 
+#define for_each_memblock_type(i, memblock_type, rgn)  \
+   for (i = 0, rgn = _type->regions[0];   \
+i < memblock_type->cnt;\
+i++, rgn = _type->regions[i])
+
 int memblock_debug __initdata_memblock;
 static bool system_has_some_mirror __initdata_memblock = false;
 static int memblock_can_resize __initdata_memblock;
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 04/17] arm64: numa: simplify dummy_numa_init()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

dummy_numa_init() loops over memblock.memory and passes nid=0 to
numa_add_memblk() which essentially wraps memblock_set_node(). However,
memblock_set_node() can cope with entire memory span itself, so the loop
over memblock.memory regions is redundant.

Using a single call to memblock_set_node() rather than a loop also fixes an
issue with a buggy ACPI firmware in which the SRAT table covers some but
not all of the memory in the EFI memory map.

Jonathan Cameron says:

  This issue can be easily triggered by having an SRAT table which fails
  to cover all elements of the EFI memory map.

  This firmware error is detected and a warning printed. e.g.
  "NUMA: Warning: invalid memblk node 64 [mem 0x24000-0x27fff]"
  At that point we fall back to dummy_numa_init().

  However, the failed ACPI init has left us with our memblocks all broken
  up as we split them when trying to assign them to NUMA nodes.

  We then iterate over the memblocks and add them to node 0.

  numa_add_memblk() calls memblock_set_node() which merges regions that
  were previously split up during the earlier attempt to add them to different
  nodes during parsing of SRAT.

  This means elements are moved in the memblock array and we can end up
  in a different memblock after the call to numa_add_memblk().
  Result is:

  Unable to handle kernel paging request at virtual address 3a40
  Mem abort info:
ESR = 0x9604
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
  Data abort info:
ISV = 0, ISS = 0x0004
CM = 0, WnR = 0
  [3a40] user address but active_mm is swapper
  Internal error: Oops: 9604 [#1] PREEMPT SMP

  ...

  Call trace:
sparse_init_nid+0x5c/0x2b0
sparse_init+0x138/0x170
bootmem_init+0x80/0xe0
setup_arch+0x2a0/0x5fc
start_kernel+0x8c/0x648

Replace the loop with a single call to memblock_set_node() to the entire
memory.

Signed-off-by: Mike Rapoport 
Acked-by: Jonathan Cameron 
Acked-by: Catalin Marinas 
---
 arch/arm64/mm/numa.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 73f8b49d485c..8a97cd3d2dfe 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -423,19 +423,16 @@ static int __init numa_init(int (*init_func)(void))
  */
 static int __init dummy_numa_init(void)
 {
+   phys_addr_t start = memblock_start_of_DRAM();
+   phys_addr_t end = memblock_end_of_DRAM();
int ret;
-   struct memblock_region *mblk;
 
if (numa_off)
pr_info("NUMA disabled\n"); /* Forced off on command line. */
-   pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n",
-   memblock_start_of_DRAM(), memblock_end_of_DRAM() - 1);
-
-   for_each_memblock(memory, mblk) {
-   ret = numa_add_memblk(0, mblk->base, mblk->base + mblk->size);
-   if (!ret)
-   continue;
+   pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n", start, end - 1);
 
+   ret = numa_add_memblk(0, start, end);
+   if (ret) {
pr_err("NUMA init failed\n");
return ret;
}
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 05/17] h8300, nds32, openrisc: simplify detection of memory extents

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

Instead of traversing memblock.memory regions to find memory_start and
memory_end, simply query memblock_{start,end}_of_DRAM().

Signed-off-by: Mike Rapoport 
Acked-by: Stafford Horne 
---
 arch/h8300/kernel/setup.c| 8 +++-
 arch/nds32/kernel/setup.c| 8 ++--
 arch/openrisc/kernel/setup.c | 9 ++---
 3 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/h8300/kernel/setup.c b/arch/h8300/kernel/setup.c
index 28ac88358a89..0281f92eea3d 100644
--- a/arch/h8300/kernel/setup.c
+++ b/arch/h8300/kernel/setup.c
@@ -74,17 +74,15 @@ static void __init bootmem_init(void)
memory_end = memory_start = 0;
 
/* Find main memory where is the kernel */
-   for_each_memblock(memory, region) {
-   memory_start = region->base;
-   memory_end = region->base + region->size;
-   }
+   memory_start = memblock_start_of_DRAM();
+   memory_end = memblock_end_of_DRAM();
 
if (!memory_end)
panic("No memory!");
 
/* setup bootmem globals (we use no_bootmem, but mm still depends on 
this) */
min_low_pfn = PFN_UP(memory_start);
-   max_low_pfn = PFN_DOWN(memblock_end_of_DRAM());
+   max_low_pfn = PFN_DOWN(memory_end);
max_pfn = max_low_pfn;
 
memblock_reserve(__pa(_stext), _end - _stext);
diff --git a/arch/nds32/kernel/setup.c b/arch/nds32/kernel/setup.c
index a066efbe53c0..c356e484dcab 100644
--- a/arch/nds32/kernel/setup.c
+++ b/arch/nds32/kernel/setup.c
@@ -249,12 +249,8 @@ static void __init setup_memory(void)
memory_end = memory_start = 0;
 
/* Find main memory where is the kernel */
-   for_each_memblock(memory, region) {
-   memory_start = region->base;
-   memory_end = region->base + region->size;
-   pr_info("%s: Memory: 0x%x-0x%x\n", __func__,
-   memory_start, memory_end);
-   }
+   memory_start = memblock_start_of_DRAM();
+   memory_end = memblock_end_of_DRAM();
 
if (!memory_end) {
panic("No memory!");
diff --git a/arch/openrisc/kernel/setup.c b/arch/openrisc/kernel/setup.c
index b18e775f8be3..5a5940f7ebb1 100644
--- a/arch/openrisc/kernel/setup.c
+++ b/arch/openrisc/kernel/setup.c
@@ -48,17 +48,12 @@ static void __init setup_memory(void)
unsigned long ram_start_pfn;
unsigned long ram_end_pfn;
phys_addr_t memory_start, memory_end;
-   struct memblock_region *region;
 
memory_end = memory_start = 0;
 
/* Find main memory where is the kernel, we assume its the only one */
-   for_each_memblock(memory, region) {
-   memory_start = region->base;
-   memory_end = region->base + region->size;
-   printk(KERN_INFO "%s: Memory: 0x%x-0x%x\n", __func__,
-  memory_start, memory_end);
-   }
+   memory_start = memblock_start_of_DRAM();
+   memory_end = memblock_end_of_DRAM();
 
if (!memory_end) {
panic("No memory!");
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 06/17] riscv: drop unneeded node initialization

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

RISC-V does not (yet) support NUMA  and for UMA architectures node 0 is
used implicitly during early memory initialization.

There is no need to call memblock_set_node(), remove this call and the
surrounding code.

Signed-off-by: Mike Rapoport 
---
 arch/riscv/mm/init.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 787c75f751a5..0485cfaacc72 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -191,15 +191,6 @@ void __init setup_bootmem(void)
early_init_fdt_scan_reserved_mem();
memblock_allow_resize();
memblock_dump_all();
-
-   for_each_memblock(memory, reg) {
-   unsigned long start_pfn = memblock_region_memory_base_pfn(reg);
-   unsigned long end_pfn = memblock_region_memory_end_pfn(reg);
-
-   memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn),
- , 0);
-   }
 }
 
 #ifdef CONFIG_MMU
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 03/17] arm, xtensa: simplify initialization of high memory pages

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

The function free_highpages() in both arm and xtensa essentially open-code
for_each_free_mem_range() loop to detect high memory pages that were not
reserved and that should be initialized and passed to the buddy allocator.

Replace open-coded implementation of for_each_free_mem_range() with usage
of memblock API to simplify the code.

Signed-off-by: Mike Rapoport 
Reviewed-by: Max Filippov   # xtensa
Tested-by: Max Filippov # xtensa
---
 arch/arm/mm/init.c| 48 +++--
 arch/xtensa/mm/init.c | 55 ---
 2 files changed, 18 insertions(+), 85 deletions(-)

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 000c1b48e973..50a5a30a78ff 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -347,61 +347,29 @@ static void __init free_unused_memmap(void)
 #endif
 }
 
-#ifdef CONFIG_HIGHMEM
-static inline void free_area_high(unsigned long pfn, unsigned long end)
-{
-   for (; pfn < end; pfn++)
-   free_highmem_page(pfn_to_page(pfn));
-}
-#endif
-
 static void __init free_highpages(void)
 {
 #ifdef CONFIG_HIGHMEM
unsigned long max_low = max_low_pfn;
-   struct memblock_region *mem, *res;
+   phys_addr_t range_start, range_end;
+   u64 i;
 
/* set highmem page free */
-   for_each_memblock(memory, mem) {
-   unsigned long start = memblock_region_memory_base_pfn(mem);
-   unsigned long end = memblock_region_memory_end_pfn(mem);
+   for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+   _start, _end, NULL) {
+   unsigned long start = PHYS_PFN(range_start);
+   unsigned long end = PHYS_PFN(range_end);
 
/* Ignore complete lowmem entries */
if (end <= max_low)
continue;
 
-   if (memblock_is_nomap(mem))
-   continue;
-
/* Truncate partial highmem entries */
if (start < max_low)
start = max_low;
 
-   /* Find and exclude any reserved regions */
-   for_each_memblock(reserved, res) {
-   unsigned long res_start, res_end;
-
-   res_start = memblock_region_reserved_base_pfn(res);
-   res_end = memblock_region_reserved_end_pfn(res);
-
-   if (res_end < start)
-   continue;
-   if (res_start < start)
-   res_start = start;
-   if (res_start > end)
-   res_start = end;
-   if (res_end > end)
-   res_end = end;
-   if (res_start != start)
-   free_area_high(start, res_start);
-   start = res_end;
-   if (start == end)
-   break;
-   }
-
-   /* And now free anything which remains */
-   if (start < end)
-   free_area_high(start, end);
+   for (; start < end; start++)
+   free_highmem_page(pfn_to_page(start));
}
 #endif
 }
diff --git a/arch/xtensa/mm/init.c b/arch/xtensa/mm/init.c
index a05b306cf371..ad9d59d93f39 100644
--- a/arch/xtensa/mm/init.c
+++ b/arch/xtensa/mm/init.c
@@ -79,67 +79,32 @@ void __init zones_init(void)
free_area_init(max_zone_pfn);
 }
 
-#ifdef CONFIG_HIGHMEM
-static void __init free_area_high(unsigned long pfn, unsigned long end)
-{
-   for (; pfn < end; pfn++)
-   free_highmem_page(pfn_to_page(pfn));
-}
-
 static void __init free_highpages(void)
 {
+#ifdef CONFIG_HIGHMEM
unsigned long max_low = max_low_pfn;
-   struct memblock_region *mem, *res;
+   phys_addr_t range_start, range_end;
+   u64 i;
 
-   reset_all_zones_managed_pages();
/* set highmem page free */
-   for_each_memblock(memory, mem) {
-   unsigned long start = memblock_region_memory_base_pfn(mem);
-   unsigned long end = memblock_region_memory_end_pfn(mem);
+   for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+   _start, _end, NULL) {
+   unsigned long start = PHYS_PFN(range_start);
+   unsigned long end = PHYS_PFN(range_end);
 
/* Ignore complete lowmem entries */
if (end <= max_low)
continue;
 
-   if (memblock_is_nomap(mem))
-   continue;
-
/* Truncate partial highmem entries */
if (start < max_low)
start = max_low;
 
-   /* Find and exclude any reserved regions */
-   for_each_memblock(reserved, res) {
-   unsigned long res_start, 

Re: [PATCH V2 1/2] Add new flush_iotlb_range and handle freelists when using iommu_unmap_fast

2020-08-18 Thread Robin Murphy

On 2020-08-18 07:04, Tom Murphy wrote:

Add a flush_iotlb_range to allow flushing of an iova range instead of a
full flush in the dma-iommu path.

Allow the iommu_unmap_fast to return newly freed page table pages and
pass the freelist to queue_iova in the dma-iommu ops path.

This patch is useful for iommu drivers (in this case the intel iommu
driver) which need to wait for the ioTLB to be flushed before newly
free/unmapped page table pages can be freed. This way we can still batch
ioTLB free operations and handle the freelists.


It sounds like the freelist is something that logically belongs in the 
iommu_iotlb_gather structure. And even if it's not a perfect fit I'd be 
inclined to jam it in there anyway just to avoid this giant argument 
explosion ;)


Why exactly do we need to introduce a new flush_iotlb_range() op? Can't 
the AMD driver simply use the gather mechanism like everyone else?


Robin.


Change-log:
V2:
-fix missing parameter in mtk_iommu_v1.c

Signed-off-by: Tom Murphy 
---
  drivers/iommu/amd/iommu.c   | 14 -
  drivers/iommu/arm-smmu-v3.c |  3 +-
  drivers/iommu/arm-smmu.c|  3 +-
  drivers/iommu/dma-iommu.c   | 45 ---
  drivers/iommu/exynos-iommu.c|  3 +-
  drivers/iommu/intel/iommu.c | 54 +
  drivers/iommu/iommu.c   | 25 +++
  drivers/iommu/ipmmu-vmsa.c  |  3 +-
  drivers/iommu/msm_iommu.c   |  3 +-
  drivers/iommu/mtk_iommu.c   |  3 +-
  drivers/iommu/mtk_iommu_v1.c|  3 +-
  drivers/iommu/omap-iommu.c  |  3 +-
  drivers/iommu/qcom_iommu.c  |  3 +-
  drivers/iommu/rockchip-iommu.c  |  3 +-
  drivers/iommu/s390-iommu.c  |  3 +-
  drivers/iommu/sun50i-iommu.c|  3 +-
  drivers/iommu/tegra-gart.c  |  3 +-
  drivers/iommu/tegra-smmu.c  |  3 +-
  drivers/iommu/virtio-iommu.c|  3 +-
  drivers/vfio/vfio_iommu_type1.c |  2 +-
  include/linux/iommu.h   | 21 +++--
  21 files changed, 150 insertions(+), 56 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 2f22326ee4df..25fbacab23c3 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2513,7 +2513,8 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
  
  static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,

  size_t page_size,
- struct iommu_iotlb_gather *gather)
+ struct iommu_iotlb_gather *gather,
+ struct page **freelist)
  {
struct protection_domain *domain = to_pdomain(dom);
struct domain_pgtable pgtable;
@@ -2636,6 +2637,16 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
spin_unlock_irqrestore(>lock, flags);
  }
  
+static void amd_iommu_flush_iotlb_range(struct iommu_domain *domain,

+   unsigned long iova, size_t size,
+   struct page *freelist)
+{
+   struct protection_domain *dom = to_pdomain(domain);
+
+   domain_flush_pages(dom, iova, size);
+   domain_flush_complete(dom);
+}
+
  static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
  {
@@ -2675,6 +2686,7 @@ const struct iommu_ops amd_iommu_ops = {
.is_attach_deferred = amd_iommu_is_attach_deferred,
.pgsize_bitmap  = AMD_IOMMU_PGSIZES,
.flush_iotlb_all = amd_iommu_flush_iotlb_all,
+   .flush_iotlb_range = amd_iommu_flush_iotlb_range,
.iotlb_sync = amd_iommu_iotlb_sync,
.def_domain_type = amd_iommu_def_domain_type,
  };
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..8d328dc25326 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2854,7 +2854,8 @@ static int arm_smmu_map(struct iommu_domain *domain, 
unsigned long iova,
  }
  
  static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,

-size_t size, struct iommu_iotlb_gather *gather)
+size_t size, struct iommu_iotlb_gather *gather,
+struct page **freelist)
  {
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 243bc4cb2705..0cd0dfc89875 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1234,7 +1234,8 @@ static int arm_smmu_map(struct iommu_domain *domain, 
unsigned long iova,
  }
  
  static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,

-size_t size, struct iommu_iotlb_gather *gather)
+size_t size, struct iommu_iotlb_gather *gather,
+struct page **freelist)

[PATCH v3 02/17] dma-contiguous: simplify cma_early_percent_memory()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

The memory size calculation in cma_early_percent_memory() traverses
memblock.memory rather than simply call memblock_phys_mem_size(). The
comment in that function suggests that at some point there should have been
call to memblock_analyze() before memblock_phys_mem_size() could be used.
As of now, there is no memblock_analyze() at all and
memblock_phys_mem_size() can be used as soon as cold-plug memory is
registerd with memblock.

Replace loop over memblock.memory with a call to memblock_phys_mem_size().

Signed-off-by: Mike Rapoport 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Baoquan He 
---
 kernel/dma/contiguous.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index cff7e60968b9..0369fd5fda8f 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -73,16 +73,7 @@ early_param("cma", early_cma);
 
 static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
 {
-   struct memblock_region *reg;
-   unsigned long total_pages = 0;
-
-   /*
-* We cannot use memblock_phys_mem_size() here, because
-* memblock_analyze() has not been called yet.
-*/
-   for_each_memblock(memory, reg)
-   total_pages += memblock_region_memory_end_pfn(reg) -
-  memblock_region_memory_base_pfn(reg);
+   unsigned long total_pages = PHYS_PFN(memblock_phys_mem_size());
 
return (total_pages * CONFIG_CMA_SIZE_PERCENTAGE / 100) << PAGE_SHIFT;
 }
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 01/17] KVM: PPC: Book3S HV: simplify kvm_cma_reserve()

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

The memory size calculation in kvm_cma_reserve() traverses memblock.memory
rather than simply call memblock_phys_mem_size(). The comment in that
function suggests that at some point there should have been call to
memblock_analyze() before memblock_phys_mem_size() could be used.
As of now, there is no memblock_analyze() at all and
memblock_phys_mem_size() can be used as soon as cold-plug memory is
registerd with memblock.

Replace loop over memblock.memory with a call to memblock_phys_mem_size().

Signed-off-by: Mike Rapoport 
---
 arch/powerpc/kvm/book3s_hv_builtin.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 073617ce83e0..8f58dd20b362 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -95,23 +95,15 @@ EXPORT_SYMBOL_GPL(kvm_free_hpt_cma);
 void __init kvm_cma_reserve(void)
 {
unsigned long align_size;
-   struct memblock_region *reg;
-   phys_addr_t selected_size = 0;
+   phys_addr_t selected_size;
 
/*
 * We need CMA reservation only when we are in HV mode
 */
if (!cpu_has_feature(CPU_FTR_HVMODE))
return;
-   /*
-* We cannot use memblock_phys_mem_size() here, because
-* memblock_analyze() has not been called yet.
-*/
-   for_each_memblock(memory, reg)
-   selected_size += memblock_region_memory_end_pfn(reg) -
-memblock_region_memory_base_pfn(reg);
 
-   selected_size = (selected_size * kvm_cma_resv_ratio / 100) << 
PAGE_SHIFT;
+   selected_size = PAGE_ALIGN(memblock_phys_mem_size() * 
kvm_cma_resv_ratio / 100);
if (selected_size) {
pr_info("%s: reserving %ld MiB for global area\n", __func__,
 (unsigned long)selected_size / SZ_1M);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 00/17] memblock: seasonal cleaning^w cleanup

2020-08-18 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

These patches simplify several uses of memblock iterators and hide some of
the memblock implementation details from the rest of the system.

The patches are on top of v5.9-rc1

v3 changes:
* rebase on v5.9-rc1, as the result this required some non-trivial changes
  in patches 10 and 16. I didn't add Baoquan's Reviewed-by to theses
  patches, but I keept Thomas and Miguel
* Add Acked-by from Thomas and Miguel as there were changes in MIPS and
  only trivial changes in .clang-format
* Added Reviewed-by from Baoquan except for the patches 10 and 16
* Fixed misc build errors and warnings reported by kbuild bot
* Updated PowerPC KVM reservation size (patch 2), as per Daniel's comment

v2 changes:
* replace for_each_memblock() with two versions, one for memblock.memory
  and another one for memblock.reserved
* fix overzealous cleanup of powerpc fadamp: keep the traversal over the
  memblocks, but use better suited iterators
* don't remove traversal over memblock.reserved in x86 numa cleanup but
  replace for_each_memblock() with new for_each_reserved_mem_region()
* simplify ramdisk and crash kernel allocations on x86
* drop more redundant and unused code: __next_reserved_mem_region() and
  memblock_mem_size()
* add description of numa initialization fix on arm64 (thanks Jonathan)
* add Acked and Reviewed tags

Mike Rapoport (17):
  KVM: PPC: Book3S HV: simplify kvm_cma_reserve()
  dma-contiguous: simplify cma_early_percent_memory()
  arm, xtensa: simplify initialization of high memory pages
  arm64: numa: simplify dummy_numa_init()
  h8300, nds32, openrisc: simplify detection of memory extents
  riscv: drop unneeded node initialization
  mircoblaze: drop unneeded NUMA and sparsemem initializations
  memblock: make for_each_memblock_type() iterator private
  memblock: make memblock_debug and related functionality private
  memblock: reduce number of parameters in for_each_mem_range()
  arch, mm: replace for_each_memblock() with for_each_mem_pfn_range()
  arch, drivers: replace for_each_membock() with for_each_mem_range()
  x86/setup: simplify initrd relocation and reservation
  x86/setup: simplify reserve_crashkernel()
  memblock: remove unused memblock_mem_size()
  memblock: implement for_each_reserved_mem_region() using
__next_mem_region()
  memblock: use separate iterators for memory and reserved regions

 .clang-format|  5 +-
 arch/arm/kernel/setup.c  | 18 +++--
 arch/arm/mm/init.c   | 59 +++
 arch/arm/mm/mmu.c| 39 --
 arch/arm/mm/pmsa-v7.c| 23 +++---
 arch/arm/mm/pmsa-v8.c| 17 ++---
 arch/arm/xen/mm.c|  7 +-
 arch/arm64/kernel/machine_kexec_file.c   |  6 +-
 arch/arm64/kernel/setup.c|  4 +-
 arch/arm64/mm/init.c | 11 +--
 arch/arm64/mm/kasan_init.c   | 10 +--
 arch/arm64/mm/mmu.c  | 11 +--
 arch/arm64/mm/numa.c | 15 ++--
 arch/c6x/kernel/setup.c  |  9 ++-
 arch/h8300/kernel/setup.c|  8 +-
 arch/microblaze/mm/init.c| 21 ++
 arch/mips/cavium-octeon/dma-octeon.c | 12 +--
 arch/mips/kernel/setup.c | 31 
 arch/mips/netlogic/xlp/setup.c   |  2 +-
 arch/nds32/kernel/setup.c|  8 +-
 arch/openrisc/kernel/setup.c |  9 +--
 arch/openrisc/mm/init.c  |  8 +-
 arch/powerpc/kernel/fadump.c | 57 +++---
 arch/powerpc/kexec/file_load_64.c| 16 ++--
 arch/powerpc/kvm/book3s_hv_builtin.c | 12 +--
 arch/powerpc/mm/book3s64/hash_utils.c| 16 ++--
 arch/powerpc/mm/book3s64/radix_pgtable.c | 10 +--
 arch/powerpc/mm/kasan/kasan_init_32.c|  8 +-
 arch/powerpc/mm/mem.c| 33 
 arch/powerpc/mm/numa.c   |  7 +-
 arch/powerpc/mm/pgtable_32.c |  8 +-
 arch/riscv/mm/init.c | 36 +++--
 arch/riscv/mm/kasan_init.c   | 10 +--
 arch/s390/kernel/setup.c | 27 ---
 arch/s390/mm/page-states.c   |  6 +-
 arch/s390/mm/vmem.c  |  7 +-
 arch/sh/mm/init.c|  9 +--
 arch/sparc/mm/init_64.c  | 12 +--
 arch/x86/kernel/setup.c  | 56 +-
 arch/x86/mm/numa.c   |  2 +-
 arch/xtensa/mm/init.c| 55 +++---
 drivers/bus/mvebu-mbus.c | 12 +--
 drivers/irqchip/irq-gic-v3-its.c |  2 +-
 include/linux/memblock.h | 88 +-
 kernel/dma/contiguous.c  | 11 +--
 mm/memblock.c| 95 ++--
 mm/page_alloc.c  | 11 ++-
 mm/sparse.c  | 10 +--
 48 files changed, 387 insertions(+), 562 deletions(-)

-- 
2.26.2

Re: [PATCH 00/16] IOMMU driver for Kirin 960/970

2020-08-18 Thread Robin Murphy

On 2020-08-17 08:49, Mauro Carvalho Chehab wrote:

Add a driver for the Kirin 960/970 iommu.

As on the past series, this starts from the original 4.9 driver from
the 96boards tree:

https://github.com/96boards-hikey/linux/tree/hikey970-v4.9

The remaining patches add SPDX headers and make it build and run with
the upstream Kernel.

Chenfeng (1):
   iommu: add support for HiSilicon Kirin 960/970 iommu

Mauro Carvalho Chehab (15):
   iommu: hisilicon: remove default iommu_map_sg handler
   iommu: hisilicon: map and unmap ops gained new arguments
   iommu: hisi_smmu_lpae: rebase it to work with upstream
   iommu: hisi_smmu: remove linux/hisi/hisi-iommu.h
   iommu: hisilicon: cleanup its code style
   iommu: hisi_smmu_lpae: get rid of IOMMU_SEC and IOMMU_DEVICE
   iommu: get rid of map/unmap tile functions
   iommu: hisi_smmu_lpae: use the right code to get domain-priv data
   iommu: hisi_smmu_lpae: convert it to probe_device
   iommu: add Hisilicon Kirin970 iommu at the building system
   iommu: hisi_smmu_lpae: cleanup printk macros
   iommu: hisi_smmu_lpae: make OF compatible more standard


Echoing the other comments about none of the driver patches being CC'd 
to the IOMMU list...


Still, I dug the series up on lore and frankly I'm not sure what to make 
of it - AFAICS the "driver" is just yet another implementation of Arm 
LPAE pagetable code, with no obvious indication of how those pagetables 
ever get handed off to IOMMU hardware (and indeed no indication of IOMMU 
hardware at all). Can you explain how it's supposed to work?


And as a pre-emptive strike, we really don't need any more LPAE 
implementations - that's what the io-pgtable library is all about (which 
incidentally has been around since 4.0...). I think that should make the 
issue of preserving authorship largely moot since there's no need to 
preserve most of the code anyway ;)


Robin.


   dt: add an spec for the Kirin36x0 SMMU
   dt: hi3670-hikey970.dts: load the SMMU driver on Hikey970
   staging: hikey9xx: add an item about the iommu driver

  .../iommu/hisilicon,kirin36x0-smmu.yaml   |  55 ++
  .../boot/dts/hisilicon/hi3670-hikey970.dts|   3 +
  drivers/staging/hikey9xx/Kconfig  |   9 +
  drivers/staging/hikey9xx/Makefile |   1 +
  drivers/staging/hikey9xx/TODO |   1 +
  drivers/staging/hikey9xx/hisi_smmu.h  | 196 ++
  drivers/staging/hikey9xx/hisi_smmu_lpae.c | 648 ++
  7 files changed, 913 insertions(+)
  create mode 100644 
Documentation/devicetree/bindings/iommu/hisilicon,kirin36x0-smmu.yaml
  create mode 100644 drivers/staging/hikey9xx/hisi_smmu.h
  create mode 100644 drivers/staging/hikey9xx/hisi_smmu_lpae.c


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Robin Murphy

On 2020-08-18 12:17, Barry Song wrote:

Polling by MSI isn't necessarily faster than polling by SEV. Tests on
hi1620 show hns3 100G NIC network throughput can improve from 25G to
27G if we disable MSI polling while running 16 netperf threads sending
UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
single thread.
The reason for the throughput improvement is that the latency to poll
the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
in an empty cmd queue, typically we need to wait for 280ns using MSI
polling. But we only need around 190ns after disabling MSI polling.
This patch provides a command line option so that users can decide to
use MSI polling or not based on their tests.

Signed-off-by: Barry Song 
---
  -v4: rebase on top of 5.9-rc1
  refine changelog

  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ++
  1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..89d3cb391fef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param_named(disable_bypass, disable_bypass, bool, 
S_IRUGO);
  MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices that 
are not attached to an iommu domain will report an abort back to the device and will not 
be allowed to pass through the SMMU.");
  
+static bool disable_msipolling;

+module_param_named(disable_msipolling, disable_msipolling, bool, S_IRUGO);


Just use module_param() - going out of the way to specify a "different" 
name that's identical to the variable name is silly.


Also I think the preference these days is to specify permissions as 
plain octal constants rather than those rather inscrutable macros. I 
certainly find that more readable myself.


(Yes, the existing parameter commits the same offences, but I'd rather 
clean that up separately than perpetuate it)



+MODULE_PARM_DESC(disable_msipolling,
+   "Disable MSI-based polling for CMD_SYNC completion.");
+
  enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
arm_smmu_cmdq_ent *ent)
return 0;
  }
  
+static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)

+{
+   return !disable_msipolling &&
+  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
+  smmu->features & ARM_SMMU_FEAT_MSI;
+}


I'd wrap this up into a new ARM_SMMU_OPT_MSIPOLL flag set at probe time, 
rather than constantly reevaluating this whole expression (now that it's 
no longer simply testing two adjacent bits of the same word).


Robin.


+
  static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device 
*smmu,
 u32 prod)
  {
@@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
+   if (arm_smmu_use_msipolling(smmu)) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
}
@@ -1332,8 +1343,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,
  static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
  {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (arm_smmu_use_msipolling(smmu))
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
  
  	return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Barry Song
Polling by MSI isn't necessarily faster than polling by SEV. Tests on
hi1620 show hns3 100G NIC network throughput can improve from 25G to
27G if we disable MSI polling while running 16 netperf threads sending
UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
single thread.
The reason for the throughput improvement is that the latency to poll
the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
in an empty cmd queue, typically we need to wait for 280ns using MSI
polling. But we only need around 190ns after disabling MSI polling.
This patch provides a command line option so that users can decide to
use MSI polling or not based on their tests.

Signed-off-by: Barry Song 
---
 -v4: rebase on top of 5.9-rc1
 refine changelog

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..89d3cb391fef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param_named(disable_bypass, disable_bypass, bool, 
S_IRUGO);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling;
+module_param_named(disable_msipolling, disable_msipolling, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_msipolling,
+   "Disable MSI-based polling for CMD_SYNC completion.");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
arm_smmu_cmdq_ent *ent)
return 0;
 }
 
+static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)
+{
+   return !disable_msipolling &&
+  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
+  smmu->features & ARM_SMMU_FEAT_MSI;
+}
+
 static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device 
*smmu,
 u32 prod)
 {
@@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
+   if (arm_smmu_use_msipolling(smmu)) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
}
@@ -1332,8 +1343,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,
 static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
 {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (arm_smmu_use_msipolling(smmu))
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
 
return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 16/20] drm/msm/a6xx: Add support for per-instance pagetables

2020-08-18 Thread Akhil P Oommen

Reviewed-by: Akhil P Oommen 

On 8/18/2020 3:31 AM, Rob Clark wrote:

From: Jordan Crouse 

Add support for using per-instance pagetables if all the dependencies are
available.

Signed-off-by: Jordan Crouse 
Signed-off-by: Rob Clark 
---
  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 63 +++
  drivers/gpu/drm/msm/adreno/a6xx_gpu.h |  1 +
  drivers/gpu/drm/msm/msm_ringbuffer.h  |  1 +
  3 files changed, 65 insertions(+)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 5eabb0109577..d7ad6c78d787 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -81,6 +81,49 @@ static void get_stats_counter(struct msm_ringbuffer *ring, 
u32 counter,
OUT_RING(ring, upper_32_bits(iova));
  }
  
+static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu,

+   struct msm_ringbuffer *ring, struct msm_file_private *ctx)
+{
+   phys_addr_t ttbr;
+   u32 asid;
+   u64 memptr = rbmemptr(ring, ttbr0);
+
+   if (ctx == a6xx_gpu->cur_ctx)
+   return;
+
+   if (msm_iommu_pagetable_params(ctx->aspace->mmu, , ))
+   return;
+
+   /* Execute the table update */
+   OUT_PKT7(ring, CP_SMMU_TABLE_UPDATE, 4);
+   OUT_RING(ring, CP_SMMU_TABLE_UPDATE_0_TTBR0_LO(lower_32_bits(ttbr)));
+
+   OUT_RING(ring,
+   CP_SMMU_TABLE_UPDATE_1_TTBR0_HI(upper_32_bits(ttbr)) |
+   CP_SMMU_TABLE_UPDATE_1_ASID(asid));
+   OUT_RING(ring, CP_SMMU_TABLE_UPDATE_2_CONTEXTIDR(0));
+   OUT_RING(ring, CP_SMMU_TABLE_UPDATE_3_CONTEXTBANK(0));
+
+   /*
+* Write the new TTBR0 to the memstore. This is good for debugging.
+*/
+   OUT_PKT7(ring, CP_MEM_WRITE, 4);
+   OUT_RING(ring, CP_MEM_WRITE_0_ADDR_LO(lower_32_bits(memptr)));
+   OUT_RING(ring, CP_MEM_WRITE_1_ADDR_HI(upper_32_bits(memptr)));
+   OUT_RING(ring, lower_32_bits(ttbr));
+   OUT_RING(ring, (asid << 16) | upper_32_bits(ttbr));
+
+   /*
+* And finally, trigger a uche flush to be sure there isn't anything
+* lingering in that part of the GPU
+*/
+
+   OUT_PKT7(ring, CP_EVENT_WRITE, 1);
+   OUT_RING(ring, 0x31);
+
+   a6xx_gpu->cur_ctx = ctx;
+}
+
  static void a6xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
  {
unsigned int index = submit->seqno % MSM_GPU_SUBMIT_STATS_COUNT;
@@ -90,6 +133,8 @@ static void a6xx_submit(struct msm_gpu *gpu, struct 
msm_gem_submit *submit)
struct msm_ringbuffer *ring = submit->ring;
unsigned int i;
  
+	a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx);

+
get_stats_counter(ring, REG_A6XX_RBBM_PERFCTR_CP_0_LO,
rbmemptr_stats(ring, index, cpcycles_start));
  
@@ -696,6 +741,8 @@ static int a6xx_hw_init(struct msm_gpu *gpu)

/* Always come up on rb 0 */
a6xx_gpu->cur_ring = gpu->rb[0];
  
+	a6xx_gpu->cur_ctx = NULL;

+
/* Enable the SQE_to start the CP engine */
gpu_write(gpu, REG_A6XX_CP_SQE_CNTL, 1);
  
@@ -1008,6 +1055,21 @@ static unsigned long a6xx_gpu_busy(struct msm_gpu *gpu)

return (unsigned long)busy_time;
  }
  
+static struct msm_gem_address_space *

+a6xx_create_private_address_space(struct msm_gpu *gpu)
+{
+   struct msm_gem_address_space *aspace = NULL;
+   struct msm_mmu *mmu;
+
+   mmu = msm_iommu_pagetable_create(gpu->aspace->mmu);
+
+   if (!IS_ERR(mmu))
+   aspace = msm_gem_address_space_create(mmu,
+   "gpu", 0x1ULL, 0x1ULL);
+
+   return aspace;
+}
+
  static const struct adreno_gpu_funcs funcs = {
.base = {
.get_param = adreno_get_param,
@@ -1031,6 +1093,7 @@ static const struct adreno_gpu_funcs funcs = {
.gpu_state_put = a6xx_gpu_state_put,
  #endif
.create_address_space = adreno_iommu_create_address_space,
+   .create_private_address_space = 
a6xx_create_private_address_space,
},
.get_timestamp = a6xx_get_timestamp,
  };
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h 
b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
index 03ba60d5b07f..da22d7549d9b 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
@@ -19,6 +19,7 @@ struct a6xx_gpu {
uint64_t sqe_iova;
  
  	struct msm_ringbuffer *cur_ring;

+   struct msm_file_private *cur_ctx;
  
  	struct a6xx_gmu gmu;

  };
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.h 
b/drivers/gpu/drm/msm/msm_ringbuffer.h
index 7764373d0ed2..0987d6bf848c 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.h
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.h
@@ -31,6 +31,7 @@ struct msm_rbmemptrs {
volatile uint32_t fence;
  
  	volatile struct msm_gpu_submit_stats stats[MSM_GPU_SUBMIT_STATS_COUNT];

+   volatile u64 ttbr0;
  };
  
  struct msm_ringbuffer {




___
iommu mailing list

Re: [PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Will Deacon
On Tue, Aug 18, 2020 at 06:37:39PM +0900, Cho KyongHo wrote:
> On Tue, Aug 18, 2020 at 09:28:53AM +0100, Will Deacon wrote:
> > On Tue, Aug 18, 2020 at 04:43:10PM +0900, Cho KyongHo wrote:
> > > Cache maintenance operations in the most of CPU architectures needs
> > > memory barrier after the cache maintenance for the DMAs to view the
> > > region of the memory correctly. The problem is that memory barrier is
> > > very expensive and dma_[un]map_sg() and dma_sync_sg_for_{device|cpu}()
> > > involves the memory barrier per every single cache sg entry. In some
> > > CPU micro-architecture, a single memory barrier consumes more time than
> > > cache clean on 4KiB. It becomes more serious if the number of CPU cores
> > > are larger.
> > 
> > Have you got higher-level performance data for this change? It's more likely
> > that the DSB is what actually forces the prior cache maintenance to
> > complete,
> 
> This patch does not skip necessary DSB after cache maintenance. It just
> remove repeated dsb per every single sg entry and call dsb just once
> after cache maintenance on all sg entries is completed.

Yes, I realise that, but what I'm saying is that a big part of your
justification for this change is:

  | The problem is that memory barrier is very expensive and dma_[un]map_sg()
  | and dma_sync_sg_for_{device|cpu}() involves the memory barrier per every
  | single cache sg entry. In some CPU micro-architecture, a single memory
  | barrier consumes more time than cache clean on 4KiB.

and my point is that the DSB is likely completing the cache maintenance,
so as cache maintenance instructions retire faster in the micro-architecture,
the DSB becomes absolutely slower. In other words, it doesn't make much
sense to me to compare the cost of the DSB with the cost of the cache
maintenance; what matters more is the code of the high-level unmap()
operation for the sglist.

> > so it's important to look at the bigger picture, not just the
> > apparent relative cost of these instructions.
> > 
> If you mean bigger picture is the performance impact of this patch to a
> complete user scenario, we are evaluating it in some latency sensitve
> scenario. But I wonder if a performance gain in a platform/SoC specific
> scenario is also persuasive.

Latency is fine too, but phrasing the numbers (and we really need those)
in terms of things like "The interrupt response time for this in-tree
driver is improved by xxx ns (yy %) after this change" or "Throughput
for this in-tree driver goes from xxx mb/s to yyy mb/s" would be really
helpful.

> > Also, it's a miracle that non-coherent DMA even works,
> 
> I am sorry, Will. I don't understand this. Can you let me know what do
> you mena with the above sentence?

Non-coherent DMA sucks for software. For the most part, Linux does a nice
job of hiding this from device drivers, and I think _that_ is the primary
concern, rather than performance. If performance is a problem, then the
solution is cache coherence or a shared non-cacheable buffer (rather than
the streaming API).

> > so I'm not sure
> > that we should be complicating the implementation like this to try to
> > make it "fast".
> > 
> I agree that this patch makes the implementation of dma API a bit more
> but I don't think this does not impact its complication seriously.

It's death by a thousand cuts; this patch further fragments the architecture
backends and leads to arm64-specific behaviour which consequently won't get
well tested by anybody else. Now, it might be worth it, but there's not
enough information here to make that call.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Cho KyongHo
On Tue, Aug 18, 2020 at 09:37:20AM +0100, Christoph Hellwig wrote:
> On Tue, Aug 18, 2020 at 09:28:53AM +0100, Will Deacon wrote:
> > On Tue, Aug 18, 2020 at 04:43:10PM +0900, Cho KyongHo wrote:
> > > Cache maintenance operations in the most of CPU architectures needs
> > > memory barrier after the cache maintenance for the DMAs to view the
> > > region of the memory correctly. The problem is that memory barrier is
> > > very expensive and dma_[un]map_sg() and dma_sync_sg_for_{device|cpu}()
> > > involves the memory barrier per every single cache sg entry. In some
> > > CPU micro-architecture, a single memory barrier consumes more time than
> > > cache clean on 4KiB. It becomes more serious if the number of CPU cores
> > > are larger.
> > 
> > Have you got higher-level performance data for this change? It's more likely
> > that the DSB is what actually forces the prior cache maintenance to
> > complete, so it's important to look at the bigger picture, not just the
> > apparent relative cost of these instructions.
> > 
> > Also, it's a miracle that non-coherent DMA even works, so I'm not sure
> > that we should be complicating the implementation like this to try to
> > make it "fast".
> 
> And without not just an important in-tree user but one that actually
> matters and can show how this is correct the whole proposal is complete
> nonstarter.
> 
The patch introduces new kernel configurations
ARCH_HAS_SYNC_DMA_FOR_CPU_RELAXED and ARCH_HAS_SYNC_DMA_FOR_CPU_RELAXED
not to affect the rest of the system. I also confirmed that the patch
does not break some other architectures including arm and x86 which do
not define the new kernel configurations.

Would you let me know some other things to confirm this patch is
correct?

Thank you.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Cho KyongHo
On Tue, Aug 18, 2020 at 09:28:53AM +0100, Will Deacon wrote:
> On Tue, Aug 18, 2020 at 04:43:10PM +0900, Cho KyongHo wrote:
> > Cache maintenance operations in the most of CPU architectures needs
> > memory barrier after the cache maintenance for the DMAs to view the
> > region of the memory correctly. The problem is that memory barrier is
> > very expensive and dma_[un]map_sg() and dma_sync_sg_for_{device|cpu}()
> > involves the memory barrier per every single cache sg entry. In some
> > CPU micro-architecture, a single memory barrier consumes more time than
> > cache clean on 4KiB. It becomes more serious if the number of CPU cores
> > are larger.
> 
> Have you got higher-level performance data for this change? It's more likely
> that the DSB is what actually forces the prior cache maintenance to
> complete,

This patch does not skip necessary DSB after cache maintenance. It just
remove repeated dsb per every single sg entry and call dsb just once
after cache maintenance on all sg entries is completed.

> so it's important to look at the bigger picture, not just the
> apparent relative cost of these instructions.
> 
If you mean bigger picture is the performance impact of this patch to a
complete user scenario, we are evaluating it in some latency sensitve
scenario. But I wonder if a performance gain in a platform/SoC specific
scenario is also persuasive.

> Also, it's a miracle that non-coherent DMA even works,

I am sorry, Will. I don't understand this. Can you let me know what do
you mena with the above sentence?

> so I'm not sure
> that we should be complicating the implementation like this to try to
> make it "fast".
> 
I agree that this patch makes the implementation of dma API a bit more
but I don't think this does not impact its complication seriously.

> Will
> 

Thank you.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Christoph Hellwig
On Tue, Aug 18, 2020 at 09:28:53AM +0100, Will Deacon wrote:
> On Tue, Aug 18, 2020 at 04:43:10PM +0900, Cho KyongHo wrote:
> > Cache maintenance operations in the most of CPU architectures needs
> > memory barrier after the cache maintenance for the DMAs to view the
> > region of the memory correctly. The problem is that memory barrier is
> > very expensive and dma_[un]map_sg() and dma_sync_sg_for_{device|cpu}()
> > involves the memory barrier per every single cache sg entry. In some
> > CPU micro-architecture, a single memory barrier consumes more time than
> > cache clean on 4KiB. It becomes more serious if the number of CPU cores
> > are larger.
> 
> Have you got higher-level performance data for this change? It's more likely
> that the DSB is what actually forces the prior cache maintenance to
> complete, so it's important to look at the bigger picture, not just the
> apparent relative cost of these instructions.
> 
> Also, it's a miracle that non-coherent DMA even works, so I'm not sure
> that we should be complicating the implementation like this to try to
> make it "fast".

And without not just an important in-tree user but one that actually
matters and can show how this is correct the whole proposal is complete
nonstarter.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Will Deacon
On Tue, Aug 18, 2020 at 04:43:10PM +0900, Cho KyongHo wrote:
> Cache maintenance operations in the most of CPU architectures needs
> memory barrier after the cache maintenance for the DMAs to view the
> region of the memory correctly. The problem is that memory barrier is
> very expensive and dma_[un]map_sg() and dma_sync_sg_for_{device|cpu}()
> involves the memory barrier per every single cache sg entry. In some
> CPU micro-architecture, a single memory barrier consumes more time than
> cache clean on 4KiB. It becomes more serious if the number of CPU cores
> are larger.

Have you got higher-level performance data for this change? It's more likely
that the DSB is what actually forces the prior cache maintenance to
complete, so it's important to look at the bigger picture, not just the
apparent relative cost of these instructions.

Also, it's a miracle that non-coherent DMA even works, so I'm not sure
that we should be complicating the implementation like this to try to
make it "fast".

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RESEND v10 07/11] device-mapping: Introduce DMA range map, supplanting dma_pfn_offset

2020-08-18 Thread Andy Shevchenko
On Mon, Aug 17, 2020 at 05:53:09PM -0400, Jim Quinlan wrote:
> The new field 'dma_range_map' in struct device is used to facilitate the
> use of single or multiple offsets between mapping regions of cpu addrs and
> dma addrs.  It subsumes the role of "dev->dma_pfn_offset" which was only
> capable of holding a single uniform offset and had no region bounds
> checking.
> 
> The function of_dma_get_range() has been modified so that it takes a single
> argument -- the device node -- and returns a map, NULL, or an error code.
> The map is an array that holds the information regarding the DMA regions.
> Each range entry contains the address offset, the cpu_start address, the
> dma_start address, and the size of the region.
> 
> of_dma_configure() is the typical manner to set range offsets but there are
> a number of ad hoc assignments to "dev->dma_pfn_offset" in the kernel
> driver code.  These cases now invoke the function
> dma_attach_offset_range(dev, cpu_addr, dma_addr, size).

...

> + if (dev) {
> + phys_addr_t paddr = PFN_PHYS(pfn);
> +

> + pfn -= (dma_offset_from_phys_addr(dev, paddr) >> PAGE_SHIFT);

PFN_DOWN() ?

> + }

...

> + pfn += (dma_offset_from_dma_addr(dev, addr) >> PAGE_SHIFT);

Ditto.


...

> +static inline u64 dma_offset_from_dma_addr(struct device *dev, dma_addr_t 
> dma_addr)
> +{
> + const struct bus_dma_region *m = dev->dma_range_map;
> +
> + if (!m)
> + return 0;
> + for (; m->size; m++)
> + if (dma_addr >= m->dma_start && dma_addr - m->dma_start < 
> m->size)
> + return m->offset;
> + return 0;
> +}
> +
> +static inline u64 dma_offset_from_phys_addr(struct device *dev, phys_addr_t 
> paddr)
> +{
> + const struct bus_dma_region *m = dev->dma_range_map;
> +
> + if (!m)
> + return 0;
> + for (; m->size; m++)
> + if (paddr >= m->cpu_start && paddr - m->cpu_start < m->size)
> + return m->offset;
> + return 0;
> +}

Perhaps for these the form with one return 0 is easier to read

if (m) {
for (; m->size; m++)
if (paddr >= m->cpu_start && paddr - m->cpu_start < 
m->size)
return m->offset;
}
return 0;

?

...

> + if (mem->use_dev_dma_pfn_offset) {
> + u64 base_addr = (u64)mem->pfn_base << PAGE_SHIFT;

PFN_PHYS() ?

> +
> + return base_addr - dma_offset_from_phys_addr(dev, base_addr);
> + }

...

> + * It returns -ENOMEM if out of memory, 0 otherwise.

This doesn't describe cases dev->dma_range_map != NULL and offset == 0.

> +int dma_set_offset_range(struct device *dev, phys_addr_t cpu_start,
> +  dma_addr_t dma_start, u64 size)
> +{
> + struct bus_dma_region *map;
> + u64 offset = (u64)cpu_start - (u64)dma_start;
> +
> + if (!offset)
> + return 0;
> +
> + if (dev->dma_range_map) {
> + dev_err(dev, "attempt to add DMA range to existing map\n");
> + return -EINVAL;
> + }
> +
> + map = kcalloc(2, sizeof(*map), GFP_KERNEL);
> + if (!map)
> + return -ENOMEM;
> + map[0].cpu_start = cpu_start;
> + map[0].dma_start = dma_start;
> + map[0].offset = offset;
> + map[0].size = size;
> + dev->dma_range_map = map;
> +
> + return 0;
> +}

...

> +void *dma_copy_dma_range_map(const struct bus_dma_region *map)
> +{
> + int num_ranges;
> + struct bus_dma_region *new_map;
> + const struct bus_dma_region *r = map;
> +
> + for (num_ranges = 0; r->size; num_ranges++)
> + r++;

> + new_map = kcalloc(num_ranges + 1, sizeof(*map), GFP_KERNEL);
> + if (new_map)
> + memcpy(new_map, map, sizeof(*map) * num_ranges);

Looks like krealloc() on the first glance...

> +
> + return new_map;
> +}

-- 
With Best Regards,
Andy Shevchenko


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/2] dma-mapping: introduce relaxed version of dma sync

2020-08-18 Thread Cho KyongHo
Cache maintenance operations in the most of CPU architectures needs
memory barrier after the cache maintenance for the DMAs to view the
region of the memory correctly. The problem is that memory barrier is
very expensive and dma_[un]map_sg() and dma_sync_sg_for_{device|cpu}()
involves the memory barrier per every single cache sg entry. In some
CPU micro-architecture, a single memory barrier consumes more time than
cache clean on 4KiB. It becomes more serious if the number of CPU cores
are larger.
This patch introduces arch_sync_dma_for_device_relaxed() and
arch_sync_dma_for_cpu_relaxed() which do not involve memory barrier.
So the users called those functions require explicitly calling
arch_sync_barrier_for_device() and arch_sync_barrier_for_cpu(),
respectively to confirm the view of memory is consistent between the
CPUs and DMAs.

Signed-off-by: Cho KyongHo 
---
 drivers/iommu/dma-iommu.c   |  6 +++--
 include/linux/dma-direct.h  | 29 +-
 include/linux/dma-noncoherent.h | 54 +
 kernel/dma/Kconfig  |  8 ++
 kernel/dma/direct.c | 25 +++
 5 files changed, 109 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 5141d49..4f9c9cb 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -705,7 +705,8 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
return;
 
for_each_sg(sgl, sg, nelems, i)
-   arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
+   arch_sync_dma_for_cpu_relaxed(sg_phys(sg), sg->length, dir);
+   arch_sync_barrier_for_cpu(dir);
 }
 
 static void iommu_dma_sync_sg_for_device(struct device *dev,
@@ -719,7 +720,8 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
return;
 
for_each_sg(sgl, sg, nelems, i)
-   arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
+   arch_sync_dma_for_device_relaxed(sg_phys(sg), sg->length, dir);
+   arch_sync_barrier_for_device(dir);
 }
 
 static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 6e87225..f5b1fee 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -152,7 +152,7 @@ static inline void dma_direct_sync_single_for_cpu(struct 
device *dev,
swiotlb_tbl_sync_single(dev, paddr, size, dir, SYNC_FOR_CPU);
 }
 
-static inline dma_addr_t dma_direct_map_page(struct device *dev,
+static inline dma_addr_t __dma_direct_map_page(struct device *dev,
struct page *page, unsigned long offset, size_t size,
enum dma_data_direction dir, unsigned long attrs)
 {
@@ -172,20 +172,37 @@ static inline dma_addr_t dma_direct_map_page(struct 
device *dev,
return DMA_MAPPING_ERROR;
}
 
-   if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-   arch_sync_dma_for_device(phys, size, dir);
return dma_addr;
 }
 
-static inline void dma_direct_unmap_page(struct device *dev, dma_addr_t addr,
+static inline dma_addr_t dma_direct_map_page(struct device *dev,
+   struct page *page, unsigned long offset, size_t size,
+   enum dma_data_direction dir, unsigned long attrs)
+{
+   dma_addr_t dma_addr = __dma_direct_map_page(dev, page, offset, size, 
dir, attrs);
+
+   if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   arch_sync_dma_for_device(page_to_phys(page) + offset, size, 
dir);
+
+   return dma_addr;
+}
+
+static inline void __dma_direct_unmap_page(struct device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
phys_addr_t phys = dma_to_phys(dev, addr);
 
+   if (unlikely(is_swiotlb_buffer(phys)))
+   swiotlb_tbl_unmap_single(dev, phys, size, size, dir, attrs);
+}
+
+static inline void dma_direct_unmap_page(struct device *dev, dma_addr_t addr,
+   size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
dma_direct_sync_single_for_cpu(dev, addr, size, dir);
 
-   if (unlikely(is_swiotlb_buffer(phys)))
-   swiotlb_tbl_unmap_single(dev, phys, size, size, dir, attrs);
+   __dma_direct_unmap_page(dev, addr, size, dir, attrs);
 }
 #endif /* _LINUX_DMA_DIRECT_H */
diff --git a/include/linux/dma-noncoherent.h b/include/linux/dma-noncoherent.h
index ca09a4e..0a31e6c 100644
--- a/include/linux/dma-noncoherent.h
+++ b/include/linux/dma-noncoherent.h
@@ -73,23 +73,77 @@ static inline void arch_dma_cache_sync(struct device *dev, 
void *vaddr,
 #endif /* CONFIG_DMA_NONCOHERENT_CACHE_SYNC */
 
 #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE
+#ifdef 

[PATCH 2/2] arm64: dma-mapping: add relaxed DMA sync

2020-08-18 Thread Cho KyongHo
__dma_[un]map_area() is the implementation of cache maintenance
operations for DMA in arm64. 'dsb sy' in the subroutine guarantees the
view of given memory area is consistent to all memory observers. So,
it is required.
However, dma_sync_sg_for_{device|cpu}() and dma_[un]map_sg() calls
__dma_[un]map_area() nents number of times and 'dsb sy' instruction is
executed the same number of times. We have observed that 'dsb sy'
consumes more time than cleaning or invalidating 4KiB area.
arch_sync_dma_for_{device|cpu}_relaxed() and
arch_sync_barrier_for_{device|cpu}() are introduced since commit
6a9356234 ("dma-mapping: introduce relaxed version of dma sync") to
reduce redundant memory barriers in sg versions of DMA sync API.
Implementing relaxed version of DMA sync API will dramatically increase
the performance of dma_sync_sg_for_{device|cpu}().

Signed-off-by: Cho KyongHo 
---
 arch/arm64/Kconfig |  4 ++--
 arch/arm64/include/asm/assembler.h | 33 -
 arch/arm64/include/asm/barrier.h   | 13 +
 arch/arm64/mm/cache.S  | 34 +++---
 arch/arm64/mm/dma-mapping.c|  4 ++--
 include/linux/dma-noncoherent.h|  1 +
 6 files changed, 61 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d23283..4fc7ef4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -31,8 +31,8 @@ config ARM64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
-   select ARCH_HAS_SYNC_DMA_FOR_DEVICE
-   select ARCH_HAS_SYNC_DMA_FOR_CPU
+   select ARCH_HAS_SYNC_DMA_FOR_DEVICE_RELAXED
+   select ARCH_HAS_SYNC_DMA_FOR_CPU_RELAXED
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 54d1811..1f87d98 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -345,6 +345,33 @@ alternative_endif
.endm
 
 /*
+ * Macro to perform a data cache invalidation for the interval
+ * [kaddr, kaddr + size)
+ *
+ * kaddr:  starting virtual address of the region
+ * size:   size of the region
+ * Corrupts:   kaddr, size, tmp1, tmp2
+ */
+   .macro __dcache_inv_by_line kaddr, size, tmp1, tmp2
+   add \size, \size, \kaddr
+   dcache_line_size \tmp1, \tmp2
+   sub \tmp2, \tmp1, #1
+   tst \size, \tmp2// end cache line aligned?
+   bic \size, \size, \tmp2
+   b.eq9997f
+   dc  civac, \size// clean & invalidate D / U line
+9997:  tst \kaddr, \tmp2   // start cache line aligned?
+   bic \kaddr, \kaddr, \tmp2
+   b.eq9998f
+   dc  civac, \kaddr   // clean & invalidate D / U line
+   b   f
+9998:  dc  ivac, \kaddr// invalidate D / U line
+:  add \kaddr, \kaddr, \tmp1
+   cmp \kaddr, \size
+   b.lo9998b
+   .endm
+
+/*
  * Macro to perform a data cache maintenance for the interval
  * [kaddr, kaddr + size)
  *
@@ -362,7 +389,7 @@ alternative_else
 alternative_endif
.endm
 
-   .macro dcache_by_line_op op, domain, kaddr, size, tmp1, tmp2
+   .macro __dcache_by_line_op op, kaddr, size, tmp1, tmp2
dcache_line_size \tmp1, \tmp2
add \size, \kaddr, \size
sub \tmp2, \tmp1, #1
@@ -388,6 +415,10 @@ alternative_endif
add \kaddr, \kaddr, \tmp1
cmp \kaddr, \size
b.lo9998b
+   .endm
+
+   .macro dcache_by_line_op op, domain, kaddr, size, tmp1, tmp2
+   __dcache_by_line_op \op, \kaddr, \size, \tmp1, \tmp2
dsb \domain
.endm
 
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index fb4c275..96bbbf6 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -167,6 +167,19 @@ do {   
\
 
 #include 
 
+#include 
+
+static inline void arch_sync_barrier_for_device(enum dma_data_direction dir)
+{
+   dsb(sy);
+}
+
+static inline void arch_sync_barrier_for_cpu(enum dma_data_direction dir)
+{
+   if (dir == DMA_FROM_DEVICE)
+   dsb(sy);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 2d881f3..7180256 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -138,34 +138,20 @@ SYM_FUNC_END(__clean_dcache_area_pou)
  * - kaddr   - kernel address
  * - size- size in question
  */
-SYM_FUNC_START_LOCAL(__dma_inv_area)
 SYM_FUNC_START_PI(__inval_dcache_area)
-   /* FALLTHROUGH */
+   __dcache_inv_by_line x0, x1, x2, x3
+   dsb sy
+   ret

Re: [PATCH v6 02/15] iommu: Report domain nesting info

2020-08-18 Thread Auger Eric
Hi Jacob,

On 8/18/20 6:21 AM, Jacob Pan wrote:
> On Sun, 16 Aug 2020 14:40:57 +0200
> Auger Eric  wrote:
> 
>> Hi Yi,
>>
>> On 8/14/20 9:15 AM, Liu, Yi L wrote:
>>> Hi Eric,
>>>   
 From: Auger Eric 
 Sent: Thursday, August 13, 2020 8:53 PM

 Yi,
 On 7/28/20 8:27 AM, Liu Yi L wrote:  
> IOMMUs that support nesting translation needs report the
> capability info  
 s/needs/need to  
> to userspace. It gives information about requirements the
> userspace needs to implement plus other features characterizing
> the physical implementation.
>
> This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller
> can get nesting info after setting DOMAIN_ATTR_NESTING. For VFIO,
> it is after selecting VFIO_TYPE1_NESTING_IOMMU.  
 This is not what this patch does ;-) It introduces a new IOMMU UAPI
 struct that gives information about the nesting capabilities and
 features. This struct is supposed to be returned by
 iommu_domain_get_attr() with DOMAIN_ATTR_NESTING attribute
 parameter, one a domain whose type has been set to
 DOMAIN_ATTR_NESTING.  
>>>
>>> got it. let me apply your suggestion. thanks. :-)
>>>   
>
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
> v5 -> v6:
> *) rephrase the feature notes per comments from Eric Auger.
> *) rename @size of struct iommu_nesting_info to @argsz.
>
> v4 -> v5:
> *) address comments from Eric Auger.
>
> v3 -> v4:
> *) split the SMMU driver changes to be a separate patch
> *) move the @addr_width and @pasid_bits from vendor specific
>part to generic part.
> *) tweak the description for the @features field of struct
>iommu_nesting_info.
> *) add description on the @data[] field of struct
> iommu_nesting_info
>
> v2 -> v3:
> *) remvoe cap/ecap_mask in iommu_nesting_info.
> *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> *) return an empty iommu_nesting_info for SMMU drivers per Jean'
>suggestion.
> ---
>  include/uapi/linux/iommu.h | 74  
 ++  
>  1 file changed, 74 insertions(+)
>
> diff --git a/include/uapi/linux/iommu.h
> b/include/uapi/linux/iommu.h index 7c8e075..5e4745a 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -332,4 +332,78 @@ struct iommu_gpasid_bind_data {
>   } vendor;
>  };
>
> +/*
> + * struct iommu_nesting_info - Information for nesting-capable
> IOMMU.
> + *  userspace should check it
> before using
> + *  nesting capability.
> + *
> + * @argsz:   size of the whole structure.
> + * @flags:   currently reserved for future extension. must
> set to 0.
> + * @format:  PASID table entry format, the same definition
> as struct
> + *   iommu_gpasid_bind_data @format.
> + * @features:supported nesting features.
> + * @addr_width:  The output addr width of first
> level/stage translation
> + * @pasid_bits:  Maximum supported PASID bits, 0
> represents no PASID
> + *   support.
> + * @data:vendor specific cap info. data[] structure type
> can be deduced
> + *   from @format field.
> + *
> + *  
 +===+===
 ===+  
> + * | feature   |
> Notes   |
> + *  
 +===+===
 ===+  
> + * | SYSWIDE_PASID |  IOMMU vendor driver sets it to mandate
> userspace|
> + * |   |  to allocate PASID from kernel. All PASID
> allocation |
> + * |   |  free must be mediated through the TBD
> API.  |  
 s/TBD/IOMMU  
>>>
>>> got it.
>>>   
> + *
> +---+--+
> + * | BIND_PGTBL|  IOMMU vendor driver sets it to mandate
> userspace|
> + * |   |  bind the first level/stage page table to
> associated |  
 s/bind/to bind  
>>>
>>> got it.
>>>   
> + * |   |  PASID (either the one specified in bind
> request or  |
> + * |   |  the default PASID of iommu domain),
> through IOMMU   |
> + * |   |
> UAPI.   |
> + *
> +---+--+
> + * | CACHE_INVLD   |  IOMMU vendor driver sets it to mandate
> userspace|  
  
> + * |   |  

Re: [PATCH v2] powerpc/pseries/svm: Allocate SWIOTLB buffer anywhere in memory

2020-08-18 Thread Christoph Hellwig
On Mon, Aug 17, 2020 at 06:46:58PM -0300, Thiago Jung Bauermann wrote:
> POWER secure guests (i.e., guests which use the Protection Execution
> Facility) need to use SWIOTLB to be able to do I/O with the hypervisor, but
> they don't need the SWIOTLB memory to be in low addresses since the
> hypervisor doesn't have any addressing limitation.
> 
> This solves a SWIOTLB initialization problem we are seeing in secure guests
> with 128 GB of RAM: they are configured with 4 GB of crashkernel reserved
> memory, which leaves no space for SWIOTLB in low addresses.
> 
> To do this, we use mostly the same code as swiotlb_init(), but allocate the
> buffer using memblock_alloc() instead of memblock_alloc_low().
> 
> We also need to add swiotlb_set_no_iotlb_memory() in order to set the
> no_iotlb_memory flag if initialization fails.

Do you really need the helper?  As far as I can tell the secure guests
very much rely on swiotlb for all I/O, so you might as well panic if
you fail to allocate it.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V2 1/2] Add new flush_iotlb_range and handle freelists when using iommu_unmap_fast

2020-08-18 Thread Tom Murphy
Add a flush_iotlb_range to allow flushing of an iova range instead of a
full flush in the dma-iommu path.

Allow the iommu_unmap_fast to return newly freed page table pages and
pass the freelist to queue_iova in the dma-iommu ops path.

This patch is useful for iommu drivers (in this case the intel iommu
driver) which need to wait for the ioTLB to be flushed before newly
free/unmapped page table pages can be freed. This way we can still batch
ioTLB free operations and handle the freelists.

Change-log:
V2:
-fix missing parameter in mtk_iommu_v1.c

Signed-off-by: Tom Murphy 
---
 drivers/iommu/amd/iommu.c   | 14 -
 drivers/iommu/arm-smmu-v3.c |  3 +-
 drivers/iommu/arm-smmu.c|  3 +-
 drivers/iommu/dma-iommu.c   | 45 ---
 drivers/iommu/exynos-iommu.c|  3 +-
 drivers/iommu/intel/iommu.c | 54 +
 drivers/iommu/iommu.c   | 25 +++
 drivers/iommu/ipmmu-vmsa.c  |  3 +-
 drivers/iommu/msm_iommu.c   |  3 +-
 drivers/iommu/mtk_iommu.c   |  3 +-
 drivers/iommu/mtk_iommu_v1.c|  3 +-
 drivers/iommu/omap-iommu.c  |  3 +-
 drivers/iommu/qcom_iommu.c  |  3 +-
 drivers/iommu/rockchip-iommu.c  |  3 +-
 drivers/iommu/s390-iommu.c  |  3 +-
 drivers/iommu/sun50i-iommu.c|  3 +-
 drivers/iommu/tegra-gart.c  |  3 +-
 drivers/iommu/tegra-smmu.c  |  3 +-
 drivers/iommu/virtio-iommu.c|  3 +-
 drivers/vfio/vfio_iommu_type1.c |  2 +-
 include/linux/iommu.h   | 21 +++--
 21 files changed, 150 insertions(+), 56 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 2f22326ee4df..25fbacab23c3 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2513,7 +2513,8 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
 
 static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
  size_t page_size,
- struct iommu_iotlb_gather *gather)
+ struct iommu_iotlb_gather *gather,
+ struct page **freelist)
 {
struct protection_domain *domain = to_pdomain(dom);
struct domain_pgtable pgtable;
@@ -2636,6 +2637,16 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void amd_iommu_flush_iotlb_range(struct iommu_domain *domain,
+   unsigned long iova, size_t size,
+   struct page *freelist)
+{
+   struct protection_domain *dom = to_pdomain(domain);
+
+   domain_flush_pages(dom, iova, size);
+   domain_flush_complete(dom);
+}
+
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
@@ -2675,6 +2686,7 @@ const struct iommu_ops amd_iommu_ops = {
.is_attach_deferred = amd_iommu_is_attach_deferred,
.pgsize_bitmap  = AMD_IOMMU_PGSIZES,
.flush_iotlb_all = amd_iommu_flush_iotlb_all,
+   .flush_iotlb_range = amd_iommu_flush_iotlb_range,
.iotlb_sync = amd_iommu_iotlb_sync,
.def_domain_type = amd_iommu_def_domain_type,
 };
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..8d328dc25326 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2854,7 +2854,8 @@ static int arm_smmu_map(struct iommu_domain *domain, 
unsigned long iova,
 }
 
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
-size_t size, struct iommu_iotlb_gather *gather)
+size_t size, struct iommu_iotlb_gather *gather,
+struct page **freelist)
 {
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 243bc4cb2705..0cd0dfc89875 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1234,7 +1234,8 @@ static int arm_smmu_map(struct iommu_domain *domain, 
unsigned long iova,
 }
 
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
-size_t size, struct iommu_iotlb_gather *gather)
+size_t size, struct iommu_iotlb_gather *gather,
+struct page **freelist)
 {
struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 4959f5df21bd..7433f74d921a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -50,6 +50,19 @@ struct iommu_dma_cookie {
struct iommu_domain *fq_domain;
 };
 
+
+static void iommu_dma_entry_dtor(unsigned 

[PATCH V2 2/2] Handle init_iova_flush_queue failure in dma-iommu path

2020-08-18 Thread Tom Murphy
init_iova_flush_queue can fail if we run out of memory. Fall back to no
flush queue if it fails.

Signed-off-by: Tom Murphy 
---
 drivers/iommu/dma-iommu.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7433f74d921a..5445e2be08b5 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -356,9 +356,11 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
 
if (!cookie->fq_domain && !iommu_domain_get_attr(domain,
DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, ) && attr) {
-   cookie->fq_domain = domain;
-   init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all,
-   iommu_dma_entry_dtor);
+   if (init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all,
+   iommu_dma_entry_dtor))
+   pr_warn("iova flush queue initialization failed\n");
+   else
+   cookie->fq_domain = domain;
}
 
if (!dev)
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu