Re: [PATCH 3/3] powerpc: Set crashkernel offset to mid of RMA region

2021-10-04 Thread Aneesh Kumar K.V

On 10/4/21 20:41, Sourabh Jain wrote:

On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memory block. It is due to the
reserve crashkernel area starts at 128MB offset by default and which
doesn't leave enough space in the first memory block to accommodate
memory for other essential system resources.

Given that the RMA region size can be 512MB or more, setting the
crashkernel offset to mid of RMA size will leave enough space to
kernel to allocate memory for other system resources in the first
memory block.

Signed-off-by: Sourabh Jain 
Reported-and-tested-by: Abdul haleem 
---
  arch/powerpc/kernel/rtas.c |  3 +++
  arch/powerpc/kexec/core.c  | 13 +
  2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index ff80bbad22a5..ce5e62bb4d8e 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1235,6 +1235,9 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
  
+	if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))

+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+


The equivalent check that we currently do more than checking 
ibm,hypertas-functions.


if (!strcmp(uname, "rtas") || !strcmp(uname, "rtas@0")) {
prop = of_get_flat_dt_prop(node, "ibm,hypertas-functions",
   );
if (prop) {
powerpc_firmware_features |= FW_FEATURE_LPAR;
fw_hypertas_feature_init(prop, len);
}


also do we expect other firmware features to be set along with 
FW_FEATURE_LPAR?



if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 48525e8b5730..f69cf3e370ec 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -147,11 +147,16 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
  #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* crash kernel needs to placed in the first segment. On LPAR
+* setting crash kernel start to mid of RMA size (512MB or more)
+* would help primary kernel to boot properly on large config
+* LPAR (with core count 192 or more) and for the reset keep
+* cap the crash kernel start at 128MB offse.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
  #else
crashk_res.start = KDUMP_KERNELBASE;
  #endif





Re: [PATCH 1/3] fixup mmu_features immediately after getting cpu pa features.

2021-10-04 Thread Aneesh Kumar K.V

On 10/4/21 20:41, Sourabh Jain wrote:

From: Mahesh Salgaonkar 

On system with radix support available, early_radix_enabled() starts
returning true for a small window (until mmu_early_init_devtree() is
called) even when radix mode disabled on kernel command line. This causes
ppc64_bolted_size() to return ULONG_MAX in HPT mode instead of supported
segment size, during boot cpu paca allocation.

With kernel command line = "... disable_radix":

early_init_devtree:   <- early_radix_enabled() = false
   early_init_dt_scan_cpus:   <- early_radix_enabled() = false
   ...
   check_cpu_pa_features: <- early_radix_enabled() = false
   ...  ^ <- early_radix_enabled() = TRUE
   allocate_paca:   | <- early_radix_enabled() = TRUE
   ...   |
   ppc64_bolted_size:   | <- early_radix_enabled() = TRUE
   if (early_radix_enabled())| <- early_radix_enabled() = TRUE
   return ULONG_MAX; |
   ...   |
   ...  | <- early_radix_enabled() = TRUE
   ...  | <- early_radix_enabled() = TRUE
   mmu_early_init_devtree()  V
   ...<- early_radix_enabled() = false

So far we have not seen any issue because allocate_paca() takes minimum of
ppc64_bolted_size and rma_size while allocating paca. However it is better
to close this window by fixing up the mmu features as early as possible.
This fixes early_radix_enabled() and ppc64_bolted_size() to return valid
values in radix disable mode. This patch will help subsequent patch to
depend on early_radix_enabled() check while detecting supported segment
size in HPT mode.

Signed-off-by: Mahesh Salgaonkar 
Signed-off-by: Sourabh Jain 
Reported-and-tested-by: Abdul haleem 
---
  arch/powerpc/include/asm/book3s/64/mmu.h | 1 +
  arch/powerpc/include/asm/mmu.h   | 1 +
  arch/powerpc/kernel/prom.c   | 1 +
  arch/powerpc/mm/init_64.c| 5 -
  4 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index c02f42d1031e..69a89fa1330d 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -197,6 +197,7 @@ extern int mmu_vmemmap_psize;
  extern int mmu_io_psize;
  
  /* MMU initialization */

+void mmu_cpu_feature_fixup(void);
  void mmu_early_init_devtree(void);
  void hash__early_init_devtree(void);
  void radix__early_init_devtree(void);
diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index 8abe8e42e045..c8eafd401fe9 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -401,6 +401,7 @@ extern void early_init_mmu(void);
  extern void early_init_mmu_secondary(void);
  extern void setup_initial_memory_limit(phys_addr_t first_memblock_base,
   phys_addr_t first_memblock_size);
+static inline void mmu_cpu_feature_fixup(void) { }
  static inline void mmu_early_init_devtree(void) { }
  
  static inline void pkey_early_init_devtree(void) {}

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 2e67588f6f6e..1727a3abe6c1 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -380,6 +380,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
check_cpu_pa_features(node);
}
  
+	mmu_cpu_feature_fixup();


can you do that call inside check_cpu_pa_features? or is it because we 
have the same issue with baremetal platforms?


Can we also rename this to indicate we are sanitizing the feature flag 
based on kernel command line.  Something like


/* Update cpu features based on kernel command line */
update_cpu_features();


identical_pvr_fixup(node);
init_mmu_slb_size(node);
  
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c

index 386be136026e..9ed452605a2c 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -437,12 +437,15 @@ static void __init early_check_vec5(void)
}
  }
  
-void __init mmu_early_init_devtree(void)

+void __init mmu_cpu_feature_fixup(void)
  {
/* Disable radix mode based on kernel command line. */
if (disable_radix)
cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
+}
  
+void __init mmu_early_init_devtree(void)

+{
/*
 * Check /chosen/ibm,architecture-vec-5 if running as a guest.
 * When running bare-metal, we can use radix if we like





[PATCH] powerpc/doc: Fix htmldocs errors

2021-08-24 Thread Aneesh Kumar K.V
Fix make htmldocs related errors with the newly added associativity.rst
doc file.

Reported-by: Stephen Rothwell 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst | 29 +
 Documentation/powerpc/index.rst |  1 +
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
index 07e7dd3d6c87..4d01c7368561 100644
--- a/Documentation/powerpc/associativity.rst
+++ b/Documentation/powerpc/associativity.rst
@@ -1,6 +1,6 @@
 
 NUMA resource associativity
-=
+
 
 Associativity represents the groupings of the various platform resources into
 domains of substantially similar mean performance relative to resources outside
@@ -20,11 +20,11 @@ A value of 1 indicates the usage of Form 1 associativity. 
For Form 2 associativi
 bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
 
 Form 0
--
+--
 Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
 
 Form 1
--
+--
 With Form 1 a combination of ibm,associativity-reference-points, and 
ibm,associativity
 device tree properties are used to determine the NUMA distance between 
resource groups/domains.
 
@@ -78,17 +78,18 @@ numa-lookup-index-table.
 
 For ex:
 ibm,numa-lookup-index-table = <3 0 8 40>;
-ibm,numa-distace-table = <9>, /bits/ 8 < 10  20  80
-20  10 160
-80 160  10>;
-  | 08   40
---|
-  |
-0 | 10   20  80
-  |
-8 | 20   10  160
-  |
-40| 80   160  10
+ibm,numa-distace-table = <9>, /bits/ 8 < 10  20  80 20  10 160 80 160  10>;
+
+::
+
+ | 08   40
+   --|
+ |
+   0 | 10   20  80
+ |
+   8 | 20   10  160
+ |
+   40| 80   160  10
 
 A possible "ibm,associativity" property for resources in node 0, 8 and 40
 
diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index bf5f1a2bdbdf..0f7d3c495693 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -7,6 +7,7 @@ powerpc
 .. toctree::
 :maxdepth: 1
 
+associativity
 booting
 bootwrapper
 cpu_families
-- 
2.31.1



Re: linux-next: build warning after merge of the powerpc tree

2021-08-23 Thread Aneesh Kumar K.V
Stephen Rothwell  writes:

> Hi all,
>
> [cc'ing Jon in case he can fix the sphix hang - or knows anything about it]
>
> On Mon, 23 Aug 2021 19:55:40 +1000 Stephen Rothwell  
> wrote:
>>
>> After merging the powerpc tree, today's linux-next build (htmldocs)
>> produced this warning:
>> 
>
> I missed a line:
>
> Sphinx parallel build error:
>
>> docutils.utils.SystemMessage: Documentation/powerpc/associativity.rst:1: 
>> (SEVERE/4) Title overline & underline mismatch.
>> 
>> 
>> NUMA resource associativity
>> =
>> 
>> Introduced by commit
>> 
>>   1c6b5a7e7405 ("powerpc/pseries: Add support for FORM2 associativity")
>> 
>> There are other obvious problems with this document (but sphinx seems
>> to have hung before it reported them).
>> 
>> Like
>> 
>> Form 0
>> -
>> 
>> and
>> 
>> Form 1
>> -
>> 
>> and
>> 
>> Form 2
>> ---
>
> I also get the following warning:
>
> Documentation/powerpc/associativity.rst: WARNING: document isn't included in 
> any toctree
>
> And applying the following patch is enough to allow sphinx to finish
> (rather than livelocking):
>
> diff --git a/Documentation/powerpc/associativity.rst 
> b/Documentation/powerpc/associativity.rst
> index 07e7dd3d6c87..b77c6ccbd6cb 100644
> --- a/Documentation/powerpc/associativity.rst
> +++ b/Documentation/powerpc/associativity.rst
> @@ -1,6 +1,6 @@
> -
> +===
>  NUMA resource associativity
> -=
> +===
>  
>  Associativity represents the groupings of the various platform resources into
>  domains of substantially similar mean performance relative to resources 
> outside
> @@ -20,11 +20,11 @@ A value of 1 indicates the usage of Form 1 associativity. 
> For Form 2 associativi
>  bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>  
>  Form 0
> --
> +--
>  Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
>  
>  Form 1
> --
> +--
>  With Form 1 a combination of ibm,associativity-reference-points, and 
> ibm,associativity
>  device tree properties are used to determine the NUMA distance between 
> resource groups/domains.
>  
> @@ -45,7 +45,7 @@ level of the resource group, the kernel doubles the NUMA 
> distance between the
>  comparing domains.
>  
>  Form 2
> 
> +--
>  Form 2 associativity format adds separate device tree properties 
> representing NUMA node distance
>  thereby making the node distance computation flexible. Form 2 also allows 
> flexible primary
>  domain numbering. With numa distance computation now detached from the index 
> value in

Thanks for looking into this. I guess we also need to format the below table?

  | 08   40
--|
  |
0 | 10   20  80
  |
8 | 20   10  160
  |
40| 80   160  10


I don't know how to represent that in the documentation file. A table is
probably not the right one?

-aneesh


[PATCH v2 1/2] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry

2021-08-12 Thread Aneesh Kumar K.V
Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling.
Also add a debugfs entry for tlb_local_single_page_flush_ceiling.

Signed-off-by: Aneesh Kumar K.V 
---
Changes from v1:
* switch to debugfs_create_u32

 arch/powerpc/mm/book3s64/radix_tlb.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..1fa2bc6a969e 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1106,8 +1107,8 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
  * invalidating a full PID, so it has a far lower threshold to change from
  * individual page flushes to full-pid flushes.
  */
-static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
-static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = 
POWER9_TLB_SETS_RADIX * 2;
+static u32 tlb_single_page_flush_ceiling __read_mostly = 33;
+static u32 tlb_local_single_page_flush_ceiling __read_mostly = 
POWER9_TLB_SETS_RADIX * 2;
 
 static inline void __radix__flush_tlb_range(struct mm_struct *mm,
unsigned long start, unsigned long 
end)
@@ -1524,3 +1525,14 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned 
long lpid,
 EXPORT_SYMBOL_GPL(do_h_rpt_invalidate_prt);
 
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+   debugfs_create_u32("tlb_single_page_flush_ceiling", 0600,
+  powerpc_debugfs_root, 
_single_page_flush_ceiling);
+   debugfs_create_u32("tlb_local_single_page_flush_ceiling", 0600,
+  powerpc_debugfs_root, 
_local_single_page_flush_ceiling);
+   return 0;
+}
+late_initcall(create_tlb_single_page_flush_ceiling);
+
-- 
2.31.1



[PATCH v2 2/2] powerpc: rename powerpc_debugfs_root to arch_debugfs_dir

2021-08-12 Thread Aneesh Kumar K.V
No functional change in this patch. arch_debugfs_dir is the generic kernel
name declared in linux/debugfs.h for arch-specific debugfs directory.
Architectures like x86/s390 already use the name. Rename powerpc
specific powerpc_debugfs_root to arch_debugfs_dir.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/debugfs.h  | 13 -
 arch/powerpc/kernel/Makefile|  3 ++-
 arch/powerpc/kernel/dawr.c  |  3 +--
 arch/powerpc/kernel/eeh.c   | 16 
 arch/powerpc/kernel/eeh_cache.c |  4 ++--
 arch/powerpc/kernel/fadump.c|  4 ++--
 arch/powerpc/kernel/hw_breakpoint.c |  1 -
 arch/powerpc/kernel/kdebugfs.c  | 14 ++
 arch/powerpc/kernel/security.c  | 16 
 arch/powerpc/kernel/setup-common.c  | 13 -
 arch/powerpc/kernel/setup_64.c  |  1 -
 arch/powerpc/kernel/traps.c |  4 ++--
 arch/powerpc/kvm/book3s_xics.c  |  6 +++---
 arch/powerpc/kvm/book3s_xive.c  |  3 +--
 arch/powerpc/kvm/book3s_xive_native.c   |  3 +--
 arch/powerpc/mm/book3s64/hash_utils.c   |  4 ++--
 arch/powerpc/mm/book3s64/pgtable.c  |  4 ++--
 arch/powerpc/mm/book3s64/radix_tlb.c|  6 +++---
 arch/powerpc/mm/ptdump/bats.c   |  4 ++--
 arch/powerpc/mm/ptdump/segment_regs.c   |  4 ++--
 arch/powerpc/platforms/cell/axon_msi.c  |  4 ++--
 arch/powerpc/platforms/powernv/memtrace.c   |  3 +--
 arch/powerpc/platforms/powernv/opal-imc.c   |  4 ++--
 arch/powerpc/platforms/powernv/opal-lpc.c   |  4 ++--
 arch/powerpc/platforms/powernv/opal-xscom.c |  4 ++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  4 ++--
 arch/powerpc/platforms/pseries/dtl.c|  4 ++--
 arch/powerpc/platforms/pseries/lpar.c   |  5 +++--
 arch/powerpc/sysdev/xive/common.c   |  3 +--
 arch/powerpc/xmon/xmon.c|  6 +++---
 30 files changed, 75 insertions(+), 92 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/debugfs.h
 create mode 100644 arch/powerpc/kernel/kdebugfs.c

diff --git a/arch/powerpc/include/asm/debugfs.h 
b/arch/powerpc/include/asm/debugfs.h
deleted file mode 100644
index 2c5c48571d75..
--- a/arch/powerpc/include/asm/debugfs.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _ASM_POWERPC_DEBUGFS_H
-#define _ASM_POWERPC_DEBUGFS_H
-
-/*
- * Copyright 2017, Michael Ellerman, IBM Corporation.
- */
-
-#include 
-
-extern struct dentry *powerpc_debugfs_root;
-
-#endif /* _ASM_POWERPC_DEBUGFS_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index f66b63e81c3b..7be36c1e1db6 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -46,7 +46,8 @@ obj-y := cputable.o syscalls.o \
   prom.o traps.o setup-common.o \
   udbg.o misc.o io.o misc_$(BITS).o \
   of_platform.o prom_parse.o firmware.o \
-  hw_breakpoint_constraints.o interrupt.o
+  hw_breakpoint_constraints.o interrupt.o \
+  kdebugfs.o
 obj-y  += ptrace/
 obj-$(CONFIG_PPC64)+= setup_64.o \
   paca.o nvram_64.o note.o
diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c
index cdc2dccb987d..64e423d2fe0f 100644
--- a/arch/powerpc/kernel/dawr.c
+++ b/arch/powerpc/kernel/dawr.c
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -101,7 +100,7 @@ static int __init dawr_force_setup(void)
if (PVR_VER(mfspr(SPRN_PVR)) == PVR_POWER9) {
/* Turn DAWR off by default, but allow admin to turn it on */
debugfs_create_file_unsafe("dawr_enable_dangerous", 0600,
-  powerpc_debugfs_root,
+  arch_debugfs_dir,
   _force_enable,
   _enable_fops);
}
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 3bbdcc86d01b..e9b597ed423c 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -21,9 +21,9 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -1901,24 +1901,24 @@ static int __init eeh_init_proc(void)
proc_create_single("powerpc/eeh", 0, NULL, proc_eeh_show);
 #ifdef CONFIG_DEBUG_FS
debugfs_create_file_unsafe("eeh_enable", 0600,
-  powerpc_debugfs_root, NULL,
+  arch_debugfs_dir, NULL,
   _enable_dbgfs_ops);

[PATCH v8 5/5] powerpc/pseries: Add support for FORM2 associativity

2021-08-12 Thread Aneesh Kumar K.V
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst   | 104 
 arch/powerpc/include/asm/firmware.h   |   3 +-
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 187 ++
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 262 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..07e7dd3d6c87
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,104 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the oldest format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via 
"ibm,architecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points, and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains a list of one or more numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains a list of one or 
more numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID indexes represents an increasing hierarchy of resource 
grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value in
+"ibm,associativity-reference-points" property, Form 2 allows a large number of 
primary domain
+ids at the same domainID index representing resource groups of different 
performance/latency
+characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains a list of one or more numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is
+used as an index while computing numa distance information via 
"ibm,numa-distance-table".
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 
(2) is used when
+computing the distance of domain 8 from other domains present in the system. 
For the rest of
+this document, this offset will be referred to as domain distance offset.
+
+"ibm,numa-distance-table" property contains a list of one or more numbers 
representing the NUMA
+distance between resource groups/domains present in the syst

[PATCH v8 4/5] powerpc/pseries: Add a helper for form1 cpu distance

2021-08-12 Thread Aneesh Kumar K.V
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |  4 ++--
 arch/powerpc/mm/numa.c| 10 +-
 arch/powerpc/platforms/pseries/lpar.c |  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index a6425a70c37b..a4712ecad3e9 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))
 
-extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
@@ -84,7 +84,7 @@ static inline void sysfs_remove_device_from_node(struct 
device *dev,
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
-static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
return 0;
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c0a89839310c..fdb2472befa4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 {
int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
return dist;
 }
 
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+   /* We should not get called with FORM0 */
+   VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+   return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index dab356e3ff87..afefbdfe768d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int 
last_disp_cpu, int cur_disp_cpu)
if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
return -EIO;
 
-   return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
+   return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
 }
 
 static int cpu_home_node_dispatch_distance(int disp_cpu)
@@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
if (!disp_cpu_assoc || !vcpu_assoc)
return -EIO;
 
-   return cpu_distance(disp_cpu_assoc, vcpu_assoc);
+   return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
 }
 
 static void update_vcpu_disp_stat(int disp_cpu)
-- 
2.31.1



[PATCH v8 3/5] powerpc/pseries: Consolidate different NUMA distance update code paths

2021-08-12 Thread Aneesh Kumar K.V
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |   2 +
 arch/powerpc/mm/numa.c| 212 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 4 files changed, 161 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..a6425a70c37b 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -64,6 +64,7 @@ static inline int early_cpu_to_node(int cpu)
 }
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
+void update_numa_distance(struct device_node *node);
 
 #else
 
@@ -93,6 +94,7 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+static inline void update_numa_distance(struct device_node *node) {}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 368719b14dcc..c0a89839310c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,50 +208,35 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-   const __be32 *associativity)
+static int __associativity_to_nid(const __be32 *associativity,
+ int max_array_sz)
 {
-   int i;
+   int nid;
+   /*
+* primary_domain_index is 1 based array index.
+*/
+   int index = primary_domain_index  - 1;
 
-   if (affinity_form != FORM1_AFFINITY)
-   return;
+   if (!numa_enabled || index >= max_array_sz)
+   return NUMA_NO_NODE;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
-   const __be32 *entry;
+   nid = of_read_number([index], 1);
 
-   entry = [be32_to_cpu(distance_ref_points[i]) - 1];
-   distance_lookup_table[nid][i] = of_read_number(entry, 1);
-   }
+   /* POWER4 LPAR uses 0x as invalid node */
+   if (nid == 0x || nid >= nr_node_ids)
+   nid = NUMA_NO_NODE;
+   return nid;
 }
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
  */
 static int associativity_to_nid(const __be32 *associativity)
 {
-   int nid = NUMA_NO_NODE;
-
-   if (!numa_enabled)
-   goto out;
-
-   if (of_read_number(associativity, 1) >= primary_domain_index)
-   nid = of_read_number([primary_domain_index], 1);
-
-   /* POWER4 LPAR uses 0x as invalid node */
-   if (nid == 0x || nid >= nr_node_ids)
-   nid = NUMA_NO_NODE;
-
-   if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
-   /*
-* Skip the length field and send start of associativity array
-*/
-   initialize_distance_lookup_table(nid, associativity + 1);
-   }
+   int array_sz = of_read_number(associativity, 1);
 
-out:
-   return nid;
+   /* Skip the first element in the associativity array */
+   return __associativity_to_nid((associativity + 1), array_sz);
 }
 
 /* Returns the nid associated with the given device tree node,
@@ -287,6 +272,60 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity,
+int max_array_sz)
+{
+   int i, nid;
+
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
+   nid = __associativity_to_nid(associativity, max_array_sz);
+   if (nid != NUMA_NO_NODE) {
+   for (i = 0; i < distance_ref_points_depth; i++) {
+   const __be32 *entry;
+   int index = be32_to_cpu(distance_ref_points[i]) - 1;
+
+   /*
+* broken hierarchy, return with broken distance table
+*/
+   if (WARN(index >= max_array_sz, "Broken 
ibm,associ

[PATCH v8 2/5] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

2021-08-12 Thread Aneesh Kumar K.V
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   |  4 +--
 arch/powerpc/include/asm/prom.h   |  2 +-
 arch/powerpc/kernel/prom_init.c   |  2 +-
 arch/powerpc/mm/numa.c| 35 ++-
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPALASM_CONST(0x1000)
 #define FW_FEATURE_SET_MODEASM_CONST(0x4000)
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
@@ -69,7 +69,7 @@ enum {
FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI0x0201  /* PCIe/MSI support */
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
-#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY 0x0580  /* FORM1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index a5bf355ce1d6..57db605ad33a 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1096,7 +1096,7 @@ static const struct ibm_arch_vec 
ibm_architecture_vec_template __initconst = {
 #else
0,
 #endif
-   .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
+   .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..368719b14dcc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int distance_ref_points_depth;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
int i;
int distance = LOCAL_DISTANCE;
 
-   if (!form1_affinity)
+   if (affinity_form == FORM0_AFFINITY)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i < distance_ref_points_depth; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
int i;
 
-   if (!form1_affinity)
+   if (affinity_form != FORM1_AFFINITY)
return;
 
for (i = 0; i < distance_ref_points_depth; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
int index;
struct device_node *root;
 
+   /*
+* Check for which form of affinity.
+*/
+   if (firmware_has_feature(FW_FEATURE_OPAL)) {
+   affinity_form = FORM1_AFFINITY;
+   } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
+   affinity_form = FORM1_AFFINITY;
+   } else
+   affinity_form = FORM0_AFFINITY;
+
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal");
else
@@

[PATCH v8 1/5] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-08-12 Thread Aneesh Kumar K.V
No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
if (!numa_enabled)
goto out;
 
-   if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   if (of_read_number(associativity, 1) >= primary_domain_index)
+   nid = of_read_number([primary_domain_index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-   int depth;
+   int index;
struct device_node *root;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
}
 
if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
 
-   depth = of_read_number(_ref_points[1], 1);
+   index = of_read_number(_ref_points[1], 1);
}
 
/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
 
of_node_put(root);
-   return depth;
+   return index;
 
 err:
of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
 
-   if ((min_common_depth < 0) || !numa_enabled)
+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
 
rc = of_get_assoc_arrays();
if (rc)
return default_nid;
 
-   if (min_common_depth <= aa.array_sz &&
+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number([index], 1);
 
if (nid == 0x || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
 
-   min_common_depth = find_min_common_depth();
+   primary_domain_index = find_primary_domain_index();
 
-   if (min_common_depth < 0) {
+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
 
-   max_nodes = of_read_number([min_common_depth], 1);
+   max_nodes = of_read_number([primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
prop_length /= sizeof(int);
-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
 
index = of_read_number(associativity, 1);
-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number([index - 1], 1);
 
 out:
-- 
2.31.1



[PATCH v8 0/5] Add support for FORM2 associativity

2021-08-12 Thread Aneesh Kumar K.V
Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1071644672,
"uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
"raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
"daxregion":{
  "id":0,
  "size":1071644672,
  "devices":[
{
  "chardev":"dax0.0",
  "size":1071644672,
  "target_node":4,
  "mode":"devdax"
}
  ]
},
"align":2097152,
"numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v7:
* Address review feedback 
* fold patch 6 to patch 3

Changes from v6:
* Address review feedback 

Changes from v5:
* Fix build error reported by kernel test robot
* Address review feedback 

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
  will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance 
information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (5):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate different NUMA distance update code paths
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity

 Documentation/powerpc/associativity.rst   | 104 +
 arch/powerpc/include/asm/firmware.h   |   7 +-
 arch/powerpc/include/asm/prom.h   |   3 +-
 arch/powerpc/include/asm/topology.h   |   6 +-
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 432 ++
 arch/powerpc/platforms/pseries/firmware.c |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/lpar.c |   4 +-
 10 files changed, 455 insertions(+), 111 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1



Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE

2021-08-12 Thread Aneesh Kumar K.V

On 8/12/21 6:19 PM, Michael Ellerman wrote:

"Puvichakravarthy Ramachandran"  writes:

With shared mapping, even though we are unmapping a large range, the kernel
will force a TLB flush with ptl lock held to avoid the race mentioned in
commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory 
freeing parts")
This results in the kernel issuing a high number of TLB flushes even for a large
range. This can be improved by making sure the kernel switch to pid based flush 
if the
kernel is unmapping a 2M range.

Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/mm/book3s64/radix_tlb.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..21d0f098e43b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
   * invalidating a full PID, so it has a far lower threshold to change > from
   * individual page flushes to full-pid flushes.
   */
-static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
+static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = 
POWER9_TLB_SETS_RADIX * 2;

  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
@@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > 
mm_struct *mm,
   if (fullmm)
   flush_pid = true;
   else if (type == FLUSH_TYPE_GLOBAL)
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
   else
   flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;


Additional details on the test environment. This was tested on a 2 Node/8
socket Power10 system.
The LPAR had 105 cores and the LPAR spanned across all the sockets.

# perf stat -I 1000 -a -e cycles,instructions -e
"{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
"{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
  Rate of work: = 176
#   time counts unit events
  1.029206442 4198594519  cycles
  1.029206442 2458254252  instructions  # 0.59 insn 
per cycle
  1.029206442 3004031488  PM_EXEC_STALL
  1.029206442 1798186036  PM_EXEC_STALL_TLBIE
  Rate of work: = 181
  2.054288539 4183883450  cycles
  2.054288539 2472178171  instructions  # 0.59 insn 
per cycle
  2.054288539 3014609313  PM_EXEC_STALL
  2.054288539 1797851642  PM_EXEC_STALL_TLBIE
  Rate of work: = 180
  3.078306883 4171250717  cycles
  3.078306883 2468341094  instructions  # 0.59 insn 
per cycle
  3.078306883 2993036205  PM_EXEC_STALL
  3.078306883 1798181890  PM_EXEC_STALL_TLBIE
.
.

# cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
34

# echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling

# perf stat -I 1000 -a -e cycles,instructions -e
"{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
"{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
  Rate of work: = 313
#   time counts unit events
  1.030310506 4206071143  cycles
  1.030310506 4314716958  instructions  # 1.03 insn 
per cycle
  1.030310506 2157762167  PM_EXEC_STALL
  1.030310506  110825573  PM_EXEC_STALL_TLBIE
  Rate of work: = 322
  2.056034068 4331745630  cycles
  2.056034068 4531658304  instructions  # 1.05 insn 
per cycle
  2.056034068 2288971361  PM_EXEC_STALL
  2.056034068  111267927  PM_EXEC_STALL_TLBIE
  Rate of work: = 321
  3.081216434 4327050349  cycles
  3.081216434 4379679508  instructions  # 1.01 insn 
per cycle
  3.081216434 2252602550  PM_EXEC_STALL
  3.081216434  110974887  PM_EXEC_STALL_TLBIE



What is the tlbie test actually doing?

Does it do anything to measure the cost of refilling after the full mm flush?




That is essentially

for ()
{
  shmat()
  fillshm()
  shmdt()

}

for a 256MB range. So it is not really a fair benchmark because it 
doesn't take into account the impact of throwing away the full pid 
translation. But even then the TLBIE stalls is an important data point?


-aneesh




Re: [PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry

2021-08-12 Thread Aneesh Kumar K.V

On 8/12/21 12:58 PM, Michael Ellerman wrote:

Christophe Leroy  writes:

Le 10/08/2021 à 06:53, Aneesh Kumar K.V a écrit :

Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling.

Signed-off-by: Aneesh Kumar K.V 
---
   arch/powerpc/mm/book3s64/radix_tlb.c | 48 
   1 file changed, 48 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..5cca0fe130e7 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -17,6 +17,7 @@

...

+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+   debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
+   powerpc_debugfs_root, NULL, _tlbflush);


Could you just use debugfs_create_u32() instead of re-implementing simple read 
and write ?


Yeah AFAICS that should work fine.

It could probably even be a u16?



I was looking at switching all that to u64. Should i fallback to u16, 
considering a tlb_signle_page_flush_ceiling value larger that 2**16 
doesn't make sense?


-aneesh



Re: [PATCH v7 5/6] powerpc/pseries: Add support for FORM2 associativity

2021-08-11 Thread Aneesh Kumar K.V

On 8/12/21 7:11 AM, David Gibson wrote:

On Wed, Aug 11, 2021 at 09:39:32AM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


On Mon, Aug 09, 2021 at 10:54:33AM +0530, Aneesh Kumar K.V wrote:

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 


LGTM, with the exception of some minor nits noted below.

...


+

+   for (i = 0; i < max_numa_index; i++)
+   /* +1 skip the max_numa_index in the property */
+   numa_id_index_table[i] = of_read_number(_lookup_index[i + 
1], 1);
+
+
+   if (numa_dist_table_length != max_numa_index * max_numa_index) {
+


Stray extra whitespace line here.


+   WARN(1, "Wrong NUMA distance information\n");
+   /* consider everybody else just remote. */
+   for (i = 0;  i < max_numa_index; i++) {
+   for (j = 0; j < max_numa_index; j++) {
+   int nodeA = numa_id_index_table[i];
+   int nodeB = numa_id_index_table[j];
+
+   if (nodeA == nodeB)
+   numa_distance_table[nodeA][nodeB] = 
LOCAL_DISTANCE;
+   else
+   numa_distance_table[nodeA][nodeB] = 
REMOTE_DISTANCE;
+   }
+   }


I don't think it's necessarily a problem, but something to consider is
that this fallback will initialize distance for *all* node IDs,
whereas the normal path will only initialize it for nodes that are in
the index table.  Since some later error checks key off whether
certain fields in the distance table are initialized, is that the
outcome you want?



With the device tree details not correct, one of the possible way to
make progress is to consider everybody remote. With new node hotplug
support we used to check whether the distance table entry is
initialized. With the updated spec, we expect all possible numa node
distance to be available during boot.


Sure.  But my main point here is that the fallback behaviour in this
clause is different from the fallback behaviour if the table is there
and parseable, but incomplete - which is also not expected.



With FORM2 fallback with bad device tree details is to consider 
everybody REMOTE. With Form1, we leave the distance table not populated 
as it was with the current kernel versions.


-aneesh


Re: [PATCH v7 5/6] powerpc/pseries: Add support for FORM2 associativity

2021-08-10 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Mon, Aug 09, 2021 at 10:54:33AM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating 
>> resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza 
>> Signed-off-by: Aneesh Kumar K.V 
>
> LGTM, with the exception of some minor nits noted below.
...

> +
>> +for (i = 0; i < max_numa_index; i++)
>> +/* +1 skip the max_numa_index in the property */
>> +numa_id_index_table[i] = of_read_number(_lookup_index[i + 
>> 1], 1);
>> +
>> +
>> +if (numa_dist_table_length != max_numa_index * max_numa_index) {
>> +
>
> Stray extra whitespace line here.
>
>> +WARN(1, "Wrong NUMA distance information\n");
>> +/* consider everybody else just remote. */
>> +for (i = 0;  i < max_numa_index; i++) {
>> +for (j = 0; j < max_numa_index; j++) {
>> +int nodeA = numa_id_index_table[i];
>> +int nodeB = numa_id_index_table[j];
>> +
>> +if (nodeA == nodeB)
>> +numa_distance_table[nodeA][nodeB] = 
>> LOCAL_DISTANCE;
>> +else
>> +numa_distance_table[nodeA][nodeB] = 
>> REMOTE_DISTANCE;
>> +}
>> +}
>
> I don't think it's necessarily a problem, but something to consider is
> that this fallback will initialize distance for *all* node IDs,
> whereas the normal path will only initialize it for nodes that are in
> the index table.  Since some later error checks key off whether
> certain fields in the distance table are initialized, is that the
> outcome you want?
>

With the device tree details not correct, one of the possible way to
make progress is to consider everybody remote. With new node hotplug
support we used to check whether the distance table entry is
initialized. With the updated spec, we expect all possible numa node
distance to be available during boot.

-aneesh


[PATCH] powerpc: rename powerpc_debugfs_root to arch_debugfs_dir

2021-08-09 Thread Aneesh Kumar K.V
No functional change in this patch. arch_debugfs_dir is the generic kernel
name declared in linux/debugfs.h for arch-specific debugfs directory.
Architectures like x86/s390 already use the name. Rename powerpc
specific powerpc_debugfs_root to arch_debugfs_dir.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/debugfs.h  | 13 -
 arch/powerpc/kernel/Makefile|  3 ++-
 arch/powerpc/kernel/dawr.c  |  3 +--
 arch/powerpc/kernel/eeh.c   | 16 
 arch/powerpc/kernel/eeh_cache.c |  4 ++--
 arch/powerpc/kernel/fadump.c|  4 ++--
 arch/powerpc/kernel/hw_breakpoint.c |  1 -
 arch/powerpc/kernel/kdebugfs.c  | 14 ++
 arch/powerpc/kernel/security.c  | 16 
 arch/powerpc/kernel/setup-common.c  | 13 -
 arch/powerpc/kernel/setup_64.c  |  1 -
 arch/powerpc/kernel/traps.c |  4 ++--
 arch/powerpc/kvm/book3s_xics.c  |  6 +++---
 arch/powerpc/kvm/book3s_xive.c  |  3 +--
 arch/powerpc/kvm/book3s_xive_native.c   |  3 +--
 arch/powerpc/mm/book3s64/hash_utils.c   |  4 ++--
 arch/powerpc/mm/book3s64/pgtable.c  |  4 ++--
 arch/powerpc/mm/book3s64/radix_tlb.c|  4 ++--
 arch/powerpc/mm/ptdump/bats.c   |  4 ++--
 arch/powerpc/mm/ptdump/segment_regs.c   |  4 ++--
 arch/powerpc/platforms/cell/axon_msi.c  |  4 ++--
 arch/powerpc/platforms/powernv/memtrace.c   |  3 +--
 arch/powerpc/platforms/powernv/opal-imc.c   |  4 ++--
 arch/powerpc/platforms/powernv/opal-lpc.c   |  4 ++--
 arch/powerpc/platforms/powernv/opal-xscom.c |  4 ++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  4 ++--
 arch/powerpc/platforms/pseries/dtl.c|  4 ++--
 arch/powerpc/platforms/pseries/lpar.c   |  5 +++--
 arch/powerpc/sysdev/xive/common.c   |  3 +--
 arch/powerpc/xmon/xmon.c|  6 +++---
 30 files changed, 74 insertions(+), 91 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/debugfs.h
 create mode 100644 arch/powerpc/kernel/kdebugfs.c

diff --git a/arch/powerpc/include/asm/debugfs.h 
b/arch/powerpc/include/asm/debugfs.h
deleted file mode 100644
index 2c5c48571d75..
--- a/arch/powerpc/include/asm/debugfs.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _ASM_POWERPC_DEBUGFS_H
-#define _ASM_POWERPC_DEBUGFS_H
-
-/*
- * Copyright 2017, Michael Ellerman, IBM Corporation.
- */
-
-#include 
-
-extern struct dentry *powerpc_debugfs_root;
-
-#endif /* _ASM_POWERPC_DEBUGFS_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index f66b63e81c3b..7be36c1e1db6 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -46,7 +46,8 @@ obj-y := cputable.o syscalls.o \
   prom.o traps.o setup-common.o \
   udbg.o misc.o io.o misc_$(BITS).o \
   of_platform.o prom_parse.o firmware.o \
-  hw_breakpoint_constraints.o interrupt.o
+  hw_breakpoint_constraints.o interrupt.o \
+  kdebugfs.o
 obj-y  += ptrace/
 obj-$(CONFIG_PPC64)+= setup_64.o \
   paca.o nvram_64.o note.o
diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c
index cdc2dccb987d..64e423d2fe0f 100644
--- a/arch/powerpc/kernel/dawr.c
+++ b/arch/powerpc/kernel/dawr.c
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -101,7 +100,7 @@ static int __init dawr_force_setup(void)
if (PVR_VER(mfspr(SPRN_PVR)) == PVR_POWER9) {
/* Turn DAWR off by default, but allow admin to turn it on */
debugfs_create_file_unsafe("dawr_enable_dangerous", 0600,
-  powerpc_debugfs_root,
+  arch_debugfs_dir,
   _force_enable,
   _enable_fops);
}
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 3bbdcc86d01b..e9b597ed423c 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -21,9 +21,9 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -1901,24 +1901,24 @@ static int __init eeh_init_proc(void)
proc_create_single("powerpc/eeh", 0, NULL, proc_eeh_show);
 #ifdef CONFIG_DEBUG_FS
debugfs_create_file_unsafe("eeh_enable", 0600,
-  powerpc_debugfs_root, NULL,
+  arch_debugfs_dir, NULL,
   _enable_dbgfs_ops);

[PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry

2021-08-09 Thread Aneesh Kumar K.V
Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 48 
 1 file changed, 48 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..5cca0fe130e7 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1524,3 +1525,50 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned 
long lpid,
 EXPORT_SYMBOL_GPL(do_h_rpt_invalidate_prt);
 
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
+
+static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
+size_t count, loff_t *ppos)
+{
+   char buf[32];
+   unsigned int len;
+
+   len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling);
+   return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t tlbflush_write_file(struct file *file,
+const char __user *user_buf, size_t count, loff_t *ppos)
+{
+   char buf[32];
+   ssize_t len;
+   int ceiling;
+
+   len = min(count, sizeof(buf) - 1);
+   if (copy_from_user(buf, user_buf, len))
+   return -EFAULT;
+
+   buf[len] = '\0';
+   if (kstrtoint(buf, 0, ))
+   return -EINVAL;
+
+   if (ceiling < 0)
+   return -EINVAL;
+
+   tlb_single_page_flush_ceiling = ceiling;
+   return count;
+}
+
+static const struct file_operations fops_tlbflush = {
+   .read = tlbflush_read_file,
+   .write = tlbflush_write_file,
+   .llseek = default_llseek,
+};
+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+   debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
+   powerpc_debugfs_root, NULL, _tlbflush);
+   return 0;
+}
+late_initcall(create_tlb_single_page_flush_ceiling);
+
-- 
2.31.1



[PATCH v7 6/6] powerpc/pseries: Consolidate form1 distance initialization into a helper

2021-08-08 Thread Aneesh Kumar K.V
Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 104 +++--
 1 file changed, 58 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index fffb3c40f595..e6d47fcba335 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -171,26 +171,36 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-/*
- * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
- * info is found.
- */
-static int associativity_to_nid(const __be32 *associativity)
+static int __associativity_to_nid(const __be32 *associativity,
+ int max_array_sz)
 {
-   int nid = NUMA_NO_NODE;
+   int nid;
+   /*
+* primary_domain_index is 1 based array index.
+*/
+   int index = primary_domain_index  - 1;
 
-   if (!numa_enabled)
-   goto out;
+   if (!numa_enabled || index >= max_array_sz)
+   return NUMA_NO_NODE;
 
-   if (of_read_number(associativity, 1) >= primary_domain_index)
-   nid = of_read_number([primary_domain_index], 1);
+   nid = of_read_number([index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
-out:
return nid;
 }
+/*
+ * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
+ * info is found.
+ */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int array_sz = of_read_number(associativity, 1);
+
+   /* Skip the first element in the associativity array */
+   return __associativity_to_nid((associativity + 1), array_sz);
+}
 
 static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 {
@@ -295,33 +305,39 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static void __initialize_form1_numa_distance(const __be32 *associativity)
+static void __initialize_form1_numa_distance(const __be32 *associativity,
+int max_array_sz)
 {
int i, nid;
 
if (affinity_form != FORM1_AFFINITY)
return;
 
-   nid = associativity_to_nid(associativity);
+   nid = __associativity_to_nid(associativity, max_array_sz);
if (nid != NUMA_NO_NODE) {
for (i = 0; i < distance_ref_points_depth; i++) {
const __be32 *entry;
+   int index = be32_to_cpu(distance_ref_points[i]) - 1;
+
+   /*
+* broken hierarchy, return with broken distance table
+*/
+   if (WARN(index >= max_array_sz, "Broken 
ibm,associativity property"))
+   return;
 
-   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   entry = [index];
distance_lookup_table[nid][i] = of_read_number(entry, 
1);
}
}
 }
 
-static void initialize_form1_numa_distance(struct device_node *node)
+static void initialize_form1_numa_distance(const __be32 *associativity)
 {
-   const __be32 *associativity;
-
-   associativity = of_get_associativity(node);
-   if (!associativity)
-   return;
+   int array_sz;
 
-   __initialize_form1_numa_distance(associativity);
+   array_sz = of_read_number(associativity, 1);
+   /* Skip the first element in the associativity array */
+   __initialize_form1_numa_distance(associativity + 1, array_sz);
 }
 
 /*
@@ -334,7 +350,13 @@ void update_numa_distance(struct device_node *node)
if (affinity_form == FORM0_AFFINITY)
return;
else if (affinity_form == FORM1_AFFINITY) {
-   initialize_form1_numa_distance(node);
+   const __be32 *associativity;
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return;
+
+   initialize_form1_numa_distance(associativity);
return;
}
 
@@ -586,27 +608,17 @@ static int get_nid_and_numa_distance(struct drmem_lmb 
*lmb)
 
if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
-   nid = of_read_number([index], 1);
+   const __be32 *associativity;
 
-   if (nid == 0x || nid >= nr_node_ids)
-  

[PATCH v7 5/6] powerpc/pseries: Add support for FORM2 associativity

2021-08-08 Thread Aneesh Kumar K.V
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst   | 103 +
 arch/powerpc/include/asm/firmware.h   |   3 +-
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 168 ++
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 252 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..b6c89706ca03
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,103 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the oldest format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via 
"ibm,architecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points, and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains a list of one or more numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains a list of one or 
more numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID indexes represents an increasing hierarchy of resource 
grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value in
+"ibm,associativity-reference-points" property, Form 2 allows a large number of 
primary domain
+ids at the same domainID index representing resource groups of different 
performance/latency
+characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains a list of one or more numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is
+used as an index while computing numa distance information via 
"ibm,numa-distance-table".
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 
(2) is used when
+computing the distance of domain 8 from other domains present in the system. 
For the rest of
+this document, this offset will be referred to as domain distance offset.
+
+"ibm,numa-distance-table" property contains a list of one or more numbers 
representing the NUMA
+distance between resource groups/domains present in the syst

[PATCH v7 4/6] powerpc/pseries: Add a helper for form1 cpu distance

2021-08-08 Thread Aneesh Kumar K.V
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |  4 ++--
 arch/powerpc/mm/numa.c| 10 +-
 arch/powerpc/platforms/pseries/lpar.c |  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index a6425a70c37b..a4712ecad3e9 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))
 
-extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
@@ -84,7 +84,7 @@ static inline void sysfs_remove_device_from_node(struct 
device *dev,
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
-static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
return 0;
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c695faf67d68..a244398a7766 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 {
int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
return dist;
 }
 
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+   /* We should not get called with FORM0 */
+   VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+   return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index dab356e3ff87..afefbdfe768d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int 
last_disp_cpu, int cur_disp_cpu)
if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
return -EIO;
 
-   return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
+   return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
 }
 
 static int cpu_home_node_dispatch_distance(int disp_cpu)
@@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
if (!disp_cpu_assoc || !vcpu_assoc)
return -EIO;
 
-   return cpu_distance(disp_cpu_assoc, vcpu_assoc);
+   return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
 }
 
 static void update_vcpu_disp_stat(int disp_cpu)
-- 
2.31.1



[PATCH v7 3/6] powerpc/pseries: Consolidate different NUMA distance update code paths

2021-08-08 Thread Aneesh Kumar K.V
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |   2 +
 arch/powerpc/mm/numa.c| 178 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 4 files changed, 138 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..a6425a70c37b 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -64,6 +64,7 @@ static inline int early_cpu_to_node(int cpu)
 }
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
+void update_numa_distance(struct device_node *node);
 
 #else
 
@@ -93,6 +94,7 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+static inline void update_numa_distance(struct device_node *node) {}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 368719b14dcc..c695faf67d68 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,22 +208,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-   const __be32 *associativity)
-{
-   int i;
-
-   if (affinity_form != FORM1_AFFINITY)
-   return;
-
-   for (i = 0; i < distance_ref_points_depth; i++) {
-   const __be32 *entry;
-
-   entry = [be32_to_cpu(distance_ref_points[i]) - 1];
-   distance_lookup_table[nid][i] = of_read_number(entry, 1);
-   }
-}
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
@@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 
*associativity)
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
-
-   if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
-   /*
-* Skip the length field and send start of associativity array
-*/
-   initialize_distance_lookup_table(nid, associativity + 1);
-   }
-
 out:
return nid;
 }
@@ -287,6 +262,48 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+   int i, nid;
+
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
+   nid = associativity_to_nid(associativity);
+   if (nid != NUMA_NO_NODE) {
+   for (i = 0; i < distance_ref_points_depth; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   distance_lookup_table[nid][i] = of_read_number(entry, 
1);
+   }
+   }
+}
+
+static void initialize_form1_numa_distance(struct device_node *node)
+{
+   const __be32 *associativity;
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return;
+
+   __initialize_form1_numa_distance(associativity);
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+   if (affinity_form == FORM0_AFFINITY)
+   return;
+   else if (affinity_form == FORM1_AFFINITY) {
+   initialize_form1_numa_distance(node);
+   return;
+   }
+}
+
 static int __init find_primary_domain_index(void)
 {
int index;
@@ -433,6 +450,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+   struct assoc_arrays aa = { .arrays = NULL };
+   int default_nid = NUMA_NO_NODE;
+   int nid = default_nid;
+   int rc, index;
+
+   if ((primary_domain_index < 0) || !numa_enabled)
+   return default_nid;
+
+   rc = of_get_assoc_arrays();
+   if (rc)
+   return default_nid;
+
+   if (primary_domain_index <= aa.array_sz &&
+   !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
+   nid = of_rea

[PATCH v7 2/6] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

2021-08-08 Thread Aneesh Kumar K.V
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   |  4 +--
 arch/powerpc/include/asm/prom.h   |  2 +-
 arch/powerpc/kernel/prom_init.c   |  2 +-
 arch/powerpc/mm/numa.c| 35 ++-
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPALASM_CONST(0x1000)
 #define FW_FEATURE_SET_MODEASM_CONST(0x4000)
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
@@ -69,7 +69,7 @@ enum {
FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI0x0201  /* PCIe/MSI support */
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
-#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY 0x0580  /* FORM1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index a5bf355ce1d6..57db605ad33a 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1096,7 +1096,7 @@ static const struct ibm_arch_vec 
ibm_architecture_vec_template __initconst = {
 #else
0,
 #endif
-   .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
+   .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..368719b14dcc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int distance_ref_points_depth;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
int i;
int distance = LOCAL_DISTANCE;
 
-   if (!form1_affinity)
+   if (affinity_form == FORM0_AFFINITY)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i < distance_ref_points_depth; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
int i;
 
-   if (!form1_affinity)
+   if (affinity_form != FORM1_AFFINITY)
return;
 
for (i = 0; i < distance_ref_points_depth; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
int index;
struct device_node *root;
 
+   /*
+* Check for which form of affinity.
+*/
+   if (firmware_has_feature(FW_FEATURE_OPAL)) {
+   affinity_form = FORM1_AFFINITY;
+   } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
+   affinity_form = FORM1_AFFINITY;
+   } else
+   affinity_form = FORM0_AFFINITY;
+
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal");
else
@@

[PATCH v7 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-08-08 Thread Aneesh Kumar K.V
No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
if (!numa_enabled)
goto out;
 
-   if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   if (of_read_number(associativity, 1) >= primary_domain_index)
+   nid = of_read_number([primary_domain_index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-   int depth;
+   int index;
struct device_node *root;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
}
 
if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
 
-   depth = of_read_number(_ref_points[1], 1);
+   index = of_read_number(_ref_points[1], 1);
}
 
/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
 
of_node_put(root);
-   return depth;
+   return index;
 
 err:
of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
 
-   if ((min_common_depth < 0) || !numa_enabled)
+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
 
rc = of_get_assoc_arrays();
if (rc)
return default_nid;
 
-   if (min_common_depth <= aa.array_sz &&
+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number([index], 1);
 
if (nid == 0x || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
 
-   min_common_depth = find_min_common_depth();
+   primary_domain_index = find_primary_domain_index();
 
-   if (min_common_depth < 0) {
+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
 
-   max_nodes = of_read_number([min_common_depth], 1);
+   max_nodes = of_read_number([primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
prop_length /= sizeof(int);
-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
 
index = of_read_number(associativity, 1);
-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number([index - 1], 1);
 
 out:
-- 
2.31.1



[PATCH v7 0/6] Add support for FORM2 associativity

2021-08-08 Thread Aneesh Kumar K.V
Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1071644672,
"uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
"raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
"daxregion":{
  "id":0,
  "size":1071644672,
  "devices":[
{
  "chardev":"dax0.0",
  "size":1071644672,
  "target_node":4,
  "mode":"devdax"
}
  ]
},
"align":2097152,
"numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v6:
* Address review feedback 

Changes from v5:
* Fix build error reported by kernel test robot
* Address review feedback 

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
  will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance 
information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (6):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate different NUMA distance update code paths
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity
  powerpc/pseries: Consolidate form1 distance initialization into a
helper

 Documentation/powerpc/associativity.rst   | 103 +
 arch/powerpc/include/asm/firmware.h   |   7 +-
 arch/powerpc/include/asm/prom.h   |   3 +-
 arch/powerpc/include/asm/topology.h   |   6 +-
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 433 ++
 arch/powerpc/platforms/pseries/firmware.c |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/lpar.c |   4 +-
 10 files changed, 455 insertions(+), 111 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1



Re: [PATCH v6 6/6] powerpc/pseries: Consolidate form1 distance initialization into a helper

2021-08-06 Thread Aneesh Kumar K.V

On 8/6/21 12:17 PM, David Gibson wrote:

On Tue, Jul 27, 2021 at 03:33:11PM +0530, Aneesh Kumar K.V wrote:

Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.


Oh... sorry.. comments on the earlier patch were from before I read
and saw you adjusted things here.



Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/mm/numa.c | 83 ++
  1 file changed, 51 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index fffb3c40f595..7506251e17f2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -171,19 +171,19 @@ static void unmap_cpu_from_node(unsigned long cpu)
  }
  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
  
-/*

- * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
- * info is found.
- */
-static int associativity_to_nid(const __be32 *associativity)
+static int __associativity_to_nid(const __be32 *associativity,
+ int max_array_sz)
  {
int nid = NUMA_NO_NODE;
+   /*
+* primary_domain_index is 1 based array index.
+*/
+   int index = primary_domain_index  - 1;
  
-	if (!numa_enabled)

+   if (!numa_enabled || index >= max_array_sz)
goto out;


You don't need a goto, you can just return NUMA_NO_NODE.


updated



  
-	if (of_read_number(associativity, 1) >= primary_domain_index)

-   nid = of_read_number([primary_domain_index], 1);
+   nid = of_read_number([index], 1);
  
  	/* POWER4 LPAR uses 0x as invalid node */

if (nid == 0x || nid >= nr_node_ids)
@@ -191,6 +191,17 @@ static int associativity_to_nid(const __be32 
*associativity)
  out:
return nid;
  }
+/*
+ * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
+ * info is found.
+ */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int array_sz = of_read_number(associativity, 1);
+
+   /* Skip the first element in the associativity array */
+   return __associativity_to_nid((associativity + 1), array_sz);
+}
  
  static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)

  {
@@ -295,24 +306,41 @@ int of_node_to_nid(struct device_node *device)
  }
  EXPORT_SYMBOL(of_node_to_nid);
  
-static void __initialize_form1_numa_distance(const __be32 *associativity)

+static void ___initialize_form1_numa_distance(const __be32 *associativity,
+int max_array_sz)
  {
int i, nid;
  
  	if (affinity_form != FORM1_AFFINITY)

return;
  
-	nid = associativity_to_nid(associativity);

+   nid = __associativity_to_nid(associativity, max_array_sz);
if (nid != NUMA_NO_NODE) {
for (i = 0; i < distance_ref_points_depth; i++) {
const __be32 *entry;
+   int index = be32_to_cpu(distance_ref_points[i]) - 1;
+
+   /*
+* broken hierarchy, return with broken distance table


WARN_ON, maybe?



updated




+*/
+   if (index >= max_array_sz)
+   return;
  
-			entry = [be32_to_cpu(distance_ref_points[i])];

+   entry = [index];
distance_lookup_table[nid][i] = of_read_number(entry, 
1);
}
}
  }
  
+static void __initialize_form1_numa_distance(const __be32 *associativity)


Do you actually use this in-between wrapper?


yes used in

static void initialize_form1_numa_distance(struct device_node *node)
{
const __be32 *associativity;

associativity = of_get_associativity(node);
if (!associativity)
return;

__initialize_form1_numa_distance(associativity);
}






+{
+   int array_sz;
+
+   array_sz = of_read_number(associativity, 1);
+   /* Skip the first element in the associativity array */
+   ___initialize_form1_numa_distance(associativity + 1, array_sz);
+}
+
  static void initialize_form1_numa_distance(struct device_node *node)
  {
const __be32 *associativity;
@@ -586,27 +614,18 @@ static int get_nid_and_numa_distance(struct drmem_lmb 
*lmb)
  
  	if (primary_domain_index <= aa.array_sz &&

!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
-   nid = of_read_number([index], 1);
+   const __be32 *associativity;
  
-		if (nid == 0x || nid >= nr_node_ids)

-   nid = default_nid;
+   index = lmb->aa_index * aa.array_sz;
+   associativity

[RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE

2021-08-03 Thread Aneesh Kumar K.V
With shared mapping, even though we are unmapping a large range, the kernel
will force a TLB flush with ptl lock held to avoid the race mentioned in
commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory 
freeing parts")
This results in the kernel issuing a high number of TLB flushes even for a large
range. This can be improved by making sure the kernel switch to pid based flush 
if the
kernel is unmapping a 2M range.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..21d0f098e43b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
  * invalidating a full PID, so it has a far lower threshold to change from
  * individual page flushes to full-pid flushes.
  */
-static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
+static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
 static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = 
POWER9_TLB_SETS_RADIX * 2;
 
 static inline void __radix__flush_tlb_range(struct mm_struct *mm,
@@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
if (fullmm)
flush_pid = true;
else if (type == FLUSH_TYPE_GLOBAL)
-   flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+   flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
else
flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
/*
@@ -1335,7 +1335,7 @@ static void __radix__flush_tlb_range_psize(struct 
mm_struct *mm,
if (fullmm)
flush_pid = true;
else if (type == FLUSH_TYPE_GLOBAL)
-   flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+   flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
else
flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
 
@@ -1505,7 +1505,7 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned 
long lpid,
continue;
 
nr_pages = (end - start) >> def->shift;
-   flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+   flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
 
/*
 * If the number of pages spanning the range is above
-- 
2.31.1



[PATCH v6 4/6] powerpc/pseries: Add a helper for form1 cpu distance

2021-07-27 Thread Aneesh Kumar K.V
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |  4 ++--
 arch/powerpc/mm/numa.c| 10 +-
 arch/powerpc/platforms/pseries/lpar.c |  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index a6425a70c37b..a4712ecad3e9 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))
 
-extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
@@ -84,7 +84,7 @@ static inline void sysfs_remove_device_from_node(struct 
device *dev,
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
-static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
return 0;
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c695faf67d68..a244398a7766 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 {
int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
return dist;
 }
 
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+   /* We should not get called with FORM0 */
+   VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+   return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index dab356e3ff87..afefbdfe768d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int 
last_disp_cpu, int cur_disp_cpu)
if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
return -EIO;
 
-   return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
+   return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
 }
 
 static int cpu_home_node_dispatch_distance(int disp_cpu)
@@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
if (!disp_cpu_assoc || !vcpu_assoc)
return -EIO;
 
-   return cpu_distance(disp_cpu_assoc, vcpu_assoc);
+   return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
 }
 
 static void update_vcpu_disp_stat(int disp_cpu)
-- 
2.31.1



[PATCH v6 5/6] powerpc/pseries: Add support for FORM2 associativity

2021-07-27 Thread Aneesh Kumar K.V
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst   | 103 +
 arch/powerpc/include/asm/firmware.h   |   3 +-
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 168 ++
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 252 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..b6c89706ca03
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,103 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the oldest format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via 
"ibm,architecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points, and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains a list of one or more numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains a list of one or 
more numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID indexes represents an increasing hierarchy of resource 
grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value in
+"ibm,associativity-reference-points" property, Form 2 allows a large number of 
primary domain
+ids at the same domainID index representing resource groups of different 
performance/latency
+characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains a list of one or more numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is
+used as an index while computing numa distance information via 
"ibm,numa-distance-table".
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 
(2) is used when
+computing the distance of domain 8 from other domains present in the system. 
For the rest of
+this document, this offset will be referred to as domain distance offset.
+
+"ibm,numa-distance-table" property contains a list of one or more numbers 
representing the NUMA
+distance between resource groups/domains present in the syst

[PATCH v6 2/6] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

2021-07-27 Thread Aneesh Kumar K.V
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   |  4 +--
 arch/powerpc/include/asm/prom.h   |  2 +-
 arch/powerpc/kernel/prom_init.c   |  2 +-
 arch/powerpc/mm/numa.c| 35 ++-
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPALASM_CONST(0x1000)
 #define FW_FEATURE_SET_MODEASM_CONST(0x4000)
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
@@ -69,7 +69,7 @@ enum {
FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI0x0201  /* PCIe/MSI support */
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
-#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY 0x0580  /* FORM1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index a5bf355ce1d6..57db605ad33a 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1096,7 +1096,7 @@ static const struct ibm_arch_vec 
ibm_architecture_vec_template __initconst = {
 #else
0,
 #endif
-   .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
+   .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..368719b14dcc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int distance_ref_points_depth;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
int i;
int distance = LOCAL_DISTANCE;
 
-   if (!form1_affinity)
+   if (affinity_form == FORM0_AFFINITY)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i < distance_ref_points_depth; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
int i;
 
-   if (!form1_affinity)
+   if (affinity_form != FORM1_AFFINITY)
return;
 
for (i = 0; i < distance_ref_points_depth; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
int index;
struct device_node *root;
 
+   /*
+* Check for which form of affinity.
+*/
+   if (firmware_has_feature(FW_FEATURE_OPAL)) {
+   affinity_form = FORM1_AFFINITY;
+   } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
+   affinity_form = FORM1_AFFINITY;
+   } else
+   affinity_form = FORM0_AFFINITY;
+
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal");
else
@@

[PATCH v6 6/6] powerpc/pseries: Consolidate form1 distance initialization into a helper

2021-07-27 Thread Aneesh Kumar K.V
Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 83 ++
 1 file changed, 51 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index fffb3c40f595..7506251e17f2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -171,19 +171,19 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-/*
- * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
- * info is found.
- */
-static int associativity_to_nid(const __be32 *associativity)
+static int __associativity_to_nid(const __be32 *associativity,
+ int max_array_sz)
 {
int nid = NUMA_NO_NODE;
+   /*
+* primary_domain_index is 1 based array index.
+*/
+   int index = primary_domain_index  - 1;
 
-   if (!numa_enabled)
+   if (!numa_enabled || index >= max_array_sz)
goto out;
 
-   if (of_read_number(associativity, 1) >= primary_domain_index)
-   nid = of_read_number([primary_domain_index], 1);
+   nid = of_read_number([index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -191,6 +191,17 @@ static int associativity_to_nid(const __be32 
*associativity)
 out:
return nid;
 }
+/*
+ * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
+ * info is found.
+ */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int array_sz = of_read_number(associativity, 1);
+
+   /* Skip the first element in the associativity array */
+   return __associativity_to_nid((associativity + 1), array_sz);
+}
 
 static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 {
@@ -295,24 +306,41 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static void __initialize_form1_numa_distance(const __be32 *associativity)
+static void ___initialize_form1_numa_distance(const __be32 *associativity,
+int max_array_sz)
 {
int i, nid;
 
if (affinity_form != FORM1_AFFINITY)
return;
 
-   nid = associativity_to_nid(associativity);
+   nid = __associativity_to_nid(associativity, max_array_sz);
if (nid != NUMA_NO_NODE) {
for (i = 0; i < distance_ref_points_depth; i++) {
const __be32 *entry;
+   int index = be32_to_cpu(distance_ref_points[i]) - 1;
+
+   /*
+* broken hierarchy, return with broken distance table
+*/
+   if (index >= max_array_sz)
+   return;
 
-   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   entry = [index];
distance_lookup_table[nid][i] = of_read_number(entry, 
1);
}
}
 }
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+   int array_sz;
+
+   array_sz = of_read_number(associativity, 1);
+   /* Skip the first element in the associativity array */
+   ___initialize_form1_numa_distance(associativity + 1, array_sz);
+}
+
 static void initialize_form1_numa_distance(struct device_node *node)
 {
const __be32 *associativity;
@@ -586,27 +614,18 @@ static int get_nid_and_numa_distance(struct drmem_lmb 
*lmb)
 
if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
-   nid = of_read_number([index], 1);
+   const __be32 *associativity;
 
-   if (nid == 0x || nid >= nr_node_ids)
-   nid = default_nid;
+   index = lmb->aa_index * aa.array_sz;
+   associativity = [index];
+   nid = __associativity_to_nid(associativity, aa.array_sz);
if (nid > 0 && affinity_form == FORM1_AFFINITY) {
-   int i;
-   const __be32 *associativity;
-
-   index = lmb->aa_index * aa.array_sz;
-   associativity = [index];
/*
-* lookup array associativity entries have different 
format
-* There is no length of the array as the first element.
+* lookup array associativity entries have

[PATCH v6 3/6] powerpc/pseries: Consolidate different NUMA distance update code paths

2021-07-27 Thread Aneesh Kumar K.V
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |   2 +
 arch/powerpc/mm/numa.c| 178 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 4 files changed, 138 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..a6425a70c37b 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -64,6 +64,7 @@ static inline int early_cpu_to_node(int cpu)
 }
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
+void update_numa_distance(struct device_node *node);
 
 #else
 
@@ -93,6 +94,7 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+static inline void update_numa_distance(struct device_node *node) {}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 368719b14dcc..c695faf67d68 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,22 +208,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-   const __be32 *associativity)
-{
-   int i;
-
-   if (affinity_form != FORM1_AFFINITY)
-   return;
-
-   for (i = 0; i < distance_ref_points_depth; i++) {
-   const __be32 *entry;
-
-   entry = [be32_to_cpu(distance_ref_points[i]) - 1];
-   distance_lookup_table[nid][i] = of_read_number(entry, 1);
-   }
-}
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
@@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 
*associativity)
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
-
-   if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
-   /*
-* Skip the length field and send start of associativity array
-*/
-   initialize_distance_lookup_table(nid, associativity + 1);
-   }
-
 out:
return nid;
 }
@@ -287,6 +262,48 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+   int i, nid;
+
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
+   nid = associativity_to_nid(associativity);
+   if (nid != NUMA_NO_NODE) {
+   for (i = 0; i < distance_ref_points_depth; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   distance_lookup_table[nid][i] = of_read_number(entry, 
1);
+   }
+   }
+}
+
+static void initialize_form1_numa_distance(struct device_node *node)
+{
+   const __be32 *associativity;
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return;
+
+   __initialize_form1_numa_distance(associativity);
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+   if (affinity_form == FORM0_AFFINITY)
+   return;
+   else if (affinity_form == FORM1_AFFINITY) {
+   initialize_form1_numa_distance(node);
+   return;
+   }
+}
+
 static int __init find_primary_domain_index(void)
 {
int index;
@@ -433,6 +450,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+   struct assoc_arrays aa = { .arrays = NULL };
+   int default_nid = NUMA_NO_NODE;
+   int nid = default_nid;
+   int rc, index;
+
+   if ((primary_domain_index < 0) || !numa_enabled)
+   return default_nid;
+
+   rc = of_get_assoc_arrays();
+   if (rc)
+   return default_nid;
+
+   if (primary_domain_index <= aa.array_sz &&
+   !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
+   nid = of_rea

[PATCH v6 0/6] Add support for FORM2 associativity

2021-07-27 Thread Aneesh Kumar K.V
Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1071644672,
"uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
"raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
"daxregion":{
  "id":0,
  "size":1071644672,
  "devices":[
{
  "chardev":"dax0.0",
  "size":1071644672,
  "target_node":4,
  "mode":"devdax"
}
  ]
},
"align":2097152,
"numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v5:
* Fix build error reported by kernel test robot
* Address review feedback 

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
  will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance 
information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (6):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate different NUMA distance update code paths
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity
  powerpc/pseries: Consolidate form1 distance initialization into a
helper

 Documentation/powerpc/associativity.rst   | 103 
 arch/powerpc/include/asm/firmware.h   |   7 +-
 arch/powerpc/include/asm/prom.h   |   3 +-
 arch/powerpc/include/asm/topology.h   |   6 +-
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 440 ++
 arch/powerpc/platforms/pseries/firmware.c |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/lpar.c |   4 +-
 10 files changed, 462 insertions(+), 111 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1



[PATCH v6 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-07-27 Thread Aneesh Kumar K.V
No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
if (!numa_enabled)
goto out;
 
-   if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   if (of_read_number(associativity, 1) >= primary_domain_index)
+   nid = of_read_number([primary_domain_index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-   int depth;
+   int index;
struct device_node *root;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
}
 
if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
 
-   depth = of_read_number(_ref_points[1], 1);
+   index = of_read_number(_ref_points[1], 1);
}
 
/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
 
of_node_put(root);
-   return depth;
+   return index;
 
 err:
of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
 
-   if ((min_common_depth < 0) || !numa_enabled)
+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
 
rc = of_get_assoc_arrays();
if (rc)
return default_nid;
 
-   if (min_common_depth <= aa.array_sz &&
+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number([index], 1);
 
if (nid == 0x || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
 
-   min_common_depth = find_min_common_depth();
+   primary_domain_index = find_primary_domain_index();
 
-   if (min_common_depth < 0) {
+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
 
-   max_nodes = of_read_number([min_common_depth], 1);
+   max_nodes = of_read_number([primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
prop_length /= sizeof(int);
-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
 
index = of_read_number(associativity, 1);
-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number([index - 1], 1);
 
 out:
-- 
2.31.1



Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths

2021-07-26 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Thu, Jul 22, 2021 at 12:37:46PM +0530, Aneesh Kumar K.V wrote:
>> David Gibson  writes:
>> 
>> > On Mon, Jun 28, 2021 at 08:41:15PM +0530, Aneesh Kumar K.V wrote:



> 
>> >
>> >> + nid = of_read_number([index], 1);
>> >> +
>> >> + if (nid == 0x || nid >= nr_node_ids)
>> >> + nid = default_nid;
>> >> + if (nid > 0 && affinity_form == FORM1_AFFINITY) {
>> >> + int i;
>> >> + const __be32 *associativity;
>> >> +
>> >> + index = lmb->aa_index * aa.array_sz;
>> >> + associativity = [index];
>> >> + /*
>> >> +  * lookup array associativity entries have different 
>> >> format
>> >> +  * There is no length of the array as the first element.
>> >
>> > The difference it very small, and this is not a hot path.  Couldn't
>> > you reduce a chunk of code by prepending aa.array_sz, then re-using
>> > __initialize_form1_numa_distance.  Or even making
>> > __initialize_form1_numa_distance() take the length as a parameter.
>> 
>> The changes are small but confusing w.r.t how we look at the
>> associativity-lookup-arrays. The way we interpret associativity array
>> and associativity lookup array using primary_domain_index is different.
>> Hence the '-1' in the node lookup here.
>
> They're really not, though.  It's exactly the same interpretation of
> the associativity array itself - it's just that one of them has the
> array prepended with a (redundant) length.  So you can make
> __initialize_form1_numa_distance() work on the "bare" associativity
> array, with a given length.  Here you call it with aa.array_sz as the
> length, and in the other place you call it with prop[0] as the length.
>
>> 
>>  index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
>>  nid = of_read_number([index], 1);
>> 
>> 
>> >
>> >> +  */
>> >> + for (i = 0; i < max_associativity_domain_index; i++) {
>> >> + const __be32 *entry;
>> >> +
>> >> + entry = 
>> >> [be32_to_cpu(distance_ref_points[i]) - 1];
>> >
>> > Does anywhere verify that distance_ref_points[i] <= aa.array_size for
>> > every i?
>> 
>> We do check for 
>> 
>>  if (primary_domain_index <= aa.array_sz &&
>
> Right, but that doesn't check the other distance_ref_points entries.
> Not that there's any reason to have extra entries with Form2, but we
> still don't want stray array accesses.

This is how the change looks. I am not convinced this makes it simpler.
I will add that as the last patch and we can drop that if we find that
not helpful? 

modified   arch/powerpc/mm/numa.c
@@ -171,20 +171,31 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-/*
- * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
- * info is found.
- */
-static int associativity_to_nid(const __be32 *associativity)
+static int __associativity_to_nid(const __be32 *associativity,
+ bool lookup_array_assoc,
+ int max_array_index)
 {
int nid = NUMA_NO_NODE;
+   int index;
 
if (!numa_enabled)
goto out;
+   /*
+* ibm,associativity-lookup-array doesn't have element
+* count at the start of the associativity. Hence
+* decrement the primary_domain_index when used with
+* lookup-array associativity.
+*/
+   if (lookup_array_assoc)
+   index = primary_domain_index - 1;
+   else {
+   index = primary_domain_index;
+   max_array_index = of_read_number(associativity, 1);
+   }
+   if (index > max_array_index)
+   goto out;
 
-   if (of_read_number(associativity, 1) >= primary_domain_index)
-   nid = of_read_number([primary_domain_index], 1);
-
+   nid = of_read_number([index], 1);
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
@@ -192,6 +203,15 @@ static int associativity_to_nid(const __be32 
*associativity)
return nid;
 }
 
+/*
+ * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
+ * info is found.
+ */
+static inline int associativity_to_nid(const __be32 *associativity)
+{
+   return

Re: [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity

2021-07-22 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating 
>> resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  Documentation/powerpc/associativity.rst   | 103 ++
>>  arch/powerpc/include/asm/firmware.h   |   3 +-
>>  arch/powerpc/include/asm/prom.h   |   1 +
>>  arch/powerpc/kernel/prom_init.c   |   3 +-
>>  arch/powerpc/mm/numa.c| 157 ++
>>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>>  6 files changed, 242 insertions(+), 26 deletions(-)
>>  create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst 
>> b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index ..31cc7da2c7a6
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,103 @@
>> +
>> +NUMA resource associativity
>> +=
>> +
>> +Associativity represents the groupings of the various platform resources 
>> into
>> +domains of substantially similar mean performance relative to resources 
>> outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux 
>> kernel.
>> +From the platform view, these groups are also referred to as domains.
>
> Pretty hard to decipher, but that's typical for PAPR.
>
>> +PAPR interface currently supports different ways of communicating these 
>> resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and 
>> Form2
>> +associativity grouping. Form 0 is the older format and is now considered 
>> deprecated.
>
> Nit: s/older/oldest/ since there are now >2 forms.

updated.

>
>> +Hypervisor indicates the type/form of associativity used via 
>> "ibm,architecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
>> Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
>> associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-
>> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-
>> +With Form 1 a combination of ibm,associativity-reference-points, and 
>> ibm,associativity
>> +device tree properties are used to determine the NUMA distance between 
>> resource groups/domains.
>> +
>> +The “ibm,associativity” property contains a list of one or more numbers 
>> (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains a list of one or 
>> more numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity 
>> lists.
>> +The list of domainID indexes represents an increasing hierarchy of resource 
>> grouping.
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID 
>> index.. }
>> +
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA 
>> node id.
>> +Linux kernel computes NUMA distance between two domains by recursively 
>> comparing
>> +if they belong to the same higher-level domains. For mismatch at every 
>> higher
>> +level of the resource group, the kernel doubles the NUMA distance between 
>> the
>> +comparing domains.
>> +
>> +Form 2
>> +---
>> +Form 2 associativity format adds separate device tree properties 
>> representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows 
>> flexible primary
>> +domain numbering. With numa distance computation now detached from the 
>> index value in
>> +"ibm,associativity-reference-points" property, Form 2 allows a large number 
>> of primary domain
&

Re: [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance

2021-07-22 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Mon, Jun 28, 2021 at 08:41:16PM +0530, Aneesh Kumar K.V wrote:
>> This helper is only used with the dispatch trace log collection.
>> A later patch will add Form2 affinity support and this change helps
>> in keeping that simpler. Also add a comment explaining we don't expect
>> the code to be called with FORM0
>> 
>> Reviewed-by: David Gibson 
>> Signed-off-by: Aneesh Kumar K.V 
>
> What makes it a "relative_distance" rather than just a "distance"?

I added that to indicate that the function is not returning the actual
distance but a number indicative of 'near', 'far' etc. (it actually returns
1, 2 etc).

>
>> ---
>>  arch/powerpc/include/asm/topology.h   |  4 ++--
>>  arch/powerpc/mm/numa.c| 10 +-
>>  arch/powerpc/platforms/pseries/lpar.c |  4 ++--
>>  3 files changed, 13 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/topology.h 
>> b/arch/powerpc/include/asm/topology.h
>> index e4db64c0e184..ac8b5ed79832 100644
>> --- a/arch/powerpc/include/asm/topology.h
>> +++ b/arch/powerpc/include/asm/topology.h
>> @@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
>>   cpu_all_mask : \
>>   cpumask_of_node(pcibus_to_node(bus)))
>>  
>> -extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
>> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
>>  extern int __node_distance(int, int);
>>  #define node_distance(a, b) __node_distance(a, b)
>>  
>> @@ -83,7 +83,7 @@ static inline void sysfs_remove_device_from_node(struct 
>> device *dev,
>>  
>>  static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) 
>> {}
>>  
>> -static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 
>> *cpu2_assoc)
>>  {
>>  return 0;
>>  }
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 7b142f79d600..c6293037a103 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
>>  }
>>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>>  
>> -int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 
>> *cpu2_assoc)
>>  {
>>  int dist = 0;
>>  
>> @@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  return dist;
>>  }
>>  
>> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +{
>> +/* We should not get called with FORM0 */
>> +VM_WARN_ON(affinity_form == FORM0_AFFINITY);
>> +
>> +return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
>> +}
>> +
>>  /* must hold reference to node during call */
>>  static const __be32 *of_get_associativity(struct device_node *dev)
>>  {
>> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
>> b/arch/powerpc/platforms/pseries/lpar.c
>> index dab356e3ff87..afefbdfe768d 100644
>> --- a/arch/powerpc/platforms/pseries/lpar.c
>> +++ b/arch/powerpc/platforms/pseries/lpar.c
>> @@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int 
>> last_disp_cpu, int cur_disp_cpu)
>>  if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
>>  return -EIO;
>>  
>> -return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
>> +return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
>>  }
>>  
>>  static int cpu_home_node_dispatch_distance(int disp_cpu)
>> @@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
>>  if (!disp_cpu_assoc || !vcpu_assoc)
>>  return -EIO;
>>  
>> -return cpu_distance(disp_cpu_assoc, vcpu_assoc);
>> +return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
>>  }
>>  
>>  static void update_vcpu_disp_stat(int disp_cpu)
>
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson


Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths

2021-07-22 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Mon, Jun 28, 2021 at 08:41:15PM +0530, Aneesh Kumar K.V wrote:
>> The associativity details of the newly added resourced are collected from
>> the hypervisor via "ibm,configure-connector" rtas call. Update the numa
>> distance details of the newly added numa node after the above call.
>> 
>> Instead of updating NUMA distance every time we lookup a node id
>> from the associativity property, add helpers that can be used
>> during boot which does this only once. Also remove the distance
>> update from node id lookup helpers.
>> 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/mm/numa.c| 173 +-
>>  arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
>>  .../platforms/pseries/hotplug-memory.c|   2 +
>>  arch/powerpc/platforms/pseries/pseries.h  |   1 +
>>  4 files changed, 132 insertions(+), 46 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 0ec16999beef..7b142f79d600 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -208,22 +208,6 @@ int __node_distance(int a, int b)
>>  }
>>  EXPORT_SYMBOL(__node_distance);
>>  
>> -static void initialize_distance_lookup_table(int nid,
>> -const __be32 *associativity)
>> -{
>> -int i;
>> -
>> -if (affinity_form != FORM1_AFFINITY)
>> -return;
>> -
>> -for (i = 0; i < max_associativity_domain_index; i++) {
>> -const __be32 *entry;
>> -
>> -entry = [be32_to_cpu(distance_ref_points[i]) - 1];
>> -distance_lookup_table[nid][i] = of_read_number(entry, 1);
>> -}
>> -}
>> -
>>  /*
>>   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>>   * info is found.
>> @@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 
>> *associativity)
>>  /* POWER4 LPAR uses 0x as invalid node */
>>  if (nid == 0x || nid >= nr_node_ids)
>>  nid = NUMA_NO_NODE;
>> -
>> -if (nid > 0 &&
>> -of_read_number(associativity, 1) >= 
>> max_associativity_domain_index) {
>> -/*
>> - * Skip the length field and send start of associativity array
>> - */
>> -initialize_distance_lookup_table(nid, associativity + 1);
>> -}
>> -
>>  out:
>>  return nid;
>>  }
>> @@ -287,6 +262,49 @@ int of_node_to_nid(struct device_node *device)
>>  }
>>  EXPORT_SYMBOL(of_node_to_nid);
>>  
>> +static void __initialize_form1_numa_distance(const __be32 *associativity)
>> +{
>> +int i, nid;
>> +
>> +if (affinity_form != FORM1_AFFINITY)
>
> Since this shouldn't be called on a !form1 system, this could be a WARN_ON().

The way we call functions currently, instead of doing

if (affinity_form == FORM1_AFFINITY)
__initialize_form1_numa_distance()

We avoid doing the if check in multiple places. For example
parse_numa_properties will fetch the associativity array to find the
details of online node and set it online. We use the same code path to
initialize distance.

if (__vphn_get_associativity(i, vphn_assoc) == 0) {
nid = associativity_to_nid(vphn_assoc);
__initialize_form1_numa_distance(vphn_assoc);
} else {

cpu = of_get_cpu_node(i, NULL);
BUG_ON(!cpu);

associativity = of_get_associativity(cpu);
if (associativity) {
nid = associativity_to_nid(associativity);
__initialize_form1_numa_distance(associativity);
}

We avoid the the if (affinity_form == FORM1_AFFINITY) check there by
moving the check inside __initialize_form1_numa_distance().


>
>> +return;
>> +
>> +if (of_read_number(associativity, 1) >= primary_domain_index) {
>> +nid = of_read_number([primary_domain_index], 1);
>
> This computes the nid from the assoc array independently of
> associativity_to_nid, which doesn't seem like a good idea.  Wouldn't
> it be better to call assocaitivity_to_nid(), then make the next bit
> conditional on nid !== NUMA_NO_NODE?

@@ -302,9 +302,8 @@ static void __initialize_form1_numa_distance(const __be32 
*associativity)
if (affinity_form != FORM1_AFFINITY)
return;
 
-   if (of_read_number(associativity, 1) >= primary_domain_index) {
-   nid = of_re

Re: [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-07-21 Thread Aneesh Kumar K.V

On 7/22/21 8:06 AM, David Gibson wrote:

On Thu, Jul 22, 2021 at 11:59:15AM +1000, David Gibson wrote:

On Mon, Jun 28, 2021 at 08:41:12PM +0530, Aneesh Kumar K.V wrote:

No functional change in this patch.


The new name does not match how you describe "primary domain index" in
the documentation from patch 6/6.  There it comes from the values in
associativity-reference-points, but here it simply comes from the
lengths of all the associativity properties.


No, sorry, I misread this code... misled by the old name, so it's a
good thing you're changing it.

But.. I'm still not sure the new name is accurate, either...

[snip]

if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);


AFACIT distance_ref_points hasn't been altered from the
of_get_property() at this point, so isn't this setting depth / index
to the number of entries in ref-points, rather than the value of the
first entry (which is what primary domain index is supposed to be).



ibm,associativity-reference-points property format is as below.

# lsprop  ibm,associativity-reference-points
ibm,associativity-reference-points
 0004 0002

it doesn't have the number of elements as the first item.

For FORM1 1 element is the NUMA boundary index/primary_domain_index
For FORM0 2 element is the NUMA boundary index/primary_domain_index.



} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
  
-		depth = of_read_number(_ref_points[1], 1);

+   index = of_read_number(_ref_points[1], 1);
}
  
  	/*

@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
  
  	of_node_put(root);

-   return depth;
+   return index;
  
  err:

of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
  
-	if ((min_common_depth < 0) || !numa_enabled)

+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
  
  	rc = of_get_assoc_arrays();

if (rc)
return default_nid;
  
-	if (min_common_depth <= aa.array_sz &&

+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number([index], 1);
  
  		if (nid == 0x || nid >= nr_node_ids)

@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
  
-	min_common_depth = find_min_common_depth();

+   primary_domain_index = find_primary_domain_index();
  
-	if (min_common_depth < 0) {

+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
  
-	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);

+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
  
  	/*

 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
  
-	max_nodes = of_read_number([min_common_depth], 1);

+   max_nodes = of_read_number([primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
  
  	prop_length /= sizeof(int);

-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
  
  out:

@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
  
  	index = of_read_number(associativity, 1);

-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number([index - 1], 1);
  
  out:










Re: [PATCH v5 0/6] Add support for FORM2 associativity

2021-07-13 Thread Aneesh Kumar K.V

On 7/13/21 7:57 PM, Daniel Henrique Barboza wrote:

Aneesh,

This series compiles with a configuration made with "pseries_le_defconfig"
but fails with a config based on an existing RHEL8 config.

The reason, which is hinted in the robot replies in patch 4, is that you 
defined
a "__vphn_get_associativity" inside a #ifdef CONFIG_PPC_SPLPAR guard but 
didn't

define how the function would behave without the config, and you ended up
using the function elsewhere.

This fixes the compilation but I'm not sure if this is what you intended
for this function:


diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c68846fc9550..6e8551d16b7a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -680,6 +680,11 @@ static int vphn_get_nid(long lcpu)

  }
  #else
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
+{
+   return -1;
+}
+
  static int vphn_get_nid(long unused)
  {
     return NUMA_NO_NODE;


I'll post a new version of the QEMU FORM2 changes using these patches as 
is (with

the above fixup), but I guess you'll want to post a v6.



kernel test robot did report that earlier and I have that fixed in my 
local tree. I haven't posted v6 yet because I want to close the review 
on the approach with v5 patchset.


-aneesh


Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-28 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Thu, Jun 24, 2021 at 01:50:34PM +0530, Aneesh Kumar K.V wrote:
>> David Gibson  writes:
>> 
>> > On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
>> >> PAPR interface currently supports two different ways of communicating 
>> >> resource
>> >> grouping details to the OS. These are referred to as Form 0 and Form 1
>> >> associativity grouping. Form 0 is the older format and is now considered
>> >> deprecated. This patch adds another resource grouping named FORM2.
>> >> 
>> >> Signed-off-by: Daniel Henrique Barboza 
>> >> Signed-off-by: Aneesh Kumar K.V 
>> >> ---
>> >>  Documentation/powerpc/associativity.rst   | 135 
>> >>  arch/powerpc/include/asm/firmware.h   |   3 +-
>> >>  arch/powerpc/include/asm/prom.h   |   1 +
>> >>  arch/powerpc/kernel/prom_init.c   |   3 +-
>> >>  arch/powerpc/mm/numa.c| 149 +-
>> >>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>> >>  6 files changed, 286 insertions(+), 6 deletions(-)
>> >>  create mode 100644 Documentation/powerpc/associativity.rst
>> >> 
>> >> diff --git a/Documentation/powerpc/associativity.rst 
>> >> b/Documentation/powerpc/associativity.rst
>> >> new file mode 100644
>> >> index ..93be604ac54d
>> >> --- /dev/null
>> >> +++ b/Documentation/powerpc/associativity.rst
>> >> @@ -0,0 +1,135 @@
>> >> +
>> >> +NUMA resource associativity
>> >> +=
>> >> +
>> >> +Associativity represents the groupings of the various platform resources 
>> >> into
>> >> +domains of substantially similar mean performance relative to resources 
>> >> outside
>> >> +of that domain. Resources subsets of a given domain that exhibit better
>> >> +performance relative to each other than relative to other resources 
>> >> subsets
>> >> +are represented as being members of a sub-grouping domain. This 
>> >> performance
>> >> +characteristic is presented in terms of NUMA node distance within the 
>> >> Linux kernel.
>> >> +From the platform view, these groups are also referred to as domains.
>> >> +
>> >> +PAPR interface currently supports different ways of communicating these 
>> >> resource
>> >> +grouping details to the OS. These are referred to as Form 0, Form 1 and 
>> >> Form2
>> >> +associativity grouping. Form 0 is the older format and is now considered 
>> >> deprecated.
>> >> +
>> >> +Hypervisor indicates the type/form of associativity used via 
>> >> "ibm,arcitecture-vec-5 property".
>> >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage 
>> >> of Form 0 or Form 1.
>> >> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
>> >> associativity
>> >> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> >> +
>> >> +Form 0
>> >> +-
>> >> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> >> +
>> >> +Form 1
>> >> +-
>> >> +With Form 1 a combination of ibm,associativity-reference-points and 
>> >> ibm,associativity
>> >> +device tree properties are used to determine the NUMA distance between 
>> >> resource groups/domains.
>> >> +
>> >> +The “ibm,associativity” property contains one or more lists of numbers 
>> >> (domainID)
>> >> +representing the resource’s platform grouping domains.
>> >> +
>> >> +The “ibm,associativity-reference-points” property contains one or more 
>> >> list of numbers
>> >> +(domainID index) that represents the 1 based ordinal in the 
>> >> associativity lists.
>> >> +The list of domainID index represnets increasing hierachy of
>> >> resource grouping.
>> >
>> > Typo "represnets".  Also s/hierachy/hierarchy/
>> >
>> >> +
>> >> +ex:
>> >> +{ primary domainID index, secondary domainID index, tertiary domainID 
>> >> index.. }
>> >
>> >> +Linux kernel uses the domainID at the primary domainID index as th

[PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity

2021-06-28 Thread Aneesh Kumar K.V
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst   | 103 ++
 arch/powerpc/include/asm/firmware.h   |   3 +-
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 157 ++
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 242 insertions(+), 26 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..31cc7da2c7a6
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,103 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via 
"ibm,architecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points, and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains a list of one or more numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains a list of one or 
more numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID indexes represents an increasing hierarchy of resource 
grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value in
+"ibm,associativity-reference-points" property, Form 2 allows a large number of 
primary domain
+ids at the same domainID index representing resource groups of different 
performance/latency
+characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains a list of one or more numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is
+used as an index while computing numa distance information via 
"ibm,numa-distance-table".
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 
(2) is used when
+computing the distance of domain 8 from other domains present in the system. 
For the rest of
+this document, this offset will be referred to as domain distance offset.
+
+"ibm,numa-distance-table" property contains a list of one or more numbers 
representing the NUMA
+distance between resource groups/domains present in the syst

[PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance

2021-06-28 Thread Aneesh Kumar K.V
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |  4 ++--
 arch/powerpc/mm/numa.c| 10 +-
 arch/powerpc/platforms/pseries/lpar.c |  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..ac8b5ed79832 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))
 
-extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
@@ -83,7 +83,7 @@ static inline void sysfs_remove_device_from_node(struct 
device *dev,
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
-static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
return 0;
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 7b142f79d600..c6293037a103 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 {
int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
return dist;
 }
 
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+   /* We should not get called with FORM0 */
+   VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+   return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index dab356e3ff87..afefbdfe768d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int 
last_disp_cpu, int cur_disp_cpu)
if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
return -EIO;
 
-   return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
+   return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
 }
 
 static int cpu_home_node_dispatch_distance(int disp_cpu)
@@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
if (!disp_cpu_assoc || !vcpu_assoc)
return -EIO;
 
-   return cpu_distance(disp_cpu_assoc, vcpu_assoc);
+   return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
 }
 
 static void update_vcpu_disp_stat(int disp_cpu)
-- 
2.31.1



[PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths

2021-06-28 Thread Aneesh Kumar K.V
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c| 173 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/pseries.h  |   1 +
 4 files changed, 132 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0ec16999beef..7b142f79d600 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,22 +208,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-   const __be32 *associativity)
-{
-   int i;
-
-   if (affinity_form != FORM1_AFFINITY)
-   return;
-
-   for (i = 0; i < max_associativity_domain_index; i++) {
-   const __be32 *entry;
-
-   entry = [be32_to_cpu(distance_ref_points[i]) - 1];
-   distance_lookup_table[nid][i] = of_read_number(entry, 1);
-   }
-}
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
@@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 
*associativity)
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
-
-   if (nid > 0 &&
-   of_read_number(associativity, 1) >= 
max_associativity_domain_index) {
-   /*
-* Skip the length field and send start of associativity array
-*/
-   initialize_distance_lookup_table(nid, associativity + 1);
-   }
-
 out:
return nid;
 }
@@ -287,6 +262,49 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+   int i, nid;
+
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   nid = of_read_number([primary_domain_index], 1);
+
+   for (i = 0; i < max_associativity_domain_index; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   distance_lookup_table[nid][i] = of_read_number(entry, 
1);
+   }
+   }
+}
+
+static void initialize_form1_numa_distance(struct device_node *node)
+{
+   const __be32 *associativity;
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return;
+
+   __initialize_form1_numa_distance(associativity);
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+   if (affinity_form == FORM0_AFFINITY)
+   return;
+   else if (affinity_form == FORM1_AFFINITY) {
+   initialize_form1_numa_distance(node);
+   return;
+   }
+}
+
 static int __init find_primary_domain_index(void)
 {
int index;
@@ -433,6 +451,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+   struct assoc_arrays aa = { .arrays = NULL };
+   int default_nid = NUMA_NO_NODE;
+   int nid = default_nid;
+   int rc, index;
+
+   if ((primary_domain_index < 0) || !numa_enabled)
+   return default_nid;
+
+   rc = of_get_assoc_arrays();
+   if (rc)
+   return default_nid;
+
+   if (primary_domain_index <= aa.array_sz &&
+   !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
+   nid = of_read_number([index], 1);
+
+   if (nid == 0x || nid >= nr_node_ids)
+   nid = default_nid;
+   if (nid > 0 && affinity_form == FORM1_AFFINITY) {
+   int i;
+   const __be32 *associativity;
+
+   index = lmb->aa_index * aa.array_sz;
+   associativity = [index];
+   /*
+* lookup array associativity entries have different 
format
+* There is no length of the array as the first element.
+ 

[PATCH v5 3/6] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

2021-06-28 Thread Aneesh Kumar K.V
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   |  4 +--
 arch/powerpc/include/asm/prom.h   |  2 +-
 arch/powerpc/kernel/prom_init.c   |  2 +-
 arch/powerpc/mm/numa.c| 35 ++-
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPALASM_CONST(0x1000)
 #define FW_FEATURE_SET_MODEASM_CONST(0x4000)
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
@@ -69,7 +69,7 @@ enum {
FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI0x0201  /* PCIe/MSI support */
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
-#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY 0x0580  /* FORM1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 523b31685c4c..5d9ea059594f 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1069,7 +1069,7 @@ static const struct ibm_arch_vec 
ibm_architecture_vec_template __initconst = {
 #else
0,
 #endif
-   .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
+   .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 132813dd1a6c..0ec16999beef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
int i;
int distance = LOCAL_DISTANCE;
 
-   if (!form1_affinity)
+   if (affinity_form == FORM0_AFFINITY)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i < max_associativity_domain_index; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
int i;
 
-   if (!form1_affinity)
+   if (affinity_form != FORM1_AFFINITY)
return;
 
for (i = 0; i < max_associativity_domain_index; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
int index;
struct device_node *root;
 
+   /*
+* Check for which form of affinity.
+*/
+   if (firmware_has_feature(FW_FEATURE_OPAL)) {
+   affinity_form = FORM1_AFFINITY;
+   } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
+   affinity_form = FORM1_AFFINITY;
+   } else
+   affinity_form = FORM0_AFFINITY;
+
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal"

[PATCH v5 2/6] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index

2021-06-28 Thread Aneesh Kumar K.V
No functional change in this patch

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..132813dd1a6c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
 #define MAX_DISTANCE_REF_POINTS 4
-static int distance_ref_points_depth;
+static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
 
@@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
int i, index;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
index = be32_to_cpu(distance_ref_points[i]);
if (cpu1_assoc[index] == cpu2_assoc[index])
break;
@@ -193,7 +193,7 @@ int __node_distance(int a, int b)
if (!form1_affinity)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
break;
 
@@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
if (!form1_affinity)
return;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
const __be32 *entry;
 
entry = [be32_to_cpu(distance_ref_points[i]) - 1];
@@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
nid = NUMA_NO_NODE;
 
if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
+   of_read_number(associativity, 1) >= 
max_associativity_domain_index) {
/*
 * Skip the length field and send start of associativity array
 */
@@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
 */
distance_ref_points = of_get_property(root,
"ibm,associativity-reference-points",
-   _ref_points_depth);
+   _associativity_domain_index);
 
if (!distance_ref_points) {
dbg("NUMA: ibm,associativity-reference-points not found.\n");
goto err;
}
 
-   distance_ref_points_depth /= sizeof(int);
+   max_associativity_domain_index /= sizeof(int);
 
if (firmware_has_feature(FW_FEATURE_OPAL) ||
firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
@@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
if (form1_affinity) {
index = of_read_number(distance_ref_points, 1);
} else {
-   if (distance_ref_points_depth < 2) {
+   if (max_associativity_domain_index < 2) {
printk(KERN_WARNING "NUMA: "
"short ibm,associativity-reference-points\n");
goto err;
@@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
 * Warn and cap if the hardware supports more than
 * MAX_DISTANCE_REF_POINTS domains.
 */
-   if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
+   if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
printk(KERN_WARNING "NUMA: distance array capped at "
"%d entries\n", MAX_DISTANCE_REF_POINTS);
-   distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
+   max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
}
 
of_node_put(root);
-- 
2.31.1



[PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-06-28 Thread Aneesh Kumar K.V
No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
if (!numa_enabled)
goto out;
 
-   if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   if (of_read_number(associativity, 1) >= primary_domain_index)
+   nid = of_read_number([primary_domain_index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-   int depth;
+   int index;
struct device_node *root;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
}
 
if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
 
-   depth = of_read_number(_ref_points[1], 1);
+   index = of_read_number(_ref_points[1], 1);
}
 
/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
 
of_node_put(root);
-   return depth;
+   return index;
 
 err:
of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
 
-   if ((min_common_depth < 0) || !numa_enabled)
+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
 
rc = of_get_assoc_arrays();
if (rc)
return default_nid;
 
-   if (min_common_depth <= aa.array_sz &&
+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number([index], 1);
 
if (nid == 0x || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
 
-   min_common_depth = find_min_common_depth();
+   primary_domain_index = find_primary_domain_index();
 
-   if (min_common_depth < 0) {
+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
 
-   max_nodes = of_read_number([min_common_depth], 1);
+   max_nodes = of_read_number([primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
prop_length /= sizeof(int);
-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
 
index = of_read_number(associativity, 1);
-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number([index - 1], 1);
 
 out:
-- 
2.31.1



[PATCH v5 0/6] Add support for FORM2 associativity

2021-06-28 Thread Aneesh Kumar K.V
Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1071644672,
"uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
"raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
"daxregion":{
  "id":0,
  "size":1071644672,
  "devices":[
{
  "chardev":"dax0.0",
  "size":1071644672,
  "target_node":4,
  "mode":"devdax"
}
  ]
},
"align":2097152,
"numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
  will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance 
information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (6):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: rename distance_ref_points_depth to
max_associativity_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate different NUMA distance update code paths
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity

 Documentation/powerpc/associativity.rst   | 103 +
 arch/powerpc/include/asm/firmware.h   |   7 +-
 arch/powerpc/include/asm/prom.h   |   3 +-
 arch/powerpc/include/asm/topology.h   |   4 +-
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 415 +-
 arch/powerpc/platforms/pseries/firmware.c |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/lpar.c |   4 +-
 arch/powerpc/platforms/pseries/pseries.h  |   1 +
 11 files changed, 432 insertions(+), 115 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1



Re: [PATCH v3] mm: pagewalk: Fix walk for hugepage tables

2021-06-28 Thread Aneesh Kumar K.V
Christophe Leroy  writes:

> Pagewalk ignores hugepd entries and walk down the tables
> as if it was traditionnal entries, leading to crazy result.

But we do handle hugetlb separately

if (vma && is_vm_hugetlb_page(vma)) {
if (ops->hugetlb_entry)
err = walk_hugetlb_range(start, end, walk);
} else
err = walk_pgd_range(start, end, walk);

Are we using hugepd format for non hugetlb entries?

>
> Add walk_hugepd_range() and use it to walk hugepage tables.
>
> Signed-off-by: Christophe Leroy 
> Reviewed-by: Steven Price 
> ---
> v3:
> - Rebased on next-20210624 (no change since v2)
> - Added Steven's Reviewed-by
> - Sent as standalone for merge via mm
>
> v2:
> - Add a guard for NULL ops->pte_entry
> - Take mm->page_table_lock when walking hugepage table, as suggested by 
> follow_huge_pd()
> ---
>  mm/pagewalk.c | 58 ++-
>  1 file changed, 53 insertions(+), 5 deletions(-)
>
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index e81640d9f177..9b3db11a4d1d 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -58,6 +58,45 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
> unsigned long end,
>   return err;
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_HUGEPD
> +static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
> +  unsigned long end, struct mm_walk *walk, int 
> pdshift)
> +{
> + int err = 0;
> + const struct mm_walk_ops *ops = walk->ops;
> + int shift = hugepd_shift(*phpd);
> + int page_size = 1 << shift;
> +
> + if (!ops->pte_entry)
> + return 0;
> +
> + if (addr & (page_size - 1))
> + return 0;
> +
> + for (;;) {
> + pte_t *pte;
> +
> + spin_lock(>mm->page_table_lock);
> + pte = hugepte_offset(*phpd, addr, pdshift);
> + err = ops->pte_entry(pte, addr, addr + page_size, walk);
> + spin_unlock(>mm->page_table_lock);
> +
> + if (err)
> + break;
> + if (addr >= end - page_size)
> + break;
> + addr += page_size;
> + }
> + return err;
> +}
> +#else
> +static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
> +  unsigned long end, struct mm_walk *walk, int 
> pdshift)
> +{
> + return 0;
> +}
> +#endif
> +
>  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> struct mm_walk *walk)
>  {
> @@ -108,7 +147,10 @@ static int walk_pmd_range(pud_t *pud, unsigned long 
> addr, unsigned long end,
>   goto again;
>   }
>  
> - err = walk_pte_range(pmd, addr, next, walk);
> + if (is_hugepd(__hugepd(pmd_val(*pmd
> + err = walk_hugepd_range((hugepd_t *)pmd, addr, next, 
> walk, PMD_SHIFT);
> + else
> + err = walk_pte_range(pmd, addr, next, walk);
>   if (err)
>   break;
>   } while (pmd++, addr = next, addr != end);
> @@ -157,7 +199,10 @@ static int walk_pud_range(p4d_t *p4d, unsigned long 
> addr, unsigned long end,
>   if (pud_none(*pud))
>   goto again;
>  
> - err = walk_pmd_range(pud, addr, next, walk);
> + if (is_hugepd(__hugepd(pud_val(*pud
> + err = walk_hugepd_range((hugepd_t *)pud, addr, next, 
> walk, PUD_SHIFT);
> + else
> + err = walk_pmd_range(pud, addr, next, walk);
>   if (err)
>   break;
>   } while (pud++, addr = next, addr != end);
> @@ -189,7 +234,9 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, 
> unsigned long end,
>   if (err)
>   break;
>   }
> - if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> + if (is_hugepd(__hugepd(p4d_val(*p4d
> + err = walk_hugepd_range((hugepd_t *)p4d, addr, next, 
> walk, P4D_SHIFT);
> + else if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
>   err = walk_pud_range(p4d, addr, next, walk);
>   if (err)
>   break;
> @@ -224,8 +271,9 @@ static int walk_pgd_range(unsigned long addr, unsigned 
> long end,
>   if (err)
>   break;
>   }
> - if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
> - ops->pte_entry)
> + if (is_hugepd(__hugepd(pgd_val(*pgd
> + err = walk_hugepd_range((hugepd_t *)pgd, addr, next, 
> walk, PGDIR_SHIFT);
> + else if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || 
> ops->pte_entry)
>   err = walk_p4d_range(pgd, addr, next, walk);
>   if (err)
> 

Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-24 Thread Aneesh Kumar K.V

On 6/24/21 4:03 PM, Laurent Dufour wrote:

Hi Aneesh,

A little bit of wordsmithing below...

Le 17/06/2021 à 18:51, Aneesh Kumar K.V a écrit :
PAPR interface currently supports two different ways of communicating 
resource

grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
  Documentation/powerpc/associativity.rst   | 135 
  arch/powerpc/include/asm/firmware.h   |   3 +-
  arch/powerpc/include/asm/prom.h   |   1 +
  arch/powerpc/kernel/prom_init.c   |   3 +-
  arch/powerpc/mm/numa.c    | 149 +-
  arch/powerpc/platforms/pseries/firmware.c |   1 +
  6 files changed, 286 insertions(+), 6 deletions(-)
  create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst

new file mode 100644
index ..93be604ac54d
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,135 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform 
resources into
+domains of substantially similar mean performance relative to 
resources outside

+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources 
subsets
+are represented as being members of a sub-grouping domain. This 
performance
+characteristic is presented in terms of NUMA node distance within the 
Linux kernel.

+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating 
these resource
+grouping details to the OS. These are referred to as Form 0, Form 1 
and Form2
+associativity grouping. Form 0 is the older format and is now 
considered deprecated.

+
+Hypervisor indicates the type/form of associativity used via 
"ibm,arcitecture-vec-5 property".

    architecture ^



fixed

+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates 
usage of Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity

+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance 
between resource groups/domains.

+
+The “ibm,associativity” property contains one or more lists of 
numbers (domainID)

+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or 
more list of numbers
+(domainID index) that represents the 1 based ordinal in the 
associativity lists.
+The list of domainID index represnets increasing hierachy of resource 
grouping.

     represents ^



fixed


+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID 
index.. }

+
+Linux kernel uses the domainID at the primary domainID index as the 
NUMA node id.
+Linux kernel computes NUMA distance between two domains by 
recursively comparing
+if they belong to the same higher-level domains. For mismatch at 
every higher
+level of the resource group, the kernel doubles the NUMA distance 
between the

+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties 
representing NUMA node distance
+thereby making the node distance computation flexible. Form 2 also 
allows flexible primary
+domain numbering. With numa distance computation now detached from 
the index value of
+"ibm,associativity" property, Form 2 allows a large number of primary 
domain ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.

+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of 
byte 5 in the

+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list 
numbers representing
+the domainIDs present in the system. The offset of the domainID in 
this property is considered

+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with 
encode-int, followed by

+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index 
for domainID 8 is 1.

+
+"ibm,numa-distance-table" property contains one or more list of 
numbers representing the NUMA

+distance between resource gro

Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-24 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating 
>> resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  Documentation/powerpc/associativity.rst   | 135 
>>  arch/powerpc/include/asm/firmware.h   |   3 +-
>>  arch/powerpc/include/asm/prom.h   |   1 +
>>  arch/powerpc/kernel/prom_init.c   |   3 +-
>>  arch/powerpc/mm/numa.c| 149 +-
>>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>>  6 files changed, 286 insertions(+), 6 deletions(-)
>>  create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst 
>> b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index ..93be604ac54d
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,135 @@
>> +
>> +NUMA resource associativity
>> +=
>> +
>> +Associativity represents the groupings of the various platform resources 
>> into
>> +domains of substantially similar mean performance relative to resources 
>> outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux 
>> kernel.
>> +From the platform view, these groups are also referred to as domains.
>> +
>> +PAPR interface currently supports different ways of communicating these 
>> resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and 
>> Form2
>> +associativity grouping. Form 0 is the older format and is now considered 
>> deprecated.
>> +
>> +Hypervisor indicates the type/form of associativity used via 
>> "ibm,arcitecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
>> Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
>> associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-
>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-
>> +With Form 1 a combination of ibm,associativity-reference-points and 
>> ibm,associativity
>> +device tree properties are used to determine the NUMA distance between 
>> resource groups/domains.
>> +
>> +The “ibm,associativity” property contains one or more lists of numbers 
>> (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains one or more list 
>> of numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity 
>> lists.
>> +The list of domainID index represnets increasing hierachy of
>> resource grouping.
>
> Typo "represnets".  Also s/hierachy/hierarchy/
>
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID 
>> index.. }
>
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA 
>> node id.
>> +Linux kernel computes NUMA distance between two domains by recursively 
>> comparing
>> +if they belong to the same higher-level domains. For mismatch at every 
>> higher
>> +level of the resource group, the kernel doubles the NUMA distance between 
>> the
>> +comparing domains.
>
> The Form1 description is still kinda confusing, but I don't really
> care.  Form1 *is* confusing, it's Form2 that I hope will be clearer.
>
>> +
>> +Form 2
>> +---
>> +Form 2 associativity format adds separate device tree properties 
>> representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows 
>> flexible primary
>> +domain numbering. With numa distance computation now detached from the 
>> index value of
>> +"ibm,associativity"

Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-22 Thread Aneesh Kumar K.V
Daniel Henrique Barboza  writes:

> On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating 
>> resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   Documentation/powerpc/associativity.rst   | 135 
>>   arch/powerpc/include/asm/firmware.h   |   3 +-
>>   arch/powerpc/include/asm/prom.h   |   1 +
>>   arch/powerpc/kernel/prom_init.c   |   3 +-
>>   arch/powerpc/mm/numa.c| 149 +-
>>   arch/powerpc/platforms/pseries/firmware.c |   1 +
>>   6 files changed, 286 insertions(+), 6 deletions(-)
>>   create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst 
>> b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index ..93be604ac54d
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,135 @@
>> +
>> +NUMA resource associativity
>> +=
>> +
>> +Associativity represents the groupings of the various platform resources 
>> into
>> +domains of substantially similar mean performance relative to resources 
>> outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux 
>> kernel.
>> +From the platform view, these groups are also referred to as domains.
>> +
>> +PAPR interface currently supports different ways of communicating these 
>> resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and 
>> Form2
>> +associativity grouping. Form 0 is the older format and is now considered 
>> deprecated.
>> +
>> +Hypervisor indicates the type/form of associativity used via 
>> "ibm,arcitecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
>> Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
>> associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-
>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-
>> +With Form 1 a combination of ibm,associativity-reference-points and 
>> ibm,associativity
>> +device tree properties are used to determine the NUMA distance between 
>> resource groups/domains.
>> +
>> +The “ibm,associativity” property contains one or more lists of numbers 
>> (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains one or more list 
>> of numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity 
>> lists.
>> +The list of domainID index represnets increasing hierachy of resource 
>> grouping.
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID 
>> index.. }
>> +
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA 
>> node id.
>> +Linux kernel computes NUMA distance between two domains by recursively 
>> comparing
>> +if they belong to the same higher-level domains. For mismatch at every 
>> higher
>> +level of the resource group, the kernel doubles the NUMA distance between 
>> the
>> +comparing domains.
>> +
>> +Form 2
>> +---
>> +Form 2 associativity format adds separate device tree properties 
>> representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows 
>> flexible primary
>> +domain numbering. With numa distance computation now detached from the 
>> index value of
>> +"ibm,associativity" property, Form 2 allows a large number of primary 
>> domain ids at the
>> +same domainID index representing resource groups of different 
>> performance/latency characteristics.
>> +
>> +H

Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Aneesh Kumar K.V

On 6/18/21 1:30 AM, Daniel Henrique Barboza wrote:



On 6/17/21 8:11 AM, Aneesh Kumar K.V wrote:

Daniel Henrique Barboza  writes:


On 6/17/21 4:46 AM, David Gibson wrote:

On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:







In fact, the more I speak about this PMEM scenario the more I wonder:
why doesn't the PMEM driver, when switching from persistent to regular
memory and vice-versa, take care of all the necessary updates in the
numa-distance-table and kernel internals to reflect the current distances
of its current mode? Is this a technical limitation?




I sent v4 doing something similar to this .

-aneesh



[PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-17 Thread Aneesh Kumar K.V
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst   | 135 
 arch/powerpc/include/asm/firmware.h   |   3 +-
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 149 +-
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 286 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..93be604ac54d
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,135 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via 
"ibm,arcitecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains one or more lists of numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or more list of 
numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID index represnets increasing hierachy of resource 
grouping. 
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value of
+"ibm,associativity" property, Form 2 allows a large number of primary domain 
ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is considered
+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for 
domainID 8 is 1.
+
+"ibm,numa-distance-table" property contains one or more list of numbers 
representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with 
encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we 
could encode is 255.
+
+For ex:
+ibm,numa-lookup-index-table =  {3, 0, 8, 40}
+ibm,numa-distance-tab

[PATCH v4 6/7] powerpc/pseries: Add a helper for form1 cpu distance

2021-06-17 Thread Aneesh Kumar K.V
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c481f08d565b..d32729f235b8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
return dist;
 }
 
+int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+   /* We should not get called with FORM0 */
+   VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+   return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
-- 
2.31.1



[PATCH v4 5/7] powerpc/pseries: Consolidate NUMA distance update during boot

2021-06-17 Thread Aneesh Kumar K.V
Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 135 +++--
 1 file changed, 88 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 645a95e3a7ea..c481f08d565b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,22 +208,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-   const __be32 *associativity)
-{
-   int i;
-
-   if (affinity_form != FORM1_AFFINITY)
-   return;
-
-   for (i = 0; i < max_associativity_domain_index; i++) {
-   const __be32 *entry;
-
-   entry = [be32_to_cpu(distance_ref_points[i]) - 1];
-   distance_lookup_table[nid][i] = of_read_number(entry, 1);
-   }
-}
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
@@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 
*associativity)
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
-
-   if (nid > 0 &&
-   of_read_number(associativity, 1) >= 
max_associativity_domain_index) {
-   /*
-* Skip the length field and send start of associativity array
-*/
-   initialize_distance_lookup_table(nid, associativity + 1);
-   }
-
 out:
return nid;
 }
@@ -291,10 +266,13 @@ static void __initialize_form1_numa_distance(const __be32 
*associativity)
 {
int i, nid;
 
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
if (of_read_number(associativity, 1) >= primary_domain_index) {
nid = of_read_number([primary_domain_index], 1);
 
-   for (i = 0; i < max_domain_index; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
const __be32 *entry;
 
entry = 
[be32_to_cpu(distance_ref_points[i])];
@@ -474,6 +452,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+   struct assoc_arrays aa = { .arrays = NULL };
+   int default_nid = NUMA_NO_NODE;
+   int nid = default_nid;
+   int rc, index;
+
+   if ((primary_domain_index < 0) || !numa_enabled)
+   return default_nid;
+
+   rc = of_get_assoc_arrays();
+   if (rc)
+   return default_nid;
+
+   if (primary_domain_index <= aa.array_sz &&
+   !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
+   nid = of_read_number([index], 1);
+
+   if (nid == 0x || nid >= nr_node_ids)
+   nid = default_nid;
+   if (nid > 0 && affinity_form == FORM1_AFFINITY) {
+   int i;
+   const __be32 *associativity;
+
+   index = lmb->aa_index * aa.array_sz;
+   associativity = [index];
+   /*
+* lookup array associativity entries have different 
format
+* There is no length of the array as the first element.
+*/
+   for (i = 0; i < max_associativity_domain_index; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i]) - 1];
+   distance_lookup_table[nid][i] = 
of_read_number(entry, 1);
+   }
+   }
+   }
+   return nid;
+}
+
 /*
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
@@ -499,21 +519,14 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 
if (nid == 0x || nid >= nr_node_ids)
nid = default_nid;
-
-   if (nid > 0) {
-   index = lmb->aa_index * aa.array_sz;
-   initialize_distance_lookup_table(nid,
-   [index]);
-   }
}
-
return nid;
 }
 
 #ifdef CONFIG_PPC_SPLPAR
-static int vphn_get_nid(long lcpu)
+
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
 {
-   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
long rc, hwid;
 
/*
@@ -533,10 +546,22 @@ static i

[PATCH v4 4/7] powerpc/pseries: Consolidate DLPAR NUMA distance update

2021-06-17 Thread Aneesh Kumar K.V
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call. In
later patch we will remove updating NUMA distance when we are looking
for node id from associativity array.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c| 41 +++
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |  2 +
 .../platforms/pseries/hotplug-memory.c|  2 +
 arch/powerpc/platforms/pseries/pseries.h  |  1 +
 4 files changed, 46 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0ec16999beef..645a95e3a7ea 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -287,6 +287,47 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+   int i, nid;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   nid = of_read_number([primary_domain_index], 1);
+
+   for (i = 0; i < max_domain_index; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   distance_lookup_table[nid][i] = of_read_number(entry, 
1);
+   }
+   }
+}
+
+static void initialize_form1_numa_distance(struct device_node *node)
+{
+   const __be32 *associativity;
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return;
+
+   __initialize_form1_numa_distance(associativity);
+   return;
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+   if (affinity_form == FORM0_AFFINITY)
+   return;
+   else if (affinity_form == FORM1_AFFINITY) {
+   initialize_form1_numa_distance(node);
+   return;
+   }
+}
+
 static int __init find_primary_domain_index(void)
 {
int index;
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..778b6ab35f0d 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
return saved_rc;
}
 
+   update_numa_distance(dn);
+
rc = dlpar_online_cpu(dn);
if (rc) {
saved_rc = rc;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..0e602c3b01ea 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb 
*lmb)
return -ENODEV;
}
 
+   update_numa_distance(lmb_node);
+
dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
if (!dr_node) {
dlpar_free_cc_nodes(lmb_node);
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 1f051a786fb3..663a0859cf13 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
 void pseries_lpar_read_hblkrm_characteristics(void);
 
+void update_numa_distance(struct device_node *node);
 #endif /* _PSERIES_PSERIES_H */
-- 
2.31.1



[PATCH v4 3/7] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

2021-06-17 Thread Aneesh Kumar K.V
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   |  4 +--
 arch/powerpc/include/asm/prom.h   |  2 +-
 arch/powerpc/kernel/prom_init.c   |  2 +-
 arch/powerpc/mm/numa.c| 35 ++-
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPALASM_CONST(0x1000)
 #define FW_FEATURE_SET_MODEASM_CONST(0x4000)
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
@@ -69,7 +69,7 @@ enum {
FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI0x0201  /* PCIe/MSI support */
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
-#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY 0x0580  /* FORM1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 41ed7e33d897..64b9593038a7 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1070,7 +1070,7 @@ static const struct ibm_arch_vec 
ibm_architecture_vec_template __initconst = {
 #else
0,
 #endif
-   .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
+   .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 132813dd1a6c..0ec16999beef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
int i;
int distance = LOCAL_DISTANCE;
 
-   if (!form1_affinity)
+   if (affinity_form == FORM0_AFFINITY)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i < max_associativity_domain_index; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
int i;
 
-   if (!form1_affinity)
+   if (affinity_form != FORM1_AFFINITY)
return;
 
for (i = 0; i < max_associativity_domain_index; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
int index;
struct device_node *root;
 
+   /*
+* Check for which form of affinity.
+*/
+   if (firmware_has_feature(FW_FEATURE_OPAL)) {
+   affinity_form = FORM1_AFFINITY;
+   } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
+   affinity_form = FORM1_AFFINITY;
+   } else
+   affinity_form = FORM0_AFFINITY;
+
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal"

[PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-06-17 Thread Aneesh Kumar K.V
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
if (!numa_enabled)
goto out;
 
-   if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   if (of_read_number(associativity, 1) >= primary_domain_index)
+   nid = of_read_number([primary_domain_index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-   int depth;
+   int index;
struct device_node *root;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
}
 
if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
 
-   depth = of_read_number(_ref_points[1], 1);
+   index = of_read_number(_ref_points[1], 1);
}
 
/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
 
of_node_put(root);
-   return depth;
+   return index;
 
 err:
of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
 
-   if ((min_common_depth < 0) || !numa_enabled)
+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
 
rc = of_get_assoc_arrays();
if (rc)
return default_nid;
 
-   if (min_common_depth <= aa.array_sz &&
+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number([index], 1);
 
if (nid == 0x || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
 
-   min_common_depth = find_min_common_depth();
+   primary_domain_index = find_primary_domain_index();
 
-   if (min_common_depth < 0) {
+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
 
-   max_nodes = of_read_number([min_common_depth], 1);
+   max_nodes = of_read_number([primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
prop_length /= sizeof(int);
-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
 
index = of_read_number(associativity, 1);
-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number([index - 1], 1);
 
 out:
-- 
2.31.1



[PATCH v4 2/7] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index

2021-06-17 Thread Aneesh Kumar K.V
No functional change in this patch

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..132813dd1a6c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
 #define MAX_DISTANCE_REF_POINTS 4
-static int distance_ref_points_depth;
+static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
 
@@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
int i, index;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
index = be32_to_cpu(distance_ref_points[i]);
if (cpu1_assoc[index] == cpu2_assoc[index])
break;
@@ -193,7 +193,7 @@ int __node_distance(int a, int b)
if (!form1_affinity)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
break;
 
@@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
if (!form1_affinity)
return;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
const __be32 *entry;
 
entry = [be32_to_cpu(distance_ref_points[i]) - 1];
@@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
nid = NUMA_NO_NODE;
 
if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
+   of_read_number(associativity, 1) >= 
max_associativity_domain_index) {
/*
 * Skip the length field and send start of associativity array
 */
@@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
 */
distance_ref_points = of_get_property(root,
"ibm,associativity-reference-points",
-   _ref_points_depth);
+   _associativity_domain_index);
 
if (!distance_ref_points) {
dbg("NUMA: ibm,associativity-reference-points not found.\n");
goto err;
}
 
-   distance_ref_points_depth /= sizeof(int);
+   max_associativity_domain_index /= sizeof(int);
 
if (firmware_has_feature(FW_FEATURE_OPAL) ||
firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
@@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
if (form1_affinity) {
index = of_read_number(distance_ref_points, 1);
} else {
-   if (distance_ref_points_depth < 2) {
+   if (max_associativity_domain_index < 2) {
printk(KERN_WARNING "NUMA: "
"short ibm,associativity-reference-points\n");
goto err;
@@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
 * Warn and cap if the hardware supports more than
 * MAX_DISTANCE_REF_POINTS domains.
 */
-   if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
+   if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
printk(KERN_WARNING "NUMA: distance array capped at "
"%d entries\n", MAX_DISTANCE_REF_POINTS);
-   distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
+   max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
}
 
of_node_put(root);
-- 
2.31.1



[PATCH v4 0/7] Add support for FORM2 associativity

2021-06-17 Thread Aneesh Kumar K.V
Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1071644672,
"uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
"raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
"daxregion":{
  "id":0,
  "size":1071644672,
  "devices":[
{
  "chardev":"dax0.0",
  "size":1071644672,
  "target_node":4,
  "mode":"devdax"
}
  ]
},
"align":2097152,
"numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance 
information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index

Aneesh Kumar K.V (7):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: rename distance_ref_points_depth to
max_associativity_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate DLPAR NUMA distance update
  powerpc/pseries: Consolidate NUMA distance update during boot
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity

 Documentation/powerpc/associativity.rst   | 135 ++
 arch/powerpc/include/asm/firmware.h   |   7 +-
 arch/powerpc/include/asm/prom.h   |   3 +-
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 410 ++
 arch/powerpc/platforms/pseries/firmware.c |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/pseries.h  |   1 +
 9 files changed, 474 insertions(+), 92 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1



Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Aneesh Kumar K.V
Aneesh Kumar K.V  writes:

> David Gibson  writes:
>
>> On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:
>>> David Gibson  writes:
>>> 
>>> > On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
>>> >> David Gibson  writes:
>>> >> 
>>> >> > On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote:
> .
>
>> I'm still not understanding why the latency we care about is different
>> in the two cases.  Can you give an example of when this would result
>> in different actual node assignments for the two different cases?
>
> How about the below update?
>
> With Form2 "ibm,associativity" for resources is listed as below:
>
> "ibm,associativity" property for resources in node 0, 8 and 40
> { 3, 6, 7, 0 }
> { 3, 6, 9, 8 }
> { 4, 6, 7, 0, 40}
>
> With "ibm,associativity-reference-points"  { 0x3, 0x2 }
>
> Form2 adds additional property which can be used with devices like persistence
> memory devices which would also like to be presented as memory-only NUMA 
> nodes.
>
> "ibm,associativity-memory-node-reference-point" property contains a number
> representing the domainID index to be used to find the domainID that should 
> be used
> when using the resource as memory only NUMA node. The NUMA distance 
> information
> w.r.t this domainID will take into consideration the latency of the media. A
> high latency memory device will have a large NUMA distance value assigned 
> w.r.t
> the domainID found at at "ibm,associativity-memory-node-reference-point" 
> domainID index.
>
> prop-encoded-array: An integer encoded as with encode-int specifying the 
> domainID index
>
> In the above example:
> "ibm,associativity-memory-node-reference-point"  { 0x4 }
>
> ex:
>
>--
>   |NUMA node0 |
>   |ProcA -> MEMA  |
>   | | |
>   |   | |
>   |   ---> PMEMB|
>   |   |
>---
>
>---
>   |NUMA node1 |
>   |   |
>   |ProcB ---> MEMC|
>   |   | |
>   |   ---> PMEMD|
>   |   |
>   |   |
>---
>
>  
> 
> |  domainID 20
>|
> |   ---   
>|
> |  |NUMA node0 |  
>|
> |  |   |  
>|
> |  |ProcA ---> MEMA|   |NUMA node40 | 
>|
> |  |  |  |   ||   
>  |
> |  |  -- |>  |  PMEMB |   
>  |
> |  |   |  
>|
> |  |   |  
>|
> |   ---   
>|
> | 
>|
> |   ---   
>|
> |  |NUMA node1 |  
>|
> |  |   |  
>|
> |  |ProcB ---> MEMC|   ---
>|
> |  |  |  |  |   NUMA node41 | 
>  |
> |  |  > | PMEMD | 
>  |
> |  |   |   ---
>|
> |  |   |  
>|
> |   ---   
>|
> | 
>|
>  
> ---

Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:
>> David Gibson  writes:
>> 
>> > On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
>> >> David Gibson  writes:
>> >> 
>> >> > On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote:
.

> I'm still not understanding why the latency we care about is different
> in the two cases.  Can you give an example of when this would result
> in different actual node assignments for the two different cases?

How about the below update?

With Form2 "ibm,associativity" for resources is listed as below:

"ibm,associativity" property for resources in node 0, 8 and 40
{ 3, 6, 7, 0 }
{ 3, 6, 9, 8 }
{ 4, 6, 7, 0, 40}

With "ibm,associativity-reference-points"  { 0x3, 0x2 }

Form2 adds additional property which can be used with devices like persistence
memory devices which would also like to be presented as memory-only NUMA nodes.

"ibm,associativity-memory-node-reference-point" property contains a number
representing the domainID index to be used to find the domainID that should be 
used
when using the resource as memory only NUMA node. The NUMA distance information
w.r.t this domainID will take into consideration the latency of the media. A
high latency memory device will have a large NUMA distance value assigned w.r.t
the domainID found at at "ibm,associativity-memory-node-reference-point" 
domainID index.

prop-encoded-array: An integer encoded as with encode-int specifying the 
domainID index

In the above example:
"ibm,associativity-memory-node-reference-point"  { 0x4 }

ex:

   --
  |NUMA node0 |
  |ProcA -> MEMA  |
  | | |
  | | |
  | ---> PMEMB|
  |   |
   ---

   ---
  |NUMA node1 |
  |   |
  |ProcB ---> MEMC|
  | | |
  | ---> PMEMD|
  |   |
  |   |
   ---

 

|  domainID 20  
 |
|   --- 
 |
|  |NUMA node0 |
 |
|  |   |
 |
|  |ProcA ---> MEMA|   |NUMA node40 |   
 |
|  ||  |   ||   
 |
|  |-- |>  |  PMEMB |   
 |
|  |   |
 |
|  |   |
 |
|   --- 
 |
|   
 |
|   --- 
 |
|  |NUMA node1 |
 |
|  |   |
 |
|  |ProcB ---> MEMC|   ---  
 |
|  ||  |  |   NUMA node41 | 
 |
|  |> | PMEMD | 
 |
|  |   |   ---  
 |
|  |   |
 |
|   --- 
 |
|   
 |
 


For a topology like the above application running of ProcA wants to find out
persistent memory mount local to its NUMA node. Hence when using it as
pmem fsdax mount or devdax device we want PMEMB to have associativity
of NUMA node0 and PMEMD to have associativity of NUMA node1. But when
we want to use it as memory using dax kmem driver, we want both PMEMB
and PMEMD to appear as memory only NUMA node at a distance that is
derived based on the latency of the media.

"ibm,associativity":
PROCA/MEMA -> { 2, 20, 0 } 
PROCB/MEMC -> { 2, 20

Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Aneesh Kumar K.V

On 6/17/21 4:41 PM, Aneesh Kumar K.V wrote:

Daniel Henrique Barboza  writes:


On 6/17/21 4:46 AM, David Gibson wrote:

On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote:

FORM2 introduce a concept of secondary domain which is identical to the
conceept of FORM1 primary domain. Use secondary domain as the numa node
when using persistent memory device. For DAX kmem use the logical domain
id introduced in FORM2. This new numa node

Signed-off-by: Aneesh Kumar K.V 
---
   arch/powerpc/mm/numa.c| 28 +++
   arch/powerpc/platforms/pseries/papr_scm.c | 26 +
   arch/powerpc/platforms/pseries/pseries.h  |  1 +
   3 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 86cd2af014f7..b9ac6d02e944 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 
*associativity)
return nid;
   }
   
+int get_primary_and_secondary_domain(struct device_node *node, int *primary, int *secondary)

+{
+   int secondary_index;
+   const __be32 *associativity;
+
+   if (!numa_enabled) {
+   *primary = NUMA_NO_NODE;
+   *secondary = NUMA_NO_NODE;
+   return 0;
+   }
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return -ENODEV;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   *primary = of_read_number([primary_domain_index], 
1);
+   secondary_index = of_read_number(_ref_points[1], 1);


Secondary ID is always the second reference point, but primary depends
on the length of resources?  That seems very weird.


primary_domain_index is distance_ref_point[0]. With Form2 we would find
both primary and secondary domain ID same for all resources other than
persistent memory device. The usage w.r.t. persistent memory is
explained in patch 7.


Right, I misunderstood



With Form2 the primary domainID and secondary domainID are used to identify the 
NUMA nodes
the kernel should use when using persistent memory devices.


This seems kind of bogus.  With Form1, the primary/secondary ID are a
sort of heirarchy of distance (things with same primary ID are very
close, things with same secondary are kinda-close, etc.).  With Form2,
it's referring to their effective node for different purposes.

Using the same terms for different meanings seems unnecessarily
confusing.


They are essentially domainIDs. The interpretation of them are different
between Form1 and Form2. Hence I kept referring to them as primary and
secondary domainID. Any suggestion on what to name them with Form2?


My point is that reusing associativity-reference-points for something
with completely unrelated semantics seems like a very poor choice.



I agree that this reuse can be confusing. I could argue that there is
precedent for that in PAPR - FORM0 puts a different spin on the same
property as well - but there is no need to keep following existing PAPR
practices in new spec (and some might argue it's best not to).

As far as QEMU goes, renaming this property to "numa-associativity-mode"
(just an example) is a quick change to do since we separated FORM1 and FORM2
code over there.

Doing such a rename can also help with the issue of having to describe new
FORM2 semantics using "least significant boundary" or "primary domain" or
any FORM0|FORM1 related terminology.



It is not just changing the name, we will then have to explain the
meaning of ibm,associativity-reference-points with FORM2 right?

With FORM2 we want to represent the topology better

  

| domainID 20   
 |
|   --- 
 |
|  |NUMA node1 |
 |
|  |   |
 |
|  |ProcB ---> MEMC|   |NUMA node40 |   
 |
|  ||  |   ||   
 |
|  |-- |>  |  PMEMD |   
 |
|  |   |
 |
|  |   |
 |
|   --- 
 |
  


ibm,associativity:
 { 20, 1, 40}  ->

Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Aneesh Kumar K.V
Daniel Henrique Barboza  writes:

> On 6/17/21 4:46 AM, David Gibson wrote:
>> On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:
>>> David Gibson  writes:
>>>
>>>> On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
>>>>> David Gibson  writes:
>>>>>
>>>>>> On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote:
>>>>>>> FORM2 introduce a concept of secondary domain which is identical to the
>>>>>>> conceept of FORM1 primary domain. Use secondary domain as the numa node
>>>>>>> when using persistent memory device. For DAX kmem use the logical domain
>>>>>>> id introduced in FORM2. This new numa node
>>>>>>>
>>>>>>> Signed-off-by: Aneesh Kumar K.V 
>>>>>>> ---
>>>>>>>   arch/powerpc/mm/numa.c| 28 +++
>>>>>>>   arch/powerpc/platforms/pseries/papr_scm.c | 26 +
>>>>>>>   arch/powerpc/platforms/pseries/pseries.h  |  1 +
>>>>>>>   3 files changed, 45 insertions(+), 10 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>>>>>> index 86cd2af014f7..b9ac6d02e944 100644
>>>>>>> --- a/arch/powerpc/mm/numa.c
>>>>>>> +++ b/arch/powerpc/mm/numa.c
>>>>>>> @@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 
>>>>>>> *associativity)
>>>>>>> return nid;
>>>>>>>   }
>>>>>>>   
>>>>>>> +int get_primary_and_secondary_domain(struct device_node *node, int 
>>>>>>> *primary, int *secondary)
>>>>>>> +{
>>>>>>> +   int secondary_index;
>>>>>>> +   const __be32 *associativity;
>>>>>>> +
>>>>>>> +   if (!numa_enabled) {
>>>>>>> +   *primary = NUMA_NO_NODE;
>>>>>>> +   *secondary = NUMA_NO_NODE;
>>>>>>> +   return 0;
>>>>>>> +   }
>>>>>>> +
>>>>>>> +   associativity = of_get_associativity(node);
>>>>>>> +   if (!associativity)
>>>>>>> +   return -ENODEV;
>>>>>>> +
>>>>>>> +   if (of_read_number(associativity, 1) >= primary_domain_index) {
>>>>>>> +   *primary = 
>>>>>>> of_read_number([primary_domain_index], 1);
>>>>>>> +   secondary_index = 
>>>>>>> of_read_number(_ref_points[1], 1);
>>>>>>
>>>>>> Secondary ID is always the second reference point, but primary depends
>>>>>> on the length of resources?  That seems very weird.
>>>>>
>>>>> primary_domain_index is distance_ref_point[0]. With Form2 we would find
>>>>> both primary and secondary domain ID same for all resources other than
>>>>> persistent memory device. The usage w.r.t. persistent memory is
>>>>> explained in patch 7.
>>>>
>>>> Right, I misunderstood
>>>>
>>>>>
>>>>> With Form2 the primary domainID and secondary domainID are used to 
>>>>> identify the NUMA nodes
>>>>> the kernel should use when using persistent memory devices.
>>>>
>>>> This seems kind of bogus.  With Form1, the primary/secondary ID are a
>>>> sort of heirarchy of distance (things with same primary ID are very
>>>> close, things with same secondary are kinda-close, etc.).  With Form2,
>>>> it's referring to their effective node for different purposes.
>>>>
>>>> Using the same terms for different meanings seems unnecessarily
>>>> confusing.
>>>
>>> They are essentially domainIDs. The interpretation of them are different
>>> between Form1 and Form2. Hence I kept referring to them as primary and
>>> secondary domainID. Any suggestion on what to name them with Form2?
>> 
>> My point is that reusing associativity-reference-points for something
>> with completely unrelated semantics seems like a very poo

Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Aneesh Kumar K.V

On 6/17/21 1:16 PM, David Gibson wrote:

On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


...


It's weird to me that you'd want to consider them in different nodes
for those different purposes.



--
   |NUMA node0 |
   |ProcA -> MEMA  |
   | | |
   || |
   |---> PMEMB|
   |   |
---

---
   |NUMA node1 |
   |   |
   |ProcB ---> MEMC|
   || |
   |---> PMEMD|
   |   |
   |   |
---
  


For a topology like the above application running of ProcA wants to find out
persistent memory mount local to its NUMA node. Hence when using it as
pmem fsdax mount or devdax device we want PMEMB to have associativity
of NUMA node0 and PMEMD to have associativity of NUMA node 1. But when
we want to use it as memory using dax kmem driver, we want both PMEMB
and PMEMD to appear as memory only NUMA node at a distance that is
derived based on the latency of the media.


I'm still not understanding why the latency we care about is different
in the two cases.  Can you give an example of when this would result
in different actual node assignments for the two different cases?



In the above example in order allow use of PMEMB and PMEMD as memory 
only NUMA nodes

we need platform to represent them in its own domainID. Let's assume that
platform assigned id 40 and 41 and hence both PMEMB and PMEMD will have 
associativity array like below


{ 4, 6, 0}  -> PROCA/MEMA
{ 4, 6, 40} -> PMEMB
{ 4, 6, 41} -> PMEMD
{ 4, 6, 1} ->  PROCB/MEMB

When we want to use this device PMEMB and PMEMD as fsdax/devdax devices, 
we essentially look for the first nearest online node. Which means both 
PMEMB and PMEMD will appear as devices attached to node0. That is not 
ideal for for many applications.


using secondary domainID index as explained here helps to associate
each PMEM device to the right group. On a non virtualized config or hard 
partitioned config such a device tree representation can be looked at as 
a hint to identify which socket the actual device is connected to.


-aneesh


Re: [RFC PATCH 7/8] powerpc/pseries: Add support for FORM2 associativity

2021-06-17 Thread Aneesh Kumar K.V

On 6/17/21 1:20 PM, David Gibson wrote:

On Tue, Jun 15, 2021 at 01:10:27PM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:






PAPR defines "most significant" as below

When the “ibm,architecture-vec-5” property byte 5 bit 0 has the value of one, 
the “ibm,associativ-
ity-reference-points” property indicates boundaries between associativity 
domains presented by the
“ibm,associativity” property containing “near” and “far” resources. The
first such boundary in the list represents the 1 based ordinal in the
associativity lists of the most significant boundary, with subsequent
entries indicating progressively less significant boundaries


No... that's not a definition.  Like your draft PAPR uses the term
while entirely failing to define it.  From what I can tell about how
it is used the "most significant" boundary corresponds to what Linux
simply thinks of as the node id.  But intuitively, I'd think of that
as the "least significant" boundary, since that's basically the
smallest granularity at which we care about NUMA distances.



I would interpret it as the boundary where we start defining NUMA
nodes.


That isn't any clearer to me.


How about calling it least significant boundary then?


Heck, no.  My whole point here is that the meaning is unclear: my
first guess at the meaning is different from whoever wrote that text.
We need to come up with a way of describing it that's clearer.


The “ibm,associativity-reference-points” property contains one or more list of 
numbers
(domainID index) that represents the 1 based ordinal in the associativity lists 
of the
least significant boundary, with subsequent entries indicating progressively 
higher
significant boundaries.

ex:
{ primary domainID index, secondary domainID index, tertiary domainID index.. }

Linux kernel uses the domainID of the least significant boundary (aka primary 
domain)
as the NUMA node id. Linux kernel computes NUMA distance between two domains by
recursively comparing if they belong to the same higher-level domains. For 
mismatch
at every higher level of the resource group, the kernel doubles the NUMA 
distance between
the comparing domains.





Any suggestion on how to reword the above section then? We could say
associativity-reference-points is list of domainID index representing 
increasing hierarchy of resource group. I am not sure that explains it 
any better?




For ex: With domainID 0, 4, 5 we could do a 5x5 matrix to represent the
numa distance. Instead ibm,numa-lookup-index-table allows us to present
the same in a 3x3 matrix  distance[index0][index1] is the  distance
between NUMA node 0 and 4 and distance[index0][index2] is the distance
between NUMA node 0 and 5


Right, I get the purpose of it, and I realized I misphrashed my
question.  My point is that in a Form2 world, the *only* thing the
associativity array is used for is to deduce its position in
lookup-index-table.  Once you have have that for each resource, you
have everything you need, yes?



ibm,associativity is used find the domainID/NUMA node id of the
resource.

ibm,lookup-index-table is used compute the distance information between
NUMA nodes using ibm,numa-distance-table.


I get that you need to use lookup-index-table to work out how to
interpret numa-distance-table.  My point is that IIUC once you've done
the lookup in lookup-index-table once for each associativity array
value, the number you get out (which just a compacted version of the
node id) should be all you need ever again.



That is correct. We will continue to use the index to nodeid map during 
DLPAR, if such an operation adds a new numa node. update_numa_distance() 
shows the detail.


-aneesh


[PATCH v3 2/8] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index

2021-06-17 Thread Aneesh Kumar K.V
No functional change in this patch

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..132813dd1a6c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
 #define MAX_DISTANCE_REF_POINTS 4
-static int distance_ref_points_depth;
+static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
 
@@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
int i, index;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
index = be32_to_cpu(distance_ref_points[i]);
if (cpu1_assoc[index] == cpu2_assoc[index])
break;
@@ -193,7 +193,7 @@ int __node_distance(int a, int b)
if (!form1_affinity)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
break;
 
@@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
if (!form1_affinity)
return;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
const __be32 *entry;
 
entry = [be32_to_cpu(distance_ref_points[i]) - 1];
@@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
nid = NUMA_NO_NODE;
 
if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
+   of_read_number(associativity, 1) >= 
max_associativity_domain_index) {
/*
 * Skip the length field and send start of associativity array
 */
@@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
 */
distance_ref_points = of_get_property(root,
"ibm,associativity-reference-points",
-   _ref_points_depth);
+   _associativity_domain_index);
 
if (!distance_ref_points) {
dbg("NUMA: ibm,associativity-reference-points not found.\n");
goto err;
}
 
-   distance_ref_points_depth /= sizeof(int);
+   max_associativity_domain_index /= sizeof(int);
 
if (firmware_has_feature(FW_FEATURE_OPAL) ||
firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
@@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
if (form1_affinity) {
index = of_read_number(distance_ref_points, 1);
} else {
-   if (distance_ref_points_depth < 2) {
+   if (max_associativity_domain_index < 2) {
printk(KERN_WARNING "NUMA: "
"short ibm,associativity-reference-points\n");
goto err;
@@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
 * Warn and cap if the hardware supports more than
 * MAX_DISTANCE_REF_POINTS domains.
 */
-   if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
+   if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
printk(KERN_WARNING "NUMA: distance array capped at "
"%d entries\n", MAX_DISTANCE_REF_POINTS);
-   distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
+   max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
}
 
of_node_put(root);
-- 
2.31.1



[PATCH v3 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Aneesh Kumar K.V
FORM2 introduce a concept of secondary domain which is identical to the
concept of FORM1 primary domain. Use secondary domain as the numa node
when using persistent memory device. With DAX kmem kernel can use the
pimary domainID introduced in Form2. More details can be found in
patch "powerpc/pseries: Add support for FORM2 associativity"

Cc: nvd...@lists.linux.dev
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c| 28 +++
 arch/powerpc/platforms/pseries/papr_scm.c | 26 +
 arch/powerpc/platforms/pseries/pseries.h  |  1 +
 3 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5a7d94960fb7..cd3ae7ff77ac 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 
*associativity)
return nid;
 }
 
+int get_primary_and_secondary_domain(struct device_node *node, int *primary, 
int *secondary)
+{
+   int secondary_index;
+   const __be32 *associativity;
+
+   if (!numa_enabled) {
+   *primary = NUMA_NO_NODE;
+   *secondary = NUMA_NO_NODE;
+   return 0;
+   }
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return -ENODEV;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   *primary = of_read_number([primary_domain_index], 
1);
+   secondary_index = of_read_number(_ref_points[1], 1);
+   *secondary = of_read_number([secondary_index], 1);
+   }
+   if (*primary == 0x || *primary >= nr_node_ids)
+   *primary = NUMA_NO_NODE;
+
+   if (*secondary == 0x || *secondary >= nr_node_ids)
+   *secondary = NUMA_NO_NODE;
+   return 0;
+}
+
 /* Returns the nid associated with the given device tree node,
  * or -1 if not found.
  */
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index ef26fe40efb0..9bf2f1f3ddc5 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include "pseries.h"
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -88,6 +89,8 @@ struct papr_scm_perf_stats {
 struct papr_scm_priv {
struct platform_device *pdev;
struct device_node *dn;
+   int numa_node;
+   int target_node;
uint32_t drc_index;
uint64_t blocks;
uint64_t block_size;
@@ -923,7 +926,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
struct nd_mapping_desc mapping;
struct nd_region_desc ndr_desc;
unsigned long dimm_flags;
-   int target_nid, online_nid;
ssize_t stat_size;
 
p->bus_desc.ndctl = papr_scm_ndctl;
@@ -974,10 +976,8 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
 
memset(_desc, 0, sizeof(ndr_desc));
-   target_nid = dev_to_node(>pdev->dev);
-   online_nid = numa_map_to_online_node(target_nid);
-   ndr_desc.numa_node = online_nid;
-   ndr_desc.target_node = target_nid;
+   ndr_desc.numa_node = p->numa_node;
+   ndr_desc.target_node = p->target_node;
ndr_desc.res = >res;
ndr_desc.of_node = p->dn;
ndr_desc.provider_data = p;
@@ -1001,9 +1001,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.res, p->dn);
goto err;
}
-   if (target_nid != online_nid)
-   dev_info(dev, "Region registered with target node %d and online 
node %d",
-target_nid, online_nid);
 
mutex_lock(_ndr_lock);
list_add_tail(>region_list, _nd_regions);
@@ -1096,7 +1093,7 @@ static int papr_scm_probe(struct platform_device *pdev)
struct papr_scm_priv *p;
const char *uuid_str;
u64 uuid[2];
-   int rc;
+   int rc, numa_node;
 
/* check we have all the required DT properties */
if (of_property_read_u32(dn, "ibm,my-drc-index", _index)) {
@@ -1119,11 +1116,20 @@ static int papr_scm_probe(struct platform_device *pdev)
return -ENODEV;
}
 
-
p = kzalloc(sizeof(*p), GFP_KERNEL);
if (!p)
return -ENOMEM;
 
+   if (get_primary_and_secondary_domain(dn, >target_node, _node)) {
+   dev_err(>dev, "%pOF: missing NUMA attributes!\n", dn);
+   rc = -ENODEV;
+   goto err;
+   }
+   p->numa_node = numa_map_to_online_node(numa_node);
+   if (numa_node != p->numa_node)
+   dev_info(>dev, "Region registered with online node %d and 
device tree node %d",
+p->

[PATCH v3 7/8] powerpc/pseries: Add support for FORM2 associativity

2021-06-17 Thread Aneesh Kumar K.V
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst   | 177 ++
 arch/powerpc/include/asm/firmware.h   |   3 +-
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 149 +-
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 328 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..5fa9352dfc05
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,177 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via 
"ibm,arcitecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains one or more lists of numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or more list of 
numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists of the
+least significant boundary, with subsequent entries indicating progressively 
higher
+significant boundaries.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID of the least significant boundary (aka primary 
domain)
+as the NUMA node id. Linux kernel computes NUMA distance between two domains by
+recursively comparing if they belong to the same higher-level domains. For 
mismatch
+at every higher level of the resource group, the kernel doubles the NUMA 
distance between
+the comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value of
+"ibm,associativity" property, Form 2 allows a large number of primary domain 
ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is considered
+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for 
domainID 8 is 1.
+
+"ibm,numa-distance-table" property contains one or more list of numbers 
representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with 
encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we 
could encode is 255.
+
+For ex:
+ibm,numa-lookup-index

[PATCH v3 6/8] powerpc/pseries: Add a helper for form1 cpu distance

2021-06-17 Thread Aneesh Kumar K.V
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c481f08d565b..d32729f235b8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
return dist;
 }
 
+int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+   /* We should not get called with FORM0 */
+   VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+   return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
-- 
2.31.1



[PATCH v3 5/8] powerpc/pseries: Consolidate NUMA distance update during boot

2021-06-17 Thread Aneesh Kumar K.V
Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 135 +++--
 1 file changed, 88 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 645a95e3a7ea..c481f08d565b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,22 +208,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-   const __be32 *associativity)
-{
-   int i;
-
-   if (affinity_form != FORM1_AFFINITY)
-   return;
-
-   for (i = 0; i < max_associativity_domain_index; i++) {
-   const __be32 *entry;
-
-   entry = [be32_to_cpu(distance_ref_points[i]) - 1];
-   distance_lookup_table[nid][i] = of_read_number(entry, 1);
-   }
-}
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
@@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 
*associativity)
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
-
-   if (nid > 0 &&
-   of_read_number(associativity, 1) >= 
max_associativity_domain_index) {
-   /*
-* Skip the length field and send start of associativity array
-*/
-   initialize_distance_lookup_table(nid, associativity + 1);
-   }
-
 out:
return nid;
 }
@@ -291,10 +266,13 @@ static void __initialize_form1_numa_distance(const __be32 
*associativity)
 {
int i, nid;
 
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
if (of_read_number(associativity, 1) >= primary_domain_index) {
nid = of_read_number([primary_domain_index], 1);
 
-   for (i = 0; i < max_domain_index; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
const __be32 *entry;
 
entry = 
[be32_to_cpu(distance_ref_points[i])];
@@ -474,6 +452,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+   struct assoc_arrays aa = { .arrays = NULL };
+   int default_nid = NUMA_NO_NODE;
+   int nid = default_nid;
+   int rc, index;
+
+   if ((primary_domain_index < 0) || !numa_enabled)
+   return default_nid;
+
+   rc = of_get_assoc_arrays();
+   if (rc)
+   return default_nid;
+
+   if (primary_domain_index <= aa.array_sz &&
+   !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
+   nid = of_read_number([index], 1);
+
+   if (nid == 0x || nid >= nr_node_ids)
+   nid = default_nid;
+   if (nid > 0 && affinity_form == FORM1_AFFINITY) {
+   int i;
+   const __be32 *associativity;
+
+   index = lmb->aa_index * aa.array_sz;
+   associativity = [index];
+   /*
+* lookup array associativity entries have different 
format
+* There is no length of the array as the first element.
+*/
+   for (i = 0; i < max_associativity_domain_index; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i]) - 1];
+   distance_lookup_table[nid][i] = 
of_read_number(entry, 1);
+   }
+   }
+   }
+   return nid;
+}
+
 /*
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
@@ -499,21 +519,14 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 
if (nid == 0x || nid >= nr_node_ids)
nid = default_nid;
-
-   if (nid > 0) {
-   index = lmb->aa_index * aa.array_sz;
-   initialize_distance_lookup_table(nid,
-   [index]);
-   }
}
-
return nid;
 }
 
 #ifdef CONFIG_PPC_SPLPAR
-static int vphn_get_nid(long lcpu)
+
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
 {
-   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
long rc, hwid;
 
/*
@@ -533,10 +546,22 @@ static i

[PATCH v3 4/8] powerpc/pseries: Consolidate DLPAR NUMA distance update

2021-06-17 Thread Aneesh Kumar K.V
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call. In
later patch we will remove updating NUMA distance when we are looking
for node id from associativity array.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c| 41 +++
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |  2 +
 .../platforms/pseries/hotplug-memory.c|  2 +
 arch/powerpc/platforms/pseries/pseries.h  |  1 +
 4 files changed, 46 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0ec16999beef..645a95e3a7ea 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -287,6 +287,47 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+   int i, nid;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   nid = of_read_number([primary_domain_index], 1);
+
+   for (i = 0; i < max_domain_index; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   distance_lookup_table[nid][i] = of_read_number(entry, 
1);
+   }
+   }
+}
+
+static void initialize_form1_numa_distance(struct device_node *node)
+{
+   const __be32 *associativity;
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return;
+
+   __initialize_form1_numa_distance(associativity);
+   return;
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+   if (affinity_form == FORM0_AFFINITY)
+   return;
+   else if (affinity_form == FORM1_AFFINITY) {
+   initialize_form1_numa_distance(node);
+   return;
+   }
+}
+
 static int __init find_primary_domain_index(void)
 {
int index;
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..778b6ab35f0d 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
return saved_rc;
}
 
+   update_numa_distance(dn);
+
rc = dlpar_online_cpu(dn);
if (rc) {
saved_rc = rc;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..0e602c3b01ea 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb 
*lmb)
return -ENODEV;
}
 
+   update_numa_distance(lmb_node);
+
dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
if (!dr_node) {
dlpar_free_cc_nodes(lmb_node);
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 1f051a786fb3..663a0859cf13 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
 void pseries_lpar_read_hblkrm_characteristics(void);
 
+void update_numa_distance(struct device_node *node);
 #endif /* _PSERIES_PSERIES_H */
-- 
2.31.1



[PATCH v3 3/8] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

2021-06-17 Thread Aneesh Kumar K.V
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   |  4 +--
 arch/powerpc/include/asm/prom.h   |  2 +-
 arch/powerpc/kernel/prom_init.c   |  2 +-
 arch/powerpc/mm/numa.c| 35 ++-
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPALASM_CONST(0x1000)
 #define FW_FEATURE_SET_MODEASM_CONST(0x4000)
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
@@ -69,7 +69,7 @@ enum {
FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI0x0201  /* PCIe/MSI support */
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
-#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY 0x0580  /* FORM1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 41ed7e33d897..64b9593038a7 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1070,7 +1070,7 @@ static const struct ibm_arch_vec 
ibm_architecture_vec_template __initconst = {
 #else
0,
 #endif
-   .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
+   .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 132813dd1a6c..0ec16999beef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
int i;
int distance = LOCAL_DISTANCE;
 
-   if (!form1_affinity)
+   if (affinity_form == FORM0_AFFINITY)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i < max_associativity_domain_index; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
int i;
 
-   if (!form1_affinity)
+   if (affinity_form != FORM1_AFFINITY)
return;
 
for (i = 0; i < max_associativity_domain_index; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
int index;
struct device_node *root;
 
+   /*
+* Check for which form of affinity.
+*/
+   if (firmware_has_feature(FW_FEATURE_OPAL)) {
+   affinity_form = FORM1_AFFINITY;
+   } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
+   affinity_form = FORM1_AFFINITY;
+   } else
+   affinity_form = FORM0_AFFINITY;
+
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal"

[PATCH v3 1/8] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-06-17 Thread Aneesh Kumar K.V
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
if (!numa_enabled)
goto out;
 
-   if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   if (of_read_number(associativity, 1) >= primary_domain_index)
+   nid = of_read_number([primary_domain_index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-   int depth;
+   int index;
struct device_node *root;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
}
 
if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
 
-   depth = of_read_number(_ref_points[1], 1);
+   index = of_read_number(_ref_points[1], 1);
}
 
/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
 
of_node_put(root);
-   return depth;
+   return index;
 
 err:
of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
 
-   if ((min_common_depth < 0) || !numa_enabled)
+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
 
rc = of_get_assoc_arrays();
if (rc)
return default_nid;
 
-   if (min_common_depth <= aa.array_sz &&
+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number([index], 1);
 
if (nid == 0x || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
 
-   min_common_depth = find_min_common_depth();
+   primary_domain_index = find_primary_domain_index();
 
-   if (min_common_depth < 0) {
+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
 
-   max_nodes = of_read_number([min_common_depth], 1);
+   max_nodes = of_read_number([primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
prop_length /= sizeof(int);
-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
 
index = of_read_number(associativity, 1);
-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number([index - 1], 1);
 
 out:
-- 
2.31.1



[PATCH v3 0/8] Add support for FORM2 associativity

2021-06-17 Thread Aneesh Kumar K.V
Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. This also allows PAPR SCM device
to use better associativity when using the device as DAX KMEM
device. More details can be found in patch 7.

$ ndctl list -N -v
[
  {
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1071644672,
"uuid":"37dea198-ddb5-4e42-915a-99a915e24188",
"raw_uuid":"148deeaa-4a2f-41d1-8d74-fd9a942d26ba",
"daxregion":{
  "id":0,
  "size":1071644672,
  "devices":[
{
  "chardev":"dax0.0",
  "size":1071644672,
  "target_node":4,
  "mode":"devdax"
}
  ]
},
"align":2097152,
"numa_node":1
  }
]

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX KMEM
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  22  33  255 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  255 
  4:  255  255  255  255  10 
# 

The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=22 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=255 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=255 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=255 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4,device-node=1

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index

Aneesh Kumar K.V (8):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: rename distance_ref_points_depth to
max_associativity_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate DLPAR NUMA distance update
  powerpc/pseries: Consolidate NUMA distance update during boot
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity
  powerpc/papr_scm: Use FORM2 associativity details

 Documentation/powerpc/associativity.rst   | 177 +++
 arch/powerpc/include/asm/firmware.h   |   7 +-
 arch/powerpc/include/asm/prom.h   |   3 +-
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 436 ++
 arch/powerpc/platforms/pseries/firmware.c |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/papr_scm.c |  26 +-
 arch/powerpc/platforms/pseries/pseries.h  |   2 +
 10 files changed, 560 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1



[PATCH v2 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Aneesh Kumar K.V
FORM2 introduce a concept of secondary domain which is identical to the
conceept of FORM1 primary domain. Use secondary domain as the numa node
when using persistent memory device. For DAX kmem use the logical domain
id introduced in FORM2. This new numa node

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c| 28 +++
 arch/powerpc/platforms/pseries/papr_scm.c | 26 +
 arch/powerpc/platforms/pseries/pseries.h  |  1 +
 3 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5a7d94960fb7..cd3ae7ff77ac 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 
*associativity)
return nid;
 }
 
+int get_primary_and_secondary_domain(struct device_node *node, int *primary, 
int *secondary)
+{
+   int secondary_index;
+   const __be32 *associativity;
+
+   if (!numa_enabled) {
+   *primary = NUMA_NO_NODE;
+   *secondary = NUMA_NO_NODE;
+   return 0;
+   }
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return -ENODEV;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   *primary = of_read_number([primary_domain_index], 
1);
+   secondary_index = of_read_number(_ref_points[1], 1);
+   *secondary = of_read_number([secondary_index], 1);
+   }
+   if (*primary == 0x || *primary >= nr_node_ids)
+   *primary = NUMA_NO_NODE;
+
+   if (*secondary == 0x || *secondary >= nr_node_ids)
+   *secondary = NUMA_NO_NODE;
+   return 0;
+}
+
 /* Returns the nid associated with the given device tree node,
  * or -1 if not found.
  */
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index ef26fe40efb0..9bf2f1f3ddc5 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include "pseries.h"
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -88,6 +89,8 @@ struct papr_scm_perf_stats {
 struct papr_scm_priv {
struct platform_device *pdev;
struct device_node *dn;
+   int numa_node;
+   int target_node;
uint32_t drc_index;
uint64_t blocks;
uint64_t block_size;
@@ -923,7 +926,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
struct nd_mapping_desc mapping;
struct nd_region_desc ndr_desc;
unsigned long dimm_flags;
-   int target_nid, online_nid;
ssize_t stat_size;
 
p->bus_desc.ndctl = papr_scm_ndctl;
@@ -974,10 +976,8 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
 
memset(_desc, 0, sizeof(ndr_desc));
-   target_nid = dev_to_node(>pdev->dev);
-   online_nid = numa_map_to_online_node(target_nid);
-   ndr_desc.numa_node = online_nid;
-   ndr_desc.target_node = target_nid;
+   ndr_desc.numa_node = p->numa_node;
+   ndr_desc.target_node = p->target_node;
ndr_desc.res = >res;
ndr_desc.of_node = p->dn;
ndr_desc.provider_data = p;
@@ -1001,9 +1001,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.res, p->dn);
goto err;
}
-   if (target_nid != online_nid)
-   dev_info(dev, "Region registered with target node %d and online 
node %d",
-target_nid, online_nid);
 
mutex_lock(_ndr_lock);
list_add_tail(>region_list, _nd_regions);
@@ -1096,7 +1093,7 @@ static int papr_scm_probe(struct platform_device *pdev)
struct papr_scm_priv *p;
const char *uuid_str;
u64 uuid[2];
-   int rc;
+   int rc, numa_node;
 
/* check we have all the required DT properties */
if (of_property_read_u32(dn, "ibm,my-drc-index", _index)) {
@@ -1119,11 +1116,20 @@ static int papr_scm_probe(struct platform_device *pdev)
return -ENODEV;
}
 
-
p = kzalloc(sizeof(*p), GFP_KERNEL);
if (!p)
return -ENOMEM;
 
+   if (get_primary_and_secondary_domain(dn, >target_node, _node)) {
+   dev_err(>dev, "%pOF: missing NUMA attributes!\n", dn);
+   rc = -ENODEV;
+   goto err;
+   }
+   p->numa_node = numa_map_to_online_node(numa_node);
+   if (numa_node != p->numa_node)
+   dev_info(>dev, "Region registered with online node %d and 
device tree node %d",
+p->numa_node, numa_node);
+
/* Initialize the dimm mutex */
mutex_init(>health_mutex);
 
diff --git a/arch

[PATCH v2 7/8] powerpc/pseries: Add support for FORM2 associativity

2021-06-17 Thread Aneesh Kumar K.V
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst   | 177 ++
 arch/powerpc/include/asm/firmware.h   |   3 +-
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 149 +-
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 328 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..5fa9352dfc05
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,177 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via 
"ibm,arcitecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The ???ibm,associativity??? property contains one or more lists of numbers 
(domainID)
+representing the resource???s platform grouping domains.
+
+The ???ibm,associativity-reference-points??? property contains one or more 
list of numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists of the
+least significant boundary, with subsequent entries indicating progressively 
higher
+significant boundaries.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID of the least significant boundary (aka primary 
domain)
+as the NUMA node id. Linux kernel computes NUMA distance between two domains by
+recursively comparing if they belong to the same higher-level domains. For 
mismatch
+at every higher level of the resource group, the kernel doubles the NUMA 
distance between
+the comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value of
+"ibm,associativity" property, Form 2 allows a large number of primary domain 
ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is considered
+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for 
domainID 8 is 1.
+
+"ibm,numa-distance-table" property contains one or more list of numbers 
representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with 
encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we 
could encode is 255.
+
+For ex:
+ibm,numa-look

[PATCH v2 6/8] powerpc/pseries: Add a helper for form1 cpu distance

2021-06-17 Thread Aneesh Kumar K.V
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c481f08d565b..d32729f235b8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
return dist;
 }
 
+int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+   /* We should not get called with FORM0 */
+   VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+   return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
-- 
2.31.1



[PATCH v2 5/8] powerpc/pseries: Consolidate NUMA distance update during boot

2021-06-17 Thread Aneesh Kumar K.V
Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 135 +++--
 1 file changed, 88 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 645a95e3a7ea..c481f08d565b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,22 +208,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-   const __be32 *associativity)
-{
-   int i;
-
-   if (affinity_form != FORM1_AFFINITY)
-   return;
-
-   for (i = 0; i < max_associativity_domain_index; i++) {
-   const __be32 *entry;
-
-   entry = [be32_to_cpu(distance_ref_points[i]) - 1];
-   distance_lookup_table[nid][i] = of_read_number(entry, 1);
-   }
-}
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
@@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 
*associativity)
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
-
-   if (nid > 0 &&
-   of_read_number(associativity, 1) >= 
max_associativity_domain_index) {
-   /*
-* Skip the length field and send start of associativity array
-*/
-   initialize_distance_lookup_table(nid, associativity + 1);
-   }
-
 out:
return nid;
 }
@@ -291,10 +266,13 @@ static void __initialize_form1_numa_distance(const __be32 
*associativity)
 {
int i, nid;
 
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
if (of_read_number(associativity, 1) >= primary_domain_index) {
nid = of_read_number([primary_domain_index], 1);
 
-   for (i = 0; i < max_domain_index; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
const __be32 *entry;
 
entry = 
[be32_to_cpu(distance_ref_points[i])];
@@ -474,6 +452,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+   struct assoc_arrays aa = { .arrays = NULL };
+   int default_nid = NUMA_NO_NODE;
+   int nid = default_nid;
+   int rc, index;
+
+   if ((primary_domain_index < 0) || !numa_enabled)
+   return default_nid;
+
+   rc = of_get_assoc_arrays();
+   if (rc)
+   return default_nid;
+
+   if (primary_domain_index <= aa.array_sz &&
+   !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
+   nid = of_read_number([index], 1);
+
+   if (nid == 0x || nid >= nr_node_ids)
+   nid = default_nid;
+   if (nid > 0 && affinity_form == FORM1_AFFINITY) {
+   int i;
+   const __be32 *associativity;
+
+   index = lmb->aa_index * aa.array_sz;
+   associativity = [index];
+   /*
+* lookup array associativity entries have different 
format
+* There is no length of the array as the first element.
+*/
+   for (i = 0; i < max_associativity_domain_index; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i]) - 1];
+   distance_lookup_table[nid][i] = 
of_read_number(entry, 1);
+   }
+   }
+   }
+   return nid;
+}
+
 /*
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
@@ -499,21 +519,14 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 
if (nid == 0x || nid >= nr_node_ids)
nid = default_nid;
-
-   if (nid > 0) {
-   index = lmb->aa_index * aa.array_sz;
-   initialize_distance_lookup_table(nid,
-   [index]);
-   }
}
-
return nid;
 }
 
 #ifdef CONFIG_PPC_SPLPAR
-static int vphn_get_nid(long lcpu)
+
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
 {
-   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
long rc, hwid;
 
/*
@@ -533,10 +546,22 @@ static i

[PATCH v2 4/8] powerpc/pseries: Consolidate DLPAR NUMA distance update

2021-06-17 Thread Aneesh Kumar K.V
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call. In
later patch we will remove updating NUMA distance when we are looking
for node id from associativity array.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c| 41 +++
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |  2 +
 .../platforms/pseries/hotplug-memory.c|  2 +
 arch/powerpc/platforms/pseries/pseries.h  |  1 +
 4 files changed, 46 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0ec16999beef..645a95e3a7ea 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -287,6 +287,47 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+   int i, nid;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   nid = of_read_number([primary_domain_index], 1);
+
+   for (i = 0; i < max_domain_index; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   distance_lookup_table[nid][i] = of_read_number(entry, 
1);
+   }
+   }
+}
+
+static void initialize_form1_numa_distance(struct device_node *node)
+{
+   const __be32 *associativity;
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return;
+
+   __initialize_form1_numa_distance(associativity);
+   return;
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+   if (affinity_form == FORM0_AFFINITY)
+   return;
+   else if (affinity_form == FORM1_AFFINITY) {
+   initialize_form1_numa_distance(node);
+   return;
+   }
+}
+
 static int __init find_primary_domain_index(void)
 {
int index;
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..778b6ab35f0d 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
return saved_rc;
}
 
+   update_numa_distance(dn);
+
rc = dlpar_online_cpu(dn);
if (rc) {
saved_rc = rc;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..0e602c3b01ea 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb 
*lmb)
return -ENODEV;
}
 
+   update_numa_distance(lmb_node);
+
dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
if (!dr_node) {
dlpar_free_cc_nodes(lmb_node);
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 1f051a786fb3..663a0859cf13 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
 void pseries_lpar_read_hblkrm_characteristics(void);
 
+void update_numa_distance(struct device_node *node);
 #endif /* _PSERIES_PSERIES_H */
-- 
2.31.1



[PATCH v2 3/8] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

2021-06-17 Thread Aneesh Kumar K.V
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   |  4 +--
 arch/powerpc/include/asm/prom.h   |  2 +-
 arch/powerpc/kernel/prom_init.c   |  2 +-
 arch/powerpc/mm/numa.c| 35 ++-
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPALASM_CONST(0x1000)
 #define FW_FEATURE_SET_MODEASM_CONST(0x4000)
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
@@ -69,7 +69,7 @@ enum {
FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI0x0201  /* PCIe/MSI support */
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
-#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY 0x0580  /* FORM1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 41ed7e33d897..64b9593038a7 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1070,7 +1070,7 @@ static const struct ibm_arch_vec 
ibm_architecture_vec_template __initconst = {
 #else
0,
 #endif
-   .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
+   .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 132813dd1a6c..0ec16999beef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
int i;
int distance = LOCAL_DISTANCE;
 
-   if (!form1_affinity)
+   if (affinity_form == FORM0_AFFINITY)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i < max_associativity_domain_index; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
int i;
 
-   if (!form1_affinity)
+   if (affinity_form != FORM1_AFFINITY)
return;
 
for (i = 0; i < max_associativity_domain_index; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
int index;
struct device_node *root;
 
+   /*
+* Check for which form of affinity.
+*/
+   if (firmware_has_feature(FW_FEATURE_OPAL)) {
+   affinity_form = FORM1_AFFINITY;
+   } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
+   affinity_form = FORM1_AFFINITY;
+   } else
+   affinity_form = FORM0_AFFINITY;
+
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal"

[PATCH v2 2/8] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index

2021-06-17 Thread Aneesh Kumar K.V
No functional change in this patch

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..132813dd1a6c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
 #define MAX_DISTANCE_REF_POINTS 4
-static int distance_ref_points_depth;
+static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
 
@@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
int i, index;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
index = be32_to_cpu(distance_ref_points[i]);
if (cpu1_assoc[index] == cpu2_assoc[index])
break;
@@ -193,7 +193,7 @@ int __node_distance(int a, int b)
if (!form1_affinity)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
break;
 
@@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
if (!form1_affinity)
return;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
+   for (i = 0; i < max_associativity_domain_index; i++) {
const __be32 *entry;
 
entry = [be32_to_cpu(distance_ref_points[i]) - 1];
@@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
nid = NUMA_NO_NODE;
 
if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
+   of_read_number(associativity, 1) >= 
max_associativity_domain_index) {
/*
 * Skip the length field and send start of associativity array
 */
@@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
 */
distance_ref_points = of_get_property(root,
"ibm,associativity-reference-points",
-   _ref_points_depth);
+   _associativity_domain_index);
 
if (!distance_ref_points) {
dbg("NUMA: ibm,associativity-reference-points not found.\n");
goto err;
}
 
-   distance_ref_points_depth /= sizeof(int);
+   max_associativity_domain_index /= sizeof(int);
 
if (firmware_has_feature(FW_FEATURE_OPAL) ||
firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
@@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
if (form1_affinity) {
index = of_read_number(distance_ref_points, 1);
} else {
-   if (distance_ref_points_depth < 2) {
+   if (max_associativity_domain_index < 2) {
printk(KERN_WARNING "NUMA: "
"short ibm,associativity-reference-points\n");
goto err;
@@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
 * Warn and cap if the hardware supports more than
 * MAX_DISTANCE_REF_POINTS domains.
 */
-   if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
+   if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
printk(KERN_WARNING "NUMA: distance array capped at "
"%d entries\n", MAX_DISTANCE_REF_POINTS);
-   distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
+   max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
}
 
of_node_put(root);
-- 
2.31.1



[PATCH v2 1/8] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-06-17 Thread Aneesh Kumar K.V
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
if (!numa_enabled)
goto out;
 
-   if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   if (of_read_number(associativity, 1) >= primary_domain_index)
+   nid = of_read_number([primary_domain_index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-   int depth;
+   int index;
struct device_node *root;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
}
 
if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
 
-   depth = of_read_number(_ref_points[1], 1);
+   index = of_read_number(_ref_points[1], 1);
}
 
/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
 
of_node_put(root);
-   return depth;
+   return index;
 
 err:
of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
 
-   if ((min_common_depth < 0) || !numa_enabled)
+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
 
rc = of_get_assoc_arrays();
if (rc)
return default_nid;
 
-   if (min_common_depth <= aa.array_sz &&
+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number([index], 1);
 
if (nid == 0x || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
 
-   min_common_depth = find_min_common_depth();
+   primary_domain_index = find_primary_domain_index();
 
-   if (min_common_depth < 0) {
+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
 
-   max_nodes = of_read_number([min_common_depth], 1);
+   max_nodes = of_read_number([primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
prop_length /= sizeof(int);
-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
 
index = of_read_number(associativity, 1);
-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number([index - 1], 1);
 
 out:
-- 
2.31.1



[PATCH v2 0/8] Add support for FORM2 associativity

2021-06-17 Thread Aneesh Kumar K.V
Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. This also allows PAPR SCM device
to use better associativity when using the device as DAX KMEM
device. More details can be found in patch 7.

$ ndctl list -N -v
[
  {
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1071644672,
"uuid":"37dea198-ddb5-4e42-915a-99a915e24188",
"raw_uuid":"148deeaa-4a2f-41d1-8d74-fd9a942d26ba",
"daxregion":{
  "id":0,
  "size":1071644672,
  "devices":[
{
  "chardev":"dax0.0",
  "size":1071644672,
  "target_node":4,
  "mode":"devdax"
}
  ]
},
"align":2097152,
"numa_node":1
  }
]

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX KMEM
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  22  33  255 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  255 
  4:  255  255  255  255  10 
# 

The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=22 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=255 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=255 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=255 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4,device-node=1

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index

Aneesh Kumar K.V (8):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: rename distance_ref_points_depth to
max_associativity_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate DLPAR NUMA distance update
  powerpc/pseries: Consolidate NUMA distance update during boot
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity
  powerpc/papr_scm: Use FORM2 associativity details

 Documentation/powerpc/associativity.rst   | 177 +++
 arch/powerpc/include/asm/firmware.h   |   7 +-
 arch/powerpc/include/asm/prom.h   |   3 +-
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 436 ++
 arch/powerpc/platforms/pseries/firmware.c |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/papr_scm.c |  26 +-
 arch/powerpc/platforms/pseries/pseries.h  |   2 +
 10 files changed, 560 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1



Re: [PATCH v2 1/1] powerpc/papr_scm: Properly handle UUID types and API

2021-06-16 Thread Aneesh Kumar K.V

On 6/16/21 7:13 PM, Andy Shevchenko wrote:

Parse to and export from UUID own type, before dereferencing.
This also fixes wrong comment (Little Endian UUID is something else)
and should eliminate the direct strict types assignments.

Fixes: 43001c52b603 ("powerpc/papr_scm: Use ibm,unit-guid as the iset cookie")
Fixes: 259a948c4ba1 ("powerpc/pseries/scm: Use a specific endian format for storing 
uuid from the device tree")



Do we need the Fixes: there? It didn't change any functionality right? 
The format with which we stored cookie1 remains the same with older and 
newer code. The newer one is better?


Reviewed-by: Aneesh Kumar K.V 


Cc: Oliver O'Halloran 
Cc: Aneesh Kumar K.V 
Signed-off-by: Andy Shevchenko 
---
v2: added missed header (Vaibhav), updated comment (Aneesh),
 rewrite part of the commit message to avoid mentioning the Sparse
  arch/powerpc/platforms/pseries/papr_scm.c | 27 +++
  1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index e2b69cc3beaf..b43be41e8ff7 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -18,6 +18,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #define BIND_ANY_ADDR (~0ul)
  
@@ -1101,8 +1102,9 @@ static int papr_scm_probe(struct platform_device *pdev)

u32 drc_index, metadata_size;
u64 blocks, block_size;
struct papr_scm_priv *p;
+   u8 uuid_raw[UUID_SIZE];
const char *uuid_str;
-   u64 uuid[2];
+   uuid_t uuid;
int rc;
  
  	/* check we have all the required DT properties */

@@ -1145,16 +1147,23 @@ static int papr_scm_probe(struct platform_device *pdev)
p->hcall_flush_required = of_property_read_bool(dn, 
"ibm,hcall-flush-required");
  
  	/* We just need to ensure that set cookies are unique across */

-   uuid_parse(uuid_str, (uuid_t *) uuid);
+   uuid_parse(uuid_str, );
+
/*
-* cookie1 and cookie2 are not really little endian
-* we store a little endian representation of the
-* uuid str so that we can compare this with the label
-* area cookie irrespective of the endian config with which
-* the kernel is built.
+* The cookie1 and cookie2 are not really little endian.
+* We store a raw buffer representation of the
+* uuid string so that we can compare this with the label
+* area cookie irrespective of the endian configuration
+* with which the kernel is built.
+*
+* Historically we stored the cookie in the below format.
+* for a uuid string 72511b67-0b3b-42fd-8d1d-5be3cae8bcaa
+*  cookie1 was 0xfd423b0b671b5172
+*  cookie2 was 0xaabce8cae35b1d8d
 */
-   p->nd_set.cookie1 = cpu_to_le64(uuid[0]);
-   p->nd_set.cookie2 = cpu_to_le64(uuid[1]);
+   export_uuid(uuid_raw, );
+   p->nd_set.cookie1 = get_unaligned_le64(_raw[0]);
+   p->nd_set.cookie2 = get_unaligned_le64(_raw[8]);
  
  	/* might be zero */

p->metadata_size = metadata_size;





[PATCH v8 3/3] powerpc/mm: Enable HAVE_MOVE_PMD support

2021-06-15 Thread Aneesh Kumar K.V
mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:
1GB mremap - Source PTE-aligned, Destination PTE-aligned
  mremap time:  2292772ns
1GB mremap - Source PMD-aligned, Destination PMD-aligned
  mremap time:  1158928ns
1GB mremap - Source PUD-aligned, Destination PUD-aligned
  mremap time:63886ns

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/Kconfig.cputype | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index f998e655b570..be8ceb5bece4 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -101,6 +101,8 @@ config PPC_BOOK3S_64
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_NUMA_BALANCING
+   select HAVE_MOVE_PMD
+   select HAVE_MOVE_PUD
select IRQ_WORK
select PPC_MM_SLICES
select PPC_HAVE_KUEP
-- 
2.31.1



[PATCH v8 2/3] powerpc/book3s64/mm: Update flush_tlb_range to flush page walk cache

2021-06-15 Thread Aneesh Kumar K.V
flush_tlb_range is special in that we don't specify the page size used
for the translation. Hence when flushing TLB we flush the translation cache
for all possible page sizes. The kernel also uses the same interface when
moving page tables around. Such a move requires us to flush the page walk cache.

Instead of adding another interface to force page walk cache flush,
update flush_tlb_range to flush page walk cache if the range flushed
is more than the PMD range. A page table move will always involve an
invalidate range more than PMD_SIZE.

Running microbenchmark with mprotect and parallel memory access
didn't show any observable performance impact.

Signed-off-by: Aneesh Kumar K.V 
---
 .../include/asm/book3s/64/tlbflush-radix.h|  2 +
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |  8 +++-
 arch/powerpc/mm/book3s64/radix_tlb.c  | 44 ---
 3 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 8b33601cdb9d..ab9d5e535000 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -60,6 +60,8 @@ extern void radix__flush_hugetlb_tlb_range(struct 
vm_area_struct *vma,
   unsigned long start, unsigned long 
end);
 extern void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long 
start,
 unsigned long end, int psize);
+void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long 
start,
+ unsigned long end, int psize);
 extern void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
   unsigned long start, unsigned long end);
 extern void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long 
start,
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c 
b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index cb91071eef52..23d3e08911d3 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -32,7 +32,13 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct 
*vma, unsigned long st
struct hstate *hstate = hstate_file(vma->vm_file);
 
psize = hstate_get_psize(hstate);
-   radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
+   /*
+* Flush PWC even if we get PUD_SIZE hugetlb invalidate to keep this 
simpler.
+*/
+   if (end - start >= PUD_SIZE)
+   radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
+   else
+   radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
 }
 
 /*
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 409e61210789..9f1a177f6bb6 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -989,14 +989,13 @@ static unsigned long tlb_local_single_page_flush_ceiling 
__read_mostly = POWER9_
 
 static inline void __radix__flush_tlb_range(struct mm_struct *mm,
unsigned long start, unsigned long 
end)
-
 {
unsigned long pid;
unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
unsigned long page_size = 1UL << page_shift;
unsigned long nr_pages = (end - start) >> page_shift;
bool fullmm = (end == TLB_FLUSH_ALL);
-   bool flush_pid;
+   bool flush_pid, flush_pwc = false;
enum tlb_flush_type type;
 
pid = mm->context.id;
@@ -1015,8 +1014,16 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
flush_pid = nr_pages > tlb_single_page_flush_ceiling;
else
flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+   /*
+* full pid flush already does the PWC flush. if it is not full pid
+* flush check the range is more than PMD and force a pwc flush
+* mremap() depends on this behaviour.
+*/
+   if (!flush_pid && (end - start) >= PMD_SIZE)
+   flush_pwc = true;
 
if (!mmu_has_feature(MMU_FTR_GTSE) && type == FLUSH_TYPE_GLOBAL) {
+   unsigned long type = H_RPTI_TYPE_TLB;
unsigned long tgt = H_RPTI_TARGET_CMMU;
unsigned long pg_sizes = 
psize_to_rpti_pgsize(mmu_virtual_psize);
 
@@ -1024,19 +1031,20 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
pg_sizes |= psize_to_rpti_pgsize(MMU_PAGE_2M);
if (atomic_read(>context.copros) > 0)
tgt |= H_RPTI_TARGET_NMMU;
-   pseries_rpt_invalidate(pid, tgt, H_RPTI_TYPE_TLB, pg_sizes,
-  start, end);
+   if (flush_pwc)
+   type |= H_R

[PATCH v8 1/3] mm/mremap: Allow arch runtime override

2021-06-15 Thread Aneesh Kumar K.V
Architectures like ppc64 support faster mremap only with radix
translation. Hence allow a runtime check w.r.t support for fast mremap.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/tlb.h |  6 ++
 mm/mremap.c| 15 ++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 160422a439aa..09a9ae5f3656 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -83,5 +83,11 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
 }
 #endif
 
+#define arch_supports_page_table_move arch_supports_page_table_move
+static inline bool arch_supports_page_table_move(void)
+{
+   return radix_enabled();
+}
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_TLB_H */
diff --git a/mm/mremap.c b/mm/mremap.c
index c3cad539a7aa..ca9d345f22e8 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -25,7 +25,7 @@
 #include 
 
 #include 
-#include 
+#include 
 #include 
 
 #include "internal.h"
@@ -210,6 +210,15 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
*old_pmd,
drop_rmap_locks(vma);
 }
 
+#ifndef arch_supports_page_table_move
+#define arch_supports_page_table_move arch_supports_page_table_move
+static inline bool arch_supports_page_table_move(void)
+{
+   return IS_ENABLED(CONFIG_HAVE_MOVE_PMD) ||
+   IS_ENABLED(CONFIG_HAVE_MOVE_PUD);
+}
+#endif
+
 #ifdef CONFIG_HAVE_MOVE_PMD
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
@@ -218,6 +227,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pmd_t pmd;
 
+   if (!arch_supports_page_table_move())
+   return false;
/*
 * The destination pmd shouldn't be established, free_pgtables()
 * should have released it.
@@ -284,6 +295,8 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pud_t pud;
 
+   if (!arch_supports_page_table_move())
+   return false;
/*
 * The destination pud shouldn't be established, free_pgtables()
 * should have released it.
-- 
2.31.1



[PATCH v8 0/3] Speedup mremap on ppc64

2021-06-15 Thread Aneesh Kumar K.V
This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without
updating page table entries. This also needs to invalidate the Page Walk
Cache on architecture supporting the same.

This patchset is dependent on
https://lore.kernel.org/linux-mm/20210616045239.370802-1-aneesh.ku...@linux.ibm.com/

Changes from v7:
* Split mremap fixes to a separate series

Changes from v6:
* Update ppc64 flush_tlb_range to invalidate page walk cache.
* Add patches to fix race between mremap and page out
* Add patch to fix build error with page table levels 2

Changes from v5:
* Drop patch mm/mremap: Move TLB flush outside page table lock
* Add fixes for race between optimized mremap and page out

Changes from v4:
* Change function name and arguments based on review feedback.

Changes from v3:
* Fix build error reported by kernel test robot
* Address review feedback.

Changes from v2:
* switch from using mmu_gather to flush_pte_tlb_pwc_range() 

Changes from v1:
* Rebase to recent upstream
* Fix build issues with tlb_gather_mmu changes



Aneesh Kumar K.V (3):
  mm/mremap: Allow arch runtime override
  powerpc/book3s64/mm: Update flush_tlb_range to flush page walk cache
  powerpc/mm: Enable HAVE_MOVE_PMD support

 .../include/asm/book3s/64/tlbflush-radix.h|  2 +
 arch/powerpc/include/asm/tlb.h|  6 +++
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |  8 +++-
 arch/powerpc/mm/book3s64/radix_tlb.c  | 44 ---
 arch/powerpc/platforms/Kconfig.cputype|  2 +
 mm/mremap.c   | 15 ++-
 6 files changed, 58 insertions(+), 19 deletions(-)

-- 
2.31.1



[PATCH v2 6/6] mm/mremap: hold the rmap lock in write mode when moving page table entries.

2021-06-15 Thread Aneesh Kumar K.V
To avoid a race between rmap walk and mremap, mremap does take_rmap_locks().
The lock was taken to ensure that rmap walk don't miss a page table entry due to
PTE moves via move_pagetables(). The kernel does further optimization of
this lock such that if we are going to find the newly added vma after the
old vma, the rmap lock is not taken. This is because rmap walk would find the
vmas in the same order and if we don't find the page table attached to
older vma we would find it with the new vma which we would iterate later.

As explained in commit eb66ae030829 ("mremap: properly flush TLB before 
releasing the page")
mremap is special in that it doesn't take ownership of the page. The
optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock.
This can result in stale TLB entries as show below.

This patch updates the rmap locking requirement in mremap to handle the race 
condition
explained below with optimized mremap::

Optmized PMD move

CPU 1   CPU 2   CPU 
3

mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one

mmap_write_lock_killable()

addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

*new_pmd = pmd

*new_addr = 10; and fills
TLB 
with new addr
and 
old pfn

unlock(pmd_ptl)
ptep_clear_flush()
old pfn is free.

Stale TLB entry

Optimized PUD move also suffers from a similar race.
Both the above race condition can be fixed if we force mremap path to take rmap 
lock.

Cc: sta...@vger.kernel.org
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Link: 
https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
Acked-by: Hugh Dickins 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 72fa0491681e..c3cad539a7aa 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -503,7 +503,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == 
PUD_SIZE) {
 
if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
-  old_pud, new_pud, need_rmap_locks))
+  old_pud, new_pud, true))
continue;
}
 
@@ -530,7 +530,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 * moving at the PMD level if possible.
 */
if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr,
-  old_pmd, new_pmd, need_rmap_locks))
+  old_pmd, new_pmd, true))
continue;
}
 
-- 
2.31.1



[PATCH v2 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries

2021-06-15 Thread Aneesh Kumar K.V
pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at
can only be used to set a hugepage PTE. Since we are not setting up a hugepage
PTE here, use the pmd/pud_populate interface.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 97313e316a4d..72fa0491681e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -258,8 +259,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
 
VM_BUG_ON(!pmd_none(*new_pmd));
 
-   /* Set the new pmd */
-   set_pmd_at(mm, new_addr, new_pmd, pmd);
+   pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
@@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
 
VM_BUG_ON(!pud_none(*new_pud));
 
-   /* Set the new pud */
-   set_pud_at(mm, new_addr, new_pud, pud);
+   pud_populate(mm, new_pud, pud_pgtable(pud));
flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
-- 
2.31.1



[PATCH v2 4/6] mm/mremap: Don't enable optimized PUD move if page table levels is 2

2021-06-15 Thread Aneesh Kumar K.V
With two level page table don't enable move_normal_pud.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 958ecdc6f29d..97313e316a4d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -276,7 +276,7 @@ static inline bool move_normal_pmd(struct vm_area_struct 
*vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_MOVE_PUD
+#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_HAVE_MOVE_PUD)
 static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
 {
-- 
2.31.1



[PATCH v2 3/6] mm/mremap: Convert huge PUD move to separate helper

2021-06-15 Thread Aneesh Kumar K.V
With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries.
Add a helper to move huge PUD entries on mremap().

This will be used by a later patch to optimize mremap of PUD_SIZE aligned
level 4 PTE mapped address

This also make sure we support mremap on huge PUD entries even with
CONFIG_HAVE_MOVE_PUD disabled.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 79 -
 1 file changed, 72 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 47c255b60150..958ecdc6f29d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -324,10 +324,61 @@ static inline bool move_normal_pud(struct vm_area_struct 
*vma,
 }
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
+{
+   spinlock_t *old_ptl, *new_ptl;
+   struct mm_struct *mm = vma->vm_mm;
+   pud_t pud;
+
+   /*
+* The destination pud shouldn't be established, free_pgtables()
+* should have released it.
+*/
+   if (WARN_ON_ONCE(!pud_none(*new_pud)))
+   return false;
+
+   /*
+* We don't have to worry about the ordering of src and dst
+* ptlocks because exclusive mmap_lock prevents deadlock.
+*/
+   old_ptl = pud_lock(vma->vm_mm, old_pud);
+   new_ptl = pud_lockptr(mm, new_pud);
+   if (new_ptl != old_ptl)
+   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+   /* Clear the pud */
+   pud = *old_pud;
+   pud_clear(old_pud);
+
+   VM_BUG_ON(!pud_none(*new_pud));
+
+   /* Set the new pud */
+   /* mark soft_ditry when we add pud level soft dirty support */
+   set_pud_at(mm, new_addr, new_pud, pud);
+   flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE);
+   if (new_ptl != old_ptl)
+   spin_unlock(new_ptl);
+   spin_unlock(old_ptl);
+
+   return true;
+}
+#else
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
+{
+   WARN_ON_ONCE(1);
+   return false;
+
+}
+#endif
+
 enum pgt_entry {
NORMAL_PMD,
HPAGE_PMD,
NORMAL_PUD,
+   HPAGE_PUD,
 };
 
 /*
@@ -347,6 +398,7 @@ static __always_inline unsigned long get_extent(enum 
pgt_entry entry,
mask = PMD_MASK;
size = PMD_SIZE;
break;
+   case HPAGE_PUD:
case NORMAL_PUD:
mask = PUD_MASK;
size = PUD_SIZE;
@@ -395,6 +447,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct 
vm_area_struct *vma,
move_huge_pmd(vma, old_addr, new_addr, old_entry,
  new_entry);
break;
+   case HPAGE_PUD:
+   moved = move_huge_pud(vma, old_addr, new_addr, old_entry,
+ new_entry);
+   break;
+
default:
WARN_ON_ONCE(1);
break;
@@ -414,6 +471,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long extent, old_end;
struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
+   pud_t *old_pud, *new_pud;
 
old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);
@@ -429,15 +487,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 * PUD level if possible.
 */
extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
-   if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
-   pud_t *old_pud, *new_pud;
 
-   old_pud = get_old_pud(vma->vm_mm, old_addr);
-   if (!old_pud)
+   old_pud = get_old_pud(vma->vm_mm, old_addr);
+   if (!old_pud)
+   continue;
+   new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
+   if (!new_pud)
+   break;
+   if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+   if (extent == HPAGE_PUD_SIZE) {
+   move_pgt_entry(HPAGE_PUD, vma, old_addr, 
new_addr,
+  old_pud, new_pud, 
need_rmap_locks);
+   /* We ignore and continue on error? */
continue;
-   new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
-   if (!new_pud)
-   break;
+   }
+   } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == 
PUD_SIZE) {
+
if (move_pgt_entry(N

[PATCH v2 2/6] selftest/mremap_test: Avoid crash with static build

2021-06-15 Thread Aneesh Kumar K.V
With a large mmap map size, we can overlap with the text area and using
MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE
and handle the EEXIST error.

Reviewed-by: Kalesh Singh 
Signed-off-by: Aneesh Kumar K.V 
---
 tools/testing/selftests/vm/mremap_test.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c 
b/tools/testing/selftests/vm/mremap_test.c
index c9a5461eb786..0624d1bd71b5 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -75,9 +75,10 @@ static void *get_source_mapping(struct config c)
 retry:
addr += c.src_alignment;
src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-   MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+   MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+   -1, 0);
if (src_addr == MAP_FAILED) {
-   if (errno == EPERM)
+   if (errno == EPERM || errno == EEXIST)
goto retry;
goto error;
}
-- 
2.31.1



[PATCH v2 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K

2021-06-15 Thread Aneesh Kumar K.V
Instead of hardcoding 4K page size fetch it using sysconf(). For the performance
measurements test still assume 2M and 1G are hugepage sizes.

Reviewed-by: Kalesh Singh 
Signed-off-by: Aneesh Kumar K.V 
---
 tools/testing/selftests/vm/mremap_test.c | 113 ---
 1 file changed, 61 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c 
b/tools/testing/selftests/vm/mremap_test.c
index 9c391d016922..c9a5461eb786 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -45,14 +45,15 @@ enum {
_4MB = 4ULL << 20,
_1GB = 1ULL << 30,
_2GB = 2ULL << 30,
-   PTE = _4KB,
PMD = _2MB,
PUD = _1GB,
 };
 
+#define PTE page_size
+
 #define MAKE_TEST(source_align, destination_align, size,   \
  overlaps, should_fail, test_name) \
-{  \
+(struct test){ \
.name = test_name,  \
.config = { \
.src_alignment = source_align,  \
@@ -252,12 +253,17 @@ static int parse_args(int argc, char **argv, unsigned int 
*threshold_mb,
return 0;
 }
 
+#define MAX_TEST 13
+#define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
int failures = 0;
int i, run_perf_tests;
unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
unsigned int pattern_seed;
+   struct test test_cases[MAX_TEST];
+   struct test perf_test_cases[MAX_PERF_TEST];
+   int page_size;
time_t t;
 
pattern_seed = (unsigned int) time();
@@ -268,56 +274,59 @@ int main(int argc, char **argv)
ksft_print_msg("Test 
configs:\n\tthreshold_mb=%u\n\tpattern_seed=%u\n\n",
   threshold_mb, pattern_seed);
 
-   struct test test_cases[] = {
-   /* Expected mremap failures */
-   MAKE_TEST(_4KB, _4KB, _4KB, OVERLAPPING, EXPECT_FAILURE,
- "mremap - Source and Destination Regions Overlapping"),
-   MAKE_TEST(_4KB, _1KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
- "mremap - Destination Address Misaligned (1KB-aligned)"),
-   MAKE_TEST(_1KB, _4KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
- "mremap - Source Address Misaligned (1KB-aligned)"),
-
-   /* Src addr PTE aligned */
-   MAKE_TEST(PTE, PTE, _8KB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "8KB mremap - Source PTE-aligned, Destination PTE-aligned"),
-
-   /* Src addr 1MB aligned */
-   MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"),
-   MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"),
-
-   /* Src addr PMD aligned */
-   MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"),
-   MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination PMD-aligned"),
-
-   /* Src addr PUD aligned */
-   MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"),
-   MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PMD-aligned"),
-   MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-   };
-
-   struct test perf_test_cases[] = {
-   /*
-* mremap 1GB region - Page table level aligned time
-* comparison.
-*/
-   MAKE_TEST(PTE, PTE, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PTE-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PMD-aligned, Destination PMD-aligned"),
-   MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB m

[PATCH v2 0/6] mrermap fixes

2021-06-15 Thread Aneesh Kumar K.V
This patch series is split out series from [PATCH v7 00/11] Speedup mremap on 
ppc64
(https://lore.kernel.org/linux-mm/20210607055131.156184-1-aneesh.ku...@linux.ibm.com)
dropping ppc64 specific changes.

This patchset is dependent on
https://lore.kernel.org/linux-mm/20210615110859.320299-1-aneesh.ku...@linux.ibm.com

ppc64 specific changes will be sent as a separate series depending on this 
patchset.

Changes from v1:
* cc sta...@kernel.org
* USe the correct config for TRANSPARENT_HUGEPAGE_PUD
* use pud_pgtable instead of pud_page_vaddr


Aneesh Kumar K.V (6):
  selftest/mremap_test: Update the test to handle pagesize other than 4K
  selftest/mremap_test: Avoid crash with static build
  mm/mremap: Convert huge PUD move to separate helper
  mm/mremap: Don't enable optimized PUD move if page table levels is 2
  mm/mremap: Use pmd/pud_poplulate to update page table entries
  mm/mremap: hold the rmap lock in write mode when moving page table
entries.

 mm/mremap.c  |  92 +++---
 tools/testing/selftests/vm/mremap_test.c | 118 ---
 2 files changed, 142 insertions(+), 68 deletions(-)

-- 
2.31.1



[PATCH v2 1/2] mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t *

2021-06-15 Thread Aneesh Kumar K.V
No functional change in this patch.

Cc: linux-al...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-a...@vger.kernel.org

Link: 
https://lore.kernel.org/linuxppc-dev/CAHk-=wi+j+iodze9ftjm3zi4j4oes+qqbkxme9qn4roxpex...@mail.gmail.com/
Signed-off-by: Aneesh Kumar K.V 
---
 arch/alpha/include/asm/pgtable.h | 8 +---
 arch/arm/include/asm/pgtable-3level.h| 2 +-
 arch/arm64/include/asm/pgtable.h | 4 ++--
 arch/ia64/include/asm/pgtable.h  | 2 +-
 arch/m68k/include/asm/motorola_pgtable.h | 2 +-
 arch/mips/include/asm/pgtable-64.h   | 4 ++--
 arch/parisc/include/asm/pgtable.h| 4 ++--
 arch/powerpc/include/asm/book3s/64/pgtable.h | 6 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h | 6 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 4 ++--
 arch/powerpc/mm/pgtable_64.c | 2 +-
 arch/riscv/include/asm/pgtable-64.h  | 4 ++--
 arch/sh/include/asm/pgtable-3level.h | 4 ++--
 arch/sparc/include/asm/pgtable_32.h  | 4 ++--
 arch/sparc/include/asm/pgtable_64.h  | 6 +++---
 arch/um/include/asm/pgtable-3level.h | 2 +-
 arch/x86/include/asm/pgtable.h   | 4 ++--
 arch/x86/mm/pat/set_memory.c | 4 ++--
 arch/x86/mm/pgtable.c| 2 +-
 include/asm-generic/pgtable-nopmd.h  | 2 +-
 include/asm-generic/pgtable-nopud.h  | 2 +-
 include/linux/pgtable.h  | 2 +-
 22 files changed, 45 insertions(+), 35 deletions(-)

diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index 8d856c62e22a..be02e4e403d1 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -239,8 +239,10 @@ pmd_page_vaddr(pmd_t pmd)
 #define pmd_page(pmd)  (pfn_to_page(pmd_val(pmd) >> 32))
 #define pud_page(pud)  (pfn_to_page(pud_val(pud) >> 32))
 
-extern inline unsigned long pud_page_vaddr(pud_t pgd)
-{ return PAGE_OFFSET + ((pud_val(pgd) & _PFN_MASK) >> (32-PAGE_SHIFT)); }
+static inline pmd_t *pud_pgtable(pud_t pgd)
+{
+   return (pmd_t *)(PAGE_OFFSET + ((pud_val(pgd) & _PFN_MASK) >> 
(32-PAGE_SHIFT)));
+}
 
 extern inline int pte_none(pte_t pte)  { return !pte_val(pte); }
 extern inline int pte_present(pte_t pte)   { return pte_val(pte) & 
_PAGE_VALID; }
@@ -290,7 +292,7 @@ extern inline pte_t pte_mkyoung(pte_t pte)  { pte_val(pte) 
|= __ACCESS_BITS; retu
 /* Find an entry in the second-level page table.. */
 extern inline pmd_t * pmd_offset(pud_t * dir, unsigned long address)
 {
-   pmd_t *ret = (pmd_t *) pud_page_vaddr(*dir) + ((address >> PMD_SHIFT) & 
(PTRS_PER_PAGE - 1));
+   pmd_t *ret = pud_pgtable(*dir) + ((address >> PMD_SHIFT) & 
(PTRS_PER_PAGE - 1));
smp_rmb(); /* see above */
return ret;
 }
diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index d4edab51a77c..eabe72ff7381 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -130,7 +130,7 @@
flush_pmd_entry(pudp);  \
} while (0)
 
-static inline pmd_t *pud_page_vaddr(pud_t pud)
+static inline pmd_t *pud_pgtable(pud_t pud)
 {
return __va(pud_val(pud) & PHYS_MASK & (s32)PAGE_MASK);
 }
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..53a415b329b0 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -636,9 +636,9 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
return __pud_to_phys(pud);
 }
 
-static inline unsigned long pud_page_vaddr(pud_t pud)
+static inline pmd_t *pud_pgtable(pud_t pud)
 {
-   return (unsigned long)__va(pud_page_paddr(pud));
+   return (pmd_t *)__va(pud_page_paddr(pud));
 }
 
 /* Find an entry in the second-level page table. */
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index d765fd948fae..b2ddfbe70365 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -274,7 +274,7 @@ ia64_phys_addr_valid (unsigned long addr)
 #define pud_bad(pud)   (!ia64_phys_addr_valid(pud_val(pud)))
 #define pud_present(pud)   (pud_val(pud) != 0UL)
 #define pud_clear(pudp)(pud_val(*(pudp)) = 0UL)
-#define pud_page_vaddr(pud)((unsigned long) __va(pud_val(pud) & 
_PFN_MASK))
+#define pud_pgtable(pud)   ((pmd_t *) __va(pud_val(pud) & 
_PFN_MASK))
 #define pud_page(pud)  virt_to_page((pud_val(pud) + 
PAGE_OFFSE

[PATCH v2 2/2] mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t *

2021-06-15 Thread Aneesh Kumar K.V
No functional change in this patch.

Cc: linux-al...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-a...@vger.kernel.org

Link: 
https://lore.kernel.org/linuxppc-dev/CAHk-=wi+j+iodze9ftjm3zi4j4oes+qqbkxme9qn4roxpex...@mail.gmail.com/
Signed-off-by: Aneesh Kumar K.V 
---
 arch/arm64/include/asm/pgtable.h| 4 ++--
 arch/ia64/include/asm/pgtable.h | 2 +-
 arch/mips/include/asm/pgtable-64.h  | 4 ++--
 arch/powerpc/include/asm/book3s/64/pgtable.h| 5 -
 arch/powerpc/include/asm/nohash/64/pgtable-4k.h | 6 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c| 2 +-
 arch/powerpc/mm/pgtable_64.c| 2 +-
 arch/sparc/include/asm/pgtable_64.h | 4 ++--
 arch/x86/include/asm/pgtable.h  | 4 ++--
 arch/x86/mm/init_64.c   | 4 ++--
 include/asm-generic/pgtable-nop4d.h | 2 +-
 include/asm-generic/pgtable-nopud.h | 2 +-
 include/linux/pgtable.h | 2 +-
 13 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 53a415b329b0..fde06639fff8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -697,9 +697,9 @@ static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
return __p4d_to_phys(p4d);
 }
 
-static inline unsigned long p4d_page_vaddr(p4d_t p4d)
+static inline pud_t *p4d_pgtable(p4d_t p4d)
 {
-   return (unsigned long)__va(p4d_page_paddr(p4d));
+   return (pud_t *)__va(p4d_page_paddr(p4d));
 }
 
 /* Find an entry in the frst-level page table. */
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index b2ddfbe70365..a2d5098b299a 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -282,7 +282,7 @@ ia64_phys_addr_valid (unsigned long addr)
 #define p4d_bad(p4d)   (!ia64_phys_addr_valid(p4d_val(p4d)))
 #define p4d_present(p4d)   (p4d_val(p4d) != 0UL)
 #define p4d_clear(p4dp)(p4d_val(*(p4dp)) = 0UL)
-#define p4d_page_vaddr(p4d)((unsigned long) __va(p4d_val(p4d) & 
_PFN_MASK))
+#define p4d_pgtable(p4d)   ((pud_t *) __va(p4d_val(p4d) & 
_PFN_MASK))
 #define p4d_page(p4d)  virt_to_page((p4d_val(p4d) + 
PAGE_OFFSET))
 #endif
 
diff --git a/arch/mips/include/asm/pgtable-64.h 
b/arch/mips/include/asm/pgtable-64.h
index ab305453e90f..b865edff2670 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -210,9 +210,9 @@ static inline void p4d_clear(p4d_t *p4dp)
p4d_val(*p4dp) = (unsigned long)invalid_pud_table;
 }
 
-static inline unsigned long p4d_page_vaddr(p4d_t p4d)
+static inline pud_t *p4d_pgtable(p4d_t p4d)
 {
-   return p4d_val(p4d);
+   return (pud_t *)p4d_val(p4d);
 }
 
 #define p4d_phys(p4d)  virt_to_phys((void *)p4d_val(p4d))
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 40bafe1e80c9..cbedc7c8959d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1048,7 +1048,10 @@ extern struct page *p4d_page(p4d_t p4d);
 /* Pointers in the page table tree are physical addresses */
 #define __pgtable_ptr_val(ptr) __pa(ptr)
 
-#define p4d_page_vaddr(p4d)__va(p4d_val(p4d) & ~P4D_MASKED_BITS)
+static inline pud_t *p4d_pgtable(p4d_t p4d)
+{
+   return (pud_t *)__va(p4d_val(p4d) & ~P4D_MASKED_BITS);
+}
 
 static inline pmd_t *pud_pgtable(pud_t pud)
 {
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h 
b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
index fe2f4c9acd9e..10f5cf444d72 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
@@ -56,10 +56,14 @@
 #define p4d_none(p4d)  (!p4d_val(p4d))
 #define p4d_bad(p4d)   (p4d_val(p4d) == 0)
 #define p4d_present(p4d)   (p4d_val(p4d) != 0)
-#define p4d_page_vaddr(p4d)(p4d_val(p4d) & ~P4D_MASKED_BITS)
 
 #ifndef __ASSEMBLY__
 
+static inline pud_t *p4d_pgtable(p4d_t p4d)
+{
+   return (pud_t *) (p4d_val(p4d) & ~P4D_MASKED_BITS);
+}
+
 static inline void p4d_clear(p4d_t *p4dp)
 {
*p4dp = __p4d(0);
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index b663d8f9f05c..1ba6d9291c10 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -859,7 +859,7 @@ static void __meminit remove_pagetable(unsigned lo

Re: [RFC PATCH 4/8] powerpc/pseries: Consolidate DLPAR NUMA distance update

2021-06-15 Thread Aneesh Kumar K.V

On 6/15/21 8:43 AM, David Gibson wrote:

On Mon, Jun 14, 2021 at 10:09:59PM +0530, Aneesh Kumar K.V wrote:

The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call. In
later patch we will remove updating NUMA distance when we are looking
for node id from associativity array.

Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/mm/numa.c| 41 +++
  arch/powerpc/platforms/pseries/hotplug-cpu.c  |  2 +
  .../platforms/pseries/hotplug-memory.c|  2 +
  arch/powerpc/platforms/pseries/pseries.h  |  1 +
  4 files changed, 46 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 192067991f8a..fec47981c1ef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -287,6 +287,47 @@ int of_node_to_nid(struct device_node *device)
  }
  EXPORT_SYMBOL(of_node_to_nid);
  
+static void __initialize_form1_numa_distance(const __be32 *associativity)

+{
+   int i, nid;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   nid = of_read_number([primary_domain_index], 1);
+
+   for (i = 0; i < max_domain_index; i++) {
+   const __be32 *entry;
+
+   entry = 
[be32_to_cpu(distance_ref_points[i])];
+   distance_lookup_table[nid][i] = of_read_number(entry, 
1);
+   }
+   }
+}


This logic is almost identicaly to initialize_distance_lookup_table()
- it would be good if they could be consolidated, so it's clear that
coldplugged and hotplugged nodes are parsing the NUMA information in
the same way.


initialize_distance_lookup_table() gets removed in the next patch.

-aneesh


  1   2   3   4   5   6   7   8   9   10   >