Re: [PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

2019-10-09 Thread Guenter Roeck

On 10/9/19 5:04 AM, Matt Fleming wrote:

On Mon, 07 Oct, at 08:28:16AM, Guenter Roeck wrote:


This patch causes build errors on systems where NUMA does not depend on SMP,
for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
disabled results in

mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to 
`node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to 
`node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to 
`node_reclaim_distance'
Makefile:1074: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1

I have seen a similar problem with one of my PPC test builds.

powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to 
`node_reclaim_distance'


Thanks for this Guenter.

So, the way I've fixed this same issue for ia64 was to make NUMA
depend on SMP. Does that seem like a suitable solution for both PPC
and MIPS?



You would still have to cover all other architectures where SMP and NUMA are 
independent
of each other. Fortunately, it looks like this is only sh4.

sh4-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x3ce0): undefined reference to `node_reclaim_distance'
Makefile:1074: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1

arm64 and s390 happen to work because they mandate SMP support, even though NUMA
is nominally independent.

Wondering - why not declare node_reclaim_distance outside SMP dependency ?

Thanks,
Guenter


Re: [PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

2019-10-09 Thread Matt Fleming
On Mon, 07 Oct, at 08:28:16AM, Guenter Roeck wrote:
> 
> This patch causes build errors on systems where NUMA does not depend on SMP,
> for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
> disabled results in
> 
> mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
> page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to 
> `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to 
> `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to 
> `node_reclaim_distance'
> Makefile:1074: recipe for target 'vmlinux' failed
> make: *** [vmlinux] Error 1
> 
> I have seen a similar problem with one of my PPC test builds.
> 
> powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to 
> `node_reclaim_distance'

Thanks for this Guenter.

So, the way I've fixed this same issue for ia64 was to make NUMA
depend on SMP. Does that seem like a suitable solution for both PPC
and MIPS?


Re: [PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

2019-10-07 Thread Guenter Roeck
Hi,

On Thu, Aug 08, 2019 at 08:53:01PM +0100, Matt Fleming wrote:
> SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
> for any sched domains with a NUMA distance greater than 2 hops
> (RECLAIM_DISTANCE). The idea being that it's expensive to balance
> across domains that far apart.
> 
> However, as is rather unfortunately explained in
> 
>   commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
> 
> the value for RECLAIM_DISTANCE is based on node distance tables from
> 2011-era hardware.
> 
> Current AMD EPYC machines have the following NUMA node distances:
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  16  16  16  32  32  32  32
>   1:  16  10  16  16  32  32  32  32
>   2:  16  16  10  16  32  32  32  32
>   3:  16  16  16  10  32  32  32  32
>   4:  32  32  32  32  10  16  16  16
>   5:  32  32  32  32  16  10  16  16
>   6:  32  32  32  32  16  16  10  16
>   7:  32  32  32  32  16  16  16  10
> 
> where 2 hops is 32.
> 
> The result is that the scheduler fails to load balance properly across
> NUMA nodes on different sockets -- 2 hops apart.
> 
> For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
> (CPUs 32-39) like so,
> 
>   $ numactl -C 0-7,32-39 ./spinner 16
> 
> causes all threads to fork and remain on node 0 until the active
> balancer kicks in after a few seconds and forcibly moves some threads
> to node 4.
> 
> Override node_reclaim_distance for AMD Zen.
> 
> Signed-off-by: Matt Fleming 
> Signed-off-by: Peter Zijlstra (Intel) 
> Acked-by: Mel Gorman 
> Cc: suravee.suthikulpa...@amd.com
> Cc: Borislav Petkov 
> Cc: thomas.lenda...@amd.com

This patch causes build errors on systems where NUMA does not depend on SMP,
for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
disabled results in

mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to 
`node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to 
`node_reclaim_distance'
mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to 
`node_reclaim_distance'
Makefile:1074: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1

I have seen a similar problem with one of my PPC test builds.

powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to 
`node_reclaim_distance'

Guenter


[PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

2019-08-08 Thread Matt Fleming
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Mel Gorman 
Cc: suravee.suthikulpa...@amd.com
Cc: Borislav Petkov 
Cc: thomas.lenda...@amd.com
---
 arch/x86/kernel/cpu/amd.c |  5 +
 include/linux/topology.h  | 14 ++
 kernel/sched/topology.c   |  3 ++-
 mm/khugepaged.c   |  2 +-
 mm/page_alloc.c   |  2 +-
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e50428b68..ceeb8afc7cf3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -824,6 +825,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
set_cpu_cap(c, X86_FEATURE_ZEN);
 
+#ifdef CONFIG_NUMA
+   node_reclaim_distance = 32;
+#endif
+
/*
 * Fix erratum 1076: CPB feature bit not being set in CPUID.
 * Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c08036..579522ec446c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 30
 #endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS (1)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f83e8e3ea9a..b5667a273bf6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int
sched_domains_curr_level;
 intsched_max_numa_distance;
 static int *sched_domains_numa_distance;
 static struct cpumask  ***sched_domains_numa_masks;
+int __read_mostly  node_reclaim_distance = RECLAIM_DISTANCE;
 #endif
 
 /*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
-   if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
+   if (sched_domains_numa_distance[tl->numa_level] > 
node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..ccede2425c3f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
-   if (node_distance(nid, i) > RECLAIM_DISTANCE)
+   if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..0d54cd2c43a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int 
order,
 static bool zone_allows_recla