from:"Matt Fleming"

[tip: x86/urgent] x86/asm/64: Align start of __clear_user() loop to 16-bytes

2020-06-19 Thread tip-bot2 for Matt Fleming

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: bb5570ad3b54e7930997aec76ab68256d5236d94
Gitweb:
https://git.kernel.org/tip/bb5570ad3b54e7930997aec76ab68256d5236d94
Author:Matt Fleming 
AuthorDate:Thu, 18 Jun 2020 11:20:02 +01:00
Committer: Borislav Petkov 
CommitterDate: Fri, 19 Jun 2020 18:32:11 +02:00

x86/asm/64: Align start of __clear_user() loop to 16-bytes

x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,

  1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate 
constants")

which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.

Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.

Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:

  Zen 1 (Naples)

  libmicro-file
5.7.0-rc6  5.7.0-rc6
  5.7.0-rc6
revert-1153933703d9+
   align16+
  Time mean95-pread_z100k   9.9195 (   0.00%)  5.9856 (  39.66%)  
5.9938 (  39.58%)
  Time mean95-pread_z10k1.1378 (   0.00%)  0.7450 (  34.52%)  
0.7467 (  34.38%)
  Time mean95-pread_z1k 0.2623 (   0.00%)  0.2251 (  14.18%)  
0.2252 (  14.15%)
  Time mean95-pread_zw100k  9.9974 (   0.00%)  6.0648 (  39.34%)  
6.0756 (  39.23%)
  Time mean95-read_z100k9.8940 (   0.00%)  5.9885 (  39.47%)  
5.9994 (  39.36%)
  Time mean95-read_z10k 1.1394 (   0.00%)  0.7483 (  34.33%)  
0.7482 (  34.33%)

Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).

Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate 
constants")
Signed-off-by: Matt Fleming 
Signed-off-by: Borislav Petkov 
Cc:  # v4.19+
Link: https://lkml.kernel.org/r/20200618102002.30034-1-m...@codeblueprint.co.uk
---
 arch/x86/lib/usercopy_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6..b0dfac3 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long 
size)
asm volatile(
"   testq  %[size8],%[size8]\n"
"   jz 4f\n"
+   "   .align 16\n"
"0: movq $0,(%[dst])\n"
"   addq   $8,%[dst]\n"
"   decl %%ecx ; jnz   0b\n"

[PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes

2020-06-18 Thread Matt Fleming

x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,

  1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate 
constants")

which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.

Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.

Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:

  Zen 1 (Naples)

  libmicro-file
5.7.0-rc6  5.7.0-rc6
  5.7.0-rc6
revert-1153933703d9+
   align16+
  Time mean95-pread_z100k   9.9195 (   0.00%)  5.9856 (  39.66%)  
5.9938 (  39.58%)
  Time mean95-pread_z10k1.1378 (   0.00%)  0.7450 (  34.52%)  
0.7467 (  34.38%)
  Time mean95-pread_z1k 0.2623 (   0.00%)  0.2251 (  14.18%)  
0.2252 (  14.15%)
  Time mean95-pread_zw100k  9.9974 (   0.00%)  6.0648 (  39.34%)  
6.0756 (  39.23%)
  Time mean95-read_z100k9.8940 (   0.00%)  5.9885 (  39.47%)  
5.9994 (  39.36%)
  Time mean95-read_z10k 1.1394 (   0.00%)  0.7483 (  34.33%)  
0.7482 (  34.33%)

Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).

Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate 
constants")
Cc: "Grimm, Jon" 
Cc: "Kumar, Venkataramanan" 
CC: Jan Kara 
Cc:  # v4.19+
Signed-off-by: Matt Fleming 
---
 arch/x86/lib/usercopy_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6f73a2..b0dfac3d3df7 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long 
size)
asm volatile(
"   testq  %[size8],%[size8]\n"
"   jz 4f\n"
+   "   .align 16\n"
"0: movq $0,(%[dst])\n"
"   addq   $8,%[dst]\n"
"   decl %%ecx ; jnz   0b\n"
-- 
2.17.1

Re: [PATCH v2] perf ordered_events: Optimise event object reuse

2020-05-28 Thread Matt Fleming

On Wed, 27 May, at 12:25:33PM, Jiri Olsa wrote:
> On Tue, May 26, 2020 at 02:59:28PM +0100, Matt Fleming wrote:
> > +/*
> > + * Allocate a new event object from the free event cache.
> > + *
> > + * Find the first address range in the cache and carve out enough bytes
> > + * for an ordered_event objects. The object with the lowest address is
> > + * always returned so that subsequent allocations benefit from
> > + * contiguous memory accesses (spatial locality).
> > + */
> > +static struct ordered_event *free_event_get_tree(struct ordered_events *oe)
> > +{
> > +   struct interval_tree_node *it;
> > +   struct ordered_event *new;
> > +   size_t bytes = sizeof(*new);
> > +
> > +   it = interval_tree_iter_first(>cache.rb, 0, ULONG_MAX);
> > +   if (!it)
> > +   return NULL;
> > +
> > +   /* Has the cache memory been exhausted? */
> > +   assert(cache_region_size(it) >= bytes);
> > +
> > +   new = (void *)it->start;
> > +
> > +   if (cache_region_size(it) == bytes) {
> > +   interval_tree_remove(it, >cache.rb);
> > +   free(it);
> > +   }
> > +
> > +   it->start += bytes;
> 
> this does not look right.. should this go to else path in above condition?

Urgh, yes. You're right -- this should be in the else path.

[PATCH v2] perf ordered_events: Optimise event object reuse

2020-05-26 Thread Matt Fleming

ordered_event objects can be placed on the free event list in any order
which means future allocations may not return objects at sequential
locations in memory. Getting non-contiguous objects from the free list
has bad consequences when later iterating over those objects in
ordered_events__queue().

For example, large perf.data files can contain trillions of events and
since objects that are next to each other in the free linked-list can
point to pretty much anywhere in the object address space, lots of
cycles in ordered_events__queue() are spent servicing DTLB misses.

Implement the free event cache using the in-kernel implementation of
interval trees so that objects can always be allocated from the free
object cache in sequential order, improving spatial locality and
reducing DTLB misses.

Since the existing linked-list implementation is faster if you're
working with fewer events, add a --free-event-list option to force the
use of the linked-list implementation. However, most users will benefit
from the new interval tree code.

Here are some numbers showing the speed up (reduction in execution time)
when running perf sched latency on sched events data and perf report on
HW_CPU_CYCLES.

 $ perf stat --null -r 10 -- bash -c \
"export PAGER=cat ; perf sched latency -i $file --stdio &>/dev/null"

  Nr events File Size   BeforeAfterSpeed up
--  -    ---  --
  123318457470 29MB 0.21490.2440-13.5%
 1651500885357260MB 3.12051.9855+36.4%
 3470519751785506MB 6.18213.8941+37.0%
 8213765551679   1100MB13.48758.5079+36.9%
15900515973759   1946MB23.4069   15.0960+35.5%

and HW_CPU_CYCLES events:

 $ perf stat --null -r 10 -- bash -c \
"export PAGER=cat ; perf report -i $file --stdio &>/dev/null"

  Nr events File Size   BeforeAfterSpeed up
--  -    ---  --
  328805166262 29MB  1.637 1.587+3.0%
 3280413919096253MB 13.38112.349+7.7%
 6475305954370500MB 25.64823.753+7.4%
14218430569416   1000MB 52.80049.036+7.1%
26760562279427   2000MB 97.169    90.129+7.2%

Signed-off-by: Matt Fleming 
---

Changes in v2:

 - Add --free-event-list parameter to fallback to the linked-list
   implementation of the free event cache.

 - Rename the free_cache_* API to free_event_* now that we've both both
   the old linked-list version and the newer _tree() calls. Also make
   the API public so I can drop the #define gunk in the tests.

 - Update check-header.sh for interval tree files.

 - Remove leftover struct cache_region type that accidentally snuck into v1.

 tools/include/linux/interval_tree.h |  30 +++
 tools/include/linux/interval_tree_generic.h | 187 ++
 tools/lib/interval_tree.c   |  12 ++
 tools/perf/MANIFEST |   1 +
 tools/perf/check-headers.sh |   2 +
 tools/perf/perf.c   |   6 +
 tools/perf/tests/Build  |   1 +
 tools/perf/tests/builtin-test.c |   4 +
 tools/perf/tests/free-event-tree.c  | 201 
 tools/perf/tests/tests.h|   2 +
 tools/perf/util/Build   |   6 +
 tools/perf/util/ordered-events.c| 174 -
 tools/perf/util/ordered-events.h|  16 +-
 13 files changed, 634 insertions(+), 8 deletions(-)
 create mode 100644 tools/include/linux/interval_tree.h
 create mode 100644 tools/include/linux/interval_tree_generic.h
 create mode 100644 tools/lib/interval_tree.c
 create mode 100644 tools/perf/tests/free-event-tree.c

diff --git a/tools/include/linux/interval_tree.h 
b/tools/include/linux/interval_tree.h
new file mode 100644
index ..288c26f50732
--- /dev/null
+++ b/tools/include/linux/interval_tree.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INTERVAL_TREE_H
+#define _LINUX_INTERVAL_TREE_H
+
+#include 
+
+struct interval_tree_node {
+   struct rb_node rb;
+   unsigned long start;/* Start of interval */
+   unsigned long last; /* Last location _in_ interval */
+   unsigned long __subtree_last;
+};
+
+extern void
+interval_tree_insert(struct interval_tree_node *node,
+struct rb_root_cached *root);
+
+extern void
+interval_tree_remove(struct interval_tree_node *node,
+struct rb_root_cached *root);
+
+extern struct interval_tree_node *
+interval_tree_iter_first(struct rb_root_cached *root,
+unsigned long start, unsigned long last);
+
+extern struct interval_tree_node *
+interval_tree_iter_next(struct interval_tree_node *node,
+   unsigned long start, unsigned long last);
+
+#endif /* _LINUX_INTERVAL_TREE_H */
diff --git a/tools/include/linux/interval_tre

Re: [PATCH] perf ordered_events: Optimise event object reuse

2020-05-21 Thread Matt Fleming

On Wed, 20 May, at 11:52:34PM, Jiri Olsa wrote:
> On Wed, May 20, 2020 at 02:00:49PM +0100, Matt Fleming wrote:
> >  
> > Nope, the tests in this file are unit tests so I'm testing
> > free_cache_{get,put} which are file-local functions by #include'ing
> > ordered-events.c.
> > 
> > The above define are required to avoid duplicate symbol errors at
> > link-time, e.g.
> > 
> >   util/perf-in.o: In function `ordered_events__flush_time':
> >   /home/matt/src/kernels/linux/tools/perf/util/ordered-events.c:461: 
> > multiple definition of `ordered_events__flush_time'
> >   
> > tests/perf-in.o:/home/matt/src/kernels/linux/tools/perf/tests/../util/ordered-events.c:461:
> >  first defined here
> > 
> > There are other ways to resolve this (linker flags to change the
> > symbols) but I couldn't find any precedent with that, so this seemed
> > like the easiest and most obvious solution. I'm happy to fix this up any
> > other way if you have suggestions though.
> 
> hum, could we just make free_cache_{get,put} public?

Yeah that's totally doable. I'll send a v2 with all these changes.

Re: [PATCH] perf ordered_events: Optimise event object reuse

2020-05-20 Thread Matt Fleming

On Mon, 18 May, at 02:04:08PM, Jiri Olsa wrote:
> On Fri, May 15, 2020 at 10:01:51PM +0100, Matt Fleming wrote:
> > ordered_event objects can be placed on the free object cache list in any
> > order which means future allocations may not return objects at
> > sequential locations in memory. Getting non-contiguous objects from the
> > free cache has bad consequences when later iterating over those objects
> > in ordered_events__queue().
> > 
> > For example, large perf.data files can contain trillions of events and
> > since objects that are next to each other in the free cache linked list
> > can point to pretty much anywhere in the object address space, lots of
> > cycles in ordered_events__queue() are spent servicing DTLB misses.
> > 
> > Implement the free object cache using the in-kernel implementation of
> > interval trees so that objects can always be allocated from the free
> > object cache in sequential order, improving spatial locality and
> > reducing DTLB misses.
> > 
> > Here are some numbers showing the speed up (reducing in execution time)
> > when running perf sched latency on sched events data and perf report on
> > HW_CPU_CYCLES.
> 
> really nice, few questions below
> 
> > 
> >  $ perf stat --null -r 10 -- bash -c \
> > "export PAGER=cat ; perf sched latency -i $file --stdio &>/dev/null"
> > 
> >   Nr events File Size   BeforeAfterSpeed up
> > --  -    ---  --
> >   123318457470 29MB 0.21490.2440-13.5%
> 
> should we be concerned about small data and the extra processing?
 
I didn't look into this slowdown originally because it's ~2.9 ms, but
FYI it looks like this is caused by:

 - Longer code paths (more instructions)
 - More branches
 - More branch mispredicts

> maybe we could add some option that disables this, at leat to be
> able to compare times in the future
 
Sure. Do you mean a command-line option or build-time config?

> > diff --git a/tools/include/linux/interval_tree_generic.h 
> > b/tools/include/linux/interval_tree_generic.h
> > new file mode 100644
> > index ..aaa8a0767aa3
> > --- /dev/null
> > +++ b/tools/include/linux/interval_tree_generic.h
> > @@ -0,0 +1,187 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +/*
> > +  Interval Trees
> > +  (C) 2012  Michel Lespinasse 
> > +
> > +
> > +  include/linux/interval_tree_generic.h
> > +*/
> > +
> > +#include 
> > +
> > +/*
> > + * Template for implementing interval trees
> > + *
> > + * ITSTRUCT:   struct type of the interval tree nodes
> > + * ITRB:   name of struct rb_node field within ITSTRUCT
> > + * ITTYPE: type of the interval endpoints
> > + * ITSUBTREE:  name of ITTYPE field within ITSTRUCT holding last-in-subtree
> > + * ITSTART(n): start endpoint of ITSTRUCT node n
> > + * ITLAST(n):  last endpoint of ITSTRUCT node n
> > + * ITSTATIC:   'static' or empty
> > + * ITPREFIX:   prefix to use for the inline tree definitions
> > + *
> > + * Note - before using this, please consider if generic version
> > + * (interval_tree.h) would work for you...
> 
> the interval_tree.h looks like what you have added with the generic
> header.. is there some reason to use the _generic header?
 
Yes, the _generic version contains the actual implementation of
interval trees, so you need both.

> please add the header file also to check-headers.sh

Done!
 
> > diff --git a/tools/perf/tests/free-object-cache.c 
> > b/tools/perf/tests/free-object-cache.c
> > new file mode 100644
> > index ..e4395ece7d2b
> > --- /dev/null
> > +++ b/tools/perf/tests/free-object-cache.c
> > @@ -0,0 +1,200 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "tests.h"
> > +#include 
> > +
> > +#define ordered_events__flush_time __test_ordered_events__flush_time
> > +#define ordered_events__first_time __test_ordered_events__first_time
> > +#define ordered_events__delete __test_ordered_events__delete
> > +#define ordered_events__init __test_ordered_events__init
> > +#define ordered_events__free __test_ordered_events__free
> > +#define ordered_events__queue __test_ordered_events__queue
> > +#define ordered_events__reinit __test_ordered_events__reinit
> > +#define ordered_events__flush __test_ordered_events__flush
> 
> I'm excited to see these tests, but why is above needed?
> 
> can't you use ordered-events interface as it is? you used only
> exported functions right?
 
Nope, the tests in this file are unit tests so

[PATCH] perf ordered_events: Optimise event object reuse

2020-05-15 Thread Matt Fleming

ordered_event objects can be placed on the free object cache list in any
order which means future allocations may not return objects at
sequential locations in memory. Getting non-contiguous objects from the
free cache has bad consequences when later iterating over those objects
in ordered_events__queue().

For example, large perf.data files can contain trillions of events and
since objects that are next to each other in the free cache linked list
can point to pretty much anywhere in the object address space, lots of
cycles in ordered_events__queue() are spent servicing DTLB misses.

Implement the free object cache using the in-kernel implementation of
interval trees so that objects can always be allocated from the free
object cache in sequential order, improving spatial locality and
reducing DTLB misses.

Here are some numbers showing the speed up (reducing in execution time)
when running perf sched latency on sched events data and perf report on
HW_CPU_CYCLES.

 $ perf stat --null -r 10 -- bash -c \
"export PAGER=cat ; perf sched latency -i $file --stdio &>/dev/null"

  Nr events File Size   BeforeAfterSpeed up
--  -    ---  --
  123318457470 29MB 0.21490.2440-13.5%
 1651500885357260MB 3.12051.9855+36.4%
 3470519751785506MB 6.18213.8941+37.0%
 8213765551679   1100MB13.48758.5079+36.9%
15900515973759   1946MB23.4069   15.0960+35.5%

and HW_CPU_CYCLES events:

 $ perf stat --null -r 10 -- bash -c \
"export PAGER=cat ; perf report -i $file --stdio &>/dev/null"

  Nr events File Size   BeforeAfterSpeed up
--  -    ---  --
  328805166262 29MB  1.637 1.587+3.0%
 3280413919096253MB 13.38112.349+7.7%
 6475305954370500MB 25.64823.753+7.4%
14218430569416   1000MB 52.80049.036+7.1%
26760562279427   2000MB 97.169    90.129+7.2%

Signed-off-by: Matt Fleming 
---
 tools/include/linux/interval_tree.h |  30 +++
 tools/include/linux/interval_tree_generic.h | 187 ++
 tools/lib/interval_tree.c   |  12 ++
 tools/perf/MANIFEST |   1 +
 tools/perf/tests/Build  |   1 +
 tools/perf/tests/builtin-test.c |   4 +
 tools/perf/tests/free-object-cache.c| 200 
 tools/perf/tests/tests.h|   2 +
 tools/perf/util/Build   |   6 +
 tools/perf/util/ordered-events.c| 130 -
 tools/perf/util/ordered-events.h|   8 +-
 11 files changed, 573 insertions(+), 8 deletions(-)
 create mode 100644 tools/include/linux/interval_tree.h
 create mode 100644 tools/include/linux/interval_tree_generic.h
 create mode 100644 tools/lib/interval_tree.c
 create mode 100644 tools/perf/tests/free-object-cache.c

diff --git a/tools/include/linux/interval_tree.h 
b/tools/include/linux/interval_tree.h
new file mode 100644
index ..288c26f50732
--- /dev/null
+++ b/tools/include/linux/interval_tree.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INTERVAL_TREE_H
+#define _LINUX_INTERVAL_TREE_H
+
+#include 
+
+struct interval_tree_node {
+   struct rb_node rb;
+   unsigned long start;/* Start of interval */
+   unsigned long last; /* Last location _in_ interval */
+   unsigned long __subtree_last;
+};
+
+extern void
+interval_tree_insert(struct interval_tree_node *node,
+struct rb_root_cached *root);
+
+extern void
+interval_tree_remove(struct interval_tree_node *node,
+struct rb_root_cached *root);
+
+extern struct interval_tree_node *
+interval_tree_iter_first(struct rb_root_cached *root,
+unsigned long start, unsigned long last);
+
+extern struct interval_tree_node *
+interval_tree_iter_next(struct interval_tree_node *node,
+   unsigned long start, unsigned long last);
+
+#endif /* _LINUX_INTERVAL_TREE_H */
diff --git a/tools/include/linux/interval_tree_generic.h 
b/tools/include/linux/interval_tree_generic.h
new file mode 100644
index ..aaa8a0767aa3
--- /dev/null
+++ b/tools/include/linux/interval_tree_generic.h
@@ -0,0 +1,187 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+  Interval Trees
+  (C) 2012  Michel Lespinasse 
+
+
+  include/linux/interval_tree_generic.h
+*/
+
+#include 
+
+/*
+ * Template for implementing interval trees
+ *
+ * ITSTRUCT:   struct type of the interval tree nodes
+ * ITRB:   name of struct rb_node field within ITSTRUCT
+ * ITTYPE: type of the interval endpoints
+ * ITSUBTREE:  name of ITTYPE field within ITSTRUCT holding last-in-subtree
+ * ITSTART(n): start endpoint of ITSTRUCT node n
+ * ITLAST(n):  last endpoint of ITSTRUCT node n
+ * ITSTATIC:   'static' or empty
+ * ITPREFIX:

Re: [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit

2019-10-18 Thread Matt Fleming

On Fri, 18 Oct, at 01:54:49PM, Mel Gorman wrote:
> On Fri, Oct 18, 2019 at 12:58:49PM +0100, Matt Fleming wrote:
> > On Fri, 18 Oct, at 11:56:03AM, Mel Gorman wrote:
> > > A private report stated that system CPU usage was excessive on an AMD
> > > EPYC 2 machine while building kernels with much longer build times than
> > > expected. The issue is partially explained by high zone lock contention
> > > due to the per-cpu page allocator batch and high limits being calculated
> > > incorrectly. This series addresses a large chunk of the problem. Patch 1
> > > is mostly cosmetic but prepares for patch 2 which is the real fix. Patch
> > > 3 is definiely cosmetic but was noticed while implementing the fix. Proper
> > > details are in the changelog for patch 2.
> > > 
> > >  include/linux/mm.h |  3 ---
> > >  mm/internal.h  |  3 +++
> > >  mm/page_alloc.c| 33 -
> > >  3 files changed, 23 insertions(+), 16 deletions(-)
> > 
> > Just to confirm, these patches don't fix the issue we're seeing on the
> > EPYC 2 machines, but they do return the batch sizes to sensible values.
> 
> To be clear, does the patch a) fix *some* of the issue and there is
> something else also going on that needs to be chased down or b) has no
> impact on build time or system CPU usage on your machine?

Sorry, I realise my email was pretty unclear.

These patches *do* fix some of the issue because I no longer see as
much contention on the zone locks with the patches applied.

Re: [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit

2019-10-18 Thread Matt Fleming

On Fri, 18 Oct, at 11:56:03AM, Mel Gorman wrote:
> A private report stated that system CPU usage was excessive on an AMD
> EPYC 2 machine while building kernels with much longer build times than
> expected. The issue is partially explained by high zone lock contention
> due to the per-cpu page allocator batch and high limits being calculated
> incorrectly. This series addresses a large chunk of the problem. Patch 1
> is mostly cosmetic but prepares for patch 2 which is the real fix. Patch
> 3 is definiely cosmetic but was noticed while implementing the fix. Proper
> details are in the changelog for patch 2.
> 
>  include/linux/mm.h |  3 ---
>  mm/internal.h  |  3 +++
>  mm/page_alloc.c| 33 -
>  3 files changed, 23 insertions(+), 16 deletions(-)

Just to confirm, these patches don't fix the issue we're seeing on the
EPYC 2 machines, but they do return the batch sizes to sensible values.

Re: [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm

2019-10-18 Thread Matt Fleming

On Fri, 18 Oct, at 11:56:06AM, Mel Gorman wrote:
> Memory hotplug needs to be able to reset and reinit the pcpu allocator
> batch and high limits but this action is internal to the VM. Move
> the declaration to internal.h
> 
> Signed-off-by: Mel Gorman 
> ---
>  include/linux/mm.h | 3 ---
>  mm/internal.h  | 3 +++
>  2 files changed, 3 insertions(+), 3 deletions(-)

Tested-by: Matt Fleming

Re: [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler

2019-10-18 Thread Matt Fleming

On Fri, 18 Oct, at 11:56:04AM, Mel Gorman wrote:
> Both the percpu_pagelist_fraction sysctl handler and memory hotplug
> have a common requirement of updating the pcpu page allocation batch
> and high values. Split the relevant helper to share common code.
> 
> No functional change.
> 
> Signed-off-by: Mel Gorman 
> ---
>  mm/page_alloc.c | 23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
 
Tested-by: Matt Fleming

Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-18 Thread Matt Fleming

On Fri, 18 Oct, at 11:56:05AM, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes on each node. It was tested on a 2-socket AMD
> EPYC 2 machine using a kernel compilation workload -- allmodconfig and
> all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
> 5.4.0-rc3  5.4.0-rc3
>   vanilla resetpcpu-v1r1
> Amean user-25613249.50 (   0.00%)15928.40 * -20.22%*
> Amean syst-25614760.30 (   0.00%) 4551.77 *  69.16%*
> Amean elsp-256  162.42 (   0.00%)  118.46 *  27.06%*
> Stddevuser-256   42.97 (   0.00%)   50.83 ( -18.30%)
> Stddevsyst-256  336.87 (   0.00%)   33.70 (  90.00%)
> Stddevelsp-2562.46 (   0.00%)0.81 (  67.01%)
> 
>5.4.0-rc3   5.4.0-rc3
>  vanillaresetpcpu-v1r1
> Duration User   39766.2447802.92
> Duration System 44298.1013671.93
> Duration Elapsed  519.11  387.65
> 
> The patch reduces system CPU usage by 69.16% and total build time by
> 27.06%. The variance of system CPU usage is also much reduced.
> 
> Cc: sta...@vger.kernel.org # v4.15+
> Signed-off-by: Mel Gorman 
> ---
>  mm/page_alloc.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)

Tested-by: Matt Fleming

Re: [PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

2019-10-09 Thread Matt Fleming

On Mon, 07 Oct, at 08:28:16AM, Guenter Roeck wrote:
> 
> This patch causes build errors on systems where NUMA does not depend on SMP,
> for example MIPS and PPC. For example, building mips:ip27_defconfig with SMP
> disabled results in
> 
> mips-linux-ld: mm/page_alloc.o: in function `get_page_from_freelist':
> page_alloc.c:(.text+0x5018): undefined reference to `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5020): undefined reference to 
> `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5028): undefined reference to 
> `node_reclaim_distance'
> mips-linux-ld: page_alloc.c:(.text+0x5040): undefined reference to 
> `node_reclaim_distance'
> Makefile:1074: recipe for target 'vmlinux' failed
> make: *** [vmlinux] Error 1
> 
> I have seen a similar problem with one of my PPC test builds.
> 
> powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x18): undefined reference to 
> `node_reclaim_distance'

Thanks for this Guenter.

So, the way I've fixed this same issue for ia64 was to make NUMA
depend on SMP. Does that seem like a suitable solution for both PPC
and MIPS?

[tip: sched/core] sched/topology: Improve load balancing on AMD EPYC systems

2019-09-03 Thread tip-bot2 for Matt Fleming

The following commit has been merged into the sched/core branch of tip:

Commit-ID: a55c7454a8c887b226a01d7eed088ccb5374d81e
Gitweb:
https://git.kernel.org/tip/a55c7454a8c887b226a01d7eed088ccb5374d81e
Author:Matt Fleming 
AuthorDate:Thu, 08 Aug 2019 20:53:01 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:37 +02:00

sched/topology: Improve load balancing on AMD EPYC systems

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in:

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

 node distances:
 node   0   1   2   3   4   5   6   7
   0:  10  16  16  16  32  32  32  32
   1:  16  10  16  16  32  32  32  32
   2:  16  16  10  16  32  32  32  32
   3:  16  16  16  10  32  32  32  32
   4:  32  32  32  32  10  16  16  16
   5:  32  32  32  32  16  10  16  16
   6:  32  32  32  32  16  16  10  16
   7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Mel Gorman 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: suravee.suthikulpa...@amd.com
Cc: Thomas Gleixner 
Cc: thomas.lenda...@amd.com
Cc: Tony Luck 
Link: https://lkml.kernel.org/r/20190808195301.13222-3-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/amd.c |  5 +
 include/linux/topology.h  | 14 ++
 kernel/sched/topology.c   |  3 ++-
 mm/khugepaged.c   |  2 +-
 mm/page_alloc.c   |  2 +-
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e504..ceeb8af 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -824,6 +825,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
set_cpu_cap(c, X86_FEATURE_ZEN);
 
+#ifdef CONFIG_NUMA
+   node_reclaim_distance = 32;
+#endif
+
/*
 * Fix erratum 1076: CPB feature bit not being set in CPUID.
 * Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c..579522e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 30
 #endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS (1)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f83e8e..b5667a2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int
sched_domains_curr_level;
 intsched_max_numa_distance;
 static int *sched_domains_numa_distance;
 static struct cpumask  ***sched_domains_numa_masks;
+int __read_mostly  node_reclaim_distance = RECLAIM_DISTANCE;
 #endif
 
 /*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
-   if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
+   if (sched_domains_numa_distance[tl->numa_level] > 
node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b..ccede24 100644
--- a/mm/khugepaged.c
+++ b/mm/khuge

[tip: sched/core] arch, ia64: Make NUMA select SMP

2019-09-03 Thread tip-bot2 for Matt Fleming

The following commit has been merged into the sched/core branch of tip:

Commit-ID: a2cbfd46559e809c8165773b7fe8afa058b35414
Gitweb:
https://git.kernel.org/tip/a2cbfd46559e809c8165773b7fe8afa058b35414
Author:Matt Fleming 
AuthorDate:Thu, 08 Aug 2019 20:53:00 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:36 +02:00

arch, ia64: Make NUMA select SMP

While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
theory, it doesn't make much sense in practice.

Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.

The motivation for this patch is to allow a new NUMA variable to be
initialised in kernel/sched/topology.c.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: suravee.suthikulpa...@amd.com
Cc: Thomas Gleixner 
Cc: thomas.lenda...@amd.com
Cc: Tony Luck 
Link: https://lkml.kernel.org/r/20190808195301.13222-2-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/ia64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 7468d8e..997baba 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -389,6 +389,7 @@ config NUMA
depends on !IA64_HP_SIM && !FLATMEM
default y if IA64_SGI_SN2
select ACPI_NUMA if ACPI
+   select SMP
help
  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
  Access).  This option is for configuring high-end multiprocessor

[PATCH v4 2/2] sched/topology: Improve load balancing on AMD EPYC

2019-08-08 Thread Matt Fleming

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Mel Gorman 
Cc: suravee.suthikulpa...@amd.com
Cc: Borislav Petkov 
Cc: thomas.lenda...@amd.com
---
 arch/x86/kernel/cpu/amd.c |  5 +
 include/linux/topology.h  | 14 ++
 kernel/sched/topology.c   |  3 ++-
 mm/khugepaged.c   |  2 +-
 mm/page_alloc.c   |  2 +-
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e50428b68..ceeb8afc7cf3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -824,6 +825,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
set_cpu_cap(c, X86_FEATURE_ZEN);
 
+#ifdef CONFIG_NUMA
+   node_reclaim_distance = 32;
+#endif
+
/*
 * Fix erratum 1076: CPB feature bit not being set in CPUID.
 * Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c08036..579522ec446c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 30
 #endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS (1)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f83e8e3ea9a..b5667a273bf6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int
sched_domains_curr_level;
 intsched_max_numa_distance;
 static int *sched_domains_numa_distance;
 static struct cpumask  ***sched_domains_numa_masks;
+int __read_mostly  node_reclaim_distance = RECLAIM_DISTANCE;
 #endif
 
 /*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
-   if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
+   if (sched_domains_numa_distance[tl->numa_level] > 
node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..ccede2425c3f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
-   if (node_distance(nid, i) > RECLAIM_DISTANCE)
+   if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..0d54cd2c43a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(s

[PATCH 1/2] ia64: Make NUMA select SMP

2019-08-08 Thread Matt Fleming

While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
theory, it doesn't make much sense in practice.

Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.

The motivation for this patch is to allow a new NUMA variable to be
initialised in kernel/sched/topology.c.

Signed-off-by: Matt Fleming 
Cc: Tony Luck 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
---
 arch/ia64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 7468d8e50467..997baba02b70 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -389,6 +389,7 @@ config NUMA
depends on !IA64_HP_SIM && !FLATMEM
default y if IA64_SGI_SN2
select ACPI_NUMA if ACPI
+   select SMP
help
  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
  Access).  This option is for configuring high-end multiprocessor
-- 
2.13.7

[PATCH v4 0/2] sched: Improve load balancing on AMD EPYC

2019-08-08 Thread Matt Fleming

This is another version of the AMD EPYC load balancing patch. The
difference with this one is that now it fixes the following ia64 build
error, reported by 0day:

   mm/page_alloc.o: In function `get_page_from_freelist':
   page_alloc.c:(.text+0x7850): undefined reference to `node_reclaim_distance'
   page_alloc.c:(.text+0x7931): undefined reference to `node_reclaim_distance'

Matt Fleming (2):
  ia64: Make NUMA select SMP
  sched/topology: Improve load balancing on AMD EPYC

 arch/ia64/Kconfig |  1 +
 arch/x86/kernel/cpu/amd.c |  5 +
 include/linux/topology.h  | 14 ++
 kernel/sched/topology.c   |  3 ++-
 mm/khugepaged.c   |  2 +-
 mm/page_alloc.c   |  2 +-
 6 files changed, 24 insertions(+), 3 deletions(-)

-- 
2.13.7

Re: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC

2019-07-29 Thread Matt Fleming

On Thu, 25 Jul, at 04:37:06PM, Suthikulpanit, Suravee wrote:
> 
> I am testing this patch on the Linux-5.2, and I actually do not
> notice difference pre vs post patch.
> 
> Besides the case above, I have also run an experiment with
> a different number of threads across two sockets:
> 
> (Note: I only focus on thread0 of each core.)
> 
> sXnY = Socket X Node Y
> 
>  * s0n0 + s0n1 + s1n0 + s1n1
>  numactl -C 0-15,32-47 ./spinner 32
> 
>  * s0n2 + s0n3 + s1n2 + s1n3
>  numactl -C 16-31,48-63 ./spinner 32
> 
>  * s0 + s1
>  numactl -C 0-63 ./spinner 64
> 
> My observations are:
> 
>  * I still notice improper load-balance on one of the task initially
>for a few seconds before they are load-balanced correctly.
> 
>  * It is taking longer to load balance w/ more number of tasks.
> 
> I wonder if you have tried with a different kernel base?

It was tested with one of the 5.2 -rc kernels.

I'll take another look at this behaviour, but for the benefit of LKML
readers, here's the summary I gave before. It's specific to using
cgroups to partitions tasks:

It turns out there's a secondary issue to do with how run queue load
averages are compared between sched groups.

Load averages for a sched_group (a group within a domain) are
effectively "scaled" by the number of CPUs in that group. This has a
direct influence on how quickly load ramps up in a group.

What's happening on my system when running with $(numactl -C
0-7,32-39) is that the load for the top NUMA sched_domain (domain4) is
scaling the load by 64 CPUs -- even though the workload can't use all
64 due to scheduler affinity.

So because the load balancer thinks there's plenty of room left to run
tasks, it doesn't balance very well across sockets even with the
SD_BALANCE_FORK flag.

This super quick and ugly patch, which caps the number of CPUs at 8, gets 
both
sockets used by fork() on my system.

>8

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..9444c34d038c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5791,6 +5791,7 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
(sd->imbalance_pct-100) / 100;
+   unsigned long capacity;

if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;
@@ -5835,10 +5836,15 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
}

/* Adjust by relative CPU capacity of the group */
+   capacity = group->sgc->capacity;
+
+   if (capacity > (SCHED_CAPACITY_SCALE * 8))
+   capacity = SCHED_CAPACITY_SCALE * 8;
+
avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
-   group->sgc->capacity;
+   capacity;
runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
-   group->sgc->capacity;
+   capacity;

if (local_group) {
this_runnable_load = runnable_load;

8<

There's still an issue with the active load balancer kicking in after a few
seconds, but I suspect that is related to the use of group capacity 
elsewhere
in the load balancer code (like update_sg_lb_stats()).

-- 
Matt Fleming
SUSE Performance Team

[PATCH v3] sched/topology: Improve load balancing on AMD EPYC

2019-07-23 Thread Matt Fleming

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming 
Cc: "Suthikulpanit, Suravee" 
Cc: Mel Gorman 
Cc: "Lendacky, Thomas" 
Cc: Borislav Petkov 
---
 arch/x86/kernel/cpu/amd.c |  3 +++
 include/linux/topology.h  | 14 ++
 kernel/sched/topology.c   |  3 ++-
 mm/khugepaged.c   |  2 +-
 mm/page_alloc.c   |  2 +-
 5 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 8d4e50428b68..d94bf83d5ee6 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
set_cpu_cap(c, X86_FEATURE_ZEN);
 
+   node_reclaim_distance = 32;
+
/*
 * Fix erratum 1076: CPB feature bit not being set in CPUID.
 * Always set it, except when running under a hypervisor.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 47a3e3c08036..579522ec446c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,20 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 30
 #endif
+
+/*
+ * The following tunable allows platforms to override the default node
+ * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
+ * sufficiently fast that the default value actually hurts
+ * performance.
+ *
+ * AMD EPYC machines use this because even though the 2-hop distance
+ * is 32 (3.2x slower than a local memory access) performance actually
+ * *improves* if allowed to reclaim memory and load balance tasks
+ * between NUMA nodes 2-hops apart.
+ */
+extern int __read_mostly node_reclaim_distance;
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS (1)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f751ce0b783e..f684fde00536 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int
sched_domains_curr_level;
 intsched_max_numa_distance;
 static int *sched_domains_numa_distance;
 static struct cpumask  ***sched_domains_numa_masks;
+int __read_mostly  node_reclaim_distance = RECLAIM_DISTANCE;
 #endif
 
 /*
@@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
-   if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
+   if (sched_domains_numa_distance[tl->numa_level] > 
node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..ccede2425c3f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
-   if (node_distance(nid, i) > RECLAIM_DISTANCE)
+   if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..0d54cd2c43a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int 
order,
 static bool zone_allows_reclaim(struct

Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

2019-06-28 Thread Matt Fleming

On Wed, 26 Jun, at 09:18:01PM, Suthikulpanit, Suravee wrote:
> 
> We use 16 to designate 1-hop latency (for different node within the same 
> socket).
> For across-socket access, since the latency is greater, we set the latency to 
> 32
> (twice the latency of 1-hop) not aware of the RECLAIM_DISTANCE at the time.
 
I guess the question is: Is the memory latency of a remote node 1 hop
away 1.6x the local node latency?

Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

2019-06-19 Thread Matt Fleming

On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
> On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
> > This works for me under all my tests. Thoughts?
> > 
> > --->8---
> > 
> > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > index 80a405c2048a..4db4e9e7654b 100644
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -8,6 +8,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
> >  {
> > set_cpu_cap(c, X86_FEATURE_ZEN);
> >  
> 
> I'm thinking this deserves a comment. Traditionally the SLIT table held
> relative memory latency. So where the identity is 10, 16 would indicate
> 1.6 times local latency and 32 would be 3.2 times local.
> 
> Now, even very early on BIOS monkeys went about their business and put
> in random values in an attempt to 'tune' the system based on how
> $random-os behaved, which is all sorts of fu^Wwrong.
> 
> Now, I suppose my question is; is that 32 Zen puts in an actual relative
> memory latency metric, or a random value we somehow have to deal with.
> And can we pretty please describe the whole sordid story behind this
> 'tunable' somewhere?

This is one for the AMD folks. I don't know if the memory latency
really is 3.2 times or not, only that that's the value in all the Zen
machines I have access to. Even this 2-socket one:

node distances:
node   0   1 
  0:  10  32 
  1:  32  10 

Tom, Suravee?

Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

2019-06-18 Thread Matt Fleming

On Tue, 11 Jun, at 05:22:21PM, Lendacky, Thomas wrote:
> On 6/10/19 4:26 PM, Matt Fleming wrote:
> > On Wed, 05 Jun, at 08:00:35PM, Peter Zijlstra wrote:
> >>
> >> And then we had two magic values :/
> >>
> >> Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
> >> surely, if we want to load-balance agressively over 30, then so too
> >> should we do node_reclaim() I'm thikning.
> > 
> > Yeah we can fix it just for EPYC, Mel suggested that approach originally.
> > 
> > Suravee, Tom, what's the best way to detect these EPYC machines that need to
> > override RECLAIM_DISTANCE?
> 
> You should be able to do it based on the family. There's an init_amd_zn()
> function in arch/x86/kernel/cpu/amd.c.  You can add something there or,
> since init_amd_zn() sets X86_FEATURE_ZEN, you could check for that if you
> prefer putting it in a different location.

This works for me under all my tests. Thoughts?

--->8---

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 80a405c2048a..4db4e9e7654b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
set_cpu_cap(c, X86_FEATURE_ZEN);
 
+   node_reclaim_distance = 32;
+
/* Fix erratum 1076: CPB feature bit not being set in CPUID. */
if (!cpu_has(c, X86_FEATURE_CPB))
set_cpu_cap(c, X86_FEATURE_CPB);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e1ee4b..74b484354ac9 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -59,6 +59,9 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 30
 #endif
+
+extern int __read_mostly node_reclaim_distance;
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS (1)
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..20f0f8f6792b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1284,6 +1284,7 @@ static int
sched_domains_curr_level;
 intsched_max_numa_distance;
 static int *sched_domains_numa_distance;
 static struct cpumask  ***sched_domains_numa_masks;
+int __read_mostly  node_reclaim_distance = RECLAIM_DISTANCE;
 #endif
 
 /*
@@ -1410,7 +1411,7 @@ sd_init(struct sched_domain_topology_level *tl,
 
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
-   if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
+   if (sched_domains_numa_distance[tl->numa_level] > 
node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..67f5f09d70ed 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i])
continue;
-   if (node_distance(nid, i) > RECLAIM_DISTANCE)
+   if (node_distance(nid, i) > node_reclaim_distance)
return true;
}
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8abe0af..8ccaaf3a47f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3450,7 +3450,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int 
order,
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
-   RECLAIM_DISTANCE;
+   node_reclaim_distance;
 }
 #else  /* CONFIG_NUMA */
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)

Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

2019-06-10 Thread Matt Fleming

On Wed, 05 Jun, at 08:00:35PM, Peter Zijlstra wrote:
> 
> And then we had two magic values :/
> 
> Should we not 'fix' RECLAIM_DISTANCE for EPYC or something? Because
> surely, if we want to load-balance agressively over 30, then so too
> should we do node_reclaim() I'm thikning.

Yeah we can fix it just for EPYC, Mel suggested that approach originally.

Suravee, Tom, what's the best way to detect these EPYC machines that need to
override RECLAIM_DISTANCE?

[PATCH] sched/topology: Improve load balancing on AMD EPYC

2019-06-05 Thread Matt Fleming

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Update the code in sd_init() to account for modern node distances, and
maintaining backward-compatible behaviour by respecting
RECLAIM_DISTANCE for distances more than 2 hops.

Signed-off-by: Matt Fleming 
Cc: "Suthikulpanit, Suravee" 
Cc: Mel Gorman 
Cc: "Lendacky, Thomas" 
Cc: Borislav Petkov 
---
 kernel/sched/topology.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..0eea395f7c6b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1410,7 +1410,18 @@ sd_init(struct sched_domain_topology_level *tl,
 
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
-   if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
+
+   /*
+* Strip the following flags for sched domains with a NUMA
+* distance greater than the historical 2-hops value
+* (RECLAIM_DISTANCE) and where tl->numa_level confirms it
+* really is more than 2 hops.
+*
+* Respecting RECLAIM_DISTANCE means we maintain
+* backwards-compatible behaviour.
+*/
+   if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE &&
+   tl->numa_level > 3) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);
-- 
2.13.7

Re: [PATCH] doc: add boot protocol 2.13 description to Documentation/x86/boot.txt

2019-03-18 Thread Matt Fleming

On Fri, 08 Mar, at 12:43:10PM, Juergen Gross wrote:
> Documentation/x86/boot.txt is missing protocol 2.13 description.
> 
> Signed-off-by: Juergen Gross 
> ---
>  Documentation/x86/boot.txt | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt
> index f4c2a97bfdbd..223e484a1304 100644
> --- a/Documentation/x86/boot.txt
> +++ b/Documentation/x86/boot.txt
> @@ -61,6 +61,10 @@ Protocol 2.12: (Kernel 3.8) Added the xloadflags field 
> and extension fields
>   to struct boot_params for loading bzImage and ramdisk
>   above 4G in 64bit.
>  
> +Protocol 2.13:   (Kernel 3.14) Support 32- and 64-bit flags being set in
> + xloadflags to support booting a 64-bit kernel from 32-bit
> + EFI
> +
>   MEMORY LAYOUT
>  
>  The traditional memory map for the kernel loader, used for Image or
> -- 
> 2.16.4
 
Not sure if this has already been picked up but...

Reviewed-by: Matt Fleming

Re: [tip:efi/core] x86/efi: Unmap EFI boot services code/data regions from efi_pgd

2019-01-07 Thread Matt Fleming

On Sat, 22 Dec, at 12:07:48PM, Ard Biesheuvel wrote:
> On Fri, 21 Dec 2018 at 20:29, Borislav Petkov  wrote:
> >
> > On Fri, Dec 21, 2018 at 06:26:23PM +0100, Ard Biesheuvel wrote:
> > > On Fri, 21 Dec 2018 at 18:13, Borislav Petkov  wrote:
> > > >
> > > > On Fri, Dec 21, 2018 at 06:02:29PM +0100, Ard Biesheuvel wrote:
> > > > > As far as I can tell, the SGI x86 UV platforms still rely on this, so
> > > > > we're stuck with it for the foreseeable future.
> > > >
> > > > What happened with the old apple laptops which couldn't handle high
> > > > virtual mappings and needed 1:1? We don't care anymore?
> > > >
> > >
> > > If that is the case (I wouldn't know) then yes, there is a second
> > > reason why we need to keep this code.
> >
> > Fleming knows details and he's on CC, lemme "pull" him up into To: :-)
> >
> 
> IIUC the 1:1 mapping and the 'old' mapping are two different things,
> and the new mapping also contains a 1:1 mapping of the boot services
> regions, at least until SetVirtualAddressMap() returns.

Yep, they're different. And yes the 1:1 mapping should stick around
with the new scheme IIRC (it's been a while).

Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains

2018-08-17 Thread Matt Fleming

On Thu, 05 Jul, at 05:54:02PM, Valentin Schneider wrote:
> On 05/07/18 14:27, Matt Fleming wrote:
> > On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
> >> Hi,
> >>
> >> On 04/07/18 15:24, Matt Fleming wrote:
> >>> It's possible that the CPU doing nohz idle balance hasn't had its own
> >>> load updated for many seconds. This can lead to huge deltas between
> >>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to
> >>> cause the following crash:
> >>>
> >>>  divide error:  [#1] SMP
> >>>  Call Trace:
> >>>   [] update_sd_lb_stats+0xe8/0x560
> 
> My confusion comes from not seeing where that crash happens. Would you mind
> sharing the associated line number? I can feel the "how did I not see this"
> from there but it can't be helped :(

The divide by zero comes from scale_rt_capacity() where 'total' is a
u64 but gets truncated when passed to div_u64() since the divisor
parameter is u32.

Sure, you could use div64_u64() instead, but the real issue is that
the load hasn't been updated for a very long time and that we're
trying to balance the domains with stale data.

Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains

2018-08-17 Thread Matt Fleming

On Thu, 05 Jul, at 05:54:02PM, Valentin Schneider wrote:
> On 05/07/18 14:27, Matt Fleming wrote:
> > On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
> >> Hi,
> >>
> >> On 04/07/18 15:24, Matt Fleming wrote:
> >>> It's possible that the CPU doing nohz idle balance hasn't had its own
> >>> load updated for many seconds. This can lead to huge deltas between
> >>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to
> >>> cause the following crash:
> >>>
> >>>  divide error:  [#1] SMP
> >>>  Call Trace:
> >>>   [] update_sd_lb_stats+0xe8/0x560
> 
> My confusion comes from not seeing where that crash happens. Would you mind
> sharing the associated line number? I can feel the "how did I not see this"
> from there but it can't be helped :(

The divide by zero comes from scale_rt_capacity() where 'total' is a
u64 but gets truncated when passed to div_u64() since the divisor
parameter is u32.

Sure, you could use div64_u64() instead, but the real issue is that
the load hasn't been updated for a very long time and that we're
trying to balance the domains with stale data.

Re: [lkp-robot] [sched/fair] fbd5188493: WARNING:inconsistent_lock_state

2018-07-05 Thread Matt Fleming

On Thu, 05 Jul, at 02:24:58PM, Matt Fleming wrote:
> 
> Hmm.. it still looks to me like we should be saving and restoring IRQs
> since this can be called from IRQ context, no?
> 
> The patch was a forward-port from one of our SLE kernels, and I messed
> up the IRQ flag balancing for the v4.18-rc3 code :-(
 
Something like this?

>8

>From 9b152d8dadec04ac631300d86a92552e57e81db5 Mon Sep 17 00:00:00 2001
From: Matt Fleming 
Date: Wed, 4 Jul 2018 14:22:51 +0100
Subject: [PATCH v2] sched/fair: Avoid divide by zero when rebalancing domains

It's possible that the CPU doing nohz idle balance hasn't had its own
load updated for many seconds. This can lead to huge deltas between
rq->avg_stamp and rq->clock when rebalancing, and has been seen to
cause the following crash:

 divide error:  [#1] SMP
 Call Trace:
  [] update_sd_lb_stats+0xe8/0x560
  [] find_busiest_group+0x2d/0x4b0
  [] load_balance+0x170/0x950
  [] rebalance_domains+0x13f/0x290
  [] __do_softirq+0xec/0x300
  [] irq_exit+0xfa/0x110
  [] reschedule_interrupt+0xc9/0xd0

Make sure we update the rq clock and load before balancing.

Cc: Ingo Molnar 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Dietmar Eggemann 
Cc: Valentin Schneider 
Signed-off-by: Matt Fleming 
---
 kernel/sched/fair.c | 11 +++
 1 file changed, 11 insertions(+)

Changes in v2: Balance IRQ flags properly.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be4d344..150b92c7c9d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9676,6 +9676,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
cpu_idle_type idle)
 {
int this_cpu = this_rq->cpu;
unsigned int flags;
+   struct rq_flags rf;
 
if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
return false;
@@ -9692,6 +9693,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
cpu_idle_type idle)
if (!(flags & NOHZ_KICK_MASK))
return false;
 
+   /*
+* Ensure this_rq's clock and load are up-to-date before we
+* rebalance since it's possible that they haven't been
+* updated for multiple schedule periods, i.e. many seconds.
+*/
+   rq_lock_irqsave(this_rq, );
+   update_rq_clock(this_rq);
+   cpu_load_update_idle(this_rq);
+   rq_unlock_irqrestore(this_rq, );
+
_nohz_idle_balance(this_rq, flags, idle);
 
return true;
-- 
2.13.6

Re: [lkp-robot] [sched/fair] fbd5188493: WARNING:inconsistent_lock_state

2018-07-05 Thread Matt Fleming

On Thu, 05 Jul, at 02:24:58PM, Matt Fleming wrote:
> 
> Hmm.. it still looks to me like we should be saving and restoring IRQs
> since this can be called from IRQ context, no?
> 
> The patch was a forward-port from one of our SLE kernels, and I messed
> up the IRQ flag balancing for the v4.18-rc3 code :-(
 
Something like this?

>8

>From 9b152d8dadec04ac631300d86a92552e57e81db5 Mon Sep 17 00:00:00 2001
From: Matt Fleming 
Date: Wed, 4 Jul 2018 14:22:51 +0100
Subject: [PATCH v2] sched/fair: Avoid divide by zero when rebalancing domains

It's possible that the CPU doing nohz idle balance hasn't had its own
load updated for many seconds. This can lead to huge deltas between
rq->avg_stamp and rq->clock when rebalancing, and has been seen to
cause the following crash:

 divide error:  [#1] SMP
 Call Trace:
  [] update_sd_lb_stats+0xe8/0x560
  [] find_busiest_group+0x2d/0x4b0
  [] load_balance+0x170/0x950
  [] rebalance_domains+0x13f/0x290
  [] __do_softirq+0xec/0x300
  [] irq_exit+0xfa/0x110
  [] reschedule_interrupt+0xc9/0xd0

Make sure we update the rq clock and load before balancing.

Cc: Ingo Molnar 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Dietmar Eggemann 
Cc: Valentin Schneider 
Signed-off-by: Matt Fleming 
---
 kernel/sched/fair.c | 11 +++
 1 file changed, 11 insertions(+)

Changes in v2: Balance IRQ flags properly.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be4d344..150b92c7c9d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9676,6 +9676,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
cpu_idle_type idle)
 {
int this_cpu = this_rq->cpu;
unsigned int flags;
+   struct rq_flags rf;
 
if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
return false;
@@ -9692,6 +9693,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
cpu_idle_type idle)
if (!(flags & NOHZ_KICK_MASK))
return false;
 
+   /*
+* Ensure this_rq's clock and load are up-to-date before we
+* rebalance since it's possible that they haven't been
+* updated for multiple schedule periods, i.e. many seconds.
+*/
+   rq_lock_irqsave(this_rq, );
+   update_rq_clock(this_rq);
+   cpu_load_update_idle(this_rq);
+   rq_unlock_irqrestore(this_rq, );
+
_nohz_idle_balance(this_rq, flags, idle);
 
return true;
-- 
2.13.6

Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains

2018-07-05 Thread Matt Fleming

On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
> Hi,
> 
> On 04/07/18 15:24, Matt Fleming wrote:
> > It's possible that the CPU doing nohz idle balance hasn't had its own
> > load updated for many seconds. This can lead to huge deltas between
> > rq->avg_stamp and rq->clock when rebalancing, and has been seen to
> > cause the following crash:
> > 
> >  divide error:  [#1] SMP
> >  Call Trace:
> >   [] update_sd_lb_stats+0xe8/0x560
> >   [] find_busiest_group+0x2d/0x4b0
> >   [] load_balance+0x170/0x950
> >   [] rebalance_domains+0x13f/0x290
> >   [] __do_softirq+0xec/0x300
> >   [] irq_exit+0xfa/0x110
> >   [] reschedule_interrupt+0xc9/0xd0
> > 
> 
> Do you have some sort of reproducer for that crash? If not I guess I can cook
> something up with a quiet userspace & rt-app, though I've never seen that one
> on arm64.
 
Unfortunately no, I don't have a reproduction case. Would love to have
one to verify the patch though.

> > Make sure we update the rq clock and load before balancing.
> > 
> > Cc: Ingo Molnar 
> > Cc: Mike Galbraith 
> > Cc: Peter Zijlstra 
> > Signed-off-by: Matt Fleming 
> > ---
> >  kernel/sched/fair.c | 10 ++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2f0a0be4d344..2c81662c858a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
> > unsigned int flags,
> >  */
> > smp_mb();
> >  
> > +   /*
> > +* Ensure this_rq's clock and load are up-to-date before we
> > +* rebalance since it's possible that they haven't been
> > +* updated for multiple schedule periods, i.e. many seconds.
> > +*/
> > +   raw_spin_lock_irq(_rq->lock);
> > +   update_rq_clock(this_rq);
> > +   cpu_load_update_idle(this_rq);
> > +   raw_spin_unlock_irq(_rq->lock);
> > +
> 
> I'm failing to understand why the updates further down below are seemingly
> not enough. After we've potentially done 
> 
> update_rq_clock(rq);
> cpu_load_update_idle(rq);
> 
> for all nohz cpus != this_cpu, we still end up doing:
> 
> if (idle != CPU_NEWLY_IDLE) {
>   update_blocked_averages(this_cpu);
>   has_blocked_load |= this_rq->has_blocked_load;
> }
> 
> which should properly update this_rq's clock and load before we attempt to do
> any balancing on it.
 
But cpu_load_update_idle() and update_blocked_averages() are not the same
thing.

Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains

2018-07-05 Thread Matt Fleming

On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
> Hi,
> 
> On 04/07/18 15:24, Matt Fleming wrote:
> > It's possible that the CPU doing nohz idle balance hasn't had its own
> > load updated for many seconds. This can lead to huge deltas between
> > rq->avg_stamp and rq->clock when rebalancing, and has been seen to
> > cause the following crash:
> > 
> >  divide error:  [#1] SMP
> >  Call Trace:
> >   [] update_sd_lb_stats+0xe8/0x560
> >   [] find_busiest_group+0x2d/0x4b0
> >   [] load_balance+0x170/0x950
> >   [] rebalance_domains+0x13f/0x290
> >   [] __do_softirq+0xec/0x300
> >   [] irq_exit+0xfa/0x110
> >   [] reschedule_interrupt+0xc9/0xd0
> > 
> 
> Do you have some sort of reproducer for that crash? If not I guess I can cook
> something up with a quiet userspace & rt-app, though I've never seen that one
> on arm64.
 
Unfortunately no, I don't have a reproduction case. Would love to have
one to verify the patch though.

> > Make sure we update the rq clock and load before balancing.
> > 
> > Cc: Ingo Molnar 
> > Cc: Mike Galbraith 
> > Cc: Peter Zijlstra 
> > Signed-off-by: Matt Fleming 
> > ---
> >  kernel/sched/fair.c | 10 ++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2f0a0be4d344..2c81662c858a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
> > unsigned int flags,
> >  */
> > smp_mb();
> >  
> > +   /*
> > +* Ensure this_rq's clock and load are up-to-date before we
> > +* rebalance since it's possible that they haven't been
> > +* updated for multiple schedule periods, i.e. many seconds.
> > +*/
> > +   raw_spin_lock_irq(_rq->lock);
> > +   update_rq_clock(this_rq);
> > +   cpu_load_update_idle(this_rq);
> > +   raw_spin_unlock_irq(_rq->lock);
> > +
> 
> I'm failing to understand why the updates further down below are seemingly
> not enough. After we've potentially done 
> 
> update_rq_clock(rq);
> cpu_load_update_idle(rq);
> 
> for all nohz cpus != this_cpu, we still end up doing:
> 
> if (idle != CPU_NEWLY_IDLE) {
>   update_blocked_averages(this_cpu);
>   has_blocked_load |= this_rq->has_blocked_load;
> }
> 
> which should properly update this_rq's clock and load before we attempt to do
> any balancing on it.
 
But cpu_load_update_idle() and update_blocked_averages() are not the same
thing.

Re: [lkp-robot] [sched/fair] fbd5188493: WARNING:inconsistent_lock_state

2018-07-05 Thread Matt Fleming

On Thu, 05 Jul, at 11:52:21AM, Dietmar Eggemann wrote:
> 
> Moving the code from _nohz_idle_balance to nohz_idle_balance let it disappear:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02be51c9dcc1..070924f07c68 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9596,16 +9596,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
> unsigned int flags,
>  */
> smp_mb();
>  
> -   /*
> -* Ensure this_rq's clock and load are up-to-date before we
> -* rebalance since it's possible that they haven't been
> -* updated for multiple schedule periods, i.e. many seconds.
> -*/
> -   raw_spin_lock_irq(_rq->lock);
> -   update_rq_clock(this_rq);
> -   cpu_load_update_idle(this_rq);
> -   raw_spin_unlock_irq(_rq->lock);
> -
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> continue;
> @@ -9701,6 +9691,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
> if (!(flags & NOHZ_KICK_MASK))
> return false;
>  
> +   /*
> +* Ensure this_rq's clock and load are up-to-date before we
> +* rebalance since it's possible that they haven't been
> +* updated for multiple schedule periods, i.e. many seconds.
> +*/
> +   raw_spin_lock_irq(_rq->lock);
> +   update_rq_clock(this_rq);
> +   cpu_load_update_idle(this_rq);
> +   raw_spin_unlock_irq(_rq->lock);
> +
> _nohz_idle_balance(this_rq, flags, idle);
>  
> return true;
> 

Hmm.. it still looks to me like we should be saving and restoring IRQs
since this can be called from IRQ context, no?

The patch was a forward-port from one of our SLE kernels, and I messed
up the IRQ flag balancing for the v4.18-rc3 code :-(

Re: [lkp-robot] [sched/fair] fbd5188493: WARNING:inconsistent_lock_state

2018-07-05 Thread Matt Fleming

On Thu, 05 Jul, at 11:52:21AM, Dietmar Eggemann wrote:
> 
> Moving the code from _nohz_idle_balance to nohz_idle_balance let it disappear:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02be51c9dcc1..070924f07c68 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9596,16 +9596,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
> unsigned int flags,
>  */
> smp_mb();
>  
> -   /*
> -* Ensure this_rq's clock and load are up-to-date before we
> -* rebalance since it's possible that they haven't been
> -* updated for multiple schedule periods, i.e. many seconds.
> -*/
> -   raw_spin_lock_irq(_rq->lock);
> -   update_rq_clock(this_rq);
> -   cpu_load_update_idle(this_rq);
> -   raw_spin_unlock_irq(_rq->lock);
> -
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> continue;
> @@ -9701,6 +9691,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
> if (!(flags & NOHZ_KICK_MASK))
> return false;
>  
> +   /*
> +* Ensure this_rq's clock and load are up-to-date before we
> +* rebalance since it's possible that they haven't been
> +* updated for multiple schedule periods, i.e. many seconds.
> +*/
> +   raw_spin_lock_irq(_rq->lock);
> +   update_rq_clock(this_rq);
> +   cpu_load_update_idle(this_rq);
> +   raw_spin_unlock_irq(_rq->lock);
> +
> _nohz_idle_balance(this_rq, flags, idle);
>  
> return true;
> 

Hmm.. it still looks to me like we should be saving and restoring IRQs
since this can be called from IRQ context, no?

The patch was a forward-port from one of our SLE kernels, and I messed
up the IRQ flag balancing for the v4.18-rc3 code :-(

[PATCH] sched/fair: Avoid divide by zero when rebalancing domains

2018-07-04 Thread Matt Fleming

It's possible that the CPU doing nohz idle balance hasn't had its own
load updated for many seconds. This can lead to huge deltas between
rq->avg_stamp and rq->clock when rebalancing, and has been seen to
cause the following crash:

 divide error:  [#1] SMP
 Call Trace:
  [] update_sd_lb_stats+0xe8/0x560
  [] find_busiest_group+0x2d/0x4b0
  [] load_balance+0x170/0x950
  [] rebalance_domains+0x13f/0x290
  [] __do_softirq+0xec/0x300
  [] irq_exit+0xfa/0x110
  [] reschedule_interrupt+0xc9/0xd0

Make sure we update the rq clock and load before balancing.

Cc: Ingo Molnar 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Signed-off-by: Matt Fleming 
---
 kernel/sched/fair.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be4d344..2c81662c858a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
unsigned int flags,
 */
smp_mb();
 
+   /*
+* Ensure this_rq's clock and load are up-to-date before we
+* rebalance since it's possible that they haven't been
+* updated for multiple schedule periods, i.e. many seconds.
+*/
+   raw_spin_lock_irq(_rq->lock);
+   update_rq_clock(this_rq);
+   cpu_load_update_idle(this_rq);
+   raw_spin_unlock_irq(_rq->lock);
+
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
continue;
-- 
2.13.6

[PATCH] sched/fair: Avoid divide by zero when rebalancing domains

2018-07-04 Thread Matt Fleming

It's possible that the CPU doing nohz idle balance hasn't had its own
load updated for many seconds. This can lead to huge deltas between
rq->avg_stamp and rq->clock when rebalancing, and has been seen to
cause the following crash:

 divide error:  [#1] SMP
 Call Trace:
  [] update_sd_lb_stats+0xe8/0x560
  [] find_busiest_group+0x2d/0x4b0
  [] load_balance+0x170/0x950
  [] rebalance_domains+0x13f/0x290
  [] __do_softirq+0xec/0x300
  [] irq_exit+0xfa/0x110
  [] reschedule_interrupt+0xc9/0xd0

Make sure we update the rq clock and load before balancing.

Cc: Ingo Molnar 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Signed-off-by: Matt Fleming 
---
 kernel/sched/fair.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be4d344..2c81662c858a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
unsigned int flags,
 */
smp_mb();
 
+   /*
+* Ensure this_rq's clock and load are up-to-date before we
+* rebalance since it's possible that they haven't been
+* updated for multiple schedule periods, i.e. many seconds.
+*/
+   raw_spin_lock_irq(_rq->lock);
+   update_rq_clock(this_rq);
+   cpu_load_update_idle(this_rq);
+   raw_spin_unlock_irq(_rq->lock);
+
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
continue;
-- 
2.13.6

Re: [RFC 00/11] select_idle_sibling rework

2018-06-19 Thread Matt Fleming

On Wed, 30 May, at 04:22:36PM, Peter Zijlstra wrote:
> Hi all,
> 
> This is all still very preliminary and could all still go up in flames (it has
> only seen hackbench so far). This is mostly the same code I posted yesterday,
> but hopefully in a more readable form.
> 
> This fixes the SIS_PROP as per the outline here:
> 
>   
> https://lkml.kernel.org/r/20180425153600.ga4...@hirez.programming.kicks-ass.net
> 
> and Rohit's suggestion of folding the iteration loops.
> 
> For testing I would suggest to ignore the last 3 patches, those are purely
> cleanups once the first lot is found to actually work as advertised.

This series looks pretty good from my testing. I see double-digit
improvements to hackbench results and only one case of a clear
regression (easily offset by all the wins).

Are you aware of any regressions for particular benchmarks I should
take a look at?

Re: [RFC 00/11] select_idle_sibling rework

2018-06-19 Thread Matt Fleming

On Wed, 30 May, at 04:22:36PM, Peter Zijlstra wrote:
> Hi all,
> 
> This is all still very preliminary and could all still go up in flames (it has
> only seen hackbench so far). This is mostly the same code I posted yesterday,
> but hopefully in a more readable form.
> 
> This fixes the SIS_PROP as per the outline here:
> 
>   
> https://lkml.kernel.org/r/20180425153600.ga4...@hirez.programming.kicks-ass.net
> 
> and Rohit's suggestion of folding the iteration loops.
> 
> For testing I would suggest to ignore the last 3 patches, those are purely
> cleanups once the first lot is found to actually work as advertised.

This series looks pretty good from my testing. I see double-digit
improvements to hackbench results and only one case of a clear
regression (easily offset by all the wins).

Are you aware of any regressions for particular benchmarks I should
take a look at?

Re: cpu stopper threads and load balancing leads to deadlock

2018-04-24 Thread Matt Fleming

On Fri, 20 Apr, at 11:50:05AM, Peter Zijlstra wrote:
> On Tue, Apr 17, 2018 at 03:21:19PM +0100, Matt Fleming wrote:
> > Hi guys,
> > 
> > We've seen a bug in one of our SLE kernels where the cpu stopper
> > thread ("migration/15") is entering idle balance. This then triggers
> > active load balance.
> > 
> > At the same time, a task on another CPU triggers a page fault and NUMA
> > balancing kicks in to try and migrate the task closer to the NUMA node
> > for that page (we're inside stop_two_cpus()). This faulting task is
> > spinning in try_to_wake_up() (inside smp_cond_load_acquire(>on_cpu,
> > !VAL)), waiting for "migration/15" to context switch.
> > 
> > Unfortunately, because "migration/15" is doing active load balance
> > it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
> > which is already held (since it's inside stop_two_cpus()).
> > 
> > Deadlock ensues.
> 
> 
> So if I read that right, something like the following happens:
> 
> CPU0  CPU1
> 
> schedule(.prev=migrate/0) 
>   pick_next_task...
> idle_balance  migrate_swap()
>   active_balancestop_two_cpus()
>   spin_lock(stopper0->lock)
>   spin_lock(stopper1->lock)
>   ttwu(migrate/0)
> smp_cond_load_acquire() -- 
> waits for schedule()
> stop_one_cpu(1)
> spin_lock(stopper1->lock) -- waits for stopper lock

Yep, that's exactly right.

> Fix _this_ deadlock by taking out the wakeups from under stopper->lock.
> I'm not entirely sure there isn't more dragons here, but this particular
> one seems fixable by doing that.
> 
> Is there any way you can reproduce/test this?

I'm afraid I don't have any way to test this, but I can ask the
customer that reported it if they can.

Either way, this fix looks good to me.

Re: cpu stopper threads and load balancing leads to deadlock

2018-04-24 Thread Matt Fleming

On Fri, 20 Apr, at 11:50:05AM, Peter Zijlstra wrote:
> On Tue, Apr 17, 2018 at 03:21:19PM +0100, Matt Fleming wrote:
> > Hi guys,
> > 
> > We've seen a bug in one of our SLE kernels where the cpu stopper
> > thread ("migration/15") is entering idle balance. This then triggers
> > active load balance.
> > 
> > At the same time, a task on another CPU triggers a page fault and NUMA
> > balancing kicks in to try and migrate the task closer to the NUMA node
> > for that page (we're inside stop_two_cpus()). This faulting task is
> > spinning in try_to_wake_up() (inside smp_cond_load_acquire(>on_cpu,
> > !VAL)), waiting for "migration/15" to context switch.
> > 
> > Unfortunately, because "migration/15" is doing active load balance
> > it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
> > which is already held (since it's inside stop_two_cpus()).
> > 
> > Deadlock ensues.
> 
> 
> So if I read that right, something like the following happens:
> 
> CPU0  CPU1
> 
> schedule(.prev=migrate/0) 
>   pick_next_task...
> idle_balance  migrate_swap()
>   active_balancestop_two_cpus()
>   spin_lock(stopper0->lock)
>   spin_lock(stopper1->lock)
>   ttwu(migrate/0)
> smp_cond_load_acquire() -- 
> waits for schedule()
> stop_one_cpu(1)
> spin_lock(stopper1->lock) -- waits for stopper lock

Yep, that's exactly right.

> Fix _this_ deadlock by taking out the wakeups from under stopper->lock.
> I'm not entirely sure there isn't more dragons here, but this particular
> one seems fixable by doing that.
> 
> Is there any way you can reproduce/test this?

I'm afraid I don't have any way to test this, but I can ask the
customer that reported it if they can.

Either way, this fix looks good to me.

cpu stopper threads and load balancing leads to deadlock

2018-04-17 Thread Matt Fleming

Hi guys,

We've seen a bug in one of our SLE kernels where the cpu stopper
thread ("migration/15") is entering idle balance. This then triggers
active load balance.

At the same time, a task on another CPU triggers a page fault and NUMA
balancing kicks in to try and migrate the task closer to the NUMA node
for that page (we're inside stop_two_cpus()). This faulting task is
spinning in try_to_wake_up() (inside smp_cond_load_acquire(>on_cpu,
!VAL)), waiting for "migration/15" to context switch.

Unfortunately, because "migration/15" is doing active load balance
it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
which is already held (since it's inside stop_two_cpus()).

Deadlock ensues.

This seems like a situation that should be prohibited, but I cannot
find any code to prevent it. Is it OK for stopper threads to load
balance? Is there something that should prevent this situation from
happening?

cpu stopper threads and load balancing leads to deadlock

2018-04-17 Thread Matt Fleming

Hi guys,

We've seen a bug in one of our SLE kernels where the cpu stopper
thread ("migration/15") is entering idle balance. This then triggers
active load balance.

At the same time, a task on another CPU triggers a page fault and NUMA
balancing kicks in to try and migrate the task closer to the NUMA node
for that page (we're inside stop_two_cpus()). This faulting task is
spinning in try_to_wake_up() (inside smp_cond_load_acquire(>on_cpu,
!VAL)), waiting for "migration/15" to context switch.

Unfortunately, because "migration/15" is doing active load balance
it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
which is already held (since it's inside stop_two_cpus()).

Deadlock ensues.

This seems like a situation that should be prohibited, but I cannot
find any code to prevent it. Is it OK for stopper threads to load
balance? Is there something that should prevent this situation from
happening?

Re: [PATCH] sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning

2018-04-03 Thread Matt Fleming

On Mon, 02 Apr, at 09:49:54AM, Davidlohr Bueso wrote:
> 
> We can get rid of it be the "traditional" means of adding an 
> update_rq_clock() call
> after acquiring the rq->lock in do_sched_rt_period_timer().
> 
> The case for the rt task throttling (which this workload also hits) can be 
> ignored in
> that the skip_update call is actually bogus and quite the contrary (the 
> request bits
> are removed/reverted). By setting RQCF_UPDATED we really don't care if the 
> skip is
> happening or not and will therefore make the assert_clock_updated() check 
> happy.
> 
> Signed-off-by: Davidlohr Bueso <dbu...@suse.de>
> ---
>  kernel/sched/rt.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 86b77987435e..ad13e6242481 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -839,6 +839,8 @@ static int do_sched_rt_period_timer(struct rt_bandwidth 
> *rt_b, int overrun)
>   continue;
>  
>   raw_spin_lock(>lock);
> + update_rq_clock(rq);
> +
>   if (rt_rq->rt_time) {
>   u64 runtime;

Looks good to me.

Reviewed-by: Matt Fleming <m...@codeblueprint.co.uk>

Re: [PATCH] sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning

2018-04-03 Thread Matt Fleming

On Mon, 02 Apr, at 09:49:54AM, Davidlohr Bueso wrote:
> 
> We can get rid of it be the "traditional" means of adding an 
> update_rq_clock() call
> after acquiring the rq->lock in do_sched_rt_period_timer().
> 
> The case for the rt task throttling (which this workload also hits) can be 
> ignored in
> that the skip_update call is actually bogus and quite the contrary (the 
> request bits
> are removed/reverted). By setting RQCF_UPDATED we really don't care if the 
> skip is
> happening or not and will therefore make the assert_clock_updated() check 
> happy.
> 
> Signed-off-by: Davidlohr Bueso 
> ---
>  kernel/sched/rt.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 86b77987435e..ad13e6242481 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -839,6 +839,8 @@ static int do_sched_rt_period_timer(struct rt_bandwidth 
> *rt_b, int overrun)
>   continue;
>  
>   raw_spin_lock(>lock);
> + update_rq_clock(rq);
> +
>       if (rt_rq->rt_time) {
>   u64 runtime;

Looks good to me.

Reviewed-by: Matt Fleming

Re: [PATCH RFC v2] sched: Minimize the idle cpu selection race window.

2018-02-07 Thread Matt Fleming

On Tue, 05 Dec, at 01:09:07PM, Atish Patra wrote:
> Currently, multiple tasks can wakeup on same cpu from
> select_idle_sibiling() path in case they wakeup simulatenously
> and last ran on the same llc. This happens because an idle cpu
> is not updated until idle task is scheduled out. Any task waking
> during that period may potentially select that cpu for a wakeup
> candidate.
> 
> Introduce a per cpu variable that is set as soon as a cpu is
> selected for wakeup for any task. This prevents from other tasks
> to select the same cpu again. Note: This does not close the race
> window but minimizes it to accessing the per-cpu variable. If two
> wakee tasks access the per cpu variable at the same time, they may
> select the same cpu again. But it minimizes the race window
> considerably.
> 
> Here are some performance numbers:

I ran this patch through some tests here on the SUSE performance grid
and there's a definite regression for Mike's personal favourite
benchmark, tbench.

Here are the results: vanilla 4.15-rc9 on the left, -rc9 plus this
patch on the right.

tbench4
 4.15.0-rc9 4.15.0-rc9
vanillasched-minimize-idle-cpu-window
Min   mb/sec-1484.50 (   0.00%)  463.03 (  -4.43%)
Min   mb/sec-2961.43 (   0.00%)  959.35 (  -0.22%)
Min   mb/sec-4   1789.60 (   0.00%) 1760.21 (  -1.64%)
Min   mb/sec-8   3518.51 (   0.00%) 3471.47 (  -1.34%)
Min   mb/sec-16  5521.12 (   0.00%) 5409.77 (  -2.02%)
Min   mb/sec-32  7268.61 (   0.00%) 7491.29 (   3.06%)
Min   mb/sec-64 14413.45 (   0.00%)14347.69 (  -0.46%)
Min   mb/sec-12813501.84 (   0.00%)13413.82 (  -0.65%)
Min   mb/sec-19213237.02 (   0.00%)13231.43 (  -0.04%)
Hmean mb/sec-1505.20 (   0.00%)  485.81 (  -3.84%)
Hmean mb/sec-2973.12 (   0.00%)  970.67 (  -0.25%)
Hmean mb/sec-4   1835.22 (   0.00%) 1788.54 (  -2.54%)
Hmean mb/sec-8   3529.35 (   0.00%) 3487.20 (  -1.19%)
Hmean mb/sec-16  5531.16 (   0.00%) 5437.43 (  -1.69%)
Hmean mb/sec-32  7627.96 (   0.00%) 8021.26 (   5.16%)
Hmean mb/sec-64 14441.20 (   0.00%)14395.08 (  -0.32%)
Hmean mb/sec-12813620.40 (   0.00%)13569.17 (  -0.38%)
Hmean mb/sec-19213265.26 (   0.00%)13263.98 (  -0.01%)
Max   mb/sec-1510.30 (   0.00%)  489.89 (  -4.00%)
Max   mb/sec-2989.45 (   0.00%)  976.10 (  -1.35%)
Max   mb/sec-4   1845.65 (   0.00%) 1795.50 (  -2.72%)
Max   mb/sec-8   3574.03 (   0.00%) 3547.56 (  -0.74%)
Max   mb/sec-16  5556.99 (   0.00%) 5564.80 (   0.14%)
Max   mb/sec-32  7678.18 (   0.00%) 8098.63 (   5.48%)
Max   mb/sec-64 14463.07 (   0.00%)14437.58 (  -0.18%)
Max   mb/sec-12813659.67 (   0.00%)13602.65 (  -0.42%)
Max   mb/sec-19213612.01 (   0.00%)13832.98 (   1.62%)

There's a nice little performance bump around the 32-client mark.
Incidentally, my test machine has 2 NUMA nodes with 24 cpus (12 cores,
2 threads) each. So 32 clients is the point at which things no longer
fit on a single node.

It doesn't look like the regression is caused by the schedule() path
being slightly longer (i.e. it's not a latency issue) because schbench
results show improvements for the low-end:

schbench
 4.15.0-rc9 4.15.0-rc9
vanillasched-minimize-idle-cpu-window
Lat 50.00th-qrtle-146.00 (   0.00%)   36.00 (  21.74%)
Lat 75.00th-qrtle-149.00 (   0.00%)   37.00 (  24.49%)
Lat 90.00th-qrtle-152.00 (   0.00%)   38.00 (  26.92%)
Lat 95.00th-qrtle-156.00 (   0.00%)   41.00 (  26.79%)
Lat 99.00th-qrtle-161.00 (   0.00%)   46.00 (  24.59%)
Lat 99.50th-qrtle-163.00 (   0.00%)   48.00 (  23.81%)
Lat 99.90th-qrtle-177.00 (   0.00%)   64.00 (  16.88%)
Lat 50.00th-qrtle-241.00 (   0.00%)   41.00 (   0.00%)
Lat 75.00th-qrtle-247.00 (   0.00%)   46.00 (   2.13%)
Lat 90.00th-qrtle-250.00 (   0.00%)   49.00 (   2.00%)
Lat 95.00th-qrtle-253.00 (   0.00%)   52.00 (   1.89%)
Lat 99.00th-qrtle-258.00 (   0.00%)   57.00 (   1.72%)
Lat 99.50th-qrtle-260.00 (   0.00%)   59.00 (   1.67%)
Lat 99.90th-qrtle-272.00 (   0.00%)   69.00 (   4.17%)
Lat 50.00th-qrtle-446.00 (   0.00%)   45.00 (   2.17%)
Lat 75.00th-qrtle-449.00 (   0.00%)   48.00 (   2.04%)
Lat 90.00th-qrtle-452.00 (   0.00%)   51.00 (   1.92%)
Lat 95.00th-qrtle-455.00 (   0.00%)   53.00 (   3.64%)
Lat 99.00th-qrtle-461.00 (   0.00%)   59.00 (   3.28%)
Lat 99.50th-qrtle-463.00 (   0.00%)   61.00 (   3.17%)
Lat 99.90th-qrtle-469.00 (   0.00%)   74.00

Re: [PATCH RFC v2] sched: Minimize the idle cpu selection race window.

2018-02-07 Thread Matt Fleming

On Tue, 05 Dec, at 01:09:07PM, Atish Patra wrote:
> Currently, multiple tasks can wakeup on same cpu from
> select_idle_sibiling() path in case they wakeup simulatenously
> and last ran on the same llc. This happens because an idle cpu
> is not updated until idle task is scheduled out. Any task waking
> during that period may potentially select that cpu for a wakeup
> candidate.
> 
> Introduce a per cpu variable that is set as soon as a cpu is
> selected for wakeup for any task. This prevents from other tasks
> to select the same cpu again. Note: This does not close the race
> window but minimizes it to accessing the per-cpu variable. If two
> wakee tasks access the per cpu variable at the same time, they may
> select the same cpu again. But it minimizes the race window
> considerably.
> 
> Here are some performance numbers:

I ran this patch through some tests here on the SUSE performance grid
and there's a definite regression for Mike's personal favourite
benchmark, tbench.

Here are the results: vanilla 4.15-rc9 on the left, -rc9 plus this
patch on the right.

tbench4
 4.15.0-rc9 4.15.0-rc9
vanillasched-minimize-idle-cpu-window
Min   mb/sec-1484.50 (   0.00%)  463.03 (  -4.43%)
Min   mb/sec-2961.43 (   0.00%)  959.35 (  -0.22%)
Min   mb/sec-4   1789.60 (   0.00%) 1760.21 (  -1.64%)
Min   mb/sec-8   3518.51 (   0.00%) 3471.47 (  -1.34%)
Min   mb/sec-16  5521.12 (   0.00%) 5409.77 (  -2.02%)
Min   mb/sec-32  7268.61 (   0.00%) 7491.29 (   3.06%)
Min   mb/sec-64 14413.45 (   0.00%)14347.69 (  -0.46%)
Min   mb/sec-12813501.84 (   0.00%)13413.82 (  -0.65%)
Min   mb/sec-19213237.02 (   0.00%)13231.43 (  -0.04%)
Hmean mb/sec-1505.20 (   0.00%)  485.81 (  -3.84%)
Hmean mb/sec-2973.12 (   0.00%)  970.67 (  -0.25%)
Hmean mb/sec-4   1835.22 (   0.00%) 1788.54 (  -2.54%)
Hmean mb/sec-8   3529.35 (   0.00%) 3487.20 (  -1.19%)
Hmean mb/sec-16  5531.16 (   0.00%) 5437.43 (  -1.69%)
Hmean mb/sec-32  7627.96 (   0.00%) 8021.26 (   5.16%)
Hmean mb/sec-64 14441.20 (   0.00%)14395.08 (  -0.32%)
Hmean mb/sec-12813620.40 (   0.00%)13569.17 (  -0.38%)
Hmean mb/sec-19213265.26 (   0.00%)13263.98 (  -0.01%)
Max   mb/sec-1510.30 (   0.00%)  489.89 (  -4.00%)
Max   mb/sec-2989.45 (   0.00%)  976.10 (  -1.35%)
Max   mb/sec-4   1845.65 (   0.00%) 1795.50 (  -2.72%)
Max   mb/sec-8   3574.03 (   0.00%) 3547.56 (  -0.74%)
Max   mb/sec-16  5556.99 (   0.00%) 5564.80 (   0.14%)
Max   mb/sec-32  7678.18 (   0.00%) 8098.63 (   5.48%)
Max   mb/sec-64 14463.07 (   0.00%)14437.58 (  -0.18%)
Max   mb/sec-12813659.67 (   0.00%)13602.65 (  -0.42%)
Max   mb/sec-19213612.01 (   0.00%)13832.98 (   1.62%)

There's a nice little performance bump around the 32-client mark.
Incidentally, my test machine has 2 NUMA nodes with 24 cpus (12 cores,
2 threads) each. So 32 clients is the point at which things no longer
fit on a single node.

It doesn't look like the regression is caused by the schedule() path
being slightly longer (i.e. it's not a latency issue) because schbench
results show improvements for the low-end:

schbench
 4.15.0-rc9 4.15.0-rc9
vanillasched-minimize-idle-cpu-window
Lat 50.00th-qrtle-146.00 (   0.00%)   36.00 (  21.74%)
Lat 75.00th-qrtle-149.00 (   0.00%)   37.00 (  24.49%)
Lat 90.00th-qrtle-152.00 (   0.00%)   38.00 (  26.92%)
Lat 95.00th-qrtle-156.00 (   0.00%)   41.00 (  26.79%)
Lat 99.00th-qrtle-161.00 (   0.00%)   46.00 (  24.59%)
Lat 99.50th-qrtle-163.00 (   0.00%)   48.00 (  23.81%)
Lat 99.90th-qrtle-177.00 (   0.00%)   64.00 (  16.88%)
Lat 50.00th-qrtle-241.00 (   0.00%)   41.00 (   0.00%)
Lat 75.00th-qrtle-247.00 (   0.00%)   46.00 (   2.13%)
Lat 90.00th-qrtle-250.00 (   0.00%)   49.00 (   2.00%)
Lat 95.00th-qrtle-253.00 (   0.00%)   52.00 (   1.89%)
Lat 99.00th-qrtle-258.00 (   0.00%)   57.00 (   1.72%)
Lat 99.50th-qrtle-260.00 (   0.00%)   59.00 (   1.67%)
Lat 99.90th-qrtle-272.00 (   0.00%)   69.00 (   4.17%)
Lat 50.00th-qrtle-446.00 (   0.00%)   45.00 (   2.17%)
Lat 75.00th-qrtle-449.00 (   0.00%)   48.00 (   2.04%)
Lat 90.00th-qrtle-452.00 (   0.00%)   51.00 (   1.92%)
Lat 95.00th-qrtle-455.00 (   0.00%)   53.00 (   3.64%)
Lat 99.00th-qrtle-461.00 (   0.00%)   59.00 (   3.28%)
Lat 99.50th-qrtle-463.00 (   0.00%)   61.00 (   3.17%)
Lat 99.90th-qrtle-469.00 (   0.00%)   74.00

Re: [PATCH V4 0/3] Use mm_struct and switch_mm() instead of manually

2018-01-26 Thread Matt Fleming

On Thu, 18 Jan, at 01:01:04PM, Sai Praneeth Prakhya wrote:
> From: Sai Praneeth <sai.praneeth.prak...@intel.com>
> 
> Presently, in x86, to invoke any efi function like
> efi_set_virtual_address_map() or any efi_runtime_service() the code path
> typically involves read_cr3() (save previous pgd), write_cr3()
> (write efi_pgd) and calling efi function. Likewise after returning from
> efi function the code path typically involves read_cr3() (save efi_pgd),
> write_cr3() (write previous pgd). We do this couple of times in efi
> subsystem of Linux kernel, instead we can use helper function
> efi_switch_mm() to do this. This improves readability and maintainability.
> Also, instead of maintaining a separate struct "efi_scratch" to store/restore
> efi_pgd, we can use mm_struct to do this.
 
FWIW this series looks OK to me.

Reviewed-by: Matt Fleming <m...@codeblueprint.co.uk>

Re: [PATCH V4 0/3] Use mm_struct and switch_mm() instead of manually

2018-01-26 Thread Matt Fleming

On Thu, 18 Jan, at 01:01:04PM, Sai Praneeth Prakhya wrote:
> From: Sai Praneeth 
> 
> Presently, in x86, to invoke any efi function like
> efi_set_virtual_address_map() or any efi_runtime_service() the code path
> typically involves read_cr3() (save previous pgd), write_cr3()
> (write efi_pgd) and calling efi function. Likewise after returning from
> efi function the code path typically involves read_cr3() (save efi_pgd),
> write_cr3() (write previous pgd). We do this couple of times in efi
> subsystem of Linux kernel, instead we can use helper function
> efi_switch_mm() to do this. This improves readability and maintainability.
> Also, instead of maintaining a separate struct "efi_scratch" to store/restore
> efi_pgd, we can use mm_struct to do this.
 
FWIW this series looks OK to me.

Reviewed-by: Matt Fleming

[tip:efi/core] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

2018-01-03 Thread tip-bot for Matt Fleming

Commit-ID:  81b60dbff04980a45b348c5b5eeca2713d4594ca
Gitweb: https://git.kernel.org/tip/81b60dbff04980a45b348c5b5eeca2713d4594ca
Author: Matt Fleming <m...@codeblueprint.co.uk>
AuthorDate: Wed, 3 Jan 2018 09:44:17 +
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 3 Jan 2018 14:03:18 +0100

MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks
as maintainers for the EFI test driver and efivarfs file system.

Also add Ard Biesheuvel as the EFI test driver and efivarfs maintainer.

Signed-off-by: Matt Fleming <m...@codeblueprint.co.uk>
Cc: Ard Biesheuvel <ard.biesheu...@linaro.org>
Cc: Ivan Hu <ivan...@canonical.com>
Cc: Jeremy Kerr <j...@ozlabs.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Matthew Garrett <mj...@srcf.ucam.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: linux-...@vger.kernel.org
Link: http://lkml.kernel.org/r/20180103094417.6353-1-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 MAINTAINERS | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b46c9ce..95c3fa1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5149,15 +5149,15 @@ F:  sound/usb/misc/ua101.c
 EFI TEST DRIVER
 L: linux-...@vger.kernel.org
 M: Ivan Hu <ivan...@canonical.com>
-M: Matt Fleming <m...@codeblueprint.co.uk>
+M: Ard Biesheuvel <ard.biesheu...@linaro.org>
 S: Maintained
 F: drivers/firmware/efi/test/
 
 EFI VARIABLE FILESYSTEM
 M: Matthew Garrett <matthew.garr...@nebula.com>
 M: Jeremy Kerr <j...@ozlabs.org>
-M: Matt Fleming <m...@codeblueprint.co.uk>
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
+M: Ard Biesheuvel <ard.biesheu...@linaro.org>
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
 L: linux-...@vger.kernel.org
 S: Maintained
 F: fs/efivarfs/
@@ -5318,7 +5318,6 @@ S:    Supported
 F: security/integrity/evm/
 
 EXTENSIBLE FIRMWARE INTERFACE (EFI)
-M: Matt Fleming <m...@codeblueprint.co.uk>
 M: Ard Biesheuvel <ard.biesheu...@linaro.org>
 L: linux-...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git

[tip:efi/core] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

2018-01-03 Thread tip-bot for Matt Fleming

Commit-ID:  81b60dbff04980a45b348c5b5eeca2713d4594ca
Gitweb: https://git.kernel.org/tip/81b60dbff04980a45b348c5b5eeca2713d4594ca
Author: Matt Fleming 
AuthorDate: Wed, 3 Jan 2018 09:44:17 +
Committer:  Ingo Molnar 
CommitDate: Wed, 3 Jan 2018 14:03:18 +0100

MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks
as maintainers for the EFI test driver and efivarfs file system.

Also add Ard Biesheuvel as the EFI test driver and efivarfs maintainer.

Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Ivan Hu 
Cc: Jeremy Kerr 
Cc: Linus Torvalds 
Cc: Matthew Garrett 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: http://lkml.kernel.org/r/20180103094417.6353-1-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 MAINTAINERS | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b46c9ce..95c3fa1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5149,15 +5149,15 @@ F:  sound/usb/misc/ua101.c
 EFI TEST DRIVER
 L: linux-...@vger.kernel.org
 M: Ivan Hu 
-M: Matt Fleming 
+M: Ard Biesheuvel 
 S: Maintained
 F: drivers/firmware/efi/test/
 
 EFI VARIABLE FILESYSTEM
 M: Matthew Garrett 
 M: Jeremy Kerr 
-M: Matt Fleming 
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
+M: Ard Biesheuvel 
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
 L: linux-...@vger.kernel.org
 S: Maintained
 F: fs/efivarfs/
@@ -5318,7 +5318,6 @@ S:Supported
 F: security/integrity/evm/
 
 EXTENSIBLE FIRMWARE INTERFACE (EFI)
-M: Matt Fleming 
 M: Ard Biesheuvel 
 L: linux-...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git

Re: [PATCH] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

2018-01-03 Thread Matt Fleming

On Wed, 03 Jan, at 10:13:55AM, Ard Biesheuvel wrote:
> On 3 January 2018 at 09:44, Matt Fleming <m...@codeblueprint.co.uk> wrote:
> > Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks
> > as maintainers for the EFI test driver and efivarfs file system.
> >
> > Signed-off-by: Matt Fleming <m...@codeblueprint.co.uk>
> 
> Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> 
> Thanks Matt
> 
> > ---
> >  MAINTAINERS | 5 +
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> >
> > Ard, if you want to add yourself as co-maintainer of the EFI test
> > driver or efivarfs, please go ahead. I'm sure no one would object.
> >
> 
> I guess that makes sense. Should we just fold that in?

Sounds good to me.

Re: [PATCH] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

2018-01-03 Thread Matt Fleming

On Wed, 03 Jan, at 10:13:55AM, Ard Biesheuvel wrote:
> On 3 January 2018 at 09:44, Matt Fleming  wrote:
> > Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks
> > as maintainers for the EFI test driver and efivarfs file system.
> >
> > Signed-off-by: Matt Fleming 
> 
> Acked-by: Ard Biesheuvel 
> 
> Thanks Matt
> 
> > ---
> >  MAINTAINERS | 5 +
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> >
> > Ard, if you want to add yourself as co-maintainer of the EFI test
> > driver or efivarfs, please go ahead. I'm sure no one would object.
> >
> 
> I guess that makes sense. Should we just fold that in?

Sounds good to me.

[PATCH] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

2018-01-03 Thread Matt Fleming

Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks
as maintainers for the EFI test driver and efivarfs file system.

Signed-off-by: Matt Fleming <m...@codeblueprint.co.uk>
---
 MAINTAINERS | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

Ard, if you want to add yourself as co-maintainer of the EFI test
driver or efivarfs, please go ahead. I'm sure no one would object.

diff --git a/MAINTAINERS b/MAINTAINERS
index d4fdcb12616c..9a41aa072e6a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5150,15 +5150,13 @@ F:  sound/usb/misc/ua101.c
 EFI TEST DRIVER
 L: linux-...@vger.kernel.org
 M: Ivan Hu <ivan...@canonical.com>
-M: Matt Fleming <m...@codeblueprint.co.uk>
 S: Maintained
 F: drivers/firmware/efi/test/
 
 EFI VARIABLE FILESYSTEM
 M: Matthew Garrett <matthew.garr...@nebula.com>
 M: Jeremy Kerr <j...@ozlabs.org>
-M: Matt Fleming <m...@codeblueprint.co.uk>
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
 L: linux-...@vger.kernel.org
 S: Maintained
 F: fs/efivarfs/
@@ -5319,7 +5317,6 @@ S:Supported
 F: security/integrity/evm/
 
 EXTENSIBLE FIRMWARE INTERFACE (EFI)
-M: Matt Fleming <m...@codeblueprint.co.uk>
 M: Ard Biesheuvel <ard.biesheu...@linaro.org>
 L: linux-...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
-- 
2.14.2

[PATCH] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

2018-01-03 Thread Matt Fleming

Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks
as maintainers for the EFI test driver and efivarfs file system.

Signed-off-by: Matt Fleming 
---
 MAINTAINERS | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

Ard, if you want to add yourself as co-maintainer of the EFI test
driver or efivarfs, please go ahead. I'm sure no one would object.

diff --git a/MAINTAINERS b/MAINTAINERS
index d4fdcb12616c..9a41aa072e6a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5150,15 +5150,13 @@ F:  sound/usb/misc/ua101.c
 EFI TEST DRIVER
 L: linux-...@vger.kernel.org
 M: Ivan Hu 
-M: Matt Fleming 
 S: Maintained
 F: drivers/firmware/efi/test/
 
 EFI VARIABLE FILESYSTEM
 M: Matthew Garrett 
 M: Jeremy Kerr 
-M: Matt Fleming 
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
 L: linux-...@vger.kernel.org
 S: Maintained
 F: fs/efivarfs/
@@ -5319,7 +5317,6 @@ S:Supported
 F: security/integrity/evm/
 
 EXTENSIBLE FIRMWARE INTERFACE (EFI)
-M: Matt Fleming 
 M: Ard Biesheuvel 
 L: linux-...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
-- 
2.14.2

Re: [PATCH V2] x86/efi: fix kernel param add_efi_memmap regression

2017-12-18 Thread Matt Fleming

On Sat, 16 Dec, at 12:19:53PM, Dave Young wrote:
> 'add_efi_memmap' is an early param, but do_add_efi_memmap() has no
> chance to run because the code path is before parse_early_param().
> I believe it worked when the param was introduced but probably later
> some other changes caused the wrong order and nobody noticed it.
> 
> Move efi_memblock_x86_reserve_range() after parse_early_param()
> to fix it.
> 
> Signed-off-by: Dave Young 
> ---
>  arch/x86/kernel/setup.c |5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Thanks Dave, applied to 'urgent'.

Re: [PATCH V2] x86/efi: fix kernel param add_efi_memmap regression

2017-12-18 Thread Matt Fleming

On Sat, 16 Dec, at 12:19:53PM, Dave Young wrote:
> 'add_efi_memmap' is an early param, but do_add_efi_memmap() has no
> chance to run because the code path is before parse_early_param().
> I believe it worked when the param was introduced but probably later
> some other changes caused the wrong order and nobody noticed it.
> 
> Move efi_memblock_x86_reserve_range() after parse_early_param()
> to fix it.
> 
> Signed-off-by: Dave Young 
> ---
>  arch/x86/kernel/setup.c |5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Thanks Dave, applied to 'urgent'.

Re: [PATCH] x86: move parse_early_param to earlier code for add_efi_memmap

2017-12-18 Thread Matt Fleming

On Sat, 16 Dec, at 03:06:32PM, Ingo Molnar wrote:
> 
> * Matt Fleming <m...@codeblueprint.co.uk> wrote:
> 
> > >   x86_init.oem.arch_setup();
> > > @@ -962,6 +959,8 @@ void __init setup_arch(char **cmdline_p)
> > >  
> > >   parse_early_param();
> > >  
> > > + if (efi_enabled(EFI_BOOT))
> > > + efi_memblock_x86_reserve_range();
> > >  #ifdef CONFIG_MEMORY_HOTPLUG
> > >   /*
> > >* Memory used by the kernel cannot be hot-removed because Linux
> > > 
> > 
> > I prefer this version. Please re-send a full patch and update the
> > subject line to include the "fix" somewhere; it wasn't obvious to me
> > from the start that this is a bug fix.
> 
> Agreed.
> 
> I dropped the commit I just applied to tip:efi/core, since you are handling 
> this, 
> so this patch should come through the regular EFI channels, right?

Yep, that's right. Me or Ard will take care of it.

Re: [PATCH] x86: move parse_early_param to earlier code for add_efi_memmap

2017-12-18 Thread Matt Fleming

On Sat, 16 Dec, at 03:06:32PM, Ingo Molnar wrote:
> 
> * Matt Fleming  wrote:
> 
> > >   x86_init.oem.arch_setup();
> > > @@ -962,6 +959,8 @@ void __init setup_arch(char **cmdline_p)
> > >  
> > >   parse_early_param();
> > >  
> > > + if (efi_enabled(EFI_BOOT))
> > > + efi_memblock_x86_reserve_range();
> > >  #ifdef CONFIG_MEMORY_HOTPLUG
> > >   /*
> > >* Memory used by the kernel cannot be hot-removed because Linux
> > > 
> > 
> > I prefer this version. Please re-send a full patch and update the
> > subject line to include the "fix" somewhere; it wasn't obvious to me
> > from the start that this is a bug fix.
> 
> Agreed.
> 
> I dropped the commit I just applied to tip:efi/core, since you are handling 
> this, 
> so this patch should come through the regular EFI channels, right?

Yep, that's right. Me or Ard will take care of it.

Re: [PATCH] efi: make EFI a menuconfig to ease disabling it all

2017-12-15 Thread Matt Fleming

On Sat, 09 Dec, at 04:52:52PM, Vincent Legoll wrote:
> No need to get into the submenu to disable all related
> config entries.
> 
> This makes it easier to disable all EFI config options
> without entering the submenu. It will also enable one
> to see that en/dis-abled state from the outside menu.
> 
> This is only intended to change menuconfig UI, not change
> the config dependencies.
> 
> Signed-off-by: Vincent Legoll 
> ---
>  drivers/firmware/efi/Kconfig | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)

This looks fine to me. Ard?

Re: [PATCH] efi: make EFI a menuconfig to ease disabling it all

2017-12-15 Thread Matt Fleming

On Sat, 09 Dec, at 04:52:52PM, Vincent Legoll wrote:
> No need to get into the submenu to disable all related
> config entries.
> 
> This makes it easier to disable all EFI config options
> without entering the submenu. It will also enable one
> to see that en/dis-abled state from the outside menu.
> 
> This is only intended to change menuconfig UI, not change
> the config dependencies.
> 
> Signed-off-by: Vincent Legoll 
> ---
>  drivers/firmware/efi/Kconfig | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)

This looks fine to me. Ard?

Re: [PATCH] x86: move parse_early_param to earlier code for add_efi_memmap

2017-12-15 Thread Matt Fleming

On Thu, 14 Dec, at 06:41:19PM, Dave Young wrote:
> On 11/30/17 at 01:23pm, Dave Young wrote:
> > 'add_efi_memmap' is an early param, but do_add_efi_memmap() has no
> > chance to run because the code path is before parse_early_param().
> > I believe it worked when the param was introduced but probably later
> > some other changes caused the wrong order and nobody noticed it.
> > 
> > Move parse_early_param before efi_memblock_x86_reserve_range to fix
> > this.
> > 
> > Signed-off-by: Dave Young 
> > ---
> >  arch/x86/kernel/setup.c |   55 
> > 
> >  1 file changed, 28 insertions(+), 27 deletions(-)
> > 
> > --- linux-x86.orig/arch/x86/kernel/setup.c
> > +++ linux-x86/arch/x86/kernel/setup.c
> > @@ -897,6 +897,34 @@ void __init setup_arch(char **cmdline_p)
> > rd_prompt = ((boot_params.hdr.ram_size & RAMDISK_PROMPT_FLAG) != 0);
> > rd_doload = ((boot_params.hdr.ram_size & RAMDISK_LOAD_FLAG) != 0);
> >  #endif
> > +
> > +#ifdef CONFIG_CMDLINE_BOOL
> > +#ifdef CONFIG_CMDLINE_OVERRIDE
> > +   strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
> > +#else
> > +   if (builtin_cmdline[0]) {
> > +   /* append boot loader cmdline to builtin */
> > +   strlcat(builtin_cmdline, " ", COMMAND_LINE_SIZE);
> > +   strlcat(builtin_cmdline, boot_command_line, COMMAND_LINE_SIZE);
> > +   strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
> > +   }
> > +#endif
> > +#endif
> > +
> > +   strlcpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
> > +   *cmdline_p = command_line;
> > +
> > +   /*
> > +* x86_configure_nx() is called before parse_early_param() to detect
> > +* whether hardware doesn't support NX (so that the early EHCI debug
> > +* console setup can safely call set_fixmap()). It may then be called
> > +* again from within noexec_setup() during parsing early parameters
> > +* to honor the respective command line option.
> > +*/
> > +   x86_configure_nx();
> > +
> > +   parse_early_param();
> > +
> >  #ifdef CONFIG_EFI
> > if (!strncmp((char *)_params.efi_info.efi_loader_signature,
> >  EFI32_LOADER_SIGNATURE, 4)) {
> > @@ -935,33 +963,6 @@ void __init setup_arch(char **cmdline_p)
> > bss_resource.start = __pa_symbol(__bss_start);
> > bss_resource.end = __pa_symbol(__bss_stop)-1;
> >  
> > -#ifdef CONFIG_CMDLINE_BOOL
> > -#ifdef CONFIG_CMDLINE_OVERRIDE
> > -   strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
> > -#else
> > -   if (builtin_cmdline[0]) {
> > -   /* append boot loader cmdline to builtin */
> > -   strlcat(builtin_cmdline, " ", COMMAND_LINE_SIZE);
> > -   strlcat(builtin_cmdline, boot_command_line, COMMAND_LINE_SIZE);
> > -   strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
> > -   }
> > -#endif
> > -#endif
> > -
> > -   strlcpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
> > -   *cmdline_p = command_line;
> > -
> > -   /*
> > -* x86_configure_nx() is called before parse_early_param() to detect
> > -* whether hardware doesn't support NX (so that the early EHCI debug
> > -* console setup can safely call set_fixmap()). It may then be called
> > -* again from within noexec_setup() during parsing early parameters
> > -* to honor the respective command line option.
> > -*/
> > -   x86_configure_nx();
> > -
> > -   parse_early_param();
> > -
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> > /*
> >  * Memory used by the kernel cannot be hot-removed because Linux
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> Ping for review..
> 
> Another way is move "efi_memblock_x86_reserve_range" to later code
> Maybe below is better?
> 
> ---
>  arch/x86/kernel/setup.c |5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> --- linux-x86.orig/arch/x86/kernel/setup.c
> +++ linux-x86/arch/x86/kernel/setup.c
> @@ -906,9 +906,6 @@ void __init setup_arch(char **cmdline_p)
>   set_bit(EFI_BOOT, );
>   set_bit(EFI_64BIT, );
>   }
> -
> - if (efi_enabled(EFI_BOOT))
> - efi_memblock_x86_reserve_range();
>  #endif
>  
>   x86_init.oem.arch_setup();
> @@ -962,6 +959,8 @@ void __init setup_arch(char **cmdline_p)
>  
>   parse_early_param();
>  
> + if (efi_enabled(EFI_BOOT))
> + efi_memblock_x86_reserve_range();
>  #ifdef CONFIG_MEMORY_HOTPLUG
>   /*
>* Memory used by the kernel cannot be hot-removed because Linux
> 

I prefer this version. Please re-send a full patch and update the
subject line to include the "fix" somewhere; it wasn't obvious to me
from the start that this is a bug fix.

Re: [PATCH] x86: move parse_early_param to earlier code for add_efi_memmap

2017-12-15 Thread Matt Fleming

On Thu, 14 Dec, at 06:41:19PM, Dave Young wrote:
> On 11/30/17 at 01:23pm, Dave Young wrote:
> > 'add_efi_memmap' is an early param, but do_add_efi_memmap() has no
> > chance to run because the code path is before parse_early_param().
> > I believe it worked when the param was introduced but probably later
> > some other changes caused the wrong order and nobody noticed it.
> > 
> > Move parse_early_param before efi_memblock_x86_reserve_range to fix
> > this.
> > 
> > Signed-off-by: Dave Young 
> > ---
> >  arch/x86/kernel/setup.c |   55 
> > 
> >  1 file changed, 28 insertions(+), 27 deletions(-)
> > 
> > --- linux-x86.orig/arch/x86/kernel/setup.c
> > +++ linux-x86/arch/x86/kernel/setup.c
> > @@ -897,6 +897,34 @@ void __init setup_arch(char **cmdline_p)
> > rd_prompt = ((boot_params.hdr.ram_size & RAMDISK_PROMPT_FLAG) != 0);
> > rd_doload = ((boot_params.hdr.ram_size & RAMDISK_LOAD_FLAG) != 0);
> >  #endif
> > +
> > +#ifdef CONFIG_CMDLINE_BOOL
> > +#ifdef CONFIG_CMDLINE_OVERRIDE
> > +   strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
> > +#else
> > +   if (builtin_cmdline[0]) {
> > +   /* append boot loader cmdline to builtin */
> > +   strlcat(builtin_cmdline, " ", COMMAND_LINE_SIZE);
> > +   strlcat(builtin_cmdline, boot_command_line, COMMAND_LINE_SIZE);
> > +   strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
> > +   }
> > +#endif
> > +#endif
> > +
> > +   strlcpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
> > +   *cmdline_p = command_line;
> > +
> > +   /*
> > +* x86_configure_nx() is called before parse_early_param() to detect
> > +* whether hardware doesn't support NX (so that the early EHCI debug
> > +* console setup can safely call set_fixmap()). It may then be called
> > +* again from within noexec_setup() during parsing early parameters
> > +* to honor the respective command line option.
> > +*/
> > +   x86_configure_nx();
> > +
> > +   parse_early_param();
> > +
> >  #ifdef CONFIG_EFI
> > if (!strncmp((char *)_params.efi_info.efi_loader_signature,
> >  EFI32_LOADER_SIGNATURE, 4)) {
> > @@ -935,33 +963,6 @@ void __init setup_arch(char **cmdline_p)
> > bss_resource.start = __pa_symbol(__bss_start);
> > bss_resource.end = __pa_symbol(__bss_stop)-1;
> >  
> > -#ifdef CONFIG_CMDLINE_BOOL
> > -#ifdef CONFIG_CMDLINE_OVERRIDE
> > -   strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
> > -#else
> > -   if (builtin_cmdline[0]) {
> > -   /* append boot loader cmdline to builtin */
> > -   strlcat(builtin_cmdline, " ", COMMAND_LINE_SIZE);
> > -   strlcat(builtin_cmdline, boot_command_line, COMMAND_LINE_SIZE);
> > -   strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
> > -   }
> > -#endif
> > -#endif
> > -
> > -   strlcpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
> > -   *cmdline_p = command_line;
> > -
> > -   /*
> > -* x86_configure_nx() is called before parse_early_param() to detect
> > -* whether hardware doesn't support NX (so that the early EHCI debug
> > -* console setup can safely call set_fixmap()). It may then be called
> > -* again from within noexec_setup() during parsing early parameters
> > -* to honor the respective command line option.
> > -*/
> > -   x86_configure_nx();
> > -
> > -   parse_early_param();
> > -
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> > /*
> >  * Memory used by the kernel cannot be hot-removed because Linux
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> Ping for review..
> 
> Another way is move "efi_memblock_x86_reserve_range" to later code
> Maybe below is better?
> 
> ---
>  arch/x86/kernel/setup.c |5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> --- linux-x86.orig/arch/x86/kernel/setup.c
> +++ linux-x86/arch/x86/kernel/setup.c
> @@ -906,9 +906,6 @@ void __init setup_arch(char **cmdline_p)
>   set_bit(EFI_BOOT, );
>   set_bit(EFI_64BIT, );
>   }
> -
> - if (efi_enabled(EFI_BOOT))
> - efi_memblock_x86_reserve_range();
>  #endif
>  
>   x86_init.oem.arch_setup();
> @@ -962,6 +959,8 @@ void __init setup_arch(char **cmdline_p)
>  
>   parse_early_param();
>  
> + if (efi_enabled(EFI_BOOT))
> + efi_memblock_x86_reserve_range();
>  #ifdef CONFIG_MEMORY_HOTPLUG
>   /*
>* Memory used by the kernel cannot be hot-removed because Linux
> 

I prefer this version. Please re-send a full patch and update the
subject line to include the "fix" somewhere; it wasn't obvious to me
from the start that this is a bug fix.

Re: [PATCH] efi: Use PTR_ERR_OR_ZERO()

2017-12-15 Thread Matt Fleming

On Tue, 28 Nov, at 10:39:37PM, Vasyl Gomonovych wrote:
> Fix ptr_ret.cocci warnings:
> drivers/firmware/efi/efi.c:610:8-14: WARNING: PTR_ERR_OR_ZERO can be used
> 
> Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
> 
> Generated by: scripts/coccinelle/api/ptr_ret.cocci
> 
> Signed-off-by: Vasyl Gomonovych 
> ---
>  drivers/firmware/efi/efi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks. Applied to the 'next' branch.

Re: [PATCH] efi: Use PTR_ERR_OR_ZERO()

2017-12-15 Thread Matt Fleming

On Tue, 28 Nov, at 10:39:37PM, Vasyl Gomonovych wrote:
> Fix ptr_ret.cocci warnings:
> drivers/firmware/efi/efi.c:610:8-14: WARNING: PTR_ERR_OR_ZERO can be used
> 
> Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
> 
> Generated by: scripts/coccinelle/api/ptr_ret.cocci
> 
> Signed-off-by: Vasyl Gomonovych 
> ---
>  drivers/firmware/efi/efi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks. Applied to the 'next' branch.

Re: [GIT PULL] hash addresses printed with %p

2017-12-02 Thread Matt Fleming

(Cc'ing Dave since this is used for kexec on EFI)

On Fri, 01 Dec, at 09:54:43AM, Ard Biesheuvel wrote:
> On 1 December 2017 at 09:48, Greg Kroah-Hartman
>  wrote:
> > On Thu, Nov 30, 2017 at 05:18:42PM +, Ard Biesheuvel wrote:
> >> On 30 November 2017 at 17:10, Greg Kroah-Hartman
> >>  wrote:
> >> > On Thu, Nov 30, 2017 at 04:32:35PM +, Greg Kroah-Hartman wrote:
> >> >> On Wed, Nov 29, 2017 at 01:36:25PM -0800, Linus Torvalds wrote:
> >> >> > On Wed, Nov 29, 2017 at 1:14 PM, Linus Torvalds
> >> >> >  wrote:
> >> >> > >
> >> >> > > Not because %pK itself changed, but because the semantics of %p did.
> >> >> > > The baseline moved, and the "safe" version did not.
> >> >> >
> >> >> > Btw, that baseline for me is now that I can do
> >> >> >
> >> >> >   ./scripts/leaking_addresses.pl | wc -l
> >> >> >   18
> >> >> >
> >> >> > and of those 18 hits, six are false positives (looks like bitmaps in
> >> >> > the uevent keys).
> >> >> >
> >> >> > The remaining 12 are from the EFI runtime map files
> >> >> > (/sys/firmware/efi/runtime-map/*). They should presumably not be
> >> >> > world-readable, but sadly the kset_create_and_add() helper seems to do
> >> >> > that by default.
> >> >> >
> >> >> > I think the sysfs code makes it insanely too easy to make things
> >> >> > world-readable. You try to be careful, and mark things read-only etc,
> >> >> > but __ATTR_RO() jkust means S_IRUGO, which means world-readable.
> >> >> >
> >> >> > There seems to be no convenient model for kobjects having better
> >> >> > permissions. Greg?
> >> >>
> >> >> They can just use __ATTR() which lets you set the exact mode settings
> >> >> that are wanted.
> >> >>
> >> >> Something like the patch below, which breaks the build as the
> >> >> map_attributes are "odd", but you get the idea.  The EFI developers can
> >> >> fix this up properly :)
> >> >>
> >> >> Note, this only accounts for 5 attributes, what is the whole list?
> >> >
> >> > Ah, it's the virt_addr file 12 times, I just ran it on my laptop:
> >> >
> >> > /sys/firmware/efi/runtime-map/7/virt_addr: 0xfffeea6ea000
> >> > /sys/firmware/efi/runtime-map/5/virt_addr: 0xfffeee88b000
> >> > /sys/firmware/efi/runtime-map/3/virt_addr: 0xfffefea0
> >> > /sys/firmware/efi/runtime-map/11/virt_addr: 0xfffed9c0
> >> > /sys/firmware/efi/runtime-map/1/virt_addr: 0xfffefee0
> >> > /sys/firmware/efi/runtime-map/8/virt_addr: 0xfffedba4e000
> >> > /sys/firmware/efi/runtime-map/6/virt_addr: 0xfffeee2de000
> >> > /sys/firmware/efi/runtime-map/4/virt_addr: 0xfffeeea0
> >> > /sys/firmware/efi/runtime-map/2/virt_addr: 0xfffefec0
> >> > /sys/firmware/efi/runtime-map/10/virt_addr: 0xfffed9c6
> >> > /sys/firmware/efi/runtime-map/0/virt_addr: 0xfffeff00
> >> > /sys/firmware/efi/runtime-map/9/virt_addr: 0xfffedb9c9000
> >> >
> >> > So changing it to use __ATTR() should fix this remaning leakage up.
> >> > That is if we even really need to export these values at all.  What does
> >> > userspace do with them?  Shouldn't they just be in debugfs instead?
> >> >
> >>
> >> These are the virtual mappings UEFI firmware regions, which must
> >> remain in the same place across kexec reboots. So kexec tooling
> >> consumes this information and passes it on to the incoming kernel in
> >> some way.
> >>
> >> Note that these are not kernel addresses, so while I agree they should
> >> not be world readable, they won't give you any clue as to where the
> >> kernel itself is mapped.
> >>
> >> So the recommendation is to switch to __ATTR( ... 0400 ... ) instead?
> >> If so, I'll code up a patch.
> >
> > If these pointers are not "real", I recommend just leaving them as-is.
> 
> That's not what I said :-)
> 
> These are real pointers, and stuff will actually be mapped there
> (although I am not intimately familiar with the way x86 does this, but
> on ARM [which does not have these sysfs nodes in the first place],
> these mappings are only live during the time a UEFI runtime service
> call is in progress, and IIRC, work was underway to do the same for
> x86). So while these values don't correlate with the placement of
> kernel data structures, they could still be useful for an attacker to
> figure out where exploitable firmware memory regions are located,
> especially given that some of these may be mapped RWX.

These are mappings of the EFI firmware's runtime regions, dynamically
allocated by the kernel starting at EFI_VA_START. Because we only get
one chance to tell the firmware where we placed its regions (via
SetVirtualAddressMap()) we have to guarantee that any subsequent kexec
reboots use the same addresses.

So that's why they're exported to userspace through sysfs.

Like Ard said, these mappings are not mapped into the regular process
address space. Instead, they're only used while making EFI runtime
calls.

But this is still an information leak. And I

Re: [GIT PULL] hash addresses printed with %p

2017-12-02 Thread Matt Fleming

(Cc'ing Dave since this is used for kexec on EFI)

On Fri, 01 Dec, at 09:54:43AM, Ard Biesheuvel wrote:
> On 1 December 2017 at 09:48, Greg Kroah-Hartman
>  wrote:
> > On Thu, Nov 30, 2017 at 05:18:42PM +, Ard Biesheuvel wrote:
> >> On 30 November 2017 at 17:10, Greg Kroah-Hartman
> >>  wrote:
> >> > On Thu, Nov 30, 2017 at 04:32:35PM +, Greg Kroah-Hartman wrote:
> >> >> On Wed, Nov 29, 2017 at 01:36:25PM -0800, Linus Torvalds wrote:
> >> >> > On Wed, Nov 29, 2017 at 1:14 PM, Linus Torvalds
> >> >> >  wrote:
> >> >> > >
> >> >> > > Not because %pK itself changed, but because the semantics of %p did.
> >> >> > > The baseline moved, and the "safe" version did not.
> >> >> >
> >> >> > Btw, that baseline for me is now that I can do
> >> >> >
> >> >> >   ./scripts/leaking_addresses.pl | wc -l
> >> >> >   18
> >> >> >
> >> >> > and of those 18 hits, six are false positives (looks like bitmaps in
> >> >> > the uevent keys).
> >> >> >
> >> >> > The remaining 12 are from the EFI runtime map files
> >> >> > (/sys/firmware/efi/runtime-map/*). They should presumably not be
> >> >> > world-readable, but sadly the kset_create_and_add() helper seems to do
> >> >> > that by default.
> >> >> >
> >> >> > I think the sysfs code makes it insanely too easy to make things
> >> >> > world-readable. You try to be careful, and mark things read-only etc,
> >> >> > but __ATTR_RO() jkust means S_IRUGO, which means world-readable.
> >> >> >
> >> >> > There seems to be no convenient model for kobjects having better
> >> >> > permissions. Greg?
> >> >>
> >> >> They can just use __ATTR() which lets you set the exact mode settings
> >> >> that are wanted.
> >> >>
> >> >> Something like the patch below, which breaks the build as the
> >> >> map_attributes are "odd", but you get the idea.  The EFI developers can
> >> >> fix this up properly :)
> >> >>
> >> >> Note, this only accounts for 5 attributes, what is the whole list?
> >> >
> >> > Ah, it's the virt_addr file 12 times, I just ran it on my laptop:
> >> >
> >> > /sys/firmware/efi/runtime-map/7/virt_addr: 0xfffeea6ea000
> >> > /sys/firmware/efi/runtime-map/5/virt_addr: 0xfffeee88b000
> >> > /sys/firmware/efi/runtime-map/3/virt_addr: 0xfffefea0
> >> > /sys/firmware/efi/runtime-map/11/virt_addr: 0xfffed9c0
> >> > /sys/firmware/efi/runtime-map/1/virt_addr: 0xfffefee0
> >> > /sys/firmware/efi/runtime-map/8/virt_addr: 0xfffedba4e000
> >> > /sys/firmware/efi/runtime-map/6/virt_addr: 0xfffeee2de000
> >> > /sys/firmware/efi/runtime-map/4/virt_addr: 0xfffeeea0
> >> > /sys/firmware/efi/runtime-map/2/virt_addr: 0xfffefec0
> >> > /sys/firmware/efi/runtime-map/10/virt_addr: 0xfffed9c6
> >> > /sys/firmware/efi/runtime-map/0/virt_addr: 0xfffeff00
> >> > /sys/firmware/efi/runtime-map/9/virt_addr: 0xfffedb9c9000
> >> >
> >> > So changing it to use __ATTR() should fix this remaning leakage up.
> >> > That is if we even really need to export these values at all.  What does
> >> > userspace do with them?  Shouldn't they just be in debugfs instead?
> >> >
> >>
> >> These are the virtual mappings UEFI firmware regions, which must
> >> remain in the same place across kexec reboots. So kexec tooling
> >> consumes this information and passes it on to the incoming kernel in
> >> some way.
> >>
> >> Note that these are not kernel addresses, so while I agree they should
> >> not be world readable, they won't give you any clue as to where the
> >> kernel itself is mapped.
> >>
> >> So the recommendation is to switch to __ATTR( ... 0400 ... ) instead?
> >> If so, I'll code up a patch.
> >
> > If these pointers are not "real", I recommend just leaving them as-is.
> 
> That's not what I said :-)
> 
> These are real pointers, and stuff will actually be mapped there
> (although I am not intimately familiar with the way x86 does this, but
> on ARM [which does not have these sysfs nodes in the first place],
> these mappings are only live during the time a UEFI runtime service
> call is in progress, and IIRC, work was underway to do the same for
> x86). So while these values don't correlate with the placement of
> kernel data structures, they could still be useful for an attacker to
> figure out where exploitable firmware memory regions are located,
> especially given that some of these may be mapped RWX.

These are mappings of the EFI firmware's runtime regions, dynamically
allocated by the kernel starting at EFI_VA_START. Because we only get
one chance to tell the firmware where we placed its regions (via
SetVirtualAddressMap()) we have to guarantee that any subsequent kexec
reboots use the same addresses.

So that's why they're exported to userspace through sysfs.

Like Ard said, these mappings are not mapped into the regular process
address space. Instead, they're only used while making EFI runtime
calls.

But this is still an information leak. And I think _ATTR(..0400) is
the right way to fix it. Dave, could you double-check my logic

Re: sysbench throughput degradation in 4.13+

2017-10-10 Thread Matt Fleming

On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote:
> 
> It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes
> regress but performance is restored for sockets.
> 
> Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT
> patch. So I've queued that up now and it should be done by tomorrow.

Yeah, netperf results look fine for either your NO_WA_WEIGHT or
WA_WEIGHT patch.

Any ETA on when this is going to tip?

Re: sysbench throughput degradation in 4.13+

2017-10-10 Thread Matt Fleming

On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote:
> 
> It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes
> regress but performance is restored for sockets.
> 
> Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT
> patch. So I've queued that up now and it should be done by tomorrow.

Yeah, netperf results look fine for either your NO_WA_WEIGHT or
WA_WEIGHT patch.

Any ETA on when this is going to tip?

Re: [PATCH] efi/capsule-loader: pr_err() strings should end with newlines

2017-10-06 Thread Matt Fleming

On Mon, 25 Sep, at 04:17:23PM, Arvind Yadav wrote:
> pr_err() messages should terminated with a new-line to avoid
> other messages being concatenated onto the end.
> 
> Signed-off-by: Arvind Yadav 
> ---
>  drivers/firmware/efi/capsule-loader.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/firmware/efi/capsule-loader.c 
> b/drivers/firmware/efi/capsule-loader.c
> index ec8ac5c..51d2874 100644
> --- a/drivers/firmware/efi/capsule-loader.c
> +++ b/drivers/firmware/efi/capsule-loader.c
> @@ -49,7 +49,7 @@ int __efi_capsule_setup_info(struct capsule_info *cap_info)
>   pages_needed = ALIGN(cap_info->total_size, PAGE_SIZE) / PAGE_SIZE;
>  
>   if (pages_needed == 0) {
> - pr_err("invalid capsule size");
> + pr_err("invalid capsule size\n");
>   return -EINVAL;
>   }
>  

Thanks, applied.

Re: [PATCH] efi/capsule-loader: pr_err() strings should end with newlines

2017-10-06 Thread Matt Fleming

On Mon, 25 Sep, at 04:17:23PM, Arvind Yadav wrote:
> pr_err() messages should terminated with a new-line to avoid
> other messages being concatenated onto the end.
> 
> Signed-off-by: Arvind Yadav 
> ---
>  drivers/firmware/efi/capsule-loader.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/firmware/efi/capsule-loader.c 
> b/drivers/firmware/efi/capsule-loader.c
> index ec8ac5c..51d2874 100644
> --- a/drivers/firmware/efi/capsule-loader.c
> +++ b/drivers/firmware/efi/capsule-loader.c
> @@ -49,7 +49,7 @@ int __efi_capsule_setup_info(struct capsule_info *cap_info)
>   pages_needed = ALIGN(cap_info->total_size, PAGE_SIZE) / PAGE_SIZE;
>  
>   if (pages_needed == 0) {
> - pr_err("invalid capsule size");
> + pr_err("invalid capsule size\n");
>   return -EINVAL;
>   }
>  

Thanks, applied.

Re: sysbench throughput degradation in 4.13+

2017-10-06 Thread Matt Fleming

On Wed, 04 Oct, at 06:18:50PM, Peter Zijlstra wrote:
> On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote:
> > So I was waiting for Rik, who promised to run a bunch of NUMA workloads
> > over the weekend.
> > 
> > The trivial thing regresses a wee bit on the overloaded case, I've not
> > yet tried to fix it.
> 
> WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the
> addition -- based on the old scheme -- that I've tried in order to lift
> the overloaded case (including hackbench).

My results (2 nodes, 12 cores/node, 2 threads/core) show that you've
pretty much restored hackbench performance to v4.12. However, it's a
regression against v4.13 for hackbench-process-pipes (I'm guessing the
v4.13 improvement is due to Rik's original patches).

The last two result columns are your latest patch with NO_WA_WEIGHT
and then with WA_WEIGHT enabled.

(I hope you've all got wide terminals)

hackbench-process-pipes
  4.12.0 4.13.0 
4.13.0 4.13.0 4.13.0 4.13.0
 vanillavanilla 
peterz-fix
rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight
Amean 11.1600 (   0.00%)  1.6037 ( -38.25%)  1.0727 (   
7.53%)  1.0200 (  12.07%)  1.0837 (   6.58%)  1.1110 (   4.22%)
Amean 42.4207 (   0.00%)  2.2300 (   7.88%)  2.0520 (  
15.23%)  1.9483 (  19.51%)  2.0623 (  14.80%)  2.2807 (   5.78%)
Amean 75.4140 (   0.00%)  3.2027 (  40.84%)  3.6100 (  
33.32%)  3.5620 (  34.21%)  3.5590 (  34.26%)  4.6573 (  13.98%)
Amean 12   9.7130 (   0.00%)  4.7170 (  51.44%)  6.5280 (  
32.79%)  6.2063 (  36.10%)  6.5670 (  32.39%) 10.5440 (  -8.56%)
Amean 21  11.6687 (   0.00%)  8.8073 (  24.52%) 14.4997 ( 
-24.26%) 10.2700 (  11.99%) 14.3187 ( -22.71%) 11.5417 (   1.09%)
Amean 30  14.6410 (   0.00%) 11.7003 (  20.09%) 23.7660 ( 
-62.32%) 13.9847 (   4.48%) 21.8957 ( -49.55%) 14.4847 (   1.07%)
Amean 48  19.8780 (   0.00%) 17.0317 (  14.32%) 37.6397 ( 
-89.35%) 19.7577 (   0.61%) 39.2110 ( -97.26%) 20.3293 (  -2.27%)
Amean 79  46.4200 (   0.00%) 27.1180 (  41.58%) 58.4037 ( 
-25.82%) 35.5537 (  23.41%) 60.8957 ( -31.18%) 49.7470 (  -7.17%)
Amean 110 57.7550 (   0.00%) 42.7013 (  26.06%) 73.0483 ( 
-26.48%) 48.8880 (  15.35%) 77.8597 ( -34.81%) 61.9353 (  -7.24%)
Amean 141 61.0490 (   0.00%) 48.0857 (  21.23%) 98.5567 ( 
-61.44%) 63.2187 (  -3.55%) 90.4857 ( -48.22%) 68.3100 ( -11.89%)
Amean 172 70.5180 (   0.00%) 59.3620 (  15.82%)122.5423 ( 
-73.77%) 76.0197 (  -7.80%)127.4023 ( -80.67%) 75.8233 (  -7.52%)
Amean 192 76.1643 (   0.00%) 65.1613 (  14.45%)142.1393 ( 
-86.62%) 91.4923 ( -20.12%)145.0663 ( -90.46%) 80.5867 (  -5.81%)

But things look pretty good for hackbench-process-sockets:

hackbench-process-sockets
  4.12.0 4.13.0 
4.13.0 4.13.0 4.13.0 4.13.0
 vanillavanilla 
peterz-fix
rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight
Amean 10.9657 (   0.00%)  1.0850 ( -12.36%)  1.3737 ( 
-42.25%)  1.3093 ( -35.59%)  1.3220 ( -36.90%)  1.3937 ( -44.32%)
Amean 42.3040 (   0.00%)  3.3840 ( -46.88%)  2.1807 (   
5.35%)  2.3010 (   0.13%)  2.2070 (   4.21%)  2.1770 (   5.51%)
Amean 74.5467 (   0.00%)  4.0787 (  10.29%)  5.0530 ( 
-11.14%)  3.7427 (  17.68%)  4.5517 (  -0.11%)  3.8560 (  15.19%)
Amean 12   5.7707 (   0.00%)  5.4440 (   5.66%) 10.5680 ( 
-83.13%)  7.7240 ( -33.85%) 10.5990 ( -83.67%)  5.9123 (  -2.45%)
Amean 21   8.9387 (   0.00%)  9.5850 (  -7.23%) 18.3103 
(-104.84%) 10.9253 ( -22.23%) 18.1540 (-103.10%)  9.2627 (  -3.62%)
Amean 30  13.1243 (   0.00%) 14.0773 (  -7.26%) 25.6563 ( 
-95.49%) 15.7590 ( -20.07%) 25.6920 ( -95.76%) 14.6523 ( -11.64%)
Amean 48  25.1030 (   0.00%) 22.5233 (  10.28%) 40.5937 ( 
-61.71%) 24.6727 (   1.71%) 40.6357 ( -61.88%) 22.1807 (  11.64%)
Amean 79  39.9150 (   0.00%) 33.4220 (  16.27%) 66.3343 ( 
-66.19%) 40.2713 (  -0.89%) 65.8543 ( -64.99%) 35.3360 (  11.47%)
Amean 110 49.1700 (   0.00%) 46.1173 (   6.21%) 92.3153 ( 
-87.75%) 55.6567 ( -13.19%) 92.0567 ( -87.22%) 46.7280 (   4.97%)
Amean 141 59.3157 (   0.00%) 57.2670 (   3.45%)118.5863 ( 
-99.92%) 70.4800 ( -18.82%)118.6013 ( -99.95%)

Re: sysbench throughput degradation in 4.13+

2017-10-06 Thread Matt Fleming

On Wed, 04 Oct, at 06:18:50PM, Peter Zijlstra wrote:
> On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote:
> > So I was waiting for Rik, who promised to run a bunch of NUMA workloads
> > over the weekend.
> > 
> > The trivial thing regresses a wee bit on the overloaded case, I've not
> > yet tried to fix it.
> 
> WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the
> addition -- based on the old scheme -- that I've tried in order to lift
> the overloaded case (including hackbench).

My results (2 nodes, 12 cores/node, 2 threads/core) show that you've
pretty much restored hackbench performance to v4.12. However, it's a
regression against v4.13 for hackbench-process-pipes (I'm guessing the
v4.13 improvement is due to Rik's original patches).

The last two result columns are your latest patch with NO_WA_WEIGHT
and then with WA_WEIGHT enabled.

(I hope you've all got wide terminals)

hackbench-process-pipes
  4.12.0 4.13.0 
4.13.0 4.13.0 4.13.0 4.13.0
 vanillavanilla 
peterz-fix
rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight
Amean 11.1600 (   0.00%)  1.6037 ( -38.25%)  1.0727 (   
7.53%)  1.0200 (  12.07%)  1.0837 (   6.58%)  1.1110 (   4.22%)
Amean 42.4207 (   0.00%)  2.2300 (   7.88%)  2.0520 (  
15.23%)  1.9483 (  19.51%)  2.0623 (  14.80%)  2.2807 (   5.78%)
Amean 75.4140 (   0.00%)  3.2027 (  40.84%)  3.6100 (  
33.32%)  3.5620 (  34.21%)  3.5590 (  34.26%)  4.6573 (  13.98%)
Amean 12   9.7130 (   0.00%)  4.7170 (  51.44%)  6.5280 (  
32.79%)  6.2063 (  36.10%)  6.5670 (  32.39%) 10.5440 (  -8.56%)
Amean 21  11.6687 (   0.00%)  8.8073 (  24.52%) 14.4997 ( 
-24.26%) 10.2700 (  11.99%) 14.3187 ( -22.71%) 11.5417 (   1.09%)
Amean 30  14.6410 (   0.00%) 11.7003 (  20.09%) 23.7660 ( 
-62.32%) 13.9847 (   4.48%) 21.8957 ( -49.55%) 14.4847 (   1.07%)
Amean 48  19.8780 (   0.00%) 17.0317 (  14.32%) 37.6397 ( 
-89.35%) 19.7577 (   0.61%) 39.2110 ( -97.26%) 20.3293 (  -2.27%)
Amean 79  46.4200 (   0.00%) 27.1180 (  41.58%) 58.4037 ( 
-25.82%) 35.5537 (  23.41%) 60.8957 ( -31.18%) 49.7470 (  -7.17%)
Amean 110 57.7550 (   0.00%) 42.7013 (  26.06%) 73.0483 ( 
-26.48%) 48.8880 (  15.35%) 77.8597 ( -34.81%) 61.9353 (  -7.24%)
Amean 141 61.0490 (   0.00%) 48.0857 (  21.23%) 98.5567 ( 
-61.44%) 63.2187 (  -3.55%) 90.4857 ( -48.22%) 68.3100 ( -11.89%)
Amean 172 70.5180 (   0.00%) 59.3620 (  15.82%)122.5423 ( 
-73.77%) 76.0197 (  -7.80%)127.4023 ( -80.67%) 75.8233 (  -7.52%)
Amean 192 76.1643 (   0.00%) 65.1613 (  14.45%)142.1393 ( 
-86.62%) 91.4923 ( -20.12%)145.0663 ( -90.46%) 80.5867 (  -5.81%)

But things look pretty good for hackbench-process-sockets:

hackbench-process-sockets
  4.12.0 4.13.0 
4.13.0 4.13.0 4.13.0 4.13.0
 vanillavanilla 
peterz-fix
rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight
Amean 10.9657 (   0.00%)  1.0850 ( -12.36%)  1.3737 ( 
-42.25%)  1.3093 ( -35.59%)  1.3220 ( -36.90%)  1.3937 ( -44.32%)
Amean 42.3040 (   0.00%)  3.3840 ( -46.88%)  2.1807 (   
5.35%)  2.3010 (   0.13%)  2.2070 (   4.21%)  2.1770 (   5.51%)
Amean 74.5467 (   0.00%)  4.0787 (  10.29%)  5.0530 ( 
-11.14%)  3.7427 (  17.68%)  4.5517 (  -0.11%)  3.8560 (  15.19%)
Amean 12   5.7707 (   0.00%)  5.4440 (   5.66%) 10.5680 ( 
-83.13%)  7.7240 ( -33.85%) 10.5990 ( -83.67%)  5.9123 (  -2.45%)
Amean 21   8.9387 (   0.00%)  9.5850 (  -7.23%) 18.3103 
(-104.84%) 10.9253 ( -22.23%) 18.1540 (-103.10%)  9.2627 (  -3.62%)
Amean 30  13.1243 (   0.00%) 14.0773 (  -7.26%) 25.6563 ( 
-95.49%) 15.7590 ( -20.07%) 25.6920 ( -95.76%) 14.6523 ( -11.64%)
Amean 48  25.1030 (   0.00%) 22.5233 (  10.28%) 40.5937 ( 
-61.71%) 24.6727 (   1.71%) 40.6357 ( -61.88%) 22.1807 (  11.64%)
Amean 79  39.9150 (   0.00%) 33.4220 (  16.27%) 66.3343 ( 
-66.19%) 40.2713 (  -0.89%) 65.8543 ( -64.99%) 35.3360 (  11.47%)
Amean 110 49.1700 (   0.00%) 46.1173 (   6.21%) 92.3153 ( 
-87.75%) 55.6567 ( -13.19%) 92.0567 ( -87.22%) 46.7280 (   4.97%)
Amean 141 59.3157 (   0.00%) 57.2670 (   3.45%)118.5863 ( 
-99.92%) 70.4800 ( -18.82%)118.6013 ( -99.95%)

Re: sysbench throughput degradation in 4.13+

2017-10-02 Thread Matt Fleming

On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote:
> 
> I like the simplicity of your approach!  I hope it does not break
> stuff like netperf...
> 
> I have been working on the patch below, which is much less optimistic
> about when to do an affine wakeup than before.

Running netperf for this patch and Peter's patch shows that Peter's
comes out on top, with scores pretty close to v4.12 in most places on
my 2-NUMA node 48-CPU Xeon box.

I haven't dug any further into why v4.13-peterz+ is worse than v4.12,
but I will next week.

netperf-tcp
4.12.0 4.13.0 
4.13.0 4.13.0
   defaultdefault
peterz+  riel+
Min   641653.72 (   0.00%) 1554.29 (  -6.01%) 1601.71 (  
-3.15%) 1627.01 (  -1.62%)
Min   128   3240.96 (   0.00%) 2858.49 ( -11.80%) 3122.62 (  
-3.65%) 3063.67 (  -5.47%)
Min   256   5840.55 (   0.00%) 4077.43 ( -30.19%) 5529.98 (  
-5.32%) 5362.18 (  -8.19%)
Min   1024 16812.11 (   0.00%)11899.72 ( -29.22%)16335.83 (  
-2.83%)15075.24 ( -10.33%)
Min   2048 26875.79 (   0.00%)18852.35 ( -29.85%)25902.85 (  
-3.62%)22804.82 ( -15.15%)
Min   3312 33060.18 (   0.00%)20984.28 ( -36.53%)32817.82 (  
-0.73%)29161.49 ( -11.79%)
Min   4096 34513.24 (   0.00%)23253.94 ( -32.62%)34167.80 (  
-1.00%)29349.09 ( -14.96%)
Min   8192 39836.88 (   0.00%)28881.63 ( -27.50%)39613.28 (  
-0.56%)35307.95 ( -11.37%)
Min   1638444203.84 (   0.00%)31616.74 ( -28.48%)43608.86 (  
-1.35%)38130.44 ( -13.74%)
Hmean 641686.58 (   0.00%) 1613.25 (  -4.35%) 1657.04 (  
-1.75%) 1655.38 (  -1.85%)
Hmean 128   3361.84 (   0.00%) 2945.34 ( -12.39%) 3173.47 (  
-5.60%) 3122.38 (  -7.12%)
Hmean 256   5993.92 (   0.00%) 4423.32 ( -26.20%) 5618.26 (  
-6.27%) 5523.72 (  -7.84%)
Hmean 1024 17225.83 (   0.00%)12314.23 ( -28.51%)16574.85 (  
-3.78%)15644.71 (  -9.18%)
Hmean 2048 27944.22 (   0.00%)21301.63 ( -23.77%)26395.38 (  
-5.54%)24067.57 ( -13.87%)
Hmean 3312 33760.48 (   0.00%)22361.07 ( -33.77%)33198.32 (  
-1.67%)30055.64 ( -10.97%)
Hmean 4096 35077.74 (   0.00%)29153.73 ( -16.89%)34479.40 (  
-1.71%)31215.64 ( -11.01%)
Hmean 8192 40674.31 (   0.00%)33493.01 ( -17.66%)40443.22 (  
-0.57%)37298.58 (  -8.30%)
Hmean 1638445492.12 (   0.00%)37177.64 ( -18.28%)44308.62 (  
-2.60%)40728.33 ( -10.47%)
Max   641745.95 (   0.00%) 1649.03 (  -5.55%) 1710.52 (  
-2.03%) 1702.65 (  -2.48%)
Max   128   3509.96 (   0.00%) 3082.35 ( -12.18%) 3204.19 (  
-8.71%) 3174.41 (  -9.56%)
Max   256   6138.35 (   0.00%) 4687.62 ( -23.63%) 5694.52 (  
-7.23%) 5722.08 (  -6.78%)
Max   1024 17732.13 (   0.00%)13270.42 ( -25.16%)16838.69 (  
-5.04%)16580.18 (  -6.50%)
Max   2048 28907.99 (   0.00%)24816.39 ( -14.15%)26792.86 (  
-7.32%)25003.60 ( -13.51%)
Max   3312 34512.60 (   0.00%)23510.32 ( -31.88%)33762.47 (  
-2.17%)31676.54 (  -8.22%)
Max   4096 35918.95 (   0.00%)35245.77 (  -1.87%)34866.23 (  
-2.93%)32537.07 (  -9.42%)
Max   8192 41749.55 (   0.00%)41068.52 (  -1.63%)42164.53 (   
0.99%)40105.50 (  -3.94%)
Max   1638447234.74 (   0.00%)41728.66 ( -11.66%)45387.40 (  
-3.91%)44107.25 (  -6.62%)

Re: sysbench throughput degradation in 4.13+

2017-10-02 Thread Matt Fleming

On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote:
> 
> I like the simplicity of your approach!  I hope it does not break
> stuff like netperf...
> 
> I have been working on the patch below, which is much less optimistic
> about when to do an affine wakeup than before.

Running netperf for this patch and Peter's patch shows that Peter's
comes out on top, with scores pretty close to v4.12 in most places on
my 2-NUMA node 48-CPU Xeon box.

I haven't dug any further into why v4.13-peterz+ is worse than v4.12,
but I will next week.

netperf-tcp
4.12.0 4.13.0 
4.13.0 4.13.0
   defaultdefault
peterz+  riel+
Min   641653.72 (   0.00%) 1554.29 (  -6.01%) 1601.71 (  
-3.15%) 1627.01 (  -1.62%)
Min   128   3240.96 (   0.00%) 2858.49 ( -11.80%) 3122.62 (  
-3.65%) 3063.67 (  -5.47%)
Min   256   5840.55 (   0.00%) 4077.43 ( -30.19%) 5529.98 (  
-5.32%) 5362.18 (  -8.19%)
Min   1024 16812.11 (   0.00%)11899.72 ( -29.22%)16335.83 (  
-2.83%)15075.24 ( -10.33%)
Min   2048 26875.79 (   0.00%)18852.35 ( -29.85%)25902.85 (  
-3.62%)22804.82 ( -15.15%)
Min   3312 33060.18 (   0.00%)20984.28 ( -36.53%)32817.82 (  
-0.73%)29161.49 ( -11.79%)
Min   4096 34513.24 (   0.00%)23253.94 ( -32.62%)34167.80 (  
-1.00%)29349.09 ( -14.96%)
Min   8192 39836.88 (   0.00%)28881.63 ( -27.50%)39613.28 (  
-0.56%)35307.95 ( -11.37%)
Min   1638444203.84 (   0.00%)31616.74 ( -28.48%)43608.86 (  
-1.35%)38130.44 ( -13.74%)
Hmean 641686.58 (   0.00%) 1613.25 (  -4.35%) 1657.04 (  
-1.75%) 1655.38 (  -1.85%)
Hmean 128   3361.84 (   0.00%) 2945.34 ( -12.39%) 3173.47 (  
-5.60%) 3122.38 (  -7.12%)
Hmean 256   5993.92 (   0.00%) 4423.32 ( -26.20%) 5618.26 (  
-6.27%) 5523.72 (  -7.84%)
Hmean 1024 17225.83 (   0.00%)12314.23 ( -28.51%)16574.85 (  
-3.78%)15644.71 (  -9.18%)
Hmean 2048 27944.22 (   0.00%)21301.63 ( -23.77%)26395.38 (  
-5.54%)24067.57 ( -13.87%)
Hmean 3312 33760.48 (   0.00%)22361.07 ( -33.77%)33198.32 (  
-1.67%)30055.64 ( -10.97%)
Hmean 4096 35077.74 (   0.00%)29153.73 ( -16.89%)34479.40 (  
-1.71%)31215.64 ( -11.01%)
Hmean 8192 40674.31 (   0.00%)33493.01 ( -17.66%)40443.22 (  
-0.57%)37298.58 (  -8.30%)
Hmean 1638445492.12 (   0.00%)37177.64 ( -18.28%)44308.62 (  
-2.60%)40728.33 ( -10.47%)
Max   641745.95 (   0.00%) 1649.03 (  -5.55%) 1710.52 (  
-2.03%) 1702.65 (  -2.48%)
Max   128   3509.96 (   0.00%) 3082.35 ( -12.18%) 3204.19 (  
-8.71%) 3174.41 (  -9.56%)
Max   256   6138.35 (   0.00%) 4687.62 ( -23.63%) 5694.52 (  
-7.23%) 5722.08 (  -6.78%)
Max   1024 17732.13 (   0.00%)13270.42 ( -25.16%)16838.69 (  
-5.04%)16580.18 (  -6.50%)
Max   2048 28907.99 (   0.00%)24816.39 ( -14.15%)26792.86 (  
-7.32%)25003.60 ( -13.51%)
Max   3312 34512.60 (   0.00%)23510.32 ( -31.88%)33762.47 (  
-2.17%)31676.54 (  -8.22%)
Max   4096 35918.95 (   0.00%)35245.77 (  -1.87%)34866.23 (  
-2.93%)32537.07 (  -9.42%)
Max   8192 41749.55 (   0.00%)41068.52 (  -1.63%)42164.53 (   
0.99%)40105.50 (  -3.94%)
Max   1638447234.74 (   0.00%)41728.66 ( -11.66%)45387.40 (  
-3.91%)44107.25 (  -6.62%)

Re: [PATCH 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with cr3

2017-08-16 Thread Matt Fleming

On Wed, 16 Aug, at 12:03:22PM, Mark Rutland wrote:
> 
> I'd expect we'd abort at a higher level, not taking any sample. i.e.
> we'd have the core overflow handler check in_funny_mm(), and if so, skip
> the sample, as with the skid case.

FYI, this is my preferred solution for x86 too.

Re: [PATCH 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with cr3

2017-08-16 Thread Matt Fleming

On Wed, 16 Aug, at 12:03:22PM, Mark Rutland wrote:
> 
> I'd expect we'd abort at a higher level, not taking any sample. i.e.
> we'd have the core overflow handler check in_funny_mm(), and if so, skip
> the sample, as with the skid case.

FYI, this is my preferred solution for x86 too.

Re: [PATCH v9 1/2] efi: Introduce efi_early_memdesc_ptr to get pointer to memmap descriptor

2017-08-16 Thread Matt Fleming

On Mon, 14 Aug, at 10:54:23PM, Baoquan He wrote:
> The existing map iteration helper for_each_efi_memory_desc_in_map can
> only be used after OS initializes EFI to fill data of struct efi_memory_map.

Should this say "EFI subsystem"? The firmware doesn't care about the
kernel's internal data structures.

> Before that we also need iterate map descriptors which are stored in several
> intermediate structures, like struct efi_boot_memmap for arch independent
> usage and struct efi_info for x86 arch only.
> 
> Introduce efi_early_memdesc_ptr to get pointer to a map descriptor, and
> replace several places of open code with it.
> 
> Suggested-by: Ingo Molnar 
> Signed-off-by: Baoquan He 
> ---
>  arch/x86/boot/compressed/eboot.c   |  2 +-
>  drivers/firmware/efi/libstub/efi-stub-helper.c |  4 ++--
>  include/linux/efi.h| 21 +
>  3 files changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/eboot.c 
> b/arch/x86/boot/compressed/eboot.c
> index c3e869eaef0c..e007887a33b0 100644
> --- a/arch/x86/boot/compressed/eboot.c
> +++ b/arch/x86/boot/compressed/eboot.c
> @@ -767,7 +767,7 @@ static efi_status_t setup_e820(struct boot_params *params,
>   m |= (u64)efi->efi_memmap_hi << 32;
>  #endif
>  
> - d = (efi_memory_desc_t *)(m + (i * efi->efi_memdesc_size));
> + d = efi_early_memdesc_ptr(m, efi->efi_memdesc_size, i);
>   switch (d->type) {
>   case EFI_RESERVED_TYPE:
>   case EFI_RUNTIME_SERVICES_CODE:
> diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c 
> b/drivers/firmware/efi/libstub/efi-stub-helper.c
> index b0184360efc6..50a9cab5a834 100644
> --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> @@ -205,7 +205,7 @@ efi_status_t efi_high_alloc(efi_system_table_t 
> *sys_table_arg,
>   unsigned long m = (unsigned long)map;
>   u64 start, end;
>  
> - desc = (efi_memory_desc_t *)(m + (i * desc_size));
> + desc = efi_early_memdesc_ptr(m, desc_size, i);
>   if (desc->type != EFI_CONVENTIONAL_MEMORY)
>   continue;
>  
> @@ -298,7 +298,7 @@ efi_status_t efi_low_alloc(efi_system_table_t 
> *sys_table_arg,
>   unsigned long m = (unsigned long)map;
>   u64 start, end;
>  
> - desc = (efi_memory_desc_t *)(m + (i * desc_size));
> + desc = efi_early_memdesc_ptr(m, desc_size, i);
>  
>   if (desc->type != EFI_CONVENTIONAL_MEMORY)
>   continue;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 8269bcb8ccf7..9783d9e4a4b2 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1020,6 +1020,27 @@ extern int efi_memattr_init(void);
>  extern int efi_memattr_apply_permissions(struct mm_struct *mm,
>efi_memattr_perm_setter fn);
>  
> +/*
> + * efi_early_memdesc_ptr - get the n-th efi memmap descriptor
> + * @map: the start of efi memmap
> + * @desc_size: the size of space for each efi memmap descriptor
> + * @n: the index of efi memmap descriptor
> + *
> + * EFI boot service provides function GetMemoryMap() to get a copy of the
> + * current memory map which is an array of memory descriptors, each of
> + * which describes a contiguous block of memory. And also get the size of
> + * map, and the size of each descriptor, etc. Note that per section 6.2 of
> + * UEFI Spec 2.6 Errata A, the returned size of each descriptor might not
> + * be equal to sizeof(efi_memory_memdesc_t) since efi_memory_memdesc_t may
> + * be extended in the future in response to hardware innovation. Thus OS
> + * MUST use the returned size of descriptor to find the start of each
> + * efi_memory_memdesc_t in the memory map array. This should only be used
> + * during bootup since for_each_efi_memory_desc_xxx is suggested after OS
> + * initializes EFI to fill data of struct efi_memory_map.
> + */

Again, please use "EFI subsystem" here.

> +#define efi_early_memdesc_ptr(map, desc_size, n) \
> + (efi_memory_desc_t *)((void *)(map) + ((n) * (desc_size)))
> +
>  /* Iterate through an efi_memory_map */
>  #define for_each_efi_memory_desc_in_map(m, md)   
>\
>   for ((md) = (m)->map;  \

Otherwise, this looks OK to me.

Re: [PATCH v9 1/2] efi: Introduce efi_early_memdesc_ptr to get pointer to memmap descriptor

2017-08-16 Thread Matt Fleming

On Mon, 14 Aug, at 10:54:23PM, Baoquan He wrote:
> The existing map iteration helper for_each_efi_memory_desc_in_map can
> only be used after OS initializes EFI to fill data of struct efi_memory_map.

Should this say "EFI subsystem"? The firmware doesn't care about the
kernel's internal data structures.

> Before that we also need iterate map descriptors which are stored in several
> intermediate structures, like struct efi_boot_memmap for arch independent
> usage and struct efi_info for x86 arch only.
> 
> Introduce efi_early_memdesc_ptr to get pointer to a map descriptor, and
> replace several places of open code with it.
> 
> Suggested-by: Ingo Molnar 
> Signed-off-by: Baoquan He 
> ---
>  arch/x86/boot/compressed/eboot.c   |  2 +-
>  drivers/firmware/efi/libstub/efi-stub-helper.c |  4 ++--
>  include/linux/efi.h| 21 +
>  3 files changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/eboot.c 
> b/arch/x86/boot/compressed/eboot.c
> index c3e869eaef0c..e007887a33b0 100644
> --- a/arch/x86/boot/compressed/eboot.c
> +++ b/arch/x86/boot/compressed/eboot.c
> @@ -767,7 +767,7 @@ static efi_status_t setup_e820(struct boot_params *params,
>   m |= (u64)efi->efi_memmap_hi << 32;
>  #endif
>  
> - d = (efi_memory_desc_t *)(m + (i * efi->efi_memdesc_size));
> + d = efi_early_memdesc_ptr(m, efi->efi_memdesc_size, i);
>   switch (d->type) {
>   case EFI_RESERVED_TYPE:
>   case EFI_RUNTIME_SERVICES_CODE:
> diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c 
> b/drivers/firmware/efi/libstub/efi-stub-helper.c
> index b0184360efc6..50a9cab5a834 100644
> --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> @@ -205,7 +205,7 @@ efi_status_t efi_high_alloc(efi_system_table_t 
> *sys_table_arg,
>   unsigned long m = (unsigned long)map;
>   u64 start, end;
>  
> - desc = (efi_memory_desc_t *)(m + (i * desc_size));
> + desc = efi_early_memdesc_ptr(m, desc_size, i);
>   if (desc->type != EFI_CONVENTIONAL_MEMORY)
>   continue;
>  
> @@ -298,7 +298,7 @@ efi_status_t efi_low_alloc(efi_system_table_t 
> *sys_table_arg,
>   unsigned long m = (unsigned long)map;
>   u64 start, end;
>  
> - desc = (efi_memory_desc_t *)(m + (i * desc_size));
> + desc = efi_early_memdesc_ptr(m, desc_size, i);
>  
>   if (desc->type != EFI_CONVENTIONAL_MEMORY)
>   continue;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 8269bcb8ccf7..9783d9e4a4b2 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1020,6 +1020,27 @@ extern int efi_memattr_init(void);
>  extern int efi_memattr_apply_permissions(struct mm_struct *mm,
>efi_memattr_perm_setter fn);
>  
> +/*
> + * efi_early_memdesc_ptr - get the n-th efi memmap descriptor
> + * @map: the start of efi memmap
> + * @desc_size: the size of space for each efi memmap descriptor
> + * @n: the index of efi memmap descriptor
> + *
> + * EFI boot service provides function GetMemoryMap() to get a copy of the
> + * current memory map which is an array of memory descriptors, each of
> + * which describes a contiguous block of memory. And also get the size of
> + * map, and the size of each descriptor, etc. Note that per section 6.2 of
> + * UEFI Spec 2.6 Errata A, the returned size of each descriptor might not
> + * be equal to sizeof(efi_memory_memdesc_t) since efi_memory_memdesc_t may
> + * be extended in the future in response to hardware innovation. Thus OS
> + * MUST use the returned size of descriptor to find the start of each
> + * efi_memory_memdesc_t in the memory map array. This should only be used
> + * during bootup since for_each_efi_memory_desc_xxx is suggested after OS
> + * initializes EFI to fill data of struct efi_memory_map.
> + */

Again, please use "EFI subsystem" here.

> +#define efi_early_memdesc_ptr(map, desc_size, n) \
> + (efi_memory_desc_t *)((void *)(map) + ((n) * (desc_size)))
> +
>  /* Iterate through an efi_memory_map */
>  #define for_each_efi_memory_desc_in_map(m, md)   
>\
>   for ((md) = (m)->map;  \

Otherwise, this looks OK to me.

Re: [PATCH] x86/efi: page align EFI ROM image ranges

2017-08-04 Thread Matt Fleming

On Wed, 02 Aug, at 11:41:38AM, Stuart Hayes wrote:
> (Resend because I mistyped the maintainer's email address the first time.)
> 
> The kernel's EFI stub locates and copies EFI ROM images into memory,
> which it allocates using the byte-granular EFI allocate_pool
> function.  These memory ranges are then added to setup_data, and
> later to e820 (in e820__reserve_setup_data()).  The e820 ranges are
> parsed to create nosave regions (in
> e820__register_nosave_regions()), but when non-page-aligned e820
> regions are parsed, a nosave page is added at the beginning and end
> of each non-page-aligned region, which results in data not getting
> saved or restored during a hibernate/resume.  This can result in
> random failures after a hibernate/resume.
>  
> Round up the allocation size to a whole number of pages, and use EFI
> allocate_pages to ensure that the EFI ROM copy regions are
> page-aligned.
> 
> On a system with six EFI ROM images, before the patch:
> 
> e820: update [mem 0x64866020-0x6486e05f] usable ==> usable
> e820: update [mem 0x6147a020-0x61499c5f] usable ==> usable
> e820: update [mem 0x60fff020-0x6105785f] usable ==> usable
> e820: update [mem 0x60fa6020-0x60ffe85f] usable ==> usable
> e820: update [mem 0x60f4d020-0x60fa585f] usable ==> usable
> e820: update [mem 0x60ef4020-0x60f4c85f] usable ==> usable
> ...
> PM: Registered nosave memory: [mem 0x-0x0fff]
> PM: Registered nosave memory: [mem 0x000a-0x000f]
> PM: Registered nosave memory: [mem 0x60ef4000-0x60ef4fff]
> PM: Registered nosave memory: [mem 0x60f4c000-0x60f4cfff]
> PM: Registered nosave memory: [mem 0x60f4d000-0x60f4dfff]
> PM: Registered nosave memory: [mem 0x60fa5000-0x60fa5fff]
> PM: Registered nosave memory: [mem 0x60fa6000-0x60fa6fff]
> PM: Registered nosave memory: [mem 0x60ffe000-0x60ffefff]
> PM: Registered nosave memory: [mem 0x60fff000-0x60ff]
> PM: Registered nosave memory: [mem 0x61057000-0x61057fff]
> PM: Registered nosave memory: [mem 0x6147a000-0x6147afff]
> PM: Registered nosave memory: [mem 0x61499000-0x61499fff]
> PM: Registered nosave memory: [mem 0x64866000-0x64866fff]
> PM: Registered nosave memory: [mem 0x6486e000-0x6486efff]
> PM: Registered nosave memory: [mem 0x6cf6e000-0x6f3ccfff]
> 
> After the patch:
> 
> e820: update [mem 0x64866000-0x6486efff] usable ==> usable
> e820: update [mem 0x6147a000-0x61499fff] usable ==> usable
> e820: update [mem 0x60fff000-0x61057fff] usable ==> usable
> e820: update [mem 0x60fa6000-0x60ffefff] usable ==> usable
> e820: update [mem 0x60f4d000-0x60fa5fff] usable ==> usable
> e820: update [mem 0x60ef4000-0x60f4cfff] usable ==> usable
> ...
> PM: Registered nosave memory: [mem 0x-0x0fff]
> PM: Registered nosave memory: [mem 0x000a-0x000f]
> PM: Registered nosave memory: [mem 0x6cf6e000-0x6f3ccfff]
> 
> Signed-off-by: Stuart Hayes 
> ---
> 
> --- linux-4.13-rc2/arch/x86/boot/compressed/eboot.c.orig  2017-08-01 
> 12:12:04.696049106 -0400
> +++ linux-4.13-rc2/arch/x86/boot/compressed/eboot.c   2017-08-01 
> 12:11:33.120182236 -0400
> @@ -235,7 +235,12 @@ __setup_efi_pci64(efi_pci_io_protocol_64
>  
>   size = pci->romsize + sizeof(*rom);
>  
> - status = efi_call_early(allocate_pool, EFI_LOADER_DATA, size, );
> + /*
> +  * Get whole pages because this will be added to e820.
> +  */
> + size = PAGE_ALIGN(size);
> + status = efi_call_early(allocate_pages, EFI_ALLOCATE_ANY_PAGES,
> + EFI_LOADER_DATA, (size >> PAGE_SHIFT), );
>   if (status != EFI_SUCCESS) {
>   efi_printk(sys_table, "Failed to alloc mem for rom\n");
>   return status;
>  
 
Nice catch. The comment could do with a little more information,
including the fact that it's the e820 nosave code that expects
page-aligned ROM regions.

Also, you'll need the same fix for __setup_efi_pci32().

Re: [PATCH] x86/efi: page align EFI ROM image ranges

2017-08-04 Thread Matt Fleming

On Wed, 02 Aug, at 11:41:38AM, Stuart Hayes wrote:
> (Resend because I mistyped the maintainer's email address the first time.)
> 
> The kernel's EFI stub locates and copies EFI ROM images into memory,
> which it allocates using the byte-granular EFI allocate_pool
> function.  These memory ranges are then added to setup_data, and
> later to e820 (in e820__reserve_setup_data()).  The e820 ranges are
> parsed to create nosave regions (in
> e820__register_nosave_regions()), but when non-page-aligned e820
> regions are parsed, a nosave page is added at the beginning and end
> of each non-page-aligned region, which results in data not getting
> saved or restored during a hibernate/resume.  This can result in
> random failures after a hibernate/resume.
>  
> Round up the allocation size to a whole number of pages, and use EFI
> allocate_pages to ensure that the EFI ROM copy regions are
> page-aligned.
> 
> On a system with six EFI ROM images, before the patch:
> 
> e820: update [mem 0x64866020-0x6486e05f] usable ==> usable
> e820: update [mem 0x6147a020-0x61499c5f] usable ==> usable
> e820: update [mem 0x60fff020-0x6105785f] usable ==> usable
> e820: update [mem 0x60fa6020-0x60ffe85f] usable ==> usable
> e820: update [mem 0x60f4d020-0x60fa585f] usable ==> usable
> e820: update [mem 0x60ef4020-0x60f4c85f] usable ==> usable
> ...
> PM: Registered nosave memory: [mem 0x-0x0fff]
> PM: Registered nosave memory: [mem 0x000a-0x000f]
> PM: Registered nosave memory: [mem 0x60ef4000-0x60ef4fff]
> PM: Registered nosave memory: [mem 0x60f4c000-0x60f4cfff]
> PM: Registered nosave memory: [mem 0x60f4d000-0x60f4dfff]
> PM: Registered nosave memory: [mem 0x60fa5000-0x60fa5fff]
> PM: Registered nosave memory: [mem 0x60fa6000-0x60fa6fff]
> PM: Registered nosave memory: [mem 0x60ffe000-0x60ffefff]
> PM: Registered nosave memory: [mem 0x60fff000-0x60ff]
> PM: Registered nosave memory: [mem 0x61057000-0x61057fff]
> PM: Registered nosave memory: [mem 0x6147a000-0x6147afff]
> PM: Registered nosave memory: [mem 0x61499000-0x61499fff]
> PM: Registered nosave memory: [mem 0x64866000-0x64866fff]
> PM: Registered nosave memory: [mem 0x6486e000-0x6486efff]
> PM: Registered nosave memory: [mem 0x6cf6e000-0x6f3ccfff]
> 
> After the patch:
> 
> e820: update [mem 0x64866000-0x6486efff] usable ==> usable
> e820: update [mem 0x6147a000-0x61499fff] usable ==> usable
> e820: update [mem 0x60fff000-0x61057fff] usable ==> usable
> e820: update [mem 0x60fa6000-0x60ffefff] usable ==> usable
> e820: update [mem 0x60f4d000-0x60fa5fff] usable ==> usable
> e820: update [mem 0x60ef4000-0x60f4cfff] usable ==> usable
> ...
> PM: Registered nosave memory: [mem 0x-0x0fff]
> PM: Registered nosave memory: [mem 0x000a-0x000f]
> PM: Registered nosave memory: [mem 0x6cf6e000-0x6f3ccfff]
> 
> Signed-off-by: Stuart Hayes 
> ---
> 
> --- linux-4.13-rc2/arch/x86/boot/compressed/eboot.c.orig  2017-08-01 
> 12:12:04.696049106 -0400
> +++ linux-4.13-rc2/arch/x86/boot/compressed/eboot.c   2017-08-01 
> 12:11:33.120182236 -0400
> @@ -235,7 +235,12 @@ __setup_efi_pci64(efi_pci_io_protocol_64
>  
>   size = pci->romsize + sizeof(*rom);
>  
> - status = efi_call_early(allocate_pool, EFI_LOADER_DATA, size, );
> + /*
> +  * Get whole pages because this will be added to e820.
> +  */
> + size = PAGE_ALIGN(size);
> + status = efi_call_early(allocate_pages, EFI_ALLOCATE_ANY_PAGES,
> + EFI_LOADER_DATA, (size >> PAGE_SHIFT), );
>   if (status != EFI_SUCCESS) {
>   efi_printk(sys_table, "Failed to alloc mem for rom\n");
>   return status;
>  
 
Nice catch. The comment could do with a little more information,
including the fact that it's the e820 nosave code that expects
page-aligned ROM regions.

Also, you'll need the same fix for __setup_efi_pci32().

Re: [PATCH v6 RESEND] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions

2017-08-04 Thread Matt Fleming

On Fri, 04 Aug, at 07:40:05PM, Baoquan He wrote:
> On 08/04/17 at 12:23pm, Matt Fleming wrote:
> > On Fri, 28 Jul, at 07:26:03PM, Baoquan He wrote:
> > > Hi Matt,
> > > 
> > > On 07/28/17 at 11:55am, Ingo Molnar wrote:
> > > > 
> > > > * Matt Fleming <m...@codeblueprint.co.uk> wrote:
> > > > 
> > > > > On Fri, 21 Jul, at 09:19:56PM, Baoquan He wrote:
> > > > > >
> > > > > > There are places where the efi map is getting and used like this. 
> > > > > > E.g
> > > > > > in efi_high_alloc() of 
> > > > > > drivers/firmware/efi/libstub/efi-stub-helper.c.
> > > > > > EFI developers worry the size of efi_memory_desc_t could not be the 
> > > > > > same
> > > > > > as e->efi_memdesc_size?
> > > > > > 
> > > > > > Hi Matt,
> > > > > > 
> > > > > > Could you help have a look at this?
> > > > > 
> > > > > You're exactly right. The code guards against the size of the
> > > > > efi_memory_desc_t struct changing. The UEFI spec says to traverse the
> > > > > memory map this way.
> > > > 
> > > > This is not obvious and looks pretty ugly as well, and open coded in 
> > > > several 
> > > > places.
> > > > 
> > > > At minimum we should have an efi_memdesc_ptr(efi, i) wrapper inline (or 
> > > > so) that 
> > > > gives us the entry pointer, plus a comment that points out that 
> > > > ->memdesc_size 
> > > > might not be equal to sizeof(efi_memory_memdesc_t).
> > > 
> > > I can make a efi_memdesc_ptr(efi, i) wrapper as Ingo suggested and use
> > > it here if you agree. Seems it might be not good to add another
> > > for_each_efi_memory_desc_ wrapper since there are different memmap
> > > data structures in x86 boot and in general efi libstub. Or any other
> > > idea?
> > 
> > I think adding a wrapper is fine, but I'd suggest including the word
> > "early" (or something similar) to explain that it should only be used
> > during bootup -- we want everyone else to use the
> > for_each_efi_memory_*() API.
> 
> Thanks, Matt. I can do that. Do you think below helper definition is OK
> to you? If yes, I can upstate with it and post v9.
> 
> #define efi_early_memdesc_ptr(map, desc_size, n)  
> \
>   (efi_memory_desc_t *)((void *)(map) + ((n) * (desc_size)))
 
Looks fine to me, but I'd wait for Ingo's OK before resending.

Re: [PATCH v6 RESEND] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions

2017-08-04 Thread Matt Fleming

On Fri, 04 Aug, at 07:40:05PM, Baoquan He wrote:
> On 08/04/17 at 12:23pm, Matt Fleming wrote:
> > On Fri, 28 Jul, at 07:26:03PM, Baoquan He wrote:
> > > Hi Matt,
> > > 
> > > On 07/28/17 at 11:55am, Ingo Molnar wrote:
> > > > 
> > > > * Matt Fleming  wrote:
> > > > 
> > > > > On Fri, 21 Jul, at 09:19:56PM, Baoquan He wrote:
> > > > > >
> > > > > > There are places where the efi map is getting and used like this. 
> > > > > > E.g
> > > > > > in efi_high_alloc() of 
> > > > > > drivers/firmware/efi/libstub/efi-stub-helper.c.
> > > > > > EFI developers worry the size of efi_memory_desc_t could not be the 
> > > > > > same
> > > > > > as e->efi_memdesc_size?
> > > > > > 
> > > > > > Hi Matt,
> > > > > > 
> > > > > > Could you help have a look at this?
> > > > > 
> > > > > You're exactly right. The code guards against the size of the
> > > > > efi_memory_desc_t struct changing. The UEFI spec says to traverse the
> > > > > memory map this way.
> > > > 
> > > > This is not obvious and looks pretty ugly as well, and open coded in 
> > > > several 
> > > > places.
> > > > 
> > > > At minimum we should have an efi_memdesc_ptr(efi, i) wrapper inline (or 
> > > > so) that 
> > > > gives us the entry pointer, plus a comment that points out that 
> > > > ->memdesc_size 
> > > > might not be equal to sizeof(efi_memory_memdesc_t).
> > > 
> > > I can make a efi_memdesc_ptr(efi, i) wrapper as Ingo suggested and use
> > > it here if you agree. Seems it might be not good to add another
> > > for_each_efi_memory_desc_ wrapper since there are different memmap
> > > data structures in x86 boot and in general efi libstub. Or any other
> > > idea?
> > 
> > I think adding a wrapper is fine, but I'd suggest including the word
> > "early" (or something similar) to explain that it should only be used
> > during bootup -- we want everyone else to use the
> > for_each_efi_memory_*() API.
> 
> Thanks, Matt. I can do that. Do you think below helper definition is OK
> to you? If yes, I can upstate with it and post v9.
> 
> #define efi_early_memdesc_ptr(map, desc_size, n)  
> \
>   (efi_memory_desc_t *)((void *)(map) + ((n) * (desc_size)))
 
Looks fine to me, but I'd wait for Ingo's OK before resending.

Re: [PATCH v6 RESEND] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions

2017-08-04 Thread Matt Fleming

On Fri, 28 Jul, at 07:26:03PM, Baoquan He wrote:
> Hi Matt,
> 
> On 07/28/17 at 11:55am, Ingo Molnar wrote:
> > 
> > * Matt Fleming <m...@codeblueprint.co.uk> wrote:
> > 
> > > On Fri, 21 Jul, at 09:19:56PM, Baoquan He wrote:
> > > >
> > > > There are places where the efi map is getting and used like this. E.g
> > > > in efi_high_alloc() of drivers/firmware/efi/libstub/efi-stub-helper.c.
> > > > EFI developers worry the size of efi_memory_desc_t could not be the same
> > > > as e->efi_memdesc_size?
> > > > 
> > > > Hi Matt,
> > > > 
> > > > Could you help have a look at this?
> > > 
> > > You're exactly right. The code guards against the size of the
> > > efi_memory_desc_t struct changing. The UEFI spec says to traverse the
> > > memory map this way.
> > 
> > This is not obvious and looks pretty ugly as well, and open coded in 
> > several 
> > places.
> > 
> > At minimum we should have an efi_memdesc_ptr(efi, i) wrapper inline (or so) 
> > that 
> > gives us the entry pointer, plus a comment that points out that 
> > ->memdesc_size 
> > might not be equal to sizeof(efi_memory_memdesc_t).
> 
> I can make a efi_memdesc_ptr(efi, i) wrapper as Ingo suggested and use
> it here if you agree. Seems it might be not good to add another
> for_each_efi_memory_desc_ wrapper since there are different memmap
> data structures in x86 boot and in general efi libstub. Or any other
> idea?

I think adding a wrapper is fine, but I'd suggest including the word
"early" (or something similar) to explain that it should only be used
during bootup -- we want everyone else to use the
for_each_efi_memory_*() API.

Re: [PATCH v6 RESEND] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions

2017-08-04 Thread Matt Fleming

On Fri, 28 Jul, at 07:26:03PM, Baoquan He wrote:
> Hi Matt,
> 
> On 07/28/17 at 11:55am, Ingo Molnar wrote:
> > 
> > * Matt Fleming  wrote:
> > 
> > > On Fri, 21 Jul, at 09:19:56PM, Baoquan He wrote:
> > > >
> > > > There are places where the efi map is getting and used like this. E.g
> > > > in efi_high_alloc() of drivers/firmware/efi/libstub/efi-stub-helper.c.
> > > > EFI developers worry the size of efi_memory_desc_t could not be the same
> > > > as e->efi_memdesc_size?
> > > > 
> > > > Hi Matt,
> > > > 
> > > > Could you help have a look at this?
> > > 
> > > You're exactly right. The code guards against the size of the
> > > efi_memory_desc_t struct changing. The UEFI spec says to traverse the
> > > memory map this way.
> > 
> > This is not obvious and looks pretty ugly as well, and open coded in 
> > several 
> > places.
> > 
> > At minimum we should have an efi_memdesc_ptr(efi, i) wrapper inline (or so) 
> > that 
> > gives us the entry pointer, plus a comment that points out that 
> > ->memdesc_size 
> > might not be equal to sizeof(efi_memory_memdesc_t).
> 
> I can make a efi_memdesc_ptr(efi, i) wrapper as Ingo suggested and use
> it here if you agree. Seems it might be not good to add another
> for_each_efi_memory_desc_ wrapper since there are different memmap
> data structures in x86 boot and in general efi libstub. Or any other
> idea?

I think adding a wrapper is fine, but I'd suggest including the word
"early" (or something similar) to explain that it should only be used
during bootup -- we want everyone else to use the
for_each_efi_memory_*() API.

Re: [PATCH v4.4.y] sched/cgroup: Move sched_online_group() back into css_online() to fix crash

2017-07-26 Thread Matt Fleming

On Wed, 26 Jul, at 07:35:51AM, Greg KH wrote:
> 
> If it needs a backport and a simple cherry-pick does not work, yes
> please.

Oh, it turns out cherry-picking commit 96b777452d88 to 4.9-stable
works just fine.

Re: [PATCH v4.4.y] sched/cgroup: Move sched_online_group() back into css_online() to fix crash

2017-07-26 Thread Matt Fleming

On Wed, 26 Jul, at 07:35:51AM, Greg KH wrote:
> 
> If it needs a backport and a simple cherry-pick does not work, yes
> please.

Oh, it turns out cherry-picking commit 96b777452d88 to 4.9-stable
works just fine.

Re: [PATCH v4.4.y] sched/cgroup: Move sched_online_group() back into css_online() to fix crash

2017-07-26 Thread Matt Fleming

On Tue, 25 Jul, at 11:04:39AM, Greg KH wrote:
> On Thu, Jul 20, 2017 at 02:53:09PM +0100, Matt Fleming wrote:
> > From: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
> > 
> > commit 96b777452d8881480fd5be50112f791c17db4b6b upstream.
> > 
> > Commit:
> > 
> >   2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
> > 
> > .. moved sched_online_group() from css_online() to css_alloc().
> > It exposes half-baked task group into global lists before initializing
> > generic cgroup stuff.
> > 
> > LTP testcase (third in cgroup_regression_test) written for testing
> > similar race in kernels 2.6.26-2.6.28 easily triggers this oops:
> > 
> >   BUG: unable to handle kernel NULL pointer dereference at 0008
> >   IP: kernfs_path_from_node_locked+0x260/0x320
> >   CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
> >   Call Trace:
> >   ? kernfs_path_from_node+0x4f/0x60
> >   kernfs_path_from_node+0x3e/0x60
> >   print_rt_rq+0x44/0x2b0
> >   print_rt_stats+0x7a/0xd0
> >   print_cpu+0x2fc/0xe80
> >   ? __might_sleep+0x4a/0x80
> >   sched_debug_show+0x17/0x30
> >   seq_read+0xf2/0x3b0
> >   proc_reg_read+0x42/0x70
> >   __vfs_read+0x28/0x130
> >   ? security_file_permission+0x9b/0xc0
> >   ? rw_verify_area+0x4e/0xb0
> >   vfs_read+0xa5/0x170
> >   SyS_read+0x46/0xa0
> >   entry_SYSCALL_64_fastpath+0x1e/0xad
> > 
> > Here the task group is already linked into the global RCU-protected 
> > 'task_groups'
> > list, but the css->cgroup pointer is still NULL.
> > 
> > This patch reverts this chunk and moves online back to css_online().
> > 
> > Signed-off-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
> > Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> > Cc: Linus Torvalds <torva...@linux-foundation.org>
> > Cc: Peter Zijlstra <pet...@infradead.org>
> > Cc: Tejun Heo <t...@kernel.org>
> > Cc: Thomas Gleixner <t...@linutronix.de>
> > Fixes: 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
> > Link: 
> > http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
> > Signed-off-by: Ingo Molnar <mi...@kernel.org>
> > Signed-off-by: Matt Fleming <m...@codeblueprint.co.uk>
> > ---
> >  kernel/sched/core.c | 14 --
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> What about 4.9-stable, this should go there too, right?

Yes, good catch. Would you like me to send a separate patch?

Re: [PATCH v4.4.y] sched/cgroup: Move sched_online_group() back into css_online() to fix crash

2017-07-26 Thread Matt Fleming

On Tue, 25 Jul, at 11:04:39AM, Greg KH wrote:
> On Thu, Jul 20, 2017 at 02:53:09PM +0100, Matt Fleming wrote:
> > From: Konstantin Khlebnikov 
> > 
> > commit 96b777452d8881480fd5be50112f791c17db4b6b upstream.
> > 
> > Commit:
> > 
> >   2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
> > 
> > .. moved sched_online_group() from css_online() to css_alloc().
> > It exposes half-baked task group into global lists before initializing
> > generic cgroup stuff.
> > 
> > LTP testcase (third in cgroup_regression_test) written for testing
> > similar race in kernels 2.6.26-2.6.28 easily triggers this oops:
> > 
> >   BUG: unable to handle kernel NULL pointer dereference at 0008
> >   IP: kernfs_path_from_node_locked+0x260/0x320
> >   CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
> >   Call Trace:
> >   ? kernfs_path_from_node+0x4f/0x60
> >   kernfs_path_from_node+0x3e/0x60
> >   print_rt_rq+0x44/0x2b0
> >   print_rt_stats+0x7a/0xd0
> >   print_cpu+0x2fc/0xe80
> >   ? __might_sleep+0x4a/0x80
> >   sched_debug_show+0x17/0x30
> >   seq_read+0xf2/0x3b0
> >   proc_reg_read+0x42/0x70
> >   __vfs_read+0x28/0x130
> >   ? security_file_permission+0x9b/0xc0
> >   ? rw_verify_area+0x4e/0xb0
> >   vfs_read+0xa5/0x170
> >   SyS_read+0x46/0xa0
> >   entry_SYSCALL_64_fastpath+0x1e/0xad
> > 
> > Here the task group is already linked into the global RCU-protected 
> > 'task_groups'
> > list, but the css->cgroup pointer is still NULL.
> > 
> > This patch reverts this chunk and moves online back to css_online().
> > 
> > Signed-off-by: Konstantin Khlebnikov 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Cc: Linus Torvalds 
> > Cc: Peter Zijlstra 
> > Cc: Tejun Heo 
> > Cc: Thomas Gleixner 
> > Fixes: 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
> > Link: 
> > http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
> > Signed-off-by: Ingo Molnar 
> > Signed-off-by: Matt Fleming 
> > ---
> >  kernel/sched/core.c | 14 --
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> What about 4.9-stable, this should go there too, right?

Yes, good catch. Would you like me to send a separate patch?

Re: [PATCH v6 RESEND] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions

2017-07-24 Thread Matt Fleming

On Fri, 21 Jul, at 09:19:56PM, Baoquan He wrote:
>
> There are places where the efi map is getting and used like this. E.g
> in efi_high_alloc() of drivers/firmware/efi/libstub/efi-stub-helper.c.
> EFI developers worry the size of efi_memory_desc_t could not be the same
> as e->efi_memdesc_size?
> 
> Hi Matt,
> 
> Could you help have a look at this?

You're exactly right. The code guards against the size of the
efi_memory_desc_t struct changing. The UEFI spec says to traverse the
memory map this way.

Re: [PATCH v6 RESEND] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions

2017-07-24 Thread Matt Fleming

On Fri, 21 Jul, at 09:19:56PM, Baoquan He wrote:
>
> There are places where the efi map is getting and used like this. E.g
> in efi_high_alloc() of drivers/firmware/efi/libstub/efi-stub-helper.c.
> EFI developers worry the size of efi_memory_desc_t could not be the same
> as e->efi_memdesc_size?
> 
> Hi Matt,
> 
> Could you help have a look at this?

You're exactly right. The code guards against the size of the
efi_memory_desc_t struct changing. The UEFI spec says to traverse the
memory map this way.

Re: [PATCH v3 2/2] x86/efi: clean up dead code around efi_reserve_boot_services()

2017-07-24 Thread Matt Fleming

On Mon, 10 Jul, at 02:51:36PM, Naoya Horiguchi wrote:
> EFI_BOOT_SERVICES_{CODE|DATA} regions never overlap the kernel now,
> so we can clean up the check in efi_reserve_boot_services().
> 
> Signed-off-by: Naoya Horiguchi 
> ---
>  arch/x86/platform/efi/quirks.c | 23 +--
>  1 file changed, 1 insertion(+), 22 deletions(-)

Is this true for kernels not using KASLR?

Re: [PATCH v3 2/2] x86/efi: clean up dead code around efi_reserve_boot_services()

2017-07-24 Thread Matt Fleming

On Mon, 10 Jul, at 02:51:36PM, Naoya Horiguchi wrote:
> EFI_BOOT_SERVICES_{CODE|DATA} regions never overlap the kernel now,
> so we can clean up the check in efi_reserve_boot_services().
> 
> Signed-off-by: Naoya Horiguchi 
> ---
>  arch/x86/platform/efi/quirks.c | 23 +--
>  1 file changed, 1 insertion(+), 22 deletions(-)

Is this true for kernels not using KASLR?

Re: [PATCH v3 1/2] x86/boot/KASLR: exclude EFI_BOOT_SERVICES_{CODE|DATA} from KASLR's choice

2017-07-24 Thread Matt Fleming

On Mon, 10 Jul, at 02:51:35PM, Naoya Horiguchi wrote:
> KASLR chooses kernel location from E820_TYPE_RAM regions by walking over
> e820 entries now. E820_TYPE_RAM includes EFI_BOOT_SERVICES_CODE and
> EFI_BOOT_SERVICES_DATA, so those regions can be the target. According to
> UEFI spec, all memory regions marked as EfiBootServicesCode and
> EfiBootServicesData are available for free memory after the first call
> of ExitBootServices(). So such regions should be usable for kernel on
> spec basis.
> 
> In x86, however, we have some workaround for broken firmware, where we
> keep such regions reserved until SetVirtualAddressMap() is done.
> See the following code in should_map_region():
> 
>   static bool should_map_region(efi_memory_desc_t *md)
>   {
>   ...
>   /*
>* Map boot services regions as a workaround for buggy
>* firmware that accesses them even when they shouldn't.
>*
>* See efi_{reserve,free}_boot_services().
>*/
>   if (md->type == EFI_BOOT_SERVICES_CODE ||
>   md->type == EFI_BOOT_SERVICES_DATA)
>   return false;
> 
> This workaround suppressed a boot crash, but potential issues still
> remain because no one prevents the regions from overlapping with kernel
> image by KASLR.
> 
> So let's make sure that EFI_BOOT_SERVICES_{CODE|DATA} regions are never
> chosen as kernel memory for the workaround to work fine.
> 
> Signed-off-by: Naoya Horiguchi 
> ---
> v2 -> v3:
> - skip EFI_LOADER_CODE and EFI_LOADER_DATA in region scan
> 
> v1 -> v2:
> - switch efi_mirror_found to local variable
> - insert break when EFI_MEMORY_MORE_RELIABLE found
> ---
>  arch/x86/boot/compressed/kaslr.c | 44 
> +---
>  1 file changed, 32 insertions(+), 12 deletions(-)

I think the patch content is fine, but please update the comments and
the changelog to say why EFI_CONVENTIONAL_MEMORY was chosen, i.e. that
it's the only memory type that we know to be free.

Re: [PATCH v3 1/2] x86/boot/KASLR: exclude EFI_BOOT_SERVICES_{CODE|DATA} from KASLR's choice

2017-07-24 Thread Matt Fleming

On Mon, 10 Jul, at 02:51:35PM, Naoya Horiguchi wrote:
> KASLR chooses kernel location from E820_TYPE_RAM regions by walking over
> e820 entries now. E820_TYPE_RAM includes EFI_BOOT_SERVICES_CODE and
> EFI_BOOT_SERVICES_DATA, so those regions can be the target. According to
> UEFI spec, all memory regions marked as EfiBootServicesCode and
> EfiBootServicesData are available for free memory after the first call
> of ExitBootServices(). So such regions should be usable for kernel on
> spec basis.
> 
> In x86, however, we have some workaround for broken firmware, where we
> keep such regions reserved until SetVirtualAddressMap() is done.
> See the following code in should_map_region():
> 
>   static bool should_map_region(efi_memory_desc_t *md)
>   {
>   ...
>   /*
>* Map boot services regions as a workaround for buggy
>* firmware that accesses them even when they shouldn't.
>*
>* See efi_{reserve,free}_boot_services().
>*/
>   if (md->type == EFI_BOOT_SERVICES_CODE ||
>   md->type == EFI_BOOT_SERVICES_DATA)
>   return false;
> 
> This workaround suppressed a boot crash, but potential issues still
> remain because no one prevents the regions from overlapping with kernel
> image by KASLR.
> 
> So let's make sure that EFI_BOOT_SERVICES_{CODE|DATA} regions are never
> chosen as kernel memory for the workaround to work fine.
> 
> Signed-off-by: Naoya Horiguchi 
> ---
> v2 -> v3:
> - skip EFI_LOADER_CODE and EFI_LOADER_DATA in region scan
> 
> v1 -> v2:
> - switch efi_mirror_found to local variable
> - insert break when EFI_MEMORY_MORE_RELIABLE found
> ---
>  arch/x86/boot/compressed/kaslr.c | 44 
> +---
>  1 file changed, 32 insertions(+), 12 deletions(-)

I think the patch content is fine, but please update the comments and
the changelog to say why EFI_CONVENTIONAL_MEMORY was chosen, i.e. that
it's the only memory type that we know to be free.

[PATCH v4.4.y] sched/cgroup: Move sched_online_group() back into css_online() to fix crash

2017-07-20 Thread Matt Fleming

From: Konstantin Khlebnikov <khlebni...@yandex-team.ru>

commit 96b777452d8881480fd5be50112f791c17db4b6b upstream.

Commit:

  2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")

.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.

LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:

  BUG: unable to handle kernel NULL pointer dereference at 0008
  IP: kernfs_path_from_node_locked+0x260/0x320
  CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
  Call Trace:
  ? kernfs_path_from_node+0x4f/0x60
  kernfs_path_from_node+0x3e/0x60
  print_rt_rq+0x44/0x2b0
  print_rt_stats+0x7a/0xd0
  print_cpu+0x2fc/0xe80
  ? __might_sleep+0x4a/0x80
  sched_debug_show+0x17/0x30
  seq_read+0xf2/0x3b0
  proc_reg_read+0x42/0x70
  __vfs_read+0x28/0x130
  ? security_file_permission+0x9b/0xc0
  ? rw_verify_area+0x4e/0xb0
  vfs_read+0xa5/0x170
  SyS_read+0x46/0xa0
  entry_SYSCALL_64_fastpath+0x1e/0xad

Here the task group is already linked into the global RCU-protected 
'task_groups'
list, but the css->cgroup pointer is still NULL.

This patch reverts this chunk and moves online back to css_online().

Signed-off-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Tejun Heo <t...@kernel.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Fixes: 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: 
http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar <mi...@kernel.org>
Signed-off-by: Matt Fleming <m...@codeblueprint.co.uk>
---
 kernel/sched/core.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20253dbc8610..7a5060bf3bbc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8237,11 +8237,20 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
if (IS_ERR(tg))
return ERR_PTR(-ENOMEM);
 
-   sched_online_group(tg, parent);
-
return >css;
 }
 
+/* Expose task group only after completing cgroup initialization */
+static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
+{
+   struct task_group *tg = css_tg(css);
+   struct task_group *parent = css_tg(css->parent);
+
+   if (parent)
+   sched_online_group(tg, parent);
+   return 0;
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
struct task_group *tg = css_tg(css);
@@ -8616,6 +8625,7 @@ static struct cftype cpu_files[] = {
 
 struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc  = cpu_cgroup_css_alloc,
+   .css_online = cpu_cgroup_css_online,
.css_released   = cpu_cgroup_css_released,
.css_free   = cpu_cgroup_css_free,
.fork   = cpu_cgroup_fork,
-- 
2.12.2

[PATCH v4.4.y] sched/cgroup: Move sched_online_group() back into css_online() to fix crash

2017-07-20 Thread Matt Fleming

From: Konstantin Khlebnikov 

commit 96b777452d8881480fd5be50112f791c17db4b6b upstream.

Commit:

  2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")

.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.

LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:

  BUG: unable to handle kernel NULL pointer dereference at 0008
  IP: kernfs_path_from_node_locked+0x260/0x320
  CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
  Call Trace:
  ? kernfs_path_from_node+0x4f/0x60
  kernfs_path_from_node+0x3e/0x60
  print_rt_rq+0x44/0x2b0
  print_rt_stats+0x7a/0xd0
  print_cpu+0x2fc/0xe80
  ? __might_sleep+0x4a/0x80
  sched_debug_show+0x17/0x30
  seq_read+0xf2/0x3b0
  proc_reg_read+0x42/0x70
  __vfs_read+0x28/0x130
  ? security_file_permission+0x9b/0xc0
  ? rw_verify_area+0x4e/0xb0
  vfs_read+0xa5/0x170
  SyS_read+0x46/0xa0
  entry_SYSCALL_64_fastpath+0x1e/0xad

Here the task group is already linked into the global RCU-protected 
'task_groups'
list, but the css->cgroup pointer is still NULL.

This patch reverts this chunk and moves online back to css_online().

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Fixes: 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: 
http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar 
Signed-off-by: Matt Fleming 
---
 kernel/sched/core.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20253dbc8610..7a5060bf3bbc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8237,11 +8237,20 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
if (IS_ERR(tg))
return ERR_PTR(-ENOMEM);
 
-   sched_online_group(tg, parent);
-
return >css;
 }
 
+/* Expose task group only after completing cgroup initialization */
+static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
+{
+   struct task_group *tg = css_tg(css);
+   struct task_group *parent = css_tg(css->parent);
+
+   if (parent)
+   sched_online_group(tg, parent);
+   return 0;
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
struct task_group *tg = css_tg(css);
@@ -8616,6 +8625,7 @@ static struct cftype cpu_files[] = {
 
 struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc  = cpu_cgroup_css_alloc,
+   .css_online = cpu_cgroup_css_online,
.css_released   = cpu_cgroup_css_released,
.css_free   = cpu_cgroup_css_free,
.fork   = cpu_cgroup_fork,
-- 
2.12.2

Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-13 Thread Matt Fleming

On Tue, 11 Jul, at 08:00:47AM, Andy Lutomirski wrote:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/
> 
> I did:
> 
> $ ./context_switch_latency_64 0 process same

Ah, that's better. I see about a 3.3% speedup with your patches when
running the context-switch benchmark.

Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-13 Thread Matt Fleming

On Tue, 11 Jul, at 08:00:47AM, Andy Lutomirski wrote:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/
> 
> I did:
> 
> $ ./context_switch_latency_64 0 process same

Ah, that's better. I see about a 3.3% speedup with your patches when
running the context-switch benchmark.

Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-11 Thread Matt Fleming

On Fri, 30 Jun, at 01:44:22PM, Matt Fleming wrote:
> On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote:
> > *** Ingo, even if this misses 4.13, please apply the first patch before
> > *** the merge window.
> > 
> > There are three performance benefits here:
> > 
> > 1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
> >This avoids many of them when switching tasks by using PCID.  In
> >a stupid little benchmark I did, it saves about 100ns on my laptop
> >per context switch.  I'll try to improve that benchmark.
> > 
> > 2. Mms that have been used recently on a given CPU might get to keep
> >their TLB entries alive across process switches with this patch
> >set.  TLB fills are pretty fast on modern CPUs, but they're even
> >faster when they don't happen.
> > 
> > 3. Lazy TLB is way better.  We used to do two stupid things when we
> >ran kernel threads: we'd send IPIs to flush user contexts on their
> >CPUs and then we'd write to CR3 for no particular reason as an excuse
> >to stop further IPIs.  With this patch, we do neither.
> 
> Heads up, I'm gonna queue this for a run on SUSE's performance test
> grid.

FWIW, I didn't see any change in performance with this series on a
PCID-capable machine. On the plus side, I didn't see any weird-looking
bugs either.

Are your benchmarks available anywhere?

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4043 matches

Mail list logo