from:"tip\-bot for Matt Fleming"

[tip:efi/core] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

2018-01-03 Thread tip-bot for Matt Fleming

Commit-ID:  81b60dbff04980a45b348c5b5eeca2713d4594ca
Gitweb: https://git.kernel.org/tip/81b60dbff04980a45b348c5b5eeca2713d4594ca
Author: Matt Fleming 
AuthorDate: Wed, 3 Jan 2018 09:44:17 +
Committer:  Ingo Molnar 
CommitDate: Wed, 3 Jan 2018 14:03:18 +0100

MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks
as maintainers for the EFI test driver and efivarfs file system.

Also add Ard Biesheuvel as the EFI test driver and efivarfs maintainer.

Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Ivan Hu 
Cc: Jeremy Kerr 
Cc: Linus Torvalds 
Cc: Matthew Garrett 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: http://lkml.kernel.org/r/20180103094417.6353-1-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 MAINTAINERS | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b46c9ce..95c3fa1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5149,15 +5149,15 @@ F:  sound/usb/misc/ua101.c
 EFI TEST DRIVER
 L: linux-...@vger.kernel.org
 M: Ivan Hu 
-M: Matt Fleming 
+M: Ard Biesheuvel 
 S: Maintained
 F: drivers/firmware/efi/test/
 
 EFI VARIABLE FILESYSTEM
 M: Matthew Garrett 
 M: Jeremy Kerr 
-M: Matt Fleming 
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
+M: Ard Biesheuvel 
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
 L: linux-...@vger.kernel.org
 S: Maintained
 F: fs/efivarfs/
@@ -5318,7 +5318,6 @@ S:Supported
 F: security/integrity/evm/
 
 EXTENSIBLE FIRMWARE INTERFACE (EFI)
-M: Matt Fleming 
 M: Ard Biesheuvel 
 L: linux-...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git

[tip:efi/core] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

2018-01-03 Thread tip-bot for Matt Fleming

Commit-ID:  81b60dbff04980a45b348c5b5eeca2713d4594ca
Gitweb: https://git.kernel.org/tip/81b60dbff04980a45b348c5b5eeca2713d4594ca
Author: Matt Fleming 
AuthorDate: Wed, 3 Jan 2018 09:44:17 +
Committer:  Ingo Molnar 
CommitDate: Wed, 3 Jan 2018 14:03:18 +0100

MAINTAINERS: Remove Matt Fleming as EFI co-maintainer

Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks
as maintainers for the EFI test driver and efivarfs file system.

Also add Ard Biesheuvel as the EFI test driver and efivarfs maintainer.

Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Ivan Hu 
Cc: Jeremy Kerr 
Cc: Linus Torvalds 
Cc: Matthew Garrett 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: http://lkml.kernel.org/r/20180103094417.6353-1-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 MAINTAINERS | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b46c9ce..95c3fa1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5149,15 +5149,15 @@ F:  sound/usb/misc/ua101.c
 EFI TEST DRIVER
 L: linux-...@vger.kernel.org
 M: Ivan Hu 
-M: Matt Fleming 
+M: Ard Biesheuvel 
 S: Maintained
 F: drivers/firmware/efi/test/
 
 EFI VARIABLE FILESYSTEM
 M: Matthew Garrett 
 M: Jeremy Kerr 
-M: Matt Fleming 
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
+M: Ard Biesheuvel 
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
 L: linux-...@vger.kernel.org
 S: Maintained
 F: fs/efivarfs/
@@ -5318,7 +5318,6 @@ S:Supported
 F: security/integrity/evm/
 
 EXTENSIBLE FIRMWARE INTERFACE (EFI)
-M: Matt Fleming 
 M: Ard Biesheuvel 
 L: linux-...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git

[tip:sched/core] sched/loadavg: Use {READ,WRITE}_ONCE() for sample window

2017-03-16 Thread tip-bot for Matt Fleming

Commit-ID:  caeb5882979bc6f3c8766fcf59c6269b38f521bc
Gitweb: http://git.kernel.org/tip/caeb5882979bc6f3c8766fcf59c6269b38f521bc
Author: Matt Fleming 
AuthorDate: Fri, 17 Feb 2017 12:07:31 +
Committer:  Ingo Molnar 
CommitDate: Thu, 16 Mar 2017 09:21:01 +0100

sched/loadavg: Use {READ,WRITE}_ONCE() for sample window

'calc_load_update' is accessed without any kind of locking and there's
a clear assumption in the code that only a single value is read or
written.

Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
unintentionally seeing multiple values, or having the load/stores
split.

Technically the loads in calc_global_*() don't require this since
those are the only functions that update 'calc_load_update', but I've
added the READ_ONCE() for consistency.

Suggested-by: Peter Zijlstra 
Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Thomas Gleixner 
Cc: Vincent Guittot 
Link: http://lkml.kernel.org/r/20170217120731.11868-3-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/loadavg.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 3a55f3f..f15fb2b 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -169,7 +169,7 @@ static inline int calc_load_write_idx(void)
 * If the folding window started, make sure we start writing in the
 * next idle-delta.
 */
-   if (!time_before(jiffies, calc_load_update))
+   if (!time_before(jiffies, READ_ONCE(calc_load_update)))
idx++;
 
return idx & 1;
@@ -204,7 +204,7 @@ void calc_load_exit_idle(void)
/*
 * If we're still before the pending sample window, we're done.
 */
-   this_rq->calc_load_update = calc_load_update;
+   this_rq->calc_load_update = READ_ONCE(calc_load_update);
if (time_before(jiffies, this_rq->calc_load_update))
return;
 
@@ -308,13 +308,15 @@ calc_load_n(unsigned long load, unsigned long exp,
  */
 static void calc_global_nohz(void)
 {
+   unsigned long sample_window;
long delta, active, n;
 
-   if (!time_before(jiffies, calc_load_update + 10)) {
+   sample_window = READ_ONCE(calc_load_update);
+   if (!time_before(jiffies, sample_window + 10)) {
/*
 * Catch-up, fold however many we are behind still
 */
-   delta = jiffies - calc_load_update - 10;
+   delta = jiffies - sample_window - 10;
n = 1 + (delta / LOAD_FREQ);
 
active = atomic_long_read(_load_tasks);
@@ -324,7 +326,7 @@ static void calc_global_nohz(void)
avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
 
-   calc_load_update += n * LOAD_FREQ;
+   WRITE_ONCE(calc_load_update, sample_window + n * LOAD_FREQ);
}
 
/*
@@ -352,9 +354,11 @@ static inline void calc_global_nohz(void) { }
  */
 void calc_global_load(unsigned long ticks)
 {
+   unsigned long sample_window;
long active, delta;
 
-   if (time_before(jiffies, calc_load_update + 10))
+   sample_window = READ_ONCE(calc_load_update);
+   if (time_before(jiffies, sample_window + 10))
return;
 
/*
@@ -371,7 +375,7 @@ void calc_global_load(unsigned long ticks)
avenrun[1] = calc_load(avenrun[1], EXP_5, active);
avenrun[2] = calc_load(avenrun[2], EXP_15, active);
 
-   calc_load_update += LOAD_FREQ;
+   WRITE_ONCE(calc_load_update, sample_window + LOAD_FREQ);
 
/*
 * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.

[tip:sched/core] sched/loadavg: Use {READ,WRITE}_ONCE() for sample window

2017-03-16 Thread tip-bot for Matt Fleming

Commit-ID:  caeb5882979bc6f3c8766fcf59c6269b38f521bc
Gitweb: http://git.kernel.org/tip/caeb5882979bc6f3c8766fcf59c6269b38f521bc
Author: Matt Fleming 
AuthorDate: Fri, 17 Feb 2017 12:07:31 +
Committer:  Ingo Molnar 
CommitDate: Thu, 16 Mar 2017 09:21:01 +0100

sched/loadavg: Use {READ,WRITE}_ONCE() for sample window

'calc_load_update' is accessed without any kind of locking and there's
a clear assumption in the code that only a single value is read or
written.

Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
unintentionally seeing multiple values, or having the load/stores
split.

Technically the loads in calc_global_*() don't require this since
those are the only functions that update 'calc_load_update', but I've
added the READ_ONCE() for consistency.

Suggested-by: Peter Zijlstra 
Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Thomas Gleixner 
Cc: Vincent Guittot 
Link: http://lkml.kernel.org/r/20170217120731.11868-3-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/loadavg.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 3a55f3f..f15fb2b 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -169,7 +169,7 @@ static inline int calc_load_write_idx(void)
 * If the folding window started, make sure we start writing in the
 * next idle-delta.
 */
-   if (!time_before(jiffies, calc_load_update))
+   if (!time_before(jiffies, READ_ONCE(calc_load_update)))
idx++;
 
return idx & 1;
@@ -204,7 +204,7 @@ void calc_load_exit_idle(void)
/*
 * If we're still before the pending sample window, we're done.
 */
-   this_rq->calc_load_update = calc_load_update;
+   this_rq->calc_load_update = READ_ONCE(calc_load_update);
if (time_before(jiffies, this_rq->calc_load_update))
return;
 
@@ -308,13 +308,15 @@ calc_load_n(unsigned long load, unsigned long exp,
  */
 static void calc_global_nohz(void)
 {
+   unsigned long sample_window;
long delta, active, n;
 
-   if (!time_before(jiffies, calc_load_update + 10)) {
+   sample_window = READ_ONCE(calc_load_update);
+   if (!time_before(jiffies, sample_window + 10)) {
/*
 * Catch-up, fold however many we are behind still
 */
-   delta = jiffies - calc_load_update - 10;
+   delta = jiffies - sample_window - 10;
n = 1 + (delta / LOAD_FREQ);
 
active = atomic_long_read(_load_tasks);
@@ -324,7 +326,7 @@ static void calc_global_nohz(void)
avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
 
-   calc_load_update += n * LOAD_FREQ;
+   WRITE_ONCE(calc_load_update, sample_window + n * LOAD_FREQ);
}
 
/*
@@ -352,9 +354,11 @@ static inline void calc_global_nohz(void) { }
  */
 void calc_global_load(unsigned long ticks)
 {
+   unsigned long sample_window;
long active, delta;
 
-   if (time_before(jiffies, calc_load_update + 10))
+   sample_window = READ_ONCE(calc_load_update);
+   if (time_before(jiffies, sample_window + 10))
return;
 
/*
@@ -371,7 +375,7 @@ void calc_global_load(unsigned long ticks)
avenrun[1] = calc_load(avenrun[1], EXP_5, active);
avenrun[2] = calc_load(avenrun[2], EXP_15, active);
 
-   calc_load_update += LOAD_FREQ;
+   WRITE_ONCE(calc_load_update, sample_window + LOAD_FREQ);
 
/*
 * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.

[tip:sched/core] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

2017-03-16 Thread tip-bot for Matt Fleming

Commit-ID:  6e5f32f7a43f45ee55c401c0b9585eb01f9629a8
Gitweb: http://git.kernel.org/tip/6e5f32f7a43f45ee55c401c0b9585eb01f9629a8
Author: Matt Fleming 
AuthorDate: Fri, 17 Feb 2017 12:07:30 +
Committer:  Ingo Molnar 
CommitDate: Thu, 16 Mar 2017 09:21:00 +0100

sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

  this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

  this_rq->calc_load_update = calc_load_update   [ next window ]

But what we actually do is:

  this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

  commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization 
average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vincent Guittot 
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- 
again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/loadavg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b73..3a55f3f 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -202,8 +202,9 @@ void calc_load_exit_idle(void)
struct rq *this_rq = this_rq();
 
/*
-* If we're still before the sample window, we're done.
+* If we're still before the pending sample window, we're done.
 */
+   this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update))
return;
 
@@ -212,7 +213,6 @@ void calc_load_exit_idle(void)
 * accounted through the nohz accounting, so skip the entire deal and
 * sync up for the next window.
 */
-   this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update + 10))
this_rq->calc_load_update += LOAD_FREQ;
 }

[tip:sched/core] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

2017-03-16 Thread tip-bot for Matt Fleming

Commit-ID:  6e5f32f7a43f45ee55c401c0b9585eb01f9629a8
Gitweb: http://git.kernel.org/tip/6e5f32f7a43f45ee55c401c0b9585eb01f9629a8
Author: Matt Fleming 
AuthorDate: Fri, 17 Feb 2017 12:07:30 +
Committer:  Ingo Molnar 
CommitDate: Thu, 16 Mar 2017 09:21:00 +0100

sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

  this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

  this_rq->calc_load_update = calc_load_update   [ next window ]

But what we actually do is:

  this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

  commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization 
average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vincent Guittot 
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- 
again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/loadavg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b73..3a55f3f 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -202,8 +202,9 @@ void calc_load_exit_idle(void)
struct rq *this_rq = this_rq();
 
/*
-* If we're still before the sample window, we're done.
+* If we're still before the pending sample window, we're done.
 */
+   this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update))
return;
 
@@ -212,7 +213,6 @@ void calc_load_exit_idle(void)
 * accounted through the nohz accounting, so skip the entire deal and
 * sync up for the next window.
 */
-   this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update + 10))
this_rq->calc_load_update += LOAD_FREQ;
 }

[tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-01-14 Thread tip-bot for Matt Fleming

Commit-ID:  cb42c9a3e23448c3f9a25417fae6309b1a92
Gitweb: http://git.kernel.org/tip/cb42c9a3e23448c3f9a25417fae6309b1a92
Author: Matt Fleming 
AuthorDate: Wed, 21 Sep 2016 14:38:13 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 14 Jan 2017 11:29:35 +0100

sched/core: Add debugging code to catch missing update_rq_clock() calls

There's no diagnostic checks for figuring out when we've accidentally
missed update_rq_clock() calls. Let's add some by piggybacking on the
rq_*pin_lock() wrappers.

The idea behind the diagnostic checks is that upon pining rq lock the
rq clock should be updated, via update_rq_clock(), before anybody
reads the clock with rq_clock() or rq_clock_task().

The exception to this rule is when updates have explicitly been
disabled with the rq_clock_skip_update() optimisation.

There are some functions that only unpin the rq lock in order to grab
some other lock and avoid deadlock. In that case we don't need to
update the clock again and the previous diagnostic state can be
carried over in rq_repin_lock() by saving the state in the rq_flags
context.

Since this patch adds a new clock update flag and some already exist
in rq::clock_skip_update, that field has now been renamed. An attempt
has been made to keep the flag manipulation code small and fast since
it's used in the heart of the __schedule() fast path.

For the !CONFIG_SCHED_DEBUG case the only object code change (other
than addresses) is the following change to reset RQCF_ACT_SKIP inside
of __schedule(),

  -   c7 83 38 09 00 00 00movl   $0x0,0x938(%rbx)
  -   00 00 00
  +   83 a3 38 09 00 00 fcandl   $0xfffc,0x938(%rbx)

Suggested-by: Peter Zijlstra 
Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Byungchul Park 
Cc: Frederic Weisbecker 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Petr Mladek 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/20160921133813.31976-8-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  | 11 +---
 kernel/sched/sched.h | 74 +++-
 2 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d233892..a129b34 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -102,9 +102,12 @@ void update_rq_clock(struct rq *rq)
 
lockdep_assert_held(>lock);
 
-   if (rq->clock_skip_update & RQCF_ACT_SKIP)
+   if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
 
+#ifdef CONFIG_SCHED_DEBUG
+   rq->clock_update_flags |= RQCF_UPDATED;
+#endif
delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
if (delta < 0)
return;
@@ -2889,7 +2892,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
rq->prev_mm = oldmm;
}
 
-   rq->clock_skip_update = 0;
+   rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 
/*
 * Since the runqueue lock will be released by the next
@@ -3364,7 +3367,7 @@ static void __sched notrace __schedule(bool preempt)
raw_spin_lock(>lock);
rq_pin_lock(rq, );
 
-   rq->clock_skip_update <<= 1; /* promote REQ to ACT */
+   rq->clock_update_flags <<= 1; /* promote REQ to ACT */
 
switch_count = >nivcsw;
if (!preempt && prev->state) {
@@ -3405,7 +3408,7 @@ static void __sched notrace __schedule(bool preempt)
trace_sched_switch(preempt, prev, next);
rq = context_switch(rq, prev, next, ); /* unlocks the rq */
} else {
-   rq->clock_skip_update = 0;
+   rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
rq_unpin_lock(rq, );
raw_spin_unlock_irq(>lock);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 98e7eee..6eeae7e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -644,7 +644,7 @@ struct rq {
unsigned long next_balance;
struct mm_struct *prev_mm;
 
-   unsigned int clock_skip_update;
+   unsigned int clock_update_flags;
u64 clock;
u64 clock_task;
 
@@ -768,48 +768,110 @@ static inline u64 __rq_clock_broken(struct rq *rq)
return READ_ONCE(rq->clock);
 }
 
+/*
+ * rq::clock_update_flags bits
+ *
+ * %RQCF_REQ_SKIP - will request skipping of clock update on the next
+ *  call to __schedule(). This

[tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-01-14 Thread tip-bot for Matt Fleming

Commit-ID:  cb42c9a3e23448c3f9a25417fae6309b1a92
Gitweb: http://git.kernel.org/tip/cb42c9a3e23448c3f9a25417fae6309b1a92
Author: Matt Fleming 
AuthorDate: Wed, 21 Sep 2016 14:38:13 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 14 Jan 2017 11:29:35 +0100

sched/core: Add debugging code to catch missing update_rq_clock() calls

There's no diagnostic checks for figuring out when we've accidentally
missed update_rq_clock() calls. Let's add some by piggybacking on the
rq_*pin_lock() wrappers.

The idea behind the diagnostic checks is that upon pining rq lock the
rq clock should be updated, via update_rq_clock(), before anybody
reads the clock with rq_clock() or rq_clock_task().

The exception to this rule is when updates have explicitly been
disabled with the rq_clock_skip_update() optimisation.

There are some functions that only unpin the rq lock in order to grab
some other lock and avoid deadlock. In that case we don't need to
update the clock again and the previous diagnostic state can be
carried over in rq_repin_lock() by saving the state in the rq_flags
context.

Since this patch adds a new clock update flag and some already exist
in rq::clock_skip_update, that field has now been renamed. An attempt
has been made to keep the flag manipulation code small and fast since
it's used in the heart of the __schedule() fast path.

For the !CONFIG_SCHED_DEBUG case the only object code change (other
than addresses) is the following change to reset RQCF_ACT_SKIP inside
of __schedule(),

  -   c7 83 38 09 00 00 00movl   $0x0,0x938(%rbx)
  -   00 00 00
  +   83 a3 38 09 00 00 fcandl   $0xfffc,0x938(%rbx)

Suggested-by: Peter Zijlstra 
Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Byungchul Park 
Cc: Frederic Weisbecker 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Petr Mladek 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/20160921133813.31976-8-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  | 11 +---
 kernel/sched/sched.h | 74 +++-
 2 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d233892..a129b34 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -102,9 +102,12 @@ void update_rq_clock(struct rq *rq)
 
lockdep_assert_held(>lock);
 
-   if (rq->clock_skip_update & RQCF_ACT_SKIP)
+   if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
 
+#ifdef CONFIG_SCHED_DEBUG
+   rq->clock_update_flags |= RQCF_UPDATED;
+#endif
delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
if (delta < 0)
return;
@@ -2889,7 +2892,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
rq->prev_mm = oldmm;
}
 
-   rq->clock_skip_update = 0;
+   rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 
/*
 * Since the runqueue lock will be released by the next
@@ -3364,7 +3367,7 @@ static void __sched notrace __schedule(bool preempt)
raw_spin_lock(>lock);
rq_pin_lock(rq, );
 
-   rq->clock_skip_update <<= 1; /* promote REQ to ACT */
+   rq->clock_update_flags <<= 1; /* promote REQ to ACT */
 
switch_count = >nivcsw;
if (!preempt && prev->state) {
@@ -3405,7 +3408,7 @@ static void __sched notrace __schedule(bool preempt)
trace_sched_switch(preempt, prev, next);
rq = context_switch(rq, prev, next, ); /* unlocks the rq */
} else {
-   rq->clock_skip_update = 0;
+   rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
rq_unpin_lock(rq, );
raw_spin_unlock_irq(>lock);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 98e7eee..6eeae7e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -644,7 +644,7 @@ struct rq {
unsigned long next_balance;
struct mm_struct *prev_mm;
 
-   unsigned int clock_skip_update;
+   unsigned int clock_update_flags;
u64 clock;
u64 clock_task;
 
@@ -768,48 +768,110 @@ static inline u64 __rq_clock_broken(struct rq *rq)
return READ_ONCE(rq->clock);
 }
 
+/*
+ * rq::clock_update_flags bits
+ *
+ * %RQCF_REQ_SKIP - will request skipping of clock update on the next
+ *  call to __schedule(). This is an optimisation to avoid
+ *  neighbouring rq clock updates.
+ *
+ * %RQCF_ACT_SKIP - is set from inside of __schedule() when skipping is
+ *  in effect and calls to update_rq_clock() are being ignored.
+ *
+ * %RQCF_UPDATED - is a debug flag that indicates whether a call has been
+ *  made to update_rq_clock() since the last time rq::lock was pinned.
+ *
+ * If inside of __schedule(), clock_update_flags will have been
+ * shifted left (a

[tip:sched/core] sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock

2017-01-14 Thread tip-bot for Matt Fleming

Commit-ID:  92509b732baf14c59ca702307270cfaa3a585ae7
Gitweb: http://git.kernel.org/tip/92509b732baf14c59ca702307270cfaa3a585ae7
Author: Matt Fleming 
AuthorDate: Wed, 21 Sep 2016 14:38:11 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 14 Jan 2017 11:29:31 +0100

sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock

rq_clock() is called from sched_info_{depart,arrive}() after resetting
RQCF_ACT_SKIP but prior to a call to update_rq_clock().

In preparation for pending patches that check whether the rq clock has
been updated inside of a pin context before rq_clock() is called, move
the reset of rq->clock_skip_update immediately before unpinning the rq
lock.

This will avoid the new warnings which check if update_rq_clock() is
being actively skipped.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Byungchul Park 
Cc: Frederic Weisbecker 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Petr Mladek 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/20160921133813.31976-6-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41df935..311460b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2887,6 +2887,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
+
+   rq->clock_skip_update = 0;
+
/*
 * Since the runqueue lock will be released by the next
 * task (which is an invalid locking op but in the case
@@ -3392,7 +3395,6 @@ static void __sched notrace __schedule(bool preempt)
next = pick_next_task(rq, prev, );
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
-   rq->clock_skip_update = 0;
 
if (likely(prev != next)) {
rq->nr_switches++;
@@ -3402,6 +3404,7 @@ static void __sched notrace __schedule(bool preempt)
trace_sched_switch(preempt, prev, next);
rq = context_switch(rq, prev, next, ); /* unlocks the rq */
} else {
+   rq->clock_skip_update = 0;
rq_unpin_lock(rq, );
raw_spin_unlock_irq(>lock);
}

[tip:sched/core] sched/core: Add wrappers for lockdep_(un)pin_lock()

2017-01-14 Thread tip-bot for Matt Fleming

Commit-ID:  d8ac897137a230ec351269f6378017f2decca512
Gitweb: http://git.kernel.org/tip/d8ac897137a230ec351269f6378017f2decca512
Author: Matt Fleming 
AuthorDate: Wed, 21 Sep 2016 14:38:10 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 14 Jan 2017 11:29:30 +0100

sched/core: Add wrappers for lockdep_(un)pin_lock()

In preparation for adding diagnostic checks to catch missing calls to
update_rq_clock(), provide wrappers for (re)pinning and unpinning
rq->lock.

Because the pending diagnostic checks allow state to be maintained in
rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
for 'struct rq_flags *'.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Byungchul Park 
Cc: Frederic Weisbecker 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Petr Mladek 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/20160921133813.31976-5-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  | 80 
 kernel/sched/deadline.c  | 10 +++---
 kernel/sched/fair.c  |  6 ++--
 kernel/sched/idle_task.c |  2 +-
 kernel/sched/rt.c|  6 ++--
 kernel/sched/sched.h | 31 ++-
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 76 insertions(+), 61 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c56fb57..41df935 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -185,7 +185,7 @@ struct rq *__task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
rq = task_rq(p);
raw_spin_lock(>lock);
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
-   rf->cookie = lockdep_pin_lock(>lock);
+   rq_pin_lock(rq, rf);
return rq;
}
raw_spin_unlock(>lock);
@@ -225,7 +225,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
 * pair with the WMB to ensure we must then also see migrating.
 */
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
-   rf->cookie = lockdep_pin_lock(>lock);
+   rq_pin_lock(rq, rf);
return rq;
}
raw_spin_unlock(>lock);
@@ -1195,9 +1195,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 * OK, since we're going to drop the lock immediately
 * afterwards anyway.
 */
-   lockdep_unpin_lock(>lock, rf.cookie);
+   rq_unpin_lock(rq, );
rq = move_queued_task(rq, p, dest_cpu);
-   lockdep_repin_lock(>lock, rf.cookie);
+   rq_repin_lock(rq, );
}
 out:
task_rq_unlock(rq, p, );
@@ -1690,7 +1690,7 @@ static inline void ttwu_activate(struct rq *rq, struct 
task_struct *p, int en_fl
  * Mark the task runnable and perform wakeup-preemption.
  */
 static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int 
wake_flags,
-  struct pin_cookie cookie)
+  struct rq_flags *rf)
 {
check_preempt_curr(rq, p, wake_flags);
p->state = TASK_RUNNING;
@@ -1702,9 +1702,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct 
task_struct *p, int wake_flags,
 * Our task @p is fully woken up and running; so its safe to
 * drop the rq->lock, hereafter rq is only used for statistics.
 */
-   lockdep_unpin_lock(>lock, cookie);
+   rq_unpin_lock(rq, rf);
p->sched_class->task_woken(rq, p);
-   lockdep_repin_lock(>lock, cookie);
+   rq_repin_lock(rq, rf);
}
 
if (rq->idle_stamp) {
@@ -1723,7 +1723,7 @@ static void ttwu_do_wakeup(struct rq *rq, struct 
task_struct *p, int wake_flags,
 
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
-struct pin_cookie cookie)
+struct rq_flags *rf)
 {
int en_flags = ENQUEUE_WAKEUP;
 
@@ -1738,7 +1738,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, 
int wake_flags,
 #endif
 
ttwu_activate(rq, p, en_flags);
-   ttwu_do_wakeup(rq, p, wake_flags, cookie);
+   ttwu_do_wakeup(rq, p, wake_flags, rf);
 }
 
 /*
@@

[tip:sched/core] sched/core: Add wrappers for lockdep_(un)pin_lock()

2017-01-14 Thread tip-bot for Matt Fleming

Commit-ID:  d8ac897137a230ec351269f6378017f2decca512
Gitweb: http://git.kernel.org/tip/d8ac897137a230ec351269f6378017f2decca512
Author: Matt Fleming 
AuthorDate: Wed, 21 Sep 2016 14:38:10 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 14 Jan 2017 11:29:30 +0100

sched/core: Add wrappers for lockdep_(un)pin_lock()

In preparation for adding diagnostic checks to catch missing calls to
update_rq_clock(), provide wrappers for (re)pinning and unpinning
rq->lock.

Because the pending diagnostic checks allow state to be maintained in
rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
for 'struct rq_flags *'.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Byungchul Park 
Cc: Frederic Weisbecker 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Petr Mladek 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/20160921133813.31976-5-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  | 80 
 kernel/sched/deadline.c  | 10 +++---
 kernel/sched/fair.c  |  6 ++--
 kernel/sched/idle_task.c |  2 +-
 kernel/sched/rt.c|  6 ++--
 kernel/sched/sched.h | 31 ++-
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 76 insertions(+), 61 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c56fb57..41df935 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -185,7 +185,7 @@ struct rq *__task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
rq = task_rq(p);
raw_spin_lock(>lock);
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
-   rf->cookie = lockdep_pin_lock(>lock);
+   rq_pin_lock(rq, rf);
return rq;
}
raw_spin_unlock(>lock);
@@ -225,7 +225,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
 * pair with the WMB to ensure we must then also see migrating.
 */
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
-   rf->cookie = lockdep_pin_lock(>lock);
+   rq_pin_lock(rq, rf);
return rq;
}
raw_spin_unlock(>lock);
@@ -1195,9 +1195,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 * OK, since we're going to drop the lock immediately
 * afterwards anyway.
 */
-   lockdep_unpin_lock(>lock, rf.cookie);
+   rq_unpin_lock(rq, );
rq = move_queued_task(rq, p, dest_cpu);
-   lockdep_repin_lock(>lock, rf.cookie);
+   rq_repin_lock(rq, );
}
 out:
task_rq_unlock(rq, p, );
@@ -1690,7 +1690,7 @@ static inline void ttwu_activate(struct rq *rq, struct 
task_struct *p, int en_fl
  * Mark the task runnable and perform wakeup-preemption.
  */
 static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int 
wake_flags,
-  struct pin_cookie cookie)
+  struct rq_flags *rf)
 {
check_preempt_curr(rq, p, wake_flags);
p->state = TASK_RUNNING;
@@ -1702,9 +1702,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct 
task_struct *p, int wake_flags,
 * Our task @p is fully woken up and running; so its safe to
 * drop the rq->lock, hereafter rq is only used for statistics.
 */
-   lockdep_unpin_lock(>lock, cookie);
+   rq_unpin_lock(rq, rf);
p->sched_class->task_woken(rq, p);
-   lockdep_repin_lock(>lock, cookie);
+   rq_repin_lock(rq, rf);
}
 
if (rq->idle_stamp) {
@@ -1723,7 +1723,7 @@ static void ttwu_do_wakeup(struct rq *rq, struct 
task_struct *p, int wake_flags,
 
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
-struct pin_cookie cookie)
+struct rq_flags *rf)
 {
int en_flags = ENQUEUE_WAKEUP;
 
@@ -1738,7 +1738,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, 
int wake_flags,
 #endif
 
ttwu_activate(rq, p, en_flags);
-   ttwu_do_wakeup(rq, p, wake_flags, cookie);
+   ttwu_do_wakeup(rq, p, wake_flags, rf);
 }
 
 /*
@@ -1757,7 +1757,7 @@ static int ttwu_remote(struct task_struct *p, int 
wake_flags)
if (task_on_rq_queued(p)) {
/* check_preempt_curr() may use rq clock */
update_rq_clock(rq);
-   ttwu_do_wakeup(rq, p, wake_flags, rf.cookie);
+   ttwu_do_wakeup(rq, p, wake_flags, );
ret = 1;
}
__task_rq_unlock(rq, );
@@ -1770,15 +1770,15 @@ void

[tip:sched/core] sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock

2017-01-14 Thread tip-bot for Matt Fleming

Commit-ID:  92509b732baf14c59ca702307270cfaa3a585ae7
Gitweb: http://git.kernel.org/tip/92509b732baf14c59ca702307270cfaa3a585ae7
Author: Matt Fleming 
AuthorDate: Wed, 21 Sep 2016 14:38:11 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 14 Jan 2017 11:29:31 +0100

sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock

rq_clock() is called from sched_info_{depart,arrive}() after resetting
RQCF_ACT_SKIP but prior to a call to update_rq_clock().

In preparation for pending patches that check whether the rq clock has
been updated inside of a pin context before rq_clock() is called, move
the reset of rq->clock_skip_update immediately before unpinning the rq
lock.

This will avoid the new warnings which check if update_rq_clock() is
being actively skipped.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Byungchul Park 
Cc: Frederic Weisbecker 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Petr Mladek 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/20160921133813.31976-6-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41df935..311460b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2887,6 +2887,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
+
+   rq->clock_skip_update = 0;
+
/*
 * Since the runqueue lock will be released by the next
 * task (which is an invalid locking op but in the case
@@ -3392,7 +3395,6 @@ static void __sched notrace __schedule(bool preempt)
next = pick_next_task(rq, prev, );
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
-   rq->clock_skip_update = 0;
 
if (likely(prev != next)) {
rq->nr_switches++;
@@ -3402,6 +3404,7 @@ static void __sched notrace __schedule(bool preempt)
trace_sched_switch(preempt, prev, next);
rq = context_switch(rq, prev, next, ); /* unlocks the rq */
} else {
+   rq->clock_skip_update = 0;
rq_unpin_lock(rq, );
raw_spin_unlock_irq(>lock);
}

[tip:sched/core] sched/fair: Push rq lock pin/unpin into idle_balance()

2017-01-14 Thread tip-bot for Matt Fleming

Commit-ID:  46f69fa33712ad12ccaa723e46ed5929ee93589b
Gitweb: http://git.kernel.org/tip/46f69fa33712ad12ccaa723e46ed5929ee93589b
Author: Matt Fleming 
AuthorDate: Wed, 21 Sep 2016 14:38:12 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 14 Jan 2017 11:29:32 +0100

sched/fair: Push rq lock pin/unpin into idle_balance()

Future patches will emit warnings if rq_clock() is called before
update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.

Since there is only one caller of idle_balance() we can push the
unpin/repin there.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Byungchul Park 
Cc: Frederic Weisbecker 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Petr Mladek 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/20160921133813.31976-7-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4904412..faf80e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3424,7 +3424,7 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq 
*cfs_rq)
return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq);
+static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
 
 #else /* CONFIG_SMP */
 
@@ -3453,7 +3453,7 @@ attach_entity_load_avg(struct cfs_rq *cfs_rq, struct 
sched_entity *se) {}
 static inline void
 detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 
-static inline int idle_balance(struct rq *rq)
+static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
 {
return 0;
 }
@@ -6320,15 +6320,8 @@ simple:
return p;
 
 idle:
-   /*
-* This is OK, because current is on_cpu, which avoids it being picked
-* for load-balance and preemption/IRQs are still disabled avoiding
-* further scheduler activity on it and we're being very careful to
-* re-start the picking loop.
-*/
-   rq_unpin_lock(rq, rf);
-   new_tasks = idle_balance(rq);
-   rq_repin_lock(rq, rf);
+   new_tasks = idle_balance(rq, rf);
+
/*
 * Because idle_balance() releases (and re-acquires) rq->lock, it is
 * possible for any higher priority task to appear. In that case we
@@ -8297,7 +8290,7 @@ update_next_balance(struct sched_domain *sd, unsigned 
long *next_balance)
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq)
+static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
unsigned long next_balance = jiffies + HZ;
int this_cpu = this_rq->cpu;
@@ -8311,6 +8304,14 @@ static int idle_balance(struct rq *this_rq)
 */
this_rq->idle_stamp = rq_clock(this_rq);
 
+   /*
+* This is OK, because current is on_cpu, which avoids it being picked
+* for load-balance and preemption/IRQs are still disabled avoiding
+* further scheduler activity on it and we're being very careful to
+* re-start the picking loop.
+*/
+   rq_unpin_lock(this_rq, rf);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!this_rq->rd->overload) {
rcu_read_lock();
@@ -8388,6 +8389,8 @@ out:
if (pulled_task)
this_rq->idle_stamp = 0;
 
+   rq_repin_lock(this_rq, rf);
+
return pulled_task;
 }

[tip:sched/core] sched/fair: Push rq lock pin/unpin into idle_balance()

2017-01-14 Thread tip-bot for Matt Fleming

Commit-ID:  46f69fa33712ad12ccaa723e46ed5929ee93589b
Gitweb: http://git.kernel.org/tip/46f69fa33712ad12ccaa723e46ed5929ee93589b
Author: Matt Fleming 
AuthorDate: Wed, 21 Sep 2016 14:38:12 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 14 Jan 2017 11:29:32 +0100

sched/fair: Push rq lock pin/unpin into idle_balance()

Future patches will emit warnings if rq_clock() is called before
update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.

Since there is only one caller of idle_balance() we can push the
unpin/repin there.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Byungchul Park 
Cc: Frederic Weisbecker 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Petr Mladek 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/20160921133813.31976-7-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4904412..faf80e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3424,7 +3424,7 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq 
*cfs_rq)
return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq);
+static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
 
 #else /* CONFIG_SMP */
 
@@ -3453,7 +3453,7 @@ attach_entity_load_avg(struct cfs_rq *cfs_rq, struct 
sched_entity *se) {}
 static inline void
 detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 
-static inline int idle_balance(struct rq *rq)
+static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
 {
return 0;
 }
@@ -6320,15 +6320,8 @@ simple:
return p;
 
 idle:
-   /*
-* This is OK, because current is on_cpu, which avoids it being picked
-* for load-balance and preemption/IRQs are still disabled avoiding
-* further scheduler activity on it and we're being very careful to
-* re-start the picking loop.
-*/
-   rq_unpin_lock(rq, rf);
-   new_tasks = idle_balance(rq);
-   rq_repin_lock(rq, rf);
+   new_tasks = idle_balance(rq, rf);
+
/*
 * Because idle_balance() releases (and re-acquires) rq->lock, it is
 * possible for any higher priority task to appear. In that case we
@@ -8297,7 +8290,7 @@ update_next_balance(struct sched_domain *sd, unsigned 
long *next_balance)
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq)
+static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
unsigned long next_balance = jiffies + HZ;
int this_cpu = this_rq->cpu;
@@ -8311,6 +8304,14 @@ static int idle_balance(struct rq *this_rq)
 */
this_rq->idle_stamp = rq_clock(this_rq);
 
+   /*
+* This is OK, because current is on_cpu, which avoids it being picked
+* for load-balance and preemption/IRQs are still disabled avoiding
+* further scheduler activity on it and we're being very careful to
+* re-start the picking loop.
+*/
+   rq_unpin_lock(this_rq, rf);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!this_rq->rd->overload) {
rcu_read_lock();
@@ -8388,6 +8389,8 @@ out:
if (pulled_task)
this_rq->idle_stamp = 0;
 
+   rq_repin_lock(this_rq, rf);
+
return pulled_task;
 }

[tip:efi/urgent] x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y

2016-11-13 Thread tip-bot for Matt Fleming

Commit-ID:  f6697df36bdf0bf7fce984605c2918d4a7b4269f
Gitweb: http://git.kernel.org/tip/f6697df36bdf0bf7fce984605c2918d4a7b4269f
Author: Matt Fleming 
AuthorDate: Sat, 12 Nov 2016 21:04:24 +
Committer:  Ingo Molnar 
CommitDate: Sun, 13 Nov 2016 08:26:40 +0100

x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y

Booting an EFI mixed mode kernel has been crashing since commit:

  e37e43a497d5 ("x86/mm/64: Enable vmapped stacks 
(CONFIG_HAVE_ARCH_VMAP_STACK=y)")

The user-visible effect in my test setup was the kernel being unable
to find the root file system ramdisk. This was likely caused by silent
memory or page table corruption.

Enabling CONFIG_DEBUG_VIRTUAL=y immediately flagged the thunking code as
abusing virt_to_phys() because it was passing addresses that were not
part of the kernel direct mapping.

Use the slow version instead, which correctly handles all memory
regions by performing a page table walk.

Suggested-by: Andy Lutomirski 
Signed-off-by: Matt Fleming 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: http://lkml.kernel.org/r/20161112210424.5157-3-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi_64.c | 80 ++
 1 file changed, 57 insertions(+), 23 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 58b0f80..319148b 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -211,6 +212,35 @@ void efi_sync_low_kernel_mappings(void)
memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
+/*
+ * Wrapper for slow_virt_to_phys() that handles NULL addresses.
+ */
+static inline phys_addr_t
+virt_to_phys_or_null_size(void *va, unsigned long size)
+{
+   bool bad_size;
+
+   if (!va)
+   return 0;
+
+   if (virt_addr_valid(va))
+   return virt_to_phys(va);
+
+   /*
+* A fully aligned variable on the stack is guaranteed not to
+* cross a page bounary. Try to catch strings on the stack by
+* checking that 'size' is a power of two.
+*/
+   bad_size = size > PAGE_SIZE || !is_power_of_2(size);
+
+   WARN_ON(!IS_ALIGNED((unsigned long)va, size) || bad_size);
+
+   return slow_virt_to_phys(va);
+}
+
+#define virt_to_phys_or_null(addr) \
+   virt_to_phys_or_null_size((addr), sizeof(*(addr)))
+
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
unsigned long pfn, text;
@@ -494,8 +524,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, 
efi_time_cap_t *tc)
 
spin_lock(_lock);
 
-   phys_tm = virt_to_phys(tm);
-   phys_tc = virt_to_phys(tc);
+   phys_tm = virt_to_phys_or_null(tm);
+   phys_tc = virt_to_phys_or_null(tc);
 
status = efi_thunk(get_time, phys_tm, phys_tc);
 
@@ -511,7 +541,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
 
spin_lock(_lock);
 
-   phys_tm = virt_to_phys(tm);
+   phys_tm = virt_to_phys_or_null(tm);
 
status = efi_thunk(set_time, phys_tm);
 
@@ -529,9 +559,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t 
*pending,
 
spin_lock(_lock);
 
-   phys_enabled = virt_to_phys(enabled);
-   phys_pending = virt_to_phys(pending);
-   phys_tm = virt_to_phys(tm);
+   phys_enabled = virt_to_phys_or_null(enabled);
+   phys_pending = virt_to_phys_or_null(pending);
+   phys_tm = virt_to_phys_or_null(tm);
 
status = efi_thunk(get_wakeup_time, phys_enabled,
 phys_pending, phys_tm);
@@ -549,7 +579,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t 
*tm)
 
spin_lock(_lock);
 
-   phys_tm = virt_to_phys(tm);
+   phys_tm = virt_to_phys_or_null(tm);
 
status = efi_thunk(set_wakeup_time, enabled, phys_tm);
 
@@ -558,6 +588,10 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t 
*tm)
return status;
 }
 
+static unsigned long efi_name_size(efi_char16_t *name)
+{
+   return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
+}
 
 static efi_status_t
 efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
@@ -567,11 +601,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t 
*vendor,
u32 phys_name, phys_vendor, phys_attr;
u32 phys_data_size, phys_data;
 
-

[tip:efi/urgent] x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y

2016-11-13 Thread tip-bot for Matt Fleming

Commit-ID:  f6697df36bdf0bf7fce984605c2918d4a7b4269f
Gitweb: http://git.kernel.org/tip/f6697df36bdf0bf7fce984605c2918d4a7b4269f
Author: Matt Fleming 
AuthorDate: Sat, 12 Nov 2016 21:04:24 +
Committer:  Ingo Molnar 
CommitDate: Sun, 13 Nov 2016 08:26:40 +0100

x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y

Booting an EFI mixed mode kernel has been crashing since commit:

  e37e43a497d5 ("x86/mm/64: Enable vmapped stacks 
(CONFIG_HAVE_ARCH_VMAP_STACK=y)")

The user-visible effect in my test setup was the kernel being unable
to find the root file system ramdisk. This was likely caused by silent
memory or page table corruption.

Enabling CONFIG_DEBUG_VIRTUAL=y immediately flagged the thunking code as
abusing virt_to_phys() because it was passing addresses that were not
part of the kernel direct mapping.

Use the slow version instead, which correctly handles all memory
regions by performing a page table walk.

Suggested-by: Andy Lutomirski 
Signed-off-by: Matt Fleming 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: http://lkml.kernel.org/r/20161112210424.5157-3-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi_64.c | 80 ++
 1 file changed, 57 insertions(+), 23 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 58b0f80..319148b 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -211,6 +212,35 @@ void efi_sync_low_kernel_mappings(void)
memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
+/*
+ * Wrapper for slow_virt_to_phys() that handles NULL addresses.
+ */
+static inline phys_addr_t
+virt_to_phys_or_null_size(void *va, unsigned long size)
+{
+   bool bad_size;
+
+   if (!va)
+   return 0;
+
+   if (virt_addr_valid(va))
+   return virt_to_phys(va);
+
+   /*
+* A fully aligned variable on the stack is guaranteed not to
+* cross a page bounary. Try to catch strings on the stack by
+* checking that 'size' is a power of two.
+*/
+   bad_size = size > PAGE_SIZE || !is_power_of_2(size);
+
+   WARN_ON(!IS_ALIGNED((unsigned long)va, size) || bad_size);
+
+   return slow_virt_to_phys(va);
+}
+
+#define virt_to_phys_or_null(addr) \
+   virt_to_phys_or_null_size((addr), sizeof(*(addr)))
+
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
unsigned long pfn, text;
@@ -494,8 +524,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, 
efi_time_cap_t *tc)
 
spin_lock(_lock);
 
-   phys_tm = virt_to_phys(tm);
-   phys_tc = virt_to_phys(tc);
+   phys_tm = virt_to_phys_or_null(tm);
+   phys_tc = virt_to_phys_or_null(tc);
 
status = efi_thunk(get_time, phys_tm, phys_tc);
 
@@ -511,7 +541,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
 
spin_lock(_lock);
 
-   phys_tm = virt_to_phys(tm);
+   phys_tm = virt_to_phys_or_null(tm);
 
status = efi_thunk(set_time, phys_tm);
 
@@ -529,9 +559,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t 
*pending,
 
spin_lock(_lock);
 
-   phys_enabled = virt_to_phys(enabled);
-   phys_pending = virt_to_phys(pending);
-   phys_tm = virt_to_phys(tm);
+   phys_enabled = virt_to_phys_or_null(enabled);
+   phys_pending = virt_to_phys_or_null(pending);
+   phys_tm = virt_to_phys_or_null(tm);
 
status = efi_thunk(get_wakeup_time, phys_enabled,
 phys_pending, phys_tm);
@@ -549,7 +579,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t 
*tm)
 
spin_lock(_lock);
 
-   phys_tm = virt_to_phys(tm);
+   phys_tm = virt_to_phys_or_null(tm);
 
status = efi_thunk(set_wakeup_time, enabled, phys_tm);
 
@@ -558,6 +588,10 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t 
*tm)
return status;
 }
 
+static unsigned long efi_name_size(efi_char16_t *name)
+{
+   return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
+}
 
 static efi_status_t
 efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
@@ -567,11 +601,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t 
*vendor,
u32 phys_name, phys_vendor, phys_attr;
u32 phys_data_size, phys_data;
 
-   phys_data_size = virt_to_phys(data_size);
-   phys_vendor = virt_to_phys(vendor);
-   phys_name = virt_to_phys(name);
-   phys_attr = virt_to_phys(attr);
-   phys_data = virt_to_phys(data);
+   phys_data_size = virt_to_phys_or_null(data_size);
+   phys_vendor = virt_to_phys_or_null(vendor);
+

[tip:sched/core] sched/fair: Kill the unused 'sched_shares_window_ns' tunable

2016-10-20 Thread tip-bot for Matt Fleming

Commit-ID:  3c3fcb45d524feb5d14a14f332e3eec7f2aff8f3
Gitweb: http://git.kernel.org/tip/3c3fcb45d524feb5d14a14f332e3eec7f2aff8f3
Author: Matt Fleming 
AuthorDate: Wed, 19 Oct 2016 15:10:59 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 20 Oct 2016 08:44:57 +0200

sched/fair: Kill the unused 'sched_shares_window_ns' tunable

The last user of this tunable was removed in 2012 in commit:

  82958366cfea ("sched: Replace update_shares weight distribution with 
per-entity computation")

Delete it since its very existence confuses people.

Signed-off-by: Matt Fleming 
Cc: Dietmar Eggemann 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20161019141059.26408-1-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 include/linux/sched/sysctl.h | 1 -
 kernel/sched/fair.c  | 7 ---
 kernel/sysctl.c  | 7 ---
 3 files changed, 15 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 22db1e6..4411453 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -36,7 +36,6 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
-extern unsigned int sysctl_sched_shares_window;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d941c97..79d464a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -93,13 +93,6 @@ unsigned int normalized_sysctl_sched_wakeup_granularity = 
100UL;
 
 const_debug unsigned int sysctl_sched_migration_cost = 50UL;
 
-/*
- * The exponential sliding  window over which load is averaged for shares
- * distribution.
- * (default: 10msec)
- */
-unsigned int __read_mostly sysctl_sched_shares_window = 1000UL;
-
 #ifdef CONFIG_CFS_BANDWIDTH
 /*
  * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 706309f..739fb17 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -347,13 +347,6 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
-   {
-   .procname   = "sched_shares_window_ns",
-   .data   = _sched_shares_window,
-   .maxlen = sizeof(unsigned int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   },
 #ifdef CONFIG_SCHEDSTATS
{
.procname   = "sched_schedstats",

[tip:sched/core] sched/fair: Kill the unused 'sched_shares_window_ns' tunable

2016-10-20 Thread tip-bot for Matt Fleming

Commit-ID:  3c3fcb45d524feb5d14a14f332e3eec7f2aff8f3
Gitweb: http://git.kernel.org/tip/3c3fcb45d524feb5d14a14f332e3eec7f2aff8f3
Author: Matt Fleming 
AuthorDate: Wed, 19 Oct 2016 15:10:59 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 20 Oct 2016 08:44:57 +0200

sched/fair: Kill the unused 'sched_shares_window_ns' tunable

The last user of this tunable was removed in 2012 in commit:

  82958366cfea ("sched: Replace update_shares weight distribution with 
per-entity computation")

Delete it since its very existence confuses people.

Signed-off-by: Matt Fleming 
Cc: Dietmar Eggemann 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20161019141059.26408-1-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 include/linux/sched/sysctl.h | 1 -
 kernel/sched/fair.c  | 7 ---
 kernel/sysctl.c  | 7 ---
 3 files changed, 15 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 22db1e6..4411453 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -36,7 +36,6 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
-extern unsigned int sysctl_sched_shares_window;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d941c97..79d464a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -93,13 +93,6 @@ unsigned int normalized_sysctl_sched_wakeup_granularity = 
100UL;
 
 const_debug unsigned int sysctl_sched_migration_cost = 50UL;
 
-/*
- * The exponential sliding  window over which load is averaged for shares
- * distribution.
- * (default: 10msec)
- */
-unsigned int __read_mostly sysctl_sched_shares_window = 1000UL;
-
 #ifdef CONFIG_CFS_BANDWIDTH
 /*
  * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 706309f..739fb17 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -347,13 +347,6 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
-   {
-   .procname   = "sched_shares_window_ns",
-   .data   = _sched_shares_window,
-   .maxlen = sizeof(unsigned int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   },
 #ifdef CONFIG_SCHEDSTATS
{
.procname   = "sched_schedstats",

[tip:perf/urgent] perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2

2016-09-16 Thread tip-bot for Matt Fleming

Commit-ID:  080fe0b790ad438fc1b61621dac37c1964ce7f35
Gitweb: http://git.kernel.org/tip/080fe0b790ad438fc1b61621dac37c1964ce7f35
Author: Matt Fleming 
AuthorDate: Wed, 24 Aug 2016 14:12:08 +0100
Committer:  Ingo Molnar 
CommitDate: Fri, 16 Sep 2016 16:19:49 +0200

perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2

While the Intel PMU monitors the LLC when perf enables the
HW_CACHE_REFERENCES and HW_CACHE_MISSES events, these events monitor
L1 instruction cache fetches (0x0080) and instruction cache misses
(0x0081) on the AMD PMU.

This is extremely confusing when monitoring the same workload across
Intel and AMD machines, since parameters like,

  $ perf stat -e cache-references,cache-misses

measure completely different things.

Instead, make the AMD PMU measure instruction/data cache and TLB fill
requests to the L2 and instruction/data cache and TLB misses in the L2
when HW_CACHE_REFERENCES and HW_CACHE_MISSES are enabled,
respectively. That way the events measure unified caches on both
platforms.

Signed-off-by: Matt Fleming 
Acked-by: Peter Zijlstra 
Cc: 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1472044328-21302-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/amd/core.c | 4 ++--
 arch/x86/kvm/pmu_amd.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index e07a22b..f5f4b3f 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -119,8 +119,8 @@ static const u64 amd_perfmon_event_map[PERF_COUNT_HW_MAX] =
 {
   [PERF_COUNT_HW_CPU_CYCLES]   = 0x0076,
   [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,
-  [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0080,
-  [PERF_COUNT_HW_CACHE_MISSES] = 0x0081,
+  [PERF_COUNT_HW_CACHE_REFERENCES] = 0x077d,
+  [PERF_COUNT_HW_CACHE_MISSES] = 0x077e,
   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]  = 0x00c2,
   [PERF_COUNT_HW_BRANCH_MISSES]= 0x00c3,
   [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND]  = 0x00d0, /* "Decoder empty" 
event */
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index 39b9112..cd94443 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -23,8 +23,8 @@
 static struct kvm_event_hw_type_mapping amd_event_mapping[] = {
[0] = { 0x76, 0x00, PERF_COUNT_HW_CPU_CYCLES },
[1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
-   [2] = { 0x80, 0x00, PERF_COUNT_HW_CACHE_REFERENCES },
-   [3] = { 0x81, 0x00, PERF_COUNT_HW_CACHE_MISSES },
+   [2] = { 0x7d, 0x07, PERF_COUNT_HW_CACHE_REFERENCES },
+   [3] = { 0x7e, 0x07, PERF_COUNT_HW_CACHE_MISSES },
[4] = { 0xc2, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
[5] = { 0xc3, 0x00, PERF_COUNT_HW_BRANCH_MISSES },
[6] = { 0xd0, 0x00, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND },

[tip:perf/urgent] perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2

2016-09-16 Thread tip-bot for Matt Fleming

Commit-ID:  080fe0b790ad438fc1b61621dac37c1964ce7f35
Gitweb: http://git.kernel.org/tip/080fe0b790ad438fc1b61621dac37c1964ce7f35
Author: Matt Fleming 
AuthorDate: Wed, 24 Aug 2016 14:12:08 +0100
Committer:  Ingo Molnar 
CommitDate: Fri, 16 Sep 2016 16:19:49 +0200

perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2

While the Intel PMU monitors the LLC when perf enables the
HW_CACHE_REFERENCES and HW_CACHE_MISSES events, these events monitor
L1 instruction cache fetches (0x0080) and instruction cache misses
(0x0081) on the AMD PMU.

This is extremely confusing when monitoring the same workload across
Intel and AMD machines, since parameters like,

  $ perf stat -e cache-references,cache-misses

measure completely different things.

Instead, make the AMD PMU measure instruction/data cache and TLB fill
requests to the L2 and instruction/data cache and TLB misses in the L2
when HW_CACHE_REFERENCES and HW_CACHE_MISSES are enabled,
respectively. That way the events measure unified caches on both
platforms.

Signed-off-by: Matt Fleming 
Acked-by: Peter Zijlstra 
Cc: 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1472044328-21302-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/amd/core.c | 4 ++--
 arch/x86/kvm/pmu_amd.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index e07a22b..f5f4b3f 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -119,8 +119,8 @@ static const u64 amd_perfmon_event_map[PERF_COUNT_HW_MAX] =
 {
   [PERF_COUNT_HW_CPU_CYCLES]   = 0x0076,
   [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,
-  [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0080,
-  [PERF_COUNT_HW_CACHE_MISSES] = 0x0081,
+  [PERF_COUNT_HW_CACHE_REFERENCES] = 0x077d,
+  [PERF_COUNT_HW_CACHE_MISSES] = 0x077e,
   [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]  = 0x00c2,
   [PERF_COUNT_HW_BRANCH_MISSES]= 0x00c3,
   [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND]  = 0x00d0, /* "Decoder empty" 
event */
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index 39b9112..cd94443 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -23,8 +23,8 @@
 static struct kvm_event_hw_type_mapping amd_event_mapping[] = {
[0] = { 0x76, 0x00, PERF_COUNT_HW_CPU_CYCLES },
[1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
-   [2] = { 0x80, 0x00, PERF_COUNT_HW_CACHE_REFERENCES },
-   [3] = { 0x81, 0x00, PERF_COUNT_HW_CACHE_MISSES },
+   [2] = { 0x7d, 0x07, PERF_COUNT_HW_CACHE_REFERENCES },
+   [3] = { 0x7e, 0x07, PERF_COUNT_HW_CACHE_MISSES },
[4] = { 0xc2, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
[5] = { 0xc3, 0x00, PERF_COUNT_HW_BRANCH_MISSES },
[6] = { 0xd0, 0x00, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND },

[tip:efi/core] efi/capsule: Move 'capsule' to the stack in efi_capsule_supported()

2016-05-07 Thread tip-bot for Matt Fleming

Commit-ID:  fb7a84cac03541f4da18dfa25b3f4767d4efc6fc
Gitweb: http://git.kernel.org/tip/fb7a84cac03541f4da18dfa25b3f4767d4efc6fc
Author: Matt Fleming 
AuthorDate: Fri, 6 May 2016 22:39:29 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 7 May 2016 07:06:13 +0200

efi/capsule: Move 'capsule' to the stack in efi_capsule_supported()

Dan Carpenter reports that passing the address of the pointer to the
kmalloc()'d memory for 'capsule' is dangerous:

 "drivers/firmware/efi/capsule.c:109 efi_capsule_supported()
  warn: did you mean to pass the address of 'capsule'

   108
   109  status = efi.query_capsule_caps(, 1, _size, reset);

  If we modify capsule inside this function call then at the end of the
  function we aren't freeing the original pointer that we allocated."

Ard Biesheuvel noted that we don't even need to call kmalloc() since the
object we allocate isn't very big and doesn't need to persist after the
function returns.

Place 'capsule' on the stack instead.

Suggested-by: Ard Biesheuvel 
Reported-by: Dan Carpenter 
Signed-off-by: Matt Fleming 
Acked-by: Ard Biesheuvel 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Bryan O'Donoghue 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Kweh Hock Leong 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1462570771-13324-4-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/capsule.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c
index e530540..53b9fd2 100644
--- a/drivers/firmware/efi/capsule.c
+++ b/drivers/firmware/efi/capsule.c
@@ -86,33 +86,26 @@ bool efi_capsule_pending(int *reset_type)
  */
 int efi_capsule_supported(efi_guid_t guid, u32 flags, size_t size, int *reset)
 {
-   efi_capsule_header_t *capsule;
+   efi_capsule_header_t capsule;
+   efi_capsule_header_t *cap_list[] = {  };
efi_status_t status;
u64 max_size;
-   int rv = 0;
 
if (flags & ~EFI_CAPSULE_SUPPORTED_FLAG_MASK)
return -EINVAL;
 
-   capsule = kmalloc(sizeof(*capsule), GFP_KERNEL);
-   if (!capsule)
-   return -ENOMEM;
-
-   capsule->headersize = capsule->imagesize = sizeof(*capsule);
-   memcpy(>guid, , sizeof(efi_guid_t));
-   capsule->flags = flags;
+   capsule.headersize = capsule.imagesize = sizeof(capsule);
+   memcpy(, , sizeof(efi_guid_t));
+   capsule.flags = flags;
 
-   status = efi.query_capsule_caps(, 1, _size, reset);
-   if (status != EFI_SUCCESS) {
-   rv = efi_status_to_err(status);
-   goto out;
-   }
+   status = efi.query_capsule_caps(cap_list, 1, _size, reset);
+   if (status != EFI_SUCCESS)
+   return efi_status_to_err(status);
 
if (size > max_size)
-   rv = -ENOSPC;
-out:
-   kfree(capsule);
-   return rv;
+   return -ENOSPC;
+
+   return 0;
 }
 EXPORT_SYMBOL_GPL(efi_capsule_supported);

[tip:efi/core] efi/capsule: Move 'capsule' to the stack in efi_capsule_supported()

2016-05-07 Thread tip-bot for Matt Fleming

Commit-ID:  fb7a84cac03541f4da18dfa25b3f4767d4efc6fc
Gitweb: http://git.kernel.org/tip/fb7a84cac03541f4da18dfa25b3f4767d4efc6fc
Author: Matt Fleming 
AuthorDate: Fri, 6 May 2016 22:39:29 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 7 May 2016 07:06:13 +0200

efi/capsule: Move 'capsule' to the stack in efi_capsule_supported()

Dan Carpenter reports that passing the address of the pointer to the
kmalloc()'d memory for 'capsule' is dangerous:

 "drivers/firmware/efi/capsule.c:109 efi_capsule_supported()
  warn: did you mean to pass the address of 'capsule'

   108
   109  status = efi.query_capsule_caps(, 1, _size, reset);

  If we modify capsule inside this function call then at the end of the
  function we aren't freeing the original pointer that we allocated."

Ard Biesheuvel noted that we don't even need to call kmalloc() since the
object we allocate isn't very big and doesn't need to persist after the
function returns.

Place 'capsule' on the stack instead.

Suggested-by: Ard Biesheuvel 
Reported-by: Dan Carpenter 
Signed-off-by: Matt Fleming 
Acked-by: Ard Biesheuvel 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Bryan O'Donoghue 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Kweh Hock Leong 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1462570771-13324-4-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/capsule.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c
index e530540..53b9fd2 100644
--- a/drivers/firmware/efi/capsule.c
+++ b/drivers/firmware/efi/capsule.c
@@ -86,33 +86,26 @@ bool efi_capsule_pending(int *reset_type)
  */
 int efi_capsule_supported(efi_guid_t guid, u32 flags, size_t size, int *reset)
 {
-   efi_capsule_header_t *capsule;
+   efi_capsule_header_t capsule;
+   efi_capsule_header_t *cap_list[] = {  };
efi_status_t status;
u64 max_size;
-   int rv = 0;
 
if (flags & ~EFI_CAPSULE_SUPPORTED_FLAG_MASK)
return -EINVAL;
 
-   capsule = kmalloc(sizeof(*capsule), GFP_KERNEL);
-   if (!capsule)
-   return -ENOMEM;
-
-   capsule->headersize = capsule->imagesize = sizeof(*capsule);
-   memcpy(>guid, , sizeof(efi_guid_t));
-   capsule->flags = flags;
+   capsule.headersize = capsule.imagesize = sizeof(capsule);
+   memcpy(, , sizeof(efi_guid_t));
+   capsule.flags = flags;
 
-   status = efi.query_capsule_caps(, 1, _size, reset);
-   if (status != EFI_SUCCESS) {
-   rv = efi_status_to_err(status);
-   goto out;
-   }
+   status = efi.query_capsule_caps(cap_list, 1, _size, reset);
+   if (status != EFI_SUCCESS)
+   return efi_status_to_err(status);
 
if (size > max_size)
-   rv = -ENOSPC;
-out:
-   kfree(capsule);
-   return rv;
+   return -ENOSPC;
+
+   return 0;
 }
 EXPORT_SYMBOL_GPL(efi_capsule_supported);

[tip:efi/core] efi/capsule: Make efi_capsule_pending() lockless

2016-05-07 Thread tip-bot for Matt Fleming

Commit-ID:  62075e581802ea1842d5d3c490a7e46330bdb9e1
Gitweb: http://git.kernel.org/tip/62075e581802ea1842d5d3c490a7e46330bdb9e1
Author: Matt Fleming 
AuthorDate: Fri, 6 May 2016 22:39:27 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 7 May 2016 07:06:13 +0200

efi/capsule: Make efi_capsule_pending() lockless

Taking a mutex in the reboot path is bogus because we cannot sleep
with interrupts disabled, such as when rebooting due to panic(),

  BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:97
  in_atomic(): 0, irqs_disabled(): 1, pid: 7, name: rcu_sched
  Call Trace:
dump_stack+0x63/0x89
___might_sleep+0xd8/0x120
__might_sleep+0x49/0x80
mutex_lock+0x20/0x50
efi_capsule_pending+0x1d/0x60
native_machine_emergency_restart+0x59/0x280
machine_emergency_restart+0x19/0x20
emergency_restart+0x18/0x20
panic+0x1ba/0x217

In this case all other CPUs will have been stopped by the time we
execute the platform reboot code, so 'capsule_pending' cannot change
under our feet. We wouldn't care even if it could since we cannot wait
for it complete.

Also, instead of relying on the external 'system_state' variable just
use a reboot notifier, so we can set 'stop_capsules' while holding
'capsule_mutex', thereby avoiding a race where system_state is updated
while we're in the middle of efi_capsule_update_locked() (since CPUs
won't have been stopped at that point).

Signed-off-by: Matt Fleming 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Bryan O'Donoghue 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Kweh Hock Leong 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1462570771-13324-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/capsule.c | 35 +--
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c
index 0de5594..e530540 100644
--- a/drivers/firmware/efi/capsule.c
+++ b/drivers/firmware/efi/capsule.c
@@ -22,11 +22,12 @@ typedef struct {
 } efi_capsule_block_desc_t;
 
 static bool capsule_pending;
+static bool stop_capsules;
 static int efi_reset_type = -1;
 
 /*
  * capsule_mutex serialises access to both capsule_pending and
- * efi_reset_type.
+ * efi_reset_type and stop_capsules.
  */
 static DEFINE_MUTEX(capsule_mutex);
 
@@ -50,18 +51,13 @@ static DEFINE_MUTEX(capsule_mutex);
  */
 bool efi_capsule_pending(int *reset_type)
 {
-   bool rv = false;
-
-   mutex_lock(_mutex);
if (!capsule_pending)
-   goto out;
+   return false;
 
if (reset_type)
*reset_type = efi_reset_type;
-   rv = true;
-out:
-   mutex_unlock(_mutex);
-   return rv;
+
+   return true;
 }
 
 /*
@@ -176,7 +172,7 @@ efi_capsule_update_locked(efi_capsule_header_t *capsule,
 * whether to force an EFI reboot), and we're racing against
 * that call. Abort in that case.
 */
-   if (unlikely(system_state == SYSTEM_RESTART)) {
+   if (unlikely(stop_capsules)) {
pr_warn("Capsule update raced with reboot, aborting.\n");
return -EINVAL;
}
@@ -298,3 +294,22 @@ out:
return rv;
 }
 EXPORT_SYMBOL_GPL(efi_capsule_update);
+
+static int capsule_reboot_notify(struct notifier_block *nb, unsigned long 
event, void *cmd)
+{
+   mutex_lock(_mutex);
+   stop_capsules = true;
+   mutex_unlock(_mutex);
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block capsule_reboot_nb = {
+   .notifier_call = capsule_reboot_notify,
+};
+
+static int __init capsule_reboot_register(void)
+{
+   return register_reboot_notifier(_reboot_nb);
+}
+core_initcall(capsule_reboot_register);

[tip:efi/core] efi/capsule: Make efi_capsule_pending() lockless

2016-05-07 Thread tip-bot for Matt Fleming

Commit-ID:  62075e581802ea1842d5d3c490a7e46330bdb9e1
Gitweb: http://git.kernel.org/tip/62075e581802ea1842d5d3c490a7e46330bdb9e1
Author: Matt Fleming 
AuthorDate: Fri, 6 May 2016 22:39:27 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 7 May 2016 07:06:13 +0200

efi/capsule: Make efi_capsule_pending() lockless

Taking a mutex in the reboot path is bogus because we cannot sleep
with interrupts disabled, such as when rebooting due to panic(),

  BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:97
  in_atomic(): 0, irqs_disabled(): 1, pid: 7, name: rcu_sched
  Call Trace:
dump_stack+0x63/0x89
___might_sleep+0xd8/0x120
__might_sleep+0x49/0x80
mutex_lock+0x20/0x50
efi_capsule_pending+0x1d/0x60
native_machine_emergency_restart+0x59/0x280
machine_emergency_restart+0x19/0x20
emergency_restart+0x18/0x20
panic+0x1ba/0x217

In this case all other CPUs will have been stopped by the time we
execute the platform reboot code, so 'capsule_pending' cannot change
under our feet. We wouldn't care even if it could since we cannot wait
for it complete.

Also, instead of relying on the external 'system_state' variable just
use a reboot notifier, so we can set 'stop_capsules' while holding
'capsule_mutex', thereby avoiding a race where system_state is updated
while we're in the middle of efi_capsule_update_locked() (since CPUs
won't have been stopped at that point).

Signed-off-by: Matt Fleming 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Bryan O'Donoghue 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Kweh Hock Leong 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1462570771-13324-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/capsule.c | 35 +--
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c
index 0de5594..e530540 100644
--- a/drivers/firmware/efi/capsule.c
+++ b/drivers/firmware/efi/capsule.c
@@ -22,11 +22,12 @@ typedef struct {
 } efi_capsule_block_desc_t;
 
 static bool capsule_pending;
+static bool stop_capsules;
 static int efi_reset_type = -1;
 
 /*
  * capsule_mutex serialises access to both capsule_pending and
- * efi_reset_type.
+ * efi_reset_type and stop_capsules.
  */
 static DEFINE_MUTEX(capsule_mutex);
 
@@ -50,18 +51,13 @@ static DEFINE_MUTEX(capsule_mutex);
  */
 bool efi_capsule_pending(int *reset_type)
 {
-   bool rv = false;
-
-   mutex_lock(_mutex);
if (!capsule_pending)
-   goto out;
+   return false;
 
if (reset_type)
*reset_type = efi_reset_type;
-   rv = true;
-out:
-   mutex_unlock(_mutex);
-   return rv;
+
+   return true;
 }
 
 /*
@@ -176,7 +172,7 @@ efi_capsule_update_locked(efi_capsule_header_t *capsule,
 * whether to force an EFI reboot), and we're racing against
 * that call. Abort in that case.
 */
-   if (unlikely(system_state == SYSTEM_RESTART)) {
+   if (unlikely(stop_capsules)) {
pr_warn("Capsule update raced with reboot, aborting.\n");
return -EINVAL;
}
@@ -298,3 +294,22 @@ out:
return rv;
 }
 EXPORT_SYMBOL_GPL(efi_capsule_update);
+
+static int capsule_reboot_notify(struct notifier_block *nb, unsigned long 
event, void *cmd)
+{
+   mutex_lock(_mutex);
+   stop_capsules = true;
+   mutex_unlock(_mutex);
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block capsule_reboot_nb = {
+   .notifier_call = capsule_reboot_notify,
+};
+
+static int __init capsule_reboot_register(void)
+{
+   return register_reboot_notifier(_reboot_nb);
+}
+core_initcall(capsule_reboot_register);

[tip:sched/core] sched/fair: Update rq clock before updating nohz CPU load

2016-05-05 Thread tip-bot for Matt Fleming

Commit-ID:  b52fad2db5d792d89975cebf2fe1646a7af28ed0
Gitweb: http://git.kernel.org/tip/b52fad2db5d792d89975cebf2fe1646a7af28ed0
Author: Matt Fleming 
AuthorDate: Tue, 3 May 2016 20:46:54 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 5 May 2016 09:41:09 +0200

sched/fair: Update rq clock before updating nohz CPU load

If we're accessing rq_clock() (e.g. in sched_avg_update()) we should
update the rq clock before calling cpu_load_update(), otherwise any
time calculations will be stale.

All other paths currently call update_rq_clock().

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Wanpeng Li 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1462304814-11715-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c381a6..7a00c7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4724,6 +4724,7 @@ void cpu_load_update_nohz_stop(void)
 
load = weighted_cpuload(cpu_of(this_rq));
raw_spin_lock(_rq->lock);
+   update_rq_clock(this_rq);
cpu_load_update_nohz(this_rq, curr_jiffies, load);
raw_spin_unlock(_rq->lock);
 }

[tip:sched/core] sched/fair: Update rq clock before updating nohz CPU load

2016-05-05 Thread tip-bot for Matt Fleming

Commit-ID:  b52fad2db5d792d89975cebf2fe1646a7af28ed0
Gitweb: http://git.kernel.org/tip/b52fad2db5d792d89975cebf2fe1646a7af28ed0
Author: Matt Fleming 
AuthorDate: Tue, 3 May 2016 20:46:54 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 5 May 2016 09:41:09 +0200

sched/fair: Update rq clock before updating nohz CPU load

If we're accessing rq_clock() (e.g. in sched_avg_update()) we should
update the rq clock before calling cpu_load_update(), otherwise any
time calculations will be stale.

All other paths currently call update_rq_clock().

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Wanpeng Li 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1462304814-11715-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c381a6..7a00c7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4724,6 +4724,7 @@ void cpu_load_update_nohz_stop(void)
 
load = weighted_cpuload(cpu_of(this_rq));
raw_spin_lock(_rq->lock);
+   update_rq_clock(this_rq);
cpu_load_update_nohz(this_rq, curr_jiffies, load);
raw_spin_unlock(_rq->lock);
 }

[tip:efi/urgent] MAINTAINERS: Remove asterisk from EFI directory names

2016-05-04 Thread tip-bot for Matt Fleming

Commit-ID:  e8dfe6d8f6762d515fcd4f30577f7bfcf7659887
Gitweb: http://git.kernel.org/tip/e8dfe6d8f6762d515fcd4f30577f7bfcf7659887
Author: Matt Fleming 
AuthorDate: Tue, 3 May 2016 20:29:39 +0100
Committer:  Ingo Molnar 
CommitDate: Wed, 4 May 2016 08:36:44 +0200

MAINTAINERS: Remove asterisk from EFI directory names

Mark reported that having asterisks on the end of directory names
confuses get_maintainer.pl when it encounters subdirectories, and that
my name does not appear when run on drivers/firmware/efi/libstub.

Reported-by: Mark Rutland 
Signed-off-by: Matt Fleming 
Cc: 
Cc: Ard Biesheuvel 
Cc: Catalin Marinas 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1462303781-8686-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 42e65d1..4dca3b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4223,8 +4223,8 @@ F:Documentation/efi-stub.txt
 F: arch/ia64/kernel/efi.c
 F: arch/x86/boot/compressed/eboot.[ch]
 F: arch/x86/include/asm/efi.h
-F: arch/x86/platform/efi/*
-F: drivers/firmware/efi/*
+F: arch/x86/platform/efi/
+F: drivers/firmware/efi/
 F: include/linux/efi*.h
 
 EFI VARIABLE FILESYSTEM

[tip:efi/urgent] MAINTAINERS: Remove asterisk from EFI directory names

2016-05-04 Thread tip-bot for Matt Fleming

Commit-ID:  e8dfe6d8f6762d515fcd4f30577f7bfcf7659887
Gitweb: http://git.kernel.org/tip/e8dfe6d8f6762d515fcd4f30577f7bfcf7659887
Author: Matt Fleming 
AuthorDate: Tue, 3 May 2016 20:29:39 +0100
Committer:  Ingo Molnar 
CommitDate: Wed, 4 May 2016 08:36:44 +0200

MAINTAINERS: Remove asterisk from EFI directory names

Mark reported that having asterisks on the end of directory names
confuses get_maintainer.pl when it encounters subdirectories, and that
my name does not appear when run on drivers/firmware/efi/libstub.

Reported-by: Mark Rutland 
Signed-off-by: Matt Fleming 
Cc: 
Cc: Ard Biesheuvel 
Cc: Catalin Marinas 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1462303781-8686-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 42e65d1..4dca3b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4223,8 +4223,8 @@ F:Documentation/efi-stub.txt
 F: arch/ia64/kernel/efi.c
 F: arch/x86/boot/compressed/eboot.[ch]
 F: arch/x86/include/asm/efi.h
-F: arch/x86/platform/efi/*
-F: drivers/firmware/efi/*
+F: arch/x86/platform/efi/
+F: drivers/firmware/efi/
 F: include/linux/efi*.h
 
 EFI VARIABLE FILESYSTEM

[tip:efi/core] efi: Add 'capsule' update support

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  f0133f3c5b8bb34ec4dec50c27e7a655aeee8935
Gitweb: http://git.kernel.org/tip/f0133f3c5b8bb34ec4dec50c27e7a655aeee8935
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:59 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:34:03 +0200

efi: Add 'capsule' update support

The EFI capsule mechanism allows data blobs to be passed to the EFI
firmware. A common use case is performing firmware updates. This patch
just introduces the main infrastructure for interacting with the
firmware, and a driver that allows users to upload capsules will come
in a later patch.

Once a capsule has been passed to the firmware, the next reboot must
be performed using the ResetSystem() EFI runtime service, which may
involve overriding the reboot type specified by reboot=. This ensures
the reset value returned by QueryCapsuleCapabilities() is used to
reset the system, which is required for the capsule to be processed.
efi_capsule_pending() is provided for this purpose.

At the moment we only allow a single capsule blob to be sent to the
firmware despite the fact that UpdateCapsule() takes a 'CapsuleCount'
parameter. This simplifies the API and shouldn't result in any
downside since it is still possible to send multiple capsules by
repeatedly calling UpdateCapsule().

Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Bryan O'Donoghue 
Cc: Kweh Hock Leong 
Cc: Mark Salter 
Cc: Peter Jones 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-28-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/Makefile  |   1 +
 drivers/firmware/efi/capsule.c | 300 +
 drivers/firmware/efi/reboot.c  |  12 +-
 include/linux/efi.h|  14 ++
 4 files changed, 326 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index b080808..fb8ad5d 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -10,6 +10,7 @@
 KASAN_SANITIZE_runtime-wrappers.o  := n
 
 obj-$(CONFIG_EFI)  += efi.o vars.o reboot.o memattr.o
+obj-$(CONFIG_EFI)  += capsule.o
 obj-$(CONFIG_EFI_VARS) += efivars.o
 obj-$(CONFIG_EFI_ESRT) += esrt.o
 obj-$(CONFIG_EFI_VARS_PSTORE)  += efi-pstore.o
diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c
new file mode 100644
index 000..0de5594
--- /dev/null
+++ b/drivers/firmware/efi/capsule.c
@@ -0,0 +1,300 @@
+/*
+ * EFI capsule support.
+ *
+ * Copyright 2013 Intel Corporation; author Matt Fleming
+ *
+ * This file is part of the Linux kernel, and is made available under
+ * the terms of the GNU General Public License version 2.
+ */
+
+#define pr_fmt(fmt) "efi: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+typedef struct {
+   u64 length;
+   u64 data;
+} efi_capsule_block_desc_t;
+
+static bool capsule_pending;
+static int efi_reset_type = -1;
+
+/*
+ * capsule_mutex serialises access to both capsule_pending and
+ * efi_reset_type.
+ */
+static DEFINE_MUTEX(capsule_mutex);
+
+/**
+ * efi_capsule_pending - has a capsule been passed to the firmware?
+ * @reset_type: store the type of EFI reset if capsule is pending
+ *
+ * To ensure that the registered capsule is processed correctly by the
+ * firmware we need to perform a specific type of reset. If a capsule is
+ * pending return the reset type in @reset_type.
+ *
+ * This function will race with callers of efi_capsule_update(), for
+ * example, calling this function while somebody else is in
+ * efi_capsule_update() but hasn't reached efi_capsue_update_locked()
+ * will miss the updates to capsule_pending and efi_reset_type after
+ * efi_capsule_update_locked() completes.
+ *
+ * A non-racy use is from platform reboot code because we use
+ * system_state to ensure no capsules can be sent to the firmware once
+ * we're at SYSTEM_RESTART. See efi_capsule_update_locked().
+ */
+bool efi_capsule_pending(int *reset_type)
+{
+   bool rv = false;
+
+   mutex_lock(_mutex);
+   if (!capsule_pending)
+   goto out;
+
+   if (reset_type)
+   *reset_type = efi_reset_type;
+   rv = true;
+out:
+   mutex_unlock(_mutex);
+   return rv;
+}
+
+/*
+ * Whitelist of EFI capsule flags that we support.
+ *
+ * We do not handle EFI_CAPSULE_INITIATE_RESET because that would
+ * require us to prepare the kernel for reboot. Refuse to load any
+ * capsules with that flag and any other flags that we do not know how
+ * to handle.
+ */
+#define

[tip:efi/core] efi: Add 'capsule' update support

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  f0133f3c5b8bb34ec4dec50c27e7a655aeee8935
Gitweb: http://git.kernel.org/tip/f0133f3c5b8bb34ec4dec50c27e7a655aeee8935
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:59 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:34:03 +0200

efi: Add 'capsule' update support

The EFI capsule mechanism allows data blobs to be passed to the EFI
firmware. A common use case is performing firmware updates. This patch
just introduces the main infrastructure for interacting with the
firmware, and a driver that allows users to upload capsules will come
in a later patch.

Once a capsule has been passed to the firmware, the next reboot must
be performed using the ResetSystem() EFI runtime service, which may
involve overriding the reboot type specified by reboot=. This ensures
the reset value returned by QueryCapsuleCapabilities() is used to
reset the system, which is required for the capsule to be processed.
efi_capsule_pending() is provided for this purpose.

At the moment we only allow a single capsule blob to be sent to the
firmware despite the fact that UpdateCapsule() takes a 'CapsuleCount'
parameter. This simplifies the API and shouldn't result in any
downside since it is still possible to send multiple capsules by
repeatedly calling UpdateCapsule().

Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Bryan O'Donoghue 
Cc: Kweh Hock Leong 
Cc: Mark Salter 
Cc: Peter Jones 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-28-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/Makefile  |   1 +
 drivers/firmware/efi/capsule.c | 300 +
 drivers/firmware/efi/reboot.c  |  12 +-
 include/linux/efi.h|  14 ++
 4 files changed, 326 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index b080808..fb8ad5d 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -10,6 +10,7 @@
 KASAN_SANITIZE_runtime-wrappers.o  := n
 
 obj-$(CONFIG_EFI)  += efi.o vars.o reboot.o memattr.o
+obj-$(CONFIG_EFI)  += capsule.o
 obj-$(CONFIG_EFI_VARS) += efivars.o
 obj-$(CONFIG_EFI_ESRT) += esrt.o
 obj-$(CONFIG_EFI_VARS_PSTORE)  += efi-pstore.o
diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c
new file mode 100644
index 000..0de5594
--- /dev/null
+++ b/drivers/firmware/efi/capsule.c
@@ -0,0 +1,300 @@
+/*
+ * EFI capsule support.
+ *
+ * Copyright 2013 Intel Corporation; author Matt Fleming
+ *
+ * This file is part of the Linux kernel, and is made available under
+ * the terms of the GNU General Public License version 2.
+ */
+
+#define pr_fmt(fmt) "efi: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+typedef struct {
+   u64 length;
+   u64 data;
+} efi_capsule_block_desc_t;
+
+static bool capsule_pending;
+static int efi_reset_type = -1;
+
+/*
+ * capsule_mutex serialises access to both capsule_pending and
+ * efi_reset_type.
+ */
+static DEFINE_MUTEX(capsule_mutex);
+
+/**
+ * efi_capsule_pending - has a capsule been passed to the firmware?
+ * @reset_type: store the type of EFI reset if capsule is pending
+ *
+ * To ensure that the registered capsule is processed correctly by the
+ * firmware we need to perform a specific type of reset. If a capsule is
+ * pending return the reset type in @reset_type.
+ *
+ * This function will race with callers of efi_capsule_update(), for
+ * example, calling this function while somebody else is in
+ * efi_capsule_update() but hasn't reached efi_capsue_update_locked()
+ * will miss the updates to capsule_pending and efi_reset_type after
+ * efi_capsule_update_locked() completes.
+ *
+ * A non-racy use is from platform reboot code because we use
+ * system_state to ensure no capsules can be sent to the firmware once
+ * we're at SYSTEM_RESTART. See efi_capsule_update_locked().
+ */
+bool efi_capsule_pending(int *reset_type)
+{
+   bool rv = false;
+
+   mutex_lock(_mutex);
+   if (!capsule_pending)
+   goto out;
+
+   if (reset_type)
+   *reset_type = efi_reset_type;
+   rv = true;
+out:
+   mutex_unlock(_mutex);
+   return rv;
+}
+
+/*
+ * Whitelist of EFI capsule flags that we support.
+ *
+ * We do not handle EFI_CAPSULE_INITIATE_RESET because that would
+ * require us to prepare the kernel for reboot. Refuse to load any
+ * capsules with that flag and any other flags that we do not know how
+ * to handle.
+ */
+#define EFI_CAPSULE_SUPPORTED_FLAG_MASK\
+   (EFI_CAPSULE_PERSIST_ACROSS_RESET | EFI_CAPSULE_POPULATE_SYSTEM_TABLE)
+
+/**
+ * efi_capsule_supported - does the firmware support the capsule?
+ * @guid: vendor guid of capsule
+ * @flags: capsule flags
+ * @size:

[tip:efi/core] x86/efi: Force EFI reboot to process pending capsules

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  87615a34d561ef59bd0cffc73256a21220dfdffd
Gitweb: http://git.kernel.org/tip/87615a34d561ef59bd0cffc73256a21220dfdffd
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:07:00 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:34:04 +0200

x86/efi: Force EFI reboot to process pending capsules

If an EFI capsule has been sent to the firmware we must match the type
of EFI reset against that required by the capsule to ensure it is
processed correctly.

Force an EFI reboot if a capsule is pending for the next reset.

Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Kweh Hock Leong 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-29-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/reboot.c | 9 +
 include/linux/efi.h  | 6 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index ab0adc0..a9b31eb 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -535,6 +535,15 @@ static void native_machine_emergency_restart(void)
mode = reboot_mode == REBOOT_WARM ? 0x1234 : 0;
*((unsigned short *)__va(0x472)) = mode;
 
+   /*
+* If an EFI capsule has been registered with the firmware then
+* override the reboot= parameter.
+*/
+   if (efi_capsule_pending(NULL)) {
+   pr_info("EFI capsule is pending, forcing EFI reboot.\n");
+   reboot_type = BOOT_EFI;
+   }
+
for (;;) {
/* Could also try the reset bit in the Hammer NB */
switch (reboot_type) {
diff --git a/include/linux/efi.h b/include/linux/efi.h
index a3b4c1e..aa36fb8 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -1085,6 +1085,12 @@ static inline bool efi_enabled(int feature)
 }
 static inline void
 efi_reboot(enum reboot_mode reboot_mode, const char *__unused) {}
+
+static inline bool
+efi_capsule_pending(int *reset_type)
+{
+   return false;
+}
 #endif
 
 extern int efi_status_to_err(efi_status_t status);

[tip:efi/core] x86/efi: Force EFI reboot to process pending capsules

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  87615a34d561ef59bd0cffc73256a21220dfdffd
Gitweb: http://git.kernel.org/tip/87615a34d561ef59bd0cffc73256a21220dfdffd
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:07:00 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:34:04 +0200

x86/efi: Force EFI reboot to process pending capsules

If an EFI capsule has been sent to the firmware we must match the type
of EFI reset against that required by the capsule to ensure it is
processed correctly.

Force an EFI reboot if a capsule is pending for the next reset.

Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Kweh Hock Leong 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-29-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/reboot.c | 9 +
 include/linux/efi.h  | 6 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index ab0adc0..a9b31eb 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -535,6 +535,15 @@ static void native_machine_emergency_restart(void)
mode = reboot_mode == REBOOT_WARM ? 0x1234 : 0;
*((unsigned short *)__va(0x472)) = mode;
 
+   /*
+* If an EFI capsule has been registered with the firmware then
+* override the reboot= parameter.
+*/
+   if (efi_capsule_pending(NULL)) {
+   pr_info("EFI capsule is pending, forcing EFI reboot.\n");
+   reboot_type = BOOT_EFI;
+   }
+
for (;;) {
/* Could also try the reset bit in the Hammer NB */
switch (reboot_type) {
diff --git a/include/linux/efi.h b/include/linux/efi.h
index a3b4c1e..aa36fb8 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -1085,6 +1085,12 @@ static inline bool efi_enabled(int feature)
 }
 static inline void
 efi_reboot(enum reboot_mode reboot_mode, const char *__unused) {}
+
+static inline bool
+efi_capsule_pending(int *reset_type)
+{
+   return false;
+}
 #endif
 
 extern int efi_status_to_err(efi_status_t status);

[tip:efi/core] efi: Move efi_status_to_err() to drivers/firmware/efi/

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  806b0351c9ff9890c1ef0ba2c46237baef49ac79
Gitweb: http://git.kernel.org/tip/806b0351c9ff9890c1ef0ba2c46237baef49ac79
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:58 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:34:03 +0200

efi: Move efi_status_to_err() to drivers/firmware/efi/

Move efi_status_to_err() to the architecture independent code as it's
generally useful in all bits of EFI code where there is a need to
convert an efi_status_t to a kernel error value.

Signed-off-by: Matt Fleming 
Acked-by: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Kweh Hock Leong 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-27-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/efi.c  | 33 +
 drivers/firmware/efi/vars.c | 33 -
 include/linux/efi.h |  2 ++
 3 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 4991371..05509f3 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -636,3 +636,36 @@ u64 __weak efi_mem_attributes(unsigned long phys_addr)
}
return 0;
 }
+
+int efi_status_to_err(efi_status_t status)
+{
+   int err;
+
+   switch (status) {
+   case EFI_SUCCESS:
+   err = 0;
+   break;
+   case EFI_INVALID_PARAMETER:
+   err = -EINVAL;
+   break;
+   case EFI_OUT_OF_RESOURCES:
+   err = -ENOSPC;
+   break;
+   case EFI_DEVICE_ERROR:
+   err = -EIO;
+   break;
+   case EFI_WRITE_PROTECTED:
+   err = -EROFS;
+   break;
+   case EFI_SECURITY_VIOLATION:
+   err = -EACCES;
+   break;
+   case EFI_NOT_FOUND:
+   err = -ENOENT;
+   break;
+   default:
+   err = -EINVAL;
+   }
+
+   return err;
+}
diff --git a/drivers/firmware/efi/vars.c b/drivers/firmware/efi/vars.c
index 34b7419..0012331 100644
--- a/drivers/firmware/efi/vars.c
+++ b/drivers/firmware/efi/vars.c
@@ -329,39 +329,6 @@ check_var_size_nonblocking(u32 attributes, unsigned long 
size)
return fops->query_variable_store(attributes, size, true);
 }
 
-static int efi_status_to_err(efi_status_t status)
-{
-   int err;
-
-   switch (status) {
-   case EFI_SUCCESS:
-   err = 0;
-   break;
-   case EFI_INVALID_PARAMETER:
-   err = -EINVAL;
-   break;
-   case EFI_OUT_OF_RESOURCES:
-   err = -ENOSPC;
-   break;
-   case EFI_DEVICE_ERROR:
-   err = -EIO;
-   break;
-   case EFI_WRITE_PROTECTED:
-   err = -EROFS;
-   break;
-   case EFI_SECURITY_VIOLATION:
-   err = -EACCES;
-   break;
-   case EFI_NOT_FOUND:
-   err = -ENOENT;
-   break;
-   default:
-   err = -EINVAL;
-   }
-
-   return err;
-}
-
 static bool variable_is_present(efi_char16_t *variable_name, efi_guid_t 
*vendor,
struct list_head *head)
 {
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 4db7052..ca47481 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -1080,6 +1080,8 @@ static inline void
 efi_reboot(enum reboot_mode reboot_mode, const char *__unused) {}
 #endif
 
+extern int efi_status_to_err(efi_status_t status);
+
 /*
  * Variable Attributes
  */

[tip:efi/core] efi: Move efi_status_to_err() to drivers/firmware/efi/

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  806b0351c9ff9890c1ef0ba2c46237baef49ac79
Gitweb: http://git.kernel.org/tip/806b0351c9ff9890c1ef0ba2c46237baef49ac79
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:58 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:34:03 +0200

efi: Move efi_status_to_err() to drivers/firmware/efi/

Move efi_status_to_err() to the architecture independent code as it's
generally useful in all bits of EFI code where there is a need to
convert an efi_status_t to a kernel error value.

Signed-off-by: Matt Fleming 
Acked-by: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Kweh Hock Leong 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: joeyli 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-27-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/efi.c  | 33 +
 drivers/firmware/efi/vars.c | 33 -
 include/linux/efi.h |  2 ++
 3 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 4991371..05509f3 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -636,3 +636,36 @@ u64 __weak efi_mem_attributes(unsigned long phys_addr)
}
return 0;
 }
+
+int efi_status_to_err(efi_status_t status)
+{
+   int err;
+
+   switch (status) {
+   case EFI_SUCCESS:
+   err = 0;
+   break;
+   case EFI_INVALID_PARAMETER:
+   err = -EINVAL;
+   break;
+   case EFI_OUT_OF_RESOURCES:
+   err = -ENOSPC;
+   break;
+   case EFI_DEVICE_ERROR:
+   err = -EIO;
+   break;
+   case EFI_WRITE_PROTECTED:
+   err = -EROFS;
+   break;
+   case EFI_SECURITY_VIOLATION:
+   err = -EACCES;
+   break;
+   case EFI_NOT_FOUND:
+   err = -ENOENT;
+   break;
+   default:
+   err = -EINVAL;
+   }
+
+   return err;
+}
diff --git a/drivers/firmware/efi/vars.c b/drivers/firmware/efi/vars.c
index 34b7419..0012331 100644
--- a/drivers/firmware/efi/vars.c
+++ b/drivers/firmware/efi/vars.c
@@ -329,39 +329,6 @@ check_var_size_nonblocking(u32 attributes, unsigned long 
size)
return fops->query_variable_store(attributes, size, true);
 }
 
-static int efi_status_to_err(efi_status_t status)
-{
-   int err;
-
-   switch (status) {
-   case EFI_SUCCESS:
-   err = 0;
-   break;
-   case EFI_INVALID_PARAMETER:
-   err = -EINVAL;
-   break;
-   case EFI_OUT_OF_RESOURCES:
-   err = -ENOSPC;
-   break;
-   case EFI_DEVICE_ERROR:
-   err = -EIO;
-   break;
-   case EFI_WRITE_PROTECTED:
-   err = -EROFS;
-   break;
-   case EFI_SECURITY_VIOLATION:
-   err = -EACCES;
-   break;
-   case EFI_NOT_FOUND:
-   err = -ENOENT;
-   break;
-   default:
-   err = -EINVAL;
-   }
-
-   return err;
-}
-
 static bool variable_is_present(efi_char16_t *variable_name, efi_guid_t 
*vendor,
struct list_head *head)
 {
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 4db7052..ca47481 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -1080,6 +1080,8 @@ static inline void
 efi_reboot(enum reboot_mode reboot_mode, const char *__unused) {}
 #endif
 
+extern int efi_status_to_err(efi_status_t status);
+
 /*
  * Variable Attributes
  */

[tip:efi/core] x86/efi: Remove the always true EFI_DEBUG symbol

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  c3c1c47f15b37a8492e630d1e9ab8ad576ee10e5
Gitweb: http://git.kernel.org/tip/c3c1c47f15b37a8492e630d1e9ab8ad576ee10e5
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:47 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:33:56 +0200

x86/efi: Remove the always true EFI_DEBUG symbol

This symbol is always set which makes it useless. Additionally we have
a kernel command-line switch, efi=debug, which actually controls the
printing of the memory map.

Reported-by: Robert Elliott 
Signed-off-by: Matt Fleming 
Acked-by: Borislav Petkov 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-16-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index dde46cd..f93545e 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -54,8 +54,6 @@
 #include 
 #include 
 
-#define EFI_DEBUG
-
 static struct efi efi_phys __initdata;
 static efi_system_table_t efi_systab __initdata;
 
@@ -222,7 +220,6 @@ int __init efi_memblock_x86_reserve_range(void)
 
 void __init efi_print_memmap(void)
 {
-#ifdef EFI_DEBUG
efi_memory_desc_t *md;
int i = 0;
 
@@ -235,7 +232,6 @@ void __init efi_print_memmap(void)
md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1,
(md->num_pages >> (20 - EFI_PAGE_SHIFT)));
}
-#endif  /*  EFI_DEBUG  */
 }
 
 void __init efi_unmap_memmap(void)

[tip:efi/core] x86/efi: Remove the always true EFI_DEBUG symbol

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  c3c1c47f15b37a8492e630d1e9ab8ad576ee10e5
Gitweb: http://git.kernel.org/tip/c3c1c47f15b37a8492e630d1e9ab8ad576ee10e5
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:47 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:33:56 +0200

x86/efi: Remove the always true EFI_DEBUG symbol

This symbol is always set which makes it useless. Additionally we have
a kernel command-line switch, efi=debug, which actually controls the
printing of the memory map.

Reported-by: Robert Elliott 
Signed-off-by: Matt Fleming 
Acked-by: Borislav Petkov 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-16-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index dde46cd..f93545e 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -54,8 +54,6 @@
 #include 
 #include 
 
-#define EFI_DEBUG
-
 static struct efi efi_phys __initdata;
 static efi_system_table_t efi_systab __initdata;
 
@@ -222,7 +220,6 @@ int __init efi_memblock_x86_reserve_range(void)
 
 void __init efi_print_memmap(void)
 {
-#ifdef EFI_DEBUG
efi_memory_desc_t *md;
int i = 0;
 
@@ -235,7 +232,6 @@ void __init efi_print_memmap(void)
md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1,
(md->num_pages >> (20 - EFI_PAGE_SHIFT)));
}
-#endif  /*  EFI_DEBUG  */
 }
 
 void __init efi_unmap_memmap(void)

[tip:efi/core] efi: Remove global 'memmap' EFI memory map

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  884f4f66ffd6ffe632f3a8be4e6d10a858afdc37
Gitweb: http://git.kernel.org/tip/884f4f66ffd6ffe632f3a8be4e6d10a858afdc37
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:39 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:33:51 +0200

efi: Remove global 'memmap' EFI memory map

Abolish the poorly named EFI memory map, 'memmap'. It is shadowed by a
bunch of local definitions in various files and having two ways to
access the EFI memory map ('efi.memmap' vs. 'memmap') is rather
confusing.

Furthermore, IA64 doesn't even provide this global object, which has
caused issues when trying to write generic EFI memmap code.

Replace all occurrences with efi.memmap, and convert the remaining
iterator code to use for_each_efi_mem_desc().

Signed-off-by: Matt Fleming 
Reviewed-by: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Luck, Tony 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-8-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c| 84 +-
 drivers/firmware/efi/arm-init.c| 20 -
 drivers/firmware/efi/arm-runtime.c | 12 +++---
 drivers/firmware/efi/efi.c |  2 +-
 drivers/firmware/efi/fake_mem.c| 40 +-
 include/linux/efi.h|  5 +--
 6 files changed, 85 insertions(+), 78 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 6f49981..88d2fb2 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -56,8 +56,6 @@
 
 #define EFI_DEBUG
 
-struct efi_memory_map memmap;
-
 static struct efi efi_phys __initdata;
 static efi_system_table_t efi_systab __initdata;
 
@@ -207,15 +205,13 @@ int __init efi_memblock_x86_reserve_range(void)
 #else
pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32));
 #endif
-   memmap.phys_map = pmap;
-   memmap.nr_map   = e->efi_memmap_size /
+   efi.memmap.phys_map = pmap;
+   efi.memmap.nr_map   = e->efi_memmap_size /
  e->efi_memdesc_size;
-   memmap.desc_size= e->efi_memdesc_size;
-   memmap.desc_version = e->efi_memdesc_version;
-
-   memblock_reserve(pmap, memmap.nr_map * memmap.desc_size);
+   efi.memmap.desc_size= e->efi_memdesc_size;
+   efi.memmap.desc_version = e->efi_memdesc_version;
 
-   efi.memmap = 
+   memblock_reserve(pmap, efi.memmap.nr_map * efi.memmap.desc_size);
 
return 0;
 }
@@ -240,10 +236,14 @@ void __init efi_print_memmap(void)
 
 void __init efi_unmap_memmap(void)
 {
+   unsigned long size;
+
clear_bit(EFI_MEMMAP, );
-   if (memmap.map) {
-   early_memunmap(memmap.map, memmap.nr_map * memmap.desc_size);
-   memmap.map = NULL;
+
+   size = efi.memmap.nr_map * efi.memmap.desc_size;
+   if (efi.memmap.map) {
+   early_memunmap(efi.memmap.map, size);
+   efi.memmap.map = NULL;
}
 }
 
@@ -432,17 +432,22 @@ static int __init efi_runtime_init(void)
 
 static int __init efi_memmap_init(void)
 {
+   unsigned long addr, size;
+
if (efi_enabled(EFI_PARAVIRT))
return 0;
 
/* Map the EFI memory map */
-   memmap.map = early_memremap((unsigned long)memmap.phys_map,
-  memmap.nr_map * memmap.desc_size);
-   if (memmap.map == NULL) {
+   size = efi.memmap.nr_map * efi.memmap.desc_size;
+   addr = (unsigned long)efi.memmap.phys_map;
+
+   efi.memmap.map = early_memremap(addr, size);
+   if (efi.memmap.map == NULL) {
pr_err("Could not map the memory map!\n");
return -ENOMEM;
}
-   memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size);
+
+   efi.memmap.map_end = efi.memmap.map + size;
 
if (add_efi_memmap)
do_add_efi_memmap();
@@ -638,6 +643,7 @@ static void __init get_systab_virt_addr(efi_memory_desc_t 
*md)
 static void __init save_runtime_map(void)
 {
 #ifdef CONFIG_KEXEC_CORE
+   unsigned long desc_size;
efi_memory_desc_t *md;
void *tmp, *q = NULL;
int count = 0;
@@ -645,21 +651,23 @@ static void __init save_runtime_map(void)
if (efi_enabled(EFI_OLD_MEMMAP))
return;
 
+   desc_size = efi.memmap.desc_size;
+
for_each_efi_memory_desc(md) {
if (!(md->attribute & EFI_MEMORY_RUNTIME) ||
(md->type == EFI_BOOT_SERVICES_CODE) ||
(md->type == EFI_BOOT_SERVICES_DATA))
continue;
-   tmp = krealloc(q, (count + 1) * memmap.desc_size, GFP_KERNEL);
+   tmp

[tip:efi/core] efi: Iterate over efi.memmap in for_each_efi_memory_desc()

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  78ce248faa3c46e24e9bd42db3ab3650659f16dd
Gitweb: http://git.kernel.org/tip/78ce248faa3c46e24e9bd42db3ab3650659f16dd
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:38 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:33:50 +0200

efi: Iterate over efi.memmap in for_each_efi_memory_desc()

Most of the users of for_each_efi_memory_desc() are equally happy
iterating over the EFI memory map in efi.memmap instead of 'memmap',
since the former is usually a pointer to the latter.

For those users that want to specify an EFI memory map other than
efi.memmap, that can be done using for_each_efi_memory_desc_in_map().
One such example is in the libstub code where the firmware is queried
directly for the memory map, it gets iterated over, and then freed.

This change goes part of the way toward deleting the global 'memmap'
variable, which is not universally available on all architectures
(notably IA64) and is rather poorly named.

Signed-off-by: Matt Fleming 
Reviewed-by: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Leif Lindholm 
Cc: Mark Salter 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-7-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c| 43 --
 arch/x86/platform/efi/efi_64.c | 10 ++
 arch/x86/platform/efi/quirks.c | 10 +++---
 drivers/firmware/efi/arm-init.c|  4 +--
 drivers/firmware/efi/arm-runtime.c |  2 +-
 drivers/firmware/efi/efi.c |  6 +---
 drivers/firmware/efi/fake_mem.c|  3 +-
 drivers/firmware/efi/libstub/efi-stub-helper.c |  6 ++--
 include/linux/efi.h| 11 ++-
 9 files changed, 39 insertions(+), 56 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index df393ea..6f49981 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -119,11 +119,10 @@ void efi_get_time(struct timespec *now)
 
 void __init efi_find_mirror(void)
 {
-   void *p;
+   efi_memory_desc_t *md;
u64 mirror_size = 0, total_size = 0;
 
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
-   efi_memory_desc_t *md = p;
+   for_each_efi_memory_desc(md) {
unsigned long long start = md->phys_addr;
unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
 
@@ -146,10 +145,9 @@ void __init efi_find_mirror(void)
 
 static void __init do_add_efi_memmap(void)
 {
-   void *p;
+   efi_memory_desc_t *md;
 
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
-   efi_memory_desc_t *md = p;
+   for_each_efi_memory_desc(md) {
unsigned long long start = md->phys_addr;
unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
int e820_type;
@@ -226,17 +224,13 @@ void __init efi_print_memmap(void)
 {
 #ifdef EFI_DEBUG
efi_memory_desc_t *md;
-   void *p;
-   int i;
+   int i = 0;
 
-   for (p = memmap.map, i = 0;
-p < memmap.map_end;
-p += memmap.desc_size, i++) {
+   for_each_efi_memory_desc(md) {
char buf[64];
 
-   md = p;
pr_info("mem%02u: %s range=[0x%016llx-0x%016llx] (%lluMB)\n",
-   i, efi_md_typeattr_format(buf, sizeof(buf), md),
+   i++, efi_md_typeattr_format(buf, sizeof(buf), md),
md->phys_addr,
md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1,
(md->num_pages >> (20 - EFI_PAGE_SHIFT)));
@@ -550,12 +544,9 @@ void __init efi_set_executable(efi_memory_desc_t *md, bool 
executable)
 void __init runtime_code_page_mkexec(void)
 {
efi_memory_desc_t *md;
-   void *p;
 
/* Make EFI runtime service code area executable */
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
-   md = p;
-
+   for_each_efi_memory_desc(md) {
if (md->type != EFI_RUNTIME_SERVICES_CODE)
continue;
 
@@ -602,12 +593,10 @@ void __init old_map_region(efi_memory_desc_t *md)
 /* Merge contiguous regions of the same type and attribute */
 static void __init efi_merge_regions(void)
 {
-   void *p;
efi_memory_desc_t *md, *prev_md = NULL;
 
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
+   for_each_efi_memory_desc(md) {
u64 prev_size;
-   md = p;
 
if (!prev_md) {
prev_md = md;
@@ -650,15

[tip:efi/core] efi: Iterate over efi.memmap in for_each_efi_memory_desc()

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  78ce248faa3c46e24e9bd42db3ab3650659f16dd
Gitweb: http://git.kernel.org/tip/78ce248faa3c46e24e9bd42db3ab3650659f16dd
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:38 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:33:50 +0200

efi: Iterate over efi.memmap in for_each_efi_memory_desc()

Most of the users of for_each_efi_memory_desc() are equally happy
iterating over the EFI memory map in efi.memmap instead of 'memmap',
since the former is usually a pointer to the latter.

For those users that want to specify an EFI memory map other than
efi.memmap, that can be done using for_each_efi_memory_desc_in_map().
One such example is in the libstub code where the firmware is queried
directly for the memory map, it gets iterated over, and then freed.

This change goes part of the way toward deleting the global 'memmap'
variable, which is not universally available on all architectures
(notably IA64) and is rather poorly named.

Signed-off-by: Matt Fleming 
Reviewed-by: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Leif Lindholm 
Cc: Mark Salter 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-7-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c| 43 --
 arch/x86/platform/efi/efi_64.c | 10 ++
 arch/x86/platform/efi/quirks.c | 10 +++---
 drivers/firmware/efi/arm-init.c|  4 +--
 drivers/firmware/efi/arm-runtime.c |  2 +-
 drivers/firmware/efi/efi.c |  6 +---
 drivers/firmware/efi/fake_mem.c|  3 +-
 drivers/firmware/efi/libstub/efi-stub-helper.c |  6 ++--
 include/linux/efi.h| 11 ++-
 9 files changed, 39 insertions(+), 56 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index df393ea..6f49981 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -119,11 +119,10 @@ void efi_get_time(struct timespec *now)
 
 void __init efi_find_mirror(void)
 {
-   void *p;
+   efi_memory_desc_t *md;
u64 mirror_size = 0, total_size = 0;
 
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
-   efi_memory_desc_t *md = p;
+   for_each_efi_memory_desc(md) {
unsigned long long start = md->phys_addr;
unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
 
@@ -146,10 +145,9 @@ void __init efi_find_mirror(void)
 
 static void __init do_add_efi_memmap(void)
 {
-   void *p;
+   efi_memory_desc_t *md;
 
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
-   efi_memory_desc_t *md = p;
+   for_each_efi_memory_desc(md) {
unsigned long long start = md->phys_addr;
unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
int e820_type;
@@ -226,17 +224,13 @@ void __init efi_print_memmap(void)
 {
 #ifdef EFI_DEBUG
efi_memory_desc_t *md;
-   void *p;
-   int i;
+   int i = 0;
 
-   for (p = memmap.map, i = 0;
-p < memmap.map_end;
-p += memmap.desc_size, i++) {
+   for_each_efi_memory_desc(md) {
char buf[64];
 
-   md = p;
pr_info("mem%02u: %s range=[0x%016llx-0x%016llx] (%lluMB)\n",
-   i, efi_md_typeattr_format(buf, sizeof(buf), md),
+   i++, efi_md_typeattr_format(buf, sizeof(buf), md),
md->phys_addr,
md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1,
(md->num_pages >> (20 - EFI_PAGE_SHIFT)));
@@ -550,12 +544,9 @@ void __init efi_set_executable(efi_memory_desc_t *md, bool 
executable)
 void __init runtime_code_page_mkexec(void)
 {
efi_memory_desc_t *md;
-   void *p;
 
/* Make EFI runtime service code area executable */
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
-   md = p;
-
+   for_each_efi_memory_desc(md) {
if (md->type != EFI_RUNTIME_SERVICES_CODE)
continue;
 
@@ -602,12 +593,10 @@ void __init old_map_region(efi_memory_desc_t *md)
 /* Merge contiguous regions of the same type and attribute */
 static void __init efi_merge_regions(void)
 {
-   void *p;
efi_memory_desc_t *md, *prev_md = NULL;
 
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
+   for_each_efi_memory_desc(md) {
u64 prev_size;
-   md = p;
 
if (!prev_md) {
prev_md = md;
@@ -650,15 +639,13 @@ static void __init save_runtime_map(void)
 {
 #ifdef CONFIG_KEXEC_CORE
efi_memory_desc_t *md;
-   void *tmp, *p, *q = NULL;
+   void *tmp, *q = NULL;
int count = 0;
 
if

[tip:efi/core] efi: Remove global 'memmap' EFI memory map

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  884f4f66ffd6ffe632f3a8be4e6d10a858afdc37
Gitweb: http://git.kernel.org/tip/884f4f66ffd6ffe632f3a8be4e6d10a858afdc37
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:39 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:33:51 +0200

efi: Remove global 'memmap' EFI memory map

Abolish the poorly named EFI memory map, 'memmap'. It is shadowed by a
bunch of local definitions in various files and having two ways to
access the EFI memory map ('efi.memmap' vs. 'memmap') is rather
confusing.

Furthermore, IA64 doesn't even provide this global object, which has
caused issues when trying to write generic EFI memmap code.

Replace all occurrences with efi.memmap, and convert the remaining
iterator code to use for_each_efi_mem_desc().

Signed-off-by: Matt Fleming 
Reviewed-by: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Luck, Tony 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-8-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c| 84 +-
 drivers/firmware/efi/arm-init.c| 20 -
 drivers/firmware/efi/arm-runtime.c | 12 +++---
 drivers/firmware/efi/efi.c |  2 +-
 drivers/firmware/efi/fake_mem.c| 40 +-
 include/linux/efi.h|  5 +--
 6 files changed, 85 insertions(+), 78 deletions(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 6f49981..88d2fb2 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -56,8 +56,6 @@
 
 #define EFI_DEBUG
 
-struct efi_memory_map memmap;
-
 static struct efi efi_phys __initdata;
 static efi_system_table_t efi_systab __initdata;
 
@@ -207,15 +205,13 @@ int __init efi_memblock_x86_reserve_range(void)
 #else
pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32));
 #endif
-   memmap.phys_map = pmap;
-   memmap.nr_map   = e->efi_memmap_size /
+   efi.memmap.phys_map = pmap;
+   efi.memmap.nr_map   = e->efi_memmap_size /
  e->efi_memdesc_size;
-   memmap.desc_size= e->efi_memdesc_size;
-   memmap.desc_version = e->efi_memdesc_version;
-
-   memblock_reserve(pmap, memmap.nr_map * memmap.desc_size);
+   efi.memmap.desc_size= e->efi_memdesc_size;
+   efi.memmap.desc_version = e->efi_memdesc_version;
 
-   efi.memmap = 
+   memblock_reserve(pmap, efi.memmap.nr_map * efi.memmap.desc_size);
 
return 0;
 }
@@ -240,10 +236,14 @@ void __init efi_print_memmap(void)
 
 void __init efi_unmap_memmap(void)
 {
+   unsigned long size;
+
clear_bit(EFI_MEMMAP, );
-   if (memmap.map) {
-   early_memunmap(memmap.map, memmap.nr_map * memmap.desc_size);
-   memmap.map = NULL;
+
+   size = efi.memmap.nr_map * efi.memmap.desc_size;
+   if (efi.memmap.map) {
+   early_memunmap(efi.memmap.map, size);
+   efi.memmap.map = NULL;
}
 }
 
@@ -432,17 +432,22 @@ static int __init efi_runtime_init(void)
 
 static int __init efi_memmap_init(void)
 {
+   unsigned long addr, size;
+
if (efi_enabled(EFI_PARAVIRT))
return 0;
 
/* Map the EFI memory map */
-   memmap.map = early_memremap((unsigned long)memmap.phys_map,
-  memmap.nr_map * memmap.desc_size);
-   if (memmap.map == NULL) {
+   size = efi.memmap.nr_map * efi.memmap.desc_size;
+   addr = (unsigned long)efi.memmap.phys_map;
+
+   efi.memmap.map = early_memremap(addr, size);
+   if (efi.memmap.map == NULL) {
pr_err("Could not map the memory map!\n");
return -ENOMEM;
}
-   memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size);
+
+   efi.memmap.map_end = efi.memmap.map + size;
 
if (add_efi_memmap)
do_add_efi_memmap();
@@ -638,6 +643,7 @@ static void __init get_systab_virt_addr(efi_memory_desc_t 
*md)
 static void __init save_runtime_map(void)
 {
 #ifdef CONFIG_KEXEC_CORE
+   unsigned long desc_size;
efi_memory_desc_t *md;
void *tmp, *q = NULL;
int count = 0;
@@ -645,21 +651,23 @@ static void __init save_runtime_map(void)
if (efi_enabled(EFI_OLD_MEMMAP))
return;
 
+   desc_size = efi.memmap.desc_size;
+
for_each_efi_memory_desc(md) {
if (!(md->attribute & EFI_MEMORY_RUNTIME) ||
(md->type == EFI_BOOT_SERVICES_CODE) ||
(md->type == EFI_BOOT_SERVICES_DATA))
continue;
-   tmp = krealloc(q, (count + 1) * memmap.desc_size, GFP_KERNEL);
+   tmp = krealloc(q, (count + 1) * desc_size, GFP_KERNEL);
if (!tmp)
goto out;
q = tmp;
 
-   memcpy(q + count * memmap.desc_size,

[tip:efi/core] x86/mm/pat: Document the (currently) EFI-only code path

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  7fc8442f2a8a77f40565b42c41e4f2d48b179a56
Gitweb: http://git.kernel.org/tip/7fc8442f2a8a77f40565b42c41e4f2d48b179a56
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:35 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:33:48 +0200

x86/mm/pat: Document the (currently) EFI-only code path

It's not at all obvious that populate_pgd() and friends are only
executed when mapping EFI virtual memory regions or that no other
pageattr callers pass a ->pgd value.

Reported-by: Andy Lutomirski 
Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-4-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/pageattr.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 01be9ec..a1f0e1d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1125,8 +1125,14 @@ static int populate_pgd(struct cpa_data *cpa, unsigned 
long addr)
 static int __cpa_process_fault(struct cpa_data *cpa, unsigned long vaddr,
   int primary)
 {
-   if (cpa->pgd)
+   if (cpa->pgd) {
+   /*
+* Right now, we only execute this code path when mapping
+* the EFI virtual memory map regions, no other users
+* provide a ->pgd value. This may change in the future.
+*/
return populate_pgd(cpa, vaddr);
+   }
 
/*
 * Ignore all non primary paths.

[tip:efi/core] x86/mm/pat: Document the (currently) EFI-only code path

2016-04-28 Thread tip-bot for Matt Fleming

Commit-ID:  7fc8442f2a8a77f40565b42c41e4f2d48b179a56
Gitweb: http://git.kernel.org/tip/7fc8442f2a8a77f40565b42c41e4f2d48b179a56
Author: Matt Fleming 
AuthorDate: Mon, 25 Apr 2016 21:06:35 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 11:33:48 +0200

x86/mm/pat: Document the (currently) EFI-only code path

It's not at all obvious that populate_pgd() and friends are only
executed when mapping EFI virtual memory regions or that no other
pageattr callers pass a ->pgd value.

Reported-by: Andy Lutomirski 
Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1461614832-17633-4-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/pageattr.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 01be9ec..a1f0e1d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1125,8 +1125,14 @@ static int populate_pgd(struct cpa_data *cpa, unsigned 
long addr)
 static int __cpa_process_fault(struct cpa_data *cpa, unsigned long vaddr,
   int primary)
 {
-   if (cpa->pgd)
+   if (cpa->pgd) {
+   /*
+* Right now, we only execute this code path when mapping
+* the EFI virtual memory map regions, no other users
+* provide a ->pgd value. This may change in the future.
+*/
return populate_pgd(cpa, vaddr);
+   }
 
/*
 * Ignore all non primary paths.

[tip:sched/urgent] sched/fair: Add comments to explain select_idle_sibling()

2016-03-21 Thread tip-bot for Matt Fleming

Commit-ID:  d4335581dc30ec6545999c7443bb9fead274a980
Gitweb: http://git.kernel.org/tip/d4335581dc30ec6545999c7443bb9fead274a980
Author: Matt Fleming 
AuthorDate: Wed, 9 Mar 2016 14:59:08 +
Committer:  Ingo Molnar 
CommitDate: Mon, 21 Mar 2016 10:52:51 +0100

sched/fair: Add comments to explain select_idle_sibling()

It's not entirely obvious how the main loop in select_idle_sibling()
works on first glance. Sprinkle a few comments to explain the design
and intention behind the loop based on some conversations with Mike
and Peter.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1457535548-15329-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c114d9..303d639 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5055,7 +5055,19 @@ static int select_idle_sibling(struct task_struct *p, 
int target)
return i;
 
/*
-* Otherwise, iterate the domains and find an elegible idle cpu.
+* Otherwise, iterate the domains and find an eligible idle cpu.
+*
+* A completely idle sched group at higher domains is more
+* desirable than an idle group at a lower level, because lower
+* domains have smaller groups and usually share hardware
+* resources which causes tasks to contend on them, e.g. x86
+* hyperthread siblings in the lowest domain (SMT) can contend
+* on the shared cpu pipeline.
+*
+* However, while we prefer idle groups at higher domains
+* finding an idle cpu at the lowest domain is still better than
+* returning 'target', which we've already established, isn't
+* idle.
 */
sd = rcu_dereference(per_cpu(sd_llc, target));
for_each_lower_domain(sd) {
@@ -5065,11 +5077,16 @@ static int select_idle_sibling(struct task_struct *p, 
int target)
tsk_cpus_allowed(p)))
goto next;
 
+   /* Ensure the entire group is idle */
for_each_cpu(i, sched_group_cpus(sg)) {
if (i == target || !idle_cpu(i))
goto next;
}
 
+   /*
+* It doesn't matter which cpu we pick, the
+* whole group is idle.
+*/
target = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
goto done;

[tip:sched/urgent] sched/fair: Add comments to explain select_idle_sibling()

2016-03-21 Thread tip-bot for Matt Fleming

Commit-ID:  d4335581dc30ec6545999c7443bb9fead274a980
Gitweb: http://git.kernel.org/tip/d4335581dc30ec6545999c7443bb9fead274a980
Author: Matt Fleming 
AuthorDate: Wed, 9 Mar 2016 14:59:08 +
Committer:  Ingo Molnar 
CommitDate: Mon, 21 Mar 2016 10:52:51 +0100

sched/fair: Add comments to explain select_idle_sibling()

It's not entirely obvious how the main loop in select_idle_sibling()
works on first glance. Sprinkle a few comments to explain the design
and intention behind the loop based on some conversations with Mike
and Peter.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1457535548-15329-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c114d9..303d639 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5055,7 +5055,19 @@ static int select_idle_sibling(struct task_struct *p, 
int target)
return i;
 
/*
-* Otherwise, iterate the domains and find an elegible idle cpu.
+* Otherwise, iterate the domains and find an eligible idle cpu.
+*
+* A completely idle sched group at higher domains is more
+* desirable than an idle group at a lower level, because lower
+* domains have smaller groups and usually share hardware
+* resources which causes tasks to contend on them, e.g. x86
+* hyperthread siblings in the lowest domain (SMT) can contend
+* on the shared cpu pipeline.
+*
+* However, while we prefer idle groups at higher domains
+* finding an idle cpu at the lowest domain is still better than
+* returning 'target', which we've already established, isn't
+* idle.
 */
sd = rcu_dereference(per_cpu(sd_llc, target));
for_each_lower_domain(sd) {
@@ -5065,11 +5077,16 @@ static int select_idle_sibling(struct task_struct *p, 
int target)
tsk_cpus_allowed(p)))
goto next;
 
+   /* Ensure the entire group is idle */
for_each_cpu(i, sched_group_cpus(sg)) {
if (i == target || !idle_cpu(i))
goto next;
}
 
+   /*
+* It doesn't matter which cpu we pick, the
+* whole group is idle.
+*/
target = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
goto done;

[tip:efi/core] x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU

2016-03-16 Thread tip-bot for Matt Fleming

Commit-ID:  d367cef0a7f0c6ee86e997c0cb455b21b3c6b9ba
Gitweb: http://git.kernel.org/tip/d367cef0a7f0c6ee86e997c0cb455b21b3c6b9ba
Author: Matt Fleming 
AuthorDate: Mon, 14 Mar 2016 10:33:01 +
Committer:  Ingo Molnar 
CommitDate: Wed, 16 Mar 2016 09:00:49 +0100

x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU

Scott reports that with the new separate EFI page tables he's seeing
the following error on boot, caused by setting reserved bits in the
page table structures (fault code is PF_RSVD | PF_PROT),

  swapper/0: Corrupted page table at address 17b102020
  PGD 17b0e5063 PUD 140e3
  Bad pagetable: 0009 [#1] SMP

On first inspection the PUD is using a 1GB page size (_PAGE_PSE) and
looks fine but that's only true if support for 1GB PUD pages
("pdpe1gb") is present in the CPU.

Scott's Intel Celeron N2820 does not have that feature and so the
_PAGE_PSE bit is reserved. Fix this issue by making the 1GB mapping
code in conditional on "cpu_has_gbpages".

This issue didn't come up in the past because the required mapping for
the faulting address (0x17b102020) will already have been setup by the
kernel in early boot before we got to efi_map_regions(), but we no
longer use the standard kernel page tables during EFI calls.

Reported-by: Scott Ashcroft 
Tested-by: Scott Ashcroft 
Signed-off-by: Matt Fleming 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Ben Hutchings 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Maarten Lankhorst 
Cc: Matthew Garrett 
Cc: Peter Zijlstra 
Cc: Raphael Hertzog 
Cc: Roger Shimizu 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1457951581-27353-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/pageattr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 14c38ae..fcf8e29 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1055,7 +1055,7 @@ static int populate_pud(struct cpa_data *cpa, unsigned 
long start, pgd_t *pgd,
/*
 * Map everything starting from the Gb boundary, possibly with 1G pages
 */
-   while (end - start >= PUD_SIZE) {
+   while (cpu_has_gbpages && end - start >= PUD_SIZE) {
set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
   massage_pgprot(pud_pgprot)));

[tip:efi/core] x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU

2016-03-16 Thread tip-bot for Matt Fleming

Commit-ID:  d367cef0a7f0c6ee86e997c0cb455b21b3c6b9ba
Gitweb: http://git.kernel.org/tip/d367cef0a7f0c6ee86e997c0cb455b21b3c6b9ba
Author: Matt Fleming 
AuthorDate: Mon, 14 Mar 2016 10:33:01 +
Committer:  Ingo Molnar 
CommitDate: Wed, 16 Mar 2016 09:00:49 +0100

x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU

Scott reports that with the new separate EFI page tables he's seeing
the following error on boot, caused by setting reserved bits in the
page table structures (fault code is PF_RSVD | PF_PROT),

  swapper/0: Corrupted page table at address 17b102020
  PGD 17b0e5063 PUD 140e3
  Bad pagetable: 0009 [#1] SMP

On first inspection the PUD is using a 1GB page size (_PAGE_PSE) and
looks fine but that's only true if support for 1GB PUD pages
("pdpe1gb") is present in the CPU.

Scott's Intel Celeron N2820 does not have that feature and so the
_PAGE_PSE bit is reserved. Fix this issue by making the 1GB mapping
code in conditional on "cpu_has_gbpages".

This issue didn't come up in the past because the required mapping for
the faulting address (0x17b102020) will already have been setup by the
kernel in early boot before we got to efi_map_regions(), but we no
longer use the standard kernel page tables during EFI calls.

Reported-by: Scott Ashcroft 
Tested-by: Scott Ashcroft 
Signed-off-by: Matt Fleming 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Ben Hutchings 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Maarten Lankhorst 
Cc: Matthew Garrett 
Cc: Peter Zijlstra 
Cc: Raphael Hertzog 
Cc: Roger Shimizu 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1457951581-27353-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/pageattr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 14c38ae..fcf8e29 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1055,7 +1055,7 @@ static int populate_pud(struct cpa_data *cpa, unsigned 
long start, pgd_t *pgd,
/*
 * Map everything starting from the Gb boundary, possibly with 1G pages
 */
-   while (end - start >= PUD_SIZE) {
+   while (cpu_has_gbpages && end - start >= PUD_SIZE) {
set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
   massage_pgprot(pud_pgprot)));

[tip:x86/urgent] x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables

2016-03-12 Thread tip-bot for Matt Fleming

Commit-ID:  452308de61056a539352a9306c46716d7af8a1f1
Gitweb: http://git.kernel.org/tip/452308de61056a539352a9306c46716d7af8a1f1
Author: Matt Fleming 
AuthorDate: Fri, 11 Mar 2016 11:19:23 +
Committer:  Ingo Molnar 
CommitDate: Sat, 12 Mar 2016 16:57:45 +0100

x86/efi: Fix boot crash by always mapping boot service regions into new EFI 
page tables

Some machines have EFI regions in page zero (physical address
0x) and historically that region has been added to the e820
map via trim_bios_range(), and ultimately mapped into the kernel page
tables. It was not mapped via efi_map_regions() as one would expect.

Alexis reports that with the new separate EFI page tables some boot
services regions, such as page zero, are not mapped. This triggers an
oops during the SetVirtualAddressMap() runtime call.

For the EFI boot services quirk on x86 we need to memblock_reserve()
boot services regions until after SetVirtualAddressMap(). Doing that
while respecting the ownership of regions that may have already been
reserved by the kernel was the motivation behind this commit:

  7d68dc3f1003 ("x86, efi: Do not reserve boot services regions within reserved 
areas")

That patch was merged at a time when the EFI runtime virtual mappings
were inserted into the kernel page tables as described above, and the
trick of setting ->numpages (and hence the region size) to zero to
track regions that should not be freed in efi_free_boot_services()
meant that we never mapped those regions in efi_map_regions(). Instead
we were relying solely on the existing kernel mappings.

Now that we have separate page tables we need to make sure the EFI
boot services regions are mapped correctly, even if someone else has
already called memblock_reserve(). Instead of stashing a tag in
->numpages, set the EFI_MEMORY_RUNTIME bit of ->attribute. Since it
generally makes no sense to mark a boot services region as required at
runtime, it's pretty much guaranteed the firmware will not have
already set this bit.

For the record, the specific circumstances under which Alexis
triggered this bug was that an EFI runtime driver on his machine was
responding to the EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE event during
SetVirtualAddressMap().

The event handler for this driver looks like this,

  sub rsp,0x28
  lea rdx,[rip+0x2445] # 0xaa948720
  mov ecx,0x4
  call func_aa9447c0  ; call to ConvertPointer(4, & 0xaa948720)
  mov r11,QWORD PTR [rip+0x2434] # 0xaa948720
  xor eax,eax
  mov BYTE PTR [r11+0x1],0x1
  add rsp,0x28
  ret

Which is pretty typical code for an EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE
handler. The "mov r11, QWORD PTR [rip+0x2424]" was the faulting
instruction because ConvertPointer() was being called to convert the
address 0x, which when converted is left unchanged and
remains 0x.

The output of the oops trace gave the impression of a standard NULL
pointer dereference bug, but because we're accessing physical
addresses during ConvertPointer(), it wasn't. EFI boot services code
is stored at that address on Alexis' machine.

Reported-by: Alexis Murzeau 
Signed-off-by: Matt Fleming 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Ben Hutchings 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Maarten Lankhorst 
Cc: Matthew Garrett 
Cc: Peter Zijlstra 
Cc: Raphael Hertzog 
Cc: Roger Shimizu 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1457695163-29632-2-git-send-email-m...@codeblueprint.co.uk
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815125
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/quirks.c | 79 +-
 1 file changed, 62 insertions(+), 17 deletions(-)

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 2d66db8..ed30e79 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -131,6 +131,27 @@ efi_status_t efi_query_variable_store(u32 attributes, 
unsigned long size)
 EXPORT_SYMBOL_GPL(efi_query_variable_store);
 
 /*
+ * Helper function for efi_reserve_boot_services() to figure out if we
+ * can free regions in efi_free_boot_services().
+ *
+ * Use this function to ensure we do not free regions owned by somebody
+ * else. We must only reserve (and then free) regions:
+ *
+ * - Not within any part of the kernel
+ * - Not the BIOS reserved area (E820_RESERVED, E820_NVS, etc)
+ */
+static bool can_free_region(u64 start, u64 size)
+{
+   if (start + size > __pa_symbol(_text) && start <=

[tip:x86/urgent] x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables

2016-03-12 Thread tip-bot for Matt Fleming

Commit-ID:  452308de61056a539352a9306c46716d7af8a1f1
Gitweb: http://git.kernel.org/tip/452308de61056a539352a9306c46716d7af8a1f1
Author: Matt Fleming 
AuthorDate: Fri, 11 Mar 2016 11:19:23 +
Committer:  Ingo Molnar 
CommitDate: Sat, 12 Mar 2016 16:57:45 +0100

x86/efi: Fix boot crash by always mapping boot service regions into new EFI 
page tables

Some machines have EFI regions in page zero (physical address
0x) and historically that region has been added to the e820
map via trim_bios_range(), and ultimately mapped into the kernel page
tables. It was not mapped via efi_map_regions() as one would expect.

Alexis reports that with the new separate EFI page tables some boot
services regions, such as page zero, are not mapped. This triggers an
oops during the SetVirtualAddressMap() runtime call.

For the EFI boot services quirk on x86 we need to memblock_reserve()
boot services regions until after SetVirtualAddressMap(). Doing that
while respecting the ownership of regions that may have already been
reserved by the kernel was the motivation behind this commit:

  7d68dc3f1003 ("x86, efi: Do not reserve boot services regions within reserved 
areas")

That patch was merged at a time when the EFI runtime virtual mappings
were inserted into the kernel page tables as described above, and the
trick of setting ->numpages (and hence the region size) to zero to
track regions that should not be freed in efi_free_boot_services()
meant that we never mapped those regions in efi_map_regions(). Instead
we were relying solely on the existing kernel mappings.

Now that we have separate page tables we need to make sure the EFI
boot services regions are mapped correctly, even if someone else has
already called memblock_reserve(). Instead of stashing a tag in
->numpages, set the EFI_MEMORY_RUNTIME bit of ->attribute. Since it
generally makes no sense to mark a boot services region as required at
runtime, it's pretty much guaranteed the firmware will not have
already set this bit.

For the record, the specific circumstances under which Alexis
triggered this bug was that an EFI runtime driver on his machine was
responding to the EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE event during
SetVirtualAddressMap().

The event handler for this driver looks like this,

  sub rsp,0x28
  lea rdx,[rip+0x2445] # 0xaa948720
  mov ecx,0x4
  call func_aa9447c0  ; call to ConvertPointer(4, & 0xaa948720)
  mov r11,QWORD PTR [rip+0x2434] # 0xaa948720
  xor eax,eax
  mov BYTE PTR [r11+0x1],0x1
  add rsp,0x28
  ret

Which is pretty typical code for an EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE
handler. The "mov r11, QWORD PTR [rip+0x2424]" was the faulting
instruction because ConvertPointer() was being called to convert the
address 0x, which when converted is left unchanged and
remains 0x.

The output of the oops trace gave the impression of a standard NULL
pointer dereference bug, but because we're accessing physical
addresses during ConvertPointer(), it wasn't. EFI boot services code
is stored at that address on Alexis' machine.

Reported-by: Alexis Murzeau 
Signed-off-by: Matt Fleming 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Ben Hutchings 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Maarten Lankhorst 
Cc: Matthew Garrett 
Cc: Peter Zijlstra 
Cc: Raphael Hertzog 
Cc: Roger Shimizu 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1457695163-29632-2-git-send-email-m...@codeblueprint.co.uk
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815125
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/quirks.c | 79 +-
 1 file changed, 62 insertions(+), 17 deletions(-)

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 2d66db8..ed30e79 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -131,6 +131,27 @@ efi_status_t efi_query_variable_store(u32 attributes, 
unsigned long size)
 EXPORT_SYMBOL_GPL(efi_query_variable_store);
 
 /*
+ * Helper function for efi_reserve_boot_services() to figure out if we
+ * can free regions in efi_free_boot_services().
+ *
+ * Use this function to ensure we do not free regions owned by somebody
+ * else. We must only reserve (and then free) regions:
+ *
+ * - Not within any part of the kernel
+ * - Not the BIOS reserved area (E820_RESERVED, E820_NVS, etc)
+ */
+static bool can_free_region(u64 start, u64 size)
+{
+   if (start + size > __pa_symbol(_text) && start <= __pa_symbol(_end))
+   return false;
+
+   if (!e820_all_mapped(start, start+size, E820_RAM))
+   return false;
+
+   return true;
+}
+
+/*
  * The UEFI specification makes it clear that the operating system is free to 
do
  * whatever it wants with boot services code after ExitBootServices() has been
  * called. Ignoring this recommendation a significant bunch of EFI 
implementations 
@@

[tip:x86/urgent] x86/mm/pat: Avoid truncation when converting cpa->numpages to address

2016-01-29 Thread tip-bot for Matt Fleming

Commit-ID:  742563777e8da62197d6cb4b99f4027f59454735
Gitweb: http://git.kernel.org/tip/742563777e8da62197d6cb4b99f4027f59454735
Author: Matt Fleming 
AuthorDate: Fri, 29 Jan 2016 11:36:10 +
Committer:  Thomas Gleixner 
CommitDate: Fri, 29 Jan 2016 15:03:09 +0100

x86/mm/pat: Avoid truncation when converting cpa->numpages to address

There are a couple of nasty truncation bugs lurking in the pageattr
code that can be triggered when mapping EFI regions, e.g. when we pass
a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting
left by PAGE_SHIFT will truncate the resultant address to 32-bits.

Viorel-Cătălin managed to trigger this bug on his Dell machine that
provides a ~5GB EFI region which requires 1236992 pages to be mapped.
When calling populate_pud() the end of the region gets calculated
incorrectly in the following buggy expression,

  end = start + (cpa->numpages << PAGE_SHIFT);

And only 188416 pages are mapped. Next, populate_pud() gets invoked
for a second time because of the loop in __change_page_attr_set_clr(),
only this time no pages get mapped because shifting the remaining
number of pages (1048576) by PAGE_SHIFT is zero. At which point the
loop in __change_page_attr_set_clr() spins forever because we fail to
map progress.

Hitting this bug depends very much on the virtual address we pick to
map the large region at and how many pages we map on the initial run
through the loop. This explains why this issue was only recently hit
with the introduction of commit

  a5caa209ba9c ("x86/efi: Fix boot crash by mapping EFI memmap
   entries bottom-up at runtime, instead of top-down")

It's interesting to note that safe uses of cpa->numpages do exist in
the pageattr code. If instead of shifting ->numpages we multiply by
PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and
so the result is unsigned long.

To avoid surprises when users try to convert very large cpa->numpages
values to addresses, change the data type from 'int' to 'unsigned
long', thereby making it suitable for shifting by PAGE_SHIFT without
any type casting.

The alternative would be to make liberal use of casting, but that is
far more likely to cause problems in the future when someone adds more
code and fails to cast properly; this bug was difficult enough to
track down in the first place.

Reported-and-tested-by: Viorel-Cătălin Răpițeanu 
Acked-by: Borislav Petkov 
Cc: Sai Praneeth Prakhya 
Cc: 
Signed-off-by: Matt Fleming 
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131
Link: 
http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner 
---
 arch/x86/mm/pageattr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index fc6a4c8..2440814 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -33,7 +33,7 @@ struct cpa_data {
pgd_t   *pgd;
pgprot_tmask_set;
pgprot_tmask_clr;
-   int numpages;
+   unsigned long   numpages;
int flags;
unsigned long   pfn;
unsignedforce_split : 1;
@@ -1350,7 +1350,7 @@ static int __change_page_attr_set_clr(struct cpa_data 
*cpa, int checkalias)
 * CPA operation. Either a large page has been
 * preserved or a single page update happened.
 */
-   BUG_ON(cpa->numpages > numpages);
+   BUG_ON(cpa->numpages > numpages || !cpa->numpages);
numpages -= cpa->numpages;
if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY))
cpa->curpage++;

[tip:x86/urgent] x86/mm/pat: Avoid truncation when converting cpa->numpages to address

2016-01-29 Thread tip-bot for Matt Fleming

Commit-ID:  742563777e8da62197d6cb4b99f4027f59454735
Gitweb: http://git.kernel.org/tip/742563777e8da62197d6cb4b99f4027f59454735
Author: Matt Fleming 
AuthorDate: Fri, 29 Jan 2016 11:36:10 +
Committer:  Thomas Gleixner 
CommitDate: Fri, 29 Jan 2016 15:03:09 +0100

x86/mm/pat: Avoid truncation when converting cpa->numpages to address

There are a couple of nasty truncation bugs lurking in the pageattr
code that can be triggered when mapping EFI regions, e.g. when we pass
a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting
left by PAGE_SHIFT will truncate the resultant address to 32-bits.

Viorel-Cătălin managed to trigger this bug on his Dell machine that
provides a ~5GB EFI region which requires 1236992 pages to be mapped.
When calling populate_pud() the end of the region gets calculated
incorrectly in the following buggy expression,

  end = start + (cpa->numpages << PAGE_SHIFT);

And only 188416 pages are mapped. Next, populate_pud() gets invoked
for a second time because of the loop in __change_page_attr_set_clr(),
only this time no pages get mapped because shifting the remaining
number of pages (1048576) by PAGE_SHIFT is zero. At which point the
loop in __change_page_attr_set_clr() spins forever because we fail to
map progress.

Hitting this bug depends very much on the virtual address we pick to
map the large region at and how many pages we map on the initial run
through the loop. This explains why this issue was only recently hit
with the introduction of commit

  a5caa209ba9c ("x86/efi: Fix boot crash by mapping EFI memmap
   entries bottom-up at runtime, instead of top-down")

It's interesting to note that safe uses of cpa->numpages do exist in
the pageattr code. If instead of shifting ->numpages we multiply by
PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and
so the result is unsigned long.

To avoid surprises when users try to convert very large cpa->numpages
values to addresses, change the data type from 'int' to 'unsigned
long', thereby making it suitable for shifting by PAGE_SHIFT without
any type casting.

The alternative would be to make liberal use of casting, but that is
far more likely to cause problems in the future when someone adds more
code and fails to cast properly; this bug was difficult enough to
track down in the first place.

Reported-and-tested-by: Viorel-Cătălin Răpițeanu 
Acked-by: Borislav Petkov 
Cc: Sai Praneeth Prakhya 
Cc: 
Signed-off-by: Matt Fleming 
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131
Link: 
http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner 
---
 arch/x86/mm/pageattr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index fc6a4c8..2440814 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -33,7 +33,7 @@ struct cpa_data {
pgd_t   *pgd;
pgprot_tmask_set;
pgprot_tmask_clr;
-   int numpages;
+   unsigned long   numpages;
int flags;
unsigned long   pfn;
unsignedforce_split : 1;
@@ -1350,7 +1350,7 @@ static int __change_page_attr_set_clr(struct cpa_data 
*cpa, int checkalias)
 * CPA operation. Either a large page has been
 * preserved or a single page update happened.
 */
-   BUG_ON(cpa->numpages > numpages);
+   BUG_ON(cpa->numpages > numpages || !cpa->numpages);
numpages -= cpa->numpages;
if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY))
cpa->curpage++;

[tip:efi/core] x86/efi: Setup separate EFI page tables in kexec paths

2016-01-22 Thread tip-bot for Matt Fleming

Commit-ID:  753b11ef8e92a1c1bbe97f2a5ec14bdd1ef2e6fe
Gitweb: http://git.kernel.org/tip/753b11ef8e92a1c1bbe97f2a5ec14bdd1ef2e6fe
Author: Matt Fleming 
AuthorDate: Thu, 21 Jan 2016 14:11:59 +
Committer:  Ingo Molnar 
CommitDate: Thu, 21 Jan 2016 21:01:34 +0100

x86/efi: Setup separate EFI page tables in kexec paths

The switch to using a new dedicated page table for EFI runtime
calls in commit commit 67a9108ed431 ("x86/efi: Build our own
page table structures") failed to take into account changes
required for the kexec code paths, which are unfortunately
duplicated in the EFI code.

Call the allocation and setup functions in
kexec_enter_virtual_mode() just like we do for
__efi_enter_virtual_mode() to avoid hitting NULL-pointer
dereferences when making EFI runtime calls.

At the very least, the call to efi_setup_page_tables() should
have existed for kexec before the following commit:

  67a9108ed431 ("x86/efi: Build our own page table structures")

Things just magically worked because we were actually using
the kernel's page tables that contained the required mappings.

Reported-by: Srikar Dronamraju 
Tested-by: Srikar Dronamraju 
Signed-off-by: Matt Fleming 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Young 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Raghavendra K T 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1453385519-11477-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 3c1f3cd..bdd9477 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -815,6 +815,7 @@ static void __init kexec_enter_virtual_mode(void)
 {
 #ifdef CONFIG_KEXEC_CORE
efi_memory_desc_t *md;
+   unsigned int num_pages;
void *p;
 
efi.systab = NULL;
@@ -829,6 +830,12 @@ static void __init kexec_enter_virtual_mode(void)
return;
}
 
+   if (efi_alloc_page_tables()) {
+   pr_err("Failed to allocate EFI page tables\n");
+   clear_bit(EFI_RUNTIME_SERVICES, );
+   return;
+   }
+
/*
* Map efi regions which were passed via setup_data. The virt_addr is a
* fixed addr which was used in first kernel of a kexec boot.
@@ -843,6 +850,14 @@ static void __init kexec_enter_virtual_mode(void)
 
BUG_ON(!efi.systab);
 
+   num_pages = ALIGN(memmap.nr_map * memmap.desc_size, PAGE_SIZE);
+   num_pages >>= PAGE_SHIFT;
+
+   if (efi_setup_page_tables(memmap.phys_map, num_pages)) {
+   clear_bit(EFI_RUNTIME_SERVICES, );
+   return;
+   }
+
efi_sync_low_kernel_mappings();
 
/*

[tip:efi/core] x86/efi: Setup separate EFI page tables in kexec paths

2016-01-22 Thread tip-bot for Matt Fleming

Commit-ID:  753b11ef8e92a1c1bbe97f2a5ec14bdd1ef2e6fe
Gitweb: http://git.kernel.org/tip/753b11ef8e92a1c1bbe97f2a5ec14bdd1ef2e6fe
Author: Matt Fleming 
AuthorDate: Thu, 21 Jan 2016 14:11:59 +
Committer:  Ingo Molnar 
CommitDate: Thu, 21 Jan 2016 21:01:34 +0100

x86/efi: Setup separate EFI page tables in kexec paths

The switch to using a new dedicated page table for EFI runtime
calls in commit commit 67a9108ed431 ("x86/efi: Build our own
page table structures") failed to take into account changes
required for the kexec code paths, which are unfortunately
duplicated in the EFI code.

Call the allocation and setup functions in
kexec_enter_virtual_mode() just like we do for
__efi_enter_virtual_mode() to avoid hitting NULL-pointer
dereferences when making EFI runtime calls.

At the very least, the call to efi_setup_page_tables() should
have existed for kexec before the following commit:

  67a9108ed431 ("x86/efi: Build our own page table structures")

Things just magically worked because we were actually using
the kernel's page tables that contained the required mappings.

Reported-by: Srikar Dronamraju 
Tested-by: Srikar Dronamraju 
Signed-off-by: Matt Fleming 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Young 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Raghavendra K T 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1453385519-11477-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 3c1f3cd..bdd9477 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -815,6 +815,7 @@ static void __init kexec_enter_virtual_mode(void)
 {
 #ifdef CONFIG_KEXEC_CORE
efi_memory_desc_t *md;
+   unsigned int num_pages;
void *p;
 
efi.systab = NULL;
@@ -829,6 +830,12 @@ static void __init kexec_enter_virtual_mode(void)
return;
}
 
+   if (efi_alloc_page_tables()) {
+   pr_err("Failed to allocate EFI page tables\n");
+   clear_bit(EFI_RUNTIME_SERVICES, );
+   return;
+   }
+
/*
* Map efi regions which were passed via setup_data. The virt_addr is a
* fixed addr which was used in first kernel of a kexec boot.
@@ -843,6 +850,14 @@ static void __init kexec_enter_virtual_mode(void)
 
BUG_ON(!efi.systab);
 
+   num_pages = ALIGN(memmap.nr_map * memmap.desc_size, PAGE_SIZE);
+   num_pages >>= PAGE_SHIFT;
+
+   if (efi_setup_page_tables(memmap.phys_map, num_pages)) {
+   clear_bit(EFI_RUNTIME_SERVICES, );
+   return;
+   }
+
efi_sync_low_kernel_mappings();
 
/*

[tip:x86/efi] x86/efi-bgrt: Replace early_memremap() with memremap()

2016-01-06 Thread tip-bot for Matt Fleming

Commit-ID:  e2c90dd7e11e3025b46719a79fb4bb1e7a5cef9f
Gitweb: http://git.kernel.org/tip/e2c90dd7e11e3025b46719a79fb4bb1e7a5cef9f
Author: Matt Fleming 
AuthorDate: Mon, 21 Dec 2015 14:12:52 +
Committer:  Thomas Gleixner 
CommitDate: Wed, 6 Jan 2016 18:28:52 +0100

x86/efi-bgrt: Replace early_memremap() with memremap()

Môshe reported the following warning triggered on his machine since
commit 50a0cb565246 ("x86/efi-bgrt: Fix kernel panic when mapping BGRT
data"),

  [0.026936] [ cut here ]
  [0.026941] WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:137 
__early_ioremap+0x102/0x1bb()
  [0.026941] Modules linked in:
  [0.026944] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc1 #2
  [0.026945] Hardware name: Dell Inc. XPS 13 9343/09K8G1, BIOS A05 
07/14/2015
  [0.026946]   900f03d5a116524d 81c03e60 
813a3fff
  [0.026948]   81c03e98 810a0852 
d7b76000
  [0.026949]   0001 0001 
017c
  [0.026951] Call Trace:
  [0.026955]  [] dump_stack+0x44/0x55
  [0.026958]  [] warn_slowpath_common+0x82/0xc0
  [0.026959]  [] warn_slowpath_null+0x1a/0x20
  [0.026961]  [] __early_ioremap+0x102/0x1bb
  [0.026962]  [] early_memremap+0x13/0x15
  [0.026964]  [] efi_bgrt_init+0x162/0x1ad
  [0.026966]  [] efi_late_init+0x9/0xb
  [0.026968]  [] start_kernel+0x46f/0x49f
  [0.026970]  [] ? early_idt_handler_array+0x120/0x120
  [0.026972]  [] x86_64_start_reservations+0x2a/0x2c
  [0.026974]  [] x86_64_start_kernel+0x14a/0x16d
  [0.026977] ---[ end trace f9b3812eb8e24c58 ]---
  [0.026978] efi_bgrt: Ignoring BGRT: failed to map image memory

early_memremap() has an upper limit on the size of mapping it can
handle which is ~200KB. Clearly the BGRT image on Môshe's machine is
much larger than that.

There's actually no reason to restrict ourselves to using the early_*
version of memremap() - the ACPI BGRT driver is invoked late enough in
boot that we can use the standard version, with the benefit that the
late version allows mappings of arbitrary size.

Reported-by: Môshe van der Sterre 
Tested-by: Môshe van der Sterre 
Signed-off-by: Matt Fleming 
Cc: Josh Triplett 
Cc: Sai Praneeth Prakhya 
Cc: Borislav Petkov 
Link: 
http://lkml.kernel.org/r/1450707172-12561-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner 
---
 arch/x86/platform/efi/efi-bgrt.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c
index bf51f4c..b097066 100644
--- a/arch/x86/platform/efi/efi-bgrt.c
+++ b/arch/x86/platform/efi/efi-bgrt.c
@@ -72,14 +72,14 @@ void __init efi_bgrt_init(void)
return;
}
 
-   image = early_memremap(bgrt_tab->image_address, sizeof(bmp_header));
+   image = memremap(bgrt_tab->image_address, sizeof(bmp_header), 
MEMREMAP_WB);
if (!image) {
pr_err("Ignoring BGRT: failed to map image header memory\n");
return;
}
 
memcpy(_header, image, sizeof(bmp_header));
-   early_memunmap(image, sizeof(bmp_header));
+   memunmap(image);
bgrt_image_size = bmp_header.size;
 
bgrt_image = kmalloc(bgrt_image_size, GFP_KERNEL | __GFP_NOWARN);
@@ -89,7 +89,7 @@ void __init efi_bgrt_init(void)
return;
}
 
-   image = early_memremap(bgrt_tab->image_address, bmp_header.size);
+   image = memremap(bgrt_tab->image_address, bmp_header.size, MEMREMAP_WB);
if (!image) {
pr_err("Ignoring BGRT: failed to map image memory\n");
kfree(bgrt_image);
@@ -98,5 +98,5 @@ void __init efi_bgrt_init(void)
}
 
memcpy(bgrt_image, image, bgrt_image_size);
-   early_memunmap(image, bmp_header.size);
+   memunmap(image);
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/efi] x86/efi-bgrt: Replace early_memremap() with memremap()

2016-01-06 Thread tip-bot for Matt Fleming

Commit-ID:  e2c90dd7e11e3025b46719a79fb4bb1e7a5cef9f
Gitweb: http://git.kernel.org/tip/e2c90dd7e11e3025b46719a79fb4bb1e7a5cef9f
Author: Matt Fleming 
AuthorDate: Mon, 21 Dec 2015 14:12:52 +
Committer:  Thomas Gleixner 
CommitDate: Wed, 6 Jan 2016 18:28:52 +0100

x86/efi-bgrt: Replace early_memremap() with memremap()

Môshe reported the following warning triggered on his machine since
commit 50a0cb565246 ("x86/efi-bgrt: Fix kernel panic when mapping BGRT
data"),

  [0.026936] [ cut here ]
  [0.026941] WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:137 
__early_ioremap+0x102/0x1bb()
  [0.026941] Modules linked in:
  [0.026944] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc1 #2
  [0.026945] Hardware name: Dell Inc. XPS 13 9343/09K8G1, BIOS A05 
07/14/2015
  [0.026946]   900f03d5a116524d 81c03e60 
813a3fff
  [0.026948]   81c03e98 810a0852 
d7b76000
  [0.026949]   0001 0001 
017c
  [0.026951] Call Trace:
  [0.026955]  [] dump_stack+0x44/0x55
  [0.026958]  [] warn_slowpath_common+0x82/0xc0
  [0.026959]  [] warn_slowpath_null+0x1a/0x20
  [0.026961]  [] __early_ioremap+0x102/0x1bb
  [0.026962]  [] early_memremap+0x13/0x15
  [0.026964]  [] efi_bgrt_init+0x162/0x1ad
  [0.026966]  [] efi_late_init+0x9/0xb
  [0.026968]  [] start_kernel+0x46f/0x49f
  [0.026970]  [] ? early_idt_handler_array+0x120/0x120
  [0.026972]  [] x86_64_start_reservations+0x2a/0x2c
  [0.026974]  [] x86_64_start_kernel+0x14a/0x16d
  [0.026977] ---[ end trace f9b3812eb8e24c58 ]---
  [0.026978] efi_bgrt: Ignoring BGRT: failed to map image memory

early_memremap() has an upper limit on the size of mapping it can
handle which is ~200KB. Clearly the BGRT image on Môshe's machine is
much larger than that.

There's actually no reason to restrict ourselves to using the early_*
version of memremap() - the ACPI BGRT driver is invoked late enough in
boot that we can use the standard version, with the benefit that the
late version allows mappings of arbitrary size.

Reported-by: Môshe van der Sterre 
Tested-by: Môshe van der Sterre 
Signed-off-by: Matt Fleming 
Cc: Josh Triplett 
Cc: Sai Praneeth Prakhya 
Cc: Borislav Petkov 
Link: 
http://lkml.kernel.org/r/1450707172-12561-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner 
---
 arch/x86/platform/efi/efi-bgrt.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c
index bf51f4c..b097066 100644
--- a/arch/x86/platform/efi/efi-bgrt.c
+++ b/arch/x86/platform/efi/efi-bgrt.c
@@ -72,14 +72,14 @@ void __init efi_bgrt_init(void)
return;
}
 
-   image = early_memremap(bgrt_tab->image_address, sizeof(bmp_header));
+   image = memremap(bgrt_tab->image_address, sizeof(bmp_header), 
MEMREMAP_WB);
if (!image) {
pr_err("Ignoring BGRT: failed to map image header memory\n");
return;
}
 
memcpy(_header, image, sizeof(bmp_header));
-   early_memunmap(image, sizeof(bmp_header));
+   memunmap(image);
bgrt_image_size = bmp_header.size;
 
bgrt_image = kmalloc(bgrt_image_size, GFP_KERNEL | __GFP_NOWARN);
@@ -89,7 +89,7 @@ void __init efi_bgrt_init(void)
return;
}
 
-   image = early_memremap(bgrt_tab->image_address, bmp_header.size);
+   image = memremap(bgrt_tab->image_address, bmp_header.size, MEMREMAP_WB);
if (!image) {
pr_err("Ignoring BGRT: failed to map image memory\n");
kfree(bgrt_image);
@@ -98,5 +98,5 @@ void __init efi_bgrt_init(void)
}
 
memcpy(bgrt_image, image, bgrt_image_size);
-   early_memunmap(image, bmp_header.size);
+   memunmap(image);
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/efi] Documentation/x86: Update EFI memory region description

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  ff3d0a12fb2dc123e2b46e9524ebf4e08de5c59c
Gitweb: http://git.kernel.org/tip/ff3d0a12fb2dc123e2b46e9524ebf4e08de5c59c
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:35 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:43 +0100

Documentation/x86: Update EFI memory region description

Make it clear that the EFI page tables are only available during
EFI runtime calls since that subject has come up a fair numbers
of times in the past.

Additionally, add the EFI region start and end addresses to the
table so that it's possible to see at a glance where they fall
in relation to other regions.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Jones 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Stephen Smalley 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-7-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 Documentation/x86/x86_64/mm.txt | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 05712ac..c518dce 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -16,6 +16,8 @@ ec00 - fc00 (=44 bits) kasan shadow 
memory (16TB)
 ... unused hole ...
 ff00 - ff7f (=39 bits) %esp fixup stacks
 ... unused hole ...
+ffef -  (=64 GB) EFI region mapping space
+... unused hole ...
 8000 - a000 (=512 MB)  kernel text mapping, from phys 0
 a000 - ff5f (=1525 MB) module mapping space
 ff60 - ffdf (=8 MB) vsyscalls
@@ -32,11 +34,9 @@ reference.
 Current X86-64 implementations only support 40 bits of address space,
 but we support up to 46 bits. This expands into MBZ space in the page tables.
 
-->trampoline_pgd:
-
-We map EFI runtime services in the aforementioned PGD in the virtual
-range of 64Gb (arbitrarily set, can be raised if needed)
-
-0xffef - 0x
+We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
+memory window (this size is arbitrary, it can be raised later if needed).
+The mappings are not part of any other kernel PGD and are only available
+during EFI runtime calls.
 
 -Andi Kleen, Jul 2004
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/efi] x86/efi: Build our own page table structures

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  67a9108ed4313b85a9c53406d80dc1ae3f8c3e36
Gitweb: http://git.kernel.org/tip/67a9108ed4313b85a9c53406d80dc1ae3f8c3e36
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:34 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/efi: Build our own page table structures

With commit e1a58320a38d ("x86/mm: Warn on W^X mappings") all
users booting on 64-bit UEFI machines see the following warning,

  [ cut here ]
  WARNING: CPU: 7 PID: 1 at arch/x86/mm/dump_pagetables.c:225 
note_page+0x5dc/0x780()
  x86/mm: Found insecure W+X mapping at address 
8805f000/0x8805f000
  ...
  x86/mm: Checked W+X mappings: FAILED, 165660 W+X pages found.
  ...

This is caused by mapping EFI regions with RWX permissions.
There isn't much we can do to restrict the permissions for these
regions due to the way the firmware toolchains mix code and
data, but we can at least isolate these mappings so that they do
not appear in the regular kernel page tables.

In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
mapping") we started using 'trampoline_pgd' to map the EFI
regions because there was an existing identity mapping there
which we use during the SetVirtualAddressMap() call and for
broken firmware that accesses those addresses.

But 'trampoline_pgd' shares some PGD entries with
'swapper_pg_dir' and does not provide the isolation we require.
Notably the virtual address for __START_KERNEL_map and
MODULES_START are mapped by the same PGD entry so we need to be
more careful when copying changes over in
efi_sync_low_kernel_mappings().

This patch doesn't go the full mile, we still want to share some
PGD entries with 'swapper_pg_dir'. Having completely separate
page tables brings its own issues such as synchronising new
mappings after memory hotplug and module loading. Sharing also
keeps memory usage down.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Jones 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Stephen Smalley 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-6-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/efi.h |  1 +
 arch/x86/platform/efi/efi.c| 39 ++---
 arch/x86/platform/efi/efi_32.c |  5 +++
 arch/x86/platform/efi/efi_64.c | 97 +++---
 4 files changed, 102 insertions(+), 40 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 347eeac..8fd9e63 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -136,6 +136,7 @@ extern void __init efi_memory_uc(u64 addr, unsigned long 
size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
 extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
 extern void efi_sync_low_kernel_mappings(void);
+extern int __init efi_alloc_page_tables(void);
 extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned 
num_pages);
 extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned 
num_pages);
 extern void __init old_map_region(efi_memory_desc_t *md);
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index ad28540..3c1f3cd 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -869,7 +869,7 @@ static void __init kexec_enter_virtual_mode(void)
  * This function will switch the EFI runtime services to virtual mode.
  * Essentially, we look through the EFI memmap and map every region that
  * has the runtime attribute bit set in its memory descriptor into the
- * ->trampoline_pgd page table using a top-down VA allocation scheme.
+ * efi_pgd page table.
  *
  * The old method which used to update that memory descriptor with the
  * virtual address obtained from ioremap() is still supported when the
@@ -879,8 +879,8 @@ static void __init kexec_enter_virtual_mode(void)
  *
  * The new method does a pagetable switch in a preemption-safe manner
  * so that we're in a different address space when calling a runtime
- * function. For function arguments passing we do copy the PGDs of the
- * kernel page table into ->trampoline_pgd prior to each call.
+ * function. For function arguments passing we do copy the PUDs of the
+ * kernel page table into efi_pgd prior to each call.
  *
  * Specially for kexec boot, efi runtime maps in previous kernel should
  * be passed in via setup_data. In that case runtime ranges will be mapped
@@ -895,6 +895,12 @@ static void __init __efi_enter_virtual_mode(void)
 
efi.systab = NULL;
 
+   if (efi_alloc_page_tables()) {
+   pr_err("Failed to allocate EFI page tables\n");
+

[tip:x86/efi] x86/efi: Map RAM into the identity page table for mixed mode

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  b61a76f8850d2979550abc42d7e09154ebb8d785
Gitweb: http://git.kernel.org/tip/b61a76f8850d2979550abc42d7e09154ebb8d785
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:32 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/efi: Map RAM into the identity page table for mixed mode

We are relying on the pre-existing mappings in 'trampoline_pgd'
when accessing function arguments in the EFI mixed mode thunking
code.

Instead let's map memory explicitly so that things will continue
to work when we move to a separate page table in the future.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-4-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi_64.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 5aa186d..102976d 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -144,6 +144,7 @@ void efi_sync_low_kernel_mappings(void)
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
unsigned long pfn, text;
+   efi_memory_desc_t *md;
struct page *page;
unsigned npages;
pgd_t *pgd;
@@ -177,6 +178,25 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, 
unsigned num_pages)
if (!IS_ENABLED(CONFIG_EFI_MIXED))
return 0;
 
+   /*
+* Map all of RAM so that we can access arguments in the 1:1
+* mapping when making EFI runtime calls.
+*/
+   for_each_efi_memory_desc(, md) {
+   if (md->type != EFI_CONVENTIONAL_MEMORY &&
+   md->type != EFI_LOADER_DATA &&
+   md->type != EFI_LOADER_CODE)
+   continue;
+
+   pfn = md->phys_addr >> PAGE_SHIFT;
+   npages = md->num_pages;
+
+   if (kernel_map_pages_in_pgd(pgd, pfn, md->phys_addr, npages, 
0)) {
+   pr_err("Failed to map 1:1 memory\n");
+   return 1;
+   }
+   }
+
page = alloc_page(GFP_KERNEL|__GFP_DMA32);
if (!page)
panic("Unable to allocate EFI runtime stack < 4GB\n");
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/efi] x86/efi: Hoist page table switching code into efi_call_virt()

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff
Gitweb: http://git.kernel.org/tip/c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:33 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/efi: Hoist page table switching code into efi_call_virt()

This change is a prerequisite for pending patches that switch to
a dedicated EFI page table, instead of using 'trampoline_pgd'
which shares PGD entries with 'swapper_pg_dir'. The pending
patches make it impossible to dereference the runtime service
function pointer without first switching %cr3.

It's true that we now have duplicated switching code in
efi_call_virt() and efi_call_phys_{prolog,epilog}() but we are
sacrificing code duplication for a little more clarity and the
ease of writing the page table switching code in C instead of
asm.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Jones 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Stephen Smalley 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-5-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/efi.h  | 25 +
 arch/x86/platform/efi/efi_64.c  | 24 ++---
 arch/x86/platform/efi/efi_stub_64.S | 43 -
 3 files changed, 36 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 0010c78..347eeac 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * We map the EFI regions needed for runtime services non-contiguously,
@@ -64,6 +65,17 @@ extern u64 asmlinkage efi_call(void *fp, ...);
 
 #define efi_call_phys(f, args...)  efi_call((f), args)
 
+/*
+ * Scratch space used for switching the pagetable in the EFI stub
+ */
+struct efi_scratch {
+   u64 r15;
+   u64 prev_cr3;
+   pgd_t   *efi_pgt;
+   booluse_pgd;
+   u64 phys_stack;
+} __packed;
+
 #define efi_call_virt(f, ...)  \
 ({ \
efi_status_t __s;   \
@@ -71,7 +83,20 @@ extern u64 asmlinkage efi_call(void *fp, ...);
efi_sync_low_kernel_mappings(); \
preempt_disable();  \
__kernel_fpu_begin();   \
+   \
+   if (efi_scratch.use_pgd) {  \
+   efi_scratch.prev_cr3 = read_cr3();  \
+   write_cr3((unsigned long)efi_scratch.efi_pgt);  \
+   __flush_tlb_all();  \
+   }   \
+   \
__s = efi_call((void *)efi.systab->runtime->f, __VA_ARGS__);\
+   \
+   if (efi_scratch.use_pgd) {  \
+   write_cr3(efi_scratch.prev_cr3);\
+   __flush_tlb_all();  \
+   }   \
+   \
__kernel_fpu_end(); \
preempt_enable();   \
__s;\
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 102976d..b19cdac 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -47,16 +47,7 @@
  */
 static u64 efi_va = EFI_VA_START;
 
-/*
- * Scratch space used for switching the pagetable in the EFI stub
- */
-struct efi_scratch {
-   u64 r15;
-   u64 prev_cr3;
-   pgd_t *efi_pgt;
-   bool use_pgd;
-   u64 phys_stack;
-} __packed;
+struct efi_scratch efi_scratch;
 
 static void __init early_code_mapping_set_exec(int executable)
 {
@@ -83,8 +74,11 @@ pgd_t * __init efi_call_phys_prolog(void)
int pgd;
int n_pgds;
 
-   if (!efi_enabled(EFI_OLD_MEMMAP))
-   return NULL;
+   if (!efi_enabled(EFI_OLD_MEMMAP)) {
+   save_pgd = (pgd_t *)read_cr3();
+   write_cr3((unsigned

[tip:x86/efi] x86/mm/pat: Ensure cpa-> pfn only contains page frame numbers

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  edc3b9129cecd0f0857112136f5b8b1bc1d45918
Gitweb: http://git.kernel.org/tip/edc3b9129cecd0f0857112136f5b8b1bc1d45918
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:31 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/mm/pat: Ensure cpa->pfn only contains page frame numbers

The x86 pageattr code is confused about the data that is stored
in cpa->pfn, sometimes it's treated as a page frame number,
sometimes it's treated as an unshifted physical address, and in
one place it's treated as a pte.

The result of this is that the mapping functions do not map the
intended physical address.

This isn't a problem in practice because most of the addresses
we're mapping in the EFI code paths are already mapped in
'trampoline_pgd' and so the pageattr mapping functions don't
actually do anything in this case. But when we move to using a
separate page table for the EFI runtime this will be an issue.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-3-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/pageattr.c | 17 ++---
 arch/x86/platform/efi/efi_64.c | 16 ++--
 2 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a3137a4..c70e420 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -905,15 +905,10 @@ static void populate_pte(struct cpa_data *cpa,
pte = pte_offset_kernel(pmd, start);
 
while (num_pages-- && start < end) {
-
-   /* deal with the NX bit */
-   if (!(pgprot_val(pgprot) & _PAGE_NX))
-   cpa->pfn &= ~_PAGE_NX;
-
-   set_pte(pte, pfn_pte(cpa->pfn >> PAGE_SHIFT, pgprot));
+   set_pte(pte, pfn_pte(cpa->pfn, pgprot));
 
start+= PAGE_SIZE;
-   cpa->pfn += PAGE_SIZE;
+   cpa->pfn++;
pte++;
}
 }
@@ -969,11 +964,11 @@ static int populate_pmd(struct cpa_data *cpa,
 
pmd = pmd_offset(pud, start);
 
-   set_pmd(pmd, __pmd(cpa->pfn | _PAGE_PSE |
+   set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
   massage_pgprot(pmd_pgprot)));
 
start += PMD_SIZE;
-   cpa->pfn  += PMD_SIZE;
+   cpa->pfn  += PMD_SIZE >> PAGE_SHIFT;
cur_pages += PMD_SIZE >> PAGE_SHIFT;
}
 
@@ -1042,11 +1037,11 @@ static int populate_pud(struct cpa_data *cpa, unsigned 
long start, pgd_t *pgd,
 * Map everything starting from the Gb boundary, possibly with 1G pages
 */
while (end - start >= PUD_SIZE) {
-   set_pud(pud, __pud(cpa->pfn | _PAGE_PSE |
+   set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
   massage_pgprot(pud_pgprot)));
 
start += PUD_SIZE;
-   cpa->pfn  += PUD_SIZE;
+   cpa->pfn  += PUD_SIZE >> PAGE_SHIFT;
cur_pages += PUD_SIZE >> PAGE_SHIFT;
pud++;
}
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index a0ac0f9..5aa186d 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -143,7 +143,7 @@ void efi_sync_low_kernel_mappings(void)
 
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
-   unsigned long text;
+   unsigned long pfn, text;
struct page *page;
unsigned npages;
pgd_t *pgd;
@@ -160,7 +160,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, 
unsigned num_pages)
 * and ident-map those pages containing the map before calling
 * phys_efi_set_virtual_address_map().
 */
-   if (kernel_map_pages_in_pgd(pgd, pa_memmap, pa_memmap, num_pages, 
_PAGE_NX)) {
+   pfn = pa_memmap >> PAGE_SHIFT;
+   if (kernel_map_pages_in_pgd(pgd, pfn, pa_memmap, num_pages, _PAGE_NX)) {
pr_err("Error ident-mapping new memmap (0x%lx)!\n", pa_memmap);
return 1;
}
@@ -185,8 +186,9 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, 
unsigned num_pages)
 
npages = (_end - _text) >> PAGE_SHIFT;
text = __pa(_text);
+   pfn = text >> PAGE_SHIFT;
 
-   if (kernel_map_pages_in_pgd(pgd, text >> PAGE_SHIFT, text, npages, 0)) {
+   if (kernel_map_pages_in_pgd(pgd, pfn, text, npages, 0)) {
pr_err("Failed to map kernel text 1:1\n");
return 1;
}
@@ -204,12 +206,14 @@ void __init

[tip:x86/efi] x86/mm: Page align the '_end' symbol to avoid pfn conversion bugs

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  21cdb6b568435738cc0b303b2b3b82742396310c
Gitweb: http://git.kernel.org/tip/21cdb6b568435738cc0b303b2b3b82742396310c
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:30 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/mm: Page align the '_end' symbol to avoid pfn conversion bugs

Ingo noted that if we can guarantee _end is aligned to PAGE_SIZE
we can automatically avoid bugs along the lines of,

size = _end - _text >> PAGE_SHIFT

which is missing a call to PFN_ALIGN(). The EFI mixed mode
contains this bug, for example.

_text is already aligned to PAGE_SIZE through the use of
LOAD_PHYSICAL_ADDR, and the BSS and BRK sections are explicitly
aligned in the linker script, so it makes sense to align _end to
match.

Reported-by: Ingo Molnar 
Signed-off-by: Matt Fleming 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/vmlinux.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 74e4bf1..4f19942 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -325,6 +325,7 @@ SECTIONS
__brk_limit = .;
}
 
+   . = ALIGN(PAGE_SIZE);
_end = .;
 
 STABS_DEBUG
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/efi] Documentation/x86: Update EFI memory region description

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  ff3d0a12fb2dc123e2b46e9524ebf4e08de5c59c
Gitweb: http://git.kernel.org/tip/ff3d0a12fb2dc123e2b46e9524ebf4e08de5c59c
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:35 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:43 +0100

Documentation/x86: Update EFI memory region description

Make it clear that the EFI page tables are only available during
EFI runtime calls since that subject has come up a fair numbers
of times in the past.

Additionally, add the EFI region start and end addresses to the
table so that it's possible to see at a glance where they fall
in relation to other regions.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Jones 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Stephen Smalley 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-7-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 Documentation/x86/x86_64/mm.txt | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 05712ac..c518dce 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -16,6 +16,8 @@ ec00 - fc00 (=44 bits) kasan shadow 
memory (16TB)
 ... unused hole ...
 ff00 - ff7f (=39 bits) %esp fixup stacks
 ... unused hole ...
+ffef -  (=64 GB) EFI region mapping space
+... unused hole ...
 8000 - a000 (=512 MB)  kernel text mapping, from phys 0
 a000 - ff5f (=1525 MB) module mapping space
 ff60 - ffdf (=8 MB) vsyscalls
@@ -32,11 +34,9 @@ reference.
 Current X86-64 implementations only support 40 bits of address space,
 but we support up to 46 bits. This expands into MBZ space in the page tables.
 
-->trampoline_pgd:
-
-We map EFI runtime services in the aforementioned PGD in the virtual
-range of 64Gb (arbitrarily set, can be raised if needed)
-
-0xffef - 0x
+We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
+memory window (this size is arbitrary, it can be raised later if needed).
+The mappings are not part of any other kernel PGD and are only available
+during EFI runtime calls.
 
 -Andi Kleen, Jul 2004
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/efi] x86/efi: Build our own page table structures

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  67a9108ed4313b85a9c53406d80dc1ae3f8c3e36
Gitweb: http://git.kernel.org/tip/67a9108ed4313b85a9c53406d80dc1ae3f8c3e36
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:34 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/efi: Build our own page table structures

With commit e1a58320a38d ("x86/mm: Warn on W^X mappings") all
users booting on 64-bit UEFI machines see the following warning,

  [ cut here ]
  WARNING: CPU: 7 PID: 1 at arch/x86/mm/dump_pagetables.c:225 
note_page+0x5dc/0x780()
  x86/mm: Found insecure W+X mapping at address 
8805f000/0x8805f000
  ...
  x86/mm: Checked W+X mappings: FAILED, 165660 W+X pages found.
  ...

This is caused by mapping EFI regions with RWX permissions.
There isn't much we can do to restrict the permissions for these
regions due to the way the firmware toolchains mix code and
data, but we can at least isolate these mappings so that they do
not appear in the regular kernel page tables.

In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
mapping") we started using 'trampoline_pgd' to map the EFI
regions because there was an existing identity mapping there
which we use during the SetVirtualAddressMap() call and for
broken firmware that accesses those addresses.

But 'trampoline_pgd' shares some PGD entries with
'swapper_pg_dir' and does not provide the isolation we require.
Notably the virtual address for __START_KERNEL_map and
MODULES_START are mapped by the same PGD entry so we need to be
more careful when copying changes over in
efi_sync_low_kernel_mappings().

This patch doesn't go the full mile, we still want to share some
PGD entries with 'swapper_pg_dir'. Having completely separate
page tables brings its own issues such as synchronising new
mappings after memory hotplug and module loading. Sharing also
keeps memory usage down.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Jones 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Stephen Smalley 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-6-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/efi.h |  1 +
 arch/x86/platform/efi/efi.c| 39 ++---
 arch/x86/platform/efi/efi_32.c |  5 +++
 arch/x86/platform/efi/efi_64.c | 97 +++---
 4 files changed, 102 insertions(+), 40 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 347eeac..8fd9e63 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -136,6 +136,7 @@ extern void __init efi_memory_uc(u64 addr, unsigned long 
size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
 extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
 extern void efi_sync_low_kernel_mappings(void);
+extern int __init efi_alloc_page_tables(void);
 extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned 
num_pages);
 extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned 
num_pages);
 extern void __init old_map_region(efi_memory_desc_t *md);
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index ad28540..3c1f3cd 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -869,7 +869,7 @@ static void __init kexec_enter_virtual_mode(void)
  * This function will switch the EFI runtime services to virtual mode.
  * Essentially, we look through the EFI memmap and map every region that
  * has the runtime attribute bit set in its memory descriptor into the
- * ->trampoline_pgd page table using a top-down VA allocation scheme.
+ * efi_pgd page table.
  *
  * The old method which used to update that memory descriptor with the
  * virtual address obtained from ioremap() is still supported when the
@@ -879,8 +879,8 @@ static void __init kexec_enter_virtual_mode(void)
  *
  * The new method does a pagetable switch in a preemption-safe manner
  * so that we're in a different address space when calling a runtime
- * function. For function arguments passing we do copy the PGDs of the
- * kernel page table into ->trampoline_pgd prior to each call.
+ * function. For function arguments passing we do copy the PUDs of the
+ *

[tip:x86/efi] x86/mm/pat: Ensure cpa-> pfn only contains page frame numbers

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  edc3b9129cecd0f0857112136f5b8b1bc1d45918
Gitweb: http://git.kernel.org/tip/edc3b9129cecd0f0857112136f5b8b1bc1d45918
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:31 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/mm/pat: Ensure cpa->pfn only contains page frame numbers

The x86 pageattr code is confused about the data that is stored
in cpa->pfn, sometimes it's treated as a page frame number,
sometimes it's treated as an unshifted physical address, and in
one place it's treated as a pte.

The result of this is that the mapping functions do not map the
intended physical address.

This isn't a problem in practice because most of the addresses
we're mapping in the EFI code paths are already mapped in
'trampoline_pgd' and so the pageattr mapping functions don't
actually do anything in this case. But when we move to using a
separate page table for the EFI runtime this will be an issue.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-3-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/pageattr.c | 17 ++---
 arch/x86/platform/efi/efi_64.c | 16 ++--
 2 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a3137a4..c70e420 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -905,15 +905,10 @@ static void populate_pte(struct cpa_data *cpa,
pte = pte_offset_kernel(pmd, start);
 
while (num_pages-- && start < end) {
-
-   /* deal with the NX bit */
-   if (!(pgprot_val(pgprot) & _PAGE_NX))
-   cpa->pfn &= ~_PAGE_NX;
-
-   set_pte(pte, pfn_pte(cpa->pfn >> PAGE_SHIFT, pgprot));
+   set_pte(pte, pfn_pte(cpa->pfn, pgprot));
 
start+= PAGE_SIZE;
-   cpa->pfn += PAGE_SIZE;
+   cpa->pfn++;
pte++;
}
 }
@@ -969,11 +964,11 @@ static int populate_pmd(struct cpa_data *cpa,
 
pmd = pmd_offset(pud, start);
 
-   set_pmd(pmd, __pmd(cpa->pfn | _PAGE_PSE |
+   set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
   massage_pgprot(pmd_pgprot)));
 
start += PMD_SIZE;
-   cpa->pfn  += PMD_SIZE;
+   cpa->pfn  += PMD_SIZE >> PAGE_SHIFT;
cur_pages += PMD_SIZE >> PAGE_SHIFT;
}
 
@@ -1042,11 +1037,11 @@ static int populate_pud(struct cpa_data *cpa, unsigned 
long start, pgd_t *pgd,
 * Map everything starting from the Gb boundary, possibly with 1G pages
 */
while (end - start >= PUD_SIZE) {
-   set_pud(pud, __pud(cpa->pfn | _PAGE_PSE |
+   set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
   massage_pgprot(pud_pgprot)));
 
start += PUD_SIZE;
-   cpa->pfn  += PUD_SIZE;
+   cpa->pfn  += PUD_SIZE >> PAGE_SHIFT;
cur_pages += PUD_SIZE >> PAGE_SHIFT;
pud++;
}
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index a0ac0f9..5aa186d 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -143,7 +143,7 @@ void efi_sync_low_kernel_mappings(void)
 
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
-   unsigned long text;
+   unsigned long pfn, text;
struct page *page;
unsigned npages;
pgd_t *pgd;
@@ -160,7 +160,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, 
unsigned num_pages)
 * and ident-map those pages containing the map before calling
 * phys_efi_set_virtual_address_map().
 */
-   if (kernel_map_pages_in_pgd(pgd, pa_memmap, pa_memmap, num_pages, 
_PAGE_NX)) {
+   pfn = pa_memmap >> PAGE_SHIFT;
+   if (kernel_map_pages_in_pgd(pgd, pfn, pa_memmap, num_pages, _PAGE_NX)) {
pr_err("Error ident-mapping new memmap (0x%lx)!\n", pa_memmap);
return 1;
}
@@ -185,8 +186,9 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, 
unsigned num_pages)

[tip:x86/efi] x86/mm: Page align the '_end' symbol to avoid pfn conversion bugs

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  21cdb6b568435738cc0b303b2b3b82742396310c
Gitweb: http://git.kernel.org/tip/21cdb6b568435738cc0b303b2b3b82742396310c
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:30 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/mm: Page align the '_end' symbol to avoid pfn conversion bugs

Ingo noted that if we can guarantee _end is aligned to PAGE_SIZE
we can automatically avoid bugs along the lines of,

size = _end - _text >> PAGE_SHIFT

which is missing a call to PFN_ALIGN(). The EFI mixed mode
contains this bug, for example.

_text is already aligned to PAGE_SIZE through the use of
LOAD_PHYSICAL_ADDR, and the BSS and BRK sections are explicitly
aligned in the linker script, so it makes sense to align _end to
match.

Reported-by: Ingo Molnar 
Signed-off-by: Matt Fleming 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/vmlinux.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 74e4bf1..4f19942 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -325,6 +325,7 @@ SECTIONS
__brk_limit = .;
}
 
+   . = ALIGN(PAGE_SIZE);
_end = .;
 
 STABS_DEBUG
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/efi] x86/efi: Map RAM into the identity page table for mixed mode

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  b61a76f8850d2979550abc42d7e09154ebb8d785
Gitweb: http://git.kernel.org/tip/b61a76f8850d2979550abc42d7e09154ebb8d785
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:32 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/efi: Map RAM into the identity page table for mixed mode

We are relying on the pre-existing mappings in 'trampoline_pgd'
when accessing function arguments in the EFI mixed mode thunking
code.

Instead let's map memory explicitly so that things will continue
to work when we move to a separate page table in the future.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-4-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi_64.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 5aa186d..102976d 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -144,6 +144,7 @@ void efi_sync_low_kernel_mappings(void)
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
unsigned long pfn, text;
+   efi_memory_desc_t *md;
struct page *page;
unsigned npages;
pgd_t *pgd;
@@ -177,6 +178,25 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, 
unsigned num_pages)
if (!IS_ENABLED(CONFIG_EFI_MIXED))
return 0;
 
+   /*
+* Map all of RAM so that we can access arguments in the 1:1
+* mapping when making EFI runtime calls.
+*/
+   for_each_efi_memory_desc(, md) {
+   if (md->type != EFI_CONVENTIONAL_MEMORY &&
+   md->type != EFI_LOADER_DATA &&
+   md->type != EFI_LOADER_CODE)
+   continue;
+
+   pfn = md->phys_addr >> PAGE_SHIFT;
+   npages = md->num_pages;
+
+   if (kernel_map_pages_in_pgd(pgd, pfn, md->phys_addr, npages, 
0)) {
+   pr_err("Failed to map 1:1 memory\n");
+   return 1;
+   }
+   }
+
page = alloc_page(GFP_KERNEL|__GFP_DMA32);
if (!page)
panic("Unable to allocate EFI runtime stack < 4GB\n");
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/efi] x86/efi: Hoist page table switching code into efi_call_virt()

2015-11-29 Thread tip-bot for Matt Fleming

Commit-ID:  c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff
Gitweb: http://git.kernel.org/tip/c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff
Author: Matt Fleming 
AuthorDate: Fri, 27 Nov 2015 21:09:33 +
Committer:  Ingo Molnar 
CommitDate: Sun, 29 Nov 2015 09:15:42 +0100

x86/efi: Hoist page table switching code into efi_call_virt()

This change is a prerequisite for pending patches that switch to
a dedicated EFI page table, instead of using 'trampoline_pgd'
which shares PGD entries with 'swapper_pg_dir'. The pending
patches make it impossible to dereference the runtime service
function pointer without first switching %cr3.

It's true that we now have duplicated switching code in
efi_call_virt() and efi_call_phys_{prolog,epilog}() but we are
sacrificing code duplication for a little more clarity and the
ease of writing the page table switching code in C instead of
asm.

Signed-off-by: Matt Fleming 
Reviewed-by: Borislav Petkov 
Acked-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Jones 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sai Praneeth Prakhya 
Cc: Stephen Smalley 
Cc: Thomas Gleixner 
Cc: Toshi Kani 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1448658575-17029-5-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/efi.h  | 25 +
 arch/x86/platform/efi/efi_64.c  | 24 ++---
 arch/x86/platform/efi/efi_stub_64.S | 43 -
 3 files changed, 36 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 0010c78..347eeac 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * We map the EFI regions needed for runtime services non-contiguously,
@@ -64,6 +65,17 @@ extern u64 asmlinkage efi_call(void *fp, ...);
 
 #define efi_call_phys(f, args...)  efi_call((f), args)
 
+/*
+ * Scratch space used for switching the pagetable in the EFI stub
+ */
+struct efi_scratch {
+   u64 r15;
+   u64 prev_cr3;
+   pgd_t   *efi_pgt;
+   booluse_pgd;
+   u64 phys_stack;
+} __packed;
+
 #define efi_call_virt(f, ...)  \
 ({ \
efi_status_t __s;   \
@@ -71,7 +83,20 @@ extern u64 asmlinkage efi_call(void *fp, ...);
efi_sync_low_kernel_mappings(); \
preempt_disable();  \
__kernel_fpu_begin();   \
+   \
+   if (efi_scratch.use_pgd) {  \
+   efi_scratch.prev_cr3 = read_cr3();  \
+   write_cr3((unsigned long)efi_scratch.efi_pgt);  \
+   __flush_tlb_all();  \
+   }   \
+   \
__s = efi_call((void *)efi.systab->runtime->f, __VA_ARGS__);\
+   \
+   if (efi_scratch.use_pgd) {  \
+   write_cr3(efi_scratch.prev_cr3);\
+   __flush_tlb_all();  \
+   }   \
+   \
__kernel_fpu_end(); \
preempt_enable();   \
__s;\
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 102976d..b19cdac 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -47,16 +47,7 @@
  */
 static u64 efi_va = EFI_VA_START;
 
-/*
- * Scratch space used for switching the pagetable in the EFI stub
- */
-struct efi_scratch {
-   u64 r15;
-   u64 prev_cr3;
-   pgd_t *efi_pgt;
-   bool use_pgd;
-

[tip:x86/urgent] x86/setup: Fix recent boot crash on 32-bit SMP machines

2015-11-04 Thread tip-bot for Matt Fleming

Commit-ID:  1c5dac914794f0170e1582d8ffdee52d30e0e4dd
Gitweb: http://git.kernel.org/tip/1c5dac914794f0170e1582d8ffdee52d30e0e4dd
Author: Matt Fleming 
AuthorDate: Tue, 3 Nov 2015 13:40:41 +
Committer:  Thomas Gleixner 
CommitDate: Wed, 4 Nov 2015 11:48:47 +0100

x86/setup: Fix recent boot crash on 32-bit SMP machines

The LKP test robot reported that the bug fix in commit f5f3497cad8c
("x86/setup: Extend low identity map to cover whole kernel range")
causes CONFIG_X86_32 SMP machines to crash on boot when trying to
bring AP cpus online.

The above commit erroneously copies too many of the PGD entries to the
low memory region of 'identity_page_table', resulting in some of the
kernel mappings for PAGE_OFFSET being trashed because,

  KERNEL_PGD_PTRS > KERNEL_PGD_BOUNDARY

The maximum number of PGD entries we can copy without corrupting the
kernel mapping is KERNEL_PGD_BOUNDARY or pgd_index(PAGE_OFFSET).

Fixes: f5f3497cad8c "x86/setup: Extend low identity map to cover whole kernel 
range"
Reported-by: Ying Huang 
Cc: Paolo Bonzini 
Cc: Laszlo Ersek 
Cc: l...@01.org
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Andy Lutomirski 
Cc: 
Signed-off-by: Matt Fleming 
Link: http://lkml.kernel.org/r/20151103140354.ga2...@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/setup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a3cccbf..2b8cbd6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1180,7 +1180,7 @@ void __init setup_arch(char **cmdline_p)
 */
clone_pgd_range(initial_page_table,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
-   KERNEL_PGD_PTRS);
+   KERNEL_PGD_BOUNDARY);
 #endif
 
tboot_probe();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86/setup: Fix recent boot crash on 32-bit SMP machines

2015-11-04 Thread tip-bot for Matt Fleming

Commit-ID:  1c5dac914794f0170e1582d8ffdee52d30e0e4dd
Gitweb: http://git.kernel.org/tip/1c5dac914794f0170e1582d8ffdee52d30e0e4dd
Author: Matt Fleming 
AuthorDate: Tue, 3 Nov 2015 13:40:41 +
Committer:  Thomas Gleixner 
CommitDate: Wed, 4 Nov 2015 11:48:47 +0100

x86/setup: Fix recent boot crash on 32-bit SMP machines

The LKP test robot reported that the bug fix in commit f5f3497cad8c
("x86/setup: Extend low identity map to cover whole kernel range")
causes CONFIG_X86_32 SMP machines to crash on boot when trying to
bring AP cpus online.

The above commit erroneously copies too many of the PGD entries to the
low memory region of 'identity_page_table', resulting in some of the
kernel mappings for PAGE_OFFSET being trashed because,

  KERNEL_PGD_PTRS > KERNEL_PGD_BOUNDARY

The maximum number of PGD entries we can copy without corrupting the
kernel mapping is KERNEL_PGD_BOUNDARY or pgd_index(PAGE_OFFSET).

Fixes: f5f3497cad8c "x86/setup: Extend low identity map to cover whole kernel 
range"
Reported-by: Ying Huang 
Cc: Paolo Bonzini 
Cc: Laszlo Ersek 
Cc: l...@01.org
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Andy Lutomirski 
Cc: 
Signed-off-by: Matt Fleming 
Link: http://lkml.kernel.org/r/20151103140354.ga2...@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/setup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a3cccbf..2b8cbd6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1180,7 +1180,7 @@ void __init setup_arch(char **cmdline_p)
 */
clone_pgd_range(initial_page_table,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
-   KERNEL_PGD_PTRS);
+   KERNEL_PGD_BOUNDARY);
 #endif
 
tboot_probe();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:core/efi] efi: Use the generic efi.memmap instead of 'memmap '

2015-10-11 Thread tip-bot for Matt Fleming

Commit-ID:  0ce423b6492a02be11662bfaa837dd16945aad3e
Gitweb: http://git.kernel.org/tip/0ce423b6492a02be11662bfaa837dd16945aad3e
Author: Matt Fleming 
AuthorDate: Sat, 3 Oct 2015 23:26:07 +0100
Committer:  Ingo Molnar 
CommitDate: Sun, 11 Oct 2015 11:04:18 +0200

efi: Use the generic efi.memmap instead of 'memmap'

Guenter reports that commit:

  7bf793115dd9 ("efi, x86: Rearrange efi_mem_attributes()")

breaks the IA64 compilation with the following error:

  drivers/built-in.o: In function `efi_mem_attributes': (.text+0xde962): 
undefined reference to `memmap'

Instead of using the (rather poorly named) global variable
'memmap' which doesn't exist on IA64, use efi.memmap which
points to the 'memmap' object on x86 and arm64 and which is NULL
for IA64.

The fact that efi.memmap is NULL for IA64 is OK because IA64
provides its own implementation of efi_mem_attributes().

Reported-by: Guenter Roeck 
Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Jonathan Zhang 
Cc: Peter Zijlstra 
Cc: Stephen Rothwell 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Tony Luck 
Link: http://lkml.kernel.org/r/20151003222607.ga2...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/efi.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index afee2880..16c4928 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -623,13 +623,15 @@ char * __init efi_md_typeattr_format(char *buf, size_t 
size,
  */
 u64 __weak efi_mem_attributes(unsigned long phys_addr)
 {
+   struct efi_memory_map *map;
efi_memory_desc_t *md;
void *p;
 
if (!efi_enabled(EFI_MEMMAP))
return 0;
 
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
+   map = efi.memmap;
+   for (p = map->map; p < map->map_end; p += map->desc_size) {
md = p;
if ((md->phys_addr <= phys_addr) &&
(phys_addr < (md->phys_addr +
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] MAINTAINERS: Change Matt Fleming's email address

2015-10-11 Thread tip-bot for Matt Fleming

Commit-ID:  825fcfce81921c9cc4ef801d844793815721e458
Gitweb: http://git.kernel.org/tip/825fcfce81921c9cc4ef801d844793815721e458
Author: Matt Fleming 
AuthorDate: Sat, 10 Oct 2015 17:22:16 +0100
Committer:  Ingo Molnar 
CommitDate: Sun, 11 Oct 2015 09:54:29 +0200

MAINTAINERS: Change Matt Fleming's email address

My Intel email address will soon expire. Replace it with my
personal address so people still know where to send patches.

Signed-off-by: Matt Fleming 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/194136-10333-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 60aacd8..43bd01e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4003,7 +4003,7 @@ S:Maintained
 F: sound/usb/misc/ua101.c
 
 EXTENSIBLE FIRMWARE INTERFACE (EFI)
-M: Matt Fleming 
+M: Matt Fleming 
 L: linux-...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
 S: Maintained
@@ -4018,7 +4018,7 @@ F:include/linux/efi*.h
 EFI VARIABLE FILESYSTEM
 M: Matthew Garrett 
 M: Jeremy Kerr 
-M: Matt Fleming 
+M: Matt Fleming 
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
 L: linux-...@vger.kernel.org
 S: Maintained
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:core/efi] efi: Use the generic efi.memmap instead of 'memmap '

2015-10-11 Thread tip-bot for Matt Fleming

Commit-ID:  0ce423b6492a02be11662bfaa837dd16945aad3e
Gitweb: http://git.kernel.org/tip/0ce423b6492a02be11662bfaa837dd16945aad3e
Author: Matt Fleming 
AuthorDate: Sat, 3 Oct 2015 23:26:07 +0100
Committer:  Ingo Molnar 
CommitDate: Sun, 11 Oct 2015 11:04:18 +0200

efi: Use the generic efi.memmap instead of 'memmap'

Guenter reports that commit:

  7bf793115dd9 ("efi, x86: Rearrange efi_mem_attributes()")

breaks the IA64 compilation with the following error:

  drivers/built-in.o: In function `efi_mem_attributes': (.text+0xde962): 
undefined reference to `memmap'

Instead of using the (rather poorly named) global variable
'memmap' which doesn't exist on IA64, use efi.memmap which
points to the 'memmap' object on x86 and arm64 and which is NULL
for IA64.

The fact that efi.memmap is NULL for IA64 is OK because IA64
provides its own implementation of efi_mem_attributes().

Reported-by: Guenter Roeck 
Signed-off-by: Matt Fleming 
Cc: Ard Biesheuvel 
Cc: Jonathan Zhang 
Cc: Peter Zijlstra 
Cc: Stephen Rothwell 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Tony Luck 
Link: http://lkml.kernel.org/r/20151003222607.ga2...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/efi.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index afee2880..16c4928 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -623,13 +623,15 @@ char * __init efi_md_typeattr_format(char *buf, size_t 
size,
  */
 u64 __weak efi_mem_attributes(unsigned long phys_addr)
 {
+   struct efi_memory_map *map;
efi_memory_desc_t *md;
void *p;
 
if (!efi_enabled(EFI_MEMMAP))
return 0;
 
-   for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
+   map = efi.memmap;
+   for (p = map->map; p < map->map_end; p += map->desc_size) {
md = p;
if ((md->phys_addr <= phys_addr) &&
(phys_addr < (md->phys_addr +
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] MAINTAINERS: Change Matt Fleming's email address

2015-10-11 Thread tip-bot for Matt Fleming

Commit-ID:  825fcfce81921c9cc4ef801d844793815721e458
Gitweb: http://git.kernel.org/tip/825fcfce81921c9cc4ef801d844793815721e458
Author: Matt Fleming 
AuthorDate: Sat, 10 Oct 2015 17:22:16 +0100
Committer:  Ingo Molnar 
CommitDate: Sun, 11 Oct 2015 09:54:29 +0200

MAINTAINERS: Change Matt Fleming's email address

My Intel email address will soon expire. Replace it with my
personal address so people still know where to send patches.

Signed-off-by: Matt Fleming 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/194136-10333-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 60aacd8..43bd01e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4003,7 +4003,7 @@ S:Maintained
 F: sound/usb/misc/ua101.c
 
 EXTENSIBLE FIRMWARE INTERFACE (EFI)
-M: Matt Fleming 
+M: Matt Fleming 
 L: linux-...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
 S: Maintained
@@ -4018,7 +4018,7 @@ F:include/linux/efi*.h
 EFI VARIABLE FILESYSTEM
 M: Matthew Garrett 
 M: Jeremy Kerr 
-M: Matt Fleming 
+M: Matt Fleming 
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
 L: linux-...@vger.kernel.org
 S: Maintained
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/core] perf tests: Add Intel CQM test

2015-10-06 Thread tip-bot for Matt Fleming

Commit-ID:  035827e9f2bd71a280f4eb58c65811d377ab2217
Gitweb: http://git.kernel.org/tip/035827e9f2bd71a280f4eb58c65811d377ab2217
Author: Matt Fleming 
AuthorDate: Mon, 5 Oct 2015 15:40:21 +0100
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 5 Oct 2015 16:56:07 -0300

perf tests: Add Intel CQM test

Peter reports that it's possible to trigger a WARN_ON_ONCE() in the
Intel CQM code by combining a hardware event and an Intel CQM
(software) event into a group. Unfortunately, the perf tools are not
able to create this bundle and we need to manually construct a test
case.

For posterity, record Peter's proof of concept test case in tools/perf
so that it presents a model for how we can perform architecture
specific tests, or "arch tests", in perf in the future.

The particular issue triggered in the test case is that when the
counter for the hardware event overflows and triggers a PMI we'll read
both the hardware event and the software event counters.
Unfortunately, for CQM that involves performing an IPI to read the CQM
event counters on all sockets, which in NMI context triggers the
WARN_ON_ONCE().

Reported-by: Peter Zijlstra 
Signed-off-by: Matt Fleming 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Fenghua Yu 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Vikas Shivappa 
Cc: Vince Weaver 
Link: 
http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-m...@codeblueprint.co.uk
Link: http://lkml.kernel.org/n/tip-3p4ra0u8vzm7m289a1m79...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/arch/x86/include/arch-tests.h |   1 +
 tools/perf/arch/x86/tests/Build  |   1 +
 tools/perf/arch/x86/tests/arch-tests.c   |   4 +
 tools/perf/arch/x86/tests/intel-cqm.c| 124 +++
 4 files changed, 130 insertions(+)

diff --git a/tools/perf/arch/x86/include/arch-tests.h 
b/tools/perf/arch/x86/include/arch-tests.h
index 5927cf2..7ed00f4 100644
--- a/tools/perf/arch/x86/include/arch-tests.h
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -5,6 +5,7 @@
 int test__rdpmc(void);
 int test__perf_time_to_tsc(void);
 int test__insn_x86(void);
+int test__intel_cqm_count_nmi_context(void);
 
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
 struct thread;
diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build
index 8e2c5a3..cbb7e97 100644
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@@ -5,3 +5,4 @@ libperf-y += arch-tests.o
 libperf-y += rdpmc.o
 libperf-y += perf-time-to-tsc.o
 libperf-$(CONFIG_AUXTRACE) += insn-x86.o
+libperf-y += intel-cqm.o
diff --git a/tools/perf/arch/x86/tests/arch-tests.c 
b/tools/perf/arch/x86/tests/arch-tests.c
index d116c21..2218cb6 100644
--- a/tools/perf/arch/x86/tests/arch-tests.c
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@@ -24,6 +24,10 @@ struct test arch_tests[] = {
},
 #endif
{
+   .desc = "Test intel cqm nmi context read",
+   .func = test__intel_cqm_count_nmi_context,
+   },
+   {
.func = NULL,
},
 
diff --git a/tools/perf/arch/x86/tests/intel-cqm.c 
b/tools/perf/arch/x86/tests/intel-cqm.c
new file mode 100644
index 000..d28c1b6
--- /dev/null
+++ b/tools/perf/arch/x86/tests/intel-cqm.c
@@ -0,0 +1,124 @@
+#include "tests/tests.h"
+#include "perf.h"
+#include "cloexec.h"
+#include "debug.h"
+#include "evlist.h"
+#include "evsel.h"
+#include "arch-tests.h"
+
+#include 
+#include 
+
+static pid_t spawn(void)
+{
+   pid_t pid;
+
+   pid = fork();
+   if (pid)
+   return pid;
+
+   while(1);
+   sleep(5);
+   return 0;
+}
+
+/*
+ * Create an event group that contains both a sampled hardware
+ * (cpu-cycles) and software (intel_cqm/llc_occupancy/) event. We then
+ * wait for the hardware perf counter to overflow and generate a PMI,
+ * which triggers an event read for both of the events in the group.
+ *
+ * Since reading Intel CQM event counters requires sending SMP IPIs, the
+ * CQM pmu needs to handle the above situation gracefully, and return
+ * the last read counter value to avoid triggering a WARN_ON_ONCE() in
+ * smp_call_function_many() caused by sending IPIs from NMI context.
+ */
+int test__intel_cqm_count_nmi_context(void)
+{
+   struct perf_evlist *evlist = NULL;
+   struct perf_evsel *evsel = NULL;
+   struct perf_event_attr pe;
+   int i, fd[2], flag, ret;
+   size_t mmap_len;
+   void *event;
+   pid_t pid;
+   int err = TEST_FAIL;
+
+   flag = perf_event_open_cloexec_flag();
+
+   evlist = perf_evlist__new();
+   if (!evlist) {
+   pr_debug("perf_evlist__new failed\n");
+   return TEST_FAIL;
+   }
+
+   ret = parse_events(evlist, "intel_cqm/llc_occupancy/", NULL);
+   if (ret) {
+   pr_debug("parse_events failed\n");
+   err = TEST_SKIP;
+   goto out;
+   }
+
+   evsel = perf_evlist__first(evlist);
+   if (!evsel) {
+

[tip:perf/core] perf tests: Move x86 tests into arch directory

2015-10-06 Thread tip-bot for Matt Fleming

Commit-ID:  d8b167f9d8af817073ee35cf904e2e527465dbc1
Gitweb: http://git.kernel.org/tip/d8b167f9d8af817073ee35cf904e2e527465dbc1
Author: Matt Fleming 
AuthorDate: Mon, 5 Oct 2015 15:40:20 +0100
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 5 Oct 2015 16:55:43 -0300

perf tests: Move x86 tests into arch directory

Move out the x86-specific tests into tools/perf/arch/x86/tests and
define an 'arch_tests' array, which is the list of tests that only apply
to the build architecture.

We can also now begin to get rid of some of the #ifdef code that is
present in the generic perf tests.

Signed-off-by: Matt Fleming 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Fenghua Yu 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Peter Zijlstra 
Cc: Vikas Shivappa 
Cc: Vince Weaver 
Link: http://lkml.kernel.org/n/tip-9s68h4ptg06ah0lgnjz55...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/arch/x86/include/arch-tests.h   | 12 ++
 tools/perf/arch/x86/tests/Build|  3 +++
 tools/perf/arch/x86/tests/arch-tests.c | 20 
 tools/perf/arch/x86/tests/dwarf-unwind.c   |  1 +
 .../perf/{ => arch/x86}/tests/gen-insn-x86-dat.awk |  0
 .../perf/{ => arch/x86}/tests/gen-insn-x86-dat.sh  |  0
 tools/perf/{ => arch/x86}/tests/insn-x86-dat-32.c  |  0
 tools/perf/{ => arch/x86}/tests/insn-x86-dat-64.c  |  0
 tools/perf/{ => arch/x86}/tests/insn-x86-dat-src.c |  0
 tools/perf/{ => arch/x86}/tests/insn-x86.c |  3 ++-
 tools/perf/{ => arch/x86}/tests/perf-time-to-tsc.c |  4 +++-
 tools/perf/{ => arch/x86}/tests/rdpmc.c|  7 ++
 tools/perf/tests/Build |  6 -
 tools/perf/tests/builtin-test.c| 28 --
 tools/perf/tests/dwarf-unwind.c|  4 
 tools/perf/tests/tests.h   |  5 +---
 16 files changed, 48 insertions(+), 45 deletions(-)

diff --git a/tools/perf/arch/x86/include/arch-tests.h 
b/tools/perf/arch/x86/include/arch-tests.h
index 4bd41d8..5927cf2 100644
--- a/tools/perf/arch/x86/include/arch-tests.h
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -1,6 +1,18 @@
 #ifndef ARCH_TESTS_H
 #define ARCH_TESTS_H
 
+/* Tests */
+int test__rdpmc(void);
+int test__perf_time_to_tsc(void);
+int test__insn_x86(void);
+
+#ifdef HAVE_DWARF_UNWIND_SUPPORT
+struct thread;
+struct perf_sample;
+int test__arch_unwind_sample(struct perf_sample *sample,
+struct thread *thread);
+#endif
+
 extern struct test arch_tests[];
 
 #endif
diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build
index d827ef3..8e2c5a3 100644
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@@ -2,3 +2,6 @@ libperf-$(CONFIG_DWARF_UNWIND) += regs_load.o
 libperf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
 
 libperf-y += arch-tests.o
+libperf-y += rdpmc.o
+libperf-y += perf-time-to-tsc.o
+libperf-$(CONFIG_AUXTRACE) += insn-x86.o
diff --git a/tools/perf/arch/x86/tests/arch-tests.c 
b/tools/perf/arch/x86/tests/arch-tests.c
index fca9eb9..d116c21 100644
--- a/tools/perf/arch/x86/tests/arch-tests.c
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@@ -4,6 +4,26 @@
 
 struct test arch_tests[] = {
{
+   .desc = "x86 rdpmc test",
+   .func = test__rdpmc,
+   },
+   {
+   .desc = "Test converting perf time to TSC",
+   .func = test__perf_time_to_tsc,
+   },
+#ifdef HAVE_DWARF_UNWIND_SUPPORT
+   {
+   .desc = "Test dwarf unwind",
+   .func = test__dwarf_unwind,
+   },
+#endif
+#ifdef HAVE_AUXTRACE_SUPPORT
+   {
+   .desc = "Test x86 instruction decoder - new instructions",
+   .func = test__insn_x86,
+   },
+#endif
+   {
.func = NULL,
},
 
diff --git a/tools/perf/arch/x86/tests/dwarf-unwind.c 
b/tools/perf/arch/x86/tests/dwarf-unwind.c
index d8bbf7a..7f209ce 100644
--- a/tools/perf/arch/x86/tests/dwarf-unwind.c
+++ b/tools/perf/arch/x86/tests/dwarf-unwind.c
@@ -5,6 +5,7 @@
 #include "event.h"
 #include "debug.h"
 #include "tests/tests.h"
+#include "arch-tests.h"
 
 #define STACK_SIZE 8192
 
diff --git a/tools/perf/tests/gen-insn-x86-dat.awk 
b/tools/perf/arch/x86/tests/gen-insn-x86-dat.awk
similarity index 100%
rename from tools/perf/tests/gen-insn-x86-dat.awk
rename to tools/perf/arch/x86/tests/gen-insn-x86-dat.awk
diff --git a/tools/perf/tests/gen-insn-x86-dat.sh 
b/tools/perf/arch/x86/tests/gen-insn-x86-dat.sh
similarity index 100%
rename from tools/perf/tests/gen-insn-x86-dat.sh
rename to tools/perf/arch/x86/tests/gen-insn-x86-dat.sh
diff --git a/tools/perf/tests/insn-x86-dat-32.c 
b/tools/perf/arch/x86/tests/insn-x86-dat-32.c
similarity index 100%
rename from tools/perf/tests/insn-x86-dat-32.c
rename to tools/perf/arch/x86/tests/insn-x86-dat-32.c
diff --git a/tools/perf/tests/insn-x86-dat-64.c 
b/tools/perf/arch/x86/tests/insn-x86-dat-64.c

[tip:perf/core] perf tests: Add arch tests

2015-10-06 Thread tip-bot for Matt Fleming

Commit-ID:  31b6753f95320260b160935d0e9c0b29f096ab57
Gitweb: http://git.kernel.org/tip/31b6753f95320260b160935d0e9c0b29f096ab57
Author: Matt Fleming 
AuthorDate: Mon, 5 Oct 2015 15:40:19 +0100
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 5 Oct 2015 16:55:38 -0300

perf tests: Add arch tests

Tests that only make sense for some architectures currently live in
the same place as the generic tests. Move out the x86-specific tests
into tools/perf/arch/x86/tests and define an 'arch_tests' array, which
is the list of tests that only apply to the build architecture.

The main idea is to encourage developers to add arch tests to build
out perf's test coverage, without dumping everything in
tools/perf/tests.

Signed-off-by: Matt Fleming 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Fenghua Yu 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Peter Zijlstra 
Cc: Vikas Shivappa 
Cc: Vince Weaver 
Link: http://lkml.kernel.org/n/tip-p4uc1c15ssbj8xj7ku5sl...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/arch/x86/Build|  2 +-
 tools/perf/arch/x86/include/arch-tests.h |  6 ++
 tools/perf/arch/x86/tests/Build  |  6 --
 tools/perf/arch/x86/tests/arch-tests.c   | 10 ++
 tools/perf/tests/builtin-test.c  | 28 
 tools/perf/tests/tests.h |  5 +
 6 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/tools/perf/arch/x86/Build b/tools/perf/arch/x86/Build
index 41bf61d..db52fa2 100644
--- a/tools/perf/arch/x86/Build
+++ b/tools/perf/arch/x86/Build
@@ -1,2 +1,2 @@
 libperf-y += util/
-libperf-$(CONFIG_DWARF_UNWIND) += tests/
+libperf-y += tests/
diff --git a/tools/perf/arch/x86/include/arch-tests.h 
b/tools/perf/arch/x86/include/arch-tests.h
new file mode 100644
index 000..4bd41d8
--- /dev/null
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -0,0 +1,6 @@
+#ifndef ARCH_TESTS_H
+#define ARCH_TESTS_H
+
+extern struct test arch_tests[];
+
+#endif
diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build
index b30eff9..d827ef3 100644
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@@ -1,2 +1,4 @@
-libperf-y += regs_load.o
-libperf-y += dwarf-unwind.o
+libperf-$(CONFIG_DWARF_UNWIND) += regs_load.o
+libperf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
+
+libperf-y += arch-tests.o
diff --git a/tools/perf/arch/x86/tests/arch-tests.c 
b/tools/perf/arch/x86/tests/arch-tests.c
new file mode 100644
index 000..fca9eb9
--- /dev/null
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@@ -0,0 +1,10 @@
+#include 
+#include "tests/tests.h"
+#include "arch-tests.h"
+
+struct test arch_tests[] = {
+   {
+   .func = NULL,
+   },
+
+};
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index d9bf51d..2b6c1bf 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -14,10 +14,13 @@
 #include "parse-options.h"
 #include "symbol.h"
 
-static struct test {
-   const char *desc;
-   int (*func)(void);
-} tests[] = {
+struct test __weak arch_tests[] = {
+   {
+   .func = NULL,
+   },
+};
+
+static struct test generic_tests[] = {
{
.desc = "vmlinux symtab matches kallsyms",
.func = test__vmlinux_matches_kallsyms,
@@ -195,6 +198,11 @@ static struct test {
},
 };
 
+static struct test *tests[] = {
+   generic_tests,
+   arch_tests,
+};
+
 static bool perf_test__matches(struct test *test, int curr, int argc, const 
char *argv[])
 {
int i;
@@ -249,22 +257,25 @@ static int run_test(struct test *test)
return err;
 }
 
-#define for_each_test(t)for (t = [0]; t->func; t++)
+#define for_each_test(j, t)\
+   for (j = 0; j < ARRAY_SIZE(tests); j++) \
+   for (t = [j][0]; t->func; t++)
 
 static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist)
 {
struct test *t;
+   unsigned int j;
int i = 0;
int width = 0;
 
-   for_each_test(t) {
+   for_each_test(j, t) {
int len = strlen(t->desc);
 
if (width < len)
width = len;
}
 
-   for_each_test(t) {
+   for_each_test(j, t) {
int curr = i++, err;
 
if (!perf_test__matches(t, curr, argc, argv))
@@ -300,10 +311,11 @@ static int __cmd_test(int argc, const char *argv[], 
struct intlist *skiplist)
 
 static int perf_test__list(int argc, const char **argv)
 {
+   unsigned int j;
struct test *t;
int i = 0;
 
-   for_each_test(t) {
+   for_each_test(j, t) {
if (argc > 1 && !strstr(t->desc, argv[1]))
continue;
 
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 0b35496..b1cb1c0 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -24,6 +24,11 @@ enum {

[tip:perf/core] perf tests: Add Intel CQM test

2015-10-06 Thread tip-bot for Matt Fleming

Commit-ID:  035827e9f2bd71a280f4eb58c65811d377ab2217
Gitweb: http://git.kernel.org/tip/035827e9f2bd71a280f4eb58c65811d377ab2217
Author: Matt Fleming 
AuthorDate: Mon, 5 Oct 2015 15:40:21 +0100
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 5 Oct 2015 16:56:07 -0300

perf tests: Add Intel CQM test

Peter reports that it's possible to trigger a WARN_ON_ONCE() in the
Intel CQM code by combining a hardware event and an Intel CQM
(software) event into a group. Unfortunately, the perf tools are not
able to create this bundle and we need to manually construct a test
case.

For posterity, record Peter's proof of concept test case in tools/perf
so that it presents a model for how we can perform architecture
specific tests, or "arch tests", in perf in the future.

The particular issue triggered in the test case is that when the
counter for the hardware event overflows and triggers a PMI we'll read
both the hardware event and the software event counters.
Unfortunately, for CQM that involves performing an IPI to read the CQM
event counters on all sockets, which in NMI context triggers the
WARN_ON_ONCE().

Reported-by: Peter Zijlstra 
Signed-off-by: Matt Fleming 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Fenghua Yu 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Vikas Shivappa 
Cc: Vince Weaver 
Link: 
http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-m...@codeblueprint.co.uk
Link: http://lkml.kernel.org/n/tip-3p4ra0u8vzm7m289a1m79...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/arch/x86/include/arch-tests.h |   1 +
 tools/perf/arch/x86/tests/Build  |   1 +
 tools/perf/arch/x86/tests/arch-tests.c   |   4 +
 tools/perf/arch/x86/tests/intel-cqm.c| 124 +++
 4 files changed, 130 insertions(+)

diff --git a/tools/perf/arch/x86/include/arch-tests.h 
b/tools/perf/arch/x86/include/arch-tests.h
index 5927cf2..7ed00f4 100644
--- a/tools/perf/arch/x86/include/arch-tests.h
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -5,6 +5,7 @@
 int test__rdpmc(void);
 int test__perf_time_to_tsc(void);
 int test__insn_x86(void);
+int test__intel_cqm_count_nmi_context(void);
 
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
 struct thread;
diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build
index 8e2c5a3..cbb7e97 100644
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@@ -5,3 +5,4 @@ libperf-y += arch-tests.o
 libperf-y += rdpmc.o
 libperf-y += perf-time-to-tsc.o
 libperf-$(CONFIG_AUXTRACE) += insn-x86.o
+libperf-y += intel-cqm.o
diff --git a/tools/perf/arch/x86/tests/arch-tests.c 
b/tools/perf/arch/x86/tests/arch-tests.c
index d116c21..2218cb6 100644
--- a/tools/perf/arch/x86/tests/arch-tests.c
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@@ -24,6 +24,10 @@ struct test arch_tests[] = {
},
 #endif
{
+   .desc = "Test intel cqm nmi context read",
+   .func = test__intel_cqm_count_nmi_context,
+   },
+   {
.func = NULL,
},
 
diff --git a/tools/perf/arch/x86/tests/intel-cqm.c 
b/tools/perf/arch/x86/tests/intel-cqm.c
new file mode 100644
index 000..d28c1b6
--- /dev/null
+++ b/tools/perf/arch/x86/tests/intel-cqm.c
@@ -0,0 +1,124 @@
+#include "tests/tests.h"
+#include "perf.h"
+#include "cloexec.h"
+#include "debug.h"
+#include "evlist.h"
+#include "evsel.h"
+#include "arch-tests.h"
+
+#include 
+#include 
+
+static pid_t spawn(void)
+{
+   pid_t pid;
+
+   pid = fork();
+   if (pid)
+   return pid;
+
+   while(1);
+   sleep(5);
+   return 0;
+}
+
+/*
+ * Create an event group that contains both a sampled hardware
+ * (cpu-cycles) and software (intel_cqm/llc_occupancy/) event. We then
+ * wait for the hardware perf counter to overflow and generate a PMI,
+ * which triggers an event read for both of the events in the group.
+ *
+ * Since reading Intel CQM event counters requires sending SMP IPIs, the
+ * CQM pmu needs to handle the above situation gracefully, and return
+ * the last read counter value to avoid triggering a WARN_ON_ONCE() in
+ * smp_call_function_many() caused by sending IPIs from NMI context.
+ */
+int test__intel_cqm_count_nmi_context(void)
+{
+   struct perf_evlist *evlist = NULL;
+   struct perf_evsel *evsel = NULL;
+   struct perf_event_attr pe;
+   int i, fd[2], flag, ret;
+   size_t mmap_len;
+   void *event;
+   pid_t pid;
+   int err = TEST_FAIL;
+
+   flag = perf_event_open_cloexec_flag();
+
+   evlist = perf_evlist__new();
+   if (!evlist) {
+   pr_debug("perf_evlist__new failed\n");
+   return TEST_FAIL;
+   }
+
+   ret = parse_events(evlist,

[tip:perf/core] perf tests: Move x86 tests into arch directory

2015-10-06 Thread tip-bot for Matt Fleming

Commit-ID:  d8b167f9d8af817073ee35cf904e2e527465dbc1
Gitweb: http://git.kernel.org/tip/d8b167f9d8af817073ee35cf904e2e527465dbc1
Author: Matt Fleming 
AuthorDate: Mon, 5 Oct 2015 15:40:20 +0100
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 5 Oct 2015 16:55:43 -0300

perf tests: Move x86 tests into arch directory

Move out the x86-specific tests into tools/perf/arch/x86/tests and
define an 'arch_tests' array, which is the list of tests that only apply
to the build architecture.

We can also now begin to get rid of some of the #ifdef code that is
present in the generic perf tests.

Signed-off-by: Matt Fleming 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Fenghua Yu 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Peter Zijlstra 
Cc: Vikas Shivappa 
Cc: Vince Weaver 
Link: http://lkml.kernel.org/n/tip-9s68h4ptg06ah0lgnjz55...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/arch/x86/include/arch-tests.h   | 12 ++
 tools/perf/arch/x86/tests/Build|  3 +++
 tools/perf/arch/x86/tests/arch-tests.c | 20 
 tools/perf/arch/x86/tests/dwarf-unwind.c   |  1 +
 .../perf/{ => arch/x86}/tests/gen-insn-x86-dat.awk |  0
 .../perf/{ => arch/x86}/tests/gen-insn-x86-dat.sh  |  0
 tools/perf/{ => arch/x86}/tests/insn-x86-dat-32.c  |  0
 tools/perf/{ => arch/x86}/tests/insn-x86-dat-64.c  |  0
 tools/perf/{ => arch/x86}/tests/insn-x86-dat-src.c |  0
 tools/perf/{ => arch/x86}/tests/insn-x86.c |  3 ++-
 tools/perf/{ => arch/x86}/tests/perf-time-to-tsc.c |  4 +++-
 tools/perf/{ => arch/x86}/tests/rdpmc.c|  7 ++
 tools/perf/tests/Build |  6 -
 tools/perf/tests/builtin-test.c| 28 --
 tools/perf/tests/dwarf-unwind.c|  4 
 tools/perf/tests/tests.h   |  5 +---
 16 files changed, 48 insertions(+), 45 deletions(-)

diff --git a/tools/perf/arch/x86/include/arch-tests.h 
b/tools/perf/arch/x86/include/arch-tests.h
index 4bd41d8..5927cf2 100644
--- a/tools/perf/arch/x86/include/arch-tests.h
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -1,6 +1,18 @@
 #ifndef ARCH_TESTS_H
 #define ARCH_TESTS_H
 
+/* Tests */
+int test__rdpmc(void);
+int test__perf_time_to_tsc(void);
+int test__insn_x86(void);
+
+#ifdef HAVE_DWARF_UNWIND_SUPPORT
+struct thread;
+struct perf_sample;
+int test__arch_unwind_sample(struct perf_sample *sample,
+struct thread *thread);
+#endif
+
 extern struct test arch_tests[];
 
 #endif
diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build
index d827ef3..8e2c5a3 100644
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@@ -2,3 +2,6 @@ libperf-$(CONFIG_DWARF_UNWIND) += regs_load.o
 libperf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
 
 libperf-y += arch-tests.o
+libperf-y += rdpmc.o
+libperf-y += perf-time-to-tsc.o
+libperf-$(CONFIG_AUXTRACE) += insn-x86.o
diff --git a/tools/perf/arch/x86/tests/arch-tests.c 
b/tools/perf/arch/x86/tests/arch-tests.c
index fca9eb9..d116c21 100644
--- a/tools/perf/arch/x86/tests/arch-tests.c
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@@ -4,6 +4,26 @@
 
 struct test arch_tests[] = {
{
+   .desc = "x86 rdpmc test",
+   .func = test__rdpmc,
+   },
+   {
+   .desc = "Test converting perf time to TSC",
+   .func = test__perf_time_to_tsc,
+   },
+#ifdef HAVE_DWARF_UNWIND_SUPPORT
+   {
+   .desc = "Test dwarf unwind",
+   .func = test__dwarf_unwind,
+   },
+#endif
+#ifdef HAVE_AUXTRACE_SUPPORT
+   {
+   .desc = "Test x86 instruction decoder - new instructions",
+   .func = test__insn_x86,
+   },
+#endif
+   {
.func = NULL,
},
 
diff --git a/tools/perf/arch/x86/tests/dwarf-unwind.c 
b/tools/perf/arch/x86/tests/dwarf-unwind.c
index d8bbf7a..7f209ce 100644
--- a/tools/perf/arch/x86/tests/dwarf-unwind.c
+++ b/tools/perf/arch/x86/tests/dwarf-unwind.c
@@ -5,6 +5,7 @@
 #include "event.h"
 #include "debug.h"
 #include "tests/tests.h"
+#include "arch-tests.h"
 
 #define STACK_SIZE 8192
 
diff --git a/tools/perf/tests/gen-insn-x86-dat.awk 
b/tools/perf/arch/x86/tests/gen-insn-x86-dat.awk
similarity index 100%
rename from tools/perf/tests/gen-insn-x86-dat.awk
rename to tools/perf/arch/x86/tests/gen-insn-x86-dat.awk
diff --git a/tools/perf/tests/gen-insn-x86-dat.sh 
b/tools/perf/arch/x86/tests/gen-insn-x86-dat.sh
similarity index 100%
rename from tools/perf/tests/gen-insn-x86-dat.sh
rename to tools/perf/arch/x86/tests/gen-insn-x86-dat.sh
diff --git a/tools/perf/tests/insn-x86-dat-32.c

[tip:perf/core] perf tests: Add arch tests

2015-10-06 Thread tip-bot for Matt Fleming

Commit-ID:  31b6753f95320260b160935d0e9c0b29f096ab57
Gitweb: http://git.kernel.org/tip/31b6753f95320260b160935d0e9c0b29f096ab57
Author: Matt Fleming 
AuthorDate: Mon, 5 Oct 2015 15:40:19 +0100
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 5 Oct 2015 16:55:38 -0300

perf tests: Add arch tests

Tests that only make sense for some architectures currently live in
the same place as the generic tests. Move out the x86-specific tests
into tools/perf/arch/x86/tests and define an 'arch_tests' array, which
is the list of tests that only apply to the build architecture.

The main idea is to encourage developers to add arch tests to build
out perf's test coverage, without dumping everything in
tools/perf/tests.

Signed-off-by: Matt Fleming 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Fenghua Yu 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Peter Zijlstra 
Cc: Vikas Shivappa 
Cc: Vince Weaver 
Link: http://lkml.kernel.org/n/tip-p4uc1c15ssbj8xj7ku5sl...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/arch/x86/Build|  2 +-
 tools/perf/arch/x86/include/arch-tests.h |  6 ++
 tools/perf/arch/x86/tests/Build  |  6 --
 tools/perf/arch/x86/tests/arch-tests.c   | 10 ++
 tools/perf/tests/builtin-test.c  | 28 
 tools/perf/tests/tests.h |  5 +
 6 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/tools/perf/arch/x86/Build b/tools/perf/arch/x86/Build
index 41bf61d..db52fa2 100644
--- a/tools/perf/arch/x86/Build
+++ b/tools/perf/arch/x86/Build
@@ -1,2 +1,2 @@
 libperf-y += util/
-libperf-$(CONFIG_DWARF_UNWIND) += tests/
+libperf-y += tests/
diff --git a/tools/perf/arch/x86/include/arch-tests.h 
b/tools/perf/arch/x86/include/arch-tests.h
new file mode 100644
index 000..4bd41d8
--- /dev/null
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -0,0 +1,6 @@
+#ifndef ARCH_TESTS_H
+#define ARCH_TESTS_H
+
+extern struct test arch_tests[];
+
+#endif
diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build
index b30eff9..d827ef3 100644
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@@ -1,2 +1,4 @@
-libperf-y += regs_load.o
-libperf-y += dwarf-unwind.o
+libperf-$(CONFIG_DWARF_UNWIND) += regs_load.o
+libperf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
+
+libperf-y += arch-tests.o
diff --git a/tools/perf/arch/x86/tests/arch-tests.c 
b/tools/perf/arch/x86/tests/arch-tests.c
new file mode 100644
index 000..fca9eb9
--- /dev/null
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@@ -0,0 +1,10 @@
+#include 
+#include "tests/tests.h"
+#include "arch-tests.h"
+
+struct test arch_tests[] = {
+   {
+   .func = NULL,
+   },
+
+};
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index d9bf51d..2b6c1bf 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -14,10 +14,13 @@
 #include "parse-options.h"
 #include "symbol.h"
 
-static struct test {
-   const char *desc;
-   int (*func)(void);
-} tests[] = {
+struct test __weak arch_tests[] = {
+   {
+   .func = NULL,
+   },
+};
+
+static struct test generic_tests[] = {
{
.desc = "vmlinux symtab matches kallsyms",
.func = test__vmlinux_matches_kallsyms,
@@ -195,6 +198,11 @@ static struct test {
},
 };
 
+static struct test *tests[] = {
+   generic_tests,
+   arch_tests,
+};
+
 static bool perf_test__matches(struct test *test, int curr, int argc, const 
char *argv[])
 {
int i;
@@ -249,22 +257,25 @@ static int run_test(struct test *test)
return err;
 }
 
-#define for_each_test(t)for (t = [0]; t->func; t++)
+#define for_each_test(j, t)\
+   for (j = 0; j < ARRAY_SIZE(tests); j++) \
+   for (t = [j][0]; t->func; t++)
 
 static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist)
 {
struct test *t;
+   unsigned int j;
int i = 0;
int width = 0;
 
-   for_each_test(t) {
+   for_each_test(j, t) {
int len = strlen(t->desc);
 
if (width < len)
width = len;
}
 
-   for_each_test(t) {
+   for_each_test(j, t) {
int curr = i++, err;
 
if (!perf_test__matches(t, curr, argc, argv))
@@ -300,10 +311,11 @@ static int __cmd_test(int argc, const char *argv[], 
struct intlist *skiplist)
 
 static int perf_test__list(int argc, const char **argv)
 {
+   unsigned int j;
struct test *t;
int i = 0;
 
-   for_each_test(t) {
+   for_each_test(j, t) {
if (argc > 1 &&

[tip:core/urgent] x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down

2015-10-01 Thread tip-bot for Matt Fleming

Commit-ID:  a5caa209ba9c29c6421292e7879d2387a2ef39c9
Gitweb: http://git.kernel.org/tip/a5caa209ba9c29c6421292e7879d2387a2ef39c9
Author: Matt Fleming 
AuthorDate: Fri, 25 Sep 2015 23:02:18 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 1 Oct 2015 12:51:28 +0200

x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, 
instead of top-down

Beginning with UEFI v2.5 EFI_PROPERTIES_TABLE was introduced
that signals that the firmware PE/COFF loader supports splitting
code and data sections of PE/COFF images into separate EFI
memory map entries. This allows the kernel to map those regions
with strict memory protections, e.g. EFI_MEMORY_RO for code,
EFI_MEMORY_XP for data, etc.

Unfortunately, an unwritten requirement of this new feature is
that the regions need to be mapped with the same offsets
relative to each other as observed in the EFI memory map. If
this is not done crashes like this may occur,

  BUG: unable to handle kernel paging request at fffefe6086dd
  IP: [] 0xfffefe6086dd
  Call Trace:
   [] efi_call+0x7e/0x100
   [] ? virt_efi_set_variable+0x61/0x90
   [] efi_delete_dummy_variable+0x63/0x70
   [] efi_enter_virtual_mode+0x383/0x392
   [] start_kernel+0x38a/0x417
   [] x86_64_start_reservations+0x2a/0x2c
   [] x86_64_start_kernel+0xeb/0xef

Here 0xfffefe6086dd refers to an address the firmware
expects to be mapped but which the OS never claimed was mapped.
The issue is that included in these regions are relative
addresses to other regions which were emitted by the firmware
toolchain before the "splitting" of sections occurred at
runtime.

Needless to say, we don't satisfy this unwritten requirement on
x86_64 and instead map the EFI memory map entries in reverse
order. The above crash is almost certainly triggerable with any
kernel newer than v3.13 because that's when we rewrote the EFI
runtime region mapping code, in commit d2f7cbe7b26a ("x86/efi:
Runtime services virtual mapping"). For kernel versions before
v3.13 things may work by pure luck depending on the
fragmentation of the kernel virtual address space at the time we
map the EFI regions.

Instead of mapping the EFI memory map entries in reverse order,
where entry N has a higher virtual address than entry N+1, map
them in the same order as they appear in the EFI memory map to
preserve this relative offset between regions.

This patch has been kept as small as possible with the intention
that it should be applied aggressively to stable and
distribution kernels. It is very much a bugfix rather than
support for a new feature, since when EFI_PROPERTIES_TABLE is
enabled we must map things as outlined above to even boot - we
have no way of asking the firmware not to split the code/data
regions.

In fact, this patch doesn't even make use of the more strict
memory protections available in UEFI v2.5. That will come later.

Suggested-by: Ard Biesheuvel 
Reported-by: Ard Biesheuvel 
Signed-off-by: Matt Fleming 
Cc: 
Cc: Borislav Petkov 
Cc: Chun-Yi 
Cc: Dave Young 
Cc: H. Peter Anvin 
Cc: James Bottomley 
Cc: Lee, Chun-Yi 
Cc: Leif Lindholm 
Cc: Linus Torvalds 
Cc: Matthew Garrett 
Cc: Mike Galbraith 
Cc: Peter Jones 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1443218539-7610-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c | 67 -
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 1db84c0..6a28ded 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -705,6 +705,70 @@ out:
 }
 
 /*
+ * Iterate the EFI memory map in reverse order because the regions
+ * will be mapped top-down. The end result is the same as if we had
+ * mapped things forward, but doesn't require us to change the
+ * existing implementation of efi_map_region().
+ */
+static inline void *efi_map_next_entry_reverse(void *entry)
+{
+   /* Initial call */
+   if (!entry)
+   return memmap.map_end - memmap.desc_size;
+
+   entry -= memmap.desc_size;
+   if (entry < memmap.map)
+   return NULL;
+
+   return entry;
+}
+
+/*
+ * efi_map_next_entry - Return the next EFI memory map descriptor
+ * @entry: Previous EFI memory map descriptor
+ *
+ * This is a helper function to iterate over the EFI memory map, which
+ * we do in different orders depending on the current configuration.
+ *
+ * To begin traversing the memory map @entry must be %NULL.
+ *
+ * Returns %NULL when we reach the end of the memory map.
+ */
+static void *efi_map_next_entry(void *entry)
+{
+   if (!efi_enabled(EFI_OLD_MEMMAP) && efi_enabled(EFI_64BIT)) {
+   /*
+* Starting in UEFI v2.5 the EFI_PROPERTIES_TABLE
+* config table feature requires us to map all entries
+* in the same order as they appear in the EFI memory
+

[tip:core/urgent] x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down

2015-10-01 Thread tip-bot for Matt Fleming

Commit-ID:  a5caa209ba9c29c6421292e7879d2387a2ef39c9
Gitweb: http://git.kernel.org/tip/a5caa209ba9c29c6421292e7879d2387a2ef39c9
Author: Matt Fleming 
AuthorDate: Fri, 25 Sep 2015 23:02:18 +0100
Committer:  Ingo Molnar 
CommitDate: Thu, 1 Oct 2015 12:51:28 +0200

x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, 
instead of top-down

Beginning with UEFI v2.5 EFI_PROPERTIES_TABLE was introduced
that signals that the firmware PE/COFF loader supports splitting
code and data sections of PE/COFF images into separate EFI
memory map entries. This allows the kernel to map those regions
with strict memory protections, e.g. EFI_MEMORY_RO for code,
EFI_MEMORY_XP for data, etc.

Unfortunately, an unwritten requirement of this new feature is
that the regions need to be mapped with the same offsets
relative to each other as observed in the EFI memory map. If
this is not done crashes like this may occur,

  BUG: unable to handle kernel paging request at fffefe6086dd
  IP: [] 0xfffefe6086dd
  Call Trace:
   [] efi_call+0x7e/0x100
   [] ? virt_efi_set_variable+0x61/0x90
   [] efi_delete_dummy_variable+0x63/0x70
   [] efi_enter_virtual_mode+0x383/0x392
   [] start_kernel+0x38a/0x417
   [] x86_64_start_reservations+0x2a/0x2c
   [] x86_64_start_kernel+0xeb/0xef

Here 0xfffefe6086dd refers to an address the firmware
expects to be mapped but which the OS never claimed was mapped.
The issue is that included in these regions are relative
addresses to other regions which were emitted by the firmware
toolchain before the "splitting" of sections occurred at
runtime.

Needless to say, we don't satisfy this unwritten requirement on
x86_64 and instead map the EFI memory map entries in reverse
order. The above crash is almost certainly triggerable with any
kernel newer than v3.13 because that's when we rewrote the EFI
runtime region mapping code, in commit d2f7cbe7b26a ("x86/efi:
Runtime services virtual mapping"). For kernel versions before
v3.13 things may work by pure luck depending on the
fragmentation of the kernel virtual address space at the time we
map the EFI regions.

Instead of mapping the EFI memory map entries in reverse order,
where entry N has a higher virtual address than entry N+1, map
them in the same order as they appear in the EFI memory map to
preserve this relative offset between regions.

This patch has been kept as small as possible with the intention
that it should be applied aggressively to stable and
distribution kernels. It is very much a bugfix rather than
support for a new feature, since when EFI_PROPERTIES_TABLE is
enabled we must map things as outlined above to even boot - we
have no way of asking the firmware not to split the code/data
regions.

In fact, this patch doesn't even make use of the more strict
memory protections available in UEFI v2.5. That will come later.

Suggested-by: Ard Biesheuvel 
Reported-by: Ard Biesheuvel 
Signed-off-by: Matt Fleming 
Cc: 
Cc: Borislav Petkov 
Cc: Chun-Yi 
Cc: Dave Young 
Cc: H. Peter Anvin 
Cc: James Bottomley 
Cc: Lee, Chun-Yi 
Cc: Leif Lindholm 
Cc: Linus Torvalds 
Cc: Matthew Garrett 
Cc: Mike Galbraith 
Cc: Peter Jones 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1443218539-7610-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c | 67 -
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 1db84c0..6a28ded 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -705,6 +705,70 @@ out:
 }
 
 /*
+ * Iterate the EFI memory map in reverse order because the regions
+ * will be mapped top-down. The end result is the same as if we had
+ * mapped things forward, but doesn't require us to change the
+ * existing implementation of efi_map_region().
+ */
+static inline void *efi_map_next_entry_reverse(void *entry)
+{
+   /* Initial call */
+   if (!entry)
+   return memmap.map_end - memmap.desc_size;
+
+   entry -= memmap.desc_size;
+   if (entry < memmap.map)
+   return NULL;
+
+   return entry;
+}
+
+/*
+ * efi_map_next_entry - Return the next EFI memory map descriptor
+ * @entry: Previous EFI memory map descriptor
+ *
+ * This is a helper function to iterate over the EFI memory map, which
+ * we do in different orders depending on the current configuration.
+ *
+ * To begin traversing the memory map @entry must be %NULL.
+ *
+

[tip:perf/core] perf tests: Introduce iterator function for tests

2015-09-15 Thread tip-bot for Matt Fleming

Commit-ID:  e8210cefb7e1ec0760a6fe581ad0727a2dcf8dd1
Gitweb: http://git.kernel.org/tip/e8210cefb7e1ec0760a6fe581ad0727a2dcf8dd1
Author: Matt Fleming 
AuthorDate: Sat, 5 Sep 2015 20:02:20 +0100
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 14 Sep 2015 12:50:18 -0300

perf tests: Introduce iterator function for tests

In preparation for introducing more arrays of tests, e.g. "arch tests"
(architecture-specific tests), abstract the code to iterate over the
list of tests into a helper function.

This way, code that uses a 'struct test' doesn't need to worry about how
the tests are grouped together and changes to the list of tests doesn't
require changes to the code using it.

Signed-off-by: Matt Fleming 
Acked-by: Jiri Olsa 
Cc: Andi Kleen 
Cc: Kanaka Juvva 
Cc: Peter Zijlstra 
Cc: Vikas Shivappa 
Cc: Vince Weaver 
Link: 
http://lkml.kernel.org/r/1441479742-15402-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/tests/builtin-test.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 98b0b24..d9bf51d 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -195,7 +195,7 @@ static struct test {
},
 };
 
-static bool perf_test__matches(int curr, int argc, const char *argv[])
+static bool perf_test__matches(struct test *test, int curr, int argc, const 
char *argv[])
 {
int i;
 
@@ -212,7 +212,7 @@ static bool perf_test__matches(int curr, int argc, const 
char *argv[])
continue;
}
 
-   if (strstr(tests[curr].desc, argv[i]))
+   if (strstr(test->desc, argv[i]))
return true;
}
 
@@ -249,27 +249,28 @@ static int run_test(struct test *test)
return err;
 }
 
+#define for_each_test(t)for (t = [0]; t->func; t++)
+
 static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist)
 {
+   struct test *t;
int i = 0;
int width = 0;
 
-   while (tests[i].func) {
-   int len = strlen(tests[i].desc);
+   for_each_test(t) {
+   int len = strlen(t->desc);
 
if (width < len)
width = len;
-   ++i;
}
 
-   i = 0;
-   while (tests[i].func) {
+   for_each_test(t) {
int curr = i++, err;
 
-   if (!perf_test__matches(curr, argc, argv))
+   if (!perf_test__matches(t, curr, argc, argv))
continue;
 
-   pr_info("%2d: %-*s:", i, width, tests[curr].desc);
+   pr_info("%2d: %-*s:", i, width, t->desc);
 
if (intlist__find(skiplist, i)) {
color_fprintf(stderr, PERF_COLOR_YELLOW, " Skip (user 
override)\n");
@@ -277,8 +278,8 @@ static int __cmd_test(int argc, const char *argv[], struct 
intlist *skiplist)
}
 
pr_debug("\n--- start ---\n");
-   err = run_test([curr]);
-   pr_debug(" end \n%s:", tests[curr].desc);
+   err = run_test(t);
+   pr_debug(" end \n%s:", t->desc);
 
switch (err) {
case TEST_OK:
@@ -299,15 +300,14 @@ static int __cmd_test(int argc, const char *argv[], 
struct intlist *skiplist)
 
 static int perf_test__list(int argc, const char **argv)
 {
+   struct test *t;
int i = 0;
 
-   while (tests[i].func) {
-   int curr = i++;
-
-   if (argc > 1 && !strstr(tests[curr].desc, argv[1]))
+   for_each_test(t) {
+   if (argc > 1 && !strstr(t->desc, argv[1]))
continue;
 
-   pr_info("%2d: %s\n", i, tests[curr].desc);
+   pr_info("%2d: %s\n", ++i, t->desc);
}
 
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/core] perf tests: Introduce iterator function for tests

2015-09-15 Thread tip-bot for Matt Fleming

Commit-ID:  e8210cefb7e1ec0760a6fe581ad0727a2dcf8dd1
Gitweb: http://git.kernel.org/tip/e8210cefb7e1ec0760a6fe581ad0727a2dcf8dd1
Author: Matt Fleming 
AuthorDate: Sat, 5 Sep 2015 20:02:20 +0100
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 14 Sep 2015 12:50:18 -0300

perf tests: Introduce iterator function for tests

In preparation for introducing more arrays of tests, e.g. "arch tests"
(architecture-specific tests), abstract the code to iterate over the
list of tests into a helper function.

This way, code that uses a 'struct test' doesn't need to worry about how
the tests are grouped together and changes to the list of tests doesn't
require changes to the code using it.

Signed-off-by: Matt Fleming 
Acked-by: Jiri Olsa 
Cc: Andi Kleen 
Cc: Kanaka Juvva 
Cc: Peter Zijlstra 
Cc: Vikas Shivappa 
Cc: Vince Weaver 
Link: 
http://lkml.kernel.org/r/1441479742-15402-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/tests/builtin-test.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 98b0b24..d9bf51d 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -195,7 +195,7 @@ static struct test {
},
 };
 
-static bool perf_test__matches(int curr, int argc, const char *argv[])
+static bool perf_test__matches(struct test *test, int curr, int argc, const 
char *argv[])
 {
int i;
 
@@ -212,7 +212,7 @@ static bool perf_test__matches(int curr, int argc, const 
char *argv[])
continue;
}
 
-   if (strstr(tests[curr].desc, argv[i]))
+   if (strstr(test->desc, argv[i]))
return true;
}
 
@@ -249,27 +249,28 @@ static int run_test(struct test *test)
return err;
 }
 
+#define for_each_test(t)for (t = [0]; t->func; t++)
+
 static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist)
 {
+   struct test *t;
int i = 0;
int width = 0;
 
-   while (tests[i].func) {
-   int len = strlen(tests[i].desc);
+   for_each_test(t) {
+   int len = strlen(t->desc);
 
if (width < len)
width = len;
-   ++i;
}
 
-   i = 0;
-   while (tests[i].func) {
+   for_each_test(t) {
int curr = i++, err;
 
-   if (!perf_test__matches(curr, argc, argv))
+   if (!perf_test__matches(t, curr, argc, argv))
continue;
 
-   pr_info("%2d: %-*s:", i, width, tests[curr].desc);
+   pr_info("%2d: %-*s:", i, width, t->desc);
 
if (intlist__find(skiplist, i)) {
color_fprintf(stderr, PERF_COLOR_YELLOW, " Skip (user 
override)\n");
@@ -277,8 +278,8 @@ static int __cmd_test(int argc, const char *argv[], struct 
intlist *skiplist)
}
 
pr_debug("\n--- start ---\n");
-   err = run_test([curr]);
-   pr_debug(" end \n%s:", tests[curr].desc);
+   err = run_test(t);
+   pr_debug(" end \n%s:", t->desc);
 
switch (err) {
case TEST_OK:
@@ -299,15 +300,14 @@ static int __cmd_test(int argc, const char *argv[], 
struct intlist *skiplist)
 
 static int perf_test__list(int argc, const char **argv)
 {
+   struct test *t;
int i = 0;
 
-   while (tests[i].func) {
-   int curr = i++;
-
-   if (argc > 1 && !strstr(tests[curr].desc, argv[1]))
+   for_each_test(t) {
+   if (argc > 1 && !strstr(t->desc, argv[1]))
continue;
 
-   pr_info("%2d: %s\n", i, tests[curr].desc);
+   pr_info("%2d: %s\n", ++i, t->desc);
}
 
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/core] perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler

2015-08-12 Thread tip-bot for Matt Fleming

Commit-ID:  d7a702f0b1033cf402fef65bd6395072738f0844
Gitweb: http://git.kernel.org/tip/d7a702f0b1033cf402fef65bd6395072738f0844
Author: Matt Fleming 
AuthorDate: Thu, 6 Aug 2015 13:12:43 +0100
Committer:  Ingo Molnar 
CommitDate: Wed, 12 Aug 2015 11:37:23 +0200

perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler

Tony reports that booting his 144-cpu machine with maxcpus=10 triggers
the following WARN_ON():

[   21.045727] WARNING: CPU: 8 PID: 647 at 
arch/x86/kernel/cpu/perf_event_intel_cqm.c:1267 
intel_cqm_cpu_prepare+0x75/0x90()
[   21.045744] CPU: 8 PID: 647 Comm: systemd-udevd Not tainted 4.2.0-rc4 #1
[   21.045745] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0066.R00.1506021730 06/02/2015
[   21.045747]   82771b09 880856333ba8 
81669b67
[   21.045748]    880856333be8 
8107b02a
[   21.045750]  88085b789800 88085f68a020 819e2470 
000a
[   21.045750] Call Trace:
[   21.045757]  [] dump_stack+0x45/0x57
[   21.045759]  [] warn_slowpath_common+0x8a/0xc0
[   21.045761]  [] warn_slowpath_null+0x1a/0x20
[   21.045762]  [] intel_cqm_cpu_prepare+0x75/0x90
[   21.045764]  [] intel_cqm_cpu_notifier+0x42/0x160
[   21.045767]  [] notifier_call_chain+0x4d/0x80
[   21.045769]  [] __raw_notifier_call_chain+0xe/0x10
[   21.045770]  [] _cpu_up+0xe8/0x190
[   21.045771]  [] cpu_up+0x7a/0xa0
[   21.045774]  [] cpu_subsys_online+0x40/0x90
[   21.045777]  [] device_online+0x67/0x90
[   21.045778]  [] online_store+0x8a/0xa0
[   21.045782]  [] dev_attr_store+0x18/0x30
[   21.045785]  [] sysfs_kf_write+0x3a/0x50
[   21.045786]  [] kernfs_fop_write+0x120/0x170
[   21.045789]  [] __vfs_write+0x37/0x100
[   21.045791]  [] ? __sb_start_write+0x58/0x110
[   21.045795]  [] ? security_file_permission+0x3d/0xc0
[   21.045796]  [] vfs_write+0xa9/0x190
[   21.045797]  [] SyS_write+0x55/0xc0
[   21.045800]  [] ? do_page_fault+0x30/0x80
[   21.045804]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[   21.045805] ---[ end trace fe228b836d8af405 ]---

The root cause is that CPU_UP_PREPARE is completely the wrong notifier
action from which to access cpu_data(), because smp_store_cpu_info()
won't have been executed by the target CPU at that point, which in turn
means that ->x86_cache_max_rmid and ->x86_cache_occ_scale haven't been
filled out.

Instead let's invoke our handler from CPU_STARTING and rename it
appropriately.

Reported-by: Tony Luck 
Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Ashok Raj 
Cc: Kanaka Juvva 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vikas Shivappa 
Link: 
http://lkml.kernel.org/r/1438863163-14083-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 63eb68b..377e8f8 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1255,7 +1255,7 @@ static inline void cqm_pick_event_reader(int cpu)
cpumask_set_cpu(cpu, _cpumask);
 }
 
-static void intel_cqm_cpu_prepare(unsigned int cpu)
+static void intel_cqm_cpu_starting(unsigned int cpu)
 {
struct intel_pqr_state *state = _cpu(pqr_state, cpu);
struct cpuinfo_x86 *c = _data(cpu);
@@ -1296,13 +1296,11 @@ static int intel_cqm_cpu_notifier(struct notifier_block 
*nb,
unsigned int cpu  = (unsigned long)hcpu;
 
switch (action & ~CPU_TASKS_FROZEN) {
-   case CPU_UP_PREPARE:
-   intel_cqm_cpu_prepare(cpu);
-   break;
case CPU_DOWN_PREPARE:
intel_cqm_cpu_exit(cpu);
break;
case CPU_STARTING:
+   intel_cqm_cpu_starting(cpu);
cqm_pick_event_reader(cpu);
break;
}
@@ -1373,7 +1371,7 @@ static int __init intel_cqm_init(void)
goto out;
 
for_each_online_cpu(i) {
-   intel_cqm_cpu_prepare(i);
+   intel_cqm_cpu_starting(i);
cqm_pick_event_reader(i);
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/core] perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler

2015-08-12 Thread tip-bot for Matt Fleming

Commit-ID:  d7a702f0b1033cf402fef65bd6395072738f0844
Gitweb: http://git.kernel.org/tip/d7a702f0b1033cf402fef65bd6395072738f0844
Author: Matt Fleming matt.flem...@intel.com
AuthorDate: Thu, 6 Aug 2015 13:12:43 +0100
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 12 Aug 2015 11:37:23 +0200

perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler

Tony reports that booting his 144-cpu machine with maxcpus=10 triggers
the following WARN_ON():

[   21.045727] WARNING: CPU: 8 PID: 647 at 
arch/x86/kernel/cpu/perf_event_intel_cqm.c:1267 
intel_cqm_cpu_prepare+0x75/0x90()
[   21.045744] CPU: 8 PID: 647 Comm: systemd-udevd Not tainted 4.2.0-rc4 #1
[   21.045745] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0066.R00.1506021730 06/02/2015
[   21.045747]   82771b09 880856333ba8 
81669b67
[   21.045748]    880856333be8 
8107b02a
[   21.045750]  88085b789800 88085f68a020 819e2470 
000a
[   21.045750] Call Trace:
[   21.045757]  [81669b67] dump_stack+0x45/0x57
[   21.045759]  [8107b02a] warn_slowpath_common+0x8a/0xc0
[   21.045761]  [8107b15a] warn_slowpath_null+0x1a/0x20
[   21.045762]  [81036725] intel_cqm_cpu_prepare+0x75/0x90
[   21.045764]  [81036872] intel_cqm_cpu_notifier+0x42/0x160
[   21.045767]  [8109a33d] notifier_call_chain+0x4d/0x80
[   21.045769]  [8109a44e] __raw_notifier_call_chain+0xe/0x10
[   21.045770]  [8107b538] _cpu_up+0xe8/0x190
[   21.045771]  [8107b65a] cpu_up+0x7a/0xa0
[   21.045774]  [8165e920] cpu_subsys_online+0x40/0x90
[   21.045777]  [81433b37] device_online+0x67/0x90
[   21.045778]  [81433bea] online_store+0x8a/0xa0
[   21.045782]  [81430e78] dev_attr_store+0x18/0x30
[   21.045785]  [8126b6ba] sysfs_kf_write+0x3a/0x50
[   21.045786]  [8126ad40] kernfs_fop_write+0x120/0x170
[   21.045789]  [811f0b77] __vfs_write+0x37/0x100
[   21.045791]  [811f38b8] ? __sb_start_write+0x58/0x110
[   21.045795]  [81296d2d] ? security_file_permission+0x3d/0xc0
[   21.045796]  [811f1279] vfs_write+0xa9/0x190
[   21.045797]  [811f2075] SyS_write+0x55/0xc0
[   21.045800]  [81067300] ? do_page_fault+0x30/0x80
[   21.045804]  [816709ae] entry_SYSCALL_64_fastpath+0x12/0x71
[   21.045805] ---[ end trace fe228b836d8af405 ]---

The root cause is that CPU_UP_PREPARE is completely the wrong notifier
action from which to access cpu_data(), because smp_store_cpu_info()
won't have been executed by the target CPU at that point, which in turn
means that -x86_cache_max_rmid and -x86_cache_occ_scale haven't been
filled out.

Instead let's invoke our handler from CPU_STARTING and rename it
appropriately.

Reported-by: Tony Luck tony.l...@intel.com
Signed-off-by: Matt Fleming matt.flem...@intel.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Cc: Ashok Raj ashok@intel.com
Cc: Kanaka Juvva kanaka.d.ju...@intel.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Vikas Shivappa vikas.shiva...@intel.com
Link: 
http://lkml.kernel.org/r/1438863163-14083-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 63eb68b..377e8f8 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1255,7 +1255,7 @@ static inline void cqm_pick_event_reader(int cpu)
cpumask_set_cpu(cpu, cqm_cpumask);
 }
 
-static void intel_cqm_cpu_prepare(unsigned int cpu)
+static void intel_cqm_cpu_starting(unsigned int cpu)
 {
struct intel_pqr_state *state = per_cpu(pqr_state, cpu);
struct cpuinfo_x86 *c = cpu_data(cpu);
@@ -1296,13 +1296,11 @@ static int intel_cqm_cpu_notifier(struct notifier_block 
*nb,
unsigned int cpu  = (unsigned long)hcpu;
 
switch (action  ~CPU_TASKS_FROZEN) {
-   case CPU_UP_PREPARE:
-   intel_cqm_cpu_prepare(cpu);
-   break;
case CPU_DOWN_PREPARE:
intel_cqm_cpu_exit(cpu);
break;
case CPU_STARTING:
+   intel_cqm_cpu_starting(cpu);
cqm_pick_event_reader(cpu);
break;
}
@@ -1373,7 +1371,7 @@ static int __init intel_cqm_init(void)
goto out;
 
for_each_online_cpu(i) {
-   intel_cqm_cpu_prepare(i);
+   intel_cqm_cpu_starting(i);
cqm_pick_event_reader(i);
}
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to

[tip:core/efi] x86/efi-bgrt: Switch pr_err() to pr_debug() for invalid BGRT

2015-08-09 Thread tip-bot for Matt Fleming

Commit-ID:  248fbcd5aee00f6519a12c5ed3bc3dc0f5e84de5
Gitweb: http://git.kernel.org/tip/248fbcd5aee00f6519a12c5ed3bc3dc0f5e84de5
Author: Matt Fleming 
AuthorDate: Fri, 7 Aug 2015 09:36:55 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 8 Aug 2015 10:37:39 +0200

x86/efi-bgrt: Switch pr_err() to pr_debug() for invalid BGRT

It's totally legitimate, per the ACPI spec, for the firmware to
set the BGRT 'status' field to zero to indicate that the BGRT
image isn't being displayed, and we shouldn't be printing an
error message in that case because it's just noise for users. So
swap pr_err() for pr_debug().

However, Josh points that out it still makes sense to test the
validity of the upper 7 bits of the 'status' field, since
they're marked as "reserved" in the spec and must be zero. If
firmware violates this it really *is* an error.

Reported-by: Tom Yan 
Tested-by: Tom Yan 
Signed-off-by: Matt Fleming 
Reviewed-by: Josh Triplett 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Matthew Garrett 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1438936621-5215-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi-bgrt.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c
index d7f997f..ea48449 100644
--- a/arch/x86/platform/efi/efi-bgrt.c
+++ b/arch/x86/platform/efi/efi-bgrt.c
@@ -50,11 +50,16 @@ void __init efi_bgrt_init(void)
   bgrt_tab->version);
return;
}
-   if (bgrt_tab->status != 1) {
-   pr_err("Ignoring BGRT: invalid status %u (expected 1)\n",
+   if (bgrt_tab->status & 0xfe) {
+   pr_err("Ignoring BGRT: reserved status bits are non-zero %u\n",
   bgrt_tab->status);
return;
}
+   if (bgrt_tab->status != 1) {
+   pr_debug("Ignoring BGRT: invalid status %u (expected 1)\n",
+bgrt_tab->status);
+   return;
+   }
if (bgrt_tab->image_type != 0) {
pr_err("Ignoring BGRT: invalid image type %u (expected 0)\n",
   bgrt_tab->image_type);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:core/efi] Revert "x86/efi: Request desired alignment via the PE/COFF headers"

2015-08-09 Thread tip-bot for Matt Fleming

Commit-ID:  fa5c35011a8d5f3d0c597a6336107eafd1b6046c
Gitweb: http://git.kernel.org/tip/fa5c35011a8d5f3d0c597a6336107eafd1b6046c
Author: Matt Fleming 
AuthorDate: Fri, 7 Aug 2015 09:36:56 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 8 Aug 2015 10:37:39 +0200

Revert "x86/efi: Request desired alignment via the PE/COFF headers"

This reverts commit:

  aeffc4928ea2 ("x86/efi: Request desired alignment via the PE/COFF headers")

Linn reports that Signtool complains that kernels built with
CONFIG_EFI_STUB=y are violating the PE/COFF specification because
the 'SizeOfImage' field is not a multiple of 'SectionAlignment'.

This violation was introduced as an optimisation to skip having
the kernel relocate itself during boot and instead have the
firmware place it at a correctly aligned address.

No one else has complained and I'm not aware of any firmware
implementations that refuse to boot with commit aeffc4928ea2,
but it's a real bug, so revert the offending commit.

Reported-by: Linn Crosetto 
Signed-off-by: Matt Fleming 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Michael Brown 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1438936621-5215-3-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/boot/header.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 16ef025..7a6d43a 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -154,7 +154,7 @@ extra_header_fields:
 #else
.quad   0   # ImageBase
 #endif
-   .long   CONFIG_PHYSICAL_ALIGN   # SectionAlignment
+   .long   0x20# SectionAlignment
.long   0x20# FileAlignment
.word   0   # MajorOperatingSystemVersion
.word   0   # MinorOperatingSystemVersion
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:core/efi] Revert x86/efi: Request desired alignment via the PE/COFF headers

2015-08-09 Thread tip-bot for Matt Fleming

Commit-ID:  fa5c35011a8d5f3d0c597a6336107eafd1b6046c
Gitweb: http://git.kernel.org/tip/fa5c35011a8d5f3d0c597a6336107eafd1b6046c
Author: Matt Fleming matt.flem...@intel.com
AuthorDate: Fri, 7 Aug 2015 09:36:56 +0100
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Sat, 8 Aug 2015 10:37:39 +0200

Revert x86/efi: Request desired alignment via the PE/COFF headers

This reverts commit:

  aeffc4928ea2 (x86/efi: Request desired alignment via the PE/COFF headers)

Linn reports that Signtool complains that kernels built with
CONFIG_EFI_STUB=y are violating the PE/COFF specification because
the 'SizeOfImage' field is not a multiple of 'SectionAlignment'.

This violation was introduced as an optimisation to skip having
the kernel relocate itself during boot and instead have the
firmware place it at a correctly aligned address.

No one else has complained and I'm not aware of any firmware
implementations that refuse to boot with commit aeffc4928ea2,
but it's a real bug, so revert the offending commit.

Reported-by: Linn Crosetto l...@hp.com
Signed-off-by: Matt Fleming matt.flem...@intel.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Michael Brown mbr...@fensystems.co.uk
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Link: 
http://lkml.kernel.org/r/1438936621-5215-3-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/boot/header.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 16ef025..7a6d43a 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -154,7 +154,7 @@ extra_header_fields:
 #else
.quad   0   # ImageBase
 #endif
-   .long   CONFIG_PHYSICAL_ALIGN   # SectionAlignment
+   .long   0x20# SectionAlignment
.long   0x20# FileAlignment
.word   0   # MajorOperatingSystemVersion
.word   0   # MinorOperatingSystemVersion
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:core/efi] x86/efi-bgrt: Switch pr_err() to pr_debug() for invalid BGRT

2015-08-09 Thread tip-bot for Matt Fleming

Commit-ID:  248fbcd5aee00f6519a12c5ed3bc3dc0f5e84de5
Gitweb: http://git.kernel.org/tip/248fbcd5aee00f6519a12c5ed3bc3dc0f5e84de5
Author: Matt Fleming matt.flem...@intel.com
AuthorDate: Fri, 7 Aug 2015 09:36:55 +0100
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Sat, 8 Aug 2015 10:37:39 +0200

x86/efi-bgrt: Switch pr_err() to pr_debug() for invalid BGRT

It's totally legitimate, per the ACPI spec, for the firmware to
set the BGRT 'status' field to zero to indicate that the BGRT
image isn't being displayed, and we shouldn't be printing an
error message in that case because it's just noise for users. So
swap pr_err() for pr_debug().

However, Josh points that out it still makes sense to test the
validity of the upper 7 bits of the 'status' field, since
they're marked as reserved in the spec and must be zero. If
firmware violates this it really *is* an error.

Reported-by: Tom Yan tom.t...@gmail.com
Tested-by: Tom Yan tom.t...@gmail.com
Signed-off-by: Matt Fleming matt.flem...@intel.com
Reviewed-by: Josh Triplett j...@joshtriplett.org
Cc: H. Peter Anvin h...@zytor.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Matthew Garrett mj...@srcf.ucam.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Link: 
http://lkml.kernel.org/r/1438936621-5215-2-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/platform/efi/efi-bgrt.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c
index d7f997f..ea48449 100644
--- a/arch/x86/platform/efi/efi-bgrt.c
+++ b/arch/x86/platform/efi/efi-bgrt.c
@@ -50,11 +50,16 @@ void __init efi_bgrt_init(void)
   bgrt_tab-version);
return;
}
-   if (bgrt_tab-status != 1) {
-   pr_err(Ignoring BGRT: invalid status %u (expected 1)\n,
+   if (bgrt_tab-status  0xfe) {
+   pr_err(Ignoring BGRT: reserved status bits are non-zero %u\n,
   bgrt_tab-status);
return;
}
+   if (bgrt_tab-status != 1) {
+   pr_debug(Ignoring BGRT: invalid status %u (expected 1)\n,
+bgrt_tab-status);
+   return;
+   }
if (bgrt_tab-image_type != 0) {
pr_err(Ignoring BGRT: invalid image type %u (expected 0)\n,
   bgrt_tab-image_type);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/urgent] perf/x86/intel/cqm: Return cached counter value from IRQ context

2015-07-26 Thread tip-bot for Matt Fleming

Commit-ID:  2c534c0da0a68418693e10ce1c4146e085f39518
Gitweb: http://git.kernel.org/tip/2c534c0da0a68418693e10ce1c4146e085f39518
Author: Matt Fleming 
AuthorDate: Tue, 21 Jul 2015 15:55:09 +0100
Committer:  Thomas Gleixner 
CommitDate: Sun, 26 Jul 2015 10:22:29 +0200

perf/x86/intel/cqm: Return cached counter value from IRQ context

Peter reported the following potential crash which I was able to
reproduce with his test program,

[  148.765788] [ cut here ]
[  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 
smp_call_function_many+0xb6/0x260()
[  148.765797] Modules linked in:
[  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
[  148.765803]  81cdc398 88085f105950 818bdfd5 
0007
[  148.765805]   88085f105990 810e413a 

[  148.765807]  82301080 0022 8107f640 
8107f640
[  148.765809] Call Trace:
[  148.765810][] dump_stack+0x45/0x57
[  148.765818]  [] warn_slowpath_common+0x8a/0xc0
[  148.765822]  [] ? intel_cqm_stable+0x60/0x60
[  148.765824]  [] ? intel_cqm_stable+0x60/0x60
[  148.765825]  [] warn_slowpath_null+0x1a/0x20
[  148.765827]  [] smp_call_function_many+0xb6/0x260
[  148.765829]  [] ? intel_cqm_stable+0x60/0x60
[  148.765831]  [] on_each_cpu_mask+0x28/0x60
[  148.765832]  [] intel_cqm_event_count+0x7f/0xe0
[  148.765836]  [] perf_output_read+0x2a5/0x400
[  148.765839]  [] perf_output_sample+0x31a/0x590
[  148.765840]  [] ? perf_prepare_sample+0x26d/0x380
[  148.765841]  [] perf_event_output+0x47/0x60
[  148.765843]  [] __perf_event_overflow+0x215/0x240
[  148.765844]  [] perf_event_overflow+0x14/0x20
[  148.765847]  [] intel_pmu_handle_irq+0x1d4/0x440
[  148.765849]  [] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765853]  [] ? vunmap_page_range+0x19d/0x2f0
[  148.765854]  [] ? unmap_kernel_range_noflush+0x11/0x20
[  148.765859]  [] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[  148.765863]  [] ? native_apic_msr_write+0x2b/0x30
[  148.765865]  [] ? x2apic_send_IPI_self+0x1d/0x20
[  148.765869]  [] ? arch_irq_work_raise+0x35/0x40
[  148.765872]  [] ? irq_work_queue+0x66/0x80
[  148.765875]  [] perf_event_nmi_handler+0x26/0x40
[  148.765877]  [] nmi_handle+0x79/0x100
[  148.765879]  [] default_do_nmi+0x42/0x100
[  148.765880]  [] do_nmi+0x83/0xb0
[  148.765884]  [] end_repeat_nmi+0x1e/0x2e
[  148.765886]  [] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765888]  [] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765890]  [] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765891]  <>  [] finish_task_switch+0x156/0x210
[  148.765898]  [] __schedule+0x341/0x920
[  148.765899]  [] schedule+0x37/0x80
[  148.765903]  [] ? do_page_fault+0x2f/0x80
[  148.765905]  [] schedule_user+0x1a/0x50
[  148.765907]  [] retint_careful+0x14/0x32
[  148.765908] ---[ end trace e33ff2be78e14901 ]---

The CQM task events are not safe to be called from within interrupt
context because they require performing an IPI to read the counter value
on all sockets. And performing IPIs from within IRQ context is a
"no-no".

Make do with the last read counter value currently event in
event->count when we're invoked in this context.

Reported-by: Peter Zijlstra 
Signed-off-by: Matt Fleming 
Cc: Thomas Gleixner 
Cc: Vikas Shivappa 
Cc: Kanaka Juvva 
Cc: Will Auld 
Cc: 
Link: 
http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 1880761..63eb68b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -952,6 +952,14 @@ static u64 intel_cqm_event_count(struct perf_event *event)
return 0;
 
/*
+* Getting up-to-date values requires an SMP IPI which is not
+* possible if we're being called in interrupt context. Return
+* the cached values instead.
+*/
+   if (unlikely(in_interrupt()))
+   goto out;
+
+   /*
 * Notice that we don't perform the reading of an RMID
 * atomically, because we can't hold a spin lock across the
 * IPIs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/urgent] perf/x86/intel/cqm: Return cached counter value from IRQ context

2015-07-26 Thread tip-bot for Matt Fleming

Commit-ID:  2c534c0da0a68418693e10ce1c4146e085f39518
Gitweb: http://git.kernel.org/tip/2c534c0da0a68418693e10ce1c4146e085f39518
Author: Matt Fleming matt.flem...@intel.com
AuthorDate: Tue, 21 Jul 2015 15:55:09 +0100
Committer:  Thomas Gleixner t...@linutronix.de
CommitDate: Sun, 26 Jul 2015 10:22:29 +0200

perf/x86/intel/cqm: Return cached counter value from IRQ context

Peter reported the following potential crash which I was able to
reproduce with his test program,

[  148.765788] [ cut here ]
[  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 
smp_call_function_many+0xb6/0x260()
[  148.765797] Modules linked in:
[  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
[  148.765803]  81cdc398 88085f105950 818bdfd5 
0007
[  148.765805]   88085f105990 810e413a 

[  148.765807]  82301080 0022 8107f640 
8107f640
[  148.765809] Call Trace:
[  148.765810]  NMI  [818bdfd5] dump_stack+0x45/0x57
[  148.765818]  [810e413a] warn_slowpath_common+0x8a/0xc0
[  148.765822]  [8107f640] ? intel_cqm_stable+0x60/0x60
[  148.765824]  [8107f640] ? intel_cqm_stable+0x60/0x60
[  148.765825]  [810e422a] warn_slowpath_null+0x1a/0x20
[  148.765827]  [811613f6] smp_call_function_many+0xb6/0x260
[  148.765829]  [8107f640] ? intel_cqm_stable+0x60/0x60
[  148.765831]  [81161748] on_each_cpu_mask+0x28/0x60
[  148.765832]  [8107f6ef] intel_cqm_event_count+0x7f/0xe0
[  148.765836]  [811cdd35] perf_output_read+0x2a5/0x400
[  148.765839]  [811d2e5a] perf_output_sample+0x31a/0x590
[  148.765840]  [811d333d] ? perf_prepare_sample+0x26d/0x380
[  148.765841]  [811d3497] perf_event_output+0x47/0x60
[  148.765843]  [811d36c5] __perf_event_overflow+0x215/0x240
[  148.765844]  [811d4124] perf_event_overflow+0x14/0x20
[  148.765847]  [8107e7f4] intel_pmu_handle_irq+0x1d4/0x440
[  148.765849]  [811d07a6] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765853]  [81219bad] ? vunmap_page_range+0x19d/0x2f0
[  148.765854]  [81219d11] ? unmap_kernel_range_noflush+0x11/0x20
[  148.765859]  [814ce6fe] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[  148.765863]  [8109e5db] ? native_apic_msr_write+0x2b/0x30
[  148.765865]  [8109e44d] ? x2apic_send_IPI_self+0x1d/0x20
[  148.765869]  [81065135] ? arch_irq_work_raise+0x35/0x40
[  148.765872]  [811c8d86] ? irq_work_queue+0x66/0x80
[  148.765875]  [81075306] perf_event_nmi_handler+0x26/0x40
[  148.765877]  [81063ed9] nmi_handle+0x79/0x100
[  148.765879]  [81064422] default_do_nmi+0x42/0x100
[  148.765880]  [81064563] do_nmi+0x83/0xb0
[  148.765884]  [818c7c0f] end_repeat_nmi+0x1e/0x2e
[  148.765886]  [811d07a6] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765888]  [811d07a6] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765890]  [811d07a6] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765891]  EOE  [8110ab66] finish_task_switch+0x156/0x210
[  148.765898]  [818c1671] __schedule+0x341/0x920
[  148.765899]  [818c1c87] schedule+0x37/0x80
[  148.765903]  [810ae1af] ? do_page_fault+0x2f/0x80
[  148.765905]  [818c1f4a] schedule_user+0x1a/0x50
[  148.765907]  [818c666c] retint_careful+0x14/0x32
[  148.765908] ---[ end trace e33ff2be78e14901 ]---

The CQM task events are not safe to be called from within interrupt
context because they require performing an IPI to read the counter value
on all sockets. And performing IPIs from within IRQ context is a
no-no.

Make do with the last read counter value currently event in
event-count when we're invoked in this context.

Reported-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Matt Fleming matt.flem...@intel.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Vikas Shivappa vikas.shiva...@intel.com
Cc: Kanaka Juvva kanaka.d.ju...@intel.com
Cc: Will Auld will.a...@intel.com
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner t...@linutronix.de
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 1880761..63eb68b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -952,6 +952,14 @@ static u64 intel_cqm_event_count(struct perf_event *event)
return 0;
 
/*
+* Getting up-to-date values requires an SMP IPI which is not
+* possible if we're being called in interrupt context. Return
+* the cached values instead.
+*/
+   if (unlikely(in_interrupt()))
+   goto

[tip:perf/core] perf/x86/intel/cqm: Use 'u32' data type for RMIDs

2015-05-27 Thread tip-bot for Matt Fleming

Commit-ID:  adafa99960ef18b019f001ddee4d9d81c4e25944
Gitweb: http://git.kernel.org/tip/adafa99960ef18b019f001ddee4d9d81c4e25944
Author: Matt Fleming 
AuthorDate: Fri, 22 May 2015 09:59:42 +0100
Committer:  Ingo Molnar 
CommitDate: Wed, 27 May 2015 09:17:41 +0200

perf/x86/intel/cqm: Use 'u32' data type for RMIDs

Since we write RMID values to MSRs the correct type to use is 'u32'
because that clearly articulates we're writing a hardware register
value.

Fix up all uses of RMID in this code to consistently use the correct data
type.

Reported-by: Thomas Gleixner 
Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Thomas Gleixner 
Cc: Kanaka Juvva 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Vikas Shivappa 
Cc: Will Auld 
Link: 
http://lkml.kernel.org/r/1432285182-17180-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 37 +++---
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 8233b29..1880761 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -13,7 +13,7 @@
 #define MSR_IA32_QM_CTR0x0c8e
 #define MSR_IA32_QM_EVTSEL 0x0c8d
 
-static unsigned int cqm_max_rmid = -1;
+static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
 /**
@@ -76,7 +76,7 @@ static cpumask_t cqm_cpumask;
  * near-zero occupancy value, i.e. no cachelines are tagged with this
  * RMID, once __intel_cqm_rmid_rotate() returns.
  */
-static unsigned int intel_cqm_rotation_rmid;
+static u32 intel_cqm_rotation_rmid;
 
 #define INVALID_RMID   (-1)
 
@@ -88,7 +88,7 @@ static unsigned int intel_cqm_rotation_rmid;
  * Likewise, an rmid value of -1 is used to indicate "no rmid currently
  * assigned" and is used as part of the rotation code.
  */
-static inline bool __rmid_valid(unsigned int rmid)
+static inline bool __rmid_valid(u32 rmid)
 {
if (!rmid || rmid == INVALID_RMID)
return false;
@@ -96,7 +96,7 @@ static inline bool __rmid_valid(unsigned int rmid)
return true;
 }
 
-static u64 __rmid_read(unsigned int rmid)
+static u64 __rmid_read(u32 rmid)
 {
u64 val;
 
@@ -121,7 +121,7 @@ enum rmid_recycle_state {
 };
 
 struct cqm_rmid_entry {
-   unsigned int rmid;
+   u32 rmid;
enum rmid_recycle_state state;
struct list_head list;
unsigned long queue_time;
@@ -166,7 +166,7 @@ static LIST_HEAD(cqm_rmid_limbo_lru);
  */
 static struct cqm_rmid_entry **cqm_rmid_ptrs;
 
-static inline struct cqm_rmid_entry *__rmid_entry(int rmid)
+static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
 {
struct cqm_rmid_entry *entry;
 
@@ -181,7 +181,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(int rmid)
  *
  * We expect to be called with cache_mutex held.
  */
-static int __get_rmid(void)
+static u32 __get_rmid(void)
 {
struct cqm_rmid_entry *entry;
 
@@ -196,7 +196,7 @@ static int __get_rmid(void)
return entry->rmid;
 }
 
-static void __put_rmid(unsigned int rmid)
+static void __put_rmid(u32 rmid)
 {
struct cqm_rmid_entry *entry;
 
@@ -391,7 +391,7 @@ static bool __conflict_event(struct perf_event *a, struct 
perf_event *b)
 }
 
 struct rmid_read {
-   unsigned int rmid;
+   u32 rmid;
atomic64_t value;
 };
 
@@ -400,12 +400,11 @@ static void __intel_cqm_event_count(void *info);
 /*
  * Exchange the RMID of a group of events.
  */
-static unsigned int
-intel_cqm_xchg_rmid(struct perf_event *group, unsigned int rmid)
+static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
 {
struct perf_event *event;
-   unsigned int old_rmid = group->hw.cqm_rmid;
struct list_head *head = >hw.cqm_group_entry;
+   u32 old_rmid = group->hw.cqm_rmid;
 
lockdep_assert_held(_mutex);
 
@@ -470,7 +469,7 @@ static void intel_cqm_stable(void *arg)
  * If we have group events waiting for an RMID that don't conflict with
  * events already running, assign @rmid.
  */
-static bool intel_cqm_sched_in_event(unsigned int rmid)
+static bool intel_cqm_sched_in_event(u32 rmid)
 {
struct perf_event *leader, *event;
 
@@ -617,7 +616,7 @@ static bool intel_cqm_rmid_stabilize(unsigned int 
*available)
 static void __intel_cqm_pick_and_rotate(struct perf_event *next)
 {
struct perf_event *rotor;
-   unsigned int rmid;
+   u32 rmid;
 
lockdep_assert_held(_mutex);
 
@@ -645,7 +644,7 @@ static void __intel_cqm_pick_and_rotate(struct perf_event 
*next)
 static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
 {
struct perf_event *group, *g;
-   unsigned int rmid;
+   u32 rmid;
 
lockdep_assert_held(_mutex);
 
@@ -847,8 +846,8 @@ static void intel_cqm_setup_event(struct perf_event *event,

[tip:perf/core] perf/x86/intel/cqm: Use 'u32' data type for RMIDs

2015-05-27 Thread tip-bot for Matt Fleming

Commit-ID:  adafa99960ef18b019f001ddee4d9d81c4e25944
Gitweb: http://git.kernel.org/tip/adafa99960ef18b019f001ddee4d9d81c4e25944
Author: Matt Fleming matt.flem...@intel.com
AuthorDate: Fri, 22 May 2015 09:59:42 +0100
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 27 May 2015 09:17:41 +0200

perf/x86/intel/cqm: Use 'u32' data type for RMIDs

Since we write RMID values to MSRs the correct type to use is 'u32'
because that clearly articulates we're writing a hardware register
value.

Fix up all uses of RMID in this code to consistently use the correct data
type.

Reported-by: Thomas Gleixner t...@linutronix.de
Signed-off-by: Matt Fleming matt.flem...@intel.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Acked-by: Thomas Gleixner t...@linutronix.de
Cc: Kanaka Juvva kanaka.d.ju...@intel.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Vikas Shivappa vikas.shiva...@linux.intel.com
Cc: Will Auld will.a...@intel.com
Link: 
http://lkml.kernel.org/r/1432285182-17180-1-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 37 +++---
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 8233b29..1880761 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -13,7 +13,7 @@
 #define MSR_IA32_QM_CTR0x0c8e
 #define MSR_IA32_QM_EVTSEL 0x0c8d
 
-static unsigned int cqm_max_rmid = -1;
+static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
 /**
@@ -76,7 +76,7 @@ static cpumask_t cqm_cpumask;
  * near-zero occupancy value, i.e. no cachelines are tagged with this
  * RMID, once __intel_cqm_rmid_rotate() returns.
  */
-static unsigned int intel_cqm_rotation_rmid;
+static u32 intel_cqm_rotation_rmid;
 
 #define INVALID_RMID   (-1)
 
@@ -88,7 +88,7 @@ static unsigned int intel_cqm_rotation_rmid;
  * Likewise, an rmid value of -1 is used to indicate no rmid currently
  * assigned and is used as part of the rotation code.
  */
-static inline bool __rmid_valid(unsigned int rmid)
+static inline bool __rmid_valid(u32 rmid)
 {
if (!rmid || rmid == INVALID_RMID)
return false;
@@ -96,7 +96,7 @@ static inline bool __rmid_valid(unsigned int rmid)
return true;
 }
 
-static u64 __rmid_read(unsigned int rmid)
+static u64 __rmid_read(u32 rmid)
 {
u64 val;
 
@@ -121,7 +121,7 @@ enum rmid_recycle_state {
 };
 
 struct cqm_rmid_entry {
-   unsigned int rmid;
+   u32 rmid;
enum rmid_recycle_state state;
struct list_head list;
unsigned long queue_time;
@@ -166,7 +166,7 @@ static LIST_HEAD(cqm_rmid_limbo_lru);
  */
 static struct cqm_rmid_entry **cqm_rmid_ptrs;
 
-static inline struct cqm_rmid_entry *__rmid_entry(int rmid)
+static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
 {
struct cqm_rmid_entry *entry;
 
@@ -181,7 +181,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(int rmid)
  *
  * We expect to be called with cache_mutex held.
  */
-static int __get_rmid(void)
+static u32 __get_rmid(void)
 {
struct cqm_rmid_entry *entry;
 
@@ -196,7 +196,7 @@ static int __get_rmid(void)
return entry-rmid;
 }
 
-static void __put_rmid(unsigned int rmid)
+static void __put_rmid(u32 rmid)
 {
struct cqm_rmid_entry *entry;
 
@@ -391,7 +391,7 @@ static bool __conflict_event(struct perf_event *a, struct 
perf_event *b)
 }
 
 struct rmid_read {
-   unsigned int rmid;
+   u32 rmid;
atomic64_t value;
 };
 
@@ -400,12 +400,11 @@ static void __intel_cqm_event_count(void *info);
 /*
  * Exchange the RMID of a group of events.
  */
-static unsigned int
-intel_cqm_xchg_rmid(struct perf_event *group, unsigned int rmid)
+static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
 {
struct perf_event *event;
-   unsigned int old_rmid = group-hw.cqm_rmid;
struct list_head *head = group-hw.cqm_group_entry;
+   u32 old_rmid = group-hw.cqm_rmid;
 
lockdep_assert_held(cache_mutex);
 
@@ -470,7 +469,7 @@ static void intel_cqm_stable(void *arg)
  * If we have group events waiting for an RMID that don't conflict with
  * events already running, assign @rmid.
  */
-static bool intel_cqm_sched_in_event(unsigned int rmid)
+static bool intel_cqm_sched_in_event(u32 rmid)
 {
struct perf_event *leader, *event;
 
@@ -617,7 +616,7 @@ static bool intel_cqm_rmid_stabilize(unsigned int 
*available)
 static void __intel_cqm_pick_and_rotate(struct perf_event *next)
 {
struct perf_event *rotor;
-   unsigned int rmid;
+   u32 rmid;
 
lockdep_assert_held(cache_mutex);
 
@@ -645,7 +644,7 @@ static void __intel_cqm_pick_and_rotate(struct perf_event 
*next)
 static void

[tip:perf/x86] perf/x86/intel: Fix Makefile to actually build the cqm driver

2015-03-23 Thread tip-bot for Matt Fleming

Commit-ID:  4e16ed99416ef569a89782a7234f95007919fadd
Gitweb: http://git.kernel.org/tip/4e16ed99416ef569a89782a7234f95007919fadd
Author: Matt Fleming 
AuthorDate: Thu, 26 Feb 2015 18:47:00 +
Committer:  Ingo Molnar 
CommitDate: Mon, 23 Mar 2015 10:58:03 +0100

perf/x86/intel: Fix Makefile to actually build the cqm driver

Someone fat fingered a merge conflict and lost the Makefile hunk.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Link: 
http://lkml.kernel.org/r/1424976420.15321.35.ca...@mfleming-mobl1.ger.corp.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 80091ae..6c1ca13 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,7 +39,7 @@ obj-$(CONFIG_CPU_SUP_AMD) += 
perf_event_amd_iommu.o
 endif
 obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_p6.o perf_event_knc.o 
perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_lbr.o 
perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_rapl.o 
perf_event_intel_cqm.o
 
 obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
   perf_event_intel_uncore_snb.o \
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/x86] perf/x86/intel: Fix Makefile to actually build the cqm driver

2015-03-23 Thread tip-bot for Matt Fleming

Commit-ID:  4e16ed99416ef569a89782a7234f95007919fadd
Gitweb: http://git.kernel.org/tip/4e16ed99416ef569a89782a7234f95007919fadd
Author: Matt Fleming matt.flem...@intel.com
AuthorDate: Thu, 26 Feb 2015 18:47:00 +
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Mon, 23 Mar 2015 10:58:03 +0100

perf/x86/intel: Fix Makefile to actually build the cqm driver

Someone fat fingered a merge conflict and lost the Makefile hunk.

Signed-off-by: Matt Fleming matt.flem...@intel.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Cc: a...@redhat.com
Cc: h...@zytor.com
Cc: jo...@redhat.com
Cc: kanaka.d.ju...@intel.com
Cc: t...@linutronix.de
Cc: torva...@linux-foundation.org
Cc: vikas.shiva...@linux.intel.com
Link: 
http://lkml.kernel.org/r/1424976420.15321.35.ca...@mfleming-mobl1.ger.corp.intel.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/kernel/cpu/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 80091ae..6c1ca13 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,7 +39,7 @@ obj-$(CONFIG_CPU_SUP_AMD) += 
perf_event_amd_iommu.o
 endif
 obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_p6.o perf_event_knc.o 
perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_lbr.o 
perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_rapl.o 
perf_event_intel_cqm.o
 
 obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
   perf_event_intel_uncore_snb.o \
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/x86] perf/x86/intel: Enable conflicting event scheduling for CQM

2015-02-25 Thread tip-bot for Matt Fleming

Commit-ID:  59bf7fd45c90a8fde22a7717b5413e4ed9666c32
Gitweb: http://git.kernel.org/tip/59bf7fd45c90a8fde22a7717b5413e4ed9666c32
Author: Matt Fleming 
AuthorDate: Fri, 23 Jan 2015 18:45:48 +
Committer:  Ingo Molnar 
CommitDate: Wed, 25 Feb 2015 13:53:36 +0100

perf/x86/intel: Enable conflicting event scheduling for CQM

We can leverage the workqueue that we use for RMID rotation to support
scheduling of conflicting monitoring events. Allowing events that
monitor conflicting things is done at various other places in the perf
subsystem, so there's precedent there.

An example of two conflicting events would be monitoring a cgroup and
simultaneously monitoring a task within that cgroup.

This uses the cache_groups list as a queuing mechanism, where every
event that reaches the front of the list gets the chance to be scheduled
in, possibly descheduling any conflicting events that are running.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: H. Peter Anvin 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Linus Torvalds 
Cc: Vikas Shivappa 
Link: 
http://lkml.kernel.org/r/1422038748-21397-10-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 130 +++--
 1 file changed, 84 insertions(+), 46 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e31f508..9a8ef83 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -507,7 +507,6 @@ static unsigned int __rmid_queue_time_ms = 
RMID_DEFAULT_QUEUE_TIME;
 static bool intel_cqm_rmid_stabilize(unsigned int *available)
 {
struct cqm_rmid_entry *entry, *tmp;
-   struct perf_event *event;
 
lockdep_assert_held(_mutex);
 
@@ -577,19 +576,9 @@ static bool intel_cqm_rmid_stabilize(unsigned int 
*available)
 
/*
 * If we have groups waiting for RMIDs, hand
-* them one now.
+* them one now provided they don't conflict.
 */
-   list_for_each_entry(event, _groups,
-   hw.cqm_groups_entry) {
-   if (__rmid_valid(event->hw.cqm_rmid))
-   continue;
-
-   intel_cqm_xchg_rmid(event, entry->rmid);
-   entry = NULL;
-   break;
-   }
-
-   if (!entry)
+   if (intel_cqm_sched_in_event(entry->rmid))
continue;
 
/*
@@ -604,25 +593,73 @@ static bool intel_cqm_rmid_stabilize(unsigned int 
*available)
 
 /*
  * Pick a victim group and move it to the tail of the group list.
+ * @next: The first group without an RMID
  */
-static struct perf_event *
-__intel_cqm_pick_and_rotate(void)
+static void __intel_cqm_pick_and_rotate(struct perf_event *next)
 {
struct perf_event *rotor;
+   unsigned int rmid;
 
lockdep_assert_held(_mutex);
-   lockdep_assert_held(_lock);
 
rotor = list_first_entry(_groups, struct perf_event,
 hw.cqm_groups_entry);
+
+   /*
+* The group at the front of the list should always have a valid
+* RMID. If it doesn't then no groups have RMIDs assigned and we
+* don't need to rotate the list.
+*/
+   if (next == rotor)
+   return;
+
+   rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID);
+   __put_rmid(rmid);
+
list_rotate_left(_groups);
+}
+
+/*
+ * Deallocate the RMIDs from any events that conflict with @event, and
+ * place them on the back of the group list.
+ */
+static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
+{
+   struct perf_event *group, *g;
+   unsigned int rmid;
+
+   lockdep_assert_held(_mutex);
+
+   list_for_each_entry_safe(group, g, _groups, hw.cqm_groups_entry) {
+   if (group == event)
+   continue;
+
+   rmid = group->hw.cqm_rmid;
+
+   /*
+* Skip events that don't have a valid RMID.
+*/
+   if (!__rmid_valid(rmid))
+   continue;
+
+   /*
+* No conflict? No problem! Leave the event alone.
+*/
+   if (!__conflict_event(group, event))
+   continue;
 
-   return rotor;
+   intel_cqm_xchg_rmid(group, INVALID_RMID);
+   __put_rmid(rmid);
+   }
 }
 
 /*
  * Attempt to rotate the groups and assign new RMIDs.
  *
+ * We rotate for two reasons,
+ *   1. To handle the scheduling of conflicting events
+ *   2. To recycle RMIDs
+ *
  * Rotating RMIDs is complicated because the hardware doesn't give us
  * any clues.
  *
@@ -642,11 +679,10 @@ __intel_cqm_pick_and_rotate(void)
  */
 static bool

[tip:perf/x86] perf/x86/intel: Perform rotation on Intel CQM RMIDs

2015-02-25 Thread tip-bot for Matt Fleming

Commit-ID:  bff671dba7981195a644a5dc210d65de8ae2d251
Gitweb: http://git.kernel.org/tip/bff671dba7981195a644a5dc210d65de8ae2d251
Author: Matt Fleming 
AuthorDate: Fri, 23 Jan 2015 18:45:47 +
Committer:  Ingo Molnar 
CommitDate: Wed, 25 Feb 2015 13:53:35 +0100

perf/x86/intel: Perform rotation on Intel CQM RMIDs

There are many use cases where people will want to monitor more tasks
than there exist RMIDs in the hardware, meaning that we have to perform
some kind of multiplexing.

We do this by "rotating" the RMIDs in a workqueue, and assigning an RMID
to a waiting event when the RMID becomes unused.

This scheme reserves one RMID at all times for rotation. When we need to
schedule a new event we give it the reserved RMID, pick a victim event
from the front of the global CQM list and wait for the victim's RMID to
drop to zero occupancy, before it becomes the new reserved RMID.

We put the victim's RMID onto the limbo list, where it resides for a
"minimum queue time", which is intended to save ourselves an expensive
smp IPI when the RMID is unlikely to have a occupancy value below
__intel_cqm_threshold.

If we fail to recycle an RMID, even after waiting the minimum queue time
then we need to increment __intel_cqm_threshold. There is an upper bound
on this threshold, __intel_cqm_max_threshold, which is programmable from
userland as /sys/devices/intel_cqm/max_recycling_threshold.

The comments above __intel_cqm_rmid_rotate() have more details.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: H. Peter Anvin 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Linus Torvalds 
Cc: Vikas Shivappa 
Link: 
http://lkml.kernel.org/r/1422038748-21397-9-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 671 ++---
 1 file changed, 623 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 8003d87..e31f508 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -25,9 +25,13 @@ struct intel_cqm_state {
 static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
 
 /*
- * Protects cache_cgroups and cqm_rmid_lru.
+ * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
+ * Also protects event->hw.cqm_rmid
+ *
+ * Hold either for stability, both for modification of ->hw.cqm_rmid.
  */
 static DEFINE_MUTEX(cache_mutex);
+static DEFINE_RAW_SPINLOCK(cache_lock);
 
 /*
  * Groups of events that have the same target(s), one RMID per group.
@@ -46,7 +50,34 @@ static cpumask_t cqm_cpumask;
 
 #define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
 
-static u64 __rmid_read(unsigned long rmid)
+/*
+ * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
+ *
+ * This rmid is always free and is guaranteed to have an associated
+ * near-zero occupancy value, i.e. no cachelines are tagged with this
+ * RMID, once __intel_cqm_rmid_rotate() returns.
+ */
+static unsigned int intel_cqm_rotation_rmid;
+
+#define INVALID_RMID   (-1)
+
+/*
+ * Is @rmid valid for programming the hardware?
+ *
+ * rmid 0 is reserved by the hardware for all non-monitored tasks, which
+ * means that we should never come across an rmid with that value.
+ * Likewise, an rmid value of -1 is used to indicate "no rmid currently
+ * assigned" and is used as part of the rotation code.
+ */
+static inline bool __rmid_valid(unsigned int rmid)
+{
+   if (!rmid || rmid == INVALID_RMID)
+   return false;
+
+   return true;
+}
+
+static u64 __rmid_read(unsigned int rmid)
 {
u64 val;
 
@@ -64,13 +95,21 @@ static u64 __rmid_read(unsigned long rmid)
return val;
 }
 
+enum rmid_recycle_state {
+   RMID_YOUNG = 0,
+   RMID_AVAILABLE,
+   RMID_DIRTY,
+};
+
 struct cqm_rmid_entry {
-   u64 rmid;
+   unsigned int rmid;
+   enum rmid_recycle_state state;
struct list_head list;
+   unsigned long queue_time;
 };
 
 /*
- * A least recently used list of RMIDs.
+ * cqm_rmid_free_lru - A least recently used list of RMIDs.
  *
  * Oldest entry at the head, newest (most recently used) entry at the
  * tail. This list is never traversed, it's only used to keep track of
@@ -81,9 +120,18 @@ struct cqm_rmid_entry {
  * in use. To mark an RMID as in use, remove its entry from the lru
  * list.
  *
- * This list is protected by cache_mutex.
+ *
+ * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs.
+ *
+ * This list is contains RMIDs that no one is currently using but that
+ * may have a non-zero occupancy value associated with them. The
+ * rotation worker moves RMIDs from the limbo list to the free list once
+ * the occupancy value drops below __intel_cqm_threshold.
+ *
+ * Both lists are protected by cache_mutex.
  */
-static LIST_HEAD(cqm_rmid_lru);
+static LIST_HEAD(cqm_rmid_free_lru);
+static

[tip:perf/x86] perf/x86/intel: Support task events with Intel CQM

2015-02-25 Thread tip-bot for Matt Fleming

Commit-ID:  bfe1fcd2688f557a6b6a88f59ea7619228728bd7
Gitweb: http://git.kernel.org/tip/bfe1fcd2688f557a6b6a88f59ea7619228728bd7
Author: Matt Fleming 
AuthorDate: Fri, 23 Jan 2015 18:45:46 +
Committer:  Ingo Molnar 
CommitDate: Wed, 25 Feb 2015 13:53:34 +0100

perf/x86/intel: Support task events with Intel CQM

Add support for task events as well as system-wide events. This change
has a big impact on the way that we gather LLC occupancy values in
intel_cqm_event_read().

Currently, for system-wide (per-cpu) events we defer processing to
userspace which knows how to discard all but one cpu result per package.

Things aren't so simple for task events because we need to do the value
aggregation ourselves. To do this, we defer updating the LLC occupancy
value in event->count from intel_cqm_event_read() and do an SMP
cross-call to read values for all packages in intel_cqm_event_count().
We need to ensure that we only do this for one task event per cache
group, otherwise we'll report duplicate values.

If we're a system-wide event we want to fallback to the default
perf_event_count() implementation. Refactor this into a common function
so that we don't duplicate the code.

Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an
event's task (if the event isn't per-cpu) inside of the Intel CQM PMU
driver.  This task information is only availble in the upper layers of
the perf infrastructure.

Other perf backends stash the target task in event->hw.*target so we
need to do something similar. The task is used to determine whether
events should share a cache group and an RMID.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: H. Peter Anvin 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Linus Torvalds 
Cc: Vikas Shivappa 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 195 +
 include/linux/perf_event.h |   1 +
 include/uapi/linux/perf_event.h|   1 +
 kernel/events/core.c   |   2 +
 4 files changed, 178 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index b5d9d74..8003d87 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -182,23 +182,124 @@ fail:
 
 /*
  * Determine if @a and @b measure the same set of tasks.
+ *
+ * If @a and @b measure the same set of tasks then we want to share a
+ * single RMID.
  */
 static bool __match_event(struct perf_event *a, struct perf_event *b)
 {
+   /* Per-cpu and task events don't mix */
if ((a->attach_state & PERF_ATTACH_TASK) !=
(b->attach_state & PERF_ATTACH_TASK))
return false;
 
-   /* not task */
+#ifdef CONFIG_CGROUP_PERF
+   if (a->cgrp != b->cgrp)
+   return false;
+#endif
+
+   /* If not task event, we're machine wide */
+   if (!(b->attach_state & PERF_ATTACH_TASK))
+   return true;
+
+   /*
+* Events that target same task are placed into the same cache group.
+*/
+   if (a->hw.cqm_target == b->hw.cqm_target)
+   return true;
+
+   /*
+* Are we an inherited event?
+*/
+   if (b->parent == a)
+   return true;
+
+   return false;
+}
+
+#ifdef CONFIG_CGROUP_PERF
+static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
+{
+   if (event->attach_state & PERF_ATTACH_TASK)
+   return perf_cgroup_from_task(event->hw.cqm_target);
 
-   return true; /* if not task, we're machine wide */
+   return event->cgrp;
 }
+#endif
 
 /*
  * Determine if @a's tasks intersect with @b's tasks
+ *
+ * There are combinations of events that we explicitly prohibit,
+ *
+ *PROHIBITS
+ * system-wide->   cgroup and task
+ * cgroup->system-wide
+ *   ->task in cgroup
+ * task  ->system-wide
+ *   ->task in cgroup
+ *
+ * Call this function before allocating an RMID.
  */
 static bool __conflict_event(struct perf_event *a, struct perf_event *b)
 {
+#ifdef CONFIG_CGROUP_PERF
+   /*
+* We can have any number of cgroups but only one system-wide
+* event at a time.
+*/
+   if (a->cgrp && b->cgrp) {
+   struct perf_cgroup *ac = a->cgrp;
+   struct perf_cgroup *bc = b->cgrp;
+
+   /*
+* This condition should have been caught in
+* __match_event() and we should be sharing an RMID.
+*/
+   WARN_ON_ONCE(ac == bc);
+
+   if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
+

[tip:perf/x86] perf: Move cgroup init before PMU ->event_init()

2015-02-25 Thread tip-bot for Matt Fleming

Commit-ID:  79dff51e900fd26a073be8b23acfbd8c15edb181
Gitweb: http://git.kernel.org/tip/79dff51e900fd26a073be8b23acfbd8c15edb181
Author: Matt Fleming 
AuthorDate: Fri, 23 Jan 2015 18:45:42 +
Committer:  Ingo Molnar 
CommitDate: Wed, 25 Feb 2015 13:53:30 +0100

perf: Move cgroup init before PMU ->event_init()

The Intel QoS PMU needs to know whether an event is part of a cgroup
during ->event_init(), because tasks in the same cgroup share a
monitoring ID.

Move the cgroup initialisation before calling into the PMU driver.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: H. Peter Anvin 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Linus Torvalds 
Cc: Vikas Shivappa 
Link: 
http://lkml.kernel.org/r/1422038748-21397-4-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 kernel/events/core.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4e8dc59..1fc3bae 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7116,7 +7116,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 struct perf_event *group_leader,
 struct perf_event *parent_event,
 perf_overflow_handler_t overflow_handler,
-void *context)
+void *context, int cgroup_fd)
 {
struct pmu *pmu;
struct perf_event *event;
@@ -7212,6 +7212,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (!has_branch_stack(event))
event->attr.branch_sample_type = 0;
 
+   if (cgroup_fd != -1) {
+   err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
+   if (err)
+   goto err_ns;
+   }
+
pmu = perf_init_event(event);
if (!pmu)
goto err_ns;
@@ -7235,6 +7241,8 @@ err_pmu:
event->destroy(event);
module_put(pmu->module);
 err_ns:
+   if (is_cgroup_event(event))
+   perf_detach_cgroup(event);
if (event->ns)
put_pid_ns(event->ns);
kfree(event);
@@ -7453,6 +7461,7 @@ SYSCALL_DEFINE5(perf_event_open,
int move_group = 0;
int err;
int f_flags = O_RDWR;
+   int cgroup_fd = -1;
 
/* for future expandability... */
if (flags & ~PERF_FLAG_ALL)
@@ -7518,21 +7527,16 @@ SYSCALL_DEFINE5(perf_event_open,
 
get_online_cpus();
 
+   if (flags & PERF_FLAG_PID_CGROUP)
+   cgroup_fd = pid;
+
event = perf_event_alloc(, cpu, task, group_leader, NULL,
-NULL, NULL);
+NULL, NULL, cgroup_fd);
if (IS_ERR(event)) {
err = PTR_ERR(event);
goto err_cpus;
}
 
-   if (flags & PERF_FLAG_PID_CGROUP) {
-   err = perf_cgroup_connect(pid, event, , group_leader);
-   if (err) {
-   __free_event(event);
-   goto err_cpus;
-   }
-   }
-
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -ENOTSUPP;
@@ -7769,7 +7773,7 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr, int cpu,
 */
 
event = perf_event_alloc(attr, cpu, task, NULL, NULL,
-overflow_handler, context);
+overflow_handler, context, -1);
if (IS_ERR(event)) {
err = PTR_ERR(event);
goto err;
@@ -8130,7 +8134,7 @@ inherit_event(struct perf_event *parent_event,
   parent_event->cpu,
   child,
   group_leader, parent_event,
-  NULL, NULL);
+  NULL, NULL, -1);
if (IS_ERR(child_event))
return child_event;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:perf/x86] perf/x86/intel: Implement LRU monitoring ID allocation for CQM

2015-02-25 Thread tip-bot for Matt Fleming

Commit-ID:  35298e554c74b7849875e3676ba8eaf833c7b917
Gitweb: http://git.kernel.org/tip/35298e554c74b7849875e3676ba8eaf833c7b917
Author: Matt Fleming 
AuthorDate: Fri, 23 Jan 2015 18:45:45 +
Committer:  Ingo Molnar 
CommitDate: Wed, 25 Feb 2015 13:53:33 +0100

perf/x86/intel: Implement LRU monitoring ID allocation for CQM

It's possible to run into issues with re-using unused monitoring IDs
because there may be stale cachelines associated with that ID from a
previous allocation. This can cause the LLC occupancy values to be
inaccurate.

To attempt to mitigate this problem we place the IDs on a least recently
used list, essentially a FIFO. The basic idea is that the longer the
time period between ID re-use the lower the probability that stale
cachelines exist in the cache.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: H. Peter Anvin 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Linus Torvalds 
Cc: Vikas Shivappa 
Link: 
http://lkml.kernel.org/r/1422038748-21397-7-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 100 ++---
 1 file changed, 92 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 05b4cd2..b5d9d74 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -25,7 +25,7 @@ struct intel_cqm_state {
 static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
 
 /*
- * Protects cache_cgroups.
+ * Protects cache_cgroups and cqm_rmid_lru.
  */
 static DEFINE_MUTEX(cache_mutex);
 
@@ -64,36 +64,120 @@ static u64 __rmid_read(unsigned long rmid)
return val;
 }
 
-static unsigned long *cqm_rmid_bitmap;
+struct cqm_rmid_entry {
+   u64 rmid;
+   struct list_head list;
+};
+
+/*
+ * A least recently used list of RMIDs.
+ *
+ * Oldest entry at the head, newest (most recently used) entry at the
+ * tail. This list is never traversed, it's only used to keep track of
+ * the lru order. That is, we only pick entries of the head or insert
+ * them on the tail.
+ *
+ * All entries on the list are 'free', and their RMIDs are not currently
+ * in use. To mark an RMID as in use, remove its entry from the lru
+ * list.
+ *
+ * This list is protected by cache_mutex.
+ */
+static LIST_HEAD(cqm_rmid_lru);
+
+/*
+ * We use a simple array of pointers so that we can lookup a struct
+ * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
+ * and __put_rmid() from having to worry about dealing with struct
+ * cqm_rmid_entry - they just deal with rmids, i.e. integers.
+ *
+ * Once this array is initialized it is read-only. No locks are required
+ * to access it.
+ *
+ * All entries for all RMIDs can be looked up in the this array at all
+ * times.
+ */
+static struct cqm_rmid_entry **cqm_rmid_ptrs;
+
+static inline struct cqm_rmid_entry *__rmid_entry(int rmid)
+{
+   struct cqm_rmid_entry *entry;
+
+   entry = cqm_rmid_ptrs[rmid];
+   WARN_ON(entry->rmid != rmid);
+
+   return entry;
+}
 
 /*
  * Returns < 0 on fail.
+ *
+ * We expect to be called with cache_mutex held.
  */
 static int __get_rmid(void)
 {
-   return bitmap_find_free_region(cqm_rmid_bitmap, cqm_max_rmid, 0);
+   struct cqm_rmid_entry *entry;
+
+   lockdep_assert_held(_mutex);
+
+   if (list_empty(_rmid_lru))
+   return -EAGAIN;
+
+   entry = list_first_entry(_rmid_lru, struct cqm_rmid_entry, list);
+   list_del(>list);
+
+   return entry->rmid;
 }
 
 static void __put_rmid(int rmid)
 {
-   bitmap_release_region(cqm_rmid_bitmap, rmid, 0);
+   struct cqm_rmid_entry *entry;
+
+   lockdep_assert_held(_mutex);
+
+   entry = __rmid_entry(rmid);
+
+   list_add_tail(>list, _rmid_lru);
 }
 
 static int intel_cqm_setup_rmid_cache(void)
 {
-   cqm_rmid_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(cqm_max_rmid), 
GFP_KERNEL);
-   if (!cqm_rmid_bitmap)
+   struct cqm_rmid_entry *entry;
+   int r;
+
+   cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
+   (cqm_max_rmid + 1), GFP_KERNEL);
+   if (!cqm_rmid_ptrs)
return -ENOMEM;
 
-   bitmap_zero(cqm_rmid_bitmap, cqm_max_rmid);
+   for (r = 0; r <= cqm_max_rmid; r++) {
+   struct cqm_rmid_entry *entry;
+
+   entry = kmalloc(sizeof(*entry), GFP_KERNEL);
+   if (!entry)
+   goto fail;
+
+   INIT_LIST_HEAD(>list);
+   entry->rmid = r;
+   cqm_rmid_ptrs[r] = entry;
+
+   list_add_tail(>list, _rmid_lru);
+   }
 
/*
 * RMID 0 is special and is always allocated. It's used for all
 * tasks that are not monitored.
 */
-   bitmap_allocate_region(cqm_rmid_bitmap, 0, 0);
+   entry =

[tip:perf/x86] perf/x86/intel: Add Intel Cache QoS Monitoring support

2015-02-25 Thread tip-bot for Matt Fleming

Commit-ID:  4afbb24ce5e723c8a093a6674a3c33062175078a
Gitweb: http://git.kernel.org/tip/4afbb24ce5e723c8a093a6674a3c33062175078a
Author: Matt Fleming 
AuthorDate: Fri, 23 Jan 2015 18:45:44 +
Committer:  Ingo Molnar 
CommitDate: Wed, 25 Feb 2015 13:53:32 +0100

perf/x86/intel: Add Intel Cache QoS Monitoring support

Future Intel Xeon processors support a Cache QoS Monitoring feature that
allows tracking of the LLC occupancy for a task or task group, i.e. the
amount of data in pulled into the LLC for the task (group).

Currently the PMU only supports per-cpu events. We create an event for
each cpu and read out all the LLC occupancy values.

Because this results in duplicate values being written out to userspace,
we also export a .per-pkg event file so that the perf tools only
accumulate values for one cpu per package.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: H. Peter Anvin 
Cc: Jiri Olsa 
Cc: Kanaka Juvva 
Cc: Linus Torvalds 
Cc: Vikas Shivappa 
Link: 
http://lkml.kernel.org/r/1422038748-21397-6-git-send-email-m...@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 530 +
 include/linux/perf_event.h |   7 +
 2 files changed, 537 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c 
b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
new file mode 100644
index 000..05b4cd2
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -0,0 +1,530 @@
+/*
+ * Intel Cache Quality-of-Service Monitoring (CQM) support.
+ *
+ * Based very, very heavily on work by Peter Zijlstra.
+ */
+
+#include 
+#include 
+#include 
+#include "perf_event.h"
+
+#define MSR_IA32_PQR_ASSOC 0x0c8f
+#define MSR_IA32_QM_CTR0x0c8e
+#define MSR_IA32_QM_EVTSEL 0x0c8d
+
+static unsigned int cqm_max_rmid = -1;
+static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+
+struct intel_cqm_state {
+   raw_spinlock_t  lock;
+   int rmid;
+   int cnt;
+};
+
+static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
+
+/*
+ * Protects cache_cgroups.
+ */
+static DEFINE_MUTEX(cache_mutex);
+
+/*
+ * Groups of events that have the same target(s), one RMID per group.
+ */
+static LIST_HEAD(cache_groups);
+
+/*
+ * Mask of CPUs for reading CQM values. We only need one per-socket.
+ */
+static cpumask_t cqm_cpumask;
+
+#define RMID_VAL_ERROR (1ULL << 63)
+#define RMID_VAL_UNAVAIL   (1ULL << 62)
+
+#define QOS_L3_OCCUP_EVENT_ID  (1 << 0)
+
+#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
+
+static u64 __rmid_read(unsigned long rmid)
+{
+   u64 val;
+
+   /*
+* Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
+* it just says that to increase confusion.
+*/
+   wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid);
+   rdmsrl(MSR_IA32_QM_CTR, val);
+
+   /*
+* Aside from the ERROR and UNAVAIL bits, assume this thing returns
+* the number of cachelines tagged with @rmid.
+*/
+   return val;
+}
+
+static unsigned long *cqm_rmid_bitmap;
+
+/*
+ * Returns < 0 on fail.
+ */
+static int __get_rmid(void)
+{
+   return bitmap_find_free_region(cqm_rmid_bitmap, cqm_max_rmid, 0);
+}
+
+static void __put_rmid(int rmid)
+{
+   bitmap_release_region(cqm_rmid_bitmap, rmid, 0);
+}
+
+static int intel_cqm_setup_rmid_cache(void)
+{
+   cqm_rmid_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(cqm_max_rmid), 
GFP_KERNEL);
+   if (!cqm_rmid_bitmap)
+   return -ENOMEM;
+
+   bitmap_zero(cqm_rmid_bitmap, cqm_max_rmid);
+
+   /*
+* RMID 0 is special and is always allocated. It's used for all
+* tasks that are not monitored.
+*/
+   bitmap_allocate_region(cqm_rmid_bitmap, 0, 0);
+
+   return 0;
+}
+
+/*
+ * Determine if @a and @b measure the same set of tasks.
+ */
+static bool __match_event(struct perf_event *a, struct perf_event *b)
+{
+   if ((a->attach_state & PERF_ATTACH_TASK) !=
+   (b->attach_state & PERF_ATTACH_TASK))
+   return false;
+
+   /* not task */
+
+   return true; /* if not task, we're machine wide */
+}
+
+/*
+ * Determine if @a's tasks intersect with @b's tasks
+ */
+static bool __conflict_event(struct perf_event *a, struct perf_event *b)
+{
+   /*
+* If one of them is not a task, same story as above with cgroups.
+*/
+   if (!(a->attach_state & PERF_ATTACH_TASK) ||
+   !(b->attach_state & PERF_ATTACH_TASK))
+   return true;
+
+   /*
+* Must be non-overlapping.
+*/
+   return false;
+}
+
+/*
+ * Find a group and setup RMID.
+ *
+ * If we're part of a group, we use the group's RMID.
+ */
+static int intel_cqm_setup_event(struct perf_event *event,
+struct perf_event

1 2 >

1 - 100 of 128 matches

Mail list logo