[tip:efi/core] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer
Commit-ID: 81b60dbff04980a45b348c5b5eeca2713d4594ca Gitweb: https://git.kernel.org/tip/81b60dbff04980a45b348c5b5eeca2713d4594ca Author: Matt FlemingAuthorDate: Wed, 3 Jan 2018 09:44:17 + Committer: Ingo Molnar CommitDate: Wed, 3 Jan 2018 14:03:18 +0100 MAINTAINERS: Remove Matt Fleming as EFI co-maintainer Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks as maintainers for the EFI test driver and efivarfs file system. Also add Ard Biesheuvel as the EFI test driver and efivarfs maintainer. Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Ivan Hu Cc: Jeremy Kerr Cc: Linus Torvalds Cc: Matthew Garrett Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20180103094417.6353-1-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- MAINTAINERS | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index b46c9ce..95c3fa1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5149,15 +5149,15 @@ F: sound/usb/misc/ua101.c EFI TEST DRIVER L: linux-...@vger.kernel.org M: Ivan Hu -M: Matt Fleming +M: Ard Biesheuvel S: Maintained F: drivers/firmware/efi/test/ EFI VARIABLE FILESYSTEM M: Matthew Garrett M: Jeremy Kerr -M: Matt Fleming -T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git +M: Ard Biesheuvel +T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git L: linux-...@vger.kernel.org S: Maintained F: fs/efivarfs/ @@ -5318,7 +5318,6 @@ S:Supported F: security/integrity/evm/ EXTENSIBLE FIRMWARE INTERFACE (EFI) -M: Matt Fleming M: Ard Biesheuvel L: linux-...@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
[tip:efi/core] MAINTAINERS: Remove Matt Fleming as EFI co-maintainer
Commit-ID: 81b60dbff04980a45b348c5b5eeca2713d4594ca Gitweb: https://git.kernel.org/tip/81b60dbff04980a45b348c5b5eeca2713d4594ca Author: Matt Fleming AuthorDate: Wed, 3 Jan 2018 09:44:17 + Committer: Ingo Molnar CommitDate: Wed, 3 Jan 2018 14:03:18 +0100 MAINTAINERS: Remove Matt Fleming as EFI co-maintainer Instate Ard Biesheuvel as the sole EFI maintainer and leave other folks as maintainers for the EFI test driver and efivarfs file system. Also add Ard Biesheuvel as the EFI test driver and efivarfs maintainer. Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Ivan Hu Cc: Jeremy Kerr Cc: Linus Torvalds Cc: Matthew Garrett Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20180103094417.6353-1-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- MAINTAINERS | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index b46c9ce..95c3fa1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5149,15 +5149,15 @@ F: sound/usb/misc/ua101.c EFI TEST DRIVER L: linux-...@vger.kernel.org M: Ivan Hu -M: Matt Fleming +M: Ard Biesheuvel S: Maintained F: drivers/firmware/efi/test/ EFI VARIABLE FILESYSTEM M: Matthew Garrett M: Jeremy Kerr -M: Matt Fleming -T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git +M: Ard Biesheuvel +T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git L: linux-...@vger.kernel.org S: Maintained F: fs/efivarfs/ @@ -5318,7 +5318,6 @@ S:Supported F: security/integrity/evm/ EXTENSIBLE FIRMWARE INTERFACE (EFI) -M: Matt Fleming M: Ard Biesheuvel L: linux-...@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git
[tip:sched/core] sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
Commit-ID: caeb5882979bc6f3c8766fcf59c6269b38f521bc Gitweb: http://git.kernel.org/tip/caeb5882979bc6f3c8766fcf59c6269b38f521bc Author: Matt FlemingAuthorDate: Fri, 17 Feb 2017 12:07:31 + Committer: Ingo Molnar CommitDate: Thu, 16 Mar 2017 09:21:01 +0100 sched/loadavg: Use {READ,WRITE}_ONCE() for sample window 'calc_load_update' is accessed without any kind of locking and there's a clear assumption in the code that only a single value is read or written. Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid unintentionally seeing multiple values, or having the load/stores split. Technically the loads in calc_global_*() don't require this since those are the only functions that update 'calc_load_update', but I've added the READ_ONCE() for consistency. Suggested-by: Peter Zijlstra Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Thomas Gleixner Cc: Vincent Guittot Link: http://lkml.kernel.org/r/20170217120731.11868-3-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/loadavg.c | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 3a55f3f..f15fb2b 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -169,7 +169,7 @@ static inline int calc_load_write_idx(void) * If the folding window started, make sure we start writing in the * next idle-delta. */ - if (!time_before(jiffies, calc_load_update)) + if (!time_before(jiffies, READ_ONCE(calc_load_update))) idx++; return idx & 1; @@ -204,7 +204,7 @@ void calc_load_exit_idle(void) /* * If we're still before the pending sample window, we're done. */ - this_rq->calc_load_update = calc_load_update; + this_rq->calc_load_update = READ_ONCE(calc_load_update); if (time_before(jiffies, this_rq->calc_load_update)) return; @@ -308,13 +308,15 @@ calc_load_n(unsigned long load, unsigned long exp, */ static void calc_global_nohz(void) { + unsigned long sample_window; long delta, active, n; - if (!time_before(jiffies, calc_load_update + 10)) { + sample_window = READ_ONCE(calc_load_update); + if (!time_before(jiffies, sample_window + 10)) { /* * Catch-up, fold however many we are behind still */ - delta = jiffies - calc_load_update - 10; + delta = jiffies - sample_window - 10; n = 1 + (delta / LOAD_FREQ); active = atomic_long_read(_load_tasks); @@ -324,7 +326,7 @@ static void calc_global_nohz(void) avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n); avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n); - calc_load_update += n * LOAD_FREQ; + WRITE_ONCE(calc_load_update, sample_window + n * LOAD_FREQ); } /* @@ -352,9 +354,11 @@ static inline void calc_global_nohz(void) { } */ void calc_global_load(unsigned long ticks) { + unsigned long sample_window; long active, delta; - if (time_before(jiffies, calc_load_update + 10)) + sample_window = READ_ONCE(calc_load_update); + if (time_before(jiffies, sample_window + 10)) return; /* @@ -371,7 +375,7 @@ void calc_global_load(unsigned long ticks) avenrun[1] = calc_load(avenrun[1], EXP_5, active); avenrun[2] = calc_load(avenrun[2], EXP_15, active); - calc_load_update += LOAD_FREQ; + WRITE_ONCE(calc_load_update, sample_window + LOAD_FREQ); /* * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
[tip:sched/core] sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
Commit-ID: caeb5882979bc6f3c8766fcf59c6269b38f521bc Gitweb: http://git.kernel.org/tip/caeb5882979bc6f3c8766fcf59c6269b38f521bc Author: Matt Fleming AuthorDate: Fri, 17 Feb 2017 12:07:31 + Committer: Ingo Molnar CommitDate: Thu, 16 Mar 2017 09:21:01 +0100 sched/loadavg: Use {READ,WRITE}_ONCE() for sample window 'calc_load_update' is accessed without any kind of locking and there's a clear assumption in the code that only a single value is read or written. Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid unintentionally seeing multiple values, or having the load/stores split. Technically the loads in calc_global_*() don't require this since those are the only functions that update 'calc_load_update', but I've added the READ_ONCE() for consistency. Suggested-by: Peter Zijlstra Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Thomas Gleixner Cc: Vincent Guittot Link: http://lkml.kernel.org/r/20170217120731.11868-3-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/loadavg.c | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 3a55f3f..f15fb2b 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -169,7 +169,7 @@ static inline int calc_load_write_idx(void) * If the folding window started, make sure we start writing in the * next idle-delta. */ - if (!time_before(jiffies, calc_load_update)) + if (!time_before(jiffies, READ_ONCE(calc_load_update))) idx++; return idx & 1; @@ -204,7 +204,7 @@ void calc_load_exit_idle(void) /* * If we're still before the pending sample window, we're done. */ - this_rq->calc_load_update = calc_load_update; + this_rq->calc_load_update = READ_ONCE(calc_load_update); if (time_before(jiffies, this_rq->calc_load_update)) return; @@ -308,13 +308,15 @@ calc_load_n(unsigned long load, unsigned long exp, */ static void calc_global_nohz(void) { + unsigned long sample_window; long delta, active, n; - if (!time_before(jiffies, calc_load_update + 10)) { + sample_window = READ_ONCE(calc_load_update); + if (!time_before(jiffies, sample_window + 10)) { /* * Catch-up, fold however many we are behind still */ - delta = jiffies - calc_load_update - 10; + delta = jiffies - sample_window - 10; n = 1 + (delta / LOAD_FREQ); active = atomic_long_read(_load_tasks); @@ -324,7 +326,7 @@ static void calc_global_nohz(void) avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n); avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n); - calc_load_update += n * LOAD_FREQ; + WRITE_ONCE(calc_load_update, sample_window + n * LOAD_FREQ); } /* @@ -352,9 +354,11 @@ static inline void calc_global_nohz(void) { } */ void calc_global_load(unsigned long ticks) { + unsigned long sample_window; long active, delta; - if (time_before(jiffies, calc_load_update + 10)) + sample_window = READ_ONCE(calc_load_update); + if (time_before(jiffies, sample_window + 10)) return; /* @@ -371,7 +375,7 @@ void calc_global_load(unsigned long ticks) avenrun[1] = calc_load(avenrun[1], EXP_5, active); avenrun[2] = calc_load(avenrun[2], EXP_15, active); - calc_load_update += LOAD_FREQ; + WRITE_ONCE(calc_load_update, sample_window + LOAD_FREQ); /* * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
[tip:sched/core] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
Commit-ID: 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 Gitweb: http://git.kernel.org/tip/6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 Author: Matt FlemingAuthorDate: Fri, 17 Feb 2017 12:07:30 + Committer: Ingo Molnar CommitDate: Thu, 16 Mar 2017 09:21:00 +0100 sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to the pending sample window time on exit, setting the next update not one window into the future, but two. This situation on exiting NO_HZ is described by: this_rq->calc_load_update < jiffies < calc_load_update In this scenario, what we should be doing is: this_rq->calc_load_update = calc_load_update [ next window ] But what we actually do is: this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ] This has the effect of delaying load average updates for potentially up to ~9seconds. This can result in huge spikes in the load average values due to per-cpu uninterruptible task counts being out of sync when accumulated across all CPUs. It's safe to update the per-cpu active count if we wake between sample windows because any load that we left in 'calc_load_idle' will have been zero'd when the idle load was folded in calc_global_load(). This issue is easy to reproduce before, commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") just by forking short-lived process pipelines built from ps(1) and grep(1) in a loop. I'm unable to reproduce the spikes after that commit, but the bug still seems to be present from code review. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Guittot Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again") Link: http://lkml.kernel.org/r/20170217120731.11868-2-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/loadavg.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 7296b73..3a55f3f 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -202,8 +202,9 @@ void calc_load_exit_idle(void) struct rq *this_rq = this_rq(); /* -* If we're still before the sample window, we're done. +* If we're still before the pending sample window, we're done. */ + this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update)) return; @@ -212,7 +213,6 @@ void calc_load_exit_idle(void) * accounted through the nohz accounting, so skip the entire deal and * sync up for the next window. */ - this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update + 10)) this_rq->calc_load_update += LOAD_FREQ; }
[tip:sched/core] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
Commit-ID: 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 Gitweb: http://git.kernel.org/tip/6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 Author: Matt Fleming AuthorDate: Fri, 17 Feb 2017 12:07:30 + Committer: Ingo Molnar CommitDate: Thu, 16 Mar 2017 09:21:00 +0100 sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to the pending sample window time on exit, setting the next update not one window into the future, but two. This situation on exiting NO_HZ is described by: this_rq->calc_load_update < jiffies < calc_load_update In this scenario, what we should be doing is: this_rq->calc_load_update = calc_load_update [ next window ] But what we actually do is: this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ] This has the effect of delaying load average updates for potentially up to ~9seconds. This can result in huge spikes in the load average values due to per-cpu uninterruptible task counts being out of sync when accumulated across all CPUs. It's safe to update the per-cpu active count if we wake between sample windows because any load that we left in 'calc_load_idle' will have been zero'd when the idle load was folded in calc_global_load(). This issue is easy to reproduce before, commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") just by forking short-lived process pipelines built from ps(1) and grep(1) in a loop. I'm unable to reproduce the spikes after that commit, but the bug still seems to be present from code review. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Guittot Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again") Link: http://lkml.kernel.org/r/20170217120731.11868-2-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/loadavg.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 7296b73..3a55f3f 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -202,8 +202,9 @@ void calc_load_exit_idle(void) struct rq *this_rq = this_rq(); /* -* If we're still before the sample window, we're done. +* If we're still before the pending sample window, we're done. */ + this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update)) return; @@ -212,7 +213,6 @@ void calc_load_exit_idle(void) * accounted through the nohz accounting, so skip the entire deal and * sync up for the next window. */ - this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update + 10)) this_rq->calc_load_update += LOAD_FREQ; }
[tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
Commit-ID: cb42c9a3e23448c3f9a25417fae6309b1a92 Gitweb: http://git.kernel.org/tip/cb42c9a3e23448c3f9a25417fae6309b1a92 Author: Matt FlemingAuthorDate: Wed, 21 Sep 2016 14:38:13 +0100 Committer: Ingo Molnar CommitDate: Sat, 14 Jan 2017 11:29:35 +0100 sched/core: Add debugging code to catch missing update_rq_clock() calls There's no diagnostic checks for figuring out when we've accidentally missed update_rq_clock() calls. Let's add some by piggybacking on the rq_*pin_lock() wrappers. The idea behind the diagnostic checks is that upon pining rq lock the rq clock should be updated, via update_rq_clock(), before anybody reads the clock with rq_clock() or rq_clock_task(). The exception to this rule is when updates have explicitly been disabled with the rq_clock_skip_update() optimisation. There are some functions that only unpin the rq lock in order to grab some other lock and avoid deadlock. In that case we don't need to update the clock again and the previous diagnostic state can be carried over in rq_repin_lock() by saving the state in the rq_flags context. Since this patch adds a new clock update flag and some already exist in rq::clock_skip_update, that field has now been renamed. An attempt has been made to keep the flag manipulation code small and fast since it's used in the heart of the __schedule() fast path. For the !CONFIG_SCHED_DEBUG case the only object code change (other than addresses) is the following change to reset RQCF_ACT_SKIP inside of __schedule(), - c7 83 38 09 00 00 00movl $0x0,0x938(%rbx) - 00 00 00 + 83 a3 38 09 00 00 fcandl $0xfffc,0x938(%rbx) Suggested-by: Peter Zijlstra Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Byungchul Park Cc: Frederic Weisbecker Cc: Jan Kara Cc: Linus Torvalds Cc: Luca Abeni Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Petr Mladek Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Thomas Gleixner Cc: Wanpeng Li Cc: Yuyang Du Link: http://lkml.kernel.org/r/20160921133813.31976-8-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 11 +--- kernel/sched/sched.h | 74 +++- 2 files changed, 75 insertions(+), 10 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d233892..a129b34 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -102,9 +102,12 @@ void update_rq_clock(struct rq *rq) lockdep_assert_held(>lock); - if (rq->clock_skip_update & RQCF_ACT_SKIP) + if (rq->clock_update_flags & RQCF_ACT_SKIP) return; +#ifdef CONFIG_SCHED_DEBUG + rq->clock_update_flags |= RQCF_UPDATED; +#endif delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; if (delta < 0) return; @@ -2889,7 +2892,7 @@ context_switch(struct rq *rq, struct task_struct *prev, rq->prev_mm = oldmm; } - rq->clock_skip_update = 0; + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); /* * Since the runqueue lock will be released by the next @@ -3364,7 +3367,7 @@ static void __sched notrace __schedule(bool preempt) raw_spin_lock(>lock); rq_pin_lock(rq, ); - rq->clock_skip_update <<= 1; /* promote REQ to ACT */ + rq->clock_update_flags <<= 1; /* promote REQ to ACT */ switch_count = >nivcsw; if (!preempt && prev->state) { @@ -3405,7 +3408,7 @@ static void __sched notrace __schedule(bool preempt) trace_sched_switch(preempt, prev, next); rq = context_switch(rq, prev, next, ); /* unlocks the rq */ } else { - rq->clock_skip_update = 0; + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); rq_unpin_lock(rq, ); raw_spin_unlock_irq(>lock); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 98e7eee..6eeae7e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -644,7 +644,7 @@ struct rq { unsigned long next_balance; struct mm_struct *prev_mm; - unsigned int clock_skip_update; + unsigned int clock_update_flags; u64 clock; u64 clock_task; @@ -768,48 +768,110 @@ static inline u64 __rq_clock_broken(struct rq *rq) return READ_ONCE(rq->clock); } +/* + * rq::clock_update_flags bits + * + * %RQCF_REQ_SKIP - will request skipping of clock update on the next + * call to __schedule(). This
[tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
Commit-ID: cb42c9a3e23448c3f9a25417fae6309b1a92 Gitweb: http://git.kernel.org/tip/cb42c9a3e23448c3f9a25417fae6309b1a92 Author: Matt Fleming AuthorDate: Wed, 21 Sep 2016 14:38:13 +0100 Committer: Ingo Molnar CommitDate: Sat, 14 Jan 2017 11:29:35 +0100 sched/core: Add debugging code to catch missing update_rq_clock() calls There's no diagnostic checks for figuring out when we've accidentally missed update_rq_clock() calls. Let's add some by piggybacking on the rq_*pin_lock() wrappers. The idea behind the diagnostic checks is that upon pining rq lock the rq clock should be updated, via update_rq_clock(), before anybody reads the clock with rq_clock() or rq_clock_task(). The exception to this rule is when updates have explicitly been disabled with the rq_clock_skip_update() optimisation. There are some functions that only unpin the rq lock in order to grab some other lock and avoid deadlock. In that case we don't need to update the clock again and the previous diagnostic state can be carried over in rq_repin_lock() by saving the state in the rq_flags context. Since this patch adds a new clock update flag and some already exist in rq::clock_skip_update, that field has now been renamed. An attempt has been made to keep the flag manipulation code small and fast since it's used in the heart of the __schedule() fast path. For the !CONFIG_SCHED_DEBUG case the only object code change (other than addresses) is the following change to reset RQCF_ACT_SKIP inside of __schedule(), - c7 83 38 09 00 00 00movl $0x0,0x938(%rbx) - 00 00 00 + 83 a3 38 09 00 00 fcandl $0xfffc,0x938(%rbx) Suggested-by: Peter Zijlstra Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Byungchul Park Cc: Frederic Weisbecker Cc: Jan Kara Cc: Linus Torvalds Cc: Luca Abeni Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Petr Mladek Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Thomas Gleixner Cc: Wanpeng Li Cc: Yuyang Du Link: http://lkml.kernel.org/r/20160921133813.31976-8-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 11 +--- kernel/sched/sched.h | 74 +++- 2 files changed, 75 insertions(+), 10 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d233892..a129b34 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -102,9 +102,12 @@ void update_rq_clock(struct rq *rq) lockdep_assert_held(>lock); - if (rq->clock_skip_update & RQCF_ACT_SKIP) + if (rq->clock_update_flags & RQCF_ACT_SKIP) return; +#ifdef CONFIG_SCHED_DEBUG + rq->clock_update_flags |= RQCF_UPDATED; +#endif delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; if (delta < 0) return; @@ -2889,7 +2892,7 @@ context_switch(struct rq *rq, struct task_struct *prev, rq->prev_mm = oldmm; } - rq->clock_skip_update = 0; + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); /* * Since the runqueue lock will be released by the next @@ -3364,7 +3367,7 @@ static void __sched notrace __schedule(bool preempt) raw_spin_lock(>lock); rq_pin_lock(rq, ); - rq->clock_skip_update <<= 1; /* promote REQ to ACT */ + rq->clock_update_flags <<= 1; /* promote REQ to ACT */ switch_count = >nivcsw; if (!preempt && prev->state) { @@ -3405,7 +3408,7 @@ static void __sched notrace __schedule(bool preempt) trace_sched_switch(preempt, prev, next); rq = context_switch(rq, prev, next, ); /* unlocks the rq */ } else { - rq->clock_skip_update = 0; + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); rq_unpin_lock(rq, ); raw_spin_unlock_irq(>lock); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 98e7eee..6eeae7e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -644,7 +644,7 @@ struct rq { unsigned long next_balance; struct mm_struct *prev_mm; - unsigned int clock_skip_update; + unsigned int clock_update_flags; u64 clock; u64 clock_task; @@ -768,48 +768,110 @@ static inline u64 __rq_clock_broken(struct rq *rq) return READ_ONCE(rq->clock); } +/* + * rq::clock_update_flags bits + * + * %RQCF_REQ_SKIP - will request skipping of clock update on the next + * call to __schedule(). This is an optimisation to avoid + * neighbouring rq clock updates. + * + * %RQCF_ACT_SKIP - is set from inside of __schedule() when skipping is + * in effect and calls to update_rq_clock() are being ignored. + * + * %RQCF_UPDATED - is a debug flag that indicates whether a call has been + * made to update_rq_clock() since the last time rq::lock was pinned. + * + * If inside of __schedule(), clock_update_flags will have been + * shifted left (a
[tip:sched/core] sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock
Commit-ID: 92509b732baf14c59ca702307270cfaa3a585ae7 Gitweb: http://git.kernel.org/tip/92509b732baf14c59ca702307270cfaa3a585ae7 Author: Matt FlemingAuthorDate: Wed, 21 Sep 2016 14:38:11 +0100 Committer: Ingo Molnar CommitDate: Sat, 14 Jan 2017 11:29:31 +0100 sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock rq_clock() is called from sched_info_{depart,arrive}() after resetting RQCF_ACT_SKIP but prior to a call to update_rq_clock(). In preparation for pending patches that check whether the rq clock has been updated inside of a pin context before rq_clock() is called, move the reset of rq->clock_skip_update immediately before unpinning the rq lock. This will avoid the new warnings which check if update_rq_clock() is being actively skipped. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Byungchul Park Cc: Frederic Weisbecker Cc: Jan Kara Cc: Linus Torvalds Cc: Luca Abeni Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Petr Mladek Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Thomas Gleixner Cc: Wanpeng Li Cc: Yuyang Du Link: http://lkml.kernel.org/r/20160921133813.31976-6-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41df935..311460b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2887,6 +2887,9 @@ context_switch(struct rq *rq, struct task_struct *prev, prev->active_mm = NULL; rq->prev_mm = oldmm; } + + rq->clock_skip_update = 0; + /* * Since the runqueue lock will be released by the next * task (which is an invalid locking op but in the case @@ -3392,7 +3395,6 @@ static void __sched notrace __schedule(bool preempt) next = pick_next_task(rq, prev, ); clear_tsk_need_resched(prev); clear_preempt_need_resched(); - rq->clock_skip_update = 0; if (likely(prev != next)) { rq->nr_switches++; @@ -3402,6 +3404,7 @@ static void __sched notrace __schedule(bool preempt) trace_sched_switch(preempt, prev, next); rq = context_switch(rq, prev, next, ); /* unlocks the rq */ } else { + rq->clock_skip_update = 0; rq_unpin_lock(rq, ); raw_spin_unlock_irq(>lock); }
[tip:sched/core] sched/core: Add wrappers for lockdep_(un)pin_lock()
Commit-ID: d8ac897137a230ec351269f6378017f2decca512 Gitweb: http://git.kernel.org/tip/d8ac897137a230ec351269f6378017f2decca512 Author: Matt FlemingAuthorDate: Wed, 21 Sep 2016 14:38:10 +0100 Committer: Ingo Molnar CommitDate: Sat, 14 Jan 2017 11:29:30 +0100 sched/core: Add wrappers for lockdep_(un)pin_lock() In preparation for adding diagnostic checks to catch missing calls to update_rq_clock(), provide wrappers for (re)pinning and unpinning rq->lock. Because the pending diagnostic checks allow state to be maintained in rq_flags across pin contexts, swap the 'struct pin_cookie' arguments for 'struct rq_flags *'. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Byungchul Park Cc: Frederic Weisbecker Cc: Jan Kara Cc: Linus Torvalds Cc: Luca Abeni Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Petr Mladek Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Thomas Gleixner Cc: Wanpeng Li Cc: Yuyang Du Link: http://lkml.kernel.org/r/20160921133813.31976-5-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 80 kernel/sched/deadline.c | 10 +++--- kernel/sched/fair.c | 6 ++-- kernel/sched/idle_task.c | 2 +- kernel/sched/rt.c| 6 ++-- kernel/sched/sched.h | 31 ++- kernel/sched/stop_task.c | 2 +- 7 files changed, 76 insertions(+), 61 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c56fb57..41df935 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -185,7 +185,7 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) rq = task_rq(p); raw_spin_lock(>lock); if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { - rf->cookie = lockdep_pin_lock(>lock); + rq_pin_lock(rq, rf); return rq; } raw_spin_unlock(>lock); @@ -225,7 +225,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) * pair with the WMB to ensure we must then also see migrating. */ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { - rf->cookie = lockdep_pin_lock(>lock); + rq_pin_lock(rq, rf); return rq; } raw_spin_unlock(>lock); @@ -1195,9 +1195,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, * OK, since we're going to drop the lock immediately * afterwards anyway. */ - lockdep_unpin_lock(>lock, rf.cookie); + rq_unpin_lock(rq, ); rq = move_queued_task(rq, p, dest_cpu); - lockdep_repin_lock(>lock, rf.cookie); + rq_repin_lock(rq, ); } out: task_rq_unlock(rq, p, ); @@ -1690,7 +1690,7 @@ static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_fl * Mark the task runnable and perform wakeup-preemption. */ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, - struct pin_cookie cookie) + struct rq_flags *rf) { check_preempt_curr(rq, p, wake_flags); p->state = TASK_RUNNING; @@ -1702,9 +1702,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, * Our task @p is fully woken up and running; so its safe to * drop the rq->lock, hereafter rq is only used for statistics. */ - lockdep_unpin_lock(>lock, cookie); + rq_unpin_lock(rq, rf); p->sched_class->task_woken(rq, p); - lockdep_repin_lock(>lock, cookie); + rq_repin_lock(rq, rf); } if (rq->idle_stamp) { @@ -1723,7 +1723,7 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, -struct pin_cookie cookie) +struct rq_flags *rf) { int en_flags = ENQUEUE_WAKEUP; @@ -1738,7 +1738,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, #endif ttwu_activate(rq, p, en_flags); - ttwu_do_wakeup(rq, p, wake_flags, cookie); + ttwu_do_wakeup(rq, p, wake_flags, rf); } /* @@
[tip:sched/core] sched/core: Add wrappers for lockdep_(un)pin_lock()
Commit-ID: d8ac897137a230ec351269f6378017f2decca512 Gitweb: http://git.kernel.org/tip/d8ac897137a230ec351269f6378017f2decca512 Author: Matt Fleming AuthorDate: Wed, 21 Sep 2016 14:38:10 +0100 Committer: Ingo Molnar CommitDate: Sat, 14 Jan 2017 11:29:30 +0100 sched/core: Add wrappers for lockdep_(un)pin_lock() In preparation for adding diagnostic checks to catch missing calls to update_rq_clock(), provide wrappers for (re)pinning and unpinning rq->lock. Because the pending diagnostic checks allow state to be maintained in rq_flags across pin contexts, swap the 'struct pin_cookie' arguments for 'struct rq_flags *'. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Byungchul Park Cc: Frederic Weisbecker Cc: Jan Kara Cc: Linus Torvalds Cc: Luca Abeni Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Petr Mladek Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Thomas Gleixner Cc: Wanpeng Li Cc: Yuyang Du Link: http://lkml.kernel.org/r/20160921133813.31976-5-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 80 kernel/sched/deadline.c | 10 +++--- kernel/sched/fair.c | 6 ++-- kernel/sched/idle_task.c | 2 +- kernel/sched/rt.c| 6 ++-- kernel/sched/sched.h | 31 ++- kernel/sched/stop_task.c | 2 +- 7 files changed, 76 insertions(+), 61 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c56fb57..41df935 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -185,7 +185,7 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) rq = task_rq(p); raw_spin_lock(>lock); if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { - rf->cookie = lockdep_pin_lock(>lock); + rq_pin_lock(rq, rf); return rq; } raw_spin_unlock(>lock); @@ -225,7 +225,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) * pair with the WMB to ensure we must then also see migrating. */ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { - rf->cookie = lockdep_pin_lock(>lock); + rq_pin_lock(rq, rf); return rq; } raw_spin_unlock(>lock); @@ -1195,9 +1195,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, * OK, since we're going to drop the lock immediately * afterwards anyway. */ - lockdep_unpin_lock(>lock, rf.cookie); + rq_unpin_lock(rq, ); rq = move_queued_task(rq, p, dest_cpu); - lockdep_repin_lock(>lock, rf.cookie); + rq_repin_lock(rq, ); } out: task_rq_unlock(rq, p, ); @@ -1690,7 +1690,7 @@ static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_fl * Mark the task runnable and perform wakeup-preemption. */ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, - struct pin_cookie cookie) + struct rq_flags *rf) { check_preempt_curr(rq, p, wake_flags); p->state = TASK_RUNNING; @@ -1702,9 +1702,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, * Our task @p is fully woken up and running; so its safe to * drop the rq->lock, hereafter rq is only used for statistics. */ - lockdep_unpin_lock(>lock, cookie); + rq_unpin_lock(rq, rf); p->sched_class->task_woken(rq, p); - lockdep_repin_lock(>lock, cookie); + rq_repin_lock(rq, rf); } if (rq->idle_stamp) { @@ -1723,7 +1723,7 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, -struct pin_cookie cookie) +struct rq_flags *rf) { int en_flags = ENQUEUE_WAKEUP; @@ -1738,7 +1738,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, #endif ttwu_activate(rq, p, en_flags); - ttwu_do_wakeup(rq, p, wake_flags, cookie); + ttwu_do_wakeup(rq, p, wake_flags, rf); } /* @@ -1757,7 +1757,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags) if (task_on_rq_queued(p)) { /* check_preempt_curr() may use rq clock */ update_rq_clock(rq); - ttwu_do_wakeup(rq, p, wake_flags, rf.cookie); + ttwu_do_wakeup(rq, p, wake_flags, ); ret = 1; } __task_rq_unlock(rq, ); @@ -1770,15 +1770,15 @@ void
[tip:sched/core] sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock
Commit-ID: 92509b732baf14c59ca702307270cfaa3a585ae7 Gitweb: http://git.kernel.org/tip/92509b732baf14c59ca702307270cfaa3a585ae7 Author: Matt Fleming AuthorDate: Wed, 21 Sep 2016 14:38:11 +0100 Committer: Ingo Molnar CommitDate: Sat, 14 Jan 2017 11:29:31 +0100 sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock rq_clock() is called from sched_info_{depart,arrive}() after resetting RQCF_ACT_SKIP but prior to a call to update_rq_clock(). In preparation for pending patches that check whether the rq clock has been updated inside of a pin context before rq_clock() is called, move the reset of rq->clock_skip_update immediately before unpinning the rq lock. This will avoid the new warnings which check if update_rq_clock() is being actively skipped. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Byungchul Park Cc: Frederic Weisbecker Cc: Jan Kara Cc: Linus Torvalds Cc: Luca Abeni Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Petr Mladek Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Thomas Gleixner Cc: Wanpeng Li Cc: Yuyang Du Link: http://lkml.kernel.org/r/20160921133813.31976-6-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41df935..311460b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2887,6 +2887,9 @@ context_switch(struct rq *rq, struct task_struct *prev, prev->active_mm = NULL; rq->prev_mm = oldmm; } + + rq->clock_skip_update = 0; + /* * Since the runqueue lock will be released by the next * task (which is an invalid locking op but in the case @@ -3392,7 +3395,6 @@ static void __sched notrace __schedule(bool preempt) next = pick_next_task(rq, prev, ); clear_tsk_need_resched(prev); clear_preempt_need_resched(); - rq->clock_skip_update = 0; if (likely(prev != next)) { rq->nr_switches++; @@ -3402,6 +3404,7 @@ static void __sched notrace __schedule(bool preempt) trace_sched_switch(preempt, prev, next); rq = context_switch(rq, prev, next, ); /* unlocks the rq */ } else { + rq->clock_skip_update = 0; rq_unpin_lock(rq, ); raw_spin_unlock_irq(>lock); }
[tip:sched/core] sched/fair: Push rq lock pin/unpin into idle_balance()
Commit-ID: 46f69fa33712ad12ccaa723e46ed5929ee93589b Gitweb: http://git.kernel.org/tip/46f69fa33712ad12ccaa723e46ed5929ee93589b Author: Matt FlemingAuthorDate: Wed, 21 Sep 2016 14:38:12 +0100 Committer: Ingo Molnar CommitDate: Sat, 14 Jan 2017 11:29:32 +0100 sched/fair: Push rq lock pin/unpin into idle_balance() Future patches will emit warnings if rq_clock() is called before update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair. Since there is only one caller of idle_balance() we can push the unpin/repin there. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Byungchul Park Cc: Frederic Weisbecker Cc: Jan Kara Cc: Linus Torvalds Cc: Luca Abeni Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Petr Mladek Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Thomas Gleixner Cc: Wanpeng Li Cc: Yuyang Du Link: http://lkml.kernel.org/r/20160921133813.31976-7-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 27 +++ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4904412..faf80e1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3424,7 +3424,7 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) return cfs_rq->avg.load_avg; } -static int idle_balance(struct rq *this_rq); +static int idle_balance(struct rq *this_rq, struct rq_flags *rf); #else /* CONFIG_SMP */ @@ -3453,7 +3453,7 @@ attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} static inline void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} -static inline int idle_balance(struct rq *rq) +static inline int idle_balance(struct rq *rq, struct rq_flags *rf) { return 0; } @@ -6320,15 +6320,8 @@ simple: return p; idle: - /* -* This is OK, because current is on_cpu, which avoids it being picked -* for load-balance and preemption/IRQs are still disabled avoiding -* further scheduler activity on it and we're being very careful to -* re-start the picking loop. -*/ - rq_unpin_lock(rq, rf); - new_tasks = idle_balance(rq); - rq_repin_lock(rq, rf); + new_tasks = idle_balance(rq, rf); + /* * Because idle_balance() releases (and re-acquires) rq->lock, it is * possible for any higher priority task to appear. In that case we @@ -8297,7 +8290,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) * idle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. */ -static int idle_balance(struct rq *this_rq) +static int idle_balance(struct rq *this_rq, struct rq_flags *rf) { unsigned long next_balance = jiffies + HZ; int this_cpu = this_rq->cpu; @@ -8311,6 +8304,14 @@ static int idle_balance(struct rq *this_rq) */ this_rq->idle_stamp = rq_clock(this_rq); + /* +* This is OK, because current is on_cpu, which avoids it being picked +* for load-balance and preemption/IRQs are still disabled avoiding +* further scheduler activity on it and we're being very careful to +* re-start the picking loop. +*/ + rq_unpin_lock(this_rq, rf); + if (this_rq->avg_idle < sysctl_sched_migration_cost || !this_rq->rd->overload) { rcu_read_lock(); @@ -8388,6 +8389,8 @@ out: if (pulled_task) this_rq->idle_stamp = 0; + rq_repin_lock(this_rq, rf); + return pulled_task; }
[tip:sched/core] sched/fair: Push rq lock pin/unpin into idle_balance()
Commit-ID: 46f69fa33712ad12ccaa723e46ed5929ee93589b Gitweb: http://git.kernel.org/tip/46f69fa33712ad12ccaa723e46ed5929ee93589b Author: Matt Fleming AuthorDate: Wed, 21 Sep 2016 14:38:12 +0100 Committer: Ingo Molnar CommitDate: Sat, 14 Jan 2017 11:29:32 +0100 sched/fair: Push rq lock pin/unpin into idle_balance() Future patches will emit warnings if rq_clock() is called before update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair. Since there is only one caller of idle_balance() we can push the unpin/repin there. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Byungchul Park Cc: Frederic Weisbecker Cc: Jan Kara Cc: Linus Torvalds Cc: Luca Abeni Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Petr Mladek Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Thomas Gleixner Cc: Wanpeng Li Cc: Yuyang Du Link: http://lkml.kernel.org/r/20160921133813.31976-7-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 27 +++ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4904412..faf80e1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3424,7 +3424,7 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) return cfs_rq->avg.load_avg; } -static int idle_balance(struct rq *this_rq); +static int idle_balance(struct rq *this_rq, struct rq_flags *rf); #else /* CONFIG_SMP */ @@ -3453,7 +3453,7 @@ attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} static inline void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} -static inline int idle_balance(struct rq *rq) +static inline int idle_balance(struct rq *rq, struct rq_flags *rf) { return 0; } @@ -6320,15 +6320,8 @@ simple: return p; idle: - /* -* This is OK, because current is on_cpu, which avoids it being picked -* for load-balance and preemption/IRQs are still disabled avoiding -* further scheduler activity on it and we're being very careful to -* re-start the picking loop. -*/ - rq_unpin_lock(rq, rf); - new_tasks = idle_balance(rq); - rq_repin_lock(rq, rf); + new_tasks = idle_balance(rq, rf); + /* * Because idle_balance() releases (and re-acquires) rq->lock, it is * possible for any higher priority task to appear. In that case we @@ -8297,7 +8290,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) * idle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. */ -static int idle_balance(struct rq *this_rq) +static int idle_balance(struct rq *this_rq, struct rq_flags *rf) { unsigned long next_balance = jiffies + HZ; int this_cpu = this_rq->cpu; @@ -8311,6 +8304,14 @@ static int idle_balance(struct rq *this_rq) */ this_rq->idle_stamp = rq_clock(this_rq); + /* +* This is OK, because current is on_cpu, which avoids it being picked +* for load-balance and preemption/IRQs are still disabled avoiding +* further scheduler activity on it and we're being very careful to +* re-start the picking loop. +*/ + rq_unpin_lock(this_rq, rf); + if (this_rq->avg_idle < sysctl_sched_migration_cost || !this_rq->rd->overload) { rcu_read_lock(); @@ -8388,6 +8389,8 @@ out: if (pulled_task) this_rq->idle_stamp = 0; + rq_repin_lock(this_rq, rf); + return pulled_task; }
[tip:efi/urgent] x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y
Commit-ID: f6697df36bdf0bf7fce984605c2918d4a7b4269f Gitweb: http://git.kernel.org/tip/f6697df36bdf0bf7fce984605c2918d4a7b4269f Author: Matt FlemingAuthorDate: Sat, 12 Nov 2016 21:04:24 + Committer: Ingo Molnar CommitDate: Sun, 13 Nov 2016 08:26:40 +0100 x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y Booting an EFI mixed mode kernel has been crashing since commit: e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)") The user-visible effect in my test setup was the kernel being unable to find the root file system ramdisk. This was likely caused by silent memory or page table corruption. Enabling CONFIG_DEBUG_VIRTUAL=y immediately flagged the thunking code as abusing virt_to_phys() because it was passing addresses that were not part of the kernel direct mapping. Use the slow version instead, which correctly handles all memory regions by performing a page table walk. Suggested-by: Andy Lutomirski Signed-off-by: Matt Fleming Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20161112210424.5157-3-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi_64.c | 80 ++ 1 file changed, 57 insertions(+), 23 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 58b0f80..319148b 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -211,6 +212,35 @@ void efi_sync_low_kernel_mappings(void) memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries); } +/* + * Wrapper for slow_virt_to_phys() that handles NULL addresses. + */ +static inline phys_addr_t +virt_to_phys_or_null_size(void *va, unsigned long size) +{ + bool bad_size; + + if (!va) + return 0; + + if (virt_addr_valid(va)) + return virt_to_phys(va); + + /* +* A fully aligned variable on the stack is guaranteed not to +* cross a page bounary. Try to catch strings on the stack by +* checking that 'size' is a power of two. +*/ + bad_size = size > PAGE_SIZE || !is_power_of_2(size); + + WARN_ON(!IS_ALIGNED((unsigned long)va, size) || bad_size); + + return slow_virt_to_phys(va); +} + +#define virt_to_phys_or_null(addr) \ + virt_to_phys_or_null_size((addr), sizeof(*(addr))) + int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) { unsigned long pfn, text; @@ -494,8 +524,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc) spin_lock(_lock); - phys_tm = virt_to_phys(tm); - phys_tc = virt_to_phys(tc); + phys_tm = virt_to_phys_or_null(tm); + phys_tc = virt_to_phys_or_null(tc); status = efi_thunk(get_time, phys_tm, phys_tc); @@ -511,7 +541,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm) spin_lock(_lock); - phys_tm = virt_to_phys(tm); + phys_tm = virt_to_phys_or_null(tm); status = efi_thunk(set_time, phys_tm); @@ -529,9 +559,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending, spin_lock(_lock); - phys_enabled = virt_to_phys(enabled); - phys_pending = virt_to_phys(pending); - phys_tm = virt_to_phys(tm); + phys_enabled = virt_to_phys_or_null(enabled); + phys_pending = virt_to_phys_or_null(pending); + phys_tm = virt_to_phys_or_null(tm); status = efi_thunk(get_wakeup_time, phys_enabled, phys_pending, phys_tm); @@ -549,7 +579,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm) spin_lock(_lock); - phys_tm = virt_to_phys(tm); + phys_tm = virt_to_phys_or_null(tm); status = efi_thunk(set_wakeup_time, enabled, phys_tm); @@ -558,6 +588,10 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm) return status; } +static unsigned long efi_name_size(efi_char16_t *name) +{ + return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1; +} static efi_status_t efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor, @@ -567,11 +601,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor, u32 phys_name, phys_vendor, phys_attr; u32 phys_data_size, phys_data; -
[tip:efi/urgent] x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y
Commit-ID: f6697df36bdf0bf7fce984605c2918d4a7b4269f Gitweb: http://git.kernel.org/tip/f6697df36bdf0bf7fce984605c2918d4a7b4269f Author: Matt Fleming AuthorDate: Sat, 12 Nov 2016 21:04:24 + Committer: Ingo Molnar CommitDate: Sun, 13 Nov 2016 08:26:40 +0100 x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y Booting an EFI mixed mode kernel has been crashing since commit: e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)") The user-visible effect in my test setup was the kernel being unable to find the root file system ramdisk. This was likely caused by silent memory or page table corruption. Enabling CONFIG_DEBUG_VIRTUAL=y immediately flagged the thunking code as abusing virt_to_phys() because it was passing addresses that were not part of the kernel direct mapping. Use the slow version instead, which correctly handles all memory regions by performing a page table walk. Suggested-by: Andy Lutomirski Signed-off-by: Matt Fleming Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20161112210424.5157-3-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi_64.c | 80 ++ 1 file changed, 57 insertions(+), 23 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 58b0f80..319148b 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -211,6 +212,35 @@ void efi_sync_low_kernel_mappings(void) memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries); } +/* + * Wrapper for slow_virt_to_phys() that handles NULL addresses. + */ +static inline phys_addr_t +virt_to_phys_or_null_size(void *va, unsigned long size) +{ + bool bad_size; + + if (!va) + return 0; + + if (virt_addr_valid(va)) + return virt_to_phys(va); + + /* +* A fully aligned variable on the stack is guaranteed not to +* cross a page bounary. Try to catch strings on the stack by +* checking that 'size' is a power of two. +*/ + bad_size = size > PAGE_SIZE || !is_power_of_2(size); + + WARN_ON(!IS_ALIGNED((unsigned long)va, size) || bad_size); + + return slow_virt_to_phys(va); +} + +#define virt_to_phys_or_null(addr) \ + virt_to_phys_or_null_size((addr), sizeof(*(addr))) + int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) { unsigned long pfn, text; @@ -494,8 +524,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc) spin_lock(_lock); - phys_tm = virt_to_phys(tm); - phys_tc = virt_to_phys(tc); + phys_tm = virt_to_phys_or_null(tm); + phys_tc = virt_to_phys_or_null(tc); status = efi_thunk(get_time, phys_tm, phys_tc); @@ -511,7 +541,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm) spin_lock(_lock); - phys_tm = virt_to_phys(tm); + phys_tm = virt_to_phys_or_null(tm); status = efi_thunk(set_time, phys_tm); @@ -529,9 +559,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending, spin_lock(_lock); - phys_enabled = virt_to_phys(enabled); - phys_pending = virt_to_phys(pending); - phys_tm = virt_to_phys(tm); + phys_enabled = virt_to_phys_or_null(enabled); + phys_pending = virt_to_phys_or_null(pending); + phys_tm = virt_to_phys_or_null(tm); status = efi_thunk(get_wakeup_time, phys_enabled, phys_pending, phys_tm); @@ -549,7 +579,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm) spin_lock(_lock); - phys_tm = virt_to_phys(tm); + phys_tm = virt_to_phys_or_null(tm); status = efi_thunk(set_wakeup_time, enabled, phys_tm); @@ -558,6 +588,10 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm) return status; } +static unsigned long efi_name_size(efi_char16_t *name) +{ + return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1; +} static efi_status_t efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor, @@ -567,11 +601,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor, u32 phys_name, phys_vendor, phys_attr; u32 phys_data_size, phys_data; - phys_data_size = virt_to_phys(data_size); - phys_vendor = virt_to_phys(vendor); - phys_name = virt_to_phys(name); - phys_attr = virt_to_phys(attr); - phys_data = virt_to_phys(data); + phys_data_size = virt_to_phys_or_null(data_size); + phys_vendor = virt_to_phys_or_null(vendor); +
[tip:sched/core] sched/fair: Kill the unused 'sched_shares_window_ns' tunable
Commit-ID: 3c3fcb45d524feb5d14a14f332e3eec7f2aff8f3 Gitweb: http://git.kernel.org/tip/3c3fcb45d524feb5d14a14f332e3eec7f2aff8f3 Author: Matt FlemingAuthorDate: Wed, 19 Oct 2016 15:10:59 +0100 Committer: Ingo Molnar CommitDate: Thu, 20 Oct 2016 08:44:57 +0200 sched/fair: Kill the unused 'sched_shares_window_ns' tunable The last user of this tunable was removed in 2012 in commit: 82958366cfea ("sched: Replace update_shares weight distribution with per-entity computation") Delete it since its very existence confuses people. Signed-off-by: Matt Fleming Cc: Dietmar Eggemann Cc: Linus Torvalds Cc: Mike Galbraith Cc: Paul Turner Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20161019141059.26408-1-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- include/linux/sched/sysctl.h | 1 - kernel/sched/fair.c | 7 --- kernel/sysctl.c | 7 --- 3 files changed, 15 deletions(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 22db1e6..4411453 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -36,7 +36,6 @@ extern unsigned int sysctl_numa_balancing_scan_size; extern unsigned int sysctl_sched_migration_cost; extern unsigned int sysctl_sched_nr_migrate; extern unsigned int sysctl_sched_time_avg; -extern unsigned int sysctl_sched_shares_window; int sched_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d941c97..79d464a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -93,13 +93,6 @@ unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL; const_debug unsigned int sysctl_sched_migration_cost = 50UL; -/* - * The exponential sliding window over which load is averaged for shares - * distribution. - * (default: 10msec) - */ -unsigned int __read_mostly sysctl_sched_shares_window = 1000UL; - #ifdef CONFIG_CFS_BANDWIDTH /* * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 706309f..739fb17 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -347,13 +347,6 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, - { - .procname = "sched_shares_window_ns", - .data = _sched_shares_window, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = proc_dointvec, - }, #ifdef CONFIG_SCHEDSTATS { .procname = "sched_schedstats",
[tip:sched/core] sched/fair: Kill the unused 'sched_shares_window_ns' tunable
Commit-ID: 3c3fcb45d524feb5d14a14f332e3eec7f2aff8f3 Gitweb: http://git.kernel.org/tip/3c3fcb45d524feb5d14a14f332e3eec7f2aff8f3 Author: Matt Fleming AuthorDate: Wed, 19 Oct 2016 15:10:59 +0100 Committer: Ingo Molnar CommitDate: Thu, 20 Oct 2016 08:44:57 +0200 sched/fair: Kill the unused 'sched_shares_window_ns' tunable The last user of this tunable was removed in 2012 in commit: 82958366cfea ("sched: Replace update_shares weight distribution with per-entity computation") Delete it since its very existence confuses people. Signed-off-by: Matt Fleming Cc: Dietmar Eggemann Cc: Linus Torvalds Cc: Mike Galbraith Cc: Paul Turner Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20161019141059.26408-1-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- include/linux/sched/sysctl.h | 1 - kernel/sched/fair.c | 7 --- kernel/sysctl.c | 7 --- 3 files changed, 15 deletions(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 22db1e6..4411453 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -36,7 +36,6 @@ extern unsigned int sysctl_numa_balancing_scan_size; extern unsigned int sysctl_sched_migration_cost; extern unsigned int sysctl_sched_nr_migrate; extern unsigned int sysctl_sched_time_avg; -extern unsigned int sysctl_sched_shares_window; int sched_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d941c97..79d464a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -93,13 +93,6 @@ unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL; const_debug unsigned int sysctl_sched_migration_cost = 50UL; -/* - * The exponential sliding window over which load is averaged for shares - * distribution. - * (default: 10msec) - */ -unsigned int __read_mostly sysctl_sched_shares_window = 1000UL; - #ifdef CONFIG_CFS_BANDWIDTH /* * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 706309f..739fb17 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -347,13 +347,6 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, - { - .procname = "sched_shares_window_ns", - .data = _sched_shares_window, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = proc_dointvec, - }, #ifdef CONFIG_SCHEDSTATS { .procname = "sched_schedstats",
[tip:perf/urgent] perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2
Commit-ID: 080fe0b790ad438fc1b61621dac37c1964ce7f35 Gitweb: http://git.kernel.org/tip/080fe0b790ad438fc1b61621dac37c1964ce7f35 Author: Matt FlemingAuthorDate: Wed, 24 Aug 2016 14:12:08 +0100 Committer: Ingo Molnar CommitDate: Fri, 16 Sep 2016 16:19:49 +0200 perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2 While the Intel PMU monitors the LLC when perf enables the HW_CACHE_REFERENCES and HW_CACHE_MISSES events, these events monitor L1 instruction cache fetches (0x0080) and instruction cache misses (0x0081) on the AMD PMU. This is extremely confusing when monitoring the same workload across Intel and AMD machines, since parameters like, $ perf stat -e cache-references,cache-misses measure completely different things. Instead, make the AMD PMU measure instruction/data cache and TLB fill requests to the L2 and instruction/data cache and TLB misses in the L2 when HW_CACHE_REFERENCES and HW_CACHE_MISSES are enabled, respectively. That way the events measure unified caches on both platforms. Signed-off-by: Matt Fleming Acked-by: Peter Zijlstra Cc: Cc: Borislav Petkov Cc: Linus Torvalds Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1472044328-21302-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/events/amd/core.c | 4 ++-- arch/x86/kvm/pmu_amd.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c index e07a22b..f5f4b3f 100644 --- a/arch/x86/events/amd/core.c +++ b/arch/x86/events/amd/core.c @@ -119,8 +119,8 @@ static const u64 amd_perfmon_event_map[PERF_COUNT_HW_MAX] = { [PERF_COUNT_HW_CPU_CYCLES] = 0x0076, [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0, - [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0080, - [PERF_COUNT_HW_CACHE_MISSES] = 0x0081, + [PERF_COUNT_HW_CACHE_REFERENCES] = 0x077d, + [PERF_COUNT_HW_CACHE_MISSES] = 0x077e, [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c2, [PERF_COUNT_HW_BRANCH_MISSES]= 0x00c3, [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x00d0, /* "Decoder empty" event */ diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c index 39b9112..cd94443 100644 --- a/arch/x86/kvm/pmu_amd.c +++ b/arch/x86/kvm/pmu_amd.c @@ -23,8 +23,8 @@ static struct kvm_event_hw_type_mapping amd_event_mapping[] = { [0] = { 0x76, 0x00, PERF_COUNT_HW_CPU_CYCLES }, [1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS }, - [2] = { 0x80, 0x00, PERF_COUNT_HW_CACHE_REFERENCES }, - [3] = { 0x81, 0x00, PERF_COUNT_HW_CACHE_MISSES }, + [2] = { 0x7d, 0x07, PERF_COUNT_HW_CACHE_REFERENCES }, + [3] = { 0x7e, 0x07, PERF_COUNT_HW_CACHE_MISSES }, [4] = { 0xc2, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS }, [5] = { 0xc3, 0x00, PERF_COUNT_HW_BRANCH_MISSES }, [6] = { 0xd0, 0x00, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND },
[tip:perf/urgent] perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2
Commit-ID: 080fe0b790ad438fc1b61621dac37c1964ce7f35 Gitweb: http://git.kernel.org/tip/080fe0b790ad438fc1b61621dac37c1964ce7f35 Author: Matt Fleming AuthorDate: Wed, 24 Aug 2016 14:12:08 +0100 Committer: Ingo Molnar CommitDate: Fri, 16 Sep 2016 16:19:49 +0200 perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2 While the Intel PMU monitors the LLC when perf enables the HW_CACHE_REFERENCES and HW_CACHE_MISSES events, these events monitor L1 instruction cache fetches (0x0080) and instruction cache misses (0x0081) on the AMD PMU. This is extremely confusing when monitoring the same workload across Intel and AMD machines, since parameters like, $ perf stat -e cache-references,cache-misses measure completely different things. Instead, make the AMD PMU measure instruction/data cache and TLB fill requests to the L2 and instruction/data cache and TLB misses in the L2 when HW_CACHE_REFERENCES and HW_CACHE_MISSES are enabled, respectively. That way the events measure unified caches on both platforms. Signed-off-by: Matt Fleming Acked-by: Peter Zijlstra Cc: Cc: Borislav Petkov Cc: Linus Torvalds Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1472044328-21302-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/events/amd/core.c | 4 ++-- arch/x86/kvm/pmu_amd.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c index e07a22b..f5f4b3f 100644 --- a/arch/x86/events/amd/core.c +++ b/arch/x86/events/amd/core.c @@ -119,8 +119,8 @@ static const u64 amd_perfmon_event_map[PERF_COUNT_HW_MAX] = { [PERF_COUNT_HW_CPU_CYCLES] = 0x0076, [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0, - [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0080, - [PERF_COUNT_HW_CACHE_MISSES] = 0x0081, + [PERF_COUNT_HW_CACHE_REFERENCES] = 0x077d, + [PERF_COUNT_HW_CACHE_MISSES] = 0x077e, [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c2, [PERF_COUNT_HW_BRANCH_MISSES]= 0x00c3, [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x00d0, /* "Decoder empty" event */ diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c index 39b9112..cd94443 100644 --- a/arch/x86/kvm/pmu_amd.c +++ b/arch/x86/kvm/pmu_amd.c @@ -23,8 +23,8 @@ static struct kvm_event_hw_type_mapping amd_event_mapping[] = { [0] = { 0x76, 0x00, PERF_COUNT_HW_CPU_CYCLES }, [1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS }, - [2] = { 0x80, 0x00, PERF_COUNT_HW_CACHE_REFERENCES }, - [3] = { 0x81, 0x00, PERF_COUNT_HW_CACHE_MISSES }, + [2] = { 0x7d, 0x07, PERF_COUNT_HW_CACHE_REFERENCES }, + [3] = { 0x7e, 0x07, PERF_COUNT_HW_CACHE_MISSES }, [4] = { 0xc2, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS }, [5] = { 0xc3, 0x00, PERF_COUNT_HW_BRANCH_MISSES }, [6] = { 0xd0, 0x00, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND },
[tip:efi/core] efi/capsule: Move 'capsule' to the stack in efi_capsule_supported()
Commit-ID: fb7a84cac03541f4da18dfa25b3f4767d4efc6fc Gitweb: http://git.kernel.org/tip/fb7a84cac03541f4da18dfa25b3f4767d4efc6fc Author: Matt FlemingAuthorDate: Fri, 6 May 2016 22:39:29 +0100 Committer: Ingo Molnar CommitDate: Sat, 7 May 2016 07:06:13 +0200 efi/capsule: Move 'capsule' to the stack in efi_capsule_supported() Dan Carpenter reports that passing the address of the pointer to the kmalloc()'d memory for 'capsule' is dangerous: "drivers/firmware/efi/capsule.c:109 efi_capsule_supported() warn: did you mean to pass the address of 'capsule' 108 109 status = efi.query_capsule_caps(, 1, _size, reset); If we modify capsule inside this function call then at the end of the function we aren't freeing the original pointer that we allocated." Ard Biesheuvel noted that we don't even need to call kmalloc() since the object we allocate isn't very big and doesn't need to persist after the function returns. Place 'capsule' on the stack instead. Suggested-by: Ard Biesheuvel Reported-by: Dan Carpenter Signed-off-by: Matt Fleming Acked-by: Ard Biesheuvel Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Bryan O'Donoghue Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Kweh Hock Leong Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1462570771-13324-4-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/capsule.c | 29 +++-- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c index e530540..53b9fd2 100644 --- a/drivers/firmware/efi/capsule.c +++ b/drivers/firmware/efi/capsule.c @@ -86,33 +86,26 @@ bool efi_capsule_pending(int *reset_type) */ int efi_capsule_supported(efi_guid_t guid, u32 flags, size_t size, int *reset) { - efi_capsule_header_t *capsule; + efi_capsule_header_t capsule; + efi_capsule_header_t *cap_list[] = { }; efi_status_t status; u64 max_size; - int rv = 0; if (flags & ~EFI_CAPSULE_SUPPORTED_FLAG_MASK) return -EINVAL; - capsule = kmalloc(sizeof(*capsule), GFP_KERNEL); - if (!capsule) - return -ENOMEM; - - capsule->headersize = capsule->imagesize = sizeof(*capsule); - memcpy(>guid, , sizeof(efi_guid_t)); - capsule->flags = flags; + capsule.headersize = capsule.imagesize = sizeof(capsule); + memcpy(, , sizeof(efi_guid_t)); + capsule.flags = flags; - status = efi.query_capsule_caps(, 1, _size, reset); - if (status != EFI_SUCCESS) { - rv = efi_status_to_err(status); - goto out; - } + status = efi.query_capsule_caps(cap_list, 1, _size, reset); + if (status != EFI_SUCCESS) + return efi_status_to_err(status); if (size > max_size) - rv = -ENOSPC; -out: - kfree(capsule); - return rv; + return -ENOSPC; + + return 0; } EXPORT_SYMBOL_GPL(efi_capsule_supported);
[tip:efi/core] efi/capsule: Move 'capsule' to the stack in efi_capsule_supported()
Commit-ID: fb7a84cac03541f4da18dfa25b3f4767d4efc6fc Gitweb: http://git.kernel.org/tip/fb7a84cac03541f4da18dfa25b3f4767d4efc6fc Author: Matt Fleming AuthorDate: Fri, 6 May 2016 22:39:29 +0100 Committer: Ingo Molnar CommitDate: Sat, 7 May 2016 07:06:13 +0200 efi/capsule: Move 'capsule' to the stack in efi_capsule_supported() Dan Carpenter reports that passing the address of the pointer to the kmalloc()'d memory for 'capsule' is dangerous: "drivers/firmware/efi/capsule.c:109 efi_capsule_supported() warn: did you mean to pass the address of 'capsule' 108 109 status = efi.query_capsule_caps(, 1, _size, reset); If we modify capsule inside this function call then at the end of the function we aren't freeing the original pointer that we allocated." Ard Biesheuvel noted that we don't even need to call kmalloc() since the object we allocate isn't very big and doesn't need to persist after the function returns. Place 'capsule' on the stack instead. Suggested-by: Ard Biesheuvel Reported-by: Dan Carpenter Signed-off-by: Matt Fleming Acked-by: Ard Biesheuvel Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Bryan O'Donoghue Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Kweh Hock Leong Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1462570771-13324-4-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/capsule.c | 29 +++-- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c index e530540..53b9fd2 100644 --- a/drivers/firmware/efi/capsule.c +++ b/drivers/firmware/efi/capsule.c @@ -86,33 +86,26 @@ bool efi_capsule_pending(int *reset_type) */ int efi_capsule_supported(efi_guid_t guid, u32 flags, size_t size, int *reset) { - efi_capsule_header_t *capsule; + efi_capsule_header_t capsule; + efi_capsule_header_t *cap_list[] = { }; efi_status_t status; u64 max_size; - int rv = 0; if (flags & ~EFI_CAPSULE_SUPPORTED_FLAG_MASK) return -EINVAL; - capsule = kmalloc(sizeof(*capsule), GFP_KERNEL); - if (!capsule) - return -ENOMEM; - - capsule->headersize = capsule->imagesize = sizeof(*capsule); - memcpy(>guid, , sizeof(efi_guid_t)); - capsule->flags = flags; + capsule.headersize = capsule.imagesize = sizeof(capsule); + memcpy(, , sizeof(efi_guid_t)); + capsule.flags = flags; - status = efi.query_capsule_caps(, 1, _size, reset); - if (status != EFI_SUCCESS) { - rv = efi_status_to_err(status); - goto out; - } + status = efi.query_capsule_caps(cap_list, 1, _size, reset); + if (status != EFI_SUCCESS) + return efi_status_to_err(status); if (size > max_size) - rv = -ENOSPC; -out: - kfree(capsule); - return rv; + return -ENOSPC; + + return 0; } EXPORT_SYMBOL_GPL(efi_capsule_supported);
[tip:efi/core] efi/capsule: Make efi_capsule_pending() lockless
Commit-ID: 62075e581802ea1842d5d3c490a7e46330bdb9e1 Gitweb: http://git.kernel.org/tip/62075e581802ea1842d5d3c490a7e46330bdb9e1 Author: Matt FlemingAuthorDate: Fri, 6 May 2016 22:39:27 +0100 Committer: Ingo Molnar CommitDate: Sat, 7 May 2016 07:06:13 +0200 efi/capsule: Make efi_capsule_pending() lockless Taking a mutex in the reboot path is bogus because we cannot sleep with interrupts disabled, such as when rebooting due to panic(), BUG: sleeping function called from invalid context at kernel/locking/mutex.c:97 in_atomic(): 0, irqs_disabled(): 1, pid: 7, name: rcu_sched Call Trace: dump_stack+0x63/0x89 ___might_sleep+0xd8/0x120 __might_sleep+0x49/0x80 mutex_lock+0x20/0x50 efi_capsule_pending+0x1d/0x60 native_machine_emergency_restart+0x59/0x280 machine_emergency_restart+0x19/0x20 emergency_restart+0x18/0x20 panic+0x1ba/0x217 In this case all other CPUs will have been stopped by the time we execute the platform reboot code, so 'capsule_pending' cannot change under our feet. We wouldn't care even if it could since we cannot wait for it complete. Also, instead of relying on the external 'system_state' variable just use a reboot notifier, so we can set 'stop_capsules' while holding 'capsule_mutex', thereby avoiding a race where system_state is updated while we're in the middle of efi_capsule_update_locked() (since CPUs won't have been stopped at that point). Signed-off-by: Matt Fleming Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Bryan O'Donoghue Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Kweh Hock Leong Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1462570771-13324-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/capsule.c | 35 +-- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c index 0de5594..e530540 100644 --- a/drivers/firmware/efi/capsule.c +++ b/drivers/firmware/efi/capsule.c @@ -22,11 +22,12 @@ typedef struct { } efi_capsule_block_desc_t; static bool capsule_pending; +static bool stop_capsules; static int efi_reset_type = -1; /* * capsule_mutex serialises access to both capsule_pending and - * efi_reset_type. + * efi_reset_type and stop_capsules. */ static DEFINE_MUTEX(capsule_mutex); @@ -50,18 +51,13 @@ static DEFINE_MUTEX(capsule_mutex); */ bool efi_capsule_pending(int *reset_type) { - bool rv = false; - - mutex_lock(_mutex); if (!capsule_pending) - goto out; + return false; if (reset_type) *reset_type = efi_reset_type; - rv = true; -out: - mutex_unlock(_mutex); - return rv; + + return true; } /* @@ -176,7 +172,7 @@ efi_capsule_update_locked(efi_capsule_header_t *capsule, * whether to force an EFI reboot), and we're racing against * that call. Abort in that case. */ - if (unlikely(system_state == SYSTEM_RESTART)) { + if (unlikely(stop_capsules)) { pr_warn("Capsule update raced with reboot, aborting.\n"); return -EINVAL; } @@ -298,3 +294,22 @@ out: return rv; } EXPORT_SYMBOL_GPL(efi_capsule_update); + +static int capsule_reboot_notify(struct notifier_block *nb, unsigned long event, void *cmd) +{ + mutex_lock(_mutex); + stop_capsules = true; + mutex_unlock(_mutex); + + return NOTIFY_DONE; +} + +static struct notifier_block capsule_reboot_nb = { + .notifier_call = capsule_reboot_notify, +}; + +static int __init capsule_reboot_register(void) +{ + return register_reboot_notifier(_reboot_nb); +} +core_initcall(capsule_reboot_register);
[tip:efi/core] efi/capsule: Make efi_capsule_pending() lockless
Commit-ID: 62075e581802ea1842d5d3c490a7e46330bdb9e1 Gitweb: http://git.kernel.org/tip/62075e581802ea1842d5d3c490a7e46330bdb9e1 Author: Matt Fleming AuthorDate: Fri, 6 May 2016 22:39:27 +0100 Committer: Ingo Molnar CommitDate: Sat, 7 May 2016 07:06:13 +0200 efi/capsule: Make efi_capsule_pending() lockless Taking a mutex in the reboot path is bogus because we cannot sleep with interrupts disabled, such as when rebooting due to panic(), BUG: sleeping function called from invalid context at kernel/locking/mutex.c:97 in_atomic(): 0, irqs_disabled(): 1, pid: 7, name: rcu_sched Call Trace: dump_stack+0x63/0x89 ___might_sleep+0xd8/0x120 __might_sleep+0x49/0x80 mutex_lock+0x20/0x50 efi_capsule_pending+0x1d/0x60 native_machine_emergency_restart+0x59/0x280 machine_emergency_restart+0x19/0x20 emergency_restart+0x18/0x20 panic+0x1ba/0x217 In this case all other CPUs will have been stopped by the time we execute the platform reboot code, so 'capsule_pending' cannot change under our feet. We wouldn't care even if it could since we cannot wait for it complete. Also, instead of relying on the external 'system_state' variable just use a reboot notifier, so we can set 'stop_capsules' while holding 'capsule_mutex', thereby avoiding a race where system_state is updated while we're in the middle of efi_capsule_update_locked() (since CPUs won't have been stopped at that point). Signed-off-by: Matt Fleming Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Bryan O'Donoghue Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Kweh Hock Leong Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1462570771-13324-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/capsule.c | 35 +-- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c index 0de5594..e530540 100644 --- a/drivers/firmware/efi/capsule.c +++ b/drivers/firmware/efi/capsule.c @@ -22,11 +22,12 @@ typedef struct { } efi_capsule_block_desc_t; static bool capsule_pending; +static bool stop_capsules; static int efi_reset_type = -1; /* * capsule_mutex serialises access to both capsule_pending and - * efi_reset_type. + * efi_reset_type and stop_capsules. */ static DEFINE_MUTEX(capsule_mutex); @@ -50,18 +51,13 @@ static DEFINE_MUTEX(capsule_mutex); */ bool efi_capsule_pending(int *reset_type) { - bool rv = false; - - mutex_lock(_mutex); if (!capsule_pending) - goto out; + return false; if (reset_type) *reset_type = efi_reset_type; - rv = true; -out: - mutex_unlock(_mutex); - return rv; + + return true; } /* @@ -176,7 +172,7 @@ efi_capsule_update_locked(efi_capsule_header_t *capsule, * whether to force an EFI reboot), and we're racing against * that call. Abort in that case. */ - if (unlikely(system_state == SYSTEM_RESTART)) { + if (unlikely(stop_capsules)) { pr_warn("Capsule update raced with reboot, aborting.\n"); return -EINVAL; } @@ -298,3 +294,22 @@ out: return rv; } EXPORT_SYMBOL_GPL(efi_capsule_update); + +static int capsule_reboot_notify(struct notifier_block *nb, unsigned long event, void *cmd) +{ + mutex_lock(_mutex); + stop_capsules = true; + mutex_unlock(_mutex); + + return NOTIFY_DONE; +} + +static struct notifier_block capsule_reboot_nb = { + .notifier_call = capsule_reboot_notify, +}; + +static int __init capsule_reboot_register(void) +{ + return register_reboot_notifier(_reboot_nb); +} +core_initcall(capsule_reboot_register);
[tip:sched/core] sched/fair: Update rq clock before updating nohz CPU load
Commit-ID: b52fad2db5d792d89975cebf2fe1646a7af28ed0 Gitweb: http://git.kernel.org/tip/b52fad2db5d792d89975cebf2fe1646a7af28ed0 Author: Matt FlemingAuthorDate: Tue, 3 May 2016 20:46:54 +0100 Committer: Ingo Molnar CommitDate: Thu, 5 May 2016 09:41:09 +0200 sched/fair: Update rq clock before updating nohz CPU load If we're accessing rq_clock() (e.g. in sched_avg_update()) we should update the rq clock before calling cpu_load_update(), otherwise any time calculations will be stale. All other paths currently call update_rq_clock(). Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Wanpeng Li Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1462304814-11715-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8c381a6..7a00c7c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4724,6 +4724,7 @@ void cpu_load_update_nohz_stop(void) load = weighted_cpuload(cpu_of(this_rq)); raw_spin_lock(_rq->lock); + update_rq_clock(this_rq); cpu_load_update_nohz(this_rq, curr_jiffies, load); raw_spin_unlock(_rq->lock); }
[tip:sched/core] sched/fair: Update rq clock before updating nohz CPU load
Commit-ID: b52fad2db5d792d89975cebf2fe1646a7af28ed0 Gitweb: http://git.kernel.org/tip/b52fad2db5d792d89975cebf2fe1646a7af28ed0 Author: Matt Fleming AuthorDate: Tue, 3 May 2016 20:46:54 +0100 Committer: Ingo Molnar CommitDate: Thu, 5 May 2016 09:41:09 +0200 sched/fair: Update rq clock before updating nohz CPU load If we're accessing rq_clock() (e.g. in sched_avg_update()) we should update the rq clock before calling cpu_load_update(), otherwise any time calculations will be stale. All other paths currently call update_rq_clock(). Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Wanpeng Li Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mel Gorman Cc: Mike Galbraith Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1462304814-11715-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8c381a6..7a00c7c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4724,6 +4724,7 @@ void cpu_load_update_nohz_stop(void) load = weighted_cpuload(cpu_of(this_rq)); raw_spin_lock(_rq->lock); + update_rq_clock(this_rq); cpu_load_update_nohz(this_rq, curr_jiffies, load); raw_spin_unlock(_rq->lock); }
[tip:efi/urgent] MAINTAINERS: Remove asterisk from EFI directory names
Commit-ID: e8dfe6d8f6762d515fcd4f30577f7bfcf7659887 Gitweb: http://git.kernel.org/tip/e8dfe6d8f6762d515fcd4f30577f7bfcf7659887 Author: Matt FlemingAuthorDate: Tue, 3 May 2016 20:29:39 +0100 Committer: Ingo Molnar CommitDate: Wed, 4 May 2016 08:36:44 +0200 MAINTAINERS: Remove asterisk from EFI directory names Mark reported that having asterisks on the end of directory names confuses get_maintainer.pl when it encounters subdirectories, and that my name does not appear when run on drivers/firmware/efi/libstub. Reported-by: Mark Rutland Signed-off-by: Matt Fleming Cc: Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1462303781-8686-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 42e65d1..4dca3b3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4223,8 +4223,8 @@ F:Documentation/efi-stub.txt F: arch/ia64/kernel/efi.c F: arch/x86/boot/compressed/eboot.[ch] F: arch/x86/include/asm/efi.h -F: arch/x86/platform/efi/* -F: drivers/firmware/efi/* +F: arch/x86/platform/efi/ +F: drivers/firmware/efi/ F: include/linux/efi*.h EFI VARIABLE FILESYSTEM
[tip:efi/urgent] MAINTAINERS: Remove asterisk from EFI directory names
Commit-ID: e8dfe6d8f6762d515fcd4f30577f7bfcf7659887 Gitweb: http://git.kernel.org/tip/e8dfe6d8f6762d515fcd4f30577f7bfcf7659887 Author: Matt Fleming AuthorDate: Tue, 3 May 2016 20:29:39 +0100 Committer: Ingo Molnar CommitDate: Wed, 4 May 2016 08:36:44 +0200 MAINTAINERS: Remove asterisk from EFI directory names Mark reported that having asterisks on the end of directory names confuses get_maintainer.pl when it encounters subdirectories, and that my name does not appear when run on drivers/firmware/efi/libstub. Reported-by: Mark Rutland Signed-off-by: Matt Fleming Cc: Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1462303781-8686-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 42e65d1..4dca3b3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4223,8 +4223,8 @@ F:Documentation/efi-stub.txt F: arch/ia64/kernel/efi.c F: arch/x86/boot/compressed/eboot.[ch] F: arch/x86/include/asm/efi.h -F: arch/x86/platform/efi/* -F: drivers/firmware/efi/* +F: arch/x86/platform/efi/ +F: drivers/firmware/efi/ F: include/linux/efi*.h EFI VARIABLE FILESYSTEM
[tip:efi/core] efi: Add 'capsule' update support
Commit-ID: f0133f3c5b8bb34ec4dec50c27e7a655aeee8935 Gitweb: http://git.kernel.org/tip/f0133f3c5b8bb34ec4dec50c27e7a655aeee8935 Author: Matt FlemingAuthorDate: Mon, 25 Apr 2016 21:06:59 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:34:03 +0200 efi: Add 'capsule' update support The EFI capsule mechanism allows data blobs to be passed to the EFI firmware. A common use case is performing firmware updates. This patch just introduces the main infrastructure for interacting with the firmware, and a driver that allows users to upload capsules will come in a later patch. Once a capsule has been passed to the firmware, the next reboot must be performed using the ResetSystem() EFI runtime service, which may involve overriding the reboot type specified by reboot=. This ensures the reset value returned by QueryCapsuleCapabilities() is used to reset the system, which is required for the capsule to be processed. efi_capsule_pending() is provided for this purpose. At the moment we only allow a single capsule blob to be sent to the firmware despite the fact that UpdateCapsule() takes a 'CapsuleCount' parameter. This simplifies the API and shouldn't result in any downside since it is still possible to send multiple capsules by repeatedly calling UpdateCapsule(). Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Bryan O'Donoghue Cc: Kweh Hock Leong Cc: Mark Salter Cc: Peter Jones Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-28-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/capsule.c | 300 + drivers/firmware/efi/reboot.c | 12 +- include/linux/efi.h| 14 ++ 4 files changed, 326 insertions(+), 1 deletion(-) diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile index b080808..fb8ad5d 100644 --- a/drivers/firmware/efi/Makefile +++ b/drivers/firmware/efi/Makefile @@ -10,6 +10,7 @@ KASAN_SANITIZE_runtime-wrappers.o := n obj-$(CONFIG_EFI) += efi.o vars.o reboot.o memattr.o +obj-$(CONFIG_EFI) += capsule.o obj-$(CONFIG_EFI_VARS) += efivars.o obj-$(CONFIG_EFI_ESRT) += esrt.o obj-$(CONFIG_EFI_VARS_PSTORE) += efi-pstore.o diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c new file mode 100644 index 000..0de5594 --- /dev/null +++ b/drivers/firmware/efi/capsule.c @@ -0,0 +1,300 @@ +/* + * EFI capsule support. + * + * Copyright 2013 Intel Corporation; author Matt Fleming + * + * This file is part of the Linux kernel, and is made available under + * the terms of the GNU General Public License version 2. + */ + +#define pr_fmt(fmt) "efi: " fmt + +#include +#include +#include +#include +#include +#include + +typedef struct { + u64 length; + u64 data; +} efi_capsule_block_desc_t; + +static bool capsule_pending; +static int efi_reset_type = -1; + +/* + * capsule_mutex serialises access to both capsule_pending and + * efi_reset_type. + */ +static DEFINE_MUTEX(capsule_mutex); + +/** + * efi_capsule_pending - has a capsule been passed to the firmware? + * @reset_type: store the type of EFI reset if capsule is pending + * + * To ensure that the registered capsule is processed correctly by the + * firmware we need to perform a specific type of reset. If a capsule is + * pending return the reset type in @reset_type. + * + * This function will race with callers of efi_capsule_update(), for + * example, calling this function while somebody else is in + * efi_capsule_update() but hasn't reached efi_capsue_update_locked() + * will miss the updates to capsule_pending and efi_reset_type after + * efi_capsule_update_locked() completes. + * + * A non-racy use is from platform reboot code because we use + * system_state to ensure no capsules can be sent to the firmware once + * we're at SYSTEM_RESTART. See efi_capsule_update_locked(). + */ +bool efi_capsule_pending(int *reset_type) +{ + bool rv = false; + + mutex_lock(_mutex); + if (!capsule_pending) + goto out; + + if (reset_type) + *reset_type = efi_reset_type; + rv = true; +out: + mutex_unlock(_mutex); + return rv; +} + +/* + * Whitelist of EFI capsule flags that we support. + * + * We do not handle EFI_CAPSULE_INITIATE_RESET because that would + * require us to prepare the kernel for reboot. Refuse to load any + * capsules with that flag and any other flags that we do not know how + * to handle. + */ +#define
[tip:efi/core] efi: Add 'capsule' update support
Commit-ID: f0133f3c5b8bb34ec4dec50c27e7a655aeee8935 Gitweb: http://git.kernel.org/tip/f0133f3c5b8bb34ec4dec50c27e7a655aeee8935 Author: Matt Fleming AuthorDate: Mon, 25 Apr 2016 21:06:59 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:34:03 +0200 efi: Add 'capsule' update support The EFI capsule mechanism allows data blobs to be passed to the EFI firmware. A common use case is performing firmware updates. This patch just introduces the main infrastructure for interacting with the firmware, and a driver that allows users to upload capsules will come in a later patch. Once a capsule has been passed to the firmware, the next reboot must be performed using the ResetSystem() EFI runtime service, which may involve overriding the reboot type specified by reboot=. This ensures the reset value returned by QueryCapsuleCapabilities() is used to reset the system, which is required for the capsule to be processed. efi_capsule_pending() is provided for this purpose. At the moment we only allow a single capsule blob to be sent to the firmware despite the fact that UpdateCapsule() takes a 'CapsuleCount' parameter. This simplifies the API and shouldn't result in any downside since it is still possible to send multiple capsules by repeatedly calling UpdateCapsule(). Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Bryan O'Donoghue Cc: Kweh Hock Leong Cc: Mark Salter Cc: Peter Jones Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-28-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/capsule.c | 300 + drivers/firmware/efi/reboot.c | 12 +- include/linux/efi.h| 14 ++ 4 files changed, 326 insertions(+), 1 deletion(-) diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile index b080808..fb8ad5d 100644 --- a/drivers/firmware/efi/Makefile +++ b/drivers/firmware/efi/Makefile @@ -10,6 +10,7 @@ KASAN_SANITIZE_runtime-wrappers.o := n obj-$(CONFIG_EFI) += efi.o vars.o reboot.o memattr.o +obj-$(CONFIG_EFI) += capsule.o obj-$(CONFIG_EFI_VARS) += efivars.o obj-$(CONFIG_EFI_ESRT) += esrt.o obj-$(CONFIG_EFI_VARS_PSTORE) += efi-pstore.o diff --git a/drivers/firmware/efi/capsule.c b/drivers/firmware/efi/capsule.c new file mode 100644 index 000..0de5594 --- /dev/null +++ b/drivers/firmware/efi/capsule.c @@ -0,0 +1,300 @@ +/* + * EFI capsule support. + * + * Copyright 2013 Intel Corporation; author Matt Fleming + * + * This file is part of the Linux kernel, and is made available under + * the terms of the GNU General Public License version 2. + */ + +#define pr_fmt(fmt) "efi: " fmt + +#include +#include +#include +#include +#include +#include + +typedef struct { + u64 length; + u64 data; +} efi_capsule_block_desc_t; + +static bool capsule_pending; +static int efi_reset_type = -1; + +/* + * capsule_mutex serialises access to both capsule_pending and + * efi_reset_type. + */ +static DEFINE_MUTEX(capsule_mutex); + +/** + * efi_capsule_pending - has a capsule been passed to the firmware? + * @reset_type: store the type of EFI reset if capsule is pending + * + * To ensure that the registered capsule is processed correctly by the + * firmware we need to perform a specific type of reset. If a capsule is + * pending return the reset type in @reset_type. + * + * This function will race with callers of efi_capsule_update(), for + * example, calling this function while somebody else is in + * efi_capsule_update() but hasn't reached efi_capsue_update_locked() + * will miss the updates to capsule_pending and efi_reset_type after + * efi_capsule_update_locked() completes. + * + * A non-racy use is from platform reboot code because we use + * system_state to ensure no capsules can be sent to the firmware once + * we're at SYSTEM_RESTART. See efi_capsule_update_locked(). + */ +bool efi_capsule_pending(int *reset_type) +{ + bool rv = false; + + mutex_lock(_mutex); + if (!capsule_pending) + goto out; + + if (reset_type) + *reset_type = efi_reset_type; + rv = true; +out: + mutex_unlock(_mutex); + return rv; +} + +/* + * Whitelist of EFI capsule flags that we support. + * + * We do not handle EFI_CAPSULE_INITIATE_RESET because that would + * require us to prepare the kernel for reboot. Refuse to load any + * capsules with that flag and any other flags that we do not know how + * to handle. + */ +#define EFI_CAPSULE_SUPPORTED_FLAG_MASK\ + (EFI_CAPSULE_PERSIST_ACROSS_RESET | EFI_CAPSULE_POPULATE_SYSTEM_TABLE) + +/** + * efi_capsule_supported - does the firmware support the capsule? + * @guid: vendor guid of capsule + * @flags: capsule flags + * @size:
[tip:efi/core] x86/efi: Force EFI reboot to process pending capsules
Commit-ID: 87615a34d561ef59bd0cffc73256a21220dfdffd Gitweb: http://git.kernel.org/tip/87615a34d561ef59bd0cffc73256a21220dfdffd Author: Matt FlemingAuthorDate: Mon, 25 Apr 2016 21:07:00 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:34:04 +0200 x86/efi: Force EFI reboot to process pending capsules If an EFI capsule has been sent to the firmware we must match the type of EFI reset against that required by the capsule to ensure it is processed correctly. Force an EFI reboot if a capsule is pending for the next reset. Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Kweh Hock Leong Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-29-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/reboot.c | 9 + include/linux/efi.h | 6 ++ 2 files changed, 15 insertions(+) diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index ab0adc0..a9b31eb 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -535,6 +535,15 @@ static void native_machine_emergency_restart(void) mode = reboot_mode == REBOOT_WARM ? 0x1234 : 0; *((unsigned short *)__va(0x472)) = mode; + /* +* If an EFI capsule has been registered with the firmware then +* override the reboot= parameter. +*/ + if (efi_capsule_pending(NULL)) { + pr_info("EFI capsule is pending, forcing EFI reboot.\n"); + reboot_type = BOOT_EFI; + } + for (;;) { /* Could also try the reset bit in the Hammer NB */ switch (reboot_type) { diff --git a/include/linux/efi.h b/include/linux/efi.h index a3b4c1e..aa36fb8 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -1085,6 +1085,12 @@ static inline bool efi_enabled(int feature) } static inline void efi_reboot(enum reboot_mode reboot_mode, const char *__unused) {} + +static inline bool +efi_capsule_pending(int *reset_type) +{ + return false; +} #endif extern int efi_status_to_err(efi_status_t status);
[tip:efi/core] x86/efi: Force EFI reboot to process pending capsules
Commit-ID: 87615a34d561ef59bd0cffc73256a21220dfdffd Gitweb: http://git.kernel.org/tip/87615a34d561ef59bd0cffc73256a21220dfdffd Author: Matt Fleming AuthorDate: Mon, 25 Apr 2016 21:07:00 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:34:04 +0200 x86/efi: Force EFI reboot to process pending capsules If an EFI capsule has been sent to the firmware we must match the type of EFI reset against that required by the capsule to ensure it is processed correctly. Force an EFI reboot if a capsule is pending for the next reset. Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Kweh Hock Leong Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-29-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/reboot.c | 9 + include/linux/efi.h | 6 ++ 2 files changed, 15 insertions(+) diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index ab0adc0..a9b31eb 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -535,6 +535,15 @@ static void native_machine_emergency_restart(void) mode = reboot_mode == REBOOT_WARM ? 0x1234 : 0; *((unsigned short *)__va(0x472)) = mode; + /* +* If an EFI capsule has been registered with the firmware then +* override the reboot= parameter. +*/ + if (efi_capsule_pending(NULL)) { + pr_info("EFI capsule is pending, forcing EFI reboot.\n"); + reboot_type = BOOT_EFI; + } + for (;;) { /* Could also try the reset bit in the Hammer NB */ switch (reboot_type) { diff --git a/include/linux/efi.h b/include/linux/efi.h index a3b4c1e..aa36fb8 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -1085,6 +1085,12 @@ static inline bool efi_enabled(int feature) } static inline void efi_reboot(enum reboot_mode reboot_mode, const char *__unused) {} + +static inline bool +efi_capsule_pending(int *reset_type) +{ + return false; +} #endif extern int efi_status_to_err(efi_status_t status);
[tip:efi/core] efi: Move efi_status_to_err() to drivers/firmware/efi/
Commit-ID: 806b0351c9ff9890c1ef0ba2c46237baef49ac79 Gitweb: http://git.kernel.org/tip/806b0351c9ff9890c1ef0ba2c46237baef49ac79 Author: Matt FlemingAuthorDate: Mon, 25 Apr 2016 21:06:58 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:34:03 +0200 efi: Move efi_status_to_err() to drivers/firmware/efi/ Move efi_status_to_err() to the architecture independent code as it's generally useful in all bits of EFI code where there is a need to convert an efi_status_t to a kernel error value. Signed-off-by: Matt Fleming Acked-by: Ard Biesheuvel Cc: Borislav Petkov Cc: Kweh Hock Leong Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-27-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/efi.c | 33 + drivers/firmware/efi/vars.c | 33 - include/linux/efi.h | 2 ++ 3 files changed, 35 insertions(+), 33 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 4991371..05509f3 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -636,3 +636,36 @@ u64 __weak efi_mem_attributes(unsigned long phys_addr) } return 0; } + +int efi_status_to_err(efi_status_t status) +{ + int err; + + switch (status) { + case EFI_SUCCESS: + err = 0; + break; + case EFI_INVALID_PARAMETER: + err = -EINVAL; + break; + case EFI_OUT_OF_RESOURCES: + err = -ENOSPC; + break; + case EFI_DEVICE_ERROR: + err = -EIO; + break; + case EFI_WRITE_PROTECTED: + err = -EROFS; + break; + case EFI_SECURITY_VIOLATION: + err = -EACCES; + break; + case EFI_NOT_FOUND: + err = -ENOENT; + break; + default: + err = -EINVAL; + } + + return err; +} diff --git a/drivers/firmware/efi/vars.c b/drivers/firmware/efi/vars.c index 34b7419..0012331 100644 --- a/drivers/firmware/efi/vars.c +++ b/drivers/firmware/efi/vars.c @@ -329,39 +329,6 @@ check_var_size_nonblocking(u32 attributes, unsigned long size) return fops->query_variable_store(attributes, size, true); } -static int efi_status_to_err(efi_status_t status) -{ - int err; - - switch (status) { - case EFI_SUCCESS: - err = 0; - break; - case EFI_INVALID_PARAMETER: - err = -EINVAL; - break; - case EFI_OUT_OF_RESOURCES: - err = -ENOSPC; - break; - case EFI_DEVICE_ERROR: - err = -EIO; - break; - case EFI_WRITE_PROTECTED: - err = -EROFS; - break; - case EFI_SECURITY_VIOLATION: - err = -EACCES; - break; - case EFI_NOT_FOUND: - err = -ENOENT; - break; - default: - err = -EINVAL; - } - - return err; -} - static bool variable_is_present(efi_char16_t *variable_name, efi_guid_t *vendor, struct list_head *head) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 4db7052..ca47481 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -1080,6 +1080,8 @@ static inline void efi_reboot(enum reboot_mode reboot_mode, const char *__unused) {} #endif +extern int efi_status_to_err(efi_status_t status); + /* * Variable Attributes */
[tip:efi/core] efi: Move efi_status_to_err() to drivers/firmware/efi/
Commit-ID: 806b0351c9ff9890c1ef0ba2c46237baef49ac79 Gitweb: http://git.kernel.org/tip/806b0351c9ff9890c1ef0ba2c46237baef49ac79 Author: Matt Fleming AuthorDate: Mon, 25 Apr 2016 21:06:58 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:34:03 +0200 efi: Move efi_status_to_err() to drivers/firmware/efi/ Move efi_status_to_err() to the architecture independent code as it's generally useful in all bits of EFI code where there is a need to convert an efi_status_t to a kernel error value. Signed-off-by: Matt Fleming Acked-by: Ard Biesheuvel Cc: Borislav Petkov Cc: Kweh Hock Leong Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: joeyli Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-27-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/efi.c | 33 + drivers/firmware/efi/vars.c | 33 - include/linux/efi.h | 2 ++ 3 files changed, 35 insertions(+), 33 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 4991371..05509f3 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -636,3 +636,36 @@ u64 __weak efi_mem_attributes(unsigned long phys_addr) } return 0; } + +int efi_status_to_err(efi_status_t status) +{ + int err; + + switch (status) { + case EFI_SUCCESS: + err = 0; + break; + case EFI_INVALID_PARAMETER: + err = -EINVAL; + break; + case EFI_OUT_OF_RESOURCES: + err = -ENOSPC; + break; + case EFI_DEVICE_ERROR: + err = -EIO; + break; + case EFI_WRITE_PROTECTED: + err = -EROFS; + break; + case EFI_SECURITY_VIOLATION: + err = -EACCES; + break; + case EFI_NOT_FOUND: + err = -ENOENT; + break; + default: + err = -EINVAL; + } + + return err; +} diff --git a/drivers/firmware/efi/vars.c b/drivers/firmware/efi/vars.c index 34b7419..0012331 100644 --- a/drivers/firmware/efi/vars.c +++ b/drivers/firmware/efi/vars.c @@ -329,39 +329,6 @@ check_var_size_nonblocking(u32 attributes, unsigned long size) return fops->query_variable_store(attributes, size, true); } -static int efi_status_to_err(efi_status_t status) -{ - int err; - - switch (status) { - case EFI_SUCCESS: - err = 0; - break; - case EFI_INVALID_PARAMETER: - err = -EINVAL; - break; - case EFI_OUT_OF_RESOURCES: - err = -ENOSPC; - break; - case EFI_DEVICE_ERROR: - err = -EIO; - break; - case EFI_WRITE_PROTECTED: - err = -EROFS; - break; - case EFI_SECURITY_VIOLATION: - err = -EACCES; - break; - case EFI_NOT_FOUND: - err = -ENOENT; - break; - default: - err = -EINVAL; - } - - return err; -} - static bool variable_is_present(efi_char16_t *variable_name, efi_guid_t *vendor, struct list_head *head) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 4db7052..ca47481 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -1080,6 +1080,8 @@ static inline void efi_reboot(enum reboot_mode reboot_mode, const char *__unused) {} #endif +extern int efi_status_to_err(efi_status_t status); + /* * Variable Attributes */
[tip:efi/core] x86/efi: Remove the always true EFI_DEBUG symbol
Commit-ID: c3c1c47f15b37a8492e630d1e9ab8ad576ee10e5 Gitweb: http://git.kernel.org/tip/c3c1c47f15b37a8492e630d1e9ab8ad576ee10e5 Author: Matt FlemingAuthorDate: Mon, 25 Apr 2016 21:06:47 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:33:56 +0200 x86/efi: Remove the always true EFI_DEBUG symbol This symbol is always set which makes it useless. Additionally we have a kernel command-line switch, efi=debug, which actually controls the printing of the memory map. Reported-by: Robert Elliott Signed-off-by: Matt Fleming Acked-by: Borislav Petkov Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-16-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c | 4 1 file changed, 4 deletions(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index dde46cd..f93545e 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -54,8 +54,6 @@ #include #include -#define EFI_DEBUG - static struct efi efi_phys __initdata; static efi_system_table_t efi_systab __initdata; @@ -222,7 +220,6 @@ int __init efi_memblock_x86_reserve_range(void) void __init efi_print_memmap(void) { -#ifdef EFI_DEBUG efi_memory_desc_t *md; int i = 0; @@ -235,7 +232,6 @@ void __init efi_print_memmap(void) md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1, (md->num_pages >> (20 - EFI_PAGE_SHIFT))); } -#endif /* EFI_DEBUG */ } void __init efi_unmap_memmap(void)
[tip:efi/core] x86/efi: Remove the always true EFI_DEBUG symbol
Commit-ID: c3c1c47f15b37a8492e630d1e9ab8ad576ee10e5 Gitweb: http://git.kernel.org/tip/c3c1c47f15b37a8492e630d1e9ab8ad576ee10e5 Author: Matt Fleming AuthorDate: Mon, 25 Apr 2016 21:06:47 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:33:56 +0200 x86/efi: Remove the always true EFI_DEBUG symbol This symbol is always set which makes it useless. Additionally we have a kernel command-line switch, efi=debug, which actually controls the printing of the memory map. Reported-by: Robert Elliott Signed-off-by: Matt Fleming Acked-by: Borislav Petkov Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-16-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c | 4 1 file changed, 4 deletions(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index dde46cd..f93545e 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -54,8 +54,6 @@ #include #include -#define EFI_DEBUG - static struct efi efi_phys __initdata; static efi_system_table_t efi_systab __initdata; @@ -222,7 +220,6 @@ int __init efi_memblock_x86_reserve_range(void) void __init efi_print_memmap(void) { -#ifdef EFI_DEBUG efi_memory_desc_t *md; int i = 0; @@ -235,7 +232,6 @@ void __init efi_print_memmap(void) md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1, (md->num_pages >> (20 - EFI_PAGE_SHIFT))); } -#endif /* EFI_DEBUG */ } void __init efi_unmap_memmap(void)
[tip:efi/core] efi: Remove global 'memmap' EFI memory map
Commit-ID: 884f4f66ffd6ffe632f3a8be4e6d10a858afdc37 Gitweb: http://git.kernel.org/tip/884f4f66ffd6ffe632f3a8be4e6d10a858afdc37 Author: Matt FlemingAuthorDate: Mon, 25 Apr 2016 21:06:39 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:33:51 +0200 efi: Remove global 'memmap' EFI memory map Abolish the poorly named EFI memory map, 'memmap'. It is shadowed by a bunch of local definitions in various files and having two ways to access the EFI memory map ('efi.memmap' vs. 'memmap') is rather confusing. Furthermore, IA64 doesn't even provide this global object, which has caused issues when trying to write generic EFI memmap code. Replace all occurrences with efi.memmap, and convert the remaining iterator code to use for_each_efi_mem_desc(). Signed-off-by: Matt Fleming Reviewed-by: Ard Biesheuvel Cc: Borislav Petkov Cc: Luck, Tony Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-8-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c| 84 +- drivers/firmware/efi/arm-init.c| 20 - drivers/firmware/efi/arm-runtime.c | 12 +++--- drivers/firmware/efi/efi.c | 2 +- drivers/firmware/efi/fake_mem.c| 40 +- include/linux/efi.h| 5 +-- 6 files changed, 85 insertions(+), 78 deletions(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 6f49981..88d2fb2 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -56,8 +56,6 @@ #define EFI_DEBUG -struct efi_memory_map memmap; - static struct efi efi_phys __initdata; static efi_system_table_t efi_systab __initdata; @@ -207,15 +205,13 @@ int __init efi_memblock_x86_reserve_range(void) #else pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32)); #endif - memmap.phys_map = pmap; - memmap.nr_map = e->efi_memmap_size / + efi.memmap.phys_map = pmap; + efi.memmap.nr_map = e->efi_memmap_size / e->efi_memdesc_size; - memmap.desc_size= e->efi_memdesc_size; - memmap.desc_version = e->efi_memdesc_version; - - memblock_reserve(pmap, memmap.nr_map * memmap.desc_size); + efi.memmap.desc_size= e->efi_memdesc_size; + efi.memmap.desc_version = e->efi_memdesc_version; - efi.memmap = + memblock_reserve(pmap, efi.memmap.nr_map * efi.memmap.desc_size); return 0; } @@ -240,10 +236,14 @@ void __init efi_print_memmap(void) void __init efi_unmap_memmap(void) { + unsigned long size; + clear_bit(EFI_MEMMAP, ); - if (memmap.map) { - early_memunmap(memmap.map, memmap.nr_map * memmap.desc_size); - memmap.map = NULL; + + size = efi.memmap.nr_map * efi.memmap.desc_size; + if (efi.memmap.map) { + early_memunmap(efi.memmap.map, size); + efi.memmap.map = NULL; } } @@ -432,17 +432,22 @@ static int __init efi_runtime_init(void) static int __init efi_memmap_init(void) { + unsigned long addr, size; + if (efi_enabled(EFI_PARAVIRT)) return 0; /* Map the EFI memory map */ - memmap.map = early_memremap((unsigned long)memmap.phys_map, - memmap.nr_map * memmap.desc_size); - if (memmap.map == NULL) { + size = efi.memmap.nr_map * efi.memmap.desc_size; + addr = (unsigned long)efi.memmap.phys_map; + + efi.memmap.map = early_memremap(addr, size); + if (efi.memmap.map == NULL) { pr_err("Could not map the memory map!\n"); return -ENOMEM; } - memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size); + + efi.memmap.map_end = efi.memmap.map + size; if (add_efi_memmap) do_add_efi_memmap(); @@ -638,6 +643,7 @@ static void __init get_systab_virt_addr(efi_memory_desc_t *md) static void __init save_runtime_map(void) { #ifdef CONFIG_KEXEC_CORE + unsigned long desc_size; efi_memory_desc_t *md; void *tmp, *q = NULL; int count = 0; @@ -645,21 +651,23 @@ static void __init save_runtime_map(void) if (efi_enabled(EFI_OLD_MEMMAP)) return; + desc_size = efi.memmap.desc_size; + for_each_efi_memory_desc(md) { if (!(md->attribute & EFI_MEMORY_RUNTIME) || (md->type == EFI_BOOT_SERVICES_CODE) || (md->type == EFI_BOOT_SERVICES_DATA)) continue; - tmp = krealloc(q, (count + 1) * memmap.desc_size, GFP_KERNEL); + tmp
[tip:efi/core] efi: Iterate over efi.memmap in for_each_efi_memory_desc()
Commit-ID: 78ce248faa3c46e24e9bd42db3ab3650659f16dd Gitweb: http://git.kernel.org/tip/78ce248faa3c46e24e9bd42db3ab3650659f16dd Author: Matt FlemingAuthorDate: Mon, 25 Apr 2016 21:06:38 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:33:50 +0200 efi: Iterate over efi.memmap in for_each_efi_memory_desc() Most of the users of for_each_efi_memory_desc() are equally happy iterating over the EFI memory map in efi.memmap instead of 'memmap', since the former is usually a pointer to the latter. For those users that want to specify an EFI memory map other than efi.memmap, that can be done using for_each_efi_memory_desc_in_map(). One such example is in the libstub code where the firmware is queried directly for the memory map, it gets iterated over, and then freed. This change goes part of the way toward deleting the global 'memmap' variable, which is not universally available on all architectures (notably IA64) and is rather poorly named. Signed-off-by: Matt Fleming Reviewed-by: Ard Biesheuvel Cc: Borislav Petkov Cc: Leif Lindholm Cc: Mark Salter Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-7-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c| 43 -- arch/x86/platform/efi/efi_64.c | 10 ++ arch/x86/platform/efi/quirks.c | 10 +++--- drivers/firmware/efi/arm-init.c| 4 +-- drivers/firmware/efi/arm-runtime.c | 2 +- drivers/firmware/efi/efi.c | 6 +--- drivers/firmware/efi/fake_mem.c| 3 +- drivers/firmware/efi/libstub/efi-stub-helper.c | 6 ++-- include/linux/efi.h| 11 ++- 9 files changed, 39 insertions(+), 56 deletions(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index df393ea..6f49981 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -119,11 +119,10 @@ void efi_get_time(struct timespec *now) void __init efi_find_mirror(void) { - void *p; + efi_memory_desc_t *md; u64 mirror_size = 0, total_size = 0; - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { - efi_memory_desc_t *md = p; + for_each_efi_memory_desc(md) { unsigned long long start = md->phys_addr; unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; @@ -146,10 +145,9 @@ void __init efi_find_mirror(void) static void __init do_add_efi_memmap(void) { - void *p; + efi_memory_desc_t *md; - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { - efi_memory_desc_t *md = p; + for_each_efi_memory_desc(md) { unsigned long long start = md->phys_addr; unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; int e820_type; @@ -226,17 +224,13 @@ void __init efi_print_memmap(void) { #ifdef EFI_DEBUG efi_memory_desc_t *md; - void *p; - int i; + int i = 0; - for (p = memmap.map, i = 0; -p < memmap.map_end; -p += memmap.desc_size, i++) { + for_each_efi_memory_desc(md) { char buf[64]; - md = p; pr_info("mem%02u: %s range=[0x%016llx-0x%016llx] (%lluMB)\n", - i, efi_md_typeattr_format(buf, sizeof(buf), md), + i++, efi_md_typeattr_format(buf, sizeof(buf), md), md->phys_addr, md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1, (md->num_pages >> (20 - EFI_PAGE_SHIFT))); @@ -550,12 +544,9 @@ void __init efi_set_executable(efi_memory_desc_t *md, bool executable) void __init runtime_code_page_mkexec(void) { efi_memory_desc_t *md; - void *p; /* Make EFI runtime service code area executable */ - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { - md = p; - + for_each_efi_memory_desc(md) { if (md->type != EFI_RUNTIME_SERVICES_CODE) continue; @@ -602,12 +593,10 @@ void __init old_map_region(efi_memory_desc_t *md) /* Merge contiguous regions of the same type and attribute */ static void __init efi_merge_regions(void) { - void *p; efi_memory_desc_t *md, *prev_md = NULL; - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + for_each_efi_memory_desc(md) { u64 prev_size; - md = p; if (!prev_md) { prev_md = md; @@ -650,15
[tip:efi/core] efi: Iterate over efi.memmap in for_each_efi_memory_desc()
Commit-ID: 78ce248faa3c46e24e9bd42db3ab3650659f16dd Gitweb: http://git.kernel.org/tip/78ce248faa3c46e24e9bd42db3ab3650659f16dd Author: Matt Fleming AuthorDate: Mon, 25 Apr 2016 21:06:38 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:33:50 +0200 efi: Iterate over efi.memmap in for_each_efi_memory_desc() Most of the users of for_each_efi_memory_desc() are equally happy iterating over the EFI memory map in efi.memmap instead of 'memmap', since the former is usually a pointer to the latter. For those users that want to specify an EFI memory map other than efi.memmap, that can be done using for_each_efi_memory_desc_in_map(). One such example is in the libstub code where the firmware is queried directly for the memory map, it gets iterated over, and then freed. This change goes part of the way toward deleting the global 'memmap' variable, which is not universally available on all architectures (notably IA64) and is rather poorly named. Signed-off-by: Matt Fleming Reviewed-by: Ard Biesheuvel Cc: Borislav Petkov Cc: Leif Lindholm Cc: Mark Salter Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-7-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c| 43 -- arch/x86/platform/efi/efi_64.c | 10 ++ arch/x86/platform/efi/quirks.c | 10 +++--- drivers/firmware/efi/arm-init.c| 4 +-- drivers/firmware/efi/arm-runtime.c | 2 +- drivers/firmware/efi/efi.c | 6 +--- drivers/firmware/efi/fake_mem.c| 3 +- drivers/firmware/efi/libstub/efi-stub-helper.c | 6 ++-- include/linux/efi.h| 11 ++- 9 files changed, 39 insertions(+), 56 deletions(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index df393ea..6f49981 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -119,11 +119,10 @@ void efi_get_time(struct timespec *now) void __init efi_find_mirror(void) { - void *p; + efi_memory_desc_t *md; u64 mirror_size = 0, total_size = 0; - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { - efi_memory_desc_t *md = p; + for_each_efi_memory_desc(md) { unsigned long long start = md->phys_addr; unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; @@ -146,10 +145,9 @@ void __init efi_find_mirror(void) static void __init do_add_efi_memmap(void) { - void *p; + efi_memory_desc_t *md; - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { - efi_memory_desc_t *md = p; + for_each_efi_memory_desc(md) { unsigned long long start = md->phys_addr; unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; int e820_type; @@ -226,17 +224,13 @@ void __init efi_print_memmap(void) { #ifdef EFI_DEBUG efi_memory_desc_t *md; - void *p; - int i; + int i = 0; - for (p = memmap.map, i = 0; -p < memmap.map_end; -p += memmap.desc_size, i++) { + for_each_efi_memory_desc(md) { char buf[64]; - md = p; pr_info("mem%02u: %s range=[0x%016llx-0x%016llx] (%lluMB)\n", - i, efi_md_typeattr_format(buf, sizeof(buf), md), + i++, efi_md_typeattr_format(buf, sizeof(buf), md), md->phys_addr, md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1, (md->num_pages >> (20 - EFI_PAGE_SHIFT))); @@ -550,12 +544,9 @@ void __init efi_set_executable(efi_memory_desc_t *md, bool executable) void __init runtime_code_page_mkexec(void) { efi_memory_desc_t *md; - void *p; /* Make EFI runtime service code area executable */ - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { - md = p; - + for_each_efi_memory_desc(md) { if (md->type != EFI_RUNTIME_SERVICES_CODE) continue; @@ -602,12 +593,10 @@ void __init old_map_region(efi_memory_desc_t *md) /* Merge contiguous regions of the same type and attribute */ static void __init efi_merge_regions(void) { - void *p; efi_memory_desc_t *md, *prev_md = NULL; - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + for_each_efi_memory_desc(md) { u64 prev_size; - md = p; if (!prev_md) { prev_md = md; @@ -650,15 +639,13 @@ static void __init save_runtime_map(void) { #ifdef CONFIG_KEXEC_CORE efi_memory_desc_t *md; - void *tmp, *p, *q = NULL; + void *tmp, *q = NULL; int count = 0; if
[tip:efi/core] efi: Remove global 'memmap' EFI memory map
Commit-ID: 884f4f66ffd6ffe632f3a8be4e6d10a858afdc37 Gitweb: http://git.kernel.org/tip/884f4f66ffd6ffe632f3a8be4e6d10a858afdc37 Author: Matt Fleming AuthorDate: Mon, 25 Apr 2016 21:06:39 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:33:51 +0200 efi: Remove global 'memmap' EFI memory map Abolish the poorly named EFI memory map, 'memmap'. It is shadowed by a bunch of local definitions in various files and having two ways to access the EFI memory map ('efi.memmap' vs. 'memmap') is rather confusing. Furthermore, IA64 doesn't even provide this global object, which has caused issues when trying to write generic EFI memmap code. Replace all occurrences with efi.memmap, and convert the remaining iterator code to use for_each_efi_mem_desc(). Signed-off-by: Matt Fleming Reviewed-by: Ard Biesheuvel Cc: Borislav Petkov Cc: Luck, Tony Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-8-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c| 84 +- drivers/firmware/efi/arm-init.c| 20 - drivers/firmware/efi/arm-runtime.c | 12 +++--- drivers/firmware/efi/efi.c | 2 +- drivers/firmware/efi/fake_mem.c| 40 +- include/linux/efi.h| 5 +-- 6 files changed, 85 insertions(+), 78 deletions(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 6f49981..88d2fb2 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -56,8 +56,6 @@ #define EFI_DEBUG -struct efi_memory_map memmap; - static struct efi efi_phys __initdata; static efi_system_table_t efi_systab __initdata; @@ -207,15 +205,13 @@ int __init efi_memblock_x86_reserve_range(void) #else pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32)); #endif - memmap.phys_map = pmap; - memmap.nr_map = e->efi_memmap_size / + efi.memmap.phys_map = pmap; + efi.memmap.nr_map = e->efi_memmap_size / e->efi_memdesc_size; - memmap.desc_size= e->efi_memdesc_size; - memmap.desc_version = e->efi_memdesc_version; - - memblock_reserve(pmap, memmap.nr_map * memmap.desc_size); + efi.memmap.desc_size= e->efi_memdesc_size; + efi.memmap.desc_version = e->efi_memdesc_version; - efi.memmap = + memblock_reserve(pmap, efi.memmap.nr_map * efi.memmap.desc_size); return 0; } @@ -240,10 +236,14 @@ void __init efi_print_memmap(void) void __init efi_unmap_memmap(void) { + unsigned long size; + clear_bit(EFI_MEMMAP, ); - if (memmap.map) { - early_memunmap(memmap.map, memmap.nr_map * memmap.desc_size); - memmap.map = NULL; + + size = efi.memmap.nr_map * efi.memmap.desc_size; + if (efi.memmap.map) { + early_memunmap(efi.memmap.map, size); + efi.memmap.map = NULL; } } @@ -432,17 +432,22 @@ static int __init efi_runtime_init(void) static int __init efi_memmap_init(void) { + unsigned long addr, size; + if (efi_enabled(EFI_PARAVIRT)) return 0; /* Map the EFI memory map */ - memmap.map = early_memremap((unsigned long)memmap.phys_map, - memmap.nr_map * memmap.desc_size); - if (memmap.map == NULL) { + size = efi.memmap.nr_map * efi.memmap.desc_size; + addr = (unsigned long)efi.memmap.phys_map; + + efi.memmap.map = early_memremap(addr, size); + if (efi.memmap.map == NULL) { pr_err("Could not map the memory map!\n"); return -ENOMEM; } - memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size); + + efi.memmap.map_end = efi.memmap.map + size; if (add_efi_memmap) do_add_efi_memmap(); @@ -638,6 +643,7 @@ static void __init get_systab_virt_addr(efi_memory_desc_t *md) static void __init save_runtime_map(void) { #ifdef CONFIG_KEXEC_CORE + unsigned long desc_size; efi_memory_desc_t *md; void *tmp, *q = NULL; int count = 0; @@ -645,21 +651,23 @@ static void __init save_runtime_map(void) if (efi_enabled(EFI_OLD_MEMMAP)) return; + desc_size = efi.memmap.desc_size; + for_each_efi_memory_desc(md) { if (!(md->attribute & EFI_MEMORY_RUNTIME) || (md->type == EFI_BOOT_SERVICES_CODE) || (md->type == EFI_BOOT_SERVICES_DATA)) continue; - tmp = krealloc(q, (count + 1) * memmap.desc_size, GFP_KERNEL); + tmp = krealloc(q, (count + 1) * desc_size, GFP_KERNEL); if (!tmp) goto out; q = tmp; - memcpy(q + count * memmap.desc_size,
[tip:efi/core] x86/mm/pat: Document the (currently) EFI-only code path
Commit-ID: 7fc8442f2a8a77f40565b42c41e4f2d48b179a56 Gitweb: http://git.kernel.org/tip/7fc8442f2a8a77f40565b42c41e4f2d48b179a56 Author: Matt FlemingAuthorDate: Mon, 25 Apr 2016 21:06:35 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:33:48 +0200 x86/mm/pat: Document the (currently) EFI-only code path It's not at all obvious that populate_pgd() and friends are only executed when mapping EFI virtual memory regions or that no other pageattr callers pass a ->pgd value. Reported-by: Andy Lutomirski Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-4-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/mm/pageattr.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 01be9ec..a1f0e1d 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1125,8 +1125,14 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr) static int __cpa_process_fault(struct cpa_data *cpa, unsigned long vaddr, int primary) { - if (cpa->pgd) + if (cpa->pgd) { + /* +* Right now, we only execute this code path when mapping +* the EFI virtual memory map regions, no other users +* provide a ->pgd value. This may change in the future. +*/ return populate_pgd(cpa, vaddr); + } /* * Ignore all non primary paths.
[tip:efi/core] x86/mm/pat: Document the (currently) EFI-only code path
Commit-ID: 7fc8442f2a8a77f40565b42c41e4f2d48b179a56 Gitweb: http://git.kernel.org/tip/7fc8442f2a8a77f40565b42c41e4f2d48b179a56 Author: Matt Fleming AuthorDate: Mon, 25 Apr 2016 21:06:35 +0100 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 11:33:48 +0200 x86/mm/pat: Document the (currently) EFI-only code path It's not at all obvious that populate_pgd() and friends are only executed when mapping EFI virtual memory regions or that no other pageattr callers pass a ->pgd value. Reported-by: Andy Lutomirski Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1461614832-17633-4-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/mm/pageattr.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 01be9ec..a1f0e1d 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1125,8 +1125,14 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr) static int __cpa_process_fault(struct cpa_data *cpa, unsigned long vaddr, int primary) { - if (cpa->pgd) + if (cpa->pgd) { + /* +* Right now, we only execute this code path when mapping +* the EFI virtual memory map regions, no other users +* provide a ->pgd value. This may change in the future. +*/ return populate_pgd(cpa, vaddr); + } /* * Ignore all non primary paths.
[tip:sched/urgent] sched/fair: Add comments to explain select_idle_sibling()
Commit-ID: d4335581dc30ec6545999c7443bb9fead274a980 Gitweb: http://git.kernel.org/tip/d4335581dc30ec6545999c7443bb9fead274a980 Author: Matt FlemingAuthorDate: Wed, 9 Mar 2016 14:59:08 + Committer: Ingo Molnar CommitDate: Mon, 21 Mar 2016 10:52:51 +0100 sched/fair: Add comments to explain select_idle_sibling() It's not entirely obvious how the main loop in select_idle_sibling() works on first glance. Sprinkle a few comments to explain the design and intention behind the loop based on some conversations with Mike and Peter. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mel Gorman Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1457535548-15329-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3c114d9..303d639 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5055,7 +5055,19 @@ static int select_idle_sibling(struct task_struct *p, int target) return i; /* -* Otherwise, iterate the domains and find an elegible idle cpu. +* Otherwise, iterate the domains and find an eligible idle cpu. +* +* A completely idle sched group at higher domains is more +* desirable than an idle group at a lower level, because lower +* domains have smaller groups and usually share hardware +* resources which causes tasks to contend on them, e.g. x86 +* hyperthread siblings in the lowest domain (SMT) can contend +* on the shared cpu pipeline. +* +* However, while we prefer idle groups at higher domains +* finding an idle cpu at the lowest domain is still better than +* returning 'target', which we've already established, isn't +* idle. */ sd = rcu_dereference(per_cpu(sd_llc, target)); for_each_lower_domain(sd) { @@ -5065,11 +5077,16 @@ static int select_idle_sibling(struct task_struct *p, int target) tsk_cpus_allowed(p))) goto next; + /* Ensure the entire group is idle */ for_each_cpu(i, sched_group_cpus(sg)) { if (i == target || !idle_cpu(i)) goto next; } + /* +* It doesn't matter which cpu we pick, the +* whole group is idle. +*/ target = cpumask_first_and(sched_group_cpus(sg), tsk_cpus_allowed(p)); goto done;
[tip:sched/urgent] sched/fair: Add comments to explain select_idle_sibling()
Commit-ID: d4335581dc30ec6545999c7443bb9fead274a980 Gitweb: http://git.kernel.org/tip/d4335581dc30ec6545999c7443bb9fead274a980 Author: Matt Fleming AuthorDate: Wed, 9 Mar 2016 14:59:08 + Committer: Ingo Molnar CommitDate: Mon, 21 Mar 2016 10:52:51 +0100 sched/fair: Add comments to explain select_idle_sibling() It's not entirely obvious how the main loop in select_idle_sibling() works on first glance. Sprinkle a few comments to explain the design and intention behind the loop based on some conversations with Mike and Peter. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mel Gorman Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1457535548-15329-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3c114d9..303d639 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5055,7 +5055,19 @@ static int select_idle_sibling(struct task_struct *p, int target) return i; /* -* Otherwise, iterate the domains and find an elegible idle cpu. +* Otherwise, iterate the domains and find an eligible idle cpu. +* +* A completely idle sched group at higher domains is more +* desirable than an idle group at a lower level, because lower +* domains have smaller groups and usually share hardware +* resources which causes tasks to contend on them, e.g. x86 +* hyperthread siblings in the lowest domain (SMT) can contend +* on the shared cpu pipeline. +* +* However, while we prefer idle groups at higher domains +* finding an idle cpu at the lowest domain is still better than +* returning 'target', which we've already established, isn't +* idle. */ sd = rcu_dereference(per_cpu(sd_llc, target)); for_each_lower_domain(sd) { @@ -5065,11 +5077,16 @@ static int select_idle_sibling(struct task_struct *p, int target) tsk_cpus_allowed(p))) goto next; + /* Ensure the entire group is idle */ for_each_cpu(i, sched_group_cpus(sg)) { if (i == target || !idle_cpu(i)) goto next; } + /* +* It doesn't matter which cpu we pick, the +* whole group is idle. +*/ target = cpumask_first_and(sched_group_cpus(sg), tsk_cpus_allowed(p)); goto done;
[tip:efi/core] x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU
Commit-ID: d367cef0a7f0c6ee86e997c0cb455b21b3c6b9ba Gitweb: http://git.kernel.org/tip/d367cef0a7f0c6ee86e997c0cb455b21b3c6b9ba Author: Matt FlemingAuthorDate: Mon, 14 Mar 2016 10:33:01 + Committer: Ingo Molnar CommitDate: Wed, 16 Mar 2016 09:00:49 +0100 x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU Scott reports that with the new separate EFI page tables he's seeing the following error on boot, caused by setting reserved bits in the page table structures (fault code is PF_RSVD | PF_PROT), swapper/0: Corrupted page table at address 17b102020 PGD 17b0e5063 PUD 140e3 Bad pagetable: 0009 [#1] SMP On first inspection the PUD is using a 1GB page size (_PAGE_PSE) and looks fine but that's only true if support for 1GB PUD pages ("pdpe1gb") is present in the CPU. Scott's Intel Celeron N2820 does not have that feature and so the _PAGE_PSE bit is reserved. Fix this issue by making the 1GB mapping code in conditional on "cpu_has_gbpages". This issue didn't come up in the past because the required mapping for the faulting address (0x17b102020) will already have been setup by the kernel in early boot before we got to efi_map_regions(), but we no longer use the standard kernel page tables during EFI calls. Reported-by: Scott Ashcroft Tested-by: Scott Ashcroft Signed-off-by: Matt Fleming Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ben Hutchings Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Maarten Lankhorst Cc: Matthew Garrett Cc: Peter Zijlstra Cc: Raphael Hertzog Cc: Roger Shimizu Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1457951581-27353-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/mm/pageattr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 14c38ae..fcf8e29 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1055,7 +1055,7 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, /* * Map everything starting from the Gb boundary, possibly with 1G pages */ - while (end - start >= PUD_SIZE) { + while (cpu_has_gbpages && end - start >= PUD_SIZE) { set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE | massage_pgprot(pud_pgprot)));
[tip:efi/core] x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU
Commit-ID: d367cef0a7f0c6ee86e997c0cb455b21b3c6b9ba Gitweb: http://git.kernel.org/tip/d367cef0a7f0c6ee86e997c0cb455b21b3c6b9ba Author: Matt Fleming AuthorDate: Mon, 14 Mar 2016 10:33:01 + Committer: Ingo Molnar CommitDate: Wed, 16 Mar 2016 09:00:49 +0100 x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU Scott reports that with the new separate EFI page tables he's seeing the following error on boot, caused by setting reserved bits in the page table structures (fault code is PF_RSVD | PF_PROT), swapper/0: Corrupted page table at address 17b102020 PGD 17b0e5063 PUD 140e3 Bad pagetable: 0009 [#1] SMP On first inspection the PUD is using a 1GB page size (_PAGE_PSE) and looks fine but that's only true if support for 1GB PUD pages ("pdpe1gb") is present in the CPU. Scott's Intel Celeron N2820 does not have that feature and so the _PAGE_PSE bit is reserved. Fix this issue by making the 1GB mapping code in conditional on "cpu_has_gbpages". This issue didn't come up in the past because the required mapping for the faulting address (0x17b102020) will already have been setup by the kernel in early boot before we got to efi_map_regions(), but we no longer use the standard kernel page tables during EFI calls. Reported-by: Scott Ashcroft Tested-by: Scott Ashcroft Signed-off-by: Matt Fleming Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ben Hutchings Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Maarten Lankhorst Cc: Matthew Garrett Cc: Peter Zijlstra Cc: Raphael Hertzog Cc: Roger Shimizu Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1457951581-27353-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/mm/pageattr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 14c38ae..fcf8e29 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1055,7 +1055,7 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, /* * Map everything starting from the Gb boundary, possibly with 1G pages */ - while (end - start >= PUD_SIZE) { + while (cpu_has_gbpages && end - start >= PUD_SIZE) { set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE | massage_pgprot(pud_pgprot)));
[tip:x86/urgent] x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables
Commit-ID: 452308de61056a539352a9306c46716d7af8a1f1 Gitweb: http://git.kernel.org/tip/452308de61056a539352a9306c46716d7af8a1f1 Author: Matt FlemingAuthorDate: Fri, 11 Mar 2016 11:19:23 + Committer: Ingo Molnar CommitDate: Sat, 12 Mar 2016 16:57:45 +0100 x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables Some machines have EFI regions in page zero (physical address 0x) and historically that region has been added to the e820 map via trim_bios_range(), and ultimately mapped into the kernel page tables. It was not mapped via efi_map_regions() as one would expect. Alexis reports that with the new separate EFI page tables some boot services regions, such as page zero, are not mapped. This triggers an oops during the SetVirtualAddressMap() runtime call. For the EFI boot services quirk on x86 we need to memblock_reserve() boot services regions until after SetVirtualAddressMap(). Doing that while respecting the ownership of regions that may have already been reserved by the kernel was the motivation behind this commit: 7d68dc3f1003 ("x86, efi: Do not reserve boot services regions within reserved areas") That patch was merged at a time when the EFI runtime virtual mappings were inserted into the kernel page tables as described above, and the trick of setting ->numpages (and hence the region size) to zero to track regions that should not be freed in efi_free_boot_services() meant that we never mapped those regions in efi_map_regions(). Instead we were relying solely on the existing kernel mappings. Now that we have separate page tables we need to make sure the EFI boot services regions are mapped correctly, even if someone else has already called memblock_reserve(). Instead of stashing a tag in ->numpages, set the EFI_MEMORY_RUNTIME bit of ->attribute. Since it generally makes no sense to mark a boot services region as required at runtime, it's pretty much guaranteed the firmware will not have already set this bit. For the record, the specific circumstances under which Alexis triggered this bug was that an EFI runtime driver on his machine was responding to the EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE event during SetVirtualAddressMap(). The event handler for this driver looks like this, sub rsp,0x28 lea rdx,[rip+0x2445] # 0xaa948720 mov ecx,0x4 call func_aa9447c0 ; call to ConvertPointer(4, & 0xaa948720) mov r11,QWORD PTR [rip+0x2434] # 0xaa948720 xor eax,eax mov BYTE PTR [r11+0x1],0x1 add rsp,0x28 ret Which is pretty typical code for an EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE handler. The "mov r11, QWORD PTR [rip+0x2424]" was the faulting instruction because ConvertPointer() was being called to convert the address 0x, which when converted is left unchanged and remains 0x. The output of the oops trace gave the impression of a standard NULL pointer dereference bug, but because we're accessing physical addresses during ConvertPointer(), it wasn't. EFI boot services code is stored at that address on Alexis' machine. Reported-by: Alexis Murzeau Signed-off-by: Matt Fleming Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ben Hutchings Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Maarten Lankhorst Cc: Matthew Garrett Cc: Peter Zijlstra Cc: Raphael Hertzog Cc: Roger Shimizu Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1457695163-29632-2-git-send-email-m...@codeblueprint.co.uk Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815125 Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/quirks.c | 79 +- 1 file changed, 62 insertions(+), 17 deletions(-) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 2d66db8..ed30e79 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -131,6 +131,27 @@ efi_status_t efi_query_variable_store(u32 attributes, unsigned long size) EXPORT_SYMBOL_GPL(efi_query_variable_store); /* + * Helper function for efi_reserve_boot_services() to figure out if we + * can free regions in efi_free_boot_services(). + * + * Use this function to ensure we do not free regions owned by somebody + * else. We must only reserve (and then free) regions: + * + * - Not within any part of the kernel + * - Not the BIOS reserved area (E820_RESERVED, E820_NVS, etc) + */ +static bool can_free_region(u64 start, u64 size) +{ + if (start + size > __pa_symbol(_text) && start <=
[tip:x86/urgent] x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables
Commit-ID: 452308de61056a539352a9306c46716d7af8a1f1 Gitweb: http://git.kernel.org/tip/452308de61056a539352a9306c46716d7af8a1f1 Author: Matt Fleming AuthorDate: Fri, 11 Mar 2016 11:19:23 + Committer: Ingo Molnar CommitDate: Sat, 12 Mar 2016 16:57:45 +0100 x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables Some machines have EFI regions in page zero (physical address 0x) and historically that region has been added to the e820 map via trim_bios_range(), and ultimately mapped into the kernel page tables. It was not mapped via efi_map_regions() as one would expect. Alexis reports that with the new separate EFI page tables some boot services regions, such as page zero, are not mapped. This triggers an oops during the SetVirtualAddressMap() runtime call. For the EFI boot services quirk on x86 we need to memblock_reserve() boot services regions until after SetVirtualAddressMap(). Doing that while respecting the ownership of regions that may have already been reserved by the kernel was the motivation behind this commit: 7d68dc3f1003 ("x86, efi: Do not reserve boot services regions within reserved areas") That patch was merged at a time when the EFI runtime virtual mappings were inserted into the kernel page tables as described above, and the trick of setting ->numpages (and hence the region size) to zero to track regions that should not be freed in efi_free_boot_services() meant that we never mapped those regions in efi_map_regions(). Instead we were relying solely on the existing kernel mappings. Now that we have separate page tables we need to make sure the EFI boot services regions are mapped correctly, even if someone else has already called memblock_reserve(). Instead of stashing a tag in ->numpages, set the EFI_MEMORY_RUNTIME bit of ->attribute. Since it generally makes no sense to mark a boot services region as required at runtime, it's pretty much guaranteed the firmware will not have already set this bit. For the record, the specific circumstances under which Alexis triggered this bug was that an EFI runtime driver on his machine was responding to the EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE event during SetVirtualAddressMap(). The event handler for this driver looks like this, sub rsp,0x28 lea rdx,[rip+0x2445] # 0xaa948720 mov ecx,0x4 call func_aa9447c0 ; call to ConvertPointer(4, & 0xaa948720) mov r11,QWORD PTR [rip+0x2434] # 0xaa948720 xor eax,eax mov BYTE PTR [r11+0x1],0x1 add rsp,0x28 ret Which is pretty typical code for an EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE handler. The "mov r11, QWORD PTR [rip+0x2424]" was the faulting instruction because ConvertPointer() was being called to convert the address 0x, which when converted is left unchanged and remains 0x. The output of the oops trace gave the impression of a standard NULL pointer dereference bug, but because we're accessing physical addresses during ConvertPointer(), it wasn't. EFI boot services code is stored at that address on Alexis' machine. Reported-by: Alexis Murzeau Signed-off-by: Matt Fleming Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ben Hutchings Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Maarten Lankhorst Cc: Matthew Garrett Cc: Peter Zijlstra Cc: Raphael Hertzog Cc: Roger Shimizu Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1457695163-29632-2-git-send-email-m...@codeblueprint.co.uk Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815125 Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/quirks.c | 79 +- 1 file changed, 62 insertions(+), 17 deletions(-) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 2d66db8..ed30e79 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -131,6 +131,27 @@ efi_status_t efi_query_variable_store(u32 attributes, unsigned long size) EXPORT_SYMBOL_GPL(efi_query_variable_store); /* + * Helper function for efi_reserve_boot_services() to figure out if we + * can free regions in efi_free_boot_services(). + * + * Use this function to ensure we do not free regions owned by somebody + * else. We must only reserve (and then free) regions: + * + * - Not within any part of the kernel + * - Not the BIOS reserved area (E820_RESERVED, E820_NVS, etc) + */ +static bool can_free_region(u64 start, u64 size) +{ + if (start + size > __pa_symbol(_text) && start <= __pa_symbol(_end)) + return false; + + if (!e820_all_mapped(start, start+size, E820_RAM)) + return false; + + return true; +} + +/* * The UEFI specification makes it clear that the operating system is free to do * whatever it wants with boot services code after ExitBootServices() has been * called. Ignoring this recommendation a significant bunch of EFI implementations @@
[tip:x86/urgent] x86/mm/pat: Avoid truncation when converting cpa->numpages to address
Commit-ID: 742563777e8da62197d6cb4b99f4027f59454735 Gitweb: http://git.kernel.org/tip/742563777e8da62197d6cb4b99f4027f59454735 Author: Matt Fleming AuthorDate: Fri, 29 Jan 2016 11:36:10 + Committer: Thomas Gleixner CommitDate: Fri, 29 Jan 2016 15:03:09 +0100 x86/mm/pat: Avoid truncation when converting cpa->numpages to address There are a couple of nasty truncation bugs lurking in the pageattr code that can be triggered when mapping EFI regions, e.g. when we pass a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting left by PAGE_SHIFT will truncate the resultant address to 32-bits. Viorel-Cătălin managed to trigger this bug on his Dell machine that provides a ~5GB EFI region which requires 1236992 pages to be mapped. When calling populate_pud() the end of the region gets calculated incorrectly in the following buggy expression, end = start + (cpa->numpages << PAGE_SHIFT); And only 188416 pages are mapped. Next, populate_pud() gets invoked for a second time because of the loop in __change_page_attr_set_clr(), only this time no pages get mapped because shifting the remaining number of pages (1048576) by PAGE_SHIFT is zero. At which point the loop in __change_page_attr_set_clr() spins forever because we fail to map progress. Hitting this bug depends very much on the virtual address we pick to map the large region at and how many pages we map on the initial run through the loop. This explains why this issue was only recently hit with the introduction of commit a5caa209ba9c ("x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down") It's interesting to note that safe uses of cpa->numpages do exist in the pageattr code. If instead of shifting ->numpages we multiply by PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and so the result is unsigned long. To avoid surprises when users try to convert very large cpa->numpages values to addresses, change the data type from 'int' to 'unsigned long', thereby making it suitable for shifting by PAGE_SHIFT without any type casting. The alternative would be to make liberal use of casting, but that is far more likely to cause problems in the future when someone adds more code and fails to cast properly; this bug was difficult enough to track down in the first place. Reported-and-tested-by: Viorel-Cătălin Răpițeanu Acked-by: Borislav Petkov Cc: Sai Praneeth Prakhya Cc: Signed-off-by: Matt Fleming Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131 Link: http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Thomas Gleixner --- arch/x86/mm/pageattr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index fc6a4c8..2440814 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -33,7 +33,7 @@ struct cpa_data { pgd_t *pgd; pgprot_tmask_set; pgprot_tmask_clr; - int numpages; + unsigned long numpages; int flags; unsigned long pfn; unsignedforce_split : 1; @@ -1350,7 +1350,7 @@ static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias) * CPA operation. Either a large page has been * preserved or a single page update happened. */ - BUG_ON(cpa->numpages > numpages); + BUG_ON(cpa->numpages > numpages || !cpa->numpages); numpages -= cpa->numpages; if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) cpa->curpage++;
[tip:x86/urgent] x86/mm/pat: Avoid truncation when converting cpa->numpages to address
Commit-ID: 742563777e8da62197d6cb4b99f4027f59454735 Gitweb: http://git.kernel.org/tip/742563777e8da62197d6cb4b99f4027f59454735 Author: Matt FlemingAuthorDate: Fri, 29 Jan 2016 11:36:10 + Committer: Thomas Gleixner CommitDate: Fri, 29 Jan 2016 15:03:09 +0100 x86/mm/pat: Avoid truncation when converting cpa->numpages to address There are a couple of nasty truncation bugs lurking in the pageattr code that can be triggered when mapping EFI regions, e.g. when we pass a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting left by PAGE_SHIFT will truncate the resultant address to 32-bits. Viorel-Cătălin managed to trigger this bug on his Dell machine that provides a ~5GB EFI region which requires 1236992 pages to be mapped. When calling populate_pud() the end of the region gets calculated incorrectly in the following buggy expression, end = start + (cpa->numpages << PAGE_SHIFT); And only 188416 pages are mapped. Next, populate_pud() gets invoked for a second time because of the loop in __change_page_attr_set_clr(), only this time no pages get mapped because shifting the remaining number of pages (1048576) by PAGE_SHIFT is zero. At which point the loop in __change_page_attr_set_clr() spins forever because we fail to map progress. Hitting this bug depends very much on the virtual address we pick to map the large region at and how many pages we map on the initial run through the loop. This explains why this issue was only recently hit with the introduction of commit a5caa209ba9c ("x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down") It's interesting to note that safe uses of cpa->numpages do exist in the pageattr code. If instead of shifting ->numpages we multiply by PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and so the result is unsigned long. To avoid surprises when users try to convert very large cpa->numpages values to addresses, change the data type from 'int' to 'unsigned long', thereby making it suitable for shifting by PAGE_SHIFT without any type casting. The alternative would be to make liberal use of casting, but that is far more likely to cause problems in the future when someone adds more code and fails to cast properly; this bug was difficult enough to track down in the first place. Reported-and-tested-by: Viorel-Cătălin Răpițeanu Acked-by: Borislav Petkov Cc: Sai Praneeth Prakhya Cc: Signed-off-by: Matt Fleming Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131 Link: http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Thomas Gleixner --- arch/x86/mm/pageattr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index fc6a4c8..2440814 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -33,7 +33,7 @@ struct cpa_data { pgd_t *pgd; pgprot_tmask_set; pgprot_tmask_clr; - int numpages; + unsigned long numpages; int flags; unsigned long pfn; unsignedforce_split : 1; @@ -1350,7 +1350,7 @@ static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias) * CPA operation. Either a large page has been * preserved or a single page update happened. */ - BUG_ON(cpa->numpages > numpages); + BUG_ON(cpa->numpages > numpages || !cpa->numpages); numpages -= cpa->numpages; if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) cpa->curpage++;
[tip:efi/core] x86/efi: Setup separate EFI page tables in kexec paths
Commit-ID: 753b11ef8e92a1c1bbe97f2a5ec14bdd1ef2e6fe Gitweb: http://git.kernel.org/tip/753b11ef8e92a1c1bbe97f2a5ec14bdd1ef2e6fe Author: Matt Fleming AuthorDate: Thu, 21 Jan 2016 14:11:59 + Committer: Ingo Molnar CommitDate: Thu, 21 Jan 2016 21:01:34 +0100 x86/efi: Setup separate EFI page tables in kexec paths The switch to using a new dedicated page table for EFI runtime calls in commit commit 67a9108ed431 ("x86/efi: Build our own page table structures") failed to take into account changes required for the kexec code paths, which are unfortunately duplicated in the EFI code. Call the allocation and setup functions in kexec_enter_virtual_mode() just like we do for __efi_enter_virtual_mode() to avoid hitting NULL-pointer dereferences when making EFI runtime calls. At the very least, the call to efi_setup_page_tables() should have existed for kexec before the following commit: 67a9108ed431 ("x86/efi: Build our own page table structures") Things just magically worked because we were actually using the kernel's page tables that contained the required mappings. Reported-by: Srikar Dronamraju Tested-by: Srikar Dronamraju Signed-off-by: Matt Fleming Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Young Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1453385519-11477-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 3c1f3cd..bdd9477 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -815,6 +815,7 @@ static void __init kexec_enter_virtual_mode(void) { #ifdef CONFIG_KEXEC_CORE efi_memory_desc_t *md; + unsigned int num_pages; void *p; efi.systab = NULL; @@ -829,6 +830,12 @@ static void __init kexec_enter_virtual_mode(void) return; } + if (efi_alloc_page_tables()) { + pr_err("Failed to allocate EFI page tables\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return; + } + /* * Map efi regions which were passed via setup_data. The virt_addr is a * fixed addr which was used in first kernel of a kexec boot. @@ -843,6 +850,14 @@ static void __init kexec_enter_virtual_mode(void) BUG_ON(!efi.systab); + num_pages = ALIGN(memmap.nr_map * memmap.desc_size, PAGE_SIZE); + num_pages >>= PAGE_SHIFT; + + if (efi_setup_page_tables(memmap.phys_map, num_pages)) { + clear_bit(EFI_RUNTIME_SERVICES, ); + return; + } + efi_sync_low_kernel_mappings(); /*
[tip:efi/core] x86/efi: Setup separate EFI page tables in kexec paths
Commit-ID: 753b11ef8e92a1c1bbe97f2a5ec14bdd1ef2e6fe Gitweb: http://git.kernel.org/tip/753b11ef8e92a1c1bbe97f2a5ec14bdd1ef2e6fe Author: Matt FlemingAuthorDate: Thu, 21 Jan 2016 14:11:59 + Committer: Ingo Molnar CommitDate: Thu, 21 Jan 2016 21:01:34 +0100 x86/efi: Setup separate EFI page tables in kexec paths The switch to using a new dedicated page table for EFI runtime calls in commit commit 67a9108ed431 ("x86/efi: Build our own page table structures") failed to take into account changes required for the kexec code paths, which are unfortunately duplicated in the EFI code. Call the allocation and setup functions in kexec_enter_virtual_mode() just like we do for __efi_enter_virtual_mode() to avoid hitting NULL-pointer dereferences when making EFI runtime calls. At the very least, the call to efi_setup_page_tables() should have existed for kexec before the following commit: 67a9108ed431 ("x86/efi: Build our own page table structures") Things just magically worked because we were actually using the kernel's page tables that contained the required mappings. Reported-by: Srikar Dronamraju Tested-by: Srikar Dronamraju Signed-off-by: Matt Fleming Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Young Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1453385519-11477-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 3c1f3cd..bdd9477 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -815,6 +815,7 @@ static void __init kexec_enter_virtual_mode(void) { #ifdef CONFIG_KEXEC_CORE efi_memory_desc_t *md; + unsigned int num_pages; void *p; efi.systab = NULL; @@ -829,6 +830,12 @@ static void __init kexec_enter_virtual_mode(void) return; } + if (efi_alloc_page_tables()) { + pr_err("Failed to allocate EFI page tables\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return; + } + /* * Map efi regions which were passed via setup_data. The virt_addr is a * fixed addr which was used in first kernel of a kexec boot. @@ -843,6 +850,14 @@ static void __init kexec_enter_virtual_mode(void) BUG_ON(!efi.systab); + num_pages = ALIGN(memmap.nr_map * memmap.desc_size, PAGE_SIZE); + num_pages >>= PAGE_SHIFT; + + if (efi_setup_page_tables(memmap.phys_map, num_pages)) { + clear_bit(EFI_RUNTIME_SERVICES, ); + return; + } + efi_sync_low_kernel_mappings(); /*
[tip:x86/efi] x86/efi-bgrt: Replace early_memremap() with memremap()
Commit-ID: e2c90dd7e11e3025b46719a79fb4bb1e7a5cef9f Gitweb: http://git.kernel.org/tip/e2c90dd7e11e3025b46719a79fb4bb1e7a5cef9f Author: Matt Fleming AuthorDate: Mon, 21 Dec 2015 14:12:52 + Committer: Thomas Gleixner CommitDate: Wed, 6 Jan 2016 18:28:52 +0100 x86/efi-bgrt: Replace early_memremap() with memremap() Môshe reported the following warning triggered on his machine since commit 50a0cb565246 ("x86/efi-bgrt: Fix kernel panic when mapping BGRT data"), [0.026936] [ cut here ] [0.026941] WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:137 __early_ioremap+0x102/0x1bb() [0.026941] Modules linked in: [0.026944] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc1 #2 [0.026945] Hardware name: Dell Inc. XPS 13 9343/09K8G1, BIOS A05 07/14/2015 [0.026946] 900f03d5a116524d 81c03e60 813a3fff [0.026948] 81c03e98 810a0852 d7b76000 [0.026949] 0001 0001 017c [0.026951] Call Trace: [0.026955] [] dump_stack+0x44/0x55 [0.026958] [] warn_slowpath_common+0x82/0xc0 [0.026959] [] warn_slowpath_null+0x1a/0x20 [0.026961] [] __early_ioremap+0x102/0x1bb [0.026962] [] early_memremap+0x13/0x15 [0.026964] [] efi_bgrt_init+0x162/0x1ad [0.026966] [] efi_late_init+0x9/0xb [0.026968] [] start_kernel+0x46f/0x49f [0.026970] [] ? early_idt_handler_array+0x120/0x120 [0.026972] [] x86_64_start_reservations+0x2a/0x2c [0.026974] [] x86_64_start_kernel+0x14a/0x16d [0.026977] ---[ end trace f9b3812eb8e24c58 ]--- [0.026978] efi_bgrt: Ignoring BGRT: failed to map image memory early_memremap() has an upper limit on the size of mapping it can handle which is ~200KB. Clearly the BGRT image on Môshe's machine is much larger than that. There's actually no reason to restrict ourselves to using the early_* version of memremap() - the ACPI BGRT driver is invoked late enough in boot that we can use the standard version, with the benefit that the late version allows mappings of arbitrary size. Reported-by: Môshe van der Sterre Tested-by: Môshe van der Sterre Signed-off-by: Matt Fleming Cc: Josh Triplett Cc: Sai Praneeth Prakhya Cc: Borislav Petkov Link: http://lkml.kernel.org/r/1450707172-12561-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Thomas Gleixner --- arch/x86/platform/efi/efi-bgrt.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c index bf51f4c..b097066 100644 --- a/arch/x86/platform/efi/efi-bgrt.c +++ b/arch/x86/platform/efi/efi-bgrt.c @@ -72,14 +72,14 @@ void __init efi_bgrt_init(void) return; } - image = early_memremap(bgrt_tab->image_address, sizeof(bmp_header)); + image = memremap(bgrt_tab->image_address, sizeof(bmp_header), MEMREMAP_WB); if (!image) { pr_err("Ignoring BGRT: failed to map image header memory\n"); return; } memcpy(_header, image, sizeof(bmp_header)); - early_memunmap(image, sizeof(bmp_header)); + memunmap(image); bgrt_image_size = bmp_header.size; bgrt_image = kmalloc(bgrt_image_size, GFP_KERNEL | __GFP_NOWARN); @@ -89,7 +89,7 @@ void __init efi_bgrt_init(void) return; } - image = early_memremap(bgrt_tab->image_address, bmp_header.size); + image = memremap(bgrt_tab->image_address, bmp_header.size, MEMREMAP_WB); if (!image) { pr_err("Ignoring BGRT: failed to map image memory\n"); kfree(bgrt_image); @@ -98,5 +98,5 @@ void __init efi_bgrt_init(void) } memcpy(bgrt_image, image, bgrt_image_size); - early_memunmap(image, bmp_header.size); + memunmap(image); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/efi] x86/efi-bgrt: Replace early_memremap() with memremap()
Commit-ID: e2c90dd7e11e3025b46719a79fb4bb1e7a5cef9f Gitweb: http://git.kernel.org/tip/e2c90dd7e11e3025b46719a79fb4bb1e7a5cef9f Author: Matt FlemingAuthorDate: Mon, 21 Dec 2015 14:12:52 + Committer: Thomas Gleixner CommitDate: Wed, 6 Jan 2016 18:28:52 +0100 x86/efi-bgrt: Replace early_memremap() with memremap() Môshe reported the following warning triggered on his machine since commit 50a0cb565246 ("x86/efi-bgrt: Fix kernel panic when mapping BGRT data"), [0.026936] [ cut here ] [0.026941] WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:137 __early_ioremap+0x102/0x1bb() [0.026941] Modules linked in: [0.026944] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc1 #2 [0.026945] Hardware name: Dell Inc. XPS 13 9343/09K8G1, BIOS A05 07/14/2015 [0.026946] 900f03d5a116524d 81c03e60 813a3fff [0.026948] 81c03e98 810a0852 d7b76000 [0.026949] 0001 0001 017c [0.026951] Call Trace: [0.026955] [] dump_stack+0x44/0x55 [0.026958] [] warn_slowpath_common+0x82/0xc0 [0.026959] [] warn_slowpath_null+0x1a/0x20 [0.026961] [] __early_ioremap+0x102/0x1bb [0.026962] [] early_memremap+0x13/0x15 [0.026964] [] efi_bgrt_init+0x162/0x1ad [0.026966] [] efi_late_init+0x9/0xb [0.026968] [] start_kernel+0x46f/0x49f [0.026970] [] ? early_idt_handler_array+0x120/0x120 [0.026972] [] x86_64_start_reservations+0x2a/0x2c [0.026974] [] x86_64_start_kernel+0x14a/0x16d [0.026977] ---[ end trace f9b3812eb8e24c58 ]--- [0.026978] efi_bgrt: Ignoring BGRT: failed to map image memory early_memremap() has an upper limit on the size of mapping it can handle which is ~200KB. Clearly the BGRT image on Môshe's machine is much larger than that. There's actually no reason to restrict ourselves to using the early_* version of memremap() - the ACPI BGRT driver is invoked late enough in boot that we can use the standard version, with the benefit that the late version allows mappings of arbitrary size. Reported-by: Môshe van der Sterre Tested-by: Môshe van der Sterre Signed-off-by: Matt Fleming Cc: Josh Triplett Cc: Sai Praneeth Prakhya Cc: Borislav Petkov Link: http://lkml.kernel.org/r/1450707172-12561-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Thomas Gleixner --- arch/x86/platform/efi/efi-bgrt.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c index bf51f4c..b097066 100644 --- a/arch/x86/platform/efi/efi-bgrt.c +++ b/arch/x86/platform/efi/efi-bgrt.c @@ -72,14 +72,14 @@ void __init efi_bgrt_init(void) return; } - image = early_memremap(bgrt_tab->image_address, sizeof(bmp_header)); + image = memremap(bgrt_tab->image_address, sizeof(bmp_header), MEMREMAP_WB); if (!image) { pr_err("Ignoring BGRT: failed to map image header memory\n"); return; } memcpy(_header, image, sizeof(bmp_header)); - early_memunmap(image, sizeof(bmp_header)); + memunmap(image); bgrt_image_size = bmp_header.size; bgrt_image = kmalloc(bgrt_image_size, GFP_KERNEL | __GFP_NOWARN); @@ -89,7 +89,7 @@ void __init efi_bgrt_init(void) return; } - image = early_memremap(bgrt_tab->image_address, bmp_header.size); + image = memremap(bgrt_tab->image_address, bmp_header.size, MEMREMAP_WB); if (!image) { pr_err("Ignoring BGRT: failed to map image memory\n"); kfree(bgrt_image); @@ -98,5 +98,5 @@ void __init efi_bgrt_init(void) } memcpy(bgrt_image, image, bgrt_image_size); - early_memunmap(image, bmp_header.size); + memunmap(image); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/efi] Documentation/x86: Update EFI memory region description
Commit-ID: ff3d0a12fb2dc123e2b46e9524ebf4e08de5c59c Gitweb: http://git.kernel.org/tip/ff3d0a12fb2dc123e2b46e9524ebf4e08de5c59c Author: Matt Fleming AuthorDate: Fri, 27 Nov 2015 21:09:35 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:43 +0100 Documentation/x86: Update EFI memory region description Make it clear that the EFI page tables are only available during EFI runtime calls since that subject has come up a fair numbers of times in the past. Additionally, add the EFI region start and end addresses to the table so that it's possible to see at a glance where they fall in relation to other regions. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Jones Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Stephen Smalley Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-7-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- Documentation/x86/x86_64/mm.txt | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 05712ac..c518dce 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -16,6 +16,8 @@ ec00 - fc00 (=44 bits) kasan shadow memory (16TB) ... unused hole ... ff00 - ff7f (=39 bits) %esp fixup stacks ... unused hole ... +ffef - (=64 GB) EFI region mapping space +... unused hole ... 8000 - a000 (=512 MB) kernel text mapping, from phys 0 a000 - ff5f (=1525 MB) module mapping space ff60 - ffdf (=8 MB) vsyscalls @@ -32,11 +34,9 @@ reference. Current X86-64 implementations only support 40 bits of address space, but we support up to 46 bits. This expands into MBZ space in the page tables. -->trampoline_pgd: - -We map EFI runtime services in the aforementioned PGD in the virtual -range of 64Gb (arbitrarily set, can be raised if needed) - -0xffef - 0x +We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual +memory window (this size is arbitrary, it can be raised later if needed). +The mappings are not part of any other kernel PGD and are only available +during EFI runtime calls. -Andi Kleen, Jul 2004 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/efi] x86/efi: Build our own page table structures
Commit-ID: 67a9108ed4313b85a9c53406d80dc1ae3f8c3e36 Gitweb: http://git.kernel.org/tip/67a9108ed4313b85a9c53406d80dc1ae3f8c3e36 Author: Matt Fleming AuthorDate: Fri, 27 Nov 2015 21:09:34 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/efi: Build our own page table structures With commit e1a58320a38d ("x86/mm: Warn on W^X mappings") all users booting on 64-bit UEFI machines see the following warning, [ cut here ] WARNING: CPU: 7 PID: 1 at arch/x86/mm/dump_pagetables.c:225 note_page+0x5dc/0x780() x86/mm: Found insecure W+X mapping at address 8805f000/0x8805f000 ... x86/mm: Checked W+X mappings: FAILED, 165660 W+X pages found. ... This is caused by mapping EFI regions with RWX permissions. There isn't much we can do to restrict the permissions for these regions due to the way the firmware toolchains mix code and data, but we can at least isolate these mappings so that they do not appear in the regular kernel page tables. In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping") we started using 'trampoline_pgd' to map the EFI regions because there was an existing identity mapping there which we use during the SetVirtualAddressMap() call and for broken firmware that accesses those addresses. But 'trampoline_pgd' shares some PGD entries with 'swapper_pg_dir' and does not provide the isolation we require. Notably the virtual address for __START_KERNEL_map and MODULES_START are mapped by the same PGD entry so we need to be more careful when copying changes over in efi_sync_low_kernel_mappings(). This patch doesn't go the full mile, we still want to share some PGD entries with 'swapper_pg_dir'. Having completely separate page tables brings its own issues such as synchronising new mappings after memory hotplug and module loading. Sharing also keeps memory usage down. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Jones Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Stephen Smalley Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-6-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/include/asm/efi.h | 1 + arch/x86/platform/efi/efi.c| 39 ++--- arch/x86/platform/efi/efi_32.c | 5 +++ arch/x86/platform/efi/efi_64.c | 97 +++--- 4 files changed, 102 insertions(+), 40 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 347eeac..8fd9e63 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -136,6 +136,7 @@ extern void __init efi_memory_uc(u64 addr, unsigned long size); extern void __init efi_map_region(efi_memory_desc_t *md); extern void __init efi_map_region_fixed(efi_memory_desc_t *md); extern void efi_sync_low_kernel_mappings(void); +extern int __init efi_alloc_page_tables(void); extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages); extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages); extern void __init old_map_region(efi_memory_desc_t *md); diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index ad28540..3c1f3cd 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -869,7 +869,7 @@ static void __init kexec_enter_virtual_mode(void) * This function will switch the EFI runtime services to virtual mode. * Essentially, we look through the EFI memmap and map every region that * has the runtime attribute bit set in its memory descriptor into the - * ->trampoline_pgd page table using a top-down VA allocation scheme. + * efi_pgd page table. * * The old method which used to update that memory descriptor with the * virtual address obtained from ioremap() is still supported when the @@ -879,8 +879,8 @@ static void __init kexec_enter_virtual_mode(void) * * The new method does a pagetable switch in a preemption-safe manner * so that we're in a different address space when calling a runtime - * function. For function arguments passing we do copy the PGDs of the - * kernel page table into ->trampoline_pgd prior to each call. + * function. For function arguments passing we do copy the PUDs of the + * kernel page table into efi_pgd prior to each call. * * Specially for kexec boot, efi runtime maps in previous kernel should * be passed in via setup_data. In that case runtime ranges will be mapped @@ -895,6 +895,12 @@ static void __init __efi_enter_virtual_mode(void) efi.systab = NULL; + if (efi_alloc_page_tables()) { + pr_err("Failed to allocate EFI page tables\n"); +
[tip:x86/efi] x86/efi: Map RAM into the identity page table for mixed mode
Commit-ID: b61a76f8850d2979550abc42d7e09154ebb8d785 Gitweb: http://git.kernel.org/tip/b61a76f8850d2979550abc42d7e09154ebb8d785 Author: Matt Fleming AuthorDate: Fri, 27 Nov 2015 21:09:32 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/efi: Map RAM into the identity page table for mixed mode We are relying on the pre-existing mappings in 'trampoline_pgd' when accessing function arguments in the EFI mixed mode thunking code. Instead let's map memory explicitly so that things will continue to work when we move to a separate page table in the future. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-4-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi_64.c | 20 1 file changed, 20 insertions(+) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 5aa186d..102976d 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -144,6 +144,7 @@ void efi_sync_low_kernel_mappings(void) int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) { unsigned long pfn, text; + efi_memory_desc_t *md; struct page *page; unsigned npages; pgd_t *pgd; @@ -177,6 +178,25 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) if (!IS_ENABLED(CONFIG_EFI_MIXED)) return 0; + /* +* Map all of RAM so that we can access arguments in the 1:1 +* mapping when making EFI runtime calls. +*/ + for_each_efi_memory_desc(, md) { + if (md->type != EFI_CONVENTIONAL_MEMORY && + md->type != EFI_LOADER_DATA && + md->type != EFI_LOADER_CODE) + continue; + + pfn = md->phys_addr >> PAGE_SHIFT; + npages = md->num_pages; + + if (kernel_map_pages_in_pgd(pgd, pfn, md->phys_addr, npages, 0)) { + pr_err("Failed to map 1:1 memory\n"); + return 1; + } + } + page = alloc_page(GFP_KERNEL|__GFP_DMA32); if (!page) panic("Unable to allocate EFI runtime stack < 4GB\n"); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/efi] x86/efi: Hoist page table switching code into efi_call_virt()
Commit-ID: c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff Gitweb: http://git.kernel.org/tip/c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff Author: Matt Fleming AuthorDate: Fri, 27 Nov 2015 21:09:33 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/efi: Hoist page table switching code into efi_call_virt() This change is a prerequisite for pending patches that switch to a dedicated EFI page table, instead of using 'trampoline_pgd' which shares PGD entries with 'swapper_pg_dir'. The pending patches make it impossible to dereference the runtime service function pointer without first switching %cr3. It's true that we now have duplicated switching code in efi_call_virt() and efi_call_phys_{prolog,epilog}() but we are sacrificing code duplication for a little more clarity and the ease of writing the page table switching code in C instead of asm. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Jones Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Stephen Smalley Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-5-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/include/asm/efi.h | 25 + arch/x86/platform/efi/efi_64.c | 24 ++--- arch/x86/platform/efi/efi_stub_64.S | 43 - 3 files changed, 36 insertions(+), 56 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 0010c78..347eeac 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -3,6 +3,7 @@ #include #include +#include /* * We map the EFI regions needed for runtime services non-contiguously, @@ -64,6 +65,17 @@ extern u64 asmlinkage efi_call(void *fp, ...); #define efi_call_phys(f, args...) efi_call((f), args) +/* + * Scratch space used for switching the pagetable in the EFI stub + */ +struct efi_scratch { + u64 r15; + u64 prev_cr3; + pgd_t *efi_pgt; + booluse_pgd; + u64 phys_stack; +} __packed; + #define efi_call_virt(f, ...) \ ({ \ efi_status_t __s; \ @@ -71,7 +83,20 @@ extern u64 asmlinkage efi_call(void *fp, ...); efi_sync_low_kernel_mappings(); \ preempt_disable(); \ __kernel_fpu_begin(); \ + \ + if (efi_scratch.use_pgd) { \ + efi_scratch.prev_cr3 = read_cr3(); \ + write_cr3((unsigned long)efi_scratch.efi_pgt); \ + __flush_tlb_all(); \ + } \ + \ __s = efi_call((void *)efi.systab->runtime->f, __VA_ARGS__);\ + \ + if (efi_scratch.use_pgd) { \ + write_cr3(efi_scratch.prev_cr3);\ + __flush_tlb_all(); \ + } \ + \ __kernel_fpu_end(); \ preempt_enable(); \ __s;\ diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 102976d..b19cdac 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -47,16 +47,7 @@ */ static u64 efi_va = EFI_VA_START; -/* - * Scratch space used for switching the pagetable in the EFI stub - */ -struct efi_scratch { - u64 r15; - u64 prev_cr3; - pgd_t *efi_pgt; - bool use_pgd; - u64 phys_stack; -} __packed; +struct efi_scratch efi_scratch; static void __init early_code_mapping_set_exec(int executable) { @@ -83,8 +74,11 @@ pgd_t * __init efi_call_phys_prolog(void) int pgd; int n_pgds; - if (!efi_enabled(EFI_OLD_MEMMAP)) - return NULL; + if (!efi_enabled(EFI_OLD_MEMMAP)) { + save_pgd = (pgd_t *)read_cr3(); + write_cr3((unsigned
[tip:x86/efi] x86/mm/pat: Ensure cpa-> pfn only contains page frame numbers
Commit-ID: edc3b9129cecd0f0857112136f5b8b1bc1d45918 Gitweb: http://git.kernel.org/tip/edc3b9129cecd0f0857112136f5b8b1bc1d45918 Author: Matt Fleming AuthorDate: Fri, 27 Nov 2015 21:09:31 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/mm/pat: Ensure cpa->pfn only contains page frame numbers The x86 pageattr code is confused about the data that is stored in cpa->pfn, sometimes it's treated as a page frame number, sometimes it's treated as an unshifted physical address, and in one place it's treated as a pte. The result of this is that the mapping functions do not map the intended physical address. This isn't a problem in practice because most of the addresses we're mapping in the EFI code paths are already mapped in 'trampoline_pgd' and so the pageattr mapping functions don't actually do anything in this case. But when we move to using a separate page table for the EFI runtime this will be an issue. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-3-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/mm/pageattr.c | 17 ++--- arch/x86/platform/efi/efi_64.c | 16 ++-- 2 files changed, 16 insertions(+), 17 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index a3137a4..c70e420 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -905,15 +905,10 @@ static void populate_pte(struct cpa_data *cpa, pte = pte_offset_kernel(pmd, start); while (num_pages-- && start < end) { - - /* deal with the NX bit */ - if (!(pgprot_val(pgprot) & _PAGE_NX)) - cpa->pfn &= ~_PAGE_NX; - - set_pte(pte, pfn_pte(cpa->pfn >> PAGE_SHIFT, pgprot)); + set_pte(pte, pfn_pte(cpa->pfn, pgprot)); start+= PAGE_SIZE; - cpa->pfn += PAGE_SIZE; + cpa->pfn++; pte++; } } @@ -969,11 +964,11 @@ static int populate_pmd(struct cpa_data *cpa, pmd = pmd_offset(pud, start); - set_pmd(pmd, __pmd(cpa->pfn | _PAGE_PSE | + set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE | massage_pgprot(pmd_pgprot))); start += PMD_SIZE; - cpa->pfn += PMD_SIZE; + cpa->pfn += PMD_SIZE >> PAGE_SHIFT; cur_pages += PMD_SIZE >> PAGE_SHIFT; } @@ -1042,11 +1037,11 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, * Map everything starting from the Gb boundary, possibly with 1G pages */ while (end - start >= PUD_SIZE) { - set_pud(pud, __pud(cpa->pfn | _PAGE_PSE | + set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE | massage_pgprot(pud_pgprot))); start += PUD_SIZE; - cpa->pfn += PUD_SIZE; + cpa->pfn += PUD_SIZE >> PAGE_SHIFT; cur_pages += PUD_SIZE >> PAGE_SHIFT; pud++; } diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index a0ac0f9..5aa186d 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -143,7 +143,7 @@ void efi_sync_low_kernel_mappings(void) int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) { - unsigned long text; + unsigned long pfn, text; struct page *page; unsigned npages; pgd_t *pgd; @@ -160,7 +160,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) * and ident-map those pages containing the map before calling * phys_efi_set_virtual_address_map(). */ - if (kernel_map_pages_in_pgd(pgd, pa_memmap, pa_memmap, num_pages, _PAGE_NX)) { + pfn = pa_memmap >> PAGE_SHIFT; + if (kernel_map_pages_in_pgd(pgd, pfn, pa_memmap, num_pages, _PAGE_NX)) { pr_err("Error ident-mapping new memmap (0x%lx)!\n", pa_memmap); return 1; } @@ -185,8 +186,9 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) npages = (_end - _text) >> PAGE_SHIFT; text = __pa(_text); + pfn = text >> PAGE_SHIFT; - if (kernel_map_pages_in_pgd(pgd, text >> PAGE_SHIFT, text, npages, 0)) { + if (kernel_map_pages_in_pgd(pgd, pfn, text, npages, 0)) { pr_err("Failed to map kernel text 1:1\n"); return 1; } @@ -204,12 +206,14 @@ void __init
[tip:x86/efi] x86/mm: Page align the '_end' symbol to avoid pfn conversion bugs
Commit-ID: 21cdb6b568435738cc0b303b2b3b82742396310c Gitweb: http://git.kernel.org/tip/21cdb6b568435738cc0b303b2b3b82742396310c Author: Matt Fleming AuthorDate: Fri, 27 Nov 2015 21:09:30 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/mm: Page align the '_end' symbol to avoid pfn conversion bugs Ingo noted that if we can guarantee _end is aligned to PAGE_SIZE we can automatically avoid bugs along the lines of, size = _end - _text >> PAGE_SHIFT which is missing a call to PFN_ALIGN(). The EFI mixed mode contains this bug, for example. _text is already aligned to PAGE_SIZE through the use of LOAD_PHYSICAL_ADDR, and the BSS and BRK sections are explicitly aligned in the linker script, so it makes sense to align _end to match. Reported-by: Ingo Molnar Signed-off-by: Matt Fleming Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/vmlinux.lds.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 74e4bf1..4f19942 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -325,6 +325,7 @@ SECTIONS __brk_limit = .; } + . = ALIGN(PAGE_SIZE); _end = .; STABS_DEBUG -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/efi] Documentation/x86: Update EFI memory region description
Commit-ID: ff3d0a12fb2dc123e2b46e9524ebf4e08de5c59c Gitweb: http://git.kernel.org/tip/ff3d0a12fb2dc123e2b46e9524ebf4e08de5c59c Author: Matt FlemingAuthorDate: Fri, 27 Nov 2015 21:09:35 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:43 +0100 Documentation/x86: Update EFI memory region description Make it clear that the EFI page tables are only available during EFI runtime calls since that subject has come up a fair numbers of times in the past. Additionally, add the EFI region start and end addresses to the table so that it's possible to see at a glance where they fall in relation to other regions. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Jones Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Stephen Smalley Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-7-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- Documentation/x86/x86_64/mm.txt | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 05712ac..c518dce 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -16,6 +16,8 @@ ec00 - fc00 (=44 bits) kasan shadow memory (16TB) ... unused hole ... ff00 - ff7f (=39 bits) %esp fixup stacks ... unused hole ... +ffef - (=64 GB) EFI region mapping space +... unused hole ... 8000 - a000 (=512 MB) kernel text mapping, from phys 0 a000 - ff5f (=1525 MB) module mapping space ff60 - ffdf (=8 MB) vsyscalls @@ -32,11 +34,9 @@ reference. Current X86-64 implementations only support 40 bits of address space, but we support up to 46 bits. This expands into MBZ space in the page tables. -->trampoline_pgd: - -We map EFI runtime services in the aforementioned PGD in the virtual -range of 64Gb (arbitrarily set, can be raised if needed) - -0xffef - 0x +We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual +memory window (this size is arbitrary, it can be raised later if needed). +The mappings are not part of any other kernel PGD and are only available +during EFI runtime calls. -Andi Kleen, Jul 2004 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/efi] x86/efi: Build our own page table structures
Commit-ID: 67a9108ed4313b85a9c53406d80dc1ae3f8c3e36 Gitweb: http://git.kernel.org/tip/67a9108ed4313b85a9c53406d80dc1ae3f8c3e36 Author: Matt FlemingAuthorDate: Fri, 27 Nov 2015 21:09:34 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/efi: Build our own page table structures With commit e1a58320a38d ("x86/mm: Warn on W^X mappings") all users booting on 64-bit UEFI machines see the following warning, [ cut here ] WARNING: CPU: 7 PID: 1 at arch/x86/mm/dump_pagetables.c:225 note_page+0x5dc/0x780() x86/mm: Found insecure W+X mapping at address 8805f000/0x8805f000 ... x86/mm: Checked W+X mappings: FAILED, 165660 W+X pages found. ... This is caused by mapping EFI regions with RWX permissions. There isn't much we can do to restrict the permissions for these regions due to the way the firmware toolchains mix code and data, but we can at least isolate these mappings so that they do not appear in the regular kernel page tables. In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping") we started using 'trampoline_pgd' to map the EFI regions because there was an existing identity mapping there which we use during the SetVirtualAddressMap() call and for broken firmware that accesses those addresses. But 'trampoline_pgd' shares some PGD entries with 'swapper_pg_dir' and does not provide the isolation we require. Notably the virtual address for __START_KERNEL_map and MODULES_START are mapped by the same PGD entry so we need to be more careful when copying changes over in efi_sync_low_kernel_mappings(). This patch doesn't go the full mile, we still want to share some PGD entries with 'swapper_pg_dir'. Having completely separate page tables brings its own issues such as synchronising new mappings after memory hotplug and module loading. Sharing also keeps memory usage down. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Jones Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Stephen Smalley Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-6-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/include/asm/efi.h | 1 + arch/x86/platform/efi/efi.c| 39 ++--- arch/x86/platform/efi/efi_32.c | 5 +++ arch/x86/platform/efi/efi_64.c | 97 +++--- 4 files changed, 102 insertions(+), 40 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 347eeac..8fd9e63 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -136,6 +136,7 @@ extern void __init efi_memory_uc(u64 addr, unsigned long size); extern void __init efi_map_region(efi_memory_desc_t *md); extern void __init efi_map_region_fixed(efi_memory_desc_t *md); extern void efi_sync_low_kernel_mappings(void); +extern int __init efi_alloc_page_tables(void); extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages); extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages); extern void __init old_map_region(efi_memory_desc_t *md); diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index ad28540..3c1f3cd 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -869,7 +869,7 @@ static void __init kexec_enter_virtual_mode(void) * This function will switch the EFI runtime services to virtual mode. * Essentially, we look through the EFI memmap and map every region that * has the runtime attribute bit set in its memory descriptor into the - * ->trampoline_pgd page table using a top-down VA allocation scheme. + * efi_pgd page table. * * The old method which used to update that memory descriptor with the * virtual address obtained from ioremap() is still supported when the @@ -879,8 +879,8 @@ static void __init kexec_enter_virtual_mode(void) * * The new method does a pagetable switch in a preemption-safe manner * so that we're in a different address space when calling a runtime - * function. For function arguments passing we do copy the PGDs of the - * kernel page table into ->trampoline_pgd prior to each call. + * function. For function arguments passing we do copy the PUDs of the + *
[tip:x86/efi] x86/mm/pat: Ensure cpa-> pfn only contains page frame numbers
Commit-ID: edc3b9129cecd0f0857112136f5b8b1bc1d45918 Gitweb: http://git.kernel.org/tip/edc3b9129cecd0f0857112136f5b8b1bc1d45918 Author: Matt FlemingAuthorDate: Fri, 27 Nov 2015 21:09:31 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/mm/pat: Ensure cpa->pfn only contains page frame numbers The x86 pageattr code is confused about the data that is stored in cpa->pfn, sometimes it's treated as a page frame number, sometimes it's treated as an unshifted physical address, and in one place it's treated as a pte. The result of this is that the mapping functions do not map the intended physical address. This isn't a problem in practice because most of the addresses we're mapping in the EFI code paths are already mapped in 'trampoline_pgd' and so the pageattr mapping functions don't actually do anything in this case. But when we move to using a separate page table for the EFI runtime this will be an issue. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-3-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/mm/pageattr.c | 17 ++--- arch/x86/platform/efi/efi_64.c | 16 ++-- 2 files changed, 16 insertions(+), 17 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index a3137a4..c70e420 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -905,15 +905,10 @@ static void populate_pte(struct cpa_data *cpa, pte = pte_offset_kernel(pmd, start); while (num_pages-- && start < end) { - - /* deal with the NX bit */ - if (!(pgprot_val(pgprot) & _PAGE_NX)) - cpa->pfn &= ~_PAGE_NX; - - set_pte(pte, pfn_pte(cpa->pfn >> PAGE_SHIFT, pgprot)); + set_pte(pte, pfn_pte(cpa->pfn, pgprot)); start+= PAGE_SIZE; - cpa->pfn += PAGE_SIZE; + cpa->pfn++; pte++; } } @@ -969,11 +964,11 @@ static int populate_pmd(struct cpa_data *cpa, pmd = pmd_offset(pud, start); - set_pmd(pmd, __pmd(cpa->pfn | _PAGE_PSE | + set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE | massage_pgprot(pmd_pgprot))); start += PMD_SIZE; - cpa->pfn += PMD_SIZE; + cpa->pfn += PMD_SIZE >> PAGE_SHIFT; cur_pages += PMD_SIZE >> PAGE_SHIFT; } @@ -1042,11 +1037,11 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, * Map everything starting from the Gb boundary, possibly with 1G pages */ while (end - start >= PUD_SIZE) { - set_pud(pud, __pud(cpa->pfn | _PAGE_PSE | + set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE | massage_pgprot(pud_pgprot))); start += PUD_SIZE; - cpa->pfn += PUD_SIZE; + cpa->pfn += PUD_SIZE >> PAGE_SHIFT; cur_pages += PUD_SIZE >> PAGE_SHIFT; pud++; } diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index a0ac0f9..5aa186d 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -143,7 +143,7 @@ void efi_sync_low_kernel_mappings(void) int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) { - unsigned long text; + unsigned long pfn, text; struct page *page; unsigned npages; pgd_t *pgd; @@ -160,7 +160,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) * and ident-map those pages containing the map before calling * phys_efi_set_virtual_address_map(). */ - if (kernel_map_pages_in_pgd(pgd, pa_memmap, pa_memmap, num_pages, _PAGE_NX)) { + pfn = pa_memmap >> PAGE_SHIFT; + if (kernel_map_pages_in_pgd(pgd, pfn, pa_memmap, num_pages, _PAGE_NX)) { pr_err("Error ident-mapping new memmap (0x%lx)!\n", pa_memmap); return 1; } @@ -185,8 +186,9 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
[tip:x86/efi] x86/mm: Page align the '_end' symbol to avoid pfn conversion bugs
Commit-ID: 21cdb6b568435738cc0b303b2b3b82742396310c Gitweb: http://git.kernel.org/tip/21cdb6b568435738cc0b303b2b3b82742396310c Author: Matt FlemingAuthorDate: Fri, 27 Nov 2015 21:09:30 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/mm: Page align the '_end' symbol to avoid pfn conversion bugs Ingo noted that if we can guarantee _end is aligned to PAGE_SIZE we can automatically avoid bugs along the lines of, size = _end - _text >> PAGE_SHIFT which is missing a call to PFN_ALIGN(). The EFI mixed mode contains this bug, for example. _text is already aligned to PAGE_SIZE through the use of LOAD_PHYSICAL_ADDR, and the BSS and BRK sections are explicitly aligned in the linker script, so it makes sense to align _end to match. Reported-by: Ingo Molnar Signed-off-by: Matt Fleming Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/vmlinux.lds.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 74e4bf1..4f19942 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -325,6 +325,7 @@ SECTIONS __brk_limit = .; } + . = ALIGN(PAGE_SIZE); _end = .; STABS_DEBUG -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/efi] x86/efi: Map RAM into the identity page table for mixed mode
Commit-ID: b61a76f8850d2979550abc42d7e09154ebb8d785 Gitweb: http://git.kernel.org/tip/b61a76f8850d2979550abc42d7e09154ebb8d785 Author: Matt FlemingAuthorDate: Fri, 27 Nov 2015 21:09:32 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/efi: Map RAM into the identity page table for mixed mode We are relying on the pre-existing mappings in 'trampoline_pgd' when accessing function arguments in the EFI mixed mode thunking code. Instead let's map memory explicitly so that things will continue to work when we move to a separate page table in the future. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-4-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi_64.c | 20 1 file changed, 20 insertions(+) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 5aa186d..102976d 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -144,6 +144,7 @@ void efi_sync_low_kernel_mappings(void) int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) { unsigned long pfn, text; + efi_memory_desc_t *md; struct page *page; unsigned npages; pgd_t *pgd; @@ -177,6 +178,25 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) if (!IS_ENABLED(CONFIG_EFI_MIXED)) return 0; + /* +* Map all of RAM so that we can access arguments in the 1:1 +* mapping when making EFI runtime calls. +*/ + for_each_efi_memory_desc(, md) { + if (md->type != EFI_CONVENTIONAL_MEMORY && + md->type != EFI_LOADER_DATA && + md->type != EFI_LOADER_CODE) + continue; + + pfn = md->phys_addr >> PAGE_SHIFT; + npages = md->num_pages; + + if (kernel_map_pages_in_pgd(pgd, pfn, md->phys_addr, npages, 0)) { + pr_err("Failed to map 1:1 memory\n"); + return 1; + } + } + page = alloc_page(GFP_KERNEL|__GFP_DMA32); if (!page) panic("Unable to allocate EFI runtime stack < 4GB\n"); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/efi] x86/efi: Hoist page table switching code into efi_call_virt()
Commit-ID: c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff Gitweb: http://git.kernel.org/tip/c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff Author: Matt FlemingAuthorDate: Fri, 27 Nov 2015 21:09:33 + Committer: Ingo Molnar CommitDate: Sun, 29 Nov 2015 09:15:42 +0100 x86/efi: Hoist page table switching code into efi_call_virt() This change is a prerequisite for pending patches that switch to a dedicated EFI page table, instead of using 'trampoline_pgd' which shares PGD entries with 'swapper_pg_dir'. The pending patches make it impossible to dereference the runtime service function pointer without first switching %cr3. It's true that we now have duplicated switching code in efi_call_virt() and efi_call_phys_{prolog,epilog}() but we are sacrificing code duplication for a little more clarity and the ease of writing the page table switching code in C instead of asm. Signed-off-by: Matt Fleming Reviewed-by: Borislav Petkov Acked-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Jones Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sai Praneeth Prakhya Cc: Stephen Smalley Cc: Thomas Gleixner Cc: Toshi Kani Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-5-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/include/asm/efi.h | 25 + arch/x86/platform/efi/efi_64.c | 24 ++--- arch/x86/platform/efi/efi_stub_64.S | 43 - 3 files changed, 36 insertions(+), 56 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 0010c78..347eeac 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -3,6 +3,7 @@ #include #include +#include /* * We map the EFI regions needed for runtime services non-contiguously, @@ -64,6 +65,17 @@ extern u64 asmlinkage efi_call(void *fp, ...); #define efi_call_phys(f, args...) efi_call((f), args) +/* + * Scratch space used for switching the pagetable in the EFI stub + */ +struct efi_scratch { + u64 r15; + u64 prev_cr3; + pgd_t *efi_pgt; + booluse_pgd; + u64 phys_stack; +} __packed; + #define efi_call_virt(f, ...) \ ({ \ efi_status_t __s; \ @@ -71,7 +83,20 @@ extern u64 asmlinkage efi_call(void *fp, ...); efi_sync_low_kernel_mappings(); \ preempt_disable(); \ __kernel_fpu_begin(); \ + \ + if (efi_scratch.use_pgd) { \ + efi_scratch.prev_cr3 = read_cr3(); \ + write_cr3((unsigned long)efi_scratch.efi_pgt); \ + __flush_tlb_all(); \ + } \ + \ __s = efi_call((void *)efi.systab->runtime->f, __VA_ARGS__);\ + \ + if (efi_scratch.use_pgd) { \ + write_cr3(efi_scratch.prev_cr3);\ + __flush_tlb_all(); \ + } \ + \ __kernel_fpu_end(); \ preempt_enable(); \ __s;\ diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 102976d..b19cdac 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -47,16 +47,7 @@ */ static u64 efi_va = EFI_VA_START; -/* - * Scratch space used for switching the pagetable in the EFI stub - */ -struct efi_scratch { - u64 r15; - u64 prev_cr3; - pgd_t *efi_pgt; - bool use_pgd; -
[tip:x86/urgent] x86/setup: Fix recent boot crash on 32-bit SMP machines
Commit-ID: 1c5dac914794f0170e1582d8ffdee52d30e0e4dd Gitweb: http://git.kernel.org/tip/1c5dac914794f0170e1582d8ffdee52d30e0e4dd Author: Matt Fleming AuthorDate: Tue, 3 Nov 2015 13:40:41 + Committer: Thomas Gleixner CommitDate: Wed, 4 Nov 2015 11:48:47 +0100 x86/setup: Fix recent boot crash on 32-bit SMP machines The LKP test robot reported that the bug fix in commit f5f3497cad8c ("x86/setup: Extend low identity map to cover whole kernel range") causes CONFIG_X86_32 SMP machines to crash on boot when trying to bring AP cpus online. The above commit erroneously copies too many of the PGD entries to the low memory region of 'identity_page_table', resulting in some of the kernel mappings for PAGE_OFFSET being trashed because, KERNEL_PGD_PTRS > KERNEL_PGD_BOUNDARY The maximum number of PGD entries we can copy without corrupting the kernel mapping is KERNEL_PGD_BOUNDARY or pgd_index(PAGE_OFFSET). Fixes: f5f3497cad8c "x86/setup: Extend low identity map to cover whole kernel range" Reported-by: Ying Huang Cc: Paolo Bonzini Cc: Laszlo Ersek Cc: l...@01.org Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Andy Lutomirski Cc: Signed-off-by: Matt Fleming Link: http://lkml.kernel.org/r/20151103140354.ga2...@codeblueprint.co.uk Signed-off-by: Thomas Gleixner --- arch/x86/kernel/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a3cccbf..2b8cbd6 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1180,7 +1180,7 @@ void __init setup_arch(char **cmdline_p) */ clone_pgd_range(initial_page_table, swapper_pg_dir + KERNEL_PGD_BOUNDARY, - KERNEL_PGD_PTRS); + KERNEL_PGD_BOUNDARY); #endif tboot_probe(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86/setup: Fix recent boot crash on 32-bit SMP machines
Commit-ID: 1c5dac914794f0170e1582d8ffdee52d30e0e4dd Gitweb: http://git.kernel.org/tip/1c5dac914794f0170e1582d8ffdee52d30e0e4dd Author: Matt FlemingAuthorDate: Tue, 3 Nov 2015 13:40:41 + Committer: Thomas Gleixner CommitDate: Wed, 4 Nov 2015 11:48:47 +0100 x86/setup: Fix recent boot crash on 32-bit SMP machines The LKP test robot reported that the bug fix in commit f5f3497cad8c ("x86/setup: Extend low identity map to cover whole kernel range") causes CONFIG_X86_32 SMP machines to crash on boot when trying to bring AP cpus online. The above commit erroneously copies too many of the PGD entries to the low memory region of 'identity_page_table', resulting in some of the kernel mappings for PAGE_OFFSET being trashed because, KERNEL_PGD_PTRS > KERNEL_PGD_BOUNDARY The maximum number of PGD entries we can copy without corrupting the kernel mapping is KERNEL_PGD_BOUNDARY or pgd_index(PAGE_OFFSET). Fixes: f5f3497cad8c "x86/setup: Extend low identity map to cover whole kernel range" Reported-by: Ying Huang Cc: Paolo Bonzini Cc: Laszlo Ersek Cc: l...@01.org Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Andy Lutomirski Cc: Signed-off-by: Matt Fleming Link: http://lkml.kernel.org/r/20151103140354.ga2...@codeblueprint.co.uk Signed-off-by: Thomas Gleixner --- arch/x86/kernel/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a3cccbf..2b8cbd6 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1180,7 +1180,7 @@ void __init setup_arch(char **cmdline_p) */ clone_pgd_range(initial_page_table, swapper_pg_dir + KERNEL_PGD_BOUNDARY, - KERNEL_PGD_PTRS); + KERNEL_PGD_BOUNDARY); #endif tboot_probe(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:core/efi] efi: Use the generic efi.memmap instead of 'memmap '
Commit-ID: 0ce423b6492a02be11662bfaa837dd16945aad3e Gitweb: http://git.kernel.org/tip/0ce423b6492a02be11662bfaa837dd16945aad3e Author: Matt Fleming AuthorDate: Sat, 3 Oct 2015 23:26:07 +0100 Committer: Ingo Molnar CommitDate: Sun, 11 Oct 2015 11:04:18 +0200 efi: Use the generic efi.memmap instead of 'memmap' Guenter reports that commit: 7bf793115dd9 ("efi, x86: Rearrange efi_mem_attributes()") breaks the IA64 compilation with the following error: drivers/built-in.o: In function `efi_mem_attributes': (.text+0xde962): undefined reference to `memmap' Instead of using the (rather poorly named) global variable 'memmap' which doesn't exist on IA64, use efi.memmap which points to the 'memmap' object on x86 and arm64 and which is NULL for IA64. The fact that efi.memmap is NULL for IA64 is OK because IA64 provides its own implementation of efi_mem_attributes(). Reported-by: Guenter Roeck Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Jonathan Zhang Cc: Peter Zijlstra Cc: Stephen Rothwell Cc: Thomas Gleixner Cc: Tony Luck Cc: Tony Luck Link: http://lkml.kernel.org/r/20151003222607.ga2...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/efi.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index afee2880..16c4928 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -623,13 +623,15 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, */ u64 __weak efi_mem_attributes(unsigned long phys_addr) { + struct efi_memory_map *map; efi_memory_desc_t *md; void *p; if (!efi_enabled(EFI_MEMMAP)) return 0; - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + map = efi.memmap; + for (p = map->map; p < map->map_end; p += map->desc_size) { md = p; if ((md->phys_addr <= phys_addr) && (phys_addr < (md->phys_addr + -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] MAINTAINERS: Change Matt Fleming's email address
Commit-ID: 825fcfce81921c9cc4ef801d844793815721e458 Gitweb: http://git.kernel.org/tip/825fcfce81921c9cc4ef801d844793815721e458 Author: Matt Fleming AuthorDate: Sat, 10 Oct 2015 17:22:16 +0100 Committer: Ingo Molnar CommitDate: Sun, 11 Oct 2015 09:54:29 +0200 MAINTAINERS: Change Matt Fleming's email address My Intel email address will soon expire. Replace it with my personal address so people still know where to send patches. Signed-off-by: Matt Fleming Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/194136-10333-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 60aacd8..43bd01e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4003,7 +4003,7 @@ S:Maintained F: sound/usb/misc/ua101.c EXTENSIBLE FIRMWARE INTERFACE (EFI) -M: Matt Fleming +M: Matt Fleming L: linux-...@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git S: Maintained @@ -4018,7 +4018,7 @@ F:include/linux/efi*.h EFI VARIABLE FILESYSTEM M: Matthew Garrett M: Jeremy Kerr -M: Matt Fleming +M: Matt Fleming T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git L: linux-...@vger.kernel.org S: Maintained -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:core/efi] efi: Use the generic efi.memmap instead of 'memmap '
Commit-ID: 0ce423b6492a02be11662bfaa837dd16945aad3e Gitweb: http://git.kernel.org/tip/0ce423b6492a02be11662bfaa837dd16945aad3e Author: Matt FlemingAuthorDate: Sat, 3 Oct 2015 23:26:07 +0100 Committer: Ingo Molnar CommitDate: Sun, 11 Oct 2015 11:04:18 +0200 efi: Use the generic efi.memmap instead of 'memmap' Guenter reports that commit: 7bf793115dd9 ("efi, x86: Rearrange efi_mem_attributes()") breaks the IA64 compilation with the following error: drivers/built-in.o: In function `efi_mem_attributes': (.text+0xde962): undefined reference to `memmap' Instead of using the (rather poorly named) global variable 'memmap' which doesn't exist on IA64, use efi.memmap which points to the 'memmap' object on x86 and arm64 and which is NULL for IA64. The fact that efi.memmap is NULL for IA64 is OK because IA64 provides its own implementation of efi_mem_attributes(). Reported-by: Guenter Roeck Signed-off-by: Matt Fleming Cc: Ard Biesheuvel Cc: Jonathan Zhang Cc: Peter Zijlstra Cc: Stephen Rothwell Cc: Thomas Gleixner Cc: Tony Luck Cc: Tony Luck Link: http://lkml.kernel.org/r/20151003222607.ga2...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- drivers/firmware/efi/efi.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index afee2880..16c4928 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -623,13 +623,15 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, */ u64 __weak efi_mem_attributes(unsigned long phys_addr) { + struct efi_memory_map *map; efi_memory_desc_t *md; void *p; if (!efi_enabled(EFI_MEMMAP)) return 0; - for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + map = efi.memmap; + for (p = map->map; p < map->map_end; p += map->desc_size) { md = p; if ((md->phys_addr <= phys_addr) && (phys_addr < (md->phys_addr + -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] MAINTAINERS: Change Matt Fleming's email address
Commit-ID: 825fcfce81921c9cc4ef801d844793815721e458 Gitweb: http://git.kernel.org/tip/825fcfce81921c9cc4ef801d844793815721e458 Author: Matt FlemingAuthorDate: Sat, 10 Oct 2015 17:22:16 +0100 Committer: Ingo Molnar CommitDate: Sun, 11 Oct 2015 09:54:29 +0200 MAINTAINERS: Change Matt Fleming's email address My Intel email address will soon expire. Replace it with my personal address so people still know where to send patches. Signed-off-by: Matt Fleming Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/194136-10333-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 60aacd8..43bd01e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4003,7 +4003,7 @@ S:Maintained F: sound/usb/misc/ua101.c EXTENSIBLE FIRMWARE INTERFACE (EFI) -M: Matt Fleming +M: Matt Fleming L: linux-...@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git S: Maintained @@ -4018,7 +4018,7 @@ F:include/linux/efi*.h EFI VARIABLE FILESYSTEM M: Matthew Garrett M: Jeremy Kerr -M: Matt Fleming +M: Matt Fleming T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git L: linux-...@vger.kernel.org S: Maintained -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/core] perf tests: Add Intel CQM test
Commit-ID: 035827e9f2bd71a280f4eb58c65811d377ab2217 Gitweb: http://git.kernel.org/tip/035827e9f2bd71a280f4eb58c65811d377ab2217 Author: Matt Fleming AuthorDate: Mon, 5 Oct 2015 15:40:21 +0100 Committer: Arnaldo Carvalho de Melo CommitDate: Mon, 5 Oct 2015 16:56:07 -0300 perf tests: Add Intel CQM test Peter reports that it's possible to trigger a WARN_ON_ONCE() in the Intel CQM code by combining a hardware event and an Intel CQM (software) event into a group. Unfortunately, the perf tools are not able to create this bundle and we need to manually construct a test case. For posterity, record Peter's proof of concept test case in tools/perf so that it presents a model for how we can perform architecture specific tests, or "arch tests", in perf in the future. The particular issue triggered in the test case is that when the counter for the hardware event overflows and triggers a PMI we'll read both the hardware event and the software event counters. Unfortunately, for CQM that involves performing an IPI to read the CQM event counters on all sockets, which in NMI context triggers the WARN_ON_ONCE(). Reported-by: Peter Zijlstra Signed-off-by: Matt Fleming Cc: Adrian Hunter Cc: Andi Kleen Cc: Fenghua Yu Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Vikas Shivappa Cc: Vince Weaver Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-m...@codeblueprint.co.uk Link: http://lkml.kernel.org/n/tip-3p4ra0u8vzm7m289a1m79...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/x86/include/arch-tests.h | 1 + tools/perf/arch/x86/tests/Build | 1 + tools/perf/arch/x86/tests/arch-tests.c | 4 + tools/perf/arch/x86/tests/intel-cqm.c| 124 +++ 4 files changed, 130 insertions(+) diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h index 5927cf2..7ed00f4 100644 --- a/tools/perf/arch/x86/include/arch-tests.h +++ b/tools/perf/arch/x86/include/arch-tests.h @@ -5,6 +5,7 @@ int test__rdpmc(void); int test__perf_time_to_tsc(void); int test__insn_x86(void); +int test__intel_cqm_count_nmi_context(void); #ifdef HAVE_DWARF_UNWIND_SUPPORT struct thread; diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build index 8e2c5a3..cbb7e97 100644 --- a/tools/perf/arch/x86/tests/Build +++ b/tools/perf/arch/x86/tests/Build @@ -5,3 +5,4 @@ libperf-y += arch-tests.o libperf-y += rdpmc.o libperf-y += perf-time-to-tsc.o libperf-$(CONFIG_AUXTRACE) += insn-x86.o +libperf-y += intel-cqm.o diff --git a/tools/perf/arch/x86/tests/arch-tests.c b/tools/perf/arch/x86/tests/arch-tests.c index d116c21..2218cb6 100644 --- a/tools/perf/arch/x86/tests/arch-tests.c +++ b/tools/perf/arch/x86/tests/arch-tests.c @@ -24,6 +24,10 @@ struct test arch_tests[] = { }, #endif { + .desc = "Test intel cqm nmi context read", + .func = test__intel_cqm_count_nmi_context, + }, + { .func = NULL, }, diff --git a/tools/perf/arch/x86/tests/intel-cqm.c b/tools/perf/arch/x86/tests/intel-cqm.c new file mode 100644 index 000..d28c1b6 --- /dev/null +++ b/tools/perf/arch/x86/tests/intel-cqm.c @@ -0,0 +1,124 @@ +#include "tests/tests.h" +#include "perf.h" +#include "cloexec.h" +#include "debug.h" +#include "evlist.h" +#include "evsel.h" +#include "arch-tests.h" + +#include +#include + +static pid_t spawn(void) +{ + pid_t pid; + + pid = fork(); + if (pid) + return pid; + + while(1); + sleep(5); + return 0; +} + +/* + * Create an event group that contains both a sampled hardware + * (cpu-cycles) and software (intel_cqm/llc_occupancy/) event. We then + * wait for the hardware perf counter to overflow and generate a PMI, + * which triggers an event read for both of the events in the group. + * + * Since reading Intel CQM event counters requires sending SMP IPIs, the + * CQM pmu needs to handle the above situation gracefully, and return + * the last read counter value to avoid triggering a WARN_ON_ONCE() in + * smp_call_function_many() caused by sending IPIs from NMI context. + */ +int test__intel_cqm_count_nmi_context(void) +{ + struct perf_evlist *evlist = NULL; + struct perf_evsel *evsel = NULL; + struct perf_event_attr pe; + int i, fd[2], flag, ret; + size_t mmap_len; + void *event; + pid_t pid; + int err = TEST_FAIL; + + flag = perf_event_open_cloexec_flag(); + + evlist = perf_evlist__new(); + if (!evlist) { + pr_debug("perf_evlist__new failed\n"); + return TEST_FAIL; + } + + ret = parse_events(evlist, "intel_cqm/llc_occupancy/", NULL); + if (ret) { + pr_debug("parse_events failed\n"); + err = TEST_SKIP; + goto out; + } + + evsel = perf_evlist__first(evlist); + if (!evsel) { +
[tip:perf/core] perf tests: Move x86 tests into arch directory
Commit-ID: d8b167f9d8af817073ee35cf904e2e527465dbc1 Gitweb: http://git.kernel.org/tip/d8b167f9d8af817073ee35cf904e2e527465dbc1 Author: Matt Fleming AuthorDate: Mon, 5 Oct 2015 15:40:20 +0100 Committer: Arnaldo Carvalho de Melo CommitDate: Mon, 5 Oct 2015 16:55:43 -0300 perf tests: Move x86 tests into arch directory Move out the x86-specific tests into tools/perf/arch/x86/tests and define an 'arch_tests' array, which is the list of tests that only apply to the build architecture. We can also now begin to get rid of some of the #ifdef code that is present in the generic perf tests. Signed-off-by: Matt Fleming Cc: Adrian Hunter Cc: Andi Kleen Cc: Fenghua Yu Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Peter Zijlstra Cc: Vikas Shivappa Cc: Vince Weaver Link: http://lkml.kernel.org/n/tip-9s68h4ptg06ah0lgnjz55...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/x86/include/arch-tests.h | 12 ++ tools/perf/arch/x86/tests/Build| 3 +++ tools/perf/arch/x86/tests/arch-tests.c | 20 tools/perf/arch/x86/tests/dwarf-unwind.c | 1 + .../perf/{ => arch/x86}/tests/gen-insn-x86-dat.awk | 0 .../perf/{ => arch/x86}/tests/gen-insn-x86-dat.sh | 0 tools/perf/{ => arch/x86}/tests/insn-x86-dat-32.c | 0 tools/perf/{ => arch/x86}/tests/insn-x86-dat-64.c | 0 tools/perf/{ => arch/x86}/tests/insn-x86-dat-src.c | 0 tools/perf/{ => arch/x86}/tests/insn-x86.c | 3 ++- tools/perf/{ => arch/x86}/tests/perf-time-to-tsc.c | 4 +++- tools/perf/{ => arch/x86}/tests/rdpmc.c| 7 ++ tools/perf/tests/Build | 6 - tools/perf/tests/builtin-test.c| 28 -- tools/perf/tests/dwarf-unwind.c| 4 tools/perf/tests/tests.h | 5 +--- 16 files changed, 48 insertions(+), 45 deletions(-) diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h index 4bd41d8..5927cf2 100644 --- a/tools/perf/arch/x86/include/arch-tests.h +++ b/tools/perf/arch/x86/include/arch-tests.h @@ -1,6 +1,18 @@ #ifndef ARCH_TESTS_H #define ARCH_TESTS_H +/* Tests */ +int test__rdpmc(void); +int test__perf_time_to_tsc(void); +int test__insn_x86(void); + +#ifdef HAVE_DWARF_UNWIND_SUPPORT +struct thread; +struct perf_sample; +int test__arch_unwind_sample(struct perf_sample *sample, +struct thread *thread); +#endif + extern struct test arch_tests[]; #endif diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build index d827ef3..8e2c5a3 100644 --- a/tools/perf/arch/x86/tests/Build +++ b/tools/perf/arch/x86/tests/Build @@ -2,3 +2,6 @@ libperf-$(CONFIG_DWARF_UNWIND) += regs_load.o libperf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o libperf-y += arch-tests.o +libperf-y += rdpmc.o +libperf-y += perf-time-to-tsc.o +libperf-$(CONFIG_AUXTRACE) += insn-x86.o diff --git a/tools/perf/arch/x86/tests/arch-tests.c b/tools/perf/arch/x86/tests/arch-tests.c index fca9eb9..d116c21 100644 --- a/tools/perf/arch/x86/tests/arch-tests.c +++ b/tools/perf/arch/x86/tests/arch-tests.c @@ -4,6 +4,26 @@ struct test arch_tests[] = { { + .desc = "x86 rdpmc test", + .func = test__rdpmc, + }, + { + .desc = "Test converting perf time to TSC", + .func = test__perf_time_to_tsc, + }, +#ifdef HAVE_DWARF_UNWIND_SUPPORT + { + .desc = "Test dwarf unwind", + .func = test__dwarf_unwind, + }, +#endif +#ifdef HAVE_AUXTRACE_SUPPORT + { + .desc = "Test x86 instruction decoder - new instructions", + .func = test__insn_x86, + }, +#endif + { .func = NULL, }, diff --git a/tools/perf/arch/x86/tests/dwarf-unwind.c b/tools/perf/arch/x86/tests/dwarf-unwind.c index d8bbf7a..7f209ce 100644 --- a/tools/perf/arch/x86/tests/dwarf-unwind.c +++ b/tools/perf/arch/x86/tests/dwarf-unwind.c @@ -5,6 +5,7 @@ #include "event.h" #include "debug.h" #include "tests/tests.h" +#include "arch-tests.h" #define STACK_SIZE 8192 diff --git a/tools/perf/tests/gen-insn-x86-dat.awk b/tools/perf/arch/x86/tests/gen-insn-x86-dat.awk similarity index 100% rename from tools/perf/tests/gen-insn-x86-dat.awk rename to tools/perf/arch/x86/tests/gen-insn-x86-dat.awk diff --git a/tools/perf/tests/gen-insn-x86-dat.sh b/tools/perf/arch/x86/tests/gen-insn-x86-dat.sh similarity index 100% rename from tools/perf/tests/gen-insn-x86-dat.sh rename to tools/perf/arch/x86/tests/gen-insn-x86-dat.sh diff --git a/tools/perf/tests/insn-x86-dat-32.c b/tools/perf/arch/x86/tests/insn-x86-dat-32.c similarity index 100% rename from tools/perf/tests/insn-x86-dat-32.c rename to tools/perf/arch/x86/tests/insn-x86-dat-32.c diff --git a/tools/perf/tests/insn-x86-dat-64.c b/tools/perf/arch/x86/tests/insn-x86-dat-64.c
[tip:perf/core] perf tests: Add arch tests
Commit-ID: 31b6753f95320260b160935d0e9c0b29f096ab57 Gitweb: http://git.kernel.org/tip/31b6753f95320260b160935d0e9c0b29f096ab57 Author: Matt Fleming AuthorDate: Mon, 5 Oct 2015 15:40:19 +0100 Committer: Arnaldo Carvalho de Melo CommitDate: Mon, 5 Oct 2015 16:55:38 -0300 perf tests: Add arch tests Tests that only make sense for some architectures currently live in the same place as the generic tests. Move out the x86-specific tests into tools/perf/arch/x86/tests and define an 'arch_tests' array, which is the list of tests that only apply to the build architecture. The main idea is to encourage developers to add arch tests to build out perf's test coverage, without dumping everything in tools/perf/tests. Signed-off-by: Matt Fleming Cc: Adrian Hunter Cc: Andi Kleen Cc: Fenghua Yu Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Peter Zijlstra Cc: Vikas Shivappa Cc: Vince Weaver Link: http://lkml.kernel.org/n/tip-p4uc1c15ssbj8xj7ku5sl...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/x86/Build| 2 +- tools/perf/arch/x86/include/arch-tests.h | 6 ++ tools/perf/arch/x86/tests/Build | 6 -- tools/perf/arch/x86/tests/arch-tests.c | 10 ++ tools/perf/tests/builtin-test.c | 28 tools/perf/tests/tests.h | 5 + 6 files changed, 46 insertions(+), 11 deletions(-) diff --git a/tools/perf/arch/x86/Build b/tools/perf/arch/x86/Build index 41bf61d..db52fa2 100644 --- a/tools/perf/arch/x86/Build +++ b/tools/perf/arch/x86/Build @@ -1,2 +1,2 @@ libperf-y += util/ -libperf-$(CONFIG_DWARF_UNWIND) += tests/ +libperf-y += tests/ diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h new file mode 100644 index 000..4bd41d8 --- /dev/null +++ b/tools/perf/arch/x86/include/arch-tests.h @@ -0,0 +1,6 @@ +#ifndef ARCH_TESTS_H +#define ARCH_TESTS_H + +extern struct test arch_tests[]; + +#endif diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build index b30eff9..d827ef3 100644 --- a/tools/perf/arch/x86/tests/Build +++ b/tools/perf/arch/x86/tests/Build @@ -1,2 +1,4 @@ -libperf-y += regs_load.o -libperf-y += dwarf-unwind.o +libperf-$(CONFIG_DWARF_UNWIND) += regs_load.o +libperf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o + +libperf-y += arch-tests.o diff --git a/tools/perf/arch/x86/tests/arch-tests.c b/tools/perf/arch/x86/tests/arch-tests.c new file mode 100644 index 000..fca9eb9 --- /dev/null +++ b/tools/perf/arch/x86/tests/arch-tests.c @@ -0,0 +1,10 @@ +#include +#include "tests/tests.h" +#include "arch-tests.h" + +struct test arch_tests[] = { + { + .func = NULL, + }, + +}; diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c index d9bf51d..2b6c1bf 100644 --- a/tools/perf/tests/builtin-test.c +++ b/tools/perf/tests/builtin-test.c @@ -14,10 +14,13 @@ #include "parse-options.h" #include "symbol.h" -static struct test { - const char *desc; - int (*func)(void); -} tests[] = { +struct test __weak arch_tests[] = { + { + .func = NULL, + }, +}; + +static struct test generic_tests[] = { { .desc = "vmlinux symtab matches kallsyms", .func = test__vmlinux_matches_kallsyms, @@ -195,6 +198,11 @@ static struct test { }, }; +static struct test *tests[] = { + generic_tests, + arch_tests, +}; + static bool perf_test__matches(struct test *test, int curr, int argc, const char *argv[]) { int i; @@ -249,22 +257,25 @@ static int run_test(struct test *test) return err; } -#define for_each_test(t)for (t = [0]; t->func; t++) +#define for_each_test(j, t)\ + for (j = 0; j < ARRAY_SIZE(tests); j++) \ + for (t = [j][0]; t->func; t++) static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) { struct test *t; + unsigned int j; int i = 0; int width = 0; - for_each_test(t) { + for_each_test(j, t) { int len = strlen(t->desc); if (width < len) width = len; } - for_each_test(t) { + for_each_test(j, t) { int curr = i++, err; if (!perf_test__matches(t, curr, argc, argv)) @@ -300,10 +311,11 @@ static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) static int perf_test__list(int argc, const char **argv) { + unsigned int j; struct test *t; int i = 0; - for_each_test(t) { + for_each_test(j, t) { if (argc > 1 && !strstr(t->desc, argv[1])) continue; diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h index 0b35496..b1cb1c0 100644 --- a/tools/perf/tests/tests.h +++ b/tools/perf/tests/tests.h @@ -24,6 +24,11 @@ enum {
[tip:perf/core] perf tests: Add Intel CQM test
Commit-ID: 035827e9f2bd71a280f4eb58c65811d377ab2217 Gitweb: http://git.kernel.org/tip/035827e9f2bd71a280f4eb58c65811d377ab2217 Author: Matt FlemingAuthorDate: Mon, 5 Oct 2015 15:40:21 +0100 Committer: Arnaldo Carvalho de Melo CommitDate: Mon, 5 Oct 2015 16:56:07 -0300 perf tests: Add Intel CQM test Peter reports that it's possible to trigger a WARN_ON_ONCE() in the Intel CQM code by combining a hardware event and an Intel CQM (software) event into a group. Unfortunately, the perf tools are not able to create this bundle and we need to manually construct a test case. For posterity, record Peter's proof of concept test case in tools/perf so that it presents a model for how we can perform architecture specific tests, or "arch tests", in perf in the future. The particular issue triggered in the test case is that when the counter for the hardware event overflows and triggers a PMI we'll read both the hardware event and the software event counters. Unfortunately, for CQM that involves performing an IPI to read the CQM event counters on all sockets, which in NMI context triggers the WARN_ON_ONCE(). Reported-by: Peter Zijlstra Signed-off-by: Matt Fleming Cc: Adrian Hunter Cc: Andi Kleen Cc: Fenghua Yu Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Vikas Shivappa Cc: Vince Weaver Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-m...@codeblueprint.co.uk Link: http://lkml.kernel.org/n/tip-3p4ra0u8vzm7m289a1m79...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/x86/include/arch-tests.h | 1 + tools/perf/arch/x86/tests/Build | 1 + tools/perf/arch/x86/tests/arch-tests.c | 4 + tools/perf/arch/x86/tests/intel-cqm.c| 124 +++ 4 files changed, 130 insertions(+) diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h index 5927cf2..7ed00f4 100644 --- a/tools/perf/arch/x86/include/arch-tests.h +++ b/tools/perf/arch/x86/include/arch-tests.h @@ -5,6 +5,7 @@ int test__rdpmc(void); int test__perf_time_to_tsc(void); int test__insn_x86(void); +int test__intel_cqm_count_nmi_context(void); #ifdef HAVE_DWARF_UNWIND_SUPPORT struct thread; diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build index 8e2c5a3..cbb7e97 100644 --- a/tools/perf/arch/x86/tests/Build +++ b/tools/perf/arch/x86/tests/Build @@ -5,3 +5,4 @@ libperf-y += arch-tests.o libperf-y += rdpmc.o libperf-y += perf-time-to-tsc.o libperf-$(CONFIG_AUXTRACE) += insn-x86.o +libperf-y += intel-cqm.o diff --git a/tools/perf/arch/x86/tests/arch-tests.c b/tools/perf/arch/x86/tests/arch-tests.c index d116c21..2218cb6 100644 --- a/tools/perf/arch/x86/tests/arch-tests.c +++ b/tools/perf/arch/x86/tests/arch-tests.c @@ -24,6 +24,10 @@ struct test arch_tests[] = { }, #endif { + .desc = "Test intel cqm nmi context read", + .func = test__intel_cqm_count_nmi_context, + }, + { .func = NULL, }, diff --git a/tools/perf/arch/x86/tests/intel-cqm.c b/tools/perf/arch/x86/tests/intel-cqm.c new file mode 100644 index 000..d28c1b6 --- /dev/null +++ b/tools/perf/arch/x86/tests/intel-cqm.c @@ -0,0 +1,124 @@ +#include "tests/tests.h" +#include "perf.h" +#include "cloexec.h" +#include "debug.h" +#include "evlist.h" +#include "evsel.h" +#include "arch-tests.h" + +#include +#include + +static pid_t spawn(void) +{ + pid_t pid; + + pid = fork(); + if (pid) + return pid; + + while(1); + sleep(5); + return 0; +} + +/* + * Create an event group that contains both a sampled hardware + * (cpu-cycles) and software (intel_cqm/llc_occupancy/) event. We then + * wait for the hardware perf counter to overflow and generate a PMI, + * which triggers an event read for both of the events in the group. + * + * Since reading Intel CQM event counters requires sending SMP IPIs, the + * CQM pmu needs to handle the above situation gracefully, and return + * the last read counter value to avoid triggering a WARN_ON_ONCE() in + * smp_call_function_many() caused by sending IPIs from NMI context. + */ +int test__intel_cqm_count_nmi_context(void) +{ + struct perf_evlist *evlist = NULL; + struct perf_evsel *evsel = NULL; + struct perf_event_attr pe; + int i, fd[2], flag, ret; + size_t mmap_len; + void *event; + pid_t pid; + int err = TEST_FAIL; + + flag = perf_event_open_cloexec_flag(); + + evlist = perf_evlist__new(); + if (!evlist) { + pr_debug("perf_evlist__new failed\n"); + return TEST_FAIL; + } + + ret = parse_events(evlist,
[tip:perf/core] perf tests: Move x86 tests into arch directory
Commit-ID: d8b167f9d8af817073ee35cf904e2e527465dbc1 Gitweb: http://git.kernel.org/tip/d8b167f9d8af817073ee35cf904e2e527465dbc1 Author: Matt FlemingAuthorDate: Mon, 5 Oct 2015 15:40:20 +0100 Committer: Arnaldo Carvalho de Melo CommitDate: Mon, 5 Oct 2015 16:55:43 -0300 perf tests: Move x86 tests into arch directory Move out the x86-specific tests into tools/perf/arch/x86/tests and define an 'arch_tests' array, which is the list of tests that only apply to the build architecture. We can also now begin to get rid of some of the #ifdef code that is present in the generic perf tests. Signed-off-by: Matt Fleming Cc: Adrian Hunter Cc: Andi Kleen Cc: Fenghua Yu Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Peter Zijlstra Cc: Vikas Shivappa Cc: Vince Weaver Link: http://lkml.kernel.org/n/tip-9s68h4ptg06ah0lgnjz55...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/x86/include/arch-tests.h | 12 ++ tools/perf/arch/x86/tests/Build| 3 +++ tools/perf/arch/x86/tests/arch-tests.c | 20 tools/perf/arch/x86/tests/dwarf-unwind.c | 1 + .../perf/{ => arch/x86}/tests/gen-insn-x86-dat.awk | 0 .../perf/{ => arch/x86}/tests/gen-insn-x86-dat.sh | 0 tools/perf/{ => arch/x86}/tests/insn-x86-dat-32.c | 0 tools/perf/{ => arch/x86}/tests/insn-x86-dat-64.c | 0 tools/perf/{ => arch/x86}/tests/insn-x86-dat-src.c | 0 tools/perf/{ => arch/x86}/tests/insn-x86.c | 3 ++- tools/perf/{ => arch/x86}/tests/perf-time-to-tsc.c | 4 +++- tools/perf/{ => arch/x86}/tests/rdpmc.c| 7 ++ tools/perf/tests/Build | 6 - tools/perf/tests/builtin-test.c| 28 -- tools/perf/tests/dwarf-unwind.c| 4 tools/perf/tests/tests.h | 5 +--- 16 files changed, 48 insertions(+), 45 deletions(-) diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h index 4bd41d8..5927cf2 100644 --- a/tools/perf/arch/x86/include/arch-tests.h +++ b/tools/perf/arch/x86/include/arch-tests.h @@ -1,6 +1,18 @@ #ifndef ARCH_TESTS_H #define ARCH_TESTS_H +/* Tests */ +int test__rdpmc(void); +int test__perf_time_to_tsc(void); +int test__insn_x86(void); + +#ifdef HAVE_DWARF_UNWIND_SUPPORT +struct thread; +struct perf_sample; +int test__arch_unwind_sample(struct perf_sample *sample, +struct thread *thread); +#endif + extern struct test arch_tests[]; #endif diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build index d827ef3..8e2c5a3 100644 --- a/tools/perf/arch/x86/tests/Build +++ b/tools/perf/arch/x86/tests/Build @@ -2,3 +2,6 @@ libperf-$(CONFIG_DWARF_UNWIND) += regs_load.o libperf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o libperf-y += arch-tests.o +libperf-y += rdpmc.o +libperf-y += perf-time-to-tsc.o +libperf-$(CONFIG_AUXTRACE) += insn-x86.o diff --git a/tools/perf/arch/x86/tests/arch-tests.c b/tools/perf/arch/x86/tests/arch-tests.c index fca9eb9..d116c21 100644 --- a/tools/perf/arch/x86/tests/arch-tests.c +++ b/tools/perf/arch/x86/tests/arch-tests.c @@ -4,6 +4,26 @@ struct test arch_tests[] = { { + .desc = "x86 rdpmc test", + .func = test__rdpmc, + }, + { + .desc = "Test converting perf time to TSC", + .func = test__perf_time_to_tsc, + }, +#ifdef HAVE_DWARF_UNWIND_SUPPORT + { + .desc = "Test dwarf unwind", + .func = test__dwarf_unwind, + }, +#endif +#ifdef HAVE_AUXTRACE_SUPPORT + { + .desc = "Test x86 instruction decoder - new instructions", + .func = test__insn_x86, + }, +#endif + { .func = NULL, }, diff --git a/tools/perf/arch/x86/tests/dwarf-unwind.c b/tools/perf/arch/x86/tests/dwarf-unwind.c index d8bbf7a..7f209ce 100644 --- a/tools/perf/arch/x86/tests/dwarf-unwind.c +++ b/tools/perf/arch/x86/tests/dwarf-unwind.c @@ -5,6 +5,7 @@ #include "event.h" #include "debug.h" #include "tests/tests.h" +#include "arch-tests.h" #define STACK_SIZE 8192 diff --git a/tools/perf/tests/gen-insn-x86-dat.awk b/tools/perf/arch/x86/tests/gen-insn-x86-dat.awk similarity index 100% rename from tools/perf/tests/gen-insn-x86-dat.awk rename to tools/perf/arch/x86/tests/gen-insn-x86-dat.awk diff --git a/tools/perf/tests/gen-insn-x86-dat.sh b/tools/perf/arch/x86/tests/gen-insn-x86-dat.sh similarity index 100% rename from tools/perf/tests/gen-insn-x86-dat.sh rename to tools/perf/arch/x86/tests/gen-insn-x86-dat.sh diff --git a/tools/perf/tests/insn-x86-dat-32.c
[tip:perf/core] perf tests: Add arch tests
Commit-ID: 31b6753f95320260b160935d0e9c0b29f096ab57 Gitweb: http://git.kernel.org/tip/31b6753f95320260b160935d0e9c0b29f096ab57 Author: Matt FlemingAuthorDate: Mon, 5 Oct 2015 15:40:19 +0100 Committer: Arnaldo Carvalho de Melo CommitDate: Mon, 5 Oct 2015 16:55:38 -0300 perf tests: Add arch tests Tests that only make sense for some architectures currently live in the same place as the generic tests. Move out the x86-specific tests into tools/perf/arch/x86/tests and define an 'arch_tests' array, which is the list of tests that only apply to the build architecture. The main idea is to encourage developers to add arch tests to build out perf's test coverage, without dumping everything in tools/perf/tests. Signed-off-by: Matt Fleming Cc: Adrian Hunter Cc: Andi Kleen Cc: Fenghua Yu Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Peter Zijlstra Cc: Vikas Shivappa Cc: Vince Weaver Link: http://lkml.kernel.org/n/tip-p4uc1c15ssbj8xj7ku5sl...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/x86/Build| 2 +- tools/perf/arch/x86/include/arch-tests.h | 6 ++ tools/perf/arch/x86/tests/Build | 6 -- tools/perf/arch/x86/tests/arch-tests.c | 10 ++ tools/perf/tests/builtin-test.c | 28 tools/perf/tests/tests.h | 5 + 6 files changed, 46 insertions(+), 11 deletions(-) diff --git a/tools/perf/arch/x86/Build b/tools/perf/arch/x86/Build index 41bf61d..db52fa2 100644 --- a/tools/perf/arch/x86/Build +++ b/tools/perf/arch/x86/Build @@ -1,2 +1,2 @@ libperf-y += util/ -libperf-$(CONFIG_DWARF_UNWIND) += tests/ +libperf-y += tests/ diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h new file mode 100644 index 000..4bd41d8 --- /dev/null +++ b/tools/perf/arch/x86/include/arch-tests.h @@ -0,0 +1,6 @@ +#ifndef ARCH_TESTS_H +#define ARCH_TESTS_H + +extern struct test arch_tests[]; + +#endif diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build index b30eff9..d827ef3 100644 --- a/tools/perf/arch/x86/tests/Build +++ b/tools/perf/arch/x86/tests/Build @@ -1,2 +1,4 @@ -libperf-y += regs_load.o -libperf-y += dwarf-unwind.o +libperf-$(CONFIG_DWARF_UNWIND) += regs_load.o +libperf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o + +libperf-y += arch-tests.o diff --git a/tools/perf/arch/x86/tests/arch-tests.c b/tools/perf/arch/x86/tests/arch-tests.c new file mode 100644 index 000..fca9eb9 --- /dev/null +++ b/tools/perf/arch/x86/tests/arch-tests.c @@ -0,0 +1,10 @@ +#include +#include "tests/tests.h" +#include "arch-tests.h" + +struct test arch_tests[] = { + { + .func = NULL, + }, + +}; diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c index d9bf51d..2b6c1bf 100644 --- a/tools/perf/tests/builtin-test.c +++ b/tools/perf/tests/builtin-test.c @@ -14,10 +14,13 @@ #include "parse-options.h" #include "symbol.h" -static struct test { - const char *desc; - int (*func)(void); -} tests[] = { +struct test __weak arch_tests[] = { + { + .func = NULL, + }, +}; + +static struct test generic_tests[] = { { .desc = "vmlinux symtab matches kallsyms", .func = test__vmlinux_matches_kallsyms, @@ -195,6 +198,11 @@ static struct test { }, }; +static struct test *tests[] = { + generic_tests, + arch_tests, +}; + static bool perf_test__matches(struct test *test, int curr, int argc, const char *argv[]) { int i; @@ -249,22 +257,25 @@ static int run_test(struct test *test) return err; } -#define for_each_test(t)for (t = [0]; t->func; t++) +#define for_each_test(j, t)\ + for (j = 0; j < ARRAY_SIZE(tests); j++) \ + for (t = [j][0]; t->func; t++) static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) { struct test *t; + unsigned int j; int i = 0; int width = 0; - for_each_test(t) { + for_each_test(j, t) { int len = strlen(t->desc); if (width < len) width = len; } - for_each_test(t) { + for_each_test(j, t) { int curr = i++, err; if (!perf_test__matches(t, curr, argc, argv)) @@ -300,10 +311,11 @@ static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) static int perf_test__list(int argc, const char **argv) { + unsigned int j; struct test *t; int i = 0; - for_each_test(t) { + for_each_test(j, t) { if (argc > 1 &&
[tip:core/urgent] x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down
Commit-ID: a5caa209ba9c29c6421292e7879d2387a2ef39c9 Gitweb: http://git.kernel.org/tip/a5caa209ba9c29c6421292e7879d2387a2ef39c9 Author: Matt Fleming AuthorDate: Fri, 25 Sep 2015 23:02:18 +0100 Committer: Ingo Molnar CommitDate: Thu, 1 Oct 2015 12:51:28 +0200 x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down Beginning with UEFI v2.5 EFI_PROPERTIES_TABLE was introduced that signals that the firmware PE/COFF loader supports splitting code and data sections of PE/COFF images into separate EFI memory map entries. This allows the kernel to map those regions with strict memory protections, e.g. EFI_MEMORY_RO for code, EFI_MEMORY_XP for data, etc. Unfortunately, an unwritten requirement of this new feature is that the regions need to be mapped with the same offsets relative to each other as observed in the EFI memory map. If this is not done crashes like this may occur, BUG: unable to handle kernel paging request at fffefe6086dd IP: [] 0xfffefe6086dd Call Trace: [] efi_call+0x7e/0x100 [] ? virt_efi_set_variable+0x61/0x90 [] efi_delete_dummy_variable+0x63/0x70 [] efi_enter_virtual_mode+0x383/0x392 [] start_kernel+0x38a/0x417 [] x86_64_start_reservations+0x2a/0x2c [] x86_64_start_kernel+0xeb/0xef Here 0xfffefe6086dd refers to an address the firmware expects to be mapped but which the OS never claimed was mapped. The issue is that included in these regions are relative addresses to other regions which were emitted by the firmware toolchain before the "splitting" of sections occurred at runtime. Needless to say, we don't satisfy this unwritten requirement on x86_64 and instead map the EFI memory map entries in reverse order. The above crash is almost certainly triggerable with any kernel newer than v3.13 because that's when we rewrote the EFI runtime region mapping code, in commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping"). For kernel versions before v3.13 things may work by pure luck depending on the fragmentation of the kernel virtual address space at the time we map the EFI regions. Instead of mapping the EFI memory map entries in reverse order, where entry N has a higher virtual address than entry N+1, map them in the same order as they appear in the EFI memory map to preserve this relative offset between regions. This patch has been kept as small as possible with the intention that it should be applied aggressively to stable and distribution kernels. It is very much a bugfix rather than support for a new feature, since when EFI_PROPERTIES_TABLE is enabled we must map things as outlined above to even boot - we have no way of asking the firmware not to split the code/data regions. In fact, this patch doesn't even make use of the more strict memory protections available in UEFI v2.5. That will come later. Suggested-by: Ard Biesheuvel Reported-by: Ard Biesheuvel Signed-off-by: Matt Fleming Cc: Cc: Borislav Petkov Cc: Chun-Yi Cc: Dave Young Cc: H. Peter Anvin Cc: James Bottomley Cc: Lee, Chun-Yi Cc: Leif Lindholm Cc: Linus Torvalds Cc: Matthew Garrett Cc: Mike Galbraith Cc: Peter Jones Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443218539-7610-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c | 67 - 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 1db84c0..6a28ded 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -705,6 +705,70 @@ out: } /* + * Iterate the EFI memory map in reverse order because the regions + * will be mapped top-down. The end result is the same as if we had + * mapped things forward, but doesn't require us to change the + * existing implementation of efi_map_region(). + */ +static inline void *efi_map_next_entry_reverse(void *entry) +{ + /* Initial call */ + if (!entry) + return memmap.map_end - memmap.desc_size; + + entry -= memmap.desc_size; + if (entry < memmap.map) + return NULL; + + return entry; +} + +/* + * efi_map_next_entry - Return the next EFI memory map descriptor + * @entry: Previous EFI memory map descriptor + * + * This is a helper function to iterate over the EFI memory map, which + * we do in different orders depending on the current configuration. + * + * To begin traversing the memory map @entry must be %NULL. + * + * Returns %NULL when we reach the end of the memory map. + */ +static void *efi_map_next_entry(void *entry) +{ + if (!efi_enabled(EFI_OLD_MEMMAP) && efi_enabled(EFI_64BIT)) { + /* +* Starting in UEFI v2.5 the EFI_PROPERTIES_TABLE +* config table feature requires us to map all entries +* in the same order as they appear in the EFI memory +
[tip:core/urgent] x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down
Commit-ID: a5caa209ba9c29c6421292e7879d2387a2ef39c9 Gitweb: http://git.kernel.org/tip/a5caa209ba9c29c6421292e7879d2387a2ef39c9 Author: Matt FlemingAuthorDate: Fri, 25 Sep 2015 23:02:18 +0100 Committer: Ingo Molnar CommitDate: Thu, 1 Oct 2015 12:51:28 +0200 x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down Beginning with UEFI v2.5 EFI_PROPERTIES_TABLE was introduced that signals that the firmware PE/COFF loader supports splitting code and data sections of PE/COFF images into separate EFI memory map entries. This allows the kernel to map those regions with strict memory protections, e.g. EFI_MEMORY_RO for code, EFI_MEMORY_XP for data, etc. Unfortunately, an unwritten requirement of this new feature is that the regions need to be mapped with the same offsets relative to each other as observed in the EFI memory map. If this is not done crashes like this may occur, BUG: unable to handle kernel paging request at fffefe6086dd IP: [] 0xfffefe6086dd Call Trace: [] efi_call+0x7e/0x100 [] ? virt_efi_set_variable+0x61/0x90 [] efi_delete_dummy_variable+0x63/0x70 [] efi_enter_virtual_mode+0x383/0x392 [] start_kernel+0x38a/0x417 [] x86_64_start_reservations+0x2a/0x2c [] x86_64_start_kernel+0xeb/0xef Here 0xfffefe6086dd refers to an address the firmware expects to be mapped but which the OS never claimed was mapped. The issue is that included in these regions are relative addresses to other regions which were emitted by the firmware toolchain before the "splitting" of sections occurred at runtime. Needless to say, we don't satisfy this unwritten requirement on x86_64 and instead map the EFI memory map entries in reverse order. The above crash is almost certainly triggerable with any kernel newer than v3.13 because that's when we rewrote the EFI runtime region mapping code, in commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping"). For kernel versions before v3.13 things may work by pure luck depending on the fragmentation of the kernel virtual address space at the time we map the EFI regions. Instead of mapping the EFI memory map entries in reverse order, where entry N has a higher virtual address than entry N+1, map them in the same order as they appear in the EFI memory map to preserve this relative offset between regions. This patch has been kept as small as possible with the intention that it should be applied aggressively to stable and distribution kernels. It is very much a bugfix rather than support for a new feature, since when EFI_PROPERTIES_TABLE is enabled we must map things as outlined above to even boot - we have no way of asking the firmware not to split the code/data regions. In fact, this patch doesn't even make use of the more strict memory protections available in UEFI v2.5. That will come later. Suggested-by: Ard Biesheuvel Reported-by: Ard Biesheuvel Signed-off-by: Matt Fleming Cc: Cc: Borislav Petkov Cc: Chun-Yi Cc: Dave Young Cc: H. Peter Anvin Cc: James Bottomley Cc: Lee, Chun-Yi Cc: Leif Lindholm Cc: Linus Torvalds Cc: Matthew Garrett Cc: Mike Galbraith Cc: Peter Jones Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443218539-7610-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi.c | 67 - 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 1db84c0..6a28ded 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -705,6 +705,70 @@ out: } /* + * Iterate the EFI memory map in reverse order because the regions + * will be mapped top-down. The end result is the same as if we had + * mapped things forward, but doesn't require us to change the + * existing implementation of efi_map_region(). + */ +static inline void *efi_map_next_entry_reverse(void *entry) +{ + /* Initial call */ + if (!entry) + return memmap.map_end - memmap.desc_size; + + entry -= memmap.desc_size; + if (entry < memmap.map) + return NULL; + + return entry; +} + +/* + * efi_map_next_entry - Return the next EFI memory map descriptor + * @entry: Previous EFI memory map descriptor + * + * This is a helper function to iterate over the EFI memory map, which + * we do in different orders depending on the current configuration. + * + * To begin traversing the memory map @entry must be %NULL. + * +
[tip:perf/core] perf tests: Introduce iterator function for tests
Commit-ID: e8210cefb7e1ec0760a6fe581ad0727a2dcf8dd1 Gitweb: http://git.kernel.org/tip/e8210cefb7e1ec0760a6fe581ad0727a2dcf8dd1 Author: Matt Fleming AuthorDate: Sat, 5 Sep 2015 20:02:20 +0100 Committer: Arnaldo Carvalho de Melo CommitDate: Mon, 14 Sep 2015 12:50:18 -0300 perf tests: Introduce iterator function for tests In preparation for introducing more arrays of tests, e.g. "arch tests" (architecture-specific tests), abstract the code to iterate over the list of tests into a helper function. This way, code that uses a 'struct test' doesn't need to worry about how the tests are grouped together and changes to the list of tests doesn't require changes to the code using it. Signed-off-by: Matt Fleming Acked-by: Jiri Olsa Cc: Andi Kleen Cc: Kanaka Juvva Cc: Peter Zijlstra Cc: Vikas Shivappa Cc: Vince Weaver Link: http://lkml.kernel.org/r/1441479742-15402-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/tests/builtin-test.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c index 98b0b24..d9bf51d 100644 --- a/tools/perf/tests/builtin-test.c +++ b/tools/perf/tests/builtin-test.c @@ -195,7 +195,7 @@ static struct test { }, }; -static bool perf_test__matches(int curr, int argc, const char *argv[]) +static bool perf_test__matches(struct test *test, int curr, int argc, const char *argv[]) { int i; @@ -212,7 +212,7 @@ static bool perf_test__matches(int curr, int argc, const char *argv[]) continue; } - if (strstr(tests[curr].desc, argv[i])) + if (strstr(test->desc, argv[i])) return true; } @@ -249,27 +249,28 @@ static int run_test(struct test *test) return err; } +#define for_each_test(t)for (t = [0]; t->func; t++) + static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) { + struct test *t; int i = 0; int width = 0; - while (tests[i].func) { - int len = strlen(tests[i].desc); + for_each_test(t) { + int len = strlen(t->desc); if (width < len) width = len; - ++i; } - i = 0; - while (tests[i].func) { + for_each_test(t) { int curr = i++, err; - if (!perf_test__matches(curr, argc, argv)) + if (!perf_test__matches(t, curr, argc, argv)) continue; - pr_info("%2d: %-*s:", i, width, tests[curr].desc); + pr_info("%2d: %-*s:", i, width, t->desc); if (intlist__find(skiplist, i)) { color_fprintf(stderr, PERF_COLOR_YELLOW, " Skip (user override)\n"); @@ -277,8 +278,8 @@ static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) } pr_debug("\n--- start ---\n"); - err = run_test([curr]); - pr_debug(" end \n%s:", tests[curr].desc); + err = run_test(t); + pr_debug(" end \n%s:", t->desc); switch (err) { case TEST_OK: @@ -299,15 +300,14 @@ static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) static int perf_test__list(int argc, const char **argv) { + struct test *t; int i = 0; - while (tests[i].func) { - int curr = i++; - - if (argc > 1 && !strstr(tests[curr].desc, argv[1])) + for_each_test(t) { + if (argc > 1 && !strstr(t->desc, argv[1])) continue; - pr_info("%2d: %s\n", i, tests[curr].desc); + pr_info("%2d: %s\n", ++i, t->desc); } return 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/core] perf tests: Introduce iterator function for tests
Commit-ID: e8210cefb7e1ec0760a6fe581ad0727a2dcf8dd1 Gitweb: http://git.kernel.org/tip/e8210cefb7e1ec0760a6fe581ad0727a2dcf8dd1 Author: Matt FlemingAuthorDate: Sat, 5 Sep 2015 20:02:20 +0100 Committer: Arnaldo Carvalho de Melo CommitDate: Mon, 14 Sep 2015 12:50:18 -0300 perf tests: Introduce iterator function for tests In preparation for introducing more arrays of tests, e.g. "arch tests" (architecture-specific tests), abstract the code to iterate over the list of tests into a helper function. This way, code that uses a 'struct test' doesn't need to worry about how the tests are grouped together and changes to the list of tests doesn't require changes to the code using it. Signed-off-by: Matt Fleming Acked-by: Jiri Olsa Cc: Andi Kleen Cc: Kanaka Juvva Cc: Peter Zijlstra Cc: Vikas Shivappa Cc: Vince Weaver Link: http://lkml.kernel.org/r/1441479742-15402-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/tests/builtin-test.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c index 98b0b24..d9bf51d 100644 --- a/tools/perf/tests/builtin-test.c +++ b/tools/perf/tests/builtin-test.c @@ -195,7 +195,7 @@ static struct test { }, }; -static bool perf_test__matches(int curr, int argc, const char *argv[]) +static bool perf_test__matches(struct test *test, int curr, int argc, const char *argv[]) { int i; @@ -212,7 +212,7 @@ static bool perf_test__matches(int curr, int argc, const char *argv[]) continue; } - if (strstr(tests[curr].desc, argv[i])) + if (strstr(test->desc, argv[i])) return true; } @@ -249,27 +249,28 @@ static int run_test(struct test *test) return err; } +#define for_each_test(t)for (t = [0]; t->func; t++) + static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) { + struct test *t; int i = 0; int width = 0; - while (tests[i].func) { - int len = strlen(tests[i].desc); + for_each_test(t) { + int len = strlen(t->desc); if (width < len) width = len; - ++i; } - i = 0; - while (tests[i].func) { + for_each_test(t) { int curr = i++, err; - if (!perf_test__matches(curr, argc, argv)) + if (!perf_test__matches(t, curr, argc, argv)) continue; - pr_info("%2d: %-*s:", i, width, tests[curr].desc); + pr_info("%2d: %-*s:", i, width, t->desc); if (intlist__find(skiplist, i)) { color_fprintf(stderr, PERF_COLOR_YELLOW, " Skip (user override)\n"); @@ -277,8 +278,8 @@ static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) } pr_debug("\n--- start ---\n"); - err = run_test([curr]); - pr_debug(" end \n%s:", tests[curr].desc); + err = run_test(t); + pr_debug(" end \n%s:", t->desc); switch (err) { case TEST_OK: @@ -299,15 +300,14 @@ static int __cmd_test(int argc, const char *argv[], struct intlist *skiplist) static int perf_test__list(int argc, const char **argv) { + struct test *t; int i = 0; - while (tests[i].func) { - int curr = i++; - - if (argc > 1 && !strstr(tests[curr].desc, argv[1])) + for_each_test(t) { + if (argc > 1 && !strstr(t->desc, argv[1])) continue; - pr_info("%2d: %s\n", i, tests[curr].desc); + pr_info("%2d: %s\n", ++i, t->desc); } return 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/core] perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler
Commit-ID: d7a702f0b1033cf402fef65bd6395072738f0844 Gitweb: http://git.kernel.org/tip/d7a702f0b1033cf402fef65bd6395072738f0844 Author: Matt Fleming AuthorDate: Thu, 6 Aug 2015 13:12:43 +0100 Committer: Ingo Molnar CommitDate: Wed, 12 Aug 2015 11:37:23 +0200 perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler Tony reports that booting his 144-cpu machine with maxcpus=10 triggers the following WARN_ON(): [ 21.045727] WARNING: CPU: 8 PID: 647 at arch/x86/kernel/cpu/perf_event_intel_cqm.c:1267 intel_cqm_cpu_prepare+0x75/0x90() [ 21.045744] CPU: 8 PID: 647 Comm: systemd-udevd Not tainted 4.2.0-rc4 #1 [ 21.045745] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0066.R00.1506021730 06/02/2015 [ 21.045747] 82771b09 880856333ba8 81669b67 [ 21.045748] 880856333be8 8107b02a [ 21.045750] 88085b789800 88085f68a020 819e2470 000a [ 21.045750] Call Trace: [ 21.045757] [] dump_stack+0x45/0x57 [ 21.045759] [] warn_slowpath_common+0x8a/0xc0 [ 21.045761] [] warn_slowpath_null+0x1a/0x20 [ 21.045762] [] intel_cqm_cpu_prepare+0x75/0x90 [ 21.045764] [] intel_cqm_cpu_notifier+0x42/0x160 [ 21.045767] [] notifier_call_chain+0x4d/0x80 [ 21.045769] [] __raw_notifier_call_chain+0xe/0x10 [ 21.045770] [] _cpu_up+0xe8/0x190 [ 21.045771] [] cpu_up+0x7a/0xa0 [ 21.045774] [] cpu_subsys_online+0x40/0x90 [ 21.045777] [] device_online+0x67/0x90 [ 21.045778] [] online_store+0x8a/0xa0 [ 21.045782] [] dev_attr_store+0x18/0x30 [ 21.045785] [] sysfs_kf_write+0x3a/0x50 [ 21.045786] [] kernfs_fop_write+0x120/0x170 [ 21.045789] [] __vfs_write+0x37/0x100 [ 21.045791] [] ? __sb_start_write+0x58/0x110 [ 21.045795] [] ? security_file_permission+0x3d/0xc0 [ 21.045796] [] vfs_write+0xa9/0x190 [ 21.045797] [] SyS_write+0x55/0xc0 [ 21.045800] [] ? do_page_fault+0x30/0x80 [ 21.045804] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 21.045805] ---[ end trace fe228b836d8af405 ]--- The root cause is that CPU_UP_PREPARE is completely the wrong notifier action from which to access cpu_data(), because smp_store_cpu_info() won't have been executed by the target CPU at that point, which in turn means that ->x86_cache_max_rmid and ->x86_cache_occ_scale haven't been filled out. Instead let's invoke our handler from CPU_STARTING and rename it appropriately. Reported-by: Tony Luck Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Ashok Raj Cc: Kanaka Juvva Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vikas Shivappa Link: http://lkml.kernel.org/r/1438863163-14083-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index 63eb68b..377e8f8 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -1255,7 +1255,7 @@ static inline void cqm_pick_event_reader(int cpu) cpumask_set_cpu(cpu, _cpumask); } -static void intel_cqm_cpu_prepare(unsigned int cpu) +static void intel_cqm_cpu_starting(unsigned int cpu) { struct intel_pqr_state *state = _cpu(pqr_state, cpu); struct cpuinfo_x86 *c = _data(cpu); @@ -1296,13 +1296,11 @@ static int intel_cqm_cpu_notifier(struct notifier_block *nb, unsigned int cpu = (unsigned long)hcpu; switch (action & ~CPU_TASKS_FROZEN) { - case CPU_UP_PREPARE: - intel_cqm_cpu_prepare(cpu); - break; case CPU_DOWN_PREPARE: intel_cqm_cpu_exit(cpu); break; case CPU_STARTING: + intel_cqm_cpu_starting(cpu); cqm_pick_event_reader(cpu); break; } @@ -1373,7 +1371,7 @@ static int __init intel_cqm_init(void) goto out; for_each_online_cpu(i) { - intel_cqm_cpu_prepare(i); + intel_cqm_cpu_starting(i); cqm_pick_event_reader(i); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/core] perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler
Commit-ID: d7a702f0b1033cf402fef65bd6395072738f0844 Gitweb: http://git.kernel.org/tip/d7a702f0b1033cf402fef65bd6395072738f0844 Author: Matt Fleming matt.flem...@intel.com AuthorDate: Thu, 6 Aug 2015 13:12:43 +0100 Committer: Ingo Molnar mi...@kernel.org CommitDate: Wed, 12 Aug 2015 11:37:23 +0200 perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler Tony reports that booting his 144-cpu machine with maxcpus=10 triggers the following WARN_ON(): [ 21.045727] WARNING: CPU: 8 PID: 647 at arch/x86/kernel/cpu/perf_event_intel_cqm.c:1267 intel_cqm_cpu_prepare+0x75/0x90() [ 21.045744] CPU: 8 PID: 647 Comm: systemd-udevd Not tainted 4.2.0-rc4 #1 [ 21.045745] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0066.R00.1506021730 06/02/2015 [ 21.045747] 82771b09 880856333ba8 81669b67 [ 21.045748] 880856333be8 8107b02a [ 21.045750] 88085b789800 88085f68a020 819e2470 000a [ 21.045750] Call Trace: [ 21.045757] [81669b67] dump_stack+0x45/0x57 [ 21.045759] [8107b02a] warn_slowpath_common+0x8a/0xc0 [ 21.045761] [8107b15a] warn_slowpath_null+0x1a/0x20 [ 21.045762] [81036725] intel_cqm_cpu_prepare+0x75/0x90 [ 21.045764] [81036872] intel_cqm_cpu_notifier+0x42/0x160 [ 21.045767] [8109a33d] notifier_call_chain+0x4d/0x80 [ 21.045769] [8109a44e] __raw_notifier_call_chain+0xe/0x10 [ 21.045770] [8107b538] _cpu_up+0xe8/0x190 [ 21.045771] [8107b65a] cpu_up+0x7a/0xa0 [ 21.045774] [8165e920] cpu_subsys_online+0x40/0x90 [ 21.045777] [81433b37] device_online+0x67/0x90 [ 21.045778] [81433bea] online_store+0x8a/0xa0 [ 21.045782] [81430e78] dev_attr_store+0x18/0x30 [ 21.045785] [8126b6ba] sysfs_kf_write+0x3a/0x50 [ 21.045786] [8126ad40] kernfs_fop_write+0x120/0x170 [ 21.045789] [811f0b77] __vfs_write+0x37/0x100 [ 21.045791] [811f38b8] ? __sb_start_write+0x58/0x110 [ 21.045795] [81296d2d] ? security_file_permission+0x3d/0xc0 [ 21.045796] [811f1279] vfs_write+0xa9/0x190 [ 21.045797] [811f2075] SyS_write+0x55/0xc0 [ 21.045800] [81067300] ? do_page_fault+0x30/0x80 [ 21.045804] [816709ae] entry_SYSCALL_64_fastpath+0x12/0x71 [ 21.045805] ---[ end trace fe228b836d8af405 ]--- The root cause is that CPU_UP_PREPARE is completely the wrong notifier action from which to access cpu_data(), because smp_store_cpu_info() won't have been executed by the target CPU at that point, which in turn means that -x86_cache_max_rmid and -x86_cache_occ_scale haven't been filled out. Instead let's invoke our handler from CPU_STARTING and rename it appropriately. Reported-by: Tony Luck tony.l...@intel.com Signed-off-by: Matt Fleming matt.flem...@intel.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Ashok Raj ashok@intel.com Cc: Kanaka Juvva kanaka.d.ju...@intel.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Vikas Shivappa vikas.shiva...@intel.com Link: http://lkml.kernel.org/r/1438863163-14083-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index 63eb68b..377e8f8 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -1255,7 +1255,7 @@ static inline void cqm_pick_event_reader(int cpu) cpumask_set_cpu(cpu, cqm_cpumask); } -static void intel_cqm_cpu_prepare(unsigned int cpu) +static void intel_cqm_cpu_starting(unsigned int cpu) { struct intel_pqr_state *state = per_cpu(pqr_state, cpu); struct cpuinfo_x86 *c = cpu_data(cpu); @@ -1296,13 +1296,11 @@ static int intel_cqm_cpu_notifier(struct notifier_block *nb, unsigned int cpu = (unsigned long)hcpu; switch (action ~CPU_TASKS_FROZEN) { - case CPU_UP_PREPARE: - intel_cqm_cpu_prepare(cpu); - break; case CPU_DOWN_PREPARE: intel_cqm_cpu_exit(cpu); break; case CPU_STARTING: + intel_cqm_cpu_starting(cpu); cqm_pick_event_reader(cpu); break; } @@ -1373,7 +1371,7 @@ static int __init intel_cqm_init(void) goto out; for_each_online_cpu(i) { - intel_cqm_cpu_prepare(i); + intel_cqm_cpu_starting(i); cqm_pick_event_reader(i); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to
[tip:core/efi] x86/efi-bgrt: Switch pr_err() to pr_debug() for invalid BGRT
Commit-ID: 248fbcd5aee00f6519a12c5ed3bc3dc0f5e84de5 Gitweb: http://git.kernel.org/tip/248fbcd5aee00f6519a12c5ed3bc3dc0f5e84de5 Author: Matt Fleming AuthorDate: Fri, 7 Aug 2015 09:36:55 +0100 Committer: Ingo Molnar CommitDate: Sat, 8 Aug 2015 10:37:39 +0200 x86/efi-bgrt: Switch pr_err() to pr_debug() for invalid BGRT It's totally legitimate, per the ACPI spec, for the firmware to set the BGRT 'status' field to zero to indicate that the BGRT image isn't being displayed, and we shouldn't be printing an error message in that case because it's just noise for users. So swap pr_err() for pr_debug(). However, Josh points that out it still makes sense to test the validity of the upper 7 bits of the 'status' field, since they're marked as "reserved" in the spec and must be zero. If firmware violates this it really *is* an error. Reported-by: Tom Yan Tested-by: Tom Yan Signed-off-by: Matt Fleming Reviewed-by: Josh Triplett Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Matthew Garrett Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1438936621-5215-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/efi-bgrt.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c index d7f997f..ea48449 100644 --- a/arch/x86/platform/efi/efi-bgrt.c +++ b/arch/x86/platform/efi/efi-bgrt.c @@ -50,11 +50,16 @@ void __init efi_bgrt_init(void) bgrt_tab->version); return; } - if (bgrt_tab->status != 1) { - pr_err("Ignoring BGRT: invalid status %u (expected 1)\n", + if (bgrt_tab->status & 0xfe) { + pr_err("Ignoring BGRT: reserved status bits are non-zero %u\n", bgrt_tab->status); return; } + if (bgrt_tab->status != 1) { + pr_debug("Ignoring BGRT: invalid status %u (expected 1)\n", +bgrt_tab->status); + return; + } if (bgrt_tab->image_type != 0) { pr_err("Ignoring BGRT: invalid image type %u (expected 0)\n", bgrt_tab->image_type); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:core/efi] Revert "x86/efi: Request desired alignment via the PE/COFF headers"
Commit-ID: fa5c35011a8d5f3d0c597a6336107eafd1b6046c Gitweb: http://git.kernel.org/tip/fa5c35011a8d5f3d0c597a6336107eafd1b6046c Author: Matt Fleming AuthorDate: Fri, 7 Aug 2015 09:36:56 +0100 Committer: Ingo Molnar CommitDate: Sat, 8 Aug 2015 10:37:39 +0200 Revert "x86/efi: Request desired alignment via the PE/COFF headers" This reverts commit: aeffc4928ea2 ("x86/efi: Request desired alignment via the PE/COFF headers") Linn reports that Signtool complains that kernels built with CONFIG_EFI_STUB=y are violating the PE/COFF specification because the 'SizeOfImage' field is not a multiple of 'SectionAlignment'. This violation was introduced as an optimisation to skip having the kernel relocate itself during boot and instead have the firmware place it at a correctly aligned address. No one else has complained and I'm not aware of any firmware implementations that refuse to boot with commit aeffc4928ea2, but it's a real bug, so revert the offending commit. Reported-by: Linn Crosetto Signed-off-by: Matt Fleming Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Michael Brown Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1438936621-5215-3-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/boot/header.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index 16ef025..7a6d43a 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -154,7 +154,7 @@ extra_header_fields: #else .quad 0 # ImageBase #endif - .long CONFIG_PHYSICAL_ALIGN # SectionAlignment + .long 0x20# SectionAlignment .long 0x20# FileAlignment .word 0 # MajorOperatingSystemVersion .word 0 # MinorOperatingSystemVersion -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:core/efi] Revert x86/efi: Request desired alignment via the PE/COFF headers
Commit-ID: fa5c35011a8d5f3d0c597a6336107eafd1b6046c Gitweb: http://git.kernel.org/tip/fa5c35011a8d5f3d0c597a6336107eafd1b6046c Author: Matt Fleming matt.flem...@intel.com AuthorDate: Fri, 7 Aug 2015 09:36:56 +0100 Committer: Ingo Molnar mi...@kernel.org CommitDate: Sat, 8 Aug 2015 10:37:39 +0200 Revert x86/efi: Request desired alignment via the PE/COFF headers This reverts commit: aeffc4928ea2 (x86/efi: Request desired alignment via the PE/COFF headers) Linn reports that Signtool complains that kernels built with CONFIG_EFI_STUB=y are violating the PE/COFF specification because the 'SizeOfImage' field is not a multiple of 'SectionAlignment'. This violation was introduced as an optimisation to skip having the kernel relocate itself during boot and instead have the firmware place it at a correctly aligned address. No one else has complained and I'm not aware of any firmware implementations that refuse to boot with commit aeffc4928ea2, but it's a real bug, so revert the offending commit. Reported-by: Linn Crosetto l...@hp.com Signed-off-by: Matt Fleming matt.flem...@intel.com Cc: H. Peter Anvin h...@zytor.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Michael Brown mbr...@fensystems.co.uk Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Link: http://lkml.kernel.org/r/1438936621-5215-3-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/boot/header.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index 16ef025..7a6d43a 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -154,7 +154,7 @@ extra_header_fields: #else .quad 0 # ImageBase #endif - .long CONFIG_PHYSICAL_ALIGN # SectionAlignment + .long 0x20# SectionAlignment .long 0x20# FileAlignment .word 0 # MajorOperatingSystemVersion .word 0 # MinorOperatingSystemVersion -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:core/efi] x86/efi-bgrt: Switch pr_err() to pr_debug() for invalid BGRT
Commit-ID: 248fbcd5aee00f6519a12c5ed3bc3dc0f5e84de5 Gitweb: http://git.kernel.org/tip/248fbcd5aee00f6519a12c5ed3bc3dc0f5e84de5 Author: Matt Fleming matt.flem...@intel.com AuthorDate: Fri, 7 Aug 2015 09:36:55 +0100 Committer: Ingo Molnar mi...@kernel.org CommitDate: Sat, 8 Aug 2015 10:37:39 +0200 x86/efi-bgrt: Switch pr_err() to pr_debug() for invalid BGRT It's totally legitimate, per the ACPI spec, for the firmware to set the BGRT 'status' field to zero to indicate that the BGRT image isn't being displayed, and we shouldn't be printing an error message in that case because it's just noise for users. So swap pr_err() for pr_debug(). However, Josh points that out it still makes sense to test the validity of the upper 7 bits of the 'status' field, since they're marked as reserved in the spec and must be zero. If firmware violates this it really *is* an error. Reported-by: Tom Yan tom.t...@gmail.com Tested-by: Tom Yan tom.t...@gmail.com Signed-off-by: Matt Fleming matt.flem...@intel.com Reviewed-by: Josh Triplett j...@joshtriplett.org Cc: H. Peter Anvin h...@zytor.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Matthew Garrett mj...@srcf.ucam.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Link: http://lkml.kernel.org/r/1438936621-5215-2-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/platform/efi/efi-bgrt.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c index d7f997f..ea48449 100644 --- a/arch/x86/platform/efi/efi-bgrt.c +++ b/arch/x86/platform/efi/efi-bgrt.c @@ -50,11 +50,16 @@ void __init efi_bgrt_init(void) bgrt_tab-version); return; } - if (bgrt_tab-status != 1) { - pr_err(Ignoring BGRT: invalid status %u (expected 1)\n, + if (bgrt_tab-status 0xfe) { + pr_err(Ignoring BGRT: reserved status bits are non-zero %u\n, bgrt_tab-status); return; } + if (bgrt_tab-status != 1) { + pr_debug(Ignoring BGRT: invalid status %u (expected 1)\n, +bgrt_tab-status); + return; + } if (bgrt_tab-image_type != 0) { pr_err(Ignoring BGRT: invalid image type %u (expected 0)\n, bgrt_tab-image_type); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/urgent] perf/x86/intel/cqm: Return cached counter value from IRQ context
Commit-ID: 2c534c0da0a68418693e10ce1c4146e085f39518 Gitweb: http://git.kernel.org/tip/2c534c0da0a68418693e10ce1c4146e085f39518 Author: Matt Fleming AuthorDate: Tue, 21 Jul 2015 15:55:09 +0100 Committer: Thomas Gleixner CommitDate: Sun, 26 Jul 2015 10:22:29 +0200 perf/x86/intel/cqm: Return cached counter value from IRQ context Peter reported the following potential crash which I was able to reproduce with his test program, [ 148.765788] [ cut here ] [ 148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260() [ 148.765797] Modules linked in: [ 148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4 [ 148.765803] 81cdc398 88085f105950 818bdfd5 0007 [ 148.765805] 88085f105990 810e413a [ 148.765807] 82301080 0022 8107f640 8107f640 [ 148.765809] Call Trace: [ 148.765810][] dump_stack+0x45/0x57 [ 148.765818] [] warn_slowpath_common+0x8a/0xc0 [ 148.765822] [] ? intel_cqm_stable+0x60/0x60 [ 148.765824] [] ? intel_cqm_stable+0x60/0x60 [ 148.765825] [] warn_slowpath_null+0x1a/0x20 [ 148.765827] [] smp_call_function_many+0xb6/0x260 [ 148.765829] [] ? intel_cqm_stable+0x60/0x60 [ 148.765831] [] on_each_cpu_mask+0x28/0x60 [ 148.765832] [] intel_cqm_event_count+0x7f/0xe0 [ 148.765836] [] perf_output_read+0x2a5/0x400 [ 148.765839] [] perf_output_sample+0x31a/0x590 [ 148.765840] [] ? perf_prepare_sample+0x26d/0x380 [ 148.765841] [] perf_event_output+0x47/0x60 [ 148.765843] [] __perf_event_overflow+0x215/0x240 [ 148.765844] [] perf_event_overflow+0x14/0x20 [ 148.765847] [] intel_pmu_handle_irq+0x1d4/0x440 [ 148.765849] [] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765853] [] ? vunmap_page_range+0x19d/0x2f0 [ 148.765854] [] ? unmap_kernel_range_noflush+0x11/0x20 [ 148.765859] [] ? ghes_copy_tofrom_phys+0x11e/0x2a0 [ 148.765863] [] ? native_apic_msr_write+0x2b/0x30 [ 148.765865] [] ? x2apic_send_IPI_self+0x1d/0x20 [ 148.765869] [] ? arch_irq_work_raise+0x35/0x40 [ 148.765872] [] ? irq_work_queue+0x66/0x80 [ 148.765875] [] perf_event_nmi_handler+0x26/0x40 [ 148.765877] [] nmi_handle+0x79/0x100 [ 148.765879] [] default_do_nmi+0x42/0x100 [ 148.765880] [] do_nmi+0x83/0xb0 [ 148.765884] [] end_repeat_nmi+0x1e/0x2e [ 148.765886] [] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765888] [] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765890] [] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765891] <> [] finish_task_switch+0x156/0x210 [ 148.765898] [] __schedule+0x341/0x920 [ 148.765899] [] schedule+0x37/0x80 [ 148.765903] [] ? do_page_fault+0x2f/0x80 [ 148.765905] [] schedule_user+0x1a/0x50 [ 148.765907] [] retint_careful+0x14/0x32 [ 148.765908] ---[ end trace e33ff2be78e14901 ]--- The CQM task events are not safe to be called from within interrupt context because they require performing an IPI to read the counter value on all sockets. And performing IPIs from within IRQ context is a "no-no". Make do with the last read counter value currently event in event->count when we're invoked in this context. Reported-by: Peter Zijlstra Signed-off-by: Matt Fleming Cc: Thomas Gleixner Cc: Vikas Shivappa Cc: Kanaka Juvva Cc: Will Auld Cc: Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index 1880761..63eb68b 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -952,6 +952,14 @@ static u64 intel_cqm_event_count(struct perf_event *event) return 0; /* +* Getting up-to-date values requires an SMP IPI which is not +* possible if we're being called in interrupt context. Return +* the cached values instead. +*/ + if (unlikely(in_interrupt())) + goto out; + + /* * Notice that we don't perform the reading of an RMID * atomically, because we can't hold a spin lock across the * IPIs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/urgent] perf/x86/intel/cqm: Return cached counter value from IRQ context
Commit-ID: 2c534c0da0a68418693e10ce1c4146e085f39518 Gitweb: http://git.kernel.org/tip/2c534c0da0a68418693e10ce1c4146e085f39518 Author: Matt Fleming matt.flem...@intel.com AuthorDate: Tue, 21 Jul 2015 15:55:09 +0100 Committer: Thomas Gleixner t...@linutronix.de CommitDate: Sun, 26 Jul 2015 10:22:29 +0200 perf/x86/intel/cqm: Return cached counter value from IRQ context Peter reported the following potential crash which I was able to reproduce with his test program, [ 148.765788] [ cut here ] [ 148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260() [ 148.765797] Modules linked in: [ 148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4 [ 148.765803] 81cdc398 88085f105950 818bdfd5 0007 [ 148.765805] 88085f105990 810e413a [ 148.765807] 82301080 0022 8107f640 8107f640 [ 148.765809] Call Trace: [ 148.765810] NMI [818bdfd5] dump_stack+0x45/0x57 [ 148.765818] [810e413a] warn_slowpath_common+0x8a/0xc0 [ 148.765822] [8107f640] ? intel_cqm_stable+0x60/0x60 [ 148.765824] [8107f640] ? intel_cqm_stable+0x60/0x60 [ 148.765825] [810e422a] warn_slowpath_null+0x1a/0x20 [ 148.765827] [811613f6] smp_call_function_many+0xb6/0x260 [ 148.765829] [8107f640] ? intel_cqm_stable+0x60/0x60 [ 148.765831] [81161748] on_each_cpu_mask+0x28/0x60 [ 148.765832] [8107f6ef] intel_cqm_event_count+0x7f/0xe0 [ 148.765836] [811cdd35] perf_output_read+0x2a5/0x400 [ 148.765839] [811d2e5a] perf_output_sample+0x31a/0x590 [ 148.765840] [811d333d] ? perf_prepare_sample+0x26d/0x380 [ 148.765841] [811d3497] perf_event_output+0x47/0x60 [ 148.765843] [811d36c5] __perf_event_overflow+0x215/0x240 [ 148.765844] [811d4124] perf_event_overflow+0x14/0x20 [ 148.765847] [8107e7f4] intel_pmu_handle_irq+0x1d4/0x440 [ 148.765849] [811d07a6] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765853] [81219bad] ? vunmap_page_range+0x19d/0x2f0 [ 148.765854] [81219d11] ? unmap_kernel_range_noflush+0x11/0x20 [ 148.765859] [814ce6fe] ? ghes_copy_tofrom_phys+0x11e/0x2a0 [ 148.765863] [8109e5db] ? native_apic_msr_write+0x2b/0x30 [ 148.765865] [8109e44d] ? x2apic_send_IPI_self+0x1d/0x20 [ 148.765869] [81065135] ? arch_irq_work_raise+0x35/0x40 [ 148.765872] [811c8d86] ? irq_work_queue+0x66/0x80 [ 148.765875] [81075306] perf_event_nmi_handler+0x26/0x40 [ 148.765877] [81063ed9] nmi_handle+0x79/0x100 [ 148.765879] [81064422] default_do_nmi+0x42/0x100 [ 148.765880] [81064563] do_nmi+0x83/0xb0 [ 148.765884] [818c7c0f] end_repeat_nmi+0x1e/0x2e [ 148.765886] [811d07a6] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765888] [811d07a6] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765890] [811d07a6] ? __perf_event_task_sched_in+0x36/0xa0 [ 148.765891] EOE [8110ab66] finish_task_switch+0x156/0x210 [ 148.765898] [818c1671] __schedule+0x341/0x920 [ 148.765899] [818c1c87] schedule+0x37/0x80 [ 148.765903] [810ae1af] ? do_page_fault+0x2f/0x80 [ 148.765905] [818c1f4a] schedule_user+0x1a/0x50 [ 148.765907] [818c666c] retint_careful+0x14/0x32 [ 148.765908] ---[ end trace e33ff2be78e14901 ]--- The CQM task events are not safe to be called from within interrupt context because they require performing an IPI to read the counter value on all sockets. And performing IPIs from within IRQ context is a no-no. Make do with the last read counter value currently event in event-count when we're invoked in this context. Reported-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Matt Fleming matt.flem...@intel.com Cc: Thomas Gleixner t...@linutronix.de Cc: Vikas Shivappa vikas.shiva...@intel.com Cc: Kanaka Juvva kanaka.d.ju...@intel.com Cc: Will Auld will.a...@intel.com Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Thomas Gleixner t...@linutronix.de --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index 1880761..63eb68b 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -952,6 +952,14 @@ static u64 intel_cqm_event_count(struct perf_event *event) return 0; /* +* Getting up-to-date values requires an SMP IPI which is not +* possible if we're being called in interrupt context. Return +* the cached values instead. +*/ + if (unlikely(in_interrupt())) + goto
[tip:perf/core] perf/x86/intel/cqm: Use 'u32' data type for RMIDs
Commit-ID: adafa99960ef18b019f001ddee4d9d81c4e25944 Gitweb: http://git.kernel.org/tip/adafa99960ef18b019f001ddee4d9d81c4e25944 Author: Matt Fleming AuthorDate: Fri, 22 May 2015 09:59:42 +0100 Committer: Ingo Molnar CommitDate: Wed, 27 May 2015 09:17:41 +0200 perf/x86/intel/cqm: Use 'u32' data type for RMIDs Since we write RMID values to MSRs the correct type to use is 'u32' because that clearly articulates we're writing a hardware register value. Fix up all uses of RMID in this code to consistently use the correct data type. Reported-by: Thomas Gleixner Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Acked-by: Thomas Gleixner Cc: Kanaka Juvva Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Vikas Shivappa Cc: Will Auld Link: http://lkml.kernel.org/r/1432285182-17180-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 37 +++--- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index 8233b29..1880761 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -13,7 +13,7 @@ #define MSR_IA32_QM_CTR0x0c8e #define MSR_IA32_QM_EVTSEL 0x0c8d -static unsigned int cqm_max_rmid = -1; +static u32 cqm_max_rmid = -1; static unsigned int cqm_l3_scale; /* supposedly cacheline size */ /** @@ -76,7 +76,7 @@ static cpumask_t cqm_cpumask; * near-zero occupancy value, i.e. no cachelines are tagged with this * RMID, once __intel_cqm_rmid_rotate() returns. */ -static unsigned int intel_cqm_rotation_rmid; +static u32 intel_cqm_rotation_rmid; #define INVALID_RMID (-1) @@ -88,7 +88,7 @@ static unsigned int intel_cqm_rotation_rmid; * Likewise, an rmid value of -1 is used to indicate "no rmid currently * assigned" and is used as part of the rotation code. */ -static inline bool __rmid_valid(unsigned int rmid) +static inline bool __rmid_valid(u32 rmid) { if (!rmid || rmid == INVALID_RMID) return false; @@ -96,7 +96,7 @@ static inline bool __rmid_valid(unsigned int rmid) return true; } -static u64 __rmid_read(unsigned int rmid) +static u64 __rmid_read(u32 rmid) { u64 val; @@ -121,7 +121,7 @@ enum rmid_recycle_state { }; struct cqm_rmid_entry { - unsigned int rmid; + u32 rmid; enum rmid_recycle_state state; struct list_head list; unsigned long queue_time; @@ -166,7 +166,7 @@ static LIST_HEAD(cqm_rmid_limbo_lru); */ static struct cqm_rmid_entry **cqm_rmid_ptrs; -static inline struct cqm_rmid_entry *__rmid_entry(int rmid) +static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid) { struct cqm_rmid_entry *entry; @@ -181,7 +181,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(int rmid) * * We expect to be called with cache_mutex held. */ -static int __get_rmid(void) +static u32 __get_rmid(void) { struct cqm_rmid_entry *entry; @@ -196,7 +196,7 @@ static int __get_rmid(void) return entry->rmid; } -static void __put_rmid(unsigned int rmid) +static void __put_rmid(u32 rmid) { struct cqm_rmid_entry *entry; @@ -391,7 +391,7 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b) } struct rmid_read { - unsigned int rmid; + u32 rmid; atomic64_t value; }; @@ -400,12 +400,11 @@ static void __intel_cqm_event_count(void *info); /* * Exchange the RMID of a group of events. */ -static unsigned int -intel_cqm_xchg_rmid(struct perf_event *group, unsigned int rmid) +static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid) { struct perf_event *event; - unsigned int old_rmid = group->hw.cqm_rmid; struct list_head *head = >hw.cqm_group_entry; + u32 old_rmid = group->hw.cqm_rmid; lockdep_assert_held(_mutex); @@ -470,7 +469,7 @@ static void intel_cqm_stable(void *arg) * If we have group events waiting for an RMID that don't conflict with * events already running, assign @rmid. */ -static bool intel_cqm_sched_in_event(unsigned int rmid) +static bool intel_cqm_sched_in_event(u32 rmid) { struct perf_event *leader, *event; @@ -617,7 +616,7 @@ static bool intel_cqm_rmid_stabilize(unsigned int *available) static void __intel_cqm_pick_and_rotate(struct perf_event *next) { struct perf_event *rotor; - unsigned int rmid; + u32 rmid; lockdep_assert_held(_mutex); @@ -645,7 +644,7 @@ static void __intel_cqm_pick_and_rotate(struct perf_event *next) static void intel_cqm_sched_out_conflicting_events(struct perf_event *event) { struct perf_event *group, *g; - unsigned int rmid; + u32 rmid; lockdep_assert_held(_mutex); @@ -847,8 +846,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
[tip:perf/core] perf/x86/intel/cqm: Use 'u32' data type for RMIDs
Commit-ID: adafa99960ef18b019f001ddee4d9d81c4e25944 Gitweb: http://git.kernel.org/tip/adafa99960ef18b019f001ddee4d9d81c4e25944 Author: Matt Fleming matt.flem...@intel.com AuthorDate: Fri, 22 May 2015 09:59:42 +0100 Committer: Ingo Molnar mi...@kernel.org CommitDate: Wed, 27 May 2015 09:17:41 +0200 perf/x86/intel/cqm: Use 'u32' data type for RMIDs Since we write RMID values to MSRs the correct type to use is 'u32' because that clearly articulates we're writing a hardware register value. Fix up all uses of RMID in this code to consistently use the correct data type. Reported-by: Thomas Gleixner t...@linutronix.de Signed-off-by: Matt Fleming matt.flem...@intel.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Acked-by: Thomas Gleixner t...@linutronix.de Cc: Kanaka Juvva kanaka.d.ju...@intel.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Peter Zijlstra pet...@infradead.org Cc: Vikas Shivappa vikas.shiva...@linux.intel.com Cc: Will Auld will.a...@intel.com Link: http://lkml.kernel.org/r/1432285182-17180-1-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 37 +++--- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index 8233b29..1880761 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -13,7 +13,7 @@ #define MSR_IA32_QM_CTR0x0c8e #define MSR_IA32_QM_EVTSEL 0x0c8d -static unsigned int cqm_max_rmid = -1; +static u32 cqm_max_rmid = -1; static unsigned int cqm_l3_scale; /* supposedly cacheline size */ /** @@ -76,7 +76,7 @@ static cpumask_t cqm_cpumask; * near-zero occupancy value, i.e. no cachelines are tagged with this * RMID, once __intel_cqm_rmid_rotate() returns. */ -static unsigned int intel_cqm_rotation_rmid; +static u32 intel_cqm_rotation_rmid; #define INVALID_RMID (-1) @@ -88,7 +88,7 @@ static unsigned int intel_cqm_rotation_rmid; * Likewise, an rmid value of -1 is used to indicate no rmid currently * assigned and is used as part of the rotation code. */ -static inline bool __rmid_valid(unsigned int rmid) +static inline bool __rmid_valid(u32 rmid) { if (!rmid || rmid == INVALID_RMID) return false; @@ -96,7 +96,7 @@ static inline bool __rmid_valid(unsigned int rmid) return true; } -static u64 __rmid_read(unsigned int rmid) +static u64 __rmid_read(u32 rmid) { u64 val; @@ -121,7 +121,7 @@ enum rmid_recycle_state { }; struct cqm_rmid_entry { - unsigned int rmid; + u32 rmid; enum rmid_recycle_state state; struct list_head list; unsigned long queue_time; @@ -166,7 +166,7 @@ static LIST_HEAD(cqm_rmid_limbo_lru); */ static struct cqm_rmid_entry **cqm_rmid_ptrs; -static inline struct cqm_rmid_entry *__rmid_entry(int rmid) +static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid) { struct cqm_rmid_entry *entry; @@ -181,7 +181,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(int rmid) * * We expect to be called with cache_mutex held. */ -static int __get_rmid(void) +static u32 __get_rmid(void) { struct cqm_rmid_entry *entry; @@ -196,7 +196,7 @@ static int __get_rmid(void) return entry-rmid; } -static void __put_rmid(unsigned int rmid) +static void __put_rmid(u32 rmid) { struct cqm_rmid_entry *entry; @@ -391,7 +391,7 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b) } struct rmid_read { - unsigned int rmid; + u32 rmid; atomic64_t value; }; @@ -400,12 +400,11 @@ static void __intel_cqm_event_count(void *info); /* * Exchange the RMID of a group of events. */ -static unsigned int -intel_cqm_xchg_rmid(struct perf_event *group, unsigned int rmid) +static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid) { struct perf_event *event; - unsigned int old_rmid = group-hw.cqm_rmid; struct list_head *head = group-hw.cqm_group_entry; + u32 old_rmid = group-hw.cqm_rmid; lockdep_assert_held(cache_mutex); @@ -470,7 +469,7 @@ static void intel_cqm_stable(void *arg) * If we have group events waiting for an RMID that don't conflict with * events already running, assign @rmid. */ -static bool intel_cqm_sched_in_event(unsigned int rmid) +static bool intel_cqm_sched_in_event(u32 rmid) { struct perf_event *leader, *event; @@ -617,7 +616,7 @@ static bool intel_cqm_rmid_stabilize(unsigned int *available) static void __intel_cqm_pick_and_rotate(struct perf_event *next) { struct perf_event *rotor; - unsigned int rmid; + u32 rmid; lockdep_assert_held(cache_mutex); @@ -645,7 +644,7 @@ static void __intel_cqm_pick_and_rotate(struct perf_event *next) static void
[tip:perf/x86] perf/x86/intel: Fix Makefile to actually build the cqm driver
Commit-ID: 4e16ed99416ef569a89782a7234f95007919fadd Gitweb: http://git.kernel.org/tip/4e16ed99416ef569a89782a7234f95007919fadd Author: Matt Fleming AuthorDate: Thu, 26 Feb 2015 18:47:00 + Committer: Ingo Molnar CommitDate: Mon, 23 Mar 2015 10:58:03 +0100 perf/x86/intel: Fix Makefile to actually build the cqm driver Someone fat fingered a merge conflict and lost the Makefile hunk. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Cc: Cc: Cc: Cc: Cc: Cc: Link: http://lkml.kernel.org/r/1424976420.15321.35.ca...@mfleming-mobl1.ger.corp.intel.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 80091ae..6c1ca13 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -39,7 +39,7 @@ obj-$(CONFIG_CPU_SUP_AMD) += perf_event_amd_iommu.o endif obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_p6.o perf_event_knc.o perf_event_p4.o obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o -obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_rapl.o +obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_rapl.o perf_event_intel_cqm.o obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \ perf_event_intel_uncore_snb.o \ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/x86] perf/x86/intel: Fix Makefile to actually build the cqm driver
Commit-ID: 4e16ed99416ef569a89782a7234f95007919fadd Gitweb: http://git.kernel.org/tip/4e16ed99416ef569a89782a7234f95007919fadd Author: Matt Fleming matt.flem...@intel.com AuthorDate: Thu, 26 Feb 2015 18:47:00 + Committer: Ingo Molnar mi...@kernel.org CommitDate: Mon, 23 Mar 2015 10:58:03 +0100 perf/x86/intel: Fix Makefile to actually build the cqm driver Someone fat fingered a merge conflict and lost the Makefile hunk. Signed-off-by: Matt Fleming matt.flem...@intel.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: a...@redhat.com Cc: h...@zytor.com Cc: jo...@redhat.com Cc: kanaka.d.ju...@intel.com Cc: t...@linutronix.de Cc: torva...@linux-foundation.org Cc: vikas.shiva...@linux.intel.com Link: http://lkml.kernel.org/r/1424976420.15321.35.ca...@mfleming-mobl1.ger.corp.intel.com Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/cpu/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 80091ae..6c1ca13 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -39,7 +39,7 @@ obj-$(CONFIG_CPU_SUP_AMD) += perf_event_amd_iommu.o endif obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_p6.o perf_event_knc.o perf_event_p4.o obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o -obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_rapl.o +obj-$(CONFIG_CPU_SUP_INTEL)+= perf_event_intel_rapl.o perf_event_intel_cqm.o obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \ perf_event_intel_uncore_snb.o \ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/x86] perf/x86/intel: Enable conflicting event scheduling for CQM
Commit-ID: 59bf7fd45c90a8fde22a7717b5413e4ed9666c32 Gitweb: http://git.kernel.org/tip/59bf7fd45c90a8fde22a7717b5413e4ed9666c32 Author: Matt Fleming AuthorDate: Fri, 23 Jan 2015 18:45:48 + Committer: Ingo Molnar CommitDate: Wed, 25 Feb 2015 13:53:36 +0100 perf/x86/intel: Enable conflicting event scheduling for CQM We can leverage the workqueue that we use for RMID rotation to support scheduling of conflicting monitoring events. Allowing events that monitor conflicting things is done at various other places in the perf subsystem, so there's precedent there. An example of two conflicting events would be monitoring a cgroup and simultaneously monitoring a task within that cgroup. This uses the cache_groups list as a queuing mechanism, where every event that reaches the front of the list gets the chance to be scheduled in, possibly descheduling any conflicting events that are running. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Linus Torvalds Cc: Vikas Shivappa Link: http://lkml.kernel.org/r/1422038748-21397-10-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 130 +++-- 1 file changed, 84 insertions(+), 46 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index e31f508..9a8ef83 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -507,7 +507,6 @@ static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME; static bool intel_cqm_rmid_stabilize(unsigned int *available) { struct cqm_rmid_entry *entry, *tmp; - struct perf_event *event; lockdep_assert_held(_mutex); @@ -577,19 +576,9 @@ static bool intel_cqm_rmid_stabilize(unsigned int *available) /* * If we have groups waiting for RMIDs, hand -* them one now. +* them one now provided they don't conflict. */ - list_for_each_entry(event, _groups, - hw.cqm_groups_entry) { - if (__rmid_valid(event->hw.cqm_rmid)) - continue; - - intel_cqm_xchg_rmid(event, entry->rmid); - entry = NULL; - break; - } - - if (!entry) + if (intel_cqm_sched_in_event(entry->rmid)) continue; /* @@ -604,25 +593,73 @@ static bool intel_cqm_rmid_stabilize(unsigned int *available) /* * Pick a victim group and move it to the tail of the group list. + * @next: The first group without an RMID */ -static struct perf_event * -__intel_cqm_pick_and_rotate(void) +static void __intel_cqm_pick_and_rotate(struct perf_event *next) { struct perf_event *rotor; + unsigned int rmid; lockdep_assert_held(_mutex); - lockdep_assert_held(_lock); rotor = list_first_entry(_groups, struct perf_event, hw.cqm_groups_entry); + + /* +* The group at the front of the list should always have a valid +* RMID. If it doesn't then no groups have RMIDs assigned and we +* don't need to rotate the list. +*/ + if (next == rotor) + return; + + rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID); + __put_rmid(rmid); + list_rotate_left(_groups); +} + +/* + * Deallocate the RMIDs from any events that conflict with @event, and + * place them on the back of the group list. + */ +static void intel_cqm_sched_out_conflicting_events(struct perf_event *event) +{ + struct perf_event *group, *g; + unsigned int rmid; + + lockdep_assert_held(_mutex); + + list_for_each_entry_safe(group, g, _groups, hw.cqm_groups_entry) { + if (group == event) + continue; + + rmid = group->hw.cqm_rmid; + + /* +* Skip events that don't have a valid RMID. +*/ + if (!__rmid_valid(rmid)) + continue; + + /* +* No conflict? No problem! Leave the event alone. +*/ + if (!__conflict_event(group, event)) + continue; - return rotor; + intel_cqm_xchg_rmid(group, INVALID_RMID); + __put_rmid(rmid); + } } /* * Attempt to rotate the groups and assign new RMIDs. * + * We rotate for two reasons, + * 1. To handle the scheduling of conflicting events + * 2. To recycle RMIDs + * * Rotating RMIDs is complicated because the hardware doesn't give us * any clues. * @@ -642,11 +679,10 @@ __intel_cqm_pick_and_rotate(void) */ static bool
[tip:perf/x86] perf/x86/intel: Perform rotation on Intel CQM RMIDs
Commit-ID: bff671dba7981195a644a5dc210d65de8ae2d251 Gitweb: http://git.kernel.org/tip/bff671dba7981195a644a5dc210d65de8ae2d251 Author: Matt Fleming AuthorDate: Fri, 23 Jan 2015 18:45:47 + Committer: Ingo Molnar CommitDate: Wed, 25 Feb 2015 13:53:35 +0100 perf/x86/intel: Perform rotation on Intel CQM RMIDs There are many use cases where people will want to monitor more tasks than there exist RMIDs in the hardware, meaning that we have to perform some kind of multiplexing. We do this by "rotating" the RMIDs in a workqueue, and assigning an RMID to a waiting event when the RMID becomes unused. This scheme reserves one RMID at all times for rotation. When we need to schedule a new event we give it the reserved RMID, pick a victim event from the front of the global CQM list and wait for the victim's RMID to drop to zero occupancy, before it becomes the new reserved RMID. We put the victim's RMID onto the limbo list, where it resides for a "minimum queue time", which is intended to save ourselves an expensive smp IPI when the RMID is unlikely to have a occupancy value below __intel_cqm_threshold. If we fail to recycle an RMID, even after waiting the minimum queue time then we need to increment __intel_cqm_threshold. There is an upper bound on this threshold, __intel_cqm_max_threshold, which is programmable from userland as /sys/devices/intel_cqm/max_recycling_threshold. The comments above __intel_cqm_rmid_rotate() have more details. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Linus Torvalds Cc: Vikas Shivappa Link: http://lkml.kernel.org/r/1422038748-21397-9-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 671 ++--- 1 file changed, 623 insertions(+), 48 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index 8003d87..e31f508 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -25,9 +25,13 @@ struct intel_cqm_state { static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state); /* - * Protects cache_cgroups and cqm_rmid_lru. + * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru. + * Also protects event->hw.cqm_rmid + * + * Hold either for stability, both for modification of ->hw.cqm_rmid. */ static DEFINE_MUTEX(cache_mutex); +static DEFINE_RAW_SPINLOCK(cache_lock); /* * Groups of events that have the same target(s), one RMID per group. @@ -46,7 +50,34 @@ static cpumask_t cqm_cpumask; #define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID -static u64 __rmid_read(unsigned long rmid) +/* + * This is central to the rotation algorithm in __intel_cqm_rmid_rotate(). + * + * This rmid is always free and is guaranteed to have an associated + * near-zero occupancy value, i.e. no cachelines are tagged with this + * RMID, once __intel_cqm_rmid_rotate() returns. + */ +static unsigned int intel_cqm_rotation_rmid; + +#define INVALID_RMID (-1) + +/* + * Is @rmid valid for programming the hardware? + * + * rmid 0 is reserved by the hardware for all non-monitored tasks, which + * means that we should never come across an rmid with that value. + * Likewise, an rmid value of -1 is used to indicate "no rmid currently + * assigned" and is used as part of the rotation code. + */ +static inline bool __rmid_valid(unsigned int rmid) +{ + if (!rmid || rmid == INVALID_RMID) + return false; + + return true; +} + +static u64 __rmid_read(unsigned int rmid) { u64 val; @@ -64,13 +95,21 @@ static u64 __rmid_read(unsigned long rmid) return val; } +enum rmid_recycle_state { + RMID_YOUNG = 0, + RMID_AVAILABLE, + RMID_DIRTY, +}; + struct cqm_rmid_entry { - u64 rmid; + unsigned int rmid; + enum rmid_recycle_state state; struct list_head list; + unsigned long queue_time; }; /* - * A least recently used list of RMIDs. + * cqm_rmid_free_lru - A least recently used list of RMIDs. * * Oldest entry at the head, newest (most recently used) entry at the * tail. This list is never traversed, it's only used to keep track of @@ -81,9 +120,18 @@ struct cqm_rmid_entry { * in use. To mark an RMID as in use, remove its entry from the lru * list. * - * This list is protected by cache_mutex. + * + * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs. + * + * This list is contains RMIDs that no one is currently using but that + * may have a non-zero occupancy value associated with them. The + * rotation worker moves RMIDs from the limbo list to the free list once + * the occupancy value drops below __intel_cqm_threshold. + * + * Both lists are protected by cache_mutex. */ -static LIST_HEAD(cqm_rmid_lru); +static LIST_HEAD(cqm_rmid_free_lru); +static
[tip:perf/x86] perf/x86/intel: Support task events with Intel CQM
Commit-ID: bfe1fcd2688f557a6b6a88f59ea7619228728bd7 Gitweb: http://git.kernel.org/tip/bfe1fcd2688f557a6b6a88f59ea7619228728bd7 Author: Matt Fleming AuthorDate: Fri, 23 Jan 2015 18:45:46 + Committer: Ingo Molnar CommitDate: Wed, 25 Feb 2015 13:53:34 +0100 perf/x86/intel: Support task events with Intel CQM Add support for task events as well as system-wide events. This change has a big impact on the way that we gather LLC occupancy values in intel_cqm_event_read(). Currently, for system-wide (per-cpu) events we defer processing to userspace which knows how to discard all but one cpu result per package. Things aren't so simple for task events because we need to do the value aggregation ourselves. To do this, we defer updating the LLC occupancy value in event->count from intel_cqm_event_read() and do an SMP cross-call to read values for all packages in intel_cqm_event_count(). We need to ensure that we only do this for one task event per cache group, otherwise we'll report duplicate values. If we're a system-wide event we want to fallback to the default perf_event_count() implementation. Refactor this into a common function so that we don't duplicate the code. Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an event's task (if the event isn't per-cpu) inside of the Intel CQM PMU driver. This task information is only availble in the upper layers of the perf infrastructure. Other perf backends stash the target task in event->hw.*target so we need to do something similar. The task is used to determine whether events should share a cache group and an RMID. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Linus Torvalds Cc: Vikas Shivappa Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 195 + include/linux/perf_event.h | 1 + include/uapi/linux/perf_event.h| 1 + kernel/events/core.c | 2 + 4 files changed, 178 insertions(+), 21 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index b5d9d74..8003d87 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -182,23 +182,124 @@ fail: /* * Determine if @a and @b measure the same set of tasks. + * + * If @a and @b measure the same set of tasks then we want to share a + * single RMID. */ static bool __match_event(struct perf_event *a, struct perf_event *b) { + /* Per-cpu and task events don't mix */ if ((a->attach_state & PERF_ATTACH_TASK) != (b->attach_state & PERF_ATTACH_TASK)) return false; - /* not task */ +#ifdef CONFIG_CGROUP_PERF + if (a->cgrp != b->cgrp) + return false; +#endif + + /* If not task event, we're machine wide */ + if (!(b->attach_state & PERF_ATTACH_TASK)) + return true; + + /* +* Events that target same task are placed into the same cache group. +*/ + if (a->hw.cqm_target == b->hw.cqm_target) + return true; + + /* +* Are we an inherited event? +*/ + if (b->parent == a) + return true; + + return false; +} + +#ifdef CONFIG_CGROUP_PERF +static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event) +{ + if (event->attach_state & PERF_ATTACH_TASK) + return perf_cgroup_from_task(event->hw.cqm_target); - return true; /* if not task, we're machine wide */ + return event->cgrp; } +#endif /* * Determine if @a's tasks intersect with @b's tasks + * + * There are combinations of events that we explicitly prohibit, + * + *PROHIBITS + * system-wide-> cgroup and task + * cgroup->system-wide + * ->task in cgroup + * task ->system-wide + * ->task in cgroup + * + * Call this function before allocating an RMID. */ static bool __conflict_event(struct perf_event *a, struct perf_event *b) { +#ifdef CONFIG_CGROUP_PERF + /* +* We can have any number of cgroups but only one system-wide +* event at a time. +*/ + if (a->cgrp && b->cgrp) { + struct perf_cgroup *ac = a->cgrp; + struct perf_cgroup *bc = b->cgrp; + + /* +* This condition should have been caught in +* __match_event() and we should be sharing an RMID. +*/ + WARN_ON_ONCE(ac == bc); + + if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) || +
[tip:perf/x86] perf: Move cgroup init before PMU ->event_init()
Commit-ID: 79dff51e900fd26a073be8b23acfbd8c15edb181 Gitweb: http://git.kernel.org/tip/79dff51e900fd26a073be8b23acfbd8c15edb181 Author: Matt Fleming AuthorDate: Fri, 23 Jan 2015 18:45:42 + Committer: Ingo Molnar CommitDate: Wed, 25 Feb 2015 13:53:30 +0100 perf: Move cgroup init before PMU ->event_init() The Intel QoS PMU needs to know whether an event is part of a cgroup during ->event_init(), because tasks in the same cgroup share a monitoring ID. Move the cgroup initialisation before calling into the PMU driver. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Linus Torvalds Cc: Vikas Shivappa Link: http://lkml.kernel.org/r/1422038748-21397-4-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- kernel/events/core.c | 28 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 4e8dc59..1fc3bae 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7116,7 +7116,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, struct perf_event *group_leader, struct perf_event *parent_event, perf_overflow_handler_t overflow_handler, -void *context) +void *context, int cgroup_fd) { struct pmu *pmu; struct perf_event *event; @@ -7212,6 +7212,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, if (!has_branch_stack(event)) event->attr.branch_sample_type = 0; + if (cgroup_fd != -1) { + err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); + if (err) + goto err_ns; + } + pmu = perf_init_event(event); if (!pmu) goto err_ns; @@ -7235,6 +7241,8 @@ err_pmu: event->destroy(event); module_put(pmu->module); err_ns: + if (is_cgroup_event(event)) + perf_detach_cgroup(event); if (event->ns) put_pid_ns(event->ns); kfree(event); @@ -7453,6 +7461,7 @@ SYSCALL_DEFINE5(perf_event_open, int move_group = 0; int err; int f_flags = O_RDWR; + int cgroup_fd = -1; /* for future expandability... */ if (flags & ~PERF_FLAG_ALL) @@ -7518,21 +7527,16 @@ SYSCALL_DEFINE5(perf_event_open, get_online_cpus(); + if (flags & PERF_FLAG_PID_CGROUP) + cgroup_fd = pid; + event = perf_event_alloc(, cpu, task, group_leader, NULL, -NULL, NULL); +NULL, NULL, cgroup_fd); if (IS_ERR(event)) { err = PTR_ERR(event); goto err_cpus; } - if (flags & PERF_FLAG_PID_CGROUP) { - err = perf_cgroup_connect(pid, event, , group_leader); - if (err) { - __free_event(event); - goto err_cpus; - } - } - if (is_sampling_event(event)) { if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) { err = -ENOTSUPP; @@ -7769,7 +7773,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, */ event = perf_event_alloc(attr, cpu, task, NULL, NULL, -overflow_handler, context); +overflow_handler, context, -1); if (IS_ERR(event)) { err = PTR_ERR(event); goto err; @@ -8130,7 +8134,7 @@ inherit_event(struct perf_event *parent_event, parent_event->cpu, child, group_leader, parent_event, - NULL, NULL); + NULL, NULL, -1); if (IS_ERR(child_event)) return child_event; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/x86] perf/x86/intel: Implement LRU monitoring ID allocation for CQM
Commit-ID: 35298e554c74b7849875e3676ba8eaf833c7b917 Gitweb: http://git.kernel.org/tip/35298e554c74b7849875e3676ba8eaf833c7b917 Author: Matt Fleming AuthorDate: Fri, 23 Jan 2015 18:45:45 + Committer: Ingo Molnar CommitDate: Wed, 25 Feb 2015 13:53:33 +0100 perf/x86/intel: Implement LRU monitoring ID allocation for CQM It's possible to run into issues with re-using unused monitoring IDs because there may be stale cachelines associated with that ID from a previous allocation. This can cause the LLC occupancy values to be inaccurate. To attempt to mitigate this problem we place the IDs on a least recently used list, essentially a FIFO. The basic idea is that the longer the time period between ID re-use the lower the probability that stale cachelines exist in the cache. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Linus Torvalds Cc: Vikas Shivappa Link: http://lkml.kernel.org/r/1422038748-21397-7-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 100 ++--- 1 file changed, 92 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c index 05b4cd2..b5d9d74 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -25,7 +25,7 @@ struct intel_cqm_state { static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state); /* - * Protects cache_cgroups. + * Protects cache_cgroups and cqm_rmid_lru. */ static DEFINE_MUTEX(cache_mutex); @@ -64,36 +64,120 @@ static u64 __rmid_read(unsigned long rmid) return val; } -static unsigned long *cqm_rmid_bitmap; +struct cqm_rmid_entry { + u64 rmid; + struct list_head list; +}; + +/* + * A least recently used list of RMIDs. + * + * Oldest entry at the head, newest (most recently used) entry at the + * tail. This list is never traversed, it's only used to keep track of + * the lru order. That is, we only pick entries of the head or insert + * them on the tail. + * + * All entries on the list are 'free', and their RMIDs are not currently + * in use. To mark an RMID as in use, remove its entry from the lru + * list. + * + * This list is protected by cache_mutex. + */ +static LIST_HEAD(cqm_rmid_lru); + +/* + * We use a simple array of pointers so that we can lookup a struct + * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid() + * and __put_rmid() from having to worry about dealing with struct + * cqm_rmid_entry - they just deal with rmids, i.e. integers. + * + * Once this array is initialized it is read-only. No locks are required + * to access it. + * + * All entries for all RMIDs can be looked up in the this array at all + * times. + */ +static struct cqm_rmid_entry **cqm_rmid_ptrs; + +static inline struct cqm_rmid_entry *__rmid_entry(int rmid) +{ + struct cqm_rmid_entry *entry; + + entry = cqm_rmid_ptrs[rmid]; + WARN_ON(entry->rmid != rmid); + + return entry; +} /* * Returns < 0 on fail. + * + * We expect to be called with cache_mutex held. */ static int __get_rmid(void) { - return bitmap_find_free_region(cqm_rmid_bitmap, cqm_max_rmid, 0); + struct cqm_rmid_entry *entry; + + lockdep_assert_held(_mutex); + + if (list_empty(_rmid_lru)) + return -EAGAIN; + + entry = list_first_entry(_rmid_lru, struct cqm_rmid_entry, list); + list_del(>list); + + return entry->rmid; } static void __put_rmid(int rmid) { - bitmap_release_region(cqm_rmid_bitmap, rmid, 0); + struct cqm_rmid_entry *entry; + + lockdep_assert_held(_mutex); + + entry = __rmid_entry(rmid); + + list_add_tail(>list, _rmid_lru); } static int intel_cqm_setup_rmid_cache(void) { - cqm_rmid_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(cqm_max_rmid), GFP_KERNEL); - if (!cqm_rmid_bitmap) + struct cqm_rmid_entry *entry; + int r; + + cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) * + (cqm_max_rmid + 1), GFP_KERNEL); + if (!cqm_rmid_ptrs) return -ENOMEM; - bitmap_zero(cqm_rmid_bitmap, cqm_max_rmid); + for (r = 0; r <= cqm_max_rmid; r++) { + struct cqm_rmid_entry *entry; + + entry = kmalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) + goto fail; + + INIT_LIST_HEAD(>list); + entry->rmid = r; + cqm_rmid_ptrs[r] = entry; + + list_add_tail(>list, _rmid_lru); + } /* * RMID 0 is special and is always allocated. It's used for all * tasks that are not monitored. */ - bitmap_allocate_region(cqm_rmid_bitmap, 0, 0); + entry =
[tip:perf/x86] perf/x86/intel: Add Intel Cache QoS Monitoring support
Commit-ID: 4afbb24ce5e723c8a093a6674a3c33062175078a Gitweb: http://git.kernel.org/tip/4afbb24ce5e723c8a093a6674a3c33062175078a Author: Matt Fleming AuthorDate: Fri, 23 Jan 2015 18:45:44 + Committer: Ingo Molnar CommitDate: Wed, 25 Feb 2015 13:53:32 +0100 perf/x86/intel: Add Intel Cache QoS Monitoring support Future Intel Xeon processors support a Cache QoS Monitoring feature that allows tracking of the LLC occupancy for a task or task group, i.e. the amount of data in pulled into the LLC for the task (group). Currently the PMU only supports per-cpu events. We create an event for each cpu and read out all the LLC occupancy values. Because this results in duplicate values being written out to userspace, we also export a .per-pkg event file so that the perf tools only accumulate values for one cpu per package. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: H. Peter Anvin Cc: Jiri Olsa Cc: Kanaka Juvva Cc: Linus Torvalds Cc: Vikas Shivappa Link: http://lkml.kernel.org/r/1422038748-21397-6-git-send-email-m...@codeblueprint.co.uk Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 530 + include/linux/perf_event.h | 7 + 2 files changed, 537 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c new file mode 100644 index 000..05b4cd2 --- /dev/null +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -0,0 +1,530 @@ +/* + * Intel Cache Quality-of-Service Monitoring (CQM) support. + * + * Based very, very heavily on work by Peter Zijlstra. + */ + +#include +#include +#include +#include "perf_event.h" + +#define MSR_IA32_PQR_ASSOC 0x0c8f +#define MSR_IA32_QM_CTR0x0c8e +#define MSR_IA32_QM_EVTSEL 0x0c8d + +static unsigned int cqm_max_rmid = -1; +static unsigned int cqm_l3_scale; /* supposedly cacheline size */ + +struct intel_cqm_state { + raw_spinlock_t lock; + int rmid; + int cnt; +}; + +static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state); + +/* + * Protects cache_cgroups. + */ +static DEFINE_MUTEX(cache_mutex); + +/* + * Groups of events that have the same target(s), one RMID per group. + */ +static LIST_HEAD(cache_groups); + +/* + * Mask of CPUs for reading CQM values. We only need one per-socket. + */ +static cpumask_t cqm_cpumask; + +#define RMID_VAL_ERROR (1ULL << 63) +#define RMID_VAL_UNAVAIL (1ULL << 62) + +#define QOS_L3_OCCUP_EVENT_ID (1 << 0) + +#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID + +static u64 __rmid_read(unsigned long rmid) +{ + u64 val; + + /* +* Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt, +* it just says that to increase confusion. +*/ + wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid); + rdmsrl(MSR_IA32_QM_CTR, val); + + /* +* Aside from the ERROR and UNAVAIL bits, assume this thing returns +* the number of cachelines tagged with @rmid. +*/ + return val; +} + +static unsigned long *cqm_rmid_bitmap; + +/* + * Returns < 0 on fail. + */ +static int __get_rmid(void) +{ + return bitmap_find_free_region(cqm_rmid_bitmap, cqm_max_rmid, 0); +} + +static void __put_rmid(int rmid) +{ + bitmap_release_region(cqm_rmid_bitmap, rmid, 0); +} + +static int intel_cqm_setup_rmid_cache(void) +{ + cqm_rmid_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(cqm_max_rmid), GFP_KERNEL); + if (!cqm_rmid_bitmap) + return -ENOMEM; + + bitmap_zero(cqm_rmid_bitmap, cqm_max_rmid); + + /* +* RMID 0 is special and is always allocated. It's used for all +* tasks that are not monitored. +*/ + bitmap_allocate_region(cqm_rmid_bitmap, 0, 0); + + return 0; +} + +/* + * Determine if @a and @b measure the same set of tasks. + */ +static bool __match_event(struct perf_event *a, struct perf_event *b) +{ + if ((a->attach_state & PERF_ATTACH_TASK) != + (b->attach_state & PERF_ATTACH_TASK)) + return false; + + /* not task */ + + return true; /* if not task, we're machine wide */ +} + +/* + * Determine if @a's tasks intersect with @b's tasks + */ +static bool __conflict_event(struct perf_event *a, struct perf_event *b) +{ + /* +* If one of them is not a task, same story as above with cgroups. +*/ + if (!(a->attach_state & PERF_ATTACH_TASK) || + !(b->attach_state & PERF_ATTACH_TASK)) + return true; + + /* +* Must be non-overlapping. +*/ + return false; +} + +/* + * Find a group and setup RMID. + * + * If we're part of a group, we use the group's RMID. + */ +static int intel_cqm_setup_event(struct perf_event *event, +struct perf_event