Re: [PATCH] x86/mm/fault: Streamline the fault error_code decoder some more

2018-12-06 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> That’s why I suggested “read,” in lowercase, for reads.  Other than 
> that, most of the unset bits are uninteresting. An OOPS is so likely to 
> be a kernel fault that it’s barely worth mentioning, and I even added a 
> whole separate diagnostic for user oopses.  Similarly, I don’t think we 
> need to remind the reader that an oops wasn’t an SGX error or that it 
> wasn’t a PK error.  So I think my idea highlights the interesting bits 
> and avoids distraction from the uninteresting bits.

Ok - all good points.

Thanks,

Ingo


Re: [PATCH] x86/mm/fault: Streamline the fault error_code decoder some more

2018-12-06 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> > vs. (with SGX added as 'G' for testing purposes)
> > 
> > [0.158849] #PF error code(0001):  +P !W !U !S !I !K !G
> > [0.159292] #PF error code(0003):  +P +W !U !S !I !K !G
> > [0.159742] #PF error code(0007):  +P +W +U !S !I !K !G
> > [0.160190] #PF error code(0025):  +P !W +U !S !I +K !G
> > [0.160638] #PF error code(0002):  !P +W !U !S !I !K !G
> > [0.161087] #PF error code(0004):  !P !W +U !S !I !K !G
> > [0.161538] #PF error code(0006):  !P +W +U !S !I !K !G
> > [0.161992] #PF error code(0014):  !P !W +U !S +I !K !G
> > [0.162450] #PF error code(0011):  +P !W !U !S +I !K !G
> > [0.162667] #PF error code(8001):  +P !W !U !S !I !K +G
> > [0.162667] #PF error code(8003):  +P +W !U !S !I !K +G
> > [0.162667] #PF error code(8007):  +P +W +U !S !I !K +G
> > [0.162667] #PF error code(8025):  +P !W +U !S !I +K +G
> > [0.162667] #PF error code(8002):  !P +W !U !S !I !K +G
> > [0.162667] #PF error code(8004):  !P !W +U !S !I !K +G
> > [0.162667] #PF error code(8006):  !P +W +U !S !I !K +G
> > [0.162667] #PF error code(8014):  !P !W +U !S +I !K +G
> > [0.162667] #PF error code(8011):  +P !W !U !S +I !K +G
> > [0.162667] #PF error code():  !P !W !U !S !I !K !G
> > 
> 
> Please don’t. The whole reason I added the decoding was to make it easy 
> to read without a cheat sheet. This is incomprehensible without 
> reference to the code, and I’m familiar with it to begin with.

Dunno, I can deduct the meaning from the above abbreviations without a 
cheat sheet and I'm sure you'll be able to too from now on. All the 
letters are very obvious references - to me at least, and brevity and 
predictable, fixed-length output matters.

> How about:
> 
> #PF error code: 0001 [PROT read kernel]
> 
> #PF error code: 0001 [PROT WRITE kernel]
> 
> #PF error code: 0001 [PROT read kernel]
> 
> #PF error code: 8011 [PROT INSTR kernel SGX]
> 
> This has no noise from unset bits except that we add lowercase “read” 
> or “kernel” as appropriate.  Even “kernel” seems barely necessary.

The thing is, the 'noise' from unset bits is actually important 
information as well - at least for the major bits: it was a mostly random 
choice that Intel defined '1' for write access and not for read access. 

Thanks,

Ingo


Re: [PATCH v7 08/14] x86/ftrace: Use text_poke_*() infrastructure

2018-12-06 Thread Ingo Molnar


* Nadav Amit  wrote:

> > On Dec 4, 2018, at 5:34 PM, Nadav Amit  wrote:
> > 
> > A following patch is going to make module allocated memory
> > non-executable. This requires to modify ftrace and make the memory
> > executable again after it is configured.
> > 
> > In addition, this patch makes ftrace use the general text poking
> > infrastructure instead ftrace's homegrown text patching. This provides
> > the advantages of having slightly "safer" code patching and avoiding
> > races with module removal or other mechanisms that patch the kernel
> > code.
> > 
> > Cc: Steven Rostedt 
> > Signed-off-by: Nadav Amit 
> > ---
> > arch/x86/kernel/ftrace.c | 74 +---
> > 1 file changed, 23 insertions(+), 51 deletions(-)
> 
> Steven Rostedt pointed that using text_poke() instead of
> probe_kernel_write() would introduce considerable overheads. Running:
> 
>   # time { echo function > current_tracer; } 
> 
> takes 0.24s without this patch and 0.7s with. I don’t know whether to
> consider it “so bad”. Obviously we can introduce a batching mechanism and/or
> do some micro-optimization (the latter will not buy us much though).

This should definitely not regress, so can we try the batching approach?

Thanks,

Ingo


Re: [PATCH] x86/mce: Streamline MCE subsystem's naming

2018-12-06 Thread Ingo Molnar


* Borislav Petkov  wrote:

> On Wed, Dec 05, 2018 at 07:01:58PM +0100, Ingo Molnar wrote:
> > Oh - I thought we'd have arch/x86/mce/ or so?
> > 
> > There's machine check events that are not tied to any particular CPU, 
> > correct? So this would be the right conceptual level - and it would also 
> > remove the somewhat redundant 'kernel' part.
> 
> Well, all the MCE events reported are some way or the other tied to the
> CPU and they're reported in the CPU's MCA banks so I think we want
> 
>   arch/x86/cpu/mce/

Well, *everything* the kernel does is in some way connected to a CPU, 
because we always execute the kernel on a CPU, still we have things like 
discrete PMUs and abstractions for other pieces of hardware that are not 
per CPU.

So the real question is, is there a signifcant class of MCE events that 
are not tied to the reporting channel which is per CPU (-ish ...) MCA 
banks?

Thanks,

Ingo


Re: [PATCH] x86/mm/fault: Streamline the fault error_code decoder some more

2018-12-06 Thread Ingo Molnar


* Sean Christopherson  wrote:

> On Thu, Dec 06, 2018 at 08:34:09AM +0100, Ingo Molnar wrote:
> > I like your '!' idea, but with a further simplification: how about using 
> > '-/+' differentiation and a single character and a fixed-length message.
> > 
> > The new output will be lines of:
> > 
> >   #PF error code: -P -W -U -S -I -K (0x00)
> >   ...
> >   #PF error code: -P -W +U +S -I -K (0x0c)
> >   ...
> >   #PF error code: +P +W +U +S +I +K (0x3f)
> > 
> > The symbol abbreviations are pretty self-explanatory:
> > 
> >   P = protection fault   (X86_PF_PROT)
> >   W = write access   (X86_PF_WRITE)
> >   U = user-mode access   (X86_PF_USER)
> >   S = supervisor mode(X86_PF_RSVD)
> >   I = instruction fault  (X86_PF_INSTR)
> >   K = keys fault (X86_PF_PK)
> > 
> > Misc notes:
> > 
> > - In principle the new text is now short enough to include it in one of 
> >   the existing output lines, further shortening the oops output - but I
> >   havent done that in this patch.
> > 
> > - Another question is the ordering of the bits: the symbolic display is 
> >   actually big endian, while the numeric hexa printout is little endian.
> > 
> >   I kind of still like it that way, not just because the decoding loop is 
> >   more natural, but because the bits are actually ordered by importance: 
> >   the PROT bits is more important than the INSTR or the PK bits - and the 
> >   more important bits are displayed first.
> 
> Hmm, my eyes tend to be drawn to the end of the line, e.g. having PROT
> be the last thing makes it stand out more than being buried in the middle
> of the line.  Inverting the ordering between raw and decoded also makes
> it very difficult to correlate the raw value with each bit.
> 
> > - Only build-tested the patch and looked at the generated assembly, but 
> >   it all looks sane enough so will obviously work just fine! ;-)
> 
> ...
> 
> >  /*
> > - * This helper function transforms the #PF error_code bits into
> > - * "[PROT] [USER]" type of descriptive, almost human-readable error 
> > strings:
> > + * This maps the somewhat obscure error_code number to symbolic text:
> > + *
> > + * P = protection fault   (X86_PF_PROT)
> > + * W = write access   (X86_PF_WRITE)
> > + * U = user-mode access   (X86_PF_USER)
> > + * S = supervisor mode(X86_PF_RSVD)
> > + * I = instruction fault  (X86_PF_INSTR)
> > + * K = keys fault (X86_PF_PK)
> >   */
> > -static void err_str_append(unsigned long error_code, char *buf, unsigned 
> > long mask, const char *txt)
> > +static const char error_code_chars[] = "PWUSIK";
> > +
> > +/*
> > + * This helper function transforms the #PF error_code bits into " +P -W +U 
> > -R -I -K"
> > + * type of descriptive, almost human-readable error strings:
> > + */
> > +static void show_error_code(struct pt_regs *regs, unsigned long error_code)
> 
> No need for @regs.
> 
> >  {
> > - if (error_code & mask) {
> > - if (buf[0])
> > - strcat(buf, " ");
> > - strcat(buf, txt);
> > + unsigned int bit, mask;
> > + char err_txt[6*3+1]; /* Fixed length of 6 bits decoded plus zero at 
> > the end */
> 
> Assuming the error code bits are contiguous breaks if/when SGX gets added,
> which uses bit 15.  Oopsing on an SGX fault should be nigh impossible, but
> it might be nice to be able to reuse show_error_code in other places.
> 
> Hardcoding "6" is also a bit painful.
> 
> > +
> > + /* We go from the X86_PF_PROT bit to the X86_PF_PK bit: */
> > +
> > + for (bit = 0; bit < 6; bit++) {
> > + unsigned int offset = bit*3;
> > +
> > + err_txt[offset+0] = ' ';
> > +
> > + mask = 1 << bit;
> > + if (error_code & mask)
> > + err_txt[offset+1] = '+';
> > + else
> > + err_txt[offset+1] = '-';
> 
> To me, using '!' contrasts better when side-by-side with '+'.
> 
> > +
> > + err_txt[offset+2] = error_code_chars[bit];
> >   }
> > +
> > + /* Close the string: */
> > + err_txt[sizeof(err_txt)-1] = 0;
> > +
> > + pr_alert("#PF error code: %s (%02lx)\n", err_txt, error_code);
> 
> The changelog example has a leading "0x" on the error code.  That being
> said, I actually like it without the "0x&qu

Re: [PATCH] scripts/atomic: change 'fold' to 'grep'

2018-12-06 Thread Ingo Molnar


* Will Deacon  wrote:

> [+ Ingo and Mark]
> 
> On Tue, Dec 04, 2018 at 10:47:13PM +0100, Anders Roxell wrote:
> > Some distibutions and build systems doesn't include 'fold' from
> > coreutils default.
> > 
> > .../scripts/atomic/atomic-tbl.sh: line 183: fold: command not found
> > 
> > Rework to use 'grep' instead of 'fold' to use a dependency that is
> > already used a lot in the kernel.
> > 
> > Reported-by: Naresh Kamboju 
> > Suggested-by: Will Deacon 
> > Signed-off-by: Anders Roxell 
> > ---
> >  scripts/atomic/atomic-tbl.sh | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/scripts/atomic/atomic-tbl.sh b/scripts/atomic/atomic-tbl.sh
> > index 9d6be538a987..81d5c32039dd 100755
> > --- a/scripts/atomic/atomic-tbl.sh
> > +++ b/scripts/atomic/atomic-tbl.sh
> > @@ -180,7 +180,7 @@ gen_proto_variants()
> >  #gen_proto(meta, ...)
> >  gen_proto() {
> > local meta="$1"; shift
> > -   for m in $(echo "${meta}" | fold -w1); do
> > +   for m in $(echo "${meta}" | grep -o .); do
> > gen_proto_variants "${m}" "$@"
> > done
> 
> Acked-by: Will Deacon 
> 
> Ingo -- please can you take this one via -tip?

I'm still waiting for a reply to my previous concerns expressed in:

   Re: [tip:locking/core] locking/atomics: Check generated headers are 
up-to-date
   <20181128083057.ga7...@gmail.com>

Will remove it from linux-next if there's no good resolution for this 
cycle.

Thanks,

Ingo


Re: [PATCHv2] locking/atomics: build atomic headers as required

2018-12-06 Thread Ingo Molnar


* Mark Rutland  wrote:

> Andrew and Ingo report that the check-atomics.sh script is simply too
> slow to run for every kernel build, and it's impractical to make it
> faster without rewriting it in something other than shell.
> 
> Rather than committing the generated headers, let's regenerate these
> as-required, if we change the data or scripts used to generate the
> atomic headers, or when building from a pristine tree.
> 
> That ensures they're always up-to-date, allows them to be built in
> parallel, and avoid redundant rebuilds, which is a 2-8s saving per
> incremental build. Since the results are not committed, it's very
> obvious that they should not be modified directly. If we need to
> generate more headers in future, it's easy to extend Makefile.genheader
> to permit this.
> 
> I've verified that this works in the cases we previously had issues with
> (out-of-tree builds and where scripts have no execute permissions), and
> have tested these cases for both x86_64 and arm64.
> 
> The diffstat looks nice, at least...
> 
> Signed-off-by: Mark Rutland 
> Cc: Andrew Morton 
> Cc: Boqun Feng 
> Cc: Borislav Petkov 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Will Deacon 
> Cc: linux-kernel@vger.kernel.org
> ---
>  Kbuild|   18 +-
>  Makefile  |8 +-
>  arch/arm64/include/asm/atomic.h   |2 +-
>  arch/x86/include/asm/atomic.h |2 +-
>  include/asm-generic/atomic-instrumented.h | 1787 --
>  include/asm-generic/atomic-long.h | 1012 -
>  include/linux/atomic-fallback.h   | 2294 
> -
>  include/linux/atomic.h|4 +-
>  scripts/Makefile.genheader|   26 +
>  scripts/atomic/check-atomics.sh   |   19 -
>  10 files changed, 38 insertions(+), 5134 deletions(-)
>  delete mode 100644 include/asm-generic/atomic-instrumented.h
>  delete mode 100644 include/asm-generic/atomic-long.h
>  delete mode 100644 include/linux/atomic-fallback.h
>  create mode 100644 scripts/Makefile.genheader
>  delete mode 100755 scripts/atomic/check-atomics.sh

So these 'automatically generated' headers are actual and important code, 
and I think it's bad practice to remove these from the git grep search 
space and history as well.

Really, this whole notion of auto-generating the headers should be 
implemented correctly, instead of working around deficiencies in a 
short-term fashion that introduces other deficiencies.

I never got any replies to my previous comments about this:

  <20181128083057.ga7...@gmail.com>

Did I miss some mails of yours perhaps?

Thanks,

Ingo


[PATCH] x86/mm/fault: Streamline the fault error_code decoder some more

2018-12-05 Thread Ingo Molnar


* Sean Christopherson  wrote:

> ...instead of manually handling the case where error_code=0, e.g. to
> display "[SUPERVISOR] [READ]" instead of "normal kernel read fault".
> 
> This makes the zero case consistent with all other messages and also
> provides additional information for other error code combinations,
> e.g. error_code==1 will display "[PROT] [SUPERVISOR] [READ]" instead
> of simply "[PROT]".
> 
> Print unique names for the negative cases as opposed to e.g. "[!USER]"
> to avoid mixups due to users missing a single "!" character, and to be
> more concise for the !INSTR && !WRITE case.
> 
> Print "SUPERVISOR" in favor of "KERNEL" to reduce the likelihood that
> the message is misinterpreted as a generic kernel/software error and
> to be consistent with the SDM's nomenclature.
> 
> An alternative to passing a negated error code to err_str_append() would
> be to expand err_str_append() to take a second string for the negative
> test, but that approach complicates handling the "[READ]" case, which
> looks for !INSTR && !WRITE, e.g. it would require an extra call to
> err_str_append() and logic in err_str_append() to allow null messages
> for both the positive and negative tests.  Printing "[INSTR] [READ]"
> wouldn't be the end of the world, but a little bit of trickery in the
> kernel is a relatively small price to pay in exchange for the ability
> to unequivocally know the access type by reading a single word.
> 
> Now that all components of the message use the [] format,
> explicitly state that it's the error *code* that's being printed and
> group the err_str_append() calls by type so that the resulting print
> messages are consistent, e.g. the deciphered codes will always be:
> 
> [PROT] [USER|SUPERVISOR] [WRITE|INSTR|READ] [RSDV] [PK]
> 
> Cc: Andy Lutomirski 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: H. Peter Anvin 
> Cc: Linus Torvalds 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Yu-cheng Yu 
> Cc: linux-kernel@vger.kernel.org
> Cc: Ingo Molnar 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/mm/fault.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 2ff25ad33233..0b4ce5d2b461 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -609,7 +609,7 @@ static void show_ldttss(const struct desc_ptr *gdt, const 
> char *name, u16 index)
>   */
>  static void err_str_append(unsigned long error_code, char *buf, unsigned 
> long mask, const char *txt)
>  {
> - if (error_code & mask) {
> + if ((error_code & mask) == mask) {
>   if (buf[0])
>   strcat(buf, " ");
>   strcat(buf, txt);
> @@ -655,13 +655,16 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
> error_code, unsigned long ad
>* zero delimiter must fit into err_txt[].
>*/
>   err_str_append(error_code, err_txt, X86_PF_PROT,  "[PROT]" );
> - err_str_append(error_code, err_txt, X86_PF_WRITE, "[WRITE]");
>   err_str_append(error_code, err_txt, X86_PF_USER,  "[USER]" );
> - err_str_append(error_code, err_txt, X86_PF_RSVD,  "[RSVD]" );
> + err_str_append(~error_code, err_txt, X86_PF_USER, "[SUPERVISOR]");
> + err_str_append(error_code, err_txt, X86_PF_WRITE, "[WRITE]");
>   err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]");
> + err_str_append(~error_code, err_txt, X86_PF_WRITE | X86_PF_INSTR,
> +   "[READ]");
> + err_str_append(error_code, err_txt, X86_PF_RSVD,  "[RSVD]" );
>   err_str_append(error_code, err_txt, X86_PF_PK,"[PK]"   );
>  
> - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal kernel read 
> fault]");
> + pr_alert("#PF error code: %s\n", err_txt);
>  
>   if (!(error_code & X86_PF_USER) && user_mode(regs)) {
>   struct desc_ptr idt, gdt;

Yeah, so I don't like the overly long 'SUPERVISOR' and the somewhat 
inconsistent, sporadic handling of negatives. Here's our error code bits:

/*
 * Page fault error code bits:
 *
 *   bit 0 ==0: no page found   1: protection fault
 *   bit 1 ==0: read access 1: write access
 *   bit 2 ==0: kernel-mode access  1: user-mode access
 *   bit 3 ==   1: use of reserved bit detected
 *   bit 4 ==   1: fault was an instruction fetch
 *   bit 5 == 

Re: [PATCH] x86/mce: Streamline MCE subsystem's naming

2018-12-05 Thread Ingo Molnar


* Borislav Petkov  wrote:

> On Wed, Dec 05, 2018 at 05:30:37PM +0100, Ingo Molnar wrote:
> > Would it make sense to organize it a bit more and separate out vendor 
> > specific functionality:
> > 
> >   mce/cpu/intel.c
> >   mce/cpu/intel-p5.c
> >   mce/cpu/amd.c
> >   mce/cpu/winchip.c
> 
> That's too fine-grained IMO and look at the path we'd get then:
> 
> arch/x86/kernel/cpu/mce/cpu/intel.c
>   ^^^ ^^^

Oh - I thought we'd have arch/x86/mce/ or so?

There's machine check events that are not tied to any particular CPU, 
correct? So this would be the right conceptual level - and it would also 
remove the somewhat redundant 'kernel' part.

Thanks,

Ingo


Re: [PATCH] x86/mce: Streamline MCE subsystem's naming

2018-12-05 Thread Ingo Molnar


* Borislav Petkov  wrote:

> From: Borislav Petkov 
> 
> Rename the containing folder to "mce" which is the most widespread name.
> Drop the "mce[-_]" filename prefix of some compilation units (while
> others don't have it).
> 
> This unifies the file naming in the MCE subsystem:
> 
> mce/
> |-- amd.c
> |-- apei.c
> |-- core.c
> |-- dev-mcelog.c
> |-- genpool.c
> |-- inject.c
> |-- intel.c
> |-- internal.h
> |-- Makefile
> |-- p5.c
> |-- severity.c
> |-- therm_throt.c
> |-- threshold.c
> `-- winchip.c
> 
> No functional changes.

Cool!

Would it make sense to organize it a bit more and separate out vendor 
specific functionality:

  mce/cpu/intel.c
  mce/cpu/intel-p5.c
  mce/cpu/amd.c
  mce/cpu/winchip.c

  mce/internal.h
  mce/core.c

  mce/genpool.c
  mce/threshold.c
  mce/severity.c
  mce/inject.c
  mce/therm_throt.c
  mce/dev-mcelog.c
  mce/apei.c

?

This way there's a clear separation between low level, vendor specific 
MCE logic and higher level MCE logic.

mce/apei.c, if this is an Intel-only feature, could perhaps become 
mce/cpu/intel-apei.c?

Anyway, your patch is fine too, so whichever subset you decide to use:

  Reviewed-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH] x86/kernel: Fix more -Wmissing-prototypes warnings

2018-12-05 Thread Ingo Molnar


* Borislav Petkov  wrote:

> On Wed, Dec 05, 2018 at 10:49:06PM +0900, Masami Hiramatsu wrote:
> > I would like to put this prototype inside arch/x86/kernel/kprobes/core.c,
> > since that is locally used.
> 
> Done.

Also, preferably the prototype should be eliminated via proper ordering 
of functions from lower level to higher levels.

On the rest:

Reviewed-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH] x86/umip: Make the UMIP activated message generic

2018-12-04 Thread Ingo Molnar


* Lendacky, Thomas  wrote:

> The User Mode Instruction Prevention (UMIP) feature is part of the x86_64
> instruction set architecture and not specific to Intel.  Make the message
> generic.
> 
> Signed-off-by: Tom Lendacky 
> ---
> 
> This patch is against the x86/cpu branch of the tip tree:
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/cpu
> 
>  arch/x86/kernel/cpu/common.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 2c56b80..cb28e98 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -353,7 +353,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 
> *c)
>  
>   cr4_set_bits(X86_CR4_UMIP);
>  
> - pr_info_once("x86/cpu: Intel User Mode Instruction Prevention (UMIP) 
> activated\n");
> + pr_info_once("x86/cpu: User Mode Instruction Prevention (UMIP) 
> activated\n");

Is there any public information about which AMD CPUs are going to support 
it?

The latest AMD CPU I can test on is a Ryzen Threadripper 1950X, and that 
doesn't have UMIP.

Thanks,

Ingo


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Ingo Molnar


* Michal Hocko  wrote:

> I dunno. I do not use hibernation. I am a heavy user of the suspend 
> though. I s2ram all the time. And I have certainly experienced cases 
> where suspend has failed and I onlyi found out later when I've picked 
> up my laptop from my heat up bag. Nothing fatal has resulted from that 
> but this is certainly annoying.

Hm, so I (mistakenly) thought freezing was mostly limited to hibernation 
and to a few weird cases when in flight DMA must not be suspended - but 
I'm wrong and in practice we always freeze tasks during s2ram, right?

And indeed:

 config SUSPEND_FREEZER
bool "Enable freezer for suspend to RAM/standby" \
if ARCH_WANTS_FREEZER_CONTROL || BROKEN
depends on SUSPEND
default y

which is essentially always enabled on x86.

TIL ...

s2ram is obviously a huge deal.

Just a newbie question: any chance to not do any freezing at all on 
modern laptops when doing s2ram, or at least only warn if it fails and 
try to suspend?

Because in practice losing power due to failed freezing *will* result in 
data loss, in about 90% of the time ...

So I don't even know what we are trying to protect against by refusing to 
freeze - avoiding a 0.001% data loss risk against a 90% data loss risk?

Thanks,

Ingo


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Ingo Molnar


* Oleg Nesterov  wrote:

> > I reviewed the ->cred_guard_mutex code, and the mutex is held across all
> > of exec() - and we always did this.
> 
> Yes, and this was always wrong. For example, this test-case hangs:
> 
>   #include 
>   #include 
>   #include 
>   #include 
> 
>   void *thread(void *arg)
>   {
>   ptrace(PTRACE_TRACEME, 0,0,0);
>   return NULL;
>   }
> 
>   int main(void)
>   {
>   int pid = fork();
> 
>   if (!pid) {
>   pthread_t pt;
>   pthread_create(, NULL, thread, NULL);
>   pthread_join(pt, NULL);
>   execlp("echo", "echo", "passed", NULL);
>   }
> 
>   sleep(1);
>   // or anything else which needs ->cred_guard_mutex,
>   // say open(/proc/$pid/mem)
>   ptrace(PTRACE_ATTACH, pid, 0,0);
>   kill(pid, SIGCONT);
> 
>   return 0;
>   }
> 
> we really need to narrow the (huge) scope of ->cred_guard_mutex in exec paths.
> 
> my attempt to fix this was nacked, and nobody suggested a better solution so 
> far.

Any link to your patch and the NAK?

Thanks,

Ingo


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Ingo Molnar


* Michal Hocko  wrote:

> > Do we actually have reports of this happening for people outside
> > Android?
> 
> Not that I am aware of.

I'd say outside of Android 99% of the use of hibernation is the fail-safe 
that distributions offer on laptops with very low battery levels: the 
emergency hibernation when there's almost no power left anymore.

Do these hibernation failure messages typically make it to persistent 
logs before the system uses power?

In practice if that is buggy the kernel won't hibernate and the laptop 
will run out of power and the user will conclude "ugh, I shouldn't have 
left my laptop turned on" - without looking into the logs and reporting, 
as they'll perceive it as a user failure not a system failure.

I certainly saw random Linux laptops fail to hibernate over the years and 
didn't report it, so if the distribution doesn't do the reporting 
automatically then chances are we'll never see it.

Thanks,

Ingo


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Ingo Molnar


* Rafael J. Wysocki  wrote:

> > Note that I haven't tested the revert yet, but the code and the breakage 
> > looks pretty obvious. (I'll boot the revert, will follow up if that 
> > didn't solve the problem.)
> 
> I can queue up a revert unless anyone beats me to that.

Thanks!

I have 1+ days uptime now on the system, no splat or other problems, as 
expected:

  Tested-by: Ingo Molnar 

I have the revert locally in -tip, will drop it once your commit hits 
upstream.

Ingo


Re: Strange hang with gcc 8 of kprobe multiple_kprobes test

2018-12-04 Thread Ingo Molnar


* Masami Hiramatsu  wrote:

> I remember I have fixed this, and actually WE did it :-D
> 
> https://lkml.org/lkml/2018/8/23/1203
> 
> Ah, we hit a same bug...
> 
> Ingo, could you pick the patch? Should I resend it?

Indeed: I just picked it up into tip:perf/urgent.

It's my bad: I missed the original submission due to Steve's feedback 
which I mistook as a request for another iteration, while he only 
commented on the reason for the original breakage and gave his 
Reviewed-by ...

Thanks,

Ingo


Re: [BUGFIX PATCH -tip] kprobes/x86: Fix to copy RIP relative instruction correctly

2018-12-04 Thread Ingo Molnar


* Masami Hiramatsu  wrote:

> Since copy_optimized_instructions() misses to update real RIP
> address while copying several instructions to working buffer,
> it adjusts RIP-relative instruction with wrong RIP address for
> the 2nd and subsequent instructions.
> 
> This may break the kernel (like kernel freeze) because
> probed instruction can refer a wrong data. For example,
> putting kprobes on cpumask_next hit this bug.
> 
> cpumask_next is normally like below if CONFIG_CPUMASK_OFFSTACK=y
> (in this case nr_cpumask_bits is an alias of nr_cpu_ids)
> 
>  :
>48 89 f0mov%rsi,%rax
>8b 35 7b fb e2 00   mov0xe2fb7b(%rip),%esi
> # 82db9e64 
>55  push   %rbp
> ...
> 
> If we put a kprobe on it and optimized with jump, it becomes
> like this.
> 
>   e9 95 7d 07 1e  jmpq   0xa000207a
>   7b fb   jnp0x81f8a2e2 
>   e2 00   loop   0x81f8a2e9 
>   55  push   %rbp
> 
> This shows first 2 "mov" instructions are copied to trampoline
> buffer at 0xa000207a. Here is the disassembled result.
> (skipped optprobe template instructions)
> 
> Dump of assembler code from 0xa000207a to 0xa00020ea:
>   54  push   %rsp
> ...
>   48 83 c4 08 add$0x8,%rsp
>   9d  popfq
>   48 89 f0mov%rsi,%rax
>   8b 35 82 7d db e2   mov-0x1d24827e(%rip),%esi
> # 0x82db9e67 
> 
> As it shows, the 2nd mov accesses *(nr_cpu_ids+3) instead of
> *nr_cpu_ids. This leads a kernel freeze because cpumask_next()
> always returns 0 and for_each_cpu() never ended.
> 
> Fixing this by adding len correctly to real RIP address while
> copying.
> 
> Fixes: 63fef14fc98a ("kprobes/x86: Make insn buffer always ROX and use 
> text_poke()")
> Reported-by: Michael Rodin 
> Signed-off-by: Masami Hiramatsu 
> Cc: sta...@vger.kernel.org
> ---
>  arch/x86/kernel/kprobes/opt.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
> index eaf02f2e7300..e92672b8b490 100644
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -189,7 +189,8 @@ static int copy_optimized_instructions(u8 *dest, u8 *src, 
> u8 *real)
>   int len = 0, ret;
>  
>   while (len < RELATIVEJUMP_SIZE) {
> - ret = __copy_instruction(dest + len, src + len, real, );
> + ret = __copy_instruction(dest + len, src + len, real + len,
> + );

I applied this, except that I unbroke the line: please ignore checkpatch 
in such cases where the cure is worse than the disease ...

I.e. even if slightly over 80 cols, this is more readable:

ret = __copy_instruction(dest + len, src + len, real + len, 
);

Thanks,

Ingo


Re: [GIT PULL rcu/next] RCU commits for 4.21/5.0

2018-12-04 Thread Ingo Molnar



[ Added Linus and Arnaldo to the Cc:, for high level kernel side and 
  tooling side buy-in for the  inclusion in the kernel tree 
  below: ]

* Paul E. McKenney  wrote:

> Hello, Ingo,
> 
> This pull request contains the following changes:
> 
> 1.Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
> 
>   http://lkml.kernel.org/r/2018193156.ga3...@linux.ibm.com
> 
> 2.Replace of calls to RCU-bh and RCU-sched update-side functions
>   to their vanilla RCU counterparts.  This series is a step
>   towards complete removal of the RCU-bh and RCU-sched update-side
>   functions.
> 
>   http://lkml.kernel.org/r/2018194104.ga4...@linux.ibm.com
> 
>   Note that several of the patches sent out have been dropped from
>   this series due to having been pulled in by their respective
>   maintainers.
> 
> 3.Documentation updates, including a number of flavor-consolidation
>   updates from Joel Fernandes.
> 
>   http://lkml.kernel.org/r/2018195619.ga6...@linux.ibm.com
> 
> 4.Miscellaneous fixes.
> 
>   http://lkml.kernel.org/r/2018192839.ga32...@linux.ibm.com
> 
> 5.Automate generation of the initrd filesystem used for
>   rcutorture testing.
> 
>   http://lkml.kernel.org/r/2018200127.ga9...@linux.ibm.com
> 
> 6.Convert spin_is_locked() assertions to instead use lockdep.
> 
>   http://lkml.kernel.org/r/2018200421.ga10...@linux.ibm.com
> 
>   Note that several of the patches sent out have been dropped from
>   this series due to having been pulled in by their respective
>   maintainers.
> 
> 7.SRCU updates, especially including a fix from Dennis Krein
>   for a bag-on-head-class bug.
> 
>   http://lkml.kernel.org/r/2018200834.ga10...@linux.ibm.com
> 
> 8.RCU torture-test updates.
> 
>   http://lkml.kernel.org/r/2018201956.ga11...@linux.ibm.com
> 
> All of these changes have been subjected to 0day Test Robot and -next
> testing, and are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git 
> for-mingo
> 
> for you to fetch changes up to 5ac7cdc29897e5fc3f5e214f3f8c8b03ef8d7029:
> 
>   rcutorture: Don't do busted forward-progress testing (2018-12-01 12:45:42 
> -0800)
> 
> 
> Connor Shu (1):
>   rcutorture: Automatically create initrd directory
> 
> Dennis Krein (1):
>   srcu: Lock srcu_data structure in srcu_gp_start()
> 
> Joe Perches (1):
>   checkpatch: Create table of obsolete APIs and apply to RCU
> 
> Joel Fernandes (Google) (19):
>   rcu: Remove unused rcu_state externs
>   rcu: Fix rcu_{node,data} comments about gp_seq_needed
>   doc: Clarify RCU data-structure comment about rcu_tree fanout
>   doc: Remove rcu_preempt_state reference in stallwarn
>   doc: Update information about resched_cpu
>   doc: Remove rcu_dynticks from Data-Structures
>   doc: rcu: Update Data-Structures for RCU flavor consolidation
>   doc: rcu: Better clarify the rcu_segcblist ->len field
>   doc: rcu: Update description of gp_seq fields in rcu_data
>   doc: rcu: Update core and full API in whatisRCU
>   doc: rcu: Add more rationale for using rcu_read_lock_sched in checklist
>   doc: rcu: Remove obsolete suggestion from checklist
>   doc: rcu: Remove obsolete checklist item about synchronize_rcu usage
>   doc: rcu: Encourage use of rcu_barrier in checklist
>   doc: Make reader aware of rcu_dereference_protected
>   doc: Remove obsolete (non-)requirement about disabling preemption
>   doc: Make listing in RCU perf/scale requirements use 
> rcu_assign_pointer()
>   doc: Correct parameter in stallwarn
>   doc: Fix "struction" typo in RCU memory-ordering documentation
> 
> Lance Roy (7):
>   x86/PCI: Replace spin_is_locked() with lockdep
>   sfc: Replace spin_is_locked() with lockdep
>   smsc: Replace spin_is_locked() with lockdep
>   userfaultfd: Replace spin_is_locked() with lockdep
>   locking/mutex: Replace spin_is_locked() with lockdep
>   mm: Replace spin_is_locked() with lockdep
>   KVM: arm/arm64: vgic: Replace spin_is_locked() with lockdep
> 
> Paul E. McKenney (77):
>   rcu: Eliminate BUG_ON() for sync.c
>   rcu: Eliminate BUG_ON() for kernel/rcu/tree.c
>   rcu: Eliminate synchronize_rcu_mult()
>   rcu: Consolidate the RCU update functions invoked by sync.c
>   sched/membarrier: Replace synchronize_sched() with synchronize_rcu()
>   sparc/oprofile: Convert timer_stop() to use synchronize_rcu()
>   s390/mm: Convert tlb_table_flush() to use call_rcu()
>   powerpc: Convert hugepd_free() to use call_rcu()
>   doc: Set down forward-progress requirements
>   rcutorture: Add initrd support for systems lacking dracut
>   rcutorture: Make initrd/init execute in userspace
>   rcutorture: Add 

Re: [PATCH] tools: Fix diverse typos

2018-12-03 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Mon, Dec 03, 2018 at 11:22:00AM +0100, Ingo Molnar wrote:
> > Go over the tools/ files that are maintained in Arnaldo's tree and
> > fix common typos: half of them were in comments, the other half
> > in JSON files.
> > 
> > ( Care should be taken not to re-import these typos in the future,
> >   if the JSON files get updated by the vendor without fixing the typos. )
> 
> >  .../perf/pmu-events/arch/x86/broadwell/cache.json  |  4 +--
> >  .../pmu-events/arch/x86/broadwell/pipeline.json|  2 +-
> >  .../pmu-events/arch/x86/broadwellde/cache.json |  4 +--
> >  .../pmu-events/arch/x86/broadwellde/pipeline.json  |  2 +-
> >  .../perf/pmu-events/arch/x86/broadwellx/cache.json |  4 +--
> >  .../pmu-events/arch/x86/broadwellx/pipeline.json   |  2 +-
> >  tools/perf/pmu-events/arch/x86/jaketown/cache.json |  4 +--
> >  .../pmu-events/arch/x86/jaketown/pipeline.json |  2 +-
> >  .../pmu-events/arch/x86/knightslanding/cache.json  | 30 
> > +++---
> >  .../pmu-events/arch/x86/sandybridge/cache.json |  4 +--
> >  .../pmu-events/arch/x86/sandybridge/pipeline.json  |  2 +-
> >  .../pmu-events/arch/x86/skylakex/uncore-other.json | 12 -
> 
> Yeah, so I think those are generated from somewhere, fixing them here
> isn't going to nessecarily help much.

It's in our source code, the output is visible to our users, so such 
typos should be fixed.

But yes, I agree that the fixes should also be applied at the Intel 
source of the JSON definitions.

Thanks,

Ingo


[PATCH] tools: Fix diverse typos

2018-12-03 Thread Ingo Molnar
Go over the tools/ files that are maintained in Arnaldo's tree and
fix common typos: half of them were in comments, the other half
in JSON files.

( Care should be taken not to re-import these typos in the future,
  if the JSON files get updated by the vendor without fixing the typos. )

No change in functionality intended.

Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar 
---
 tools/lib/subcmd/parse-options.h   |  4 +--
 tools/lib/traceevent/event-parse.c | 12 -
 tools/lib/traceevent/plugin_kvm.c  |  2 +-
 tools/perf/Documentation/perf-list.txt |  2 +-
 tools/perf/Documentation/perf-report.txt   |  2 +-
 tools/perf/Documentation/perf-stat.txt |  4 +--
 tools/perf/arch/x86/tests/insn-x86.c   |  2 +-
 tools/perf/builtin-top.c   |  2 +-
 tools/perf/builtin-trace.c |  2 +-
 .../perf/pmu-events/arch/x86/broadwell/cache.json  |  4 +--
 .../pmu-events/arch/x86/broadwell/pipeline.json|  2 +-
 .../pmu-events/arch/x86/broadwellde/cache.json |  4 +--
 .../pmu-events/arch/x86/broadwellde/pipeline.json  |  2 +-
 .../perf/pmu-events/arch/x86/broadwellx/cache.json |  4 +--
 .../pmu-events/arch/x86/broadwellx/pipeline.json   |  2 +-
 tools/perf/pmu-events/arch/x86/jaketown/cache.json |  4 +--
 .../pmu-events/arch/x86/jaketown/pipeline.json |  2 +-
 .../pmu-events/arch/x86/knightslanding/cache.json  | 30 +++---
 .../pmu-events/arch/x86/sandybridge/cache.json |  4 +--
 .../pmu-events/arch/x86/sandybridge/pipeline.json  |  2 +-
 .../pmu-events/arch/x86/skylakex/uncore-other.json | 12 -
 tools/perf/tests/attr.c|  2 +-
 tools/perf/util/annotate.c |  2 +-
 tools/perf/util/bpf-loader.c   |  2 +-
 tools/perf/util/header.c   |  2 +-
 tools/perf/util/hist.c |  2 +-
 tools/perf/util/jitdump.c  |  2 +-
 tools/perf/util/machine.c  |  2 +-
 tools/perf/util/probe-event.c  |  4 +--
 tools/perf/util/sort.c |  2 +-
 30 files changed, 62 insertions(+), 62 deletions(-)

diff --git a/tools/lib/subcmd/parse-options.h b/tools/lib/subcmd/parse-options.h
index 6ca2a8bfe716..af9def589863 100644
--- a/tools/lib/subcmd/parse-options.h
+++ b/tools/lib/subcmd/parse-options.h
@@ -71,7 +71,7 @@ typedef int parse_opt_cb(const struct option *, const char 
*arg, int unset);
  *
  * `argh`::
  *   token to explain the kind of argument this option wants. Keep it
- *   homogenous across the repository.
+ *   homogeneous across the repository.
  *
  * `help`::
  *   the short help associated to what the option does.
@@ -80,7 +80,7 @@ typedef int parse_opt_cb(const struct option *, const char 
*arg, int unset);
  *
  * `flags`::
  *   mask of parse_opt_option_flags.
- *   PARSE_OPT_OPTARG: says that the argument is optionnal (not for BOOLEANs)
+ *   PARSE_OPT_OPTARG: says that the argument is optional (not for BOOLEANs)
  *   PARSE_OPT_NOARG: says that this option takes no argument, for CALLBACKs
  *   PARSE_OPT_NONEG: says that this option cannot be negated
  *   PARSE_OPT_HIDDEN this option is skipped in the default usage, showed in
diff --git a/tools/lib/traceevent/event-parse.c 
b/tools/lib/traceevent/event-parse.c
index 3692f29fee46..934c441d3618 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -1145,7 +1145,7 @@ static enum tep_event_type read_token(char **tok)
 }
 
 /**
- * tep_read_token - access to utilites to use the pevent parser
+ * tep_read_token - access to utilities to use the pevent parser
  * @tok: The token to return
  *
  * This will parse tokens from the string given by
@@ -3258,7 +3258,7 @@ static int event_read_print(struct tep_event_format 
*event)
  * @name: the name of the common field to return
  *
  * Returns a common field from the event by the given @name.
- * This only searchs the common fields and not all field.
+ * This only searches the common fields and not all field.
  */
 struct tep_format_field *
 tep_find_common_field(struct tep_event_format *event, const char *name)
@@ -3302,7 +3302,7 @@ tep_find_field(struct tep_event_format *event, const char 
*name)
  * @name: the name of the field
  *
  * Returns a field by the given @name.
- * This searchs the common field names first, then
+ * This searches the common field names first, then
  * the non-common ones if a common one was not found.
  */
 struct tep_format_field *
@@ -3838,7 +3838,7 @@ static void print_bitmask_to_seq(struct tep_handle 
*pevent,
/*
 * data points to a bit mask of size bytes.
 * In the kernel, this is an array of long words, thus
-* endianess is very important

Re: [PATCH] cpu: Bool tests don't need comparisons

2018-12-03 Thread Ingo Molnar


* Wen Yang  wrote:

> This is the patch to the file cpu.c
> which fixes the following coccinelle warning:
> 
> WARNING: Comparison to bool
> 
> Signed-off-by: Wen Yang 
> CC: Thomas Gleixner 
> CC: Ingo Molnar 
> CC: Konrad Rzeszutek Wilk 
> CC: Josh Poimboeuf 
> CC: "Peter Zijlstra (Intel)" 
> CC: Peter Zijlstra 
> CC: Mukesh Ojha 
> CC: linux-kernel@vger.kernel.org
> ---
>  kernel/cpu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 91d5c38eb7e5..5bdd7e150a11 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1650,7 +1650,7 @@ int __cpuhp_state_add_instance_cpuslocked(enum 
> cpuhp_state state,
>   lockdep_assert_cpus_held();
>  
>   sp = cpuhp_get_step(state);
> - if (sp->multi_instance == false)
> + if (!sp->multi_instance)
>   return -EINVAL;
>  

This is a *totally* bogus explanation.

This is an equivalent pattern to '== 0' which is commonly used.

The patch is still doing the right thing, but only accidentally, for 
another reason, it's because we are using ->multi_instance in an 
inconsistent fashion:

 kernel/cpu.c:  if (!step->multi_instance) {
 kernel/cpu.c:  if (sp->multi_instance == false)
 kernel/cpu.c:  if (!sp->multi_instance)
 kernel/cpu.c:  if (sp->multi_instance) {

But that's really just by accident - if all usages were of the
'== true/false' pattern then this wouldn't be necessary.

Thanks,

Ingo


Re: [PATCH] locktorture: Fix assignment of boolean variables

2018-12-03 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Sat, Dec 01, 2018 at 12:37:01PM -0800, Paul E. McKenney wrote:
> > On Sat, Dec 01, 2018 at 04:31:49PM +0800, Wen Yang wrote:
> > > Fix the following warnings reported by coccinelle:
> > > 
> > > kernel/locking/locktorture.c:703:6-10: WARNING: Assignment of bool to 0/1
> > > kernel/locking/locktorture.c:918:2-20: WARNING: Assignment of bool to 0/1
> > > kernel/locking/locktorture.c:949:3-20: WARNING: Assignment of bool to 0/1
> > > kernel/locking/locktorture.c:682:2-19: WARNING: Assignment of bool to 0/1
> > > kernel/locking/locktorture.c:688:2-19: WARNING: Assignment of bool to 0/1
> > > kernel/locking/locktorture.c:648:2-20: WARNING: Assignment of bool to 0/1
> > > kernel/locking/locktorture.c:654:2-20: WARNING: Assignment of bool to 0/1
> > > 
> > > This patch also makes the code more readable.
> > > 
> > > Signed-off-by: Wen Yang 
> > > CC: Davidlohr Bueso 
> > > CC: "Paul E. McKenney" 
> > > CC: Josh Triplett 
> > > CC: linux-kernel@vger.kernel.org
> > 
> > Adding the current maintainers on CC.
> 
> So I strongly disagree with this. Anybody that has trouble with 0/1 vs
> false/true needs to stay the heck away from C.

Indeed, and it's actually *worse* to read, as 0/1 stands out more and is 
more compact than false/true...

The only reasonable case where bool is recommended is when functions are 
returning it, to make sure there's no mishap returning something else.

But for a plain .c variable? Nope.

> I would suggest we delete that stupid coccinelle scripts that generates
> these pointless warns.

Ack.

Thanks,

Ingo


[PATCH] x86/pci: Clean up the asm/pci_x86.h header a bit

2018-12-03 Thread Ingo Molnar


* Ingo Molnar  wrote:

> From 22b71f970f18f5f38161be028ab7ce7cd1f769f7 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar 
> Date: Mon, 3 Dec 2018 09:15:40 +0100
> Subject: [PATCH] x86/pci: Remove the dead-code DBG() macro
> 
> While reading arch/x86/include/asm/pci_x86.h I noticed that we have ancient
> residuals of debugging code, which is never actually enabled via any regular
> Kconfig mechanism:

I also noticed that the PCI_CHECK_ENABLE_AMD_MMCONF line added eons ago 
looked weird, so did a quick cleanup pass of the header.

Thanks,

Ingo

====>
From: Ingo Molnar 
Date: Mon, 3 Dec 2018 09:42:40 +0100
Subject: [PATCH] x86/pci: Clean up the asm/pci_x86.h header a bit

- The addition of PCI_CHECK_ENABLE_AMD_MMCONF broke the tabulation of the
  enumeration - fix the tabulation and also extend it with zeroes to make
  it clearer how the bits relate, and to make it easier to check that
  there's no accidental overlap.

- Remove unnecessary and out of date comments referring to file names that
  do not exist with that name anymore.

- Fix non-standard comment lines.

- Add 'extern' to global scope variables/functions that don't have it.

- Fix typo.

- Tabulate 'struct pci_mmcfg_region', similarly to the other structure
  definitions in this header.

- Move the x86_default_pci_* definitions closer to their usage site
  in arch/x86/kernel/x86_init.c, they are not used anywhere else,
  so it's unnecessary to include it in 30+ .c files.

No change to functionality intended.

Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/pci_x86.h | 101 +++--
 arch/x86/kernel/x86_init.c |  14 ++
 2 files changed, 51 insertions(+), 64 deletions(-)

diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
index 3f8d9c75b9ee..3448f24282c7 100644
--- a/arch/x86/include/asm/pci_x86.h
+++ b/arch/x86/include/asm/pci_x86.h
@@ -7,26 +7,26 @@
 
 #include 
 
-#define PCI_PROBE_BIOS 0x0001
-#define PCI_PROBE_CONF10x0002
-#define PCI_PROBE_CONF20x0004
-#define PCI_PROBE_MMCONF   0x0008
-#define PCI_PROBE_MASK 0x000f
-#define PCI_PROBE_NOEARLY  0x0010
-
-#define PCI_NO_CHECKS  0x0400
-#define PCI_USE_PIRQ_MASK  0x0800
-#define PCI_ASSIGN_ROMS0x1000
-#define PCI_BIOS_IRQ_SCAN  0x2000
-#define PCI_ASSIGN_ALL_BUSSES  0x4000
-#define PCI_CAN_SKIP_ISA_ALIGN 0x8000
-#define PCI_USE__CRS   0x1
-#define PCI_CHECK_ENABLE_AMD_MMCONF0x2
-#define PCI_HAS_IO_ECS 0x4
-#define PCI_NOASSIGN_ROMS  0x8
-#define PCI_ROOT_NO_CRS0x10
-#define PCI_NOASSIGN_BARS  0x20
-#define PCI_BIG_ROOT_WINDOW0x40
+#define PCI_PROBE_BIOS 0x01
+#define PCI_PROBE_CONF10x02
+#define PCI_PROBE_CONF20x04
+#define PCI_PROBE_MMCONF   0x08
+#define PCI_PROBE_MASK 0x0f
+#define PCI_PROBE_NOEARLY  0x10
+
+#define PCI_NO_CHECKS  0x000400
+#define PCI_USE_PIRQ_MASK  0x000800
+#define PCI_ASSIGN_ROMS0x001000
+#define PCI_BIOS_IRQ_SCAN  0x002000
+#define PCI_ASSIGN_ALL_BUSSES  0x004000
+#define PCI_CAN_SKIP_ISA_ALIGN 0x008000
+#define PCI_USE__CRS   0x01
+#define PCI_CHECK_ENABLE_AMD_MMCONF0x02
+#define PCI_HAS_IO_ECS 0x04
+#define PCI_NOASSIGN_ROMS  0x08
+#define PCI_ROOT_NO_CRS0x10
+#define PCI_NOASSIGN_BARS  0x20
+#define PCI_BIG_ROOT_WINDOW0x40
 
 extern unsigned int pci_probe;
 extern unsigned long pirq_table_addr;
@@ -38,28 +38,19 @@ enum pci_bf_sort_state {
pci_dmi_bf,
 };
 
-/* pci-i386.c */
-
-void pcibios_resource_survey(void);
-void pcibios_set_cache_line_size(void);
-
-/* pci-pc.c */
-
+extern void pcibios_resource_survey(void);
+extern void pcibios_set_cache_line_size(void);
 extern int pcibios_last_bus;
 extern struct pci_ops pci_root_ops;
-
-void pcibios_scan_specific_bus(int busn);
-
-/* pci-irq.c */
+extern void pcibios_scan_specific_bus(int busn);
 
 struct irq_info {
u8 bus, devfn;  /* Bus, device and function */
struct {
-   u8 link;/* IRQ line ID, chipset dependent,
-  0 = not routed */
+   u8 link;/* IRQ line ID, chipset dependent, 0 = 
not routed */
u16 bitmap; /* Available IRQs */
-   } __attribute__((packed)) irq[4];
-   u8 slot;/* Slot number, 0=onboard */
+   } __packed irq[4];
+   u8 slot;/* Slot number, 0 = onboard */
u8 rfu;
 } __attribute__((packed));
 
@@ -68,10 +59,8 @@ struct irq_routing_table {
u16 version;

[PATCH] x86/pci: Remove dead code DBG() macro

2018-12-03 Thread Ingo Molnar
>From 22b71f970f18f5f38161be028ab7ce7cd1f769f7 Mon Sep 17 00:00:00 2001
From: Ingo Molnar 
Date: Mon, 3 Dec 2018 09:15:40 +0100
Subject: [PATCH] x86/pci: Remove the dead-code DBG() macro

While reading arch/x86/include/asm/pci_x86.h I noticed that we have ancient
residuals of debugging code, which is never actually enabled via any regular
Kconfig mechanism:

 #undef DEBUG

 #ifdef DEBUG
 #define DBG(fmt, ...) printk(fmt, ##__VA_ARGS__)
 #else
 #define DBG(fmt, ...)  \
 do {   \
   if (0)  \
   printk(fmt, ##__VA_ARGS__); \
 } while (0)
 #endif

Remove this and the call sites. These messages might have been
super interesting decades ago when the PCI code was first
bootstrapped, but we have better mechanisms meanwhile, and code
readability is king ... ;-)

Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/pci_x86.h | 12 
 arch/x86/pci/direct.c  |  1 -
 arch/x86/pci/i386.c|  2 --
 arch/x86/pci/irq.c | 29 +++--
 arch/x86/pci/legacy.c  |  3 +--
 arch/x86/pci/pcbios.c  |  9 ++---
 6 files changed, 6 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
index 959d618dbb17..3f8d9c75b9ee 100644
--- a/arch/x86/include/asm/pci_x86.h
+++ b/arch/x86/include/asm/pci_x86.h
@@ -7,18 +7,6 @@
 
 #include 
 
-#undef DEBUG
-
-#ifdef DEBUG
-#define DBG(fmt, ...) printk(fmt, ##__VA_ARGS__)
-#else
-#define DBG(fmt, ...)  \
-do {   \
-   if (0)  \
-   printk(fmt, ##__VA_ARGS__); \
-} while (0)
-#endif
-
 #define PCI_PROBE_BIOS 0x0001
 #define PCI_PROBE_CONF10x0002
 #define PCI_PROBE_CONF20x0004
diff --git a/arch/x86/pci/direct.c b/arch/x86/pci/direct.c
index a51074c55982..68a115629568 100644
--- a/arch/x86/pci/direct.c
+++ b/arch/x86/pci/direct.c
@@ -216,7 +216,6 @@ static int __init pci_sanity_check(const struct pci_raw_ops 
*o)
return 1;
}
 
-   DBG(KERN_WARNING "PCI: Sanity check failed\n");
return 0;
 }
 
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 8cd66152cdb0..5c2b31315750 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -389,8 +389,6 @@ void __init pcibios_resource_survey(void)
 {
struct pci_bus *bus;
 
-   DBG("PCI: Allocating resources\n");
-
list_for_each_entry(bus, _root_buses, node)
pcibios_allocate_bus_resources(bus);
 
diff --git a/arch/x86/pci/irq.c b/arch/x86/pci/irq.c
index 52e55108404e..1286f138e281 100644
--- a/arch/x86/pci/irq.c
+++ b/arch/x86/pci/irq.c
@@ -77,11 +77,8 @@ static inline struct irq_routing_table 
*pirq_check_routing_table(u8 *addr)
sum = 0;
for (i = 0; i < rt->size; i++)
sum += addr[i];
-   if (!sum) {
-   DBG(KERN_DEBUG "PCI: Interrupt Routing Table found at 0x%p\n",
-   rt);
+   if (!sum)
return rt;
-   }
return NULL;
 }
 
@@ -126,15 +123,6 @@ static void __init pirq_peer_trick(void)
memset(busmap, 0, sizeof(busmap));
for (i = 0; i < (rt->size - sizeof(struct irq_routing_table)) / 
sizeof(struct irq_info); i++) {
e = >slots[i];
-#ifdef DEBUG
-   {
-   int j;
-   DBG(KERN_DEBUG "%02x:%02x slot=%02x", e->bus, 
e->devfn/8, e->slot);
-   for (j = 0; j < 4; j++)
-   DBG(" %d:%02x/%04x", j, e->irq[j].link, 
e->irq[j].bitmap);
-   DBG("\n");
-   }
-#endif
busmap[e->bus] = 1;
}
for (i = 1; i < 256; i++) {
@@ -163,10 +151,8 @@ void elcr_set_level_irq(unsigned int irq)
elcr_irq_mask |= (1 << irq);
printk(KERN_DEBUG "PCI: setting IRQ %u as level-triggered\n", irq);
val = inb(port);
-   if (!(val & mask)) {
-   DBG(KERN_DEBUG " -> edge");
+   if (!(val & mask))
outb(val | mask, port);
-   }
 }
 
 /*
@@ -836,16 +822,10 @@ static void __init pirq_find_router(struct irq_router *r)
r->get = NULL;
r->set = NULL;
 
-   DBG(KERN_DEBUG "PCI: Attempting to find IRQ router for [%04x:%04x]\n",
-   rt->rtr_vendor, rt->rtr_device);
-
pirq_router_dev = pci_get_domain_bus_and_slot(0, rt->rtr_bus,
  rt->rtr_devfn);
-   if (!pirq_router_dev) {
-   DBG(KERN_DEBUG "PCI: Interrupt router not found at "
-   "%02x:%02x\n", rt->rtr_bus, rt-

Re: [PATCH v2] locktorture: style fix - spaces required around

2018-12-03 Thread Ingo Molnar


* Wen Yang  wrote:

> This patch fixes the following checkpatch.pl errors:
> 
> ERROR: spaces required around that ':' (ctx:VxW)
> +torture_type, tag, cxt.debug_lock ? " [debug]": "",

There's literally 1 million checkpatch 'failures' in the kernel, so 
unless we want to apply 1 million commits (we don't), please do such 
cleanups as part of some real feature, simplification or optimization 
work.

Thanks,

Ingo


Re: [PATCH v2] locktorture: Fix assignment of boolean variables

2018-12-03 Thread Ingo Molnar


* Wen Yang  wrote:

> Fix the following warnings reported by coccinelle:
> 
> kernel/locking/locktorture.c:703:6-10: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:918:2-20: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:949:3-20: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:682:2-19: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:688:2-19: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:648:2-20: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:654:2-20: WARNING: Assignment of bool to 0/1
> 
> This patch also makes the code more readable.

No, it doesn't make the code more readable!

0/1 patterns are blatantly obvious and shorter to both write and read 
than false/true text.

NAK.

Thanks,

Ingo


[PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-02 Thread Ingo Molnar


* Linus Torvalds  wrote:

> The patch stats this week look a little bit more normal than last tim,
> probably simply because it's also a normal-sized rc4 rather than the
> unusually small rc3.

So there's a new regression in v4.20-rc4, my desktop produces this 
lockdep splat:

[ 1772.588771] WARNING: pkexec/4633 still has locks held!
[ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
[ 1772.588775] 
[ 1772.588776] 1 lock held by pkexec/4633:
[ 1772.588778]  #0: ed85fbf8 (>cred_guard_mutex){+.+.}, at: 
prepare_bprm_creds+0x2a/0x70
[ 1772.588786] stack backtrace:
[ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 
4.20.0-rc4-custom-00213-g93a49841322b #1
[ 1772.588792] Call Trace:
[ 1772.588800]  dump_stack+0x85/0xcb
[ 1772.588803]  flush_old_exec+0x116/0x890
[ 1772.588807]  ? load_elf_phdrs+0x72/0xb0
[ 1772.588809]  load_elf_binary+0x291/0x1620
[ 1772.588815]  ? sched_clock+0x5/0x10
[ 1772.588817]  ? search_binary_handler+0x6d/0x240
[ 1772.588820]  search_binary_handler+0x80/0x240
[ 1772.588823]  load_script+0x201/0x220
[ 1772.588825]  search_binary_handler+0x80/0x240
[ 1772.588828]  __do_execve_file.isra.32+0x7d2/0xa60
[ 1772.588832]  ? strncpy_from_user+0x40/0x180
[ 1772.588835]  __x64_sys_execve+0x34/0x40
[ 1772.588838]  do_syscall_64+0x60/0x1c0

The warning gets triggered by an ancient lockdep check in the freezer:

(gdb) list *0x812ece06
0x812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
52   * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
53   * If try_to_freeze causes a lockdep warning it means the caller may 
deadlock
54   */
55  static inline bool try_to_freeze_unsafe(void)
56  {
57  might_sleep();
58  if (likely(!freezing(current)))
59  return false;
60  return __refrigerator(false);
61  }

I reviewed the ->cred_guard_mutex code, and the mutex is held across all 
of exec() - and we always did this.

But there's this recent -rc4 commit:

> Chanho Min (1):
>   exec: make de_thread() freezable

  c22397888f1e: exec: make de_thread() freezable

I believe this commit is bogus, you cannot call try_to_freeze() from 
de_thread(), because it's holding the ->cred_guard_mutex.

Also, I'm 3 times grumpy:

 #1: I think this commit was never tested with lockdep turned on, as I 
 think splat this should trigger 100% of the time with the ELF 
 binfmt loader.

 #2: Less than 4 days passed between the commit on Nov 19 and it being 
 upstream by Nov 23. *Please* give it more testing if you change 
 something as central as fs/exec.c ...

 #3  Also, please also proof-read changelogs before committing them:

 >>  The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting 
for
 >>  [...]
 >>
 >>  In our machine, it causes freeze timeout as bellows.

 That's *three* typos in a single commit:

s/casue/cause
s/In our/On our
s/bellows/below

 ...

Grump! :-)

Note that I haven't tested the revert yet, but the code and the breakage 
looks pretty obvious. (I'll boot the revert, will follow up if that 
didn't solve the problem.)

Thanks,

Ingo

Signed-off-by: Ingo Molnar 

This reverts commit c22397888f1eed98cd59f0a88f2a5f6925f80e15.
---
 fs/exec.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index acc3a5536384..fc281b738a98 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -62,7 +62,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -1084,7 +1083,7 @@ static int de_thread(struct task_struct *tsk)
while (sig->notify_count) {
__set_current_state(TASK_KILLABLE);
spin_unlock_irq(lock);
-   freezable_schedule();
+   schedule();
if (unlikely(__fatal_signal_pending(tsk)))
goto killed;
spin_lock_irq(lock);
@@ -1112,7 +,7 @@ static int de_thread(struct task_struct *tsk)
__set_current_state(TASK_KILLABLE);
write_unlock_irq(_lock);
cgroup_threadgroup_change_end(tsk);
-   freezable_schedule();
+   schedule();
if (unlikely(__fatal_signal_pending(tsk)))
goto killed;
}


[GIT PULL] x86 fixes

2018-11-29 Thread Ingo Molnar
Linus,

Please pull the latest x86-urgent-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
x86-urgent-for-linus

   # HEAD: 60c8144afc287ef09ce8c1230c6aa972659ba1bb x86/MCE/AMD: Fix the 
thresholding machinery initialization order

Misc fixes:

 - an MCE related boot crash fix on certain AMD systems
 - an FPU exception handling fix
 - an FPU handling race fix
 - a revert+rewrite of the RSDP boot protocol extension, use boot_params 
   instead
 - a documentation fix

 Thanks,

Ingo

-->
Borislav Petkov (1):
  x86/MCE/AMD: Fix the thresholding machinery initialization order

Elvira Khabirova (1):
  x86/ptrace: Fix documentation for tracehook_report_syscall_entry()

Jann Horn (1):
  x86/fpu: Use the correct exception table macro in the XSTATE_OP wrapper

Juergen Gross (2):
  x86/boot: Mostly revert commit ae7e1238e68f2a ("Add ACPI RSDP address to 
setup_header")
  x86/acpi, x86/boot: Take RSDP address from boot params if available

Sebastian Andrzej Siewior (1):
  x86/fpu: Disable bottom halves while loading FPU registers


 Documentation/x86/boot.txt| 32 +---
 arch/x86/boot/header.S|  6 +-
 arch/x86/include/asm/fpu/internal.h   |  2 +-
 arch/x86/include/asm/x86_init.h   |  2 --
 arch/x86/include/uapi/asm/bootparam.h |  7 ++-
 arch/x86/kernel/acpi/boot.c   |  2 +-
 arch/x86/kernel/cpu/mcheck/mce_amd.c  | 19 ++-
 arch/x86/kernel/fpu/signal.c  |  4 ++--
 arch/x86/kernel/head32.c  |  1 -
 arch/x86/kernel/head64.c  |  2 --
 arch/x86/kernel/setup.c   | 17 -
 include/linux/tracehook.h |  4 ++--
 12 files changed, 16 insertions(+), 82 deletions(-)

diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt
index 7727db8f94bc..5e9b826b5f62 100644
--- a/Documentation/x86/boot.txt
+++ b/Documentation/x86/boot.txt
@@ -61,18 +61,6 @@ Protocol 2.12:   (Kernel 3.8) Added the xloadflags field 
and extension fields
to struct boot_params for loading bzImage and ramdisk
above 4G in 64bit.
 
-Protocol 2.13: (Kernel 3.14) Support 32- and 64-bit flags being set in
-   xloadflags to support booting a 64-bit kernel from 32-bit
-   EFI
-
-Protocol 2.14: (Kernel 4.20) Added acpi_rsdp_addr holding the physical
-   address of the ACPI RSDP table.
-   The bootloader updates version with:
-   0x8000 | min(kernel-version, bootloader-version)
-   kernel-version being the protocol version supported by
-   the kernel and bootloader-version the protocol version
-   supported by the bootloader.
-
  MEMORY LAYOUT
 
 The traditional memory map for the kernel loader, used for Image or
@@ -209,7 +197,6 @@ Offset  Proto   NameMeaning
 0258/8 2.10+   pref_addressPreferred loading address
 0260/4 2.10+   init_size   Linear memory required during initialization
 0264/4 2.11+   handover_offset Offset of handover entry point
-0268/8 2.14+   acpi_rsdp_addr  Physical address of RSDP table
 
 (1) For backwards compatibility, if the setup_sects field contains 0, the
 real value is 4.
@@ -322,7 +309,7 @@ Protocol:   2.00+
   Contains the magic number "HdrS" (0x53726448).
 
 Field name:version
-Type:  modify
+Type:  read
 Offset/size:   0x206/2
 Protocol:  2.00+
 
@@ -330,12 +317,6 @@ Protocol:  2.00+
   e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
   10.17.
 
-  Up to protocol version 2.13 this information is only read by the
-  bootloader. From protocol version 2.14 onwards the bootloader will
-  write the used protocol version or-ed with 0x8000 to the field. The
-  used protocol version will be the minimum of the supported protocol
-  versions of the bootloader and the kernel.
-
 Field name:realmode_swtch
 Type:  modify (optional)
 Offset/size:   0x208/4
@@ -763,17 +744,6 @@ Offset/size:   0x264/4
 
   See EFI HANDOVER PROTOCOL below for more details.
 
-Field name:acpi_rsdp_addr
-Type:  write
-Offset/size:   0x268/8
-Protocol:  2.14+
-
-  This field can be set by the boot loader to tell the kernel the
-  physical address of the ACPI RSDP table.
-
-  A value of 0 indicates the kernel should fall back to the standard
-  methods to locate the RSDP.
-
 
  THE IMAGE CHECKSUM
 
diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 4c881c850125..850b8762e889 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -300,7 +300,7 @@ _start:
# Part 2 of the header, from the old setup.S
 
.ascii  "HdrS"  # header signature
-   .word   0x020e  # header version number (>= 0x0105)
+   .word   0x020d  # header version number (>= 0x0105)

[GIT PULL] perf fixes

2018-11-29 Thread Ingo Molnar
Linus,

Please pull the latest perf-urgent-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
perf-urgent-for-linus

   # HEAD: 09d3f015d1e1b4fee7e9bbdcf54201d239393391 uprobes: Fix handle_swbp() 
vs. unregister() + register() race once more

Misc fixes:

 - a counter freezing related regression fix,
 - an uprobes race fix,
 - an Intel PMU unusual event combination fix,
 - and diverse tooling fixes.

 Thanks,

Ingo

-->
Andrea Parri (1):
  uprobes: Fix handle_swbp() vs. unregister() + register() race once more

Arnaldo Carvalho de Melo (5):
  tools build feature: Check if get_current_dir_name() is available
  tools headers uapi: Synchronize i915_drm.h
  tools arch x86: Update tools's copy of cpufeatures.h
  tools uapi asm-generic: Synchronize ioctls.h
  perf tools beauty ioctl: Support new ISO7816 commands

Jiri Olsa (5):
  perf tools: Fix crash on synthesizing the unit
  perf tools: Restore proper cwd on return from mnt namespace
  perf/x86/intel: Move branch tracing setup to the Intel-specific source 
file
  perf/x86/intel: Add generic branch tracing check to intel_pmu_has_bts()
  perf/x86/intel: Disallow precise_ip on BTS events

Peter Zijlstra (1):
  perf/x86/intel: Fix regression by default disabling perfmon v4 interrupt 
handling


 Documentation/admin-guide/kernel-parameters.txt |  3 +-
 arch/x86/events/core.c  | 20 
 arch/x86/events/intel/core.c| 68 +++--
 arch/x86/events/perf_event.h| 13 +++--
 kernel/events/uprobes.c | 12 -
 tools/arch/x86/include/asm/cpufeatures.h|  2 +
 tools/build/Makefile.feature|  1 +
 tools/build/feature/Makefile|  4 ++
 tools/build/feature/test-all.c  |  5 ++
 tools/build/feature/test-get_current_dir_name.c | 10 
 tools/include/uapi/asm-generic/ioctls.h |  2 +
 tools/include/uapi/drm/i915_drm.h   | 22 
 tools/perf/Makefile.config  |  5 ++
 tools/perf/tests/attr/base-record   |  2 +-
 tools/perf/trace/beauty/ioctl.c |  1 +
 tools/perf/util/Build   |  1 +
 tools/perf/util/evsel.c |  2 +-
 tools/perf/util/get_current_dir_name.c  | 18 +++
 tools/perf/util/namespaces.c| 17 ++-
 tools/perf/util/namespaces.h|  1 +
 tools/perf/util/util.h  |  4 ++
 21 files changed, 166 insertions(+), 47 deletions(-)
 create mode 100644 tools/build/feature/test-get_current_dir_name.c
 create mode 100644 tools/perf/util/get_current_dir_name.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 81d1d5a74728..5463d5a4d85c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -856,7 +856,8 @@
causing system reset or hang due to sending
INIT from AP to BSP.
 
-   disable_counter_freezing [HW]
+   perf_v4_pmi=[X86,INTEL]
+   Format: 
Disable Intel PMU counter freezing feature.
The feature only exists starting from
Arch Perfmon v4 (Skylake and newer).
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 106911b603bd..374a19712e20 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -438,26 +438,6 @@ int x86_setup_perfctr(struct perf_event *event)
if (config == -1LL)
return -EINVAL;
 
-   /*
-* Branch tracing:
-*/
-   if (attr->config == PERF_COUNT_HW_BRANCH_INSTRUCTIONS &&
-   !attr->freq && hwc->sample_period == 1) {
-   /* BTS is not supported by this architecture. */
-   if (!x86_pmu.bts_active)
-   return -EOPNOTSUPP;
-
-   /* BTS is currently only allowed for user-mode. */
-   if (!attr->exclude_kernel)
-   return -EOPNOTSUPP;
-
-   /* disallow bts if conflicting events are present */
-   if (x86_add_exclusive(x86_lbr_exclusive_lbr))
-   return -EBUSY;
-
-   event->destroy = hw_perf_lbr_event_destroy;
-   }
-
hwc->config |= config;
 
return 0;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 273c62e81546..ecc3e34ca955 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2306,14 +2306,18 @@ static int handle_pmi_common(struct pt_regs *regs, u64 
status)
return handled;
 }
 
-static bool disable_counter_freezing;
+static bool disable_counter_freezing = true;
 static int __init intel_perf_counter_freezing_setup(char *s)
 {

[GIT PULL] EFI fix

2018-11-29 Thread Ingo Molnar
Linus,

Please pull the latest efi-urgent-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
efi-urgent-for-linus

   # HEAD: 976b489120cdab2b1b3a41ffa14661db43d58190 efi: Prevent GICv3 WARN() 
by mapping the memreserve table before first use

An arm64 warning fix.

 Thanks,

Ingo

-->
Ard Biesheuvel (1):
  efi: Prevent GICv3 WARN() by mapping the memreserve table before first use


 drivers/firmware/efi/efi.c | 36 ++--
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index fad7c62cfc0e..415849bab233 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -969,13 +969,33 @@ bool efi_is_table_address(unsigned long phys_addr)
 static DEFINE_SPINLOCK(efi_mem_reserve_persistent_lock);
 static struct linux_efi_memreserve *efi_memreserve_root __ro_after_init;
 
-int efi_mem_reserve_persistent(phys_addr_t addr, u64 size)
+static int __init efi_memreserve_map_root(void)
+{
+   if (efi.mem_reserve == EFI_INVALID_TABLE_ADDR)
+   return -ENODEV;
+
+   efi_memreserve_root = memremap(efi.mem_reserve,
+  sizeof(*efi_memreserve_root),
+  MEMREMAP_WB);
+   if (WARN_ON_ONCE(!efi_memreserve_root))
+   return -ENOMEM;
+   return 0;
+}
+
+int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size)
 {
struct linux_efi_memreserve *rsv;
+   int rc;
 
-   if (!efi_memreserve_root)
+   if (efi_memreserve_root == (void *)ULONG_MAX)
return -ENODEV;
 
+   if (!efi_memreserve_root) {
+   rc = efi_memreserve_map_root();
+   if (rc)
+   return rc;
+   }
+
rsv = kmalloc(sizeof(*rsv), GFP_ATOMIC);
if (!rsv)
return -ENOMEM;
@@ -993,14 +1013,10 @@ int efi_mem_reserve_persistent(phys_addr_t addr, u64 
size)
 
 static int __init efi_memreserve_root_init(void)
 {
-   if (efi.mem_reserve == EFI_INVALID_TABLE_ADDR)
-   return -ENODEV;
-
-   efi_memreserve_root = memremap(efi.mem_reserve,
-  sizeof(*efi_memreserve_root),
-  MEMREMAP_WB);
-   if (!efi_memreserve_root)
-   return -ENOMEM;
+   if (efi_memreserve_root)
+   return 0;
+   if (efi_memreserve_map_root())
+   efi_memreserve_root = (void *)ULONG_MAX;
return 0;
 }
 early_initcall(efi_memreserve_root_init);


[GIT PULL] objtool fixes

2018-11-29 Thread Ingo Molnar
Linus,

Please pull the latest core-urgent-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
core-urgent-for-linus

   # HEAD: 22566c1603030f0a036ad564634b064ad1a55db2 objtool: Fix segfault in 
.cold detection with -ffunction-sections

Two fixes for boundary conditions.

 Thanks,

Ingo

-->
Artem Savkov (2):
  objtool: Fix double-free in .cold detection error path
  objtool: Fix segfault in .cold detection with -ffunction-sections


 tools/objtool/elf.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index 6dbb9fae0f9d..b8f3cca8e58b 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -31,6 +31,8 @@
 #include "elf.h"
 #include "warn.h"
 
+#define MAX_NAME_LEN 128
+
 struct section *find_section_by_name(struct elf *elf, const char *name)
 {
struct section *sec;
@@ -298,6 +300,8 @@ static int read_symbols(struct elf *elf)
/* Create parent/child links for any cold subfunctions */
list_for_each_entry(sec, >sections, list) {
list_for_each_entry(sym, >symbol_list, list) {
+   char pname[MAX_NAME_LEN + 1];
+   size_t pnamelen;
if (sym->type != STT_FUNC)
continue;
sym->pfunc = sym->cfunc = sym;
@@ -305,14 +309,21 @@ static int read_symbols(struct elf *elf)
if (!coldstr)
continue;
 
-   coldstr[0] = '\0';
-   pfunc = find_symbol_by_name(elf, sym->name);
-   coldstr[0] = '.';
+   pnamelen = coldstr - sym->name;
+   if (pnamelen > MAX_NAME_LEN) {
+   WARN("%s(): parent function name exceeds 
maximum length of %d characters",
+sym->name, MAX_NAME_LEN);
+   return -1;
+   }
+
+   strncpy(pname, sym->name, pnamelen);
+   pname[pnamelen] = '\0';
+   pfunc = find_symbol_by_name(elf, pname);
 
if (!pfunc) {
WARN("%s(): can't find parent function",
 sym->name);
-   goto err;
+   return -1;
}
 
sym->pfunc = pfunc;


Re: [PATCH v4 4/4] perf report: Documentation average IPC and IPC coverage

2018-11-29 Thread Ingo Molnar


* Jin Yao  wrote:

> Add explanations for new columns "IPC" and "IPC coverage" in perf
> documentation.
> 
> Signed-off-by: Jin Yao 
> ---
>  tools/perf/Documentation/perf-report.txt | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tools/perf/Documentation/perf-report.txt 
> b/tools/perf/Documentation/perf-report.txt
> index 474a494..e5a32f3 100644
> --- a/tools/perf/Documentation/perf-report.txt
> +++ b/tools/perf/Documentation/perf-report.txt
> @@ -126,6 +126,14 @@ OPTIONS
>   And default sort keys are changed to comm, dso_from, symbol_from, dso_to
>   and symbol_to, see '--branch-stack'.
>  
> + When the sort key symbol is specified, columns "IPC" and "IPC Coverage"
> + are enabled automatically. Column "IPC" reports the average IPC per 
> function
> + and column "IPC coverage" reports the percentage of instructions with
> + sampled IPC in this function. IPC means Instruction Per Cycle. If it's 
> low,
> + it indicates there may be performance bottleneck when the function is
> + executed, such as, memory access bottleneck. If a function has high 
> overhead
> + and low IPC, it's worth further analysis for performance optimization.

Thank you for adding this!

Just a few small nits:

s/may be performance bottleneck
 /may be a performance bottleneck

s/such as, memory access bottleneck
 /such as a memory access bottleneck

s/it's worth further analysis for performance optimization.
 /it's worth further analyzing it to optimize its performance.

?

Other than that:

Reviewed-by: Ingo Molnar 

Ingo


Re: [PATCH v3 0/3] perf report/annotate: Support average IPC and IPC coverage for function

2018-11-28 Thread Ingo Molnar


* Jin Yao  wrote:

> Add supporting of displaying the average IPC and IPC coverage
> percentage per function.
> 
> For example,
> 
> $ perf record -b ...
> $ perf report -s symbol or
>   perf report -s symbol --stdio
> 
> Overhead  Symbol   IPC   [IPC Coverage]
>   39.60%  [.] __random 2.30  [ 54.8%]
>   18.02%  [.] main 0.43  [ 54.3%]
>   14.21%  [.] compute_flag 2.29  [100.0%]
>   14.16%  [.] rand 0.36  [100.0%]
>7.06%  [.] __random_r   2.57  [ 70.5%]
>6.85%  [.] rand@plt 0.00  [  0.0%]
>   ...
> 
> $ perf annotate --stdio2
> 
> Percent  IPC Cycle (Average IPC: 2.30, IPC Coverage: 54.8%)
> 
> Disassembly of section .text:
> 
> 0003aac0 :
>   8.32  3.28  sub$0x18,%rsp
> 3.28  mov$0x1,%esi
> 3.28  xor%eax,%eax
> 3.28  cmpl   
> $0x0,argp_program_version_hook@@GLIBC_2.2.5+0x1e0
>  11.57  3.28 1  ↓ je 20
>   lock   cmpxchg %esi,__abort_msg@@GLIBC_PRIVATE+0x8a0
> ↓ jne29
> ↓ jmp43
>  11.57  1.1020:   cmpxchg %esi,__abort_msg@@GLIBC_PRIVATE+0x8a0

That's a nice feature: please add meaningful documentation, accessible 
via the perf help system preferably, that outlines how the IPC metrics 
should be interpreted and how they are useful when optimizing programs.

Thanks,

Ingo


Re: [tip:locking/core] locking/atomics: Check generated headers are up-to-date

2018-11-28 Thread Ingo Molnar


* Mark Rutland  wrote:

> > Could we please get this fixed so that proper dependencies are checked 
> > and it's only regenerated when needed? This slowdown makes additive-build 
> > kernel development quite painful, as ~5 seconds is in the 'too long' 
> > category already, while 1.2 seconds is basically instantaneous.
> 
> Just to check, are we happy to eat the full cost for the first build of a
> pristine tree?

No, not happy to add 3-4 seconds to a full build that usually takes less 
than 60 seconds. This stuff isn't parallelized nor particularly well 
optimized it appears.

This *must* get faster.

> One reason we do the check rather than (re-)generating the headers is 
> that Linus requested [1] the generated header be committed so that they 
> show up in git grep, but it looks like he was happy to be flexible on 
> that.

I think the generated headers should be part of the commit space, the 
grepping is important.

> If we're happy to not commit in the generated headers, and if we're happy to
> pay the cost for a pristine tree, that's fairly straightforward to do.
> Otherwise, this has to be an optional check.

Or faster code, or a different concept!

Thanks,

Ingo


Re: [patch V2 00/28] x86/speculation: Remedy the STIBP/IBPB overhead

2018-11-26 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> Thats hopefully the final version of this. Changes since V1:
> 
>   - Renamed the command line option and related code to spectre_v2_user= as
> suggested by Josh.
> 
>   - Thought more about the back to back optimization and finally left the
> IBPB code in switch_mm().
> 
> It still removes the ptrace check for the always IBPB case. That's
> substantial overhead for dubious value now that the default is
> conditional (prctl/seccomp) IBPB.
> 
>   - Added two options which allow conditional STIBP and IBPB always mode.
> 
>   - Addressed the review comments

With the typo review feedback outlined in the discussions:

Reviewed-by: Ingo Molnar 

Thanks,

Ingo


Re: [patch V2 21/28] x86/speculation: Prepare for conditional IBPB in switch_mm()

2018-11-26 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> On Sun, 25 Nov 2018, Andy Lutomirski wrote:
> > > On Nov 25, 2018, at 2:20 PM, Thomas Gleixner  wrote:
> > > On Sun, 25 Nov 2018, Andi Kleen wrote:
> > > 
> > >>> The current check whether two tasks belong to the same context is using 
> > >>> the
> > >>> tasks context id. While correct, it's simpler to use the mm pointer 
> > >>> because
> > >>> it allows to mangle the TIF_SPEC_IB bit into it. The context id based
> > >>> mechanism requires extra storage, which creates worse code.
> > >> 
> > >> [We tried similar in some really early versions, but it was replaced
> > >> with the context id later.]
> > >> 
> > >> One issue with using the pointer is that the pointer can be reused
> > >> when the original mm_struct is freed, and then gets reallocated
> > >> immediately to an attacker. Then the attacker may avoid the IBPB.
> > >> 
> > >> Given it's probably hard to generate any reasonable leak bandwidth with
> > >> such a complex scenario, but it still seemed better to close the hole.
> > > 
> > > Sorry, but that's really a purely academic exercise. 
> > 
> > I would guess that it’s actually very easy to force mm_struct* reuse.
> > Don’t the various allocators try to allocate hot memory?  There’s nothing
> > hotter than a just-freed allocation of the same size.
> 
> Sure, but this is about a indirect branch predictor attack against
> something which reuses the mm.
> 
> So you'd need to pull off:
> 
>P1 poisons branch predictor
>P1 exit
> 
>P2 starts and resuses mm(P1) and uses the poisoned branch predictor
> 
> the only thing between P1 and P2 is either idle or some other kernel
> thread, but no other user task. If that happens then the code would not
> issue IBPB as it assumes to switch back to the same process.
> 
> Even if you can pull that off the speculation would hit the startup code of
> P2, which is truly a source of secret information. Creating a valuable
> attack based on mm reuse is really less proabable than a lottery jackpot.
> 
> So using mm is really good enough and results in better assembly code which
> is surely more valuable than addressing some hypothetical hole.

OTOH we could probably close even this with very little cost if we added 
an IBPB to non-threaded fork() and vfork()+exec() paths? Those are really 
slow paths compared to all the context switch paths we are trying to 
optimize here.

Alternatively we could IBPB on the post-exit() final task struct freeing, 
which too is a relative slow path compared to the context switch paths.

But no strong opinion.

Thanks,

Ingo


Re: [patch V2 27/28] x86/speculation: Add seccomp Spectre v2 user space protection mode

2018-11-26 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> On Sun, 25 Nov 2018, Linus Torvalds wrote:
> 
> > [ You forgot to fix your quilt setup.. ]
> 
> Duh. Should have pinned that package.
> 
> > On Sun, 25 Nov 2018, Thomas Gleixner wrote:
> > >
> > > The mitigation guide documents how STIPB works:
> > >
> > >Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor
> > >prevents the predicted targets of indirect branches on any logical
> > >processor of that core from being controlled by software that executes
> > >(or executed previously) on another logical processor of the same core.
> > 
> > Can we please just fix this stupid lie?
> 
> Well, it's not a lie. The above is correct, it just does not tell WHY this
> works.

Well, it's a "technically correct but misleading" phrase, which has much 
more of the effects of an actual "lie" than that of a true description of 
it.

I.e. in terms of what effects it's likely going to have on readers not 
aware of the underlying mechanics it's much more correct to call it a 
"lie" than to call it "truth" - which I think is at the core of Linus's 
argument.

> > Yes, Intel calls it "STIBP" and tries to make it out to be about the 
> > indirect branch predictor being per-SMT thread.
> > 
> > But the reason it is unacceptable is apparently because in reality it 
> > just disables indirect branch prediction entirely. So yes, 
> > *technically* it's true that that limits indirect branch prediction 
> > to just a single SMT core, but in reality it is just a "go really 
> > slow" mode.
> 
> Indeed. Just checked the documentation again, it's also not clear 
> whether IBPB is required if STIPB is in use.

So I think we should clarify all this.

Thanks,

Ingo


Re: [patch 20/24] x86/speculation: Split out TIF update

2018-11-22 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> On Thu, 22 Nov 2018, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
> > 
> > Had to read this twice, because the comment and the code are both correct 
> > but deal with the inverse case. This might have helped:
> > 
> > /*
> >  * Immediately update the speculation MSRs on the current task,
> >  * but for non-current tasks delay setting the CPU mitigation 
> >  * until it is scheduled next.
> >  */
> > if (tsk == current && update)
> > speculation_ctrl_update_current();
> > 
> > But can the target task ever be non-current here? I don't think so: the 
> > two callers are prctl and seccomp, and both are passing in the current 
> > task pointer.
> 
> See te other mail. Yes, seccomp passes non current task pointers. Will
> update the comment.

Ignore my previous mail :-)

Thanks,

Ingo


Re: [patch 20/24] x86/speculation: Split out TIF update

2018-11-22 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> On Wed, 21 Nov 2018, Tim Chen wrote:
> 
> > On Wed, Nov 21, 2018 at 09:14:50PM +0100, Thomas Gleixner wrote:
> > > +static void task_update_spec_tif(struct task_struct *tsk, int tifbit, 
> > > bool on)
> > >  {
> > >   bool update;
> > >
> > > + if (on)
> > > + update = !test_and_set_tsk_thread_flag(tsk, tifbit);
> > > + else
> > > + update = test_and_clear_tsk_thread_flag(tsk, tifbit);
> > > +
> > > + /*
> > > +  * If being set on non-current task, delay setting the CPU
> > > +  * mitigation until it is scheduled next.
> > > +  */
> > > + if (tsk == current && update)
> > > + speculation_ctrl_update_current();
> > 
> > I think all the call paths from prctl and seccomp coming here
> > has tsk == current.
> 
> We had that discussion before with SSBD:
> 
> seccomp_set_mode_filter()
>seccomp_attach_filter()
>   seccomp_sync_threads()
>  for_each_thread(t)
>   if (t == current)
>   continue;
>   seccomp_assign_mode(t)
> arch_seccomp_spec_mitigate(t);
> 
> seccomp_assign_mode(current...)
>   arch_seccomp_spec_mitigate();
> 
> > But if task_update_spec_tif gets used in the future where tsk is running
> > on a remote CPU, this could lead to the MSR getting out of sync with the
> > running task's TIF flag. This will break either performance or security.
> 
> We also had that discussion with SSBD and decided that we won't chase
> threads and send IPIs around. Yes, it's not perfect, but not the end of the
> world either. For PRCTL it's a non issue.

Fair enough and agreed - but please add a comment for all this, as it's a 
non-trivial and rare call context and a non-trivial implementation 
trade-off as a result.

Thanks,

Ingo


Re: [PATCH] uprobes: Fix handle_swbp() vs. unregister() + register() race once more

2018-11-22 Thread Ingo Molnar


* Oleg Nesterov  wrote:

> On 11/22, Oleg Nesterov wrote:
> > On 11/22, Andrea Parri wrote:
> > >
> > > Commit 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() +
> > > register() race") added the UPROBE_COPY_INSN flag, and corresponding
> > > smp_wmb() and smp_rmb() memory barriers, to ensure that handle_swbp()
> > > uses fully-initialized uprobes only.
> > > 
> > > However, the smp_rmb() is mis-placed: this barrier should be placed
> > > after handle_swbp() has tested for the flag, thus guaranteeing that
> > > (program-order) subsequent loads from the uprobe can see the initial
> > > stores performed by prepare_uprobe().
> > > 
> > > Move the smp_rmb() accordingly.  Also amend the comments associated
> > > to the two memory barriers to indicate their actual locations.
> > > 
> > > Signed-off-by: Andrea Parri 
> > > Cc: Peter Zijlstra 
> > > Cc: Ingo Molnar 
> > > Cc: Arnaldo Carvalho de Melo 
> > > Cc: Alexander Shishkin 
> > > Cc: Jiri Olsa 
> > > Cc: Namhyung Kim 
> > > Cc: Oleg Nesterov 
> > > Cc: sta...@kernel.org
> > > Fixes: 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + 
> > > register() race")
> > 
> > Thanks,
> > 
> > Acked-by: Oleg Nesterov 
> 
> Yes, but I am not sure this is the -stable material...

So I left the Cc: stable tag intact, because this is a really low-risk 
fix (it just moves barriers around), and clearly fixes a bug that people 
might or might not have observed.

Even if they observed it the race is probably very hard to reproduce and 
almost impossible to report - so we are better off propagating this fix 
to -stable, as there's no realistic actionable way for users to actually 
complain about the bug if it affects them.

That's the general backporting policy for race fixes, unless they are 
really, really intrusive - which this one isn't really.

Thanks,

Ingo


Re: [tip:x86/cleanups] x86/headers: Fix -Wmissing-prototypes warning

2018-11-22 Thread Ingo Molnar


* tip-bot for Yi Wang  wrote:

> Commit-ID:  d37904c5b14317a2c76efec6b9e4dbcaa17380e5
> Gitweb: 
> https://git.kernel.org/tip/d37904c5b14317a2c76efec6b9e4dbcaa17380e5
> Author: Yi Wang 
> AuthorDate: Thu, 22 Nov 2018 10:04:09 +0800
> Committer:  Ingo Molnar 
> CommitDate: Thu, 22 Nov 2018 09:52:28 +0100
> 
> x86/headers: Fix -Wmissing-prototypes warning


>  #endif /* _ASM_X86_CRASH_H */
> diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> index eea40d52ca78..063f1d4d698e 100644
> --- a/arch/x86/include/asm/efi.h
> +++ b/arch/x86/include/asm/efi.h
> @@ -173,8 +173,6 @@ static inline bool efi_runtime_supported(void)
>  extern struct console early_efi_console;
>  extern void parse_efi_setup(u64 phys_addr, u32 data_len);
>  
> -extern void efifb_setup_from_dmi(struct screen_info *si, const char *opt);
> -
>  #ifdef CONFIG_EFI_MIXED
>  extern void efi_thunk_runtime_setup(void);
>  extern efi_status_t efi_thunk_set_virtual_address_map(

> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 845174e113ce..890c4cb37502 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1053,6 +1053,8 @@ extern struct kobject *efi_kobj;
>  extern int efi_reboot_quirk_mode;
>  extern bool efi_poweroff_required(void);
>  
> +extern void efifb_setup_from_dmi(struct screen_info *si, const char *opt);
> +
>  #ifdef CONFIG_EFI_FAKE_MEMMAP
>  extern void __init efi_fake_memmap(void);
>  #else

These efifb_setup_from_dmi() changes were completely bogus and broke 
every non-x86 EFI architecture, which a simple:

  git grep efifb_setup_from_dmi

should have already exposed.

Did you actually check the usage of every single prototype and think 
about them case by case? It doesn't seem to be the case.

Thanks,

Ingo


Re: [PATCH v5] x86/fsgsbase/64: Fix the base write helper functions

2018-11-22 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> On Fri, Nov 16, 2018 at 3:27 PM Chang S. Bae  wrote:
> >
> > The helper functions that purport to write the base should just write it
> > only. It shouldn't have magic optimizations to change the index.
> >
> > Make the index explicitly changed from the caller, instead of including
> > the code in the helpers.
> >
> > Subsequently, the task write helpers do not handle for the current task
> > anymore. The range check for a base value is also factored out, to
> > minimize code redundancy from the caller.
> >
> > v2: Fix further on the task write functions. Revert the changes on the
> > task read helpers.
> >
> > v3: Fix putreg(). Edit the changelog.
> >
> > v4: Update the task write helper functions and do_arch_prctl_64(). Fix
> > the comment in putreg().
> >
> > v5: Fix preempt_disable() calls in do_arch_prctl_64()
> 
> Reviewed-by: Andy Lutomirski 
> 
> Ingo, Thomas: can we get this in x86/urgent, please?

Sadly this commit introduced a boot failure on both an Intel and an AMD 
64-bit testbox.

Symptoms range from silent bootup hang in early userspace to segfaults 
like this:

[   21.885741] random: systemd: uninitialized urandom read (16 bytes read)
[   21.964778] systemd[1]: segfault at 28 ip 5584d8d8247d sp 
7ffc7a05aed0 error 4 in systemd[5584d8d0d000+137000]
[   21.977664] Code: c3 4c 89 ff e8 94 78 fa ff eb bb 48 89 c3 eb f1 00 00 00 
00 00 00 00 00 00 00 00 00 00 41 55 41 54 55 53 48 89 fd 48 83 ec 28 <64> 48 8b 
04 25 28 00 00 00 48 89 44 24 18 31 c0 48 85 ff 74 6e 48
[   22.04] systemd[1]: segfault at 28 ip 5584d8db0a3d sp 
7ffc7a05a7e0 error 4 in systemd[5584d8d0d000+137000]
[   22.012869] Code: 49 89 e9 ba 67 01 00 00 bf 04 00 00 00 31 c0 e8 c9 1c 03 
00 59 31 c0 5e e9 ff fa ff ff 41 54 55 53 89 fb 48 81 ec 40 01 00 00 <64> 48 8b 
04 25 28 00 00 00 48 89 84 24 38 01 00 00 31 c0 e8 fb 92

I've zapped the commit from x86/urgent because it's clearly not ready 
yet.

I used a fairly regular distro .config and a fairly regular distro - 
nothing fancy.

Thanks,

Ingo


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-22 Thread Ingo Molnar


* Ingo Molnar  wrote:

> So I dug into this some more:
> 
> 1)
> 
> Firstly I tracked down GCC bloating the might_fault() checks and the 
> related out-of-line code exception handling which bloats the full 
> generated function.

Sorry, I mis-remembered that detail when I wrote the email: it was 
CONFIG_HARDENED_USERCOPY=y and its object size checks that distros enable 
- and I disabled that option to simplify the size analysis.

(might_fault() doesn't have inline conditionals so shouldn't have any 
effect on the generated code.)

Thanks,

Ingo


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-22 Thread Ingo Molnar


* Ingo Molnar  wrote:

> The kernel text size reduction with Jen's patch is small but real:
> 
>  text databss dec hex filename
>  19572694 115169341987388850963516309a43c 
> vmlinux.before
>  19572468 115169341987388850963290309a35a 
> vmlinux.after
> 
> But I checked the disassembly, and it's not a real win, the new code is 
> actually more complex than the old one, as expected, but GCC (7.3.0) does 
> some particularly stupid things which bloats the generated code.

So I dug into this some more:

1)

Firstly I tracked down GCC bloating the might_fault() checks and the 
related out-of-line code exception handling which bloats the full 
generated function.

2)

But with even that complication eliminated, there's a size reduction when 
Jen's patch is applied, which is puzzling:

19563640115167901988208050962510309a04e 
vmlinux.before
195632741151679019882080509621443099ee0 
vmlinux.after

but this is entirely due to the .altinstructions section being counted as 
'text' part of the vmlinux - while in reality it's not:

3)

The _real_ part of the vmlinux gets bloated by Jen's patch:

 8100 <_stext>:

 before:  81b0e5e0 <__clear_user>
 after:   81b0e670 <__clear_user>:

I.e. we get a e5e0 => e670 bloat, as expected.

In the config I tested a later section of the kernel image first aligns 
away the bloat:

 before: 82fa6321 <.altinstr_aux>:
 after:  82fa6321 <.altinstr_aux>:

and then artificially debloats the modified kernel via the 
altinstructions section:

  before: Disassembly of section .exit.text: 83160798 

  after:  Disassembly of section .exit.text: 83160608 


Note that there's a third level of obfuscation here: Jen's patch actually 
*adds* a new altinstructions statement:

+   /*
+* For smaller copies, don't use ERMS as it's slower.
+*/
+   if (len < 128) {
+   alternative_call(copy_user_generic_unrolled,
+copy_user_generic_string, X86_FEATURE_REP_GOOD,
+ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
+"=d" (len)),
+"1" (to), "2" (from), "3" (len)
+: "memory", "rcx", "r8", "r9", "r10", "r11");
+   return ret;
+   }
+
/*
 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
 * Otherwise, use copy_user_generic_unrolled.
 */
alternative_call_2(copy_user_generic_unrolled,
-copy_user_generic_string,
-X86_FEATURE_REP_GOOD,
-copy_user_enhanced_fast_string,
-X86_FEATURE_ERMS,
+copy_user_generic_string, X86_FEATURE_REP_GOOD,
+copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
 "=d" (len)),
 "1" (to), "2" (from), "3" (len)

So how can this change possibly result in a *small* altinstructions 
section?

4)

The reason is GCC's somewhat broken __builtin_constant() logic, which 
leaves ~10% of the constant call sites actually active, but which are 
then optimized by GCC's later stages, and the alternative_call_2() gets 
optimized out and replaced with the alternative_call() call.

This is where Jens's patch 'debloats' the vmlinux and confuses the 'size' 
utility and gains its code reduction street cred.

Note to self: watch out for patches that change altinstructions and don't 
make premature vmlinux size impact assumptions. :-)

Thanks,

Ingo


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-22 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds
>  wrote:
> >
> > It might be interesting to just change raw_copy_to/from_user() to
> > handle a lot more cases (in particular, handle cases where 'size' is
> > 8-byte aligned). The special cases we *do* have may not be the right
> > ones (the 10-byte case in particular looks odd).
> >
> > For example, instead of having a "if constant size is 8 bytes, do one
> > get/put_user()" case, we might have a "if constant size is < 64 just
> > unroll it into get/put_user()" calls.
> 
> Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think
> the constant size cases ever trigger at all the way they are set up
> now.

Side note, there's one artifact the patch exposes: some of the 
__builtin_constant_p() checks are imprecise and don't trigger at the 
early stage where GCC checks them, but the lenght is actually known to 
the compiler at later optimization stages.

This means that with Jen's patch some of the length checks go away. I 
checked x86-64 defconfig and a distro config, and the numbers were ~7% 
and 10%, so not a big effect.

The kernel text size reduction with Jen's patch is small but real:

 text   databss dec hex filename
 19572694   115169341987388850963516309a43c 
vmlinux.before
 19572468   115169341987388850963290309a35a 
vmlinux.after

But I checked the disassembly, and it's not a real win, the new code is 
actually more complex than the old one, as expected, but GCC (7.3.0) does 
some particularly stupid things which bloats the generated code.

> I do have a random patch that makes "unsafe_put_user()" actually use
> "asm goto" for the error case, and that, together with the attached
> patch seems to generate fairly nice code, but even then it would
> depend on gcc actually unrolling things (which we do *not* want in
> general).
> 
> But for a 32-byte user copy (cp_old_stat), and that
> INLINE_COPY_TO_USER, it generates this:
> 
> stac
> movl$32, %edx   #, size
> movq%rsp, %rax  #, src
> .L201:
> movq(%rax), %rcx# MEM[base: src_155, offset: 0B],
> MEM[base: src_155, offset: 0B]
> 1:  movq %rcx,0(%rbp)   # MEM[base: src_155, offset: 0B],
> MEM[(struct __large_struct *)dst_156]
> ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess"#
> 
> addq$8, %rax#, src
> addq$8, %rbp#, statbuf
> subq$8, %rdx#, size
> jne .L201   #,
> clac
> 
> which is actually fairly close to "optimal".

Yeah, that looks pretty sweet!

> Random patch (with my "asm goto" hack included) attached, in case
> people want to play with it.

Doesn't even look all that hacky to me. Any hack in it that I didn't 
notice? :-)

The only question is the inlining overhead - will try to measure that.

> Impressively, it actually removes more lines of code than it adds. But
> I didn't actually check whether the end result *works*, so hey..

Most of the linecount reduction appears to come from the simplification 
of the unroll loop and moving it into C, from a 6-way hard-coded copy 
routine:

> - switch (size) {
> - case 1:
> - case 2:
> - case 4:
> - case 8:
> - case 10:
> - case 16:

to a more flexible 4-way loop unrolling:

> + while (size >= sizeof(unsigned long)) {
> + while (size >= sizeof(unsigned int)) {
> + while (size >= sizeof(unsigned short)) {
> + while (size >= sizeof(unsigned char)) {

Which is a nice improvement in itself.

> + user_access_begin();
> + if (dirent)
> + unsafe_put_user(offset, >d_off, efault_end);
>   dirent = buf->current_dir;
> + unsafe_put_user(d_ino, >d_ino, efault_end);
> + unsafe_put_user(reclen, >d_reclen, efault_end);
> + unsafe_put_user(0, dirent->d_name + namlen, efault_end);
> + unsafe_put_user(d_type, (char __user *) dirent + reclen - 1, 
> efault_end);
> + user_access_end();
> +
>   if (copy_to_user(dirent->d_name, name, namlen))
>   goto efault;
>   buf->previous = dirent;
>   dirent = (void __user *)dirent + reclen;
>   buf->current_dir = dirent;
>   buf->count -= reclen;
>   return 0;
> +efault_end:
> + user_access_end();
>  efault:
>   buf->error = -EFAULT;
>   return -EFAULT;

In terms of high level APIs, could we perhaps use the opportunity to 
introduce unsafe_write_user() instead, which would allow us to write it 
as:

unsafe_write_user(>d_ino, d_ino, efault_end);
unsafe_write_user(>d_reclen, reclen, efault_end);
unsafe_write_user(dirent->d_name + namlen, 0, efault_end);
unsafe_write_user((char __user *)dirent + reclen - 1, d_type, 
efault_end);

if (copy_to_user(dirent->d_name, name, namlen))
goto efault;

This gives it the regular 'VAR = VAL;' notation of 

[PATCH 6/5] x86/fault: Clean up the page fault oops decoder a bit

2018-11-22 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> One of Linus' favorite hobbies seems to be looking at OOPSes and
> decoding the error code in his head.  This is not one of my favorite
> hobbies :)
> 
> Teach the page fault OOPS hander to decode the error code.  If it's
> a !USER fault from user mode, print an explicit note to that effect
> and print out the addresses of various tables that might cause such
> an error.
> 
> With this patch applied, if I intentionally point the LDT at 0x0 and
> run the x86 selftests, I get:
> 
> BUG: unable to handle kernel NULL pointer dereference at 
> HW error: normal kernel read fault
> This was a system access from user code
> IDT: 0xfe00 (limit=0xfff) GDT: 0xfe001000 (limit=0x7f)
> LDTR: 0x50 -- base=0x0 limit=0xfff7
> TR: 0x40 -- base=0xfe003000 limit=0x206f
> PGD 8456e067 P4D 8456e067 PUD 4623067 PMD 0
> SMP PTI
> CPU: 0 PID: 153 Comm: ldt_gdt_64 Not tainted 4.19.0+ #1317
> Hardware name: ...
> RIP: 0033:0x401454

I've applied your series, with one small edit, the following message:

  > HW error: normal kernel read fault

will IMHO confuse the heck out of users, thinking that their hardware is 
broken...

Yes, the message is accurate, in MM pagefault language it's indeed the HW 
error code, but it's a language very few people speak.

So I edited it over to say '#PF error code'. I also applied a few other 
minor cleanups - see the changelog below.

Let me know if you have any objections.

Thanks,

Ingo

===>
>From a2aa52ab16efbee40ad118ebac4a5e438f5b43ee Mon Sep 17 00:00:00 2001
From: Ingo Molnar 
Date: Thu, 22 Nov 2018 09:34:03 +0100
Subject: [PATCH] x86/fault: Clean up the page fault oops decoder a bit

 - Make the oops messages a bit less scary (don't mention 'HW errors')

 - Turn 'PROT USER' (which is visually easily confused with PROT_USER)
   into individual bit descriptors: "[PROT] [USER]".
   This also makes "[normal kernel read fault]" more apparent.

 - De-abbreviate variables to make the code easier to read

 - Use vertical alignment where appropriate.

 - Add comment about string size limits and the helper function.

 - Remove unnecessary line breaks.

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Yu-cheng Yu 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/fault.c | 38 +++---
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f5efbdba2b6d..2ff25ad33233 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -603,10 +603,13 @@ static void show_ldttss(const struct desc_ptr *gdt, const 
char *name, u16 index)
 name, index, addr, (desc.limit0 | (desc.limit1 << 16)));
 }
 
-static void errstr(unsigned long ec, char *buf, unsigned long mask,
-  const char *txt)
+/*
+ * This helper function transforms the #PF error_code bits into
+ * "[PROT] [USER]" type of descriptive, almost human-readable error strings:
+ */
+static void err_str_append(unsigned long error_code, char *buf, unsigned long 
mask, const char *txt)
 {
-   if (ec & mask) {
+   if (error_code & mask) {
if (buf[0])
strcat(buf, " ");
strcat(buf, txt);
@@ -614,10 +617,9 @@ static void errstr(unsigned long ec, char *buf, unsigned 
long mask,
 }
 
 static void
-show_fault_oops(struct pt_regs *regs, unsigned long error_code,
-   unsigned long address)
+show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address)
 {
-   char errtxt[64];
+   char err_txt[64];
 
if (!oops_may_print())
return;
@@ -646,15 +648,21 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
error_code,
 address < PAGE_SIZE ? "NULL pointer dereference" : "paging 
request",
 (void *)address);
 
-   errtxt[0] = 0;
-   errstr(error_code, errtxt, X86_PF_PROT, "PROT");
-   errstr(error_code, errtxt, X86_PF_WRITE, "WRITE");
-   errstr(error_code, errtxt, X86_PF_USER, "USER");
-   errstr(error_code, errtxt, X86_PF_RSVD, "RSVD");
-   errstr(error_code, errtxt, X86_PF_INSTR, "INSTR");
-   errstr(error_code, errtxt, X86_PF_PK, "PK");
-   pr_alert("HW error: %s\n", error_code ? errtxt :
-"normal kernel read fault");
+   err_txt[0] = 0;
+
+   /*
+* Note: length of these appended strings including the separation 
space and the
+* zero delimiter must fit into err_txt[].
+*/
+   err_str_append(error_code, err_

Re: [patch 14/24] x86/speculation: Unify conditional spectre v2 print functions

2018-11-21 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> There is no point in having two functions and a conditional at the call
> site.
> 
> Signed-off-by: Thomas Gleixner 

patches 1-14:

  Reviewed-by: Ingo Molnar 

15-24 look good to me too, modulo the (mostly trivial) feedback I gave.

Thanks,

Ingo


Re: [patch 16/24] x86/speculation: Prepare for per task indirect branch speculation control

2018-11-21 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> From: Tim Chen 
> 
> To avoid the overhead of STIBP always on, it's necessary to allow per task
> control of STIBP.
> 
> Add a new task flag TIF_SPEC_IB and evaluate it during context switch if
> SMT is active and flag evaluation is enabled by the speculation control
> code. Add the conditional evaluation to x86_virt_spec_ctrl() as well so the
> guest/host switch works properly.
> 
> This has no effect because TIF_SPEC_IB cannot be set yet and the static key
> which controls evaluation is off. Preparatory patch for adding the control
> code.
> 
> [ tglx: Simplify the context switch logic and make the TIF evaluation
>   depend on SMP=y and on the static key controlling the conditional
>   update. Rename it to TIF_SPEC_IB because it controls both STIBP and
>   IBPB ]
> 
> Signed-off-by: Tim Chen 
> Signed-off-by: Thomas Gleixner 
> 
> ---
>  arch/x86/include/asm/msr-index.h   |5 +++--
>  arch/x86/include/asm/spec-ctrl.h   |   12 
>  arch/x86/include/asm/thread_info.h |5 -
>  arch/x86/kernel/cpu/bugs.c |4 
>  arch/x86/kernel/process.c  |   24 ++--
>  5 files changed, 45 insertions(+), 5 deletions(-)
> 
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -41,9 +41,10 @@
>  
>  #define MSR_IA32_SPEC_CTRL   0x0048 /* Speculation Control */
>  #define SPEC_CTRL_IBRS   (1 << 0)   /* Indirect Branch 
> Restricted Speculation */
> -#define SPEC_CTRL_STIBP  (1 << 1)   /* Single Thread 
> Indirect Branch Predictors */
> +#define SPEC_CTRL_STIBP_SHIFT1  /* Single Thread 
> Indirect Branch Predictor (STIBP) bit */
> +#define SPEC_CTRL_STIBP  (1 << SPEC_CTRL_STIBP_SHIFT)
> /* STIBP mask */
>  #define SPEC_CTRL_SSBD_SHIFT 2  /* Speculative Store Bypass 
> Disable bit */
> -#define SPEC_CTRL_SSBD   (1 << SPEC_CTRL_SSBD_SHIFT)   
> /* Speculative Store Bypass Disable */
> +#define SPEC_CTRL_SSBD   (1 << SPEC_CTRL_SSBD_SHIFT) 
> /* Speculative Store Bypass Disable */
>  
>  #define MSR_IA32_PRED_CMD0x0049 /* Prediction Command */
>  #define PRED_CMD_IBPB(1 << 0)   /* Indirect Branch 
> Prediction Barrier */
> --- a/arch/x86/include/asm/spec-ctrl.h
> +++ b/arch/x86/include/asm/spec-ctrl.h
> @@ -53,12 +53,24 @@ static inline u64 ssbd_tif_to_spec_ctrl(
>   return (tifn & _TIF_SSBD) >> (TIF_SSBD - SPEC_CTRL_SSBD_SHIFT);
>  }
>  
> +static inline u64 stibp_tif_to_spec_ctrl(u64 tifn)
> +{
> + BUILD_BUG_ON(TIF_SPEC_IB < SPEC_CTRL_STIBP_SHIFT);
> + return (tifn & _TIF_SPEC_IB) >> (TIF_SPEC_IB - SPEC_CTRL_STIBP_SHIFT);
> +}
> +
>  static inline unsigned long ssbd_spec_ctrl_to_tif(u64 spec_ctrl)
>  {
>   BUILD_BUG_ON(TIF_SSBD < SPEC_CTRL_SSBD_SHIFT);
>   return (spec_ctrl & SPEC_CTRL_SSBD) << (TIF_SSBD - 
> SPEC_CTRL_SSBD_SHIFT);
>  }
>  
> +static inline unsigned long stibp_spec_ctrl_to_tif(u64 spec_ctrl)
> +{
> + BUILD_BUG_ON(TIF_SPEC_IB < SPEC_CTRL_STIBP_SHIFT);
> + return (spec_ctrl & SPEC_CTRL_STIBP) << (TIF_SPEC_IB - 
> SPEC_CTRL_STIBP_SHIFT);
> +}
> +
>  static inline u64 ssbd_tif_to_amd_ls_cfg(u64 tifn)
>  {
>   return (tifn & _TIF_SSBD) ? x86_amd_ls_cfg_ssbd_mask : 0ULL;
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -83,6 +83,7 @@ struct thread_info {
>  #define TIF_SYSCALL_EMU  6   /* syscall emulation active */
>  #define TIF_SYSCALL_AUDIT7   /* syscall auditing active */
>  #define TIF_SECCOMP  8   /* secure computing */
> +#define TIF_SPEC_IB  9   /* Indirect branch speculation 
> mitigation */
>  #define TIF_USER_RETURN_NOTIFY   11  /* notify kernel of userspace 
> return */
>  #define TIF_UPROBE   12  /* breakpointed or singlestepping */
>  #define TIF_PATCH_PENDING13  /* pending live patching update */
> @@ -110,6 +111,7 @@ struct thread_info {
>  #define _TIF_SYSCALL_EMU (1 << TIF_SYSCALL_EMU)
>  #define _TIF_SYSCALL_AUDIT   (1 << TIF_SYSCALL_AUDIT)
>  #define _TIF_SECCOMP (1 << TIF_SECCOMP)
> +#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
>  #define _TIF_USER_RETURN_NOTIFY  (1 << TIF_USER_RETURN_NOTIFY)
>  #define _TIF_UPROBE  (1 << TIF_UPROBE)
>  #define _TIF_PATCH_PENDING   (1 << TIF_PATCH_PENDING)
> @@ -146,7 +148,8 @@ struct thread_info {
>  
>  /* flags to check in __switch_to() */
>  #define _TIF_WORK_CTXSW  
> \
> - (_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP|_TIF_SSBD)
> + (_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP| \
> +  _TIF_SSBD|_TIF_SPEC_IB)
>  
>  #define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
>  #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
> --- 

Re: [patch 17/24] x86/speculation: Move IBPB control out of switch_mm()

2018-11-21 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> IBPB control is currently in switch_mm() to avoid issuing IBPB when
> switching between tasks of the same process.
> 
> But that's not covering the case of sandboxed tasks which get the
> TIF_SPEC_IB flag set via seccomp. There the barrier is required when the
> potentially malicious task is switched out because the task which is
> switched in might have it not set and would still be attackable.
> 
> For tasks which mark themself with TIF_SPEC_IB via the prctl, the barrier
> needs to be when the tasks switches in because the previous one might be an
> attacker.

s/themself
 /themselves
> 
> Move the code out of switch_mm() and evaluate the TIF bit in
> switch_to(). Make it an inline function so it can be used both in 32bit and
> 64bit code.

s/32bit
 /32-bit

s/64bit
 /64-bit

> 
> This loses the optimization of switching back to the same process, but
> that's wrong in the context of seccomp anyway as it does not protect tasks
> of the same process against each other.
> 
> This could be optimized by keeping track of the last user task per cpu and
> avoiding the barrier when the task is immediately scheduled back and the
> thread inbetween was a kernel thread. It's dubious whether that'd be worth
> the extra load/store and conditional operations. Keep it optimized for the
> common case where the TIF bit is not set.

s/cpu/CPU

> 
> Signed-off-by: Thomas Gleixner 
> ---
>  arch/x86/include/asm/nospec-branch.h |2 +
>  arch/x86/include/asm/spec-ctrl.h |   46 
> +++
>  arch/x86/include/asm/tlbflush.h  |2 -
>  arch/x86/kernel/cpu/bugs.c   |   16 +++-
>  arch/x86/kernel/process_32.c |   11 ++--
>  arch/x86/kernel/process_64.c |   11 ++--
>  arch/x86/mm/tlb.c|   39 -
>  7 files changed, 81 insertions(+), 46 deletions(-)
> 
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -312,6 +312,8 @@ do {  
> \
>  } while (0)
>  
>  DECLARE_STATIC_KEY_FALSE(switch_to_cond_stibp);
> +DECLARE_STATIC_KEY_FALSE(switch_to_cond_ibpb);
> +DECLARE_STATIC_KEY_FALSE(switch_to_always_ibpb);
>  
>  #endif /* __ASSEMBLY__ */
>  
> --- a/arch/x86/include/asm/spec-ctrl.h
> +++ b/arch/x86/include/asm/spec-ctrl.h
> @@ -76,6 +76,52 @@ static inline u64 ssbd_tif_to_amd_ls_cfg
>   return (tifn & _TIF_SSBD) ? x86_amd_ls_cfg_ssbd_mask : 0ULL;
>  }
>  
> +/**
> + * switch_to_ibpb - Issue IBPB on task switch
> + * @next:Pointer to the next task
> + * @prev_tif:Threadinfo flags of the previous task
> + * @next_tif:Threadinfo flags of the next task
> + *
> + * IBPB flushes the branch predictor, which stops Spectre-v2 attacks
> + * between user space tasks. Depending on the mode the flush is made
> + * conditional.
> + */
> +static inline void switch_to_ibpb(struct task_struct *next,
> +   unsigned long prev_tif,
> +   unsigned long next_tif)
> +{
> + if (static_branch_unlikely(_to_always_ibpb)) {
> + /* Only flush when switching to a user task. */
> + if (next->mm)
> + indirect_branch_prediction_barrier();
> + }
> +
> + if (static_branch_unlikely(_to_cond_ibpb)) {
> + /*
> +  * Both tasks' threadinfo flags are checked for TIF_SPEC_IB.
> +  *
> +  * For an outgoing sandboxed task which has TIF_SPEC_IB set
> +  * via seccomp this is needed because it might be malicious
> +  * and the next user task switching in might not have it
> +  * set.
> +  *
> +  * For an incoming task which has set TIF_SPEC_IB itself
> +  * via prctl() this is needed because the previous user
> +  * task might be malicious and have the flag unset.
> +  *
> +  * This could be optimized by keeping track of the last
> +  * user task per cpu and avoiding the barrier when the task
> +  * is immediately scheduled back and the thread inbetween
> +  * was a kernel thread. It's dubious whether that'd be
> +  * worth the extra load/store and conditional operations.
> +  * Keep it optimized for the common case where the TIF bit
> +  * is not set.
> +  */
> + if ((prev_tif | next_tif) & _TIF_SPEC_IB)
> + indirect_branch_prediction_barrier();

s/cpu/CPU

> +
> + switch (mode) {
> + case SPECTRE_V2_APP2APP_STRICT:
> + static_branch_enable(_to_always_ibpb);
> + break;
> + default:
> + break;
> + }
> +
> + pr_info("mitigation: Enabling %s Indirect Branch Prediction 
> Barrier\n",
> +

Re: [patch 18/24] x86/speculation: Avoid __switch_to_xtra() calls

2018-11-21 Thread Ingo Molnar


* Tim Chen  wrote:

> On 11/21/2018 12:14 PM, Thomas Gleixner wrote:
> 
> > +* Avoid __switch_to_xtra() invocation when conditional stpib is
> 
> s/stpib/stibp

and:

  s/stibp/STIBP

to make it consistent throughout the patchset.

Thanks,

Ingo


Re: [patch 20/24] x86/speculation: Split out TIF update

2018-11-21 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> The update of the TIF_SSBD flag and the conditional speculation control MSR
> update is done in the ssb_prctl_set() function directly. The upcoming prctl
> support for controlling indirect branch speculation via STIBP needs the
> same mechanism.
> 
> Split the code out and make it reusable.
> 
> Signed-off-by: Thomas Gleixner 
> ---
>  arch/x86/kernel/cpu/bugs.c |   31 +++
>  1 file changed, 19 insertions(+), 12 deletions(-)
> 
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -703,10 +703,25 @@ static void ssb_select_mitigation(void)
>  #undef pr_fmt
>  #define pr_fmt(fmt) "Speculation prctl: " fmt
>  
> -static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl)
> +static void task_update_spec_tif(struct task_struct *tsk, int tifbit, bool 
> on)
>  {
>   bool update;
>  
> + if (on)
> + update = !test_and_set_tsk_thread_flag(tsk, tifbit);
> + else
> + update = test_and_clear_tsk_thread_flag(tsk, tifbit);
> +
> + /*
> +  * If being set on non-current task, delay setting the CPU
> +  * mitigation until it is scheduled next.
> +  */
> + if (tsk == current && update)
> + speculation_ctrl_update_current();

Had to read this twice, because the comment and the code are both correct 
but deal with the inverse case. This might have helped:

/*
 * Immediately update the speculation MSRs on the current task,
 * but for non-current tasks delay setting the CPU mitigation 
 * until it is scheduled next.
 */
if (tsk == current && update)
speculation_ctrl_update_current();

But can the target task ever be non-current here? I don't think so: the 
two callers are prctl and seccomp, and both are passing in the current 
task pointer.

If so then it would be nice to rename all these task variable names from 
'task' to 'curr' or such, to make this property apparent.

Then we can also remove the condition and the comment, and update 
unconditionally, and maybe add:

WARN_ON_ONCE(curr != current);

... or so?

Thanks,

Ingo


Re: [patch 21/24] x86/speculation: Prepare arch_smt_update() for PRCTL mode

2018-11-21 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> The upcoming fine grained per task STIBP control needs to be updated on CPU
> hotplug as well.
> 
> Split out the code which controls the strict mode so the prctl control code
> can be added later.
> 
> Signed-off-by: Thomas Gleixner 
> ---
>  arch/x86/kernel/cpu/bugs.c |   46 
> -
>  1 file changed, 25 insertions(+), 21 deletions(-)
> 
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -531,40 +531,44 @@ static void __init spectre_v2_select_mit
>   arch_smt_update();
>  }
>  
> -static bool stibp_needed(void)
> +static void update_stibp_msr(void *info)
>  {
> - /* Enhanced IBRS makes using STIBP unnecessary. */
> - if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
> - return false;
> -
> - /* Check for strict app2app mitigation mode */
> - return spectre_v2_app2app == SPECTRE_V2_APP2APP_STRICT;
> + wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);
>  }


Does Sparse or other tooling warn about unused function parameters? If 
yes then it might make sense to mark it __used?

>  
> -static void update_stibp_msr(void *info)
> +/* Update x86_spec_ctrl_base in case SMT state changed. */
> +static void update_stibp_strict(void)
>  {
> - wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);
> + u64 mask = x86_spec_ctrl_base & ~SPEC_CTRL_STIBP;
> +
> + if (sched_smt_active())
> + mask |= SPEC_CTRL_STIBP;
> +
> + if (mask == x86_spec_ctrl_base)
> + return;
> +
> + pr_info("Spectre v2 cross-process SMT mitigation: %s STIBP\n",
> + mask & SPEC_CTRL_STIBP ? "Enabling" : "Disabling");
> + x86_spec_ctrl_base = mask;
> + on_each_cpu(update_stibp_msr, NULL, 1);
>  }
>  
>  void arch_smt_update(void)
>  {
> - u64 mask;
> -
> - if (!stibp_needed())
> + /* Enhanced IBRS makes using STIBP unnecessary. No update required. */
> + if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
>   return;
>  
>   mutex_lock(_ctrl_mutex);
>  
> - mask = x86_spec_ctrl_base & ~SPEC_CTRL_STIBP;
> - if (sched_smt_active())
> - mask |= SPEC_CTRL_STIBP;
> -
> - if (mask != x86_spec_ctrl_base) {
> - pr_info("Spectre v2 cross-process SMT mitigation: %s STIBP\n",
> - mask & SPEC_CTRL_STIBP ? "Enabling" : "Disabling");
> - x86_spec_ctrl_base = mask;
> - on_each_cpu(update_stibp_msr, NULL, 1);
> + switch (spectre_v2_app2app) {
> + case SPECTRE_V2_APP2APP_NONE:
> + break;
> + case SPECTRE_V2_APP2APP_STRICT:
> + update_stibp_strict();
> + break;
>   }

So I'm wondering, shouldn't firmware_restrict_branch_speculation_start()/_end()
also enable/disable STIBP? It already enabled/disables IBRS.

Thanks,

Ingo


Re: [patch 24/24] x86/speculation: Add seccomp Spectre v2 app to app protection mode

2018-11-21 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> From: Jiri Kosina 
> 
> If 'prctl' mode of app2app protection from spectre v2 is selected on the
> kernel command-line, STIBP and IBPB are applied on tasks which restrict
> their indirect branch speculation via prctl.
> 
> SECCOMP enables the SSBD mitigation for sandboxed tasks already, so it
> makes sense to prevent spectre v2 application to application attacks as
> well.
> 
> The mitigation guide documents how STIPB works:
> 
>Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor
>prevents the predicted targets of indirect branches on any logical
>processor of that core from being controlled by software that executes
>(or executed previously) on another logical processor of the same core.
> 
> Ergo setting STIBP protects the task itself from being attacked from a task
> running on a different hyper-thread and protects the tasks running on
> different hyper-threads from being attacked.
> 
> IBPB is issued when the task switches out, so malicious sandbox code cannot
> mistrain the branch predictor for the next user space task on the same
> logical processor.
> 
> Signed-off-by: Jiri Kosina 
> Signed-off-by: Thomas Gleixner 
> 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |7 +-
>  arch/x86/include/asm/nospec-branch.h|1 
>  arch/x86/kernel/cpu/bugs.c  |   27 
> +++-
>  3 files changed, 29 insertions(+), 6 deletions(-)
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4228,10 +4228,15 @@
> by spectre_v2=off
>   auto- Kernel selects the mitigation depending on
> the available CPU features and vulnerability.
> -   Default is prctl.
>   prctl   - Indirect branch speculation is enabled, but
> mitigation can be enabled via prctl per 
> thread.
> The mitigation control state is inherited on 
> fork.
> + seccomp - Same as "prctl" above, but all seccomp threads
> +   will enable the mitigation unless they 
> explicitly
> +   opt out.
> +
> + Default mitigation:
> + If CONFIG_SECCOMP=y "seccomp", otherwise "prctl"
>  
>   Not specifying this option is equivalent to
>   spectre_v2_app2app=auto.
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -233,6 +233,7 @@ enum spectre_v2_app2app_mitigation {
>   SPECTRE_V2_APP2APP_NONE,
>   SPECTRE_V2_APP2APP_STRICT,
>   SPECTRE_V2_APP2APP_PRCTL,
> + SPECTRE_V2_APP2APP_SECCOMP,
>  };
>  
>  /* The Speculative Store Bypass disable variants */
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -256,12 +256,14 @@ enum spectre_v2_app2app_cmd {
>   SPECTRE_V2_APP2APP_CMD_AUTO,
>   SPECTRE_V2_APP2APP_CMD_FORCE,
>   SPECTRE_V2_APP2APP_CMD_PRCTL,
> + SPECTRE_V2_APP2APP_CMD_SECCOMP,
>  };
>  
>  static const char *spectre_v2_app2app_strings[] = {
>   [SPECTRE_V2_APP2APP_NONE]   = "App-App Vulnerable",
> - [SPECTRE_V2_APP2APP_STRICT] = "App-App Mitigation: STIBP 
> protection",
> - [SPECTRE_V2_APP2APP_PRCTL]  = "App-App Mitigation: STIBP via prctl",
> + [SPECTRE_V2_APP2APP_STRICT] = "App-App Mitigation: forced 
> protection",
> + [SPECTRE_V2_APP2APP_PRCTL]  = "App-App Mitigation: prctl opt-in",
> + [SPECTRE_V2_APP2APP_SECCOMP]= "App-App Mitigation: seccomp and 
> prctl opt-in",

This description is not accurate: it's not a 'seccomp and prctl opt-in', 
the seccomp functionality is opt-out, the prctl is opt-in.

So something like:

> + [SPECTRE_V2_APP2APP_SECCOMP]= "App-App Mitigation: seccomp by 
> default and prctl opt-in",

or so?

>  void arch_seccomp_spec_mitigate(struct task_struct *task)
>  {
>   if (ssb_mode == SPEC_STORE_BYPASS_SECCOMP)
>   ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE);
> + if (spectre_v2_app2app == SPECTRE_V2_APP2APP_SECCOMP)
> + indir_branch_prctl_set(task, PR_SPEC_FORCE_DISABLE);
>  }
>  #endif

Hm, so isn't arch_seccomp_spec_mitigate() called right before untrusted 
seccomp code is executed? So why are we disabling the mitigation here?

> + case SPECTRE_V2_APP2APP_SECCOMP:
> + return ", STIBP: seccomp and prctl opt-in";
> + case SPECTRE_V2_APP2APP_SECCOMP:
> + return ", IBPB: seccomp and prctl opt-in";

Same feedback wrt. potentially confusing use of 'opt-in' here, while 
seccomp is more like an opt-out mechanism.

Thanks,

Ingo


Re: [patch 23/24] x86/speculation: Enable PRCTL mode for spectre_v2_app2app

2018-11-21 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> Now that all prerequisites are in place:
> 
>  - Add the prctl command line option
> 
>  - Default the 'auto' mode to 'prctl'
> 
>  - When SMT state changes, update the static key which controls the
>conditional STIBP evaluation on context switch.
> 
>  - At init update the static key which controls the conditional IBPB
>evaluation on context switch.
> 
> Signed-off-by: Thomas Gleixner 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |5 ++
>  arch/x86/kernel/cpu/bugs.c  |   46 
> +---
>  2 files changed, 45 insertions(+), 6 deletions(-)
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4246,7 +4246,10 @@
> by spectre_v2=off
>   auto- Kernel selects the mitigation depending on
> the available CPU features and vulnerability.
> -   Default is off.
> +   Default is prctl.
> + prctl   - Indirect branch speculation is enabled, but
> +   mitigation can be enabled via prctl per 
> thread.
> +   The mitigation control state is inherited on 
> fork.

Please change the order of the last two entries, i.e. make 'auto' the 
last one. This is the pattern we use in other places plus we refer to 
'prctl' before we document it.

>  static const struct {
> @@ -270,6 +272,7 @@ static const struct {
>   { "auto",   SPECTRE_V2_APP2APP_CMD_AUTO,false },
>   { "off",SPECTRE_V2_APP2APP_CMD_NONE,false },
>   { "on", SPECTRE_V2_APP2APP_CMD_FORCE,   true  },
> + { "prctl",  SPECTRE_V2_APP2APP_CMD_PRCTL,   false },

Might make sense to order them in a consistent fashion as well.

> + /*
> +  * If STIBP is not available or SMT is not possible clear the STIPB
> +  * mode.
> +  */
> + if (!smt_possible || !boot_cpu_has(X86_FEATURE_STIBP))
> + mode = SPECTRE_V2_APP2APP_NONE;

Another nit: please match order of the comments to how the condition is 
written in the code.

> +/* Update the static key controlling the evaluation of TIF_SPEC_IB */
> +static void update_indir_branch_cond(void)
> +{
> + if (!IS_ENABLED(CONFIG_SMP))
> + return;
> +
> + if (sched_smt_active())
> + static_branch_enable(_to_cond_stibp);
> + else
> + static_branch_disable(_to_cond_stibp);

So in the !SMP case sched_smt_active() is already doing the right thing:

  static inline bool sched_smt_active(void) { return false; }

I.e. couldn't we just remove the extra CONFIG_SMP condition?
This would simplify the code with some very minor expense on !SMP.

Thanks,

Ingo


Re: [patch 22/24] x86/speculation: Create PRCTL interface to restrict indirect branch speculation

2018-11-21 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> From: Tim Chen 
> 
> Add the PR_SPEC_INDIR_BRANCH option for the PR_GET_SPECULATION_CTRL and
> PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
> indirect branch speculation via STIBP.
> 
> Invocations:
>  Check indirect branch speculation status with
>  - prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIR_BRANCH, 0, 0, 0);
> 
>  Enable indirect branch speculation with
>  - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIR_BRANCH, PR_SPEC_ENABLE, 0, 0);
> 
>  Disable indirect branch speculation with
>  - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIR_BRANCH, PR_SPEC_DISABLE, 0, 
> 0);
> 
>  Force disable indirect branch speculation with
>  - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIR_BRANCH, 
> PR_SPEC_FORCE_DISABLE, 0, 0);
> 
> See Documentation/userspace-api/spec_ctrl.rst.
> 
> Signed-off-by: Tim Chen 
> Signed-off-by: Thomas Gleixner 

Please either de-capitalize the title to 'prctl()', or use PR_SPEC_PRCTL 
or PR_SET/GET_SPECULATION_CTRL - there's no such thing as 'PRCTL' 
interface - the interface is called prctl() and the speculation control 
ABIs have their own names.

This applies to the next patch as well.

> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -212,6 +212,7 @@ struct prctl_mm_map {
>  #define PR_SET_SPECULATION_CTRL  53
>  /* Speculation control variants */
>  # define PR_SPEC_STORE_BYPASS0
> +# define PR_SPEC_INDIR_BRANCH1
>  /* Return and control values for PR_SET/GET_SPECULATION_CTRL */
>  # define PR_SPEC_NOT_AFFECTED0
>  # define PR_SPEC_PRCTL   (1UL << 0)
> --- a/tools/include/uapi/linux/prctl.h
> +++ b/tools/include/uapi/linux/prctl.h
> @@ -212,6 +212,7 @@ struct prctl_mm_map {
>  #define PR_SET_SPECULATION_CTRL  53
>  /* Speculation control variants */
>  # define PR_SPEC_STORE_BYPASS0
> +# define PR_SPEC_INDIR_BRANCH1
>  /* Return and control values for PR_SET/GET_SPECULATION_CTRL */
>  # define PR_SPEC_NOT_AFFECTED0
>  # define PR_SPEC_PRCTL   (1UL << 0)

Please de-abbreviate the new ABI: PR_SPEC_INDIRECT_BRANCH?

Thanks,

Ingo


Re: [GIT PULL 00/28] perf/core improvements and fixes

2018-11-21 Thread Ingo Molnar


* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, some from before the trip to Vancouver,
> some that were more easy to process before I continue with the backlog.
> Took a bit more time than I antecipated due to fixing build breakage in
> various places due to multiple patches.  This has tip/perf/urgent
> merged.
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit b1a9d7b0190119dad5b9b7841751b5a7586bbc8b:
> 
>   Merge tag 'perf-urgent-for-mingo-4.20-20181121' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent 
> (2018-11-21 15:57:21 +0100)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.21-20181122
> 
> for you to fetch changes up to f4a0742b3cc1d03b2ff448017b8c714a77e5a261:
> 
>   perf pmu: Move *_cpuid_str() weak functions to header.c (2018-11-21 
> 22:39:59 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> - Start using BPF maps in 'perf trace' for filters in the augmented syscalls
>   code, keeping the existing code for tracepoint filters so that we can switch
>   back and forth while getting everything BPFied (Arnaldo Carvalho de Melo)
> 
> - Suppress potential format-truncation warning in the PMU code (Ben Hutchings)
> 
> - Introduce 'perf bench epoll', with "wait" and "ctl" benchmarks (Davidlohr 
> Bueso)
> 
> - Fix slowness due to -ffunction-section, do it by sorting the maps by name, 
> so
>   avoiding the using rb_first/next to traverse all entries looking for a map 
> name,
>   that with --ffunction-section gets to thousands of maps (Eric Saint-Etienne)
> 
> - Separate jvmti cmlr check (Jiri Olsa)
> 
> - Allow using the stepping when figuring out which JSON files to use for a x86
>   processor, so that Cascadelake server can be support, which has the same
>   cpuid as some other processor, being different only in the stepping (Kan 
> Liang)
> 
> - Share code and output format for uregs and iregs 'perf script' output 
> (Milian Wolff)
> 
> - Use perf_evsel__is_clocki() for clock events in 'perf stat' (Ravi Bangoria)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (15):
>   perf bpf: Add unistd.h to the headers accessible to bpf proggies
>   perf augmented_syscalls: Filter on a hard coded pid
>   perf augmented_syscalls: Remove needless linux/socket.h include
>   perf bpf: Add defines for map insertion/lookup
>   perf bpf: Add simple pid_filter class accessible to BPF proggies
>   perf augmented_syscalls: Drop 'write', 'poll' for testing without self 
> pid filter
>   perf augmented_syscalls: Use pid_filter
>   perf evlist: Rename perf_evlist__set_filter* to 
> perf_evlist__set_tp_filter*
>   perf trace: Add "_from_option" suffix to trace__set_filter()
>   perf trace: See if there is a map named "filtered_pids"
>   perf trace: Fill in BPF "filtered_pids" map when present
>   perf augmented_syscalls: Remove example hardcoded set of filtered pids
>   Revert "perf augmented_syscalls: Drop 'write', 'poll' for testing 
> without self pid filter"
>   perf bpf: Reduce the hardcoded .max_entries for pid_maps
>   tools build feature: Check if eventfd() is available
> 
> Ben Hutchings (1):
>   perf pmu: Suppress potential format-truncation warning
> 
> Davidlohr Bueso (3):
>   perf bench: Move HAVE_PTHREAD_ATTR_SETAFFINITY_NP into bench.h
>   perf bench: Add epoll parallel epoll_wait benchmark
>   perf bench: Add epoll_ctl(2) benchmark
> 
> Eric Saint-Etienne (1):
>   perf symbols: Fix slowness due to -ffunction-section
> 
> Jiri Olsa (1):
>   perf jvmti: Separate jvmti cmlr check
> 
> Kan Liang (3):
>   perf vendor events: Add stepping in CPUID string for x86
>   perf vendor events: Add JSON metrics for Cascadelake server
>   perf pmu: Move *_cpuid_str() weak functions to header.c
> 
> Milian Wolff (2):
>   perf script: Add newline after uregs output
>   perf script: Share code and output format for uregs and iregs output
> 
> Pu Wen (1):
>   perf tools: Add Hygon Dhyana support
> 
> Ravi Bangoria (1):
>   perf stat: Use perf_evsel__is_clocki() for clock events
> 
>  tools/build/Makefile.feature   | 1 +
>  tools/build/feature/Makefile   | 8 +
>  tools/build/feature/test-all.c | 5 +
>  tools/build/feature/test-eventfd.c | 9 +
>  tools/build/feature/test-jvmti-cmlr.c  |11 +
>  tools/build/feature/test-jvmti.c   | 1 -
>  tools/perf/Documentation/perf-bench.txt|10 +
>  tools/perf/Makefile.config |12 +-
>  tools/perf/Makefile.perf   

Re: [GIT PULL 0/7] perf/urgent fixes

2018-11-21 Thread Ingo Molnar


* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, more to come as I process the backlog
> from being away for LPC.
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 4d47d6407ac7b4b442a4e717488a3bb137398b6c:
> 
>   perf/x86/intel/uncore: Support CoffeeLake 8th CBOX (2018-11-12 05:03:26 
> +0100)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-urgent-for-mingo-4.20-20181121
> 
> for you to fetch changes up to a4243e1494532ab8fa679a4134153149a71fa331:
> 
>   perf tools beauty ioctl: Support new ISO7816 commands (2018-11-19 12:38:50 
> -0800)
> 
> 
> perf/urgent fixes:
> 
> - Update kernel ABI headers, one of them lead to a small change in
>   the ioctl 'cmd' beautifier in 'perf trace' to support the new ISO7816
>   commands. (Arnaldo Carvalho de Melo)
> 
> - Restore proper cwd on return from mnt namespace (Jiri Olsa)
> 
> - Add feature check for the get_current_dir_name() function used in the
>   namespace fix from Jiri, that is not available in systems such as
>   Alpine Linux, which uses the  musl libc (Arnaldo Carvalho de Melo)
> 
> - Fix crash in 'perf record' when synthesizing the unit for events such
>   as 'cpu-clock' (Jiri Olsa)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (5):
>   tools build feature: Check if get_current_dir_name() is available
>   tools headers uapi: Synchronize i915_drm.h
>   tools arch x86: Update tools's copy of cpufeatures.h
>   tools uapi asm-generic: Synchronize ioctls.h
>   perf tools beauty ioctl: Support new ISO7816 commands
> 
> Jiri Olsa (2):
>   perf tools: Fix crash on synthesizing the unit
>   perf tools: Restore proper cwd on return from mnt namespace
> 
>  tools/arch/x86/include/asm/cpufeatures.h|  2 ++
>  tools/build/Makefile.feature|  1 +
>  tools/build/feature/Makefile|  4 
>  tools/build/feature/test-all.c  |  5 +
>  tools/build/feature/test-get_current_dir_name.c | 10 ++
>  tools/include/uapi/asm-generic/ioctls.h |  2 ++
>  tools/include/uapi/drm/i915_drm.h   | 22 ++
>  tools/perf/Makefile.config  |  5 +
>  tools/perf/tests/attr/base-record   |  2 +-
>  tools/perf/trace/beauty/ioctl.c |  1 +
>  tools/perf/util/Build   |  1 +
>  tools/perf/util/evsel.c |  2 +-
>  tools/perf/util/get_current_dir_name.c  | 18 ++
>  tools/perf/util/namespaces.c| 17 +++--
>  tools/perf/util/namespaces.h|  1 +
>  tools/perf/util/util.h  |  4 
>  16 files changed, 93 insertions(+), 4 deletions(-)
>  create mode 100644 tools/build/feature/test-get_current_dir_name.c
>  create mode 100644 tools/perf/util/get_current_dir_name.c

Pulled, thanks a lot Arnaldo!

Ingo


Re: [tip:locking/core] locking/atomics: Check generated headers are up-to-date

2018-11-21 Thread Ingo Molnar


* tip-bot for Mark Rutland  wrote:

> Commit-ID:  8d32588077bdc390420cfa6946f407033a20d7a8
> Gitweb: 
> https://git.kernel.org/tip/8d32588077bdc390420cfa6946f407033a20d7a8
> Author: Mark Rutland 
> AuthorDate: Tue, 4 Sep 2018 11:48:29 +0100
> Committer:  Ingo Molnar 
> CommitDate: Thu, 1 Nov 2018 11:01:10 +0100
> 
> locking/atomics: Check generated headers are up-to-date
> 
> Now that all the generated atomic headers are in place, it would be good
> to ensure that:
> 
> a) the headers are up-to-date when scripting changes.
> 
> b) developers don't directly modify the generated headers.
> 
> To ensure both of these properties, let's add a Kbuild step to check
> that the generated headers are up-to-date.
> 
> Signed-off-by: Mark Rutland 
> Signed-off-by: Peter Zijlstra (Intel) 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: catalin.mari...@arm.com
> Cc: Will Deacon 
> Cc: linuxdriv...@attotech.com
> Cc: dvyu...@google.com
> Cc: Boqun Feng 
> Cc: a...@arndb.de
> Cc: aryabi...@virtuozzo.com
> Cc: gli...@google.com
> Link: http://lkml.kernel.org/r/20180904104830.2975-6-mark.rutl...@arm.com
> Signed-off-by: Ingo Molnar 
> ---
>  Kbuild  | 18 --
>  scripts/atomic/check-atomics.sh | 19 +++
>  2 files changed, 35 insertions(+), 2 deletions(-)

These scripts are *awfully* slow to be run at every kernel build - even a 
reasonably fast machine:

  model name: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

... they are adding 3-4 seconds to the build time:

[Before]:

  galatea:~/linux/linux> perf stat --null --repeat 3 make kernel/sched/core.o
  CALLscripts/checksyscalls.sh
  DESCEND  objtool
  CALLscripts/checksyscalls.sh
  DESCEND  objtool
  CALLscripts/checksyscalls.sh
  DESCEND  objtool

 Performance counter stats for 'make kernel/sched/core.o' (3 runs):

  1.201874 +- 0.000371 seconds time elapsed  ( +-  0.03% )


[After]:

  galatea:~/linux/linux> perf stat --null --repeat 3 make kernel/sched/core.o

  CALLscripts/checksyscalls.sh
  CALLscripts/atomic/check-atomics.sh
  DESCEND  objtool
  CALLscripts/checksyscalls.sh
  CALLscripts/atomic/check-atomics.sh
  DESCEND  objtool
  CALLscripts/checksyscalls.sh
  CALLscripts/atomic/check-atomics.sh
  DESCEND  objtool

 Performance counter stats for 'make kernel/sched/core.o' (3 runs):

4.5987 +- 0.0109 seconds time elapsed  ( +-  0.24% )

Could we please get this fixed so that proper dependencies are checked 
and it's only regenerated when needed? This slowdown makes additive-build 
kernel development quite painful, as ~5 seconds is in the 'too long' 
category already, while 1.2 seconds is basically instantaneous.

I cannot even imagine the slowdown on a truly slow box where kernel 
development *has* to be additive.

Thanks,

Ingo


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-20 Thread Ingo Molnar


[ Cc:-ed a few other gents and lkml. ]

* Jens Axboe  wrote:

> Hi,
> 
> So this is a fun one... While I was doing the aio polled work, I noticed
> that the submitting process spent a substantial amount of time copying
> data to/from userspace. For aio, that's iocb and io_event, which are 64
> and 32 bytes respectively. Looking closer at this, and it seems that
> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
> cost.
> 
> I came up with this hack to test it out, and low and behold, we now cut
> the time spent in copying in half. 50% less.
> 
> Since these kinds of patches tend to lend themselves to bike shedding, I
> also ran a string of kernel compilations out of RAM. Results are as
> follows:
> 
> Patched   : 62.86s avg, stddev 0.65s
> Stock : 63.73s avg, stddev 0.67s
> 
> which would also seem to indicate that we're faster punting smaller
> (< 128 byte) copies.
> 
> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
> 
> Interestingly, text size is smaller with the patch as well?!
> 
> I'm sure there are smarter ways to do this, but results look fairly
> conclusive. FWIW, the behaviorial change was introduced by:
> 
> commit 954e482bde20b0e208fd4d34ef26e10afd194600
> Author: Fenghua Yu 
> Date:   Thu May 24 18:19:45 2012 -0700
> 
> x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
> 
> which contains nothing in terms of benchmarking or results, just claims
> that the new hotness is better.
> 
> Signed-off-by: Jens Axboe 
> ---
> 
> diff --git a/arch/x86/include/asm/uaccess_64.h 
> b/arch/x86/include/asm/uaccess_64.h
> index a9d637bc301d..7dbb78827e64 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned 
> len)
>  {
>   unsigned ret;
>  
> + /*
> +  * For smaller copies, don't use ERMS as it's slower.
> +  */
> + if (len < 128) {
> + alternative_call(copy_user_generic_unrolled,
> +  copy_user_generic_string, X86_FEATURE_REP_GOOD,
> +  ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
> +  "=d" (len)),
> +  "1" (to), "2" (from), "3" (len)
> +  : "memory", "rcx", "r8", "r9", "r10", "r11");
> + return ret;
> + }
> +
>   /*
>* If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>* Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
>* Otherwise, use copy_user_generic_unrolled.
>*/
>   alternative_call_2(copy_user_generic_unrolled,
> -  copy_user_generic_string,
> -  X86_FEATURE_REP_GOOD,
> -  copy_user_enhanced_fast_string,
> -  X86_FEATURE_ERMS,
> +  copy_user_generic_string, X86_FEATURE_REP_GOOD,
> +  copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
>ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>"=d" (len)),
>"1" (to), "2" (from), "3" (len)

So I'm inclined to do something like yours, because clearly the changelog 
of 954e482bde20 was at least partly false: Intel can say whatever they 
want, it's a fact that ERMS has high setup costs for low buffer sizes - 
ERMS is optimized for large size, cache-aligned copies mainly.

But the result is counter-intuitive in terms of kernel text footprint, 
plus the '128' is pretty arbitrary - we should at least try to come up 
with a break-even point where manual copy is about as fast as ERMS - on 
at least a single CPU ...

Thanks,

Ingo


Re: [PATCH 02/13] x86/fault: Check user_mode(regs) when validating a stack extension

2018-11-20 Thread Ingo Molnar


* Ingo Molnar  wrote:

> 
> * Andy Lutomirski  wrote:
> 
> > The fault handling code tries to validate that a page fault from
> > user mode that would extend the stack is within a certain range of
> > the user SP.  regs->sp is only equal to the user SP if
> > user_mode(regs).  In the extremely unlikely event that that
> > sw_error_code had the USER bit set but the faulting instruction was
> > in the kernel (i.e. the faulting instruction was WRUSS), then the
> > *kernel* stack pointer would have been checked, which would be an
> > info leak.
> > 
> > Note to -stable maintainers: don't backport this unless you backport
> > CET.  The bug it fixes is unobservable in current kernels.
> > 
> > Signed-off-by: Andy Lutomirski 
> > ---
> >  arch/x86/mm/fault.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 91d4d2722f2e..eae7ee3ce89b 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1377,7 +1377,7 @@ void do_user_addr_fault(struct pt_regs *regs,
> > bad_area(regs, sw_error_code, address);
> > return;
> > }
> > -   if (sw_error_code & X86_PF_USER) {
> > +   if (user_mode(regs)) {
> > /*
> >  * Accessing the stack below %sp is always a bug.
> >  * The large cushion allows instructions like enter
> 
> Note that this check is gone now due to:
> 
>   1d8ca3be86eb: x86/mm/fault: Allow stack access below %rsp
> 
> Thanks,

Ok, I like your series - I have applied the first ~7 patches of it to 
tip:x86/mm, the rest is interacting with 1d8ca3be86eb - will apply the 
rest as well once you send a v2.

Thanks,

Ingo


Re: [PATCH 02/13] x86/fault: Check user_mode(regs) when validating a stack extension

2018-11-19 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> The fault handling code tries to validate that a page fault from
> user mode that would extend the stack is within a certain range of
> the user SP.  regs->sp is only equal to the user SP if
> user_mode(regs).  In the extremely unlikely event that that
> sw_error_code had the USER bit set but the faulting instruction was
> in the kernel (i.e. the faulting instruction was WRUSS), then the
> *kernel* stack pointer would have been checked, which would be an
> info leak.
> 
> Note to -stable maintainers: don't backport this unless you backport
> CET.  The bug it fixes is unobservable in current kernels.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/mm/fault.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 91d4d2722f2e..eae7ee3ce89b 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1377,7 +1377,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>   bad_area(regs, sw_error_code, address);
>   return;
>   }
> - if (sw_error_code & X86_PF_USER) {
> + if (user_mode(regs)) {
>   /*
>* Accessing the stack below %sp is always a bug.
>* The large cushion allows instructions like enter

Note that this check is gone now due to:

  1d8ca3be86eb: x86/mm/fault: Allow stack access below %rsp

Thanks,

Ingo


Re: Cleaning up numbering for new x86 syscalls?

2018-11-19 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> Hi all-
> 
> We currently have some giant turds in the way that syscalls are
> numbered.  We have the x86_32 table, which is totally sane other than
> some legacy multiplexers.  Then we have the x86_64 table, which is,
> um, demented:
> 
>  - The numbers don't match x86_32.  I have no idea why.
> 
>  - We use bit 30, which triggers in_x32_syscall().  It should have
> been bit 31, bit I digress.
> 
>  - We have this weird set of extra x32 syscalls that start at 512.
> Who wants to bet whether we have no bugs if someone does syscall with,
> say, nr == 512 (i.e. not 512 | BIT(30)) or nr == (16 | BIT(30))?  The
> latter would be non-compat ioctl with in_x32_syscall() set and hence
> in_compat_syscall() set.
> 
>  - Bloody restart_syscall() has a different number on x86_64 and
> x64_32, which is a big mess.
> 
> I propose we consider some subset of the following:
> 
> 1. Introduce restart_syscall_2().  Make its number be 1024.  Maybe
> someday we could start using it instead of restart_syscall().  The
> only issue I can see is programs that allow restart_syscall() using
> seccomp but don't allow the new variant.
> 
> 2. Introduce an outright ban on new syscalls with nr < 1024.

Also let's make sure it results in a build error or boot panic if someone 
tries.

> 3. Introduce an outright ban on the addition of new __x32_compat
> syscalls.  If new compat hacks are needed, they can use
> in_compat_syscall(), thank you very much.

Here too build-time and runtime enforcement would be nice.

> 4. Modify the wrappers of the __x32_compat entries so that they will
> return -ENOSYS if in_x32_syscall() returns false.
> 
> 5. Adjust the scripts so that we only have to wire up new syscalls
> once.  They'll have a nr above 1024, and they'll have the same nr on
> all x86 variants.
> 
> Thoughts?

Fully agreed:

6. Is x32 even used in practice? I still think it was a mistake to add it 
   and some significant distributions like Fedora are not enabling it.

Barring any sane way to phase out x32 support I'd suggest we implement 
all your suggestions.

Thanks,

Ingo


Re: STIBP by default.. Revert?

2018-11-19 Thread Ingo Molnar


* Linus Torvalds  wrote:

> This was marked for stable, and honestly, nowhere in the discussion
> did I see any mention of just *how* bad the performance impact of this
> was.

Yeah. This was an oversight - we'll fix it!

> When performance goes down by 50% on some loads, people need to start
> asking themselves whether it was worth it. It's apparently better to
> just disable SMT entirely, which is what security-conscious people do
> anyway.
> 
> So why do that STIBP slow-down by default when the people who *really*
> care already disabled SMT?
> 
> I think we should use the same logic as for L1TF: we default to
> something that doesn't kill performance. Warn once about it, and let
> the  crazy people say "I'd rather take a 50% performance hit than
> worry about a theoretical issue".

Yeah, absolutely.

We'll also require performance measurements in changelogs enabling any 
sort of mitigation feature from now on - this requirement was implicit 
but 53c613fe6349 flew in under the radar, so it's going to be explicit an 
explicit requirement.

Thanks,

Ingo


[GIT PULL] scheduler fix

2018-11-17 Thread Ingo Molnar
Linus,

Please pull the latest sched-urgent-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
sched-urgent-for-linus

   # HEAD: c469933e772132aad040bd6a2adc8edf9ad6f825 sched/fair: Fix 
cpu_util_wake() for 'execl' type workloads

Fix an exec() related scalability/performance regression, which was 
caused by incorrectly calculating load and migrating tasks on exec() when 
they shouldn't be.

 Thanks,

Ingo

-->
Patrick Bellasi (1):
  sched/fair: Fix cpu_util_wake() for 'execl' type workloads


 kernel/sched/fair.c | 62 +
 1 file changed, 48 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3648d0300fdf..ac855b2f4774 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5674,11 +5674,11 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p,
return target;
 }
 
-static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
+static unsigned long cpu_util_without(int cpu, struct task_struct *p);
 
-static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+static unsigned long capacity_spare_without(int cpu, struct task_struct *p)
 {
-   return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);
+   return max_t(long, capacity_of(cpu) - cpu_util_without(cpu, p), 0);
 }
 
 /*
@@ -5738,7 +5738,7 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
 
avg_load += cfs_rq_load_avg(_rq(i)->cfs);
 
-   spare_cap = capacity_spare_wake(i, p);
+   spare_cap = capacity_spare_without(i, p);
 
if (spare_cap > max_spare_cap)
max_spare_cap = spare_cap;
@@ -5889,8 +5889,8 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return prev_cpu;
 
/*
-* We need task's util for capacity_spare_wake, sync it up to prev_cpu's
-* last_update_time.
+* We need task's util for capacity_spare_without, sync it up to
+* prev_cpu's last_update_time.
 */
if (!(sd_flag & SD_BALANCE_FORK))
sync_entity_load_avg(>se);
@@ -6216,10 +6216,19 @@ static inline unsigned long cpu_util(int cpu)
 }
 
 /*
- * cpu_util_wake: Compute CPU utilization with any contributions from
- * the waking task p removed.
+ * cpu_util_without: compute cpu utilization without any contributions from *p
+ * @cpu: the CPU which utilization is requested
+ * @p: the task which utilization should be discounted
+ *
+ * The utilization of a CPU is defined by the utilization of tasks currently
+ * enqueued on that CPU as well as tasks which are currently sleeping after an
+ * execution on that CPU.
+ *
+ * This method returns the utilization of the specified CPU by discounting the
+ * utilization of the specified task, whenever the task is currently
+ * contributing to the CPU utilization.
  */
-static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 {
struct cfs_rq *cfs_rq;
unsigned int util;
@@ -6231,7 +6240,7 @@ static unsigned long cpu_util_wake(int cpu, struct 
task_struct *p)
cfs_rq = _rq(cpu)->cfs;
util = READ_ONCE(cfs_rq->avg.util_avg);
 
-   /* Discount task's blocked util from CPU's util */
+   /* Discount task's util from CPU's util */
util -= min_t(unsigned int, util, task_util(p));
 
/*
@@ -6240,14 +6249,14 @@ static unsigned long cpu_util_wake(int cpu, struct 
task_struct *p)
 * a) if *p is the only task sleeping on this CPU, then:
 *  cpu_util (== task_util) > util_est (== 0)
 *and thus we return:
-*  cpu_util_wake = (cpu_util - task_util) = 0
+*  cpu_util_without = (cpu_util - task_util) = 0
 *
 * b) if other tasks are SLEEPING on this CPU, which is now exiting
 *IDLE, then:
 *  cpu_util >= task_util
 *  cpu_util > util_est (== 0)
 *and thus we discount *p's blocked utilization to return:
-*  cpu_util_wake = (cpu_util - task_util) >= 0
+*  cpu_util_without = (cpu_util - task_util) >= 0
 *
 * c) if other tasks are RUNNABLE on that CPU and
 *  util_est > cpu_util
@@ -6260,8 +6269,33 @@ static unsigned long cpu_util_wake(int cpu, struct 
task_struct *p)
 * covered by the following code when estimated utilization is
 * enabled.
 */
-   if (sched_feat(UTIL_EST))
-   util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
+   if (sched_feat(UTIL_EST)) {
+   unsigned int estimated =
+   READ_ONCE(cfs_rq->avg.util_est.enqueued);
+
+   /*
+* Despite the following checks we still have a 

[GIT PULL] perf fixes

2018-11-17 Thread Ingo Molnar
Linus,

Please pull the latest perf-urgent-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
perf-urgent-for-linus

   # HEAD: 4d47d6407ac7b4b442a4e717488a3bb137398b6c perf/x86/intel/uncore: 
Support CoffeeLake 8th CBOX

Fix uncore PMU enumeration for CofeeLake CPUs.

 Thanks,

Ingo

-->
Kan Liang (2):
  perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake 
CPUs
  perf/x86/intel/uncore: Support CoffeeLake 8th CBOX


 arch/x86/events/intel/uncore.h |  33 +++---
 arch/x86/events/intel/uncore_snb.c | 121 -
 2 files changed, 144 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
index e17ab885b1e9..cb46d602a6b8 100644
--- a/arch/x86/events/intel/uncore.h
+++ b/arch/x86/events/intel/uncore.h
@@ -129,8 +129,15 @@ struct intel_uncore_box {
struct intel_uncore_extra_reg shared_regs[0];
 };
 
-#define UNCORE_BOX_FLAG_INITIATED  0
-#define UNCORE_BOX_FLAG_CTL_OFFS8  1 /* event config registers are 8-byte 
apart */
+/* CFL uncore 8th cbox MSRs */
+#define CFL_UNC_CBO_7_PERFEVTSEL0  0xf70
+#define CFL_UNC_CBO_7_PER_CTR0 0xf76
+
+#define UNCORE_BOX_FLAG_INITIATED  0
+/* event config registers are 8-byte apart */
+#define UNCORE_BOX_FLAG_CTL_OFFS8  1
+/* CFL 8th CBOX has different MSR space */
+#define UNCORE_BOX_FLAG_CFL8_CBOX_MSR_OFFS 2
 
 struct uncore_event_desc {
struct kobj_attribute attr;
@@ -297,17 +304,27 @@ unsigned int uncore_freerunning_counter(struct 
intel_uncore_box *box,
 static inline
 unsigned uncore_msr_event_ctl(struct intel_uncore_box *box, int idx)
 {
-   return box->pmu->type->event_ctl +
-   (box->pmu->type->pair_ctr_ctl ? 2 * idx : idx) +
-   uncore_msr_box_offset(box);
+   if (test_bit(UNCORE_BOX_FLAG_CFL8_CBOX_MSR_OFFS, >flags)) {
+   return CFL_UNC_CBO_7_PERFEVTSEL0 +
+  (box->pmu->type->pair_ctr_ctl ? 2 * idx : idx);
+   } else {
+   return box->pmu->type->event_ctl +
+  (box->pmu->type->pair_ctr_ctl ? 2 * idx : idx) +
+  uncore_msr_box_offset(box);
+   }
 }
 
 static inline
 unsigned uncore_msr_perf_ctr(struct intel_uncore_box *box, int idx)
 {
-   return box->pmu->type->perf_ctr +
-   (box->pmu->type->pair_ctr_ctl ? 2 * idx : idx) +
-   uncore_msr_box_offset(box);
+   if (test_bit(UNCORE_BOX_FLAG_CFL8_CBOX_MSR_OFFS, >flags)) {
+   return CFL_UNC_CBO_7_PER_CTR0 +
+  (box->pmu->type->pair_ctr_ctl ? 2 * idx : idx);
+   } else {
+   return box->pmu->type->perf_ctr +
+  (box->pmu->type->pair_ctr_ctl ? 2 * idx : idx) +
+  uncore_msr_box_offset(box);
+   }
 }
 
 static inline
diff --git a/arch/x86/events/intel/uncore_snb.c 
b/arch/x86/events/intel/uncore_snb.c
index 8527c3e1038b..2593b0d7aeee 100644
--- a/arch/x86/events/intel/uncore_snb.c
+++ b/arch/x86/events/intel/uncore_snb.c
@@ -15,6 +15,25 @@
 #define PCI_DEVICE_ID_INTEL_SKL_HQ_IMC 0x1910
 #define PCI_DEVICE_ID_INTEL_SKL_SD_IMC 0x190f
 #define PCI_DEVICE_ID_INTEL_SKL_SQ_IMC 0x191f
+#define PCI_DEVICE_ID_INTEL_KBL_Y_IMC  0x590c
+#define PCI_DEVICE_ID_INTEL_KBL_U_IMC  0x5904
+#define PCI_DEVICE_ID_INTEL_KBL_UQ_IMC 0x5914
+#define PCI_DEVICE_ID_INTEL_KBL_SD_IMC 0x590f
+#define PCI_DEVICE_ID_INTEL_KBL_SQ_IMC 0x591f
+#define PCI_DEVICE_ID_INTEL_CFL_2U_IMC 0x3ecc
+#define PCI_DEVICE_ID_INTEL_CFL_4U_IMC 0x3ed0
+#define PCI_DEVICE_ID_INTEL_CFL_4H_IMC 0x3e10
+#define PCI_DEVICE_ID_INTEL_CFL_6H_IMC 0x3ec4
+#define PCI_DEVICE_ID_INTEL_CFL_2S_D_IMC   0x3e0f
+#define PCI_DEVICE_ID_INTEL_CFL_4S_D_IMC   0x3e1f
+#define PCI_DEVICE_ID_INTEL_CFL_6S_D_IMC   0x3ec2
+#define PCI_DEVICE_ID_INTEL_CFL_8S_D_IMC   0x3e30
+#define PCI_DEVICE_ID_INTEL_CFL_4S_W_IMC   0x3e18
+#define PCI_DEVICE_ID_INTEL_CFL_6S_W_IMC   0x3ec6
+#define PCI_DEVICE_ID_INTEL_CFL_8S_W_IMC   0x3e31
+#define PCI_DEVICE_ID_INTEL_CFL_4S_S_IMC   0x3e33
+#define PCI_DEVICE_ID_INTEL_CFL_6S_S_IMC   0x3eca
+#define PCI_DEVICE_ID_INTEL_CFL_8S_S_IMC   0x3e32
 
 /* SNB event control */
 #define SNB_UNC_CTL_EV_SEL_MASK0x00ff
@@ -202,6 +221,10 @@ static void skl_uncore_msr_init_box(struct 
intel_uncore_box *box)
wrmsrl(SKL_UNC_PERF_GLOBAL_CTL,
SNB_UNC_GLOBAL_CTL_EN | SKL_UNC_GLOBAL_CTL_CORE_ALL);
}
+
+   /* The 8th CBOX has different MSR space */
+   if (box->pmu->pmu_idx == 7)
+   __set_bit(UNCORE_BOX_FLAG_CFL8_CBOX_MSR_OFFS, >flags);
 }
 
 static void skl_uncore_msr_enable_box(struct intel_uncore_box *box)
@@ -228,7 +251,7 @@ static struct intel_uncore_ops skl_uncore_msr_ops = {
 static struct intel_uncore_type skl_uncore_cbox = {
.name   = 

Re: [GIT PULL] PCI changes for v4.20

2018-11-14 Thread Ingo Molnar


* Bjorn Helgaas  wrote:

> [+cc Martin, Rafael, Len, linux-acpi]
> 
> On Tue, Nov 13, 2018 at 11:20:04AM +0100, Borislav Petkov wrote:
> > On Tue, Nov 13, 2018 at 08:17:12AM +0100, Ingo Molnar wrote:
> > > 
> > > * Bjorn Helgaas  wrote:
> > > 
> > > > PCI changes:
> > > > 
> > > >   - Pay attention to device-specific _PXM node values (Jonathan Cameron)
> > > 
> > > There's a new boot regression, my AMD ThreadRipper system (MSI X399 SLI 
> > > PLUS (MS-7B09)) hangs during early bootup, and I have bisected it down to 
> > > this commit:
> > > 
> > >   bad7dcd94f39: ACPI/PCI: Pay attention to device-specific _PXM node 
> > > values
> > > 
> > > Reverting it solves the hang.
> > > 
> > > Unfortunately there's no console output when it hangs, even with 
> > > earlyprintk. It just hangs after the "loading initrd" line.
> > > 
> > > Config is an Ubuntu-ish config with PROVE_LOCKING=y and a few other debug 
> > > options.
> > > 
> > > All my other testsystems boot fine with similar configs, so it's probably 
> > > something specific to this system.
> 
> Martin reported the same thing [1] (unfortunately the archive didn't
> capture Martin's original emails, I think because they were multi-part
> messages with attachments).
> 
> Looks like Martin might have a similar system:
> 
>   DMI: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.30 
> 08/14/2018
>   smpboot: CPU0: AMD Ryzen Threadripper 2950X 16-Core Processor (family: 
> 0x17, model: 0x8, stepping: 0x2)
> 
> Given how painful this is to debug, I queued up a revert on my
> for-linus branch until we figure out what sanity checks are needed to
> make the original patch safe.

Thanks!

Took me about a day to bisect this, on this hard to bisect machine. :-/

> I would expect proximity information to be basically just a hint for 
> optimization, not a functional requirement, so it would be really 
> interesting to figure out why this causes such a catastrophic failure. 
> Maybe there's a way to improve that path as well so it would be more 
> robust or at least more debuggable.

Yeah.

Thanks,

Ingo


Re: [PATCH 2/2] x86: set a dependency on macros.S

2018-11-13 Thread Ingo Molnar


* Nadav Amit  wrote:

> Changes in macros.S should trigger the recompilation of all C files, as
> the macros might need to affect their compilation.
> 
> Signed-off-by: Nadav Amit 
> ---
>  scripts/Makefile.build | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/scripts/Makefile.build b/scripts/Makefile.build
> index d2213b041408..ffe3e1a01210 100644
> --- a/scripts/Makefile.build
> +++ b/scripts/Makefile.build
> @@ -312,13 +312,13 @@ cmd_undef_syms = echo
>  endif
>  
>  # Built-in and composite module parts
> -$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
> +$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) 
> $(ASM_MACRO_FILE:.s=.S) FORCE
>   $(call cmd,force_checksrc)
>   $(call if_changed_rule,cc_o_c)
>  
>  # Single-part modules are special since we need to mark them in $(MODVERDIR)
>  
> -$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) 
> $(objtool_dep) FORCE
> +$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) 
> $(objtool_dep) $(ASM_MACRO_FILE:.s=.S) FORCE
>   $(call cmd,force_checksrc)
>       $(call if_changed_rule,cc_o_c)
>   @{ echo $(@:.o=.ko); echo $@; \

Acked-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH 1/2] Makefile: Fix distcc compilation with x86 macros

2018-11-13 Thread Ingo Molnar


* Nadav Amit  wrote:

> Introducing the use of asm macros in c-code broke distcc, since it only
> sends the preprocessed source file. The solution is to break the
> compilation into two separate phases of compilation and assembly, and
> between the two concatanate the assembly macros and the compiled (yet

s/concatenate

> not assembled) source file. Since this is less efficient, this
> compilation mode is only used when make is called with the "DISTCC=y"
> parameter.
> 
> Note that the assembly stage should also be distributed, if distcc is
> configured using "CFLAGS=-DENABLE_REMOTE_ASSEMBLE".

It's a bit sad that we regressed distcc performance ...

> +# If distcc is used, then when an assembly macro files is needed, the
> +# compilation stage and the assembly stage need to be separated. Providing
> +# "DISTCC=y" option enables the separate compilation and assembly.

Let's fix the various typos:

  > +# If distcc is used, and when assembly macro files are needed, the
  > +# compilation stage and the assembly stage needs to be separated. 
  > +# Providing the "DISTCC=y" option enables separate compilation and 
  > +# assembly.



> +cmd_cc_o_c_direct = $(CC) $(c_flags) -c -o $(1) $<
> +
> +ifeq ($(DISTCC),y)
> +a_flags_no_debug = $(filter-out $(AFLAGS_DEBUG_INFO), $(a_flags))
> +c_flags_no_macros = $(filter-out $(ASM_MACRO_FLAGS), $(c_flags))
> +
> +cmd_cc_o_c_two_steps =   
> \
> + $(CC) $(c_flags_no_macros) $(DISABLE_LTO) -fverbose-asm -S  \
> + -o $(@D)/.$(@F:.o=.s) $< ;  \
> + cat $(ASM_MACRO_FILE) $(@D)/.$(@F:.o=.s) >  \
> + $(@D)/.tmp_$(@F:.o=.s); \
> + $(CC) $(a_flags_no_debug) -c -o $(1) $(@D)/.tmp_$(@F:.o=.s) ;   \
> + rm -f $(@D)/.$(@F:.o=.s) $(@D)/.tmp_$(@F:.o=.s) \
> +
> +cmd_cc_o_c_helper =  
> \
> + $(if $(findstring $(ASM_MACRO_FLAGS),$(c_flags)),   \
> + $(call cmd_cc_o_c_two_steps, $(1)), \
> + $(call cmd_cc_o_c_direct, $(1)))

There are various stray + whitespace noise errors in the 
chunks above.

Thanks,

Ingo


Re: [GIT PULL] PCI changes for v4.20

2018-11-12 Thread Ingo Molnar


* Bjorn Helgaas  wrote:

> PCI changes:
> 
>   - Pay attention to device-specific _PXM node values (Jonathan Cameron)

There's a new boot regression, my AMD ThreadRipper system (MSI X399 SLI 
PLUS (MS-7B09)) hangs during early bootup, and I have bisected it down to 
this commit:

  bad7dcd94f39: ACPI/PCI: Pay attention to device-specific _PXM node values

Reverting it solves the hang.

Unfortunately there's no console output when it hangs, even with 
earlyprintk. It just hangs after the "loading initrd" line.

Config is an Ubuntu-ish config with PROVE_LOCKING=y and a few other debug 
options.

All my other testsystems boot fine with similar configs, so it's probably 
something specific to this system.

Thanks,

Ingo


Re: [PATCH] sched/rt: Introduce prio_{higher,lower}() helper for comparing RT task prority

2018-11-11 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> I think you only need the less thing, because:
> 
> static inline bool prio_lower(int a, int b)
> {
>   return a > b;
> }

I'd say that should be named rt_prio_lower(), even if it's local to 
sched/rt.c, right?

Thanks,

Ingo


Re: [RFC PATCH 00/12] locking/lockdep: Add a new class of terminal locks

2018-11-11 Thread Ingo Molnar


* Josh Poimboeuf  wrote:

> On Mon, Nov 12, 2018 at 06:10:33AM +0100, Ingo Molnar wrote:
> > 
> > * Waiman Long  wrote:
> > 
> > > On 11/10/2018 09:10 AM, Peter Zijlstra wrote:
> > > > On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote:
> > > >> BTW., if you are interested in more radical approaches to optimize 
> > > >> lockdep, we could also add a static checker via objtool driven call 
> > > >> graph 
> > > >> analysis, and mark those locks terminal that we can prove are terminal.
> > > >>
> > > >> This would require the unified call graph of the kernel image and of 
> > > >> all 
> > > >> modules to be examined in a final pass, but that's within the 
> > > >> principal 
> > > >> scope of objtool. (This 'final pass' could also be done during bootup, 
> > > >> at 
> > > >> least in initial versions.)
> > > >
> > > > Something like this is needed for objtool LTO support as well. I just
> > > > dread the build time 'regressions' this will introduce :/
> > > >
> > > > The final link pass is already by far the most expensive part (as
> > > > measured in wall-time) of building a kernel, adding more work there
> > > > would really suck :/
> > > 
> > > I think the idea is to make objtool have the capability to do that. It
> > > doesn't mean we need to turn it on by default in every build.
> > 
> > Yeah.
> > 
> > Also note that much of the objtool legwork would be on a per file basis 
> > which is reasonably parallelized already. On x86 it's also already done 
> > for every ORC build i.e. every distro build and the incremental overhead 
> > from also extracting locking dependencies should be reasonably small.
> > 
> > The final search of the global graph would be serialized but still 
> > reasonably fast as these are all 'class' level dependencies which are 
> > much less numerous than runtime dependencies.
> > 
> > I.e. I think we are talking about tens of thousands of dependencies, not 
> > tens of millions.
> > 
> > At least in theory. ;-)
> 
> Generating a unified call graph sounds very expensive (and very far
> beyond what objtool can do today).

Well, objtool already goes through the instruction stream and recognizes 
function calls - so it can in effect generate a stream of "function x 
called by function y" data, correct?

>  Also, what about function pointers?

So maybe it's possible to enumerate all potential values for function 
pointers with a reasonably simple compiler plugin and work from there?

One complication would be function pointers encoded as opaque data 
types...

> BTW there's another kernel static analysis tool which attempts to 
> create such a call graph already: smatch.

It's not included in the kernel tree though and I'd expect tight coupling 
(or at least lock-step improvements) between tooling and lockdep here.

Thanks,

Ingo


Re: [patch 0/2] Documentation/process: Add subsystem/tree handbook

2018-11-11 Thread Ingo Molnar


* Jonathan Corbet  wrote:

> > Even after a decade of introducing Git I still see Signed-off-by used as 
> > an Acked-by or Reviewed-by substitutes, so I'd suggest adding this small 
> > explanation as well:
> > 
> >   +   SOB chains should reflect the *real* route a patch took as it was 
> >   +   propagated to us, with the first SOB entry signalling primary
> >   +   authorship of a single author. Acks should be given as Acked-by 
> >   +   lines and review approvals as Reviewed-by lines.
> 
> If SOB means anything like what it's supposed to mean, this *can't* be a
> "local quirk" - we have to agree on it globally.

Yeah, I didn't intend this paragraph to be a local quirk - more like a 
reiteration of what the global rule already is (regardless of enforcement 
...), given the context.

I presume the cross section of readers of *both* the global documents and 
the -tip documents will be even smaller than just the readers of the -tip 
document - so some redundancy is beneficial. ;-)

In that sense the 'local' rules can also be indicators about which parts 
of the myriads of global rules the maintainers feel most strongly about, 
and are also a list of the most common problems we see in practice. Just 
repeating the very large set of global rules is counter-productive in 
that fashion. Maybe we should re-phrase and re-structure it into such a 
format, with references back to the original rules where we are just 
reiterating global rules? That would also allow the eventual inclusion of 
any clarification into the global rule.

Another aspect is that in -tip we are pretty strict about certain 
stylistic and not so stylistic details - this is partly a personal 
preference of the maintainers, and partly a maintainer workload reduction 
measure.

But level of enforcement matters and I think other trees can legitimately 
be more relaxed: when there's a lack of contributors then you *really* 
don't want to be the maintainer-from-hell who wants all i's dotted and 
all t's crossed, and you might not have the bandwidth to do it all 
yourself ...

Also I think there are and should be a lot of shades of grey when it 
comes to accepting patches, to find the right balance between pushing 
back on patches to improve quality and pulling in patches to move the 
kernel forward.

> If you want to push this into the tree in something like its current 
> form, I'm not going to resist too hard - far be it from me to say we 
> don't want more documentation!  But allow me to complain a little.
> 
> Suppose I came along with my nifty new architecture, and it dragged in 
> a whole new set of timer and interrupt subsystems that duplicated a lot 
> of what's in the kernel now, but buried a few "local quirks" deep in 
> the middle.  "Don't worry", I say, "we'll factor out the common stuff 
> later once we figure out what it is; I'd rather not deal with the 
> bikeshedding now". Correct me if I'm wrong, but I suspect I might just 
> get a response back from you.  That's not how we normally do things.
> 
> This proposal takes a similar approach to the documentation.  Changelog 
> rules, your comment rules (other than tail comments), brace rules, line 
> breaks, etc. are common stuff; if they are not well-enough documented 
> in the global docs, the fix should really be applied there.  If it 
> lands in the current form, you know as well as I do that it will almost 
> certainly stay there for years, if not indefinitely.
>
> IMO, the subsystem-specific documentation should be something that an
> existing kernel developer can use to quickly learn how to avoid surprises
> when wandering into a different subsystem.  So it should be concise and
> strongly focused on the local customs.  If we don't start that way, I'm
> afraid we'll never have that.  Then developers will miss the important
> information, and we'll reinforce the image of the kernel project as a
> collection of little fiefdoms that one wanders into at one's own risk.
> And Documentation/ will continue to be a painful mess.
> 
> 

Yeah, so in -tip we don't do "local customs": we genuinely believe that 
*all* our rules where they go beyond the current development process 
would improve the generic kernel as well. (In fact if anyone can give 
reasons for why a particular rule is not an absolute advantage to the 
kernel we'd reconsider changing the rule.)

So your complaint is legitimate - how would you propose we handle this?

Thanks,

Ingo


Re: [RFC PATCH v1 2/2] proc: add /proc//thread_state

2018-11-11 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Thu, Nov 08, 2018 at 07:32:46AM +0100, Ingo Molnar wrote:
> > 
> > * Aubrey Li  wrote:
> > 
> > > Expose the per-task cpu specific thread state value, it's helpful
> > > for userland to classify and schedule the tasks by different policies
> > 
> > That's pretty vague - what exactly would use this information? I'm sure 
> > you have a usecase in mind - could you please describe it?
> 
> Yeah, "thread_state" is a pretty terrible name for this. The use-case is
> detectoring which tasks use AVX3 such that a userspace component (think
> job scheduler) can cluster them together.

I'd prefer the kernel to do such clustering...

Thanks,

Ingo


Re: [tip:ras/core] x86/mce: Fix -Wmissing-prototypes warnings

2018-11-11 Thread Ingo Molnar


* Borislav Petkov  wrote:

> Fix an actual bug too which the warning triggered:
> 
>   arch/x86/kernel/cpu/mcheck/therm_throt.c:395:39: error: conflicting \
>   types for ‘smp_thermal_interrupt’
>asmlinkage __visible void __irq_entry smp_thermal_interrupt(struct pt_regs 
> *r)
>  ^

>   In file included from arch/x86/kernel/cpu/mcheck/therm_throt.c:29:
>   ./arch/x86/include/asm/traps.h:107:17: note: previous declaration of \
> ‘smp_thermal_interrupt’ was here
>asmlinkage void smp_thermal_interrupt(void);
> 
> Signed-off-by: Borislav Petkov 
> Cc: Yi Wang 
> Cc: Michael Matz 
> Cc: x...@kernel.org
> Link: 
> https://lkml.kernel.org/r/alpine.deb.2.21.1811081633160.1...@nanos.tec.linutronix.de
> ---
>  arch/x86/include/asm/traps.h | 6 +++---
>  arch/x86/kernel/cpu/mcheck/mce_amd.c | 5 +++--
>  arch/x86/kernel/cpu/mcheck/therm_throt.c | 1 +
>  arch/x86/kernel/cpu/mcheck/threshold.c   | 3 ++-
>  4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index 3de69330e6c5..9ea54a764a8e 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -104,9 +104,9 @@ extern int panic_on_unrecovered_nmi;
>  
>  void math_emulate(struct math_emu_info *);
>  #ifndef CONFIG_X86_32
> -asmlinkage void smp_thermal_interrupt(void);
> -asmlinkage void smp_threshold_interrupt(void);
> -asmlinkage void smp_deferred_error_interrupt(void);
> +asmlinkage void smp_thermal_interrupt(struct pt_regs *);
> +asmlinkage void smp_threshold_interrupt(struct pt_regs *);
> +asmlinkage void smp_deferred_error_interrupt(struct pt_regs *);

Nit: please use full declarations in prototypes, i.e. 'struct pt_regs 
*regs' or so. Much of traps.h does this wrong so I guess this should be a 
separate cleanup patch.

> +asmlinkage __visible void __irq_entry smp_deferred_error_interrupt(struct 
> pt_regs *r)

> +asmlinkage __visible void __irq_entry smp_threshold_interrupt(struct pt_regs 
> *r)

Please use the canonical name, i.e. 'regs'.

Thanks,

Ingo


Re: [RFC PATCH 00/12] locking/lockdep: Add a new class of terminal locks

2018-11-11 Thread Ingo Molnar


* Waiman Long  wrote:

> > Could you please measure a locking intense workload instead, such as:
> >
> >$ perf stat --null --sync --repeat 10 perf bench sched messaging
> >
> > and profile which locks used there could be marked terminal, and measure 
> > the before/after performance impact?
> 
> I will run the test. It will probably be done after the LPC next week.

Thanks!

> >> Below were selected output lines from the lockdep_stats files of the
> >> patched and unpatched kernels after bootup and running parallel kernel
> >> builds.
> >>
> >>   Item Unpatched kernel  Patched kernel  % Change
> >>      --  
> >>   direct dependencies   9732 8994  -7.6%
> >>   dependency chains1877617033  -9.3%
> >>   dependency chain hlocks  7604468419 -10.0%
> >>   stack-trace entries 110403   104341  -5.5%
> > That's pretty impressive!
> >
> >> There were some reductions in the size of the lockdep tables. They were
> >> not significant, but it is still a good start to rein in the number of
> >> entries in those tables to make it harder to overflow them.
> > Agreed.
> >
> > BTW., if you are interested in more radical approaches to optimize 
> > lockdep, we could also add a static checker via objtool driven call graph 
> > analysis, and mark those locks terminal that we can prove are terminal.
> >
> > This would require the unified call graph of the kernel image and of all 
> > modules to be examined in a final pass, but that's within the principal 
> > scope of objtool. (This 'final pass' could also be done during bootup, at 
> > least in initial versions.)
> >
> > Note that beyond marking it 'terminal' such a static analysis pass would 
> > also allow the detection of obvious locking bugs at the build (or boot) 
> > stage already - plus it would allow the disabling of lockdep for 
> > self-contained locks that don't interact with anything else.
> >
> > I.e. the static analysis pass would 'augment' lockdep and leave only 
> > those locks active for runtime lockdep tracking whose dependencies it 
> > cannot prove to be correct yet.
> 
> It is a pretty interesting idea to use objtool to scan for locks. The
> list of locks that I marked as terminal in this patch was found by
> looking at /proc/lockdep for those that only have backward dependencies,
> but no forward dependency. I focused on those with a large number of BDs
> and check the code to see if they could marked as terminal. This is a
> rather labor intensive process and is subject to error.

Yeah.

> [...] It would be nice if it can be done by an automated tool. So I am 
> going to look into that, but it won't be part of this initial patchset, 
> though.

Of course!

> I sent this patchset out to see if anyone has any objection to it. It
> seems you don't have any objection to that. So I am going to move ahead
> to do more testing and performance analysis.

The one worry I have is that this interim solution removes the benefit of 
a proper static analysis method.

But if you promise to make a serious effort on the static analysis 
tooling as well (which should have awesome performance results and 
automate the manual markup), then I have no fundamental objections to the 
interim approach either.

If static analysis works as well as I expect it to then in principle we 
might even be able to have lockdep enabled in production kernels: it 
would only add overhead to locks that are overly complex - which would 
create incentives to improve those dependencies.

Thanks,

Ingo


Re: [RFC PATCH 00/12] locking/lockdep: Add a new class of terminal locks

2018-11-11 Thread Ingo Molnar


* Waiman Long  wrote:

> On 11/10/2018 09:10 AM, Peter Zijlstra wrote:
> > On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote:
> >> BTW., if you are interested in more radical approaches to optimize 
> >> lockdep, we could also add a static checker via objtool driven call graph 
> >> analysis, and mark those locks terminal that we can prove are terminal.
> >>
> >> This would require the unified call graph of the kernel image and of all 
> >> modules to be examined in a final pass, but that's within the principal 
> >> scope of objtool. (This 'final pass' could also be done during bootup, at 
> >> least in initial versions.)
> >
> > Something like this is needed for objtool LTO support as well. I just
> > dread the build time 'regressions' this will introduce :/
> >
> > The final link pass is already by far the most expensive part (as
> > measured in wall-time) of building a kernel, adding more work there
> > would really suck :/
> 
> I think the idea is to make objtool have the capability to do that. It
> doesn't mean we need to turn it on by default in every build.

Yeah.

Also note that much of the objtool legwork would be on a per file basis 
which is reasonably parallelized already. On x86 it's also already done 
for every ORC build i.e. every distro build and the incremental overhead 
from also extracting locking dependencies should be reasonably small.

The final search of the global graph would be serialized but still 
reasonably fast as these are all 'class' level dependencies which are 
much less numerous than runtime dependencies.

I.e. I think we are talking about tens of thousands of dependencies, not 
tens of millions.

At least in theory. ;-)

Thanks,

Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Ingo Molnar


* Josh Poimboeuf  wrote:

> On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > >   x86-64 only, which means we have to use the "unoptimized" version
> > >   everywhere else.  I may experiment with a GCC plugin instead.
> > 
> > I'd prefer the objtool approach. It's a pretty reliable first-principles 
> > approach while GCC plugin would have to be replicated for Clang and any 
> > other compilers, etc.
> 
> The benefit of a plugin is that we'd only need two of them: GCC and
> Clang.  And presumably, they'd share a lot of code.
> 
> The prospect of porting objtool to all architectures is going to be much
> more of a daunting task (though we are at least already considering it
> for some arches).

Which architectures would benefit from ORC support the most?

I really think that hard reliance on GCC plugins is foolish - but maybe 
Clang's plugin infrastructure is a guarantee that it remains a sane and 
usable interface.

> > I'd be very happy with a demonstrated paravirt optimization already - 
> > i.e. seeing the before/after effect on the vmlinux with an x86 distro 
> > config.
> > 
> > All major Linux distributions enable CONFIG_PARAVIRT=y and 
> > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much 
> > as possible in the 99.999% cases where it's not used is a primary 
> > concern.
> 
> For paravirt, I was thinking of it as more of a cleanup than an
> optimization.  The paravirt patching code already replaces indirect
> branches with direct ones -- see paravirt_patch_default().
> 
> Though it *would* reduce the instruction footprint a bit, as the 7-byte
> indirect calls (later patched to 5-byte direct + 2-byte nop) would
> instead be 5-byte direct calls to begin with.

Yes.

> > All other usecases are bonus, but it would certainly be interesting to 
> > investigate the impact of using these APIs for tracing: that too is a 
> > feature enabled everywhere but utilized only by a small fraction of Linux 
> > users - so literally every single cycle or instruction saved or hot-path 
> > shortened is a major win.
> 
> With retpolines, and with tracepoints enabled, it's definitely a major
> win.  Steve measured an 8.9% general slowdown on hackbench caused by
> retpolines.

How much of that slowdown is reversed?

> But with tracepoints disabled, I believe static jumps are used, which
> already minimizes the impact on hot paths.

Yeah.

Thanks,

Ing


Re: [PATCH v2 3/3] sched/fair: add lsub_positive and use it consistently

2018-11-11 Thread Ingo Molnar


* Patrick Bellasi  wrote:

> The following pattern:
> 
>var -= min_t(typeof(var), var, val);
> 
> is used multiple times in fair.c.
> 
> The existing sub_positive() already capture that pattern but it adds
> also explicit load-sotre to properly support lockless observations.
> In other cases, the patter above is used to update local, and/or not
> concurrently accessed, variables.

> Let's add a simpler version of sub_positive, targeted to local variables
> updates, which gives the same readability benefits at calling sites
> without enforcing {READ,WRITE}_ONCE barriers.

> +/*
> + * Remove and clamp on negative, from a local variable.
> + *
> + * A variant of sub_positive which do not use explicit load-store
> + * and thus optimized for local variable updates.

I fixed up the two typos ('load-sotre', 'patter'), and fixed eight 
grammar mistakes (!) in the changelog and in the code comments, but 
*please* read the changelogs and code you are writing, this is scheduler 
code after all ...

( Please also use the fn() notation in changelogs consistently in the 
  future: first you use 'sub_positive()' correctly then it becomes 
  'sub_positive'. )

Anyway, the fix looks sane so I've applied it to sched/urgent.

Thanks,

Ingo


Re: [PATCH v4 06/10] x86/alternative: use temporary mm for text poking

2018-11-11 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Sun, Nov 11, 2018 at 08:53:07PM +, Nadav Amit wrote:
> 
> > >> +/*
> > >> + * The lock is not really needed, but this allows to avoid 
> > >> open-coding.
> > >> + */
> > >> +ptep = get_locked_pte(poking_mm, poking_addr, );
> > >> +
> > >> +/*
> > >> + * If we failed to allocate a PTE, fail. This should *never* 
> > >> happen,
> > >> + * since we preallocate the PTE.
> > >> + */
> > >> +if (WARN_ON_ONCE(!ptep))
> > >> +goto out;
> > > 
> > > Since we hard rely on init getting that right; can't we simply get rid
> > > of this?
> > 
> > This is a repeated complaint of yours, which I do not feel comfortable with.
> > One day someone will run some static analysis tool and start finding that
> > all these checks are missing.
> > 
> > The question is why do you care about them.
> 
> Mostly because they should not be happening, ever.

Since get_locked_pte() might in principle return NULL, it's an entirely 
routine pattern to check the return for NULL. This will save reviewer 
time in the future.

> [...] And if they happen, there really isn't anything sensible we can 
> do about it.

Warning about it is 'something', even if we cash afterwards, isn't it?

> > If it is because they affect the
> > generated code and make it less efficient, I can fully understand and 
> > perhaps
> > we should have something like PARANOID_WARN_ON_ONCE() which compiles into 
> > nothing
> > unless a certain debug option is set.
> > 
> > If it is about the way the source code looks - I guess it doesn’t sore my
> > eyes as hard as some other stuff, and I cannot do much about it (other than
> > removing it as you asked).
> 
> And yes on the above two points. It adds both runtime overhead (albeit
> trivially small) and code complexity.

It's trivially small cycle level overhead in something that will be 
burdened by two TLB flushes anyway is is utterly slow.

> > >> +out:
> > >> +if (memcmp(addr, opcode, len))
> > >> +r = -EFAULT;
> > > 
> > > How could this ever fail? And how can we reliably recover from that?
> > 
> > This code has been there before (with slightly uglier code). Before this
> > patch, a BUG_ON() was used here. However, I noticed that kgdb actually
> > checks that text_poke() succeeded after calling it and gracefully fail.
> > However, this was useless, since text_poke() would panic before kgdb gets
> > the chance to do anything (see patch 7).
> 
> Yes, I know it was there before, and I did see kgdb do it too. But aside
> from that out-label case, which we also should never hit, how can we
> realistically ever fail that memcmp()?
> 
> If we fail here, something is _seriously_ buggered.

So wouldn't it be better to just document and verify our assumptions of 
this non-trivial code by using return values intelligently?

I mean, being worried about overhead would be legitimate in the syscall 
entry code. In code patching code, which is essentially a slow path, we 
should be much more worried about *robustness*.

Thanks,

Ingo


Re: [RFC PATCH 00/12] locking/lockdep: Add a new class of terminal locks

2018-11-09 Thread Ingo Molnar


* Waiman Long  wrote:

> The purpose of this patchset is to add a new class of locks called
> terminal locks and converts some of the low level raw or regular
> spinlocks to terminal locks. A terminal lock does not have forward
> dependency and it won't allow a lock or unlock operation on another
> lock. Two level nesting of terminal locks is allowed, though.
> 
> Only spinlocks that are acquired with the _irq/_irqsave variants or
> acquired in an IRQ disabled context should be classified as terminal
> locks.
> 
> Because of the restrictions on terminal locks, we can do simple checks on
> them without using the lockdep lock validation machinery. The advantages
> of making these changes are as follows:
> 
>  1) The lockdep check will be faster for terminal locks without using
> the lock validation code.
>  2) It saves table entries used by the validation code and hence make
> it harder to overflow those tables.
> 
> In fact, it is possible to overflow some of the tables by running
> a variety of different workloads on a debug kernel. I have seen bug
> reports about exhausting MAX_LOCKDEP_KEYS, MAX_LOCKDEP_ENTRIES and
> MAX_STACK_TRACE_ENTRIES. This patch will help to reduce the chance
> of overflowing some of the tables.
> 
> Performance wise, there was no statistically significant difference in
> performanace when doing a parallel kernel build on a debug kernel.

Could you please measure a locking intense workload instead, such as:

   $ perf stat --null --sync --repeat 10 perf bench sched messaging

and profile which locks used there could be marked terminal, and measure 
the before/after performance impact?

> Below were selected output lines from the lockdep_stats files of the
> patched and unpatched kernels after bootup and running parallel kernel
> builds.
> 
>   Item Unpatched kernel  Patched kernel  % Change
>      --  
>   direct dependencies   9732 8994  -7.6%
>   dependency chains1877617033  -9.3%
>   dependency chain hlocks  7604468419 -10.0%
>   stack-trace entries 110403   104341  -5.5%

That's pretty impressive!

> There were some reductions in the size of the lockdep tables. They were
> not significant, but it is still a good start to rein in the number of
> entries in those tables to make it harder to overflow them.

Agreed.

BTW., if you are interested in more radical approaches to optimize 
lockdep, we could also add a static checker via objtool driven call graph 
analysis, and mark those locks terminal that we can prove are terminal.

This would require the unified call graph of the kernel image and of all 
modules to be examined in a final pass, but that's within the principal 
scope of objtool. (This 'final pass' could also be done during bootup, at 
least in initial versions.)

Note that beyond marking it 'terminal' such a static analysis pass would 
also allow the detection of obvious locking bugs at the build (or boot) 
stage already - plus it would allow the disabling of lockdep for 
self-contained locks that don't interact with anything else.

I.e. the static analysis pass would 'augment' lockdep and leave only 
those locks active for runtime lockdep tracking whose dependencies it 
cannot prove to be correct yet.

Thanks,

Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-08 Thread Ingo Molnar


* Ingo Molnar  wrote:

> > - Does this feature have much value without retpolines?  If not, should
> >   we make it depend on retpolines somehow?
> 
> Paravirt patching, as you mention in your later reply?

BTW., to look for candidates of this API, I'd suggest looking at the 
function call frequency of my (almost-)distro kernel vmlinux:

  $ objdump -d vmlinux | grep -w callq | cut -f3- | sort | uniq -c | sort -n | 
tail -100

which gives:

502 callq  8157d050 
522 callq  81aaf420 
536 callq  81547e60 <_copy_to_user>
615 callq  81a97700 
624 callq  *0x82648428
624 callq  810cc810 <__might_sleep>
625 callq  81a93b90 
649 callq  81547dd0 <_copy_from_user>
651 callq  811ba930 
654 callq  8170b6f0 <_dev_warn>
691 callq  81a93790 
693 callq  81a88dc0 
709 callq  *0x82648438
723 callq  811bdbd0 
735 callq  810feac0 
750 callq  8163e9f0 
768 callq  *0x82648430
814 callq  81ab2710 <_raw_spin_lock_irq>
841 callq  81a9e680 <__memcpy>
863 callq  812ae3d0 <__kmalloc>
899 callq  8126ac80 <__might_fault>
912 callq  81ab2970 <_raw_spin_unlock_irq>
939 callq  81aaaf10 <_cond_resched>
966 callq  811bda00 
   1069 callq  81126f50 
   1078 callq  81097760 <__warn_printk>
   1081 callq  8157b140 <__dynamic_dev_dbg>
   1351 callq  8170b630 <_dev_err>
   1365 callq  811050c0 
   1373 callq  81a977f0 
   1390 callq  8157b090 <__dynamic_pr_debug>
   1453 callq  8155c650 <__list_add_valid>
   1501 callq  812ad6f0 
   1509 callq  8155c6c0 <__list_del_entry_valid>
   1513 callq  81310ce0 
   1571 callq  81ab2780 <_raw_spin_lock_irqsave>
   1624 callq  81ab29b0 <_raw_spin_unlock_irqrestore>
   1661 callq  81126fd0 
   1986 callq  81104940 
   2050 callq  811c5110 
   2133 callq  81102c70 
   2507 callq  81ab2560 <_raw_spin_lock>
   2676 callq  81aadc40 
   3056 callq  81ab2900 <_raw_spin_unlock>
   3294 callq  81aac610 
   3628 callq  81129100 
   4462 callq  812ac2c0 
   6454 callq  8111a51e 
   6676 callq  81101420 
   7328 callq  81e014b0 <__x86_indirect_thunk_rax>
   7598 callq  81126f30 
   9065 callq  810979f0 <__stack_chk_fail>

The most prominent callers which are already function call pointers today 
are:

  $ objdump -d vmlinux | grep -w callq | grep \* | cut -f3- | sort | uniq -c | 
sort -n | tail -10

109 callq  *0x82648530
134 callq  *0x82648568
154 callq  *0x826483d0
260 callq  *0x826483d8
297 callq  *0x826483e0
345 callq  *0x82648440
345 callq  *0x82648558
624 callq  *0x82648428
709 callq  *0x82648438
768 callq  *0x82648430

That's all pv_ops->*() method calls:

   82648300 D pv_ops
   826485d0 D pv_info

Optimizing those thousands of function pointer calls would already be a 
nice improvement.

But retpolines:

   7328 callq  81e014b0 <__x86_indirect_thunk_rax>

  81e014b0 <__x86_indirect_thunk_rax>:
  81e014b0:   ff e0   jmpq   *%rax

... are even more prominent, and turned on in every distro as well, 
obviously.

Thanks,

Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-08 Thread Ingo Molnar


* Josh Poimboeuf  wrote:

> These patches are related to two similar patch sets from Ard and Steve:
> 
> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
> 
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
> 
> There are three separate implementations, depending on what the arch
> supports:
> 
>   1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
>  objtool and a small amount of arch code
>   
>   2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
>  a small amount of arch code
>   
>   3) If no arch support, fall back to regular function pointers
> 
> 
> TODO:
> 
> - I'm not sure about the objtool approach.  Objtool is (currently)
>   x86-64 only, which means we have to use the "unoptimized" version
>   everywhere else.  I may experiment with a GCC plugin instead.

I'd prefer the objtool approach. It's a pretty reliable first-principles 
approach while GCC plugin would have to be replicated for Clang and any 
other compilers, etc.

> - Does this feature have much value without retpolines?  If not, should
>   we make it depend on retpolines somehow?

Paravirt patching, as you mention in your later reply?

> - Find some actual users of the interfaces (tracepoints? crypto?)

I'd be very happy with a demonstrated paravirt optimization already - 
i.e. seeing the before/after effect on the vmlinux with an x86 distro 
config.

All major Linux distributions enable CONFIG_PARAVIRT=y and 
CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much 
as possible in the 99.999% cases where it's not used is a primary 
concern.

All other usecases are bonus, but it would certainly be interesting to 
investigate the impact of using these APIs for tracing: that too is a 
feature enabled everywhere but utilized only by a small fraction of Linux 
users - so literally every single cycle or instruction saved or hot-path 
shortened is a major win.

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-08 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> +Line breaks
> +^^^
> +
> +Restricting line length to 80 characters makes deeply indented code hard to
> +read.  Consider breaking out code into helper functions to avoid excessive
> +line breaking.
> +
> +The 80 character rule is not a strict rule, so please use common sense when
> +breaking lines. Especially format strings should never be broken up.

Might make sense to explain that:

  + The reason for that rule is that if for example we have this printk line:
  +
  + if (uh_oh()) {
  + pr_info("Something really bad happened, danger"
  + "danger, blue smoke reported!\n");
  + }
  +
  + People would see this message in the syslog:
  +
  +   Thu Nov  8 09:22:33: Something really bad happened, dangerdanger, blue 
smoke reported!
  +
  + And chances are that in sheer panic they'd type the most distinctive 
  + part of that text as a search pattern for the kernel source tree:
  +
  +   $ git grep -i 'dangerdanger'
  +   $
  +
  + ... and they'd get absolutely no match on that string due to the 
  + col80 broken format string, and confusion and frustration would rise, 
  + in addition to growing amounts of blue smoke.
  +
  + We don't want that, so just write out the single line:

  + if (uh_oh())
  + pr_info("Something really bad happened, danger danger, 
blue smoke reported!\n");
  +
  + Also note two other advantages of writing it like this:
  +
  +  - We saved two curly braces.
  +  - We also added a proper space to 'danger danger' which was the original 
intended message.

?

> +
> +When splitting function declarations or function calls, then please align
> +the first argument in the second line with the first argument in the first
> +line::
> +
> +  static int long_function_name(struct foobar *barfoo, unsigned int id,
> + unsigned int offset)
> +  {
> +
> + if (!id) {
> + ret = longer_function_name(barfoo, DEFAULT_BARFOO_ID,
> +offset);

BTW., in this particular case I think small violations of col80 rule are 
even more readable, i.e.:

> + ret = longer_function_name(barfoo, DEFAULT_BARFOO_ID, offset);

And note that in this example we used 78 colums so we didn't even violate 
the col80 rule. ;-)

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-08 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> +Commit notifications
> +
> +
> +The tip tree is monitored by a bot for new commits. The bot sends an email
> +for each new commit to a dedicated mailing list
> +(``linux-tip-comm...@vger.kernel.org``) and Cc's all people who are
> +mentioned in one of the commit tags. It uses the email message id from the
> +Link tag at the end of the tag list to set the In-Reply-To email header so
> +the message is properly threaded with the patch submission email.

s/id
 /ID

> +
> +The maintainers try to reply to the submitter when merging a patch, but
> +they sometimes forget or it does not fit the workflow of the moment. While
> +the bot message is purely mechanical assume it implies a 'Thank you!
> +Applied.'.

s/The maintainers
 /The -tip maintainers and submaintainers

s/is purely mechanical assume it implies a 'Thank you! Applied.'
 /is purely mechanical, it also implies a 'Thank you! Applied.' message.

... added a missing comma, plus there's nothing to assume, that the 
thank-you note is implied is a given! :-)

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-08 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> +Namespaces
> +^^
> +
> +To improve readability and to allow easy grepping for information the usage
> +of consistent namespaces is important. The namespace should be used as a
> +prefix for globally visible (inline) functions and variables. A simple rule
> +for chosing a namespace prefix is to combine the subsystem and the
> +component name, e.g. 'x86_comp\_', 'sched\_', 'irq\_', 'mutex\_', etc. For
> +static functions and variables the namespace prefix can be omitted.
> +
> +Also be careful when using vendor names in namespaces. There is no value of
> +having 'xxx_vendor\_' or 'vendor_xxx_` as prefix for all static functions
> +of a vendor specific file as it is already clear that the code is vendor
> +specific. Aside of that vendor names should only be used when it is really
> +vendor specific functionality.
> +
> +As always apply common sense and aim for consistency and readability.

I'd also suggest adding:

> +This also includes static file scope functions that are immediately put into 
> +globally visible driver templates - it's useful for those symbols to carry a
> +good prefix as well, for backtrace readability.
> +
> +Truly local functions, only called by other local functions, can have 
> +shorter descriptive names - our primary concern is greppability and 
> +backtrace readability.

Beyond the driver-template aspect also note the backtrace readability 
argument I added: good namespaces are not just about greppability.

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-08 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> +Variable types
> +^^
> +
> +Please use the proper u8, u16, u32, u64 types for variables which are meant
> +to describe hardware or are used as arguments for functions which access
> +hardware. These types are clearly defining the bit width and avoid
> +truncation, expansion and 32/64 bit confusion.
> +
> +u64 is also recommended in code which would become ambiguous for 32bit when
> +'unsigned long' would be used instead. While in such situations 'unsigned
> +long long' could be used as well, u64 is shorter and also clearly shows
> +that the operation is required to be 64bit wide independent of the target
> +CPU.
> +
> +Please use 'unsigned int' instead of 'unsigned'.

s/for 32bit
 /for 32-bit kernels

s/64bit wide
 /64 bits wide

> +Constants
> +^
> +
> +Please do not use literal (hexa)decimal numbers in code or initializers.
> +Either use proper defines which have descriptive names or consider using
> +an enum.

I believe there should be an exception for 'obvious' literal values like 
0 and 1.

I.e. the above is mostly a rule that is intended to cover undocumented 
'magic' numbers.

I.e. how about this wording:

  +Constants
  +^
  +
  +Please do not use magic literal (hexa)decimal numbers when interfacing
  +with hardware where the number has an unclear origin in code or 
  +initializers. I.e. "no magic numbers".
  +
  +Either use proper defines which have descriptive names or use an enum.
  +
  +Using obvious 0/1 literal values is fine in most cases.

?

> +
> +
> +Struct declarations and initializers
> +
> +
> +Struct declarations should align the struct member names in a tabular
> +fashion::
> +
> + struct bar_order {
> + unsigned intguest_id;
> + int ordered_item;
> + struct menu *menu;
> + };
> +
> +Please avoid documenting struct members within the declaration, because
> +this often results in strangely formatted comments and the struct members
> +become obfuscated::
> +
> + struct bar_order {
> + unsigned intguest_id; /* Unique guest id */

[ Sidenote: there's whitespace damage (extra spaces) in the text here. ]

> + int ordered_item;
> + /* Pointer to a menu instance which contains all the drinks */
> + struct menu *menu;
> + };
> +
> +Instead, please consider using the kernel-doc format in a comment preceding
> +the struct declaration, which is easier to read and has the added advantage
> +of including the information in the kernel documentation, for example, as
> +follows::

I disagree slightly here. While adding kernel-doc format is fine of 
course, so are in-line comments which I frequently use.

This form is particularly helpful for more complex structures. Have a 
look at 'struct fpu' for example:


/*
 * Highest level per task FPU state data structure that
 * contains the FPU register state plus various FPU
 * state fields:
 */
struct fpu {
/*
 * @last_cpu:
 *
 * Records the last CPU on which this context was loaded into
 * FPU registers. (In the lazy-restore case we might be
 * able to reuse FPU registers across multiple context switches
 * this way, if no intermediate task used the FPU.)
 *
 * A value of -1 is used to indicate that the FPU state in context
 * memory is newer than the FPU state in registers, and that the
 * FPU state should be reloaded next time the task is run.
 */
unsigned intlast_cpu;

/*
 * @initialized:
 *
 * This flag indicates whether this context is initialized: if the task
 * is not running then we can restore from this context, if the task
 * is running then we should save into this context.
 */
unsigned char   initialized;

/*
 * @state:
 *
 * In-memory copy of all FPU registers that we save/restore
 * over context switches. If the task is using the FPU then
 * the registers in the FPU are more recent than this state
 * copy. If the task context-switches away then they get
 * saved here and represent the FPU state.
 */
union fpregs_state  state;
/*
 * WARNING: 'state' is dynamically-sized.  Do not put
 * anything after it here.
 */
};

The in-line freestanding comments is perfectly structured and readable as 
well, and this is analogous to the 'freestanding comments' style for C 
statements.

We also have occasional examples where tail comments are fine, such as:

/*
 * The legacy x87 FPU state format, as saved by FSAVE and
 * restored by the FRSTOR instructions:
 */
struct fregs_state {
u32 cwd;/* FPU Control Word */
u32 swd;/* FPU 

Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-07 Thread Ingo Molnar



Lemme fill in the scheduler and locking/atomics bits as well:

> +The tip tree contains the following subsystems:
> +
> +   - **x86 architecture**
> +
> + The x86 architecture development takes place in the tip tree except
> + for the x86 KVM and XEN specific parts which are maintained in the
> + corresponding subsystems and routed directly to mainline from
> + there. It's still good practice to Cc the x86 maintainers on
> + x86-specific KVM and XEN patches.
> +
> + Some x86 subsystems have their own maintainers in addition to the
> + overall x86 maintainers.  Please Cc the overall x86 maintainers on
> + patches touching files in arch/x86 even when they are not called out
> + by the MAINTAINER file.
> +
> + Note, that ``x...@kernel.org`` is not a mailing list. It is merely a
> + mail alias which distributes mails to the x86 top-level maintainer
> + team. Please always Cc the Linux Kernel mailing list (LKML)
> + ``linux-kernel@vger.kernel.org``, otherwise your mail ends up only in
> + the private inboxes of the maintainers.
> +
> +   - **Scheduler**

Scheduler development takes place in the -tip tree, in the 
sched/core branch - with occasional sub-topic trees for 
work-in-progress patch-sets.

> +
> +   - **Locking and atomics**

Locking development (including atomics and other synchronization 
primitives that are connected to locking) takes place in the -tip 
tree, in the locking/core branch - with occasional sub-topic 
trees for work-in-progress patch-sets.

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-07 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> + - Signed-off-by: ``Patch handler ``
> +
> +   SOBs after the author SOB are from people handling and transporting the
> +   patch, but were not involved in development. If the handler made
> +   modifications to the patch or the changelog, then this should be
> +   mentioned **after** the changelog text and **above** all commit tags in
> +   the following format::
> +
> + ... changelog text ends.
> +
> + [ handler: Replaced foo by bar and updated changelog ]
> +
> + First-tag: .
> +
> +   Note the two empty new lines which separate the changelog text and the
> +   commit tags from that notice.

Even after a decade of introducing Git I still see Signed-off-by used as 
an Acked-by or Reviewed-by substitutes, so I'd suggest adding this small 
explanation as well:

  +   SOB chains should reflect the *real* route a patch took as it was 
  +   propagated to us, with the first SOB entry signalling primary
  +   authorship of a single author. Acks should be given as Acked-by 
  +   lines and review approvals as Reviewed-by lines.


> +   If a patch is sent to the mailing list by a handler then the author has
> +   to be noted in the first line of the changelog with::
> +
> + From: ``Author ``
> +
> + Changelog text starts here
> +
> +   so the authorship is preserved. The 'From:' line has to be followed by a
> +   empty newline. If that 'From:' line is missing, then the patch would be
> +   attributed to the person who sent (transported) it. The 'From:' line is
> +   automatically removed when the patch is applied and does not show up in
> +   the final git changelog. It merely affects the authorship information of
> +   the resulting git commit.

s/(transported)
 /(transported, handled)

to connect the text with the whole 'handler' language used before?

and since we are not talking about the 'git command', maybe also:

s/git
 /Git

?

> + - Cc: ``cc-ed-person ``
> +
> +   If the patch should be backported to stable, then please add a '``Cc:
> +   sta...@vger.kernel.org``' tag, but do not Cc stable when sending your
> +   mail.

Can I suggest a more canonical form:

Cc:  # v4.18 and later kernels

It would be nice if people adding Cc: stable lines would actually try to 
figure out which exact kernel versions are affected.

Also the '<>' form makes it easier to read and my email client will also 
syntax highlight it in that case. ;-)


> + - Link: ``https://link/to/information``
> +
> +   For referring to email on LKML or other kernel mailing lists, please use
> +   the lkml.kernel.org redirector URL::

s/referring to email
 /referring to an email

> +
> + https://lkml.kernel.org/r/email-message@id
> +
> +   The kernel.org redirector is considered a stable URL unlike other email
> +   archives.

s/URL unlike
 /URL, unlike

?

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-07 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> + - Fixes: 12char-SHA1 ("sub/sys: Original subject line")
> +
> +   A Fixes tag should be added even for changes which do not need to be
> +   backported to stable kernels, i.e. when addressing a recently introduced
> +   issue which only affects tip or the current head of mainline. These tags
> +   are helpful to identify the original commit and are much more valuable
> +   than prominently mentioning the commit which introduced a problem in the
> +   text of the changelog itself because they can be automatically
> +   extracted.
> +
> +   The following example illustrates the difference::
> +
> + Commit
> +
> +   abcdef012345678 ("x86/xxx: Replace foo with bar")
> +
> + left an unused instance of variable foo around. Remove it.
> +
> + Signed-off-by: J.Dev 
> +
> +   Please say instead::
> +
> + The recent replacement of foo with bar left an unused instance of
> + variable foo around. Remove it.
> +
> + Fixes: abcdef012345678 ("x86/xxx: Replace foo with bar")
> + Signed-off-by: J.Dev 

Let me extend this policy element, I frequently write out commits in the 
changelog itself *as well*, because that's where I utilize it myself when 
reading other people's changelogs.

I.e. I would convert this:

 The recent replacement of left with right left an unused instance of
 variable left around. Remove it.

 Fixes: abcdef012345678 ("x86/xxx: Replace 'left' with 'right')
 Signed-off-by: J.Dev 

... to the following form:

 Two years ago the following commit:

   abcdef012345678 ("x86/xxx: Replace foo with bar")

 ... left an unused instance of the variable 'left' around. Remove it.

 Fixes: abcdef012345678 ("x86/xxx: Replace 'left' with 'right')
 Signed-off-by: J.Dev 

This changelog style, while more verbose, has a couple of advantages:

 - It's a bad practice to force the reader to go the tags sections, fish
   out a commit ID, just to be able to see the original commit. 
   Especially with longer changelogs and with changelogs mentioning 
   multiple source commits in-lining the commit ID is useful.

 - Also note how this style allows for human-readable time information to
   be inserted - which can be important to backporters. While an unused
   variable warning might not be backported, in other cases the time
   information can be useful in prioritizing the backporting.

 - Also note another pet peeve of mine: the quotation marks around the 
   variable names 'left' and 'right'. I changed the variable names to 
   English words that are ambiguous in free-flowing changelog text, just
   to illustrate how important it can be to escape them for better
   readability.

The 'Fixes' tag is mainly a standard tag that backporter tooling can 
search for - otherwise for human readers the in-line explanation is more 
useful.

I really trivial cases the inlining can be skipped and only a 'Fixes' tag 
is perfectly sufficient.

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-07 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> +Backtraces in changelogs
> +
> +
> +Backtraces can be useful to document the call chain which led to a
> +problem. Though not all back traces are really valuable because the call
> +chain is unique and obvious, e.g. in early boot code. Just copying the full
> +dmesg output is adding a lot of distracting information like timestamps,
> +module lists, register and stack dumps etc.
> +
> +Reducing the backtrace to the real relevant data helps to concentrate on
> +the issue and not being distracted by destilling the relevant information
> +out of the dump. Here is an example of a well trimmed backtrace::
> +
> +  unchecked MSR access error: WRMSR to 0xd51 (tried to write 0x
> +  0064) at rIP: 0xae059994 (native_write_msr+0x4/0x20)
> +  Call Trace:
> +  mba_wrmsr+0x41/0x80
> +  update_domains+0x125/0x130
> +  rdtgroup_mkdir+0x270/0x500

Yeah, so I frequently simplify such backtraces even more, i.e.:

> +  unchecked MSR access error: WRMSR to 0xd51 (tried to write 0x 
> 0064) at rIP: 0xae059994 (native_write_msr())
> +
> +  Call Trace:
> +mba_wrmsr()
> +update_domains()
> +rdtgroup_mkdir()

Note how the actual MSR contents and the attempted operation's parameters 
are important, the actual hexadecimal offsets of the function call 
backtrace are not. They are useful when trying to do fuzzy version 
matching and in the occasional case when there's a question about which 
exact call chain it is - but those are 0.01% cases really.

See for example this recent commit:

 commit e4a02ed2aaf447fa849e3254bfdb3b9b01e1e520 
(origin/locking-urgent-for-linus, locking-urgent-for-linus)
 Author: Guenter Roeck 
 Date:   Tue Oct 2 14:48:49 2018 -0700

locking/ww_mutex: Fix runtime warning in the WW mutex selftest

If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image
in an arm64 virtual machine results in the following
traceback if 8 CPUs are enabled:

  DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
  WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 
__mutex_unlock_slowpath+0x1a8/0x2e0
  ...
  Call trace:
   __mutex_unlock_slowpath()
   ww_mutex_unlock()
   test_cycle_work()
   process_one_work()
   worker_thread()
   kthread()
   ret_from_fork()

If requesting b_mutex fails with -EDEADLK, the error variable
is reassigned to the return value from calling ww_mutex_lock
on a_mutex again. If this call fails, a_mutex is not locked.
It is, however, unconditionally unlocked subsequently, causing
the reported warning. Fix the problem by using two error variables.

With this change, the selftest still fails as follows:

  cyclic deadlock not resolved, ret[7/8] = -35

However, the traceback is gone.

The C-style writing of the backtrace is more readable than listing the 
offsets.

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-07 Thread Ingo Molnar


* Ingo Molnar  wrote:

> With tail comments the code looks like this:
> 
>   res = dostuff(); /* We explain something here. */
> 
>   seed = 1; /* Another explanation. */
> 
>   mod_timer(_object->our_timer, jiffies + OUR_INTERVAL); /* We like 
> to talk */
> 
>   res = check_stuff(our_object); /* We explain something here. */
>   if (res)
>   return -EINVAL;
> 
>   interval = nontrivial_calculation(); /* Another explanation. */
> 
>   mod_timer(_object->our_timer, jiffies + interval); /* This doesn't 
> race, because. */
> 
> ... while with freestanding comments it's:
> 
>   /* We explain something here: */
>   res = check_stuff(our_object);
>   if (res)
>   return -EINVAL;
> 
>   /* Another explanation: */
>   interval = nontrivial_calculation();
> 
>   /* This doesn't race with init_our_stuff(), because: */
>   mod_timer(_object->our_timer, jiffies + interval);
> 
> This comment placement style has several advantages:
> 
>   - Comments precede actual code - while in tail comments it's exactly
> the wrong way around.
> 
>   - We don't create over-long lines nor artificially short tail comments 
> just because we were trying to stay within the col80 limit with the 
> "this doesn't race" comment. The full-line comment was able to 
> explain that the race is with init_our_stuff().
> 
>   - Freestanding oneliner comments are much better aligned as well: note 
> how ever comment starts at the exact same column, making it very easy 
> to read (or not to read) these comments.
> 
>   - In short: predictable visual parsing rules and proper semantic 
> ordering of information is good, random joining of typographical 
> elements just to create marginally more compact source code is bad.
> 
>   - Just *look* at the tail comments example: it's a visually diffuse, 
> jumble of statements and misaligned comments with good structure.

Forgot to mention the role of punctuation:

- Also note how punctuation actually *helps* the integration of 
  comments and code. The ":" will direct attention to the code that 
  follows, making it clear that the sentence explains the next 
  statement. There are exceptions for when different type of 
  punctuation is a better choice, for example:

/* TODO: convert the code to our newest calc API ASAP. */
interval = nontrivial_calculation();

  Here we don't explain the statement but document a TODO entry. 

  Also, it's frequent practice to use multi-line comments to explain 
  the next section of code, with a particular note about the first 
  statement. Proper punctuation helps there too:

/*
 * Establish the timeout interval and use it to set up
 * the timer of our object. The object is going to be
 * freed when the last reference is released.
 *
 * Note, the initial calculation is non-trivial, because
 * our timeout rules have complexity A), B) and C):
 */
interval = nontrivial_calculation();

  Note how part of the multi-line comment describes overall 
  principles of the code to follow, while the last sentence 
  describes a noteworthy aspect of the next C statement.

Thanks,

Ingo


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-07 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> +Coding style notes
> +--
> +
> +Comment style
> +^
> +
> +Sentences in comments start with a uppercase letter.
> +
> +Single line comments::
> +
> + /* This is a single line comment */
> +
> +Multi-line comments::
> +
> + /*
> +  * This is a properly formatted
> +  * multi-line comment.
> +  *
> +  * Larger multi-line comments should be split into paragraphs.
> +  */
> +
> +No tail comments:
> +
> +  Please refrain from using tail comments. Tail comments disturb the
> +  reading flow in almost all contexts, but especially in code::
> +
> + if (somecondition_is_true) /* Don't put a comment here */
> + dostuff(); /* Neither here */
> +
> + seed = MAGIC_CONSTANT; /* Nor here */
> +
> +  Use freestanding comments instead::
> +
> + /* This condition is not obvious without a comment */
> + if (somecondition_is_true) {
> + /* This really needs to be documented */
> + dostuff();
> + }
> +
> + /* This magic initialization needs a comment. Maybe not? */
> + seed = MAGIC_CONSTANT;

Yeah, so I think a better way to visualize and explain the 'no tail 
comments' guideline in -tip is not these more complex code flows, but the 
totally simple linear flows of statements.

With tail comments the code looks like this:

res = dostuff(); /* We explain something here. */

seed = 1; /* Another explanation. */

mod_timer(_object->our_timer, jiffies + OUR_INTERVAL); /* We like 
to talk */

res = check_stuff(our_object); /* We explain something here. */
if (res)
return -EINVAL;

interval = nontrivial_calculation(); /* Another explanation. */

mod_timer(_object->our_timer, jiffies + interval); /* This doesn't 
race, because. */

... while with freestanding comments it's:

/* We explain something here: */
res = check_stuff(our_object);
if (res)
return -EINVAL;

/* Another explanation: */
interval = nontrivial_calculation();

/* This doesn't race with init_our_stuff(), because: */
mod_timer(_object->our_timer, jiffies + interval);

This comment placement style has several advantages:

  - Comments precede actual code - while in tail comments it's exactly
the wrong way around.

  - We don't create over-long lines nor artificially short tail comments 
just because we were trying to stay within the col80 limit with the 
"this doesn't race" comment. The full-line comment was able to 
explain that the race is with init_our_stuff().

  - Freestanding oneliner comments are much better aligned as well: note 
how ever comment starts at the exact same column, making it very easy 
to read (or not to read) these comments.

  - In short: predictable visual parsing rules and proper semantic 
ordering of information is good, random joining of typographical 
elements just to create marginally more compact source code is bad.

  - Just *look* at the tail comments example: it's a visually diffuse, 
jumble of statements and misaligned comments with good structure.

Do you want me to send a delta patch, or an edited document?

Thanks,

Ingo


Re: [RFC PATCH v1 2/2] proc: add /proc//thread_state

2018-11-07 Thread Ingo Molnar


* Aubrey Li  wrote:

> Expose the per-task cpu specific thread state value, it's helpful
> for userland to classify and schedule the tasks by different policies

That's pretty vague - what exactly would use this information? I'm sure 
you have a usecase in mind - could you please describe it?

Thanks,

Ingo


Re: [PATCH 0/3] x86/cpu: fix some prototype warning

2018-11-07 Thread Ingo Molnar


* Yi Wang  wrote:

> This series of patch fix some prototype warning because
> of missing include file.
> 
> Yi Wang (3):
>   x86/cpu: fix prototype warning in cacheinfo.c
>   x86/cpu: fix prototype warning in scattered.c
>   x86/cpu: fix prototype warning in topology.c
> 
>  arch/x86/kernel/cpu/cacheinfo.c | 1 +
>  arch/x86/kernel/cpu/scattered.c | 2 ++
>  arch/x86/kernel/cpu/topology.c  | 2 ++
>  3 files changed, 5 insertions(+)

Could you please submit this as a single patch?

Thanks,

Ingo


Re: [GIT PULL 00/18] perf/urgent improvements and fixes

2018-11-06 Thread Ingo Molnar


* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, mostly fixes, some late coming
> improvements in non-core areas,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 29995d296e3e9ce4f9767963ecbef143ade26c36:
> 
>   Merge tag 'perf-urgent-for-mingo-4.20-20181031' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent 
> (2018-10-31 22:53:40 +0100)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-urgent-for-mingo-4.20-20181106
> 
> for you to fetch changes up to 8e88c29b351ed4e09dd63f825f1c8260b0cb0ab3:
> 
>   perf tools: Do not zero sample_id_all for group members (2018-11-06 
> 08:29:56 -0300)
> 
> 
> perf/urgent improvements and fixes:
> 
> Intel PT sql viewer: (Adrian Hunter)
> 
> - Fall back to /usr/local/lib/libxed.so
> - Add Selected branches report
> - Add help window
> - Fix table find when table re-ordered
> 
> Intel PT debug log (Adrian Hunter)
> 
> - Add more event information
> - Add MTC and CYC timestamps
> 
> perf record: (Andi Kleen)
> 
> - Support weak groups, just like with 'perf stat'
> 
> perf trace: (Arnaldo Carvalho de Melo)
> 
> - Start augmenting raw_syscalls:{sys_enter,sys_exit}: goal is to have a
>   generic, arch independent eBPF kernel component that is programmed with
>   syscall table details, what to copy, how many bytes, pid, arg filters from 
> the
>   userspace via eBPF maps by the 'perf trace' tool that continues to use all 
> its
>   argument beautifiers, just taking advantage of the extra pointer contents.
> 
> JVMTI: (Gustavo Romero)
> 
> - Fix undefined symbol scnprintf in libperf-jvmti.so
> 
> perf top: (Jin Yao)
> 
> - Display the LBR stats in callchain entries
> 
> perf stat: (Thomas Richter)
> 
> - Handle different PMU names with common prefix
> 
> arm64: Will (Deacon)
> 
> - Fix arm64 tools build failure wrt smp_load_{acquire,release}.
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Adrian Hunter (6):
>   perf scripts python: exported-sql-viewer.py: Fall back to 
> /usr/local/lib/libxed.so
>   perf scripts python: exported-sql-viewer.py: Add Selected branches 
> report
>   perf scripts python: exported-sql-viewer.py: Add help window
>   perf scripts python: exported-sql-viewer.py: Fix table find when table 
> re-ordered
>   perf intel-pt: Add more event information to debug log
>   perf intel-pt: Add MTC and CYC timestamps to debug log
> 
> Andi Kleen (2):
>   perf evlist: Move perf_evsel__reset_weak_group into evlist
>   perf record: Support weak groups
> 
> Arnaldo Carvalho de Melo (5):
>   perf examples bpf: Start augmenting raw_syscalls:sys_{start,exit}
>   perf trace: When augmenting raw_syscalls plug raw_syscalls:sys_exit too
>   perf trace: Fix setting of augmented payload when using eBPF + 
> raw_syscalls
>   perf augmented_syscalls: Start collecting pathnames in the BPF program
>   perf beauty: Use SRCARCH, ARCH=x86_64 must map to "x86" to find the 
> headers
> 
> Gustavo Romero (1):
>   perf tools: Fix undefined symbol scnprintf in libperf-jvmti.so
> 
> Jin Yao (1):
>   perf top: Display the LBR stats in callchain entry
> 
> Jiri Olsa (1):
>   perf tools: Do not zero sample_id_all for group members
> 
> Thomas Richter (1):
>   perf stat: Handle different PMU names with common prefix
> 
> Will Deacon (1):
>   tools headers barrier: Fix arm64 tools build failure wrt 
> smp_load_{acquire,release}
> 
>  tools/arch/arm64/include/asm/barrier.h | 133 +++---
>  tools/perf/Documentation/perf-list.txt |   1 -
>  tools/perf/Makefile.perf   |   2 +-
>  tools/perf/builtin-record.c|   7 +-
>  tools/perf/builtin-stat.c  |  28 +-
>  tools/perf/builtin-top.c   |   3 +
>  tools/perf/builtin-trace.c |  34 +-
>  tools/perf/examples/bpf/augmented_raw_syscalls.c   | 131 ++
>  tools/perf/jvmti/jvmti_agent.c |  49 +-
>  tools/perf/scripts/python/exported-sql-viewer.py   | 493 
> -
>  tools/perf/tests/attr/test-record-group-sampling   |   1 -
>  tools/perf/util/evlist.c   |  27 ++
>  tools/perf/util/evlist.h   |   3 +
>  tools/perf/util/evsel.c|   1 -
>  .../perf/util/intel-pt-decoder/intel-pt-decoder.c  |   4 +
>  tools/perf/util/intel-pt-decoder/intel-pt-log.c|   5 +
>  tools/perf/util/intel-pt-decoder/intel-pt-log.h|   1 +
>  tools/perf/util/intel-pt.c |  16 +-
>  tools/perf/util/pmu.c  |   2 +-
>  19 files changed, 820 insertions(+), 121 deletions(-)
>  create 

Re: [patch 0/9] time: Add SPDX identifiers and cleanup boilerplates

2018-11-05 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> Add SPDX identifiers to all files in kernel/time and remove the license
> boiler plates.
> 
> Aside of that use the chance to get rid of (stale) file references and tidy
> up the top of file comments as they are touched anyway by this work.
> 
> This work is based on a script and data from Philippe Ombredanne, Kate
> Stewart and myself. The data has been created with two independent license
> scanners and manual inspection.
> 
> Thanks,
> 
>   tglx
> 
> ---
>  include/linux/hrtimer.h  |5 +
>  kernel/time/alarmtimer.c |5 +
>  kernel/time/clockevents.c|6 +-
>  kernel/time/clocksource.c|   20 +---
>  kernel/time/hrtimer.c|   19 +--
>  kernel/time/itimer.c |2 --
>  kernel/time/jiffies.c|   28 ++--
>  kernel/time/posix-clock.c|   17 ++---
>  kernel/time/posix-stubs.c|5 +
>  kernel/time/posix-timers.c   |   25 ++---
>  kernel/time/sched_clock.c|9 +++--
>  kernel/time/test_udelay.c|   10 +-
>  kernel/time/tick-broadcast-hrtimer.c |4 +---
>  kernel/time/tick-broadcast.c |6 +-
>  kernel/time/tick-common.c|6 +-
>  kernel/time/tick-oneshot.c   |6 +-
>  kernel/time/tick-sched.c |5 +
>  kernel/time/time.c   |   13 +
>  kernel/time/timeconst.bc |2 ++
>  kernel/time/timeconv.c   |1 +
>  kernel/time/timecounter.c|   17 ++---
>  kernel/time/timekeeping.c|   11 +++
>  kernel/time/timekeeping_debug.c  |   11 +--
>  kernel/time/timer.c  |3 +--
>  kernel/time/timer_list.c |7 +--
>  25 files changed, 45 insertions(+), 198 deletions(-)

Acked-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH v3] x86/vsmp_64.c: Remove dependency on pv_irq_ops

2018-11-05 Thread Ingo Molnar


* Eial Czerwacki  wrote:

> +#if defined(CONFIG_PCI)

This is shorter:

   #ifdef CONFIG_PCI

Thanks,

Ingo


Re: [tip:x86/urgent] x86/vsmp: Remove dependency on pv_irq_ops

2018-11-05 Thread Ingo Molnar


* tip-bot for Eial Czerwacki  wrote:

> Commit-ID:  7f2d7f8190d856949d8aaab55d3902cd10ea1048
> Gitweb: 
> https://git.kernel.org/tip/7f2d7f8190d856949d8aaab55d3902cd10ea1048
> Author: Eial Czerwacki 
> AuthorDate: Mon, 5 Nov 2018 10:01:04 +0200
> Committer:  Thomas Gleixner 
> CommitDate: Mon, 5 Nov 2018 12:33:47 +0100
> 
> x86/vsmp: Remove dependency on pv_irq_ops
> 
> vsmp dependency on pv_irq_ops has been removed some years ago, but the code
> still deals with pv_irq_ops.

s/vsmp
 /vSMP

> +#if defined CONFIG_PCI

I don't think code was ever tested with PCI disabled?

The above typoed sequence of code accidentally evaluates to 'true' 
regardless of CONFIG_PCI.

Thanks,

Ingo


Re: [PATCH] x86/gart: Rewrite early_gart_iommu_check() comment

2018-11-05 Thread Ingo Molnar


* Borislav Petkov  wrote:

> From: Borislav Petkov 
> 
> ... to actually explain what the function is trying to do.
> 
> Reported-by: Mike Rapoport 
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kernel/aperture_64.c | 25 +++--
>  1 file changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
> index 2c4d5ece7456..4fb4b7f53568 100644
> --- a/arch/x86/kernel/aperture_64.c
> +++ b/arch/x86/kernel/aperture_64.c
> @@ -264,18 +264,23 @@ static int __init parse_gart_mem(char *p)
>  }
>  early_param("gart_fix_e820", parse_gart_mem);
>  
> +/*
> + * With kexec/kdump, if the first kernel doesn't shutdown the GART and the
> + * second kernel allocates a different GART region, there might be two
> + * overlapping GART regions present:

s/shutdown
 /shut down

Thanks,

Ingo


  1   2   3   4   5   6   7   8   9   10   >