Re: static_branch/jump_label vs branch merging

2021-04-10 Thread Segher Boessenkool
On Fri, Apr 09, 2021 at 12:33:29PM -0700, Nick Desaulniers wrote:
> Since asm goto is implicitly volatile qualified, it sounds like
> removing the implicit volatile qualifier from asm goto might help?
> Then if there were side effects but you forgot to inform the compiler
> that there were via an explicit volatile qualifier, and it performed
> the suggested merge, oh well.

"asm goto" without outputs is always volatile, just like any other asm
without outputs (if it wasn't, the compiler would always delete every
asm without outputs!)


Segher


RE: static_branch/jump_label vs branch merging

2021-04-10 Thread David Laight
From: David Malcolm
> Sent: 09 April 2021 14:49
...
> With the caveat that my knowledge of GCC's middle-end is mostly about
> implementing warnings, rather than optimization, I did some
> experimentation, with gcc trunk on x86_64 FWIW.
> 
> Given:
> 
> int __attribute__((pure)) foo(void);
> 
> int t(void)
> {
>   int a;
   = 0;
>   if (foo())
> a++;
>   if (foo())
> a++;
>   if (foo())
> a++;
>   return a;
> }
> 
> At -O1 and above this is optimized to a single call to foo, returning 0
> or 3 accordingly.

Interesting horrid idea.
The code generated for the above should be:
callfoo
jz  label
So objtool could find the relocation entries for 'foo'
and use that information to replace the call instruction
(and maybe the jz) with a suitable alternate instruction
sequence.

Although that might end up with a game of 'whack-a-mole'
on the perverse instruction sequences the compiler
generates.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 05:07:15PM -0400, David Malcolm wrote:

> You've built a very specific thing out of asm-goto to fulfil the tough
> requirements you outlined above - as well as the nops, there's a thing
> in another section to contend with.
> 
> How to merge these asm-goto constructs?

By calling the function less, you emit less of them. Which then brings
us back to the whole pure/const thing.

> Doing so feels very special-case to the kernel and not something that
> other GCC users would find useful.

Doesn't it boil down to 'fixing' the pure/const vs asm-goto interaction?
I could imagine that having that interaction fixed could allow other
creative uses.


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread David Malcolm
On Fri, 2021-04-09 at 22:09 +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 03:21:49PM -0400, David Malcolm wrote:
> > [Caveat: I'm a gcc developer, not a kernel expert]
> > 
> > But it's not *quite* a global constant, or presumably you would be
> > simply using a global constant, right?  As the optimizer gets
> > smarter,
> > you don't want to have it one day decide that actually it really is
> > constant, and optimize away everything at compile-time (e.g. when
> > LTO
> > is turned on, or whatnot).
> 
> Right; as I said, the result is not a constant, but any invocation
> ever,
> will return the same result. Small but subtle difference :-)
> 
> > I get the impression that you're resorting to assembler because
> > you're
> > pushing beyond what the C language can express. 
> 
> Of course :-) I tend to always push way past what C considers
> sane.
> Lets say I'm firmly in the C-as-Optimizing-Assembler camp :-)

Yeah, I got that :)

> > Taking things to a slightly higher level, am I right in thinking
> > that
> > what you're trying to achieve is a control flow construct that
> > almost
> > always takes one of the given branches, but which can (very rarely)
> > be
> > switched to permanently take one of the other branches, and that
> > you
> > want the lowest possible overhead for the common case where the
> > control flow hasn't been touched yet? 
> 
> Correct, that's what it is. We do runtime code patching to flip the
> branch if/when needed. We've been doing this for many many years now.
> 
> The issue of today is all this clever stuff defeating some simple
> optimizations.

It's certainly clever - though, if you'll forgive me, that's not always
a good thing :)

> > (and presumably little overhead for when it
> > has been?)... and that you want to be able to merge repeated such
> > conditionals. 
> 
> This.. So the 'static' branches have been upstream and in use ever
> since
> GCC added asm-goto, it was in fact the driving force to get asm-goto
> implemented. This was 2010 according to git history.
> 
> So we emit, using asm goto, either a "NOP5" or "JMP.d32" (x86
> speaking),
> and a special section entry into which we encode the key address and
> the
> instruction address and the jump target.
> 
> GCC, not knowing what the asm does, only sees the 2 edges and all is
> well.
> 
> Then, at runtime, when we decide we want the other edge for a given
> key,
> we iterate our section and rewrite the code to either nop5 or jmp.d32
> with the correct jump target.
> 
> > It's kind of the opposite of "volatile" - something that the user
> > is
> > happy for the compiler to treat as not changing much, as opposed to
> > something the user is warning the compiler about changing from
> > under
> > it.  A "const-ish" value?
> 
> Just so. Encoded in text, not data.
> 
> > Sorry if I'm being incoherent; I'm kind of thinking aloud here.
> 
> No problem, we're way outside of what is generally considered normal,
> and I did somewhat assume people were familiar with our 'dodgy'
> construct (some on this list are more than others).
> 
> I hope it's all a little clearer now.

Yeah.  This is actually on two mailing lists; I'm only subscribed to
linux-toolchains, which AIUI is about sharing ideas between Linux and
the toolchains.

You've built a very specific thing out of asm-goto to fulfil the tough
requirements you outlined above - as well as the nops, there's a thing
in another section to contend with.

How to merge these asm-goto constructs?

Doing so feels very special-case to the kernel and not something that
other GCC users would find useful.

I can imagine a GCC plugin that implemented a custom optimization pass
for that - basically something that spots the asm-gotos in the gimple
IR and optimizes away duplicates by replacing them with jumps, but
having read about Linus's feelings about GCC plugins recently:
  https://lwn.net/Articles/851090/
I suspect that that isn't going to fly (and if you're going down the
route of adding an optimization pass via a plugin, there's probably a
way to do that that doesn't involve asm).  In theory, something to
optimize the asm-gotos could be relatively simple, but that said, we
don't really have a GCC plugin API; all of our internal APIs are
exposed, and are liable to change from release to release, which I know
is a pain (I've managed to break one of my own plugins with one of my
own API changes at least once).

Hope this is constructive
Dave




Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 12:33:29PM -0700, Nick Desaulniers wrote:
> On Fri, Apr 9, 2021 at 4:18 AM Peter Zijlstra  wrote:
> >
> > On Fri, Apr 09, 2021 at 12:55:18PM +0200, Florian Weimer wrote:
> > > * Ard Biesheuvel:
> > >
> > > > Wouldn't that require the compiler to interpret the contents of the
> > > > asm() block?
> > >
> > > Yes and no.  It would require proper toolchain support, so in this case
> > > a new ELF relocation type, with compiler, assembler, and linker support
> > > to generate those relocations and process them.  As far as I understand
> > > it, the kernel doesn't do things this way.
> >
> > I don't think that all is needed. All we need is for the compiler to
> > recognise that:
> >
> > if (cond) {
> > stmt-A;
> > }
> > if (cond) {
> > stmt-B;
> > }
> >
> > both cond are equivalent and hence can merge the blocks like:
> >
> > if (cond) {
> > stmt-A;
> > stmt-B;
> > }
> >
> > But because @cond is some super opaque asm crap, the compiler throws up
> > it's imaginry hands and says it cannot possibly tell and leaves them as
> > is.
> 
> Right, because if `cond` has side effects (such as is implied by asm
> statements that are volatile qualified), suddenly those side effects
> would only occur once whereas previously they occurred twice.
> 
> Since asm goto is implicitly volatile qualified, it sounds like
> removing the implicit volatile qualifier from asm goto might help?
> Then if there were side effects but you forgot to inform the compiler
> that there were via an explicit volatile qualifier, and it performed
> the suggested merge, oh well.

So, as mentioned elsewhere in this thread, it would be nice if either
the pure or const function attribute could over-ride/constrain that
volatile side effect.

I'm fine with things going side-ways if we get it wrong, that's more or
less the game we're playing anyway ;-)


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 03:21:49PM -0400, David Malcolm wrote:
> [Caveat: I'm a gcc developer, not a kernel expert]
> 
> But it's not *quite* a global constant, or presumably you would be
> simply using a global constant, right?  As the optimizer gets smarter,
> you don't want to have it one day decide that actually it really is
> constant, and optimize away everything at compile-time (e.g. when LTO
> is turned on, or whatnot).

Right; as I said, the result is not a constant, but any invocation ever,
will return the same result. Small but subtle difference :-)

> I get the impression that you're resorting to assembler because you're
> pushing beyond what the C language can express. 

Of course :-) I tend to always push way past what C considers sane.
Lets say I'm firmly in the C-as-Optimizing-Assembler camp :-)

> Taking things to a slightly higher level, am I right in thinking that
> what you're trying to achieve is a control flow construct that almost
> always takes one of the given branches, but which can (very rarely) be
> switched to permanently take one of the other branches, and that you
> want the lowest possible overhead for the common case where the
> control flow hasn't been touched yet? 

Correct, that's what it is. We do runtime code patching to flip the
branch if/when needed. We've been doing this for many many years now.

The issue of today is all this clever stuff defeating some simple
optimizations.

> (and presumably little overhead for when it
> has been?)... and that you want to be able to merge repeated such
> conditionals. 

This.. So the 'static' branches have been upstream and in use ever since
GCC added asm-goto, it was in fact the driving force to get asm-goto
implemented. This was 2010 according to git history.

So we emit, using asm goto, either a "NOP5" or "JMP.d32" (x86 speaking),
and a special section entry into which we encode the key address and the
instruction address and the jump target.

GCC, not knowing what the asm does, only sees the 2 edges and all is
well.

Then, at runtime, when we decide we want the other edge for a given key,
we iterate our section and rewrite the code to either nop5 or jmp.d32
with the correct jump target.

> It's kind of the opposite of "volatile" - something that the user is
> happy for the compiler to treat as not changing much, as opposed to
> something the user is warning the compiler about changing from under
> it.  A "const-ish" value?

Just so. Encoded in text, not data.

> Sorry if I'm being incoherent; I'm kind of thinking aloud here.

No problem, we're way outside of what is generally considered normal,
and I did somewhat assume people were familiar with our 'dodgy'
construct (some on this list are more than others).

I hope it's all a little clearer now.



Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Nick Desaulniers
On Fri, Apr 9, 2021 at 4:18 AM Peter Zijlstra  wrote:
>
> On Fri, Apr 09, 2021 at 12:55:18PM +0200, Florian Weimer wrote:
> > * Ard Biesheuvel:
> >
> > > Wouldn't that require the compiler to interpret the contents of the
> > > asm() block?
> >
> > Yes and no.  It would require proper toolchain support, so in this case
> > a new ELF relocation type, with compiler, assembler, and linker support
> > to generate those relocations and process them.  As far as I understand
> > it, the kernel doesn't do things this way.
>
> I don't think that all is needed. All we need is for the compiler to
> recognise that:
>
> if (cond) {
> stmt-A;
> }
> if (cond) {
> stmt-B;
> }
>
> both cond are equivalent and hence can merge the blocks like:
>
> if (cond) {
> stmt-A;
> stmt-B;
> }
>
> But because @cond is some super opaque asm crap, the compiler throws up
> it's imaginry hands and says it cannot possibly tell and leaves them as
> is.

Right, because if `cond` has side effects (such as is implied by asm
statements that are volatile qualified), suddenly those side effects
would only occur once whereas previously they occurred twice.

Since asm goto is implicitly volatile qualified, it sounds like
removing the implicit volatile qualifier from asm goto might help?
Then if there were side effects but you forgot to inform the compiler
that there were via an explicit volatile qualifier, and it performed
the suggested merge, oh well.
-- 
Thanks,
~Nick Desaulniers


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread David Malcolm
On Fri, 2021-04-09 at 20:40 +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 09:48:33AM -0400, David Malcolm wrote:
> > You tried __pure on arch_static_branch; did you try it on
> > static_branch_unlikely?
> 
> static_branch_unlikely() is a CPP macro that expands to a statement
> expression, or as with the later patch, a _Generic(). I'm not sure
> how
> to apply the attribute to either of them since it is a function
> attribute.
> 
> I was hoping the attribute would percolate through, so to speak.
> 
> > With the caveat that my knowledge of GCC's middle-end is mostly
> > about
> > implementing warnings, rather than optimization, I did some
> > experimentation, with gcc trunk on x86_64 FWIW.
> > 
> > Given:
> > 
> > int __attribute__((pure)) foo(void);
> > 
> > int t(void)
> > {
> >   int a;
> >   if (foo())
> >     a++;
> >   if (foo())
> >     a++;
> >   if (foo())
> >     a++;
> >   return a;
> > }
> > 
> > At -O1 and above this is optimized to a single call to foo,
> > returning 0
> > or 3 accordingly.
> > 
> > -fdump-tree-all shows that it's the "fre1" pass that eliminates the
> > subsequent calls to foo, replacing them with reuses of the result
> > of
> > the first call.
> > 
> > This is in gcc/tree-ssa-sccvn.c, a value-numbering pass.
> > 
> > I think you want to somehow "teach" the compiler that:
> >   static_branch_unlikely(_schedstats)
> > is "pure-ish", that for some portion of the surrounding code that
> > you
> > want the result to be treated as pure - though I suspect compiler
> > maintainers with more experience than me are thinking "but which
> > portion? what is it safe to assume, and what will users be annoyed
> > about if we optimize away? what if t itself is inlined somewhere?"
> > and
> > similar concerns.
> 
> Right, pure or even const. As to the scope, as wide as possible. It
> literally is a global constant, the value returned is the same
> everywhere.

[Caveat: I'm a gcc developer, not a kernel expert]

But it's not *quite* a global constant, or presumably you would be
simply using a global constant, right?  As the optimizer gets smarter,
you don't want to have it one day decide that actually it really is
constant, and optimize away everything at compile-time (e.g. when LTO
is turned on, or whatnot).

I get the impression that you're resorting to assembler because you're
pushing beyond what the C language can express.  Taking things to a
slightly higher level, am I right in thinking that what you're trying
to achieve is a control flow construct that almost always takes one of
the given branches, but which can (very rarely) be switched to
permanently take one of the other branches, and that you want the
lowest possible overhead for the common case where the control flow
hasn't been touched yet?  (and presumably little overhead for when it
has been?)... and that you want to be able to merge repeated such
conditionals.  Perhaps a __builtin_ to hint that a conditional should
work that way (analogous to __builtin_expect)?  I can imagine a way of
implementing such a construct in gcc's gimple and RTL representations,
but it would be a ton of work (and I have a full plate already)

Or maybe another way of thinking about it is that you're reading a
value and you would like the compiler to amortize away repeated reads
of the value (perhaps just within the current function).

It's kind of the opposite of "volatile" - something that the user is
happy for the compiler to treat as not changing much, as opposed to
something the user is warning the compiler about changing from under
it.  A "const-ish" value?

Sorry if I'm being incoherent; I'm kind of thinking aloud here.

Hope this is constructive
Dave


> 
> All we need GCC to do for the static_branch construct is to emit both
> branches; that is, it must not treat the result as a constant and
> elide
> the other branches. But it can consider consecutive calls (as far and
> wide as it wants) to return the same value.
> 
> > Or maybe the asm stmt itself could somehow be marked as pure???
> > (with
> > similar concerns about semantics as above)
> 
> Yeah, not sure, someone with more clue will have to inform us what,
> if
> anything more than marking it either pure or const is required.
> Perhaps
> that attribute is sufficient and the compiler just isn't optimizing
> for
> an unrelated reason.
> 
> Regards,
> 
> Peter
> 




Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 09:48:33AM -0400, David Malcolm wrote:
> You tried __pure on arch_static_branch; did you try it on
> static_branch_unlikely?

static_branch_unlikely() is a CPP macro that expands to a statement
expression, or as with the later patch, a _Generic(). I'm not sure how
to apply the attribute to either of them since it is a function
attribute.

I was hoping the attribute would percolate through, so to speak.

> With the caveat that my knowledge of GCC's middle-end is mostly about
> implementing warnings, rather than optimization, I did some
> experimentation, with gcc trunk on x86_64 FWIW.
> 
> Given:
> 
> int __attribute__((pure)) foo(void);
> 
> int t(void)
> {
>   int a;
>   if (foo())
> a++;
>   if (foo())
> a++;
>   if (foo())
> a++;
>   return a;
> }
> 
> At -O1 and above this is optimized to a single call to foo, returning 0
> or 3 accordingly.
> 
> -fdump-tree-all shows that it's the "fre1" pass that eliminates the
> subsequent calls to foo, replacing them with reuses of the result of
> the first call.
> 
> This is in gcc/tree-ssa-sccvn.c, a value-numbering pass.
> 
> I think you want to somehow "teach" the compiler that:
>   static_branch_unlikely(_schedstats)
> is "pure-ish", that for some portion of the surrounding code that you
> want the result to be treated as pure - though I suspect compiler
> maintainers with more experience than me are thinking "but which
> portion? what is it safe to assume, and what will users be annoyed
> about if we optimize away? what if t itself is inlined somewhere?" and
> similar concerns.

Right, pure or even const. As to the scope, as wide as possible. It
literally is a global constant, the value returned is the same
everywhere.

All we need GCC to do for the static_branch construct is to emit both
branches; that is, it must not treat the result as a constant and elide
the other branches. But it can consider consecutive calls (as far and
wide as it wants) to return the same value.

> Or maybe the asm stmt itself could somehow be marked as pure??? (with
> similar concerns about semantics as above)

Yeah, not sure, someone with more clue will have to inform us what, if
anything more than marking it either pure or const is required. Perhaps
that attribute is sufficient and the compiler just isn't optimizing for
an unrelated reason.

Regards,

Peter


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread David Malcolm
On Fri, 2021-04-09 at 15:13 +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 03:01:50PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 09, 2021 at 02:03:46PM +0200, Peter Zijlstra wrote:
> > > On Fri, Apr 09, 2021 at 07:55:42AM -0400, David Malcolm wrote:
> > 
> > > > Sorry if this is a dumb question, but does the function
> > > > attribute:
> > > >   __attribute__ ((pure)) 
> > > > help here?  It's meant to allow multiple calls to a predicate
> > > > to be
> > > > merged - though I'd be nervous of using it here, the predicate
> > > > isn't
> > > > 100% pure, since AIUI the whole point of what you've built is
> > > > for
> > > > predicates that very rarely change - but can change
> > > > occasionally.
> > > 
> > > I actually tried that, but it doesn't seem to work. Given the
> > > function
> > > arguments are all compile time constants it should DTRT AFAICT,
> > > but
> > > alas.
> > 
> > FWIW, I tried the below patch and GCC-10.2.1 on current tip/master.
> 
> I also just tried __attribute__((__const__)), which is stronger still
> than __pure__ and that's also not working :/
> 
> I then also tried to replace the __buildin_types_compatible_p() magic
> in
> static_branch_unlikely() with _Generic(), but still no joy.
> 

[..snip...]

You tried __pure on arch_static_branch; did you try it on
static_branch_unlikely?

With the caveat that my knowledge of GCC's middle-end is mostly about
implementing warnings, rather than optimization, I did some
experimentation, with gcc trunk on x86_64 FWIW.

Given:

int __attribute__((pure)) foo(void);

int t(void)
{
  int a;
  if (foo())
a++;
  if (foo())
a++;
  if (foo())
a++;
  return a;
}

At -O1 and above this is optimized to a single call to foo, returning 0
or 3 accordingly.

-fdump-tree-all shows that it's the "fre1" pass that eliminates the
subsequent calls to foo, replacing them with reuses of the result of
the first call.

This is in gcc/tree-ssa-sccvn.c, a value-numbering pass.

I think you want to somehow "teach" the compiler that:
  static_branch_unlikely(_schedstats)
is "pure-ish", that for some portion of the surrounding code that you
want the result to be treated as pure - though I suspect compiler
maintainers with more experience than me are thinking "but which
portion? what is it safe to assume, and what will users be annoyed
about if we optimize away? what if t itself is inlined somewhere?" and
similar concerns.

Or maybe the asm stmt itself could somehow be marked as pure??? (with
similar concerns about semantics as above)

Hope this is constructive
Dave




Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 03:01:50PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 02:03:46PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 09, 2021 at 07:55:42AM -0400, David Malcolm wrote:
> 
> > > Sorry if this is a dumb question, but does the function attribute:
> > >   __attribute__ ((pure)) 
> > > help here?  It's meant to allow multiple calls to a predicate to be
> > > merged - though I'd be nervous of using it here, the predicate isn't
> > > 100% pure, since AIUI the whole point of what you've built is for
> > > predicates that very rarely change - but can change occasionally.
> > 
> > I actually tried that, but it doesn't seem to work. Given the function
> > arguments are all compile time constants it should DTRT AFAICT, but
> > alas.
> 
> FWIW, I tried the below patch and GCC-10.2.1 on current tip/master.

I also just tried __attribute__((__const__)), which is stronger still
than __pure__ and that's also not working :/

I then also tried to replace the __buildin_types_compatible_p() magic in
static_branch_unlikely() with _Generic(), but still no joy.


---
diff --git a/arch/x86/include/asm/jump_label.h 
b/arch/x86/include/asm/jump_label.h
index 610a05374c02..f14c6863b911 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -14,7 +14,7 @@
 #include 
 #include 
 
-static __always_inline bool arch_static_branch(struct static_key * const key, 
const bool branch)
+static __always_inline __attribute_const__ bool arch_static_branch(struct 
static_key * const key, const bool branch)
 {
asm_volatile_goto("1:"
".byte " __stringify(BYTES_NOP5) "\n\t"
@@ -30,7 +30,7 @@ static __always_inline bool arch_static_branch(struct 
static_key * const key, co
return true;
 }
 
-static __always_inline bool arch_static_branch_jump(struct static_key * const 
key, const bool branch)
+static __always_inline __attribute_const__ bool arch_static_branch_jump(struct 
static_key * const key, const bool branch)
 {
asm_volatile_goto("1:"
".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 05f5554d860f..2c250d8b9a02 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -195,12 +195,12 @@ struct module;
 #define JUMP_TYPE_LINKED   2UL
 #define JUMP_TYPE_MASK 3UL
 
-static __always_inline bool static_key_false(struct static_key *key)
+static __always_inline __attribute_const__ bool static_key_false(struct 
static_key * const key)
 {
return arch_static_branch(key, false);
 }
 
-static __always_inline bool static_key_true(struct static_key *key)
+static __always_inline __attribute_const__ bool static_key_true(struct 
static_key * const key)
 {
return !arch_static_branch(key, true);
 }
@@ -466,6 +466,18 @@ extern bool wrong_branch_error(void);
  * See jump_label_type() / jump_label_init_type().
  */
 
+#if 1
+
+#define static_branch_likely(x)_Generic(*(x),  
\
+   struct static_key_true: !arch_static_branch(&(x)->key, true),   
\
+   struct static_key_false: !arch_static_branch_jump(&(x)->key, true))
+
+#define static_branch_unlikely(x) _Generic(*(x),   
\
+   struct static_key_true: arch_static_branch_jump(&(x)->key, false),  
\
+   struct static_key_false: arch_static_branch(&(x)->key, false))
+
+#else
+
 #define static_branch_likely(x)
\
 ({ 
\
bool branch;
\
@@ -490,6 +502,8 @@ extern bool wrong_branch_error(void);
unlikely_notrace(branch);   
\
 })
 
+#endif
+
 #else /* !CONFIG_JUMP_LABEL */
 
 #define static_branch_likely(x)
likely_notrace(static_key_enabled(&(x)->key))


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Segher Boessenkool
On Thu, Apr 08, 2021 at 06:52:18PM +0200, Peter Zijlstra wrote:
> Is there *any* way in which we can have the compiler recognise that the
> asm_goto only depends on its arguments and have it merge the branches
> itself?
> 
> I do realize that asm-goto being volatile this is a fairly huge ask, but
> I figured I should at least raise the issue, if only to raise awareness.

"volatile" should not be an impediment to this at all, volatile means
there is an unspecified side effect, nothing more, nothing less.  But
yes this currently does not work with GCC:

void f(int x)
{   if (x)
asm volatile("ojee %0" :: "r"(x));
else
asm volatile("ojee %0" :: "r"(x));
}

or even

static inline void h(int x)
{
asm volatile("ojee %0" :: "r"(x));
}

void f1(int x)
{   if (x)
h(x);
else
h(x);
}

which both emit silly machine code.


Segher


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 02:03:46PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 07:55:42AM -0400, David Malcolm wrote:

> > Sorry if this is a dumb question, but does the function attribute:
> >   __attribute__ ((pure)) 
> > help here?  It's meant to allow multiple calls to a predicate to be
> > merged - though I'd be nervous of using it here, the predicate isn't
> > 100% pure, since AIUI the whole point of what you've built is for
> > predicates that very rarely change - but can change occasionally.
> 
> I actually tried that, but it doesn't seem to work. Given the function
> arguments are all compile time constants it should DTRT AFAICT, but
> alas.

FWIW, I tried the below patch and GCC-10.2.1 on current tip/master.

---
diff --git a/arch/x86/include/asm/jump_label.h 
b/arch/x86/include/asm/jump_label.h
index 610a05374c02..704438d07bc8 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -14,7 +14,7 @@
 #include 
 #include 
 
-static __always_inline bool arch_static_branch(struct static_key * const key, 
const bool branch)
+static __always_inline __pure bool arch_static_branch(struct static_key * 
const key, const bool branch)
 {
asm_volatile_goto("1:"
".byte " __stringify(BYTES_NOP5) "\n\t"
@@ -30,7 +30,7 @@ static __always_inline bool arch_static_branch(struct 
static_key * const key, co
return true;
 }
 
-static __always_inline bool arch_static_branch_jump(struct static_key * const 
key, const bool branch)
+static __always_inline __pure bool arch_static_branch_jump(struct static_key * 
const key, const bool branch)
 {
asm_volatile_goto("1:"
".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 05f5554d860f..834086663c26 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -195,12 +195,12 @@ struct module;
 #define JUMP_TYPE_LINKED   2UL
 #define JUMP_TYPE_MASK 3UL
 
-static __always_inline bool static_key_false(struct static_key *key)
+static __always_inline __pure bool static_key_false(struct static_key * const 
key)
 {
return arch_static_branch(key, false);
 }
 
-static __always_inline bool static_key_true(struct static_key *key)
+static __always_inline __pure bool static_key_true(struct static_key * const 
key)
 {
return !arch_static_branch(key, true);
 }


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 07:55:42AM -0400, David Malcolm wrote:
> On Fri, 2021-04-09 at 13:12 +0200, Peter Zijlstra wrote:
> > On Fri, Apr 09, 2021 at 11:57:22AM +0200, Ard Biesheuvel wrote:
> > > On Thu, 8 Apr 2021 at 18:53, Peter Zijlstra 
> > > wrote:
> > 
> > > > Is there *any* way in which we can have the compiler recognise
> > > > that the
> > > > asm_goto only depends on its arguments and have it merge the
> > > > branches
> > > > itself?
> > > > 
> > > > I do realize that asm-goto being volatile this is a fairly huge
> > > > ask, but
> > > > I figured I should at least raise the issue, if only to raise
> > > > awareness.
> > > > 
> > > 
> > > Wouldn't that require the compiler to interpret the contents of the
> > > asm() block?
> > 
> > Yeah, this is more or less asking for ponies :-) One option would be
> > some annotation that conveys the desired semantics without it having
> > to
> > untangle the mess in the asm block.
> > 
> > The thing the compiler needs to know is that the branch is constant
> > for
> > any @key, and hence allow the obvious optimizations. I'm not sure if
> > this is something compiler folks would be even willing to consider,
> > but
> > I figured asking never hurts.
> > 
> 
> Sorry if this is a dumb question, but does the function attribute:
>   __attribute__ ((pure)) 
> help here?  It's meant to allow multiple calls to a predicate to be
> merged - though I'd be nervous of using it here, the predicate isn't
> 100% pure, since AIUI the whole point of what you've built is for
> predicates that very rarely change - but can change occasionally.

I actually tried that, but it doesn't seem to work. Given the function
arguments are all compile time constants it should DTRT AFAICT, but
alas.


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread David Malcolm
On Fri, 2021-04-09 at 13:12 +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 11:57:22AM +0200, Ard Biesheuvel wrote:
> > On Thu, 8 Apr 2021 at 18:53, Peter Zijlstra 
> > wrote:
> 
> > > Is there *any* way in which we can have the compiler recognise
> > > that the
> > > asm_goto only depends on its arguments and have it merge the
> > > branches
> > > itself?
> > > 
> > > I do realize that asm-goto being volatile this is a fairly huge
> > > ask, but
> > > I figured I should at least raise the issue, if only to raise
> > > awareness.
> > > 
> > 
> > Wouldn't that require the compiler to interpret the contents of the
> > asm() block?
> 
> Yeah, this is more or less asking for ponies :-) One option would be
> some annotation that conveys the desired semantics without it having
> to
> untangle the mess in the asm block.
> 
> The thing the compiler needs to know is that the branch is constant
> for
> any @key, and hence allow the obvious optimizations. I'm not sure if
> this is something compiler folks would be even willing to consider,
> but
> I figured asking never hurts.
> 

Sorry if this is a dumb question, but does the function attribute:
  __attribute__ ((pure)) 
help here?  It's meant to allow multiple calls to a predicate to be
merged - though I'd be nervous of using it here, the predicate isn't
100% pure, since AIUI the whole point of what you've built is for
predicates that very rarely change - but can change occasionally.

Hope this is constructive
Dave




Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 12:55:18PM +0200, Florian Weimer wrote:
> * Ard Biesheuvel:
> 
> > Wouldn't that require the compiler to interpret the contents of the
> > asm() block?
> 
> Yes and no.  It would require proper toolchain support, so in this case
> a new ELF relocation type, with compiler, assembler, and linker support
> to generate those relocations and process them.  As far as I understand
> it, the kernel doesn't do things this way.

I don't think that all is needed. All we need is for the compiler to
recognise that:

if (cond) {
stmt-A;
}
if (cond) {
stmt-B;
}

both cond are equivalent and hence can merge the blocks like:

if (cond) {
stmt-A;
stmt-B;
}

But because @cond is some super opaque asm crap, the compiler throws up
it's imaginry hands and says it cannot possibly tell and leaves them as
is.


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Peter Zijlstra
On Fri, Apr 09, 2021 at 11:57:22AM +0200, Ard Biesheuvel wrote:
> On Thu, 8 Apr 2021 at 18:53, Peter Zijlstra  wrote:

> > Is there *any* way in which we can have the compiler recognise that the
> > asm_goto only depends on its arguments and have it merge the branches
> > itself?
> >
> > I do realize that asm-goto being volatile this is a fairly huge ask, but
> > I figured I should at least raise the issue, if only to raise awareness.
> >
> 
> Wouldn't that require the compiler to interpret the contents of the asm() 
> block?

Yeah, this is more or less asking for ponies :-) One option would be
some annotation that conveys the desired semantics without it having to
untangle the mess in the asm block.

The thing the compiler needs to know is that the branch is constant for
any @key, and hence allow the obvious optimizations. I'm not sure if
this is something compiler folks would be even willing to consider, but
I figured asking never hurts.


Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Florian Weimer
* Ard Biesheuvel:

> Wouldn't that require the compiler to interpret the contents of the
> asm() block?

Yes and no.  It would require proper toolchain support, so in this case
a new ELF relocation type, with compiler, assembler, and linker support
to generate those relocations and process them.  As far as I understand
it, the kernel doesn't do things this way.

Thanks,
Florian



Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Ard Biesheuvel
On Thu, 8 Apr 2021 at 18:53, Peter Zijlstra  wrote:
>
> Hi!
>
> Given code like:
>
> DEFINE_STATIC_KEY_FALSE(sched_schedstats);
>
> #define   schedstat_enabled()   
> static_branch_unlikely(_schedstats)
> #define   schedstat_set(var, val)   do { if (schedstat_enabled()) { var = 
> (val); } } while (0)
> #define __schedstat_set(var, val)   do { var = (val); } while (0)
>
> void foo(void)
> {
> struct task_struct *p = current;
>
> schedstat_set(p->se.statistics.wait_start,  0);
> schedstat_set(p->se.statistics.sleep_start, 0);
> schedstat_set(p->se.statistics.block_start, 0);
> }
>
> Where the static_branch_unlikely() ends up being:
>
> static __always_inline bool arch_static_branch(struct static_key * const key, 
> const bool branch)
> {
> asm_volatile_goto("1:"
> ".byte " __stringify(BYTES_NOP5) "\n\t"
> ".pushsection __jump_table,  \"aw\" \n\t"
> _ASM_ALIGN "\n\t"
> ".long 1b - ., %l[l_yes] - . \n\t"
> _ASM_PTR "%c0 + %c1 - .\n\t"
> ".popsection \n\t"
> : :  "i" (key), "i" (branch) : : l_yes);
>
> return false;
> l_yes:
> return true;
> }
>
> The compiler gives us code like:
>
> a290 :
> a290:   65 48 8b 04 25 00 00 00 00  mov%gs:0x0,%rax a295: 
> R_X86_64_32S  current_task
> a299:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
> a29e:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
> a2a3:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
> a2a8:   c3  retq
> a2a9:   48 c7 80 28 01 00 00 00 00 00 00movq   
> $0x0,0x128(%rax)
> a2b4:   eb e8   jmpa29e 
> a2b6:   48 c7 80 58 01 00 00 00 00 00 00movq   
> $0x0,0x158(%rax)
> a2c1:   eb e0   jmpa2a3 
> a2c3:   48 c7 80 70 01 00 00 00 00 00 00movq   
> $0x0,0x170(%rax)
> a2ce:   c3  retq
>
>
> Now, in this case I can easily rewrite foo like:
>
> void foo2(void)
> {
> struct task_struct *p = current;
>
> if (schedstat_enabled()) {
> __schedstat_set(p->se.statistics.wait_start,  0);
> __schedstat_set(p->se.statistics.sleep_start, 0);
> __schedstat_set(p->se.statistics.block_start, 0);
> }
> }
>
> Which gives the far more reasonable:
>
> a2d0 :
> a2d0:   65 48 8b 04 25 00 00 00 00  mov%gs:0x0,%rax a2d5: 
> R_X86_64_32S  current_task
> a2d9:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
> a2de:   c3  retq
> a2df:   48 c7 80 28 01 00 00 00 00 00 00movq   
> $0x0,0x128(%rax)
> a2ea:   48 c7 80 58 01 00 00 00 00 00 00movq   
> $0x0,0x158(%rax)
> a2f5:   48 c7 80 70 01 00 00 00 00 00 00movq   
> $0x0,0x170(%rax)
> a300:   c3  retq
>
> But I've found a few sites where this isn't so trivial.
>
> Is there *any* way in which we can have the compiler recognise that the
> asm_goto only depends on its arguments and have it merge the branches
> itself?
>
> I do realize that asm-goto being volatile this is a fairly huge ask, but
> I figured I should at least raise the issue, if only to raise awareness.
>

Wouldn't that require the compiler to interpret the contents of the asm() block?


static_branch/jump_label vs branch merging

2021-04-08 Thread Peter Zijlstra
Hi!

Given code like:

DEFINE_STATIC_KEY_FALSE(sched_schedstats);

#define   schedstat_enabled()   
static_branch_unlikely(_schedstats)
#define   schedstat_set(var, val)   do { if (schedstat_enabled()) { var = 
(val); } } while (0)
#define __schedstat_set(var, val)   do { var = (val); } while (0)

void foo(void)
{
struct task_struct *p = current;

schedstat_set(p->se.statistics.wait_start,  0);
schedstat_set(p->se.statistics.sleep_start, 0);
schedstat_set(p->se.statistics.block_start, 0);
}

Where the static_branch_unlikely() ends up being:

static __always_inline bool arch_static_branch(struct static_key * const key, 
const bool branch)
{
asm_volatile_goto("1:"
".byte " __stringify(BYTES_NOP5) "\n\t"
".pushsection __jump_table,  \"aw\" \n\t"
_ASM_ALIGN "\n\t"
".long 1b - ., %l[l_yes] - . \n\t"
_ASM_PTR "%c0 + %c1 - .\n\t"
".popsection \n\t"
: :  "i" (key), "i" (branch) : : l_yes);

return false;
l_yes:
return true;
}

The compiler gives us code like:

a290 :
a290:   65 48 8b 04 25 00 00 00 00  mov%gs:0x0,%rax a295: 
R_X86_64_32S  current_task
a299:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
a29e:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
a2a3:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
a2a8:   c3  retq
a2a9:   48 c7 80 28 01 00 00 00 00 00 00movq   $0x0,0x128(%rax)
a2b4:   eb e8   jmpa29e 
a2b6:   48 c7 80 58 01 00 00 00 00 00 00movq   $0x0,0x158(%rax)
a2c1:   eb e0   jmpa2a3 
a2c3:   48 c7 80 70 01 00 00 00 00 00 00movq   $0x0,0x170(%rax)
a2ce:   c3  retq


Now, in this case I can easily rewrite foo like:

void foo2(void)
{
struct task_struct *p = current;

if (schedstat_enabled()) {
__schedstat_set(p->se.statistics.wait_start,  0);
__schedstat_set(p->se.statistics.sleep_start, 0);
__schedstat_set(p->se.statistics.block_start, 0);
}
}

Which gives the far more reasonable:

a2d0 :
a2d0:   65 48 8b 04 25 00 00 00 00  mov%gs:0x0,%rax a2d5: 
R_X86_64_32S  current_task
a2d9:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
a2de:   c3  retq
a2df:   48 c7 80 28 01 00 00 00 00 00 00movq   $0x0,0x128(%rax)
a2ea:   48 c7 80 58 01 00 00 00 00 00 00movq   $0x0,0x158(%rax)
a2f5:   48 c7 80 70 01 00 00 00 00 00 00movq   $0x0,0x170(%rax)
a300:   c3  retq

But I've found a few sites where this isn't so trivial.

Is there *any* way in which we can have the compiler recognise that the
asm_goto only depends on its arguments and have it merge the branches
itself?

I do realize that asm-goto being volatile this is a fairly huge ask, but
I figured I should at least raise the issue, if only to raise awareness.