Re: [PATCH RFC 0/3] Static calls

2018-11-12 Thread Josh Poimboeuf
On Mon, Nov 12, 2018 at 12:03:34PM -0500, Steven Rostedt wrote:
> On Sun, 11 Nov 2018 23:30:55 -0600
> Josh Poimboeuf  wrote:
> 
> > > How much of that slowdown is reversed?  
> > 
> > In theory, it should reverse all of the slowdown, and actually may even
> > speed it up a little.  Steve is working on measuring that now.
> 
> When I'm able to get it to work! Hopefully that last patch snippet you
> posted will help. If not, I'm assuming you'll be in Vancouver this
> week, and we could sit down and work it out.

Sure, I'm already in Vancouver.  Just grab me if you see me, or ping me
on IRC/email.  Or feel free to send me your patches if it's still giving
you trouble.

> That said, I don't expect a 100% improvement. Because the retpoline
> causes slow down in other areas than just tracing, which is not being
> fixed by this. I'm expecting a substantial improvement (which I see
> good improvement with the unoptimized static calls), and hoping for
> much more with the optimized one (when I get it working). But not 100%,
> as stated above.

Ah, ok.  Makes sense.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-12 Thread Josh Poimboeuf
On Mon, Nov 12, 2018 at 12:03:34PM -0500, Steven Rostedt wrote:
> On Sun, 11 Nov 2018 23:30:55 -0600
> Josh Poimboeuf  wrote:
> 
> > > How much of that slowdown is reversed?  
> > 
> > In theory, it should reverse all of the slowdown, and actually may even
> > speed it up a little.  Steve is working on measuring that now.
> 
> When I'm able to get it to work! Hopefully that last patch snippet you
> posted will help. If not, I'm assuming you'll be in Vancouver this
> week, and we could sit down and work it out.

Sure, I'm already in Vancouver.  Just grab me if you see me, or ping me
on IRC/email.  Or feel free to send me your patches if it's still giving
you trouble.

> That said, I don't expect a 100% improvement. Because the retpoline
> causes slow down in other areas than just tracing, which is not being
> fixed by this. I'm expecting a substantial improvement (which I see
> good improvement with the unoptimized static calls), and hoping for
> much more with the optimized one (when I get it working). But not 100%,
> as stated above.

Ah, ok.  Makes sense.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-12 Thread Josh Poimboeuf
On Mon, Nov 12, 2018 at 10:39:52AM +0100, Ard Biesheuvel wrote:
> On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf  wrote:
> >
> > On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
> > >
> > > * Josh Poimboeuf  wrote:
> > >
> > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > > > > >   x86-64 only, which means we have to use the "unoptimized" version
> > > > > >   everywhere else.  I may experiment with a GCC plugin instead.
> > > > >
> > > > > I'd prefer the objtool approach. It's a pretty reliable 
> > > > > first-principles
> > > > > approach while GCC plugin would have to be replicated for Clang and 
> > > > > any
> > > > > other compilers, etc.
> > > >
> > > > The benefit of a plugin is that we'd only need two of them: GCC and
> > > > Clang.  And presumably, they'd share a lot of code.
> > > >
> 
> Having looked into this, I don't think they will share any code at
> all, to be honest. Perhaps some macros and string templates, that's
> all.

Oh well.  That should still be easier to maintain than objtool across
all arches at this point.

> > > > The prospect of porting objtool to all architectures is going to be much
> > > > more of a daunting task (though we are at least already considering it
> > > > for some arches).
> > >
> > > Which architectures would benefit from ORC support the most?
> >
> > According to my (limited and potentially flawed) knowledge, I think
> > arm64 would benefit the most performance-wise, whereas powerpc and s390
> > gains would be quite a bit less.
> >
> 
> What would arm64 gain from ORC and/or objtool?

Other than live patching, the biggest benefit would be an
across-the-board performance improvement from disabling frame pointers.
It would be interesting to see some arm64 performance numbers there, for
a kernel compiled with -fomit-frame-pointer.

For more details (and benefits of) ORC see
Documentation/x86/orc-unwinder.txt.

Objtool has also come in handy for other cases, like ensuring retpolines
are used everywhere.

Over time, I would like to move some objtool functionality to compiler
plugins, such that it would be easier to port it to other arches.

> > We may have to port objtool to arm64 anyway, for live patching.
> 
> Is this about the reliable stack traces, i.e., the ability to detect
> non-leaf functions that don't create stack frames? I think we should
> be able to manage this without objtool on arm64 tbh.

Hm?  How else would you ensure all functions honor CONFIG_FRAME_POINTER,
and continue to do so indefinitely?

> >  But
> > that will be a lot more work than it took for Ard to write a GCC plugin.
> >
> > > I really think that hard reliance on GCC plugins is foolish
> >
> > Funny, I feel the same way about hard reliance on objtool :-)
> >
> 
> I tend to agree here. I think objtool is a necessary evil (as are
> compiler plugins, for that matter) which I hope does not spread to
> other architectures.

I agree that it's a necessary evil, but it may be necessary on arm64 for
live patching.

> But the main difference is that the GCC plugin is only ~50 lines (for
> this particular use case, and minus another 50 lines of boilerplate),
> whereas objtool (AIUI) duplicates lots and lots of functionality of
> the compiler, assembler and/or linker, to mangle relocations, create
> new sections etc etc. Porting this to other architectures is going to
> be a major maintenance effort, especially when I think of, e.g.,
> 32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are
> currently hidden in the toolchain. Other architectures should be first
> class citizens if objtool gains support for them, which means that the
> x86 people that own it currently are on the hook for testing their
> changes against architectures they are not familiar with.

Sounds like we could use you as a co-maintainer then :-)

BTW, AFAIK, there are no plans to support live patching for 32-bit ARM.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-12 Thread Josh Poimboeuf
On Mon, Nov 12, 2018 at 10:39:52AM +0100, Ard Biesheuvel wrote:
> On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf  wrote:
> >
> > On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
> > >
> > > * Josh Poimboeuf  wrote:
> > >
> > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > > > > >   x86-64 only, which means we have to use the "unoptimized" version
> > > > > >   everywhere else.  I may experiment with a GCC plugin instead.
> > > > >
> > > > > I'd prefer the objtool approach. It's a pretty reliable 
> > > > > first-principles
> > > > > approach while GCC plugin would have to be replicated for Clang and 
> > > > > any
> > > > > other compilers, etc.
> > > >
> > > > The benefit of a plugin is that we'd only need two of them: GCC and
> > > > Clang.  And presumably, they'd share a lot of code.
> > > >
> 
> Having looked into this, I don't think they will share any code at
> all, to be honest. Perhaps some macros and string templates, that's
> all.

Oh well.  That should still be easier to maintain than objtool across
all arches at this point.

> > > > The prospect of porting objtool to all architectures is going to be much
> > > > more of a daunting task (though we are at least already considering it
> > > > for some arches).
> > >
> > > Which architectures would benefit from ORC support the most?
> >
> > According to my (limited and potentially flawed) knowledge, I think
> > arm64 would benefit the most performance-wise, whereas powerpc and s390
> > gains would be quite a bit less.
> >
> 
> What would arm64 gain from ORC and/or objtool?

Other than live patching, the biggest benefit would be an
across-the-board performance improvement from disabling frame pointers.
It would be interesting to see some arm64 performance numbers there, for
a kernel compiled with -fomit-frame-pointer.

For more details (and benefits of) ORC see
Documentation/x86/orc-unwinder.txt.

Objtool has also come in handy for other cases, like ensuring retpolines
are used everywhere.

Over time, I would like to move some objtool functionality to compiler
plugins, such that it would be easier to port it to other arches.

> > We may have to port objtool to arm64 anyway, for live patching.
> 
> Is this about the reliable stack traces, i.e., the ability to detect
> non-leaf functions that don't create stack frames? I think we should
> be able to manage this without objtool on arm64 tbh.

Hm?  How else would you ensure all functions honor CONFIG_FRAME_POINTER,
and continue to do so indefinitely?

> >  But
> > that will be a lot more work than it took for Ard to write a GCC plugin.
> >
> > > I really think that hard reliance on GCC plugins is foolish
> >
> > Funny, I feel the same way about hard reliance on objtool :-)
> >
> 
> I tend to agree here. I think objtool is a necessary evil (as are
> compiler plugins, for that matter) which I hope does not spread to
> other architectures.

I agree that it's a necessary evil, but it may be necessary on arm64 for
live patching.

> But the main difference is that the GCC plugin is only ~50 lines (for
> this particular use case, and minus another 50 lines of boilerplate),
> whereas objtool (AIUI) duplicates lots and lots of functionality of
> the compiler, assembler and/or linker, to mangle relocations, create
> new sections etc etc. Porting this to other architectures is going to
> be a major maintenance effort, especially when I think of, e.g.,
> 32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are
> currently hidden in the toolchain. Other architectures should be first
> class citizens if objtool gains support for them, which means that the
> x86 people that own it currently are on the hook for testing their
> changes against architectures they are not familiar with.

Sounds like we could use you as a co-maintainer then :-)

BTW, AFAIK, there are no plans to support live patching for 32-bit ARM.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-12 Thread Steven Rostedt
On Sun, 11 Nov 2018 23:30:55 -0600
Josh Poimboeuf  wrote:

> > How much of that slowdown is reversed?  
> 
> In theory, it should reverse all of the slowdown, and actually may even
> speed it up a little.  Steve is working on measuring that now.

When I'm able to get it to work! Hopefully that last patch snippet you
posted will help. If not, I'm assuming you'll be in Vancouver this
week, and we could sit down and work it out.

That said, I don't expect a 100% improvement. Because the retpoline
causes slow down in other areas than just tracing, which is not being
fixed by this. I'm expecting a substantial improvement (which I see
good improvement with the unoptimized static calls), and hoping for
much more with the optimized one (when I get it working). But not 100%,
as stated above.

-- Steve


Re: [PATCH RFC 0/3] Static calls

2018-11-12 Thread Steven Rostedt
On Sun, 11 Nov 2018 23:30:55 -0600
Josh Poimboeuf  wrote:

> > How much of that slowdown is reversed?  
> 
> In theory, it should reverse all of the slowdown, and actually may even
> speed it up a little.  Steve is working on measuring that now.

When I'm able to get it to work! Hopefully that last patch snippet you
posted will help. If not, I'm assuming you'll be in Vancouver this
week, and we could sit down and work it out.

That said, I don't expect a 100% improvement. Because the retpoline
causes slow down in other areas than just tracing, which is not being
fixed by this. I'm expecting a substantial improvement (which I see
good improvement with the unoptimized static calls), and hoping for
much more with the optimized one (when I get it working). But not 100%,
as stated above.

-- Steve


Re: [PATCH RFC 0/3] Static calls

2018-11-12 Thread Ard Biesheuvel
On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf  wrote:
>
> On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
> >
> > * Josh Poimboeuf  wrote:
> >
> > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > > > >   x86-64 only, which means we have to use the "unoptimized" version
> > > > >   everywhere else.  I may experiment with a GCC plugin instead.
> > > >
> > > > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > > > approach while GCC plugin would have to be replicated for Clang and any
> > > > other compilers, etc.
> > >
> > > The benefit of a plugin is that we'd only need two of them: GCC and
> > > Clang.  And presumably, they'd share a lot of code.
> > >

Having looked into this, I don't think they will share any code at
all, to be honest. Perhaps some macros and string templates, that's
all.

> > > The prospect of porting objtool to all architectures is going to be much
> > > more of a daunting task (though we are at least already considering it
> > > for some arches).
> >
> > Which architectures would benefit from ORC support the most?
>
> According to my (limited and potentially flawed) knowledge, I think
> arm64 would benefit the most performance-wise, whereas powerpc and s390
> gains would be quite a bit less.
>

What would arm64 gain from ORC and/or objtool?

> We may have to port objtool to arm64 anyway, for live patching.

Is this about the reliable stack traces, i.e., the ability to detect
non-leaf functions that don't create stack frames? I think we should
be able to manage this without objtool on arm64 tbh.

>  But
> that will be a lot more work than it took for Ard to write a GCC plugin.
>
> > I really think that hard reliance on GCC plugins is foolish
>
> Funny, I feel the same way about hard reliance on objtool :-)
>

I tend to agree here. I think objtool is a necessary evil (as are
compiler plugins, for that matter) which I hope does not spread to
other architectures.

But the main difference is that the GCC plugin is only ~50 lines (for
this particular use case, and minus another 50 lines of boilerplate),
whereas objtool (AIUI) duplicates lots and lots of functionality of
the compiler, assembler and/or linker, to mangle relocations, create
new sections etc etc. Porting this to other architectures is going to
be a major maintenance effort, especially when I think of, e.g.,
32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are
currently hidden in the toolchain. Other architectures should be first
class citizens if objtool gains support for them, which means that the
x86 people that own it currently are on the hook for testing their
changes against architectures they are not familiar with.

This obviously applies equally to compiler plugins, but those have a
lot more focus.


> > - but maybe Clang's plugin infrastructure is a guarantee that it
> > remains a sane and usable interface.
>
> Hopefully so.  If it breaks, we could always write another tool, as the
> work is straightforward.  Or we could make it an objtool subcommand
> which works on all arches.
>
> > > > All other usecases are bonus, but it would certainly be interesting to
> > > > investigate the impact of using these APIs for tracing: that too is a
> > > > feature enabled everywhere but utilized only by a small fraction of 
> > > > Linux
> > > > users - so literally every single cycle or instruction saved or hot-path
> > > > shortened is a major win.
> > >
> > > With retpolines, and with tracepoints enabled, it's definitely a major
> > > win.  Steve measured an 8.9% general slowdown on hackbench caused by
> > > retpolines.
> >
> > How much of that slowdown is reversed?
>
> In theory, it should reverse all of the slowdown, and actually may even
> speed it up a little.  Steve is working on measuring that now.
>
> --
> Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-12 Thread Ard Biesheuvel
On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf  wrote:
>
> On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
> >
> > * Josh Poimboeuf  wrote:
> >
> > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > > > >   x86-64 only, which means we have to use the "unoptimized" version
> > > > >   everywhere else.  I may experiment with a GCC plugin instead.
> > > >
> > > > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > > > approach while GCC plugin would have to be replicated for Clang and any
> > > > other compilers, etc.
> > >
> > > The benefit of a plugin is that we'd only need two of them: GCC and
> > > Clang.  And presumably, they'd share a lot of code.
> > >

Having looked into this, I don't think they will share any code at
all, to be honest. Perhaps some macros and string templates, that's
all.

> > > The prospect of porting objtool to all architectures is going to be much
> > > more of a daunting task (though we are at least already considering it
> > > for some arches).
> >
> > Which architectures would benefit from ORC support the most?
>
> According to my (limited and potentially flawed) knowledge, I think
> arm64 would benefit the most performance-wise, whereas powerpc and s390
> gains would be quite a bit less.
>

What would arm64 gain from ORC and/or objtool?

> We may have to port objtool to arm64 anyway, for live patching.

Is this about the reliable stack traces, i.e., the ability to detect
non-leaf functions that don't create stack frames? I think we should
be able to manage this without objtool on arm64 tbh.

>  But
> that will be a lot more work than it took for Ard to write a GCC plugin.
>
> > I really think that hard reliance on GCC plugins is foolish
>
> Funny, I feel the same way about hard reliance on objtool :-)
>

I tend to agree here. I think objtool is a necessary evil (as are
compiler plugins, for that matter) which I hope does not spread to
other architectures.

But the main difference is that the GCC plugin is only ~50 lines (for
this particular use case, and minus another 50 lines of boilerplate),
whereas objtool (AIUI) duplicates lots and lots of functionality of
the compiler, assembler and/or linker, to mangle relocations, create
new sections etc etc. Porting this to other architectures is going to
be a major maintenance effort, especially when I think of, e.g.,
32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are
currently hidden in the toolchain. Other architectures should be first
class citizens if objtool gains support for them, which means that the
x86 people that own it currently are on the hook for testing their
changes against architectures they are not familiar with.

This obviously applies equally to compiler plugins, but those have a
lot more focus.


> > - but maybe Clang's plugin infrastructure is a guarantee that it
> > remains a sane and usable interface.
>
> Hopefully so.  If it breaks, we could always write another tool, as the
> work is straightforward.  Or we could make it an objtool subcommand
> which works on all arches.
>
> > > > All other usecases are bonus, but it would certainly be interesting to
> > > > investigate the impact of using these APIs for tracing: that too is a
> > > > feature enabled everywhere but utilized only by a small fraction of 
> > > > Linux
> > > > users - so literally every single cycle or instruction saved or hot-path
> > > > shortened is a major win.
> > >
> > > With retpolines, and with tracepoints enabled, it's definitely a major
> > > win.  Steve measured an 8.9% general slowdown on hackbench caused by
> > > retpolines.
> >
> > How much of that slowdown is reversed?
>
> In theory, it should reverse all of the slowdown, and actually may even
> speed it up a little.  Steve is working on measuring that now.
>
> --
> Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Andy Lutomirski
On Sun, Nov 11, 2018 at 9:02 PM Ingo Molnar  wrote:
>
>
> * Josh Poimboeuf  wrote:
>
> > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > > >   x86-64 only, which means we have to use the "unoptimized" version
> > > >   everywhere else.  I may experiment with a GCC plugin instead.
> > >
> > > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > > approach while GCC plugin would have to be replicated for Clang and any
> > > other compilers, etc.
> >
> > The benefit of a plugin is that we'd only need two of them: GCC and
> > Clang.  And presumably, they'd share a lot of code.
> >
> > The prospect of porting objtool to all architectures is going to be much
> > more of a daunting task (though we are at least already considering it
> > for some arches).
>
> Which architectures would benefit from ORC support the most?
>
> I really think that hard reliance on GCC plugins is foolish - but maybe
> Clang's plugin infrastructure is a guarantee that it remains a sane and
> usable interface.
>
> > > I'd be very happy with a demonstrated paravirt optimization already -
> > > i.e. seeing the before/after effect on the vmlinux with an x86 distro
> > > config.
> > >
> > > All major Linux distributions enable CONFIG_PARAVIRT=y and
> > > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
> > > as possible in the 99.999% cases where it's not used is a primary
> > > concern.
> >
> > For paravirt, I was thinking of it as more of a cleanup than an
> > optimization.  The paravirt patching code already replaces indirect
> > branches with direct ones -- see paravirt_patch_default().
> >
> > Though it *would* reduce the instruction footprint a bit, as the 7-byte
> > indirect calls (later patched to 5-byte direct + 2-byte nop) would
> > instead be 5-byte direct calls to begin with.
>
> Yes.

It would be a huge cleanup IMO -- the existing PVOP call stuff is
really quite ugly IMO.  Also, the existing stuff tries to emulate the
semantics of passing parameters of unknown types using asm
constraints, and I just don't believe that GCC does what we want it to
do.  In general, passing the *value* of a pointer to asm doesn't seem
to convince gcc that the pointed-to value is used by the asm, and this
makes me nervous.  See commit 715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b
as an example of this.  In a similar vein, the existing PVOP calls
have a "memory" clobber, and that's not free.


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Andy Lutomirski
On Sun, Nov 11, 2018 at 9:02 PM Ingo Molnar  wrote:
>
>
> * Josh Poimboeuf  wrote:
>
> > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > > >   x86-64 only, which means we have to use the "unoptimized" version
> > > >   everywhere else.  I may experiment with a GCC plugin instead.
> > >
> > > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > > approach while GCC plugin would have to be replicated for Clang and any
> > > other compilers, etc.
> >
> > The benefit of a plugin is that we'd only need two of them: GCC and
> > Clang.  And presumably, they'd share a lot of code.
> >
> > The prospect of porting objtool to all architectures is going to be much
> > more of a daunting task (though we are at least already considering it
> > for some arches).
>
> Which architectures would benefit from ORC support the most?
>
> I really think that hard reliance on GCC plugins is foolish - but maybe
> Clang's plugin infrastructure is a guarantee that it remains a sane and
> usable interface.
>
> > > I'd be very happy with a demonstrated paravirt optimization already -
> > > i.e. seeing the before/after effect on the vmlinux with an x86 distro
> > > config.
> > >
> > > All major Linux distributions enable CONFIG_PARAVIRT=y and
> > > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
> > > as possible in the 99.999% cases where it's not used is a primary
> > > concern.
> >
> > For paravirt, I was thinking of it as more of a cleanup than an
> > optimization.  The paravirt patching code already replaces indirect
> > branches with direct ones -- see paravirt_patch_default().
> >
> > Though it *would* reduce the instruction footprint a bit, as the 7-byte
> > indirect calls (later patched to 5-byte direct + 2-byte nop) would
> > instead be 5-byte direct calls to begin with.
>
> Yes.

It would be a huge cleanup IMO -- the existing PVOP call stuff is
really quite ugly IMO.  Also, the existing stuff tries to emulate the
semantics of passing parameters of unknown types using asm
constraints, and I just don't believe that GCC does what we want it to
do.  In general, passing the *value* of a pointer to asm doesn't seem
to convince gcc that the pointed-to value is used by the asm, and this
makes me nervous.  See commit 715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b
as an example of this.  In a similar vein, the existing PVOP calls
have a "memory" clobber, and that's not free.


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Josh Poimboeuf
On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
> 
> * Josh Poimboeuf  wrote:
> 
> > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > > >   x86-64 only, which means we have to use the "unoptimized" version
> > > >   everywhere else.  I may experiment with a GCC plugin instead.
> > > 
> > > I'd prefer the objtool approach. It's a pretty reliable first-principles 
> > > approach while GCC plugin would have to be replicated for Clang and any 
> > > other compilers, etc.
> > 
> > The benefit of a plugin is that we'd only need two of them: GCC and
> > Clang.  And presumably, they'd share a lot of code.
> > 
> > The prospect of porting objtool to all architectures is going to be much
> > more of a daunting task (though we are at least already considering it
> > for some arches).
> 
> Which architectures would benefit from ORC support the most?

According to my (limited and potentially flawed) knowledge, I think
arm64 would benefit the most performance-wise, whereas powerpc and s390
gains would be quite a bit less.

We may have to port objtool to arm64 anyway, for live patching.  But
that will be a lot more work than it took for Ard to write a GCC plugin.

> I really think that hard reliance on GCC plugins is foolish

Funny, I feel the same way about hard reliance on objtool :-)

> - but maybe Clang's plugin infrastructure is a guarantee that it
> remains a sane and usable interface.

Hopefully so.  If it breaks, we could always write another tool, as the
work is straightforward.  Or we could make it an objtool subcommand
which works on all arches.

> > > All other usecases are bonus, but it would certainly be interesting to 
> > > investigate the impact of using these APIs for tracing: that too is a 
> > > feature enabled everywhere but utilized only by a small fraction of Linux 
> > > users - so literally every single cycle or instruction saved or hot-path 
> > > shortened is a major win.
> > 
> > With retpolines, and with tracepoints enabled, it's definitely a major
> > win.  Steve measured an 8.9% general slowdown on hackbench caused by
> > retpolines.
> 
> How much of that slowdown is reversed?

In theory, it should reverse all of the slowdown, and actually may even
speed it up a little.  Steve is working on measuring that now.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Josh Poimboeuf
On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
> 
> * Josh Poimboeuf  wrote:
> 
> > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > > >   x86-64 only, which means we have to use the "unoptimized" version
> > > >   everywhere else.  I may experiment with a GCC plugin instead.
> > > 
> > > I'd prefer the objtool approach. It's a pretty reliable first-principles 
> > > approach while GCC plugin would have to be replicated for Clang and any 
> > > other compilers, etc.
> > 
> > The benefit of a plugin is that we'd only need two of them: GCC and
> > Clang.  And presumably, they'd share a lot of code.
> > 
> > The prospect of porting objtool to all architectures is going to be much
> > more of a daunting task (though we are at least already considering it
> > for some arches).
> 
> Which architectures would benefit from ORC support the most?

According to my (limited and potentially flawed) knowledge, I think
arm64 would benefit the most performance-wise, whereas powerpc and s390
gains would be quite a bit less.

We may have to port objtool to arm64 anyway, for live patching.  But
that will be a lot more work than it took for Ard to write a GCC plugin.

> I really think that hard reliance on GCC plugins is foolish

Funny, I feel the same way about hard reliance on objtool :-)

> - but maybe Clang's plugin infrastructure is a guarantee that it
> remains a sane and usable interface.

Hopefully so.  If it breaks, we could always write another tool, as the
work is straightforward.  Or we could make it an objtool subcommand
which works on all arches.

> > > All other usecases are bonus, but it would certainly be interesting to 
> > > investigate the impact of using these APIs for tracing: that too is a 
> > > feature enabled everywhere but utilized only by a small fraction of Linux 
> > > users - so literally every single cycle or instruction saved or hot-path 
> > > shortened is a major win.
> > 
> > With retpolines, and with tracepoints enabled, it's definitely a major
> > win.  Steve measured an 8.9% general slowdown on hackbench caused by
> > retpolines.
> 
> How much of that slowdown is reversed?

In theory, it should reverse all of the slowdown, and actually may even
speed it up a little.  Steve is working on measuring that now.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Ingo Molnar


* Josh Poimboeuf  wrote:

> On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > >   x86-64 only, which means we have to use the "unoptimized" version
> > >   everywhere else.  I may experiment with a GCC plugin instead.
> > 
> > I'd prefer the objtool approach. It's a pretty reliable first-principles 
> > approach while GCC plugin would have to be replicated for Clang and any 
> > other compilers, etc.
> 
> The benefit of a plugin is that we'd only need two of them: GCC and
> Clang.  And presumably, they'd share a lot of code.
> 
> The prospect of porting objtool to all architectures is going to be much
> more of a daunting task (though we are at least already considering it
> for some arches).

Which architectures would benefit from ORC support the most?

I really think that hard reliance on GCC plugins is foolish - but maybe 
Clang's plugin infrastructure is a guarantee that it remains a sane and 
usable interface.

> > I'd be very happy with a demonstrated paravirt optimization already - 
> > i.e. seeing the before/after effect on the vmlinux with an x86 distro 
> > config.
> > 
> > All major Linux distributions enable CONFIG_PARAVIRT=y and 
> > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much 
> > as possible in the 99.999% cases where it's not used is a primary 
> > concern.
> 
> For paravirt, I was thinking of it as more of a cleanup than an
> optimization.  The paravirt patching code already replaces indirect
> branches with direct ones -- see paravirt_patch_default().
> 
> Though it *would* reduce the instruction footprint a bit, as the 7-byte
> indirect calls (later patched to 5-byte direct + 2-byte nop) would
> instead be 5-byte direct calls to begin with.

Yes.

> > All other usecases are bonus, but it would certainly be interesting to 
> > investigate the impact of using these APIs for tracing: that too is a 
> > feature enabled everywhere but utilized only by a small fraction of Linux 
> > users - so literally every single cycle or instruction saved or hot-path 
> > shortened is a major win.
> 
> With retpolines, and with tracepoints enabled, it's definitely a major
> win.  Steve measured an 8.9% general slowdown on hackbench caused by
> retpolines.

How much of that slowdown is reversed?

> But with tracepoints disabled, I believe static jumps are used, which
> already minimizes the impact on hot paths.

Yeah.

Thanks,

Ing


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Ingo Molnar


* Josh Poimboeuf  wrote:

> On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > - I'm not sure about the objtool approach.  Objtool is (currently)
> > >   x86-64 only, which means we have to use the "unoptimized" version
> > >   everywhere else.  I may experiment with a GCC plugin instead.
> > 
> > I'd prefer the objtool approach. It's a pretty reliable first-principles 
> > approach while GCC plugin would have to be replicated for Clang and any 
> > other compilers, etc.
> 
> The benefit of a plugin is that we'd only need two of them: GCC and
> Clang.  And presumably, they'd share a lot of code.
> 
> The prospect of porting objtool to all architectures is going to be much
> more of a daunting task (though we are at least already considering it
> for some arches).

Which architectures would benefit from ORC support the most?

I really think that hard reliance on GCC plugins is foolish - but maybe 
Clang's plugin infrastructure is a guarantee that it remains a sane and 
usable interface.

> > I'd be very happy with a demonstrated paravirt optimization already - 
> > i.e. seeing the before/after effect on the vmlinux with an x86 distro 
> > config.
> > 
> > All major Linux distributions enable CONFIG_PARAVIRT=y and 
> > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much 
> > as possible in the 99.999% cases where it's not used is a primary 
> > concern.
> 
> For paravirt, I was thinking of it as more of a cleanup than an
> optimization.  The paravirt patching code already replaces indirect
> branches with direct ones -- see paravirt_patch_default().
> 
> Though it *would* reduce the instruction footprint a bit, as the 7-byte
> indirect calls (later patched to 5-byte direct + 2-byte nop) would
> instead be 5-byte direct calls to begin with.

Yes.

> > All other usecases are bonus, but it would certainly be interesting to 
> > investigate the impact of using these APIs for tracing: that too is a 
> > feature enabled everywhere but utilized only by a small fraction of Linux 
> > users - so literally every single cycle or instruction saved or hot-path 
> > shortened is a major win.
> 
> With retpolines, and with tracepoints enabled, it's definitely a major
> win.  Steve measured an 8.9% general slowdown on hackbench caused by
> retpolines.

How much of that slowdown is reversed?

> But with tracepoints disabled, I believe static jumps are used, which
> already minimizes the impact on hot paths.

Yeah.

Thanks,

Ing


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Peter Zijlstra
On Sun, Nov 11, 2018 at 02:42:55PM +0100, Ard Biesheuvel wrote:
> On 11 November 2018 at 00:20, Peter Zijlstra  wrote:
> > On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> >> On 9 November 2018 at 08:28, Ingo Molnar  wrote:
> >> >> - I'm not sure about the objtool approach.  Objtool is (currently)
> >> >>   x86-64 only, which means we have to use the "unoptimized" version
> >> >>   everywhere else.  I may experiment with a GCC plugin instead.
> >> >
> >> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> >> > approach while GCC plugin would have to be replicated for Clang and any
> >> > other compilers, etc.
> >> >
> >>
> >> I implemented the GCC plugin approach here for arm64
> >
> > I'm confused; I though we only needed objtool for variable instruction
> > length architectures, because we can't reliably decode our instruction
> > stream. Otherwise we can fairly trivially use the DWARF relocation data,
> > no?
> 
> How would that work? We could build vmlinux with --emit-relocs, filter
> out the static jump/call relocations and resolve the symbol names to
> filter the ones associated with calls to trampolines. But then, we
> have to build the static_call_sites section and reinject it back into
> the image in some way, which is essentially objtool, no?

It's a _much_ simpler tool than objtool, but yes, we need a tool that
reads the relocation stuff and (re)injects it in a new section -- we
don't need it on a vmlinux level, it can be done per TU.

Anyway, a GCC plugin (I still have to have a peek at your thing) sounds
like it should work just fine too.


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Peter Zijlstra
On Sun, Nov 11, 2018 at 02:42:55PM +0100, Ard Biesheuvel wrote:
> On 11 November 2018 at 00:20, Peter Zijlstra  wrote:
> > On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> >> On 9 November 2018 at 08:28, Ingo Molnar  wrote:
> >> >> - I'm not sure about the objtool approach.  Objtool is (currently)
> >> >>   x86-64 only, which means we have to use the "unoptimized" version
> >> >>   everywhere else.  I may experiment with a GCC plugin instead.
> >> >
> >> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> >> > approach while GCC plugin would have to be replicated for Clang and any
> >> > other compilers, etc.
> >> >
> >>
> >> I implemented the GCC plugin approach here for arm64
> >
> > I'm confused; I though we only needed objtool for variable instruction
> > length architectures, because we can't reliably decode our instruction
> > stream. Otherwise we can fairly trivially use the DWARF relocation data,
> > no?
> 
> How would that work? We could build vmlinux with --emit-relocs, filter
> out the static jump/call relocations and resolve the symbol names to
> filter the ones associated with calls to trampolines. But then, we
> have to build the static_call_sites section and reinject it back into
> the image in some way, which is essentially objtool, no?

It's a _much_ simpler tool than objtool, but yes, we need a tool that
reads the relocation stuff and (re)injects it in a new section -- we
don't need it on a vmlinux level, it can be done per TU.

Anyway, a GCC plugin (I still have to have a peek at your thing) sounds
like it should work just fine too.


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Ard Biesheuvel
On 11 November 2018 at 00:20, Peter Zijlstra  wrote:
> On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
>> On 9 November 2018 at 08:28, Ingo Molnar  wrote:
>> >> - I'm not sure about the objtool approach.  Objtool is (currently)
>> >>   x86-64 only, which means we have to use the "unoptimized" version
>> >>   everywhere else.  I may experiment with a GCC plugin instead.
>> >
>> > I'd prefer the objtool approach. It's a pretty reliable first-principles
>> > approach while GCC plugin would have to be replicated for Clang and any
>> > other compilers, etc.
>> >
>>
>> I implemented the GCC plugin approach here for arm64
>
> I'm confused; I though we only needed objtool for variable instruction
> length architectures, because we can't reliably decode our instruction
> stream. Otherwise we can fairly trivially use the DWARF relocation data,
> no?

How would that work? We could build vmlinux with --emit-relocs, filter
out the static jump/call relocations and resolve the symbol names to
filter the ones associated with calls to trampolines. But then, we
have to build the static_call_sites section and reinject it back into
the image in some way, which is essentially objtool, no?


Re: [PATCH RFC 0/3] Static calls

2018-11-11 Thread Ard Biesheuvel
On 11 November 2018 at 00:20, Peter Zijlstra  wrote:
> On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
>> On 9 November 2018 at 08:28, Ingo Molnar  wrote:
>> >> - I'm not sure about the objtool approach.  Objtool is (currently)
>> >>   x86-64 only, which means we have to use the "unoptimized" version
>> >>   everywhere else.  I may experiment with a GCC plugin instead.
>> >
>> > I'd prefer the objtool approach. It's a pretty reliable first-principles
>> > approach while GCC plugin would have to be replicated for Clang and any
>> > other compilers, etc.
>> >
>>
>> I implemented the GCC plugin approach here for arm64
>
> I'm confused; I though we only needed objtool for variable instruction
> length architectures, because we can't reliably decode our instruction
> stream. Otherwise we can fairly trivially use the DWARF relocation data,
> no?

How would that work? We could build vmlinux with --emit-relocs, filter
out the static jump/call relocations and resolve the symbol names to
filter the ones associated with calls to trampolines. But then, we
have to build the static_call_sites section and reinject it back into
the image in some way, which is essentially objtool, no?


Re: [PATCH RFC 0/3] Static calls

2018-11-10 Thread Peter Zijlstra
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> On 9 November 2018 at 08:28, Ingo Molnar  wrote:
> >> - I'm not sure about the objtool approach.  Objtool is (currently)
> >>   x86-64 only, which means we have to use the "unoptimized" version
> >>   everywhere else.  I may experiment with a GCC plugin instead.
> >
> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > approach while GCC plugin would have to be replicated for Clang and any
> > other compilers, etc.
> >
> 
> I implemented the GCC plugin approach here for arm64

I'm confused; I though we only needed objtool for variable instruction
length architectures, because we can't reliably decode our instruction
stream. Otherwise we can fairly trivially use the DWARF relocation data,
no?


Re: [PATCH RFC 0/3] Static calls

2018-11-10 Thread Peter Zijlstra
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> On 9 November 2018 at 08:28, Ingo Molnar  wrote:
> >> - I'm not sure about the objtool approach.  Objtool is (currently)
> >>   x86-64 only, which means we have to use the "unoptimized" version
> >>   everywhere else.  I may experiment with a GCC plugin instead.
> >
> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > approach while GCC plugin would have to be replicated for Clang and any
> > other compilers, etc.
> >
> 
> I implemented the GCC plugin approach here for arm64

I'm confused; I though we only needed objtool for variable instruction
length architectures, because we can't reliably decode our instruction
stream. Otherwise we can fairly trivially use the DWARF relocation data,
no?


Re: [PATCH RFC 0/3] Static calls

2018-11-10 Thread Masami Hiramatsu
On Fri, 9 Nov 2018 11:05:51 -0800
Andy Lutomirski  wrote:

> 
> 
> > On Nov 9, 2018, at 10:42 AM, Steven Rostedt  wrote:
> > 
> > On Fri, 9 Nov 2018 10:41:37 -0600
> > Josh Poimboeuf  wrote:
> > 
> >>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
>  On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:  
> > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:  
> > 
> > 
> > All other usecases are bonus, but it would certainly be interesting to
> > investigate the impact of using these APIs for tracing: that too is a
> > feature enabled everywhere but utilized only by a small fraction of 
> > Linux
> > users - so literally every single cycle or instruction saved or hot-path
> > shortened is a major win.  
>  
>  For tracing, we'd want static_call_set_to_nop() or something like that, 
>  right?  
> >>> 
> >>> Are we talking about tracepoints?  Or ftrace?  
> >> 
> >> Since ftrace changes calls to nops, and vice versa, I assume you meant
> >> ftrace.  I don't think ftrace is a good candidate for this, as it's
> >> inherently more flexible than this API would reasonably allow.
> >> 
> > 
> > Not sure what Andy was talking about, but I'm currently implementing
> > tracepoints to use this, as tracepoints use indirect calls, and are a
> > prime candidate for static calls, as I showed in my original RFC of
> > this feature.
> > 
> > 
> 
> Indeed.
> 
> Although I had assumed that tracepoints already had appropriate jump label 
> magic.

As far as I know, the jump label magic is for reducing the overhead when the 
tracepoint is OFF (because it can skip function parameter preparation), and this
static call will be good when the tracepoint is ON (enabled) because of this can
avoid retpoline performance degradation.

Thank you,


-- 
Masami Hiramatsu 


Re: [PATCH RFC 0/3] Static calls

2018-11-10 Thread Masami Hiramatsu
On Fri, 9 Nov 2018 11:05:51 -0800
Andy Lutomirski  wrote:

> 
> 
> > On Nov 9, 2018, at 10:42 AM, Steven Rostedt  wrote:
> > 
> > On Fri, 9 Nov 2018 10:41:37 -0600
> > Josh Poimboeuf  wrote:
> > 
> >>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
>  On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:  
> > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:  
> > 
> > 
> > All other usecases are bonus, but it would certainly be interesting to
> > investigate the impact of using these APIs for tracing: that too is a
> > feature enabled everywhere but utilized only by a small fraction of 
> > Linux
> > users - so literally every single cycle or instruction saved or hot-path
> > shortened is a major win.  
>  
>  For tracing, we'd want static_call_set_to_nop() or something like that, 
>  right?  
> >>> 
> >>> Are we talking about tracepoints?  Or ftrace?  
> >> 
> >> Since ftrace changes calls to nops, and vice versa, I assume you meant
> >> ftrace.  I don't think ftrace is a good candidate for this, as it's
> >> inherently more flexible than this API would reasonably allow.
> >> 
> > 
> > Not sure what Andy was talking about, but I'm currently implementing
> > tracepoints to use this, as tracepoints use indirect calls, and are a
> > prime candidate for static calls, as I showed in my original RFC of
> > this feature.
> > 
> > 
> 
> Indeed.
> 
> Although I had assumed that tracepoints already had appropriate jump label 
> magic.

As far as I know, the jump label magic is for reducing the overhead when the 
tracepoint is OFF (because it can skip function parameter preparation), and this
static call will be good when the tracepoint is ON (enabled) because of this can
avoid retpoline performance degradation.

Thank you,


-- 
Masami Hiramatsu 


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Rasmus Villemoes
On 09/11/2018 16.16, Andy Lutomirski wrote:
> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:
>>
>>
>> All other usecases are bonus, but it would certainly be interesting to
>> investigate the impact of using these APIs for tracing: that too is a
>> feature enabled everywhere but utilized only by a small fraction of Linux
>> users - so literally every single cycle or instruction saved or hot-path
>> shortened is a major win.
> 
> For tracing, we'd want static_call_set_to_nop() or something like that, right?
> 

Hm. IIUC, when gcc sees static_call(key)(...), it has to generate code
to put the right values in %rdi, %rsi etc.. Even if the function is void
(*)(void), gcc would still need to shuffle things around (either spill
and reload, or move %rdi to some callee saved register). So if the
static_call is noop'ed out most of the time, that seems like a net loss?
With an unlikely static_key, gcc can do all the parameter setup and
reloading in an out-of-line chunk of code.

static calls seems like a quite useful concept, but only/mostly if
_some_ function needs to be called at that spot.

Aside: there should be some compile-time check that
static_call_set_to_nop can only be used if the return type is void.

Rasmus




Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Rasmus Villemoes
On 09/11/2018 16.16, Andy Lutomirski wrote:
> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:
>>
>>
>> All other usecases are bonus, but it would certainly be interesting to
>> investigate the impact of using these APIs for tracing: that too is a
>> feature enabled everywhere but utilized only by a small fraction of Linux
>> users - so literally every single cycle or instruction saved or hot-path
>> shortened is a major win.
> 
> For tracing, we'd want static_call_set_to_nop() or something like that, right?
> 

Hm. IIUC, when gcc sees static_call(key)(...), it has to generate code
to put the right values in %rdi, %rsi etc.. Even if the function is void
(*)(void), gcc would still need to shuffle things around (either spill
and reload, or move %rdi to some callee saved register). So if the
static_call is noop'ed out most of the time, that seems like a net loss?
With an unlikely static_key, gcc can do all the parameter setup and
reloading in an out-of-line chunk of code.

static calls seems like a quite useful concept, but only/mostly if
_some_ function needs to be called at that spot.

Aside: there should be some compile-time check that
static_call_set_to_nop can only be used if the return type is void.

Rasmus




Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 02:59:18PM -0500, Steven Rostedt wrote:
> On Fri, 9 Nov 2018 13:44:09 -0600
> Josh Poimboeuf  wrote:
> 
> > On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> > > On Fri, 9 Nov 2018 11:05:51 -0800
> > > Andy Lutomirski  wrote:
> > >   
> > > > > Not sure what Andy was talking about, but I'm currently implementing
> > > > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > > > prime candidate for static calls, as I showed in my original RFC of
> > > > > this feature.
> > > > > 
> > > > > 
> > > > 
> > > > Indeed.
> > > > 
> > > > Although I had assumed that tracepoints already had appropriate jump 
> > > > label magic.  
> > > 
> > > It does. But that's not the problem I was trying to solve. It's that
> > > tracing took a 8% noise dive with retpolines when enabled (hackbench
> > > slowed down by 8% with all the trace events enabled compared to all
> > > trace events enabled without retpoline). That is, normal users (those
> > > not tracinng) are not affected by trace events slowing down by
> > > retpoline. Those that care about performance when they are tracing, are
> > > affected by retpoline, quite drastically.
> > > 
> > > I'm doing another test run and measurements, to see how the unoptimized
> > > trampolines help, followed by the trampoline case.  
> > 
> > Are you sure you're using unoptimized?  Optimized is the default on
> > x86-64 (with my third patch).
> > 
> 
> Yes, because I haven't applied that third patch yet ;-)
> 
> Then I'll apply it and see how much that improves things.

Ah, good.  That will be interesting to see the difference between
optimized/unoptimized.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 02:59:18PM -0500, Steven Rostedt wrote:
> On Fri, 9 Nov 2018 13:44:09 -0600
> Josh Poimboeuf  wrote:
> 
> > On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> > > On Fri, 9 Nov 2018 11:05:51 -0800
> > > Andy Lutomirski  wrote:
> > >   
> > > > > Not sure what Andy was talking about, but I'm currently implementing
> > > > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > > > prime candidate for static calls, as I showed in my original RFC of
> > > > > this feature.
> > > > > 
> > > > > 
> > > > 
> > > > Indeed.
> > > > 
> > > > Although I had assumed that tracepoints already had appropriate jump 
> > > > label magic.  
> > > 
> > > It does. But that's not the problem I was trying to solve. It's that
> > > tracing took a 8% noise dive with retpolines when enabled (hackbench
> > > slowed down by 8% with all the trace events enabled compared to all
> > > trace events enabled without retpoline). That is, normal users (those
> > > not tracinng) are not affected by trace events slowing down by
> > > retpoline. Those that care about performance when they are tracing, are
> > > affected by retpoline, quite drastically.
> > > 
> > > I'm doing another test run and measurements, to see how the unoptimized
> > > trampolines help, followed by the trampoline case.  
> > 
> > Are you sure you're using unoptimized?  Optimized is the default on
> > x86-64 (with my third patch).
> > 
> 
> Yes, because I haven't applied that third patch yet ;-)
> 
> Then I'll apply it and see how much that improves things.

Ah, good.  That will be interesting to see the difference between
optimized/unoptimized.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Steven Rostedt
On Fri, 9 Nov 2018 13:44:09 -0600
Josh Poimboeuf  wrote:

> On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> > On Fri, 9 Nov 2018 11:05:51 -0800
> > Andy Lutomirski  wrote:
> >   
> > > > Not sure what Andy was talking about, but I'm currently implementing
> > > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > > prime candidate for static calls, as I showed in my original RFC of
> > > > this feature.
> > > > 
> > > > 
> > > 
> > > Indeed.
> > > 
> > > Although I had assumed that tracepoints already had appropriate jump 
> > > label magic.  
> > 
> > It does. But that's not the problem I was trying to solve. It's that
> > tracing took a 8% noise dive with retpolines when enabled (hackbench
> > slowed down by 8% with all the trace events enabled compared to all
> > trace events enabled without retpoline). That is, normal users (those
> > not tracinng) are not affected by trace events slowing down by
> > retpoline. Those that care about performance when they are tracing, are
> > affected by retpoline, quite drastically.
> > 
> > I'm doing another test run and measurements, to see how the unoptimized
> > trampolines help, followed by the trampoline case.  
> 
> Are you sure you're using unoptimized?  Optimized is the default on
> x86-64 (with my third patch).
> 

Yes, because I haven't applied that third patch yet ;-)

Then I'll apply it and see how much that improves things.

-- Steve


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Steven Rostedt
On Fri, 9 Nov 2018 13:44:09 -0600
Josh Poimboeuf  wrote:

> On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> > On Fri, 9 Nov 2018 11:05:51 -0800
> > Andy Lutomirski  wrote:
> >   
> > > > Not sure what Andy was talking about, but I'm currently implementing
> > > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > > prime candidate for static calls, as I showed in my original RFC of
> > > > this feature.
> > > > 
> > > > 
> > > 
> > > Indeed.
> > > 
> > > Although I had assumed that tracepoints already had appropriate jump 
> > > label magic.  
> > 
> > It does. But that's not the problem I was trying to solve. It's that
> > tracing took a 8% noise dive with retpolines when enabled (hackbench
> > slowed down by 8% with all the trace events enabled compared to all
> > trace events enabled without retpoline). That is, normal users (those
> > not tracinng) are not affected by trace events slowing down by
> > retpoline. Those that care about performance when they are tracing, are
> > affected by retpoline, quite drastically.
> > 
> > I'm doing another test run and measurements, to see how the unoptimized
> > trampolines help, followed by the trampoline case.  
> 
> Are you sure you're using unoptimized?  Optimized is the default on
> x86-64 (with my third patch).
> 

Yes, because I haven't applied that third patch yet ;-)

Then I'll apply it and see how much that improves things.

-- Steve


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> On Fri, 9 Nov 2018 11:05:51 -0800
> Andy Lutomirski  wrote:
> 
> > > Not sure what Andy was talking about, but I'm currently implementing
> > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > prime candidate for static calls, as I showed in my original RFC of
> > > this feature.
> > > 
> > >   
> > 
> > Indeed.
> > 
> > Although I had assumed that tracepoints already had appropriate jump label 
> > magic.
> 
> It does. But that's not the problem I was trying to solve. It's that
> tracing took a 8% noise dive with retpolines when enabled (hackbench
> slowed down by 8% with all the trace events enabled compared to all
> trace events enabled without retpoline). That is, normal users (those
> not tracinng) are not affected by trace events slowing down by
> retpoline. Those that care about performance when they are tracing, are
> affected by retpoline, quite drastically.
> 
> I'm doing another test run and measurements, to see how the unoptimized
> trampolines help, followed by the trampoline case.

Are you sure you're using unoptimized?  Optimized is the default on
x86-64 (with my third patch).

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> On Fri, 9 Nov 2018 11:05:51 -0800
> Andy Lutomirski  wrote:
> 
> > > Not sure what Andy was talking about, but I'm currently implementing
> > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > prime candidate for static calls, as I showed in my original RFC of
> > > this feature.
> > > 
> > >   
> > 
> > Indeed.
> > 
> > Although I had assumed that tracepoints already had appropriate jump label 
> > magic.
> 
> It does. But that's not the problem I was trying to solve. It's that
> tracing took a 8% noise dive with retpolines when enabled (hackbench
> slowed down by 8% with all the trace events enabled compared to all
> trace events enabled without retpoline). That is, normal users (those
> not tracinng) are not affected by trace events slowing down by
> retpoline. Those that care about performance when they are tracing, are
> affected by retpoline, quite drastically.
> 
> I'm doing another test run and measurements, to see how the unoptimized
> trampolines help, followed by the trampoline case.

Are you sure you're using unoptimized?  Optimized is the default on
x86-64 (with my third patch).

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Steven Rostedt
On Fri, 9 Nov 2018 11:05:51 -0800
Andy Lutomirski  wrote:

> > Not sure what Andy was talking about, but I'm currently implementing
> > tracepoints to use this, as tracepoints use indirect calls, and are a
> > prime candidate for static calls, as I showed in my original RFC of
> > this feature.
> > 
> >   
> 
> Indeed.
> 
> Although I had assumed that tracepoints already had appropriate jump label 
> magic.

It does. But that's not the problem I was trying to solve. It's that
tracing took a 8% noise dive with retpolines when enabled (hackbench
slowed down by 8% with all the trace events enabled compared to all
trace events enabled without retpoline). That is, normal users (those
not tracinng) are not affected by trace events slowing down by
retpoline. Those that care about performance when they are tracing, are
affected by retpoline, quite drastically.

I'm doing another test run and measurements, to see how the unoptimized
trampolines help, followed by the trampoline case.

-- Steve


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Steven Rostedt
On Fri, 9 Nov 2018 11:05:51 -0800
Andy Lutomirski  wrote:

> > Not sure what Andy was talking about, but I'm currently implementing
> > tracepoints to use this, as tracepoints use indirect calls, and are a
> > prime candidate for static calls, as I showed in my original RFC of
> > this feature.
> > 
> >   
> 
> Indeed.
> 
> Although I had assumed that tracepoints already had appropriate jump label 
> magic.

It does. But that's not the problem I was trying to solve. It's that
tracing took a 8% noise dive with retpolines when enabled (hackbench
slowed down by 8% with all the trace events enabled compared to all
trace events enabled without retpoline). That is, normal users (those
not tracinng) are not affected by trace events slowing down by
retpoline. Those that care about performance when they are tracing, are
affected by retpoline, quite drastically.

I'm doing another test run and measurements, to see how the unoptimized
trampolines help, followed by the trampoline case.

-- Steve


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Andy Lutomirski



> On Nov 9, 2018, at 10:42 AM, Steven Rostedt  wrote:
> 
> On Fri, 9 Nov 2018 10:41:37 -0600
> Josh Poimboeuf  wrote:
> 
>>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
 On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:  
> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:  
> 
> 
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.  
 
 For tracing, we'd want static_call_set_to_nop() or something like that, 
 right?  
>>> 
>>> Are we talking about tracepoints?  Or ftrace?  
>> 
>> Since ftrace changes calls to nops, and vice versa, I assume you meant
>> ftrace.  I don't think ftrace is a good candidate for this, as it's
>> inherently more flexible than this API would reasonably allow.
>> 
> 
> Not sure what Andy was talking about, but I'm currently implementing
> tracepoints to use this, as tracepoints use indirect calls, and are a
> prime candidate for static calls, as I showed in my original RFC of
> this feature.
> 
> 

Indeed.

Although I had assumed that tracepoints already had appropriate jump label 
magic.

Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Andy Lutomirski



> On Nov 9, 2018, at 10:42 AM, Steven Rostedt  wrote:
> 
> On Fri, 9 Nov 2018 10:41:37 -0600
> Josh Poimboeuf  wrote:
> 
>>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
 On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:  
> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:  
> 
> 
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.  
 
 For tracing, we'd want static_call_set_to_nop() or something like that, 
 right?  
>>> 
>>> Are we talking about tracepoints?  Or ftrace?  
>> 
>> Since ftrace changes calls to nops, and vice versa, I assume you meant
>> ftrace.  I don't think ftrace is a good candidate for this, as it's
>> inherently more flexible than this API would reasonably allow.
>> 
> 
> Not sure what Andy was talking about, but I'm currently implementing
> tracepoints to use this, as tracepoints use indirect calls, and are a
> prime candidate for static calls, as I showed in my original RFC of
> this feature.
> 
> 

Indeed.

Although I had assumed that tracepoints already had appropriate jump label 
magic.

Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Steven Rostedt
On Fri, 9 Nov 2018 10:41:37 -0600
Josh Poimboeuf  wrote:

> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
> > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:  
> > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:  
> > > >
> > > >
> > > > All other usecases are bonus, but it would certainly be interesting to
> > > > investigate the impact of using these APIs for tracing: that too is a
> > > > feature enabled everywhere but utilized only by a small fraction of 
> > > > Linux
> > > > users - so literally every single cycle or instruction saved or hot-path
> > > > shortened is a major win.  
> > > 
> > > For tracing, we'd want static_call_set_to_nop() or something like that, 
> > > right?  
> > 
> > Are we talking about tracepoints?  Or ftrace?  
> 
> Since ftrace changes calls to nops, and vice versa, I assume you meant
> ftrace.  I don't think ftrace is a good candidate for this, as it's
> inherently more flexible than this API would reasonably allow.
> 

Not sure what Andy was talking about, but I'm currently implementing
tracepoints to use this, as tracepoints use indirect calls, and are a
prime candidate for static calls, as I showed in my original RFC of
this feature.

-- Steve


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Steven Rostedt
On Fri, 9 Nov 2018 10:41:37 -0600
Josh Poimboeuf  wrote:

> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
> > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:  
> > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:  
> > > >
> > > >
> > > > All other usecases are bonus, but it would certainly be interesting to
> > > > investigate the impact of using these APIs for tracing: that too is a
> > > > feature enabled everywhere but utilized only by a small fraction of 
> > > > Linux
> > > > users - so literally every single cycle or instruction saved or hot-path
> > > > shortened is a major win.  
> > > 
> > > For tracing, we'd want static_call_set_to_nop() or something like that, 
> > > right?  
> > 
> > Are we talking about tracepoints?  Or ftrace?  
> 
> Since ftrace changes calls to nops, and vice versa, I assume you meant
> ftrace.  I don't think ftrace is a good candidate for this, as it's
> inherently more flexible than this API would reasonably allow.
> 

Not sure what Andy was talking about, but I'm currently implementing
tracepoints to use this, as tracepoints use indirect calls, and are a
prime candidate for static calls, as I showed in my original RFC of
this feature.

-- Steve


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
> On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
> > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:
> > >
> > >
> > > All other usecases are bonus, but it would certainly be interesting to
> > > investigate the impact of using these APIs for tracing: that too is a
> > > feature enabled everywhere but utilized only by a small fraction of Linux
> > > users - so literally every single cycle or instruction saved or hot-path
> > > shortened is a major win.
> > 
> > For tracing, we'd want static_call_set_to_nop() or something like that, 
> > right?
> 
> Are we talking about tracepoints?  Or ftrace?

Since ftrace changes calls to nops, and vice versa, I assume you meant
ftrace.  I don't think ftrace is a good candidate for this, as it's
inherently more flexible than this API would reasonably allow.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
> On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
> > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:
> > >
> > >
> > > All other usecases are bonus, but it would certainly be interesting to
> > > investigate the impact of using these APIs for tracing: that too is a
> > > feature enabled everywhere but utilized only by a small fraction of Linux
> > > users - so literally every single cycle or instruction saved or hot-path
> > > shortened is a major win.
> > 
> > For tracing, we'd want static_call_set_to_nop() or something like that, 
> > right?
> 
> Are we talking about tracepoints?  Or ftrace?

Since ftrace changes calls to nops, and vice versa, I assume you meant
ftrace.  I don't think ftrace is a good candidate for this, as it's
inherently more flexible than this API would reasonably allow.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:
> >
> >
> > All other usecases are bonus, but it would certainly be interesting to
> > investigate the impact of using these APIs for tracing: that too is a
> > feature enabled everywhere but utilized only by a small fraction of Linux
> > users - so literally every single cycle or instruction saved or hot-path
> > shortened is a major win.
> 
> For tracing, we'd want static_call_set_to_nop() or something like that, right?

Are we talking about tracepoints?  Or ftrace?

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:
> >
> >
> > All other usecases are bonus, but it would certainly be interesting to
> > investigate the impact of using these APIs for tracing: that too is a
> > feature enabled everywhere but utilized only by a small fraction of Linux
> > users - so literally every single cycle or instruction saved or hot-path
> > shortened is a major win.
> 
> For tracing, we'd want static_call_set_to_nop() or something like that, right?

Are we talking about tracepoints?  Or ftrace?

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> On 9 November 2018 at 08:28, Ingo Molnar  wrote:
> >
> > * Josh Poimboeuf  wrote:
> >
> >> These patches are related to two similar patch sets from Ard and Steve:
> >>
> >> - 
> >> https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
> >> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
> >>
> >> The code is also heavily inspired by the jump label code, as some of the
> >> concepts are very similar.
> >>
> >> There are three separate implementations, depending on what the arch
> >> supports:
> >>
> >>   1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> >>  objtool and a small amount of arch code
> >>
> >>   2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> >>  a small amount of arch code
> >>
> >>   3) If no arch support, fall back to regular function pointers
> >>
> >>
> >> TODO:
> >>
> >> - I'm not sure about the objtool approach.  Objtool is (currently)
> >>   x86-64 only, which means we have to use the "unoptimized" version
> >>   everywhere else.  I may experiment with a GCC plugin instead.
> >
> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > approach while GCC plugin would have to be replicated for Clang and any
> > other compilers, etc.
> >
> 
> I implemented the GCC plugin approach here for arm64
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls
>
> That implements both the unoptimized and the optimized versions.

Nice!  That was fast :-)

> I do take your point about GCC and other compilers, but on arm64 we
> don't have a lot of choice.
> 
> As far as I can tell, the GCC plugin is generic (i.e., it does not
> rely on any ARM specific passes, but obviously, this requires a *lot*
> of testing and validation to be taken seriously.

Yeah.  I haven't had a chance to try your plugin on x86 yet, but in
theory it should be arch-independent.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> On 9 November 2018 at 08:28, Ingo Molnar  wrote:
> >
> > * Josh Poimboeuf  wrote:
> >
> >> These patches are related to two similar patch sets from Ard and Steve:
> >>
> >> - 
> >> https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
> >> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
> >>
> >> The code is also heavily inspired by the jump label code, as some of the
> >> concepts are very similar.
> >>
> >> There are three separate implementations, depending on what the arch
> >> supports:
> >>
> >>   1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> >>  objtool and a small amount of arch code
> >>
> >>   2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> >>  a small amount of arch code
> >>
> >>   3) If no arch support, fall back to regular function pointers
> >>
> >>
> >> TODO:
> >>
> >> - I'm not sure about the objtool approach.  Objtool is (currently)
> >>   x86-64 only, which means we have to use the "unoptimized" version
> >>   everywhere else.  I may experiment with a GCC plugin instead.
> >
> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > approach while GCC plugin would have to be replicated for Clang and any
> > other compilers, etc.
> >
> 
> I implemented the GCC plugin approach here for arm64
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls
>
> That implements both the unoptimized and the optimized versions.

Nice!  That was fast :-)

> I do take your point about GCC and other compilers, but on arm64 we
> don't have a lot of choice.
> 
> As far as I can tell, the GCC plugin is generic (i.e., it does not
> rely on any ARM specific passes, but obviously, this requires a *lot*
> of testing and validation to be taken seriously.

Yeah.  I haven't had a chance to try your plugin on x86 yet, but in
theory it should be arch-independent.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Andy Lutomirski
On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:
>
>
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.

For tracing, we'd want static_call_set_to_nop() or something like that, right?

--Andy


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Andy Lutomirski
On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar  wrote:
>
>
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.

For tracing, we'd want static_call_set_to_nop() or something like that, right?

--Andy


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > - I'm not sure about the objtool approach.  Objtool is (currently)
> >   x86-64 only, which means we have to use the "unoptimized" version
> >   everywhere else.  I may experiment with a GCC plugin instead.
> 
> I'd prefer the objtool approach. It's a pretty reliable first-principles 
> approach while GCC plugin would have to be replicated for Clang and any 
> other compilers, etc.

The benefit of a plugin is that we'd only need two of them: GCC and
Clang.  And presumably, they'd share a lot of code.

The prospect of porting objtool to all architectures is going to be much
more of a daunting task (though we are at least already considering it
for some arches).

> > - Does this feature have much value without retpolines?  If not, should
> >   we make it depend on retpolines somehow?
> 
> Paravirt patching, as you mention in your later reply?
> 
> > - Find some actual users of the interfaces (tracepoints? crypto?)
> 
> I'd be very happy with a demonstrated paravirt optimization already - 
> i.e. seeing the before/after effect on the vmlinux with an x86 distro 
> config.
> 
> All major Linux distributions enable CONFIG_PARAVIRT=y and 
> CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much 
> as possible in the 99.999% cases where it's not used is a primary 
> concern.

For paravirt, I was thinking of it as more of a cleanup than an
optimization.  The paravirt patching code already replaces indirect
branches with direct ones -- see paravirt_patch_default().

Though it *would* reduce the instruction footprint a bit, as the 7-byte
indirect calls (later patched to 5-byte direct + 2-byte nop) would
instead be 5-byte direct calls to begin with.

> All other usecases are bonus, but it would certainly be interesting to 
> investigate the impact of using these APIs for tracing: that too is a 
> feature enabled everywhere but utilized only by a small fraction of Linux 
> users - so literally every single cycle or instruction saved or hot-path 
> shortened is a major win.

With retpolines, and with tracepoints enabled, it's definitely a major
win.  Steve measured an 8.9% general slowdown on hackbench caused by
retpolines.

But with tracepoints disabled, I believe static jumps are used, which
already minimizes the impact on hot paths.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Josh Poimboeuf
On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > - I'm not sure about the objtool approach.  Objtool is (currently)
> >   x86-64 only, which means we have to use the "unoptimized" version
> >   everywhere else.  I may experiment with a GCC plugin instead.
> 
> I'd prefer the objtool approach. It's a pretty reliable first-principles 
> approach while GCC plugin would have to be replicated for Clang and any 
> other compilers, etc.

The benefit of a plugin is that we'd only need two of them: GCC and
Clang.  And presumably, they'd share a lot of code.

The prospect of porting objtool to all architectures is going to be much
more of a daunting task (though we are at least already considering it
for some arches).

> > - Does this feature have much value without retpolines?  If not, should
> >   we make it depend on retpolines somehow?
> 
> Paravirt patching, as you mention in your later reply?
> 
> > - Find some actual users of the interfaces (tracepoints? crypto?)
> 
> I'd be very happy with a demonstrated paravirt optimization already - 
> i.e. seeing the before/after effect on the vmlinux with an x86 distro 
> config.
> 
> All major Linux distributions enable CONFIG_PARAVIRT=y and 
> CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much 
> as possible in the 99.999% cases where it's not used is a primary 
> concern.

For paravirt, I was thinking of it as more of a cleanup than an
optimization.  The paravirt patching code already replaces indirect
branches with direct ones -- see paravirt_patch_default().

Though it *would* reduce the instruction footprint a bit, as the 7-byte
indirect calls (later patched to 5-byte direct + 2-byte nop) would
instead be 5-byte direct calls to begin with.

> All other usecases are bonus, but it would certainly be interesting to 
> investigate the impact of using these APIs for tracing: that too is a 
> feature enabled everywhere but utilized only by a small fraction of Linux 
> users - so literally every single cycle or instruction saved or hot-path 
> shortened is a major win.

With retpolines, and with tracepoints enabled, it's definitely a major
win.  Steve measured an 8.9% general slowdown on hackbench caused by
retpolines.

But with tracepoints disabled, I believe static jumps are used, which
already minimizes the impact on hot paths.

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Ard Biesheuvel
On 9 November 2018 at 08:28, Ingo Molnar  wrote:
>
> * Josh Poimboeuf  wrote:
>
>> These patches are related to two similar patch sets from Ard and Steve:
>>
>> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
>> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
>>
>> The code is also heavily inspired by the jump label code, as some of the
>> concepts are very similar.
>>
>> There are three separate implementations, depending on what the arch
>> supports:
>>
>>   1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
>>  objtool and a small amount of arch code
>>
>>   2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
>>  a small amount of arch code
>>
>>   3) If no arch support, fall back to regular function pointers
>>
>>
>> TODO:
>>
>> - I'm not sure about the objtool approach.  Objtool is (currently)
>>   x86-64 only, which means we have to use the "unoptimized" version
>>   everywhere else.  I may experiment with a GCC plugin instead.
>
> I'd prefer the objtool approach. It's a pretty reliable first-principles
> approach while GCC plugin would have to be replicated for Clang and any
> other compilers, etc.
>

I implemented the GCC plugin approach here for arm64

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls

That implements both the unoptimized and the optimized versions.

I do take your point about GCC and other compilers, but on arm64 we
don't have a lot of choice.

As far as I can tell, the GCC plugin is generic (i.e., it does not
rely on any ARM specific passes, but obviously, this requires a *lot*
of testing and validation to be taken seriously.

>> - Does this feature have much value without retpolines?  If not, should
>>   we make it depend on retpolines somehow?
>
> Paravirt patching, as you mention in your later reply?
>
>> - Find some actual users of the interfaces (tracepoints? crypto?)
>
> I'd be very happy with a demonstrated paravirt optimization already -
> i.e. seeing the before/after effect on the vmlinux with an x86 distro
> config.
>
> All major Linux distributions enable CONFIG_PARAVIRT=y and
> CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
> as possible in the 99.999% cases where it's not used is a primary
> concern.
>
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.
>
> Thanks,
>
> Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-09 Thread Ard Biesheuvel
On 9 November 2018 at 08:28, Ingo Molnar  wrote:
>
> * Josh Poimboeuf  wrote:
>
>> These patches are related to two similar patch sets from Ard and Steve:
>>
>> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
>> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
>>
>> The code is also heavily inspired by the jump label code, as some of the
>> concepts are very similar.
>>
>> There are three separate implementations, depending on what the arch
>> supports:
>>
>>   1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
>>  objtool and a small amount of arch code
>>
>>   2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
>>  a small amount of arch code
>>
>>   3) If no arch support, fall back to regular function pointers
>>
>>
>> TODO:
>>
>> - I'm not sure about the objtool approach.  Objtool is (currently)
>>   x86-64 only, which means we have to use the "unoptimized" version
>>   everywhere else.  I may experiment with a GCC plugin instead.
>
> I'd prefer the objtool approach. It's a pretty reliable first-principles
> approach while GCC plugin would have to be replicated for Clang and any
> other compilers, etc.
>

I implemented the GCC plugin approach here for arm64

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls

That implements both the unoptimized and the optimized versions.

I do take your point about GCC and other compilers, but on arm64 we
don't have a lot of choice.

As far as I can tell, the GCC plugin is generic (i.e., it does not
rely on any ARM specific passes, but obviously, this requires a *lot*
of testing and validation to be taken seriously.

>> - Does this feature have much value without retpolines?  If not, should
>>   we make it depend on retpolines somehow?
>
> Paravirt patching, as you mention in your later reply?
>
>> - Find some actual users of the interfaces (tracepoints? crypto?)
>
> I'd be very happy with a demonstrated paravirt optimization already -
> i.e. seeing the before/after effect on the vmlinux with an x86 distro
> config.
>
> All major Linux distributions enable CONFIG_PARAVIRT=y and
> CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
> as possible in the 99.999% cases where it's not used is a primary
> concern.
>
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.
>
> Thanks,
>
> Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-08 Thread Ingo Molnar


* Ingo Molnar  wrote:

> > - Does this feature have much value without retpolines?  If not, should
> >   we make it depend on retpolines somehow?
> 
> Paravirt patching, as you mention in your later reply?

BTW., to look for candidates of this API, I'd suggest looking at the 
function call frequency of my (almost-)distro kernel vmlinux:

  $ objdump -d vmlinux | grep -w callq | cut -f3- | sort | uniq -c | sort -n | 
tail -100

which gives:

502 callq  8157d050 
522 callq  81aaf420 
536 callq  81547e60 <_copy_to_user>
615 callq  81a97700 
624 callq  *0x82648428
624 callq  810cc810 <__might_sleep>
625 callq  81a93b90 
649 callq  81547dd0 <_copy_from_user>
651 callq  811ba930 
654 callq  8170b6f0 <_dev_warn>
691 callq  81a93790 
693 callq  81a88dc0 
709 callq  *0x82648438
723 callq  811bdbd0 
735 callq  810feac0 
750 callq  8163e9f0 
768 callq  *0x82648430
814 callq  81ab2710 <_raw_spin_lock_irq>
841 callq  81a9e680 <__memcpy>
863 callq  812ae3d0 <__kmalloc>
899 callq  8126ac80 <__might_fault>
912 callq  81ab2970 <_raw_spin_unlock_irq>
939 callq  81aaaf10 <_cond_resched>
966 callq  811bda00 
   1069 callq  81126f50 
   1078 callq  81097760 <__warn_printk>
   1081 callq  8157b140 <__dynamic_dev_dbg>
   1351 callq  8170b630 <_dev_err>
   1365 callq  811050c0 
   1373 callq  81a977f0 
   1390 callq  8157b090 <__dynamic_pr_debug>
   1453 callq  8155c650 <__list_add_valid>
   1501 callq  812ad6f0 
   1509 callq  8155c6c0 <__list_del_entry_valid>
   1513 callq  81310ce0 
   1571 callq  81ab2780 <_raw_spin_lock_irqsave>
   1624 callq  81ab29b0 <_raw_spin_unlock_irqrestore>
   1661 callq  81126fd0 
   1986 callq  81104940 
   2050 callq  811c5110 
   2133 callq  81102c70 
   2507 callq  81ab2560 <_raw_spin_lock>
   2676 callq  81aadc40 
   3056 callq  81ab2900 <_raw_spin_unlock>
   3294 callq  81aac610 
   3628 callq  81129100 
   4462 callq  812ac2c0 
   6454 callq  8111a51e 
   6676 callq  81101420 
   7328 callq  81e014b0 <__x86_indirect_thunk_rax>
   7598 callq  81126f30 
   9065 callq  810979f0 <__stack_chk_fail>

The most prominent callers which are already function call pointers today 
are:

  $ objdump -d vmlinux | grep -w callq | grep \* | cut -f3- | sort | uniq -c | 
sort -n | tail -10

109 callq  *0x82648530
134 callq  *0x82648568
154 callq  *0x826483d0
260 callq  *0x826483d8
297 callq  *0x826483e0
345 callq  *0x82648440
345 callq  *0x82648558
624 callq  *0x82648428
709 callq  *0x82648438
768 callq  *0x82648430

That's all pv_ops->*() method calls:

   82648300 D pv_ops
   826485d0 D pv_info

Optimizing those thousands of function pointer calls would already be a 
nice improvement.

But retpolines:

   7328 callq  81e014b0 <__x86_indirect_thunk_rax>

  81e014b0 <__x86_indirect_thunk_rax>:
  81e014b0:   ff e0   jmpq   *%rax

... are even more prominent, and turned on in every distro as well, 
obviously.

Thanks,

Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-08 Thread Ingo Molnar


* Ingo Molnar  wrote:

> > - Does this feature have much value without retpolines?  If not, should
> >   we make it depend on retpolines somehow?
> 
> Paravirt patching, as you mention in your later reply?

BTW., to look for candidates of this API, I'd suggest looking at the 
function call frequency of my (almost-)distro kernel vmlinux:

  $ objdump -d vmlinux | grep -w callq | cut -f3- | sort | uniq -c | sort -n | 
tail -100

which gives:

502 callq  8157d050 
522 callq  81aaf420 
536 callq  81547e60 <_copy_to_user>
615 callq  81a97700 
624 callq  *0x82648428
624 callq  810cc810 <__might_sleep>
625 callq  81a93b90 
649 callq  81547dd0 <_copy_from_user>
651 callq  811ba930 
654 callq  8170b6f0 <_dev_warn>
691 callq  81a93790 
693 callq  81a88dc0 
709 callq  *0x82648438
723 callq  811bdbd0 
735 callq  810feac0 
750 callq  8163e9f0 
768 callq  *0x82648430
814 callq  81ab2710 <_raw_spin_lock_irq>
841 callq  81a9e680 <__memcpy>
863 callq  812ae3d0 <__kmalloc>
899 callq  8126ac80 <__might_fault>
912 callq  81ab2970 <_raw_spin_unlock_irq>
939 callq  81aaaf10 <_cond_resched>
966 callq  811bda00 
   1069 callq  81126f50 
   1078 callq  81097760 <__warn_printk>
   1081 callq  8157b140 <__dynamic_dev_dbg>
   1351 callq  8170b630 <_dev_err>
   1365 callq  811050c0 
   1373 callq  81a977f0 
   1390 callq  8157b090 <__dynamic_pr_debug>
   1453 callq  8155c650 <__list_add_valid>
   1501 callq  812ad6f0 
   1509 callq  8155c6c0 <__list_del_entry_valid>
   1513 callq  81310ce0 
   1571 callq  81ab2780 <_raw_spin_lock_irqsave>
   1624 callq  81ab29b0 <_raw_spin_unlock_irqrestore>
   1661 callq  81126fd0 
   1986 callq  81104940 
   2050 callq  811c5110 
   2133 callq  81102c70 
   2507 callq  81ab2560 <_raw_spin_lock>
   2676 callq  81aadc40 
   3056 callq  81ab2900 <_raw_spin_unlock>
   3294 callq  81aac610 
   3628 callq  81129100 
   4462 callq  812ac2c0 
   6454 callq  8111a51e 
   6676 callq  81101420 
   7328 callq  81e014b0 <__x86_indirect_thunk_rax>
   7598 callq  81126f30 
   9065 callq  810979f0 <__stack_chk_fail>

The most prominent callers which are already function call pointers today 
are:

  $ objdump -d vmlinux | grep -w callq | grep \* | cut -f3- | sort | uniq -c | 
sort -n | tail -10

109 callq  *0x82648530
134 callq  *0x82648568
154 callq  *0x826483d0
260 callq  *0x826483d8
297 callq  *0x826483e0
345 callq  *0x82648440
345 callq  *0x82648558
624 callq  *0x82648428
709 callq  *0x82648438
768 callq  *0x82648430

That's all pv_ops->*() method calls:

   82648300 D pv_ops
   826485d0 D pv_info

Optimizing those thousands of function pointer calls would already be a 
nice improvement.

But retpolines:

   7328 callq  81e014b0 <__x86_indirect_thunk_rax>

  81e014b0 <__x86_indirect_thunk_rax>:
  81e014b0:   ff e0   jmpq   *%rax

... are even more prominent, and turned on in every distro as well, 
obviously.

Thanks,

Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-08 Thread Ingo Molnar


* Josh Poimboeuf  wrote:

> These patches are related to two similar patch sets from Ard and Steve:
> 
> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
> 
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
> 
> There are three separate implementations, depending on what the arch
> supports:
> 
>   1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
>  objtool and a small amount of arch code
>   
>   2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
>  a small amount of arch code
>   
>   3) If no arch support, fall back to regular function pointers
> 
> 
> TODO:
> 
> - I'm not sure about the objtool approach.  Objtool is (currently)
>   x86-64 only, which means we have to use the "unoptimized" version
>   everywhere else.  I may experiment with a GCC plugin instead.

I'd prefer the objtool approach. It's a pretty reliable first-principles 
approach while GCC plugin would have to be replicated for Clang and any 
other compilers, etc.

> - Does this feature have much value without retpolines?  If not, should
>   we make it depend on retpolines somehow?

Paravirt patching, as you mention in your later reply?

> - Find some actual users of the interfaces (tracepoints? crypto?)

I'd be very happy with a demonstrated paravirt optimization already - 
i.e. seeing the before/after effect on the vmlinux with an x86 distro 
config.

All major Linux distributions enable CONFIG_PARAVIRT=y and 
CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much 
as possible in the 99.999% cases where it's not used is a primary 
concern.

All other usecases are bonus, but it would certainly be interesting to 
investigate the impact of using these APIs for tracing: that too is a 
feature enabled everywhere but utilized only by a small fraction of Linux 
users - so literally every single cycle or instruction saved or hot-path 
shortened is a major win.

Thanks,

Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-08 Thread Ingo Molnar


* Josh Poimboeuf  wrote:

> These patches are related to two similar patch sets from Ard and Steve:
> 
> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
> 
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
> 
> There are three separate implementations, depending on what the arch
> supports:
> 
>   1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
>  objtool and a small amount of arch code
>   
>   2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
>  a small amount of arch code
>   
>   3) If no arch support, fall back to regular function pointers
> 
> 
> TODO:
> 
> - I'm not sure about the objtool approach.  Objtool is (currently)
>   x86-64 only, which means we have to use the "unoptimized" version
>   everywhere else.  I may experiment with a GCC plugin instead.

I'd prefer the objtool approach. It's a pretty reliable first-principles 
approach while GCC plugin would have to be replicated for Clang and any 
other compilers, etc.

> - Does this feature have much value without retpolines?  If not, should
>   we make it depend on retpolines somehow?

Paravirt patching, as you mention in your later reply?

> - Find some actual users of the interfaces (tracepoints? crypto?)

I'd be very happy with a demonstrated paravirt optimization already - 
i.e. seeing the before/after effect on the vmlinux with an x86 distro 
config.

All major Linux distributions enable CONFIG_PARAVIRT=y and 
CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much 
as possible in the 99.999% cases where it's not used is a primary 
concern.

All other usecases are bonus, but it would certainly be interesting to 
investigate the impact of using these APIs for tracing: that too is a 
feature enabled everywhere but utilized only by a small fraction of Linux 
users - so literally every single cycle or instruction saved or hot-path 
shortened is a major win.

Thanks,

Ingo


Re: [PATCH RFC 0/3] Static calls

2018-11-08 Thread Josh Poimboeuf
On Thu, Nov 08, 2018 at 03:15:50PM -0600, Josh Poimboeuf wrote:
> - Does this feature have much value without retpolines?  If not, should
>   we make it depend on retpolines somehow?

I forgot Andy mentioned that we might be able to use this to clean up
paravirt patching, in which case it would have a lot of value,
retpolines or not...

-- 
Josh


Re: [PATCH RFC 0/3] Static calls

2018-11-08 Thread Josh Poimboeuf
On Thu, Nov 08, 2018 at 03:15:50PM -0600, Josh Poimboeuf wrote:
> - Does this feature have much value without retpolines?  If not, should
>   we make it depend on retpolines somehow?

I forgot Andy mentioned that we might be able to use this to clean up
paravirt patching, in which case it would have a lot of value,
retpolines or not...

-- 
Josh