Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Nadav Amit
> On Dec 12, 2018, at 1:36 PM, Edward Cree  wrote:
> 
> On 12/12/18 21:15, Nadav Amit wrote:
>>> On Dec 12, 2018, at 10:33 AM, Edward Cree  wrote:
>>> 
>>> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
>>> additional call and ret.  So I wouldn't expect it to be too expensive.
>>> More to the point, it seems like it's easier to get right than the inline
>>> version, and if we get the inline version working later we can introduce it
>>> without any API change, much as Josh's existing patches have both versions
>>> behind a Kconfig switch.
>> I see. For my outlined blocks I used the opposite approach - a call followed
>> by jmp
> That's what Josh did too.  I.e. caller calls the trampoline, which jmps to the
>  callee; later it rets, taking it back to the caller.  Perhaps I wasn't clear.
> The point is that there's still only one call and one ret.

Sorry for the misunderstanding.

> 
>>> I was working on the assumption that it would be opt-in, wrapping a macro
>>> around indirect calls that are known to have a fairly small number of hot
>>> targets.  There are plenty of indirect calls in the kernel that are only
>>> called once in a blue moon, e.g. in control-plane operations like ethtool;
>>> we don't really need to bulk up .text with trampolines for all of them.
>> On the other hand, I’m not sure the static_call interface is so intuitive.
>> And extending it into “dynamic_call” might be even worse. As I initially
>> used an opt-in approach, I can tell you that it was very exhausting.
> Well, if it's done with a gcc plugin after all, then it wouldn't be too hard
>  to make it opt-out.
> One advantage of the explicit opt-in dynamic_call, though, which can be seen
>  in my patch is that multiple call sites can share the same learning-state,
>  if they're expected to call the same set of functions.  An opt-out approach
>  would automatically give each indirect call statement its own individual BTB.
> Either way, I think the question is orthogonal to what the trampolines
>  themselves look like (and even to the inline vs outline question).

Not entirely. If the mechanism is opt-out and outlined, and especially if it
also supports multiple targets, you may not want to allocate all the memory
for them during build-time, and instead use module memory to allocate them
dynamically (that’s what we did).

Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Edward Cree
On 12/12/18 21:15, Nadav Amit wrote:
>> On Dec 12, 2018, at 10:33 AM, Edward Cree  wrote:
>>
>> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
>>  additional call and ret.  So I wouldn't expect it to be too expensive.
>> More to the point, it seems like it's easier to get right than the inline
>>  version, and if we get the inline version working later we can introduce it
>>  without any API change, much as Josh's existing patches have both versions
>>  behind a Kconfig switch.
> I see. For my outlined blocks I used the opposite approach - a call followed
> by jmp
That's what Josh did too.  I.e. caller calls the trampoline, which jmps to the
 callee; later it rets, taking it back to the caller.  Perhaps I wasn't clear.
The point is that there's still only one call and one ret.

>> I was working on the assumption that it would be opt-in, wrapping a macro
>>  around indirect calls that are known to have a fairly small number of hot
>>  targets.  There are plenty of indirect calls in the kernel that are only
>>  called once in a blue moon, e.g. in control-plane operations like ethtool;
>>  we don't really need to bulk up .text with trampolines for all of them.
> On the other hand, I’m not sure the static_call interface is so intuitive.
> And extending it into “dynamic_call” might be even worse. As I initially
> used an opt-in approach, I can tell you that it was very exhausting.
Well, if it's done with a gcc plugin after all, then it wouldn't be too hard
 to make it opt-out.
One advantage of the explicit opt-in dynamic_call, though, which can be seen
 in my patch is that multiple call sites can share the same learning-state,
 if they're expected to call the same set of functions.  An opt-out approach
 would automatically give each indirect call statement its own individual BTB.
Either way, I think the question is orthogonal to what the trampolines
 themselves look like (and even to the inline vs outline question).

-Ed


Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Nadav Amit
> On Dec 12, 2018, at 10:33 AM, Edward Cree  wrote:
> 
> On 12/12/18 18:14, Nadav Amit wrote:
>> Second, (2i) is not very intuitive for me. Using the out-of-line static
>> calls seems to me as less performant than the inline (potentially, I didn’t
>> check).
>> 
>> Anyhow, the use of out-of-line static calls seems to me as
>> counter-intuitive. I think (didn’t measure) that it may add more overhead
>> than it saves due to the additional call, ret, and so on
> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
>  additional call and ret.  So I wouldn't expect it to be too expensive.
> More to the point, it seems like it's easier to get right than the inline
>  version, and if we get the inline version working later we can introduce it
>  without any API change, much as Josh's existing patches have both versions
>  behind a Kconfig switch.

I see. For my outlined blocks I used the opposite approach - a call followed
by jmp (instead of jmp followed by call that Josh did). Does the stack look
correct when you first do the jmp? It seems to me that you will not see the
calling function on the stack in this case. Can it have an effect on
live-patching, debugging?

>> I tried to avoid reading to
>> compared target from memory and therefore used an immediate. This should
>> prevent data cache misses and even when the data is available is faster by
>> one cycle. But it requires the patching of both the “cmp %target-reg, imm”
>> and “call rel-target” to be patched “atomically”. So the static-calls
>> mechanism wouldn’t be sufficient.
> The approach I took to deal with that (since though I'm doing a read from
>  memory, it's key->func in .data rather than the jmp immediate in .text) was
>  to have another static_call (though a plain static_key could also be used)
>  to 'skip' the fast-path while it's actually being patched.  Then, since all
>  my callers were under the rcu_read_lock, I just needed to synchronize_rcu()
>  after switching off the fast-path to make sure no threads were still in it.
> I'm not sure how that would be generalised to all cases, though; we don't
>  want to force every indirect call to take the rcu_read_lock as that means
>  no callee can ever synchronize_rcu().  I guess we could have our own
>  separate RCU read lock just for indirect call patching?  (What does kgraft
>  do?)

I used a similar approach to a certain extent. (I’m going to describe the
implementation following the discussion with Andy Lutomirski): We use a
“restartable section” that if we need to be preempted in this block of code,
we restart the entire section. Then, we use synchronize_rcu() like you do
after patching.

>> Based on Josh’s previous feedback, I thought of improving the learning using
>> some hysteresis. Anyhow, note that there are quite a few cases in which you
>> wouldn’t want optpolines. The question is whether in general it would be an
>> opt-in or opt-out mechanism.
> I was working on the assumption that it would be opt-in, wrapping a macro
>  around indirect calls that are known to have a fairly small number of hot
>  targets.  There are plenty of indirect calls in the kernel that are only
>  called once in a blue moon, e.g. in control-plane operations like ethtool;
>  we don't really need to bulk up .text with trampolines for all of them.

On the other hand, I’m not sure the static_call interface is so intuitive.
And extending it into “dynamic_call” might be even worse. As I initially
used an opt-in approach, I can tell you that it was very exhausting.



Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Edward Cree
On 12/12/18 18:14, Nadav Amit wrote:
> Second, (2i) is not very intuitive for me. Using the out-of-line static
> calls seems to me as less performant than the inline (potentially, I didn’t
> check).
>
> Anyhow, the use of out-of-line static calls seems to me as
> counter-intuitive. I think (didn’t measure) that it may add more overhead
> than it saves due to the additional call, ret, and so on
AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
 additional call and ret.  So I wouldn't expect it to be too expensive.
More to the point, it seems like it's easier to get right than the inline
 version, and if we get the inline version working later we can introduce it
 without any API change, much as Josh's existing patches have both versions
 behind a Kconfig switch.

> I tried to avoid reading to
> compared target from memory and therefore used an immediate. This should
> prevent data cache misses and even when the data is available is faster by
> one cycle. But it requires the patching of both the “cmp %target-reg, imm”
> and “call rel-target” to be patched “atomically”. So the static-calls
> mechanism wouldn’t be sufficient.
The approach I took to deal with that (since though I'm doing a read from
 memory, it's key->func in .data rather than the jmp immediate in .text) was
 to have another static_call (though a plain static_key could also be used)
 to 'skip' the fast-path while it's actually being patched.  Then, since all
 my callers were under the rcu_read_lock, I just needed to synchronize_rcu()
 after switching off the fast-path to make sure no threads were still in it.
I'm not sure how that would be generalised to all cases, though; we don't
 want to force every indirect call to take the rcu_read_lock as that means
 no callee can ever synchronize_rcu().  I guess we could have our own
 separate RCU read lock just for indirect call patching?  (What does kgraft
 do?)

> Based on Josh’s previous feedback, I thought of improving the learning using
> some hysteresis. Anyhow, note that there are quite a few cases in which you
> wouldn’t want optpolines. The question is whether in general it would be an
> opt-in or opt-out mechanism.
I was working on the assumption that it would be opt-in, wrapping a macro
 around indirect calls that are known to have a fairly small number of hot
 targets.  There are plenty of indirect calls in the kernel that are only
 called once in a blue moon, e.g. in control-plane operations like ethtool;
 we don't really need to bulk up .text with trampolines for all of them.

-Ed


Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Nadav Amit
> On Dec 12, 2018, at 9:11 AM, Edward Cree  wrote:
> 
> On 12/12/18 05:59, Nadav Amit wrote:
>> Thanks for cc’ing me. (I didn’t know about the other patch-sets.)
> Well in my case, that's because I haven't posted any yet.  (Will follow up
>  shortly with what I currently have, though it's not pretty.)
> 
> Looking at your patches, it seems you've got a much more developed learning
>  mechanism.  Mine on the other hand is brutally simple but runs continuously
>  (i.e. after we patch we immediately enter the next 'relearning' phase);
>  since it never does anything but prod a handful of percpu variables, this
>  shouldn't be too costly.
> 
> Also, you've got the macrology for making all indirect calls use this,
>  whereas at present I just have an open-coded instance on a single call site
>  (I went with deliver_skb in the networking stack).
> 
> So I think where we probably want to go from here is:
>  1) get Josh's static_calls in.  AIUI Linus seems to prefer the out-of-line
> approach; I'd say ditch the inline version (at least for now).
>  2) build a relpolines patch series that uses
>i) static_calls for the text-patching part
>   ii) as much of Nadav's macrology as is applicable
>  iii) either my or Nadav's learning mechanism; we can experiment with both,
>   bikeshed it incessantly etc.
> 
> Seem reasonable?

Mostly yes. I have a few reservations (and let’s call them optpolines from
now on, since Josh disliked the previous name).

First, I still have to address the issues that Josh raised before, and try
to use gcc plugin instead of (most) of the macros. Specifically, I need to
bring back (from my PoC code) the part that sets multiple targets.

Second, (2i) is not very intuitive for me. Using the out-of-line static
calls seems to me as less performant than the inline (potentially, I didn’t
check).

Anyhow, the use of out-of-line static calls seems to me as
counter-intuitive. I think (didn’t measure) that it may add more overhead
than it saves due to the additional call, ret, and so on - at least if
retpolines are not used. For multiple targets it may be useful in saving
some memory if the outline block is dynamically allocated (as I did in my
yet unpublished code). But that’s not how it’s done in Josh’s code.

If we talk about inline implementation there is a different problem that
prevents me of using Josh’s static-calls as-is. I tried to avoid reading to
compared target from memory and therefore used an immediate. This should
prevent data cache misses and even when the data is available is faster by
one cycle. But it requires the patching of both the “cmp %target-reg, imm”
and “call rel-target” to be patched “atomically”. So the static-calls
mechanism wouldn’t be sufficient.

Based on Josh’s previous feedback, I thought of improving the learning using
some hysteresis. Anyhow, note that there are quite a few cases in which you
wouldn’t want optpolines. The question is whether in general it would be an
opt-in or opt-out mechanism.

Let me know what you think.

BTW: When it comes to deliver_skb, you have packet_type as an identifier.
You can use it directly or through an indirection table to figure the
target. Here’s a chunk of assembly magic that I used in a similar case:

.macro _call_table val:req bit:req max:req val1:req bit1:req
call_table_\val\()_\bit\():
test $(1 << \bit), %al
.if \val1 + (1 << \bit1) >= \max
jnz syscall_relpoline_\val1
jmp syscall_relpoline_\val
.else
jnz call_table_\val1\()_\bit1

# fall-through to no carry, val unchange, going to next bit
call_table \val,\bit1,\max
call_table \val1,\bit1,\max
.endif
.endm

.macro call_table val:req bit:req max:req
.altmacro
_call_table \val,\bit,\max,%(\val + (1 << \bit)),%(\bit + 1)
.noaltmacro
.endm

ENTRY(direct_syscall)
mov %esi, %eax
call_table val=0 bit=0 max=16
ENDPROC(direct_syscall)

Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Edward Cree
On 12/12/18 05:59, Nadav Amit wrote:
> Thanks for cc’ing me. (I didn’t know about the other patch-sets.)
Well in my case, that's because I haven't posted any yet.  (Will follow up
 shortly with what I currently have, though it's not pretty.)

Looking at your patches, it seems you've got a much more developed learning
 mechanism.  Mine on the other hand is brutally simple but runs continuously
 (i.e. after we patch we immediately enter the next 'relearning' phase);
 since it never does anything but prod a handful of percpu variables, this
 shouldn't be too costly.

Also, you've got the macrology for making all indirect calls use this,
 whereas at present I just have an open-coded instance on a single call site
 (I went with deliver_skb in the networking stack).

So I think where we probably want to go from here is:
 1) get Josh's static_calls in.  AIUI Linus seems to prefer the out-of-line
    approach; I'd say ditch the inline version (at least for now).
 2) build a relpolines patch series that uses
   i) static_calls for the text-patching part
  ii) as much of Nadav's macrology as is applicable
 iii) either my or Nadav's learning mechanism; we can experiment with both,
  bikeshed it incessantly etc.

Seem reasonable?

-Ed


Re: [PATCH v2 0/4] Static calls

2018-12-11 Thread Nadav Amit
> On Dec 11, 2018, at 10:05 AM, Josh Poimboeuf  wrote:
> 
> On Fri, Dec 07, 2018 at 04:06:32PM +, Edward Cree wrote:
>> Sorry if this has been pointed out before (it's a very long thread), but
>> in the out-of-line implementation, it appears that static_call_update()
>> never alters key->func.  Am I right in thinking that this should be
>> fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
>> arch_static_call_transform() on line 159 of include/linux/static_call.h?
> 
> Yes, you're right about both bugs in the out-of-line case: key->func
> needs to be written, and __static_call_update() needs to be called by
> static_call_update.  I was so focused on getting the inline case working
> that I overlooked those.
> 
>> Some background (why does key->func matter for the
>> CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
>> combining these static calls with the 'indirect call wrappers' notion
>> that Paolo Abeni has been working on [1], using runtime instrumentation
>> to determine a list of potential callees.  (This allows us to cope with
>> cases where the callees are in modules, or where different workloads may
>> use different sets of callees for a given call site, neither of which is
>> handled by Paolo's approach).
>> The core of my design looks something like:
>> 
>> static int dynamic_call_xyz(int (*func)(some_args), some_args)
>> {
>>  if (func == dynamic_call_xyz_1.func)
>>  return static_call(dynamic_call_xyz_1, some_args);
>>  if (func == dynamic_call_xyz_2.func)
>>  return static_call(dynamic_call_xyz_2, some_args);
>>  return (*func)(some_args);
>> }
>> 
>> albeit with a bunch of extra (and currently rather ugly) stuff to collect
>> the statistics needed to decide what to put in the static call keys, and
>> mechanisms (RCU in my current case) to ensure that the static call isn't
>> changed between checking its .func and actually calling it.
>> 
>> -Ed
>> 
>> PS: not on list, please keep me in CC.
>> 
>> [1] 
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F773985%2Fdata=02%7C01%7Cnamit%40vmware.com%7C147894d6c56d4ce6c1fc08d65f933adf%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636801483286688490sdata=QOZcfWXjvPqoR0oujtf1QTQLenv%2BiEu6jUA5fiav6Mo%3Dreserved=0
> 
> Thanks, this sounds very interesting.  Adding Nadav to CC, as he has
> been looking at a different approach to solving the same problem:
> 
>  
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181018005420.82993-1-namit%40vmware.comdata=02%7C01%7Cnamit%40vmware.com%7C147894d6c56d4ce6c1fc08d65f933adf%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636801483286688490sdata=bCI2N2xKUVyyyPAnWDH3mC3DsSk%2Bzy5nmI0DV%2F%2FaYYw%3Dreserved=0

Thanks for cc’ing me. (I didn’t know about the other patch-sets.)

Allow me to share my experience, because I was studying this issue for some
time and have implemented more than I shared, since the code need more
cleanup. Some of the proposed approaches are things we either considered or
actually implemented (and later dropped). Eventually, our design was guided
by performance profiling and a grain of “what’s academic” consideration.

I think that eventually you would want to go with one central mechanism for
the various situations:

1. Registered targets (static) and automatically learnt targets (dynamic).
Registration does not work in some cases (e.g., file-system function
pointers, there are too many of those). On the other hand, if you know your
target it is obviously simpler/better.

2. With/without retpoline fallback. We’ve have always had the retpoline as
fallback, but if you use a registration mechanism, it’s not necessary.

3. Single and multiple targets. For multiple targets we decided to use
outline block in order not to inflate the code for no reason. There were
over 1 indirect branches in our kernel build, but in our workloads only
~500 were actually run.

If you got with the approach that Edward mentioned, you may want to
associate each “function" with identifier (think about file_operations
having an additional field that holds a unique ID, or using the struct
address). This would allow you to use a “binary search” to find the right
target, which would be slightly more efficient. We actually used a
binary-search for a different reason - learning the most frequent syscalls
per process and calling them in this manner (we actually had an indirection
table to choose the right one).

3. Call-chains which are mostly fixed (next). You want to unroll those.

4. Per-process calls. The case that bothered us the most is seccomp. On our
setup, systemd installed 17(!) BPF seccomp programs on Redis server, causing
every syscall to go through 17 indirect branches to invoke them. But
similarly you have mmu-notifiers and others. We used a per-process
trampoline page for this matter.

Now there is of course the question of whether to go through automatic

Re: [PATCH v2 0/4] Static calls

2018-12-11 Thread Josh Poimboeuf
On Fri, Dec 07, 2018 at 04:06:32PM +, Edward Cree wrote:
> Sorry if this has been pointed out before (it's a very long thread), but
>  in the out-of-line implementation, it appears that static_call_update()
>  never alters key->func.  Am I right in thinking that this should be
>  fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
>  arch_static_call_transform() on line 159 of include/linux/static_call.h?

Yes, you're right about both bugs in the out-of-line case: key->func
needs to be written, and __static_call_update() needs to be called by
static_call_update.  I was so focused on getting the inline case working
that I overlooked those.

> Some background (why does key->func matter for the
>  CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
>  combining these static calls with the 'indirect call wrappers' notion
>  that Paolo Abeni has been working on [1], using runtime instrumentation
>  to determine a list of potential callees.  (This allows us to cope with
>  cases where the callees are in modules, or where different workloads may
>  use different sets of callees for a given call site, neither of which is
>  handled by Paolo's approach).
> The core of my design looks something like:
> 
> static int dynamic_call_xyz(int (*func)(some_args), some_args)
> {
>   if (func == dynamic_call_xyz_1.func)
>   return static_call(dynamic_call_xyz_1, some_args);
>   if (func == dynamic_call_xyz_2.func)
>   return static_call(dynamic_call_xyz_2, some_args);
>   return (*func)(some_args);
> }
> 
> albeit with a bunch of extra (and currently rather ugly) stuff to collect
>  the statistics needed to decide what to put in the static call keys, and
>  mechanisms (RCU in my current case) to ensure that the static call isn't
>  changed between checking its .func and actually calling it.
> 
> -Ed
> 
> PS: not on list, please keep me in CC.
> 
> [1] https://lwn.net/Articles/773985/

Thanks, this sounds very interesting.  Adding Nadav to CC, as he has
been looking at a different approach to solving the same problem:

  https://lkml.kernel.org/r/20181018005420.82993-1-na...@vmware.com

-- 
Josh


Re: [PATCH v2 0/4] Static calls

2018-12-10 Thread Pavel Machek
Hi!

> These patches are related to two similar patch sets from Ard and Steve:
> 
> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
> 
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
> 
> There are three separate implementations, depending on what the arch
> supports:
> 
>   1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires
>  objtool and a small amount of arch code
>   
>   2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires
>  a small amount of arch code
>   
>   3) If no arch support, fall back to regular function pointers

Well, it would be nice to mention what these patches do :-).

I guess they are expected to make things slightly faster? If so it
would be nice to mention benchmarks...

(There are even statistics later in the series, but I guess at least
short explanation should be in cover letter).

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH v2 0/4] Static calls

2018-12-07 Thread Edward Cree
On 07/12/18 16:06, Edward Cree wrote:
> Sorry if this has been pointed out before (it's a very long thread), but
>  in the out-of-line implementation, it appears that static_call_update()
>  never alters key->func.  Am I right in thinking that this should be
>  fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
>  arch_static_call_transform() on line 159 of include/linux/static_call.h?
On further examination, it's worse than that.

Why does the CONFIG_HAVE_STATIC_CALL_OUTLINE static_call_update() not
 call __static_call_update()?  It contains nothing but a BUILD_BUG_ON,
 which isn't likely to update anything.

-Ed


Re: [PATCH v2 0/4] Static calls

2018-12-07 Thread Edward Cree
On 07/12/18 16:06, Edward Cree wrote:
> Sorry if this has been pointed out before (it's a very long thread), but
>  in the out-of-line implementation, it appears that static_call_update()
>  never alters key->func.  Am I right in thinking that this should be
>  fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
>  arch_static_call_transform() on line 159 of include/linux/static_call.h?
On further examination, it's worse than that.

Why does the CONFIG_HAVE_STATIC_CALL_OUTLINE static_call_update() not
 call __static_call_update()?  It contains nothing but a BUILD_BUG_ON,
 which isn't likely to update anything.

-Ed


Re: [PATCH v2 0/4] Static calls

2018-12-07 Thread Edward Cree
Sorry if this has been pointed out before (it's a very long thread), but
 in the out-of-line implementation, it appears that static_call_update()
 never alters key->func.  Am I right in thinking that this should be
 fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
 arch_static_call_transform() on line 159 of include/linux/static_call.h?

Some background (why does key->func matter for the
 CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
 combining these static calls with the 'indirect call wrappers' notion
 that Paolo Abeni has been working on [1], using runtime instrumentation
 to determine a list of potential callees.  (This allows us to cope with
 cases where the callees are in modules, or where different workloads may
 use different sets of callees for a given call site, neither of which is
 handled by Paolo's approach).
The core of my design looks something like:

static int dynamic_call_xyz(int (*func)(some_args), some_args)
{
if (func == dynamic_call_xyz_1.func)
return static_call(dynamic_call_xyz_1, some_args);
if (func == dynamic_call_xyz_2.func)
return static_call(dynamic_call_xyz_2, some_args);
return (*func)(some_args);
}

albeit with a bunch of extra (and currently rather ugly) stuff to collect
 the statistics needed to decide what to put in the static call keys, and
 mechanisms (RCU in my current case) to ensure that the static call isn't
 changed between checking its .func and actually calling it.

-Ed

PS: not on list, please keep me in CC.

[1] https://lwn.net/Articles/773985/


Re: [PATCH v2 0/4] Static calls

2018-12-07 Thread Edward Cree
Sorry if this has been pointed out before (it's a very long thread), but
 in the out-of-line implementation, it appears that static_call_update()
 never alters key->func.  Am I right in thinking that this should be
 fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
 arch_static_call_transform() on line 159 of include/linux/static_call.h?

Some background (why does key->func matter for the
 CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
 combining these static calls with the 'indirect call wrappers' notion
 that Paolo Abeni has been working on [1], using runtime instrumentation
 to determine a list of potential callees.  (This allows us to cope with
 cases where the callees are in modules, or where different workloads may
 use different sets of callees for a given call site, neither of which is
 handled by Paolo's approach).
The core of my design looks something like:

static int dynamic_call_xyz(int (*func)(some_args), some_args)
{
if (func == dynamic_call_xyz_1.func)
return static_call(dynamic_call_xyz_1, some_args);
if (func == dynamic_call_xyz_2.func)
return static_call(dynamic_call_xyz_2, some_args);
return (*func)(some_args);
}

albeit with a bunch of extra (and currently rather ugly) stuff to collect
 the statistics needed to decide what to put in the static call keys, and
 mechanisms (RCU in my current case) to ensure that the static call isn't
 changed between checking its .func and actually calling it.

-Ed

PS: not on list, please keep me in CC.

[1] https://lwn.net/Articles/773985/


Re: [PATCH v2 0/4] Static calls

2018-12-05 Thread Andy Lutomirski
>> On Dec 5, 2018, at 7:04 AM, Josh Poimboeuf  wrote:
>
>
>> Anyway, I have a new objection to Josh’s create_gap proposal: what on
>> Earth will kernel CET do to it?  Maybe my longjmp-like hack is
>> actually better.
>
> Does CET even care about iret?  I assumed it didn't.  If it does, your
> proposal would have the same problem, no?

I think it doesn’t, but it doesn’t really matter.  The shadow stack looks like:

retaddr of function being poked
call do_int3 + 5

And, to emulate a call, you need to stick a new frame right in the
middle.  At least with a longjmp-like approach, you can clobber the
“call do_int3 + 5” part and then INCSSP on the way out.  To be fair, I
think this also sucks.

PeterZ, can we abuse NMI to make this problem go away?  I don't
suppose that we have some rule that NMI handlers never wait for other
CPUs to finish doing anything?


Re: [PATCH v2 0/4] Static calls

2018-12-05 Thread Andy Lutomirski
>> On Dec 5, 2018, at 7:04 AM, Josh Poimboeuf  wrote:
>
>
>> Anyway, I have a new objection to Josh’s create_gap proposal: what on
>> Earth will kernel CET do to it?  Maybe my longjmp-like hack is
>> actually better.
>
> Does CET even care about iret?  I assumed it didn't.  If it does, your
> proposal would have the same problem, no?

I think it doesn’t, but it doesn’t really matter.  The shadow stack looks like:

retaddr of function being poked
call do_int3 + 5

And, to emulate a call, you need to stick a new frame right in the
middle.  At least with a longjmp-like approach, you can clobber the
“call do_int3 + 5” part and then INCSSP on the way out.  To be fair, I
think this also sucks.

PeterZ, can we abuse NMI to make this problem go away?  I don't
suppose that we have some rule that NMI handlers never wait for other
CPUs to finish doing anything?


Re: [PATCH v2 0/4] Static calls

2018-12-05 Thread Josh Poimboeuf
On Tue, Dec 04, 2018 at 03:41:01PM -0800, Andy Lutomirski wrote:
> 
> 
> > On Dec 4, 2018, at 3:08 PM, Steven Rostedt  wrote:
> > 
> > 
> > Where did this end up BTW?
> > 
> > I know that there's controversy about the
> > CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the 
> > CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
> > v1 patch 0 description:
> > 
> > There are three separate implementations, depending on what the arch
> > supports:
> > 
> >  1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> > objtool and a small amount of arch code
> > 
> >  2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> > a small amount of arch code
> > 
> >  3) If no arch support, fall back to regular function pointers
> > 
> > My benchmarks showed the best improvements with the
> > STATIC_CALL_OPTIMIZED, but it still showed improvement with the
> > UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
> > above (which happen to be the first part of the patch set. 1 comes in
> > at the end).
> 
> Sounds good to me.
> 
> > 
> > I would also just call it CONFIG_STATIC_CALL. If we every agree on the
> > optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
> > Have an option called UNOPTIMIZED just seems wrong.

(Poking my head up for a bit, soon to disappear again until next week)

Ard had already objected to "unoptimized", which was why for v2 I
renamed them to CONFIG_STATIC_CALL_OUTLINE and CONFIG_STATIC_CALL_INLINE.

I could rename it to CONFIG_STATIC_CALL and CONFIG_STATIC_CALL_INLINE if
you prefer.  I don't have much of an opinion either way.

I'll post a v3 next week or so, with the controversial bits more fully
separated from the non-controversial bits.  So at least the out-of-line
implementation can get merged.

> My objection to all the bike shed colors so far is that we *always*
> have static_call() — it’s just not always static.

Hm?  Do you mean you don't like that we have a generic function pointer
implementation?  or what?

> Anyway, I have a new objection to Josh’s create_gap proposal: what on
> Earth will kernel CET do to it?  Maybe my longjmp-like hack is
> actually better.

Does CET even care about iret?  I assumed it didn't.  If it does, your
proposal would have the same problem, no?

-- 
Josh


Re: [PATCH v2 0/4] Static calls

2018-12-05 Thread Josh Poimboeuf
On Tue, Dec 04, 2018 at 03:41:01PM -0800, Andy Lutomirski wrote:
> 
> 
> > On Dec 4, 2018, at 3:08 PM, Steven Rostedt  wrote:
> > 
> > 
> > Where did this end up BTW?
> > 
> > I know that there's controversy about the
> > CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the 
> > CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
> > v1 patch 0 description:
> > 
> > There are three separate implementations, depending on what the arch
> > supports:
> > 
> >  1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> > objtool and a small amount of arch code
> > 
> >  2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> > a small amount of arch code
> > 
> >  3) If no arch support, fall back to regular function pointers
> > 
> > My benchmarks showed the best improvements with the
> > STATIC_CALL_OPTIMIZED, but it still showed improvement with the
> > UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
> > above (which happen to be the first part of the patch set. 1 comes in
> > at the end).
> 
> Sounds good to me.
> 
> > 
> > I would also just call it CONFIG_STATIC_CALL. If we every agree on the
> > optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
> > Have an option called UNOPTIMIZED just seems wrong.

(Poking my head up for a bit, soon to disappear again until next week)

Ard had already objected to "unoptimized", which was why for v2 I
renamed them to CONFIG_STATIC_CALL_OUTLINE and CONFIG_STATIC_CALL_INLINE.

I could rename it to CONFIG_STATIC_CALL and CONFIG_STATIC_CALL_INLINE if
you prefer.  I don't have much of an opinion either way.

I'll post a v3 next week or so, with the controversial bits more fully
separated from the non-controversial bits.  So at least the out-of-line
implementation can get merged.

> My objection to all the bike shed colors so far is that we *always*
> have static_call() — it’s just not always static.

Hm?  Do you mean you don't like that we have a generic function pointer
implementation?  or what?

> Anyway, I have a new objection to Josh’s create_gap proposal: what on
> Earth will kernel CET do to it?  Maybe my longjmp-like hack is
> actually better.

Does CET even care about iret?  I assumed it didn't.  If it does, your
proposal would have the same problem, no?

-- 
Josh


Re: [PATCH v2 0/4] Static calls

2018-12-04 Thread Andy Lutomirski



> On Dec 4, 2018, at 3:08 PM, Steven Rostedt  wrote:
> 
> 
> Where did this end up BTW?
> 
> I know that there's controversy about the
> CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the 
> CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
> v1 patch 0 description:
> 
> There are three separate implementations, depending on what the arch
> supports:
> 
>  1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> objtool and a small amount of arch code
> 
>  2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> a small amount of arch code
> 
>  3) If no arch support, fall back to regular function pointers
> 
> My benchmarks showed the best improvements with the
> STATIC_CALL_OPTIMIZED, but it still showed improvement with the
> UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
> above (which happen to be the first part of the patch set. 1 comes in
> at the end).

Sounds good to me.

> 
> I would also just call it CONFIG_STATIC_CALL. If we every agree on the
> optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
> Have an option called UNOPTIMIZED just seems wrong.

My objection to all the bike shed colors so far is that we *always* have 
static_call() — it’s just not always static.

Anyway, I have a new objection to Josh’s create_gap proposal: what on Earth 
will kernel CET do to it?  Maybe my longjmp-like hack is actually better.




Re: [PATCH v2 0/4] Static calls

2018-12-04 Thread Andy Lutomirski



> On Dec 4, 2018, at 3:08 PM, Steven Rostedt  wrote:
> 
> 
> Where did this end up BTW?
> 
> I know that there's controversy about the
> CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the 
> CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
> v1 patch 0 description:
> 
> There are three separate implementations, depending on what the arch
> supports:
> 
>  1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> objtool and a small amount of arch code
> 
>  2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> a small amount of arch code
> 
>  3) If no arch support, fall back to regular function pointers
> 
> My benchmarks showed the best improvements with the
> STATIC_CALL_OPTIMIZED, but it still showed improvement with the
> UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
> above (which happen to be the first part of the patch set. 1 comes in
> at the end).

Sounds good to me.

> 
> I would also just call it CONFIG_STATIC_CALL. If we every agree on the
> optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
> Have an option called UNOPTIMIZED just seems wrong.

My objection to all the bike shed colors so far is that we *always* have 
static_call() — it’s just not always static.

Anyway, I have a new objection to Josh’s create_gap proposal: what on Earth 
will kernel CET do to it?  Maybe my longjmp-like hack is actually better.




Re: [PATCH v2 0/4] Static calls

2018-12-04 Thread Steven Rostedt


Where did this end up BTW?

I know that there's controversy about the
CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the 
CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
v1 patch 0 description:

There are three separate implementations, depending on what the arch
supports:

  1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
 objtool and a small amount of arch code
  
  2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
 a small amount of arch code
  
  3) If no arch support, fall back to regular function pointers

My benchmarks showed the best improvements with the
STATIC_CALL_OPTIMIZED, but it still showed improvement with the
UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
above (which happen to be the first part of the patch set. 1 comes in
at the end).

I would also just call it CONFIG_STATIC_CALL. If we every agree on the
optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
Have an option called UNOPTIMIZED just seems wrong.

-- Steve



On Mon, 26 Nov 2018 07:54:56 -0600
Josh Poimboeuf  wrote:

> v2:
> - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
> - rename optimized/unoptimized -> inline/out-of-line [Ard]
> - tweak arch interfaces for PLT and add key->tramp field [Ard]
> - rename 'poison' to 'defuse' and do it after all sites have been patched 
> [Ard]
> - fix .init handling [Ard, Steven]
> - add CONFIG_HAVE_STATIC_CALL [Steven]
> - make interfaces more consistent across configs to allow tracepoints to
>   use them [Steven]
> - move __ADDRESSABLE() to static_call() macro [Steven]
> - prevent 2-byte jumps [Steven]
> - add offset to asm-offsets.c instead of hard coding key->func offset
> - add kernel_text_address() sanity check
> - make __ADDRESSABLE() symbols truly unique
> 
> TODO:
> - port Ard's arm64 patches to the new arch interfaces
> - tracepoint performance testing
> 
> 
> 
> These patches are related to two similar patch sets from Ard and Steve:
> 
> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
> 
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
> 
> There are three separate implementations, depending on what the arch
> supports:
> 
>   1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires
>  objtool and a small amount of arch code
>   
>   2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires
>  a small amount of arch code
>   
>   3) If no arch support, fall back to regular function pointers
> 
> 
> Josh Poimboeuf (4):
>   compiler.h: Make __ADDRESSABLE() symbol truly unique
>   static_call: Add static call infrastructure
>   x86/static_call: Add out-of-line static call implementation
>   x86/static_call: Add inline static call implementation for x86-64
> 
>  arch/Kconfig  |  10 +
>  arch/x86/Kconfig  |   4 +-
>  arch/x86/include/asm/static_call.h|  52 +++
>  arch/x86/kernel/Makefile  |   1 +
>  arch/x86/kernel/asm-offsets.c |   6 +
>  arch/x86/kernel/static_call.c |  78 
>  include/asm-generic/vmlinux.lds.h |  11 +
>  include/linux/compiler.h  |   2 +-
>  include/linux/module.h|  10 +
>  include/linux/static_call.h   | 202 ++
>  include/linux/static_call_types.h |  19 +
>  kernel/Makefile   |   1 +
>  kernel/module.c   |   5 +
>  kernel/static_call.c  | 350 ++
>  tools/objtool/Makefile|   3 +-
>  tools/objtool/check.c | 126 ++-
>  tools/objtool/check.h |   2 +
>  tools/objtool/elf.h   |   1 +
>  .../objtool/include/linux/static_call_types.h |  19 +
>  tools/objtool/sync-check.sh   |   1 +
>  20 files changed, 899 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/include/asm/static_call.h
>  create mode 100644 arch/x86/kernel/static_call.c
>  create mode 100644 include/linux/static_call.h
>  create mode 100644 include/linux/static_call_types.h
>  create mode 100644 kernel/static_call.c
>  create mode 100644 tools/objtool/include/linux/static_call_types.h
> 



Re: [PATCH v2 0/4] Static calls

2018-12-04 Thread Steven Rostedt


Where did this end up BTW?

I know that there's controversy about the
CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the 
CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
v1 patch 0 description:

There are three separate implementations, depending on what the arch
supports:

  1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
 objtool and a small amount of arch code
  
  2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
 a small amount of arch code
  
  3) If no arch support, fall back to regular function pointers

My benchmarks showed the best improvements with the
STATIC_CALL_OPTIMIZED, but it still showed improvement with the
UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
above (which happen to be the first part of the patch set. 1 comes in
at the end).

I would also just call it CONFIG_STATIC_CALL. If we every agree on the
optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
Have an option called UNOPTIMIZED just seems wrong.

-- Steve



On Mon, 26 Nov 2018 07:54:56 -0600
Josh Poimboeuf  wrote:

> v2:
> - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
> - rename optimized/unoptimized -> inline/out-of-line [Ard]
> - tweak arch interfaces for PLT and add key->tramp field [Ard]
> - rename 'poison' to 'defuse' and do it after all sites have been patched 
> [Ard]
> - fix .init handling [Ard, Steven]
> - add CONFIG_HAVE_STATIC_CALL [Steven]
> - make interfaces more consistent across configs to allow tracepoints to
>   use them [Steven]
> - move __ADDRESSABLE() to static_call() macro [Steven]
> - prevent 2-byte jumps [Steven]
> - add offset to asm-offsets.c instead of hard coding key->func offset
> - add kernel_text_address() sanity check
> - make __ADDRESSABLE() symbols truly unique
> 
> TODO:
> - port Ard's arm64 patches to the new arch interfaces
> - tracepoint performance testing
> 
> 
> 
> These patches are related to two similar patch sets from Ard and Steve:
> 
> - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
> - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
> 
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
> 
> There are three separate implementations, depending on what the arch
> supports:
> 
>   1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires
>  objtool and a small amount of arch code
>   
>   2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires
>  a small amount of arch code
>   
>   3) If no arch support, fall back to regular function pointers
> 
> 
> Josh Poimboeuf (4):
>   compiler.h: Make __ADDRESSABLE() symbol truly unique
>   static_call: Add static call infrastructure
>   x86/static_call: Add out-of-line static call implementation
>   x86/static_call: Add inline static call implementation for x86-64
> 
>  arch/Kconfig  |  10 +
>  arch/x86/Kconfig  |   4 +-
>  arch/x86/include/asm/static_call.h|  52 +++
>  arch/x86/kernel/Makefile  |   1 +
>  arch/x86/kernel/asm-offsets.c |   6 +
>  arch/x86/kernel/static_call.c |  78 
>  include/asm-generic/vmlinux.lds.h |  11 +
>  include/linux/compiler.h  |   2 +-
>  include/linux/module.h|  10 +
>  include/linux/static_call.h   | 202 ++
>  include/linux/static_call_types.h |  19 +
>  kernel/Makefile   |   1 +
>  kernel/module.c   |   5 +
>  kernel/static_call.c  | 350 ++
>  tools/objtool/Makefile|   3 +-
>  tools/objtool/check.c | 126 ++-
>  tools/objtool/check.h |   2 +
>  tools/objtool/elf.h   |   1 +
>  .../objtool/include/linux/static_call_types.h |  19 +
>  tools/objtool/sync-check.sh   |   1 +
>  20 files changed, 899 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/include/asm/static_call.h
>  create mode 100644 arch/x86/kernel/static_call.c
>  create mode 100644 include/linux/static_call.h
>  create mode 100644 include/linux/static_call_types.h
>  create mode 100644 kernel/static_call.c
>  create mode 100644 tools/objtool/include/linux/static_call_types.h
> 



Re: [PATCH v2 0/4] Static calls

2018-11-26 Thread Steven Rostedt
On Mon, 26 Nov 2018 16:24:20 -0600
Josh Poimboeuf  wrote:

> Should I add your tracepoint patch to the next version of my patches?
> 

No, not yet. Especially since I haven't totally vetted them.

When yours are ready, I'll post an RFC, and then we can add them in. I
would want an Acked-by from Mathieu Desnoyers too.

-- Steve



Re: [PATCH v2 0/4] Static calls

2018-11-26 Thread Steven Rostedt
On Mon, 26 Nov 2018 16:24:20 -0600
Josh Poimboeuf  wrote:

> Should I add your tracepoint patch to the next version of my patches?
> 

No, not yet. Especially since I haven't totally vetted them.

When yours are ready, I'll post an RFC, and then we can add them in. I
would want an Acked-by from Mathieu Desnoyers too.

-- Steve



Re: [PATCH v2 0/4] Static calls

2018-11-26 Thread Josh Poimboeuf
On Mon, Nov 26, 2018 at 03:54:05PM -0500, Steven Rostedt wrote:
> In summary, we had this:
> 
> No RETPOLINES:
> 1.4503 +- 0.0148 seconds time elapsed  ( +-  1.02% )
> 
> baseline RETPOLINES:
> 1.5120 +- 0.0133 seconds time elapsed  ( +-  0.88% )
> 
> Added direct calls for trace_events:
> 1.5239 +- 0.0139 seconds time elapsed  ( +-  0.91% )
> 
> With static calls:
> 1.5282 +- 0.0135 seconds time elapsed  ( +-  0.88% )
> 
> With static call trampolines:
>1.48328 +- 0.00515 seconds time elapsed  ( +-  0.35% )
> 
> Full static calls:
>1.47364 +- 0.00706 seconds time elapsed  ( +-  0.48% )
> 
> 
> Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown
> 
> Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown
> 
> With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown
> 
> Going from 4.25 to 1.6 isn't bad, and I think this is very much worth
> the effort. I did not expect it to go to 0% as there's a lot of other
> places that retpolines cause issues, but this shows that it does help
> the tracing code.
> 
> I originally did the tests with the development config, which has a
> bunch of debugging options enabled (hackbench usually takes over 9
> seconds, not the 1.5 that was done here), and the slowdown was closer
> to 9% with retpolines. If people want me to do this with that, or I can
> send them the config. Or better yet, the code is here, just use your
> own configs.

Thanks a lot for running these.  This looks like a nice speedup.  Also a
nice reduction in the standard deviation.

Should I add your tracepoint patch to the next version of my patches?

-- 
Josh


Re: [PATCH v2 0/4] Static calls

2018-11-26 Thread Josh Poimboeuf
On Mon, Nov 26, 2018 at 03:54:05PM -0500, Steven Rostedt wrote:
> In summary, we had this:
> 
> No RETPOLINES:
> 1.4503 +- 0.0148 seconds time elapsed  ( +-  1.02% )
> 
> baseline RETPOLINES:
> 1.5120 +- 0.0133 seconds time elapsed  ( +-  0.88% )
> 
> Added direct calls for trace_events:
> 1.5239 +- 0.0139 seconds time elapsed  ( +-  0.91% )
> 
> With static calls:
> 1.5282 +- 0.0135 seconds time elapsed  ( +-  0.88% )
> 
> With static call trampolines:
>1.48328 +- 0.00515 seconds time elapsed  ( +-  0.35% )
> 
> Full static calls:
>1.47364 +- 0.00706 seconds time elapsed  ( +-  0.48% )
> 
> 
> Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown
> 
> Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown
> 
> With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown
> 
> Going from 4.25 to 1.6 isn't bad, and I think this is very much worth
> the effort. I did not expect it to go to 0% as there's a lot of other
> places that retpolines cause issues, but this shows that it does help
> the tracing code.
> 
> I originally did the tests with the development config, which has a
> bunch of debugging options enabled (hackbench usually takes over 9
> seconds, not the 1.5 that was done here), and the slowdown was closer
> to 9% with retpolines. If people want me to do this with that, or I can
> send them the config. Or better yet, the code is here, just use your
> own configs.

Thanks a lot for running these.  This looks like a nice speedup.  Also a
nice reduction in the standard deviation.

Should I add your tracepoint patch to the next version of my patches?

-- 
Josh


Re: [PATCH v2 0/4] Static calls

2018-11-26 Thread Josh Poimboeuf
On Mon, Nov 26, 2018 at 07:54:56AM -0600, Josh Poimboeuf wrote:
> v2:
> - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
> - rename optimized/unoptimized -> inline/out-of-line [Ard]
> - tweak arch interfaces for PLT and add key->tramp field [Ard]
> - rename 'poison' to 'defuse' and do it after all sites have been patched 
> [Ard]
> - fix .init handling [Ard, Steven]
> - add CONFIG_HAVE_STATIC_CALL [Steven]
> - make interfaces more consistent across configs to allow tracepoints to
>   use them [Steven]
> - move __ADDRESSABLE() to static_call() macro [Steven]
> - prevent 2-byte jumps [Steven]
> - add offset to asm-offsets.c instead of hard coding key->func offset
> - add kernel_text_address() sanity check
> - make __ADDRESSABLE() symbols truly unique
> 
> TODO:
> - port Ard's arm64 patches to the new arch interfaces
> - tracepoint performance testing

Below is the patch Steve gave me for converting tracepoints to use
static calls.

Steve, if you want me to do the performance testing, send me the test
details and I can give it a try this week.



diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 49ba9cde7e4b..ae16672bea61 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -11,6 +11,8 @@
 #include 
 #include 
 
+struct static_call_key;
+
 struct trace_print_flags {
unsigned long   mask;
const char  *name;
@@ -30,6 +32,8 @@ struct tracepoint_func {
 struct tracepoint {
const char *name;   /* Tracepoint name */
struct static_key key;
+   struct static_call_key *static_call_key;
+   void *iterator;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 538ba1a58f5b..bddaf6043027 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct module;
 struct tracepoint;
@@ -94,7 +95,9 @@ extern int syscall_regfunc(void);
 extern void syscall_unregfunc(void);
 #endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */
 
+#ifndef PARAMS
 #define PARAMS(args...) args
+#endif
 
 #define TRACE_DEFINE_ENUM(x)
 #define TRACE_DEFINE_SIZEOF(x)
@@ -161,12 +164,11 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
  */
-#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
+#define __DO_TRACE(name, proto, args, cond, rcuidle)   \
do {\
struct tracepoint_func *it_func_ptr;\
-   void *it_func;  \
-   void *__data;   \
int __maybe_unused idx = 0; \
+   void *__data;   \
\
if (!(cond))\
return; \
@@ -186,14 +188,11 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
rcu_irq_enter_irqson(); \
}   \
\
-   it_func_ptr = rcu_dereference_raw((tp)->funcs); \
-   \
+   it_func_ptr =   \
+   rcu_dereference_raw((&__tracepoint_##name)->funcs); \
if (it_func_ptr) {  \
-   do {\
-   it_func = (it_func_ptr)->func;  \
-   __data = (it_func_ptr)->data;   \
-   ((void(*)(proto))(it_func))(args);  \
-   } while ((++it_func_ptr)->func);\
+   __data = (it_func_ptr)->data;   \
+   static_call(__tp_func_##name, args);\
}   \
\
if (rcuidle) {  \
@@ -209,7 +208,7 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
static inline void trace_##name##_rcuidle(proto)\
{

Re: [PATCH v2 0/4] Static calls

2018-11-26 Thread Josh Poimboeuf
On Mon, Nov 26, 2018 at 07:54:56AM -0600, Josh Poimboeuf wrote:
> v2:
> - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
> - rename optimized/unoptimized -> inline/out-of-line [Ard]
> - tweak arch interfaces for PLT and add key->tramp field [Ard]
> - rename 'poison' to 'defuse' and do it after all sites have been patched 
> [Ard]
> - fix .init handling [Ard, Steven]
> - add CONFIG_HAVE_STATIC_CALL [Steven]
> - make interfaces more consistent across configs to allow tracepoints to
>   use them [Steven]
> - move __ADDRESSABLE() to static_call() macro [Steven]
> - prevent 2-byte jumps [Steven]
> - add offset to asm-offsets.c instead of hard coding key->func offset
> - add kernel_text_address() sanity check
> - make __ADDRESSABLE() symbols truly unique
> 
> TODO:
> - port Ard's arm64 patches to the new arch interfaces
> - tracepoint performance testing

Below is the patch Steve gave me for converting tracepoints to use
static calls.

Steve, if you want me to do the performance testing, send me the test
details and I can give it a try this week.



diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 49ba9cde7e4b..ae16672bea61 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -11,6 +11,8 @@
 #include 
 #include 
 
+struct static_call_key;
+
 struct trace_print_flags {
unsigned long   mask;
const char  *name;
@@ -30,6 +32,8 @@ struct tracepoint_func {
 struct tracepoint {
const char *name;   /* Tracepoint name */
struct static_key key;
+   struct static_call_key *static_call_key;
+   void *iterator;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 538ba1a58f5b..bddaf6043027 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct module;
 struct tracepoint;
@@ -94,7 +95,9 @@ extern int syscall_regfunc(void);
 extern void syscall_unregfunc(void);
 #endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */
 
+#ifndef PARAMS
 #define PARAMS(args...) args
+#endif
 
 #define TRACE_DEFINE_ENUM(x)
 #define TRACE_DEFINE_SIZEOF(x)
@@ -161,12 +164,11 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
  */
-#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
+#define __DO_TRACE(name, proto, args, cond, rcuidle)   \
do {\
struct tracepoint_func *it_func_ptr;\
-   void *it_func;  \
-   void *__data;   \
int __maybe_unused idx = 0; \
+   void *__data;   \
\
if (!(cond))\
return; \
@@ -186,14 +188,11 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
rcu_irq_enter_irqson(); \
}   \
\
-   it_func_ptr = rcu_dereference_raw((tp)->funcs); \
-   \
+   it_func_ptr =   \
+   rcu_dereference_raw((&__tracepoint_##name)->funcs); \
if (it_func_ptr) {  \
-   do {\
-   it_func = (it_func_ptr)->func;  \
-   __data = (it_func_ptr)->data;   \
-   ((void(*)(proto))(it_func))(args);  \
-   } while ((++it_func_ptr)->func);\
+   __data = (it_func_ptr)->data;   \
+   static_call(__tp_func_##name, args);\
}   \
\
if (rcuidle) {  \
@@ -209,7 +208,7 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
static inline void trace_##name##_rcuidle(proto)\
{