Re: [PATCH v2 0/4] Static calls
> On Dec 12, 2018, at 1:36 PM, Edward Cree wrote: > > On 12/12/18 21:15, Nadav Amit wrote: >>> On Dec 12, 2018, at 10:33 AM, Edward Cree wrote: >>> >>> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an >>> additional call and ret. So I wouldn't expect it to be too expensive. >>> More to the point, it seems like it's easier to get right than the inline >>> version, and if we get the inline version working later we can introduce it >>> without any API change, much as Josh's existing patches have both versions >>> behind a Kconfig switch. >> I see. For my outlined blocks I used the opposite approach - a call followed >> by jmp > That's what Josh did too. I.e. caller calls the trampoline, which jmps to the > callee; later it rets, taking it back to the caller. Perhaps I wasn't clear. > The point is that there's still only one call and one ret. Sorry for the misunderstanding. > >>> I was working on the assumption that it would be opt-in, wrapping a macro >>> around indirect calls that are known to have a fairly small number of hot >>> targets. There are plenty of indirect calls in the kernel that are only >>> called once in a blue moon, e.g. in control-plane operations like ethtool; >>> we don't really need to bulk up .text with trampolines for all of them. >> On the other hand, I’m not sure the static_call interface is so intuitive. >> And extending it into “dynamic_call” might be even worse. As I initially >> used an opt-in approach, I can tell you that it was very exhausting. > Well, if it's done with a gcc plugin after all, then it wouldn't be too hard > to make it opt-out. > One advantage of the explicit opt-in dynamic_call, though, which can be seen > in my patch is that multiple call sites can share the same learning-state, > if they're expected to call the same set of functions. An opt-out approach > would automatically give each indirect call statement its own individual BTB. > Either way, I think the question is orthogonal to what the trampolines > themselves look like (and even to the inline vs outline question). Not entirely. If the mechanism is opt-out and outlined, and especially if it also supports multiple targets, you may not want to allocate all the memory for them during build-time, and instead use module memory to allocate them dynamically (that’s what we did).
Re: [PATCH v2 0/4] Static calls
On 12/12/18 21:15, Nadav Amit wrote: >> On Dec 12, 2018, at 10:33 AM, Edward Cree wrote: >> >> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an >> additional call and ret. So I wouldn't expect it to be too expensive. >> More to the point, it seems like it's easier to get right than the inline >> version, and if we get the inline version working later we can introduce it >> without any API change, much as Josh's existing patches have both versions >> behind a Kconfig switch. > I see. For my outlined blocks I used the opposite approach - a call followed > by jmp That's what Josh did too. I.e. caller calls the trampoline, which jmps to the callee; later it rets, taking it back to the caller. Perhaps I wasn't clear. The point is that there's still only one call and one ret. >> I was working on the assumption that it would be opt-in, wrapping a macro >> around indirect calls that are known to have a fairly small number of hot >> targets. There are plenty of indirect calls in the kernel that are only >> called once in a blue moon, e.g. in control-plane operations like ethtool; >> we don't really need to bulk up .text with trampolines for all of them. > On the other hand, I’m not sure the static_call interface is so intuitive. > And extending it into “dynamic_call” might be even worse. As I initially > used an opt-in approach, I can tell you that it was very exhausting. Well, if it's done with a gcc plugin after all, then it wouldn't be too hard to make it opt-out. One advantage of the explicit opt-in dynamic_call, though, which can be seen in my patch is that multiple call sites can share the same learning-state, if they're expected to call the same set of functions. An opt-out approach would automatically give each indirect call statement its own individual BTB. Either way, I think the question is orthogonal to what the trampolines themselves look like (and even to the inline vs outline question). -Ed
Re: [PATCH v2 0/4] Static calls
> On Dec 12, 2018, at 10:33 AM, Edward Cree wrote: > > On 12/12/18 18:14, Nadav Amit wrote: >> Second, (2i) is not very intuitive for me. Using the out-of-line static >> calls seems to me as less performant than the inline (potentially, I didn’t >> check). >> >> Anyhow, the use of out-of-line static calls seems to me as >> counter-intuitive. I think (didn’t measure) that it may add more overhead >> than it saves due to the additional call, ret, and so on > AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an > additional call and ret. So I wouldn't expect it to be too expensive. > More to the point, it seems like it's easier to get right than the inline > version, and if we get the inline version working later we can introduce it > without any API change, much as Josh's existing patches have both versions > behind a Kconfig switch. I see. For my outlined blocks I used the opposite approach - a call followed by jmp (instead of jmp followed by call that Josh did). Does the stack look correct when you first do the jmp? It seems to me that you will not see the calling function on the stack in this case. Can it have an effect on live-patching, debugging? >> I tried to avoid reading to >> compared target from memory and therefore used an immediate. This should >> prevent data cache misses and even when the data is available is faster by >> one cycle. But it requires the patching of both the “cmp %target-reg, imm” >> and “call rel-target” to be patched “atomically”. So the static-calls >> mechanism wouldn’t be sufficient. > The approach I took to deal with that (since though I'm doing a read from > memory, it's key->func in .data rather than the jmp immediate in .text) was > to have another static_call (though a plain static_key could also be used) > to 'skip' the fast-path while it's actually being patched. Then, since all > my callers were under the rcu_read_lock, I just needed to synchronize_rcu() > after switching off the fast-path to make sure no threads were still in it. > I'm not sure how that would be generalised to all cases, though; we don't > want to force every indirect call to take the rcu_read_lock as that means > no callee can ever synchronize_rcu(). I guess we could have our own > separate RCU read lock just for indirect call patching? (What does kgraft > do?) I used a similar approach to a certain extent. (I’m going to describe the implementation following the discussion with Andy Lutomirski): We use a “restartable section” that if we need to be preempted in this block of code, we restart the entire section. Then, we use synchronize_rcu() like you do after patching. >> Based on Josh’s previous feedback, I thought of improving the learning using >> some hysteresis. Anyhow, note that there are quite a few cases in which you >> wouldn’t want optpolines. The question is whether in general it would be an >> opt-in or opt-out mechanism. > I was working on the assumption that it would be opt-in, wrapping a macro > around indirect calls that are known to have a fairly small number of hot > targets. There are plenty of indirect calls in the kernel that are only > called once in a blue moon, e.g. in control-plane operations like ethtool; > we don't really need to bulk up .text with trampolines for all of them. On the other hand, I’m not sure the static_call interface is so intuitive. And extending it into “dynamic_call” might be even worse. As I initially used an opt-in approach, I can tell you that it was very exhausting.
Re: [PATCH v2 0/4] Static calls
On 12/12/18 18:14, Nadav Amit wrote: > Second, (2i) is not very intuitive for me. Using the out-of-line static > calls seems to me as less performant than the inline (potentially, I didn’t > check). > > Anyhow, the use of out-of-line static calls seems to me as > counter-intuitive. I think (didn’t measure) that it may add more overhead > than it saves due to the additional call, ret, and so on AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an additional call and ret. So I wouldn't expect it to be too expensive. More to the point, it seems like it's easier to get right than the inline version, and if we get the inline version working later we can introduce it without any API change, much as Josh's existing patches have both versions behind a Kconfig switch. > I tried to avoid reading to > compared target from memory and therefore used an immediate. This should > prevent data cache misses and even when the data is available is faster by > one cycle. But it requires the patching of both the “cmp %target-reg, imm” > and “call rel-target” to be patched “atomically”. So the static-calls > mechanism wouldn’t be sufficient. The approach I took to deal with that (since though I'm doing a read from memory, it's key->func in .data rather than the jmp immediate in .text) was to have another static_call (though a plain static_key could also be used) to 'skip' the fast-path while it's actually being patched. Then, since all my callers were under the rcu_read_lock, I just needed to synchronize_rcu() after switching off the fast-path to make sure no threads were still in it. I'm not sure how that would be generalised to all cases, though; we don't want to force every indirect call to take the rcu_read_lock as that means no callee can ever synchronize_rcu(). I guess we could have our own separate RCU read lock just for indirect call patching? (What does kgraft do?) > Based on Josh’s previous feedback, I thought of improving the learning using > some hysteresis. Anyhow, note that there are quite a few cases in which you > wouldn’t want optpolines. The question is whether in general it would be an > opt-in or opt-out mechanism. I was working on the assumption that it would be opt-in, wrapping a macro around indirect calls that are known to have a fairly small number of hot targets. There are plenty of indirect calls in the kernel that are only called once in a blue moon, e.g. in control-plane operations like ethtool; we don't really need to bulk up .text with trampolines for all of them. -Ed
Re: [PATCH v2 0/4] Static calls
> On Dec 12, 2018, at 9:11 AM, Edward Cree wrote: > > On 12/12/18 05:59, Nadav Amit wrote: >> Thanks for cc’ing me. (I didn’t know about the other patch-sets.) > Well in my case, that's because I haven't posted any yet. (Will follow up > shortly with what I currently have, though it's not pretty.) > > Looking at your patches, it seems you've got a much more developed learning > mechanism. Mine on the other hand is brutally simple but runs continuously > (i.e. after we patch we immediately enter the next 'relearning' phase); > since it never does anything but prod a handful of percpu variables, this > shouldn't be too costly. > > Also, you've got the macrology for making all indirect calls use this, > whereas at present I just have an open-coded instance on a single call site > (I went with deliver_skb in the networking stack). > > So I think where we probably want to go from here is: > 1) get Josh's static_calls in. AIUI Linus seems to prefer the out-of-line > approach; I'd say ditch the inline version (at least for now). > 2) build a relpolines patch series that uses >i) static_calls for the text-patching part > ii) as much of Nadav's macrology as is applicable > iii) either my or Nadav's learning mechanism; we can experiment with both, > bikeshed it incessantly etc. > > Seem reasonable? Mostly yes. I have a few reservations (and let’s call them optpolines from now on, since Josh disliked the previous name). First, I still have to address the issues that Josh raised before, and try to use gcc plugin instead of (most) of the macros. Specifically, I need to bring back (from my PoC code) the part that sets multiple targets. Second, (2i) is not very intuitive for me. Using the out-of-line static calls seems to me as less performant than the inline (potentially, I didn’t check). Anyhow, the use of out-of-line static calls seems to me as counter-intuitive. I think (didn’t measure) that it may add more overhead than it saves due to the additional call, ret, and so on - at least if retpolines are not used. For multiple targets it may be useful in saving some memory if the outline block is dynamically allocated (as I did in my yet unpublished code). But that’s not how it’s done in Josh’s code. If we talk about inline implementation there is a different problem that prevents me of using Josh’s static-calls as-is. I tried to avoid reading to compared target from memory and therefore used an immediate. This should prevent data cache misses and even when the data is available is faster by one cycle. But it requires the patching of both the “cmp %target-reg, imm” and “call rel-target” to be patched “atomically”. So the static-calls mechanism wouldn’t be sufficient. Based on Josh’s previous feedback, I thought of improving the learning using some hysteresis. Anyhow, note that there are quite a few cases in which you wouldn’t want optpolines. The question is whether in general it would be an opt-in or opt-out mechanism. Let me know what you think. BTW: When it comes to deliver_skb, you have packet_type as an identifier. You can use it directly or through an indirection table to figure the target. Here’s a chunk of assembly magic that I used in a similar case: .macro _call_table val:req bit:req max:req val1:req bit1:req call_table_\val\()_\bit\(): test $(1 << \bit), %al .if \val1 + (1 << \bit1) >= \max jnz syscall_relpoline_\val1 jmp syscall_relpoline_\val .else jnz call_table_\val1\()_\bit1 # fall-through to no carry, val unchange, going to next bit call_table \val,\bit1,\max call_table \val1,\bit1,\max .endif .endm .macro call_table val:req bit:req max:req .altmacro _call_table \val,\bit,\max,%(\val + (1 << \bit)),%(\bit + 1) .noaltmacro .endm ENTRY(direct_syscall) mov %esi, %eax call_table val=0 bit=0 max=16 ENDPROC(direct_syscall)
Re: [PATCH v2 0/4] Static calls
On 12/12/18 05:59, Nadav Amit wrote: > Thanks for cc’ing me. (I didn’t know about the other patch-sets.) Well in my case, that's because I haven't posted any yet. (Will follow up shortly with what I currently have, though it's not pretty.) Looking at your patches, it seems you've got a much more developed learning mechanism. Mine on the other hand is brutally simple but runs continuously (i.e. after we patch we immediately enter the next 'relearning' phase); since it never does anything but prod a handful of percpu variables, this shouldn't be too costly. Also, you've got the macrology for making all indirect calls use this, whereas at present I just have an open-coded instance on a single call site (I went with deliver_skb in the networking stack). So I think where we probably want to go from here is: 1) get Josh's static_calls in. AIUI Linus seems to prefer the out-of-line approach; I'd say ditch the inline version (at least for now). 2) build a relpolines patch series that uses i) static_calls for the text-patching part ii) as much of Nadav's macrology as is applicable iii) either my or Nadav's learning mechanism; we can experiment with both, bikeshed it incessantly etc. Seem reasonable? -Ed
Re: [PATCH v2 0/4] Static calls
> On Dec 11, 2018, at 10:05 AM, Josh Poimboeuf wrote: > > On Fri, Dec 07, 2018 at 04:06:32PM +, Edward Cree wrote: >> Sorry if this has been pointed out before (it's a very long thread), but >> in the out-of-line implementation, it appears that static_call_update() >> never alters key->func. Am I right in thinking that this should be >> fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to >> arch_static_call_transform() on line 159 of include/linux/static_call.h? > > Yes, you're right about both bugs in the out-of-line case: key->func > needs to be written, and __static_call_update() needs to be called by > static_call_update. I was so focused on getting the inline case working > that I overlooked those. > >> Some background (why does key->func matter for the >> CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with >> combining these static calls with the 'indirect call wrappers' notion >> that Paolo Abeni has been working on [1], using runtime instrumentation >> to determine a list of potential callees. (This allows us to cope with >> cases where the callees are in modules, or where different workloads may >> use different sets of callees for a given call site, neither of which is >> handled by Paolo's approach). >> The core of my design looks something like: >> >> static int dynamic_call_xyz(int (*func)(some_args), some_args) >> { >> if (func == dynamic_call_xyz_1.func) >> return static_call(dynamic_call_xyz_1, some_args); >> if (func == dynamic_call_xyz_2.func) >> return static_call(dynamic_call_xyz_2, some_args); >> return (*func)(some_args); >> } >> >> albeit with a bunch of extra (and currently rather ugly) stuff to collect >> the statistics needed to decide what to put in the static call keys, and >> mechanisms (RCU in my current case) to ensure that the static call isn't >> changed between checking its .func and actually calling it. >> >> -Ed >> >> PS: not on list, please keep me in CC. >> >> [1] >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F773985%2Fdata=02%7C01%7Cnamit%40vmware.com%7C147894d6c56d4ce6c1fc08d65f933adf%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636801483286688490sdata=QOZcfWXjvPqoR0oujtf1QTQLenv%2BiEu6jUA5fiav6Mo%3Dreserved=0 > > Thanks, this sounds very interesting. Adding Nadav to CC, as he has > been looking at a different approach to solving the same problem: > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181018005420.82993-1-namit%40vmware.comdata=02%7C01%7Cnamit%40vmware.com%7C147894d6c56d4ce6c1fc08d65f933adf%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636801483286688490sdata=bCI2N2xKUVyyyPAnWDH3mC3DsSk%2Bzy5nmI0DV%2F%2FaYYw%3Dreserved=0 Thanks for cc’ing me. (I didn’t know about the other patch-sets.) Allow me to share my experience, because I was studying this issue for some time and have implemented more than I shared, since the code need more cleanup. Some of the proposed approaches are things we either considered or actually implemented (and later dropped). Eventually, our design was guided by performance profiling and a grain of “what’s academic” consideration. I think that eventually you would want to go with one central mechanism for the various situations: 1. Registered targets (static) and automatically learnt targets (dynamic). Registration does not work in some cases (e.g., file-system function pointers, there are too many of those). On the other hand, if you know your target it is obviously simpler/better. 2. With/without retpoline fallback. We’ve have always had the retpoline as fallback, but if you use a registration mechanism, it’s not necessary. 3. Single and multiple targets. For multiple targets we decided to use outline block in order not to inflate the code for no reason. There were over 1 indirect branches in our kernel build, but in our workloads only ~500 were actually run. If you got with the approach that Edward mentioned, you may want to associate each “function" with identifier (think about file_operations having an additional field that holds a unique ID, or using the struct address). This would allow you to use a “binary search” to find the right target, which would be slightly more efficient. We actually used a binary-search for a different reason - learning the most frequent syscalls per process and calling them in this manner (we actually had an indirection table to choose the right one). 3. Call-chains which are mostly fixed (next). You want to unroll those. 4. Per-process calls. The case that bothered us the most is seccomp. On our setup, systemd installed 17(!) BPF seccomp programs on Redis server, causing every syscall to go through 17 indirect branches to invoke them. But similarly you have mmu-notifiers and others. We used a per-process trampoline page for this matter. Now there is of course the question of whether to go through automatic
Re: [PATCH v2 0/4] Static calls
On Fri, Dec 07, 2018 at 04:06:32PM +, Edward Cree wrote: > Sorry if this has been pointed out before (it's a very long thread), but > in the out-of-line implementation, it appears that static_call_update() > never alters key->func. Am I right in thinking that this should be > fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to > arch_static_call_transform() on line 159 of include/linux/static_call.h? Yes, you're right about both bugs in the out-of-line case: key->func needs to be written, and __static_call_update() needs to be called by static_call_update. I was so focused on getting the inline case working that I overlooked those. > Some background (why does key->func matter for the > CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with > combining these static calls with the 'indirect call wrappers' notion > that Paolo Abeni has been working on [1], using runtime instrumentation > to determine a list of potential callees. (This allows us to cope with > cases where the callees are in modules, or where different workloads may > use different sets of callees for a given call site, neither of which is > handled by Paolo's approach). > The core of my design looks something like: > > static int dynamic_call_xyz(int (*func)(some_args), some_args) > { > if (func == dynamic_call_xyz_1.func) > return static_call(dynamic_call_xyz_1, some_args); > if (func == dynamic_call_xyz_2.func) > return static_call(dynamic_call_xyz_2, some_args); > return (*func)(some_args); > } > > albeit with a bunch of extra (and currently rather ugly) stuff to collect > the statistics needed to decide what to put in the static call keys, and > mechanisms (RCU in my current case) to ensure that the static call isn't > changed between checking its .func and actually calling it. > > -Ed > > PS: not on list, please keep me in CC. > > [1] https://lwn.net/Articles/773985/ Thanks, this sounds very interesting. Adding Nadav to CC, as he has been looking at a different approach to solving the same problem: https://lkml.kernel.org/r/20181018005420.82993-1-na...@vmware.com -- Josh
Re: [PATCH v2 0/4] Static calls
Hi! > These patches are related to two similar patch sets from Ard and Steve: > > - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org > - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org > > The code is also heavily inspired by the jump label code, as some of the > concepts are very similar. > > There are three separate implementations, depending on what the arch > supports: > > 1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires > objtool and a small amount of arch code > > 2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires > a small amount of arch code > > 3) If no arch support, fall back to regular function pointers Well, it would be nice to mention what these patches do :-). I guess they are expected to make things slightly faster? If so it would be nice to mention benchmarks... (There are even statistics later in the series, but I guess at least short explanation should be in cover letter). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH v2 0/4] Static calls
On 07/12/18 16:06, Edward Cree wrote: > Sorry if this has been pointed out before (it's a very long thread), but > in the out-of-line implementation, it appears that static_call_update() > never alters key->func. Am I right in thinking that this should be > fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to > arch_static_call_transform() on line 159 of include/linux/static_call.h? On further examination, it's worse than that. Why does the CONFIG_HAVE_STATIC_CALL_OUTLINE static_call_update() not call __static_call_update()? It contains nothing but a BUILD_BUG_ON, which isn't likely to update anything. -Ed
Re: [PATCH v2 0/4] Static calls
On 07/12/18 16:06, Edward Cree wrote: > Sorry if this has been pointed out before (it's a very long thread), but > in the out-of-line implementation, it appears that static_call_update() > never alters key->func. Am I right in thinking that this should be > fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to > arch_static_call_transform() on line 159 of include/linux/static_call.h? On further examination, it's worse than that. Why does the CONFIG_HAVE_STATIC_CALL_OUTLINE static_call_update() not call __static_call_update()? It contains nothing but a BUILD_BUG_ON, which isn't likely to update anything. -Ed
Re: [PATCH v2 0/4] Static calls
Sorry if this has been pointed out before (it's a very long thread), but in the out-of-line implementation, it appears that static_call_update() never alters key->func. Am I right in thinking that this should be fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to arch_static_call_transform() on line 159 of include/linux/static_call.h? Some background (why does key->func matter for the CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with combining these static calls with the 'indirect call wrappers' notion that Paolo Abeni has been working on [1], using runtime instrumentation to determine a list of potential callees. (This allows us to cope with cases where the callees are in modules, or where different workloads may use different sets of callees for a given call site, neither of which is handled by Paolo's approach). The core of my design looks something like: static int dynamic_call_xyz(int (*func)(some_args), some_args) { if (func == dynamic_call_xyz_1.func) return static_call(dynamic_call_xyz_1, some_args); if (func == dynamic_call_xyz_2.func) return static_call(dynamic_call_xyz_2, some_args); return (*func)(some_args); } albeit with a bunch of extra (and currently rather ugly) stuff to collect the statistics needed to decide what to put in the static call keys, and mechanisms (RCU in my current case) to ensure that the static call isn't changed between checking its .func and actually calling it. -Ed PS: not on list, please keep me in CC. [1] https://lwn.net/Articles/773985/
Re: [PATCH v2 0/4] Static calls
Sorry if this has been pointed out before (it's a very long thread), but in the out-of-line implementation, it appears that static_call_update() never alters key->func. Am I right in thinking that this should be fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to arch_static_call_transform() on line 159 of include/linux/static_call.h? Some background (why does key->func matter for the CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with combining these static calls with the 'indirect call wrappers' notion that Paolo Abeni has been working on [1], using runtime instrumentation to determine a list of potential callees. (This allows us to cope with cases where the callees are in modules, or where different workloads may use different sets of callees for a given call site, neither of which is handled by Paolo's approach). The core of my design looks something like: static int dynamic_call_xyz(int (*func)(some_args), some_args) { if (func == dynamic_call_xyz_1.func) return static_call(dynamic_call_xyz_1, some_args); if (func == dynamic_call_xyz_2.func) return static_call(dynamic_call_xyz_2, some_args); return (*func)(some_args); } albeit with a bunch of extra (and currently rather ugly) stuff to collect the statistics needed to decide what to put in the static call keys, and mechanisms (RCU in my current case) to ensure that the static call isn't changed between checking its .func and actually calling it. -Ed PS: not on list, please keep me in CC. [1] https://lwn.net/Articles/773985/
Re: [PATCH v2 0/4] Static calls
>> On Dec 5, 2018, at 7:04 AM, Josh Poimboeuf wrote: > > >> Anyway, I have a new objection to Josh’s create_gap proposal: what on >> Earth will kernel CET do to it? Maybe my longjmp-like hack is >> actually better. > > Does CET even care about iret? I assumed it didn't. If it does, your > proposal would have the same problem, no? I think it doesn’t, but it doesn’t really matter. The shadow stack looks like: retaddr of function being poked call do_int3 + 5 And, to emulate a call, you need to stick a new frame right in the middle. At least with a longjmp-like approach, you can clobber the “call do_int3 + 5” part and then INCSSP on the way out. To be fair, I think this also sucks. PeterZ, can we abuse NMI to make this problem go away? I don't suppose that we have some rule that NMI handlers never wait for other CPUs to finish doing anything?
Re: [PATCH v2 0/4] Static calls
>> On Dec 5, 2018, at 7:04 AM, Josh Poimboeuf wrote: > > >> Anyway, I have a new objection to Josh’s create_gap proposal: what on >> Earth will kernel CET do to it? Maybe my longjmp-like hack is >> actually better. > > Does CET even care about iret? I assumed it didn't. If it does, your > proposal would have the same problem, no? I think it doesn’t, but it doesn’t really matter. The shadow stack looks like: retaddr of function being poked call do_int3 + 5 And, to emulate a call, you need to stick a new frame right in the middle. At least with a longjmp-like approach, you can clobber the “call do_int3 + 5” part and then INCSSP on the way out. To be fair, I think this also sucks. PeterZ, can we abuse NMI to make this problem go away? I don't suppose that we have some rule that NMI handlers never wait for other CPUs to finish doing anything?
Re: [PATCH v2 0/4] Static calls
On Tue, Dec 04, 2018 at 03:41:01PM -0800, Andy Lutomirski wrote: > > > > On Dec 4, 2018, at 3:08 PM, Steven Rostedt wrote: > > > > > > Where did this end up BTW? > > > > I know that there's controversy about the > > CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the > > CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the > > v1 patch 0 description: > > > > There are three separate implementations, depending on what the arch > > supports: > > > > 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires > > objtool and a small amount of arch code > > > > 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires > > a small amount of arch code > > > > 3) If no arch support, fall back to regular function pointers > > > > My benchmarks showed the best improvements with the > > STATIC_CALL_OPTIMIZED, but it still showed improvement with the > > UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the > > above (which happen to be the first part of the patch set. 1 comes in > > at the end). > > Sounds good to me. > > > > > I would also just call it CONFIG_STATIC_CALL. If we every agree on the > > optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED. > > Have an option called UNOPTIMIZED just seems wrong. (Poking my head up for a bit, soon to disappear again until next week) Ard had already objected to "unoptimized", which was why for v2 I renamed them to CONFIG_STATIC_CALL_OUTLINE and CONFIG_STATIC_CALL_INLINE. I could rename it to CONFIG_STATIC_CALL and CONFIG_STATIC_CALL_INLINE if you prefer. I don't have much of an opinion either way. I'll post a v3 next week or so, with the controversial bits more fully separated from the non-controversial bits. So at least the out-of-line implementation can get merged. > My objection to all the bike shed colors so far is that we *always* > have static_call() — it’s just not always static. Hm? Do you mean you don't like that we have a generic function pointer implementation? or what? > Anyway, I have a new objection to Josh’s create_gap proposal: what on > Earth will kernel CET do to it? Maybe my longjmp-like hack is > actually better. Does CET even care about iret? I assumed it didn't. If it does, your proposal would have the same problem, no? -- Josh
Re: [PATCH v2 0/4] Static calls
On Tue, Dec 04, 2018 at 03:41:01PM -0800, Andy Lutomirski wrote: > > > > On Dec 4, 2018, at 3:08 PM, Steven Rostedt wrote: > > > > > > Where did this end up BTW? > > > > I know that there's controversy about the > > CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the > > CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the > > v1 patch 0 description: > > > > There are three separate implementations, depending on what the arch > > supports: > > > > 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires > > objtool and a small amount of arch code > > > > 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires > > a small amount of arch code > > > > 3) If no arch support, fall back to regular function pointers > > > > My benchmarks showed the best improvements with the > > STATIC_CALL_OPTIMIZED, but it still showed improvement with the > > UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the > > above (which happen to be the first part of the patch set. 1 comes in > > at the end). > > Sounds good to me. > > > > > I would also just call it CONFIG_STATIC_CALL. If we every agree on the > > optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED. > > Have an option called UNOPTIMIZED just seems wrong. (Poking my head up for a bit, soon to disappear again until next week) Ard had already objected to "unoptimized", which was why for v2 I renamed them to CONFIG_STATIC_CALL_OUTLINE and CONFIG_STATIC_CALL_INLINE. I could rename it to CONFIG_STATIC_CALL and CONFIG_STATIC_CALL_INLINE if you prefer. I don't have much of an opinion either way. I'll post a v3 next week or so, with the controversial bits more fully separated from the non-controversial bits. So at least the out-of-line implementation can get merged. > My objection to all the bike shed colors so far is that we *always* > have static_call() — it’s just not always static. Hm? Do you mean you don't like that we have a generic function pointer implementation? or what? > Anyway, I have a new objection to Josh’s create_gap proposal: what on > Earth will kernel CET do to it? Maybe my longjmp-like hack is > actually better. Does CET even care about iret? I assumed it didn't. If it does, your proposal would have the same problem, no? -- Josh
Re: [PATCH v2 0/4] Static calls
> On Dec 4, 2018, at 3:08 PM, Steven Rostedt wrote: > > > Where did this end up BTW? > > I know that there's controversy about the > CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the > CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the > v1 patch 0 description: > > There are three separate implementations, depending on what the arch > supports: > > 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires > objtool and a small amount of arch code > > 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires > a small amount of arch code > > 3) If no arch support, fall back to regular function pointers > > My benchmarks showed the best improvements with the > STATIC_CALL_OPTIMIZED, but it still showed improvement with the > UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the > above (which happen to be the first part of the patch set. 1 comes in > at the end). Sounds good to me. > > I would also just call it CONFIG_STATIC_CALL. If we every agree on the > optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED. > Have an option called UNOPTIMIZED just seems wrong. My objection to all the bike shed colors so far is that we *always* have static_call() — it’s just not always static. Anyway, I have a new objection to Josh’s create_gap proposal: what on Earth will kernel CET do to it? Maybe my longjmp-like hack is actually better.
Re: [PATCH v2 0/4] Static calls
> On Dec 4, 2018, at 3:08 PM, Steven Rostedt wrote: > > > Where did this end up BTW? > > I know that there's controversy about the > CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the > CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the > v1 patch 0 description: > > There are three separate implementations, depending on what the arch > supports: > > 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires > objtool and a small amount of arch code > > 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires > a small amount of arch code > > 3) If no arch support, fall back to regular function pointers > > My benchmarks showed the best improvements with the > STATIC_CALL_OPTIMIZED, but it still showed improvement with the > UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the > above (which happen to be the first part of the patch set. 1 comes in > at the end). Sounds good to me. > > I would also just call it CONFIG_STATIC_CALL. If we every agree on the > optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED. > Have an option called UNOPTIMIZED just seems wrong. My objection to all the bike shed colors so far is that we *always* have static_call() — it’s just not always static. Anyway, I have a new objection to Josh’s create_gap proposal: what on Earth will kernel CET do to it? Maybe my longjmp-like hack is actually better.
Re: [PATCH v2 0/4] Static calls
Where did this end up BTW? I know that there's controversy about the CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the v1 patch 0 description: There are three separate implementations, depending on what the arch supports: 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires objtool and a small amount of arch code 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires a small amount of arch code 3) If no arch support, fall back to regular function pointers My benchmarks showed the best improvements with the STATIC_CALL_OPTIMIZED, but it still showed improvement with the UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the above (which happen to be the first part of the patch set. 1 comes in at the end). I would also just call it CONFIG_STATIC_CALL. If we every agree on the optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED. Have an option called UNOPTIMIZED just seems wrong. -- Steve On Mon, 26 Nov 2018 07:54:56 -0600 Josh Poimboeuf wrote: > v2: > - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard] > - rename optimized/unoptimized -> inline/out-of-line [Ard] > - tweak arch interfaces for PLT and add key->tramp field [Ard] > - rename 'poison' to 'defuse' and do it after all sites have been patched > [Ard] > - fix .init handling [Ard, Steven] > - add CONFIG_HAVE_STATIC_CALL [Steven] > - make interfaces more consistent across configs to allow tracepoints to > use them [Steven] > - move __ADDRESSABLE() to static_call() macro [Steven] > - prevent 2-byte jumps [Steven] > - add offset to asm-offsets.c instead of hard coding key->func offset > - add kernel_text_address() sanity check > - make __ADDRESSABLE() symbols truly unique > > TODO: > - port Ard's arm64 patches to the new arch interfaces > - tracepoint performance testing > > > > These patches are related to two similar patch sets from Ard and Steve: > > - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org > - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org > > The code is also heavily inspired by the jump label code, as some of the > concepts are very similar. > > There are three separate implementations, depending on what the arch > supports: > > 1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires > objtool and a small amount of arch code > > 2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires > a small amount of arch code > > 3) If no arch support, fall back to regular function pointers > > > Josh Poimboeuf (4): > compiler.h: Make __ADDRESSABLE() symbol truly unique > static_call: Add static call infrastructure > x86/static_call: Add out-of-line static call implementation > x86/static_call: Add inline static call implementation for x86-64 > > arch/Kconfig | 10 + > arch/x86/Kconfig | 4 +- > arch/x86/include/asm/static_call.h| 52 +++ > arch/x86/kernel/Makefile | 1 + > arch/x86/kernel/asm-offsets.c | 6 + > arch/x86/kernel/static_call.c | 78 > include/asm-generic/vmlinux.lds.h | 11 + > include/linux/compiler.h | 2 +- > include/linux/module.h| 10 + > include/linux/static_call.h | 202 ++ > include/linux/static_call_types.h | 19 + > kernel/Makefile | 1 + > kernel/module.c | 5 + > kernel/static_call.c | 350 ++ > tools/objtool/Makefile| 3 +- > tools/objtool/check.c | 126 ++- > tools/objtool/check.h | 2 + > tools/objtool/elf.h | 1 + > .../objtool/include/linux/static_call_types.h | 19 + > tools/objtool/sync-check.sh | 1 + > 20 files changed, 899 insertions(+), 4 deletions(-) > create mode 100644 arch/x86/include/asm/static_call.h > create mode 100644 arch/x86/kernel/static_call.c > create mode 100644 include/linux/static_call.h > create mode 100644 include/linux/static_call_types.h > create mode 100644 kernel/static_call.c > create mode 100644 tools/objtool/include/linux/static_call_types.h >
Re: [PATCH v2 0/4] Static calls
Where did this end up BTW? I know that there's controversy about the CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the v1 patch 0 description: There are three separate implementations, depending on what the arch supports: 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires objtool and a small amount of arch code 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires a small amount of arch code 3) If no arch support, fall back to regular function pointers My benchmarks showed the best improvements with the STATIC_CALL_OPTIMIZED, but it still showed improvement with the UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the above (which happen to be the first part of the patch set. 1 comes in at the end). I would also just call it CONFIG_STATIC_CALL. If we every agree on the optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED. Have an option called UNOPTIMIZED just seems wrong. -- Steve On Mon, 26 Nov 2018 07:54:56 -0600 Josh Poimboeuf wrote: > v2: > - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard] > - rename optimized/unoptimized -> inline/out-of-line [Ard] > - tweak arch interfaces for PLT and add key->tramp field [Ard] > - rename 'poison' to 'defuse' and do it after all sites have been patched > [Ard] > - fix .init handling [Ard, Steven] > - add CONFIG_HAVE_STATIC_CALL [Steven] > - make interfaces more consistent across configs to allow tracepoints to > use them [Steven] > - move __ADDRESSABLE() to static_call() macro [Steven] > - prevent 2-byte jumps [Steven] > - add offset to asm-offsets.c instead of hard coding key->func offset > - add kernel_text_address() sanity check > - make __ADDRESSABLE() symbols truly unique > > TODO: > - port Ard's arm64 patches to the new arch interfaces > - tracepoint performance testing > > > > These patches are related to two similar patch sets from Ard and Steve: > > - https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org > - https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org > > The code is also heavily inspired by the jump label code, as some of the > concepts are very similar. > > There are three separate implementations, depending on what the arch > supports: > > 1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires > objtool and a small amount of arch code > > 2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires > a small amount of arch code > > 3) If no arch support, fall back to regular function pointers > > > Josh Poimboeuf (4): > compiler.h: Make __ADDRESSABLE() symbol truly unique > static_call: Add static call infrastructure > x86/static_call: Add out-of-line static call implementation > x86/static_call: Add inline static call implementation for x86-64 > > arch/Kconfig | 10 + > arch/x86/Kconfig | 4 +- > arch/x86/include/asm/static_call.h| 52 +++ > arch/x86/kernel/Makefile | 1 + > arch/x86/kernel/asm-offsets.c | 6 + > arch/x86/kernel/static_call.c | 78 > include/asm-generic/vmlinux.lds.h | 11 + > include/linux/compiler.h | 2 +- > include/linux/module.h| 10 + > include/linux/static_call.h | 202 ++ > include/linux/static_call_types.h | 19 + > kernel/Makefile | 1 + > kernel/module.c | 5 + > kernel/static_call.c | 350 ++ > tools/objtool/Makefile| 3 +- > tools/objtool/check.c | 126 ++- > tools/objtool/check.h | 2 + > tools/objtool/elf.h | 1 + > .../objtool/include/linux/static_call_types.h | 19 + > tools/objtool/sync-check.sh | 1 + > 20 files changed, 899 insertions(+), 4 deletions(-) > create mode 100644 arch/x86/include/asm/static_call.h > create mode 100644 arch/x86/kernel/static_call.c > create mode 100644 include/linux/static_call.h > create mode 100644 include/linux/static_call_types.h > create mode 100644 kernel/static_call.c > create mode 100644 tools/objtool/include/linux/static_call_types.h >
Re: [PATCH v2 0/4] Static calls
On Mon, 26 Nov 2018 16:24:20 -0600 Josh Poimboeuf wrote: > Should I add your tracepoint patch to the next version of my patches? > No, not yet. Especially since I haven't totally vetted them. When yours are ready, I'll post an RFC, and then we can add them in. I would want an Acked-by from Mathieu Desnoyers too. -- Steve
Re: [PATCH v2 0/4] Static calls
On Mon, 26 Nov 2018 16:24:20 -0600 Josh Poimboeuf wrote: > Should I add your tracepoint patch to the next version of my patches? > No, not yet. Especially since I haven't totally vetted them. When yours are ready, I'll post an RFC, and then we can add them in. I would want an Acked-by from Mathieu Desnoyers too. -- Steve
Re: [PATCH v2 0/4] Static calls
On Mon, Nov 26, 2018 at 03:54:05PM -0500, Steven Rostedt wrote: > In summary, we had this: > > No RETPOLINES: > 1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% ) > > baseline RETPOLINES: > 1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% ) > > Added direct calls for trace_events: > 1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% ) > > With static calls: > 1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% ) > > With static call trampolines: >1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% ) > > Full static calls: >1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% ) > > > Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown > > Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown > > With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown > > Going from 4.25 to 1.6 isn't bad, and I think this is very much worth > the effort. I did not expect it to go to 0% as there's a lot of other > places that retpolines cause issues, but this shows that it does help > the tracing code. > > I originally did the tests with the development config, which has a > bunch of debugging options enabled (hackbench usually takes over 9 > seconds, not the 1.5 that was done here), and the slowdown was closer > to 9% with retpolines. If people want me to do this with that, or I can > send them the config. Or better yet, the code is here, just use your > own configs. Thanks a lot for running these. This looks like a nice speedup. Also a nice reduction in the standard deviation. Should I add your tracepoint patch to the next version of my patches? -- Josh
Re: [PATCH v2 0/4] Static calls
On Mon, Nov 26, 2018 at 03:54:05PM -0500, Steven Rostedt wrote: > In summary, we had this: > > No RETPOLINES: > 1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% ) > > baseline RETPOLINES: > 1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% ) > > Added direct calls for trace_events: > 1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% ) > > With static calls: > 1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% ) > > With static call trampolines: >1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% ) > > Full static calls: >1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% ) > > > Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown > > Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown > > With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown > > Going from 4.25 to 1.6 isn't bad, and I think this is very much worth > the effort. I did not expect it to go to 0% as there's a lot of other > places that retpolines cause issues, but this shows that it does help > the tracing code. > > I originally did the tests with the development config, which has a > bunch of debugging options enabled (hackbench usually takes over 9 > seconds, not the 1.5 that was done here), and the slowdown was closer > to 9% with retpolines. If people want me to do this with that, or I can > send them the config. Or better yet, the code is here, just use your > own configs. Thanks a lot for running these. This looks like a nice speedup. Also a nice reduction in the standard deviation. Should I add your tracepoint patch to the next version of my patches? -- Josh
Re: [PATCH v2 0/4] Static calls
On Mon, Nov 26, 2018 at 07:54:56AM -0600, Josh Poimboeuf wrote: > v2: > - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard] > - rename optimized/unoptimized -> inline/out-of-line [Ard] > - tweak arch interfaces for PLT and add key->tramp field [Ard] > - rename 'poison' to 'defuse' and do it after all sites have been patched > [Ard] > - fix .init handling [Ard, Steven] > - add CONFIG_HAVE_STATIC_CALL [Steven] > - make interfaces more consistent across configs to allow tracepoints to > use them [Steven] > - move __ADDRESSABLE() to static_call() macro [Steven] > - prevent 2-byte jumps [Steven] > - add offset to asm-offsets.c instead of hard coding key->func offset > - add kernel_text_address() sanity check > - make __ADDRESSABLE() symbols truly unique > > TODO: > - port Ard's arm64 patches to the new arch interfaces > - tracepoint performance testing Below is the patch Steve gave me for converting tracepoints to use static calls. Steve, if you want me to do the performance testing, send me the test details and I can give it a try this week. diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h index 49ba9cde7e4b..ae16672bea61 100644 --- a/include/linux/tracepoint-defs.h +++ b/include/linux/tracepoint-defs.h @@ -11,6 +11,8 @@ #include #include +struct static_call_key; + struct trace_print_flags { unsigned long mask; const char *name; @@ -30,6 +32,8 @@ struct tracepoint_func { struct tracepoint { const char *name; /* Tracepoint name */ struct static_key key; + struct static_call_key *static_call_key; + void *iterator; int (*regfunc)(void); void (*unregfunc)(void); struct tracepoint_func __rcu *funcs; diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 538ba1a58f5b..bddaf6043027 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -21,6 +21,7 @@ #include #include #include +#include struct module; struct tracepoint; @@ -94,7 +95,9 @@ extern int syscall_regfunc(void); extern void syscall_unregfunc(void); #endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */ +#ifndef PARAMS #define PARAMS(args...) args +#endif #define TRACE_DEFINE_ENUM(x) #define TRACE_DEFINE_SIZEOF(x) @@ -161,12 +164,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto". */ -#define __DO_TRACE(tp, proto, args, cond, rcuidle) \ +#define __DO_TRACE(name, proto, args, cond, rcuidle) \ do {\ struct tracepoint_func *it_func_ptr;\ - void *it_func; \ - void *__data; \ int __maybe_unused idx = 0; \ + void *__data; \ \ if (!(cond))\ return; \ @@ -186,14 +188,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) rcu_irq_enter_irqson(); \ } \ \ - it_func_ptr = rcu_dereference_raw((tp)->funcs); \ - \ + it_func_ptr = \ + rcu_dereference_raw((&__tracepoint_##name)->funcs); \ if (it_func_ptr) { \ - do {\ - it_func = (it_func_ptr)->func; \ - __data = (it_func_ptr)->data; \ - ((void(*)(proto))(it_func))(args); \ - } while ((++it_func_ptr)->func);\ + __data = (it_func_ptr)->data; \ + static_call(__tp_func_##name, args);\ } \ \ if (rcuidle) { \ @@ -209,7 +208,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) static inline void trace_##name##_rcuidle(proto)\ {
Re: [PATCH v2 0/4] Static calls
On Mon, Nov 26, 2018 at 07:54:56AM -0600, Josh Poimboeuf wrote: > v2: > - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard] > - rename optimized/unoptimized -> inline/out-of-line [Ard] > - tweak arch interfaces for PLT and add key->tramp field [Ard] > - rename 'poison' to 'defuse' and do it after all sites have been patched > [Ard] > - fix .init handling [Ard, Steven] > - add CONFIG_HAVE_STATIC_CALL [Steven] > - make interfaces more consistent across configs to allow tracepoints to > use them [Steven] > - move __ADDRESSABLE() to static_call() macro [Steven] > - prevent 2-byte jumps [Steven] > - add offset to asm-offsets.c instead of hard coding key->func offset > - add kernel_text_address() sanity check > - make __ADDRESSABLE() symbols truly unique > > TODO: > - port Ard's arm64 patches to the new arch interfaces > - tracepoint performance testing Below is the patch Steve gave me for converting tracepoints to use static calls. Steve, if you want me to do the performance testing, send me the test details and I can give it a try this week. diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h index 49ba9cde7e4b..ae16672bea61 100644 --- a/include/linux/tracepoint-defs.h +++ b/include/linux/tracepoint-defs.h @@ -11,6 +11,8 @@ #include #include +struct static_call_key; + struct trace_print_flags { unsigned long mask; const char *name; @@ -30,6 +32,8 @@ struct tracepoint_func { struct tracepoint { const char *name; /* Tracepoint name */ struct static_key key; + struct static_call_key *static_call_key; + void *iterator; int (*regfunc)(void); void (*unregfunc)(void); struct tracepoint_func __rcu *funcs; diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 538ba1a58f5b..bddaf6043027 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -21,6 +21,7 @@ #include #include #include +#include struct module; struct tracepoint; @@ -94,7 +95,9 @@ extern int syscall_regfunc(void); extern void syscall_unregfunc(void); #endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */ +#ifndef PARAMS #define PARAMS(args...) args +#endif #define TRACE_DEFINE_ENUM(x) #define TRACE_DEFINE_SIZEOF(x) @@ -161,12 +164,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto". */ -#define __DO_TRACE(tp, proto, args, cond, rcuidle) \ +#define __DO_TRACE(name, proto, args, cond, rcuidle) \ do {\ struct tracepoint_func *it_func_ptr;\ - void *it_func; \ - void *__data; \ int __maybe_unused idx = 0; \ + void *__data; \ \ if (!(cond))\ return; \ @@ -186,14 +188,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) rcu_irq_enter_irqson(); \ } \ \ - it_func_ptr = rcu_dereference_raw((tp)->funcs); \ - \ + it_func_ptr = \ + rcu_dereference_raw((&__tracepoint_##name)->funcs); \ if (it_func_ptr) { \ - do {\ - it_func = (it_func_ptr)->func; \ - __data = (it_func_ptr)->data; \ - ((void(*)(proto))(it_func))(args); \ - } while ((++it_func_ptr)->func);\ + __data = (it_func_ptr)->data; \ + static_call(__tp_func_##name, args);\ } \ \ if (rcuidle) { \ @@ -209,7 +208,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) static inline void trace_##name##_rcuidle(proto)\ {