Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf wrote: > > > > On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote: > > > > > > I propose a different solution: > > > > > > As in this patch set, we have a direct and an indirect version. The > > > indirect version remains exactly the same as in this patch set. The > > > direct version just only does the patching when all seems well: the > > > call instruction needs to be 0xe8, and we only do it when the thing > > > doesn't cross a cache line. Does that work? In the rare case where > > > the compiler generates something other than 0xe8 or crosses a cache > > > line, then the thing just remains as a call to the out of line jmp > > > trampoline. Does that seem reasonable? It's a very minor change to > > > the patch set. > > > > Maybe that would be ok. If my math is right, we would use the > > out-of-line version almost 5% of the time due to cache misalignment of > > the address. > > Note that I don't think cache-line alignment is necessarily sufficient. > > The I$ fetch from the cacheline can happen in smaller chunks, because > the bus between the I$ and the instruction decode isn't a full > cacheline (well, it is _now_ in modern big cores, but it hasn't always > been). > > So even if the cacheline is updated atomically, I could imagine seeing > a partial fetch from the I$ (old values) and then a second partial > fetch (new values). > > It would be interesting to know what the exact fetch rules are. So I fixed my test case to do 32-bit writes, and now the results are making a lot more sense. Now I only get crashes when writing across cache lines. So maybe we should just go with Andy's suggestion above. It would be great if some CPU people could confirm that it's safe (for x86-64 only), since it's not in the SDM. Who can help answer that? -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Tue, Dec 11, 2018 at 09:41:37AM +, David Laight wrote: > From: Josh Poimboeuf > > Sent: 30 November 2018 16:27 > > > > On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote: > > > On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf > > > wrote: > ... > > > > Maybe that would be ok. If my math is right, we would use the > > > > out-of-line version almost 5% of the time due to cache misalignment of > > > > the address. > > > > > > Note that I don't think cache-line alignment is necessarily sufficient. > > > > > > The I$ fetch from the cacheline can happen in smaller chunks, because > > > the bus between the I$ and the instruction decode isn't a full > > > cacheline (well, it is _now_ in modern big cores, but it hasn't always > > > been). > > > > > > So even if the cacheline is updated atomically, I could imagine seeing > > > a partial fetch from the I$ (old values) and then a second partial > > > fetch (new values). > > > > > > It would be interesting to know what the exact fetch rules are. > > > > I've been doing some cross-modifying code experiments on Nehalem, with > > one CPU writing call destinations while the other CPUs are executing > > them. Reliably, one of the readers goes off into the weeds within a few > > seconds. > > > > The writing was done with just text_poke(), no #BP. > > > > I wasn't able to figure out the pattern in the addresses of the > > corrupted call sites. It wasn't cache line. > > > > That was on Nehalem. Skylake didn't crash at all. > > Interesting thought? > > If it is possible to add a prefix that can be overwritten by an int3 > is it also possible to add something that the assembler will use > to align the instruction so that a write to the 4 byte offset > will be atomic? > > I'd guess that avoiding 8 byte granularity would be sufficient. > So you'd need a 1, 2 or 3 byte nop depending on the actual > alignment - although a 3 byte one would always do. The problem is that the call is done in C code, and we don't have a feasible way to use inline asm to call functions with more than five arguments. BTW, my original experiments (mentioned above) were a bit... flawed. I used text_poke(), which does memcpy(), which writes one byte at a time. No wonder it wasn't atomic. I'll need to do some more experiments. -- Josh
RE: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
From: Josh Poimboeuf > Sent: 30 November 2018 16:27 > > On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote: > > On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf wrote: ... > > > Maybe that would be ok. If my math is right, we would use the > > > out-of-line version almost 5% of the time due to cache misalignment of > > > the address. > > > > Note that I don't think cache-line alignment is necessarily sufficient. > > > > The I$ fetch from the cacheline can happen in smaller chunks, because > > the bus between the I$ and the instruction decode isn't a full > > cacheline (well, it is _now_ in modern big cores, but it hasn't always > > been). > > > > So even if the cacheline is updated atomically, I could imagine seeing > > a partial fetch from the I$ (old values) and then a second partial > > fetch (new values). > > > > It would be interesting to know what the exact fetch rules are. > > I've been doing some cross-modifying code experiments on Nehalem, with > one CPU writing call destinations while the other CPUs are executing > them. Reliably, one of the readers goes off into the weeds within a few > seconds. > > The writing was done with just text_poke(), no #BP. > > I wasn't able to figure out the pattern in the addresses of the > corrupted call sites. It wasn't cache line. > > That was on Nehalem. Skylake didn't crash at all. Interesting thought? If it is possible to add a prefix that can be overwritten by an int3 is it also possible to add something that the assembler will use to align the instruction so that a write to the 4 byte offset will be atomic? I'd guess that avoiding 8 byte granularity would be sufficient. So you'd need a 1, 2 or 3 byte nop depending on the actual alignment - although a 3 byte one would always do. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Dec 10, 2018 at 3:58 PM Pavel Machek wrote: > > On Thu 2018-11-29 11:11:50, Linus Torvalds wrote: > > > > It might be better to use an empty REX prefix on x86-64 or something like > > that. > > It might be easiest to use plain old NOP, no? :-). No. The whole point would be that the instruction rewriting is atomic wrt fetch. If it's a "nop" + "second instruction", and the "nop" is overwritten by "int3", then the second instruction could still be executed after the "int3" has been written (because the other CPU just finished the "nop". So an empty rex prefix is very different from a one-byte nop, exactly because it's executed atomically with the instruction itself. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu 2018-11-29 11:11:50, Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds > wrote: > > > > What you can do then is basically add a single-byte prefix to the > > "call" instruction that does nothing (say, cs override), and then > > replace *that* with a 'int3' instruction. > > Hmm. the segment prefixes are documented as being "reserved" for > branch instructions. I *think* that means just conditional branches > (Intel at one point used the prefixes for static prediction > information), not "call", but who knows.. > > It might be better to use an empty REX prefix on x86-64 or something like > that. It might be easiest to use plain old NOP, no? :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, Nov 30, 2018 at 11:16:34PM +0100, Rasmus Villemoes wrote: > On 29/11/2018 20.22, Josh Poimboeuf wrote: > > On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote: > >>> and honestly, the way "static_call()" works now, can you guarantee > >>> that the call-site doesn't end up doing that, and calling the > >>> trampoline function for two different static calls from one indirect > >>> call? > >>> > >>> See what I'm talking about? Saying "callers are wrapped in macros" > >>> doesn't actually protect you from the compiler doing things like that. > >>> > >>> In contrast, if the call was wrapped in an inline asm, we'd *know* the > >>> compiler couldn't turn a "call wrapper(%rip)" into anything else. > >> > >> But then we need to implement all numbers of parameters. > > > > I actually have an old unfinished patch which (ab)used C macros to > > detect the number of parameters and then setup the asm constraints > > accordingly. At the time, the goal was to optimize the BUG code. > > > > I had wanted to avoid this kind of approach for static calls, because > > "ugh", but now it's starting to look much more appealing. > > > > Behold: > > > > diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h > > index aa6b2023d8f8..d63e9240da77 100644 > > --- a/arch/x86/include/asm/bug.h > > +++ b/arch/x86/include/asm/bug.h > > @@ -32,10 +32,59 @@ > > > > #ifdef CONFIG_DEBUG_BUGVERBOSE > > > > -#define _BUG_FLAGS(ins, flags) > > \ > > +#define __BUG_ARGS_0(ins, ...) \ > > +({\ > > + asm volatile("1:\t" ins "\n"); \ > > +}) > > +#define __BUG_ARGS_1(ins, ...) \ > > +({\ > > + asm volatile("1:\t" ins "\n" \ > > +: : "D" (ARG1(__VA_ARGS__))); \ > > +}) > > +#define __BUG_ARGS_2(ins, ...) \ > > +({\ > > + asm volatile("1:\t" ins "\n" \ > > +: : "D" (ARG1(__VA_ARGS__)), \ > > +"S" (ARG2(__VA_ARGS__))); \ > > +}) > > +#define __BUG_ARGS_3(ins, ...) \ > > +({\ > > + asm volatile("1:\t" ins "\n" \ > > +: : "D" (ARG1(__VA_ARGS__)), \ > > +"S" (ARG2(__VA_ARGS__)), \ > > +"d" (ARG3(__VA_ARGS__))); \ > > +}) > > wouldn't you need to tie all these to (unused) outputs as well as adding > the remaining caller-saved registers to the clobber list? Maybe not for > the WARN machinery(?), but at least for stuff that should look like a > normal call to gcc? Then there's %rax which is either a clobber or an > output, and if there's not to be a separate static_call_void(), one > would need to do some __builtin_choose_expr(__same_type(void, f(...)), ...). Yes, this is a crappy unfinished patch. It should be ignored, and perhaps even mercilessly mocked :-) paravirt_types.h already does something similar today, and it's at least more correct than this. What I was trying to show was that you can use macros to count arguments, like this: _BUG_ARGS(ins, NUM_ARGS(__VA_ARGS__), __VA_ARGS__); which can make a macro look and act like a function call. Though as Steven pointed out, the concept falls apart after 6 arguments. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On 29/11/2018 20.22, Josh Poimboeuf wrote: > On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote: >>> and honestly, the way "static_call()" works now, can you guarantee >>> that the call-site doesn't end up doing that, and calling the >>> trampoline function for two different static calls from one indirect >>> call? >>> >>> See what I'm talking about? Saying "callers are wrapped in macros" >>> doesn't actually protect you from the compiler doing things like that. >>> >>> In contrast, if the call was wrapped in an inline asm, we'd *know* the >>> compiler couldn't turn a "call wrapper(%rip)" into anything else. >> >> But then we need to implement all numbers of parameters. > > I actually have an old unfinished patch which (ab)used C macros to > detect the number of parameters and then setup the asm constraints > accordingly. At the time, the goal was to optimize the BUG code. > > I had wanted to avoid this kind of approach for static calls, because > "ugh", but now it's starting to look much more appealing. > > Behold: > > diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h > index aa6b2023d8f8..d63e9240da77 100644 > --- a/arch/x86/include/asm/bug.h > +++ b/arch/x86/include/asm/bug.h > @@ -32,10 +32,59 @@ > > #ifdef CONFIG_DEBUG_BUGVERBOSE > > -#define _BUG_FLAGS(ins, flags) > \ > +#define __BUG_ARGS_0(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n"); \ > +}) > +#define __BUG_ARGS_1(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n" \ > + : : "D" (ARG1(__VA_ARGS__))); \ > +}) > +#define __BUG_ARGS_2(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n" \ > + : : "D" (ARG1(__VA_ARGS__)), \ > + "S" (ARG2(__VA_ARGS__))); \ > +}) > +#define __BUG_ARGS_3(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n" \ > + : : "D" (ARG1(__VA_ARGS__)), \ > + "S" (ARG2(__VA_ARGS__)), \ > + "d" (ARG3(__VA_ARGS__))); \ > +}) wouldn't you need to tie all these to (unused) outputs as well as adding the remaining caller-saved registers to the clobber list? Maybe not for the WARN machinery(?), but at least for stuff that should look like a normal call to gcc? Then there's %rax which is either a clobber or an output, and if there's not to be a separate static_call_void(), one would need to do some __builtin_choose_expr(__same_type(void, f(...)), ...). Rasmus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, 30 Nov 2018, Andy Lutomirski wrote: > According to the SDM, you can program the APIC ICR to request an SMI. > It's not remotely clear to me what will happen if we do this. I think one of the known reliable ways to trigger SMI is to write 0x0 to the SMI command I/O port (0xb2). > For all I know, the SMI handler will explode and the computer will catch > fire. Ha, therefore noone can't claim any more that SMIs are always harmful :) -- Jiri Kosina SUSE Labs
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, Nov 30, 2018 at 12:18:33PM -0800, Andy Lutomirski wrote: > On Fri, Nov 30, 2018 at 11:51 AM Linus Torvalds > wrote: > > > > On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf wrote: > > > > > > AFAICT, all the other proposed options seem to have major issues. > > > > I still absolutely detest this patch, and in fact it got worse from > > the test of the config variable. > > > > Honestly, the entry code being legible and simple is more important > > than the extra cycle from branching to a trampoline for static calls. > > > > Just don't do the inline case if it causes this much confusion. I *really* don't want to have to drop the inline feature. The speedup is measurable and not insignificant. And out-of-line would be a regression if we ported paravirt to use static calls. > With my entry maintainer hat on, I don't mind it so much, although the > implementation needs some work. The #ifdef should just go away, and > there should be another sanity check in the sanity check section. Your suggested changes sound good to me. I'll be gone next week, so here's hoping you'll have this all figured out when I get back! -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, 30 Nov 2018 12:59:36 -0800 Andy Lutomirski wrote: > For all I know, the SMI handler will explode and the computer will catch fire. That sounds like an AWESOME feature!!! -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, Nov 30, 2018 at 12:28 PM Steven Rostedt wrote: > > On Fri, 30 Nov 2018 12:18:33 -0800 > Andy Lutomirski wrote: > > > Or we could replace that IPI with x86's bona fide serialize-all-cpus > > primitive and then we can just retry instead of emulating. It's a > > piece of cake -- we just trigger an SMI :) /me runs away. > > I must have fallen on my head one too many times, because I really like > the idea of synchronizing all the CPUs with an SMI! (If that's even > possible). The IPI's that are sent are only to force smp_mb() on all > CPUs. Which should be something an SMI could do. > > /me runs after Andy According to the SDM, you can program the APIC ICR to request an SMI. It's not remotely clear to me what will happen if we do this. For all I know, the SMI handler will explode and the computer will catch fire. PeterZ?
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, 30 Nov 2018 12:18:33 -0800 Andy Lutomirski wrote: > Or we could replace that IPI with x86's bona fide serialize-all-cpus > primitive and then we can just retry instead of emulating. It's a > piece of cake -- we just trigger an SMI :) /me runs away. I must have fallen on my head one too many times, because I really like the idea of synchronizing all the CPUs with an SMI! (If that's even possible). The IPI's that are sent are only to force smp_mb() on all CPUs. Which should be something an SMI could do. /me runs after Andy -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, Nov 30, 2018 at 11:51 AM Linus Torvalds wrote: > > On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf wrote: > > > > AFAICT, all the other proposed options seem to have major issues. > > I still absolutely detest this patch, and in fact it got worse from > the test of the config variable. > > Honestly, the entry code being legible and simple is more important > than the extra cycle from branching to a trampoline for static calls. > > Just don't do the inline case if it causes this much confusion. With my entry maintainer hat on, I don't mind it so much, although the implementation needs some work. The #ifdef should just go away, and there should be another sanity check in the sanity check section. Or we could replace that IPI with x86's bona fide serialize-all-cpus primitive and then we can just retry instead of emulating. It's a piece of cake -- we just trigger an SMI :) /me runs away. --Andy
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf wrote: > > AFAICT, all the other proposed options seem to have major issues. I still absolutely detest this patch, and in fact it got worse from the test of the config variable. Honestly, the entry code being legible and simple is more important than the extra cycle from branching to a trampoline for static calls. Just don't do the inline case if it causes this much confusion. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Fri, Nov 30, 2018 at 08:42:26AM -0800, Andy Lutomirski wrote: > On Thu, Nov 29, 2018 at 12:24 PM Josh Poimboeuf wrote: > > > > > Alternatively, we could actually emulate call instructions like this: > > > > > > void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...) > > > { > > > struct pt_regs ptregs_copy = *regs; > > > barrier(); > > > *(unsigned long *)(regs->sp - 8) = whatever; /* may clobber old > > > regs, but so what? */ > > > asm volatile ("jmp return_to_alternate_ptregs"); > > > } > > > > > > where return_to_alternate_ptregs points rsp to the ptregs and goes > > > through the normal return path. It's ugly, but we could have a test > > > case for it, and it should work fine. > > > > Is that really any better than my patch to create a gap in the stack > > (modified for kernel space #BP only)? > > > > I tend to prefer a nice local hack like mine over a hack that further > complicates the entry in general. This is not to say I'm thrilled by > my idea either. They're both mucking with the location of the pt_regs. The above code just takes that fact and hides it in the corner and hopes that there are no bugs lurking there. Even with the CPL check, the "gap" code is simple and self-contained (see below). The kernel pt_regs can already be anywhere on the stack so there should be no harm in moving them. AFAICT, all the other proposed options seem to have major issues. diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index ce25d84023c0..f487f7daed6c 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=1 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8 @@ -896,6 +896,18 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_\@ .endif +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE + .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_\@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_\@: + .endif +#endif + .if \paranoid callparanoid_entry .else @@ -1126,7 +1138,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */ idtentry debug do_debughas_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segmenthas_error_code=1 #ifdef CONFIG_XEN_PV
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 12:24 PM Josh Poimboeuf wrote: > > > Alternatively, we could actually emulate call instructions like this: > > > > void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...) > > { > > struct pt_regs ptregs_copy = *regs; > > barrier(); > > *(unsigned long *)(regs->sp - 8) = whatever; /* may clobber old > > regs, but so what? */ > > asm volatile ("jmp return_to_alternate_ptregs"); > > } > > > > where return_to_alternate_ptregs points rsp to the ptregs and goes > > through the normal return path. It's ugly, but we could have a test > > case for it, and it should work fine. > > Is that really any better than my patch to create a gap in the stack > (modified for kernel space #BP only)? > I tend to prefer a nice local hack like mine over a hack that further complicates the entry in general. This is not to say I'm thrilled by my idea either.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf wrote: > > > > On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote: > > > > > > I propose a different solution: > > > > > > As in this patch set, we have a direct and an indirect version. The > > > indirect version remains exactly the same as in this patch set. The > > > direct version just only does the patching when all seems well: the > > > call instruction needs to be 0xe8, and we only do it when the thing > > > doesn't cross a cache line. Does that work? In the rare case where > > > the compiler generates something other than 0xe8 or crosses a cache > > > line, then the thing just remains as a call to the out of line jmp > > > trampoline. Does that seem reasonable? It's a very minor change to > > > the patch set. > > > > Maybe that would be ok. If my math is right, we would use the > > out-of-line version almost 5% of the time due to cache misalignment of > > the address. > > Note that I don't think cache-line alignment is necessarily sufficient. > > The I$ fetch from the cacheline can happen in smaller chunks, because > the bus between the I$ and the instruction decode isn't a full > cacheline (well, it is _now_ in modern big cores, but it hasn't always > been). > > So even if the cacheline is updated atomically, I could imagine seeing > a partial fetch from the I$ (old values) and then a second partial > fetch (new values). > > It would be interesting to know what the exact fetch rules are. I've been doing some cross-modifying code experiments on Nehalem, with one CPU writing call destinations while the other CPUs are executing them. Reliably, one of the readers goes off into the weeds within a few seconds. The writing was done with just text_poke(), no #BP. I wasn't able to figure out the pattern in the addresses of the corrupted call sites. It wasn't cache line. That was on Nehalem. Skylake didn't crash at all. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf wrote: > > On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote: > > > > I propose a different solution: > > > > As in this patch set, we have a direct and an indirect version. The > > indirect version remains exactly the same as in this patch set. The > > direct version just only does the patching when all seems well: the > > call instruction needs to be 0xe8, and we only do it when the thing > > doesn't cross a cache line. Does that work? In the rare case where > > the compiler generates something other than 0xe8 or crosses a cache > > line, then the thing just remains as a call to the out of line jmp > > trampoline. Does that seem reasonable? It's a very minor change to > > the patch set. > > Maybe that would be ok. If my math is right, we would use the > out-of-line version almost 5% of the time due to cache misalignment of > the address. Note that I don't think cache-line alignment is necessarily sufficient. The I$ fetch from the cacheline can happen in smaller chunks, because the bus between the I$ and the instruction decode isn't a full cacheline (well, it is _now_ in modern big cores, but it hasn't always been). So even if the cacheline is updated atomically, I could imagine seeing a partial fetch from the I$ (old values) and then a second partial fetch (new values). It would be interesting to know what the exact fetch rules are. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote: > On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote: > > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote: > > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > > > > > > (like pointing IP at a stub that retpolines to the target by reading > > > > the function pointer, a la the unoptimizable version), then okay, I > > > > guess, with only a small amount of grumbling. > > > > > > I tried that in v2, but Peter pointed out it's racy: > > > > > > > > > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net > > > > Ah, but that is because it is a global shared trampoline. > > > > Each static_call has it's own trampoline; which currently reads > > something like: > > > > RETPOLINE_SAFE > > JMP *key > > > > which you then 'defuse' by writing an UD2 on. _However_, if you write > > that trampoline like: > > > > 1: RETPOLINE_SAFE > > JMP *key > > 2: CALL_NOSPEC *key > > RET > > > > and have the text_poke_bp() handler jump to 2 (a location you'll never > > reach when you enter at 1), it will in fact work I think. The trampoline > > is never modified and not shared between different static_call's. > > But after returning from the function to the trampoline, how does it > return from the trampoline to the call site? At that point there is no > return address on the stack. Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 02:25:33PM -0800, Andy Lutomirski wrote: > On Thu, Nov 29, 2018 at 2:22 PM Peter Zijlstra wrote: > > > > On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote: > > > On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote: > > > > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote: > > > > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > > > > > > > > > > (like pointing IP at a stub that retpolines to the target by reading > > > > > > the function pointer, a la the unoptimizable version), then okay, I > > > > > > guess, with only a small amount of grumbling. > > > > > > > > > > I tried that in v2, but Peter pointed out it's racy: > > > > > > > > > > > > > > > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net > > > > > > > > Ah, but that is because it is a global shared trampoline. > > > > > > > > Each static_call has it's own trampoline; which currently reads > > > > something like: > > > > > > > > RETPOLINE_SAFE > > > > JMP *key > > > > > > > > which you then 'defuse' by writing an UD2 on. _However_, if you write > > > > that trampoline like: > > > > > > > > 1: RETPOLINE_SAFE > > > > JMP *key > > > > 2: CALL_NOSPEC *key > > > > RET > > > > > > > > and have the text_poke_bp() handler jump to 2 (a location you'll never > > > > reach when you enter at 1), it will in fact work I think. The trampoline > > > > is never modified and not shared between different static_call's. > > > > > > But after returning from the function to the trampoline, how does it > > > return from the trampoline to the call site? At that point there is no > > > return address on the stack. > > > > Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose. > > I assume I'm missing something, but can't it just be JMP_NOSPEC *key? > The code would call the trampoline just like any other function and, > if the alignment is bad, we can skip patching it. And, if we want the > performance back, maybe some day we can find a clean way to patch > those misaligned callers, too. Yeah, this is currently the leading contender, though I believe it will use a direct jump like the current out-of-line implementation. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 2:22 PM Peter Zijlstra wrote: > > On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote: > > On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote: > > > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote: > > > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > > > > > > > > (like pointing IP at a stub that retpolines to the target by reading > > > > > the function pointer, a la the unoptimizable version), then okay, I > > > > > guess, with only a small amount of grumbling. > > > > > > > > I tried that in v2, but Peter pointed out it's racy: > > > > > > > > > > > > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net > > > > > > Ah, but that is because it is a global shared trampoline. > > > > > > Each static_call has it's own trampoline; which currently reads > > > something like: > > > > > > RETPOLINE_SAFE > > > JMP *key > > > > > > which you then 'defuse' by writing an UD2 on. _However_, if you write > > > that trampoline like: > > > > > > 1: RETPOLINE_SAFE > > > JMP *key > > > 2: CALL_NOSPEC *key > > > RET > > > > > > and have the text_poke_bp() handler jump to 2 (a location you'll never > > > reach when you enter at 1), it will in fact work I think. The trampoline > > > is never modified and not shared between different static_call's. > > > > But after returning from the function to the trampoline, how does it > > return from the trampoline to the call site? At that point there is no > > return address on the stack. > > Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose. I assume I'm missing something, but can't it just be JMP_NOSPEC *key? The code would call the trampoline just like any other function and, if the alignment is bad, we can skip patching it. And, if we want the performance back, maybe some day we can find a clean way to patch those misaligned callers, too.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 02:24:52PM -0600, Josh Poimboeuf wrote: > On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote: > > On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds > > wrote: > > > > > > On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds > > > wrote: > > > > > > > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > > > > compiler couldn't turn a "call wrapper(%rip)" into anything else. > > > > > > Actually, I think I have a better model - if the caller is done with > > > inline asm. > > > > > > What you can do then is basically add a single-byte prefix to the > > > "call" instruction that does nothing (say, cs override), and then > > > replace *that* with a 'int3' instruction. > > > > > > Boom. Done. > > > > > > Now, the "int3" handler can just update the instruction in-place, but > > > leave the "int3" in place, and then return to the next instruction > > > byte (which is just the normal branch instruction without the prefix > > > byte). > > > > > > The cross-CPU case continues to work, because the 'int3' remains in > > > place until after the IPI. > > > > Hmm, cute. But then the calls are in inline asm, which results in > > giant turds like we have for the pvop vcalls. And, if they start > > being used more generally, we potentially have ABI issues where the > > calling convention isn't quite what the asm expects, and we explode. > > > > I propose a different solution: > > > > As in this patch set, we have a direct and an indirect version. The > > indirect version remains exactly the same as in this patch set. The > > direct version just only does the patching when all seems well: the > > call instruction needs to be 0xe8, and we only do it when the thing > > doesn't cross a cache line. Does that work? In the rare case where > > the compiler generates something other than 0xe8 or crosses a cache > > line, then the thing just remains as a call to the out of line jmp > > trampoline. Does that seem reasonable? It's a very minor change to > > the patch set. > > Maybe that would be ok. If my math is right, we would use the > out-of-line version almost 5% of the time due to cache misalignment of > the address. BTW, this means that if any of a trampoline's callers crosses cache boundaries then we won't be able to poison the trampoline. Which is kind of sad. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote: > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote: > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > > > > (like pointing IP at a stub that retpolines to the target by reading > > > the function pointer, a la the unoptimizable version), then okay, I > > > guess, with only a small amount of grumbling. > > > > I tried that in v2, but Peter pointed out it's racy: > > > > > > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net > > Ah, but that is because it is a global shared trampoline. > > Each static_call has it's own trampoline; which currently reads > something like: > > RETPOLINE_SAFE > JMP *key > > which you then 'defuse' by writing an UD2 on. _However_, if you write > that trampoline like: > > 1:RETPOLINE_SAFE > JMP *key > 2:CALL_NOSPEC *key > RET > > and have the text_poke_bp() handler jump to 2 (a location you'll never > reach when you enter at 1), it will in fact work I think. The trampoline > is never modified and not shared between different static_call's. But after returning from the function to the trampoline, how does it return from the trampoline to the call site? At that point there is no return address on the stack. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote: > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > > (like pointing IP at a stub that retpolines to the target by reading > > the function pointer, a la the unoptimizable version), then okay, I > > guess, with only a small amount of grumbling. > > I tried that in v2, but Peter pointed out it's racy: > > > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net Ah, but that is because it is a global shared trampoline. Each static_call has it's own trampoline; which currently reads something like: RETPOLINE_SAFE JMP *key which you then 'defuse' by writing an UD2 on. _However_, if you write that trampoline like: 1: RETPOLINE_SAFE JMP *key 2: CALL_NOSPEC *key RET and have the text_poke_bp() handler jump to 2 (a location you'll never reach when you enter at 1), it will in fact work I think. The trampoline is never modified and not shared between different static_call's.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote: > On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds > wrote: > > > > On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds > > wrote: > > > > > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > > > compiler couldn't turn a "call wrapper(%rip)" into anything else. > > > > Actually, I think I have a better model - if the caller is done with inline > > asm. > > > > What you can do then is basically add a single-byte prefix to the > > "call" instruction that does nothing (say, cs override), and then > > replace *that* with a 'int3' instruction. > > > > Boom. Done. > > > > Now, the "int3" handler can just update the instruction in-place, but > > leave the "int3" in place, and then return to the next instruction > > byte (which is just the normal branch instruction without the prefix > > byte). > > > > The cross-CPU case continues to work, because the 'int3' remains in > > place until after the IPI. > > Hmm, cute. But then the calls are in inline asm, which results in > giant turds like we have for the pvop vcalls. And, if they start > being used more generally, we potentially have ABI issues where the > calling convention isn't quite what the asm expects, and we explode. > > I propose a different solution: > > As in this patch set, we have a direct and an indirect version. The > indirect version remains exactly the same as in this patch set. The > direct version just only does the patching when all seems well: the > call instruction needs to be 0xe8, and we only do it when the thing > doesn't cross a cache line. Does that work? In the rare case where > the compiler generates something other than 0xe8 or crosses a cache > line, then the thing just remains as a call to the out of line jmp > trampoline. Does that seem reasonable? It's a very minor change to > the patch set. Maybe that would be ok. If my math is right, we would use the out-of-line version almost 5% of the time due to cache misalignment of the address. > Alternatively, we could actually emulate call instructions like this: > > void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...) > { > struct pt_regs ptregs_copy = *regs; > barrier(); > *(unsigned long *)(regs->sp - 8) = whatever; /* may clobber old > regs, but so what? */ > asm volatile ("jmp return_to_alternate_ptregs"); > } > > where return_to_alternate_ptregs points rsp to the ptregs and goes > through the normal return path. It's ugly, but we could have a test > case for it, and it should work fine. Is that really any better than my patch to create a gap in the stack (modified for kernel space #BP only)? -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 10:58:40AM -0800, Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt wrote: > > > > Note, we do have a bit of control at what is getting called. The patch > > set requires that the callers are wrapped in macros. We should not > > allow just any random callers (like from asm). > > Actually, I'd argue that asm is often more controlled than C code. > > Right now you can do odd things if you really want to, and have the > compiler generate indirect calls to those wrapper functions. > > For example, I can easily imagine a pre-retpoline compiler turning > > if (cond) > fn1(a,b) > else >fn2(a,b); > > into a function pointer conditional > > (cond ? fn1 : fn2)(a,b); > > and honestly, the way "static_call()" works now, can you guarantee > that the call-site doesn't end up doing that, and calling the > trampoline function for two different static calls from one indirect > call? > > See what I'm talking about? Saying "callers are wrapped in macros" > doesn't actually protect you from the compiler doing things like that. > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > compiler couldn't turn a "call wrapper(%rip)" into anything else. I think objtool could warn about many such issues, including function pointer references to trampolines and short tail call jumps. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 11:24:43 -0800 Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt wrote: > > > > But then we need to implement all numbers of parameters. > > Oh, I agree, it's nasty. > > But it's actually a nastiness that we've solved before. In particular, > with the system call mappings, which have pretty much the exact same > issue of "map unknown number of arguments to registers". > > Yes, it's different - there you map the unknown number of arguments to > a structure access instead. And yes, the macros are unbelievably ugly. > See > > arch/x86/include/asm/syscall_wrapper.h Those are not doing inline assembly. > > and the __MAP() macro from > > include/linux/syscalls.h > > so it's not pretty. But it would solve all the problems. > Again, not inline assembly, and those only handle up to 6 parameters. My POC started down this route, until I notice that there's tracepoints that have 13 parameters! And I need to handle all tracepoints. Yes, we can argue that we need to change those (if that doesn't break the API of something using it). -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 11:25 AM Linus Torvalds wrote: > > On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt wrote: > > > > But then we need to implement all numbers of parameters. > > Oh, I agree, it's nasty. > > But it's actually a nastiness that we've solved before. In particular, > with the system call mappings, which have pretty much the exact same > issue of "map unknown number of arguments to registers". > > Yes, it's different - there you map the unknown number of arguments to > a structure access instead. And yes, the macros are unbelievably ugly. > See > > arch/x86/include/asm/syscall_wrapper.h > > and the __MAP() macro from > > include/linux/syscalls.h > > so it's not pretty. But it would solve all the problems. > Until someone does: struct foo foo; static_call(thingy, foo); For syscalls, we know better than to do that. For static calls, I'm less confident.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 13:22:11 -0600 Josh Poimboeuf wrote: > On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote: > > > and honestly, the way "static_call()" works now, can you guarantee > > > that the call-site doesn't end up doing that, and calling the > > > trampoline function for two different static calls from one indirect > > > call? > > > > > > See what I'm talking about? Saying "callers are wrapped in macros" > > > doesn't actually protect you from the compiler doing things like that. > > > > > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > > > compiler couldn't turn a "call wrapper(%rip)" into anything else. > > > > But then we need to implement all numbers of parameters. > > I actually have an old unfinished patch which (ab)used C macros to > detect the number of parameters and then setup the asm constraints > accordingly. At the time, the goal was to optimize the BUG code. > > I had wanted to avoid this kind of approach for static calls, because > "ugh", but now it's starting to look much more appealing. > > Behold: > > diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h > index aa6b2023d8f8..d63e9240da77 100644 > --- a/arch/x86/include/asm/bug.h > +++ b/arch/x86/include/asm/bug.h > @@ -32,10 +32,59 @@ > > #ifdef CONFIG_DEBUG_BUGVERBOSE > > -#define _BUG_FLAGS(ins, flags) > \ > +#define __BUG_ARGS_0(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n"); \ > +}) > +#define __BUG_ARGS_1(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n" \ > + : : "D" (ARG1(__VA_ARGS__))); \ > +}) > +#define __BUG_ARGS_2(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n" \ > + : : "D" (ARG1(__VA_ARGS__)), \ > + "S" (ARG2(__VA_ARGS__))); \ > +}) > +#define __BUG_ARGS_3(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n" \ > + : : "D" (ARG1(__VA_ARGS__)), \ > + "S" (ARG2(__VA_ARGS__)), \ > + "d" (ARG3(__VA_ARGS__))); \ > +}) > +#define __BUG_ARGS_4(ins, ...) \ > +({\ > + asm volatile("1:\t" ins "\n" \ > + : : "D" (ARG1(__VA_ARGS__)), \ > + "S" (ARG2(__VA_ARGS__)), \ > + "d" (ARG3(__VA_ARGS__)), \ > + "c" (ARG4(__VA_ARGS__))); \ > +}) > +#define __BUG_ARGS_5(ins, ...) \ > +({\ > + register u64 __r8 asm("r8") = (u64)ARG5(__VA_ARGS__); \ > + asm volatile("1:\t" ins "\n" \ > + : : "D" (ARG1(__VA_ARGS__)), \ > + "S" (ARG2(__VA_ARGS__)), \ > + "d" (ARG3(__VA_ARGS__)), \ > + "c" (ARG4(__VA_ARGS__)), \ > + "r" (__r8)); \ > +}) > +#define __BUG_ARGS_6 foo > +#define __BUG_ARGS_7 foo > +#define __BUG_ARGS_8 foo > +#define __BUG_ARGS_9 foo > + There exist tracepoints with 13 arguments. -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds wrote: > > On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds > wrote: > > > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > > compiler couldn't turn a "call wrapper(%rip)" into anything else. > > Actually, I think I have a better model - if the caller is done with inline > asm. > > What you can do then is basically add a single-byte prefix to the > "call" instruction that does nothing (say, cs override), and then > replace *that* with a 'int3' instruction. > > Boom. Done. > > Now, the "int3" handler can just update the instruction in-place, but > leave the "int3" in place, and then return to the next instruction > byte (which is just the normal branch instruction without the prefix > byte). > > The cross-CPU case continues to work, because the 'int3' remains in > place until after the IPI. Hmm, cute. But then the calls are in inline asm, which results in giant turds like we have for the pvop vcalls. And, if they start being used more generally, we potentially have ABI issues where the calling convention isn't quite what the asm expects, and we explode. I propose a different solution: As in this patch set, we have a direct and an indirect version. The indirect version remains exactly the same as in this patch set. The direct version just only does the patching when all seems well: the call instruction needs to be 0xe8, and we only do it when the thing doesn't cross a cache line. Does that work? In the rare case where the compiler generates something other than 0xe8 or crosses a cache line, then the thing just remains as a call to the out of line jmp trampoline. Does that seem reasonable? It's a very minor change to the patch set. Alternatively, we could actually emulate call instructions like this: void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...) { struct pt_regs ptregs_copy = *regs; barrier(); *(unsigned long *)(regs->sp - 8) = whatever; /* may clobber old regs, but so what? */ asm volatile ("jmp return_to_alternate_ptregs"); } where return_to_alternate_ptregs points rsp to the ptregs and goes through the normal return path. It's ugly, but we could have a test case for it, and it should work fine.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt wrote: > > But then we need to implement all numbers of parameters. Oh, I agree, it's nasty. But it's actually a nastiness that we've solved before. In particular, with the system call mappings, which have pretty much the exact same issue of "map unknown number of arguments to registers". Yes, it's different - there you map the unknown number of arguments to a structure access instead. And yes, the macros are unbelievably ugly. See arch/x86/include/asm/syscall_wrapper.h and the __MAP() macro from include/linux/syscalls.h so it's not pretty. But it would solve all the problems. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote: > > and honestly, the way "static_call()" works now, can you guarantee > > that the call-site doesn't end up doing that, and calling the > > trampoline function for two different static calls from one indirect > > call? > > > > See what I'm talking about? Saying "callers are wrapped in macros" > > doesn't actually protect you from the compiler doing things like that. > > > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > > compiler couldn't turn a "call wrapper(%rip)" into anything else. > > But then we need to implement all numbers of parameters. I actually have an old unfinished patch which (ab)used C macros to detect the number of parameters and then setup the asm constraints accordingly. At the time, the goal was to optimize the BUG code. I had wanted to avoid this kind of approach for static calls, because "ugh", but now it's starting to look much more appealing. Behold: diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h index aa6b2023d8f8..d63e9240da77 100644 --- a/arch/x86/include/asm/bug.h +++ b/arch/x86/include/asm/bug.h @@ -32,10 +32,59 @@ #ifdef CONFIG_DEBUG_BUGVERBOSE -#define _BUG_FLAGS(ins, flags) \ +#define __BUG_ARGS_0(ins, ...) \ +({\ + asm volatile("1:\t" ins "\n"); \ +}) +#define __BUG_ARGS_1(ins, ...) \ +({\ + asm volatile("1:\t" ins "\n" \ +: : "D" (ARG1(__VA_ARGS__))); \ +}) +#define __BUG_ARGS_2(ins, ...) \ +({\ + asm volatile("1:\t" ins "\n" \ +: : "D" (ARG1(__VA_ARGS__)), \ +"S" (ARG2(__VA_ARGS__))); \ +}) +#define __BUG_ARGS_3(ins, ...) \ +({\ + asm volatile("1:\t" ins "\n" \ +: : "D" (ARG1(__VA_ARGS__)), \ +"S" (ARG2(__VA_ARGS__)), \ +"d" (ARG3(__VA_ARGS__))); \ +}) +#define __BUG_ARGS_4(ins, ...) \ +({\ + asm volatile("1:\t" ins "\n" \ +: : "D" (ARG1(__VA_ARGS__)), \ +"S" (ARG2(__VA_ARGS__)), \ +"d" (ARG3(__VA_ARGS__)), \ +"c" (ARG4(__VA_ARGS__))); \ +}) +#define __BUG_ARGS_5(ins, ...) \ +({\ + register u64 __r8 asm("r8") = (u64)ARG5(__VA_ARGS__); \ + asm volatile("1:\t" ins "\n" \ +: : "D" (ARG1(__VA_ARGS__)), \ +"S" (ARG2(__VA_ARGS__)), \ +"d" (ARG3(__VA_ARGS__)), \ +"c" (ARG4(__VA_ARGS__)), \ +"r" (__r8)); \ +}) +#define __BUG_ARGS_6 foo +#define __BUG_ARGS_7 foo +#define __BUG_ARGS_8 foo +#define __BUG_ARGS_9 foo + +#define __BUG_ARGS(ins, num, ...) __BUG_ARGS_ ## num(ins, __VA_ARGS__) + +#define _BUG_ARGS(ins, num, ...) __BUG_ARGS(ins, num, __VA_ARGS__) + +#define _BUG_FLAGS(ins, flags, ...)\ do { \ - asm volatile("1:\t" ins "\n"\ -".pushsection __bug_table,\"aw\"\n"\ + _BUG_ARGS(ins, NUM_ARGS(__VA_ARGS__), __VA_ARGS__); \ + asm volatile(".pushsection __bug_table,\"aw\"\n"\ "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n" \ "\t" __BUG_REL(%c0) "\t# bug_entry::file\n" \ "\t.word %c1""\t# bug_entry::line\n" \ @@ -76,7 +125,7 @@ do { \ unreachable(); \ } while (0) -#define __WARN_FLAGS(flags)_BUG_FLAGS(ASM_UD0, BUGFLAG_WARNING|(flags)) +#define __WARN_FLAGS(flags, ...) _BUG_FLAGS(ASM_UD0, BUGFLAG_WARNING|(flags), __VA_ARGS__) #include diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h index 70c7732c9594..0cb16e912c02 100644 --- a/include/asm-generic/bug.h +++ b/include/asm-generic/bug.h @@ -58,8 +58,8 @@ struct bug_entry { #endif #ifdef __WARN_FLAGS -#define __WARN_TAINT(taint)__WARN_FLAGS(BUGFLAG_TAINT(taint)) -#define __WARN_ONCE_TAINT(taint) __WARN_FLAGS(BUGFLAG_ONCE|BUGFLAG_TAINT(taint)) +#define __WARN_TAINT(taint, args...) __WARN_FLAGS(BUGFLAG_TAINT(taint), args) +#define __WARN_ONCE_TAINT(taint, args...) __WARN_FLAGS(BUGFLAG_ONCE|BUGFLAG_TAINT(taint), args) #define WARN_ON_ONCE(condition) ({ \ int __ret_warn_on = !!(condition); \ @@ -84,11 +84,12 @@ void warn_slowpath_fmt_taint(const char *file, const int line, unsigned taint, extern void warn_slowpath_null(const char *file, const int line); #ifdef __WARN_TAINT #define __WARN() __WARN_TAINT(TAINT_WARN) +#define __WARN_printf(args...) __WARN_TAINT(TAINT_WARN, args) #else #define __WARN() warn_slowpath_null(__FILE__, __LINE__
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 10:58:40 -0800 Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt wrote: > > > > Note, we do have a bit of control at what is getting called. The patch > > set requires that the callers are wrapped in macros. We should not > > allow just any random callers (like from asm). > > Actually, I'd argue that asm is often more controlled than C code. > > Right now you can do odd things if you really want to, and have the > compiler generate indirect calls to those wrapper functions. > > For example, I can easily imagine a pre-retpoline compiler turning > > if (cond) > fn1(a,b) > else >fn2(a,b); > > into a function pointer conditional > > (cond ? fn1 : fn2)(a,b); If we are worried about such a construct, wouldn't a compiler barrier before and after the static_call solve that? barrier(); static_call(func...); barrier(); It should also stop tail calls too. > > and honestly, the way "static_call()" works now, can you guarantee > that the call-site doesn't end up doing that, and calling the > trampoline function for two different static calls from one indirect > call? > > See what I'm talking about? Saying "callers are wrapped in macros" > doesn't actually protect you from the compiler doing things like that. > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > compiler couldn't turn a "call wrapper(%rip)" into anything else. But then we need to implement all numbers of parameters. -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 11:08:26 -0800 Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds > wrote: > > > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > > compiler couldn't turn a "call wrapper(%rip)" into anything else. > > Actually, I think I have a better model - if the caller is done with inline > asm. > > What you can do then is basically add a single-byte prefix to the > "call" instruction that does nothing (say, cs override), and then > replace *that* with a 'int3' instruction. > > Boom. Done. > > Now, the "int3" handler can just update the instruction in-place, but > leave the "int3" in place, and then return to the next instruction > byte (which is just the normal branch instruction without the prefix > byte). > > The cross-CPU case continues to work, because the 'int3' remains in > place until after the IPI. > > But that would require that we'd mark those call instruction with > In my original proof of concept, I tried to to implement the callers with asm, but then the way to handle parameters became a nightmare. The goal of this (for me) was to replace the tracepoint indirect calls with static calls, and tracepoints can have any number of parameters to pass. I ended up needing the compiler to help me with the passing of parameters. -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds wrote: > > What you can do then is basically add a single-byte prefix to the > "call" instruction that does nothing (say, cs override), and then > replace *that* with a 'int3' instruction. Hmm. the segment prefixes are documented as being "reserved" for branch instructions. I *think* that means just conditional branches (Intel at one point used the prefixes for static prediction information), not "call", but who knows.. It might be better to use an empty REX prefix on x86-64 or something like that. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds wrote: > > In contrast, if the call was wrapped in an inline asm, we'd *know* the > compiler couldn't turn a "call wrapper(%rip)" into anything else. Actually, I think I have a better model - if the caller is done with inline asm. What you can do then is basically add a single-byte prefix to the "call" instruction that does nothing (say, cs override), and then replace *that* with a 'int3' instruction. Boom. Done. Now, the "int3" handler can just update the instruction in-place, but leave the "int3" in place, and then return to the next instruction byte (which is just the normal branch instruction without the prefix byte). The cross-CPU case continues to work, because the 'int3' remains in place until after the IPI. But that would require that we'd mark those call instruction with Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt wrote: > > Note, we do have a bit of control at what is getting called. The patch > set requires that the callers are wrapped in macros. We should not > allow just any random callers (like from asm). Actually, I'd argue that asm is often more controlled than C code. Right now you can do odd things if you really want to, and have the compiler generate indirect calls to those wrapper functions. For example, I can easily imagine a pre-retpoline compiler turning if (cond) fn1(a,b) else fn2(a,b); into a function pointer conditional (cond ? fn1 : fn2)(a,b); and honestly, the way "static_call()" works now, can you guarantee that the call-site doesn't end up doing that, and calling the trampoline function for two different static calls from one indirect call? See what I'm talking about? Saying "callers are wrapped in macros" doesn't actually protect you from the compiler doing things like that. In contrast, if the call was wrapped in an inline asm, we'd *know* the compiler couldn't turn a "call wrapper(%rip)" into anything else. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 10:00:48 -0800 Andy Lutomirski wrote: > > > > Of course, another option is to just say "we don't do the inline case, > > then", and only ever do a call to a stub that does a "jmp" > > instruction. > > That’s not a terrible idea. It was the implementation of my first proof of concept that kicked off this entire idea, where others (Peter and Josh) thought it was better to modify the calls themselves. It does improve things. Just a reminder of the benchmarks of enabling all tracepoints (which use indirect jumps) and running hackbench: No RETPOLINES: 1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% ) baseline RETPOLINES: 1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% ) Added direct calls for trace_events: 1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% ) With static calls: 1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% ) With static call trampolines: 1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% ) Full static calls: 1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% ) Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown The above is the stub with the jmp case. With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown Modifying the calls themselves does have an improvement (and this is much greater of an improvement when I had debugging enabled). Perhaps it's not worth the effort, but again, we do have control of what uses this. It's not a total free-for-all. Full results here: http://lkml.kernel.org/r/20181126155405.72b4f...@gandalf.local.home Although since lore.kernel.org seems to be having issues: https://marc.info/?l=linux-kernel&m=154326714710686 -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 10:23:44 -0800 Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 9:59 AM Steven Rostedt wrote: > > > > Do you realize that the cmpxchg used by the first attempts of the > > dynamic modification of code by ftrace was the source of the e1000e > > NVRAM corruption bug. > > If you have a static call in IO memory, you have bigger problems than that. > > What's your point? Just that cmpxchg on dynamic modified code brings back bad memories ;-) > > Again - I will point out that the things you guys have tried to come > up with have been *WORSE*. Much worse. Note, we do have a bit of control at what is getting called. The patch set requires that the callers are wrapped in macros. We should not allow just any random callers (like from asm). This isn't about modifying any function call. This is for a specific subset, that we can impose rules on. -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 10:00 AM Andy Lutomirski wrote: > > then it really sounds pretty safe to just say "ok, just make it > > aligned and update the instruction with an atomic cmpxchg or > > something". > > And how do we do that? With a gcc plugin and some asm magic? Asm magic. You already have to mark the call sites with static_call(fn, arg1, arg2, ...); and while it right now just magically depends on gcc outputting the right code to call the trampoline. But it could do it as a jmp instruction (tail-call), and maybe that works right, maybe it doesn't. And maybe some gcc switch makes it output it as a indirect call due to instrumentation or something. Doing it with asm magic would, I feel, be safer anyway, so that we'd know *exactly* how that call gets done. For example, if gcc does it as a jmp due to a tail-call, the compiler/linker could in theory turn the jump into a short jump if it sees that the trampoline is close enough. Does that happen? Probably not. But I don't see why it *couldn't* happen in the current patch series. The trampoline is just a regular function, even if it has been defined by global asm. Putting the trampoline in a different code section could fix things like that (maybe there was a patch that did that and I missed it?) but I do think that doing the call with an asm would *also* fix it. But the "just always use a trampoline" is certainly the simpler model. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 06:15:39PM +0100, Peter Zijlstra wrote: > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > > > If you make it conditional on CPL, do it for 32-bit as well, add > > comments, > > > and convince yourself that there isn’t a better solution > > (like pointing IP at a stub that retpolines to the target by reading > > the function pointer, a la the unoptimizable version), then okay, I > > guess, with only a small amount of grumbling. > > Right; so we _could_ grow the trampoline with a retpoline indirect call > and ret. It just makes the trampoline a whole lot bigger, but it could > work. I'm trying to envision how this would work. How would the function (or stub) know how to return back to the call site? -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 9:59 AM Steven Rostedt wrote: > > Do you realize that the cmpxchg used by the first attempts of the > dynamic modification of code by ftrace was the source of the e1000e > NVRAM corruption bug. If you have a static call in IO memory, you have bigger problems than that. What's your point? Again - I will point out that the things you guys have tried to come up with have been *WORSE*. Much worse. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 29, 2018, at 9:50 AM, Linus Torvalds > wrote: > >> On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt wrote: >> >> Well, the current method (as Jiri mentioned) did get the OK from at >> least Intel (and that was with a lot of arm twisting to do so). > > Guys, when the comparison is to: > > - create a huge honking security hole by screwing up the stack frame > > or > > - corrupt random registers because we "know" they aren't in use For C calls, we do indeed know that. But I guess there could be asm calls. > > then it really sounds pretty safe to just say "ok, just make it > aligned and update the instruction with an atomic cmpxchg or > something". And how do we do that? With a gcc plugin and some asm magic? > > Of course, another option is to just say "we don't do the inline case, > then", and only ever do a call to a stub that does a "jmp" > instruction. That’s not a terrible idea. > > Problem solved, at the cost of some I$. Emulating a "jmp" is trivial, > in ways emulating a "call" is not. > >
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 09:50:28 -0800 Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt wrote: > > > > Well, the current method (as Jiri mentioned) did get the OK from at > > least Intel (and that was with a lot of arm twisting to do so). > > Guys, when the comparison is to: > > - create a huge honking security hole by screwing up the stack frame > > or > > - corrupt random registers because we "know" they aren't in use > > then it really sounds pretty safe to just say "ok, just make it > aligned and update the instruction with an atomic cmpxchg or > something". Do you realize that the cmpxchg used by the first attempts of the dynamic modification of code by ftrace was the source of the e1000e NVRAM corruption bug. It's because it happened to do it to IO write only memory, and a cmpxchg will *always* write, even if it didn't match. It will just write out what it read. In the case of the e1000e bug, it read 0x and that's what it wrote back out. So no, I don't think that's a better solution. -- Steve > > Of course, another option is to just say "we don't do the inline case, > then", and only ever do a call to a stub that does a "jmp" > instruction. > > Problem solved, at the cost of some I$. Emulating a "jmp" is trivial, > in ways emulating a "call" is not. > > Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 9:50 AM Linus Torvalds wrote: > > - corrupt random registers because we "know" they aren't in use Just to clarify: I think that's a completely unacceptable model. We already have lots of special calling conventions, including ones that do not have any call-clobbered registers at all, because we have special magic calls in inline asm. Some of those might be prime material for doing static calls (ie PV-op stuff, where the native model does *not* change any registers). So no. Don't do ugly hacks like that. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 29, 2018, at 9:45 AM, Josh Poimboeuf wrote: > >> On Thu, Nov 29, 2018 at 09:41:33AM -0800, Andy Lutomirski wrote: >> >>> On Nov 29, 2018, at 9:21 AM, Steven Rostedt wrote: >>> >>> On Thu, 29 Nov 2018 12:20:00 -0500 >>> Steven Rostedt wrote: >>> >>> r8 = return address r9 = function to call >>> >>> Bad example, r8 and r9 are args, but r10 and r11 are available. >>> >>> -- Steve >>> push r8 jmp *r9 Then have the regs->ip point to that trampoline. >> >> Cute. That’ll need ORC annotations and some kind of retpoline to replace the >> indirect jump, though. > > I'm going with this idea, but the BP is so rare that I really don't see > why a retpoline would be needed. > Without the retpoline in place, you are vulnerable to security researchers causing you a personal denial of service by finding a way to cause the BP to get hit, mistraining the branch predictor, and writing a paper about it :)
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt wrote: > > Well, the current method (as Jiri mentioned) did get the OK from at > least Intel (and that was with a lot of arm twisting to do so). Guys, when the comparison is to: - create a huge honking security hole by screwing up the stack frame or - corrupt random registers because we "know" they aren't in use then it really sounds pretty safe to just say "ok, just make it aligned and update the instruction with an atomic cmpxchg or something". Of course, another option is to just say "we don't do the inline case, then", and only ever do a call to a stub that does a "jmp" instruction. Problem solved, at the cost of some I$. Emulating a "jmp" is trivial, in ways emulating a "call" is not. Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 09:41:33 -0800 Andy Lutomirski wrote: > > On Nov 29, 2018, at 9:21 AM, Steven Rostedt wrote: > > > > On Thu, 29 Nov 2018 12:20:00 -0500 > > Steven Rostedt wrote: > > > > > >> r8 = return address > >> r9 = function to call > >> > > > > Bad example, r8 and r9 are args, but r10 and r11 are available. > > > > -- Steve > > > >>push r8 > >>jmp *r9 > >> > >> Then have the regs->ip point to that trampoline. > > Cute. That’ll need ORC annotations and some kind of retpoline to replace the > indirect jump, though. > Do we really need to worry about retpoline here? I'm not fully up on all the current vulnerabilities, but can this really be taken advantage of when it only happens in the transition of changing a static call with the small chance of one of those calls triggering the break point? If someone can take advantage of that, I almost think they deserve cracking my box ;-) -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 09:41:33AM -0800, Andy Lutomirski wrote: > > > On Nov 29, 2018, at 9:21 AM, Steven Rostedt wrote: > > > > On Thu, 29 Nov 2018 12:20:00 -0500 > > Steven Rostedt wrote: > > > > > >> r8 = return address > >> r9 = function to call > >> > > > > Bad example, r8 and r9 are args, but r10 and r11 are available. > > > > -- Steve > > > >>push r8 > >>jmp *r9 > >> > >> Then have the regs->ip point to that trampoline. > > Cute. That’ll need ORC annotations and some kind of retpoline to replace the > indirect jump, though. I'm going with this idea, but the BP is so rare that I really don't see why a retpoline would be needed. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 09:35:11 -0800 Linus Torvalds wrote: > On Thu, Nov 29, 2018 at 9:13 AM Steven Rostedt wrote: > > > > No, we really do need to sync after we change the second part of the > > command with the int3 on it. Unless there's another way to guarantee > > that the full instruction gets seen when we replace the int3 with the > > finished command. > > Making sure the call instruction is aligned with the I$ fetch boundary > should do that. > > It's not in the SDM, but neither was our current behavior - we > were/are just relying on "it will work". > Well, the current method (as Jiri mentioned) did get the OK from at least Intel (and that was with a lot of arm twisting to do so). -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 29, 2018, at 9:21 AM, Steven Rostedt wrote: > > On Thu, 29 Nov 2018 12:20:00 -0500 > Steven Rostedt wrote: > > >> r8 = return address >> r9 = function to call >> > > Bad example, r8 and r9 are args, but r10 and r11 are available. > > -- Steve > >>push r8 >>jmp *r9 >> >> Then have the regs->ip point to that trampoline. Cute. That’ll need ORC annotations and some kind of retpoline to replace the indirect jump, though. >> >> -- Steve >
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 9:13 AM Steven Rostedt wrote: > > No, we really do need to sync after we change the second part of the > command with the int3 on it. Unless there's another way to guarantee > that the full instruction gets seen when we replace the int3 with the > finished command. Making sure the call instruction is aligned with the I$ fetch boundary should do that. It's not in the SDM, but neither was our current behavior - we were/are just relying on "it will work". Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 29, 2018, at 9:29 AM, Linus Torvalds > wrote: > > On Thu, Nov 29, 2018 at 9:02 AM Andy Lutomirski wrote: >>> >>> - just restart the instruction (with the suggested "ptregs->rip --") >>> >>> - to avoid any "oh, we're not making progress" issues, just fix the >>> instruction yourself to be the right call, by looking it up in the >>> "what needs to be fixed" tables. >> >> I thought that too. I think it deadlocks. CPU A does text_poke_bp(). CPU B >> is waiting for a spinlock with IRQs off. CPU C holds the spinlock and hits >> the int3. The int3 never goes away because CPU A is waiting for CPU B to >> handle the sync_core IPI. >> >> Or do you think we can avoid the IPI while the int3 is there? > > I'm handwaving and thinking that CPU C that hits the int3 can just fix > up the instruction directly in its own caches, and return. > > Yes, it does what he "text_poke" *will* do (so now the instruction > gets rewritten _twice_), but who cares? It's idempotent. > > But it’s out of order. I’m not concerned about the final IPI — I’m concerned about the IPI after the int3 write and before the int3 is removed again. If one CPU replaces 0xcc with 0xe8, another CPU could observe that before the last couple bytes of the call target are written and observed by all CPUs.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018, Andy Lutomirski wrote: > Does anyone know what the actual hardware semantics are? The SDM is not > particularly informative unless I looked at the wrong section. I don't think SDM answers all the questions there, unfortunately. I vaguely remember that back then when I was preparing the original text_poke_bp() implementation, hpa had to provide some answers directly from inner depths of Intel ... see fd4363fff3 ("x86: Introduce int3 (breakpoint)-based instruction patching") for reference. -- Jiri Kosina SUSE Labs
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 9:02 AM Andy Lutomirski wrote: > > > > - just restart the instruction (with the suggested "ptregs->rip --") > > > > - to avoid any "oh, we're not making progress" issues, just fix the > > instruction yourself to be the right call, by looking it up in the > > "what needs to be fixed" tables. > > I thought that too. I think it deadlocks. CPU A does text_poke_bp(). CPU B > is waiting for a spinlock with IRQs off. CPU C holds the spinlock and hits > the int3. The int3 never goes away because CPU A is waiting for CPU B to > handle the sync_core IPI. > > Or do you think we can avoid the IPI while the int3 is there? I'm handwaving and thinking that CPU C that hits the int3 can just fix up the instruction directly in its own caches, and return. Yes, it does what he "text_poke" *will* do (so now the instruction gets rewritten _twice_), but who cares? It's idempotent. And no, I don't have code, just "maybe some handwaving like this" Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 29, 2018, at 9:07 AM, Peter Zijlstra wrote: > > On Thu, Nov 29, 2018 at 09:02:23AM -0800, Andy Lutomirski wrote: >>> On Nov 29, 2018, at 8:50 AM, Linus Torvalds >>> wrote: > >>> So no. Do *not* try to change %rsp on the stack in the bp handler. >>> Instead, I'd suggest: >>> >>> - just restart the instruction (with the suggested "ptregs->rip --") >>> >>> - to avoid any "oh, we're not making progress" issues, just fix the >>> instruction yourself to be the right call, by looking it up in the >>> "what needs to be fixed" tables. >>> >>> No? > >> Or do you think we can avoid the IPI while the int3 is there? > > I'm thinking Linus is suggesting the #BP handler does the text write too > (as a competing store) and then sync_core() and restarts. > > But I think that is broken, because then there is no telling what the > other CPUs will observe. Does anyone know what the actual hardware semantics are? The SDM is not particularly informative unless I looked at the wrong section.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 12:20:00 -0500 Steven Rostedt wrote: > r8 = return address > r9 = function to call > Bad example, r8 and r9 are args, but r10 and r11 are available. -- Steve > push r8 > jmp *r9 > > Then have the regs->ip point to that trampoline. > > -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 18:15:39 +0100 Peter Zijlstra wrote: > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > > > If you make it conditional on CPL, do it for 32-bit as well, add > > comments, > > > and convince yourself that there isn’t a better solution > > (like pointing IP at a stub that retpolines to the target by reading > > the function pointer, a la the unoptimizable version), then okay, I > > guess, with only a small amount of grumbling. > > Right; so we _could_ grow the trampoline with a retpoline indirect call > and ret. It just makes the trampoline a whole lot bigger, but it could > work. Can't we make use of the callee clobbered registers? I mean, we know that call is being made when the int3 is triggered. Then we can save the return address in one register, and the jump location in another, and then just call a trampoline that does: r8 = return address r9 = function to call push r8 jmp *r9 Then have the regs->ip point to that trampoline. -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > If you make it conditional on CPL, do it for 32-bit as well, add > comments, > and convince yourself that there isn’t a better solution > (like pointing IP at a stub that retpolines to the target by reading > the function pointer, a la the unoptimizable version), then okay, I > guess, with only a small amount of grumbling. Right; so we _could_ grow the trampoline with a retpoline indirect call and ret. It just makes the trampoline a whole lot bigger, but it could work.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 09:02:23 -0800 Andy Lutomirski wrote: > > Instead, I'd suggest: > > > > - just restart the instruction (with the suggested "ptregs->rip --") > > > > - to avoid any "oh, we're not making progress" issues, just fix the > > instruction yourself to be the right call, by looking it up in the > > "what needs to be fixed" tables. > > > > No? > > I thought that too. I think it deadlocks. CPU A does > text_poke_bp(). CPU B is waiting for a spinlock with IRQs off. CPU > C holds the spinlock and hits the int3. The int3 never goes away > because CPU A is waiting for CPU B to handle the sync_core IPI. I agree that this can happen. > > Or do you think we can avoid the IPI while the int3 is there? No, we really do need to sync after we change the second part of the command with the int3 on it. Unless there's another way to guarantee that the full instruction gets seen when we replace the int3 with the finished command. To refresh everyone's memory for why we have an IPI (as IPIs have an implicit memory barrier for the CPU). We start with: e8 01 02 03 04 and we want to convert it to: e8 ab cd ef 01 And let's say the instruction crosses a cache line that breaks it into e8 01 and 02 03 04. We add the breakpoint: cc 01 02 03 04 We do a sync (so now everyone should see the break point), because we don't want to update the second part and another CPU happens to update the second part of the cache, and might see: e8 01 cd ef 01 Which would not be good. And we need another sync after we change the code so all CPUs see cc ab cd ef 01 Because when we remove the break point, we don't want other CPUs to see e8 ab 02 03 04 Which would also be bad. -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote: > > > > On Nov 29, 2018, at 8:49 AM, Peter Zijlstra wrote: > > > > On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote: > >>> can't we 'fix' that again? The alternative is moving that IRET-frame and > >>> fixing everything up, which is going to be fragile, ugly and such > >>> things more. > > > >> This seems to work... > > > > That's almost too easy... nice! > > It is indeed too easy: you’re putting pt_regs in the wrong place for > int3 from user mode, which is probably a root hole if you arrange for > a ptraced process to do int3 and try to write to whatever register > aliases CS. > > If you make it conditional on CPL, do it for 32-bit as well, add > comments convince yourself that there isn’t a better solution I could do that - but why subject 32-bit to it? I was going to make it conditional on CONFIG_HAVE_STATIC_CALL_INLINE which is 64-bit only. > (like pointing IP at a stub that retpolines to the target by reading > the function pointer, a la the unoptimizable version), then okay, I > guess, with only a small amount of grumbling. I tried that in v2, but Peter pointed out it's racy: https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 09:02:23AM -0800, Andy Lutomirski wrote: > > On Nov 29, 2018, at 8:50 AM, Linus Torvalds > > wrote: > > So no. Do *not* try to change %rsp on the stack in the bp handler. > > Instead, I'd suggest: > > > > - just restart the instruction (with the suggested "ptregs->rip --") > > > > - to avoid any "oh, we're not making progress" issues, just fix the > > instruction yourself to be the right call, by looking it up in the > > "what needs to be fixed" tables. > > > > No? > Or do you think we can avoid the IPI while the int3 is there? I'm thinking Linus is suggesting the #BP handler does the text write too (as a competing store) and then sync_core() and restarts. But I think that is broken, because then there is no telling what the other CPUs will observe.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 29, 2018, at 8:50 AM, Linus Torvalds > wrote: > >> On Thu, Nov 29, 2018 at 8:33 AM Josh Poimboeuf wrote: >> >> This seems to work... >> >> + .if \create_gap == 1 >> + .rept 6 >> + pushq 5*8(%rsp) >> + .endr >> + .endif >> + >> -idtentry int3 do_int3 has_error_code=0 >> +idtentry int3 do_int3 has_error_code=0 >>create_gap=1 > > Ugh. Doesn't this entirely screw up the stack layout, which then > screws up task_pt_regs(), which then breaks ptrace and friends? > > ... and you'd only notice it for users that use int3 in user space, > which now writes random locations on the kernel stack, which is then a > huge honking security hole. > > It's possible that I'm confused, but let's not play random games with > the stack like this. The entry code is sacred, in scary ways. > > So no. Do *not* try to change %rsp on the stack in the bp handler. > Instead, I'd suggest: > > - just restart the instruction (with the suggested "ptregs->rip --") > > - to avoid any "oh, we're not making progress" issues, just fix the > instruction yourself to be the right call, by looking it up in the > "what needs to be fixed" tables. > > No? I thought that too. I think it deadlocks. CPU A does text_poke_bp(). CPU B is waiting for a spinlock with IRQs off. CPU C holds the spinlock and hits the int3. The int3 never goes away because CPU A is waiting for CPU B to handle the sync_core IPI. Or do you think we can avoid the IPI while the int3 is there?
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 29, 2018, at 8:49 AM, Peter Zijlstra wrote: > > On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote: >>> can't we 'fix' that again? The alternative is moving that IRET-frame and >>> fixing everything up, which is going to be fragile, ugly and such >>> things more. > >> This seems to work... > > That's almost too easy... nice! It is indeed too easy: you’re putting pt_regs in the wrong place for int3 from user mode, which is probably a root hole if you arrange for a ptraced process to do int3 and try to write to whatever register aliases CS. If you make it conditional on CPL, do it for 32-bit as well, add comments, and convince yourself that there isn’t a better solution (like pointing IP at a stub that retpolines to the target by reading the function pointer, a la the unoptimizable version), then okay, I guess, with only a small amount of grumbling. > >> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S >> index ce25d84023c0..184523447d35 100644 >> --- a/arch/x86/entry/entry_64.S >> +++ b/arch/x86/entry/entry_64.S >> @@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR >> irq_work_interruptsmp_irq_work_interrupt >> * @paranoid == 2 is special: the stub will never switch stacks. This is for >> * #DF: if the thread stack is somehow unusable, we'll still get a useful >> OOPS. >> */ >> -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 >> +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 >> create_gap=0 >> ENTRY(\sym) >>UNWIND_HINT_IRET_REGS offset=\has_error_code*8 >> >> @@ -891,6 +891,12 @@ ENTRY(\sym) >>pushq$-1/* ORIG_RAX: no syscall to restart */ >>.endif >> >> +.if \create_gap == 1 >> +.rept 6 >> +pushq 5*8(%rsp) >> +.endr >> +.endif >> + >>.if \paranoid == 1 >>testb$3, CS-ORIG_RAX(%rsp)/* If coming from userspace, switch >> stacks */ >>jnz.Lfrom_usermode_switch_stack_\@ >> @@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ >> #endif /* CONFIG_HYPERV */ >> >> idtentry debugdo_debughas_error_code=0paranoid=1 >> shift_ist=DEBUG_STACK >> -idtentry int3do_int3has_error_code=0 >> +idtentry int3do_int3has_error_code=0create_gap=1 >> idtentry stack_segmentdo_stack_segmenthas_error_code=1 >> >> #ifdef CONFIG_XEN_PV
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018 08:50:16 -0800 Linus Torvalds wrote: > Instead, I'd suggest: > > - just restart the instruction (with the suggested "ptregs->rip --") > > - to avoid any "oh, we're not making progress" issues, just fix the > instruction yourself to be the right call, by looking it up in the > "what needs to be fixed" tables. So basically this will cause the code to go into a spin while we are doing the update, right? -- Steve
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 8:33 AM Josh Poimboeuf wrote: > > This seems to work... > > + .if \create_gap == 1 > + .rept 6 > + pushq 5*8(%rsp) > + .endr > + .endif > + > -idtentry int3 do_int3 has_error_code=0 > +idtentry int3 do_int3 has_error_code=0 > create_gap=1 Ugh. Doesn't this entirely screw up the stack layout, which then screws up task_pt_regs(), which then breaks ptrace and friends? ... and you'd only notice it for users that use int3 in user space, which now writes random locations on the kernel stack, which is then a huge honking security hole. It's possible that I'm confused, but let's not play random games with the stack like this. The entry code is sacred, in scary ways. So no. Do *not* try to change %rsp on the stack in the bp handler. Instead, I'd suggest: - just restart the instruction (with the suggested "ptregs->rip --") - to avoid any "oh, we're not making progress" issues, just fix the instruction yourself to be the right call, by looking it up in the "what needs to be fixed" tables. No? Linus
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote: > > can't we 'fix' that again? The alternative is moving that IRET-frame and > > fixing everything up, which is going to be fragile, ugly and such > > things more. > This seems to work... That's almost too easy... nice! > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index ce25d84023c0..184523447d35 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR > irq_work_interrupt smp_irq_work_interrupt > * @paranoid == 2 is special: the stub will never switch stacks. This is for > * #DF: if the thread stack is somehow unusable, we'll still get a useful > OOPS. > */ > -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 > +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 > create_gap=0 > ENTRY(\sym) > UNWIND_HINT_IRET_REGS offset=\has_error_code*8 > > @@ -891,6 +891,12 @@ ENTRY(\sym) > pushq $-1 /* ORIG_RAX: no syscall to > restart */ > .endif > > + .if \create_gap == 1 > + .rept 6 > + pushq 5*8(%rsp) > + .endr > + .endif > + > .if \paranoid == 1 > testb $3, CS-ORIG_RAX(%rsp) /* If coming from userspace, > switch stacks */ > jnz .Lfrom_usermode_switch_stack_\@ > @@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ > #endif /* CONFIG_HYPERV */ > > idtentry debug do_debug > has_error_code=0paranoid=1 shift_ist=DEBUG_STACK > -idtentry int3do_int3 has_error_code=0 > +idtentry int3do_int3 > has_error_code=0create_gap=1 > idtentry stack_segment do_stack_segmenthas_error_code=1 > > #ifdef CONFIG_XEN_PV
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 03:38:53PM +0100, Peter Zijlstra wrote: > On Thu, Nov 29, 2018 at 05:37:39AM -0800, Andy Lutomirski wrote: > > > > > > > On Nov 29, 2018, at 1:42 AM, Peter Zijlstra wrote: > > > > > > On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote: > > > > > +static void static_call_bp_handler(struct pt_regs *regs, void *_data) > > +{ > > +struct static_call_bp_data *data = _data; > > + > > +/* > > + * For inline static calls, push the return address on the stack > > so the > > + * "called" function will return to the location immediately > > after the > > + * call site. > > + * > > + * NOTE: This code will need to be revisited when kernel CET gets > > + * implemented. > > + */ > > +if (data->ret) { > > +regs->sp -= sizeof(long); > > +*(unsigned long *)regs->sp = data->ret; > > +} > > >> > > >> You can’t do this. Depending on the alignment of the old RSP, which > > >> is not guaranteed, this overwrites regs->cs. IRET goes boom. > > > > > > I don't get it; can you spell that out? > > > > > > The way I understand it is that we're at a location where a "E8 - Near > > > CALL" instruction should be, and thus RSP should be the regular kernel > > > stack, and the above simply does "PUSH ret", which is what that CALL > > > would've done too. > > > > > > > int3 isn’t IST anymore, so the int3 instruction conditionally > > subtracts 8 from RSP and then pushes SS, etc. So my email was > > obviously wrong wrt “cs”, but you’re still potentially overwriting the > > int3 IRET frame. > > ARGH!.. > > can't we 'fix' that again? The alternative is moving that IRET-frame and > fixing everything up, which is going to be fragile, ugly and such > things more. > > Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack") > doesn't list any strong reasons for why it should NOT be an IST. This seems to work... diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index ce25d84023c0..184523447d35 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8 @@ -891,6 +891,12 @@ ENTRY(\sym) pushq $-1 /* ORIG_RAX: no syscall to restart */ .endif + .if \create_gap == 1 + .rept 6 + pushq 5*8(%rsp) + .endr + .endif + .if \paranoid == 1 testb $3, CS-ORIG_RAX(%rsp) /* If coming from userspace, switch stacks */ jnz .Lfrom_usermode_switch_stack_\@ @@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */ idtentry debug do_debughas_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segmenthas_error_code=1 #ifdef CONFIG_XEN_PV
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, 29 Nov 2018, Peter Zijlstra wrote: > > int3 isn’t IST anymore, so the int3 instruction conditionally > > subtracts 8 from RSP and then pushes SS, etc. So my email was > > obviously wrong wrt “cs”, but you’re still potentially overwriting the > > int3 IRET frame. > > ARGH!.. > > can't we 'fix' that again? The alternative is moving that IRET-frame and > fixing everything up, which is going to be fragile, ugly and such > things more. > > Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack") > doesn't list any strong reasons for why it should NOT be an IST. It's CVE-2018-8897. -- Jiri Kosina SUSE Labs
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 05:37:39AM -0800, Andy Lutomirski wrote: > > > > On Nov 29, 2018, at 1:42 AM, Peter Zijlstra wrote: > > > > On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote: > > > +static void static_call_bp_handler(struct pt_regs *regs, void *_data) > +{ > +struct static_call_bp_data *data = _data; > + > +/* > + * For inline static calls, push the return address on the stack so > the > + * "called" function will return to the location immediately after > the > + * call site. > + * > + * NOTE: This code will need to be revisited when kernel CET gets > + * implemented. > + */ > +if (data->ret) { > +regs->sp -= sizeof(long); > +*(unsigned long *)regs->sp = data->ret; > +} > >> > >> You can’t do this. Depending on the alignment of the old RSP, which > >> is not guaranteed, this overwrites regs->cs. IRET goes boom. > > > > I don't get it; can you spell that out? > > > > The way I understand it is that we're at a location where a "E8 - Near > > CALL" instruction should be, and thus RSP should be the regular kernel > > stack, and the above simply does "PUSH ret", which is what that CALL > > would've done too. > > > > int3 isn’t IST anymore, so the int3 instruction conditionally > subtracts 8 from RSP and then pushes SS, etc. So my email was > obviously wrong wrt “cs”, but you’re still potentially overwriting the > int3 IRET frame. ARGH!.. can't we 'fix' that again? The alternative is moving that IRET-frame and fixing everything up, which is going to be fragile, ugly and such things more. Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack") doesn't list any strong reasons for why it should NOT be an IST.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 29, 2018, at 1:42 AM, Peter Zijlstra wrote: > > On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote: > +static void static_call_bp_handler(struct pt_regs *regs, void *_data) +{ +struct static_call_bp_data *data = _data; + +/* + * For inline static calls, push the return address on the stack so the + * "called" function will return to the location immediately after the + * call site. + * + * NOTE: This code will need to be revisited when kernel CET gets + * implemented. + */ +if (data->ret) { +regs->sp -= sizeof(long); +*(unsigned long *)regs->sp = data->ret; +} >> >> You can’t do this. Depending on the alignment of the old RSP, which >> is not guaranteed, this overwrites regs->cs. IRET goes boom. > > I don't get it; can you spell that out? > > The way I understand it is that we're at a location where a "E8 - Near > CALL" instruction should be, and thus RSP should be the regular kernel > stack, and the above simply does "PUSH ret", which is what that CALL > would've done too. > int3 isn’t IST anymore, so the int3 instruction conditionally subtracts 8 from RSP and then pushes SS, etc. So my email was obviously wrong wrt “cs”, but you’re still potentially overwriting the int3 IRET frame.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Thu, Nov 29, 2018 at 10:42:10AM +0100, Peter Zijlstra wrote: > On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote: > > > >> +static void static_call_bp_handler(struct pt_regs *regs, void *_data) > > >> +{ > > >> +struct static_call_bp_data *data = _data; > > >> + > > >> +/* > > >> + * For inline static calls, push the return address on the stack so > > >> the > > >> + * "called" function will return to the location immediately after > > >> the > > >> + * call site. > > >> + * > > >> + * NOTE: This code will need to be revisited when kernel CET gets > > >> + * implemented. > > >> + */ > > >> +if (data->ret) { > > >> +regs->sp -= sizeof(long); > > >> +*(unsigned long *)regs->sp = data->ret; > > >> +} > > > > You can’t do this. Depending on the alignment of the old RSP, which > > is not guaranteed, this overwrites regs->cs. IRET goes boom. > > I don't get it; can you spell that out? I don't quite follow that either. Maybe Andy is referring to x86-32, for which regs->sp isn't actually saved: see kernel_stack_pointer(). This code is 64-bit only so that's not a concern. > The way I understand it is that we're at a location where a "E8 - Near > CALL" instruction should be, and thus RSP should be the regular kernel > stack, and the above simply does "PUSH ret", which is what that CALL > would've done too. Right. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote: > >> +static void static_call_bp_handler(struct pt_regs *regs, void *_data) > >> +{ > >> +struct static_call_bp_data *data = _data; > >> + > >> +/* > >> + * For inline static calls, push the return address on the stack so > >> the > >> + * "called" function will return to the location immediately after the > >> + * call site. > >> + * > >> + * NOTE: This code will need to be revisited when kernel CET gets > >> + * implemented. > >> + */ > >> +if (data->ret) { > >> +regs->sp -= sizeof(long); > >> +*(unsigned long *)regs->sp = data->ret; > >> +} > > You can’t do this. Depending on the alignment of the old RSP, which > is not guaranteed, this overwrites regs->cs. IRET goes boom. I don't get it; can you spell that out? The way I understand it is that we're at a location where a "E8 - Near CALL" instruction should be, and thus RSP should be the regular kernel stack, and the above simply does "PUSH ret", which is what that CALL would've done too.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
> On Nov 27, 2018, at 12:43 AM, Peter Zijlstra wrote: > >> On Mon, Nov 26, 2018 at 03:26:28PM -0600, Josh Poimboeuf wrote: >> >> Yeah, that's probably better. I assume you also mean that we would have >> all text_poke_bp() users create a handler callback? That way the >> interface is clear and consistent for everybody. Like: > > Can do, it does indeed make the interface less like a hack. It is not > like there are too many users. > >> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c >> index aac0c1f7e354..d4b0abe4912d 100644 >> --- a/arch/x86/kernel/jump_label.c >> +++ b/arch/x86/kernel/jump_label.c >> @@ -37,6 +37,11 @@ static void bug_at(unsigned char *ip, int line) >>BUG(); >> } >> >> +static inline void jump_label_bp_handler(struct pt_regs *regs, void *data) >> +{ >> +regs->ip += JUMP_LABEL_NOP_SIZE - 1; >> +} >> + >> static void __ref __jump_label_transform(struct jump_entry *entry, >> enum jump_label_type type, >> void *(*poker)(void *, const void *, size_t), >> @@ -91,7 +96,7 @@ static void __ref __jump_label_transform(struct jump_entry >> *entry, >>} >> >>text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, >> - (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE); >> + jump_label_bp_handler, NULL); >> } >> >> void arch_jump_label_transform(struct jump_entry *entry, > > Per that example.. > >> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c >> index d3869295b88d..e05ebc6d4db5 100644 >> --- a/arch/x86/kernel/static_call.c >> +++ b/arch/x86/kernel/static_call.c >> @@ -7,24 +7,30 @@ >> >> #define CALL_INSN_SIZE 5 >> >> +struct static_call_bp_data { >> +unsigned long func, ret; >> +}; >> + >> +static void static_call_bp_handler(struct pt_regs *regs, void *_data) >> +{ >> +struct static_call_bp_data *data = _data; >> + >> +/* >> + * For inline static calls, push the return address on the stack so the >> + * "called" function will return to the location immediately after the >> + * call site. >> + * >> + * NOTE: This code will need to be revisited when kernel CET gets >> + * implemented. >> + */ >> +if (data->ret) { >> +regs->sp -= sizeof(long); >> +*(unsigned long *)regs->sp = data->ret; >> +} You can’t do this. Depending on the alignment of the old RSP, which is not guaranteed, this overwrites regs->cs. IRET goes boom. Maybe it could be fixed by pointing regs->ip at a real trampoline? This code is subtle and executed rarely, which is a bag combination. It would be great if we had a test case. I think it would be great if the implementation could be, literally: regs->ip -= 1; return; IOW, just retry and wait until we get the new patched instruction. The problem is that, if we're in a context where IRQs are off, then we're preventing on_each_cpu() from completing and, even if we somehow just let the code know that we already serialized ourselves, we're still potentially holding a spinlock that another CPU is waiting for with IRQs off. Ugh. Anyone have a clever idea to fix that?
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Tue, Nov 27, 2018 at 09:43:30AM +0100, Peter Zijlstra wrote: > Now; if I'm not mistaken, the below @site is in fact @regs->ip - 1, no? > > We already patched site with INT3, which is what we just trapped on. So > we could in fact write something like: > > static void static_call_bp_handler(struct pt_regs *regs, void *data) > { > struct static_call_bp_data *scd = data; > > switch (data->type) { > case CALL_INSN: /* emulate CALL instruction */ > regs->sp -= sizeof(unsigned long); > *(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1; > regs->ip = data->func; > break; > > case JMP_INSN: /* emulate JMP instruction */ > regs->ip = data->func; > break; > } > } > handler_data = (struct static_call_bp_data){ > .type = IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) ? CALL_INSN > : JMP_INSN, > .func = func, > }; Heck; check this: static void static_call_bp_handler(struct pt_regs *regs, void *data) { #ifdef CONFIG_HAVE_STATIC_CALL_INLINE /* emulate CALL instruction */ regs->sp -= sizeof(unsigned long); *(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1; regs->ip = data; #else /* !CONFIG_HAVE_STATIC_CALL_INLINE */ /* emulate JMP instruction */ regs->ip = data; #endif }
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 02:14:49PM -0600, Josh Poimboeuf wrote: > On Mon, Nov 26, 2018 at 10:28:08AM -0800, Andy Lutomirski wrote: > > Can you add a comment that it will need updating when kernel CET is added? > > Will do, though I get the feeling there's a lot of other (existing) code > that will also need to change for kernel CET. Yeah, function graph tracer and kretprobes at the very least. But I suspect there's a few surprises to be hand once they try kernel CET.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 03:26:28PM -0600, Josh Poimboeuf wrote: > Yeah, that's probably better. I assume you also mean that we would have > all text_poke_bp() users create a handler callback? That way the > interface is clear and consistent for everybody. Like: Can do, it does indeed make the interface less like a hack. It is not like there are too many users. > diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c > index aac0c1f7e354..d4b0abe4912d 100644 > --- a/arch/x86/kernel/jump_label.c > +++ b/arch/x86/kernel/jump_label.c > @@ -37,6 +37,11 @@ static void bug_at(unsigned char *ip, int line) > BUG(); > } > > +static inline void jump_label_bp_handler(struct pt_regs *regs, void *data) > +{ > + regs->ip += JUMP_LABEL_NOP_SIZE - 1; > +} > + > static void __ref __jump_label_transform(struct jump_entry *entry, >enum jump_label_type type, >void *(*poker)(void *, const void *, > size_t), > @@ -91,7 +96,7 @@ static void __ref __jump_label_transform(struct jump_entry > *entry, > } > > text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, > - (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE); > + jump_label_bp_handler, NULL); > } > > void arch_jump_label_transform(struct jump_entry *entry, Per that example.. > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c > index d3869295b88d..e05ebc6d4db5 100644 > --- a/arch/x86/kernel/static_call.c > +++ b/arch/x86/kernel/static_call.c > @@ -7,24 +7,30 @@ > > #define CALL_INSN_SIZE 5 > > +struct static_call_bp_data { > + unsigned long func, ret; > +}; > + > +static void static_call_bp_handler(struct pt_regs *regs, void *_data) > +{ > + struct static_call_bp_data *data = _data; > + > + /* > + * For inline static calls, push the return address on the stack so the > + * "called" function will return to the location immediately after the > + * call site. > + * > + * NOTE: This code will need to be revisited when kernel CET gets > + * implemented. > + */ > + if (data->ret) { > + regs->sp -= sizeof(long); > + *(unsigned long *)regs->sp = data->ret; > + } > + > + /* The exception handler will 'return' to the destination function. */ > + regs->ip = data->func; > +} Now; if I'm not mistaken, the below @site is in fact @regs->ip - 1, no? We already patched site with INT3, which is what we just trapped on. So we could in fact write something like: static void static_call_bp_handler(struct pt_regs *regs, void *data) { struct static_call_bp_data *scd = data; switch (data->type) { case CALL_INSN: /* emulate CALL instruction */ regs->sp -= sizeof(unsigned long); *(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1; regs->ip = data->func; break; case JMP_INSN: /* emulate JMP instruction */ regs->ip = data->func; break; } } > void arch_static_call_transform(void *site, void *tramp, void *func) > { > @@ -32,11 +38,17 @@ void arch_static_call_transform(void *site, void *tramp, > void *func) > unsigned long insn; > unsigned char insn_opcode; > unsigned char opcodes[CALL_INSN_SIZE]; > + struct static_call_bp_data handler_data; > + > + handler_data.func = (unsigned long)func; > > - if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) { > insn = (unsigned long)site; > + handler_data.ret = insn + CALL_INSN_SIZE; > + } else { > insn = (unsigned long)tramp; > + handler_data.ret = 0; > + } handler_data = (struct static_call_bp_data){ .type = IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) ? CALL_INSN : JMP_INSN, .func = func, }; > mutex_lock(&text_mutex); > > @@ -52,14 +64,9 @@ void arch_static_call_transform(void *site, void *tramp, > void *func) > opcodes[0] = insn_opcode; > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1); > > /* Patch the call site: */ > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, > + static_call_bp_handler, &handler_data); > > done: > mutex_unlock(&text_mutex);
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 09:08:01PM +0100, Peter Zijlstra wrote: > On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote: > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c > > index d3869295b88d..8fd6c8556750 100644 > > --- a/arch/x86/kernel/static_call.c > > +++ b/arch/x86/kernel/static_call.c > > @@ -7,24 +7,19 @@ > > > > #define CALL_INSN_SIZE 5 > > > > +unsigned long bp_handler_call_return_addr; > > > > +static void static_call_bp_handler(struct pt_regs *regs) > > +{ > > #ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > + /* > > +* Push the return address on the stack so the "called" function will > > +* return to immediately after the call site. > > +*/ > > + regs->sp -= sizeof(long); > > + *(unsigned long *)regs->sp = bp_handler_call_return_addr; > > #endif > > +} > > > > void arch_static_call_transform(void *site, void *tramp, void *func) > > { > > @@ -52,14 +47,12 @@ void arch_static_call_transform(void *site, void > > *tramp, void *func) > > opcodes[0] = insn_opcode; > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1); > > > > if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > + bp_handler_call_return_addr = insn + CALL_INSN_SIZE; > > > > /* Patch the call site: */ > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, > > -static_call_bp_handler); > > +static_call_bp_handler, func); > > > > done: > > mutex_unlock(&text_mutex); > > > like maybe something along the lines of: > > struct sc_data { > unsigned long ret; > unsigned long ip; > }; > > void sc_handler(struct pt_regs *regs, void *data) > { > struct sc_data *scd = data; > > regs->sp -= sizeof(long); > *(unsigned long *)regs->sp = scd->ret; > regs->ip = scd->ip; > } > > arch_static_call_transform() > { > ... > > scd = (struct sc_data){ > .ret = insn + CALL_INSN_SIZE, > .ip = (unsigned long)func, > }; > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, > sc_handler, (void *)&scd); > > ... > } Yeah, that's probably better. I assume you also mean that we would have all text_poke_bp() users create a handler callback? That way the interface is clear and consistent for everybody. Like: diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index e85ff65c43c3..04d6cf838fb7 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -20,6 +20,8 @@ static inline void apply_paravirt(struct paravirt_patch_site *start, extern void *text_poke_early(void *addr, const void *opcode, size_t len); +typedef void (*bp_handler_t)(struct pt_regs *regs, void *data); + /* * Clear and restore the kernel write-protection flag on the local CPU. * Allows the kernel to edit read-only pages. @@ -36,7 +38,8 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len); */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern int poke_int3_handler(struct pt_regs *regs); -extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler); +extern void *text_poke_bp(void *addr, const void *opcode, size_t len, + bp_handler_t handler, void *data); extern int after_bootmem; #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ebeac487a20c..547af714bd60 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -738,7 +738,8 @@ static void do_sync_core(void *info) } static bool bp_patching_in_progress; -static void *bp_int3_handler, *bp_int3_addr; +static void *bp_int3_data, *bp_int3_addr; +static bp_handler_t bp_int3_handler; int poke_int3_handler(struct pt_regs *regs) { @@ -746,11 +747,11 @@ int poke_int3_handler(struct pt_regs *regs) * Having observed our INT3 instruction, we now must observe * bp_patching_in_progress. * -* in_progress = TRUE INT3 -* WMB RMB -* write INT3 if (in_progress) +* in_progress = TRUE INT3 +* WMB RMB +* write INT3 if (in_progress) * -* Idem for bp_int3_handler. +* Idem for bp_int3_data. */ smp_rmb(); @@ -760,8 +761,7 @@ int poke_int3_handler(struct pt_regs *regs) if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr) return 0; - /* set up the specified breakpoint handler */ - regs->ip = (unsigned long) bp_int3_handler; + bp_int3_handler(regs, bp_int3_data); return 1; @@ -772,7 +772,8 @@ int poke_int3_handler(struct pt_regs *regs) * @addr: address to patch * @opcode:opcode o
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 10:28:08AM -0800, Andy Lutomirski wrote: > On Mon, Nov 26, 2018 at 9:10 AM Josh Poimboeuf wrote: > > > > On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote: > > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > > > > diff --git a/arch/x86/kernel/static_call.c > > > > b/arch/x86/kernel/static_call.c > > > > index 8026d176f25c..d3869295b88d 100644 > > > > --- a/arch/x86/kernel/static_call.c > > > > +++ b/arch/x86/kernel/static_call.c > > > > @@ -9,13 +9,21 @@ > > > > > > > > void static_call_bp_handler(void); > > > > void *bp_handler_dest; > > > > +void *bp_handler_continue; > > > > > > > > asm(".pushsection .text, \"ax\" > > > > \n" > > > > ".globl static_call_bp_handler \n" > > > > ".type static_call_bp_handler, @function > > > > \n" > > > > "static_call_bp_handler: > > > > \n" > > > > -"ANNOTATE_RETPOLINE_SAFE > > > > \n" > > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > > > +ANNOTATE_RETPOLINE_SAFE > > > > +"call *bp_handler_dest \n" > > > > +ANNOTATE_RETPOLINE_SAFE > > > > +"jmp *bp_handler_continue > > > > \n" > > > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */ > > > > +ANNOTATE_RETPOLINE_SAFE > > > > "jmp *bp_handler_dest \n" > > > > +#endif > > > > ".popsection > > > > \n"); > > > > > > > > void arch_static_call_transform(void *site, void *tramp, void *func) > > > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void > > > > *tramp, void *func) > > > > unsigned char insn_opcode; > > > > unsigned char opcodes[CALL_INSN_SIZE]; > > > > > > > > - insn = (unsigned long)tramp; > > > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > > > + insn = (unsigned long)site; > > > > + else > > > > + insn = (unsigned long)tramp; > > > > > > > > mutex_lock(&text_mutex); > > > > > > > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void > > > > *tramp, void *func) > > > > opcodes[0] = insn_opcode; > > > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1); > > > > > > > > - /* Set up the variable for the breakpoint handler: */ > > > > + /* Set up the variables for the breakpoint handler: */ > > > > bp_handler_dest = func; > > > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > > > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE); > > > > > > > > /* Patch the call site: */ > > > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, > > > > > > OK, so this is where that static_call_bp_handler comes from; you need > > > that CALL to frob the stack. > > > > > > But I still think it is broken; consider: > > > > > > CPU0CPU1 > > > > > > bp_handler = ponies; > > > > > > text_poke_bp(, &static_call_bp_handler) > > > text_poke(&int3); > > > on_each_cpu(sync) > > > > > > ... > > > > > > > > > text_poke(/* all but first bytes */) > > > on_each_cpu(sync) > > > > > > ... > > > > > > > > > > > > pt_regs->ip = > > > &static_call_bp_handler > > > > > > > > > // VCPU takes a nap... > > > text_poke(/* first byte */) > > > on_each_cpu(sync) > > > > > > ... > > > > > > > > > // VCPU sleeps more > > > bp_handler = unicorn; > > > > > > CALL unicorn > > > > > > *whoops* > > > > > > Now, granted, that is all rather 'unlikely', but that never stopped > > > Murphy. > > > > Good find, thanks Peter. > > > > As we discussed on IRC, we'll need to fix this from within the int3 > > exception handler by faking the call: putting a fake return address on > > the stack (pointing to right after the call) and setting regs->ip to the > > called function. > > Can you add a comment that it will need updating when kernel CET is added? Will do, though I get the feeling there's a lot of other (existing) code that will also need to change for kernel CET. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote: > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c > index d3869295b88d..8fd6c8556750 100644 > --- a/arch/x86/kernel/static_call.c > +++ b/arch/x86/kernel/static_call.c > @@ -7,24 +7,19 @@ > > #define CALL_INSN_SIZE 5 > > +unsigned long bp_handler_call_return_addr; > > +static void static_call_bp_handler(struct pt_regs *regs) > +{ > #ifdef CONFIG_HAVE_STATIC_CALL_INLINE > + /* > + * Push the return address on the stack so the "called" function will > + * return to immediately after the call site. > + */ > + regs->sp -= sizeof(long); > + *(unsigned long *)regs->sp = bp_handler_call_return_addr; > #endif > +} > > void arch_static_call_transform(void *site, void *tramp, void *func) > { > @@ -52,14 +47,12 @@ void arch_static_call_transform(void *site, void *tramp, > void *func) > opcodes[0] = insn_opcode; > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1); > > if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > + bp_handler_call_return_addr = insn + CALL_INSN_SIZE; > > /* Patch the call site: */ > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, > - static_call_bp_handler); > + static_call_bp_handler, func); > > done: > mutex_unlock(&text_mutex); like maybe something along the lines of: struct sc_data { unsigned long ret; unsigned long ip; }; void sc_handler(struct pt_regs *regs, void *data) { struct sc_data *scd = data; regs->sp -= sizeof(long); *(unsigned long *)regs->sp = scd->ret; regs->ip = scd->ip; } arch_static_call_transform() { ... scd = (struct sc_data){ .ret = insn + CALL_INSN_SIZE, .ip = (unsigned long)func, }; text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, sc_handler, (void *)&scd); ... }
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote: > Peter suggested updating the text_poke_bp() interface to add a handler > which is called from int3 context. This seems to work. > @@ -760,8 +761,10 @@ int poke_int3_handler(struct pt_regs *regs) > if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr) > return 0; > > - /* set up the specified breakpoint handler */ > - regs->ip = (unsigned long) bp_int3_handler; > + if (bp_int3_handler) > + bp_int3_handler(regs); > + > + regs->ip = (unsigned long)bp_int3_resume; > > return 1; > Peter also suggested you write that like: if (bp_int3_handler) bp_int3_handler(regs, resume); else regs->ip = resume; That allows 'abusing' @resume as 'data' pointer for @handler. Which allows for more complicated handlers.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 9:10 AM Josh Poimboeuf wrote: > > On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c > > > index 8026d176f25c..d3869295b88d 100644 > > > --- a/arch/x86/kernel/static_call.c > > > +++ b/arch/x86/kernel/static_call.c > > > @@ -9,13 +9,21 @@ > > > > > > void static_call_bp_handler(void); > > > void *bp_handler_dest; > > > +void *bp_handler_continue; > > > > > > asm(".pushsection .text, \"ax\" > > > \n" > > > ".globl static_call_bp_handler \n" > > > ".type static_call_bp_handler, @function > > > \n" > > > "static_call_bp_handler: > > > \n" > > > -"ANNOTATE_RETPOLINE_SAFE > > > \n" > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > > +ANNOTATE_RETPOLINE_SAFE > > > +"call *bp_handler_dest \n" > > > +ANNOTATE_RETPOLINE_SAFE > > > +"jmp *bp_handler_continue > > > \n" > > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */ > > > +ANNOTATE_RETPOLINE_SAFE > > > "jmp *bp_handler_dest \n" > > > +#endif > > > ".popsection \n"); > > > > > > void arch_static_call_transform(void *site, void *tramp, void *func) > > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void > > > *tramp, void *func) > > > unsigned char insn_opcode; > > > unsigned char opcodes[CALL_INSN_SIZE]; > > > > > > - insn = (unsigned long)tramp; > > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > > + insn = (unsigned long)site; > > > + else > > > + insn = (unsigned long)tramp; > > > > > > mutex_lock(&text_mutex); > > > > > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void > > > *tramp, void *func) > > > opcodes[0] = insn_opcode; > > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1); > > > > > > - /* Set up the variable for the breakpoint handler: */ > > > + /* Set up the variables for the breakpoint handler: */ > > > bp_handler_dest = func; > > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE); > > > > > > /* Patch the call site: */ > > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, > > > > OK, so this is where that static_call_bp_handler comes from; you need > > that CALL to frob the stack. > > > > But I still think it is broken; consider: > > > > CPU0CPU1 > > > > bp_handler = ponies; > > > > text_poke_bp(, &static_call_bp_handler) > > text_poke(&int3); > > on_each_cpu(sync) > > > > ... > > > > > > text_poke(/* all but first bytes */) > > on_each_cpu(sync) > > > > ... > > > > > > > > pt_regs->ip = > > &static_call_bp_handler > > > > > > // VCPU takes a nap... > > text_poke(/* first byte */) > > on_each_cpu(sync) > > > > ... > > > > > > // VCPU sleeps more > > bp_handler = unicorn; > > > > CALL unicorn > > > > *whoops* > > > > Now, granted, that is all rather 'unlikely', but that never stopped > > Murphy. > > Good find, thanks Peter. > > As we discussed on IRC, we'll need to fix this from within the int3 > exception handler by faking the call: putting a fake return address on > the stack (pointing to right after the call) and setting regs->ip to the > called function. Can you add a comment that it will need updating when kernel CET is added?
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 11:10:36AM -0600, Josh Poimboeuf wrote: > On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c > > > index 8026d176f25c..d3869295b88d 100644 > > > --- a/arch/x86/kernel/static_call.c > > > +++ b/arch/x86/kernel/static_call.c > > > @@ -9,13 +9,21 @@ > > > > > > void static_call_bp_handler(void); > > > void *bp_handler_dest; > > > +void *bp_handler_continue; > > > > > > asm(".pushsection .text, \"ax\" > > > \n" > > > ".globl static_call_bp_handler > > > \n" > > > ".type static_call_bp_handler, @function > > > \n" > > > "static_call_bp_handler: > > > \n" > > > -"ANNOTATE_RETPOLINE_SAFE > > > \n" > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > > +ANNOTATE_RETPOLINE_SAFE > > > +"call *bp_handler_dest > > > \n" > > > +ANNOTATE_RETPOLINE_SAFE > > > +"jmp *bp_handler_continue > > > \n" > > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */ > > > +ANNOTATE_RETPOLINE_SAFE > > > "jmp *bp_handler_dest > > > \n" > > > +#endif > > > ".popsection \n"); > > > > > > void arch_static_call_transform(void *site, void *tramp, void *func) > > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void > > > *tramp, void *func) > > > unsigned char insn_opcode; > > > unsigned char opcodes[CALL_INSN_SIZE]; > > > > > > - insn = (unsigned long)tramp; > > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > > + insn = (unsigned long)site; > > > + else > > > + insn = (unsigned long)tramp; > > > > > > mutex_lock(&text_mutex); > > > > > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void > > > *tramp, void *func) > > > opcodes[0] = insn_opcode; > > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1); > > > > > > - /* Set up the variable for the breakpoint handler: */ > > > + /* Set up the variables for the breakpoint handler: */ > > > bp_handler_dest = func; > > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE); > > > > > > /* Patch the call site: */ > > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, > > > > OK, so this is where that static_call_bp_handler comes from; you need > > that CALL to frob the stack. > > > > But I still think it is broken; consider: > > > > CPU0CPU1 > > > > bp_handler = ponies; > > > > text_poke_bp(, &static_call_bp_handler) > > text_poke(&int3); > > on_each_cpu(sync) > > > > ... > > > > > > text_poke(/* all but first bytes */) > > on_each_cpu(sync) > > > > ... > > > > > > > > pt_regs->ip = &static_call_bp_handler > > > > > > // VCPU takes a nap... > > text_poke(/* first byte */) > > on_each_cpu(sync) > > > > ... > > > > > > // VCPU sleeps more > > bp_handler = unicorn; > > > > CALL unicorn > > > > *whoops* > > > > Now, granted, that is all rather 'unlikely', but that never stopped > > Murphy. > > Good find, thanks Peter. > > As we discussed on IRC, we'll need to fix this from within the int3 > exception handler by faking the call: putting a fake return address on > the stack (pointing to right after the call) and setting regs->ip to the > called function. > > And for the out-of-line case we can just jump straight to the function, > so the function itself will be the text_poke_bp() "handler". > > So the static_call_bp_handler() trampoline will go away. Peter suggested updating the text_poke_bp() interface to add a handler which is called from int3 context. This seems to work. diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index e85ff65c43c3..7fcaa37c1876 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -20,6 +20,8 @@ static inline void apply_paravirt(struct paravirt_patch_site *start, extern void *text_po
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote: > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c > > index 8026d176f25c..d3869295b88d 100644 > > --- a/arch/x86/kernel/static_call.c > > +++ b/arch/x86/kernel/static_call.c > > @@ -9,13 +9,21 @@ > > > > void static_call_bp_handler(void); > > void *bp_handler_dest; > > +void *bp_handler_continue; > > > > asm(".pushsection .text, \"ax\" > > \n" > > ".globl static_call_bp_handler \n" > > ".type static_call_bp_handler, @function > > \n" > > "static_call_bp_handler: > > \n" > > -"ANNOTATE_RETPOLINE_SAFE > > \n" > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > +ANNOTATE_RETPOLINE_SAFE > > +"call *bp_handler_dest \n" > > +ANNOTATE_RETPOLINE_SAFE > > +"jmp *bp_handler_continue > > \n" > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */ > > +ANNOTATE_RETPOLINE_SAFE > > "jmp *bp_handler_dest \n" > > +#endif > > ".popsection \n"); > > > > void arch_static_call_transform(void *site, void *tramp, void *func) > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, > > void *func) > > unsigned char insn_opcode; > > unsigned char opcodes[CALL_INSN_SIZE]; > > > > - insn = (unsigned long)tramp; > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > + insn = (unsigned long)site; > > + else > > + insn = (unsigned long)tramp; > > > > mutex_lock(&text_mutex); > > > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, > > void *func) > > opcodes[0] = insn_opcode; > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1); > > > > - /* Set up the variable for the breakpoint handler: */ > > + /* Set up the variables for the breakpoint handler: */ > > bp_handler_dest = func; > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE); > > > > /* Patch the call site: */ > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, > > OK, so this is where that static_call_bp_handler comes from; you need > that CALL to frob the stack. > > But I still think it is broken; consider: > > CPU0CPU1 > > bp_handler = ponies; > > text_poke_bp(, &static_call_bp_handler) > text_poke(&int3); > on_each_cpu(sync) > > ... > > > text_poke(/* all but first bytes */) > on_each_cpu(sync) > > ... > > > > pt_regs->ip = &static_call_bp_handler > > > // VCPU takes a nap... > text_poke(/* first byte */) > on_each_cpu(sync) > > ... > > > // VCPU sleeps more > bp_handler = unicorn; > > CALL unicorn > > *whoops* > > Now, granted, that is all rather 'unlikely', but that never stopped > Murphy. Good find, thanks Peter. As we discussed on IRC, we'll need to fix this from within the int3 exception handler by faking the call: putting a fake return address on the stack (pointing to right after the call) and setting regs->ip to the called function. And for the out-of-line case we can just jump straight to the function, so the function itself will be the text_poke_bp() "handler". So the static_call_bp_handler() trampoline will go away. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 05:39:23PM +0100, Peter Zijlstra wrote: > On Mon, Nov 26, 2018 at 05:11:05PM +0100, Ard Biesheuvel wrote: > > On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra wrote: > > > > > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > > > +void arch_static_call_defuse_tramp(void *site, void *tramp) > > > > +{ > > > > + unsigned short opcode = INSN_UD2; > > > > + > > > > + mutex_lock(&text_mutex); > > > > + text_poke((void *)tramp, &opcode, 2); > > > > + mutex_unlock(&text_mutex); > > > > +} > > > > +#endif > > > > > > I would rather think that makes the trampoline _more_ dangerous, rather > > > than less so. > > > > > > My dictionary sayeth: > > > > > > defuse: verb > > > > > > - remove the fuse from (an explosive device) in order to prevent it > > > from exploding. > > > > > > - make (a situation) less tense or dangerous > > > > > > patching in an UD2 seems to do the exact opposite. > > > > That is my fault. > > > > The original name was 'poison' iirc, but on arm64, we need to retain > > the trampoline for cases where the direct branch is out of range, and > > so poisoning is semantically inaccurate. > > > > But since you opened your dictionary anyway, any better suggestions? :-) > > I was leaning towards: "prime", but I'm not entirely sure that works > with your case. Maybe we should just go back to "poison", along with a comment that it will not necessarily be poisoned for all arches. I think "poison" at least describes the intent, if not always the implementation. -- Josh
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 05:11:05PM +0100, Ard Biesheuvel wrote: > On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra wrote: > > > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > > +void arch_static_call_defuse_tramp(void *site, void *tramp) > > > +{ > > > + unsigned short opcode = INSN_UD2; > > > + > > > + mutex_lock(&text_mutex); > > > + text_poke((void *)tramp, &opcode, 2); > > > + mutex_unlock(&text_mutex); > > > +} > > > +#endif > > > > I would rather think that makes the trampoline _more_ dangerous, rather > > than less so. > > > > My dictionary sayeth: > > > > defuse: verb > > > > - remove the fuse from (an explosive device) in order to prevent it > > from exploding. > > > > - make (a situation) less tense or dangerous > > > > patching in an UD2 seems to do the exact opposite. > > That is my fault. > > The original name was 'poison' iirc, but on arm64, we need to retain > the trampoline for cases where the direct branch is out of range, and > so poisoning is semantically inaccurate. > > But since you opened your dictionary anyway, any better suggestions? :-) I was leaning towards: "prime", but I'm not entirely sure that works with your case.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 8:11 AM Ard Biesheuvel wrote: > > On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra wrote: > > > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > > +void arch_static_call_defuse_tramp(void *site, void *tramp) > > > +{ > > > + unsigned short opcode = INSN_UD2; > > > + > > > + mutex_lock(&text_mutex); > > > + text_poke((void *)tramp, &opcode, 2); > > > + mutex_unlock(&text_mutex); > > > +} > > > +#endif > > > > I would rather think that makes the trampoline _more_ dangerous, rather > > than less so. > > > > My dictionary sayeth: > > > > defuse: verb > > > > - remove the fuse from (an explosive device) in order to prevent it > > from exploding. > > > > - make (a situation) less tense or dangerous > > > > patching in an UD2 seems to do the exact opposite. > > That is my fault. > > The original name was 'poison' iirc, but on arm64, we need to retain > the trampoline for cases where the direct branch is out of range, and > so poisoning is semantically inaccurate. > > But since you opened your dictionary anyway, any better suggestions? :-) Release? Finish?
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra wrote: > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > > +void arch_static_call_defuse_tramp(void *site, void *tramp) > > +{ > > + unsigned short opcode = INSN_UD2; > > + > > + mutex_lock(&text_mutex); > > + text_poke((void *)tramp, &opcode, 2); > > + mutex_unlock(&text_mutex); > > +} > > +#endif > > I would rather think that makes the trampoline _more_ dangerous, rather > than less so. > > My dictionary sayeth: > > defuse: verb > > - remove the fuse from (an explosive device) in order to prevent it > from exploding. > > - make (a situation) less tense or dangerous > > patching in an UD2 seems to do the exact opposite. That is my fault. The original name was 'poison' iirc, but on arm64, we need to retain the trampoline for cases where the direct branch is out of range, and so poisoning is semantically inaccurate. But since you opened your dictionary anyway, any better suggestions? :-)
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > +void arch_static_call_defuse_tramp(void *site, void *tramp) > +{ > + unsigned short opcode = INSN_UD2; > + > + mutex_lock(&text_mutex); > + text_poke((void *)tramp, &opcode, 2); > + mutex_unlock(&text_mutex); > +} > +#endif I would rather think that makes the trampoline _more_ dangerous, rather than less so. My dictionary sayeth: defuse: verb - remove the fuse from (an explosive device) in order to prevent it from exploding. - make (a situation) less tense or dangerous patching in an UD2 seems to do the exact opposite.
Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote: > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c > index 8026d176f25c..d3869295b88d 100644 > --- a/arch/x86/kernel/static_call.c > +++ b/arch/x86/kernel/static_call.c > @@ -9,13 +9,21 @@ > > void static_call_bp_handler(void); > void *bp_handler_dest; > +void *bp_handler_continue; > > asm(".pushsection .text, \"ax\" > \n" > ".globl static_call_bp_handler \n" > ".type static_call_bp_handler, @function \n" > "static_call_bp_handler: \n" > -"ANNOTATE_RETPOLINE_SAFE \n" > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE > +ANNOTATE_RETPOLINE_SAFE > +"call *bp_handler_dest \n" > +ANNOTATE_RETPOLINE_SAFE > +"jmp *bp_handler_continue > \n" > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */ > +ANNOTATE_RETPOLINE_SAFE > "jmp *bp_handler_dest\n" > +#endif > ".popsection \n"); > > void arch_static_call_transform(void *site, void *tramp, void *func) > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, > void *func) > unsigned char insn_opcode; > unsigned char opcodes[CALL_INSN_SIZE]; > > - insn = (unsigned long)tramp; > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > + insn = (unsigned long)site; > + else > + insn = (unsigned long)tramp; > > mutex_lock(&text_mutex); > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, > void *func) > opcodes[0] = insn_opcode; > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1); > > - /* Set up the variable for the breakpoint handler: */ > + /* Set up the variables for the breakpoint handler: */ > bp_handler_dest = func; > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE); > > /* Patch the call site: */ > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, OK, so this is where that static_call_bp_handler comes from; you need that CALL to frob the stack. But I still think it is broken; consider: CPU0CPU1 bp_handler = ponies; text_poke_bp(, &static_call_bp_handler) text_poke(&int3); on_each_cpu(sync) ... text_poke(/* all but first bytes */) on_each_cpu(sync) ... pt_regs->ip = &static_call_bp_handler // VCPU takes a nap... text_poke(/* first byte */) on_each_cpu(sync) ... // VCPU sleeps more bp_handler = unicorn; CALL unicorn *whoops* Now, granted, that is all rather 'unlikely', but that never stopped Murphy.
[PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64
Add the inline static call implementation for x86-64. For each key, a temporary trampoline is created, named __static_call_tramp_. The trampoline has an indirect jump to the destination function. Objtool uses the trampoline naming convention to detect all the call sites. It then annotates those call sites in the .static_call_sites section. During boot (and module init), the call sites are patched to call directly into the destination function. The temporary trampoline is then no longer used. Signed-off-by: Josh Poimboeuf --- arch/x86/Kconfig | 5 +- arch/x86/include/asm/static_call.h| 28 +++- arch/x86/kernel/asm-offsets.c | 6 + arch/x86/kernel/static_call.c | 30 - include/linux/static_call.h | 2 +- tools/objtool/Makefile| 3 +- tools/objtool/check.c | 126 +- tools/objtool/check.h | 2 + tools/objtool/elf.h | 1 + .../objtool/include/linux/static_call_types.h | 19 +++ tools/objtool/sync-check.sh | 1 + 11 files changed, 213 insertions(+), 10 deletions(-) create mode 100644 tools/objtool/include/linux/static_call_types.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index a2a10e0ce248..e099ea87ea70 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -189,7 +189,8 @@ config X86 select HAVE_FUNCTION_ARG_ACCESS_API select HAVE_STACKPROTECTOR if CC_HAS_SANE_STACKPROTECTOR select HAVE_STACK_VALIDATIONif X86_64 - select HAVE_STATIC_CALL_OUTLINE + select HAVE_STATIC_CALL_INLINE if HAVE_STACK_VALIDATION + select HAVE_STATIC_CALL_OUTLINE if !HAVE_STACK_VALIDATION select HAVE_RSEQ select HAVE_SYSCALL_TRACEPOINTS select HAVE_UNSTABLE_SCHED_CLOCK @@ -203,6 +204,7 @@ config X86 select RTC_MC146818_LIB select SPARSE_IRQ select SRCU + select STACK_VALIDATION if HAVE_STACK_VALIDATION && (HAVE_STATIC_CALL_INLINE || RETPOLINE) select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK select USER_STACKTRACE_SUPPORT @@ -438,7 +440,6 @@ config GOLDFISH config RETPOLINE bool "Avoid speculative indirect branches in kernel" default y - select STACK_VALIDATION if HAVE_STACK_VALIDATION help Compile kernel with the retpoline compiler options to guard against kernel-to-user data leaks by avoiding speculative indirect diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h index 6e9ad5969ec2..27bd7da16150 100644 --- a/arch/x86/include/asm/static_call.h +++ b/arch/x86/include/asm/static_call.h @@ -2,6 +2,20 @@ #ifndef _ASM_STATIC_CALL_H #define _ASM_STATIC_CALL_H +#include + +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE + +/* + * This trampoline is only used during boot / module init, so it's safe to use + * the indirect branch without a retpoline. + */ +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func) \ + ANNOTATE_RETPOLINE_SAFE \ + "jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n" + +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */ + /* * Manually construct a 5-byte direct JMP to prevent the assembler from * optimizing it into a 2-byte JMP. @@ -12,9 +26,19 @@ ".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n" \ __ARCH_STATIC_CALL_JMP_LABEL(key) ":" +#endif /* !CONFIG_HAVE_STATIC_CALL_INLINE */ + /* - * This is a permanent trampoline which does a direct jump to the function. - * The direct jump get patched by static_call_update(). + * For CONFIG_HAVE_STATIC_CALL_INLINE, this is a temporary trampoline which + * uses the current value of the key->func pointer to do an indirect jump to + * the function. This trampoline is only used during boot, before the call + * sites get patched by static_call_update(). The name of this trampoline has + * a magical aspect: objtool uses it to find static call sites so it can create + * the .static_call_sites section. + * + * For CONFIG_HAVE_STATIC_CALL_OUTLINE, this is a permanent trampoline which + * does a direct jump to the function. The direct jump gets patched by + * static_call_update(). */ #define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func) \ asm(".pushsection .text, \"ax\" \n" \ diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 72adf6c335dc..da8fd220e4f2 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -104,4 +105,9 @@ void common(void) { OFFSET(TSS_sp0, tss_struct, x86_tss.sp0); OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);