subject:"\[PATCH v2 4\/4\] x86\/static_call\: Add inline static call implementation for x86\-64"

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-12-12 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf  wrote:
> >
> > On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> > >
> > > I propose a different solution:
> > >
> > > As in this patch set, we have a direct and an indirect version.  The
> > > indirect version remains exactly the same as in this patch set.  The
> > > direct version just only does the patching when all seems well: the
> > > call instruction needs to be 0xe8, and we only do it when the thing
> > > doesn't cross a cache line.  Does that work?  In the rare case where
> > > the compiler generates something other than 0xe8 or crosses a cache
> > > line, then the thing just remains as a call to the out of line jmp
> > > trampoline.  Does that seem reasonable?  It's a very minor change to
> > > the patch set.
> >
> > Maybe that would be ok.  If my math is right, we would use the
> > out-of-line version almost 5% of the time due to cache misalignment of
> > the address.
> 
> Note that I don't think cache-line alignment is necessarily sufficient.
> 
> The I$ fetch from the cacheline can happen in smaller chunks, because
> the bus between the I$ and the instruction decode isn't a full
> cacheline (well, it is _now_ in modern big cores, but it hasn't always
> been).
> 
> So even if the cacheline is updated atomically, I could imagine seeing
> a partial fetch from the I$ (old values) and then a second partial
> fetch (new values).
> 
> It would be interesting to know what the exact fetch rules are.

So I fixed my test case to do 32-bit writes, and now the results are
making a lot more sense.  Now I only get crashes when writing across
cache lines.  So maybe we should just go with Andy's suggestion above.

It would be great if some CPU people could confirm that it's safe (for
x86-64 only), since it's not in the SDM.  Who can help answer that?

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-12-11 Thread Josh Poimboeuf

On Tue, Dec 11, 2018 at 09:41:37AM +, David Laight wrote:
> From: Josh Poimboeuf
> > Sent: 30 November 2018 16:27
> > 
> > On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote:
> > > On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf  
> > > wrote:
> ...
> > > > Maybe that would be ok.  If my math is right, we would use the
> > > > out-of-line version almost 5% of the time due to cache misalignment of
> > > > the address.
> > >
> > > Note that I don't think cache-line alignment is necessarily sufficient.
> > >
> > > The I$ fetch from the cacheline can happen in smaller chunks, because
> > > the bus between the I$ and the instruction decode isn't a full
> > > cacheline (well, it is _now_ in modern big cores, but it hasn't always
> > > been).
> > >
> > > So even if the cacheline is updated atomically, I could imagine seeing
> > > a partial fetch from the I$ (old values) and then a second partial
> > > fetch (new values).
> > >
> > > It would be interesting to know what the exact fetch rules are.
> > 
> > I've been doing  some cross-modifying code experiments on Nehalem, with
> > one CPU writing call destinations while the other CPUs are executing
> > them.  Reliably, one of the readers goes off into the weeds within a few
> > seconds.
> > 
> > The writing was done with just text_poke(), no #BP.
> > 
> > I wasn't able to figure out the pattern in the addresses of the
> > corrupted call sites.  It wasn't cache line.
> > 
> > That was on Nehalem.  Skylake didn't crash at all.
> 
> Interesting thought?
> 
> If it is possible to add a prefix that can be overwritten by an int3
> is it also possible to add something that the assembler will use
> to align the instruction so that a write to the 4 byte offset
> will be atomic?
> 
> I'd guess that avoiding 8 byte granularity would be sufficient.
> So you'd need a 1, 2 or 3 byte nop depending on the actual
> alignment - although a 3 byte one would always do.

The problem is that the call is done in C code, and we don't have a
feasible way to use inline asm to call functions with more than five
arguments.

BTW, my original experiments (mentioned above) were a bit... flawed.  I
used text_poke(), which does memcpy(), which writes one byte at a time.
No wonder it wasn't atomic.

I'll need to do some more experiments.

-- 
Josh

RE: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-12-11 Thread David Laight

From: Josh Poimboeuf
> Sent: 30 November 2018 16:27
> 
> On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote:
> > On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf  wrote:
...
> > > Maybe that would be ok.  If my math is right, we would use the
> > > out-of-line version almost 5% of the time due to cache misalignment of
> > > the address.
> >
> > Note that I don't think cache-line alignment is necessarily sufficient.
> >
> > The I$ fetch from the cacheline can happen in smaller chunks, because
> > the bus between the I$ and the instruction decode isn't a full
> > cacheline (well, it is _now_ in modern big cores, but it hasn't always
> > been).
> >
> > So even if the cacheline is updated atomically, I could imagine seeing
> > a partial fetch from the I$ (old values) and then a second partial
> > fetch (new values).
> >
> > It would be interesting to know what the exact fetch rules are.
> 
> I've been doing  some cross-modifying code experiments on Nehalem, with
> one CPU writing call destinations while the other CPUs are executing
> them.  Reliably, one of the readers goes off into the weeds within a few
> seconds.
> 
> The writing was done with just text_poke(), no #BP.
> 
> I wasn't able to figure out the pattern in the addresses of the
> corrupted call sites.  It wasn't cache line.
> 
> That was on Nehalem.  Skylake didn't crash at all.

Interesting thought?

If it is possible to add a prefix that can be overwritten by an int3
is it also possible to add something that the assembler will use
to align the instruction so that a write to the 4 byte offset
will be atomic?

I'd guess that avoiding 8 byte granularity would be sufficient.
So you'd need a 1, 2 or 3 byte nop depending on the actual
alignment - although a 3 byte one would always do.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-12-10 Thread Linus Torvalds

On Mon, Dec 10, 2018 at 3:58 PM Pavel Machek  wrote:
>
> On Thu 2018-11-29 11:11:50, Linus Torvalds wrote:
> >
> > It might be better to use an empty REX prefix on x86-64 or something like 
> > that.
>
> It might be easiest to use plain old NOP, no? :-).

No. The whole point would be that the instruction rewriting is atomic wrt fetch.

If it's a "nop" + "second instruction", and the "nop" is overwritten
by "int3", then the second instruction could still be executed after
the "int3" has been written (because the other CPU just finished the
"nop".

So an empty rex prefix is very different from a one-byte nop, exactly
because it's executed atomically with the instruction itself.

  Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-12-10 Thread Pavel Machek

On Thu 2018-11-29 11:11:50, Linus Torvalds wrote:
> On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
>  wrote:
> >
> > What you can do then is basically add a single-byte prefix to the
> > "call" instruction that does nothing (say, cs override), and then
> > replace *that* with a 'int3' instruction.
> 
> Hmm. the segment prefixes are documented as being "reserved" for
> branch instructions. I *think* that means just conditional branches
> (Intel at one point used the prefixes for static prediction
> information), not "call", but who knows..
> 
> It might be better to use an empty REX prefix on x86-64 or something like 
> that.

It might be easiest to use plain old NOP, no? :-).
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Josh Poimboeuf

On Fri, Nov 30, 2018 at 11:16:34PM +0100, Rasmus Villemoes wrote:
> On 29/11/2018 20.22, Josh Poimboeuf wrote:
> > On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote:
> >>> and honestly, the way "static_call()" works now, can you guarantee
> >>> that the call-site doesn't end up doing that, and calling the
> >>> trampoline function for two different static calls from one indirect
> >>> call?
> >>>
> >>> See what I'm talking about? Saying "callers are wrapped in macros"
> >>> doesn't actually protect you from the compiler doing things like that.
> >>>
> >>> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> >>> compiler couldn't turn a "call wrapper(%rip)" into anything else.
> >>
> >> But then we need to implement all numbers of parameters.
> > 
> > I actually have an old unfinished patch which (ab)used C macros to
> > detect the number of parameters and then setup the asm constraints
> > accordingly.  At the time, the goal was to optimize the BUG code.
> > 
> > I had wanted to avoid this kind of approach for static calls, because
> > "ugh", but now it's starting to look much more appealing.
> > 
> > Behold:
> > 
> > diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
> > index aa6b2023d8f8..d63e9240da77 100644
> > --- a/arch/x86/include/asm/bug.h
> > +++ b/arch/x86/include/asm/bug.h
> > @@ -32,10 +32,59 @@
> >  
> >  #ifdef CONFIG_DEBUG_BUGVERBOSE
> >  
> > -#define _BUG_FLAGS(ins, flags) 
> > \
> > +#define __BUG_ARGS_0(ins, ...) \
> > +({\
> > +   asm volatile("1:\t" ins "\n"); \
> > +})
> > +#define __BUG_ARGS_1(ins, ...) \
> > +({\
> > +   asm volatile("1:\t" ins "\n" \
> > +: : "D" (ARG1(__VA_ARGS__))); \
> > +})
> > +#define __BUG_ARGS_2(ins, ...) \
> > +({\
> > +   asm volatile("1:\t" ins "\n" \
> > +: : "D" (ARG1(__VA_ARGS__)), \
> > +"S" (ARG2(__VA_ARGS__))); \
> > +})
> > +#define __BUG_ARGS_3(ins, ...) \
> > +({\
> > +   asm volatile("1:\t" ins "\n" \
> > +: : "D" (ARG1(__VA_ARGS__)), \
> > +"S" (ARG2(__VA_ARGS__)), \
> > +"d" (ARG3(__VA_ARGS__))); \
> > +})
> 
> wouldn't you need to tie all these to (unused) outputs as well as adding
> the remaining caller-saved registers to the clobber list? Maybe not for
> the WARN machinery(?), but at least for stuff that should look like a
> normal call to gcc? Then there's %rax which is either a clobber or an
> output, and if there's not to be a separate static_call_void(), one
> would need to do some __builtin_choose_expr(__same_type(void, f(...)), ...).

Yes, this is a crappy unfinished patch.  It should be ignored, and
perhaps even mercilessly mocked :-)

paravirt_types.h already does something similar today, and it's at least
more correct than this.

What I was trying to show was that you can use macros to count
arguments, like this:

  _BUG_ARGS(ins, NUM_ARGS(__VA_ARGS__), __VA_ARGS__);

which can make a macro look and act like a function call.  Though as
Steven pointed out, the concept falls apart after 6 arguments.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Rasmus Villemoes

On 29/11/2018 20.22, Josh Poimboeuf wrote:
> On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote:
>>> and honestly, the way "static_call()" works now, can you guarantee
>>> that the call-site doesn't end up doing that, and calling the
>>> trampoline function for two different static calls from one indirect
>>> call?
>>>
>>> See what I'm talking about? Saying "callers are wrapped in macros"
>>> doesn't actually protect you from the compiler doing things like that.
>>>
>>> In contrast, if the call was wrapped in an inline asm, we'd *know* the
>>> compiler couldn't turn a "call wrapper(%rip)" into anything else.
>>
>> But then we need to implement all numbers of parameters.
> 
> I actually have an old unfinished patch which (ab)used C macros to
> detect the number of parameters and then setup the asm constraints
> accordingly.  At the time, the goal was to optimize the BUG code.
> 
> I had wanted to avoid this kind of approach for static calls, because
> "ugh", but now it's starting to look much more appealing.
> 
> Behold:
> 
> diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
> index aa6b2023d8f8..d63e9240da77 100644
> --- a/arch/x86/include/asm/bug.h
> +++ b/arch/x86/include/asm/bug.h
> @@ -32,10 +32,59 @@
>  
>  #ifdef CONFIG_DEBUG_BUGVERBOSE
>  
> -#define _BUG_FLAGS(ins, flags)   
> \
> +#define __BUG_ARGS_0(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n"); \
> +})
> +#define __BUG_ARGS_1(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> +  : : "D" (ARG1(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_2(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> +  : : "D" (ARG1(__VA_ARGS__)), \
> +  "S" (ARG2(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_3(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> +  : : "D" (ARG1(__VA_ARGS__)), \
> +  "S" (ARG2(__VA_ARGS__)), \
> +  "d" (ARG3(__VA_ARGS__))); \
> +})

wouldn't you need to tie all these to (unused) outputs as well as adding
the remaining caller-saved registers to the clobber list? Maybe not for
the WARN machinery(?), but at least for stuff that should look like a
normal call to gcc? Then there's %rax which is either a clobber or an
output, and if there's not to be a separate static_call_void(), one
would need to do some __builtin_choose_expr(__same_type(void, f(...)), ...).

Rasmus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Jiri Kosina

On Fri, 30 Nov 2018, Andy Lutomirski wrote:

> According to the SDM, you can program the APIC ICR to request an SMI. 
> It's not remotely clear to me what will happen if we do this.  

I think one of the known reliable ways to trigger SMI is to write 0x0 to 
the SMI command I/O port (0xb2).

> For all I know, the SMI handler will explode and the computer will catch 
> fire.

Ha, therefore noone can't claim any more that SMIs are always harmful :)

-- 
Jiri Kosina
SUSE Labs

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Josh Poimboeuf

On Fri, Nov 30, 2018 at 12:18:33PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 30, 2018 at 11:51 AM Linus Torvalds
>  wrote:
> >
> > On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf  wrote:
> > >
> > > AFAICT, all the other proposed options seem to have major issues.
> >
> > I still absolutely detest this patch, and in fact it got worse from
> > the test of the config variable.
> >
> > Honestly, the entry code being legible and simple is more important
> > than the extra cycle from branching to a trampoline for static calls.
> >
> > Just don't do the inline case if it causes this much confusion.

I *really* don't want to have to drop the inline feature.  The speedup
is measurable and not insignificant.  And out-of-line would be a
regression if we ported paravirt to use static calls.

> With my entry maintainer hat on, I don't mind it so much, although the
> implementation needs some work.  The #ifdef should just go away, and
> there should be another sanity check in the sanity check section.

Your suggested changes sound good to me.  I'll be gone next week, so
here's hoping you'll have this all figured out when I get back!

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Steven Rostedt

On Fri, 30 Nov 2018 12:59:36 -0800
Andy Lutomirski  wrote:

> For all I know, the SMI handler will explode and the computer will catch fire.

That sounds like an AWESOME feature!!!

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Andy Lutomirski

On Fri, Nov 30, 2018 at 12:28 PM Steven Rostedt  wrote:
>
> On Fri, 30 Nov 2018 12:18:33 -0800
> Andy Lutomirski  wrote:
>
> > Or we could replace that IPI with x86's bona fide serialize-all-cpus
> > primitive and then we can just retry instead of emulating.  It's a
> > piece of cake -- we just trigger an SMI :)  /me runs away.
>
> I must have fallen on my head one too many times, because I really like
> the idea of synchronizing all the CPUs with an SMI! (If that's even
> possible). The IPI's that are sent are only to force smp_mb() on all
> CPUs. Which should be something an SMI could do.
>
> /me runs after Andy

According to the SDM, you can program the APIC ICR to request an SMI.
It's not remotely clear to me what will happen if we do this.  For all
I know, the SMI handler will explode and the computer will catch fire.
PeterZ?

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Steven Rostedt

On Fri, 30 Nov 2018 12:18:33 -0800
Andy Lutomirski  wrote:

> Or we could replace that IPI with x86's bona fide serialize-all-cpus
> primitive and then we can just retry instead of emulating.  It's a
> piece of cake -- we just trigger an SMI :)  /me runs away.

I must have fallen on my head one too many times, because I really like
the idea of synchronizing all the CPUs with an SMI! (If that's even
possible). The IPI's that are sent are only to force smp_mb() on all
CPUs. Which should be something an SMI could do.

/me runs after Andy

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Andy Lutomirski

On Fri, Nov 30, 2018 at 11:51 AM Linus Torvalds
 wrote:
>
> On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf  wrote:
> >
> > AFAICT, all the other proposed options seem to have major issues.
>
> I still absolutely detest this patch, and in fact it got worse from
> the test of the config variable.
>
> Honestly, the entry code being legible and simple is more important
> than the extra cycle from branching to a trampoline for static calls.
>
> Just don't do the inline case if it causes this much confusion.

With my entry maintainer hat on, I don't mind it so much, although the
implementation needs some work.  The #ifdef should just go away, and
there should be another sanity check in the sanity check section.

Or we could replace that IPI with x86's bona fide serialize-all-cpus
primitive and then we can just retry instead of emulating.  It's a
piece of cake -- we just trigger an SMI :)  /me runs away.

--Andy

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Linus Torvalds

On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf  wrote:
>
> AFAICT, all the other proposed options seem to have major issues.

I still absolutely detest this patch, and in fact it got worse from
the test of the config variable.

Honestly, the entry code being legible and simple is more important
than the extra cycle from branching to a trampoline for static calls.

Just don't do the inline case if it causes this much confusion.

 Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Josh Poimboeuf

On Fri, Nov 30, 2018 at 08:42:26AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 29, 2018 at 12:24 PM Josh Poimboeuf  wrote:
> >
> > > Alternatively, we could actually emulate call instructions like this:
> > >
> > > void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...)
> > > {
> > >   struct pt_regs ptregs_copy = *regs;
> > >   barrier();
> > >   *(unsigned long *)(regs->sp - 8) = whatever;  /* may clobber old
> > > regs, but so what? */
> > >   asm volatile ("jmp return_to_alternate_ptregs");
> > > }
> > >
> > > where return_to_alternate_ptregs points rsp to the ptregs and goes
> > > through the normal return path.  It's ugly, but we could have a test
> > > case for it, and it should work fine.
> >
> > Is that really any better than my patch to create a gap in the stack
> > (modified for kernel space #BP only)?
> >
> 
> I tend to prefer a nice local hack like mine over a hack that further
> complicates the entry in general.  This is not to say I'm thrilled by
> my idea either.

They're both mucking with the location of the pt_regs.  The above code
just takes that fact and hides it in the corner and hopes that there are
no bugs lurking there.

Even with the CPL check, the "gap" code is simple and self-contained
(see below).  The kernel pt_regs can already be anywhere on the stack so
there should be no harm in moving them.

AFAICT, all the other proposed options seem to have major issues.

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce25d84023c0..f487f7daed6c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR   
irq_work_interrupt  smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 
create_gap=1
 ENTRY(\sym)
UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -896,6 +896,18 @@ ENTRY(\sym)
jnz .Lfrom_usermode_switch_stack_\@
.endif
 
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+   .if \create_gap == 1
+   testb   $3, CS-ORIG_RAX(%rsp)
+   jnz .Lfrom_usermode_no_gap_\@
+   .rept 6
+   pushq 5*8(%rsp)
+   .endr
+   UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+   .endif
+#endif
+
.if \paranoid
callparanoid_entry
.else
@@ -1126,7 +1138,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug do_debughas_error_code=0
paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3  do_int3 has_error_code=0
+idtentry int3  do_int3 has_error_code=0
create_gap=1
 idtentry stack_segment do_stack_segmenthas_error_code=1
 
 #ifdef CONFIG_XEN_PV

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Andy Lutomirski

On Thu, Nov 29, 2018 at 12:24 PM Josh Poimboeuf  wrote:
>
> > Alternatively, we could actually emulate call instructions like this:
> >
> > void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...)
> > {
> >   struct pt_regs ptregs_copy = *regs;
> >   barrier();
> >   *(unsigned long *)(regs->sp - 8) = whatever;  /* may clobber old
> > regs, but so what? */
> >   asm volatile ("jmp return_to_alternate_ptregs");
> > }
> >
> > where return_to_alternate_ptregs points rsp to the ptregs and goes
> > through the normal return path.  It's ugly, but we could have a test
> > case for it, and it should work fine.
>
> Is that really any better than my patch to create a gap in the stack
> (modified for kernel space #BP only)?
>

I tend to prefer a nice local hack like mine over a hack that further
complicates the entry in general.  This is not to say I'm thrilled by
my idea either.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf  wrote:
> >
> > On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> > >
> > > I propose a different solution:
> > >
> > > As in this patch set, we have a direct and an indirect version.  The
> > > indirect version remains exactly the same as in this patch set.  The
> > > direct version just only does the patching when all seems well: the
> > > call instruction needs to be 0xe8, and we only do it when the thing
> > > doesn't cross a cache line.  Does that work?  In the rare case where
> > > the compiler generates something other than 0xe8 or crosses a cache
> > > line, then the thing just remains as a call to the out of line jmp
> > > trampoline.  Does that seem reasonable?  It's a very minor change to
> > > the patch set.
> >
> > Maybe that would be ok.  If my math is right, we would use the
> > out-of-line version almost 5% of the time due to cache misalignment of
> > the address.
> 
> Note that I don't think cache-line alignment is necessarily sufficient.
> 
> The I$ fetch from the cacheline can happen in smaller chunks, because
> the bus between the I$ and the instruction decode isn't a full
> cacheline (well, it is _now_ in modern big cores, but it hasn't always
> been).
> 
> So even if the cacheline is updated atomically, I could imagine seeing
> a partial fetch from the I$ (old values) and then a second partial
> fetch (new values).
> 
> It would be interesting to know what the exact fetch rules are.

I've been doing  some cross-modifying code experiments on Nehalem, with
one CPU writing call destinations while the other CPUs are executing
them.  Reliably, one of the readers goes off into the weeds within a few
seconds.

The writing was done with just text_poke(), no #BP.

I wasn't able to figure out the pattern in the addresses of the
corrupted call sites.  It wasn't cache line.

That was on Nehalem.  Skylake didn't crash at all.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf  wrote:
>
> On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> >
> > I propose a different solution:
> >
> > As in this patch set, we have a direct and an indirect version.  The
> > indirect version remains exactly the same as in this patch set.  The
> > direct version just only does the patching when all seems well: the
> > call instruction needs to be 0xe8, and we only do it when the thing
> > doesn't cross a cache line.  Does that work?  In the rare case where
> > the compiler generates something other than 0xe8 or crosses a cache
> > line, then the thing just remains as a call to the out of line jmp
> > trampoline.  Does that seem reasonable?  It's a very minor change to
> > the patch set.
>
> Maybe that would be ok.  If my math is right, we would use the
> out-of-line version almost 5% of the time due to cache misalignment of
> the address.

Note that I don't think cache-line alignment is necessarily sufficient.

The I$ fetch from the cacheline can happen in smaller chunks, because
the bus between the I$ and the instruction decode isn't a full
cacheline (well, it is _now_ in modern big cores, but it hasn't always
been).

So even if the cacheline is updated atomically, I could imagine seeing
a partial fetch from the I$ (old values) and then a second partial
fetch (new values).

It would be interesting to know what the exact fetch rules are.

 Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Peter Zijlstra

On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> > 
> > > > (like pointing IP at a stub that retpolines to the target by reading
> > > > the function pointer, a la the unoptimizable version), then okay, I
> > > > guess, with only a small amount of grumbling.
> > > 
> > > I tried that in v2, but Peter pointed out it's racy:
> > > 
> > >   
> > > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net
> > 
> > Ah, but that is because it is a global shared trampoline.
> > 
> > Each static_call has it's own trampoline; which currently reads
> > something like:
> > 
> > RETPOLINE_SAFE
> > JMP *key
> > 
> > which you then 'defuse' by writing an UD2 on. _However_, if you write
> > that trampoline like:
> > 
> > 1:  RETPOLINE_SAFE
> > JMP *key
> > 2:  CALL_NOSPEC *key
> > RET
> > 
> > and have the text_poke_bp() handler jump to 2 (a location you'll never
> > reach when you enter at 1), it will in fact work I think. The trampoline
> > is never modified and not shared between different static_call's.
> 
> But after returning from the function to the trampoline, how does it
> return from the trampoline to the call site?  At that point there is no
> return address on the stack.

Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 02:25:33PM -0800, Andy Lutomirski wrote:
> On Thu, Nov 29, 2018 at 2:22 PM Peter Zijlstra  wrote:
> >
> > On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote:
> > > On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> > > > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> > > >
> > > > > > (like pointing IP at a stub that retpolines to the target by reading
> > > > > > the function pointer, a la the unoptimizable version), then okay, I
> > > > > > guess, with only a small amount of grumbling.
> > > > >
> > > > > I tried that in v2, but Peter pointed out it's racy:
> > > > >
> > > > >   
> > > > > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net
> > > >
> > > > Ah, but that is because it is a global shared trampoline.
> > > >
> > > > Each static_call has it's own trampoline; which currently reads
> > > > something like:
> > > >
> > > > RETPOLINE_SAFE
> > > > JMP *key
> > > >
> > > > which you then 'defuse' by writing an UD2 on. _However_, if you write
> > > > that trampoline like:
> > > >
> > > > 1:  RETPOLINE_SAFE
> > > > JMP *key
> > > > 2:  CALL_NOSPEC *key
> > > > RET
> > > >
> > > > and have the text_poke_bp() handler jump to 2 (a location you'll never
> > > > reach when you enter at 1), it will in fact work I think. The trampoline
> > > > is never modified and not shared between different static_call's.
> > >
> > > But after returning from the function to the trampoline, how does it
> > > return from the trampoline to the call site?  At that point there is no
> > > return address on the stack.
> >
> > Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose.
> 
> I assume I'm missing something, but can't it just be JMP_NOSPEC *key?
> The code would call the trampoline just like any other function and,
> if the alignment is bad, we can skip patching it.  And, if we want the
> performance back, maybe some day we can find a clean way to patch
> those misaligned callers, too.

Yeah, this is currently the leading contender, though I believe it will
use a direct jump like the current out-of-line implementation.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski

On Thu, Nov 29, 2018 at 2:22 PM Peter Zijlstra  wrote:
>
> On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote:
> > On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> > > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> > >
> > > > > (like pointing IP at a stub that retpolines to the target by reading
> > > > > the function pointer, a la the unoptimizable version), then okay, I
> > > > > guess, with only a small amount of grumbling.
> > > >
> > > > I tried that in v2, but Peter pointed out it's racy:
> > > >
> > > >   
> > > > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net
> > >
> > > Ah, but that is because it is a global shared trampoline.
> > >
> > > Each static_call has it's own trampoline; which currently reads
> > > something like:
> > >
> > > RETPOLINE_SAFE
> > > JMP *key
> > >
> > > which you then 'defuse' by writing an UD2 on. _However_, if you write
> > > that trampoline like:
> > >
> > > 1:  RETPOLINE_SAFE
> > > JMP *key
> > > 2:  CALL_NOSPEC *key
> > > RET
> > >
> > > and have the text_poke_bp() handler jump to 2 (a location you'll never
> > > reach when you enter at 1), it will in fact work I think. The trampoline
> > > is never modified and not shared between different static_call's.
> >
> > But after returning from the function to the trampoline, how does it
> > return from the trampoline to the call site?  At that point there is no
> > return address on the stack.
>
> Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose.

I assume I'm missing something, but can't it just be JMP_NOSPEC *key?
The code would call the trampoline just like any other function and,
if the alignment is bad, we can skip patching it.  And, if we want the
performance back, maybe some day we can find a clean way to patch
those misaligned callers, too.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 02:24:52PM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> > On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
> >  wrote:
> > >
> > > On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
> > >  wrote:
> > > >
> > > > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > > > compiler couldn't turn a "call wrapper(%rip)" into anything else.
> > >
> > > Actually, I think I have a better model - if the caller is done with 
> > > inline asm.
> > >
> > > What you can do then is basically add a single-byte prefix to the
> > > "call" instruction that does nothing (say, cs override), and then
> > > replace *that* with a 'int3' instruction.
> > >
> > > Boom. Done.
> > >
> > > Now, the "int3" handler can just update the instruction in-place, but
> > > leave the "int3" in place, and then return to the next instruction
> > > byte (which is just the normal branch instruction without the prefix
> > > byte).
> > >
> > > The cross-CPU case continues to work, because the 'int3' remains in
> > > place until after the IPI.
> > 
> > Hmm, cute.  But then the calls are in inline asm, which results in
> > giant turds like we have for the pvop vcalls.  And, if they start
> > being used more generally, we potentially have ABI issues where the
> > calling convention isn't quite what the asm expects, and we explode.
> > 
> > I propose a different solution:
> > 
> > As in this patch set, we have a direct and an indirect version.  The
> > indirect version remains exactly the same as in this patch set.  The
> > direct version just only does the patching when all seems well: the
> > call instruction needs to be 0xe8, and we only do it when the thing
> > doesn't cross a cache line.  Does that work?  In the rare case where
> > the compiler generates something other than 0xe8 or crosses a cache
> > line, then the thing just remains as a call to the out of line jmp
> > trampoline.  Does that seem reasonable?  It's a very minor change to
> > the patch set.
> 
> Maybe that would be ok.  If my math is right, we would use the
> out-of-line version almost 5% of the time due to cache misalignment of
> the address.

BTW, this means that if any of a trampoline's callers crosses cache
boundaries then we won't be able to poison the trampoline.  Which is
kind of sad.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> 
> > > (like pointing IP at a stub that retpolines to the target by reading
> > > the function pointer, a la the unoptimizable version), then okay, I
> > > guess, with only a small amount of grumbling.
> > 
> > I tried that in v2, but Peter pointed out it's racy:
> > 
> >   
> > https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net
> 
> Ah, but that is because it is a global shared trampoline.
> 
> Each static_call has it's own trampoline; which currently reads
> something like:
> 
>   RETPOLINE_SAFE
>   JMP *key
> 
> which you then 'defuse' by writing an UD2 on. _However_, if you write
> that trampoline like:
> 
> 1:RETPOLINE_SAFE
>   JMP *key
> 2:CALL_NOSPEC *key
>   RET
> 
> and have the text_poke_bp() handler jump to 2 (a location you'll never
> reach when you enter at 1), it will in fact work I think. The trampoline
> is never modified and not shared between different static_call's.

But after returning from the function to the trampoline, how does it
return from the trampoline to the call site?  At that point there is no
return address on the stack.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Peter Zijlstra

On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:

> > (like pointing IP at a stub that retpolines to the target by reading
> > the function pointer, a la the unoptimizable version), then okay, I
> > guess, with only a small amount of grumbling.
> 
> I tried that in v2, but Peter pointed out it's racy:
> 
>   
> https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net

Ah, but that is because it is a global shared trampoline.

Each static_call has it's own trampoline; which currently reads
something like:

RETPOLINE_SAFE
JMP *key

which you then 'defuse' by writing an UD2 on. _However_, if you write
that trampoline like:

1:  RETPOLINE_SAFE
JMP *key
2:  CALL_NOSPEC *key
RET

and have the text_poke_bp() handler jump to 2 (a location you'll never
reach when you enter at 1), it will in fact work I think. The trampoline
is never modified and not shared between different static_call's.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
>  wrote:
> >
> > On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
> >  wrote:
> > >
> > > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > > compiler couldn't turn a "call wrapper(%rip)" into anything else.
> >
> > Actually, I think I have a better model - if the caller is done with inline 
> > asm.
> >
> > What you can do then is basically add a single-byte prefix to the
> > "call" instruction that does nothing (say, cs override), and then
> > replace *that* with a 'int3' instruction.
> >
> > Boom. Done.
> >
> > Now, the "int3" handler can just update the instruction in-place, but
> > leave the "int3" in place, and then return to the next instruction
> > byte (which is just the normal branch instruction without the prefix
> > byte).
> >
> > The cross-CPU case continues to work, because the 'int3' remains in
> > place until after the IPI.
> 
> Hmm, cute.  But then the calls are in inline asm, which results in
> giant turds like we have for the pvop vcalls.  And, if they start
> being used more generally, we potentially have ABI issues where the
> calling convention isn't quite what the asm expects, and we explode.
> 
> I propose a different solution:
> 
> As in this patch set, we have a direct and an indirect version.  The
> indirect version remains exactly the same as in this patch set.  The
> direct version just only does the patching when all seems well: the
> call instruction needs to be 0xe8, and we only do it when the thing
> doesn't cross a cache line.  Does that work?  In the rare case where
> the compiler generates something other than 0xe8 or crosses a cache
> line, then the thing just remains as a call to the out of line jmp
> trampoline.  Does that seem reasonable?  It's a very minor change to
> the patch set.

Maybe that would be ok.  If my math is right, we would use the
out-of-line version almost 5% of the time due to cache misalignment of
the address.

> Alternatively, we could actually emulate call instructions like this:
> 
> void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...)
> {
>   struct pt_regs ptregs_copy = *regs;
>   barrier();
>   *(unsigned long *)(regs->sp - 8) = whatever;  /* may clobber old
> regs, but so what? */
>   asm volatile ("jmp return_to_alternate_ptregs");
> }
> 
> where return_to_alternate_ptregs points rsp to the ptregs and goes
> through the normal return path.  It's ugly, but we could have a test
> case for it, and it should work fine.

Is that really any better than my patch to create a gap in the stack
(modified for kernel space #BP only)?

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 10:58:40AM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt  wrote:
> >
> > Note, we do have a bit of control at what is getting called. The patch
> > set requires that the callers are wrapped in macros. We should not
> > allow just any random callers (like from asm).
> 
> Actually, I'd argue that asm is often more controlled than C code.
> 
> Right now you can do odd things if you really want to, and have the
> compiler generate indirect calls to those wrapper functions.
> 
> For example, I can easily imagine a pre-retpoline compiler turning
> 
>  if (cond)
> fn1(a,b)
>  else
>fn2(a,b);
> 
> into a function pointer conditional
> 
> (cond ? fn1 : fn2)(a,b);
> 
> and honestly, the way "static_call()" works now, can you guarantee
> that the call-site doesn't end up doing that, and calling the
> trampoline function for two different static calls from one indirect
> call?
> 
> See what I'm talking about? Saying "callers are wrapped in macros"
> doesn't actually protect you from the compiler doing things like that.
> 
> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> compiler couldn't turn a "call wrapper(%rip)" into anything else.

I think objtool could warn about many such issues, including function
pointer references to trampolines and short tail call jumps.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 11:24:43 -0800
Linus Torvalds  wrote:

> On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt  wrote:
> >
> > But then we need to implement all numbers of parameters.  
> 
> Oh, I agree, it's nasty.
> 
> But it's actually a nastiness that we've solved before. In particular,
> with the system call mappings, which have pretty much the exact same
> issue of "map unknown number of arguments to registers".
> 
> Yes, it's different - there you map the unknown number of arguments to
> a structure access instead. And yes, the macros are unbelievably ugly.
> See
> 
> arch/x86/include/asm/syscall_wrapper.h

Those are not doing inline assembly.

> 
> and the __MAP() macro from
> 
> include/linux/syscalls.h
> 
> so it's not pretty. But it would solve all the problems.
> 

Again, not inline assembly, and those only handle up to 6 parameters.

My POC started down this route, until I notice that there's tracepoints
that have 13 parameters! And I need to handle all tracepoints.

Yes, we can argue that we need to change those (if that doesn't break
the API of something using it).

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski

On Thu, Nov 29, 2018 at 11:25 AM Linus Torvalds
 wrote:
>
> On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt  wrote:
> >
> > But then we need to implement all numbers of parameters.
>
> Oh, I agree, it's nasty.
>
> But it's actually a nastiness that we've solved before. In particular,
> with the system call mappings, which have pretty much the exact same
> issue of "map unknown number of arguments to registers".
>
> Yes, it's different - there you map the unknown number of arguments to
> a structure access instead. And yes, the macros are unbelievably ugly.
> See
>
> arch/x86/include/asm/syscall_wrapper.h
>
> and the __MAP() macro from
>
> include/linux/syscalls.h
>
> so it's not pretty. But it would solve all the problems.
>

Until someone does:

struct foo foo;
static_call(thingy, foo);

For syscalls, we know better than to do that.  For static calls, I'm
less confident.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 13:22:11 -0600
Josh Poimboeuf  wrote:

> On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote:
> > > and honestly, the way "static_call()" works now, can you guarantee
> > > that the call-site doesn't end up doing that, and calling the
> > > trampoline function for two different static calls from one indirect
> > > call?
> > > 
> > > See what I'm talking about? Saying "callers are wrapped in macros"
> > > doesn't actually protect you from the compiler doing things like that.
> > > 
> > > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > > compiler couldn't turn a "call wrapper(%rip)" into anything else.  
> > 
> > But then we need to implement all numbers of parameters.  
> 
> I actually have an old unfinished patch which (ab)used C macros to
> detect the number of parameters and then setup the asm constraints
> accordingly.  At the time, the goal was to optimize the BUG code.
> 
> I had wanted to avoid this kind of approach for static calls, because
> "ugh", but now it's starting to look much more appealing.
> 
> Behold:
> 
> diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
> index aa6b2023d8f8..d63e9240da77 100644
> --- a/arch/x86/include/asm/bug.h
> +++ b/arch/x86/include/asm/bug.h
> @@ -32,10 +32,59 @@
>  
>  #ifdef CONFIG_DEBUG_BUGVERBOSE
>  
> -#define _BUG_FLAGS(ins, flags)   
> \
> +#define __BUG_ARGS_0(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n"); \
> +})
> +#define __BUG_ARGS_1(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> +  : : "D" (ARG1(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_2(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> +  : : "D" (ARG1(__VA_ARGS__)), \
> +  "S" (ARG2(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_3(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> +  : : "D" (ARG1(__VA_ARGS__)), \
> +  "S" (ARG2(__VA_ARGS__)), \
> +  "d" (ARG3(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_4(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> +  : : "D" (ARG1(__VA_ARGS__)), \
> +  "S" (ARG2(__VA_ARGS__)), \
> +  "d" (ARG3(__VA_ARGS__)), \
> +  "c" (ARG4(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_5(ins, ...) \
> +({\
> + register u64 __r8 asm("r8") = (u64)ARG5(__VA_ARGS__); \
> + asm volatile("1:\t" ins "\n" \
> +  : : "D" (ARG1(__VA_ARGS__)), \
> +  "S" (ARG2(__VA_ARGS__)), \
> +  "d" (ARG3(__VA_ARGS__)), \
> +  "c" (ARG4(__VA_ARGS__)), \
> +  "r" (__r8)); \
> +})
> +#define __BUG_ARGS_6 foo
> +#define __BUG_ARGS_7 foo
> +#define __BUG_ARGS_8 foo
> +#define __BUG_ARGS_9 foo
> +


There exist tracepoints with 13 arguments.

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski

On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
 wrote:
>
> On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
>  wrote:
> >
> > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > compiler couldn't turn a "call wrapper(%rip)" into anything else.
>
> Actually, I think I have a better model - if the caller is done with inline 
> asm.
>
> What you can do then is basically add a single-byte prefix to the
> "call" instruction that does nothing (say, cs override), and then
> replace *that* with a 'int3' instruction.
>
> Boom. Done.
>
> Now, the "int3" handler can just update the instruction in-place, but
> leave the "int3" in place, and then return to the next instruction
> byte (which is just the normal branch instruction without the prefix
> byte).
>
> The cross-CPU case continues to work, because the 'int3' remains in
> place until after the IPI.

Hmm, cute.  But then the calls are in inline asm, which results in
giant turds like we have for the pvop vcalls.  And, if they start
being used more generally, we potentially have ABI issues where the
calling convention isn't quite what the asm expects, and we explode.

I propose a different solution:

As in this patch set, we have a direct and an indirect version.  The
indirect version remains exactly the same as in this patch set.  The
direct version just only does the patching when all seems well: the
call instruction needs to be 0xe8, and we only do it when the thing
doesn't cross a cache line.  Does that work?  In the rare case where
the compiler generates something other than 0xe8 or crosses a cache
line, then the thing just remains as a call to the out of line jmp
trampoline.  Does that seem reasonable?  It's a very minor change to
the patch set.

Alternatively, we could actually emulate call instructions like this:

void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...)
{
  struct pt_regs ptregs_copy = *regs;
  barrier();
  *(unsigned long *)(regs->sp - 8) = whatever;  /* may clobber old
regs, but so what? */
  asm volatile ("jmp return_to_alternate_ptregs");
}

where return_to_alternate_ptregs points rsp to the ptregs and goes
through the normal return path.  It's ugly, but we could have a test
case for it, and it should work fine.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt  wrote:
>
> But then we need to implement all numbers of parameters.

Oh, I agree, it's nasty.

But it's actually a nastiness that we've solved before. In particular,
with the system call mappings, which have pretty much the exact same
issue of "map unknown number of arguments to registers".

Yes, it's different - there you map the unknown number of arguments to
a structure access instead. And yes, the macros are unbelievably ugly.
See

arch/x86/include/asm/syscall_wrapper.h

and the __MAP() macro from

include/linux/syscalls.h

so it's not pretty. But it would solve all the problems.

  Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote:
> > and honestly, the way "static_call()" works now, can you guarantee
> > that the call-site doesn't end up doing that, and calling the
> > trampoline function for two different static calls from one indirect
> > call?
> > 
> > See what I'm talking about? Saying "callers are wrapped in macros"
> > doesn't actually protect you from the compiler doing things like that.
> > 
> > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > compiler couldn't turn a "call wrapper(%rip)" into anything else.
> 
> But then we need to implement all numbers of parameters.

I actually have an old unfinished patch which (ab)used C macros to
detect the number of parameters and then setup the asm constraints
accordingly.  At the time, the goal was to optimize the BUG code.

I had wanted to avoid this kind of approach for static calls, because
"ugh", but now it's starting to look much more appealing.

Behold:

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index aa6b2023d8f8..d63e9240da77 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -32,10 +32,59 @@
 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
 
-#define _BUG_FLAGS(ins, flags) \
+#define __BUG_ARGS_0(ins, ...) \
+({\
+   asm volatile("1:\t" ins "\n"); \
+})
+#define __BUG_ARGS_1(ins, ...) \
+({\
+   asm volatile("1:\t" ins "\n" \
+: : "D" (ARG1(__VA_ARGS__))); \
+})
+#define __BUG_ARGS_2(ins, ...) \
+({\
+   asm volatile("1:\t" ins "\n" \
+: : "D" (ARG1(__VA_ARGS__)), \
+"S" (ARG2(__VA_ARGS__))); \
+})
+#define __BUG_ARGS_3(ins, ...) \
+({\
+   asm volatile("1:\t" ins "\n" \
+: : "D" (ARG1(__VA_ARGS__)), \
+"S" (ARG2(__VA_ARGS__)), \
+"d" (ARG3(__VA_ARGS__))); \
+})
+#define __BUG_ARGS_4(ins, ...) \
+({\
+   asm volatile("1:\t" ins "\n" \
+: : "D" (ARG1(__VA_ARGS__)), \
+"S" (ARG2(__VA_ARGS__)), \
+"d" (ARG3(__VA_ARGS__)), \
+"c" (ARG4(__VA_ARGS__))); \
+})
+#define __BUG_ARGS_5(ins, ...) \
+({\
+   register u64 __r8 asm("r8") = (u64)ARG5(__VA_ARGS__); \
+   asm volatile("1:\t" ins "\n" \
+: : "D" (ARG1(__VA_ARGS__)), \
+"S" (ARG2(__VA_ARGS__)), \
+"d" (ARG3(__VA_ARGS__)), \
+"c" (ARG4(__VA_ARGS__)), \
+"r" (__r8)); \
+})
+#define __BUG_ARGS_6 foo
+#define __BUG_ARGS_7 foo
+#define __BUG_ARGS_8 foo
+#define __BUG_ARGS_9 foo
+
+#define __BUG_ARGS(ins, num, ...) __BUG_ARGS_ ## num(ins, __VA_ARGS__)
+
+#define _BUG_ARGS(ins, num, ...) __BUG_ARGS(ins, num, __VA_ARGS__)
+
+#define _BUG_FLAGS(ins, flags, ...)\
 do {   \
-   asm volatile("1:\t" ins "\n"\
-".pushsection __bug_table,\"aw\"\n"\
+   _BUG_ARGS(ins, NUM_ARGS(__VA_ARGS__), __VA_ARGS__); \
+   asm volatile(".pushsection __bug_table,\"aw\"\n"\
 "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"   \
 "\t"  __BUG_REL(%c0) "\t# bug_entry::file\n"   \
 "\t.word %c1""\t# bug_entry::line\n"   \
@@ -76,7 +125,7 @@ do { 
\
unreachable();  \
 } while (0)
 
-#define __WARN_FLAGS(flags)_BUG_FLAGS(ASM_UD0, BUGFLAG_WARNING|(flags))
+#define __WARN_FLAGS(flags, ...)   _BUG_FLAGS(ASM_UD0, 
BUGFLAG_WARNING|(flags), __VA_ARGS__)
 
 #include 
 
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 70c7732c9594..0cb16e912c02 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -58,8 +58,8 @@ struct bug_entry {
 #endif
 
 #ifdef __WARN_FLAGS
-#define __WARN_TAINT(taint)__WARN_FLAGS(BUGFLAG_TAINT(taint))
-#define __WARN_ONCE_TAINT(taint)   
__WARN_FLAGS(BUGFLAG_ONCE|BUGFLAG_TAINT(taint))
+#define __WARN_TAINT(taint, args...)   
__WARN_FLAGS(BUGFLAG_TAINT(taint), args)
+#define __WARN_ONCE_TAINT(taint, args...)  
__WARN_FLAGS(BUGFLAG_ONCE|BUGFLAG_TAINT(taint), args)
 
 #define WARN_ON_ONCE(condition) ({ \
int __ret_warn_on = !!(condition);  \
@@ -84,11 +84,12 @@ void warn_slowpath_fmt_taint(const char *file, const int 
line, unsigned taint,
 extern void warn_slowpath_null(const char *file, const int line);
 #ifdef __WARN_TAINT
 #define __WARN()   __WARN_TAINT(TAINT_WARN)
+#define __WARN_printf(args...) __WARN_TAINT(TAINT_WARN, args)
 #else
 #define __WARN()   warn_slowpath_null(__FILE__, __LINE__

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 10:58:40 -0800
Linus Torvalds  wrote:

> On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt  wrote:
> >
> > Note, we do have a bit of control at what is getting called. The patch
> > set requires that the callers are wrapped in macros. We should not
> > allow just any random callers (like from asm).  
> 
> Actually, I'd argue that asm is often more controlled than C code.
> 
> Right now you can do odd things if you really want to, and have the
> compiler generate indirect calls to those wrapper functions.
> 
> For example, I can easily imagine a pre-retpoline compiler turning
> 
>  if (cond)
> fn1(a,b)
>  else
>fn2(a,b);
> 
> into a function pointer conditional
> 
> (cond ? fn1 : fn2)(a,b);

If we are worried about such a construct, wouldn't a compiler barrier
before and after the static_call solve that?

barrier();
static_call(func...);
barrier();

It should also stop tail calls too.

> 
> and honestly, the way "static_call()" works now, can you guarantee
> that the call-site doesn't end up doing that, and calling the
> trampoline function for two different static calls from one indirect
> call?
> 
> See what I'm talking about? Saying "callers are wrapped in macros"
> doesn't actually protect you from the compiler doing things like that.
> 
> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> compiler couldn't turn a "call wrapper(%rip)" into anything else.

But then we need to implement all numbers of parameters.

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 11:08:26 -0800
Linus Torvalds  wrote:

> On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
>  wrote:
> >
> > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > compiler couldn't turn a "call wrapper(%rip)" into anything else.  
> 
> Actually, I think I have a better model - if the caller is done with inline 
> asm.
> 
> What you can do then is basically add a single-byte prefix to the
> "call" instruction that does nothing (say, cs override), and then
> replace *that* with a 'int3' instruction.
> 
> Boom. Done.
> 
> Now, the "int3" handler can just update the instruction in-place, but
> leave the "int3" in place, and then return to the next instruction
> byte (which is just the normal branch instruction without the prefix
> byte).
> 
> The cross-CPU case continues to work, because the 'int3' remains in
> place until after the IPI.
> 
> But that would require that we'd mark those call instruction with
> 

In my original proof of concept, I tried to to implement the callers
with asm, but then the way to handle parameters became a nightmare.

The goal of this (for me) was to replace the tracepoint indirect calls
with static calls, and tracepoints can have any number of parameters to
pass. I ended up needing the compiler to help me with the passing of
parameters.

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
 wrote:
>
> What you can do then is basically add a single-byte prefix to the
> "call" instruction that does nothing (say, cs override), and then
> replace *that* with a 'int3' instruction.

Hmm. the segment prefixes are documented as being "reserved" for
branch instructions. I *think* that means just conditional branches
(Intel at one point used the prefixes for static prediction
information), not "call", but who knows..

It might be better to use an empty REX prefix on x86-64 or something like that.

   Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
 wrote:
>
> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> compiler couldn't turn a "call wrapper(%rip)" into anything else.

Actually, I think I have a better model - if the caller is done with inline asm.

What you can do then is basically add a single-byte prefix to the
"call" instruction that does nothing (say, cs override), and then
replace *that* with a 'int3' instruction.

Boom. Done.

Now, the "int3" handler can just update the instruction in-place, but
leave the "int3" in place, and then return to the next instruction
byte (which is just the normal branch instruction without the prefix
byte).

The cross-CPU case continues to work, because the 'int3' remains in
place until after the IPI.

But that would require that we'd mark those call instruction with

  Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt  wrote:
>
> Note, we do have a bit of control at what is getting called. The patch
> set requires that the callers are wrapped in macros. We should not
> allow just any random callers (like from asm).

Actually, I'd argue that asm is often more controlled than C code.

Right now you can do odd things if you really want to, and have the
compiler generate indirect calls to those wrapper functions.

For example, I can easily imagine a pre-retpoline compiler turning

 if (cond)
fn1(a,b)
 else
   fn2(a,b);

into a function pointer conditional

(cond ? fn1 : fn2)(a,b);

and honestly, the way "static_call()" works now, can you guarantee
that the call-site doesn't end up doing that, and calling the
trampoline function for two different static calls from one indirect
call?

See what I'm talking about? Saying "callers are wrapped in macros"
doesn't actually protect you from the compiler doing things like that.

In contrast, if the call was wrapped in an inline asm, we'd *know* the
compiler couldn't turn a "call wrapper(%rip)" into anything else.

  Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 10:00:48 -0800
Andy Lutomirski  wrote:

> > 
> > Of course, another option is to just say "we don't do the inline case,
> > then", and only ever do a call to a stub that does a "jmp"
> > instruction.  
> 
> That’s not a terrible idea.

It was the implementation of my first proof of concept that kicked off
this entire idea, where others (Peter and Josh) thought it was better
to modify the calls themselves. It does improve things.

Just a reminder of the benchmarks of enabling all tracepoints (which
use indirect jumps) and running hackbench:

  No RETPOLINES:
1.4503 +- 0.0148 seconds time elapsed  ( +-  1.02% )

  baseline RETPOLINES:
1.5120 +- 0.0133 seconds time elapsed  ( +-  0.88% )

  Added direct calls for trace_events:
1.5239 +- 0.0139 seconds time elapsed  ( +-  0.91% )

  With static calls:
1.5282 +- 0.0135 seconds time elapsed  ( +-  0.88% )

  With static call trampolines:
   1.48328 +- 0.00515 seconds time elapsed  ( +-  0.35% )

  Full static calls:
   1.47364 +- 0.00706 seconds time elapsed  ( +-  0.48% )

  Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown

  Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown

The above is the stub with the jmp case.

  With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown

Modifying the calls themselves does have an improvement (and this is
much greater of an improvement when I had debugging enabled).

Perhaps it's not worth the effort, but again, we do have control of
what uses this. It's not a total free-for-all.

Full results here:

  http://lkml.kernel.org/r/20181126155405.72b4f...@gandalf.local.home

Although since lore.kernel.org seems to be having issues:

  https://marc.info/?l=linux-kernel&m=154326714710686

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 10:23:44 -0800
Linus Torvalds  wrote:

> On Thu, Nov 29, 2018 at 9:59 AM Steven Rostedt  wrote:
> >
> > Do you realize that the cmpxchg used by the first attempts of the
> > dynamic modification of code by ftrace was the source of the e1000e
> > NVRAM corruption bug.  
> 
> If you have a static call in IO memory, you have bigger problems than that.
> 
> What's your point?

Just that cmpxchg on dynamic modified code brings back bad memories ;-)

> 
> Again - I will point out that the things you guys have tried to come
> up with have been *WORSE*. Much worse.

Note, we do have a bit of control at what is getting called. The patch
set requires that the callers are wrapped in macros. We should not
allow just any random callers (like from asm).

This isn't about modifying any function call. This is for a specific
subset, that we can impose rules on.

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 10:00 AM Andy Lutomirski  wrote:
> > then it really sounds pretty safe to just say "ok, just make it
> > aligned and update the instruction with an atomic cmpxchg or
> > something".
>
> And how do we do that?  With a gcc plugin and some asm magic?

Asm magic.

You already have to mark the call sites with

static_call(fn, arg1, arg2, ...);

and while it right now just magically depends on gcc outputting the
right code to call the trampoline. But it could do it as a jmp
instruction (tail-call), and maybe that works right, maybe it doesn't.
And maybe some gcc switch makes it output it as a indirect call due to
instrumentation or something. Doing it with asm magic would, I feel,
be safer anyway, so that we'd know *exactly* how that call gets done.

For example, if gcc does it as a jmp due to a tail-call, the
compiler/linker could in theory turn the jump into a short jump if it
sees that the trampoline is close enough. Does that happen? Probably
not. But I don't see why it *couldn't* happen in the current patch
series. The trampoline is just a regular function, even if it has been
defined by global asm.

Putting the trampoline in a different code section could fix things
like that (maybe there was a patch that did that and I missed it?) but
I do think that doing the call with an asm would *also* fix it.

But the "just always use a trampoline" is certainly the simpler model.

  Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 06:15:39PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> 
> > If you make it conditional on CPL, do it for 32-bit as well, add
> > comments, 
> 
> > and convince yourself that there isn’t a better solution
> > (like pointing IP at a stub that retpolines to the target by reading
> > the function pointer, a la the unoptimizable version), then okay, I
> > guess, with only a small amount of grumbling.
> 
> Right; so we _could_ grow the trampoline with a retpoline indirect call
> and ret. It just makes the trampoline a whole lot bigger, but it could
> work.

I'm trying to envision how this would work.  How would the function (or
stub) know how to return back to the call site?

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 9:59 AM Steven Rostedt  wrote:
>
> Do you realize that the cmpxchg used by the first attempts of the
> dynamic modification of code by ftrace was the source of the e1000e
> NVRAM corruption bug.

If you have a static call in IO memory, you have bigger problems than that.

What's your point?

Again - I will point out that the things you guys have tried to come
up with have been *WORSE*. Much worse.

   Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski



> On Nov 29, 2018, at 9:50 AM, Linus Torvalds  
> wrote:
> 
>> On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt  wrote:
>> 
>> Well, the current method (as Jiri mentioned) did get the OK from at
>> least Intel (and that was with a lot of arm twisting to do so).
> 
> Guys, when the comparison is to:
> 
> - create a huge honking security hole by screwing up the stack frame
> 
> or
> 
> - corrupt random registers because we "know" they aren't in use

For C calls, we do indeed know that.  But I guess there could be asm calls.

> 
> then it really sounds pretty safe to just say "ok, just make it
> aligned and update the instruction with an atomic cmpxchg or
> something".

And how do we do that?  With a gcc plugin and some asm magic?

> 
> Of course, another option is to just say "we don't do the inline case,
> then", and only ever do a call to a stub that does a "jmp"
> instruction.

That’s not a terrible idea.

> 
> Problem solved, at the cost of some I$. Emulating a "jmp" is trivial,
> in ways emulating a "call" is not.
> 
>

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 09:50:28 -0800
Linus Torvalds  wrote:

> On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt  wrote:
> >
> > Well, the current method (as Jiri mentioned) did get the OK from at
> > least Intel (and that was with a lot of arm twisting to do so).  
> 
> Guys, when the comparison is to:
> 
>  - create a huge honking security hole by screwing up the stack frame
> 
> or
> 
>  - corrupt random registers because we "know" they aren't in use
> 
> then it really sounds pretty safe to just say "ok, just make it
> aligned and update the instruction with an atomic cmpxchg or
> something".

Do you realize that the cmpxchg used by the first attempts of the
dynamic modification of code by ftrace was the source of the e1000e
NVRAM corruption bug.

It's because it happened to do it to IO write only memory, and a
cmpxchg will *always* write, even if it didn't match. It will just
write out what it read.

In the case of the e1000e bug, it read 0x and that's what it
wrote back out.

So no, I don't think that's a better solution.

-- Steve

> 
> Of course, another option is to just say "we don't do the inline case,
> then", and only ever do a call to a stub that does a "jmp"
> instruction.
> 
> Problem solved, at the cost of some I$. Emulating a "jmp" is trivial,
> in ways emulating a "call" is not.
> 
>   Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 9:50 AM Linus Torvalds
 wrote:
>
>  - corrupt random registers because we "know" they aren't in use

Just to clarify: I think that's a completely unacceptable model.

We already have lots of special calling conventions, including ones
that do not have any call-clobbered registers at all, because we have
special magic calls in inline asm.

Some of those might be prime material for doing static calls (ie PV-op
stuff, where the native model does *not* change any registers).

So no. Don't do ugly hacks like that.

   Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski



> On Nov 29, 2018, at 9:45 AM, Josh Poimboeuf  wrote:
> 
>> On Thu, Nov 29, 2018 at 09:41:33AM -0800, Andy Lutomirski wrote:
>> 
>>> On Nov 29, 2018, at 9:21 AM, Steven Rostedt  wrote:
>>> 
>>> On Thu, 29 Nov 2018 12:20:00 -0500
>>> Steven Rostedt  wrote:
>>> 
>>> 
 r8 = return address
 r9 = function to call
 
>>> 
>>> Bad example, r8 and r9 are args, but r10 and r11 are available.
>>> 
>>> -- Steve
>>> 
   push r8
   jmp *r9
 
 Then have the regs->ip point to that trampoline.
>> 
>> Cute. That’ll need ORC annotations and some kind of retpoline to replace the 
>> indirect jump, though.
> 
> I'm going with this idea, but the BP is so rare that I really don't see
> why a retpoline would be needed.
> 

Without the retpoline in place, you are vulnerable to security researchers 
causing you a personal denial of service by finding a way to cause the BP to 
get hit, mistraining the branch predictor, and writing a paper about it :)

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt  wrote:
>
> Well, the current method (as Jiri mentioned) did get the OK from at
> least Intel (and that was with a lot of arm twisting to do so).

Guys, when the comparison is to:

 - create a huge honking security hole by screwing up the stack frame

or

 - corrupt random registers because we "know" they aren't in use

then it really sounds pretty safe to just say "ok, just make it
aligned and update the instruction with an atomic cmpxchg or
something".

Of course, another option is to just say "we don't do the inline case,
then", and only ever do a call to a stub that does a "jmp"
instruction.

Problem solved, at the cost of some I$. Emulating a "jmp" is trivial,
in ways emulating a "call" is not.

  Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 09:41:33 -0800
Andy Lutomirski  wrote:

> > On Nov 29, 2018, at 9:21 AM, Steven Rostedt  wrote:
> > 
> > On Thu, 29 Nov 2018 12:20:00 -0500
> > Steven Rostedt  wrote:
> > 
> >   
> >> r8 = return address
> >> r9 = function to call
> >>   
> > 
> > Bad example, r8 and r9 are args, but r10 and r11 are available.
> > 
> > -- Steve
> >   
> >>push r8
> >>jmp *r9
> >> 
> >> Then have the regs->ip point to that trampoline.  
> 
> Cute. That’ll need ORC annotations and some kind of retpoline to replace the 
> indirect jump, though.
>

Do we really need to worry about retpoline here?

I'm not fully up on all the current vulnerabilities, but can this
really be taken advantage of when it only happens in the transition of
changing a static call with the small chance of one of those calls
triggering the break point?

If someone can take advantage of that, I almost think they deserve
cracking my box ;-)

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 09:41:33AM -0800, Andy Lutomirski wrote:
> 
> > On Nov 29, 2018, at 9:21 AM, Steven Rostedt  wrote:
> > 
> > On Thu, 29 Nov 2018 12:20:00 -0500
> > Steven Rostedt  wrote:
> > 
> > 
> >> r8 = return address
> >> r9 = function to call
> >> 
> > 
> > Bad example, r8 and r9 are args, but r10 and r11 are available.
> > 
> > -- Steve
> > 
> >>push r8
> >>jmp *r9
> >> 
> >> Then have the regs->ip point to that trampoline.
> 
> Cute. That’ll need ORC annotations and some kind of retpoline to replace the 
> indirect jump, though.

I'm going with this idea, but the BP is so rare that I really don't see
why a retpoline would be needed.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 09:35:11 -0800
Linus Torvalds  wrote:

> On Thu, Nov 29, 2018 at 9:13 AM Steven Rostedt  wrote:
> >
> > No, we really do need to sync after we change the second part of the
> > command with the int3 on it. Unless there's another way to guarantee
> > that the full instruction gets seen when we replace the int3 with the
> > finished command.  
> 
> Making sure the call instruction is aligned with the I$ fetch boundary
> should do that.
> 
> It's not in the SDM, but neither was our current behavior - we
> were/are just relying on "it will work".
>

Well, the current method (as Jiri mentioned) did get the OK from at
least Intel (and that was with a lot of arm twisting to do so).

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski



> On Nov 29, 2018, at 9:21 AM, Steven Rostedt  wrote:
> 
> On Thu, 29 Nov 2018 12:20:00 -0500
> Steven Rostedt  wrote:
> 
> 
>> r8 = return address
>> r9 = function to call
>> 
> 
> Bad example, r8 and r9 are args, but r10 and r11 are available.
> 
> -- Steve
> 
>>push r8
>>jmp *r9
>> 
>> Then have the regs->ip point to that trampoline.

Cute. That’ll need ORC annotations and some kind of retpoline to replace the 
indirect jump, though.

>> 
>> -- Steve
>

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 9:13 AM Steven Rostedt  wrote:
>
> No, we really do need to sync after we change the second part of the
> command with the int3 on it. Unless there's another way to guarantee
> that the full instruction gets seen when we replace the int3 with the
> finished command.

Making sure the call instruction is aligned with the I$ fetch boundary
should do that.

It's not in the SDM, but neither was our current behavior - we
were/are just relying on "it will work".

 Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski




> On Nov 29, 2018, at 9:29 AM, Linus Torvalds  
> wrote:
> 
> On Thu, Nov 29, 2018 at 9:02 AM Andy Lutomirski  wrote:
>>> 
>>> - just restart the instruction (with the suggested "ptregs->rip --")
>>> 
>>> - to avoid any "oh, we're not making progress" issues, just fix the
>>> instruction yourself to be the right call, by looking it up in the
>>> "what needs to be fixed" tables.
>> 
>> I thought that too.  I think it deadlocks. CPU A does text_poke_bp().  CPU B 
>> is waiting for a spinlock with IRQs off.  CPU C holds the spinlock and hits 
>> the int3.  The int3 never goes away because CPU A is waiting for CPU B to 
>> handle the sync_core IPI.
>> 
>> Or do you think we can avoid the IPI while the int3 is there?
> 
> I'm handwaving and thinking that CPU C that hits the int3 can just fix
> up the instruction directly in its own caches, and return.
> 
> Yes, it does what he "text_poke" *will* do (so now the instruction
> gets rewritten _twice_), but who cares? It's idempotent.
> 
> 

But it’s out of order. I’m not concerned about the final IPI — I’m concerned 
about the IPI after the int3 write and before the int3 is removed again. If one 
CPU replaces 0xcc with 0xe8, another CPU could observe that before the last 
couple bytes of the call target are written and observed by all CPUs.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Jiri Kosina

On Thu, 29 Nov 2018, Andy Lutomirski wrote:

> Does anyone know what the actual hardware semantics are?  The SDM is not 
> particularly informative unless I looked at the wrong section.

I don't think SDM answers all the questions there, unfortunately.

I vaguely remember that back then when I was preparing the original 
text_poke_bp() implementation, hpa had to provide some answers directly 
from inner depths of Intel ... see fd4363fff3 ("x86: Introduce int3 
(breakpoint)-based instruction patching") for reference.

-- 
Jiri Kosina
SUSE Labs

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 9:02 AM Andy Lutomirski  wrote:
> >
> > - just restart the instruction (with the suggested "ptregs->rip --")
> >
> > - to avoid any "oh, we're not making progress" issues, just fix the
> > instruction yourself to be the right call, by looking it up in the
> > "what needs to be fixed" tables.
>
> I thought that too.  I think it deadlocks. CPU A does text_poke_bp().  CPU B 
> is waiting for a spinlock with IRQs off.  CPU C holds the spinlock and hits 
> the int3.  The int3 never goes away because CPU A is waiting for CPU B to 
> handle the sync_core IPI.
>
> Or do you think we can avoid the IPI while the int3 is there?

I'm handwaving and thinking that CPU C that hits the int3 can just fix
up the instruction directly in its own caches, and return.

Yes, it does what he "text_poke" *will* do (so now the instruction
gets rewritten _twice_), but who cares? It's idempotent.

And no, I don't have code, just "maybe some handwaving like this"

   Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski




> On Nov 29, 2018, at 9:07 AM, Peter Zijlstra  wrote:
> 
> On Thu, Nov 29, 2018 at 09:02:23AM -0800, Andy Lutomirski wrote:
>>> On Nov 29, 2018, at 8:50 AM, Linus Torvalds  
>>> wrote:
> 
>>> So no. Do *not* try to change %rsp on the stack in the bp handler.
>>> Instead, I'd suggest:
>>> 
>>> - just restart the instruction (with the suggested "ptregs->rip --")
>>> 
>>> - to avoid any "oh, we're not making progress" issues, just fix the
>>> instruction yourself to be the right call, by looking it up in the
>>> "what needs to be fixed" tables.
>>> 
>>> No?
> 
>> Or do you think we can avoid the IPI while the int3 is there?
> 
> I'm thinking Linus is suggesting the #BP handler does the text write too
> (as a competing store) and then sync_core() and restarts.
> 
> But I think that is broken, because then there is no telling what the
> other CPUs will observe.

Does anyone know what the actual hardware semantics are?  The SDM is not 
particularly informative unless I looked at the wrong section.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 12:20:00 -0500
Steven Rostedt  wrote:


> r8 = return address
> r9 = function to call
> 

Bad example, r8 and r9 are args, but r10 and r11 are available.

-- Steve

>   push r8
>   jmp *r9
> 
> Then have the regs->ip point to that trampoline.
> 
> -- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 18:15:39 +0100
Peter Zijlstra  wrote:

> On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> 
> > If you make it conditional on CPL, do it for 32-bit as well, add
> > comments,   
> 
> > and convince yourself that there isn’t a better solution
> > (like pointing IP at a stub that retpolines to the target by reading
> > the function pointer, a la the unoptimizable version), then okay, I
> > guess, with only a small amount of grumbling.  
> 
> Right; so we _could_ grow the trampoline with a retpoline indirect call
> and ret. It just makes the trampoline a whole lot bigger, but it could
> work.

Can't we make use of the callee clobbered registers? I mean, we know
that call is being made when the int3 is triggered. Then we can save
the return address in one register, and the jump location in another,
and then just call a trampoline that does:

r8 = return address
r9 = function to call

push r8
jmp *r9

Then have the regs->ip point to that trampoline.

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Peter Zijlstra

On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:

> If you make it conditional on CPL, do it for 32-bit as well, add
> comments, 

> and convince yourself that there isn’t a better solution
> (like pointing IP at a stub that retpolines to the target by reading
> the function pointer, a la the unoptimizable version), then okay, I
> guess, with only a small amount of grumbling.

Right; so we _could_ grow the trampoline with a retpoline indirect call
and ret. It just makes the trampoline a whole lot bigger, but it could
work.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 09:02:23 -0800
Andy Lutomirski  wrote:

> > Instead, I'd suggest:
> > 
> > - just restart the instruction (with the suggested "ptregs->rip --")
> > 
> > - to avoid any "oh, we're not making progress" issues, just fix the
> > instruction yourself to be the right call, by looking it up in the
> > "what needs to be fixed" tables.
> > 
> > No?  
> 
> I thought that too.  I think it deadlocks. CPU A does
> text_poke_bp().  CPU B is waiting for a spinlock with IRQs off.  CPU
> C holds the spinlock and hits the int3.  The int3 never goes away
> because CPU A is waiting for CPU B to handle the sync_core IPI.

I agree that this can happen.

> 
> Or do you think we can avoid the IPI while the int3 is there?

No, we really do need to sync after we change the second part of the
command with the int3 on it. Unless there's another way to guarantee
that the full instruction gets seen when we replace the int3 with the
finished command.

To refresh everyone's memory for why we have an IPI (as IPIs have an
implicit memory barrier for the CPU).

We start with:

   e8 01 02 03 04

and we want to convert it to: e8 ab cd ef 01

And let's say the instruction crosses a cache line that breaks it into
e8 01 and 02 03 04.

We add the breakpoint:

  cc 01 02 03 04

We do a sync (so now everyone should see the break point), because we
don't want to update the second part and another CPU happens to update
the second part of the cache, and might see:

  e8 01 cd ef 01

Which would not be good.

And we need another sync after we change the code so all CPUs see

  cc ab cd ef 01

Because when we remove the break point, we don't want other CPUs to see

  e8 ab 02 03 04

Which would also be bad.

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> 
> 
> > On Nov 29, 2018, at 8:49 AM, Peter Zijlstra  wrote:
> > 
> > On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote:
> >>> can't we 'fix' that again? The alternative is moving that IRET-frame and
> >>> fixing everything up, which is going to be fragile, ugly and such
> >>> things more.
> > 
> >> This seems to work...
> > 
> > That's almost too easy... nice!
> 
> It is indeed too easy: you’re putting pt_regs in the wrong place for
> int3 from user mode, which is probably a root hole if you arrange for
> a ptraced process to do int3 and try to write to whatever register
> aliases CS.
> 
> If you make it conditional on CPL, do it for 32-bit as well, add
> comments convince yourself that there isn’t a better solution

I could do that - but why subject 32-bit to it?  I was going to make it
conditional on CONFIG_HAVE_STATIC_CALL_INLINE which is 64-bit only.

> (like pointing IP at a stub that retpolines to the target by reading
> the function pointer, a la the unoptimizable version), then okay, I
> guess, with only a small amount of grumbling.

I tried that in v2, but Peter pointed out it's racy:

  
https://lkml.kernel.org/r/20181126160217.gr2...@hirez.programming.kicks-ass.net

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Peter Zijlstra

On Thu, Nov 29, 2018 at 09:02:23AM -0800, Andy Lutomirski wrote:
> > On Nov 29, 2018, at 8:50 AM, Linus Torvalds  
> > wrote:

> > So no. Do *not* try to change %rsp on the stack in the bp handler.
> > Instead, I'd suggest:
> > 
> > - just restart the instruction (with the suggested "ptregs->rip --")
> > 
> > - to avoid any "oh, we're not making progress" issues, just fix the
> > instruction yourself to be the right call, by looking it up in the
> > "what needs to be fixed" tables.
> > 
> > No?

> Or do you think we can avoid the IPI while the int3 is there?

I'm thinking Linus is suggesting the #BP handler does the text write too
(as a competing store) and then sync_core() and restarts.

But I think that is broken, because then there is no telling what the
other CPUs will observe.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski




> On Nov 29, 2018, at 8:50 AM, Linus Torvalds  
> wrote:
> 
>> On Thu, Nov 29, 2018 at 8:33 AM Josh Poimboeuf  wrote:
>> 
>> This seems to work...
>> 
>> +   .if \create_gap == 1
>> +   .rept 6
>> +   pushq 5*8(%rsp)
>> +   .endr
>> +   .endif
>> +
>> -idtentry int3  do_int3 has_error_code=0
>> +idtentry int3  do_int3 has_error_code=0 
>>create_gap=1
> 
> Ugh. Doesn't this entirely screw up the stack layout, which then
> screws up  task_pt_regs(), which then breaks ptrace and friends?
> 
> ... and you'd only notice it for users that use int3 in user space,
> which now writes random locations on the kernel stack, which is then a
> huge honking security hole.
> 
> It's possible that I'm confused, but let's not play random games with
> the stack like this. The entry code is sacred, in scary ways.
> 
> So no. Do *not* try to change %rsp on the stack in the bp handler.
> Instead, I'd suggest:
> 
> - just restart the instruction (with the suggested "ptregs->rip --")
> 
> - to avoid any "oh, we're not making progress" issues, just fix the
> instruction yourself to be the right call, by looking it up in the
> "what needs to be fixed" tables.
> 
> No?

I thought that too.  I think it deadlocks. CPU A does text_poke_bp().  CPU B is 
waiting for a spinlock with IRQs off.  CPU C holds the spinlock and hits the 
int3.  The int3 never goes away because CPU A is waiting for CPU B to handle 
the sync_core IPI.

Or do you think we can avoid the IPI while the int3 is there?

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski




> On Nov 29, 2018, at 8:49 AM, Peter Zijlstra  wrote:
> 
> On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote:
>>> can't we 'fix' that again? The alternative is moving that IRET-frame and
>>> fixing everything up, which is going to be fragile, ugly and such
>>> things more.
> 
>> This seems to work...
> 
> That's almost too easy... nice!

It is indeed too easy: you’re putting pt_regs in the wrong place for int3 from 
user mode, which is probably a root hole if you arrange for a ptraced process 
to do int3 and try to write to whatever register aliases CS.

If you make it conditional on CPL, do it for 32-bit as well, add comments, and 
convince yourself that there isn’t a better solution (like pointing IP at a 
stub that retpolines to the target by reading the function pointer, a la the 
unoptimizable version), then okay, I guess, with only a small amount of 
grumbling.

> 
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index ce25d84023c0..184523447d35 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR
>> irq_work_interruptsmp_irq_work_interrupt
>>  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
>>  * #DF: if the thread stack is somehow unusable, we'll still get a useful 
>> OOPS.
>>  */
>> -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
>> +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 
>> create_gap=0
>> ENTRY(\sym)
>>UNWIND_HINT_IRET_REGS offset=\has_error_code*8
>> 
>> @@ -891,6 +891,12 @@ ENTRY(\sym)
>>pushq$-1/* ORIG_RAX: no syscall to restart */
>>.endif
>> 
>> +.if \create_gap == 1
>> +.rept 6
>> +pushq 5*8(%rsp)
>> +.endr
>> +.endif
>> +
>>.if \paranoid == 1
>>testb$3, CS-ORIG_RAX(%rsp)/* If coming from userspace, switch 
>> stacks */
>>jnz.Lfrom_usermode_switch_stack_\@
>> @@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
>> #endif /* CONFIG_HYPERV */
>> 
>> idtentry debugdo_debughas_error_code=0paranoid=1 
>> shift_ist=DEBUG_STACK
>> -idtentry int3do_int3has_error_code=0
>> +idtentry int3do_int3has_error_code=0create_gap=1
>> idtentry stack_segmentdo_stack_segmenthas_error_code=1
>> 
>> #ifdef CONFIG_XEN_PV

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Steven Rostedt

On Thu, 29 Nov 2018 08:50:16 -0800
Linus Torvalds  wrote:

> Instead, I'd suggest:
> 
>  - just restart the instruction (with the suggested "ptregs->rip --")
> 
>  - to avoid any "oh, we're not making progress" issues, just fix the
> instruction yourself to be the right call, by looking it up in the
> "what needs to be fixed" tables.

So basically this will cause the code to go into a spin while we are
doing the update, right?

-- Steve

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2018 at 8:33 AM Josh Poimboeuf  wrote:
>
> This seems to work...
>
> +   .if \create_gap == 1
> +   .rept 6
> +   pushq 5*8(%rsp)
> +   .endr
> +   .endif
> +
> -idtentry int3  do_int3 has_error_code=0
> +idtentry int3  do_int3 has_error_code=0  
>   create_gap=1

Ugh. Doesn't this entirely screw up the stack layout, which then
screws up  task_pt_regs(), which then breaks ptrace and friends?

... and you'd only notice it for users that use int3 in user space,
which now writes random locations on the kernel stack, which is then a
huge honking security hole.

It's possible that I'm confused, but let's not play random games with
the stack like this. The entry code is sacred, in scary ways.

So no. Do *not* try to change %rsp on the stack in the bp handler.
Instead, I'd suggest:

 - just restart the instruction (with the suggested "ptregs->rip --")

 - to avoid any "oh, we're not making progress" issues, just fix the
instruction yourself to be the right call, by looking it up in the
"what needs to be fixed" tables.

No?

  Linus

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Peter Zijlstra

On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote:
> > can't we 'fix' that again? The alternative is moving that IRET-frame and
> > fixing everything up, which is going to be fragile, ugly and such
> > things more.

> This seems to work...

That's almost too easy... nice!

> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index ce25d84023c0..184523447d35 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR 
> irq_work_interrupt  smp_irq_work_interrupt
>   * @paranoid == 2 is special: the stub will never switch stacks.  This is for
>   * #DF: if the thread stack is somehow unusable, we'll still get a useful 
> OOPS.
>   */
> -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
> +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 
> create_gap=0
>  ENTRY(\sym)
>   UNWIND_HINT_IRET_REGS offset=\has_error_code*8
>  
> @@ -891,6 +891,12 @@ ENTRY(\sym)
>   pushq   $-1 /* ORIG_RAX: no syscall to 
> restart */
>   .endif
>  
> + .if \create_gap == 1
> + .rept 6
> + pushq 5*8(%rsp)
> + .endr
> + .endif
> +
>   .if \paranoid == 1
>   testb   $3, CS-ORIG_RAX(%rsp)   /* If coming from userspace, 
> switch stacks */
>   jnz .Lfrom_usermode_switch_stack_\@
> @@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
>  #endif /* CONFIG_HYPERV */
>  
>  idtentry debug   do_debug
> has_error_code=0paranoid=1 shift_ist=DEBUG_STACK
> -idtentry int3do_int3 has_error_code=0
> +idtentry int3do_int3 
> has_error_code=0create_gap=1
>  idtentry stack_segment   do_stack_segmenthas_error_code=1
>  
>  #ifdef CONFIG_XEN_PV

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 03:38:53PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 05:37:39AM -0800, Andy Lutomirski wrote:
> > 
> > 
> > > On Nov 29, 2018, at 1:42 AM, Peter Zijlstra  wrote:
> > > 
> > > On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:
> > > 
> >  +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> >  +{
> >  +struct static_call_bp_data *data = _data;
> >  +
> >  +/*
> >  + * For inline static calls, push the return address on the stack 
> >  so the
> >  + * "called" function will return to the location immediately 
> >  after the
> >  + * call site.
> >  + *
> >  + * NOTE: This code will need to be revisited when kernel CET gets
> >  + *   implemented.
> >  + */
> >  +if (data->ret) {
> >  +regs->sp -= sizeof(long);
> >  +*(unsigned long *)regs->sp = data->ret;
> >  +}
> > >> 
> > >> You can’t do this.  Depending on the alignment of the old RSP, which
> > >> is not guaranteed, this overwrites regs->cs.  IRET goes boom.
> > > 
> > > I don't get it; can you spell that out?
> > > 
> > > The way I understand it is that we're at a location where a "E8 - Near
> > > CALL" instruction should be, and thus RSP should be the regular kernel
> > > stack, and the above simply does "PUSH ret", which is what that CALL
> > > would've done too.
> > > 
> > 
> > int3 isn’t IST anymore, so the int3 instruction conditionally
> > subtracts 8 from RSP and then pushes SS, etc. So my email was
> > obviously wrong wrt “cs”, but you’re still potentially overwriting the
> > int3 IRET frame.
> 
> ARGH!..
> 
> can't we 'fix' that again? The alternative is moving that IRET-frame and
> fixing everything up, which is going to be fragile, ugly and such
> things more.
> 
> Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack")
> doesn't list any strong reasons for why it should NOT be an IST.

This seems to work...

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce25d84023c0..184523447d35 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR   
irq_work_interrupt  smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 
create_gap=0
 ENTRY(\sym)
UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -891,6 +891,12 @@ ENTRY(\sym)
pushq   $-1 /* ORIG_RAX: no syscall to 
restart */
.endif
 
+   .if \create_gap == 1
+   .rept 6
+   pushq 5*8(%rsp)
+   .endr
+   .endif
+
.if \paranoid == 1
testb   $3, CS-ORIG_RAX(%rsp)   /* If coming from userspace, 
switch stacks */
jnz .Lfrom_usermode_switch_stack_\@
@@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug do_debughas_error_code=0
paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3  do_int3 has_error_code=0
+idtentry int3  do_int3 has_error_code=0
create_gap=1
 idtentry stack_segment do_stack_segmenthas_error_code=1
 
 #ifdef CONFIG_XEN_PV

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Jiri Kosina

On Thu, 29 Nov 2018, Peter Zijlstra wrote:

> > int3 isn’t IST anymore, so the int3 instruction conditionally 
> > subtracts 8 from RSP and then pushes SS, etc. So my email was 
> > obviously wrong wrt “cs”, but you’re still potentially overwriting the 
> > int3 IRET frame.
> 
> ARGH!..
> 
> can't we 'fix' that again? The alternative is moving that IRET-frame and
> fixing everything up, which is going to be fragile, ugly and such
> things more.
> 
> Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack")
> doesn't list any strong reasons for why it should NOT be an IST.

It's CVE-2018-8897.

-- 
Jiri Kosina
SUSE Labs

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Peter Zijlstra

On Thu, Nov 29, 2018 at 05:37:39AM -0800, Andy Lutomirski wrote:
> 
> 
> > On Nov 29, 2018, at 1:42 AM, Peter Zijlstra  wrote:
> > 
> > On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:
> > 
>  +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
>  +{
>  +struct static_call_bp_data *data = _data;
>  +
>  +/*
>  + * For inline static calls, push the return address on the stack so 
>  the
>  + * "called" function will return to the location immediately after 
>  the
>  + * call site.
>  + *
>  + * NOTE: This code will need to be revisited when kernel CET gets
>  + *   implemented.
>  + */
>  +if (data->ret) {
>  +regs->sp -= sizeof(long);
>  +*(unsigned long *)regs->sp = data->ret;
>  +}
> >> 
> >> You can’t do this.  Depending on the alignment of the old RSP, which
> >> is not guaranteed, this overwrites regs->cs.  IRET goes boom.
> > 
> > I don't get it; can you spell that out?
> > 
> > The way I understand it is that we're at a location where a "E8 - Near
> > CALL" instruction should be, and thus RSP should be the regular kernel
> > stack, and the above simply does "PUSH ret", which is what that CALL
> > would've done too.
> > 
> 
> int3 isn’t IST anymore, so the int3 instruction conditionally
> subtracts 8 from RSP and then pushes SS, etc. So my email was
> obviously wrong wrt “cs”, but you’re still potentially overwriting the
> int3 IRET frame.

ARGH!..

can't we 'fix' that again? The alternative is moving that IRET-frame and
fixing everything up, which is going to be fragile, ugly and such
things more.

Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack")
doesn't list any strong reasons for why it should NOT be an IST.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Andy Lutomirski




> On Nov 29, 2018, at 1:42 AM, Peter Zijlstra  wrote:
> 
> On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:
> 
 +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
 +{
 +struct static_call_bp_data *data = _data;
 +
 +/*
 + * For inline static calls, push the return address on the stack so 
 the
 + * "called" function will return to the location immediately after the
 + * call site.
 + *
 + * NOTE: This code will need to be revisited when kernel CET gets
 + *   implemented.
 + */
 +if (data->ret) {
 +regs->sp -= sizeof(long);
 +*(unsigned long *)regs->sp = data->ret;
 +}
>> 
>> You can’t do this.  Depending on the alignment of the old RSP, which
>> is not guaranteed, this overwrites regs->cs.  IRET goes boom.
> 
> I don't get it; can you spell that out?
> 
> The way I understand it is that we're at a location where a "E8 - Near
> CALL" instruction should be, and thus RSP should be the regular kernel
> stack, and the above simply does "PUSH ret", which is what that CALL
> would've done too.
> 

int3 isn’t IST anymore, so the int3 instruction conditionally subtracts 8 from 
RSP and then pushes SS, etc. So my email was obviously wrong wrt “cs”, but 
you’re still potentially overwriting the int3 IRET frame.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Josh Poimboeuf

On Thu, Nov 29, 2018 at 10:42:10AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:
> 
> > >> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> > >> +{
> > >> +struct static_call_bp_data *data = _data;
> > >> +
> > >> +/*
> > >> + * For inline static calls, push the return address on the stack so 
> > >> the
> > >> + * "called" function will return to the location immediately after 
> > >> the
> > >> + * call site.
> > >> + *
> > >> + * NOTE: This code will need to be revisited when kernel CET gets
> > >> + *   implemented.
> > >> + */
> > >> +if (data->ret) {
> > >> +regs->sp -= sizeof(long);
> > >> +*(unsigned long *)regs->sp = data->ret;
> > >> +}
> > 
> > You can’t do this.  Depending on the alignment of the old RSP, which
> > is not guaranteed, this overwrites regs->cs.  IRET goes boom.
> 
> I don't get it; can you spell that out?

I don't quite follow that either.  Maybe Andy is referring to x86-32,
for which regs->sp isn't actually saved: see kernel_stack_pointer().

This code is 64-bit only so that's not a concern.

> The way I understand it is that we're at a location where a "E8 - Near
> CALL" instruction should be, and thus RSP should be the regular kernel
> stack, and the above simply does "PUSH ret", which is what that CALL
> would've done too.

Right.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Peter Zijlstra

On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:

> >> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> >> +{
> >> +struct static_call_bp_data *data = _data;
> >> +
> >> +/*
> >> + * For inline static calls, push the return address on the stack so 
> >> the
> >> + * "called" function will return to the location immediately after the
> >> + * call site.
> >> + *
> >> + * NOTE: This code will need to be revisited when kernel CET gets
> >> + *   implemented.
> >> + */
> >> +if (data->ret) {
> >> +regs->sp -= sizeof(long);
> >> +*(unsigned long *)regs->sp = data->ret;
> >> +}
> 
> You can’t do this.  Depending on the alignment of the old RSP, which
> is not guaranteed, this overwrites regs->cs.  IRET goes boom.

I don't get it; can you spell that out?

The way I understand it is that we're at a location where a "E8 - Near
CALL" instruction should be, and thus RSP should be the regular kernel
stack, and the above simply does "PUSH ret", which is what that CALL
would've done too.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-28 Thread Andy Lutomirski

> On Nov 27, 2018, at 12:43 AM, Peter Zijlstra  wrote:
>
>> On Mon, Nov 26, 2018 at 03:26:28PM -0600, Josh Poimboeuf wrote:
>>
>> Yeah, that's probably better.  I assume you also mean that we would have
>> all text_poke_bp() users create a handler callback?  That way the
>> interface is clear and consistent for everybody.  Like:
>
> Can do, it does indeed make the interface less like a hack. It is not
> like there are too many users.
>
>> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
>> index aac0c1f7e354..d4b0abe4912d 100644
>> --- a/arch/x86/kernel/jump_label.c
>> +++ b/arch/x86/kernel/jump_label.c
>> @@ -37,6 +37,11 @@ static void bug_at(unsigned char *ip, int line)
>>BUG();
>> }
>>
>> +static inline void jump_label_bp_handler(struct pt_regs *regs, void *data)
>> +{
>> +regs->ip += JUMP_LABEL_NOP_SIZE - 1;
>> +}
>> +
>> static void __ref __jump_label_transform(struct jump_entry *entry,
>> enum jump_label_type type,
>> void *(*poker)(void *, const void *, size_t),
>> @@ -91,7 +96,7 @@ static void __ref __jump_label_transform(struct jump_entry 
>> *entry,
>>}
>>
>>text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
>> - (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
>> + jump_label_bp_handler, NULL);
>> }
>>
>> void arch_jump_label_transform(struct jump_entry *entry,
>
> Per that example..
>
>> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
>> index d3869295b88d..e05ebc6d4db5 100644
>> --- a/arch/x86/kernel/static_call.c
>> +++ b/arch/x86/kernel/static_call.c
>> @@ -7,24 +7,30 @@
>>
>> #define CALL_INSN_SIZE 5
>>
>> +struct static_call_bp_data {
>> +unsigned long func, ret;
>> +};
>> +
>> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
>> +{
>> +struct static_call_bp_data *data = _data;
>> +
>> +/*
>> + * For inline static calls, push the return address on the stack so the
>> + * "called" function will return to the location immediately after the
>> + * call site.
>> + *
>> + * NOTE: This code will need to be revisited when kernel CET gets
>> + *   implemented.
>> + */
>> +if (data->ret) {
>> +regs->sp -= sizeof(long);
>> +*(unsigned long *)regs->sp = data->ret;
>> +}

You can’t do this.  Depending on the alignment of the old RSP, which
is not guaranteed, this overwrites regs->cs.  IRET goes boom.

Maybe it could be fixed by pointing regs->ip at a real trampoline?

This code is subtle and executed rarely, which is a bag combination.
It would be great if we had a test case.

I think it would be great if the implementation could be, literally:

regs->ip -= 1;
return;

IOW, just retry and wait until we get the new patched instruction.
The problem is that, if we're in a context where IRQs are off, then
we're preventing on_each_cpu() from completing and, even if we somehow
just let the code know that we already serialized ourselves, we're
still potentially holding a spinlock that another CPU is waiting for
with IRQs off.  Ugh.  Anyone have a clever idea to fix that?

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-27 Thread Peter Zijlstra

On Tue, Nov 27, 2018 at 09:43:30AM +0100, Peter Zijlstra wrote:
> Now; if I'm not mistaken, the below @site is in fact @regs->ip - 1, no?
> 
> We already patched site with INT3, which is what we just trapped on. So
> we could in fact write something like:
> 
> static void static_call_bp_handler(struct pt_regs *regs, void *data)
> {
>   struct static_call_bp_data *scd = data;
> 
>   switch (data->type) {
>   case CALL_INSN: /* emulate CALL instruction */
>   regs->sp -= sizeof(unsigned long);
>   *(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1;
>   regs->ip = data->func;
>   break;
> 
>   case JMP_INSN: /* emulate JMP instruction */
>   regs->ip = data->func;
>   break;
>   }
> }

>   handler_data = (struct static_call_bp_data){
>   .type = IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) ? CALL_INSN 
> : JMP_INSN,
>   .func = func,
>   };

Heck; check this:

static void static_call_bp_handler(struct pt_regs *regs, void *data)
{
#ifdef CONFIG_HAVE_STATIC_CALL_INLINE

/* emulate CALL instruction */
regs->sp -= sizeof(unsigned long);
*(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1;
regs->ip = data;

#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */

/* emulate JMP instruction */
regs->ip = data;

#endif
}

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-27 Thread Peter Zijlstra

On Mon, Nov 26, 2018 at 02:14:49PM -0600, Josh Poimboeuf wrote:
> On Mon, Nov 26, 2018 at 10:28:08AM -0800, Andy Lutomirski wrote:

> > Can you add a comment that it will need updating when kernel CET is added?
> 
> Will do, though I get the feeling there's a lot of other (existing) code
> that will also need to change for kernel CET.

Yeah, function graph tracer and kretprobes at the very least. But I
suspect there's a few surprises to be hand once they try kernel CET.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-27 Thread Peter Zijlstra

On Mon, Nov 26, 2018 at 03:26:28PM -0600, Josh Poimboeuf wrote:

> Yeah, that's probably better.  I assume you also mean that we would have
> all text_poke_bp() users create a handler callback?  That way the
> interface is clear and consistent for everybody.  Like:

Can do, it does indeed make the interface less like a hack. It is not
like there are too many users.

> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
> index aac0c1f7e354..d4b0abe4912d 100644
> --- a/arch/x86/kernel/jump_label.c
> +++ b/arch/x86/kernel/jump_label.c
> @@ -37,6 +37,11 @@ static void bug_at(unsigned char *ip, int line)
>   BUG();
>  }
>  
> +static inline void jump_label_bp_handler(struct pt_regs *regs, void *data)
> +{
> + regs->ip += JUMP_LABEL_NOP_SIZE - 1;
> +}
> +
>  static void __ref __jump_label_transform(struct jump_entry *entry,
>enum jump_label_type type,
>void *(*poker)(void *, const void *, 
> size_t),
> @@ -91,7 +96,7 @@ static void __ref __jump_label_transform(struct jump_entry 
> *entry,
>   }
>  
>   text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
> -  (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
> +  jump_label_bp_handler, NULL);
>  }
>  
>  void arch_jump_label_transform(struct jump_entry *entry,

Per that example..

> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> index d3869295b88d..e05ebc6d4db5 100644
> --- a/arch/x86/kernel/static_call.c
> +++ b/arch/x86/kernel/static_call.c
> @@ -7,24 +7,30 @@
>  
>  #define CALL_INSN_SIZE 5
>  
> +struct static_call_bp_data {
> + unsigned long func, ret;
> +};
> +
> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> +{
> + struct static_call_bp_data *data = _data;
> +
> + /*
> +  * For inline static calls, push the return address on the stack so the
> +  * "called" function will return to the location immediately after the
> +  * call site.
> +  *
> +  * NOTE: This code will need to be revisited when kernel CET gets
> +  *   implemented.
> +  */
> + if (data->ret) {
> + regs->sp -= sizeof(long);
> + *(unsigned long *)regs->sp = data->ret;
> + }
> +
> + /* The exception handler will 'return' to the destination function. */
> + regs->ip = data->func;
> +}

Now; if I'm not mistaken, the below @site is in fact @regs->ip - 1, no?

We already patched site with INT3, which is what we just trapped on. So
we could in fact write something like:

static void static_call_bp_handler(struct pt_regs *regs, void *data)
{
struct static_call_bp_data *scd = data;

switch (data->type) {
case CALL_INSN: /* emulate CALL instruction */
regs->sp -= sizeof(unsigned long);
*(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1;
regs->ip = data->func;
break;

case JMP_INSN: /* emulate JMP instruction */
regs->ip = data->func;
break;
}
}

>  void arch_static_call_transform(void *site, void *tramp, void *func)
>  {
> @@ -32,11 +38,17 @@ void arch_static_call_transform(void *site, void *tramp, 
> void *func)
>   unsigned long insn;
>   unsigned char insn_opcode;
>   unsigned char opcodes[CALL_INSN_SIZE];
> + struct static_call_bp_data handler_data;
> +
> + handler_data.func = (unsigned long)func;
>  
> - if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) {
>   insn = (unsigned long)site;
> + handler_data.ret = insn + CALL_INSN_SIZE;
> + } else {
>   insn = (unsigned long)tramp;
> + handler_data.ret = 0;
> + }

handler_data = (struct static_call_bp_data){
.type = IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) ? CALL_INSN 
: JMP_INSN,
.func = func,
};

>   mutex_lock(&text_mutex);
>  
> @@ -52,14 +64,9 @@ void arch_static_call_transform(void *site, void *tramp, 
> void *func)
>   opcodes[0] = insn_opcode;
>   memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
>  
>   /* Patch the call site: */
>   text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> +  static_call_bp_handler, &handler_data);
>  
>  done:
>   mutex_unlock(&text_mutex);

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Josh Poimboeuf

On Mon, Nov 26, 2018 at 09:08:01PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote:
> > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > index d3869295b88d..8fd6c8556750 100644
> > --- a/arch/x86/kernel/static_call.c
> > +++ b/arch/x86/kernel/static_call.c
> > @@ -7,24 +7,19 @@
> >  
> >  #define CALL_INSN_SIZE 5
> >  
> > +unsigned long bp_handler_call_return_addr;
> >  
> > +static void static_call_bp_handler(struct pt_regs *regs)
> > +{
> >  #ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > +   /*
> > +* Push the return address on the stack so the "called" function will
> > +* return to immediately after the call site.
> > +*/
> > +   regs->sp -= sizeof(long);
> > +   *(unsigned long *)regs->sp = bp_handler_call_return_addr;
> >  #endif
> > +}
> >  
> >  void arch_static_call_transform(void *site, void *tramp, void *func)
> >  {
> > @@ -52,14 +47,12 @@ void arch_static_call_transform(void *site, void 
> > *tramp, void *func)
> > opcodes[0] = insn_opcode;
> > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> >  
> > if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > +   bp_handler_call_return_addr = insn + CALL_INSN_SIZE;
> >  
> > /* Patch the call site: */
> > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> > -static_call_bp_handler);
> > +static_call_bp_handler, func);
> >  
> >  done:
> > mutex_unlock(&text_mutex);
> 
> 
> like maybe something along the lines of:
> 
> struct sc_data {
>   unsigned long ret;
>   unsigned long ip;
> };
> 
> void sc_handler(struct pt_regs *regs, void *data)
> {
>   struct sc_data *scd = data;
> 
>   regs->sp -= sizeof(long);
>   *(unsigned long *)regs->sp = scd->ret;
>   regs->ip = scd->ip;
> }
> 
> arch_static_call_transform()
> {
>   ...
> 
>   scd = (struct sc_data){
>   .ret = insn + CALL_INSN_SIZE,
>   .ip = (unsigned long)func,
>   };
> 
>   text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
>   sc_handler, (void *)&scd);
> 
>   ...
> }

Yeah, that's probably better.  I assume you also mean that we would have
all text_poke_bp() users create a handler callback?  That way the
interface is clear and consistent for everybody.  Like:

diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..04d6cf838fb7 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -20,6 +20,8 @@ static inline void apply_paravirt(struct paravirt_patch_site 
*start,
 
 extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 
+typedef void (*bp_handler_t)(struct pt_regs *regs, void *data);
+
 /*
  * Clear and restore the kernel write-protection flag on the local CPU.
  * Allows the kernel to edit read-only pages.
@@ -36,7 +38,8 @@ extern void *text_poke_early(void *addr, const void *opcode, 
size_t len);
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
-extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void 
*handler);
+extern void *text_poke_bp(void *addr, const void *opcode, size_t len,
+ bp_handler_t handler, void *data);
 extern int after_bootmem;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ebeac487a20c..547af714bd60 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -738,7 +738,8 @@ static void do_sync_core(void *info)
 }
 
 static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static void *bp_int3_data, *bp_int3_addr;
+static bp_handler_t bp_int3_handler;
 
 int poke_int3_handler(struct pt_regs *regs)
 {
@@ -746,11 +747,11 @@ int poke_int3_handler(struct pt_regs *regs)
 * Having observed our INT3 instruction, we now must observe
 * bp_patching_in_progress.
 *
-*  in_progress = TRUE  INT3
-*  WMB RMB
-*  write INT3  if (in_progress)
+*  in_progress = TRUE  INT3
+*  WMB RMB
+*  write INT3  if (in_progress)
 *
-* Idem for bp_int3_handler.
+* Idem for bp_int3_data.
 */
smp_rmb();
 
@@ -760,8 +761,7 @@ int poke_int3_handler(struct pt_regs *regs)
if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
return 0;
 
-   /* set up the specified breakpoint handler */
-   regs->ip = (unsigned long) bp_int3_handler;
+   bp_int3_handler(regs, bp_int3_data);
 
return 1;
 
@@ -772,7 +772,8 @@ int poke_int3_handler(struct pt_regs *regs)
  * @addr:  address to patch
  * @opcode:opcode o

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Josh Poimboeuf

On Mon, Nov 26, 2018 at 10:28:08AM -0800, Andy Lutomirski wrote:
> On Mon, Nov 26, 2018 at 9:10 AM Josh Poimboeuf  wrote:
> >
> > On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > > diff --git a/arch/x86/kernel/static_call.c 
> > > > b/arch/x86/kernel/static_call.c
> > > > index 8026d176f25c..d3869295b88d 100644
> > > > --- a/arch/x86/kernel/static_call.c
> > > > +++ b/arch/x86/kernel/static_call.c
> > > > @@ -9,13 +9,21 @@
> > > >
> > > >  void static_call_bp_handler(void);
> > > >  void *bp_handler_dest;
> > > > +void *bp_handler_continue;
> > > >
> > > >  asm(".pushsection .text, \"ax\"
> > > > \n"
> > > >  ".globl static_call_bp_handler \n"
> > > >  ".type static_call_bp_handler, @function   
> > > > \n"
> > > >  "static_call_bp_handler:   
> > > > \n"
> > > > -"ANNOTATE_RETPOLINE_SAFE   
> > > > \n"
> > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > > +ANNOTATE_RETPOLINE_SAFE
> > > > +"call *bp_handler_dest \n"
> > > > +ANNOTATE_RETPOLINE_SAFE
> > > > +"jmp *bp_handler_continue  
> > > > \n"
> > > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > > > +ANNOTATE_RETPOLINE_SAFE
> > > >  "jmp *bp_handler_dest  \n"
> > > > +#endif
> > > >  ".popsection   
> > > > \n");
> > > >
> > > >  void arch_static_call_transform(void *site, void *tramp, void *func)
> > > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void 
> > > > *tramp, void *func)
> > > > unsigned char insn_opcode;
> > > > unsigned char opcodes[CALL_INSN_SIZE];
> > > >
> > > > -   insn = (unsigned long)tramp;
> > > > +   if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > > +   insn = (unsigned long)site;
> > > > +   else
> > > > +   insn = (unsigned long)tramp;
> > > >
> > > > mutex_lock(&text_mutex);
> > > >
> > > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void 
> > > > *tramp, void *func)
> > > > opcodes[0] = insn_opcode;
> > > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> > > >
> > > > -   /* Set up the variable for the breakpoint handler: */
> > > > +   /* Set up the variables for the breakpoint handler: */
> > > > bp_handler_dest = func;
> > > > +   if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > > +   bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
> > > >
> > > > /* Patch the call site: */
> > > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> > >
> > > OK, so this is where that static_call_bp_handler comes from; you need
> > > that CALL to frob the stack.
> > >
> > > But I still think it is broken; consider:
> > >
> > >   CPU0CPU1
> > >
> > >   bp_handler = ponies;
> > >
> > >   text_poke_bp(, &static_call_bp_handler)
> > > text_poke(&int3);
> > > on_each_cpu(sync)
> > >   
> > > ...
> > >   
> > >
> > > text_poke(/* all but first bytes */)
> > > on_each_cpu(sync)
> > >   
> > > ...
> > >   
> > >
> > >   
> > > pt_regs->ip = 
> > > &static_call_bp_handler
> > >   
> > >
> > >   // VCPU takes a nap...
> > > text_poke(/* first byte */)
> > > on_each_cpu(sync)
> > >   
> > > ...
> > >   
> > >
> > >   // VCPU sleeps more
> > >   bp_handler = unicorn;
> > >
> > >   CALL unicorn
> > >
> > > *whoops*
> > >
> > > Now, granted, that is all rather 'unlikely', but that never stopped
> > > Murphy.
> >
> > Good find, thanks Peter.
> >
> > As we discussed on IRC, we'll need to fix this from within the int3
> > exception handler by faking the call: putting a fake return address on
> > the stack (pointing to right after the call) and setting regs->ip to the
> > called function.
> 
> Can you add a comment that it will need updating when kernel CET is added?

Will do, though I get the feeling there's a lot of other (existing) code
that will also need to change for kernel CET.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Peter Zijlstra

On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote:
> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> index d3869295b88d..8fd6c8556750 100644
> --- a/arch/x86/kernel/static_call.c
> +++ b/arch/x86/kernel/static_call.c
> @@ -7,24 +7,19 @@
>  
>  #define CALL_INSN_SIZE 5
>  
> +unsigned long bp_handler_call_return_addr;
>  
> +static void static_call_bp_handler(struct pt_regs *regs)
> +{
>  #ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> + /*
> +  * Push the return address on the stack so the "called" function will
> +  * return to immediately after the call site.
> +  */
> + regs->sp -= sizeof(long);
> + *(unsigned long *)regs->sp = bp_handler_call_return_addr;
>  #endif
> +}
>  
>  void arch_static_call_transform(void *site, void *tramp, void *func)
>  {
> @@ -52,14 +47,12 @@ void arch_static_call_transform(void *site, void *tramp, 
> void *func)
>   opcodes[0] = insn_opcode;
>   memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
>  
>   if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> + bp_handler_call_return_addr = insn + CALL_INSN_SIZE;
>  
>   /* Patch the call site: */
>   text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> -  static_call_bp_handler);
> +  static_call_bp_handler, func);
>  
>  done:
>   mutex_unlock(&text_mutex);


like maybe something along the lines of:

struct sc_data {
unsigned long ret;
unsigned long ip;
};

void sc_handler(struct pt_regs *regs, void *data)
{
struct sc_data *scd = data;

regs->sp -= sizeof(long);
*(unsigned long *)regs->sp = scd->ret;
regs->ip = scd->ip;
}

arch_static_call_transform()
{
...

scd = (struct sc_data){
.ret = insn + CALL_INSN_SIZE,
.ip = (unsigned long)func,
};

text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
sc_handler, (void *)&scd);

...
}

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Peter Zijlstra

On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote:
> Peter suggested updating the text_poke_bp() interface to add a handler
> which is called from int3 context.  This seems to work.

> @@ -760,8 +761,10 @@ int poke_int3_handler(struct pt_regs *regs)
>   if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
>   return 0;
>  
> - /* set up the specified breakpoint handler */
> - regs->ip = (unsigned long) bp_int3_handler;
> + if (bp_int3_handler)
> + bp_int3_handler(regs);
> +
> + regs->ip = (unsigned long)bp_int3_resume;
>  
>   return 1;
>  

Peter also suggested you write that like:

if (bp_int3_handler)
bp_int3_handler(regs, resume);
else
regs->ip = resume;

That allows 'abusing' @resume as 'data' pointer for @handler. Which
allows for more complicated handlers.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Andy Lutomirski

On Mon, Nov 26, 2018 at 9:10 AM Josh Poimboeuf  wrote:
>
> On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > > index 8026d176f25c..d3869295b88d 100644
> > > --- a/arch/x86/kernel/static_call.c
> > > +++ b/arch/x86/kernel/static_call.c
> > > @@ -9,13 +9,21 @@
> > >
> > >  void static_call_bp_handler(void);
> > >  void *bp_handler_dest;
> > > +void *bp_handler_continue;
> > >
> > >  asm(".pushsection .text, \"ax\"  
> > >   \n"
> > >  ".globl static_call_bp_handler \n"
> > >  ".type static_call_bp_handler, @function 
> > >   \n"
> > >  "static_call_bp_handler: 
> > >   \n"
> > > -"ANNOTATE_RETPOLINE_SAFE 
> > >   \n"
> > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > +ANNOTATE_RETPOLINE_SAFE
> > > +"call *bp_handler_dest \n"
> > > +ANNOTATE_RETPOLINE_SAFE
> > > +"jmp *bp_handler_continue
> > >   \n"
> > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > > +ANNOTATE_RETPOLINE_SAFE
> > >  "jmp *bp_handler_dest  \n"
> > > +#endif
> > >  ".popsection   \n");
> > >
> > >  void arch_static_call_transform(void *site, void *tramp, void *func)
> > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void 
> > > *tramp, void *func)
> > > unsigned char insn_opcode;
> > > unsigned char opcodes[CALL_INSN_SIZE];
> > >
> > > -   insn = (unsigned long)tramp;
> > > +   if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > +   insn = (unsigned long)site;
> > > +   else
> > > +   insn = (unsigned long)tramp;
> > >
> > > mutex_lock(&text_mutex);
> > >
> > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void 
> > > *tramp, void *func)
> > > opcodes[0] = insn_opcode;
> > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> > >
> > > -   /* Set up the variable for the breakpoint handler: */
> > > +   /* Set up the variables for the breakpoint handler: */
> > > bp_handler_dest = func;
> > > +   if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > +   bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
> > >
> > > /* Patch the call site: */
> > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> >
> > OK, so this is where that static_call_bp_handler comes from; you need
> > that CALL to frob the stack.
> >
> > But I still think it is broken; consider:
> >
> >   CPU0CPU1
> >
> >   bp_handler = ponies;
> >
> >   text_poke_bp(, &static_call_bp_handler)
> > text_poke(&int3);
> > on_each_cpu(sync)
> >   
> > ...
> >   
> >
> > text_poke(/* all but first bytes */)
> > on_each_cpu(sync)
> >   
> > ...
> >   
> >
> >   
> > pt_regs->ip = 
> > &static_call_bp_handler
> >   
> >
> >   // VCPU takes a nap...
> > text_poke(/* first byte */)
> > on_each_cpu(sync)
> >   
> > ...
> >   
> >
> >   // VCPU sleeps more
> >   bp_handler = unicorn;
> >
> >   CALL unicorn
> >
> > *whoops*
> >
> > Now, granted, that is all rather 'unlikely', but that never stopped
> > Murphy.
>
> Good find, thanks Peter.
>
> As we discussed on IRC, we'll need to fix this from within the int3
> exception handler by faking the call: putting a fake return address on
> the stack (pointing to right after the call) and setting regs->ip to the
> called function.

Can you add a comment that it will need updating when kernel CET is added?

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Josh Poimboeuf

On Mon, Nov 26, 2018 at 11:10:36AM -0600, Josh Poimboeuf wrote:
> On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > > index 8026d176f25c..d3869295b88d 100644
> > > --- a/arch/x86/kernel/static_call.c
> > > +++ b/arch/x86/kernel/static_call.c
> > > @@ -9,13 +9,21 @@
> > >  
> > >  void static_call_bp_handler(void);
> > >  void *bp_handler_dest;
> > > +void *bp_handler_continue;
> > >  
> > >  asm(".pushsection .text, \"ax\"  
> > > \n"
> > >  ".globl static_call_bp_handler   
> > > \n"
> > >  ".type static_call_bp_handler, @function 
> > > \n"
> > >  "static_call_bp_handler: 
> > > \n"
> > > -"ANNOTATE_RETPOLINE_SAFE 
> > > \n"
> > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > +ANNOTATE_RETPOLINE_SAFE
> > > +"call *bp_handler_dest   
> > > \n"
> > > +ANNOTATE_RETPOLINE_SAFE
> > > +"jmp *bp_handler_continue
> > > \n"
> > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > > +ANNOTATE_RETPOLINE_SAFE
> > >  "jmp *bp_handler_dest
> > > \n"
> > > +#endif
> > >  ".popsection \n");
> > >  
> > >  void arch_static_call_transform(void *site, void *tramp, void *func)
> > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void 
> > > *tramp, void *func)
> > >   unsigned char insn_opcode;
> > >   unsigned char opcodes[CALL_INSN_SIZE];
> > >  
> > > - insn = (unsigned long)tramp;
> > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > + insn = (unsigned long)site;
> > > + else
> > > + insn = (unsigned long)tramp;
> > >  
> > >   mutex_lock(&text_mutex);
> > >  
> > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void 
> > > *tramp, void *func)
> > >   opcodes[0] = insn_opcode;
> > >   memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> > >  
> > > - /* Set up the variable for the breakpoint handler: */
> > > + /* Set up the variables for the breakpoint handler: */
> > >   bp_handler_dest = func;
> > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
> > >  
> > >   /* Patch the call site: */
> > >   text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> > 
> > OK, so this is where that static_call_bp_handler comes from; you need
> > that CALL to frob the stack.
> > 
> > But I still think it is broken; consider:
> > 
> > CPU0CPU1
> > 
> > bp_handler = ponies;
> > 
> > text_poke_bp(, &static_call_bp_handler)
> >   text_poke(&int3);
> >   on_each_cpu(sync)
> > 
> >   ...
> > 
> > 
> >   text_poke(/* all but first bytes */)
> >   on_each_cpu(sync)
> > 
> >   ...
> > 
> > 
> > 
> >   pt_regs->ip = &static_call_bp_handler
> > 
> > 
> > // VCPU takes a nap...
> >   text_poke(/* first byte */)
> >   on_each_cpu(sync)
> > 
> >   ...
> > 
> > 
> > // VCPU sleeps more
> > bp_handler = unicorn;
> > 
> > CALL unicorn
> > 
> > *whoops*
> > 
> > Now, granted, that is all rather 'unlikely', but that never stopped
> > Murphy.
> 
> Good find, thanks Peter.
> 
> As we discussed on IRC, we'll need to fix this from within the int3
> exception handler by faking the call: putting a fake return address on
> the stack (pointing to right after the call) and setting regs->ip to the
> called function.
> 
> And for the out-of-line case we can just jump straight to the function,
> so the function itself will be the text_poke_bp() "handler".
> 
> So the static_call_bp_handler() trampoline will go away.

Peter suggested updating the text_poke_bp() interface to add a handler
which is called from int3 context.  This seems to work.

diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..7fcaa37c1876 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -20,6 +20,8 @@ static inline void apply_paravirt(struct paravirt_patch_site 
*start,
 
 extern void *text_po

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Josh Poimboeuf

On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > index 8026d176f25c..d3869295b88d 100644
> > --- a/arch/x86/kernel/static_call.c
> > +++ b/arch/x86/kernel/static_call.c
> > @@ -9,13 +9,21 @@
> >  
> >  void static_call_bp_handler(void);
> >  void *bp_handler_dest;
> > +void *bp_handler_continue;
> >  
> >  asm(".pushsection .text, \"ax\"
> > \n"
> >  ".globl static_call_bp_handler \n"
> >  ".type static_call_bp_handler, @function   
> > \n"
> >  "static_call_bp_handler:   
> > \n"
> > -"ANNOTATE_RETPOLINE_SAFE   
> > \n"
> > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > +ANNOTATE_RETPOLINE_SAFE
> > +"call *bp_handler_dest \n"
> > +ANNOTATE_RETPOLINE_SAFE
> > +"jmp *bp_handler_continue  
> > \n"
> > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > +ANNOTATE_RETPOLINE_SAFE
> >  "jmp *bp_handler_dest  \n"
> > +#endif
> >  ".popsection   \n");
> >  
> >  void arch_static_call_transform(void *site, void *tramp, void *func)
> > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, 
> > void *func)
> > unsigned char insn_opcode;
> > unsigned char opcodes[CALL_INSN_SIZE];
> >  
> > -   insn = (unsigned long)tramp;
> > +   if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > +   insn = (unsigned long)site;
> > +   else
> > +   insn = (unsigned long)tramp;
> >  
> > mutex_lock(&text_mutex);
> >  
> > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, 
> > void *func)
> > opcodes[0] = insn_opcode;
> > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> >  
> > -   /* Set up the variable for the breakpoint handler: */
> > +   /* Set up the variables for the breakpoint handler: */
> > bp_handler_dest = func;
> > +   if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > +   bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
> >  
> > /* Patch the call site: */
> > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> 
> OK, so this is where that static_call_bp_handler comes from; you need
> that CALL to frob the stack.
> 
> But I still think it is broken; consider:
> 
>   CPU0CPU1
> 
>   bp_handler = ponies;
> 
>   text_poke_bp(, &static_call_bp_handler)
> text_poke(&int3);
> on_each_cpu(sync)
>   
> ...
>   
> 
> text_poke(/* all but first bytes */)
> on_each_cpu(sync)
>   
> ...
>   
> 
>   
> pt_regs->ip = &static_call_bp_handler
>   
> 
>   // VCPU takes a nap...
> text_poke(/* first byte */)
> on_each_cpu(sync)
>   
> ...
>   
> 
>   // VCPU sleeps more
>   bp_handler = unicorn;
> 
>   CALL unicorn
> 
> *whoops*
> 
> Now, granted, that is all rather 'unlikely', but that never stopped
> Murphy.

Good find, thanks Peter.

As we discussed on IRC, we'll need to fix this from within the int3
exception handler by faking the call: putting a fake return address on
the stack (pointing to right after the call) and setting regs->ip to the
called function.

And for the out-of-line case we can just jump straight to the function,
so the function itself will be the text_poke_bp() "handler".

So the static_call_bp_handler() trampoline will go away.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Josh Poimboeuf

On Mon, Nov 26, 2018 at 05:39:23PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 26, 2018 at 05:11:05PM +0100, Ard Biesheuvel wrote:
> > On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra  wrote:
> > >
> > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > > +void arch_static_call_defuse_tramp(void *site, void *tramp)
> > > > +{
> > > > + unsigned short opcode = INSN_UD2;
> > > > +
> > > > + mutex_lock(&text_mutex);
> > > > + text_poke((void *)tramp, &opcode, 2);
> > > > + mutex_unlock(&text_mutex);
> > > > +}
> > > > +#endif
> > >
> > > I would rather think that makes the trampoline _more_ dangerous, rather
> > > than less so.
> > >
> > > My dictionary sayeth:
> > >
> > >   defuse: verb
> > >
> > > - remove the fuse from (an explosive device) in order to prevent it
> > >   from exploding.
> > >
> > > - make (a situation) less tense or dangerous
> > >
> > > patching in an UD2 seems to do the exact opposite.
> > 
> > That is my fault.
> > 
> > The original name was 'poison' iirc, but on arm64, we need to retain
> > the trampoline for cases where the direct branch is out of range, and
> > so poisoning is semantically inaccurate.
> > 
> > But since you opened your dictionary anyway, any better suggestions? :-)
> 
> I was leaning towards: "prime", but I'm not entirely sure that works
> with your case.

Maybe we should just go back to "poison", along with a comment that it
will not necessarily be poisoned for all arches.  I think "poison" at
least describes the intent, if not always the implementation.

-- 
Josh

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Peter Zijlstra

On Mon, Nov 26, 2018 at 05:11:05PM +0100, Ard Biesheuvel wrote:
> On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra  wrote:
> >
> > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > +void arch_static_call_defuse_tramp(void *site, void *tramp)
> > > +{
> > > + unsigned short opcode = INSN_UD2;
> > > +
> > > + mutex_lock(&text_mutex);
> > > + text_poke((void *)tramp, &opcode, 2);
> > > + mutex_unlock(&text_mutex);
> > > +}
> > > +#endif
> >
> > I would rather think that makes the trampoline _more_ dangerous, rather
> > than less so.
> >
> > My dictionary sayeth:
> >
> >   defuse: verb
> >
> > - remove the fuse from (an explosive device) in order to prevent it
> >   from exploding.
> >
> > - make (a situation) less tense or dangerous
> >
> > patching in an UD2 seems to do the exact opposite.
> 
> That is my fault.
> 
> The original name was 'poison' iirc, but on arm64, we need to retain
> the trampoline for cases where the direct branch is out of range, and
> so poisoning is semantically inaccurate.
> 
> But since you opened your dictionary anyway, any better suggestions? :-)

I was leaning towards: "prime", but I'm not entirely sure that works
with your case.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Andy Lutomirski

On Mon, Nov 26, 2018 at 8:11 AM Ard Biesheuvel
 wrote:
>
> On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra  wrote:
> >
> > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > +void arch_static_call_defuse_tramp(void *site, void *tramp)
> > > +{
> > > + unsigned short opcode = INSN_UD2;
> > > +
> > > + mutex_lock(&text_mutex);
> > > + text_poke((void *)tramp, &opcode, 2);
> > > + mutex_unlock(&text_mutex);
> > > +}
> > > +#endif
> >
> > I would rather think that makes the trampoline _more_ dangerous, rather
> > than less so.
> >
> > My dictionary sayeth:
> >
> >   defuse: verb
> >
> > - remove the fuse from (an explosive device) in order to prevent it
> >   from exploding.
> >
> > - make (a situation) less tense or dangerous
> >
> > patching in an UD2 seems to do the exact opposite.
>
> That is my fault.
>
> The original name was 'poison' iirc, but on arm64, we need to retain
> the trampoline for cases where the direct branch is out of range, and
> so poisoning is semantically inaccurate.
>
> But since you opened your dictionary anyway, any better suggestions? :-)

Release?  Finish?

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Ard Biesheuvel

On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra  wrote:
>
> On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > +void arch_static_call_defuse_tramp(void *site, void *tramp)
> > +{
> > + unsigned short opcode = INSN_UD2;
> > +
> > + mutex_lock(&text_mutex);
> > + text_poke((void *)tramp, &opcode, 2);
> > + mutex_unlock(&text_mutex);
> > +}
> > +#endif
>
> I would rather think that makes the trampoline _more_ dangerous, rather
> than less so.
>
> My dictionary sayeth:
>
>   defuse: verb
>
> - remove the fuse from (an explosive device) in order to prevent it
>   from exploding.
>
> - make (a situation) less tense or dangerous
>
> patching in an UD2 seems to do the exact opposite.

That is my fault.

The original name was 'poison' iirc, but on arm64, we need to retain
the trampoline for cases where the direct branch is out of range, and
so poisoning is semantically inaccurate.

But since you opened your dictionary anyway, any better suggestions? :-)

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Peter Zijlstra

On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> +void arch_static_call_defuse_tramp(void *site, void *tramp)
> +{
> + unsigned short opcode = INSN_UD2;
> +
> + mutex_lock(&text_mutex);
> + text_poke((void *)tramp, &opcode, 2);
> + mutex_unlock(&text_mutex);
> +}
> +#endif

I would rather think that makes the trampoline _more_ dangerous, rather
than less so.

My dictionary sayeth:

  defuse: verb

- remove the fuse from (an explosive device) in order to prevent it
  from exploding.

- make (a situation) less tense or dangerous

patching in an UD2 seems to do the exact opposite.

Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Peter Zijlstra

On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> index 8026d176f25c..d3869295b88d 100644
> --- a/arch/x86/kernel/static_call.c
> +++ b/arch/x86/kernel/static_call.c
> @@ -9,13 +9,21 @@
>  
>  void static_call_bp_handler(void);
>  void *bp_handler_dest;
> +void *bp_handler_continue;
>  
>  asm(".pushsection .text, \"ax\"  
> \n"
>  ".globl static_call_bp_handler   \n"
>  ".type static_call_bp_handler, @function \n"
>  "static_call_bp_handler: \n"
> -"ANNOTATE_RETPOLINE_SAFE \n"
> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> +ANNOTATE_RETPOLINE_SAFE
> +"call *bp_handler_dest   \n"
> +ANNOTATE_RETPOLINE_SAFE
> +"jmp *bp_handler_continue
> \n"
> +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> +ANNOTATE_RETPOLINE_SAFE
>  "jmp *bp_handler_dest\n"
> +#endif
>  ".popsection \n");
>  
>  void arch_static_call_transform(void *site, void *tramp, void *func)
> @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, 
> void *func)
>   unsigned char insn_opcode;
>   unsigned char opcodes[CALL_INSN_SIZE];
>  
> - insn = (unsigned long)tramp;
> + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> + insn = (unsigned long)site;
> + else
> + insn = (unsigned long)tramp;
>  
>   mutex_lock(&text_mutex);
>  
> @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, 
> void *func)
>   opcodes[0] = insn_opcode;
>   memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
>  
> - /* Set up the variable for the breakpoint handler: */
> + /* Set up the variables for the breakpoint handler: */
>   bp_handler_dest = func;
> + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
>  
>   /* Patch the call site: */
>   text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,

OK, so this is where that static_call_bp_handler comes from; you need
that CALL to frob the stack.

But I still think it is broken; consider:

CPU0CPU1

bp_handler = ponies;

text_poke_bp(, &static_call_bp_handler)
  text_poke(&int3);
  on_each_cpu(sync)

  ...


  text_poke(/* all but first bytes */)
  on_each_cpu(sync)

  ...



  pt_regs->ip = &static_call_bp_handler


// VCPU takes a nap...
  text_poke(/* first byte */)
  on_each_cpu(sync)

  ...


// VCPU sleeps more
bp_handler = unicorn;

CALL unicorn

*whoops*

Now, granted, that is all rather 'unlikely', but that never stopped
Murphy.

[PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-26 Thread Josh Poimboeuf

Add the inline static call implementation for x86-64.  For each key, a
temporary trampoline is created, named __static_call_tramp_.  The
trampoline has an indirect jump to the destination function.

Objtool uses the trampoline naming convention to detect all the call
sites.  It then annotates those call sites in the .static_call_sites
section.

During boot (and module init), the call sites are patched to call
directly into the destination function.  The temporary trampoline is
then no longer used.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/Kconfig  |   5 +-
 arch/x86/include/asm/static_call.h|  28 +++-
 arch/x86/kernel/asm-offsets.c |   6 +
 arch/x86/kernel/static_call.c |  30 -
 include/linux/static_call.h   |   2 +-
 tools/objtool/Makefile|   3 +-
 tools/objtool/check.c | 126 +-
 tools/objtool/check.h |   2 +
 tools/objtool/elf.h   |   1 +
 .../objtool/include/linux/static_call_types.h |  19 +++
 tools/objtool/sync-check.sh   |   1 +
 11 files changed, 213 insertions(+), 10 deletions(-)
 create mode 100644 tools/objtool/include/linux/static_call_types.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a2a10e0ce248..e099ea87ea70 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -189,7 +189,8 @@ config X86
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
select HAVE_STACK_VALIDATIONif X86_64
-   select HAVE_STATIC_CALL_OUTLINE
+   select HAVE_STATIC_CALL_INLINE  if HAVE_STACK_VALIDATION
+   select HAVE_STATIC_CALL_OUTLINE if !HAVE_STACK_VALIDATION
select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
@@ -203,6 +204,7 @@ config X86
select RTC_MC146818_LIB
select SPARSE_IRQ
select SRCU
+   select STACK_VALIDATION if HAVE_STACK_VALIDATION && 
(HAVE_STATIC_CALL_INLINE || RETPOLINE)
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select USER_STACKTRACE_SUPPORT
@@ -438,7 +440,6 @@ config GOLDFISH
 config RETPOLINE
bool "Avoid speculative indirect branches in kernel"
default y
-   select STACK_VALIDATION if HAVE_STACK_VALIDATION
help
  Compile kernel with the retpoline compiler options to guard against
  kernel-to-user data leaks by avoiding speculative indirect
diff --git a/arch/x86/include/asm/static_call.h 
b/arch/x86/include/asm/static_call.h
index 6e9ad5969ec2..27bd7da16150 100644
--- a/arch/x86/include/asm/static_call.h
+++ b/arch/x86/include/asm/static_call.h
@@ -2,6 +2,20 @@
 #ifndef _ASM_STATIC_CALL_H
 #define _ASM_STATIC_CALL_H
 
+#include 
+
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+
+/*
+ * This trampoline is only used during boot / module init, so it's safe to use
+ * the indirect branch without a retpoline.
+ */
+#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)
\
+   ANNOTATE_RETPOLINE_SAFE \
+   "jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
+
+#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
+
 /*
  * Manually construct a 5-byte direct JMP to prevent the assembler from
  * optimizing it into a 2-byte JMP.
@@ -12,9 +26,19 @@
".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n" \
__ARCH_STATIC_CALL_JMP_LABEL(key) ":"
 
+#endif /* !CONFIG_HAVE_STATIC_CALL_INLINE */
+
 /*
- * This is a permanent trampoline which does a direct jump to the function.
- * The direct jump get patched by static_call_update().
+ * For CONFIG_HAVE_STATIC_CALL_INLINE, this is a temporary trampoline which
+ * uses the current value of the key->func pointer to do an indirect jump to
+ * the function.  This trampoline is only used during boot, before the call
+ * sites get patched by static_call_update().  The name of this trampoline has
+ * a magical aspect: objtool uses it to find static call sites so it can create
+ * the .static_call_sites section.
+ *
+ * For CONFIG_HAVE_STATIC_CALL_OUTLINE, this is a permanent trampoline which
+ * does a direct jump to the function.  The direct jump gets patched by
+ * static_call_update().
  */
 #define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)   \
asm(".pushsection .text, \"ax\" \n" \
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 72adf6c335dc..da8fd220e4f2 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -104,4 +105,9 @@ void common(void) {
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);

91 matches

Mail list logo