Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread H. Peter Anvin
On 1/14/19 9:01 PM, H. Peter Anvin wrote:
> 
> This could be as simple as spinning for a limited time waiting for
> states 0 or 3 if we are not the patching CPU. It is also not necessary
> to wait for the mask to become zero for the first sync if we find
> ourselves suddenly in state 4.
> 

So this would look something like this for the #BP handler; I think this
is safe.  This uses the TLB miss on the write page intentionally to slow
down the loop a bit to reduce the risk of livelock.  Note that
"bp_write_addr" here refers to the write address for the breakpoint that
was taken.

state = atomic_read(_poke_state);
if (state == 0)
return 0;   /* No patching in progress */

recheck:
clear bit in mask

switch (state) {
case 1:
case 4:
if (smp_processor_id() != bp_patching_cpu) {
int retries = NNN;
while (retries--) {
invlpg
if (*bp_write_addr != 0xcc)
goto recheck;
state = atomic_read(_poke_state);
if (state != 1 && state != 4)
goto recheck;
}
}
state = cmpxchg(_poke_state, 1, 4);
if (state != 1 && state != 4)
goto recheck;
atomic_write(bp_write_addr, bp_old_value);
break;
case 2:
if (smp_processor_id() != bp_patching_cpu) {
invlpg
state = atomic_read(_poke_state);
if (state != 2)
goto recheck;
}
complete patch sequence
remove breakpoint
break;

case 3:
case 0:
/*
 * If we are here, the #BP will go away on its
 * own, or we will re-take it if it was a "real"
 * breakpoint.
 */
break;
}
return 1;


Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread H. Peter Anvin
On 1/14/19 7:05 PM, Andy Lutomirski wrote:
> On Mon, Jan 14, 2019 at 2:55 PM H. Peter Anvin  wrote:
>>
>> I think this sequence ought to work (keep in mind we are already under a
>> mutex, so the global data is safe even if we are preempted):
> 
> I'm trying to wrap my head around this.  The states are:
> 
> 0: normal operation
> 1: writing 0xcc, can be canceled
> 2: writing final instruction.  The 0xcc was definitely synced to all CPUs.
> 3: patch is definitely installed but maybe not sync_cored.
> 

4: breakpoint has been canceled; need to redo patching.

>>
>> set up page table entries
>> invlpg
>> set up bp patching global data
>>
>> cpu = get_cpu()
>>
> So we're assuming that the state is
> 
>> bp_old_value = atomic_read(bp_write_addr)
>>
>> do {
> 
> So we're assuming that the state is 0 here.  A WARN_ON_ONCE to check
> that would be nice.

The state here can be 0 or 4.

>> atomic_write(_poke_state, 1)
>>
>> atomic_write(bp_write_addr, 0xcc)
>>
>> mask <- online_cpu_mask - self
>> send IPIs
>> wait for mask = 0
>>
>> } while (cmpxchg(_poke_state, 1, 2) != 1);
>>
>> patch sites, remove breakpoints after patching each one
> 
> Not sure what you mean by patch *sites*.  As written, this only
> supports one patch site at a time, since there's only one
> bp_write_addr, and fixing that may be complicated.  Not fixing it
> might also be a scalability problem.

Fixing it isn't all that complicated; we just need to have a list of
patch locations (which we need anyway!) and walk (or search) it instead
of checking just one; I omitted that detail for simplicity.

>> atomic_write(_poke_state, 3);
>>
>> mask <- online_cpu_mask - self
>> send IPIs
>> wait for mask = 0
>>
>> atomic_write(_poke_state, 0);
>>
>> tear down patching global data
>> tear down page table entries
>>
>>
>>
>> The #BP handler would then look like:
>>
>> state = cmpxchg(_poke_state, 1, 4);
>> switch (state) {
>> case 1:
>> case 4:
> 
> What is state 4?
> 
>> invlpg
>> cmpxchg(bp_write_addr, 0xcc, bp_old_value)

I'm 85% sure that the cmpxchg here is actually unnecessary, an
atomic_write() is sufficient.

>> break;
>> case 2:
>> invlpg
>> complete patch sequence
>> remove breakpoint
>> break;
> 
> ISTM you might as well change state to 3 here, but it's arguably unnecessary.

If and only if you have only one patch location you could, but again,
unnecessary.

>> case 3:
>> /* If we are here, the #BP will go away on its own */
>> break;
>> case 0:
>> /* No patching in progress!!! */
>> return 0;
>> }
>>
>> clear bit in mask
>> return 1;
>>
>> The IPI handler:
>>
>> clear bit in mask
>> sync_core   /* Needed if multiple IPI events are chained */
> 
> I really like that this doesn't require fixups -- text_poke_bp() just
> works.  But I'm nervous about livelocks or maybe just extreme slowness
> under nasty loads.  Suppose some perf NMI code does a static call or
> uses a static call.  Now there's a situation where, under high
> frequency perf sampling, the patch process might almost always hit the
> breakpoint while in state 1.  It'll get reversed and done again, and
> we get stuck.  It would be neat if we could get the same "no
> deadlocks" property while significantly reducing the chance of a
> rollback.

This could be as simple as spinning for a limited time waiting for
states 0 or 3 if we are not the patching CPU. It is also not necessary
to wait for the mask to become zero for the first sync if we find
ourselves suddenly in state 4.

This wouldn't reduce the livelock probability to zero, but it ought to
reduce it enough that if we really are under such heavy event load we
may end up getting stuck in any number of ways...

> This is why I proposed something where we try to guarantee forward
> progress by making sure that any NMI code that might spin and wait for
> other CPUs is guaranteed to eventually sync_core(), clear its bit, and
> possibly finish a patch.  But this is a bit gross.

Yes, this gets really grotty and who knows how many code paths it would
touch.

-hpa




Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread Andy Lutomirski
On Mon, Jan 14, 2019 at 2:55 PM H. Peter Anvin  wrote:
>
> I think this sequence ought to work (keep in mind we are already under a
> mutex, so the global data is safe even if we are preempted):

I'm trying to wrap my head around this.  The states are:

0: normal operation
1: writing 0xcc, can be canceled
2: writing final instruction.  The 0xcc was definitely synced to all CPUs.
3: patch is definitely installed but maybe not sync_cored.

>
> set up page table entries
> invlpg
> set up bp patching global data
>
> cpu = get_cpu()
>
So we're assuming that the state is

> bp_old_value = atomic_read(bp_write_addr)
>
> do {

So we're assuming that the state is 0 here.  A WARN_ON_ONCE to check
that would be nice.

> atomic_write(_poke_state, 1)
>
> atomic_write(bp_write_addr, 0xcc)
>
> mask <- online_cpu_mask - self
> send IPIs
> wait for mask = 0
>
> } while (cmpxchg(_poke_state, 1, 2) != 1);
>
> patch sites, remove breakpoints after patching each one

Not sure what you mean by patch *sites*.  As written, this only
supports one patch site at a time, since there's only one
bp_write_addr, and fixing that may be complicated.  Not fixing it
might also be a scalability problem.

>
> atomic_write(_poke_state, 3);
>
> mask <- online_cpu_mask - self
> send IPIs
> wait for mask = 0
>
> atomic_write(_poke_state, 0);
>
> tear down patching global data
> tear down page table entries
>
>
>
> The #BP handler would then look like:
>
> state = cmpxchg(_poke_state, 1, 4);
> switch (state) {
> case 1:
> case 4:

What is state 4?

> invlpg
> cmpxchg(bp_write_addr, 0xcc, bp_old_value)
> break;
> case 2:
> invlpg
> complete patch sequence
> remove breakpoint
> break;

ISTM you might as well change state to 3 here, but it's arguably unnecessary.

> case 3:
> /* If we are here, the #BP will go away on its own */
> break;
> case 0:
> /* No patching in progress!!! */
> return 0;
> }
>
> clear bit in mask
> return 1;
>
> The IPI handler:
>
> clear bit in mask
> sync_core   /* Needed if multiple IPI events are chained */

I really like that this doesn't require fixups -- text_poke_bp() just
works.  But I'm nervous about livelocks or maybe just extreme slowness
under nasty loads.  Suppose some perf NMI code does a static call or
uses a static call.  Now there's a situation where, under high
frequency perf sampling, the patch process might almost always hit the
breakpoint while in state 1.  It'll get reversed and done again, and
we get stuck.  It would be neat if we could get the same "no
deadlocks" property while significantly reducing the chance of a
rollback.

This is why I proposed something where we try to guarantee forward
progress by making sure that any NMI code that might spin and wait for
other CPUs is guaranteed to eventually sync_core(), clear its bit, and
possibly finish a patch.  But this is a bit gross.


Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread hpa
On January 14, 2019 3:27:55 PM PST, Andy Lutomirski  wrote:
>On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin  wrote:
>>
>> So I was already in the middle of composing this message when Andy
>posted:
>>
>> > I don't even think this is sufficient.  I think we also need
>everyone
>> > who clears the bit to check if all bits are clear and, if so,
>remove
>> > the breakpoint.  Otherwise we have a situation where, if you are in
>> > text_poke_bp() and you take an NMI (or interrupt or MCE or
>whatever)
>> > and that interrupt then hits the breakpoint, then you deadlock
>because
>> > no one removes the breakpoint.
>> >
>> > If we do this, and if we can guarantee that all CPUs make forward
>> > progress, then maybe the problem is solved. Can we guarantee
>something
>> > like all NMI handlers that might wait in a spinlock or for any
>other
>> > reason will periodically check if a sync is needed while they're
>> > spinning?
>>
>> So the really, really nasty case is when an asynchronous event on the
>> *patching* processor gets stuck spinning on a resource which is
>> unavailable due to another processor spinning on the #BP. We can
>disable
>> interrupts, but we can't stop NMIs from coming in (although we could
>> test in the NMI handler if we are in that condition and return
>> immediately; I'm not sure we want to do that, and we still have to
>deal
>> with #MC and what not.)
>>
>> The fundamental problem here is that we don't see the #BP on the
>> patching processor, in which case we could simply complete the
>patching
>> from the #BP handler on that processor.
>>
>> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
>> > On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>> >>
>> >> static cpumask_t text_poke_cpumask;
>> >>
>> >> static void text_poke_sync(void)
>> >> {
>> >>  smp_wmb();
>> >>  text_poke_cpumask = cpu_online_mask;
>> >>  smp_wmb();  /* Should be optional on x86 */
>> >>  cpumask_clear_cpu(_poke_cpumask, smp_processor_id());
>> >>  on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu,
>NULL, false);
>> >>  while (!cpumask_empty(_poke_cpumask)) {
>> >>  cpu_relax();
>> >>  smp_rmb();
>> >>  }
>> >> }
>> >>
>> >> static void text_poke_sync_cpu(void *dummy)
>> >> {
>> >>  (void)dummy;
>> >>
>> >>  smp_rmb();
>> >>  cpumask_clear_cpu(_bitmask, smp_processor_id());
>> >>  /*
>> >>   * We are guaranteed to return with an IRET, either from the
>> >>   * IPI or the #BP handler; this provides serialization.
>> >>   */
>> >> }
>> >>
>> >
>> > The invariants here are:
>> >
>> > 1. The patching routine must set each bit in the cpumask after each
>event
>> >that requires synchronization is complete.
>> > 2. The bit can be (atomically) cleared on the target CPU only, and
>only in a
>> >place that guarantees a synchronizing event (e.g. IRET) before
>it may
>> >reaching the poked instruction.
>> > 3. At a minimum the IPI handler and #BP handler needs to clear the
>bit. It
>> >*is* also possible to clear it in other places, e.g. the NMI
>handler, if
>> >necessary as long as condition 2 is satisfied.
>> >
>>
>> OK, so with interrupts enabled *on the processor doing the patching*
>we
>> still have a problem if it takes an interrupt which in turn takes a
>#BP.
>>  Disabling interrupts would not help, because but an NMI and #MC
>could
>> still cause problems unless we can guarantee that no path which may
>be
>> invoked by NMI/#MC can do text_poke, which seems to be a very
>aggressive
>> assumption.
>>
>> Note: I am assuming preemption is disabled.
>>
>> The easiest/sanest way to deal with this might be to switch the IDT
>(or
>> provide a hook in the generic exception entry code) on the patching
>> processor, such that if an asynchronous event comes in, we either
>roll
>> forward or revert. This is doable because the second sync we
>currently
>> do is not actually necessary per the hardware guys.
>
>This is IMO insanely complicated.  I much prefer the kind of
>complexity that is more or less deterministic and easy to test to the
>kind of complexity (like this) that only happens in corner cases.
>
>I see two solutions here:
>
>1. Just suck it up and emulate the CALL.  And find a way to write a
>test case so we know it works.
>
>2. Find a non-deadlocky way to make the breakpoint handler wait for
>the breakpoint to get removed, without any mucking at all with the
>entry code.  And find a way to write a test case so we know it works.
>(E.g. stick an actual static_call call site *in text_poke_bp()* that
>fires once on boot so that the really awful recursive case gets
>exercised all the time.
>
>But if we're going to do any mucking with the entry code, let's just
>do the simple mucking to make emulating CALL work.
>
>--Andy

Ugh. So much for not really proofreading. Yes, I think the second solution is 
the right thing since I think I figured out how to do it without deadlock; see 
other mail.
-- 
Sent from my Android device with K-9 Mail. 

Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread Nadav Amit
> On Jan 14, 2019, at 3:27 PM, Andy Lutomirski  wrote:
> 
> On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin  wrote:
>> So I was already in the middle of composing this message when Andy posted:
>> 
>>> I don't even think this is sufficient.  I think we also need everyone
>>> who clears the bit to check if all bits are clear and, if so, remove
>>> the breakpoint.  Otherwise we have a situation where, if you are in
>>> text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
>>> and that interrupt then hits the breakpoint, then you deadlock because
>>> no one removes the breakpoint.
>>> 
>>> If we do this, and if we can guarantee that all CPUs make forward
>>> progress, then maybe the problem is solved. Can we guarantee something
>>> like all NMI handlers that might wait in a spinlock or for any other
>>> reason will periodically check if a sync is needed while they're
>>> spinning?
>> 
>> So the really, really nasty case is when an asynchronous event on the
>> *patching* processor gets stuck spinning on a resource which is
>> unavailable due to another processor spinning on the #BP. We can disable
>> interrupts, but we can't stop NMIs from coming in (although we could
>> test in the NMI handler if we are in that condition and return
>> immediately; I'm not sure we want to do that, and we still have to deal
>> with #MC and what not.)
>> 
>> The fundamental problem here is that we don't see the #BP on the
>> patching processor, in which case we could simply complete the patching
>> from the #BP handler on that processor.
>> 
>> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
>>> On 1/13/19 6:31 PM, H. Peter Anvin wrote:
 static cpumask_t text_poke_cpumask;
 
 static void text_poke_sync(void)
 {
 smp_wmb();
 text_poke_cpumask = cpu_online_mask;
 smp_wmb();  /* Should be optional on x86 */
 cpumask_clear_cpu(_poke_cpumask, smp_processor_id());
 on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false);
 while (!cpumask_empty(_poke_cpumask)) {
 cpu_relax();
 smp_rmb();
 }
 }
 
 static void text_poke_sync_cpu(void *dummy)
 {
 (void)dummy;
 
 smp_rmb();
 cpumask_clear_cpu(_bitmask, smp_processor_id());
 /*
  * We are guaranteed to return with an IRET, either from the
  * IPI or the #BP handler; this provides serialization.
  */
 }
>>> 
>>> The invariants here are:
>>> 
>>> 1. The patching routine must set each bit in the cpumask after each event
>>>   that requires synchronization is complete.
>>> 2. The bit can be (atomically) cleared on the target CPU only, and only in a
>>>   place that guarantees a synchronizing event (e.g. IRET) before it may
>>>   reaching the poked instruction.
>>> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
>>>   *is* also possible to clear it in other places, e.g. the NMI handler, if
>>>   necessary as long as condition 2 is satisfied.
>> 
>> OK, so with interrupts enabled *on the processor doing the patching* we
>> still have a problem if it takes an interrupt which in turn takes a #BP.
>> Disabling interrupts would not help, because but an NMI and #MC could
>> still cause problems unless we can guarantee that no path which may be
>> invoked by NMI/#MC can do text_poke, which seems to be a very aggressive
>> assumption.
>> 
>> Note: I am assuming preemption is disabled.
>> 
>> The easiest/sanest way to deal with this might be to switch the IDT (or
>> provide a hook in the generic exception entry code) on the patching
>> processor, such that if an asynchronous event comes in, we either roll
>> forward or revert. This is doable because the second sync we currently
>> do is not actually necessary per the hardware guys.
> 
> This is IMO insanely complicated.  I much prefer the kind of
> complexity that is more or less deterministic and easy to test to the
> kind of complexity (like this) that only happens in corner cases.
> 
> I see two solutions here:
> 
> 1. Just suck it up and emulate the CALL.  And find a way to write a
> test case so we know it works.
> 
> 2. Find a non-deadlocky way to make the breakpoint handler wait for
> the breakpoint to get removed, without any mucking at all with the
> entry code.  And find a way to write a test case so we know it works.
> (E.g. stick an actual static_call call site *in text_poke_bp()* that
> fires once on boot so that the really awful recursive case gets
> exercised all the time.
> 
> But if we're going to do any mucking with the entry code, let's just
> do the simple mucking to make emulating CALL work.

These two approaches still seem more complicated to me than having a 
trampoline as backup, which is patched dynamically.
 
IIUC, the current pushback against this option is that it makes batching is
more complicated. I am not sure it is that bad, but there are other variants
of this solution, for example using an 

Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread Andy Lutomirski
On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin  wrote:
>
> So I was already in the middle of composing this message when Andy posted:
>
> > I don't even think this is sufficient.  I think we also need everyone
> > who clears the bit to check if all bits are clear and, if so, remove
> > the breakpoint.  Otherwise we have a situation where, if you are in
> > text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
> > and that interrupt then hits the breakpoint, then you deadlock because
> > no one removes the breakpoint.
> >
> > If we do this, and if we can guarantee that all CPUs make forward
> > progress, then maybe the problem is solved. Can we guarantee something
> > like all NMI handlers that might wait in a spinlock or for any other
> > reason will periodically check if a sync is needed while they're
> > spinning?
>
> So the really, really nasty case is when an asynchronous event on the
> *patching* processor gets stuck spinning on a resource which is
> unavailable due to another processor spinning on the #BP. We can disable
> interrupts, but we can't stop NMIs from coming in (although we could
> test in the NMI handler if we are in that condition and return
> immediately; I'm not sure we want to do that, and we still have to deal
> with #MC and what not.)
>
> The fundamental problem here is that we don't see the #BP on the
> patching processor, in which case we could simply complete the patching
> from the #BP handler on that processor.
>
> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
> > On 1/13/19 6:31 PM, H. Peter Anvin wrote:
> >>
> >> static cpumask_t text_poke_cpumask;
> >>
> >> static void text_poke_sync(void)
> >> {
> >>  smp_wmb();
> >>  text_poke_cpumask = cpu_online_mask;
> >>  smp_wmb();  /* Should be optional on x86 */
> >>  cpumask_clear_cpu(_poke_cpumask, smp_processor_id());
> >>  on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false);
> >>  while (!cpumask_empty(_poke_cpumask)) {
> >>  cpu_relax();
> >>  smp_rmb();
> >>  }
> >> }
> >>
> >> static void text_poke_sync_cpu(void *dummy)
> >> {
> >>  (void)dummy;
> >>
> >>  smp_rmb();
> >>  cpumask_clear_cpu(_bitmask, smp_processor_id());
> >>  /*
> >>   * We are guaranteed to return with an IRET, either from the
> >>   * IPI or the #BP handler; this provides serialization.
> >>   */
> >> }
> >>
> >
> > The invariants here are:
> >
> > 1. The patching routine must set each bit in the cpumask after each event
> >that requires synchronization is complete.
> > 2. The bit can be (atomically) cleared on the target CPU only, and only in a
> >place that guarantees a synchronizing event (e.g. IRET) before it may
> >reaching the poked instruction.
> > 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
> >*is* also possible to clear it in other places, e.g. the NMI handler, if
> >necessary as long as condition 2 is satisfied.
> >
>
> OK, so with interrupts enabled *on the processor doing the patching* we
> still have a problem if it takes an interrupt which in turn takes a #BP.
>  Disabling interrupts would not help, because but an NMI and #MC could
> still cause problems unless we can guarantee that no path which may be
> invoked by NMI/#MC can do text_poke, which seems to be a very aggressive
> assumption.
>
> Note: I am assuming preemption is disabled.
>
> The easiest/sanest way to deal with this might be to switch the IDT (or
> provide a hook in the generic exception entry code) on the patching
> processor, such that if an asynchronous event comes in, we either roll
> forward or revert. This is doable because the second sync we currently
> do is not actually necessary per the hardware guys.

This is IMO insanely complicated.  I much prefer the kind of
complexity that is more or less deterministic and easy to test to the
kind of complexity (like this) that only happens in corner cases.

I see two solutions here:

1. Just suck it up and emulate the CALL.  And find a way to write a
test case so we know it works.

2. Find a non-deadlocky way to make the breakpoint handler wait for
the breakpoint to get removed, without any mucking at all with the
entry code.  And find a way to write a test case so we know it works.
(E.g. stick an actual static_call call site *in text_poke_bp()* that
fires once on boot so that the really awful recursive case gets
exercised all the time.

But if we're going to do any mucking with the entry code, let's just
do the simple mucking to make emulating CALL work.

--Andy


Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread H. Peter Anvin
I think this sequence ought to work (keep in mind we are already under a
mutex, so the global data is safe even if we are preempted):

set up page table entries
invlpg
set up bp patching global data

cpu = get_cpu()

bp_old_value = atomic_read(bp_write_addr)

do {
atomic_write(_poke_state, 1)

atomic_write(bp_write_addr, 0xcc)

mask <- online_cpu_mask - self
send IPIs
wait for mask = 0

} while (cmpxchg(_poke_state, 1, 2) != 1);

patch sites, remove breakpoints after patching each one

atomic_write(_poke_state, 3);

mask <- online_cpu_mask - self
send IPIs
wait for mask = 0

atomic_write(_poke_state, 0);

tear down patching global data
tear down page table entries



The #BP handler would then look like:

state = cmpxchg(_poke_state, 1, 4);
switch (state) {
case 1:
case 4:
invlpg
cmpxchg(bp_write_addr, 0xcc, bp_old_value)
break;
case 2:
invlpg
complete patch sequence
remove breakpoint
break;
case 3:
/* If we are here, the #BP will go away on its own */
break;
case 0:
/* No patching in progress!!! */
return 0;
}

clear bit in mask
return 1;

The IPI handler:

clear bit in mask
sync_core   /* Needed if multiple IPI events are chained */


Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread H. Peter Anvin
So I was already in the middle of composing this message when Andy posted:

> I don't even think this is sufficient.  I think we also need everyone
> who clears the bit to check if all bits are clear and, if so, remove
> the breakpoint.  Otherwise we have a situation where, if you are in
> text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
> and that interrupt then hits the breakpoint, then you deadlock because
> no one removes the breakpoint.
> 
> If we do this, and if we can guarantee that all CPUs make forward
> progress, then maybe the problem is solved. Can we guarantee something
> like all NMI handlers that might wait in a spinlock or for any other
> reason will periodically check if a sync is needed while they're
> spinning?

So the really, really nasty case is when an asynchronous event on the
*patching* processor gets stuck spinning on a resource which is
unavailable due to another processor spinning on the #BP. We can disable
interrupts, but we can't stop NMIs from coming in (although we could
test in the NMI handler if we are in that condition and return
immediately; I'm not sure we want to do that, and we still have to deal
with #MC and what not.)

The fundamental problem here is that we don't see the #BP on the
patching processor, in which case we could simply complete the patching
from the #BP handler on that processor.

On 1/13/19 6:40 PM, H. Peter Anvin wrote:
> On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>>
>> static cpumask_t text_poke_cpumask;
>>
>> static void text_poke_sync(void)
>> {
>>  smp_wmb();
>>  text_poke_cpumask = cpu_online_mask;
>>  smp_wmb();  /* Should be optional on x86 */
>>  cpumask_clear_cpu(_poke_cpumask, smp_processor_id());
>>  on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false);
>>  while (!cpumask_empty(_poke_cpumask)) {
>>  cpu_relax();
>>  smp_rmb();
>>  }
>> }
>>
>> static void text_poke_sync_cpu(void *dummy)
>> {
>>  (void)dummy;
>>
>>  smp_rmb();
>>  cpumask_clear_cpu(_bitmask, smp_processor_id());
>>  /*
>>   * We are guaranteed to return with an IRET, either from the
>>   * IPI or the #BP handler; this provides serialization.
>>   */
>> }
>>
> 
> The invariants here are:
> 
> 1. The patching routine must set each bit in the cpumask after each event
>that requires synchronization is complete.
> 2. The bit can be (atomically) cleared on the target CPU only, and only in a
>place that guarantees a synchronizing event (e.g. IRET) before it may
>reaching the poked instruction.
> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
>*is* also possible to clear it in other places, e.g. the NMI handler, if
>necessary as long as condition 2 is satisfied.
> 

OK, so with interrupts enabled *on the processor doing the patching* we
still have a problem if it takes an interrupt which in turn takes a #BP.
 Disabling interrupts would not help, because but an NMI and #MC could
still cause problems unless we can guarantee that no path which may be
invoked by NMI/#MC can do text_poke, which seems to be a very aggressive
assumption.

Note: I am assuming preemption is disabled.

The easiest/sanest way to deal with this might be to switch the IDT (or
provide a hook in the generic exception entry code) on the patching
processor, such that if an asynchronous event comes in, we either roll
forward or revert. This is doable because the second sync we currently
do is not actually necessary per the hardware guys.

If we take that #BP during the breakpoint deployment phase -- that is,
before the first sync has completed -- restore the previous value of the
breakpoint byte. Upon return text_poke_bp() will then have to loop back
to the beginning and do it again.

If we take the #BP after that point, we can complete the patch in the
normal manner, by writing the rest of the instruction and then removing
the #BP. text_poke_bp() will complete the synchronization sequence on
return, but if another processor is spinning and sees the breakpoint
having been removed, it is good to go.

Rignt now we do completely unnecessary setup and teardown of the PDT
entries for each phase of the patching. This would have to be removed,
so that the asynchronous event handler will always be able to do the
roll forward/roll back as required.

If this is unpalatable, the solution you touched on is probably also
doable, but I need to think *really* carefully about the sequencing
constraints, because now you also have to worry about events
interrupting a patch in progress but not completed. It would however
have the advantage that an arbitrary interrupt on the patching processor
is unlikely to cause a rollback, and so would be safer to execute with
interrupts enabled without causing a livelock.

Now, you can't just remove the breakpoint; you have to make sure that if
you do, during the first phase text_poke_bp() will loop, and in the
second phase complete the 

Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread Andy Lutomirski
On Sun, Jan 13, 2019 at 6:41 PM H. Peter Anvin  wrote:
>
> On 1/13/19 6:31 PM, H. Peter Anvin wrote:
> >
> > static cpumask_t text_poke_cpumask;
> >
> > static void text_poke_sync(void)
> > {
> >   smp_wmb();
> >   text_poke_cpumask = cpu_online_mask;
> >   smp_wmb();  /* Should be optional on x86 */
> >   cpumask_clear_cpu(_poke_cpumask, smp_processor_id());
> >   on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false);
> >   while (!cpumask_empty(_poke_cpumask)) {
> >   cpu_relax();
> >   smp_rmb();
> >   }
> > }
> >
> > static void text_poke_sync_cpu(void *dummy)
> > {
> >   (void)dummy;
> >
> >   smp_rmb();
> >   cpumask_clear_cpu(_bitmask, smp_processor_id());
> >   /*
> >* We are guaranteed to return with an IRET, either from the
> >* IPI or the #BP handler; this provides serialization.
> >*/
> > }
> >
>
> The invariants here are:
>
> 1. The patching routine must set each bit in the cpumask after each event
>that requires synchronization is complete.
> 2. The bit can be (atomically) cleared on the target CPU only, and only in a
>place that guarantees a synchronizing event (e.g. IRET) before it may
>reaching the poked instruction.
> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
>*is* also possible to clear it in other places, e.g. the NMI handler, if
>necessary as long as condition 2 is satisfied.
>

I don't even think this is sufficient.  I think we also need everyone
who clears the bit to check if all bits are clear and, if so, remove
the breakpoint.  Otherwise we have a situation where, if you are in
text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
and that interrupt then hits the breakpoint, then you deadlock because
no one removes the breakpoint.

If we do this, and if we can guarantee that all CPUs make forward
progress, then maybe the problem is solved. Can we guarantee something
like all NMI handlers that might wait in a spinlock or for any other
reason will periodically check if a sync is needed while they're
spinning?

--Andy


Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread Peter Zijlstra
On Fri, Jan 11, 2019 at 01:05:20PM -0800, Andy Lutomirski wrote:
> On Fri, Jan 11, 2019 at 12:54 PM Linus Torvalds
>  wrote:
> >
> > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf  wrote:
> > >
> > > I was referring to the fact that a single static call key update will
> > > usually result in patching multiple call sites.  But you're right, it's
> > > only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> > > we may want to batch all the writes like what Daniel has proposed for
> > > jump labels, to reduce IPIs.
> >
> > Yeah, my suggestion doesn't allow for batching, since it would
> > basically generate one trampoline for every rewritten instruction.
> 
> Sure it does.  Just make 1000 trampolines and patch 1000 sites in a
> batch :)  As long as the number of trampolines is smallish (e.g. fits
> in a page), then we should be in good shape.

Much easier still would be to make the ARCH_DEFINE_STATIC_TRAMP thing
generate the two trampolines per callsite and simply keep them around.

Another advantage is that you then only have to patch the JMP target,
since the return address will always stay the same (since these things
are generated per call-site).


Anyway... the STI-shadow thing is very clever. But I'm with Josh in that
I think I prefer the IRET frame offset thing -- but yes, I've read
Linus' argument against that.


Re: [PATCH v3 0/6] Static calls

2019-01-13 Thread H. Peter Anvin
On 1/13/19 6:31 PM, H. Peter Anvin wrote:
> 
> static cpumask_t text_poke_cpumask;
> 
> static void text_poke_sync(void)
> {
>   smp_wmb();
>   text_poke_cpumask = cpu_online_mask;
>   smp_wmb();  /* Should be optional on x86 */
>   cpumask_clear_cpu(_poke_cpumask, smp_processor_id());
>   on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false);
>   while (!cpumask_empty(_poke_cpumask)) {
>   cpu_relax();
>   smp_rmb();
>   }
> }
> 
> static void text_poke_sync_cpu(void *dummy)
> {
>   (void)dummy;
> 
>   smp_rmb();
>   cpumask_clear_cpu(_bitmask, smp_processor_id());
>   /*
>* We are guaranteed to return with an IRET, either from the
>* IPI or the #BP handler; this provides serialization.
>*/
> }
> 

The invariants here are:

1. The patching routine must set each bit in the cpumask after each event
   that requires synchronization is complete.
2. The bit can be (atomically) cleared on the target CPU only, and only in a
   place that guarantees a synchronizing event (e.g. IRET) before it may
   reaching the poked instruction.
3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
   *is* also possible to clear it in other places, e.g. the NMI handler, if
   necessary as long as condition 2 is satisfied.

-hpa


Re: [PATCH v3 0/6] Static calls

2019-01-13 Thread H. Peter Anvin
On 1/11/19 11:39 AM, Jiri Kosina wrote:
> On Fri, 11 Jan 2019, h...@zytor.com wrote:
> 
>> I still don't see why can't simply spin in the #BP handler until the 
>> patch is complete.
> 
> I think this brings us to the already discussed possible deadlock, when 
> one CPU#0 is in the middle of text_poke_bp(), CPU#1 is spinning inside 
> spin_lock_irq*() and CPU#2 hits the breakpont while holding that very 
> 'lock'.
> 
> Then we're stuck forever, because CPU#1 will never handle the pending 
> sync_core() IPI (it's not NMI).
> 
> Or have I misunderstood what you meant?
> 

OK, I was thinking about this quite a while ago, and even started hacking on
it, but apparently I managed to forget some key details.

Specifically, you do *not* want to use the acknowledgment of the IPI as the
blocking condition, so don't use a waiting IPI.

Instead, you want a CPU bitmap (or percpu variable) that the IPI handler
clears.  When you are spinning in the IPI handler, you *also* need to clear
that bit.  Doing so is safe even in the case of batched updates, because you
are guaranteed to execute an IRET before you get to patched code.

So the synchronization part of the patching routine becomes:

static cpumask_t text_poke_cpumask;

static void text_poke_sync(void)
{
smp_wmb();
text_poke_cpumask = cpu_online_mask;
smp_wmb();  /* Optional on x86 */
cpumask_clear_cpu(_poke_cpumask, smp_processor_id());
on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false);
while (!cpumask_empty(_poke_cpumask)) {
cpu_relax();
smp_rmb();
}
}

static void text_poke_sync_cpu(void *dummy)
{
(void)dummy;

smp_rmb();
cpumask_clear_cpu(_bitmask, smp_processor_id());
/*
 * We are guaranteed to return with an IRET, either from the
 * IPI or the #BP handler; this provides serialization.
 */
}

The spin routine then needs add a call to do something like this. By
(optionally) not comparing to a specific breakpoint address we allow for
batching, but we may end up spinning on a breakpoint that is not actually a
patching breakpoint until the patching is done.

int poke_int3_handler(struct pt_regs *regs)
{
/* In the current handler, but shouldn't be needed... */
smp_rmb();

if (likely(!atomic_read(bp_patching_in_progress)))
return 0;

if (user_mode(regs) unlikely(!user_mode(regs) &&
atomic_read(_patching_in_progress)) {
text_poke_sync();
regs->ip--;
return 1;   /* May end up retaking the trap */
} else {
return 0;
}
}

Unless I'm totally mistaken, the worst thing that will happen with this code
is that it may end up taking a harmless spurious IPI at a later time.

-hpa


Re: [PATCH v3 0/6] Static calls

2019-01-12 Thread hpa
On January 11, 2019 11:34:34 AM PST, Linus Torvalds 
 wrote:
>On Fri, Jan 11, 2019 at 11:24 AM  wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
>  text_poke(addr, , sizeof(int3));
>   *interrupt*
>  interrupt has a static call
>*BP*
>  poke_int3_handler
> *BOOM*
>
>Note how at BOOM we cannot just spin (or return) to wait for the
>'int3' to be switched back. Becuase it never will. Because we are
>interrupting the thing that would do that switch-back.
>
>So we'd have to do the 'text_poke_bp()' sequence with interrupts
>disabled. Which we can't do right now at least, because part of that
>sequence involves that on_each_cpu(do_sync_core) thing, which needs
>interrupts enabled.
>
>See?
>
>Or am I missing something?
>
>Linus

Ok, I was thinking far more about spinning with an IRET and letting the 
exception be delivered. Patching with interrupts disabled have other 
problems... 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 0/6] Static calls

2019-01-12 Thread hpa
On January 11, 2019 11:34:34 AM PST, Linus Torvalds 
 wrote:
>On Fri, Jan 11, 2019 at 11:24 AM  wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
>  text_poke(addr, , sizeof(int3));
>   *interrupt*
>  interrupt has a static call
>*BP*
>  poke_int3_handler
> *BOOM*
>
>Note how at BOOM we cannot just spin (or return) to wait for the
>'int3' to be switched back. Becuase it never will. Because we are
>interrupting the thing that would do that switch-back.
>
>So we'd have to do the 'text_poke_bp()' sequence with interrupts
>disabled. Which we can't do right now at least, because part of that
>sequence involves that on_each_cpu(do_sync_core) thing, which needs
>interrupts enabled.
>
>See?
>
>Or am I missing something?
>
>Linus

Let me think about it.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 0/6] Static calls

2019-01-12 Thread Andy Lutomirski
On Fri, Jan 11, 2019 at 1:22 PM Josh Poimboeuf  wrote:
>
> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf  wrote:
> > >
> > > I was referring to the fact that a single static call key update will
> > > usually result in patching multiple call sites.  But you're right, it's
> > > only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> > > we may want to batch all the writes like what Daniel has proposed for
> > > jump labels, to reduce IPIs.
> >
> > Yeah, my suggestion doesn't allow for batching, since it would
> > basically generate one trampoline for every rewritten instruction.
>
> As Andy said, I think batching would still be possible, it's just that
> we'd have to create multiple trampolines at a time.
>
> Or... we could do a hybrid approach: create a single custom trampoline
> which has the call destination patched in, but put the return address in
> %rax -- which is always clobbered, even for callee-saved PV ops.  Like:
>

One think I particularly like about the current design is that there
are no requirements at all on the calling convention.  I think it
seems fragile to add a calling convention constraint that only applies
when there's a race.  I'd rather do a longjmp-like hack or a stack gap
adding hack than make the actual static calls more fragile.


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Nadav Amit
> On Jan 11, 2019, at 1:55 PM, Steven Rostedt  wrote:
> 
> On Fri, 11 Jan 2019 15:41:22 -0600
> Josh Poimboeuf  wrote:
> 
>>> I don’t see RCU-sched solves the problem if you don’t disable preemption. On
>>> a fully preemptable kernel, you can get preempted between the push and the
>>> call (jmp) or before the push. RCU-sched can then finish, and the preempted
>>> task may later jump to a wrong patched-dest.  
>> 
>> Argh, I misspoke about RCU-sched.  Words are hard.
>> 
>> I meant synchronize_rcu_tasks(), which is a completely different animal.
>> My understanding is that it waits until all runnable tasks (including
>> preempted tasks) have gotten a chance to run.
> 
> Not quite, but does the same thing. It waits for all tasks to either
> schedule voluntarily (not preempted), or be / go into idle, or be /go
> into userspace. In any case, it makes sure code is off of trampolines.
> I use this before freeing trampolines used by ftrace.

Interesting. So my last email is completely wrong.
 

Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Nadav Amit
> On Jan 11, 2019, at 1:41 PM, Josh Poimboeuf  wrote:
> 
> On Fri, Jan 11, 2019 at 09:36:59PM +, Nadav Amit wrote:
>>> On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf  wrote:
>>> 
>>> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
 On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf  
 wrote:
> I was referring to the fact that a single static call key update will
> usually result in patching multiple call sites.  But you're right, it's
> only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> we may want to batch all the writes like what Daniel has proposed for
> jump labels, to reduce IPIs.
 
 Yeah, my suggestion doesn't allow for batching, since it would
 basically generate one trampoline for every rewritten instruction.
>>> 
>>> As Andy said, I think batching would still be possible, it's just that
>>> we'd have to create multiple trampolines at a time.
>>> 
>>> Or... we could do a hybrid approach: create a single custom trampoline
>>> which has the call destination patched in, but put the return address in
>>> %rax -- which is always clobbered, even for callee-saved PV ops.  Like:
>>> 
>>> trampoline:
>>> push %rax
>>> call patched-dest
>>> 
>>> That way the batching could be done with a single trampoline
>>> (particularly if using rcu-sched to avoid the sti hack).
>> 
>> I don’t see RCU-sched solves the problem if you don’t disable preemption. On
>> a fully preemptable kernel, you can get preempted between the push and the
>> call (jmp) or before the push. RCU-sched can then finish, and the preempted
>> task may later jump to a wrong patched-dest.
> 
> Argh, I misspoke about RCU-sched.  Words are hard.
> 
> I meant synchronize_rcu_tasks(), which is a completely different animal.
> My understanding is that it waits until all runnable tasks (including
> preempted tasks) have gotten a chance to run.

Actually, I just used the term you used, and thought about
synchronize_sched(). If you look at my patch [1], you’ll see I did something
similar using synchronize_sched(). But this required some delicate work of
restarting any preempted “optpoline” (or whatever name you want) block. 

[Note that my implementation has a terrible bug in this respect].

This is required since running a preempted task to does now prevent it from
being preempted again without doing any “real” progress.

If we want to adapt the same solution to static_calls, this means that in
retint_kernel (entry_64.S), you need check whether you got preempted inside
the trampoline and change the saved RIP in such case back, before the
static_call.

IMHO, sti+jmp is simpler.

[1] https://lore.kernel.org/lkml/20181231072112.21051-6-na...@vmware.com/

Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Steven Rostedt
On Fri, 11 Jan 2019 15:41:22 -0600
Josh Poimboeuf  wrote:

> > I don’t see RCU-sched solves the problem if you don’t disable preemption. On
> > a fully preemptable kernel, you can get preempted between the push and the
> > call (jmp) or before the push. RCU-sched can then finish, and the preempted
> > task may later jump to a wrong patched-dest.  
> 
> Argh, I misspoke about RCU-sched.  Words are hard.
> 
> I meant synchronize_rcu_tasks(), which is a completely different animal.
> My understanding is that it waits until all runnable tasks (including
> preempted tasks) have gotten a chance to run.

Not quite, but does the same thing. It waits for all tasks to either
schedule voluntarily (not preempted), or be / go into idle, or be /go
into userspace. In any case, it makes sure code is off of trampolines.
I use this before freeing trampolines used by ftrace.

-- Steve


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 09:36:59PM +, Nadav Amit wrote:
> > On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf  wrote:
> > 
> > On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> >> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf  
> >> wrote:
> >>> I was referring to the fact that a single static call key update will
> >>> usually result in patching multiple call sites.  But you're right, it's
> >>> only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> >>> we may want to batch all the writes like what Daniel has proposed for
> >>> jump labels, to reduce IPIs.
> >> 
> >> Yeah, my suggestion doesn't allow for batching, since it would
> >> basically generate one trampoline for every rewritten instruction.
> > 
> > As Andy said, I think batching would still be possible, it's just that
> > we'd have to create multiple trampolines at a time.
> > 
> > Or... we could do a hybrid approach: create a single custom trampoline
> > which has the call destination patched in, but put the return address in
> > %rax -- which is always clobbered, even for callee-saved PV ops.  Like:
> > 
> > trampoline:
> > push %rax
> > call patched-dest
> > 
> > That way the batching could be done with a single trampoline
> > (particularly if using rcu-sched to avoid the sti hack).
> 
> I don’t see RCU-sched solves the problem if you don’t disable preemption. On
> a fully preemptable kernel, you can get preempted between the push and the
> call (jmp) or before the push. RCU-sched can then finish, and the preempted
> task may later jump to a wrong patched-dest.

Argh, I misspoke about RCU-sched.  Words are hard.

I meant synchronize_rcu_tasks(), which is a completely different animal.
My understanding is that it waits until all runnable tasks (including
preempted tasks) have gotten a chance to run.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Nadav Amit
> On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf  wrote:
> 
> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
>> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf  wrote:
>>> I was referring to the fact that a single static call key update will
>>> usually result in patching multiple call sites.  But you're right, it's
>>> only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
>>> we may want to batch all the writes like what Daniel has proposed for
>>> jump labels, to reduce IPIs.
>> 
>> Yeah, my suggestion doesn't allow for batching, since it would
>> basically generate one trampoline for every rewritten instruction.
> 
> As Andy said, I think batching would still be possible, it's just that
> we'd have to create multiple trampolines at a time.
> 
> Or... we could do a hybrid approach: create a single custom trampoline
> which has the call destination patched in, but put the return address in
> %rax -- which is always clobbered, even for callee-saved PV ops.  Like:
> 
> trampoline:
>   push %rax
>   call patched-dest
> 
> That way the batching could be done with a single trampoline
> (particularly if using rcu-sched to avoid the sti hack).

I don’t see RCU-sched solves the problem if you don’t disable preemption. On
a fully preemptable kernel, you can get preempted between the push and the
call (jmp) or before the push. RCU-sched can then finish, and the preempted
task may later jump to a wrong patched-dest.



Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 01:10:47PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 1:05 PM Andy Lutomirski  wrote:
> >
> > > Yeah, my suggestion doesn't allow for batching, since it would
> > > basically generate one trampoline for every rewritten instruction.
> >
> > Sure it does.  Just make 1000 trampolines and patch 1000 sites in a
> > batch :)  As long as the number of trampolines is smallish (e.g. fits
> > in a page), then we should be in good shape.
> 
> Yeah, I guess that's true.
> 
> But let's not worry about it. I don't think we do batching for any
> alternative instruction rewriting right now, do we?

Not at the moment, but Daniel has been working on some patches to do so:

  https://lkml.kernel.org/r/cover.1545228276.git.bris...@redhat.com

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 03:22:10PM -0600, Josh Poimboeuf wrote:
> b) Adding a gap to the #DB entry stack

And that should be #BP of course...  Not sure why my fingers keep doing
that!

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 03:22:10PM -0600, Josh Poimboeuf wrote:
> trampoline:
>   push %rax
>   call patched-dest

That should be a JMP of course.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf  wrote:
> >
> > I was referring to the fact that a single static call key update will
> > usually result in patching multiple call sites.  But you're right, it's
> > only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> > we may want to batch all the writes like what Daniel has proposed for
> > jump labels, to reduce IPIs.
> 
> Yeah, my suggestion doesn't allow for batching, since it would
> basically generate one trampoline for every rewritten instruction.

As Andy said, I think batching would still be possible, it's just that
we'd have to create multiple trampolines at a time.

Or... we could do a hybrid approach: create a single custom trampoline
which has the call destination patched in, but put the return address in
%rax -- which is always clobbered, even for callee-saved PV ops.  Like:

trampoline:
push %rax
call patched-dest

That way the batching could be done with a single trampoline
(particularly if using rcu-sched to avoid the sti hack).

If you don't completely hate that approach then I may try it.

> > Regardless, the trampoline management seems more complex to me.  But
> > it's easier to argue about actual code, so maybe I'll code it up to make
> > it easier to compare solutions.
> 
> I do agree hat the stack games are likely "simpler" in one sense. I
> just abhor playing those kinds of games with the entry code and entry
> stack.
> 
> A small bit of extra complexity in the code that actually does the
> rewriting would be much more palatable to me than the complexity in
> the entry code. I prefer seeing the onus of complexity being on the
> code that introduces the problem, not on a innocent bystander.
> 
> I'd like to say that our entry code actually looks fairly sane these
> days.  I'd _like_ to say that, but I'd be lying through my teeth if I
> did. The macros we use make any normal persons head spin.
> 
> The workaround for the stack corruption was fairly simple, but the
> subtlety behind the *reason* for it was what made my hackles rise
> about that code.
> 
> The x86 entry code is some of the nastiest in the kernel, I feel, with
> all the subtle interactions about odd stack switches, odd CPU bugs
> causing odd TLB switches, NMI interactions etc etc.
> 
> So I am fully cognizant that the workaround to shift the stack in the
> entry code was just a couple of lines, and not very complicated.
> 
> And I agree that I may be a bit oversensitive about that area, but it
> really is one of those places where I go "damn, I think I know some
> low-level x86 stuff better than most, but that code scares *me*".
> 
> Which is why I'd accept a rather bigger complexity hit just about
> anywhere else in the code...

I agree that, to a certain extent, it can make sense to put the "onus of
complexity" on the code that introduces the problem.  But of course it's
not an absolute rule, and should be considered in context with the
relative complexities of the competing solutions.

But I think where I see things differently is that:

a) The entry code is, by far, in the best shape it's ever been [*],
   thanks to Andy's considerable efforts.  I find it to be quite
   readable, but that might be due to many hours of intense study...

   [*] overlooking recent unavoidable meltdown/spectre hacks

b) Adding a gap to the #DB entry stack is (in my mind) a simple
   localized change, which is easily understandable by a reader of the
   entry code -- assuming certain personality characteristics of a
   person whose life decisions have resulted in them reading entry code
   in the first place...

c) Doing so is an order of magnitude simpler than the custom trampoline
   thing (or really any of the other many alternatives we've discussed).
   At least that's my feeling.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Linus Torvalds
On Fri, Jan 11, 2019 at 1:05 PM Andy Lutomirski  wrote:
>
> > Yeah, my suggestion doesn't allow for batching, since it would
> > basically generate one trampoline for every rewritten instruction.
>
> Sure it does.  Just make 1000 trampolines and patch 1000 sites in a
> batch :)  As long as the number of trampolines is smallish (e.g. fits
> in a page), then we should be in good shape.

Yeah, I guess that's true.

But let's not worry about it. I don't think we do batching for any
alternative instruction rewriting right now, do we?

 Linus


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Andy Lutomirski
On Fri, Jan 11, 2019 at 12:54 PM Linus Torvalds
 wrote:
>
> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf  wrote:
> >
> > I was referring to the fact that a single static call key update will
> > usually result in patching multiple call sites.  But you're right, it's
> > only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> > we may want to batch all the writes like what Daniel has proposed for
> > jump labels, to reduce IPIs.
>
> Yeah, my suggestion doesn't allow for batching, since it would
> basically generate one trampoline for every rewritten instruction.

Sure it does.  Just make 1000 trampolines and patch 1000 sites in a
batch :)  As long as the number of trampolines is smallish (e.g. fits
in a page), then we should be in good shape.


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Linus Torvalds
On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf  wrote:
>
> I was referring to the fact that a single static call key update will
> usually result in patching multiple call sites.  But you're right, it's
> only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
> we may want to batch all the writes like what Daniel has proposed for
> jump labels, to reduce IPIs.

Yeah, my suggestion doesn't allow for batching, since it would
basically generate one trampoline for every rewritten instruction.

> Regardless, the trampoline management seems more complex to me.  But
> it's easier to argue about actual code, so maybe I'll code it up to make
> it easier to compare solutions.

I do agree hat the stack games are likely "simpler" in one sense. I
just abhor playing those kinds of games with the entry code and entry
stack.

A small bit of extra complexity in the code that actually does the
rewriting would be much more palatable to me than the complexity in
the entry code. I prefer seeing the onus of complexity being on the
code that introduces the problem, not on a innocent bystander.

I'd like to say that our entry code actually looks fairly sane these
days.  I'd _like_ to say that, but I'd be lying through my teeth if I
did. The macros we use make any normal persons head spin.

The workaround for the stack corruption was fairly simple, but the
subtlety behind the *reason* for it was what made my hackles rise
about that code.

The x86 entry code is some of the nastiest in the kernel, I feel, with
all the subtle interactions about odd stack switches, odd CPU bugs
causing odd TLB switches, NMI interactions etc etc.

So I am fully cognizant that the workaround to shift the stack in the
entry code was just a couple of lines, and not very complicated.

And I agree that I may be a bit oversensitive about that area, but it
really is one of those places where I go "damn, I think I know some
low-level x86 stuff better than most, but that code scares *me*".

Which is why I'd accept a rather bigger complexity hit just about
anywhere else in the code...

   Linus


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 12:12:30PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 12:04 PM Josh Poimboeuf  wrote:
> >
> > But really, to me, having to create and manage all those custom
> > trampolines still feels a lot more complex than just making a gap on the
> > stack.
> 
> There are no "all those custom trampolines".
> 
> There is literally *one* custom trampoline that you generate as you do
> the rewriting.
> 
> Well, two, since you need the version with the "sti" before the jmp.
> 
> It would be possible to generate the custom trampoline on the fly in
> the BP handler itself, and just have a magic flag for that case. But
> it's probably simpler to do it in the caller, since you need to
> generate that special writable and executable code sequence. You
> probably don't want to do that at BP time.
> 
> You probably want to use a FIX_TEXT_POKE2 page for the generated
> sequence that just maps some generated code executably for a short
> while. Or something like that.

I was referring to the fact that a single static call key update will
usually result in patching multiple call sites.  But you're right, it's
only 1-2 trampolines per text_poke_bp() invocation.  Though eventually
we may want to batch all the writes like what Daniel has proposed for
jump labels, to reduce IPIs.

Regardless, the trampoline management seems more complex to me.  But
it's easier to argue about actual code, so maybe I'll code it up to make
it easier to compare solutions.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Linus Torvalds
On Fri, Jan 11, 2019 at 12:04 PM Josh Poimboeuf  wrote:
>
> But really, to me, having to create and manage all those custom
> trampolines still feels a lot more complex than just making a gap on the
> stack.

There are no "all those custom trampolines".

There is literally *one* custom trampoline that you generate as you do
the rewriting.

Well, two, since you need the version with the "sti" before the jmp.

It would be possible to generate the custom trampoline on the fly in
the BP handler itself, and just have a magic flag for that case. But
it's probably simpler to do it in the caller, since you need to
generate that special writable and executable code sequence. You
probably don't want to do that at BP time.

You probably want to use a FIX_TEXT_POKE2 page for the generated
sequence that just maps some generated code executably for a short
while. Or something like that.

  Linus

 Linus


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 11:03:30AM -0800, Linus Torvalds wrote:
> The we'd change the end of poke_int3_handler() to do something like
> this instead:
> 
> void *newip = bp_int3_handler;
> ..
> if (new == magic_static_call_bp_int3_handler) {
> if (regs->flags _FLAGS_IF) {
> newip = magic_static_call_bp_int3_handler_sti;
> regs->flags &= ~X86_FLAGS_IF;
> }
> regs->ip = (unsigned long) newip;
> return 1;
> 
> AAND now we're *really* done.
> 
> Does anybody see any issues in this?

This sounds ok, with a possible tweak: instead of the sti tricks,
couldn't we just use synchronize_rcu_tasks() (as Jason suggested), to
make sure the stubs are no longer used by a preempted task?

But really, to me, having to create and manage all those custom
trampolines still feels a lot more complex than just making a gap on the
stack.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Jiri Kosina
On Fri, 11 Jan 2019, h...@zytor.com wrote:

> I still don't see why can't simply spin in the #BP handler until the 
> patch is complete.

I think this brings us to the already discussed possible deadlock, when 
one CPU#0 is in the middle of text_poke_bp(), CPU#1 is spinning inside 
spin_lock_irq*() and CPU#2 hits the breakpont while holding that very 
'lock'.

Then we're stuck forever, because CPU#1 will never handle the pending 
sync_core() IPI (it's not NMI).

Or have I misunderstood what you meant?

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Linus Torvalds
On Fri, Jan 11, 2019 at 11:24 AM  wrote:
>
> I still don't see why can't simply spin in the #BP handler until the patch is 
> complete.

So here's at least one problem:

text_poke_bp()
  text_poke(addr, , sizeof(int3));
   *interrupt*
  interrupt has a static call
*BP*
  poke_int3_handler
 *BOOM*

Note how at BOOM we cannot just spin (or return) to wait for the
'int3' to be switched back. Becuase it never will. Because we are
interrupting the thing that would do that switch-back.

So we'd have to do the 'text_poke_bp()' sequence with interrupts
disabled. Which we can't do right now at least, because part of that
sequence involves that on_each_cpu(do_sync_core) thing, which needs
interrupts enabled.

See?

Or am I missing something?

Linus


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Nadav Amit
> On Jan 11, 2019, at 11:23 AM, h...@zytor.com wrote:
> 
> On January 11, 2019 11:03:30 AM PST, Linus Torvalds 
>  wrote:
>> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf 
>> wrote:
 Now, in the int3 handler can you take the faulting RIP and search
>> for it in
 the “static-calls” table, writing the RIP+5 (offset) into R10
>> (return
 address) and the target into R11. You make the int3 handler to
>> divert the
 code execution by changing pt_regs->rip to point to a new function
>> that does:
  push R10
  jmp __x86_indirect_thunk_r11
 
 And then you are done. No?
>>> 
>>> IIUC, that sounds pretty much like what Steven proposed:
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.homedata=02%7C01%7Cnamit%40vmware.com%7Ca665a074940b4630e3fc08d677fa753b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828314949874019sdata=fs2uo%2BjL%2FV3rpzIHJ%2B3QoyHg4KhV%2B%2FUPmUOLpy8S8p8%3Dreserved=0
>>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>>> that case there's only one clobbered register to work with (rax).
>> 
>> Actually, there's a much simpler model now that I think about it.
>> 
>> The BP fixup just fixes up %rip to to point to "bp_int3_handler".
>> 
>> And that's just a random text address set up by "text_poke_bp()".
>> 
>> So how about the static call rewriting simply do this:
>> 
>> - for each static call:
>> 
>> 1)   create a fixup code stub that does
>> 
>>   push $returnaddressforTHIScall
>>   jmp targetforTHIScall
>> 
>> 2) do
>> 
>>   on_each_cpu(do_sync_core, NULL, 1);
>> 
>>to make sure all CPU's see this generated code
>> 
>> 3) do
>> 
>>   text_poke_bp(addr, newcode, newlen, generatedcode);
>> 
>> Ta-daa! Done.
>> 
>> In fact, it turns out that even the extra "do_sync_core()" in #2 is
>> unnecessary, because taking the BP will be serializing on the CPU that
>> takes it, so we can skip it.
>> 
>> End result: the text_poke_bp() function will do the two do_sync_core
>> IPI's that guarantee that by the time it returns, no other CPU is
>> using the generated code any more, so it can be re-used for the next
>> static call fixup.
>> 
>> Notice? No odd emulation, no need to adjust the stack in the BP
>> handler, just the regular "return to a different IP".
>> 
>> Now, there is a nasty special case with that stub, though.
>> 
>> So nasty thing with the whole "generate a stub for each call" case:
>> because it's dynamic and because of the re-use of the stub, you could
>> be in the situation where:
>> 
>> CPU1  CPU2
>>   
>> 
>> generate a stub
>> on_each_cpu(do_sync_core..)
>> text_poke_bp()
>> ...
>> 
>> rewrite to BP
>>   trigger the BP
>>   return to the stub
>>   fun the first instruction of the stub
>>   *INTERRUPT causes rescheduling*
>> 
>> on_each_cpu(do_sync_core..)
>> rewrite to good instruction
>> on_each_cpu(do_sync_core..)
>> 
>> free or re-generate the stub
>> 
>>   !! The stub is still in use !!
>> 
>> So that simple "just generate the stub dynamically" isn't so simple
>> after all.
>> 
>> But it turns out that that is really simple to handle too. How do we do
>> that?
>> 
>> We do that by giving the BP handler *two* code sequences, and we make
>> the BP handler pick one depending on whether it is returning to a
>> "interrupts disabled" or "interrupts enabled" case.
>> 
>> So the BP handler does this:
>> 
>> - if we're returning with interrupts disabled, pick the simple stub
>> 
>> - if we're returning with interrupts enabled, clkear IF in the return
>> %rflags, and pick a *slightly* more complex stub:
>> 
>>   push $returnaddressforTHIScall
>>   sti
>>   jmp targetforTHIScall
>> 
>> and now the STI shadow will mean that this sequence is uninterruptible.
>> 
>> So we'd not do complex emulation of the call instruction at BP time,
>> but we'd do that *trivial* change at BP time.
>> 
>> This seems simple, doesn't need any temporary registers at all, and
>> doesn't need any extra stack magic. It literally needs just a trivial
>> sequence in poke_int3_handler().
>> 
>> The we'd change the end of poke_int3_handler() to do something like
>> this instead:
>> 
>>   void *newip = bp_int3_handler;
>>   ..
>>   if (new == magic_static_call_bp_int3_handler) {
>>   if (regs->flags _FLAGS_IF) {
>>   newip = magic_static_call_bp_int3_handler_sti;
>>   regs->flags &= ~X86_FLAGS_IF;
>>   }
>>   regs->ip = (unsigned long) newip;
>>   return 1;
>> 
>> AAND now we're *really* done.
>> 
>> Does anybody see any issues in this?
>> 
>> Linus
> 
> I still don't see why can't simply spin in the #BP handler until the patch is 
> complete.
> 
> We can't have the #BP handler do any additional patching, as previously 
> 

Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread hpa
On January 11, 2019 11:03:30 AM PST, Linus Torvalds 
 wrote:
>On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf 
>wrote:
>>
>> >
>> > Now, in the int3 handler can you take the faulting RIP and search
>for it in
>> > the “static-calls” table, writing the RIP+5 (offset) into R10
>(return
>> > address) and the target into R11. You make the int3 handler to
>divert the
>> > code execution by changing pt_regs->rip to point to a new function
>that does:
>> >
>> >   push R10
>> >   jmp __x86_indirect_thunk_r11
>> >
>> > And then you are done. No?
>>
>> IIUC, that sounds pretty much like what Steven proposed:
>>
>>  
>https://lkml.kernel.org/r/20181129122000.7fb4f...@gandalf.local.home
>>
>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>> that case there's only one clobbered register to work with (rax).
>
>Actually, there's a much simpler model now that I think about it.
>
>The BP fixup just fixes up %rip to to point to "bp_int3_handler".
>
>And that's just a random text address set up by "text_poke_bp()".
>
>So how about the static call rewriting simply do this:
>
> - for each static call:
>
> 1)   create a fixup code stub that does
>
>push $returnaddressforTHIScall
>jmp targetforTHIScall
>
> 2) do
>
>on_each_cpu(do_sync_core, NULL, 1);
>
> to make sure all CPU's see this generated code
>
>  3) do
>
>text_poke_bp(addr, newcode, newlen, generatedcode);
>
>Ta-daa! Done.
>
>In fact, it turns out that even the extra "do_sync_core()" in #2 is
>unnecessary, because taking the BP will be serializing on the CPU that
>takes it, so we can skip it.
>
>End result: the text_poke_bp() function will do the two do_sync_core
>IPI's that guarantee that by the time it returns, no other CPU is
>using the generated code any more, so it can be re-used for the next
>static call fixup.
>
>Notice? No odd emulation, no need to adjust the stack in the BP
>handler, just the regular "return to a different IP".
>
>Now, there is a nasty special case with that stub, though.
>
>So nasty thing with the whole "generate a stub for each call" case:
>because it's dynamic and because of the re-use of the stub, you could
>be in the situation where:
>
>  CPU1  CPU2
>    
>
>  generate a stub
>  on_each_cpu(do_sync_core..)
>  text_poke_bp()
>  ...
>
>  rewrite to BP
>trigger the BP
>return to the stub
>fun the first instruction of the stub
>*INTERRUPT causes rescheduling*
>
>  on_each_cpu(do_sync_core..)
>  rewrite to good instruction
>  on_each_cpu(do_sync_core..)
>
>  free or re-generate the stub
>
>!! The stub is still in use !!
>
>So that simple "just generate the stub dynamically" isn't so simple
>after all.
>
>But it turns out that that is really simple to handle too. How do we do
>that?
>
>We do that by giving the BP handler *two* code sequences, and we make
>the BP handler pick one depending on whether it is returning to a
>"interrupts disabled" or "interrupts enabled" case.
>
>So the BP handler does this:
>
> - if we're returning with interrupts disabled, pick the simple stub
>
> - if we're returning with interrupts enabled, clkear IF in the return
>%rflags, and pick a *slightly* more complex stub:
>
>push $returnaddressforTHIScall
>sti
>jmp targetforTHIScall
>
>and now the STI shadow will mean that this sequence is uninterruptible.
>
>So we'd not do complex emulation of the call instruction at BP time,
>but we'd do that *trivial* change at BP time.
>
>This seems simple, doesn't need any temporary registers at all, and
>doesn't need any extra stack magic. It literally needs just a trivial
>sequence in poke_int3_handler().
>
>The we'd change the end of poke_int3_handler() to do something like
>this instead:
>
>void *newip = bp_int3_handler;
>..
>if (new == magic_static_call_bp_int3_handler) {
>if (regs->flags _FLAGS_IF) {
>newip = magic_static_call_bp_int3_handler_sti;
>regs->flags &= ~X86_FLAGS_IF;
>}
>regs->ip = (unsigned long) newip;
>return 1;
>
>AAND now we're *really* done.
>
>Does anybody see any issues in this?
>
>  Linus

I still don't see why can't simply spin in the #BP handler until the patch is 
complete.

We can't have the #BP handler do any additional patching, as previously 
discussed, but spinning should be perfectly safe. The simplest way to spin it 
to just IRET; that both serializes and will re-take the exception if the patch 
is still in progress.

It requires exactly *no* awareness in the #BP handler, allows for the call to 
be replaced with inline code or a simple NOP if desired (or vice versa, as long 
as it is a single instruction.)

If I'm missing something, then please educate me or point me to previous 
discussion; I would greatly appreciate it.

-- 

Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Nadav Amit
> On Jan 11, 2019, at 11:03 AM, Linus Torvalds  
> wrote:
> 
> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf  wrote:
>>> Now, in the int3 handler can you take the faulting RIP and search for it in
>>> the “static-calls” table, writing the RIP+5 (offset) into R10 (return
>>> address) and the target into R11. You make the int3 handler to divert the
>>> code execution by changing pt_regs->rip to point to a new function that 
>>> does:
>>> 
>>>  push R10
>>>  jmp __x86_indirect_thunk_r11
>>> 
>>> And then you are done. No?
>> 
>> IIUC, that sounds pretty much like what Steven proposed:
>> 
>>  
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.homedata=02%7C01%7Cnamit%40vmware.com%7Cd90f2aee03854a8da5dd08d677f7866e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828302371417454sdata=KEGVjCZbl5z7cCQX7F0%2FOaWfU7l%2Bvd2YDwZMXuOhpGI%3Dreserved=0
>> 
>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>> that case there's only one clobbered register to work with (rax).
> 
> Actually, there's a much simpler model now that I think about it.
> 
> The BP fixup just fixes up %rip to to point to "bp_int3_handler".
> 
> And that's just a random text address set up by "text_poke_bp()".
> 
> So how about the static call rewriting simply do this:
> 
> - for each static call:
> 
> 1)   create a fixup code stub that does
> 
>push $returnaddressforTHIScall
>jmp targetforTHIScall
> 
> 2) do
> 
>on_each_cpu(do_sync_core, NULL, 1);
> 
> to make sure all CPU's see this generated code
> 
>  3) do
> 
>text_poke_bp(addr, newcode, newlen, generatedcode);
> 
> Ta-daa! Done.
> 
> In fact, it turns out that even the extra "do_sync_core()" in #2 is
> unnecessary, because taking the BP will be serializing on the CPU that
> takes it, so we can skip it.
> 
> End result: the text_poke_bp() function will do the two do_sync_core
> IPI's that guarantee that by the time it returns, no other CPU is
> using the generated code any more, so it can be re-used for the next
> static call fixup.
> 
> Notice? No odd emulation, no need to adjust the stack in the BP
> handler, just the regular "return to a different IP".
> 
> Now, there is a nasty special case with that stub, though.
> 
> So nasty thing with the whole "generate a stub for each call" case:
> because it's dynamic and because of the re-use of the stub, you could
> be in the situation where:
> 
>  CPU1  CPU2
>    
> 
>  generate a stub
>  on_each_cpu(do_sync_core..)
>  text_poke_bp()
>  ...
> 
>  rewrite to BP
>trigger the BP
>return to the stub
>fun the first instruction of the stub
>*INTERRUPT causes rescheduling*
> 
>  on_each_cpu(do_sync_core..)
>  rewrite to good instruction
>  on_each_cpu(do_sync_core..)
> 
>  free or re-generate the stub
> 
>!! The stub is still in use !!
> 
> So that simple "just generate the stub dynamically" isn't so simple after all.
> 
> But it turns out that that is really simple to handle too. How do we do that?
> 
> We do that by giving the BP handler *two* code sequences, and we make
> the BP handler pick one depending on whether it is returning to a
> "interrupts disabled" or "interrupts enabled" case.
> 
> So the BP handler does this:
> 
> - if we're returning with interrupts disabled, pick the simple stub
> 
> - if we're returning with interrupts enabled, clkear IF in the return
> %rflags, and pick a *slightly* more complex stub:
> 
>push $returnaddressforTHIScall
>sti
>jmp targetforTHIScall
> 
> and now the STI shadow will mean that this sequence is uninterruptible.
> 
> So we'd not do complex emulation of the call instruction at BP time,
> but we'd do that *trivial* change at BP time.
> 
> This seems simple, doesn't need any temporary registers at all, and
> doesn't need any extra stack magic. It literally needs just a trivial
> sequence in poke_int3_handler().
> 
> The we'd change the end of poke_int3_handler() to do something like
> this instead:
> 
>void *newip = bp_int3_handler;
>..
>if (new == magic_static_call_bp_int3_handler) {
>if (regs->flags _FLAGS_IF) {
>newip = magic_static_call_bp_int3_handler_sti;
>regs->flags &= ~X86_FLAGS_IF;
>}
>regs->ip = (unsigned long) newip;
>return 1;
> 
> AAND now we're *really* done.
> 
> Does anybody see any issues in this?

I think it’s a better articulated, more detailed version to the solution I
proposed in this thread (which also used a patched trampoline with STI+JMP).

So obviously I like it. ;-)

Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Linus Torvalds
On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf  wrote:
>
> >
> > Now, in the int3 handler can you take the faulting RIP and search for it in
> > the “static-calls” table, writing the RIP+5 (offset) into R10 (return
> > address) and the target into R11. You make the int3 handler to divert the
> > code execution by changing pt_regs->rip to point to a new function that 
> > does:
> >
> >   push R10
> >   jmp __x86_indirect_thunk_r11
> >
> > And then you are done. No?
>
> IIUC, that sounds pretty much like what Steven proposed:
>
>   https://lkml.kernel.org/r/20181129122000.7fb4f...@gandalf.local.home
>
> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
> that case there's only one clobbered register to work with (rax).

Actually, there's a much simpler model now that I think about it.

The BP fixup just fixes up %rip to to point to "bp_int3_handler".

And that's just a random text address set up by "text_poke_bp()".

So how about the static call rewriting simply do this:

 - for each static call:

 1)   create a fixup code stub that does

push $returnaddressforTHIScall
jmp targetforTHIScall

 2) do

on_each_cpu(do_sync_core, NULL, 1);

 to make sure all CPU's see this generated code

  3) do

text_poke_bp(addr, newcode, newlen, generatedcode);

Ta-daa! Done.

In fact, it turns out that even the extra "do_sync_core()" in #2 is
unnecessary, because taking the BP will be serializing on the CPU that
takes it, so we can skip it.

End result: the text_poke_bp() function will do the two do_sync_core
IPI's that guarantee that by the time it returns, no other CPU is
using the generated code any more, so it can be re-used for the next
static call fixup.

Notice? No odd emulation, no need to adjust the stack in the BP
handler, just the regular "return to a different IP".

Now, there is a nasty special case with that stub, though.

So nasty thing with the whole "generate a stub for each call" case:
because it's dynamic and because of the re-use of the stub, you could
be in the situation where:

  CPU1  CPU2
    

  generate a stub
  on_each_cpu(do_sync_core..)
  text_poke_bp()
  ...

  rewrite to BP
trigger the BP
return to the stub
fun the first instruction of the stub
*INTERRUPT causes rescheduling*

  on_each_cpu(do_sync_core..)
  rewrite to good instruction
  on_each_cpu(do_sync_core..)

  free or re-generate the stub

!! The stub is still in use !!

So that simple "just generate the stub dynamically" isn't so simple after all.

But it turns out that that is really simple to handle too. How do we do that?

We do that by giving the BP handler *two* code sequences, and we make
the BP handler pick one depending on whether it is returning to a
"interrupts disabled" or "interrupts enabled" case.

So the BP handler does this:

 - if we're returning with interrupts disabled, pick the simple stub

 - if we're returning with interrupts enabled, clkear IF in the return
%rflags, and pick a *slightly* more complex stub:

push $returnaddressforTHIScall
sti
jmp targetforTHIScall

and now the STI shadow will mean that this sequence is uninterruptible.

So we'd not do complex emulation of the call instruction at BP time,
but we'd do that *trivial* change at BP time.

This seems simple, doesn't need any temporary registers at all, and
doesn't need any extra stack magic. It literally needs just a trivial
sequence in poke_int3_handler().

The we'd change the end of poke_int3_handler() to do something like
this instead:

void *newip = bp_int3_handler;
..
if (new == magic_static_call_bp_int3_handler) {
if (regs->flags _FLAGS_IF) {
newip = magic_static_call_bp_int3_handler_sti;
regs->flags &= ~X86_FLAGS_IF;
}
regs->ip = (unsigned long) newip;
return 1;

AAND now we're *really* done.

Does anybody see any issues in this?

  Linus


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Nadav Amit
> On Jan 11, 2019, at 8:07 AM, Josh Poimboeuf  wrote:
> 
> On Fri, Jan 11, 2019 at 03:48:59PM +, Nadav Amit wrote:
>>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>>> that case there's only one clobbered register to work with (rax).
>> 
>> That’s would be more tricky. How about using a per-CPU trampoline code to
>> hold a direct call to the target and temporarily disable preemption (which
>> might be simpler by disabling IRQs):
>> 

Allow me to simplify/correct:

>> Static-call modifier:
>> 
>>1. synchronize_sched() to ensure per-cpu trampoline is not used
No need for (1) (We are going to sync using IPI).

>>  2. Patches the jmp in a per-cpu trampoline (see below)
>>  3. Saves the call source RIP in [per-cpu scratchpad RIP] (below)
Both do not need to be per-cpu

>>  4. Configures the int3 handler to use static-call int3 handler
>>  5. Patches the call target (as it currently does).
Note that text_poke_bp() would do eventually:

on_each_cpu(do_sync_core, NULL, 1);

So you should know no cores run the trampoline after (5).



Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 03:48:59PM +, Nadav Amit wrote:
> > I liked the idea, BUT, how would it work for callee-saved PV ops?  In
> > that case there's only one clobbered register to work with (rax).
> 
> That’s would be more tricky. How about using a per-CPU trampoline code to
> hold a direct call to the target and temporarily disable preemption (which
> might be simpler by disabling IRQs):
> 
> Static-call modifier:
> 
> 1. synchronize_sched() to ensure per-cpu trampoline is not used
>   2. Patches the jmp in a per-cpu trampoline (see below)
>   3. Saves the call source RIP in [per-cpu scratchpad RIP] (below) 
>   4. Configures the int3 handler to use static-call int3 handler
>   5. Patches the call target (as it currently does).
> 
> Static-call int3 handler:
>   1. Changes flags on the stack to keep IRQs disabled on return
>   2. Jumps to per-cpu trampoline on return
> 
> Per-cpu trampoline:
>   push [per-CPU scratchpad RIP]
>   sti
>   jmp [ target ] (this one is patched)
> 
> Note that no IRQ should be possible between the STI and the JMP due to STI
> blocking.
> 
> What do you say?

This could work, but it's more complex than I was hoping for.

My current leading contender is to do call emulation in the #BP handler,
either by making a gap or by doing Andy's longjmp-style thingie.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Nadav Amit
> On Jan 11, 2019, at 7:15 AM, Josh Poimboeuf  wrote:
> 
> On Fri, Jan 11, 2019 at 01:47:01AM +, Nadav Amit wrote:
>> Here is an alternative idea (although similar to Steven’s and my code).
>> 
>> Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow
>> should be done by the calling convention (and gcc plugin should allow us to
>> enforce). Also assume that we hold a table with all source RIP and the
>> matching target.
>> 
>> Now, in the int3 handler can you take the faulting RIP and search for it in
>> the “static-calls” table, writing the RIP+5 (offset) into R10 (return
>> address) and the target into R11. You make the int3 handler to divert the
>> code execution by changing pt_regs->rip to point to a new function that does:
>> 
>>  push R10
>>  jmp __x86_indirect_thunk_r11
>> 
>> And then you are done. No?
> 
> IIUC, that sounds pretty much like what Steven proposed:
> 
>  
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.homedata=02%7C01%7Cnamit%40vmware.com%7Ce3f0b96a1e83417af48808d677d7a147%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828165370908292sdata=PFzrJQzoa21IRYmEuqHSSGYrNZt0zIo8TGOZa3NWbOE%3Dreserved=0

Stupid me. I’ve remembered it slightly different (the caller saving the
target in a register).

> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
> that case there's only one clobbered register to work with (rax).

That’s would be more tricky. How about using a per-CPU trampoline code to
hold a direct call to the target and temporarily disable preemption (which
might be simpler by disabling IRQs):

Static-call modifier:

1. synchronize_sched() to ensure per-cpu trampoline is not used
2. Patches the jmp in a per-cpu trampoline (see below)
3. Saves the call source RIP in [per-cpu scratchpad RIP] (below) 
4. Configures the int3 handler to use static-call int3 handler
5. Patches the call target (as it currently does).

Static-call int3 handler:
1. Changes flags on the stack to keep IRQs disabled on return
2. Jumps to per-cpu trampoline on return

Per-cpu trampoline:
push [per-CPU scratchpad RIP]
sti
jmp [ target ] (this one is patched)

Note that no IRQ should be possible between the STI and the JMP due to STI
blocking.

What do you say?



Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread Josh Poimboeuf
On Fri, Jan 11, 2019 at 01:47:01AM +, Nadav Amit wrote:
> Here is an alternative idea (although similar to Steven’s and my code).
> 
> Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow
> should be done by the calling convention (and gcc plugin should allow us to
> enforce). Also assume that we hold a table with all source RIP and the
> matching target.
> 
> Now, in the int3 handler can you take the faulting RIP and search for it in
> the “static-calls” table, writing the RIP+5 (offset) into R10 (return
> address) and the target into R11. You make the int3 handler to divert the
> code execution by changing pt_regs->rip to point to a new function that does:
> 
>   push R10
>   jmp __x86_indirect_thunk_r11
> 
> And then you are done. No?

IIUC, that sounds pretty much like what Steven proposed:

  https://lkml.kernel.org/r/20181129122000.7fb4f...@gandalf.local.home

I liked the idea, BUT, how would it work for callee-saved PV ops?  In
that case there's only one clobbered register to work with (rax).

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Nadav Amit
> On Jan 10, 2019, at 4:56 PM, Andy Lutomirski  wrote:
> 
> On Thu, Jan 10, 2019 at 3:02 PM Linus Torvalds
>  wrote:
>> On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf  wrote:
>>> Right, emulating a call instruction from the #BP handler is ugly,
>>> because you have to somehow grow the stack to make room for the return
>>> address.  Personally I liked the idea of shifting the iret frame by 16
>>> bytes in the #DB entry code, but others hated it.
>> 
>> Yeah, I hated it.
>> 
>> But I'm starting to think it's the simplest solution.
>> 
>> So still not loving it, but all the other models have had huge issues too.
> 
> Putting my maintainer hat on:
> 
> I'm okay-ish with shifting the stack by 16 bytes.  If this is done, I
> want an assertion in do_int3() or wherever the fixup happens that the
> write isn't overlapping pt_regs (which is easy to implement because
> that code has the relevant pt_regs pointer).  And I want some code
> that explicitly triggers the fixup when a CONFIG_DEBUG_ENTRY=y or
> similar kernel is built so that this whole mess actually gets
> exercised.  Because the fixup only happens when a
> really-quite-improbable race gets hit, and the issues depend on stack
> alignment, which is presumably why Josh was able to submit a buggy
> series without noticing.
> 
> BUT: this is going to be utterly gross whenever anyone tries to
> implement shadow stacks for the kernel, and we might need to switch to
> a longjmp-like approach if that happens.

Here is an alternative idea (although similar to Steven’s and my code).

Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow
should be done by the calling convention (and gcc plugin should allow us to
enforce). Also assume that we hold a table with all source RIP and the
matching target.

Now, in the int3 handler can you take the faulting RIP and search for it in
the “static-calls” table, writing the RIP+5 (offset) into R10 (return
address) and the target into R11. You make the int3 handler to divert the
code execution by changing pt_regs->rip to point to a new function that does:

push R10
jmp __x86_indirect_thunk_r11

And then you are done. No?



Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Andy Lutomirski
On Thu, Jan 10, 2019 at 3:02 PM Linus Torvalds
 wrote:
>
> On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf  wrote:
> >
> > Right, emulating a call instruction from the #BP handler is ugly,
> > because you have to somehow grow the stack to make room for the return
> > address.  Personally I liked the idea of shifting the iret frame by 16
> > bytes in the #DB entry code, but others hated it.
>
> Yeah, I hated it.
>
> But I'm starting to think it's the simplest solution.
>
> So still not loving it, but all the other models have had huge issues too.
>

Putting my maintainer hat on:

I'm okay-ish with shifting the stack by 16 bytes.  If this is done, I
want an assertion in do_int3() or wherever the fixup happens that the
write isn't overlapping pt_regs (which is easy to implement because
that code has the relevant pt_regs pointer).  And I want some code
that explicitly triggers the fixup when a CONFIG_DEBUG_ENTRY=y or
similar kernel is built so that this whole mess actually gets
exercised.  Because the fixup only happens when a
really-quite-improbable race gets hit, and the issues depend on stack
alignment, which is presumably why Josh was able to submit a buggy
series without noticing.

BUT: this is going to be utterly gross whenever anyone tries to
implement shadow stacks for the kernel, and we might need to switch to
a longjmp-like approach if that happens.

--Andy


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Linus Torvalds
On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf  wrote:
>
> Right, emulating a call instruction from the #BP handler is ugly,
> because you have to somehow grow the stack to make room for the return
> address.  Personally I liked the idea of shifting the iret frame by 16
> bytes in the #DB entry code, but others hated it.

Yeah, I hated it.

But I'm starting to think it's the simplest solution.

So still not loving it, but all the other models have had huge issues too.

Linus


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Nadav Amit
> On Jan 10, 2019, at 12:57 PM, Josh Poimboeuf  wrote:
> 
> On Thu, Jan 10, 2019 at 08:48:31PM +, Nadav Amit wrote:
>>> On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf  wrote:
>>> 
>>> On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote:
>> I’m not GCC expert either and writing this code was not making me full of
>> joy, etc.. I’ll be happy that my code would be reviewed, but it does 
>> work. I
>> don’t think an early pass is needed, as long as hardware registers were 
>> not
>> allocated.
>> 
>>> Would it work with more than 5 arguments, where args get passed on the
>>> stack?
>> 
>> It does.
>> 
>>> At the very least, it would (at least partially) defeat the point of the
>>> callee-saved paravirt ops.
>> 
>> Actually, I think you can even deal with callee-saved functions and 
>> remove
>> all the (terrible) macros. You would need to tell the extension not to
>> clobber the registers through a new attribute.
> 
> Ok, it does sound interesting then.  I assume you'll be sharing the
> code?
 
 Of course. If this what is going to convince, I’ll make a small version for
 PV callee-saved first.
>>> 
>>> It wasn't *only* the PV callee-saved part which interested me, so if you
>>> already have something which implements the other parts, I'd still like
>>> to see it.
>> 
>> Did you have a look at 
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20181231072112.21051-4-namit%40vmware.com%2Fdata=02%7C01%7Cnamit%40vmware.com%7C169b737792134fc852d808d6773e454b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827506683819442sdata=FhM%2F5OD%2FNHx9Jr97iPIBNyn0BoLAlyiSv%2BT4XICBUdg%3Dreserved=0
>>  ?
>> 
>> See the changes to x86_call_markup_plugin.c .
>> 
>> The missing part (that I just finished but need to cleanup) is attributes
>> that allow you to provide key and dynamically enable the patching.
> 
> Aha, so it's the basically the same plugin you had for optpolines.  I
> missed that.  I'll need to stare at the code for a little bit.

Pretty much. You would want to change the assembly code block, and based on
the use-case (e.g., callee-saved, static-calls) clobber or set as an input
more or fewer registers.



Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Josh Poimboeuf
On Thu, Jan 10, 2019 at 08:48:31PM +, Nadav Amit wrote:
> > On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf  wrote:
> > 
> > On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote:
>  I’m not GCC expert either and writing this code was not making me full of
>  joy, etc.. I’ll be happy that my code would be reviewed, but it does 
>  work. I
>  don’t think an early pass is needed, as long as hardware registers were 
>  not
>  allocated.
>  
> > Would it work with more than 5 arguments, where args get passed on the
> > stack?
>  
>  It does.
>  
> > At the very least, it would (at least partially) defeat the point of the
> > callee-saved paravirt ops.
>  
>  Actually, I think you can even deal with callee-saved functions and 
>  remove
>  all the (terrible) macros. You would need to tell the extension not to
>  clobber the registers through a new attribute.
> >>> 
> >>> Ok, it does sound interesting then.  I assume you'll be sharing the
> >>> code?
> >> 
> >> Of course. If this what is going to convince, I’ll make a small version for
> >> PV callee-saved first.
> > 
> > It wasn't *only* the PV callee-saved part which interested me, so if you
> > already have something which implements the other parts, I'd still like
> > to see it.
> 
> Did you have a look at 
> https://lore.kernel.org/lkml/20181231072112.21051-4-na...@vmware.com/ ?
> 
> See the changes to x86_call_markup_plugin.c .
> 
> The missing part (that I just finished but need to cleanup) is attributes
> that allow you to provide key and dynamically enable the patching.

Aha, so it's the basically the same plugin you had for optpolines.  I
missed that.  I'll need to stare at the code for a little bit.

> > What if we just used a plugin in a simpler fashion -- to do call site
> > alignment, if necessary, to ensure the instruction doesn't cross
> > cacheline boundaries.  This could be done in a later pass, with no side
> > effects other than code layout.  And it would allow us to avoid
> > breakpoints altogether -- again, assuming somebody can verify that
> > intra-cacheline call destination writes are atomic with respect to
> > instruction decoder reads.
>  
>  The plugin should not be able to do so. Layout of the bytecode is done by
>  the assembler, so I don’t think a plugin would help you with this one.
> >>> 
> >>> Actually I think we could use .bundle_align_mode for this purpose:
> >>> 
> >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.htmldata=02%7C01%7Cnamit%40vmware.com%7Cbc4dcc541474462da00b08d6773ab61f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827491388051263sdata=HZNPN4UygwQCqsX8dOajaNeDZyy1O0O4cYeSwu%2BIdO0%3Dreserved=0
> >> 
> >> Hm… I don’t understand what you have in mind (i.e., when would this
> >> assembly directives would be emitted).
> > 
> > For example, it could replace
> > 
> >  callq static_call_tramp_my_key
> > 
> > with
> > 
> >  .bundle_align_mode 6
> >  callq static_call_tramp_my_key
> >  .bundle_align_mode 0
> > 
> > which ensures the instruction is within a cache line, aligning it with
> > NOPs if necessary.  That would allow my current implementation to
> > upgrade out-of-line calls to inline calls 100% of the time, instead of
> > 95% of the time.
> 
> Heh. I almost wrote based no the feature description that this will add
> unnecessary padding no matter what, but actually (experimentally) it works
> well…

Yeah, based on the poorly worded docs, I made the same assumption, until
I tried it.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread H. Peter Anvin
On 1/10/19 9:31 AM, Linus Torvalds wrote:
> On Wed, Jan 9, 2019 at 2:59 PM Josh Poimboeuf  wrote:
>>
>> NOTE: At least experimentally, the call destination writes seem to be
>> atomic with respect to instruction fetching.  On Nehalem I can easily
>> trigger crashes when writing a call destination across cachelines while
>> reading the instruction on other CPU; but I get no such crashes when
>> respecting cacheline boundaries.
> 
> I still doubt ifetch is atomic on a cacheline boundary for the simple
> reason that the bus between the IU and the L1 I$ is narrower in older
> CPU's.
> 

As far as I understand, on P6+ (and possibly earlier, but I don't know) it is
atomic on a 16-byte fetch datum, at least for Intel CPUs.

However, single byte accesses are always going to be safe.

-hpa



Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Josh Poimboeuf
On Thu, Jan 10, 2019 at 09:30:23PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote:
> > With this version, I stopped trying to use text_poke_bp(), and instead
> > went with a different approach: if the call site destination doesn't
> > cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> > using the trampoline indefinitely.
> 
> > - Get rid of the use of text_poke_bp(), in favor of atomic writes.
> >   Out-of-line calls will be promoted to inline only if the call sites
> >   don't cross cache line boundaries. [Linus/Andy]
> 
> Can we perserve why text_poke_bp() didn't work? I seem to have forgotten
> again. The problem was poking the return address onto the stack from the
> int3 handler, or something along those lines?

Right, emulating a call instruction from the #BP handler is ugly,
because you have to somehow grow the stack to make room for the return
address.  Personally I liked the idea of shifting the iret frame by 16
bytes in the #DB entry code, but others hated it.

So many bad-but-not-completely-unacceptable options to choose from.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Nadav Amit
> On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf  wrote:
> 
> On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote:
 I’m not GCC expert either and writing this code was not making me full of
 joy, etc.. I’ll be happy that my code would be reviewed, but it does work. 
 I
 don’t think an early pass is needed, as long as hardware registers were not
 allocated.
 
> Would it work with more than 5 arguments, where args get passed on the
> stack?
 
 It does.
 
> At the very least, it would (at least partially) defeat the point of the
> callee-saved paravirt ops.
 
 Actually, I think you can even deal with callee-saved functions and remove
 all the (terrible) macros. You would need to tell the extension not to
 clobber the registers through a new attribute.
>>> 
>>> Ok, it does sound interesting then.  I assume you'll be sharing the
>>> code?
>> 
>> Of course. If this what is going to convince, I’ll make a small version for
>> PV callee-saved first.
> 
> It wasn't *only* the PV callee-saved part which interested me, so if you
> already have something which implements the other parts, I'd still like
> to see it.

Did you have a look at 
https://lore.kernel.org/lkml/20181231072112.21051-4-na...@vmware.com/ ?

See the changes to x86_call_markup_plugin.c .

The missing part (that I just finished but need to cleanup) is attributes
that allow you to provide key and dynamically enable the patching.

> What if we just used a plugin in a simpler fashion -- to do call site
> alignment, if necessary, to ensure the instruction doesn't cross
> cacheline boundaries.  This could be done in a later pass, with no side
> effects other than code layout.  And it would allow us to avoid
> breakpoints altogether -- again, assuming somebody can verify that
> intra-cacheline call destination writes are atomic with respect to
> instruction decoder reads.
 
 The plugin should not be able to do so. Layout of the bytecode is done by
 the assembler, so I don’t think a plugin would help you with this one.
>>> 
>>> Actually I think we could use .bundle_align_mode for this purpose:
>>> 
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.htmldata=02%7C01%7Cnamit%40vmware.com%7Cbc4dcc541474462da00b08d6773ab61f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827491388051263sdata=HZNPN4UygwQCqsX8dOajaNeDZyy1O0O4cYeSwu%2BIdO0%3Dreserved=0
>> 
>> Hm… I don’t understand what you have in mind (i.e., when would this
>> assembly directives would be emitted).
> 
> For example, it could replace
> 
>  callq static_call_tramp_my_key
> 
> with
> 
>  .bundle_align_mode 6
>  callq static_call_tramp_my_key
>  .bundle_align_mode 0
> 
> which ensures the instruction is within a cache line, aligning it with
> NOPs if necessary.  That would allow my current implementation to
> upgrade out-of-line calls to inline calls 100% of the time, instead of
> 95% of the time.

Heh. I almost wrote based no the feature description that this will add
unnecessary padding no matter what, but actually (experimentally) it works
well…




Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Josh Poimboeuf
On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote:
> >> I’m not GCC expert either and writing this code was not making me full of
> >> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. 
> >> I
> >> don’t think an early pass is needed, as long as hardware registers were not
> >> allocated.
> >> 
> >>> Would it work with more than 5 arguments, where args get passed on the
> >>> stack?
> >> 
> >> It does.
> >> 
> >>> At the very least, it would (at least partially) defeat the point of the
> >>> callee-saved paravirt ops.
> >> 
> >> Actually, I think you can even deal with callee-saved functions and remove
> >> all the (terrible) macros. You would need to tell the extension not to
> >> clobber the registers through a new attribute.
> > 
> > Ok, it does sound interesting then.  I assume you'll be sharing the
> > code?
> 
> Of course. If this what is going to convince, I’ll make a small version for
> PV callee-saved first.

It wasn't *only* the PV callee-saved part which interested me, so if you
already have something which implements the other parts, I'd still like
to see it.

> >>> What if we just used a plugin in a simpler fashion -- to do call site
> >>> alignment, if necessary, to ensure the instruction doesn't cross
> >>> cacheline boundaries.  This could be done in a later pass, with no side
> >>> effects other than code layout.  And it would allow us to avoid
> >>> breakpoints altogether -- again, assuming somebody can verify that
> >>> intra-cacheline call destination writes are atomic with respect to
> >>> instruction decoder reads.
> >> 
> >> The plugin should not be able to do so. Layout of the bytecode is done by
> >> the assembler, so I don’t think a plugin would help you with this one.
> > 
> > Actually I think we could use .bundle_align_mode for this purpose:
> > 
> >  
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.htmldata=02%7C01%7Cnamit%40vmware.com%7Cfa29fb8be208498d039008d67727fe30%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827411004664549sdata=elDuAVOsSlidG7pZSZfjbhrgnMOHeX6AWKs0hJM4cCE%3Dreserved=0
> 
> Hm… I don’t understand what you have in mind (i.e., when would this
> assembly directives would be emitted).

For example, it could replace

  callq static_call_tramp_my_key

with

  .bundle_align_mode 6
  callq static_call_tramp_my_key
  .bundle_align_mode 0

which ensures the instruction is within a cache line, aligning it with
NOPs if necessary.  That would allow my current implementation to
upgrade out-of-line calls to inline calls 100% of the time, instead of
95% of the time.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Peter Zijlstra
On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote:
> With this version, I stopped trying to use text_poke_bp(), and instead
> went with a different approach: if the call site destination doesn't
> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> using the trampoline indefinitely.

> - Get rid of the use of text_poke_bp(), in favor of atomic writes.
>   Out-of-line calls will be promoted to inline only if the call sites
>   don't cross cache line boundaries. [Linus/Andy]

Can we perserve why text_poke_bp() didn't work? I seem to have forgotten
again. The problem was poking the return address onto the stack from the
int3 handler, or something along those lines?


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Nadav Amit
> On Jan 10, 2019, at 10:18 AM, Josh Poimboeuf  wrote:
> 
> On Thu, Jan 10, 2019 at 05:32:08PM +, Nadav Amit wrote:
>>> On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf  wrote:
>>> 
>>> On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote:
> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf  wrote:
> 
> With this version, I stopped trying to use text_poke_bp(), and instead
> went with a different approach: if the call site destination doesn't
> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> using the trampoline indefinitely.
> 
> NOTE: At least experimentally, the call destination writes seem to be
> atomic with respect to instruction fetching.  On Nehalem I can easily
> trigger crashes when writing a call destination across cachelines while
> reading the instruction on other CPU; but I get no such crashes when
> respecting cacheline boundaries.
> 
> BUT, the SDM doesn't document this approach, so it would be great if any
> CPU people can confirm that it's safe!
 
 I (still) think that having a compiler plugin can make things much cleaner
 (as done in [1]). The callers would not need to be changed, and the key can
 be provided through an attribute.
 
 Using a plugin should also allow to use Steven’s proposal for doing
 text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
 by the plugin, you can be guaranteed that registers are clobbered. Then, 
 you
 can store in the assembly block the return address in one of these
 registers.
>>> 
>>> I'm no GCC expert (why do I find myself saying this a lot lately?), but
>>> this sounds to me like it could be tricky to get right.
>>> 
>>> I suppose you'd have to do it in an early pass, to allow GCC to clobber
>>> the registers in a later pass.  So it would necessarily have side
>>> effects, but I don't know what the risks are.
>> 
>> I’m not GCC expert either and writing this code was not making me full of
>> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
>> don’t think an early pass is needed, as long as hardware registers were not
>> allocated.
>> 
>>> Would it work with more than 5 arguments, where args get passed on the
>>> stack?
>> 
>> It does.
>> 
>>> At the very least, it would (at least partially) defeat the point of the
>>> callee-saved paravirt ops.
>> 
>> Actually, I think you can even deal with callee-saved functions and remove
>> all the (terrible) macros. You would need to tell the extension not to
>> clobber the registers through a new attribute.
> 
> Ok, it does sound interesting then.  I assume you'll be sharing the
> code?

Of course. If this what is going to convince, I’ll make a small version for
PV callee-saved first.

>>> What if we just used a plugin in a simpler fashion -- to do call site
>>> alignment, if necessary, to ensure the instruction doesn't cross
>>> cacheline boundaries.  This could be done in a later pass, with no side
>>> effects other than code layout.  And it would allow us to avoid
>>> breakpoints altogether -- again, assuming somebody can verify that
>>> intra-cacheline call destination writes are atomic with respect to
>>> instruction decoder reads.
>> 
>> The plugin should not be able to do so. Layout of the bytecode is done by
>> the assembler, so I don’t think a plugin would help you with this one.
> 
> Actually I think we could use .bundle_align_mode for this purpose:
> 
>  
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.htmldata=02%7C01%7Cnamit%40vmware.com%7Cfa29fb8be208498d039008d67727fe30%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827411004664549sdata=elDuAVOsSlidG7pZSZfjbhrgnMOHeX6AWKs0hJM4cCE%3Dreserved=0

Hm… I don’t understand what you have in mind (i.e., when would this
assembly directives would be emitted).





Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Josh Poimboeuf
On Thu, Jan 10, 2019 at 05:32:08PM +, Nadav Amit wrote:
> > On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf  wrote:
> > 
> > On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote:
> >>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf  wrote:
> >>> 
> >>> With this version, I stopped trying to use text_poke_bp(), and instead
> >>> went with a different approach: if the call site destination doesn't
> >>> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> >>> using the trampoline indefinitely.
> >>> 
> >>> NOTE: At least experimentally, the call destination writes seem to be
> >>> atomic with respect to instruction fetching.  On Nehalem I can easily
> >>> trigger crashes when writing a call destination across cachelines while
> >>> reading the instruction on other CPU; but I get no such crashes when
> >>> respecting cacheline boundaries.
> >>> 
> >>> BUT, the SDM doesn't document this approach, so it would be great if any
> >>> CPU people can confirm that it's safe!
> >> 
> >> I (still) think that having a compiler plugin can make things much cleaner
> >> (as done in [1]). The callers would not need to be changed, and the key can
> >> be provided through an attribute.
> >> 
> >> Using a plugin should also allow to use Steven’s proposal for doing
> >> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
> >> by the plugin, you can be guaranteed that registers are clobbered. Then, 
> >> you
> >> can store in the assembly block the return address in one of these
> >> registers.
> > 
> > I'm no GCC expert (why do I find myself saying this a lot lately?), but
> > this sounds to me like it could be tricky to get right.
> > 
> > I suppose you'd have to do it in an early pass, to allow GCC to clobber
> > the registers in a later pass.  So it would necessarily have side
> > effects, but I don't know what the risks are.
> 
> I’m not GCC expert either and writing this code was not making me full of
> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
> don’t think an early pass is needed, as long as hardware registers were not
> allocated.
> 
> > Would it work with more than 5 arguments, where args get passed on the
> > stack?
> 
> It does.
> 
> > 
> > At the very least, it would (at least partially) defeat the point of the
> > callee-saved paravirt ops.
> 
> Actually, I think you can even deal with callee-saved functions and remove
> all the (terrible) macros. You would need to tell the extension not to
> clobber the registers through a new attribute.

Ok, it does sound interesting then.  I assume you'll be sharing the
code?

> > What if we just used a plugin in a simpler fashion -- to do call site
> > alignment, if necessary, to ensure the instruction doesn't cross
> > cacheline boundaries.  This could be done in a later pass, with no side
> > effects other than code layout.  And it would allow us to avoid
> > breakpoints altogether -- again, assuming somebody can verify that
> > intra-cacheline call destination writes are atomic with respect to
> > instruction decoder reads.
> 
> The plugin should not be able to do so. Layout of the bytecode is done by
> the assembler, so I don’t think a plugin would help you with this one.

Actually I think we could use .bundle_align_mode for this purpose:

  https://sourceware.org/binutils/docs-2.31/as/Bundle-directives.html

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Nadav Amit
> On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf  wrote:
> 
> On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote:
>>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf  wrote:
>>> 
>>> With this version, I stopped trying to use text_poke_bp(), and instead
>>> went with a different approach: if the call site destination doesn't
>>> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
>>> using the trampoline indefinitely.
>>> 
>>> NOTE: At least experimentally, the call destination writes seem to be
>>> atomic with respect to instruction fetching.  On Nehalem I can easily
>>> trigger crashes when writing a call destination across cachelines while
>>> reading the instruction on other CPU; but I get no such crashes when
>>> respecting cacheline boundaries.
>>> 
>>> BUT, the SDM doesn't document this approach, so it would be great if any
>>> CPU people can confirm that it's safe!
>> 
>> I (still) think that having a compiler plugin can make things much cleaner
>> (as done in [1]). The callers would not need to be changed, and the key can
>> be provided through an attribute.
>> 
>> Using a plugin should also allow to use Steven’s proposal for doing
>> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
>> by the plugin, you can be guaranteed that registers are clobbered. Then, you
>> can store in the assembly block the return address in one of these
>> registers.
> 
> I'm no GCC expert (why do I find myself saying this a lot lately?), but
> this sounds to me like it could be tricky to get right.
> 
> I suppose you'd have to do it in an early pass, to allow GCC to clobber
> the registers in a later pass.  So it would necessarily have side
> effects, but I don't know what the risks are.

I’m not GCC expert either and writing this code was not making me full of
joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I
don’t think an early pass is needed, as long as hardware registers were not
allocated.

> Would it work with more than 5 arguments, where args get passed on the
> stack?

It does.

> 
> At the very least, it would (at least partially) defeat the point of the
> callee-saved paravirt ops.

Actually, I think you can even deal with callee-saved functions and remove
all the (terrible) macros. You would need to tell the extension not to
clobber the registers through a new attribute.

> What if we just used a plugin in a simpler fashion -- to do call site
> alignment, if necessary, to ensure the instruction doesn't cross
> cacheline boundaries.  This could be done in a later pass, with no side
> effects other than code layout.  And it would allow us to avoid
> breakpoints altogether -- again, assuming somebody can verify that
> intra-cacheline call destination writes are atomic with respect to
> instruction decoder reads.

The plugin should not be able to do so. Layout of the bytecode is done by
the assembler, so I don’t think a plugin would help you with this one.

Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Linus Torvalds
On Wed, Jan 9, 2019 at 2:59 PM Josh Poimboeuf  wrote:
>
> NOTE: At least experimentally, the call destination writes seem to be
> atomic with respect to instruction fetching.  On Nehalem I can easily
> trigger crashes when writing a call destination across cachelines while
> reading the instruction on other CPU; but I get no such crashes when
> respecting cacheline boundaries.

I still doubt ifetch is atomic on a cacheline boundary for the simple
reason that the bus between the IU and the L1 I$ is narrower in older
CPU's.

Also, the fill of the L1 I$ from the (cache coherent L2) may not be a
cacheline at a time either.

That said, the fetch may be sufficiently ordered that it works in
practice. It _would_ be absolutely lovely to be able to do things like
this.

I do agree with Nadav that if there's some way to avoid this, it would
be good. I'm not in general a huge fan of compiler plugins (compiler
instability is just about my worst fear, and I feel plugins tend to
open up that area a lot), but it does feel like this might be
something where compiler tweaking would possibly be the cleanest
approach.

 Linus


Re: [PATCH v3 0/6] Static calls

2019-01-10 Thread Josh Poimboeuf
On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote:
> > On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf  wrote:
> > 
> > With this version, I stopped trying to use text_poke_bp(), and instead
> > went with a different approach: if the call site destination doesn't
> > cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> > using the trampoline indefinitely.
> > 
> > NOTE: At least experimentally, the call destination writes seem to be
> > atomic with respect to instruction fetching.  On Nehalem I can easily
> > trigger crashes when writing a call destination across cachelines while
> > reading the instruction on other CPU; but I get no such crashes when
> > respecting cacheline boundaries.
> > 
> > BUT, the SDM doesn't document this approach, so it would be great if any
> > CPU people can confirm that it's safe!
> > 
> 
> I (still) think that having a compiler plugin can make things much cleaner
> (as done in [1]). The callers would not need to be changed, and the key can
> be provided through an attribute.
> 
> Using a plugin should also allow to use Steven’s proposal for doing
> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
> by the plugin, you can be guaranteed that registers are clobbered. Then, you
> can store in the assembly block the return address in one of these
> registers.

I'm no GCC expert (why do I find myself saying this a lot lately?), but
this sounds to me like it could be tricky to get right.

I suppose you'd have to do it in an early pass, to allow GCC to clobber
the registers in a later pass.  So it would necessarily have side
effects, but I don't know what the risks are.

Would it work with more than 5 arguments, where args get passed on the
stack?

At the very least, it would (at least partially) defeat the point of the
callee-saved paravirt ops.

What if we just used a plugin in a simpler fashion -- to do call site
alignment, if necessary, to ensure the instruction doesn't cross
cacheline boundaries.  This could be done in a later pass, with no side
effects other than code layout.  And it would allow us to avoid
breakpoints altogether -- again, assuming somebody can verify that
intra-cacheline call destination writes are atomic with respect to
instruction decoder reads.

-- 
Josh


Re: [PATCH v3 0/6] Static calls

2019-01-09 Thread Nadav Amit
> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf  wrote:
> 
> With this version, I stopped trying to use text_poke_bp(), and instead
> went with a different approach: if the call site destination doesn't
> cross a cacheline boundary, just do an atomic write.  Otherwise, keep
> using the trampoline indefinitely.
> 
> NOTE: At least experimentally, the call destination writes seem to be
> atomic with respect to instruction fetching.  On Nehalem I can easily
> trigger crashes when writing a call destination across cachelines while
> reading the instruction on other CPU; but I get no such crashes when
> respecting cacheline boundaries.
> 
> BUT, the SDM doesn't document this approach, so it would be great if any
> CPU people can confirm that it's safe!
> 

I (still) think that having a compiler plugin can make things much cleaner
(as done in [1]). The callers would not need to be changed, and the key can
be provided through an attribute.

Using a plugin should also allow to use Steven’s proposal for doing
text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done
by the plugin, you can be guaranteed that registers are clobbered. Then, you
can store in the assembly block the return address in one of these
registers.

[1] https://lkml.org/lkml/2018/12/31/25






[PATCH v3 0/6] Static calls

2019-01-09 Thread Josh Poimboeuf
With this version, I stopped trying to use text_poke_bp(), and instead
went with a different approach: if the call site destination doesn't
cross a cacheline boundary, just do an atomic write.  Otherwise, keep
using the trampoline indefinitely.

NOTE: At least experimentally, the call destination writes seem to be
atomic with respect to instruction fetching.  On Nehalem I can easily
trigger crashes when writing a call destination across cachelines while
reading the instruction on other CPU; but I get no such crashes when
respecting cacheline boundaries.

BUT, the SDM doesn't document this approach, so it would be great if any
CPU people can confirm that it's safe!


v3:
- Split up the patches a bit more so that out-of-line static calls can
  be separately mergeable.  Inline is more controversial, and other
  approaches or improvements might be considered.  For example, Nadav is
  looking at implementing it with the help of a GCC plugin to ensure the
  call sites don't cross cacheline boundaries.

- Get rid of the use of text_poke_bp(), in favor of atomic writes.
  Out-of-line calls will be promoted to inline only if the call sites
  don't cross cache line boundaries. [Linus/Andy]

- Converge the inline and out-of-line trampolines into a single
  implementation, which uses a direct jump.  This was made possible by
  making static_call_update() safe to be called during early boot.

- Remove trampoline poisoning for now, since trampolines may still be
  needed for call sites which cross cache line boundaries.

- Rename CONFIG_HAVE_STATIC_CALL_OUTLINE -> CONFIG_HAVE_STATIC_CALL [Steven]

- Add missing __static_call_update() call to static_call_update() [Edward]

- Add missing key->func update in __static_call_update() [Edward]

- Put trampoline in a separate section to prevent 2-byte tail calls [Linus]


v2:
- fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
- rename optimized/unoptimized -> inline/out-of-line [Ard]
- tweak arch interfaces for PLT and add key->tramp field [Ard]
- rename 'poison' to 'defuse' and do it after all sites have been patched [Ard]
- fix .init handling [Ard, Steven]
- add CONFIG_HAVE_STATIC_CALL [Steven]
- make interfaces more consistent across configs to allow tracepoints to
  use them [Steven]
- move __ADDRESSABLE() to static_call() macro [Steven]
- prevent 2-byte jumps [Steven]
- add offset to asm-offsets.c instead of hard coding key->func offset
- add kernel_text_address() sanity check
- make __ADDRESSABLE() symbols truly unique


Static calls are a replacement for global function pointers.  They use
code patching to allow direct calls to be used instead of indirect
calls.  They give the flexibility of function pointers, but with
improved performance.  This is especially important for cases where
retpolines would otherwise be used, as retpolines can significantly
impact performance.

The concept and code are an extension of previous work done by Ard
Biesheuvel and Steven Rostedt:

  https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
  https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org

There are three implementations, depending on arch support:

1) basic function pointers
2) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL)
3) inline: patched call sites (CONFIG_HAVE_STATIC_CALL_INLINE)



Josh Poimboeuf (6):
  compiler.h: Make __ADDRESSABLE() symbol truly unique
  static_call: Add basic static call infrastructure
  x86/static_call: Add out-of-line static call implementation
  static_call: Add inline static call infrastructure
  x86/alternative: Use a single access in text_poke() where possible
  x86/static_call: Add inline static call implementation for x86-64

 arch/Kconfig  |   7 +
 arch/x86/Kconfig  |   4 +-
 arch/x86/include/asm/static_call.h|  33 ++
 arch/x86/kernel/Makefile  |   1 +
 arch/x86/kernel/alternative.c |  31 +-
 arch/x86/kernel/static_call.c |  57 
 arch/x86/kernel/vmlinux.lds.S |   1 +
 include/asm-generic/vmlinux.lds.h |  15 +
 include/linux/compiler.h  |   2 +-
 include/linux/module.h|  10 +
 include/linux/static_call.h   | 196 +++
 include/linux/static_call_types.h |  22 ++
 kernel/Makefile   |   1 +
 kernel/module.c   |   5 +
 kernel/static_call.c  | 316 ++
 scripts/Makefile.build|   3 +
 tools/objtool/Makefile|   3 +-
 tools/objtool/builtin-check.c |   3 +-
 tools/objtool/builtin.h   |   2 +-
 tools/objtool/check.c | 131 +++-
 tools/objtool/check.h |   2 +
 tools/objtool/elf.h   |   1 +