Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-09-12 Thread hpa
On September 12, 2019 8:00:39 AM GMT+01:00, Adrian Hunter 
 wrote:
>On 29/08/19 2:46 PM, Peter Zijlstra wrote:
>> On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
>>> Can you expand on "and ensure the poke_handler preserves the
>existing
>>> control flow"?  Whatever the INT3-handler does will be traced
>normally so
>>> long as it does not itself execute self-modified code.
>> 
>> My thinking was that the code shouldn't change semantics before
>emitting
>> the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
>> matter much.
>> 
>> Either we run the old code or INT3 does 'something'.  Then we get
>> RECORD_TEXT_POKE and finish the poke.  Which tells that the moment
>INT3
>> stops the new text is in place.
>> 
>> I suppose that works too, and requires less changes.
>
>
>What about this?
>
>
>diff --git a/arch/x86/include/asm/text-patching.h
>b/arch/x86/include/asm/text-patching.h
>index 70c09967a999..00aa9bef2b9d 100644
>--- a/arch/x86/include/asm/text-patching.h
>+++ b/arch/x86/include/asm/text-patching.h
>@@ -30,6 +30,7 @@ struct text_poke_loc {
>   void *addr;
>   size_t len;
>   const char opcode[POKE_MAX_OPCODE_SIZE];
>+  char old_opcode[POKE_MAX_OPCODE_SIZE];
> };
> 
>extern void text_poke_early(void *addr, const void *opcode, size_t
>len);
>diff --git a/arch/x86/kernel/alternative.c
>b/arch/x86/kernel/alternative.c
>index ccd32013c47a..c781bafd 100644
>--- a/arch/x86/kernel/alternative.c
>+++ b/arch/x86/kernel/alternative.c
>@@ -3,6 +3,7 @@
> 
> #include 
> #include 
>+#include 
> #include 
> #include 
> #include 
>@@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc
>*tp, unsigned int nr_entries)
>   /*
>* First step: add a int3 trap to the address that will be patched.
>*/
>-  for (i = 0; i < nr_entries; i++)
>+  for (i = 0; i < nr_entries; i++) {
>+  memcpy(tp[i].old_opcode, tp[i].addr, tp[i].len);
>   text_poke(tp[i].addr, , sizeof(int3));
>+  }
> 
>   on_each_cpu(do_sync_core, NULL, 1);
> 
>@@ -1071,6 +1074,11 @@ void text_poke_bp_batch(struct text_poke_loc
>*tp, unsigned int nr_entries)
>   on_each_cpu(do_sync_core, NULL, 1);
>   }
> 
>+  for (i = 0; i < nr_entries; i++) {
>+  perf_event_text_poke((unsigned long)tp[i].addr,
>+   tp[i].old_opcode, tp[i].opcode, tp[i].len);
>+  }
>+
>   /*
>* Third step: replace the first byte (int3) by the first byte of
>* replacing opcode.
>diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>index 61448c19a132..f4c6095d2110 100644
>--- a/include/linux/perf_event.h
>+++ b/include/linux/perf_event.h
>@@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
> extern void perf_event_comm(struct task_struct *tsk, bool exec);
> extern void perf_event_namespaces(struct task_struct *tsk);
> extern void perf_event_fork(struct task_struct *tsk);
>+extern void perf_event_text_poke(unsigned long addr, const void
>*old_bytes,
>+   const void *new_bytes, size_t len);
> 
> /* Callchains */
> DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>@@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void)
>{ }
>static inline void perf_event_comm(struct task_struct *tsk, bool
>exec)  { }
> static inline void perf_event_namespaces(struct task_struct *tsk) { }
> static inline void perf_event_fork(struct task_struct *tsk)   { }
>+static inline void perf_event_text_poke(unsigned long addr,
>+  const void *old_bytes,
>+  const void *new_bytes,
>+  size_t len) { }
> static inline void perf_event_init(void)  { }
>static inline int  perf_swevent_get_recursion_context(void){ return
>-1; }
> static inline void perf_swevent_put_recursion_context(int rctx)   
> { }
>diff --git a/include/uapi/linux/perf_event.h
>b/include/uapi/linux/perf_event.h
>index bb7b271397a6..6396d4c0d2f9 100644
>--- a/include/uapi/linux/perf_event.h
>+++ b/include/uapi/linux/perf_event.h
>@@ -375,7 +375,8 @@ struct perf_event_attr {
>   ksymbol:  1, /* include ksymbol events 
> */
>   bpf_event  :  1, /* include bpf events */
>   aux_output :  1, /* generate AUX records 
> instead of events */
>-  __reserved_1   : 32;
>+  text_poke  :  1, /* include text poke 
>events */
>+  __reserved_1   : 31;
> 
>   union {
>   __u32   wakeup_events;/* wakeup every n events */
>@@ -1000,6 +1001,22 @@ enum perf_event_type {
>*/
>   PERF_RECORD_BPF_EVENT   = 18,
> 
>+  /*
>+   * Records changes 

Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-09-12 Thread Adrian Hunter
On 29/08/19 2:46 PM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
>> Can you expand on "and ensure the poke_handler preserves the existing
>> control flow"?  Whatever the INT3-handler does will be traced normally so
>> long as it does not itself execute self-modified code.
> 
> My thinking was that the code shouldn't change semantics before emitting
> the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
> matter much.
> 
> Either we run the old code or INT3 does 'something'.  Then we get
> RECORD_TEXT_POKE and finish the poke.  Which tells that the moment INT3
> stops the new text is in place.
> 
> I suppose that works too, and requires less changes.


What about this?


diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index 70c09967a999..00aa9bef2b9d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -30,6 +30,7 @@ struct text_poke_loc {
void *addr;
size_t len;
const char opcode[POKE_MAX_OPCODE_SIZE];
+   char old_opcode[POKE_MAX_OPCODE_SIZE];
 };
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ccd32013c47a..c781bafd 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc *tp, 
unsigned int nr_entries)
/*
 * First step: add a int3 trap to the address that will be patched.
 */
-   for (i = 0; i < nr_entries; i++)
+   for (i = 0; i < nr_entries; i++) {
+   memcpy(tp[i].old_opcode, tp[i].addr, tp[i].len);
text_poke(tp[i].addr, , sizeof(int3));
+   }
 
on_each_cpu(do_sync_core, NULL, 1);
 
@@ -1071,6 +1074,11 @@ void text_poke_bp_batch(struct text_poke_loc *tp, 
unsigned int nr_entries)
on_each_cpu(do_sync_core, NULL, 1);
}
 
+   for (i = 0; i < nr_entries; i++) {
+   perf_event_text_poke((unsigned long)tp[i].addr,
+tp[i].old_opcode, tp[i].opcode, tp[i].len);
+   }
+
/*
 * Third step: replace the first byte (int3) by the first byte of
 * replacing opcode.
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61448c19a132..f4c6095d2110 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_namespaces(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
+extern void perf_event_text_poke(unsigned long addr, const void *old_bytes,
+const void *new_bytes, size_t len);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void) 
{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec) { }
 static inline void perf_event_namespaces(struct task_struct *tsk)  { }
 static inline void perf_event_fork(struct task_struct *tsk){ }
+static inline void perf_event_text_poke(unsigned long addr,
+   const void *old_bytes,
+   const void *new_bytes,
+   size_t len) { }
 static inline void perf_event_init(void)   { }
 static inline int  perf_swevent_get_recursion_context(void){ 
return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)
{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271397a6..6396d4c0d2f9 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -375,7 +375,8 @@ struct perf_event_attr {
ksymbol:  1, /* include ksymbol events 
*/
bpf_event  :  1, /* include bpf events */
aux_output :  1, /* generate AUX records 
instead of events */
-   __reserved_1   : 32;
+   text_poke  :  1, /* include text poke 
events */
+   __reserved_1   : 31;
 
union {
__u32   wakeup_events;/* wakeup every n events */
@@ -1000,6 +1001,22 @@ enum perf_event_type {
 */
PERF_RECORD_BPF_EVENT   = 18,
 
+   /*
+* Records changes to kernel text i.e. self-modified code.
+* 'len' is the number of old bytes, which is the same as the number
+* of new bytes. 'bytes' contains 

Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
> Can you expand on "and ensure the poke_handler preserves the existing
> control flow"?  Whatever the INT3-handler does will be traced normally so
> long as it does not itself execute self-modified code.

My thinking was that the code shouldn't change semantics before emitting
the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
matter much.

Either we run the old code or INT3 does 'something'.  Then we get
RECORD_TEXT_POKE and finish the poke.  Which tells that the moment INT3
stops the new text is in place.

I suppose that works too, and requires less changes.




Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-08-29 Thread Adrian Hunter
On 29/08/19 11:53 AM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 11:23:52AM +0300, Adrian Hunter wrote:
>> On 9/01/19 12:35 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
>>>
 A general solution is more complicated, however, due to the racy nature of
 cross-modifying code. There would need to be TSC recording of the time
 before the modifications start and after they are done.

 BTW: I am not sure that static-keys are much better. Their change also
 affects the control flow, and they do affect the control flow.
>>>
>>> Any text_poke() user is a problem; which is why I suggested a
>>> PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
>>> timestamped and can be correlated to the trace.
>>>
>>> As to the racy nature of text_poke, yes, this is a wee bit tricky and
>>> might need some care. I _think_ we can make it work, but I'm not 100%
>>> sure on exactly how PT works, but something like:
>>>
>>>  - write INT3 byte
>>>  - IPI-SYNC
>>>
>>> and ensure the poke_handler preserves the existing control flow (which
>>> it currently does not, but should be possible).
>>>
>>>  - emit RECORD_TEXT_POKE with the new instruction
>>>
>>> at this point the actual control flow will be through the INT3 and
>>> handler and not hit the actual instruction, so the actual state is
>>> irrelevant.
>>>
>>>  - write instruction tail
>>>  - IPI-SYNC
>>>  - write first byte
>>>  - IPI-SYNC
>>>
>>> And at this point we start using the new instruction, but this is after
>>> the timestamp from the RECORD_TEXT_POKE event and decoding should work
>>> just fine.
>>>
>>
>> Presumably the IPI-SYNC does not guarantee that other CPUs will not already
>> have seen the change.  In that case, it is not possible to provide a
>> timestamp before which all CPUs executed the old code, and after which all
>> CPUs execute the new code.
> 
> 'the change' is an INT3 poke, so either you see the old code flow, or
> you see an INT3 emulate the old flow in your trace.
> 
> That should be unambiguous.
> 
> Then you emit the RECORD_TEXT_POKE with the new instruction on. This
> prepares the decoder to accept a new reality.
> 
> Then we finish the instruction poke.
> 
> And then when the trace no longer shows INT3 exceptions, you know the
> new code is in effect.
> 
> How is this ambiguous?

It's not.  I didn't get that from the first read, sorry.

Can you expand on "and ensure the poke_handler preserves the existing
control flow"?  Whatever the INT3-handler does will be traced normally so
long as it does not itself execute self-modified code.


Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2019 at 11:23:52AM +0300, Adrian Hunter wrote:
> On 9/01/19 12:35 PM, Peter Zijlstra wrote:
> > On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
> > 
> >> A general solution is more complicated, however, due to the racy nature of
> >> cross-modifying code. There would need to be TSC recording of the time
> >> before the modifications start and after they are done.
> >>
> >> BTW: I am not sure that static-keys are much better. Their change also
> >> affects the control flow, and they do affect the control flow.
> > 
> > Any text_poke() user is a problem; which is why I suggested a
> > PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
> > timestamped and can be correlated to the trace.
> > 
> > As to the racy nature of text_poke, yes, this is a wee bit tricky and
> > might need some care. I _think_ we can make it work, but I'm not 100%
> > sure on exactly how PT works, but something like:
> > 
> >  - write INT3 byte
> >  - IPI-SYNC
> > 
> > and ensure the poke_handler preserves the existing control flow (which
> > it currently does not, but should be possible).
> > 
> >  - emit RECORD_TEXT_POKE with the new instruction
> > 
> > at this point the actual control flow will be through the INT3 and
> > handler and not hit the actual instruction, so the actual state is
> > irrelevant.
> > 
> >  - write instruction tail
> >  - IPI-SYNC
> >  - write first byte
> >  - IPI-SYNC
> > 
> > And at this point we start using the new instruction, but this is after
> > the timestamp from the RECORD_TEXT_POKE event and decoding should work
> > just fine.
> > 
> 
> Presumably the IPI-SYNC does not guarantee that other CPUs will not already
> have seen the change.  In that case, it is not possible to provide a
> timestamp before which all CPUs executed the old code, and after which all
> CPUs execute the new code.

'the change' is an INT3 poke, so either you see the old code flow, or
you see an INT3 emulate the old flow in your trace.

That should be unambiguous.

Then you emit the RECORD_TEXT_POKE with the new instruction on. This
prepares the decoder to accept a new reality.

Then we finish the instruction poke.

And then when the trace no longer shows INT3 exceptions, you know the
new code is in effect.

How is this ambiguous?



Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-08-29 Thread Adrian Hunter
On 9/01/19 12:35 PM, Peter Zijlstra wrote:
> On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
> 
>> A general solution is more complicated, however, due to the racy nature of
>> cross-modifying code. There would need to be TSC recording of the time
>> before the modifications start and after they are done.
>>
>> BTW: I am not sure that static-keys are much better. Their change also
>> affects the control flow, and they do affect the control flow.
> 
> Any text_poke() user is a problem; which is why I suggested a
> PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
> timestamped and can be correlated to the trace.
> 
> As to the racy nature of text_poke, yes, this is a wee bit tricky and
> might need some care. I _think_ we can make it work, but I'm not 100%
> sure on exactly how PT works, but something like:
> 
>  - write INT3 byte
>  - IPI-SYNC
> 
> and ensure the poke_handler preserves the existing control flow (which
> it currently does not, but should be possible).
> 
>  - emit RECORD_TEXT_POKE with the new instruction
> 
> at this point the actual control flow will be through the INT3 and
> handler and not hit the actual instruction, so the actual state is
> irrelevant.
> 
>  - write instruction tail
>  - IPI-SYNC
>  - write first byte
>  - IPI-SYNC
> 
> And at this point we start using the new instruction, but this is after
> the timestamp from the RECORD_TEXT_POKE event and decoding should work
> just fine.
> 

Presumably the IPI-SYNC does not guarantee that other CPUs will not already
have seen the change.  In that case, it is not possible to provide a
timestamp before which all CPUs executed the old code, and after which all
CPUs execute the new code.

So we need 2 events: one before and one after the text poke.  Then it will
be up to the tools to figure out which code path was taken in that time
interval. e.g.

1. emit RECORD_TEXT_POKE
flags:  BEFORE (timestamp is before change)
method: INT3 (INT3 method used to change code)
ip
code size
code bytes before
code bytes after

2. text poke

3. emit RECORD_TEXT_POKE
flags:  AFTER (timestamp is after change)
method: INT3 (INT3 method used to change code)
ip
code size
code bytes before
code bytes after

Thoughts?


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-09 Thread Peter Zijlstra
On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:

> A general solution is more complicated, however, due to the racy nature of
> cross-modifying code. There would need to be TSC recording of the time
> before the modifications start and after they are done.
> 
> BTW: I am not sure that static-keys are much better. Their change also
> affects the control flow, and they do affect the control flow.

Any text_poke() user is a problem; which is why I suggested a
PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
timestamped and can be correlated to the trace.

As to the racy nature of text_poke, yes, this is a wee bit tricky and
might need some care. I _think_ we can make it work, but I'm not 100%
sure on exactly how PT works, but something like:

 - write INT3 byte
 - IPI-SYNC

and ensure the poke_handler preserves the existing control flow (which
it currently does not, but should be possible).

 - emit RECORD_TEXT_POKE with the new instruction

at this point the actual control flow will be through the INT3 and
handler and not hit the actual instruction, so the actual state is
irrelevant.

 - write instruction tail
 - IPI-SYNC
 - write first byte
 - IPI-SYNC

And at this point we start using the new instruction, but this is after
the timestamp from the RECORD_TEXT_POKE event and decoding should work
just fine.


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Andi Kleen
> BTW: I am not sure that static-keys are much better. Their change also
> affects the control flow, and they do affect the control flow.

Static keys have the same problem, but they only change infrequently so usually
it's not too big a problem if you dump the kernel close to the tracing 
sessions.

simple-pt doesn't have a similar mechanism, so it suffers more from it.

-Andi


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Nadav Amit
> On Jan 8, 2019, at 11:01 AM, Peter Zijlstra  wrote:
> 
> On Tue, Jan 08, 2019 at 10:28:02AM -0800, Nadav Amit wrote:
>> Is it really that important for debugging to get the instructions at the
>> time of execution? Wouldn’t it be easier to annotate the instructions that
>> might change? After all, it is not as if any instruction can change to any
>> other instruction.
> 
> I think PT has a bitstream encoding of branch-taken; to decode and
> follow the actual code-flow you then need to have the actual and
> accurate branch target from the code. If we go muck about with the code
> and change that, decoding gets somewhat 'tricky'.
> 
> Or something along those lines..

Thanks for the explanation (I now see it in the SDM and sources).

Basically, the updates should not be done too frequently, and I can expose
an interface to suspend them (this will not affect correctness). I think
this can be the easiest solution, which should not affect the workload
execution too much.

A general solution is more complicated, however, due to the racy nature of
cross-modifying code. There would need to be TSC recording of the time
before the modifications start and after they are done.

BTW: I am not sure that static-keys are much better. Their change also
affects the control flow, and they do affect the control flow.



Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Peter Zijlstra
On Tue, Jan 08, 2019 at 10:28:02AM -0800, Nadav Amit wrote:
> Is it really that important for debugging to get the instructions at the
> time of execution? Wouldn’t it be easier to annotate the instructions that
> might change? After all, it is not as if any instruction can change to any
> other instruction.

I think PT has a bitstream encoding of branch-taken; to decode and
follow the actual code-flow you then need to have the actual and
accurate branch target from the code. If we go muck about with the code
and change that, decoding gets somewhat 'tricky'.

Or something along those lines..


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Peter Zijlstra
On Tue, Jan 08, 2019 at 09:27:21AM -0800, Andi Kleen wrote:
> It doesn't work when the code is modified in place, like the
> patch in the $SUBJECT.

For in-place modification a-la jump_label / static_call etc. we can
'trivially' log the new instruction and the decoder can do matching
in-place kcore adjustments.




Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Nadav Amit
> On Jan 8, 2019, at 9:27 AM, Andi Kleen  wrote:
> 
> On Tue, Jan 08, 2019 at 11:10:58AM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 08, 2019 at 12:01:11PM +0200, Adrian Hunter wrote:
>>> The problem is that the jitted code gets freed from memory, which is why I
>>> suggested the ability to pin it for a while.
>> 
>> Then what do you tell the guy that keeps PT running for a day and runs
>> out of memory because he likes to JIT a lot?
> 
> It only would need to be saved until the next kcore dump, so they would
> need to do regular kcore dumps, after each of which the JIT code could be 
> freed.
> 
> In a sense it would be like RCU for code.
> 
> You would somehow need to tell the kernel when that happens though
> so it can schedule the frees.
> 
> It doesn't work when the code is modified in place, like the
> patch in the $SUBJECT.

Excuse my ignorance - can you be more concrete what will break where?

I am looking at perf-with-kcore, and intuitively the main thing that is
required is to take text_mutex while kcore is copied, to get a point-in-time
snapshot.

Is it really that important for debugging to get the instructions at the
time of execution? Wouldn’t it be easier to annotate the instructions that
might change? After all, it is not as if any instruction can change to any
other instruction.



Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Andi Kleen
On Tue, Jan 08, 2019 at 11:10:58AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 08, 2019 at 12:01:11PM +0200, Adrian Hunter wrote:
> > The problem is that the jitted code gets freed from memory, which is why I
> > suggested the ability to pin it for a while.
> 
> Then what do you tell the guy that keeps PT running for a day and runs
> out of memory because he likes to JIT a lot?

It only would need to be saved until the next kcore dump, so they would
need to do regular kcore dumps, after each of which the JIT code could be freed.

In a sense it would be like RCU for code.

You would somehow need to tell the kernel when that happens though
so it can schedule the frees.

It doesn't work when the code is modified in place, like the
patch in the $SUBJECT.

-Andi


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Peter Zijlstra
On Tue, Jan 08, 2019 at 12:01:11PM +0200, Adrian Hunter wrote:
> The problem is that the jitted code gets freed from memory, which is why I
> suggested the ability to pin it for a while.

Then what do you tell the guy that keeps PT running for a day and runs
out of memory because he likes to JIT a lot?


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Adrian Hunter
On 8/01/19 11:25 AM, Peter Zijlstra wrote:
> On Tue, Jan 08, 2019 at 09:47:18AM +0200, Adrian Hunter wrote:
>> On 7/01/19 6:32 PM, Peter Zijlstra wrote:
>>> On Thu, Jan 03, 2019 at 02:18:15PM -0800, Andi Kleen wrote:
 Nadav Amit  writes:
>
> - Do we use periodic learning or not? Josh suggested to reconfigure the
>   branches whenever a new target is found. However, I do not know at
>   this time how to do learning efficiently, without making learning much
>   more expensive.

 FWIW frequent patching will likely completely break perf Processor Trace
 decoding, which needs a somewhat stable kernel text image to decode the
 traces generated by the CPU. Right now it relies on kcore dumped after
 the trace usually being stable because jumplabel changes happen only
 infrequently. But if you start patching frequently this assumption will
 break.

 You would either need a way to turn this off, or provide
 updates for every change to the trace, so that the decoder can
 keep track.
>>>
>>> I'm thining it would be entirely possible to create and feed text_poke
>>> events into the regular (!aux) buffer which can be timestamp correlated
>>> to the PT data.
>>
>> To rebuild kernel text from such events would require a starting point.
>> What is the starting point?  The problem with kcore is that people can
>> deconfig it without realising it is needed to enable the tracing of kernel
>> self-modifying code.  It would be nice if it was all tied together, so that
>> if someone selects the ability to trace kernel self-modifying code, then all
>> the bits needed are also selected.  Perhaps we should expose another ELF
>> image that contains only kernel executable code, and take the opportunity to
>> put the symbols in it also.
> 
> Meh; you always need a magic combo of CONFIG symbols to make stuff work.
> We don't even have a CONFIG symbol for PT, so if you really care you
> should probably start there.
> 
> If you want symbols; what stops us from exposing kallsyms in kcore as
> is?
> 
>> Also what about BPF jitted code?  Will it always fit in an event?  I was
>> thinking of trying to add a way to prevent temporarily the unload of modules
>> or jitted code, which would be a good-enough solution for now.
> 
> We're working on BPF and kallsym events, those should, esp. when
> combined with kcore, allow you to extract the actual instructions.

The problem is that the jitted code gets freed from memory, which is why I
suggested the ability to pin it for a while.


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-08 Thread Peter Zijlstra
On Tue, Jan 08, 2019 at 09:47:18AM +0200, Adrian Hunter wrote:
> On 7/01/19 6:32 PM, Peter Zijlstra wrote:
> > On Thu, Jan 03, 2019 at 02:18:15PM -0800, Andi Kleen wrote:
> >> Nadav Amit  writes:
> >>>
> >>> - Do we use periodic learning or not? Josh suggested to reconfigure the
> >>>   branches whenever a new target is found. However, I do not know at
> >>>   this time how to do learning efficiently, without making learning much
> >>>   more expensive.
> >>
> >> FWIW frequent patching will likely completely break perf Processor Trace
> >> decoding, which needs a somewhat stable kernel text image to decode the
> >> traces generated by the CPU. Right now it relies on kcore dumped after
> >> the trace usually being stable because jumplabel changes happen only
> >> infrequently. But if you start patching frequently this assumption will
> >> break.
> >>
> >> You would either need a way to turn this off, or provide
> >> updates for every change to the trace, so that the decoder can
> >> keep track.
> > 
> > I'm thining it would be entirely possible to create and feed text_poke
> > events into the regular (!aux) buffer which can be timestamp correlated
> > to the PT data.
> 
> To rebuild kernel text from such events would require a starting point.
> What is the starting point?  The problem with kcore is that people can
> deconfig it without realising it is needed to enable the tracing of kernel
> self-modifying code.  It would be nice if it was all tied together, so that
> if someone selects the ability to trace kernel self-modifying code, then all
> the bits needed are also selected.  Perhaps we should expose another ELF
> image that contains only kernel executable code, and take the opportunity to
> put the symbols in it also.

Meh; you always need a magic combo of CONFIG symbols to make stuff work.
We don't even have a CONFIG symbol for PT, so if you really care you
should probably start there.

If you want symbols; what stops us from exposing kallsyms in kcore as
is?

> Also what about BPF jitted code?  Will it always fit in an event?  I was
> thinking of trying to add a way to prevent temporarily the unload of modules
> or jitted code, which would be a good-enough solution for now.

We're working on BPF and kallsym events, those should, esp. when
combined with kcore, allow you to extract the actual instructions.

We don't have module events, but I suppose the kallsym events should
cover module loading (a new module results in lots of new symbols after
all).

With all that there is still a race in that nothing blocks module-unload
while we're still harvesting the information, not sure how/if we want to
cure that.


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-07 Thread Adrian Hunter
On 7/01/19 6:32 PM, Peter Zijlstra wrote:
> On Thu, Jan 03, 2019 at 02:18:15PM -0800, Andi Kleen wrote:
>> Nadav Amit  writes:
>>>
>>> - Do we use periodic learning or not? Josh suggested to reconfigure the
>>>   branches whenever a new target is found. However, I do not know at
>>>   this time how to do learning efficiently, without making learning much
>>>   more expensive.
>>
>> FWIW frequent patching will likely completely break perf Processor Trace
>> decoding, which needs a somewhat stable kernel text image to decode the
>> traces generated by the CPU. Right now it relies on kcore dumped after
>> the trace usually being stable because jumplabel changes happen only
>> infrequently. But if you start patching frequently this assumption will
>> break.
>>
>> You would either need a way to turn this off, or provide
>> updates for every change to the trace, so that the decoder can
>> keep track.
> 
> I'm thining it would be entirely possible to create and feed text_poke
> events into the regular (!aux) buffer which can be timestamp correlated
> to the PT data.

To rebuild kernel text from such events would require a starting point.
What is the starting point?  The problem with kcore is that people can
deconfig it without realising it is needed to enable the tracing of kernel
self-modifying code.  It would be nice if it was all tied together, so that
if someone selects the ability to trace kernel self-modifying code, then all
the bits needed are also selected.  Perhaps we should expose another ELF
image that contains only kernel executable code, and take the opportunity to
put the symbols in it also.

Also what about BPF jitted code?  Will it always fit in an event?  I was
thinking of trying to add a way to prevent temporarily the unload of modules
or jitted code, which would be a good-enough solution for now.


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-07 Thread Peter Zijlstra
On Thu, Jan 03, 2019 at 02:18:15PM -0800, Andi Kleen wrote:
> Nadav Amit  writes:
> >
> > - Do we use periodic learning or not? Josh suggested to reconfigure the
> >   branches whenever a new target is found. However, I do not know at
> >   this time how to do learning efficiently, without making learning much
> >   more expensive.
> 
> FWIW frequent patching will likely completely break perf Processor Trace
> decoding, which needs a somewhat stable kernel text image to decode the
> traces generated by the CPU. Right now it relies on kcore dumped after
> the trace usually being stable because jumplabel changes happen only
> infrequently. But if you start patching frequently this assumption will
> break.
> 
> You would either need a way to turn this off, or provide
> updates for every change to the trace, so that the decoder can
> keep track.

I'm thining it would be entirely possible to create and feed text_poke
events into the regular (!aux) buffer which can be timestamp correlated
to the PT data.


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-03 Thread Andi Kleen
Nadav Amit  writes:
>
> - Do we use periodic learning or not? Josh suggested to reconfigure the
>   branches whenever a new target is found. However, I do not know at
>   this time how to do learning efficiently, without making learning much
>   more expensive.

FWIW frequent patching will likely completely break perf Processor Trace
decoding, which needs a somewhat stable kernel text image to decode the
traces generated by the CPU. Right now it relies on kcore dumped after
the trace usually being stable because jumplabel changes happen only
infrequently. But if you start patching frequently this assumption will
break.

You would either need a way to turn this off, or provide
updates for every change to the trace, so that the decoder can
keep track.

-Andi


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-03 Thread Josh Poimboeuf
On Thu, Jan 03, 2019 at 06:30:08PM +, Nadav Amit wrote:
> > On Jan 3, 2019, at 10:10 AM, Josh Poimboeuf  wrote:
> > 
> > On Mon, Dec 31, 2018 at 07:53:06PM +, Nadav Amit wrote:
> >>> On Dec 31, 2018, at 11:51 AM, Andy Lutomirski  wrote:
> >>> 
> >>> On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit  wrote:
>  This is a revised version of optpolines (formerly named retpolines) for
>  dynamic indirect branch promotion in order to reduce retpoline overheads
>  [1].
> >>> 
> >>> Some of your changelogs still call them "relpolines".
> >>> 
> >>> I have a crazy suggestion: maybe don't give them a cute name at all?
> >>> Where it's actually necessary to name them (like in a config option),
> >>> use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
> >>> CONFIG_PATCH_INDIRECT_CALLS or something.
> > 
> > Cute or not, naming is important.
> > 
> > If you want a description instead of a name, it will be a challenge to
> > describe it in 2-3 words.
> > 
> > I have no idea what "dynamic devirtualization" means.
> > 
> > "Patch indirect calls" doesn't fully describe it either (and could be
> > easily confused with static calls and some other approaches).
> > 
> >> I’m totally fine with that (don’t turn me into a "marketing” guy). It was
> >> just a way to refer to the mechanism. I need more feedback about the more
> >> fundamental issues to go on.
> > 
> > Naming isn't marketing.  It's a real issue: it affects both usability
> > and code readability.
> 
> Well, allow me to be on the fence not this one.
> 
> I look for the path of least resistance. I think it would be easiest if I
> first manage to make Josh’s static calls to be less intrusive. For that, I
> try to add in the GCC plugin an attribute to annotate the function pointers
> whose calls should be promoted. However, I don’t manage to get the
> declaration from the call instruction's rtx. If anyone has a pointer on how
> it can be done, that’s would be very helpful.

Ok.  FYI, I'll be posting v3 in the next few days or so.

-- 
Josh


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-03 Thread Nadav Amit
> On Jan 3, 2019, at 10:10 AM, Josh Poimboeuf  wrote:
> 
> On Mon, Dec 31, 2018 at 07:53:06PM +, Nadav Amit wrote:
>>> On Dec 31, 2018, at 11:51 AM, Andy Lutomirski  wrote:
>>> 
>>> On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit  wrote:
 This is a revised version of optpolines (formerly named retpolines) for
 dynamic indirect branch promotion in order to reduce retpoline overheads
 [1].
>>> 
>>> Some of your changelogs still call them "relpolines".
>>> 
>>> I have a crazy suggestion: maybe don't give them a cute name at all?
>>> Where it's actually necessary to name them (like in a config option),
>>> use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
>>> CONFIG_PATCH_INDIRECT_CALLS or something.
> 
> Cute or not, naming is important.
> 
> If you want a description instead of a name, it will be a challenge to
> describe it in 2-3 words.
> 
> I have no idea what "dynamic devirtualization" means.
> 
> "Patch indirect calls" doesn't fully describe it either (and could be
> easily confused with static calls and some other approaches).
> 
>> I’m totally fine with that (don’t turn me into a "marketing” guy). It was
>> just a way to refer to the mechanism. I need more feedback about the more
>> fundamental issues to go on.
> 
> Naming isn't marketing.  It's a real issue: it affects both usability
> and code readability.

Well, allow me to be on the fence not this one.

I look for the path of least resistance. I think it would be easiest if I
first manage to make Josh’s static calls to be less intrusive. For that, I
try to add in the GCC plugin an attribute to annotate the function pointers
whose calls should be promoted. However, I don’t manage to get the
declaration from the call instruction's rtx. If anyone has a pointer on how
it can be done, that’s would be very helpful.



Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2019-01-03 Thread Josh Poimboeuf
On Mon, Dec 31, 2018 at 07:53:06PM +, Nadav Amit wrote:
> > On Dec 31, 2018, at 11:51 AM, Andy Lutomirski  wrote:
> > 
> > On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit  wrote:
> >> This is a revised version of optpolines (formerly named retpolines) for
> >> dynamic indirect branch promotion in order to reduce retpoline overheads
> >> [1].
> > 
> > Some of your changelogs still call them "relpolines".
> > 
> > I have a crazy suggestion: maybe don't give them a cute name at all?
> > Where it's actually necessary to name them (like in a config option),
> > use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
> > CONFIG_PATCH_INDIRECT_CALLS or something.

Cute or not, naming is important.

If you want a description instead of a name, it will be a challenge to
describe it in 2-3 words.

I have no idea what "dynamic devirtualization" means.

"Patch indirect calls" doesn't fully describe it either (and could be
easily confused with static calls and some other approaches).

> I’m totally fine with that (don’t turn me into a "marketing” guy). It was
> just a way to refer to the mechanism. I need more feedback about the more
> fundamental issues to go on.

Naming isn't marketing.  It's a real issue: it affects both usability
and code readability.

-- 
Josh


Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2018-12-31 Thread Nadav Amit
> On Dec 31, 2018, at 11:51 AM, Andy Lutomirski  wrote:
> 
> On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit  wrote:
>> This is a revised version of optpolines (formerly named retpolines) for
>> dynamic indirect branch promotion in order to reduce retpoline overheads
>> [1].
> 
> Some of your changelogs still call them "relpolines".
> 
> I have a crazy suggestion: maybe don't give them a cute name at all?
> Where it's actually necessary to name them (like in a config option),
> use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
> CONFIG_PATCH_INDIRECT_CALLS or something.

I’m totally fine with that (don’t turn me into a "marketing” guy). It was
just a way to refer to the mechanism. I need more feedback about the more
fundamental issues to go on.



Re: [RFC v2 0/6] x86: dynamic indirect branch promotion

2018-12-31 Thread Andy Lutomirski
On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit  wrote:
>
> This is a revised version of optpolines (formerly named retpolines) for
> dynamic indirect branch promotion in order to reduce retpoline overheads
> [1].

Some of your changelogs still call them "relpolines".

I have a crazy suggestion: maybe don't give them a cute name at all?
Where it's actually necessary to name them (like in a config option),
use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
CONFIG_PATCH_INDIRECT_CALLS or something.

--Andy


[RFC v2 0/6] x86: dynamic indirect branch promotion

2018-12-30 Thread Nadav Amit
This is a revised version of optpolines (formerly named retpolines) for
dynamic indirect branch promotion in order to reduce retpoline overheads
[1].

This version address some of the concerns that were raised before.
Accordingly, the code was slightly simplified and patching is now done
using the regular int3/breakpoint mechanism.

Outline optpolines for multiple targets was added. I do not think the
way I implemented it is the correct one. In my original (private)
version, if there are more targets than the outline block can hold, the
outline block is completely removed. However, I think this is
more-or-less how Josh wanted it to be.

The code modifications are now done using a gcc-plugin. This allows to
easily ignore code from init and other code sections. I think it should
also allow us to add opt-in/opt-out support for each branch, for example
by marking function pointers using address-space attributes.

All of these changes required some optimizations to go away to keep the
code simple. I have still did not run the benchmarks again. 

So I might have not addressed all the open issues, but it is rather hard
to finish the implementation since some still open high-level decisions
affect the way in which optimizations should be done.

Specifically:

- Is it going to be the only indirect branch promotion mechanism? If so,
  it probably should also provide interface similar to Josh's
  "static-calls" with annotations.

- Should it also be used when retpolines are disabled (in the config)?
  This does complicate the implementation a bit (RFC v1 supported it).

- Is it going to be opt-in or opt-out? If it is an opt-out mechanism,
  memory and performance optimizations need to be more aggressive.

- Do we use periodic learning or not? Josh suggested to reconfigure the
  branches whenever a new target is found. However, I do not know at
  this time how to do learning efficiently, without making learning much
  more expensive.

[1] https://lore.kernel.org/patchwork/cover/1001332/

Nadav Amit (6):
  x86: introduce kernel restartable sequence
  objtool: ignore instructions
  x86: patch indirect branch promotion
  x86: interface for accessing indirect branch locations
  x86: learning and patching indirect branch targets
  x86: outline optpoline

 arch/x86/Kconfig |4 +
 arch/x86/entry/entry_64.S|   16 +-
 arch/x86/include/asm/nospec-branch.h |   83 ++
 arch/x86/include/asm/sections.h  |2 +
 arch/x86/kernel/Makefile |1 +
 arch/x86/kernel/asm-offsets.c|9 +
 arch/x86/kernel/nospec-branch.c  | 1293 ++
 arch/x86/kernel/traps.c  |7 +
 arch/x86/kernel/vmlinux.lds.S|7 +
 arch/x86/lib/retpoline.S |   83 ++
 include/linux/cpuhotplug.h   |1 +
 include/linux/module.h   |9 +
 kernel/module.c  |8 +
 scripts/Makefile.gcc-plugins |3 +
 scripts/gcc-plugins/x86_call_markup_plugin.c |  329 +
 tools/objtool/check.c|   21 +-
 16 files changed, 1872 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/kernel/nospec-branch.c
 create mode 100644 scripts/gcc-plugins/x86_call_markup_plugin.c

-- 
2.17.1