Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-09-12 Thread hpa
On September 12, 2019 8:00:39 AM GMT+01:00, Adrian Hunter 
 wrote:
>On 29/08/19 2:46 PM, Peter Zijlstra wrote:
>> On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
>>> Can you expand on "and ensure the poke_handler preserves the
>existing
>>> control flow"?  Whatever the INT3-handler does will be traced
>normally so
>>> long as it does not itself execute self-modified code.
>> 
>> My thinking was that the code shouldn't change semantics before
>emitting
>> the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
>> matter much.
>> 
>> Either we run the old code or INT3 does 'something'.  Then we get
>> RECORD_TEXT_POKE and finish the poke.  Which tells that the moment
>INT3
>> stops the new text is in place.
>> 
>> I suppose that works too, and requires less changes.
>
>
>What about this?
>
>
>diff --git a/arch/x86/include/asm/text-patching.h
>b/arch/x86/include/asm/text-patching.h
>index 70c09967a999..00aa9bef2b9d 100644
>--- a/arch/x86/include/asm/text-patching.h
>+++ b/arch/x86/include/asm/text-patching.h
>@@ -30,6 +30,7 @@ struct text_poke_loc {
>   void *addr;
>   size_t len;
>   const char opcode[POKE_MAX_OPCODE_SIZE];
>+  char old_opcode[POKE_MAX_OPCODE_SIZE];
> };
> 
>extern void text_poke_early(void *addr, const void *opcode, size_t
>len);
>diff --git a/arch/x86/kernel/alternative.c
>b/arch/x86/kernel/alternative.c
>index ccd32013c47a..c781bafd 100644
>--- a/arch/x86/kernel/alternative.c
>+++ b/arch/x86/kernel/alternative.c
>@@ -3,6 +3,7 @@
> 
> #include 
> #include 
>+#include 
> #include 
> #include 
> #include 
>@@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc
>*tp, unsigned int nr_entries)
>   /*
>* First step: add a int3 trap to the address that will be patched.
>*/
>-  for (i = 0; i < nr_entries; i++)
>+  for (i = 0; i < nr_entries; i++) {
>+  memcpy(tp[i].old_opcode, tp[i].addr, tp[i].len);
>   text_poke(tp[i].addr, , sizeof(int3));
>+  }
> 
>   on_each_cpu(do_sync_core, NULL, 1);
> 
>@@ -1071,6 +1074,11 @@ void text_poke_bp_batch(struct text_poke_loc
>*tp, unsigned int nr_entries)
>   on_each_cpu(do_sync_core, NULL, 1);
>   }
> 
>+  for (i = 0; i < nr_entries; i++) {
>+  perf_event_text_poke((unsigned long)tp[i].addr,
>+   tp[i].old_opcode, tp[i].opcode, tp[i].len);
>+  }
>+
>   /*
>* Third step: replace the first byte (int3) by the first byte of
>* replacing opcode.
>diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>index 61448c19a132..f4c6095d2110 100644
>--- a/include/linux/perf_event.h
>+++ b/include/linux/perf_event.h
>@@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
> extern void perf_event_comm(struct task_struct *tsk, bool exec);
> extern void perf_event_namespaces(struct task_struct *tsk);
> extern void perf_event_fork(struct task_struct *tsk);
>+extern void perf_event_text_poke(unsigned long addr, const void
>*old_bytes,
>+   const void *new_bytes, size_t len);
> 
> /* Callchains */
> DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>@@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void)
>{ }
>static inline void perf_event_comm(struct task_struct *tsk, bool
>exec)  { }
> static inline void perf_event_namespaces(struct task_struct *tsk) { }
> static inline void perf_event_fork(struct task_struct *tsk)   { }
>+static inline void perf_event_text_poke(unsigned long addr,
>+  const void *old_bytes,
>+  const void *new_bytes,
>+  size_t len) { }
> static inline void perf_event_init(void)  { }
>static inline int  perf_swevent_get_recursion_context(void){ return
>-1; }
> static inline void perf_swevent_put_recursion_context(int rctx)   
> { }
>diff --git a/include/uapi/linux/perf_event.h
>b/include/uapi/linux/perf_event.h
>index bb7b271397a6..6396d4c0d2f9 100644
>--- a/include/uapi/linux/perf_event.h
>+++ b/include/uapi/linux/perf_event.h
>@@ -375,7 +375,8 @@ struct perf_event_attr {
>   ksymbol:  1, /* include ksymbol events 
> */
>   bpf_event  :  1, /* include bpf events */
>   aux_output :  1, /* generate AUX records 
> instead of events */
>-  __reserved_1   : 32;
>+  text_poke  :  1, /* include text poke 
>events */
>+  __reserved_1   : 31;
> 
>   union {
>   __u32   wakeup_events;/* wakeup every n events */
>@@ -1000,6 +1001,22 @@ enum perf_event_type {
>*/
>   PERF_RECORD_BPF_EVENT   = 18,
> 
>+  /*
>+   * Records changes 

Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-09-12 Thread Adrian Hunter
On 29/08/19 2:46 PM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
>> Can you expand on "and ensure the poke_handler preserves the existing
>> control flow"?  Whatever the INT3-handler does will be traced normally so
>> long as it does not itself execute self-modified code.
> 
> My thinking was that the code shouldn't change semantics before emitting
> the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
> matter much.
> 
> Either we run the old code or INT3 does 'something'.  Then we get
> RECORD_TEXT_POKE and finish the poke.  Which tells that the moment INT3
> stops the new text is in place.
> 
> I suppose that works too, and requires less changes.


What about this?


diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index 70c09967a999..00aa9bef2b9d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -30,6 +30,7 @@ struct text_poke_loc {
void *addr;
size_t len;
const char opcode[POKE_MAX_OPCODE_SIZE];
+   char old_opcode[POKE_MAX_OPCODE_SIZE];
 };
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ccd32013c47a..c781bafd 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc *tp, 
unsigned int nr_entries)
/*
 * First step: add a int3 trap to the address that will be patched.
 */
-   for (i = 0; i < nr_entries; i++)
+   for (i = 0; i < nr_entries; i++) {
+   memcpy(tp[i].old_opcode, tp[i].addr, tp[i].len);
text_poke(tp[i].addr, , sizeof(int3));
+   }
 
on_each_cpu(do_sync_core, NULL, 1);
 
@@ -1071,6 +1074,11 @@ void text_poke_bp_batch(struct text_poke_loc *tp, 
unsigned int nr_entries)
on_each_cpu(do_sync_core, NULL, 1);
}
 
+   for (i = 0; i < nr_entries; i++) {
+   perf_event_text_poke((unsigned long)tp[i].addr,
+tp[i].old_opcode, tp[i].opcode, tp[i].len);
+   }
+
/*
 * Third step: replace the first byte (int3) by the first byte of
 * replacing opcode.
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61448c19a132..f4c6095d2110 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_namespaces(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
+extern void perf_event_text_poke(unsigned long addr, const void *old_bytes,
+const void *new_bytes, size_t len);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void) 
{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec) { }
 static inline void perf_event_namespaces(struct task_struct *tsk)  { }
 static inline void perf_event_fork(struct task_struct *tsk){ }
+static inline void perf_event_text_poke(unsigned long addr,
+   const void *old_bytes,
+   const void *new_bytes,
+   size_t len) { }
 static inline void perf_event_init(void)   { }
 static inline int  perf_swevent_get_recursion_context(void){ 
return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)
{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271397a6..6396d4c0d2f9 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -375,7 +375,8 @@ struct perf_event_attr {
ksymbol:  1, /* include ksymbol events 
*/
bpf_event  :  1, /* include bpf events */
aux_output :  1, /* generate AUX records 
instead of events */
-   __reserved_1   : 32;
+   text_poke  :  1, /* include text poke 
events */
+   __reserved_1   : 31;
 
union {
__u32   wakeup_events;/* wakeup every n events */
@@ -1000,6 +1001,22 @@ enum perf_event_type {
 */
PERF_RECORD_BPF_EVENT   = 18,
 
+   /*
+* Records changes to kernel text i.e. self-modified code.
+* 'len' is the number of old bytes, which is the same as the number
+* of new bytes. 'bytes' contains 

Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
> Can you expand on "and ensure the poke_handler preserves the existing
> control flow"?  Whatever the INT3-handler does will be traced normally so
> long as it does not itself execute self-modified code.

My thinking was that the code shouldn't change semantics before emitting
the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
matter much.

Either we run the old code or INT3 does 'something'.  Then we get
RECORD_TEXT_POKE and finish the poke.  Which tells that the moment INT3
stops the new text is in place.

I suppose that works too, and requires less changes.




Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-08-29 Thread Adrian Hunter
On 29/08/19 11:53 AM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 11:23:52AM +0300, Adrian Hunter wrote:
>> On 9/01/19 12:35 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
>>>
 A general solution is more complicated, however, due to the racy nature of
 cross-modifying code. There would need to be TSC recording of the time
 before the modifications start and after they are done.

 BTW: I am not sure that static-keys are much better. Their change also
 affects the control flow, and they do affect the control flow.
>>>
>>> Any text_poke() user is a problem; which is why I suggested a
>>> PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
>>> timestamped and can be correlated to the trace.
>>>
>>> As to the racy nature of text_poke, yes, this is a wee bit tricky and
>>> might need some care. I _think_ we can make it work, but I'm not 100%
>>> sure on exactly how PT works, but something like:
>>>
>>>  - write INT3 byte
>>>  - IPI-SYNC
>>>
>>> and ensure the poke_handler preserves the existing control flow (which
>>> it currently does not, but should be possible).
>>>
>>>  - emit RECORD_TEXT_POKE with the new instruction
>>>
>>> at this point the actual control flow will be through the INT3 and
>>> handler and not hit the actual instruction, so the actual state is
>>> irrelevant.
>>>
>>>  - write instruction tail
>>>  - IPI-SYNC
>>>  - write first byte
>>>  - IPI-SYNC
>>>
>>> And at this point we start using the new instruction, but this is after
>>> the timestamp from the RECORD_TEXT_POKE event and decoding should work
>>> just fine.
>>>
>>
>> Presumably the IPI-SYNC does not guarantee that other CPUs will not already
>> have seen the change.  In that case, it is not possible to provide a
>> timestamp before which all CPUs executed the old code, and after which all
>> CPUs execute the new code.
> 
> 'the change' is an INT3 poke, so either you see the old code flow, or
> you see an INT3 emulate the old flow in your trace.
> 
> That should be unambiguous.
> 
> Then you emit the RECORD_TEXT_POKE with the new instruction on. This
> prepares the decoder to accept a new reality.
> 
> Then we finish the instruction poke.
> 
> And then when the trace no longer shows INT3 exceptions, you know the
> new code is in effect.
> 
> How is this ambiguous?

It's not.  I didn't get that from the first read, sorry.

Can you expand on "and ensure the poke_handler preserves the existing
control flow"?  Whatever the INT3-handler does will be traced normally so
long as it does not itself execute self-modified code.


Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2019 at 11:23:52AM +0300, Adrian Hunter wrote:
> On 9/01/19 12:35 PM, Peter Zijlstra wrote:
> > On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
> > 
> >> A general solution is more complicated, however, due to the racy nature of
> >> cross-modifying code. There would need to be TSC recording of the time
> >> before the modifications start and after they are done.
> >>
> >> BTW: I am not sure that static-keys are much better. Their change also
> >> affects the control flow, and they do affect the control flow.
> > 
> > Any text_poke() user is a problem; which is why I suggested a
> > PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
> > timestamped and can be correlated to the trace.
> > 
> > As to the racy nature of text_poke, yes, this is a wee bit tricky and
> > might need some care. I _think_ we can make it work, but I'm not 100%
> > sure on exactly how PT works, but something like:
> > 
> >  - write INT3 byte
> >  - IPI-SYNC
> > 
> > and ensure the poke_handler preserves the existing control flow (which
> > it currently does not, but should be possible).
> > 
> >  - emit RECORD_TEXT_POKE with the new instruction
> > 
> > at this point the actual control flow will be through the INT3 and
> > handler and not hit the actual instruction, so the actual state is
> > irrelevant.
> > 
> >  - write instruction tail
> >  - IPI-SYNC
> >  - write first byte
> >  - IPI-SYNC
> > 
> > And at this point we start using the new instruction, but this is after
> > the timestamp from the RECORD_TEXT_POKE event and decoding should work
> > just fine.
> > 
> 
> Presumably the IPI-SYNC does not guarantee that other CPUs will not already
> have seen the change.  In that case, it is not possible to provide a
> timestamp before which all CPUs executed the old code, and after which all
> CPUs execute the new code.

'the change' is an INT3 poke, so either you see the old code flow, or
you see an INT3 emulate the old flow in your trace.

That should be unambiguous.

Then you emit the RECORD_TEXT_POKE with the new instruction on. This
prepares the decoder to accept a new reality.

Then we finish the instruction poke.

And then when the trace no longer shows INT3 exceptions, you know the
new code is in effect.

How is this ambiguous?



Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-08-29 Thread Adrian Hunter
On 9/01/19 12:35 PM, Peter Zijlstra wrote:
> On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
> 
>> A general solution is more complicated, however, due to the racy nature of
>> cross-modifying code. There would need to be TSC recording of the time
>> before the modifications start and after they are done.
>>
>> BTW: I am not sure that static-keys are much better. Their change also
>> affects the control flow, and they do affect the control flow.
> 
> Any text_poke() user is a problem; which is why I suggested a
> PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
> timestamped and can be correlated to the trace.
> 
> As to the racy nature of text_poke, yes, this is a wee bit tricky and
> might need some care. I _think_ we can make it work, but I'm not 100%
> sure on exactly how PT works, but something like:
> 
>  - write INT3 byte
>  - IPI-SYNC
> 
> and ensure the poke_handler preserves the existing control flow (which
> it currently does not, but should be possible).
> 
>  - emit RECORD_TEXT_POKE with the new instruction
> 
> at this point the actual control flow will be through the INT3 and
> handler and not hit the actual instruction, so the actual state is
> irrelevant.
> 
>  - write instruction tail
>  - IPI-SYNC
>  - write first byte
>  - IPI-SYNC
> 
> And at this point we start using the new instruction, but this is after
> the timestamp from the RECORD_TEXT_POKE event and decoding should work
> just fine.
> 

Presumably the IPI-SYNC does not guarantee that other CPUs will not already
have seen the change.  In that case, it is not possible to provide a
timestamp before which all CPUs executed the old code, and after which all
CPUs execute the new code.

So we need 2 events: one before and one after the text poke.  Then it will
be up to the tools to figure out which code path was taken in that time
interval. e.g.

1. emit RECORD_TEXT_POKE
flags:  BEFORE (timestamp is before change)
method: INT3 (INT3 method used to change code)
ip
code size
code bytes before
code bytes after

2. text poke

3. emit RECORD_TEXT_POKE
flags:  AFTER (timestamp is after change)
method: INT3 (INT3 method used to change code)
ip
code size
code bytes before
code bytes after

Thoughts?