Re: new text patching for review

2007-08-10 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote:
> 
> Hallo,
> 
> I had some second thoughts about the text patching of DEBUG_RODATA kernels
> using change_page_attr(). Change_page_attr is intrusive and slow and using
> a separate mapping is a little more gentle. I came up with this patch.
> For your review and testing pleasure.
> 
> The main quirk is that it doesn't fully follow the cross-modifying code
> recommendations; but i'm not sure that's really needed.
> 
> Also I admit nop_out is quite inefficient, but that's a very slow path
> and it was easier to do it this way than handle all the corner cases
> explicitely.
> 
> Comments,
> 
> -Andi
> 
> x86: Fix alternatives and kprobes to remap write-protected kernel text
> 
> Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>
> 
> +/*
> + * RED-PEN Intel recommends stopping the other CPUs for such
> + * "cross-modifying code".
> + */
> +void __kprobes text_poke(void *oaddr, u8 opcode)
> +{
> +u8 *addr = oaddr;
> + if (!pte_write(*lookup_address((unsigned long)addr))) {
> + struct page *p = virt_to_page(addr);
> + addr = vmap(, 1, VM_MAP, PAGE_KERNEL);
> + if (!addr)
> + return;
> + addr += ((unsigned long)oaddr) % PAGE_SIZE;
> + }
> + *addr = opcode;
> + /* Not strictly needed, but can speed CPU recovery up */
> + if (cpu_has_clflush)
> + asm("clflush (%0) " :: "r" (addr) : "memory");
> + if (addr != oaddr)
> + vunmap(addr);
> +}

Hi Andi,

I was trying to make this work, but I find out that it seems incredibly
slow when I call it multiple times. Trying to understand why, I come
with a few questions:

- Is this supposed to work with large pages ?
- Is this supposed to work module text ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-08-10 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote:
 
 Hallo,
 
 I had some second thoughts about the text patching of DEBUG_RODATA kernels
 using change_page_attr(). Change_page_attr is intrusive and slow and using
 a separate mapping is a little more gentle. I came up with this patch.
 For your review and testing pleasure.
 
 The main quirk is that it doesn't fully follow the cross-modifying code
 recommendations; but i'm not sure that's really needed.
 
 Also I admit nop_out is quite inefficient, but that's a very slow path
 and it was easier to do it this way than handle all the corner cases
 explicitely.
 
 Comments,
 
 -Andi
 
 x86: Fix alternatives and kprobes to remap write-protected kernel text
 
 Signed-off-by: Andi Kleen [EMAIL PROTECTED]
 
 +/*
 + * RED-PEN Intel recommends stopping the other CPUs for such
 + * cross-modifying code.
 + */
 +void __kprobes text_poke(void *oaddr, u8 opcode)
 +{
 +u8 *addr = oaddr;
 + if (!pte_write(*lookup_address((unsigned long)addr))) {
 + struct page *p = virt_to_page(addr);
 + addr = vmap(p, 1, VM_MAP, PAGE_KERNEL);
 + if (!addr)
 + return;
 + addr += ((unsigned long)oaddr) % PAGE_SIZE;
 + }
 + *addr = opcode;
 + /* Not strictly needed, but can speed CPU recovery up */
 + if (cpu_has_clflush)
 + asm(clflush (%0)  :: r (addr) : memory);
 + if (addr != oaddr)
 + vunmap(addr);
 +}

Hi Andi,

I was trying to make this work, but I find out that it seems incredibly
slow when I call it multiple times. Trying to understand why, I come
with a few questions:

- Is this supposed to work with large pages ?
- Is this supposed to work module text ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-21 Thread Andi Kleen
On Fri, Jul 20, 2007 at 11:17:49AM -0400, Mathieu Desnoyers wrote:
> * Zachary Amsden ([EMAIL PROTECTED]) wrote:
> > Mathieu Desnoyers wrote:
> > >Yes, kprobes is case 1: atomic update. And we don't even have to bother
> > >about Intel's erratum. This one is ok. That's mainly the
> > >alternatives/paravirt code I worry about.
> > >  
> > 
> > Paravirt and alternatives should all be ok because they are done before 
> > SMP bringup and with NMIs disabled.  NMI watchdog is not setup until 
> > smp_prepare_cpus/check_nmi_watchdog, which happens way later, not during 
> > parse_args/setup_nmi_watchdog, which just decides which type of watchdog 
> > to setup.
> > 
> 
> I'm not so sure about this. You are right in that it has nothing to do
> with parse_args, but I just went in detail through the source, and the
> order seems to be:
> 
> 1 - NMI is activated
> 2 - MCE is activated
> 3 - alternatives are applied

Yes I was wrong on this. I now added code to disable them again -- see 
the later patch I posted

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-21 Thread Andi Kleen
On Fri, Jul 20, 2007 at 11:17:49AM -0400, Mathieu Desnoyers wrote:
 * Zachary Amsden ([EMAIL PROTECTED]) wrote:
  Mathieu Desnoyers wrote:
  Yes, kprobes is case 1: atomic update. And we don't even have to bother
  about Intel's erratum. This one is ok. That's mainly the
  alternatives/paravirt code I worry about.

  
  Paravirt and alternatives should all be ok because they are done before 
  SMP bringup and with NMIs disabled.  NMI watchdog is not setup until 
  smp_prepare_cpus/check_nmi_watchdog, which happens way later, not during 
  parse_args/setup_nmi_watchdog, which just decides which type of watchdog 
  to setup.
  
 
 I'm not so sure about this. You are right in that it has nothing to do
 with parse_args, but I just went in detail through the source, and the
 order seems to be:
 
 1 - NMI is activated
 2 - MCE is activated
 3 - alternatives are applied

Yes I was wrong on this. I now added code to disable them again -- see 
the later patch I posted

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-20 Thread Mathieu Desnoyers
* Zachary Amsden ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers wrote:
> >Yes, kprobes is case 1: atomic update. And we don't even have to bother
> >about Intel's erratum. This one is ok. That's mainly the
> >alternatives/paravirt code I worry about.
> >  
> 
> Paravirt and alternatives should all be ok because they are done before 
> SMP bringup and with NMIs disabled.  NMI watchdog is not setup until 
> smp_prepare_cpus/check_nmi_watchdog, which happens way later, not during 
> parse_args/setup_nmi_watchdog, which just decides which type of watchdog 
> to setup.
> 

I'm not so sure about this. You are right in that it has nothing to do
with parse_args, but I just went in detail through the source, and the
order seems to be:

1 - NMI is activated
2 - MCE is activated
3 - alternatives are applied

In detail:

start_kernel()
  ...
  include/asm-i386/smp.h:smp_prepare_boot_cpu()
smp_ops.smp_prepare_boot_cpu()
   arch/i386/kernel/smpboot.c:native_smp_prepare_boot_cpu()
arch/i386/kernel/smpboot.c:native_smp_prepare_cpus()
  arch/i386/kernel/smpboot.c:smp_boot_cpus()
arch/i386/kernel/apic.c:setup_local_APIC()
  arch/i386/kernel/nmi.c:setup_apic_nmi_watchdog()
arch/i386/kernel/cpu/perfctr-watchdog.c:lapic_watchdog_init()
  wd_ops->setup() (e.g. setup_intel_arch_watchdog)
  ...
  arch/i386/kernel/cpu/bugs.c:check_bugs()
arch/i386/kernel/cpu/common.c:identify_boot_cpu()
  identify_cpu()
arch/i386/kernel/cpu/mcheck/mce.c:mcheck_init()
  arch/i386/kernel/cpu/mcheck/p4.c:intel_p4_mcheck_init()
...
arch/i386/kernel/alternative_instructions()
  ...

Therefore, applying alternatives instructions seems to be done after NMI
and MCE init at boot time, or am I misunderstanding something ?


> I originally considered the NMI problem for paravirt-ops patching done 
> during module load, and found that I would need to modify 
> stop_machine_run to have an architecture specific callout to mask and 
> unmask NMIs.  I didn't imagine that would be very popular, and VMI was 
> the only paravirt-ops that were considering module load time patching, 
> so I flushed it.
> 
> You get some other nasty issues as well with run-time switching, like 
> missing early init calls (in particular, we would have to go to some 
> heroics to retake the land surrounding the APIC local timer interrupt).
> 
> Zach

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-20 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote:
> On Thu, Jul 19, 2007 at 07:49:12PM -0400, Mathieu Desnoyers wrote:
> > * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > > Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
> > > 
> > > > * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > > > > 
> > > > > > Ewww you plan to run this in SMP ? So you actually go 
> > > > > > byte
> > > > > > by byte changing pieces of instructions non atomically and doing
> > > > > > non-Intel's errata friendly XMC. You are really looking for trouble
> > > > > > there :) Two distinct errors can occur: 
> > > > > 
> > > > > In this case it is ok because this only happens when transitioning
> > > > > from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
> > > > > are essentially stopped.
> > > > > 
> > > > 
> > > > I agree that it's ok with SMP, but another problem arises: it's not only
> > > > a matter of being protected from SMP access, but also a matter of
> > > > reentrancy wrt interrupt handlers.
> > > > 
> > > > i.e.: if, as we are patching nops non atomically, we have a non maskable
> > > > interrupt coming which calls get_cycles_sync() which uses the
> > > 
> > > Hmm, i didn't think NMI handlers called that. e.g. nmi watchdog just
> > > uses jiffies.
> > > 
> > > get_cycles_sync patching happens only relatively early at boot, so 
> > > oprofile
> > > cannot be running yet.
> > 
> > Actually, the nmi handler does use the get_cycles(), and also uses the
> >
> > spinlock code:
> > 
> > arch/i386/kernel/nmi.c:
> > __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
> > ...
> > static DEFINE_SPINLOCK(lock);   /* Serialise the printks */
> > spin_lock();
> > printk("NMI backtrace for cpu %d\n", cpu);
> > ...
> > spin_unlock();
> > 
> > If A - we change the spinlock code non atomically it would break.
> 
> It only has its lock prefixes twiggled, which should be ok.
> 

Yes, this is a special case where both the lock prefixed instructions
and the non lock prefixed instructions are valid, so it does not matter
if a thread is preempted right after executing the NOP turned into a
0xf0. However, if such case happens when passing from UP to SMP, a
thread could be scheduled in and try to access to a spinlock with the
non-locked instruction. There should be some kind of "teardown" to make
sure that no such case can happen.



> >B - printk reads the TSC to get a timestamp, it breaks:
> >it calls:
> > printk_clock(void) -> sched_clock(); -> get_cycles_sync() on x86_64.
> 
> Are we reading the same source? sched_clock has never used get_cycles_sync(),
> just ordinary get_cycles() which is not patched. In fact it mostly
> used rdtscll() directly.
> 

Yes, you are right.. I am thinking more about other clients, such as a
tracer, which could want the precision given by get_cycles_sync() and
may execute in NMI context. It does not apply to the current kernel
source. It's just that reading a timestamp counter is an operation so
common that it should not come with restrictions about which context it
could be called from due to the alternatives mechanism.

> The main problem is alternative() nopify, e.g. for prefetches which
> could hide in every list_for_each; but from a quick look the current
> early NMI code doesn't do that.

Yup.. well.. my tracer will ;) I use a list_for_each_rcu() to iterate on
active traces. That's another example of a very basic piece of
infrastructure for which we don't want to bother about alternatives
patching when using it.

> 
> > Yeah, that's a mess. That's why I always consider patching the code
> > in a way that will let the NMI handler run through it in a sane manner
> > _while_ the code is being patched. It implies _at least_ to do the
> > updates atomically with atomic aligned memory writes that keeps the site
> > being patched in a coherent state. Using a int3-based bypass is also
> > required on Intel because of the erratum regarding instruction cache.
> 
> That's only for cross modifying code, no? 
> 

No. It also applies to UP modification. Since it is hard to insure that
no unmaskable interrupt handler will run on top of you, it can help to
leave the code in a valid state at every moment.

> > > This cannot happen for the current code: 
> > > - full alternative patching happen only at boot when the other CPUs
> > > are not running
> > 
> > Should be checked if NMIs and MCEs are active at that moment.
> 
> They are probably both.
> 
> I guess we could disable them again. I will cook up a patch.
> 

I guess we could, although I wouldn't recommend doing it on a live
system, only at boot time.

> > I see the mb()/rmb()/wmb() also uses alternatives, they should be
> > checked for boot-time racing against NMIs and MCEs.
> 
> Patch above would take care of it.
> 
> > 
> > init/main.c:start_kernel()
> > 
> > parse_args() (where the nmi watchdog is enabled it seems) would probably
> > execute the smp-alt-boot and nmi_watchdog arguments in the order in which
> > they are given as 

Re: new text patching for review

2007-07-20 Thread Andi Kleen
On Thu, Jul 19, 2007 at 07:49:12PM -0400, Mathieu Desnoyers wrote:
> * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
> > 
> > > * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > > > 
> > > > > Ewww you plan to run this in SMP ? So you actually go byte
> > > > > by byte changing pieces of instructions non atomically and doing
> > > > > non-Intel's errata friendly XMC. You are really looking for trouble
> > > > > there :) Two distinct errors can occur: 
> > > > 
> > > > In this case it is ok because this only happens when transitioning
> > > > from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
> > > > are essentially stopped.
> > > > 
> > > 
> > > I agree that it's ok with SMP, but another problem arises: it's not only
> > > a matter of being protected from SMP access, but also a matter of
> > > reentrancy wrt interrupt handlers.
> > > 
> > > i.e.: if, as we are patching nops non atomically, we have a non maskable
> > > interrupt coming which calls get_cycles_sync() which uses the
> > 
> > Hmm, i didn't think NMI handlers called that. e.g. nmi watchdog just
> > uses jiffies.
> > 
> > get_cycles_sync patching happens only relatively early at boot, so oprofile
> > cannot be running yet.
> 
> Actually, the nmi handler does use the get_cycles(), and also uses the
>
> spinlock code:
> 
> arch/i386/kernel/nmi.c:
> __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
> ...
> static DEFINE_SPINLOCK(lock);   /* Serialise the printks */
> spin_lock();
> printk("NMI backtrace for cpu %d\n", cpu);
> ...
> spin_unlock();
> 
> If A - we change the spinlock code non atomically it would break.

It only has its lock prefixes twiggled, which should be ok.

>B - printk reads the TSC to get a timestamp, it breaks:
>it calls:
> printk_clock(void) -> sched_clock(); -> get_cycles_sync() on x86_64.

Are we reading the same source? sched_clock has never used get_cycles_sync(),
just ordinary get_cycles() which is not patched. In fact it mostly
used rdtscll() directly.

The main problem is alternative() nopify, e.g. for prefetches which
could hide in every list_for_each; but from a quick look the current
early NMI code doesn't do that.

> Yeah, that's a mess. That's why I always consider patching the code
> in a way that will let the NMI handler run through it in a sane manner
> _while_ the code is being patched. It implies _at least_ to do the
> updates atomically with atomic aligned memory writes that keeps the site
> being patched in a coherent state. Using a int3-based bypass is also
> required on Intel because of the erratum regarding instruction cache.

That's only for cross modifying code, no? 

> > This cannot happen for the current code: 
> > - full alternative patching happen only at boot when the other CPUs
> > are not running
> 
> Should be checked if NMIs and MCEs are active at that moment.

They are probably both.

I guess we could disable them again. I will cook up a patch.

> I see the mb()/rmb()/wmb() also uses alternatives, they should be
> checked for boot-time racing against NMIs and MCEs.

Patch above would take care of it.

> 
> init/main.c:start_kernel()
> 
> parse_args() (where the nmi watchdog is enabled it seems) would probably
> execute the smp-alt-boot and nmi_watchdog arguments in the order in which
> they are given as kernel arguments. So I guess it could race.

Not sure I see your point here. How can arguments race?

> 
> the "mce" kernel argument is also parsed in parse_args(), which leads to
> the same problem.

?


> 
> > For the immediate value patching it also cannot happen because
> > you'll never modify multiple instructions and all immediate values
> > can be changed atomically. 
> > 
> 
> Exactly, I always make sure that the immediate value within the
> instruction is aligned (so a 5 bytes movl must have an offset of +3
> compared to a 4 bytes alignment).

The x86 architecture doesn't require alignment for atomic updates.

> Make sure this API is used only to modify code meeting these
> requirements (those are the ones I remember from the top of my head):

Umm, that's far too complicated. Nobody will understand it anyways.
I'll cook up something simpler.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-20 Thread Andi Kleen
On Friday 20 July 2007 02:37:20 Zachary Amsden wrote:
> Andi Kleen wrote:
> > +   *addr = opcode;
> > +   /* Not strictly needed, but can speed CPU recovery up */
> > +   if (cpu_has_clflush)
> > +   asm("clflush (%0) " :: "r" (addr) : "memory");
> > +   if (addr != oaddr)
> > +   vunmap(addr);
> >   
> 
> clflush should take oaddr.

Thanks

> If you had to remap, note that the processor does not know the linear 
> address you wrote to could be matched by another mapping in icache.  In 
> that case, you'll need a serializing instruction (cpuid) to 
> resynchronize caches.

We already got one in alternative patching, but you're right it's 
safer to do it in text_poke

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-20 Thread Andi Kleen
On Thu, Jul 19, 2007 at 06:15:46PM -0700, Zachary Amsden wrote:
> Mathieu Desnoyers wrote:
> >Yes, kprobes is case 1: atomic update. And we don't even have to bother
> >about Intel's erratum. This one is ok. That's mainly the
> >alternatives/paravirt code I worry about.
> >  
> 
> Paravirt and alternatives should all be ok because they are done before 
> SMP bringup and with NMIs disabled.  NMI watchdog is not setup until 
> smp_prepare_cpus/check_nmi_watchdog, which happens way later, not during 
> parse_args/setup_nmi_watchdog, which just decides which type of watchdog 
> to setup.

There could be other NMIs too - e.g. caused by IO devices or NMI
buttons - but they're relatively unlikely.

> 
> I originally considered the NMI problem for paravirt-ops patching done 
> during module load, and found that I would need to modify 
> stop_machine_run to have an architecture specific callout to mask and 
> unmask NMIs.

When your virtual machine never injects NMIs or MCEs you're ok.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-20 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote:
 On Thu, Jul 19, 2007 at 07:49:12PM -0400, Mathieu Desnoyers wrote:
  * Andi Kleen ([EMAIL PROTECTED]) wrote:
   Mathieu Desnoyers [EMAIL PROTECTED] writes:
   
* Andi Kleen ([EMAIL PROTECTED]) wrote:
 
  Ewww you plan to run this in SMP ? So you actually go 
  byte
  by byte changing pieces of instructions non atomically and doing
  non-Intel's errata friendly XMC. You are really looking for trouble
  there :) Two distinct errors can occur: 
 
 In this case it is ok because this only happens when transitioning
 from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
 are essentially stopped.
 

I agree that it's ok with SMP, but another problem arises: it's not only
a matter of being protected from SMP access, but also a matter of
reentrancy wrt interrupt handlers.

i.e.: if, as we are patching nops non atomically, we have a non maskable
interrupt coming which calls get_cycles_sync() which uses the
   
   Hmm, i didn't think NMI handlers called that. e.g. nmi watchdog just
   uses jiffies.
   
   get_cycles_sync patching happens only relatively early at boot, so 
   oprofile
   cannot be running yet.
  
  Actually, the nmi handler does use the get_cycles(), and also uses the
 
  spinlock code:
  
  arch/i386/kernel/nmi.c:
  __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
  ...
  static DEFINE_SPINLOCK(lock);   /* Serialise the printks */
  spin_lock(lock);
  printk(NMI backtrace for cpu %d\n, cpu);
  ...
  spin_unlock(lock);
  
  If A - we change the spinlock code non atomically it would break.
 
 It only has its lock prefixes twiggled, which should be ok.
 

Yes, this is a special case where both the lock prefixed instructions
and the non lock prefixed instructions are valid, so it does not matter
if a thread is preempted right after executing the NOP turned into a
0xf0. However, if such case happens when passing from UP to SMP, a
thread could be scheduled in and try to access to a spinlock with the
non-locked instruction. There should be some kind of teardown to make
sure that no such case can happen.



 B - printk reads the TSC to get a timestamp, it breaks:
 it calls:
  printk_clock(void) - sched_clock(); - get_cycles_sync() on x86_64.
 
 Are we reading the same source? sched_clock has never used get_cycles_sync(),
 just ordinary get_cycles() which is not patched. In fact it mostly
 used rdtscll() directly.
 

Yes, you are right.. I am thinking more about other clients, such as a
tracer, which could want the precision given by get_cycles_sync() and
may execute in NMI context. It does not apply to the current kernel
source. It's just that reading a timestamp counter is an operation so
common that it should not come with restrictions about which context it
could be called from due to the alternatives mechanism.

 The main problem is alternative() nopify, e.g. for prefetches which
 could hide in every list_for_each; but from a quick look the current
 early NMI code doesn't do that.

Yup.. well.. my tracer will ;) I use a list_for_each_rcu() to iterate on
active traces. That's another example of a very basic piece of
infrastructure for which we don't want to bother about alternatives
patching when using it.

 
  Yeah, that's a mess. That's why I always consider patching the code
  in a way that will let the NMI handler run through it in a sane manner
  _while_ the code is being patched. It implies _at least_ to do the
  updates atomically with atomic aligned memory writes that keeps the site
  being patched in a coherent state. Using a int3-based bypass is also
  required on Intel because of the erratum regarding instruction cache.
 
 That's only for cross modifying code, no? 
 

No. It also applies to UP modification. Since it is hard to insure that
no unmaskable interrupt handler will run on top of you, it can help to
leave the code in a valid state at every moment.

   This cannot happen for the current code: 
   - full alternative patching happen only at boot when the other CPUs
   are not running
  
  Should be checked if NMIs and MCEs are active at that moment.
 
 They are probably both.
 
 I guess we could disable them again. I will cook up a patch.
 

I guess we could, although I wouldn't recommend doing it on a live
system, only at boot time.

  I see the mb()/rmb()/wmb() also uses alternatives, they should be
  checked for boot-time racing against NMIs and MCEs.
 
 Patch above would take care of it.
 
  
  init/main.c:start_kernel()
  
  parse_args() (where the nmi watchdog is enabled it seems) would probably
  execute the smp-alt-boot and nmi_watchdog arguments in the order in which
  they are given as kernel arguments. So I guess it could race.
 
 Not sure I see your point here. How can arguments race?
 

I thought parse_args() started the NMIs, but it seems to just take the
arguments and saves them for later.

  
  the 

Re: new text patching for review

2007-07-20 Thread Mathieu Desnoyers
* Zachary Amsden ([EMAIL PROTECTED]) wrote:
 Mathieu Desnoyers wrote:
 Yes, kprobes is case 1: atomic update. And we don't even have to bother
 about Intel's erratum. This one is ok. That's mainly the
 alternatives/paravirt code I worry about.
   
 
 Paravirt and alternatives should all be ok because they are done before 
 SMP bringup and with NMIs disabled.  NMI watchdog is not setup until 
 smp_prepare_cpus/check_nmi_watchdog, which happens way later, not during 
 parse_args/setup_nmi_watchdog, which just decides which type of watchdog 
 to setup.
 

I'm not so sure about this. You are right in that it has nothing to do
with parse_args, but I just went in detail through the source, and the
order seems to be:

1 - NMI is activated
2 - MCE is activated
3 - alternatives are applied

In detail:

start_kernel()
  ...
  include/asm-i386/smp.h:smp_prepare_boot_cpu()
smp_ops.smp_prepare_boot_cpu()
   arch/i386/kernel/smpboot.c:native_smp_prepare_boot_cpu()
arch/i386/kernel/smpboot.c:native_smp_prepare_cpus()
  arch/i386/kernel/smpboot.c:smp_boot_cpus()
arch/i386/kernel/apic.c:setup_local_APIC()
  arch/i386/kernel/nmi.c:setup_apic_nmi_watchdog()
arch/i386/kernel/cpu/perfctr-watchdog.c:lapic_watchdog_init()
  wd_ops-setup() (e.g. setup_intel_arch_watchdog)
  ...
  arch/i386/kernel/cpu/bugs.c:check_bugs()
arch/i386/kernel/cpu/common.c:identify_boot_cpu()
  identify_cpu()
arch/i386/kernel/cpu/mcheck/mce.c:mcheck_init()
  arch/i386/kernel/cpu/mcheck/p4.c:intel_p4_mcheck_init()
...
arch/i386/kernel/alternative_instructions()
  ...

Therefore, applying alternatives instructions seems to be done after NMI
and MCE init at boot time, or am I misunderstanding something ?


 I originally considered the NMI problem for paravirt-ops patching done 
 during module load, and found that I would need to modify 
 stop_machine_run to have an architecture specific callout to mask and 
 unmask NMIs.  I didn't imagine that would be very popular, and VMI was 
 the only paravirt-ops that were considering module load time patching, 
 so I flushed it.
 
 You get some other nasty issues as well with run-time switching, like 
 missing early init calls (in particular, we would have to go to some 
 heroics to retake the land surrounding the APIC local timer interrupt).
 
 Zach

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-20 Thread Andi Kleen
On Thu, Jul 19, 2007 at 06:15:46PM -0700, Zachary Amsden wrote:
 Mathieu Desnoyers wrote:
 Yes, kprobes is case 1: atomic update. And we don't even have to bother
 about Intel's erratum. This one is ok. That's mainly the
 alternatives/paravirt code I worry about.
   
 
 Paravirt and alternatives should all be ok because they are done before 
 SMP bringup and with NMIs disabled.  NMI watchdog is not setup until 
 smp_prepare_cpus/check_nmi_watchdog, which happens way later, not during 
 parse_args/setup_nmi_watchdog, which just decides which type of watchdog 
 to setup.

There could be other NMIs too - e.g. caused by IO devices or NMI
buttons - but they're relatively unlikely.

 
 I originally considered the NMI problem for paravirt-ops patching done 
 during module load, and found that I would need to modify 
 stop_machine_run to have an architecture specific callout to mask and 
 unmask NMIs.

When your virtual machine never injects NMIs or MCEs you're ok.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-20 Thread Andi Kleen
On Friday 20 July 2007 02:37:20 Zachary Amsden wrote:
 Andi Kleen wrote:
  +   *addr = opcode;
  +   /* Not strictly needed, but can speed CPU recovery up */
  +   if (cpu_has_clflush)
  +   asm(clflush (%0)  :: r (addr) : memory);
  +   if (addr != oaddr)
  +   vunmap(addr);

 
 clflush should take oaddr.

Thanks

 If you had to remap, note that the processor does not know the linear 
 address you wrote to could be matched by another mapping in icache.  In 
 that case, you'll need a serializing instruction (cpuid) to 
 resynchronize caches.

We already got one in alternative patching, but you're right it's 
safer to do it in text_poke

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-20 Thread Andi Kleen
On Thu, Jul 19, 2007 at 07:49:12PM -0400, Mathieu Desnoyers wrote:
 * Andi Kleen ([EMAIL PROTECTED]) wrote:
  Mathieu Desnoyers [EMAIL PROTECTED] writes:
  
   * Andi Kleen ([EMAIL PROTECTED]) wrote:

 Ewww you plan to run this in SMP ? So you actually go byte
 by byte changing pieces of instructions non atomically and doing
 non-Intel's errata friendly XMC. You are really looking for trouble
 there :) Two distinct errors can occur: 

In this case it is ok because this only happens when transitioning
from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
are essentially stopped.

   
   I agree that it's ok with SMP, but another problem arises: it's not only
   a matter of being protected from SMP access, but also a matter of
   reentrancy wrt interrupt handlers.
   
   i.e.: if, as we are patching nops non atomically, we have a non maskable
   interrupt coming which calls get_cycles_sync() which uses the
  
  Hmm, i didn't think NMI handlers called that. e.g. nmi watchdog just
  uses jiffies.
  
  get_cycles_sync patching happens only relatively early at boot, so oprofile
  cannot be running yet.
 
 Actually, the nmi handler does use the get_cycles(), and also uses the

 spinlock code:
 
 arch/i386/kernel/nmi.c:
 __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
 ...
 static DEFINE_SPINLOCK(lock);   /* Serialise the printks */
 spin_lock(lock);
 printk(NMI backtrace for cpu %d\n, cpu);
 ...
 spin_unlock(lock);
 
 If A - we change the spinlock code non atomically it would break.

It only has its lock prefixes twiggled, which should be ok.

B - printk reads the TSC to get a timestamp, it breaks:
it calls:
 printk_clock(void) - sched_clock(); - get_cycles_sync() on x86_64.

Are we reading the same source? sched_clock has never used get_cycles_sync(),
just ordinary get_cycles() which is not patched. In fact it mostly
used rdtscll() directly.

The main problem is alternative() nopify, e.g. for prefetches which
could hide in every list_for_each; but from a quick look the current
early NMI code doesn't do that.

 Yeah, that's a mess. That's why I always consider patching the code
 in a way that will let the NMI handler run through it in a sane manner
 _while_ the code is being patched. It implies _at least_ to do the
 updates atomically with atomic aligned memory writes that keeps the site
 being patched in a coherent state. Using a int3-based bypass is also
 required on Intel because of the erratum regarding instruction cache.

That's only for cross modifying code, no? 

  This cannot happen for the current code: 
  - full alternative patching happen only at boot when the other CPUs
  are not running
 
 Should be checked if NMIs and MCEs are active at that moment.

They are probably both.

I guess we could disable them again. I will cook up a patch.

 I see the mb()/rmb()/wmb() also uses alternatives, they should be
 checked for boot-time racing against NMIs and MCEs.

Patch above would take care of it.

 
 init/main.c:start_kernel()
 
 parse_args() (where the nmi watchdog is enabled it seems) would probably
 execute the smp-alt-boot and nmi_watchdog arguments in the order in which
 they are given as kernel arguments. So I guess it could race.

Not sure I see your point here. How can arguments race?

 
 the mce kernel argument is also parsed in parse_args(), which leads to
 the same problem.

?


 
  For the immediate value patching it also cannot happen because
  you'll never modify multiple instructions and all immediate values
  can be changed atomically. 
  
 
 Exactly, I always make sure that the immediate value within the
 instruction is aligned (so a 5 bytes movl must have an offset of +3
 compared to a 4 bytes alignment).

The x86 architecture doesn't require alignment for atomic updates.

 Make sure this API is used only to modify code meeting these
 requirements (those are the ones I remember from the top of my head):

Umm, that's far too complicated. Nobody will understand it anyways.
I'll cook up something simpler.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Zachary Amsden

Mathieu Desnoyers wrote:

Yes, kprobes is case 1: atomic update. And we don't even have to bother
about Intel's erratum. This one is ok. That's mainly the
alternatives/paravirt code I worry about.
  


Paravirt and alternatives should all be ok because they are done before 
SMP bringup and with NMIs disabled.  NMI watchdog is not setup until 
smp_prepare_cpus/check_nmi_watchdog, which happens way later, not during 
parse_args/setup_nmi_watchdog, which just decides which type of watchdog 
to setup.


I originally considered the NMI problem for paravirt-ops patching done 
during module load, and found that I would need to modify 
stop_machine_run to have an architecture specific callout to mask and 
unmask NMIs.  I didn't imagine that would be very popular, and VMI was 
the only paravirt-ops that were considering module load time patching, 
so I flushed it.


You get some other nasty issues as well with run-time switching, like 
missing early init calls (in particular, we would have to go to some 
heroics to retake the land surrounding the APIC local timer interrupt).


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Zachary Amsden

Andi Kleen wrote:

+   *addr = opcode;
+   /* Not strictly needed, but can speed CPU recovery up */
+   if (cpu_has_clflush)
+   asm("clflush (%0) " :: "r" (addr) : "memory");
+   if (addr != oaddr)
+   vunmap(addr);
  


clflush should take oaddr.

If you had to remap, note that the processor does not know the linear 
address you wrote to could be matched by another mapping in icache.  In 
that case, you'll need a serializing instruction (cpuid) to 
resynchronize caches.


I don't think any of these are SMP problems since either the code is 
patched before AP bringup or it is single byte patching.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
> 
> > * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > > 
> > > > Ewww you plan to run this in SMP ? So you actually go byte
> > > > by byte changing pieces of instructions non atomically and doing
> > > > non-Intel's errata friendly XMC. You are really looking for trouble
> > > > there :) Two distinct errors can occur: 
> > > 
> > > In this case it is ok because this only happens when transitioning
> > > from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
> > > are essentially stopped.
> > > 
> > 
> > I agree that it's ok with SMP, but another problem arises: it's not only
> > a matter of being protected from SMP access, but also a matter of
> > reentrancy wrt interrupt handlers.
> > 
> > i.e.: if, as we are patching nops non atomically, we have a non maskable
> > interrupt coming which calls get_cycles_sync() which uses the
> 
> Hmm, i didn't think NMI handlers called that. e.g. nmi watchdog just
> uses jiffies.
> 
> get_cycles_sync patching happens only relatively early at boot, so oprofile
> cannot be running yet.

Actually, the nmi handler does use the get_cycles(), and also uses the
spinlock code:

arch/i386/kernel/nmi.c:
__kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
...
static DEFINE_SPINLOCK(lock);   /* Serialise the printks */
spin_lock();
printk("NMI backtrace for cpu %d\n", cpu);
...
spin_unlock();

If A - we change the spinlock code non atomically it would break.
   B - printk reads the TSC to get a timestamp, it breaks:
   it calls:
printk_clock(void) -> sched_clock(); -> get_cycles_sync() on x86_64.


> 
> > I see that IRQs are disabled in alternative_instructions(), but it does
> > not protect against NMIs, which could come at a very inappropriate
> > moment. MCE and SMIs would potentially cause the same kind of trouble.
> > 
> > So unless you can guarantee that any code from NMI handler won't call
> > basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
> > it won't execute an illegal instruction. Also, the option of temporarily
> > disabling the NMI for the duration of the update simply adds unwanted
> > latency to the NMI handler which could be unacceptable in some setups.
> 
> Ok it's a fair point.  But how would you address it ?
> 
> Even if we IPIed the other CPUs NMIs or MCEs could still happen.
> 

Yeah, that's a mess. That's why I always consider patching the code
in a way that will let the NMI handler run through it in a sane manner
_while_ the code is being patched. It implies _at least_ to do the
updates atomically with atomic aligned memory writes that keeps the site
being patched in a coherent state. Using a int3-based bypass is also
required on Intel because of the erratum regarding instruction cache.
Using these techniques, I can atomically modify the execution stream. It
would be relatively simple to re-use the framework I have done in the
Immediate Values as long as no non-relocatable instructions are
involved.

> BTW Jeremy, have you ever considered that problem with paravirt ops
> patching? 
> 
> > 
> > Another potential problem comes from preemption: if a thread is
> > preempted in the middle of the instructions you are modifying, let's say
> > we originally have 4 * 1 byte instructions that we replace with 1 * 4
> > byte instruction, a thread could iret in the middle of the new
> > instruction when it will be scheduled again, thus executing an illegal
> > instruction. This problem could be overcomed by putting the threads in
> > the freezer. See the djprobe project for details about this.
> 
> This cannot happen for the current code: 
> - full alternative patching happen only at boot when the other CPUs
> are not running

Should be checked if NMIs and MCEs are active at that moment.

> - smp lock patching only ever changes a single byte (lock prefix) of
> a single instruction

Yup, as long as we are just adding/removing a f0 prefix, it's clearly
atomic.

> - kprobes only ever change a single byte
> 

I see the mb()/rmb()/wmb() also uses alternatives, they should be
checked for boot-time racing against NMIs and MCEs.

init/main.c:start_kernel()

parse_args() (where the nmi watchdog is enabled it seems) would probably
execute the smp-alt-boot and nmi_watchdog arguments in the order in which
they are given as kernel arguments. So I guess it could race.

the "mce" kernel argument is also parsed in parse_args(), which leads to
the same problem.

> For the immediate value patching it also cannot happen because
> you'll never modify multiple instructions and all immediate values
> can be changed atomically. 
> 

Exactly, I always make sure that the immediate value within the
instruction is aligned (so a 5 bytes movl must have an offset of +3
compared to a 4 bytes alignment).

> 
> 
> > In the end, I think the safest solution would be to limit ourselves to
> > one of these 2 solutions:
> > - If the update can be done 

Re: new text patching for review

2007-07-19 Thread Mathieu Desnoyers
* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
> Andi Kleen wrote:
> > Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
> >   
> >> I see that IRQs are disabled in alternative_instructions(), but it does
> >> not protect against NMIs, which could come at a very inappropriate
> >> moment. MCE and SMIs would potentially cause the same kind of trouble.
> >>
> >> So unless you can guarantee that any code from NMI handler won't call
> >> basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
> >> it won't execute an illegal instruction. Also, the option of temporarily
> >> disabling the NMI for the duration of the update simply adds unwanted
> >> latency to the NMI handler which could be unacceptable in some setups.
> >> 
> >
> > Ok it's a fair point.  But how would you address it ?
> >
> > Even if we IPIed the other CPUs NMIs or MCEs could still happen.
> >
> > BTW Jeremy, have you ever considered that problem with paravirt ops
> > patching? 
> >   
> 
> I remember Zach was thinking about it when he was thinking of making vmi
> a kernel module, but I don't think we discussed it with respect to the
> current patching mechanism.  Though he did discover that at one point
> alternative_instructions() was being run with interrupts enabled, which
> caused surprisingly few problems...
> 
> But, yeah, it seems like it could be a problem.
> 
> > - smp lock patching only ever changes a single byte (lock prefix) of
> > a single instruction
> > - kprobes only ever change a single byte
> >
> > For the immediate value patching it also cannot happen because
> > you'll never modify multiple instructions and all immediate values
> > can be changed atomically. 
> >   
> 
> Are misaligned/cross-cache-line updates atomic?
> 

I align the "immediate values" within the mov instructions on multiples
of the immediate value size so I can update it atomically.

> J

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
> Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
>   
>> I see that IRQs are disabled in alternative_instructions(), but it does
>> not protect against NMIs, which could come at a very inappropriate
>> moment. MCE and SMIs would potentially cause the same kind of trouble.
>>
>> So unless you can guarantee that any code from NMI handler won't call
>> basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
>> it won't execute an illegal instruction. Also, the option of temporarily
>> disabling the NMI for the duration of the update simply adds unwanted
>> latency to the NMI handler which could be unacceptable in some setups.
>> 
>
> Ok it's a fair point.  But how would you address it ?
>
> Even if we IPIed the other CPUs NMIs or MCEs could still happen.
>
> BTW Jeremy, have you ever considered that problem with paravirt ops
> patching? 
>   

I remember Zach was thinking about it when he was thinking of making vmi
a kernel module, but I don't think we discussed it with respect to the
current patching mechanism.  Though he did discover that at one point
alternative_instructions() was being run with interrupts enabled, which
caused surprisingly few problems...

But, yeah, it seems like it could be a problem.

> - smp lock patching only ever changes a single byte (lock prefix) of
> a single instruction
> - kprobes only ever change a single byte
>
> For the immediate value patching it also cannot happen because
> you'll never modify multiple instructions and all immediate values
> can be changed atomically. 
>   

Are misaligned/cross-cache-line updates atomic?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Andi Kleen
Mathieu Desnoyers <[EMAIL PROTECTED]> writes:

> * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > 
> > > Ewww you plan to run this in SMP ? So you actually go byte
> > > by byte changing pieces of instructions non atomically and doing
> > > non-Intel's errata friendly XMC. You are really looking for trouble
> > > there :) Two distinct errors can occur: 
> > 
> > In this case it is ok because this only happens when transitioning
> > from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
> > are essentially stopped.
> > 
> 
> I agree that it's ok with SMP, but another problem arises: it's not only
> a matter of being protected from SMP access, but also a matter of
> reentrancy wrt interrupt handlers.
> 
> i.e.: if, as we are patching nops non atomically, we have a non maskable
> interrupt coming which calls get_cycles_sync() which uses the

Hmm, i didn't think NMI handlers called that. e.g. nmi watchdog just
uses jiffies.

get_cycles_sync patching happens only relatively early at boot, so oprofile
cannot be running yet.

> I see that IRQs are disabled in alternative_instructions(), but it does
> not protect against NMIs, which could come at a very inappropriate
> moment. MCE and SMIs would potentially cause the same kind of trouble.
> 
> So unless you can guarantee that any code from NMI handler won't call
> basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
> it won't execute an illegal instruction. Also, the option of temporarily
> disabling the NMI for the duration of the update simply adds unwanted
> latency to the NMI handler which could be unacceptable in some setups.

Ok it's a fair point.  But how would you address it ?

Even if we IPIed the other CPUs NMIs or MCEs could still happen.

BTW Jeremy, have you ever considered that problem with paravirt ops
patching? 

> 
> Another potential problem comes from preemption: if a thread is
> preempted in the middle of the instructions you are modifying, let's say
> we originally have 4 * 1 byte instructions that we replace with 1 * 4
> byte instruction, a thread could iret in the middle of the new
> instruction when it will be scheduled again, thus executing an illegal
> instruction. This problem could be overcomed by putting the threads in
> the freezer. See the djprobe project for details about this.

This cannot happen for the current code: 
- full alternative patching happen only at boot when the other CPUs
are not running
- smp lock patching only ever changes a single byte (lock prefix) of
a single instruction
- kprobes only ever change a single byte

For the immediate value patching it also cannot happen because
you'll never modify multiple instructions and all immediate values
can be changed atomically. 



> In the end, I think the safest solution would be to limit ourselves to
> one of these 2 solutions:
> - If the update can be done atomically and leaves the site in a coherent
>   state after each atomic modification, while keeping the same
>   instruction size, it could be done on the spot.
> - If not, care should be taken to first do a copy of the code to modify
>   into a "bypass", making sure that instructions can be relocated, so
>   that any context that would try to execute the site during the
>   modification would execute a breakpoint which would execute the bypass
>   while we modify the instructions.

kprobes does that, but to write the break point it uses text_poke() 
with my patch.

> Because of the tricks that must be done to do code modification on a
> live system (as explained above), I don't think it makes sense to
> provide a falsely-safe infrastructure that would allow code modification
> in a non-atomic manner.

Hmm, perhaps it would make sense to add a comment warning about
the limitations of the current text_poke. Can you supply one?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote:
> 
> > Ewww you plan to run this in SMP ? So you actually go byte
> > by byte changing pieces of instructions non atomically and doing
> > non-Intel's errata friendly XMC. You are really looking for trouble
> > there :) Two distinct errors can occur: 
> 
> In this case it is ok because this only happens when transitioning
> from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
> are essentially stopped.
> 

I agree that it's ok with SMP, but another problem arises: it's not only
a matter of being protected from SMP access, but also a matter of
reentrancy wrt interrupt handlers.

i.e.: if, as we are patching nops non atomically, we have a non maskable
interrupt coming which calls get_cycles_sync() which uses the
alternatives for cpuid when the NOP instructions are not in a correct
state, we can end up doing an illegal instruction.

I see that IRQs are disabled in alternative_instructions(), but it does
not protect against NMIs, which could come at a very inappropriate
moment. MCE and SMIs would potentially cause the same kind of trouble.

So unless you can guarantee that any code from NMI handler won't call
basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
it won't execute an illegal instruction. Also, the option of temporarily
disabling the NMI for the duration of the update simply adds unwanted
latency to the NMI handler which could be unacceptable in some setups.

Another potential problem comes from preemption: if a thread is
preempted in the middle of the instructions you are modifying, let's say
we originally have 4 * 1 byte instructions that we replace with 1 * 4
byte instruction, a thread could iret in the middle of the new
instruction when it will be scheduled again, thus executing an illegal
instruction. This problem could be overcomed by putting the threads in
the freezer. See the djprobe project for details about this.

In the end, I think the safest solution would be to limit ourselves to
one of these 2 solutions:
- If the update can be done atomically and leaves the site in a coherent
  state after each atomic modification, while keeping the same
  instruction size, it could be done on the spot.
- If not, care should be taken to first do a copy of the code to modify
  into a "bypass", making sure that instructions can be relocated, so
  that any context that would try to execute the site during the
  modification would execute a breakpoint which would execute the bypass
  while we modify the instructions. Care should be taken to make sure
  that threads are in the freezer while we do this. This is basically
  what djprobes does. You case is only a bit simpler because you don't
  worry about SMP.


> All the other manipulations currently are single byte.
> 

For breakpoint, yes, this one is easy.


> I suppose for your immediate value patches something stronger is needed,
> but we can worry about that post .23.
> 
> > What I don't like about this particular implementation is that it does
> > not support "poking" more than 1 byte. In order to support this, you
> > would have to deal with the case where the address range spans over more
> > than one page.
> 
> I considered it, but the function would have been at least twice as big
> to handle all the corner cases. And for the current callers it's all fine.
> 
> > Also, doing the copy in the same interface seems a bit awkward.
> 
> Splitting it would also seem quite awkward.
>  

Because of the tricks that must be done to do code modification on a
live system (as explained above), I don't think it makes sense to
provide a falsely-safe infrastructure that would allow code modification
in a non-atomic manner.

> > 
> > I would much prefer something like:
> > 
> > void *map_shadow_write(void *addr, size_t len);
> > (returns a pointer to the shadow writable pages, at the same page offset
> > as "addr")
> > 
> > int unmap_shadow_write(void *shadow_addr, size_t len);
> > (unmap the shadow pages)
> > 
> > Then, the in-kernel user is free to modify their pages as they like.
> > Since we cannot foresee each modification pattern, I think that leaving
> > this kind of flexibility is useful.
> 
> You could as well call vmap directly then; it's not that much
> more complicated.  I don't really see much value in complicating
> it right now.
> 

Right about the direct vmap call, it is quite simple and elegant, I
guess I'll use it in my next version of kernel text lock.

Mathieu

> -Andi
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Andi Kleen

> Ewww you plan to run this in SMP ? So you actually go byte
> by byte changing pieces of instructions non atomically and doing
> non-Intel's errata friendly XMC. You are really looking for trouble
> there :) Two distinct errors can occur: 

In this case it is ok because this only happens when transitioning
from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
are essentially stopped.

All the other manipulations currently are single byte.

I suppose for your immediate value patches something stronger is needed,
but we can worry about that post .23.

> What I don't like about this particular implementation is that it does
> not support "poking" more than 1 byte. In order to support this, you
> would have to deal with the case where the address range spans over more
> than one page.

I considered it, but the function would have been at least twice as big
to handle all the corner cases. And for the current callers it's all fine.

> Also, doing the copy in the same interface seems a bit awkward.

Splitting it would also seem quite awkward.
 
> 
> I would much prefer something like:
> 
> void *map_shadow_write(void *addr, size_t len);
> (returns a pointer to the shadow writable pages, at the same page offset
> as "addr")
> 
> int unmap_shadow_write(void *shadow_addr, size_t len);
> (unmap the shadow pages)
> 
> Then, the in-kernel user is free to modify their pages as they like.
> Since we cannot foresee each modification pattern, I think that leaving
> this kind of flexibility is useful.

You could as well call vmap directly then; it's not that much
more complicated.  I don't really see much value in complicating
it right now.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Mathieu Desnoyers
Hi Andi,

* Andi Kleen ([EMAIL PROTECTED]) wrote:
> 
> Hallo,
> 
> I had some second thoughts about the text patching of DEBUG_RODATA kernels
> using change_page_attr(). Change_page_attr is intrusive and slow and using
> a separate mapping is a little more gentle. I came up with this patch.
> For your review and testing pleasure.
> 
> The main quirk is that it doesn't fully follow the cross-modifying code
> recommendations; but i'm not sure that's really needed.
> 
> Also I admit nop_out is quite inefficient, but that's a very slow path
> and it was easier to do it this way than handle all the corner cases
> explicitely.
> 
> Comments,
> 
> -Andi
> 
> x86: Fix alternatives and kprobes to remap write-protected kernel text
> 
> Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>
> 
> ---
>  arch/i386/kernel/alternative.c   |   33 +++--
>  arch/i386/kernel/kprobes.c   |9 +++--
>  arch/i386/mm/init.c  |   14 +++---
>  arch/x86_64/kernel/kprobes.c |   10 +++---
>  arch/x86_64/mm/init.c|   10 --
>  arch/x86_64/mm/pageattr.c|2 +-
>  include/asm-i386/alternative.h   |2 ++
>  include/asm-x86_64/alternative.h |2 ++
>  include/asm-x86_64/pgtable.h |2 ++
>  9 files changed, 47 insertions(+), 37 deletions(-)
> 
> Index: linux/arch/x86_64/kernel/kprobes.c
> ===
> --- linux.orig/arch/x86_64/kernel/kprobes.c
> +++ linux/arch/x86_64/kernel/kprobes.c
> @@ -39,9 +39,9 @@
>  #include 
>  #include 
>  
> -#include 
>  #include 
>  #include 
> +#include 
>  
>  void jprobe_return_end(void);
>  static void __kprobes arch_copy_kprobe(struct kprobe *p);
> @@ -209,16 +209,12 @@ static void __kprobes arch_copy_kprobe(s
>  
>  void __kprobes arch_arm_kprobe(struct kprobe *p)
>  {
> - *p->addr = BREAKPOINT_INSTRUCTION;
> - flush_icache_range((unsigned long) p->addr,
> -(unsigned long) p->addr + sizeof(kprobe_opcode_t));
> + text_poke(p->addr, BREAKPOINT_INSTRUCTION);
>  }
>  
>  void __kprobes arch_disarm_kprobe(struct kprobe *p)
>  {
> - *p->addr = p->opcode;
> - flush_icache_range((unsigned long) p->addr,
> -(unsigned long) p->addr + sizeof(kprobe_opcode_t));
> + text_poke(p->addr, p->opcode);
>  }
>  
>  void __kprobes arch_remove_kprobe(struct kprobe *p)
> Index: linux/arch/i386/kernel/alternative.c
> ===
> --- linux.orig/arch/i386/kernel/alternative.c
> +++ linux/arch/i386/kernel/alternative.c
> @@ -2,8 +2,12 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  static int smp_alt_once;
> @@ -145,12 +149,15 @@ static unsigned char** find_nop_table(vo
>  static void nop_out(void *insns, unsigned int len)
>  {
>   unsigned char **noptable = find_nop_table();
> + int i;
>  
>   while (len > 0) {
>   unsigned int noplen = len;
>   if (noplen > ASM_NOP_MAX)
>   noplen = ASM_NOP_MAX;
> - memcpy(insns, noptable[noplen], noplen);
> + /* Very inefficient */
> + for (i = 0; i < noplen; i++)
> + text_poke(insns + i, noptable[noplen][i]);

Ewww you plan to run this in SMP ? So you actually go byte
by byte changing pieces of instructions non atomically and doing
non-Intel's errata friendly XMC. You are really looking for trouble
there :) Two distinct errors can occur: 

1 - Invalid instruction (due to partial instruction modification)
2 - GPF (or something like it), due to Intel's errata.

I don't see why we _need_ to do the copy operation within a wrapper. I
am always frowning when I see an API that could do a simple task (and do
it well) turn into a more sophisticated interface (here: dealing with
write permissions _and_ doing the copy). It seems to me that it will
just limit the programmer's options about what code they want to change,
how, on how many bytes at a time...

>   insns += noplen;
>   len -= noplen;
>   }
> @@ -202,7 +209,7 @@ static void alternatives_smp_lock(u8 **s
>   continue;
>   if (*ptr > text_end)
>   continue;
> - **ptr = 0xf0; /* lock prefix */
> + text_poke(ptr, 0xf0); /* lock prefix */
>   };
>  }
>  
> @@ -406,3 +413,25 @@ void __init alternative_instructions(voi
>   apply_paravirt(__parainstructions, __parainstructions_end);
>   local_irq_restore(flags);
>  }
> +
> +/*
> + * RED-PEN Intel recommends stopping the other CPUs for such
> + * "cross-modifying code".
> + */

Yeah, actually, you don't want them to busy loop with interrupts
disabled during this. Even then, you can get an NMI. Dealing with this
properly at the kernel level requires a little bit more thought than

Re: new text patching for review

2007-07-19 Thread Andi Kleen
On Thursday 19 July 2007 14:25:59 Jan Beulich wrote:
> I like this approach much better, just one small note (I think I
> had the same mistake in my changes initially):
> 
> >@@ -202,7 +209,7 @@ static void alternatives_smp_lock(u8 **s
> > continue;
> > if (*ptr > text_end)
> > continue;
> >-**ptr = 0xf0; /* lock prefix */
> >+text_poke(ptr, 0xf0); /* lock prefix */
> > };
> > }
>  
> You probably want *ptr here.

Thanks fixed

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Jan Beulich
I like this approach much better, just one small note (I think I
had the same mistake in my changes initially):

>@@ -202,7 +209,7 @@ static void alternatives_smp_lock(u8 **s
>   continue;
>   if (*ptr > text_end)
>   continue;
>-  **ptr = 0xf0; /* lock prefix */
>+  text_poke(ptr, 0xf0); /* lock prefix */
>   };
> }
 
You probably want *ptr here.

Jan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Jan Beulich
I like this approach much better, just one small note (I think I
had the same mistake in my changes initially):

@@ -202,7 +209,7 @@ static void alternatives_smp_lock(u8 **s
   continue;
   if (*ptr  text_end)
   continue;
-  **ptr = 0xf0; /* lock prefix */
+  text_poke(ptr, 0xf0); /* lock prefix */
   };
 }
 
You probably want *ptr here.

Jan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Andi Kleen
On Thursday 19 July 2007 14:25:59 Jan Beulich wrote:
 I like this approach much better, just one small note (I think I
 had the same mistake in my changes initially):
 
 @@ -202,7 +209,7 @@ static void alternatives_smp_lock(u8 **s
  continue;
  if (*ptr  text_end)
  continue;
 -**ptr = 0xf0; /* lock prefix */
 +text_poke(ptr, 0xf0); /* lock prefix */
  };
  }
  
 You probably want *ptr here.

Thanks fixed

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Mathieu Desnoyers
Hi Andi,

* Andi Kleen ([EMAIL PROTECTED]) wrote:
 
 Hallo,
 
 I had some second thoughts about the text patching of DEBUG_RODATA kernels
 using change_page_attr(). Change_page_attr is intrusive and slow and using
 a separate mapping is a little more gentle. I came up with this patch.
 For your review and testing pleasure.
 
 The main quirk is that it doesn't fully follow the cross-modifying code
 recommendations; but i'm not sure that's really needed.
 
 Also I admit nop_out is quite inefficient, but that's a very slow path
 and it was easier to do it this way than handle all the corner cases
 explicitely.
 
 Comments,
 
 -Andi
 
 x86: Fix alternatives and kprobes to remap write-protected kernel text
 
 Signed-off-by: Andi Kleen [EMAIL PROTECTED]
 
 ---
  arch/i386/kernel/alternative.c   |   33 +++--
  arch/i386/kernel/kprobes.c   |9 +++--
  arch/i386/mm/init.c  |   14 +++---
  arch/x86_64/kernel/kprobes.c |   10 +++---
  arch/x86_64/mm/init.c|   10 --
  arch/x86_64/mm/pageattr.c|2 +-
  include/asm-i386/alternative.h   |2 ++
  include/asm-x86_64/alternative.h |2 ++
  include/asm-x86_64/pgtable.h |2 ++
  9 files changed, 47 insertions(+), 37 deletions(-)
 
 Index: linux/arch/x86_64/kernel/kprobes.c
 ===
 --- linux.orig/arch/x86_64/kernel/kprobes.c
 +++ linux/arch/x86_64/kernel/kprobes.c
 @@ -39,9 +39,9 @@
  #include linux/module.h
  #include linux/kdebug.h
  
 -#include asm/cacheflush.h
  #include asm/pgtable.h
  #include asm/uaccess.h
 +#include asm/alternative.h
  
  void jprobe_return_end(void);
  static void __kprobes arch_copy_kprobe(struct kprobe *p);
 @@ -209,16 +209,12 @@ static void __kprobes arch_copy_kprobe(s
  
  void __kprobes arch_arm_kprobe(struct kprobe *p)
  {
 - *p-addr = BREAKPOINT_INSTRUCTION;
 - flush_icache_range((unsigned long) p-addr,
 -(unsigned long) p-addr + sizeof(kprobe_opcode_t));
 + text_poke(p-addr, BREAKPOINT_INSTRUCTION);
  }
  
  void __kprobes arch_disarm_kprobe(struct kprobe *p)
  {
 - *p-addr = p-opcode;
 - flush_icache_range((unsigned long) p-addr,
 -(unsigned long) p-addr + sizeof(kprobe_opcode_t));
 + text_poke(p-addr, p-opcode);
  }
  
  void __kprobes arch_remove_kprobe(struct kprobe *p)
 Index: linux/arch/i386/kernel/alternative.c
 ===
 --- linux.orig/arch/i386/kernel/alternative.c
 +++ linux/arch/i386/kernel/alternative.c
 @@ -2,8 +2,12 @@
  #include linux/sched.h
  #include linux/spinlock.h
  #include linux/list.h
 +#include linux/kprobes.h
 +#include linux/mm.h
 +#include linux/vmalloc.h
  #include asm/alternative.h
  #include asm/sections.h
 +#include asm/pgtable.h
  
  #ifdef CONFIG_HOTPLUG_CPU
  static int smp_alt_once;
 @@ -145,12 +149,15 @@ static unsigned char** find_nop_table(vo
  static void nop_out(void *insns, unsigned int len)
  {
   unsigned char **noptable = find_nop_table();
 + int i;
  
   while (len  0) {
   unsigned int noplen = len;
   if (noplen  ASM_NOP_MAX)
   noplen = ASM_NOP_MAX;
 - memcpy(insns, noptable[noplen], noplen);
 + /* Very inefficient */
 + for (i = 0; i  noplen; i++)
 + text_poke(insns + i, noptable[noplen][i]);

Ewww you plan to run this in SMP ? So you actually go byte
by byte changing pieces of instructions non atomically and doing
non-Intel's errata friendly XMC. You are really looking for trouble
there :) Two distinct errors can occur: 

1 - Invalid instruction (due to partial instruction modification)
2 - GPF (or something like it), due to Intel's errata.

I don't see why we _need_ to do the copy operation within a wrapper. I
am always frowning when I see an API that could do a simple task (and do
it well) turn into a more sophisticated interface (here: dealing with
write permissions _and_ doing the copy). It seems to me that it will
just limit the programmer's options about what code they want to change,
how, on how many bytes at a time...

   insns += noplen;
   len -= noplen;
   }
 @@ -202,7 +209,7 @@ static void alternatives_smp_lock(u8 **s
   continue;
   if (*ptr  text_end)
   continue;
 - **ptr = 0xf0; /* lock prefix */
 + text_poke(ptr, 0xf0); /* lock prefix */
   };
  }
  
 @@ -406,3 +413,25 @@ void __init alternative_instructions(voi
   apply_paravirt(__parainstructions, __parainstructions_end);
   local_irq_restore(flags);
  }
 +
 +/*
 + * RED-PEN Intel recommends stopping the other CPUs for such
 + * cross-modifying code.
 + */

Yeah, actually, you don't want them to busy loop with interrupts
disabled during this. Even then, you can get an NMI. Dealing with this

Re: new text patching for review

2007-07-19 Thread Andi Kleen

 Ewww you plan to run this in SMP ? So you actually go byte
 by byte changing pieces of instructions non atomically and doing
 non-Intel's errata friendly XMC. You are really looking for trouble
 there :) Two distinct errors can occur: 

In this case it is ok because this only happens when transitioning
from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
are essentially stopped.

All the other manipulations currently are single byte.

I suppose for your immediate value patches something stronger is needed,
but we can worry about that post .23.

 What I don't like about this particular implementation is that it does
 not support poking more than 1 byte. In order to support this, you
 would have to deal with the case where the address range spans over more
 than one page.

I considered it, but the function would have been at least twice as big
to handle all the corner cases. And for the current callers it's all fine.

 Also, doing the copy in the same interface seems a bit awkward.

Splitting it would also seem quite awkward.
 
 
 I would much prefer something like:
 
 void *map_shadow_write(void *addr, size_t len);
 (returns a pointer to the shadow writable pages, at the same page offset
 as addr)
 
 int unmap_shadow_write(void *shadow_addr, size_t len);
 (unmap the shadow pages)
 
 Then, the in-kernel user is free to modify their pages as they like.
 Since we cannot foresee each modification pattern, I think that leaving
 this kind of flexibility is useful.

You could as well call vmap directly then; it's not that much
more complicated.  I don't really see much value in complicating
it right now.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote:
 
  Ewww you plan to run this in SMP ? So you actually go byte
  by byte changing pieces of instructions non atomically and doing
  non-Intel's errata friendly XMC. You are really looking for trouble
  there :) Two distinct errors can occur: 
 
 In this case it is ok because this only happens when transitioning
 from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
 are essentially stopped.
 

I agree that it's ok with SMP, but another problem arises: it's not only
a matter of being protected from SMP access, but also a matter of
reentrancy wrt interrupt handlers.

i.e.: if, as we are patching nops non atomically, we have a non maskable
interrupt coming which calls get_cycles_sync() which uses the
alternatives for cpuid when the NOP instructions are not in a correct
state, we can end up doing an illegal instruction.

I see that IRQs are disabled in alternative_instructions(), but it does
not protect against NMIs, which could come at a very inappropriate
moment. MCE and SMIs would potentially cause the same kind of trouble.

So unless you can guarantee that any code from NMI handler won't call
basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
it won't execute an illegal instruction. Also, the option of temporarily
disabling the NMI for the duration of the update simply adds unwanted
latency to the NMI handler which could be unacceptable in some setups.

Another potential problem comes from preemption: if a thread is
preempted in the middle of the instructions you are modifying, let's say
we originally have 4 * 1 byte instructions that we replace with 1 * 4
byte instruction, a thread could iret in the middle of the new
instruction when it will be scheduled again, thus executing an illegal
instruction. This problem could be overcomed by putting the threads in
the freezer. See the djprobe project for details about this.

In the end, I think the safest solution would be to limit ourselves to
one of these 2 solutions:
- If the update can be done atomically and leaves the site in a coherent
  state after each atomic modification, while keeping the same
  instruction size, it could be done on the spot.
- If not, care should be taken to first do a copy of the code to modify
  into a bypass, making sure that instructions can be relocated, so
  that any context that would try to execute the site during the
  modification would execute a breakpoint which would execute the bypass
  while we modify the instructions. Care should be taken to make sure
  that threads are in the freezer while we do this. This is basically
  what djprobes does. You case is only a bit simpler because you don't
  worry about SMP.


 All the other manipulations currently are single byte.
 

For breakpoint, yes, this one is easy.


 I suppose for your immediate value patches something stronger is needed,
 but we can worry about that post .23.
 
  What I don't like about this particular implementation is that it does
  not support poking more than 1 byte. In order to support this, you
  would have to deal with the case where the address range spans over more
  than one page.
 
 I considered it, but the function would have been at least twice as big
 to handle all the corner cases. And for the current callers it's all fine.
 
  Also, doing the copy in the same interface seems a bit awkward.
 
 Splitting it would also seem quite awkward.
  

Because of the tricks that must be done to do code modification on a
live system (as explained above), I don't think it makes sense to
provide a falsely-safe infrastructure that would allow code modification
in a non-atomic manner.

  
  I would much prefer something like:
  
  void *map_shadow_write(void *addr, size_t len);
  (returns a pointer to the shadow writable pages, at the same page offset
  as addr)
  
  int unmap_shadow_write(void *shadow_addr, size_t len);
  (unmap the shadow pages)
  
  Then, the in-kernel user is free to modify their pages as they like.
  Since we cannot foresee each modification pattern, I think that leaving
  this kind of flexibility is useful.
 
 You could as well call vmap directly then; it's not that much
 more complicated.  I don't really see much value in complicating
 it right now.
 

Right about the direct vmap call, it is quite simple and elegant, I
guess I'll use it in my next version of kernel text lock.

Mathieu

 -Andi
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Andi Kleen
Mathieu Desnoyers [EMAIL PROTECTED] writes:

 * Andi Kleen ([EMAIL PROTECTED]) wrote:
  
   Ewww you plan to run this in SMP ? So you actually go byte
   by byte changing pieces of instructions non atomically and doing
   non-Intel's errata friendly XMC. You are really looking for trouble
   there :) Two distinct errors can occur: 
  
  In this case it is ok because this only happens when transitioning
  from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
  are essentially stopped.
  
 
 I agree that it's ok with SMP, but another problem arises: it's not only
 a matter of being protected from SMP access, but also a matter of
 reentrancy wrt interrupt handlers.
 
 i.e.: if, as we are patching nops non atomically, we have a non maskable
 interrupt coming which calls get_cycles_sync() which uses the

Hmm, i didn't think NMI handlers called that. e.g. nmi watchdog just
uses jiffies.

get_cycles_sync patching happens only relatively early at boot, so oprofile
cannot be running yet.

 I see that IRQs are disabled in alternative_instructions(), but it does
 not protect against NMIs, which could come at a very inappropriate
 moment. MCE and SMIs would potentially cause the same kind of trouble.
 
 So unless you can guarantee that any code from NMI handler won't call
 basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
 it won't execute an illegal instruction. Also, the option of temporarily
 disabling the NMI for the duration of the update simply adds unwanted
 latency to the NMI handler which could be unacceptable in some setups.

Ok it's a fair point.  But how would you address it ?

Even if we IPIed the other CPUs NMIs or MCEs could still happen.

BTW Jeremy, have you ever considered that problem with paravirt ops
patching? 

 
 Another potential problem comes from preemption: if a thread is
 preempted in the middle of the instructions you are modifying, let's say
 we originally have 4 * 1 byte instructions that we replace with 1 * 4
 byte instruction, a thread could iret in the middle of the new
 instruction when it will be scheduled again, thus executing an illegal
 instruction. This problem could be overcomed by putting the threads in
 the freezer. See the djprobe project for details about this.

This cannot happen for the current code: 
- full alternative patching happen only at boot when the other CPUs
are not running
- smp lock patching only ever changes a single byte (lock prefix) of
a single instruction
- kprobes only ever change a single byte

For the immediate value patching it also cannot happen because
you'll never modify multiple instructions and all immediate values
can be changed atomically. 



 In the end, I think the safest solution would be to limit ourselves to
 one of these 2 solutions:
 - If the update can be done atomically and leaves the site in a coherent
   state after each atomic modification, while keeping the same
   instruction size, it could be done on the spot.
 - If not, care should be taken to first do a copy of the code to modify
   into a bypass, making sure that instructions can be relocated, so
   that any context that would try to execute the site during the
   modification would execute a breakpoint which would execute the bypass
   while we modify the instructions.

kprobes does that, but to write the break point it uses text_poke() 
with my patch.

 Because of the tricks that must be done to do code modification on a
 live system (as explained above), I don't think it makes sense to
 provide a falsely-safe infrastructure that would allow code modification
 in a non-atomic manner.

Hmm, perhaps it would make sense to add a comment warning about
the limitations of the current text_poke. Can you supply one?

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
 Mathieu Desnoyers [EMAIL PROTECTED] writes:
   
 I see that IRQs are disabled in alternative_instructions(), but it does
 not protect against NMIs, which could come at a very inappropriate
 moment. MCE and SMIs would potentially cause the same kind of trouble.

 So unless you can guarantee that any code from NMI handler won't call
 basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
 it won't execute an illegal instruction. Also, the option of temporarily
 disabling the NMI for the duration of the update simply adds unwanted
 latency to the NMI handler which could be unacceptable in some setups.
 

 Ok it's a fair point.  But how would you address it ?

 Even if we IPIed the other CPUs NMIs or MCEs could still happen.

 BTW Jeremy, have you ever considered that problem with paravirt ops
 patching? 
   

I remember Zach was thinking about it when he was thinking of making vmi
a kernel module, but I don't think we discussed it with respect to the
current patching mechanism.  Though he did discover that at one point
alternative_instructions() was being run with interrupts enabled, which
caused surprisingly few problems...

But, yeah, it seems like it could be a problem.

 - smp lock patching only ever changes a single byte (lock prefix) of
 a single instruction
 - kprobes only ever change a single byte

 For the immediate value patching it also cannot happen because
 you'll never modify multiple instructions and all immediate values
 can be changed atomically. 
   

Are misaligned/cross-cache-line updates atomic?

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Mathieu Desnoyers
* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
 Andi Kleen wrote:
  Mathieu Desnoyers [EMAIL PROTECTED] writes:

  I see that IRQs are disabled in alternative_instructions(), but it does
  not protect against NMIs, which could come at a very inappropriate
  moment. MCE and SMIs would potentially cause the same kind of trouble.
 
  So unless you can guarantee that any code from NMI handler won't call
  basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
  it won't execute an illegal instruction. Also, the option of temporarily
  disabling the NMI for the duration of the update simply adds unwanted
  latency to the NMI handler which could be unacceptable in some setups.
  
 
  Ok it's a fair point.  But how would you address it ?
 
  Even if we IPIed the other CPUs NMIs or MCEs could still happen.
 
  BTW Jeremy, have you ever considered that problem with paravirt ops
  patching? 

 
 I remember Zach was thinking about it when he was thinking of making vmi
 a kernel module, but I don't think we discussed it with respect to the
 current patching mechanism.  Though he did discover that at one point
 alternative_instructions() was being run with interrupts enabled, which
 caused surprisingly few problems...
 
 But, yeah, it seems like it could be a problem.
 
  - smp lock patching only ever changes a single byte (lock prefix) of
  a single instruction
  - kprobes only ever change a single byte
 
  For the immediate value patching it also cannot happen because
  you'll never modify multiple instructions and all immediate values
  can be changed atomically. 

 
 Are misaligned/cross-cache-line updates atomic?
 

I align the immediate values within the mov instructions on multiples
of the immediate value size so I can update it atomically.

 J

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote:
 Mathieu Desnoyers [EMAIL PROTECTED] writes:
 
  * Andi Kleen ([EMAIL PROTECTED]) wrote:
   
Ewww you plan to run this in SMP ? So you actually go byte
by byte changing pieces of instructions non atomically and doing
non-Intel's errata friendly XMC. You are really looking for trouble
there :) Two distinct errors can occur: 
   
   In this case it is ok because this only happens when transitioning
   from 1 CPU to 2 CPUs or vice versa and in both cases the other CPUs
   are essentially stopped.
   
  
  I agree that it's ok with SMP, but another problem arises: it's not only
  a matter of being protected from SMP access, but also a matter of
  reentrancy wrt interrupt handlers.
  
  i.e.: if, as we are patching nops non atomically, we have a non maskable
  interrupt coming which calls get_cycles_sync() which uses the
 
 Hmm, i didn't think NMI handlers called that. e.g. nmi watchdog just
 uses jiffies.
 
 get_cycles_sync patching happens only relatively early at boot, so oprofile
 cannot be running yet.

Actually, the nmi handler does use the get_cycles(), and also uses the
spinlock code:

arch/i386/kernel/nmi.c:
__kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
...
static DEFINE_SPINLOCK(lock);   /* Serialise the printks */
spin_lock(lock);
printk(NMI backtrace for cpu %d\n, cpu);
...
spin_unlock(lock);

If A - we change the spinlock code non atomically it would break.
   B - printk reads the TSC to get a timestamp, it breaks:
   it calls:
printk_clock(void) - sched_clock(); - get_cycles_sync() on x86_64.


 
  I see that IRQs are disabled in alternative_instructions(), but it does
  not protect against NMIs, which could come at a very inappropriate
  moment. MCE and SMIs would potentially cause the same kind of trouble.
  
  So unless you can guarantee that any code from NMI handler won't call
  basic things such as get_cycles() (nor MCE, nor SMIs), you can't insure
  it won't execute an illegal instruction. Also, the option of temporarily
  disabling the NMI for the duration of the update simply adds unwanted
  latency to the NMI handler which could be unacceptable in some setups.
 
 Ok it's a fair point.  But how would you address it ?
 
 Even if we IPIed the other CPUs NMIs or MCEs could still happen.
 

Yeah, that's a mess. That's why I always consider patching the code
in a way that will let the NMI handler run through it in a sane manner
_while_ the code is being patched. It implies _at least_ to do the
updates atomically with atomic aligned memory writes that keeps the site
being patched in a coherent state. Using a int3-based bypass is also
required on Intel because of the erratum regarding instruction cache.
Using these techniques, I can atomically modify the execution stream. It
would be relatively simple to re-use the framework I have done in the
Immediate Values as long as no non-relocatable instructions are
involved.

 BTW Jeremy, have you ever considered that problem with paravirt ops
 patching? 
 
  
  Another potential problem comes from preemption: if a thread is
  preempted in the middle of the instructions you are modifying, let's say
  we originally have 4 * 1 byte instructions that we replace with 1 * 4
  byte instruction, a thread could iret in the middle of the new
  instruction when it will be scheduled again, thus executing an illegal
  instruction. This problem could be overcomed by putting the threads in
  the freezer. See the djprobe project for details about this.
 
 This cannot happen for the current code: 
 - full alternative patching happen only at boot when the other CPUs
 are not running

Should be checked if NMIs and MCEs are active at that moment.

 - smp lock patching only ever changes a single byte (lock prefix) of
 a single instruction

Yup, as long as we are just adding/removing a f0 prefix, it's clearly
atomic.

 - kprobes only ever change a single byte
 

I see the mb()/rmb()/wmb() also uses alternatives, they should be
checked for boot-time racing against NMIs and MCEs.

init/main.c:start_kernel()

parse_args() (where the nmi watchdog is enabled it seems) would probably
execute the smp-alt-boot and nmi_watchdog arguments in the order in which
they are given as kernel arguments. So I guess it could race.

the mce kernel argument is also parsed in parse_args(), which leads to
the same problem.

 For the immediate value patching it also cannot happen because
 you'll never modify multiple instructions and all immediate values
 can be changed atomically. 
 

Exactly, I always make sure that the immediate value within the
instruction is aligned (so a 5 bytes movl must have an offset of +3
compared to a 4 bytes alignment).

 
 
  In the end, I think the safest solution would be to limit ourselves to
  one of these 2 solutions:
  - If the update can be done atomically and leaves the site in a coherent
state after each atomic modification, while keeping the same
instruction 

Re: new text patching for review

2007-07-19 Thread Zachary Amsden

Andi Kleen wrote:

+   *addr = opcode;
+   /* Not strictly needed, but can speed CPU recovery up */
+   if (cpu_has_clflush)
+   asm(clflush (%0)  :: r (addr) : memory);
+   if (addr != oaddr)
+   vunmap(addr);
  


clflush should take oaddr.

If you had to remap, note that the processor does not know the linear 
address you wrote to could be matched by another mapping in icache.  In 
that case, you'll need a serializing instruction (cpuid) to 
resynchronize caches.


I don't think any of these are SMP problems since either the code is 
patched before AP bringup or it is single byte patching.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new text patching for review

2007-07-19 Thread Zachary Amsden

Mathieu Desnoyers wrote:

Yes, kprobes is case 1: atomic update. And we don't even have to bother
about Intel's erratum. This one is ok. That's mainly the
alternatives/paravirt code I worry about.
  


Paravirt and alternatives should all be ok because they are done before 
SMP bringup and with NMIs disabled.  NMI watchdog is not setup until 
smp_prepare_cpus/check_nmi_watchdog, which happens way later, not during 
parse_args/setup_nmi_watchdog, which just decides which type of watchdog 
to setup.


I originally considered the NMI problem for paravirt-ops patching done 
during module load, and found that I would need to modify 
stop_machine_run to have an architecture specific callout to mask and 
unmask NMIs.  I didn't imagine that would be very popular, and VMI was 
the only paravirt-ops that were considering module load time patching, 
so I flushed it.


You get some other nasty issues as well with run-time switching, like 
missing early init calls (in particular, we would have to go to some 
heroics to retake the land surrounding the APIC local timer interrupt).


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/