Re: KRETPROBES are broken since kernel 5.8
On Thu, 10 Dec 2020 18:14:30 +0100 Adam Zabrocki wrote: > Hi, > > > > However, there might be another issue which I wanted to brought / discuss > > > - > > > problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. > > > 'ftrace_enable_sysctl' function was correctly optimized without any > > > problems. > > > > Did you check it on other functions? Did you see it only on the > > "ftrace_enable_sysctl"? > > > > Yes, I see it in most of the functions with padding. Thanks for the confirmation. > > > > Since 5.9 it can't be optimized anynmore. I didn't see any changes in the > > > sources regarding the optimizer, neither function itself. > > > When I looked at the generated vmlinux binary, I can see that GCC > > > generated > > > padding at the end of this function using INT3 opcode: > > > > > > ... > > > 8130528b: 41 bd f0 ff ff ff mov$0xfff0,%r13d > > > 81305291: e9 fe fe ff ff jmpq 81305194 > > > > > > 81305296: cc int3 > > > 81305297: cc int3 > > > 81305298: cc int3 > > > 81305299: cc int3 > > > 8130529a: cc int3 > > > 8130529b: cc int3 > > > 8130529c: cc int3 > > > 8130529d: cc int3 > > > 8130529e: cc int3 > > > 8130529f: cc int3 > > > > So these int3 is generated by GCC for padding, right? > > > > I've just browsed a few commits and I've found that one: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e > > It looks like INT3 is now a default padding used by linker. Thanks for the information! OK, I will add Fixed: tag and backport it. > > > > However, that's not the case here. INT3_INSN_OPCODE is placed at the end > > > of > > > the function as padding (and protect from NOP-padding problems). > > > > > > I wonder, if optimizer should take this special case into account? Is it > > > worth > > > to still optimize such functions when they are padded with INT3? > > > > Indeed. I expected int3 is used from other subsystems (e.g. kgdb) and, > > in that case the optimization can confuse them. > > Right. The same can happen when text section is being actively modified. > However, this case could be covered by running the optimizer logic under > text_mutex. No, this check is needed because of the instruction decoding. Usually, the int3 will be put a the first byte of the existing instruction whose length is usually 1-6 bytes. If the instruction's opcode is overwritten by the int3, kprobes can not get the original opcode and this means it can not get the original length of the instruction. To optimize the probe, kprobes have to ensure the other jump instruction doesn't jump into the instructions which will be overwritten by optimized jump instruction. This is why the can_optimize() decodes all instructions in the function (note that ksyms has no information of padding bytes, it returns the function size with the padding bytes.) Thus, when the kprobes detects the int3 in the function, it gives up the decoding and optimizing. > > > But if the gcc uses int3 to pad the room between functions, it should be > > reconsidered. > > > > Looks like it's a default behavior now. OK, let me fix that. If the int3 is only used for the padding between functions, those int3 should continue to the end of the function. So kprobes can distinguish the int3 comes from other subsystems or linker. Thank you, > > > Thank you, > > > > > If it is OK, we should backport those to stable tree. > > > > Agreed. > > It is also important to make sure that distro kernels would pick-up such > backported fix. > > Thanks, > Adam > > -- > pi3 (pi3ki31ny) - pi3 (at) itsec pl > http://pi3.com.pl > -- Masami Hiramatsu
Re: KRETPROBES are broken since kernel 5.8
Hi, > > However, there might be another issue which I wanted to brought / discuss - > > problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. > > 'ftrace_enable_sysctl' function was correctly optimized without any > > problems. > > Did you check it on other functions? Did you see it only on the > "ftrace_enable_sysctl"? > Yes, I see it in most of the functions with padding. > > Since 5.9 it can't be optimized anynmore. I didn't see any changes in the > > sources regarding the optimizer, neither function itself. > > When I looked at the generated vmlinux binary, I can see that GCC generated > > padding at the end of this function using INT3 opcode: > > > > ... > > 8130528b: 41 bd f0 ff ff ff mov$0xfff0,%r13d > > 81305291: e9 fe fe ff ff jmpq 81305194 > > > > 81305296: cc int3 > > 81305297: cc int3 > > 81305298: cc int3 > > 81305299: cc int3 > > 8130529a: cc int3 > > 8130529b: cc int3 > > 8130529c: cc int3 > > 8130529d: cc int3 > > 8130529e: cc int3 > > 8130529f: cc int3 > > So these int3 is generated by GCC for padding, right? > I've just browsed a few commits and I've found that one: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e It looks like INT3 is now a default padding used by linker. > > However, that's not the case here. INT3_INSN_OPCODE is placed at the end of > > the function as padding (and protect from NOP-padding problems). > > > > I wonder, if optimizer should take this special case into account? Is it > > worth > > to still optimize such functions when they are padded with INT3? > > Indeed. I expected int3 is used from other subsystems (e.g. kgdb) and, > in that case the optimization can confuse them. Right. The same can happen when text section is being actively modified. However, this case could be covered by running the optimizer logic under text_mutex. > But if the gcc uses int3 to pad the room between functions, it should be > reconsidered. > Looks like it's a default behavior now. > Thank you, > > > If it is OK, we should backport those to stable tree. > > Agreed. It is also important to make sure that distro kernels would pick-up such backported fix. Thanks, Adam -- pi3 (pi3ki31ny) - pi3 (at) itsec pl http://pi3.com.pl
Re: KRETPROBES are broken since kernel 5.8
Hi Adam, On Thu, 10 Dec 2020 08:17:42 +0100 Adam Zabrocki wrote: > Hi, > > On Thu, Dec 10, 2020 at 10:25:07AM +0900, Masami Hiramatsu wrote: > > Hi Adam, > > > > Thank you for reporting and debugging, actually we had investigated this > > issue in Aug. Please read this thread. > > > > https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad282...@trendmicro.com > > > > Thanks for the link. I've read the entire history of proposed fix - very > informative :) > > > We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI > > context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about > > kprobe nesting"). Yeah, it will be in the v5.10. > > > > Could you try to test whether these commits can solve the issue? > > I've applied these commits on the top of kernel 5.9.7 and verified that it > works well. Non-optimized KRETPROBES are back to life. Good! > > However, there might be another issue which I wanted to brought / discuss - > problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. > 'ftrace_enable_sysctl' function was correctly optimized without any problems. Did you check it on other functions? Did you see it only on the "ftrace_enable_sysctl"? > Since 5.9 it can't be optimized anynmore. I didn't see any changes in the > sources regarding the optimizer, neither function itself. > When I looked at the generated vmlinux binary, I can see that GCC generated > padding at the end of this function using INT3 opcode: > > ... > 8130528b: 41 bd f0 ff ff ff mov$0xfff0,%r13d > 81305291: e9 fe fe ff ff jmpq 81305194 > > 81305296: cc int3 > 81305297: cc int3 > 81305298: cc int3 > 81305299: cc int3 > 8130529a: cc int3 > 8130529b: cc int3 > 8130529c: cc int3 > 8130529d: cc int3 > 8130529e: cc int3 > 8130529f: cc int3 So these int3 is generated by GCC for padding, right? > Such padding didn't exist in this function in generated images for older > kernels. Nevertheless, such padding is not unusual and it's pretty common. > > Optimizer logic fails here: > > try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> > __prepare_optimized_kprobe() > -> arch_prepare_optimized_kprobe() -> can_optimize(): > > /* Decode instructions */ > addr = paddr - offset; > while (addr < paddr - offset + size) { /* Decode until function end */ > unsigned long recovered_insn; > if (search_exception_tables(addr)) > /* > * Since some fixup code will jumps into this function, > * we can't optimize kprobe in this function. > */ > return 0; > recovered_insn = recover_probed_instruction(buf, addr); > if (!recovered_insn) > return 0; > kernel_insn_init(, (void *)recovered_insn, MAX_INSN_SIZE); > insn_get_length(); > /* Another subsystem puts a breakpoint */ > if (insn.opcode.bytes[0] == INT3_INSN_OPCODE) > return 0; > /* Recover address */ > insn.kaddr = (void *)addr; > insn.next_byte = (void *)(addr + insn.length); > /* Check any instructions don't jump into target */ > if (insn_is_indirect_jump() || > insn_jump_into_range(, paddr + INT3_INSN_SIZE, > DISP32_SIZE)) > return 0; > addr += insn.length; > } > > One of the check tries to protect from the situation when another subsystem > puts a breakpoint there as well: > > /* Another subsystem puts a breakpoint */ > if (insn.opcode.bytes[0] == INT3_INSN_OPCODE) > return 0; Right. > > However, that's not the case here. INT3_INSN_OPCODE is placed at the end of > the function as padding (and protect from NOP-padding problems). > > I wonder, if optimizer should take this special case into account? Is it > worth > to still optimize such functions when they are padded with INT3? Indeed. I expected int3 is used from other subsystems (e.g. kgdb) and, in that case the optimization can confuse them. But if the gcc uses int3 to pad the room between functions, it should be reconsidered. Thank you, > > > If it is OK, we should backport those to stable tree. > > Agreed. > > Thanks, > Adam > > > Thank you, > > > > On Wed, 9 Dec 2020 22:50:01 +0100 > > Adam Zabrocki wrote: > > > > > Hello, > > > > > > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until > > > 5.8, > > > when #DB exception was raised, entry to the NMI was not fully performed. > > > Among > > > others, the following logic was executed: > > >
Re: KRETPROBES are broken since kernel 5.8
Hi, On Thu, Dec 10, 2020 at 10:25:07AM +0900, Masami Hiramatsu wrote: > Hi Adam, > > Thank you for reporting and debugging, actually we had investigated this > issue in Aug. Please read this thread. > > https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad282...@trendmicro.com > Thanks for the link. I've read the entire history of proposed fix - very informative :) > We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI > context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about > kprobe nesting"). Yeah, it will be in the v5.10. > > Could you try to test whether these commits can solve the issue? I've applied these commits on the top of kernel 5.9.7 and verified that it works well. Non-optimized KRETPROBES are back to life. However, there might be another issue which I wanted to brought / discuss - problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 'ftrace_enable_sysctl' function was correctly optimized without any problems. Since 5.9 it can't be optimized anynmore. I didn't see any changes in the sources regarding the optimizer, neither function itself. When I looked at the generated vmlinux binary, I can see that GCC generated padding at the end of this function using INT3 opcode: ... 8130528b: 41 bd f0 ff ff ff mov$0xfff0,%r13d 81305291: e9 fe fe ff ff jmpq 81305194 81305296: cc int3 81305297: cc int3 81305298: cc int3 81305299: cc int3 8130529a: cc int3 8130529b: cc int3 8130529c: cc int3 8130529d: cc int3 8130529e: cc int3 8130529f: cc int3 Such padding didn't exist in this function in generated images for older kernels. Nevertheless, such padding is not unusual and it's pretty common. Optimizer logic fails here: try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe() -> arch_prepare_optimized_kprobe() -> can_optimize(): /* Decode instructions */ addr = paddr - offset; while (addr < paddr - offset + size) { /* Decode until function end */ unsigned long recovered_insn; if (search_exception_tables(addr)) /* * Since some fixup code will jumps into this function, * we can't optimize kprobe in this function. */ return 0; recovered_insn = recover_probed_instruction(buf, addr); if (!recovered_insn) return 0; kernel_insn_init(, (void *)recovered_insn, MAX_INSN_SIZE); insn_get_length(); /* Another subsystem puts a breakpoint */ if (insn.opcode.bytes[0] == INT3_INSN_OPCODE) return 0; /* Recover address */ insn.kaddr = (void *)addr; insn.next_byte = (void *)(addr + insn.length); /* Check any instructions don't jump into target */ if (insn_is_indirect_jump() || insn_jump_into_range(, paddr + INT3_INSN_SIZE, DISP32_SIZE)) return 0; addr += insn.length; } One of the check tries to protect from the situation when another subsystem puts a breakpoint there as well: /* Another subsystem puts a breakpoint */ if (insn.opcode.bytes[0] == INT3_INSN_OPCODE) return 0; However, that's not the case here. INT3_INSN_OPCODE is placed at the end of the function as padding (and protect from NOP-padding problems). I wonder, if optimizer should take this special case into account? Is it worth to still optimize such functions when they are padded with INT3? > If it is OK, we should backport those to stable tree. Agreed. Thanks, Adam > Thank you, > > On Wed, 9 Dec 2020 22:50:01 +0100 > Adam Zabrocki wrote: > > > Hello, > > > > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8, > > when #DB exception was raised, entry to the NMI was not fully performed. > > Among > > others, the following logic was executed: > > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589 > > > > if (!user_mode(regs)) { > > rcu_nmi_enter(); > > preempt_disable(); > > } > > > > In some older kernels function ist_enter() was called instead. Inside this > > function we can see the following logic: > > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91 > > > > if (user_mode(regs)) { > > RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); > > } else { > > /* > > * We might have interrupted pretty much anything. In > > * fact, if we're a machine check, we can even interrupt > > * NMI processing. We don't want
Re: KRETPROBES are broken since kernel 5.8
Hi Adam, Thank you for reporting and debugging, actually we had investigated this issue in Aug. Please read this thread. https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad282...@trendmicro.com We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about kprobe nesting"). Yeah, it will be in the v5.10. Could you try to test whether these commits can solve the issue? If it is OK, we should backport those to stable tree. Thank you, On Wed, 9 Dec 2020 22:50:01 +0100 Adam Zabrocki wrote: > Hello, > > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8, > when #DB exception was raised, entry to the NMI was not fully performed. Among > others, the following logic was executed: > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589 > > if (!user_mode(regs)) { > rcu_nmi_enter(); > preempt_disable(); > } > > In some older kernels function ist_enter() was called instead. Inside this > function we can see the following logic: > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91 > > if (user_mode(regs)) { > RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); > } else { > /* > * We might have interrupted pretty much anything. In > * fact, if we're a machine check, we can even interrupt > * NMI processing. We don't want in_nmi() to return true, > * but we need to notify RCU. > */ > rcu_nmi_enter(); > } > > preempt_disable(); > > As the comment says "We don't want in_nmi() to return true, but we need to > notify RCU.". However, since kernel 5.8 the logic of how interrupts are > handled > was modified and currently we have this (function "exc_int3"): > https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630 > > /* > * idtentry_enter_user() uses static_branch_{,un}likely() and therefore > * can trigger INT3, hence poke_int3_handler() must be done > * before. If the entry came from kernel mode, then use nmi_enter() > * because the INT3 could have been hit in any context including > * NMI. > */ > if (user_mode(regs)) { > idtentry_enter_user(regs); > instrumentation_begin(); > do_int3_user(regs); > instrumentation_end(); > idtentry_exit_user(regs); > } else { > nmi_enter(); > instrumentation_begin(); > trace_hardirqs_off_finish(); > if (!do_int3(regs)) > die("int3", regs, 0); > if (regs->flags & X86_EFLAGS_IF) > trace_hardirqs_on_prepare(); > instrumentation_end(); > nmi_exit(); > } > > The root of unlucky change comes from this commit: > > https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5 > > which later was modified by this commit: > > https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2 > > and this is what we currently have in all kernels since 5.8. Essentially, > KRETPROBES are not working since these commits. We have the following logic: > > asm_exc_int3() -> exc_int3(): > | > | > | > v > ... > nmi_enter(); > ... > if (!do_int3(regs)) >| > -| > | > v > do_int3() -> kprobe_int3_handler(): > | > | > | > v > ... > if (!p->pre_handler || !p->pre_handler(p, regs)) > | > -| > | > v > ... > pre_handler_kretprobe(): > ... > if (unlikely(in_nmi())) { > rp->nmissed++; > return 0; > } > > Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before > invoking any registered kprobe verifies if it is not in NMI via in_nmi() call. > > This bug was masked by OPTIMIZER which was turning most of the KPROBES to be > FTRACE so essentially if someone registered KRETPROBE, buggy code was not > invoked (FTRACE code was executed instead). However, the optimizer was changed > and can't optimize as many functions anymore (probably another bug which might > be worth to investigate) and this bug is more visible. > > Thanks, > Adam > > -- > pi3 (pi3ki31ny) - pi3 (at) itsec pl > http://pi3.com.pl > -- Masami Hiramatsu
KRETPROBES are broken since kernel 5.8
Hello, Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8, when #DB exception was raised, entry to the NMI was not fully performed. Among others, the following logic was executed: https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589 if (!user_mode(regs)) { rcu_nmi_enter(); preempt_disable(); } In some older kernels function ist_enter() was called instead. Inside this function we can see the following logic: https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91 if (user_mode(regs)) { RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); } else { /* * We might have interrupted pretty much anything. In * fact, if we're a machine check, we can even interrupt * NMI processing. We don't want in_nmi() to return true, * but we need to notify RCU. */ rcu_nmi_enter(); } preempt_disable(); As the comment says "We don't want in_nmi() to return true, but we need to notify RCU.". However, since kernel 5.8 the logic of how interrupts are handled was modified and currently we have this (function "exc_int3"): https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630 /* * idtentry_enter_user() uses static_branch_{,un}likely() and therefore * can trigger INT3, hence poke_int3_handler() must be done * before. If the entry came from kernel mode, then use nmi_enter() * because the INT3 could have been hit in any context including * NMI. */ if (user_mode(regs)) { idtentry_enter_user(regs); instrumentation_begin(); do_int3_user(regs); instrumentation_end(); idtentry_exit_user(regs); } else { nmi_enter(); instrumentation_begin(); trace_hardirqs_off_finish(); if (!do_int3(regs)) die("int3", regs, 0); if (regs->flags & X86_EFLAGS_IF) trace_hardirqs_on_prepare(); instrumentation_end(); nmi_exit(); } The root of unlucky change comes from this commit: https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5 which later was modified by this commit: https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2 and this is what we currently have in all kernels since 5.8. Essentially, KRETPROBES are not working since these commits. We have the following logic: asm_exc_int3() -> exc_int3(): | | | v ... nmi_enter(); ... if (!do_int3(regs)) | -| | v do_int3() -> kprobe_int3_handler(): | | | v ... if (!p->pre_handler || !p->pre_handler(p, regs)) | -| | v ... pre_handler_kretprobe(): ... if (unlikely(in_nmi())) { rp->nmissed++; return 0; } Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before invoking any registered kprobe verifies if it is not in NMI via in_nmi() call. This bug was masked by OPTIMIZER which was turning most of the KPROBES to be FTRACE so essentially if someone registered KRETPROBE, buggy code was not invoked (FTRACE code was executed instead). However, the optimizer was changed and can't optimize as many functions anymore (probably another bug which might be worth to investigate) and this bug is more visible. Thanks, Adam -- pi3 (pi3ki31ny) - pi3 (at) itsec pl http://pi3.com.pl