Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86
On Wed, Oct 07, 2020 at 04:41PM +0200, Marco Elver wrote: > On Wed, 7 Oct 2020 at 16:15, Jann Horn wrote: [...] > > > > > + return false; > > > > > + > > > > > + if (protect) > > > > > + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); > > > > > + else > > > > > + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); > > > > > > > > Hmm... do we have this helper (instead of using the existing helpers > > > > for modifying memory permissions) to work around the allocation out of > > > > the data section? > > > > > > I just played around with using the set_memory.c functions, to remind > > > myself why this didn't work. I experimented with using > > > set_memory_{np,p}() functions; set_memory_p() isn't implemented, but > > > is easily added (which I did for below experiment). However, this > > > didn't quite work: > > [...] > > > For one, smp_call_function_many_cond() doesn't want to be called with > > > interrupts disabled, and we may very well get a KFENCE allocation or > > > page fault with interrupts disabled / within interrupts. > > > > > > Therefore, to be safe, we should avoid IPIs. > > > > set_direct_map_invalid_noflush() does that, too, I think? And that's > > already implemented for both arm64 and x86. > > Sure, that works. > > We still want the flush_tlb_one_kernel(), at least so the local CPU's > TLB is flushed. Nope, sorry, set_direct_map_invalid_noflush() does not work -- this results in potential deadlock. WARNING: inconsistent lock state 5.9.0-rc4+ #2 Not tainted inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/1/16 [HC0[0]:SC1[1]:HE1:SE0] takes: 89fcf9b8 (cpa_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] 89fcf9b8 (cpa_lock){+.?.}-{2:2}, at: __change_page_attr_set_clr+0x1b0/0x2510 arch/x86/mm/pat/set_memory.c:1658 {SOFTIRQ-ON-W} state was registered at: lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] __change_page_attr_set_clr+0x1b0/0x2510 arch/x86/mm/pat/set_memory.c:1658 change_page_attr_set_clr+0x333/0x500 arch/x86/mm/pat/set_memory.c:1752 change_page_attr_set arch/x86/mm/pat/set_memory.c:1782 [inline] set_memory_nx+0xb2/0x110 arch/x86/mm/pat/set_memory.c:1930 free_init_pages+0x73/0xc0 arch/x86/mm/init.c:876 alternative_instructions+0x155/0x1a4 arch/x86/kernel/alternative.c:738 check_bugs+0x1bd0/0x1c77 arch/x86/kernel/cpu/bugs.c:140 start_kernel+0x486/0x4b6 init/main.c:1042 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243 irq event stamp: 14564 hardirqs last enabled at (14564): [] __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160 [inline] hardirqs last enabled at (14564): [] _raw_spin_unlock_irqrestore+0x6f/0x90 kernel/locking/spinlock.c:191 hardirqs last disabled at (14563): [] __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline] hardirqs last disabled at (14563): [] _raw_spin_lock_irqsave+0xa9/0xce kernel/locking/spinlock.c:159 softirqs last enabled at (14486): [] run_ksoftirqd kernel/softirq.c:652 [inline] softirqs last enabled at (14486): [] run_ksoftirqd+0xcf/0x170 kernel/softirq.c:644 softirqs last disabled at (14491): [] run_ksoftirqd kernel/softirq.c:652 [inline] softirqs last disabled at (14491): [] run_ksoftirqd+0xcf/0x170 kernel/softirq.c:644 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(cpa_lock); lock(cpa_lock); *** DEADLOCK *** 1 lock held by ksoftirqd/1/16: #0: 8a067e20 (rcu_callback){}-{0:0}, at: rcu_do_batch kernel/rcu/tree.c:2418 [inline] #0: 8a067e20 (rcu_callback){}-{0:0}, at: rcu_core+0x55d/0x1130 kernel/rcu/tree.c:2656 stack backtrace: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.9.0-rc4+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 print_usage_bug kernel/locking/lockdep.c:3350 [inline] valid_state kernel/locking/lockdep.c:3361 [inline] mark_lock_irq kernel/locking/lockdep.c:3575 [inline] mark_lock.cold+0x12/0x17 kernel/locking/lockdep.c:4006 mark_usage kernel/locking/lockdep.c:3905 [inline] __lock_acquire+0x1159/0x5780 kernel/locking/lockdep.c:4380
Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86
On Wed, 7 Oct 2020 at 16:15, Jann Horn wrote: > > On Wed, Oct 7, 2020 at 3:09 PM Marco Elver wrote: > > On Fri, 2 Oct 2020 at 07:45, Jann Horn wrote: > > > On Tue, Sep 29, 2020 at 3:38 PM Marco Elver wrote: > > > > Add architecture specific implementation details for KFENCE and enable > > > > KFENCE for the x86 architecture. In particular, this implements the > > > > required interface in for setting up the pool and > > > > providing helper functions for protecting and unprotecting pages. > > > > > > > > For x86, we need to ensure that the pool uses 4K pages, which is done > > > > using the set_memory_4k() helper function. > > > [...] > > > > diff --git a/arch/x86/include/asm/kfence.h > > > > b/arch/x86/include/asm/kfence.h > > > [...] > > > > +/* Protect the given page and flush TLBs. */ > > > > +static inline bool kfence_protect_page(unsigned long addr, bool > > > > protect) > > > > +{ > > > > + unsigned int level; > > > > + pte_t *pte = lookup_address(addr, ); > > > > + > > > > + if (!pte || level != PG_LEVEL_4K) > > > > > > Do we actually expect this to happen, or is this just a "robustness" > > > check? If we don't expect this to happen, there should be a WARN_ON() > > > around the condition. > > > > It's not obvious here, but we already have this covered with a WARN: > > the core.c code has a KFENCE_WARN_ON, which disables KFENCE on a > > warning. > > So for this specific branch: Can it ever happen? If not, please either > remove it or add WARN_ON(). That serves two functions: It ensures that > if something unexpected happens, we see a warning, and it hints to > people reading the code "this isn't actually expected to happen, you > don't have to wrack your brain trying to figure out for which scenario > this branch is intended". Perhaps I could have been clearer: we already have this returning false covered by a WARN+disable KFENCE in core.c. We'll add another WARN_ON right here, as it doesn't hurt, and hopefully improves readability. > > > > + return false; > > > > + > > > > + if (protect) > > > > + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); > > > > + else > > > > + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); > > > > > > Hmm... do we have this helper (instead of using the existing helpers > > > for modifying memory permissions) to work around the allocation out of > > > the data section? > > > > I just played around with using the set_memory.c functions, to remind > > myself why this didn't work. I experimented with using > > set_memory_{np,p}() functions; set_memory_p() isn't implemented, but > > is easily added (which I did for below experiment). However, this > > didn't quite work: > [...] > > For one, smp_call_function_many_cond() doesn't want to be called with > > interrupts disabled, and we may very well get a KFENCE allocation or > > page fault with interrupts disabled / within interrupts. > > > > Therefore, to be safe, we should avoid IPIs. > > set_direct_map_invalid_noflush() does that, too, I think? And that's > already implemented for both arm64 and x86. Sure, that works. We still want the flush_tlb_one_kernel(), at least so the local CPU's TLB is flushed. > > It follows that setting > > the page attribute is best-effort, and we can tolerate some > > inaccuracy. Lazy fault handling should take care of faults after we > > set the page as PRESENT. > [...] > > > Shouldn't kfence_handle_page_fault() happen after prefetch handling, > > > at least? Maybe directly above the "oops" label? > > > > Good question. AFAIK it doesn't matter, as is_kfence_address() should > > never apply for any of those that follow, right? In any case, it > > shouldn't hurt to move it down. > > is_prefetch() ignores any #PF not caused by instruction fetch if it > comes from kernel mode and the faulting instruction is one of the > PREFETCH* instructions. (Which is not supposed to happen - the > processor should just be ignoring the fault for PREFETCH instead of > generating an exception AFAIK. But the comments say that this is about > CPU bugs and stuff.) While this is probably not a big deal anymore > partly because the kernel doesn't use software prefetching in many > places anymore, it seems to me like, in principle, this could also > cause page faults that should be ignored in KFENCE regions if someone > tries to do PREFETCH on an out-of-bounds array element or a dangling > pointer or something. Thanks for the clarification.
Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86
On Wed, Oct 7, 2020 at 3:09 PM Marco Elver wrote: > On Fri, 2 Oct 2020 at 07:45, Jann Horn wrote: > > On Tue, Sep 29, 2020 at 3:38 PM Marco Elver wrote: > > > Add architecture specific implementation details for KFENCE and enable > > > KFENCE for the x86 architecture. In particular, this implements the > > > required interface in for setting up the pool and > > > providing helper functions for protecting and unprotecting pages. > > > > > > For x86, we need to ensure that the pool uses 4K pages, which is done > > > using the set_memory_4k() helper function. > > [...] > > > diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h > > [...] > > > +/* Protect the given page and flush TLBs. */ > > > +static inline bool kfence_protect_page(unsigned long addr, bool protect) > > > +{ > > > + unsigned int level; > > > + pte_t *pte = lookup_address(addr, ); > > > + > > > + if (!pte || level != PG_LEVEL_4K) > > > > Do we actually expect this to happen, or is this just a "robustness" > > check? If we don't expect this to happen, there should be a WARN_ON() > > around the condition. > > It's not obvious here, but we already have this covered with a WARN: > the core.c code has a KFENCE_WARN_ON, which disables KFENCE on a > warning. So for this specific branch: Can it ever happen? If not, please either remove it or add WARN_ON(). That serves two functions: It ensures that if something unexpected happens, we see a warning, and it hints to people reading the code "this isn't actually expected to happen, you don't have to wrack your brain trying to figure out for which scenario this branch is intended". > > > + return false; > > > + > > > + if (protect) > > > + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); > > > + else > > > + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); > > > > Hmm... do we have this helper (instead of using the existing helpers > > for modifying memory permissions) to work around the allocation out of > > the data section? > > I just played around with using the set_memory.c functions, to remind > myself why this didn't work. I experimented with using > set_memory_{np,p}() functions; set_memory_p() isn't implemented, but > is easily added (which I did for below experiment). However, this > didn't quite work: [...] > For one, smp_call_function_many_cond() doesn't want to be called with > interrupts disabled, and we may very well get a KFENCE allocation or > page fault with interrupts disabled / within interrupts. > > Therefore, to be safe, we should avoid IPIs. set_direct_map_invalid_noflush() does that, too, I think? And that's already implemented for both arm64 and x86. > It follows that setting > the page attribute is best-effort, and we can tolerate some > inaccuracy. Lazy fault handling should take care of faults after we > set the page as PRESENT. [...] > > Shouldn't kfence_handle_page_fault() happen after prefetch handling, > > at least? Maybe directly above the "oops" label? > > Good question. AFAIK it doesn't matter, as is_kfence_address() should > never apply for any of those that follow, right? In any case, it > shouldn't hurt to move it down. is_prefetch() ignores any #PF not caused by instruction fetch if it comes from kernel mode and the faulting instruction is one of the PREFETCH* instructions. (Which is not supposed to happen - the processor should just be ignoring the fault for PREFETCH instead of generating an exception AFAIK. But the comments say that this is about CPU bugs and stuff.) While this is probably not a big deal anymore partly because the kernel doesn't use software prefetching in many places anymore, it seems to me like, in principle, this could also cause page faults that should be ignored in KFENCE regions if someone tries to do PREFETCH on an out-of-bounds array element or a dangling pointer or something.
Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86
On Fri, 2 Oct 2020 at 07:45, Jann Horn wrote: > > On Tue, Sep 29, 2020 at 3:38 PM Marco Elver wrote: > > Add architecture specific implementation details for KFENCE and enable > > KFENCE for the x86 architecture. In particular, this implements the > > required interface in for setting up the pool and > > providing helper functions for protecting and unprotecting pages. > > > > For x86, we need to ensure that the pool uses 4K pages, which is done > > using the set_memory_4k() helper function. > [...] > > diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h > [...] > > +/* Protect the given page and flush TLBs. */ > > +static inline bool kfence_protect_page(unsigned long addr, bool protect) > > +{ > > + unsigned int level; > > + pte_t *pte = lookup_address(addr, ); > > + > > + if (!pte || level != PG_LEVEL_4K) > > Do we actually expect this to happen, or is this just a "robustness" > check? If we don't expect this to happen, there should be a WARN_ON() > around the condition. It's not obvious here, but we already have this covered with a WARN: the core.c code has a KFENCE_WARN_ON, which disables KFENCE on a warning. > > + return false; > > + > > + if (protect) > > + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); > > + else > > + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); > > Hmm... do we have this helper (instead of using the existing helpers > for modifying memory permissions) to work around the allocation out of > the data section? I just played around with using the set_memory.c functions, to remind myself why this didn't work. I experimented with using set_memory_{np,p}() functions; set_memory_p() isn't implemented, but is easily added (which I did for below experiment). However, this didn't quite work: WARNING: CPU: 6 PID: 107 at kernel/smp.c:490 smp_call_function_many_cond+0x9c/0x2a0 kernel/smp.c:490 [...] Call Trace: smp_call_function_many kernel/smp.c:577 [inline] smp_call_function kernel/smp.c:599 [inline] on_each_cpu+0x3e/0x90 kernel/smp.c:698 __purge_vmap_area_lazy+0x58/0x670 mm/vmalloc.c:1352 _vm_unmap_aliases.part.0+0x10b/0x140 mm/vmalloc.c:1770 change_page_attr_set_clr+0xb4/0x1c0 arch/x86/mm/pat/set_memory.c:1732 change_page_attr_set arch/x86/mm/pat/set_memory.c:1782 [inline] set_memory_p+0x21/0x30 arch/x86/mm/pat/set_memory.c:1950 kfence_protect_page arch/x86/include/asm/kfence.h:55 [inline] kfence_protect_page arch/x86/include/asm/kfence.h:43 [inline] kfence_unprotect+0x42/0x70 mm/kfence/core.c:139 no_context+0x115/0x300 arch/x86/mm/fault.c:705 handle_page_fault arch/x86/mm/fault.c:1431 [inline] exc_page_fault+0xa7/0x170 arch/x86/mm/fault.c:1486 asm_exc_page_fault+0x1e/0x30 arch/x86/include/asm/idtentry.h:538 For one, smp_call_function_many_cond() doesn't want to be called with interrupts disabled, and we may very well get a KFENCE allocation or page fault with interrupts disabled / within interrupts. Therefore, to be safe, we should avoid IPIs. It follows that setting the page attribute is best-effort, and we can tolerate some inaccuracy. Lazy fault handling should take care of faults after we set the page as PRESENT. Which hopefully also answers your other comment: > flush_tlb_one_kernel() -> flush_tlb_one_user() -> > __flush_tlb_one_user() -> native_flush_tlb_one_user() only flushes on > the local CPU core, not on others. If you want to leave it this way, I > think this needs a comment explaining why we're not doing a global > flush (locking context / performance overhead / ... ?). We'll add a comment to clarify why it's done this way. > > + flush_tlb_one_kernel(addr); > > + return true; > > +} > > + > > +#endif /* _ASM_X86_KFENCE_H */ > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > [...] > > @@ -701,6 +702,9 @@ no_context(struct pt_regs *regs, unsigned long > > error_code, > > } > > #endif > > > > + if (kfence_handle_page_fault(address)) > > + return; > > + > > /* > > * 32-bit: > > * > > The standard 5 lines of diff context don't really make it obvious > what's going on here. Here's a diff with more context: > > > /* > * Stack overflow? During boot, we can fault near the initial > * stack in the direct map, but that's not an overflow -- check > * that we're in vmalloc space to avoid this. > */ > if (is_vmalloc_addr((void *)address) && > (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) || > address - ((unsigned long)tsk->stack + THREAD_SIZE) < > PAGE_SIZE)) { > unsigned long stack = __this_cpu_ist_top_va(DF) - > sizeof(void *); > /* > * We're likely to be running with very little stack space > * left. It's plausible that we'd hit this condition but > * double-fault even before we get this far, in which case >
Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86
On Tue, Sep 29, 2020 at 3:38 PM Marco Elver wrote: > Add architecture specific implementation details for KFENCE and enable > KFENCE for the x86 architecture. In particular, this implements the > required interface in for setting up the pool and > providing helper functions for protecting and unprotecting pages. [...] > diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h [...] > +/* Protect the given page and flush TLBs. */ > +static inline bool kfence_protect_page(unsigned long addr, bool protect) > +{ [...] > + flush_tlb_one_kernel(addr); flush_tlb_one_kernel() -> flush_tlb_one_user() -> __flush_tlb_one_user() -> native_flush_tlb_one_user() only flushes on the local CPU core, not on others. If you want to leave it this way, I think this needs a comment explaining why we're not doing a global flush (locking context / performance overhead / ... ?).
Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86
On Tue, Sep 29, 2020 at 3:38 PM Marco Elver wrote: > Add architecture specific implementation details for KFENCE and enable > KFENCE for the x86 architecture. In particular, this implements the > required interface in for setting up the pool and > providing helper functions for protecting and unprotecting pages. > > For x86, we need to ensure that the pool uses 4K pages, which is done > using the set_memory_4k() helper function. [...] > diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h [...] > +/* Protect the given page and flush TLBs. */ > +static inline bool kfence_protect_page(unsigned long addr, bool protect) > +{ > + unsigned int level; > + pte_t *pte = lookup_address(addr, ); > + > + if (!pte || level != PG_LEVEL_4K) Do we actually expect this to happen, or is this just a "robustness" check? If we don't expect this to happen, there should be a WARN_ON() around the condition. > + return false; > + > + if (protect) > + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); > + else > + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); Hmm... do we have this helper (instead of using the existing helpers for modifying memory permissions) to work around the allocation out of the data section? > + flush_tlb_one_kernel(addr); > + return true; > +} > + > +#endif /* _ASM_X86_KFENCE_H */ > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c [...] > @@ -701,6 +702,9 @@ no_context(struct pt_regs *regs, unsigned long error_code, > } > #endif > > + if (kfence_handle_page_fault(address)) > + return; > + > /* > * 32-bit: > * The standard 5 lines of diff context don't really make it obvious what's going on here. Here's a diff with more context: /* * Stack overflow? During boot, we can fault near the initial * stack in the direct map, but that's not an overflow -- check * that we're in vmalloc space to avoid this. */ if (is_vmalloc_addr((void *)address) && (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) || address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) { unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *); /* * We're likely to be running with very little stack space * left. It's plausible that we'd hit this condition but * double-fault even before we get this far, in which case * we're fine: the double-fault handler will deal with it. * * We don't want to make it all the way into the oops code * and then double-fault, though, because we're likely to * break the console driver and lose most of the stack dump. */ asm volatile ("movq %[stack], %%rsp\n\t" "call handle_stack_overflow\n\t" "1: jmp 1b" : ASM_CALL_CONSTRAINT : "D" ("kernel stack overflow (page fault)"), "S" (regs), "d" (address), [stack] "rm" (stack)); unreachable(); } #endif + if (kfence_handle_page_fault(address)) + return; + /* * 32-bit: * * Valid to do another page fault here, because if this fault * had been triggered by is_prefetch fixup_exception would have * handled it. * * 64-bit: * * Hall of shame of CPU/BIOS bugs. */ if (is_prefetch(regs, error_code, address)) return; if (is_errata93(regs, address)) return; /* * Buggy firmware could access regions which might page fault, try to * recover from such faults. */ if (IS_ENABLED(CONFIG_EFI)) efi_recover_from_page_fault(address); oops: /* * Oops. The kernel tried to access some bad page. We'll have to * terminate things with extreme prejudice: */ flags = oops_begin(); Shouldn't kfence_handle_page_fault() happen after prefetch handling, at least? Maybe directly above the "oops" label?