Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86

2020-10-09 Thread Marco Elver
On Wed, Oct 07, 2020 at 04:41PM +0200, Marco Elver wrote:
> On Wed, 7 Oct 2020 at 16:15, Jann Horn  wrote:
[...]
> > > > > +   return false;
> > > > > +
> > > > > +   if (protect)
> > > > > +   set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
> > > > > +   else
> > > > > +   set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
> > > >
> > > > Hmm... do we have this helper (instead of using the existing helpers
> > > > for modifying memory permissions) to work around the allocation out of
> > > > the data section?
> > >
> > > I just played around with using the set_memory.c functions, to remind
> > > myself why this didn't work. I experimented with using
> > > set_memory_{np,p}() functions; set_memory_p() isn't implemented, but
> > > is easily added (which I did for below experiment). However, this
> > > didn't quite work:
> > [...]
> > > For one, smp_call_function_many_cond() doesn't want to be called with
> > > interrupts disabled, and we may very well get a KFENCE allocation or
> > > page fault with interrupts disabled / within interrupts.
> > >
> > > Therefore, to be safe, we should avoid IPIs.
> >
> > set_direct_map_invalid_noflush() does that, too, I think? And that's
> > already implemented for both arm64 and x86.
> 
> Sure, that works.
> 
> We still want the flush_tlb_one_kernel(), at least so the local CPU's
> TLB is flushed.

Nope, sorry, set_direct_map_invalid_noflush() does not work -- this
results in potential deadlock.


WARNING: inconsistent lock state
5.9.0-rc4+ #2 Not tainted

inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
ksoftirqd/1/16 [HC0[0]:SC1[1]:HE1:SE0] takes:
89fcf9b8 (cpa_lock){+.?.}-{2:2}, at: spin_lock 
include/linux/spinlock.h:354 [inline]
89fcf9b8 (cpa_lock){+.?.}-{2:2}, at: 
__change_page_attr_set_clr+0x1b0/0x2510 arch/x86/mm/pat/set_memory.c:1658
{SOFTIRQ-ON-W} state was registered at:
  lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
  __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
  _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
  spin_lock include/linux/spinlock.h:354 [inline]
  __change_page_attr_set_clr+0x1b0/0x2510 
arch/x86/mm/pat/set_memory.c:1658
  change_page_attr_set_clr+0x333/0x500 arch/x86/mm/pat/set_memory.c:1752
  change_page_attr_set arch/x86/mm/pat/set_memory.c:1782 [inline]
  set_memory_nx+0xb2/0x110 arch/x86/mm/pat/set_memory.c:1930
  free_init_pages+0x73/0xc0 arch/x86/mm/init.c:876
  alternative_instructions+0x155/0x1a4 arch/x86/kernel/alternative.c:738
  check_bugs+0x1bd0/0x1c77 arch/x86/kernel/cpu/bugs.c:140
  start_kernel+0x486/0x4b6 init/main.c:1042
  secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
irq event stamp: 14564
hardirqs last  enabled at (14564): [] 
__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160 [inline]
hardirqs last  enabled at (14564): [] 
_raw_spin_unlock_irqrestore+0x6f/0x90 kernel/locking/spinlock.c:191
hardirqs last disabled at (14563): [] 
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
hardirqs last disabled at (14563): [] 
_raw_spin_lock_irqsave+0xa9/0xce kernel/locking/spinlock.c:159
softirqs last  enabled at (14486): [] run_ksoftirqd 
kernel/softirq.c:652 [inline]
softirqs last  enabled at (14486): [] 
run_ksoftirqd+0xcf/0x170 kernel/softirq.c:644
softirqs last disabled at (14491): [] run_ksoftirqd 
kernel/softirq.c:652 [inline]
softirqs last disabled at (14491): [] 
run_ksoftirqd+0xcf/0x170 kernel/softirq.c:644

other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(cpa_lock);
  
lock(cpa_lock);

 *** DEADLOCK ***

1 lock held by ksoftirqd/1/16:
 #0: 8a067e20 (rcu_callback){}-{0:0}, at: rcu_do_batch 
kernel/rcu/tree.c:2418 [inline]
 #0: 8a067e20 (rcu_callback){}-{0:0}, at: 
rcu_core+0x55d/0x1130 kernel/rcu/tree.c:2656

stack backtrace:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.9.0-rc4+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 
04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x198/0x1fd lib/dump_stack.c:118
 print_usage_bug kernel/locking/lockdep.c:3350 [inline]
 valid_state kernel/locking/lockdep.c:3361 [inline]
 mark_lock_irq kernel/locking/lockdep.c:3575 [inline]
 mark_lock.cold+0x12/0x17 kernel/locking/lockdep.c:4006
 mark_usage kernel/locking/lockdep.c:3905 [inline]
 __lock_acquire+0x1159/0x5780 kernel/locking/lockdep.c:4380
 

Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86

2020-10-07 Thread Marco Elver
On Wed, 7 Oct 2020 at 16:15, Jann Horn  wrote:
>
> On Wed, Oct 7, 2020 at 3:09 PM Marco Elver  wrote:
> > On Fri, 2 Oct 2020 at 07:45, Jann Horn  wrote:
> > > On Tue, Sep 29, 2020 at 3:38 PM Marco Elver  wrote:
> > > > Add architecture specific implementation details for KFENCE and enable
> > > > KFENCE for the x86 architecture. In particular, this implements the
> > > > required interface in  for setting up the pool and
> > > > providing helper functions for protecting and unprotecting pages.
> > > >
> > > > For x86, we need to ensure that the pool uses 4K pages, which is done
> > > > using the set_memory_4k() helper function.
> > > [...]
> > > > diff --git a/arch/x86/include/asm/kfence.h 
> > > > b/arch/x86/include/asm/kfence.h
> > > [...]
> > > > +/* Protect the given page and flush TLBs. */
> > > > +static inline bool kfence_protect_page(unsigned long addr, bool 
> > > > protect)
> > > > +{
> > > > +   unsigned int level;
> > > > +   pte_t *pte = lookup_address(addr, );
> > > > +
> > > > +   if (!pte || level != PG_LEVEL_4K)
> > >
> > > Do we actually expect this to happen, or is this just a "robustness"
> > > check? If we don't expect this to happen, there should be a WARN_ON()
> > > around the condition.
> >
> > It's not obvious here, but we already have this covered with a WARN:
> > the core.c code has a KFENCE_WARN_ON, which disables KFENCE on a
> > warning.
>
> So for this specific branch: Can it ever happen? If not, please either
> remove it or add WARN_ON(). That serves two functions: It ensures that
> if something unexpected happens, we see a warning, and it hints to
> people reading the code "this isn't actually expected to happen, you
> don't have to wrack your brain trying to figure out for which scenario
> this branch is intended".

Perhaps I could have been clearer: we already have this returning
false covered by a WARN+disable KFENCE in core.c.

We'll add another WARN_ON right here, as it doesn't hurt, and
hopefully improves readability.

> > > > +   return false;
> > > > +
> > > > +   if (protect)
> > > > +   set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
> > > > +   else
> > > > +   set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
> > >
> > > Hmm... do we have this helper (instead of using the existing helpers
> > > for modifying memory permissions) to work around the allocation out of
> > > the data section?
> >
> > I just played around with using the set_memory.c functions, to remind
> > myself why this didn't work. I experimented with using
> > set_memory_{np,p}() functions; set_memory_p() isn't implemented, but
> > is easily added (which I did for below experiment). However, this
> > didn't quite work:
> [...]
> > For one, smp_call_function_many_cond() doesn't want to be called with
> > interrupts disabled, and we may very well get a KFENCE allocation or
> > page fault with interrupts disabled / within interrupts.
> >
> > Therefore, to be safe, we should avoid IPIs.
>
> set_direct_map_invalid_noflush() does that, too, I think? And that's
> already implemented for both arm64 and x86.

Sure, that works.

We still want the flush_tlb_one_kernel(), at least so the local CPU's
TLB is flushed.

> > It follows that setting
> > the page attribute is best-effort, and we can tolerate some
> > inaccuracy. Lazy fault handling should take care of faults after we
> > set the page as PRESENT.
> [...]
> > > Shouldn't kfence_handle_page_fault() happen after prefetch handling,
> > > at least? Maybe directly above the "oops" label?
> >
> > Good question. AFAIK it doesn't matter, as is_kfence_address() should
> > never apply for any of those that follow, right? In any case, it
> > shouldn't hurt to move it down.
>
> is_prefetch() ignores any #PF not caused by instruction fetch if it
> comes from kernel mode and the faulting instruction is one of the
> PREFETCH* instructions. (Which is not supposed to happen - the
> processor should just be ignoring the fault for PREFETCH instead of
> generating an exception AFAIK. But the comments say that this is about
> CPU bugs and stuff.) While this is probably not a big deal anymore
> partly because the kernel doesn't use software prefetching in many
> places anymore, it seems to me like, in principle, this could also
> cause page faults that should be ignored in KFENCE regions if someone
> tries to do PREFETCH on an out-of-bounds array element or a dangling
> pointer or something.

Thanks for the clarification.


Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86

2020-10-07 Thread Jann Horn
On Wed, Oct 7, 2020 at 3:09 PM Marco Elver  wrote:
> On Fri, 2 Oct 2020 at 07:45, Jann Horn  wrote:
> > On Tue, Sep 29, 2020 at 3:38 PM Marco Elver  wrote:
> > > Add architecture specific implementation details for KFENCE and enable
> > > KFENCE for the x86 architecture. In particular, this implements the
> > > required interface in  for setting up the pool and
> > > providing helper functions for protecting and unprotecting pages.
> > >
> > > For x86, we need to ensure that the pool uses 4K pages, which is done
> > > using the set_memory_4k() helper function.
> > [...]
> > > diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
> > [...]
> > > +/* Protect the given page and flush TLBs. */
> > > +static inline bool kfence_protect_page(unsigned long addr, bool protect)
> > > +{
> > > +   unsigned int level;
> > > +   pte_t *pte = lookup_address(addr, );
> > > +
> > > +   if (!pte || level != PG_LEVEL_4K)
> >
> > Do we actually expect this to happen, or is this just a "robustness"
> > check? If we don't expect this to happen, there should be a WARN_ON()
> > around the condition.
>
> It's not obvious here, but we already have this covered with a WARN:
> the core.c code has a KFENCE_WARN_ON, which disables KFENCE on a
> warning.

So for this specific branch: Can it ever happen? If not, please either
remove it or add WARN_ON(). That serves two functions: It ensures that
if something unexpected happens, we see a warning, and it hints to
people reading the code "this isn't actually expected to happen, you
don't have to wrack your brain trying to figure out for which scenario
this branch is intended".

> > > +   return false;
> > > +
> > > +   if (protect)
> > > +   set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
> > > +   else
> > > +   set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
> >
> > Hmm... do we have this helper (instead of using the existing helpers
> > for modifying memory permissions) to work around the allocation out of
> > the data section?
>
> I just played around with using the set_memory.c functions, to remind
> myself why this didn't work. I experimented with using
> set_memory_{np,p}() functions; set_memory_p() isn't implemented, but
> is easily added (which I did for below experiment). However, this
> didn't quite work:
[...]
> For one, smp_call_function_many_cond() doesn't want to be called with
> interrupts disabled, and we may very well get a KFENCE allocation or
> page fault with interrupts disabled / within interrupts.
>
> Therefore, to be safe, we should avoid IPIs.

set_direct_map_invalid_noflush() does that, too, I think? And that's
already implemented for both arm64 and x86.

> It follows that setting
> the page attribute is best-effort, and we can tolerate some
> inaccuracy. Lazy fault handling should take care of faults after we
> set the page as PRESENT.
[...]
> > Shouldn't kfence_handle_page_fault() happen after prefetch handling,
> > at least? Maybe directly above the "oops" label?
>
> Good question. AFAIK it doesn't matter, as is_kfence_address() should
> never apply for any of those that follow, right? In any case, it
> shouldn't hurt to move it down.

is_prefetch() ignores any #PF not caused by instruction fetch if it
comes from kernel mode and the faulting instruction is one of the
PREFETCH* instructions. (Which is not supposed to happen - the
processor should just be ignoring the fault for PREFETCH instead of
generating an exception AFAIK. But the comments say that this is about
CPU bugs and stuff.) While this is probably not a big deal anymore
partly because the kernel doesn't use software prefetching in many
places anymore, it seems to me like, in principle, this could also
cause page faults that should be ignored in KFENCE regions if someone
tries to do PREFETCH on an out-of-bounds array element or a dangling
pointer or something.


Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86

2020-10-07 Thread Marco Elver
On Fri, 2 Oct 2020 at 07:45, Jann Horn  wrote:
>
> On Tue, Sep 29, 2020 at 3:38 PM Marco Elver  wrote:
> > Add architecture specific implementation details for KFENCE and enable
> > KFENCE for the x86 architecture. In particular, this implements the
> > required interface in  for setting up the pool and
> > providing helper functions for protecting and unprotecting pages.
> >
> > For x86, we need to ensure that the pool uses 4K pages, which is done
> > using the set_memory_4k() helper function.
> [...]
> > diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
> [...]
> > +/* Protect the given page and flush TLBs. */
> > +static inline bool kfence_protect_page(unsigned long addr, bool protect)
> > +{
> > +   unsigned int level;
> > +   pte_t *pte = lookup_address(addr, );
> > +
> > +   if (!pte || level != PG_LEVEL_4K)
>
> Do we actually expect this to happen, or is this just a "robustness"
> check? If we don't expect this to happen, there should be a WARN_ON()
> around the condition.

It's not obvious here, but we already have this covered with a WARN:
the core.c code has a KFENCE_WARN_ON, which disables KFENCE on a
warning.

> > +   return false;
> > +
> > +   if (protect)
> > +   set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
> > +   else
> > +   set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
>
> Hmm... do we have this helper (instead of using the existing helpers
> for modifying memory permissions) to work around the allocation out of
> the data section?

I just played around with using the set_memory.c functions, to remind
myself why this didn't work. I experimented with using
set_memory_{np,p}() functions; set_memory_p() isn't implemented, but
is easily added (which I did for below experiment). However, this
didn't quite work:

WARNING: CPU: 6 PID: 107 at kernel/smp.c:490
smp_call_function_many_cond+0x9c/0x2a0 kernel/smp.c:490
[...]
Call Trace:
 smp_call_function_many kernel/smp.c:577 [inline]
 smp_call_function kernel/smp.c:599 [inline]
 on_each_cpu+0x3e/0x90 kernel/smp.c:698
 __purge_vmap_area_lazy+0x58/0x670 mm/vmalloc.c:1352
 _vm_unmap_aliases.part.0+0x10b/0x140 mm/vmalloc.c:1770
 change_page_attr_set_clr+0xb4/0x1c0 arch/x86/mm/pat/set_memory.c:1732
 change_page_attr_set arch/x86/mm/pat/set_memory.c:1782 [inline]
 set_memory_p+0x21/0x30 arch/x86/mm/pat/set_memory.c:1950
 kfence_protect_page arch/x86/include/asm/kfence.h:55 [inline]
 kfence_protect_page arch/x86/include/asm/kfence.h:43 [inline]
 kfence_unprotect+0x42/0x70 mm/kfence/core.c:139
 no_context+0x115/0x300 arch/x86/mm/fault.c:705
 handle_page_fault arch/x86/mm/fault.c:1431 [inline]
 exc_page_fault+0xa7/0x170 arch/x86/mm/fault.c:1486
 asm_exc_page_fault+0x1e/0x30 arch/x86/include/asm/idtentry.h:538

For one, smp_call_function_many_cond() doesn't want to be called with
interrupts disabled, and we may very well get a KFENCE allocation or
page fault with interrupts disabled / within interrupts.

Therefore, to be safe, we should avoid IPIs. It follows that setting
the page attribute is best-effort, and we can tolerate some
inaccuracy. Lazy fault handling should take care of faults after we
set the page as PRESENT.

Which hopefully also answers your other comment:

> flush_tlb_one_kernel() -> flush_tlb_one_user() ->
> __flush_tlb_one_user() -> native_flush_tlb_one_user() only flushes on
> the local CPU core, not on others. If you want to leave it this way, I
> think this needs a comment explaining why we're not doing a global
> flush (locking context / performance overhead / ... ?).

We'll add a comment to clarify why it's done this way.

> > +   flush_tlb_one_kernel(addr);
> > +   return true;
> > +}
> > +
> > +#endif /* _ASM_X86_KFENCE_H */
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> [...]
> > @@ -701,6 +702,9 @@ no_context(struct pt_regs *regs, unsigned long 
> > error_code,
> > }
> >  #endif
> >
> > +   if (kfence_handle_page_fault(address))
> > +   return;
> > +
> > /*
> >  * 32-bit:
> >  *
>
> The standard 5 lines of diff context don't really make it obvious
> what's going on here. Here's a diff with more context:
>
>
> /*
>  * Stack overflow?  During boot, we can fault near the initial
>  * stack in the direct map, but that's not an overflow -- check
>  * that we're in vmalloc space to avoid this.
>  */
> if (is_vmalloc_addr((void *)address) &&
> (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
>  address - ((unsigned long)tsk->stack + THREAD_SIZE) < 
> PAGE_SIZE)) {
> unsigned long stack = __this_cpu_ist_top_va(DF) -
> sizeof(void *);
> /*
>  * We're likely to be running with very little stack space
>  * left.  It's plausible that we'd hit this condition but
>  * double-fault even before we get this far, in which case
> 

Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86

2020-10-02 Thread Jann Horn
On Tue, Sep 29, 2020 at 3:38 PM Marco Elver  wrote:
> Add architecture specific implementation details for KFENCE and enable
> KFENCE for the x86 architecture. In particular, this implements the
> required interface in  for setting up the pool and
> providing helper functions for protecting and unprotecting pages.
[...]
> diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
[...]
> +/* Protect the given page and flush TLBs. */
> +static inline bool kfence_protect_page(unsigned long addr, bool protect)
> +{
[...]
> +   flush_tlb_one_kernel(addr);

flush_tlb_one_kernel() -> flush_tlb_one_user() ->
__flush_tlb_one_user() -> native_flush_tlb_one_user() only flushes on
the local CPU core, not on others. If you want to leave it this way, I
think this needs a comment explaining why we're not doing a global
flush (locking context / performance overhead / ... ?).


Re: [PATCH v4 02/11] x86, kfence: enable KFENCE for x86

2020-10-01 Thread Jann Horn
On Tue, Sep 29, 2020 at 3:38 PM Marco Elver  wrote:
> Add architecture specific implementation details for KFENCE and enable
> KFENCE for the x86 architecture. In particular, this implements the
> required interface in  for setting up the pool and
> providing helper functions for protecting and unprotecting pages.
>
> For x86, we need to ensure that the pool uses 4K pages, which is done
> using the set_memory_4k() helper function.
[...]
> diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
[...]
> +/* Protect the given page and flush TLBs. */
> +static inline bool kfence_protect_page(unsigned long addr, bool protect)
> +{
> +   unsigned int level;
> +   pte_t *pte = lookup_address(addr, );
> +
> +   if (!pte || level != PG_LEVEL_4K)

Do we actually expect this to happen, or is this just a "robustness"
check? If we don't expect this to happen, there should be a WARN_ON()
around the condition.

> +   return false;
> +
> +   if (protect)
> +   set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
> +   else
> +   set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));

Hmm... do we have this helper (instead of using the existing helpers
for modifying memory permissions) to work around the allocation out of
the data section?

> +   flush_tlb_one_kernel(addr);
> +   return true;
> +}
> +
> +#endif /* _ASM_X86_KFENCE_H */
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
[...]
> @@ -701,6 +702,9 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> }
>  #endif
>
> +   if (kfence_handle_page_fault(address))
> +   return;
> +
> /*
>  * 32-bit:
>  *

The standard 5 lines of diff context don't really make it obvious
what's going on here. Here's a diff with more context:


/*
 * Stack overflow?  During boot, we can fault near the initial
 * stack in the direct map, but that's not an overflow -- check
 * that we're in vmalloc space to avoid this.
 */
if (is_vmalloc_addr((void *)address) &&
(((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
 address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
unsigned long stack = __this_cpu_ist_top_va(DF) -
sizeof(void *);
/*
 * We're likely to be running with very little stack space
 * left.  It's plausible that we'd hit this condition but
 * double-fault even before we get this far, in which case
 * we're fine: the double-fault handler will deal with it.
 *
 * We don't want to make it all the way into the oops code
 * and then double-fault, though, because we're likely to
 * break the console driver and lose most of the stack dump.
 */
asm volatile ("movq %[stack], %%rsp\n\t"
  "call handle_stack_overflow\n\t"
  "1: jmp 1b"
  : ASM_CALL_CONSTRAINT
  : "D" ("kernel stack overflow (page fault)"),
"S" (regs), "d" (address),
[stack] "rm" (stack));
unreachable();
}
 #endif

+   if (kfence_handle_page_fault(address))
+   return;
+
/*
 * 32-bit:
 *
 *   Valid to do another page fault here, because if this fault
 *   had been triggered by is_prefetch fixup_exception would have
 *   handled it.
 *
 * 64-bit:
 *
 *   Hall of shame of CPU/BIOS bugs.
 */
if (is_prefetch(regs, error_code, address))
return;

if (is_errata93(regs, address))
return;

/*
 * Buggy firmware could access regions which might page fault, try to
 * recover from such faults.
 */
if (IS_ENABLED(CONFIG_EFI))
efi_recover_from_page_fault(address);

 oops:
/*
 * Oops. The kernel tried to access some bad page. We'll have to
 * terminate things with extreme prejudice:
 */
flags = oops_begin();



Shouldn't kfence_handle_page_fault() happen after prefetch handling,
at least? Maybe directly above the "oops" label?