Re: [PATCH 1/3] arm64/ptrace: don't clobber task registers on syscall entry/exit traps

2021-02-04 Thread Dave Martin
On Thu, Feb 04, 2021 at 03:23:34PM +, Will Deacon wrote:
> On Mon, Feb 01, 2021 at 11:40:10AM -0800, Andrei Vagin wrote:
> > ip/r12 for AArch32 and x7 for AArch64 is used to indicate whether or not
> > the stop has been signalled from syscall entry or syscall exit. This
> > means that:
> > 
> > - Any writes by the tracer to this register during the stop are
> >   ignored/discarded.
> > 
> > - The actual value of the register is not available during the stop,
> >   so the tracer cannot save it and restore it later.
> > 
> > Right now, these registers are clobbered in tracehook_report_syscall.
> > This change moves the logic to gpr_get and compat_gpr_get where
> > registers are copied into a user-space buffer.
> > 
> > This will allow to change these registers and to introduce a new
> > ptrace option to get the full set of registers.
> > 
> > Signed-off-by: Andrei Vagin 
> > ---
> >  arch/arm64/include/asm/ptrace.h |   5 ++
> >  arch/arm64/kernel/ptrace.c  | 104 
> >  2 files changed, 69 insertions(+), 40 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/ptrace.h 
> > b/arch/arm64/include/asm/ptrace.h
> > index e58bca832dff..0a9552b4f61e 100644
> > --- a/arch/arm64/include/asm/ptrace.h
> > +++ b/arch/arm64/include/asm/ptrace.h
> > @@ -170,6 +170,11 @@ static inline unsigned long pstate_to_compat_psr(const 
> > unsigned long pstate)
> > return psr;
> >  }
> >  
> > +enum ptrace_syscall_dir {
> > +   PTRACE_SYSCALL_ENTER = 0,
> > +   PTRACE_SYSCALL_EXIT,
> > +};
> > +
> >  /*
> >   * This struct defines the way the registers are stored on the stack 
> > during an
> >   * exception. Note that sizeof(struct pt_regs) has to be a multiple of 16 
> > (for
> > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> > index 8ac487c84e37..39da03104528 100644
> > --- a/arch/arm64/kernel/ptrace.c
> > +++ b/arch/arm64/kernel/ptrace.c
> > @@ -40,6 +40,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > @@ -561,7 +562,31 @@ static int gpr_get(struct task_struct *target,
> >struct membuf to)
> >  {
> > struct user_pt_regs *uregs = _pt_regs(target)->user_regs;
> > -   return membuf_write(, uregs, sizeof(*uregs));
> > +   unsigned long saved_reg;
> > +   int ret;
> > +
> > +   /*
> > +* We have some ABI weirdness here in the way that we handle syscall
> > +* exit stops because we indicate whether or not the stop has been
> > +* signalled from syscall entry or syscall exit by clobbering the 
> > general
> > +* purpose register x7.
> > +*/
> 
> When you move a comment, please don't truncate it!
> 
> > +   saved_reg = uregs->regs[7];
> > +
> > +   switch (target->ptrace_message) {
> > +   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
> > +   uregs->regs[7] = PTRACE_SYSCALL_ENTER;
> > +   break;
> > +   case PTRACE_EVENTMSG_SYSCALL_EXIT:
> > +   uregs->regs[7] = PTRACE_SYSCALL_EXIT;
> > +   break;
> > +   }
> 
> I'm wary of checking target->ptrace_message here, as I seem to recall the
> regset code also being used for coredumps. What guarantees we don't break
> things there?

For a coredump, is there any way to know whether a given thread was
inside a traced syscall when the coredump was generated?  If so, x7 in
the dump may already unreliable and we only need to make best efforts to
get it "right".

Since triggering of the coredump and death of other threads all require
dequeueing of some signal, I think all threads must always outside the
syscall-enter...syscall-exit path before any of the coredump runs anyway,
in which case the above should never matter...  Though somone else ought
to eyeball the coredump code before we agree on that.

ptrace_message doesn't seem absolutely the wrong thing to check, but
we'd need to be sure that it can't be stale (say, left over from some
previous trap).


Out of interest, where did this arm64 ptrace feature come from?  Was it
just pasted from 32-bit and thinly adapted?  It looks like an
arch-specific attempt to do what PTRACE_O_TRACESYSGOOD does, in which
case it may have been obsolete even before it was upstreamed.  I wonder
whether anyone is actually relying on it at all...  

Doesn't mean we can definitely fix it safely, but it's annoying.

[...]

Cheers
---Dave


Re: [PATCH v5 1/5] uapi: Move the aux vector AT_MINSIGSTKSZ define to uapi

2021-02-04 Thread Dave Martin
On Wed, Feb 03, 2021 at 09:22:38AM -0800, Chang S. Bae wrote:
> Move the AT_MINSIGSTKSZ definition to generic Linux from arm64. It is
> already used as generic ABI in glibc's generic elf.h, and this move will
> prevent future namespace conflicts. In particular, x86 will re-use this
> generic definition.
> 
> Signed-off-by: Chang S. Bae 
> Reviewed-by: Len Brown 
> Cc: Carlos O'Donell 
> Cc: Dave Martin 
> Cc: libc-al...@sourceware.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> ---
> Change from v4:
> * Added as a new patch (Carlos O'Donell)
> ---
>  arch/arm64/include/uapi/asm/auxvec.h | 1 -
>  include/uapi/linux/auxvec.h  | 1 +
>  2 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/uapi/asm/auxvec.h 
> b/arch/arm64/include/uapi/asm/auxvec.h
> index 743c0b84fd30..767d710c92aa 100644
> --- a/arch/arm64/include/uapi/asm/auxvec.h
> +++ b/arch/arm64/include/uapi/asm/auxvec.h
> @@ -19,7 +19,6 @@
>  
>  /* vDSO location */
>  #define AT_SYSINFO_EHDR  33
> -#define AT_MINSIGSTKSZ   51  /* stack needed for signal delivery */

Since this is UAPI, I'm wondering whether we should try to preserve this
definition for users of .  (Indeed, it is not uncommon to
include  headers in userspace hackery, since the  headers
tend to interact badly with the the libc headers.)

In C11 at least, duplicate #defines are not an error if the definitions
are the same.  I don't know about the history, but I suspect this was
true for older standards too.  So maybe we can just keep this definition
with a duplicate definition in the common header.

Otherwise, we could have

#ifndef AT_MINSIGSTKSZ
#define AT_MINSIGSTKSZ 51
#endif

in include/linux/uapi/auxvec.h, and keep the arm64 header unchanged.

>  
>  #define AT_VECTOR_SIZE_ARCH 2 /* entries in ARCH_DLINFO */
>  
> diff --git a/include/uapi/linux/auxvec.h b/include/uapi/linux/auxvec.h
> index abe5f2b6581b..cc4fa77bd2a7 100644
> --- a/include/uapi/linux/auxvec.h
> +++ b/include/uapi/linux/auxvec.h
> @@ -33,5 +33,6 @@
>  
>  #define AT_EXECFN  31/* filename of program */
>  
> +#define AT_MINSIGSTKSZ   51  /* stack needed for signal delivery  */
>  
>  #endif /* _UAPI_LINUX_AUXVEC_H */

Otherwise, this looks fine as a concept.

AFAICT, no other arch is already using the value 51.

If nobody else objects to the loss of the definition from arm64's
 then I guess I can put up with that -- but I will wait to
see if anyone gives a view first.

Cheers
---Dave


Re: [PATCH 1/3] arm64/ptrace: don't clobber task registers on syscall entry/exit traps

2021-01-27 Thread Dave Martin
On Tue, Jan 19, 2021 at 02:06:35PM -0800, Andrei Vagin wrote:
> ip/r12 for AArch32 and x7 for AArch64 is used to indicate whether or not
> the stop has been signalled from syscall entry or syscall exit. This
> means that:
> 
> - Any writes by the tracer to this register during the stop are
>   ignored/discarded.
> 
> - The actual value of the register is not available during the stop,
>   so the tracer cannot save it and restore it later.
> 
> Right now, these registers are clobbered in tracehook_report_syscall.
> This change moves this logic to gpr_get and compat_gpr_get where
> registers are copied into a user-space buffer.
> 
> This will allow to change these registers and to introduce a new
> NT_ARM_PRSTATUS command to get the full set of registers.
> 
> Signed-off-by: Andrei Vagin 
> ---
>  arch/arm64/include/asm/ptrace.h |   5 ++
>  arch/arm64/kernel/ptrace.c  | 104 +++-
>  2 files changed, 67 insertions(+), 42 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
> index e58bca832dff..0a9552b4f61e 100644
> --- a/arch/arm64/include/asm/ptrace.h
> +++ b/arch/arm64/include/asm/ptrace.h
> @@ -170,6 +170,11 @@ static inline unsigned long pstate_to_compat_psr(const 
> unsigned long pstate)
>   return psr;
>  }
>  
> +enum ptrace_syscall_dir {
> + PTRACE_SYSCALL_ENTER = 0,
> + PTRACE_SYSCALL_EXIT,
> +};
> +
>  /*
>   * This struct defines the way the registers are stored on the stack during 
> an
>   * exception. Note that sizeof(struct pt_regs) has to be a multiple of 16 
> (for
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index 8ac487c84e37..1863f080cb07 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -561,7 +562,33 @@ static int gpr_get(struct task_struct *target,
>  struct membuf to)
>  {
>   struct user_pt_regs *uregs = _pt_regs(target)->user_regs;
> - return membuf_write(, uregs, sizeof(*uregs));
> + unsigned long saved_reg;
> + int ret;
> +
> + /*
> +  * We have some ABI weirdness here in the way that we handle syscall
> +  * exit stops because we indicate whether or not the stop has been
> +  * signalled from syscall entry or syscall exit by clobbering the 
> general
> +  * purpose register x7.
> +  */
> + switch (target->ptrace_message) {
> + case PTRACE_EVENTMSG_SYSCALL_ENTRY:
> + saved_reg = uregs->regs[7];
> + uregs->regs[7] = PTRACE_SYSCALL_ENTER;
> + break;
> + case PTRACE_EVENTMSG_SYSCALL_EXIT:
> + saved_reg = uregs->regs[7];
> + uregs->regs[7] = PTRACE_SYSCALL_EXIT;
> + break;
> + }
> +
> + ret =  membuf_write(, uregs, sizeof(*uregs));
> +
> + if (target->ptrace_message == PTRACE_EVENTMSG_SYSCALL_ENTRY ||
> + target->ptrace_message == PTRACE_EVENTMSG_SYSCALL_EXIT)
> + uregs->regs[7] = saved_reg;

This might be a reasonable cleanup even if the extra regset isn't
introduced: it makes it clear that we're not changing the user registers
here, just the tracer's view of them.

I'm assuming it doesn't break tracing anywhere else.  I can't think of
anything it would break just now, but I haven't spent much time looking
into it.


Can you not just unconditionally back up and restore regs[7] here?  e.g.

saved_reg = uregs->regs[7];

switch (target->ptrace_message) {
case PTRACE_EVENTMSG_SYSCALL_ENTRY:
case PTRACE_EVENTMSG_SYSCALL_EXIT:
uregs->regs[7] = target->ptrace_message;
}

ret = membuf_write(...);

uregs->regs[7] = saved_reg;


> +
> + return ret;
>  }
>  
>  static int gpr_set(struct task_struct *target, const struct user_regset 
> *regset,
> @@ -1221,10 +1248,40 @@ static int compat_gpr_get(struct task_struct *target,
> const struct user_regset *regset,
> struct membuf to)
>  {
> + compat_ulong_t r12;
> + bool overwrite_r12;
>   int i = 0;
>  
> - while (to.left)
> - membuf_store(, compat_get_user_reg(target, i++));
> + /*
> +  * We have some ABI weirdness here in the way that we handle syscall
> +  * exit stops because we indicate whether or not the stop has been
> +  * signalled from syscall entry or syscall exit by clobbering the
> +  * general purpose register r12.
> +  */
> + switch (target->ptrace_message) {
> + case PTRACE_EVENTMSG_SYSCALL_ENTRY:
> + r12 = PTRACE_SYSCALL_ENTER;
> + overwrite_r12 = true;
> + break;
> + case PTRACE_EVENTMSG_SYSCALL_EXIT:
> + r12 = PTRACE_SYSCALL_EXIT;
> + overwrite_r12 = true;
> + break;
> + default:
> + overwrite_r12 = false;
> + break;

Re: [PATCH 2/3] arm64/ptrace: introduce NT_ARM_PRSTATUS to get a full set of registers

2021-01-27 Thread Dave Martin
On Tue, Jan 19, 2021 at 02:06:36PM -0800, Andrei Vagin wrote:
> This is an alternative to NT_PRSTATUS that clobbers ip/r12 on AArch32,
> x7 on AArch64 when a tracee is stopped in syscall entry or syscall exit
> traps.
> 
> Signed-off-by: Andrei Vagin 

This approach looks like it works, though I still think adding an option
for this under PTRACE_SETOPTIONS would be less intrusive.

Adding a shadow regset like this also looks like it would cause the gp
regs to be pointlessly be dumped twice in a core dump.  Avoiding that
might require hacks in the core code...


> ---
>  arch/arm64/kernel/ptrace.c | 39 ++
>  include/uapi/linux/elf.h   |  1 +
>  2 files changed, 40 insertions(+)
> 
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index 1863f080cb07..b8e4c2ddf636 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -591,6 +591,15 @@ static int gpr_get(struct task_struct *target,
>   return ret;
>  }
>  
> +static int gpr_get_full(struct task_struct *target,
> +const struct user_regset *regset,
> +struct membuf to)
> +{
> + struct user_pt_regs *uregs = _pt_regs(target)->user_regs;
> +
> + return membuf_write(, uregs, sizeof(*uregs));
> +}
> +
>  static int gpr_set(struct task_struct *target, const struct user_regset 
> *regset,
>  unsigned int pos, unsigned int count,
>  const void *kbuf, const void __user *ubuf)
> @@ -1088,6 +1097,7 @@ static int tagged_addr_ctrl_set(struct task_struct 
> *target, const struct
>  
>  enum aarch64_regset {
>   REGSET_GPR,
> + REGSET_GPR_FULL,

If we go with this approach, "REGSET_GPR_RAW" might be a preferable
name.  Both regs represent all the regs ("full"), but REGSET_GPR is
mangled by the kernel.

>   REGSET_FPR,
>   REGSET_TLS,
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
> @@ -1119,6 +1129,14 @@ static const struct user_regset aarch64_regsets[] = {
>   .regset_get = gpr_get,
>   .set = gpr_set
>   },
> + [REGSET_GPR_FULL] = {
> + .core_note_type = NT_ARM_PRSTATUS,

Similarly, something like NT_ARM_PRSTATUS_RAW or similar.

> + .n = sizeof(struct user_pt_regs) / sizeof(u64),
> + .size = sizeof(u64),
> + .align = sizeof(u64),
> + .regset_get = gpr_get_full,
> + .set = gpr_set
> + },
>   [REGSET_FPR] = {
>   .core_note_type = NT_PRFPREG,
>   .n = sizeof(struct user_fpsimd_state) / sizeof(u32),
> @@ -1225,6 +1243,7 @@ static const struct user_regset_view user_aarch64_view 
> = {
>  #ifdef CONFIG_COMPAT
>  enum compat_regset {
>   REGSET_COMPAT_GPR,
> + REGSET_COMPAT_GPR_FULL,
>   REGSET_COMPAT_VFP,
>  };
>  
> @@ -1285,6 +1304,18 @@ static int compat_gpr_get(struct task_struct *target,
>   return 0;
>  }
>  
> +/* compat_gpr_get_full doesn't  overwrite x12 like compat_gpr_get. */
> +static int compat_gpr_get_full(struct task_struct *target,
> +   const struct user_regset *regset,
> +   struct membuf to)
> +{
> + int i = 0;
> +
> + while (to.left)
> + membuf_store(, compat_get_user_reg(target, i++));
> + return 0;
> +}
> +
>  static int compat_gpr_set(struct task_struct *target,
> const struct user_regset *regset,
> unsigned int pos, unsigned int count,
> @@ -1435,6 +1466,14 @@ static const struct user_regset aarch32_regsets[] = {
>   .regset_get = compat_gpr_get,
>   .set = compat_gpr_set
>   },
> + [REGSET_COMPAT_GPR_FULL] = {
> + .core_note_type = NT_ARM_PRSTATUS,
> + .n = COMPAT_ELF_NGREG,
> + .size = sizeof(compat_elf_greg_t),
> + .align = sizeof(compat_elf_greg_t),
> + .regset_get = compat_gpr_get_full,
> + .set = compat_gpr_set
> + },
>   [REGSET_COMPAT_VFP] = {
>   .core_note_type = NT_ARM_VFP,
>   .n = VFP_STATE_SIZE / sizeof(compat_ulong_t),
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index 30f68b42eeb5..a2086d19263a 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -426,6 +426,7 @@ typedef struct elf64_shdr {
>  #define NT_ARM_PACA_KEYS 0x407   /* ARM pointer authentication address 
> keys */
>  #define NT_ARM_PACG_KEYS 0x408   /* ARM pointer authentication generic 
> key */
>  #define NT_ARM_TAGGED_ADDR_CTRL  0x409   /* arm64 tagged address control 
> (prctl()) */

What happened to 0x40a..0x40f?

[...]

Cheers
---Dave


Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled

2021-01-20 Thread Dave Martin
On Tue, Jan 19, 2021 at 05:29:05PM +0100, Ard Biesheuvel wrote:
> On Tue, 19 Jan 2021 at 17:01, Dave Martin  wrote:
> >
> > On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > > Kernel mode NEON can be used in task or softirq context, but only in
> > > a non-nesting manner, i.e., softirq context is only permitted if the
> > > interrupt was not taken at a point where the kernel was using the NEON
> > > in task context.
> > >
> > > This means all users of kernel mode NEON have to be aware of this
> > > limitation, and either need to provide scalar fallbacks that may be much
> > > slower (up to 20x for AES instructions) and potentially less safe, or
> > > use an asynchronous interface that defers processing to a later time
> > > when the NEON is guaranteed to be available.
> > >
> > > Given that grabbing and releasing the NEON is cheap, we can relax this
> > > restriction, by increasing the granularity of kernel mode NEON code, and
> > > always disabling softirq processing while the NEON is being used in task
> > > context.
> > >
> > > Signed-off-by: Ard Biesheuvel 
> >
> > Sorry for the slow reply on this...  it looks reasonable, but I have a
> > few comments below.
> >
> 
> No worries - thanks for taking a look.
> 
> > > ---
> > >  arch/arm64/include/asm/assembler.h | 19 +--
> > >  arch/arm64/kernel/asm-offsets.c|  2 ++
> > >  arch/arm64/kernel/fpsimd.c |  4 ++--
> > >  3 files changed, 17 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/assembler.h 
> > > b/arch/arm64/include/asm/assembler.h
> > > index ddbe6bf00e33..74ce46ed55ac 100644
> > > --- a/arch/arm64/include/asm/assembler.h
> > > +++ b/arch/arm64/include/asm/assembler.h
> > > @@ -15,6 +15,7 @@
> > >  #include 
> > >
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2)// 
> > > invalidate I line PoU
> > >   .endm
> > >
> > >   .macro  if_will_cond_yield_neon
> > > -#ifdef CONFIG_PREEMPTION
> > >   get_current_taskx0
> > >   ldr x0, [x0, #TSK_TI_PREEMPT]
> > > - sub x0, x0, #PREEMPT_DISABLE_OFFSET
> > > - cbz x0, .Lyield_\@
> > > +#ifdef CONFIG_PREEMPTION
> > > + cmp x0, #PREEMPT_DISABLE_OFFSET
> > > + beq .Lyield_\@  // yield on need_resched in task 
> > > context
> > > +#endif
> > > + /* never yield while serving a softirq */
> > > + tbnzx0, #SOFTIRQ_SHIFT, .Lnoyield_\@
> >
> > Can you explain the rationale here?
> >
> > Using if_will_cond_yield_neon suggests the algo thinks it may run for
> > too long the stall preemption until completion, but we happily stall
> > preemption _and_ softirqs here.
> >
> > Is it actually a bug to use the NEON conditional yield helpers in
> > softirq context?
> >
> 
> No, it is not. But calling kernel_neon_end() from softirq context will
> not cause it to finish any faster, so there is really no point in
> doing so.
> 
> > Ideally, if processing in softirq context takes an unreasonable about of
> > time, the work would be handed off to an asynchronous worker, but that
> > does seem to conflict rather with the purpose of this series...
> >
> 
> Agreed, but this is not something we can police at this level. If the
> caller does an unreasonable amount of work from a softirq, no amount
> of yielding is going to make a difference.

Ack, just wanted to make sure I wasn't missing something.

Anyone writing softirq code can starve preemption, so I agree that we
should trust people to know what they're doing.


> > > +
> > > + adr_l   x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > > + this_cpu_offset x1
> > > + ldr w0, [x0, x1]
> > > + cbnzw0, .Lyield_\@  // yield on pending softirq in task 
> > > context
> > > +.Lnoyield_\@:
> > >   /* fall through to endif_yield_neon */
> > >   .subsection 1
> > >  .Lyield_\@ :
> > > -#else
> > > - .section".discard.cond_yield_neon", "ax"
> > > -#endif
> > >   .endm
> > >
> > >   .macro  do_cond_yield_ne

Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled

2021-01-19 Thread Dave Martin
On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> Kernel mode NEON can be used in task or softirq context, but only in
> a non-nesting manner, i.e., softirq context is only permitted if the
> interrupt was not taken at a point where the kernel was using the NEON
> in task context.
> 
> This means all users of kernel mode NEON have to be aware of this
> limitation, and either need to provide scalar fallbacks that may be much
> slower (up to 20x for AES instructions) and potentially less safe, or
> use an asynchronous interface that defers processing to a later time
> when the NEON is guaranteed to be available.
> 
> Given that grabbing and releasing the NEON is cheap, we can relax this
> restriction, by increasing the granularity of kernel mode NEON code, and
> always disabling softirq processing while the NEON is being used in task
> context.
> 
> Signed-off-by: Ard Biesheuvel 

Sorry for the slow reply on this...  it looks reasonable, but I have a
few comments below.

> ---
>  arch/arm64/include/asm/assembler.h | 19 +--
>  arch/arm64/kernel/asm-offsets.c|  2 ++
>  arch/arm64/kernel/fpsimd.c |  4 ++--
>  3 files changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/assembler.h 
> b/arch/arm64/include/asm/assembler.h
> index ddbe6bf00e33..74ce46ed55ac 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -15,6 +15,7 @@
>  #include 
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2)// 
> invalidate I line PoU
>   .endm
>  
>   .macro  if_will_cond_yield_neon
> -#ifdef CONFIG_PREEMPTION
>   get_current_taskx0
>   ldr x0, [x0, #TSK_TI_PREEMPT]
> - sub x0, x0, #PREEMPT_DISABLE_OFFSET
> - cbz x0, .Lyield_\@
> +#ifdef CONFIG_PREEMPTION
> + cmp x0, #PREEMPT_DISABLE_OFFSET
> + beq .Lyield_\@  // yield on need_resched in task context
> +#endif
> + /* never yield while serving a softirq */
> + tbnzx0, #SOFTIRQ_SHIFT, .Lnoyield_\@

Can you explain the rationale here?

Using if_will_cond_yield_neon suggests the algo thinks it may run for
too long the stall preemption until completion, but we happily stall
preemption _and_ softirqs here.

Is it actually a bug to use the NEON conditional yield helpers in
softirq context?

Ideally, if processing in softirq context takes an unreasonable about of
time, the work would be handed off to an asynchronous worker, but that
does seem to conflict rather with the purpose of this series...

> +
> + adr_l   x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> + this_cpu_offset x1
> + ldr w0, [x0, x1]
> + cbnzw0, .Lyield_\@  // yield on pending softirq in task 
> context
> +.Lnoyield_\@:
>   /* fall through to endif_yield_neon */
>   .subsection 1
>  .Lyield_\@ :
> -#else
> - .section".discard.cond_yield_neon", "ax"
> -#endif
>   .endm
>  
>   .macro  do_cond_yield_neon
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 7d32fc959b1a..34ef70877de4 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -93,6 +93,8 @@ int main(void)
>DEFINE(DMA_FROM_DEVICE,DMA_FROM_DEVICE);
>BLANK();
>DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, 
> __softirq_pending));
>BLANK();
>DEFINE(CPU_BOOT_STACK, offsetof(struct secondary_data, stack));
>DEFINE(CPU_BOOT_TASK,  offsetof(struct secondary_data, task));
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 062b21f30f94..823e3a8a8871 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
>   */
>  static void get_cpu_fpsimd_context(void)
>  {
> - preempt_disable();
> + local_bh_disable();
>   __get_cpu_fpsimd_context();
>  }
>  
> @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
>  static void put_cpu_fpsimd_context(void)
>  {
>   __put_cpu_fpsimd_context();
> - preempt_enable();
> + local_bh_enable();
>  }
>  
>  static bool have_cpu_fpsimd_context(void)

I was concerned about catching all the relevant preempt_disable()s, but
it had slipped my memory that Julien had factored these into one place.

I can't see off the top of my head any reason why this shouldn't work.


In threory, switching to local_bh_enable() here will add a check for
pending softirqs onto context handling fast paths.  I haven't dug into
how that works, so perhaps this is trivial on top of the preemption
check in preempt_enable().  Do you see any difference in hackbench or
similar benchmarks?

Re: [PATCH v8 06/22] perf arm-spe: Refactor printing string to buffer

2020-11-11 Thread Dave Martin


On Wed, Nov 11, 2020 at 05:39:22PM +, Arnaldo Carvalho de Melo wrote:
> Em Wed, Nov 11, 2020 at 03:45:23PM +, Andr� Przywara escreveu:
> > On 11/11/2020 15:35, Arnaldo Carvalho de Melo wrote:
> > 
> > Hi Arnaldo,
> > 
> > thanks for taking a look!
> > 
> > > Em Wed, Nov 11, 2020 at 03:11:33PM +0800, Leo Yan escreveu:
> > >> When outputs strings to the decoding buffer with function snprintf(),
> > >> SPE decoder needs to detects if any error returns from snprintf() and if
> > >> so needs to directly bail out.  If snprintf() returns success, it needs
> > >> to update buffer pointer and reduce the buffer length so can continue to
> > >> output the next string into the consequent memory space.
> > >>
> > >> This complex logics are spreading in the function arm_spe_pkt_desc() so
> > >> there has many duplicate codes for handling error detecting, increment
> > >> buffer pointer and decrement buffer size.
> > >>
> > >> To avoid the duplicate code, this patch introduces a new helper function
> > >> arm_spe_pkt_snprintf() which is used to wrap up the complex logics, and
> > >> it's used by the caller arm_spe_pkt_desc().
> > >>
> > >> This patch also moves the variable 'blen' as the function's local
> > >> variable, this allows to remove the unnecessary braces and improve the
> > >> readability.
> > >>
> > >> Suggested-by: Dave Martin 
> > >> Signed-off-by: Leo Yan 
> > >> Reviewed-by: Andre Przywara 
> > >> ---
> > >>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 260 +-
> > >>  1 file changed, 126 insertions(+), 134 deletions(-)
> > >>
> > >> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> > >> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > >> index 04fd7fd7c15f..1970686f7020 100644
> > >> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > >> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > >> @@ -9,6 +9,7 @@
> > >>  #include 
> > >>  #include 
> > >>  #include 
> > >> +#include 
> > >>  
> > >>  #include "arm-spe-pkt-decoder.h"
> > >>  
> > >> @@ -258,192 +259,183 @@ int arm_spe_get_packet(const unsigned char *buf, 
> > >> size_t len,
> > >>  return ret;
> > >>  }
> > >>  
> > >> +static int arm_spe_pkt_snprintf(int *err, char **buf_p, size_t *blen,
> > >> +const char *fmt, ...)
> > >> +{
> > >> +va_list ap;
> > >> +int ret;
> > >> +
> > >> +/* Bail out if any error occurred */
> > >> +if (err && *err)
> > >> +return *err;
> > >> +
> > >> +va_start(ap, fmt);
> > >> +ret = vsnprintf(*buf_p, *blen, fmt, ap);
> > >> +va_end(ap);
> > >> +
> > >> +if (ret < 0) {
> > >> +if (err && !*err)
> > >> +*err = ret;
> > >> +
> > >> +/*
> > >> + * A return value of (*blen - 1) or more means that the
> > >> + * output was truncated and the buffer is overrun.
> > >> + */
> > >> +} else if (ret >= ((int)*blen - 1)) {
> > >> +(*buf_p)[*blen - 1] = '\0';
> > >> +
> > >> +/*
> > >> + * Set *err to 'ret' to avoid overflow if tries to
> > >> + * fill this buffer sequentially.
> > >> + */
> > >> +if (err && !*err)
> > >> +*err = ret;
> > >> +} else {
> > >> +*buf_p += ret;
> > >> +*blen -= ret;
> > >> +}
> > >> +
> > >> +return ret;
> > >> +}
> > >> +
> > >>  int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> > >>   size_t buf_len)
> > >>  {
> > >>  int ret, ns, el, idx = packet->index;
> > >>  unsigned long long payload = packet->payload;
> > >>  const char *name = arm_spe_pkt_name(pack

Re: [PATCH v8 07/22] perf arm-spe: Consolidate arm_spe_pkt_desc()'s return value

2020-11-11 Thread Dave Martin
On Wed, Nov 11, 2020 at 07:11:34AM +, Leo Yan wrote:
> arm_spe_pkt_desc() returns the length of consumed the buffer for
> the success case; otherwise, it delivers the return value from
> arm_spe_pkt_snprintf(), and returns the last return value if there have
> multiple calling arm_spe_pkt_snprintf().
> 
> Since arm_spe_pkt_snprintf() has the same semantics with vsnprintf() for
> the return value, and vsnprintf() might return value equals to or bigger
> than the parameter 'size' to indicate the truncation.  Because the
> return value is >= 0 when the string is truncated, this condition will
> be returned up the stack as "success".
> 
> This patch simplifies the return value for arm_spe_pkt_desc(): '0' means
> success and other values mean an error has occurred.  To realize this,
> it relies on arm_spe_pkt_snprintf()'s parameter 'err', the 'err' is a
> cumulative value, returns its final value if printing buffer is called
> for one time or multiple times.
> 
> To unify the error value generation, this patch handles error in a
> central place, rather than directly bailing out in switch-cases,
> it returns error at the end of arm_spe_pkt_desc().
> 
> This patch changes the caller arm_spe_dump() to respect the updated
> return value semantics of arm_spe_pkt_desc().
> 
> Suggested-by: Dave Martin 
> Signed-off-by: Leo Yan 
> Reviewed-by: Andre Przywara 
> ---
>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 128 +-
>  tools/perf/util/arm-spe.c |   2 +-
>  2 files changed, 68 insertions(+), 62 deletions(-)
> 
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> index 1970686f7020..424ff5862aa1 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> @@ -301,9 +301,10 @@ static int arm_spe_pkt_snprintf(int *err, char **buf_p, 
> size_t *blen,
>  int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
>size_t buf_len)
>  {
> - int ret, ns, el, idx = packet->index;
> + int ns, el, idx = packet->index;
>   unsigned long long payload = packet->payload;
>   const char *name = arm_spe_pkt_name(packet->type);
> + char *buf_orig = buf;
>   size_t blen = buf_len;
>   int err = 0;
>  
> @@ -311,82 +312,76 @@ int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, 
> char *buf,
>   case ARM_SPE_BAD:
>   case ARM_SPE_PAD:
>   case ARM_SPE_END:
> - return arm_spe_pkt_snprintf(, , , "%s", name);
> + arm_spe_pkt_snprintf(, , , "%s", name);
> + break;
>   case ARM_SPE_EVENTS:
> - ret = arm_spe_pkt_snprintf(, , , "EV");
> + arm_spe_pkt_snprintf(, , , "EV");
>  
>   if (payload & 0x1)
> - ret = arm_spe_pkt_snprintf(, , , " 
> EXCEPTION-GEN");
> + arm_spe_pkt_snprintf(, , , " 
> EXCEPTION-GEN");
>   if (payload & 0x2)
> - ret = arm_spe_pkt_snprintf(, , , " 
> RETIRED");
> + arm_spe_pkt_snprintf(, , , " RETIRED");
>   if (payload & 0x4)
> - ret = arm_spe_pkt_snprintf(, , , " 
> L1D-ACCESS");
> + arm_spe_pkt_snprintf(, , , " L1D-ACCESS");
>   if (payload & 0x8)
> - ret = arm_spe_pkt_snprintf(, , , " 
> L1D-REFILL");
> + arm_spe_pkt_snprintf(, , , " L1D-REFILL");
>   if (payload & 0x10)
> - ret = arm_spe_pkt_snprintf(, , , " 
> TLB-ACCESS");
> + arm_spe_pkt_snprintf(, , , " TLB-ACCESS");
>   if (payload & 0x20)
> - ret = arm_spe_pkt_snprintf(, , , " 
> TLB-REFILL");
> + arm_spe_pkt_snprintf(, , , " TLB-REFILL");
>   if (payload & 0x40)
> - ret = arm_spe_pkt_snprintf(, , , " 
> NOT-TAKEN");
> + arm_spe_pkt_snprintf(, , , " NOT-TAKEN");
>   if (payload & 0x80)
> - ret = arm_spe_pkt_snprintf(, , , " 
> MISPRED");
> + arm_spe_pkt_snprintf(, , , " MISPRED");
>   if (idx > 1) {
>   if (payload & 0x100)
> - ret = arm_spe_pkt_snprintf(, , , " 
> LLC-ACCESS");
> +

Re: [PATCH v8 06/22] perf arm-spe: Refactor printing string to buffer

2020-11-11 Thread Dave Martin
On Wed, Nov 11, 2020 at 03:53:20PM +, Dave Martin wrote:
> On Wed, Nov 11, 2020 at 07:11:33AM +, Leo Yan wrote:
> > When outputs strings to the decoding buffer with function snprintf(),
> > SPE decoder needs to detects if any error returns from snprintf() and if
> > so needs to directly bail out.  If snprintf() returns success, it needs
> > to update buffer pointer and reduce the buffer length so can continue to
> > output the next string into the consequent memory space.
> > 
> > This complex logics are spreading in the function arm_spe_pkt_desc() so
> > there has many duplicate codes for handling error detecting, increment
> > buffer pointer and decrement buffer size.
> > 
> > To avoid the duplicate code, this patch introduces a new helper function
> > arm_spe_pkt_snprintf() which is used to wrap up the complex logics, and
> > it's used by the caller arm_spe_pkt_desc().
> > 
> > This patch also moves the variable 'blen' as the function's local
> > variable, this allows to remove the unnecessary braces and improve the
> > readability.
> > 
> > Suggested-by: Dave Martin 
> > Signed-off-by: Leo Yan 
> > Reviewed-by: Andre Przywara 
> 
> Mostly looks fine to me now, thought there are a few potentionalu
> issues -- comments below.

Hmm, looks like patch 7 anticipated some of my comments here.

Rather than fixing up patch 6, maybe it would be better to squash these
patches together after all...  sorry!

[...]

Cheers
---Dave


Re: [PATCH v8 06/22] perf arm-spe: Refactor printing string to buffer

2020-11-11 Thread Dave Martin
On Wed, Nov 11, 2020 at 07:11:33AM +, Leo Yan wrote:
> When outputs strings to the decoding buffer with function snprintf(),
> SPE decoder needs to detects if any error returns from snprintf() and if
> so needs to directly bail out.  If snprintf() returns success, it needs
> to update buffer pointer and reduce the buffer length so can continue to
> output the next string into the consequent memory space.
> 
> This complex logics are spreading in the function arm_spe_pkt_desc() so
> there has many duplicate codes for handling error detecting, increment
> buffer pointer and decrement buffer size.
> 
> To avoid the duplicate code, this patch introduces a new helper function
> arm_spe_pkt_snprintf() which is used to wrap up the complex logics, and
> it's used by the caller arm_spe_pkt_desc().
> 
> This patch also moves the variable 'blen' as the function's local
> variable, this allows to remove the unnecessary braces and improve the
> readability.
> 
> Suggested-by: Dave Martin 
> Signed-off-by: Leo Yan 
> Reviewed-by: Andre Przywara 

Mostly looks fine to me now, thought there are a few potentionalu
issues -- comments below.

Cheers
---Dave

> ---
>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 260 +-
>  1 file changed, 126 insertions(+), 134 deletions(-)
> 
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> index 04fd7fd7c15f..1970686f7020 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "arm-spe-pkt-decoder.h"
>  
> @@ -258,192 +259,183 @@ int arm_spe_get_packet(const unsigned char *buf, 
> size_t len,
>   return ret;
>  }
>  
> +static int arm_spe_pkt_snprintf(int *err, char **buf_p, size_t *blen,
> + const char *fmt, ...)
> +{
> + va_list ap;
> + int ret;
> +
> + /* Bail out if any error occurred */
> + if (err && *err)
> + return *err;
> +
> + va_start(ap, fmt);
> + ret = vsnprintf(*buf_p, *blen, fmt, ap);
> + va_end(ap);
> +
> + if (ret < 0) {
> + if (err && !*err)
> + *err = ret;
> +
> + /*
> +  * A return value of (*blen - 1) or more means that the
> +  * output was truncated and the buffer is overrun.
> +  */

(*blen - 1) chars, + 1 for '\0', is exactly *blen bytes.  So ret ==
*blen - 1 is not an error.

> + } else if (ret >= ((int)*blen - 1)) {

So I suggest

if (ret >= *blen)

here.

Nit: If gcc moans about signedness in the comparison, I think it's
preferable to say

if ((size_t)ret >= *blen)

rather than casting *blen to an int.  We already know that ret >= 0, and
UINT_MAX always fits in a size_t.  On this code path it probably doesn't
matter in practice through, since *blen will be much less than INT_MAX.
vsnprintf() probably doesn't cope gracefully with super-large buffers
anyway, and the ISO C standards don't describe this situation
adequately.

If gcc doesn't warn, just drop the cast.

> + (*buf_p)[*blen - 1] = '\0';
> +
> + /*
> +  * Set *err to 'ret' to avoid overflow if tries to
> +  * fill this buffer sequentially.
> +  */
> + if (err && !*err)
> + *err = ret;
> + } else {
> + *buf_p += ret;
> + *blen -= ret;
> + }
> +
> + return ret;
> +}
> +
>  int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
>size_t buf_len)
>  {
>   int ret, ns, el, idx = packet->index;
>   unsigned long long payload = packet->payload;
>   const char *name = arm_spe_pkt_name(packet->type);
> + size_t blen = buf_len;
> + int err = 0;
>  
>   switch (packet->type) {
>   case ARM_SPE_BAD:
>   case ARM_SPE_PAD:
>   case ARM_SPE_END:
> - return snprintf(buf, buf_len, "%s", name);
> - case ARM_SPE_EVENTS: {
> - size_t blen = buf_len;
> -
> - ret = 0;
> - ret = snprintf(buf, buf_len, "EV");
> - buf += ret;
> - blen -= ret;
> - if (payload & 0x1) {
> - ret = snprintf(buf, buf_len, " EXCEPTION-GEN");
> - buf += ret;
> - blen -= ret;
> - }
> - if (payload & 0x2) {
> - ret = snprintf(buf, buf_len, " RETIR

Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-11-04 Thread Dave Martin
On Thu, Oct 29, 2020 at 11:02:22AM +, Catalin Marinas via Libc-alpha wrote:
> On Tue, Oct 27, 2020 at 02:15:22PM +, Dave P Martin wrote:
> > I also wonder whether we actually care whether the pages are marked
> > executable or not here; probably the flags can just be independent.  This
> > rather depends on whether the how the architecture treats the BTI (a.k.a
> > GP) pagetable bit for non-executable pages.  I have a feeling we already
> > allow PROT_BTI && !PROT_EXEC through anyway.
> > 
> > 
> > What about a generic-ish set/clear interface that still works by just
> > adding a couple of PROT_ flags:
> > 
> > switch (flags & (PROT_SET | PROT_CLEAR)) {
> > case PROT_SET: prot |= flags; break;
> > case PROT_CLEAR: prot &= ~flags; break;
> > case 0: prot = flags; break;
> > 
> > default:
> > return -EINVAL;
> > }
> > 
> > This can't atomically set some flags while clearing some others, but for
> > simple stuff it seems sufficient and shouldn't be too invasive on the
> > kernel side.
> > 
> > We will still have to take the mm lock when doing a SET or CLEAR, but
> > not for the non-set/clear case.
> > 
> > 
> > Anyway, libc could now do:
> > 
> > mprotect(addr, len, PROT_SET | PROT_BTI);
> > 
> > with much the same effect as your PROT_BTI_IF_X.
> > 
> > 
> > JITting or breakpoint setting code that wants to change the permissions
> > temporarily, without needing to know whether PROT_BTI is set, say:
> > 
> > mprotect(addr, len, PROT_SET | PROT_WRITE);
> > *addr = BKPT_INSN;
> > mprotect(addr, len, PROT_CLEAR | PROT_WRITE);
> 
> The problem with this approach is that you can't catch
> PROT_EXEC|PROT_WRITE mappings via seccomp. So you'd have to limit it to
> some harmless PROT_ flags only. I don't like this limitation, nor the
> PROT_BTI_IF_X approach.

Ack; this is just one flavour of interface, and every approach seems to
have some shortcomings.

> The only generic solutions I see are to either use a stateful filter in
> systemd or pass the old state to the kernel in a cmpxchg style so that
> seccomp can check it (I think you suggest this at some point).

The "cmpxchg" option has the disadvantage that the caller needs to know
the original permissions.  It seems that glibc is prepared to work
around this, but it won't always be feasible in ancillary /
instrumentation code or libraries.

IMHO it would be preferable to apply a policy to mmap/mprotect in the
kernel proper rather then BPF being the only way to do it -- in any
case, the required checks seem to be out of the scope of what can be
done efficiently (or perhaps at all) in a syscall filter.

> The latter requires a new syscall which is not something we can address
> as a quick, back-portable fix here. If systemd cannot be changed to use
> a stateful filter for w^x detection, my suggestion is to go for the
> kernel setting PROT_BTI on the main executable with glibc changed to
> tolerate EPERM on mprotect(). I don't mind adding an AT_FLAGS bit if
> needed but I don't think it buys us much.

I agree, this seems the best short-term approach.

> Once the current problem is fixed, we can look at a better solution
> longer term as a new syscall.

Agreed, I think if we try to rush the addition of new syscalls, the
chance of coming up with a bad design is high...

Cheers
---Dave


Re: [PATCH v6 20/21] perf arm_spe: Decode memory tagging properties

2020-11-03 Thread Dave Martin
On Tue, Nov 03, 2020 at 06:51:01AM +, Leo Yan wrote:
> On Mon, Nov 02, 2020 at 04:25:36PM +0000, Dave Martin wrote:
> > On Fri, Oct 30, 2020 at 02:57:23AM +, Leo Yan wrote:
> > > From: Andre Przywara 
> > > 
> > > When SPE records a physical address, it can additionally tag the event
> > > with information from the Memory Tagging architecture extension.
> > > 
> > > Decode the two additional fields in the SPE event payload.
> > > 
> > > [leoy: Refined patch to use predefined macros]
> > > 
> > > Signed-off-by: Andre Przywara 
> > > Signed-off-by: Leo Yan 
> > > ---
> > >  tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c | 6 +-
> > >  tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.h | 2 ++
> > >  2 files changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> > > b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > index 3fca65e9cbbf..9ec3057de86f 100644
> > > --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > @@ -371,6 +371,7 @@ static int arm_spe_pkt_desc_addr(const struct 
> > > arm_spe_pkt *packet,
> > >char *buf, size_t buf_len)
> > >  {
> > >   int ns, el, idx = packet->index;
> > > + int ch, pat;
> > >   u64 payload = packet->payload;
> > >   int err = 0;
> > >  
> > > @@ -388,9 +389,12 @@ static int arm_spe_pkt_desc_addr(const struct 
> > > arm_spe_pkt *packet,
> > >   "VA 0x%llx", payload);
> > >   case SPE_ADDR_PKT_HDR_INDEX_DATA_PHYS:
> > >   ns = !!SPE_ADDR_PKT_GET_NS(payload);
> > > + ch = !!SPE_ADDR_PKT_GET_CH(payload);
> > > + pat = SPE_ADDR_PKT_GET_PAT(payload);
> > >   payload = SPE_ADDR_PKT_ADDR_GET_BYTES_0_6(payload);
> > >   return arm_spe_pkt_snprintf(, , _len,
> > > - "PA 0x%llx ns=%d", payload, ns);
> > > + "PA 0x%llx ns=%d ch=%d, pat=%x",
> > 
> > Nit: given that this data is all closely related, do we really want the
> > extra comma here?
> 
> No reason for adding comma.  Will remove it.

OK, I'm happy for my Reviewed-by to stand.

[...]

Cheers
---Dave


Re: [PATCH v6 06/21] perf arm-spe: Refactor printing string to buffer

2020-11-03 Thread Dave Martin
On Tue, Nov 03, 2020 at 10:13:49AM +, André Przywara wrote:
> On 03/11/2020 06:40, Leo Yan wrote:
> 
> Hi Dave, Leo,
> 
> > On Mon, Nov 02, 2020 at 05:06:53PM +, Dave Martin wrote:
> >> On Fri, Oct 30, 2020 at 02:57:09AM +, Leo Yan wrote:
> >>> When outputs strings to the decoding buffer with function snprintf(),
> >>> SPE decoder needs to detects if any error returns from snprintf() and if
> >>> so needs to directly bail out.  If snprintf() returns success, it needs
> >>> to update buffer pointer and reduce the buffer length so can continue to
> >>> output the next string into the consequent memory space.
> >>>
> >>> This complex logics are spreading in the function arm_spe_pkt_desc() so
> >>> there has many duplicate codes for handling error detecting, increment
> >>> buffer pointer and decrement buffer size.
> >>>
> >>> To avoid the duplicate code, this patch introduces a new helper function
> >>> arm_spe_pkt_snprintf() which is used to wrap up the complex logics, and
> >>> it's used by the caller arm_spe_pkt_desc(); if printing buffer is called
> >>> for multiple times in a flow, the error is a cumulative value and simply
> >>> returns its final value.
> >>>
> >>> This patch also moves the variable 'blen' as the function's local
> >>> variable, this allows to remove the unnecessary braces and improve the
> >>> readability.
> >>>
> >>> Suggested-by: Dave Martin 
> >>> Signed-off-by: Leo Yan 
> >>
> >> This looks like a good refacroting now, but as pointed out by Andre this
> >> patch is now rather hard to review, since it combines the refactoring
> >> with other changes.
> >>
> >> If reposting this series, it would be good if this could be split into a
> >> first patch that introduces arm_spe_pkt_snprintf() and just updates each
> >> snprintf() call site to use it, but without moving other code around or
> >> optimising anything, followed by one or more patches that clean up and
> >> simplify arm_spe_pkt_desc().
> > 
> > I will respin the patch set and follow this approach.
> 
> Well, I am afraid this is not easily possible.
> 
> Dave: this patch is basically following the pattern turning this:
> ===
>   if (condition) {
>   ret = snprintf(buf, buf_len, "foo");
>   buf += ret;
>   blen -= ret;
>   }
>   ...
>   if (ret < 0)
>   return ret;
>   blen -= ret;
>   return buf_len - blen;
> ===
> into this:
> ---
>   if (condition)
>   arm_spe_pkt_snprintf(, , , "foo");
>   ...
>   return err ?: (int)(buf_len - blen);
> ---
> 
> And "diff" is getting really ahead of itself here and tries to be super
> clever, which leads to this hard to read patch.
> 
> But I don't think there is anything we can really do here, this is
> already the minimal version. Leo adds the optimisations only later on,
> in other patches.

OK, this is only nice-to-have, if feasible -- so I'd say Leo should not
bother to split this patch up unless it is easy to do and actually does
make the diff more intelligible!


> 
> Cheers,
> Andre
> 
> 
> > 
> >> If the series is otherwise mature though, then this rework may be
> >> overkill.
> >>
> >>> ---
> >>>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 267 --
> >>>  1 file changed, 117 insertions(+), 150 deletions(-)
> >>>
> >>> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> >>> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> >>> index 04fd7fd7c15f..1ecaf9805b79 100644
> >>> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> >>> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> >>> @@ -9,6 +9,7 @@
> >>>  #include 
> >>>  #include 
> >>>  #include 
> >>> +#include 
> >>>  
> >>>  #include "arm-spe-pkt-decoder.h"
> >>>  
> >>> @@ -258,192 +259,158 @@ int arm_spe_get_packet(const unsigned char *buf, 
> >>> size_t len,
> >>>   return ret;
> >>>  }
> >>>  
> >>> +static int arm_spe_pkt_snprintf(int *err, char **buf_p, size_t *blen,
> >>> + const char *fmt, ...)
> >>> +{
> >>> + 

Re: [PATCH v6 06/21] perf arm-spe: Refactor printing string to buffer

2020-11-02 Thread Dave Martin
On Fri, Oct 30, 2020 at 02:57:09AM +, Leo Yan wrote:
> When outputs strings to the decoding buffer with function snprintf(),
> SPE decoder needs to detects if any error returns from snprintf() and if
> so needs to directly bail out.  If snprintf() returns success, it needs
> to update buffer pointer and reduce the buffer length so can continue to
> output the next string into the consequent memory space.
> 
> This complex logics are spreading in the function arm_spe_pkt_desc() so
> there has many duplicate codes for handling error detecting, increment
> buffer pointer and decrement buffer size.
> 
> To avoid the duplicate code, this patch introduces a new helper function
> arm_spe_pkt_snprintf() which is used to wrap up the complex logics, and
> it's used by the caller arm_spe_pkt_desc(); if printing buffer is called
> for multiple times in a flow, the error is a cumulative value and simply
> returns its final value.
> 
> This patch also moves the variable 'blen' as the function's local
> variable, this allows to remove the unnecessary braces and improve the
> readability.
> 
> Suggested-by: Dave Martin 
> Signed-off-by: Leo Yan 

This looks like a good refacroting now, but as pointed out by Andre this
patch is now rather hard to review, since it combines the refactoring
with other changes.

If reposting this series, it would be good if this could be split into a
first patch that introduces arm_spe_pkt_snprintf() and just updates each
snprintf() call site to use it, but without moving other code around or
optimising anything, followed by one or more patches that clean up and
simplify arm_spe_pkt_desc().

If the series is otherwise mature though, then this rework may be
overkill.

> ---
>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 267 --
>  1 file changed, 117 insertions(+), 150 deletions(-)
> 
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> index 04fd7fd7c15f..1ecaf9805b79 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "arm-spe-pkt-decoder.h"
>  
> @@ -258,192 +259,158 @@ int arm_spe_get_packet(const unsigned char *buf, 
> size_t len,
>   return ret;
>  }
>  
> +static int arm_spe_pkt_snprintf(int *err, char **buf_p, size_t *blen,
> + const char *fmt, ...)
> +{
> + va_list ap;
> + int ret;
> +
> + /* Bail out if any error occurred */
> + if (err && *err)
> + return *err;
> +
> + va_start(ap, fmt);
> + ret = vsnprintf(*buf_p, *blen, fmt, ap);
> + va_end(ap);
> +
> + if (ret < 0) {
> + if (err && !*err)
> + *err = ret;

What happens on buffer overrun (i.e., ret >= *blen)?

It looks to me like we'll advance buf_p too far, blen will wrap around,
and the string at *buf_p won't be null terminated.  Because the return
value is still >= 0, this condition will be returned up the stack as
"success".

Perhaps this can never happen given the actual buffer sizes and strings
being printed, but it feels a bit unsafe.


It may be better to clamp the adjustments to *buf_p and *blen to
*blen - 1 in this case, and explicitly set (*buf_p)[*blen - 1] to '\0'.
We _may_ want indicate failure in the return from arm_spe_pkt_desc() in
this situation, but I don't know enough about how this code is called to
enable me to judge that.

(Note, this issue is not introduced by this patch, but this refactoring
makes it easier to address it in a single place -- so it may now be
worth doing so.)

> + } else {
> + *buf_p += ret;
> + *blen -= ret;
> + }
> +
> + return ret;
> +}
> +
>  int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
>size_t buf_len)
>  {
> - int ret, ns, el, idx = packet->index;
> + int ns, el, idx = packet->index;
>   unsigned long long payload = packet->payload;
>   const char *name = arm_spe_pkt_name(packet->type);
> + size_t blen = buf_len;
> + int err = 0;
>  
>   switch (packet->type) {
>   case ARM_SPE_BAD:
>   case ARM_SPE_PAD:
>   case ARM_SPE_END:
> - return snprintf(buf, buf_len, "%s", name);
> - case ARM_SPE_EVENTS: {
> - size_t blen = buf_len;
> -
> - ret = 0;
> - ret = snprintf(buf, buf_len, "EV");
> - buf += ret;
> - blen -= ret;
> - if (payload & 0x1) {
> -  

Re: [PATCH v6 20/21] perf arm_spe: Decode memory tagging properties

2020-11-02 Thread Dave Martin
On Fri, Oct 30, 2020 at 02:57:23AM +, Leo Yan wrote:
> From: Andre Przywara 
> 
> When SPE records a physical address, it can additionally tag the event
> with information from the Memory Tagging architecture extension.
> 
> Decode the two additional fields in the SPE event payload.
> 
> [leoy: Refined patch to use predefined macros]
> 
> Signed-off-by: Andre Przywara 
> Signed-off-by: Leo Yan 
> ---
>  tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c | 6 +-
>  tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.h | 2 ++
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> index 3fca65e9cbbf..9ec3057de86f 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> @@ -371,6 +371,7 @@ static int arm_spe_pkt_desc_addr(const struct arm_spe_pkt 
> *packet,
>char *buf, size_t buf_len)
>  {
>   int ns, el, idx = packet->index;
> + int ch, pat;
>   u64 payload = packet->payload;
>   int err = 0;
>  
> @@ -388,9 +389,12 @@ static int arm_spe_pkt_desc_addr(const struct 
> arm_spe_pkt *packet,
>   "VA 0x%llx", payload);
>   case SPE_ADDR_PKT_HDR_INDEX_DATA_PHYS:
>   ns = !!SPE_ADDR_PKT_GET_NS(payload);
> + ch = !!SPE_ADDR_PKT_GET_CH(payload);
> + pat = SPE_ADDR_PKT_GET_PAT(payload);
>   payload = SPE_ADDR_PKT_ADDR_GET_BYTES_0_6(payload);
>   return arm_spe_pkt_snprintf(, , _len,
> - "PA 0x%llx ns=%d", payload, ns);
> + "PA 0x%llx ns=%d ch=%d, pat=%x",

Nit: given that this data is all closely related, do we really want the
extra comma here?

(Note, I am not familiar with how this text is consumed, so if there are
other reasons why the comma is needed then that's probably fine.)

> + payload, ns, ch, pat);
>   default:
>   return 0;
>   }
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.h 
> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.h
> index 7032fc141ad4..1ad14885c2a1 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.h
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.h
> @@ -73,6 +73,8 @@ struct arm_spe_pkt {
>  
>  #define SPE_ADDR_PKT_GET_NS(v)   (((v) & BIT_ULL(63)) >> 
> 63)
>  #define SPE_ADDR_PKT_GET_EL(v)   (((v) & GENMASK_ULL(62, 
> 61)) >> 61)
> +#define SPE_ADDR_PKT_GET_CH(v)       (((v) & BIT_ULL(62)) >> 
> 62)
> +#define SPE_ADDR_PKT_GET_PAT(v)  (((v) & GENMASK_ULL(59, 
> 56)) >> 56)

These seem to match the spec.

With or without addressing the nit above:

Reviewed-by: Dave Martin 

[...]

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-27 Thread Dave Martin
On Mon, Oct 26, 2020 at 05:39:42PM -0500, Jeremy Linton via Libc-alpha wrote:
> Hi,
> 
> On 10/26/20 12:52 PM, Dave Martin wrote:
> >On Mon, Oct 26, 2020 at 04:57:55PM +, Szabolcs Nagy via Libc-alpha wrote:
> >>The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> >>>Unrolling this discussion a bit, this problem comes from a few sources:
> >>>
> >>>1) systemd is trying to implement a policy that doesn't fit SECCOMP
> >>>syscall filtering very well.
> >>>
> >>>2) The program is trying to do something not expressible through the
> >>>syscall interface: really the intent is to set PROT_BTI on the page,
> >>>with no intent to set PROT_EXEC on any page that didn't already have it
> >>>set.
> >>>
> >>>
> >>>This limitation of mprotect() was known when I originally added PROT_BTI,
> >>>but at that time we weren't aware of a clear use case that would fail.
> >>>
> >>>
> >>>Would it now help to add something like:
> >>>
> >>>int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> >>>{
> >>>   int ret = -EINVAL;
> >>>   mmap_write_lock(current->mm);
> >>>   if (all vmas in [addr .. addr + len) have
> >>>   their mprotect flags set to old_flags) {
> >>>
> >>>   ret = mprotect(addr, len, new_flags);
> >>>   }
> >>>   
> >>>   mmap_write_unlock(current->mm);
> >>>   return ret;
> >>>}
> >>
> >>if more prot flags are introduced then the exact
> >>match for old_flags may be restrictive and currently
> >>there is no way to query these flags to figure out
> >>how to toggle one prot flag in a future proof way,
> >>so i don't think this solves the issue completely.
> >
> >Ack -- I illustrated this model because it makes the seccomp filter's
> >job easy, but it does have limitations.
> >
> >>i think we might need a new api, given that aarch64
> >>now has PROT_BTI and PROT_MTE while existing code
> >>expects RWX only, but i don't know what api is best.
> >
> >An alternative option would be a call that sets / clears chosen
> >flags and leaves others unchanged.
> 
> I tend to favor a set/clear API, but that could also just be done by
> creating a new PROT_BTI_IF_X which enables BTI for areas already set to
> _EXEC. That goes right by the seccomp filters too, and actually is closer to
> what glibc wants to do anyway.

That works, though I'm not so keen on teating PROT_BTI as a special case,
since the problem is likely to recur when other weird per-arch flags get
added...

I also wonder whether we actually care whether the pages are marked
executable or not here; probably the flags can just be independent.  This
rather depends on whether the how the architecture treats the BTI (a.k.a
GP) pagetable bit for non-executable pages.  I have a feeling we already
allow PROT_BTI && !PROT_EXEC through anyway.


What about a generic-ish set/clear interface that still works by just
adding a couple of PROT_ flags:

switch (flags & (PROT_SET | PROT_CLEAR)) {
case PROT_SET: prot |= flags; break;
case PROT_CLEAR: prot &= ~flags; break;
case 0: prot = flags; break;

default:
return -EINVAL;
}

This can't atomically set some flags while clearing some others, but for
simple stuff it seems sufficient and shouldn't be too invasive on the
kernel side.

We will still have to take the mm lock when doing a SET or CLEAR, but
not for the non-set/clear case.


Anyway, libc could now do:

mprotect(addr, len, PROT_SET | PROT_BTI);

with much the same effect as your PROT_BTI_IF_X.


JITting or breakpoint setting code that wants to change the permissions
temporarily, without needing to know whether PROT_BTI is set, say:

mprotect(addr, len, PROT_SET | PROT_WRITE);
*addr = BKPT_INSN;
mprotect(addr, len, PROT_CLEAR | PROT_WRITE);


Thoughts?

I won't claim this doesn't still have some limitations...

Cheers
---Dave


Re: [PATCH v4 06/21] perf arm-spe: Refactor printing string to buffer

2020-10-27 Thread Dave Martin
On Tue, Oct 27, 2020 at 03:09:02AM +, Leo Yan wrote:
> When outputs strings to the decoding buffer with function snprintf(),
> SPE decoder needs to detects if any error returns from snprintf() and if
> so needs to directly bail out.  If snprintf() returns success, it needs
> to update buffer pointer and reduce the buffer length so can continue to
> output the next string into the consequent memory space.
> 
> This complex logics are spreading in the function arm_spe_pkt_desc() so
> there has many duplicate codes for handling error detecting, increment
> buffer pointer and decrement buffer size.
> 
> To avoid the duplicate code, this patch introduces a new helper function
> arm_spe_pkt_snprintf() which is used to wrap up the complex logics, and
> the caller arm_spe_pkt_desc() will call it and simply check the returns
> value.
> 
> This patch also moves the variable 'blen' as the function's local
> variable, this allows to remove the unnecessary braces and improve the
> readability.
> 
> Suggested-by: Dave Martin 
> Signed-off-by: Leo Yan 
> Reviewed-by: Andre Przywara 
> ---
>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 247 ++
>  1 file changed, 135 insertions(+), 112 deletions(-)
> 
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> index 04fd7fd7c15f..b400636e6da2 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "arm-spe-pkt-decoder.h"
>  
> @@ -258,192 +259,214 @@ int arm_spe_get_packet(const unsigned char *buf, 
> size_t len,
>   return ret;
>  }
>  
> +static int arm_spe_pkt_snprintf(char **buf_p, size_t *blen,
> + const char *fmt, ...)
> +{
> + va_list ap;
> + int ret;
> +
> + va_start(ap, fmt);
> + ret = vsnprintf(*buf_p, *blen, fmt, ap);
> + va_end(ap);
> +
> + if (ret < 0)
> + return ret;
> +
> + *buf_p += ret;
> + *blen -= ret;
> + return ret;
> +}
> +

This approach seems OK, though I wonder whether all the
"if (ret < 0) return;" logic is really needed.

In case of failure, it probably doesn't matter what ends up in buf.
If not, we could just implement a cumulative error:

static int arm_spe_pkt_snprintf(int *err, char **buf_p, size_t *blen,
{

/* ... */

if (ret < 0) {
if (err && !*err)
*err = ret;
} else {
*buf_p += ret;
*blen -= ret;
}

return err ? *err : ret;
}

and just return the final value of err.



>  int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
>size_t buf_len)
>  {
>   int ret, ns, el, idx = packet->index;
>   unsigned long long payload = packet->payload;
>   const char *name = arm_spe_pkt_name(packet->type);
> + size_t blen = buf_len;
>  
>   switch (packet->type) {
>   case ARM_SPE_BAD:
>   case ARM_SPE_PAD:
>   case ARM_SPE_END:
> - return snprintf(buf, buf_len, "%s", name);
> - case ARM_SPE_EVENTS: {
> - size_t blen = buf_len;
> -
> - ret = 0;
> - ret = snprintf(buf, buf_len, "EV");
> - buf += ret;
> - blen -= ret;
> + return arm_spe_pkt_snprintf(, , "%s", name);
> + case ARM_SPE_EVENTS:
> + ret = arm_spe_pkt_snprintf(, , "EV");
> + if (ret < 0)
> + return ret;
> +

...

Then this becomes

case ARM_SPE_END:
return arm_spe_pkt_snprintf(, , , "%s", name);
case ARM_SPE_EVENTS:
arm_spe_pkt_snprintf(, , , "%s", name);

if (payload & 0x1)
arm_spe_pkt_snprintf(, , , " 
EXCEPTION-GEN");
if (payload & 0x2)
arm_spe_pkt_snprintf(, , , " RETIRED");

/* ... */


This might be over-engineering, but it does help to condense the code.

[...]

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-27 Thread Dave Martin
On Mon, Oct 26, 2020 at 05:45:42PM +0100, Florian Weimer via Libc-alpha wrote:
> * Dave Martin via Libc-alpha:
> 
> > Would it now help to add something like:
> >
> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> > {
> > int ret = -EINVAL;
> > mmap_write_lock(current->mm);
> > if (all vmas in [addr .. addr + len) have
> > their mprotect flags set to old_flags) {
> >
> > ret = mprotect(addr, len, new_flags);
> > }
> > 
> > mmap_write_unlock(current->mm);
> > return ret;
> > }
> 
> I suggested something similar as well.  Ideally, the interface would
> subsume pkey_mprotect, though, and have a separate flags argument from
> the protection flags.  But then we run into argument list length limits.
>
> Thanks,
> Florian

I suppose.  Assuming that a syscall filter can inspect memory, we might
be able to bundle arguments into a struct if necessary.

[...]

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Dave Martin
On Mon, Oct 26, 2020 at 04:57:55PM +, Szabolcs Nagy via Libc-alpha wrote:
> The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> > Unrolling this discussion a bit, this problem comes from a few sources:
> > 
> > 1) systemd is trying to implement a policy that doesn't fit SECCOMP
> > syscall filtering very well.
> > 
> > 2) The program is trying to do something not expressible through the
> > syscall interface: really the intent is to set PROT_BTI on the page,
> > with no intent to set PROT_EXEC on any page that didn't already have it
> > set.
> > 
> > 
> > This limitation of mprotect() was known when I originally added PROT_BTI,
> > but at that time we weren't aware of a clear use case that would fail.
> > 
> > 
> > Would it now help to add something like:
> > 
> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> > {
> > int ret = -EINVAL;
> > mmap_write_lock(current->mm);
> > if (all vmas in [addr .. addr + len) have
> > their mprotect flags set to old_flags) {
> > 
> > ret = mprotect(addr, len, new_flags);
> > }
> > 
> > mmap_write_unlock(current->mm);
> > return ret;
> > }
> 
> if more prot flags are introduced then the exact
> match for old_flags may be restrictive and currently
> there is no way to query these flags to figure out
> how to toggle one prot flag in a future proof way,
> so i don't think this solves the issue completely.

Ack -- I illustrated this model because it makes the seccomp filter's
job easy, but it does have limitations.

> i think we might need a new api, given that aarch64
> now has PROT_BTI and PROT_MTE while existing code
> expects RWX only, but i don't know what api is best.

An alternative option would be a call that sets / clears chosen
flags and leaves others unchanged.

The trouble with that is that the MDWX policy then becomes hard to
implement again.


But policies might be best set via another route, such as a prctl,
rather than being implemented completely in a seccomp filter.

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Dave Martin
On Wed, Oct 21, 2020 at 10:44:46PM -0500, Jeremy Linton via Libc-alpha wrote:
> Hi,
> 
> There is a problem with glibc+systemd on BTI enabled systems. Systemd
> has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
> PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
> being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
> caught by the seccomp filter, resulting in service failures.
> 
> So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
> This is obviously not desirable.
> 
> Various changes have been suggested, replacing the mprotect with mmap calls
> having PROT_BTI set on the original mapping, re-mmapping the segments,
> implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> and various modification to seccomp to allow particular mprotect cases to
> bypass the filters. In each case there seems to be an undesirable attribute
> to the solution.
> 
> So, whats the best solution?

Unrolling this discussion a bit, this problem comes from a few sources:

1) systemd is trying to implement a policy that doesn't fit SECCOMP
syscall filtering very well.

2) The program is trying to do something not expressible through the
syscall interface: really the intent is to set PROT_BTI on the page,
with no intent to set PROT_EXEC on any page that didn't already have it
set.


This limitation of mprotect() was known when I originally added PROT_BTI,
but at that time we weren't aware of a clear use case that would fail.


Would it now help to add something like:

int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
{
int ret = -EINVAL;
mmap_write_lock(current->mm);
if (all vmas in [addr .. addr + len) have
their mprotect flags set to old_flags) {

ret = mprotect(addr, len, new_flags);
}

mmap_write_unlock(current->mm);
return ret;
}


libc would now be able to do

mchangeprot(addr, len, PROT_EXEC | PROT_READ,
PROT_EXEC | PROT_READ | PROT_BTI);

while systemd's MDWX filter would reject the call if

(new_flags & PROT_EXEC) &&
(!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)



This won't magically fix current code, but something along these lines
might be better going forward.


Thoughts?

---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Dave Martin
On Mon, Oct 26, 2020 at 02:52:46PM +, Catalin Marinas via Libc-alpha wrote:
> On Sat, Oct 24, 2020 at 02:01:30PM +0300, Topi Miettinen wrote:
> > On 23.10.2020 12.02, Catalin Marinas wrote:
> > > On Thu, Oct 22, 2020 at 01:02:18PM -0700, Kees Cook wrote:
> > > > Regardless, it makes sense to me to have the kernel load the executable
> > > > itself with BTI enabled by default. I prefer gaining Catalin's suggested
> > > > patch[2]. :)
> > > [...]
> > > > [2] https://lore.kernel.org/linux-arm-kernel/20201022093104.GB1229@gaia/
> > > 
> > > I think I first heard the idea at Mark R ;).
> > > 
> > > It still needs glibc changes to avoid the mprotect(), or at least ignore
> > > the error. Since this is an ABI change and we don't know which kernels
> > > would have it backported, maybe better to still issue the mprotect() but
> > > ignore the failure.
> > 
> > What about kernel adding an auxiliary vector as a flag to indicate that BTI
> > is supported and recommended by the kernel? Then dynamic loader could use
> > that to detect that a) the main executable is BTI protected and there's no
> > need to mprotect() it and b) PROT_BTI flag should be added to all PROT_EXEC
> > pages.
> 
> We could add a bit to AT_FLAGS, it's always been 0 for Linux.
> 
> > In absence of the vector, the dynamic loader might choose to skip doing
> > PROT_BTI at all (since the main executable isn't protected anyway either, or
> > maybe even the kernel is up-to-date but it knows that it's not recommended
> > for some reason, or maybe the kernel is so ancient that it doesn't know
> > about BTI). Optionally it could still read the flag from ELF later (for
> > compatibility with old kernels) and then do the mprotect() dance, which may
> > trip seccomp filters, possibly fatally.
> 
> I think the safest is for the dynamic loader to issue an mprotect() and
> ignore the EPERM error. Not all user deployments have this seccomp
> filter, so they can still benefit, and user can't tell whether the
> kernel change has been backported.
> 
> Now, if the dynamic loader silently ignores the mprotect() failure on
> the main executable, is there much value in exposing a flag in the aux
> vectors? It saves a few (one?) mprotect() calls but I don't think it
> matters much. Anyway, I don't mind the flag.

I don't see a problem with the aforementioned patch [2] to pre-set BTI
on the pages of the main binary.

The original rationale here was that ld.so doesn't _need_ this, since it
is going to examine the binary's ELF headers anyway.  But equally, if
the binary is marked as supporting BTI then it's safe to enable BTI for
the binary's own pages.


I'd tend to agree that an AT_FLAGS flag doesn't add much.  I think real
EPERMs would only be seen in assert-fail type situations.  Failure of
mmap() is likely to result in a segfault later on, or correct operation
with weakened permissions on some pages.  Given the likely failure
modes, that situation doesn't feel too bad.


> The only potential risk is if the dynamic loader decides not to turn
> PROT_BTI one because of some mix and match of objects but AFAIK BTI
> allows interworking.

Yes, the design means that a page's PROT_BTI can be set safely if the
code in that page was compiled for BTI, irrespective of how other pages
were compiled.  The reasons why we don't do this at finer granularity
are (a) is't not very useful, and (b) ELF images only contain a BTI
property note for the whole image, not per segment.

I think that ld.so already makes this decision at ELF image granularity
(unless someone contradicts me).

Cheers
---Dave


Re: [RFC PATCH 1/4] x86/signal: Introduce helpers to get the maximum signal frame size

2020-10-12 Thread Dave Martin
On Thu, Oct 08, 2020 at 10:43:50PM +, Bae, Chang Seok wrote:
> On Wed, 2020-10-07 at 11:05 +0100, Dave Martin wrote:
> > On Tue, Oct 06, 2020 at 05:45:24PM +, Bae, Chang Seok wrote:
> > > On Mon, 2020-10-05 at 14:42 +0100, Dave Martin wrote:
> > > > On Tue, Sep 29, 2020 at 01:57:43PM -0700, Chang S. Bae wrote:
> > > > > 
> > > > > +/*
> > > > > + * The FP state frame contains an XSAVE buffer which must be 64-byte 
> > > > > aligned.
> > > > > + * If a signal frame starts at an unaligned address, extra space is 
> > > > > required.
> > > > > + * This is the max alignment padding, conservatively.
> > > > > + */
> > > > > +#define MAX_XSAVE_PADDING63UL
> > > > > +
> > > > > +/*
> > > > > + * The frame data is composed of the following areas and laid out as:
> > > > > + *
> > > > > + * -
> > > > > + * | alignment padding |
> > > > > + * -
> > > > > + * | (f)xsave frame|
> > > > > + * -
> > > > > + * | fsave header  |
> > > > > + * -
> > > > > + * | siginfo + ucontext|
> > > > > + * -
> > > > > + */
> > > > > +
> > > > > +/* max_frame_size tells userspace the worst case signal stack size. 
> > > > > */
> > > > > +static unsigned long __ro_after_init max_frame_size;
> > > > > +
> > > > > +void __init init_sigframe_size(void)
> > > > > +{
> > > > > + /*
> > > > > +  * Use the largest of possible structure formats. This might
> > > > > +  * slightly oversize the frame for 64-bit apps.
> > > > > +  */
> > > > > +
> > > > > + if (IS_ENABLED(CONFIG_X86_32) ||
> > > > > + IS_ENABLED(CONFIG_IA32_EMULATION))
> > > > > + max_frame_size = max((unsigned 
> > > > > long)SIZEOF_sigframe_ia32,
> > > > > +  (unsigned 
> > > > > long)SIZEOF_rt_sigframe_ia32);
> > > > > +
> > > > > + if (IS_ENABLED(CONFIG_X86_X32_ABI))
> > > > > + max_frame_size = max(max_frame_size, (unsigned 
> > > > > long)SIZEOF_rt_sigframe_x32);
> > > > > +
> > > > > + if (IS_ENABLED(CONFIG_X86_64))
> > > > > + max_frame_size = max(max_frame_size, (unsigned 
> > > > > long)SIZEOF_rt_sigframe);
> > > > > +
> > > > > + max_frame_size += fpu__get_fpstate_sigframe_size() + 
> > > > > MAX_XSAVE_PADDING;
> > > > 
> > > > For arm64, we round the worst-case padding up by one.
> > > > 
> > > 
> > > Yeah, I saw that. The ARM code adds the max padding, too:
> > > 
> > >   signal_minsigstksz = sigframe_size() +
> > >   round_up(sizeof(struct frame_record), 16) +
> > >   16; /* max alignment padding */
> > > 
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/signal.c#n973
> > > 
> > > > I can't remember the full rationale for this, but it at least seemed a
> > > > bit weird to report a size that is not a multiple of the alignment.
> > > > 
> > > 
> > > Because the last state size of XSAVE may not be 64B aligned, the 
> > > (reported)
> > > sum of xstate size here does not guarantee 64B alignment.
> > > 
> > > > I'm can't think of a clear argument as to why it really matters, though.
> > > 
> > > We care about the start of XSAVE buffer for the XSAVE instructions, to be
> > > 64B-aligned.
> > 
> > Ah, I see.  That makes sense.
> > 
> > For arm64, there is no additional alignment padding inside the frame,
> > only the padding inserted after the frame to ensure that the base
> > address is 16-byte aligned.
> > 
> > However, I wonder whether people will tend to assume that AT_MINSIGSTKSZ
> > is a sensible (if minimal) amount of stack to allocate.  Allocating an
> > odd number of bytes, or any amount that isn't a multiple of the
> > architecture's preferred (or mandated) stack alignment probably doesn't
> > make sense.
> > 
> &

Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-07 Thread Dave Martin
On Wed, Oct 07, 2020 at 06:30:03AM -0700, H.J. Lu wrote:
> On Wed, Oct 7, 2020 at 3:47 AM Dave Martin  wrote:
> >
> > On Tue, Oct 06, 2020 at 10:44:14AM -0700, H.J. Lu wrote:

[...]

> > > I updated my glibc patch to add both _SC_MINSIGSTKSZ and _SC_SIGSTKSZ.
> > > _SC_MINSIGSTKSZ is  the minimum signal stack size from AT_MINSIGSTKSZ,
> > > which is the signal frame size used by kernel, and _SC_SIGSTKSZ is the 
> > > value
> > > of sysconf (_SC_MINSIGSTKSZ) + 6KB for user application.
> >
> > Can I suggest sysconf (_SC_MINSIGSTKSZ) * 4 instead?
> 
> Done.

OK.  I was prepared to have to argue my case a bit more, but if you
think this is OK then I will stop arguing ;)


> > If the arch has more or bigger registers to save in the signal frame,
> > the chances are that they're going to get saved in some userspace stack
> > frames too.  So I suspect that the user signal handler stack usage may
> > scale up to some extent rather than being a constant.
> >
> >
> > To help people migrate without unpleasant surprises, I also figured it
> > would be a good idea to make sure that sysconf (_SC_MINSIGSTKSZ) >=
> > legacy MINSIGSTKSZ, and sysconf (_SC_SIGSTKSZ) >= legacy SIGSTKSZ.
> > This should makes it safer to use sysconf (_SC_MINSIGSTKSZ) as a
> > drop-in replacement for MINSIGSTKSZ, etc.
> >
> > (To explain: AT_MINSIGSTKSZ may actually be < MINSIGSTKSZ on AArch64.
> > My idea was that sysconf () should hide this surprise, but people who
> > really want to know the kernel value would tolerate some
> > nonportability and read AT_MINSIGSTKSZ directly.)
> >
> >
> > So then:
> >
> > kernel_minsigstksz = getauxval(AT_MINSIGSTKSZ);
> > minsigstksz = LEGACY_MINSIGSTKSZ;
> > if (kernel_minsigstksz > minsigstksz)
> > minsistksz = kernel_minsigstksz;
> >
> > sigstksz = LEGACY_SIGSTKSZ;
> > if (minsigstksz * 4 > sigstksz)
> > sigstksz = minsigstksz * 4;
> 
> I updated users/hjl/AT_MINSIGSTKSZ branch with
> 
> +@item _SC_MINSIGSTKSZ
> +@standards{GNU, unistd.h}

Can we specify these more agressively now?

> +Inquire about the signal stack size used by the kernel.

I think we've already concluded that this should included all mandatory
overheads, including those imposed by the compiler and glibc?

e.g.:

--8<--

The returned value is the minimum number of bytes of free stack space
required in order to gurantee successful, non-nested handling of a
single signal whose handler is an empty function.

-->8--

> +
> +@item _SC_SIGSTKSZ
> +@standards{GNU, unistd.h}
> +Inquire about the default signal stack size for a signal handler.

Similarly:

--8<--

The returned value is the suggested minimum number of bytes of stack
space required for a signal stack.

This is not guaranteed to be enough for any specific purpose other than
the invocation of a single, non-nested, empty handler, but nonetheless
should be enough for basic scenarios involving simple signal handlers
and very low levels of signal nesting (say, 2 or 3 levels at the very
most).

This value is provided for developer convenience and to ease migration
from the legacy SIGSTKSZ constant.  Programs requiring stronger
guarantees should avoid using it if at all possible.

-->8--


If these descriptions are too wordy, we might want to move some of it
out to signal.texi, though.

> 
> case _SC_MINSIGSTKSZ:
>   assert (GLRO(dl_minsigstacksize) != 0);
>   return GLRO(dl_minsigstacksize);
> 
> case _SC_SIGSTKSZ:
>   {
> /* Return MAX (MINSIGSTKSZ, sysconf (_SC_MINSIGSTKSZ)) * 4.  */
> long int minsigstacksize = GLRO(dl_minsigstacksize);
> _Static_assert (__builtin_constant_p (MINSIGSTKSZ),
> "MINSIGSTKSZ is constant");
> if (minsigstacksize < MINSIGSTKSZ)
>   minsigstacksize = MINSIGSTKSZ;
> return minsigstacksize * 4;
>   }
> 
> >
> > (Or something like that, unless the architecture provides its own
> > definitions.  ia64's MINSIGSTKSZ is enormous, so it probably doesn't
> > want this.)
> >
> >
> > Also: should all these values be rounded up to a multiple of the
> > architecture's preferred stack alignment?
> 
> Kernel should provide a properly aligned AT_MINSIGSTKSZ.

OK.  Can you comment on Chang S. Bae's series?  I wasn't sure that the
proposal produces an aligned value for AT_MINSIGSTKSZ on x86, but maybe
I just worked it out wrong.


> > Should the preferred stack alignment also be exposed through sysconf?
> > Portable code otherwise has no way to know this, though if the
> &g

Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-07 Thread Dave Martin
On Tue, Oct 06, 2020 at 10:44:14AM -0700, H.J. Lu wrote:
> On Tue, Oct 6, 2020 at 9:55 AM Dave Martin  wrote:
> >
> > On Tue, Oct 06, 2020 at 08:34:06AM -0700, H.J. Lu wrote:
> > > On Tue, Oct 6, 2020 at 8:25 AM Dave Martin  wrote:
> > > >
> > > > On Tue, Oct 06, 2020 at 05:12:29AM -0700, H.J. Lu wrote:
> > > > > On Tue, Oct 6, 2020 at 2:25 AM Dave Martin  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Oct 05, 2020 at 10:17:06PM +0100, H.J. Lu wrote:
> > > > > > > On Mon, Oct 5, 2020 at 6:45 AM Dave Martin  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 29, 2020 at 01:57:42PM -0700, Chang S. Bae wrote:
> > > > > > > > > During signal entry, the kernel pushes data onto the normal 
> > > > > > > > > userspace
> > > > > > > > > stack. On x86, the data pushed onto the user stack includes 
> > > > > > > > > XSAVE state,
> > > > > > > > > which has grown over time as new features and larger 
> > > > > > > > > registers have been
> > > > > > > > > added to the architecture.
> > > > > > > > >
> > > > > > > > > MINSIGSTKSZ is a constant provided in the kernel signal.h 
> > > > > > > > > headers and
> > > > > > > > > typically distributed in lib-dev(el) packages, e.g. [1]. Its 
> > > > > > > > > value is
> > > > > > > > > compiled into programs and is part of the user/kernel ABI. 
> > > > > > > > > The MINSIGSTKSZ
> > > > > > > > > constant indicates to userspace how much data the kernel 
> > > > > > > > > expects to push on
> > > > > > > > > the user stack, [2][3].
> > > > > > > > >
> > > > > > > > > However, this constant is much too small and does not reflect 
> > > > > > > > > recent
> > > > > > > > > additions to the architecture. For instance, when AVX-512 
> > > > > > > > > states are in
> > > > > > > > > use, the signal frame size can be 3.5KB while MINSIGSTKSZ 
> > > > > > > > > remains 2KB.
> > > > > > > > >
> > > > > > > > > The bug report [4] explains this as an ABI issue. The small 
> > > > > > > > > MINSIGSTKSZ can
> > > > > > > > > cause user stack overflow when delivering a signal.
> > > > > > > > >
> > > > > > > > > In this series, we suggest a couple of things:
> > > > > > > > > 1. Provide a variable minimum stack size to userspace, as a 
> > > > > > > > > similar
> > > > > > > > >approach to [5]
> > > > > > > > > 2. Avoid using a too-small alternate stack
> > > > > > > >
> > > > > > > > I can't comment on the x86 specifics, but the approach followed 
> > > > > > > > in this
> > > > > > > > series does seem consistent with the way arm64 populates
> > > > > > > > AT_MINSIGSTKSZ.
> > > > > > > >
> > > > > > > > I need to dig up my glibc hacks for providing a sysconf 
> > > > > > > > interface to
> > > > > > > > this...
> > > > > > >
> > > > > > > Here is my proposal for glibc:
> > > > > > >
> > > > > > > https://sourceware.org/pipermail/libc-alpha/2020-September/118098.html
> > > > > >
> > > > > > Thanks for the link.
> > > > > >
> > > > > > Are there patches yet?  I already had some hacks in the works, but 
> > > > > > I can
> > > > > > drop them if there's something already out there.
> > > > >
> > > > > I am working on it.
> > > >
> > > > OK.  I may post something for discussion, but I'm happy for it to be
> > > > superseded by someone (i.e., other than me) who actually knows what
> > > > they're doing...
> > >
> > > Please see my previous email for my glibc patch:
> > 

Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-07 Thread Dave Martin
On Tue, Oct 06, 2020 at 11:30:42AM -0700, Dave Hansen wrote:
> On 10/6/20 10:00 AM, Dave Martin wrote:
> > On Tue, Oct 06, 2020 at 08:33:47AM -0700, Dave Hansen wrote:
> >> On 10/6/20 8:25 AM, Dave Martin wrote:
> >>> Or are people reporting real stack overruns on x86 today?
> >> We have real overruns.  We have ~2800 bytes of XSAVE (regisiter) state
> >> mostly from AVX-512, and a 2048 byte MINSIGSTKSZ.
> > Right.  Out of interest, do you believe that's a direct consequence of
> > the larger kernel-generated signal frame, or does the expansion of
> > userspace stack frames play a role too?
> 
> The kernel-generated signal frame is entirely responsible for the ~2800
> bytes that I'm talking about.
> 
> I'm sure there are some systems where userspace plays a role, but those
> are much less of a worry at the moment, since the kernel-induced
> overflows mean an instant crash that userspace has no recourse for.

Ack, that sounds pretty convincing.

Cheers
---Dave


Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-07 Thread Dave Martin
On Tue, Oct 06, 2020 at 08:21:00PM +0200, Florian Weimer wrote:
> * Dave Martin via Libc-alpha:
> 
> > On Tue, Oct 06, 2020 at 08:33:47AM -0700, Dave Hansen wrote:
> >> On 10/6/20 8:25 AM, Dave Martin wrote:
> >> > Or are people reporting real stack overruns on x86 today?
> >> 
> >> We have real overruns.  We have ~2800 bytes of XSAVE (regisiter) state
> >> mostly from AVX-512, and a 2048 byte MINSIGSTKSZ.
> >
> > Right.  Out of interest, do you believe that's a direct consequence of
> > the larger kernel-generated signal frame, or does the expansion of
> > userspace stack frames play a role too?
> 
> I must say that I do not quite understand this question.
> 
> 32 64-*byte* registers simply need 2048 bytes of storage space worst
> case, there is really no way around that.

If the architecture grows more or bigger registers, and if those
registers are used in general-purpose code, then all stack frames will
tend to grow, not just the signal frame.

So a stack overflow might be caused by the larger signal frame by
itself; or it might be caused by the growth of the stack of 20 function
frames created by someone's signal handler.

In the latter case, this is just a "normal" stack overflow, and nothing
really to do with signals or SIGSTKSZ.  Rebuilding with different
compiler flags could also grow the stack usage and cause just the same
problem.

I also strongly suspect that people often don't think about signal
nesting when allocating signal stacks.  So, there might be a pre-
existing potential overflow that just becomes more likely when the
signal frame grows.  That's not really SIGSTKSZ's fault.


Of course, AVX-512 might never be used in general-purpose code.  On
AArch64, SVE can be used in general-purpose code, but it's too early to
say what its prevalence will be in signal handlers.  Probably low.


> > In practice software just assumes SIGSTKSZ and then ignores the problem
> > until / unless an actual stack overflow is seen.
> >
> > There's probably a lot of software out there whose stack is
> > theoretically too small even without AVX-512 etc. in the mix, especially
> > when considering the possibility of nested signals...
> 
> That is certainly true.  We have seen problems with ntpd, which
> requested a 16 KiB stack, at a time when there were various deductions
> from the stack size, and since the glibc dynamic loader also uses XSAVE,
> ntpd exceeded the remaining stack space.  But in this case, we just
> fudged the stack size computation in pthread_create and made it less
> likely that the dynamic loader was activated, which largely worked
> around this particular problem.  For MINSIGSTKSZ, we just don't have
> this option because it's simply too small in the first place.
> 
> I don't immediately recall a bug due to SIGSTKSZ being too small.  The
> test cases I wrote for this were all artificial, to raise awareness of
> this issue (applications treating these as recommended values, rather
> than minimum value to avoid immediately sigaltstack/phtread_create
> failures, same issue with PTHREAD_STACK_MIN).

Ack, I think if SIGSTKSZ was too small significantly often, there would
be more awareness of the issue.

Cheers
---Dave


Re: [RFC PATCH 1/4] x86/signal: Introduce helpers to get the maximum signal frame size

2020-10-07 Thread Dave Martin
On Tue, Oct 06, 2020 at 05:45:24PM +, Bae, Chang Seok wrote:
> On Mon, 2020-10-05 at 14:42 +0100, Dave Martin wrote:
> > On Tue, Sep 29, 2020 at 01:57:43PM -0700, Chang S. Bae wrote:
> > > 
> > > +/*
> > > + * The FP state frame contains an XSAVE buffer which must be 64-byte 
> > > aligned.
> > > + * If a signal frame starts at an unaligned address, extra space is 
> > > required.
> > > + * This is the max alignment padding, conservatively.
> > > + */
> > > +#define MAX_XSAVE_PADDING63UL
> > > +
> > > +/*
> > > + * The frame data is composed of the following areas and laid out as:
> > > + *
> > > + * -
> > > + * | alignment padding |
> > > + * -
> > > + * | (f)xsave frame|
> > > + * -
> > > + * | fsave header  |
> > > + * -
> > > + * | siginfo + ucontext|
> > > + * -
> > > + */
> > > +
> > > +/* max_frame_size tells userspace the worst case signal stack size. */
> > > +static unsigned long __ro_after_init max_frame_size;
> > > +
> > > +void __init init_sigframe_size(void)
> > > +{
> > > + /*
> > > +  * Use the largest of possible structure formats. This might
> > > +  * slightly oversize the frame for 64-bit apps.
> > > +  */
> > > +
> > > + if (IS_ENABLED(CONFIG_X86_32) ||
> > > + IS_ENABLED(CONFIG_IA32_EMULATION))
> > > + max_frame_size = max((unsigned long)SIZEOF_sigframe_ia32,
> > > +  (unsigned long)SIZEOF_rt_sigframe_ia32);
> > > +
> > > + if (IS_ENABLED(CONFIG_X86_X32_ABI))
> > > + max_frame_size = max(max_frame_size, (unsigned 
> > > long)SIZEOF_rt_sigframe_x32);
> > > +
> > > + if (IS_ENABLED(CONFIG_X86_64))
> > > + max_frame_size = max(max_frame_size, (unsigned 
> > > long)SIZEOF_rt_sigframe);
> > > +
> > > + max_frame_size += fpu__get_fpstate_sigframe_size() + MAX_XSAVE_PADDING;
> > 
> > For arm64, we round the worst-case padding up by one.
> > 
> 
> Yeah, I saw that. The ARM code adds the max padding, too:
> 
>   signal_minsigstksz = sigframe_size() +
>   round_up(sizeof(struct frame_record), 16) +
>   16; /* max alignment padding */
> 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/signal.c#n973
> 
> > I can't remember the full rationale for this, but it at least seemed a
> > bit weird to report a size that is not a multiple of the alignment.
> > 
> 
> Because the last state size of XSAVE may not be 64B aligned, the (reported)
> sum of xstate size here does not guarantee 64B alignment.
> 
> > I'm can't think of a clear argument as to why it really matters, though.
> 
> We care about the start of XSAVE buffer for the XSAVE instructions, to be
> 64B-aligned.

Ah, I see.  That makes sense.

For arm64, there is no additional alignment padding inside the frame,
only the padding inserted after the frame to ensure that the base
address is 16-byte aligned.

However, I wonder whether people will tend to assume that AT_MINSIGSTKSZ
is a sensible (if minimal) amount of stack to allocate.  Allocating an
odd number of bytes, or any amount that isn't a multiple of the
architecture's preferred (or mandated) stack alignment probably doesn't
make sense.

AArch64 has a mandatory stack alignment of 16 bytes; I'm not sure about
x86.

Cheers
---Dave


Re: [BUG][PATCH v3] crypto: arm64: Use x16 with indirect branch to bti_c

2020-10-07 Thread Dave Martin
On Tue, Oct 06, 2020 at 11:33:26AM -0500, Jeremy Linton wrote:
> The AES code uses a 'br x7' as part of a function called by
> a macro. That branch needs a bti_j as a target. This results
> in a panic as seen below. Using x16 (or x17) with an indirect
> branch keeps the target bti_c.
> 
>   Bad mode in Synchronous Abort handler detected on CPU1, code 0x3403 -- 
> BTI
>   CPU: 1 PID: 265 Comm: cryptomgr_test Not tainted 5.8.11-300.fc33.aarch64 #1
>   pstate: 20400c05 (nzCv daif +PAN -UAO BTYPE=j-)
>   pc : aesbs_encrypt8+0x0/0x5f0 [aes_neon_bs]
>   lr : aesbs_xts_encrypt+0x48/0xe0 [aes_neon_bs]
>   sp : 80001052b730
> 
>   aesbs_encrypt8+0x0/0x5f0 [aes_neon_bs]
>__xts_crypt+0xb0/0x2dc [aes_neon_bs]
>xts_encrypt+0x28/0x3c [aes_neon_bs]
>   crypto_skcipher_encrypt+0x50/0x84
>   simd_skcipher_encrypt+0xc8/0xe0
>   crypto_skcipher_encrypt+0x50/0x84
>   test_skcipher_vec_cfg+0x224/0x5f0
>   test_skcipher+0xbc/0x120
>   alg_test_skcipher+0xa0/0x1b0
>   alg_test+0x3dc/0x47c
>   cryptomgr_test+0x38/0x60
> 
> Fixes: 0e89640b640d ("crypto: arm64 - Use modern annotations for assembly 
> functions")
> Signed-off-by: Jeremy Linton 

Reviewed-by: Dave Martin 

Note, if we ended up with any veneered function calls in the mix while
x16 is live, this register could get clobbered.

Given the self-contained nature of this code though, it seems highly
unlikely that we will ever have multiple code sections of external calls
here.

Cheers
---Dave

> ---
>  arch/arm64/crypto/aes-neonbs-core.S | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/crypto/aes-neonbs-core.S 
> b/arch/arm64/crypto/aes-neonbs-core.S
> index b357164379f6..63a52ad9a75c 100644
> --- a/arch/arm64/crypto/aes-neonbs-core.S
> +++ b/arch/arm64/crypto/aes-neonbs-core.S
> @@ -788,7 +788,7 @@ SYM_FUNC_START_LOCAL(__xts_crypt8)
>  
>  0:   mov bskey, x21
>   mov rounds, x22
> - br  x7
> + br  x16
>  SYM_FUNC_END(__xts_crypt8)
>  
>   .macro  __xts_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
> @@ -806,7 +806,7 @@ SYM_FUNC_END(__xts_crypt8)
>   uzp1v30.4s, v30.4s, v25.4s
>   ld1 {v25.16b}, [x24]
>  
> -99:  adr x7, \do8
> +99:  adr x16, \do8
>   bl  __xts_crypt8
>  
>   ldp q16, q17, [sp, #.Lframe_local_offset]
> -- 
> 2.25.4
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-06 Thread Dave Martin
On Tue, Oct 06, 2020 at 08:33:47AM -0700, Dave Hansen wrote:
> On 10/6/20 8:25 AM, Dave Martin wrote:
> > Or are people reporting real stack overruns on x86 today?
> 
> We have real overruns.  We have ~2800 bytes of XSAVE (regisiter) state
> mostly from AVX-512, and a 2048 byte MINSIGSTKSZ.

Right.  Out of interest, do you believe that's a direct consequence of
the larger kernel-generated signal frame, or does the expansion of
userspace stack frames play a role too?

In practice software just assumes SIGSTKSZ and then ignores the problem
until / unless an actual stack overflow is seen.

There's probably a lot of software out there whose stack is
theoretically too small even without AVX-512 etc. in the mix, especially
when considering the possibility of nested signals...

Cheers
---Dave


Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-06 Thread Dave Martin
On Tue, Oct 06, 2020 at 08:34:06AM -0700, H.J. Lu wrote:
> On Tue, Oct 6, 2020 at 8:25 AM Dave Martin  wrote:
> >
> > On Tue, Oct 06, 2020 at 05:12:29AM -0700, H.J. Lu wrote:
> > > On Tue, Oct 6, 2020 at 2:25 AM Dave Martin  wrote:
> > > >
> > > > On Mon, Oct 05, 2020 at 10:17:06PM +0100, H.J. Lu wrote:
> > > > > On Mon, Oct 5, 2020 at 6:45 AM Dave Martin  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Sep 29, 2020 at 01:57:42PM -0700, Chang S. Bae wrote:
> > > > > > > During signal entry, the kernel pushes data onto the normal 
> > > > > > > userspace
> > > > > > > stack. On x86, the data pushed onto the user stack includes XSAVE 
> > > > > > > state,
> > > > > > > which has grown over time as new features and larger registers 
> > > > > > > have been
> > > > > > > added to the architecture.
> > > > > > >
> > > > > > > MINSIGSTKSZ is a constant provided in the kernel signal.h headers 
> > > > > > > and
> > > > > > > typically distributed in lib-dev(el) packages, e.g. [1]. Its 
> > > > > > > value is
> > > > > > > compiled into programs and is part of the user/kernel ABI. The 
> > > > > > > MINSIGSTKSZ
> > > > > > > constant indicates to userspace how much data the kernel expects 
> > > > > > > to push on
> > > > > > > the user stack, [2][3].
> > > > > > >
> > > > > > > However, this constant is much too small and does not reflect 
> > > > > > > recent
> > > > > > > additions to the architecture. For instance, when AVX-512 states 
> > > > > > > are in
> > > > > > > use, the signal frame size can be 3.5KB while MINSIGSTKSZ remains 
> > > > > > > 2KB.
> > > > > > >
> > > > > > > The bug report [4] explains this as an ABI issue. The small 
> > > > > > > MINSIGSTKSZ can
> > > > > > > cause user stack overflow when delivering a signal.
> > > > > > >
> > > > > > > In this series, we suggest a couple of things:
> > > > > > > 1. Provide a variable minimum stack size to userspace, as a 
> > > > > > > similar
> > > > > > >approach to [5]
> > > > > > > 2. Avoid using a too-small alternate stack
> > > > > >
> > > > > > I can't comment on the x86 specifics, but the approach followed in 
> > > > > > this
> > > > > > series does seem consistent with the way arm64 populates
> > > > > > AT_MINSIGSTKSZ.
> > > > > >
> > > > > > I need to dig up my glibc hacks for providing a sysconf interface to
> > > > > > this...
> > > > >
> > > > > Here is my proposal for glibc:
> > > > >
> > > > > https://sourceware.org/pipermail/libc-alpha/2020-September/118098.html
> > > >
> > > > Thanks for the link.
> > > >
> > > > Are there patches yet?  I already had some hacks in the works, but I can
> > > > drop them if there's something already out there.
> > >
> > > I am working on it.
> >
> > OK.  I may post something for discussion, but I'm happy for it to be
> > superseded by someone (i.e., other than me) who actually knows what
> > they're doing...
> 
> Please see my previous email for my glibc patch:
> 
> https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/AT_MINSIGSTKSZ
> 
> > > >
> > > > > 1. Define SIGSTKSZ and MINSIGSTKSZ to 64KB.
> > > >
> > > > Can we do this?  IIUC, this is an ABI break and carries the risk of
> > > > buffer overruns.
> > > >
> > > > The reason for not simply increasing the kernel's MINSIGSTKSZ #define
> > > > (apart from the fact that it is rarely used, due to glibc's shadowing
> > > > definitions) was that userspace binaries will have baked in the old
> > > > value of the constant and may be making assumptions about it.
> > > >
> > > > For example, the type (char [MINSIGSTKSZ]) changes if this #define
> > > > changes.  This could be a problem if an newly built library tries to
> > > > memcpy() or dump suc

Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-06 Thread Dave Martin
On Tue, Oct 06, 2020 at 08:18:03AM -0700, H.J. Lu wrote:
> On Tue, Oct 6, 2020 at 5:12 AM H.J. Lu  wrote:
> >
> > On Tue, Oct 6, 2020 at 2:25 AM Dave Martin  wrote:
> > >
> > > On Mon, Oct 05, 2020 at 10:17:06PM +0100, H.J. Lu wrote:
> > > > On Mon, Oct 5, 2020 at 6:45 AM Dave Martin  wrote:
> > > > >
> > > > > On Tue, Sep 29, 2020 at 01:57:42PM -0700, Chang S. Bae wrote:
> > > > > > During signal entry, the kernel pushes data onto the normal 
> > > > > > userspace
> > > > > > stack. On x86, the data pushed onto the user stack includes XSAVE 
> > > > > > state,
> > > > > > which has grown over time as new features and larger registers have 
> > > > > > been
> > > > > > added to the architecture.
> > > > > >
> > > > > > MINSIGSTKSZ is a constant provided in the kernel signal.h headers 
> > > > > > and
> > > > > > typically distributed in lib-dev(el) packages, e.g. [1]. Its value 
> > > > > > is
> > > > > > compiled into programs and is part of the user/kernel ABI. The 
> > > > > > MINSIGSTKSZ
> > > > > > constant indicates to userspace how much data the kernel expects to 
> > > > > > push on
> > > > > > the user stack, [2][3].
> > > > > >
> > > > > > However, this constant is much too small and does not reflect recent
> > > > > > additions to the architecture. For instance, when AVX-512 states 
> > > > > > are in
> > > > > > use, the signal frame size can be 3.5KB while MINSIGSTKSZ remains 
> > > > > > 2KB.
> > > > > >
> > > > > > The bug report [4] explains this as an ABI issue. The small 
> > > > > > MINSIGSTKSZ can
> > > > > > cause user stack overflow when delivering a signal.
> > > > > >
> > > > > > In this series, we suggest a couple of things:
> > > > > > 1. Provide a variable minimum stack size to userspace, as a similar
> > > > > >approach to [5]
> > > > > > 2. Avoid using a too-small alternate stack
> > > > >
> > > > > I can't comment on the x86 specifics, but the approach followed in 
> > > > > this
> > > > > series does seem consistent with the way arm64 populates
> > > > > AT_MINSIGSTKSZ.
> > > > >
> > > > > I need to dig up my glibc hacks for providing a sysconf interface to
> > > > > this...
> > > >
> > > > Here is my proposal for glibc:
> > > >
> > > > https://sourceware.org/pipermail/libc-alpha/2020-September/118098.html
> > >
> > > Thanks for the link.
> > >
> > > Are there patches yet?  I already had some hacks in the works, but I can
> > > drop them if there's something already out there.
> >
> > I am working on it.
> >
> > >
> > > > 1. Define SIGSTKSZ and MINSIGSTKSZ to 64KB.
> > >
> > > Can we do this?  IIUC, this is an ABI break and carries the risk of
> > > buffer overruns.
> > >
> > > The reason for not simply increasing the kernel's MINSIGSTKSZ #define
> > > (apart from the fact that it is rarely used, due to glibc's shadowing
> > > definitions) was that userspace binaries will have baked in the old
> > > value of the constant and may be making assumptions about it.
> > >
> > > For example, the type (char [MINSIGSTKSZ]) changes if this #define
> > > changes.  This could be a problem if an newly built library tries to
> > > memcpy() or dump such an object defined by and old binary.
> > > Bounds-checking and the stack sizes passed to things like sigaltstack()
> > > and makecontext() could similarly go wrong.
> >
> > With my original proposal:
> >
> > https://sourceware.org/pipermail/libc-alpha/2020-September/118028.html
> >
> > char [MINSIGSTKSZ] won't compile.  The feedback is to increase the
> > constants:
> >
> > https://sourceware.org/pipermail/libc-alpha/2020-September/118092.html
> >
> > >
> > > > 2. Add _SC_RSVD_SIG_STACK_SIZE for signal stack size reserved by the 
> > > > kernel.
> > >
> > > How about "_SC_MINSIGSTKSZ"?  This was my initial choice since only the
> > > discovery method is changing.  The meaning of the value is e

Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-06 Thread Dave Martin
On Tue, Oct 06, 2020 at 05:12:29AM -0700, H.J. Lu wrote:
> On Tue, Oct 6, 2020 at 2:25 AM Dave Martin  wrote:
> >
> > On Mon, Oct 05, 2020 at 10:17:06PM +0100, H.J. Lu wrote:
> > > On Mon, Oct 5, 2020 at 6:45 AM Dave Martin  wrote:
> > > >
> > > > On Tue, Sep 29, 2020 at 01:57:42PM -0700, Chang S. Bae wrote:
> > > > > During signal entry, the kernel pushes data onto the normal userspace
> > > > > stack. On x86, the data pushed onto the user stack includes XSAVE 
> > > > > state,
> > > > > which has grown over time as new features and larger registers have 
> > > > > been
> > > > > added to the architecture.
> > > > >
> > > > > MINSIGSTKSZ is a constant provided in the kernel signal.h headers and
> > > > > typically distributed in lib-dev(el) packages, e.g. [1]. Its value is
> > > > > compiled into programs and is part of the user/kernel ABI. The 
> > > > > MINSIGSTKSZ
> > > > > constant indicates to userspace how much data the kernel expects to 
> > > > > push on
> > > > > the user stack, [2][3].
> > > > >
> > > > > However, this constant is much too small and does not reflect recent
> > > > > additions to the architecture. For instance, when AVX-512 states are 
> > > > > in
> > > > > use, the signal frame size can be 3.5KB while MINSIGSTKSZ remains 2KB.
> > > > >
> > > > > The bug report [4] explains this as an ABI issue. The small 
> > > > > MINSIGSTKSZ can
> > > > > cause user stack overflow when delivering a signal.
> > > > >
> > > > > In this series, we suggest a couple of things:
> > > > > 1. Provide a variable minimum stack size to userspace, as a similar
> > > > >approach to [5]
> > > > > 2. Avoid using a too-small alternate stack
> > > >
> > > > I can't comment on the x86 specifics, but the approach followed in this
> > > > series does seem consistent with the way arm64 populates
> > > > AT_MINSIGSTKSZ.
> > > >
> > > > I need to dig up my glibc hacks for providing a sysconf interface to
> > > > this...
> > >
> > > Here is my proposal for glibc:
> > >
> > > https://sourceware.org/pipermail/libc-alpha/2020-September/118098.html
> >
> > Thanks for the link.
> >
> > Are there patches yet?  I already had some hacks in the works, but I can
> > drop them if there's something already out there.
> 
> I am working on it.

OK.  I may post something for discussion, but I'm happy for it to be
superseded by someone (i.e., other than me) who actually knows what
they're doing...

> >
> > > 1. Define SIGSTKSZ and MINSIGSTKSZ to 64KB.
> >
> > Can we do this?  IIUC, this is an ABI break and carries the risk of
> > buffer overruns.
> >
> > The reason for not simply increasing the kernel's MINSIGSTKSZ #define
> > (apart from the fact that it is rarely used, due to glibc's shadowing
> > definitions) was that userspace binaries will have baked in the old
> > value of the constant and may be making assumptions about it.
> >
> > For example, the type (char [MINSIGSTKSZ]) changes if this #define
> > changes.  This could be a problem if an newly built library tries to
> > memcpy() or dump such an object defined by and old binary.
> > Bounds-checking and the stack sizes passed to things like sigaltstack()
> > and makecontext() could similarly go wrong.
> 
> With my original proposal:
> 
> https://sourceware.org/pipermail/libc-alpha/2020-September/118028.html
> 
> char [MINSIGSTKSZ] won't compile.  The feedback is to increase the
> constants:
> 
> https://sourceware.org/pipermail/libc-alpha/2020-September/118092.html

Ah, I see.  But both still API and ABI breaks; moreover, declaraing an
array with size based on (MIN)SIGSTKSZ is not just reasonable, but the
obvious thing to do with this constant in many simple cases.  Such usage
is widespread, see:

 * https://codesearch.debian.net/search?q=%5BSIGSTKSZ%5D=1


Your two approaches seem to trade off two different sources of buffer
overruns: undersized stacks versus ABI breaks across library boundaries.

Since undersized stack is by far the more familiar problem and we at
least have guard regions to help detect overruns, I'd vote to keep
MINSIGSTKSZ and SIGSTKSZ as-is, at least for now.

Or are people reporting real stack overruns on x86 today?


For arm64, we made large vectors on SVE opt-in, so that oversized signal
frames are not see

Re: [BUG][PATCH] crypto: arm64: Avoid indirect branch to bti_c

2020-10-06 Thread Dave Martin
On Tue, Oct 06, 2020 at 11:25:11AM +0100, Catalin Marinas wrote:
> On Tue, Oct 06, 2020 at 11:01:21AM +0100, Dave P Martin wrote:
> > On Tue, Oct 06, 2020 at 09:27:48AM +0100, Will Deacon wrote:
> > > On Mon, Oct 05, 2020 at 10:48:54PM -0500, Jeremy Linton wrote:
> > > > The AES code uses a 'br x7' as part of a function called by
> > > > a macro. That branch needs a bti_j as a target. This results
> > > > in a panic as seen below. Instead of trying to replace the branch
> > > > target with a bti_jc, lets replace the indirect branch with a
> > > > bl/ret, bl sequence that can target the existing bti_c.
> > > > 
> > > >   Bad mode in Synchronous Abort handler detected on CPU1, code 
> > > > 0x3403 -- BTI
> > > >   CPU: 1 PID: 265 Comm: cryptomgr_test Not tainted 
> > > > 5.8.11-300.fc33.aarch64 #1
> > > >   pstate: 20400c05 (nzCv daif +PAN -UAO BTYPE=j-)
> > > >   pc : aesbs_encrypt8+0x0/0x5f0 [aes_neon_bs]
> > > >   lr : aesbs_xts_encrypt+0x48/0xe0 [aes_neon_bs]
> > > >   sp : 80001052b730
> > > > 
> > > >   aesbs_encrypt8+0x0/0x5f0 [aes_neon_bs]
> > > >__xts_crypt+0xb0/0x2dc [aes_neon_bs]
> > > >xts_encrypt+0x28/0x3c [aes_neon_bs]
> > > >   crypto_skcipher_encrypt+0x50/0x84
> > > >   simd_skcipher_encrypt+0xc8/0xe0
> > > >   crypto_skcipher_encrypt+0x50/0x84
> > > >   test_skcipher_vec_cfg+0x224/0x5f0
> > > >   test_skcipher+0xbc/0x120
> > > >   alg_test_skcipher+0xa0/0x1b0
> > > >   alg_test+0x3dc/0x47c
> > > >   cryptomgr_test+0x38/0x60
> > > > 
> > > > Fixes: commit 0e89640b640d ("crypto: arm64 - Use modern annotations for 
> > > > assembly functions")
> > > 
> > > nit: the "commit" string shouldn't be here, and I think the linux-next
> > > scripts will yell at us if we don't remove it.
> > > 
> > > > Signed-off-by: Jeremy Linton 
> > > > ---
> > > >  arch/arm64/crypto/aes-neonbs-core.S | 6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/arch/arm64/crypto/aes-neonbs-core.S 
> > > > b/arch/arm64/crypto/aes-neonbs-core.S
> > > > index b357164379f6..32f53ebe5e2c 100644
> > > > --- a/arch/arm64/crypto/aes-neonbs-core.S
> > > > +++ b/arch/arm64/crypto/aes-neonbs-core.S
> > > > @@ -788,7 +788,7 @@ SYM_FUNC_START_LOCAL(__xts_crypt8)
> > > >  
> > > >  0: mov bskey, x21
> > > > mov rounds, x22
> > > > -   br  x7
> > > > +   ret
> > 
> > Dang, replied on an old version.
> 
> Which I ignored (by default, when the kbuild test robot complains ;)).
> 
> > Since this is logically a tail call, could we simply be using br x16 or
> > br x17 for this?
> > 
> > The architecture makes special provision for that so that the compiler
> > can generate tail-calls.
> 
> So a "br x16" is compatible with a bti_c landing pad. I think it makes
> more sense to keep it as a tail call.

Just to be clear, I'm happy either way, but I thought it would make
sense to point this out.

Normally, "bti j" would be used just for weird stuff like jump tables,
but .S files all count as "weird stuff" to some extent -- so there are
no hard and fast rules.

Cheers
---Dave


Re: [BUG][PATCH] crypto: arm64: Avoid indirect branch to bti_c

2020-10-06 Thread Dave Martin
On Tue, Oct 06, 2020 at 09:27:48AM +0100, Will Deacon wrote:
> On Mon, Oct 05, 2020 at 10:48:54PM -0500, Jeremy Linton wrote:
> > The AES code uses a 'br x7' as part of a function called by
> > a macro. That branch needs a bti_j as a target. This results
> > in a panic as seen below. Instead of trying to replace the branch
> > target with a bti_jc, lets replace the indirect branch with a
> > bl/ret, bl sequence that can target the existing bti_c.
> > 
> >   Bad mode in Synchronous Abort handler detected on CPU1, code 0x3403 
> > -- BTI
> >   CPU: 1 PID: 265 Comm: cryptomgr_test Not tainted 5.8.11-300.fc33.aarch64 
> > #1
> >   pstate: 20400c05 (nzCv daif +PAN -UAO BTYPE=j-)
> >   pc : aesbs_encrypt8+0x0/0x5f0 [aes_neon_bs]
> >   lr : aesbs_xts_encrypt+0x48/0xe0 [aes_neon_bs]
> >   sp : 80001052b730
> > 
> >   aesbs_encrypt8+0x0/0x5f0 [aes_neon_bs]
> >__xts_crypt+0xb0/0x2dc [aes_neon_bs]
> >xts_encrypt+0x28/0x3c [aes_neon_bs]
> >   crypto_skcipher_encrypt+0x50/0x84
> >   simd_skcipher_encrypt+0xc8/0xe0
> >   crypto_skcipher_encrypt+0x50/0x84
> >   test_skcipher_vec_cfg+0x224/0x5f0
> >   test_skcipher+0xbc/0x120
> >   alg_test_skcipher+0xa0/0x1b0
> >   alg_test+0x3dc/0x47c
> >   cryptomgr_test+0x38/0x60
> > 
> > Fixes: commit 0e89640b640d ("crypto: arm64 - Use modern annotations for 
> > assembly functions")
> 
> nit: the "commit" string shouldn't be here, and I think the linux-next
> scripts will yell at us if we don't remove it.
> 
> > Signed-off-by: Jeremy Linton 
> > ---
> >  arch/arm64/crypto/aes-neonbs-core.S | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/arm64/crypto/aes-neonbs-core.S 
> > b/arch/arm64/crypto/aes-neonbs-core.S
> > index b357164379f6..32f53ebe5e2c 100644
> > --- a/arch/arm64/crypto/aes-neonbs-core.S
> > +++ b/arch/arm64/crypto/aes-neonbs-core.S
> > @@ -788,7 +788,7 @@ SYM_FUNC_START_LOCAL(__xts_crypt8)
> >  
> >  0: mov bskey, x21
> > mov rounds, x22
> > -   br  x7
> > +   ret

Dang, replied on an old version.

Since this is logically a tail call, could we simply be using br x16 or
br x17 for this?

The architecture makes special provision for that so that the compiler
can generate tail-calls.


This assumes that those regs aren't clobbered by any veneered function
call in the meantime, but all the calls here are local, so I don't think
that is a concern.

[...]

Cheers
---Dave


Re: [BUG][PATCH] arm64: bti: fix BTI to handle local indirect branches

2020-10-06 Thread Dave Martin
On Mon, Oct 05, 2020 at 02:24:47PM -0500, Jeremy Linton wrote:
> Hi,
> 
> On 10/5/20 1:54 PM, Ard Biesheuvel wrote:
> >On Mon, 5 Oct 2020 at 20:18, Jeremy Linton  wrote:
> >>
> >>The AES code uses a 'br x7' as part of a function called by
> >>a macro, that ends up needing a BTI_J as a target.
> >
> >Could we instead just drop the tail call, i.e, replace it with a ret
> >and do a 'bl' after it returns? The indirect call does not really
> >serve a purpose here anyway
> 
> Yes, that is an option, it adds an extra ret. Which probably doesn't mean
> much in most cases. I assumed this code was optimized this way because it
> mattered somewhere.

Since this really does seem to be a tail-call and since x16 and x17
appear to be otherwise unused here, can we not just use x16 or x17
instead of x7?

This relies on there being no other calls to veneered functions in the
mix, but this code is all in a single section so that shouldn't be a
concern.

Due to the magic status of x16 and x17 in br instructions, the resulting
jump should be compatible with BTI c.  I think this matches how the
compiler should typically compile tail-calls.

Cheers
---Dave


Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-06 Thread Dave Martin
On Mon, Oct 05, 2020 at 10:17:06PM +0100, H.J. Lu wrote:
> On Mon, Oct 5, 2020 at 6:45 AM Dave Martin  wrote:
> >
> > On Tue, Sep 29, 2020 at 01:57:42PM -0700, Chang S. Bae wrote:
> > > During signal entry, the kernel pushes data onto the normal userspace
> > > stack. On x86, the data pushed onto the user stack includes XSAVE state,
> > > which has grown over time as new features and larger registers have been
> > > added to the architecture.
> > >
> > > MINSIGSTKSZ is a constant provided in the kernel signal.h headers and
> > > typically distributed in lib-dev(el) packages, e.g. [1]. Its value is
> > > compiled into programs and is part of the user/kernel ABI. The MINSIGSTKSZ
> > > constant indicates to userspace how much data the kernel expects to push 
> > > on
> > > the user stack, [2][3].
> > >
> > > However, this constant is much too small and does not reflect recent
> > > additions to the architecture. For instance, when AVX-512 states are in
> > > use, the signal frame size can be 3.5KB while MINSIGSTKSZ remains 2KB.
> > >
> > > The bug report [4] explains this as an ABI issue. The small MINSIGSTKSZ 
> > > can
> > > cause user stack overflow when delivering a signal.
> > >
> > > In this series, we suggest a couple of things:
> > > 1. Provide a variable minimum stack size to userspace, as a similar
> > >approach to [5]
> > > 2. Avoid using a too-small alternate stack
> >
> > I can't comment on the x86 specifics, but the approach followed in this
> > series does seem consistent with the way arm64 populates
> > AT_MINSIGSTKSZ.
> >
> > I need to dig up my glibc hacks for providing a sysconf interface to
> > this...
> 
> Here is my proposal for glibc:
> 
> https://sourceware.org/pipermail/libc-alpha/2020-September/118098.html

Thanks for the link.

Are there patches yet?  I already had some hacks in the works, but I can
drop them if there's something already out there.


> 1. Define SIGSTKSZ and MINSIGSTKSZ to 64KB.

Can we do this?  IIUC, this is an ABI break and carries the risk of
buffer overruns.

The reason for not simply increasing the kernel's MINSIGSTKSZ #define
(apart from the fact that it is rarely used, due to glibc's shadowing
definitions) was that userspace binaries will have baked in the old
value of the constant and may be making assumptions about it.

For example, the type (char [MINSIGSTKSZ]) changes if this #define
changes.  This could be a problem if an newly built library tries to
memcpy() or dump such an object defined by and old binary.
Bounds-checking and the stack sizes passed to things like sigaltstack()
and makecontext() could similarly go wrong.


> 2. Add _SC_RSVD_SIG_STACK_SIZE for signal stack size reserved by the kernel.

How about "_SC_MINSIGSTKSZ"?  This was my initial choice since only the
discovery method is changing.  The meaning of the value is exactly the
same as before.

If we are going to rename it though, it could make sense to go for
something more directly descriptive, say, "_SC_SIGNAL_FRAME_SIZE".

The trouble with including "STKSZ" is that is sounds like a
recommendation for your stack size.  While the signal frame size is
relevant to picking a stack size, it's not the only thing to
consider.


Also, do we need a _SC_SIGSTKSZ constant, or should the entire concept
of a "recommended stack size" be abandoned?  glibc can at least make a
slightly more informed guess about suitable stack sizes than the kernel
(and glibc already has to guess anyway, in order to determine the
default thread stack size).


> 3. Deprecate SIGSTKSZ and MINSIGSTKSZ if _SC_RSVD_SIG_STACK_SIZE
> is in use.

Great if we can do it.  I was concerned that this might be
controversial.

Would this just be a recommendation, or can we enforce it somehow?

Cheers
---Dave


Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-05 Thread Dave Martin
On Tue, Sep 29, 2020 at 01:57:42PM -0700, Chang S. Bae wrote:
> During signal entry, the kernel pushes data onto the normal userspace
> stack. On x86, the data pushed onto the user stack includes XSAVE state,
> which has grown over time as new features and larger registers have been
> added to the architecture.
> 
> MINSIGSTKSZ is a constant provided in the kernel signal.h headers and
> typically distributed in lib-dev(el) packages, e.g. [1]. Its value is
> compiled into programs and is part of the user/kernel ABI. The MINSIGSTKSZ
> constant indicates to userspace how much data the kernel expects to push on
> the user stack, [2][3].
> 
> However, this constant is much too small and does not reflect recent
> additions to the architecture. For instance, when AVX-512 states are in
> use, the signal frame size can be 3.5KB while MINSIGSTKSZ remains 2KB.
> 
> The bug report [4] explains this as an ABI issue. The small MINSIGSTKSZ can
> cause user stack overflow when delivering a signal.
> 
> In this series, we suggest a couple of things:
> 1. Provide a variable minimum stack size to userspace, as a similar
>approach to [5]
> 2. Avoid using a too-small alternate stack

I can't comment on the x86 specifics, but the approach followed in this
series does seem consistent with the way arm64 populates
AT_MINSIGSTKSZ.

I need to dig up my glibc hacks for providing a sysconf interface to
this...

Cheers
---Dave

> 
> [1]: 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/bits/sigstack.h;h=b9dca794da093dc4d41d39db9851d444e1b54d9b;hb=HEAD
> [2]: https://www.gnu.org/software/libc/manual/html_node/Signal-Stack.html
> [3]: https://man7.org/linux/man-pages/man2/sigaltstack.2.html
> [4]: https://bugzilla.kernel.org/show_bug.cgi?id=153531
> [5]: 
> https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4671/original/plumbers-dm-2017.pdf
> 
> Chang S. Bae (4):
>   x86/signal: Introduce helpers to get the maximum signal frame size
>   x86/elf: Support a new ELF aux vector AT_MINSIGSTKSZ
>   x86/signal: Prevent an alternate stack overflow before a signal
> delivery
>   selftest/x86/signal: Include test cases for validating sigaltstack
> 
>  arch/x86/ia32/ia32_signal.c   |  11 +-
>  arch/x86/include/asm/elf.h|   4 +
>  arch/x86/include/asm/fpu/signal.h |   2 +
>  arch/x86/include/asm/sigframe.h   |  25 +
>  arch/x86/include/uapi/asm/auxvec.h|   6 +-
>  arch/x86/kernel/cpu/common.c  |   3 +
>  arch/x86/kernel/fpu/signal.c  |  20 
>  arch/x86/kernel/signal.c  |  66 +++-
>  tools/testing/selftests/x86/Makefile  |   2 +-
>  tools/testing/selftests/x86/sigaltstack.c | 126 ++
>  10 files changed, 258 insertions(+), 7 deletions(-)
>  create mode 100644 tools/testing/selftests/x86/sigaltstack.c
> 
> --
> 2.17.1
> 


Re: [RFC PATCH 1/4] x86/signal: Introduce helpers to get the maximum signal frame size

2020-10-05 Thread Dave Martin
On Tue, Sep 29, 2020 at 01:57:43PM -0700, Chang S. Bae wrote:
> Signal frames do not have a fixed format and can vary in size when a number
> of things change: support XSAVE features, 32 vs. 64-bit apps. Add the code
> to support a runtime method for userspace to dynamically discover how large
> a signal stack needs to be.
> 
> Introduce a new variable, max_frame_size, and helper functions for the
> calculation to be used in a new user interface. Set max_frame_size to a
> system-wide worst-case value, instead of storing multiple app-specific
> values.
> 
> Locate the body of the helper function -- fpu__get_fpstate_sigframe_size()
> in fpu/signal.c for its relevance.
> 
> Signed-off-by: Chang S. Bae 
> Reviewed-by: Len Brown 
> Cc: x...@kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/include/asm/fpu/signal.h |  2 ++
>  arch/x86/include/asm/sigframe.h   | 23 
>  arch/x86/kernel/cpu/common.c  |  3 +++
>  arch/x86/kernel/fpu/signal.c  | 20 ++
>  arch/x86/kernel/signal.c  | 45 +++
>  5 files changed, 93 insertions(+)

[...]

> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index be0d7d4152ec..239a0b23a4b0 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -663,6 +663,51 @@ SYSCALL_DEFINE0(rt_sigreturn)
>   return 0;
>  }
>  
> +/*
> + * The FP state frame contains an XSAVE buffer which must be 64-byte aligned.
> + * If a signal frame starts at an unaligned address, extra space is required.
> + * This is the max alignment padding, conservatively.
> + */
> +#define MAX_XSAVE_PADDING63UL
> +
> +/*
> + * The frame data is composed of the following areas and laid out as:
> + *
> + * -
> + * | alignment padding |
> + * -
> + * | (f)xsave frame|
> + * -
> + * | fsave header  |
> + * -
> + * | siginfo + ucontext|
> + * -
> + */
> +
> +/* max_frame_size tells userspace the worst case signal stack size. */
> +static unsigned long __ro_after_init max_frame_size;
> +
> +void __init init_sigframe_size(void)
> +{
> + /*
> +  * Use the largest of possible structure formats. This might
> +  * slightly oversize the frame for 64-bit apps.
> +  */
> +
> + if (IS_ENABLED(CONFIG_X86_32) ||
> + IS_ENABLED(CONFIG_IA32_EMULATION))
> + max_frame_size = max((unsigned long)SIZEOF_sigframe_ia32,
> +  (unsigned long)SIZEOF_rt_sigframe_ia32);
> +
> + if (IS_ENABLED(CONFIG_X86_X32_ABI))
> + max_frame_size = max(max_frame_size, (unsigned 
> long)SIZEOF_rt_sigframe_x32);
> +
> + if (IS_ENABLED(CONFIG_X86_64))
> + max_frame_size = max(max_frame_size, (unsigned 
> long)SIZEOF_rt_sigframe);
> +
> + max_frame_size += fpu__get_fpstate_sigframe_size() + MAX_XSAVE_PADDING;

For arm64, we round the worst-case padding up by one.

I can't remember the full rationale for this, but it at least seemed a
bit weird to report a size that is not a multiple of the alignment.

I'm can't think of a clear argument as to why it really matters, though.

[...]

Cheers
---Dave


Re: [PATCH 5/5] perf: arm_spe: Decode SVE events

2020-10-05 Thread Dave Martin
On Wed, Sep 30, 2020 at 07:04:53PM +0800, Leo Yan wrote:

[...]

> > > > > >> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> > > > > >> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > > > >> index a033f34846a6..f0c369259554 100644
> > > > > >> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > > > >> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > > > >> @@ -372,8 +372,35 @@ int arm_spe_pkt_desc(const struct arm_spe_pkt 
> > > > > >> *packet, char *buf,
> > > > > >>}
> > > > > >>case ARM_SPE_OP_TYPE:
> > > > > >>switch (idx) {
> > > > > >> -  case 0: return snprintf(buf, buf_len, "%s", payload & 
> > > > > >> 0x1 ?
> > > > > >> +  case 0: {
> > > > > >> +  size_t blen = buf_len;
> > > > > >> +
> > > > > >> +  if ((payload & 0x89) == 0x08) {
> > > > > >> +  ret = snprintf(buf, buf_len, "SVE");
> > > > > >> +  buf += ret;
> > > > > >> +  blen -= ret;
> > > > > > 
> > > > > > (Nit: can ret be < 0 ?  I've never been 100% clear on this myself 
> > > > > > for
> > > > > > the s*printf() family -- if this assumption is widespread in perf 
> > > > > > tool
> > > > > > a lready that I guess just go with the flow.)
> > > > > 
> > > > > Yeah, some parts of the code in here check for -1, actually, but doing
> > > > > this on every call to snprintf would push this current code over the
> > > > > edge - and I cowardly avoided a refactoring ;-)
> > > > > 
> > > > > Please note that his is perf userland, and also we are printing 
> > > > > constant
> > > > > strings here.
> > > > > Although admittedly this starts to sounds like an excuse now ...
> > > > > 
> > > > > > I wonder if this snprintf+increment+decrement sequence could be 
> > > > > > wrapped
> > > > > > up as a helper, rather than having to be repeated all over the 
> > > > > > place.
> > > > > 
> > > > > Yes, I was hoping nobody would notice ;-)
> > > > 
> > > > It's probably not worth losing sleep over.
> > > > 
> > > > snprintf(3) says, under NOTES:
> > > > 
> > > > Until glibc 2.0.6, they would return -1 when the output was
> > > > truncated.
> > > > 
> > > > which is probably ancient enough history that we don't care.  C11 does
> > > > say that a negative return value can happen "if an encoding error
> > > > occurred".  _Probably_ not a problem if perf tool never calls
> > > > setlocale(), but ...
> > > 
> > > I have one patch which tried to fix the snprintf+increment sequence
> > > [1], to be honest, the change seems urgly for me.  I agree it's better
> > > to use a helper to wrap up.
> > > 
> > > [1] https://lore.kernel.org/patchwork/patch/1288410/
> > 
> > Sure, putting explicit checks all over the place makes a lot of noise in
> > the code.
> > 
> > I was wondering whether something along the following lines would work:
> > 
> > /* ... */
> > 
> > if (payload & SVE_EVT_PKT_GEN_EXCEPTION)
> > buf_appendf_err(, _len, , " EXCEPTION-GEN");
> > if (payload & SVE_EVT_PKT_ARCH_RETIRED)
> > buf_appendf_err(, _len, , " RETIRED");
> > if (payload & SVE_EVT_PKT_L1D_ACCESS)
> > buf_appendf_err(, _len, , " L1D-ACCESS");
> > 
> > /* ... */
> > 
> > if (ret)
> > return ret;
> > 
> > [...]
> 
> I have sent out the patch v2 [1] and Cc'ed you; I used a similiar API
> definition with your suggestion:
> 
>   static int arm_spe_pkt_snprintf(char **buf_p, size_t *blen,
> const char *fmt, ...)
> 
> Only a difference is when return from arm_spe_pkt_snprintf(), will check
> the return value and directly bail out when detect failure.  Your input
> will be considered for next spin.
> 
> > Best to keep such refactoring independent of this series though.
> 
> Yeah, the patch set [2] is quite heavy; after get some reviewing,
> maybe need to consider to split into 2 or even 3 small patch sets.
> 
> Thanks a lot for your suggestions!
>
> Leo

No problem, your approach seems reasonable to me.

Cheers
---Dave

> [1] https://lore.kernel.org/patchwork/patch/1314603/
> [2] https://lore.kernel.org/patchwork/cover/1314599/
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH 5/5] perf: arm_spe: Decode SVE events

2020-09-30 Thread Dave Martin
On Tue, Sep 29, 2020 at 10:19:02AM +0800, Leo Yan wrote:
> On Mon, Sep 28, 2020 at 03:47:56PM +0100, Dave Martin wrote:
> > On Mon, Sep 28, 2020 at 02:59:34PM +0100, André Przywara wrote:
> > > On 28/09/2020 14:21, Dave Martin wrote:
> > > 
> > > Hi Dave,
> > > 
> > > > On Tue, Sep 22, 2020 at 11:12:25AM +0100, Andre Przywara wrote:
> > > >> The Scalable Vector Extension (SVE) is an ARMv8 architecture extension
> > > >> that introduces very long vector operations (up to 2048 bits).
> > > > 
> > > > (8192, in fact, though don't expect to see that on real hardware any
> > > > time soon...  qemu and the Arm fast model can do it, though.)
> > > > 
> > > >> The SPE profiling feature can tag SVE instructions with additional
> > > >> properties like predication or the effective vector length.
> > > >>
> > > >> Decode the new operation type bits in the SPE decoder to allow the perf
> > > >> tool to correctly report about SVE instructions.
> > > > 
> > > > 
> > > > I don't know anything about SPE, so just commenting on a few minor
> > > > things that catch my eye here.
> > > 
> > > Many thanks for taking a look!
> > > Please note that I actually missed a prior submission by Wei, so the
> > > code changes here will end up in:
> > > https://lore.kernel.org/patchwork/patch/1288413/
> > > 
> > > But your two points below magically apply to his patch as well, so
> > > 
> > > > 
> > > >> Signed-off-by: Andre Przywara 
> > > >> ---
> > > >>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 48 ++-
> > > >>  1 file changed, 47 insertions(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> > > >> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > >> index a033f34846a6..f0c369259554 100644
> > > >> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > >> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> > > >> @@ -372,8 +372,35 @@ int arm_spe_pkt_desc(const struct arm_spe_pkt 
> > > >> *packet, char *buf,
> > > >>}
> > > >>case ARM_SPE_OP_TYPE:
> > > >>switch (idx) {
> > > >> -  case 0: return snprintf(buf, buf_len, "%s", payload & 
> > > >> 0x1 ?
> > > >> +  case 0: {
> > > >> +  size_t blen = buf_len;
> > > >> +
> > > >> +  if ((payload & 0x89) == 0x08) {
> > > >> +  ret = snprintf(buf, buf_len, "SVE");
> > > >> +  buf += ret;
> > > >> +  blen -= ret;
> > > > 
> > > > (Nit: can ret be < 0 ?  I've never been 100% clear on this myself for
> > > > the s*printf() family -- if this assumption is widespread in perf tool
> > > > a lready that I guess just go with the flow.)
> > > 
> > > Yeah, some parts of the code in here check for -1, actually, but doing
> > > this on every call to snprintf would push this current code over the
> > > edge - and I cowardly avoided a refactoring ;-)
> > > 
> > > Please note that his is perf userland, and also we are printing constant
> > > strings here.
> > > Although admittedly this starts to sounds like an excuse now ...
> > > 
> > > > I wonder if this snprintf+increment+decrement sequence could be wrapped
> > > > up as a helper, rather than having to be repeated all over the place.
> > > 
> > > Yes, I was hoping nobody would notice ;-)
> > 
> > It's probably not worth losing sleep over.
> > 
> > snprintf(3) says, under NOTES:
> > 
> > Until glibc 2.0.6, they would return -1 when the output was
> > truncated.
> > 
> > which is probably ancient enough history that we don't care.  C11 does
> > say that a negative return value can happen "if an encoding error
> > occurred".  _Probably_ not a problem if perf tool never calls
> > setlocale(), but ...
> 
> I have one patch which tried to fix the snprintf+increment sequence
> [1], to be honest, the change seems urgly for me.  I agree it's better
> to use a helper to wrap up.
> 
> [1] https://lore.kernel.org/patchwork/patch/1288410/

Sure, putting explicit checks all over the place makes a lot of noise in
the code.

I was wondering whether something along the following lines would work:

/* ... */

if (payload & SVE_EVT_PKT_GEN_EXCEPTION)
buf_appendf_err(, _len, , " EXCEPTION-GEN");
if (payload & SVE_EVT_PKT_ARCH_RETIRED)
buf_appendf_err(, _len, , " RETIRED");
if (payload & SVE_EVT_PKT_L1D_ACCESS)
buf_appendf_err(, _len, , " L1D-ACCESS");

/* ... */

if (ret)
return ret;

[...]

Best to keep such refactoring independent of this series though.

Cheers
---Dave


Re: [PATCH 5/5] perf: arm_spe: Decode SVE events

2020-09-29 Thread Dave Martin
On Tue, Sep 29, 2020 at 10:19:02AM +0800, Leo Yan wrote:
> On Mon, Sep 28, 2020 at 03:47:56PM +0100, Dave Martin wrote:
> > On Mon, Sep 28, 2020 at 02:59:34PM +0100, André Przywara wrote:
> > > On 28/09/2020 14:21, Dave Martin wrote:
> > > 
> > > Hi Dave,
> > > 
> > > > On Tue, Sep 22, 2020 at 11:12:25AM +0100, Andre Przywara wrote:
> > > >> The Scalable Vector Extension (SVE) is an ARMv8 architecture extension
> > > >> that introduces very long vector operations (up to 2048 bits).
> > > > 
> > > > (8192, in fact, though don't expect to see that on real hardware any
> > > > time soon...  qemu and the Arm fast model can do it, though.)

[...]

> > Mostly I'm curious because the encoding doesn't match the SVE
> > architecture: SVE requires 4 bits to specify the vector length, not 3.
> > This might have been a deliberate limitation in the SPE spec., but it
> > raises questions about what should happen when 3 bits is not enough.
> > 
> > For SVE, valid vector lengths are 16 bytes * n
> > or equivalently 128 bits * n), where 1 <= n <= 16.
> > 
> > The code here though cannot print EVLEN16 or EVLEN48 etc.  This might
> > not be a bug, but I'd like to understand where it comes from...
> 
> In the SPE's spec, the defined values for EVL are:
> 
>   0b'000 -> EVLEN: 32 bits.
>   0b'001 -> EVLEN: 64 bits.
>   0b'010 -> EVLEN: 128 bits.
>   0b'011 -> EVLEN: 256 bits.
>   0b'100 -> EVLEN: 512 bits.
>   0b'101 -> EVLEN: 1024 bits.
>   0b'110 -> EVLEN: 2048 bits.
> 
> Note that 0b'111 is reserved.  In theory, I think SPE Operation packet
> can support up to 4196 bits (32 << 7) when the EVL field is 0b'111; but

OK, having looked at the spec I can now confirm that this look correct.
I was expecting a more direct correspondence between the SVE ISA and
these events, but it looks like SPE may report on a finer granularity
than whole instructions, hence showing effective vector lengths smaller
than 32; also SPE rounds the reported effective vector length up to a
power of two, which allows the full range of lengths to be reported via
the 3-bit EVL field.

> it's impossible to express vector length for 8192 bits as you mentioned.

Yes, ignore my comment about 8192-bit vectors: I was confusing myself
(the Linux API extensions support up to 8192 _bytes_ per vector in order
to have some expansion room just in case; however the SVE architecture
limits vectors to at most 2048 bits).

So I don't see any obvious issues.

It might be a good idea to explicitly reject the encoding 0b111, since
we can't be certain what it is going to mean -- however, I don't have a
strong opinion on this.

Cheers
---Dave


Re: [PATCH 5/5] perf: arm_spe: Decode SVE events

2020-09-28 Thread Dave Martin
On Mon, Sep 28, 2020 at 02:59:34PM +0100, André Przywara wrote:
> On 28/09/2020 14:21, Dave Martin wrote:
> 
> Hi Dave,
> 
> > On Tue, Sep 22, 2020 at 11:12:25AM +0100, Andre Przywara wrote:
> >> The Scalable Vector Extension (SVE) is an ARMv8 architecture extension
> >> that introduces very long vector operations (up to 2048 bits).
> > 
> > (8192, in fact, though don't expect to see that on real hardware any
> > time soon...  qemu and the Arm fast model can do it, though.)
> > 
> >> The SPE profiling feature can tag SVE instructions with additional
> >> properties like predication or the effective vector length.
> >>
> >> Decode the new operation type bits in the SPE decoder to allow the perf
> >> tool to correctly report about SVE instructions.
> > 
> > 
> > I don't know anything about SPE, so just commenting on a few minor
> > things that catch my eye here.
> 
> Many thanks for taking a look!
> Please note that I actually missed a prior submission by Wei, so the
> code changes here will end up in:
> https://lore.kernel.org/patchwork/patch/1288413/
> 
> But your two points below magically apply to his patch as well, so
> 
> > 
> >> Signed-off-by: Andre Przywara 
> >> ---
> >>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 48 ++-
> >>  1 file changed, 47 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> >> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> >> index a033f34846a6..f0c369259554 100644
> >> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> >> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> >> @@ -372,8 +372,35 @@ int arm_spe_pkt_desc(const struct arm_spe_pkt 
> >> *packet, char *buf,
> >>}
> >>case ARM_SPE_OP_TYPE:
> >>switch (idx) {
> >> -  case 0: return snprintf(buf, buf_len, "%s", payload & 0x1 ?
> >> +  case 0: {
> >> +  size_t blen = buf_len;
> >> +
> >> +  if ((payload & 0x89) == 0x08) {
> >> +  ret = snprintf(buf, buf_len, "SVE");
> >> +  buf += ret;
> >> +  blen -= ret;
> > 
> > (Nit: can ret be < 0 ?  I've never been 100% clear on this myself for
> > the s*printf() family -- if this assumption is widespread in perf tool
> > a lready that I guess just go with the flow.)
> 
> Yeah, some parts of the code in here check for -1, actually, but doing
> this on every call to snprintf would push this current code over the
> edge - and I cowardly avoided a refactoring ;-)
> 
> Please note that his is perf userland, and also we are printing constant
> strings here.
> Although admittedly this starts to sounds like an excuse now ...
> 
> > I wonder if this snprintf+increment+decrement sequence could be wrapped
> > up as a helper, rather than having to be repeated all over the place.
> 
> Yes, I was hoping nobody would notice ;-)

It's probably not worth losing sleep over.

snprintf(3) says, under NOTES:

Until glibc 2.0.6, they would return -1 when the output was
truncated.

which is probably ancient enough history that we don't care.  C11 does
say that a negative return value can happen "if an encoding error
occurred".  _Probably_ not a problem if perf tool never calls
setlocale(), but ...


> >> +  if (payload & 0x2)
> >> +  ret = snprintf(buf, buf_len, " FP");
> >> +  else
> >> +  ret = snprintf(buf, buf_len, " INT");
> >> +  buf += ret;
> >> +  blen -= ret;
> >> +  if (payload & 0x4) {
> >> +  ret = snprintf(buf, buf_len, " PRED");
> >> +  buf += ret;
> >> +  blen -= ret;
> >> +  }
> >> +  /* Bits [7..4] encode the vector length */
> >> +  ret = snprintf(buf, buf_len, " EVLEN%d",
> >> + 32 << ((payload >> 4) & 0x7));
> > 
> > Isn't this just extracting 3 bits (0x7)? 
> 
> Ah, right, the comment is wrong. It's actually bits [6:4].
> 
> > And what unit are we a

Re: [PATCH 5/5] perf: arm_spe: Decode SVE events

2020-09-28 Thread Dave Martin
On Tue, Sep 22, 2020 at 11:12:25AM +0100, Andre Przywara wrote:
> The Scalable Vector Extension (SVE) is an ARMv8 architecture extension
> that introduces very long vector operations (up to 2048 bits).

(8192, in fact, though don't expect to see that on real hardware any
time soon...  qemu and the Arm fast model can do it, though.)

> The SPE profiling feature can tag SVE instructions with additional
> properties like predication or the effective vector length.
> 
> Decode the new operation type bits in the SPE decoder to allow the perf
> tool to correctly report about SVE instructions.


I don't know anything about SPE, so just commenting on a few minor
things that catch my eye here.

> Signed-off-by: Andre Przywara 
> ---
>  .../arm-spe-decoder/arm-spe-pkt-decoder.c | 48 ++-
>  1 file changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c 
> b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> index a033f34846a6..f0c369259554 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-pkt-decoder.c
> @@ -372,8 +372,35 @@ int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, 
> char *buf,
>   }
>   case ARM_SPE_OP_TYPE:
>   switch (idx) {
> - case 0: return snprintf(buf, buf_len, "%s", payload & 0x1 ?
> + case 0: {
> + size_t blen = buf_len;
> +
> + if ((payload & 0x89) == 0x08) {
> + ret = snprintf(buf, buf_len, "SVE");
> + buf += ret;
> + blen -= ret;

(Nit: can ret be < 0 ?  I've never been 100% clear on this myself for
the s*printf() family -- if this assumption is widespread in perf tool
a lready that I guess just go with the flow.)

I wonder if this snprintf+increment+decrement sequence could be wrapped
up as a helper, rather than having to be repeated all over the place.

> + if (payload & 0x2)
> + ret = snprintf(buf, buf_len, " FP");
> + else
> + ret = snprintf(buf, buf_len, " INT");
> + buf += ret;
> + blen -= ret;
> + if (payload & 0x4) {
> + ret = snprintf(buf, buf_len, " PRED");
> + buf += ret;
> + blen -= ret;
> + }
> + /* Bits [7..4] encode the vector length */
> + ret = snprintf(buf, buf_len, " EVLEN%d",
> +32 << ((payload >> 4) & 0x7));

Isn't this just extracting 3 bits (0x7)?  And what unit are we aiming
for here: is it the number of bytes per vector, or something else?  I'm
confused by the fact that this will go up in steps of 32, which doesn't
seem to match up to the architecure.

I notice that bit 7 has to be zero to get into this if() though.

> + buf += ret;
> + blen -= ret;
> + return buf_len - blen;
> + }
> +
> + return snprintf(buf, buf_len, "%s", payload & 0x1 ?
>   "COND-SELECT" : "INSN-OTHER");
> + }
>   case 1: {
>   size_t blen = buf_len;
>  
> @@ -403,6 +430,25 @@ int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, 
> char *buf,
>   ret = snprintf(buf, buf_len, " NV-SYSREG");
>   buf += ret;
>   blen -= ret;
> + } else if ((payload & 0x0a) == 0x08) {
> + ret = snprintf(buf, buf_len, " SVE");
> + buf += ret;
> + blen -= ret;
> + if (payload & 0x4) {
> + ret = snprintf(buf, buf_len, " PRED");
> + buf += ret;
> + blen -= ret;
> + }
> + if (payload & 0x80) {
> + ret = snprintf(buf, buf_len, " SG");
> + buf += ret;
> + blen -= ret;
> + }
> + /* Bits [7..4] encode the vector length */
> + ret = snprintf(buf, buf_len, " EVLEN%d",
> +32 << ((payload >> 4) & 0x7));

Same comment as above.  Maybe have a common helper for decoding the
vector length bits so it can be fixed in a single place?

> + buf 

Re: [PATCH 3/4] kselftests/arm64: add PAuth test for whether exec() changes keys

2020-09-16 Thread Dave Martin
On Tue, Sep 15, 2020 at 04:18:28PM +0100, Boyan Karatotev wrote:
> On 07/09/2020 11:27 am, Dave Martin wrote:
> > On Thu, Sep 03, 2020 at 11:20:25AM +0100, Boyan Karatotev wrote:
> >> On 02/09/2020 18:00, Dave Martin wrote:
> >>> On Fri, Aug 28, 2020 at 02:16:05PM +0100, Boyan Karatotev wrote:
> >>>> +int exec_sign_all(struct signatures *signed_vals, size_t val)
> >>>> +{
> >>>
> >>> Could popen(3) be used here?
> >>>
> >>> Fork-and-exec is notoriously fiddly, so it's preferable to use a library
> >>> function to do it where applicable.I would love to, but the worker needs 
> >>> a bidirectional channel and popen
> >> only gives a unidirectional stream.
> > 
> > Ah, fair point.
> > 
> > Would it help if you created an additional pipe before calling popen()?
> > 
> > May not be worth it, though.  For one thing, wiring that extra pipe to
> > stdin or stdout in the child process would require some extra work...
> Well, I probably could, but I doubt the result would be any better. I
> agree that I'm not sure the effort is worth it and would rather keep it
> the same.

Sure, fair enough.

Ideally kselftest would provide some common code for this sort of thing,
but I guess that's a separate discussion.

Cheers
---Dave


Re: [PATCH v2 3/4] kselftests/arm64: add PAuth test for whether exec() changes keys

2020-09-07 Thread Dave Martin
On Thu, Sep 03, 2020 at 11:48:37AM +0100, Boyan Karatotev wrote:
> On 02/09/2020 18:08, Dave Martin wrote:
> > On Mon, Aug 31, 2020 at 12:04:49PM +0100, Boyan Karatotev wrote:
> >> +/*
> >> + * fork() does not change keys. Only exec() does so call a worker program.
> >> + * Its only job is to sign a value and report back the resutls
> >> + */
> >> +TEST(exec_unique_keys)
> >> +{
> > 
> > The kernel doesn't guarantee that keys are unique.
> > 
> > Can we present all the "unique keys" wording differently, say
> > 
> > exec_key_collision_likely()
> 
> I agree that this test's name is a bit out of place. I would rather have
> it named "exec_changed_keys" though.
> 
> > Otherwise people might infer from this test code that the keys are
> > supposed to be truly unique and start reporting bugs on the kernel.
> > 
> > I can't see an obvious security argument for unique keys (rather, the
> > keys just need to be "unique enough".  That's the job of
> > get_random_bytes().)
> 
> The "exec_unique_keys" test only checks that the keys changed after an
> exec() which I think the name change would reflect.
> 
> The thing with the "single_thread_unique_keys" test is that the kernel
> says the the keys will be random. Yes, there is no uniqueness guarantee
> but I'm not sure how to phrase it differently. There is some minuscule
> chance that the keys end up the same, but for this test I pretend this
> will not happen. Would changing up the comments and the failure message
> communicate this? Maybe substitute "unique" for "different" and say how
> many keys clashed?

Yes, something like that seems reasonable.

Cheers
---Dave


Re: [PATCH 0/4] kselftests/arm64: add PAuth tests

2020-09-07 Thread Dave Martin
On Thu, Sep 03, 2020 at 10:46:33AM +0100, Boyan Karatotev wrote:
> On 02/09/2020 17:48, Dave Martin wrote:
> > On Fri, Aug 28, 2020 at 02:16:02PM +0100, Boyan Karatotev wrote:
> >> Pointer Authentication (PAuth) is a security feature introduced in ARMv8.3.
> >> It introduces instructions to sign addresses and later check for potential
> >> corruption using a second modifier value and one of a set of keys. The
> >> signature, in the form of the Pointer Authentication Code (PAC), is stored
> >> in some of the top unused bits of the virtual address (e.g. [54: 49] if
> >> TBID0 is enabled and TnSZ is set to use a 48 bit VA space). A set of
> >> controls are present to enable/disable groups of instructions (which use
> >> certain keys) for compatibility with libraries that do not utilize the
> >> feature. PAuth is used to verify the integrity of return addresses on the
> >> stack with less memory than the stack canary.
> >>
> >> This patchset adds kselftests to verify the kernel's configuration of the
> >> feature and its runtime behaviour. There are 7 tests which verify that:
> >>* an authentication failure leads to a SIGSEGV
> >>* the data/instruction instruction groups are enabled
> >>* the generic instructions are enabled
> >>* all 5 keys are unique for a single thread
> >>* exec() changes all keys to new unique ones
> >>* context switching preserves the 4 data/instruction keys
> >>* context switching preserves the generic keys
> >>
> >> The tests have been verified to work on qemu without a working PAUTH
> >> Implementation and on ARM's FVP with a full or partial PAuth
> >> implementation.
> >>
> >> Note: This patchset is only verified for ARMv8.3 and there will be some
> >> changes required for ARMv8.6. More details can be found here [1]. Once
> >> ARMv8.6 PAuth is merged the first test in this series will required to be
> >> updated.
> > 
> > Nit: is it worth running checkpatch over this series?
> > 
> > Although this is not kernel code, there are a number of formatting
> > weirdnesses and surplus blank lines etc. that checkpatch would probably
> > warn about.
> > 
> I ran it through checkpatch and it came out clean except for some
> MAINTAINERS warnings. I see that when I add --strict it does complain
> about multiple blank lines which I can fix for the next version. Are
> there any other flags I should be running checkpatch with?

Hmmm, probably not.  I had thought checkpatch was generally noisier
about that kind of thing.

Since the issues were all minor and nobody else objected, I would
suggest not to worry about them.

Cheers
---Dave


Re: [PATCH 3/4] kselftests/arm64: add PAuth test for whether exec() changes keys

2020-09-07 Thread Dave Martin
On Thu, Sep 03, 2020 at 11:20:25AM +0100, Boyan Karatotev wrote:
> On 02/09/2020 18:00, Dave Martin wrote:
> > On Fri, Aug 28, 2020 at 02:16:05PM +0100, Boyan Karatotev wrote:
> >> Kernel documentation states that it will change PAuth keys on exec() calls.
> >>
> >> Verify that all keys are correctly switched to new ones.
> >>
> >> Cc: Shuah Khan 
> >> Cc: Catalin Marinas 
> >> Cc: Will Deacon 
> >> Signed-off-by: Boyan Karatotev 
> >> ---

[...]

> >> diff --git a/tools/testing/selftests/arm64/pauth/pac.c 
> >> b/tools/testing/selftests/arm64/pauth/pac.c
> >> index cdbffa8bf61e..16dea47b11c7 100644
> >> --- a/tools/testing/selftests/arm64/pauth/pac.c
> >> +++ b/tools/testing/selftests/arm64/pauth/pac.c

[...]

> >> +int exec_sign_all(struct signatures *signed_vals, size_t val)
> >> +{
> > 
> > Could popen(3) be used here?
> > 
> > Fork-and-exec is notoriously fiddly, so it's preferable to use a library
> > function to do it where applicable.I would love to, but the worker needs a 
> > bidirectional channel and popen
> only gives a unidirectional stream.

Ah, fair point.

Would it help if you created an additional pipe before calling popen()?

May not be worth it, though.  For one thing, wiring that extra pipe to
stdin or stdout in the child process would require some extra work...

Cheers
---Dave


Re: [PATCH 1/4] kselftests/arm64: add a basic Pointer Authentication test

2020-09-07 Thread Dave Martin
On Thu, Sep 03, 2020 at 11:12:02AM +0100, Boyan Karatotev wrote:
> On 02/09/2020 17:49, Dave Martin wrote:
> > On Fri, Aug 28, 2020 at 02:16:03PM +0100, Boyan Karatotev wrote:
> >> PAuth signs and verifies return addresses on the stack. It does so by
> >> inserting a Pointer Authentication code (PAC) into some of the unused top
> >> bits of an address. This is achieved by adding paciasp/autiasp instructions
> >> at the beginning and end of a function.
> >>
> >> This feature is partially backwards compatible with earlier versions of the
> >> ARM architecture. To coerce the compiler into emitting fully backwards
> >> compatible code the main file is compiled to target an earlier ARM version.
> >> This allows the tests to check for the feature and print meaningful error
> >> messages instead of crashing.
> >>
> >> Add a test to verify that corrupting the return address results in a
> >> SIGSEGV on return.
> >>
> >> Cc: Shuah Khan 
> >> Cc: Catalin Marinas 
> >> Cc: Will Deacon 
> >> Signed-off-by: Boyan Karatotev 
> >> ---

[...]

> >> diff --git a/tools/testing/selftests/arm64/pauth/pac_corruptor.S 
> >> b/tools/testing/selftests/arm64/pauth/pac_corruptor.S
> >> new file mode 100644
> >> index ..6a34ec23a034
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/arm64/pauth/pac_corruptor.S
> >> @@ -0,0 +1,36 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +/* Copyright (C) 2020 ARM Limited */
> >> +
> >> +.global pac_corruptor
> >> +
> >> +.text
> >> +/*
> >> + * Corrupting a single bit of the PAC ensures the authentication will 
> >> fail.  It
> >> + * also guarantees no possible collision. TCR_EL1.TBI0 is set by default 
> >> so no
> >> + * top byte PAC is tested
> >> + */
> >> + pac_corruptor:
> >> +  paciasp
> >> +
> >> +  /* make stack frame */
> >> +  sub sp, sp, #16
> >> +  stp x29, lr, [sp]
> > 
> > Nit: if respinning, you can optimise a few sequences of this sort, e.g.
> > 
> > stp x29, lr, [sp, #-16]!
> > 
> >> +  mov x29, sp
> >> +
> >> +  /* prepare mask for bit to be corrupted (bit 54) */
> >> +  mov x1, xzr
> >> +  add x1, x1, #1
> >> +  lsl x1, x1, #54
> > 
> > Nit:
> > 
> > mov x1, #1 << 54
> Thank you for this, didn't know I could do it this way.
> > 
> > but anyway, the logic operations can encode most simple bitmasks
> > directly as immediate operands, so you can skip this and just do
> > 
> >> +
> >> +  /* get saved lr, corrupt selected bit, put it back */
> >> +  ldr x0, [sp, #8]
> >> +  eor x0, x0, x1
> > 
> > eor x0, x0, #1 << 54
> > 
> >> +  str x0, [sp, #8]
> >> +
> >> +  /* remove stack frame */
> >> +  ldp x29, lr, [sp]
> >> +  add sp, sp, #16
> > 
> > ldp x29, lr, [sp], #16
> > 
> > [...]
> > 
> > Actually, since there are no leaf nested function calls and no trap is
> > expected until the function returns (so backtracing in the middle of
> > this function is unlikely to be needed), could we optimise this whole
> > thing down to the following?
> > 
> I suppose you're right. The intent was to emulate a c function but there
> really is no point in doing all this extra work. Will change it.

It's not critical either way, but this way it's at least less code to
maintain / read.

> > pac_corruptor:
> > paciasp
> > eor lr, lr, #1 << 53
> > autiasp
> > ret
> > 
> > Cheers
> > ---Dave

[...]

Cheers
---Dave


Re: [PATCH v2 3/4] kselftests/arm64: add PAuth test for whether exec() changes keys

2020-09-02 Thread Dave Martin
On Mon, Aug 31, 2020 at 12:04:49PM +0100, Boyan Karatotev wrote:
> Kernel documentation states that it will change PAuth keys on exec() calls.
> 
> Verify that all keys are correctly switched to new ones.
> 
> Cc: Shuah Khan 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Reviewed-by: Vincenzo Frascino 
> Reviewed-by: Amit Daniel Kachhap 
> Signed-off-by: Boyan Karatotev 
> ---
>  tools/testing/selftests/arm64/pauth/Makefile  |   4 +
>  .../selftests/arm64/pauth/exec_target.c   |  35 +
>  tools/testing/selftests/arm64/pauth/helper.h  |  10 ++
>  tools/testing/selftests/arm64/pauth/pac.c | 148 ++
>  4 files changed, 197 insertions(+)
>  create mode 100644 tools/testing/selftests/arm64/pauth/exec_target.c
> 
> diff --git a/tools/testing/selftests/arm64/pauth/Makefile 
> b/tools/testing/selftests/arm64/pauth/Makefile
> index 5c0dd129562f..72e290b0b10c 100644
> --- a/tools/testing/selftests/arm64/pauth/Makefile
> +++ b/tools/testing/selftests/arm64/pauth/Makefile
> @@ -13,6 +13,7 @@ pauth_cc_support := $(shell if ($(CC) $(CFLAGS) 
> -march=armv8.3-a -E -x c /dev/nu
>  ifeq ($(pauth_cc_support),1)
>  TEST_GEN_PROGS := pac
>  TEST_GEN_FILES := pac_corruptor.o helper.o
> +TEST_GEN_PROGS_EXTENDED := exec_target
>  endif
>  
>  include ../../lib.mk
> @@ -30,6 +31,9 @@ $(OUTPUT)/helper.o: helper.c
>  # greater, gcc emits pac* instructions which are not in HINT NOP space,
>  # preventing the tests from occurring at all. Compile for ARMv8.2 so tests 
> can
>  # run on earlier targets and print a meaningful error messages
> +$(OUTPUT)/exec_target: exec_target.c $(OUTPUT)/helper.o
> + $(CC) $^ -o $@ $(CFLAGS) -march=armv8.2-a
> +
>  $(OUTPUT)/pac: pac.c $(OUTPUT)/pac_corruptor.o $(OUTPUT)/helper.o
>   $(CC) $^ -o $@ $(CFLAGS) -march=armv8.2-a
>  endif
> diff --git a/tools/testing/selftests/arm64/pauth/exec_target.c 
> b/tools/testing/selftests/arm64/pauth/exec_target.c
> new file mode 100644
> index ..07addef5a1d7
> --- /dev/null
> +++ b/tools/testing/selftests/arm64/pauth/exec_target.c
> @@ -0,0 +1,35 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2020 ARM Limited
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "helper.h"
> +
> +
> +int main(void)
> +{
> + struct signatures signed_vals;
> + unsigned long hwcaps;
> + size_t val;
> +
> + fread(, sizeof(size_t), 1, stdin);
> +
> + /* don't try to execute illegal (unimplemented) instructions) caller
> +  * should have checked this and keep worker simple
> +  */
> + hwcaps = getauxval(AT_HWCAP);
> +
> + if (hwcaps & HWCAP_PACA) {
> + signed_vals.keyia = keyia_sign(val);
> + signed_vals.keyib = keyib_sign(val);
> + signed_vals.keyda = keyda_sign(val);
> + signed_vals.keydb = keydb_sign(val);
> + }
> + signed_vals.keyg = (hwcaps & HWCAP_PACG) ?  keyg_sign(val) : 0;
> +
> + fwrite(_vals, sizeof(struct signatures), 1, stdout);
> +
> + return 0;
> +}
> diff --git a/tools/testing/selftests/arm64/pauth/helper.h 
> b/tools/testing/selftests/arm64/pauth/helper.h
> index e2ed910c9863..da6457177727 100644
> --- a/tools/testing/selftests/arm64/pauth/helper.h
> +++ b/tools/testing/selftests/arm64/pauth/helper.h
> @@ -6,6 +6,16 @@
>  
>  #include 
>  
> +#define NKEYS 5
> +
> +
> +struct signatures {
> + size_t keyia;
> + size_t keyib;
> + size_t keyda;
> + size_t keydb;
> + size_t keyg;
> +};
>  
>  void pac_corruptor(void);
>  
> diff --git a/tools/testing/selftests/arm64/pauth/pac.c 
> b/tools/testing/selftests/arm64/pauth/pac.c
> index 035fdd6aae9b..1b9e3acfeb61 100644
> --- a/tools/testing/selftests/arm64/pauth/pac.c
> +++ b/tools/testing/selftests/arm64/pauth/pac.c
> @@ -2,6 +2,8 @@
>  // Copyright (C) 2020 ARM Limited
>  
>  #include 
> +#include 
> +#include 
>  #include 
>  
>  #include "../../kselftest_harness.h"
> @@ -33,6 +35,117 @@ do { \
>  } while (0)
>  
>  
> +void sign_specific(struct signatures *sign, size_t val)
> +{
> + sign->keyia = keyia_sign(val);
> + sign->keyib = keyib_sign(val);
> + sign->keyda = keyda_sign(val);
> + sign->keydb = keydb_sign(val);
> +}
> +
> +void sign_all(struct signatures *sign, size_t val)
> +{
> + sign->keyia = keyia_sign(val);
> + sign->keyib = keyib_sign(val);
> + sign->keyda = keyda_sign(val);
> + sign->keydb = keydb_sign(val);
> + sign->keyg  = keyg_sign(val);
> +}
> +
> +int are_same(struct signatures *old, struct signatures *new, int nkeys)
> +{
> + int res = 0;
> +
> + res |= old->keyia == new->keyia;
> + res |= old->keyib == new->keyib;
> + res |= old->keyda == new->keyda;
> + res |= old->keydb == new->keydb;
> + if (nkeys == NKEYS)
> + res |= old->keyg  == new->keyg;
> +
> + return res;
> +}
> +
> +int exec_sign_all(struct signatures *signed_vals, size_t val)
> +{
> + int new_stdin[2];
> + int new_stdout[2];
> + int status;
> + ssize_t ret;
> + 

Re: [PATCH 3/4] kselftests/arm64: add PAuth test for whether exec() changes keys

2020-09-02 Thread Dave Martin
On Fri, Aug 28, 2020 at 02:16:05PM +0100, Boyan Karatotev wrote:
> Kernel documentation states that it will change PAuth keys on exec() calls.
> 
> Verify that all keys are correctly switched to new ones.
> 
> Cc: Shuah Khan 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Signed-off-by: Boyan Karatotev 
> ---
>  tools/testing/selftests/arm64/pauth/Makefile  |   4 +
>  .../selftests/arm64/pauth/exec_target.c   |  35 +
>  tools/testing/selftests/arm64/pauth/helper.h  |  10 ++
>  tools/testing/selftests/arm64/pauth/pac.c | 148 ++
>  4 files changed, 197 insertions(+)
>  create mode 100644 tools/testing/selftests/arm64/pauth/exec_target.c
> 
> diff --git a/tools/testing/selftests/arm64/pauth/Makefile 
> b/tools/testing/selftests/arm64/pauth/Makefile
> index a017d1c8dd58..2e237b21ccf6 100644
> --- a/tools/testing/selftests/arm64/pauth/Makefile
> +++ b/tools/testing/selftests/arm64/pauth/Makefile
> @@ -5,6 +5,7 @@ CFLAGS += -mbranch-protection=pac-ret
>  
>  TEST_GEN_PROGS := pac
>  TEST_GEN_FILES := pac_corruptor.o helper.o
> +TEST_GEN_PROGS_EXTENDED := exec_target
>  
>  include ../../lib.mk
>  
> @@ -20,6 +21,9 @@ $(OUTPUT)/helper.o: helper.c
>  # greater, gcc emits pac* instructions which are not in HINT NOP space,
>  # preventing the tests from occurring at all. Compile for ARMv8.2 so tests 
> can
>  # run on earlier targets and print a meaningful error messages
> +$(OUTPUT)/exec_target: exec_target.c $(OUTPUT)/helper.o
> + $(CC) $^ -o $@ $(CFLAGS) -march=armv8.2-a
> +
>  $(OUTPUT)/pac: pac.c $(OUTPUT)/pac_corruptor.o $(OUTPUT)/helper.o
>   $(CC) $^ -o $@ $(CFLAGS) -march=armv8.2-a
>  
> diff --git a/tools/testing/selftests/arm64/pauth/exec_target.c 
> b/tools/testing/selftests/arm64/pauth/exec_target.c
> new file mode 100644
> index ..07addef5a1d7
> --- /dev/null
> +++ b/tools/testing/selftests/arm64/pauth/exec_target.c
> @@ -0,0 +1,35 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2020 ARM Limited
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "helper.h"
> +
> +
> +int main(void)
> +{
> + struct signatures signed_vals;
> + unsigned long hwcaps;
> + size_t val;
> +
> + fread(, sizeof(size_t), 1, stdin);
> +
> + /* don't try to execute illegal (unimplemented) instructions) caller
> +  * should have checked this and keep worker simple
> +  */
> + hwcaps = getauxval(AT_HWCAP);
> +
> + if (hwcaps & HWCAP_PACA) {
> + signed_vals.keyia = keyia_sign(val);
> + signed_vals.keyib = keyib_sign(val);
> + signed_vals.keyda = keyda_sign(val);
> + signed_vals.keydb = keydb_sign(val);
> + }
> + signed_vals.keyg = (hwcaps & HWCAP_PACG) ?  keyg_sign(val) : 0;
> +
> + fwrite(_vals, sizeof(struct signatures), 1, stdout);
> +
> + return 0;
> +}
> diff --git a/tools/testing/selftests/arm64/pauth/helper.h 
> b/tools/testing/selftests/arm64/pauth/helper.h
> index b3cf709e249d..fceaa1e4824a 100644
> --- a/tools/testing/selftests/arm64/pauth/helper.h
> +++ b/tools/testing/selftests/arm64/pauth/helper.h
> @@ -6,6 +6,16 @@
>  
>  #include 
>  
> +#define NKEYS 5
> +
> +
> +struct signatures {
> + size_t keyia;
> + size_t keyib;
> + size_t keyda;
> + size_t keydb;
> + size_t keyg;
> +};
>  
>  void pac_corruptor(void);
>  
> diff --git a/tools/testing/selftests/arm64/pauth/pac.c 
> b/tools/testing/selftests/arm64/pauth/pac.c
> index cdbffa8bf61e..16dea47b11c7 100644
> --- a/tools/testing/selftests/arm64/pauth/pac.c
> +++ b/tools/testing/selftests/arm64/pauth/pac.c
> @@ -2,6 +2,8 @@
>  // Copyright (C) 2020 ARM Limited
>  
>  #include 
> +#include 
> +#include 
>  #include 
>  
>  #include "../../kselftest_harness.h"
> @@ -33,6 +35,117 @@ do { \
>  } while (0)
>  
>  
> +void sign_specific(struct signatures *sign, size_t val)
> +{
> + sign->keyia = keyia_sign(val);
> + sign->keyib = keyib_sign(val);
> + sign->keyda = keyda_sign(val);
> + sign->keydb = keydb_sign(val);
> +}
> +
> +void sign_all(struct signatures *sign, size_t val)
> +{
> + sign->keyia = keyia_sign(val);
> + sign->keyib = keyib_sign(val);
> + sign->keyda = keyda_sign(val);
> + sign->keydb = keydb_sign(val);
> + sign->keyg  = keyg_sign(val);
> +}
> +
> +int are_same(struct signatures *old, struct signatures *new, int nkeys)
> +{
> + int res = 0;
> +
> + res |= old->keyia == new->keyia;
> + res |= old->keyib == new->keyib;
> + res |= old->keyda == new->keyda;
> + res |= old->keydb == new->keydb;
> + if (nkeys == NKEYS)
> + res |= old->keyg  == new->keyg;
> +
> + return res;
> +}
> +
> +int exec_sign_all(struct signatures *signed_vals, size_t val)
> +{

Could popen(3) be used here?

Fork-and-exec is notoriously fiddly, so it's preferable to use a library
function to do it where applicable.

[...]

Cheers
---Dave


Re: [PATCH 1/4] kselftests/arm64: add a basic Pointer Authentication test

2020-09-02 Thread Dave Martin
On Fri, Aug 28, 2020 at 02:16:03PM +0100, Boyan Karatotev wrote:
> PAuth signs and verifies return addresses on the stack. It does so by
> inserting a Pointer Authentication code (PAC) into some of the unused top
> bits of an address. This is achieved by adding paciasp/autiasp instructions
> at the beginning and end of a function.
> 
> This feature is partially backwards compatible with earlier versions of the
> ARM architecture. To coerce the compiler into emitting fully backwards
> compatible code the main file is compiled to target an earlier ARM version.
> This allows the tests to check for the feature and print meaningful error
> messages instead of crashing.
> 
> Add a test to verify that corrupting the return address results in a
> SIGSEGV on return.
> 
> Cc: Shuah Khan 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Signed-off-by: Boyan Karatotev 
> ---
>  tools/testing/selftests/arm64/Makefile|  2 +-
>  .../testing/selftests/arm64/pauth/.gitignore  |  1 +
>  tools/testing/selftests/arm64/pauth/Makefile  | 22 
>  tools/testing/selftests/arm64/pauth/helper.h  | 10 ++
>  tools/testing/selftests/arm64/pauth/pac.c | 32 +
>  .../selftests/arm64/pauth/pac_corruptor.S | 36 +++
>  6 files changed, 102 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/arm64/pauth/.gitignore
>  create mode 100644 tools/testing/selftests/arm64/pauth/Makefile
>  create mode 100644 tools/testing/selftests/arm64/pauth/helper.h
>  create mode 100644 tools/testing/selftests/arm64/pauth/pac.c
>  create mode 100644 tools/testing/selftests/arm64/pauth/pac_corruptor.S
> 
> diff --git a/tools/testing/selftests/arm64/Makefile 
> b/tools/testing/selftests/arm64/Makefile
> index 93b567d23c8b..525506fd97b9 100644
> --- a/tools/testing/selftests/arm64/Makefile
> +++ b/tools/testing/selftests/arm64/Makefile
> @@ -4,7 +4,7 @@
>  ARCH ?= $(shell uname -m 2>/dev/null || echo not)
>  
>  ifneq (,$(filter $(ARCH),aarch64 arm64))
> -ARM64_SUBTARGETS ?= tags signal
> +ARM64_SUBTARGETS ?= tags signal pauth
>  else
>  ARM64_SUBTARGETS :=
>  endif
> diff --git a/tools/testing/selftests/arm64/pauth/.gitignore 
> b/tools/testing/selftests/arm64/pauth/.gitignore
> new file mode 100644
> index ..b557c916720a
> --- /dev/null
> +++ b/tools/testing/selftests/arm64/pauth/.gitignore
> @@ -0,0 +1 @@
> +pac
> diff --git a/tools/testing/selftests/arm64/pauth/Makefile 
> b/tools/testing/selftests/arm64/pauth/Makefile
> new file mode 100644
> index ..785c775e5e41
> --- /dev/null
> +++ b/tools/testing/selftests/arm64/pauth/Makefile
> @@ -0,0 +1,22 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2020 ARM Limited
> +
> +CFLAGS += -mbranch-protection=pac-ret
> +
> +TEST_GEN_PROGS := pac
> +TEST_GEN_FILES := pac_corruptor.o
> +
> +include ../../lib.mk
> +
> +# pac* and aut* instructions are not available on architectures berfore
> +# ARMv8.3. Therefore target ARMv8.3 wherever they are used directly
> +$(OUTPUT)/pac_corruptor.o: pac_corruptor.S
> + $(CC) -c $^ -o $@ $(CFLAGS) -march=armv8.3-a
> +
> +# when -mbranch-protection is enabled and the target architecture is ARMv8.3 
> or
> +# greater, gcc emits pac* instructions which are not in HINT NOP space,
> +# preventing the tests from occurring at all. Compile for ARMv8.2 so tests 
> can
> +# run on earlier targets and print a meaningful error messages
> +$(OUTPUT)/pac: pac.c $(OUTPUT)/pac_corruptor.o
> + $(CC) $^ -o $@ $(CFLAGS) -march=armv8.2-a
> +
> diff --git a/tools/testing/selftests/arm64/pauth/helper.h 
> b/tools/testing/selftests/arm64/pauth/helper.h
> new file mode 100644
> index ..f777f88acf0a
> --- /dev/null
> +++ b/tools/testing/selftests/arm64/pauth/helper.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2020 ARM Limited */
> +
> +#ifndef _HELPER_H_
> +#define _HELPER_H_
> +
> +void pac_corruptor(void);
> +
> +#endif
> +
> diff --git a/tools/testing/selftests/arm64/pauth/pac.c 
> b/tools/testing/selftests/arm64/pauth/pac.c
> new file mode 100644
> index ..ed445050f621
> --- /dev/null
> +++ b/tools/testing/selftests/arm64/pauth/pac.c
> @@ -0,0 +1,32 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2020 ARM Limited
> +
> +#include 
> +#include 
> +
> +#include "../../kselftest_harness.h"
> +#include "helper.h"
> +
> +/*
> + * Tests are ARMv8.3 compliant. They make no provisions for features present 
> in
> + * future version of the arm architecture
> + */
> +
> +#define ASSERT_PAUTH_ENABLED() \
> +do { \
> + unsigned long hwcaps = getauxval(AT_HWCAP); \
> + /* data key instructions are not in NOP space. This prevents a SIGILL 
> */ \


> + ASSERT_NE(0, hwcaps & HWCAP_PACA) TH_LOG("PAUTH not enabled"); \
> +} while (0)
> +
> +
> +/* check that a corrupted PAC results in SIGSEGV */
> +TEST_SIGNAL(corrupt_pac, SIGSEGV)
> +{
> + ASSERT_PAUTH_ENABLED();
> +
> + pac_corruptor();
> +}
> +
> 

Re: [PATCH 0/4] kselftests/arm64: add PAuth tests

2020-09-02 Thread Dave Martin
On Fri, Aug 28, 2020 at 02:16:02PM +0100, Boyan Karatotev wrote:
> Pointer Authentication (PAuth) is a security feature introduced in ARMv8.3.
> It introduces instructions to sign addresses and later check for potential
> corruption using a second modifier value and one of a set of keys. The
> signature, in the form of the Pointer Authentication Code (PAC), is stored
> in some of the top unused bits of the virtual address (e.g. [54: 49] if
> TBID0 is enabled and TnSZ is set to use a 48 bit VA space). A set of
> controls are present to enable/disable groups of instructions (which use
> certain keys) for compatibility with libraries that do not utilize the
> feature. PAuth is used to verify the integrity of return addresses on the
> stack with less memory than the stack canary.
> 
> This patchset adds kselftests to verify the kernel's configuration of the
> feature and its runtime behaviour. There are 7 tests which verify that:
>   * an authentication failure leads to a SIGSEGV
>   * the data/instruction instruction groups are enabled
>   * the generic instructions are enabled
>   * all 5 keys are unique for a single thread
>   * exec() changes all keys to new unique ones
>   * context switching preserves the 4 data/instruction keys
>   * context switching preserves the generic keys
> 
> The tests have been verified to work on qemu without a working PAUTH
> Implementation and on ARM's FVP with a full or partial PAuth
> implementation.
> 
> Note: This patchset is only verified for ARMv8.3 and there will be some
> changes required for ARMv8.6. More details can be found here [1]. Once
> ARMv8.6 PAuth is merged the first test in this series will required to be
> updated.

Nit: is it worth running checkpatch over this series?

Although this is not kernel code, there are a number of formatting
weirdnesses and surplus blank lines etc. that checkpatch would probably
warn about.

[...]

Cheers
---Dave


Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-02 Thread Dave Martin
On Tue, Sep 01, 2020 at 11:11:37AM -0700, Dave Hansen wrote:
> On 9/1/20 10:45 AM, Andy Lutomirski wrote:
> >>> For arm64 (and sparc etc.) we continue to use the regular mmap/mprotect
> >>> family of calls.  One or two additional arch-specific mmap flags are
> >>> sufficient for now.
> >>>
> >>> Is x86 definitely not going to fit within those calls?
> >> That can work for x86.  Andy, what if we create PROT_SHSTK, which can
> >> been seen only from the user.  Once in kernel, it is translated to
> >> VM_SHSTK.  One question for mremap/mprotect is, do we allow a normal
> >> data area to become shadow stack?
> > I'm unconvinced that we want to use a somewhat precious PROT_ or VM_
> > bit for this.  Using a flag bit makes sense if we expect anyone to
> > ever map an fd or similar as a shadow stack, but that seems a bit odd
> > in the first place.  To me, it seems more logical for a shadow stack
> > to be a special sort of mapping with a special vm_ops, not a normal
> > mapping with a special flag set.  Although I realize that we want
> > shadow stacks to work like anonymous memory with respect to fork().
> > Dave?
> 
> I actually don't like the idea of *creating* mappings much.
> 
> I think the pkey model has worked out pretty well where we separate
> creating the mapping from doing something *to* it, like changing
> protections.  For instance, it would be nice if we could preserve things
> like using hugetlbfs or heck even doing KSM for shadow stacks.
> 
> If we're *creating* mappings, we've pretty much ruled out things like
> hugetlbfs.
> 
> Something like mprotect_shstk() would allow an implementation today that
> only works on anonymous memory *and* sets up a special vm_ops.  But, the
> same exact ABI could do wonky stuff in the future if we decided we
> wanted to do shadow stacks on DAX or hugetlbfs or whatever.
> 
> I don't really like the idea of PROT_SHSTK those are plumbed into a
> bunch of interfaces.  But, I also can't deny that it seems to be working
> fine for the arm64 folks.

Note, there are some rough edges, such as what happens when someone
calls mprotect() on memory marked with PROT_BTI.  Unless the caller
knows whether PROT_BTI should be set for that page, the flag may get
unintentionally cleared.  Since the flag only applies to text pages
though, it's not _that_ much of a concern.  Software that deals with
writable text pages is also usually involved in generating the code and
so will know about PROT_BTI.  That's was the theory anyway.

In the longer term, it might be preferable to have a mprotect2() that
can leave some flags unmodified, and that doesn't silently ignore
unknown flags (at least one of mmap or mprotect does; I don't recall
which).  We attempt didn't go this far, for now.

For arm64 it seemed fairly natural for the BTI flag to be a PROT_ flag,
but I don't know enough detail about x86 shstk to know whether it's a
natural fit there.

Cheers
---Dave


Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-01 Thread Dave Martin
On Thu, Aug 27, 2020 at 06:26:11AM -0700, H.J. Lu wrote:
> On Wed, Aug 26, 2020 at 12:57 PM Dave Hansen  wrote:
> >
> > On 8/26/20 11:49 AM, Yu, Yu-cheng wrote:
> > >> I would expect things like Go and various JITs to call it directly.
> > >>
> > >> If we wanted to be fancy and add a potentially more widely useful
> > >> syscall, how about:
> > >>
> > >> mmap_special(void *addr, size_t length, int prot, int flags, int type);
> > >>
> > >> Where type is something like MMAP_SPECIAL_X86_SHSTK.  Fundamentally,
> > >> this is really just mmap() except that we want to map something a bit
> > >> magical, and we don't want to require opening a device node to do it.
> > >
> > > One benefit of MMAP_SPECIAL_* is there are more free bits than MAP_*.
> > > Does ARM have similar needs for memory mapping, Dave?
> >
> > No idea.
> >
> > But, mmap_special() is *basically* mmap2() with extra-big flags space.
> > I suspect it will grow some more uses on top of shadow stacks.  It could
> > have, for instance, been used to allocate MPX bounds tables.
> 
> There is no reason we can't use
> 
> long arch_prctl (int, unsigned long, unsigned long, unsigned long, ..);
> 
> for ARCH_X86_CET_MMAP_SHSTK.   We just need to use
> 
> syscall (SYS_arch_prctl, ARCH_X86_CET_MMAP_SHSTK, ...);


For arm64 (and sparc etc.) we continue to use the regular mmap/mprotect
family of calls.  One or two additional arch-specific mmap flags are
sufficient for now.

Is x86 definitely not going to fit within those calls?

For now, I can't see what arg[2] is used for (and hence the type
argument of mmap_special()), but I haven't dug through the whole series.

Cheers
---Dave


Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-26 Thread Dave Martin
On Wed, Aug 26, 2020 at 06:51:48PM +0200, Florian Weimer wrote:
> * Dave Martin:
> 
> > On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
> >> On 8/25/2020 4:20 PM, Dave Hansen wrote:
> >> >On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
> >> >>>>I think this is more arch-specific.  Even if it becomes a new syscall,
> >> >>>>we still need to pass the same parameters.
> >> >>>
> >> >>>Right, but without the copying in and out of memory.
> >> >>>
> >> >>Linux-api is already on the Cc list.  Do we need to add more people to
> >> >>get some agreements for the syscall?
> >> >What kind of agreement are you looking for?  I'd suggest just coding it
> >> >up and posting the patches.  Adding syscalls really is really pretty
> >> >straightforward and isn't much code at all.
> >> >
> >> 
> >> Sure, I will do that.
> >
> > Alternatively, would a regular prctl() work here?
> 
> Is this something appliation code has to call, or just the dynamic
> loader?
> 
> prctl in glibc is a variadic function, so if there's a mismatch between
> the kernel/userspace syscall convention and the userspace calling
> convention (for variadic functions) for specific types, it can't be made
> to work in a generic way.
>
> The loader can use inline assembly for system calls and does not have
> this issue, but applications would be implcated by it.

To the extent that this is a problem, libc's prctl() wrapper has to
handle it already.  New prctl() calls tend to demand precisely 4
arguments and require unused arguments to be 0, but this is more down to
policy rather than because anything breaks otherwise.

You're right that this has implications: for i386, libc probably pulls
more arguments off the stack than are really there in some situations.
This isn't a new problem though.  There are already generic prctls with
fewer than 4 args that are used on x86.

Merging the actual prctl() and arch_prctl() syscalls doesn't acutally
stop libc from retaining separate wrappers if they have different
argument marshaling requirements in some corner cases.


There might be some underlying reason by x86 has its own call and nobody
else followed the same model, but I don't know what it is.

Cheers
---Dave


Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-26 Thread Dave Martin
On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
> On 8/25/2020 4:20 PM, Dave Hansen wrote:
> >On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
> I think this is more arch-specific.  Even if it becomes a new syscall,
> we still need to pass the same parameters.
> >>>
> >>>Right, but without the copying in and out of memory.
> >>>
> >>Linux-api is already on the Cc list.  Do we need to add more people to
> >>get some agreements for the syscall?
> >What kind of agreement are you looking for?  I'd suggest just coding it
> >up and posting the patches.  Adding syscalls really is really pretty
> >straightforward and isn't much code at all.
> >
> 
> Sure, I will do that.

Alternatively, would a regular prctl() work here?

arch_prctl() feels like a historical weirdness for x86 -- other arches
all seem to be using regular prctl(), which allows for 4 args.  I don't
know the history behind the difference here.

(Since prctl() and arch_prctl() use non-clashing command numbers, I had
wondered whether it would be worth just merging the x86 calls in with
the rest and making the two calls aliases.  That's one for later,
though...)

Cheers
---Dave


Re: [PATCH 18/18] arm64: lto: Strengthen READ_ONCE() to acquire when CLANG_LTO=y

2020-07-07 Thread Dave Martin
On Mon, Jul 06, 2020 at 10:36:28AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 06, 2020 at 06:05:57PM +0100, Dave Martin wrote:
> > On Mon, Jul 06, 2020 at 09:34:55AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> > > > On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > > > > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > > > > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > > > > > When building with LTO, there is an increased risk of the compiler
> > > > > > > converting an address dependency headed by a READ_ONCE() 
> > > > > > > invocation
> > > > > > > into a control dependency and consequently allowing for harmful
> > > > > > > reordering by the CPU.
> > > > > > > 
> > > > > > > Ensure that such transformations are harmless by overriding the 
> > > > > > > generic
> > > > > > > READ_ONCE() definition with one that provides acquire semantics 
> > > > > > > when
> > > > > > > building with LTO.
> > > > > > > 
> > > > > > > Signed-off-by: Will Deacon 
> > > > > > > ---
> > > > > > >  arch/arm64/include/asm/rwonce.h   | 63 
> > > > > > > +++
> > > > > > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > > > > > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > > > > > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > > > > > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > > > > > > 
> > > > > > > diff --git a/arch/arm64/include/asm/rwonce.h 
> > > > > > > b/arch/arm64/include/asm/rwonce.h
> > > > > > > new file mode 100644
> > > > > > > index ..515e360b01a1
> > > > > > > --- /dev/null
> > > > > > > +++ b/arch/arm64/include/asm/rwonce.h
> > > > > > > @@ -0,0 +1,63 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > +/*
> > > > > > > + * Copyright (C) 2020 Google LLC.
> > > > > > > + */
> > > > > > > +#ifndef __ASM_RWONCE_H
> > > > > > > +#define __ASM_RWONCE_H
> > > > > > > +
> > > > > > > +#ifdef CONFIG_CLANG_LTO
> > > > > > 
> > > > > > Don't we have a generic option for LTO that's not specific to Clang.
> > > > > 
> > > > > /me looks at the LTO series some more
> > > > > 
> > > > > Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, 
> > > > > which is
> > > > > the non-typoed version of the above. I can switch this to CONFIG_LTO.
> > > > > 
> > > > > > Also, can you illustrate code that can only be unsafe with Clang 
> > > > > > LTO?
> > > > > 
> > > > > I don't have a concrete example, but it's an ongoing concern over on 
> > > > > the LTO
> > > > > thread [1], so I cooked this to show one way we could deal with it. 
> > > > > The main
> > > > > concern is that the whole-program optimisations enabled by LTO may 
> > > > > allow the
> > > > > compiler to enumerate possible values for a pointer at link time and 
> > > > > replace
> > > > > an address dependency between two loads with a control dependency 
> > > > > instead,
> > > > > defeating the dependency ordering within the CPU.
> > > > 
> > > > Why can't that happen without LTO?
> > > 
> > > Because without LTO, the compiler cannot see all the pointers all at
> > > the same time due to their being in different translation units.
> > > 
> > > But yes, if the compiler could see all the pointer values and further
> > > -know- that it was seeing all the pointer values, these optimizations
> > > could happen even without LTO.  But it is quite easy to make sure that
> > > the compiler thinks that there are additional pointer values that it
> > > does not know about.
> > 
> > Yes of course, but even without LTO the compiler can still apply this
> > optimisation to everything visible

Re: [PATCH 18/18] arm64: lto: Strengthen READ_ONCE() to acquire when CLANG_LTO=y

2020-07-07 Thread Dave Martin
On Mon, Jul 06, 2020 at 07:35:11PM +0100, Will Deacon wrote:
> On Mon, Jul 06, 2020 at 05:08:20PM +0100, Dave Martin wrote:
> > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > diff --git a/arch/arm64/include/asm/rwonce.h 
> > > b/arch/arm64/include/asm/rwonce.h
> > > new file mode 100644
> > > index ..515e360b01a1
> > > --- /dev/null
> > > +++ b/arch/arm64/include/asm/rwonce.h
> > > @@ -0,0 +1,63 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * Copyright (C) 2020 Google LLC.
> > > + */
> > > +#ifndef __ASM_RWONCE_H
> > > +#define __ASM_RWONCE_H
> > > +
> > > +#ifdef CONFIG_CLANG_LTO
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +#ifndef BUILD_VDSO
> > > +
> > > +#ifdef CONFIG_AS_HAS_LDAPR
> > > +#define __LOAD_RCPC(sfx, regs...)
> > > \
> > > + ALTERNATIVE(\
> > > + "ldar"  #sfx "\t" #regs,\
> > 
> > ^ Should this be here?  It seems that READ_ONCE() will actually read
> > twice... even if that doesn't actually conflict with the required
> > semantics of READ_ONCE(), it looks odd.
> 
> It's patched at runtime, so it's either LDAR or LDAPR.

Agh ignore me, I somehow failed to sport the ALTERNATIVE().

For my understanding -- my background here is a bit shaky -- the LDAPR
gives us load-to-load order even if there is just a control dependency?

If so (possibly dumb question): why can't we just turn this on
unconditionally?  Is there a significant performance impact?

I'm still confused (or ignorant) though.  If both loads are READ_ONCE()
then switching to LDAPR presumably helps, but otherwise, once the
compiler has reduced the address dependency to a control dependency
can't it then go one step further and reverse the order of the loads?
LDAPR wouldn't rescue us from that.

Or does the "memory" clobber in READ_ONCE() fix that for all important
cases?  I can't see this mattering for local variables (where it
definitely won't work), but I wonder whether static variables might not
count as "memory" in some situations.

Discounting ridiculous things like static register variables, I think
the only way for a static variable not to count as memory would be if
there are no writes to it that are reachable from any translation unit
entry point (possibly after dead code removal).  If so, maybe that's
enough.

> > Making a direct link between LTO and the memory model also seems highly
> > spurious (as discussed in the other subthread) so can we have a comment
> > explaining the reasoning?
> 
> Sure, although like I say, this is more about helping to progress that
> conversation.

That's fair enough, but when there is a consensus it would be good to
see it documented in the code _especially_ if we know that the fix won't
address all instances of the problem and in any case works partly by
accident.  That doesn't mean it's not a good practical compromise, but
it could be very confusing to unpick later on.

Cheers
---Dave


Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

2020-07-06 Thread Dave Martin
On Sat, Jul 04, 2020 at 04:02:46PM +0200, Greg Kroah-Hartman wrote:
> Here is a tiny new syscall, readfile, that makes it simpler to read
> small/medium sized files all in one shot, no need to do open/read/close.
> This is especially helpful for tools that poke around in procfs or
> sysfs, making a little bit of a less system load than before, especially
> as syscall overheads go up over time due to various CPU bugs being
> addressed.
> 
> There are 4 patches in this series, the first 3 are against the kernel
> tree, adding the syscall logic, wiring up the syscall, and adding some
> tests for it.
> 
> The last patch is agains the man-pages project, adding a tiny man page
> to try to describe the new syscall.

General question, using this series as an illustration only:


At the risk of starting a flamewar, why is this needed?  Is there a
realistic usecase that would get significant benefit from this?

A lot of syscalls seem to get added that combine or refactor the
functionality of existing syscalls without justifying why this is
needed (or even wise).  This case feels like a solution, not a
primitive, so I wonder if the long-term ABI fragmentation is worth the
benefit.

I ask because I'd like to get an idea of the policy on what is and is
not considered a frivolous ABI extension.

(I'm sure a usecase must be in mind, but it isn't mentioned here.
Certainly the time it takes top to dump the contents of /proc leaves
something to be desired.)

Cheers
---Dave


Re: [PATCH 3/3] Documentation: arm64/sve: drop duplicate words

2020-07-06 Thread Dave Martin
On Fri, Jul 03, 2020 at 01:51:10PM -0700, Randy Dunlap wrote:
> Drop the doubled word "for".
> 
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: Dave Martin 

Thanks!

Acked-by: Dave Martin 

> ---
>  Documentation/arm64/sve.rst |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next-20200701.orig/Documentation/arm64/sve.rst
> +++ linux-next-20200701/Documentation/arm64/sve.rst
> @@ -494,7 +494,7 @@ Appendix B.  ARMv8-A FP/SIMD programmer'
>  Note: This section is for information only and not intended to be complete or
>  to replace any architectural specification.
>  
> -Refer to [4] for for more information.
> +Refer to [4] for more information.
>  
>  ARMv8-A defines the following floating-point / SIMD register state:
>  
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH 18/18] arm64: lto: Strengthen READ_ONCE() to acquire when CLANG_LTO=y

2020-07-06 Thread Dave Martin
On Mon, Jul 06, 2020 at 09:34:55AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 06, 2020 at 05:00:23PM +0100, Dave Martin wrote:
> > On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> > > On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > > > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > > > When building with LTO, there is an increased risk of the compiler
> > > > > converting an address dependency headed by a READ_ONCE() invocation
> > > > > into a control dependency and consequently allowing for harmful
> > > > > reordering by the CPU.
> > > > > 
> > > > > Ensure that such transformations are harmless by overriding the 
> > > > > generic
> > > > > READ_ONCE() definition with one that provides acquire semantics when
> > > > > building with LTO.
> > > > > 
> > > > > Signed-off-by: Will Deacon 
> > > > > ---
> > > > >  arch/arm64/include/asm/rwonce.h   | 63 
> > > > > +++
> > > > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > > > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > > > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > > > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > > > > 
> > > > > diff --git a/arch/arm64/include/asm/rwonce.h 
> > > > > b/arch/arm64/include/asm/rwonce.h
> > > > > new file mode 100644
> > > > > index ..515e360b01a1
> > > > > --- /dev/null
> > > > > +++ b/arch/arm64/include/asm/rwonce.h
> > > > > @@ -0,0 +1,63 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > +/*
> > > > > + * Copyright (C) 2020 Google LLC.
> > > > > + */
> > > > > +#ifndef __ASM_RWONCE_H
> > > > > +#define __ASM_RWONCE_H
> > > > > +
> > > > > +#ifdef CONFIG_CLANG_LTO
> > > > 
> > > > Don't we have a generic option for LTO that's not specific to Clang.
> > > 
> > > /me looks at the LTO series some more
> > > 
> > > Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, which 
> > > is
> > > the non-typoed version of the above. I can switch this to CONFIG_LTO.
> > > 
> > > > Also, can you illustrate code that can only be unsafe with Clang LTO?
> > > 
> > > I don't have a concrete example, but it's an ongoing concern over on the 
> > > LTO
> > > thread [1], so I cooked this to show one way we could deal with it. The 
> > > main
> > > concern is that the whole-program optimisations enabled by LTO may allow 
> > > the
> > > compiler to enumerate possible values for a pointer at link time and 
> > > replace
> > > an address dependency between two loads with a control dependency instead,
> > > defeating the dependency ordering within the CPU.
> > 
> > Why can't that happen without LTO?
> 
> Because without LTO, the compiler cannot see all the pointers all at
> the same time due to their being in different translation units.
> 
> But yes, if the compiler could see all the pointer values and further
> -know- that it was seeing all the pointer values, these optimizations
> could happen even without LTO.  But it is quite easy to make sure that
> the compiler thinks that there are additional pointer values that it
> does not know about.

Yes of course, but even without LTO the compiler can still apply this
optimisation to everything visible in the translation unit, and that can
drift as people refactor code over time.

Convincing the compiler there are other possible values doesn't help.
Even in

int foo(int *p)
{
asm ("" : "+r" (p));
return *p;
}

Can't the compiler still generate something like this:

switch (p) {
case :
return foo;

case :
return bar;

default:
return *p;
}

...in which case we still have the same lost ordering guarantee that
we were trying to enforce.

If foo and bar already happen to be in registers and profiling shows
that  and  are the most likely value of p then this might be
a reasonable optimisation in some situations, irrespective of LTO.

The underlying problem here seems to be that the necessary ordering
rule is not part of what passes for the C memory model prior to C11.
If we want to control the data flow, don't we have to wrap the entire
dereference in a mac

Re: [PATCH 18/18] arm64: lto: Strengthen READ_ONCE() to acquire when CLANG_LTO=y

2020-07-06 Thread Dave Martin
On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> When building with LTO, there is an increased risk of the compiler
> converting an address dependency headed by a READ_ONCE() invocation
> into a control dependency and consequently allowing for harmful
> reordering by the CPU.
> 
> Ensure that such transformations are harmless by overriding the generic
> READ_ONCE() definition with one that provides acquire semantics when
> building with LTO.
> 
> Signed-off-by: Will Deacon 
> ---
>  arch/arm64/include/asm/rwonce.h   | 63 +++
>  arch/arm64/kernel/vdso/Makefile   |  2 +-
>  arch/arm64/kernel/vdso32/Makefile |  2 +-
>  3 files changed, 65 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/include/asm/rwonce.h
> 
> diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> new file mode 100644
> index ..515e360b01a1
> --- /dev/null
> +++ b/arch/arm64/include/asm/rwonce.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2020 Google LLC.
> + */
> +#ifndef __ASM_RWONCE_H
> +#define __ASM_RWONCE_H
> +
> +#ifdef CONFIG_CLANG_LTO
> +
> +#include 
> +#include 
> +
> +#ifndef BUILD_VDSO
> +
> +#ifdef CONFIG_AS_HAS_LDAPR
> +#define __LOAD_RCPC(sfx, regs...)\
> + ALTERNATIVE(\
> + "ldar"  #sfx "\t" #regs,\

^ Should this be here?  It seems that READ_ONCE() will actually read
twice... even if that doesn't actually conflict with the required
semantics of READ_ONCE(), it looks odd.

Making a direct link between LTO and the memory model also seems highly
spurious (as discussed in the other subthread) so can we have a comment
explaining the reasoning?

> + ".arch_extension rcpc\n"\
> + "ldapr" #sfx "\t" #regs,\
> + ARM64_HAS_LDAPR)
> +#else
> +#define __LOAD_RCPC(sfx, regs...)"ldar" #sfx "\t" #regs
> +#endif /* CONFIG_AS_HAS_LDAPR */

[...]

Cheers
---Dave


Re: [PATCH 18/18] arm64: lto: Strengthen READ_ONCE() to acquire when CLANG_LTO=y

2020-07-06 Thread Dave Martin
On Thu, Jul 02, 2020 at 08:23:02AM +0100, Will Deacon wrote:
> On Wed, Jul 01, 2020 at 06:07:25PM +0100, Dave P Martin wrote:
> > On Tue, Jun 30, 2020 at 06:37:34PM +0100, Will Deacon wrote:
> > > When building with LTO, there is an increased risk of the compiler
> > > converting an address dependency headed by a READ_ONCE() invocation
> > > into a control dependency and consequently allowing for harmful
> > > reordering by the CPU.
> > > 
> > > Ensure that such transformations are harmless by overriding the generic
> > > READ_ONCE() definition with one that provides acquire semantics when
> > > building with LTO.
> > > 
> > > Signed-off-by: Will Deacon 
> > > ---
> > >  arch/arm64/include/asm/rwonce.h   | 63 +++
> > >  arch/arm64/kernel/vdso/Makefile   |  2 +-
> > >  arch/arm64/kernel/vdso32/Makefile |  2 +-
> > >  3 files changed, 65 insertions(+), 2 deletions(-)
> > >  create mode 100644 arch/arm64/include/asm/rwonce.h
> > > 
> > > diff --git a/arch/arm64/include/asm/rwonce.h 
> > > b/arch/arm64/include/asm/rwonce.h
> > > new file mode 100644
> > > index ..515e360b01a1
> > > --- /dev/null
> > > +++ b/arch/arm64/include/asm/rwonce.h
> > > @@ -0,0 +1,63 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * Copyright (C) 2020 Google LLC.
> > > + */
> > > +#ifndef __ASM_RWONCE_H
> > > +#define __ASM_RWONCE_H
> > > +
> > > +#ifdef CONFIG_CLANG_LTO
> > 
> > Don't we have a generic option for LTO that's not specific to Clang.
> 
> /me looks at the LTO series some more
> 
> Oh yeah, there's CONFIG_LTO which is selected by CONFIG_LTO_CLANG, which is
> the non-typoed version of the above. I can switch this to CONFIG_LTO.
> 
> > Also, can you illustrate code that can only be unsafe with Clang LTO?
> 
> I don't have a concrete example, but it's an ongoing concern over on the LTO
> thread [1], so I cooked this to show one way we could deal with it. The main
> concern is that the whole-program optimisations enabled by LTO may allow the
> compiler to enumerate possible values for a pointer at link time and replace
> an address dependency between two loads with a control dependency instead,
> defeating the dependency ordering within the CPU.

Why can't that happen without LTO?

> We likely won't realise if/when this goes wrong, other than impossible to
> debug, subtle breakage that crops up seemingly randomly. Ideally, we'd be
> able to detect this sort of thing happening at build time, and perhaps
> even prevent it with compiler options or annotations, but none of that is
> close to being available and I'm keen to progress the LTO patches in the
> meantime because they are a requirement for CFI.

My concern was not so much why LTO makes things dangerous, as why !LTO
makes things safe...

Cheers
---Dave


Re: [PATCH v3 3/9] efi/libstub: Remove .note.gnu.property

2020-06-24 Thread Dave Martin
On Wed, Jun 24, 2020 at 06:40:48PM +0200, Ard Biesheuvel wrote:
> On Wed, 24 Jun 2020 at 18:29, Dave Martin  wrote:
> >
> > On Wed, Jun 24, 2020 at 05:48:41PM +0200, Ard Biesheuvel wrote:
> > > On Wed, 24 Jun 2020 at 17:45, Kees Cook  wrote:
> > > >
> > > > On Wed, Jun 24, 2020 at 05:31:06PM +0200, Ard Biesheuvel wrote:
> > > > > On Wed, 24 Jun 2020 at 17:21, Kees Cook  wrote:
> > > > > >
> > > > > > On Wed, Jun 24, 2020 at 12:46:32PM +0200, Ard Biesheuvel wrote:
> > > > > > > I'm not sure if there is a point to having PAC and/or BTI in the 
> > > > > > > EFI
> > > > > > > stub, given that it runs under the control of the firmware, with 
> > > > > > > its
> > > > > > > memory mappings and PAC configuration etc.
> > > > > >
> > > > > > Is BTI being ignored when the firmware runs?
> > > > >
> > > > > Given that it requires the 'guarded' attribute to be set in the page
> > > > > tables, and the fact that the UEFI spec does not require it for
> > > > > executables that it invokes, nor describes any means of annotating
> > > > > such executables as having been built with BTI annotations, I think we
> > > > > can safely assume that the EFI stub will execute with BTI disabled in
> > > > > the foreseeable future.
> > > >
> > > > yaay. *sigh* How long until EFI catches up?
> > > >
> > > > That said, BTI shouldn't _hurt_, right? If EFI ever decides to enable
> > > > it, we'll be ready?
> > > >
> > >
> > > Sure. Although I anticipate that we'll need to set some flag in the
> > > PE/COFF header to enable it, and so any BTI opcodes we emit without
> > > that will never take effect in practice.
> >
> > In the meantime, it is possible to build all the in-tree parts of EFI
> > for BTI, and just turn it off for out-of-tree EFI binaries?
> >
> 
> Not sure I understand the question. What do you mean by out-of-tree
> EFI binaries? And how would the firmware (which is out of tree itself,
> and is in charge of the page tables, vector table, timer interrupt etc
> when the EFI stub executes) distinguish such binaries from the EFI
> stub?

I'm not an EFI expert, but I'm guessing that you configure EFI with
certain compiler flags and build it.  Possibly some standalone EFI
executables are built out of the same tree and shipped with the
firmware from the same build, but I'm speculating.  If not, we can just
run all EFI executables with BTI off.

> > If there's no easy way to do this though, I guess we should wait for /
> > push for a PE/COFF flag to describe this properly.
> >
> 
> Yeah good point. I will take this to the forum.

In the interim, we could set the GP bit in EFI's page tables for the
executable code from the firmware image if we want this protection, but
turn it off in pages mapping the executable code of EFI executables.
This is better than nothing.

Cheers
---Dave


Re: [PATCH v3 3/9] efi/libstub: Remove .note.gnu.property

2020-06-24 Thread Dave Martin
On Wed, Jun 24, 2020 at 05:48:41PM +0200, Ard Biesheuvel wrote:
> On Wed, 24 Jun 2020 at 17:45, Kees Cook  wrote:
> >
> > On Wed, Jun 24, 2020 at 05:31:06PM +0200, Ard Biesheuvel wrote:
> > > On Wed, 24 Jun 2020 at 17:21, Kees Cook  wrote:
> > > >
> > > > On Wed, Jun 24, 2020 at 12:46:32PM +0200, Ard Biesheuvel wrote:
> > > > > I'm not sure if there is a point to having PAC and/or BTI in the EFI
> > > > > stub, given that it runs under the control of the firmware, with its
> > > > > memory mappings and PAC configuration etc.
> > > >
> > > > Is BTI being ignored when the firmware runs?
> > >
> > > Given that it requires the 'guarded' attribute to be set in the page
> > > tables, and the fact that the UEFI spec does not require it for
> > > executables that it invokes, nor describes any means of annotating
> > > such executables as having been built with BTI annotations, I think we
> > > can safely assume that the EFI stub will execute with BTI disabled in
> > > the foreseeable future.
> >
> > yaay. *sigh* How long until EFI catches up?
> >
> > That said, BTI shouldn't _hurt_, right? If EFI ever decides to enable
> > it, we'll be ready?
> >
> 
> Sure. Although I anticipate that we'll need to set some flag in the
> PE/COFF header to enable it, and so any BTI opcodes we emit without
> that will never take effect in practice.

In the meantime, it is possible to build all the in-tree parts of EFI
for BTI, and just turn it off for out-of-tree EFI binaries?

If there's no easy way to do this though, I guess we should wait for /
push for a PE/COFF flag to describe this properly.

Cheers
---Dave


Re: [PATCH v3 3/9] efi/libstub: Remove .note.gnu.property

2020-06-24 Thread Dave Martin
On Wed, Jun 24, 2020 at 04:26:46PM +0100, Will Deacon wrote:
> On Wed, Jun 24, 2020 at 02:48:55PM +0100, Dave Martin wrote:
> > On Wed, Jun 24, 2020 at 12:26:47PM +0100, Will Deacon wrote:
> > > On Wed, Jun 24, 2020 at 12:46:32PM +0200, Ard Biesheuvel wrote:
> > > > On Wed, 24 Jun 2020 at 12:44, Will Deacon  wrote:
> > > > > For the kernel Image, how do we remove these sections? The objcopy 
> > > > > flags
> > > > > in arch/arm64/boot/Makefile look both insufficient and out of date. My
> > > > > vmlinux ends up with both a ".notes" and a ".init.note.gnu.property"
> > > > > segment.
> > > > 
> > > > The latter is the fault of the libstub make rules, that prepend .init
> > > > to all section names.
> > > 
> > > Hmm. I tried adding -mbranch-protection=none to arm64 cflags for the stub,
> > > but I still see this note in vmlinux. It looks like it comes in via the
> > > stub copy of lib-ctype.o, but I don't know why that would force the
> > > note. The cflags look ok to me [1] and I confirmed that the note is
> > > being generated by the compiler.
> > > 
> > > > I'm not sure if there is a point to having PAC and/or BTI in the EFI
> > > > stub, given that it runs under the control of the firmware, with its
> > > > memory mappings and PAC configuration etc.
> > > 
> > > Agreed, I just can't figure out how to get rid of the note.
> > 
> > Because this section is generated by the linker itself I think you might
> > have to send it to /DISCARD/ in the link, or strip it explicitly after
> > linking.
> 
> Right, but why is the linker generating that section in the first place? I'm
> compiling with -mbranch-protection=none and all the other objects linked
> into the stub do not have the section.
> 
> I wonder if it's because lib/ctype.c doesn't have any executable code...

What compiler and flags are you using for the affected object?  I don't
see this with gcc so far.

I wonder if this is a hole in the specs: the property could logically
be emitted in any codeless object, since turning on BTI will obviously
not break that object.

For different linkers and compilers to interoperate though, the specs
would need to say what to do in that situation.

Cheers
---Dave



> 
> Will
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH v3 3/9] efi/libstub: Remove .note.gnu.property

2020-06-24 Thread Dave Martin
On Wed, Jun 24, 2020 at 12:26:47PM +0100, Will Deacon wrote:
> On Wed, Jun 24, 2020 at 12:46:32PM +0200, Ard Biesheuvel wrote:
> > On Wed, 24 Jun 2020 at 12:44, Will Deacon  wrote:
> > > On Tue, Jun 23, 2020 at 09:44:11PM -0700, Kees Cook wrote:
> > > > On Tue, Jun 23, 2020 at 08:31:42PM -0700, 'Fangrui Song' via Clang 
> > > > Built Linux wrote:
> > > > > arch/arm64/Kconfig enables ARM64_PTR_AUTH by default. When the config 
> > > > > is on
> > > > >
> > > > > ifeq ($(CONFIG_ARM64_BTI_KERNEL),y)
> > > > > branch-prot-flags-$(CONFIG_CC_HAS_BRANCH_PROT_PAC_RET_BTI) := 
> > > > > -mbranch-protection=pac-ret+leaf+bti
> > > > > else
> > > > > branch-prot-flags-$(CONFIG_CC_HAS_BRANCH_PROT_PAC_RET) := 
> > > > > -mbranch-protection=pac-ret+leaf
> > > > > endif
> > > > >
> > > > > This option creates .note.gnu.property:
> > > > >
> > > > > % readelf -n drivers/firmware/efi/libstub/efi-stub.o
> > > > >
> > > > > Displaying notes found in: .note.gnu.property
> > > > >   OwnerData sizeDescription
> > > > >   GNU  0x0010   NT_GNU_PROPERTY_TYPE_0
> > > > >   Properties: AArch64 feature: PAC
> > > > >
> > > > > If .note.gnu.property is not desired in drivers/firmware/efi/libstub, 
> > > > > specifying
> > > > > -mbranch-protection=none can override -mbranch-protection=pac-ret+leaf
> > > >
> > > > We want to keep the branch protection enabled. But since it's not a
> > > > "regular" ELF, we don't need to keep the property that identifies the
> > > > feature.
> > >
> > > For the kernel Image, how do we remove these sections? The objcopy flags
> > > in arch/arm64/boot/Makefile look both insufficient and out of date. My
> > > vmlinux ends up with both a ".notes" and a ".init.note.gnu.property"
> > > segment.
> > >
> > 
> > The latter is the fault of the libstub make rules, that prepend .init
> > to all section names.
> 
> Hmm. I tried adding -mbranch-protection=none to arm64 cflags for the stub,
> but I still see this note in vmlinux. It looks like it comes in via the
> stub copy of lib-ctype.o, but I don't know why that would force the
> note. The cflags look ok to me [1] and I confirmed that the note is
> being generated by the compiler.
> 
> > I'm not sure if there is a point to having PAC and/or BTI in the EFI
> > stub, given that it runs under the control of the firmware, with its
> > memory mappings and PAC configuration etc.
> 
> Agreed, I just can't figure out how to get rid of the note.

Because this section is generated by the linker itself I think you might
have to send it to /DISCARD/ in the link, or strip it explicitly after
linking.

Cheers
---Dave


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Dave Martin
On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote:
> On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote:
> > On Wed, 17 Jun 2020 at 13:39, Steven Price  wrote:
> > > These patches add support to KVM to enable MTE within a guest. It is
> > > based on Catalin's v4 MTE user space series[1].
> > >
> > > [1] 
> > > http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
> > >
> > > Posting as an RFC as I'd like feedback on the approach taken.
> > 
> > What's your plan for handling tags across VM migration?
> > Will the kernel expose the tag ram to userspace so we
> > can copy it from the source machine to the destination
> > at the same time as we copy the actual ram contents ?
> 
> Qemu can map the guest memory with PROT_MTE and access the tags directly
> with LDG/STG instructions. Steven was actually asking in the cover
> letter whether we should require that the VMM maps the guest memory with
> PROT_MTE as a guarantee that it can access the guest tags.
> 
> There is no architecturally visible tag ram (tag storage), that's a
> microarchitecture detail.

If userspace maps the guest memory with PROT_MTE for dump purposes,
isn't it going to get tag check faults when accessing the memory
(i.e., when dumping the regular memory content, not the tags
specifically).

Does it need to map two aliases, one with PROT_MTE and one without,
and is that architecturally valid?

Cheers
---Dave


Re: [PATCH] arm64: fpsimd: Added API to manage fpsimd state inside kernel

2020-06-15 Thread Dave Martin
On Thu, Jun 11, 2020 at 03:11:02PM +0100, Catalin Marinas wrote:
> On Thu, Jun 11, 2020 at 06:42:12PM +0900, Wooyeon Kim wrote:
> > I am in charge of camera driver development in Samsung S.LSI division.
> > 
> > In order to guarantee real time processing such as Camera 3A algorithm in
> > current or ongoing projects, prebuilt binary is loaded and used in kernel
> > space, rather than user space.
> 
> Thanks for the additional details.

I have to ask: there are other camera drivers in existence already.
What makes your hardware so different that it requires all this data
processing to be done inside the kernel?

> If you do such intensive processing in an IRQ context you'd probably
> introduce additional IRQ latency. Wouldn't offloading such work to a
> real-time (user) thread help? In a non-preempt-rt kernel, I don't think
> you can get much in terms of (soft) guarantees for IRQ latency anyway.
> 
> > Because the binary is built with other standard library which could use
> > FPSIMD register, kernel API should keep the original FPSIMD state for other
> > user tasks.
> 
> Can you not recompile those libraries not to use FP?
> 
> As Mark said, for a kernel API we require at least an in-kernel,
> upstreamed, user of that functionality.
> 
> > In the case of the kernel_neon_begin / kernel_neon_end that you mentioned,
> > there is a limitation that cannot be used in hardirq context.
> > Also, if another kernel task switching occurs while kernel API is being
> > used, fpsimd register corruption may occur.
> 
> kernel_neon_begin/end disable preemption, so you can't have a task
> switch (you can have interrupts though but we don't allow FPSIMD in IRQ
> context).

Note, the decision not to support kernel_neon_begin / kernel_neon_end in
hardirq context was deliberate.  hardirq handlers shouldn't usually do
anything at all except ensure that something responds to the hardware
event, by waking some other thread or scheduling a workqueue item for
example.  An IRQ handler that only does that has no need to do any data
processing, and gains no advantage from using FPSIMD.

Doing additional work in hardirq context will harm interrupt latency for
the rest of the system.

So, you should move the data processing work out of the hardirq handler.
Is there a reason why this is not possible?


Secondly, there is the question of whether FPSIMD can be used by kernel
threads.  Currently this is only supported in a limited way.  Again,
this a deliberate decision, for now.

Can you split the processing work into small blocks using
kernel_neon_begin/kernel_neon_end, similarly to the arm64 crypto
drivers?

This is the current accepted way of doing number crunching inside the
kernel without harming preemption latency too much.  Even so, it's
primarily intended for things that affect critical paths inside the
kernel, such as crypto or checksumming in the filesysem and network
subsystems.

Cheers
---Dave


Re: [PATCH] arm64: fpsimd: Added API to manage fpsimd state inside kernel

2020-06-08 Thread Dave Martin
On Fri, Jun 05, 2020 at 11:37:05AM +0100, Mark Rutland wrote:
> Hi Wooyeon,
> 
> There are a *lot* of people Cc' here, many of whomo will find this
> irrelevant. Please try to keep the Cc list constrained to a reasonable
> number of interested parties.
> 
> On Fri, Jun 05, 2020 at 04:30:52PM +0900, Wooyeon Kim wrote:
> > From: Wooki Min 
> > 
> >  This is an patch to use FPSIMD register in Kernel space.
> >  It need to manage to use FPSIMD register without damaging it
> >  of the user task.
> >  Following items have been implemented and added.
> 
> Please introduce the problem you are trying to solve in more detail. We
> already have kernel_neon_{begin,end}() for kernel-mode NEON; why is that
> not sufficient for your needs? Please answer this before considering
> other details.
> 
> What do you want to use this for?
> 
> > 
> >  1. Using FPSIMD in ISR (in_interrupt)
> > It can used __efi_fpsimd_begin/__efi_fpsimd_end
> > which is already implemented.
> > Save fpsimd state before entering ISR,
> > and restore fpsimd state after ISR ends.
> > For use in external kernel module,
> > it is declared as EXPORT_SYMBOL.
> 
> This patch adds no in-tree modular users of this, so per the usual
> conventions, NAK to EXPORT_SYMBOL().

Ack, this looks supicious.  Can you explain why your usecase _requires_
FPSIMD in hardirq context?

For now, these functions are strictly for EFI use only and should never
be used by modules.

Cheers
---Dave


Re: arm64: Register modification during syscall entry/exit stop

2020-06-01 Thread Dave Martin
On Mon, Jun 01, 2020 at 05:40:28AM -0400, Keno Fischer wrote:
> On Mon, Jun 1, 2020 at 5:23 AM Dave Martin  wrote:
> > > > Can't PTRACE_SYSEMU be emulated by using PTRACE_SYSCALL, cancelling the
> > > > syscall at the syscall enter stop, then modifying the regs at the
> > > > syscall exit stop?
> > >
> > > Yes, it can. The idea behind SYSEMU is to be able to save half the
> > > ptrace traps that would require, in theory making the ptracer
> > > a decent amount faster. That said, the x7 issue is orthogonal to
> > > SYSEMU, you'd have the same issues if you used PTRACE_SYSCALL.
> >
> > Right, I just wondered whether there was some deeper difference between
> > the two approaches.
> 
> You're asking about a new regset vs trying to do it via ptrace option?

I meant SYSEMU versus SYSCALL + cancellation and emulating the syscall
at the syscall exit stop.

i.e., I was trying to understand whether SYSEMU is just a convenience,
or does some magic that can't be reproduced by other means.

> I don't think there's anything a ptrace option can do that a new regset
> that replicates the same registers (I'm gonna propose adding orig_x0,
> while we're at it and changing the x0 semantics a bit, will have
> those details with the patch) wouldn't be able to do . The reason I
> originally thought it might have to be a ptrace option is because
> the register modification currently gets applied in the syscall entry
> code to the actual regs struct, so I thought you might have to know
> to preserve those registers. However, then I realized that you could
> just change the regset accessors to emulate the old behavior, since
> we do already store all the required information (what kind of stop
> we're currently at) in order to be able to answer the ptrace
> informational queries. So doing that it probably just all around
> easier. I guess NT_PRSTATUS might also rot, but I guess strace
> doesn't really have to stop using it, since it doesn't care about
> the x7 value nor does it need to modify it.

I think NT_PRSTATUS probably doesn't need to change.

Having a duplicate regset feels like a worse outcome that having a new
ptrace option.  Undocumentedly different things already happen to the
regs depending on how the tracee stopped, so adding a new special case
doesn't seem to justify creating a new regset.

Cheers
---Dave


Re: arm64: Register modification during syscall entry/exit stop

2020-06-01 Thread Dave Martin
On Mon, Jun 01, 2020 at 05:23:01AM -0400, Keno Fischer wrote:
> On Mon, Jun 1, 2020 at 5:14 AM Dave Martin  wrote:
> > Can you explain why userspace would write a changed value for x7
> > but at the same time need that new to be thrown away?
> 
> The discarding behavior is the primary reason things aren't completely
> broken at the moment. If it read the wrong x7 value and didn't know about
> the Aarch64 quirk, it's often just trying to write that same wrong
> value back during the next stop, so if that's just ignored,
> that's probably fine in 99% of cases, since the value in the
> tracee will be undisturbed.

I guess that's my question: when is x7 "disturbed".

Other than sigreturn, I can't think of a case.

I'm likely missing some aspect of what you're trying to do.

> I don't think there's a sane way to change the aarch64 NT_PRSTATUS
> semantics without just completely removing the x7 behavior, but of course
> people may be relying on that (I think somebody said upthread that strace 
> does?)

Since rt_sigreturn emulation was always broken, can we just say
that the effect of updating any reg other than x0 is unspecified in this
case?

Even fixing the x7 issue won't magically teach your tracer how to
deal with unrecognised data in the signal frame, so new hardware or
a new kernel could cause your tracer to become subtly broken.  Would you
be better off tweaking the real signal frame as desired and doing a real
rt_sigreturn for example, instead of attempting to emulate it?


I'm somewhat playing devil's advocate here...

Cheers
---Dave


Re: arm64: Register modification during syscall entry/exit stop

2020-06-01 Thread Dave Martin
On Sun, May 31, 2020 at 12:20:51PM -0400, Keno Fischer wrote:
> > Can't PTRACE_SYSEMU be emulated by using PTRACE_SYSCALL, cancelling the
> > syscall at the syscall enter stop, then modifying the regs at the
> > syscall exit stop?
> 
> Yes, it can. The idea behind SYSEMU is to be able to save half the
> ptrace traps that would require, in theory making the ptracer
> a decent amount faster. That said, the x7 issue is orthogonal to
> SYSEMU, you'd have the same issues if you used PTRACE_SYSCALL.

Right, I just wondered whether there was some deeper difference between
the two approaches.

Cheers
---Dave


Re: arm64: Register modification during syscall entry/exit stop

2020-06-01 Thread Dave Martin
On Sun, May 31, 2020 at 12:13:18PM -0400, Keno Fischer wrote:
> > Keno -- are you planning to send out a patch? You previously spoke about
> > implementing this using PTRACE_SETOPTIONS.
> 
> Yes, I'll have a patch for you. Though I've come to the conclusion
> that introducing a new regset is probably a better way to solve it.
> We can then also expose orig_x0 at the same time and give it sane semantics
> (there's some problems with the way it works currently - I'll write it up
> together with the patch).

I'd worry that having a new ptrace option would be useless bug-
compatibility that is just going to bitrot.

Can you explain why userspace would write a changed value for x7
but at the same time need that new to be thrown away?

That sounds like a nonsensical thing for userspace to be doing.

Cheers
---Dave


Re: [PATCH] arm64: vdso32: force vdso32 to be compiled as -marm

2020-05-27 Thread Dave Martin
On Tue, May 26, 2020 at 09:45:05PM +0100, Will Deacon wrote:
> On Tue, 26 May 2020 10:31:14 -0700, Nick Desaulniers wrote:
> > Custom toolchains that modify the default target to -mthumb cannot

It's probably too late to water this down, but it's unfortunate to have
this comment in the upstream commit history.

It's not constructive to call the native compiler configuration of
major distros for many years a "custom" toolchain.  Unmodified GCC has
had a clean configure option for this for a very long time; it's not
someone's dirty hack.  (The wisdom of armhf's choice of -mthumb might
be debated, but it is well established.)

Ignoring the triplet and passing random options to a compiler in the
hopes that it will do the right thing for an irregular usecase has never
been reliable.  Usecases don't get much more irregular than building
vdso32.

arch/arm has the proper options in its Makefiles.

This patch is a kernel bugfix, plain and simple.

> > compile the arm64 compat vdso32, as
> > arch/arm64/include/asm/vdso/compat_gettimeofday.h
> > contains assembly that's invalid in -mthumb.  Force the use of -marm,
> > always.
> 
> Applied to arm64 (for-next/vdso), thanks!
> 
> [1/1] arm64: vdso32: force vdso32 to be compiled as -marm
>   https://git.kernel.org/arm64/c/20363b59ad4f

Does this need to go to stable?

Cheers
---Dave


Re: arm64: Register modification during syscall entry/exit stop

2020-05-27 Thread Dave Martin
On Wed, May 27, 2020 at 10:55:29AM +0100, Will Deacon wrote:
> On Sun, May 24, 2020 at 02:56:35AM -0400, Keno Fischer wrote:
> > Just ran into this issue again, with what I think may be most compelling
> > example yet why this is problematic:
> > 
> > The tracee incurred a signal, we PTRACE_SYSEMU'd to the rt_sigreturn,
> > which the tracer tried to emulate by applying the state from the signal 
> > frame.
> > However, the PTRACE_SYSEMU stop is a syscall-stop, so the tracer's write
> > to x7 was ignored and x7 retained the value it had in the signal handler,
> > which broke the tracee.
> 
> Yeah, that sounds like a good justification to add a way to stop this. Could
> you send a patch, please?
> 
> Interestingly, I *thought* the current behaviour was needed by strace, but I
> can't find anything there that seems to require it. Oh well, we're stuck
> with it anyway.

The fact that PTRACE_SYSEMU is only implemented for a few arches makes
we wonder whether it was a misguided addition that should not be ported
to new arches... i.e., why does hardly anyone need it?  But I haven't
attempted to understand the history.

Can't PTRACE_SYSEMU be emulated by using PTRACE_SYSCALL, cancelling the
syscall at the syscall enter stop, then modifying the regs at the
syscall exit stop?


If SYSEMU was obviously always broken, perhaps we can withdraw support
for it.  Assuming nobody is crazy enough to try to emulate execve() I
can't see anything other than sigreturn that would be affected by this
issue though.  So maybe SYSEMU isn't broken enough to justify
withdrawal.

Cheers
---Dave


Re: [PATCH v7 1/9] firmware: arm_scmi: Add notification protocol-registration

2020-05-13 Thread Dave Martin
On Mon, May 11, 2020 at 11:04:03PM +0100, Cristian Marussi wrote:
> Hi Dave
> 
> thanks for the review first of all.
> 
> On Wed, May 06, 2020 at 04:25:50PM +0100, Dave Martin wrote:
> > On Mon, May 04, 2020 at 05:38:47PM +0100, Cristian Marussi wrote:
> > > Add core SCMI Notifications protocol-registration support: allow protocols
> > > to register their own set of supported events, during their initialization
> > > phase. Notification core can track multiple platform instances by their
> > > handles.
> > > 
> > > Reviewed-by: Jonathan Cameron 
> > > Signed-off-by: Cristian Marussi 
> > > ---
> > > V4 --> V5
> > > - fixed kernel-doc
> > > - added barriers for registered protocols and events
> > > - using kfifo_alloc and devm_add_action_or_reset
> > > V3 --> V4
> > > - removed scratch ISR buffer, move scratch BH buffer into protocol
> > >   descriptor
> > > - converted registered_protocols and registered_events from hashtables
> > >   into bare fixed-sized arrays
> > > - removed unregister protocols' routines (never called really)
> > > V2 --> V3
> > > - added scmi_notify_instance to track target platform instance
> > > V1 --> V2
> > > - splitted out of V1 patch 04
> > > - moved from IDR maps to real HashTables to store events
> > > - scmi_notifications_initialized is now an atomic_t
> > > - reviewed protocol registration/unregistration to use devres
> > > - fixed:
> > >   drivers/firmware/arm_scmi/notify.c:483:18-23: ERROR:
> > >   reference preceded by free on line 482
> > > 
> > > Reported-by: kbuild test robot 
> > > Reported-by: Julia Lawall 
> > > ---
> > >  drivers/firmware/arm_scmi/Makefile |   2 +-
> > >  drivers/firmware/arm_scmi/common.h |   4 +
> > >  drivers/firmware/arm_scmi/notify.c | 444 +
> > >  drivers/firmware/arm_scmi/notify.h |  56 
> > >  include/linux/scmi_protocol.h  |   3 +
> > >  5 files changed, 508 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/firmware/arm_scmi/notify.c
> > >  create mode 100644 drivers/firmware/arm_scmi/notify.h
> > 
> > [...]
> > 
> > > diff --git a/drivers/firmware/arm_scmi/notify.c 
> > > b/drivers/firmware/arm_scmi/notify.c
> > 
> > [...]
> > 
> > > +int scmi_register_protocol_events(const struct scmi_handle *handle,
> > > +   u8 proto_id, size_t queue_sz,
> > > +   const struct scmi_protocol_event_ops *ops,
> > > +   const struct scmi_event *evt, int num_events,
> > > +   int num_sources)
> > > +{
> > > + int i;
> > > + size_t payld_sz = 0;
> > > + struct scmi_registered_protocol_events_desc *pd;
> > > + struct scmi_notify_instance *ni = handle->notify_priv;
> > > +
> > > + if (!ops || !evt || proto_id >= SCMI_MAX_PROTO)
> > > + return -EINVAL;
> > > +
> > > + /* Ensure atomic value is updated */
> > > + smp_mb__before_atomic();
> > > + if (unlikely(!ni || !atomic_read(>initialized)))
> > > + return -EAGAIN;
> > 
> > The atomics/barriers don't look quite right to me here.
> > 
> > I'd have expected:
> > 
> > scmi_register_protocol_events()
> > {
> > if (atomic_read(>initialized))
> > return -EAGAIN;
> > smp_mb_after_atomic();
> > 
> > /* ... */
> > }
> > 
> > to pair with:
> > 
> > scmi_notification_init()
> > {
> > /* ... */
> > 
> > smp_mb__before_atomic();
> > atomic_set(>enabled, 1);
> > }
> > 
> > 
> > ...however, do we need to allow these two functions to race with each
> > other at all?  (I haven't tried to understand the wider context here,
> > so if there really is no way to avoid initialisation racing with use I
> > guess we may have to do something like this.  We don't want callers
> > to dumbly spin on this function though.)
> > 
> > 
> > In other patches in the series, calls to scmi_register_protocol_events()
> > seem to be assuming there is no race: the return value is not checked.
> > Possibly a bug?
> > 
> 
> I think you are right in these regards, there's no need of an atomic here
> for 'initialized' and using -EAGAIN on !initialized as error code in
> scmi_register_protocol_events() is wrong too 

Re: [PATCH V2] arm64/cpuinfo: Move HWCAP name arrays alongside their bit definitions

2020-05-13 Thread Dave Martin
On Thu, May 07, 2020 at 06:59:10PM +0530, Anshuman Khandual wrote:
> All HWCAP name arrays (i.e hwcap_str, compat_hwcap_str, compat_hwcap2_str)
> that are scanned for /proc/cpuinfo output are detached from their bit fild
> definitions making it difficult to corelate. This is also bit problematic
> because during /proc/cpuinfo dump these arrays get traversed sequentially
> assuming that they reflect and match HWCAP bit sequence, to test various
> features for a given CPU.
> 
> This moves all HWCAP name arrays near their bit definitions. But first it
> defines all missing COMPAT_HWCAP_XXX that are present in the name string.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Brown 
> Cc: Ard Biesheuvel 
> Cc: Mark Rutland 
> Cc: Suzuki K Poulose 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> 
> Signed-off-by: Anshuman Khandual 
> Acked-by: Mark Rutland 
> ---
> This applies on 5.7-rc4
> 
> Changes in V2:
> 
> - Defined COMPAT_KERNEL_HWCAP[2] and updated the name arrays per Mark
> - Updated the commit message as required
> 
> Changes in V1: (https://patchwork.kernel.org/patch/11532945/)
> 
>  arch/arm64/include/asm/hwcap.h | 101 +
>  arch/arm64/kernel/cpuinfo.c|  90 -
>  2 files changed, 101 insertions(+), 90 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/hwcap.h b/arch/arm64/include/asm/hwcap.h
> index 0f00265248b5..589ac02e1ddd 100644
> --- a/arch/arm64/include/asm/hwcap.h
> +++ b/arch/arm64/include/asm/hwcap.h
> @@ -8,18 +8,27 @@
>  #include 
>  #include 
>  
> +#define COMPAT_HWCAP_SWP (1 << 0)
>  #define COMPAT_HWCAP_HALF(1 << 1)
>  #define COMPAT_HWCAP_THUMB   (1 << 2)
> +#define COMPAT_HWCAP_26BIT   (1 << 3)
>  #define COMPAT_HWCAP_FAST_MULT   (1 << 4)
> +#define COMPAT_HWCAP_FPA (1 << 5)
>  #define COMPAT_HWCAP_VFP (1 << 6)
>  #define COMPAT_HWCAP_EDSP(1 << 7)
> +#define COMPAT_HWCAP_JAVA(1 << 8)
> +#define COMPAT_HWCAP_IWMMXT  (1 << 9)
> +#define COMPAT_HWCAP_CRUNCH  (1 << 10)
> +#define COMPAT_HWCAP_THUMBEE (1 << 11)
>  #define COMPAT_HWCAP_NEON(1 << 12)
>  #define COMPAT_HWCAP_VFPv3   (1 << 13)
> +#define COMPAT_HWCAP_VFPV3D16(1 << 14)
>  #define COMPAT_HWCAP_TLS (1 << 15)
>  #define COMPAT_HWCAP_VFPv4   (1 << 16)
>  #define COMPAT_HWCAP_IDIVA   (1 << 17)
>  #define COMPAT_HWCAP_IDIVT   (1 << 18)
>  #define COMPAT_HWCAP_IDIV(COMPAT_HWCAP_IDIVA|COMPAT_HWCAP_IDIVT)
> +#define COMPAT_HWCAP_VFPD32  (1 << 19)
>  #define COMPAT_HWCAP_LPAE(1 << 20)
>  #define COMPAT_HWCAP_EVTSTRM (1 << 21)

With the possible exception of SWP (does the swp emulation allow us to
report this as supported?), I think all these weren't mentioned because
they aren't included in ARMv8 and so can never be reported.

If we find ourselves reporting them, there's a bug somewhere.

So, can we just default all obsolete string entries to NULL?

When generating the cpuinfo strings we could WARN and just emit an empty
string for that hwcap.

Cheers
---Dave

[...]

> +#ifdef CONFIG_COMPAT
> +#define COMPAT_KERNEL_HWCAP(x)   const_ilog2(COMPAT_HWCAP_ ## x)
> +static const char *const compat_hwcap_str[] = {
> + [COMPAT_KERNEL_HWCAP(SWP)]  = "swp",
> + [COMPAT_KERNEL_HWCAP(HALF)] = "half",
> + [COMPAT_KERNEL_HWCAP(THUMB)]= "thumb",
> + [COMPAT_KERNEL_HWCAP(26BIT)]= "26bit",
> + [COMPAT_KERNEL_HWCAP(FAST_MULT)] = "fastmult",
> + [COMPAT_KERNEL_HWCAP(FPA)]  = "fpa",
> + [COMPAT_KERNEL_HWCAP(VFP)]  = "vfp",
> + [COMPAT_KERNEL_HWCAP(EDSP)] = "edsp",
> + [COMPAT_KERNEL_HWCAP(JAVA)] = "java",
> + [COMPAT_KERNEL_HWCAP(IWMMXT)]   = "iwmmxt",
> + [COMPAT_KERNEL_HWCAP(CRUNCH)]   = "crunch",
> + [COMPAT_KERNEL_HWCAP(THUMBEE)]  = "thumbee",
> + [COMPAT_KERNEL_HWCAP(NEON)] = "neon",
> + [COMPAT_KERNEL_HWCAP(VFPv3)]= "vfpv3",
> + [COMPAT_KERNEL_HWCAP(VFPV3D16)] = "vfpv3d16",
> + [COMPAT_KERNEL_HWCAP(TLS)]  = "tls",
> + [COMPAT_KERNEL_HWCAP(VFPv4)]= "vfpv4",
> + [COMPAT_KERNEL_HWCAP(IDIVA)]= "idiva",
> + [COMPAT_KERNEL_HWCAP(IDIVT)]= "idivt",
> + [COMPAT_KERNEL_HWCAP(VFPD32)]   = "vfpd32",
> + [COMPAT_KERNEL_HWCAP(LPAE)] = "lpae",
> + [COMPAT_KERNEL_HWCAP(EVTSTRM)]  = "evtstrm",
> + NULL
> +};
> +
> +#define COMPAT_KERNEL_HWCAP2(x)  const_ilog2(COMPAT_HWCAP2_ ## x)
> +static const char *const compat_hwcap2_str[] = {
> + [COMPAT_KERNEL_HWCAP2(AES)] = "aes",
> + [COMPAT_KERNEL_HWCAP2(PMULL)]   = "pmull",
> + [COMPAT_KERNEL_HWCAP2(SHA1)]= "sha1",
> + [COMPAT_KERNEL_HWCAP2(SHA2)]= "sha2",
> + [COMPAT_KERNEL_HWCAP2(CRC32)]   = "crc32",
> + NULL,
> +};
> +#endif /* CONFIG_COMPAT */
> +

[...]


Re: [PATCH v7 1/9] firmware: arm_scmi: Add notification protocol-registration

2020-05-06 Thread Dave Martin
On Mon, May 04, 2020 at 05:38:47PM +0100, Cristian Marussi wrote:
> Add core SCMI Notifications protocol-registration support: allow protocols
> to register their own set of supported events, during their initialization
> phase. Notification core can track multiple platform instances by their
> handles.
> 
> Reviewed-by: Jonathan Cameron 
> Signed-off-by: Cristian Marussi 
> ---
> V4 --> V5
> - fixed kernel-doc
> - added barriers for registered protocols and events
> - using kfifo_alloc and devm_add_action_or_reset
> V3 --> V4
> - removed scratch ISR buffer, move scratch BH buffer into protocol
>   descriptor
> - converted registered_protocols and registered_events from hashtables
>   into bare fixed-sized arrays
> - removed unregister protocols' routines (never called really)
> V2 --> V3
> - added scmi_notify_instance to track target platform instance
> V1 --> V2
> - splitted out of V1 patch 04
> - moved from IDR maps to real HashTables to store events
> - scmi_notifications_initialized is now an atomic_t
> - reviewed protocol registration/unregistration to use devres
> - fixed:
>   drivers/firmware/arm_scmi/notify.c:483:18-23: ERROR:
>   reference preceded by free on line 482
> 
> Reported-by: kbuild test robot 
> Reported-by: Julia Lawall 
> ---
>  drivers/firmware/arm_scmi/Makefile |   2 +-
>  drivers/firmware/arm_scmi/common.h |   4 +
>  drivers/firmware/arm_scmi/notify.c | 444 +
>  drivers/firmware/arm_scmi/notify.h |  56 
>  include/linux/scmi_protocol.h  |   3 +
>  5 files changed, 508 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/firmware/arm_scmi/notify.c
>  create mode 100644 drivers/firmware/arm_scmi/notify.h

[...]

> diff --git a/drivers/firmware/arm_scmi/notify.c 
> b/drivers/firmware/arm_scmi/notify.c

[...]

> +int scmi_register_protocol_events(const struct scmi_handle *handle,
> +   u8 proto_id, size_t queue_sz,
> +   const struct scmi_protocol_event_ops *ops,
> +   const struct scmi_event *evt, int num_events,
> +   int num_sources)
> +{
> + int i;
> + size_t payld_sz = 0;
> + struct scmi_registered_protocol_events_desc *pd;
> + struct scmi_notify_instance *ni = handle->notify_priv;
> +
> + if (!ops || !evt || proto_id >= SCMI_MAX_PROTO)
> + return -EINVAL;
> +
> + /* Ensure atomic value is updated */
> + smp_mb__before_atomic();
> + if (unlikely(!ni || !atomic_read(>initialized)))
> + return -EAGAIN;

The atomics/barriers don't look quite right to me here.

I'd have expected:

scmi_register_protocol_events()
{
if (atomic_read(>initialized))
return -EAGAIN;
smp_mb_after_atomic();

/* ... */
}

to pair with:

scmi_notification_init()
{
/* ... */

smp_mb__before_atomic();
atomic_set(>enabled, 1);
}


...however, do we need to allow these two functions to race with each
other at all?  (I haven't tried to understand the wider context here,
so if there really is no way to avoid initialisation racing with use I
guess we may have to do something like this.  We don't want callers
to dumbly spin on this function though.)


In other patches in the series, calls to scmi_register_protocol_events()
seem to be assuming there is no race: the return value is not checked.
Possibly a bug?


I'm not sure about scmi_notification_exit() (see below).

> +
> + /* Attach to the notification main devres group */
> + if (!devres_open_group(ni->handle->dev, ni->gid, GFP_KERNEL))
> + return -ENOMEM;
> +
> + for (i = 0; i < num_events; i++)
> + payld_sz = max_t(size_t, payld_sz, evt[i].max_payld_sz);
> + pd = scmi_allocate_registered_protocol_desc(ni, proto_id, queue_sz,
> + sizeof(struct scmi_event_header) + payld_sz,
> + num_events, ops);
> + if (IS_ERR(pd))
> + goto err;
> +
> + for (i = 0; i < num_events; i++, evt++) {
> + struct scmi_registered_event *r_evt;
> +
> + r_evt = devm_kzalloc(ni->handle->dev, sizeof(*r_evt),
> +  GFP_KERNEL);
> + if (!r_evt)
> + goto err;
> + r_evt->proto = pd;
> + r_evt->evt = evt;
> +
> + r_evt->sources = devm_kcalloc(ni->handle->dev, num_sources,
> +   sizeof(refcount_t), GFP_KERNEL);
> + if (!r_evt->sources)
> + goto err;
> + r_evt->num_sources = num_sources;
> + mutex_init(_evt->sources_mtx);
> +
> + r_evt->report = devm_kzalloc(ni->handle->dev,
> +  evt->max_report_sz, GFP_KERNEL);
> + if (!r_evt->report)
> + goto err;
> +
> + 

[PATCH v3 12/12] KVM: arm64: BTI: Reset BTYPE when skipping emulated instructions

2019-10-18 Thread Dave Martin
Since normal execution of any non-branch instruction resets the
PSTATE BTYPE field to 0, so do the same thing when emulating a
trapped instruction.

Branches don't trap directly, so we should never need to assign a
non-zero value to BTYPE here.

Signed-off-by: Dave Martin 

---

Changes since v2:

 * Drop (u64) case when masking out PSR_BTYPE_MASK in
   arm64_skip_faulting_instruction().

   PSTATE may grow, but we should address this more generally rather
   than with point hacks.

 * Add { } around if () clause that was unbalanced by the previous
   patch.
---
 arch/arm64/include/asm/kvm_emulate.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index d69c1ef..f41bfdee 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -450,10 +450,12 @@ static inline unsigned long 
vcpu_data_host_to_guest(struct kvm_vcpu *vcpu,
 
 static inline void kvm_skip_instr(struct kvm_vcpu *vcpu, bool is_wide_instr)
 {
-   if (vcpu_mode_is_32bit(vcpu))
+   if (vcpu_mode_is_32bit(vcpu)) {
kvm_skip_instr32(vcpu, is_wide_instr);
-   else
+   } else {
*vcpu_pc(vcpu) += 4;
+   *vcpu_cpsr(vcpu) &= ~PSR_BTYPE_MASK;
+   }
 
/* advance the singlestep state machine */
*vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
-- 
2.1.4



[PATCH v3 11/12] arm64: BTI: Reset BTYPE when skipping emulated instructions

2019-10-18 Thread Dave Martin
Since normal execution of any non-branch instruction resets the
PSTATE BTYPE field to 0, so do the same thing when emulating a
trapped instruction.

Branches don't trap directly, so we should never need to assign a
non-zero value to BTYPE here.

Signed-off-by: Dave Martin 

---

Changes since v2:

 * Drop (u64) case when masking out PSR_BTYPE_MASK in
   arm64_skip_faulting_instruction().

   PSTATE may grow, but we should address this more generally rather
   than with point hacks in this series.
---
 arch/arm64/kernel/traps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 3af2768..5c46a7b 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -331,6 +331,8 @@ void arm64_skip_faulting_instruction(struct pt_regs *regs, 
unsigned long size)
 
if (regs->pstate & PSR_MODE32_BIT)
advance_itstate(regs);
+   else
+   regs->pstate &= ~PSR_BTYPE_MASK;
 }
 
 static LIST_HEAD(undef_hook);
-- 
2.1.4



[PATCH v3 07/12] arm64: elf: Enable BTI at exec based on ELF program properties

2019-10-18 Thread Dave Martin
For BTI protection to be as comprehensive as possible, it is
desirable to have BTI enabled from process startup.  If this is not
done, the process must use mprotect() to enable BTI for each of its
executable mappings, but this is painful to do in the libc startup
code.  It's simpler and more sound to have the kernel do it
instead.

To this end, detect BTI support in the executable (or ELF
interpreter, as appropriate), via the
NT_GNU_PROGRAM_PROPERTY_TYPE_0 note, and tweak the initial prot
flags for the process' executable pages to include PROT_BTI as
appropriate.

Signed-off-by: Dave Martin 
---
 arch/arm64/Kconfig   |  3 +++
 arch/arm64/include/asm/elf.h | 50 
 arch/arm64/kernel/process.c  | 19 +
 include/linux/elf.h  |  6 +-
 include/uapi/linux/elf.h |  6 ++
 5 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index bb3189e..a64d91d 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -9,6 +9,7 @@ config ARM64
select ACPI_MCFG if (ACPI && PCI)
select ACPI_SPCR_TABLE if ACPI
select ACPI_PPTT if ACPI
+   select ARCH_BINFMT_ELF_STATE
select ARCH_CLOCKSOURCE_DATA
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEVMEM_IS_ALLOWED
@@ -34,6 +35,7 @@ config ARM64
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+   select ARCH_HAVE_ELF_PROT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_INLINE_READ_LOCK if !PREEMPT
select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
@@ -63,6 +65,7 @@ config ARM64
select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPT
select ARCH_KEEP_MEMBLOCK
select ARCH_USE_CMPXCHG_LOCKREF
+   select ARCH_USE_GNU_PROPERTY if BINFMT_ELF
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_SUPPORTS_MEMORY_FAILURE
diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index b618017..8bc154c 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -114,7 +114,11 @@
 
 #ifndef __ASSEMBLY__
 
+#include 
 #include 
+#include 
+#include 
+#include 
 #include  /* for signal_minsigstksz, used by ARCH_DLINFO */
 
 typedef unsigned long elf_greg_t;
@@ -224,6 +228,52 @@ extern int aarch32_setup_additional_pages(struct 
linux_binprm *bprm,
 
 #endif /* CONFIG_COMPAT */
 
+struct arch_elf_state {
+   int flags;
+};
+
+#define ARM64_ELF_BTI  (1 << 0)
+
+#define INIT_ARCH_ELF_STATE {  \
+   .flags = 0, \
+}
+
+static inline int arch_parse_elf_property(u32 type, const void *data,
+ size_t datasz, bool compat,
+ struct arch_elf_state *arch)
+{
+   /* No known properties for AArch32 yet */
+   if (IS_ENABLED(CONFIG_COMPAT) && compat)
+   return 0;
+
+   if (type == GNU_PROPERTY_AARCH64_FEATURE_1_AND) {
+   const u32 *p = data;
+
+   if (datasz != sizeof(*p))
+   return -EIO;
+
+   if (IS_ENABLED(CONFIG_ARM64_BTI) &&
+   (*p & GNU_PROPERTY_AARCH64_FEATURE_1_BTI))
+   arch->flags |= ARM64_ELF_BTI;
+   }
+
+   return 0;
+}
+
+static inline int arch_elf_pt_proc(void *ehdr, void *phdr,
+  struct file *f, bool is_interp,
+  struct arch_elf_state *state)
+{
+   return 0;
+}
+
+static inline int arch_check_elf(void *ehdr, bool has_interp,
+void *interp_ehdr,
+struct arch_elf_state *state)
+{
+   return 0;
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index a47462d..4c78937 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -11,12 +11,14 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -633,3 +635,20 @@ static int __init tagged_addr_init(void)
 
 core_initcall(tagged_addr_init);
 #endif /* CONFIG_ARM64_TAGGED_ADDR_ABI */
+
+#ifdef CONFIG_BINFMT_ELF
+int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
+bool has_interp, bool is_interp)
+{
+   if (is_interp != has_interp)
+   return prot;
+
+   if (!(state->flags & ARM64_ELF_BTI))
+   return prot;
+
+   if (prot & PROT_EXEC)
+   prot |= PROT_BTI;
+
+   return prot;
+}
+#endif
diff --git a/include/linux/elf.h b/include/linux/elf.h
index 1b6e895..5d5b032 100644
--- a/include/linux/elf.h
+++ b/i

[PATCH v3 09/12] arm64: traps: Fix inconsistent faulting instruction skipping

2019-10-18 Thread Dave Martin
Correct skipping of an instruction on AArch32 works a bit
differently from AArch64, mainly due to the different CPSR/PSTATE
semantics.

There have been various attempts to get this right.  Currenty
arm64_skip_faulting_instruction() mostly does the right thing, but
does not advance the IT state machine for the AArch32 case.

arm64_compat_skip_faulting_instruction() handles the IT state
machine but is local to traps.c, and porting other code to use it
will make a mess since there are some call sites that apply for
both the compat and native cases.

Since manual instruction skipping implies a trap, it's a relatively
slow path.

So, make arm64_skip_faulting_instruction() handle both compat and
native, and get rid of the arm64_compat_skip_faulting_instruction()
special case.

Fixes: 32a3e635fb0e ("arm64: compat: Add CNTFRQ trap handler")
Fixes: 1f1c014035a8 ("arm64: compat: Add condition code checks and IT advance")
Fixes: 6436b572 ("arm64: Fix single stepping in kernel traps")
Fixes: bd35a4adc413 ("arm64: Port SWP/SWPB emulation support from arm")
Signed-off-by: Dave Martin 

---

**NOTE**

Despite discussions on the v2 series to the effect that the prior
behaviour is not broken, I'm now not so sure:

Taking another look, I now can't track down for example where SWP in an
IT block is specified to be UNPREDICTABLE.  I only see e.g., ARM DDI
0487E.a Section 1.8.2 ("F1.8.2 Partial deprecation of IT"), which only
deprecates the affected instructions.

The legacy AArch32 SWP{B} insn is obsoleted by ARMv8, but the whole
point of the armv8_deprecated stuff is to provide some backwards
compatiblity with v7.

So, this looks like it needs a closer look.

I'll leave the Fixes tags for now, so that the archaeology doesn't need
to be repeated if we conclude that this patch really is a fix.
---
 arch/arm64/kernel/traps.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 15e3c4f..44c91d4 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -268,6 +268,8 @@ void arm64_notify_die(const char *str, struct pt_regs *regs,
}
 }
 
+static void advance_itstate(struct pt_regs *regs);
+
 void arm64_skip_faulting_instruction(struct pt_regs *regs, unsigned long size)
 {
regs->pc += size;
@@ -278,6 +280,9 @@ void arm64_skip_faulting_instruction(struct pt_regs *regs, 
unsigned long size)
 */
if (user_mode(regs))
user_fastforward_single_step(current);
+
+   if (regs->pstate & PSR_MODE32_BIT)
+   advance_itstate(regs);
 }
 
 static LIST_HEAD(undef_hook);
@@ -629,19 +634,12 @@ static void advance_itstate(struct pt_regs *regs)
compat_set_it_state(regs, it);
 }
 
-static void arm64_compat_skip_faulting_instruction(struct pt_regs *regs,
-  unsigned int sz)
-{
-   advance_itstate(regs);
-   arm64_skip_faulting_instruction(regs, sz);
-}
-
 static void compat_cntfrq_read_handler(unsigned int esr, struct pt_regs *regs)
 {
int reg = (esr & ESR_ELx_CP15_32_ISS_RT_MASK) >> 
ESR_ELx_CP15_32_ISS_RT_SHIFT;
 
pt_regs_write_reg(regs, reg, arch_timer_get_rate());
-   arm64_compat_skip_faulting_instruction(regs, 4);
+   arm64_skip_faulting_instruction(regs, 4);
 }
 
 static const struct sys64_hook cp15_32_hooks[] = {
@@ -661,7 +659,7 @@ static void compat_cntvct_read_handler(unsigned int esr, 
struct pt_regs *regs)
 
pt_regs_write_reg(regs, rt, lower_32_bits(val));
pt_regs_write_reg(regs, rt2, upper_32_bits(val));
-   arm64_compat_skip_faulting_instruction(regs, 4);
+   arm64_skip_faulting_instruction(regs, 4);
 }
 
 static const struct sys64_hook cp15_64_hooks[] = {
@@ -682,7 +680,7 @@ asmlinkage void __exception do_cp15instr(unsigned int esr, 
struct pt_regs *regs)
 * There is no T16 variant of a CP access, so we
 * always advance PC by 4 bytes.
 */
-   arm64_compat_skip_faulting_instruction(regs, 4);
+   arm64_skip_faulting_instruction(regs, 4);
return;
}
 
-- 
2.1.4



[PATCH v3 03/12] mm: Reserve asm-generic prot flag 0x10 for arch use

2019-10-18 Thread Dave Martin
The asm-generic mman definitions are used by a few architectures
that also define an arch-specific PROT flag with value 0x10.  This
currently applies to sparc and powerpc, and arm64 will soon join
in.

To help future maintainers, document the use of this flag in the
asm-generic header too.

Signed-off-by: Dave Martin 
---
 include/uapi/asm-generic/mman-common.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index c160a53..81442d2 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -11,6 +11,7 @@
 #define PROT_WRITE 0x2 /* page can be written */
 #define PROT_EXEC  0x4 /* page can be executed */
 #define PROT_SEM   0x8 /* page may be used for atomic ops */
+ /*0x10   reserved for arch-specific use */
 #define PROT_NONE  0x0 /* page can not be accessed */
 #define PROT_GROWSDOWN 0x0100  /* mprotect flag: extend change to 
start of growsdown vma */
 #define PROT_GROWSUP   0x0200  /* mprotect flag: extend change to end 
of growsup vma */
-- 
2.1.4



[PATCH v3 04/12] arm64: docs: cpu-feature-registers: Document ID_AA64PFR1_EL1

2019-10-18 Thread Dave Martin
Commit d71be2b6c0e1 ("arm64: cpufeature: Detect SSBS and advertise
to userspace") exposes ID_AA64PFR1_EL1 to userspace, but didn't
update the documentation to match.

Add it.

Signed-off-by: Dave Martin 

---

Note to maintainers:

 * This patch has been racing with various other attempts to fix
   the same documentation in the meantime.

   Since this patch only fixes the documenting for pre-existing
   features, it can safely be dropped if appropriate.

   The _new_ documentation relating to BTI feature reporting
   is in a subsequent patch, and needs to be retained.
---
 Documentation/arm64/cpu-feature-registers.rst | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/Documentation/arm64/cpu-feature-registers.rst 
b/Documentation/arm64/cpu-feature-registers.rst
index 2955287..b86828f 100644
--- a/Documentation/arm64/cpu-feature-registers.rst
+++ b/Documentation/arm64/cpu-feature-registers.rst
@@ -168,8 +168,15 @@ infrastructure:
  +--+-+-+
 
 
-  3) MIDR_EL1 - Main ID Register
+  3) ID_AA64PFR1_EL1 - Processor Feature Register 1
+ +--+-+-+
+ | Name |  bits   | visible |
+ +--+-+-+
+ | SSBS | [7-4]   |y|
+ +--+-+-+
+
 
+  4) MIDR_EL1 - Main ID Register
  +--+-+-+
  | Name |  bits   | visible |
  +--+-+-+
@@ -188,7 +195,7 @@ infrastructure:
as available on the CPU where it is fetched and is not a system
wide safe value.
 
-  4) ID_AA64ISAR1_EL1 - Instruction set attribute register 1
+  5) ID_AA64ISAR1_EL1 - Instruction set attribute register 1
 
  +--+-+-+
  | Name |  bits   | visible |
@@ -210,7 +217,7 @@ infrastructure:
  | DPB  | [3-0]   |y|
  +--+-+-+
 
-  5) ID_AA64MMFR2_EL1 - Memory model feature register 2
+  6) ID_AA64MMFR2_EL1 - Memory model feature register 2
 
  +--+-+-+
  | Name |  bits   | visible |
@@ -218,7 +225,7 @@ infrastructure:
  | AT   | [35-32] |y|
  +--+-+-+
 
-  6) ID_AA64ZFR0_EL1 - SVE feature ID register 0
+  7) ID_AA64ZFR0_EL1 - SVE feature ID register 0
 
  +--+-+-+
  | Name |  bits   | visible |
-- 
2.1.4



[PATCH v3 06/12] elf: Allow arch to tweak initial mmap prot flags

2019-10-18 Thread Dave Martin
An arch may want to tweak the mmap prot flags for an
ELFexecutable's initial mappings.  For example, arm64 is going to
need to add PROT_BTI for executable pages in an ELF process whose
executable is marked as using Branch Target Identification (an
ARMv8.5-A control flow integrity feature).

So that this can be done in a generic way, add a hook
arch_elf_adjust_prot() to modify the prot flags as desired: arches
can select CONFIG_HAVE_ELF_PROT and implement their own backend
where necessary.

By default, leave the prot flags unchanged.

Signed-off-by: Dave Martin 
---
 fs/Kconfig.binfmt   |  3 +++
 fs/binfmt_elf.c | 18 --
 include/linux/elf.h | 12 
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index d2cfe07..2358368 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -36,6 +36,9 @@ config COMPAT_BINFMT_ELF
 config ARCH_BINFMT_ELF_STATE
bool
 
+config ARCH_HAVE_ELF_PROT
+   bool
+
 config ARCH_USE_GNU_PROPERTY
bool
 
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index ae345f6..dbfab2e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -531,7 +531,8 @@ static inline int arch_check_elf(struct elfhdr *ehdr, bool 
has_interp,
 
 #endif /* !CONFIG_ARCH_BINFMT_ELF_STATE */
 
-static inline int make_prot(u32 p_flags)
+static inline int make_prot(u32 p_flags, struct arch_elf_state *arch_state,
+   bool has_interp, bool is_interp)
 {
int prot = 0;
 
@@ -541,7 +542,8 @@ static inline int make_prot(u32 p_flags)
prot |= PROT_WRITE;
if (p_flags & PF_X)
prot |= PROT_EXEC;
-   return prot;
+
+   return arch_elf_adjust_prot(prot, arch_state, has_interp, is_interp);
 }
 
 /* This is much more generalized than the library routine read function,
@@ -551,7 +553,8 @@ static inline int make_prot(u32 p_flags)
 
 static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
struct file *interpreter, unsigned long *interp_map_addr,
-   unsigned long no_base, struct elf_phdr *interp_elf_phdata)
+   unsigned long no_base, struct elf_phdr *interp_elf_phdata,
+   struct arch_elf_state *arch_state)
 {
struct elf_phdr *eppnt;
unsigned long load_addr = 0;
@@ -583,7 +586,8 @@ static unsigned long load_elf_interp(struct elfhdr 
*interp_elf_ex,
for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
if (eppnt->p_type == PT_LOAD) {
int elf_type = MAP_PRIVATE | MAP_DENYWRITE;
-   int elf_prot = make_prot(eppnt->p_flags);
+   int elf_prot = make_prot(eppnt->p_flags, arch_state,
+true, true);
unsigned long vaddr = 0;
unsigned long k, map_addr;
 
@@ -1040,7 +1044,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
}
 
-   elf_prot = make_prot(elf_ppnt->p_flags);
+   elf_prot = make_prot(elf_ppnt->p_flags, _state,
+!!interpreter, false);
 
elf_flags = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE;
 
@@ -1186,7 +1191,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
elf_entry = load_elf_interp(>interp_elf_ex,
interpreter,
_map_addr,
-   load_bias, interp_elf_phdata);
+   load_bias, interp_elf_phdata,
+   _state);
if (!IS_ERR((void *)elf_entry)) {
/*
 * load_elf_interp() returns relocation
diff --git a/include/linux/elf.h b/include/linux/elf.h
index 7bdc6da..1b6e895 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -83,4 +83,16 @@ extern int arch_parse_elf_property(u32 type, const void 
*data, size_t datasz,
   bool compat, struct arch_elf_state *arch);
 #endif
 
+#ifdef CONFIG_ARCH_HAVE_ELF_PROT
+int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
+bool has_interp, bool is_interp);
+#else
+static inline int arch_elf_adjust_prot(int prot,
+  const struct arch_elf_state *state,
+  bool has_interp, bool is_interp)
+{
+   return prot;
+}
+#endif
+
 #endif /* _LINUX_ELF_H */
-- 
2.1.4



[PATCH v3 01/12] ELF: UAPI and Kconfig additions for ELF program properties

2019-10-18 Thread Dave Martin
Pull the basic ELF definitions relating to the
NT_GNU_PROPERTY_TYPE_0 note from Yu-Cheng Yu's earlier x86 shstk
series.

Signed-off-by: Yu-cheng Yu 
Signed-off-by: Dave Martin 
---
 fs/Kconfig.binfmt| 3 +++
 include/linux/elf.h  | 8 
 include/uapi/linux/elf.h | 1 +
 3 files changed, 12 insertions(+)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index 62dc4f5..d2cfe07 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -36,6 +36,9 @@ config COMPAT_BINFMT_ELF
 config ARCH_BINFMT_ELF_STATE
bool
 
+config ARCH_USE_GNU_PROPERTY
+   bool
+
 config BINFMT_ELF_FDPIC
bool "Kernel support for FDPIC ELF binaries"
default y if !BINFMT_ELF
diff --git a/include/linux/elf.h b/include/linux/elf.h
index e3649b3..459cddc 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_ELF_H
 #define _LINUX_ELF_H
 
+#include 
 #include 
 #include 
 
@@ -56,4 +57,11 @@ static inline int elf_coredump_extra_notes_write(struct 
coredump_params *cprm) {
 extern int elf_coredump_extra_notes_size(void);
 extern int elf_coredump_extra_notes_write(struct coredump_params *cprm);
 #endif
+
+/* NT_GNU_PROPERTY_TYPE_0 header */
+struct gnu_property {
+   u32 pr_type;
+   u32 pr_datasz;
+};
+
 #endif /* _LINUX_ELF_H */
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 34c02e4..c377314 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -36,6 +36,7 @@ typedef __s64 Elf64_Sxword;
 #define PT_LOPROC  0x7000
 #define PT_HIPROC  0x7fff
 #define PT_GNU_EH_FRAME0x6474e550
+#define PT_GNU_PROPERTY0x6474e553
 
 #define PT_GNU_STACK   (PT_LOOS + 0x474e551)
 
-- 
2.1.4



[PATCH v3 02/12] ELF: Add ELF program property parsing support

2019-10-18 Thread Dave Martin
ELF program properties will be needed for detecting whether to
enable optional architecture or ABI features for a new ELF process.

For now, there are no generic properties that we care about, so do
nothing unless CONFIG_ARCH_USE_GNU_PROPERTY=y.

Otherwise, the presence of properties using the PT_PROGRAM_PROPERTY
phdrs entry (if any), and notify each property to the arch code.

For now, the added code is not used.

Signed-off-by: Dave Martin 
---
 fs/binfmt_elf.c  | 127 +++
 fs/compat_binfmt_elf.c   |   4 ++
 include/linux/elf.h  |  19 +++
 include/uapi/linux/elf.h |   4 ++
 4 files changed, 154 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index c5642bc..ae345f6 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -39,12 +39,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 
+#ifndef ELF_COMPAT
+#define ELF_COMPAT 0
+#endif
+
 #ifndef user_long_t
 #define user_long_t long
 #endif
@@ -670,6 +676,111 @@ static unsigned long load_elf_interp(struct elfhdr 
*interp_elf_ex,
  * libraries.  There is no binary dependent code anywhere else.
  */
 
+static int parse_elf_property(const char *data, size_t *off, size_t datasz,
+ struct arch_elf_state *arch,
+ bool have_prev_type, u32 *prev_type)
+{
+   size_t o, step;
+   const struct gnu_property *pr;
+   int ret;
+
+   if (*off == datasz)
+   return -ENOENT;
+
+   if (WARN_ON(*off > datasz || *off % ELF_GNU_PROPERTY_ALIGN))
+   return -EIO;
+   o = *off;
+   datasz -= *off;
+
+   if (datasz < sizeof(*pr))
+   return -EIO;
+   pr = (const struct gnu_property *)(data + o);
+   o += sizeof(*pr);
+   datasz -= sizeof(*pr);
+
+   if (pr->pr_datasz > datasz)
+   return -EIO;
+
+   WARN_ON(o % ELF_GNU_PROPERTY_ALIGN);
+   step = round_up(pr->pr_datasz, ELF_GNU_PROPERTY_ALIGN);
+   if (step > datasz)
+   return -EIO;
+
+   /* Properties are supposed to be unique and sorted on pr_type: */
+   if (have_prev_type && pr->pr_type <= *prev_type)
+   return -EIO;
+   *prev_type = pr->pr_type;
+
+   ret = arch_parse_elf_property(pr->pr_type, data + o,
+ pr->pr_datasz, ELF_COMPAT, arch);
+   if (ret)
+   return ret;
+
+   *off = o + step;
+   return 0;
+}
+
+#define NOTE_DATA_SZ SZ_1K
+#define GNU_PROPERTY_TYPE_0_NAME "GNU"
+#define NOTE_NAME_SZ (sizeof(GNU_PROPERTY_TYPE_0_NAME))
+
+static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
+   struct arch_elf_state *arch)
+{
+   union {
+   struct elf_note nhdr;
+   char data[NOTE_DATA_SZ];
+   } note;
+   loff_t pos;
+   ssize_t n;
+   size_t off, datasz;
+   int ret;
+   bool have_prev_type;
+   u32 prev_type;
+
+   if (!IS_ENABLED(CONFIG_ARCH_USE_GNU_PROPERTY) || !phdr)
+   return 0;
+
+   /* load_elf_binary() shouldn't call us unless this is true... */
+   if (WARN_ON(phdr->p_type != PT_GNU_PROPERTY))
+   return -EIO;
+
+   /* If the properties are crazy large, that's too bad (for now): */
+   if (phdr->p_filesz > sizeof(note))
+   return -ENOEXEC;
+
+   pos = phdr->p_offset;
+   n = kernel_read(f, , phdr->p_filesz, );
+
+   BUILD_BUG_ON(sizeof(note) < sizeof(note.nhdr) + NOTE_NAME_SZ);
+   if (n < 0 || n < sizeof(note.nhdr) + NOTE_NAME_SZ)
+   return -EIO;
+
+   if (note.nhdr.n_type != NT_GNU_PROPERTY_TYPE_0 ||
+   note.nhdr.n_namesz != NOTE_NAME_SZ ||
+   strncmp(note.data + sizeof(note.nhdr),
+   GNU_PROPERTY_TYPE_0_NAME, n - sizeof(note.nhdr)))
+   return -EIO;
+
+   off = round_up(sizeof(note.nhdr) + NOTE_NAME_SZ,
+  ELF_GNU_PROPERTY_ALIGN);
+   if (off > n)
+   return -EIO;
+
+   if (note.nhdr.n_descsz > n - off)
+   return -EIO;
+   datasz = off + note.nhdr.n_descsz;
+
+   have_prev_type = false;
+   do {
+   ret = parse_elf_property(note.data, , datasz, arch,
+have_prev_type, _type);
+   have_prev_type = true;
+   } while (!ret);
+
+   return ret == -ENOENT ? 0 : ret;
+}
+
 static int load_elf_binary(struct linux_binprm *bprm)
 {
struct file *interpreter = NULL; /* to shut gcc up */
@@ -677,6 +788,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
int load_addr_set = 0;
unsigned long error;
struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
+   struct elf_phdr *elf_property_phdata = NULL;
unsigned long elf

[PATCH v3 10/12] arm64: traps: Shuffle code to eliminate forward declarations

2019-10-18 Thread Dave Martin
Hoist the IT state handling code earlier in traps.c, to avoid
accumulating forward declarations.

No functional change.

Signed-off-by: Dave Martin 
---
 arch/arm64/kernel/traps.c | 101 ++
 1 file changed, 49 insertions(+), 52 deletions(-)

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 44c91d4..3af2768 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -268,7 +268,55 @@ void arm64_notify_die(const char *str, struct pt_regs 
*regs,
}
 }
 
-static void advance_itstate(struct pt_regs *regs);
+#ifdef CONFIG_COMPAT
+#define PSTATE_IT_1_0_SHIFT25
+#define PSTATE_IT_1_0_MASK (0x3 << PSTATE_IT_1_0_SHIFT)
+#define PSTATE_IT_7_2_SHIFT10
+#define PSTATE_IT_7_2_MASK (0x3f << PSTATE_IT_7_2_SHIFT)
+
+static u32 compat_get_it_state(struct pt_regs *regs)
+{
+   u32 it, pstate = regs->pstate;
+
+   it  = (pstate & PSTATE_IT_1_0_MASK) >> PSTATE_IT_1_0_SHIFT;
+   it |= ((pstate & PSTATE_IT_7_2_MASK) >> PSTATE_IT_7_2_SHIFT) << 2;
+
+   return it;
+}
+
+static void compat_set_it_state(struct pt_regs *regs, u32 it)
+{
+   u32 pstate_it;
+
+   pstate_it  = (it << PSTATE_IT_1_0_SHIFT) & PSTATE_IT_1_0_MASK;
+   pstate_it |= ((it >> 2) << PSTATE_IT_7_2_SHIFT) & PSTATE_IT_7_2_MASK;
+
+   regs->pstate &= ~PSR_AA32_IT_MASK;
+   regs->pstate |= pstate_it;
+}
+
+static void advance_itstate(struct pt_regs *regs)
+{
+   u32 it;
+
+   /* ARM mode */
+   if (!(regs->pstate & PSR_AA32_T_BIT) ||
+   !(regs->pstate & PSR_AA32_IT_MASK))
+   return;
+
+   it  = compat_get_it_state(regs);
+
+   /*
+* If this is the last instruction of the block, wipe the IT
+* state. Otherwise advance it.
+*/
+   if (!(it & 7))
+   it = 0;
+   else
+   it = (it & 0xe0) | ((it << 1) & 0x1f);
+
+   compat_set_it_state(regs, it);
+}
 
 void arm64_skip_faulting_instruction(struct pt_regs *regs, unsigned long size)
 {
@@ -563,34 +611,6 @@ static const struct sys64_hook sys64_hooks[] = {
{},
 };
 
-
-#ifdef CONFIG_COMPAT
-#define PSTATE_IT_1_0_SHIFT25
-#define PSTATE_IT_1_0_MASK (0x3 << PSTATE_IT_1_0_SHIFT)
-#define PSTATE_IT_7_2_SHIFT10
-#define PSTATE_IT_7_2_MASK (0x3f << PSTATE_IT_7_2_SHIFT)
-
-static u32 compat_get_it_state(struct pt_regs *regs)
-{
-   u32 it, pstate = regs->pstate;
-
-   it  = (pstate & PSTATE_IT_1_0_MASK) >> PSTATE_IT_1_0_SHIFT;
-   it |= ((pstate & PSTATE_IT_7_2_MASK) >> PSTATE_IT_7_2_SHIFT) << 2;
-
-   return it;
-}
-
-static void compat_set_it_state(struct pt_regs *regs, u32 it)
-{
-   u32 pstate_it;
-
-   pstate_it  = (it << PSTATE_IT_1_0_SHIFT) & PSTATE_IT_1_0_MASK;
-   pstate_it |= ((it >> 2) << PSTATE_IT_7_2_SHIFT) & PSTATE_IT_7_2_MASK;
-
-   regs->pstate &= ~PSR_AA32_IT_MASK;
-   regs->pstate |= pstate_it;
-}
-
 static bool cp15_cond_valid(unsigned int esr, struct pt_regs *regs)
 {
int cond;
@@ -611,29 +631,6 @@ static bool cp15_cond_valid(unsigned int esr, struct 
pt_regs *regs)
return aarch32_opcode_cond_checks[cond](regs->pstate);
 }
 
-static void advance_itstate(struct pt_regs *regs)
-{
-   u32 it;
-
-   /* ARM mode */
-   if (!(regs->pstate & PSR_AA32_T_BIT) ||
-   !(regs->pstate & PSR_AA32_IT_MASK))
-   return;
-
-   it  = compat_get_it_state(regs);
-
-   /*
-* If this is the last instruction of the block, wipe the IT
-* state. Otherwise advance it.
-*/
-   if (!(it & 7))
-   it = 0;
-   else
-   it = (it & 0xe0) | ((it << 1) & 0x1f);
-
-   compat_set_it_state(regs, it);
-}
-
 static void compat_cntfrq_read_handler(unsigned int esr, struct pt_regs *regs)
 {
int reg = (esr & ESR_ELx_CP15_32_ISS_RT_MASK) >> 
ESR_ELx_CP15_32_ISS_RT_SHIFT;
-- 
2.1.4



[PATCH v3 08/12] arm64: BTI: Decode BYTPE bits when printing PSTATE

2019-10-18 Thread Dave Martin
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.

So, decode BYTPE and print it too.

In the interests of human-readability, print the classes of BTI
matched.  The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:

-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j

Signed-off-by: Dave Martin 

---

Changes since v2:

 * Split out the PSR_BYTPE_* definitions that for merging into
   "arm64: Basic Branch Target Identification support".
---
 arch/arm64/kernel/process.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 4c78937..a2b555a 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -209,6 +209,15 @@ void machine_restart(char *cmd)
while (1);
 }
 
+#define bstr(suffix, str) [PSR_BTYPE_ ## suffix >> PSR_BTYPE_SHIFT] = str
+static const char *const btypes[] = {
+   bstr(NONE, "--"),
+   bstr(  JC, "jc"),
+   bstr(   C, "-c"),
+   bstr(  J , "j-")
+};
+#undef bstr
+
 static void print_pstate(struct pt_regs *regs)
 {
u64 pstate = regs->pstate;
@@ -227,7 +236,10 @@ static void print_pstate(struct pt_regs *regs)
pstate & PSR_AA32_I_BIT ? 'I' : 'i',
pstate & PSR_AA32_F_BIT ? 'F' : 'f');
} else {
-   printk("pstate: %08llx (%c%c%c%c %c%c%c%c %cPAN %cUAO)\n",
+   const char *btype_str = btypes[(pstate & PSR_BTYPE_MASK) >>
+  PSR_BTYPE_SHIFT];
+
+   printk("pstate: %08llx (%c%c%c%c %c%c%c%c %cPAN %cUAO 
BTYPE=%s)\n",
pstate,
pstate & PSR_N_BIT ? 'N' : 'n',
pstate & PSR_Z_BIT ? 'Z' : 'z',
@@ -238,7 +250,8 @@ static void print_pstate(struct pt_regs *regs)
pstate & PSR_I_BIT ? 'I' : 'i',
pstate & PSR_F_BIT ? 'F' : 'f',
pstate & PSR_PAN_BIT ? '+' : '-',
-   pstate & PSR_UAO_BIT ? '+' : '-');
+   pstate & PSR_UAO_BIT ? '+' : '-',
+   btype_str);
}
 }
 
-- 
2.1.4



[PATCH v3 05/12] arm64: Basic Branch Target Identification support

2019-10-18 Thread Dave Martin
This patch adds the bare minimum required to expose the ARMv8.5
Branch Target Identification feature to userspace.

By itself, this does _not_ automatically enable BTI for any initial
executable pages mapped by execve().  This will come later, but for
now it should be possible to enable BTI manually on those pages by
using mprotect() from within the target process.

Other arches already using the generic mman.h are already using
0x10 for arch-specific prot flags, so we use that for PROT_BTI
here.

For consistency, signal handler entry points in BTI guarded pages
are required to be annotated as such, just like any other function.
This blocks a relatively minor attack vector, but comforming
userspace will have the annotations anyway, so we may as well
enforce them.

Signed-off-by: Dave Martin 

---

**NOTE**

Currently the generic code does not validate user-supplied prot bits
via arch_validate_prot() except for mprotect():  mmap() doesn't
validate them.

This appears harmless and has been the case ever since the validation
helper was originally introduced in v2.6.27, but it is probably a bug
and could use some attention.

**NOTE**

The new Kconfig dependency on CONFIG_ARM64_PTR_AUTH needs further
discussion.

Two conforming hardware implementations containing BTI could
nonetheless have incompatible Pointer auth implementations, meaning
that we expose BTI to userspace but not Pointer auth.

That's stupid hardware design, but the architecture doesn't forbid it
today.  We _could_ detect this and hide BTI from userspace too, but if
a big.LITTLE system contains Pointer auth implementations with
mismatched IMP DEF algorithms, we lose -- we have no direct way to
detect that.

Since BTI still provides some limited value without Pointer auth,
disabling it unnecessarily might be regarded as too heavy-handed.

---

Changes since v2:

 * Fix Kconfig typo that claimed that Pointer authentication is part of
   ARMv8.2.  It's v8.3.

 * Incorporate PSR_BTYPE_* definitions that were previously delayed
   until "arm64: BTI: Decode BYTPE bits when printing PSTATE" for no
   reason.

 * In the interest of making the code easier to follow, rename the arch
   prot bits handling helpers in arch/arm64/asm/mman.h to have the same
   names used at the their generic callsites, rather than hiding them
   behind #defines.  x86 and powerpc already do the same (but not
   sparc).

 * Add Kconfig dependency on CONFIG_ARM64_PTR_AUTH

   During test hacking, I observed that there are situations where
   userspace should be entitled to assume that Pointer auth is present
   if BTI is present.

   Although the kernel BTI support doesn't require any aspect of
   Pointer authentication, there are architectural dependencies:

* ARMv8.5 requires BTI to be implemented. [1]
* BTI requires ARMv8.4-A to be implemented. [1], [2]
* ARMv8.4 requires ARMv8.3 to be implemented. [3]
* ARMv8.3 requires Pointer authentication to be implemented. [4]

   i.e., an implementation that supports BTI but not Pointer auth is
   broken.

   BTI is also designed to be complementary to Pointer authentication:
   without Pointer auth, BTI would offer no protection for function
   returns, seriously undermining the value of the feature.

   See ARM ARM for ARMv8-A (ARM DDI 0487E.a) Sections:

   [1] A2.8.1, "Architectural features added by ARMv8.5"

   [2] A2.2.1, "Permitted implementation of subsets of ARMv8.x and
   ARMv8.(x+1) architectural features"

   [3] A2.6.1, "Architectural features added by Armv8.3"

   [4] A2.6, "The Armv8.3 architecture extension"

 * Add a comment explaining the purpose of setting PSTATE.BTYPE in
   setup_return() during signal delivery.

   If the registered SIGILL handler itself points to BTI-noncompliant
   entry point in a PROT_BTI page, and was registered with SA_NODEFER,
   then we will take another SIGILL immediately on entry to userspace,
   leading to a temporary livelock until enough signal frames have
   been pushed to exhaust the user stack and trigger a SIGSEGV.

   This is too bad; a similar situation already exists if a SIGSEGV
   handler registered with SA_NODEFER itself triggers a SIGSEGV.

   Because of the way signals are prioritised in dequeue_signal(),
   it's possible that the task may temporarily fail to respond to
   SIGKILL or SIGSTOP while in such a spin.  This is not really a
   new issue caused by BTI, due to the existing SIGSEGV case.

   For SIGKILL, I think this prioritisation problem is resolved
   directly on the send_signal() side, but I'm not so sure about
   SIGSTOP -- investigation is probably needed, but in any case this
   issue seems orthogonal to this series.
---
 Documentation/arm64/cpu-feature-registers.rst |  2 ++
 Documentation/arm64/elf_hwcaps.rst|  4 +++
 arch/arm64/Kconfig| 28 
 arch/arm64/include/asm/cpucaps.h  |  3 ++-
 arch/arm64/include/asm/cpufeat

[PATCH v3 00/12] arm64: ARMv8.5-A: Branch Target Identification support

2019-10-18 Thread Dave Martin
This patch implements support for ARMv8.5-A Branch Target Identification
(BTI), which is a control flow integrity protection feature introduced
as part of the ARMv8.5-A extensions.

The series is based on v5.4-rc2.

A branch for this series is available in Git [3].

This series supersedes the previous v2 posting [1], and also
incorporates my proposed ELF GNU property parsing implementation.  (See
[2] for the ABI spec describing NT_GNU_PROPERTY_TYPE_0).

Changes:

 * Minor cleanups / nitpick fixes only.

   Since this is an interim update so that Mark Brown can take over
   development of the series, I haven't fully retested.  The series
   builds with defconfig.

   There are some outstanding discussion points: see notes in the
   invidual patches, particularly on patch 5.


Notes:

 * No documentation yet.  We could do with some being written before
   this series gets merged.

 * GCC 9 can compile backwards-compatible BTI-enabled code with
   -mbranch-protection=bti or -mbranch-protection=standard.

 * Binutils trunk supports the new ELF note, but this wasn't in a release
   the last time I posted this series.  (The situation _might_ have changed
   in the meantime...)

   Creation of a BTI-enabled binary requires _everything_ linked in to
   be BTI-enabled.  For now ld --force-bti can be used to override this,
   but some things may break until the required C library support is in
   place.

   There is no straightforward way to mark a .s file as BTI-enabled:
   scraping the output from gcc -S works as a quick hack for now.

   readelf -n can be used to examing the program properties in an ELF
   file.

 * Runtime mmap() and mprotect() can be used to enable BTI on a
   page-by-page basis using the new PROT_BTI, but the code in the
   affected pages still needs to be written or compiled to contain the
   appopriate BTI landing pads.


[1] [PATCH v2 00/12] arm64: ARMv8.5-A: Branch Target Identification support
https://lore.kernel.org/lkml/1570733080-21015-1-git-send-email-dave.mar...@arm.com/

[2] Linux Extensions to gABI
https://github.com/hjl-tools/linux-abi/wiki/Linux-Extensions-to-gABI

[3] Git branch:
git://linux-arm.org/linux-dm.git arm64/bti/v3/head
http://linux-arm.org/git?p=linux-dm.git;a=shortlog;h=refs/heads/arm64/bti/v3/head


Dave Martin (12):
  ELF: UAPI and Kconfig additions for ELF program properties
  ELF: Add ELF program property parsing support
  mm: Reserve asm-generic prot flag 0x10 for arch use
  arm64: docs: cpu-feature-registers: Document ID_AA64PFR1_EL1
  arm64: Basic Branch Target Identification support
  elf: Allow arch to tweak initial mmap prot flags
  arm64: elf: Enable BTI at exec based on ELF program properties
  arm64: BTI: Decode BYTPE bits when printing PSTATE
  arm64: traps: Fix inconsistent faulting instruction skipping
  arm64: traps: Shuffle code to eliminate forward declarations
  arm64: BTI: Reset BTYPE when skipping emulated instructions
  KVM: arm64: BTI: Reset BTYPE when skipping emulated instructions

 Documentation/arm64/cpu-feature-registers.rst |  17 ++-
 Documentation/arm64/elf_hwcaps.rst|   4 +
 arch/arm64/Kconfig|  31 ++
 arch/arm64/include/asm/cpucaps.h  |   3 +-
 arch/arm64/include/asm/cpufeature.h   |   6 ++
 arch/arm64/include/asm/elf.h  |  50 +
 arch/arm64/include/asm/esr.h  |   2 +-
 arch/arm64/include/asm/hwcap.h|   1 +
 arch/arm64/include/asm/kvm_emulate.h  |   6 +-
 arch/arm64/include/asm/mman.h |  37 +++
 arch/arm64/include/asm/pgtable-hwdef.h|   1 +
 arch/arm64/include/asm/pgtable.h  |   2 +-
 arch/arm64/include/asm/ptrace.h   |   8 ++
 arch/arm64/include/asm/sysreg.h   |   4 +
 arch/arm64/include/uapi/asm/hwcap.h   |   1 +
 arch/arm64/include/uapi/asm/mman.h|   9 ++
 arch/arm64/include/uapi/asm/ptrace.h  |   1 +
 arch/arm64/kernel/cpufeature.c|  33 ++
 arch/arm64/kernel/cpuinfo.c   |   1 +
 arch/arm64/kernel/entry.S |  11 ++
 arch/arm64/kernel/process.c   |  36 ++-
 arch/arm64/kernel/ptrace.c|   2 +-
 arch/arm64/kernel/signal.c|  16 +++
 arch/arm64/kernel/syscall.c   |  18 
 arch/arm64/kernel/traps.c | 126 +++---
 fs/Kconfig.binfmt |   6 ++
 fs/binfmt_elf.c   | 145 --
 fs/compat_binfmt_elf.c|   4 +
 include/linux/elf.h   |  43 
 include/linux/mm.h|   3 +
 include/uapi/asm-generic/mman-common.h|   1 +
 include/uapi/linux/elf.h  |  11 ++
 32 files changed, 560 insertions(+), 79 deletions(-)
 create mode 100644 arch/arm64/include/asm/mman.h
 create mode 100644 arch/arm64

Re: [PATCH v2 09/12] arm64: traps: Fix inconsistent faulting instruction skipping

2019-10-18 Thread Dave Martin
On Tue, Oct 15, 2019 at 05:49:05PM +0100, Dave Martin wrote:
> On Tue, Oct 15, 2019 at 05:42:04PM +0100, Mark Rutland wrote:
> > On Tue, Oct 15, 2019 at 04:21:09PM +0100, Dave Martin wrote:
> > > On Fri, Oct 11, 2019 at 04:24:53PM +0100, Mark Rutland wrote:
> > > > On Thu, Oct 10, 2019 at 07:44:37PM +0100, Dave Martin wrote:
> > > > > Correct skipping of an instruction on AArch32 works a bit
> > > > > differently from AArch64, mainly due to the different CPSR/PSTATE
> > > > > semantics.
> > > > > 
> > > > > There have been various attempts to get this right.  Currenty
> > > > > arm64_skip_faulting_instruction() mostly does the right thing, but
> > > > > does not advance the IT state machine for the AArch32 case.
> > > > > 
> > > > > arm64_compat_skip_faulting_instruction() handles the IT state
> > > > > machine but is local to traps.c, and porting other code to use it
> > > > > will make a mess since there are some call sites that apply for
> > > > > both the compat and native cases.
> > > > > 
> > > > > Since manual instruction skipping implies a trap, it's a relatively
> > > > > slow path.
> > > > > 
> > > > > So, make arm64_skip_faulting_instruction() handle both compat and
> > > > > native, and get rid of the arm64_compat_skip_faulting_instruction()
> > > > > special case.
> > > > > 
> > > > > Fixes: 32a3e635fb0e ("arm64: compat: Add CNTFRQ trap handler")
> > > > > Fixes: 1f1c014035a8 ("arm64: compat: Add condition code checks and IT 
> > > > > advance")
> > > > > Fixes: 6436b572 ("arm64: Fix single stepping in kernel traps")
> > > > > Fixes: bd35a4adc413 ("arm64: Port SWP/SWPB emulation support from 
> > > > > arm")
> > > > > Signed-off-by: Dave Martin 
> > > > > ---
> > > > >  arch/arm64/kernel/traps.c | 18 --
> > > > >  1 file changed, 8 insertions(+), 10 deletions(-)
> > > > 
> > > > This looks good to me; it's certainly easier to reason about.
> > > > 
> > > > I couldn't spot a place where we do the wrong thing today, given AFAICT
> > > > all the instances in arch/arm64/kernel/armv8_deprecated.c would be
> > > > UNPREDICTABLE within an IT block.
> > > > 
> > > > It might be worth calling out an example in the commit message to
> > > > justify the fixes tags.
> > > 
> > > IIRC I found no bug here; rather we have pointlessly fragmented code,
> > > so I followed the "if fixing the same bug in multiple places, merge
> > > those places so you need only fix it in one place next time" rule.
> > 
> > Sure thing, that makes sense to me.
> > 
> > > Since arm64_skip_faulting_instruction() is most of the way to being
> > > generically usable anyway, this series merges all the special-case
> > > handling into it.
> > > 
> > > I could add something like
> > > 
> > > --8<--
> > > 
> > > This allows this fiddly operation to be maintained in a single
> > > place rather than trying to maintain fragmented versions spread
> > > around arch/arm64.
> > > 
> > > -->8--
> > > 
> > > Any good?
> > 
> > My big concern is that the commit message reads as a fix, implying that
> > there's an existing correctness bug. I think that simplifying it to make
> > it clearer that it's a cleanup/improvement would be preferable.
> > 
> > How about:
> > 
> > | arm64: unify native/compat instruction skipping
> > |
> > | Skipping of an instruction on AArch32 works a bit differently from
> > | AArch64, mainly due to the different CPSR/PSTATE semantics.
> > |
> > | Currently arm64_skip_faulting_instruction() is only suitable for
> > | AArch64, and arm64_compat_skip_faulting_instruction() handles the IT
> > | state machine but is local to traps.c.
> > | 
> > | Since manual instruction skipping implies a trap, it's a relatively
> > | slow path.
> > | 
> > | So, make arm64_skip_faulting_instruction() handle both compat and
> > | native, and get rid of the arm64_compat_skip_faulting_instruction()
> > | special case.
> > |
> > | Signed-off-by: Dave Martin 
> 
> And drop the Fixes tags.  Yes, I think that's reasonable.
> 
> I think I was probably glossi

Re: [PATCH v2 11/12] arm64: BTI: Reset BTYPE when skipping emulated instructions

2019-10-18 Thread Dave Martin
On Fri, Oct 18, 2019 at 12:04:29PM +0100, Mark Rutland wrote:
> On Fri, Oct 11, 2019 at 03:47:43PM +0100, Dave Martin wrote:
> > On Fri, Oct 11, 2019 at 03:21:58PM +0100, Mark Rutland wrote:
> > > On Thu, Oct 10, 2019 at 07:44:39PM +0100, Dave Martin wrote:
> > > > Since normal execution of any non-branch instruction resets the
> > > > PSTATE BTYPE field to 0, so do the same thing when emulating a
> > > > trapped instruction.
> > > > 
> > > > Branches don't trap directly, so we should never need to assign a
> > > > non-zero value to BTYPE here.
> > > > 
> > > > Signed-off-by: Dave Martin 
> > > > ---
> > > >  arch/arm64/kernel/traps.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > > > index 3af2768..4d8ce50 100644
> > > > --- a/arch/arm64/kernel/traps.c
> > > > +++ b/arch/arm64/kernel/traps.c
> > > > @@ -331,6 +331,8 @@ void arm64_skip_faulting_instruction(struct pt_regs 
> > > > *regs, unsigned long size)
> > > >  
> > > > if (regs->pstate & PSR_MODE32_BIT)
> > > > advance_itstate(regs);
> > > > +   else
> > > > +   regs->pstate &= ~(u64)PSR_BTYPE_MASK;
> > > 
> > > This looks good to me, with one nit below.
> > > 
> > > We don't (currently) need the u64 cast here, and it's inconsistent with
> > > what we do elsewhere. If the upper 32-bit of pstate get allocated, we'll
> > > need to fix up all the other masking we do:
> > 
> > Huh, looks like I missed that.  Dang.  Will fix.
> > 
> > > [mark@lakrids:~/src/linux]% git grep 'pstate &= ~'
> > > arch/arm64/kernel/armv8_deprecated.c:   regs->pstate &= 
> > > ~PSR_AA32_E_BIT;
> > > arch/arm64/kernel/cpufeature.c: regs->pstate &= ~PSR_SSBS_BIT;
> > > arch/arm64/kernel/debug-monitors.c: regs->pstate &= ~DBG_SPSR_SS;
> > > arch/arm64/kernel/insn.c:   pstate &= ~(pstate >> 1);   /* 
> > > PSR_C_BIT &= ~PSR_Z_BIT */
> > > arch/arm64/kernel/insn.c:   pstate &= ~(pstate >> 1);   /* 
> > > PSR_C_BIT &= ~PSR_Z_BIT */
> > > arch/arm64/kernel/probes/kprobes.c: regs->pstate &= ~PSR_D_BIT;
> > > arch/arm64/kernel/probes/kprobes.c: regs->pstate &= ~DAIF_MASK;
> > > arch/arm64/kernel/ptrace.c: regs->pstate &= 
> > > ~SPSR_EL1_AARCH32_RES0_BITS;
> > > arch/arm64/kernel/ptrace.c: regs->pstate &= 
> > > ~PSR_AA32_E_BIT;
> > > arch/arm64/kernel/ptrace.c: regs->pstate &= 
> > > ~SPSR_EL1_AARCH64_RES0_BITS;
> > > arch/arm64/kernel/ptrace.c: regs->pstate &= ~DBG_SPSR_SS;
> > > arch/arm64/kernel/ssbd.c:   task_pt_regs(task)->pstate &= ~val;
> > > arch/arm64/kernel/traps.c:  regs->pstate &= ~PSR_AA32_IT_MASK;
> > > 
> > > ... and at that point I'd suggest we should just ensure the bit
> > > definitions are all defined as unsigned long in the first place since
> > > adding casts to each use is error-prone.
> > 
> > Are we concerned about changing the types of UAPI #defines?  That can
> > cause subtle and unexpected breakage, especially when the signedness
> > of a #define changes.
> > 
> > Ideally, we'd just change all these to 1UL << n.
> 
> I agree that's the ideal -- I don't know how concerned we are w.r.t. the
> UAPI headers, I'm afraid.

OK, I'll following the existing convention for now, keep the #define as
(implicitly) signed, and drop the u64 casts.

At some point in the future we may want to refactor the headers so that
the kernel uses shadow register bit definitions that are always u64.
The new HWCAP definitions provide a reasonable template for doing that
kind of thing.

It's probably best not to do anything to alter the types of the UAPI
definitions.

I will shamelessly duck this for now :|

Cheers
---Dave


Re: [PATCH v2 05/12] arm64: Basic Branch Target Identification support

2019-10-18 Thread Dave Martin
On Fri, Oct 18, 2019 at 12:16:03PM +0100, Mark Rutland wrote:
> [adding mm folk]
> 
> On Fri, Oct 11, 2019 at 06:20:15PM +0100, Dave Martin wrote:
> > On Fri, Oct 11, 2019 at 04:10:29PM +0100, Mark Rutland wrote:
> > > On Thu, Oct 10, 2019 at 07:44:33PM +0100, Dave Martin wrote:
> > > > +#define arch_validate_prot(prot, addr) arm64_validate_prot(prot, addr)
> > > > +static inline int arm64_validate_prot(unsigned long prot, unsigned 
> > > > long addr)
> > > > +{
> > > > +   unsigned long supported = PROT_READ | PROT_WRITE | PROT_EXEC | 
> > > > PROT_SEM;
> > > > +
> > > > +   if (system_supports_bti())
> > > > +   supported |= PROT_BTI;
> > > > +
> > > > +   return (prot & ~supported) == 0;
> > > > +}
> > > 
> > > If we have this check, can we ever get into arm64_calc_vm_prot_bits()
> > > with PROT_BIT but !system_supports_bti()?
> > > 
> > > ... or can that become:
> > > 
> > >   return (prot & PROT_BTI) ? VM_ARM64_BTI : 0;
> > 
> > We can reach this via mmap() and friends IIUC.
> > 
> > Since this function only gets called once-ish per vma I have a weak
> > preference for keeping the check here to avoid code fragility.
> > 
> > 
> > It does feel like arch_validate_prot() is supposed to be a generic gate
> > for prot flags coming into the kernel via any route though, but only the
> > mprotect() path actually uses it.
> > 
> > This function originally landed in v2.6.27 as part of the powerpc strong
> > access ordering support (PROT_SAO):
> > 
> > b845f313d78e ("mm: Allow architectures to define additional protection 
> > bits")
> > ef3d3246a0d0 ("powerpc/mm: Add Strong Access Ordering support")
> > 
> > where the mmap() path uses arch_calc_vm_prot_bits() without
> > arch_validate_prot(), just as in the current code.  powerpc's original
> > arch_calc_vm_prot_bits() does no obvious policing.
> > 
> > This might be a bug.  I can draft a patch to add it for the mmap() path
> > for people to comment on ... I can't figure out yet whether or not the
> > difference is intentional or there's some subtlety that I'm missed.
> 
> From reading those two commit messages, it looks like this was an
> oversight. I'd expect that we should apply this check for any
> user-provided prot (i.e. it should apply to both mprotect and mmap).
> 
> Ben, Andrew, does that make sense to you?
> 
> ... or was there some reason to only do this for mprotect?
> 
> Thanks,
> Mark.

For now, I'll drop a comment under the tearoff noting this outstanding
question.

The resulting behaviour is slightly odd, but doesn't seem unsafe, and
we can of course tidy it up later.  I think the risk of userspace
becoming dependent on randomly passing PROT_BTI to mprotect() even
when unsupported is low.

[...]

Cheers
---Dave


Re: [PATCH v2 05/12] arm64: Basic Branch Target Identification support

2019-10-18 Thread Dave Martin
On Fri, Oct 18, 2019 at 12:10:03PM +0100, Mark Rutland wrote:
> On Fri, Oct 11, 2019 at 06:20:15PM +0100, Dave Martin wrote:
> > On Fri, Oct 11, 2019 at 04:10:29PM +0100, Mark Rutland wrote:
> > > On Thu, Oct 10, 2019 at 07:44:33PM +0100, Dave Martin wrote:
> > > > +#define arch_calc_vm_prot_bits(prot, pkey) 
> > > > arm64_calc_vm_prot_bits(prot)
> > > > +static inline unsigned long arm64_calc_vm_prot_bits(unsigned long prot)
> > > > +{
> > > > +   if (system_supports_bti() && (prot & PROT_BTI))
> > > > +   return VM_ARM64_BTI;
> > > > +
> > > > +   return 0;
> > > > +}
> > > 
> > > Can we call this arch_calc_vm_prot_bits() directly, with all the
> > > arguments:
> > > 
> > > static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
> > >  unsigned long pkey)
> > > {
> > >   ...
> > > }
> > > #define arch_calc_vm_prot_bits arch_calc_vm_prot_bits
> > > 
> > > ... as that makes it a bit easier to match definition with use, and just
> > > definign the name makes it a bit clearer that that's probably for the
> > > benefit of some ifdeffery.
> > > 
> > > Likewise for the other functions here.
> > > 
> > > > +#define arch_vm_get_page_prot(vm_flags) 
> > > > arm64_vm_get_page_prot(vm_flags)
> > > > +static inline pgprot_t arm64_vm_get_page_prot(unsigned long vm_flags)
> > > > +{
> > > > +   return (vm_flags & VM_ARM64_BTI) ? __pgprot(PTE_GP) : 
> > > > __pgprot(0);
> > > > +}
> > > > +
> > > > +#define arch_validate_prot(prot, addr) arm64_validate_prot(prot, addr)
> > > > +static inline int arm64_validate_prot(unsigned long prot, unsigned 
> > > > long addr)
> > 
> > Can do, though it looks like a used sparc as a template, and that has a
> > sparc_ prefix.
> > 
> > powerpc uses the generic name, as does x86 ... in its UAPI headers.
> > Odd.
> > 
> > I can change the names here, though I'm not sure it adds a lot of value.
> > 
> > If you feel strongly I can do it.
> 
> I'd really prefer it because it minimizes surprises, and makes it much
> easier to hop around the codebase and find the thing you're looking for.

OK, I've no objection in that case.  I'll make the change.

[...]

Cheers
---Dave


Re: [PATCH v2 05/12] arm64: Basic Branch Target Identification support

2019-10-18 Thread Dave Martin
On Fri, Oct 18, 2019 at 12:05:52PM +0100, Mark Rutland wrote:
> On Fri, Oct 11, 2019 at 05:42:00PM +0100, Dave Martin wrote:
> > On Fri, Oct 11, 2019 at 05:01:13PM +0100, Dave Martin wrote:
> > > On Fri, Oct 11, 2019 at 04:44:45PM +0100, Dave Martin wrote:
> > > > On Fri, Oct 11, 2019 at 04:40:43PM +0100, Mark Rutland wrote:
> > > > > On Fri, Oct 11, 2019 at 04:32:26PM +0100, Dave Martin wrote:

[...]

> > > > > > Either way, I feel we should do this: any function in a PROT_BTI 
> > > > > > page
> > > > > > should have a suitable landing pad.  There's no reason I can see why
> > > > > > a protection given to any other callback function should be omitted
> > > > > > for a signal handler.
> > > > > > 
> > > > > > Note, if the signal handler isn't in a PROT_BTI page then overriding
> > > > > > BTYPE here will not trigger a Branch Target exception.
> > > > > > 
> > > > > > I'm happy to drop a brief comment into the code also, once we're
> > > > > > agreed on what the code should be doing.
> > > > > 
> > > > > So long as there's a comment as to why, I have no strong feelings 
> > > > > here.
> > > > > :)
> > > > 
> > > > OK, I think it's worth a brief comment in the code either way, so I'll
> > > > add something.
> > > 
> > > Hmm, come to think of it we do need special logic for a particular case
> > > here:
> > > 
> > > If we are delivering a SIGILL here and the SIGILL handler was registered
> > > with SA_NODEFER then we will get into a spin, repeatedly delivering
> > > the BTI-triggered SIGILL to the same (bad) entry point.
> > > 
> > > Without SA_NODEFER, the SIGILL becomes fatal, which is the desired
> > > behaviour, but we'll need to catch this recursion explicitly.
> > > 
> > > 
> > > It's similar to the special force_sigsegv() case in
> > > linux/kernel/signal.c...
> > > 
> > > Thoughts?
> > 
> > On second thought, maybe we don't need to do anything special.
> > 
> > A SIGSEGV handler registered with (SA_NODEFER & ~SA_RESETHAND) and that
> > dereferences a duff address would spin similarly.
> > 
> > This SIGILL case doesn't really seem different.  Either way it's a
> > livelock of the user task that doesn't compromise the kernel.  There
> > are plenty of ways for such a livelock to happen.
> 
> That sounds reasonable to me.

OK, I guess we can park this discussion for now.

Cheers
---Dave


Re: [PATCH 2/3] arm64: nofpsmid: Clear TIF_FOREIGN_FPSTATE flag for early tasks

2019-10-17 Thread Dave Martin
On Thu, Oct 17, 2019 at 01:42:37PM +0100, Suzuki K Poulose wrote:
> Hi Dave
> 
> Thanks for the comments.
> 
> On 11/10/2019 12:26, Dave Martin wrote:
> >On Thu, Oct 10, 2019 at 06:15:16PM +0100, Suzuki K Poulose wrote:
> >>We detect the absence of FP/SIMD after we boot the SMP CPUs, and by then
> >>we have kernel threads running already with TIF_FOREIGN_FPSTATE set which
> >>could be inherited by early userspace applications (e.g, modprobe triggered
> >>from initramfs). This could end up in the applications stuck in
> >>do_nofity_resume() as we never clear the TIF flag, once we now know that
> >>we don't support FP.
> >>
> >>Fix this by making sure that we clear the TIF_FOREIGN_FPSTATE flag
> >>for tasks which may have them set, as we would have done in the normal
> >>case, but avoiding touching the hardware state (since we don't support any).
> >>
> >>Fixes: 82e0191a1aa11abf ("arm64: Support systems without FP/ASIMD")
> >>Cc: Will Deacon 
> >>Cc: Mark Rutland 
> >>Cc: Catalin Marinas 
> >>Signed-off-by: Suzuki K Poulose 
> >>---
> >>  arch/arm64/kernel/fpsimd.c | 26 --
> >>  1 file changed, 16 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> >>index 37d3912cfe06..dfcdd077aeca 100644
> >>--- a/arch/arm64/kernel/fpsimd.c
> >>+++ b/arch/arm64/kernel/fpsimd.c
> >>@@ -1128,12 +1128,19 @@ void fpsimd_bind_state_to_cpu(struct 
> >>user_fpsimd_state *st, void *sve_state,
> >>   */
> >>  void fpsimd_restore_current_state(void)
> >>  {
> >>-   if (!system_supports_fpsimd())
> >>-   return;
> >>-
> >>get_cpu_fpsimd_context();
> >>-
> >>-   if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
> >>+   /*
> >>+* For the tasks that were created before we detected the absence of
> >>+* FP/SIMD, the TIF_FOREIGN_FPSTATE could be set via 
> >>fpsimd_thread_switch()
> >>+* and/or could be inherited from the parent(init_task has this set). 
> >>Even
> >>+* though userspace has not run yet, this could be inherited by the
> >>+* processes forked from one of those tasks (e.g, modprobe from 
> >>initramfs).
> >>+* If the system doesn't support FP/SIMD, we must clear the flag for the
> >>+* tasks mentioned above, to indicate that the FPSTATE is clean (as we
> >>+* can't have one) to avoid looping for ever to clear the flag.
> >>+*/
> >>+   if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE) &&
> >>+   system_supports_fpsimd()) {
> >
> >I'm not too keen on this approach: elsewhere we just stub out all the
> >FPSIMD handling logic if !system_supports_fpsimd() -- I think we should
> >be using this test everywhere rather than relying on TIF_FOREIGN_FPSTATE.
> 
> We used to do this. But the flag is not cleared anymore once we detect
> the absence of FP/SIMD.
> 
> >Rather, I feel that TIF_FOREIGN_FPSTATE means "if this is a user task
> >and this task is current() and the system supports FPSIMD at all, this
> >task's FPSIMD state is not loaded in the cpu".
> 
> Yes, that is  correct. However, we ran some tasks, even before we detected
> that the FPSIMD is missing. So, we need to clear the flag for those tasks
> to make sure the flag state is consistent, as explained in the comment.

I think there's a misunderstanding here somewhere.

What I'm saying it that we shouldn't even look at TIF_FOREIGN_FPSTATE if
!system_supports_fpsimd() -- i.e., when checking whether there is
any FPSIMD context handling work to do, !system_supports_fpsimd()
should take precedence.

Firstly, this replaces the runtime TIF_FOREIGN_FPSTATE check with a
static key check in the !system_supprts_fpsimd() case, and second, the
"work to do" condition is never wrong, even temporarily.

The "work to do" condition is now

system_supports_fpsimd() && test_thread_flag(TIF_FOREIGN_FPSTATE)

instead of

test_thread_flag(TIF_FOREIGN_FPSTATE).

Code that _only writes_ the TIF_FORGIEN_FPSTATE flag can continue to do
so harmlessly if we do things this way.

In do_notify_resume() this doesn't quite work, but it's acceptable to
fall spuriously into fpsimd_restore_current_state() provided that we
check for !system_supports_fpsimd() in there.  Which we already do.
In this one case, we should clear TIF_FOREIGN_FPSTATE so this backwards
checking doesn't send do_notify_resume() into a spin waiting for the
flag to go clear.

Another option is to cle

Re: [PATCH v2 09/12] arm64: traps: Fix inconsistent faulting instruction skipping

2019-10-15 Thread Dave Martin
On Tue, Oct 15, 2019 at 05:42:04PM +0100, Mark Rutland wrote:
> On Tue, Oct 15, 2019 at 04:21:09PM +0100, Dave Martin wrote:
> > On Fri, Oct 11, 2019 at 04:24:53PM +0100, Mark Rutland wrote:
> > > On Thu, Oct 10, 2019 at 07:44:37PM +0100, Dave Martin wrote:
> > > > Correct skipping of an instruction on AArch32 works a bit
> > > > differently from AArch64, mainly due to the different CPSR/PSTATE
> > > > semantics.
> > > > 
> > > > There have been various attempts to get this right.  Currenty
> > > > arm64_skip_faulting_instruction() mostly does the right thing, but
> > > > does not advance the IT state machine for the AArch32 case.
> > > > 
> > > > arm64_compat_skip_faulting_instruction() handles the IT state
> > > > machine but is local to traps.c, and porting other code to use it
> > > > will make a mess since there are some call sites that apply for
> > > > both the compat and native cases.
> > > > 
> > > > Since manual instruction skipping implies a trap, it's a relatively
> > > > slow path.
> > > > 
> > > > So, make arm64_skip_faulting_instruction() handle both compat and
> > > > native, and get rid of the arm64_compat_skip_faulting_instruction()
> > > > special case.
> > > > 
> > > > Fixes: 32a3e635fb0e ("arm64: compat: Add CNTFRQ trap handler")
> > > > Fixes: 1f1c014035a8 ("arm64: compat: Add condition code checks and IT 
> > > > advance")
> > > > Fixes: 6436b572 ("arm64: Fix single stepping in kernel traps")
> > > > Fixes: bd35a4adc413 ("arm64: Port SWP/SWPB emulation support from arm")
> > > > Signed-off-by: Dave Martin 
> > > > ---
> > > >  arch/arm64/kernel/traps.c | 18 --
> > > >  1 file changed, 8 insertions(+), 10 deletions(-)
> > > 
> > > This looks good to me; it's certainly easier to reason about.
> > > 
> > > I couldn't spot a place where we do the wrong thing today, given AFAICT
> > > all the instances in arch/arm64/kernel/armv8_deprecated.c would be
> > > UNPREDICTABLE within an IT block.
> > > 
> > > It might be worth calling out an example in the commit message to
> > > justify the fixes tags.
> > 
> > IIRC I found no bug here; rather we have pointlessly fragmented code,
> > so I followed the "if fixing the same bug in multiple places, merge
> > those places so you need only fix it in one place next time" rule.
> 
> Sure thing, that makes sense to me.
> 
> > Since arm64_skip_faulting_instruction() is most of the way to being
> > generically usable anyway, this series merges all the special-case
> > handling into it.
> > 
> > I could add something like
> > 
> > --8<--
> > 
> > This allows this fiddly operation to be maintained in a single
> > place rather than trying to maintain fragmented versions spread
> > around arch/arm64.
> > 
> > -->8--
> > 
> > Any good?
> 
> My big concern is that the commit message reads as a fix, implying that
> there's an existing correctness bug. I think that simplifying it to make
> it clearer that it's a cleanup/improvement would be preferable.
> 
> How about:
> 
> | arm64: unify native/compat instruction skipping
> |
> | Skipping of an instruction on AArch32 works a bit differently from
> | AArch64, mainly due to the different CPSR/PSTATE semantics.
> |
> | Currently arm64_skip_faulting_instruction() is only suitable for
> | AArch64, and arm64_compat_skip_faulting_instruction() handles the IT
> | state machine but is local to traps.c.
> | 
> | Since manual instruction skipping implies a trap, it's a relatively
> | slow path.
> | 
> | So, make arm64_skip_faulting_instruction() handle both compat and
> | native, and get rid of the arm64_compat_skip_faulting_instruction()
> | special case.
> |
> | Signed-off-by: Dave Martin 

And drop the Fixes tags.  Yes, I think that's reasonable.

I think I was probably glossing over the fact that we don't need to get
the ITSTATE machine advance correct for the compat insn emulation; as
you say, I can't see what else this patch "fixes".

> With that, feel free to add:
>
> Reviewed-by: Mark Rutland 

Thanks!

> We could even point out that the armv8_deprecated cases are
> UNPREDICTABLE in an IT block, and correctly emulated either way.

Yes, I'll add something along those lines.

Cheers
---Dave


Re: [PATCH v2 09/12] arm64: traps: Fix inconsistent faulting instruction skipping

2019-10-15 Thread Dave Martin
On Fri, Oct 11, 2019 at 04:24:53PM +0100, Mark Rutland wrote:
> On Thu, Oct 10, 2019 at 07:44:37PM +0100, Dave Martin wrote:
> > Correct skipping of an instruction on AArch32 works a bit
> > differently from AArch64, mainly due to the different CPSR/PSTATE
> > semantics.
> > 
> > There have been various attempts to get this right.  Currenty
> > arm64_skip_faulting_instruction() mostly does the right thing, but
> > does not advance the IT state machine for the AArch32 case.
> > 
> > arm64_compat_skip_faulting_instruction() handles the IT state
> > machine but is local to traps.c, and porting other code to use it
> > will make a mess since there are some call sites that apply for
> > both the compat and native cases.
> > 
> > Since manual instruction skipping implies a trap, it's a relatively
> > slow path.
> > 
> > So, make arm64_skip_faulting_instruction() handle both compat and
> > native, and get rid of the arm64_compat_skip_faulting_instruction()
> > special case.
> > 
> > Fixes: 32a3e635fb0e ("arm64: compat: Add CNTFRQ trap handler")
> > Fixes: 1f1c014035a8 ("arm64: compat: Add condition code checks and IT 
> > advance")
> > Fixes: 6436b572 ("arm64: Fix single stepping in kernel traps")
> > Fixes: bd35a4adc413 ("arm64: Port SWP/SWPB emulation support from arm")
> > Signed-off-by: Dave Martin 
> > ---
> >  arch/arm64/kernel/traps.c | 18 --
> >  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> This looks good to me; it's certainly easier to reason about.
> 
> I couldn't spot a place where we do the wrong thing today, given AFAICT
> all the instances in arch/arm64/kernel/armv8_deprecated.c would be
> UNPREDICTABLE within an IT block.
> 
> It might be worth calling out an example in the commit message to
> justify the fixes tags.

IIRC I found no bug here; rather we have pointlessly fragmented code,
so I followed the "if fixing the same bug in multiple places, merge
those places so you need only fix it in one place next time" rule.

Since arm64_skip_faulting_instruction() is most of the way to being
generically usable anyway, this series merges all the special-case
handling into it.

I could add something like

--8<--

This allows this fiddly operation to be maintained in a single
place rather than trying to maintain fragmented versions spread
around arch/arm64.

-->8--

Any good?

Cheers
---Dave

[...]

> > 
> > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > index 15e3c4f..44c91d4 100644
> > --- a/arch/arm64/kernel/traps.c
> > +++ b/arch/arm64/kernel/traps.c
> > @@ -268,6 +268,8 @@ void arm64_notify_die(const char *str, struct pt_regs 
> > *regs,
> > }
> >  }
> >  
> > +static void advance_itstate(struct pt_regs *regs);
> > +
> >  void arm64_skip_faulting_instruction(struct pt_regs *regs, unsigned long 
> > size)
> >  {
> > regs->pc += size;
> > @@ -278,6 +280,9 @@ void arm64_skip_faulting_instruction(struct pt_regs 
> > *regs, unsigned long size)
> >  */
> > if (user_mode(regs))
> > user_fastforward_single_step(current);
> > +
> > +   if (regs->pstate & PSR_MODE32_BIT)
> > +   advance_itstate(regs);
> >  }
> >  
> >  static LIST_HEAD(undef_hook);
> > @@ -629,19 +634,12 @@ static void advance_itstate(struct pt_regs *regs)
> > compat_set_it_state(regs, it);
> >  }
> >  
> > -static void arm64_compat_skip_faulting_instruction(struct pt_regs *regs,
> > -  unsigned int sz)
> > -{
> > -   advance_itstate(regs);
> > -   arm64_skip_faulting_instruction(regs, sz);
> > -}
> > -
> >  static void compat_cntfrq_read_handler(unsigned int esr, struct pt_regs 
> > *regs)
> >  {
> > int reg = (esr & ESR_ELx_CP15_32_ISS_RT_MASK) >> 
> > ESR_ELx_CP15_32_ISS_RT_SHIFT;
> >  
> > pt_regs_write_reg(regs, reg, arch_timer_get_rate());
> > -   arm64_compat_skip_faulting_instruction(regs, 4);
> > +   arm64_skip_faulting_instruction(regs, 4);
> >  }
> >  
> >  static const struct sys64_hook cp15_32_hooks[] = {
> > @@ -661,7 +659,7 @@ static void compat_cntvct_read_handler(unsigned int 
> > esr, struct pt_regs *regs)
> >  
> > pt_regs_write_reg(regs, rt, lower_32_bits(val));
> > pt_regs_write_reg(regs, rt2, upper_32_bits(val));
> > -   arm64_compat_skip_faulting_instruction(regs, 4);
> > +   arm64_skip_faulting_instruction(regs, 4);
> >  }
> >  
> >  s

  1   2   3   4   5   6   7   8   9   10   >