Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-15 Thread Mark Rutland
On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > +/**
> > + * enum execmem_type - types of executable memory ranges
> > + *
> > + * There are several subsystems that allocate executable memory.
> > + * Architectures define different restrictions on placement,
> > + * permissions, alignment and other parameters for memory that can be used
> > + * by these subsystems.
> > + * Types in this enum identify subsystems that allocate executable memory
> > + * and let architectures define parameters for ranges suitable for
> > + * allocations by each subsystem.
> > + *
> > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > + * are not explcitly defined.
> > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > + * @EXECMEM_KPROBES: parameters for kprobes
> > + * @EXECMEM_FTRACE: parameters for ftrace
> > + * @EXECMEM_BPF: parameters for BPF
> > + * @EXECMEM_TYPE_MAX:
> > + */
> > +enum execmem_type {
> > +   EXECMEM_DEFAULT,
> > +   EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > +   EXECMEM_KPROBES,
> > +   EXECMEM_FTRACE,
> > +   EXECMEM_BPF,
> > +   EXECMEM_TYPE_MAX,
> > +};
> 
> Can we please get a break-down of how all these types are actually
> different from one another?
> 
> I'm thinking some platforms have a tiny immediate space (arm64 comes to
> mind) and has less strict placement constraints for some of them?

Yeah, and really I'd *much* rather deal with that in arch code, as I have said
several times.

For arm64 we have two bsaic restrictions: 

1) Direct branches can go +/-128M
   We can expand this range by having direct branches go to PLTs, at a
   performance cost.

2) PREL32 relocations can go +/-2G
   We cannot expand this further.

* We don't need to allocate memory for ftrace. We do not use trampolines.

* Kprobes XOL areas don't care about either of those; we don't place any
  PC-relative instructions in those. Maybe we want to in future.

* Modules care about both; we'd *prefer* to place them within +/-128M of all
  other kernel/module code, but if there's no space we can use PLTs and expand
  that to +/-2G. Since modules can refreence other modules, that ends up
  actually being halved, and modules have to fit within some 2G window that
  also covers the kernel.

* I'm not sure about BPF's requirements; it seems happy doing the same as
  modules.

So if we *must* use a common execmem allocator, what we'd reall want is our own
types, e.g.

EXECMEM_ANYWHERE
EXECMEM_NOPLT
EXECMEM_PREL32

... and then we use those in arch code to implement module_alloc() and friends.

Mark.



Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Mark Rutland
On Tue, Mar 26, 2024 at 09:15:14AM -0700, Calvin Owens wrote:
> On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote:
> > On Tue, 26 Mar 2024 14:46:10 +
> > Mark Rutland  wrote:
> > > Different exectuable allocations can have different requirements. For 
> > > example,
> > > on arm64 modules need to be within 2G of the kernel image, but the 
> > > kprobes XOL
> > > areas can be anywhere in the kernel VA space.
> > > 
> > > Forcing those behind the same interface makes things *harder* for 
> > > architectures
> > > and/or makes the common code more complicated (if that ends up having to 
> > > track
> > > all those different requirements). From my PoV it'd be much better to have
> > > separate kprobes_alloc_*() functions for kprobes which an architecture 
> > > can then
> > > choose to implement using a common library if it wants to.
> > > 
> > > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > > v6,
> > > and it looks pretty clean to me (and works in testing on arm64):
> > > 
> > >   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > > 
> > > Could we please start with that approach, with kprobe-specific alloc/free 
> > > code
> > > provided by the architecture?
> 
> Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was
> about to send a patch to remove it.
> 
> > OK, as far as I can read the code, this method also works and neat! 
> > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> > to user does not help, it should be an internal change. So hiding this 
> > change
> > from user is better choice. Then there is no reason to introduce the new
> > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
> 
> I'm happy with this, it solves the first half of my problem. But I want
> eBPF to work in the !MODULES case too.
> 
> I think Mark's approach can work for bpf as well, without needing to
> touch module_alloc() at all? So I might be able to drop that first patch
> entirely.

I'd be very happy with eBPF following the same approach, with BPF-specific
alloc/free functions that we can implement in arch code.

IIUC eBPF code *does* want to be within range of the core kernel image, so for
arm64 we'd want to factor some common logic out of module_alloc() and into
something that module_alloc() and "bpf_alloc()" (or whatever it would be
called) could use. So I don't think we'd necessarily save on touching
module_alloc(), but I think the resulting split would be better.

Thanks,
Mark.



Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Mark Rutland
On Wed, Mar 27, 2024 at 12:24:03AM +0900, Masami Hiramatsu wrote:
> On Tue, 26 Mar 2024 14:46:10 +
> Mark Rutland  wrote:
> > 
> > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > I think, we'd better to introduce `alloc_execmem()`,
> > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > 
> > >   config HAVE_ALLOC_EXECMEM
> > >   bool
> > > 
> > >   config ALLOC_EXECMEM
> > >   bool "Executable trampline memory allocation"
> > >   depends on MODULES || HAVE_ALLOC_EXECMEM
> > > 
> > > And define fallback macro to module_alloc() like this.
> > > 
> > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > #define alloc_execmem(size, gfp)  module_alloc(size)
> > > #endif
> > 
> > Please can we *not* do this? I think this is abstracting at the wrong level 
> > (as
> > I mentioned on the prior execmem proposals).
> > 
> > Different exectuable allocations can have different requirements. For 
> > example,
> > on arm64 modules need to be within 2G of the kernel image, but the kprobes 
> > XOL
> > areas can be anywhere in the kernel VA space.
> > 
> > Forcing those behind the same interface makes things *harder* for 
> > architectures
> > and/or makes the common code more complicated (if that ends up having to 
> > track
> > all those different requirements). From my PoV it'd be much better to have
> > separate kprobes_alloc_*() functions for kprobes which an architecture can 
> > then
> > choose to implement using a common library if it wants to.
> > 
> > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > v6,
> > and it looks pretty clean to me (and works in testing on arm64):
> > 
> >   
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > 
> > Could we please start with that approach, with kprobe-specific alloc/free 
> > code
> > provided by the architecture?
> 
> OK, as far as I can read the code, this method also works and neat! 
> (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> to user does not help, it should be an internal change. So hiding this change
> from user is better choice. Then there is no reason to introduce the new
> alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
> 
> Mark, can you send this series here, so that others can review/test it?

I've written up a cover letter and sent that out:
  
  https://lore.kernel.org/lkml/20240326163624.3253157-1-mark.rutl...@arm.com/

Mark.



Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Mark Rutland
Hi Masami,

On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> Hi Jarkko,
> 
> On Sun, 24 Mar 2024 01:29:08 +0200
> Jarkko Sakkinen  wrote:
> 
> > Tracing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> > 
> > Address the issue by allowing architectures to implement module_alloc()
> > and module_memfree() independent of the module subsystem. An arch tree
> > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > 
> > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > and implementing module_memfree().
> 
> Even though, this involves changes in arch-independent part. So it should
> be solved by generic way. Did you checked Calvin's thread?
> 
> https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> 
> I think, we'd better to introduce `alloc_execmem()`,
> CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> 
>   config HAVE_ALLOC_EXECMEM
>   bool
> 
>   config ALLOC_EXECMEM
>   bool "Executable trampline memory allocation"
>   depends on MODULES || HAVE_ALLOC_EXECMEM
> 
> And define fallback macro to module_alloc() like this.
> 
> #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> #define alloc_execmem(size, gfp)  module_alloc(size)
> #endif

Please can we *not* do this? I think this is abstracting at the wrong level (as
I mentioned on the prior execmem proposals).

Different exectuable allocations can have different requirements. For example,
on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL
areas can be anywhere in the kernel VA space.

Forcing those behind the same interface makes things *harder* for architectures
and/or makes the common code more complicated (if that ends up having to track
all those different requirements). From my PoV it'd be much better to have
separate kprobes_alloc_*() functions for kprobes which an architecture can then
choose to implement using a common library if it wants to.

I took a look at doing that using the core ifdeffery fixups from Jarkko's v6,
and it looks pretty clean to me (and works in testing on arm64):

  
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules

Could we please start with that approach, with kprobe-specific alloc/free code
provided by the architecture?

Thanks,
Mark.



Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-20 Thread Mark Rutland
On Thu, Mar 14, 2024 at 04:07:33PM +0100, Bj"orn T"opel wrote:
> After reading Mark's reply, and discussing with OpenJDK folks (who does
> the most crazy text patching on all platforms), having to patch multiple
> instructions (where the address materialization is split over multiple
> instructions) is a no-go. It's just a too big can of worms. So, if we
> can only patch one insn, it's CALL_OPS.
> 
> A couple of options (in addition to Andy's), and all require a
> per-function landing address ala CALL_OPS) tweaking what Mark is doing
> on Arm (given the poor branch range).
> 
> ..and maybe we'll get RISC-V rainbows/unicorns in the future getting
> better reach (full 64b! ;-)).
> 
> A) Use auipc/jalr, only patch jalr to take us to a common
>dispatcher/trampoline
>   
>  |  # probably on a data cache-line != func .text 
> to avoid ping-pong
>  | ...
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   aupic
>  |   nop <=> jalr # Text patch point -> common_dispatch
>  |   ACTUAL_FUNC
>  | 
>  | common_dispatch:
>  |   load  based on ra
>  |   jalr
>  |   ...
> 
> The auipc is never touched, and will be overhead. Also, we need a mv to
> store ra in a scratch register as well -- like Arm. We'll have two insn
> per-caller overhead for a disabled caller.

Is the AUIPC a significant overhead? IIUC that's similar to Arm's ADRP, and I'd
have expected that to be pretty cheap.

IIUC your JALR can choose which destination register to store the return
address in, and if so, you could leave the original ra untouched (and recover
that in the common trampoline). Have I misunderstood that?

Maybe that doesn't play nicely with something else?

> B) Use jal, which can only take us +/-1M, and requires multiple
>dispatchers (and tracking which one to use, and properly distribute
>them. Ick.)
> 
>  |  # probably on a data cache-line != func .text 
> to avoid ping-pong
>  | ...
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   nop <=> jal # Text patch point -> within_1M_to_func_dispatch
>  |   ACTUAL_FUNC
>  | 
>  | within_1M_to_func_dispatch:
>  |   load  based on ra
>  |   jalr
> 
> C) Use jal, which can only take us +/-1M, and use a per-function
>trampoline requires multiple dispatchers (and tracking which one to
>use). Blows up text size A LOT.
> 
>  |  # somewhere, but probably on a different 
> cacheline than the .text to avoid ping-ongs
>  | ...
>  | per_func_dispatch
>  |   load  based on ra
>  |   jalr
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   nop <=> jal # Text patch point -> per_func_dispatch
>  |   ACTUAL_FUNC

Beware that with option (C) you'll need to handle that in your unwinder for
RELIABLE_STACKTRACE. If you don't have a symbol for per_func_dispatch (or
func_trace_target_data_8B), PC values within per_func_dispatch would be
symbolized as the prior function/data.

> It's a bit sad that we'll always have to have a dispatcher/trampoline,
> but it's still better than stop_machine(). (And we'll need a fencei IPI
> as well, but only one. ;-))
> 
> Today, I'm leaning towards A (which is what Mark suggested, and also
> Robbin).. Any other options?

Assuming my understanding of JALR above is correct, I reckon A is the nicest
option out of A/B/C.

Mark.



Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-12 Thread Mark Rutland
Hi Bjorn

(apologies, my corporate mail server has butchered your name here).

There's a big info dump below; I realise this sounds like a sales pitch for
CALL_OPS, but my intent is more to say "here are some dragons you may not have
spotted".

On Thu, Mar 07, 2024 at 08:27:40PM +0100, Bj"orn T"opel wrote:
> Puranjay!
> 
> Puranjay Mohan  writes:
> 
> > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
> > This allows each ftrace callsite to provide an ftrace_ops to the common
> > ftrace trampoline, allowing each callsite to invoke distinct tracer
> > functions without the need to fall back to list processing or to
> > allocate custom trampolines for each callsite. This significantly speeds
> > up cases where multiple distinct trace functions are used and callsites
> > are mostly traced by a single tracer.
> >
> > The idea and most of the implementation is taken from the ARM64's
> > implementation of the same feature. The idea is to place a pointer to
> > the ftrace_ops as a literal at a fixed offset from the function entry
> > point, which can be recovered by the common ftrace trampoline.
> 
> Not really a review, but some more background; Another rationale (on-top
> of the improved per-call performance!) for CALL_OPS was to use it to
> build ftrace direct call support (which BPF uses a lot!). Mark, please
> correct me if I'm lying here!

Yep; it gives us the ability to do a number of per-callsite things, including
direct calls.

> On Arm64, CALL_OPS makes it possible to implement direct calls, while
> only patching one BL instruction -- nice!

The key thing here isn't that we patch a single instruction (since we have ot
patch the ops pointer too!); it's that we can safely patch either of the ops
pointer or BL/NOP at any time while threads are concurrently executing.

If you have a multi-instruction sequence, then threads can be preempted
mid-sequence, and it's very painful/complex to handle all of the races that
entails.

For example, if your callsites use a sequence:

AUIPC , 
JALR , ()

Using stop_machine() won't allow you to patch that safely as some threads
could be stuck mid-sequence, e.g.

AUIPC , 
[ preempted here ]
JALR , ()

... and you can't update the JALR to use a new funcptr immediate until those
have completed the sequence.

There are ways around that, but they're complicated and/or expensive, e.g.

* Use a sequence of multiple patches, starting with replacing the JALR with an
  exception-generating instruction with a fixup handler, which is sort-of what
  x86 does with UD2. This may require multiple passes with
  synchronize_rcu_tasks() to make sure all threads have seen the latest
  instructions, and that cannot be done under stop_machine(), so if you need
  stop_machine() for CMODx reasons, you may need to use that several times with
  intervening calls to synchronize_rcu_tasks().

* Have the patching logic manually go over each thread and fix up the pt_regs
  for the interrupted thread. This is pretty horrid since you could have nested
  exceptions and a task could have several pt_regs which might require
  updating.

The CALL_OPS approach is a bit easier to deal with as we can patch the
per-callsite pointer atomically, then we can (possibly) enable/disable the
callsite's branch, then wait for threads to drain once. 

As a heads-up, there are some latent/generic issues with DYNAMIC_FTRACE
generally in this area (CALL_OPs happens to side-step those, but trampoline
usage is currently affected):

  https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/

... I'm looking into fixing that at the moment, and it looks like that's likely
to require some per-architecture changes.

> On RISC-V we cannot use use the same ideas as Arm64 straight off,
> because the range of jal (compare to BL) is simply too short (+/-1M).
> So, on RISC-V we need to use a full auipc/jal pair (the text patching
> story is another chapter, but let's leave that aside for now). Since we
> have to patch multiple instructions, the cmodx situation doesn't really
> improve with CALL_OPS.

The branch range thing is annoying, but I think this boils down to the same
problem as arm64 has with needing a "MOV , LR" instruction that we have to
patch in once at boot time. You could do the same and patch in the AUIPC once,
e.g. have

|   NOP
|   NOP 
| func:
|   AUIPC , 
|   JALR , () // patched with NOP

... which'd look very similar to arm64's sequence:

|   NOP
|   NOP
| func:
|   MOV X9, LR
|   BL ftrace_caller // patched with NOP

... which I think means it *might* be better from a cmodx perspective?

> Let's say that we continue building on your patch and implement direct
> calls on CALL_OPS for RISC-V as well.
> 
> From Florent's commit message for direct calls:
> 
>   |There are a few cases to distinguish:
>   |- If a direct call ops is the only one tracing a function:
>   |  - If the direct called trampoline is 

Re: Question about the ipi_raise filter usage and output

2024-02-05 Thread Mark Rutland
[adding Valentin]

On Mon, Feb 05, 2024 at 08:06:09AM -0500, Steven Rostedt wrote:
> On Mon, 5 Feb 2024 10:28:57 +
> Mark Rutland  wrote:
> 
> > > I try to write below:
> > > echo 'target_cpus == 11 && reason == "Function call interrupts"' >
> > > events/ipi/ipi_raise/filter  
> > 
> > The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the 
> > cpumask
> > contains other CPUs, the filter will skip the call.
> > 
> > I believe you can use '&' to check whether a cpumask contains a CPU, e.g.
> > 
> > 'target_cpus & 11'
> 
> 11 == 0xb = b1011
> 
> So the above would only be true for CPUs 0,1 and 3 ;-)

Sorry, I misunderstood the scalar logic and thought that we treated:

'$mask $OP $scalar', e.g. 'target_cpus & 11'

... as a special case meaning a cpumask with that scalar bit set, i.e.

'$mask $OP CPUS{$scalar}', e.g. 'target_cpus & CPUS{11}'

... but evidently I was wrong.

> I think you meant: 'target_cpus & 0x800' 
> 
> I tried "1 << 11' but it appears to not allow shifts. I wonder if we should 
> add that?

Hmm... shouldn't we make 'CPUS{11}' work for that?

>From a quick test (below), that doesn't seem to work, though I think it
probably should?

  # cat /sys/devices/system/cpu/online 
  0-3
  # echo 1 > /sys/kernel/tracing/events/ipi/ipi_raise/enable 
  # echo 'target_cpus & CPUS{3}' > 
/sys/kernel/tracing/events/ipi/ipi_raise/filter 
  # grep IPI /proc/interrupts 
  IPI0:54 41 32 42   Rescheduling interrupts
  IPI1:  1202   1035893909   Function call 
interrupts
  IPI2: 0  0  0  0   CPU stop interrupts
  IPI3: 0  0  0  0   CPU stop (for crash 
dump) interrupts
  IPI4: 0  0  0  0   Timer broadcast 
interrupts
  IPI5: 0  0  0  0   IRQ work interrupts
  # sleep 1
  # grep IPI /proc/interrupts 
  IPI0:54 42 32 42   Rescheduling interrupts
  IPI1:  1209   1037912927   Function call 
interrupts
  IPI2: 0  0  0  0   CPU stop interrupts
  IPI3: 0  0  0  0   CPU stop (for crash 
dump) interrupts
  IPI4: 0  0  0  0   Timer broadcast 
interrupts
  IPI5: 0  0  0  0   IRQ work interrupts
  # cat /sys/devices/system/cpu/online 
  0-3
  # cat /sys/kernel/tracing/trace
  # tracer: nop
  #
  # entries-in-buffer/entries-written: 0/0   #P:4
  #
  #_-=> irqs-off/BH-disabled
  #   / _=> need-resched
  #  | / _---=> hardirq/softirq
  #  || / _--=> preempt-depth
  #  ||| / _-=> migrate-disable
  #   / delay
  #   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
  #  | | |   | | |
  # 

More confusingly, if I use '8', I get events with cpumasks which shouldn't
match AFAICT:

  echo 'target_cpus & 8' > /sys/kernel/tracing/events/ipi/ipi_raise/filter 
  # echo '' > /sys/kernel/tracing/trace
  # grep IPI /proc/interrupts 
  IPI0:55 46 34 43   Rescheduling interrupts
  IPI1:  1358   1155994   1021   Function call 
interrupts
  IPI2: 0  0  0  0   CPU stop interrupts
  IPI3: 0  0  0  0   CPU stop (for crash 
dump) interrupts
  IPI4: 0  0  0  0   Timer broadcast 
interrupts
  IPI5: 0  0  0  0   IRQ work interrupts
  # sleep 1
  # grep IPI /proc/interrupts 
  IPI0:56 46 34 43   Rescheduling interrupts
  IPI1:  1366   1158   1005   1038   Function call 
interrupts
  IPI2: 0  0  0  0   CPU stop interrupts
  IPI3: 0  0  0  0   CPU stop (for crash 
dump) interrupts
  IPI4: 0  0  0  0   Timer broadcast 
interrupts
  IPI5: 0  0  0  0   IRQ work interrupts
  # cat /sys/kernel/tracing/trace
  # tracer: nop
  #
  # entries-in-buffer/entries-written: 91/91   #P:4
  #
  #_-=> irqs-off/BH-disabled
  #   / _=> need-resched
  #  | / _---=> hardirq/softirq
  #  || / _--=> preempt-depth
  #  ||| / _-=> migrate-disab

Re: Question about the ipi_raise filter usage and output

2024-02-05 Thread Mark Rutland
On Mon, Feb 05, 2024 at 05:57:29PM +0800, richard clark wrote:
> Hi guys,
> 
> With the ipi_raise event enabled and filtered with:
> echo 'reason == "Function call interrupts"' > filter, then the 'cat
> trace' output below messages:
> ...
> insmod-3355[010] 1.. 24479.230381: ipi_raise:
> target_mask=,0bff (Function call interrupts)
> ...
> The above output is triggered by my kernel module where it will smp
> cross call a remote function from cpu#10 to cpu#11, for the
> 'target_mask' value, what does the ',0bff' mean?

That's a cpumask bitmap: 0xbff is 0b1011__, which is:

 ,- CPU 10
 |
1011__
| '__'
|  |
|  `- CPUs 9 to 0
|
`- CPU 11

Note that bitmap has CPUs 0-9 and CPU 11 set, but CPU 10 is not set.

I suspect your kernel module has generated the bitmap incorrectly; it looks
like you have a mask for CPUs 0-11 minus a mask for CPU 10?

For CPUs 10 and 11, that should be 0xc00 / 0b1100__.

>  ~~
> 
> Another question is for the filter, I'd like to catch the IPI only
> happening on cpu#11 *AND* a remote function call, so how to write the
> 'target_cpus' in the filter expression?
> 
> I try to write below:
> echo 'target_cpus == 11 && reason == "Function call interrupts"' >
> events/ipi/ipi_raise/filter

The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the cpumask
contains other CPUs, the filter will skip the call.

I believe you can use '&' to check whether a cpumask contains a CPU, e.g.

'target_cpus & 11'

Thanks,
Mark.



Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-11 Thread Mark Rutland
On Thu, Jan 11, 2024 at 11:15:33AM +0900, Masami Hiramatsu wrote:
> Hi Mark,
> 
> Thanks for the investigation.

Hi!

As a heads-up, I already figured out the problem and sent a fixup at:

  https://lore.kernel.org/lkml/ZZwEz8HsTa2IZE3L@FVFF77S0Q05N/

... and a more refined fix at:

  https://lore.kernel.org/lkml/ZZwOubTSbB_FucVz@FVFF77S0Q05N/

The gist was that before this patch, arm64 used the FP as the 'retp' value, but
this patch changed that to the address of fregs->lr. This meant that the fgraph
ret_stack contained all of the correct return addresses, but when the unwinder
called ftrace_graph_ret_addr() with FP as the 'retp' value, it failed to match
any entry in the ret_stack.

Since the fregs only exist transiently at function entry and exit, I'd prefer
that we still use the FP as the 'retp' value, which is what I proposed in the
fixups above.

Thanks,
Mark.

> On Mon, 8 Jan 2024 12:25:55 +
> Mark Rutland  wrote:
> 
> > Hi,
> > 
> > There's a bit more of an info-dump below; I'll go try to dump the fgraph 
> > shadow
> > stack so that we can analyse this in more detail.
> > 
> > On Mon, Jan 08, 2024 at 10:14:36AM +0900, Masami Hiramatsu wrote:
> > > On Fri, 5 Jan 2024 17:09:10 +
> > > Mark Rutland  wrote:
> > > 
> > > > On Mon, Dec 18, 2023 at 10:13:46PM +0900, Masami Hiramatsu (Google) 
> > > > wrote:
> > > > > From: Steven Rostedt (VMware) 
> > > > > 
> > > > > Allow for instances to have their own ftrace_ops part of the 
> > > > > fgraph_ops
> > > > > that makes the funtion_graph tracer filter on the set_ftrace_filter 
> > > > > file
> > > > > of the instance and not the top instance.
> > > > > 
> > > > > This also change how the function_graph handles multiple instances on 
> > > > > the
> > > > > shadow stack. Previously we use ARRAY type entries to record which one
> > > > > is enabled, and this makes it a bitmap of the fgraph_array's indexes.
> > > > > Previous function_graph_enter() expects calling back from
> > > > > prepare_ftrace_return() function which is called back only once if it 
> > > > > is
> > > > > enabled. But this introduces different ftrace_ops for each fgraph
> > > > > instance and those are called from ftrace_graph_func() one by one. 
> > > > > Thus
> > > > > we can not loop on the fgraph_array(), and need to reuse the ret_stack
> > > > > pushed by the previous instance. Finding the ret_stack is easy because
> > > > > we can check the ret_stack->func. But that is not enough for the self-
> > > > > recursive tail-call case. Thus fgraph uses the bitmap entry to find it
> > > > > is already set (this means that entry is for previous tail call).
> > > > > 
> > > > > Signed-off-by: Steven Rostedt (VMware) 
> > > > > Signed-off-by: Masami Hiramatsu (Google) 
> > > > 
> > > > As a heads-up, while testing the topic/fprobe-on-fgraph branch on 
> > > > arm64, I get
> > > > a warning which bisets down to this commit:
> > > 
> > > Hmm, so does this happen when enabling function graph tracer?
> > 
> > Yes; I see it during the function_graph boot-time self-test if I also enable
> > CONFIG_IRQSOFF_TRACER=y. I can also trigger it regardless of
> > CONFIG_IRQSOFF_TRACER if I cat /proc/self/stack with the function_graph 
> > tracer
> > enabled (note that I hacked the unwinder to continue after failing to 
> > recover a
> > return address):
> > 
> > | # mount -t tracefs none /sys/kernel/tracing/
> > | # echo function_graph > /sys/kernel/tracing/current_tracer
> > | # cat /proc/self/stack
> > | [   37.469980] [ cut here ]
> > | [   37.471503] WARNING: CPU: 2 PID: 174 at 
> > arch/arm64/kernel/stacktrace.c:84 arch_stack_walk+0x2d8/0x338
> > | [   37.474381] Modules linked in:
> > | [   37.475501] CPU: 2 PID: 174 Comm: cat Not tainted 
> > 6.7.0-rc2-00026-gea1e68a341c2-dirty #15
> > | [   37.478133] Hardware name: linux,dummy-virt (DT)
> > | [   37.479670] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
> > BTYPE=--)
> > | [   37.481923] pc : arch_stack_walk+0x2d8/0x338
> > | [   37.483373] lr : arch_stack_walk+0x1bc/0x338
> > | [   37.484818] sp : 8000835f3a90
> > | [   37.485974] x29: 8000835f3a90 x28: 8000835f3b80 x27: 
> > 8000835f3b38
> > | [   37.488405] x26: 04341e00 x25: 8000835f4000 

Re: [PATCH] ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default

2024-01-10 Thread Mark Rutland
On Wed, Jan 10, 2024 at 09:13:06AM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> The commit 60c8971899f3 ("ftrace: Make DIRECT_CALLS work WITH_ARGS
> and !WITH_REGS") changed DIRECT_CALLS to use SAVE_ARGS when there
> are multiple ftrace_ops at the same function, but since the x86 only
> support to jump to direct_call from ftrace_regs_caller, when we set
> the function tracer on the same target function on x86, ftrace-direct
> does not work as below (this actually works on arm64.)
> 
> At first, insmod ftrace-direct.ko to put a direct_call on
> 'wake_up_process()'.
> 
>  # insmod kernel/samples/ftrace/ftrace-direct.ko
>  # less trace
> ...
>   -0   [006] ..s1.   564.686958: my_direct_func: waking up 
> rcu_preempt-17
>   -0   [007] ..s1.   564.687836: my_direct_func: waking up 
> kcompactd0-63
>   -0   [006] ..s1.   564.690926: my_direct_func: waking up 
> rcu_preempt-17
>   -0   [006] ..s1.   564.696872: my_direct_func: waking up 
> rcu_preempt-17
>   -0   [007] ..s1.   565.191982: my_direct_func: waking up 
> kcompactd0-63
> 
> Setup a function filter to the 'wake_up_process' too, and enable it.
> 
>  # cd /sys/kernel/tracing/
>  # echo wake_up_process > set_ftrace_filter
>  # echo function > current_tracer
>  # less trace
> ...
>   -0   [006] ..s3.   686.180972: wake_up_process 
> <-call_timer_fn
>   -0   [006] ..s3.   686.186919: wake_up_process 
> <-call_timer_fn
>   -0   [002] ..s3.   686.264049: wake_up_process 
> <-call_timer_fn
>   -0   [002] d.h6.   686.515216: wake_up_process <-kick_pool
>   -0   [002] d.h6.   686.691386: wake_up_process <-kick_pool
> 
> Then, only function tracer is shown on x86.
> But if you enable 'kprobe on ftrace' event (which uses SAVE_REGS flag)
> on the same function, it is shown again.
> 
>  # echo 'p wake_up_process' >> dynamic_events
>  # echo 1 > events/kprobes/p_wake_up_process_0/enable
>  # echo > trace
>  # less trace
> ...
>   -0   [006] ..s2.  2710.345919: p_wake_up_process_0: 
> (wake_up_process+0x4/0x20)
>   -0   [006] ..s3.  2710.345923: wake_up_process 
> <-call_timer_fn
>   -0   [006] ..s1.  2710.345928: my_direct_func: waking up 
> rcu_preempt-17
>   -0   [006] ..s2.  2710.349931: p_wake_up_process_0: 
> (wake_up_process+0x4/0x20)
>   -0   [006] ..s3.  2710.349934: wake_up_process 
> <-call_timer_fn
>   -0   [006] ..s1.  2710.349937: my_direct_func: waking up 
> rcu_preempt-17
> 
> To fix this issue, use SAVE_REGS flag for multiple ftrace_ops flag of
> direct_call by default.
> 
> Fixes: 60c8971899f3 ("ftrace: Make DIRECT_CALLS work WITH_ARGS and 
> !WITH_REGS")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Masami Hiramatsu (Google) 

Sorry about this; I hadn't realised that x86 only supported direct calls when
SAVE_REGS was requested.

The patch looks good to me. I applied it atop v6.7 and double-checked that this
still works on arm64 as per your examples above, and everything looks good:

# mount -t tracefs none /sys/kernel/tracing/
# insmod ftrace-direct.ko 
# echo wake_up_process > /sys/kernel/tracing/set_ftrace_filter 
# echo function > /sys/kernel/tracing/current_tracer 
# less /sys/kernel/tracing/trace
... 
  -0   [007] ..s3.   172.932840: wake_up_process 
<-process_timeout
  -0   [007] ..s1.   172.932842: my_direct_func: waking up 
kcompactd0-62
  -0   [007] ..s3.   173.444836: wake_up_process 
<-process_timeout
  -0   [007] ..s1.   173.444838: my_direct_func: waking up 
kcompactd0-62
  -0   [001] d.h5.   173.471116: wake_up_process <-kick_pool
  -0   [001] d.h3.   173.471118: my_direct_func: waking up 
kworker/1:1-58

Reviewed-by: Mark Rutland 
Tested-by: Mark Rutland  [arm64]

Thanks,
Mark.

> ---
>  kernel/trace/ftrace.c |   10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index b01ae7d36021..c060d5b47910 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5325,7 +5325,17 @@ static LIST_HEAD(ftrace_direct_funcs);
>  
>  static int register_ftrace_function_nolock(struct ftrace_ops *ops);
>  
> +/*
> + * If there are multiple ftrace_ops, use SAVE_REGS by default, so that direct
> + * call will be jumped from ftrace_regs_caller. Only if the architecture does
> + * not support ftrace_regs_caller but direct_call, use SAVE_ARGS so that it
> + * jumps from ftrace_caller for multiple ftrace_ops.
> + */
> +#ifndef HAVE_DYNAMIC_FTRACE_WITH_REGS
>  #define MULTI_FLAGS (FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_ARGS)
> +#else
> +#define MULTI_FLAGS (FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS)
> +#endif
>  
>  static int check_direct_multi(struct ftrace_ops *ops)
>  {
> 



Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-08 Thread Mark Rutland
On Mon, Jan 08, 2024 at 02:21:03PM +, Mark Rutland wrote:
> On Mon, Jan 08, 2024 at 12:25:55PM +0000, Mark Rutland wrote:
> > We also have HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, but since the return address 
> > is
> > not on the stack at the point function-entry is intercepted we use the FP as
> > the retp value -- in the absence of tail calls this will be different 
> > between a
> > caller and callee.
> 
> Ah; I just spotted that this patch changed that in ftrace_graph_func(), which
> is the source of the bug. 
> 
> As of this patch, we use the address of fregs->lr as the retp value, but the
> unwinder still uses the FP value, and so when unwind_recover_return_address()
> calls ftrace_graph_ret_addr(), the retp value won't match the expected entry 
> on
> the fgraph ret_stack, resulting in failing to find the expected entry.
> 
> Since the ftrace_regs only exist transiently during function entry/exit, it's
> possible for a stackframe to reuse that same address on the stack, which would
> result in finding a different entry by mistake.
> 
> The diff below restores the existing behaviour and fixes the issue for me.
> Could you please fold that into this patch?
> 
> On a separate note, looking at how this patch changed arm64's
> ftrace_graph_func(), do we need similar changes to arm64's
> prepare_ftrace_return() for the old-style mcount based ftrace?
> 
> Mark.
> 
> >8
> diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
> index 205937e04ece..329092ce06ba 100644
> --- a/arch/arm64/kernel/ftrace.c
> +++ b/arch/arm64/kernel/ftrace.c
> @@ -495,7 +495,7 @@ void ftrace_graph_func(unsigned long ip, unsigned long 
> parent_ip,
> if (bit < 0)
> return;
>  
> -   if (!function_graph_enter_ops(*parent, ip, fregs->fp, parent, gops))
> +   if (!function_graph_enter_ops(*parent, ip, fregs->fp, (void 
> *)fregs->fp, gops))
> *parent = (unsigned long)_to_handler;
>  
> ftrace_test_recursion_unlock(bit);

Thinking some more, this line gets excessively long when we pass the fregs too,
so it's probably worth adding a local variable for fp, i.e. the diff below.

Mark.

>8
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index 205937e04ece..d4e142ef4686 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -481,8 +481,9 @@ void prepare_ftrace_return(unsigned long self_addr, 
unsigned long *parent,
 void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
-   unsigned long *parent = >lr;
struct fgraph_ops *gops = container_of(op, struct fgraph_ops, ops);
+   unsigned long *parent = >lr;
+   unsigned long fp = fregs->fp;
int bit;
 
if (unlikely(ftrace_graph_is_dead()))
@@ -495,7 +496,7 @@ void ftrace_graph_func(unsigned long ip, unsigned long 
parent_ip,
if (bit < 0)
return;
 
-   if (!function_graph_enter_ops(*parent, ip, fregs->fp, parent, gops))
+   if (!function_graph_enter_ops(*parent, ip, fp, (void *)fp, gops))
*parent = (unsigned long)_to_handler;
 
ftrace_test_recursion_unlock(bit);



Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-08 Thread Mark Rutland
On Mon, Jan 08, 2024 at 12:25:55PM +, Mark Rutland wrote:
> We also have HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, but since the return address is
> not on the stack at the point function-entry is intercepted we use the FP as
> the retp value -- in the absence of tail calls this will be different between 
> a
> caller and callee.

Ah; I just spotted that this patch changed that in ftrace_graph_func(), which
is the source of the bug. 

As of this patch, we use the address of fregs->lr as the retp value, but the
unwinder still uses the FP value, and so when unwind_recover_return_address()
calls ftrace_graph_ret_addr(), the retp value won't match the expected entry on
the fgraph ret_stack, resulting in failing to find the expected entry.

Since the ftrace_regs only exist transiently during function entry/exit, it's
possible for a stackframe to reuse that same address on the stack, which would
result in finding a different entry by mistake.

The diff below restores the existing behaviour and fixes the issue for me.
Could you please fold that into this patch?

On a separate note, looking at how this patch changed arm64's
ftrace_graph_func(), do we need similar changes to arm64's
prepare_ftrace_return() for the old-style mcount based ftrace?

Mark.

>8
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index 205937e04ece..329092ce06ba 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -495,7 +495,7 @@ void ftrace_graph_func(unsigned long ip, unsigned long 
parent_ip,
if (bit < 0)
return;
 
-   if (!function_graph_enter_ops(*parent, ip, fregs->fp, parent, gops))
+   if (!function_graph_enter_ops(*parent, ip, fregs->fp, (void 
*)fregs->fp, gops))
*parent = (unsigned long)_to_handler;
 
ftrace_test_recursion_unlock(bit);



Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-08 Thread Mark Rutland
Hi,

There's a bit more of an info-dump below; I'll go try to dump the fgraph shadow
stack so that we can analyse this in more detail.

On Mon, Jan 08, 2024 at 10:14:36AM +0900, Masami Hiramatsu wrote:
> On Fri, 5 Jan 2024 17:09:10 +
> Mark Rutland  wrote:
> 
> > On Mon, Dec 18, 2023 at 10:13:46PM +0900, Masami Hiramatsu (Google) wrote:
> > > From: Steven Rostedt (VMware) 
> > > 
> > > Allow for instances to have their own ftrace_ops part of the fgraph_ops
> > > that makes the funtion_graph tracer filter on the set_ftrace_filter file
> > > of the instance and not the top instance.
> > > 
> > > This also change how the function_graph handles multiple instances on the
> > > shadow stack. Previously we use ARRAY type entries to record which one
> > > is enabled, and this makes it a bitmap of the fgraph_array's indexes.
> > > Previous function_graph_enter() expects calling back from
> > > prepare_ftrace_return() function which is called back only once if it is
> > > enabled. But this introduces different ftrace_ops for each fgraph
> > > instance and those are called from ftrace_graph_func() one by one. Thus
> > > we can not loop on the fgraph_array(), and need to reuse the ret_stack
> > > pushed by the previous instance. Finding the ret_stack is easy because
> > > we can check the ret_stack->func. But that is not enough for the self-
> > > recursive tail-call case. Thus fgraph uses the bitmap entry to find it
> > > is already set (this means that entry is for previous tail call).
> > > 
> > > Signed-off-by: Steven Rostedt (VMware) 
> > > Signed-off-by: Masami Hiramatsu (Google) 
> > 
> > As a heads-up, while testing the topic/fprobe-on-fgraph branch on arm64, I 
> > get
> > a warning which bisets down to this commit:
> 
> Hmm, so does this happen when enabling function graph tracer?

Yes; I see it during the function_graph boot-time self-test if I also enable
CONFIG_IRQSOFF_TRACER=y. I can also trigger it regardless of
CONFIG_IRQSOFF_TRACER if I cat /proc/self/stack with the function_graph tracer
enabled (note that I hacked the unwinder to continue after failing to recover a
return address):

| # mount -t tracefs none /sys/kernel/tracing/
| # echo function_graph > /sys/kernel/tracing/current_tracer
| # cat /proc/self/stack
| [   37.469980] [ cut here ]
| [   37.471503] WARNING: CPU: 2 PID: 174 at arch/arm64/kernel/stacktrace.c:84 
arch_stack_walk+0x2d8/0x338
| [   37.474381] Modules linked in:
| [   37.475501] CPU: 2 PID: 174 Comm: cat Not tainted 
6.7.0-rc2-00026-gea1e68a341c2-dirty #15
| [   37.478133] Hardware name: linux,dummy-virt (DT)
| [   37.479670] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| [   37.481923] pc : arch_stack_walk+0x2d8/0x338
| [   37.483373] lr : arch_stack_walk+0x1bc/0x338
| [   37.484818] sp : 8000835f3a90
| [   37.485974] x29: 8000835f3a90 x28: 8000835f3b80 x27: 
8000835f3b38
| [   37.488405] x26: 04341e00 x25: 8000835f4000 x24: 
80008002df18
| [   37.490842] x23: 80008002df18 x22: 8000835f3b60 x21: 
80008015d240
| [   37.493269] x20: 8000835f3b50 x19: 8000835f3b40 x18: 

| [   37.495704] x17:  x16:  x15: 

| [   37.498144] x14:  x13:  x12: 

| [   37.500579] x11: 800082b4d920 x10: 8000835f3a70 x9 : 
8000800e55a0
| [   37.503021] x8 : 80008002df18 x7 : 04341e00 x6 : 

| [   37.505452] x5 :  x4 : 8000835f3e48 x3 : 
8000835f3b80
| [   37.507888] x2 : 80008002df18 x1 : 07f7b000 x0 : 
80008002df18
| [   37.510319] Call trace:
| [   37.511202]  arch_stack_walk+0x2d8/0x338
| [   37.512541]  stack_trace_save_tsk+0x90/0x110
| [   37.514012]  return_to_handler+0x0/0x48
| [   37.515336]  return_to_handler+0x0/0x48
| [   37.516657]  return_to_handler+0x0/0x48
| [   37.517985]  return_to_handler+0x0/0x48
| [   37.519305]  return_to_handler+0x0/0x48
| [   37.520623]  return_to_handler+0x0/0x48
| [   37.521957]  return_to_handler+0x0/0x48
| [   37.523272]  return_to_handler+0x0/0x48
| [   37.524595]  return_to_handler+0x0/0x48
| [   37.525931]  return_to_handler+0x0/0x48
| [   37.527254]  return_to_handler+0x0/0x48
| [   37.528564]  el0t_64_sync_handler+0x120/0x130
| [   37.530046]  el0t_64_sync+0x190/0x198
| [   37.531310] ---[ end trace  ]---
| [<0>] ftrace_stub_graph+0x8/0x8
| [<0>] ftrace_stub_graph+0x8/0x8
| [<0>] ftrace_stub_graph+0x8/0x8
| [<0>] ftrace_stub_graph+0x8/0x8
| [<0>] ftrace_stub_graph+0x8/0x8
| [<0>] ftrace_stub_graph+0x8/0x8
| [<0>] ftrace_stub_graph+0x8/0x8
| [<0>] ftrace_stub_graph+0x8

Re: [PATCH v5 22/34] tracing: Rename ftrace_regs_return_value to ftrace_regs_get_return_value

2024-01-05 Thread Mark Rutland
On Mon, Dec 18, 2023 at 10:15:59PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> Rename ftrace_regs_return_value to ftrace_regs_get_return_value as same as
> other ftrace_regs_get/set_* APIs.
> 
> Signed-off-by: Masami Hiramatsu (Google) 

Acked-by: Mark Rutland 

Since this is a trivial cleanup, it might make sense to move this to the start
of the series, so that it can be queued even if the later parts need more work.

Mark.

> ---
>  Changes in v3:
>   - Newly added.
> ---
>  arch/loongarch/include/asm/ftrace.h |2 +-
>  arch/powerpc/include/asm/ftrace.h   |2 +-
>  arch/s390/include/asm/ftrace.h  |2 +-
>  arch/x86/include/asm/ftrace.h   |2 +-
>  include/linux/ftrace.h  |2 +-
>  kernel/trace/fgraph.c   |2 +-
>  6 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/loongarch/include/asm/ftrace.h 
> b/arch/loongarch/include/asm/ftrace.h
> index a11996eb5892..a9c3d0f2f941 100644
> --- a/arch/loongarch/include/asm/ftrace.h
> +++ b/arch/loongarch/include/asm/ftrace.h
> @@ -70,7 +70,7 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs 
> *fregs, unsigned long ip)
>   regs_get_kernel_argument(&(fregs)->regs, n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(&(fregs)->regs)
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(&(fregs)->regs)
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(&(fregs)->regs, ret)
> diff --git a/arch/powerpc/include/asm/ftrace.h 
> b/arch/powerpc/include/asm/ftrace.h
> index 9e5a39b6a311..7e138e0e3baf 100644
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -69,7 +69,7 @@ ftrace_regs_get_instruction_pointer(struct ftrace_regs 
> *fregs)
>   regs_get_kernel_argument(&(fregs)->regs, n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(&(fregs)->regs)
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(&(fregs)->regs)
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(&(fregs)->regs, ret)
> diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
> index 5a82b08f03cd..01e775c98425 100644
> --- a/arch/s390/include/asm/ftrace.h
> +++ b/arch/s390/include/asm/ftrace.h
> @@ -88,7 +88,7 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs 
> *fregs,
>   regs_get_kernel_argument(&(fregs)->regs, n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(&(fregs)->regs)
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(&(fregs)->regs)
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(&(fregs)->regs, ret)
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index 0b306c82855d..a061f8832b20 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -64,7 +64,7 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
>   regs_get_kernel_argument(&(fregs)->regs, n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(&(fregs)->regs)
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(&(fregs)->regs)
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(&(fregs)->regs, ret)
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 79875a00c02b..da2a23f5a9ed 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -187,7 +187,7 @@ static __always_inline bool ftrace_regs_has_args(struct 
> ftrace_regs *fregs)
>   regs_get_kernel_argument(ftrace_get_regs(fregs), n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(ftrace_get_regs(fregs))
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(ftrace_get_regs(fregs))
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(ftrace_get_regs(fregs), ret)
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 088432b695a6..9a60acaacc96 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -783,7 +783,7 @@ static void fgraph_call_retfunc(struct ftrace_regs *fregs,
>   trace.rettime = trace_clock_local();
>  #ifdef CONFIG_FUNCTION_GRAPH_RETVAL
>   if (fregs)
> - trace.retval = ftrace_regs_return_value(fregs);
> + trace.retval = ftrace_regs_get_return_value(fregs);
>   else
>   trace.retval = fgraph_ret_regs_return_value(ret_regs);
>  #endif
> 



Re: [PATCH v5 01/34] tracing: Add a comment about ftrace_regs definition

2024-01-05 Thread Mark Rutland
On Mon, Dec 18, 2023 at 10:11:44PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> To clarify what will be expected on ftrace_regs, add a comment to the
> architecture independent definition of the ftrace_regs.
> 
> Signed-off-by: Masami Hiramatsu (Google) 

Acked-by: Mark Rutland 

Mark.

> ---
>  Changes in v3:
>   - Add instruction pointer
>  Changes in v2:
>   - newly added.
> ---
>  include/linux/ftrace.h |   26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index e8921871ef9a..8b48fc621ea0 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -118,6 +118,32 @@ extern int ftrace_enabled;
>  
>  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
>  
> +/**
> + * ftrace_regs - ftrace partial/optimal register set
> + *
> + * ftrace_regs represents a group of registers which is used at the
> + * function entry and exit. There are three types of registers.
> + *
> + * - Registers for passing the parameters to callee, including the stack
> + *   pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64)
> + * - Registers for passing the return values to caller.
> + *   (e.g. rax and rdx on x86_64)
> + * - Registers for hooking the function call and return including the
> + *   frame pointer (the frame pointer is architecture/config dependent)
> + *   (e.g. rip, rbp and rsp for x86_64)
> + *
> + * Also, architecture dependent fields can be used for internal process.
> + * (e.g. orig_ax on x86_64)
> + *
> + * On the function entry, those registers will be restored except for
> + * the stack pointer, so that user can change the function parameters
> + * and instruction pointer (e.g. live patching.)
> + * On the function exit, only registers which is used for return values
> + * are restored.
> + *
> + * NOTE: user *must not* access regs directly, only do it via APIs, because
> + * the member can be changed according to the architecture.
> + */
>  struct ftrace_regs {
>   struct pt_regs  regs;
>  };
> 



Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-05 Thread Mark Rutland
On Mon, Dec 18, 2023 at 10:13:46PM +0900, Masami Hiramatsu (Google) wrote:
> From: Steven Rostedt (VMware) 
> 
> Allow for instances to have their own ftrace_ops part of the fgraph_ops
> that makes the funtion_graph tracer filter on the set_ftrace_filter file
> of the instance and not the top instance.
> 
> This also change how the function_graph handles multiple instances on the
> shadow stack. Previously we use ARRAY type entries to record which one
> is enabled, and this makes it a bitmap of the fgraph_array's indexes.
> Previous function_graph_enter() expects calling back from
> prepare_ftrace_return() function which is called back only once if it is
> enabled. But this introduces different ftrace_ops for each fgraph
> instance and those are called from ftrace_graph_func() one by one. Thus
> we can not loop on the fgraph_array(), and need to reuse the ret_stack
> pushed by the previous instance. Finding the ret_stack is easy because
> we can check the ret_stack->func. But that is not enough for the self-
> recursive tail-call case. Thus fgraph uses the bitmap entry to find it
> is already set (this means that entry is for previous tail call).
> 
> Signed-off-by: Steven Rostedt (VMware) 
> Signed-off-by: Masami Hiramatsu (Google) 

As a heads-up, while testing the topic/fprobe-on-fgraph branch on arm64, I get
a warning which bisets down to this commit:

| Testing tracer function_graph: 
| [ cut here ]
| WARNING: CPU: 2 PID: 0 at arch/arm64/kernel/stacktrace.c:84 
arch_stack_walk+0x3c0/0x3d8
| Modules linked in:
| CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.7.0-rc2-00026-gea1e68a341c2 #12
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x3c0/0x3d8
| lr : arch_stack_walk+0x260/0x3d8
| sp : 80008318be00
| x29: 80008318be00 x28: 03c0ae80 x27: 
| x26:  x25: 03c0ae80 x24: 
| x23: 8000800234c8 x22: 80008002dc30 x21: 800080035d10
| x20: 80008318bee8 x19: 800080023460 x18: 800083453c68
| x17:  x16: 800083188000 x15: 8ccc5058
| x14: 0004 x13: 800082b8c4f0 x12: 
| x11: 800081fba9b0 x10: 80008318bff0 x9 : 800080010798
| x8 : 80008002dc30 x7 : 03c0ae80 x6 : 
| x5 :  x4 : 8000832a3c18 x3 : 80008318bff0
| x2 : 80008002dc30 x1 : 80008002dc30 x0 : 80008002dc30
| Call trace:
|  arch_stack_walk+0x3c0/0x3d8
|  return_address+0x40/0x80
|  trace_hardirqs_on+0x8c/0x198
|  __do_softirq+0xe8/0x440
| ---[ end trace  ]---

That's a warning in arm64's unwind_recover_return_address() function, which
fires when ftrace_graph_ret_addr() finds return_to_handler:

if (state->task->ret_stack &&
(state->pc == (unsigned long)return_to_handler)) {
unsigned long orig_pc;
orig_pc = ftrace_graph_ret_addr(state->task, NULL, state->pc,
(void *)state->fp);
if (WARN_ON_ONCE(state->pc == orig_pc))
return -EINVAL;
state->pc = orig_pc;
}

The rationale there is that since tail calls are (currently) disabled on arm64,
the only reason for ftrace_graph_ret_addr() to return return_to_handler is when
it fails to find the original return address.

Does this change make it legitimate for ftrace_graph_ret_addr() to return
return_to_handler in other cases, or is that a bug?

Either way, we'll need *some* way to recover the original return addresss...

Mark.

> ---
>  Changes in v4:
>   - Simplify get_ret_stack() sanity check and use WARN_ON_ONCE() for
> obviously wrong value.
>   - Do not check ret == return_to_handler but always read the previous
> ret_stack in ftrace_push_return_trace() to check it is reusable.
>   - Set the bit 0 of the bitmap entry always in function_graph_enter()
> because it uses bit 0 to check re-usability.
>   - Fix to ensure the ret_stack entry is bitmap type when checking the
> bitmap.
>  Changes in v3:
>   - Pass current fgraph_ops to the new entry handler
>(function_graph_enter_ops) if fgraph use ftrace.
>   - Add fgraph_ops::idx in this patch.
>   - Replace the array type with the bitmap type so that it can record
> which fgraph is called.
>   - Fix some helper function to use passed task_struct instead of current.
>   - Reduce the ret-index size to 1024 words.
>   - Make the ret-index directly points the ret_stack.
>   - Fix ftrace_graph_ret_addr() to handle tail-call case correctly.
>  Changes in v2:
>   - Use ftrace_graph_func and FTRACE_OPS_GRAPH_STUB instead of
> ftrace_stub and FTRACE_OPS_FL_STUB for new ftrace based fgraph.
> ---
>  arch/arm64/kernel/ftrace.c   |   19 ++
>  arch/x86/kernel/ftrace.c |   19 ++
>  include/linux/ftrace.h   |7 +
>  

Re: ARM64 Livepatch based on SFrame

2023-12-15 Thread Mark Rutland
On Thu, Dec 14, 2023 at 02:49:29PM -0600, Madhavan T. Venkataraman wrote:
> Hi Mark,

Hi Madhavan,

> I attended your presentation in the LPC. You mentioned that you could use
> some help with some pre-requisites for the Livepatch feature.
> I would like to lend a hand.

Cool!

I've been meaning to send a mail round with a summary of the current state of
things, and what needs to be done going forward, but I haven't had the time
since LPC to put that together (as e.g. that requires learning some more about
SFrame).  I'll be disappearing for the holiday shortly, and I intend to pick
that up in the new year.

> What would you like me to implement?

I'm not currently sure exactly what we need/want to implement, and as above I
think that needs to wait until the new year.

However, one thing that you can do that would be very useful is to write up and
report the GCC DWARF issues that you mentioned in:

  
https://lore.kernel.org/linux-arm-kernel/20230202074036.507249-1-madve...@linux.microsoft.com/

... as (depending on exactly what those are) those could also affect SFrame
generation (and thus we'll need to get those fixed in GCC), and regardless it
would be useful information to know.

I understood that you planned to do that from:

  
https://lore.kernel.org/linux-arm-kernel/054ce0d6-70f0-b834-d4e5-1049c8df7...@linux.microsoft.com/

... but I couldn't spot any relevant mails or issues in the GCC bugzilla, so
either I'm failing to search hard enough, or did that get forgotten about?

> I would also like to implement Unwind Hints for the feature. If you want a
> specific format for the hints, let me know.

I will get back to you on that in the new year; I think the specifics we want
are going to depend on other details above we need to analyse first.

Thanks,
Mark.



Re: selftests: ftrace: WARNING: __list_del_entry_valid_or_report (lib/list_debug.c:62 (discriminator 1))

2023-11-23 Thread Mark Rutland
On Wed, Nov 22, 2023 at 10:12:51AM -0500, Steven Rostedt wrote:
> On Wed, 22 Nov 2023 19:49:43 +0530
> Naresh Kamboju  wrote:
> 
> > Hi Steven,
> > 
> > 
> > 
> > On Tue, 21 Nov 2023 at 02:06, Steven Rostedt  wrote:
> > >
> > > On Thu, 16 Nov 2023 18:00:16 +0530
> > > Naresh Kamboju  wrote:
> > [  282.726999] Unexpected kernel BRK exception at EL1
> 
> What's a "BRK exception"?

That's triggered by a BRK instruction (software breakpoint), we use it like UD2
on x86.

> > [  282.731840] Internal error: BRK handler: f20003e8 [#1] PREEMPT 
> > SMP

That lump of hex here means "this was triggered by a BRK #1000" instruction.

That immediate (0x3e8 / 1000) is what GCC/Clang use for __builtin_trap(), which
is generated for a bunch of reasons, usually a sanitizer like UBSAN, or options
to trap on runtime issues.

Naresh, where can I find the config used for this run? I can try to reproduce
this and investigate.

Thanks,
Mark.

> > [  282.738752] Modules linked in: ftrace_direct ftrace_direct_too
> > hdlcd tda998x onboard_usb_hub drm_dma_helper cec crct10dif_ce
> > drm_kms_helper fuse drm backlight dm_mod ip_tables x_tables [last
> > unloaded: ftrace_direct]
> > [  282.758152] CPU: 0 PID: 987 Comm: rmmod Not tainted 6.6.2-rc1 #1
> 
> So this happened on removing a module? Unloading ftrace_direct?
> 
> This could be a direct trampoline bug.
> 
> But it doesn't look to be related to eventfs, so I'm going with my other 
> patches.
> 
> Thanks,
> 
> -- Steve
> 
> 
> > [  282.764191] Hardware name: ARM Juno development board (r2) (DT)
> > [  282.770138] pstate: a0c5 (NzCv daIF -PAN -UAO -TCO -DIT -SSBS 
> > BTYPE=--)
> > [  282.777133] pc : ring_buffer_lock_reserve
> > (kernel/trace/ring_buffer.c:3304 (discriminator 1)
> > kernel/trace/ring_buffer.c:3819 (discriminator 1))
> > [  282.782230] lr : trace_function (kernel/trace/trace.c:992
> > kernel/trace/trace.c:3078)
> > [  282.786360] sp : 80008428ba20
> > [  282.789695] x29: 80008428ba20 x28: 00082bb367c0 x27: 
> > 
> > [  282.796898] x26: 800080371d84 x25:  x24: 
> > 00097ed1e4b0
> > [  282.804100] x23: 000800018400 x22:  x21: 
> > 0018
> > [  282.811302] x20: 000800018400 x19: 000800041400 x18: 
> > 
> > [  282.818504] x17:  x16:  x15: 
> > 800081d2ec58
> > [  282.825706] x14:  x13: 80008428beb0 x12: 
> > 0255
> > [  282.832908] x11: 800083343078 x10: 80008428bbc0 x9 : 
> > 8000803e3110
> > [  282.840110] x8 : 80008428ba98 x7 :  x6 : 
> > 0009
> > [  282.847311] x5 : 0081 x4 : fffb x3 : 
> > 0001
> > [  282.854513] x2 :  x1 :  x0 : 
> > 
> > [  282.861715] Call trace:
> > [  282.864179] ring_buffer_lock_reserve
> > (kernel/trace/ring_buffer.c:3304 (discriminator 1)
> > kernel/trace/ring_buffer.c:3819 (discriminator 1))
> > [  282.868920] trace_function (kernel/trace/trace.c:992
> > kernel/trace/trace.c:3078)
> > [  282.872702] tracer_hardirqs_off (kernel/trace/trace_irqsoff.c:279
> > kernel/trace/trace_irqsoff.c:397 kernel/trace/trace_irqsoff.c:619
> > kernel/trace/trace_irqsoff.c:616)
> > [  282.877009] trace_hardirqs_off.part.0
> > (kernel/trace/trace_preemptirq.c:78 (discriminator 1))
> > [  282.881748] trace_hardirqs_off (kernel/trace/trace_preemptirq.c:94)
> > [  282.885791] obj_cgroup_charge (mm/memcontrol.c:3236 (discriminator
> > 1) mm/memcontrol.c:3364 (discriminator 1))
> > [  282.889918] kmem_cache_alloc (mm/slab.h:503 (discriminator 2)
> > mm/slab.h:714 (discriminator 2) mm/slub.c:3460 (discriminator 2)
> > mm/slub.c:3486 (discriminator 2) mm/slub.c:3493 (discriminator 2)
> > mm/slub.c:3502 (discriminator 2))
> > [  282.893873] __anon_vma_prepare (mm/rmap.c:197)
> > [  282.897999] __handle_mm_fault (mm/memory.c:4111 (discriminator 2)
> > mm/memory.c:3670 (discriminator 2) mm/memory.c:4981 (discriminator 2)
> > mm/memory.c:5122 (discriminator 2))
> > [  282.902213] handle_mm_fault (mm/memory.c:5287)
> > [  282.906077] do_page_fault (arch/arm64/mm/fault.c:513
> > arch/arm64/mm/fault.c:626)
> > [  282.909856] do_translation_fault (arch/arm64/mm/fault.c:714)
> > [  282.914068] do_mem_abort (arch/arm64/mm/fault.c:846 (discriminator 1))
> > [  282.917591] el0_da (arch/arm64/include/asm/daifflags.h:28
> > arch/arm64/kernel/entry-common.c:133
> > arch/arm64/kernel/entry-common.c:144
> > arch/arm64/kernel/entry-common.c:547)
> > [  282.920590] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:700)
> > [  282.924894] el0t_64_sync (arch/arm64/kernel/entry.S:595)
> > [ 282.928593] Code: 88037c22 35a3 17f8 94549eb9 (d4207d00)
> > All code
> > 
> >0: 88037c22 stxr w3, w2, [x1]
> >4: 35a3 cbnz w3, 0xfff8
> >8: 17f8 b 0xffe8
> >c: 94549eb9 bl 0x1527af0
> >   10:* d4207d00 brk #0x3e8 <-- 

Re: [PATCH] tracing: fix UAF caused by memory ordering issue

2023-11-13 Thread Mark Rutland
On Sun, Nov 12, 2023 at 11:00:30PM +0800, Kairui Song wrote:
> From: Kairui Song 
> 
> Following kernel panic was observed when doing ftrace stress test:

Can you share some more details:

* What test specifically are you running? Can you share this so that others can
  try to reproduce the issue?

* Which machines are you testing on (i.e. which CPU microarchitecture is this
  seen with) ?

* Which compiler are you using?

* The log shows this is with v6.1.61+. Can you reproduce this with a mainline
  kernel? e.g. v6.6 or v6.7-rc1?

> Unable to handle kernel paging request at virtual address 9699b0f8ece28240
> Mem abort info:
>   ESR = 0x9604
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
>   FSC = 0x04: level 0 translation fault
> Data abort info:
>   ISV = 0, ISS = 0x0004
>   CM = 0, WnR = 0
> [9699b0f8ece28240] address between user and kernel address ranges
> Internal error: Oops: 9604 [#1] SMP
> Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill vfat fat loop 
> fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 
> sr_mod cdrom crct10dif_ce ghash_ce sha2_ce virtio_gpu virtio_dma_buf 
> drm_shmem_helper virtio_blk drm_kms_helper syscopyarea sysfillrect sysimgblt 
> fb_sys_fops virtio_console sha256_arm64 sha1_ce drm virtio_scsi i2c_core 
> virtio_net net_failover failover virtio_mmio dm_multipath dm_mod autofs4 
> [last unloaded: ipmi_msghandler]
> CPU: 0 PID: 499719 Comm: sh Kdump: loaded Not tainted 6.1.61+ #2
> Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : __kmem_cache_alloc_node+0x1dc/0x2e4
> lr : __kmem_cache_alloc_node+0xac/0x2e4
> sp : 8ad23aa0
> x29: 8ad23ab0 x28: 0004052b8000 x27: c513863b
> x26: 0040 x25: c51384f21ca4 x24: 
> x23: d615521430b1b1a5 x22: c51386044770 x21: 
> x20: 0cc0 x19: c0001200 x18: 
> x17:  x16:  x15: e65e1630
> x14: 0004 x13: c513863e67a0 x12: c513863af6d8
> x11: 0001 x10: 8ad23aa0 x9 : c51385058078
> x8 : 0018 x7 : 0001 x6 : 0010
> x5 : c09c2280 x4 : c51384f21ca4 x3 : 0040
> x2 : 9699b0f8ece28240 x1 : c09c2280 x0 : 9699b0f8ece28200
> Call trace:
>  __kmem_cache_alloc_node+0x1dc/0x2e4
>  __kmalloc+0x6c/0x1c0
>  func_add+0x1a4/0x200
>  tracepoint_add_func+0x70/0x230
>  tracepoint_probe_register+0x6c/0xb4
>  trace_event_reg+0x8c/0xa0
>  __ftrace_event_enable_disable+0x17c/0x440
>  __ftrace_set_clr_event_nolock+0xe0/0x150
>  system_enable_write+0xe0/0x114
>  vfs_write+0xd0/0x2dc
>  ksys_write+0x78/0x110
>  __arm64_sys_write+0x24/0x30
>  invoke_syscall.constprop.0+0x58/0xf0
>  el0_svc_common.constprop.0+0x54/0x160
>  do_el0_svc+0x2c/0x60
>  el0_svc+0x40/0x1ac
>  el0t_64_sync_handler+0xf4/0x120
>  el0t_64_sync+0x19c/0x1a0
> Code: b9402a63 f9405e77 8b030002 d5384101 (f8636803)
> 
> Panic was caused by corrupted freelist pointer. After more debugging,
> I found the root cause is UAF of slab allocated object in ftrace
> introduced by commit eecb91b9f98d ("tracing: Fix memleak due to race
> between current_tracer and trace"), and so far it's only reproducible
> on some ARM64 machines, the UAF and free stack is:
> 
> UAF:
> kasan_report+0xa8/0x1bc
> __asan_report_load8_noabort+0x28/0x3c
> print_graph_function_flags+0x524/0x5a0
> print_graph_function_event+0x28/0x40
> print_trace_line+0x5c4/0x1030
> s_show+0xf0/0x460
> seq_read_iter+0x930/0xf5c
> seq_read+0x130/0x1d0
> vfs_read+0x288/0x840
> ksys_read+0x130/0x270
> __arm64_sys_read+0x78/0xac
> invoke_syscall.constprop.0+0x90/0x224
> do_el0_svc+0x118/0x3dc
> el0_svc+0x54/0x120
> el0t_64_sync_handler+0xf4/0x120
> el0t_64_sync+0x19c/0x1a0
> 
> Freed by:
> kasan_save_free_info+0x38/0x5c
> __kasan_slab_free+0xe8/0x154
> slab_free_freelist_hook+0xfc/0x1e0
> __kmem_cache_free+0x138/0x260
> kfree+0xd0/0x1d0
> graph_trace_close+0x60/0x90
> s_start+0x610/0x910
> seq_read_iter+0x274/0xf5c
> seq_read+0x130/0x1d0
> vfs_read+0x288/0x840
> ksys_read+0x130/0x270
> __arm64_sys_read+0x78/0xac
> invoke_syscall.constprop.0+0x90/0x224
> do_el0_svc+0x118/0x3dc
> el0_svc+0x54/0x120
> el0t_64_sync_handler+0xf4/0x120
> el0t_64_sync+0x19c/0x1a0
> 
> Despite the s_start and s_show being serialized by seq_file mutex,
> the tracer struct copy in s_start introduced by the commit mentioned
> above is not atomic nor guarenteened to be seen by all CPUs. So
> following seneriao is possible (and actually happened):
> 
> CPU 1 CPU 2
> seq_read_iter seq_read_iter
>   mutex_lock(>lock);
>   s_start
> // iter->trace is graph_trace
> iter->trace->close(iter);
> graph_trace_close
>   kfree(data) <- *** data released here ***
> // 

Re: [RFC PATCH v2 01/31] tracing: Add a comment about ftrace_regs definition

2023-11-10 Thread Mark Rutland
On Thu, Nov 09, 2023 at 08:14:52AM +0900, Masami Hiramatsu wrote:
> On Wed,  8 Nov 2023 23:24:32 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > From: Masami Hiramatsu (Google) 
> > 
> > To clarify what will be expected on ftrace_regs, add a comment to the
> > architecture independent definition of the ftrace_regs.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > ---
> >  Changes in v2:
> >   - newly added.
> > ---
> >  include/linux/ftrace.h |   25 +
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index e8921871ef9a..b174af91d8be 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -118,6 +118,31 @@ extern int ftrace_enabled;
> >  
> >  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> >  
> > +/**
> > + * ftrace_regs - ftrace partial/optimal register set
> > + *
> > + * ftrace_regs represents a group of registers which is used at the
> > + * function entry and exit. There are three types of registers.
> > + *
> > + * - Registers for passing the parameters to callee, including the stack
> > + *   pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64)
> > + * - Registers for passing the return values to caller.
> > + *   (e.g. rax and rdx on x86_64)
> > + * - Registers for hooking the function return including the frame pointer
> > + *   (the frame pointer is architecture/config dependent)
> > + *   (e.g. rbp and rsp for x86_64)
> 
> Oops, I found the program counter/instruction pointer must be saved too.
> This is used for live patching. One question is that if the IP is modified
> at the return handler, what should we do? Return to the specified address?

I'm a bit confused here; currently we use fgraph_ret_regs for function returns,
are we going to replace that with ftrace_regs?

I think it makes sense for the PC/IP to be the address the return handler will
eventually return to (and hence allowing it to be overridden), but that does
mean we'll need to go recover the return address *before* we invoke any return
handlers.

Thanks,
Mark.



Re: [PATCH] locking/atomic: sh: Use generic_cmpxchg_local for arch_cmpxchg_local()

2023-10-25 Thread Mark Rutland
On Wed, Oct 25, 2023 at 08:42:55AM +0900, Masami Hiramatsu wrote:
> On Tue, 24 Oct 2023 16:08:12 +0100
> Mark Rutland  wrote:
> 
> > On Tue, Oct 24, 2023 at 11:52:54PM +0900, Masami Hiramatsu (Google) wrote:
> > > From: Masami Hiramatsu (Google) 
> > > 
> > > Use generic_cmpxchg_local() for arch_cmpxchg_local() implementation
> > > in SH architecture because it does not implement arch_cmpxchg_local().
> > 
> > I do not think this is correct.
> > 
> > The implementation in  is UP-only (and it only
> > disables interrupts), whereas arch/sh can be built SMP. We should probably 
> > add
> > some guards into  for that as we have in
> > .
> 
> Isn't cmpxchg_local for the data which only needs to ensure to do cmpxchg
> on local CPU?
> So I think it doesn't care about the other CPUs (IOW, it should not touched by
> other CPUs), so it only considers UP case. E.g. on x86, arch_cmpxchg_local() 
> is
> defined as raw "cmpxchg" without lock prefix.
> 
> #define __cmpxchg_local(ptr, old, new, size)\
>     __raw_cmpxchg((ptr), (old), (new), (size), "")
> 

Yes, you're right; sorry for the noise.

For your original patch:

Acked-by: Mark Rutland 

Mark.

> 
> Thank you,
> 
> 
> > 
> > I think the right thing to do here is to define arch_cmpxchg_local() in 
> > terms
> > of arch_cmpxchg(), i.e. at the bottom of arch/sh's  add:
> > 
> > #define arch_cmpxchg_local  arch_cmpxchg
> > 
> > Mark.
> > 
> > > 
> > > Reported-by: kernel test robot 
> > > Closes: 
> > > https://lore.kernel.org/oe-kbuild-all/202310241310.ir5uukog-...@intel.com/
> > > Signed-off-by: Masami Hiramatsu (Google) 
> > > ---
> > >  arch/sh/include/asm/cmpxchg.h |2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/arch/sh/include/asm/cmpxchg.h b/arch/sh/include/asm/cmpxchg.h
> > > index 288f6f38d98f..e920e61fb817 100644
> > > --- a/arch/sh/include/asm/cmpxchg.h
> > > +++ b/arch/sh/include/asm/cmpxchg.h
> > > @@ -71,4 +71,6 @@ static inline unsigned long __cmpxchg(volatile void * 
> > > ptr, unsigned long old,
> > >   (unsigned long)_n_, sizeof(*(ptr))); \
> > >})
> > >  
> > > +#include 
> > > +
> > >  #endif /* __ASM_SH_CMPXCHG_H */
> > > 
> 
> 
> -- 
> Masami Hiramatsu (Google) 



Re: [PATCH] locking/atomic: sh: Use generic_cmpxchg_local for arch_cmpxchg_local()

2023-10-24 Thread Mark Rutland
On Tue, Oct 24, 2023 at 11:52:54PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> Use generic_cmpxchg_local() for arch_cmpxchg_local() implementation
> in SH architecture because it does not implement arch_cmpxchg_local().

I do not think this is correct.

The implementation in  is UP-only (and it only
disables interrupts), whereas arch/sh can be built SMP. We should probably add
some guards into  for that as we have in
.

I think the right thing to do here is to define arch_cmpxchg_local() in terms
of arch_cmpxchg(), i.e. at the bottom of arch/sh's  add:

#define arch_cmpxchg_local  arch_cmpxchg

Mark.

> 
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202310241310.ir5uukog-...@intel.com/
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  arch/sh/include/asm/cmpxchg.h |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/sh/include/asm/cmpxchg.h b/arch/sh/include/asm/cmpxchg.h
> index 288f6f38d98f..e920e61fb817 100644
> --- a/arch/sh/include/asm/cmpxchg.h
> +++ b/arch/sh/include/asm/cmpxchg.h
> @@ -71,4 +71,6 @@ static inline unsigned long __cmpxchg(volatile void * ptr, 
> unsigned long old,
>   (unsigned long)_n_, sizeof(*(ptr))); \
>})
>  
> +#include 
> +
>  #endif /* __ASM_SH_CMPXCHG_H */
> 



Re: [PATCH 1/3] arm64: ptrace: Add is_syscall_success to handle compat

2021-04-16 Thread Mark Rutland
On Fri, Apr 16, 2021 at 01:33:22PM +0100, Catalin Marinas wrote:
> On Fri, Apr 16, 2021 at 03:55:31PM +0800, He Zhe wrote:
> > The general version of is_syscall_success does not handle 32-bit
> > compatible case, which would cause 32-bit negative return code to be
> > recoganized as a positive number later and seen as a "success".
> > 
> > Since is_compat_thread is defined in compat.h, implementing
> > is_syscall_success in ptrace.h would introduce build failure due to
> > recursive inclusion of some basic headers like mutex.h. We put the
> > implementation to ptrace.c
> > 
> > Signed-off-by: He Zhe 
> > ---
> >  arch/arm64/include/asm/ptrace.h |  3 +++
> >  arch/arm64/kernel/ptrace.c  | 10 ++
> >  2 files changed, 13 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/ptrace.h 
> > b/arch/arm64/include/asm/ptrace.h
> > index e58bca832dff..3c415e9e5d85 100644
> > --- a/arch/arm64/include/asm/ptrace.h
> > +++ b/arch/arm64/include/asm/ptrace.h
> > @@ -328,6 +328,9 @@ static inline void regs_set_return_value(struct pt_regs 
> > *regs, unsigned long rc)
> > regs->regs[0] = rc;
> >  }
> >  
> > +extern inline int is_syscall_success(struct pt_regs *regs);
> > +#define is_syscall_success(regs) is_syscall_success(regs)
> > +
> >  /**
> >   * regs_get_kernel_argument() - get Nth function argument in kernel
> >   * @regs:  pt_regs of that context
> > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> > index 170f42fd6101..3266201f8c60 100644
> > --- a/arch/arm64/kernel/ptrace.c
> > +++ b/arch/arm64/kernel/ptrace.c
> > @@ -1909,3 +1909,13 @@ int valid_user_regs(struct user_pt_regs *regs, 
> > struct task_struct *task)
> > else
> > return valid_native_regs(regs);
> >  }
> > +
> > +inline int is_syscall_success(struct pt_regs *regs)
> > +{
> > +   unsigned long val = regs->regs[0];
> > +
> > +   if (is_compat_thread(task_thread_info(current)))
> > +   val = sign_extend64(val, 31);
> > +
> > +   return !IS_ERR_VALUE(val);
> > +}
> 
> It's better to use compat_user_mode(regs) here instead of
> is_compat_thread(). It saves us from worrying whether regs are for the
> current context.
> 
> I think we should change regs_return_value() instead. This function
> seems to be called from several other places and it has the same
> potential problems if called on compat pt_regs.

I think this is a problem we created for ourselves back in commit:

  15956689a0e60aa0 ("arm64: compat: Ensure upper 32 bits of x0 are zero on 
syscall return)

AFAICT, the perf regs samples are the only place this matters, since for
ptrace the compat regs are implicitly truncated to compat_ulong_t, and
audit expects the non-truncated return value. Other architectures don't
truncate here, so I think we're setting ourselves up for a game of
whack-a-mole to truncate and extend wherever we need to.

Given that, I suspect it'd be better to do something like the below.

Will, thoughts?

Mark.

>8
>From df0f7c160240d9ee6f20f87a180326d3253e80fb Mon Sep 17 00:00:00 2001
From: Mark Rutland 
Date: Fri, 16 Apr 2021 13:58:54 +0100
Subject: [PATCH] arm64: perf: truncate compat regs

For compat userspace, it doesn't generally make sense for the upper 32
bits of GPRs to be set, as these bits don't really exist in AArch32.
However, for structural reasons the kernel may transiently set the upper
32 bits of registers in pt_regs at points where a perf sample can be
taken.

We tried to avoid this happening in commit:

  15956689a0e60aa0 ("arm64: compat: Ensure upper 32 bits of x0 are zero on 
syscall return")

... by having invoke_syscall() truncate the return value for compat
tasks, with helpers in  extending the return value when
required.

Unfortunately this is not complete, as there are other places where we
assign the return value, such as when el0_svc_common() sets up a return
of -ENOSYS.

Further, this approach breaks the audit code, which relies on the upper
32 bits of the return value.

Instead, let's have the perf code explicitly truncate the user regs to
32 bits, and otherwise preserve those within the kernel.

Fixes: 15956689a0e60aa0 ("arm64: compat: Ensure upper 32 bits of x0 are zero on 
syscall return")
Signed-off-by: Mark Rutland 
Cc: Will Deacon 
---
 arch/arm64/include/asm/syscall.h | 11 +--
 arch/arm64/kernel/perf_regs.c| 26 --
 arch/arm64/kernel/syscall.c  |  3 ---
 3 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..0ebeaf6dbd45 100644
--- a/arch/arm64/include/asm/sy

Re: [PATCH 0/9] kcsan: Add support for reporting observed value changes

2021-04-15 Thread Mark Rutland
On Wed, Apr 14, 2021 at 01:28:16PM +0200, Marco Elver wrote:
> This series adds support for showing observed value changes in reports.
> Several clean up and refactors of KCSAN reporting code are done as a
> pre-requisite.

> This series was originally prepared courtesy of Mark Rutland in
> September 2020.

For anyone looking for the original, it was never posted to a list, but
is sat on my kcsan/rework branch on kernel.org:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kcsan/rework

> Because KCSAN had a few minor changes since the original
> draft of the series, it required a rebase and re-test. To not be
> forgotten and get these changes in sooner than later, Mark kindly agreed
> to me adopting the series and doing the rebase, a few minor tweaks, and
> finally re-test.

Thanks for picking this up!

All your changes look good to me (along with the documentation patch),
so FWIW:

Acked-by: Mark Rutland 

Thanks,
Mark.

> 
> Marco Elver (1):
>   kcsan: Document "value changed" line
> 
> Mark Rutland (8):
>   kcsan: Simplify value change detection
>   kcsan: Distinguish kcsan_report() calls
>   kcsan: Refactor passing watchpoint/other_info
>   kcsan: Fold panic() call into print_report()
>   kcsan: Refactor access_info initialization
>   kcsan: Remove reporting indirection
>   kcsan: Remove kcsan_report_type
>   kcsan: Report observed value changes
> 
>  Documentation/dev-tools/kcsan.rst |  88 +++-
>  kernel/kcsan/core.c   |  53 --
>  kernel/kcsan/kcsan.h  |  39 ---
>  kernel/kcsan/report.c | 169 --
>  4 files changed, 162 insertions(+), 187 deletions(-)
> 
> -- 
> 2.31.1.295.g9ea45b61b8-goog
> 


Re: [PATCH v3] arm64: mte: Move MTE TCF0 check in entry-common

2021-04-09 Thread Mark Rutland
On Fri, Apr 09, 2021 at 03:32:47PM +0100, Mark Rutland wrote:
> Hi Vincenzo,
> 
> On Fri, Apr 09, 2021 at 02:24:19PM +0100, Vincenzo Frascino wrote:
> > The check_mte_async_tcf macro sets the TIF flag non-atomically. This can
> > race with another CPU doing a set_tsk_thread_flag() and all the other flags
> > can be lost in the process.
> > 
> > Move the tcf0 check to enter_from_user_mode() and clear tcf0 in
> > exit_to_user_mode() to address the problem.
> > 
> > Note: Moving the check in entry-common allows to use set_thread_flag()
> > which is safe.

I've dug into this a bit more, and as set_thread_flag() calls some
potentially-instrumented helpers I don't think this is safe after all
(as e.g. those might cause an EL1 exception and clobber the ESR/FAR/etc
before the EL0 exception handler reads it).

Making that watertight is pretty hairy, as we either need to open-code
set_thread_flag() or go rework a load of core code. If we can use STSET
in the entry asm that'd be simpler, otherwise we'll need something more
involved.

Thanks,
Mark.

> > 
> > Fixes: 637ec831ea4f ("arm64: mte: Handle synchronous and asynchronous tag 
> > check faults")
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Cc: sta...@vger.kernel.org
> > Reported-by: Will Deacon 
> > Signed-off-by: Vincenzo Frascino 
> > ---
> >  arch/arm64/include/asm/mte.h |  9 +
> >  arch/arm64/kernel/entry-common.c |  6 ++
> >  arch/arm64/kernel/entry.S| 34 
> >  arch/arm64/kernel/mte.c  | 33 +--
> >  4 files changed, 46 insertions(+), 36 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 9b557a457f24..c7ab681a95c3 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -49,6 +49,9 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
> > request,
> >  
> >  void mte_assign_mem_tag_range(void *addr, size_t size);
> >  
> > +void noinstr check_mte_async_tcf0(void);
> > +void noinstr clear_mte_async_tcf0(void);
> 
> Can we please put the implementations in the header so that they can be
> inlined? Otherwise when the HW doesn't support MTE we'll always do a pointless
> branch to the out-of-line implementation.
> 
> With that, we can mark them __always_inline to avoid weirdness with an inline
> noinstr function.
> 
> Otherwise, this looks good to me.
> 
> Thanks,
> Mark.
> 
> > +
> >  #else /* CONFIG_ARM64_MTE */
> >  
> >  /* unused if !CONFIG_ARM64_MTE, silence the compiler */
> > @@ -83,6 +86,12 @@ static inline int mte_ptrace_copy_tags(struct 
> > task_struct *child,
> >  {
> > return -EIO;
> >  }
> > +static inline void check_mte_async_tcf0(void)
> > +{
> > +}
> > +static inline void clear_mte_async_tcf0(void)
> > +{
> > +}
> >  
> >  static inline void mte_assign_mem_tag_range(void *addr, size_t size)
> >  {
> > diff --git a/arch/arm64/kernel/entry-common.c 
> > b/arch/arm64/kernel/entry-common.c
> > index 9d3588450473..837d3624a1d5 100644
> > --- a/arch/arm64/kernel/entry-common.c
> > +++ b/arch/arm64/kernel/entry-common.c
> > @@ -289,10 +289,16 @@ asmlinkage void noinstr enter_from_user_mode(void)
> > CT_WARN_ON(ct_state() != CONTEXT_USER);
> > user_exit_irqoff();
> > trace_hardirqs_off_finish();
> > +
> > +   /* Check for asynchronous tag check faults in user space */
> > +   check_mte_async_tcf0();
> 
> 
> 
> >  }
> >  
> >  asmlinkage void noinstr exit_to_user_mode(void)
> >  {
> > +   /* Ignore asynchronous tag check faults in the uaccess routines */
> > +   clear_mte_async_tcf0();
> > +
> > trace_hardirqs_on_prepare();
> > lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> > user_enter_irqoff();
> > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > index a31a0a713c85..fb57df0d453f 100644
> > --- a/arch/arm64/kernel/entry.S
> > +++ b/arch/arm64/kernel/entry.S
> > @@ -34,15 +34,11 @@
> >   * user and kernel mode.
> >   */
> > .macro user_exit_irqoff
> > -#if defined(CONFIG_CONTEXT_TRACKING) || defined(CONFIG_TRACE_IRQFLAGS)
> > bl  enter_from_user_mode
> > -#endif
> > .endm
> >  
> > .macro user_enter_irqoff
> > -#if defined(CONFIG_CONTEXT_TRACKING) || defined(CONFIG_TRACE_IRQFLAGS)
> > bl  exit_to_user_mode
> > -#endif
> > .endm
> >  
> > .macro  clear_gp_re

Re: [PATCH v3] arm64: mte: Move MTE TCF0 check in entry-common

2021-04-09 Thread Mark Rutland
Hi Vincenzo,

On Fri, Apr 09, 2021 at 02:24:19PM +0100, Vincenzo Frascino wrote:
> The check_mte_async_tcf macro sets the TIF flag non-atomically. This can
> race with another CPU doing a set_tsk_thread_flag() and all the other flags
> can be lost in the process.
> 
> Move the tcf0 check to enter_from_user_mode() and clear tcf0 in
> exit_to_user_mode() to address the problem.
> 
> Note: Moving the check in entry-common allows to use set_thread_flag()
> which is safe.
> 
> Fixes: 637ec831ea4f ("arm64: mte: Handle synchronous and asynchronous tag 
> check faults")
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: sta...@vger.kernel.org
> Reported-by: Will Deacon 
> Signed-off-by: Vincenzo Frascino 
> ---
>  arch/arm64/include/asm/mte.h |  9 +
>  arch/arm64/kernel/entry-common.c |  6 ++
>  arch/arm64/kernel/entry.S| 34 
>  arch/arm64/kernel/mte.c  | 33 +--
>  4 files changed, 46 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 9b557a457f24..c7ab681a95c3 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -49,6 +49,9 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
> request,
>  
>  void mte_assign_mem_tag_range(void *addr, size_t size);
>  
> +void noinstr check_mte_async_tcf0(void);
> +void noinstr clear_mte_async_tcf0(void);

Can we please put the implementations in the header so that they can be
inlined? Otherwise when the HW doesn't support MTE we'll always do a pointless
branch to the out-of-line implementation.

With that, we can mark them __always_inline to avoid weirdness with an inline
noinstr function.

Otherwise, this looks good to me.

Thanks,
Mark.

> +
>  #else /* CONFIG_ARM64_MTE */
>  
>  /* unused if !CONFIG_ARM64_MTE, silence the compiler */
> @@ -83,6 +86,12 @@ static inline int mte_ptrace_copy_tags(struct task_struct 
> *child,
>  {
>   return -EIO;
>  }
> +static inline void check_mte_async_tcf0(void)
> +{
> +}
> +static inline void clear_mte_async_tcf0(void)
> +{
> +}
>  
>  static inline void mte_assign_mem_tag_range(void *addr, size_t size)
>  {
> diff --git a/arch/arm64/kernel/entry-common.c 
> b/arch/arm64/kernel/entry-common.c
> index 9d3588450473..837d3624a1d5 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -289,10 +289,16 @@ asmlinkage void noinstr enter_from_user_mode(void)
>   CT_WARN_ON(ct_state() != CONTEXT_USER);
>   user_exit_irqoff();
>   trace_hardirqs_off_finish();
> +
> + /* Check for asynchronous tag check faults in user space */
> + check_mte_async_tcf0();



>  }
>  
>  asmlinkage void noinstr exit_to_user_mode(void)
>  {
> + /* Ignore asynchronous tag check faults in the uaccess routines */
> + clear_mte_async_tcf0();
> +
>   trace_hardirqs_on_prepare();
>   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
>   user_enter_irqoff();
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index a31a0a713c85..fb57df0d453f 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -34,15 +34,11 @@
>   * user and kernel mode.
>   */
>   .macro user_exit_irqoff
> -#if defined(CONFIG_CONTEXT_TRACKING) || defined(CONFIG_TRACE_IRQFLAGS)
>   bl  enter_from_user_mode
> -#endif
>   .endm
>  
>   .macro user_enter_irqoff
> -#if defined(CONFIG_CONTEXT_TRACKING) || defined(CONFIG_TRACE_IRQFLAGS)
>   bl  exit_to_user_mode
> -#endif
>   .endm
>  
>   .macro  clear_gp_regs
> @@ -147,32 +143,6 @@ alternative_cb_end
>  .L__asm_ssbd_skip\@:
>   .endm
>  
> - /* Check for MTE asynchronous tag check faults */
> - .macro check_mte_async_tcf, flgs, tmp
> -#ifdef CONFIG_ARM64_MTE
> -alternative_if_not ARM64_MTE
> - b   1f
> -alternative_else_nop_endif
> - mrs_s   \tmp, SYS_TFSRE0_EL1
> - tbz \tmp, #SYS_TFSR_EL1_TF0_SHIFT, 1f
> - /* Asynchronous TCF occurred for TTBR0 access, set the TI flag */
> - orr \flgs, \flgs, #_TIF_MTE_ASYNC_FAULT
> - str \flgs, [tsk, #TSK_TI_FLAGS]
> - msr_s   SYS_TFSRE0_EL1, xzr
> -1:
> -#endif
> - .endm
> -
> - /* Clear the MTE asynchronous tag check faults */
> - .macro clear_mte_async_tcf
> -#ifdef CONFIG_ARM64_MTE
> -alternative_if ARM64_MTE
> - dsb ish
> - msr_s   SYS_TFSRE0_EL1, xzr
> -alternative_else_nop_endif
> -#endif
> - .endm
> -
>   .macro mte_set_gcr, tmp, tmp2
>  #ifdef CONFIG_ARM64_MTE
>   /*
> @@ -243,8 +213,6 @@ alternative_else_nop_endif
>   ldr x19, [tsk, #TSK_TI_FLAGS]
>   disable_step_tsk x19, x20
>  
> - /* Check for asynchronous tag check faults in user space */
> - check_mte_async_tcf x19, x22
>   apply_ssbd 1, x22, x23
>  
>   ptrauth_keys_install_kernel tsk, x20, x22, x23
> @@ -775,8 +743,6 @@ SYM_CODE_START_LOCAL(ret_to_user)
>   cbnzx2, work_pending
>  

Re: [RFC PATCH v2 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-09 Thread Mark Rutland
On Mon, Apr 05, 2021 at 03:43:12PM -0500, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
> for a function, the ftrace infrastructure is called for the function at
> the very beginning. Ftrace creates two frames:
> 
>   - One for the traced function
> 
>   - One for the caller of the traced function
> 
> That gives a reliable stack trace while executing in the ftrace code. When
> ftrace returns to the traced function, the frames are popped and everything
> is back to normal.
> 
> However, in cases like live patch, a tracer function may redirect execution
> to a different function when it returns. A stack trace taken while still in
> the tracer function will not show the target function. The target function
> is the real function that we want to track.
> 
> So, if an FTRACE frame is detected on the stack, just mark the stack trace
> as unreliable. The detection is done by checking the return PC against
> FTRACE trampolines.
> 
> Also, the Function Graph Tracer modifies the return address of a traced
> function to a return trampoline to gather tracing data on function return.
> Stack traces taken from that trampoline and functions it calls are
> unreliable as the original return address may not be available in
> that context. Mark the stack trace unreliable accordingly.
> 
> Signed-off-by: Madhavan T. Venkataraman 
> ---
>  arch/arm64/kernel/entry-ftrace.S | 12 +++
>  arch/arm64/kernel/stacktrace.c   | 61 
>  2 files changed, 73 insertions(+)
> 
> diff --git a/arch/arm64/kernel/entry-ftrace.S 
> b/arch/arm64/kernel/entry-ftrace.S
> index b3e4f9a088b1..1f0714a50c71 100644
> --- a/arch/arm64/kernel/entry-ftrace.S
> +++ b/arch/arm64/kernel/entry-ftrace.S
> @@ -86,6 +86,18 @@ SYM_CODE_START(ftrace_caller)
>   b   ftrace_common
>  SYM_CODE_END(ftrace_caller)
>  
> +/*
> + * A stack trace taken from anywhere in the FTRACE trampoline code should be
> + * considered unreliable as a tracer function (patched at ftrace_call) could
> + * potentially set pt_regs->pc and redirect execution to a function different
> + * than the traced function. E.g., livepatch.

IIUC the issue here that we have two copies of the pc: one in the regs,
and one in a frame record, and so after the update to the regs, the
frame record is stale.

This is something that we could fix by having
ftrace_instruction_pointer_set() set both.

However, as noted elsewhere there are other issues which mean we'd still
need special unwinding code for this.

Thanks,
Mark.

> + *
> + * No stack traces are taken in this FTRACE trampoline assembly code. But
> + * they can be taken from C functions that get called from here. The unwinder
> + * checks if a return address falls in this FTRACE trampoline code. See
> + * stacktrace.c. If function calls in this code are changed, please keep the
> + * special_functions[] array in stacktrace.c in sync.
> + */
>  SYM_CODE_START(ftrace_common)
>   sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
>   mov x1, x9  // parent_ip (callsite's LR)
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index fb11e4372891..7a3c638d4aeb 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -51,6 +51,52 @@ struct function_range {
>   * unreliable. Breakpoints are used for executing probe code. Stack traces
>   * taken while in the probe code will show an EL1 frame and will be 
> considered
>   * unreliable. This is correct behavior.
> + *
> + * FTRACE
> + * ==
> + *
> + * When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled, the FTRACE trampoline 
> code
> + * is called from a traced function even before the frame pointer prolog.
> + * FTRACE sets up two stack frames (one for the traced function and one for
> + * its caller) so that the unwinder can provide a sensible stack trace for
> + * any tracer function called from the FTRACE trampoline code.
> + *
> + * There are two cases where the stack trace is not reliable.
> + *
> + * (1) The task gets preempted before the two frames are set up. Preemption
> + * involves an interrupt which is an EL1 exception. The unwinder already
> + * handles EL1 exceptions.
> + *
> + * (2) The tracer function that gets called by the FTRACE trampoline code
> + * changes the return PC (e.g., livepatch).
> + *
> + * Not all tracer functions do that. But to err on the side of safety,
> + * consider the stack trace as unreliable in all cases.
> + *
> + * When Function Graph Tracer is used, FTRACE modifies the return address of
> + * the traced function in its stack frame to an FTRACE return trampoline
> + * (return_to_handler). When the traced function returns, control goes to
> + * return_to_handler. return_to_handler calls FTRACE to gather tracing data
> + * and to obtain the original return address. Then, 

Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-09 Thread Mark Rutland
Hi Madhavan,

I've noted some concerns below. At a high-level, I'm not keen on the
blacklisting approach, and I think there's some other preparatory work
that would be more valuable in the short term.

On Mon, Apr 05, 2021 at 03:43:09PM -0500, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> There are a number of places in kernel code where the stack trace is not
> reliable. Enhance the unwinder to check for those cases and mark the
> stack trace as unreliable. Once all of the checks are in place, the unwinder
> can provide a reliable stack trace. But before this can be used for livepatch,
> some other entity needs to guarantee that the frame pointers are all set up
> correctly in kernel functions. objtool is currently being worked on to
> fill that gap.
> 
> Except for the return address check, all the other checks involve checking
> the return PC of every frame against certain kernel functions. To do this,
> implement some infrastructure code:
> 
>   - Define a special_functions[] array and populate the array with
> the special functions

I'm not too keen on having to manually collate this within the unwinder,
as it's very painful from a maintenance perspective. I'd much rather we
could associate this information with the implementations of these
functions, so that they're more likely to stay in sync.

Further, I believe all the special cases are assembly functions, and
most of those are already in special sections to begin with. I reckon
it'd be simpler and more robust to reject unwinding based on the
section. If we need to unwind across specific functions in those
sections, we could opt-in with some metadata. So e.g. we could reject
all functions in ".entry.text", special casing the EL0 entry functions
if necessary.

As I mentioned before, I'm currently reworking the entry assembly to
make this simpler to do. I'd prefer to not make invasive changes in that
area until that's sorted.

I think there's a lot more code that we cannot unwind, e.g. KVM
exception code, or almost anything marked with SYM_CODE_END().

>   - Using kallsyms_lookup(), lookup the symbol table entries for the
> functions and record their address ranges
> 
>   - Define an is_reliable_function(pc) to match a return PC against
> the special functions.
> 
> The unwinder calls is_reliable_function(pc) for every return PC and marks
> the stack trace as reliable or unreliable accordingly.
> 
> Return address check
> 
> 
> Check the return PC of every stack frame to make sure that it is a valid
> kernel text address (and not some generated code, for example).
> 
> Detect EL1 exception frame
> ==
> 
> EL1 exceptions can happen on any instruction including instructions in
> the frame pointer prolog or epilog. Depending on where exactly they happen,
> they could render the stack trace unreliable.
> 
> Add all of the EL1 exception handlers to special_functions[].
> 
>   - el1_sync()
>   - el1_irq()
>   - el1_error()
>   - el1_sync_invalid()
>   - el1_irq_invalid()
>   - el1_fiq_invalid()
>   - el1_error_invalid()
> 
> Detect ftrace frame
> ===
> 
> When FTRACE executes at the beginning of a traced function, it creates two
> frames and calls the tracer function:
> 
>   - One frame for the traced function
> 
>   - One frame for the caller of the traced function
> 
> That gives a sensible stack trace while executing in the tracer function.
> When FTRACE returns to the traced function, the frames are popped and
> everything is back to normal.
> 
> However, in cases like live patch, the tracer function redirects execution
> to a different function. When FTRACE returns, control will go to that target
> function. A stack trace taken in the tracer function will not show the target
> function. The target function is the real function that we want to track.
> So, the stack trace is unreliable.

This doesn't match my understanding of the reliable stacktrace
requirements, but I might have misunderstood what you're saying here.

IIUC what you're describing here is:

1) A calls B
2) B is traced
3) tracer replaces B with TARGET
4) tracer returns to TARGET

... and if a stacktrace is taken at step 3 (before the return address is
patched), the trace will show B rather than TARGET.

My understanding is that this is legitimate behaviour.

> To detect stack traces from a tracer function, add the following to
> special_functions[]:
> 
>   - ftrace_call + 4
> 
> ftrace_call is the label at which the tracer function is patched in. So,
> ftrace_call + 4 is its return address. This is what will show up in a
> stack trace taken from the tracer function.
> 
> When Function Graph Tracing is on, ftrace_graph_caller is patched in
> at the label ftrace_graph_call. If a tracer function called before it has
> redirected execution as mentioned above, the stack traces taken from within
> ftrace_graph_caller will also 

Re: [PATCH] arm64: mte: Move MTE TCF0 check in entry-common

2021-04-08 Thread Mark Rutland
Hi Vincenzo,

On Thu, Apr 08, 2021 at 03:37:23PM +0100, Vincenzo Frascino wrote:
> The check_mte_async_tcf macro sets the TIF flag non-atomically. This can
> race with another CPU doing a set_tsk_thread_flag() and the flag can be
> lost in the process.
> 
> Move the tcf0 check to enter_from_user_mode() and clear tcf0 in
> exit_to_user_mode() to address the problem.

Beware that these are called at critical points of the entry sequence,
so we need to take care that nothing is instrumented (e.g. we can only
safely use noinstr functions here).

> Note: Moving the check in entry-common allows to use set_thread_flag()
> which is safe.
> 
> Fixes: 637ec831ea4f ("arm64: mte: Handle synchronous and asynchronous
> tag check faults")
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Reported-by: Will Deacon 
> Signed-off-by: Vincenzo Frascino 
> ---
>  arch/arm64/include/asm/mte.h |  8 
>  arch/arm64/kernel/entry-common.c |  6 ++
>  arch/arm64/kernel/entry.S| 30 --
>  arch/arm64/kernel/mte.c  | 25 +++--
>  4 files changed, 37 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 9b557a457f24..188f778c6f7b 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -31,6 +31,8 @@ void mte_invalidate_tags(int type, pgoff_t offset);
>  void mte_invalidate_tags_area(int type);
>  void *mte_allocate_tag_storage(void);
>  void mte_free_tag_storage(char *storage);
> +void check_mte_async_tcf0(void);
> +void clear_mte_async_tcf0(void);
>  
>  #ifdef CONFIG_ARM64_MTE
>  
> @@ -83,6 +85,12 @@ static inline int mte_ptrace_copy_tags(struct task_struct 
> *child,
>  {
>   return -EIO;
>  }
> +void check_mte_async_tcf0(void)
> +{
> +}
> +void clear_mte_async_tcf0(void)
> +{
> +}

Were these meant to be static inline?

>  static inline void mte_assign_mem_tag_range(void *addr, size_t size)
>  {
> diff --git a/arch/arm64/kernel/entry-common.c 
> b/arch/arm64/kernel/entry-common.c
> index 9d3588450473..837d3624a1d5 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -289,10 +289,16 @@ asmlinkage void noinstr enter_from_user_mode(void)
>   CT_WARN_ON(ct_state() != CONTEXT_USER);
>   user_exit_irqoff();
>   trace_hardirqs_off_finish();
> +
> + /* Check for asynchronous tag check faults in user space */
> + check_mte_async_tcf0();
>  }
>  
>  asmlinkage void noinstr exit_to_user_mode(void)
>  {
> + /* Ignore asynchronous tag check faults in the uaccess routines */
> + clear_mte_async_tcf0();
> +
>   trace_hardirqs_on_prepare();
>   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
>   user_enter_irqoff();
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index a31a0a713c85..fafd74ae5021 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -147,32 +147,6 @@ alternative_cb_end
>  .L__asm_ssbd_skip\@:
>   .endm
>  
> - /* Check for MTE asynchronous tag check faults */
> - .macro check_mte_async_tcf, flgs, tmp
> -#ifdef CONFIG_ARM64_MTE
> -alternative_if_not ARM64_MTE
> - b   1f
> -alternative_else_nop_endif
> - mrs_s   \tmp, SYS_TFSRE0_EL1
> - tbz \tmp, #SYS_TFSR_EL1_TF0_SHIFT, 1f
> - /* Asynchronous TCF occurred for TTBR0 access, set the TI flag */
> - orr \flgs, \flgs, #_TIF_MTE_ASYNC_FAULT
> - str \flgs, [tsk, #TSK_TI_FLAGS]
> - msr_s   SYS_TFSRE0_EL1, xzr
> -1:
> -#endif
> - .endm
> -
> - /* Clear the MTE asynchronous tag check faults */
> - .macro clear_mte_async_tcf
> -#ifdef CONFIG_ARM64_MTE
> -alternative_if ARM64_MTE
> - dsb ish
> - msr_s   SYS_TFSRE0_EL1, xzr
> -alternative_else_nop_endif
> -#endif
> - .endm
> -
>   .macro mte_set_gcr, tmp, tmp2
>  #ifdef CONFIG_ARM64_MTE
>   /*
> @@ -243,8 +217,6 @@ alternative_else_nop_endif
>   ldr x19, [tsk, #TSK_TI_FLAGS]
>   disable_step_tsk x19, x20
>  
> - /* Check for asynchronous tag check faults in user space */
> - check_mte_async_tcf x19, x22
>   apply_ssbd 1, x22, x23
>  
>   ptrauth_keys_install_kernel tsk, x20, x22, x23
> @@ -775,8 +747,6 @@ SYM_CODE_START_LOCAL(ret_to_user)
>   cbnzx2, work_pending
>  finish_ret_to_user:
>   user_enter_irqoff
> - /* Ignore asynchronous tag check faults in the uaccess routines */
> - clear_mte_async_tcf
>   enable_step_tsk x19, x2
>  #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
>   bl  stackleak_erase
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index b3c70a612c7a..e759b0eca47e 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -166,14 +166,35 @@ static void set_gcr_el1_excl(u64 excl)
>*/
>  }
>  
> +void check_mte_async_tcf0(void)

As above, this'll need to be noinstr. I also reckon we should put this
in the header so that it can be inlined.

> +{
> + /*
> +  

Re: [PATCH] arm64: mte: Move MTE TCF0 check in entry-common

2021-04-08 Thread Mark Rutland
On Thu, Apr 08, 2021 at 03:56:04PM +0100, Will Deacon wrote:
> On Thu, Apr 08, 2021 at 03:37:23PM +0100, Vincenzo Frascino wrote:
> > The check_mte_async_tcf macro sets the TIF flag non-atomically. This can
> > race with another CPU doing a set_tsk_thread_flag() and the flag can be
> > lost in the process.
> 
> Actually, it's all the *other* flags that get lost!
> 
> > Move the tcf0 check to enter_from_user_mode() and clear tcf0 in
> > exit_to_user_mode() to address the problem.
> > 
> > Note: Moving the check in entry-common allows to use set_thread_flag()
> > which is safe.
> > 
> > Fixes: 637ec831ea4f ("arm64: mte: Handle synchronous and asynchronous
> > tag check faults")
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Reported-by: Will Deacon 
> > Signed-off-by: Vincenzo Frascino 
> > ---
> >  arch/arm64/include/asm/mte.h |  8 
> >  arch/arm64/kernel/entry-common.c |  6 ++
> >  arch/arm64/kernel/entry.S| 30 --
> >  arch/arm64/kernel/mte.c  | 25 +++--
> >  4 files changed, 37 insertions(+), 32 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 9b557a457f24..188f778c6f7b 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -31,6 +31,8 @@ void mte_invalidate_tags(int type, pgoff_t offset);
> >  void mte_invalidate_tags_area(int type);
> >  void *mte_allocate_tag_storage(void);
> >  void mte_free_tag_storage(char *storage);
> > +void check_mte_async_tcf0(void);
> > +void clear_mte_async_tcf0(void);
> >  
> >  #ifdef CONFIG_ARM64_MTE
> >  
> > @@ -83,6 +85,12 @@ static inline int mte_ptrace_copy_tags(struct 
> > task_struct *child,
> >  {
> > return -EIO;
> >  }
> > +void check_mte_async_tcf0(void)
> > +{
> > +}
> > +void clear_mte_async_tcf0(void)
> > +{
> > +}
> >  
> >  static inline void mte_assign_mem_tag_range(void *addr, size_t size)
> >  {
> > diff --git a/arch/arm64/kernel/entry-common.c 
> > b/arch/arm64/kernel/entry-common.c
> > index 9d3588450473..837d3624a1d5 100644
> > --- a/arch/arm64/kernel/entry-common.c
> > +++ b/arch/arm64/kernel/entry-common.c
> > @@ -289,10 +289,16 @@ asmlinkage void noinstr enter_from_user_mode(void)
> > CT_WARN_ON(ct_state() != CONTEXT_USER);
> > user_exit_irqoff();
> > trace_hardirqs_off_finish();
> > +
> > +   /* Check for asynchronous tag check faults in user space */
> > +   check_mte_async_tcf0();
> >  }
> 
> Is enter_from_user_mode() always called when we enter the kernel from EL0?
> afaict, some paths (e.g. el0_irq()) only end up calling it if
> CONTEXT_TRACKING or TRACE_IRQFLAGS are enabled.

Currently everything that's in {enter,exit}_from_user_mode() only
matters when either CONTEXT_TRACKING or TRACE_IRQFLAGS is selected (and
expands to an empty stub otherwise).

We could drop the ifdeffery in user_{enter,exit}_irqoff() to have them
called regardless, or add CONFIG_MTE to the list.

> >  asmlinkage void noinstr exit_to_user_mode(void)
> >  {
> > +   /* Ignore asynchronous tag check faults in the uaccess routines */
> > +   clear_mte_async_tcf0();
> > +
> 
> and this one seems to be called even less often.

This is always done in ret_to_user, so (modulo ifdeferry above) all
returns to EL0 call this.

Thanks,
Mark.


Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

2021-04-08 Thread Mark Rutland
On Wed, Apr 07, 2021 at 01:44:37PM +0100, Will Deacon wrote:
> [Moving Mark to To: since I'd like his view on this]
> 
> On Thu, Apr 01, 2021 at 02:45:21PM -0500, Rob Herring wrote:
> > On Wed, Mar 31, 2021 at 11:01 AM Will Deacon  wrote:
> > >
> > > On Tue, Mar 30, 2021 at 12:09:38PM -0500, Rob Herring wrote:
> > > > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon  wrote:
> > > > >
> > > > > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > > > > From: Raphael Gault 

> > > > > > +static void armv8pmu_event_unmapped(struct perf_event *event, 
> > > > > > struct mm_struct *mm)
> > > > > > +{
> > > > > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > > > > +
> > > > > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > > > + return;
> > > > > > +
> > > > > > + if (atomic_dec_and_test(>context.pmu_direct_access))
> > > > > > + on_each_cpu_mask(>supported_cpus, 
> > > > > > refresh_pmuserenr, mm, 1);
> > > > >
> > > > > Given that the pmu_direct_access field is global per-mm, won't this go
> > > > > wrong if multiple PMUs are opened by the same process but only a 
> > > > > subset
> > > > > are exposed to EL0? Perhaps pmu_direct_access should be treated as a 
> > > > > mask
> > > > > rather than a counter, so that we can 'and' it with the 
> > > > > supported_cpus for
> > > > > the PMU we're dealing with.
> > > >
> > > > It needs to be a count to support multiple events on the same PMU. If
> > > > the event is not enabled for EL0, then we'd exit out on the
> > > > ARMPMU_EL0_RD_CNTR check. So I think we're fine.
> > >
> > > I'm still not convinced; pmu_direct_access is shared between PMUs, so
> > > testing the result of atomic_dec_and_test() just doesn't make sense to
> > > me, as another PMU could be playing with the count.
> > 
> > How is that a problem? Let's make a concrete example:
> > 
> > map PMU1:event2 -> pmu_direct_access = 1 -> enable access
> > map PMU2:event3 -> pmu_direct_access = 2
> > map PMU1:event4 -> pmu_direct_access = 3
> > unmap PMU2:event3 -> pmu_direct_access = 2
> > unmap PMU1:event2 -> pmu_direct_access = 1
> > unmap PMU1:event4 -> pmu_direct_access = 0 -> disable access
> > 
> > The only issue I can see is PMU2 remains enabled for user access until
> > the last unmap. But we're sharing the mm, so who cares? Also, in this
> > scenario it is the user's problem to pin themselves to cores sharing a
> > PMU. If the user doesn't do that, they get to keep the pieces.
> 
> I guess I'm just worried about exposing the counters to userspace after
> the PMU driver (or perf core?) thinks that they're no longer exposed in
> case we leak other events.

IMO that's not practically different from the single-PMU case (i.e.
multi-PMU isn't material, either we have a concern with leaking or we
don't); more on that below.

While it looks odd to place this on the mm, I don't think it's the end
of the world.

> However, I'm not sure how this is supposed to work normally: what
> happens if e.g. a privileged user has a per-cpu counter for a kernel
> event while a task has a counter with direct access -- can that task
> read the kernel event out of the PMU registers from userspace?

Yes -- userspace could go read any counters even though it isn't
supposed to, and could potentially infer information from those. It
won't have access to the config registers or kernel data structures, so
it isn't guaranteed to know what the even is or when it is
context-switched/reprogrammed/etc.

If we believe that's a problem, then it's difficult to do anything
robust other than denying userspace access entirely, since disabling
userspace access while in use would surprise applications, and denying
privileged events would need some global state that we consult at event
creation time (in addition to being an inversion of privilege).

IIRC there was some fuss about this a while back on x86; I'll go dig and
see what I can find, unless Peter has a memory...

Thanks,
Mark.


Re: [PATCH v5 16/18] arm64: ftrace: use function_nocfi for ftrace_call

2021-04-06 Thread Mark Rutland
On Thu, Apr 01, 2021 at 04:32:14PM -0700, Sami Tolvanen wrote:
> With CONFIG_CFI_CLANG, the compiler replaces function pointers with
> jump table addresses, which breaks dynamic ftrace as the address of
> ftrace_call is replaced with the address of ftrace_call.cfi_jt. Use
> function_nocfi() to get the address of the actual function instead.
> 
> Suggested-by: Ben Dai 
> Signed-off-by: Sami Tolvanen 
> ---
>  arch/arm64/kernel/ftrace.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
> index 86a5cf9bc19a..b5d3ddaf69d9 100644
> --- a/arch/arm64/kernel/ftrace.c
> +++ b/arch/arm64/kernel/ftrace.c
> @@ -55,7 +55,7 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
>   unsigned long pc;
>   u32 new;
>  
> - pc = (unsigned long)_call;
> +     pc = (unsigned long)function_nocfi(ftrace_call);

Acked-by: Mark Rutland 

Thanks,
Mark.

>   new = aarch64_insn_gen_branch_imm(pc, (unsigned long)func,
> AARCH64_INSN_BRANCH_LINK);
>  
> -- 
> 2.31.0.208.g409f899ff0-goog
> 


Re: [PATCH v5 14/18] arm64: add __nocfi to functions that jump to a physical address

2021-04-06 Thread Mark Rutland
[adding Ard for EFI runtime services bits]

On Thu, Apr 01, 2021 at 04:32:12PM -0700, Sami Tolvanen wrote:
> Disable CFI checking for functions that switch to linear mapping and
> make an indirect call to a physical address, since the compiler only
> understands virtual addresses and the CFI check for such indirect calls
> would always fail.

What does physical vs virtual have to do with this? Does the address
actually matter, or is this just a general thing that when calling an
assembly function we won't have a trampoline that the caller expects?

I wonder if we need to do something with asmlinkage here, perhaps?

I didn't spot anything in the seriues handling EFI runtime services
calls, and I strongly suspect we need to do something for those, unless
they're handled implicitly by something else.

> Signed-off-by: Sami Tolvanen 
> Reviewed-by: Kees Cook 
> ---
>  arch/arm64/include/asm/mmu_context.h | 2 +-
>  arch/arm64/kernel/cpu-reset.h| 8 
>  arch/arm64/kernel/cpufeature.c   | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu_context.h 
> b/arch/arm64/include/asm/mmu_context.h
> index 386b96400a57..d3cef9133539 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -119,7 +119,7 @@ static inline void cpu_install_idmap(void)
>   * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
>   * avoiding the possibility of conflicting TLB entries being allocated.
>   */
> -static inline void cpu_replace_ttbr1(pgd_t *pgdp)
> +static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp)

Given these are inlines, what's the effect when these are inlined into a
function that would normally use CFI? Does CFI get supressed for the
whole function, or just the bit that got inlined?

Is there an attribute that we could place on a function pointer to tell
the compiler to not check calls via that pointer? If that existed we'd
be able to scope this much more tightly.

Thanks,
Mark.

>  {
>   typedef void (ttbr_replace_func)(phys_addr_t);
>   extern ttbr_replace_func idmap_cpu_replace_ttbr1;
> diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
> index f3adc574f969..9a7b1262ef17 100644
> --- a/arch/arm64/kernel/cpu-reset.h
> +++ b/arch/arm64/kernel/cpu-reset.h
> @@ -13,10 +13,10 @@
>  void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
>   unsigned long arg0, unsigned long arg1, unsigned long arg2);
>  
> -static inline void __noreturn cpu_soft_restart(unsigned long entry,
> -unsigned long arg0,
> -unsigned long arg1,
> -unsigned long arg2)
> +static inline void __noreturn __nocfi cpu_soft_restart(unsigned long entry,
> +unsigned long arg0,
> +unsigned long arg1,
> +unsigned long arg2)
>  {
>   typeof(__cpu_soft_restart) *restart;
>  
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 0b2e0d7b13ec..c2f94a5206e0 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1445,7 +1445,7 @@ static bool unmap_kernel_at_el0(const struct 
> arm64_cpu_capabilities *entry,
>  }
>  
>  #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
> -static void
> +static void __nocfi
>  kpti_install_ng_mappings(const struct arm64_cpu_capabilities *__unused)
>  {
>   typedef void (kpti_remap_fn)(int, int, phys_addr_t);
> -- 
> 2.31.0.208.g409f899ff0-goog
> 


Re: [PATCH v5 13/18] arm64: use function_nocfi with __pa_symbol

2021-04-06 Thread Mark Rutland
endianness before jumping. This is mandated by
>* the boot protocol.
>*/
> - writeq_relaxed(__pa_symbol(secondary_holding_pen), release_addr);
> + writeq_relaxed(__pa_symbol(function_nocfi(secondary_holding_pen)),
> +release_addr);

Likewise here? e.g. at the start of the function have:

phys_addr_t pa_holding_pen = 
__pa_symbol(function_nocfi(secondary_holding_pen));

... then here have:

writeq_relaxed(pa_holding_pen, release_addr);

With those:

Acked-by: Mark Rutland 

Thanks,
Mark.

>   __flush_dcache_area((__force void *)release_addr,
>   sizeof(*release_addr));
>  
> -- 
> 2.31.0.208.g409f899ff0-goog
> 


Re: [PATCH v5 12/18] arm64: implement function_nocfi

2021-04-06 Thread Mark Rutland
On Thu, Apr 01, 2021 at 04:32:10PM -0700, Sami Tolvanen wrote:
> With CONFIG_CFI_CLANG, the compiler replaces function addresses in
> instrumented C code with jump table addresses. This change implements
> the function_nocfi() macro, which returns the actual function address
> instead.
> 
> Signed-off-by: Sami Tolvanen 
> Reviewed-by: Kees Cook 

I think that it's unfortunate that we have to drop to assembly here, but
given this is infrequent I agree it's not the end of the world, so:

Acked-by: Mark Rutland 

> ---
>  arch/arm64/include/asm/memory.h | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 0aabc3be9a75..b55410afd3d1 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -321,6 +321,21 @@ static inline void *phys_to_virt(phys_addr_t x)
>  #define virt_to_pfn(x)   __phys_to_pfn(__virt_to_phys((unsigned 
> long)(x)))
>  #define sym_to_pfn(x)__phys_to_pfn(__pa_symbol(x))
>  
> +#ifdef CONFIG_CFI_CLANG
> +/*
> + * With CONFIG_CFI_CLANG, the compiler replaces function address
> + * references with the address of the function's CFI jump table
> + * entry. The function_nocfi macro always returns the address of the
> + * actual function instead.
> + */
> +#define function_nocfi(x) ({ \
> + void *addr; \
> + asm("adrp %0, " __stringify(x) "\n\t"   \
> + "add  %0, %0, :lo12:" __stringify(x) : "=r" (addr));\

If it's not too painful, could we please move the asm constrain onto its
own line? That makes it slightly easier to read, and aligns with what
we've (mostly) done elsewhere in arm64.

Not a big deal either way, and the ack stands regardless.

Thanks,
Mark.

> + addr;   \
> +})
> +#endif
> +
>  /*
>   *  virt_to_page(x)  convert a _valid_ virtual address to struct page *
>   *  virt_addr_valid(x)   indicates whether a virtual address is valid
> -- 
> 2.31.0.208.g409f899ff0-goog
> 


Re: [PATCH v5 11/18] psci: use function_nocfi for cpu_resume

2021-04-06 Thread Mark Rutland
On Thu, Apr 01, 2021 at 04:32:09PM -0700, Sami Tolvanen wrote:
> With CONFIG_CFI_CLANG, the compiler replaces function pointers with
> jump table addresses, which results in __pa_symbol returning the
> physical address of the jump table entry. As the jump table contains
> an immediate jump to an EL1 virtual address, this typically won't
> work as intended. Use function_nocfi to get the actual address of
> cpu_resume.
> 
> Signed-off-by: Sami Tolvanen 
> Reviewed-by: Kees Cook 

Acked-by: Mark Rutland 

Mark.

> ---
>  drivers/firmware/psci/psci.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
> index f5fc429cae3f..64344e84bd63 100644
> --- a/drivers/firmware/psci/psci.c
> +++ b/drivers/firmware/psci/psci.c
> @@ -325,8 +325,9 @@ static int __init psci_features(u32 psci_func_id)
>  static int psci_suspend_finisher(unsigned long state)
>  {
>   u32 power_state = state;
> + phys_addr_t pa_cpu_resume = __pa_symbol(function_nocfi(cpu_resume));
>  
> - return psci_ops.cpu_suspend(power_state, __pa_symbol(cpu_resume));
> + return psci_ops.cpu_suspend(power_state, pa_cpu_resume);
>  }
>  
>  int psci_cpu_suspend_enter(u32 state)
> @@ -344,8 +345,10 @@ int psci_cpu_suspend_enter(u32 state)
>  
>  static int psci_system_suspend(unsigned long unused)
>  {
> + phys_addr_t pa_cpu_resume = __pa_symbol(function_nocfi(cpu_resume));
> +
>   return invoke_psci_fn(PSCI_FN_NATIVE(1_0, SYSTEM_SUSPEND),
> -   __pa_symbol(cpu_resume), 0, 0);
> +   pa_cpu_resume, 0, 0);
>  }
>  
>  static int psci_system_suspend_enter(suspend_state_t state)
> -- 
> 2.31.0.208.g409f899ff0-goog
> 


Re: [PATCH v5 03/18] mm: add generic function_nocfi macro

2021-04-06 Thread Mark Rutland
On Thu, Apr 01, 2021 at 04:32:01PM -0700, Sami Tolvanen wrote:
> With CONFIG_CFI_CLANG, the compiler replaces function addresses
> in instrumented C code with jump table addresses. This means that
> __pa_symbol(function) returns the physical address of the jump table
> entry instead of the actual function, which may not work as the jump
> table code will immediately jump to a virtual address that may not be
> mapped.
> 
> To avoid this address space confusion, this change adds a generic
> definition for function_nocfi(), which architectures that support CFI
> can override. The typical implementation of would use inline assembly
> to take the function address, which avoids compiler instrumentation.
> 
> Signed-off-by: Sami Tolvanen 
> Reviewed-by: Kees Cook 

FWIW:

Acked-by: Mark Rutland 

Mark.

> ---
>  include/linux/mm.h | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8ba434287387..22cce9c7dd05 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -124,6 +124,16 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define lm_alias(x)  __va(__pa_symbol(x))
>  #endif
>  
> +/*
> + * With CONFIG_CFI_CLANG, the compiler replaces function addresses in
> + * instrumented C code with jump table addresses. Architectures that
> + * support CFI can define this macro to return the actual function address
> + * when needed.
> + */
> +#ifndef function_nocfi
> +#define function_nocfi(x) (x)
> +#endif
> +
>  /*
>   * To prevent common memory management code establishing
>   * a zero page mapping on a read fault.
> -- 
> 2.31.0.208.g409f899ff0-goog
> 


Re: [RFC PATCH v1 1/1] arm64: Implement stack trace termination record

2021-03-29 Thread Mark Rutland
Hi Madhavan,

Overall this looks pretty good; I have a few comments below.

On Wed, Mar 24, 2021 at 01:46:07PM -0500, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> The unwinder needs to be able to reliably tell when it has reached the end
> of a stack trace. One way to do this is to have the last stack frame at a
> fixed offset from the base of the task stack. When the unwinder reaches
> that offset, it knows it is done.

To make the relationship with reliable stacktrace clearer, how about:

| Reliable stacktracing requires that we identify when a stacktrace is
| terminated early. We can do this by ensuring all tasks have a final
| frame record at a known location on their task stack, and checking
| that this is the final frame record in the chain.

Currently we use inconsistent terminology to refer to the final frame
record, and it would be good if we could be consistent. The existing
code uses "terminal record" (which I appreciate isn't clear), and this
series largely uses "last frame". It'd be nice to make that consistent.

For clarity could we please use "final" rather than "last"? That avoids
the ambiguity of "last" also meaning "previous".

e.g. below this'd mean having `setup_final_frame`.

> 
> Kernel Tasks
> 
> 
> All tasks except the idle task have a pt_regs structure right after the
> task stack. This is called the task pt_regs. The pt_regs structure has a
> special stackframe field. Make this stackframe field the last frame in the
> task stack. This needs to be done in copy_thread() which initializes a new
> task's pt_regs and initial CPU context.
> 
> For the idle task, there is no task pt_regs. For our purpose, we need one.
> So, create a pt_regs just like other kernel tasks and make
> pt_regs->stackframe the last frame in the idle task stack. This needs to be
> done at two places:
> 
>   - On the primary CPU, the boot task runs. It calls start_kernel()
> and eventually becomes the idle task for the primary CPU. Just
> before start_kernel() is called, set up the last frame.
> 
>   - On each secondary CPU, a startup task runs that calls
> secondary_startup_kernel() and eventually becomes the idle task
> on the secondary CPU. Just before secondary_start_kernel() is
> called, set up the last frame.
> 
> User Tasks
> ==
> 
> User tasks are initially set up like kernel tasks when they are created.
> Then, they return to userland after fork via ret_from_fork(). After that,
> they enter the kernel only on an EL0 exception. (In arm64, system calls are
> also EL0 exceptions). The EL0 exception handler stores state in the task
> pt_regs and calls different functions based on the type of exception. The
> stack trace for an EL0 exception must end at the task pt_regs. So, make
> task pt_regs->stackframe as the last frame in the EL0 exception stack.
> 
> In summary, task pt_regs->stackframe is where a successful stack trace ends.
> 
> Stack trace termination
> ===
> 
> In the unwinder, terminate the stack trace successfully when
> task_pt_regs(task)->stackframe is reached. For stack traces in the kernel,
> this will correctly terminate the stack trace at the right place.
> 
> However, debuggers terminate the stack trace when FP == 0. In the
> pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
> debugger may print an extra record 0x0 at the end. While this is not
> pretty, this does not do any harm. This is a small price to pay for
> having reliable stack trace termination in the kernel.
> 
> Signed-off-by: Madhavan T. Venkataraman 
> ---
>  arch/arm64/kernel/entry.S  |  8 +---
>  arch/arm64/kernel/head.S   | 28 
>  arch/arm64/kernel/process.c|  5 +
>  arch/arm64/kernel/stacktrace.c |  8 
>  4 files changed, 38 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index a31a0a713c85..e2dc2e998934 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -261,16 +261,18 @@ alternative_else_nop_endif
>   stp lr, x21, [sp, #S_LR]
>  
>   /*
> -  * For exceptions from EL0, terminate the callchain here.
> +  * For exceptions from EL0, terminate the callchain here at
> +  * task_pt_regs(current)->stackframe.
> +  *
>* For exceptions from EL1, create a synthetic frame record so the
>* interrupted code shows up in the backtrace.
>*/
>   .if \el == 0
> - mov x29, xzr
> + stp xzr, xzr, [sp, #S_STACKFRAME]
>   .else
>   stp x29, x22, [sp, #S_STACKFRAME]
> - add x29, sp, #S_STACKFRAME
>   .endif
> + add x29, sp, #S_STACKFRAME
>  
>  #ifdef CONFIG_ARM64_SW_TTBR0_PAN
>  alternative_if_not ARM64_HAS_PAN
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 840bda1869e9..b8003fb9cfa5 100644
> --- a/arch/arm64/kernel/head.S
> +++ 

Re: [PATCH v3 12/17] arm64: implement __va_function

2021-03-25 Thread Mark Rutland
On Tue, Mar 23, 2021 at 01:39:41PM -0700, Sami Tolvanen wrote:
> With CONFIG_CFI_CLANG, the compiler replaces function addresses in
> instrumented C code with jump table addresses. This change implements
> the __va_function() macro, which returns the actual function address
> instead.
> 
> Signed-off-by: Sami Tolvanen 
> Reviewed-by: Kees Cook 

Is there really no attribute or builtin that can be used to do this
without assembly?

IIUC from other patches the symbol tables will contain the "real"
non-cfi entry points (unless we explciitly asked to make the jump table
address canonical), so AFAICT here the compiler should have all the
necessary information to generate either the CFI or non-CFI entry point
addresses, even if it doesn't expose an interface for that today.

It'd be a lot nicer if we could get the compiler to do this for us.

Thanks,
Mark.

> ---
>  arch/arm64/include/asm/memory.h | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 0aabc3be9a75..9a4887808681 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -321,6 +321,21 @@ static inline void *phys_to_virt(phys_addr_t x)
>  #define virt_to_pfn(x)   __phys_to_pfn(__virt_to_phys((unsigned 
> long)(x)))
>  #define sym_to_pfn(x)__phys_to_pfn(__pa_symbol(x))
>  
> +#ifdef CONFIG_CFI_CLANG
> +/*
> + * With CONFIG_CFI_CLANG, the compiler replaces function address
> + * references with the address of the function's CFI jump table
> + * entry. The __va_function macro always returns the address of the
> + * actual function instead.
> + */
> +#define __va_function(x) ({  \
> + void *addr; \
> + asm("adrp %0, " __stringify(x) "\n\t"   \
> + "add  %0, %0, :lo12:" __stringify(x) : "=r" (addr));\
> + addr;   \
> +})
> +#endif
> +
>  /*
>   *  virt_to_page(x)  convert a _valid_ virtual address to struct page *
>   *  virt_addr_valid(x)   indicates whether a virtual address is valid
> -- 
> 2.31.0.291.g576ba9dcdaf-goog
> 


Re: [PATCH v3 11/17] psci: use __pa_function for cpu_resume

2021-03-25 Thread Mark Rutland
On Tue, Mar 23, 2021 at 01:39:40PM -0700, Sami Tolvanen wrote:
> With CONFIG_CFI_CLANG, the compiler replaces function pointers with
> jump table addresses, which results in __pa_symbol returning the
> physical address of the jump table entry. As the jump table contains
> an immediate jump to an EL1 virtual address, this typically won't
> work as intended. Use __pa_function instead to get the address to
> cpu_resume.
> 
> Signed-off-by: Sami Tolvanen 
> Reviewed-by: Kees Cook 
> ---
>  drivers/firmware/psci/psci.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
> index f5fc429cae3f..facd3cce3244 100644
> --- a/drivers/firmware/psci/psci.c
> +++ b/drivers/firmware/psci/psci.c
> @@ -326,7 +326,7 @@ static int psci_suspend_finisher(unsigned long state)
>  {
>   u32 power_state = state;
>  
> - return psci_ops.cpu_suspend(power_state, __pa_symbol(cpu_resume));
> + return psci_ops.cpu_suspend(power_state, __pa_function(cpu_resume));

As mentioned on the patch adding __pa_function(), I'd prefer to keep the
whatever->phys conversion separate from the CFI removal, since we have a
number of distinct virtual address ranges with differing conversions to
phys, and I don't think it's clear that __pa_function() only works for
kernel symbols (rather than module addresses, linear map addresses,
etc).

Other than that, I'm happy in principle with wrapping this. I suspect
that for clarity we should add an intermediate variable, e.g.

| phys_addr_t pa_cpu_resume = pa_symbol(function_nocfi(cpu_resume));
| return psci_ops.cpu_suspend(power_state, pa_cpu_resume)

Thanks,
Mark.

>  }
>  
>  int psci_cpu_suspend_enter(u32 state)
> @@ -345,7 +345,7 @@ int psci_cpu_suspend_enter(u32 state)
>  static int psci_system_suspend(unsigned long unused)
>  {
>   return invoke_psci_fn(PSCI_FN_NATIVE(1_0, SYSTEM_SUSPEND),
> -   __pa_symbol(cpu_resume), 0, 0);
> +   __pa_function(cpu_resume), 0, 0);
>  }
>  
>  static int psci_system_suspend_enter(suspend_state_t state)
> -- 
> 2.31.0.291.g576ba9dcdaf-goog
> 


Re: [PATCH v3 03/17] mm: add generic __va_function and __pa_function macros

2021-03-25 Thread Mark Rutland
On Wed, Mar 24, 2021 at 08:54:18AM -0700, Sami Tolvanen wrote:
> On Wed, Mar 24, 2021 at 12:14 AM Christoph Hellwig  wrote:
> >
> > On Tue, Mar 23, 2021 at 01:39:32PM -0700, Sami Tolvanen wrote:
> > > With CONFIG_CFI_CLANG, the compiler replaces function addresses
> > > in instrumented C code with jump table addresses. This means that
> > > __pa_symbol(function) returns the physical address of the jump table
> > > entry instead of the actual function, which may not work as the jump
> > > table code will immediately jump to a virtual address that may not be
> > > mapped.
> > >
> > > To avoid this address space confusion, this change adds generic
> > > definitions for __va_function and __pa_function, which architectures
> > > that support CFI can override. The typical implementation of the
> > > __va_function macro would use inline assembly to take the function
> > > address, which avoids compiler instrumentation.
> >
> > I think these helper are sensible, but shouldn't they have somewhat
> > less arcane names and proper documentation?
> 
> Good point, I'll add comments in the next version. I thought
> __pa_function would be a fairly straightforward replacement for
> __pa_symbol, but I'm fine with renaming these. Any suggestions for
> less arcane names?

I think dropping 'nocfi' into the name would be clear enough. I think
that given the usual fun with {symbol,module,virt}->phys conversions
it's not worth having the __pa_* form, and we can leave the phys
conversion to the caller that knows where the function lives.

How about we just add `function_nocfi()` ?

Callers can then do `__pa_symbol(function_nocfi(foo))` and similar.

Thanks,
Mark.


Re: [PATCH v9 1/7] smccc: Add HVC call variant with result registers other than 0 thru 3

2021-03-25 Thread Mark Rutland
On Thu, Mar 25, 2021 at 04:55:51AM +, Michael Kelley wrote:
> From: Mark Rutland  Sent: Wednesday, March 24, 2021 
> 9:55 AM
> > For the benefit of others here, SMCCCv1.2 allows:
> > 
> > * SMC64/HVC64 to use all of x1-x17 for both parameters and return values
> > * SMC32/HVC32 to use all of r1-r7 for both parameters and return values
> > 
> > The rationale for this was to make it possible to pass a large number of
> > arguments in one call without the hypervisor/firmware needing to access
> > the memory of the caller.
> > 
> > My preference would be to add arm_smccc_1_2_{hvc,smc}() assembly
> > functions which read all the permitted argument registers from a struct,
> > and write all the permitted result registers to a struct, leaving it to
> > callers to set those up and decompose them.
> > 
> > That way we only have to write one implementation that all callers can
> > use, which'll be far easier to maintain. I suspect that in general the
> > cost of temporarily bouncing the values through memory will be dominated
> > by whatever the hypervisor/firmware is going to do, and if it's not we
> > can optimize that away in future.
> 
> Thanks for the feedback, and I'm working on implementing this approach.
> But I've hit a snag in that gcc limits the "asm" statement to 30 arguments,
> which gives us 15 registers as parameters and 15 registers as return
> values, instead of the 18 each allowed by SMCCC v1.2.  I will continue
> with the 15 register limit for now, unless someone knows a way to exceed
> that.  The alternative would be to go to pure assembly language.

I realise in retrospect this is not clear, but when I said "assembly
functions" I had meant raw assembly functions rather than inline
assembly.

We already have __arm_smccc_smc and __arm_smccc_hvc assembly functions
in arch/{arm,arm64}/kernel/smccc-call.S, and I'd expected we'd add the
full fat SMCCCv1.2 variants there.

Thanks,
Mark.


Re: [RFT PATCH v3 13/27] arm64: Add Apple vendor-specific system registers

2021-03-24 Thread Mark Rutland
On Wed, Mar 24, 2021 at 06:38:18PM +, Will Deacon wrote:
> On Fri, Mar 05, 2021 at 06:38:48AM +0900, Hector Martin wrote:
> > Apple ARM64 SoCs have a ton of vendor-specific registers we're going to
> > have to deal with, and those don't really belong in sysreg.h with all
> > the architectural registers. Make a new home for them, and add some
> > registers which are useful for early bring-up.
> > 
> > Signed-off-by: Hector Martin 
> > ---
> >  MAINTAINERS   |  1 +
> >  arch/arm64/include/asm/sysreg_apple.h | 69 +++
> >  2 files changed, 70 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/sysreg_apple.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index aec14fbd61b8..3a352c687d4b 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -1646,6 +1646,7 @@ B:https://github.com/AsahiLinux/linux/issues
> >  C: irc://chat.freenode.net/asahi-dev
> >  T: git https://github.com/AsahiLinux/linux.git
> >  F: Documentation/devicetree/bindings/arm/apple.yaml
> > +F: arch/arm64/include/asm/sysreg_apple.h
> 
> (this isn't needed with my suggestion below).
> 
> >  ARM/ARTPEC MACHINE SUPPORT
> >  M: Jesper Nilsson 
> > diff --git a/arch/arm64/include/asm/sysreg_apple.h 
> > b/arch/arm64/include/asm/sysreg_apple.h
> > new file mode 100644
> > index ..48347a51d564
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/sysreg_apple.h
> 
> I doubt apple are the only folks doing this, so can we instead have
> sysreg-impdef.h please, and then have an Apple section in there for these
> registers? That way, we could also have an imp_sys_reg() macro to limit
> CRn to 11 or 15, which is the reserved encoding space for these registers.
> 
> We'll cc you for any patches touching the Apple parts, as we don't have
> the first clue about what's hiding in there.

For existing IMP-DEF sysregs (e.g. the Kryo L2 control registers), we've
put the definitions in the drivers, rather than collating
non-architectural bits under arch/arm64/.

So far we've kept arch/arm64/ largely devoid of IMP-DEF bits, and it
seems a shame to add something with the sole purpose of collating that,
especially given arch code shouldn't need to touch these if FW and
bootloader have done their jobs right.

Can we put the definitions in the relevant drivers? That would sidestep
any pain with MAINTAINERS, too.

Thanks,
Mark.


Re: [PATCH v9 1/7] smccc: Add HVC call variant with result registers other than 0 thru 3

2021-03-24 Thread Mark Rutland
Hi Michael,

On Mon, Mar 08, 2021 at 11:57:13AM -0800, Michael Kelley wrote:
> Hypercalls to Hyper-V on ARM64 may return results in registers other
> than X0 thru X3, as permitted by the SMCCC spec version 1.2 and later.
> Accommodate this by adding a variant of arm_smccc_1_1_hvc that allows
> the caller to specify which 3 registers are returned in addition to X0.
> 
> Signed-off-by: Michael Kelley 
> ---
> There are several ways to support returning results from registers
> other than X0 thru X3, and Hyper-V usage should be compatible with
> whatever the maintainers prefer.  What's implemented in this patch
> may be the most flexible, but it has the downside of not being a
> true function interface in that args 0 thru 2 must be fixed strings,
> and not general "C" expressions.

For the benefit of others here, SMCCCv1.2 allows:

* SMC64/HVC64 to use all of x1-x17 for both parameters and return values
* SMC32/HVC32 to use all of r1-r7 for both parameters and return values

The rationale for this was to make it possible to pass a large number of
arguments in one call without the hypervisor/firmware needing to access
the memory of the caller.

My preference would be to add arm_smccc_1_2_{hvc,smc}() assembly
functions which read all the permitted argument registers from a struct,
and write all the permitted result registers to a struct, leaving it to
callers to set those up and decompose them.

That way we only have to write one implementation that all callers can
use, which'll be far easier to maintain. I suspect that in general the
cost of temporarily bouncing the values through memory will be dominated
by whatever the hypervisor/firmware is going to do, and if it's not we
can optimize that away in future.

> Other alternatives include:
> * Create a variant that hard codes to return X5 thru X7, though
>   in the future there may be Hyper-V hypercalls that need a
>   different hard-coded variant.
> * Return all of X0 thru X7 in a larger result structure. That
>   approach may execute more memory stores, but performance is unlikely
>   to be an issue for the Hyper-V hypercalls that would use it.
>   However, it's possible in the future that Hyper-V results might
>   be beyond X7, as allowed by the SMCCC v1.3 spec.

As above, something of this sort would be my preferred approach.

Thanks,
Mark.

> * The macro __arm_smccc_1_1() could be cloned in Hyper-V specific
>   code and modified to meet Hyper-V specific needs, but this seems
>   undesirable since Hyper-V is operating within the v1.2 spec.
> 
> In any of these cases, the call might be renamed from "_1_1_" to
> "_1_2_" to reflect conformance to the later spec version.
> 
> 
>  include/linux/arm-smccc.h | 29 +++--
>  1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
> index f860645..acda958 100644
> --- a/include/linux/arm-smccc.h
> +++ b/include/linux/arm-smccc.h
> @@ -300,12 +300,12 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, 
> unsigned long a1,
>   * entitled to optimise the whole sequence away. "volatile" is what
>   * makes it stick.
>   */
> -#define __arm_smccc_1_1(inst, ...)   \
> +#define __arm_smccc_1_1(inst, reg1, reg2, reg3, ...) \
>   do {\
>   register unsigned long r0 asm("r0");\
> - register unsigned long r1 asm("r1");\
> - register unsigned long r2 asm("r2");\
> - register unsigned long r3 asm("r3");\
> + register unsigned long r1 asm(reg1);\
> + register unsigned long r2 asm(reg2);\
> + register unsigned long r3 asm(reg3);\
>   __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \
>   asm volatile(inst "\n" :\
>"=r" (r0), "=r" (r1), "=r" (r2), "=r" (r3) \
> @@ -328,7 +328,8 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, 
> unsigned long a1,
>   * to the SMC instruction. The return values are updated with the content
>   * from register 0 to 3 on return from the SMC instruction if not NULL.
>   */
> -#define arm_smccc_1_1_smc(...)   __arm_smccc_1_1(SMCCC_SMC_INST, 
> __VA_ARGS__)
> +#define arm_smccc_1_1_smc(...)\
> + __arm_smccc_1_1(SMCCC_SMC_INST, "r1", "r2", "r3", __VA_ARGS__)
>  
>  /*
>   * arm_smccc_1_1_hvc() - make an SMCCC v1.1 compliant HVC call
> @@ -344,7 +345,23 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, 
> unsigned long a1,
>   * to the HVC instruction. The return values are updated with the content
>   * from register 0 to 3 on return from the HVC instruction if not NULL.
>   */
> -#define arm_smccc_1_1_hvc(...)   __arm_smccc_1_1(SMCCC_HVC_INST, 
> __VA_ARGS__)
> +#define 

Re: [PATCH 2/2] arm64: print alloc free paths for address in registers

2021-03-24 Thread Mark Rutland
Hi,

On Wed, Mar 24, 2021 at 12:24:59PM +0530, Maninder Singh wrote:
> In case of a use after free kernel OOPs, freed path of the object is
> required to debug futher. In most of cases the object address is present
> in one of the registers.
> 
> Thus check the register's address and if it belongs to slab, print its
> alloc and free path.

This path is used for a number of failures that might have nothing to do
with a use-after-free, and from the trimmed example below it looks like
this could significantly bloat the panic and potentially cause important
information to be lost from the log, especially given the large number
of GPRs arm64 has.

Given that, I suspect this is not something we want enabled by default.

When is this logging enabled? I assume the kernel doesn't always record
the alloc/free paths. Is there a boot-time option to control this?

How many lines does this produce on average?

> commit a02a25709155 ("mm/slub: add support for free path information of an 
> object")
> provides free path along with alloc path of object in mem_dump_obj().
> 
> Thus call it with register values same as in ARM with
> commit 14c0508adcdb ("arm: print alloc free paths for address in registers")
> 
> e.g.  in the below issue register x20 belongs to slab, and a use after free
> issue occurred on one of its dereferenced values:
> 
> [   19.516507] Unable to handle kernel paging request at virtual address 
> 006b6b6b6b6b6b73
> ..
> ..
> [   19.528784] Register x10 information: 0-page vmalloc region starting at 
> 0x800011bb allocated at paging_init+0x1d8/0x544
> [   19.529143] Register x11 information: 0-page vmalloc region starting at 
> 0x800011bb allocated at paging_init+0x1d8/0x544
> [   19.529513] Register x12 information: non-paged memory
> ..
> [   19.544953] Register x20 information: slab kmalloc-128 start 
> c3a34280 data offset 128 pointer offset 0 size 128 allocated at 
> meminfo_proc_show+0x44/0x588
> [   19.545432] ___slab_alloc+0x638/0x658
> [   19.545576] __slab_alloc.isra.0+0x2c/0x58
> [   19.545728] kmem_cache_alloc+0x584/0x598
> [   19.545877] meminfo_proc_show+0x44/0x588
> [   19.546022] seq_read_iter+0x258/0x460
> [   19.546160] proc_reg_read_iter+0x90/0xd0
> [   19.546308] generic_file_splice_read+0xd0/0x188
> [   19.546474] do_splice_to+0x90/0xe0
> [   19.546609] splice_direct_to_actor+0xbc/0x240
> [   19.546768] do_splice_direct+0x8c/0xe8
> [   19.546911] do_sendfile+0x2c4/0x500
> [   19.547048] __arm64_sys_sendfile64+0x160/0x168
> [   19.547205] el0_svc_common.constprop.0+0x60/0x120
> [   19.547377] do_el0_svc_compat+0x1c/0x40
> [   19.547524] el0_svc_compat+0x24/0x38
> [   19.547660] el0_sync_compat_handler+0x90/0x158
> [   19.547821]  Free path:
> [   19.547906] __slab_free+0x3dc/0x538
> [   19.548051] kfree+0x2d8/0x310
> [   19.548176] meminfo_proc_show+0x60/0x588
> [   19.548322] seq_read_iter+0x258/0x460
> [   19.548459] proc_reg_read_iter+0x90/0xd0
> [   19.548602] generic_file_splice_read+0xd0/0x188
> [   19.548761] do_splice_to+0x90/0xe0
> [   19.548889] splice_direct_to_actor+0xbc/0x240
> [   19.549040] do_splice_direct+0x8c/0xe8
> [   19.549183] do_sendfile+0x2c4/0x500
> [   19.549319] __arm64_sys_sendfile64+0x160/0x168
> [   19.549477] el0_svc_common.constprop.0+0x60/0x120
> [   19.549646] do_el0_svc_compat+0x1c/0x40
> [   19.549782] el0_svc_compat+0x24/0x38
> [   19.549913] el0_sync_compat_handler+0x90/0x158
> [   19.550067] el0_sync_compat+0x174/0x180
> ..
> 
> Signed-off-by: Vaneet Narang 
> Signed-off-by: Maninder Singh 
> ---
>  arch/arm64/include/asm/system_misc.h |  1 +
>  arch/arm64/kernel/process.c  | 11 +++
>  arch/arm64/kernel/traps.c|  1 +
>  3 files changed, 13 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/system_misc.h 
> b/arch/arm64/include/asm/system_misc.h
> index 673be2d1263c..84d5204cdb80 100644
> --- a/arch/arm64/include/asm/system_misc.h
> +++ b/arch/arm64/include/asm/system_misc.h
> @@ -31,6 +31,7 @@ void hook_debug_fault_code(int nr, int (*fn)(unsigned long, 
> unsigned int,
>  
>  struct mm_struct;
>  extern void __show_regs(struct pt_regs *);
> +extern void __show_regs_alloc_free(struct pt_regs *regs);
>  
>  extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
>  
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 6e60aa3b5ea9..d0d0ada332c3 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -306,6 +306,17 @@ void __show_regs(struct pt_regs *regs)
>   }
>  }
>  
> +void __show_regs_alloc_free(struct pt_regs *regs)
> +{
> + int i;
> +
> + /* check for x0 - x29 only */

Why x29? The AAPCS says that's the frame pointer, so much like the SP it
shouldn't point to a heap object.

> + for (i = 0; i <= 29; i++) {
> + pr_alert("Register x%d information:", i);
> +  

Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 12:02 PM, Mark Rutland wrote:

[...]

> I think that I did a bad job of explaining what I wanted to do. It is not
> for any additional protection at all.
> 
> So, let us say we create a field in the task structure:
> 
>   u64 unreliable_stack;
> 
> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
> On exiting the above, decrement unreliable_stack.
> 
> In arch_stack_walk_reliable(), simply do this check upfront:
> 
>   if (task->unreliable_stack)
>   return -EINVAL;
> 
> This way, the function does not even bother unwinding the stack to find
> exception frames or checking for different return addresses or anything.
> We also don't have to worry about code being reorganized, functions
> being renamed, etc. It also may help in debugging to know if a task is
> experiencing an exception and the level of nesting, etc.

As in my other reply, since this is an optimization that is not
necessary for functional correctness, I would prefer to avoid this for
now. We can reconsider that in future if we encounter performance
problems.

Even with this there will be cases where we have to identify
non-unwindable functions explicitly (e.g. the patchable-function-entry
trampolines, where the real return address is in x9), and I'd prefer
that we use one mechanism consistently.

I suspect that in the future we'll need to unwind across exception
boundaries using metadata, and we can treat the non-unwindable metadata
in the same way.

[...]

> > 3. Figure out exception boundary handling. I'm currently working to
> >simplify the entry assembly down to a uniform set of stubs, and I'd
> >prefer to get that sorted before we teach the unwinder about
> >exception boundaries, as it'll be significantly simpler to reason
> >about and won't end up clashing with the rework.
> 
> So, here is where I still have a question. Is it necessary for the unwinder
> to know the exception boundaries? Is it not enough if it knows if there are
> exceptions present? For instance, using something like num_special_frames
> I suggested above?

I agree that it would be legitimate to bail out early if we knew there
was going to be an exception somewhere in the trace. Regardless, I think
it's simpler overall to identify non-unwindability during the trace, and
doing that during the trace aligns more closely with the structure that
we'll need to permit unwinding across these boundaries in future, so I'd
prefer we do that rather than trying to optimize for early returns
today.

Thanks,
Mark.


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Tue, Mar 23, 2021 at 11:53:04AM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 11:48 AM, Mark Rutland wrote:
> > On Tue, Mar 23, 2021 at 10:26:50AM -0500, Madhavan T. Venkataraman wrote:
> >> So, my next question is - can we define a practical limit for the
> >> nesting so that any nesting beyond that is fatal? The reason I ask is
> >> - if there is a max, then we can allocate an array of stack frames out
> >> of band for the special frames so they are not part of the stack and
> >> will not likely get corrupted.

> >> Also, we don't have to do any special detection. If the number of out
> >> of band frames used is one or more then we have exceptions and the
> >> stack trace is unreliable.
> > 
> > What is expected to protect against?
> 
> It is not a protection thing. I just wanted a reliable way to tell that there
> is an exception without having to unwind the stack up to the exception frame.
> That is all.

I see.

Given that's an optimization, we can consider doing something like that
that after we have the functional bits in place, where we'll be in a
position to see whether this is even a measureable concern in practice.

I suspect that longer-term we'll end up trying to use metadata to unwind
across exception boundaries, since it's possible to get blocked within
those for long periods (e.g. for a uaccess fault), and the larger scale
optimization for patching is to not block the patch.

Thanks,
Mark.


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Tue, Mar 23, 2021 at 11:20:44AM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
> > On 3/23/21 9:57 AM, Mark Rutland wrote:
> >> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
> > So, my next question is - can we define a practical limit for the
> > nesting so that any nesting beyond that is fatal? The reason I ask
> > is - if there is a max, then we can allocate an array of stack
> > frames out of band for the special frames so they are not part of
> > the stack and will not likely get corrupted.
> > 
> > Also, we don't have to do any special detection. If the number of
> > out of band frames used is one or more then we have exceptions and
> > the stack trace is unreliable.
> 
> Alternatively, if we can just increment a counter in the task
> structure when an exception is entered and decrement it when an
> exception returns, that counter will tell us that the stack trace is
> unreliable.

As I noted earlier, we must treat *any* EL1 exception boundary needs to
be treated as unreliable for unwinding, and per my other comments w.r.t.
corrupting the call chain I don't think we need additional protection on
exception boundaries specifically.

> Is this feasible?
> 
> I think I have enough for v3 at this point. If you think that the
> counter idea is OK, I can implement it in v3. Once you confirm, I will
> start working on v3.

Currently, I don't see a compelling reason to need this, and would
prefer to avoid it.

More generally, could we please break this work into smaller steps? I
reckon we can break this down into the following chunks:

1. Add the explicit final frame and associated handling. I suspect that
   this is complicated enough on its own to be an independent series,
   and it's something that we can merge without all the bits and pieces
   necessary for truly reliable stacktracing.

2. Figure out how we must handle kprobes and ftrace. That probably means
   rejecting unwinds from specific places, but we might also want to
   adjust the trampolines if that makes this easier.

3. Figure out exception boundary handling. I'm currently working to
   simplify the entry assembly down to a uniform set of stubs, and I'd
   prefer to get that sorted before we teach the unwinder about
   exception boundaries, as it'll be significantly simpler to reason
   about and won't end up clashing with the rework.

Thanks,
Mark.


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Tue, Mar 23, 2021 at 10:26:50AM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 9:57 AM, Mark Rutland wrote:
> Thanks for explaining the nesting. It is now clear to me.

No problem!

> So, my next question is - can we define a practical limit for the
> nesting so that any nesting beyond that is fatal? The reason I ask is
> - if there is a max, then we can allocate an array of stack frames out
> of band for the special frames so they are not part of the stack and
> will not likely get corrupted.

I suspect we can't define such a fatal limit without introducing a local
DoS vector on some otherwise legitimate workload, and I fear this will
further complicate the entry/exit logic, so I'd prefer to avoid
introducing a new limit.

What exactly do you mean by a "special frame", and why do those need
additional protection over regular frame records?

> Also, we don't have to do any special detection. If the number of out
> of band frames used is one or more then we have exceptions and the
> stack trace is unreliable.

What is expected to protect against?

Thanks,
Mark.


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
> Hi Mark,
> 
> I have a general question. When exceptions are nested, how does it work? Let 
> us consider 2 cases:
> 
> 1. Exception in a page fault handler itself. In this case, I guess one more 
> pt_regs will get
>established in the task stack for the second exception.

Generally (ignoring SDEI and stack overflow exceptions) the regs will be
placed on the stack that was in use when the exception occurred, e.g.

  task -> task
  irq -> irq
  overflow -> overflow

For SDEI and stack overflow, we'll place the regs on the relevant SDEI
or overflow stack, e.g.

  task -> overflow
  irq -> overflow

  task -> sdei
  irq -> sdei

I tried to explain the nesting rules in:

  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/stacktrace.c?h=v5.11#n59
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel/stacktrace.c?h=v5.11=592700f094be229b5c9cc1192d5cea46eb4c7afc

> 2. Exception in an interrupt handler. Here the interrupt handler is running 
> on the IRQ stack.
>Will the pt_regs get created on the IRQ stack?

For an interrupt the regs will be placed on the stack that was in use
when the interrupt was taken. The kernel switches to the IRQ stack
*after* stacking the registers. e.g.

  task -> task // subsequently switches to IRQ stack
  irq -> irq

> Also, is there a maximum nesting for exceptions?

In practice, yes, but the specific number isn't a constant, so in the
unwind code we have to act as if there is no limit other than stack
sizing.

We try to prevent cerain exceptions from nesting (e.g. debug exceptions
cannot nest), but there are still several level sof nesting, and some
exceptions which can be nested safely (like faults). For example, it's
possible to have a chain:

 syscall -> fault -> interrupt -> fault -> pNMI -> fault -> SError -> fault -> 
watchpoint -> fault -> overflow -> fault -> BRK

... and potentially longer than that.

The practical limit is the size of all the stacks, and the unwinder's 
stack monotonicity checks ensure that an unwind will terminate.

Thanks,
Mark.


Re: [RFC PATCH v2 4/8] arm64: Detect an EL1 exception frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Tue, Mar 23, 2021 at 08:31:50AM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 8:04 AM, Mark Rutland wrote:
> > On Tue, Mar 23, 2021 at 07:46:10AM -0500, Madhavan T. Venkataraman wrote:
> >> On 3/23/21 5:42 AM, Mark Rutland wrote:
> >>> On Mon, Mar 15, 2021 at 11:57:56AM -0500, madve...@linux.microsoft.com 
> >>> wrote:
> >>>> From: "Madhavan T. Venkataraman" 
> >>>>
> >>>> EL1 exceptions can happen on any instruction including instructions in
> >>>> the frame pointer prolog or epilog. Depending on where exactly they 
> >>>> happen,
> >>>> they could render the stack trace unreliable.
> >>>>
> >>>> If an EL1 exception frame is found on the stack, mark the stack trace as
> >>>> unreliable.
> >>>>
> >>>> Now, the EL1 exception frame is not at any well-known offset on the 
> >>>> stack.
> >>>> It can be anywhere on the stack. In order to properly detect an EL1
> >>>> exception frame the following checks must be done:
> >>>>
> >>>>  - The frame type must be EL1_FRAME.
> >>>>
> >>>>  - When the register state is saved in the EL1 pt_regs, the frame
> >>>>pointer x29 is saved in pt_regs->regs[29] and the return PC
> >>>>is saved in pt_regs->pc. These must match with the current
> >>>>frame.
> >>>
> >>> Before you can do this, you need to reliably identify that you have a
> >>> pt_regs on the stack, but this patch uses a heuristic, which is not
> >>> reliable.
> >>>
> >>> However, instead you can identify whether you're trying to unwind
> >>> through one of the EL1 entry functions, which tells you the same thing
> >>> without even having to look at the pt_regs.
> >>>
> >>> We can do that based on the entry functions all being in .entry.text,
> >>> which we could further sub-divide to split the EL0 and EL1 entry
> >>> functions.
> >>
> >> Yes. I will check the entry functions. But I still think that we should
> >> not rely on just one check. The additional checks will make it robust.
> >> So, I suggest that the return address be checked first. If that passes,
> >> then we can be reasonably sure that there are pt_regs. Then, check
> >> the other things in pt_regs.
> > 
> > What do you think this will catch?
> 
> I am not sure that I have an exact example to mention here. But I will attempt
> one. Let us say that a task has called arch_stack_walk() in the recent past.
> The unwinder may have copied a stack frame onto some location in the stack
> with one of the return addresses we check. Let us assume that there is some
> stack corruption that makes a frame pointer point to that exact record. Now,
> we will get a match on the return address on the next unwind.

I don't see how this is material to the pt_regs case, as either:

* When the unwinder considers this frame, it appears to be in the middle
  of an EL1 entry function, and the unwinder must mark the unwinding as
  unreliable regardless of the contents of any regs (so there's no need
  to look at the regs).

* When the unwinder considers this frame, it does not appear to be in
  the middle of an EL1 entry function, so the unwinder does not think
  there are any regs to consider, and so we cannot detect this case.

... unless I've misunderstood the example?

There's a general problem that it's possible to corrupt any portion of
the chain to skip records, e.g.

  A -> B -> C -> D -> E -> F -> G -> H -> [final]

... could get corrupted to:

  A -> B -> D -> H -> [final]

... regardless of whether C/E/F/G had associated pt_regs. AFAICT there's
no good way to catch this generally unless we have additional metadata
to check the unwinding against.

The likelihood of this happening without triggering other checks is
vanishingly low, and as we don't have a reliable mechanism for detecting
this, I don't think it's worthwhile attempting to do so.

If and when we try to unwind across EL1 exception boundaries, the
potential mismatch between the frame record and regs will be more
significant, and I agree at that point thisd will need more thought.

> Pardon me if the example is somewhat crude. My point is that it is
> highly unlikely but not impossible for the return address to be on the
> stack and for the unwinder to get an unfortunate match.

I agree that this is possible in theory, but as above I don't think this
is a practical concern.

Thanks,
Mark.


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Tue, Mar 23, 2021 at 07:56:40AM -0500, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/23/21 5:51 AM, Mark Rutland wrote:
> > On Mon, Mar 15, 2021 at 11:57:57AM -0500, madve...@linux.microsoft.com 
> > wrote:
> >> From: "Madhavan T. Venkataraman" 
> >>
> >> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
> >> for a function, the ftrace infrastructure is called for the function at
> >> the very beginning. Ftrace creates two frames:
> >>
> >>- One for the traced function
> >>
> >>- One for the caller of the traced function
> >>
> >> That gives a reliable stack trace while executing in the ftrace
> >> infrastructure code. When ftrace returns to the traced function, the frames
> >> are popped and everything is back to normal.
> >>
> >> However, in cases like live patch, execution is redirected to a different
> >> function when ftrace returns. A stack trace taken while still in the ftrace
> >> infrastructure code will not show the target function. The target function
> >> is the real function that we want to track.
> >>
> >> So, if an FTRACE frame is detected on the stack, just mark the stack trace
> >> as unreliable.
> > 
> > To identify this case, please identify the ftrace trampolines instead,
> > e.g. ftrace_regs_caller, return_to_handler.
> > 
> 
> Yes. As part of the return address checking, I will check this. IIUC, I think 
> that
> I need to check for the inner labels that are defined at the point where the
> instructions are patched for ftrace. E.g., ftrace_call and ftrace_graph_call.
> 
> SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
> bl  ftrace_stub   <
> 
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
> nop   <===// If enabled, this will be replaced
> // "b ftrace_graph_caller"
> #endif
> 
> For instance, the stack trace I got while tracing do_mmap() with the stack 
> trace
> tracer looks like this:
> 
>...
> [  338.911793]   trace_function+0xc4/0x160
> [  338.911801]   function_stack_trace_call+0xac/0x130
> [  338.911807]   ftrace_graph_call+0x0/0x4
> [  338.911813]   do_mmap+0x8/0x598
> [  338.911820]   vm_mmap_pgoff+0xf4/0x188
> [  338.911826]   ksys_mmap_pgoff+0x1d8/0x220
> [  338.911832]   __arm64_sys_mmap+0x38/0x50
> [  338.911839]   el0_svc_common.constprop.0+0x70/0x1a8
> [  338.911846]   do_el0_svc+0x2c/0x98
> [  338.911851]   el0_svc+0x2c/0x70
> [  338.911859]   el0_sync_handler+0xb0/0xb8
> [  338.911864]   el0_sync+0x180/0x1c0
> 
> > It'd be good to check *exactly* when we need to reject, since IIUC when
> > we have a graph stack entry the unwind will be correct from livepatch's
> > PoV.
> > 
> 
> The current unwinder already handles this like this:
> 
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> if (tsk->ret_stack &&
> (ptrauth_strip_insn_pac(frame->pc) == (unsigned 
> long)return_to_handler)) {
> struct ftrace_ret_stack *ret_stack;
> /*
>  * This is a case where function graph tracer has
>  * modified a return address (LR) in a stack frame
>  * to hook a function return.
>  * So replace it to an original value.
>  */
> ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
> if (WARN_ON_ONCE(!ret_stack))
> return -EINVAL;
> frame->pc = ret_stack->ret;
> }
> #endif /* CONFIG_FUNCTION_GRAPH_TRACER */

Beware that this handles the case where a function will return to
return_to_handler, but doesn't handle unwinding from *within*
return_to_handler, which we can't do reliably today, so that might need
special handling.

> Is there anything else that needs handling here?

I wrote up a few cases to consider in:

https://www.kernel.org/doc/html/latest/livepatch/reliable-stacktrace.html

... e.g. the "Obscuring of return addresses" case.

It might be that we're fine so long as we refuse to unwind across
exception boundaries, but it needs some thought. We probably need to go
over each of the trampolines instruction-by-instruction to consider
that.

As mentioned above, within return_to_handler when we call
ftrace_return_to_handler, there's a period where the real return address
has been removed from the ftrace return stack, but hasn't yet been
placed in x30, and wouldn't show up in a trace (e.g. if we could somehow
hook the return from ftrace_return_to_handler).

We might be saved by the fact we'll mark traces across exception
boundaries as unreliable, but I haven't thought very hard about it. We
might want to explciitly reject unwinds within return_to_handler in case
it's possible to interpose ftrace_return_to_handler somehow.

Thanks,
Mark.


Re: [RFC PATCH v2 4/8] arm64: Detect an EL1 exception frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Tue, Mar 23, 2021 at 07:46:10AM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 5:42 AM, Mark Rutland wrote:
> > On Mon, Mar 15, 2021 at 11:57:56AM -0500, madve...@linux.microsoft.com 
> > wrote:
> >> From: "Madhavan T. Venkataraman" 
> >>
> >> EL1 exceptions can happen on any instruction including instructions in
> >> the frame pointer prolog or epilog. Depending on where exactly they happen,
> >> they could render the stack trace unreliable.
> >>
> >> If an EL1 exception frame is found on the stack, mark the stack trace as
> >> unreliable.
> >>
> >> Now, the EL1 exception frame is not at any well-known offset on the stack.
> >> It can be anywhere on the stack. In order to properly detect an EL1
> >> exception frame the following checks must be done:
> >>
> >>- The frame type must be EL1_FRAME.
> >>
> >>- When the register state is saved in the EL1 pt_regs, the frame
> >>  pointer x29 is saved in pt_regs->regs[29] and the return PC
> >>  is saved in pt_regs->pc. These must match with the current
> >>  frame.
> > 
> > Before you can do this, you need to reliably identify that you have a
> > pt_regs on the stack, but this patch uses a heuristic, which is not
> > reliable.
> > 
> > However, instead you can identify whether you're trying to unwind
> > through one of the EL1 entry functions, which tells you the same thing
> > without even having to look at the pt_regs.
> > 
> > We can do that based on the entry functions all being in .entry.text,
> > which we could further sub-divide to split the EL0 and EL1 entry
> > functions.
> 
> Yes. I will check the entry functions. But I still think that we should
> not rely on just one check. The additional checks will make it robust.
> So, I suggest that the return address be checked first. If that passes,
> then we can be reasonably sure that there are pt_regs. Then, check
> the other things in pt_regs.

What do you think this will catch?

The only way to correctly identify whether or not we have a pt_regs is
to check whether we're in specific portions of the EL1 entry assembly
where the regs exist. However, as any EL1<->EL1 transition cannot be
safely unwound, we'd mark any trace going through the EL1 entry assembly
as unreliable.

Given that, I don't think it's useful to check the regs, and I'd prefer
to avoid the subtlteties involved in attempting to do so.

[...]

> >> +static void check_if_reliable(unsigned long fp, struct stackframe *frame,
> >> +struct stack_info *info)
> >> +{
> >> +  struct pt_regs *regs;
> >> +  unsigned long regs_start, regs_end;
> >> +
> >> +  /*
> >> +   * If the stack trace has already been marked unreliable, just
> >> +   * return.
> >> +   */
> >> +  if (!frame->reliable)
> >> +  return;
> >> +
> >> +  /*
> >> +   * Assume that this is an intermediate marker frame inside a pt_regs
> >> +   * structure created on the stack and get the pt_regs pointer. Other
> >> +   * checks will be done below to make sure that this is a marker
> >> +   * frame.
> >> +   */
> > 
> > Sorry, but NAK to this approach specifically. This isn't reliable (since
> > it can be influenced by arbitrary data on the stack), and it's far more
> > complicated than identifying the entry functions specifically.
> 
> As I mentioned above, I agree that we should check the return address. But
> just as a precaution, I think we should double check the pt_regs.
> 
> Is that OK with you? It does not take away anything or increase the risk in
> anyway. I think it makes it more robust.

As above, I think that the work necessary to correctly access the regs
means that it's not helpful to check the regs themselves. If you have
something in mind where checking the regs is helpful I'm happy to
consider that, but my general preference would be to stay away from the
regs for now.

Thanks,
Mark.


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Mon, Mar 15, 2021 at 11:57:57AM -0500, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
> for a function, the ftrace infrastructure is called for the function at
> the very beginning. Ftrace creates two frames:
> 
>   - One for the traced function
> 
>   - One for the caller of the traced function
> 
> That gives a reliable stack trace while executing in the ftrace
> infrastructure code. When ftrace returns to the traced function, the frames
> are popped and everything is back to normal.
> 
> However, in cases like live patch, execution is redirected to a different
> function when ftrace returns. A stack trace taken while still in the ftrace
> infrastructure code will not show the target function. The target function
> is the real function that we want to track.
> 
> So, if an FTRACE frame is detected on the stack, just mark the stack trace
> as unreliable.

To identify this case, please identify the ftrace trampolines instead,
e.g. ftrace_regs_caller, return_to_handler.

It'd be good to check *exactly* when we need to reject, since IIUC when
we have a graph stack entry the unwind will be correct from livepatch's
PoV.

Thanks,
Mark.

> 
> Signed-off-by: Madhavan T. Venkataraman 
> ---
>  arch/arm64/kernel/entry-ftrace.S |  2 ++
>  arch/arm64/kernel/stacktrace.c   | 33 
>  2 files changed, 35 insertions(+)
> 
> diff --git a/arch/arm64/kernel/entry-ftrace.S 
> b/arch/arm64/kernel/entry-ftrace.S
> index b3e4f9a088b1..1ec8c5180fc0 100644
> --- a/arch/arm64/kernel/entry-ftrace.S
> +++ b/arch/arm64/kernel/entry-ftrace.S
> @@ -74,6 +74,8 @@
>   /* Create our frame record within pt_regs. */
>   stp x29, x30, [sp, #S_STACKFRAME]
>   add x29, sp, #S_STACKFRAME
> + ldr w17, =FTRACE_FRAME
> + str w17, [sp, #S_FRAME_TYPE]
>   .endm
>  
>  SYM_CODE_START(ftrace_regs_caller)
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 6ae103326f7b..594806a0c225 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -23,6 +23,7 @@ static void check_if_reliable(unsigned long fp, struct 
> stackframe *frame,
>  {
>   struct pt_regs *regs;
>   unsigned long regs_start, regs_end;
> + unsigned long caller_fp;
>  
>   /*
>* If the stack trace has already been marked unreliable, just
> @@ -68,6 +69,38 @@ static void check_if_reliable(unsigned long fp, struct 
> stackframe *frame,
>   frame->reliable = false;
>   return;
>   }
> +
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> + /*
> +  * When tracing is active for a function, the ftrace code is called
> +  * from the function even before the frame pointer prolog and
> +  * epilog. ftrace creates a pt_regs structure on the stack to save
> +  * register state.
> +  *
> +  * In addition, ftrace sets up two stack frames and chains them
> +  * with other frames on the stack. One frame is pt_regs->stackframe
> +  * that is for the traced function. The other frame is set up right
> +  * after the pt_regs structure and it is for the caller of the
> +  * traced function. This is done to ensure a proper stack trace.
> +  *
> +  * If the ftrace code returns to the traced function, then all is
> +  * fine. But if it transfers control to a different function (like
> +  * in livepatch), then a stack walk performed while still in the
> +  * ftrace code will not find the target function.
> +  *
> +  * So, mark the stack trace as unreliable if an ftrace frame is
> +  * detected.
> +  */
> + if (regs->frame_type == FTRACE_FRAME && frame->fp == regs_end &&
> + frame->fp < info->high) {
> + /* Check the traced function's caller's frame. */
> + caller_fp = READ_ONCE_NOCHECK(*(unsigned long *)(frame->fp));
> + if (caller_fp == regs->regs[29]) {
> + frame->reliable = false;
> + return;
> + }
> + }
> +#endif
>  }
>  
>  /*
> -- 
> 2.25.1
> 


Re: [RFC PATCH v2 4/8] arm64: Detect an EL1 exception frame and mark a stack trace unreliable

2021-03-23 Thread Mark Rutland
On Mon, Mar 15, 2021 at 11:57:56AM -0500, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> EL1 exceptions can happen on any instruction including instructions in
> the frame pointer prolog or epilog. Depending on where exactly they happen,
> they could render the stack trace unreliable.
> 
> If an EL1 exception frame is found on the stack, mark the stack trace as
> unreliable.
> 
> Now, the EL1 exception frame is not at any well-known offset on the stack.
> It can be anywhere on the stack. In order to properly detect an EL1
> exception frame the following checks must be done:
> 
>   - The frame type must be EL1_FRAME.
> 
>   - When the register state is saved in the EL1 pt_regs, the frame
> pointer x29 is saved in pt_regs->regs[29] and the return PC
> is saved in pt_regs->pc. These must match with the current
> frame.

Before you can do this, you need to reliably identify that you have a
pt_regs on the stack, but this patch uses a heuristic, which is not
reliable.

However, instead you can identify whether you're trying to unwind
through one of the EL1 entry functions, which tells you the same thing
without even having to look at the pt_regs.

We can do that based on the entry functions all being in .entry.text,
which we could further sub-divide to split the EL0 and EL1 entry
functions.

> 
> Interrupts encountered in kernel code are also EL1 exceptions. At the end
> of an interrupt, the interrupt handler checks if the current task must be
> preempted for any reason. If so, it calls the preemption code which takes
> the task off the CPU. A stack trace taken on the task after the preemption
> will show the EL1 frame and will be considered unreliable. This is correct
> behavior as preemption can happen practically at any point in code
> including the frame pointer prolog and epilog.
> 
> Breakpoints encountered in kernel code are also EL1 exceptions. The probing
> infrastructure uses breakpoints for executing probe code. While in the probe
> code, the stack trace will show an EL1 frame and will be considered
> unreliable. This is also correct behavior.
> 
> Signed-off-by: Madhavan T. Venkataraman 
> ---
>  arch/arm64/include/asm/stacktrace.h |  2 +
>  arch/arm64/kernel/stacktrace.c  | 57 +
>  2 files changed, 59 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/stacktrace.h 
> b/arch/arm64/include/asm/stacktrace.h
> index eb29b1fe8255..684f65808394 100644
> --- a/arch/arm64/include/asm/stacktrace.h
> +++ b/arch/arm64/include/asm/stacktrace.h
> @@ -59,6 +59,7 @@ struct stackframe {
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>   int graph;
>  #endif
> + bool reliable;
>  };
>  
>  extern int unwind_frame(struct task_struct *tsk, struct stackframe *frame);
> @@ -169,6 +170,7 @@ static inline void start_backtrace(struct stackframe 
> *frame,
>   bitmap_zero(frame->stacks_done, __NR_STACK_TYPES);
>   frame->prev_fp = 0;
>   frame->prev_type = STACK_TYPE_UNKNOWN;
> + frame->reliable = true;
>  }
>  
>  #endif   /* __ASM_STACKTRACE_H */
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 504cd161339d..6ae103326f7b 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -18,6 +18,58 @@
>  #include 
>  #include 
>  
> +static void check_if_reliable(unsigned long fp, struct stackframe *frame,
> +   struct stack_info *info)
> +{
> + struct pt_regs *regs;
> + unsigned long regs_start, regs_end;
> +
> + /*
> +  * If the stack trace has already been marked unreliable, just
> +  * return.
> +  */
> + if (!frame->reliable)
> + return;
> +
> + /*
> +  * Assume that this is an intermediate marker frame inside a pt_regs
> +  * structure created on the stack and get the pt_regs pointer. Other
> +  * checks will be done below to make sure that this is a marker
> +  * frame.
> +  */

Sorry, but NAK to this approach specifically. This isn't reliable (since
it can be influenced by arbitrary data on the stack), and it's far more
complicated than identifying the entry functions specifically.

Thanks,
Mark.

> + regs_start = fp - offsetof(struct pt_regs, stackframe);
> + if (regs_start < info->low)
> + return;
> + regs_end = regs_start + sizeof(*regs);
> + if (regs_end > info->high)
> + return;
> + regs = (struct pt_regs *) regs_start;
> +
> + /*
> +  * When an EL1 exception happens, a pt_regs structure is created
> +  * on the stack and the register state is recorded. Part of the
> +  * state is the FP and PC at the time of the exception.
> +  *
> +  * In addition, the FP and PC are also stored in pt_regs->stackframe
> +  * and pt_regs->stackframe is chained with other frames on the stack.
> +  * This is so that the interrupted function shows up in the stack
> +  * trace.
> +  

Re: [RFC PATCH v2 3/8] arm64: Terminate the stack trace at TASK_FRAME and EL0_FRAME

2021-03-23 Thread Mark Rutland
On Thu, Mar 18, 2021 at 03:29:19PM -0500, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/18/21 1:26 PM, Mark Brown wrote:
> > On Mon, Mar 15, 2021 at 11:57:55AM -0500, madve...@linux.microsoft.com 
> > wrote:
> > 
> >> +  /* Terminal record, nothing to unwind */
> >> +  if (fp == (unsigned long) regs->stackframe) {
> >> +  if (regs->frame_type == TASK_FRAME ||
> >> +  regs->frame_type == EL0_FRAME)
> >> +  return -ENOENT;
> >>return -EINVAL;
> >> +  }
> > 
> > This is conflating the reliable stacktrace checks (which your series
> > will later flag up with frame->reliable) with verifying that we found
> > the bottom of the stack by looking for this terminal stack frame record.
> > For the purposes of determining if the unwinder got to the bottom of the
> > stack we don't care what stack type we're looking at, we just care if it
> > managed to walk to this defined final record.  
> > 
> > At the minute nothing except reliable stack trace has any intention of
> > checking the specific return code but it's clearer to be consistent.
> > 
> 
> So, you are saying that the type check is redundant. OK. I will remove it
> and just return -ENOENT on reaching the final record.

Yes please; and please fold that into the same patch that adds the final
records.

Thanks,
Mark.


Re: [RFC PATCH v2 2/8] arm64: Implement frame types

2021-03-23 Thread Mark Rutland
On Mon, Mar 15, 2021 at 11:57:54AM -0500, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> Apart from the task pt_regs, pt_regs is also created on the stack for other
> other cases:
> 
>   - EL1 exception. A pt_regs is created on the stack to save register
> state. In addition, pt_regs->stackframe is set up for the
> interrupted kernel function so that the function shows up in the
> EL1 exception stack trace.
> 
>   - When a traced function calls the ftrace infrastructure at the
> beginning of the function, ftrace creates a pt_regs on the stack
> at that point to save register state. In addition, it sets up
> pt_regs->stackframe for the traced function so that the traced
> function shows up in the stack trace taken from anywhere in the
> ftrace code after that point. When the ftrace code returns to the
> traced function, the pt_regs is removed from the stack.
> 
> To summarize, pt_regs->stackframe is used (or will be used) as a marker
> frame in stack traces. To enable the unwinder to detect these frames, tag
> each pt_regs->stackframe with a type. To record the type, use the unused2
> field in struct pt_regs and rename it to frame_type. The types are:
> 
> TASK_FRAME
>   Terminating frame for a normal stack trace.
> EL0_FRAME
>   Terminating frame for an EL0 exception.
> EL1_FRAME
>   EL1 exception frame.
> FTRACE_FRAME
>   FTRACE frame.
> 
> These frame types will be used by the unwinder later to validate frames.

I don't think that we need a marker in the pt_regs:

* For kernel tasks and user tasks we just need the terminal frame record
  to be at a known location. We don't need the pt_regs to determine
  this.

* For EL1<->EL1 exception boundaries, we already chain the frame records
  together, and we can identify the entry functions to see that there's
  an exception boundary. We don't need the pt_regs to determine this.

* For ftrace using patchable-function-entry, we can identify the
  trampoline function. I'm also hoping to move away from pt_regs to an
  ftrace_regs here, and I'd like to avoid more strongly coupling this to
  pt_regs.

  Maybe I'm missing something you need for this last case?

> 
> Signed-off-by: Madhavan T. Venkataraman 
> ---
>  arch/arm64/include/asm/ptrace.h | 15 +--
>  arch/arm64/kernel/asm-offsets.c |  1 +
>  arch/arm64/kernel/entry.S   |  4 
>  arch/arm64/kernel/head.S|  2 ++
>  arch/arm64/kernel/process.c |  1 +
>  5 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
> index e58bca832dff..a75211ce009a 100644
> --- a/arch/arm64/include/asm/ptrace.h
> +++ b/arch/arm64/include/asm/ptrace.h
> @@ -117,6 +117,17 @@
>   */
>  #define NO_SYSCALL (-1)
>  
> +/*
> + * pt_regs->stackframe is a marker frame that is used in different
> + * situations. These are the different types of frames. Use patterns
> + * for the frame types instead of (0, 1, 2, 3, ..) so that it is less
> + * likely to find them on the stack.
> + */
> +#define TASK_FRAME   0xDEADBEE0  /* Task stack termination frame */
> +#define EL0_FRAME0xDEADBEE1  /* EL0 exception frame */
> +#define EL1_FRAME0xDEADBEE2  /* EL1 exception frame */
> +#define FTRACE_FRAME 0xDEADBEE3  /* FTrace frame */

This sounds like we're using this as a heuristic, which I don't think we
should do. I'd strongly prefr to avoid magic valuess here, and if we
cannot be 100% certain of the stack contents, this is not reliable
anyway.

Thanks,
Mark.

>  #ifndef __ASSEMBLY__
>  #include 
>  #include 
> @@ -187,11 +198,11 @@ struct pt_regs {
>   };
>   u64 orig_x0;
>  #ifdef __AARCH64EB__
> - u32 unused2;
> + u32 frame_type;
>   s32 syscallno;
>  #else
>   s32 syscallno;
> - u32 unused2;
> + u32 frame_type;
>  #endif
>   u64 sdei_ttbr1;
>   /* Only valid when ARM64_HAS_IRQ_PRIO_MASKING is enabled. */
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index a36e2fc330d4..43f97dbc7dfc 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -75,6 +75,7 @@ int main(void)
>DEFINE(S_SDEI_TTBR1,   offsetof(struct pt_regs, sdei_ttbr1));
>DEFINE(S_PMR_SAVE, offsetof(struct pt_regs, pmr_save));
>DEFINE(S_STACKFRAME,   offsetof(struct pt_regs, stackframe));
> +  DEFINE(S_FRAME_TYPE,   offsetof(struct pt_regs, frame_type));
>DEFINE(PT_REGS_SIZE,   sizeof(struct pt_regs));
>BLANK();
>  #ifdef CONFIG_COMPAT
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index e2dc2e998934..ecc3507d9cdd 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -269,8 +269,12 @@ alternative_else_nop_endif
>*/
>   .if \el == 0
>   stp xzr, xzr, [sp, #S_STACKFRAME]
> + ldr 

Re: [RFC PATCH v2 1/8] arm64: Implement stack trace termination record

2021-03-23 Thread Mark Rutland
On Fri, Mar 19, 2021 at 05:03:09PM -0500, Madhavan T. Venkataraman wrote:
> I solved this by using existing functions logically instead of inventing a
> dummy function. I initialize pt_regs->stackframe[1] to an existing function
> so that the stack trace will not show a 0x0 entry as well as the kernel and
> gdb will show identical stack traces.
> 
> For all task stack traces including the idle tasks, the stack trace will
> end at copy_thread() as copy_thread() is the function that initializes the
> pt_regs and the first stack frame for a task.

I don't think this is a good idea, as it will mean that copy_thread()
will appear to be live in every thread, and therefore will not be
patchable.

There are other things people need to be aware of when using an external
debugger (e.g. during EL0<->ELx transitions there are periods when X29
and X30 contain the EL0 values, and cannot be used to unwind), so I
don't think there's a strong need to make this look prettier to an
external debugger.

> For EL0 exceptions, the stack trace will end with vectors() as vectors
> entries call the EL0 handlers.
> 
> Here are sample stack traces (I only show the ending of each trace):
> 
> Idle task on primary CPU
> 
> 
>...
> [0.022557]   start_kernel+0x5b8/0x5f4
> [0.022570]   __primary_switched+0xa8/0xb8
> [0.022578]   copy_thread+0x0/0x188
> 
> Idle task on secondary CPU
> ==
> 
>...
> [0.023397]   secondary_start_kernel+0x188/0x1e0
> [0.023406]   __secondary_switched+0x40/0x88
> [0.023415]   copy_thread+0x0/0x188
> 
> All other kernel threads
> 
> 
>...
> [   13.501062]   ret_from_fork+0x10/0x18
> [   13.507998]   copy_thread+0x0/0x188
> 
> User threads (EL0 exception)
> 
> 
> write(2) system call example:
> 
>...
> [  521.686148]   vfs_write+0xc8/0x2c0
> [  521.686156]   ksys_write+0x74/0x108
> [  521.686161]   __arm64_sys_write+0x24/0x30
> [  521.686166]   el0_svc_common.constprop.0+0x70/0x1a8
> [  521.686175]   do_el0_svc+0x2c/0x98
> [  521.686180]   el0_svc+0x2c/0x70
> [  521.686188]   el0_sync_handler+0xb0/0xb8
> [  521.686193]   el0_sync+0x17c/0x180
> [  521.686198]   vectors+0x0/0x7d8

[...]

> If you approve, the above will become RFC Patch v3 1/8 in the next version.

As above, I don't think we should repurpose an existing function here,
and my preference is to use 0x0.

> Let me know.
> 
> Also, I could introduce an extra frame in the EL1 exception stack trace that
> includes vectors so the stack trace would look like this (timer interrupt 
> example):
> 
> call_timer_fn
> run_timer_softirq
> __do_softirq
> irq_exit
> __handle_domain_irq
> gic_handle_irq
> el1_irq
> vectors
> 
> This way, if the unwinder finds vectors, it knows that it is an exception 
> frame.

I can see this might make it simpler to detect exception boundaries, but
I suspect that we need other information anyway, so this doesn't become
all that helpful. For EL0<->EL1 exception boundaries we want to
successfully terminate a robust stacktrace whereas for EL1<->EL1
exception boundaries we want to fail a robust stacktrace.

I reckon we have to figure that out from the el1_* and el0_* entry
points (which I am working to reduce/simplify as part of the entry
assembly conversion to C). With that we can terminate unwind at the
el0_* parts, and reject unwinding across any other bit of .entry.text.

Thanks,
Mark.



Re: [syzbot] upstream boot error: WARNING in __context_tracking_enter

2021-03-22 Thread Mark Rutland
Hi Russell,

On Fri, Mar 19, 2021 at 10:10:43AM +, Russell King - ARM Linux admin wrote:
> On Fri, Mar 19, 2021 at 10:54:48AM +0100, Dmitry Vyukov wrote:
> > .On Fri, Mar 19, 2021 at 10:44 AM syzbot
> >  wrote:
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:8b12a62a Merge tag 'drm-fixes-2021-03-19' of 
> > > git://anongit..
> > > git tree:   upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=17e815aed0
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=cfeed364fc353c32
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=f09a12b2c77bfbbf51bd
> > > userspace arch: arm
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+f09a12b2c77bfbbf5...@syzkaller.appspotmail.com
> > 
> > 
> > +Mark, arm
> > It did not get far with CONFIG_CONTEXT_TRACKING_FORCE (kernel doesn't boot).
> 
> It seems that the path:
> 
> context_tracking_user_enter()
> user_enter()
> context_tracking_enter()
> __context_tracking_enter()
> vtime_user_enter()
> 
> expects preemption to be disabled. It effectively is, because local
> interrupts are disabled by context_tracking_enter().
> 
> However, the requirement for preemption to be disabled is not
> documented... so shrug. Maybe someone can say what the real requirements
> are here.

>From dealing with this recently on arm64, theis is a bit messy. To
handle this robustly we need to do a few things in sequence, including
using the *_irqoff() variants of the context_tracking_user_*()
functions.

I wrote down the constraints in commit:
  
  23529049c6842382 ("arm64: entry: fix non-NMI user<->kernel transitions")

For user->kernel transitions, the arch code needs the following sequence
before invoking arbitrary kernel C code:

lockdep_hardirqs_off(CALLER_ADDR0);
user_exit_irqoff();
trace_hardirqs_off_finish();

For kernel->user transitions, the arch code needs the following sequence
once it will no longer invoke arbitrary kernel C code, just before
returning to userspace:

trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
user_enter_irqoff();
lockdep_hardirqs_on(CALLER_ADDR0);

Thanks,
Mark.


Re: [PATCH] arm64: stacktrace: don't trace arch_stack_walk()

2021-03-22 Thread Mark Rutland
On Fri, Mar 19, 2021 at 07:02:06PM +, Catalin Marinas wrote:
> On Fri, Mar 19, 2021 at 06:41:06PM +0000, Mark Rutland wrote:
> > We recently converted arm64 to use arch_stack_walk() in commit:
> > 
> >   5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > 
> > The core stacktrace code expects that (when tracing the current task)
> > arch_stack_walk() starts a trace at its caller, and does not include
> > itself in the trace. However, arm64's arch_stack_walk() includes itself,
> > and so traces include one more entry than callers expect. The core
> > stacktrace code which calls arch_stack_walk() tries to skip a number of
> > entries to prevent itself appearing in a trace, and the additional entry
> > prevents skipping one of the core stacktrace functions, leaving this in
> > the trace unexpectedly.
> > 
> > We can fix this by having arm64's arch_stack_walk() begin the trace with
> > its caller. The first value returned by the trace will be
> > __builtin_return_address(0), i.e. the caller of arch_stack_walk(). The
> > first frame record to be unwound will be __builtin_frame_address(1),
> > i.e. the caller's frame record. To prevent surprises, arch_stack_walk()
> > is also marked noinline.

[...]

> > Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > Signed-off-by: Mark Rutland 
> > Cc: Catalin Marinas 
> > Cc: Chen Jun 
> > Cc: Marco Elver 
> > Cc: Mark Brown 
> > Cc: Will Deacon 
> 
> Thanks Mark. I think we should add a cc stable, just with Fixes doesn't
> always seem to end up in a stable kernel:
> 
> Cc:  # 5.10.x

Makes sense to me, sure.

> With that:
> 
> Reviewed-by: Catalin Marinas 

Thanks!

Will, I assume you're happy to fold in the above when picking this. If
you'd prefer I repost with that folded in, please let me know!

Mark.


Re: [PATCHv3 2/6] arm64: don't use GENERIC_IRQ_MULTI_HANDLER

2021-03-22 Thread Mark Rutland
Hi Christoph,

On Mon, Mar 15, 2021 at 07:28:03PM +, Christoph Hellwig wrote:
> On Mon, Mar 15, 2021 at 11:56:25AM +0000, Mark Rutland wrote:
> > From: Marc Zyngier 
> > 
> > In subsequent patches we want to allow irqchip drivers to register as
> > FIQ handlers, with a set_handle_fiq() function. To keep the IRQ/FIQ
> > paths similar, we want arm64 to provide both set_handle_irq() and
> > set_handle_fiq(), rather than using GENERIC_IRQ_MULTI_HANDLER for the
> > former.
> 
> Having looked through the series I do not understand this rationale
> at all.  You've only added the default_handle_irq logic, which seems
> perfectly suitable and desirable for the generic version. 

The default_handle_irq thing isn't the point of the series, that part is
all preparatory work. I agree that probably makes sense for the generic
code, and I'm happy to update core code with this.

The big thing here is that (unlike most architectures), with arm64 a CPU
has two interrupt pins, IRQ and FIQ, and we need separate root handlers
for these. That's what this series aims to do, and patches 1-5 are all
preparatory work with that appearing in patch 6.

Our initial stab at this did try to add that support to core code, but
that was more painful to deal with, since you either add abstractions to
make this look generic that make the code more complex for bot hthe
genreic code and arch code, or you place arch-specific assumptions in
the core code. See Marc's eariler stab at this, where in effect we had
to duplicate the logic in the core code so that we didn't adversely
affect existing entry assembly on other architectures due to the way the
function pointers were stored.

> Please don't fork off generic code for no good reason.

I appreciate that this runs counter to the general goal of making things
generic wherever possible, but I do think in this case we have good
reasons, and the duplication is better than adding single-user
abstractions in the generic code that complicate the generic code and
arch code.

Thanks,
Mark.


[PATCH] arm64: stacktrace: don't trace arch_stack_walk()

2021-03-19 Thread Mark Rutland
We recently converted arm64 to use arch_stack_walk() in commit:

  5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")

The core stacktrace code expects that (when tracing the current task)
arch_stack_walk() starts a trace at its caller, and does not include
itself in the trace. However, arm64's arch_stack_walk() includes itself,
and so traces include one more entry than callers expect. The core
stacktrace code which calls arch_stack_walk() tries to skip a number of
entries to prevent itself appearing in a trace, and the additional entry
prevents skipping one of the core stacktrace functions, leaving this in
the trace unexpectedly.

We can fix this by having arm64's arch_stack_walk() begin the trace with
its caller. The first value returned by the trace will be
__builtin_return_address(0), i.e. the caller of arch_stack_walk(). The
first frame record to be unwound will be __builtin_frame_address(1),
i.e. the caller's frame record. To prevent surprises, arch_stack_walk()
is also marked noinline.

While __builtin_frame_address(1) is not safe in portable code, local GCC
developers have confirmed that it is safe on arm64. To find the caller's
frame record, the builtin can safely dereference the current function's
frame record or (in theory) could stash the original FP into another GPR
at function entry time, neither of which are problematic.

Prior to this patch, the tracing code would unexpectedly show up in
traces of the current task, e.g.

| # cat /proc/self/stack
| [<0>] stack_trace_save_tsk+0x98/0x100
| [<0>] proc_pid_stack+0xb4/0x130
| [<0>] proc_single_show+0x60/0x110
| [<0>] seq_read_iter+0x230/0x4d0
| [<0>] seq_read+0xdc/0x130
| [<0>] vfs_read+0xac/0x1e0
| [<0>] ksys_read+0x6c/0xfc
| [<0>] __arm64_sys_read+0x20/0x30
| [<0>] el0_svc_common.constprop.0+0x60/0x120
| [<0>] do_el0_svc+0x24/0x90
| [<0>] el0_svc+0x2c/0x54
| [<0>] el0_sync_handler+0x1a4/0x1b0
| [<0>] el0_sync+0x170/0x180

After this patch, the tracing code will not show up in such traces:

| # cat /proc/self/stack
| [<0>] proc_pid_stack+0xb4/0x130
| [<0>] proc_single_show+0x60/0x110
| [<0>] seq_read_iter+0x230/0x4d0
| [<0>] seq_read+0xdc/0x130
| [<0>] vfs_read+0xac/0x1e0
| [<0>] ksys_read+0x6c/0xfc
| [<0>] __arm64_sys_read+0x20/0x30
| [<0>] el0_svc_common.constprop.0+0x60/0x120
| [<0>] do_el0_svc+0x24/0x90
| [<0>] el0_svc+0x2c/0x54
| [<0>] el0_sync_handler+0x1a4/0x1b0
| [<0>] el0_sync+0x170/0x180

Erring on the side of caution, I've given this a spin with a bunch of
toolchains, verifying the output of /proc/self/stack and checking that
the assembly looked sound. For GCC (where we require version 5.1.0 or
later) I tested with the kernel.org crosstool binares for versions
5.5.0, 6.4.0, 6.5.0, 7.3.0, 7.5.0, 8.1.0, 8.3.0, 8.4.0, 9.2.0, and
10.1.0. For clang (where we require version 10.0.1 or later) I tested
with the llvm.org binary releases of 11.0.0, and 11.0.1.

Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Chen Jun 
Cc: Marco Elver 
Cc: Mark Brown 
Cc: Will Deacon 
---
 arch/arm64/kernel/stacktrace.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index ad20981dfda4..d55bdfb7789c 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -194,8 +194,9 @@ void show_stack(struct task_struct *tsk, unsigned long *sp, 
const char *loglvl)
 
 #ifdef CONFIG_STACKTRACE
 
-void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
-struct task_struct *task, struct pt_regs *regs)
+noinline void arch_stack_walk(stack_trace_consume_fn consume_entry,
+ void *cookie, struct task_struct *task,
+ struct pt_regs *regs)
 {
struct stackframe frame;
 
@@ -203,8 +204,8 @@ void arch_stack_walk(stack_trace_consume_fn consume_entry, 
void *cookie,
start_backtrace(, regs->regs[29], regs->pc);
else if (task == current)
start_backtrace(,
-   (unsigned long)__builtin_frame_address(0),
-   (unsigned long)arch_stack_walk);
+   (unsigned long)__builtin_frame_address(1),
+   (unsigned long)__builtin_return_address(0));
else
start_backtrace(, thread_saved_fp(task),
thread_saved_pc(task));
-- 
2.11.0



Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current

2021-03-18 Thread Mark Rutland
On Thu, Mar 18, 2021 at 04:17:24PM +, Catalin Marinas wrote:
> On Wed, Mar 17, 2021 at 07:34:16PM +0000, Mark Rutland wrote:
> > On Wed, Mar 17, 2021 at 06:36:36PM +, Catalin Marinas wrote:
> > > On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote:
> > > > On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same
> > > > stack:
> > > >  stack_trace_save+0x4c/0x78
> > > >  register_early_stack+0x34/0x70
> > > >  init_page_owner+0x34/0x230
> > > >  page_ext_init+0x1bc/0x1dc
> > > > 
> > > > The reason is that:
> > > > check_recursive_alloc always return 1 because that
> > > > entries[0] is always equal to ip (__set_page_owner+0x3c/0x60).
> > > > 
> > > > The root cause is that:
> > > > commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > > > make the save_trace save 2 more entries.
> > > > 
> > > > Add skip in arch_stack_walk when task == current.
> > > > 
> > > > Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > > > Signed-off-by: Chen Jun 
> > > > ---
> > > >  arch/arm64/kernel/stacktrace.c | 5 +++--
> > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/arm64/kernel/stacktrace.c 
> > > > b/arch/arm64/kernel/stacktrace.c
> > > > index ad20981..c26b0ac 100644
> > > > --- a/arch/arm64/kernel/stacktrace.c
> > > > +++ b/arch/arm64/kernel/stacktrace.c
> > > > @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn 
> > > > consume_entry, void *cookie,
> > > >  
> > > > if (regs)
> > > > start_backtrace(, regs->regs[29], regs->pc);
> > > > -   else if (task == current)
> > > > +   else if (task == current) {
> > > > +   ((struct stacktrace_cookie *)cookie)->skip += 2;
> > > > start_backtrace(,
> > > > (unsigned 
> > > > long)__builtin_frame_address(0),
> > > > (unsigned long)arch_stack_walk);
> > > > -   else
> > > > +   } else
> > > > start_backtrace(, thread_saved_fp(task),
> > > > thread_saved_pc(task));
> > > 
> > > I don't like abusing the cookie here. It's void * as it's meant to be an
> > > opaque type. I'd rather skip the first two frames in walk_stackframe()
> > > instead before invoking fn().
> > 
> > I agree that we shouldn't touch cookie here.
> > 
> > I don't think that it's right to bodge this inside walk_stackframe(),
> > since that'll add bogus skipping for the case starting with regs in the
> > current task. If we need a bodge, it has to live in arch_stack_walk()
> > where we set up the initial unwinding state.
> 
> Good point. However, instead of relying on __builtin_frame_address(1),
> can we add a 'skip' value to struct stackframe via arch_stack_walk() ->
> start_backtrace() that is consumed by walk_stackframe()?

We could, but I'd strongly prefer to use __builtin_frame_address(1) if
we can, as it's much simpler to read and keeps the logic constrained to
the starting function. I'd already hacked that up at:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/unwind=5811a76c1be1dcea7104a9a771fc2604bc2a90ef

... and I'm fairly confident that this works on arm64.

If __builtin_frame_address(1) is truly unreliable, then we could just
manually unwind one step within arch_stack_walk() when unwinding
current, which I think is cleaner than spreading this within
walk_stackframe().

I can clean up the commit message and post that as a real patch, if you
like?

> > In another thread, we came to the conclusion that arch_stack_walk()
> > should start at its parent, and its parent should add any skipping it
> > requires.
> 
> This makes sense.
> 
> > Currently, arch_stack_walk() is off-by-one, and we can bodge that by
> > using __builtin_frame_address(1), though I'm waiting for some compiler
> > folk to confirm that's sound. Otherwise we need to add an assembly
> > trampoline to snapshot the FP, which is unfortunastely convoluted.
> > 
> > This report suggests that a caller of arch_stack_walk() is off-by-one
> > too, which suggests a larger cross-architecture semantic issue. I'll try
> > to take a look tomorrow.
> 
> I don't think the 

Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current

2021-03-17 Thread Mark Rutland
On Wed, Mar 17, 2021 at 06:36:36PM +, Catalin Marinas wrote:
> On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote:
> > On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same
> > stack:
> >  stack_trace_save+0x4c/0x78
> >  register_early_stack+0x34/0x70
> >  init_page_owner+0x34/0x230
> >  page_ext_init+0x1bc/0x1dc
> > 
> > The reason is that:
> > check_recursive_alloc always return 1 because that
> > entries[0] is always equal to ip (__set_page_owner+0x3c/0x60).
> > 
> > The root cause is that:
> > commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > make the save_trace save 2 more entries.
> > 
> > Add skip in arch_stack_walk when task == current.
> > 
> > Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > Signed-off-by: Chen Jun 
> > ---
> >  arch/arm64/kernel/stacktrace.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> > index ad20981..c26b0ac 100644
> > --- a/arch/arm64/kernel/stacktrace.c
> > +++ b/arch/arm64/kernel/stacktrace.c
> > @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn 
> > consume_entry, void *cookie,
> >  
> > if (regs)
> > start_backtrace(, regs->regs[29], regs->pc);
> > -   else if (task == current)
> > +   else if (task == current) {
> > +   ((struct stacktrace_cookie *)cookie)->skip += 2;
> > start_backtrace(,
> > (unsigned long)__builtin_frame_address(0),
> > (unsigned long)arch_stack_walk);
> > -   else
> > +   } else
> > start_backtrace(, thread_saved_fp(task),
> > thread_saved_pc(task));
> 
> I don't like abusing the cookie here. It's void * as it's meant to be an
> opaque type. I'd rather skip the first two frames in walk_stackframe()
> instead before invoking fn().

I agree that we shouldn't touch cookie here.

I don't think that it's right to bodge this inside walk_stackframe(),
since that'll add bogus skipping for the case starting with regs in the
current task. If we need a bodge, it has to live in arch_stack_walk()
where we set up the initial unwinding state.

In another thread, we came to the conclusion that arch_stack_walk()
should start at its parent, and its parent should add any skipping it
requires.

Currently, arch_stack_walk() is off-by-one, and we can bodge that by
using __builtin_frame_address(1), though I'm waiting for some compiler
folk to confirm that's sound. Otherwise we need to add an assembly
trampoline to snapshot the FP, which is unfortunastely convoluted.

This report suggests that a caller of arch_stack_walk() is off-by-one
too, which suggests a larger cross-architecture semantic issue. I'll try
to take a look tomorrow.

Thanks,
Mark.

> 
> Prior to the conversion to ARCH_STACKWALK, we were indeed skipping two
> more entries in __save_stack_trace() if tsk == current. Something like
> below, completely untested:
> 
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index ad20981dfda4..2a9f759aa41a 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -115,10 +115,15 @@ NOKPROBE_SYMBOL(unwind_frame);
>  void notrace walk_stackframe(struct task_struct *tsk, struct stackframe 
> *frame,
>bool (*fn)(void *, unsigned long), void *data)
>  {
> + /* for the current task, we don't want this function nor its caller */
> + int skip = tsk == current ? 2 : 0;
> +
>   while (1) {
>   int ret;
>  
> - if (!fn(data, frame->pc))
> + if (skip)
> + skip--;
> + else if (!fn(data, frame->pc))
>   break;
>   ret = unwind_frame(tsk, frame);
>   if (ret < 0)
> 
> 
> -- 
> Catalin


Re: arm64 syzbot instances

2021-03-17 Thread Mark Rutland
On Thu, Mar 11, 2021 at 05:56:46PM +0100, Dmitry Vyukov wrote:
> On Thu, Mar 11, 2021 at 1:33 PM Mark Rutland  wrote:
> > FWIW, I keep my fuzzing config fragment in my fuzzing/* branches on
> > git.kernel.org, and for comparison my fragment for v5.12-rc1 is:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=fuzzing/5.12-rc1=6d9f7f8a2514fe882823fadbe7478228f71d7ab1
> >
> > ... I'm not sure whether there's anything in that which is novel to you.
> 
> Hi Mark,
> 
> I've learned about DEBUG_TIMEKEEPING which we had disabled. I am enabling it.
> We also have CONTEXT_TRACKING_FORCE disabled. I don't completely
> understand what it's doing. Is it also "more debug checks" type of
> config?

Context tracking tracks user<->kernel transitions, and tries to disable
RCU when it is not needed (e.g. while a CPU is in usersspace), to avoid
the need to perturb that CPU with IPIs and so on. Normally this is not
enabled unless CPUs are set aside for NOHZ usage, as there's some
expense in doing this tracking. I haven't measured how expensive it is
in practice.

CONTEXT_TRACKING_FORCE enables that tracking regardless of whether any
CPUs are set aside for NOHZ usage, and makes it easier to find bugs in
that tracking code, or where it is not being used correctly (e.g. missed
calls, or called in the wrong places).

I added it to my debug fragment back when I fixed the arm64 entry code
accounting for lockdep, and I keep it around to make sure that we don't
accidentally regress any of that.

Thanks,
Mark.

> FWIW we have more debug configs:
> https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/debug.yml
> https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/base.yml
> https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/kasan.yml
> https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/kmemleak.yml


[PATCHv3 4/6] arm64: entry: factor irq triage logic into macros

2021-03-15 Thread Mark Rutland
From: Marc Zyngier 

In subsequent patches we'll allow an FIQ handler to be registered, and
FIQ exceptions will need to be triaged very similarly to IRQ exceptions.
So that we can reuse the existing logic, this patch factors the IRQ
triage logic out into macros that can be reused for FIQ.

The macros are named to follow the elX_foo_handler scheme used by the C
exception handlers. For consistency with other top-level exception
handlers, the kernel_entry/kernel_exit logic is not moved into the
macros. As FIQ will use a different C handler, this handler name is
provided as an argument to the macros.

There should be no functional change as a result of this patch.

Signed-off-by: Marc Zyngier 
[Mark: rework macros, commit message, rebase before DAIF rework]
Signed-off-by: Mark Rutland 
Tested-by: Hector Martin 
Cc: Catalin Marinas 
Cc: James Morse 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/kernel/entry.S | 80 +--
 1 file changed, 43 insertions(+), 37 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index a31a0a713c85..e235b0e4e468 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -491,8 +491,8 @@ tsk .reqx28 // current thread_info
 /*
  * Interrupt handling.
  */
-   .macro  irq_handler
-   ldr_l   x1, handle_arch_irq
+   .macro  irq_handler, handler:req
+   ldr_l   x1, \handler
mov x0, sp
irq_stack_entry
blr x1
@@ -531,6 +531,45 @@ alternative_endif
 #endif
.endm
 
+   .macro el1_interrupt_handler, handler:req
+   gic_prio_irq_setup pmr=x20, tmp=x1
+   enable_da_f
+
+   mov x0, sp
+   bl  enter_el1_irq_or_nmi
+
+   irq_handler \handler
+
+#ifdef CONFIG_PREEMPTION
+   ldr x24, [tsk, #TSK_TI_PREEMPT] // get preempt count
+alternative_if ARM64_HAS_IRQ_PRIO_MASKING
+   /*
+* DA_F were cleared at start of handling. If anything is set in DAIF,
+* we come back from an NMI, so skip preemption
+*/
+   mrs x0, daif
+   orr x24, x24, x0
+alternative_else_nop_endif
+   cbnzx24, 1f // preempt count != 0 || NMI 
return path
+   bl  arm64_preempt_schedule_irq  // irq en/disable is done inside
+1:
+#endif
+
+   mov x0, sp
+   bl  exit_el1_irq_or_nmi
+   .endm
+
+   .macro el0_interrupt_handler, handler:req
+   gic_prio_irq_setup pmr=x20, tmp=x0
+   user_exit_irqoff
+   enable_da_f
+
+   tbz x22, #55, 1f
+   bl  do_el0_irq_bp_hardening
+1:
+   irq_handler \handler
+   .endm
+
.text
 
 /*
@@ -660,32 +699,7 @@ SYM_CODE_END(el1_sync)
.align  6
 SYM_CODE_START_LOCAL_NOALIGN(el1_irq)
kernel_entry 1
-   gic_prio_irq_setup pmr=x20, tmp=x1
-   enable_da_f
-
-   mov x0, sp
-   bl  enter_el1_irq_or_nmi
-
-   irq_handler
-
-#ifdef CONFIG_PREEMPTION
-   ldr x24, [tsk, #TSK_TI_PREEMPT] // get preempt count
-alternative_if ARM64_HAS_IRQ_PRIO_MASKING
-   /*
-* DA_F were cleared at start of handling. If anything is set in DAIF,
-* we come back from an NMI, so skip preemption
-*/
-   mrs x0, daif
-   orr x24, x24, x0
-alternative_else_nop_endif
-   cbnzx24, 1f // preempt count != 0 || NMI 
return path
-   bl  arm64_preempt_schedule_irq  // irq en/disable is done inside
-1:
-#endif
-
-   mov x0, sp
-   bl  exit_el1_irq_or_nmi
-
+   el1_interrupt_handler handle_arch_irq
kernel_exit 1
 SYM_CODE_END(el1_irq)
 
@@ -725,15 +739,7 @@ SYM_CODE_END(el0_error_compat)
 SYM_CODE_START_LOCAL_NOALIGN(el0_irq)
kernel_entry 0
 el0_irq_naked:
-   gic_prio_irq_setup pmr=x20, tmp=x0
-   user_exit_irqoff
-   enable_da_f
-
-   tbz x22, #55, 1f
-   bl  do_el0_irq_bp_hardening
-1:
-   irq_handler
-
+   el0_interrupt_handler handle_arch_irq
b   ret_to_user
 SYM_CODE_END(el0_irq)
 
-- 
2.11.0



[PATCHv3 6/6] arm64: irq: allow FIQs to be handled

2021-03-15 Thread Mark Rutland
On contemporary platforms we don't use FIQ, and treat any stray FIQ as a
fatal event. However, some platforms have an interrupt controller wired
to FIQ, and need to handle FIQ as part of regular operation.

So that we can support both cases dynamically, this patch updates the
FIQ exception handling code to operate the same way as the IRQ handling
code, with its own handle_arch_fiq handler. Where a root FIQ handler is
not registered, an unexpected FIQ exception will trigger the default FIQ
handler, which will panic() as today. Where a root FIQ handler is
registered, handling of the FIQ is deferred to that handler.

As el0_fiq_invalid_compat is supplanted by el0_fiq, the former is
removed. For !CONFIG_COMPAT builds we never expect to take an exception
from AArch32 EL0, so we keep the common el0_fiq_invalid handler.

Signed-off-by: Mark Rutland 
Tested-by: Hector Martin 
Cc: Catalin Marinas 
Cc: James Morse 
Cc: Marc Zyngier 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/include/asm/irq.h |  1 +
 arch/arm64/kernel/entry.S| 30 +-
 arch/arm64/kernel/irq.c  | 16 
 3 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 8391c6f6f746..fac08e18bcd5 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -10,6 +10,7 @@ struct pt_regs;
 
 int set_handle_irq(void (*handle_irq)(struct pt_regs *));
 #define set_handle_irq set_handle_irq
+int set_handle_fiq(void (*handle_fiq)(struct pt_regs *));
 
 static inline int nr_legacy_irqs(void)
 {
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ce8d4dc416fb..a86f50de2c7b 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -588,18 +588,18 @@ SYM_CODE_START(vectors)
 
kernel_ventry   1, sync // Synchronous EL1h
kernel_ventry   1, irq  // IRQ EL1h
-   kernel_ventry   1, fiq_invalid  // FIQ EL1h
+   kernel_ventry   1, fiq  // FIQ EL1h
kernel_ventry   1, error// Error EL1h
 
kernel_ventry   0, sync // Synchronous 64-bit 
EL0
kernel_ventry   0, irq  // IRQ 64-bit EL0
-   kernel_ventry   0, fiq_invalid  // FIQ 64-bit EL0
+   kernel_ventry   0, fiq  // FIQ 64-bit EL0
kernel_ventry   0, error// Error 64-bit EL0
 
 #ifdef CONFIG_COMPAT
kernel_ventry   0, sync_compat, 32  // Synchronous 32-bit 
EL0
kernel_ventry   0, irq_compat, 32   // IRQ 32-bit EL0
-   kernel_ventry   0, fiq_invalid_compat, 32   // FIQ 32-bit EL0
+   kernel_ventry   0, fiq_compat, 32   // FIQ 32-bit EL0
kernel_ventry   0, error_compat, 32 // Error 32-bit EL0
 #else
kernel_ventry   0, sync_invalid, 32 // Synchronous 32-bit 
EL0
@@ -665,12 +665,6 @@ SYM_CODE_START_LOCAL(el0_error_invalid)
inv_entry 0, BAD_ERROR
 SYM_CODE_END(el0_error_invalid)
 
-#ifdef CONFIG_COMPAT
-SYM_CODE_START_LOCAL(el0_fiq_invalid_compat)
-   inv_entry 0, BAD_FIQ, 32
-SYM_CODE_END(el0_fiq_invalid_compat)
-#endif
-
 SYM_CODE_START_LOCAL(el1_sync_invalid)
inv_entry 1, BAD_SYNC
 SYM_CODE_END(el1_sync_invalid)
@@ -705,6 +699,12 @@ SYM_CODE_START_LOCAL_NOALIGN(el1_irq)
kernel_exit 1
 SYM_CODE_END(el1_irq)
 
+SYM_CODE_START_LOCAL_NOALIGN(el1_fiq)
+   kernel_entry 1
+   el1_interrupt_handler handle_arch_fiq
+   kernel_exit 1
+SYM_CODE_END(el1_fiq)
+
 /*
  * EL0 mode handlers.
  */
@@ -731,6 +731,11 @@ SYM_CODE_START_LOCAL_NOALIGN(el0_irq_compat)
b   el0_irq_naked
 SYM_CODE_END(el0_irq_compat)
 
+SYM_CODE_START_LOCAL_NOALIGN(el0_fiq_compat)
+   kernel_entry 0, 32
+   b   el0_fiq_naked
+SYM_CODE_END(el0_fiq_compat)
+
 SYM_CODE_START_LOCAL_NOALIGN(el0_error_compat)
kernel_entry 0, 32
b   el0_error_naked
@@ -745,6 +750,13 @@ el0_irq_naked:
b   ret_to_user
 SYM_CODE_END(el0_irq)
 
+SYM_CODE_START_LOCAL_NOALIGN(el0_fiq)
+   kernel_entry 0
+el0_fiq_naked:
+   el0_interrupt_handler handle_arch_fiq
+   b   ret_to_user
+SYM_CODE_END(el0_fiq)
+
 SYM_CODE_START_LOCAL(el1_error)
kernel_entry 1
mrs x1, esr_el1
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 2fe0b535de30..bda49430c9ea 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -76,7 +76,13 @@ static void default_handle_irq(struct pt_regs *regs)
panic("IRQ taken without a root IRQ handler\n");
 }
 
+static void default_handle_fiq(struct pt_regs *regs)
+{
+   panic("FIQ taken without a root FIQ handler\n");
+}
+
 void (*handle_arch_irq)(struct pt_regs *) __ro_after_init = default_handle_irq;
+void (*handle_arch

[PATCHv3 5/6] arm64: Always keep DAIF.[IF] in sync

2021-03-15 Thread Mark Rutland
From: Hector Martin 

Apple SoCs (A11 and newer) have some interrupt sources hardwired to the
FIQ line. We implement support for this by simply treating IRQs and FIQs
the same way in the interrupt vectors.

To support these systems, the FIQ mask bit needs to be kept in sync with
the IRQ mask bit, so both kinds of exceptions are masked together. No
other platforms should be delivering FIQ exceptions right now, and we
already unmask FIQ in normal process context, so this should not have an
effect on other systems - if spurious FIQs were arriving, they would
already panic the kernel.

Signed-off-by: Hector Martin 
Signed-off-by: Mark Rutland 
Tested-by: Hector Martin 
Cc: Catalin Marinas 
Cc: James Morse 
Cc: Marc Zyngier 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/include/asm/arch_gicv3.h |  2 +-
 arch/arm64/include/asm/assembler.h  |  8 
 arch/arm64/include/asm/daifflags.h  | 10 +-
 arch/arm64/include/asm/irqflags.h   | 16 +++-
 arch/arm64/kernel/entry.S   | 12 +++-
 arch/arm64/kernel/process.c |  2 +-
 arch/arm64/kernel/smp.c |  1 +
 7 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/arch_gicv3.h 
b/arch/arm64/include/asm/arch_gicv3.h
index 880b9054d75c..934b9be582d2 100644
--- a/arch/arm64/include/asm/arch_gicv3.h
+++ b/arch/arm64/include/asm/arch_gicv3.h
@@ -173,7 +173,7 @@ static inline void gic_pmr_mask_irqs(void)
 
 static inline void gic_arch_enable_irqs(void)
 {
-   asm volatile ("msr daifclr, #2" : : : "memory");
+   asm volatile ("msr daifclr, #3" : : : "memory");
 }
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index ca31594d3d6c..b76a71e84b7c 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -40,9 +40,9 @@
msr daif, \flags
.endm
 
-   /* IRQ is the lowest priority flag, unconditionally unmask the rest. */
-   .macro enable_da_f
-   msr daifclr, #(8 | 4 | 1)
+   /* IRQ/FIQ are the lowest priority flags, unconditionally unmask the 
rest. */
+   .macro enable_da
+   msr daifclr, #(8 | 4)
.endm
 
 /*
@@ -50,7 +50,7 @@
  */
.macro  save_and_disable_irq, flags
mrs \flags, daif
-   msr daifset, #2
+   msr daifset, #3
.endm
 
.macro  restore_irq, flags
diff --git a/arch/arm64/include/asm/daifflags.h 
b/arch/arm64/include/asm/daifflags.h
index 1c26d7baa67f..5eb7af9c4557 100644
--- a/arch/arm64/include/asm/daifflags.h
+++ b/arch/arm64/include/asm/daifflags.h
@@ -13,8 +13,8 @@
 #include 
 
 #define DAIF_PROCCTX   0
-#define DAIF_PROCCTX_NOIRQ PSR_I_BIT
-#define DAIF_ERRCTX(PSR_I_BIT | PSR_A_BIT)
+#define DAIF_PROCCTX_NOIRQ (PSR_I_BIT | PSR_F_BIT)
+#define DAIF_ERRCTX(PSR_A_BIT | PSR_I_BIT | PSR_F_BIT)
 #define DAIF_MASK  (PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT)
 
 
@@ -47,7 +47,7 @@ static inline unsigned long local_daif_save_flags(void)
if (system_uses_irq_prio_masking()) {
/* If IRQs are masked with PMR, reflect it in the flags */
if (read_sysreg_s(SYS_ICC_PMR_EL1) != GIC_PRIO_IRQON)
-   flags |= PSR_I_BIT;
+   flags |= PSR_I_BIT | PSR_F_BIT;
}
 
return flags;
@@ -69,7 +69,7 @@ static inline void local_daif_restore(unsigned long flags)
bool irq_disabled = flags & PSR_I_BIT;
 
WARN_ON(system_has_prio_mask_debugging() &&
-   !(read_sysreg(daif) & PSR_I_BIT));
+   (read_sysreg(daif) & (PSR_I_BIT | PSR_F_BIT)) != (PSR_I_BIT | 
PSR_F_BIT));
 
if (!irq_disabled) {
trace_hardirqs_on();
@@ -86,7 +86,7 @@ static inline void local_daif_restore(unsigned long flags)
 * If interrupts are disabled but we can take
 * asynchronous errors, we can take NMIs
 */
-   flags &= ~PSR_I_BIT;
+   flags &= ~(PSR_I_BIT | PSR_F_BIT);
pmr = GIC_PRIO_IRQOFF;
} else {
pmr = GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET;
diff --git a/arch/arm64/include/asm/irqflags.h 
b/arch/arm64/include/asm/irqflags.h
index ff328e5bbb75..b57b9b1e4344 100644
--- a/arch/arm64/include/asm/irqflags.h
+++ b/arch/arm64/include/asm/irqflags.h
@@ -12,15 +12,13 @@
 
 /*
  * Aarch64 has flags for masking: Debug, Asynchronous (serror), Interrupts and
- * FIQ exceptions, in the 'daif' register. We mask and unmask them in 'dai'
+ * FIQ exceptions, in the 'daif' register. We mask and unmask them in 'daif'
  * order:
  * Masking debug exceptions causes all other exceptions to be masked too/
- * Masking SError masks irq, but not debug exceptions. Masking irqs has no
- *

[PATCHv3 3/6] arm64: irq: rework root IRQ handler registration

2021-03-15 Thread Mark Rutland
If we accidentally unmask IRQs before we've registered a root IRQ
handler, handle_arch_irq will be NULL, and the IRQ exception handler
will branch to a bogus address.

To make this easier to debug, this patch initialises handle_arch_irq to
a default handler which will panic(), making such problems easier to
debug. When we add support for FIQ handlers, we can follow the same
approach.

When we add support for a root FIQ handler, it's possible to have root
IRQ handler without an root FIQ handler, and in theory the inverse is
also possible. To permit this, and to keep the IRQ/FIQ registration
logic similar, this patch removes the panic in the absence of a root IRQ
controller. Instead, set_handle_irq() logs when a handler is registered,
which is sufficient for debug purposes.

Signed-off-by: Mark Rutland 
Tested-by: Hector Martin 
Cc: Catalin Marinas 
Cc: James Morse 
Cc: Marc Zyngier 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/kernel/irq.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index ad63bd50fa7b..2fe0b535de30 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -71,14 +71,20 @@ static void init_irq_stacks(void)
 }
 #endif
 
-void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
+static void default_handle_irq(struct pt_regs *regs)
+{
+   panic("IRQ taken without a root IRQ handler\n");
+}
+
+void (*handle_arch_irq)(struct pt_regs *) __ro_after_init = default_handle_irq;
 
 int __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
 {
-   if (handle_arch_irq)
+   if (handle_arch_irq != default_handle_irq)
return -EBUSY;
 
handle_arch_irq = handle_irq;
+   pr_info("Root IRQ handler: %ps\n", handle_irq);
return 0;
 }
 
@@ -87,8 +93,6 @@ void __init init_IRQ(void)
init_irq_stacks();
init_irq_scs();
irqchip_init();
-   if (!handle_arch_irq)
-   panic("No interrupt controller found.");
 
if (system_uses_irq_prio_masking()) {
/*
-- 
2.11.0



[PATCHv3 1/6] genirq: Allow architectures to override set_handle_irq() fallback

2021-03-15 Thread Mark Rutland
From: Marc Zyngier 

Some architectures want to provide the generic set_handle_irq() API, but
for structural reasons need to provide their own implementation. For
example, arm64 needs to do this to provide uniform set_handle_irq() and
set_handle_fiq() registration functions.

Make this possible by allowing architectures to provide their own
implementation of set_handle_irq when CONFIG_GENERIC_IRQ_MULTI_HANDLER
is not selected.

Signed-off-by: Marc Zyngier 
[Mark: expand commit message]
Signed-off-by: Mark Rutland 
Tested-by: Hector Martin 
Cc: Catalin Marinas 
Cc: James Morse 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 include/linux/irq.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 2efde6a79b7e..9890180b84fd 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1258,11 +1258,13 @@ int __init set_handle_irq(void (*handle_irq)(struct 
pt_regs *));
  */
 extern void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
 #else
+#ifndef set_handle_irq
 #define set_handle_irq(handle_irq) \
do {\
(void)handle_irq;   \
WARN_ON(1); \
} while (0)
 #endif
+#endif
 
 #endif /* _LINUX_IRQ_H */
-- 
2.11.0



[PATCHv3 2/6] arm64: don't use GENERIC_IRQ_MULTI_HANDLER

2021-03-15 Thread Mark Rutland
From: Marc Zyngier 

In subsequent patches we want to allow irqchip drivers to register as
FIQ handlers, with a set_handle_fiq() function. To keep the IRQ/FIQ
paths similar, we want arm64 to provide both set_handle_irq() and
set_handle_fiq(), rather than using GENERIC_IRQ_MULTI_HANDLER for the
former.

This patch adds an arm64-specific implementation of set_handle_irq().
There should be no functional change as a result of this patch.

Signed-off-by: Marc Zyngier 
[Mark: use a single handler pointer]
Signed-off-by: Mark Rutland 
Tested-by: Hector Martin 
Cc: Catalin Marinas 
Cc: James Morse 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/Kconfig   |  1 -
 arch/arm64/include/asm/irq.h |  3 +++
 arch/arm64/kernel/irq.c  | 11 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5656e7aacd69..e7d2405be71f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -110,7 +110,6 @@ config ARM64
select GENERIC_EARLY_IOREMAP
select GENERIC_IDLE_POLL_SETUP
select GENERIC_IRQ_IPI
-   select GENERIC_IRQ_MULTI_HANDLER
select GENERIC_IRQ_PROBE
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index b2b0c6405eb0..8391c6f6f746 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -8,6 +8,9 @@
 
 struct pt_regs;
 
+int set_handle_irq(void (*handle_irq)(struct pt_regs *));
+#define set_handle_irq set_handle_irq
+
 static inline int nr_legacy_irqs(void)
 {
return 0;
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index dfb1feab867d..ad63bd50fa7b 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -71,6 +71,17 @@ static void init_irq_stacks(void)
 }
 #endif
 
+void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
+
+int __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
+{
+   if (handle_arch_irq)
+   return -EBUSY;
+
+   handle_arch_irq = handle_irq;
+   return 0;
+}
+
 void __init init_IRQ(void)
 {
init_irq_stacks();
-- 
2.11.0



[PATCHv3 0/6] arm64: Support FIQ controller registration

2021-03-15 Thread Mark Rutland
Hector's M1 support series [1] shows that some platforms have critical
interrupts wired to FIQ, and to support these platforms we need to support
handling FIQ exceptions. Other contemporary platforms don't use FIQ (since e.g.
this is usually routed to EL3), and as we never expect to take an FIQ, we have
the FIQ vector cause a panic.

Since the use of FIQ is a platform integration detail (which can differ across
bare-metal and virtualized environments), we need be able to explicitly opt-in
to handling FIQs while retaining the existing behaviour otherwise. This series
adds a new set_handle_fiq() hook so that the FIQ controller can do so, and
where no controller is registered the default handler will panic(). For
consistency the set_handle_irq() code is made to do the same.

The first four patches move arm64 over to a local set_handle_irq()
implementation, which is written to share code with a set_handle_fiq() function
in the last two patches. This adds a default handler which will directly
panic() rather than branching to NULL if an IRQ is taken unexpectedly, and the
boot-time panic in the absence of a handler is removed (for consistently with
FIQ support added later).

The penultimate patch reworks arm64's IRQ masking to always keep DAIF.[IF] in
sync, so that we can treat IRQ and FIQ as equals. This is cherry-picked from
Hector's reply [2] to the first version of this series.

The final patch adds the low-level FIQ exception handling and registration
mechanism atop the prior rework.

I'm hoping this is ready to be merged into the arm64 tree, given the
preparatory cleanup made it into v5.12-rc3. I've pushed the series out to my
arm64/fiq branch [3] on kernel.org, also tagged as arm64-fiq-20210315, atop
v5.12-rc3.

Since v1 [4]:
* Rebase to v5.12-rc1
* Pick up Hector's latest DAIF.[IF] patch
* Use "root {IRQ,FIQ} handler" rather than "{IRQ,FIQ} controller"
* Remove existing panic per Marc's comments
* Log registered root handlers
* Make default root handlers static
* Remove redundant el0_fiq_invalid_compat, per Joey's comments

Since v2 [5]:
* Fold in Hector's Tested-by tags
* Rebase to v5.12-rc3
* Drop patches merged in v5.12-rc3

[1] https://http://lore.kernel.org/r/20210215121713.57687-1-mar...@marcan.st
[2] https://lore.kernel.org/r/20210219172530.45805-1-mar...@marcan.st
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/fiq
[4] https://lore.kernel.org/r/20210219113904.41736-1-mark.rutl...@arm.com
[5] https://lore.kernel.org/r/20210302101211.2328-1-mark.rutl...@arm.com

Thanks,
Mark.

Hector Martin (1):
  arm64: Always keep DAIF.[IF] in sync

Marc Zyngier (3):
  genirq: Allow architectures to override set_handle_irq() fallback
  arm64: don't use GENERIC_IRQ_MULTI_HANDLER
  arm64: entry: factor irq triage logic into macros

Mark Rutland (2):
  arm64: irq: rework root IRQ handler registration
  arm64: irq: allow FIQs to be handled

 arch/arm64/Kconfig  |   1 -
 arch/arm64/include/asm/arch_gicv3.h |   2 +-
 arch/arm64/include/asm/assembler.h  |   8 +--
 arch/arm64/include/asm/daifflags.h  |  10 ++--
 arch/arm64/include/asm/irq.h|   4 ++
 arch/arm64/include/asm/irqflags.h   |  16 +++--
 arch/arm64/kernel/entry.S   | 114 +---
 arch/arm64/kernel/irq.c |  35 ++-
 arch/arm64/kernel/process.c |   2 +-
 arch/arm64/kernel/smp.c |   1 +
 include/linux/irq.h |   2 +
 11 files changed, 125 insertions(+), 70 deletions(-)

-- 
2.11.0



Re: arm64 syzbot instances

2021-03-11 Thread Mark Rutland
On Thu, Mar 11, 2021 at 12:38:21PM +0100, 'Dmitry Vyukov' via syzkaller wrote:
> Hi arm64 maintainers,

Hi Dmitry,

> We now have some syzbot instances testing arm64 (woohoo!) using qemu
> emulation. I wanted to write up the current status.

Nice!

> There are 3 instances, first uses KASAN:
> https://syzkaller.appspot.com/upstream?manager=ci-qemu2-arm64
> second KASAN and 32-bit userspace test load (compat):
> https://syzkaller.appspot.com/upstream?manager=ci-qemu2-arm64-compat
> third uses MTE/KASAN_HWTAGS:
> https://syzkaller.appspot.com/upstream?manager=ci-qemu2-arm64-mte
> 
> Kernel configs:
> https://github.com/google/syzkaller/blob/master/dashboard/config/linux/upstream-arm64-kasan.config
> https://github.com/google/syzkaller/blob/master/dashboard/config/linux/upstream-arm64-mte.config

FWIW, I keep my fuzzing config fragment in my fuzzing/* branches on
git.kernel.org, and for comparison my fragment for v5.12-rc1 is:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=fuzzing/5.12-rc1=6d9f7f8a2514fe882823fadbe7478228f71d7ab1

... I'm not sure whether there's anything in that which is novel to you.

> The instances have KCOV disabled because it slows down execution too
> much (KASAN in qemu emulation is already extremely slow), so no
> coverage guidance and coverage reports for now :(
> 
> The instances found few arm64-specific issues that we have not
> observed on other instances:
> https://syzkaller.appspot.com/bug?id=1d22a2cc3521d5cf6b41bd6b825793c2015f861f
> https://syzkaller.appspot.com/bug?id=bb2c16b0e13b4de4bbf22cf6a4b9b16fb0c20eea
> https://syzkaller.appspot.com/bug?id=b75386f45318ec181b7f49260d619fac9877d456
> https://syzkaller.appspot.com/bug?id=5a1bc29bca656159f95c7c8bb30e3776ca860332
> but mostly re-discovering known bugs we already found on x86.

Likewise, my general experience these days (fuzzing under KVM on a
ThunderX2 host) is that we mostly hit issues in core code or drivers
rather than anything strictly specific to arm64. As my host is ARMv8.1
that might just be by virtue of not exercising many of the new
architectural features.

> The instances use qemu emulation and lots of debug configs, so they
> are quite slow and it makes sense to target them at arm64-specific
> parts of the kernel as much as possible (rather
> than stress generic subsystems that are already stressed on x86).
> So the question is: what arm64-specific parts are there that we can reach
> in qemu?
> Can you think of any qemu flags (cpu features, device emulation, etc)?

Generally, `-cpu max` will expose the more interesting CPU features, and
you already seem to have that, so I think you're mostly there on that
front.

Devices vary a lot between SoCs (and most aren't even emulated), so
unless you have particular platforms in mind I'd suggest it might be
better to just use PV devices and try to focus fuzzing on arch code and
common code like mm rather than drivers.

> Any kernel subsystems with heavy arm-specific parts that we may be missing?

It looks like your configs already have BPF, which is probably one of
the more interesting subsystems with architecture-specific bits, so I
don't have further suggestions on that front.

> Testing some of the arm64 drivers that qemu can emulate may be the
> most profitable thing.
> Currently the instances use the following flags:
> -machine virt,virtualization=on,graphics=on,usb=on -cpu cortex-a57
> -machine virt,virtualization=on,mte=on,graphics=on,usb=on -cpu max

With `-cpu max`, QEMU will use a relatively expensive SW implementation
of pointer authentication (which I found significantly magnified the
cost of implementation like kcov), so depending on your priorities you
might want to disable that or (assuming you have a recent enough build
of QEMU) you might wantto force the use of a cheaper algorithm by
passing `-cpu max,pauth-impef`.

The relevant QEMU commit is:

eb94284d0812b4e7 ("arget/arm: Add cpu properties to control pauth")

... but it looks like that might not yet be in a tagged release yet.

Thanks,
Mark.

> mte=on + virtualization=on is broken in the kernel on in the qemu:
> https://lore.kernel.org/lkml/CAAeHK+wDz8aSLyjq1b=q3+hg9ajxxwyr6+gn_ftttmn5osm...@mail.gmail.com/
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller/CACT4Y%2BbeyZ7rjmy7im0KdSU-Pcqd4Rud3xsxonBbYVk0wU-B9g%40mail.gmail.com.


Re: [PATCH v1] powerpc: Include running function as first entry in save_stack_trace() and friends

2021-03-10 Thread Mark Rutland
On Tue, Mar 09, 2021 at 04:05:32PM -0600, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Mar 09, 2021 at 04:05:23PM +0000, Mark Rutland wrote:
> > On Thu, Mar 04, 2021 at 03:54:48PM -0600, Segher Boessenkool wrote:
> > > On Thu, Mar 04, 2021 at 02:57:30PM +, Mark Rutland wrote:
> > > > It looks like GCC is happy to give us the function-entry-time FP if we 
> > > > use
> > > > __builtin_frame_address(1),
> > > 
> > > From the GCC manual:
> > >  Calling this function with a nonzero argument can have
> > >  unpredictable effects, including crashing the calling program.  As
> > >  a result, calls that are considered unsafe are diagnosed when the
> > >  '-Wframe-address' option is in effect.  Such calls should only be
> > >  made in debugging situations.
> > > 
> > > It *does* warn (the warning is in -Wall btw), on both powerpc and
> > > aarch64.  Furthermore, using this builtin causes lousy code (it forces
> > > the use of a frame pointer, which we normally try very hard to optimise
> > > away, for good reason).
> > > 
> > > And, that warning is not an idle warning.  Non-zero arguments to
> > > __builtin_frame_address can crash the program.  It won't on simpler
> > > functions, but there is no real definition of what a simpler function
> > > *is*.  It is meant for debugging, not for production use (this is also
> > > why no one has bothered to make it faster).
> > >
> > > On Power it should work, but on pretty much any other arch it won't.
> > 
> > I understand this is true generally, and cannot be relied upon in
> > portable code. However as you hint here for Power, I believe that on
> > arm64 __builtin_frame_address(1) shouldn't crash the program due to the
> > way frame records work on arm64, but I'll go check with some local
> > compiler folk. I agree that __builtin_frame_address(2) and beyond
> > certainly can, e.g.  by NULL dereference and similar.
> 
> I still do not know the aarch64 ABI well enough.  If only I had time!
> 
> > For context, why do you think this would work on power specifically? I
> > wonder if our rationale is similar.
> 
> On most 64-bit Power ABIs all stack frames are connected together as a
> linked list (which is updated atomically, importantly).  This makes it
> possible to always find all previous stack frames.

We have something similar on arm64, where the kernel depends on being
built with a frame pointer following the AAPCS frame pointer rules.

Every stack frame contains a "frame record" *somewhere* within that
stack frame, and the frame records are chained together as a linked
list. The frame pointer points at the most recent frame record (and this
is what __builtin_frame_address(0) returns).

The records themselves are basically:

struct record {
struct record *next;
unsigned long ret_addr;
};

At function call boundaries, we know that the FP is the caller's record
(or NULL for the first function), and the LR is the address the current
function should return to. Within a function with a stack frame, we can
access that function's record and the `next` field (equivalent to the FP
at the time of entry to the function) is what __builtin_frame_address(1)
should return.

> > Are you aware of anything in particular that breaks using
> > __builtin_frame_address(1) in non-portable code, or is this just a
> > general sentiment of this not being a supported use-case?
> 
> It is not supported, and trying to do it anyway can crash: it can use
> random stack contents as pointer!  Not really "random" of course, but
> where it thinks to find a pointer into the previous frame, which is not
> something it can rely on (unless the ABI guarantees it somehow).
> 
> See gcc.gnu.org/PR60109 for example.

Sure; I see that being true generally (and Ramana noted that on 32-bit
arm a frame pointer wasn't mandated), but I think in this case we have a
stronger target (and configuration) specific guarantee.

> > > > Unless we can get some strong guarantees from compiler folk such that we
> > > > can guarantee a specific function acts boundary for unwinding (and
> > > > doesn't itself get split, etc), the only reliable way I can think to
> > > > solve this requires an assembly trampoline. Whatever we do is liable to
> > > > need some invasive rework.
> > > 
> > > You cannot get such a guarantee, other than not letting the compiler
> > > see into the routine at all, like with assembler code (not inline asm,
> > > real assembler code).
> > 
> > If we cannot reliably ensure this then I'm h

Re: [PATCH] arm64: perf: Fix 64-bit event counter read truncation

2021-03-10 Thread Mark Rutland
On Tue, Mar 09, 2021 at 05:44:12PM -0700, Rob Herring wrote:
> Commit 0fdf1bb75953 ("arm64: perf: Avoid PMXEV* indirection") changed
> armv8pmu_read_evcntr() to return a u32 instead of u64. The result is
> silent truncation of the event counter when using 64-bit counters. Given
> the offending commit appears to have passed thru several folks, it seems
> likely this was a bad rebase after v8.5 PMU 64-bit counters landed.

IIRC I wrote the indirection patch first, so this does sound like an
oversight when rebasing or reworking the patch.

Comparing against commit 0fdf1bb75953, this does appear to be the only
point of truncation given read_pmevcntrn() directly returns the result
of read_sysreg(), so:

Acked-by: Mark Rutland 

Will, could you pick this up?

Thanks,
Mark.

> Fixes: 0fdf1bb75953 ("arm64: perf: Avoid PMXEV* indirection")
> Cc: Alexandru Elisei 
> Cc: Julien Thierry 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: Catalin Marinas 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Signed-off-by: Rob Herring 
> ---
>  arch/arm64/kernel/perf_event.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index 7d2318f80955..4658fcf88c2b 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -460,7 +460,7 @@ static inline int armv8pmu_counter_has_overflowed(u32 
> pmnc, int idx)
>   return pmnc & BIT(ARMV8_IDX_TO_COUNTER(idx));
>  }
>  
> -static inline u32 armv8pmu_read_evcntr(int idx)
> +static inline u64 armv8pmu_read_evcntr(int idx)
>  {
>   u32 counter = ARMV8_IDX_TO_COUNTER(idx);
>  
> -- 
> 2.27.0
> 


Re: [PATCH v1] powerpc: Include running function as first entry in save_stack_trace() and friends

2021-03-09 Thread Mark Rutland
On Thu, Mar 04, 2021 at 03:54:48PM -0600, Segher Boessenkool wrote:
> Hi!

Hi Segher,

> On Thu, Mar 04, 2021 at 02:57:30PM +, Mark Rutland wrote:
> > It looks like GCC is happy to give us the function-entry-time FP if we use
> > __builtin_frame_address(1),
> 
> From the GCC manual:
>  Calling this function with a nonzero argument can have
>  unpredictable effects, including crashing the calling program.  As
>  a result, calls that are considered unsafe are diagnosed when the
>  '-Wframe-address' option is in effect.  Such calls should only be
>  made in debugging situations.
> 
> It *does* warn (the warning is in -Wall btw), on both powerpc and
> aarch64.  Furthermore, using this builtin causes lousy code (it forces
> the use of a frame pointer, which we normally try very hard to optimise
> away, for good reason).
> 
> And, that warning is not an idle warning.  Non-zero arguments to
> __builtin_frame_address can crash the program.  It won't on simpler
> functions, but there is no real definition of what a simpler function
> *is*.  It is meant for debugging, not for production use (this is also
> why no one has bothered to make it faster).
>
> On Power it should work, but on pretty much any other arch it won't.

I understand this is true generally, and cannot be relied upon in
portable code. However as you hint here for Power, I believe that on
arm64 __builtin_frame_address(1) shouldn't crash the program due to the
way frame records work on arm64, but I'll go check with some local
compiler folk. I agree that __builtin_frame_address(2) and beyond
certainly can, e.g.  by NULL dereference and similar.

For context, why do you think this would work on power specifically? I
wonder if our rationale is similar.

Are you aware of anything in particular that breaks using
__builtin_frame_address(1) in non-portable code, or is this just a
general sentiment of this not being a supported use-case?

> > Unless we can get some strong guarantees from compiler folk such that we
> > can guarantee a specific function acts boundary for unwinding (and
> > doesn't itself get split, etc), the only reliable way I can think to
> > solve this requires an assembly trampoline. Whatever we do is liable to
> > need some invasive rework.
> 
> You cannot get such a guarantee, other than not letting the compiler
> see into the routine at all, like with assembler code (not inline asm,
> real assembler code).

If we cannot reliably ensure this then I'm happy to go write an assembly
trampoline to snapshot the state at a function call boundary (where our
procedure call standard mandates the state of the LR, FP, and frame
records pointed to by the FP). This'll require reworking a reasonable
amount of code cross-architecture, so I'll need to get some more
concrete justification (e.g. examples of things that can go wrong in
practice).

> The real way forward is to bite the bullet and to no longer pretend you
> can do a full backtrace from just the stack contents.  You cannot.

I think what you mean here is that there's no reliable way to handle the
current/leaf function, right? If so I do agree.

Beyond that I believe that arm64's frame records should be sufficient.

Thanks,
Mark.


Re: [PATCH v14 5/8] arm64: mte: Enable TCO in functions that can read beyond buffer limits

2021-03-08 Thread Mark Rutland
On Mon, Mar 08, 2021 at 04:14:31PM +, Vincenzo Frascino wrote:
> load_unaligned_zeropad() and __get/put_kernel_nofault() functions can
> read passed some buffer limits which may include some MTE granule with a
> different tag.

s/passed/past/

> When MTE async mode is enable, the load operation crosses the boundaries

s/enabel/enabled/

> and the next granule has a different tag the PE sets the TFSR_EL1.TF1 bit
> as if an asynchronous tag fault is happened.
> 
> Enable Tag Check Override (TCO) in these functions  before the load and
> disable it afterwards to prevent this to happen.
> 
> Note: The same condition can be hit in MTE sync mode but we deal with it
> through the exception handling.
> In the current implementation, mte_async_mode flag is set only at boot
> time but in future kasan might acquire some runtime features that
> that change the mode dynamically, hence we disable it when sync mode is
> selected for future proof.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Reported-by: Branislav Rankov 
> Tested-by: Branislav Rankov 
> Signed-off-by: Vincenzo Frascino 
> ---
>  arch/arm64/include/asm/uaccess.h| 24 
>  arch/arm64/include/asm/word-at-a-time.h |  4 
>  arch/arm64/kernel/mte.c | 22 ++
>  3 files changed, 50 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/uaccess.h 
> b/arch/arm64/include/asm/uaccess.h
> index 0deb88467111..a857f8f82aeb 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -188,6 +188,26 @@ static inline void __uaccess_enable_tco(void)
>ARM64_MTE, CONFIG_KASAN_HW_TAGS));
>  }
>  
> +/* Whether the MTE asynchronous mode is enabled. */
> +DECLARE_STATIC_KEY_FALSE(mte_async_mode);

Can we please hide this behind something like:

static inline bool system_uses_mte_async_mode(void)
{
return IS_ENABLED(CONFIG_KASAN_HW_TAGS) &&
static_branch_unlikely(_async_mode);
}

... like we do for system_uses_ttbr0_pan()?

That way the callers are easier to read, and kernels built without
CONFIG_KASAN_HW_TAGS don't have the static branch at all. I reckon you
can put that in one of hte mte headers and include it where needed.

Thanks,
Mark.

> +
> +/*
> + * These functions disable tag checking only if in MTE async mode
> + * since the sync mode generates exceptions synchronously and the
> + * nofault or load_unaligned_zeropad can handle them.
> + */
> +static inline void __uaccess_disable_tco_async(void)
> +{
> + if (static_branch_unlikely(_async_mode))
> +  __uaccess_disable_tco();
> +}
> +
> +static inline void __uaccess_enable_tco_async(void)
> +{
> + if (static_branch_unlikely(_async_mode))
> + __uaccess_enable_tco();
> +}
> +
>  static inline void uaccess_disable_privileged(void)
>  {
>   __uaccess_disable_tco();
> @@ -307,8 +327,10 @@ do { 
> \
>  do { \
>   int __gkn_err = 0;  \
>   \
> + __uaccess_enable_tco_async();   \
>   __raw_get_mem("ldr", *((type *)(dst)),  \
> (__force type *)(src), __gkn_err);\
> + __uaccess_disable_tco_async();  \
>   if (unlikely(__gkn_err))\
>   goto err_label; \
>  } while (0)
> @@ -380,8 +402,10 @@ do { 
> \
>  do { \
>   int __pkn_err = 0;  \
>   \
> + __uaccess_enable_tco_async();   \
>   __raw_put_mem("str", *((type *)(src)),  \
> (__force type *)(dst), __pkn_err);\
> + __uaccess_disable_tco_async();  \
>   if (unlikely(__pkn_err))\
>   goto err_label; \
>  } while(0)
> diff --git a/arch/arm64/include/asm/word-at-a-time.h 
> b/arch/arm64/include/asm/word-at-a-time.h
> index 950b5909..c62d9fa791aa 100644
> --- a/arch/arm64/include/asm/word-at-a-time.h
> +++ b/arch/arm64/include/asm/word-at-a-time.h
> @@ -55,6 +55,8 @@ static inline unsigned long load_unaligned_zeropad(const 
> void *addr)
>  {
>   unsigned long ret, offset;
>  
> + __uaccess_enable_tco_async();
> +
>   /* Load word from unaligned pointer addr */
>   asm(
>   "1: ldr %0, %3\n"
> @@ -76,6 +78,8 @@ 

Re: [PATCHv2 0/8] arm64: Support FIQ controller registration

2021-03-08 Thread Mark Rutland
On Fri, Mar 05, 2021 at 07:08:50PM +0900, Hector Martin wrote:
> On 02/03/2021 19.12, Mark Rutland wrote:
> > I'm hoping that we can get the first 2 patches in as a preparatory cleanup 
> > for
> > the next rc or so, and then the rest of the series can be rebased atop that.
> > I've pushed the series out to my arm64/fiq branch [4] on kernel.org, also
> > tagged as arm64-fiq-20210302, atop v5.12-rc1.
> 
> Just a reminder to everyone that filesystems under v5.12-rc1 go explodey if
> you use a swap file [1].
> 
> I don't care for the M1 bring-up series (we don't *have* storage), but it's
> worth pointing out for other people who might test this.
> 
> Modulo that,
> 
> Tested-by: Hector Martin 
> 
> [1] 
> https://lore.kernel.org/lkml/CAHk-=wjnzdlsp3odxhf9emtyo7gf-qjanlbuh1zk3c4a7x7...@mail.gmail.com/

Thanks!

I've folded that in, with the series rebased to v5.12-rc2, tagged as
arm64-fiq-20210308. I'm expecting that Marc will get the first couple of
patches queued by rc4, so there's at least one rebase ahead.

Mark.


Re: [PATCH] arm64/mm: Fix __enable_mmu() for new TGRAN range values

2021-03-08 Thread Mark Rutland
On Mon, Mar 08, 2021 at 01:30:53PM +, Will Deacon wrote:
> On Sun, Mar 07, 2021 at 05:24:21PM +0530, Anshuman Khandual wrote:
> > On 3/5/21 8:21 PM, Mark Rutland wrote:
> > > On Fri, Mar 05, 2021 at 08:06:09PM +0530, Anshuman Khandual wrote:

> > >> +#define ID_AA64MMFR0_TGRAN_2_SUPPORTED_DEFAULT  0x0
> > >> +#define ID_AA64MMFR0_TGRAN_2_SUPPORTED_NONE 0x1
> > >> +#define ID_AA64MMFR0_TGRAN_2_SUPPORTED_MIN  0x2
> > >> +#define ID_AA64MMFR0_TGRAN_2_SUPPORTED_MAX  0x7
> > >
> > > The TGRAN2 fields doesn't quite follow the usual ID scheme rules, so how
> > > do we deteremine the max value? Does the ARM ARM say anything in
> > > particular about them, like we do for some of the PMU ID fields?
> > 
> > Did not find anything in ARM ARM, regarding what scheme TGRAN2 fields
> > actually follow. I had arrived at more restrictive 0x7 value, like the
> > usual signed fields as the TGRAN4 fields definitely do not follow the
> > unsigned ID scheme. Would restricting max value to 0x3 (i.e LPA2) be a
> > better option instead ?
> 
> I don't think it helps much, as TGRAN64_2 doesn't even define 0x3.
> 
> So I think this patch is probably the best we can do, but the Arm ARM could
> really do with describing the scheme here.

I agree, and I've filed a ticket internally to try to get this cleaned
up.

I suspect that the answer is that these are basically unsigned, with
0x2-0xf indicating presence, but I can't guarantee that.

Thanks,
Mark.


Re: [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86

2021-03-08 Thread Mark Rutland
On Thu, Mar 04, 2021 at 11:06:01AM -0800, Andy Lutomirski wrote:
> In principle, the generic entry code is generic, and the goal is to use it
> in many architectures once it settles down more.  Move CONFIG_DEBUG_ENTRY
> to the generic config so that it can be used in the generic entry code and
> not just in arch/x86.
> 
> Disable it on arm64.  arm64 uses some but not all of the kentry
> code, and trying to debug the resulting state machine will be painful.
> arm64 can turn it back on when it starts using the entire generic
> path.

Can we make this depend on CONFIG_GENERIC_ENTRY instead of !ARM64?
That'd be more in line with "use the generic entry code, get the generic
functionality". Note that arm64 doesn't select CONFIG_GENERIC_ENTRY
today.

I see that s390 selects CONFIG_GENERIC_ENTRY, and either way this will
enable DEBUG_ENTRY for them, so it'd ve worth checking whether this is
ok for them.

Sven, thoughts?

Thanks,
Mark.

> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/Kconfig.debug | 10 --
>  lib/Kconfig.debug  | 11 +++
>  2 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> index 80b57e7f4947..a5a52133730c 100644
> --- a/arch/x86/Kconfig.debug
> +++ b/arch/x86/Kconfig.debug
> @@ -170,16 +170,6 @@ config CPA_DEBUG
>   help
> Do change_page_attr() self-tests every 30 seconds.
>  
> -config DEBUG_ENTRY
> - bool "Debug low-level entry code"
> - depends on DEBUG_KERNEL
> - help
> -   This option enables sanity checks in x86's low-level entry code.
> -   Some of these sanity checks may slow down kernel entries and
> -   exits or otherwise impact performance.
> -
> -   If unsure, say N.
> -
>  config DEBUG_NMI_SELFTEST
>   bool "NMI Selftest"
>   depends on DEBUG_KERNEL && X86_LOCAL_APIC
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 7937265ef879..76549c8afa8a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1411,6 +1411,17 @@ config CSD_LOCK_WAIT_DEBUG
>  
>  endmenu # lock debugging
>  
> +config DEBUG_ENTRY
> + bool "Debug low-level entry code"
> + depends on DEBUG_KERNEL
> + depends on !ARM64
> + help
> +   This option enables sanity checks in the low-level entry code.
> +   Some of these sanity checks may slow down kernel entries and
> +   exits or otherwise impact performance.
> +
> +   If unsure, say N.
> +
>  config TRACE_IRQFLAGS
>   depends on TRACE_IRQFLAGS_SUPPORT
>   bool
> -- 
> 2.29.2
> 


Re: [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only

2021-03-08 Thread Mark Rutland
On Thu, Mar 04, 2021 at 11:06:00AM -0800, Andy Lutomirski wrote:
> exit_to_user_mode() does part, but not all, of the exit-to-user-mode work.
> It's used only by arm64, and arm64 should stop using it (hint, hint!).
> Compile it out on other architectures to minimize the chance of error.

For context, I do plan to move over, but there's a reasonable amount of
preparatory work needed first (e.g. factoring out the remaining asm
entry points, and reworking our pseudo-NMI management to play nicely
with the common entry code).

> enter_from_user_mode() is a legacy way to spell
> kentry_enter_from_user_mode().  It's also only used by arm64.  Give it
> the same treatment.

I think you can remove these entirely, no ifdeferry necessary.

Currently arm64 cannot select CONFIG_GENERIC_ENTRY, so we open-code
copies of these in arch/arm64/kernel/entry-common.c, and doesn't use
these common versions at all. When we move over to the common code we
can move directly to the kentry_* versions. If we are relying on the
prototypes anywhere, that's a bug as of itself.

In retrospect I probably should have given our local copies an arm64_*
prefix. If I can't get rid of them soon I'll add that to lessen the
scope for confusion.

Mark.

> Signed-off-by: Andy Lutomirski 
> ---
>  include/linux/entry-common.h | 34 ++
>  kernel/entry/common.c|  4 
>  2 files changed, 10 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 5287c6c15a66..a75374f87258 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -97,26 +97,15 @@ static inline __must_check int 
> arch_syscall_enter_tracehook(struct pt_regs *regs
>  }
>  #endif
>  
> +#ifdef CONFIG_ARM64
>  /**
>   * enter_from_user_mode - Establish state when coming from user mode
>   *
> - * Syscall/interrupt entry disables interrupts, but user mode is traced as
> - * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
> + * Legacy variant of kentry_enter_from_user_mode().  Used only by arm64.
>   *
> - * 1) Tell lockdep that interrupts are disabled
> - * 2) Invoke context tracking if enabled to reactivate RCU
> - * 3) Trace interrupts off state
> - *
> - * Invoked from architecture specific syscall entry code with interrupts
> - * disabled. The calling code has to be non-instrumentable. When the
> - * function returns all state is correct and interrupts are still
> - * disabled. The subsequent functions can be instrumented.
> - *
> - * This is invoked when there is architecture specific functionality to be
> - * done between establishing state and enabling interrupts. The caller must
> - * enable interrupts before invoking syscall_enter_from_user_mode_work().
>   */
>  void enter_from_user_mode(struct pt_regs *regs);
> +#endif
>  
>  /**
>   * kentry_syscall_begin - Prepare to invoke a syscall handler
> @@ -261,25 +250,14 @@ static inline void arch_syscall_exit_tracehook(struct 
> pt_regs *regs, bool step)
>  }
>  #endif
>  
> +#ifdef CONFIG_ARM64
>  /**
>   * exit_to_user_mode - Fixup state when exiting to user mode
>   *
> - * Syscall/interrupt exit enables interrupts, but the kernel state is
> - * interrupts disabled when this is invoked. Also tell RCU about it.
> - *
> - * 1) Trace interrupts on state
> - * 2) Invoke context tracking if enabled to adjust RCU state
> - * 3) Invoke architecture specific last minute exit code, e.g. speculation
> - *mitigations, etc.: arch_exit_to_user_mode()
> - * 4) Tell lockdep that interrupts are enabled
> - *
> - * Invoked from architecture specific code when syscall_exit_to_user_mode()
> - * is not suitable as the last step before returning to userspace. Must be
> - * invoked with interrupts disabled and the caller must be
> - * non-instrumentable.
> - * The caller has to invoke syscall_exit_to_user_mode_work() before this.
> + * Does the latter part of irqentry_exit_to_user_mode().  Only used by arm64.
>   */
>  void exit_to_user_mode(void);
> +#endif
>  
>  /**
>   * kentry_syscall_end - Finish syscall processing
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 800ad406431b..4ba82c684189 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -25,10 +25,12 @@ static __always_inline void __enter_from_user_mode(struct 
> pt_regs *regs)
>   instrumentation_end();
>  }
>  
> +#ifdef CONFIG_ARM64
>  void noinstr enter_from_user_mode(struct pt_regs *regs)
>  {
>   __enter_from_user_mode(regs);
>  }
> +#endif
>  
>  static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
>  {
> @@ -106,10 +108,12 @@ static __always_inline void __exit_to_user_mode(void)
>   lockdep_hardirqs_on(CALLER_ADDR0);
>  }
>  
> +#ifdef CONFIG_ARM64
>  void noinstr exit_to_user_mode(void)
>  {
>   __exit_to_user_mode();
>  }
> +#endif
>  
>  /* Workaround to allow gradual conversion of architecture code */
>  void __weak arch_do_signal_or_restart(struct pt_regs *regs, 

Re: [PATCH v3 06/11] kentry: Simplify the common syscall API

2021-03-08 Thread Mark Rutland
On Thu, Mar 04, 2021 at 11:05:59AM -0800, Andy Lutomirski wrote:
> The new common syscall API had a large and confusing API surface.  Simplify
> it.  Now there is exactly one way to use it.  It's a bit more verbose than
> the old way for the simple x86_64 native case, but it's much easier to use
> right, and the diffstat should speak for itself.
> 
> Signed-off-by: Andy Lutomirski 

I think that having this more verbose is going to make it easier to
handle ABI warts that differ across architectures, so:

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/x86/entry/common.c  | 57 +++-
>  include/linux/entry-common.h | 86 ++--
>  kernel/entry/common.c| 57 +++-
>  3 files changed, 54 insertions(+), 146 deletions(-)
> 
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index ef1c65938a6b..8710b2300b8d 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -38,9 +38,12 @@
>  #ifdef CONFIG_X86_64
>  __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>  {
> - nr = syscall_enter_from_user_mode(regs, nr);
> -
> + kentry_enter_from_user_mode(regs);
> + local_irq_enable();
>   instrumentation_begin();
> +
> + nr = kentry_syscall_begin(regs, nr);
> +
>   if (likely(nr < NR_syscalls)) {
>   nr = array_index_nospec(nr, NR_syscalls);
>   regs->ax = sys_call_table[nr](regs);
> @@ -52,8 +55,12 @@ __visible noinstr void do_syscall_64(unsigned long nr, 
> struct pt_regs *regs)
>   regs->ax = x32_sys_call_table[nr](regs);
>  #endif
>   }
> +
> + kentry_syscall_end(regs);
> +
> + local_irq_disable();
>   instrumentation_end();
> - syscall_exit_to_user_mode(regs);
> + kentry_exit_to_user_mode(regs);
>  }
>  #endif
>  
> @@ -83,33 +90,34 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs 
> *regs)
>  {
>   unsigned int nr = syscall_32_enter(regs);
>  
> + kentry_enter_from_user_mode(regs);
> + local_irq_enable();
> + instrumentation_begin();
> +
>   /*
>* Subtlety here: if ptrace pokes something larger than 2^32-1 into
>* orig_ax, the unsigned int return value truncates it.  This may
>* or may not be necessary, but it matches the old asm behavior.
>*/
> - nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
> - instrumentation_begin();
> -
> + nr = (unsigned int)kentry_syscall_begin(regs, nr);
>   do_syscall_32_irqs_on(regs, nr);
> + kentry_syscall_end(regs);
>  
> + local_irq_disable();
>   instrumentation_end();
> - syscall_exit_to_user_mode(regs);
> + kentry_exit_to_user_mode(regs);
>  }
>  
>  static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
>  {
>   unsigned int nr = syscall_32_enter(regs);
> + bool ret;
>   int res;
>  
> - /*
> -  * This cannot use syscall_enter_from_user_mode() as it has to
> -  * fetch EBP before invoking any of the syscall entry work
> -  * functions.
> -  */
> - syscall_enter_from_user_mode_prepare(regs);
> -
> + kentry_enter_from_user_mode(regs);
> + local_irq_enable();
>   instrumentation_begin();
> +
>   /* Fetch EBP from where the vDSO stashed it. */
>   if (IS_ENABLED(CONFIG_X86_64)) {
>   /*
> @@ -126,21 +134,23 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
> *regs)
>   if (res) {
>   /* User code screwed up. */
>   regs->ax = -EFAULT;
> -
> - instrumentation_end();
> - local_irq_disable();
> - irqentry_exit_to_user_mode(regs);
> - return false;
> + ret = false;
> + goto out;
>   }
>  
>   /* The case truncates any ptrace induced syscall nr > 2^32 -1 */
> - nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
> + nr = (unsigned int)kentry_syscall_begin(regs, nr);
>  
>   /* Now this is just like a normal syscall. */
>   do_syscall_32_irqs_on(regs, nr);
>  
> + kentry_syscall_end(regs);
> + ret = true;
> +
> +out:
> + local_irq_disable();
>   instrumentation_end();
> - syscall_exit_to_user_mode(regs);
> + kentry_exit_to_user_mode(regs);
>   return true;
>  }
>  
> @@ -233,8 +243,11 @@ __visible void noinstr ret_from_fork(struct task_struct 
> *prev,
>   user_regs->ax = 0;
>   }
>  
> + kentry_syscall_end(user_regs);
> +
> + local_irq_disable();
>   instrumentation_end();
&g

Re: [PATCH v3 02/11] kentry: Rename irqentry to kentry

2021-03-08 Thread Mark Rutland
On Thu, Mar 04, 2021 at 11:05:55AM -0800, Andy Lutomirski wrote:
> The common entry functions are mostly named irqentry, and this is
> confusing.  They are used for syscalls, exceptions, NMIs and, yes, IRQs.
> Call them kentry instead, since they are supposed to be usable for any
> entry to the kernel.
> 
> This path doesn't touch the .irqentry section -- someone can contemplate
> changing that later.  That code predates the common entry code.
> 
> Signed-off-by: Andy Lutomirski 

FWIW, I agree this is a better name, so:

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/x86/entry/common.c |  8 ++---
>  arch/x86/include/asm/idtentry.h | 28 +++
>  arch/x86/kernel/cpu/mce/core.c  | 10 +++---
>  arch/x86/kernel/kvm.c   |  6 ++--
>  arch/x86/kernel/nmi.c   |  6 ++--
>  arch/x86/kernel/traps.c | 28 +++
>  arch/x86/mm/fault.c |  6 ++--
>  include/linux/entry-common.h| 60 -
>  kernel/entry/common.c   | 26 +++---
>  9 files changed, 89 insertions(+), 89 deletions(-)
> 
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 8fdb4cb27efe..95776f16c1cb 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -264,9 +264,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
> pt_regs *regs)
>  {
>   struct pt_regs *old_regs;
>   bool inhcall;
> - irqentry_state_t state;
> + kentry_state_t state;
>  
> - state = irqentry_enter(regs);
> + state = kentry_enter(regs);
>   old_regs = set_irq_regs(regs);
>  
>   instrumentation_begin();
> @@ -278,11 +278,11 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
> pt_regs *regs)
>   inhcall = get_and_clear_inhcall();
>   if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
>   instrumentation_begin();
> - irqentry_exit_cond_resched();
> + kentry_exit_cond_resched();
>   instrumentation_end();
>   restore_inhcall(inhcall);
>   } else {
> - irqentry_exit(regs, state);
> + kentry_exit(regs, state);
>   }
>  }
>  #endif /* CONFIG_XEN_PV */
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index f656aabd1545..3ac805d24289 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -40,8 +40,8 @@
>   * The macro is written so it acts as function definition. Append the
>   * body with a pair of curly brackets.
>   *
> - * irqentry_enter() contains common code which has to be invoked before
> - * arbitrary code in the body. irqentry_exit() contains common code
> + * kentry_enter() contains common code which has to be invoked before
> + * arbitrary code in the body. kentry_exit() contains common code
>   * which has to run before returning to the low level assembly code.
>   */
>  #define DEFINE_IDTENTRY(func)
> \
> @@ -49,12 +49,12 @@ static __always_inline void __##func(struct pt_regs 
> *regs);   \
>   \
>  __visible noinstr void func(struct pt_regs *regs)\
>  {\
> - irqentry_state_t state = irqentry_enter(regs);  \
> + kentry_state_t state = kentry_enter(regs);  \
>   \
>   instrumentation_begin();\
>   __##func (regs);\
>   instrumentation_end();  \
> - irqentry_exit(regs, state); \
> + kentry_exit(regs, state);   \
>  }\
>   \
>  static __always_inline void __##func(struct pt_regs *regs)
> @@ -96,12 +96,12 @@ static __always_inline void __##func(struct pt_regs 
> *regs,\
>  __visible noinstr void func(struct pt_regs *regs,\
>   unsigned long error_code)   \
>  {\
> - irqentry_state_t state = irqentry_enter(regs);  \
> + kentry_state_t state = kentry_enter(regs);  \
>   \
>   instrumentation_begin(); 

Re: [PATCH] arm64/mm: Fix __enable_mmu() for new TGRAN range values

2021-03-05 Thread Mark Rutland
On Fri, Mar 05, 2021 at 08:06:09PM +0530, Anshuman Khandual wrote:
> From: James Morse 
> 
> As per ARM ARM DDI 0487G.a, when FEAT_LPA2 is implemented, ID_AA64MMFR0_EL1
> might contain a range of values to describe supported translation granules
> (4K and 16K pages sizes in particular) instead of just enabled or disabled
> values. This changes __enable_mmu() function to handle complete acceptable
> range of values (depending on whether the field is signed or unsigned) now
> represented with ID_AA64MMFR0_TGRAN_SUPPORTED_[MIN..MAX] pair. While here,
> also fix similar situations in EFI stub and KVM as well.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Marc Zyngier 
> Cc: James Morse 
> Cc: Suzuki K Poulose 
> Cc: Ard Biesheuvel 
> Cc: Mark Rutland 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: kvm...@lists.cs.columbia.edu
> Cc: linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: James Morse 
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/arm64/include/asm/sysreg.h   | 20 ++--
>  arch/arm64/kernel/head.S  |  6 --
>  arch/arm64/kvm/reset.c| 23 ---
>  drivers/firmware/efi/libstub/arm64-stub.c |  2 +-
>  4 files changed, 31 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index dfd4edb..d4a5fca9 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -796,6 +796,11 @@
>  #define ID_AA64MMFR0_PARANGE_48  0x5
>  #define ID_AA64MMFR0_PARANGE_52  0x6
>  
> +#define ID_AA64MMFR0_TGRAN_2_SUPPORTED_DEFAULT   0x0
> +#define ID_AA64MMFR0_TGRAN_2_SUPPORTED_NONE  0x1
> +#define ID_AA64MMFR0_TGRAN_2_SUPPORTED_MIN   0x2
> +#define ID_AA64MMFR0_TGRAN_2_SUPPORTED_MAX   0x7

The TGRAN2 fields doesn't quite follow the usual ID scheme rules, so how
do we deteremine the max value? Does the ARM ARM say anything in
particular about them, like we do for some of the PMU ID fields?

Otherwise, this patch looks correct to me.

Thanks,
Mark.

> +
>  #ifdef CONFIG_ARM64_PA_BITS_52
>  #define ID_AA64MMFR0_PARANGE_MAX ID_AA64MMFR0_PARANGE_52
>  #else
> @@ -961,14 +966,17 @@
>  #define ID_PFR1_PROGMOD_SHIFT0
>  
>  #if defined(CONFIG_ARM64_4K_PAGES)
> -#define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN4_SHIFT
> -#define ID_AA64MMFR0_TGRAN_SUPPORTED ID_AA64MMFR0_TGRAN4_SUPPORTED
> +#define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN4_SHIFT
> +#define ID_AA64MMFR0_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_TGRAN4_SUPPORTED
> +#define ID_AA64MMFR0_TGRAN_SUPPORTED_MAX 0x7
>  #elif defined(CONFIG_ARM64_16K_PAGES)
> -#define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN16_SHIFT
> -#define ID_AA64MMFR0_TGRAN_SUPPORTED ID_AA64MMFR0_TGRAN16_SUPPORTED
> +#define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN16_SHIFT
> +#define ID_AA64MMFR0_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_TGRAN16_SUPPORTED
> +#define ID_AA64MMFR0_TGRAN_SUPPORTED_MAX 0xF
>  #elif defined(CONFIG_ARM64_64K_PAGES)
> -#define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN64_SHIFT
> -#define ID_AA64MMFR0_TGRAN_SUPPORTED ID_AA64MMFR0_TGRAN64_SUPPORTED
> +#define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN64_SHIFT
> +#define ID_AA64MMFR0_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_TGRAN64_SUPPORTED
> +#define ID_AA64MMFR0_TGRAN_SUPPORTED_MAX 0x7
>  #endif
>  
>  #define MVFR2_FPMISC_SHIFT   4
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 66b0e0b..8b469f1 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -655,8 +655,10 @@ SYM_FUNC_END(__secondary_too_slow)
>  SYM_FUNC_START(__enable_mmu)
>   mrs x2, ID_AA64MMFR0_EL1
>   ubfxx2, x2, #ID_AA64MMFR0_TGRAN_SHIFT, 4
> - cmp x2, #ID_AA64MMFR0_TGRAN_SUPPORTED
> - b.ne__no_granule_support
> + cmp x2, #ID_AA64MMFR0_TGRAN_SUPPORTED_MIN
> + b.lt__no_granule_support
> + cmp x2, #ID_AA64MMFR0_TGRAN_SUPPORTED_MAX
> + b.gt__no_granule_support
>   update_early_cpu_boot_status 0, x2, x3
>   adrpx2, idmap_pg_dir
>   phys_to_ttbr x1, x1
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 47f3f03..fe72bfb 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -286,7 +286,7 @@ u32 get_kvm_ipa_limit(void)
>  
>  int kvm_set_ipa_limit(void)
>  {
> - unsigned int parange, tgran_2;
> + unsigned int parange, tgran_2_shift, tgran_2;
>   u64 mmfr0;
>  
>   mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> @@ -300,27 +300,28 @@ int k

Re: [PATCH v1] powerpc: Include running function as first entry in save_stack_trace() and friends

2021-03-05 Thread Mark Rutland
On Thu, Mar 04, 2021 at 08:01:29PM +0100, Marco Elver wrote:
> On Thu, 4 Mar 2021 at 19:51, Mark Rutland  wrote:
> > On Thu, Mar 04, 2021 at 07:22:53PM +0100, Marco Elver wrote:

> > > I was having this problem with KCSAN, where the compiler would
> > > tail-call-optimize __tsan_X instrumentation.
> >
> > Those are compiler-generated calls, right? When those are generated the
> > compilation unit (and whatever it has included) might not have provided
> > a prototype anyway, and the compiler has special knowledge of the
> > functions, so it feels like the compiler would need to inhibit TCO here
> > for this to be robust. For their intended usage subjecting them to TCO
> > doesn't seem to make sense AFAICT.
> >
> > I suspect that compilers have some way of handling that; otherwise I'd
> > expect to have heard stories of mcount/fentry calls getting TCO'd and
> > causing problems. So maybe there's an easy fix there?
> 
> I agree, the compiler builtins should be handled by the compiler
> directly, perhaps that was a bad example. But we also have "explicit
> instrumentation", e.g. everything that's in .

True -- I agree for those we want similar, and can see a case for a
no-tco-calls-to-me attribute on functions as with noreturn.

Maybe for now it's worth adding prevent_tail_call_optimization() to the
instrument_*() call wrappers in ? As those are
__always_inline, that should keep the function they get inlined in
around. Though we probably want to see if we can replace the mb() in
prevent_tail_call_optimization() with something that doesn't require a
real CPU barrier.

[...]

> > I reckon for basically any instrumentation we don't want calls to be
> > TCO'd, though I'm not immediately sure of cases beyond sanitizers and
> > mcount/fentry.
> 
> Thinking about this more, I think it's all debugging tools. E.g.
> lockdep, if you lock/unlock at the end of a function, you might tail
> call into lockdep. If the compiler applies TCO, and lockdep determines
> there's a bug and then shows a trace, you'll have no idea where the
> actual bug is. The kernel has lots of debugging facilities that add
> instrumentation in this way. So perhaps it's a general debugging-tool
> problem (rather than just sanitizers).

This makes sense to me.

Thanks,
Mark.


Re: [PATCH v1] powerpc: Include running function as first entry in save_stack_trace() and friends

2021-03-04 Thread Mark Rutland
On Thu, Mar 04, 2021 at 07:22:53PM +0100, Marco Elver wrote:
> On Thu, 4 Mar 2021 at 19:02, Mark Rutland  wrote:
> > On Thu, Mar 04, 2021 at 06:25:33PM +0100, Marco Elver wrote:
> > > On Thu, Mar 04, 2021 at 04:59PM +, Mark Rutland wrote:
> > > > On Thu, Mar 04, 2021 at 04:30:34PM +0100, Marco Elver wrote:
> > > > > On Thu, 4 Mar 2021 at 15:57, Mark Rutland  
> > > > > wrote:
> > > > > > [adding Mark Brown]
> > > > > >
> > > > > > The bigger problem here is that skipping is dodgy to begin with, and
> > > > > > this is still liable to break in some cases. One big concern is that
> > > > > > (especially with LTO) we cannot guarantee the compiler will not 
> > > > > > inline
> > > > > > or outline functions, causing the skipp value to be too large or too
> > > > > > small. That's liable to happen to callers, and in theory (though
> > > > > > unlikely in practice), portions of arch_stack_walk() or
> > > > > > stack_trace_save() could get outlined too.
> > > > > >
> > > > > > Unless we can get some strong guarantees from compiler folk such 
> > > > > > that we
> > > > > > can guarantee a specific function acts boundary for unwinding (and
> > > > > > doesn't itself get split, etc), the only reliable way I can think to
> > > > > > solve this requires an assembly trampoline. Whatever we do is 
> > > > > > liable to
> > > > > > need some invasive rework.
> > > > >
> > > > > Will LTO and friends respect 'noinline'?
> > > >
> > > > I hope so (and suspect we'd have more problems otherwise), but I don't
> > > > know whether they actually so.
> > > >
> > > > I suspect even with 'noinline' the compiler is permitted to outline
> > > > portions of a function if it wanted to (and IIUC it could still make
> > > > specialized copies in the absence of 'noclone').
> > > >
> > > > > One thing I also noticed is that tail calls would also cause the stack
> > > > > trace to appear somewhat incomplete (for some of my tests I've
> > > > > disabled tail call optimizations).
> > > >
> > > > I assume you mean for a chain A->B->C where B tail-calls C, you get a
> > > > trace A->C? ... or is A going missing too?
> > >
> > > Correct, it's just the A->C outcome.
> >
> > I'd assumed that those cases were benign, e.g. for livepatching what
> > matters is what can be returned to, so B disappearing from the trace
> > isn't a problem there.
> >
> > Is the concern debugability, or is there a functional issue you have in
> > mind?
> 
> For me, it's just been debuggability, and reliable test cases.
> 
> > > > > Is there a way to also mark a function non-tail-callable?
> > > >
> > > > I think this can be bodged using __attribute__((optimize("$OPTIONS")))
> > > > on a caller to inhibit TCO (though IIRC GCC doesn't reliably support
> > > > function-local optimization options), but I don't expect there's any way
> > > > to mark a callee as not being tail-callable.
> > >
> > > I don't think this is reliable. It'd be
> > > __attribute__((optimize("-fno-optimize-sibling-calls"))), but doesn't
> > > work if applied to the function we do not want to tail-call-optimize,
> > > but would have to be applied to the function that does the tail-calling.
> >
> > Yup; that's what I meant then I said you could do that on the caller but
> > not the callee.
> >
> > I don't follow why you'd want to put this on the callee, though, so I
> > think I'm missing something. Considering a set of functions in different
> > compilation units:
> >
> >   A->B->C->D->E->F->G->H->I->J->K
> 
> I was having this problem with KCSAN, where the compiler would
> tail-call-optimize __tsan_X instrumentation.

Those are compiler-generated calls, right? When those are generated the
compilation unit (and whatever it has included) might not have provided
a prototype anyway, and the compiler has special knowledge of the
functions, so it feels like the compiler would need to inhibit TCO here
for this to be robust. For their intended usage subjecting them to TCO
doesn't seem to make sense AFAICT.

I suspect that compilers have some way of handling that; otherwise I'd
expect to have heard stories o

Re: [PATCH v1] powerpc: Include running function as first entry in save_stack_trace() and friends

2021-03-04 Thread Mark Rutland
On Thu, Mar 04, 2021 at 06:25:33PM +0100, Marco Elver wrote:
> On Thu, Mar 04, 2021 at 04:59PM +0000, Mark Rutland wrote:
> > On Thu, Mar 04, 2021 at 04:30:34PM +0100, Marco Elver wrote:
> > > On Thu, 4 Mar 2021 at 15:57, Mark Rutland  wrote:
> > > > [adding Mark Brown]
> > > >
> > > > The bigger problem here is that skipping is dodgy to begin with, and
> > > > this is still liable to break in some cases. One big concern is that
> > > > (especially with LTO) we cannot guarantee the compiler will not inline
> > > > or outline functions, causing the skipp value to be too large or too
> > > > small. That's liable to happen to callers, and in theory (though
> > > > unlikely in practice), portions of arch_stack_walk() or
> > > > stack_trace_save() could get outlined too.
> > > >
> > > > Unless we can get some strong guarantees from compiler folk such that we
> > > > can guarantee a specific function acts boundary for unwinding (and
> > > > doesn't itself get split, etc), the only reliable way I can think to
> > > > solve this requires an assembly trampoline. Whatever we do is liable to
> > > > need some invasive rework.
> > > 
> > > Will LTO and friends respect 'noinline'?
> > 
> > I hope so (and suspect we'd have more problems otherwise), but I don't
> > know whether they actually so.
> > 
> > I suspect even with 'noinline' the compiler is permitted to outline
> > portions of a function if it wanted to (and IIUC it could still make
> > specialized copies in the absence of 'noclone').
> > 
> > > One thing I also noticed is that tail calls would also cause the stack
> > > trace to appear somewhat incomplete (for some of my tests I've
> > > disabled tail call optimizations).
> > 
> > I assume you mean for a chain A->B->C where B tail-calls C, you get a
> > trace A->C? ... or is A going missing too?
> 
> Correct, it's just the A->C outcome.

I'd assumed that those cases were benign, e.g. for livepatching what
matters is what can be returned to, so B disappearing from the trace
isn't a problem there.

Is the concern debugability, or is there a functional issue you have in
mind?

> > > Is there a way to also mark a function non-tail-callable?
> > 
> > I think this can be bodged using __attribute__((optimize("$OPTIONS")))
> > on a caller to inhibit TCO (though IIRC GCC doesn't reliably support
> > function-local optimization options), but I don't expect there's any way
> > to mark a callee as not being tail-callable.
> 
> I don't think this is reliable. It'd be
> __attribute__((optimize("-fno-optimize-sibling-calls"))), but doesn't
> work if applied to the function we do not want to tail-call-optimize,
> but would have to be applied to the function that does the tail-calling.

Yup; that's what I meant then I said you could do that on the caller but
not the callee.

I don't follow why you'd want to put this on the callee, though, so I
think I'm missing something. Considering a set of functions in different
compilation units:

  A->B->C->D->E->F->G->H->I->J->K

... if K were marked in this way, and J was compiled with visibility of
this, J would stick around, but J's callers might not, and so the a
trace might see:

  A->J->K

... do you just care about the final caller, i.e. you just need
certainty that J will be in the trace?

If so, we can somewhat bodge that by having K have an __always_inline
wrapper which has a barrier() or similar after the real call to K, so
the call couldn't be TCO'd.

Otherwise I'd expect we'd probably need to disable TCO generally.

> So it's a bit backwards, even if it worked.
> 
> > Accoding to the GCC documentation, GCC won't TCO noreturn functions, but
> > obviously that's not something we can use generally.
> > 
> > https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
> 
> Perhaps we can ask the toolchain folks to help add such an attribute. Or
> maybe the feature already exists somewhere, but hidden.
> 
> +Cc linux-toolcha...@vger.kernel.org
> 
> > > But I'm also not sure if with all that we'd be guaranteed the code we
> > > want, even though in practice it might.
> > 
> > True! I'd just like to be on the least dodgy ground we can be.
> 
> It's been dodgy for a while, and I'd welcome any low-cost fixes to make
> it less dodgy in the short-term at least. :-)

:)

Thanks,
Mark.


Re: [PATCH v1] powerpc: Include running function as first entry in save_stack_trace() and friends

2021-03-04 Thread Mark Rutland
On Thu, Mar 04, 2021 at 04:30:34PM +0100, Marco Elver wrote:
> On Thu, 4 Mar 2021 at 15:57, Mark Rutland  wrote:
> > [adding Mark Brown]
> >
> > The bigger problem here is that skipping is dodgy to begin with, and
> > this is still liable to break in some cases. One big concern is that
> > (especially with LTO) we cannot guarantee the compiler will not inline
> > or outline functions, causing the skipp value to be too large or too
> > small. That's liable to happen to callers, and in theory (though
> > unlikely in practice), portions of arch_stack_walk() or
> > stack_trace_save() could get outlined too.
> >
> > Unless we can get some strong guarantees from compiler folk such that we
> > can guarantee a specific function acts boundary for unwinding (and
> > doesn't itself get split, etc), the only reliable way I can think to
> > solve this requires an assembly trampoline. Whatever we do is liable to
> > need some invasive rework.
> 
> Will LTO and friends respect 'noinline'?

I hope so (and suspect we'd have more problems otherwise), but I don't
know whether they actually so.

I suspect even with 'noinline' the compiler is permitted to outline
portions of a function if it wanted to (and IIUC it could still make
specialized copies in the absence of 'noclone').

> One thing I also noticed is that tail calls would also cause the stack
> trace to appear somewhat incomplete (for some of my tests I've
> disabled tail call optimizations).

I assume you mean for a chain A->B->C where B tail-calls C, you get a
trace A->C? ... or is A going missing too?

> Is there a way to also mark a function non-tail-callable?

I think this can be bodged using __attribute__((optimize("$OPTIONS")))
on a caller to inhibit TCO (though IIRC GCC doesn't reliably support
function-local optimization options), but I don't expect there's any way
to mark a callee as not being tail-callable.

Accoding to the GCC documentation, GCC won't TCO noreturn functions, but
obviously that's not something we can use generally.

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

> But I'm also not sure if with all that we'd be guaranteed the code we
> want, even though in practice it might.

True! I'd just like to be on the least dodgy ground we can be.

Thanks,
Mark.


Re: [PATCH v1] powerpc: Include running function as first entry in save_stack_trace() and friends

2021-03-04 Thread Mark Rutland
[adding Mark Brown]

On Wed, Mar 03, 2021 at 04:20:43PM +0100, Marco Elver wrote:
> On Wed, Mar 03, 2021 at 03:52PM +0100, Christophe Leroy wrote:
> > Le 03/03/2021 � 15:38, Marco Elver a �crit�:
> > > On Wed, 3 Mar 2021 at 15:09, Christophe Leroy
> > >  wrote:
> > > > 
> > > > It seems like all other sane architectures, namely x86 and arm64
> > > > at least, include the running function as top entry when saving
> > > > stack trace.
> > > > 
> > > > Functionnalities like KFENCE expect it.
> > > > 
> > > > Do the same on powerpc, it allows KFENCE to properly identify the 
> > > > faulting
> > > > function as depicted below. Before the patch KFENCE was identifying
> > > > finish_task_switch.isra as the faulting function.
> > > > 
> > > > [   14.937370] 
> > > > ==
> > > > [   14.948692] BUG: KFENCE: invalid read in 
> > > > test_invalid_access+0x54/0x108
> > > > [   14.948692]
> > > > [   14.956814] Invalid read at 0xdf98800a:
> > > > [   14.960664]  test_invalid_access+0x54/0x108
> > > > [   14.964876]  finish_task_switch.isra.0+0x54/0x23c
> > > > [   14.969606]  kunit_try_run_case+0x5c/0xd0
> > > > [   14.973658]  kunit_generic_run_threadfn_adapter+0x24/0x30
> > > > [   14.979079]  kthread+0x15c/0x174
> > > > [   14.982342]  ret_from_kernel_thread+0x14/0x1c
> > > > [   14.986731]
> > > > [   14.988236] CPU: 0 PID: 111 Comm: kunit_try_catch Tainted: GB
> > > >  5.12.0-rc1-01537-g95f6e2088d7e-dirty #4682
> > > > [   14.999795] NIP:  c016ec2c LR: c02f517c CTR: c016ebd8
> > > > [   15.004851] REGS: e2449d90 TRAP: 0301   Tainted: GB  
> > > > (5.12.0-rc1-01537-g95f6e2088d7e-dirty)
> > > > [   15.015274] MSR:  9032   CR: 2204  XER: 
> > > > 
> > > > [   15.022043] DAR: df98800a DSISR: 2000
> > > > [   15.022043] GPR00: c02f517c e2449e50 c1142080 e100dd24 c084b13c 
> > > > 0008 c084b32b c016ebd8
> > > > [   15.022043] GPR08: c085 df988000 c0d1 e2449eb0 22000288
> > > > [   15.040581] NIP [c016ec2c] test_invalid_access+0x54/0x108
> > > > [   15.046010] LR [c02f517c] kunit_try_run_case+0x5c/0xd0
> > > > [   15.051181] Call Trace:
> > > > [   15.053637] [e2449e50] [c005a68c] 
> > > > finish_task_switch.isra.0+0x54/0x23c (unreliable)
> > > > [   15.061338] [e2449eb0] [c02f517c] kunit_try_run_case+0x5c/0xd0
> > > > [   15.067215] [e2449ed0] [c02f648c] 
> > > > kunit_generic_run_threadfn_adapter+0x24/0x30
> > > > [   15.074472] [e2449ef0] [c004e7b0] kthread+0x15c/0x174
> > > > [   15.079571] [e2449f30] [c001317c] ret_from_kernel_thread+0x14/0x1c
> > > > [   15.085798] Instruction dump:
> > > > [   15.088784] 8129d608 38e7ebd8 81020280 911f004c 3900 995f0024 
> > > > 907f0028 90ff001c
> > > > [   15.096613] 3949000a 915f0020 3d40c0d1 3d00c085 <8929000a> 3908adb0 
> > > > 812a4b98 3d40c02f
> > > > [   15.104612] 
> > > > ==
> > > > 
> > > > Signed-off-by: Christophe Leroy 
> > > 
> > > Acked-by: Marco Elver 
> > > 
> > > Thank you, I think this looks like the right solution. Just a question 
> > > below:
> > > 
> > ...
> > 
> > > > @@ -59,23 +70,26 @@ void save_stack_trace(struct stack_trace *trace)
> > > > 
> > > >  sp = current_stack_frame();
> > > > 
> > > > -   save_context_stack(trace, sp, current, 1);
> > > > +   save_context_stack(trace, sp, (unsigned long)save_stack_trace, 
> > > > current, 1);
> > > 
> > > This causes ip == save_stack_trace and also below for
> > > save_stack_trace_tsk. Does this mean save_stack_trace() is included in
> > > the trace? Looking at kernel/stacktrace.c, I think the library wants
> > > to exclude itself from the trace, as it does '.skip = skipnr + 1' (and
> > > '.skip   = skipnr + (current == tsk)' for the _tsk variant).
> > > 
> > > If the arch-helper here is included, should this use _RET_IP_ instead?
> > > 
> > 
> > Don't really know, I was inspired by arm64 which has:
> > 
> > void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
> >  struct task_struct *task, struct pt_regs *regs)
> > {
> > struct stackframe frame;
> > 
> > if (regs)
> > start_backtrace(, regs->regs[29], regs->pc);
> > else if (task == current)
> > start_backtrace(,
> > (unsigned long)__builtin_frame_address(0),
> > (unsigned long)arch_stack_walk);
> > else
> > start_backtrace(, thread_saved_fp(task),
> > thread_saved_pc(task));
> > 
> > walk_stackframe(task, , consume_entry, cookie);
> > }
> > 
> > But looking at x86 you may be right, so what should be done really ?
> 
> x86:
> 
> [2.843292] calling stack_trace_save:
> [2.843705]  test_func+0x6c/0x118
> [2.844184]  do_one_initcall+0x58/0x270
> [2.844618]  kernel_init_freeable+0x1da/0x23a
> [2.845110]  kernel_init+0xc/0x166
> [2.845494]  

[PATCHv2 8/8] arm64: irq: allow FIQs to be handled

2021-03-02 Thread Mark Rutland
On contemporary platforms we don't use FIQ, and treat any stray FIQ as a
fatal event. However, some platforms have an interrupt controller wired
to FIQ, and need to handle FIQ as part of regular operation.

So that we can support both cases dynamically, this patch updates the
FIQ exception handling code to operate the same way as the IRQ handling
code, with its own handle_arch_fiq handler. Where a root FIQ handler is
not registered, an unexpected FIQ exception will trigger the default FIQ
handler, which will panic() as today. Where a root FIQ handler is
registered, handling of the FIQ is deferred to that handler.

As el0_fiq_invalid_compat is supplanted by el0_fiq, the former is
removed. For !CONFIG_COMPAT builds we never expect to take an exception
from AArch32 EL0, so we keep the common el0_fiq_invalid handler.

Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Hector Martin 
Cc: James Morse 
Cc: Marc Zyngier 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/include/asm/irq.h |  1 +
 arch/arm64/kernel/entry.S| 30 +-
 arch/arm64/kernel/irq.c  | 16 
 3 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 8391c6f6f746..fac08e18bcd5 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -10,6 +10,7 @@ struct pt_regs;
 
 int set_handle_irq(void (*handle_irq)(struct pt_regs *));
 #define set_handle_irq set_handle_irq
+int set_handle_fiq(void (*handle_fiq)(struct pt_regs *));
 
 static inline int nr_legacy_irqs(void)
 {
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ce8d4dc416fb..a86f50de2c7b 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -588,18 +588,18 @@ SYM_CODE_START(vectors)
 
kernel_ventry   1, sync // Synchronous EL1h
kernel_ventry   1, irq  // IRQ EL1h
-   kernel_ventry   1, fiq_invalid  // FIQ EL1h
+   kernel_ventry   1, fiq  // FIQ EL1h
kernel_ventry   1, error// Error EL1h
 
kernel_ventry   0, sync // Synchronous 64-bit 
EL0
kernel_ventry   0, irq  // IRQ 64-bit EL0
-   kernel_ventry   0, fiq_invalid  // FIQ 64-bit EL0
+   kernel_ventry   0, fiq  // FIQ 64-bit EL0
kernel_ventry   0, error// Error 64-bit EL0
 
 #ifdef CONFIG_COMPAT
kernel_ventry   0, sync_compat, 32  // Synchronous 32-bit 
EL0
kernel_ventry   0, irq_compat, 32   // IRQ 32-bit EL0
-   kernel_ventry   0, fiq_invalid_compat, 32   // FIQ 32-bit EL0
+   kernel_ventry   0, fiq_compat, 32   // FIQ 32-bit EL0
kernel_ventry   0, error_compat, 32 // Error 32-bit EL0
 #else
kernel_ventry   0, sync_invalid, 32 // Synchronous 32-bit 
EL0
@@ -665,12 +665,6 @@ SYM_CODE_START_LOCAL(el0_error_invalid)
inv_entry 0, BAD_ERROR
 SYM_CODE_END(el0_error_invalid)
 
-#ifdef CONFIG_COMPAT
-SYM_CODE_START_LOCAL(el0_fiq_invalid_compat)
-   inv_entry 0, BAD_FIQ, 32
-SYM_CODE_END(el0_fiq_invalid_compat)
-#endif
-
 SYM_CODE_START_LOCAL(el1_sync_invalid)
inv_entry 1, BAD_SYNC
 SYM_CODE_END(el1_sync_invalid)
@@ -705,6 +699,12 @@ SYM_CODE_START_LOCAL_NOALIGN(el1_irq)
kernel_exit 1
 SYM_CODE_END(el1_irq)
 
+SYM_CODE_START_LOCAL_NOALIGN(el1_fiq)
+   kernel_entry 1
+   el1_interrupt_handler handle_arch_fiq
+   kernel_exit 1
+SYM_CODE_END(el1_fiq)
+
 /*
  * EL0 mode handlers.
  */
@@ -731,6 +731,11 @@ SYM_CODE_START_LOCAL_NOALIGN(el0_irq_compat)
b   el0_irq_naked
 SYM_CODE_END(el0_irq_compat)
 
+SYM_CODE_START_LOCAL_NOALIGN(el0_fiq_compat)
+   kernel_entry 0, 32
+   b   el0_fiq_naked
+SYM_CODE_END(el0_fiq_compat)
+
 SYM_CODE_START_LOCAL_NOALIGN(el0_error_compat)
kernel_entry 0, 32
b   el0_error_naked
@@ -745,6 +750,13 @@ el0_irq_naked:
b   ret_to_user
 SYM_CODE_END(el0_irq)
 
+SYM_CODE_START_LOCAL_NOALIGN(el0_fiq)
+   kernel_entry 0
+el0_fiq_naked:
+   el0_interrupt_handler handle_arch_fiq
+   b   ret_to_user
+SYM_CODE_END(el0_fiq)
+
 SYM_CODE_START_LOCAL(el1_error)
kernel_entry 1
mrs x1, esr_el1
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 2fe0b535de30..bda49430c9ea 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -76,7 +76,13 @@ static void default_handle_irq(struct pt_regs *regs)
panic("IRQ taken without a root IRQ handler\n");
 }
 
+static void default_handle_fiq(struct pt_regs *regs)
+{
+   panic("FIQ taken without a root FIQ handler\n");
+}
+
 void (*handle_arch_irq)(struct pt_regs *) __ro_after_init = default_handle_irq;
+void (*handle_arch_fiq)(struct pt_reg

[PATCHv2 6/8] arm64: entry: factor irq triage logic into macros

2021-03-02 Thread Mark Rutland
From: Marc Zyngier 

In subsequent patches we'll allow an FIQ handler to be registered, and
FIQ exceptions will need to be triaged very similarly to IRQ exceptions.
So that we can reuse the existing logic, this patch factors the IRQ
triage logic out into macros that can be reused for FIQ.

The macros are named to follow the elX_foo_handler scheme used by the C
exception handlers. For consistency with other top-level exception
handlers, the kernel_entry/kernel_exit logic is not moved into the
macros. As FIQ will use a different C handler, this handler name is
provided as an argument to the macros.

There should be no functional change as a result of this patch.

Signed-off-by: Marc Zyngier 
[Mark: rework macros, commit message, rebase before DAIF rework]
Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Hector Martin 
Cc: James Morse 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/kernel/entry.S | 80 +--
 1 file changed, 43 insertions(+), 37 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index a31a0a713c85..e235b0e4e468 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -491,8 +491,8 @@ tsk .reqx28 // current thread_info
 /*
  * Interrupt handling.
  */
-   .macro  irq_handler
-   ldr_l   x1, handle_arch_irq
+   .macro  irq_handler, handler:req
+   ldr_l   x1, \handler
mov x0, sp
irq_stack_entry
blr x1
@@ -531,6 +531,45 @@ alternative_endif
 #endif
.endm
 
+   .macro el1_interrupt_handler, handler:req
+   gic_prio_irq_setup pmr=x20, tmp=x1
+   enable_da_f
+
+   mov x0, sp
+   bl  enter_el1_irq_or_nmi
+
+   irq_handler \handler
+
+#ifdef CONFIG_PREEMPTION
+   ldr x24, [tsk, #TSK_TI_PREEMPT] // get preempt count
+alternative_if ARM64_HAS_IRQ_PRIO_MASKING
+   /*
+* DA_F were cleared at start of handling. If anything is set in DAIF,
+* we come back from an NMI, so skip preemption
+*/
+   mrs x0, daif
+   orr x24, x24, x0
+alternative_else_nop_endif
+   cbnzx24, 1f // preempt count != 0 || NMI 
return path
+   bl  arm64_preempt_schedule_irq  // irq en/disable is done inside
+1:
+#endif
+
+   mov x0, sp
+   bl  exit_el1_irq_or_nmi
+   .endm
+
+   .macro el0_interrupt_handler, handler:req
+   gic_prio_irq_setup pmr=x20, tmp=x0
+   user_exit_irqoff
+   enable_da_f
+
+   tbz x22, #55, 1f
+   bl  do_el0_irq_bp_hardening
+1:
+   irq_handler \handler
+   .endm
+
.text
 
 /*
@@ -660,32 +699,7 @@ SYM_CODE_END(el1_sync)
.align  6
 SYM_CODE_START_LOCAL_NOALIGN(el1_irq)
kernel_entry 1
-   gic_prio_irq_setup pmr=x20, tmp=x1
-   enable_da_f
-
-   mov x0, sp
-   bl  enter_el1_irq_or_nmi
-
-   irq_handler
-
-#ifdef CONFIG_PREEMPTION
-   ldr x24, [tsk, #TSK_TI_PREEMPT] // get preempt count
-alternative_if ARM64_HAS_IRQ_PRIO_MASKING
-   /*
-* DA_F were cleared at start of handling. If anything is set in DAIF,
-* we come back from an NMI, so skip preemption
-*/
-   mrs x0, daif
-   orr x24, x24, x0
-alternative_else_nop_endif
-   cbnzx24, 1f // preempt count != 0 || NMI 
return path
-   bl  arm64_preempt_schedule_irq  // irq en/disable is done inside
-1:
-#endif
-
-   mov x0, sp
-   bl  exit_el1_irq_or_nmi
-
+   el1_interrupt_handler handle_arch_irq
kernel_exit 1
 SYM_CODE_END(el1_irq)
 
@@ -725,15 +739,7 @@ SYM_CODE_END(el0_error_compat)
 SYM_CODE_START_LOCAL_NOALIGN(el0_irq)
kernel_entry 0
 el0_irq_naked:
-   gic_prio_irq_setup pmr=x20, tmp=x0
-   user_exit_irqoff
-   enable_da_f
-
-   tbz x22, #55, 1f
-   bl  do_el0_irq_bp_hardening
-1:
-   irq_handler
-
+   el0_interrupt_handler handle_arch_irq
b   ret_to_user
 SYM_CODE_END(el0_irq)
 
-- 
2.11.0



[PATCHv2 7/8] arm64: Always keep DAIF.[IF] in sync

2021-03-02 Thread Mark Rutland
From: Hector Martin 

Apple SoCs (A11 and newer) have some interrupt sources hardwired to the
FIQ line. We implement support for this by simply treating IRQs and FIQs
the same way in the interrupt vectors.

To support these systems, the FIQ mask bit needs to be kept in sync with
the IRQ mask bit, so both kinds of exceptions are masked together. No
other platforms should be delivering FIQ exceptions right now, and we
already unmask FIQ in normal process context, so this should not have an
effect on other systems - if spurious FIQs were arriving, they would
already panic the kernel.

Signed-off-by: Hector Martin 
Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: James Morse 
Cc: Marc Zyngier 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/include/asm/arch_gicv3.h |  2 +-
 arch/arm64/include/asm/assembler.h  |  8 
 arch/arm64/include/asm/daifflags.h  | 10 +-
 arch/arm64/include/asm/irqflags.h   | 16 +++-
 arch/arm64/kernel/entry.S   | 12 +++-
 arch/arm64/kernel/process.c |  2 +-
 arch/arm64/kernel/smp.c |  1 +
 7 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/arch_gicv3.h 
b/arch/arm64/include/asm/arch_gicv3.h
index 880b9054d75c..934b9be582d2 100644
--- a/arch/arm64/include/asm/arch_gicv3.h
+++ b/arch/arm64/include/asm/arch_gicv3.h
@@ -173,7 +173,7 @@ static inline void gic_pmr_mask_irqs(void)
 
 static inline void gic_arch_enable_irqs(void)
 {
-   asm volatile ("msr daifclr, #2" : : : "memory");
+   asm volatile ("msr daifclr, #3" : : : "memory");
 }
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index ca31594d3d6c..b76a71e84b7c 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -40,9 +40,9 @@
msr daif, \flags
.endm
 
-   /* IRQ is the lowest priority flag, unconditionally unmask the rest. */
-   .macro enable_da_f
-   msr daifclr, #(8 | 4 | 1)
+   /* IRQ/FIQ are the lowest priority flags, unconditionally unmask the 
rest. */
+   .macro enable_da
+   msr daifclr, #(8 | 4)
.endm
 
 /*
@@ -50,7 +50,7 @@
  */
.macro  save_and_disable_irq, flags
mrs \flags, daif
-   msr daifset, #2
+   msr daifset, #3
.endm
 
.macro  restore_irq, flags
diff --git a/arch/arm64/include/asm/daifflags.h 
b/arch/arm64/include/asm/daifflags.h
index 1c26d7baa67f..5eb7af9c4557 100644
--- a/arch/arm64/include/asm/daifflags.h
+++ b/arch/arm64/include/asm/daifflags.h
@@ -13,8 +13,8 @@
 #include 
 
 #define DAIF_PROCCTX   0
-#define DAIF_PROCCTX_NOIRQ PSR_I_BIT
-#define DAIF_ERRCTX(PSR_I_BIT | PSR_A_BIT)
+#define DAIF_PROCCTX_NOIRQ (PSR_I_BIT | PSR_F_BIT)
+#define DAIF_ERRCTX(PSR_A_BIT | PSR_I_BIT | PSR_F_BIT)
 #define DAIF_MASK  (PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT)
 
 
@@ -47,7 +47,7 @@ static inline unsigned long local_daif_save_flags(void)
if (system_uses_irq_prio_masking()) {
/* If IRQs are masked with PMR, reflect it in the flags */
if (read_sysreg_s(SYS_ICC_PMR_EL1) != GIC_PRIO_IRQON)
-   flags |= PSR_I_BIT;
+   flags |= PSR_I_BIT | PSR_F_BIT;
}
 
return flags;
@@ -69,7 +69,7 @@ static inline void local_daif_restore(unsigned long flags)
bool irq_disabled = flags & PSR_I_BIT;
 
WARN_ON(system_has_prio_mask_debugging() &&
-   !(read_sysreg(daif) & PSR_I_BIT));
+   (read_sysreg(daif) & (PSR_I_BIT | PSR_F_BIT)) != (PSR_I_BIT | 
PSR_F_BIT));
 
if (!irq_disabled) {
trace_hardirqs_on();
@@ -86,7 +86,7 @@ static inline void local_daif_restore(unsigned long flags)
 * If interrupts are disabled but we can take
 * asynchronous errors, we can take NMIs
 */
-   flags &= ~PSR_I_BIT;
+   flags &= ~(PSR_I_BIT | PSR_F_BIT);
pmr = GIC_PRIO_IRQOFF;
} else {
pmr = GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET;
diff --git a/arch/arm64/include/asm/irqflags.h 
b/arch/arm64/include/asm/irqflags.h
index ff328e5bbb75..b57b9b1e4344 100644
--- a/arch/arm64/include/asm/irqflags.h
+++ b/arch/arm64/include/asm/irqflags.h
@@ -12,15 +12,13 @@
 
 /*
  * Aarch64 has flags for masking: Debug, Asynchronous (serror), Interrupts and
- * FIQ exceptions, in the 'daif' register. We mask and unmask them in 'dai'
+ * FIQ exceptions, in the 'daif' register. We mask and unmask them in 'daif'
  * order:
  * Masking debug exceptions causes all other exceptions to be masked too/
- * Masking SError masks irq, but not debug exceptions. Masking irqs has no
- * side effects for other flags. Keeping to

[PATCHv2 5/8] arm64: irq: rework root IRQ handler registration

2021-03-02 Thread Mark Rutland
If we accidentally unmask IRQs before we've registered a root IRQ
handler, handle_arch_irq will be NULL, and the IRQ exception handler
will branch to a bogus address.

To make this easier to debug, this patch initialises handle_arch_irq to
a default handler which will panic(), making such problems easier to
debug. When we add support for FIQ handlers, we can follow the same
approach.

When we add support for a root FIQ handler, it's possible to have root
IRQ handler without an root FIQ handler, and in theory the inverse is
also possible. To permit this, and to keep the IRQ/FIQ registration
logic similar, this patch removes the panic in the absence of a root IRQ
controller. Instead, set_handle_irq() logs when a handler is registered,
which is sufficient for debug purposes.

Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Hector Martin 
Cc: James Morse 
Cc: Marc Zyngier 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/kernel/irq.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index ad63bd50fa7b..2fe0b535de30 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -71,14 +71,20 @@ static void init_irq_stacks(void)
 }
 #endif
 
-void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
+static void default_handle_irq(struct pt_regs *regs)
+{
+   panic("IRQ taken without a root IRQ handler\n");
+}
+
+void (*handle_arch_irq)(struct pt_regs *) __ro_after_init = default_handle_irq;
 
 int __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
 {
-   if (handle_arch_irq)
+   if (handle_arch_irq != default_handle_irq)
return -EBUSY;
 
handle_arch_irq = handle_irq;
+   pr_info("Root IRQ handler: %ps\n", handle_irq);
return 0;
 }
 
@@ -87,8 +93,6 @@ void __init init_IRQ(void)
init_irq_stacks();
init_irq_scs();
irqchip_init();
-   if (!handle_arch_irq)
-   panic("No interrupt controller found.");
 
if (system_uses_irq_prio_masking()) {
/*
-- 
2.11.0



[PATCHv2 0/8] arm64: Support FIQ controller registration

2021-03-02 Thread Mark Rutland
Hector's M1 support series [1] shows that some platforms have critical
interrupts wired to FIQ, and to support these platforms we need to support
handling FIQ exceptions. Other contemporary platforms don't use FIQ (since e.g.
this is usually routed to EL3), and as we never expect to take an FIQ, we have
the FIQ vector cause a panic.

Since the use of FIQ is a platform integration detail (which can differ across
bare-metal and virtualized environments), we need be able to explicitly opt-in
to handling FIQs while retaining the existing behaviour otherwise. This series
adds a new set_handle_fiq() hook so that the FIQ controller can do so, and
where no controller is registered the default handler will panic(). For
consistency the set_handle_irq() code is made to do the same.

The first couple of patches are from Marc's irq/drop-generic_irq_multi_handler
branch [2] on kernel.org, and clean up CONFIG_GENERIC_IRQ_MULTI_HANDLER usage.

The next four patches move arm64 over to a local set_handle_irq()
implementation, which is written to share code with a set_handle_fiq() function
in the last two patches. This adds a default handler which will directly
panic() rather than branching to NULL if an IRQ is taken unexpectedly, and the
boot-time panic in the absence of a handler is removed (for consistently with
FIQ support added later).

The penultimate patch reworks arm64's IRQ masking to always keep DAIF.[IF] in
sync, so that we can treat IRQ and FIQ as equals. This is cherry-picked from
Hector's reply [3] to the first version of this series.

The final patch adds the low-level FIQ exception handling and registration
mechanism atop the prior rework.

I'm hoping that we can get the first 2 patches in as a preparatory cleanup for
the next rc or so, and then the rest of the series can be rebased atop that.
I've pushed the series out to my arm64/fiq branch [4] on kernel.org, also
tagged as arm64-fiq-20210302, atop v5.12-rc1.

Since v1 [5]:
* Rebase to v5.12-rc1
* Pick up Hector's latest DAIF.[IF] patch
* Use "root {IRQ,FIQ} handler" rather than "{IRQ,FIQ} controller"
* Remove existing panic per Marc's comments
* Log registered root handlers
* Make default root handlers static
* Remove redundant el0_fiq_invalid_compat, per Joey's comments

Thanks,
Mark.

[1] https://http://lore.kernel.org/r/20210215121713.57687-1-mar...@marcan.st
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/drop-generic_irq_multi_handler
[3] https://lore.kernel.org/r/20210219172530.45805-1-mar...@marcan.st
[4] 
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/fiq
[5] https://lore.kernel.org/r/20210219113904.41736-1-mark.rutl...@arm.com

Hector Martin (1):
  arm64: Always keep DAIF.[IF] in sync

Marc Zyngier (5):
  ARM: ep93xx: Select GENERIC_IRQ_MULTI_HANDLER directly
  irqchip: Do not blindly select CONFIG_GENERIC_IRQ_MULTI_HANDLER
  genirq: Allow architectures to override set_handle_irq() fallback
  arm64: don't use GENERIC_IRQ_MULTI_HANDLER
  arm64: entry: factor irq triage logic into macros

Mark Rutland (2):
  arm64: irq: rework root IRQ handler registration
  arm64: irq: allow FIQs to be handled

 arch/arm/Kconfig|   1 +
 arch/arm64/Kconfig  |   1 -
 arch/arm64/include/asm/arch_gicv3.h |   2 +-
 arch/arm64/include/asm/assembler.h  |   8 +--
 arch/arm64/include/asm/daifflags.h  |  10 ++--
 arch/arm64/include/asm/irq.h|   4 ++
 arch/arm64/include/asm/irqflags.h   |  16 +++--
 arch/arm64/kernel/entry.S   | 114 +---
 arch/arm64/kernel/irq.c |  35 ++-
 arch/arm64/kernel/process.c |   2 +-
 arch/arm64/kernel/smp.c |   1 +
 drivers/irqchip/Kconfig |   9 ---
 include/linux/irq.h |   2 +
 13 files changed, 126 insertions(+), 79 deletions(-)

-- 
2.11.0



[PATCHv2 4/8] arm64: don't use GENERIC_IRQ_MULTI_HANDLER

2021-03-02 Thread Mark Rutland
From: Marc Zyngier 

In subsequent patches we want to allow irqchip drivers to register as
FIQ handlers, with a set_handle_fiq() function. To keep the IRQ/FIQ
paths similar, we want arm64 to provide both set_handle_irq() and
set_handle_fiq(), rather than using GENERIC_IRQ_MULTI_HANDLER for the
former.

This patch adds an arm64-specific implementation of set_handle_irq().
There should be no functional change as a result of this patch.

Signed-off-by: Marc Zyngier 
[Mark: use a single handler pointer]
Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Hector Martin 
Cc: James Morse 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm64/Kconfig   |  1 -
 arch/arm64/include/asm/irq.h |  3 +++
 arch/arm64/kernel/irq.c  | 11 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1f212b47a48a..bca2fc64b6b6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -110,7 +110,6 @@ config ARM64
select GENERIC_EARLY_IOREMAP
select GENERIC_IDLE_POLL_SETUP
select GENERIC_IRQ_IPI
-   select GENERIC_IRQ_MULTI_HANDLER
select GENERIC_IRQ_PROBE
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index b2b0c6405eb0..8391c6f6f746 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -8,6 +8,9 @@
 
 struct pt_regs;
 
+int set_handle_irq(void (*handle_irq)(struct pt_regs *));
+#define set_handle_irq set_handle_irq
+
 static inline int nr_legacy_irqs(void)
 {
return 0;
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index dfb1feab867d..ad63bd50fa7b 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -71,6 +71,17 @@ static void init_irq_stacks(void)
 }
 #endif
 
+void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
+
+int __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
+{
+   if (handle_arch_irq)
+   return -EBUSY;
+
+   handle_arch_irq = handle_irq;
+   return 0;
+}
+
 void __init init_IRQ(void)
 {
init_irq_stacks();
-- 
2.11.0



[PATCHv2 3/8] genirq: Allow architectures to override set_handle_irq() fallback

2021-03-02 Thread Mark Rutland
From: Marc Zyngier 

Some architectures want to provide the generic set_handle_irq() API, but
for structural reasons need to provide their own implementation. For
example, arm64 needs to do this to provide uniform set_handle_irq() and
set_handle_fiq() registration functions.

Make this possible by allowing architectures to provide their own
implementation of set_handle_irq when CONFIG_GENERIC_IRQ_MULTI_HANDLER
is not selected.

Signed-off-by: Marc Zyngier 
[Mark: expand commit message]
Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Hector Martin 
Cc: James Morse 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 include/linux/irq.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 2efde6a79b7e..9890180b84fd 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1258,11 +1258,13 @@ int __init set_handle_irq(void (*handle_irq)(struct 
pt_regs *));
  */
 extern void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
 #else
+#ifndef set_handle_irq
 #define set_handle_irq(handle_irq) \
do {\
(void)handle_irq;   \
WARN_ON(1); \
} while (0)
 #endif
+#endif
 
 #endif /* _LINUX_IRQ_H */
-- 
2.11.0



[PATCHv2 2/8] irqchip: Do not blindly select CONFIG_GENERIC_IRQ_MULTI_HANDLER

2021-03-02 Thread Mark Rutland
From: Marc Zyngier 

Implementing CONFIG_GENERIC_IRQ_MULTI_HANDLER is a decision that is
made at the architecture level, and shouldn't involve the irqchip
at all (we even provide a fallback helper when the option isn't
selected).

Drop all instances of such selection from non-arch code.

Signed-off-by: Marc Zyngier 
Link: https://lore.kernel.org/r/20210217142800.2547737-1-...@kernel.org
Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Hector Martin 
Cc: James Morse 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 drivers/irqchip/Kconfig | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index e74fa206240a..15536e321df5 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -8,7 +8,6 @@ config IRQCHIP
 config ARM_GIC
bool
select IRQ_DOMAIN_HIERARCHY
-   select GENERIC_IRQ_MULTI_HANDLER
select GENERIC_IRQ_EFFECTIVE_AFF_MASK
 
 config ARM_GIC_PM
@@ -33,7 +32,6 @@ config GIC_NON_BANKED
 
 config ARM_GIC_V3
bool
-   select GENERIC_IRQ_MULTI_HANDLER
select IRQ_DOMAIN_HIERARCHY
select PARTITION_PERCPU
select GENERIC_IRQ_EFFECTIVE_AFF_MASK
@@ -64,7 +62,6 @@ config ARM_NVIC
 config ARM_VIC
bool
select IRQ_DOMAIN
-   select GENERIC_IRQ_MULTI_HANDLER
 
 config ARM_VIC_NR
int
@@ -99,14 +96,12 @@ config ATMEL_AIC_IRQ
bool
select GENERIC_IRQ_CHIP
select IRQ_DOMAIN
-   select GENERIC_IRQ_MULTI_HANDLER
select SPARSE_IRQ
 
 config ATMEL_AIC5_IRQ
bool
select GENERIC_IRQ_CHIP
select IRQ_DOMAIN
-   select GENERIC_IRQ_MULTI_HANDLER
select SPARSE_IRQ
 
 config I8259
@@ -153,7 +148,6 @@ config DW_APB_ICTL
 config FARADAY_FTINTC010
bool
select IRQ_DOMAIN
-   select GENERIC_IRQ_MULTI_HANDLER
select SPARSE_IRQ
 
 config HISILICON_IRQ_MBIGEN
@@ -169,7 +163,6 @@ config IMGPDC_IRQ
 config IXP4XX_IRQ
bool
select IRQ_DOMAIN
-   select GENERIC_IRQ_MULTI_HANDLER
select SPARSE_IRQ
 
 config MADERA_IRQ
@@ -186,7 +179,6 @@ config CLPS711X_IRQCHIP
bool
depends on ARCH_CLPS711X
select IRQ_DOMAIN
-   select GENERIC_IRQ_MULTI_HANDLER
select SPARSE_IRQ
default y
 
@@ -205,7 +197,6 @@ config OMAP_IRQCHIP
 config ORION_IRQCHIP
bool
select IRQ_DOMAIN
-   select GENERIC_IRQ_MULTI_HANDLER
 
 config PIC32_EVIC
bool
-- 
2.11.0



[PATCHv2 1/8] ARM: ep93xx: Select GENERIC_IRQ_MULTI_HANDLER directly

2021-03-02 Thread Mark Rutland
From: Marc Zyngier 

ep93xx currently relies of CONFIG_ARM_VIC to select
GENERIC_IRQ_MULTI_HANDLER. Given that this is logically a platform
architecture property, add the selection of GENERIC_IRQ_MULTI_HANDLER
at the platform level.

Further patches will remove the selection from the irqchip side.

Reported-by: Marc Rutland 
Signed-off-by: Marc Zyngier 
Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Hector Martin 
Cc: James Morse 
Cc: Thomas Gleixner 
Cc: Will Deacon 
---
 arch/arm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 853aab5ab327..5da96f5df48f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -348,6 +348,7 @@ config ARCH_EP93XX
select ARM_AMBA
imply ARM_PATCH_PHYS_VIRT
select ARM_VIC
+   select GENERIC_IRQ_MULTI_HANDLER
select AUTO_ZRELADDR
select CLKDEV_LOOKUP
select CLKSRC_MMIO
-- 
2.11.0



Re: [RFC PATCH v1 1/1] arm64: Unwinder enhancements for reliable stack trace

2021-02-25 Thread Mark Rutland
On Wed, Feb 24, 2021 at 01:34:09PM -0600, Madhavan T. Venkataraman wrote:
> On 2/24/21 6:17 AM, Mark Rutland wrote:
> > On Tue, Feb 23, 2021 at 12:12:43PM -0600, madve...@linux.microsoft.com 
> > wrote:
> >> From: "Madhavan T. Venkataraman" 
> >>Termination
> >>===
> >>
> >>Currently, the unwinder terminates when both the FP (frame pointer)
> >>and the PC (return address) of a frame are 0. But a frame could get
> >>corrupted and zeroed. There needs to be a better check.
> >>
> >>The following special terminating frame and function have been
> >>defined for this purpose:
> >>
> >>const u64arm64_last_frame[2] __attribute__ ((aligned (16)));
> >>
> >>void arm64_last_func(void)
> >>{
> >>}
> >>
> >>So, set the FP to arm64_last_frame and the PC to arm64_last_func in
> >>the bottom most frame.
> > 
> > My expectation was that we'd do this per-task, creating an empty frame
> > record (i.e. with fp=NULL and lr=NULL) on the task's stack at the
> > instant it was created, and chaining this into x29. That way the address
> > is known (since it can be derived from the task), and the frame will
> > also implicitly check that the callchain terminates on the task stack
> > without loops. That also means that we can use it to detect the entry
> > code going wrong (e.g. if the SP gets corrupted), since in that case the
> > entry code would place the record at a different location.
> 
> That is exactly what this is doing. arm64_last_frame[] is a marker frame
> that contains fp=0 and pc=0.

Almost! What I meant was that rather that each task should have its own
final/marker frame record on its task task rather than sharing a common
global variable.

That way a check for that frame record implicitly checks that a task
started at the expected location *on that stack*, which catches
additional stack corruption cases (e.g. if we left data on the stack
prior to an EL0 entry).

[...]

> > ... I reckon once we've moved the last of the exception triage out to C
> > this will be relatively simple, since all of the exception handlers will
> > look like:
> > 
> > | SYM_CODE_START_LOCAL(elX_exception)
> > |   kernel_entry X
> > |   mov x0, sp
> > |   bl  elX_exception_handler
> > |   kernel_exit X
> > | SYM_CODE_END(elX_exception)
> > 
> > ... and so we just need to identify the set of elX_exception functions
> > (which we'll never expect to take exceptions from directly). We could be
> > strict and reject unwinding into arbitrary bits of the entry code (e.g.
> > if we took an unexpected exception), and only permit unwinding to the
> > BL.
> > 
> >>It can do this if the FP and PC are also recorded elsewhere in the
> >>pt_regs for comparison. Currently, the FP is also stored in
> >>regs->regs[29]. The PC is stored in regs->pc. However, regs->pc can
> >>be changed by lower level functions.
> >>
> >>Create a new field, pt_regs->orig_pc, and record the return address
> >>PC there. With this, the unwinder can validate the exception frame
> >>and set a flag so that the caller of the unwinder can know when
> >>an exception frame is encountered.
> > 
> > I don't understand the case you're trying to solve here. When is
> > regs->pc changed in a way that's problematic?
> > 
> 
> For instance, I used a test driver in which the driver calls a function
> pointer which is NULL. The low level fault handler sends a signal to the
> task. Looks like it changes regs->pc for this.

I'm struggling to follow what you mean here.

If the kernel branches to NULL, the CPU should raise an instruction
abort from the current EL, and our handling of that should terminate the
thread via die_kernel_fault(), without returning to the faulting
context, and without altering the PC in the faulting context.

Signals are delivered to userspace and alter the userspace PC, not a
kernel PC, so this doesn't seem relevant. Do you mean an exception fixup
handler rather than a signal?

> When I dump the stack from the low level handler, the comparison with
> regs->pc does not work.  But comparison with regs->orig_pc works.

As above, I'm struggling with this; could you share a concrete example? 

Thanks,
Mark.


Re: [PATCH 0/8] arm64: Support FIQ controller registration

2021-02-24 Thread Mark Rutland
On Fri, Feb 19, 2021 at 06:10:56PM +, Marc Zyngier wrote:
> Hi Mark,
> 
> On Fri, 19 Feb 2021 11:38:56 +0000,
> Mark Rutland  wrote:
> > 
> > Hector's M1 support series [1] shows that some platforms have critical
> > interrupts wired to FIQ, and to support these platforms we need to support
> > handling FIQ exceptions. Other contemporary platforms don't use FIQ (since 
> > e.g.
> > this is usually routed to EL3), and as we never expect to take an FIQ, we 
> > have
> > the FIQ vector cause a panic.
> > 
> > Since the use of FIQ is a platform integration detail (which can differ 
> > across
> > bare-metal and virtualized environments), we need be able to explicitly 
> > opt-in
> > to handling FIQs while retaining the existing behaviour otherwise. This 
> > series
> > adds a new set_handle_fiq() hook so that the FIQ controller can do so, and
> > where no controller is registered the default handler will panic(). For
> > consistency the set_handle_irq() code is made to do the same.
> > 
> > The first couple of patches are from Marc's 
> > irq/drop-generic_irq_multi_handler
> > branch [2] on kernel.org, and clean up CONFIG_GENERIC_IRQ_MULTI_HANDLER 
> > usage.
> > The next four patches move arm64 over to a local set_handle_irq()
> > implementation, which is written to share code with a set_handle_fiq() 
> > function
> > in the last two patches. The only functional difference here is that if an 
> > IRQ
> > is somehow taken prior to set_handle_irq() the default handler will directly
> > panic() rather than the vector branching to NULL.
> > 
> > The penultimate patch is cherry-picked from the v2 M1 series, and as per
> > discussion there [3] will need a few additional fixups. I've included it for
> > now as the DAIF.IF alignment is necessary for the FIQ exception handling 
> > added
> > in the final patch.
> > 
> > The final patch adds the low-level FIQ exception handling and registration
> > mechanism atop the prior rework.
> 
> Thanks for putting this together. I have an extra patch on top of this
> series[1] that prevents the kernel from catching fire if a FIQ fires
> whilst running a guest. Nothing urgent, we can queue it at a later time.
> 
> Thanks,
> 
>   M.
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/fiq

IIUC for that "invalid_vect" should be changed to "valid_vect", to match
the other valid vector entries, but otherwise that looks sane to me.

I guess we could take that as a prerequisite ahead of the rest?

Thanks,
Mark.


  1   2   3   4   5   6   7   8   9   10   >