Re: [RESEND PATCH v5 2/5] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo

2020-01-10 Thread James Morse
Hi Bhupesh,

On 25/12/2019 19:01, Bhupesh Sharma wrote:
> On 12/12/2019 04:02 PM, James Morse wrote:
>> On 29/11/2019 19:59, Bhupesh Sharma wrote:
>>> vabits_actual variable on arm64 indicates the actual VA space size,
>>> and allows a single binary to support both 48-bit and 52-bit VA
>>> spaces.
>>>
>>> If the ARMv8.2-LVA optional feature is present, and we are running
>>> with a 64KB page size; then it is possible to use 52-bits of address
>>> space for both userspace and kernel addresses. However, any kernel
>>> binary that supports 52-bit must also be able to fall back to 48-bit
>>> at early boot time if the hardware feature is not present.
>>>
>>> Since TCR_EL1.T1SZ indicates the size offset of the memory region
>>> addressed by TTBR1_EL1 (and hence can be used for determining the
>>> vabits_actual value) it makes more sense to export the same in
>>> vmcoreinfo rather than vabits_actual variable, as the name of the
>>> variable can change in future kernel versions, but the architectural
>>> constructs like TCR_EL1.T1SZ can be used better to indicate intended
>>> specific fields to user-space.
>>>
>>> User-space utilities like makedumpfile and crash-utility, need to
>>> read/write this value from/to vmcoreinfo
>>
>> (write?)
> 
> Yes, also write so that the vmcoreinfo from an (crashing) arm64 system can be 
> used for
> analysis of the root-cause of panic/crash on say an x86_64 host using 
> utilities like
> crash-utility/gdb.

I read this as as "User-space [...] needs to write to vmcoreinfo".


>>> for determining if a virtual address lies in the linear map range.
>>
>> I think this is a fragile example. The debugger shouldn't need to know this.
> 
> Well that the current user-space utility design, so I am not sure we can 
> tweak that too much.
> 
>>> The user-space computation for determining whether an address lies in
>>> the linear map range is the same as we have in kernel-space:
>>>
>>>    #define __is_lm_address(addr)    (!(((u64)addr) & BIT(vabits_actual - 
>>> 1)))
>>
>> This was changed with 14c127c957c1 ("arm64: mm: Flip kernel VA space"). If 
>> user-space
>> tools rely on 'knowing' the kernel memory layout, they must have to 
>> constantly be fixed
>> and updated. This is a poor argument for adding this to something that ends 
>> up as ABI.
> 
> See above. The user-space has to rely on some ABI/guaranteed hardware-symbols 
> which can be
> used for 'determining' the kernel memory layout.

I disagree. Everything and anything in the kernel will change. The ABI rules 
apply to
stuff exposed via syscalls and kernel filesystems. It does not apply to kernel 
internals,
like the memory layout we used yesterday. 14c127c957c1 is a case in point.

A debugger trying to rely on this sort of thing would have to play catchup 
whenever it
changes.

I'm looking for a justification that isn't paper-thin. Putting 'for guessing 
the memory
map' in the commit message gives the impression we can support that.


>> I think a better argument is walking the kernel page tables from the core 
>> dump.
>> Core code's vmcoreinfo exports the location of the kernel page tables, but 
>> in the example
>> above you can't walk them without knowing how T1SZ was configured.
> 
> Sure, both makedumpfile and crash-utility (which walks the kernel page tables 
> from the
> core dump) use this (and similar) information currently in the user-space.

[...]

>> (From-memory: one of vmcore/kcore is virtually addressed, the other 
>> physically. Does this
>> fix your poblem in both cases?)
>>
>>
>>> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
>>> index ca4c3e12d8c5..f78310ba65ea 100644
>>> --- a/arch/arm64/kernel/crash_core.c
>>> +++ b/arch/arm64/kernel/crash_core.c
>>> @@ -7,6 +7,13 @@
>>>   #include 
>>>   #include 
>>
>> You need to include asm/sysreg.h for read_sysreg(), and asm/pgtable-hwdef.h 
>> for the macros
>> you added.
> 
> Ok. Will check as I did not get any compilation errors without the same and 
> build-bot also
> did not raise a flag for the missing include files.

Don't trust the header jungle!


>>> +static inline u64 get_tcr_el1_t1sz(void);
> 
>> Why do you need to do this?
> 
> Without this I was getting a missing declaration error, while compiling the 
> code.

Missing declaration?

>>> +static inline u64 get_tcr_el1_t1sz(void)
>>> +{
>>> +    return (read_sysreg(tcr_el1) & TCR_T1SZ_MASK) >> TCR_T1SZ_OFFSET;
>>> +}

Here it is! (I must be going mad...)


Thanks,

James


Re: [RESEND PATCH v5 5/5] Documentation/vmcoreinfo: Add documentation for 'TCR_EL1.T1SZ'

2019-12-12 Thread James Morse
Hi Bhupesh,

On 29/11/2019 19:59, Bhupesh Sharma wrote:
> Add documentation for TCR_EL1.T1SZ variable being added to
> vmcoreinfo.
> 
> It indicates the size offset of the memory region addressed by TTBR1_EL1

> and hence can be used for determining the vabits_actual value.

used for determining random-internal-kernel-variable, that might not exist 
tomorrow.

Could you describe how this is useful/necessary if a debugger wants to walk the 
page
tables from the core file? I think this is a better argument.

Wouldn't the documentation be better as part of the patch that adds the export?
(... unless these have to go via different trees? ..)


> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst 
> b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index 447b64314f56..f9349f9d3345 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -398,6 +398,12 @@ KERNELOFFSET
>  The kernel randomization offset. Used to compute the page offset. If
>  KASLR is disabled, this value is zero.
>  
> +TCR_EL1.T1SZ
> +
> +
> +Indicates the size offset of the memory region addressed by TTBR1_EL1

> +and hence can be used for determining the vabits_actual value.

'vabits_actual' may not exist when the next person comes to read this 
documentation (its
going to rot really quickly).

I think the first half of this text is enough to say what this is for. You 
should include
words to the effect that its the hardware value that goes with swapper_pg_dir. 
You may
want to point readers to the arm-arm for more details on what the value means.


Thanks,

James


Re: [RESEND PATCH v5 2/5] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo

2019-12-12 Thread James Morse
Hi Bhupesh,

On 29/11/2019 19:59, Bhupesh Sharma wrote:
> vabits_actual variable on arm64 indicates the actual VA space size,
> and allows a single binary to support both 48-bit and 52-bit VA
> spaces.
> 
> If the ARMv8.2-LVA optional feature is present, and we are running
> with a 64KB page size; then it is possible to use 52-bits of address
> space for both userspace and kernel addresses. However, any kernel
> binary that supports 52-bit must also be able to fall back to 48-bit
> at early boot time if the hardware feature is not present.
> 
> Since TCR_EL1.T1SZ indicates the size offset of the memory region
> addressed by TTBR1_EL1 (and hence can be used for determining the
> vabits_actual value) it makes more sense to export the same in
> vmcoreinfo rather than vabits_actual variable, as the name of the
> variable can change in future kernel versions, but the architectural
> constructs like TCR_EL1.T1SZ can be used better to indicate intended
> specific fields to user-space.
> 
> User-space utilities like makedumpfile and crash-utility, need to
> read/write this value from/to vmcoreinfo

(write?)

> for determining if a virtual address lies in the linear map range.

I think this is a fragile example. The debugger shouldn't need to know this.


> The user-space computation for determining whether an address lies in
> the linear map range is the same as we have in kernel-space:
> 
>   #define __is_lm_address(addr)   (!(((u64)addr) & BIT(vabits_actual - 
> 1)))

This was changed with 14c127c957c1 ("arm64: mm: Flip kernel VA space"). If 
user-space
tools rely on 'knowing' the kernel memory layout, they must have to constantly 
be fixed
and updated. This is a poor argument for adding this to something that ends up 
as ABI.


I think a better argument is walking the kernel page tables from the core dump.
Core code's vmcoreinfo exports the location of the kernel page tables, but in 
the example
above you can't walk them without knowing how T1SZ was configured.

On older kernels, user-space that needs this would have to assume the value it 
computes
from VA_BITs (also in vmcoreinfo) is the value in use.


---%<---
> I have sent out user-space patches for makedumpfile and crash-utility
> to add features for obtaining vabits_actual value from TCR_EL1.T1SZ (see
> [0] and [1]).
> 
> Akashi reported that he was able to use this patchset and the user-space
> changes to get user-space working fine with the 52-bit kernel VA
> changes (see [2]).
> 
> [0]. http://lists.infradead.org/pipermail/kexec/2019-November/023966.html
> [1]. http://lists.infradead.org/pipermail/kexec/2019-November/024006.html
> [2]. http://lists.infradead.org/pipermail/kexec/2019-November/023992.html
---%<---

This probably belongs in the cover letter instead of the commit log.

(From-memory: one of vmcore/kcore is virtually addressed, the other physically. 
Does this
fix your poblem in both cases?)


> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index ca4c3e12d8c5..f78310ba65ea 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -7,6 +7,13 @@
>  #include 
>  #include 

You need to include asm/sysreg.h for read_sysreg(), and asm/pgtable-hwdef.h for 
the macros
you added.


> +static inline u64 get_tcr_el1_t1sz(void);

Why do you need to do this?


> +static inline u64 get_tcr_el1_t1sz(void)
> +{
> + return (read_sysreg(tcr_el1) & TCR_T1SZ_MASK) >> TCR_T1SZ_OFFSET;
> +}

(We don't modify this one, and its always the same one very CPU, so this is 
fine.
This function is only called once when the stringy vmcoreinfo elf_note is 
created...)


>  void arch_crash_save_vmcoreinfo(void)
>  {
>   VMCOREINFO_NUMBER(VA_BITS);
> @@ -15,5 +22,7 @@ void arch_crash_save_vmcoreinfo(void)
>   kimage_voffset);
>   vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
>   PHYS_OFFSET);
> + vmcoreinfo_append_str("NUMBER(tcr_el1_t1sz)=0x%llx\n",
> + get_tcr_el1_t1sz());

You document the name as being upper-case.
The two values either values either side are upper-case.


>   vmcoreinfo_append_str("KERNELOFFSET=%lx\n", kaslr_offset());
>  }


Thanks,

James


Re: [PATCH RFC 6/7] arm64: kexec: no need to ClearPageReserved()

2018-12-05 Thread James Morse
Hi David,

(CC: +Akashi)

On 05/12/2018 12:28, David Hildenbrand wrote:
> This will already be done by free_reserved_page().

(will already be -> will be ?)

So it does!


> diff --git a/arch/arm64/kernel/machine_kexec.c 
> b/arch/arm64/kernel/machine_kexec.c
> index 922add8adb74..0ef4ea73aa54 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -353,7 +353,6 @@ void crash_free_reserved_phys_range(unsigned long begin, 
> unsigned long end)
>  
>   for (addr = begin; addr < end; addr += PAGE_SIZE) {
>   page = phys_to_page(addr);
> - ClearPageReserved(page);
>   free_reserved_page(page);
>   }
>  }
> 

Acked-by: James Morse 


Thanks,

James


Re: [PATCH 1/5] arm64: entry: isb in el1_irq

2018-04-06 Thread James Morse
Hi Mark,

On 06/04/18 18:22, Mark Rutland wrote:
> Digging a bit, I also thing that our ct_user_exit and ct_user_enter
> usage is on dodgy ground today.

[...]

> I think similar applies to SDEI; we don't negotiate with RCU prior to
> invoking handlers, which might need RCU.

The arch code's __sdei_handler() calls nmi_enter() -> rcu_nmi_enter(), which
does a rcu_dynticks_eqs_exit().


Thanks,

James


Re: [PATCH 3/5] arm64: early ISB at exit from extended quiescent state

2018-04-06 Thread James Morse
Hi Yury,

On 05/04/18 18:17, Yury Norov wrote:
> This series enables delaying of kernel memory synchronization
> for CPUs running in extended quiescent state (EQS) till the exit
> of that state.
> 
> ARM64 uses IPI mechanism to notify all cores in  SMP system that
> kernel text is changed; and IPI handler calls isb() to synchronize.
> 
> If we don't deliver IPI to EQS CPUs anymore, we should add ISB early
> in EQS exit path.
> 
> There are 2 such paths. One starts in do_idle() loop, and other
> in el0_svc entry. For do_idle(), isb() is added in
> arch_cpu_idle_exit() hook. And for SVC handler, isb is called in
> el0_svc_naked.

(I know nothing about this EQS stuff, but) there is a third path that might be
relevant.
>From include/linux/context_tracking.h:guest_enter_irqoff():
|* KVM does not hold any references to rcu protected data when it
|* switches CPU into a guest mode. In fact switching to a guest mode
|* is very similar to exiting to userspace from rcu point of view. In
|* addition CPU may stay in a guest mode for quite a long time (up to
|* one time slice). Lets treat guest mode as quiescent state, just like
|* we do with user-mode execution.

For non-VHE systems guest_enter_irqoff()() is called just before we jump to EL2.
Coming back gives us an exception-return, so we have a context-synchronisation
event there, and I assume we will never patch the hyp-text on these systems.

But with VHE on the upcoming kernel version we still go on to run code at the
same EL. Do we need an ISB on the path back from the guest once we've told RCU
we've 'exited user-space'?
If this code can be patched, do we have a problem here?


> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index c8d9ec363ddd..b1e1c19b4432 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -48,7 +48,7 @@
>   .endm
>  
>   .macro el0_svc_restore_syscall_args
> -#if defined(CONFIG_CONTEXT_TRACKING)
> +#if !defined(CONFIG_TINY_RCU) || defined(CONFIG_CONTEXT_TRACKING)
>   restore_syscall_args
>  #endif
>   .endm
> @@ -483,6 +483,19 @@ __bad_stack:
>   ASM_BUG()
>   .endm
>  
> +/*
> + * If CPU is in extended quiescent state we need isb to ensure that
> + * possible change of kernel text is visible by the core.
> + */
> + .macro  isb_if_eqs
> +#ifndef CONFIG_TINY_RCU
> + bl  rcu_is_watching
> + cbnzx0, 1f
> + isb // pairs with 
> aarch64_insn_patch_text
> +1:
> +#endif
> + .endm
> +
>  el0_sync_invalid:
>   inv_entry 0, BAD_SYNC
>  ENDPROC(el0_sync_invalid)
> @@ -949,6 +962,7 @@ alternative_else_nop_endif
>  
>  el0_svc_naked:   // compat entry point
>   stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and 
> syscall number
> + isb_if_eqs
>   enable_daif
>   ct_user_exit
>   el0_svc_restore_syscall_args

Shouldn't this be at the point that RCU knows we've exited user-space? Otherwise
there is a gap where RCU thinks we're in user-space, we're not, and we're about
to tell it. Code-patching occurring in this gap would be missed.

This gap only contains 'enable_daif', and any exception that occurs here is
safe, but its going to give someone a nasty surprise...

Mark points out this ISB needs to be after RCU knows we're not quiescent:
https://lkml.org/lkml/2018/4/3/378

Can't this go in the rcu exit-quiescence code? Isn't this what your
rcu_dynticks_eqs_exit_sync() hook does?


Thanks,

James


Re: [PATCH 1/5] arm64: entry: isb in el1_irq

2018-04-06 Thread James Morse
Hi Yury,

On 05/04/18 18:17, Yury Norov wrote:
> Kernel text patching framework relies on IPI to ensure that other
> SMP cores observe the change. Target core calls isb() in IPI handler

(Odd, if its just to synchronize the CPU, taking the IPI should be enough).


> path, but not at the beginning of el1_irq entry. There's a chance
> that modified instruction will appear prior isb(), and so will not be
> observed.
> 
> This patch inserts isb early at el1_irq entry to avoid that chance.


> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index ec2ee720e33e..9c06b4b80060 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -593,6 +593,7 @@ ENDPROC(el1_sync)
>  
>   .align  6
>  el1_irq:
> + isb // pairs with 
> aarch64_insn_patch_text
>   kernel_entry 1
>   enable_da_f
>  #ifdef CONFIG_TRACE_IRQFLAGS
> 

An ISB at the beginning of the vectors? This is odd, taking an IRQ to get in
here would be a context-synchronization-event too, so the ISB is superfluous.

The ARM-ARM  has a list of 'Context-Synchronization event's (Glossary on page
6480 of DDI0487B.b), paraphrasing:
* ISB
* Taking an exception
* ERET
* (...loads of debug stuff...)


Thanks,

James


Re: [PATCH] ptrace: Add compat PTRACE_{G,S}ETSIGMASK handlers

2017-07-17 Thread James Morse
Hi Michael,

On 17/07/17 11:17, Michael Ellerman wrote:
> James Morse <james.mo...@arm.com> writes:
>> compat_ptrace_request() lacks handlers for PTRACE_{G,S}ETSIGMASK,
>> instead using those in ptrace_request(). The compat variant should
>> read a compat_sigset_t from userspace instead of ptrace_request()s
>> sigset_t.
>>
>> While compat_sigset_t is the same size as sigset_t, it is defined as
>> 2xu32, instead of a single u64. On a big-endian CPU this means that
>> compat_sigset_t is passed to user-space using middle-endianness,
>> where the least-significant u32 is written most significant byte
>> first.
>>
>> If ptrace_request()s code is used userspace will read the most
>> significant u32 where it expected the least significant.
> 
> But that's what the code has done since 2013.

> So won't changing this break userspace that has been written to work
> around that bug?

Wouldn't the same program then be broken when run natively instead? To work
around it userspace would have to know it was running under compat instead of
natively.
This only affects this exotic ptrace API for big-endian compat users. I think
there are so few users that no-one has noticed its broken.

I'm only aware of CRIU using this[0], and it doesn't look like powerpc has to
support compat-criu users:
https://github.com/xemul/criu/tree/master/compel/arch
only has a ppc64 entry, for which
https://github.com/xemul/criu/blob/master/compel/arch/ppc64/plugins/include/asm/syscall-types.h
puts 'bits per word' as 64, I don't think it supports ppc32, which is where this
bug would be seen.


> Or do we think nothing actually uses it in the wild and
> we can get away with it?

I think only Zhou Chengming has hit this, and there is no 'in the wild' code
that actually inspects the buffer returned by the call.

Zhou, were you using criu on big-endian ilp32 when you found this? Or was it
some other project that uses this API..
(ilp32 is a second user of compat on arm64)


Thanks,

James



[0]
The commit message that added this code points to CRIU and GDB. GDB doesn't use
this API. Debian's codesearch (which obviously isn't exhaustive) only finds CRIU
and systemtap making these calls.

It looks like criu just save/restores the sigset_t as a blob:
https://sources.debian.net/src/criu/2.12.1-2/criu/parasite-syscall.c/?hl=92#L92

It's sigset_t helpers aren't aware of this bug:
https://github.com/xemul/criu/blob/master/compel/include/uapi/ksigset.h

Systemtap just makes some calls as part of a self test:
https://sources.debian.net/src/systemtap/3.1-2/testsuite/systemtap.syscall/ptrace.c/?hl=198#L198



[PATCH] ptrace: Add compat PTRACE_{G,S}ETSIGMASK handlers

2017-06-29 Thread James Morse
compat_ptrace_request() lacks handlers for PTRACE_{G,S}ETSIGMASK,
instead using those in ptrace_request(). The compat variant should
read a compat_sigset_t from userspace instead of ptrace_request()s
sigset_t.

While compat_sigset_t is the same size as sigset_t, it is defined as
2xu32, instead of a single u64. On a big-endian CPU this means that
compat_sigset_t is passed to user-space using middle-endianness,
where the least-significant u32 is written most significant byte
first.

If ptrace_request()s code is used userspace will read the most
significant u32 where it expected the least significant.

Instead of duplicating ptrace_request()s code as a special case in
the arch code, handle it here.

CC: Yury Norov <yno...@caviumnetworks.com>
CC: Andrey Vagin <ava...@openvz.org>
Reported-by: Zhou Chengming <zhouchengmi...@huawei.com>
Signed-off-by: James Morse <james.mo...@arm.com>
Fixes: 29000caecbe87 ("ptrace: add ability to get/set signal-blocked mask")
---
LTP test case here:
https://lists.linux.it/pipermail/ltp/2017-June/004932.html

 kernel/ptrace.c | 52 
 1 file changed, 40 insertions(+), 12 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 8d2c10714530..a5bebb6713e8 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -843,6 +843,22 @@ static int ptrace_regset(struct task_struct *task, int 
req, unsigned int type,
 EXPORT_SYMBOL_GPL(task_user_regset_view);
 #endif
 
+static int ptrace_setsigmask(struct task_struct *child, sigset_t *new_set)
+{
+   sigdelsetmask(new_set, sigmask(SIGKILL)|sigmask(SIGSTOP));
+
+   /*
+* Every thread does recalc_sigpending() after resume, so
+* retarget_shared_pending() and recalc_sigpending() are not
+* called here.
+*/
+   spin_lock_irq(>sighand->siglock);
+   child->blocked = *new_set;
+   spin_unlock_irq(>sighand->siglock);
+
+   return 0;
+}
+
 int ptrace_request(struct task_struct *child, long request,
   unsigned long addr, unsigned long data)
 {
@@ -914,18 +930,7 @@ int ptrace_request(struct task_struct *child, long request,
break;
}
 
-   sigdelsetmask(_set, sigmask(SIGKILL)|sigmask(SIGSTOP));
-
-   /*
-* Every thread does recalc_sigpending() after resume, so
-* retarget_shared_pending() and recalc_sigpending() are not
-* called here.
-*/
-   spin_lock_irq(>sighand->siglock);
-   child->blocked = new_set;
-   spin_unlock_irq(>sighand->siglock);
-
-   ret = 0;
+   ret = ptrace_setsigmask(child, _set);
break;
}
 
@@ -1149,7 +1154,9 @@ int compat_ptrace_request(struct task_struct *child, 
compat_long_t request,
  compat_ulong_t addr, compat_ulong_t data)
 {
compat_ulong_t __user *datap = compat_ptr(data);
+   compat_sigset_t set32;
compat_ulong_t word;
+   sigset_t new_set;
siginfo_t siginfo;
int ret;
 
@@ -1189,6 +1196,27 @@ int compat_ptrace_request(struct task_struct *child, 
compat_long_t request,
else
ret = ptrace_setsiginfo(child, );
break;
+   case PTRACE_GETSIGMASK:
+   if (addr != sizeof(compat_sigset_t))
+   return -EINVAL;
+
+   sigset_to_compat(, >blocked);
+
+   if (copy_to_user(datap, , sizeof(set32)))
+   return -EFAULT;
+
+   ret = 0;
+   break;
+   case PTRACE_SETSIGMASK:
+   if (addr != sizeof(compat_sigset_t))
+   return -EINVAL;
+
+   if (copy_from_user(, datap, sizeof(compat_sigset_t)))
+   return -EFAULT;
+
+   sigset_from_compat(_set, );
+   ret = ptrace_setsigmask(child, _set);
+   break;
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
case PTRACE_GETREGSET:
case PTRACE_SETREGSET:
-- 
2.11.0