Re: [PATCH 1/7] powerpc/build: put sys_call_table in .data.rel.ro if RELOCATABLE

2022-09-14 Thread Christophe Leroy


Le 14/09/2022 à 17:47, Nicholas Piggin a écrit :
> Const function pointers live in .data.rel.ro rather than .rodata because
> they must be relocated. This change prevents powerpc/32 from generating
> R_PPC_UADDR32 relocations (which are not handled). The sys_call_table is
> moved to writeable memory, but a later change will move it back.

Aren't you missing commit c7acee3d2f12 ("powerpc: align syscall table 
for ppc32") ?

I can't see any R_PPC_UADDR32 relocations generated by ppc4xx_defconfig 
+ CONFIG_RELOCATABLE unless I revert that commit.



> 
> After this patch, 44x_defconfig + CONFIG_RELOCATABLE boots to busybox.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>   arch/powerpc/kernel/systbl.S | 4 
>   1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/systbl.S b/arch/powerpc/kernel/systbl.S
> index cb3358886203..0bec33e86f50 100644
> --- a/arch/powerpc/kernel/systbl.S
> +++ b/arch/powerpc/kernel/systbl.S
> @@ -12,7 +12,11 @@
>   
>   #include 
>   
> +#ifdef CONFIG_RELOCATABLE
> +.section .data.rel.ro,"aw"
> +#else
>   .section .rodata,"a"
> +#endif
>   
>   #ifdef CONFIG_PPC64
>   .p2align3

Re: [PATCH 2/7] powerpc: move __end_rodata to cover arch read-only sections

2022-09-14 Thread Christophe Leroy


Le 14/09/2022 à 17:47, Nicholas Piggin a écrit :
> powerpc has a number of read-only sections and tables that are put
> after RO_DATA(). Move the __end_rodata symbol to cover these as well.
> 
> Setting memory to read-only at boot is done using __init_begin,
> change that that to use __end_rodata.

I see the idea after looking in more details in the generated code, but 
I think this commit description needs to be more explanatory.

In mm/pgtable_32.c there was a comment explaining why __init_begin is 
used. I think you need to explain why we don't want to use it anymore 
and why we can now use __end_rodata.

> 
> This also affects boot dmesg, is_kernel_rodata(), and some other checks.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>   arch/powerpc/kernel/vmlinux.lds.S| 3 +++
>   arch/powerpc/mm/book3s32/mmu.c   | 2 +-
>   arch/powerpc/mm/book3s64/hash_pgtable.c  | 2 +-
>   arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
>   arch/powerpc/mm/pgtable_32.c | 5 ++---
>   5 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
> b/arch/powerpc/kernel/vmlinux.lds.S
> index fe22d940412f..90ac5ff73df2 100644
> --- a/arch/powerpc/kernel/vmlinux.lds.S
> +++ b/arch/powerpc/kernel/vmlinux.lds.S
> @@ -210,6 +210,9 @@ SECTIONS
>   }
>   #endif
>   
> + . = ALIGN(PAGE_SIZE);
> + __end_rodata = .;
> +

I think this will likely break block mapping on PPC32.

It needs to be aligned to STRICT_ALIGN_SIZE, like __init_begin is.


>   /*
>* Init sections discarded at runtime
>*/
> diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
> index a96b73006dfb..e13b883e4e5b 100644
> --- a/arch/powerpc/mm/book3s32/mmu.c
> +++ b/arch/powerpc/mm/book3s32/mmu.c
> @@ -240,7 +240,7 @@ void mmu_mark_rodata_ro(void)
>   for (i = 0; i < nb; i++) {
>   struct ppc_bat *bat = BATS[i];
>   
> - if (bat_addrs[i].start < (unsigned long)__init_begin)
> + if (bat_addrs[i].start < (unsigned long)__end_rodata)
>   bat[1].batl = (bat[1].batl & ~BPP_RW) | BPP_RX;
>   }
>   
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
> b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index ae008b9df0e6..28332001bd87 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -541,7 +541,7 @@ void hash__mark_rodata_ro(void)
>   unsigned long start, end, pp;
>   
>   start = (unsigned long)_stext;
> - end = (unsigned long)__init_begin;
> + end = (unsigned long)__end_rodata;
>   
>   pp = htab_convert_pte_flags(pgprot_val(PAGE_KERNEL_ROX), 
> HPTE_USE_KERNEL_KEY);
>   
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 698274109c91..2305f34bcc33 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -228,7 +228,7 @@ void radix__mark_rodata_ro(void)
>   unsigned long start, end;
>   
>   start = (unsigned long)_stext;
> - end = (unsigned long)__init_begin;
> + end = (unsigned long)__end_rodata;
>   
>   radix__change_memory_range(start, end, _PAGE_WRITE);
>   }
> diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
> index 3ac73f9fb5d5..112af8c5447a 100644
> --- a/arch/powerpc/mm/pgtable_32.c
> +++ b/arch/powerpc/mm/pgtable_32.c
> @@ -158,10 +158,9 @@ void mark_rodata_ro(void)
>   }
>   
>   /*
> -  * mark .text and .rodata as read only. Use __init_begin rather than
> -  * __end_rodata to cover NOTES and EXCEPTION_TABLE.
> +  * mark .text and .rodata as read only.
>*/
> - numpages = PFN_UP((unsigned long)__init_begin) -
> + numpages = PFN_UP((unsigned long)__end_rodata) -
>  PFN_DOWN((unsigned long)_stext);
>   
>   set_memory_ro((unsigned long)_stext, numpages);

Re: [PATCH v4 11/20] powerpc: Add ZEROIZE_GPRS macros for register clears

2022-09-14 Thread Rohan McLure



> On 12 Sep 2022, at 9:09 pm, Nicholas Piggin  wrote:
> 
> On Wed Aug 24, 2022 at 12:05 PM AEST, Rohan McLure wrote:
>> Macros for restoring and saving registers to and from the stack exist.
>> Provide macros with the same interface for clearing a range of gprs by
>> setting each register's value in that range to zero.
> 
> Can I bikeshed this a bit more and ask if you would change the order?
> 
> Saving and restoring macros are incidental, and are neither the change
> nor the reson for it. I think the changelog reads better if you state
> the change (or the problem) up front.
> 
> Provid register zeroing macros ... follow the same convention as
> existing register stack save/restore macros ... will be used in later
> change to zero registers.

Thanks for the suggestion, that should read better.

> 
> Or not, if you disagree.
> 
> Reviewed-by: Nicholas Piggin 
> 
>> The resulting macros are called ZEROIZE_GPRS and ZEROIZE_NVGPRS, keeping
>> with the naming of the accompanying restore and save macros, and usage
>> of zeroize to describe this operation elsewhere in the kernel.
>> 
>> Signed-off-by: Rohan McLure 
>> ---
>> V1 -> V2: Change 'ZERO' usage in naming to 'NULLIFY', a more obvious verb
>> V2 -> V3: Change 'NULLIFY' usage in naming to 'ZEROIZE', which has
>> precedent in kernel and explicitly specifies that we are zeroing.
>> V3 -> V4: Update commit message to use zeroize.
>> ---
>> arch/powerpc/include/asm/ppc_asm.h | 22 ++
>> 1 file changed, 22 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/ppc_asm.h 
>> b/arch/powerpc/include/asm/ppc_asm.h
>> index 83c02f5a7f2a..b95689ada59c 100644
>> --- a/arch/powerpc/include/asm/ppc_asm.h
>> +++ b/arch/powerpc/include/asm/ppc_asm.h
>> @@ -33,6 +33,20 @@
>>  .endr
>> .endm
>> 
>> +/*
>> + * This expands to a sequence of register clears for regs start to end
>> + * inclusive, of the form:
>> + *
>> + *   li rN, 0
>> + */
>> +.macro ZEROIZE_REGS start, end
>> +.Lreg=\start
>> +.rept (\end - \start + 1)
>> +li  .Lreg, 0
>> +.Lreg=.Lreg+1
>> +.endr
>> +.endm
>> +
>> /*
>>  * Macros for storing registers into and loading registers from
>>  * exception frames.
>> @@ -49,6 +63,14 @@
>> #define REST_NVGPRS(base)REST_GPRS(13, 31, base)
>> #endif
>> 
>> +#define ZEROIZE_GPRS(start, end)ZEROIZE_REGS start, end
>> +#ifdef __powerpc64__
>> +#define ZEROIZE_NVGPRS()ZEROIZE_GPRS(14, 31)
>> +#else
>> +#define ZEROIZE_NVGPRS()ZEROIZE_GPRS(13, 31)
>> +#endif
>> +#define ZEROIZE_GPR(n)  ZEROIZE_GPRS(n, n)
>> +
>> #define SAVE_GPR(n, base)SAVE_GPRS(n, n, base)
>> #define REST_GPR(n, base)REST_GPRS(n, n, base)
>> 
>> -- 
>> 2.34.1
> 



Re: [PATCH v4 10/20] powerpc: Use common syscall handler type

2022-09-14 Thread Rohan McLure



> On 12 Sep 2022, at 8:56 pm, Nicholas Piggin  wrote:
> 
> On Wed Aug 24, 2022 at 12:05 PM AEST, Rohan McLure wrote:
>> Cause syscall handlers to be typed as follows when called indirectly
>> throughout the kernel.
>> 
>> typedef long (*syscall_fn)(unsigned long, unsigned long, unsigned long,
>>   unsigned long, unsigned long, unsigned long);
> 
> The point is... better type checking?
> 
>> 
>> Since both 32 and 64-bit abis allow for at least the first six
>> machine-word length parameters to a function to be passed by registers,
>> even handlers which admit fewer than six parameters may be viewed as
>> having the above type.
>> 
>> Fixup comparisons in VDSO to avoid pointer-integer comparison. Introduce
>> explicit cast on systems with SPUs.
>> 
>> Signed-off-by: Rohan McLure 
>> ---
>> V1 -> V2: New patch.
>> V2 -> V3: Remove unnecessary cast from const syscall_fn to syscall_fn
>> ---
>> arch/powerpc/include/asm/syscall.h  | 7 +--
>> arch/powerpc/include/asm/syscalls.h | 1 +
>> arch/powerpc/kernel/systbl.c| 6 +++---
>> arch/powerpc/kernel/vdso.c  | 4 ++--
>> arch/powerpc/platforms/cell/spu_callbacks.c | 6 +++---
>> 5 files changed, 14 insertions(+), 10 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/syscall.h 
>> b/arch/powerpc/include/asm/syscall.h
>> index 25fc8ad9a27a..d2a8dfd5de33 100644
>> --- a/arch/powerpc/include/asm/syscall.h
>> +++ b/arch/powerpc/include/asm/syscall.h
>> @@ -14,9 +14,12 @@
>> #include 
>> #include 
>> 
>> +typedef long (*syscall_fn)(unsigned long, unsigned long, unsigned long,
>> +   unsigned long, unsigned long, unsigned long);
>> +
>> /* ftrace syscalls requires exporting the sys_call_table */
>> -extern const unsigned long sys_call_table[];
>> -extern const unsigned long compat_sys_call_table[];
>> +extern const syscall_fn sys_call_table[];
>> +extern const syscall_fn compat_sys_call_table[];
> 
> Ah you constify it in this patch. I think the previous patch should have
> kept the const, and it should keep the unsigned long type rather than
> use void *. Either that or do this patch first.
> 
>> static inline int syscall_get_nr(struct task_struct *task, struct pt_regs 
>> *regs)
>> {
>> diff --git a/arch/powerpc/include/asm/syscalls.h 
>> b/arch/powerpc/include/asm/syscalls.h
>> index 91417dee534e..e979b7593d2b 100644
>> --- a/arch/powerpc/include/asm/syscalls.h
>> +++ b/arch/powerpc/include/asm/syscalls.h
>> @@ -8,6 +8,7 @@
>> #include 
>> #include 
>> 
>> +#include 
>> #ifdef CONFIG_PPC64
>> #include 
>> #endif
> 
> Is this necessary or should be in another patch?

Good spot. This belongs in the patch that produces systbl.c.

> 
>> diff --git a/arch/powerpc/kernel/systbl.c b/arch/powerpc/kernel/systbl.c
>> index 99ffdfef6b9c..b88a9c2a1f50 100644
>> --- a/arch/powerpc/kernel/systbl.c
>> +++ b/arch/powerpc/kernel/systbl.c
>> @@ -21,10 +21,10 @@
>> #define __SYSCALL(nr, entry) [nr] = __powerpc_##entry,
>> #define __powerpc_sys_ni_syscall sys_ni_syscall
>> #else
>> -#define __SYSCALL(nr, entry) [nr] = entry,
>> +#define __SYSCALL(nr, entry) [nr] = (void *) entry,
>> #endif
> 
> Also perhaps this should have been in the prior pach and this pach
> should change the cast from void to syscall_fn ?

This cast to (void *) kicks in when casting functions with six or fewer
parameters to six-parameter type accepting and returning u64. Sadly I can’t
find a way to avoid -Wcast-function-type even with (__force syscall_fn) short
of an ugly casti to void* here. Any suggestions?

> 
>> 
>> -void *sys_call_table[] = {
>> +const syscall_fn sys_call_table[] = {
>> #ifdef CONFIG_PPC64
>> #include 
>> #else
>> @@ -35,7 +35,7 @@ void *sys_call_table[] = {
>> #ifdef CONFIG_COMPAT
>> #undef __SYSCALL_WITH_COMPAT
>> #define __SYSCALL_WITH_COMPAT(nr, native, compat)__SYSCALL(nr, compat)
>> -void *compat_sys_call_table[] = {
>> +const syscall_fn compat_sys_call_table[] = {
>> #include 
>> };
>> #endif /* CONFIG_COMPAT */
>> diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
>> index 0da287544054..080c9e7de0c0 100644
>> --- a/arch/powerpc/kernel/vdso.c
>> +++ b/arch/powerpc/kernel/vdso.c
>> @@ -313,10 +313,10 @@ static void __init vdso_setup_syscall_map(void)
>>  unsigned int i;
>> 
>>  for (i = 0; i < NR_syscalls; i++) {
>> -if (sys_call_table[i] != (unsigned long)_ni_syscall)
>> +if (sys_call_table[i] != (void *)_ni_syscall)
>>  vdso_data->syscall_map[i >> 5] |= 0x8000UL >> (i & 
>> 0x1f);
>>  if (IS_ENABLED(CONFIG_COMPAT) &&
>> -compat_sys_call_table[i] != (unsigned long)_ni_syscall)
>> +compat_sys_call_table[i] != (void *)_ni_syscall)
> 
> Also these.
> 
> Thanks,
> Nick
> 
>>  vdso_data->compat_syscall_map[i >> 5] |= 0x8000UL 
>> >> (i & 0x1f);
>>  }
>> }
>> diff --git a/arch/powerpc/platforms/cell/spu_callbacks.c 
>> 

Re: [PATCH v4 06/20] powerpc: Remove direct call to mmap2 syscall handlers

2022-09-14 Thread Rohan McLure



> On 12 Sep 2022, at 7:47 pm, Nicholas Piggin  wrote:
> 
> On Wed Aug 24, 2022 at 12:05 PM AEST, Rohan McLure wrote:
>> Syscall handlers should not be invoked internally by their symbol names,
>> as these symbols defined by the architecture-defined SYSCALL_DEFINE
>> macro. Move the compatibility syscall definition for mmap2 to
>> syscalls.c, so that all mmap implementations can share an inline helper
>> function, as is done with the personality handlers.
>> 
>> Signed-off-by: Rohan McLure 
> 
> Is there any point to keping sys_ppc32.c at all? Might as well move them
> all to syscall.c IMO.

Currently serves as a fairly arbitrary distinginction between compat calls
and others, noting that a compat variant of personality is in syscalls.c.
May as well get rid of it.

> Reviewed-by: Nicholas Piggin 
> 
>> ---
>> V1 -> V2: Move mmap2 compat implementation to asm/kernel/syscalls.c.
>> V3 -> V4: Move to be applied before syscall wrapper introduced.
>> ---
>> arch/powerpc/kernel/sys_ppc32.c |  9 -
>> arch/powerpc/kernel/syscalls.c  | 11 +++
>> 2 files changed, 11 insertions(+), 9 deletions(-)
>> 
>> diff --git a/arch/powerpc/kernel/sys_ppc32.c 
>> b/arch/powerpc/kernel/sys_ppc32.c
>> index f4edcc9489fb..bc6491ed6454 100644
>> --- a/arch/powerpc/kernel/sys_ppc32.c
>> +++ b/arch/powerpc/kernel/sys_ppc32.c
>> @@ -25,7 +25,6 @@
>> #include 
>> #include 
>> #include 
>> -#include 
>> #include 
>> #include 
>> #include 
>> @@ -48,14 +47,6 @@
>> #include 
>> #include 
>> 
>> -unsigned long compat_sys_mmap2(unsigned long addr, size_t len,
>> -  unsigned long prot, unsigned long flags,
>> -  unsigned long fd, unsigned long pgoff)
>> -{
>> -/* This should remain 12 even if PAGE_SIZE changes */
>> -return sys_mmap(addr, len, prot, flags, fd, pgoff << 12);
>> -}
>> -
>> /* 
>>  * long long munging:
>>  * The 32 bit ABI passes long longs in an odd even register pair.
>> diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
>> index b8461128c8f7..32fadf3c2cd3 100644
>> --- a/arch/powerpc/kernel/syscalls.c
>> +++ b/arch/powerpc/kernel/syscalls.c
>> @@ -56,6 +56,17 @@ SYSCALL_DEFINE6(mmap2, unsigned long, addr, size_t, len,
>>  return do_mmap2(addr, len, prot, flags, fd, pgoff, PAGE_SHIFT-12);
>> }
>> 
>> +#ifdef CONFIG_COMPAT
>> +COMPAT_SYSCALL_DEFINE6(mmap2,
>> +   unsigned long, addr, size_t, len,
>> +   unsigned long, prot, unsigned long, flags,
>> +   unsigned long, fd, unsigned long, pgoff)
>> +{
>> +/* This should remain 12 even if PAGE_SIZE changes */
>> +return do_mmap2(addr, len, prot, flags, fd, pgoff << 12, PAGE_SHIFT-12);
>> +}
>> +#endif
>> +
>> SYSCALL_DEFINE6(mmap, unsigned long, addr, size_t, len,
>>  unsigned long, prot, unsigned long, flags,
>>  unsigned long, fd, off_t, offset)
>> -- 
>> 2.34.1
> 



Re: [PATCH v4 03/20] powerpc/32: Remove powerpc select specialisation

2022-09-14 Thread Rohan McLure



> On 12 Sep 2022, at 7:03 pm, Nicholas Piggin  wrote:
> 
> On Wed Aug 24, 2022 at 12:05 PM AEST, Rohan McLure wrote:
>> Syscall #82 has been implemented for 32-bit platforms in a unique way on
>> powerpc systems. This hack will in effect guess whether the caller is
>> expecting new select semantics or old select semantics. It does so via a
>> guess, based off the first parameter. In new select, this parameter
>> represents the length of a user-memory array of file descriptors, and in
>> old select this is a pointer to an arguments structure.
>> 
>> The heuristic simply interprets sufficiently large values of its first
>> parameter as being a call to old select. The following is a discussion
>> on how this syscall should be handled.
>> 
>> Link: 
>> https://lore.kernel.org/lkml/13737de5-0eb7-e881-9af0-163b0d29a...@csgroup.eu/
> 
> Seems okay to me, probably Christophe needs to ack it.
> Should some of that history be included directly in this changelog?
> 
> Should ppc64 compat be added back too, if this is being updated instead
> of removed? I don't know much about compat but it seems odd not provide
> it (considering it's just using compat_sys_old_select, isn't it?

That would make sense to me. I’ll put that in syscall.tbl.

> Reviewed-by: Nicholas Piggin 
> 
>> 
>> As discussed in this thread, the existence of such a hack suggests that for
>> whatever powerpc binaries may predate glibc, it is most likely that they
>> would have taken use of the old select semantics. x86 and arm64 both
>> implement this syscall with oldselect semantics.
>> 
>> Remove the powerpc implementation, and update syscall.tbl to refer to emit
>> a reference to sys_old_select for 32-bit binaries, in keeping with how
>> other architectures support syscall #82.
>> 
>> Signed-off-by: Rohan McLure 
>> ---
>> V1 -> V2: Remove arch-specific select handler
>> V2 -> V3: Remove ppc_old_select prototype in . Move to
>> earlier in patch series
>> ---
>> arch/powerpc/include/asm/syscalls.h   |  2 --
>> arch/powerpc/kernel/syscalls.c| 17 -
>> arch/powerpc/kernel/syscalls/syscall.tbl  |  2 +-
>> .../arch/powerpc/entry/syscalls/syscall.tbl   |  2 +-
>> 4 files changed, 2 insertions(+), 21 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/syscalls.h 
>> b/arch/powerpc/include/asm/syscalls.h
>> index 675a8f5ec3ca..739498c358a1 100644
>> --- a/arch/powerpc/include/asm/syscalls.h
>> +++ b/arch/powerpc/include/asm/syscalls.h
>> @@ -18,8 +18,6 @@ long sys_mmap2(unsigned long addr, size_t len,
>> unsigned long fd, unsigned long pgoff);
>> long ppc64_personality(unsigned long personality);
>> long sys_rtas(struct rtas_args __user *uargs);
>> -int ppc_select(int n, fd_set __user *inp, fd_set __user *outp,
>> -   fd_set __user *exp, struct __kernel_old_timeval __user *tvp);
>> long ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low,
>>u32 len_high, u32 len_low);
>> 
>> diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
>> index fc999140bc27..ef5896bee818 100644
>> --- a/arch/powerpc/kernel/syscalls.c
>> +++ b/arch/powerpc/kernel/syscalls.c
>> @@ -63,23 +63,6 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, size_t, len,
>>  return do_mmap2(addr, len, prot, flags, fd, offset, PAGE_SHIFT);
>> }
>> 
>> -#ifdef CONFIG_PPC32
>> -/*
>> - * Due to some executables calling the wrong select we sometimes
>> - * get wrong args.  This determines how the args are being passed
>> - * (a single ptr to them all args passed) then calls
>> - * sys_select() with the appropriate args. -- Cort
>> - */
>> -int
>> -ppc_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user 
>> *exp, struct __kernel_old_timeval __user *tvp)
>> -{
>> -if ((unsigned long)n >= 4096)
>> -return sys_old_select((void __user *)n);
>> -
>> -return sys_select(n, inp, outp, exp, tvp);
>> -}
>> -#endif
>> -
>> #ifdef CONFIG_PPC64
>> long ppc64_personality(unsigned long personality)
>> {
>> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl 
>> b/arch/powerpc/kernel/syscalls/syscall.tbl
>> index 2600b4237292..4cbbb810ae10 100644
>> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
>> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
>> @@ -110,7 +110,7 @@
>> 79   common  settimeofdaysys_settimeofday
>> compat_sys_settimeofday
>> 80   common  getgroups   sys_getgroups
>> 81   common  setgroups   sys_setgroups
>> -82  32  select  ppc_select  
>> sys_ni_syscall
>> +82  32  select  sys_old_select  
>> sys_ni_syscall
>> 82   64  select  sys_ni_syscall
>> 82   spu select  sys_ni_syscall
>> 83   common  symlink sys_symlink
>> diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl 
>> 

Re: Fragmented physical memory on powerpc/32

2022-09-14 Thread Pali Rohár
On Wednesday 14 September 2022 16:55:04 Mike Rapoport wrote:
> On September 14, 2022 10:43:52 AM GMT+01:00, Christophe Leroy 
>  wrote:
> >
> >
> >Le 14/09/2022 à 11:32, Mike Rapoport a écrit :
> >> On Tue, Sep 13, 2022 at 02:36:13PM +0200, Christophe Leroy wrote:
> >>>
> >>>
> >>> Le 13/09/2022 à 08:11, Christophe Leroy a écrit :
> 
> 
>  Le 12/09/2022 à 23:16, Pali Rohár a écrit :
> >>
> >> My guess would be that something went wrong in the linear map
> >> setup, but it
> >> won't hurt running with "memblock=debug" added to the kernel
> >> command line
> >> to see if there is anything suspicious there.
> >
> > Here is boot log on serial console with memblock=debug command line:
> >
>  ...
> >
> > Do you need something more for debug?
> 
>  Can you send me the 'vmlinux' used to generate the above Oops so that I
>  can see exactly where we are in function mem_init().
> 
>  And could you also try without CONFIG_HIGHMEM just in case.

Hello! I disabled CONFIG_HIGHMEM and there is no crash anymore. But
kernel see only 761160 kB of RAM, which is less than 3GB. I guess this
is expected.

> 
> >>>
> >>> I looked at the vmlinux you sent me, the problem is in the loop for 
> >>> highmem
> >>> in mem_init(). It crashes in the call to free_highmem_page()
> >>>
> >>> #ifdef CONFIG_HIGHMEM
> >>>   {
> >>>   unsigned long pfn, highmem_mapnr;
> >>>
> >>>   highmem_mapnr = lowmem_end_addr >> PAGE_SHIFT;
> >>>   for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) {
> >>>   phys_addr_t paddr = (phys_addr_t)pfn << PAGE_SHIFT;
> >>>   struct page *page = pfn_to_page(pfn);
> >>>   if (!memblock_is_reserved(paddr))
> >>>   free_highmem_page(page);
> >>>   }
> >>>   }
> >>> #endif /* CONFIG_HIGHMEM */
> >>>
> >>>
> >>> As far as I can see in the memblock debug lines, the holes don't seem to 
> >>> be
> >>> marked as reserved by memblock. So it is above valid ? Other architectures
> >>> seem to do differently.
> >>>
> >>> Can you try by replacing !memblock_is_reserved(paddr) by
> >>> memblock_is_memory(paddr) ?

I enabled CONFIG_HIGHMEM again, applied this change and kernel does not
crash too anymore. And it sees 3062896 kB of memory (in 'free' output).

So seems that this change fixed it.

I tried simple C program for allocating memory via malloc() and it
successfully allocated 2800 MB. If it tried to request/allocate more
then kernel started OOM killer which killed that program gracefully.
No kernel or HW crash.

Here is bootlog:

[0.00] memblock_alloc_try_nid: 8 bytes align=0x4 nid=-1 from=0x 
max_addr=0x smp_setup_cpu_maps+0x40/0x2b4
[0.00] memblock_reserve: [0x2fff5d74-0x2fff5d7b] 
memblock_alloc_range_nid+0xe8/0x1b0
[0.00] CPU maps initialized for 1 thread per core
[0.00]  (thread shift is 0)
[0.00] memblock_phys_free: [0x2fff5d74-0x2fff5d7b] 
setup_arch+0x1bc/0x318
[0.00] -
[0.00] phys_mem_size = 0xbe50
[0.00] dcache_bsize  = 0x20
[0.00] icache_bsize  = 0x20
[0.00] cpu_features  = 0x10010108
[0.00]   possible= 0x10010108
[0.00]   always  = 0x10010108
[0.00] cpu_user_features = 0x84e08000 0x0800
[0.00] mmu_features  = 0x00020010
[0.00] -
[0.00] memblock_alloc_try_nid: 8192 bytes align=0x2000 nid=-1 
from=0x max_addr=0x alloc_stack+0x2c/0x60
[0.00] memblock_reserve: [0x2fff2000-0x2fff3fff] 
memblock_alloc_range_nid+0xe8/0x1b0
[0.00] memblock_alloc_try_nid: 8192 bytes align=0x2000 nid=-1 
from=0x max_addr=0x alloc_stack+0x2c/0x60
[0.00] memblock_reserve: [0x2fff-0x2fff1fff] 
memblock_alloc_range_nid+0xe8/0x1b0
[0.00] memblock_alloc_try_nid: 8192 bytes align=0x2000 nid=-1 
from=0x max_addr=0x alloc_stack+0x2c/0x60
[0.00] memblock_reserve: [0x2ffee000-0x2ffe] 
memblock_alloc_range_nid+0xe8/0x1b0
[0.00] memblock_alloc_try_nid: 8192 bytes align=0x2000 nid=-1 
from=0x max_addr=0x alloc_stack+0x2c/0x60
[0.00] memblock_reserve: [0x2ffec000-0x2ffedfff] 
memblock_alloc_range_nid+0xe8/0x1b0
[0.00] memblock_alloc_try_nid: 8192 bytes align=0x2000 nid=-1 
from=0x max_addr=0x alloc_stack+0x2c/0x60
[0.00] memblock_reserve: [0x2ffea000-0x2ffebfff] 
memblock_alloc_range_nid+0xe8/0x1b0
[0.00] memblock_alloc_try_nid: 8192 bytes align=0x2000 nid=-1 
from=0x max_addr=0x alloc_stack+0x2c/0x60
[0.00] memblock_reserve: [0x2ffe8000-0x2ffe9fff] 
memblock_alloc_range_nid+0xe8/0x1b0
[0.00] memblock_alloc_try_nid: 8192 bytes align=0x2000 nid=-1 
from=0x 

Re: [RFC] Objtool toolchain proposal: -fannotate-{jump-table,noreturn}

2022-09-14 Thread Segher Boessenkool
On Wed, Sep 14, 2022 at 04:55:27PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 14, 2022 at 02:28:26PM +, Michael Matz wrote:
> > Don't mix DWARF debug info with DWARF-based unwinding info, the latter 
> > doesn't imply the former.  Out of interest: how does ORC get around the 
> > need for CFI annotations (or equivalents to restore registers) and what 
> 
> Objtool 'interprets' the stackops. So it follows the call-graph and is
> an interpreter for all instructions that modify the stack. Doing that it
> konws what the stackframe is at 'most' places.

To get correct backtraces on e.g. PowerPC you need to emulate many of
the integer insns.  That is why GCC enables -fasynchronous-unwind-tables
by default for us.

The same is true for s390, aarch64, and x86 (unless 32-bit w/ frame
pointer).

The problem is that you do not know how to access anything on the stack,
whether in the current frame or in a previous frame, from a random point
in the program.  GDB has many heuristics for this, and it still does not
get it right in all cases.

> > makes it fast?  I want faster unwinding for DWARF as well, when there's 
> > feature parity :-)  Maybe something can be learned for integration into 
> > dwarf-unwind.
> 
> I think we have some details here:
> 
>  https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html

It is faster because it does a whole lot less.  Is that still enough?
It's not clear (to me) what exact information it wants to provide :-(


Segher


Re: [PATCH 8/9] powerpc/ps3: remove orphan declarations from ps3av.h

2022-09-14 Thread Geoff Levand
Hi Gaosheng,

On 9/13/22 00:50, Gaosheng Cui wrote:
> Remove the following orphan declarations from ps3av.h:
> 1. ps3av_dev_open()
> 2. ps3av_dev_close()
> 
> They have been removed since commit 13a5e30cf740 ("[POWERPC] PS3:
> Rework AV settings driver"), so remove them.

I did a test build with this patch applied to v6.0-rc5 and had no errors.

Acked-by: Geoff Levand 



Re: Fragmented physical memory on powerpc/32

2022-09-14 Thread Mike Rapoport



On September 14, 2022 10:43:52 AM GMT+01:00, Christophe Leroy 
 wrote:
>
>
>Le 14/09/2022 à 11:32, Mike Rapoport a écrit :
>> On Tue, Sep 13, 2022 at 02:36:13PM +0200, Christophe Leroy wrote:
>>>
>>>
>>> Le 13/09/2022 à 08:11, Christophe Leroy a écrit :


 Le 12/09/2022 à 23:16, Pali Rohár a écrit :
>>
>> My guess would be that something went wrong in the linear map
>> setup, but it
>> won't hurt running with "memblock=debug" added to the kernel
>> command line
>> to see if there is anything suspicious there.
>
> Here is boot log on serial console with memblock=debug command line:
>
 ...
>
> Do you need something more for debug?

 Can you send me the 'vmlinux' used to generate the above Oops so that I
 can see exactly where we are in function mem_init().

 And could you also try without CONFIG_HIGHMEM just in case.

>>>
>>> I looked at the vmlinux you sent me, the problem is in the loop for highmem
>>> in mem_init(). It crashes in the call to free_highmem_page()
>>>
>>> #ifdef CONFIG_HIGHMEM
>>> {
>>> unsigned long pfn, highmem_mapnr;
>>>
>>> highmem_mapnr = lowmem_end_addr >> PAGE_SHIFT;
>>> for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) {
>>> phys_addr_t paddr = (phys_addr_t)pfn << PAGE_SHIFT;
>>> struct page *page = pfn_to_page(pfn);
>>> if (!memblock_is_reserved(paddr))
>>> free_highmem_page(page);
>>> }
>>> }
>>> #endif /* CONFIG_HIGHMEM */
>>>
>>>
>>> As far as I can see in the memblock debug lines, the holes don't seem to be
>>> marked as reserved by memblock. So it is above valid ? Other architectures
>>> seem to do differently.
>>>
>>> Can you try by replacing !memblock_is_reserved(paddr) by
>>> memblock_is_memory(paddr) ?
>> 
>> The holes should not be marked as reserved, we just need to loop over the
>> memory ranges rather than over pfns. Then the holes will be taken into
>> account.
>> 
>> I believe arm and xtensa got this right:
>> 
>> (from arch/arm/mm/init.c)
>> 
>> static void __init free_highpages(void)
>> {
>> #ifdef CONFIG_HIGHMEM
>>  unsigned long max_low = max_low_pfn;
>>  phys_addr_t range_start, range_end;
>>  u64 i;
>> 
>>  /* set highmem page free */
>>  for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
>>  _start, _end, NULL) {
>>  unsigned long start = PFN_UP(range_start);
>>  unsigned long end = PFN_DOWN(range_end);
>> 
>>  /* Ignore complete lowmem entries */
>>  if (end <= max_low)
>>  continue;
>> 
>>  /* Truncate partial highmem entries */
>>  if (start < max_low)
>>  start = max_low;
>> 
>>  for (; start < end; start++)
>>  free_highmem_page(pfn_to_page(start));
>>  }
>> #endif
>> }
>> 
>
>
>And what about the way MIPS does it ?
>
>static inline void __init mem_init_free_highmem(void)
>{
>#ifdef CONFIG_HIGHMEM
>   unsigned long tmp;
>
>   if (cpu_has_dc_aliases)
>   return;
>
>   for (tmp = highstart_pfn; tmp < highend_pfn; tmp++) {
>   struct page *page = pfn_to_page(tmp);
>
>   if (!memblock_is_memory(PFN_PHYS(tmp)))
>   SetPageReserved(page);
>   else
>   free_highmem_page(page);
>   }
>#endif
>}

This iterates over all PFNs in the highmem range and skips those in holes.
for_each_free_mem_range() skips the holes altogether.

Largely, I think we need to move, say, arm's version to mm/ and use it 
everywhere, except, perhaps, x86.

>Christophe
-- 
Sincerely yours,
Mike


[PATCH 7/7] powerpc/64/build: merge .got and .toc input sections

2022-09-14 Thread Nicholas Piggin
Follow the binutils ld internal linker script and merge .got and .toc
input sections in the .got output section.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vmlinux.lds.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 404944263db8..f7271bf78150 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -169,13 +169,12 @@ SECTIONS
}
 
.got : AT(ADDR(.got) - LOAD_OFFSET) ALIGN(256) {
-   *(.got)
+   *(.got .toc)
 #ifndef CONFIG_RELOCATABLE
__prom_init_toc_start = .;
arch/powerpc/kernel/prom_init.o*(.toc)
__prom_init_toc_end = .;
 #endif
-   *(.toc)
}
 
SOFT_MASK_TABLE(8)
-- 
2.37.2



[PATCH 6/7] powerpc/64/build: only include .opd with ELFv1

2022-09-14 Thread Nicholas Piggin
ELFv2 does not use function descriptors so .opd is not required.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vmlinux.lds.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 44050863032e..404944263db8 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -181,11 +181,13 @@ SECTIONS
SOFT_MASK_TABLE(8)
RESTART_TABLE(8)
 
+#ifdef CONFIG_PPC64_ELF_ABI_V1
.opd : AT(ADDR(.opd) - LOAD_OFFSET) {
__start_opd = .;
KEEP(*(.opd))
__end_opd = .;
}
+#endif
 
. = ALIGN(8);
__stf_entry_barrier_fixup : AT(ADDR(__stf_entry_barrier_fixup) - 
LOAD_OFFSET) {
-- 
2.37.2



[PATCH 5/7] powerpc/build: move .data.rel.ro, .sdata2 to read-only

2022-09-14 Thread Nicholas Piggin
.sdata2 is a readonly small data section for ppc32, and .data.rel.ro
is data that needs relocating but is read-only after that so these
can both be moved to the read only memory region.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vmlinux.lds.S | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 716fff86c3fd..44050863032e 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -131,6 +131,16 @@ SECTIONS
/* Read-only data */
RO_DATA(PAGE_SIZE)
 
+#ifdef CONFIG_PPC32
+   .sdata2 : AT(ADDR(.sdata2) - LOAD_OFFSET) {
+   *(.sdata2)
+   }
+#endif
+
+   .data.rel.ro : AT(ADDR(.data.rel.ro) - LOAD_OFFSET) {
+   *(.data.rel.ro*)
+   }
+
.branch_lt : AT(ADDR(.branch_lt) - LOAD_OFFSET) {
*(.branch_lt)
}
@@ -348,19 +358,13 @@ SECTIONS
. = ALIGN(PAGE_SIZE);
_sdata = .;
 
-#ifdef CONFIG_PPC32
.data : AT(ADDR(.data) - LOAD_OFFSET) {
DATA_DATA
*(.data.rel*)
+#ifdef CONFIG_PPC32
*(SDATA_MAIN)
-   *(.sdata2)
-   }
-#else
-   .data : AT(ADDR(.data) - LOAD_OFFSET) {
-   DATA_DATA
-   *(.data.rel*)
-   }
 #endif
+   }
 
/* The initial task and kernel stack */
INIT_TASK_DATA_SECTION(THREAD_ALIGN)
-- 
2.37.2



[PATCH 4/7] powerpc/build: move got, toc, plt, branch_lt sections to read-only

2022-09-14 Thread Nicholas Piggin
This moves linker related tables from .data to read-only area.
Relocations are performed at early boot time before memory is protected,
after which there should be no modifications required.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vmlinux.lds.S | 42 ---
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 341ac79f49a9..716fff86c3fd 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -131,6 +131,10 @@ SECTIONS
/* Read-only data */
RO_DATA(PAGE_SIZE)
 
+   .branch_lt : AT(ADDR(.branch_lt) - LOAD_OFFSET) {
+   *(.branch_lt)
+   }
+
 #ifdef CONFIG_PPC32
.got1 : AT(ADDR(.got1) - LOAD_OFFSET) {
*(.got1)
@@ -140,7 +144,30 @@ SECTIONS
*(.got2)
__got2_end = .;
}
+   .got : AT(ADDR(.got) - LOAD_OFFSET) SPECIAL {
+   *(.got)
+   *(.got.plt)
+   }
+   .plt : AT(ADDR(.plt) - LOAD_OFFSET) SPECIAL {
+   /* XXX: is .plt (and .got.plt) required? */
+   *(.plt)
+   }
+
 #else /* CONFIG_PPC32 */
+   .toc1 : AT(ADDR(.toc1) - LOAD_OFFSET) {
+   *(.toc1)
+   }
+
+   .got : AT(ADDR(.got) - LOAD_OFFSET) ALIGN(256) {
+   *(.got)
+#ifndef CONFIG_RELOCATABLE
+   __prom_init_toc_start = .;
+   arch/powerpc/kernel/prom_init.o*(.toc)
+   __prom_init_toc_end = .;
+#endif
+   *(.toc)
+   }
+
SOFT_MASK_TABLE(8)
RESTART_TABLE(8)
 
@@ -327,26 +354,11 @@ SECTIONS
*(.data.rel*)
*(SDATA_MAIN)
*(.sdata2)
-   *(.got.plt) *(.got)
-   *(.plt)
-   *(.branch_lt)
}
 #else
.data : AT(ADDR(.data) - LOAD_OFFSET) {
DATA_DATA
*(.data.rel*)
-   *(.toc1)
-   *(.branch_lt)
-   }
-
-   .got : AT(ADDR(.got) - LOAD_OFFSET) ALIGN(256) {
-   *(.got)
-#ifndef CONFIG_RELOCATABLE
-   __prom_init_toc_start = .;
-   arch/powerpc/kernel/prom_init.o*(.toc)
-   __prom_init_toc_end = .;
-#endif
-   *(.toc)
}
 #endif
 
-- 
2.37.2



[PATCH 3/7] powerpc/32/build: move got1/got2 sections out of text

2022-09-14 Thread Nicholas Piggin
Following the example from the binutils default linker script, move
.got1 and .got2 out of .text, to just after RO_DATA where they are in
read-only NX memory.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vmlinux.lds.S | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 90ac5ff73df2..341ac79f49a9 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -122,14 +122,6 @@ SECTIONS
*(.sfpr);
MEM_KEEP(init.text)
MEM_KEEP(exit.text)
-
-#ifdef CONFIG_PPC32
-   *(.got1)
-   __got2_start = .;
-   *(.got2)
-   __got2_end = .;
-#endif /* CONFIG_PPC32 */
-
} :text
 
. = ALIGN(PAGE_SIZE);
@@ -139,7 +131,16 @@ SECTIONS
/* Read-only data */
RO_DATA(PAGE_SIZE)
 
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC32
+   .got1 : AT(ADDR(.got1) - LOAD_OFFSET) {
+   *(.got1)
+   }
+   .got2 : AT(ADDR(.got2) - LOAD_OFFSET) {
+   __got2_start = .;
+   *(.got2)
+   __got2_end = .;
+   }
+#else /* CONFIG_PPC32 */
SOFT_MASK_TABLE(8)
RESTART_TABLE(8)
 
@@ -190,7 +191,7 @@ SECTIONS
*(__rfi_flush_fixup)
__stop___rfi_flush_fixup = .;
}
-#endif /* CONFIG_PPC64 */
+#endif /* CONFIG_PPC32 */
 
 #ifdef CONFIG_PPC_BARRIER_NOSPEC
. = ALIGN(8);
-- 
2.37.2



[PATCH 2/7] powerpc: move __end_rodata to cover arch read-only sections

2022-09-14 Thread Nicholas Piggin
powerpc has a number of read-only sections and tables that are put
after RO_DATA(). Move the __end_rodata symbol to cover these as well.

Setting memory to read-only at boot is done using __init_begin,
change that that to use __end_rodata.

This also affects boot dmesg, is_kernel_rodata(), and some other checks.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vmlinux.lds.S| 3 +++
 arch/powerpc/mm/book3s32/mmu.c   | 2 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c  | 2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
 arch/powerpc/mm/pgtable_32.c | 5 ++---
 5 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index fe22d940412f..90ac5ff73df2 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -210,6 +210,9 @@ SECTIONS
}
 #endif
 
+   . = ALIGN(PAGE_SIZE);
+   __end_rodata = .;
+
 /*
  * Init sections discarded at runtime
  */
diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
index a96b73006dfb..e13b883e4e5b 100644
--- a/arch/powerpc/mm/book3s32/mmu.c
+++ b/arch/powerpc/mm/book3s32/mmu.c
@@ -240,7 +240,7 @@ void mmu_mark_rodata_ro(void)
for (i = 0; i < nb; i++) {
struct ppc_bat *bat = BATS[i];
 
-   if (bat_addrs[i].start < (unsigned long)__init_begin)
+   if (bat_addrs[i].start < (unsigned long)__end_rodata)
bat[1].batl = (bat[1].batl & ~BPP_RW) | BPP_RX;
}
 
diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
b/arch/powerpc/mm/book3s64/hash_pgtable.c
index ae008b9df0e6..28332001bd87 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -541,7 +541,7 @@ void hash__mark_rodata_ro(void)
unsigned long start, end, pp;
 
start = (unsigned long)_stext;
-   end = (unsigned long)__init_begin;
+   end = (unsigned long)__end_rodata;
 
pp = htab_convert_pte_flags(pgprot_val(PAGE_KERNEL_ROX), 
HPTE_USE_KERNEL_KEY);
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 698274109c91..2305f34bcc33 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -228,7 +228,7 @@ void radix__mark_rodata_ro(void)
unsigned long start, end;
 
start = (unsigned long)_stext;
-   end = (unsigned long)__init_begin;
+   end = (unsigned long)__end_rodata;
 
radix__change_memory_range(start, end, _PAGE_WRITE);
 }
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 3ac73f9fb5d5..112af8c5447a 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -158,10 +158,9 @@ void mark_rodata_ro(void)
}
 
/*
-* mark .text and .rodata as read only. Use __init_begin rather than
-* __end_rodata to cover NOTES and EXCEPTION_TABLE.
+* mark .text and .rodata as read only.
 */
-   numpages = PFN_UP((unsigned long)__init_begin) -
+   numpages = PFN_UP((unsigned long)__end_rodata) -
   PFN_DOWN((unsigned long)_stext);
 
set_memory_ro((unsigned long)_stext, numpages);
-- 
2.37.2



[PATCH 1/7] powerpc/build: put sys_call_table in .data.rel.ro if RELOCATABLE

2022-09-14 Thread Nicholas Piggin
Const function pointers live in .data.rel.ro rather than .rodata because
they must be relocated. This change prevents powerpc/32 from generating
R_PPC_UADDR32 relocations (which are not handled). The sys_call_table is
moved to writeable memory, but a later change will move it back.

After this patch, 44x_defconfig + CONFIG_RELOCATABLE boots to busybox.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/systbl.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/systbl.S b/arch/powerpc/kernel/systbl.S
index cb3358886203..0bec33e86f50 100644
--- a/arch/powerpc/kernel/systbl.S
+++ b/arch/powerpc/kernel/systbl.S
@@ -12,7 +12,11 @@
 
 #include 
 
+#ifdef CONFIG_RELOCATABLE
+.section .data.rel.ro,"aw"
+#else
 .section .rodata,"a"
+#endif
 
 #ifdef CONFIG_PPC64
.p2align3
-- 
2.37.2



[PATCH 0/7] powerpc: build / linker improvements

2022-09-14 Thread Nicholas Piggin
This series is mainly about moving more things out of writable and
executable memory, and slightly moving the linker script in the
direction of the binutils ld internal linker script as we do.

Thanks,
Nick

Nicholas Piggin (7):
  powerpc/build: put sys_call_table in .data.rel.ro if RELOCATABLE
  powerpc: move __end_rodata to cover arch read-only sections
  powerpc/32/build: move got1/got2 sections out of text
  powerpc/build: move got, toc, plt, branch_lt sections to read-only
  powerpc/build: move .data.rel.ro, .sdata2 to read-only
  powerpc/64/build: only include .opd with ELFv1
  powerpc/64/build: merge .got and .toc input sections

 arch/powerpc/kernel/systbl.S |  4 ++
 arch/powerpc/kernel/vmlinux.lds.S| 85 +++-
 arch/powerpc/mm/book3s32/mmu.c   |  2 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c  |  2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |  2 +-
 arch/powerpc/mm/pgtable_32.c |  5 +-
 6 files changed, 62 insertions(+), 38 deletions(-)

-- 
2.37.2



Re: [RFC] Objtool toolchain proposal: -fannotate-{jump-table,noreturn}

2022-09-14 Thread Peter Zijlstra
On Wed, Sep 14, 2022 at 02:28:26PM +, Michael Matz wrote:
> Hello,
> 
> On Wed, 14 Sep 2022, Peter Zijlstra wrote:
> 
> > > Maybe this is semantics, but I wouldn't characterize objtool's existence
> > > as being based on the mistrust of tools.  It's main motivation is to
> > > fill in the toolchain's blind spots in asm and inline-asm, which exist
> > > by design.
> > 
> > That and a fairly deep seated loathing for the regular CFI annotations
> > and DWARF in general. Linus was fairly firm he didn't want anything to
> > do with DWARF for in-kernel unwinding.
> 
> I was referring only to the check-stuff functionality of objtool, not to 
> its other parts.  Altough, of course, "deep seated loathing" is a special 
> form of mistrust as well ;-)

Those were born out the DWARF unwinder itself crashing the kernel due to
it's inherent complexity (tracking the whole DWARF state machine and not
being quite robust itself).

That, and the manual CFI annotations were 'always' wrong, due to humans
and no tooling verifying them.

That said; objtool does do have a number of annotations as well; mostly
things telling what kind of stackframe stuff starts with.

> > That left us in a spot that we needed unwind information in a 'better'
> > format than DWARF.
> > 
> > Objtool was born out of those contraints. ORC not needing the CFI
> > annotations and ORC being *much* faster at unwiding and generation
> > (debug builds are slow) were all good.
> 
> Don't mix DWARF debug info with DWARF-based unwinding info, the latter 
> doesn't imply the former.  Out of interest: how does ORC get around the 
> need for CFI annotations (or equivalents to restore registers) and what 

Objtool 'interprets' the stackops. So it follows the call-graph and is
an interpreter for all instructions that modify the stack. Doing that it
konws what the stackframe is at 'most' places.

> makes it fast?  I want faster unwinding for DWARF as well, when there's 
> feature parity :-)  Maybe something can be learned for integration into 
> dwarf-unwind.

I think we have some details here:

 https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html


Re: [RFC] Objtool toolchain proposal: -fannotate-{jump-table,noreturn}

2022-09-14 Thread Michael Matz
Hello,

On Wed, 14 Sep 2022, Peter Zijlstra wrote:

> > Maybe this is semantics, but I wouldn't characterize objtool's existence
> > as being based on the mistrust of tools.  It's main motivation is to
> > fill in the toolchain's blind spots in asm and inline-asm, which exist
> > by design.
> 
> That and a fairly deep seated loathing for the regular CFI annotations
> and DWARF in general. Linus was fairly firm he didn't want anything to
> do with DWARF for in-kernel unwinding.

I was referring only to the check-stuff functionality of objtool, not to 
its other parts.  Altough, of course, "deep seated loathing" is a special 
form of mistrust as well ;-)

> That left us in a spot that we needed unwind information in a 'better'
> format than DWARF.
> 
> Objtool was born out of those contraints. ORC not needing the CFI
> annotations and ORC being *much* faster at unwiding and generation
> (debug builds are slow) were all good.

Don't mix DWARF debug info with DWARF-based unwinding info, the latter 
doesn't imply the former.  Out of interest: how does ORC get around the 
need for CFI annotations (or equivalents to restore registers) and what 
makes it fast?  I want faster unwinding for DWARF as well, when there's 
feature parity :-)  Maybe something can be learned for integration into 
dwarf-unwind.


Ciao,
Michael.


Re: [PATCH linux-next][RFC] powerpc: protect cpu offlining by RCU offline lock

2022-09-14 Thread Zhouyi Zhou
On Wed, Sep 14, 2022 at 8:17 PM Paul E. McKenney  wrote:
>
> On Wed, Sep 14, 2022 at 10:15:28AM +0800, Zhouyi Zhou wrote:
> > During the cpu offlining, the sub functions of xive_teardown_cpu will
> > call __lock_acquire when CONFIG_LOCKDEP=y. The latter function will
> > travel RCU protected list, so "WARNING: suspicious RCU usage" will be
> > triggered.
> >
> > Try to protect cpu offlining by RCU offline lock.
>
Thank Paul for your guidance!
> Rather than acquiring the RCU lock, why not change the functions called
> by xive_teardown_cpu() to avoid calling __lock_acquire()?  For example,
> a call to spin_lock() could be changed to arch_spin_lock().
Great idea!
I will take a try, and perform new rounds of rcutorture tests. I will
submit a new version next week.
Also thank PPC developers for your patience on me ;-)

Cheers
Zhouyi
>
> Thanx, Paul
>
> > Tested on PPC VM of Open Source Lab of Oregon State University.
> > (Each round of tests takes about 19 hours to finish)
> > Test results show that although "WARNING: suspicious RCU usage" has gone,
> > but there are more "BUG: soft lockup" reports than the original kernel
> > (10 vs 6), so I add a [RFC] to my subject line.
> >
> > Signed-off-by: Zhouyi Zhou 
> > ---
> > [it seems that there are some delivery problem in my previous email,
> >  so I send again via gmail, sorry for the trouble]
> >
> > Dear PPC and RCU developers
> >
> > I found this bug when trying to do rcutorture tests in ppc VM of
> > Open Source Lab of Oregon State University.
> >
> > console.log report following bug:
> > [   37.635545][T0] WARNING: suspicious RCU usage^M
> > [   37.636409][T0] 6.0.0-rc4-next-20220907-dirty #8 Not tainted^M
> > [   37.637575][T0] -^M
> > [   37.638306][T0] kernel/locking/lockdep.c:3723 RCU-list traversed in 
> > non-reader section!!^M
> > [   37.639651][T0] ^M
> > [   37.639651][T0] other info that might help us debug this:^M
> > [   37.639651][T0] ^M
> > [   37.641381][T0] ^M
> > [   37.641381][T0] RCU used illegally from offline CPU!^M
> > [   37.641381][T0] rcu_scheduler_active = 2, debug_locks = 1^M
> > [   37.667170][T0] no locks held by swapper/6/0.^M
> > [   37.668328][T0] ^M
> > [   37.668328][T0] stack backtrace:^M
> > [   37.669995][T0] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 
> > 6.0.0-rc4-next-20220907-dirty #8^M
> > [   37.672777][T0] Call Trace:^M
> > [   37.673729][T0] [c4653920] [c097f9b4] 
> > dump_stack_lvl+0x98/0xe0 (unreliable)^M
> > [   37.678579][T0] [c4653960] [c01f2eb8] 
> > lockdep_rcu_suspicious+0x148/0x16c^M
> > [   37.680425][T0] [c46539f0] [c01ed9b4] 
> > __lock_acquire+0x10f4/0x26e0^M
> > [   37.682450][T0] [c4653b30] [c01efc2c] 
> > lock_acquire+0x12c/0x420^M
> > [   37.684113][T0] [c4653c20] [c10d704c] 
> > _raw_spin_lock_irqsave+0x6c/0xc0^M
> > [   37.686154][T0] [c4653c60] [c00c7b4c] 
> > xive_spapr_put_ipi+0xcc/0x150^M
> > [   37.687879][T0] [c4653ca0] [c10c72a8] 
> > xive_cleanup_cpu_ipi+0xc8/0xf0^M
> > [   37.689856][T0] [c4653cf0] [c10c7370] 
> > xive_teardown_cpu+0xa0/0xf0^M
> > [   37.691877][T0] [c4653d30] [c00fba5c] 
> > pseries_cpu_offline_self+0x5c/0x100^M
> > [   37.693882][T0] [c4653da0] [c005d2c4] 
> > arch_cpu_idle_dead+0x44/0x60^M
> > [   37.695739][T0] [c4653dc0] [c01c740c] 
> > do_idle+0x16c/0x3d0^M
> > [   37.697536][T0] [c4653e70] [c01c7a1c] 
> > cpu_startup_entry+0x3c/0x40^M
> > [   37.699694][T0] [c4653ea0] [c005ca20] 
> > start_secondary+0x6c0/0xb50^M
> > [   37.701742][T0] [c4653f90] [c000d054] 
> > start_secondary_prolog+0x10/0x14^M
> >
> >
> > I am a beginner, hope I can be of some beneficial to the community ;-)
> >
> > Thanks
> > Zhouyi
> > --
> >  arch/powerpc/platforms/pseries/hotplug-cpu.c |  5 -
> >  include/linux/rcupdate.h |  3 ++-
> >  kernel/rcu/tree.c| 10 ++
> >  3 files changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> > b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > index 0f8cd8b06432..ddf66a253c70 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -64,11 +64,14 @@ static void pseries_cpu_offline_self(void)
> >
> >   local_irq_disable();
> >   idle_task_exit();
> > +
> > + /* Because the cpu is now offline, let rcu know that */
> > + rcu_state_ofl_lock();
> >   if (xive_enabled())
> >   xive_teardown_cpu();
> >   else
> >   xics_teardown_cpu();
> > -
> > + rcu_state_ofl_unlock();
> >   unregister_slb_shadow(hwcpu);
> >  

Re: [RFC] Objtool toolchain proposal: -fannotate-{jump-table,noreturn}

2022-09-14 Thread Peter Zijlstra
On Wed, Sep 14, 2022 at 01:04:16AM +0100, Josh Poimboeuf wrote:

> > I will mention that objtool's existence is based on mistrust, of persons 
> > (not correctly annotating stuff) and of tools (not correctly heeding those 
> > annotations).  The mistrust in persons is understandable and can be dealt 
> > with by tools, but the mistrust in tools can't be fixed by making tools 
> > more complicated by emitting even more information; there's no good reason 
> > to assume that one piece of info can be trusted more than other pieces.  
> > So, if you mistrust the tools you have already lost.  That's somewhat 
> > philosophical, so I won't beat that horse much more either.
> 
> Maybe this is semantics, but I wouldn't characterize objtool's existence
> as being based on the mistrust of tools.  It's main motivation is to
> fill in the toolchain's blind spots in asm and inline-asm, which exist
> by design.

That and a fairly deep seated loathing for the regular CFI annotations
and DWARF in general. Linus was fairly firm he didn't want anything to
do with DWARF for in-kernel unwinding.

That left us in a spot that we needed unwind information in a 'better'
format than DWARF.

Objtool was born out of those contraints. ORC not needing the CFI
annotations and ORC being *much* faster at unwiding and generation
(debug builds are slow) were all good.




Re: [PATCH 0/4] Remove unused macros from asm-offset

2022-09-14 Thread Disha Goel



On 9/13/22 1:45 PM, Christophe Leroy wrote:


Le 13/09/2022 à 09:40, Disha Goel a écrit :

There were commits which did code refactoring and converting some of kvm
assembly routines to C. When doing it, many of the asm-offset macro
definitions were missed to remove. Patchset here removes those.

Patch1 removes usage of KVM_TLB_SETS macro from the asm-offset

Patch2 removes KVM_RADIX macro from the asm-offset.c

Patch3 removes a set of unused kvm vcpu and hstate macros from the
asm-offset.c

Patch4 removes unused HSTATE/host_mmcr references for MMCR3/SIER2/SIER3

I think you can squash all changes to asm-offsets.c into a single patch.
The Fixes: tags are unrelevant, you are not fixing a real bug, it's just
a cleanup.

Then have a second patch that reduces the size of host_mmcr[] in
kvmppc_host_state struct.


Hi Christophe,

Thanks for reviewing the patchset. I will send v2 patchset with 
mentioned changes.



Thanks

Disha


Thanks
Christophe


Link to the script used to get unused macro:
https://github.com/maddy-kerneldev/scripts/blob/master/check_asm-offset.sh

Link to linux-ci job result:
https://github.com/disgoel/linux-ci/actions

Disha Goel (3):
powerpc/asm-offset: Remove unused KVM_TLB_SETS macros
powerpc/asm-offset: Remove unused KVM_RADIX macros
powerpc/kvm: Remove unused macros from asm-offset

Kajol Jain (1):
powerpc/kvm: Remove unused references for MMCR3/SIER2/SIER3 registers

   arch/powerpc/include/asm/kvm_book3s_asm.h |  2 +-
   arch/powerpc/kernel/asm-offsets.c | 25 ---
   2 files changed, 1 insertion(+), 26 deletions(-)


Re: [RFC] Objtool toolchain proposal: -fannotate-{jump-table,noreturn}

2022-09-14 Thread Segher Boessenkool
On Wed, Sep 14, 2022 at 11:21:00AM +0100, Josh Poimboeuf wrote:
> On Mon, Sep 12, 2022 at 06:31:14AM -0500, Segher Boessenkool wrote:
> > On Fri, Sep 09, 2022 at 11:07:04AM -0700, Josh Poimboeuf wrote:
> > > 2) Noreturn functions:
> > >
> > >There's no reliable way to determine which functions are designated
> > >by the compiler to be noreturn (either explictly via function
> > >attribute, or implicitly via a static function which is a wrapper
> > >around a noreturn function.)
> > 
> > Or just a function that does not return for any other reason.
> > 
> > The compiler makes no difference between functions that have the
> > attribute and functions that do not.  There are good reasons to not
> > have the attribute on functions that do in fact not return.  The
> > not-returningness of the function may be just an implementation
> > accident, something you do not want part of the API, so it *should* not
> > have that attribute; or you may want the callers to a function to not be
> > optimised according to this knowledge (you cannot *prevent* that, the
> > compiler can figure it out it other ways, but still) for any other
> > reason.
> 
> Yes, many static functions that are wrappers around noreturn functions
> have this "implicit noreturn" property.

I meant functions that are noreturn intrinsically.  The trivial example:

void f(void)
{
for (;;)
;
}

>  I agree we would need to know
> about those functions (or, as Michael suggested, their call sites) as
> well.

Many "potentially does not return" functions (there are very many such
functions!) turn into "never returns" functions, for some inputs (or
something in the environment).  If the compiler specialises a code path
that does not return, you'll not see that marked up any way.  Of course
such a path should not be taken in the kernel, normally :-)

> > >This information is needed because the
> > >code after the call to such a function is optimized out as
> > >unreachable and objtool has no way of knowing that.
> > 
> > Since June we (GCC) have -funreachable-traps.  This creates a trap insn
> > wherever control flow would otherwise go into limbo.
> 
> Ah, that's interesting, though I'm not sure if we'd be able to
> distinguish between "call doesn't return" traps and other traps or
> reasons for UD2.

The trap handler can see where the trap came from.  And then look up
that address in some tables or such.  Just like __bug_table?


Segher


Re: [PATCH linux-next][RFC] powerpc: protect cpu offlining by RCU offline lock

2022-09-14 Thread Paul E. McKenney
On Wed, Sep 14, 2022 at 10:15:28AM +0800, Zhouyi Zhou wrote:
> During the cpu offlining, the sub functions of xive_teardown_cpu will
> call __lock_acquire when CONFIG_LOCKDEP=y. The latter function will
> travel RCU protected list, so "WARNING: suspicious RCU usage" will be
> triggered.
> 
> Try to protect cpu offlining by RCU offline lock.

Rather than acquiring the RCU lock, why not change the functions called
by xive_teardown_cpu() to avoid calling __lock_acquire()?  For example,
a call to spin_lock() could be changed to arch_spin_lock().

Thanx, Paul

> Tested on PPC VM of Open Source Lab of Oregon State University.
> (Each round of tests takes about 19 hours to finish)
> Test results show that although "WARNING: suspicious RCU usage" has gone,
> but there are more "BUG: soft lockup" reports than the original kernel
> (10 vs 6), so I add a [RFC] to my subject line.
> 
> Signed-off-by: Zhouyi Zhou 
> ---
> [it seems that there are some delivery problem in my previous email,
>  so I send again via gmail, sorry for the trouble]
>  
> Dear PPC and RCU developers
> 
> I found this bug when trying to do rcutorture tests in ppc VM of
> Open Source Lab of Oregon State University.
> 
> console.log report following bug:
> [   37.635545][T0] WARNING: suspicious RCU usage^M
> [   37.636409][T0] 6.0.0-rc4-next-20220907-dirty #8 Not tainted^M
> [   37.637575][T0] -^M
> [   37.638306][T0] kernel/locking/lockdep.c:3723 RCU-list traversed in 
> non-reader section!!^M
> [   37.639651][T0] ^M
> [   37.639651][T0] other info that might help us debug this:^M
> [   37.639651][T0] ^M
> [   37.641381][T0] ^M
> [   37.641381][T0] RCU used illegally from offline CPU!^M
> [   37.641381][T0] rcu_scheduler_active = 2, debug_locks = 1^M
> [   37.667170][T0] no locks held by swapper/6/0.^M
> [   37.668328][T0] ^M
> [   37.668328][T0] stack backtrace:^M
> [   37.669995][T0] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 
> 6.0.0-rc4-next-20220907-dirty #8^M
> [   37.672777][T0] Call Trace:^M
> [   37.673729][T0] [c4653920] [c097f9b4] 
> dump_stack_lvl+0x98/0xe0 (unreliable)^M
> [   37.678579][T0] [c4653960] [c01f2eb8] 
> lockdep_rcu_suspicious+0x148/0x16c^M
> [   37.680425][T0] [c46539f0] [c01ed9b4] 
> __lock_acquire+0x10f4/0x26e0^M
> [   37.682450][T0] [c4653b30] [c01efc2c] 
> lock_acquire+0x12c/0x420^M
> [   37.684113][T0] [c4653c20] [c10d704c] 
> _raw_spin_lock_irqsave+0x6c/0xc0^M
> [   37.686154][T0] [c4653c60] [c00c7b4c] 
> xive_spapr_put_ipi+0xcc/0x150^M
> [   37.687879][T0] [c4653ca0] [c10c72a8] 
> xive_cleanup_cpu_ipi+0xc8/0xf0^M
> [   37.689856][T0] [c4653cf0] [c10c7370] 
> xive_teardown_cpu+0xa0/0xf0^M
> [   37.691877][T0] [c4653d30] [c00fba5c] 
> pseries_cpu_offline_self+0x5c/0x100^M
> [   37.693882][T0] [c4653da0] [c005d2c4] 
> arch_cpu_idle_dead+0x44/0x60^M
> [   37.695739][T0] [c4653dc0] [c01c740c] 
> do_idle+0x16c/0x3d0^M
> [   37.697536][T0] [c4653e70] [c01c7a1c] 
> cpu_startup_entry+0x3c/0x40^M
> [   37.699694][T0] [c4653ea0] [c005ca20] 
> start_secondary+0x6c0/0xb50^M
> [   37.701742][T0] [c4653f90] [c000d054] 
> start_secondary_prolog+0x10/0x14^M
> 
> 
> I am a beginner, hope I can be of some beneficial to the community ;-)
> 
> Thanks
> Zhouyi
> --
>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  5 -
>  include/linux/rcupdate.h |  3 ++-
>  kernel/rcu/tree.c| 10 ++
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 0f8cd8b06432..ddf66a253c70 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -64,11 +64,14 @@ static void pseries_cpu_offline_self(void)
>  
>   local_irq_disable();
>   idle_task_exit();
> +
> + /* Because the cpu is now offline, let rcu know that */
> + rcu_state_ofl_lock();
>   if (xive_enabled())
>   xive_teardown_cpu();
>   else
>   xics_teardown_cpu();
> -
> + rcu_state_ofl_unlock();
>   unregister_slb_shadow(hwcpu);
>   rtas_stop_self();
>  
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 63d2e6a60ad7..d857955a02ba 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1034,5 +1034,6 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, 
> rcu_callback_t f)
>  /* kernel/ksysfs.c definitions */
>  extern int rcu_expedited;
>  extern int rcu_normal;
> -
> +void rcu_state_ofl_lock(void);
> +void rcu_state_ofl_unlock(void);
>  #endif /* 

Re: [RFC] Objtool toolchain proposal: -fannotate-{jump-table,noreturn}

2022-09-14 Thread Michael Matz
Hello,

On Wed, 14 Sep 2022, Josh Poimboeuf wrote:

> > >This information is needed because the
> > >code after the call to such a function is optimized out as
> > >unreachable and objtool has no way of knowing that.
> > 
> > Since June we (GCC) have -funreachable-traps.  This creates a trap insn
> > wherever control flow would otherwise go into limbo.
> 
> Ah, that's interesting, though I'm not sure if we'd be able to
> distinguish between "call doesn't return" traps and other traps or
> reasons for UD2.

There are two reasons (which will turn out to be the same) for a trap (say 
'UD2' on x86-64) directly after a call insn:
1) "the call shall not have returned"
2) something else jumps to that trap because it was __builtin_unreachable 
   (or equivalent), and the compiler happened to put that ud2 directly 
   after the call.  It could have done that only when the call itself was 
   noreturn:
 cmp $foo, %rax
 jne do_trap
 call noret
do_trap:
 ud2

So, it's all the same.  If there's an ud2 (or whatever the trap maker is) 
after a call then it was because it's noreturn.

(But, of course this costs (little) code size, unlike the non-alloc 
checker sections)


Ciao,
Michael.


Re: [RFC] Objtool toolchain proposal: -fannotate-{jump-table,noreturn}

2022-09-14 Thread Josh Poimboeuf
On Mon, Sep 12, 2022 at 06:31:14AM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Sep 09, 2022 at 11:07:04AM -0700, Josh Poimboeuf wrote:
> > 2) Noreturn functions:
> >
> >There's no reliable way to determine which functions are designated
> >by the compiler to be noreturn (either explictly via function
> >attribute, or implicitly via a static function which is a wrapper
> >around a noreturn function.)
> 
> Or just a function that does not return for any other reason.
> 
> The compiler makes no difference between functions that have the
> attribute and functions that do not.  There are good reasons to not
> have the attribute on functions that do in fact not return.  The
> not-returningness of the function may be just an implementation
> accident, something you do not want part of the API, so it *should* not
> have that attribute; or you may want the callers to a function to not be
> optimised according to this knowledge (you cannot *prevent* that, the
> compiler can figure it out it other ways, but still) for any other
> reason.

Yes, many static functions that are wrappers around noreturn functions
have this "implicit noreturn" property.  I agree we would need to know
about those functions (or, as Michael suggested, their call sites) as
well.

> >This information is needed because the
> >code after the call to such a function is optimized out as
> >unreachable and objtool has no way of knowing that.
> 
> Since June we (GCC) have -funreachable-traps.  This creates a trap insn
> wherever control flow would otherwise go into limbo.

Ah, that's interesting, though I'm not sure if we'd be able to
distinguish between "call doesn't return" traps and other traps or
reasons for UD2.

-- 
Josh


Re: [PATCH] hmm-tests: Fix migrate_dirty_page test

2022-09-14 Thread David Hildenbrand

On 13.09.22 10:20, Alistair Popple wrote:


David Hildenbrand  writes:


On 13.09.22 07:22, Alistair Popple wrote:

As noted by John Hubbard the original test relied on side effects of the
implementation of migrate_vma_setup() to detect if pages had been
swapped to disk or not. This is subject to change in future so
explicitly check for swap entries via pagemap instead. Fix a spelling
mistake while we're at it.
Signed-off-by: Alistair Popple 
Fixes: 5cc88e844e87 ("selftests/hmm-tests: add test for dirty bits")
---
   tools/testing/selftests/vm/hmm-tests.c | 50 +++---
   1 file changed, 46 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/vm/hmm-tests.c
b/tools/testing/selftests/vm/hmm-tests.c
index 70fdb49b59ed..b5f6a7dc1f12 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1261,9 +1261,47 @@ static int destroy_cgroup(void)
return 0;
   }
   +static uint64_t get_pfn(int fd, uint64_t ptr)
+{
+   uint64_t pfn;
+   int ret;
+
+   ret = pread(fd, , sizeof(ptr),
+   (uint64_t) ptr / getpagesize() * sizeof(ptr));
+   if (ret != sizeof(ptr))
+   return 0;
+
+   return pfn;
+}
+
+#define PAGEMAP_SWAPPED (1ULL << 62)
+
+/* Returns true if at least one page in the range is on swap */
+static bool pages_swapped(void *ptr, unsigned long pages)
+{
+   uint64_t pfn;
+   int fd = open("/proc/self/pagemap", O_RDONLY);
+   unsigned long i;
+
+   if (fd < 0)
+   return false;
+
+   for (i = 0; i < pages; i++) {
+   pfn = get_pfn(fd, (uint64_t) ptr + i * getpagesize());
+
+   if (pfn & PAGEMAP_SWAPPED) {
+   close(fd);
+   return true;
+   }


We do have pagemap_get_entry() in vm_util.c to query the pagemap entry.


Thanks. I'd missed that, although `grep pagemap
tools/testing/selftests/vm` suggests I'm not the first to follow a
tradition of open-coding this :-)

But there's no need to perpetuate that tradition, so will redo this to
use vm_util.c instead.


Yeah, we just recently factored stuff out into there. I'll be factoring 
out more in my upcoming tests from the madv_populate tests.


--
Thanks,

David / dhildenb



Re: Fragmented physical memory on powerpc/32

2022-09-14 Thread Christophe Leroy


Le 14/09/2022 à 11:32, Mike Rapoport a écrit :
> On Tue, Sep 13, 2022 at 02:36:13PM +0200, Christophe Leroy wrote:
>>
>>
>> Le 13/09/2022 à 08:11, Christophe Leroy a écrit :
>>>
>>>
>>> Le 12/09/2022 à 23:16, Pali Rohár a écrit :
>
> My guess would be that something went wrong in the linear map
> setup, but it
> won't hurt running with "memblock=debug" added to the kernel
> command line
> to see if there is anything suspicious there.

 Here is boot log on serial console with memblock=debug command line:

>>> ...

 Do you need something more for debug?
>>>
>>> Can you send me the 'vmlinux' used to generate the above Oops so that I
>>> can see exactly where we are in function mem_init().
>>>
>>> And could you also try without CONFIG_HIGHMEM just in case.
>>>
>>
>> I looked at the vmlinux you sent me, the problem is in the loop for highmem
>> in mem_init(). It crashes in the call to free_highmem_page()
>>
>> #ifdef CONFIG_HIGHMEM
>>  {
>>  unsigned long pfn, highmem_mapnr;
>>
>>  highmem_mapnr = lowmem_end_addr >> PAGE_SHIFT;
>>  for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) {
>>  phys_addr_t paddr = (phys_addr_t)pfn << PAGE_SHIFT;
>>  struct page *page = pfn_to_page(pfn);
>>  if (!memblock_is_reserved(paddr))
>>  free_highmem_page(page);
>>  }
>>  }
>> #endif /* CONFIG_HIGHMEM */
>>
>>
>> As far as I can see in the memblock debug lines, the holes don't seem to be
>> marked as reserved by memblock. So it is above valid ? Other architectures
>> seem to do differently.
>>
>> Can you try by replacing !memblock_is_reserved(paddr) by
>> memblock_is_memory(paddr) ?
> 
> The holes should not be marked as reserved, we just need to loop over the
> memory ranges rather than over pfns. Then the holes will be taken into
> account.
> 
> I believe arm and xtensa got this right:
> 
> (from arch/arm/mm/init.c)
> 
> static void __init free_highpages(void)
> {
> #ifdef CONFIG_HIGHMEM
>   unsigned long max_low = max_low_pfn;
>   phys_addr_t range_start, range_end;
>   u64 i;
> 
>   /* set highmem page free */
>   for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
>   _start, _end, NULL) {
>   unsigned long start = PFN_UP(range_start);
>   unsigned long end = PFN_DOWN(range_end);
> 
>   /* Ignore complete lowmem entries */
>   if (end <= max_low)
>   continue;
> 
>   /* Truncate partial highmem entries */
>   if (start < max_low)
>   start = max_low;
> 
>   for (; start < end; start++)
>   free_highmem_page(pfn_to_page(start));
>   }
> #endif
> }
> 


And what about the way MIPS does it ?

static inline void __init mem_init_free_highmem(void)
{
#ifdef CONFIG_HIGHMEM
unsigned long tmp;

if (cpu_has_dc_aliases)
return;

for (tmp = highstart_pfn; tmp < highend_pfn; tmp++) {
struct page *page = pfn_to_page(tmp);

if (!memblock_is_memory(PFN_PHYS(tmp)))
SetPageReserved(page);
else
free_highmem_page(page);
}
#endif
}


Christophe

Re: Fragmented physical memory on powerpc/32

2022-09-14 Thread Mike Rapoport
On Tue, Sep 13, 2022 at 02:36:13PM +0200, Christophe Leroy wrote:
> 
> 
> Le 13/09/2022 à 08:11, Christophe Leroy a écrit :
> > 
> > 
> > Le 12/09/2022 à 23:16, Pali Rohár a écrit :
> > > > 
> > > > My guess would be that something went wrong in the linear map
> > > > setup, but it
> > > > won't hurt running with "memblock=debug" added to the kernel
> > > > command line
> > > > to see if there is anything suspicious there.
> > > 
> > > Here is boot log on serial console with memblock=debug command line:
> > > 
> > ...
> > > 
> > > Do you need something more for debug?
> > 
> > Can you send me the 'vmlinux' used to generate the above Oops so that I
> > can see exactly where we are in function mem_init().
> > 
> > And could you also try without CONFIG_HIGHMEM just in case.
> > 
> 
> I looked at the vmlinux you sent me, the problem is in the loop for highmem
> in mem_init(). It crashes in the call to free_highmem_page()
> 
> #ifdef CONFIG_HIGHMEM
>   {
>   unsigned long pfn, highmem_mapnr;
> 
>   highmem_mapnr = lowmem_end_addr >> PAGE_SHIFT;
>   for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) {
>   phys_addr_t paddr = (phys_addr_t)pfn << PAGE_SHIFT;
>   struct page *page = pfn_to_page(pfn);
>   if (!memblock_is_reserved(paddr))
>   free_highmem_page(page);
>   }
>   }
> #endif /* CONFIG_HIGHMEM */
> 
> 
> As far as I can see in the memblock debug lines, the holes don't seem to be
> marked as reserved by memblock. So it is above valid ? Other architectures
> seem to do differently.
> 
> Can you try by replacing !memblock_is_reserved(paddr) by
> memblock_is_memory(paddr) ?

The holes should not be marked as reserved, we just need to loop over the
memory ranges rather than over pfns. Then the holes will be taken into
account.

I believe arm and xtensa got this right:

(from arch/arm/mm/init.c)

static void __init free_highpages(void)
{
#ifdef CONFIG_HIGHMEM
unsigned long max_low = max_low_pfn;
phys_addr_t range_start, range_end;
u64 i;

/* set highmem page free */
for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
_start, _end, NULL) {
unsigned long start = PFN_UP(range_start);
unsigned long end = PFN_DOWN(range_end);

/* Ignore complete lowmem entries */
if (end <= max_low)
continue;

/* Truncate partial highmem entries */
if (start < max_low)
start = max_low;

for (; start < end; start++)
free_highmem_page(pfn_to_page(start));
}
#endif
}

 
> Thanks
> Christophe
> 

-- 
Sincerely yours,
Mike.


[PATCH v2] hmm-tests: Fix migrate_dirty_page test

2022-09-14 Thread Alistair Popple
As noted by John Hubbard the original test relied on side effects of the
implementation of migrate_vma_setup() to detect if pages had been
swapped to disk or not. This is subject to change in future so
explicitly check for swap entries via pagemap instead. Fix a spelling
mistake while we're at it.

Signed-off-by: Alistair Popple 
Fixes: 5cc88e844e87 ("selftests/hmm-tests: add test for dirty bits")
Reviewed-by: Mika Penttilä 

---

Changes for v2:

 - Added Mika's Reviewed-by (Thanks!)
 - Moved pagemap checks into vm_util.c as suggested by David H.
---
 tools/testing/selftests/vm/Makefile|  1 +
 tools/testing/selftests/vm/hmm-tests.c | 28 ++
 tools/testing/selftests/vm/vm_util.c   | 22 
 tools/testing/selftests/vm/vm_util.h   |  2 ++
 4 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index d9fa6a9ea584..17e36129efd2 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -100,6 +100,7 @@ include ../lib.mk
 $(OUTPUT)/madv_populate: vm_util.c
 $(OUTPUT)/soft-dirty: vm_util.c
 $(OUTPUT)/split_huge_page_test: vm_util.c
+$(OUTPUT)/hmm-tests: vm_util.c
 
 ifeq ($(MACHINE),x86_64)
 BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 70fdb49b59ed..509fe8c5158e 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -37,6 +37,7 @@
  */
 #include "../../../../lib/test_hmm_uapi.h"
 #include "../../../../mm/gup_test.h"
+#include "vm_util.h"
 
 struct hmm_buffer {
void*ptr;
@@ -1261,9 +1262,24 @@ static int destroy_cgroup(void)
return 0;
 }
 
+/* Returns true if at least one page in the range is on swap */
+static bool pages_swapped(void *ptr, size_t size)
+{
+   int fd = open("/proc/self/pagemap", O_RDONLY);
+   bool ret;
+
+   if (fd < 0)
+   return false;
+
+   ret = pagemap_range_some_swapped(fd, ptr, size);
+   close(fd);
+
+   return ret;
+}
+
 /*
  * Try and migrate a dirty page that has previously been swapped to disk. This
- * checks that we don't loose dirty bits.
+ * checks that we don't lose dirty bits.
  */
 TEST_F(hmm, migrate_dirty_page)
 {
@@ -1300,6 +1316,10 @@ TEST_F(hmm, migrate_dirty_page)
 
ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
 
+   /* Make sure at least some pages got paged to disk. */
+   if (!pages_swapped(buffer->ptr, size))
+   SKIP(return, "Pages weren't swapped when they should have 
been");
+
/* Fault pages back in from swap as clean pages */
for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
tmp += ptr[i];
@@ -1309,10 +1329,10 @@ TEST_F(hmm, migrate_dirty_page)
ptr[i] = i;
 
/*
-* Attempt to migrate memory to device, which should fail because
-* hopefully some pages are backed by swap storage.
+* Attempt to migrate memory to device. This might fail if some pages
+* are/were backed by swap but that's ok.
 */
-   ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));
+   hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 
ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
 
diff --git a/tools/testing/selftests/vm/vm_util.c 
b/tools/testing/selftests/vm/vm_util.c
index b58ab11a7a30..2768d4f3de4c 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include 
 #include 
 #include 
 #include "../kselftest.h"
@@ -20,6 +21,14 @@ uint64_t pagemap_get_entry(int fd, char *start)
return entry;
 }
 
+bool pagemap_is_swapped(int fd, char *start)
+{
+   uint64_t entry = pagemap_get_entry(fd, start);
+
+   // Check if swap entry bit (62nd bit) is set
+   return entry & 0x4000ull;
+}
+
 bool pagemap_is_softdirty(int fd, char *start)
 {
uint64_t entry = pagemap_get_entry(fd, start);
@@ -28,6 +37,19 @@ bool pagemap_is_softdirty(int fd, char *start)
return entry & 0x0080ull;
 }
 
+/* Returns true if at least one page in the range is in swap */
+bool pagemap_range_some_swapped(int fd, char *start, size_t len)
+{
+   unsigned long i;
+   unsigned long npages = len / getpagesize();
+
+   for (i = 0; i < npages; i++)
+   if (pagemap_is_swapped(fd, start + i * getpagesize()))
+   return true;
+
+   return false;
+}
+
 void clear_softdirty(void)
 {
int ret;
diff --git a/tools/testing/selftests/vm/vm_util.h 
b/tools/testing/selftests/vm/vm_util.h
index 2e512bd57ae1..307e11f72341 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -4,6 +4,8 @@
 
 uint64_t pagemap_get_entry(int fd, 

Re: [PATCH] powerpc/pseries: add lparctl driver for platform-specific functions

2022-09-14 Thread Michal Suchánek
On Tue, Sep 13, 2022 at 12:02:42PM -0500, Nathan Lynch wrote:
> Michal Suchánek  writes:
> > On Tue, Sep 13, 2022 at 10:59:56AM -0500, Nathan Lynch wrote:
> >> Michal Suchánek  writes:
> >> 
> >> > On Fri, Aug 12, 2022 at 02:14:21PM -0500, Nathan Lynch wrote:
> >> >> Laurent Dufour  writes:
> >> >> > Le 30/07/2022 à 02:04, Nathan Lynch a écrit :
> >> >> >> +static long lparctl_get_sysparm(struct lparctl_get_system_parameter 
> >> >> >> __user *argp)
> >> >> >> +{
> >> >> >> +struct lparctl_get_system_parameter *gsp;
> >> >> >> +long ret;
> >> >> >> +int fwrc;
> >> >> >> +
> >> >> >> +/*
> >> >> >> + * Special case to allow user space to probe the command.
> >> >> >> + */
> >> >> >> +if (argp == NULL)
> >> >> >> +return 0;
> >> >> >> +
> >> >> >> +gsp = memdup_user(argp, sizeof(*gsp));
> >> >> >> +if (IS_ERR(gsp)) {
> >> >> >> +ret = PTR_ERR(gsp);
> >> >> >> +goto err_return;
> >> >> >> +}
> >> >> >> +
> >> >> >> +ret = -EINVAL;
> >> >> >> +if (gsp->rtas_status != 0)
> >> >> >> +goto err_free;
> >> >> >> +
> >> >> >> +do {
> >> >> >> +static_assert(sizeof(gsp->data) <= 
> >> >> >> sizeof(rtas_data_buf));
> >> >> >> +
> >> >> >> +spin_lock(_data_buf_lock);
> >> >> >> +memset(rtas_data_buf, 0, sizeof(rtas_data_buf));
> >> >> >> +memcpy(rtas_data_buf, gsp->data, sizeof(gsp->data));
> >> >> >> +fwrc = 
> >> >> >> rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
> >> >> >> + NULL, gsp->token, __pa(rtas_data_buf),
> >> >> >> + sizeof(gsp->data));
> >> >> >> +if (fwrc == 0)
> >> >> >> +memcpy(gsp->data, rtas_data_buf, 
> >> >> >> sizeof(gsp->data));
> >> >> >
> >> >> > May be the amount of data copied out to the user space could be
> >> >> > gsp->length. This would prevent copying 4K bytes all the time.
> >> >> >
> >> >> > In a more general way, the size of the RTAS buffer is quite big, and 
> >> >> > I'm
> >> >> > wondering if all the data need to be copied back and forth to the 
> >> >> > kernel.
> >> >> >
> >> >> > Unless there are a high frequency of calls this doesn't make sense, 
> >> >> > and
> >> >> > keeping the code simple might be the best way. Otherwise limiting the 
> >> >> > bytes
> >> >> > copied could help a bit.
> >> >> 
> >> >> This is not intended to be a high-bandwidth interface and I don't think
> >> >> there's much of a performance concern here, so I'd rather just keep the
> >> >> copy sizes involved constant.
> >> >
> >> > But that's absolutely horrible!
> >> 
> >> ?
> >> 
> >> > The user wants the VPD data, all of it. And you only give one page with
> >> > this interface.
> >> 
> >> The code here is for system parameters, which have a known maximum size,
> >> unlike VPD. There's no code for VPD retrieval in this patch.
> >
> > But we do need to support the calls that return multiple pages of data.
> >
> > If the new driver supports only the simple calls it's a failure.
> 
> Michal, will you please moderate your tone? I think you can communicate
> your concerns without calling my work "absolutely horrible" or a
> "failure". Thanks.

Sorry, it's not a good wording.

> Anyway, of course I intend to support the more complex calls, but
> supporting the simple calls actually unbreaks a lot of stuff.

The thing is that supporting calls that return more than one page of
data is absolutely required, and this interface built around fixed size
data transfer can't do it.

So it sounds like a ticked for redoing the driver right after it's
implemented, or ending up with two subtly different interfaces - one for
the calls that can return multiple pages of data, and one for the simple
calls.

That does not sound like a good idea at all to me.

> 
> >> But I'm happy to constructively discuss how a VPD ioctl interface should
> >> work.
> >> 
> >> > Worse, the call is not reentrant so you need to lock against other users
> >> > calling the call while the current caller is retrieving the inidividual
> >> > pagaes.
> >> >
> >> > You could do that per process, but then processes with userspace
> >> > threading would want the data as well so you would have to save the
> >> > arguments of the last call, and compare to arguments of any subsequent
> >> > call to determine if you can let it pass or block.
> >> >
> >> > And when you do all that there will be a process that retrieves a couple
> >> > of pages and goes out for lunch or loses interest completely, blocking
> >> > out everyone from accessing the interface at all.
> >> 
> >> Right, the ibm,get-vpd RTAS function is tricky to expose to user space.
> >> 
> >> It needs to be called repeatedly until all data has been returned, 4KB
> >> at a time.
> >> 
> >> Only one ibm,get-vpd sequence can be in progress at any time. If an
> >> ibm,get-vpd sequence is begun while another sequence is already
> >> outstanding,