Re: [PATCH] powerpc/64/interrupt: Fix return to masked context after hard-mask irq becomes pending

2022-05-25 Thread Sachin Sant



> On 26-May-2022, at 10:20 AM, Nicholas Piggin  wrote:
> 
> Excerpts from Sachin Sant's message of March 9, 2022 6:37 pm:
>> 
>> 
>>> On 07-Mar-2022, at 8:21 PM, Nicholas Piggin  wrote:
>>> 
>>> When a synchronous interrupt[1] is taken in a local_irq_disable() region
>>> which has MSR[EE]=1, the interrupt handler will enable MSR[EE] as part
>>> of enabling MSR[RI], for peformance and profiling reasons.
>>> 
>>> [1] Typically a hash fault, but in error cases this could be a page
>>> fault or facility unavailable as well.
>>> 
>>> If an asynchronous interrupt hits here and its masked handler requires
>>> MSR[EE] to be cleared (it is a PACA_IRQ_MUST_HARD_MASK interrupt), then
>>> MSR[EE] must remain disabled until that pending interrupt is replayed.
>>> The problem is that the MSR of the original context has MSR[EE]=1, so
>>> returning directly to that causes MSR[EE] to be enabled while the
>>> interrupt is still pending.
>>> 
>>> This issue was hacked around in the interrupt return code by just
>>> clearing the hard mask to avoid a warning, and taking the masked
>>> interrupt again immediately in the return context, which would disable
>>> MSR[EE]. However in the case of a pending PMI, it is possible that it is
>>> not maked in the calling context so the full handler will be run while
>>> there is a PMI pending, and this confuses the perf code and causes
>>> warnings with its PMI pending management.
>>> 
>>> Fix this by removing the hack, and adjusting the return MSR if it has
>>> MSR[EE]=1 and there is a PACA_IRQ_MUST_HARD_MASK interrupt pending.
>>> 
>>> Fixes: 4423eb5ae32e ("powerpc/64/interrupt: make normal synchronous 
>>> interrupts enable MSR[EE] if possible")
>>> Signed-off-by: Nicholas Piggin 
>>> ---
>>> arch/powerpc/kernel/interrupt.c | 10 -
>>> arch/powerpc/kernel/interrupt_64.S | 34 +++---
>>> 2 files changed, 31 insertions(+), 13 deletions(-)
>> 
>> With this patch on top of powerpc/merge following rcu stalls are seen while
>> running powerpc selftests (mitigation-patching) on P9. I don’t see this
>> issue on P10.
>> 
>> [ 1841.248838] link-stack-flush: flush disabled.
>> [ 1841.248905] count-cache-flush: software flush enabled.
>> [ 1841.248911] link-stack-flush: software flush enabled.
>> [ 1901.249668] rcu: INFO: rcu_sched self-detected stall on CPU
>> [ 1901.249703] rcu:  12-...!: (5999 ticks this GP) 
>> idle=d0f/1/0x4002 softirq=37019/37027 fqs=0 
>> [ 1901.249720]   (t=6000 jiffies g=106273 q=1624)
>> [ 1901.249729] rcu: rcu_sched kthread starved for 6000 jiffies! g106273 f0x0 
>> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=6
>> [ 1901.249743] rcu:  Unless rcu_sched kthread gets sufficient CPU time, OOM 
>> is now expected behavior.
>> [ 1901.249752] rcu: RCU grace-period kthread stack dump:
>> [ 1901.249759] task:rcu_sched state:R running task stack: 0 pid: 11 ppid: 2 
>> flags:0x0800
>> [ 1901.249775] Call Trace:
>> [ 1901.249781] [c76ab870] [0001] 0x1 (unreliable)
>> [ 1901.249795] [c76aba60] [c001e508] __switch_to+0x288/0x4a0
>> [ 1901.249811] [c76abac0] [c0d15950] __schedule+0x2c0/0x950
>> [ 1901.249824] [c76abb80] [c0d16048] schedule+0x68/0x130
>> [ 1901.249836] [c76abbb0] [c0d1df1c] 
>> schedule_timeout+0x25c/0x3f0
>> [ 1901.249849] [c76abc90] [c021522c] 
>> rcu_gp_fqs_loop+0x2fc/0x3e0
>> [ 1901.249863] [c76abd40] [c021a0fc] 
>> rcu_gp_kthread+0x13c/0x180
>> [ 1901.249875] [c76abdc0] [c018ce94] kthread+0x124/0x130
>> [ 1901.249887] [c76abe10] [c000cec0] 
>> ret_from_kernel_thread+0x5c/0x64
>> [ 1901.249900] rcu: Stack dump where RCU GP kthread last ran:
>> [ 1901.249908] Sending NMI from CPU 12 to CPUs 6:
>> [ 1901.249944] NMI backtrace for cpu 6
>> [ 1901.249957] CPU: 6 PID: 40 Comm: migration/6 Not tainted 
>> 5.17.0-rc6-00327-g782b30d101f6-dirty #3
>> [ 1901.249971] Stopper: multi_cpu_stop+0x0/0x230 <- 
>> stop_machine_cpuslocked+0x188/0x1e0
>> [ 1901.249987] NIP: c0d14e0c LR: c0214280 CTR: 
>> c02914f0
>> [ 1901.249996] REGS: c785b980 TRAP: 0500 Not tainted 
>> (5.17.0-rc6-00327-g782b30d101f6-dirty)
>> [ 1901.250007] MSR: 8280b033  CR: 
>> 48002822 XER: 
>> [ 1901.250038] CFAR:  IRQMASK: 0 
>> [ 1901.250038] GPR00: c029165c c785bc20 c2a2 
>> 0002 
>> [ 1901.250038] GPR04:  c009fb60ab80 c009fb60ab70 
>> c001e508 
>> [ 1901.250038] GPR08:  c009fb68f5a8 0009f94c 
>> 0098967f 
>> [ 1901.250038] GPR12:  c0001ec57a00 c018cd78 
>> c7234f80 
>> [ 1901.250038] GPR16:    
>>  
>> [ 1901.250038] GPR20:    
>> 0001 
>> [ 1901.250038] GPR24: 0002 

Re: [PATCH] powerpc/64/interrupt: Fix return to masked context after hard-mask irq becomes pending

2022-05-25 Thread Nicholas Piggin
Excerpts from Sachin Sant's message of March 9, 2022 6:37 pm:
> 
> 
>> On 07-Mar-2022, at 8:21 PM, Nicholas Piggin  wrote:
>> 
>> When a synchronous interrupt[1] is taken in a local_irq_disable() region
>> which has MSR[EE]=1, the interrupt handler will enable MSR[EE] as part
>> of enabling MSR[RI], for peformance and profiling reasons.
>> 
>> [1] Typically a hash fault, but in error cases this could be a page
>>fault or facility unavailable as well.
>> 
>> If an asynchronous interrupt hits here and its masked handler requires
>> MSR[EE] to be cleared (it is a PACA_IRQ_MUST_HARD_MASK interrupt), then
>> MSR[EE] must remain disabled until that pending interrupt is replayed.
>> The problem is that the MSR of the original context has MSR[EE]=1, so
>> returning directly to that causes MSR[EE] to be enabled while the
>> interrupt is still pending.
>> 
>> This issue was hacked around in the interrupt return code by just
>> clearing the hard mask to avoid a warning, and taking the masked
>> interrupt again immediately in the return context, which would disable
>> MSR[EE]. However in the case of a pending PMI, it is possible that it is
>> not maked in the calling context so the full handler will be run while
>> there is a PMI pending, and this confuses the perf code and causes
>> warnings with its PMI pending management.
>> 
>> Fix this by removing the hack, and adjusting the return MSR if it has
>> MSR[EE]=1 and there is a PACA_IRQ_MUST_HARD_MASK interrupt pending.
>> 
>> Fixes: 4423eb5ae32e ("powerpc/64/interrupt: make normal synchronous 
>> interrupts enable MSR[EE] if possible")
>> Signed-off-by: Nicholas Piggin 
>> ---
>> arch/powerpc/kernel/interrupt.c| 10 -
>> arch/powerpc/kernel/interrupt_64.S | 34 +++---
>> 2 files changed, 31 insertions(+), 13 deletions(-)
> 
> With this patch on top of powerpc/merge following rcu stalls are seen while
> running powerpc selftests (mitigation-patching) on P9. I don’t see this
> issue on P10.
> 
> [ 1841.248838] link-stack-flush: flush disabled.
> [ 1841.248905] count-cache-flush: software flush enabled.
> [ 1841.248911] link-stack-flush: software flush enabled.
> [ 1901.249668] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 1901.249703] rcu:   12-...!: (5999 ticks this GP) 
> idle=d0f/1/0x4002 softirq=37019/37027 fqs=0 
> [ 1901.249720](t=6000 jiffies g=106273 q=1624)
> [ 1901.249729] rcu: rcu_sched kthread starved for 6000 jiffies! g106273 f0x0 
> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=6
> [ 1901.249743] rcu:   Unless rcu_sched kthread gets sufficient CPU time, OOM 
> is now expected behavior.
> [ 1901.249752] rcu: RCU grace-period kthread stack dump:
> [ 1901.249759] task:rcu_sched   state:R  running task stack:0 
> pid:   11 ppid: 2 flags:0x0800
> [ 1901.249775] Call Trace:
> [ 1901.249781] [c76ab870] [0001] 0x1 (unreliable)
> [ 1901.249795] [c76aba60] [c001e508] __switch_to+0x288/0x4a0
> [ 1901.249811] [c76abac0] [c0d15950] __schedule+0x2c0/0x950
> [ 1901.249824] [c76abb80] [c0d16048] schedule+0x68/0x130
> [ 1901.249836] [c76abbb0] [c0d1df1c] 
> schedule_timeout+0x25c/0x3f0
> [ 1901.249849] [c76abc90] [c021522c] 
> rcu_gp_fqs_loop+0x2fc/0x3e0
> [ 1901.249863] [c76abd40] [c021a0fc] 
> rcu_gp_kthread+0x13c/0x180
> [ 1901.249875] [c76abdc0] [c018ce94] kthread+0x124/0x130
> [ 1901.249887] [c76abe10] [c000cec0] 
> ret_from_kernel_thread+0x5c/0x64
> [ 1901.249900] rcu: Stack dump where RCU GP kthread last ran:
> [ 1901.249908] Sending NMI from CPU 12 to CPUs 6:
> [ 1901.249944] NMI backtrace for cpu 6
> [ 1901.249957] CPU: 6 PID: 40 Comm: migration/6 Not tainted 
> 5.17.0-rc6-00327-g782b30d101f6-dirty #3
> [ 1901.249971] Stopper: multi_cpu_stop+0x0/0x230 <- 
> stop_machine_cpuslocked+0x188/0x1e0
> [ 1901.249987] NIP:  c0d14e0c LR: c0214280 CTR: 
> c02914f0
> [ 1901.249996] REGS: c785b980 TRAP: 0500   Not tainted  
> (5.17.0-rc6-00327-g782b30d101f6-dirty)
> [ 1901.250007] MSR:  8280b033   CR: 
> 48002822  XER: 
> [ 1901.250038] CFAR:  IRQMASK: 0 
> [ 1901.250038] GPR00: c029165c c785bc20 c2a2 
> 0002 
> [ 1901.250038] GPR04:  c009fb60ab80 c009fb60ab70 
> c001e508 
> [ 1901.250038] GPR08:  c009fb68f5a8 0009f94c 
> 0098967f 
> [ 1901.250038] GPR12:  c0001ec57a00 c018cd78 
> c7234f80 
> [ 1901.250038] GPR16:    
>  
> [ 1901.250038] GPR20:    
> 0001 
> [ 1901.250038] GPR24: 0002 0003  
> c2a62138 
> [ 1901.250038] GPR28: c000ee70faf8 0001 c000ee70fb1c 
> 

Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Vineet Gupta

nkel , Michal Simek , Thomas Bogendoerfer , 
linux-par...@vger.kernel.org, linux-m...@vger.kernel.org, Dinh Nguyen , Palmer Dabbelt , Sven 
Schnelle , Guo Ren , Borislav Petkov , Johannes Berg 
, linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 




On 5/24/22 16:45, Peter Xu wrote:

I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page.  It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock.  It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

   Before: 650.980 ms (+-1.94%)
   After:  569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Signed-off-by: Peter Xu
---

v3:
- Rebase to akpm/mm-unstable
- Copy arch maintainers
---
   arch/arc/mm/fault.c   |  4 


Acked-by: Vineet Gupta 

Thx,
-Vineet


Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Johannes Weiner
Michal Simek , Thomas Bogendoerfer 
, linux-par...@vger.kernel.org, Max Filippov 
, linux-ker...@vger.kernel.org, Dinh Nguyen 
, Palmer Dabbelt , Sven Schnelle 
, Guo Ren , Borislav Petkov 
, Johannes Berg , 
linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> 
> Then after that throttling we return VM_FAULT_RETRY.
> 
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
> 
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
> 
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
> 
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
> 
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
> 
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
> 
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
> 
> I believe it could help more than that.
> 
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
> 
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> 
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
> 
> Signed-off-by: Peter Xu 

Acked-by: Johannes Weiner 


Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-25 Thread Rob Herring
On Tue, May 24, 2022 at 11:20:42AM +0530, Vaibhav Jain wrote:
> Presently ima_get_kexec_buffer() doesn't check if the previous kernel's
> ima-kexec-buffer lies outside the addressable memory range. This can result
> in a kernel panic if the new kernel is booted with 'mem=X' arg and the
> ima-kexec-buffer was allocated beyond that range by the previous kernel.
> The panic is usually of the form below:
> 
> $ sudo kexec --initrd initrd vmlinux --append='mem=16G'
> 
> 
>  BUG: Unable to handle kernel data access on read at 0xc000c01fff7f
>  Faulting instruction address: 0xc0837974
>  Oops: Kernel access of bad area, sig: 11 [#1]
> 
>  NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0
>  LR [c083b55c] ima_load_kexec_buffer+0xac/0x160
>  Call Trace:
>  [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160
>  [c371fb00] [c20512c4] ima_init+0x80/0x108
>  [c371fb70] [c20514dc] init_ima+0x4c/0x120
>  [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0
>  [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec
>  [c371fda0] [c00128a4] kernel_init+0x34/0x1b0
>  [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64
>  Instruction dump:
>  f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330
>  7c0802a6 fb610198 7c9b2378 f80101d0  2c090001 40820614 e9240010
>  ---[ end trace  ]---
> 
> Fix this issue by checking returned PFN range of previous kernel's
> ima-kexec-buffer with pfn_valid to ensure correct memory bounds.
> 
> Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the 
> previous kernel")
> Cc: Frank Rowand 
> Cc: Prakhar Srivastava 
> Cc: Lakshmi Ramasubramanian 
> Cc: Thiago Jung Bauermann 
> Cc: Rob Herring 
> Signed-off-by: Vaibhav Jain 
> 
> ---
> Changelog
> ==
> 
> v2:
> * Instead of using memblock to determine the valid bounds use pfn_valid() to 
> do
> so since memblock may not be available late after the kernel init. [ Mpe ]
> * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ]
> * Updated the 'Fixes' tag to point to correct commit that introduced this
> function. [ Rob ]
> * Fixed some whitespace/tab issues in the patch description [ Rob ]
> * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0
> ---
>  drivers/of/kexec.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
> index 8d374cc552be..879e984fe901 100644
> --- a/drivers/of/kexec.c
> +++ b/drivers/of/kexec.c
> @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
>  {
>   int ret, len;
>   unsigned long tmp_addr;
> + unsigned int start_pfn, end_pfn;
>   size_t tmp_size;
>   const void *prop;
>  
> @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
>   if (ret)
>   return ret;
>  
> + /* Do some sanity on the returned size for the ima-kexec buffer */
> + if (!tmp_size)
> + return -ENOENT;
> +
> + /*
> +  * Calculate the PFNs for the buffer and ensure
> +  * they are with in addressable memory.
> +  */
> + start_pfn = PHYS_PFN(tmp_addr);
> + end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1);
> + if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) {

pfn_valid() isn't necessarily RAM, only that you have a struct page 
AIUI. Maybe page_is_ram() instead?

Thanks to Robin for this.

Rob


Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support

2022-05-25 Thread Guo Ren
On Thu, May 26, 2022 at 3:37 AM Heiko Stübner  wrote:
>
> Am Mittwoch, 25. Mai 2022, 18:08:22 CEST schrieb Guo Ren:
> > Thx Heiko & Guenter,
> >
> > On Wed, May 25, 2022 at 7:10 PM Heiko Stübner  wrote:
> > >
> > > Am Mittwoch, 25. Mai 2022, 12:57:30 CEST schrieb Heiko Stübner:
> > > > Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck:
> > > > > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote:
> > > > > [ ... ]
> > > > >
> > > > > > > The problem is come from "__dls3's vdso decode part in musl's
> > > > > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong.
> > > > > > >
> > > > > > > I think the root cause is from musl's implementation with the 
> > > > > > > wrong
> > > > > > > elf parser. I would fix that soon.
> > > > > > Not elf parser, it's "aux vector just past environ[]". I think I 
> > > > > > could
> > > > > > solve this, but anyone who could help dig in is welcome.
> > > > > >
> > > > >
> > > > > I am not sure I understand what you are saying here. Point is that my
> > > > > root file system, generated with musl a year or so ago, crashes with
> > > > > your patch set applied. That is a regression, even if there is a bug
> > > > > in musl.
> > Thx for the report, it's a valuable regression for riscv-compat.
> >
> > > >
> > > > Also as I said in the other part of the thread, the rootfs seems 
> > > > innocent,
> > > > as my completely-standard Debian riscv64 rootfs is also affected.
> > > >
> > > > The merged version seems to be v12 [0] - not sure how we this discussion
> > > > ended up in v9, but I just tested this revision in two variants:
> > > >
> > > > - v5.17 + this v9 -> works nicely
> > >
> > > I take that back ... now going back to that build I somehow also run into
> > > that issue here ... will investigate more.
> > Yeah, it's my fault. I've fixed up it, please have a try:
> >
> > https://lore.kernel.org/linux-riscv/20220525160404.2930984-1-guo...@kernel.org/T/#u
>
> very cool that you found the issue.
> I've tested your patch and it seems to fix the issue for me.
>
> Thanks for figuring out the cause
I should thx Guenter Roeck, It just surprised me that compat_vdso
could work with quite a lot of rv64 apps.

> Heiko
>
>
> > > > - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot
> > > >   The only rebase-conflict was with the introduction of restartable
> > > >   sequences and removal of the tracehook include, but turning 
> > > > CONFIG_RSEQ
> > > >   off doesn't seem to affect the breakage.
> > > >
> > > > So it looks like something changed between 5.17 and 5.18 that causes 
> > > > the issue.
> > > >
> > > >
> > > > Heiko
> > > >
> > > >
> > > > [0] 
> > > > https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/
> > > >
> > >
> > >
> > >
> > >
> >
> >
> >
>
>
>
>


-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/


Re: [GIT PULL] Modules fixes for v5.19-rc1

2022-05-25 Thread Luis Chamberlain
Sorry the subject should say "Modules changes".

I also forgot to itemize possible merge conflicts and resolutions
which linux-next reported:

powerpc:
https://lkml.kernel.org/r/20220520154055.7f964...@canb.auug.org.au

kbuild:
https://lkml.kernel.org/r/20220523120859.570f7...@canb.auug.org.au

  Luis


[GIT PULL] Modules fixes for v5.19-rc1

2022-05-25 Thread Luis Chamberlain
OK, finally some changes for modules. It is still pretty boring,
but I am hopefull that the cleanup will yield nice results in the
future as further cleanups will make the code much easier to
read, maintain and test. Perhaps the most exciting thing is
Christophe Leroy's CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC.
In reviewing Rick Edgecombe's prior work on enhancements for
special allocators I suspect this is going to help as module
space was the more complex aspect to deal with in his work.

AFAICT you *may* run into conflicts *if* bpf folks submit the
module_alloc_huge() stuff which I was still reviewing with Rick.
To my taste that effort seems to be going fast and I like to
take time to consider a proper interface for it which aligns well
with that others have in mind, specially in consideration for what
other architectures might need. The VM_FLUSH_RESET_PERMS stuff was
what was loose there. It doesn't seem we can address that stuff in
a generic neat way yet, and so the x86 open codes its own solution
for it.

I suspect we'll also need more tests on the huge page front so that
if more module_alloc() users want to convert we can enable folks to
give more realistic performance information rather than loose
numbers. In the future I suspect we'll just generalize module_alloc()
to vmalloc_exec() as its users are growing and the technical debt
of not drawing a clean API for it is growing.

Let me know if there are any issues.

  Luis

The following changes since commit 3123109284176b1532874591f7c81f3837bbdc17:

  Linux 5.18-rc1 (2022-04-03 14:08:21 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ 
tags/modules-5.19-rc1

for you to fetch changes up to 7390b94a3c2d93272d6da4945b81a9cf78055b7b:

  module: merge check_exported_symbol() into find_exported_symbol_in_section() 
(2022-05-12 10:29:41 -0700)


Modules updates for v5.19-rc1

As promised, for v5.19 I queued up quite a bit of work for modules, but
still with a pretty conservative eye. These changes have been soaking on
modules-next (and so linux-next) for quite some time, the code shift was
merged onto modules-next on March 22, and the last patch was queued on May
5th.

The following are the highlights of what bells and whistles we will get for
v5.19:

 1) It was time to tidy up kernel/module.c and one way of starting with
that effort was to split it up into files. At my request Aaron Tomlin
spearheaded that effort with the goal to not introduce any
functional at all during that endeavour.  The penalty for the split
is +1322 bytes total, +112 bytes in data, +1210 bytes in text while
bss is unchanged. One of the benefits of this other than helping
make the code easier to read and review is summoning more help on review
for changes with livepatching so kernel/module/livepatch.c is now
pegged as maintained by the live patching folks.

The before and after with just the move on a defconfig on x86-64:

 $ size kernel/module.o
textdata bss dec hex filename
   384344540 104   43078a846 kernel/module.o

 $ size -t kernel/module/*.o
textdata bss dec hex filename
   4785 120   049051329 kernel/module/kallsyms.o
  285774416 104   330978149 kernel/module/main.o
   1158   8   01166 48e kernel/module/procfs.o
902 108   01010 3f2 kernel/module/strict_rwx.o
   3390   0   03390 d3e kernel/module/sysfs.o
832   0   0 832 340 kernel/module/tree_lookup.o
  396444652 104   44400ad70 (TOTALS)

 2) Aaron added module unload taint tracking (MODULE_UNLOAD_TAINT_TRACKING),
so to enable tracking unloaded modules which did taint the kernel.

 3) Christophe Leroy added CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
which lets architectures to request having modules data in vmalloc
area instead of module area. There are three reasons why an
architecture might want this:

a) On some architectures (like book3s/32) it is not possible to protect
   against execution on a page basis. The exec stuff can be mapped by
   different arch segment sizes (on book3s/32 that is 256M segments). By
   default the module area is in an Exec segment while vmalloc area is in
   a NoExec segment. Using vmalloc lets you muck with module data as
   NoExec on those architectures whereas before you could not.

b) By pushing more module data to vmalloc you also increase the
   probability of module text to remain within a closer distance
   from kernel core text and this reduces trampolines, this has been
   reported on arm first and powerpc folks are following that lead.

c) Free'ing module_alloc() (Exec by default) area leaves this
   exposed as Exec by default, some architectures 

Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-25 Thread Andrew Morton
On Fri, 20 May 2022 14:25:05 -0500 "Eric W. Biederman"  
wrote:

> > I am not strongly against taking off __weak, just wondering if there's
> > chance to fix it in recordmcount, and the cost comparing with kernel fix;
> > except of this issue, any other weakness of __weak. Noticed Andrew has
> > picked this patch, as a witness of this moment, raise a tiny concern.
> 
> I just don't see what else we can realistically do.

I think converting all of the kexec __weaks to use the ifdef approach
makes sense, if only because kexec is now using two different styles.

But for now, I'll send Naveen's v2 patch in to Linus to get us out of
trouble.

I'm thinking that we should add cc:stable to that patch as well, to
reduce the amount of problems which people experience when using newer
binutils on older kernels?



Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support

2022-05-25 Thread Heiko Stübner
Am Mittwoch, 25. Mai 2022, 18:08:22 CEST schrieb Guo Ren:
> Thx Heiko & Guenter,
> 
> On Wed, May 25, 2022 at 7:10 PM Heiko Stübner  wrote:
> >
> > Am Mittwoch, 25. Mai 2022, 12:57:30 CEST schrieb Heiko Stübner:
> > > Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck:
> > > > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote:
> > > > [ ... ]
> > > >
> > > > > > The problem is come from "__dls3's vdso decode part in musl's
> > > > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong.
> > > > > >
> > > > > > I think the root cause is from musl's implementation with the wrong
> > > > > > elf parser. I would fix that soon.
> > > > > Not elf parser, it's "aux vector just past environ[]". I think I could
> > > > > solve this, but anyone who could help dig in is welcome.
> > > > >
> > > >
> > > > I am not sure I understand what you are saying here. Point is that my
> > > > root file system, generated with musl a year or so ago, crashes with
> > > > your patch set applied. That is a regression, even if there is a bug
> > > > in musl.
> Thx for the report, it's a valuable regression for riscv-compat.
> 
> > >
> > > Also as I said in the other part of the thread, the rootfs seems innocent,
> > > as my completely-standard Debian riscv64 rootfs is also affected.
> > >
> > > The merged version seems to be v12 [0] - not sure how we this discussion
> > > ended up in v9, but I just tested this revision in two variants:
> > >
> > > - v5.17 + this v9 -> works nicely
> >
> > I take that back ... now going back to that build I somehow also run into
> > that issue here ... will investigate more.
> Yeah, it's my fault. I've fixed up it, please have a try:
> 
> https://lore.kernel.org/linux-riscv/20220525160404.2930984-1-guo...@kernel.org/T/#u

very cool that you found the issue.
I've tested your patch and it seems to fix the issue for me.

Thanks for figuring out the cause
Heiko


> > > - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot
> > >   The only rebase-conflict was with the introduction of restartable
> > >   sequences and removal of the tracehook include, but turning CONFIG_RSEQ
> > >   off doesn't seem to affect the breakage.
> > >
> > > So it looks like something changed between 5.17 and 5.18 that causes the 
> > > issue.
> > >
> > >
> > > Heiko
> > >
> > >
> > > [0] 
> > > https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/
> > >
> >
> >
> >
> >
> 
> 
> 






Re: [RFC PATCH v2 0/7] objtool: Enable and implement --mcount option on powerpc

2022-05-25 Thread Sathvika Vasireddy



On 25/05/22 23:09, Christophe Leroy wrote:

Hi Sathvika,

Le 25/05/2022 à 12:14, Sathvika Vasireddy a écrit :

Hi Christophe,

On 24/05/22 18:47, Christophe Leroy wrote:

This draft series adds PPC32 support to Sathvika's series.
Verified on pmac32 on QEMU.

It should in principle also work for PPC64 BE but for the time being
something goes wrong. In the beginning I had a segfaut hence the first
patch. But I still get no mcount section in the files.

Since PPC64 BE uses older elfv1 ABI, it prepends a dot to symbols.
And so, the relocation records in case of PPC64BE point to "._mcount",
rather than just "_mcount". We should be looking for "._mcount" to be
able to generate mcount_loc section in the files.

Like:

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 70be5a72e838..7da5bf8c7236 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2185,7 +2185,7 @@ static int classify_symbols(struct objtool_file
*file)
      if (arch_is_retpoline(func))
      func->retpoline_thunk = true;

-   if ((!strcmp(func->name, "__fentry__")) ||
(!strcmp(func->name, "_mcount")))
+   if ((!strcmp(func->name, "__fentry__")) ||
(!strcmp(func->name, "_mcount")) || (!strcmp(func->name, "._mcount")))
      func->fentry = true;

      if (is_profiling_func(func->name))


With this change, I could see __mcount_loc section being
generated in individual ppc64be object files.


Or should we implement an equivalent of arch_ftrace_match_adjust() in
objtool ?


Yeah, I think it makes more sense if we make it arch specific.
Thanks for the suggestion. I'll make this change in next revision :-)

- Sathvika




Re: [RFC PATCH v2 0/7] objtool: Enable and implement --mcount option on powerpc

2022-05-25 Thread Christophe Leroy
Hi Sathvika,

Le 25/05/2022 à 12:14, Sathvika Vasireddy a écrit :
> Hi Christophe,
> 
> On 24/05/22 18:47, Christophe Leroy wrote:
>> This draft series adds PPC32 support to Sathvika's series.
>> Verified on pmac32 on QEMU.
>>
>> It should in principle also work for PPC64 BE but for the time being
>> something goes wrong. In the beginning I had a segfaut hence the first
>> patch. But I still get no mcount section in the files.
> Since PPC64 BE uses older elfv1 ABI, it prepends a dot to symbols.
> And so, the relocation records in case of PPC64BE point to "._mcount",
> rather than just "_mcount". We should be looking for "._mcount" to be
> able to generate mcount_loc section in the files.
> 
> Like:
> 
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 70be5a72e838..7da5bf8c7236 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -2185,7 +2185,7 @@ static int classify_symbols(struct objtool_file 
> *file)
>      if (arch_is_retpoline(func))
>      func->retpoline_thunk = true;
> 
> -   if ((!strcmp(func->name, "__fentry__")) || 
> (!strcmp(func->name, "_mcount")))
> +   if ((!strcmp(func->name, "__fentry__")) || 
> (!strcmp(func->name, "_mcount")) || (!strcmp(func->name, "._mcount")))
>      func->fentry = true;
> 
>      if (is_profiling_func(func->name))
> 
> 
> With this change, I could see __mcount_loc section being
> generated in individual ppc64be object files.
> 

Or should we implement an equivalent of arch_ftrace_match_adjust() in 
objtool ?

Christophe

Re: [RFC PATCH 4/4] objtool/powerpc: Add --mcount specific implementation

2022-05-25 Thread Christophe Leroy




Le 24/05/2022 à 15:33, Christophe Leroy a écrit :



Le 24/05/2022 à 13:00, Sathvika Vasireddy a écrit :



+{
+    switch (elf->ehdr.e_machine) {
+    case EM_X86_64:
+    return R_X86_64_64;
+    case EM_PPC64:
+    return R_PPC64_ADDR64;
+    default:
+    WARN("unknown machine...");
+    exit(-1);
+    }
+}

Wouldn't it be better to make that function arch specific ?


This is so that we can support cross architecture builds.




I'm not sure I follow you here.

This is only based on the target, it doesn't depend on the build host so
I can't the link with cross arch builds.

The same as you have arch_decode_instruction(), you could have
arch_elf_reloc_type_long()
It would make sense indeed, because there is no point in supporting X86
relocation when you don't support X86 instruction decoding.



Could simply be some macro defined in 
tools/objtool/arch/powerpc/include/arch/elf.h and 
tools/objtool/arch/x86/include/arch/elf.h


The x86 version would be:

#define R_ADDR(elf) R_X86_64_64

And the powerpc version would be:

#define R_ADDR(elf) (elf->ehdr.e_machine == EM_PPC64 ? R_PPC64_ADDR64 : 
R_PPC_ADDR32)


Christophe


Re: [RFC PATCH v1 1/4] Revert "objtool: Enable objtool to run only on files with ftrace enabled"

2022-05-25 Thread Christophe Leroy


Le 25/05/2022 à 18:34, Peter Zijlstra a écrit :
> On Wed, May 25, 2022 at 05:58:14PM +0200, Christophe Leroy wrote:
>> This reverts commit cf3013dfad89ad5ac7d16d56dced72d7c138a20e.
>>
>> That commit is problematic as we miss some static calls.
> 
> Revert ?!?! who comitted this. And there's a ton more broken than just
> static calls. This must absolutely not be.

No worry, it is just a follow-up of my previous series which includes it.

Re: [RFC PATCH v2 0/7] objtool: Enable and implement --mcount option on powerpc

2022-05-25 Thread Segher Boessenkool
On Wed, May 25, 2022 at 03:44:04PM +0530, Sathvika Vasireddy wrote:
> On 24/05/22 18:47, Christophe Leroy wrote:
> >This draft series adds PPC32 support to Sathvika's series.
> >Verified on pmac32 on QEMU.
> >
> >It should in principle also work for PPC64 BE but for the time being
> >something goes wrong. In the beginning I had a segfaut hence the first
> >patch. But I still get no mcount section in the files.
> Since PPC64 BE uses older elfv1 ABI, it prepends a dot to symbols.
> And so, the relocation records in case of PPC64BE point to "._mcount",
> rather than just "_mcount". We should be looking for "._mcount" to be
> able to generate mcount_loc section in the files.

The dotted symbol is on the actual function.  The "normal" symbol is on
the "official procedure descriptor" (opd), which is what you get if you
(in C) take the address of a function.  A procedure descriptor holds one
or two more pointers, the GOT and environment pointers.  We don't use
the environment one, but the GOT pointer is necessary everywhere :-)


Segher


Re: [PATCH 2/2] drm/tiny: Add ofdrm for Open Firmware framebuffers

2022-05-25 Thread Thomas Zimmermann

Hi

Am 21.05.22 um 04:49 schrieb Benjamin Herrenschmidt:

On Thu, 2022-05-19 at 09:27 +0200, Thomas Zimmermann wrote:


to build without PCI to see what happens.


If you bring any of the "heuristic" and palette support code in, you
need PCI. I don't see any reason to take it out.


Those old Macs use BootX, right? BootX is not supported ATM, as I don't
have the HW to test. Is there an emulator for it?


It isn't ? When did it break ? :-)


I meant that BootX is not (yet) supported by this new driver. The Linux 
kernel overall probably supports it.






If anyone what's to make patches for BootX, I'd be happy to add them.
The offb driver also supports a number of special cases for palette
handling. That might be necessary for ofdrm as well.


The palette handling is useful when using a real Open Firmware
implementation which tends to boot in 8-bit mode, so without palette
things will look ... bad.

It's not necessary when using 16/32 bpp framebuffers which is typically
... what BootX provides :-)


Maybe the odd color formats can be tested via qemu.

I don't mind adding DRM support for BootX displays, but getting the 
necessary test HW with a suitable Linux seems to be laborious. Would a 
G4 Powerbook work?


Best regard
Thomas



Cheers,
Ben.


Best regards
Thomas


Gr{oetje,eeting}s,

  Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
  -- Linus Torvalds


--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev




--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev


OpenPGP_signature
Description: OpenPGP digital signature


Re: [RFC PATCH v1 2/4] objtool: Add R_REL32 macro

2022-05-25 Thread Segher Boessenkool
Hi!

On Wed, May 25, 2022 at 05:58:15PM +0200, Christophe Leroy wrote:
> In order to allow other architectures than x86 to use 32 bits
> relative relocations, define a R_REL32 macro that each architecture
> will define, in the same way as already done for R_NONE.

What are the expected semantics of this relocation?  It is PC-relative,
sure, but what is the destination?  S+A-P always?  That works for both
x86-64 and for PowerPC, but it should be written doen somewhere :-)


Segher


Re: [RFC PATCH v1 1/4] Revert "objtool: Enable objtool to run only on files with ftrace enabled"

2022-05-25 Thread Peter Zijlstra
On Wed, May 25, 2022 at 05:58:14PM +0200, Christophe Leroy wrote:
> This reverts commit cf3013dfad89ad5ac7d16d56dced72d7c138a20e.
> 
> That commit is problematic as we miss some static calls.

Revert ?!?! who comitted this. And there's a ton more broken than just
static calls. This must absolutely not be.


Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support

2022-05-25 Thread Guo Ren
Thx Heiko & Guenter,

On Wed, May 25, 2022 at 7:10 PM Heiko Stübner  wrote:
>
> Am Mittwoch, 25. Mai 2022, 12:57:30 CEST schrieb Heiko Stübner:
> > Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck:
> > > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote:
> > > [ ... ]
> > >
> > > > > The problem is come from "__dls3's vdso decode part in musl's
> > > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong.
> > > > >
> > > > > I think the root cause is from musl's implementation with the wrong
> > > > > elf parser. I would fix that soon.
> > > > Not elf parser, it's "aux vector just past environ[]". I think I could
> > > > solve this, but anyone who could help dig in is welcome.
> > > >
> > >
> > > I am not sure I understand what you are saying here. Point is that my
> > > root file system, generated with musl a year or so ago, crashes with
> > > your patch set applied. That is a regression, even if there is a bug
> > > in musl.
Thx for the report, it's a valuable regression for riscv-compat.

> >
> > Also as I said in the other part of the thread, the rootfs seems innocent,
> > as my completely-standard Debian riscv64 rootfs is also affected.
> >
> > The merged version seems to be v12 [0] - not sure how we this discussion
> > ended up in v9, but I just tested this revision in two variants:
> >
> > - v5.17 + this v9 -> works nicely
>
> I take that back ... now going back to that build I somehow also run into
> that issue here ... will investigate more.
Yeah, it's my fault. I've fixed up it, please have a try:

https://lore.kernel.org/linux-riscv/20220525160404.2930984-1-guo...@kernel.org/T/#u

>
>
> > - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot
> >   The only rebase-conflict was with the introduction of restartable
> >   sequences and removal of the tracehook include, but turning CONFIG_RSEQ
> >   off doesn't seem to affect the breakage.
> >
> > So it looks like something changed between 5.17 and 5.18 that causes the 
> > issue.
> >
> >
> > Heiko
> >
> >
> > [0] https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/
> >
>
>
>
>


-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/


[RFC PATCH v1 2/4] objtool: Add R_REL32 macro

2022-05-25 Thread Christophe Leroy
In order to allow other architectures than x86 to use 32 bits
relative relocations, define a R_REL32 macro that each architecture
will define, in the same way as already done for R_NONE.

Signed-off-by: Christophe Leroy 
---
 tools/objtool/arch/x86/include/arch/elf.h |  1 +
 tools/objtool/check.c | 10 +-
 tools/objtool/orc_gen.c   |  2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/tools/objtool/arch/x86/include/arch/elf.h 
b/tools/objtool/arch/x86/include/arch/elf.h
index 69cc4264b28a..8aa8c29607da 100644
--- a/tools/objtool/arch/x86/include/arch/elf.h
+++ b/tools/objtool/arch/x86/include/arch/elf.h
@@ -2,5 +2,6 @@
 #define _OBJTOOL_ARCH_ELF
 
 #define R_NONE R_X86_64_NONE
+#define R_REL32R_X86_64_PC32
 
 #endif /* _OBJTOOL_ARCH_ELF */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 70be5a72e838..1627d14a01c9 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -650,7 +650,7 @@ static int create_static_call_sections(struct objtool_file 
*file)
/* populate reloc for 'addr' */
if (elf_add_reloc_to_insn(file->elf, sec,
  idx * sizeof(struct static_call_site),
- R_X86_64_PC32,
+ R_REL32,
  insn->sec, insn->offset))
return -1;
 
@@ -691,7 +691,7 @@ static int create_static_call_sections(struct objtool_file 
*file)
/* populate reloc for 'key' */
if (elf_add_reloc(file->elf, sec,
  idx * sizeof(struct static_call_site) + 4,
- R_X86_64_PC32, key_sym,
+ R_REL32, key_sym,
  is_sibling_call(insn) * 
STATIC_CALL_SITE_TAIL))
return -1;
 
@@ -735,7 +735,7 @@ static int create_retpoline_sites_sections(struct 
objtool_file *file)
 
if (elf_add_reloc_to_insn(file->elf, sec,
  idx * sizeof(int),
- R_X86_64_PC32,
+ R_REL32,
  insn->sec, insn->offset)) {
WARN("elf_add_reloc_to_insn: .retpoline_sites");
return -1;
@@ -787,7 +787,7 @@ static int create_ibt_endbr_seal_sections(struct 
objtool_file *file)
 
if (elf_add_reloc_to_insn(file->elf, sec,
  idx * sizeof(int),
- R_X86_64_PC32,
+ R_REL32,
  insn->sec, insn->offset)) {
WARN("elf_add_reloc_to_insn: .ibt_endbr_seal");
return -1;
@@ -3716,7 +3716,7 @@ static int validate_ibt_insn(struct objtool_file *file, 
struct instruction *insn
continue;
 
off = reloc->sym->offset;
-   if (reloc->type == R_X86_64_PC32 || reloc->type == 
R_X86_64_PLT32)
+   if (reloc->type == R_REL32 || reloc->type == R_X86_64_PLT32)
off += arch_dest_reloc_offset(reloc->addend);
else
off += reloc->addend;
diff --git a/tools/objtool/orc_gen.c b/tools/objtool/orc_gen.c
index 1f22b7ebae58..49a877b9c879 100644
--- a/tools/objtool/orc_gen.c
+++ b/tools/objtool/orc_gen.c
@@ -101,7 +101,7 @@ static int write_orc_entry(struct elf *elf, struct section 
*orc_sec,
orc->bp_offset = bswap_if_needed(elf, orc->bp_offset);
 
/* populate reloc for ip */
-   if (elf_add_reloc_to_insn(elf, ip_sec, idx * sizeof(int), R_X86_64_PC32,
+   if (elf_add_reloc_to_insn(elf, ip_sec, idx * sizeof(int), R_REL32,
  insn_sec, insn_off))
return -1;
 
-- 
2.35.3



[RFC PATCH v1 4/4] powerpc/static_call: Implement inline static calls

2022-05-25 Thread Christophe Leroy
Implement inline static calls:
- Put a 'bl' to the destination function
- Put a 'nop' when the destination function is NULL
- Put a 'li r3,0' when the destination is the RET0 function

For the time being it only works if the destination is
within 32Mb from the caller.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/static_call.h|  2 +
 arch/powerpc/kernel/static_call.c | 41 ---
 tools/objtool/arch/powerpc/include/arch/elf.h |  1 +
 4 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 5ef8bf8eb202..3257a1c258d8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -246,6 +246,7 @@ config PPC
select HAVE_STACKPROTECTOR  if PPC32 && 
$(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2)
select HAVE_STACKPROTECTOR  if PPC64 && 
$(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13)
select HAVE_STATIC_CALL if PPC32
+   select HAVE_STATIC_CALL_INLINE  if PPC32
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HUGETLB_PAGE_SIZE_VARIABLE   if PPC_BOOK3S_64 && HUGETLB_PAGE
diff --git a/arch/powerpc/include/asm/static_call.h 
b/arch/powerpc/include/asm/static_call.h
index de1018cc522b..e3d5d3823dac 100644
--- a/arch/powerpc/include/asm/static_call.h
+++ b/arch/powerpc/include/asm/static_call.h
@@ -26,4 +26,6 @@
 #define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)   __PPC_SCT(name, "blr")
 #define ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name)   __PPC_SCT(name, "b 
.+20")
 
+#define CALL_INSN_SIZE 4
+
 #endif /* _ASM_POWERPC_STATIC_CALL_H */
diff --git a/arch/powerpc/kernel/static_call.c 
b/arch/powerpc/kernel/static_call.c
index 863a7aa24650..fd25954cfd24 100644
--- a/arch/powerpc/kernel/static_call.c
+++ b/arch/powerpc/kernel/static_call.c
@@ -9,25 +9,38 @@ void arch_static_call_transform(void *site, void *tramp, void 
*func, bool tail)
int err;
bool is_ret0 = (func == __static_call_return0);
unsigned long target = (unsigned long)(is_ret0 ? tramp + PPC_SCT_RET0 : 
func);
-   bool is_short = is_offset_in_branch_range((long)target - (long)tramp);
-
-   if (!tramp)
-   return;
 
mutex_lock(_mutex);
 
-   if (func && !is_short) {
-   err = patch_instruction(tramp + PPC_SCT_DATA, ppc_inst(target));
-   if (err)
-   goto out;
+   if (tramp) {
+   bool is_short = is_offset_in_branch_range((long)target - 
(long)tramp);
+
+   if (func && !is_short) {
+   err = patch_instruction(tramp + PPC_SCT_DATA, 
ppc_inst(target));
+   if (err)
+   goto out;
+   }
+
+   if (!func)
+   err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR()));
+   else if (is_short)
+   err = patch_branch(tramp, target, 0);
+   else
+   err = patch_instruction(tramp, ppc_inst(PPC_RAW_NOP()));
}
 
-   if (!func)
-   err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR()));
-   else if (is_short)
-   err = patch_branch(tramp, target, 0);
-   else
-   err = patch_instruction(tramp, ppc_inst(PPC_RAW_NOP()));
+   if (site) {
+   bool is_short = is_offset_in_branch_range((long)func - 
(long)site);
+
+   if (!func)
+   err = patch_instruction(site, ppc_inst(PPC_RAW_NOP()));
+   else if (is_ret0)
+   err = patch_instruction(site, ppc_inst(PPC_RAW_LI(_R3, 
0)));
+   else if (is_short)
+   err = patch_branch(site, target, BRANCH_SET_LINK);
+   else
+   panic("%s: function %pS is out of reach of %pS\n", 
__func__, func, site);
+   }
 out:
mutex_unlock(_mutex);
 
diff --git a/tools/objtool/arch/powerpc/include/arch/elf.h 
b/tools/objtool/arch/powerpc/include/arch/elf.h
index 3c8ebb7d2a6b..18784c764c14 100644
--- a/tools/objtool/arch/powerpc/include/arch/elf.h
+++ b/tools/objtool/arch/powerpc/include/arch/elf.h
@@ -4,5 +4,6 @@
 #define _OBJTOOL_ARCH_ELF
 
 #define R_NONE R_PPC_NONE
+#define R_REL32R_PPC_REL32
 
 #endif /* _OBJTOOL_ARCH_ELF */
-- 
2.35.3



[RFC PATCH v1 0/4] Implement inline static calls on PPC32

2022-05-25 Thread Christophe Leroy
This is first draft for implementing inline static calls on PPC32.

This series applies on top of the series v2 "objtool: Enable and implement 
--mcount option on powerpc"

For the time being only the case where functions are within 'bl' reach
is supported. Otherwise panic() is invoked.

For the other case, we'll need to use the trampoline we have at startup
before initialising inline static calls. But it seems that at the time
being once inline static calls are initialised we don't know anymore
where the trampoline was.
We'd need to keep the information somewhere (is the static_call_key ?)
We may also need to keep the information for when the trampoline itself
is out of 'bl' reach, in that case there is a trampoline setup by the
compiler and we'll need to remind the location of that trampoline. Guess
it should get saved somewhere when we initialise inline static calls ?

Christophe Leroy (4):
  Revert "objtool: Enable objtool to run only on files with ftrace
enabled"
  objtool: Add R_REL32 macro
  static_call: Call static_call_init() from start_kernel()
  powerpc/static_call: Implement inline static calls

 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/static_call.h|  2 +
 arch/powerpc/kernel/static_call.c | 41 ---
 init/main.c   |  1 +
 scripts/Makefile.build|  4 +-
 tools/objtool/arch/powerpc/include/arch/elf.h |  1 +
 tools/objtool/arch/x86/include/arch/elf.h |  1 +
 tools/objtool/check.c | 10 ++---
 tools/objtool/orc_gen.c   |  2 +-
 9 files changed, 41 insertions(+), 22 deletions(-)

-- 
2.35.3



[RFC PATCH v1 1/4] Revert "objtool: Enable objtool to run only on files with ftrace enabled"

2022-05-25 Thread Christophe Leroy
This reverts commit cf3013dfad89ad5ac7d16d56dced72d7c138a20e.

That commit is problematic as we miss some static calls.

Signed-off-by: Christophe Leroy 
---
 scripts/Makefile.build | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 06ceffd92921..2e0c3f9c1459 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -258,8 +258,8 @@ else
 # 'OBJECT_FILES_NON_STANDARD_foo.o := 'y': skip objtool checking for a file
 # 'OBJECT_FILES_NON_STANDARD_foo.o := 'n': override directory skip for a file
 
-$(obj)/%.o: objtool-enabled = $(and $(if $(filter-out y%, 
$(OBJECT_FILES_NON_STANDARD_$(basetarget).o)$(OBJECT_FILES_NON_STANDARD)n),y),  
  \
-$(if $(findstring $(strip $(CC_FLAGS_FTRACE)),$(_c_flags)),y),y)
+$(obj)/%.o: objtool-enabled = $(if $(filter-out y%, \
+   
$(OBJECT_FILES_NON_STANDARD_$(basetarget).o)$(OBJECT_FILES_NON_STANDARD)n),y)
 
 endif
 
-- 
2.35.3



[RFC PATCH v1 3/4] static_call: Call static_call_init() from start_kernel()

2022-05-25 Thread Christophe Leroy
Call static_call_init() just after jump_label_init().

x86 already called it from setup_arch(). This is not a
problem as static_call_init() is guarded from double call.

Signed-off-by: Christophe Leroy 
---
 init/main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/init/main.c b/init/main.c
index 98182c3c2c4b..b6c49c18ec5d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -962,6 +962,7 @@ asmlinkage __visible void __init __no_sanitize_address 
start_kernel(void)
pr_notice("Kernel command line: %s\n", saved_command_line);
/* parameters may set static keys */
jump_label_init();
+   static_call_init();
parse_early_param();
after_dashes = parse_args("Booting kernel",
  static_command_line, __start___param,
-- 
2.35.3



[PATCH 5/5] KVM: PPC: Book3S HV: Provide more detailed timings for P9 entry path

2022-05-25 Thread Fabiano Rosas
Alter the data collection points for the debug timing code in the P9
path to be more in line with what the code does. The points where we
accumulate time are now the following:

vcpu_entry: From vcpu_run_hv entry until the start of the inner loop;

guest_entry: From the start of the inner loop until the guest entry in
 asm;

in_guest: From the guest entry in asm until the return to KVM C code;

guest_exit: From the return into KVM C code until the corresponding
hypercall/page fault handling or re-entry into the guest;

hypercall: Time spent handling hcalls in the kernel (hcalls can go to
   QEMU, not accounted here);

page_fault: Time spent handling page faults;

vcpu_exit: vcpu_run_hv exit (almost no code here currently).

Like before, these are exposed in debugfs in a file called
"timings". There are four values:

- number of occurrences of the accumulation point;
- total time the vcpu spent in the phase in ns;
- shortest time the vcpu spent in the phase in ns;
- longest time the vcpu spent in the phase in ns;

===
Before:

  rm_entry: 53132 16793518 256 4060
  rm_intr: 53132 2125914 22 340
  rm_exit: 53132 24108344 374 2180
  guest: 53132 40980507996 404 9997650
  cede: 0 0 0 0

After:

  vcpu_entry: 34637 7716108 178 4416
  guest_entry: 52414 49365608 324 747542
  in_guest: 52411 40828715840 258 9997480
  guest_exit: 52410 19681717182 826 102496674
  vcpu_exit: 34636 1744462 38 182
  hypercall: 45712 22878288 38 1307962
  page_fault: 992 04034 568 168688

  With just one instruction (hcall):

  vcpu_entry: 1 942 942 942
  guest_entry: 1 4044 4044 4044
  in_guest: 1 1540 1540 1540
  guest_exit: 1 3542 3542 3542
  vcpu_exit: 1 80 80 80
  hypercall: 0 0 0 0
  page_fault: 0 0 0 0
===

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/include/asm/kvm_host.h   | 12 +++-
 arch/powerpc/kvm/Kconfig  |  9 +
 arch/powerpc/kvm/book3s_hv.c  | 23 ++-
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 14 --
 4 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 37f03665bfa2..de2b226aa350 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -827,11 +827,13 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator *cur_activity;  /* What we're timing */
u64 cur_tb_start;   /* when it started */
 #ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING
-   struct kvmhv_tb_accumulator rm_entry;   /* real-mode entry code */
-   struct kvmhv_tb_accumulator rm_intr;/* real-mode intr handling */
-   struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
-   struct kvmhv_tb_accumulator guest_time; /* guest execution */
-   struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
+   struct kvmhv_tb_accumulator vcpu_entry;
+   struct kvmhv_tb_accumulator vcpu_exit;
+   struct kvmhv_tb_accumulator in_guest;
+   struct kvmhv_tb_accumulator hcall;
+   struct kvmhv_tb_accumulator pg_fault;
+   struct kvmhv_tb_accumulator guest_entry;
+   struct kvmhv_tb_accumulator guest_exit;
 #else
struct kvmhv_tb_accumulator rm_entry;   /* real-mode entry code */
struct kvmhv_tb_accumulator rm_intr;/* real-mode intr handling */
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 191347f44731..cedf1e0f50e1 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -135,10 +135,11 @@ config KVM_BOOK3S_HV_P9_TIMING
select KVM_BOOK3S_HV_EXIT_TIMING
depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS
help
- Calculate time taken for each vcpu in various parts of the
- code. The total, minimum and maximum times in nanoseconds
- together with the number of executions are reported in debugfs in
- kvm/vm#/vcpu#/timings.
+ Calculate time taken for each vcpu during vcpu entry and
+ exit, time spent inside the guest and time spent handling
+ hypercalls and page faults. The total, minimum and maximum
+ times in nanoseconds together with the number of executions
+ are reported in debugfs in kvm/vm#/vcpu#/timings.
 
  If unsure, say N.
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 69a6b40d58b9..f485632f247a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2654,11 +2654,13 @@ static struct debugfs_timings_element {
size_t offset;
 } timings[] = {
 #ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING
-   {"rm_entry",offsetof(struct kvm_vcpu, arch.rm_entry)},
-   {"rm_intr", offsetof(struct kvm_vcpu, arch.rm_intr)},
-   {"rm_exit", offsetof(struct kvm_vcpu, arch.rm_exit)},
-   {"guest",   offsetof(struct kvm_vcpu, arch.guest_time)},
-   {"cede",offsetof(struct kvm_vcpu, arch.cede_time)},
+   

[PATCH 4/5] KVM: PPC: Book3S HV: Expose timing functions to module code

2022-05-25 Thread Fabiano Rosas
The next patch adds new timing points to the P9 entry path, some of
which are in the module code, so we need to export the timing
functions.

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/kvm/book3s_hv.h  | 10 ++
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 11 ++-
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.h b/arch/powerpc/kvm/book3s_hv.h
index 6b7f07d9026b..2f2e59d7d433 100644
--- a/arch/powerpc/kvm/book3s_hv.h
+++ b/arch/powerpc/kvm/book3s_hv.h
@@ -40,3 +40,13 @@ void switch_pmu_to_guest(struct kvm_vcpu *vcpu,
struct p9_host_os_sprs *host_os_sprs);
 void switch_pmu_to_host(struct kvm_vcpu *vcpu,
struct p9_host_os_sprs *host_os_sprs);
+
+#ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING
+void accumulate_time(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator *next);
+#define start_timing(vcpu, next) accumulate_time(vcpu, next)
+#define end_timing(vcpu) accumulate_time(vcpu, NULL)
+#else
+#define accumulate_time(vcpu, next) do {} while (0)
+#define start_timing(vcpu, next) do {} while (0)
+#define end_timing(vcpu) do {} while (0)
+#endif
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index f8ce473149b7..8b2a9a360e4e 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -438,7 +438,7 @@ void restore_p9_host_os_sprs(struct kvm_vcpu *vcpu,
 EXPORT_SYMBOL_GPL(restore_p9_host_os_sprs);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING
-static void __accumulate_time(struct kvm_vcpu *vcpu, struct 
kvmhv_tb_accumulator *next)
+void accumulate_time(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator *next)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
struct kvmhv_tb_accumulator *curr;
@@ -468,14 +468,7 @@ static void __accumulate_time(struct kvm_vcpu *vcpu, 
struct kvmhv_tb_accumulator
smp_wmb();
curr->seqcount = seq + 2;
 }
-
-#define start_timing(vcpu, next) __accumulate_time(vcpu, next)
-#define end_timing(vcpu) __accumulate_time(vcpu, NULL)
-#define accumulate_time(vcpu, next) __accumulate_time(vcpu, next)
-#else
-#define start_timing(vcpu, next) do {} while (0)
-#define end_timing(vcpu) do {} while (0)
-#define accumulate_time(vcpu, next) do {} while (0)
+EXPORT_SYMBOL_GPL(accumulate_time);
 #endif
 
 static inline u64 mfslbv(unsigned int idx)
-- 
2.35.1



[PATCH 3/5] KVM: PPC: Book3S HV: Decouple the debug timing from the P8 entry path

2022-05-25 Thread Fabiano Rosas
We are currently doing the timing for debug purposes of the P9 entry
path using the accumulators and terminology defined by the old entry
path for P8 machines.

Not only the "real-mode" and "napping" mentions are out of place for
the P9 Radix entry path but also we cannot change them because the
timing code is coupled to the structures defined in struct
kvm_vcpu_arch.

Add a new CONFIG_KVM_BOOK3S_HV_P9_TIMING to enable the timing code for
the P9 entry path. For now, just add the new CONFIG and duplicate the
structures. A subsequent patch will add the P9 changes.

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/include/asm/kvm_host.h   |  8 
 arch/powerpc/kvm/Kconfig  | 14 +-
 arch/powerpc/kvm/book3s_hv.c  | 13 +++--
 arch/powerpc/kvm/book3s_hv_p9_entry.c |  2 +-
 4 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index faf301d0dec0..37f03665bfa2 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -826,11 +826,19 @@ struct kvm_vcpu_arch {
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
struct kvmhv_tb_accumulator *cur_activity;  /* What we're timing */
u64 cur_tb_start;   /* when it started */
+#ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING
struct kvmhv_tb_accumulator rm_entry;   /* real-mode entry code */
struct kvmhv_tb_accumulator rm_intr;/* real-mode intr handling */
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
+#else
+   struct kvmhv_tb_accumulator rm_entry;   /* real-mode entry code */
+   struct kvmhv_tb_accumulator rm_intr;/* real-mode intr handling */
+   struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
+   struct kvmhv_tb_accumulator guest_time; /* guest execution */
+   struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
+#endif
 #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
 };
 
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 73f8277df7d1..191347f44731 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -130,10 +130,22 @@ config KVM_BOOK3S_64_PR
 config KVM_BOOK3S_HV_EXIT_TIMING
bool
 
+config KVM_BOOK3S_HV_P9_TIMING
+   bool "Detailed timing for the P9 entry point"
+   select KVM_BOOK3S_HV_EXIT_TIMING
+   depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS
+   help
+ Calculate time taken for each vcpu in various parts of the
+ code. The total, minimum and maximum times in nanoseconds
+ together with the number of executions are reported in debugfs in
+ kvm/vm#/vcpu#/timings.
+
+ If unsure, say N.
+
 config KVM_BOOK3S_HV_P8_TIMING
bool "Detailed timing for hypervisor real-mode code (for POWER8)"
select KVM_BOOK3S_HV_EXIT_TIMING
-   depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS
+   depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS && 
!KVM_BOOK3S_HV_P9_TIMING
help
  Calculate time taken for each vcpu in the real-mode guest entry,
  exit, and interrupt handling code, plus time spent in the guest
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6fa518f6501d..69a6b40d58b9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2653,11 +2653,19 @@ static struct debugfs_timings_element {
const char *name;
size_t offset;
 } timings[] = {
+#ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING
{"rm_entry",offsetof(struct kvm_vcpu, arch.rm_entry)},
{"rm_intr", offsetof(struct kvm_vcpu, arch.rm_intr)},
{"rm_exit", offsetof(struct kvm_vcpu, arch.rm_exit)},
{"guest",   offsetof(struct kvm_vcpu, arch.guest_time)},
{"cede",offsetof(struct kvm_vcpu, arch.cede_time)},
+#else
+   {"rm_entry",offsetof(struct kvm_vcpu, arch.rm_entry)},
+   {"rm_intr", offsetof(struct kvm_vcpu, arch.rm_intr)},
+   {"rm_exit", offsetof(struct kvm_vcpu, arch.rm_exit)},
+   {"guest",   offsetof(struct kvm_vcpu, arch.guest_time)},
+   {"cede",offsetof(struct kvm_vcpu, arch.cede_time)},
+#endif
 };
 
 #define N_TIMINGS  (ARRAY_SIZE(timings))
@@ -2776,8 +2784,9 @@ static const struct file_operations debugfs_timings_ops = 
{
 /* Create a debugfs directory for the vcpu */
 static int kvmppc_arch_create_vcpu_debugfs_hv(struct kvm_vcpu *vcpu, struct 
dentry *debugfs_dentry)
 {
-   debugfs_create_file("timings", 0444, debugfs_dentry, vcpu,
-   _timings_ops);
+   if (cpu_has_feature(CPU_FTR_ARCH_300) == 
IS_ENABLED(CONFIG_KVM_BOOK3S_HV_P9_TIMING))
+   debugfs_create_file("timings", 0444, debugfs_dentry, vcpu,
+   

[PATCH 2/5] KVM: PPC: Book3S HV: Add a new config for P8 debug timing

2022-05-25 Thread Fabiano Rosas
Turn the existing Kconfig KVM_BOOK3S_HV_EXIT_TIMING into
KVM_BOOK3S_HV_P8_TIMING in preparation for the addition of a new
config for P9 timings.

This applies only to P8 code, the generic timing code is still kept
under KVM_BOOK3S_HV_EXIT_TIMING.

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/kernel/asm-offsets.c   |  2 +-
 arch/powerpc/kvm/Kconfig|  6 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index eec536aef83a..8c10f536e478 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -379,7 +379,7 @@ int main(void)
OFFSET(VCPU_SPRG2, kvm_vcpu, arch.shregs.sprg2);
OFFSET(VCPU_SPRG3, kvm_vcpu, arch.shregs.sprg3);
 #endif
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
OFFSET(VCPU_TB_RMENTRY, kvm_vcpu, arch.rm_entry);
OFFSET(VCPU_TB_RMINTR, kvm_vcpu, arch.rm_intr);
OFFSET(VCPU_TB_RMEXIT, kvm_vcpu, arch.rm_exit);
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index ddd88179110a..73f8277df7d1 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -128,7 +128,11 @@ config KVM_BOOK3S_64_PR
  and system calls on the host.
 
 config KVM_BOOK3S_HV_EXIT_TIMING
-   bool "Detailed timing for hypervisor real-mode code"
+   bool
+
+config KVM_BOOK3S_HV_P8_TIMING
+   bool "Detailed timing for hypervisor real-mode code (for POWER8)"
+   select KVM_BOOK3S_HV_EXIT_TIMING
depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS
help
  Calculate time taken for each vcpu in the real-mode guest entry,
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index d185dee26026..c34932e31dcd 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -229,14 +229,14 @@ kvm_novcpu_wakeup:
cmpdi   r4, 0
beq kvmppc_primary_no_guest
 
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
addir3, r4, VCPU_TB_RMENTRY
bl  kvmhv_start_timing
 #endif
b   kvmppc_got_guest
 
 kvm_novcpu_exit:
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
ld  r4, HSTATE_KVM_VCPU(r13)
cmpdi   r4, 0
beq 13f
@@ -515,7 +515,7 @@ kvmppc_hv_entry:
li  r6, KVM_GUEST_MODE_HOST_HV
stb r6, HSTATE_IN_GUEST(r13)
 
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
/* Store initial timestamp */
cmpdi   r4, 0
beq 1f
@@ -886,7 +886,7 @@ fast_guest_return:
li  r9, KVM_GUEST_MODE_GUEST_HV
stb r9, HSTATE_IN_GUEST(r13)
 
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
/* Accumulate timing */
addir3, r4, VCPU_TB_GUEST
bl  kvmhv_accumulate_time
@@ -937,7 +937,7 @@ secondary_too_late:
cmpdi   r4, 0
beq 11f
stw r12, VCPU_TRAP(r4)
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
addir3, r4, VCPU_TB_RMEXIT
bl  kvmhv_accumulate_time
 #endif
@@ -951,7 +951,7 @@ hdec_soon:
li  r12, BOOK3S_INTERRUPT_HV_DECREMENTER
 12:stw r12, VCPU_TRAP(r4)
mr  r9, r4
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
addir3, r4, VCPU_TB_RMEXIT
bl  kvmhv_accumulate_time
 #endif
@@ -1048,7 +1048,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
li  r0, MSR_RI
mtmsrd  r0, 1
 
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
addir3, r9, VCPU_TB_RMINTR
mr  r4, r9
bl  kvmhv_accumulate_time
@@ -1127,7 +1127,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 
 guest_exit_cont:   /* r9 = vcpu, r12 = trap, r13 = paca */
 
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
addir3, r9, VCPU_TB_RMEXIT
mr  r4, r9
bl  kvmhv_accumulate_time
@@ -1487,7 +1487,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
mtspr   SPRN_LPCR,r8
isync
 
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
/* Finish timing, if we have a vcpu */
ld  r4, HSTATE_KVM_VCPU(r13)
cmpdi   r4, 0
@@ -2155,7 +2155,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM)
ld  r4, HSTATE_KVM_VCPU(r13)
std r3, VCPU_DEC_EXPIRES(r4)
 
-#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
+#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
ld  r4, HSTATE_KVM_VCPU(r13)
addir3, r4, VCPU_TB_CEDE
bl  kvmhv_accumulate_time
@@ -2223,7 +2223,7 @@ kvm_end_cede:
/* get vcpu pointer */
ld  r4, 

[PATCH 1/5] KVM: PPC: Book3S HV: Fix "rm_exit" entry in debugfs timings

2022-05-25 Thread Fabiano Rosas
At debugfs/kvm//vcpu0/timings we show how long each part of the
code takes to run:

$ cat /sys/kernel/debug/kvm/*-*/vcpu0/timings
rm_entry: 123785 49398892 118 4898
rm_intr: 123780 6075890 22 390
rm_exit: 0 0 0 0 <-- NOK
guest: 123780 46732919988 402 9997638
cede: 0 0 0 0<-- OK, no cede napping in P9

The "rm_exit" is always showing zero because it is the last one and
end_timing does not increment the counter of the previous entry.

We can fix it by calling accumulate_time again instead of
end_timing. That way the counter gets incremented. The rest of the
arithmetic can be ignored because there are no timing points after
this and the accumulators are reset before the next round.

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index a28e5b3daabd..f7591b6c92d1 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -438,15 +438,6 @@ void restore_p9_host_os_sprs(struct kvm_vcpu *vcpu,
 EXPORT_SYMBOL_GPL(restore_p9_host_os_sprs);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
-static void __start_timing(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator 
*next)
-{
-   struct kvmppc_vcore *vc = vcpu->arch.vcore;
-   u64 tb = mftb() - vc->tb_offset_applied;
-
-   vcpu->arch.cur_activity = next;
-   vcpu->arch.cur_tb_start = tb;
-}
-
 static void __accumulate_time(struct kvm_vcpu *vcpu, struct 
kvmhv_tb_accumulator *next)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
@@ -478,8 +469,8 @@ static void __accumulate_time(struct kvm_vcpu *vcpu, struct 
kvmhv_tb_accumulator
curr->seqcount = seq + 2;
 }
 
-#define start_timing(vcpu, next) __start_timing(vcpu, next)
-#define end_timing(vcpu) __start_timing(vcpu, NULL)
+#define start_timing(vcpu, next) __accumulate_time(vcpu, next)
+#define end_timing(vcpu) __accumulate_time(vcpu, NULL)
 #define accumulate_time(vcpu, next) __accumulate_time(vcpu, next)
 #else
 #define start_timing(vcpu, next) do {} while (0)
-- 
2.35.1



[PATCH 0/5] KVM: PPC: Book3S HV: Update debug timing code

2022-05-25 Thread Fabiano Rosas
We have some debug information at /sys/kernel/debug/kvm//vcpu#/timings
which shows the time it takes to run various parts of the code.

That infrastructure was written in the P8 timeframe and wasn't updated
along with the guest entry point changes for P9.

Ideally we would be able to just add new/different accounting points
to the code as it changes over time but since the P8 and P9 entry
points are different code paths we first need to separate them from
each other. This series alters KVM Kconfig to make that distinction.

Currently:
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING - timing infrastructure in asm (P8 only)
   timing infrastructure in C (P9 only)
   generic timing variables (P8/P9)
   debugfs code
   timing points for P8

After this series:
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING - generic timing variables (P8/P9)
   debugfs code

CONFIG_KVM_BOOK3S_HV_P8_TIMING - timing infrastructure in asm (P8 only)
 timing points for P8

CONFIG_KVM_BOOK3S_HV_P9_TIMING - timing infrastructure in C (P9 only)
 timing points for P9

The new Kconfig rules are:

a) CONFIG_KVM_BOOK3S_HV_P8_TIMING selects CONFIG_KVM_BOOK3S_HV_EXIT_TIMING,
   resulting in the previous behavior. Tested on P8.

b) CONFIG_KVM_BOOK3S_HV_P9_TIMING selects CONFIG_KVM_BOOK3S_HV_EXIT_TIMING,
   resulting in the new behavior. Tested on P9.

c) CONFIG_KVM_BOOK3S_HV_P8_TIMING and CONFIG_KVM_BOOK3S_HV_P9_TIMING
   are mutually exclusive. If both are set, P9 takes precedence.

Fabiano Rosas (5):
  KVM: PPC: Book3S HV: Fix "rm_exit" entry in debugfs timings
  KVM: PPC: Book3S HV: Add a new config for P8 debug timing
  KVM: PPC: Book3S HV: Decouple the debug timing from the P8 entry path
  KVM: PPC: Book3S HV: Expose timing functions to module code
  KVM: PPC: Book3S HV: Provide more detailed timings for P9 entry path

 arch/powerpc/include/asm/kvm_host.h | 10 +++
 arch/powerpc/kernel/asm-offsets.c   |  2 +-
 arch/powerpc/kvm/Kconfig| 19 -
 arch/powerpc/kvm/book3s_hv.c| 26 --
 arch/powerpc/kvm/book3s_hv.h| 10 +++
 arch/powerpc/kvm/book3s_hv_p9_entry.c   | 36 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 -
 7 files changed, 82 insertions(+), 45 deletions(-)

-- 
2.35.1



[PATCH] KVM: PPC: Align pt_regs in kvm_vcpu_arch structure

2022-05-25 Thread Fabiano Rosas
The H_ENTER_NESTED hypercall receives as second parameter the address
of a region of memory containing the values for the nested guest
privileged registers. We currently use the pt_regs structure contained
within kvm_vcpu_arch for that end.

Most hypercalls that receive a memory address expect that region to
not cross a 4k page boundary. We would want H_ENTER_NESTED to follow
the same pattern so this patch ensures the pt_regs structure sits
within a page.

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/include/asm/kvm_host.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index faf301d0dec0..87eba60f2920 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -519,7 +519,11 @@ struct kvm_vcpu_arch {
struct kvmppc_book3s_shadow_vcpu *shadow_vcpu;
 #endif
 
-   struct pt_regs regs;
+   /*
+* This is passed along to the HV via H_ENTER_NESTED. Align to
+* prevent it crossing a real 4K page.
+*/
+   struct pt_regs regs __aligned(512);
 
struct thread_fp_state fp;
 
-- 
2.35.1



Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Peter Zijlstra
On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> 
> Then after that throttling we return VM_FAULT_RETRY.
> 
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
> 
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
> 
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
> 
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
> 
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
> 
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
> 
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
> 
> I believe it could help more than that.
> 
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
> 
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> 
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
> 
> Signed-off-by: Peter Xu 

Acked-by: Peter Zijlstra (Intel) 


Re: [RFC PATCH 1/4] objtool: Add --mnop as an option to --mcount

2022-05-25 Thread Peter Zijlstra
On Tue, May 24, 2022 at 04:01:48PM +0530, Naveen N. Rao wrote:

> We need to know for sure either way. Nop'ing out the _mcount locations at
> boot allows us to discover existing long branch trampolines. If we want to
> avoid it, we need to note down those locations during build time.
> 
> Do you have a different approach in mind?

If you put _mcount in a separate section then the compiler cannot tell
where it is and is forced to always emit a long branch trampoline.

Does that help?


Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support

2022-05-25 Thread Heiko Stübner
Am Mittwoch, 25. Mai 2022, 12:57:30 CEST schrieb Heiko Stübner:
> Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck:
> > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote:
> > [ ... ]
> > 
> > > > The problem is come from "__dls3's vdso decode part in musl's
> > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong.
> > > >
> > > > I think the root cause is from musl's implementation with the wrong
> > > > elf parser. I would fix that soon.
> > > Not elf parser, it's "aux vector just past environ[]". I think I could
> > > solve this, but anyone who could help dig in is welcome.
> > > 
> > 
> > I am not sure I understand what you are saying here. Point is that my
> > root file system, generated with musl a year or so ago, crashes with
> > your patch set applied. That is a regression, even if there is a bug
> > in musl.
> 
> Also as I said in the other part of the thread, the rootfs seems innocent,
> as my completely-standard Debian riscv64 rootfs is also affected.
> 
> The merged version seems to be v12 [0] - not sure how we this discussion
> ended up in v9, but I just tested this revision in two variants:
> 
> - v5.17 + this v9 -> works nicely

I take that back ... now going back to that build I somehow also run into
that issue here ... will investigate more.


> - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot
>   The only rebase-conflict was with the introduction of restartable
>   sequences and removal of the tracehook include, but turning CONFIG_RSEQ
>   off doesn't seem to affect the breakage.
> 
> So it looks like something changed between 5.17 and 5.18 that causes the 
> issue.
> 
> 
> Heiko
> 
> 
> [0] https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/
> 






Re: [RFC PATCH v2 5/7] objtool: Enable objtool to run only on files with ftrace enabled

2022-05-25 Thread Sathvika Vasireddy

Hi Peter,

On 25/05/22 01:20, Peter Zijlstra wrote:

On Tue, May 24, 2022 at 06:59:50PM +, Christophe Leroy wrote:


Le 24/05/2022 à 20:02, Peter Zijlstra a écrit :

On Tue, May 24, 2022 at 08:01:39PM +0200, Peter Zijlstra wrote:

On Tue, May 24, 2022 at 03:17:45PM +0200, Christophe Leroy wrote:

From: Sathvika Vasireddy

This patch makes sure objtool runs only on the object files
that have ftrace enabled, instead of running on all the object
files.

Signed-off-by: Naveen N. Rao
Signed-off-by: Sathvika Vasireddy
Signed-off-by: Christophe Leroy
---
   scripts/Makefile.build | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 2e0c3f9c1459..06ceffd92921 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -258,8 +258,8 @@ else
   # 'OBJECT_FILES_NON_STANDARD_foo.o := 'y': skip objtool checking for a file
   # 'OBJECT_FILES_NON_STANDARD_foo.o := 'n': override directory skip for a file
   
-$(obj)/%.o: objtool-enabled = $(if $(filter-out y%, \

-   
$(OBJECT_FILES_NON_STANDARD_$(basetarget).o)$(OBJECT_FILES_NON_STANDARD)n),y)
+$(obj)/%.o: objtool-enabled = $(and $(if $(filter-out y%, 
$(OBJECT_FILES_NON_STANDARD_$(basetarget).o)$(OBJECT_FILES_NON_STANDARD)n),y),  
  \
+$(if $(findstring $(strip $(CC_FLAGS_FTRACE)),$(_c_flags)),y),y)

I think this breaks x86, quite a bit of files have ftrace disabled but
very much must run objtool anyway.

Also; since the Changelog gives 0 clue as to what problem it's trying to
solve, I can't suggest anything.

I asked Sathvika on the previous series, see
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220523175548.922671-3...@linux.ibm.com/

He says it is to solve the problem I reported at
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220318105140.43914-4...@linux.ibm.com/#2861128

So on x86 we have:

arch/x86/entry/vdso/Makefile:OBJECT_FILES_NON_STANDARD   := y

to kill objtool for the whole of the VDSO. When we run objtool on
vmlinux it isn't a problem, because the VDSO ends up as a data section
through linker scripts.

Right.. Like you and Christophe mentioned,
arch/powerpc/kernel/vdso/Makefile:OBJECT_FILES_NON_STANDARD := y
should solve it for powerpc as well.


I'll drop this patch and replace it with the above change as part of next
revision series.


Thanks for reviewing!



- Sathvika



Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Geert Uytterhoeven
On Wed, May 25, 2022 at 1:45 AM Peter Xu  wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
>
> Then after that throttling we return VM_FAULT_RETRY.
>
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
>
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
>
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
>
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
>
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
>
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
>
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
>
> I believe it could help more than that.
>
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
>
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
>
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
>
> Signed-off-by: Peter Xu 

>  arch/m68k/mm/fault.c  |  4 

Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support

2022-05-25 Thread Heiko Stübner
Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck:
> On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote:
> [ ... ]
> 
> > > The problem is come from "__dls3's vdso decode part in musl's
> > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong.
> > >
> > > I think the root cause is from musl's implementation with the wrong
> > > elf parser. I would fix that soon.
> > Not elf parser, it's "aux vector just past environ[]". I think I could
> > solve this, but anyone who could help dig in is welcome.
> > 
> 
> I am not sure I understand what you are saying here. Point is that my
> root file system, generated with musl a year or so ago, crashes with
> your patch set applied. That is a regression, even if there is a bug
> in musl.

Also as I said in the other part of the thread, the rootfs seems innocent,
as my completely-standard Debian riscv64 rootfs is also affected.

The merged version seems to be v12 [0] - not sure how we this discussion
ended up in v9, but I just tested this revision in two variants:

- v5.17 + this v9 -> works nicely
- v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot
  The only rebase-conflict was with the introduction of restartable
  sequences and removal of the tracehook include, but turning CONFIG_RSEQ
  off doesn't seem to affect the breakage.

So it looks like something changed between 5.17 and 5.18 that causes the issue.


Heiko


[0] https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/





Re: [RFC PATCH v2 0/7] objtool: Enable and implement --mcount option on powerpc

2022-05-25 Thread Sathvika Vasireddy

Hi Christophe,

On 24/05/22 18:47, Christophe Leroy wrote:

This draft series adds PPC32 support to Sathvika's series.
Verified on pmac32 on QEMU.

It should in principle also work for PPC64 BE but for the time being
something goes wrong. In the beginning I had a segfaut hence the first
patch. But I still get no mcount section in the files.

Since PPC64 BE uses older elfv1 ABI, it prepends a dot to symbols.
And so, the relocation records in case of PPC64BE point to "._mcount",
rather than just "_mcount". We should be looking for "._mcount" to be
able to generate mcount_loc section in the files.

Like:

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 70be5a72e838..7da5bf8c7236 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2185,7 +2185,7 @@ static int classify_symbols(struct objtool_file *file)
    if (arch_is_retpoline(func))
    func->retpoline_thunk = true;

-   if ((!strcmp(func->name, "__fentry__")) || 
(!strcmp(func->name, "_mcount")))
+   if ((!strcmp(func->name, "__fentry__")) || 
(!strcmp(func->name, "_mcount")) || (!strcmp(func->name, "._mcount")))

    func->fentry = true;

    if (is_profiling_func(func->name))


With this change, I could see __mcount_loc section being
generated in individual ppc64be object files.

- Sathvika




Re: [PATCH] powerpc/64: Use tick accounting by default

2022-05-25 Thread Christophe Leroy




Le 22/05/2017 à 07:13, Anton Blanchard a écrit :

Hi Michael,


ppc64 is the only architecture that turns on
VIRT_CPU_ACCOUNTING_NATIVE by default. The overhead of this option
is extremely high - a context switch microbenchmark using
sched_yield() is almost 20% slower.


Running on what? It should all be nop'ed out unless you're on a
platform that needs it (SPLPAR).


POWERNV native. We don't nop out all the vtime_account_* gunk do we? It
is all those functions that are a large part of the problem.


To get finer grained user/hardirq/softirq statitics, the
IRQ_TIME_ACCOUNTING option can be used instead, which has much lower
overhead.


Can it? We don't select HAVE_IRQ_TIME_ACCOUNTING, so AFAICS it can't
be enabled.


I have a separate patch to enable it.


Doesn't dropping this mean we never count stolen time?


Perhaps. Do we have any applications left that care?



This patch has been superseded by Nick's patch 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220525081346.871535-1-npig...@gmail.com/


Christophe


Re: [PATCH -next v4 3/7] arm64: add support for machine check error safe

2022-05-25 Thread Mark Rutland
On Thu, May 19, 2022 at 02:29:54PM +0800, Tong Tiangen wrote:
> 
> 
> 在 2022/5/13 23:26, Mark Rutland 写道:
> > On Wed, Apr 20, 2022 at 03:04:14AM +, Tong Tiangen wrote:
> > > During the processing of arm64 kernel hardware memory errors(do_sea()), if
> > > the errors is consumed in the kernel, the current processing is panic.
> > > However, it is not optimal.
> > > 
> > > Take uaccess for example, if the uaccess operation fails due to memory
> > > error, only the user process will be affected, kill the user process
> > > and isolate the user page with hardware memory errors is a better choice.
> > 
> > Conceptually, I'm fine with the idea of constraining what we do for a
> > true uaccess, but I don't like the implementation of this at all, and I
> > think we first need to clean up the arm64 extable usage to clearly
> > distinguish a uaccess from another access.
> 
> OK,using EX_TYPE_UACCESS and this extable type could be recover, this is
> more reasonable.

Great.

> For EX_TYPE_UACCESS_ERR_ZERO, today we use it for kernel accesses in a
> couple of cases, such as
> get_user/futex/__user_cache_maint()/__user_swpX_asm(), 

Those are all user accesses.

However, __get_kernel_nofault() and __put_kernel_nofault() use
EX_TYPE_UACCESS_ERR_ZERO by way of __{get,put}_mem_asm(), so we'd need to
refactor that code to split the user/kernel cases higher up the callchain.

> your suggestion is:
> get_user continues to use EX_TYPE_UACCESS_ERR_ZERO and the other cases use
> new type EX_TYPE_FIXUP_ERR_ZERO?

Yes, that's the rough shape. We could make the latter EX_TYPE_KACCESS_ERR_ZERO
to be clearly analogous to EX_TYPE_UACCESS_ERR_ZERO, and with that I susepct we
could remove EX_TYPE_FIXUP.

Thanks,
Mark.


[PATCH] powerpc/64: Remove PPC64 special case for cputime accounting default

2022-05-25 Thread Nicholas Piggin
Distro kernels tend to be moving to VIRT_CPU_ACCOUNTING_GEN, and there
is not much reason why PPC64 should be special here.

VIRT_CPU_ACCOUNTING_NATIVE does provide scaled vtime and stolen time
apportioned between system and user time, and vtime accounting is not
unconditionally enabled, and possibly other things. But it would be
better at this point to extend GEN to cover important missing features
rather than directing users back to a less used option.

Signed-off-by: Nicholas Piggin 
---
After implementing stolen time for GEN for powerpc, can we try this and
see who screams?

Thanks,
Nick

 init/Kconfig | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index ddcbefe535e9..544ed8b0707a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -473,8 +473,7 @@ config VIRT_CPU_ACCOUNTING
 
 choice
prompt "Cputime accounting"
-   default TICK_CPU_ACCOUNTING if !PPC64
-   default VIRT_CPU_ACCOUNTING_NATIVE if PPC64
+   default TICK_CPU_ACCOUNTING
 
 # Kind of a stub config for the pure tick based cputime accounting
 config TICK_CPU_ACCOUNTING
-- 
2.35.1



Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers

2022-05-25 Thread Guenter Roeck

On 5/24/22 23:35, Alexey Kardashevskiy wrote:



On 5/21/22 04:35, Scott Cheloha wrote:

PAPR v2.12 defines a new hypercall, H_WATCHDOG.  The hypercall permits
guest control of one or more virtual watchdog timers.  The timers have
millisecond granularity.  The guest is terminated when a timer
expires.

This patch adds a watchdog driver for these timers, "pseries-wdt".

pseries_wdt_probe() currently assumes the existence of only one
platform device and always assigns it watchdogNumber 1.  If we ever
expose more than one timer to userspace we will need to devise a way
to assign a distinct watchdogNumber to each platform device at device
registration time.



This one should go before 4/4 in the series for bisectability.

What is platform_device_register_simple("pseries-wdt",...) going to do without 
the driver?



Signed-off-by: Scott Cheloha 
---
  .../watchdog/watchdog-parameters.rst  |  12 +
  drivers/watchdog/Kconfig  |   8 +
  drivers/watchdog/Makefile |   1 +
  drivers/watchdog/pseries-wdt.c    | 337 ++
  4 files changed, 358 insertions(+)
  create mode 100644 drivers/watchdog/pseries-wdt.c

diff --git a/Documentation/watchdog/watchdog-parameters.rst 
b/Documentation/watchdog/watchdog-parameters.rst
index 223c99361a30..4ffe725e796c 100644
--- a/Documentation/watchdog/watchdog-parameters.rst
+++ b/Documentation/watchdog/watchdog-parameters.rst
@@ -425,6 +425,18 @@ pnx833x_wdt:
  -
+pseries-wdt:
+    action:
+    Action taken when watchdog expires: 1 (power off), 2 (restart),
+    3 (dump and restart). (default=2)
+    timeout:
+    Initial watchdog timeout in seconds. (default=60)
+    nowayout:
+    Watchdog cannot be stopped once started.
+    (default=kernel config parameter)
+
+-
+
  rc32434_wdt:
  timeout:
  Watchdog timeout value, in seconds (default=20)
diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index c4e82a8d863f..06b412603f3e 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -1932,6 +1932,14 @@ config MEN_A21_WDT
  # PPC64 Architecture
+config PSERIES_WDT
+    tristate "POWER Architecture Platform Watchdog Timer"
+    depends on PPC_PSERIES
+    select WATCHDOG_CORE
+    help
+  Driver for virtual watchdog timers provided by PAPR
+  hypervisors (e.g. PowerVM, KVM).
+
  config WATCHDOG_RTAS
  tristate "RTAS watchdog"
  depends on PPC_RTAS
diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile
index f7da867e8782..f35660409f17 100644
--- a/drivers/watchdog/Makefile
+++ b/drivers/watchdog/Makefile
@@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o
  obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o
  # PPC64 Architecture
+obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o
  obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o
  # S390 Architecture
diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c
new file mode 100644
index ..f41bc4d3b7a2
--- /dev/null
+++ b/drivers/watchdog/pseries-wdt.c
@@ -0,0 +1,337 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2022 International Business Machines, Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRV_NAME "pseries-wdt"
+
+/*
+ * The PAPR's MSB->LSB bit ordering is 0->63.  These macros simplify
+ * defining bitfields as described in the PAPR without needing to
+ * transpose values to the more C-like 63->0 ordering.
+ */
+#define SETFIELD(_v, _b, _e)    \
+    (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e)))
+#define GETFIELD(_v, _b, _e)    \
+    (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e))
+
+/*
+ * H_WATCHDOG Hypercall Input
+ *
+ * R4: "flags":
+ *
+ * A 64-bit value structured as follows:
+ *
+ * Bits 0-46: Reserved (must be zero).
+ */
+#define PSERIES_WDTF_RESERVED    PPC_BITMASK(0, 46)
+
+/*
+ * Bit 47: "leaveOtherWatchdogsRunningOnTimeout"
+ *
+ * 0  Stop outstanding watchdogs on timeout.
+ * 1  Leave outstanding watchdogs running on timeout.
+ */
+#define PSERIES_WDTF_LEAVE_OTHER    PPC_BIT(47)
+
+/*
+ * Bits 48-55: "operation"
+ *
+ * 0x01  Start Watchdog
+ * 0x02  Stop Watchdog
+ * 0x03  Query Watchdog Capabilities
+ * 0x04  Query Watchdog LPM Requirement
+ */
+#define PSERIES_WDTF_OP(op)    SETFIELD((op), 48, 55)
+#define PSERIES_WDTF_OP_START    PSERIES_WDTF_OP(0x1)
+#define PSERIES_WDTF_OP_STOP    PSERIES_WDTF_OP(0x2)
+#define PSERIES_WDTF_OP_QUERY    PSERIES_WDTF_OP(0x3)
+#define PSERIES_WDTF_OP_QUERY_LPM    PSERIES_WDTF_OP(0x4)
+
+/*
+ * Bits 56-63: "timeoutAction"
+ *
+ * 0x01  Hard poweroff
+ * 0x02  Hard restart
+ * 0x03  Dump restart
+ */
+#define PSERIES_WDTF_ACTION(ac)    

Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-25 Thread Vaibhav Jain
Hi Ritesh,
thanks for looking into this patch,

Ritesh Harjani  writes:

> Just a minor nit which I noticed.
>
> On 22/05/24 11:20AM, Vaibhav Jain wrote:
>> Presently ima_get_kexec_buffer() doesn't check if the previous kernel's
>> ima-kexec-buffer lies outside the addressable memory range. This can result
>> in a kernel panic if the new kernel is booted with 'mem=X' arg and the
>> ima-kexec-buffer was allocated beyond that range by the previous kernel.
>> The panic is usually of the form below:
>>
>> $ sudo kexec --initrd initrd vmlinux --append='mem=16G'
>>
>> 
>>  BUG: Unable to handle kernel data access on read at 0xc000c01fff7f
>>  Faulting instruction address: 0xc0837974
>>  Oops: Kernel access of bad area, sig: 11 [#1]
>> 
>>  NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0
>>  LR [c083b55c] ima_load_kexec_buffer+0xac/0x160
>>  Call Trace:
>>  [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160
>>  [c371fb00] [c20512c4] ima_init+0x80/0x108
>>  [c371fb70] [c20514dc] init_ima+0x4c/0x120
>>  [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0
>>  [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec
>>  [c371fda0] [c00128a4] kernel_init+0x34/0x1b0
>>  [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64
>>  Instruction dump:
>>  f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330
>>  7c0802a6 fb610198 7c9b2378 f80101d0  2c090001 40820614 e9240010
>>  ---[ end trace  ]---
>>
>> Fix this issue by checking returned PFN range of previous kernel's
>> ima-kexec-buffer with pfn_valid to ensure correct memory bounds.
>>
>> Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the 
>> previous kernel")
>> Cc: Frank Rowand 
>> Cc: Prakhar Srivastava 
>> Cc: Lakshmi Ramasubramanian 
>> Cc: Thiago Jung Bauermann 
>> Cc: Rob Herring 
>> Signed-off-by: Vaibhav Jain 
>>
>> ---
>> Changelog
>> ==
>>
>> v2:
>> * Instead of using memblock to determine the valid bounds use pfn_valid() to 
>> do
>> so since memblock may not be available late after the kernel init. [ Mpe ]
>> * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ]
>> * Updated the 'Fixes' tag to point to correct commit that introduced this
>> function. [ Rob ]
>> * Fixed some whitespace/tab issues in the patch description [ Rob ]
>> * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0
>> ---
>>  drivers/of/kexec.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
>> index 8d374cc552be..879e984fe901 100644
>> --- a/drivers/of/kexec.c
>> +++ b/drivers/of/kexec.c
>> @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
>>  {
>>  int ret, len;
>>  unsigned long tmp_addr;
>> +unsigned int start_pfn, end_pfn;
>
> ^^^ Shouldn't this be unsigned long?
Thanks for catching this. Yes that should be 'unsigned long'. Will
resend the patch with this fixed.

>
> -ritesh
>
>>  size_t tmp_size;
>>  const void *prop;
>>
>> @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
>>  if (ret)
>>  return ret;
>>
>> +/* Do some sanity on the returned size for the ima-kexec buffer */
>> +if (!tmp_size)
>> +return -ENOENT;
>> +
>> +/*
>> + * Calculate the PFNs for the buffer and ensure
>> + * they are with in addressable memory.
>> + */
>> +start_pfn = PHYS_PFN(tmp_addr);
>> +end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1);
>> +if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) {
>> +pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n",
>> +tmp_addr, tmp_size);
>> +return -EINVAL;
>> +}
>> +
>>  *addr = __va(tmp_addr);
>>  *size = tmp_size;
>>
>> --
>> 2.35.1
>>

-- 
Cheers
~ Vaibhav


Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers

2022-05-25 Thread Alexey Kardashevskiy




On 5/21/22 04:35, Scott Cheloha wrote:

PAPR v2.12 defines a new hypercall, H_WATCHDOG.  The hypercall permits
guest control of one or more virtual watchdog timers.  The timers have
millisecond granularity.  The guest is terminated when a timer
expires.

This patch adds a watchdog driver for these timers, "pseries-wdt".

pseries_wdt_probe() currently assumes the existence of only one
platform device and always assigns it watchdogNumber 1.  If we ever
expose more than one timer to userspace we will need to devise a way
to assign a distinct watchdogNumber to each platform device at device
registration time.



This one should go before 4/4 in the series for bisectability.

What is platform_device_register_simple("pseries-wdt",...) going to do 
without the driver?




Signed-off-by: Scott Cheloha 
---
  .../watchdog/watchdog-parameters.rst  |  12 +
  drivers/watchdog/Kconfig  |   8 +
  drivers/watchdog/Makefile |   1 +
  drivers/watchdog/pseries-wdt.c| 337 ++
  4 files changed, 358 insertions(+)
  create mode 100644 drivers/watchdog/pseries-wdt.c

diff --git a/Documentation/watchdog/watchdog-parameters.rst 
b/Documentation/watchdog/watchdog-parameters.rst
index 223c99361a30..4ffe725e796c 100644
--- a/Documentation/watchdog/watchdog-parameters.rst
+++ b/Documentation/watchdog/watchdog-parameters.rst
@@ -425,6 +425,18 @@ pnx833x_wdt:
  
  -
  
+pseries-wdt:

+action:
+   Action taken when watchdog expires: 1 (power off), 2 (restart),
+   3 (dump and restart). (default=2)
+timeout:
+   Initial watchdog timeout in seconds. (default=60)
+nowayout:
+   Watchdog cannot be stopped once started.
+   (default=kernel config parameter)
+
+-
+
  rc32434_wdt:
  timeout:
Watchdog timeout value, in seconds (default=20)
diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index c4e82a8d863f..06b412603f3e 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -1932,6 +1932,14 @@ config MEN_A21_WDT
  
  # PPC64 Architecture
  
+config PSERIES_WDT

+   tristate "POWER Architecture Platform Watchdog Timer"
+   depends on PPC_PSERIES
+   select WATCHDOG_CORE
+   help
+ Driver for virtual watchdog timers provided by PAPR
+ hypervisors (e.g. PowerVM, KVM).
+
  config WATCHDOG_RTAS
tristate "RTAS watchdog"
depends on PPC_RTAS
diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile
index f7da867e8782..f35660409f17 100644
--- a/drivers/watchdog/Makefile
+++ b/drivers/watchdog/Makefile
@@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o
  obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o
  
  # PPC64 Architecture

+obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o
  obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o
  
  # S390 Architecture

diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c
new file mode 100644
index ..f41bc4d3b7a2
--- /dev/null
+++ b/drivers/watchdog/pseries-wdt.c
@@ -0,0 +1,337 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2022 International Business Machines, Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRV_NAME "pseries-wdt"
+
+/*
+ * The PAPR's MSB->LSB bit ordering is 0->63.  These macros simplify
+ * defining bitfields as described in the PAPR without needing to
+ * transpose values to the more C-like 63->0 ordering.
+ */
+#define SETFIELD(_v, _b, _e)   \
+   (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e)))
+#define GETFIELD(_v, _b, _e)   \
+   (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e))
+
+/*
+ * H_WATCHDOG Hypercall Input
+ *
+ * R4: "flags":
+ *
+ * A 64-bit value structured as follows:
+ *
+ * Bits 0-46: Reserved (must be zero).
+ */
+#define PSERIES_WDTF_RESERVED  PPC_BITMASK(0, 46)
+
+/*
+ * Bit 47: "leaveOtherWatchdogsRunningOnTimeout"
+ *
+ * 0  Stop outstanding watchdogs on timeout.
+ * 1  Leave outstanding watchdogs running on timeout.
+ */
+#define PSERIES_WDTF_LEAVE_OTHER   PPC_BIT(47)
+
+/*
+ * Bits 48-55: "operation"
+ *
+ * 0x01  Start Watchdog
+ * 0x02  Stop Watchdog
+ * 0x03  Query Watchdog Capabilities
+ * 0x04  Query Watchdog LPM Requirement
+ */
+#define PSERIES_WDTF_OP(op)SETFIELD((op), 48, 55)
+#define PSERIES_WDTF_OP_START  PSERIES_WDTF_OP(0x1)
+#define PSERIES_WDTF_OP_STOP   PSERIES_WDTF_OP(0x2)
+#define PSERIES_WDTF_OP_QUERY  PSERIES_WDTF_OP(0x3)
+#define PSERIES_WDTF_OP_QUERY_LPM  PSERIES_WDTF_OP(0x4)
+
+/*
+ * Bits 56-63: "timeoutAction"
+ *
+ * 0x01  Hard poweroff
+ * 0x02  Hard restart
+ * 0x03  Dump restart
+ */
+#define 

Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-25 Thread Ritesh Harjani


Just a minor nit which I noticed.

On 22/05/24 11:20AM, Vaibhav Jain wrote:
> Presently ima_get_kexec_buffer() doesn't check if the previous kernel's
> ima-kexec-buffer lies outside the addressable memory range. This can result
> in a kernel panic if the new kernel is booted with 'mem=X' arg and the
> ima-kexec-buffer was allocated beyond that range by the previous kernel.
> The panic is usually of the form below:
>
> $ sudo kexec --initrd initrd vmlinux --append='mem=16G'
>
> 
>  BUG: Unable to handle kernel data access on read at 0xc000c01fff7f
>  Faulting instruction address: 0xc0837974
>  Oops: Kernel access of bad area, sig: 11 [#1]
> 
>  NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0
>  LR [c083b55c] ima_load_kexec_buffer+0xac/0x160
>  Call Trace:
>  [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160
>  [c371fb00] [c20512c4] ima_init+0x80/0x108
>  [c371fb70] [c20514dc] init_ima+0x4c/0x120
>  [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0
>  [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec
>  [c371fda0] [c00128a4] kernel_init+0x34/0x1b0
>  [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64
>  Instruction dump:
>  f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330
>  7c0802a6 fb610198 7c9b2378 f80101d0  2c090001 40820614 e9240010
>  ---[ end trace  ]---
>
> Fix this issue by checking returned PFN range of previous kernel's
> ima-kexec-buffer with pfn_valid to ensure correct memory bounds.
>
> Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the 
> previous kernel")
> Cc: Frank Rowand 
> Cc: Prakhar Srivastava 
> Cc: Lakshmi Ramasubramanian 
> Cc: Thiago Jung Bauermann 
> Cc: Rob Herring 
> Signed-off-by: Vaibhav Jain 
>
> ---
> Changelog
> ==
>
> v2:
> * Instead of using memblock to determine the valid bounds use pfn_valid() to 
> do
> so since memblock may not be available late after the kernel init. [ Mpe ]
> * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ]
> * Updated the 'Fixes' tag to point to correct commit that introduced this
> function. [ Rob ]
> * Fixed some whitespace/tab issues in the patch description [ Rob ]
> * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0
> ---
>  drivers/of/kexec.c | 17 +
>  1 file changed, 17 insertions(+)
>
> diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
> index 8d374cc552be..879e984fe901 100644
> --- a/drivers/of/kexec.c
> +++ b/drivers/of/kexec.c
> @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
>  {
>   int ret, len;
>   unsigned long tmp_addr;
> + unsigned int start_pfn, end_pfn;

^^^ Shouldn't this be unsigned long?

-ritesh

>   size_t tmp_size;
>   const void *prop;
>
> @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
>   if (ret)
>   return ret;
>
> + /* Do some sanity on the returned size for the ima-kexec buffer */
> + if (!tmp_size)
> + return -ENOENT;
> +
> + /*
> +  * Calculate the PFNs for the buffer and ensure
> +  * they are with in addressable memory.
> +  */
> + start_pfn = PHYS_PFN(tmp_addr);
> + end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1);
> + if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) {
> + pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n",
> + tmp_addr, tmp_size);
> + return -EINVAL;
> + }
> +
>   *addr = __va(tmp_addr);
>   *size = tmp_size;
>
> --
> 2.35.1
>