date:20210201





Le 02/02/2021 à 07:38, Nicholas Piggin a écrit :

Excerpts from Christophe Leroy's message of February 2, 2021 4:15 pm:



Le 28/01/2021 à 00:50, Nicholas Piggin a écrit :

Excerpts from David Laight's message of January 26, 2021 8:28 pm:

From: Nicholas Piggin

Sent: 26 January 2021 10:21

Excerpts from Christophe Leroy's message of January 26, 2021 12:48 am:

syscall_64.c will be reused almost as is for PPC32.

Rename it syscall.c


Could you rename it to interrupt.c instead? A system call is an
interrupt, and the file now also has code to return from other
interrupts as well, and it matches the new asm/interrupt.h from
the interrupts series.


Hmmm

That might make it harder for someone looking for the system call
entry code to find it.


It's very grep'able.


In some sense interrupts are the simpler case.

Especially when comparing with other architectures which have
special instructions for syscall entry.


powerpc does have a special instruction for syscall, and it causes a
system call interrupt.

I'm not sure about other architectures, but for powerpc its more
sensible to call it interrupt.c than syscall.c.


Many other architectures have a syscall.c but for a different purpose: it 
contains arch specific
system calls. We have that in powerpc as well, it is called syscalls.c

So to avoid confusion, I'll rename it. But I think "interrupt" is maybe not the 
right name. An
interrupt most of the time refers to IRQ.


That depends what you mean by interrupt and IRQ.

Linux kind of considers any asynchronous maskable interrupt an irq
(local_irq_disable()). But if you say irq it's more likely to mean
a device interrupt, and "interrupt" usually refres to the asynch
ones.

But Linux doesn't really assign names to synchronous interrupts in
core code. It doesn't say they aren't interrupts, it just doesn't
really have a convention for them at all.

Other architectures e.g., x86 also have things like interrupt
descriptor table for synchronous interrupts as well. That's where
I got the interrupt wrappers code from actually.

So it's really fine to use the proper arch-specific names for things
in arch code. I'm trying to slowly change names from exception to
interrupt.


For me system call is not an interrupt in the way it
doesn't unexpectedly interrupt a program flow. In powerpc manuals it is 
generally called exceptions,
no I'm more inclined to call it exception.c


Actually that's backwards. Powerpc manuals (at least the one I look at)
calls them all interrupts including system calls, and also the system
call interrupt is actually the only one that doesn't appear to be
associated with an exception.

Also there is no distinction about expecte/unexpected -- a data storage
interrupt is expected if you access a location without the right access
permissions for example, but it is still an interrupt.

These handlers very specifically deal with the change to execution flow
(i.e., the interrupt), they do *not* deal with the exception which may
be associated with it (that is the job of the handler).

And on the other hand you can deal with exceptions in some cases without
taking an interrupt at all. For example if you had MSR[EE]=0 you could
change the decrementer or execute msgclr or change HMER SPR etc to clear
various exceptions without ever taking the interrupt.



Ok, let's call it interrupt.c then, to be consistant with the interrupt wrapper 
story.

Christophe

Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm:
> On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote:
> 
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 427bfcc6cdec..b97136b7010b 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather 
>> *tlb)
>>  
>>  #ifdef CONFIG_MMU_GATHER_NO_RANGE
>>  
>> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
>> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() 
>> and tlb_end_vma()
>> +#if defined(tlb_flush)
>> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
>>  #endif
>>  
>>  /*
>> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, 
>> struct vm_area_struct *vm
>>  
>>  #ifndef tlb_flush
>>  
>> -#if defined(tlb_start_vma) || defined(tlb_end_vma)
>> -#error Default tlb_flush() relies on default tlb_start_vma() and 
>> tlb_end_vma()
>> -#endif
> 
> #ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
> #error 
> #endif
> 
> goes here...
> 
> 
>>  static inline void tlb_end_vma(struct mmu_gather *tlb, struct 
>> vm_area_struct *vma)
>>  {
>>  if (tlb->fullmm)
>>  return;
>>  
>> +if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
>> +return;
> 
> Also, can you please stick to the CONFIG_MMU_GATHER_* namespace?
> 
> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
> How about:
> 
>   CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH

Yes please, have to have descriptive names.

I didn't quite see why this was much of an improvement though. Maybe 
follow up patches take advantage of it? I didn't see how they all fit 
together.

Thanks,
Nick

[PATCH] powerpc/perf: Fix the guest crash issue with trace-imc

2021-02-01 Thread Athira Rajeev

when perf kvm record with trace_imc event is attach to guest
pid(with -p option), the qemu process gets killed with permission
issue. This happens because trace_imc event requires admin privileges
to monitor the process.If the qemu creates threads, by default
child tasks also inherit the counters and if there is no permission
to monitor qemu threads, we return permission denied ( EACCES ).

Fix this by returning EACCES only if there is no CAP_SYS_ADMIN and the
event doesn’t have inheritance.

Fixes: 012ae244845f ("powerpc/perf: Trace imc PMU functions")
Signed-off-by: Athira Rajeev 
---
 arch/powerpc/perf/imc-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index e106909ff9c3..cc5679bfd28b 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1429,7 +1429,7 @@ static int trace_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;
 
-   if (!perfmon_capable())
+   if (!perfmon_capable() && !event->attr.inherit)
return -EACCES;
 
/* Return if this is a couting event */
-- 
1.8.3.1

Re: [PATCH v4 11/23] powerpc/syscall: Rename syscall_64.c into syscall.c

Excerpts from Christophe Leroy's message of February 2, 2021 4:15 pm:
> 
> 
> Le 28/01/2021 à 00:50, Nicholas Piggin a écrit :
>> Excerpts from David Laight's message of January 26, 2021 8:28 pm:
>>> From: Nicholas Piggin
 Sent: 26 January 2021 10:21

 Excerpts from Christophe Leroy's message of January 26, 2021 12:48 am:
> syscall_64.c will be reused almost as is for PPC32.
>
> Rename it syscall.c

 Could you rename it to interrupt.c instead? A system call is an
 interrupt, and the file now also has code to return from other
 interrupts as well, and it matches the new asm/interrupt.h from
 the interrupts series.
>>>
>>> Hmmm
>>>
>>> That might make it harder for someone looking for the system call
>>> entry code to find it.
>> 
>> It's very grep'able.
>> 
>>> In some sense interrupts are the simpler case.
>>>
>>> Especially when comparing with other architectures which have
>>> special instructions for syscall entry.
>> 
>> powerpc does have a special instruction for syscall, and it causes a
>> system call interrupt.
>> 
>> I'm not sure about other architectures, but for powerpc its more
>> sensible to call it interrupt.c than syscall.c.
> 
> Many other architectures have a syscall.c but for a different purpose: it 
> contains arch specific 
> system calls. We have that in powerpc as well, it is called syscalls.c
> 
> So to avoid confusion, I'll rename it. But I think "interrupt" is maybe not 
> the right name. An 
> interrupt most of the time refers to IRQ.

That depends what you mean by interrupt and IRQ.

Linux kind of considers any asynchronous maskable interrupt an irq 
(local_irq_disable()). But if you say irq it's more likely to mean
a device interrupt, and "interrupt" usually refres to the asynch
ones.

But Linux doesn't really assign names to synchronous interrupts in
core code. It doesn't say they aren't interrupts, it just doesn't
really have a convention for them at all.

Other architectures e.g., x86 also have things like interrupt
descriptor table for synchronous interrupts as well. That's where
I got the interrupt wrappers code from actually.

So it's really fine to use the proper arch-specific names for things
in arch code. I'm trying to slowly change names from exception to
interrupt.

> For me system call is not an interrupt in the way it 
> doesn't unexpectedly interrupt a program flow. In powerpc manuals it is 
> generally called exceptions, 
> no I'm more inclined to call it exception.c

Actually that's backwards. Powerpc manuals (at least the one I look at) 
calls them all interrupts including system calls, and also the system
call interrupt is actually the only one that doesn't appear to be
associated with an exception.

Also there is no distinction about expecte/unexpected -- a data storage 
interrupt is expected if you access a location without the right access 
permissions for example, but it is still an interrupt.

These handlers very specifically deal with the change to execution flow
(i.e., the interrupt), they do *not* deal with the exception which may 
be associated with it (that is the job of the handler).

And on the other hand you can deal with exceptions in some cases without
taking an interrupt at all. For example if you had MSR[EE]=0 you could
change the decrementer or execute msgclr or change HMER SPR etc to clear
various exceptions without ever taking the interrupt.

Thanks,
Nick

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING


On 2/2/21 11:50 AM, Christophe Leroy wrote:



Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :

On 2/2/21 11:32 AM, Christophe Leroy wrote:



Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :

Aneesh Kumar K.V  writes:


Nicholas Piggin  writes:


Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:

Christophe Leroy  writes:

+Aneesh

Le 29/01/2021 à 07:52, Zorro Lang a écrit :

..

[   96.200296] [ cut here ]
[   96.200304] Bug: Read fault blocked by KUAP!
[   96.200309] WARNING: CPU: 3 PID: 1876 at 
arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310


[   96.200734] NIP [c0849424] 
fault_in_pages_readable+0x104/0x350
[   96.200741] LR [c084952c] 
fault_in_pages_readable+0x20c/0x350

[   96.200747] --- interrupt: 300



Problem happens in a section where userspace access is supposed 
to be granted, so the patch you

proposed is definitely not the right fix.

c0849408:    2c 01 00 4c isync
c084940c:    a6 03 3d 7d mtspr   29,r9  <== granting 
userspace access permission

c0849410:    2c 01 00 4c isync
c0849414:    00 00 36 e9 ld  r9,0(r22)
c0849418:    20 00 29 81 lwz r9,32(r9)
c084941c:    00 02 29 71 andi.   r9,r9,512
c0849420:    78 d3 5e 7f mr  r30,r26
==> c0849424:    00 00 bf 8b lbz r29,0(r31)  <== 
accessing userspace
c0849428:    10 00 82 41 beq c0849438 


c084942c:    2c 01 00 4c isync
c0849430:    a6 03 bd 7e mtspr   29,r21  <== 
clearing userspace access permission

c0849434:    2c 01 00 4c isync

My first guess is that the problem is linked to the following 
function, see the comment


/*
   * For kernel thread that doesn't have thread.regs return
   * default AMR/IAMR values.
   */
static inline u64 current_thread_amr(void)
{
if (current->thread.regs)
    return current->thread.regs->amr;
return AMR_KUAP_BLOCKED;
}

Above function was introduced by commit 48a8ab4eeb82 
("powerpc/book3s64/pkeys: Don't update SPRN_AMR

when in kernel mode")


Yeah that's a bit of a curly one.

At some point io_uring did kthread_use_mm(), which is supposed to 
mean
the kthread can operate on behalf of the original process that 
submitted

the IO.

But because KUAP is implemented using memory protection keys, it 
depends
on the value of the AMR register, which is not part of the mm, 
it's in

thread.regs->amr.

And what's worse by the time we're in kthread_use_mm() we no 
longer have
access to the thread.regs->amr of the original process that 
submitted

the IO.

We also can't simply move the AMR into the mm, precisely because 
it's

per thread, not per mm.

So TBH I don't know how we're going to fix this.

I guess we could return AMR=unblocked for kernel threads, but that's
arguably a bug because it allows a process to circumvent memory 
keys by

asking the kernel to do the access.


We shouldn't need to inherit AMR should we? We only need it to be 
locked
for kernel threads until it's explicitly unlocked -- nothing mm 
specific
there. I think current_thread_amr could return 0 for kernel 
threads? Or

I would even avoid using that function for allow_user_access and open
code the kthread case and remove it from current_thread_amr().

Thanks,
Nick




updated one

 From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Tue, 2 Feb 2021 09:23:38 +0530
Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access 
userspace

  after kthread_use_mm

This fix the bad fault reported by KUAP when io_wqe_worker access 
userspace.


  Bug: Read fault blocked by KUAP!
  WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
__do_page_fault+0x6b4/0xcd0

  NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
  LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
  Call Trace:
  [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 
(unreliable)

  [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
  [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
  --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
  NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
  LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
  interrupt: 300
  [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
  [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
  [c00016367990] [c0710330] 
iomap_file_buffered_write+0xa0/0x120
  [c000163679e0] [c0080040791c] 
xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]

  [c00016367a90] [c06d74bc] io_write+0x10c/0x460
  [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
  [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
  [c00016367cb0] [c06e2578] 
io_worker_handle_work+0x498/0x800

  [c00016367d40] [c06e2cdc]

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING





Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :

On 2/2/21 11:32 AM, Christophe Leroy wrote:



Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :

Aneesh Kumar K.V  writes:


Nicholas Piggin  writes:


Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:

Christophe Leroy  writes:

+Aneesh

Le 29/01/2021 à 07:52, Zorro Lang a écrit :

..

[   96.200296] [ cut here ]
[   96.200304] Bug: Read fault blocked by KUAP!
[   96.200309] WARNING: CPU: 3 PID: 1876 at arch/powerpc/mm/fault.c:229 
bad_kernel_fault+0x180/0x310



[   96.200734] NIP [c0849424] fault_in_pages_readable+0x104/0x350
[   96.200741] LR [c084952c] fault_in_pages_readable+0x20c/0x350
[   96.200747] --- interrupt: 300



Problem happens in a section where userspace access is supposed to be granted, 
so the patch you
proposed is definitely not the right fix.

c0849408:    2c 01 00 4c isync
c084940c:    a6 03 3d 7d mtspr   29,r9  <== granting userspace 
access permission
c0849410:    2c 01 00 4c isync
c0849414:    00 00 36 e9 ld  r9,0(r22)
c0849418:    20 00 29 81 lwz r9,32(r9)
c084941c:    00 02 29 71 andi.   r9,r9,512
c0849420:    78 d3 5e 7f mr  r30,r26
==> c0849424:    00 00 bf 8b lbz r29,0(r31)  <== accessing 
userspace
c0849428:    10 00 82 41 beq c0849438 

c084942c:    2c 01 00 4c isync
c0849430:    a6 03 bd 7e mtspr   29,r21  <== clearing userspace 
access permission
c0849434:    2c 01 00 4c isync

My first guess is that the problem is linked to the following function, see the 
comment

/*
   * For kernel thread that doesn't have thread.regs return
   * default AMR/IAMR values.
   */
static inline u64 current_thread_amr(void)
{
if (current->thread.regs)
    return current->thread.regs->amr;
return AMR_KUAP_BLOCKED;
}

Above function was introduced by commit 48a8ab4eeb82 ("powerpc/book3s64/pkeys: Don't update 
SPRN_AMR

when in kernel mode")


Yeah that's a bit of a curly one.

At some point io_uring did kthread_use_mm(), which is supposed to mean
the kthread can operate on behalf of the original process that submitted
the IO.

But because KUAP is implemented using memory protection keys, it depends
on the value of the AMR register, which is not part of the mm, it's in
thread.regs->amr.

And what's worse by the time we're in kthread_use_mm() we no longer have
access to the thread.regs->amr of the original process that submitted
the IO.

We also can't simply move the AMR into the mm, precisely because it's
per thread, not per mm.

So TBH I don't know how we're going to fix this.

I guess we could return AMR=unblocked for kernel threads, but that's
arguably a bug because it allows a process to circumvent memory keys by
asking the kernel to do the access.


We shouldn't need to inherit AMR should we? We only need it to be locked
for kernel threads until it's explicitly unlocked -- nothing mm specific
there. I think current_thread_amr could return 0 for kernel threads? Or
I would even avoid using that function for allow_user_access and open
code the kthread case and remove it from current_thread_amr().

Thanks,
Nick




updated one

 From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Tue, 2 Feb 2021 09:23:38 +0530
Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access userspace
  after kthread_use_mm

This fix the bad fault reported by KUAP when io_wqe_worker access userspace.

  Bug: Read fault blocked by KUAP!
  WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
__do_page_fault+0x6b4/0xcd0
  NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
  LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
  Call Trace:
  [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable)
  [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
  [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
  --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
  NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
  LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
  interrupt: 300
  [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
  [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
  [c00016367990] [c0710330] iomap_file_buffered_write+0xa0/0x120
  [c000163679e0] [c0080040791c] xfs_file_buffered_aio_write+0x314/0x5e0 
[xfs]
  [c00016367a90] [c06d74bc] io_write+0x10c/0x460
  [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
  [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
  [c00016367cb0] [c06e2578] io_worker_handle_work+0x498/0x800
  [c00016367d40] [c06e2cdc] io_wqe_worker+0x3fc/0x4f0
  [c00016367da0] [c01cb0a4] kthread+0x1c4/0x1d0

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING


On 2/2/21 11:32 AM, Christophe Leroy wrote:



Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :

Aneesh Kumar K.V  writes:


Nicholas Piggin  writes:


Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:

Christophe Leroy  writes:

+Aneesh

Le 29/01/2021 à 07:52, Zorro Lang a écrit :

..

[   96.200296] [ cut here ]
[   96.200304] Bug: Read fault blocked by KUAP!
[   96.200309] WARNING: CPU: 3 PID: 1876 at 
arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310


[   96.200734] NIP [c0849424] 
fault_in_pages_readable+0x104/0x350
[   96.200741] LR [c084952c] 
fault_in_pages_readable+0x20c/0x350

[   96.200747] --- interrupt: 300



Problem happens in a section where userspace access is supposed to 
be granted, so the patch you

proposed is definitely not the right fix.

c0849408:    2c 01 00 4c isync
c084940c:    a6 03 3d 7d mtspr   29,r9  <== granting 
userspace access permission

c0849410:    2c 01 00 4c isync
c0849414:    00 00 36 e9 ld  r9,0(r22)
c0849418:    20 00 29 81 lwz r9,32(r9)
c084941c:    00 02 29 71 andi.   r9,r9,512
c0849420:    78 d3 5e 7f mr  r30,r26
==> c0849424:    00 00 bf 8b lbz r29,0(r31)  <== 
accessing userspace
c0849428:    10 00 82 41 beq c0849438 


c084942c:    2c 01 00 4c isync
c0849430:    a6 03 bd 7e mtspr   29,r21  <== clearing 
userspace access permission

c0849434:    2c 01 00 4c isync

My first guess is that the problem is linked to the following 
function, see the comment


/*
   * For kernel thread that doesn't have thread.regs return
   * default AMR/IAMR values.
   */
static inline u64 current_thread_amr(void)
{
if (current->thread.regs)
    return current->thread.regs->amr;
return AMR_KUAP_BLOCKED;
}

Above function was introduced by commit 48a8ab4eeb82 
("powerpc/book3s64/pkeys: Don't update SPRN_AMR

when in kernel mode")


Yeah that's a bit of a curly one.

At some point io_uring did kthread_use_mm(), which is supposed to mean
the kthread can operate on behalf of the original process that 
submitted

the IO.

But because KUAP is implemented using memory protection keys, it 
depends

on the value of the AMR register, which is not part of the mm, it's in
thread.regs->amr.

And what's worse by the time we're in kthread_use_mm() we no longer 
have

access to the thread.regs->amr of the original process that submitted
the IO.

We also can't simply move the AMR into the mm, precisely because it's
per thread, not per mm.

So TBH I don't know how we're going to fix this.

I guess we could return AMR=unblocked for kernel threads, but that's
arguably a bug because it allows a process to circumvent memory 
keys by

asking the kernel to do the access.


We shouldn't need to inherit AMR should we? We only need it to be 
locked
for kernel threads until it's explicitly unlocked -- nothing mm 
specific

there. I think current_thread_amr could return 0 for kernel threads? Or
I would even avoid using that function for allow_user_access and open
code the kthread case and remove it from current_thread_amr().

Thanks,
Nick




updated one

 From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Tue, 2 Feb 2021 09:23:38 +0530
Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access userspace
  after kthread_use_mm

This fix the bad fault reported by KUAP when io_wqe_worker access 
userspace.


  Bug: Read fault blocked by KUAP!
  WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
__do_page_fault+0x6b4/0xcd0

  NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
  LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
  Call Trace:
  [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 
(unreliable)

  [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
  [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
  --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
  NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
  LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
  interrupt: 300
  [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
  [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
  [c00016367990] [c0710330] 
iomap_file_buffered_write+0xa0/0x120
  [c000163679e0] [c0080040791c] 
xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]

  [c00016367a90] [c06d74bc] io_write+0x10c/0x460
  [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
  [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
  [c00016367cb0] [c06e2578] io_worker_handle_work+0x498/0x800
  [c00016367d40] [c06e2cdc] io_wqe_worker+0x3fc/0x4f0
  [c00016367da0] [c01cb0a4] kthread+0x1c4/0x1d0
  [c00016367e10] [c000dbf0]

Re: [PATCH v4 11/23] powerpc/syscall: Rename syscall_64.c into syscall.c





Le 28/01/2021 à 00:50, Nicholas Piggin a écrit :

Excerpts from David Laight's message of January 26, 2021 8:28 pm:

From: Nicholas Piggin

Sent: 26 January 2021 10:21

Excerpts from Christophe Leroy's message of January 26, 2021 12:48 am:

syscall_64.c will be reused almost as is for PPC32.

Rename it syscall.c


Could you rename it to interrupt.c instead? A system call is an
interrupt, and the file now also has code to return from other
interrupts as well, and it matches the new asm/interrupt.h from
the interrupts series.


Hmmm

That might make it harder for someone looking for the system call
entry code to find it.


It's very grep'able.


In some sense interrupts are the simpler case.

Especially when comparing with other architectures which have
special instructions for syscall entry.


powerpc does have a special instruction for syscall, and it causes a
system call interrupt.

I'm not sure about other architectures, but for powerpc its more
sensible to call it interrupt.c than syscall.c.


Many other architectures have a syscall.c but for a different purpose: it contains arch specific 
system calls. We have that in powerpc as well, it is called syscalls.c


So to avoid confusion, I'll rename it. But I think "interrupt" is maybe not the right name. An 
interrupt most of the time refers to IRQ. For me system call is not an interrupt in the way it 
doesn't unexpectedly interrupt a program flow. In powerpc manuals it is generally called exceptions, 
no I'm more inclined to call it exception.c


Christophe

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING





Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :

Aneesh Kumar K.V  writes:


Nicholas Piggin  writes:


Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:

Christophe Leroy  writes:

+Aneesh

Le 29/01/2021 à 07:52, Zorro Lang a écrit :

..

[   96.200296] [ cut here ]
[   96.200304] Bug: Read fault blocked by KUAP!
[   96.200309] WARNING: CPU: 3 PID: 1876 at arch/powerpc/mm/fault.c:229 
bad_kernel_fault+0x180/0x310



[   96.200734] NIP [c0849424] fault_in_pages_readable+0x104/0x350
[   96.200741] LR [c084952c] fault_in_pages_readable+0x20c/0x350
[   96.200747] --- interrupt: 300



Problem happens in a section where userspace access is supposed to be granted, 
so the patch you
proposed is definitely not the right fix.

c0849408:   2c 01 00 4c isync
c084940c:   a6 03 3d 7d mtspr   29,r9  <== granting userspace 
access permission
c0849410:   2c 01 00 4c isync
c0849414:   00 00 36 e9 ld  r9,0(r22)
c0849418:   20 00 29 81 lwz r9,32(r9)
c084941c:   00 02 29 71 andi.   r9,r9,512
c0849420:   78 d3 5e 7f mr  r30,r26
==> c0849424:00 00 bf 8b lbz r29,0(r31)  <== accessing 
userspace
c0849428:   10 00 82 41 beq c0849438 

c084942c:   2c 01 00 4c isync
c0849430:   a6 03 bd 7e mtspr   29,r21  <== clearing userspace 
access permission
c0849434:   2c 01 00 4c isync

My first guess is that the problem is linked to the following function, see the 
comment

/*
   * For kernel thread that doesn't have thread.regs return
   * default AMR/IAMR values.
   */
static inline u64 current_thread_amr(void)
{
if (current->thread.regs)
return current->thread.regs->amr;
return AMR_KUAP_BLOCKED;
}

Above function was introduced by commit 48a8ab4eeb82 ("powerpc/book3s64/pkeys: 
Don't update SPRN_AMR
when in kernel mode")


Yeah that's a bit of a curly one.

At some point io_uring did kthread_use_mm(), which is supposed to mean
the kthread can operate on behalf of the original process that submitted
the IO.

But because KUAP is implemented using memory protection keys, it depends
on the value of the AMR register, which is not part of the mm, it's in
thread.regs->amr.

And what's worse by the time we're in kthread_use_mm() we no longer have
access to the thread.regs->amr of the original process that submitted
the IO.

We also can't simply move the AMR into the mm, precisely because it's
per thread, not per mm.

So TBH I don't know how we're going to fix this.

I guess we could return AMR=unblocked for kernel threads, but that's
arguably a bug because it allows a process to circumvent memory keys by
asking the kernel to do the access.


We shouldn't need to inherit AMR should we? We only need it to be locked
for kernel threads until it's explicitly unlocked -- nothing mm specific
there. I think current_thread_amr could return 0 for kernel threads? Or
I would even avoid using that function for allow_user_access and open
code the kthread case and remove it from current_thread_amr().

Thanks,
Nick




updated one

 From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Tue, 2 Feb 2021 09:23:38 +0530
Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access userspace
  after kthread_use_mm

This fix the bad fault reported by KUAP when io_wqe_worker access userspace.

  Bug: Read fault blocked by KUAP!
  WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
__do_page_fault+0x6b4/0xcd0
  NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
  LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
  Call Trace:
  [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable)
  [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
  [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
  --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
  NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
  LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
  interrupt: 300
  [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
  [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
  [c00016367990] [c0710330] iomap_file_buffered_write+0xa0/0x120
  [c000163679e0] [c0080040791c] xfs_file_buffered_aio_write+0x314/0x5e0 
[xfs]
  [c00016367a90] [c06d74bc] io_write+0x10c/0x460
  [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
  [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
  [c00016367cb0] [c06e2578] io_worker_handle_work+0x498/0x800
  [c00016367d40] [c06e2cdc] io_wqe_worker+0x3fc/0x4f0
  [c00016367da0] [c01cb0a4] kthread+0x1c4/0x1d0
  [c00016367e10] [c000dbf0]

Re: [PATCH v7 00/42] powerpc: interrupt wrappers





Le 30/01/2021 à 14:08, Nicholas Piggin a écrit :

This adds interrupt handler wrapper functions, similar to the
generic / x86 code, and moves several common operations into them
from either asm or open coded in the individual handlers.

This series is based on powerpc fixes-test tree, there's another
unrelated pending fix in patch 1 of the series which clashes a
bit.


This series trivialy conflicts with 
https://github.com/linuxppc/linux/commit/11f9c1d2fb497f69f83d4fab6fb7fc8a6884eded on powerpc next tree.




This includes more changes and fixes suggested by Christophe,
a few minor bug fixes and compile fix noticed by kbuild, and
some NMI changes Athira asked about -- PMI interrupts don't
block tracing when they are soft-NMI.

Since v1:
- Fixed a couple of compile issues
- Fixed perf weirdness (sometimes NMI, sometimes not)
- Also move irq_enter/exit into wrappers

Since v2:
- Rebased upstream
- Took code in patch 3 from Christophe
- Fixed some compile errors from 0day

Since v3:
- Rebased
- Split Christophe's 32s DABR patch into its own patch
- Fixed missing asm from 32s on patch 3 noticed by Christophe.
- Moved changes around, split out one more patch (patch 9) to make
   changes more logical and atomic.
- Add comments explaining _RAW handlers (SLB, HPTE) interrupts better

Since v4:
- Rebased (on top of scv fallback flush fix)
- Rearranged a few changes into different patches from Christophe,
   e.g., the ___do_page_fault change from patch 2 to 10. I didn't
   do everything (e.g., splitting to update __hash_page to drop the
   msr argument before the bulk of patch 2 seemed like churn without
   much improvement), and also other things like removing the new
   ___do_page_fault variant if we can change hash fault context tracking
   I didn't get time to completely investigate and implement. I think
   this shouldn't be a showstopper though we can make more improvements
   as we go.

Since v5:
- Lots of good review suggestions from Christophe, see v5 email threads.
- Major change being do_break is left in asm and selected early as an
   alternate interrupt handler now, which is a smaller step and matches
   other subarchs better.
- Rearranged patches, split, moved things, bug fixes, etc.
- Converted a few more missed exception handlers for debug and ras

Since v6:
- Move related interrupt handler de-argify patches together [Christophe]
- Split do_bad_page_fault patch [Christophe]
- Change do_page_fault cleanup patch [Christophe]
- entry_32.S can't avoid saving r4/r5 until later in the series [Christophe]
- Soft-NMI decrementer and perf don't block ftrace [Athira]
- Rebased on some fixes
- Fixed mismerge / duplicate line in patch 40
- Fix kbuild hash missing declaration bug

Christophe Leroy (1):
   powerpc/32s: move DABR match out of handle_page_fault

Nicholas Piggin (41):
   powerpc/64s: interrupt exit improve bounding of interrupt recursion
   KVM: PPC: Book3S HV: Context tracking exit guest context before
 enabling irqs
   powerpc/64s: move DABR match out of handle_page_fault
   powerpc/64s: move the hash fault handling logic to C
   powerpc: remove arguments from fault handler functions
   powerpc/fsl_booke/32: CacheLockingException remove args
   powerpc: do_break get registers from regs
   powerpc: DebugException remove args
   powerpc/32: transfer can avoid saving r4/r5 over trace call
   powerpc: bad_page_fault get registers from regs
   powerpc/64s: add do_bad_page_fault_segv handler
   powerpc: rearrange do_page_fault error case to be inside
 exception_enter
   powerpc/64s: move bad_page_fault handling to C
   powerpc/64s: split do_hash_fault
   powerpc/mm: Remove stale do_page_fault comment referring to SLB faults
   powerpc/64s: slb comment update
   powerpc/traps: add NOKPROBE_SYMBOL for sreset and mce
   powerpc/perf: move perf irq/nmi handling details into traps.c
   powerpc/time: move timer_broadcast_interrupt prototype to asm/time.h
   powerpc: add and use unknown_async_exception
   powerpc/cell: tidy up pervasive declarations
   powerpc: introduce die_mce
   powerpc/mce: ensure machine check handler always tests RI
   powerpc: improve handling of unrecoverable system reset
   powerpc: interrupt handler wrapper functions
   powerpc: add interrupt wrapper entry / exit stub functions
   powerpc: convert interrupt handlers to use wrappers
   powerpc: add interrupt_cond_local_irq_enable helper
   powerpc/64: context tracking remove _TIF_NOHZ
   powerpc/64s/hash: improve context tracking of hash faults
   powerpc/64: context tracking move to interrupt wrappers
   powerpc/64: add context tracking to asynchronous interrupts
   powerpc: handle irq_enter/irq_exit in interrupt handler wrappers
   powerpc/64s: move context tracking exit to interrupt exit path
   powerpc/64s: reconcile interrupts in C
   powerpc/64: move account_stolen_time into its own function
   powerpc/64: entry cpu time accounting in C
   powerpc: move NMI entry/exit code into wrapper
   powerpc/64s: move NMI

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

Aneesh Kumar K.V  writes:

> Nicholas Piggin  writes:
>
>> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
>>> Christophe Leroy  writes:
 +Aneesh

 Le 29/01/2021 à 07:52, Zorro Lang a écrit :
>>> ..
> [   96.200296] [ cut here ]
> [   96.200304] Bug: Read fault blocked by KUAP!
> [   96.200309] WARNING: CPU: 3 PID: 1876 at arch/powerpc/mm/fault.c:229 
> bad_kernel_fault+0x180/0x310

> [   96.200734] NIP [c0849424] fault_in_pages_readable+0x104/0x350
> [   96.200741] LR [c084952c] fault_in_pages_readable+0x20c/0x350
> [   96.200747] --- interrupt: 300


 Problem happens in a section where userspace access is supposed to be 
 granted, so the patch you 
 proposed is definitely not the right fix.

 c0849408:  2c 01 00 4c isync
 c084940c:  a6 03 3d 7d mtspr   29,r9  <== granting userspace 
 access permission
 c0849410:  2c 01 00 4c isync
 c0849414:  00 00 36 e9 ld  r9,0(r22)
 c0849418:  20 00 29 81 lwz r9,32(r9)
 c084941c:  00 02 29 71 andi.   r9,r9,512
 c0849420:  78 d3 5e 7f mr  r30,r26
 ==> c0849424:  00 00 bf 8b lbz r29,0(r31)  <== 
 accessing userspace
 c0849428:  10 00 82 41 beq c0849438 
 
 c084942c:  2c 01 00 4c isync
 c0849430:  a6 03 bd 7e mtspr   29,r21  <== clearing userspace 
 access permission
 c0849434:  2c 01 00 4c isync

 My first guess is that the problem is linked to the following function, 
 see the comment

 /*
   * For kernel thread that doesn't have thread.regs return
   * default AMR/IAMR values.
   */
 static inline u64 current_thread_amr(void)
 {
if (current->thread.regs)
return current->thread.regs->amr;
return AMR_KUAP_BLOCKED;
 }

 Above function was introduced by commit 48a8ab4eeb82 
 ("powerpc/book3s64/pkeys: Don't update SPRN_AMR 
 when in kernel mode")
>>> 
>>> Yeah that's a bit of a curly one.
>>> 
>>> At some point io_uring did kthread_use_mm(), which is supposed to mean
>>> the kthread can operate on behalf of the original process that submitted
>>> the IO.
>>> 
>>> But because KUAP is implemented using memory protection keys, it depends
>>> on the value of the AMR register, which is not part of the mm, it's in
>>> thread.regs->amr.
>>> 
>>> And what's worse by the time we're in kthread_use_mm() we no longer have
>>> access to the thread.regs->amr of the original process that submitted
>>> the IO.
>>> 
>>> We also can't simply move the AMR into the mm, precisely because it's
>>> per thread, not per mm.
>>> 
>>> So TBH I don't know how we're going to fix this.
>>> 
>>> I guess we could return AMR=unblocked for kernel threads, but that's
>>> arguably a bug because it allows a process to circumvent memory keys by
>>> asking the kernel to do the access.
>>
>> We shouldn't need to inherit AMR should we? We only need it to be locked 
>> for kernel threads until it's explicitly unlocked -- nothing mm specific 
>> there. I think current_thread_amr could return 0 for kernel threads? Or
>> I would even avoid using that function for allow_user_access and open
>> code the kthread case and remove it from current_thread_amr().
>>
>> Thanks,
>> Nick
>

updated one

>From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Tue, 2 Feb 2021 09:23:38 +0530
Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access userspace
 after kthread_use_mm

This fix the bad fault reported by KUAP when io_wqe_worker access userspace.

 Bug: Read fault blocked by KUAP!
 WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
__do_page_fault+0x6b4/0xcd0
 NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
 LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
 Call Trace:
 [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable)
 [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
 [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
 --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
 NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
 LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
 interrupt: 300
 [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
 [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
 [c00016367990] [c0710330] iomap_file_buffered_write+0xa0/0x120
 [c000163679e0] [c0080040791c] xfs_file_buffered_aio_write+0x314/0x5e0 
[xfs]
 [c00016367a90] [c06d74bc] io_write+0x10c/0x460
 [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
 [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
 [c00016367cb0]

[PATCH] cpufreq: Remove unused flag CPUFREQ_PM_NO_WARN

2021-02-01 Thread Viresh Kumar

This flag is set by one of the drivers but it isn't used in the code
otherwise. Remove the unused flag and update the driver.

Signed-off-by: Viresh Kumar 
---
Rebased over:

https://lore.kernel.org/lkml/a59bb322b22c247d570b70a8e94067804287623b.1612241683.git.viresh.ku...@linaro.org/

 drivers/cpufreq/pmac32-cpufreq.c |  3 +--
 include/linux/cpufreq.h  | 13 +
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/cpufreq/pmac32-cpufreq.c b/drivers/cpufreq/pmac32-cpufreq.c
index 73621bc11976..4f20c6a9108d 100644
--- a/drivers/cpufreq/pmac32-cpufreq.c
+++ b/drivers/cpufreq/pmac32-cpufreq.c
@@ -439,8 +439,7 @@ static struct cpufreq_driver pmac_cpufreq_driver = {
.init   = pmac_cpufreq_cpu_init,
.suspend= pmac_cpufreq_suspend,
.resume = pmac_cpufreq_resume,
-   .flags  = CPUFREQ_PM_NO_WARN |
- CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING,
+   .flags  = CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING,
.attr   = cpufreq_generic_attr,
.name   = "powermac",
 };
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index c8e40e91fe9b..353969c7acd3 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -398,8 +398,11 @@ struct cpufreq_driver {
 /* loops_per_jiffy or other kernel "constants" aren't affected by frequency 
transitions */
 #define CPUFREQ_CONST_LOOPSBIT(1)
 
-/* don't warn on suspend/resume speed mismatches */
-#define CPUFREQ_PM_NO_WARN BIT(2)
+/*
+ * Set by drivers that want the core to automatically register the cpufreq
+ * driver as a thermal cooling device.
+ */
+#define CPUFREQ_IS_COOLING_DEV BIT(2)
 
 /*
  * This should be set by platforms having multiple clock-domains, i.e.
@@ -431,12 +434,6 @@ struct cpufreq_driver {
  */
 #define CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING  BIT(6)
 
-/*
- * Set by drivers that want the core to automatically register the cpufreq
- * driver as a thermal cooling device.
- */
-#define CPUFREQ_IS_COOLING_DEV BIT(7)
-
 int cpufreq_register_driver(struct cpufreq_driver *driver_data);
 int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);
 
-- 
2.25.0.rc1.19.g042ed3e048af

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

Nicholas Piggin  writes:

> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
>> Christophe Leroy  writes:
>>> +Aneesh
>>>
>>> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
>> ..
 [   96.200296] [ cut here ]
 [   96.200304] Bug: Read fault blocked by KUAP!
 [   96.200309] WARNING: CPU: 3 PID: 1876 at arch/powerpc/mm/fault.c:229 
 bad_kernel_fault+0x180/0x310
>>>
 [   96.200734] NIP [c0849424] fault_in_pages_readable+0x104/0x350
 [   96.200741] LR [c084952c] fault_in_pages_readable+0x20c/0x350
 [   96.200747] --- interrupt: 300
>>>
>>>
>>> Problem happens in a section where userspace access is supposed to be 
>>> granted, so the patch you 
>>> proposed is definitely not the right fix.
>>>
>>> c0849408:   2c 01 00 4c isync
>>> c084940c:   a6 03 3d 7d mtspr   29,r9  <== granting userspace 
>>> access permission
>>> c0849410:   2c 01 00 4c isync
>>> c0849414:   00 00 36 e9 ld  r9,0(r22)
>>> c0849418:   20 00 29 81 lwz r9,32(r9)
>>> c084941c:   00 02 29 71 andi.   r9,r9,512
>>> c0849420:   78 d3 5e 7f mr  r30,r26
>>> ==> c0849424:   00 00 bf 8b lbz r29,0(r31)  <== 
>>> accessing userspace
>>> c0849428:   10 00 82 41 beq c0849438 
>>> 
>>> c084942c:   2c 01 00 4c isync
>>> c0849430:   a6 03 bd 7e mtspr   29,r21  <== clearing userspace 
>>> access permission
>>> c0849434:   2c 01 00 4c isync
>>>
>>> My first guess is that the problem is linked to the following function, see 
>>> the comment
>>>
>>> /*
>>>   * For kernel thread that doesn't have thread.regs return
>>>   * default AMR/IAMR values.
>>>   */
>>> static inline u64 current_thread_amr(void)
>>> {
>>> if (current->thread.regs)
>>> return current->thread.regs->amr;
>>> return AMR_KUAP_BLOCKED;
>>> }
>>>
>>> Above function was introduced by commit 48a8ab4eeb82 
>>> ("powerpc/book3s64/pkeys: Don't update SPRN_AMR 
>>> when in kernel mode")
>> 
>> Yeah that's a bit of a curly one.
>> 
>> At some point io_uring did kthread_use_mm(), which is supposed to mean
>> the kthread can operate on behalf of the original process that submitted
>> the IO.
>> 
>> But because KUAP is implemented using memory protection keys, it depends
>> on the value of the AMR register, which is not part of the mm, it's in
>> thread.regs->amr.
>> 
>> And what's worse by the time we're in kthread_use_mm() we no longer have
>> access to the thread.regs->amr of the original process that submitted
>> the IO.
>> 
>> We also can't simply move the AMR into the mm, precisely because it's
>> per thread, not per mm.
>> 
>> So TBH I don't know how we're going to fix this.
>> 
>> I guess we could return AMR=unblocked for kernel threads, but that's
>> arguably a bug because it allows a process to circumvent memory keys by
>> asking the kernel to do the access.
>
> We shouldn't need to inherit AMR should we? We only need it to be locked 
> for kernel threads until it's explicitly unlocked -- nothing mm specific 
> there. I think current_thread_amr could return 0 for kernel threads? Or
> I would even avoid using that function for allow_user_access and open
> code the kthread case and remove it from current_thread_amr().
>
> Thanks,
> Nick

Can we try this?

commit 9a193d38b1a0a52364bc70ec953e0685993603b6
Author: Aneesh Kumar K.V 
Date:   Tue Feb 2 09:23:38 2021 +0530

powerpc/kuap: Allow kernel thread to access userspace after kthread_use_mm

This fix the bad fault reported by KUAP when io_wqe_worker access userspace.

 Bug: Read fault blocked by KUAP!
 WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
__do_page_fault+0x6b4/0xcd0
 NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
 LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
 Call Trace:
 [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 
(unreliable)
 [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
 [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
 --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
 NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
 LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
 interrupt: 300
 [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
 [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
 [c00016367990] [c0710330] iomap_file_buffered_write+0xa0/0x120
 [c000163679e0] [c0080040791c] 
xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
 [c00016367a90] [c06d74bc] io_write+0x10c/0x460
 [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
 [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
 [c00016367cb0] [c06e2578]

Re: [PATCH] dma-mapping: remove unneeded semicolon

2021-02-01 Thread Geoff Levand

On 2/1/21 7:41 PM, Yang Li wrote:
> Eliminate the following coccicheck warning:
> ./arch/powerpc/platforms/ps3/system-bus.c:606:2-3: Unneeded semicolon
> ./arch/powerpc/platforms/ps3/system-bus.c:765:2-3: Unneeded semicolon
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  arch/powerpc/platforms/ps3/system-bus.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Thanks for your patch, it looks good.

Acked-by: Geoff Levand 

-Geoff

[PATCH] powerpc/book3s64: remove unneeded semicolon

Eliminate the following coccicheck warning:
./arch/powerpc/platforms/pseries/lpar.c:1632:2-3: Unneeded semicolon
./arch/powerpc/platforms/pseries/lpar.c:1663:2-3: Unneeded semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 arch/powerpc/platforms/pseries/lpar.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 764170f..24889b8 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1629,7 +1629,7 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
}
msleep(delay);
rc = plpar_resize_hpt_prepare(0, shift);
-   };
+   }
 
switch (rc) {
case H_SUCCESS:
@@ -1663,7 +1663,7 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
pr_warn("Unexpected error %d from 
H_RESIZE_HPT_COMMIT\n",
state.commit_rc);
return -EIO;
-   };
+   }
}
 
pr_info("HPT resize to shift %lu complete (%lld ms / %lld ms)\n",
-- 
1.8.3.1

Re: [PATCH] powerpc/eeh: remove unneeded semicolon

2021-02-01 Thread Oliver O'Halloran

On Tue, Feb 2, 2021 at 2:21 PM Yang Li  wrote:
>
> Eliminate the following coccicheck warning:
> ./arch/powerpc/kernel/eeh.c:782:2-3: Unneeded semicolon

Doesn't appear to be a load-bearing semicolon. It's hard to tell with EEH.

Reviewed-by: Oliver O'Halloran 

>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  arch/powerpc/kernel/eeh.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 813713c..02c058d 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -779,7 +779,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
> enum pcie_reset_state stat
> default:
> eeh_pe_state_clear(pe, EEH_PE_ISOLATED | EEH_PE_CFG_BLOCKED, 
> true);
> return -EINVAL;
> -   };
> +   }
>
> return 0;
>  }
> --
> 1.8.3.1
>

[PATCH] dma-mapping: remove unneeded semicolon

Eliminate the following coccicheck warning:
./arch/powerpc/platforms/ps3/system-bus.c:606:2-3: Unneeded semicolon
./arch/powerpc/platforms/ps3/system-bus.c:765:2-3: Unneeded semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 arch/powerpc/platforms/ps3/system-bus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/ps3/system-bus.c 
b/arch/powerpc/platforms/ps3/system-bus.c
index b431f41..5c73926 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -603,7 +603,7 @@ static dma_addr_t ps3_ioc0_map_page(struct device *_dev, 
struct page *page,
default:
/* not happned */
BUG();
-   };
+   }
result = ps3_dma_map(dev->d_region, (unsigned long)ptr, size,
 _addr, iopte_flag);
 
@@ -762,7 +762,7 @@ int ps3_system_bus_device_register(struct 
ps3_system_bus_device *dev)
break;
default:
BUG();
-   };
+   }
 
dev->core.of_node = NULL;
set_dev_node(>core, 0);
-- 
1.8.3.1

[PATCH] powerpc/64s: remove unneeded semicolon

Eliminate the following coccicheck warning:
./arch/powerpc/platforms/powernv/setup.c:160:2-3: Unneeded semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 arch/powerpc/platforms/powernv/setup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 4426a10..167d6ed 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -157,7 +157,7 @@ static void __init pnv_check_guarded_cores(void)
for_each_node_by_type(dn, "cpu") {
if (of_property_match_string(dn, "status", "bad") >= 0)
bad_count++;
-   };
+   }
 
if (bad_count) {
printk("  _ ___\n");
-- 
1.8.3.1

[PATCH] KVM: PPC: remove unneeded semicolon

Eliminate the following coccicheck warning:
./arch/powerpc/kvm/booke.c:701:2-3: Unneeded semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 arch/powerpc/kvm/booke.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 288a982..f38ae3e 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -698,7 +698,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 
kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
r = 1;
-   };
+   }
 
return r;
 }
-- 
1.8.3.1

[PATCH] powerpc/prom_init: remove unneeded semicolon

Eliminate the following coccicheck warning:
./arch/powerpc/kernel/prom_init.c:2990:2-3: Unneeded semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 arch/powerpc/kernel/prom_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index e9d4eb6..2d111bb 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2987,7 +2987,7 @@ static void __init fixup_device_tree_efika_add_phy(void)
" 0x3 encode-int encode+"
" s\" interrupts\" property"
" finish-device");
-   };
+   }
 
/* Check for a PHY device node - if missing then create one and
 * give it's phandle to the ethernet node */
-- 
1.8.3.1

[PATCH] powerpc/eeh: remove unneeded semicolon

Eliminate the following coccicheck warning:
./arch/powerpc/kernel/eeh.c:782:2-3: Unneeded semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 arch/powerpc/kernel/eeh.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 813713c..02c058d 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -779,7 +779,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum 
pcie_reset_state stat
default:
eeh_pe_state_clear(pe, EEH_PE_ISOLATED | EEH_PE_CFG_BLOCKED, 
true);
return -EINVAL;
-   };
+   }
 
return 0;
 }
-- 
1.8.3.1

[PATCH] crypto: powerpc: remove unneeded semicolon

Eliminate the following coccicheck warning:
./arch/powerpc/crypto/sha256-spe-glue.c:132:2-3: Unneeded
semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 arch/powerpc/crypto/sha256-spe-glue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/crypto/sha256-spe-glue.c 
b/arch/powerpc/crypto/sha256-spe-glue.c
index a6e650a..ffedea7 100644
--- a/arch/powerpc/crypto/sha256-spe-glue.c
+++ b/arch/powerpc/crypto/sha256-spe-glue.c
@@ -129,7 +129,7 @@ static int ppc_spe_sha256_update(struct shash_desc *desc, 
const u8 *data,
 
src += bytes;
len -= bytes;
-   };
+   }
 
memcpy((char *)sctx->buf, src, len);
return 0;
-- 
1.8.3.1

[RFC PATCH 9/9] KVM: PPC: Book3S HV: Implement the rest of the P9 entry/exit handling in C

Almost all logic is moved to C, by introducing a new in_guest mode that
selects and branches very early in the interrupt handler to the P9 exit
code.

The remaining asm is a low level stack setup, with VCPU vs host register
save and restore. Only the GPRs that might be touched by the compiler are
handled in asm.

More consolidation could be made in future by inlining more of
__kvmhv_vcpu_entry_p9 into its callers and handling some bits there
perhaps, although there is something to be said for keeping the P9
and old path looking similar, so that needs to be done carefully if
at all.

Some bits are left out for now, ultravisor return for example.

I haven't added any loop to return quickly back into the guest, no
simple "realmode" hcalls, decrementer checks, etc. It remains to be
seen whether these would actually matter to add -- ISA v3.0 (radix,
large decrementer, etc) + XIVE should cut down on guest exits by
a huge amount, and of those that remain I would say a large portion
of them will need to go back out to Linux. Radix also makes this
much more light-weight because we are already in host MMU mode.
There are a few bits like LPID/LPCR/PID switching and XIVE pulling
and pushing which would still be desirable to avoid so if it
happens a lot we could probably handle some simple common cases. I
hope we won't have to at all.

---
 arch/powerpc/include/asm/asm-prototypes.h |   2 +-
 arch/powerpc/include/asm/kvm_asm.h|   3 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|   2 +
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/book3s_64_entry.S| 135 ++
 arch/powerpc/kvm/book3s_hv.c  |  21 ++-
 arch/powerpc/kvm/book3s_hv_interrupt.c| 164 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 123 ++--
 arch/powerpc/kvm/book3s_xive.c|  32 +
 10 files changed, 364 insertions(+), 123 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_interrupt.c

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index d0b832cbbec8..ba15ce78ffe5 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -171,7 +171,7 @@ void kvmhv_load_host_pmu(void);
 void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
 void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
 
-int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);
+void kvmppc_p9_enter_guest(struct kvm_vcpu *vcpu);
 
 long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr);
 long kvmppc_h_set_xdabr(struct kvm_vcpu *vcpu, unsigned long dabr,
diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index a3633560493b..b4f9996bd331 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -146,7 +146,8 @@
 #define KVM_GUEST_MODE_GUEST   1
 #define KVM_GUEST_MODE_SKIP2
 #define KVM_GUEST_MODE_GUEST_HV3
-#define KVM_GUEST_MODE_HOST_HV 4
+#define KVM_GUEST_MODE_GUEST_HV_FAST   4 /* ISA v3.0 with host radix mode */
+#define KVM_GUEST_MODE_HOST_HV 5
 
 #define KVM_INST_FETCH_FAILED  -1
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 9bb9bb370b53..7f08f6ed73df 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -153,6 +153,8 @@ static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu 
*vcpu)
return radix;
 }
 
+int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);
+
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0a056c64c317..1699e8ca96b1 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -670,6 +670,7 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
   int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
+extern void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu);
 
 static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
 {
@@ -710,6 +711,7 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
*vcpu, u64 icpval) { retur
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq,
  int level, bool line_status) { return 
-ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
+static inline void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu) { }
 
 static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
{ return 0; }
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index cdd119028f64..b94be8c9bad1 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -43,6 +43,9 @@

[RFC PATCH 8/9] KVM: PPC: Book3S HV: Minimise hcall handler calling convention differences

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 16 +++-
 arch/powerpc/kvm/book3s_64_entry.S   | 22 +++---
 2 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index bed4499488b3..0844558f1d7c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1923,8 +1923,22 @@ EXC_VIRT_END(system_call, 0x4c00, 0x100)
 
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 TRAMP_REAL_BEGIN(kvm_hcall)
+   std r9,PACA_EXGEN+EX_R9(r13)
+   std r11,PACA_EXGEN+EX_R11(r13)
+   std r12,PACA_EXGEN+EX_R12(r13)
+   mfcrr9
mfctr   r10
-   SET_SCRATCH0(r10) /* Save r13 in SCRATCH0 */
+   std r10,PACA_EXGEN+EX_R13(r13)
+   li  r10,0
+   std r10,PACA_EXGEN+EX_CFAR(r13)
+   std r10,PACA_EXGEN+EX_CTR(r13)
+BEGIN_FTR_SECTION
+   mfspr   r10,SPRN_PPR
+   std r10,PACA_EXGEN+EX_PPR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+
+   HMT_MEDIUM
+
 #ifdef CONFIG_RELOCATABLE
/*
 * Requires __LOAD_FAR_HANDLER beause kvmppc_hcall lives
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 0d9e1e55c24d..0e39267aef63 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -9,24 +9,9 @@
 .globalkvmppc_hcall
 .balign IFETCH_ALIGN_BYTES
 kvmppc_hcall:
-   /*
-* This is a hcall, so register convention is as
-* Documentation/powerpc/papr_hcalls.rst, with these additions:
-* R13  = PACA
-* guest R13 saved in SPRN_SCRATCH0
-* R10  = free
-*/
-BEGIN_FTR_SECTION
-   mfspr   r10,SPRN_PPR
-   std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
-   HMT_MEDIUM
-   mfcrr10
-   std r12,HSTATE_SCRATCH0(r13)
-   sldir12,r10,32
-   ori r12,r12,0xc00
-   ld  r10,PACA_EXGEN+EX_R10(r13)
-   b   do_kvm_interrupt
+   ld  r10,PACA_EXGEN+EX_R13(r13)
+   SET_SCRATCH0(r10)
+   li  r10,0xc00
 
 .globalkvmppc_interrupt
 .balign IFETCH_ALIGN_BYTES
@@ -57,7 +42,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ld  r10,EX_R10(r11)
ld  r11,EX_R11(r11)
 
-do_kvm_interrupt:
/*
 * Hcalls and other interrupts come here after normalising register
 * contents and save locations:
-- 
2.23.0

[RFC PATCH 7/9] KVM: PPC: Book3S HV: move bad_host_intr check to HV handler

This is not used by PR KVM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_64_entry.S  | 3 ---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++-
 arch/powerpc/kvm/book3s_segment.S   | 7 +++
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 5db76c8d4012..0d9e1e55c24d 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -73,11 +73,8 @@ do_kvm_interrupt:
beq maybe_skip
 no_skip:
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   cmpwi   r9,KVM_GUEST_MODE_HOST_HV
-   beq kvmppc_bad_host_intr
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
cmpwi   r9,KVM_GUEST_MODE_GUEST
-   ld  r9,HSTATE_SCRATCH2(r13)
beq kvmppc_interrupt_pr
 #endif
b   kvmppc_interrupt_hv
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b9c4acd747f7..8144c1403203 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1251,6 +1251,7 @@ hdec_soon:
 kvmppc_interrupt_hv:
/*
 * Register contents:
+* R9   = HSTATE_IN_GUEST
 * R12  = (guest CR << 32) | interrupt vector
 * R13  = PACA
 * guest R12 saved in shadow VCPU SCRATCH0
@@ -1258,6 +1259,8 @@ kvmppc_interrupt_hv:
 * guest R9 saved in HSTATE_SCRATCH2
 */
/* We're now back in the host but in guest MMU context */
+   cmpwi   r9,KVM_GUEST_MODE_HOST_HV
+   beq kvmppc_bad_host_intr
li  r9, KVM_GUEST_MODE_HOST_HV
stb r9, HSTATE_IN_GUEST(r13)
 
@@ -3245,7 +3248,6 @@ END_FTR_SECTION_IFCLR(CPU_FTR_P9_TM_HV_ASSIST)
  * cfar is saved in HSTATE_CFAR(r13)
  * ppr is saved in HSTATE_PPR(r13)
  */
-.global kvmppc_bad_host_intr
 kvmppc_bad_host_intr:
/*
 * Switch to the emergency stack, but start half-way down in
diff --git a/arch/powerpc/kvm/book3s_segment.S 
b/arch/powerpc/kvm/book3s_segment.S
index 1f492aa4c8d6..ef1d88b869bf 100644
--- a/arch/powerpc/kvm/book3s_segment.S
+++ b/arch/powerpc/kvm/book3s_segment.S
@@ -167,8 +167,15 @@ kvmppc_interrupt_pr:
 * R12 = (guest CR << 32) | exit handler id
 * R13 = PACA
 * HSTATE.SCRATCH0 = guest R12
+*
+* If HV is possible, additionally:
+* R9  = HSTATE_IN_GUEST
+* HSTATE.SCRATCH2 = guest R9
 */
 #ifdef CONFIG_PPC64
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   ld  r9,HSTATE_SCRATCH2(r13)
+#endif
/* Match 32-bit entry */
rotldi  r12, r12, 32  /* Flip R12 halves for stw */
stw r12, HSTATE_SCRATCH1(r13) /* CR is now in the low half */
-- 
2.23.0

[RFC PATCH 6/9] KVM: PPC: Book3S HV: Move interrupt early register setup to KVM

Like the earlier patch for hcalls, KVM interrupt entry requires a
different calling convention than the Linux interrupt handlers
set up. Move the code that converts from one to the other into KVM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 126 ---
 arch/powerpc/kvm/book3s_64_entry.S   |  37 ++--
 2 files changed, 50 insertions(+), 113 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index c893801eb1d5..bed4499488b3 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -191,7 +191,6 @@ do_define_int n
.endif
 .endm
 
-#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 /*
  * All interrupts which set HSRR registers, as well as SRESET and MCE and
  * syscall when invoked with "sc 1" switch to MSR[HV]=1 (HVMODE) to be taken,
@@ -224,54 +223,25 @@ do_define_int n
  * to KVM to handle.
  */
 
-.macro KVMTEST name
+.macro KVMTEST name handler
+#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
lbz r10,HSTATE_IN_GUEST(r13)
cmpwi   r10,0
-   bne \name\()_kvm
-.endm
-
-.macro GEN_KVM name
-   .balign IFETCH_ALIGN_BYTES
-\name\()_kvm:
-
-BEGIN_FTR_SECTION
-   ld  r10,IAREA+EX_CFAR(r13)
-   std r10,HSTATE_CFAR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
-
-   ld  r10,IAREA+EX_CTR(r13)
-   mtctr   r10
-BEGIN_FTR_SECTION
-   ld  r10,IAREA+EX_PPR(r13)
-   std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
-   ld  r11,IAREA+EX_R11(r13)
-   ld  r12,IAREA+EX_R12(r13)
-   std r12,HSTATE_SCRATCH0(r13)
-   sldir12,r9,32
-   ld  r9,IAREA+EX_R9(r13)
-   ld  r10,IAREA+EX_R10(r13)
/* HSRR variants have the 0x2 bit added to their trap number */
.if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
-   ori r12,r12,(IVEC + 0x2)
+   li  r10,(IVEC + 0x2)
FTR_SECTION_ELSE
-   ori r12,r12,(IVEC)
+   li  r10,(IVEC)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
.elseif IHSRR
-   ori r12,r12,(IVEC+ 0x2)
+   li  r10,(IVEC + 0x2)
.else
-   ori r12,r12,(IVEC)
+   li  r10,(IVEC)
.endif
-   b   kvmppc_interrupt
-.endm
-
-#else
-.macro KVMTEST name
-.endm
-.macro GEN_KVM name
-.endm
+   bne \handler
 #endif
+.endm
 
 /*
  * This is the BOOK3S interrupt entry code macro.
@@ -413,7 +383,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
 DEFINE_FIXED_SYMBOL(\name\()_common_real)
 \name\()_common_real:
.if IKVM_REAL
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
.endif
 
ld  r10,PACAKMSR(r13)   /* get MSR value for kernel */
@@ -436,7 +406,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real)
 DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 \name\()_common_virt:
.if IKVM_VIRT
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
 1:
.endif
.endif /* IVIRT */
@@ -450,7 +420,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 DEFINE_FIXED_SYMBOL(\name\()_common_real)
 \name\()_common_real:
.if IKVM_REAL
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
.endif
 .endm
 
@@ -1011,8 +981,6 @@ EXC_COMMON_BEGIN(system_reset_common)
EXCEPTION_RESTORE_REGS
RFI_TO_USER_OR_KERNEL
 
-   GEN_KVM system_reset
-
 
 /**
  * Interrupt 0x200 - Machine Check Interrupt (MCE).
@@ -1196,7 +1164,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
/*
 * Check if we are coming from guest. If yes, then run the normal
 * exception handler which will take the
-* machine_check_kvm->kvmppc_interrupt branch to deliver the MC event
+* machine_check_kvm->kvm_interrupt branch to deliver the MC event
 * to guest.
 */
lbz r11,HSTATE_IN_GUEST(r13)
@@ -1267,8 +1235,6 @@ EXC_COMMON_BEGIN(machine_check_common)
bl  machine_check_exception
b   interrupt_return
 
-   GEN_KVM machine_check
-
 
 #ifdef CONFIG_PPC_P7_NAP
 /*
@@ -1393,8 +1359,6 @@ MMU_FTR_SECTION_ELSE
b   handle_page_fault
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
 
-   GEN_KVM data_access
-
 
 /**
  * Interrupt 0x380 - Data Segment Interrupt (DSLB).
@@ -1449,8 +1413,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
bl  do_bad_slb_fault
b   interrupt_return
 
-   GEN_KVM data_access_slb
-
 
 /**
  * Interrupt 0x400 - Instruction Storage Interrupt (ISI).
@@ -1489,8 +1451,6 @@ MMU_FTR_SECTION_ELSE
b   handle_page_fault
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
 
-   GEN_KVM instruction_access
-
 
 /**
  * Interrupt 0x480 - Instruction Segment Interrupt (ISLB).
@@ -1540,8 +1500,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
bl  do_bad_slb_fault
b   interrupt_return
 
-

[RFC PATCH 5/9] powerpc/64s: Remove EXSLB interrupt save area

SLB faults should not be taken while the PACA save areas are live, so
EXSLB is not be required.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/paca.h  | 3 +--
 arch/powerpc/kernel/asm-offsets.c| 1 -
 arch/powerpc/kernel/exceptions-64s.S | 5 -
 3 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 9454d29ff4b4..be0b00cb9fbb 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -108,8 +108,7 @@ struct paca_struct {
 */
/* used for most interrupts/exceptions */
u64 exgen[EX_SIZE] __attribute__((aligned(0x80)));
-   u64 exslb[EX_SIZE]; /* used for SLB/segment table misses
-* on the linear mapping */
+
/* SLB related definitions */
u16 vmalloc_sllp;
u8 slb_cache_ptr;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 489a22cf1a92..51302fb74d14 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -255,7 +255,6 @@ int main(void)
 #endif /* CONFIG_PPC_MM_SLICES */
OFFSET(PACA_EXGEN, paca_struct, exgen);
OFFSET(PACA_EXMC, paca_struct, exmc);
-   OFFSET(PACA_EXSLB, paca_struct, exslb);
OFFSET(PACA_EXNMI, paca_struct, exnmi);
 #ifdef CONFIG_PPC_PSERIES
OFFSET(PACALPPACAPTR, paca_struct, lppaca_ptr);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index e964ea5149e8..c893801eb1d5 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1412,13 +1412,9 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
  *   on user-handler data structures.
  *
  *   KVM: Same as 0x300, DSLB must test for KVM guest.
- *
- * A dedicated save area EXSLB is used (XXX: but it actually need not be
- * these days, we could use EXGEN).
  */
 INT_DEFINE_BEGIN(data_access_slb)
IVEC=0x380
-   IAREA=PACA_EXSLB
IRECONCILE=0
IDAR=1
IKVM_REAL=1
@@ -1507,7 +1503,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
  */
 INT_DEFINE_BEGIN(instruction_access_slb)
IVEC=0x480
-   IAREA=PACA_EXSLB
IRECONCILE=0
IISIDE=1
IDAR=1
-- 
2.23.0

[RFC PATCH 4/9] KVM: PPC: Book3S HV: Move hcall early register setup to KVM

System calls / hcalls have a different calling convention than
other interrupts, so there is code in the KVMTEST to massage these
into the same form as other interrupt handlers.

Move this work into the KVM hcall handler. This means teaching KVM
a little more about the low level interrupt handler setup, PACA save
areas, etc., although that's not obviously worse than the current
approach of coming up with an entirely different interrupt register
/ save convention.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/exception-64s.h | 13 +++
 arch/powerpc/kernel/exceptions-64s.S | 44 ++--
 arch/powerpc/kvm/book3s_64_entry.S   | 17 +
 3 files changed, 32 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 1d32b174ab6a..7e5168b83155 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -35,6 +35,19 @@
 /* PACA save area size in u64 units (exgen, exmc, etc) */
 #define EX_SIZE10
 
+/* PACA save area offsets */
+#define EX_R9  0
+#define EX_R10 8
+#define EX_R11 16
+#define EX_R12 24
+#define EX_R13 32
+#define EX_DAR 40
+#define EX_DSISR   48
+#define EX_CCR 52
+#define EX_CFAR56
+#define EX_PPR 64
+#define EX_CTR 72
+
 /*
  * maximum recursive depth of MCE exceptions
  */
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index c25395b5921a..e964ea5149e8 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -21,22 +21,6 @@
 #include 
 #include 
 
-/* PACA save area offsets (exgen, exmc, etc) */
-#define EX_R9  0
-#define EX_R10 8
-#define EX_R11 16
-#define EX_R12 24
-#define EX_R13 32
-#define EX_DAR 40
-#define EX_DSISR   48
-#define EX_CCR 52
-#define EX_CFAR56
-#define EX_PPR 64
-#define EX_CTR 72
-.if EX_SIZE != 10
-   .error "EX_SIZE is wrong"
-.endif
-
 /*
  * Following are fixed section helper macros.
  *
@@ -2000,45 +1984,21 @@ EXC_VIRT_END(system_call, 0x4c00, 0x100)
 
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 TRAMP_REAL_BEGIN(system_call_kvm)
-   /*
-* This is a hcall, so register convention is as above, with these
-* differences:
-* r13 = PACA
-* ctr = orig r13
-* orig r10 saved in PACA
-*/
-/*
- * Save the PPR (on systems that support it) before changing to
- * HMT_MEDIUM. That allows the KVM code to save that value into the
- * guest state (it is the guest's PPR value).
- */
-BEGIN_FTR_SECTION
-   mfspr   r10,SPRN_PPR
-   std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
-   HMT_MEDIUM
mfctr   r10
-   SET_SCRATCH0(r10)
-   mfcrr10
-   std r12,HSTATE_SCRATCH0(r13)
-   sldir12,r10,32
-   ori r12,r12,0xc00
+   SET_SCRATCH0(r10) /* Save r13 in SCRATCH0 */
 #ifdef CONFIG_RELOCATABLE
/*
-* Requires __LOAD_FAR_HANDLER beause kvmppc_interrupt lives
+* Requires __LOAD_FAR_HANDLER beause kvmppc_hcall lives
 * outside the head section.
 */
__LOAD_FAR_HANDLER(r10, kvmppc_hcall)
mtctr   r10
-   ld  r10,PACA_EXGEN+EX_R10(r13)
bctr
 #else
-   ld  r10,PACA_EXGEN+EX_R10(r13)
b   kvmppc_hcall
 #endif
 #endif
 
-
 /**
  * Interrupt 0xd00 - Trace Interrupt.
  * This is a synchronous interrupt in response to instruction step or
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 3b894b90862f..df074d33f6d7 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -12,6 +12,23 @@
 .globalkvmppc_hcall
 .balign IFETCH_ALIGN_BYTES
 kvmppc_hcall:
+   /*
+* This is a hcall, so register convention is as
+* Documentation/powerpc/papr_hcalls.rst, with these additions:
+* R13  = PACA
+* guest R13 saved in SPRN_SCRATCH0
+* R10  = free
+*/
+BEGIN_FTR_SECTION
+   mfspr   r10,SPRN_PPR
+   std r10,HSTATE_PPR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+   HMT_MEDIUM
+   mfcrr10
+   std r12,HSTATE_SCRATCH0(r13)
+   sldir12,r10,32
+   ori r12,r12,0xc00
+   ld  r10,PACA_EXGEN+EX_R10(r13)
 
 .globalkvmppc_interrupt
 .balign IFETCH_ALIGN_BYTES
-- 
2.23.0

[RFC PATCH 3/9] KVM: PPC: Book3S 64: add hcall interrupt handler

Add a separate hcall entry point. This can be used to deal with the
different calling convention.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 ++--
 arch/powerpc/kvm/book3s_64_entry.S   | 4 
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index e6f7fc7c61a1..c25395b5921a 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -2028,13 +2028,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 * Requires __LOAD_FAR_HANDLER beause kvmppc_interrupt lives
 * outside the head section.
 */
-   __LOAD_FAR_HANDLER(r10, kvmppc_interrupt)
+   __LOAD_FAR_HANDLER(r10, kvmppc_hcall)
mtctr   r10
ld  r10,PACA_EXGEN+EX_R10(r13)
bctr
 #else
ld  r10,PACA_EXGEN+EX_R10(r13)
-   b   kvmppc_interrupt
+   b   kvmppc_hcall
 #endif
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 8e7216f3c3ee..3b894b90862f 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -9,6 +9,10 @@
 /*
  * We come here from the first-level interrupt handlers.
  */
+.globalkvmppc_hcall
+.balign IFETCH_ALIGN_BYTES
+kvmppc_hcall:
+
 .globalkvmppc_interrupt
 .balign IFETCH_ALIGN_BYTES
 kvmppc_interrupt:
-- 
2.23.0

[RFC PATCH 2/9] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM

Move the GUEST_MODE_SKIP logic into KVM code. This is quite a KVM
internal detail that has no real need to be in common handlers.

(XXX: Need to confirm CBE handlers etc)

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 64 
 arch/powerpc/kvm/book3s_64_entry.S   | 50 ++
 2 files changed, 42 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 65659ea3cec4..e6f7fc7c61a1 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -133,7 +133,6 @@ name:
 #define IBRANCH_TO_COMMON  .L_IBRANCH_TO_COMMON_\name\() /* ENTRY branch 
to common */
 #define IREALMODE_COMMON   .L_IREALMODE_COMMON_\name\() /* Common runs in 
realmode */
 #define IMASK  .L_IMASK_\name\()   /* IRQ soft-mask bit */
-#define IKVM_SKIP  .L_IKVM_SKIP_\name\()   /* Generate KVM skip handler */
 #define IKVM_REAL  .L_IKVM_REAL_\name\()   /* Real entry tests KVM */
 #define __IKVM_REAL(name)  .L_IKVM_REAL_ ## name
 #define IKVM_VIRT  .L_IKVM_VIRT_\name\()   /* Virt entry tests KVM */
@@ -191,9 +190,6 @@ do_define_int n
.ifndef IMASK
IMASK=0
.endif
-   .ifndef IKVM_SKIP
-   IKVM_SKIP=0
-   .endif
.ifndef IKVM_REAL
IKVM_REAL=0
.endif
@@ -254,15 +250,10 @@ do_define_int n
.balign IFETCH_ALIGN_BYTES
 \name\()_kvm:
 
-   .if IKVM_SKIP
-   cmpwi   r10,KVM_GUEST_MODE_SKIP
-   beq 89f
-   .else
 BEGIN_FTR_SECTION
ld  r10,IAREA+EX_CFAR(r13)
std r10,HSTATE_CFAR(r13)
 END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
-   .endif
 
ld  r10,IAREA+EX_CTR(r13)
mtctr   r10
@@ -289,27 +280,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ori r12,r12,(IVEC)
.endif
b   kvmppc_interrupt
-
-   .if IKVM_SKIP
-89:mtocrf  0x80,r9
-   ld  r10,IAREA+EX_CTR(r13)
-   mtctr   r10
-   ld  r9,IAREA+EX_R9(r13)
-   ld  r10,IAREA+EX_R10(r13)
-   ld  r11,IAREA+EX_R11(r13)
-   ld  r12,IAREA+EX_R12(r13)
-   .if IHSRR_IF_HVMODE
-   BEGIN_FTR_SECTION
-   b   kvmppc_skip_Hinterrupt
-   FTR_SECTION_ELSE
-   b   kvmppc_skip_interrupt
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif IHSRR
-   b   kvmppc_skip_Hinterrupt
-   .else
-   b   kvmppc_skip_interrupt
-   .endif
-   .endif
 .endm
 
 #else
@@ -1128,7 +1098,6 @@ INT_DEFINE_BEGIN(machine_check)
ISET_RI=0
IDAR=1
IDSISR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(machine_check)
 
@@ -1419,7 +1388,6 @@ INT_DEFINE_BEGIN(data_access)
IVEC=0x300
IDAR=1
IDSISR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(data_access)
 
@@ -1469,7 +1437,6 @@ INT_DEFINE_BEGIN(data_access_slb)
IAREA=PACA_EXSLB
IRECONCILE=0
IDAR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(data_access_slb)
 
@@ -2116,7 +2083,6 @@ INT_DEFINE_BEGIN(h_data_storage)
IHSRR=1
IDAR=1
IDSISR=1
-   IKVM_SKIP=1
IKVM_REAL=1
IKVM_VIRT=1
 INT_DEFINE_END(h_data_storage)
@@ -2573,7 +2539,6 @@ EXC_VIRT_NONE(0x5100, 0x100)
 INT_DEFINE_BEGIN(cbe_system_error)
IVEC=0x1200
IHSRR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(cbe_system_error)
 
@@ -2598,7 +2563,6 @@ EXC_VIRT_NONE(0x5200, 0x100)
 INT_DEFINE_BEGIN(instruction_breakpoint)
IVEC=0x1300
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
-   IKVM_SKIP=1
IKVM_REAL=1
 #endif
 INT_DEFINE_END(instruction_breakpoint)
@@ -2744,7 +2708,6 @@ EXC_COMMON_BEGIN(denorm_exception_common)
 INT_DEFINE_BEGIN(cbe_maintenance)
IVEC=0x1600
IHSRR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(cbe_maintenance)
 
@@ -2797,7 +2760,6 @@ EXC_COMMON_BEGIN(altivec_assist_common)
 INT_DEFINE_BEGIN(cbe_thermal)
IVEC=0x1800
IHSRR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(cbe_thermal)
 
@@ -3081,32 +3043,6 @@ EXPORT_SYMBOL(do_uaccess_flush)
 MASKED_INTERRUPT
 MASKED_INTERRUPT hsrr=1
 
-#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
-kvmppc_skip_interrupt:
-   /*
-* Here all GPRs are unchanged from when the interrupt happened
-* except for r13, which is saved in SPRG_SCRATCH0.
-*/
-   mfspr   r13, SPRN_SRR0
-   addir13, r13, 4
-   mtspr   SPRN_SRR0, r13
-   GET_SCRATCH0(r13)
-   RFI_TO_KERNEL
-   b   .
-
-kvmppc_skip_Hinterrupt:
-   /*
-* Here all GPRs are unchanged from when the interrupt happened
-* except for r13, which is saved in SPRG_SCRATCH0.
-*/
-   mfspr   r13, SPRN_HSRR0
-   addir13, r13, 4
-   mtspr   SPRN_HSRR0, r13
-   GET_SCRATCH0(r13)
-   HRFI_TO_KERNEL
-   b   .
-#endif
-

[RFC PATCH 1/9] KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point

Rather than bifurcate the call depending on whether or not HV is
possible, and have the HV entry test for PR, just make a single
common point which does the demultiplexing. This makes it simpler
to add another type of exit handler.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S|  8 +-
 arch/powerpc/kvm/Makefile   |  3 +++
 arch/powerpc/kvm/book3s_64_entry.S  | 34 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++--
 4 files changed, 40 insertions(+), 16 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_64_entry.S

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index e02ad6fefa46..65659ea3cec4 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -212,7 +212,6 @@ do_define_int n
 .endm
 
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 /*
  * All interrupts which set HSRR registers, as well as SRESET and MCE and
  * syscall when invoked with "sc 1" switch to MSR[HV]=1 (HVMODE) to be taken,
@@ -242,13 +241,8 @@ do_define_int n
 
 /*
  * If an interrupt is taken while a guest is running, it is immediately routed
- * to KVM to handle. If both HV and PR KVM arepossible, KVM interrupts go first
- * to kvmppc_interrupt_hv, which handles the PR guest case.
+ * to KVM to handle.
  */
-#define kvmppc_interrupt kvmppc_interrupt_hv
-#else
-#define kvmppc_interrupt kvmppc_interrupt_pr
-#endif
 
 .macro KVMTEST name
lbz r10,HSTATE_IN_GUEST(r13)
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 2bfeaa13befb..cdd119028f64 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -59,6 +59,9 @@ kvm-pr-y := \
 kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HANDLER) += \
tm.o
 
+kvm-book3s_64-builtin-objs-y += \
+   book3s_64_entry.o
+
 ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
 kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HANDLER) += \
book3s_rmhandlers.o
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
new file mode 100644
index ..22e34b95f478
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -0,0 +1,34 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * We come here from the first-level interrupt handlers.
+ */
+.globalkvmppc_interrupt
+.balign IFETCH_ALIGN_BYTES
+kvmppc_interrupt:
+   /*
+* Register contents:
+* R12  = (guest CR << 32) | interrupt vector
+* R13  = PACA
+* guest R12 saved in shadow VCPU SCRATCH0
+* guest R13 saved in SPRN_SCRATCH0
+*/
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   std r9, HSTATE_SCRATCH2(r13)
+   lbz r9, HSTATE_IN_GUEST(r13)
+   cmpwi   r9, KVM_GUEST_MODE_HOST_HV
+   beq kvmppc_bad_host_intr
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
+   cmpwi   r9, KVM_GUEST_MODE_GUEST
+   ld  r9, HSTATE_SCRATCH2(r13)
+   beq kvmppc_interrupt_pr
+#endif
+   b   kvmppc_interrupt_hv
+#else
+   b   kvmppc_interrupt_pr
+#endif
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 8cf1f69f442e..b9c4acd747f7 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1255,16 +1255,8 @@ kvmppc_interrupt_hv:
 * R13  = PACA
 * guest R12 saved in shadow VCPU SCRATCH0
 * guest R13 saved in SPRN_SCRATCH0
+* guest R9 saved in HSTATE_SCRATCH2
 */
-   std r9, HSTATE_SCRATCH2(r13)
-   lbz r9, HSTATE_IN_GUEST(r13)
-   cmpwi   r9, KVM_GUEST_MODE_HOST_HV
-   beq kvmppc_bad_host_intr
-#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
-   cmpwi   r9, KVM_GUEST_MODE_GUEST
-   ld  r9, HSTATE_SCRATCH2(r13)
-   beq kvmppc_interrupt_pr
-#endif
/* We're now back in the host but in guest MMU context */
li  r9, KVM_GUEST_MODE_HOST_HV
stb r9, HSTATE_IN_GUEST(r13)
@@ -3253,6 +3245,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_P9_TM_HV_ASSIST)
  * cfar is saved in HSTATE_CFAR(r13)
  * ppr is saved in HSTATE_PPR(r13)
  */
+.global kvmppc_bad_host_intr
 kvmppc_bad_host_intr:
/*
 * Switch to the emergency stack, but start half-way down in
-- 
2.23.0

[RFC PATCH 0/9] KVM: PPC: Book3S: C-ify the P9 entry/exit code

This series goes on top of the KVM patches I sent for the next
merge window. It's pretty rough and incomplete at the moment but
has started booting a simple guest into busybox in sim.

One of the main points of this is to avoid running KVM in a special
mode (aka "realmode") which is hostile to the rest of Linux and can't
be instrumented well by core facilities or generally use core code.

The asm itself is also tricky to follow. It's nothing compared with the
old HV path when you sit down and go through it, but it's not trivial
and sharing code paths makes things painful too.

One issue is what to do about PR-KVM and pre-ISAv3.0 / HPT HV-KVM.

The old path could also be converted similarly to C, although that's a
far bigger job. At least removing the asm code sharing makes a (slight)
simplification to the old path for now.

This change is a pretty clean break in the low level exit/entry code,
and the rest of the code already has many divergences, so I don't think
this exacerbates the problem, but if those PR / old-HV are to be
maintained well then we should eat the cost early to modernise that
code too. If they stay walled off with these interface shims then
they'll just be harder to maintain and the problem will definitely not
get better.

Now, the number of people who understand PR-KVM and have time to
maintain it is roughly zero, and for the old HV path it's approaching
zero. The radix, LPAR-per-thread programming model by comparison is so
nice and simple, maybe its better to just keep the rest on life support
and keep them going for as long as we can with the bare minimum.

Thanks,
Nick

Nicholas Piggin (9):
  KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point
  KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM
  KVM: PPC: Book3S 64: add hcall interrupt handler
  KVM: PPC: Book3S HV: Move hcall early register setup to KVM
  powerpc/64s: Remove EXSLB interrupt save area
  KVM: PPC: Book3S HV: Move interrupt early register setup to KVM
  KVM: PPC: Book3S HV: move bad_host_intr check to HV handler
  KVM: PPC: Book3S HV: Minimise hcall handler calling convention
differences
  KVM: PPC: Book3S HV: Implement the rest of the P9 entry/exit handling
in C

 arch/powerpc/include/asm/asm-prototypes.h |   2 +-
 arch/powerpc/include/asm/exception-64s.h  |  13 ++
 arch/powerpc/include/asm/kvm_asm.h|   3 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|   2 +
 arch/powerpc/include/asm/paca.h   |   3 +-
 arch/powerpc/kernel/asm-offsets.c |   1 -
 arch/powerpc/kernel/exceptions-64s.S  | 259 +++---
 arch/powerpc/kvm/Makefile |   6 +
 arch/powerpc/kvm/book3s_64_entry.S| 232 +++
 arch/powerpc/kvm/book3s_hv.c  |  21 +-
 arch/powerpc/kvm/book3s_hv_interrupt.c| 164 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 136 ++--
 arch/powerpc/kvm/book3s_segment.S |   7 +
 arch/powerpc/kvm/book3s_xive.c|  32 +++
 15 files changed, 523 insertions(+), 360 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_64_entry.S
 create mode 100644 arch/powerpc/kvm/book3s_hv_interrupt.c

-- 
2.23.0

Re: [PATCH net v2] ibmvnic: device remove has higher precedence over reset

2021-02-01 Thread patchwork-bot+netdevbpf

Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Thu, 28 Jan 2021 22:34:01 -0600 you wrote:
> Returning -EBUSY in ibmvnic_remove() does not actually hold the
> removal procedure since driver core doesn't care for the return
> value (see __device_release_driver() in drivers/base/dd.c
> calling dev->bus->remove()) though vio_bus_remove
> (in arch/powerpc/platforms/pseries/vio.c) records the
> return value and passes it on. [1]
> 
> [...]

Here is the summary with links:
  - [net,v2] ibmvnic: device remove has higher precedence over reset
https://git.kernel.org/netdev/net/c/5e9eff5dfa46

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCH] ASoC: fsl_spdif: Utilize the defined parameter to clear code

2021-02-01 Thread Shengjiu Wang

On Thu, Jan 28, 2021 at 7:28 PM Tang Bin  wrote:
>
> Utilize the defined parameter 'dev' to make the code cleaner.
>
> Signed-off-by: Tang Bin 

Acked-by: Shengjiu Wang

Re: [PATCH] ASoC: fsl_xcvr: remove unneeded semicolon

2021-02-01 Thread Shengjiu Wang

On Mon, Feb 1, 2021 at 4:08 PM Yang Li  wrote:
>
> Eliminate the following coccicheck warning:
> ./sound/soc/fsl/fsl_xcvr.c:739:2-3: Unneeded semicolon
>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 

Acked-by: Shengjiu Wang

Re: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

On Mon Feb 1, 2021 at 10:54 AM CST, Gabriel Paubert wrote:
> On Mon, Feb 01, 2021 at 09:55:44AM -0600, Christopher M. Riedl wrote:
> > On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
> > > From: Christopher M. Riedl
> > > > Sent: 28 January 2021 04:04
> > > > 
> > > > Reuse the "safe" implementation from signal.c except for calling
> > > > unsafe_copy_from_user() to copy into a local buffer.
> > > > 
> > > > Signed-off-by: Christopher M. Riedl 
> > > > ---
> > > >  arch/powerpc/kernel/signal.h | 33 +
> > > >  1 file changed, 33 insertions(+)
> > > > 
> > > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> > > > index 2559a681536e..c18402d625f1 100644
> > > > --- a/arch/powerpc/kernel/signal.h
> > > > +++ b/arch/powerpc/kernel/signal.h
> > > > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct 
> > > > task_struct *task, void __user *from);
> > > > [i], label);\
> > > >  } while (0)
> > > > 
> > > > +#define unsafe_copy_fpr_from_user(task, from, label)   do {
> > > > \
> > > > +   struct task_struct *__t = task; 
> > > > \
> > > > +   u64 __user *__f = (u64 __user *)from;   
> > > > \
> > > > +   u64 buf[ELF_NFPREG];
> > > > \
> > >
> > > How big is that buffer?
> > > Isn't is likely to be reasonably large compared to a reasonable
> > > kernel stack frame.
> > > Especially since this isn't even a leaf function.
> > >
> > 
> > I think Christophe answered this - I don't really have an opinion either
> > way. What would be a 'reasonable' kernel stack frame for reference?
>
> See include/linux/poll.h, where the limit is of the order of 800 bytes
> and the number of entries in an on stack array is chosen at compile time
> (different between 32 and 64 bit for example).
>
> The values are used in do_sys_poll, which, with almost 1000 bytes of
> stack footprint, appears close to the top of "make checkstack". In
> addition do_sys_poll has to call the ->poll function of every file
> descriptor in its table, so it is not a tail function.
>
> This 264 bytes array looks reasonable, but please use 'make checkstack'
> to verify that the function's total stack usage stays within reason.

Neat, looks like total usage is a bit larger but still reasonable and
less than half of 800B:

0xc0017e900 __unsafe_restore_sigcontext.constprop.0 [vmlinux]:352

Thanks for the tip!

>
> Gabriel
>
> > 
> > > > +   int i;  
> > > > \
> > > > +   
> > > > \
> > > > +   unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),
> > > > \
> > >
> > > That really ought to be sizeof(buf).
> > >
> > 
> > Agreed, I will fix this. Thanks!
> > 
> > > David
> > >
> > >
> > > > +   label); 
> > > > \
> > > > +   for (i = 0; i < ELF_NFPREG - 1; i++)
> > > > \
> > > > +   __t->thread.TS_FPR(i) = buf[i]; 
> > > > \
> > > > +   __t->thread.fp_state.fpscr = buf[i];
> > > > \
> > > > +} while (0)
> > > > +
> > > > +#define unsafe_copy_vsx_from_user(task, from, label)   do {
> > > > \
> > > > +   struct task_struct *__t = task; 
> > > > \
> > > > +   u64 __user *__f = (u64 __user *)from;   
> > > > \
> > > > +   u64 buf[ELF_NVSRHALFREG];   
> > > > \
> > > > +   int i;  
> > > > \
> > > > +   
> > > > \
> > > > +   unsafe_copy_from_user(buf, __f, 
> > > > \
> > > > +   ELF_NVSRHALFREG * sizeof(double),   
> > > > \
> > > > +   label); 
> > > > \
> > > > +   for (i = 0; i < ELF_NVSRHALFREG ; i++)  
> > > > \
> > > > +   __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];  
> > > > \
> > > > +} while (0)
> > > > +
> > > > +
> > > >  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > > >  #define unsafe_copy_ckfpr_to_user(to, task, label) do {
> > > > \
> > > > struct task_struct *__t = task; 
> > > > \
> > > > @@ -80,6 +107,10 @@ unsigned long copy_ckfpr_from_user(struct 
> > > > task_struct *task, void __user *from);
> > > > unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
> > > > ELF_NFPREG * sizeof(double), label)
> > > > 
> > > > +#define unsafe_copy_fpr_from_user(task, from, label)   
> > > > \
> > > > +

[PATCH] powerpc/64/signal: Fix regression in __kernel_sigtramp_rt64 semantics

2021-02-01 Thread Raoni Fassina Firmino

Tested on powerpc64 and powerpc64le, with a glibc build and running the
affected glibc's testcase[2], inspected that glibc's backtrace() now gives
the correct result and gdb backtrace also keeps working as before.

I believe this should be backported to releases 5.9 and 5.10 as userspace
is affected in this releases.

 8< 

A Change[1] in __kernel_sigtramp_rt64 VDSO and trampoline code introduced a
regression in the way glibc's backtrace()[2] detects the signal-handler
stack frame.  Apart from the practical implications, __kernel_sigtram_rt64
was a VDSO with the semantics that it is a function you can call from
userspace to end a signal handling.  Now this semantics are no longer
valid.

I believe the aforementioned change affects all releases since 5.9.

This patch tries to fix both the semantics and practical aspect of
__kernel_sigtramp_rt64 returning it to the previous code, whilst keeping
the intended behavior from[1] by adding a new symbol to serve as the jump
target from the kernel to the trampoline. Now the trampoline has two parts,
an new entry point and the old return point.

[1] commit 0138ba5783ae0dcc799ad401a1e8ac8333790df9 ("powerpc/64/signal:
Balance return predictor stack in signal trampoline")
[2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-January/223194.html

Fixes: 0138ba5783ae ("powerpc/64/signal: Balance return predictor stack in 
signal trampoline")
Signed-off-by: Raoni Fassina Firmino 
---
 arch/powerpc/kernel/vdso64/sigtramp.S   | 9 -
 arch/powerpc/kernel/vdso64/vdso64.lds.S | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/vdso64/sigtramp.S 
b/arch/powerpc/kernel/vdso64/sigtramp.S
index bbf68cd01088..f0fd8d2a9fc4 100644
--- a/arch/powerpc/kernel/vdso64/sigtramp.S
+++ b/arch/powerpc/kernel/vdso64/sigtramp.S
@@ -15,11 +15,18 @@
 
.text
 
+/* __kernel_start_sigtramp_rt64 and __kernel_sigtramp_rt64 together
+   are one function split in two parts. The kernel jumps to the former
+   and the signal handler indirectly (by blr) returns to the latter.
+   __kernel_sigtramp_rt64 needs to point to the return address so
+   glibc can correctly identify the trampoline stack frame.  */
.balign 8
.balign IFETCH_ALIGN_BYTES
-V_FUNCTION_BEGIN(__kernel_sigtramp_rt64)
+V_FUNCTION_BEGIN(__kernel_start_sigtramp_rt64)
 .Lsigrt_start:
bctrl   /* call the handler */
+V_FUNCTION_END(__kernel_start_sigtramp_rt64)
+V_FUNCTION_BEGIN(__kernel_sigtramp_rt64)
addir1, r1, __SIGNAL_FRAMESIZE
li  r0,__NR_rt_sigreturn
sc
diff --git a/arch/powerpc/kernel/vdso64/vdso64.lds.S 
b/arch/powerpc/kernel/vdso64/vdso64.lds.S
index 6164d1a1ba11..2f3c359cacd3 100644
--- a/arch/powerpc/kernel/vdso64/vdso64.lds.S
+++ b/arch/powerpc/kernel/vdso64/vdso64.lds.S
@@ -131,4 +131,4 @@ VERSION
 /*
  * Make the sigreturn code visible to the kernel.
  */
-VDSO_sigtramp_rt64 = __kernel_sigtramp_rt64;
+VDSO_sigtramp_rt64 = __kernel_start_sigtramp_rt64;

base-commit: 76c057c84d286140c6c416c3b4ba832cd1d8984e
-- 
2.26.2

Re: [PATCH v15 09/10] arm64: Call kmalloc() to allocate DTB buffer

2021-02-01 Thread Thiago Jung Bauermann

Joe Perches  writes:

> On Thu, 2021-01-28 at 00:52 -0300, Thiago Jung Bauermann wrote:
>> The problem is that this patch implements only part of the suggestion,
>> which isn't useful in itself. So the patch series should either drop
>> this patch or consolidate the FDT allocation between the arches.
>> 
>> I just tested on powernv and pseries platforms and powerpc can use
>> vmalloc for the FDT buffer.
>
> Perhaps more sensible to use kvmalloc/kvfree.

That's true. Converting both arm64 to powerpc to kvmalloc/kvfree is a
good option. I don't think it's that much better though, because
kexec_file_load() is called infrequently and doesn't need to be fast so
the vmalloc() overhead isn't important in practice.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

RE: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

On Mon Feb 1, 2021 at 11:37 AM CST, David Laight wrote:
> From: Christopher M. Riedl
> > Sent: 01 February 2021 16:55
> ...
> > > > > > +   int i;  
> > > > > > \
> > > > > > +   
> > > > > > \
> > > > > > +   unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),
> > > > > > \
> > > > > > +   label); 
> > > > > > \
> > > > > > +   for (i = 0; i < ELF_NFPREG - 1; i++)
> > > > > > \
> > > > > > +   __t->thread.TS_FPR(i) = buf[i]; 
> > > > > > \
> > > > > > +   __t->thread.fp_state.fpscr = buf[i];
> > > > > > \
> > > > > > +} while (0)
> > >
> > > On further reflection, since you immediately loop through the buffer
> > > why not just use user_access_begin() and unsafe_get_user() in the loop.
> > 
> > Christophe had suggested this a few revisions ago as well. When I tried
> > this approach, the signal handling performance took a pretty big hit:
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-October/219351.html
> > 
> > I included some numbers on v3 as well but decided to drop the approach
> > altogether for this one since it just didn't seem worth the hit.
>
> Was that using unsafe_get_user (which relies on user_access_begin()
> having 'opened up' user accesses) or just get_user() that does
> it for every access?
>
> The former should be ok, the latter will be horrid.

It was using unsafe_get_user() whereas unsafe_copy_from_user() will just
call the optimized __copy_tofrom_user() a single time - assuming that
user access is open.

>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
> MK1 1PT, UK
> Registration No: 1397386 (Wales)

RE: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-01 Thread David Laight

From: Christopher M. Riedl
> Sent: 01 February 2021 16:55
...
> > > > > + int i;  
> > > > > \
> > > > > + 
> > > > > \
> > > > > + unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),
> > > > > \
> > > > > + label); 
> > > > > \
> > > > > + for (i = 0; i < ELF_NFPREG - 1; i++)
> > > > > \
> > > > > + __t->thread.TS_FPR(i) = buf[i]; 
> > > > > \
> > > > > + __t->thread.fp_state.fpscr = buf[i];
> > > > > \
> > > > > +} while (0)
> >
> > On further reflection, since you immediately loop through the buffer
> > why not just use user_access_begin() and unsafe_get_user() in the loop.
> 
> Christophe had suggested this a few revisions ago as well. When I tried
> this approach, the signal handling performance took a pretty big hit:
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-October/219351.html
> 
> I included some numbers on v3 as well but decided to drop the approach
> altogether for this one since it just didn't seem worth the hit.

Was that using unsafe_get_user (which relies on user_access_begin()
having 'opened up' user accesses) or just get_user() that does
it for every access?

The former should be ok, the latter will be horrid.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-01 Thread Gabriel Paubert

On Mon, Feb 01, 2021 at 09:55:44AM -0600, Christopher M. Riedl wrote:
> On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
> > From: Christopher M. Riedl
> > > Sent: 28 January 2021 04:04
> > > 
> > > Reuse the "safe" implementation from signal.c except for calling
> > > unsafe_copy_from_user() to copy into a local buffer.
> > > 
> > > Signed-off-by: Christopher M. Riedl 
> > > ---
> > >  arch/powerpc/kernel/signal.h | 33 +
> > >  1 file changed, 33 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> > > index 2559a681536e..c18402d625f1 100644
> > > --- a/arch/powerpc/kernel/signal.h
> > > +++ b/arch/powerpc/kernel/signal.h
> > > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > > *task, void __user *from);
> > >   [i], label);\
> > >  } while (0)
> > > 
> > > +#define unsafe_copy_fpr_from_user(task, from, label) do {
> > > \
> > > + struct task_struct *__t = task; \
> > > + u64 __user *__f = (u64 __user *)from;   \
> > > + u64 buf[ELF_NFPREG];\
> >
> > How big is that buffer?
> > Isn't is likely to be reasonably large compared to a reasonable
> > kernel stack frame.
> > Especially since this isn't even a leaf function.
> >
> 
> I think Christophe answered this - I don't really have an opinion either
> way. What would be a 'reasonable' kernel stack frame for reference?

See include/linux/poll.h, where the limit is of the order of 800 bytes
and the number of entries in an on stack array is chosen at compile time
(different between 32 and 64 bit for example).

The values are used in do_sys_poll, which, with almost 1000 bytes of
stack footprint, appears close to the top of "make checkstack". In
addition do_sys_poll has to call the ->poll function of every file
descriptor in its table, so it is not a tail function.

This 264 bytes array looks reasonable, but please use 'make checkstack'
to verify that the function's total stack usage stays within reason.

Gabriel

> 
> > > + int i;  \
> > > + \
> > > + unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\
> >
> > That really ought to be sizeof(buf).
> >
> 
> Agreed, I will fix this. Thanks!
> 
> > David
> >
> >
> > > + label); \
> > > + for (i = 0; i < ELF_NFPREG - 1; i++)\
> > > + __t->thread.TS_FPR(i) = buf[i]; \
> > > + __t->thread.fp_state.fpscr = buf[i];\
> > > +} while (0)
> > > +
> > > +#define unsafe_copy_vsx_from_user(task, from, label) do {
> > > \
> > > + struct task_struct *__t = task; \
> > > + u64 __user *__f = (u64 __user *)from;   \
> > > + u64 buf[ELF_NVSRHALFREG];   \
> > > + int i;  \
> > > + \
> > > + unsafe_copy_from_user(buf, __f, \
> > > + ELF_NVSRHALFREG * sizeof(double),   \
> > > + label); \
> > > + for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> > > + __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];  \
> > > +} while (0)
> > > +
> > > +
> > >  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > >  #define unsafe_copy_ckfpr_to_user(to, task, label)   do {
> > > \
> > >   struct task_struct *__t = task; \
> > > @@ -80,6 +107,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > > *task, void __user *from);
> > >   unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
> > >   ELF_NFPREG * sizeof(double), label)
> > > 
> > > +#define unsafe_copy_fpr_from_user(task, from, label) 
> > > \
> > > + unsafe_copy_from_user((task)->thread.fp_state.fpr, from,\
> > > + ELF_NFPREG * sizeof(double), label)
> > > +
> > >  static inline unsigned long
> > >  copy_fpr_to_user(void __user *to, struct task_struct *task)
> > >  {
> > > @@ -115,6 +146,8 @@ copy_ckfpr_from_user(struct task_struct *task, void 
> > > __user *from)
> > >  #else
> > >  #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
> > > 
> > > +#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
> > > +
> > >  static inline unsigned long
> > >  copy_fpr_to_user(void __user *to, struct task_struct *task)
> > >  {
> > > --
> > > 2.26.1
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
> > MK1 1PT, UK
> >

RE: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

On Mon Feb 1, 2021 at 10:15 AM CST, David Laight wrote:
> From: Christopher M. Riedl
> > Sent: 01 February 2021 15:56
> > 
> > On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
> > > From: Christopher M. Riedl
> > > > Sent: 28 January 2021 04:04
> > > >
> > > > Reuse the "safe" implementation from signal.c except for calling
> > > > unsafe_copy_from_user() to copy into a local buffer.
> > > >
> > > > Signed-off-by: Christopher M. Riedl 
> > > > ---
> > > >  arch/powerpc/kernel/signal.h | 33 +
> > > >  1 file changed, 33 insertions(+)
> > > >
> > > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> > > > index 2559a681536e..c18402d625f1 100644
> > > > --- a/arch/powerpc/kernel/signal.h
> > > > +++ b/arch/powerpc/kernel/signal.h
> > > > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct 
> > > > task_struct *task, void __user *from);
> > > > [i], label);\
> > > >  } while (0)
> > > >
> > > > +#define unsafe_copy_fpr_from_user(task, from, label)   do {
> > > > \
> > > > +   struct task_struct *__t = task; 
> > > > \
> > > > +   u64 __user *__f = (u64 __user *)from;   
> > > > \
> > > > +   u64 buf[ELF_NFPREG];
> > > > \
> > >
> > > How big is that buffer?
> > > Isn't is likely to be reasonably large compared to a reasonable
> > > kernel stack frame.
> > > Especially since this isn't even a leaf function.
> > >
> > 
> > I think Christophe answered this - I don't really have an opinion either
> > way. What would be a 'reasonable' kernel stack frame for reference?
>
> Zero :-)
>

Hehe good point!

> > 
> > > > +   int i;  
> > > > \
> > > > +   
> > > > \
> > > > +   unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),
> > > > \
> > > > +   label); 
> > > > \
> > > > +   for (i = 0; i < ELF_NFPREG - 1; i++)
> > > > \
> > > > +   __t->thread.TS_FPR(i) = buf[i]; 
> > > > \
> > > > +   __t->thread.fp_state.fpscr = buf[i];
> > > > \
> > > > +} while (0)
>
> On further reflection, since you immediately loop through the buffer
> why not just use user_access_begin() and unsafe_get_user() in the loop.

Christophe had suggested this a few revisions ago as well. When I tried
this approach, the signal handling performance took a pretty big hit:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-October/219351.html

I included some numbers on v3 as well but decided to drop the approach
altogether for this one since it just didn't seem worth the hit.

>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
> MK1 1PT, UK
> Registration No: 1397386 (Wales)

Re: [PATCH 05/13] kallsyms: refactor {,module_}kallsyms_on_each_symbol

2021-02-01 Thread Christoph Hellwig

On Mon, Feb 01, 2021 at 02:37:12PM +0100, Miroslav Benes wrote:
> > > This change is not needed. (objname == NULL) means that we are
> > > interested only in symbols in "vmlinux".
> > > 
> > > module_kallsyms_on_each_symbol(klp_find_callback, )
> > > will always fail when objname == NULL.
> > 
> > I just tried to keep the old behavior.  I can respin it with your
> > recommended change noting the change in behavior, though.
> 
> Yes, please. It would be cleaner that way.

Let me know if this works for you:

---
>From 18af41e88d088cfb8680d1669fcae2bc2ede5328 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Wed, 20 Jan 2021 16:23:16 +0100
Subject: kallsyms: refactor {,module_}kallsyms_on_each_symbol

Require an explicit call to module_kallsyms_on_each_symbol to look
for symbols in modules instead of the call from kallsyms_on_each_symbol,
and acquire module_mutex inside of module_kallsyms_on_each_symbol instead
of leaving that up to the caller.  Note that this slightly changes the
behavior for the livepatch code in that the symbols from vmlinux are not
iterated anymore if objname is set, but that actually is the desired
behavior in this case.

Signed-off-by: Christoph Hellwig 
---
 kernel/kallsyms.c   |  6 +-
 kernel/livepatch/core.c |  2 --
 kernel/module.c | 13 -
 3 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index fe9de067771c34..a0d3f0865916f9 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -177,6 +177,10 @@ unsigned long kallsyms_lookup_name(const char *name)
return module_kallsyms_lookup_name(name);
 }
 
+/*
+ * Iterate over all symbols in vmlinux.  For symbols from modules use
+ * module_kallsyms_on_each_symbol instead.
+ */
 int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *,
  unsigned long),
void *data)
@@ -192,7 +196,7 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, 
struct module *,
if (ret != 0)
return ret;
}
-   return module_kallsyms_on_each_symbol(fn, data);
+   return 0;
 }
 
 static unsigned long get_symbol_pos(unsigned long addr,
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 262cd9b003b9f0..335d988bd81117 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -164,12 +164,10 @@ static int klp_find_object_symbol(const char *objname, 
const char *name,
.pos = sympos,
};
 
-   mutex_lock(_mutex);
if (objname)
module_kallsyms_on_each_symbol(klp_find_callback, );
else
kallsyms_on_each_symbol(klp_find_callback, );
-   mutex_unlock(_mutex);
 
/*
 * Ensure an address was found. If sympos is 0, ensure symbol is unique;
diff --git a/kernel/module.c b/kernel/module.c
index 6772fb2680eb3e..25345792c770d1 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -255,11 +255,6 @@ static void mod_update_bounds(struct module *mod)
 struct list_head *kdb_modules =  /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
 
-static void module_assert_mutex(void)
-{
-   lockdep_assert_held(_mutex);
-}
-
 static void module_assert_mutex_or_preempt(void)
 {
 #ifdef CONFIG_LOCKDEP
@@ -4379,8 +4374,7 @@ int module_kallsyms_on_each_symbol(int (*fn)(void *, 
const char *,
unsigned int i;
int ret;
 
-   module_assert_mutex();
-
+   mutex_lock(_mutex);
list_for_each_entry(mod, , list) {
/* We hold module_mutex: no need for rcu_dereference_sched */
struct mod_kallsyms *kallsyms = mod->kallsyms;
@@ -4396,10 +4390,11 @@ int module_kallsyms_on_each_symbol(int (*fn)(void *, 
const char *,
ret = fn(data, kallsyms_symbol_name(kallsyms, i),
 mod, kallsyms_symbol_value(sym));
if (ret != 0)
-   return ret;
+   break;
}
}
-   return 0;
+   mutex_unlock(_mutex);
+   return ret;
 }
 #endif /* CONFIG_KALLSYMS */
 
-- 
2.29.2

RE: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-01 Thread David Laight

From: Christopher M. Riedl
> Sent: 01 February 2021 15:56
> 
> On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
> > From: Christopher M. Riedl
> > > Sent: 28 January 2021 04:04
> > >
> > > Reuse the "safe" implementation from signal.c except for calling
> > > unsafe_copy_from_user() to copy into a local buffer.
> > >
> > > Signed-off-by: Christopher M. Riedl 
> > > ---
> > >  arch/powerpc/kernel/signal.h | 33 +
> > >  1 file changed, 33 insertions(+)
> > >
> > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> > > index 2559a681536e..c18402d625f1 100644
> > > --- a/arch/powerpc/kernel/signal.h
> > > +++ b/arch/powerpc/kernel/signal.h
> > > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > > *task, void __user *from);
> > >   [i], label);\
> > >  } while (0)
> > >
> > > +#define unsafe_copy_fpr_from_user(task, from, label) do {
> > > \
> > > + struct task_struct *__t = task; \
> > > + u64 __user *__f = (u64 __user *)from;   \
> > > + u64 buf[ELF_NFPREG];\
> >
> > How big is that buffer?
> > Isn't is likely to be reasonably large compared to a reasonable
> > kernel stack frame.
> > Especially since this isn't even a leaf function.
> >
> 
> I think Christophe answered this - I don't really have an opinion either
> way. What would be a 'reasonable' kernel stack frame for reference?

Zero :-)

> 
> > > + int i;  \
> > > + \
> > > + unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\
> > > + label); \
> > > + for (i = 0; i < ELF_NFPREG - 1; i++)\
> > > + __t->thread.TS_FPR(i) = buf[i]; \
> > > + __t->thread.fp_state.fpscr = buf[i];\
> > > +} while (0)

On further reflection, since you immediately loop through the buffer
why not just use user_access_begin() and unsafe_get_user() in the loop.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
> From: Christopher M. Riedl
> > Sent: 28 January 2021 04:04
> > 
> > Reuse the "safe" implementation from signal.c except for calling
> > unsafe_copy_from_user() to copy into a local buffer.
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/kernel/signal.h | 33 +
> >  1 file changed, 33 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> > index 2559a681536e..c18402d625f1 100644
> > --- a/arch/powerpc/kernel/signal.h
> > +++ b/arch/powerpc/kernel/signal.h
> > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > *task, void __user *from);
> > [i], label);\
> >  } while (0)
> > 
> > +#define unsafe_copy_fpr_from_user(task, from, label)   do {
> > \
> > +   struct task_struct *__t = task; \
> > +   u64 __user *__f = (u64 __user *)from;   \
> > +   u64 buf[ELF_NFPREG];\
>
> How big is that buffer?
> Isn't is likely to be reasonably large compared to a reasonable
> kernel stack frame.
> Especially since this isn't even a leaf function.
>

I think Christophe answered this - I don't really have an opinion either
way. What would be a 'reasonable' kernel stack frame for reference?

> > +   int i;  \
> > +   \
> > +   unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\
>
> That really ought to be sizeof(buf).
>

Agreed, I will fix this. Thanks!

> David
>
>
> > +   label); \
> > +   for (i = 0; i < ELF_NFPREG - 1; i++)\
> > +   __t->thread.TS_FPR(i) = buf[i]; \
> > +   __t->thread.fp_state.fpscr = buf[i];\
> > +} while (0)
> > +
> > +#define unsafe_copy_vsx_from_user(task, from, label)   do {
> > \
> > +   struct task_struct *__t = task; \
> > +   u64 __user *__f = (u64 __user *)from;   \
> > +   u64 buf[ELF_NVSRHALFREG];   \
> > +   int i;  \
> > +   \
> > +   unsafe_copy_from_user(buf, __f, \
> > +   ELF_NVSRHALFREG * sizeof(double),   \
> > +   label); \
> > +   for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> > +   __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];  \
> > +} while (0)
> > +
> > +
> >  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> >  #define unsafe_copy_ckfpr_to_user(to, task, label) do {\
> > struct task_struct *__t = task; \
> > @@ -80,6 +107,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > *task, void __user *from);
> > unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
> > ELF_NFPREG * sizeof(double), label)
> > 
> > +#define unsafe_copy_fpr_from_user(task, from, label)   
> > \
> > +   unsafe_copy_from_user((task)->thread.fp_state.fpr, from,\
> > +   ELF_NFPREG * sizeof(double), label)
> > +
> >  static inline unsigned long
> >  copy_fpr_to_user(void __user *to, struct task_struct *task)
> >  {
> > @@ -115,6 +146,8 @@ copy_ckfpr_from_user(struct task_struct *task, void 
> > __user *from)
> >  #else
> >  #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
> > 
> > +#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
> > +
> >  static inline unsigned long
> >  copy_fpr_to_user(void __user *to, struct task_struct *task)
> >  {
> > --
> > 2.26.1
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
> MK1 1PT, UK
> Registration No: 1397386 (Wales)

Re: [PATCH 05/13] kallsyms: refactor {,module_}kallsyms_on_each_symbol

2021-02-01 Thread Miroslav Benes

One more thing...

> @@ -4379,8 +4379,7 @@ int module_kallsyms_on_each_symbol(int (*fn)(void *, 
> const char *,
>   unsigned int i;
>   int ret;
>  
> - module_assert_mutex();
> -
> + mutex_lock(_mutex);
>   list_for_each_entry(mod, , list) {
>   /* We hold module_mutex: no need for rcu_dereference_sched */
>   struct mod_kallsyms *kallsyms = mod->kallsyms;

This was the last user of module_assert_mutex(), which can be removed now.

Miroslav

Re: [PATCH 05/13] kallsyms: refactor {,module_}kallsyms_on_each_symbol

2021-02-01 Thread Miroslav Benes

On Mon, 1 Feb 2021, Christoph Hellwig wrote:

> On Fri, Jan 29, 2021 at 10:43:36AM +0100, Petr Mladek wrote:
> > > --- a/kernel/livepatch/core.c
> > > +++ b/kernel/livepatch/core.c
> > > @@ -164,12 +164,8 @@ static int klp_find_object_symbol(const char 
> > > *objname, const char *name,
> > >   .pos = sympos,
> > >   };
> > >  
> > > - mutex_lock(_mutex);
> > > - if (objname)
> > > + if (objname || !kallsyms_on_each_symbol(klp_find_callback, ))
> > >   module_kallsyms_on_each_symbol(klp_find_callback, );
> > > - else
> > > - kallsyms_on_each_symbol(klp_find_callback, );
> > > - mutex_unlock(_mutex);
> > 
> > This change is not needed. (objname == NULL) means that we are
> > interested only in symbols in "vmlinux".
> > 
> > module_kallsyms_on_each_symbol(klp_find_callback, )
> > will always fail when objname == NULL.
> 
> I just tried to keep the old behavior.  I can respin it with your
> recommended change noting the change in behavior, though.

Yes, please. It would be cleaner that way.

Miroslav

Re: [PATCH 04/13] module: use RCU to synchronize find_module

2021-02-01 Thread Miroslav Benes

On Mon, 1 Feb 2021, Jessica Yu wrote:

> +++ Miroslav Benes [29/01/21 16:29 +0100]:
> >On Thu, 28 Jan 2021, Christoph Hellwig wrote:
> >
> >> Allow for a RCU-sched critical section around find_module, following
> >> the lower level find_module_all helper, and switch the two callers
> >> outside of module.c to use such a RCU-sched critical section instead
> >> of module_mutex.
> >
> >That's a nice idea.
> >
> >> @@ -57,7 +58,7 @@ static void klp_find_object_module(struct klp_object
> >> *obj)
> >>   if (!klp_is_module(obj))
> >>return;
> >>
> >> -  mutex_lock(_mutex);
> >> +  rcu_read_lock_sched();
> >>   /*
> >>* We do not want to block removal of patched modules and therefore
> >>* we do not take a reference here. The patches are removed by
> >> @@ -74,7 +75,7 @@ static void klp_find_object_module(struct klp_object
> >> *obj)
> >>   if (mod && mod->klp_alive)
> >
> >RCU always baffles me a bit, so I'll ask. Don't we need
> >rcu_dereference_sched() here? "mod" comes from a RCU-protected list, so I
> >wonder.
> 
> Same here :-) I had to double check the RCU documentation. For our
> modules list case I believe the rcu list API should take care of that
> for us. Worth noting is this snippet from Documentation/RCU/whatisRCU.txt:
> 
>rcu_dereference() is typically used indirectly, via the _rcu
>list-manipulation primitives, such as list_for_each_entry_rcu()

Ok, thanks to both for checking and explanation.

Ack to the patch then.

Miroslav

Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-02-01 Thread Peter Zijlstra

On Sun, Jan 31, 2021 at 07:57:01AM +, Nadav Amit wrote:
> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin  wrote:

> > I'll go through the patches a bit more closely when they all come 
> > through. Sparc and powerpc of course need the arch lazy mode to get 
> > per-page/pte information for operations that are not freeing pages, 
> > which is what mmu gather is designed for.
> 
> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
> where no previous PTE was set, right?

These are the HASH architectures. Their hardware doesn't walk the
page-tables, but it consults a hash-table to resolve page translations.

They _MUST_ flush the entries under the PTL to avoid ever seeing
conflicting information, which will make them really unhappy. They can
do this because they have TLBI broadcast.

There's a few more details I'm sure, but those seem to have slipped from
my mind.

RE: [PATCH] powerpc/603: Fix protection of user pages mapped with PROT_NONE

2021-02-01 Thread PLATTNER Christoph

Thank you very much, I appreciate your fast responses.
Thank you also for clarification, I did completely oversee
the permission settings in the segment setup and expected
the fault reaction on the PP bits in the TLB.
And I will re-read the chapters, got get deeper into this topic.

Greetings
Christoph 


-Original Message-
From: Christophe Leroy  
Sent: Montag, 1. Februar 2021 12:39
To: PLATTNER Christoph ; Benjamin 
Herrenschmidt ; Paul Mackerras ; 
Michael Ellerman 
Cc: linux-ker...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; HAMETNER 
Reinhard ; REITHER Robert - Contractor 
; KOENIG Werner 

Subject: Re: [PATCH] powerpc/603: Fix protection of user pages mapped with 
PROT_NONE



Le 01/02/2021 à 11:22, PLATTNER Christoph a écrit :
> Hello to all, and thank you very much for first and second fast response.
> 
> I do not have a long history on PowerPC MMU environment, I hacked into 
> this topic for about 3 months for analyzing that problem- so, sorry, if I am 
> wrong in some points ...

Yes you are wrong on some points, sorry, see below.


> 
> What I learn so far from this MPC5121e (variant of e300c4 core):
> - It uses book3s32 hash-code, but it DOES NOT provide KEY hash method, so 
> always the
> branch  "if (! Hash) " is taken, so, I assume that "key 0" and "key 
> 1" setups are not
> used on this CPU (not supporting MMU_FTR_HPTE_TABLE)

hash method is not used, this is SW TLB loading that is used, but still, all 
the PP and Ks/Kp keys defined in the segment register are used, see e300 core 
reference manual §6.4.2 Page Memory Protection

> - The PP bits are NOT checked by the CPU in HW, even if set to 00, the CPU 
> does not react.
> As far I have understood, the TLB miss routines are responsible for 
> checking permissions.
> The TLB miss routines check the Linux PTE styled entries and generates 
> the PP bits
> for the TLB entry. The PowerPC PP bits are never check elsewhere on that 
> CPU models ...

PP bits ARE checked hoppefully. If it was not the case, then the TLB miss 
routines would install a TLB on a read, then the user could do a write without 
any verification being done ?

Refer to e300 Core reference Manual, §6.1.4 Memory Protection Facilities

As I explained in the patch, the problem is not that the HW doesn't check the 
permission. It is that user accessed been done with key 0 as programmed in the 
segment registers, PP 00 means RW access.

> - The PTE entries in Linux are fully "void" in sense of this CPU type, as 
> this CPU does not
> read any PTEs from RAM (no HW support in contrast to x86 or ARM or later 
> ppc...).

No, the PTE are read by the TLB miss exception handlers and writen into TLB 
entries.

> 
> In summary - as far as I understand it now - we have to handle the PTE 
> bits differently (Linux style) for PROT_NONE permissions - OR - we 
> have to expand the permission checking like my proposed experimental 
> patch. (PROT_NONE is not NUMA related only, but may not used very often ...).

Yes, expanding the permission checking is the easiest solution, hence the patch 
I sent out based on your proposal.

> 
> Another related point:
> According e300 RM (manual) the ACCESSED bit in the PTE shall be set on 
> TLB miss, as it is an indication, that page is used. In 4.4 kernel 
> this write back of the _PAGE_ACCESSED bit was performed after successful 
> permission check:
> 
>  bne-DataAddressInvalid  /* return if access not permitted */
>  ori r0,r0,_PAGE_ACCESSED/* set _PAGE_ACCESSED in pte */
>  /*
>   * NOTE! We are assuming this is not an SMP system, otherwise
>   * we would need to update the pte atomically with lwarx/stwcx.
>   */
>  stw r0,0(r2)/* update PTE (accessed bit) */
>  /* Convert linux-style PTE to low word of PPC-style PTE */
> 
> Bit is set (ori ...) and written back (stw ...) to Linux PTE. May be, 
> this is not needed, as the PTE is never seen by the PPC chip. But I do 
> not understand, WHY the PAGE_ACCCESSED is used for permission check in the 
> late 5.4 kernel (not used in 4.4 kernel):
> 
>   cmplw   0,r1,r3
>   mfspr   r2, SPRN_SDR1
>   li  r1, _PAGE_PRESENT | _PAGE_ACCESSED
>   rlwinm  r2, r2, 28, 0xf000
>   bgt-112f
> 
> What is the reason or relevance for checking this here ?
> Was not checked in 4.4, bit or-ed afterwards, as it is accessed now.
> Do you know the reason of change on this point ?

PAGE_ACCESSED is important for memory management, linux kernel need it.

But instead of spending time at every miss to perform a write which will be a 
no-op in 99% of cases, we prefer bailing out to the page_fault logic when the 
accessed bit is not set. Then the page_fault logic will set the bit.
This also allowed to simplify the handling in __set_pte()_at function by 
avoiding races in the update of PTEs.

> 
> Another remark to Core manual relevant for this:
> There is the

Re: [PATCH 04/13] module: use RCU to synchronize find_module

2021-02-01 Thread Jessica Yu


+++ Miroslav Benes [29/01/21 16:29 +0100]:

On Thu, 28 Jan 2021, Christoph Hellwig wrote:


Allow for a RCU-sched critical section around find_module, following
the lower level find_module_all helper, and switch the two callers
outside of module.c to use such a RCU-sched critical section instead
of module_mutex.


That's a nice idea.


@@ -57,7 +58,7 @@ static void klp_find_object_module(struct klp_object *obj)
if (!klp_is_module(obj))
return;

-   mutex_lock(_mutex);
+   rcu_read_lock_sched();
/*
 * We do not want to block removal of patched modules and therefore
 * we do not take a reference here. The patches are removed by
@@ -74,7 +75,7 @@ static void klp_find_object_module(struct klp_object *obj)
if (mod && mod->klp_alive)


RCU always baffles me a bit, so I'll ask. Don't we need
rcu_dereference_sched() here? "mod" comes from a RCU-protected list, so I
wonder.


Same here :-) I had to double check the RCU documentation. For our
modules list case I believe the rcu list API should take care of that
for us. Worth noting is this snippet from Documentation/RCU/whatisRCU.txt:

   rcu_dereference() is typically used indirectly, via the _rcu
   list-manipulation primitives, such as list_for_each_entry_rcu()

Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-01 Thread Peter Zijlstra

On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote:

> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 427bfcc6cdec..b97136b7010b 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather 
> *tlb)
>  
>  #ifdef CONFIG_MMU_GATHER_NO_RANGE
>  
> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() 
> and tlb_end_vma()
> +#if defined(tlb_flush)
> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
>  #endif
>  
>  /*
> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, 
> struct vm_area_struct *vm
>  
>  #ifndef tlb_flush
>  
> -#if defined(tlb_start_vma) || defined(tlb_end_vma)
> -#error Default tlb_flush() relies on default tlb_start_vma() and 
> tlb_end_vma()
> -#endif

#ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
#error 
#endif

goes here...


>  static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct 
> *vma)
>  {
>   if (tlb->fullmm)
>   return;
>  
> + if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
> + return;

Also, can you please stick to the CONFIG_MMU_GATHER_* namespace?

I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
How about:

CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH

?

Re: [PATCH 05/13] kallsyms: refactor {,module_}kallsyms_on_each_symbol

2021-02-01 Thread Christoph Hellwig

On Fri, Jan 29, 2021 at 10:43:36AM +0100, Petr Mladek wrote:
> > --- a/kernel/livepatch/core.c
> > +++ b/kernel/livepatch/core.c
> > @@ -164,12 +164,8 @@ static int klp_find_object_symbol(const char *objname, 
> > const char *name,
> > .pos = sympos,
> > };
> >  
> > -   mutex_lock(_mutex);
> > -   if (objname)
> > +   if (objname || !kallsyms_on_each_symbol(klp_find_callback, ))
> > module_kallsyms_on_each_symbol(klp_find_callback, );
> > -   else
> > -   kallsyms_on_each_symbol(klp_find_callback, );
> > -   mutex_unlock(_mutex);
> 
> This change is not needed. (objname == NULL) means that we are
> interested only in symbols in "vmlinux".
> 
> module_kallsyms_on_each_symbol(klp_find_callback, )
> will always fail when objname == NULL.

I just tried to keep the old behavior.  I can respin it with your
recommended change noting the change in behavior, though.

Re: [PATCH 04/13] module: use RCU to synchronize find_module

2021-02-01 Thread Christoph Hellwig

On Fri, Jan 29, 2021 at 04:29:02PM +0100, Miroslav Benes wrote:
> >  
> > -   mutex_lock(_mutex);
> > +   rcu_read_lock_sched();
> > /*
> >  * We do not want to block removal of patched modules and therefore
> >  * we do not take a reference here. The patches are removed by
> > @@ -74,7 +75,7 @@ static void klp_find_object_module(struct klp_object *obj)
> > if (mod && mod->klp_alive)
> 
> RCU always baffles me a bit, so I'll ask. Don't we need 
> rcu_dereference_sched() here? "mod" comes from a RCU-protected list, so I 
> wonder.

rcu_dereference* is only used for dereferencing points where that
reference itself is RCU protected, that is the lookup of mod itself down
in find_module_all in this case.

Re: [PATCH] powerpc/603: Fix protection of user pages mapped with PROT_NONE

Le 01/02/2021 à 11:22, PLATTNER Christoph a écrit :

Hello to all, and thank you very much for first and second fast response.

I do not have a long history on PowerPC MMU environment, I hacked into this
topic
for about 3 months for analyzing that problem- so, sorry, if I am wrong in some
points ...

Yes you are wrong on some points, sorry, see below.

What I learn so far from this MPC5121e (variant of e300c4 core):
- It uses book3s32 hash-code, but it DOES NOT provide KEY hash method, so
always the
branch "if (! Hash) " is taken, so, I assume that "key 0" and "key 1"
setups are not
used on this CPU (not supporting MMU_FTR_HPTE_TABLE)

hash method is not used, this is SW TLB loading that is used, but still, all the PP and Ks/Kp keys
defined in the segment register are used, see e300 core reference manual §6.4.2 Page Memory Protection

- The PP bits are NOT checked by the CPU in HW, even if set to 00, the CPU does
not react.
As far I have understood, the TLB miss routines are responsible for
checking permissions.
The TLB miss routines check the Linux PTE styled entries and generates the
PP bits
for the TLB entry. The PowerPC PP bits are never check elsewhere on that
CPU models ...

PP bits ARE checked hoppefully. If it was not the case, then the TLB miss routines would install a
TLB on a read, then the user could do a write without any verification being done ?

Refer to e300 Core reference Manual, §6.1.4 Memory Protection Facilities

As I explained in the patch, the problem is not that the HW doesn't check the permission. It is that
user accessed been done with key 0 as programmed in the segment registers, PP 00 means RW access.

- The PTE entries in Linux are fully "void" in sense of this CPU type, as this
CPU does not
read any PTEs from RAM (no HW support in contrast to x86 or ARM or later
ppc...).

No, the PTE are read by the TLB miss exception handlers and writen into TLB
entries.

In summary - as far as I understand it now - we have to handle the PTE bits
differently
(Linux style) for PROT_NONE permissions - OR - we have to expand the permission
checking like my proposed experimental patch. (PROT_NONE is not NUMA related
only,
but may not used very often ...).

Yes, expanding the permission checking is the easiest solution, hence the patch I sent out based on
your proposal.

Another related point:
According e300 RM (manual) the ACCESSED bit in the PTE shall be set on TLB
miss, as
it is an indication, that page is used. In 4.4 kernel this write back of the
_PAGE_ACCESSED
bit was performed after successful permission check:

bne-DataAddressInvalid /* return if access not permitted */
ori r0,r0,_PAGE_ACCESSED/* set _PAGE_ACCESSED in pte */
/*
* NOTE! We are assuming this is not an SMP system, otherwise
* we would need to update the pte atomically with lwarx/stwcx.
*/
stw r0,0(r2)/* update PTE (accessed bit) */
/* Convert linux-style PTE to low word of PPC-style PTE */

Bit is set (ori ...) and written back (stw ...) to Linux PTE. May be, this is
not needed, as the
PTE is never seen by the PPC chip. But I do not understand, WHY the
PAGE_ACCCESSED
is used for permission check in the late 5.4 kernel (not used in 4.4 kernel):

cmplw 0,r1,r3
mfspr r2, SPRN_SDR1
li r1, _PAGE_PRESENT | _PAGE_ACCESSED
rlwinm r2, r2, 28, 0xf000
bgt-112f

What is the reason or relevance for checking this here ?
Was not checked in 4.4, bit or-ed afterwards, as it is accessed now.
Do you know the reason of change on this point ?

PAGE_ACCESSED is important for memory management, linux kernel need it.

But instead of spending time at every miss to perform a write which will be a no-op in 99% of cases,
we prefer bailing out to the page_fault logic when the accessed bit is not set. Then the page_fault
logic will set the bit.
This also allowed to simplify the handling in __set_pte()_at function by avoiding races in the
update of PTEs.

Another remark to Core manual relevant for this:
There is the reference manual for e300 core available (e300 RM). It includes
many remarks in range of Memory Management section, that many features
are optional or variable for dedicated implementations. On the other hand,
the MPC5121e reference manual refers to the e300 core RM, but DOES NOT
information, which of the optional points are there or nor. According my
analysis, MPC5121e does not include any of the optional features.

Not sure what you mean. As far as I understand, that chapter tells you that some functionnalities
are optional for the powerpc architectecture, and provided (or not) by the e300 core. The MPC5121
supports all the things that are defined by e300 core.

Thanks a lot for first reactions

You are welcome, don't hesitate if you have additional questions.

Christophe

RE: [PATCH] powerpc/603: Fix protection of user pages mapped with PROT_NONE

2021-02-01 Thread PLATTNER Christoph

Hello to all, and thank you very much for first and second fast response.

I do not have a long history on PowerPC MMU environment, I hacked into this 
topic
for about 3 months for analyzing that problem- so, sorry, if I am wrong in some 
points ...

What I learn so far from this MPC5121e (variant of e300c4 core):
- It uses book3s32 hash-code, but it DOES NOT provide KEY hash method, so 
always the 
   branch  "if (! Hash) " is taken, so, I assume that "key 0" and "key 1" 
setups are not
   used on this CPU (not supporting MMU_FTR_HPTE_TABLE)
- The PP bits are NOT checked by the CPU in HW, even if set to 00, the CPU does 
not react.
   As far I have understood, the TLB miss routines are responsible for checking 
permissions.
   The TLB miss routines check the Linux PTE styled entries and generates the 
PP bits
   for the TLB entry. The PowerPC PP bits are never check elsewhere on that CPU 
models ...
- The PTE entries in Linux are fully "void" in sense of this CPU type, as this 
CPU does not
   read any PTEs from RAM (no HW support in contrast to x86 or ARM or later 
ppc...).

In summary - as far as I understand it now - we have to handle the PTE bits 
differently
(Linux style) for PROT_NONE permissions - OR - we have to expand the permission 
checking like my proposed experimental patch. (PROT_NONE is not NUMA related 
only,
but may not used very often ...).

Another related point:
According e300 RM (manual) the ACCESSED bit in the PTE shall be set on TLB 
miss, as
it is an indication, that page is used. In 4.4 kernel this write back of the 
_PAGE_ACCESSED 
bit was performed after successful permission check:

bne-DataAddressInvalid  /* return if access not permitted */
ori r0,r0,_PAGE_ACCESSED/* set _PAGE_ACCESSED in pte */
/*
 * NOTE! We are assuming this is not an SMP system, otherwise
 * we would need to update the pte atomically with lwarx/stwcx.
 */
stw r0,0(r2)/* update PTE (accessed bit) */
/* Convert linux-style PTE to low word of PPC-style PTE */

Bit is set (ori ...) and written back (stw ...) to Linux PTE. May be, this is 
not needed, as the
PTE is never seen by the PPC chip. But I do not understand, WHY the 
PAGE_ACCCESSED 
is used for permission check in the late 5.4 kernel (not used in 4.4 kernel):

cmplw   0,r1,r3
mfspr   r2, SPRN_SDR1
li  r1, _PAGE_PRESENT | _PAGE_ACCESSED
rlwinm  r2, r2, 28, 0xf000
bgt-112f

What is the reason or relevance for checking this here ?
Was not checked in 4.4, bit or-ed afterwards, as it is accessed now.
Do you know the reason of change on this point ?

Another remark to Core manual relevant for this:
There is the reference manual for e300 core available (e300 RM). It includes
many remarks in range of Memory Management section, that many features
are optional or variable for dedicated implementations. On the other hand, 
the MPC5121e reference manual refers to the e300 core RM, but DOES NOT 
information, which of the optional points are there or nor. According my
analysis, MPC5121e does not include any of the optional features.


Thanks a lot for first reactions
Christoph


-Original Message-
From: Christophe Leroy  
Sent: Montag, 1. Februar 2021 07:30
To: Benjamin Herrenschmidt ; Paul Mackerras 
; Michael Ellerman ; PLATTNER Christoph 

Cc: linux-ker...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org
Subject: [PATCH] powerpc/603: Fix protection of user pages mapped with PROT_NONE

On book3s/32, page protection is defined by the PP bits in the PTE which 
provide the following protection depending on the access keys defined in the 
matching segment register:
- PP 00 means RW with key 0 and N/A with key 1.
- PP 01 means RW with key 0 and RO with key 1.
- PP 10 means RW with both key 0 and key 1.
- PP 11 means RO with both key 0 and key 1.

Since the implementation of kernel userspace access protection, PP bits have 
been set as follows:
- PP00 for pages without _PAGE_USER
- PP01 for pages with _PAGE_USER and _PAGE_RW
- PP11 for pages with _PAGE_USER and without _PAGE_RW

For kernelspace segments, kernel accesses are performed with key 0 and user 
accesses are performed with key 1. As PP00 is used for non _PAGE_USER pages, 
user can't access kernel pages not flagged _PAGE_USER while kernel can.

For userspace segments, both kernel and user accesses are performed with key 0, 
therefore pages not flagged _PAGE_USER are still accessible to the user.

This shouldn't be an issue, because userspace is expected to be accessible to 
the user. But unlike most other architectures, powerpc implements PROT_NONE 
protection by removing _PAGE_USER flag instead of flagging the page as not 
valid. This means that pages in userspace that are not flagged _PAGE_USER shall 
remain inaccessible.

To get the expected behaviour, just mimic other architectures in the TLB miss 
handler by checking _PAGE_USER permission on

Re: FSL P5040: KVM HV doesn't work with the RC5 of kernel 5.11

2021-02-01 Thread Christian Zigotzky


Hello,

I compiled the RC6 of kernel 5.11 today and KVM HV works again. 
Therefore I don't need the patch below anymore.


Many thanks for solving the issue,
Christian


On 27 January 2021 at 05:07pm, Christian Zigotzky wrote:

Hello,

I compiled the RC5 of kernel 5.11 today. Unfortunately KVM HV doesn't 
work anymore on my FSL P5040 board [1]. I tested it with QEMU 5.0.0 
today [2]. The virtual e5500 QEMU machine works with the "RC4 with KVM 
HV" and with the "RC5 without KVM HV". The complete system freezes if 
I use KVM HV with the RC5.


I have bisected and 785025820a6a565185ce9d47fdd8d23dbf91dee8 
(powerpc/mm/highmem: use __set_pte_at() for kmap_local()) [3] is the 
first bad commit.


I was able to revert this bad commit and after a new compiling, KVM HV 
works again.  I created a patch for reverting the commit. [4] Please 
find attached the kernel config. I use one uImage for the virtual 
machine and for the P5040 board.


Please check the first bad commit.

Thanks,
Christian


[1] http://wiki.amiga.org/index.php?title=X5000
[2] qemu-system-ppc64 -M ppce500 -cpu e5500 -enable-kvm -m 1024 
-kernel uImage-5.11 -drive 
format=raw,file=MintPPC32-X5000.img,index=0,if=virtio -netdev 
user,id=mynet0 -device e1000,netdev=mynet0 -append "rw root=/dev/vda" 
-device virtio-vga -device virtio-mouse-pci -device 
virtio-keyboard-pci -device pci-ohci,id=newusb -device 
usb-audio,bus=newusb.0 -smp 4
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.11-rc5=785025820a6a565185ce9d47fdd8d23dbf91dee8

[4]
diff -rupN a/arch/powerpc/include/asm/highmem.h 
b/arch/powerpc/include/asm/highmem.h
--- a/arch/powerpc/include/asm/highmem.h    2021-01-27 
16:12:40.382164118 +0100
+++ b/arch/powerpc/include/asm/highmem.h    2021-01-27 
16:10:54.055249957 +0100

@@ -58,8 +58,6 @@ extern pte_t *pkmap_page_table;

 #define flush_cache_kmaps()    flush_cache_all()

-#define arch_kmap_local_set_pte(mm, vaddr, ptep, ptev) \
-   __set_pte_at(mm, vaddr, ptep, ptev, 1)
 #define arch_kmap_local_post_map(vaddr, pteval)    \
    local_flush_tlb_page(NULL, vaddr)
 #define arch_kmap_local_post_unmap(vaddr)  \

[PATCH] ASoC: fsl_xcvr: remove unneeded semicolon