date:20220525

Re: [PATCH] powerpc/64/interrupt: Fix return to masked context after hard-mask irq becomes pending

2022-05-25 Thread Sachin Sant




> On 26-May-2022, at 10:20 AM, Nicholas Piggin  wrote:
> 
> Excerpts from Sachin Sant's message of March 9, 2022 6:37 pm:
>> 
>> 
>>> On 07-Mar-2022, at 8:21 PM, Nicholas Piggin  wrote:
>>> 
>>> When a synchronous interrupt[1] is taken in a local_irq_disable() region
>>> which has MSR[EE]=1, the interrupt handler will enable MSR[EE] as part
>>> of enabling MSR[RI], for peformance and profiling reasons.
>>> 
>>> [1] Typically a hash fault, but in error cases this could be a page
>>> fault or facility unavailable as well.
>>> 
>>> If an asynchronous interrupt hits here and its masked handler requires
>>> MSR[EE] to be cleared (it is a PACA_IRQ_MUST_HARD_MASK interrupt), then
>>> MSR[EE] must remain disabled until that pending interrupt is replayed.
>>> The problem is that the MSR of the original context has MSR[EE]=1, so
>>> returning directly to that causes MSR[EE] to be enabled while the
>>> interrupt is still pending.
>>> 
>>> This issue was hacked around in the interrupt return code by just
>>> clearing the hard mask to avoid a warning, and taking the masked
>>> interrupt again immediately in the return context, which would disable
>>> MSR[EE]. However in the case of a pending PMI, it is possible that it is
>>> not maked in the calling context so the full handler will be run while
>>> there is a PMI pending, and this confuses the perf code and causes
>>> warnings with its PMI pending management.
>>> 
>>> Fix this by removing the hack, and adjusting the return MSR if it has
>>> MSR[EE]=1 and there is a PACA_IRQ_MUST_HARD_MASK interrupt pending.
>>> 
>>> Fixes: 4423eb5ae32e ("powerpc/64/interrupt: make normal synchronous 
>>> interrupts enable MSR[EE] if possible")
>>> Signed-off-by: Nicholas Piggin 
>>> ---
>>> arch/powerpc/kernel/interrupt.c | 10 -
>>> arch/powerpc/kernel/interrupt_64.S | 34 +++---
>>> 2 files changed, 31 insertions(+), 13 deletions(-)
>> 
>> With this patch on top of powerpc/merge following rcu stalls are seen while
>> running powerpc selftests (mitigation-patching) on P9. I don’t see this
>> issue on P10.
>> 
>> [ 1841.248838] link-stack-flush: flush disabled.
>> [ 1841.248905] count-cache-flush: software flush enabled.
>> [ 1841.248911] link-stack-flush: software flush enabled.
>> [ 1901.249668] rcu: INFO: rcu_sched self-detected stall on CPU
>> [ 1901.249703] rcu:  12-...!: (5999 ticks this GP) 
>> idle=d0f/1/0x4002 softirq=37019/37027 fqs=0 
>> [ 1901.249720]   (t=6000 jiffies g=106273 q=1624)
>> [ 1901.249729] rcu: rcu_sched kthread starved for 6000 jiffies! g106273 f0x0 
>> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=6
>> [ 1901.249743] rcu:  Unless rcu_sched kthread gets sufficient CPU time, OOM 
>> is now expected behavior.
>> [ 1901.249752] rcu: RCU grace-period kthread stack dump:
>> [ 1901.249759] task:rcu_sched state:R running task stack: 0 pid: 11 ppid: 2 
>> flags:0x0800
>> [ 1901.249775] Call Trace:
>> [ 1901.249781] [c76ab870] [0001] 0x1 (unreliable)
>> [ 1901.249795] [c76aba60] [c001e508] __switch_to+0x288/0x4a0
>> [ 1901.249811] [c76abac0] [c0d15950] __schedule+0x2c0/0x950
>> [ 1901.249824] [c76abb80] [c0d16048] schedule+0x68/0x130
>> [ 1901.249836] [c76abbb0] [c0d1df1c] 
>> schedule_timeout+0x25c/0x3f0
>> [ 1901.249849] [c76abc90] [c021522c] 
>> rcu_gp_fqs_loop+0x2fc/0x3e0
>> [ 1901.249863] [c76abd40] [c021a0fc] 
>> rcu_gp_kthread+0x13c/0x180
>> [ 1901.249875] [c76abdc0] [c018ce94] kthread+0x124/0x130
>> [ 1901.249887] [c76abe10] [c000cec0] 
>> ret_from_kernel_thread+0x5c/0x64
>> [ 1901.249900] rcu: Stack dump where RCU GP kthread last ran:
>> [ 1901.249908] Sending NMI from CPU 12 to CPUs 6:
>> [ 1901.249944] NMI backtrace for cpu 6
>> [ 1901.249957] CPU: 6 PID: 40 Comm: migration/6 Not tainted 
>> 5.17.0-rc6-00327-g782b30d101f6-dirty #3
>> [ 1901.249971] Stopper: multi_cpu_stop+0x0/0x230 <- 
>> stop_machine_cpuslocked+0x188/0x1e0
>> [ 1901.249987] NIP: c0d14e0c LR: c0214280 CTR: 
>> c02914f0
>> [ 1901.249996] REGS: c785b980 TRAP: 0500 Not tainted 
>> (5.17.0-rc6-00327-g782b30d101f6-dirty)
>> [ 1901.250007] MSR: 8280b033  CR: 
>> 48002822 XER: 
>> [ 1901.250038] CFAR:  IRQMASK: 0 
>> [ 1901.250038] GPR00: c029165c c785bc20 c2a2 
>> 0002 
>> [ 1901.250038] GPR04:  c009fb60ab80 c009fb60ab70 
>> c001e508 
>> [ 1901.250038] GPR08:  c009fb68f5a8 0009f94c 
>> 0098967f 
>> [ 1901.250038] GPR12:  c0001ec57a00 c018cd78 
>> c7234f80 
>> [ 1901.250038] GPR16:    
>>  
>> [ 1901.250038] GPR20:    
>> 0001 
>> [ 1901.250038] GPR24: 0002

Re: [PATCH] powerpc/64/interrupt: Fix return to masked context after hard-mask irq becomes pending

2022-05-25 Thread Nicholas Piggin

Excerpts from Sachin Sant's message of March 9, 2022 6:37 pm:
> 
> 
>> On 07-Mar-2022, at 8:21 PM, Nicholas Piggin  wrote:
>> 
>> When a synchronous interrupt[1] is taken in a local_irq_disable() region
>> which has MSR[EE]=1, the interrupt handler will enable MSR[EE] as part
>> of enabling MSR[RI], for peformance and profiling reasons.
>> 
>> [1] Typically a hash fault, but in error cases this could be a page
>>fault or facility unavailable as well.
>> 
>> If an asynchronous interrupt hits here and its masked handler requires
>> MSR[EE] to be cleared (it is a PACA_IRQ_MUST_HARD_MASK interrupt), then
>> MSR[EE] must remain disabled until that pending interrupt is replayed.
>> The problem is that the MSR of the original context has MSR[EE]=1, so
>> returning directly to that causes MSR[EE] to be enabled while the
>> interrupt is still pending.
>> 
>> This issue was hacked around in the interrupt return code by just
>> clearing the hard mask to avoid a warning, and taking the masked
>> interrupt again immediately in the return context, which would disable
>> MSR[EE]. However in the case of a pending PMI, it is possible that it is
>> not maked in the calling context so the full handler will be run while
>> there is a PMI pending, and this confuses the perf code and causes
>> warnings with its PMI pending management.
>> 
>> Fix this by removing the hack, and adjusting the return MSR if it has
>> MSR[EE]=1 and there is a PACA_IRQ_MUST_HARD_MASK interrupt pending.
>> 
>> Fixes: 4423eb5ae32e ("powerpc/64/interrupt: make normal synchronous 
>> interrupts enable MSR[EE] if possible")
>> Signed-off-by: Nicholas Piggin 
>> ---
>> arch/powerpc/kernel/interrupt.c| 10 -
>> arch/powerpc/kernel/interrupt_64.S | 34 +++---
>> 2 files changed, 31 insertions(+), 13 deletions(-)
> 
> With this patch on top of powerpc/merge following rcu stalls are seen while
> running powerpc selftests (mitigation-patching) on P9. I don’t see this
> issue on P10.
> 
> [ 1841.248838] link-stack-flush: flush disabled.
> [ 1841.248905] count-cache-flush: software flush enabled.
> [ 1841.248911] link-stack-flush: software flush enabled.
> [ 1901.249668] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 1901.249703] rcu:   12-...!: (5999 ticks this GP) 
> idle=d0f/1/0x4002 softirq=37019/37027 fqs=0 
> [ 1901.249720](t=6000 jiffies g=106273 q=1624)
> [ 1901.249729] rcu: rcu_sched kthread starved for 6000 jiffies! g106273 f0x0 
> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=6
> [ 1901.249743] rcu:   Unless rcu_sched kthread gets sufficient CPU time, OOM 
> is now expected behavior.
> [ 1901.249752] rcu: RCU grace-period kthread stack dump:
> [ 1901.249759] task:rcu_sched   state:R  running task stack:0 
> pid:   11 ppid: 2 flags:0x0800
> [ 1901.249775] Call Trace:
> [ 1901.249781] [c76ab870] [0001] 0x1 (unreliable)
> [ 1901.249795] [c76aba60] [c001e508] __switch_to+0x288/0x4a0
> [ 1901.249811] [c76abac0] [c0d15950] __schedule+0x2c0/0x950
> [ 1901.249824] [c76abb80] [c0d16048] schedule+0x68/0x130
> [ 1901.249836] [c76abbb0] [c0d1df1c] 
> schedule_timeout+0x25c/0x3f0
> [ 1901.249849] [c76abc90] [c021522c] 
> rcu_gp_fqs_loop+0x2fc/0x3e0
> [ 1901.249863] [c76abd40] [c021a0fc] 
> rcu_gp_kthread+0x13c/0x180
> [ 1901.249875] [c76abdc0] [c018ce94] kthread+0x124/0x130
> [ 1901.249887] [c76abe10] [c000cec0] 
> ret_from_kernel_thread+0x5c/0x64
> [ 1901.249900] rcu: Stack dump where RCU GP kthread last ran:
> [ 1901.249908] Sending NMI from CPU 12 to CPUs 6:
> [ 1901.249944] NMI backtrace for cpu 6
> [ 1901.249957] CPU: 6 PID: 40 Comm: migration/6 Not tainted 
> 5.17.0-rc6-00327-g782b30d101f6-dirty #3
> [ 1901.249971] Stopper: multi_cpu_stop+0x0/0x230 <- 
> stop_machine_cpuslocked+0x188/0x1e0
> [ 1901.249987] NIP:  c0d14e0c LR: c0214280 CTR: 
> c02914f0
> [ 1901.249996] REGS: c785b980 TRAP: 0500   Not tainted  
> (5.17.0-rc6-00327-g782b30d101f6-dirty)
> [ 1901.250007] MSR:  8280b033   CR: 
> 48002822  XER: 
> [ 1901.250038] CFAR:  IRQMASK: 0 
> [ 1901.250038] GPR00: c029165c c785bc20 c2a2 
> 0002 
> [ 1901.250038] GPR04:  c009fb60ab80 c009fb60ab70 
> c001e508 
> [ 1901.250038] GPR08:  c009fb68f5a8 0009f94c 
> 0098967f 
> [ 1901.250038] GPR12:  c0001ec57a00 c018cd78 
> c7234f80 
> [ 1901.250038] GPR16:    
>  
> [ 1901.250038] GPR20:    
> 0001 
> [ 1901.250038] GPR24: 0002 0003  
> c2a62138 
> [ 1901.250038] GPR28: c000ee70faf8 0001 c000ee70fb1c 
>

Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Vineet Gupta

nkel , Michal Simek , Thomas Bogendoerfer ,
linux-par...@vger.kernel.org, linux-m...@vger.kernel.org, Dinh Nguyen , Palmer Dabbelt , Sven
Schnelle , Guo Ren , Borislav Petkov , Johannes Berg
, linuxppc-dev@lists.ozlabs.org, "David S . Miller"
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev"

On 5/24/22 16:45, Peter Xu wrote:

I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

45 matches

Mail list logo