from:"Laurent Vivier"

Re: [PATCH v3] KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling

2021-10-28 Thread Laurent Vivier


On 27/10/2021 16:21, Nicholas Piggin wrote:

From: Laurent Vivier 

Commit 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest
context before enabling irqs") moved guest_exit() into the interrupt
protected area to avoid wrong context warning (or worse). The problem is
that tick-based time accounting has not yet been updated at this point
(because it depends on the timer interrupt firing), so the guest time
gets incorrectly accounted to system time.

To fix the problem, follow the x86 fix in commit 160457140187 ("Defer
vtime accounting 'til after IRQ handling"), and allow host IRQs to run
before accounting the guest exit time.

In the case vtime accounting is enabled, this is not required because TB
is used directly for accounting.

Before this patch, with CONFIG_TICK_CPU_ACCOUNTING=y in the host and a
guest running a kernel compile, the 'guest' fields of /proc/stat are
stuck at zero. With the patch they can be observed increasing roughly as
expected.

Fixes: e233d54d4d97 ("KVM: booke: use __kvm_guest_exit")
Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context 
before enabling irqs")
Cc:  # 5.12
Signed-off-by: Laurent Vivier 
[np: only required for tick accounting, add Book3E fix, tweak changelog]
Signed-off-by: Nicholas Piggin 
---
Since v2:
- I took over the patch with Laurent's blessing.
- Changed to avoid processing IRQs if we do have vtime accounting
   enabled.
- Changed so in either case the accounting is called with irqs disabled.
- Added similar Book3E fix.
- Rebased on upstream, tested, observed bug and confirmed fix.

  arch/powerpc/kvm/book3s_hv.c | 30 --
  arch/powerpc/kvm/booke.c | 16 +++-
  2 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2acb1c96cfaf..7b74fc0a986b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3726,7 +3726,20 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
  
  	kvmppc_set_host_core(pcpu);
  
-	guest_exit_irqoff();

+   context_tracking_guest_exit();
+   if (!vtime_accounting_enabled_this_cpu()) {
+   local_irq_enable();
+   /*
+* Service IRQs here before vtime_account_guest_exit() so any
+* ticks that occurred while running the guest are accounted to
+* the guest. If vtime accounting is enabled, accounting uses
+* TB rather than ticks, so it can be done without enabling
+* interrupts here, which has the problem that it accounts
+* interrupt processing overhead to the host.
+*/
+   local_irq_disable();
+   }
+   vtime_account_guest_exit();
  
  	local_irq_enable();
  
@@ -4510,7 +4523,20 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
  
  	kvmppc_set_host_core(pcpu);
  
-	guest_exit_irqoff();

+   context_tracking_guest_exit();
+   if (!vtime_accounting_enabled_this_cpu()) {
+   local_irq_enable();
+   /*
+* Service IRQs here before vtime_account_guest_exit() so any
+* ticks that occurred while running the guest are accounted to
+* the guest. If vtime accounting is enabled, accounting uses
+* TB rather than ticks, so it can be done without enabling
+* interrupts here, which has the problem that it accounts
+* interrupt processing overhead to the host.
+*/
+   local_irq_disable();
+   }
+   vtime_account_guest_exit();
  
  	local_irq_enable();
  
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c

index 977801c83aff..8c15c90dd3a9 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1042,7 +1042,21 @@ int kvmppc_handle_exit(struct kvm_vcpu *vcpu, unsigned 
int exit_nr)
}
  
  	trace_kvm_exit(exit_nr, vcpu);

-   guest_exit_irqoff();
+
+   context_tracking_guest_exit();
+   if (!vtime_accounting_enabled_this_cpu()) {
+   local_irq_enable();
+   /*
+* Service IRQs here before vtime_account_guest_exit() so any
+* ticks that occurred while running the guest are accounted to
+* the guest. If vtime accounting is enabled, accounting uses
+* TB rather than ticks, so it can be done without enabling
+* interrupts here, which has the problem that it accounts
+* interrupt processing overhead to the host.
+*/
+   local_irq_disable();
+   }
+   vtime_account_guest_exit();
  
  	local_irq_enable();
  



I'm wondering if we should put the context_tracking_guest_exit() just after the 
"srcu_read_unlock(>kvm->srcu, srcu_idx);" as it was before 61bd0f

Re: [PATCH v3] KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling

2021-10-28 Thread Laurent Vivier


On 27/10/2021 16:21, Nicholas Piggin wrote:

From: Laurent Vivier 

Commit 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest
context before enabling irqs") moved guest_exit() into the interrupt
protected area to avoid wrong context warning (or worse). The problem is
that tick-based time accounting has not yet been updated at this point
(because it depends on the timer interrupt firing), so the guest time
gets incorrectly accounted to system time.

To fix the problem, follow the x86 fix in commit 160457140187 ("Defer
vtime accounting 'til after IRQ handling"), and allow host IRQs to run
before accounting the guest exit time.

In the case vtime accounting is enabled, this is not required because TB
is used directly for accounting.

Before this patch, with CONFIG_TICK_CPU_ACCOUNTING=y in the host and a
guest running a kernel compile, the 'guest' fields of /proc/stat are
stuck at zero. With the patch they can be observed increasing roughly as
expected.

Fixes: e233d54d4d97 ("KVM: booke: use __kvm_guest_exit")
Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context 
before enabling irqs")
Cc:  # 5.12
Signed-off-by: Laurent Vivier 
[np: only required for tick accounting, add Book3E fix, tweak changelog]
Signed-off-by: Nicholas Piggin 
---
Since v2:
- I took over the patch with Laurent's blessing.
- Changed to avoid processing IRQs if we do have vtime accounting
   enabled.
- Changed so in either case the accounting is called with irqs disabled.
- Added similar Book3E fix.
- Rebased on upstream, tested, observed bug and confirmed fix.

  arch/powerpc/kvm/book3s_hv.c | 30 --
  arch/powerpc/kvm/booke.c | 16 +++-
  2 files changed, 43 insertions(+), 3 deletions(-)



Tested-by: Laurent Vivier 

Checked with mpstat that time is accounted to %guest while a stress-ng test is running in 
the guest. Checked there is no warning in the host kernellogs.


Thanks,
Laurent

Re: [PATCH v2] KVM: PPC: Defer vtime accounting 'til after IRQ handling

2021-10-20 Thread Laurent Vivier


On 15/10/2021 04:23, Nicholas Piggin wrote:

Excerpts from Laurent Vivier's message of October 13, 2021 7:30 pm:

On 13/10/2021 01:18, Michael Ellerman wrote:

Laurent Vivier  writes:

Commit 112665286d08 moved guest_exit() in the interrupt protected
area to avoid wrong context warning (or worse), but the tick counter
cannot be updated and the guest time is accounted to the system time.

To fix the problem port to POWER the x86 fix
160457140187 ("Defer vtime accounting 'til after IRQ handling"):

"Defer the call to account guest time until after servicing any IRQ(s)
   that happened in the guest or immediately after VM-Exit.  Tick-based
   accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
   handler runs, and IRQs are blocked throughout the main sequence of
   vcpu_enter_guest(), including the call into vendor code to actually
   enter and exit the guest."

Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context 
before enabling irqs")
Cc: npig...@gmail.com
Cc:  # 5.12
Signed-off-by: Laurent Vivier 
---

Notes:
  v2: remove reference to commit 61bd0f66ff92
  cc stable 5.12
  add the same comment in the code as for x86

   arch/powerpc/kvm/book3s_hv.c | 24 
   1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2acb1c96cfaf..a694d1a8f6ce 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c

...

@@ -4506,13 +4514,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
   
   	srcu_read_unlock(>srcu, srcu_idx);
   
+	context_tracking_guest_exit();

+
set_irq_happened(trap);
   
   	kvmppc_set_host_core(pcpu);
   
-	guest_exit_irqoff();

-
local_irq_enable();
+   /*
+* Wait until after servicing IRQs to account guest time so that any
+* ticks that occurred while running the guest are properly accounted
+* to the guest.  Waiting until IRQs are enabled degrades the accuracy
+* of accounting via context tracking, but the loss of accuracy is
+* acceptable for all known use cases.
+*/
+   vtime_account_guest_exit();


This pops a warning for me, running guest(s) on Power8:
   
[  270.745303][T16661] [ cut here ]

[  270.745374][T16661] WARNING: CPU: 72 PID: 16661 at 
arch/powerpc/kernel/time.c:311 vtime_account_kernel+0xe0/0xf0


Thank you, I missed that...

My patch is wrong, I have to add vtime_account_guest_exit() before the 
local_irq_enable().


I thought so because if we take an interrupt after exiting the guest that
should be accounted to kernel not guest.



arch/powerpc/kernel/time.c

   305 static unsigned long vtime_delta(struct cpu_accounting_data *acct,
   306  unsigned long *stime_scaled,
   307  unsigned long *steal_time)
   308 {
   309 unsigned long now, stime;
   310
   311 WARN_ON_ONCE(!irqs_disabled());
...

But I don't understand how ticks can be accounted now if irqs are still 
disabled.

Not sure it is as simple as expected...


I don't know all the timer stuff too well. The
!CONFIG_VIRT_CPU_ACCOUNTING case is relying on PF_VCPU to be set when
the host timer interrupt runs irqtime_account_process_tick runs so it
can accumulate that tick to the guest?

That probably makes sense then, but it seems like we need that in a
different place. Timer interrupts are not guaranteed to be the first one
to occur when interrupts are enabled.

Maybe a new tick_account_guest_exit() and move PF_VCPU clearing to that
for tick based accounting. Call it after local_irq_enable and call the
vtime accounting before it. Would that work?


Hi Nick,

I think I will not have the time to have a look to fix that?

Could you try?

Thanks,
Laurent

Re: [PATCH v2] KVM: PPC: Defer vtime accounting 'til after IRQ handling

2021-10-13 Thread Laurent Vivier


On 13/10/2021 01:18, Michael Ellerman wrote:

Laurent Vivier  writes:

Commit 112665286d08 moved guest_exit() in the interrupt protected
area to avoid wrong context warning (or worse), but the tick counter
cannot be updated and the guest time is accounted to the system time.

To fix the problem port to POWER the x86 fix
160457140187 ("Defer vtime accounting 'til after IRQ handling"):

"Defer the call to account guest time until after servicing any IRQ(s)
  that happened in the guest or immediately after VM-Exit.  Tick-based
  accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
  handler runs, and IRQs are blocked throughout the main sequence of
  vcpu_enter_guest(), including the call into vendor code to actually
  enter and exit the guest."

Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context 
before enabling irqs")
Cc: npig...@gmail.com
Cc:  # 5.12
Signed-off-by: Laurent Vivier 
---

Notes:
 v2: remove reference to commit 61bd0f66ff92
 cc stable 5.12
 add the same comment in the code as for x86

  arch/powerpc/kvm/book3s_hv.c | 24 
  1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2acb1c96cfaf..a694d1a8f6ce 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c

...

@@ -4506,13 +4514,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
  
  	srcu_read_unlock(>srcu, srcu_idx);
  
+	context_tracking_guest_exit();

+
set_irq_happened(trap);
  
  	kvmppc_set_host_core(pcpu);
  
-	guest_exit_irqoff();

-
local_irq_enable();
+   /*
+* Wait until after servicing IRQs to account guest time so that any
+* ticks that occurred while running the guest are properly accounted
+* to the guest.  Waiting until IRQs are enabled degrades the accuracy
+* of accounting via context tracking, but the loss of accuracy is
+* acceptable for all known use cases.
+*/
+   vtime_account_guest_exit();


This pops a warning for me, running guest(s) on Power8:
  
   [  270.745303][T16661] [ cut here ]

   [  270.745374][T16661] WARNING: CPU: 72 PID: 16661 at 
arch/powerpc/kernel/time.c:311 vtime_account_kernel+0xe0/0xf0


Thank you, I missed that...

My patch is wrong, I have to add vtime_account_guest_exit() before the 
local_irq_enable().

arch/powerpc/kernel/time.c

 305 static unsigned long vtime_delta(struct cpu_accounting_data *acct,
 306  unsigned long *stime_scaled,
 307  unsigned long *steal_time)
 308 {
 309 unsigned long now, stime;
 310
 311 WARN_ON_ONCE(!irqs_disabled());
...

But I don't understand how ticks can be accounted now if irqs are still 
disabled.

Not sure it is as simple as expected...

Thanks,
Laurent

[PATCH v2] KVM: PPC: Defer vtime accounting 'til after IRQ handling

2021-10-07 Thread Laurent Vivier

Commit 112665286d08 moved guest_exit() in the interrupt protected
area to avoid wrong context warning (or worse), but the tick counter
cannot be updated and the guest time is accounted to the system time.

To fix the problem port to POWER the x86 fix
160457140187 ("Defer vtime accounting 'til after IRQ handling"):

"Defer the call to account guest time until after servicing any IRQ(s)
 that happened in the guest or immediately after VM-Exit.  Tick-based
 accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
 handler runs, and IRQs are blocked throughout the main sequence of
 vcpu_enter_guest(), including the call into vendor code to actually
 enter and exit the guest."

Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context 
before enabling irqs")
Cc: npig...@gmail.com
Cc:  # 5.12
Signed-off-by: Laurent Vivier 
---

Notes:
v2: remove reference to commit 61bd0f66ff92
cc stable 5.12
add the same comment in the code as for x86

 arch/powerpc/kvm/book3s_hv.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2acb1c96cfaf..a694d1a8f6ce 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3695,6 +3695,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
 
srcu_read_unlock(>kvm->srcu, srcu_idx);
 
+   context_tracking_guest_exit();
+
set_irq_happened(trap);
 
spin_lock(>lock);
@@ -3726,9 +3728,15 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
 
kvmppc_set_host_core(pcpu);
 
-   guest_exit_irqoff();
-
local_irq_enable();
+   /*
+* Wait until after servicing IRQs to account guest time so that any
+* ticks that occurred while running the guest are properly accounted
+* to the guest.  Waiting until IRQs are enabled degrades the accuracy
+* of accounting via context tracking, but the loss of accuracy is
+* acceptable for all known use cases.
+*/
+   vtime_account_guest_exit();
 
/* Let secondaries go back to the offline loop */
for (i = 0; i < controlled_threads; ++i) {
@@ -4506,13 +4514,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 
srcu_read_unlock(>srcu, srcu_idx);
 
+   context_tracking_guest_exit();
+
set_irq_happened(trap);
 
kvmppc_set_host_core(pcpu);
 
-   guest_exit_irqoff();
-
local_irq_enable();
+   /*
+* Wait until after servicing IRQs to account guest time so that any
+* ticks that occurred while running the guest are properly accounted
+* to the guest.  Waiting until IRQs are enabled degrades the accuracy
+* of accounting via context tracking, but the loss of accuracy is
+* acceptable for all known use cases.
+*/
+   vtime_account_guest_exit();
 
cpumask_clear_cpu(pcpu, >arch.cpu_in_guest);
 
-- 
2.31.1

Re: [PATCH] KVM: PPC: Defer vtime accounting 'til after IRQ handling

2021-10-07 Thread Laurent Vivier


On 06/10/2021 12:42, Greg Kurz wrote:

On Wed,  6 Oct 2021 09:37:45 +0200
Laurent Vivier  wrote:


Commit 61bd0f66ff92 has moved guest_enter() out of the interrupt
protected area to be able to have updated tick counters, but
commit 112665286d08 moved back to this area to avoid wrong
context warning (or worse).

None of them are correct, to fix the problem port to POWER
the x86 fix 160457140187 ("KVM: x86: Defer vtime accounting 'til
after IRQ handling"):

"Defer the call to account guest time until after servicing any IRQ(s)
  that happened in the guest or immediately after VM-Exit.  Tick-based
  accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
  handler runs, and IRQs are blocked throughout the main sequence of
  vcpu_enter_guest(), including the call into vendor code to actually
  enter and exit the guest."

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2009312
Fixes: 61bd0f66ff92 ("KVM: PPC: Book3S HV: Fix guest time accounting with 
VIRT_CPU_ACCOUNTING_GEN")


This patch was merged in linux 4.16 and thus is on the 4.16.y
stable branch and it was backported to stable 4.14.y. It would
make sense to Cc: stable # v4.14 also, but...


Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context 
before enabling irqs")


... this one, which was merged in linux 5.12, was never backported
anywhere because it wasn't considered as a fix, as commented here:

https://lore.kernel.org/linuxppc-dev/1610793296.fjhomer31g.astr...@bobo.none/

AFAICT commit 61bd0f66ff92 was never mentioned anywhere in a bug. The
first Fixes: tag thus looks a bit misleading. I'd personally drop it
and Cc: stable # v5.12.



Ok, I update the comment.


Cc: npig...@gmail.com

Signed-off-by: Laurent Vivier 
---
  arch/powerpc/kvm/book3s_hv.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2acb1c96cfaf..43e1ce853785 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3695,6 +3695,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
  
  	srcu_read_unlock(>kvm->srcu, srcu_idx);
  
+	context_tracking_guest_exit();

+
set_irq_happened(trap);
  
  	spin_lock(>lock);

@@ -3726,9 +3728,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
  
  	kvmppc_set_host_core(pcpu);
  
-	guest_exit_irqoff();

-



Change looks ok but it might be a bit confusing for the
occasional reader that guest_enter_irqoff() isn't matched
by a guest_exit_irqoff().


local_irq_enable();
+   vtime_account_guest_exit();
  


Maybe add a comment like in x86 ?



done


/* Let secondaries go back to the offline loop */
for (i = 0; i < controlled_threads; ++i) {
@@ -4506,13 +4507,14 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
  
  	srcu_read_unlock(>srcu, srcu_idx);
  
+	context_tracking_guest_exit();

+
set_irq_happened(trap);
  
  	kvmppc_set_host_core(pcpu);
  
-	guest_exit_irqoff();

-
local_irq_enable();
+   vtime_account_guest_exit();
  
  	cpumask_clear_cpu(pcpu, >arch.cpu_in_guest);
  


Same remarks. FWIW a followup was immediately added to x86 to
encapsulate the enter/exit logic in common helpers :

ommit bc908e091b3264672889162733020048901021fb
Author: Sean Christopherson 
Date:   Tue May 4 17:27:35 2021 -0700

 KVM: x86: Consolidate guest enter/exit logic to common helpers

This makes the code nicer. Maybe do something similar for POWER ?


I don't like to modify kernel code when it's not needed. I just want to fix a bug, if 
someone wants this nicer I let this to him...


Thanks,
Laurent

[PATCH] KVM: PPC: Defer vtime accounting 'til after IRQ handling

2021-10-06 Thread Laurent Vivier

Commit 61bd0f66ff92 has moved guest_enter() out of the interrupt
protected area to be able to have updated tick counters, but
commit 112665286d08 moved back to this area to avoid wrong
context warning (or worse).

None of them are correct, to fix the problem port to POWER
the x86 fix 160457140187 ("KVM: x86: Defer vtime accounting 'til
after IRQ handling"):

"Defer the call to account guest time until after servicing any IRQ(s)
 that happened in the guest or immediately after VM-Exit.  Tick-based
 accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
 handler runs, and IRQs are blocked throughout the main sequence of
 vcpu_enter_guest(), including the call into vendor code to actually
 enter and exit the guest."

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2009312
Fixes: 61bd0f66ff92 ("KVM: PPC: Book3S HV: Fix guest time accounting with 
VIRT_CPU_ACCOUNTING_GEN")
Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context 
before enabling irqs")
Cc: npig...@gmail.com

Signed-off-by: Laurent Vivier 
---
 arch/powerpc/kvm/book3s_hv.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2acb1c96cfaf..43e1ce853785 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3695,6 +3695,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
 
srcu_read_unlock(>kvm->srcu, srcu_idx);
 
+   context_tracking_guest_exit();
+
set_irq_happened(trap);
 
spin_lock(>lock);
@@ -3726,9 +3728,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
 
kvmppc_set_host_core(pcpu);
 
-   guest_exit_irqoff();
-
local_irq_enable();
+   vtime_account_guest_exit();
 
/* Let secondaries go back to the offline loop */
for (i = 0; i < controlled_threads; ++i) {
@@ -4506,13 +4507,14 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 
srcu_read_unlock(>srcu, srcu_idx);
 
+   context_tracking_guest_exit();
+
set_irq_happened(trap);
 
kvmppc_set_host_core(pcpu);
 
-   guest_exit_irqoff();
-
local_irq_enable();
+   vtime_account_guest_exit();
 
cpumask_clear_cpu(pcpu, >arch.cpu_in_guest);
 
-- 
2.31.1

Re: [PATCH v2] powerpc/mm: Fix set_memory_*() against concurrent accesses

2021-08-18 Thread Laurent Vivier

On 18/08/2021 14:05, Michael Ellerman wrote:
> Laurent reported that STRICT_MODULE_RWX was causing intermittent crashes
> on one of his systems:
> 
>   kernel tried to execute exec-protected page (c00804073278) - exploit 
> attempt? (uid: 0)
>   BUG: Unable to handle kernel instruction fetch
>   Faulting instruction address: 0xc00804073278
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>   Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks ...
>   CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
>   Workqueue: events control_work_handler [virtio_console]
>   NIP:  c00804073278 LR: c00804073278 CTR: c01e9de0
>   REGS: c0002e4ef7e0 TRAP: 0400   Not tainted  (5.14.0-rc4+)
>   MSR:  80004280b033   CR: 24002822 XER: 
> 200400cf
>   ...
>   NIP fill_queue+0xf0/0x210 [virtio_console]
>   LR  fill_queue+0xf0/0x210 [virtio_console]
>   Call Trace:
> fill_queue+0xb4/0x210 [virtio_console] (unreliable)
> add_port+0x1a8/0x470 [virtio_console]
> control_work_handler+0xbc/0x1e8 [virtio_console]
> process_one_work+0x290/0x590
> worker_thread+0x88/0x620
> kthread+0x194/0x1a0
> ret_from_kernel_thread+0x5c/0x64
> 
> Jordan, Fabiano & Murilo were able to reproduce and identify that the
> problem is caused by the call to module_enable_ro() in do_init_module(),
> which happens after the module's init function has already been called.
> 
> Our current implementation of change_page_attr() is not safe against
> concurrent accesses, because it invalidates the PTE before flushing the
> TLB and then installing the new PTE. That leaves a window in time where
> there is no valid PTE for the page, if another CPU tries to access the
> page at that time we see something like the fault above.
> 
> We can't simply switch to set_pte_at()/flush TLB, because our hash MMU
> code doesn't handle a set_pte_at() of a valid PTE. See [1].
> 
> But we do have pte_update(), which replaces the old PTE with the new,
> meaning there's no window where the PTE is invalid. And the hash MMU
> version hash__pte_update() deals with synchronising the hash page table
> correctly.
> 
> [1]: https://lore.kernel.org/linuxppc-dev/87y318wp9r@linux.ibm.com/
> 
> Fixes: 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
> Reported-by: Laurent Vivier 
> Signed-off-by: Fabiano Rosas 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/mm/pageattr.c | 23 ++-
>  1 file changed, 10 insertions(+), 13 deletions(-)
> 
> v2: Use pte_update(..., ~0, pte_val(pte), ...) as suggested by Fabiano,
> and ptep_get() as suggested by Christophe.
> 
> diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
> index 0876216ceee6..edea388e9d3f 100644
> --- a/arch/powerpc/mm/pageattr.c
> +++ b/arch/powerpc/mm/pageattr.c
> @@ -18,16 +18,12 @@
>  /*
>   * Updates the attributes of a page in three steps:
>   *
> - * 1. invalidate the page table entry
> - * 2. flush the TLB
> - * 3. install the new entry with the updated attributes
> - *
> - * Invalidating the pte means there are situations where this will not work
> - * when in theory it should.
> - * For example:
> - * - removing write from page whilst it is being executed
> - * - setting a page read-only whilst it is being read by another CPU
> + * 1. take the page_table_lock
> + * 2. install the new entry with the updated attributes
> + * 3. flush the TLB
>   *
> + * This sequence is safe against concurrent updates, and also allows 
> updating the
> + * attributes of a page currently being executed or accessed.
>   */
>  static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
>  {
> @@ -36,9 +32,7 @@ static int change_page_attr(pte_t *ptep, unsigned long 
> addr, void *data)
>  
>   spin_lock(_mm.page_table_lock);
>  
> - /* invalidate the PTE so it's safe to modify */
> - pte = ptep_get_and_clear(_mm, addr, ptep);
> - flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> + pte = ptep_get(ptep);
>  
>   /* modify the PTE bits as desired, then apply */
>   switch (action) {
> @@ -59,11 +53,14 @@ static int change_page_attr(pte_t *ptep, unsigned long 
> addr, void *data)
>   break;
>   }
>  
> - set_pte_at(_mm, addr, ptep, pte);
> + pte_update(_mm, addr, ptep, ~0UL, pte_val(pte), 0);
>  
>   /* See ptesync comment in radix__set_pte_at() */
>   if (radix_enabled())
>   asm volatile("ptesync": : :"memory");
> +
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +
>   spin_unlock(_mm.page_table_lock);
>  
>   return 0;
> 
> base-commit: cbc06f051c524dcfe52ef0d1f30647828e226d30
> 

Tested-by: Laurent Vivier

Re: [PATCH v2] powerpc/xive: Do not skip CPU-less nodes when creating the IPIs

2021-08-10 Thread Laurent Vivier

On 07/08/2021 09:20, Cédric Le Goater wrote:
> On PowerVM, CPU-less nodes can be populated with hot-plugged CPUs at
> runtime. Today, the IPI is not created for such nodes, and hot-plugged
> CPUs use a bogus IPI, which leads to soft lockups.
> 
> We can not directly allocate and request the IPI on demand because
> bringup_up() is called under the IRQ sparse lock. The alternative is
> to allocate the IPIs for all possible nodes at startup and to request
> the mapping on demand when the first CPU of a node is brought up.
> 
> Fixes: 7dcc37b3eff9 ("powerpc/xive: Map one IPI interrupt per node")
> Cc: sta...@vger.kernel.org # v5.13
> Reported-by: Geetika Moolchandani 
> Cc: Srikar Dronamraju 
> Cc: Laurent Vivier 
> Signed-off-by: Cédric Le Goater 
> Message-Id: <20210629131542.743888-1-...@kaod.org>
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/sysdev/xive/common.c | 35 +--
>  1 file changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index dbdbbc2f1dc5..943fd30095af 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -67,6 +67,7 @@ static struct irq_domain *xive_irq_domain;
>  static struct xive_ipi_desc {
>   unsigned int irq;
>   char name[16];
> + atomic_t started;
>  } *xive_ipis;
>  
>  /*
> @@ -1120,7 +1121,7 @@ static const struct irq_domain_ops 
> xive_ipi_irq_domain_ops = {
>   .alloc  = xive_ipi_irq_domain_alloc,
>  };
>  
> -static int __init xive_request_ipi(void)
> +static int __init xive_init_ipis(void)
>  {
>   struct fwnode_handle *fwnode;
>   struct irq_domain *ipi_domain;
> @@ -1144,10 +1145,6 @@ static int __init xive_request_ipi(void)
>   struct xive_ipi_desc *xid = _ipis[node];
>   struct xive_ipi_alloc_info info = { node };
>  
> - /* Skip nodes without CPUs */
> - if (cpumask_empty(cpumask_of_node(node)))
> - continue;
> -
>   /*
>* Map one IPI interrupt per node for all cpus of that node.
>* Since the HW interrupt number doesn't have any meaning,
> @@ -1159,11 +1156,6 @@ static int __init xive_request_ipi(void)
>   xid->irq = ret;
>  
>   snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);
> -
> - ret = request_irq(xid->irq, xive_muxed_ipi_action,
> -   IRQF_PERCPU | IRQF_NO_THREAD, xid->name, 
> NULL);
> -
> - WARN(ret < 0, "Failed to request IPI %d: %d\n", xid->irq, ret);
>   }
>  
>   return ret;
> @@ -1178,6 +1170,22 @@ static int __init xive_request_ipi(void)
>   return ret;
>  }
>  
> +static int __init xive_request_ipi(unsigned int cpu)
> +{
> + struct xive_ipi_desc *xid = _ipis[early_cpu_to_node(cpu)];
> + int ret;
> +
> + if (atomic_inc_return(>started) > 1)
> + return 0;
> +
> + ret = request_irq(xid->irq, xive_muxed_ipi_action,
> +   IRQF_PERCPU | IRQF_NO_THREAD,
> +   xid->name, NULL);
> +
> + WARN(ret < 0, "Failed to request IPI %d: %d\n", xid->irq, ret);
> + return ret;
> +}
> +
>  static int xive_setup_cpu_ipi(unsigned int cpu)
>  {
>   unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
> @@ -1192,6 +1200,9 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
>   if (xc->hw_ipi != XIVE_BAD_IRQ)
>   return 0;
>  
> + /* Register the IPI */
> + xive_request_ipi(cpu);
> +
>   /* Grab an IPI from the backend, this will populate xc->hw_ipi */
>   if (xive_ops->get_ipi(cpu, xc))
>   return -EIO;
> @@ -1231,6 +1242,8 @@ static void xive_cleanup_cpu_ipi(unsigned int cpu, 
> struct xive_cpu *xc)
>   if (xc->hw_ipi == XIVE_BAD_IRQ)
>   return;
>  
> + /* TODO: clear IPI mapping */
> +
>   /* Mask the IPI */
>   xive_do_source_set_mask(>ipi_data, true);
>  
> @@ -1253,7 +1266,7 @@ void __init xive_smp_probe(void)
>   smp_ops->cause_ipi = xive_cause_ipi;
>  
>   /* Register the IPI */
> - xive_request_ipi();
> + xive_init_ipis();
>  
>   /* Allocate and setup IPI for the boot CPU */
>   xive_setup_cpu_ipi(smp_processor_id());
> 

Tested-by: Laurent Vivier

Re: [PATCH v15 7/9] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX

2021-08-05 Thread Laurent Vivier

Hi,

On 09/06/2021 03:34, Jordan Niethe wrote:
> From: Russell Currey 
> 
> To enable strict module RWX on powerpc, set:
> 
> CONFIG_STRICT_MODULE_RWX=y
> 
> You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real
> security benefit.
> 
> ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX.
> This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that
> makes STRICT_MODULE_RWX *on by default* in configurations where
> STRICT_KERNEL_RWX is *unavailable*.
> 
> Since this doesn't make much sense, and module RWX without kernel RWX
> doesn't make much sense, having the same dependencies as kernel RWX
> works around this problem.
> 
> Book3s/32 603 and 604 core processors are not able to write protect
> kernel pages so do not set ARCH_HAS_STRICT_MODULE_RWX for Book3s/32.
> 
> Reviewed-by: Christophe Leroy 
> Signed-off-by: Russell Currey 
> [jpn: - predicate on !PPC_BOOK3S_604
>   - make module_alloc() use PAGE_KERNEL protection]
> Signed-off-by: Jordan Niethe 
> ---
> v10: - Predicate on !PPC_BOOK3S_604
>  - Make module_alloc() use PAGE_KERNEL protection
> v11: - Neaten up
> v13: Use strict_kernel_rwx_enabled()
> v14: Make changes to module_alloc() its own commit
> v15: - Force STRICT_KERNEL_RWX if STRICT_MODULE_RWX is selected
>  - Predicate on !PPC_BOOK3S_32 instead
> ---
>  arch/powerpc/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index abfe2e9225fa..72f307f1796b 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -142,6 +142,7 @@ config PPC
>   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
> && PPC_BOOK3S_64
>   select ARCH_HAS_SET_MEMORY
>   select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
> !HIBERNATION)
> + select ARCH_HAS_STRICT_MODULE_RWX   if ARCH_HAS_STRICT_KERNEL_RWX 
> && !PPC_BOOK3S_32
>   select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
>   select ARCH_HAS_UACCESS_FLUSHCACHE
>   select ARCH_HAS_UBSAN_SANITIZE_ALL
> @@ -267,6 +268,7 @@ config PPC
>   select PPC_DAWR if PPC64
>   select RTC_LIB
>   select SPARSE_IRQ
> + select STRICT_KERNEL_RWX if STRICT_MODULE_RWX
>   select SYSCTL_EXCEPTION_TRACE
>   select THREAD_INFO_IN_TASK
>   select VIRT_TO_BUS  if !PPC64
> 

since this patch is merged my VM is experiencing a crash at boot (20% of the 
time):

[8.496850] kernel tried to execute exec-protected page (c00804073278) - 
exploit
attempt? (uid: 0)
[8.496921] BUG: Unable to handle kernel instruction fetch
[8.496954] Faulting instruction address: 0xc00804073278
[8.496994] Oops: Kernel access of bad area, sig: 11 [#1]
[8.497028] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[8.497071] Modules linked in: drm virtio_console fuse 
drm_panel_orientation_quirks xfs
libcrc32c virtio_net net_failover virtio_blk vmx_crypto failover dm_mirror 
dm_region_hash
dm_log dm_mod
[8.497186] CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
[8.497228] Workqueue: events control_work_handler [virtio_console]
[8.497272] NIP:  c00804073278 LR: c00804073278 CTR: c01e9de0
[8.497320] REGS: c0002e4ef7e0 TRAP: 0400   Not tainted  (5.14.0-rc4+)
[8.497361] MSR:  80004280b033   CR: 
24002822
XER: 200400cf
[8.497426] CFAR: c01e9e44 IRQMASK: 1
[8.497426] GPR00: c00804073278 c0002e4efa80 c2a26b00 
c00042c39520
[8.497426] GPR04: 0001   
00ff
[8.497426] GPR08: 0001 c00042c39520 0001 
c00804076008
[8.497426] GPR12: c01e9de0 c001fffccb00 c018ba88 
c0002c91d400
[8.497426] GPR16:    

[8.497426] GPR20:    
c00804080340
[8.497426] GPR24: c008040a01e8   
c0002e0975c0
[8.497426] GPR28: c0002ce72940 c00042c39520 0048 
0038
[8.497891] NIP [c00804073278] fill_queue+0xf0/0x210 [virtio_console]
[8.497934] LR [c00804073278] fill_queue+0xf0/0x210 [virtio_console]
[8.497976] Call Trace:
[8.497993] [c0002e4efa80] [c0080407323c] fill_queue+0xb4/0x210
[virtio_console] (unreliable)
[8.498052] [c0002e4efae0] [c00804073a90] add_port+0x1a8/0x470 
[virtio_console]
[8.498102] [c0002e4efbb0] [c008040750f4] 
control_work_handler+0xbc/0x1e8
[virtio_console]
[8.498160] [c0002e4efc60] [c017f4f0] 
process_one_work+0x290/0x590
[8.498212] [c0002e4efd00] [c017f878] worker_thread+0x88/0x620
[8.498256] [c0002e4efda0] [c018bc14] kthread+0x194/0x1a0
[8.498299] [c0002e4efe10]

Re: [PATCH] powerpc/xive: Do not skip CPU-less nodes when creating the IPIs

2021-07-19 Thread Laurent Vivier

On 29/06/2021 15:15, Cédric Le Goater wrote:
> On PowerVM, CPU-less nodes can be populated with hot-plugged CPUs at
> runtime. Today, the IPI is not created for such nodes, and hot-plugged
> CPUs use a bogus IPI, which leads to soft lockups.
> 
> We could create the node IPI on demand but it is a bit complex because
> this code would be called under bringup_up() and some IRQ locking is
> being done. The simplest solution is to create the IPIs for all nodes
> at startup.
> 
> Fixes: 7dcc37b3eff9 ("powerpc/xive: Map one IPI interrupt per node")
> Cc: sta...@vger.kernel.org # v5.13
> Reported-by: Geetika Moolchandani 
> Cc: Srikar Dronamraju 
> Signed-off-by: Cédric Le Goater 
> ---
> 
> This patch breaks old versions of irqbalance (<= v1.4). Possible nodes
> are collected from /sys/devices/system/node/ but CPU-less nodes are
> not listed there. When interrupts are scanned, the link representing
> the node structure is NULL and segfault occurs.
> 
> Version 1.7 seems immune. 
> 
> ---
>  arch/powerpc/sysdev/xive/common.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index f3b16ed48b05..5d2c58dba57e 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1143,10 +1143,6 @@ static int __init xive_request_ipi(void)
>   struct xive_ipi_desc *xid = _ipis[node];
>   struct xive_ipi_alloc_info info = { node };
>  
> - /* Skip nodes without CPUs */
> - if (cpumask_empty(cpumask_of_node(node)))
> - continue;
> -
>   /*
>* Map one IPI interrupt per node for all cpus of that node.
>* Since the HW interrupt number doesn't have any meaning,
> 

What happened to this fix? Will it be merged?

Thanks,
Laurent

Re: [PATCH] powerpc/xive: Do not skip CPU-less nodes when creating the IPIs

2021-07-07 Thread Laurent Vivier

On 29/06/2021 15:15, Cédric Le Goater wrote:
> On PowerVM, CPU-less nodes can be populated with hot-plugged CPUs at
> runtime. Today, the IPI is not created for such nodes, and hot-plugged
> CPUs use a bogus IPI, which leads to soft lockups.
>
> We could create the node IPI on demand but it is a bit complex because
> this code would be called under bringup_up() and some IRQ locking is
> being done. The simplest solution is to create the IPIs for all nodes
> at startup.
>
> Fixes: 7dcc37b3eff9 ("powerpc/xive: Map one IPI interrupt per node")
> Cc: sta...@vger.kernel.org # v5.13
> Reported-by: Geetika Moolchandani 
> Cc: Srikar Dronamraju 
> Signed-off-by: Cédric Le Goater 
> ---
>
> This patch breaks old versions of irqbalance (<= v1.4). Possible nodes
> are collected from /sys/devices/system/node/ but CPU-less nodes are
> not listed there. When interrupts are scanned, the link representing
> the node structure is NULL and segfault occurs.
>
> Version 1.7 seems immune. 
>
> ---
>  arch/powerpc/sysdev/xive/common.c | 4 
>  1 file changed, 4 deletions(-)
>
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index f3b16ed48b05..5d2c58dba57e 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1143,10 +1143,6 @@ static int __init xive_request_ipi(void)
>   struct xive_ipi_desc *xid = _ipis[node];
>   struct xive_ipi_alloc_info info = { node };
>  
> - /* Skip nodes without CPUs */
> - if (cpumask_empty(cpumask_of_node(node)))
> - continue;
> -
>   /*
>* Map one IPI interrupt per node for all cpus of that node.
>* Since the HW interrupt number doesn't have any meaning,

Tested-by: Laurent Vivier

Re: [PATCH] powerpc/pseries: Don't enforce MSI affinity with kdump

2021-02-12 Thread Laurent Vivier

On 12/02/2021 17:41, Greg Kurz wrote:
> Depending on the number of online CPUs in the original kernel, it is
> likely for CPU #0 to be offline in a kdump kernel. The associated IRQs
> in the affinity mappings provided by irq_create_affinity_masks() are
> thus not started by irq_startup(), as per-design with managed IRQs.
> 
> This can be a problem with multi-queue block devices driven by blk-mq :
> such a non-started IRQ is very likely paired with the single queue
> enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This
> causes the device to remain silent and likely hangs the guest at
> some point.
> 
> This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries:
> Pass MSI affinity to irq_create_mapping()"). Note that this only happens
> with the XIVE interrupt controller because XICS has a workaround to bypass
> affinity, which is activated during kdump with the "noirqdistrib" kernel
> parameter.
> 
> The issue comes from a combination of factors:
> - discrepancy between the number of queues detected by the multi-queue
>   block driver, that was used to create the MSI vectors, and the single
>   queue mode enforced later on by blk-mq because of kdump (i.e. keeping
>   all queues fixes the issue)
> - CPU#0 offline (i.e. kdump always succeed with CPU#0)
> 
> Given that I couldn't reproduce on x86, which seems to always have CPU#0
> online even during kdump, I'm not sure where this should be fixed. Hence
> going for another approach : fine-grained affinity is for performance
> and we don't really care about that during kdump. Simply revert to the
> previous working behavior of ignoring affinity masks in this case only.
> 
> Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to 
> irq_create_mapping()")
> Cc: lviv...@redhat.com
> Cc: sta...@vger.kernel.org
> Signed-off-by: Greg Kurz 
> ---
>  arch/powerpc/platforms/pseries/msi.c | 24 ++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/msi.c 
> b/arch/powerpc/platforms/pseries/msi.c
> index b3ac2455faad..29d04b83288d 100644
> --- a/arch/powerpc/platforms/pseries/msi.c
> +++ b/arch/powerpc/platforms/pseries/msi.c
> @@ -458,8 +458,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
> nvec_in, int type)
>   return hwirq;
>   }
>  
> - virq = irq_create_mapping_affinity(NULL, hwirq,
> -entry->affinity);
> + /*
> +  * Depending on the number of online CPUs in the original
> +  * kernel, it is likely for CPU #0 to be offline in a kdump
> +  * kernel. The associated IRQs in the affinity mappings
> +  * provided by irq_create_affinity_masks() are thus not
> +  * started by irq_startup(), as per-design for managed IRQs.
> +  * This can be a problem with multi-queue block devices driven
> +  * by blk-mq : such a non-started IRQ is very likely paired
> +  * with the single queue enforced by blk-mq during kdump (see
> +  * blk_mq_alloc_tag_set()). This causes the device to remain
> +  * silent and likely hangs the guest at some point.
> +  *
> +  * We don't really care for fine-grained affinity when doing
> +  * kdump actually : simply ignore the pre-computed affinity
> +  * masks in this case and let the default mask with all CPUs
> +  * be used when creating the IRQ mappings.
> +  */
> + if (is_kdump_kernel())
> + virq = irq_create_mapping(NULL, hwirq);
> + else
> + virq = irq_create_mapping_affinity(NULL, hwirq,
> +entry->affinity);
>  
>   if (!virq) {
>   pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
> 

Reviewed-by: Laurent Vivier

[PATCH v4 2/2] powerpc/pseries: Pass MSI affinity to irq_create_mapping()

2020-11-26 Thread Laurent Vivier

With virtio multiqueue, normally each queue IRQ is mapped to a CPU.

Commit 0d9f0a52c8b9f ("virtio_scsi: use virtio IRQ affinity") exposed
an existing shortcoming of the arch code by moving virtio_scsi to
the automatic IRQ affinity assignment.

The affinity is correctly computed in msi_desc but this is not applied
to the system IRQs.

It appears the affinity is correctly passed to rtas_setup_msi_irqs() but
lost at this point and never passed to irq_domain_alloc_descs()
(see commit 06ee6d571f0e ("genirq: Add affinity hint to irq allocation"))
because irq_create_mapping() doesn't take an affinity parameter.

As the previous patch has added the affinity parameter to
irq_create_mapping() we can forward the affinity from rtas_setup_msi_irqs()
to irq_domain_alloc_descs().

With this change, the virtqueues are correctly dispatched between the CPUs
on pseries.

Signed-off-by: Laurent Vivier 
Reviewed-by: Greg Kurz 
Acked-by: Michael Ellerman 
---
 arch/powerpc/platforms/pseries/msi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 133f6adcb39c..b3ac2455faad 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -458,7 +458,8 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return hwirq;
}
 
-   virq = irq_create_mapping(NULL, hwirq);
+   virq = irq_create_mapping_affinity(NULL, hwirq,
+  entry->affinity);
 
if (!virq) {
pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
-- 
2.28.0

[PATCH v4 1/2] genirq/irqdomain: Add an irq_create_mapping_affinity() function

2020-11-26 Thread Laurent Vivier

There is currently no way to convey the affinity of an interrupt
via irq_create_mapping(), which creates issues for devices that
expect that affinity to be managed by the kernel.

In order to sort this out, rename irq_create_mapping() to
irq_create_mapping_affinity() with an additional affinity parameter
that can conveniently passed down to irq_domain_alloc_descs().

irq_create_mapping() is then re-implemented as a wrapper around
irq_create_mapping_affinity().

Signed-off-by: Laurent Vivier 
Reviewed-by: Greg Kurz 
---
 include/linux/irqdomain.h | 12 ++--
 kernel/irq/irqdomain.c| 13 -
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 71535e87109f..ea5a337e0f8b 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -384,11 +384,19 @@ extern void irq_domain_associate_many(struct irq_domain 
*domain,
 extern void irq_domain_disassociate(struct irq_domain *domain,
unsigned int irq);
 
-extern unsigned int irq_create_mapping(struct irq_domain *host,
-  irq_hw_number_t hwirq);
+extern unsigned int irq_create_mapping_affinity(struct irq_domain *host,
+ irq_hw_number_t hwirq,
+ const struct irq_affinity_desc *affinity);
 extern unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec);
 extern void irq_dispose_mapping(unsigned int virq);
 
+static inline unsigned int irq_create_mapping(struct irq_domain *host,
+ irq_hw_number_t hwirq)
+{
+   return irq_create_mapping_affinity(host, hwirq, NULL);
+}
+
+
 /**
  * irq_linear_revmap() - Find a linux irq from a hw irq number.
  * @domain: domain owning this hardware interrupt
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..e4ca69608f3b 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -624,17 +624,19 @@ unsigned int irq_create_direct_mapping(struct irq_domain 
*domain)
 EXPORT_SYMBOL_GPL(irq_create_direct_mapping);
 
 /**
- * irq_create_mapping() - Map a hardware interrupt into linux irq space
+ * irq_create_mapping_affinity() - Map a hardware interrupt into linux irq 
space
  * @domain: domain owning this hardware interrupt or NULL for default domain
  * @hwirq: hardware irq number in that domain space
+ * @affinity: irq affinity
  *
  * Only one mapping per hardware interrupt is permitted. Returns a linux
  * irq number.
  * If the sense/trigger is to be specified, set_irq_type() should be called
  * on the number returned from that call.
  */
-unsigned int irq_create_mapping(struct irq_domain *domain,
-   irq_hw_number_t hwirq)
+unsigned int irq_create_mapping_affinity(struct irq_domain *domain,
+  irq_hw_number_t hwirq,
+  const struct irq_affinity_desc *affinity)
 {
struct device_node *of_node;
int virq;
@@ -660,7 +662,8 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
}
 
/* Allocate a virtual interrupt number */
-   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), 
NULL);
+   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
+ affinity);
if (virq <= 0) {
pr_debug("-> virq allocation failed\n");
return 0;
@@ -676,7 +679,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 
return virq;
 }
-EXPORT_SYMBOL_GPL(irq_create_mapping);
+EXPORT_SYMBOL_GPL(irq_create_mapping_affinity);
 
 /**
  * irq_create_strict_mappings() - Map a range of hw irqs to fixed linux irqs
-- 
2.28.0

[PATCH v4 0/2] powerpc/pseries: fix MSI/X IRQ affinity on pseries

2020-11-26 Thread Laurent Vivier

With virtio, in multiqueue case, each queue IRQ is normally
bound to a different CPU using the affinity mask.

This works fine on x86_64 but totally ignored on pseries.

This is not obvious at first look because irqbalance is doing
some balancing to improve that.

It appears that the "managed" flag set in the MSI entry
is never copied to the system IRQ entry.

This series passes the affinity mask from rtas_setup_msi_irqs()
to irq_domain_alloc_descs() by adding an affinity parameter to
irq_create_mapping().

The first patch adds the parameter (no functional change), the
second patch passes the actual affinity mask to irq_create_mapping()
in rtas_setup_msi_irqs().

For instance, with 32 CPUs VM and 32 queues virtio-scsi interface:

... -smp 32 -device virtio-scsi-pci,id=virtio_scsi_pci0,num_queues=32

for IRQ in $(grep virtio2-request /proc/interrupts |cut -d: -f1); do
for file in /proc/irq/$IRQ/ ; do
echo -n "IRQ: $(basename $file) CPU: " ; cat $file/smp_affinity_list
done
done

Without the patch (and without irqbalanced)

IRQ: 268 CPU: 0-31
IRQ: 269 CPU: 0-31
IRQ: 270 CPU: 0-31
IRQ: 271 CPU: 0-31
IRQ: 272 CPU: 0-31
IRQ: 273 CPU: 0-31
IRQ: 274 CPU: 0-31
IRQ: 275 CPU: 0-31
IRQ: 276 CPU: 0-31
IRQ: 277 CPU: 0-31
IRQ: 278 CPU: 0-31
IRQ: 279 CPU: 0-31
IRQ: 280 CPU: 0-31
IRQ: 281 CPU: 0-31
IRQ: 282 CPU: 0-31
IRQ: 283 CPU: 0-31
IRQ: 284 CPU: 0-31
IRQ: 285 CPU: 0-31
IRQ: 286 CPU: 0-31
IRQ: 287 CPU: 0-31
IRQ: 288 CPU: 0-31
IRQ: 289 CPU: 0-31
IRQ: 290 CPU: 0-31
IRQ: 291 CPU: 0-31
IRQ: 292 CPU: 0-31
IRQ: 293 CPU: 0-31
IRQ: 294 CPU: 0-31
IRQ: 295 CPU: 0-31
IRQ: 296 CPU: 0-31
IRQ: 297 CPU: 0-31
IRQ: 298 CPU: 0-31
IRQ: 299 CPU: 0-31

With the patch:

IRQ: 265 CPU: 0
IRQ: 266 CPU: 1
IRQ: 267 CPU: 2
IRQ: 268 CPU: 3
IRQ: 269 CPU: 4
IRQ: 270 CPU: 5
IRQ: 271 CPU: 6
IRQ: 272 CPU: 7
IRQ: 273 CPU: 8
IRQ: 274 CPU: 9
IRQ: 275 CPU: 10
IRQ: 276 CPU: 11
IRQ: 277 CPU: 12
IRQ: 278 CPU: 13
IRQ: 279 CPU: 14
IRQ: 280 CPU: 15
IRQ: 281 CPU: 16
IRQ: 282 CPU: 17
IRQ: 283 CPU: 18
IRQ: 284 CPU: 19
IRQ: 285 CPU: 20
IRQ: 286 CPU: 21
IRQ: 287 CPU: 22
IRQ: 288 CPU: 23
IRQ: 289 CPU: 24
IRQ: 290 CPU: 25
IRQ: 291 CPU: 26
IRQ: 292 CPU: 27
IRQ: 293 CPU: 28
IRQ: 294 CPU: 29
IRQ: 295 CPU: 30
IRQ: 299 CPU: 31

This matches what we have on an x86_64 system.

v4: udate changelog of PATCH 2, add Michael's Acked-by
v3: update changelog of PATCH 1 with comments from Thomas Gleixner and
Marc Zyngier.
v2: add a wrapper around original irq_create_mapping() with the
affinity parameter. Update comments

Laurent Vivier (2):
  genirq/irqdomain: Add an irq_create_mapping_affinity() function
  powerpc/pseries: Pass MSI affinity to irq_create_mapping()

 arch/powerpc/platforms/pseries/msi.c |  3 ++-
 include/linux/irqdomain.h| 12 ++--
 kernel/irq/irqdomain.c   | 13 -
 3 files changed, 20 insertions(+), 8 deletions(-)

-- 
2.28.0

Re: [PATCH v3 2/2] powerpc/pseries: pass MSI affinity to irq_create_mapping()

2020-11-25 Thread Laurent Vivier

On 25/11/2020 17:05, Denis Kirjanov wrote:
> On 11/25/20, Laurent Vivier  wrote:
>> With virtio multiqueue, normally each queue IRQ is mapped to a CPU.
>>
>> But since commit 0d9f0a52c8b9f ("virtio_scsi: use virtio IRQ affinity")
>> this is broken on pseries.
> 
> Please add "Fixes" tag.

In fact, the code in commit 0d9f0a52c8b9f is correct.

The problem is with MSI/X irq affinity and pseries. So this patch fixes more 
than
virtio_scsi. I put this information because this commit allows to clearly show 
the
problem. Perhaps I should remove this line in fact?

Thanks,
Laurent

> 
> Thanks!
> 
>>
>> The affinity is correctly computed in msi_desc but this is not applied
>> to the system IRQs.
>>
>> It appears the affinity is correctly passed to rtas_setup_msi_irqs() but
>> lost at this point and never passed to irq_domain_alloc_descs()
>> (see commit 06ee6d571f0e ("genirq: Add affinity hint to irq allocation"))
>> because irq_create_mapping() doesn't take an affinity parameter.
>>
>> As the previous patch has added the affinity parameter to
>> irq_create_mapping() we can forward the affinity from rtas_setup_msi_irqs()
>> to irq_domain_alloc_descs().
>>
>> With this change, the virtqueues are correctly dispatched between the CPUs
>> on pseries.
>>
>> BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1702939
>> Signed-off-by: Laurent Vivier 
>> Reviewed-by: Greg Kurz 
>> ---
>>  arch/powerpc/platforms/pseries/msi.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/msi.c
>> b/arch/powerpc/platforms/pseries/msi.c
>> index 133f6adcb39c..b3ac2455faad 100644
>> --- a/arch/powerpc/platforms/pseries/msi.c
>> +++ b/arch/powerpc/platforms/pseries/msi.c
>> @@ -458,7 +458,8 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int
>> nvec_in, int type)
>>  return hwirq;
>>  }
>>
>> -virq = irq_create_mapping(NULL, hwirq);
>> +virq = irq_create_mapping_affinity(NULL, hwirq,
>> +   entry->affinity);
>>
>>  if (!virq) {
>>  pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
>> --
>> 2.28.0
>>
>>
>

[PATCH v3 1/2] genirq/irqdomain: Add an irq_create_mapping_affinity() function

2020-11-25 Thread Laurent Vivier

There is currently no way to convey the affinity of an interrupt
via irq_create_mapping(), which creates issues for devices that
expect that affinity to be managed by the kernel.

In order to sort this out, rename irq_create_mapping() to
irq_create_mapping_affinity() with an additional affinity parameter
that can conveniently passed down to irq_domain_alloc_descs().

irq_create_mapping() is then re-implemented as a wrapper around
irq_create_mapping_affinity().

Signed-off-by: Laurent Vivier 
Reviewed-by: Greg Kurz 
---
 include/linux/irqdomain.h | 12 ++--
 kernel/irq/irqdomain.c| 13 -
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 71535e87109f..ea5a337e0f8b 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -384,11 +384,19 @@ extern void irq_domain_associate_many(struct irq_domain 
*domain,
 extern void irq_domain_disassociate(struct irq_domain *domain,
unsigned int irq);
 
-extern unsigned int irq_create_mapping(struct irq_domain *host,
-  irq_hw_number_t hwirq);
+extern unsigned int irq_create_mapping_affinity(struct irq_domain *host,
+ irq_hw_number_t hwirq,
+ const struct irq_affinity_desc *affinity);
 extern unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec);
 extern void irq_dispose_mapping(unsigned int virq);
 
+static inline unsigned int irq_create_mapping(struct irq_domain *host,
+ irq_hw_number_t hwirq)
+{
+   return irq_create_mapping_affinity(host, hwirq, NULL);
+}
+
+
 /**
  * irq_linear_revmap() - Find a linux irq from a hw irq number.
  * @domain: domain owning this hardware interrupt
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..e4ca69608f3b 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -624,17 +624,19 @@ unsigned int irq_create_direct_mapping(struct irq_domain 
*domain)
 EXPORT_SYMBOL_GPL(irq_create_direct_mapping);
 
 /**
- * irq_create_mapping() - Map a hardware interrupt into linux irq space
+ * irq_create_mapping_affinity() - Map a hardware interrupt into linux irq 
space
  * @domain: domain owning this hardware interrupt or NULL for default domain
  * @hwirq: hardware irq number in that domain space
+ * @affinity: irq affinity
  *
  * Only one mapping per hardware interrupt is permitted. Returns a linux
  * irq number.
  * If the sense/trigger is to be specified, set_irq_type() should be called
  * on the number returned from that call.
  */
-unsigned int irq_create_mapping(struct irq_domain *domain,
-   irq_hw_number_t hwirq)
+unsigned int irq_create_mapping_affinity(struct irq_domain *domain,
+  irq_hw_number_t hwirq,
+  const struct irq_affinity_desc *affinity)
 {
struct device_node *of_node;
int virq;
@@ -660,7 +662,8 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
}
 
/* Allocate a virtual interrupt number */
-   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), 
NULL);
+   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
+ affinity);
if (virq <= 0) {
pr_debug("-> virq allocation failed\n");
return 0;
@@ -676,7 +679,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 
return virq;
 }
-EXPORT_SYMBOL_GPL(irq_create_mapping);
+EXPORT_SYMBOL_GPL(irq_create_mapping_affinity);
 
 /**
  * irq_create_strict_mappings() - Map a range of hw irqs to fixed linux irqs
-- 
2.28.0

[PATCH v3 2/2] powerpc/pseries: pass MSI affinity to irq_create_mapping()

2020-11-25 Thread Laurent Vivier

With virtio multiqueue, normally each queue IRQ is mapped to a CPU.

But since commit 0d9f0a52c8b9f ("virtio_scsi: use virtio IRQ affinity")
this is broken on pseries.

The affinity is correctly computed in msi_desc but this is not applied
to the system IRQs.

It appears the affinity is correctly passed to rtas_setup_msi_irqs() but
lost at this point and never passed to irq_domain_alloc_descs()
(see commit 06ee6d571f0e ("genirq: Add affinity hint to irq allocation"))
because irq_create_mapping() doesn't take an affinity parameter.

As the previous patch has added the affinity parameter to
irq_create_mapping() we can forward the affinity from rtas_setup_msi_irqs()
to irq_domain_alloc_descs().

With this change, the virtqueues are correctly dispatched between the CPUs
on pseries.

BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1702939
Signed-off-by: Laurent Vivier 
Reviewed-by: Greg Kurz 
---
 arch/powerpc/platforms/pseries/msi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 133f6adcb39c..b3ac2455faad 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -458,7 +458,8 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return hwirq;
}
 
-   virq = irq_create_mapping(NULL, hwirq);
+   virq = irq_create_mapping_affinity(NULL, hwirq,
+  entry->affinity);
 
if (!virq) {
pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
-- 
2.28.0

[PATCH v3 0/2] powerpc/pseries: fix MSI/X IRQ affinity on pseries

2020-11-25 Thread Laurent Vivier

With virtio, in multiqueue case, each queue IRQ is normally
bound to a different CPU using the affinity mask.

This works fine on x86_64 but totally ignored on pseries.

This is not obvious at first look because irqbalance is doing
some balancing to improve that.

It appears that the "managed" flag set in the MSI entry
is never copied to the system IRQ entry.

This series passes the affinity mask from rtas_setup_msi_irqs()
to irq_domain_alloc_descs() by adding an affinity parameter to
irq_create_mapping().

The first patch adds the parameter (no functional change), the
second patch passes the actual affinity mask to irq_create_mapping()
in rtas_setup_msi_irqs().

For instance, with 32 CPUs VM and 32 queues virtio-scsi interface:

... -smp 32 -device virtio-scsi-pci,id=virtio_scsi_pci0,num_queues=32

for IRQ in $(grep virtio2-request /proc/interrupts |cut -d: -f1); do
for file in /proc/irq/$IRQ/ ; do
echo -n "IRQ: $(basename $file) CPU: " ; cat $file/smp_affinity_list
done
done

Without the patch (and without irqbalanced)

IRQ: 268 CPU: 0-31
IRQ: 269 CPU: 0-31
IRQ: 270 CPU: 0-31
IRQ: 271 CPU: 0-31
IRQ: 272 CPU: 0-31
IRQ: 273 CPU: 0-31
IRQ: 274 CPU: 0-31
IRQ: 275 CPU: 0-31
IRQ: 276 CPU: 0-31
IRQ: 277 CPU: 0-31
IRQ: 278 CPU: 0-31
IRQ: 279 CPU: 0-31
IRQ: 280 CPU: 0-31
IRQ: 281 CPU: 0-31
IRQ: 282 CPU: 0-31
IRQ: 283 CPU: 0-31
IRQ: 284 CPU: 0-31
IRQ: 285 CPU: 0-31
IRQ: 286 CPU: 0-31
IRQ: 287 CPU: 0-31
IRQ: 288 CPU: 0-31
IRQ: 289 CPU: 0-31
IRQ: 290 CPU: 0-31
IRQ: 291 CPU: 0-31
IRQ: 292 CPU: 0-31
IRQ: 293 CPU: 0-31
IRQ: 294 CPU: 0-31
IRQ: 295 CPU: 0-31
IRQ: 296 CPU: 0-31
IRQ: 297 CPU: 0-31
IRQ: 298 CPU: 0-31
IRQ: 299 CPU: 0-31

With the patch:

IRQ: 265 CPU: 0
IRQ: 266 CPU: 1
IRQ: 267 CPU: 2
IRQ: 268 CPU: 3
IRQ: 269 CPU: 4
IRQ: 270 CPU: 5
IRQ: 271 CPU: 6
IRQ: 272 CPU: 7
IRQ: 273 CPU: 8
IRQ: 274 CPU: 9
IRQ: 275 CPU: 10
IRQ: 276 CPU: 11
IRQ: 277 CPU: 12
IRQ: 278 CPU: 13
IRQ: 279 CPU: 14
IRQ: 280 CPU: 15
IRQ: 281 CPU: 16
IRQ: 282 CPU: 17
IRQ: 283 CPU: 18
IRQ: 284 CPU: 19
IRQ: 285 CPU: 20
IRQ: 286 CPU: 21
IRQ: 287 CPU: 22
IRQ: 288 CPU: 23
IRQ: 289 CPU: 24
IRQ: 290 CPU: 25
IRQ: 291 CPU: 26
IRQ: 292 CPU: 27
IRQ: 293 CPU: 28
IRQ: 294 CPU: 29
IRQ: 295 CPU: 30
IRQ: 299 CPU: 31

This matches what we have on an x86_64 system.

v3: update changelog of PATCH 1 with comments from Thomas Gleixner and
Marc Zyngier.
v2: add a wrapper around original irq_create_mapping() with the
affinity parameter. Update comments

Laurent Vivier (2):
  genirq/irqdomain: Add an irq_create_mapping_affinity() function
  powerpc/pseries: pass MSI affinity to irq_create_mapping()

 arch/powerpc/platforms/pseries/msi.c |  3 ++-
 include/linux/irqdomain.h| 12 ++--
 kernel/irq/irqdomain.c   | 13 -
 3 files changed, 20 insertions(+), 8 deletions(-)

-- 
2.28.0

Re: [PATCH v2 1/2] genirq: add an irq_create_mapping_affinity() function

2020-11-25 Thread Laurent Vivier

On 25/11/2020 15:54, Marc Zyngier wrote:
> On 2020-11-25 14:09, Laurent Vivier wrote:
>> On 25/11/2020 14:20, Thomas Gleixner wrote:
>>> Laurent,
>>>
>>> On Wed, Nov 25 2020 at 12:16, Laurent Vivier wrote:
>>>
>>> The proper subsystem prefix is: 'genirq/irqdomain:' and the first letter
>>> after the colon wants to be uppercase.
>>
>> Ok.
>>
>>>> This function adds an affinity parameter to irq_create_mapping().
>>>> This parameter is needed to pass it to irq_domain_alloc_descs().
>>>
>>> A changelog has to explain the WHY. 'The parameter is needed' is not
>>> really useful information.
>>>
>>
>> The reason of this change is explained in PATCH 2.
>>
>> I have two patches, one to change the interface with no functional
>> change (PATCH 1) and
>> one to fix the problem (PATCH 2). Moreover they don't cover the same 
>> subsystems.
>>
>> I can either:
>> - merge the two patches
>> - or make a reference in the changelog of PATCH 1 to PATCH 2
>>   (something like "(see folowing patch "powerpc/pseries: pass MSI affinity to
>>    irq_create_mapping()")")
>> - or copy some information from PATCH 2
>>   (something like "this parameter is needed by rtas_setup_msi_irqs()
>> to pass the affinity
>>    to irq_domain_alloc_descs() to fix multiqueue affinity")
>>
>> What do you prefer?
> 
> How about something like this for the first patch:
> 
> "There is currently no way to convey the affinity of an interrupt
>  via irq_create_mapping(), which creates issues for devices that
>  expect that affinity to be managed by the kernel.
> 
>  In order to sort this out, rename irq_create_mapping() to
>  irq_create_mapping_affinity() with an additional affinity parameter
>  that can conveniently passed down to irq_domain_alloc_descs().
> 
>  irq_create_mapping() is then re-implemented as a wrapper around
>  irq_create_mapping_affinity()."

It looks perfect. I update the changelog with that.

Thanks,
Laurent

Re: [PATCH v2 1/2] genirq: add an irq_create_mapping_affinity() function

2020-11-25 Thread Laurent Vivier

On 25/11/2020 14:20, Thomas Gleixner wrote:
> Laurent,
> 
> On Wed, Nov 25 2020 at 12:16, Laurent Vivier wrote:
> 
> The proper subsystem prefix is: 'genirq/irqdomain:' and the first letter
> after the colon wants to be uppercase.

Ok.

>> This function adds an affinity parameter to irq_create_mapping().
>> This parameter is needed to pass it to irq_domain_alloc_descs().
> 
> A changelog has to explain the WHY. 'The parameter is needed' is not
> really useful information.
> 

The reason of this change is explained in PATCH 2.

I have two patches, one to change the interface with no functional change 
(PATCH 1) and
one to fix the problem (PATCH 2). Moreover they don't cover the same subsystems.

I can either:
- merge the two patches
- or make a reference in the changelog of PATCH 1 to PATCH 2
  (something like "(see folowing patch "powerpc/pseries: pass MSI affinity to
   irq_create_mapping()")")
- or copy some information from PATCH 2
  (something like "this parameter is needed by rtas_setup_msi_irqs() to pass 
the affinity
   to irq_domain_alloc_descs() to fix multiqueue affinity")

What do you prefer?

Thanks,
Laurent

[PATCH v2 2/2] powerpc/pseries: pass MSI affinity to irq_create_mapping()

2020-11-25 Thread Laurent Vivier

With virtio multiqueue, normally each queue IRQ is mapped to a CPU.

But since commit 0d9f0a52c8b9f ("virtio_scsi: use virtio IRQ affinity")
this is broken on pseries.

The affinity is correctly computed in msi_desc but this is not applied
to the system IRQs.

It appears the affinity is correctly passed to rtas_setup_msi_irqs() but
lost at this point and never passed to irq_domain_alloc_descs()
(see commit 06ee6d571f0e ("genirq: Add affinity hint to irq allocation"))
because irq_create_mapping() doesn't take an affinity parameter.

As the previous patch has added the affinity parameter to
irq_create_mapping() we can forward the affinity from rtas_setup_msi_irqs()
to irq_domain_alloc_descs().

With this change, the virtqueues are correctly dispatched between the CPUs
on pseries.

Signed-off-by: Laurent Vivier 
---
 arch/powerpc/platforms/pseries/msi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 133f6adcb39c..b3ac2455faad 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -458,7 +458,8 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return hwirq;
}
 
-   virq = irq_create_mapping(NULL, hwirq);
+   virq = irq_create_mapping_affinity(NULL, hwirq,
+  entry->affinity);
 
if (!virq) {
pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
-- 
2.28.0

[PATCH v2 1/2] genirq: add an irq_create_mapping_affinity() function

2020-11-25 Thread Laurent Vivier

This function adds an affinity parameter to irq_create_mapping().
This parameter is needed to pass it to irq_domain_alloc_descs().

irq_create_mapping() is a wrapper around irq_create_mapping_affinity()
to pass NULL for the affinity parameter.

No functional change.

Signed-off-by: Laurent Vivier 
---
 include/linux/irqdomain.h | 12 ++--
 kernel/irq/irqdomain.c| 13 -
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 71535e87109f..ea5a337e0f8b 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -384,11 +384,19 @@ extern void irq_domain_associate_many(struct irq_domain 
*domain,
 extern void irq_domain_disassociate(struct irq_domain *domain,
unsigned int irq);
 
-extern unsigned int irq_create_mapping(struct irq_domain *host,
-  irq_hw_number_t hwirq);
+extern unsigned int irq_create_mapping_affinity(struct irq_domain *host,
+ irq_hw_number_t hwirq,
+ const struct irq_affinity_desc *affinity);
 extern unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec);
 extern void irq_dispose_mapping(unsigned int virq);
 
+static inline unsigned int irq_create_mapping(struct irq_domain *host,
+ irq_hw_number_t hwirq)
+{
+   return irq_create_mapping_affinity(host, hwirq, NULL);
+}
+
+
 /**
  * irq_linear_revmap() - Find a linux irq from a hw irq number.
  * @domain: domain owning this hardware interrupt
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..e4ca69608f3b 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -624,17 +624,19 @@ unsigned int irq_create_direct_mapping(struct irq_domain 
*domain)
 EXPORT_SYMBOL_GPL(irq_create_direct_mapping);
 
 /**
- * irq_create_mapping() - Map a hardware interrupt into linux irq space
+ * irq_create_mapping_affinity() - Map a hardware interrupt into linux irq 
space
  * @domain: domain owning this hardware interrupt or NULL for default domain
  * @hwirq: hardware irq number in that domain space
+ * @affinity: irq affinity
  *
  * Only one mapping per hardware interrupt is permitted. Returns a linux
  * irq number.
  * If the sense/trigger is to be specified, set_irq_type() should be called
  * on the number returned from that call.
  */
-unsigned int irq_create_mapping(struct irq_domain *domain,
-   irq_hw_number_t hwirq)
+unsigned int irq_create_mapping_affinity(struct irq_domain *domain,
+  irq_hw_number_t hwirq,
+  const struct irq_affinity_desc *affinity)
 {
struct device_node *of_node;
int virq;
@@ -660,7 +662,8 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
}
 
/* Allocate a virtual interrupt number */
-   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), 
NULL);
+   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
+ affinity);
if (virq <= 0) {
pr_debug("-> virq allocation failed\n");
return 0;
@@ -676,7 +679,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 
return virq;
 }
-EXPORT_SYMBOL_GPL(irq_create_mapping);
+EXPORT_SYMBOL_GPL(irq_create_mapping_affinity);
 
 /**
  * irq_create_strict_mappings() - Map a range of hw irqs to fixed linux irqs
-- 
2.28.0

[PATCH v2 0/2] powerpc/pseries: fix MSI/X IRQ affinity on pseries

2020-11-25 Thread Laurent Vivier

With virtio, in multiqueue case, each queue IRQ is normally
bound to a different CPU using the affinity mask.

This works fine on x86_64 but totally ignored on pseries.

This is not obvious at first look because irqbalance is doing
some balancing to improve that.

It appears that the "managed" flag set in the MSI entry
is never copied to the system IRQ entry.

This series passes the affinity mask from rtas_setup_msi_irqs()
to irq_domain_alloc_descs() by adding an affinity parameter to
irq_create_mapping().

The first patch adds the parameter (no functional change), the
second patch passes the actual affinity mask to irq_create_mapping()
in rtas_setup_msi_irqs().

For instance, with 32 CPUs VM and 32 queues virtio-scsi interface:

... -smp 32 -device virtio-scsi-pci,id=virtio_scsi_pci0,num_queues=32

for IRQ in $(grep virtio2-request /proc/interrupts |cut -d: -f1); do
for file in /proc/irq/$IRQ/ ; do
echo -n "IRQ: $(basename $file) CPU: " ; cat $file/smp_affinity_list
done
done

Without the patch (and without irqbalanced)

IRQ: 268 CPU: 0-31
IRQ: 269 CPU: 0-31
IRQ: 270 CPU: 0-31
IRQ: 271 CPU: 0-31
IRQ: 272 CPU: 0-31
IRQ: 273 CPU: 0-31
IRQ: 274 CPU: 0-31
IRQ: 275 CPU: 0-31
IRQ: 276 CPU: 0-31
IRQ: 277 CPU: 0-31
IRQ: 278 CPU: 0-31
IRQ: 279 CPU: 0-31
IRQ: 280 CPU: 0-31
IRQ: 281 CPU: 0-31
IRQ: 282 CPU: 0-31
IRQ: 283 CPU: 0-31
IRQ: 284 CPU: 0-31
IRQ: 285 CPU: 0-31
IRQ: 286 CPU: 0-31
IRQ: 287 CPU: 0-31
IRQ: 288 CPU: 0-31
IRQ: 289 CPU: 0-31
IRQ: 290 CPU: 0-31
IRQ: 291 CPU: 0-31
IRQ: 292 CPU: 0-31
IRQ: 293 CPU: 0-31
IRQ: 294 CPU: 0-31
IRQ: 295 CPU: 0-31
IRQ: 296 CPU: 0-31
IRQ: 297 CPU: 0-31
IRQ: 298 CPU: 0-31
IRQ: 299 CPU: 0-31

With the patch:

IRQ: 265 CPU: 0
IRQ: 266 CPU: 1
IRQ: 267 CPU: 2
IRQ: 268 CPU: 3
IRQ: 269 CPU: 4
IRQ: 270 CPU: 5
IRQ: 271 CPU: 6
IRQ: 272 CPU: 7
IRQ: 273 CPU: 8
IRQ: 274 CPU: 9
IRQ: 275 CPU: 10
IRQ: 276 CPU: 11
IRQ: 277 CPU: 12
IRQ: 278 CPU: 13
IRQ: 279 CPU: 14
IRQ: 280 CPU: 15
IRQ: 281 CPU: 16
IRQ: 282 CPU: 17
IRQ: 283 CPU: 18
IRQ: 284 CPU: 19
IRQ: 285 CPU: 20
IRQ: 286 CPU: 21
IRQ: 287 CPU: 22
IRQ: 288 CPU: 23
IRQ: 289 CPU: 24
IRQ: 290 CPU: 25
IRQ: 291 CPU: 26
IRQ: 292 CPU: 27
IRQ: 293 CPU: 28
IRQ: 294 CPU: 29
IRQ: 295 CPU: 30
IRQ: 299 CPU: 31

This matches what we have on an x86_64 system.

v2: add a wrapper around original irq_create_mapping() with the
affinity parameter. Update comments

Laurent Vivier (2):
  genirq: add an irq_create_mapping_affinity() function
  powerpc/pseries: pass MSI affinity to irq_create_mapping()

 arch/powerpc/platforms/pseries/msi.c |  3 ++-
 include/linux/irqdomain.h| 12 ++--
 kernel/irq/irqdomain.c   | 13 -
 3 files changed, 20 insertions(+), 8 deletions(-)

-- 
2.28.0

Re: [PATCH 1/2] genirq: add an affinity parameter to irq_create_mapping()

2020-11-24 Thread Laurent Vivier

On 24/11/2020 23:19, Thomas Gleixner wrote:
> On Tue, Nov 24 2020 at 21:03, Laurent Vivier wrote:
>> This parameter is needed to pass it to irq_domain_alloc_descs().
>>
>> This seems to have been missed by
>> o06ee6d571f0e ("genirq: Add affinity hint to irq allocation")
> 
> No, this has not been missed at all. There was and is no reason to do
> this.
> 
>> This is needed to implement proper support for multiqueue with
>> pseries.
> 
> And because pseries needs this _all_ callers need to be changed?
> 
>>  123 files changed, 171 insertions(+), 146 deletions(-)
> 
> Lots of churn for nothing. 99% of the callers will never need that.
> 
> What's wrong with simply adding an interface which takes that parameter,
> make the existing one an inline wrapper and and leave the rest alone?

Nothing. I'm going to do like that.

Thank you for your comment.

Laurent

[PATCH 2/2] powerpc/pseries: pass MSI affinity to irq_create_mapping()

2020-11-24 Thread Laurent Vivier

With virtio multiqueue, normally each queue IRQ is mapped to a CPU.

But since commit 0d9f0a52c8b9f ("virtio_scsi: use virtio IRQ affinity")
this is broken on pseries.

The affinity is correctly computed in msi_desc but this is not applied
to the system IRQs.

It appears the affinity is correctly passed to rtas_setup_msi_irqs() but
lost at this point and never passed to irq_domain_alloc_descs()
(see commit 06ee6d571f0e ("genirq: Add affinity hint to irq allocation"))
because irq_create_mapping() doesn't take an affinity parameter.

As the previous patch has added the affinity parameter to
irq_create_mapping() we can forward the affinity from rtas_setup_msi_irqs()
to irq_domain_alloc_descs().

With this change, the virtqueues are correctly dispatched between the CPUs
on pseries.

This problem cannot be shown on x86_64 for two reasons:

- the call path traverses arch_setup_msi_irqs() that is arch specific:

   virtscsi_probe()
  virtscsi_init()
 vp_modern_find_vqs()
vp_find_vqs()
   vp_find_vqs_msix()
  pci_alloc_irq_vectors_affinity()
 __pci_enable_msix_range()
pci_msi_setup_msi_irqs()
   arch_setup_msi_irqs()
  rtas_setup_msi_irqs()
 irq_create_mapping()
irq_domain_alloc_descs()
  __irq_alloc_descs()

- and x86_64 has CONFIG_PCI_MSI_IRQ_DOMAIN that uses another path:

   virtscsi_probe()
  virtscsi_init()
 vp_modern_find_vqs()
vp_find_vqs()
   vp_find_vqs_msix()
  pci_alloc_irq_vectors_affinity()
 __pci_enable_msix_range()
__msi_domain_alloc_irqs()
   __irq_domain_alloc_irqs()
  __irq_alloc_descs()

Signed-off-by: Laurent Vivier 
---
 arch/powerpc/platforms/pseries/msi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 42ba08eaea91..58197f92c6a2 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -458,7 +458,7 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return hwirq;
}
 
-   virq = irq_create_mapping(NULL, hwirq, NULL);
+   virq = irq_create_mapping(NULL, hwirq, entry->affinity);
 
if (!virq) {
pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
-- 
2.28.0

[PATCH 0/2] powerpc/pseries: fix MSI/X IRQ affinity on pseries

2020-11-24 Thread Laurent Vivier

With virtio, in multiqueue case, each queue IRQ is normally
bound to a different CPU using the affinity mask.

This works fine on x86_64 but totally ignored on pseries.

This is not obvious at first look because irqbalance is doing
some balancing to improve that.

It appears that the "managed" flag set in the MSI entry
is never copied to the system IRQ entry.

This series passes the affinity mask from rtas_setup_msi_irqs()
to irq_domain_alloc_descs() by adding an affinity parameter to
irq_create_mapping().

The first patch adds the parameter (no functional change), the
second patch passes the actual affinity mask to irq_create_mapping()
in rtas_setup_msi_irqs().

For instance, with 32 CPUs VM and 32 queues virtio-scsi interface:

... -smp 32 -device virtio-scsi-pci,id=virtio_scsi_pci0,num_queues=32

for IRQ in $(grep virtio2-request /proc/interrupts |cut -d: -f1); do
for file in /proc/irq/$IRQ/ ; do
echo -n "IRQ: $(basename $file) CPU: " ; cat $file/smp_affinity_list
done
done

Without the patch (and without irqbalanced)

IRQ: 268 CPU: 0-31
IRQ: 269 CPU: 0-31
IRQ: 270 CPU: 0-31
IRQ: 271 CPU: 0-31
IRQ: 272 CPU: 0-31
IRQ: 273 CPU: 0-31
IRQ: 274 CPU: 0-31
IRQ: 275 CPU: 0-31
IRQ: 276 CPU: 0-31
IRQ: 277 CPU: 0-31
IRQ: 278 CPU: 0-31
IRQ: 279 CPU: 0-31
IRQ: 280 CPU: 0-31
IRQ: 281 CPU: 0-31
IRQ: 282 CPU: 0-31
IRQ: 283 CPU: 0-31
IRQ: 284 CPU: 0-31
IRQ: 285 CPU: 0-31
IRQ: 286 CPU: 0-31
IRQ: 287 CPU: 0-31
IRQ: 288 CPU: 0-31
IRQ: 289 CPU: 0-31
IRQ: 290 CPU: 0-31
IRQ: 291 CPU: 0-31
IRQ: 292 CPU: 0-31
IRQ: 293 CPU: 0-31
IRQ: 294 CPU: 0-31
IRQ: 295 CPU: 0-31
IRQ: 296 CPU: 0-31
IRQ: 297 CPU: 0-31
IRQ: 298 CPU: 0-31
IRQ: 299 CPU: 0-31

With the patch:

IRQ: 265 CPU: 0
IRQ: 266 CPU: 1
IRQ: 267 CPU: 2
IRQ: 268 CPU: 3
IRQ: 269 CPU: 4
IRQ: 270 CPU: 5
IRQ: 271 CPU: 6
IRQ: 272 CPU: 7
IRQ: 273 CPU: 8
IRQ: 274 CPU: 9
IRQ: 275 CPU: 10
IRQ: 276 CPU: 11
IRQ: 277 CPU: 12
IRQ: 278 CPU: 13
IRQ: 279 CPU: 14
IRQ: 280 CPU: 15
IRQ: 281 CPU: 16
IRQ: 282 CPU: 17
IRQ: 283 CPU: 18
IRQ: 284 CPU: 19
IRQ: 285 CPU: 20
IRQ: 286 CPU: 21
IRQ: 287 CPU: 22
IRQ: 288 CPU: 23
IRQ: 289 CPU: 24
IRQ: 290 CPU: 25
IRQ: 291 CPU: 26
IRQ: 292 CPU: 27
IRQ: 293 CPU: 28
IRQ: 294 CPU: 29
IRQ: 295 CPU: 30
IRQ: 299 CPU: 31

This matches what we have on an x86_64 system.

Laurent Vivier (2):
  genirq: add an affinity parameter to irq_create_mapping()
  powerpc/pseries: pass MSI affinity to irq_create_mapping()

 arch/arc/kernel/intc-arcv2.c  | 4 ++--
 arch/arc/kernel/mcip.c| 2 +-
 arch/arm/common/sa.c  | 2 +-
 arch/arm/mach-s3c/irq-s3c24xx.c   | 3 ++-
 arch/arm/plat-orion/gpio.c| 2 +-
 arch/mips/ath25/ar2315.c  | 4 ++--
 arch/mips/ath25/ar5312.c  | 4 ++--
 arch/mips/lantiq/irq.c| 2 +-
 arch/mips/pci/pci-ar2315.c| 3 ++-
 arch/mips/pic32/pic32mzda/time.c  | 2 +-
 arch/mips/ralink/irq.c| 2 +-
 arch/powerpc/kernel/pci-common.c  | 2 +-
 arch/powerpc/kvm/book3s_xive.c| 2 +-
 arch/powerpc/platforms/44x/ppc476.c   | 4 ++--
 arch/powerpc/platforms/cell/interrupt.c   | 4 ++--
 arch/powerpc/platforms/cell/iommu.c   | 3 ++-
 arch/powerpc/platforms/cell/pmu.c | 2 +-
 arch/powerpc/platforms/cell/spider-pic.c  | 2 +-
 arch/powerpc/platforms/cell/spu_manage.c  | 6 +++---
 arch/powerpc/platforms/maple/pci.c| 2 +-
 arch/powerpc/platforms/pasemi/dma_lib.c   | 5 +++--
 arch/powerpc/platforms/pasemi/msi.c   | 2 +-
 arch/powerpc/platforms/pasemi/setup.c | 4 ++--
 arch/powerpc/platforms/powermac/pci.c | 2 +-
 arch/powerpc/platforms/powermac/pic.c | 2 +-
 arch/powerpc/platforms/powermac/smp.c | 2 +-
 arch/powerpc/platforms/powernv/opal-irqchip.c | 5 +++--
 arch/powerpc/platforms/powernv/pci.c  | 2 +-
 arch/powerpc/platforms/powernv/vas.c  | 2 +-
 arch/powerpc/platforms/ps3/interrupt.c| 2 +-
 arch/powerpc/platforms/pseries/ibmebus.c  | 2 +-
 arch/powerpc/platforms/pseries/msi.c  | 2 +-
 arch/powerpc/sysdev/fsl_mpic_err.c| 2 +-
 arch/powerpc/sysdev/fsl_msi.c | 2 +-
 arch/powerpc/sysdev/mpic.c| 3 ++-
 arch/powerpc/sysdev/mpic_u3msi.c  | 2 +-
 arch/powerpc/sysdev/xics/xics-common.c| 2 +-
 arch/powerpc/sysdev/xive/common.c | 2 +-
 arch/sh/boards/mach-se/7343/irq.c | 2 +-
 arch/sh/boards/mach-se/7722/irq.c | 2 +-
 arch/sh/boards/mach-x3proto/gpio.c| 2 +-
 arch/xtensa/kernel/perf_event.c   | 2 +-
 arch/xtensa/kernel/smp.c  | 2 +-
 arch/xtensa/kernel/time.c | 2 +-
 drivers/ata/pata_macio.c  | 2 +-
 drivers/base/regmap/regmap-irq.c  | 2 +-
 drivers/bus/moxtet.c  | 2 +-
 drivers

[PATCH 1/2] genirq: add an affinity parameter to irq_create_mapping()

2020-11-24 Thread Laurent Vivier

This parameter is needed to pass it to irq_domain_alloc_descs().

This seems to have been missed by
o06ee6d571f0e ("genirq: Add affinity hint to irq allocation")

This is needed to implement proper support for multiqueue with pseries.

All irq_create_mapping() callers have been updated with the help
of the following coccinelle script:
@@
expression a, b;
@@
<...
- irq_create_mapping(a, b)
+ irq_create_mapping(a, b, NULL)
...>

With some manual changes to comply with checkpatch errors.

No functional change.

Signed-off-by: Laurent Vivier 
---
 arch/arc/kernel/intc-arcv2.c  | 4 ++--
 arch/arc/kernel/mcip.c| 2 +-
 arch/arm/common/sa.c  | 2 +-
 arch/arm/mach-s3c/irq-s3c24xx.c   | 3 ++-
 arch/arm/plat-orion/gpio.c| 2 +-
 arch/mips/ath25/ar2315.c  | 4 ++--
 arch/mips/ath25/ar5312.c  | 4 ++--
 arch/mips/lantiq/irq.c| 2 +-
 arch/mips/pci/pci-ar2315.c| 3 ++-
 arch/mips/pic32/pic32mzda/time.c  | 2 +-
 arch/mips/ralink/irq.c| 2 +-
 arch/powerpc/kernel/pci-common.c  | 2 +-
 arch/powerpc/kvm/book3s_xive.c| 2 +-
 arch/powerpc/platforms/44x/ppc476.c   | 4 ++--
 arch/powerpc/platforms/cell/interrupt.c   | 4 ++--
 arch/powerpc/platforms/cell/iommu.c   | 3 ++-
 arch/powerpc/platforms/cell/pmu.c | 2 +-
 arch/powerpc/platforms/cell/spider-pic.c  | 2 +-
 arch/powerpc/platforms/cell/spu_manage.c  | 6 +++---
 arch/powerpc/platforms/maple/pci.c| 2 +-
 arch/powerpc/platforms/pasemi/dma_lib.c   | 5 +++--
 arch/powerpc/platforms/pasemi/msi.c   | 2 +-
 arch/powerpc/platforms/pasemi/setup.c | 4 ++--
 arch/powerpc/platforms/powermac/pci.c | 2 +-
 arch/powerpc/platforms/powermac/pic.c | 2 +-
 arch/powerpc/platforms/powermac/smp.c | 2 +-
 arch/powerpc/platforms/powernv/opal-irqchip.c | 5 +++--
 arch/powerpc/platforms/powernv/pci.c  | 2 +-
 arch/powerpc/platforms/powernv/vas.c  | 2 +-
 arch/powerpc/platforms/ps3/interrupt.c| 2 +-
 arch/powerpc/platforms/pseries/ibmebus.c  | 2 +-
 arch/powerpc/platforms/pseries/msi.c  | 2 +-
 arch/powerpc/sysdev/fsl_mpic_err.c| 2 +-
 arch/powerpc/sysdev/fsl_msi.c | 2 +-
 arch/powerpc/sysdev/mpic.c| 3 ++-
 arch/powerpc/sysdev/mpic_u3msi.c  | 2 +-
 arch/powerpc/sysdev/xics/xics-common.c| 2 +-
 arch/powerpc/sysdev/xive/common.c | 2 +-
 arch/sh/boards/mach-se/7343/irq.c | 2 +-
 arch/sh/boards/mach-se/7722/irq.c | 2 +-
 arch/sh/boards/mach-x3proto/gpio.c| 2 +-
 arch/xtensa/kernel/perf_event.c   | 2 +-
 arch/xtensa/kernel/smp.c  | 2 +-
 arch/xtensa/kernel/time.c | 2 +-
 drivers/ata/pata_macio.c  | 2 +-
 drivers/base/regmap/regmap-irq.c  | 2 +-
 drivers/bus/moxtet.c  | 2 +-
 drivers/clocksource/ingenic-timer.c   | 2 +-
 drivers/clocksource/timer-riscv.c | 2 +-
 drivers/extcon/extcon-max8997.c   | 3 ++-
 drivers/gpio/gpio-bcm-kona.c  | 2 +-
 drivers/gpio/gpio-brcmstb.c   | 2 +-
 drivers/gpio/gpio-davinci.c   | 2 +-
 drivers/gpio/gpio-em.c| 3 ++-
 drivers/gpio/gpio-grgpio.c| 2 +-
 drivers/gpio/gpio-mockup.c| 2 +-
 drivers/gpio/gpio-mpc8xxx.c   | 2 +-
 drivers/gpio/gpio-mvebu.c | 2 +-
 drivers/gpio/gpio-tb10x.c | 2 +-
 drivers/gpio/gpio-tegra.c | 2 +-
 drivers/gpio/gpio-wm831x.c| 2 +-
 drivers/gpio/gpiolib.c| 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c   | 3 ++-
 drivers/gpu/ipu-v3/ipu-common.c   | 2 +-
 drivers/hid/hid-rmi.c | 2 +-
 drivers/i2c/busses/i2c-cht-wc.c   | 2 +-
 drivers/i2c/i2c-core-base.c   | 2 +-
 drivers/i2c/muxes/i2c-mux-pca954x.c   | 2 +-
 drivers/ide/pmac.c| 2 +-
 drivers/iio/dummy/iio_dummy_evgen.c   | 3 ++-
 drivers/input/rmi4/rmi_bus.c  | 2 +-
 drivers/irqchip/irq-ath79-misc.c  | 3 ++-
 drivers/irqchip/irq-bcm2835.c | 3 ++-
 drivers/irqchip/irq-csky-mpintc.c | 2 +-
 drivers/irqchip/irq-eznps.c   | 2 +-
 drivers/irqchip/irq-mips-gic.c| 8 +---
 drivers/irqchip/irq-mmp.c | 4 ++--
 drivers/irqchip/irq-versatile-fpga.c  | 2 +-
 drivers/irqchip/irq-vic.c | 2 +-
 drivers/macintosh/macio_asic.c| 2 +-
 drivers/memory/omap-gpmc.c| 2 +-
 drivers/m

Re: [PATCH] serial: pmac_zilog: don't init if zilog is not available

2020-10-22 Thread Laurent Vivier

Le 22/10/2020 à 05:23, Finn Thain a écrit :
> On Wed, 21 Oct 2020, Laurent Vivier wrote:
> 
>> Le 21/10/2020 à 01:43, Finn Thain a écrit :
>>
>>> Laurent, can we avoid the irq == 0 warning splat like this?
>>>
>>> diff --git a/drivers/tty/serial/pmac_zilog.c 
>>> b/drivers/tty/serial/pmac_zilog.c
>>> index 96e7aa479961..7db600cd8cc7 100644
>>> --- a/drivers/tty/serial/pmac_zilog.c
>>> +++ b/drivers/tty/serial/pmac_zilog.c
>>> @@ -1701,8 +1701,10 @@ static int __init pmz_init_port(struct 
>>> uart_pmac_port *uap)
>>> int irq;
>>>  
>>> r_ports = platform_get_resource(uap->pdev, IORESOURCE_MEM, 0);
>>> +   if (!r_ports)
>>> +   return -ENODEV;
>>> irq = platform_get_irq(uap->pdev, 0);
>>> -   if (!r_ports || irq <= 0)
>>> +   if (irq <= 0)
>>> return -ENODEV;
>>>  
>>> uap->port.mapbase  = r_ports->start;
>>>
>>
>> No, this doesn't fix the problem.
>>
> 
> Then I had better stop guessing and start up Aranym...
> 
> The patch below seems to fix the problem for me. Does it work on your 
> system(s)?

It works like a charm.

Thank you,
Laurent

Re: [PATCH] serial: pmac_zilog: don't init if zilog is not available

2020-10-21 Thread Laurent Vivier

Le 21/10/2020 à 01:43, Finn Thain a écrit :
> On Tue, 20 Oct 2020, Brad Boyer wrote:
> 
>>
>> Wouldn't it be better to rearrange this code to only run if the devices 
>> are present? This is a macio driver on pmac and a platform driver on 
>> mac, so shouldn't it be possible to only run this code when the 
>> appropriate entries are present in the right data structures?
>>
>> I didn't look at a lot of the other serial drivers, but some other mac 
>> drivers have recently been updated to no longer have MACH_IS_MAC checks 
>> due to being converted to platform drivers.
>>
> 
> Actually, it's not simply a platform driver or macio driver. I think the 
> console is supposed to be registered before the normal bus matching takes 
> place. Hence this comment in pmac_zilog.c,
> 
> /* 
>  * First, we need to do a direct OF-based probe pass. We
>  * do that because we want serial console up before the
>  * macio stuffs calls us back, and since that makes it
>  * easier to pass the proper number of channels to
>  * uart_register_driver()
>  */
> 
> Laurent, can we avoid the irq == 0 warning splat like this?
> 
> diff --git a/drivers/tty/serial/pmac_zilog.c b/drivers/tty/serial/pmac_zilog.c
> index 96e7aa479961..7db600cd8cc7 100644
> --- a/drivers/tty/serial/pmac_zilog.c
> +++ b/drivers/tty/serial/pmac_zilog.c
> @@ -1701,8 +1701,10 @@ static int __init pmz_init_port(struct uart_pmac_port 
> *uap)
>   int irq;
>  
>   r_ports = platform_get_resource(uap->pdev, IORESOURCE_MEM, 0);
> + if (!r_ports)
> + return -ENODEV;
>   irq = platform_get_irq(uap->pdev, 0);
> - if (!r_ports || irq <= 0)
> + if (irq <= 0)
>   return -ENODEV;
>  
>   uap->port.mapbase  = r_ports->start;
> 

No, this doesn't fix the problem.

The message is still:

[0.00] [ cut here ]
[0.00] WARNING: CPU: 0 PID: 0 at drivers/base/platform.c:224
platform_get_irq_optional+0x7a/0x80
[0.00] 0 is an invalid IRQ number
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0+ #324
[0.00] Stack from 004e7f24:
   004e7f24 0046c1d3 0046c1d3 0002cb26 004985fb
00e0 0009 
   0002cb6a 004985fb 00e0 002a5b86 0009
 004e7f70 00553cc4
      004985df 004e7f90
004e7ff8 002a5b86 004985fb
   00e0 0009 004985df 004eb290 002a5bd2
004eb290  00553cc4
   0057bb66 00553cc4 00573d6e 004eb290 
00573d38 0021c42c 00573e06
   00553cc4 0046df15 00583a7c 00573e58 00564b74
0005299a 0055ce34 
[0.00] Call Trace: [<0002cb26>] __warn+0xb2/0xb4
[0.00]  [<0002cb6a>] warn_slowpath_fmt+0x42/0x64
[0.00]  [<002a5b86>] platform_get_irq_optional+0x7a/0x80
[0.00]  [<002a5b86>] platform_get_irq_optional+0x7a/0x80
[0.00]  [<002a5bd2>] platform_get_irq+0x16/0x42
[0.00]  [<00573d6e>] pmz_init_port+0x36/0x9e
[0.00]  [<00573d38>] pmz_init_port+0x0/0x9e
[0.00]  [<0021c42c>] strlen+0x0/0x14
[0.00]  [<00573e06>] pmz_probe+0x30/0x7e
[0.00]  [<00573e58>] pmz_console_init+0x4/0x22
[0.00]  [<00564b74>] console_init+0x1e/0x20
[0.00]  [<0005299a>] printk+0x0/0x18
[0.00]  [<0055ce34>] start_kernel+0x332/0x4c4
[0.00]  [<0055b8c6>] _sinittext+0x8c6/0x1268
[0.00] ---[ end trace 32d780b8cd50b829 ]---

Thanks,
Laurent

Re: [PATCH] serial: pmac_zilog: don't init if zilog is not available

2020-10-20 Thread Laurent Vivier

Le 20/10/2020 à 20:32, Greg KH a écrit :
> On Tue, Oct 20, 2020 at 08:19:26PM +0200, Laurent Vivier wrote:
>> Le 20/10/2020 à 19:37, Greg KH a écrit :
>>> On Tue, Oct 20, 2020 at 06:37:41PM +0200, Laurent Vivier wrote:
>>>> Le 20/10/2020 à 18:28, Greg KH a écrit :
>>>>> On Tue, Oct 20, 2020 at 06:23:03PM +0200, Laurent Vivier wrote:
>>>>>> We can avoid to probe for the Zilog device (and generate ugly kernel 
>>>>>> warning)
>>>>>> if kernel is built for Mac but not on a Mac.
>>>>>>
>>>>>> Signed-off-by: Laurent Vivier 
>>>>>> ---
>>>>>>  drivers/tty/serial/pmac_zilog.c | 11 +++
>>>>>>  1 file changed, 11 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/tty/serial/pmac_zilog.c 
>>>>>> b/drivers/tty/serial/pmac_zilog.c
>>>>>> index 063484b22523..d1d2e55983c3 100644
>>>>>> --- a/drivers/tty/serial/pmac_zilog.c
>>>>>> +++ b/drivers/tty/serial/pmac_zilog.c
>>>>>> @@ -1867,6 +1867,12 @@ static struct platform_driver pmz_driver = {
>>>>>>  static int __init init_pmz(void)
>>>>>>  {
>>>>>>  int rc, i;
>>>>>> +
>>>>>> +#ifdef CONFIG_MAC
>>>>>> +if (!MACH_IS_MAC)
>>>>>> +return -ENODEV;
>>>>>> +#endif
>>>>>
>>>>> Why is the #ifdef needed?
>>>>>
>>>>> We don't like putting #ifdef in .c files for good reasons.  Can you make
>>>>> the api check for this work with and without that #ifdef needed?
>>>>
>>>> The #ifdef is needed because this file can be compiled for PowerMac and
>>>> m68k Mac. For PowerMac, the MACH_IS_MAC is not defined, so we need the
>>>> #ifdef.
>>>>
>>>> We need the MAC_IS_MAC because the same kernel can be used with several
>>>> m68k machines, so the init_pmz can be called on a m68k machine without
>>>> the zilog device (it's a multi-targets kernel).
>>>>
>>>> You can check it's the good way to do by looking inside:
>>>>
>>>> drivers/video/fbdev/valkyriefb.c +317
>>>> drivers/macintosh/adb.c +316
>>>>
>>>> That are two files used by both, mac and pmac.
>>>
>>> Why not fix it to work properly like other arch checks are done
>> I would be happy to do the same.
>>
>>> Put it in a .h file and do the #ifdef there.  Why is this "special"?
>>
>> I don't know.
>>
>> Do you mean something like:
>>
>> drivers/tty/serial/pmac_zilog.h
>> ...
>> #ifndef MACH_IS_MAC
>> #define MACH_IS_MAC (0)
>> #endif
>> ...
>>
>> drivers/tty/serial/pmac_zilog.c
>> ...
>> static int __init pmz_console_init(void)
>> {
>> if (!MACH_IS_MAC)
>> return -ENODEV;
>> ...
> 
> Yup, that would be a good start, but why is the pmac_zilog.h file
> responsible for this?  Shouldn't this be in some arch-specific file
> somewhere?

For m68k, MACH_IS_MAC is defined in arch/m68k/include/asm/setup.h

If I want to define it for any other archs I don't know in which file we
can put it.

But as m68k mac is only sharing drivers with pmac perhaps we can put
this in arch/powerpc/include/asm/setup.h?

Thanks,
Laurent

Re: [PATCH] serial: pmac_zilog: don't init if zilog is not available

2020-10-20 Thread Laurent Vivier

Le 20/10/2020 à 19:37, Greg KH a écrit :
> On Tue, Oct 20, 2020 at 06:37:41PM +0200, Laurent Vivier wrote:
>> Le 20/10/2020 à 18:28, Greg KH a écrit :
>>> On Tue, Oct 20, 2020 at 06:23:03PM +0200, Laurent Vivier wrote:
>>>> We can avoid to probe for the Zilog device (and generate ugly kernel 
>>>> warning)
>>>> if kernel is built for Mac but not on a Mac.
>>>>
>>>> Signed-off-by: Laurent Vivier 
>>>> ---
>>>>  drivers/tty/serial/pmac_zilog.c | 11 +++
>>>>  1 file changed, 11 insertions(+)
>>>>
>>>> diff --git a/drivers/tty/serial/pmac_zilog.c 
>>>> b/drivers/tty/serial/pmac_zilog.c
>>>> index 063484b22523..d1d2e55983c3 100644
>>>> --- a/drivers/tty/serial/pmac_zilog.c
>>>> +++ b/drivers/tty/serial/pmac_zilog.c
>>>> @@ -1867,6 +1867,12 @@ static struct platform_driver pmz_driver = {
>>>>  static int __init init_pmz(void)
>>>>  {
>>>>int rc, i;
>>>> +
>>>> +#ifdef CONFIG_MAC
>>>> +  if (!MACH_IS_MAC)
>>>> +  return -ENODEV;
>>>> +#endif
>>>
>>> Why is the #ifdef needed?
>>>
>>> We don't like putting #ifdef in .c files for good reasons.  Can you make
>>> the api check for this work with and without that #ifdef needed?
>>
>> The #ifdef is needed because this file can be compiled for PowerMac and
>> m68k Mac. For PowerMac, the MACH_IS_MAC is not defined, so we need the
>> #ifdef.
>>
>> We need the MAC_IS_MAC because the same kernel can be used with several
>> m68k machines, so the init_pmz can be called on a m68k machine without
>> the zilog device (it's a multi-targets kernel).
>>
>> You can check it's the good way to do by looking inside:
>>
>> drivers/video/fbdev/valkyriefb.c +317
>> drivers/macintosh/adb.c +316
>>
>> That are two files used by both, mac and pmac.
> 
> Why not fix it to work properly like other arch checks are done
I would be happy to do the same.

> Put it in a .h file and do the #ifdef there.  Why is this "special"?

I don't know.

Do you mean something like:

drivers/tty/serial/pmac_zilog.h
...
#ifndef MACH_IS_MAC
#define MACH_IS_MAC (0)
#endif
...

drivers/tty/serial/pmac_zilog.c
...
static int __init pmz_console_init(void)
{
if (!MACH_IS_MAC)
return -ENODEV;
...

Thanks,
Laurent

Re: [PATCH] serial: pmac_zilog: don't init if zilog is not available

2020-10-20 Thread Laurent Vivier

Le 20/10/2020 à 18:28, Greg KH a écrit :
> On Tue, Oct 20, 2020 at 06:23:03PM +0200, Laurent Vivier wrote:
>> We can avoid to probe for the Zilog device (and generate ugly kernel warning)
>> if kernel is built for Mac but not on a Mac.
>>
>> Signed-off-by: Laurent Vivier 
>> ---
>>  drivers/tty/serial/pmac_zilog.c | 11 +++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/drivers/tty/serial/pmac_zilog.c 
>> b/drivers/tty/serial/pmac_zilog.c
>> index 063484b22523..d1d2e55983c3 100644
>> --- a/drivers/tty/serial/pmac_zilog.c
>> +++ b/drivers/tty/serial/pmac_zilog.c
>> @@ -1867,6 +1867,12 @@ static struct platform_driver pmz_driver = {
>>  static int __init init_pmz(void)
>>  {
>>  int rc, i;
>> +
>> +#ifdef CONFIG_MAC
>> +if (!MACH_IS_MAC)
>> +return -ENODEV;
>> +#endif
> 
> Why is the #ifdef needed?
> 
> We don't like putting #ifdef in .c files for good reasons.  Can you make
> the api check for this work with and without that #ifdef needed?

The #ifdef is needed because this file can be compiled for PowerMac and
m68k Mac. For PowerMac, the MACH_IS_MAC is not defined, so we need the
#ifdef.

We need the MAC_IS_MAC because the same kernel can be used with several
m68k machines, so the init_pmz can be called on a m68k machine without
the zilog device (it's a multi-targets kernel).

You can check it's the good way to do by looking inside:

drivers/video/fbdev/valkyriefb.c +317
drivers/macintosh/adb.c +316

That are two files used by both, mac and pmac.

Thanks,
Laurent

[PATCH] serial: pmac_zilog: don't init if zilog is not available

2020-10-20 Thread Laurent Vivier

We can avoid to probe for the Zilog device (and generate ugly kernel warning)
if kernel is built for Mac but not on a Mac.

Signed-off-by: Laurent Vivier 
---
 drivers/tty/serial/pmac_zilog.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/tty/serial/pmac_zilog.c b/drivers/tty/serial/pmac_zilog.c
index 063484b22523..d1d2e55983c3 100644
--- a/drivers/tty/serial/pmac_zilog.c
+++ b/drivers/tty/serial/pmac_zilog.c
@@ -1867,6 +1867,12 @@ static struct platform_driver pmz_driver = {
 static int __init init_pmz(void)
 {
int rc, i;
+
+#ifdef CONFIG_MAC
+   if (!MACH_IS_MAC)
+   return -ENODEV;
+#endif
+
printk(KERN_INFO "%s\n", version);
 
/* 
@@ -2034,6 +2040,11 @@ static int __init pmz_console_setup(struct console *co, 
char *options)
 
 static int __init pmz_console_init(void)
 {
+#ifdef CONFIG_MAC
+   if (!MACH_IS_MAC)
+   return -ENODEV;
+#endif
+
/* Probe ports */
pmz_probe();
 
-- 
2.26.2

Re: [PATCH 3/3] KVM: PPC: Book3S HV: Clear pending decr exceptions on nested guest entry

2019-06-20 Thread Laurent Vivier

On 20/06/2019 03:46, Suraj Jitindar Singh wrote:
> If we enter an L1 guest with a pending decrementer exception then this
> is cleared on guest exit if the guest has writtien a positive value into
> the decrementer (indicating that it handled the decrementer exception)
> since there is no other way to detect that the guest has handled the
> pending exception and that it should be dequeued. In the event that the
> L1 guest tries to run a nested (L2) guest immediately after this and the
> L2 guest decrementer is negative (which is loaded by L1 before making
> the H_ENTER_NESTED hcall), then the pending decrementer exception
> isn't cleared and the L2 entry is blocked since L1 has a pending
> exception, even though L1 may have already handled the exception and
> written a positive value for it's decrementer. This results in a loop of
> L1 trying to enter the L2 guest and L0 blocking the entry since L1 has
> an interrupt pending with the outcome being that L2 never gets to run
> and hangs.
> 
> Fix this by clearing any pending decrementer exceptions when L1 makes
> the H_ENTER_NESTED hcall since it won't do this if it's decrementer has
> gone negative, and anyway it's decrementer has been communicated to L0
> in the hdec_expires field and L0 will return control to L1 when this
> goes negative by delivering an H_DECREMENTER exception.
> 
> Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest entry/exit path 
> on P9 for radix guests"
> 
> Signed-off-by: Suraj Jitindar Singh 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 719fd2529eec..4a5eb29b952f 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4128,8 +4128,15 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
>  
>   preempt_enable();
>  
> - /* cancel pending decrementer exception if DEC is now positive */
> - if (get_tb() < vcpu->arch.dec_expires && kvmppc_core_pending_dec(vcpu))
> + /*
> +  * cancel pending decrementer exception if DEC is now positive, or if
> +  * entering a nested guest in which case the decrementer is now owned
> +  * by L2 and the L1 decrementer is provided in hdec_expires
> +  */
> + if (kvmppc_core_pending_dec(vcpu) &&
> + ((get_tb() < vcpu->arch.dec_expires) ||
> +  (trap == BOOK3S_INTERRUPT_SYSCALL &&
> +   kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
>   kvmppc_core_dequeue_dec(vcpu);
>  
>   trace_kvm_guest_exit(vcpu);
> 

Patches 2 and 3: tested I can boot and run an L2 nested guest with qemu
v4.0.0 and caps-large-decr=on in the case we have had a hang previously.

Tested-by: Laurent Vivier

Re: [PATCH 2/3] KVM: PPC: Book3S HV: Signed extend decrementer value if not using large decr

2019-06-20 Thread Laurent Vivier

On 20/06/2019 03:46, Suraj Jitindar Singh wrote:
> On POWER9 the decrementer can operate in large decrementer mode where
> the decrementer is 56 bits and signed extended to 64 bits. When not
> operating in this mode the decrementer behaves as a 32 bit decrementer
> which is NOT signed extended (as on POWER8).
> 
> Currently when reading a guest decrementer value we don't take into
> account whether the large decrementer is enabled or not, and this means
> the value will be incorrect when the guest is not using the large
> decrementer. Fix this by sign extending the value read when the guest
> isn't using the large decrementer.
> 
> Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest entry/exit path 
> on P9 for radix guests"
> 
> Signed-off-by: Suraj Jitindar Singh 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index d3684509da35..719fd2529eec 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3607,6 +3607,8 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
> time_limit,
>  
>   vcpu->arch.slb_max = 0;
>   dec = mfspr(SPRN_DEC);
> + if (!(lpcr & LPCR_LD)) /* Sign extend if not using large decrementer */
> + dec = (s32) dec;
>   tb = mftb();
>   vcpu->arch.dec_expires = dec + tb;
>   vcpu->cpu = -1;
> 

Patches 2 and 3: tested I can boot and run an L2 nested guest with qemu
v4.0.0 and caps-large-decr=on in the case we have had a hang previously.

Tested-by: Laurent Vivier

Re: [PATCH v3] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-04-08 Thread Laurent Vivier

On 20/03/2019 13:47, Michael Ellerman wrote:
> Laurent Vivier  writes:
>> Hi Michael,
>>
>> as it seems good now, could you pick up this patch for merging?
> 
> I'll start picking up patches for next starting after rc2, so next week.
> 
> If you think it's a bug fix I can put it into fixes now, but I don't
> think it's a bug fix is it?

Any news about this patch?

Thanks,
Laurent

Re: [PATCH v3] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-20 Thread Laurent Vivier

On 20/03/2019 13:47, Michael Ellerman wrote:
> Laurent Vivier  writes:
>> Hi Michael,
>>
>> as it seems good now, could you pick up this patch for merging?
> 
> I'll start picking up patches for next starting after rc2, so next week.
> 
> If you think it's a bug fix I can put it into fixes now, but I don't
> think it's a bug fix is it?

No, it's only cosmetic.

Thanks,
Laurent

Re: [PATCH v3] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-19 Thread Laurent Vivier

Hi Michael,

as it seems good now, could you pick up this patch for merging?

Thanks,
Laurent

On 13/03/2019 11:25, Laurent Vivier wrote:
> resize_hpt_for_hotplug() reports a warning when it cannot
> resize the hash page table ("Unable to resize hash page
> table to target order") but in some cases it's not a problem
> and can make user thinks something has not worked properly.
> 
> This patch moves the warning to arch_remove_memory() to
> only report the problem when it is needed.
> 
> Reviewed-by: David Gibson 
> Signed-off-by: Laurent Vivier 
> ---
> 
> Notes:
> v3: move "||" to above line and remove parenthesis
> v2: add warning messages for H_PARAMETER and H_RESOURCE
> 
>  arch/powerpc/include/asm/sparsemem.h  |  4 ++--
>  arch/powerpc/mm/hash_utils_64.c   | 19 +++
>  arch/powerpc/mm/mem.c |  3 ++-
>  arch/powerpc/platforms/pseries/lpar.c |  3 ++-
>  4 files changed, 13 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/sparsemem.h 
> b/arch/powerpc/include/asm/sparsemem.h
> index 68da49320592..3192d454a733 100644
> --- a/arch/powerpc/include/asm/sparsemem.h
> +++ b/arch/powerpc/include/asm/sparsemem.h
> @@ -17,9 +17,9 @@ extern int create_section_mapping(unsigned long start, 
> unsigned long end, int ni
>  extern int remove_section_mapping(unsigned long start, unsigned long end);
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
> -extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
> +extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
>  #else
> -static inline void resize_hpt_for_hotplug(unsigned long new_mem_size) { }
> +static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { 
> return 0; }
>  #endif
>  
>  #ifdef CONFIG_NUMA
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index 0cc7fbc3bd1c..5aa7594ee71b 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -755,12 +755,12 @@ static unsigned long __init htab_get_table_size(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -void resize_hpt_for_hotplug(unsigned long new_mem_size)
> +int resize_hpt_for_hotplug(unsigned long new_mem_size)
>  {
>   unsigned target_hpt_shift;
>  
>   if (!mmu_hash_ops.resize_hpt)
> - return;
> + return 0;
>  
>   target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
>  
> @@ -772,16 +772,11 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
>* reduce unless the target shift is at least 2 below the
>* current shift
>*/
> - if ((target_hpt_shift > ppc64_pft_size)
> - || (target_hpt_shift < (ppc64_pft_size - 1))) {
> - int rc;
> -
> - rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
> - if (rc && (rc != -ENODEV))
> - printk(KERN_WARNING
> -"Unable to resize hash page table to target 
> order %d: %d\n",
> -target_hpt_shift, rc);
> - }
> + if (target_hpt_shift > ppc64_pft_size ||
> + target_hpt_shift < ppc64_pft_size - 1)
> + return mmu_hash_ops.resize_hpt(target_hpt_shift);
> +
> + return 0;
>  }
>  
>  int hash__create_section_mapping(unsigned long start, unsigned long end, int 
> nid)
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 33cc6f676fa6..0d40d970cf4a 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -169,7 +169,8 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 
> size,
>*/
>   vm_unmap_aliases();
>  
> - resize_hpt_for_hotplug(memblock_phys_mem_size());
> + if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
> + pr_warn("Hash collision while resizing HPT\n");
>  
>   return ret;
>  }
> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
> b/arch/powerpc/platforms/pseries/lpar.c
> index f2a9f0adc2d3..1034ef1fe2b4 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -901,8 +901,10 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
>   break;
>  
>   case H_PARAMETER:
> + pr_warn("Invalid argument from H_RESIZE_HPT_PREPARE\n");
>   return -EINVAL;
>   case H_RESOURCE:
> + pr_warn("Operation not permitted from H_RESIZE_HPT_PREPARE\n");
>   return -EPERM;
>   default:
>   pr_warn("Unexpected error %d from H_RESIZE_HPT_PREPARE\n", rc);
> @@ -918,7 +920,6 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
>   if (rc != 0) {
>   switch (state.commit_rc) {
>   case H_PTEG_FULL:
> - pr_warn("Hash collision while resizing HPT\n");
>   return -ENOSPC;
>  
>   default:
>

Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-15 Thread Laurent Vivier

On 15/03/2019 13:25, Peter Zijlstra wrote:
> On Fri, Mar 15, 2019 at 12:12:45PM +0100, Laurent Vivier wrote:
> 
>> Another way to avoid the nodes overlapping for the offline nodes at
>> startup is to ensure the default values don't define a distance that
>> merge all offline nodes into node 0.
>>
>> A powerpc specific patch can workaround the kernel crash by doing this:
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 87f0dd0..3ba29bb 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
>> struct device_node *memory;
>> int default_nid = 0;
>> unsigned long i;
>> +   int nid, dist;
>>
>> if (numa_enabled == 0) {
>> printk(KERN_WARNING "NUMA disabled by user\n");
>> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
>>
>> dbg("NUMA associativity depth for CPU/Memory: %d\n",
>> min_common_depth);
>>
>> +   for (nid = 0; nid < MAX_NUMNODES; nid ++)
>> +   for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
>> +   distance_lookup_table[nid][dist] = nid;
>> +
>> /*
>>  * Even though we connect cpus to numa domains later in SMP
>>  * init, we need to know the node ids now. This is because
> 
> What does that actually do? That is, what does it make the distance
> table look like before and after you bring up the CPUs?

By default the table is full of 0. When a CPU is brought up the value is
read from the device-tree and the table is updated. What I've seen is
this value is common for 2 nodes at a given level if they share the level.
So as the table is initialized with 0, all offline nodes (no memory no
cpu) are merged with node 0.
My fix initializes the table with unique values for each node, so by
default no nodes are mixed.

> 
>> Any comment?
> 
> Well, I had a few questions here:
> 
>   20190305115952.gh32...@hirez.programming.kicks-ass.net
> 
> that I've not yet seen answers to.

I didn't answer because:

- I thought this was not the good way to fix the problem as you said "it
seems very fragile and unfortunate",

- I don't have the answers, I'd really like someone from IBM that knows
well the NUMA part of powerpc answers to these questions... and perhaps
find a better solution.

Thanks,
Laurent

Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-15 Thread Laurent Vivier

On 04/03/2019 20:59, Laurent Vivier wrote:
> When we hotplug a CPU in a memoryless/cpuless node,
> the kernel crashes when it rebuilds the sched_domains data.
> 
> I reproduce this problem on POWER and with a pseries VM, with the following
> QEMU parameters:
> 
>   -machine pseries -enable-kvm -m 8192 \
>   -smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \
>   -numa node,nodeid=0,cpus=0-1,mem=0 \
>   -numa node,nodeid=1,cpus=2-3,mem=8192 \
>   -numa node,nodeid=2,cpus=4-5,mem=0 \
>   -numa node,nodeid=3,cpus=6-7,mem=0
> 
> Then I can trigger the crash by hotplugging a CPU on node-id 3:
> 
>   (qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3
> 
> Built 2 zonelists, mobility grouping on.  Total pages: 130162
> Policy zone: Normal
> WARNING: workqueue cpumask: online intersect > possible intersect
> BUG: Kernel NULL pointer dereference at 0x0400
> Faulting instruction address: 0xc0170edc
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute 
> bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security 
> ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 
> nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter 
> ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables 
> xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci 
> virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
> CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20
> Workqueue: events cpuset_hotplug_workfn
> NIP:  c0170edc LR: c0170f98 CTR: 
> REGS: c3e931a0 TRAP: 0380   Not tainted  (5.0.0-rc6+)
> MSR:  80009033   CR: 22284028  XER: 
> CFAR: c0170f20 IRQMASK: 0
> GPR00: c0170f98 c3e93430 c11ac500 c001efe22000
> GPR04: 0001   0010
> GPR08: 0001 0400  
> GPR12: 8800 c0003fffd680 c001f14b c11e1bf0
> GPR16: c11e61f4 c001efe22200 c001efe22020 c001fba8
> GPR20: c001ff567a80 0001 c0e27a80 e830
> GPR24: ec30 102f 102f c001efca1000
> GPR28: c001efca0400 c001efe22000 c001efe23bff c001efe22a00
> NIP [c0170edc] free_sched_groups+0x5c/0xf0
> LR [c0170f98] destroy_sched_domain+0x28/0x90
> Call Trace:
> [c3e93430] [102f] 0x102f (unreliable)
> [c3e93470] [c0170f98] destroy_sched_domain+0x28/0x90
> [c3e934a0] [c01716e0] cpu_attach_domain+0x100/0x920
> [c3e93600] [c0173128] build_sched_domains+0x1228/0x1370
> [c3e93740] [c017429c] partition_sched_domains+0x23c/0x400
> [c3e937e0] [c01f5ec8] 
> rebuild_sched_domains_locked+0x78/0xe0
> [c3e93820] [c01f9ff0] rebuild_sched_domains+0x30/0x50
> [c3e93850] [c01fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70
> [c3e93c80] [c012e5a0] process_one_work+0x1b0/0x480
> [c3e93d20] [c012e8f8] worker_thread+0x88/0x540
> [c3e93db0] [c013714c] kthread+0x15c/0x1a0
> [c3e93e20] [c000b55c] ret_from_kernel_thread+0x5c/0x80
> Instruction dump:
> 2e24 f8010010 f821ffc1 409e0014 4880 7fbdf040 7fdff378 419e0074
> ebdf 4192002c e93f0010 7c0004ac <7d404828> 314a 7d40492d 40c2fff4
> ---[ end trace f992c4a7d47d602a ]---
> 
> Kernel panic - not syncing: Fatal exception
> 
> This happens in free_sched_groups() because the linked list of the
> sched_groups is corrupted. Here what happens when we hotplug the CPU:
> 
>  - build_sched_groups() builds a sched_groups linked list for
>sched_domain D1, with only one entry A, refcount=1
> 
>D1: A(ref=1)
> 
>  - build_sched_groups() builds a sched_groups linked list for
>sched_domain D2, with the same entry A
> 
>D2: A(ref=2)
> 
>  - build_sched_groups() builds a sched_groups linked list for
>sched_domain D3, with the same entry A and a new entry B:
> 
>D3: A(ref=3) -> B(ref=1)
> 
>  - destroy_sched_domain() is called for D1:
> 
>D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released,
>  but A->next always poin

[PATCH v3] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-13 Thread Laurent Vivier

resize_hpt_for_hotplug() reports a warning when it cannot
resize the hash page table ("Unable to resize hash page
table to target order") but in some cases it's not a problem
and can make user thinks something has not worked properly.

This patch moves the warning to arch_remove_memory() to
only report the problem when it is needed.

Reviewed-by: David Gibson 
Signed-off-by: Laurent Vivier 
---

Notes:
v3: move "||" to above line and remove parenthesis
v2: add warning messages for H_PARAMETER and H_RESOURCE

 arch/powerpc/include/asm/sparsemem.h  |  4 ++--
 arch/powerpc/mm/hash_utils_64.c   | 19 +++
 arch/powerpc/mm/mem.c |  3 ++-
 arch/powerpc/platforms/pseries/lpar.c |  3 ++-
 4 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index 68da49320592..3192d454a733 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -17,9 +17,9 @@ extern int create_section_mapping(unsigned long start, 
unsigned long end, int ni
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 
 #ifdef CONFIG_PPC_BOOK3S_64
-extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
+extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
 #else
-static inline void resize_hpt_for_hotplug(unsigned long new_mem_size) { }
+static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { return 
0; }
 #endif
 
 #ifdef CONFIG_NUMA
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0cc7fbc3bd1c..5aa7594ee71b 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -755,12 +755,12 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-void resize_hpt_for_hotplug(unsigned long new_mem_size)
+int resize_hpt_for_hotplug(unsigned long new_mem_size)
 {
unsigned target_hpt_shift;
 
if (!mmu_hash_ops.resize_hpt)
-   return;
+   return 0;
 
target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
 
@@ -772,16 +772,11 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
 * reduce unless the target shift is at least 2 below the
 * current shift
 */
-   if ((target_hpt_shift > ppc64_pft_size)
-   || (target_hpt_shift < (ppc64_pft_size - 1))) {
-   int rc;
-
-   rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
-   if (rc && (rc != -ENODEV))
-   printk(KERN_WARNING
-  "Unable to resize hash page table to target 
order %d: %d\n",
-  target_hpt_shift, rc);
-   }
+   if (target_hpt_shift > ppc64_pft_size ||
+   target_hpt_shift < ppc64_pft_size - 1)
+   return mmu_hash_ops.resize_hpt(target_hpt_shift);
+
+   return 0;
 }
 
 int hash__create_section_mapping(unsigned long start, unsigned long end, int 
nid)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 33cc6f676fa6..0d40d970cf4a 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -169,7 +169,8 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 
size,
 */
vm_unmap_aliases();
 
-   resize_hpt_for_hotplug(memblock_phys_mem_size());
+   if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
+   pr_warn("Hash collision while resizing HPT\n");
 
return ret;
 }
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index f2a9f0adc2d3..1034ef1fe2b4 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -901,8 +901,10 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
break;
 
case H_PARAMETER:
+   pr_warn("Invalid argument from H_RESIZE_HPT_PREPARE\n");
return -EINVAL;
case H_RESOURCE:
+   pr_warn("Operation not permitted from H_RESIZE_HPT_PREPARE\n");
return -EPERM;
default:
pr_warn("Unexpected error %d from H_RESIZE_HPT_PREPARE\n", rc);
@@ -918,7 +920,6 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
if (rc != 0) {
switch (state.commit_rc) {
case H_PTEG_FULL:
-   pr_warn("Hash collision while resizing HPT\n");
return -ENOSPC;
 
default:
-- 
2.20.1

Re: [PATCH v2] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-13 Thread Laurent Vivier

On 13/03/2019 09:28, Christophe Leroy wrote:
> 
> 
> Le 13/03/2019 à 09:01, Laurent Vivier a écrit :
>> On 13/03/2019 07:03, Christophe Leroy wrote:
>>>
>>>
>>> Le 08/03/2019 à 11:54, Laurent Vivier a écrit :
>>>> resize_hpt_for_hotplug() reports a warning when it cannot
>>>> resize the hash page table ("Unable to resize hash page
>>>> table to target order") but in some cases it's not a problem
>>>> and can make user thinks something has not worked properly.
>>>>
>>>> This patch moves the warning to arch_remove_memory() to
>>>> only report the problem when it is needed.
>>>>
>>>> Signed-off-by: Laurent Vivier 
>>>> ---
>>>>    arch/powerpc/include/asm/sparsemem.h  |  4 ++--
>>>>    arch/powerpc/mm/hash_utils_64.c   | 17 ++---
>>>>    arch/powerpc/mm/mem.c |  3 ++-
>>>>    arch/powerpc/platforms/pseries/lpar.c |  3 ++-
>>>>    4 files changed, 12 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/sparsemem.h
>>>> b/arch/powerpc/include/asm/sparsemem.h
>>>> index 68da49320592..3192d454a733 100644
>>>> --- a/arch/powerpc/include/asm/sparsemem.h
>>>> +++ b/arch/powerpc/include/asm/sparsemem.h
>>>> @@ -17,9 +17,9 @@ extern int create_section_mapping(unsigned long
>>>> start, unsigned long end, int ni
>>>>    extern int remove_section_mapping(unsigned long start, unsigned long
>>>> end);
>>>>      #ifdef CONFIG_PPC_BOOK3S_64
>>>> -extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
>>>> +extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
>>>>    #else
>>>> -static inline void resize_hpt_for_hotplug(unsigned long new_mem_size)
>>>> { }
>>>> +static inline int resize_hpt_for_hotplug(unsigned long new_mem_size)
>>>> { return 0; }
>>>>    #endif
>>>>      #ifdef CONFIG_NUMA
>>>> diff --git a/arch/powerpc/mm/hash_utils_64.c
>>>> b/arch/powerpc/mm/hash_utils_64.c
>>>> index 0cc7fbc3bd1c..40bb2a8326bb 100644
>>>> --- a/arch/powerpc/mm/hash_utils_64.c
>>>> +++ b/arch/powerpc/mm/hash_utils_64.c
>>>> @@ -755,12 +755,12 @@ static unsigned long __init
>>>> htab_get_table_size(void)
>>>>    }
>>>>      #ifdef CONFIG_MEMORY_HOTPLUG
>>>> -void resize_hpt_for_hotplug(unsigned long new_mem_size)
>>>> +int resize_hpt_for_hotplug(unsigned long new_mem_size)
>>>>    {
>>>>    unsigned target_hpt_shift;
>>>>      if (!mmu_hash_ops.resize_hpt)
>>>> -    return;
>>>> +    return 0;
>>>>      target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
>>>>    @@ -773,15 +773,10 @@ void resize_hpt_for_hotplug(unsigned long
>>>> new_mem_size)
>>>>     * current shift
>>>>     */
>>>>    if ((target_hpt_shift > ppc64_pft_size)
>>>> -    || (target_hpt_shift < (ppc64_pft_size - 1))) {
>>>> -    int rc;
>>>> -
>>>> -    rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
>>>> -    if (rc && (rc != -ENODEV))
>>>> -    printk(KERN_WARNING
>>>> -   "Unable to resize hash page table to target order
>>>> %d: %d\n",
>>>> -   target_hpt_shift, rc);
>>>> -    }
>>>> +    || (target_hpt_shift < (ppc64_pft_size - 1)))
>>>
>>> The || should go on the line above and the two (target_hpt... should be
>>> aligned, and the () after the < are superflous.
>>>
>>> And indeed, we should (in another patch) rename 'target_hpt_shift' with
>>> a shorter name, this would avoid multiple lines.
>>>
>>> Ref
>>> https://www.kernel.org/doc/html/latest/process/coding-style.html#naming
>>> :
>>>
>>> LOCAL variable names should be short, and to the point. If you have some
>>> random integer loop counter, it should probably be called i. Calling it
>>> loop_counter is non-productive, if there is no chance of it being
>>> mis-understood. Similarly, tmp can be just about any type of variable
>>> that is used to hold a temporary value.
>>>
>>> If you are afraid to mix up your local variable names, you have another
>>> problem, which is called the function-gr

Re: [PATCH v2] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-13 Thread Laurent Vivier

On 13/03/2019 07:03, Christophe Leroy wrote:
> 
> 
> Le 08/03/2019 à 11:54, Laurent Vivier a écrit :
>> resize_hpt_for_hotplug() reports a warning when it cannot
>> resize the hash page table ("Unable to resize hash page
>> table to target order") but in some cases it's not a problem
>> and can make user thinks something has not worked properly.
>>
>> This patch moves the warning to arch_remove_memory() to
>> only report the problem when it is needed.
>>
>> Signed-off-by: Laurent Vivier 
>> ---
>>   arch/powerpc/include/asm/sparsemem.h  |  4 ++--
>>   arch/powerpc/mm/hash_utils_64.c   | 17 ++---
>>   arch/powerpc/mm/mem.c |  3 ++-
>>   arch/powerpc/platforms/pseries/lpar.c |  3 ++-
>>   4 files changed, 12 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/sparsemem.h
>> b/arch/powerpc/include/asm/sparsemem.h
>> index 68da49320592..3192d454a733 100644
>> --- a/arch/powerpc/include/asm/sparsemem.h
>> +++ b/arch/powerpc/include/asm/sparsemem.h
>> @@ -17,9 +17,9 @@ extern int create_section_mapping(unsigned long
>> start, unsigned long end, int ni
>>   extern int remove_section_mapping(unsigned long start, unsigned long
>> end);
>>     #ifdef CONFIG_PPC_BOOK3S_64
>> -extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
>> +extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
>>   #else
>> -static inline void resize_hpt_for_hotplug(unsigned long new_mem_size)
>> { }
>> +static inline int resize_hpt_for_hotplug(unsigned long new_mem_size)
>> { return 0; }
>>   #endif
>>     #ifdef CONFIG_NUMA
>> diff --git a/arch/powerpc/mm/hash_utils_64.c
>> b/arch/powerpc/mm/hash_utils_64.c
>> index 0cc7fbc3bd1c..40bb2a8326bb 100644
>> --- a/arch/powerpc/mm/hash_utils_64.c
>> +++ b/arch/powerpc/mm/hash_utils_64.c
>> @@ -755,12 +755,12 @@ static unsigned long __init
>> htab_get_table_size(void)
>>   }
>>     #ifdef CONFIG_MEMORY_HOTPLUG
>> -void resize_hpt_for_hotplug(unsigned long new_mem_size)
>> +int resize_hpt_for_hotplug(unsigned long new_mem_size)
>>   {
>>   unsigned target_hpt_shift;
>>     if (!mmu_hash_ops.resize_hpt)
>> -    return;
>> +    return 0;
>>     target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
>>   @@ -773,15 +773,10 @@ void resize_hpt_for_hotplug(unsigned long
>> new_mem_size)
>>    * current shift
>>    */
>>   if ((target_hpt_shift > ppc64_pft_size)
>> -    || (target_hpt_shift < (ppc64_pft_size - 1))) {
>> -    int rc;
>> -
>> -    rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
>> -    if (rc && (rc != -ENODEV))
>> -    printk(KERN_WARNING
>> -   "Unable to resize hash page table to target order
>> %d: %d\n",
>> -   target_hpt_shift, rc);
>> -    }
>> +    || (target_hpt_shift < (ppc64_pft_size - 1)))
> 
> The || should go on the line above and the two (target_hpt... should be
> aligned, and the () after the < are superflous.
> 
> And indeed, we should (in another patch) rename 'target_hpt_shift' with
> a shorter name, this would avoid multiple lines.
> 
> Ref
> https://www.kernel.org/doc/html/latest/process/coding-style.html#naming :
> 
> LOCAL variable names should be short, and to the point. If you have some
> random integer loop counter, it should probably be called i. Calling it
> loop_counter is non-productive, if there is no chance of it being
> mis-understood. Similarly, tmp can be just about any type of variable
> that is used to hold a temporary value.
> 
> If you are afraid to mix up your local variable names, you have another
> problem, which is called the function-growth-hormone-imbalance syndrome.
> See chapter 6 (Functions).
> 

I'm only removing a warning. Do we really need to rewrite all the code
around for that?

Thanks,
Laurent

Re: [PATCH v2] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-08 Thread Laurent Vivier

I forgot the version change note:

  v2: add warning messages for H_PARAMETER and H_RESOURCE

Thanks,
Laurent

On 08/03/2019 11:54, Laurent Vivier wrote:
> resize_hpt_for_hotplug() reports a warning when it cannot
> resize the hash page table ("Unable to resize hash page
> table to target order") but in some cases it's not a problem
> and can make user thinks something has not worked properly.
> 
> This patch moves the warning to arch_remove_memory() to
> only report the problem when it is needed.
> 
> Signed-off-by: Laurent Vivier 
> ---
>  arch/powerpc/include/asm/sparsemem.h  |  4 ++--
>  arch/powerpc/mm/hash_utils_64.c   | 17 ++---
>  arch/powerpc/mm/mem.c |  3 ++-
>  arch/powerpc/platforms/pseries/lpar.c |  3 ++-
>  4 files changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/sparsemem.h 
> b/arch/powerpc/include/asm/sparsemem.h
> index 68da49320592..3192d454a733 100644
> --- a/arch/powerpc/include/asm/sparsemem.h
> +++ b/arch/powerpc/include/asm/sparsemem.h
> @@ -17,9 +17,9 @@ extern int create_section_mapping(unsigned long start, 
> unsigned long end, int ni
>  extern int remove_section_mapping(unsigned long start, unsigned long end);
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
> -extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
> +extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
>  #else
> -static inline void resize_hpt_for_hotplug(unsigned long new_mem_size) { }
> +static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { 
> return 0; }
>  #endif
>  
>  #ifdef CONFIG_NUMA
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index 0cc7fbc3bd1c..40bb2a8326bb 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -755,12 +755,12 @@ static unsigned long __init htab_get_table_size(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -void resize_hpt_for_hotplug(unsigned long new_mem_size)
> +int resize_hpt_for_hotplug(unsigned long new_mem_size)
>  {
>   unsigned target_hpt_shift;
>  
>   if (!mmu_hash_ops.resize_hpt)
> - return;
> + return 0;
>  
>   target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
>  
> @@ -773,15 +773,10 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
>* current shift
>*/
>   if ((target_hpt_shift > ppc64_pft_size)
> - || (target_hpt_shift < (ppc64_pft_size - 1))) {
> - int rc;
> -
> - rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
> - if (rc && (rc != -ENODEV))
> - printk(KERN_WARNING
> -"Unable to resize hash page table to target 
> order %d: %d\n",
> -target_hpt_shift, rc);
> - }
> + || (target_hpt_shift < (ppc64_pft_size - 1)))
> + return mmu_hash_ops.resize_hpt(target_hpt_shift);
> +
> + return 0;
>  }
>  
>  int hash__create_section_mapping(unsigned long start, unsigned long end, int 
> nid)
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 33cc6f676fa6..0d40d970cf4a 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -169,7 +169,8 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 
> size,
>*/
>   vm_unmap_aliases();
>  
> - resize_hpt_for_hotplug(memblock_phys_mem_size());
> + if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
> + pr_warn("Hash collision while resizing HPT\n");
>  
>   return ret;
>  }
> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
> b/arch/powerpc/platforms/pseries/lpar.c
> index f2a9f0adc2d3..1034ef1fe2b4 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -901,8 +901,10 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
>   break;
>  
>   case H_PARAMETER:
> + pr_warn("Invalid argument from H_RESIZE_HPT_PREPARE\n");
>   return -EINVAL;
>   case H_RESOURCE:
> + pr_warn("Operation not permitted from H_RESIZE_HPT_PREPARE\n");
>   return -EPERM;
>   default:
>   pr_warn("Unexpected error %d from H_RESIZE_HPT_PREPARE\n", rc);
> @@ -918,7 +920,6 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
>   if (rc != 0) {
>   switch (state.commit_rc) {
>   case H_PTEG_FULL:
> - pr_warn("Hash collision while resizing HPT\n");
>   return -ENOSPC;
>  
>   default:
>

[PATCH v2] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-08 Thread Laurent Vivier

resize_hpt_for_hotplug() reports a warning when it cannot
resize the hash page table ("Unable to resize hash page
table to target order") but in some cases it's not a problem
and can make user thinks something has not worked properly.

This patch moves the warning to arch_remove_memory() to
only report the problem when it is needed.

Signed-off-by: Laurent Vivier 
---
 arch/powerpc/include/asm/sparsemem.h  |  4 ++--
 arch/powerpc/mm/hash_utils_64.c   | 17 ++---
 arch/powerpc/mm/mem.c |  3 ++-
 arch/powerpc/platforms/pseries/lpar.c |  3 ++-
 4 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index 68da49320592..3192d454a733 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -17,9 +17,9 @@ extern int create_section_mapping(unsigned long start, 
unsigned long end, int ni
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 
 #ifdef CONFIG_PPC_BOOK3S_64
-extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
+extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
 #else
-static inline void resize_hpt_for_hotplug(unsigned long new_mem_size) { }
+static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { return 
0; }
 #endif
 
 #ifdef CONFIG_NUMA
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0cc7fbc3bd1c..40bb2a8326bb 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -755,12 +755,12 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-void resize_hpt_for_hotplug(unsigned long new_mem_size)
+int resize_hpt_for_hotplug(unsigned long new_mem_size)
 {
unsigned target_hpt_shift;
 
if (!mmu_hash_ops.resize_hpt)
-   return;
+   return 0;
 
target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
 
@@ -773,15 +773,10 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
 * current shift
 */
if ((target_hpt_shift > ppc64_pft_size)
-   || (target_hpt_shift < (ppc64_pft_size - 1))) {
-   int rc;
-
-   rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
-   if (rc && (rc != -ENODEV))
-   printk(KERN_WARNING
-  "Unable to resize hash page table to target 
order %d: %d\n",
-  target_hpt_shift, rc);
-   }
+   || (target_hpt_shift < (ppc64_pft_size - 1)))
+   return mmu_hash_ops.resize_hpt(target_hpt_shift);
+
+   return 0;
 }
 
 int hash__create_section_mapping(unsigned long start, unsigned long end, int 
nid)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 33cc6f676fa6..0d40d970cf4a 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -169,7 +169,8 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 
size,
 */
vm_unmap_aliases();
 
-   resize_hpt_for_hotplug(memblock_phys_mem_size());
+   if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
+   pr_warn("Hash collision while resizing HPT\n");
 
return ret;
 }
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index f2a9f0adc2d3..1034ef1fe2b4 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -901,8 +901,10 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
break;
 
case H_PARAMETER:
+   pr_warn("Invalid argument from H_RESIZE_HPT_PREPARE\n");
return -EINVAL;
case H_RESOURCE:
+   pr_warn("Operation not permitted from H_RESIZE_HPT_PREPARE\n");
return -EPERM;
default:
pr_warn("Unexpected error %d from H_RESIZE_HPT_PREPARE\n", rc);
@@ -918,7 +920,6 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
if (rc != 0) {
switch (state.commit_rc) {
case H_PTEG_FULL:
-   pr_warn("Hash collision while resizing HPT\n");
return -ENOSPC;
 
default:
-- 
2.20.1

[PATCH] powerpc/mm: move warning from resize_hpt_for_hotplug()

2019-03-07 Thread Laurent Vivier

resize_hpt_for_hotplug() reports a warning when it cannot
resize the hash page table ("Unable to resize hash page
table to target order") but in some cases it's not a problem
and can make user thinks something has not worked properly.

This patch moves the warning to arch_remove_memory() to
only report the problem when it is needed.

Signed-off-by: Laurent Vivier 
---
 arch/powerpc/include/asm/sparsemem.h  |  4 ++--
 arch/powerpc/mm/hash_utils_64.c   | 17 ++---
 arch/powerpc/mm/mem.c |  3 ++-
 arch/powerpc/platforms/pseries/lpar.c |  1 -
 4 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index 68da49320592..3192d454a733 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -17,9 +17,9 @@ extern int create_section_mapping(unsigned long start, 
unsigned long end, int ni
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 
 #ifdef CONFIG_PPC_BOOK3S_64
-extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
+extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
 #else
-static inline void resize_hpt_for_hotplug(unsigned long new_mem_size) { }
+static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { return 
0; }
 #endif
 
 #ifdef CONFIG_NUMA
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0cc7fbc3bd1c..40bb2a8326bb 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -755,12 +755,12 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-void resize_hpt_for_hotplug(unsigned long new_mem_size)
+int resize_hpt_for_hotplug(unsigned long new_mem_size)
 {
unsigned target_hpt_shift;
 
if (!mmu_hash_ops.resize_hpt)
-   return;
+   return 0;
 
target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
 
@@ -773,15 +773,10 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
 * current shift
 */
if ((target_hpt_shift > ppc64_pft_size)
-   || (target_hpt_shift < (ppc64_pft_size - 1))) {
-   int rc;
-
-   rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
-   if (rc && (rc != -ENODEV))
-   printk(KERN_WARNING
-  "Unable to resize hash page table to target 
order %d: %d\n",
-  target_hpt_shift, rc);
-   }
+   || (target_hpt_shift < (ppc64_pft_size - 1)))
+   return mmu_hash_ops.resize_hpt(target_hpt_shift);
+
+   return 0;
 }
 
 int hash__create_section_mapping(unsigned long start, unsigned long end, int 
nid)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 33cc6f676fa6..0d40d970cf4a 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -169,7 +169,8 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 
size,
 */
vm_unmap_aliases();
 
-   resize_hpt_for_hotplug(memblock_phys_mem_size());
+   if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
+   pr_warn("Hash collision while resizing HPT\n");
 
return ret;
 }
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index f2a9f0adc2d3..b407034a80ba 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -918,7 +918,6 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
if (rc != 0) {
switch (state.commit_rc) {
case H_PTEG_FULL:
-   pr_warn("Hash collision while resizing HPT\n");
return -ENOSPC;
 
default:
-- 
2.20.1

[RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-04 Thread Laurent Vivier

d instead of
build_sched_groups() and in this case the reference counters have all the
same value and the linked list can be correctly unallocated.
The involved commit has introduced the node domain, and in the case of
powerpc the node domain can overlap, whereas it should not happen.

This happens because initially powerpc code computes
sched_domains_numa_masks of offline nodes as if they were merged with
node 0 (because firmware doesn't provide the distance information for
memoryless/cpuless nodes):

  node   0   1   2   3
0:  10  40  10  10
1:  40  10  40  40
2:  10  40  10  10
3:  10  40  10  10

We should have:

  node   0   1   2   3
0:  10  40  40  40
1:  40  10  40  40
2:  40  40  10  40
3:  40  40  40  10

And once a new CPU is added, node is onlined, numa masks are updated
but initial set bits are not cleared. This explains why nodes can overlap.

This patch changes the initial code to not initialize the distance for
offline nodes. The distances will be updated when node will become online
(on CPU hotplug) as it is already done.

This patch has been tested on powerpc but not on the other architectures.
They are impacted because the modified part is in the common code.
All comments are welcome (how to move the change to powerpc specific code
or if the other architectures can work with this change).

Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched 
domain")
Cc: Suravee Suthikulpanit 
Cc: Srikar Dronamraju 
Cc: Borislav Petkov 
Cc: David Gibson 
Cc: Michael Ellerman 
Cc: Nathan Fontenot 
Cc: Michael Bringmann 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Signed-off-by: Laurent Vivier 
---

Notes:
v3: fix the root cause of the problem (sched numa mask initialization)
v2: add scheduler maintainers in the CC: list

 kernel/sched/topology.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3f35ba1d8fde..24831b86533b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1622,8 +1622,10 @@ void sched_init_numa(void)
return;
 
sched_domains_numa_masks[i][j] = mask;
+   if (!node_state(j, N_ONLINE))
+   continue;
 
-   for_each_node(k) {
+   for_each_online_node(k) {
if (node_distance(j, k) > 
sched_domains_numa_distance[i])
continue;
 
-- 
2.20.1

Re: [PATCH v2] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-02-20 Thread Laurent Vivier

On 20/02/2019 18:08, Peter Zijlstra wrote:
> On Wed, Feb 20, 2019 at 05:55:20PM +0100, Laurent Vivier wrote:
>> index 3f35ba1d8fde..372278605f0d 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1651,6 +1651,7 @@ void sched_init_numa(void)
>>   */
>>  tl[i++] = (struct sched_domain_topology_level){
>>  .mask = sd_numa_mask,
>> +.flags = SDTL_OVERLAP,
> 
> This makes no sense what so ever. The numa identify node should not have
> overlap with other domains.
> 
> Are you sure this is not because of the utterly broken powerpc nonsense
> where they move CPUs between nodes?

No, I'm not sure. This why I've Cc: powerpc folks. My conclusion is only
based on the before/after changes.

I've tested some patches from powerpc ML, but they don't fix this problem:
  powerpc/numa: Perform full re-add of CPU for PRRN/VPHN topology update
  powerpc/pseries: Perform full re-add of CPU for topology update
post-migration

So the only reason I can see to have a corrupted sched_group list is the
sched_domain_span() fonction doesn't return a correct cpumask for the
domain once a new CPU is added.

Thanks,
Laurent

[PATCH v2] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-02-20 Thread Laurent Vivier

d instead of
build_sched_groups() and in this case the reference counters have all the
same value and the linked list can be correctly unallocated.

The problem happens because patch "sched/topology: Introduce NUMA
identity node sched domain" has removed the SDTL_OVERLAP flag
of the first topology level when it has introduced the NODE domain (and thus
build_sched_groups() is used instead of build_overlap_sched_groups()).

As I don't see any reason (and it is not documented in the involved commit)
to remove this flag, this patch re-introduces the SDTL_OVERLAP flag for the
first level. This fixes the problem described above and a CPU can be hotplugged
again without kernel crash.

Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched 
domain")
Cc: Suravee Suthikulpanit 
Cc: Srikar Dronamraju 
Cc: Borislav Petkov 
Cc: David Gibson 
Cc: Michael Ellerman 
Cc: Nathan Fontenot 
Cc: Michael Bringmann 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Signed-off-by: Laurent Vivier 
---

Notes:
v2: add scheduler maintainers in the CC: list

 kernel/sched/topology.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3f35ba1d8fde..372278605f0d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1651,6 +1651,7 @@ void sched_init_numa(void)
 */
tl[i++] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
+   .flags = SDTL_OVERLAP,
.numa_level = 0,
SD_INIT_NAME(NODE)
};
-- 
2.20.1

[PATCH] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-02-20 Thread Laurent Vivier

d instead of
build_sched_groups() and in this case the reference counters have all the
same value and the linked list can be correctly unallocated.

The problem happens because patch "sched/topology: Introduce NUMA
identity node sched domain" has removed the SDTL_OVERLAP flag
of the first topology level when it has introduced the NODE domain (and thus
build_sched_groups() is used instead of build_overlap_sched_groups()).

As I don't see any reason (and it is not documented in the involved commit)
to remove this flag, this patch re-introduces the SDTL_OVERLAP flag for the
first level. This fixes the problem described above and a CPU can be hotplugged
again without kernel crash.

Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched 
domain")
Cc: Suravee Suthikulpanit 
Cc: Srikar Dronamraju 
Cc: Borislav Petkov 
CC: David Gibson 
CC: Michael Ellerman 
CC: Nathan Fontenot 
CC: Michael Bringmann 
CC: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Laurent Vivier 
---
 kernel/sched/topology.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3f35ba1d8fde..372278605f0d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1651,6 +1651,7 @@ void sched_init_numa(void)
 */
tl[i++] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
+   .flags = SDTL_OVERLAP,
.numa_level = 0,
SD_INIT_NAME(NODE)
};
-- 
2.20.1

Re: [PATCH v2] powerpc/mm: move a KERN_WARNING message to pr_debug()

2019-02-07 Thread Laurent Vivier

On 07/02/2019 04:03, David Gibson wrote:
> On Tue, Feb 05, 2019 at 09:21:33PM +0100, Laurent Vivier wrote:
>> resize_hpt_for_hotplug() reports a warning when it cannot
>> increase the hash page table ("Unable to resize hash page
>> table to target order") but this is not blocking and
>> can make user thinks something has not worked properly.
>> As we move the message to the debug area, report again the
>> ENODEV error.
>>
>> If the operation cannot be done the real error message
>> will be reported by arch_add_memory() if create_section_mapping()
>> fails.
>>
>> Fixes: 7339390d772dd
>>powerpc/pseries: Don't give a warning when HPT resizing isn't 
>> available
>> Signed-off-by: Laurent Vivier 
> 
> Sorry, I'm pretty dubious about this.  It's true that in the case for
> which this bug was filed this is a harmless situation which deserves a
> pr_debug() at most.
> 
> But that's not necessarily true in all paths leading to this message.
> It will also trip if we fail to reshrink the HPT after genuinely
> hotunplugging a bunch of memory, in which case failing to release
> expected resources does deserve a warning.

But if there is a real problem this function should return an error and
this error should be managed by the caller.

Moreover, the function that can fail (pseries_lpar_resize_hpt()) has
already a warning for each error case:

ETIMEDOUT: "Unexpected error %d cancelling timed out HPT resize\n"
EIO:   "Unexpected error %d from H_RESIZE_HPT_PREPARE\n"
   "Unexpected error %d from H_RESIZE_HPT_COMMIT\n
ENOSPC:"Hash collision while resizing HPT\n"

ENODEV has no error message but is already silently ignored.
EINVAL and EPERM have no message but this happens if hcall is not used
correctly and deserve only a pr_debug() I think.

Thanks,
Laurent

Re: [PATCH v2] powerpc/mm: move a KERN_WARNING message to pr_debug()

2019-02-07 Thread Laurent Vivier

On 07/02/2019 05:33, Michael Ellerman wrote:
> Hi Laurent,
> 
> I'm not sure I'm convinced about this one. It seems like we're just
> throwing away the warning because it's annoying.
> 
> Laurent Vivier  writes:
>> resize_hpt_for_hotplug() reports a warning when it cannot
>> increase the hash page table ("Unable to resize hash page
>> table to target order") but this is not blocking and
>> can make user thinks something has not worked properly.
> 
> Something did not work properly, the resize didn't work properly. Right?
> 
>> As we move the message to the debug area, report again the
>> ENODEV error.
>>
>> If the operation cannot be done the real error message
>> will be reported by arch_add_memory() if create_section_mapping()
>> fails.
> 
> Can you explain that more. Isn't the fact that the resize failed "the
> real error message"?

In arch_add_memory(), after calling resize_hpt_for_hotplug() it calls
create_section_mapping() and if create_section_mapping() fails we will
have the error message "Unable to create mapping for hot added memory"
and the real exit (EFAULT).

> 
>> Fixes: 7339390d772dd
>>powerpc/pseries: Don't give a warning when HPT resizing isn't 
>> available
> 
> This should all be on one line, and formatted as:
> 
> Fixes: 7339390d772d ("powerpc/pseries: Don't give a warning when HPT resizing 
> isn't available")
> 
> See Documentation/process/submitting-patches.rst for more info and how
> to configure git to do it automatically for you.

Thank you, I didn't know the format was documented.

Thanks,
Laurent

[PATCH v2] powerpc/mm: move a KERN_WARNING message to pr_debug()

2019-02-05 Thread Laurent Vivier

resize_hpt_for_hotplug() reports a warning when it cannot
increase the hash page table ("Unable to resize hash page
table to target order") but this is not blocking and
can make user thinks something has not worked properly.
As we move the message to the debug area, report again the
ENODEV error.

If the operation cannot be done the real error message
will be reported by arch_add_memory() if create_section_mapping()
fails.

Fixes: 7339390d772dd
   powerpc/pseries: Don't give a warning when HPT resizing isn't available
Signed-off-by: Laurent Vivier 
---

Notes:
v2:
 - use pr_debug instead of printk(KERN_DEBUG
 - remove check for ENODEV

 arch/powerpc/mm/hash_utils_64.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0cc7fbc3bd1c..6a0cc4eb2c83 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -777,10 +777,9 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
int rc;
 
rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
-   if (rc && (rc != -ENODEV))
-   printk(KERN_WARNING
-  "Unable to resize hash page table to target 
order %d: %d\n",
-  target_hpt_shift, rc);
+   if (rc)
+   pr_debug("Unable to resize hash page table to target 
order %d: %d\n",
+target_hpt_shift, rc);
}
 }
 
-- 
2.20.1

[PATCH] powerpc/mm: move a KERN_WARNING message to KERN_DEBUG

2019-02-05 Thread Laurent Vivier

resize_hpt_for_hotplug() reports a warning when it cannot
increase the hash page table ("Unable to resize hash page
table to target order") but this is not blocking and
can make user thinks something has not worked properly.

If the operation cannot be done the real error message
will be reported by arch_add_memory() if create_section_mapping()
fails.

Signed-off-by: Laurent Vivier 
---
 arch/powerpc/mm/hash_utils_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0cc7fbc3bd1c..b762bdceb510 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -778,7 +778,7 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
 
rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
if (rc && (rc != -ENODEV))
-   printk(KERN_WARNING
+   printk(KERN_DEBUG
   "Unable to resize hash page table to target 
order %d: %d\n",
   target_hpt_shift, rc);
}
-- 
2.20.1

Re: [PATCH] powerpc/numa: fix hot-added CPU on memory-less node

2018-11-21 Thread Laurent Vivier

On 15/11/2018 10:19, Satheesh Rajendran wrote:
> On Wed, Nov 14, 2018 at 06:03:19PM +0100, Laurent Vivier wrote:
>> Trying to hotplug a CPU on an empty NUMA node (without
>> memory or CPU) crashes the kernel when the CPU is onlined.
>>
>> During the onlining process, the kernel calls start_secondary()
>> that ends by calling
>> set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]))
>> that relies on NODE_DATA(nid)->node_zonelists and in our case
>> NODE_DATA(nid) is NULL.
>>
>> To fix that, add the same checking as we already have in
>> find_and_online_cpu_nid(): if NODE_DATA() is NULL, use
>> the first online node.
>>
>> Bug: https://github.com/linuxppc/linux/issues/184
>> Fixes: ea05ba7c559c8e5a5946c3a94a2a266e9a6680a6
>>(powerpc/numa: Ensure nodes initialized for hotplug)
>> Signed-off-by: Laurent Vivier 
>> ---
>>  arch/powerpc/mm/numa.c | 9 +
>>  1 file changed, 9 insertions(+)
> 
> This patch causes regression for cold plug numa case(Case 1) and 
> hotplug case + reboot(Case 2) with adding all vcpus into node 0.
> 
> 
> Env: HW: Power8 Host.
> Kernel: 4.20-rc2 + this patch
> 
> Case 1:
> 1. boot a guest with 8 vcpus(all available), spreadout in 4 numa nodes.
> 8
> ...
>
>   
>   
>   
>   
> 
> 
> 2. Check lscpu --- all vcpus are added to node0 --> NOK
> 
> # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  1
> Core(s) per socket:  8
> Socket(s):   1
> NUMA node(s):4
> Model:   2.1 (pvr 004b 0201)
> Model name:  POWER8 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:   64K
> L1i cache:   32K
> NUMA node0 CPU(s):   0-7
> NUMA node1 CPU(s):   
> NUMA node2 CPU(s):   
> NUMA node3 CPU(s): 
> 
> without this patch it was working fine.
> # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  1
> Core(s) per socket:  8
> Socket(s):   1
> NUMA node(s):4
> Model:   2.1 (pvr 004b 0201)
> Model name:  POWER8 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:   64K
> L1i cache:   32K
> NUMA node0 CPU(s):   0,1
> NUMA node1 CPU(s):   2,3
> NUMA node2 CPU(s):   4,5
> NUMA node3 CPU(s):   6,7
> 
> 
> Case 2:
> 1. boot a guest with 8 vcpus(2 available, 6 possible), spreadout in 4 numa 
> nodes.
> 8
> ...
>
>   
>   
>   
>   
> 
> 
> 2. Hotplug all vcpus
> # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  1
> Core(s) per socket:  8
> Socket(s):   1
> NUMA node(s):2
> Model:   2.1 (pvr 004b 0201)
> Model name:  POWER8 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:   64K
> L1i cache:   32K
> NUMA node0 CPU(s):   0,1,4-7
> NUMA node1 CPU(s):   2,3
> 
> 
> 3. reboot the guest
> # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  1
> Core(s) per socket:  8
> Socket(s):   1
> NUMA node(s):4
> Model:   2.1 (pvr 004b 0201)
> Model name:  POWER8 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:   64K
> L1i cache:   32K
> NUMA node0 CPU(s):   0-7
> NUMA node1 CPU(s):
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> 
> 
> Without this patch, Case 2 crashes the guest during hotplug, i.e
> issue reported in https://github.com/linuxppc/linux/issues/184
> 
> Regards,
> -Satheesh.
> 
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 3a048e98a132..1b2d25a3c984 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -483,6 +483,15 @@ static int numa_setup_cpu(unsigned long lcpu)
>>  if (nid < 0 || !node_possible(nid))
>>  nid = first_online_node;
>>
>> +if (NODE_DATA(nid) == NULL) {
>> +/*
>> + * Default to using the nearest node that has memory installed.
>> + * Otherwise, it would be necessary to pat

Re: [PATCH] powerpc/numa: fix hot-added CPU on memory-less node

2018-11-15 Thread Laurent Vivier

On 15/11/2018 10:19, Satheesh Rajendran wrote:
> On Wed, Nov 14, 2018 at 06:03:19PM +0100, Laurent Vivier wrote:
>> Trying to hotplug a CPU on an empty NUMA node (without
>> memory or CPU) crashes the kernel when the CPU is onlined.
>>
>> During the onlining process, the kernel calls start_secondary()
>> that ends by calling
>> set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]))
>> that relies on NODE_DATA(nid)->node_zonelists and in our case
>> NODE_DATA(nid) is NULL.
>>
>> To fix that, add the same checking as we already have in
>> find_and_online_cpu_nid(): if NODE_DATA() is NULL, use
>> the first online node.
>>
>> Bug: https://github.com/linuxppc/linux/issues/184
>> Fixes: ea05ba7c559c8e5a5946c3a94a2a266e9a6680a6
>>(powerpc/numa: Ensure nodes initialized for hotplug)
>> Signed-off-by: Laurent Vivier 
>> ---
>>  arch/powerpc/mm/numa.c | 9 +
>>  1 file changed, 9 insertions(+)
> 
> This patch causes regression for cold plug numa case(Case 1) and 
> hotplug case + reboot(Case 2) with adding all vcpus into node 0.
> 
> 
> Env: HW: Power8 Host.
> Kernel: 4.20-rc2 + this patch
> 
> Case 1:
> 1. boot a guest with 8 vcpus(all available), spreadout in 4 numa nodes.
> 8
> ...
>
>   
>   
>   
>   
> 
> 
> 2. Check lscpu --- all vcpus are added to node0 --> NOK
> 
> # lscpu
...
> NUMA node0 CPU(s):   0-7
> NUMA node1 CPU(s):   
> NUMA node2 CPU(s):   
> NUMA node3 CPU(s): 
> 
> without this patch it was working fine.
> # lscpu
...
> NUMA node0 CPU(s):   0,1
> NUMA node1 CPU(s):   2,3
> NUMA node2 CPU(s):   4,5
> NUMA node3 CPU(s):   6,7
> 

Good point. Thank you.

I'm going to see what happens and how the cold case allows to online
CPUs on nodes with NODE_DATA() set to NULL (because it's what the patch
changes).

Thanks,
Laurent

[PATCH] powerpc/numa: fix hot-added CPU on memory-less node

2018-11-14 Thread Laurent Vivier

Trying to hotplug a CPU on an empty NUMA node (without
memory or CPU) crashes the kernel when the CPU is onlined.

During the onlining process, the kernel calls start_secondary()
that ends by calling
set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]))
that relies on NODE_DATA(nid)->node_zonelists and in our case
NODE_DATA(nid) is NULL.

To fix that, add the same checking as we already have in
find_and_online_cpu_nid(): if NODE_DATA() is NULL, use
the first online node.

Bug: https://github.com/linuxppc/linux/issues/184
Fixes: ea05ba7c559c8e5a5946c3a94a2a266e9a6680a6
   (powerpc/numa: Ensure nodes initialized for hotplug)
Signed-off-by: Laurent Vivier 
---
 arch/powerpc/mm/numa.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3a048e98a132..1b2d25a3c984 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -483,6 +483,15 @@ static int numa_setup_cpu(unsigned long lcpu)
if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
+   if (NODE_DATA(nid) == NULL) {
+   /*
+* Default to using the nearest node that has memory installed.
+* Otherwise, it would be necessary to patch the kernel MM code
+* to deal with more memoryless-node error conditions.
+*/
+   nid = first_online_node;
+   }
+
map_cpu_to_node(lcpu, nid);
of_node_put(cpu);
 out:
-- 
2.17.2

Re: [PATCH v2 08/12] macintosh/via-pmu68k: Don't load driver on unsupported hardware

2018-06-12 Thread Laurent Vivier

On 12/06/2018 01:47, Finn Thain wrote:
> On Sun, 10 Jun 2018, Benjamin Herrenschmidt wrote:
...
> I don't know what the bootloader situation is, but it looks messy...
> http://nubus-pmac.sourceforge.net/#booters
> 
> Laurent, does Emile work on these machines?
> 

No, Emile doesn't work on pmac-nubus, I tried to implement the switch
from m68k to ppc, but it has never worked.

Laurent

[PATCH] KVM: PPC: Book3S HV: Fix guest time accounting with VIRT_CPU_ACCOUNTING_GEN

2018-03-02 Thread Laurent Vivier

Since commit 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing
for signals on guest entry"), if CONFIG_VIRT_CPU_ACCOUNTING_GEN is set, the
guest time is not accounted to guest time and user time, but instead to
system time.

This is because guest_enter()/guest_exit() are called while interrupts
are disabled and the tick counter cannot be updated between them.

To fix that, move guest_exit() after local_irq_enable(), and as
guest_enter() is called with IRQ disabled, calls guest_enter_irqoff()
instead.

Fixes: 8b24e69fc47e
("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry")
Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
 arch/powerpc/kvm/book3s_hv.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 89707354c2ef..8274c2807202 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2885,7 +2885,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
 */
trace_hardirqs_on();
 
-   guest_enter();
+   guest_enter_irqoff();
 
srcu_idx = srcu_read_lock(>kvm->srcu);
 
@@ -2893,8 +2893,6 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
 
srcu_read_unlock(>kvm->srcu, srcu_idx);
 
-   guest_exit();
-
trace_hardirqs_off();
set_irq_happened(trap);
 
@@ -2937,6 +2935,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
kvmppc_set_host_core(pcpu);
 
local_irq_enable();
+   guest_exit();
 
/* Let secondaries go back to the offline loop */
for (i = 0; i < controlled_threads; ++i) {
-- 
2.14.3

Re: [PATCH v2] powerpc/via-pmu: Fix section mismatch warning

2018-02-15 Thread Laurent Vivier

On 14/02/2018 22:15, Mathieu Malaterre wrote:
> Make the struct via_pmu_driver const to avoid following warning:
> 
> WARNING: vmlinux.o(.data+0x4739c): Section mismatch in reference from the 
> variable via_pmu_driver to the function .init.text:pmu_init()
> The variable via_pmu_driver references
> the function __init pmu_init()
> If the reference is valid then annotate the
> variable with __init* or __refdata (see linux/init.h) or name the variable:
> *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console
> 
> Signed-off-by: Mathieu Malaterre <ma...@debian.org>
> Suggested-by: Laurent Vivier <lviv...@redhat.com>
> ---
> v2: pmu_init() is really an init function, leave __init marker
> 
>  drivers/macintosh/via-pmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
> index 94c0f3f7df69..fc56c7067732 100644
> --- a/drivers/macintosh/via-pmu.c
> +++ b/drivers/macintosh/via-pmu.c
> @@ -198,7 +198,7 @@ static const struct file_operations pmu_battery_proc_fops;
>  static const struct file_operations pmu_options_proc_fops;
>  
>  #ifdef CONFIG_ADB
> -struct adb_driver via_pmu_driver = {
> +const struct adb_driver via_pmu_driver = {
>   "PMU",
>   pmu_probe,
>   pmu_init,
> 

Reviewed-by: Laurent Vivier <lviv...@redhat.com>

Re: [PATCH] powerpc/via-pmu: Fix section mismatch warning

2018-02-13 Thread Laurent Vivier

On 07/02/2018 20:44, Mathieu Malaterre wrote:
> Remove the __init annotation from pmu_init() to avoid the
> following warning.
> 
> WARNING: vmlinux.o(.data+0x4739c): Section mismatch in reference from the 
> variable via_pmu_driver to the function .init.text:pmu_init()
> The variable via_pmu_driver references
> the function __init pmu_init()
> If the reference is valid then annotate the
> variable with __init* or __refdata (see linux/init.h) or name the variable:
> *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console
> 
> Signed-off-by: Mathieu Malaterre 
> ---
>  drivers/macintosh/via-pmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
> index 08849e33c567..5f378272d5b2 100644
> --- a/drivers/macintosh/via-pmu.c
> +++ b/drivers/macintosh/via-pmu.c
> @@ -378,7 +378,7 @@ static int pmu_probe(void)
>   return vias == NULL? -ENODEV: 0;
>  }
>  
> -static int __init pmu_init(void)
> +static int pmu_init(void)
>  {
>   if (vias == NULL)
>   return -ENODEV;
> 

pmu_init() is really an init function only called by another init
function (adb_init()).

So I think it could be good to let the __init marker.

Did you try:

--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -198,7 +198,7 @@ static const struct file_operations
pmu_battery_proc_fops;
 static const struct file_operations pmu_options_proc_fops;

 #ifdef CONFIG_ADB
-struct adb_driver via_pmu_driver = {
+const struct adb_driver via_pmu_driver = {
"PMU",
pmu_probe,
pmu_init,


Thanks,
Laurent

Re: [PATCH kernel v2] powerpc/mm: Flush radix process translations when setting MMU type

2018-02-13 Thread Laurent Vivier

On 07/02/2018 18:49, Daniel Henrique Barboza wrote:
> 
> 
> On 02/07/2018 12:33 PM, Laurent Vivier wrote:
>> On 01/02/2018 06:09, Alexey Kardashevskiy wrote:
>>> Radix guests do normally invalidate process-scoped translations when
>>> a new pid is allocated but migrated guests do not invalidate these so
>>> migrated guests crash sometime, especially easy to reproduce with
>>> migration happening within first 10 seconds after the guest boot
>>> start on
>>> the same machine.
>>>
>>> This adds the "Invalidate process-scoped translations" flush to fix
>>> radix guests migration.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>>> ---
>>> Changes:
>>> v2:
>>> * removed PPC_TLBIE_5() from the !(old_HR) case as it is pointless
>>> on hash
>>>
>>> ---
>>>
>>>
>>> Not so sure that "process-scoped translations" only require flushing
>>> at pid allocation and migration.
>>>
>>> ---
>>>   arch/powerpc/mm/pgtable_64.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>>> index c9a623c..d75dd52 100644
>>> --- a/arch/powerpc/mm/pgtable_64.c
>>> +++ b/arch/powerpc/mm/pgtable_64.c
>>> @@ -471,6 +471,8 @@ void mmu_partition_table_set_entry(unsigned int
>>> lpid, unsigned long dw0,
>>>   if (old & PATB_HR) {
>>>   asm volatile(PPC_TLBIE_5(%0,%1,2,0,1) : :
>>>    "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
>>> +    asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
>>> + "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
>>>   trace_tlbie(lpid, 0, TLBIEL_INVAL_SET_LPID, lpid, 2, 0, 1);
>>>   } else {
>>>   asm volatile(PPC_TLBIE_5(%0,%1,2,0,0) : :
>>>
>> This patch fixes for me a VM migration crash on POWER9.
> 
> Same here.
> 
> Tested-by: Daniel Henrique Barboza <danie...@linux.vnet.ibm.com>
> 
>>
>> Tested-by: Laurent Vivier <lviv...@redhat.com>

Any hope to have this patch merged soon?

It fixes a real problem and migration of VM is not reliable without it.

Thanks,
Laurent

Re: [PATCH kernel v2] powerpc/mm: Flush radix process translations when setting MMU type

2018-02-07 Thread Laurent Vivier

On 01/02/2018 06:09, Alexey Kardashevskiy wrote:
> Radix guests do normally invalidate process-scoped translations when
> a new pid is allocated but migrated guests do not invalidate these so
> migrated guests crash sometime, especially easy to reproduce with
> migration happening within first 10 seconds after the guest boot start on
> the same machine.
> 
> This adds the "Invalidate process-scoped translations" flush to fix
> radix guests migration.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> ---
> Changes:
> v2:
> * removed PPC_TLBIE_5() from the !(old_HR) case as it is pointless
> on hash
> 
> ---
> 
> 
> Not so sure that "process-scoped translations" only require flushing
> at pid allocation and migration.
> 
> ---
>  arch/powerpc/mm/pgtable_64.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index c9a623c..d75dd52 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -471,6 +471,8 @@ void mmu_partition_table_set_entry(unsigned int lpid, 
> unsigned long dw0,
>   if (old & PATB_HR) {
>   asm volatile(PPC_TLBIE_5(%0,%1,2,0,1) : :
>"r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
> + asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
> +  "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
>   trace_tlbie(lpid, 0, TLBIEL_INVAL_SET_LPID, lpid, 2, 0, 1);
>   } else {
>   asm volatile(PPC_TLBIE_5(%0,%1,2,0,0) : :
> 

This patch fixes for me a VM migration crash on POWER9.

Tested-by: Laurent Vivier <lviv...@redhat.com>

Thanks,
Laurent

Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-21 Thread Laurent Vivier

On 22/12/2017 08:54, Paul Mackerras wrote:
> On Fri, Dec 22, 2017 at 03:34:20PM +1100, Michael Ellerman wrote:
>> Laurent Vivier <lviv...@redhat.com> writes:
>>
>>> On 12/12/2017 13:02, Cédric Le Goater wrote:
>>>> When restoring a pending interrupt, we are setting the Q bit to force
>>>> a retrigger in xive_finish_unmask(). But we also need to force an EOI
>>>> in this case to reach the same initial state : P=1, Q=0.
>>>>
>>>> This can be done by not setting 'old_p' for pending interrupts which
>>>> will inform xive_finish_unmask() that an EOI needs to be sent.
>>>>
>>>> Suggested-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
>>>> Signed-off-by: Cédric Le Goater <c...@kaod.org>
>>>> ---
>>>>
>>>>  Tested with a guest running iozone.
>>>>
>>>>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> We really need this patch to fix VM migration on POWER9.
>>> When will it be merged?
>>
>> Paul is away, so I'll merge it via the powerpc tree.
>>
>> I'll mark it:
>>
>>   Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE 
>> interrupt controller")
>>   Cc: sta...@vger.kernel.org # v4.12+
> 
> Thanks for doing that.
> 
> If you felt like merging Alexey's patch "KVM: PPC: Book3S PR: Fix WIMG
> handling under pHyp" with my acked-by, that would be fine too.  The
> commit message needs a little work - the reason for using HPTE_R_M is
> not just because it seems to work, but because current POWER
> processors require M set on mappings for normal pages, and pHyp
> enforces that.

We also need:

KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()

Thanks,
Laurent

Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-20 Thread Laurent Vivier

On 12/12/2017 13:02, Cédric Le Goater wrote:
> When restoring a pending interrupt, we are setting the Q bit to force
> a retrigger in xive_finish_unmask(). But we also need to force an EOI
> in this case to reach the same initial state : P=1, Q=0.
> 
> This can be done by not setting 'old_p' for pending interrupts which
> will inform xive_finish_unmask() that an EOI needs to be sent.
> 
> Suggested-by: Benjamin Herrenschmidt 
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  Tested with a guest running iozone.
> 
>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

We really need this patch to fix VM migration on POWER9.
When will it be merged?

Thanks,
Laurent

[PATCH] KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()

2017-12-12 Thread Laurent Vivier

When we migrate a VM from a POWER8 host (XICS) to a POWER9 host
(XICS-on-XIVE), we have an error:

qemu-kvm: Unable to restore KVM interrupt controller state \
  (0xff00) for CPU 0: Invalid argument

This is because kvmppc_xics_set_icp() checks the new state
is internaly consistent, and especially:

...
   1129 if (xisr == 0) {
   1130 if (pending_pri != 0xff)
   1131 return -EINVAL;
...

On the other side, kvmppc_xive_get_icp() doesn't set
neither the pending_pri value, nor the xisr value (set to 0)
(and kvmppc_xive_set_icp() ignores the pending_pri value)

As xisr is 0, pending_pri must be set to 0xff.

Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
 arch/powerpc/kvm/book3s_xive.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index bf457843e032..e79062573f8c 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -725,7 +725,8 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
 
/* Return the per-cpu state for state saving/migration */
return (u64)xc->cppr << KVM_REG_PPC_ICP_CPPR_SHIFT |
-  (u64)xc->mfrr << KVM_REG_PPC_ICP_MFRR_SHIFT;
+  (u64)xc->mfrr << KVM_REG_PPC_ICP_MFRR_SHIFT |
+  (u64)0xff << KVM_REG_PPC_ICP_PPRI_SHIFT;
 }
 
 int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
-- 
2.14.3

Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-12 Thread Laurent Vivier

On 12/12/2017 13:02, Cédric Le Goater wrote:
> When restoring a pending interrupt, we are setting the Q bit to force
> a retrigger in xive_finish_unmask(). But we also need to force an EOI
> in this case to reach the same initial state : P=1, Q=0.
> 
> This can be done by not setting 'old_p' for pending interrupts which
> will inform xive_finish_unmask() that an EOI needs to be sent.
> 
> Suggested-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> Signed-off-by: Cédric Le Goater <c...@kaod.org>
> ---
> 
>  Tested with a guest running iozone.
> 
>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index bf457843e032..b5e6d227a034 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1558,7 +1558,7 @@ static int xive_set_source(struct kvmppc_xive *xive, 
> long irq, u64 addr)
>  
>   /*
>* Restore P and Q. If the interrupt was pending, we
> -  * force both P and Q, which will trigger a resend.
> +  * force Q and !P, which will trigger a resend.
>*
>* That means that a guest that had both an interrupt
>* pending (queued) and Q set will restore with only
> @@ -1566,7 +1566,7 @@ static int xive_set_source(struct kvmppc_xive *xive, 
> long irq, u64 addr)
>* is perfectly fine as coalescing interrupts that haven't
>* been presented yet is always allowed.
>*/
> - if (val & KVM_XICS_PRESENTED || val & KVM_XICS_PENDING)
> + if (val & KVM_XICS_PRESENTED && !(val & KVM_XICS_PENDING))
>       state->old_p = true;
>   if (val & KVM_XICS_QUEUED || val & KVM_XICS_PENDING)
>   state->old_q = true;
> 

Reviewed-by: Laurent Vivier <lviv...@redhat.com>
Tested-by: Laurent Vivier <lviv...@redhat.com>

Re: [PATCH] powerpc/xive: store server for masked interrupt in kvmppc_xive_set_xive()

2017-12-05 Thread Laurent Vivier

On 05/12/2017 04:05, Paul Mackerras wrote:
> On Fri, Nov 24, 2017 at 07:38:13AM +1100, Benjamin Herrenschmidt wrote:
>> On Thu, 2017-11-23 at 10:06 +0100, Laurent Vivier wrote:
>>> This is needed to map kvmppc_xive_set_xive() behavior
>>> to kvmppc_xics_set_xive().
>>>
>>> As we store the server, kvmppc_xive_get_xive() can return
>>> the good value and we can also allow kvmppc_xive_int_on().
>>>
>>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
>>> ---
>>>  arch/powerpc/kvm/book3s_xive.c | 20 
>>>  1 file changed, 8 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>>> index bf457843e032..2781b8733038 100644
>>> --- a/arch/powerpc/kvm/book3s_xive.c
>>> +++ b/arch/powerpc/kvm/book3s_xive.c
>>> @@ -584,10 +584,14 @@ int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, 
>>> u32 server,
>>>  *   we could initialize interrupts with valid default
>>>  */
>>>  
>>> -   if (new_act_prio != MASKED &&
>>> -   (state->act_server != server ||
>>> -state->act_priority != new_act_prio))
>>> -   rc = xive_target_interrupt(kvm, state, server, new_act_prio);
>>> +   if (state->act_server != server ||
>>> +   state->act_priority != new_act_prio) {
>>> +   if (new_act_prio != MASKED)
>>> +   rc = xive_target_interrupt(kvm, state, server,
>>> +  new_act_prio);
>>> +   if (!rc)
>>> +   state->act_server = server;
>>> +   }
>>
>> That leads to another problem with this code. My current implementation
>> is such that is a target queue is full, it will pick another target.
>> But here, we still update act_server to the passed-in server and
>> not the actual target...
> 
> So does that amount to a NAK?
> 
>>> /*
>>>  * Perform the final unmasking of the interrupt source
>>> @@ -646,14 +650,6 @@ int kvmppc_xive_int_on(struct kvm *kvm, u32 irq)
>>>  
>>> pr_devel("int_on(irq=0x%x)\n", irq);
>>>  
>>> -   /*
>>> -* Check if interrupt was not targetted
>>> -*/
>>> -   if (state->act_priority == MASKED) {
>>> -   pr_devel("int_on on untargetted interrupt\n");
>>> -   return -EINVAL;
>>> -   }
>>> -
>>
>> So my thinking here was that act_priority was never going to be MASKED
>> except if the interrupt had never been targetted anywhere at machine
>> startup time. Thus if act_priority is masked, the act_server field
>> cannot be trusted.
>>
>>> /* If saved_priority is 0xff, do nothing */
>>> if (state->saved_priority == MASKED)
>>> return 0;
> 
> How do you think this should be fixed?
> 
> Laurent, are you reworking the patch at the moment?

Not for the moment.

The easy way is to forbid to set interrupt value to the MASKED one with
xive_set_xive. I think it's allowed by the specs.

I've got another bug in the XICS emulation: when we migrate a guest
under stress, the pending interrupt is lost and the guest hangs on the
destination side. I'm trying to understand why.

Thanks,
Laurent

[PATCH] powerpc/xive: store server for masked interrupt in kvmppc_xive_set_xive()

2017-11-23 Thread Laurent Vivier

This is needed to map kvmppc_xive_set_xive() behavior
to kvmppc_xics_set_xive().

As we store the server, kvmppc_xive_get_xive() can return
the good value and we can also allow kvmppc_xive_int_on().

Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
 arch/powerpc/kvm/book3s_xive.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index bf457843e032..2781b8733038 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -584,10 +584,14 @@ int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 
server,
 *   we could initialize interrupts with valid default
 */
 
-   if (new_act_prio != MASKED &&
-   (state->act_server != server ||
-state->act_priority != new_act_prio))
-   rc = xive_target_interrupt(kvm, state, server, new_act_prio);
+   if (state->act_server != server ||
+   state->act_priority != new_act_prio) {
+   if (new_act_prio != MASKED)
+   rc = xive_target_interrupt(kvm, state, server,
+  new_act_prio);
+   if (!rc)
+   state->act_server = server;
+   }
 
/*
 * Perform the final unmasking of the interrupt source
@@ -646,14 +650,6 @@ int kvmppc_xive_int_on(struct kvm *kvm, u32 irq)
 
pr_devel("int_on(irq=0x%x)\n", irq);
 
-   /*
-* Check if interrupt was not targetted
-*/
-   if (state->act_priority == MASKED) {
-   pr_devel("int_on on untargetted interrupt\n");
-   return -EINVAL;
-   }
-
/* If saved_priority is 0xff, do nothing */
if (state->saved_priority == MASKED)
return 0;
-- 
2.13.6

Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

2017-08-24 Thread Laurent Vivier

On 23/08/2017 15:26, Tejun Heo wrote:
> Hello, Michael.
> 
> On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote:
>>> I don't think that's true.  The CPU id used in kernel doesn't have to
>>> match the physical one and arch code should be able to pre-map CPU IDs
>>> to nodes and use the matching one when hotplugging CPUs.  I'm not
>>> saying that's the best way to solve the problem tho.
>>
>> We already virtualise the CPU numbers, but not the node IDs. And it's
>> the node IDs that are really the problem.
> 
> Yeah, it just needs to match up new cpus to the cpu ids assigned to
> the right node.

We are not able to assign the cpu ids to the right node before the CPU
is present, because firmware doesn't provide CPU mapping <-> node id
before that.

>>> It could be that the best way forward is making cpu <-> node mapping
>>> dynamic and properly synchronized.
>>
>> We don't need it to be dynamic (at least for this bug).
> 
> The node mapping for that cpu id changes *dynamically* while the
> system is running and that can race with node-affinity sensitive
> operations such as memory allocations.

Memory is mapped to the node through its own firmware entry, so I don't
think cpu id change can affect memory affinity, and before we know the
node id of the CPU, the CPU is not present and thus it can't use memory.

>> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
>> that because some CPUs aren't present at boot we don't know what the
>> node mapping is. (Correct me if I'm wrong Laurent).
>>
>> So all we need is:
>>  - the workqueue code to cope with CPUs that are possible but not online
>>having NUMA_NO_NODE to begin with.
>>  - a way to update the workqueue cpumask when the CPU comes online.
>>
>> Which seems reasonable to me?
> 
> Please take a step back and think through the problem again.  You
> can't bandaid it this way.

Could you give some ideas, proposals?
As the firmware doesn't provide the information before the CPU is really
plugged, I really don't know how to manage this problem.

Thanks,
Laurent

Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

2017-08-23 Thread Laurent Vivier

On 23/08/2017 13:00, Michael Ellerman wrote:
> Hi Tejun,
> 
> Tejun Heo  writes:
>> Hello, Michael.
>>
>> On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote:
 This is something powerpc needs to fix.
>>>
>>> There is no way for us to fix it.
>>
>> I don't think that's true.  The CPU id used in kernel doesn't have to
>> match the physical one and arch code should be able to pre-map CPU IDs
>> to nodes and use the matching one when hotplugging CPUs.  I'm not
>> saying that's the best way to solve the problem tho.
> 
> We already virtualise the CPU numbers, but not the node IDs. And it's
> the node IDs that are really the problem.
> 
> So yeah I guess we might be able to make that work, but I'd have to
> think about it a bit more.
> 
>> It could be that the best way forward is making cpu <-> node mapping
>> dynamic and properly synchronized.
> 
> We don't need it to be dynamic (at least for this bug).
> 
> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
> that because some CPUs aren't present at boot we don't know what the
> node mapping is. (Correct me if I'm wrong Laurent).

You're correct.

Qemu is started with:

-numa node,cpus=0-1 -numa node,cpus=2-3 \
-smp 2,maxcpus=4,cores=4,threads=1,sockets=1

Which means we have 2 nodes with cpu ids 0 and 1 on node 0, and cpu ids
2 and 3 on node 1, but at boot only 2 CPUs are present.

The problem I try to fix with this series is when we hotplug a third
CPU, to node 1 with id 2.

Thanks,
Laurent

[PATCH 2/2] blk-mq: don't use WORK_CPU_UNBOUND

2017-08-21 Thread Laurent Vivier

cpumask is the list of CPUs present when the queue is built.
If a new CPU is hotplugged, this list is not updated,
and when the scheduler asks for a CPU id, blk_mq_hctx_next_cpu()
can return WORK_CPU_UNBOUND.
And in this case _blk_mq_run_hw_queue() can be executed by the new CPU
(that is not present in cpumask) and raises the following warning:

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
cpu_online(hctx->next_cpu));

To fix this problem, this patch modifies blk_mq_hctx_next_cpu()
to only return a CPU id present in cpumask.

Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
 block/blk-mq.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b115e234..bdac1e654814 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1126,9 +1126,6 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
-   if (hctx->queue->nr_hw_queues == 1)
-   return WORK_CPU_UNBOUND;
-
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
 
-- 
2.13.5

[PATCH 1/2] powerpc/workqueue: update list of possible CPUs

2017-08-21 Thread Laurent Vivier

In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
is built.

Unfortunately, on powerpc, the Firmware is only able to provide the
node of a CPU if the CPU is present. So, in our case (possible CPU)
CPU ids are known, but as the CPU is not present, the node id is
unknown and all the unplugged CPUs are attached to node 0.

When we hotplugged CPU, there can be an inconsistency between
the node id known by the workqueue, and the real node id of
the CPU.

This patch adds a hotplug handler to add the hotplugged CPU to
update the node entry in wq_numa_possible_cpumask array.

Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
 arch/powerpc/kernel/smp.c |  1 +
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c| 21 +
 3 files changed, 23 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8d3320562c70..abc533146514 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -964,6 +964,7 @@ void start_secondary(void *unused)
 
set_numa_node(numa_cpu_lookup_table[cpu]);
set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+   wq_numa_add_possible_cpu(cpu);
 
smp_wmb();
notify_cpu_starting(cpu);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index db6dc9dc0482..4ef030dae22c 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -644,6 +644,7 @@ int workqueue_online_cpu(unsigned int cpu);
 int workqueue_offline_cpu(unsigned int cpu);
 #endif
 
+void wq_numa_add_possible_cpu(unsigned int cpu);
 int __init workqueue_init_early(void);
 int __init workqueue_init(void);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ca937b0c3a96..d1a99e25d5da 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5486,6 +5486,27 @@ static inline void wq_watchdog_init(void) { }
 
 #endif /* CONFIG_WQ_WATCHDOG */
 
+void wq_numa_add_possible_cpu(unsigned int cpu)
+{
+   int node;
+
+   if (num_possible_nodes() <= 1)
+   return;
+
+   if (wq_disable_numa)
+   return;
+
+   if (!wq_numa_enabled)
+   return;
+
+   node = cpu_to_node(cpu);
+   if (node == NUMA_NO_NODE)
+   return;
+
+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+}
+EXPORT_SYMBOL_GPL(wq_numa_add_possible_cpu);
+
 static void __init wq_numa_init(void)
 {
cpumask_var_t *tbl;
-- 
2.13.5

Re: [PATCH] powerpc/pseries: Check memory device state before onlining/offlining

2017-08-10 Thread Laurent Vivier

On 02/08/2017 20:03, Nathan Fontenot wrote:
> When DLPAR adding or removing memory we need to check the device
> offline status before trying to online/offline the memory. This is
> needed because calls device_online() and device_offline() will return
> non-zero for memory that is already online and offline respectively.
> 
> This update resolves two scenarios. First, for kernel built with
> auto-online memory enabled, memory will be onlined as part of calls
> to add_memory(). After adding the memory the pseries dlpar code tries
> to online it and fails since the memory is already online. The dlpar
> code then tries to remove the memory which produces the oops message
> below because the memory is not offline.
> 
> The second scenario occurs when removing memory that is already offline,
> i.e. marking memory offline (via sysfs) and the trying to remove that
> memory. This doesn't work because offlining the already offline memory
> does not succeed and the dlpar code then fails the dlpar remove operation.
> 
> The fix for both scenarios is to check the device.offline status before
> making the calls to device_online() or device_offline().
> 
> kernel BUG at mm/memory_hotplug.c:2189!
> Oops: Exception in kernel mode, sig: 5 [#1]
> SMP NR_CPUS=2048
> NUMA
> pSeries
> CPU: 0 PID: 5 Comm: kworker/u129:0 Not tainted 4.12.0-rc3 #272
> Workqueue: pseries hotplug workque .pseries_hp_work_fn
> task: c003f9c89200 task.stack: c003f9d1
> NIP: c02ca428 LR: c02ca3cc CTR: c0ba16a0
> REGS: c003f9d13630 TRAP: 0700   Not tainted  (4.12.0-rc3)
> MSR: 8282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
>   CR: 22002024  XER: 000a
> CFAR: c02ca3d0 SOFTE: 1
> GPR00: c02ca3cc c003f9d138b0 c1bb0200 0001
> GPR04: c003fb143c80 c003fef21630 0003 0002
> GPR08: 0003 0003 0003 31b1
> GPR12: 28002042 cfd8 c0118ae0 c003fb170180
> GPR16:  0004 0010 c00379c8
> GPR20: c0037b68 c003f728ff84 0002 0010
> GPR24: 0002 c003f728ff80 0002 0001
> GPR28: c003fb143c38 0002 1000 2000
> NIP [c02ca428] .remove_memory+0xb8/0xc0
> LR [c02ca3cc] .remove_memory+0x5c/0xc0
> Call Trace:
> [c003f9d138b0] [c02ca3cc] .remove_memory+0x5c/0xc0 (unreliable)
> [c003f9d13940] [c00938a4] .dlpar_add_lmb+0x384/0x400
> [c003f9d13a30] [c009456c] .dlpar_memory+0x5dc/0xca0
> [c003f9d13af0] [c008ce84] .handle_dlpar_errorlog+0x74/0xe0
> [c003f9d13b70] [c008cf1c] .pseries_hp_work_fn+0x2c/0x90
> [c003f9d13bf0] [c0110a5c] .process_one_work+0x17c/0x460
> [c003f9d13c90] [c0110dc8] .worker_thread+0x88/0x500
> [c003f9d13d70] [c0118c3c] .kthread+0x15c/0x1a0
> [c003f9d13e30] [c000ba18] .ret_from_kernel_thread+0x58/0xc0
> Instruction dump:
> 7fe3fb78 4bd7c845 6000 7fa3eb78 4bfdd3c9 38210090 e8010010 eba1ffe8
> ebc1fff0 ebe1fff8 7c0803a6 4bfdc2ac <0fe0>  7c0802a6 fb01ffc0
> 
> Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged 
> memory'")
> Signed-off-by: Nathan Fontenot <nf...@linux.vnet.ibm.com>

tested the first scenario with 4.13.0-rc4 and qemu 2.10.0-rc2.

Tested-by: Laurent Vivier <lviv...@redhat.com>
Reviewed-by: Laurent Vivier <lviv...@redhat.com>

Re: [PATCH v2] powerpc/mm: Check for _PAGE_PTE in *_devmap()

2017-07-27 Thread Laurent Vivier

On 27/07/2017 17:35, ooh...@gmail.com (Oliver O'Halloran) wrote:
> The ISA radix translation tree contains two different types of entry,
> directories and leaves. The formats of the two entries are different
> with the directory entries containing no spare bits for use by software.
> As a result we need to ensure that the *_devmap() family of functions
> check fail for everything except leaf (PTE) entries.
> 
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> ---
> "i'll just tweak the mbox before i sent it, what's the worst that can happen"
> *completely breaks KVM*
> "..."
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index d1da415..6bc6248 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -610,7 +610,9 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>  
>  static inline int pte_devmap(pte_t pte)
>  {
> - return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP));
> + uint64_t mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE);
> +
> + return (pte_raw(pte) & mask) == mask;
>  }
>  
>  static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> 

Tested-by: Laurent Vivier <lviv...@redhat.com>

[PATCH] powerpc/pseries: Fix of_node_put() underflow during pseries remove

2017-07-21 Thread Laurent Vivier

As for commit 68baf692c435 ("powerpc/pseries: Fix of_node_put()
underflow during DLPAR remove"), the call to of_node_put()
must be removed from pSeries_reconfig_remove_node().

dlpar_detach_node() and pSeries_reconfig_remove_node() call
of_detach_node(), and thus the node should not be released
in this case too.

Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
 arch/powerpc/platforms/pseries/reconfig.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/reconfig.c 
b/arch/powerpc/platforms/pseries/reconfig.c
index e5bf1e8..011ef21 100644
--- a/arch/powerpc/platforms/pseries/reconfig.c
+++ b/arch/powerpc/platforms/pseries/reconfig.c
@@ -82,7 +82,6 @@ static int pSeries_reconfig_remove_node(struct device_node 
*np)
 
of_detach_node(np);
of_node_put(parent);
-   of_node_put(np); /* Must decrement the refcount */
return 0;
 }
 
-- 
2.9.4

Re: [PATCH] kvm-pr: manage illegal instructions

2016-05-17 Thread Laurent Vivier



On 17/05/2016 10:37, Alexander Graf wrote:
> On 05/17/2016 10:35 AM, Laurent Vivier wrote:
>>
>> On 12/05/2016 16:23, Laurent Vivier wrote:
>>>
>>> On 12/05/2016 11:27, Alexander Graf wrote:
>>>> On 05/12/2016 11:10 AM, Laurent Vivier wrote:
>>>>> On 11/05/2016 13:49, Alexander Graf wrote:
>>>>>> On 05/11/2016 01:14 PM, Laurent Vivier wrote:
>>>>>>> On 11/05/2016 12:35, Alexander Graf wrote:
>>>>>>>> On 03/15/2016 09:18 PM, Laurent Vivier wrote:
>>>>>>>>> While writing some instruction tests for kvm-unit-tests for
>>>>>>>>> powerpc,
>>>>>>>>> I've found that illegal instructions are not managed correctly
>>>>>>>>> with
>>>>>>>>> kvm-pr,
>>>>>>>>> while it is fine with kvm-hv.
>>>>>>>>>
>>>>>>>>> When an illegal instruction (like ".long 0") is processed by
>>>>>>>>> kvm-pr,
>>>>>>>>> the kernel logs are filled with:
>>>>>>>>>
>>>>>>>>>  Couldn't emulate instruction 0x (op 0 xop 0)
>>>>>>>>>  kvmppc_handle_exit_pr: emulation at 700 failed ()
>>>>>>>>>
>>>>>>>>> While the exception handler receives an interrupt for each
>>>>>>>>> instruction
>>>>>>>>> executed after the illegal instruction.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
>>>>>>>>> ---
>>>>>>>>>  arch/powerpc/kvm/book3s_emulate.c | 4 +++-
>>>>>>>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/powerpc/kvm/book3s_emulate.c
>>>>>>>>> b/arch/powerpc/kvm/book3s_emulate.c
>>>>>>>>> index 2afdb9c..4ee969d 100644
>>>>>>>>> --- a/arch/powerpc/kvm/book3s_emulate.c
>>>>>>>>> +++ b/arch/powerpc/kvm/book3s_emulate.c
>>>>>>>>> @@ -99,7 +99,6 @@ int kvmppc_core_emulate_op_pr(struct kvm_run
>>>>>>>>> *run,
>>>>>>>>> struct kvm_vcpu *vcpu,
>>>>>>>>>switch (get_op(inst)) {
>>>>>>>>>  case 0:
>>>>>>>>> -emulated = EMULATE_FAIL;
>>>>>>>>>  if ((kvmppc_get_msr(vcpu) & MSR_LE) &&
>>>>>>>>>  (inst == swab32(inst_sc))) {
>>>>>>>>>  /*
>>>>>>>>> @@ -112,6 +111,9 @@ int kvmppc_core_emulate_op_pr(struct kvm_run
>>>>>>>>> *run,
>>>>>>>>> struct kvm_vcpu *vcpu,
>>>>>>>>>  kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
>>>>>>>>>  kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
>>>>>>>>>  emulated = EMULATE_DONE;
>>>>>>>>> +} else {
>>>>>>>>> +kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
>>>>>>>> But isn't that exactly what the semantic of EMULATE_FAIL is?
>>>>>>>> Fixing it
>>>>>>>> up in book3s_emulate.c is definitely the wrong spot.
>>>>>>>>
>>>>>>>> So what is the problem you're trying to solve? Is the SRR0 at the
>>>>>>>> wrong
>>>>>>>> spot or are the log messages the problem?
>>>>>>> No, the problem is the host kernel logs are filled by the message
>>>>>>> and
>>>>>>> the execution hangs. And the host becomes unresponsiveness, even
>>>>>>> after
>>>>>>> the end of the tests.
>>>>>>>
>>>>>>> Please, try to run kvm-unit-tests (the emulator test) on a KVM-PR
>>>>>>> host,
>>>>>>> and check the kernel logs (dmesg), then try to ssh to the host...
>>>>>> Ok, so the log messages are the problem. Please fix the message
>>>>>> output
>>>>>> then - or remove it altogeth

Re: [PATCH] kvm-pr: manage illegal instructions

2016-05-17 Thread Laurent Vivier



On 12/05/2016 16:23, Laurent Vivier wrote:
> 
> 
> On 12/05/2016 11:27, Alexander Graf wrote:
>> On 05/12/2016 11:10 AM, Laurent Vivier wrote:
>>>
>>> On 11/05/2016 13:49, Alexander Graf wrote:
>>>> On 05/11/2016 01:14 PM, Laurent Vivier wrote:
>>>>> On 11/05/2016 12:35, Alexander Graf wrote:
>>>>>> On 03/15/2016 09:18 PM, Laurent Vivier wrote:
>>>>>>> While writing some instruction tests for kvm-unit-tests for powerpc,
>>>>>>> I've found that illegal instructions are not managed correctly with
>>>>>>> kvm-pr,
>>>>>>> while it is fine with kvm-hv.
>>>>>>>
>>>>>>> When an illegal instruction (like ".long 0") is processed by kvm-pr,
>>>>>>> the kernel logs are filled with:
>>>>>>>
>>>>>>> Couldn't emulate instruction 0x0000 (op 0 xop 0)
>>>>>>> kvmppc_handle_exit_pr: emulation at 700 failed ()
>>>>>>>
>>>>>>> While the exception handler receives an interrupt for each
>>>>>>> instruction
>>>>>>> executed after the illegal instruction.
>>>>>>>
>>>>>>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
>>>>>>> ---
>>>>>>> arch/powerpc/kvm/book3s_emulate.c | 4 +++-
>>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/kvm/book3s_emulate.c
>>>>>>> b/arch/powerpc/kvm/book3s_emulate.c
>>>>>>> index 2afdb9c..4ee969d 100644
>>>>>>> --- a/arch/powerpc/kvm/book3s_emulate.c
>>>>>>> +++ b/arch/powerpc/kvm/book3s_emulate.c
>>>>>>> @@ -99,7 +99,6 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run,
>>>>>>> struct kvm_vcpu *vcpu,
>>>>>>>   switch (get_op(inst)) {
>>>>>>> case 0:
>>>>>>> -emulated = EMULATE_FAIL;
>>>>>>> if ((kvmppc_get_msr(vcpu) & MSR_LE) &&
>>>>>>> (inst == swab32(inst_sc))) {
>>>>>>> /*
>>>>>>> @@ -112,6 +111,9 @@ int kvmppc_core_emulate_op_pr(struct kvm_run
>>>>>>> *run,
>>>>>>> struct kvm_vcpu *vcpu,
>>>>>>> kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
>>>>>>> kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
>>>>>>> emulated = EMULATE_DONE;
>>>>>>> +} else {
>>>>>>> +kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
>>>>>> But isn't that exactly what the semantic of EMULATE_FAIL is? Fixing it
>>>>>> up in book3s_emulate.c is definitely the wrong spot.
>>>>>>
>>>>>> So what is the problem you're trying to solve? Is the SRR0 at the
>>>>>> wrong
>>>>>> spot or are the log messages the problem?
>>>>> No, the problem is the host kernel logs are filled by the message and
>>>>> the execution hangs. And the host becomes unresponsiveness, even after
>>>>> the end of the tests.
>>>>>
>>>>> Please, try to run kvm-unit-tests (the emulator test) on a KVM-PR host,
>>>>> and check the kernel logs (dmesg), then try to ssh to the host...
>>>> Ok, so the log messages are the problem. Please fix the message output
>>>> then - or remove it altogether. Or if you like, create a module
>>>> parameter that allows you to emit them.
>>>>
>>>> I personally think the best solution would be to just convert the
>>>> message into a trace point.
>>>>
>>>> While at it, please see whether the guest can trigger similar host log
>>>> output excess in other code paths.
>>> The problem is not really with the log messages: they are consequence of
>>> the bug I try to fix.
>>>
>>> What happens is once kvm_pr decodes an invalid instruction all the valid
>>> following instructions trigger a Program exception to the guest (but are
>>> executed correctly). It has no real consequence on big machine like
>>> POWER8, except that the guest become very slow and the log files of th

Re: [PATCH] kvm-pr: manage illegal instructions

2016-05-12 Thread Laurent Vivier



On 12/05/2016 11:27, Alexander Graf wrote:
> On 05/12/2016 11:10 AM, Laurent Vivier wrote:
>>
>> On 11/05/2016 13:49, Alexander Graf wrote:
>>> On 05/11/2016 01:14 PM, Laurent Vivier wrote:
>>>> On 11/05/2016 12:35, Alexander Graf wrote:
>>>>> On 03/15/2016 09:18 PM, Laurent Vivier wrote:
>>>>>> While writing some instruction tests for kvm-unit-tests for powerpc,
>>>>>> I've found that illegal instructions are not managed correctly with
>>>>>> kvm-pr,
>>>>>> while it is fine with kvm-hv.
>>>>>>
>>>>>> When an illegal instruction (like ".long 0") is processed by kvm-pr,
>>>>>> the kernel logs are filled with:
>>>>>>
>>>>>> Couldn't emulate instruction 0x (op 0 xop 0)
>>>>>>     kvmppc_handle_exit_pr: emulation at 700 failed ()
>>>>>>
>>>>>> While the exception handler receives an interrupt for each
>>>>>> instruction
>>>>>> executed after the illegal instruction.
>>>>>>
>>>>>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
>>>>>> ---
>>>>>> arch/powerpc/kvm/book3s_emulate.c | 4 +++-
>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/kvm/book3s_emulate.c
>>>>>> b/arch/powerpc/kvm/book3s_emulate.c
>>>>>> index 2afdb9c..4ee969d 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_emulate.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_emulate.c
>>>>>> @@ -99,7 +99,6 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run,
>>>>>> struct kvm_vcpu *vcpu,
>>>>>>   switch (get_op(inst)) {
>>>>>> case 0:
>>>>>> -emulated = EMULATE_FAIL;
>>>>>> if ((kvmppc_get_msr(vcpu) & MSR_LE) &&
>>>>>> (inst == swab32(inst_sc))) {
>>>>>> /*
>>>>>> @@ -112,6 +111,9 @@ int kvmppc_core_emulate_op_pr(struct kvm_run
>>>>>> *run,
>>>>>> struct kvm_vcpu *vcpu,
>>>>>> kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
>>>>>> kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
>>>>>> emulated = EMULATE_DONE;
>>>>>> +} else {
>>>>>> +kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
>>>>> But isn't that exactly what the semantic of EMULATE_FAIL is? Fixing it
>>>>> up in book3s_emulate.c is definitely the wrong spot.
>>>>>
>>>>> So what is the problem you're trying to solve? Is the SRR0 at the
>>>>> wrong
>>>>> spot or are the log messages the problem?
>>>> No, the problem is the host kernel logs are filled by the message and
>>>> the execution hangs. And the host becomes unresponsiveness, even after
>>>> the end of the tests.
>>>>
>>>> Please, try to run kvm-unit-tests (the emulator test) on a KVM-PR host,
>>>> and check the kernel logs (dmesg), then try to ssh to the host...
>>> Ok, so the log messages are the problem. Please fix the message output
>>> then - or remove it altogether. Or if you like, create a module
>>> parameter that allows you to emit them.
>>>
>>> I personally think the best solution would be to just convert the
>>> message into a trace point.
>>>
>>> While at it, please see whether the guest can trigger similar host log
>>> output excess in other code paths.
>> The problem is not really with the log messages: they are consequence of
>> the bug I try to fix.
>>
>> What happens is once kvm_pr decodes an invalid instruction all the valid
>> following instructions trigger a Program exception to the guest (but are
>> executed correctly). It has no real consequence on big machine like
>> POWER8, except that the guest become very slow and the log files of the
>> host are filled with messages (and qemu uses 100% of the CPU). On a
>> smaller machine like a  PowerMac G5, the machine becomes simply unusable.
> 
> It's probably more related to your verbosity level of kernel messages.
> If you pass loglevel=0 (or quiet) to you kernel cmdline you won't get
> the messages printed to serial which is what's slowing you down.
> 
> The other problem sounds pretty severe, but the only thing your patch
> does any different from the current code flow would be the patch below.
> Or did I miss anything?
> 
> diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
> index 5cc2e7a..4672bc2 100644
> --- a/arch/powerpc/kvm/emulate.c
> +++ b/arch/powerpc/kvm/emulate.c
> @@ -302,7 +302,11 @@ int kvmppc_emulate_instruction(struct kvm_run *run,
> struct kvm_vcpu *vcpu)
> advance = 0;
> printk(KERN_ERR "Couldn't emulate instruction
> 0x%08x "
>"(op %d xop %d)\n", inst, get_op(inst),
> get_xop(inst));
> +#ifdef CONFIG_PPC_BOOK3S
> +   kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
> +#else
> kvmppc_core_queue_program(vcpu, 0);
> +#endif
> }
> }
> 

OK, that fixes the problem. Thanks. :)

Could you apply it?

Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm-pr: manage illegal instructions

2016-05-12 Thread Laurent Vivier



On 11/05/2016 13:49, Alexander Graf wrote:
> On 05/11/2016 01:14 PM, Laurent Vivier wrote:
>>
>> On 11/05/2016 12:35, Alexander Graf wrote:
>>> On 03/15/2016 09:18 PM, Laurent Vivier wrote:
>>>> While writing some instruction tests for kvm-unit-tests for powerpc,
>>>> I've found that illegal instructions are not managed correctly with
>>>> kvm-pr,
>>>> while it is fine with kvm-hv.
>>>>
>>>> When an illegal instruction (like ".long 0") is processed by kvm-pr,
>>>> the kernel logs are filled with:
>>>>
>>>>Couldn't emulate instruction 0x (op 0 xop 0)
>>>>kvmppc_handle_exit_pr: emulation at 700 failed (0000)
>>>>
>>>> While the exception handler receives an interrupt for each instruction
>>>> executed after the illegal instruction.
>>>>
>>>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
>>>> ---
>>>>arch/powerpc/kvm/book3s_emulate.c | 4 +++-
>>>>1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_emulate.c
>>>> b/arch/powerpc/kvm/book3s_emulate.c
>>>> index 2afdb9c..4ee969d 100644
>>>> --- a/arch/powerpc/kvm/book3s_emulate.c
>>>> +++ b/arch/powerpc/kvm/book3s_emulate.c
>>>> @@ -99,7 +99,6 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run,
>>>> struct kvm_vcpu *vcpu,
>>>>  switch (get_op(inst)) {
>>>>case 0:
>>>> -emulated = EMULATE_FAIL;
>>>>if ((kvmppc_get_msr(vcpu) & MSR_LE) &&
>>>>(inst == swab32(inst_sc))) {
>>>>/*
>>>> @@ -112,6 +111,9 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run,
>>>> struct kvm_vcpu *vcpu,
>>>>kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
>>>>kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
>>>>emulated = EMULATE_DONE;
>>>> +} else {
>>>> +kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
>>> But isn't that exactly what the semantic of EMULATE_FAIL is? Fixing it
>>> up in book3s_emulate.c is definitely the wrong spot.
>>>
>>> So what is the problem you're trying to solve? Is the SRR0 at the wrong
>>> spot or are the log messages the problem?
>> No, the problem is the host kernel logs are filled by the message and
>> the execution hangs. And the host becomes unresponsiveness, even after
>> the end of the tests.
>>
>> Please, try to run kvm-unit-tests (the emulator test) on a KVM-PR host,
>> and check the kernel logs (dmesg), then try to ssh to the host...
> 
> Ok, so the log messages are the problem. Please fix the message output
> then - or remove it altogether. Or if you like, create a module
> parameter that allows you to emit them.
> 
> I personally think the best solution would be to just convert the
> message into a trace point.
> 
> While at it, please see whether the guest can trigger similar host log
> output excess in other code paths.

The problem is not really with the log messages: they are consequence of
the bug I try to fix.

What happens is once kvm_pr decodes an invalid instruction all the valid
following instructions trigger a Program exception to the guest (but are
executed correctly). It has no real consequence on big machine like
POWER8, except that the guest become very slow and the log files of the
host are filled with messages (and qemu uses 100% of the CPU). On a
smaller machine like a  PowerMac G5, the machine becomes simply unusable.

Please try kvm-unit-tests to see what happens:

qemu-system-ppc64 -machine pseries,accel=kvm,kvm-type=PR -bios
powerpc/boot_rom.bin -display none -serial stdio -kernel
powerpc/emulator.elf -smp 1 --append "-v"


Thanks,
Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm-pr: manage illegal instructions

2016-05-11 Thread Laurent Vivier



On 11/05/2016 12:35, Alexander Graf wrote:
> On 03/15/2016 09:18 PM, Laurent Vivier wrote:
>> While writing some instruction tests for kvm-unit-tests for powerpc,
>> I've found that illegal instructions are not managed correctly with
>> kvm-pr,
>> while it is fine with kvm-hv.
>>
>> When an illegal instruction (like ".long 0") is processed by kvm-pr,
>> the kernel logs are filled with:
>>
>>   Couldn't emulate instruction 0x (op 0 xop 0)
>>   kvmppc_handle_exit_pr: emulation at 700 failed ()
>>
>> While the exception handler receives an interrupt for each instruction
>> executed after the illegal instruction.
>>
>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
>> ---
>>   arch/powerpc/kvm/book3s_emulate.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_emulate.c
>> b/arch/powerpc/kvm/book3s_emulate.c
>> index 2afdb9c..4ee969d 100644
>> --- a/arch/powerpc/kvm/book3s_emulate.c
>> +++ b/arch/powerpc/kvm/book3s_emulate.c
>> @@ -99,7 +99,6 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run,
>> struct kvm_vcpu *vcpu,
>> switch (get_op(inst)) {
>>   case 0:
>> -emulated = EMULATE_FAIL;
>>   if ((kvmppc_get_msr(vcpu) & MSR_LE) &&
>>   (inst == swab32(inst_sc))) {
>>   /*
>> @@ -112,6 +111,9 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run,
>> struct kvm_vcpu *vcpu,
>>   kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
>>   kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
>>   emulated = EMULATE_DONE;
>> +} else {
>> +kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
> 
> But isn't that exactly what the semantic of EMULATE_FAIL is? Fixing it
> up in book3s_emulate.c is definitely the wrong spot.
> 
> So what is the problem you're trying to solve? Is the SRR0 at the wrong
> spot or are the log messages the problem?

No, the problem is the host kernel logs are filled by the message and
the execution hangs. And the host becomes unresponsiveness, even after
the end of the tests.

Please, try to run kvm-unit-tests (the emulator test) on a KVM-PR host,
and check the kernel logs (dmesg), then try to ssh to the host...

Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] kvm-pr: manage single-step mode

2016-04-08 Thread Laurent Vivier

Until now, when we connect gdb to the QEMU gdb-server, the
single-step mode is not managed.

This patch adds this, only for kvm-pr:

If KVM_GUESTDBG_SINGLESTEP is set, we enable single-step trace bit in the
MSR (MSR_SE) just before the __kvmppc_vcpu_run(), and disable it just after.
In kvmppc_handle_exit_pr, instead of routing the interrupt to
the guest, we return to host, with KVM_EXIT_DEBUG reason.

Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
v2: split BOOK3S_INTERRUPT_MACHINE_CHECK and BOOK3S_INTERRUPT_TRACE

 arch/powerpc/kvm/book3s_pr.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 95bceca..8129b0d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -882,6 +882,24 @@ void kvmppc_set_fscr(struct kvm_vcpu *vcpu, u64 fscr)
 }
 #endif
 
+static void kvmppc_setup_debug(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
+   u64 msr = kvmppc_get_msr(vcpu);
+
+   kvmppc_set_msr(vcpu, msr | MSR_SE);
+   }
+}
+
+static void kvmppc_clear_debug(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
+   u64 msr = kvmppc_get_msr(vcpu);
+
+   kvmppc_set_msr(vcpu, msr & ~MSR_SE);
+   }
+}
+
 int kvmppc_handle_exit_pr(struct kvm_run *run, struct kvm_vcpu *vcpu,
  unsigned int exit_nr)
 {
@@ -1207,10 +1225,18 @@ program_interrupt:
break;
 #endif
case BOOK3S_INTERRUPT_MACHINE_CHECK:
-   case BOOK3S_INTERRUPT_TRACE:
kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
r = RESUME_GUEST;
break;
+   case BOOK3S_INTERRUPT_TRACE:
+   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
+   run->exit_reason = KVM_EXIT_DEBUG;
+   r = RESUME_HOST;
+   } else {
+   kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
+   r = RESUME_GUEST;
+   }
+   break;
default:
{
ulong shadow_srr1 = vcpu->arch.shadow_srr1;
@@ -1479,6 +1505,8 @@ static int kvmppc_vcpu_run_pr(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
goto out;
}
 
+   kvmppc_setup_debug(vcpu);
+
/*
 * Interrupts could be timers for the guest which we have to inject
 * again, so let's postpone them until we're in the guest and if we
@@ -1501,6 +1529,8 @@ static int kvmppc_vcpu_run_pr(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
 
ret = __kvmppc_vcpu_run(kvm_run, vcpu);
 
+   kvmppc_clear_debug(vcpu);
+
/* No need for kvm_guest_exit. It's done in handle_exit.
   We also get here with interrupts enabled. */
 
-- 
2.5.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm-pr: manage single-step mode

2016-04-08 Thread Laurent Vivier



On 08/04/2016 09:44, Thomas Huth wrote:
> On 08.04.2016 08:58, Laurent Vivier wrote:
>>
>>
>> On 08/04/2016 08:23, Thomas Huth wrote:
>>> On 22.03.2016 15:53, Laurent Vivier wrote:
>>>> Until now, when we connect gdb to the QEMU gdb-server, the
>>>> single-step mode is not managed.
>>>>
>>>> This patch adds this, only for kvm-pr:
>>>>
>>>> If KVM_GUESTDBG_SINGLESTEP is set, we enable single-step trace bit in the
>>>> MSR (MSR_SE) just before the __kvmppc_vcpu_run(), and disable it just 
>>>> after.
>>>> In kvmppc_handle_exit_pr, instead of routing the interrupt to
>>>> the guest, we return to host, with KVM_EXIT_DEBUG reason.
>>>>
>>>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
>>>> ---
>>>>  arch/powerpc/kvm/book3s_pr.c | 31 +--
>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
>>>> index 95bceca..e6896f4 100644
>>>> --- a/arch/powerpc/kvm/book3s_pr.c
>>>> +++ b/arch/powerpc/kvm/book3s_pr.c
>>>> @@ -882,6 +882,24 @@ void kvmppc_set_fscr(struct kvm_vcpu *vcpu, u64 fscr)
>>>>  }
>>>>  #endif
>>>>  
>>>> +static void kvmppc_setup_debug(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +  if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
>>>> +  u64 msr = kvmppc_get_msr(vcpu);
>>>> +
>>>> +  kvmppc_set_msr(vcpu, msr | MSR_SE);
>>>> +  }
>>>> +}
>>>> +
>>>> +static void kvmppc_clear_debug(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +  if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
>>>> +  u64 msr = kvmppc_get_msr(vcpu);
>>>> +
>>>> +  kvmppc_set_msr(vcpu, msr & ~MSR_SE);
>>>> +  }
>>>> +}
>>>> +
>>>>  int kvmppc_handle_exit_pr(struct kvm_run *run, struct kvm_vcpu *vcpu,
>>>>  unsigned int exit_nr)
>>>>  {
>>>> @@ -1208,8 +1226,13 @@ program_interrupt:
>>>>  #endif
>>>>case BOOK3S_INTERRUPT_MACHINE_CHECK:
>>>>case BOOK3S_INTERRUPT_TRACE:
>>>> -  kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
>>>> -  r = RESUME_GUEST;
>>>> +  if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
>>>> +  run->exit_reason = KVM_EXIT_DEBUG;
>>>> +  r = RESUME_HOST;
>>>> +  } else {
>>>> +  kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
>>>> +  r = RESUME_GUEST;
>>>> +  }
>>>
>>> Should the new code rather be limited to the BOOK3S_INTERRUPT_TRACE case
>>> only? I mean, this way, you never can deliver a machine check interrupt
>>> to the guest if singlestep debugging is enabled on the host, can you?
>>
>> You're right but it adds complexity and it would be only useful to
>> single-step the single-step mode of the guest.
>>
>> It's hard to imagine a developer single-stepping the guest kernel while
>> he is single-stepping a user application in the guest.
> 
> Hmm, not sure whether you've got me right ;-) I rather meant: What

Yes, I've missed what you mean. :(
Thank you to try again :)

> happens when a machine check is supposed to happen in the guest while
> single stepping is enabled at the host level? IMHO it would be better to
> shape the code like this:
> 
>   case BOOK3S_INTERRUPT_MACHINE_CHECK:
>   kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
>   r = RESUME_GUEST;
>   break;
>   case BOOK3S_INTERRUPT_TRACE:
>   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
>   run->exit_reason = KVM_EXIT_DEBUG;
>   r = RESUME_HOST;
>   } else {
>   kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
>   r = RESUME_GUEST;
>   }
> 
> That means, split the two cases, to keep the old behavior for the
> MACHINE_CHECK case. That's also not too much of additional complexity,
> is it?

Yes, you're right.

Thanks,
Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm-pr: manage single-step mode

2016-04-08 Thread Laurent Vivier



On 08/04/2016 08:23, Thomas Huth wrote:
> On 22.03.2016 15:53, Laurent Vivier wrote:
>> Until now, when we connect gdb to the QEMU gdb-server, the
>> single-step mode is not managed.
>>
>> This patch adds this, only for kvm-pr:
>>
>> If KVM_GUESTDBG_SINGLESTEP is set, we enable single-step trace bit in the
>> MSR (MSR_SE) just before the __kvmppc_vcpu_run(), and disable it just after.
>> In kvmppc_handle_exit_pr, instead of routing the interrupt to
>> the guest, we return to host, with KVM_EXIT_DEBUG reason.
>>
>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
>> ---
>>  arch/powerpc/kvm/book3s_pr.c | 31 +--
>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
>> index 95bceca..e6896f4 100644
>> --- a/arch/powerpc/kvm/book3s_pr.c
>> +++ b/arch/powerpc/kvm/book3s_pr.c
>> @@ -882,6 +882,24 @@ void kvmppc_set_fscr(struct kvm_vcpu *vcpu, u64 fscr)
>>  }
>>  #endif
>>  
>> +static void kvmppc_setup_debug(struct kvm_vcpu *vcpu)
>> +{
>> +if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
>> +u64 msr = kvmppc_get_msr(vcpu);
>> +
>> +kvmppc_set_msr(vcpu, msr | MSR_SE);
>> +}
>> +}
>> +
>> +static void kvmppc_clear_debug(struct kvm_vcpu *vcpu)
>> +{
>> +if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
>> +u64 msr = kvmppc_get_msr(vcpu);
>> +
>> +kvmppc_set_msr(vcpu, msr & ~MSR_SE);
>> +}
>> +}
>> +
>>  int kvmppc_handle_exit_pr(struct kvm_run *run, struct kvm_vcpu *vcpu,
>>unsigned int exit_nr)
>>  {
>> @@ -1208,8 +1226,13 @@ program_interrupt:
>>  #endif
>>  case BOOK3S_INTERRUPT_MACHINE_CHECK:
>>  case BOOK3S_INTERRUPT_TRACE:
>> -kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
>> -r = RESUME_GUEST;
>> +if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
>> +run->exit_reason = KVM_EXIT_DEBUG;
>> +r = RESUME_HOST;
>> +} else {
>> +kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
>> +r = RESUME_GUEST;
>> +}
> 
> Should the new code rather be limited to the BOOK3S_INTERRUPT_TRACE case
> only? I mean, this way, you never can deliver a machine check interrupt
> to the guest if singlestep debugging is enabled on the host, can you?

You're right but it adds complexity and it would be only useful to
single-step the single-step mode of the guest.

It's hard to imagine a developer single-stepping the guest kernel while
he is single-stepping a user application in the guest.

It's why I have completely by-passed this case.

Thanks,
Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm-pr: manage single-step mode

2016-04-07 Thread Laurent Vivier

Ping?

On 22/03/2016 15:53, Laurent Vivier wrote:
> Until now, when we connect gdb to the QEMU gdb-server, the
> single-step mode is not managed.
> 
> This patch adds this, only for kvm-pr:
> 
> If KVM_GUESTDBG_SINGLESTEP is set, we enable single-step trace bit in the
> MSR (MSR_SE) just before the __kvmppc_vcpu_run(), and disable it just after.
> In kvmppc_handle_exit_pr, instead of routing the interrupt to
> the guest, we return to host, with KVM_EXIT_DEBUG reason.
> 
> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
> ---
>  arch/powerpc/kvm/book3s_pr.c | 31 +--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index 95bceca..e6896f4 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -882,6 +882,24 @@ void kvmppc_set_fscr(struct kvm_vcpu *vcpu, u64 fscr)
>  }
>  #endif
>  
> +static void kvmppc_setup_debug(struct kvm_vcpu *vcpu)
> +{
> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
> + u64 msr = kvmppc_get_msr(vcpu);
> +
> + kvmppc_set_msr(vcpu, msr | MSR_SE);
> + }
> +}
> +
> +static void kvmppc_clear_debug(struct kvm_vcpu *vcpu)
> +{
> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
> + u64 msr = kvmppc_get_msr(vcpu);
> +
> + kvmppc_set_msr(vcpu, msr & ~MSR_SE);
> + }
> +}
> +
>  int kvmppc_handle_exit_pr(struct kvm_run *run, struct kvm_vcpu *vcpu,
> unsigned int exit_nr)
>  {
> @@ -1208,8 +1226,13 @@ program_interrupt:
>  #endif
>   case BOOK3S_INTERRUPT_MACHINE_CHECK:
>   case BOOK3S_INTERRUPT_TRACE:
> - kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
> - r = RESUME_GUEST;
> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
> + run->exit_reason = KVM_EXIT_DEBUG;
> + r = RESUME_HOST;
> + } else {
> + kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
> + r = RESUME_GUEST;
> + }
>   break;
>   default:
>   {
> @@ -1479,6 +1502,8 @@ static int kvmppc_vcpu_run_pr(struct kvm_run *kvm_run, 
> struct kvm_vcpu *vcpu)
>   goto out;
>   }
>  
> + kvmppc_setup_debug(vcpu);
> +
>   /*
>* Interrupts could be timers for the guest which we have to inject
>* again, so let's postpone them until we're in the guest and if we
> @@ -1501,6 +1526,8 @@ static int kvmppc_vcpu_run_pr(struct kvm_run *kvm_run, 
> struct kvm_vcpu *vcpu)
>  
>   ret = __kvmppc_vcpu_run(kvm_run, vcpu);
>  
> + kvmppc_clear_debug(vcpu);
> +
>   /* No need for kvm_guest_exit. It's done in handle_exit.
>  We also get here with interrupts enabled. */
>  
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm-pr: manage illegal instructions

2016-04-07 Thread Laurent Vivier

Ping?

On 15/03/2016 21:18, Laurent Vivier wrote:
> While writing some instruction tests for kvm-unit-tests for powerpc,
> I've found that illegal instructions are not managed correctly with kvm-pr,
> while it is fine with kvm-hv.
> 
> When an illegal instruction (like ".long 0") is processed by kvm-pr,
> the kernel logs are filled with:
> 
>  Couldn't emulate instruction 0x (op 0 xop 0)
>  kvmppc_handle_exit_pr: emulation at 700 failed ()
> 
> While the exception handler receives an interrupt for each instruction
> executed after the illegal instruction.
> 
> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
> ---
>  arch/powerpc/kvm/book3s_emulate.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_emulate.c 
> b/arch/powerpc/kvm/book3s_emulate.c
> index 2afdb9c..4ee969d 100644
> --- a/arch/powerpc/kvm/book3s_emulate.c
> +++ b/arch/powerpc/kvm/book3s_emulate.c
> @@ -99,7 +99,6 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct 
> kvm_vcpu *vcpu,
>  
>   switch (get_op(inst)) {
>   case 0:
> - emulated = EMULATE_FAIL;
>   if ((kvmppc_get_msr(vcpu) & MSR_LE) &&
>   (inst == swab32(inst_sc))) {
>   /*
> @@ -112,6 +111,9 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct 
> kvm_vcpu *vcpu,
>   kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
>   kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
>   emulated = EMULATE_DONE;
> + } else {
> + kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
> + emulated = EMULATE_AGAIN;
>   }
>   break;
>   case 19:
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] kvm-pr: manage single-step mode

2016-03-22 Thread Laurent Vivier

Until now, when we connect gdb to the QEMU gdb-server, the
single-step mode is not managed.

This patch adds this, only for kvm-pr:

If KVM_GUESTDBG_SINGLESTEP is set, we enable single-step trace bit in the
MSR (MSR_SE) just before the __kvmppc_vcpu_run(), and disable it just after.
In kvmppc_handle_exit_pr, instead of routing the interrupt to
the guest, we return to host, with KVM_EXIT_DEBUG reason.

Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
 arch/powerpc/kvm/book3s_pr.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 95bceca..e6896f4 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -882,6 +882,24 @@ void kvmppc_set_fscr(struct kvm_vcpu *vcpu, u64 fscr)
 }
 #endif
 
+static void kvmppc_setup_debug(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
+   u64 msr = kvmppc_get_msr(vcpu);
+
+   kvmppc_set_msr(vcpu, msr | MSR_SE);
+   }
+}
+
+static void kvmppc_clear_debug(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
+   u64 msr = kvmppc_get_msr(vcpu);
+
+   kvmppc_set_msr(vcpu, msr & ~MSR_SE);
+   }
+}
+
 int kvmppc_handle_exit_pr(struct kvm_run *run, struct kvm_vcpu *vcpu,
  unsigned int exit_nr)
 {
@@ -1208,8 +1226,13 @@ program_interrupt:
 #endif
case BOOK3S_INTERRUPT_MACHINE_CHECK:
case BOOK3S_INTERRUPT_TRACE:
-   kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
-   r = RESUME_GUEST;
+   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
+   run->exit_reason = KVM_EXIT_DEBUG;
+   r = RESUME_HOST;
+   } else {
+   kvmppc_book3s_queue_irqprio(vcpu, exit_nr);
+   r = RESUME_GUEST;
+   }
break;
default:
{
@@ -1479,6 +1502,8 @@ static int kvmppc_vcpu_run_pr(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
goto out;
}
 
+   kvmppc_setup_debug(vcpu);
+
/*
 * Interrupts could be timers for the guest which we have to inject
 * again, so let's postpone them until we're in the guest and if we
@@ -1501,6 +1526,8 @@ static int kvmppc_vcpu_run_pr(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
 
ret = __kvmppc_vcpu_run(kvm_run, vcpu);
 
+   kvmppc_clear_debug(vcpu);
+
/* No need for kvm_guest_exit. It's done in handle_exit.
   We also get here with interrupts enabled. */
 
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm-pr: manage illegal instructions

2016-03-22 Thread Laurent Vivier

Hi,

as Paolo has merged the test into kvm-unit-tests, this patch (and
original bug) can be now tested with it.

git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git

at least:

be9b007 powerpc: add test to check invalid instruction trap

Run this with KVM-PR and check your dmesg:

qemu-system-ppc64 -machine pseries,accel=kvm \
  -bios powerpc/boot_rom.bin \
  -display none -serial stdio \
  -kernel powerpc/emulator.elf -smp 1

Laurent

On 15/03/2016 21:18, Laurent Vivier wrote:
> While writing some instruction tests for kvm-unit-tests for powerpc,
> I've found that illegal instructions are not managed correctly with kvm-pr,
> while it is fine with kvm-hv.
> 
> When an illegal instruction (like ".long 0") is processed by kvm-pr,
> the kernel logs are filled with:
> 
>  Couldn't emulate instruction 0x (op 0 xop 0)
>  kvmppc_handle_exit_pr: emulation at 700 failed ()
> 
> While the exception handler receives an interrupt for each instruction
> executed after the illegal instruction.
> 
> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
> ---
>  arch/powerpc/kvm/book3s_emulate.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_emulate.c 
> b/arch/powerpc/kvm/book3s_emulate.c
> index 2afdb9c..4ee969d 100644
> --- a/arch/powerpc/kvm/book3s_emulate.c
> +++ b/arch/powerpc/kvm/book3s_emulate.c
> @@ -99,7 +99,6 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct 
> kvm_vcpu *vcpu,
>  
>   switch (get_op(inst)) {
>   case 0:
> - emulated = EMULATE_FAIL;
>   if ((kvmppc_get_msr(vcpu) & MSR_LE) &&
>   (inst == swab32(inst_sc))) {
>   /*
> @@ -112,6 +111,9 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct 
> kvm_vcpu *vcpu,
>   kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
>   kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
>   emulated = EMULATE_DONE;
> + } else {
> + kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
> + emulated = EMULATE_AGAIN;
>   }
>   break;
>   case 19:
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] kvm-pr: manage illegal instructions

2016-03-15 Thread Laurent Vivier

While writing some instruction tests for kvm-unit-tests for powerpc,
I've found that illegal instructions are not managed correctly with kvm-pr,
while it is fine with kvm-hv.

When an illegal instruction (like ".long 0") is processed by kvm-pr,
the kernel logs are filled with:

 Couldn't emulate instruction 0x (op 0 xop 0)
 kvmppc_handle_exit_pr: emulation at 700 failed ()

While the exception handler receives an interrupt for each instruction
executed after the illegal instruction.

Signed-off-by: Laurent Vivier <lviv...@redhat.com>
---
 arch/powerpc/kvm/book3s_emulate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 2afdb9c..4ee969d 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -99,7 +99,6 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 
switch (get_op(inst)) {
case 0:
-   emulated = EMULATE_FAIL;
if ((kvmppc_get_msr(vcpu) & MSR_LE) &&
(inst == swab32(inst_sc))) {
/*
@@ -112,6 +111,9 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
emulated = EMULATE_DONE;
+   } else {
+   kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
+   emulated = EMULATE_AGAIN;
}
break;
case 19:
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: allow cross-compilation of ppc64 kernel

2015-11-09 Thread Laurent Vivier



Le 10/11/2015 01:29, Michael Ellerman a écrit :
> On Sat, 2015-11-07 at 12:35 +0100, Laurent Vivier wrote:
>> Le 07/11/2015 00:24, Scott Wood a écrit :
>>> On Fri, 2015-11-06 at 23:22 +0100, Laurent Vivier wrote:
>>>> Le 06/11/2015 22:09, Scott Wood a écrit :
>>>>> On Thu, 2015-11-05 at 12:47 +0100, Laurent Vivier wrote:
>>>>>> When I try to cross compile a ppc64 kernel, it generally
>>>>>> fails on the VDSO stage. This is true for powerpc64 cross-
>>>>>> compiler, but also when I try to build a ppc64le kernel
>>>>>> on a ppc64 host.
>>>>>>
>>>>>> VDSO64L fails:
>>>>>>
>>>>>>   VDSO64L arch/powerpc/kernel/vdso64/vdso64.so.dbg
>>>>>> /usr/bin/powerpc64-linux-gnu-ld: arch/powerpc/kernel/vdso64/sigtramp.o:
>>>>>> file class ELFCLASS64 incompatible with ELFCLASS32
>>>>>> /usr/bin/powerpc64-linux-gnu-ld: final link failed: File in wrong format
>>>>>>
>>>>>> This fails because gcc calls "collect2" with
>>>>>> "--oformat elf32-powerpcle" with ppc64 objects, without the
>>>>>> "--oformat" ld works well because it use the format of the
>>>>>> first object as output format.
>>>>>>
>>>>>> As this case is correctly managed to build the other kernel
>>>>>> objects, this patch replaces $(GCC) by $(LD) to generate the
>>>>>> VDSO objects.
>>>
>>>> So at this point, I can:
>>>>
>>>> 1- either fix my compiler,
>>>> 2- or fix the vdso64 linker command.
>>>
>> Thank you Scott. With the help of the comment from Segher, I can choose
>> #1 now :)
> 
> What distro/toolchain did you hit this on?

fedora 23.

> I actually wrote this same patch a while back, but from memory it broke in 
> some
> configurations. So I never merged it. I'll see if I can find my notes and work
> out exactly why it didn't work, maybe we should merge it anyway, even though
> the real bug is a toolchain bug.

On fedora, there is also a bug in binutils configuration where ld is not
able to manage ppc64le binaries on ppc64 host, but I've already sent a
fix for that and it is now in rawhide (will be in fedora 24).

https://bugzilla.redhat.com/show_bug.cgi?id=1275709

Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: allow cross-compilation of ppc64 kernel

2015-11-07 Thread Laurent Vivier



Le 07/11/2015 00:12, Benjamin Herrenschmidt a écrit :
> On Fri, 2015-11-06 at 15:09 -0600, Scott Wood wrote:
>> On Thu, 2015-11-05 at 12:47 +0100, Laurent Vivier wrote:
>>> When I try to cross compile a ppc64 kernel, it generally
>>> fails on the VDSO stage. This is true for powerpc64 cross-
>>> compiler, but also when I try to build a ppc64le kernel
>>> on a ppc64 host.
>>>
>>> VDSO64L fails:
>>>
>>>   VDSO64L arch/powerpc/kernel/vdso64/vdso64.so.dbg
>>> /usr/bin/powerpc64-linux-gnu-ld:
>>> arch/powerpc/kernel/vdso64/sigtramp.o:
>>> file class ELFCLASS64 incompatible with ELFCLASS32
>>> /usr/bin/powerpc64-linux-gnu-ld: final link failed: File in wrong
>>> format
>>>
>>> This fails because gcc calls "collect2" with
>>> "--oformat elf32-powerpcle" with ppc64 objects, without the
>>> "--oformat" ld works well because it use the format of the
>>> first object as output format.
>>>
>>> As this case is correctly managed to build the other kernel
>>> objects, this patch replaces $(GCC) by $(LD) to generate the
>>> VDSO objects.
> 
> This is LE ? I think that's a bug in binutils or gcc ... I remember we
> fought that a while ago for the openpower builds. It might have been
> fixed in upstream toolchain.

Yes, Segher has given me the commit id. It is just what I need.

Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: allow cross-compilation of ppc64 kernel

2015-11-07 Thread Laurent Vivier



Le 07/11/2015 00:32, Segher Boessenkool a écrit :
> On Fri, Nov 06, 2015 at 04:55:49PM -0600, Segher Boessenkool wrote:
>> On Fri, Nov 06, 2015 at 03:09:40PM -0600, Scott Wood wrote:
>>> Why is GCC building ppc64 object files but telling the linker --oformat 
>>> elf32-
>>> powerpcle?  Are different options somehow being passed to GCC in one case 
>>> versus the other?
>>
>> This was changed for GCC 6 in , could one of
>> you check if that fixes this problem?  If so we'll need to backport it.
> 
> Actually it already has been backported, r228089.

Thank you, I'm going to see if it can be included in the distro I use.

Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: allow cross-compilation of ppc64 kernel

2015-11-06 Thread Laurent Vivier

Le 06/11/2015 22:09, Scott Wood a écrit :
> On Thu, 2015-11-05 at 12:47 +0100, Laurent Vivier wrote:
>> When I try to cross compile a ppc64 kernel, it generally
>> fails on the VDSO stage. This is true for powerpc64 cross-
>> compiler, but also when I try to build a ppc64le kernel
>> on a ppc64 host.
>>
>> VDSO64L fails:
>>
>>   VDSO64L arch/powerpc/kernel/vdso64/vdso64.so.dbg
>> /usr/bin/powerpc64-linux-gnu-ld: arch/powerpc/kernel/vdso64/sigtramp.o:
>> file class ELFCLASS64 incompatible with ELFCLASS32
>> /usr/bin/powerpc64-linux-gnu-ld: final link failed: File in wrong format
>>
>> This fails because gcc calls "collect2" with
>> "--oformat elf32-powerpcle" with ppc64 objects, without the
>> "--oformat" ld works well because it use the format of the
>> first object as output format.
>>
>> As this case is correctly managed to build the other kernel
>> objects, this patch replaces $(GCC) by $(LD) to generate the
>> VDSO objects.
> 
> I cross-compile ppc64 kernels and have not seen this problem.  I do need to 
> pass in -m64 as part of $(CC) if it's not the toolchain default, which is not 
> nice, but the proper fix for that is to add -m64 in the makefiles -- and if I 
> don't it fails way before VDSO.
> 
> Why is GCC building ppc64 object files but telling the linker --oformat elf32-
> powerpcle?  Are different options somehow being passed to GCC in one case 
> versus the other?

In fact, for all the other parts of the kernel, gcc is called with
"-mlittle-endian -m64", ld with "-EL -m elf64lppc", and thus generates
the good objects and calls ld with the good options ("elf64lppc"). I
think gcc is never used to link, only to compile.
This, I think, comes from:

arch/powerpc/Makefile:

ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),y)
override CC += -mlittle-endian
override LD += -EL
...
ifeq ($(HAS_BIARCH),y)
override CC += -m$(CONFIG_WORD_SIZE)
override LD += -m elf$(CONFIG_WORD_SIZE)$(LDEMULATION)

But in the case of vdso64, ld command is gcc:

arch/powerpc/kernel/vdso64/Makefile:

cmd_vdso64ld = $(CC) $(c_flags) -Wl,-T $^ -o $@

So to  link, we use "gcc -mlittle-endian -m64"

and strace of "gcc -mlittle-endian -m64" gives me:

"/usr/libexec/gcc/ppc64-redhat-linux/5.1.1/collect2", [ ... "--oformat",
"elf32-powerpcle", "-m", "elf64lppc",...

So the format used to link is by default "elf32-powerpcle" (with the
emulation elf64lppc given by "-mlittle-endian -m64", I agree it seems
strange).

I think this is coming from the configuration of my gcc, "-dumpspecs"
gives me:

*link_target:
%{mlittle|mlittle-endian: --oformat
elf32-powerpcle;mbig|mbig-endian:;mcall-i960-old: --oformat
elf32-powerpcle;:}

When "ld" is called without "--oformat" it takes the format of the first
processed object ("elf64-powerpcle" in our case, it's why it works for
the other binaries of the kernel).

So at this point, I can:

1- either fix my compiler,
2- or fix the vdso64 linker command.

As all the others objects of the kernel are generated and linked
correctly with this build env, there is no reason to not choose 2- as 1-
is much more complex (at leasts for me = rebuild gcc whereas I just want
to build the kernel).

But, more generally, what I'm wondering is why we are using CC instead
of LD to link objects...

Laurent
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/boot: allow wrapper to work on non-english system

2015-11-05 Thread Laurent Vivier

if the language is not english objdump output is not parsed correctly
and format is "". Later, "ld -m $format" fails.

This patch adds "LANG=C" to force english output for objdump.

Signed-off-by: Laurent Vivier <laur...@vivier.eu>
---
 arch/powerpc/boot/wrapper | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/wrapper b/arch/powerpc/boot/wrapper
index 3f50c27..1c0b6ec 100755
--- a/arch/powerpc/boot/wrapper
+++ b/arch/powerpc/boot/wrapper
@@ -137,7 +137,7 @@ if [ -z "$kernel" ]; then
 kernel=vmlinux
 fi
 
-elfformat="`${CROSS}objdump -p "$kernel" | grep 'file format' | awk '{print 
$4}'`"
+LANG=C elfformat="`${CROSS}objdump -p "$kernel" | grep 'file format' | awk 
'{print $4}'`"
 case "$elfformat" in
 elf64-powerpcle)   format=elf64lppc;;
 elf64-powerpc) format=elf32ppc ;;
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: allow cross-compilation of ppc64 kernel

2015-11-05 Thread Laurent Vivier

When I try to cross compile a ppc64 kernel, it generally
fails on the VDSO stage. This is true for powerpc64 cross-
compiler, but also when I try to build a ppc64le kernel
on a ppc64 host.

VDSO64L fails:

  VDSO64L arch/powerpc/kernel/vdso64/vdso64.so.dbg
/usr/bin/powerpc64-linux-gnu-ld: arch/powerpc/kernel/vdso64/sigtramp.o:
file class ELFCLASS64 incompatible with ELFCLASS32
/usr/bin/powerpc64-linux-gnu-ld: final link failed: File in wrong format

This fails because gcc calls "collect2" with
"--oformat elf32-powerpcle" with ppc64 objects, without the
"--oformat" ld works well because it use the format of the
first object as output format.

As this case is correctly managed to build the other kernel
objects, this patch replaces $(GCC) by $(LD) to generate the
VDSO objects.

Signed-off-by: Laurent Vivier <laur...@vivier.eu>
---
 arch/powerpc/kernel/vdso64/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/vdso64/Makefile 
b/arch/powerpc/kernel/vdso64/Makefile
index effca94..618c420 100644
--- a/arch/powerpc/kernel/vdso64/Makefile
+++ b/arch/powerpc/kernel/vdso64/Makefile
@@ -36,7 +36,7 @@ $(obj-vdso64): %.o: %.S
 
 # actual build commands
 quiet_cmd_vdso64ld = VDSO64L $@
-  cmd_vdso64ld = $(CC) $(c_flags) -Wl,-T $^ -o $@
+  cmd_vdso64ld = $(LD) $(LDFLAGS) $(ldflags-y) -T $^ -o $@
 quiet_cmd_vdso64as = VDSO64A $@
   cmd_vdso64as = $(CC) $(a_flags) -c -o $@ $<
 
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: on crash, kexec'ed kernel needs all CPUs are online

2015-11-04 Thread Laurent Vivier



On 04/11/2015 13:34, Hari Bathini wrote:
> On 10/16/2015 12:30 AM, Laurent Vivier wrote:
>> On kexec, all secondary offline CPUs are onlined before
>> starting the new kernel, this is not done in the case of kdump.
>>
>> If kdump is configured and a kernel crash occurs whereas
>> some secondaries CPUs are offline (SMT=off),
>> the new kernel is not able to start them and displays some
>> "Processor X is stuck.".
>>
>> Starting with POWER8, subcore logic relies on all threads of
>> core being booted. So, on startup kernel tries to start all
>> threads, and asks OPAL (or RTAS) to start all CPUs (including
>> threads). If a CPU has been offlined by the previous kernel,
>> it has not been returned to OPAL, and thus OPAL cannot restart
>> it: this CPU has been lost...
>>
>> Signed-off-by: Laurent Vivier<lviv...@redhat.com>
> 
> 
> Hi Laurent,

Hi Hari,

> Sorry for jumping too late into this.

better late than never :)

> Are you seeing this issue even with the below patches:
> 
> pseries:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c1caae3de46a072d0855729aed6e793e536a4a55
> 
> 
> opal/powernv:
> https://github.com/open-power/skiboot/commit/9ee56b5

Very interesting. Is there a way to have a firmware with the fix ?

Thanks,
Laurent

> Thanks
> Hari
> 
>> ---
>>   arch/powerpc/kernel/crash.c | 20 
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/crash.c b/arch/powerpc/kernel/crash.c
>> index 51dbace..3ca9452 100644
>> --- a/arch/powerpc/kernel/crash.c
>> +++ b/arch/powerpc/kernel/crash.c
>> @@ -19,6 +19,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>> #include 
>>   #include 
>> @@ -299,11 +300,30 @@ int crash_shutdown_unregister(crash_shutdown_t
>> handler)
>>   }
>>   EXPORT_SYMBOL(crash_shutdown_unregister);
>>   +/*
>> + * The next kernel will try to start all secondary CPUs and if
>> + * there are not online it will fail to start them.
>> + *
>> + */
>> +static void wake_offline_cpus(void)
>> +{
>> +int cpu = 0;
>> +
>> +for_each_present_cpu(cpu) {
>> +if (!cpu_online(cpu)) {
>> +pr_info("kexec: Waking offline cpu %d.\n", cpu);
>> +cpu_up(cpu);
>> +}
>> +}
>> +}
>> +
>>   void default_machine_crash_shutdown(struct pt_regs *regs)
>>   {
>>   unsigned int i;
>>   int (*old_handler)(struct pt_regs *regs);
>>   +wake_offline_cpus();
>> +
>>   /*
>>* This function is only called after the system
>>* has panicked or is otherwise in a critical state.
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: on crash, kexec'ed kernel needs all CPUs are online

2015-10-16 Thread Laurent Vivier

On 16/10/2015 04:14, Michael Ellerman wrote:
> On Thu, 2015-10-15 at 21:00 +0200, Laurent Vivier wrote:
>> On kexec, all secondary offline CPUs are onlined before
>> starting the new kernel, this is not done in the case of kdump.
>>
>> If kdump is configured and a kernel crash occurs whereas
>> some secondaries CPUs are offline (SMT=off),
>> the new kernel is not able to start them and displays some
>> "Processor X is stuck.".
> 
> Do we know why they are stuck?

Yes, we know :)

On the crash, as the CPUs are offline, kernel doesn't call
opal_return_cpu(), so for OPAL all these CPU are always in the kernel.

When the new kernel starts, it call s opal_query_cpu_status() to know
which CPUs are available. As they were not returned to OPAL these CPUs
are not available, but as the kernel logic relies on the fact they must
be available (the logic is SMT is on), it is waiting for their starting
and wait for ever...

When the kernel starts, all secondary processors are started by a call
for each of them of __cpu_up():

__cpu_up()

...
cpu_callin_map[cpu] = 0;
...
rc = smp_ops->kick_cpu(cpu);
...wait...
if (!cpu_callin_map[cpu]) {
printk(KERN_ERR "Processor %u is stuck.\n", cpu);
...

on powernv, kick_cpu() is pnv_smp_kick_cpu():

pnv_smp_kick_cpu()

...
unsigned long start_here =

__pa(ppc_function_entry(generic_secondary_smp_init));
...
/*
 * Already started, just kick it, probably coming from
 * kexec and spinning
 */
rc = opal_query_cpu_status(pcpu, );
...
if (status == OPAL_THREAD_STARTED)
goto kick;
...
rc = opal_start_cpu(pcpu, start_here);
...
kick:
...

generic_secondary_smp_init() is a function in assembly language that
calls in the end start_secondary() :

start_secondary()

...
cpu_callin_map[cpu] = 1;
...

So processors are stucked because start_secondary() is never called.

start_secondary() is never called because OPAL cpu status is
OPAL_THREAD_STARTED.

Secondary CPUs are in "OPAL_THREAD_STARTED" state because they have not
been returned to OPAL on crash.

CPUs are returned to OPAL by pnv_kexec_cpu_down() which is called by
crash_ipi_callback() (for secondary cpus)... except if the cpu is not
online.

As the CPUs are offline, they are not returned to OPAL, and then kernel
can't restart them.

> I really don't like this fix. The reason we're doing a kdump is because the
> first kernel has panicked, possibly with locks held or data structures
> corrupted. Calling cpu_up() then goes and tries to run a bunch of code in the
> crashed kernel, which increases the chance of us just wedging completely.

I agree, but the whole logic of the POWER kernel is we have all the
threads available.

Moreover the kernel parameter "maxcpus" is ignored if it is not a
multiple of thread per core:

...
static int subcore_init(void)
{
if (!cpu_has_feature(CPU_FTR_ARCH_207S))
return 0;

/*
 * We need all threads in a core to be present to split/unsplit so
 * continue only if max_cpus are aligned to threads_per_core.
 */
if (setup_max_cpus % threads_per_core)
return 0;

...
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: on crash, kexec'ed kernel needs all CPUs are online

2015-10-16 Thread Laurent Vivier

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256



On 16/10/2015 04:29, David Gibson wrote:
> On Thu, 15 Oct 2015 21:00:58 +0200 Laurent Vivier
> <lviv...@redhat.com> wrote:
> 
>> On kexec, all secondary offline CPUs are onlined before starting
>> the new kernel, this is not done in the case of kdump.
>> 
>> If kdump is configured and a kernel crash occurs whereas some
>> secondaries CPUs are offline (SMT=off), the new kernel is not
>> able to start them and displays some "Processor X is stuck.".
>> 
>> Starting with POWER8, subcore logic relies on all threads of core
>> being booted. So, on startup kernel tries to start all threads,
>> and asks OPAL (or RTAS) to start all CPUs (including threads). If
>> a CPU has been offlined by the previous kernel, it has not been
>> returned to OPAL, and thus OPAL cannot restart it: this CPU has
>> been lost...
>> 
>> Signed-off-by: Laurent Vivier <lviv...@redhat.com>
> 
> Nice analysis of the problem.  But, I'm a bit uneasy about this
> approach to fixing it: Onlining potentially hundreds of CPU threads
> seems like a risky operation in a kernel that's already crashed.

I agree.

> I don't have a terribly clear idea of what is the best way to
> address this.  Here's a few ideas in the right general direction:
> 
> * I'm already looking into a kdump userspace fixes to stop it 
> attempting to bring up secondary CPUs
> 
> * A working kernel option to say "only allow this many online cpus 
> ever" which we could pass to the kdump kernel would be nice
> 
> * Paulus had an idea about offline threads returning themselves 
> directly to OPAL by kicking a flag at kdump/kexec time.

For me the problem is: as these CPUs are offline, I guess the core has
been switched to 1 thread per core, so the CPUs (1 to 7 for core 0)
don't exist anymore, how can we return them to OPAL ?

> 
> BenH, Paulus,
> 
> OPAL <-> kernel cpu transitions don't seem to work quite how I
> thought they would.  IIUC there's a register we can use to directly
> control which threads on a core are active.  Given that I would
> have thought cpu "ownership" OPAL vs. kernel would be on a
> per-core, rather than per-thread basis.
> 
> Is there some way we can change the CPU onlining / offlining code
> so that if threads aren't in OPAL, we directly enable them, rather
> than just hoping they're in a nap loop somewhere?
> 

Laurent
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQIcBAEBCAAGBQJWIK3kAAoJEPMMOL0/L748S4UP/2rJIRavrB4QylPMYKpRIxf6
VCLuve3TRY40er5GO8bwQ+95yHUo8K57OzZAh8T2mDQGjHGJArMElWUbb+EGaDF2
z5FU0iH7TKkJ9FDBlz2ZTny0vrEK2eBwxAFggLcfF8PeKMs5H4Rh9FrTFKKuc9Z4
KSAdhi4niKVdn0ln8M6k5FGB3AE0gG7zeTPeO74Knrr8cvOX1Xk5pfgzo2WpD91w
zymDgG127xBL0G9gs8jrse+yXoB2dLsevdxS6CEH4vKnjsLokqnWlk1n9JeIUKiW
+BEZ0llb5jppBYzOmrghTS5fPwh+Nmkbc4Kk9i/1Tjb8LRXNBEiSxVtHu9XIdwve
K37gOIuqCkOap0NE/AbcDjsFEoCFVSHbdD6cCgtLEPVFq7f8w7U/qa9ty//PM8br
KGtfZ1sG2/LCapMuyx3QhplxrXEy/bpQwT3BPnS818OMxrE20QfR5PM2C+nCpd4H
8mpdLpOctLJ7lgmYSwSlbNkJrQJvTFXv8WhZB2Qkadi0yaq8C5JZ3Dr10HrijoVL
lsOfrevB/mHrZmLBkp8t4+UYa5fM59nNpFZ/0BTdWfP8CDAlkw2Kla5PVeKN4ssk
GzySgQwOPsyS27aAk005ZeXPtfrGD93A43EcwG4IULf5J8DbzmCt5gPoJ241D0IO
3Z8+/4nl3WVRVzQ/Lwlc
=yLqE
-END PGP SIGNATURE-
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

1 2 >

1 - 100 of 104 matches

Mail list logo