from:"Michael Neuling"

Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10

2020-07-23 Thread Michael Neuling

On Fri, 2020-07-10 at 10:52 +0530, Pratik Rajesh Sampat wrote:
> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
> stop levels < 4.
> Therefore save the values of these SPRs before entering a  "stop"
> state and restore their values on wakeup.
> 
> Signed-off-by: Pratik Rajesh Sampat 
> ---
>  arch/powerpc/platforms/powernv/idle.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c
> b/arch/powerpc/platforms/powernv/idle.c
> index 19d94d021357..f2e2a6a4c274 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -600,6 +600,8 @@ struct p9_sprs {
>   u64 iamr;
>   u64 amor;
>   u64 uamor;
> + u64 dawr0;
> + u64 dawrx0;
>  };
>  
>  static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> @@ -687,6 +689,10 @@ static unsigned long power9_idle_stop(unsigned long
> psscr, bool mmu_on)
>   sprs.iamr   = mfspr(SPRN_IAMR);
>   sprs.amor   = mfspr(SPRN_AMOR);
>   sprs.uamor  = mfspr(SPRN_UAMOR);
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {

Can you add a comment here saying even though DAWR0 is ARCH_30, it's only
required to be saved on 31. Otherwise this looks pretty odd.

> + sprs.dawr0 = mfspr(SPRN_DAWR0);
> + sprs.dawrx0 = mfspr(SPRN_DAWRX0);
> + }
>  
>   srr1 = isa300_idle_stop_mayloss(psscr); /* go idle */
>  
> @@ -710,6 +716,10 @@ static unsigned long power9_idle_stop(unsigned long
> psscr, bool mmu_on)
>   mtspr(SPRN_IAMR,sprs.iamr);
>   mtspr(SPRN_AMOR,sprs.amor);
>   mtspr(SPRN_UAMOR,   sprs.uamor);
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> + mtspr(SPRN_DAWR0, sprs.dawr0);
> + mtspr(SPRN_DAWRX0, sprs.dawrx0);
> + }
>  
>   /*
>* Workaround for POWER9 DD2.0, if we lost resources, the ERAT

CVE-2019-15031: Linux kernel: powerpc: data leak with FP/VMX triggerable by interrupt in transaction

2019-09-10 Thread Michael Neuling

The Linux kernel for powerpc since v4.15 has a bug in it's TM handling during
interrupts where any user can read the FP/VMX registers of a difference user's
process. Users of TM + FP/VMX can also experience corruption of their FP/VMX
state.

To trigger the bug, a process starts a transaction with FP/VMX off and then
takes an interrupt. Due to the kernels incorrect handling of the interrupt,
FP/VMX is turned on but the checkpointed state is not updated. If this
transaction then rolls back, the checkpointed state may contain the state of a
different process. This checkpointed state can then be read by the process hence
leaking data from one process to another.

The trigger for this bug is an interrupt inside a transaction where FP/VMX is
off, hence the process needs FP/VMX off when starting the transaction. FP/VMX
availability is under the control of the kernel and is transparent to the user,
hence the user has to retry the transaction many times to trigger this bug. High
interrupt loads also help trigger this bug.

All 64-bit machines where TM is present are affected. This includes all POWER8
variants and POWER9 VMs under KVM or LPARs under PowerVM. POWER9 bare metal
doesn't support TM and hence is not affected.

The bug was introduced in commit:
fa7771176b439 ("powerpc: Don't enable FP/Altivec if not checkpointed")
Which was originally merged in v4.15

The upstream fix is here:
https://git.kernel.org/torvalds/c/a8318c13e79badb92bc6640704a64cc022a6eb97

The fix can be verified by running the tm-poison from the kernel selftests. This
test is in a patch here:
https://patchwork.ozlabs.org/patch/1157467/
which should eventually end up here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/powerpc/tm/tm-poison.c

cheers
Mikey

CVE-2019-15030: Linux kernel: powerpc: data leak with FP/VMX triggerable by unavailable exception in transaction

2019-09-10 Thread Michael Neuling

The Linux kernel for powerpc since v4.12 has a bug in it's TM handling where any
user can read the FP/VMX registers of a difference user's process. Users of TM +
FP/VMX can also experience corruption of their FP/VMX state.

To trigger the bug, a process starts a transaction and reads a FP/VMX register.
This transaction can then fail which causes a rollback to the checkpointed
state. Due to the kernel taking an FP/VMX unavaliable exception inside a
transaction and the kernel's incorrect handling of this, the checkpointed state
can be set to the FP/VMX registers of another process. This checkpointed state
can then be read by the process hence leaking data from one process to another.

The trigger for this bug is an FP/VMX unavailable exception inside a
transaction, hence the process needs FP/VMX off when starting the transaction.
FP/VMX availability is under the control of the kernel and is transparent to the
user, hence the user has to retry the transaction many times to trigger this
bug.

The bug was introduced in commit:
f48e91e87e67 ("powerpc/tm: Fix FP and VMX register corruption")
Which was originally merged in v4.12

The upstream fix is here:
https://git.kernel.org/torvalds/c/8205d5d98ef7f155de211f5e2eb6ca03d95a5a60

cheers
Mikey

CVE-2019-13648: Linux kernel: powerpc: kernel crash in TM handling triggerable by any local user

2019-07-29 Thread Michael Neuling

The Linux kernel for powerpc since v3.9 has a bug in the TM handling  where any
unprivileged local user may crash the operating system.

This bug affects machines using 64-bit CPUs where Transactional Memory (TM) is
not present or has been disabled (see below for more details on affected CPUs).

To trigger the bug a process constructs a signal context which still has the MSR
TS bits set. That process then passes this signal context to the sigreturn()
system call. When returning back to userspace, the kernel then crashes with a
bad TM transition (TM Bad Thing) or by executing TM code on a non-TM system.

All 64bit machines where TM is not present are affected. This includes PowerPC
970 (G5), PA6T, POWER5/6/7 VMs under KVM or LPARs under PowerVM and POWER9 bare
metal. 

Additionally systems with TM hardware but where TM is disabled in software (via
ppc_tm=off kernel cmdline) are also affected. This includes POWER8/9 VMs under
KVM or LPARs under PowerVM and POWER8 bare metal.

The bug was introduced in commit:
  2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal 
context")

Which was originally merged in v3.9. 

The upstream fix is here:
  https://git.kernel.org/torvalds/c/f16d80b75a096c52354c6e0a574993f3b0dfbdfe

The fix can be verified by running `sigfuz -m` from the kernel selftests:
 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/powerpc/signal/sigfuz.c?h=v5.2

cheers
Mikey

Re: [PATCH 5/5] Powerpc/Watchpoint: Fix length calculation for unaligned target

2019-06-18 Thread Michael Neuling

On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote:
> Watchpoint match range is always doubleword(8 bytes) aligned on
> powerpc. If the given range is crossing doubleword boundary, we
> need to increase the length such that next doubleword also get
> covered. Ex,
> 
>   address   len = 6 bytes
> |=.
>|v--|--v|
>| | | | | | | | | | | | | | | | |
>|---|---|
> <---8 bytes--->
> 
> In such case, current code configures hw as:
>   start_addr = address & ~HW_BREAKPOINT_ALIGN
>   len = 8 bytes
> 
> And thus read/write in last 4 bytes of the given range is ignored.
> Fix this by including next doubleword in the length. Watchpoint
> exception handler already ignores extraneous exceptions, so no
> changes required for that.

Nice catch. Thanks.

I assume this has been broken forever? Should we be CCing stable? If so, it
would be nice to have this self contained (separate from the refactor) so we can
more easily backport it.

Also, can you update 
tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c to catch this issue?

A couple more comments below.

> 
> Signed-off-by: Ravi Bangoria 
> ---
>  arch/powerpc/include/asm/hw_breakpoint.h |  7 ++--
>  arch/powerpc/kernel/hw_breakpoint.c  | 44 +---
>  arch/powerpc/kernel/process.c| 34 --
>  3 files changed, 60 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hw_breakpoint.h
> b/arch/powerpc/include/asm/hw_breakpoint.h
> index 8acbbdd4a2d5..749a357164d5 100644
> --- a/arch/powerpc/include/asm/hw_breakpoint.h
> +++ b/arch/powerpc/include/asm/hw_breakpoint.h
> @@ -34,6 +34,8 @@ struct arch_hw_breakpoint {
>  #define HW_BRK_TYPE_PRIV_ALL (HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL | \
>HW_BRK_TYPE_HYP)
>  
> +#define HW_BREAKPOINT_ALIGN 0x7
> +
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
>  #include 
>  #include 
> @@ -45,8 +47,6 @@ struct pmu;
>  struct perf_sample_data;
>  struct task_struct;
>  
> -#define HW_BREAKPOINT_ALIGN 0x7
> -
>  extern int hw_breakpoint_slots(int type);
>  extern int arch_bp_generic_fields(int type, int *gen_bp_type);
>  extern int arch_check_bp_in_kernelspace(struct arch_hw_breakpoint *hw);
> @@ -76,7 +76,8 @@ static inline void hw_breakpoint_disable(void)
>  }
>  extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs);
>  int hw_breakpoint_handler(struct die_args *args);
> -
> +extern u16 hw_breakpoint_get_final_len(struct arch_hw_breakpoint *brk,
> + unsigned long *start_addr, unsigned long *end_addr);
>  extern int set_dawr(struct arch_hw_breakpoint *brk);
>  extern bool dawr_force_enable;
>  static inline bool dawr_enabled(void)
> diff --git a/arch/powerpc/kernel/hw_breakpoint.c
> b/arch/powerpc/kernel/hw_breakpoint.c
> index 36bcf705df65..c122fd55aa44 100644
> --- a/arch/powerpc/kernel/hw_breakpoint.c
> +++ b/arch/powerpc/kernel/hw_breakpoint.c
> @@ -126,6 +126,28 @@ int arch_bp_generic_fields(int type, int *gen_bp_type)
>   return 0;
>  }
>  
> +/* Maximum len for DABR is 8 bytes and DAWR is 512 bytes */
> +static int hw_breakpoint_validate_len(struct arch_hw_breakpoint *hw)
> +{
> + u16 length_max = 8;
> + u16 final_len;
> + unsigned long start_addr, end_addr;
> +
> + final_len = hw_breakpoint_get_final_len(hw, _addr, _addr);
> +
> + if (dawr_enabled()) {
> + length_max = 512;
> + /* DAWR region can't cross 512 bytes boundary */
> + if ((start_addr >> 9) != (end_addr >> 9))
> + return -EINVAL;
> + }
> +
> + if (final_len > length_max)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
>  /*
>   * Validate the arch-specific HW Breakpoint register settings
>   */
> @@ -133,12 +155,10 @@ int hw_breakpoint_arch_parse(struct perf_event *bp,
>const struct perf_event_attr *attr,
>struct arch_hw_breakpoint *hw)
>  {
> - int length_max;
> -
>   if (!ppc_breakpoint_available())
>   return -ENODEV;
>  
> - if (!bp)
> + if (!bp || !attr->bp_len)
>   return -EINVAL;
>  
>   hw->type = HW_BRK_TYPE_TRANSLATE;
> @@ -160,23 +180,7 @@ int hw_breakpoint_arch_parse(struct perf_event *bp,
>   hw->address = attr->bp_addr;
>   hw->len = attr->bp_len;
>  
> - length_max = 8; /* DABR */
> - if (dawr_enabled()) {
> - length_max = 512 ; /* 64 doublewords */
> - /* DAWR region can't cross 512 bytes boundary */
> - if ((hw->address >> 9) != ((hw->address + hw->len - 1) >> 9))
> - return -EINVAL;
> - }
> -
> - /*
> -  * Since breakpoint length can be a maximum of length_max and
> -  * breakpoint addresses are aligned to nearest double-word
> -  * HW_BREAKPOINT_ALIGN by rounding off to the lower address,
> -  * the 'symbolsize' should satisfy the check

Re: [PATCH 0/5] Powerpc/hw-breakpoint: Fixes plus Code refactor

2019-06-18 Thread Michael Neuling

On Tue, 2019-06-18 at 08:01 +0200, Christophe Leroy wrote:
> 
> Le 18/06/2019 à 06:27, Ravi Bangoria a écrit :
> > patch 1-3: Code refactor
> > patch 4: Speedup disabling breakpoint
> > patch 5: Fix length calculation for unaligned targets
> 
> While you are playing with hw breakpoints, did you have a look at 
> https://github.com/linuxppc/issues/issues/38 ?

Agreed and also: 

https://github.com/linuxppc/issues/issues/170

https://github.com/linuxppc/issues/issues/128 

Mikey

Re: [PATCH 4/5] Powerpc/hw-breakpoint: Optimize disable path

2019-06-18 Thread Michael Neuling

On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote:
> Directly setting dawr and dawrx with 0 should be enough to
> disable watchpoint. No need to reset individual bits in
> variable and then set in hw.

This seems like a pointless optimisation to me. 

I'm all for adding more code/complexity if it buys us some performance, but I
can't imagine this is a fast path (nor have you stated any performance
benefits). 

Mikey

> 
> Signed-off-by: Ravi Bangoria 
> ---
>  arch/powerpc/include/asm/hw_breakpoint.h |  3 ++-
>  arch/powerpc/kernel/process.c| 12 
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/hw_breakpoint.h
> b/arch/powerpc/include/asm/hw_breakpoint.h
> index 78202d5fb13a..8acbbdd4a2d5 100644
> --- a/arch/powerpc/include/asm/hw_breakpoint.h
> +++ b/arch/powerpc/include/asm/hw_breakpoint.h
> @@ -19,6 +19,7 @@ struct arch_hw_breakpoint {
>  /* Note: Don't change the the first 6 bits below as they are in the same
> order
>   * as the dabr and dabrx.
>   */
> +#define HW_BRK_TYPE_DISABLE  0x00
>  #define HW_BRK_TYPE_READ 0x01
>  #define HW_BRK_TYPE_WRITE0x02
>  #define HW_BRK_TYPE_TRANSLATE0x04
> @@ -68,7 +69,7 @@ static inline void hw_breakpoint_disable(void)
>   struct arch_hw_breakpoint brk;
>  
>   brk.address = 0;
> - brk.type = 0;
> + brk.type = HW_BRK_TYPE_DISABLE;
>   brk.len = 0;
>   if (ppc_breakpoint_available())
>   __set_breakpoint();
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index f002d286..265fac9fb3a4 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -793,10 +793,22 @@ static inline int set_dabr(struct arch_hw_breakpoint
> *brk)
>   return __set_dabr(dabr, dabrx);
>  }
>  
> +static int disable_dawr(void)
> +{
> + if (ppc_md.set_dawr)
> + return ppc_md.set_dawr(0, 0);
> +
> + mtspr(SPRN_DAWRX, 0);
> + return 0;
> +}
> +
>  int set_dawr(struct arch_hw_breakpoint *brk)
>  {
>   unsigned long dawr, dawrx, mrd;
>  
> + if (brk->type == HW_BRK_TYPE_DISABLE)
> + return disable_dawr();
> +
>   dawr = brk->address;
>  
>   dawrx  = (brk->type & HW_BRK_TYPE_RDWR) << (63 - 58);

Re: [PATCH 3/5] Powerpc/hw-breakpoint: Refactor set_dawr()

2019-06-18 Thread Michael Neuling

This is going to collide with this patch 
https://patchwork.ozlabs.org/patch/1109594/

Mikey


On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote:
> Remove unnecessary comments. Code itself is self explanatory.
> And, ISA already talks about MRD field. I Don't think we need
> to re-describe it.
> 
> Signed-off-by: Ravi Bangoria 
> ---
>  arch/powerpc/kernel/process.c | 17 +
>  1 file changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index f0fbbf6a6a1f..f002d286 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -799,18 +799,11 @@ int set_dawr(struct arch_hw_breakpoint *brk)
>  
>   dawr = brk->address;
>  
> - dawrx  = (brk->type & (HW_BRK_TYPE_READ | HW_BRK_TYPE_WRITE)) \
> -<< (63 - 58); //* read/write bits */
> - dawrx |= ((brk->type & (HW_BRK_TYPE_TRANSLATE)) >> 2) \
> -<< (63 - 59); //* translate */
> - dawrx |= (brk->type & (HW_BRK_TYPE_PRIV_ALL)) \
> ->> 3; //* PRIM bits */
> - /* dawr length is stored in field MDR bits 48:53.  Matches range in
> -doublewords (64 bits) baised by -1 eg. 0b00=1DW and
> -0b11=64DW.
> -brk->len is in bytes.
> -This aligns up to double word size, shifts and does the bias.
> - */
> + dawrx  = (brk->type & HW_BRK_TYPE_RDWR) << (63 - 58);
> + dawrx |= ((brk->type & HW_BRK_TYPE_TRANSLATE) >> 2) << (63 - 59);
> + dawrx |= (brk->type & HW_BRK_TYPE_PRIV_ALL) >> 3;
> +
> + /* brk->len is in bytes. */
>   mrd = ((brk->len + 7) >> 3) - 1;
>   dawrx |= (mrd & 0x3f) << (63 - 53);
>

Re: [PATCH 1/5] Powerpc/hw-breakpoint: Replace stale do_dabr() with do_break()

2019-06-18 Thread Michael Neuling

> Subject: Powerpc/hw-breakpoint: Replace stale do_dabr() with do_break()

Can you add the word "comment" to this subject. Currently it implies there are
code changes here.

Mikey


On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote:
> do_dabr() was renamed with do_break() long ago. But I still see
> some comments mentioning do_dabr(). Replace it.
> 
> Signed-off-by: Ravi Bangoria 
> ---
>  arch/powerpc/kernel/hw_breakpoint.c | 2 +-
>  arch/powerpc/kernel/ptrace.c| 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/hw_breakpoint.c
> b/arch/powerpc/kernel/hw_breakpoint.c
> index a293a53b4365..1908e4fcc132 100644
> --- a/arch/powerpc/kernel/hw_breakpoint.c
> +++ b/arch/powerpc/kernel/hw_breakpoint.c
> @@ -232,7 +232,7 @@ int hw_breakpoint_handler(struct die_args *args)
>* Return early after invoking user-callback function without restoring
>* DABR if the breakpoint is from ptrace which always operates in
>* one-shot mode. The ptrace-ed process will receive the SIGTRAP signal
> -  * generated in do_dabr().
> +  * generated in do_break().
>*/
>   if (bp->overflow_handler == ptrace_triggered) {
>   perf_bp_event(bp, regs);
> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index 684b0b315c32..44b823e5e8c8 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -2373,7 +2373,7 @@ void ptrace_triggered(struct perf_event *bp,
>   /*
>* Disable the breakpoint request here since ptrace has defined a
>* one-shot behaviour for breakpoint exceptions in PPC64.
> -  * The SIGTRAP signal is generated automatically for us in do_dabr().
> +  * The SIGTRAP signal is generated automatically for us in do_break().
>* We don't have to do anything about that here
>*/
>   attr = bp->attr;

Re: [PATCH] Powerpc/Watchpoint: Restore nvgprs while returning from exception

2019-06-06 Thread Michael Neuling

On Thu, 2019-06-06 at 12:59 +0530, Ravi Bangoria wrote:
> Powerpc hw triggers watchpoint before executing the instruction.
> To make trigger-after-execute behavior, kernel emulates the
> instruction. If the instruction is 'load something into non-
> volatile register', exception handler should restore emulated
> register state while returning back, otherwise there will be
> register state corruption. Ex, Adding a watchpoint on a list
> can corrput the list:
> 
>   # cat /proc/kallsyms | grep kthread_create_list
>   c121c8b8 d kthread_create_list
> 
> Add watchpoint on kthread_create_list->next:
> 
>   # perf record -e mem:0xc121c8c0
> 
> Run some workload such that new kthread gets invoked. Ex, I
> just logged out from console:
> 
>   list_add corruption. next->prev should be prev (c1214e00), \
>   but was c121c8b8. (next=c121c8b8).
>   WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
>   CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
>   ...
>   NIP __list_add_valid+0xb4/0xc0
>   LR __list_add_valid+0xb0/0xc0
>   Call Trace:
>   __list_add_valid+0xb0/0xc0 (unreliable)
>   __kthread_create_on_node+0xe0/0x260
>   kthread_create_on_node+0x34/0x50
>   create_worker+0xe8/0x260
>   worker_thread+0x444/0x560
>   kthread+0x160/0x1a0
>   ret_from_kernel_thread+0x5c/0x70
> 
> Signed-off-by: Ravi Bangoria 

How long has this been around? Should we be CCing stable?

Mikey

> ---
>  arch/powerpc/kernel/exceptions-64s.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S
> b/arch/powerpc/kernel/exceptions-64s.S
> index 9481a11..96de0d1 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -1753,7 +1753,7 @@ handle_dabr_fault:
>   ld  r5,_DSISR(r1)
>   addir3,r1,STACK_FRAME_OVERHEAD
>   bl  do_break
> -12:  b   ret_from_except_lite
> +12:  b   ret_from_except
>  
>  
>  #ifdef CONFIG_PPC_BOOK3S_64

Re: [PATCH 0/7] powerpc: Modernize unhandled signals message

2018-07-25 Thread Michael Neuling

 On Tue, 2018-07-24 at 16:27 -0300, Murilo Opsfelder Araujo wrote:
> Hi, everyone.
> 
> This series was inspired by the need to modernize and display more
> informative messages about unhandled signals.
> 
> The "unhandled signal NN" is not very informative.  We thought it would
> be helpful adding a human-readable message describing what the signal
> number means, printing the VMA address, and dumping the instructions.
> 
> We can add more informative messages, like informing what each code of a
> SIGSEGV signal means.  We are open to suggestions.
> 
> I have collected some early feedback from Michael Ellerman about this
> series and would love to hear more feedback from you all.

Nice.. the instruction dump would have been very handy when debugging the PCR
init issue I had a month or so back.

> Before this series:
> 
> Jul 24 13:01:07 localhost kernel: pandafault[5989]: unhandled signal 11 
> at 17d0 nip 161c lr 3fff85a75100 code 2
> 
> After this series:
> 
> Jul 24 13:08:01 localhost kernel: pandafault[10758]: segfault (11) at 
> 17d0 nip 161c lr 7fffabc85100 code 2 in 
> pandafault[1000+1]
> Jul 24 13:08:01 localhost kernel: Instruction dump:
> Jul 24 13:08:01 localhost kernel: 4bfffeec 4bfffee8 3c401002 38427f00 
> fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
> Jul 24 13:08:01 localhost kernel: 392988d0 f93f0020 e93f0020 39400048 
> <9949> 3920 7d234b78 383f0040

What happens if we get a sudden flood of these from different processes that
overlap their output? Are we going to be able to match up the process with
instruction dump?

Should we prefix every line with the PID to avoid this?

Mikey

Re: [PATCH 0/7] powerpc: Modernize unhandled signals message

2018-07-25 Thread Michael Neuling

 On Tue, 2018-07-24 at 16:27 -0300, Murilo Opsfelder Araujo wrote:
> Hi, everyone.
> 
> This series was inspired by the need to modernize and display more
> informative messages about unhandled signals.
> 
> The "unhandled signal NN" is not very informative.  We thought it would
> be helpful adding a human-readable message describing what the signal
> number means, printing the VMA address, and dumping the instructions.
> 
> We can add more informative messages, like informing what each code of a
> SIGSEGV signal means.  We are open to suggestions.
> 
> I have collected some early feedback from Michael Ellerman about this
> series and would love to hear more feedback from you all.

Nice.. the instruction dump would have been very handy when debugging the PCR
init issue I had a month or so back.

> Before this series:
> 
> Jul 24 13:01:07 localhost kernel: pandafault[5989]: unhandled signal 11 
> at 17d0 nip 161c lr 3fff85a75100 code 2
> 
> After this series:
> 
> Jul 24 13:08:01 localhost kernel: pandafault[10758]: segfault (11) at 
> 17d0 nip 161c lr 7fffabc85100 code 2 in 
> pandafault[1000+1]
> Jul 24 13:08:01 localhost kernel: Instruction dump:
> Jul 24 13:08:01 localhost kernel: 4bfffeec 4bfffee8 3c401002 38427f00 
> fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
> Jul 24 13:08:01 localhost kernel: 392988d0 f93f0020 e93f0020 39400048 
> <9949> 3920 7d234b78 383f0040

What happens if we get a sudden flood of these from different processes that
overlap their output? Are we going to be able to match up the process with
instruction dump?

Should we prefix every line with the PID to avoid this?

Mikey

Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.

2018-07-19 Thread Michael Neuling

On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
> Hello Mikey,
> 
> On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
> > 
> > >   DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> > > diff --git a/arch/powerpc/kernel/idle_book3s.S
> > > b/arch/powerpc/kernel/idle_book3s.S
> > > index d85d551..5069d42 100644
> > > --- a/arch/powerpc/kernel/idle_book3s.S
> > > +++ b/arch/powerpc/kernel/idle_book3s.S
> > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
> > >   mfspr   r4, SPRN_MMCR2
> > >   std r3, STOP_MMCR1(r13)
> > >   std r4, STOP_MMCR2(r13)
> > > +
> > > + mfspr   r3, SPRN_SPRG3
> > > + std r3, STOP_SPRG3(r13)
> > 
> > We don't need to save it.  Just restore it from paca->sprg_vdso which should
> > never change.
> 
> Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
> 
> > 
> > How can we do better at catching these missing SPRGs?
> 
> We can go through the list of SPRs from the POWER9 User Manual and
> document explicitly why we don't have to save/restore certain SPRs
> during the execution of the stop instruction. Does this sound ok ?
> 
> (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
> accessible from
> https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual)

I was thinking of a boot time test case built into linux. linux has some boot
time test cases which you can enable via CONFIG options.

Firstly you could see if an SPR exists using the same trick xmon does in
dump_one_spr(). Then once you have a list of usable SPRs, you could write all
the known ones (I assume you'd have to leave out some, like the PSSCR), then set
the appropriate stop level, make sure you got into that stop level, and then see
if that register was changed. Then you'd have an automated list of registers you
need to make sure you save/restore at each stop level.

Could something like that work?

Mikey

Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.

2018-07-19 Thread Michael Neuling

On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
> Hello Mikey,
> 
> On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
> > 
> > >   DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> > > diff --git a/arch/powerpc/kernel/idle_book3s.S
> > > b/arch/powerpc/kernel/idle_book3s.S
> > > index d85d551..5069d42 100644
> > > --- a/arch/powerpc/kernel/idle_book3s.S
> > > +++ b/arch/powerpc/kernel/idle_book3s.S
> > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
> > >   mfspr   r4, SPRN_MMCR2
> > >   std r3, STOP_MMCR1(r13)
> > >   std r4, STOP_MMCR2(r13)
> > > +
> > > + mfspr   r3, SPRN_SPRG3
> > > + std r3, STOP_SPRG3(r13)
> > 
> > We don't need to save it.  Just restore it from paca->sprg_vdso which should
> > never change.
> 
> Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
> 
> > 
> > How can we do better at catching these missing SPRGs?
> 
> We can go through the list of SPRs from the POWER9 User Manual and
> document explicitly why we don't have to save/restore certain SPRs
> during the execution of the stop instruction. Does this sound ok ?
> 
> (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
> accessible from
> https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual)

I was thinking of a boot time test case built into linux. linux has some boot
time test cases which you can enable via CONFIG options.

Firstly you could see if an SPR exists using the same trick xmon does in
dump_one_spr(). Then once you have a list of usable SPRs, you could write all
the known ones (I assume you'd have to leave out some, like the PSSCR), then set
the appropriate stop level, make sure you got into that stop level, and then see
if that register was changed. Then you'd have an automated list of registers you
need to make sure you save/restore at each stop level.

Could something like that work?

Mikey

Re: [PATCH 2/2] powerpc: Enable ASYM_SMT on interleaved big-core systems

2018-05-13 Thread Michael Neuling

On Fri, 2018-05-11 at 16:47 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> Each of the SMT4 cores forming a fused-core are more or less
> independent units. Thus when multiple tasks are scheduled to run on
> the fused core, we get the best performance when the tasks are spread
> across the pair of SMT4 cores.
> 
> Since the threads in the pair of SMT4 cores of an interleaved big-core
> are numbered {0,2,4,6} and {1,3,5,7} respectively, enable ASYM_SMT on
> such interleaved big-cores that will bias the load-balancing of tasks
> on smaller numbered threads, which will automatically result in
> spreading the tasks uniformly across the associated pair of SMT4
> cores.
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kernel/smp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 9ca7148..0153f01 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1082,7 +1082,7 @@ static int powerpc_smt_flags(void)
>  {
>   int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
>  
> - if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
> + if (cpu_has_feature(CPU_FTR_ASYM_SMT) || has_interleaved_big_core) {

Shouldn't we just set CPU_FTR_ASYM_SMT and leave this code unchanged?


>   printk_once(KERN_INFO "Enabling Asymmetric SMT
> scheduling\n");
>   flags |= SD_ASYM_PACKING;
>   }

Re: [PATCH 2/2] powerpc: Enable ASYM_SMT on interleaved big-core systems

2018-05-13 Thread Michael Neuling

On Fri, 2018-05-11 at 16:47 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> Each of the SMT4 cores forming a fused-core are more or less
> independent units. Thus when multiple tasks are scheduled to run on
> the fused core, we get the best performance when the tasks are spread
> across the pair of SMT4 cores.
> 
> Since the threads in the pair of SMT4 cores of an interleaved big-core
> are numbered {0,2,4,6} and {1,3,5,7} respectively, enable ASYM_SMT on
> such interleaved big-cores that will bias the load-balancing of tasks
> on smaller numbered threads, which will automatically result in
> spreading the tasks uniformly across the associated pair of SMT4
> cores.
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kernel/smp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 9ca7148..0153f01 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1082,7 +1082,7 @@ static int powerpc_smt_flags(void)
>  {
>   int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
>  
> - if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
> + if (cpu_has_feature(CPU_FTR_ASYM_SMT) || has_interleaved_big_core) {

Shouldn't we just set CPU_FTR_ASYM_SMT and leave this code unchanged?


>   printk_once(KERN_INFO "Enabling Asymmetric SMT
> scheduling\n");
>   flags |= SD_ASYM_PACKING;
>   }

Re: [PATCH 1/2] powerpc: Detect the presence of big-core with interleaved threads

2018-05-13 Thread Michael Neuling

Thanks for posting this... A couple of comments below.

On Fri, 2018-05-11 at 16:47 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> A pair of IBM POWER9 SMT4 cores can be fused together to form a
> big-core with 8 SMT threads. This can be discovered via the
> "ibm,thread-groups" CPU property in the device tree which will
> indicate which group of threads that share the L1 cache, translation
> cache and instruction data flow.  If there are multiple such group of
> threads, then the core is a big-core. The thread-ids of the threads of
> the big-core can be obtained by interleaving the thread-ids of the
> thread-groups (component small core).
> 
> Eg: Threads in the pair of component SMT4 cores of an interleaved
> big-core are numbered {0,2,4,6} and {1,3,5,7} respectively.
> 
> This patch introduces a function to check if a given device tree node
> corresponding to a CPU node represents an interleaved big-core.
> 
> This function is invoked during the boot-up to detect the presence of
> interleaved big-cores. The presence of such an interleaved big-core is
> recorded in a global variable for later use.
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/include/asm/cputhreads.h |  8 +++--
>  arch/powerpc/kernel/setup-common.c| 63 +-
> -
>  2 files changed, 66 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cputhreads.h
> b/arch/powerpc/include/asm/cputhreads.h
> index d71a909..b706f0a 100644
> --- a/arch/powerpc/include/asm/cputhreads.h
> +++ b/arch/powerpc/include/asm/cputhreads.h
> @@ -23,11 +23,13 @@
>  extern int threads_per_core;
>  extern int threads_per_subcore;
>  extern int threads_shift;
> +extern bool has_interleaved_big_core;
>  extern cpumask_t threads_core_mask;
>  #else
> -#define threads_per_core 1
> -#define threads_per_subcore  1
> -#define threads_shift0
> +#define threads_per_core 1
> +#define threads_per_subcore  1
> +#define threads_shift0
> +#define has_interleaved_big_core 0
>  #define threads_core_mask(*get_cpu_mask(0))
>  #endif
>  
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-
> common.c
> index 0af5c11..884dff2 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -408,10 +408,12 @@ void __init check_for_initrd(void)
>  #ifdef CONFIG_SMP
>  
>  int threads_per_core, threads_per_subcore, threads_shift;
> +bool has_interleaved_big_core;
>  cpumask_t threads_core_mask;
>  EXPORT_SYMBOL_GPL(threads_per_core);
>  EXPORT_SYMBOL_GPL(threads_per_subcore);
>  EXPORT_SYMBOL_GPL(threads_shift);
> +EXPORT_SYMBOL_GPL(has_interleaved_big_core);
>  EXPORT_SYMBOL_GPL(threads_core_mask);
>  
>  static void __init cpu_init_thread_core_maps(int tpc)
> @@ -436,8 +438,56 @@ static void __init cpu_init_thread_core_maps(int tpc)
>   printk(KERN_DEBUG " (thread shift is %d)\n", threads_shift);
>  }
>  
> -
>  u32 *cpu_to_phys_id = NULL;
> +/*
> + * check_for_interleaved_big_core - Checks if the core represented by
> + *dn is a big-core whose threads are interleavings of the
> + *threads of the component small cores.
> + *
> + * @dn: device node corresponding to the core.
> + *
> + * Returns true if the core is a interleaved big-core.
> + * Returns false otherwise.
> + */
> +static inline bool check_for_interleaved_big_core(struct device_node *dn)
> +{
> + int len, nr_groups, threads_per_group;
> + const __be32 *thread_groups;
> + __be32 *thread_list, *first_cpu_idx;
> + int cur_cpu, next_cpu, i, j;
> +
> + thread_groups = of_get_property(dn, "ibm,thread-groups", );
> + if (!thread_groups)
> + return false;

Can you document what this property looks like? Seems to be nr_groups,
threads_per_group, thread_list. Can you explain what each of these mean?

If we get configured with an SMT2 big-core (ie. two interleaved SMT1 normal
cores), will this code also work there?

> +
> + nr_groups = be32_to_cpu(*(thread_groups + 1));
> + if (nr_groups <= 1)
> + return false;
> +
> + threads_per_group = be32_to_cpu(*(thread_groups + 2));
> + thread_list = (__be32 *)thread_groups + 3;
> +
> + /*
> +  * In case of an interleaved big-core, the thread-ids of the
> +  * big-core can be obtained by interleaving the the thread-ids
> +  * of the component small
> +  *
> +  * Eg: On a 8-thread big-core with two SMT4 small cores, the
> +  * threads of the two component small cores will be
> +  * {0, 2, 4, 6} and {1, 3, 5, 7}.
> +  */
> + for (i = 0; i < nr_groups; i++) {
> + first_cpu_idx = thread_list + i * threads_per_group;
> +
> + for (j = 0; j < threads_per_group - 1; j++) {
> + cur_cpu = be32_to_cpu(*(first_cpu_idx + j));
> + next_cpu = be32_to_cpu(*(first_cpu_idx +

Re: [PATCH 1/2] powerpc: Detect the presence of big-core with interleaved threads

2018-05-13 Thread Michael Neuling

Thanks for posting this... A couple of comments below.

On Fri, 2018-05-11 at 16:47 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> A pair of IBM POWER9 SMT4 cores can be fused together to form a
> big-core with 8 SMT threads. This can be discovered via the
> "ibm,thread-groups" CPU property in the device tree which will
> indicate which group of threads that share the L1 cache, translation
> cache and instruction data flow.  If there are multiple such group of
> threads, then the core is a big-core. The thread-ids of the threads of
> the big-core can be obtained by interleaving the thread-ids of the
> thread-groups (component small core).
> 
> Eg: Threads in the pair of component SMT4 cores of an interleaved
> big-core are numbered {0,2,4,6} and {1,3,5,7} respectively.
> 
> This patch introduces a function to check if a given device tree node
> corresponding to a CPU node represents an interleaved big-core.
> 
> This function is invoked during the boot-up to detect the presence of
> interleaved big-cores. The presence of such an interleaved big-core is
> recorded in a global variable for later use.
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/include/asm/cputhreads.h |  8 +++--
>  arch/powerpc/kernel/setup-common.c| 63 +-
> -
>  2 files changed, 66 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cputhreads.h
> b/arch/powerpc/include/asm/cputhreads.h
> index d71a909..b706f0a 100644
> --- a/arch/powerpc/include/asm/cputhreads.h
> +++ b/arch/powerpc/include/asm/cputhreads.h
> @@ -23,11 +23,13 @@
>  extern int threads_per_core;
>  extern int threads_per_subcore;
>  extern int threads_shift;
> +extern bool has_interleaved_big_core;
>  extern cpumask_t threads_core_mask;
>  #else
> -#define threads_per_core 1
> -#define threads_per_subcore  1
> -#define threads_shift0
> +#define threads_per_core 1
> +#define threads_per_subcore  1
> +#define threads_shift0
> +#define has_interleaved_big_core 0
>  #define threads_core_mask(*get_cpu_mask(0))
>  #endif
>  
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-
> common.c
> index 0af5c11..884dff2 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -408,10 +408,12 @@ void __init check_for_initrd(void)
>  #ifdef CONFIG_SMP
>  
>  int threads_per_core, threads_per_subcore, threads_shift;
> +bool has_interleaved_big_core;
>  cpumask_t threads_core_mask;
>  EXPORT_SYMBOL_GPL(threads_per_core);
>  EXPORT_SYMBOL_GPL(threads_per_subcore);
>  EXPORT_SYMBOL_GPL(threads_shift);
> +EXPORT_SYMBOL_GPL(has_interleaved_big_core);
>  EXPORT_SYMBOL_GPL(threads_core_mask);
>  
>  static void __init cpu_init_thread_core_maps(int tpc)
> @@ -436,8 +438,56 @@ static void __init cpu_init_thread_core_maps(int tpc)
>   printk(KERN_DEBUG " (thread shift is %d)\n", threads_shift);
>  }
>  
> -
>  u32 *cpu_to_phys_id = NULL;
> +/*
> + * check_for_interleaved_big_core - Checks if the core represented by
> + *dn is a big-core whose threads are interleavings of the
> + *threads of the component small cores.
> + *
> + * @dn: device node corresponding to the core.
> + *
> + * Returns true if the core is a interleaved big-core.
> + * Returns false otherwise.
> + */
> +static inline bool check_for_interleaved_big_core(struct device_node *dn)
> +{
> + int len, nr_groups, threads_per_group;
> + const __be32 *thread_groups;
> + __be32 *thread_list, *first_cpu_idx;
> + int cur_cpu, next_cpu, i, j;
> +
> + thread_groups = of_get_property(dn, "ibm,thread-groups", );
> + if (!thread_groups)
> + return false;

Can you document what this property looks like? Seems to be nr_groups,
threads_per_group, thread_list. Can you explain what each of these mean?

If we get configured with an SMT2 big-core (ie. two interleaved SMT1 normal
cores), will this code also work there?

> +
> + nr_groups = be32_to_cpu(*(thread_groups + 1));
> + if (nr_groups <= 1)
> + return false;
> +
> + threads_per_group = be32_to_cpu(*(thread_groups + 2));
> + thread_list = (__be32 *)thread_groups + 3;
> +
> + /*
> +  * In case of an interleaved big-core, the thread-ids of the
> +  * big-core can be obtained by interleaving the the thread-ids
> +  * of the component small
> +  *
> +  * Eg: On a 8-thread big-core with two SMT4 small cores, the
> +  * threads of the two component small cores will be
> +  * {0, 2, 4, 6} and {1, 3, 5, 7}.
> +  */
> + for (i = 0; i < nr_groups; i++) {
> + first_cpu_idx = thread_list + i * threads_per_group;
> +
> + for (j = 0; j < threads_per_group - 1; j++) {
> + cur_cpu = be32_to_cpu(*(first_cpu_idx + j));
> + next_cpu = be32_to_cpu(*(first_cpu_idx + j + 1));
> + if (next_cpu !=

[PATCH] nvme-pci: Fix NULL ptr deref in EEH code

2018-03-19 Thread Michael Neuling

On powerpc on boot we can take an EEH event which results in this oops.

cpu 0x23: Vector: 300 (Data Access) at [c00ff50f3800]
pc: c008089a0eb0: nvme_error_detected+0x4c/0x90 [nvme]
lr: c0026564: eeh_report_error+0xe0/0x110
sp: c00ff50f3a80
msr: 90009033
dar: 400
dsisr: 4000
current = 0xc00ff507c000
paca = 0xcfdc9d80 softe: 0 irq_happened: 0x01
pid = 782, comm = eehd
Linux version 4.15.6-openpower1 (smc@smc-desktop) (gcc version 6.4.0 (Buildroot 
2017.11.2-8-g4b6188e)) #2 SM P Tue Feb 27 12:33:27 PST 2018
enter ? for help
[c00ff50f3af0] c0026564 eeh_report_error+0xe0/0x110
[c00ff50f3b30] c0025520 eeh_pe_dev_traverse+0xc0/0xdc
[c00ff50f3bc0] c0026bd0 eeh_handle_normal_event+0x184/0x4c4
[c00ff50f3c70] c0026ff4 eeh_handle_event+0x30/0x288
[c00ff50f3d10] c002758c eeh_event_handler+0x124/0x170
[c00ff50f3dc0] c008fed0 kthread+0x14c/0x154
[c00ff50f3e30] c000b594 ret_from_kernel_thread+0x5c/0xc8

This fixes the NULL ptr deref.

Signed-off-by: Michael Neuling <mi...@neuling.org>
---
 drivers/nvme/host/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b6f43b738f..404b346e3c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2626,6 +2626,9 @@ static pci_ers_result_t nvme_error_detected(struct 
pci_dev *pdev,
 {
struct nvme_dev *dev = pci_get_drvdata(pdev);
 
+   if (!dev)
+   return PCI_ERS_RESULT_NEED_RESET;
+
/*
 * A frozen channel requires a reset. When detected, this method will
 * shutdown the controller to quiesce. The controller will be restarted
-- 
2.14.1

[PATCH] nvme-pci: Fix NULL ptr deref in EEH code

2018-03-19 Thread Michael Neuling

On powerpc on boot we can take an EEH event which results in this oops.

cpu 0x23: Vector: 300 (Data Access) at [c00ff50f3800]
pc: c008089a0eb0: nvme_error_detected+0x4c/0x90 [nvme]
lr: c0026564: eeh_report_error+0xe0/0x110
sp: c00ff50f3a80
msr: 90009033
dar: 400
dsisr: 4000
current = 0xc00ff507c000
paca = 0xcfdc9d80 softe: 0 irq_happened: 0x01
pid = 782, comm = eehd
Linux version 4.15.6-openpower1 (smc@smc-desktop) (gcc version 6.4.0 (Buildroot 
2017.11.2-8-g4b6188e)) #2 SM P Tue Feb 27 12:33:27 PST 2018
enter ? for help
[c00ff50f3af0] c0026564 eeh_report_error+0xe0/0x110
[c00ff50f3b30] c0025520 eeh_pe_dev_traverse+0xc0/0xdc
[c00ff50f3bc0] c0026bd0 eeh_handle_normal_event+0x184/0x4c4
[c00ff50f3c70] c0026ff4 eeh_handle_event+0x30/0x288
[c00ff50f3d10] c002758c eeh_event_handler+0x124/0x170
[c00ff50f3dc0] c008fed0 kthread+0x14c/0x154
[c00ff50f3e30] c000b594 ret_from_kernel_thread+0x5c/0xc8

This fixes the NULL ptr deref.

Signed-off-by: Michael Neuling 
---
 drivers/nvme/host/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b6f43b738f..404b346e3c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2626,6 +2626,9 @@ static pci_ers_result_t nvme_error_detected(struct 
pci_dev *pdev,
 {
struct nvme_dev *dev = pci_get_drvdata(pdev);
 
+   if (!dev)
+   return PCI_ERS_RESULT_NEED_RESET;
+
/*
 * A frozen channel requires a reset. When detected, this method will
 * shutdown the controller to quiesce. The controller will be restarted
-- 
2.14.1

Re: [PATCH v2 15/18] powerpc: Emulate paste instruction

2017-10-09 Thread Michael Neuling

@ -1924,6 +1987,7 @@ struct ppc_emulated ppc_emulated = {
>   WARN_EMULATED_SETUP(mfdscr),
>   WARN_EMULATED_SETUP(mtdscr),
>   WARN_EMULATED_SETUP(lq_stq),
> + WARN_EMULATED_SETUP(paste),

You'll need to rebase this on powerpc/next as this has changed upstream.

Mikey

Re: [PATCH v2 15/18] powerpc: Emulate paste instruction

2017-10-09 Thread Michael Neuling

@ -1924,6 +1987,7 @@ struct ppc_emulated ppc_emulated = {
>   WARN_EMULATED_SETUP(mfdscr),
>   WARN_EMULATED_SETUP(mtdscr),
>   WARN_EMULATED_SETUP(lq_stq),
> + WARN_EMULATED_SETUP(paste),

You'll need to rebase this on powerpc/next as this has changed upstream.

Mikey

Re: [PATCH v8 00/10] Enable VAS

2017-09-07 Thread Michael Neuling

So this is upstream now but it will cause a crash on boot with older skiboots
with: 

powernv-cpufreq: cpufreq pstate min 101 nominal 50 max 0
powernv-cpufreq: Workload Optimized Frequency is enabled in the platform
Disabling lock debugging due to kernel taint
Severe Machine check interrupt [Not recovered]
  NIP [c0098530]: reset_window_regs+0x20/0x220
  Initiator: CPU
  Error type: Unknown
opal: Machine check interrupt unrecoverable: MSR(RI=0)
opal: Hardware platform error: Unrecoverable Machine Check exception
CPU: 1 PID: 1 Comm: swapper/0 Tainted: G   M
4.13.0-rc7-00708-g8b680911e774-dirty #10
task: c00f2268 task.stack: c00f2270
NIP:  c0098530 LR: c0098758 CTR: 
REGS: c0003ffebd80 TRAP: 0200   Tainted: G   M 
(4.13.0-rc7-00708-g8b680911e774-dirty)
MSR:  98349031 <SF,HV,EE,ME,IR,DR,LE>  CR: 24000224  XER: 
CFAR: c0098754 DAR: 100bef30 DSISR: 4000 SOFTE: 0 
GPR00: c0098f44 c00f22703a00 c0eff200 c00f1cf861e0 
GPR04: c00f22703a50 0001 0fff 0003 
GPR08: c00c842f 0001   
GPR12:  cfd40580 c0c03590 c0c1f428 
GPR16: c0c4a640 c0c31360 c0c92738 c0c92690 
GPR20: c0c926a0 c0c926f0 c00f1d672940  
GPR24: c0c92638  c0e888b0 c0dce428 
GPR28: 0002 c00f0e80 c00f22703a50 c00f1cf861e0 
NIP [c0098530] reset_window_regs+0x20/0x220
LR [c0098758] init_winctx_regs+0x28/0x6c0
Call Trace:
[c00f22703a00] [0002] 0x2 (unreliable)
[c00f22703a30] [c0098f44] vas_rx_win_open.part.11+0x154/0x210
[c00f22703ae0] [c0d668e8] nx842_powernv_init+0x6b4/0x824
[c00f22703c[   38.412765557,0] OPAL: Reboot requested due to Platform error.
[   38.412828287,3] OPAL: Reboot requested due to Platform error.40] 
[c000ca60] do_one_initcall+0x60/0x1c0

If you see this you need a new skiboot with at least these two patches:

b503dcf16d vas: Set mmio enable bits in DD2
a5c124072f vas: Set FIRs according to workbook

This is a community announcement brought to you by OzLabs. 
  OzLabs: Making Linux better since 1999

Mikey


On Mon, 2017-08-28 at 23:23 -0700, Sukadev Bhattiprolu wrote:
> Power9 introduces a hardware subsystem referred to as the Virtual
> Accelerator Switchboard (VAS). VAS allows kernel subsystems and user
> space processes to directly access the Nest Accelerator (NX) engines
> which implement compression and encryption algorithms in the hardware.
> 
> NX has been in Power processors since Power7+, but access to the NX
> engines was through the 'icswx' instruction which is only available
> to the kernel/hypervisor. Starting with Power9, access to the NX
> engines is provided to both kernel and user space processes through
> VAS.
> 
> The switchboard (i.e VAS) multiplexes accesses between "receivers" and
> "senders", where the "receivers" are typically the NX engines and
> "senders" are the kernel subsystems and user processors that wish to
> access the receivers (NX engines).  Once a sender is "connected" to
> a receiver through the switchboard, the senders can submit compression/
> encryption requests to the hardware using the new (PowerISA 3.0)
> "copy" and "paste" instructions.
> 
> In the initial OPAL and PowerNV kernel patchsets, the "senders" can
> only be kernel subsystems (eg NX-842 driver) and receivers can only
> be the NX-842 engine. Follow-on patch sets will allow senders/receivers
> to be user-space processes and receivers to be NX-GZIP engines.
> 
> Provides:
> 
>   This kernel patch set configures the VAS subsystems and provides
>   kernel interfaces to drivers like NX-842 to open receive and send
>   windows in VAS and to submit compression requests to the NX engine.
> 
> Requires:
> 
>   This patch set needs corresponding VAS/NX skiboot patches which
>   were merged into skiboot tree. i.e skiboot must include:
>   commit b503dcf ("vas: Set mmio enable bits in DD2")
> 
> Tests:
> In-kernel compression requests were tested on DD1 and DD2 POWER9
>   hardware using compression self-test module and the following
>   NX-842 patch set from Haren Myneni:
> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-July/160620.html
> 
>   and by dropping the last parameters to both vas_copy_crb() and
>   vas_paste_crb() calls in drivers/crypto/nx/nx-842-powernv.c.
>   See also PATCH 10/10.
> 
> Git Tree:
> 
> https://github.com/sukadev/linux/ 
>

Re: [PATCH v8 00/10] Enable VAS

2017-09-07 Thread Michael Neuling

So this is upstream now but it will cause a crash on boot with older skiboots
with: 

powernv-cpufreq: cpufreq pstate min 101 nominal 50 max 0
powernv-cpufreq: Workload Optimized Frequency is enabled in the platform
Disabling lock debugging due to kernel taint
Severe Machine check interrupt [Not recovered]
  NIP [c0098530]: reset_window_regs+0x20/0x220
  Initiator: CPU
  Error type: Unknown
opal: Machine check interrupt unrecoverable: MSR(RI=0)
opal: Hardware platform error: Unrecoverable Machine Check exception
CPU: 1 PID: 1 Comm: swapper/0 Tainted: G   M
4.13.0-rc7-00708-g8b680911e774-dirty #10
task: c00f2268 task.stack: c00f2270
NIP:  c0098530 LR: c0098758 CTR: 
REGS: c0003ffebd80 TRAP: 0200   Tainted: G   M 
(4.13.0-rc7-00708-g8b680911e774-dirty)
MSR:  98349031   CR: 24000224  XER: 
CFAR: c0098754 DAR: 100bef30 DSISR: 4000 SOFTE: 0 
GPR00: c0098f44 c00f22703a00 c0eff200 c00f1cf861e0 
GPR04: c00f22703a50 0001 0fff 0003 
GPR08: c00c842f 0001   
GPR12:  cfd40580 c0c03590 c0c1f428 
GPR16: c0c4a640 c0c31360 c0c92738 c0c92690 
GPR20: c0c926a0 c0c926f0 c00f1d672940  
GPR24: c0c92638  c0e888b0 c0dce428 
GPR28: 0002 c00f0e80 c00f22703a50 c00f1cf861e0 
NIP [c0098530] reset_window_regs+0x20/0x220
LR [c0098758] init_winctx_regs+0x28/0x6c0
Call Trace:
[c00f22703a00] [0002] 0x2 (unreliable)
[c00f22703a30] [c0098f44] vas_rx_win_open.part.11+0x154/0x210
[c00f22703ae0] [c0d668e8] nx842_powernv_init+0x6b4/0x824
[c00f22703c[   38.412765557,0] OPAL: Reboot requested due to Platform error.
[   38.412828287,3] OPAL: Reboot requested due to Platform error.40] 
[c000ca60] do_one_initcall+0x60/0x1c0

If you see this you need a new skiboot with at least these two patches:

b503dcf16d vas: Set mmio enable bits in DD2
a5c124072f vas: Set FIRs according to workbook

This is a community announcement brought to you by OzLabs. 
  OzLabs: Making Linux better since 1999

Mikey


On Mon, 2017-08-28 at 23:23 -0700, Sukadev Bhattiprolu wrote:
> Power9 introduces a hardware subsystem referred to as the Virtual
> Accelerator Switchboard (VAS). VAS allows kernel subsystems and user
> space processes to directly access the Nest Accelerator (NX) engines
> which implement compression and encryption algorithms in the hardware.
> 
> NX has been in Power processors since Power7+, but access to the NX
> engines was through the 'icswx' instruction which is only available
> to the kernel/hypervisor. Starting with Power9, access to the NX
> engines is provided to both kernel and user space processes through
> VAS.
> 
> The switchboard (i.e VAS) multiplexes accesses between "receivers" and
> "senders", where the "receivers" are typically the NX engines and
> "senders" are the kernel subsystems and user processors that wish to
> access the receivers (NX engines).  Once a sender is "connected" to
> a receiver through the switchboard, the senders can submit compression/
> encryption requests to the hardware using the new (PowerISA 3.0)
> "copy" and "paste" instructions.
> 
> In the initial OPAL and PowerNV kernel patchsets, the "senders" can
> only be kernel subsystems (eg NX-842 driver) and receivers can only
> be the NX-842 engine. Follow-on patch sets will allow senders/receivers
> to be user-space processes and receivers to be NX-GZIP engines.
> 
> Provides:
> 
>   This kernel patch set configures the VAS subsystems and provides
>   kernel interfaces to drivers like NX-842 to open receive and send
>   windows in VAS and to submit compression requests to the NX engine.
> 
> Requires:
> 
>   This patch set needs corresponding VAS/NX skiboot patches which
>   were merged into skiboot tree. i.e skiboot must include:
>   commit b503dcf ("vas: Set mmio enable bits in DD2")
> 
> Tests:
> In-kernel compression requests were tested on DD1 and DD2 POWER9
>   hardware using compression self-test module and the following
>   NX-842 patch set from Haren Myneni:
> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-July/160620.html
> 
>   and by dropping the last parameters to both vas_copy_crb() and
>   vas_paste_crb() calls in drivers/crypto/nx/nx-842-powernv.c.
>   See also PATCH 10/10.
> 
> Git Tree:
> 
> https://github.com/sukadev/linux/ 
>   Branch: vas-kern-v

Re: Status of reverted Linux patch "tty: Fix ldisc crash on reopened tty", Linux 4.9 kernel frequent crashes

2017-08-30 Thread Michael Neuling

On Thu, 2017-08-31 at 06:36 +0200, Greg Kroah-Hartman wrote:
> On Wed, Aug 30, 2017 at 11:10:14PM +0300, Pasi Kärkkäinen wrote:
> > Hello everyone,
> > 
> > Recently Nathan March reported on centos-virt list he's getting frequent
> > Linux kernel crashes with Linux 4.9 LTS kernel because of the missing patch
> > "tty: Fix ldisc crash on reopened tty".
> 
> Crashes with "normal" operation, or crashes when running a fuzzer or
> other type of program?

For me it crashed on boot.

> 
> > The patch was already merged upstream here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> > d=71472fa9c52b1da27663c275d416d8654b905f05
> > 
> > but then reverted here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> > d=896d81fefe5d1919537db2c2150ab6384e4a6610
> > 
> > Nathan confirmed if he applies the patch from
> > 71472fa9c52b1da27663c275d416d8654b905f05 to his Linux 4.9 LTS kernel the
> > bug/problem goes away, so the patch (or similar fix) is still needed, at
> > least for 4.9 LTS kernel.
> > 
> > 
> > Mikulas reported he's able to trigger the same crash on Linux 4.10:
> > https://www.spinics.net/lists/kernel/msg2440637.html
> > https://lists.gt.net/linux/kernel/2664604?search_string=ldisc%20reopened;#26
> > 64604
> > 
> > Michael Neuling reported he's able to trigger the bug on PowerPC:
> > https://lkml.org/lkml/2017/3/10/1582
> > 
> > 
> > So now the question is.. is anyone currently working on getting this patch
> > fixed and applied upstream? I think one of the problems earlier was being
> > able to reliable reproduce the crash.. Nathan says he's able to reproduce it
> > many times per week on his environment on x86_64.
> 
> I don't know of anyone working on it, want to do it yourself?

I'm not anymore. We found it was only triggered on a bogus CONFIG option
combination.  Once we removed that, it no longer happened.

The underlying bug was still there though.

Mikey

Re: Status of reverted Linux patch "tty: Fix ldisc crash on reopened tty", Linux 4.9 kernel frequent crashes

2017-08-30 Thread Michael Neuling

On Thu, 2017-08-31 at 06:36 +0200, Greg Kroah-Hartman wrote:
> On Wed, Aug 30, 2017 at 11:10:14PM +0300, Pasi Kärkkäinen wrote:
> > Hello everyone,
> > 
> > Recently Nathan March reported on centos-virt list he's getting frequent
> > Linux kernel crashes with Linux 4.9 LTS kernel because of the missing patch
> > "tty: Fix ldisc crash on reopened tty".
> 
> Crashes with "normal" operation, or crashes when running a fuzzer or
> other type of program?

For me it crashed on boot.

> 
> > The patch was already merged upstream here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> > d=71472fa9c52b1da27663c275d416d8654b905f05
> > 
> > but then reverted here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> > d=896d81fefe5d1919537db2c2150ab6384e4a6610
> > 
> > Nathan confirmed if he applies the patch from
> > 71472fa9c52b1da27663c275d416d8654b905f05 to his Linux 4.9 LTS kernel the
> > bug/problem goes away, so the patch (or similar fix) is still needed, at
> > least for 4.9 LTS kernel.
> > 
> > 
> > Mikulas reported he's able to trigger the same crash on Linux 4.10:
> > https://www.spinics.net/lists/kernel/msg2440637.html
> > https://lists.gt.net/linux/kernel/2664604?search_string=ldisc%20reopened;#26
> > 64604
> > 
> > Michael Neuling reported he's able to trigger the bug on PowerPC:
> > https://lkml.org/lkml/2017/3/10/1582
> > 
> > 
> > So now the question is.. is anyone currently working on getting this patch
> > fixed and applied upstream? I think one of the problems earlier was being
> > able to reliable reproduce the crash.. Nathan says he's able to reproduce it
> > many times per week on his environment on x86_64.
> 
> I don't know of anyone working on it, want to do it yourself?

I'm not anymore. We found it was only triggered on a bogus CONFIG option
combination.  Once we removed that, it no longer happened.

The underlying bug was still there though.

Mikey

Re: [PATCH v6 17/17] powerpc/vas: Document FTW API/usage

2017-08-14 Thread Michael Neuling

On Tue, 2017-08-08 at 16:07 -0700, Sukadev Bhattiprolu wrote:
> Document the usage of the VAS Fast thread-wakeup API.
> 
> Thanks for input/comments from Benjamin Herrenschmidt, Michael Neuling,
> Michael Ellerman, Robert Blackmore, Ian Munsie, Haren Myneni, Paul Mackerras.
> 
> Cc:Ian Munsie <imun...@au1.ibm.com>
> Cc:Paul Mackerras <pau...@ozlabs.org>
> Signed-off-by: Sukadev Bhattiprolu <suka...@linux.vnet.ibm.com>
> ---
>  Documentation/powerpc/ftw-api.txt | 373
> ++
>  1 file changed, 373 insertions(+)
>  create mode 100644 Documentation/powerpc/ftw-api.txt
> 
> diff --git a/Documentation/powerpc/ftw-api.txt b/Documentation/powerpc/ftw-
> api.txt
> new file mode 100644
> index 000..0b3f16f
> --- /dev/null
> +++ b/Documentation/powerpc/ftw-api.txt
> @@ -0,0 +1,373 @@
> +Virtual Accelerator Switchboard and Fast Thread-Wakeup API
> +
> +Power9 processor supports a hardware subystem known as the Virtual
> +Accelerator Switchboard (VAS) which allows two entities in the Power9
> +system to efficiently exchange messages. Messages must be formatted as
> +Coprocessor Reqeust Blocks (CRB) and be submitted using the COPY/PASTE
> +instructions (new in Power9).
> +
> +Usage of VAS depends on the entities exchanging the messages and
> +currently two usages have been identified.
> +
> +First usage of VAS, referred to as VAS/NX involves a software thread
> +submitting data compression requests to a co-processor (hardware/nest
> +accelerator) aka NX engine. The API for this usage is described in the
> +VAS/NX API document.
> +
> +Alternatively, VAS can be used by two software threads to efficiently
> +exchange messages. Initially, this mechanism is intended to wake up a
> +waiting thread quickly - i.e "fast thread wake-up (FTW)". This document
> +describes the user API for this VAS/FTW mechanism.
> +
> +Application access to the FTW mechanism is provided through the NX-FTW
> +device node (/dev/crypto/nx-ftw) implemented by the VAS/FTW device
> +driver.

crypto?

> +
> +A software thread T1 that intends to wait for an event must first setup
> +a receive window, by opening the NX-FTW device and using the
> +VAS_RX_WIN_OPEN ioctl. Upon successful return from the VAS_RX_WIN_OPEN
> +ioctl, an rx_win_handle is returned.

I realise there is a window here as part of the hardware implementation, but the
users don't care about the window on the receive side. It's hidden from them. 
It's just an rx handle IMHO.

The sender certainly has a window that users care about since they have to mmap
it.

> +
> +A software thread T2 that intends to wake up T1 at some point, must first
> +set up a "send window" using the VAS_TX_WIN_OPEN ioctl and specify the
> +rx_win_handle obtained by T1. After a successful VAS_TX_WIN_OPEN ioctl
> the
> +send window of T2 is considered paired with the receive window of T1. The
> +thread T2 must then use mmap() to obtain a "paste address" for the send
> +window.


> +With this set up, thread T1 can wait for an event using the WAIT
> +instruction.
> +
> +Thread T2 can wake up T1 by using the "COPY/PASTE" instructions and
> +submitting an empty/NULL CRB to the send window's paste address. The
> +wait/wake up process can be repeated as long as the threads have the
> +send/receive windows open.



> +1. NX-FTW Device Node
> +
> +There is one /dev/crypto/nx-ftw node in the system and it provides
> +access to the VAS/FTW functionality.


> +The only valid operations on the NX-FTW node are:
> +
> +- open() the device for read and write.
> +
> +- issue either VAS_RX_WIN_OPEN or VAS_TX_WIN_OPEN ioctls to set up
> +  receive or send (only one of them per open).
> +
> +- if the open is associated with send window (i.e VAS_TX_WIN_OPEN
> +  ioctl was issued) mmap() the send window into the application's
> +  virtual address space. (i.e get a 'paste_address' for the send
> +  window).
> +
> +- close the device node.
> +
> +Other file operations on the NX-FTW node are undefined.
> +
> +Note tHAT the COPY and PASTE operations go directly to the hardware
> +and not go through the NX-FTW device.

I don't understand this statement

> +
> +Although a system may have several instances of the VAS in the system
> +(typically, one per P9 chip) there is just one NX-FTW device node in
> +the system.

> + When the NX-FTW device node is opened, the kernel assigns a suitable
> + instance of VAS to the process. Ker

Re: [PATCH v6 17/17] powerpc/vas: Document FTW API/usage

2017-08-14 Thread Michael Neuling

On Tue, 2017-08-08 at 16:07 -0700, Sukadev Bhattiprolu wrote:
> Document the usage of the VAS Fast thread-wakeup API.
> 
> Thanks for input/comments from Benjamin Herrenschmidt, Michael Neuling,
> Michael Ellerman, Robert Blackmore, Ian Munsie, Haren Myneni, Paul Mackerras.
> 
> Cc:Ian Munsie 
> Cc:Paul Mackerras 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  Documentation/powerpc/ftw-api.txt | 373
> ++
>  1 file changed, 373 insertions(+)
>  create mode 100644 Documentation/powerpc/ftw-api.txt
> 
> diff --git a/Documentation/powerpc/ftw-api.txt b/Documentation/powerpc/ftw-
> api.txt
> new file mode 100644
> index 000..0b3f16f
> --- /dev/null
> +++ b/Documentation/powerpc/ftw-api.txt
> @@ -0,0 +1,373 @@
> +Virtual Accelerator Switchboard and Fast Thread-Wakeup API
> +
> +Power9 processor supports a hardware subystem known as the Virtual
> +Accelerator Switchboard (VAS) which allows two entities in the Power9
> +system to efficiently exchange messages. Messages must be formatted as
> +Coprocessor Reqeust Blocks (CRB) and be submitted using the COPY/PASTE
> +instructions (new in Power9).
> +
> +Usage of VAS depends on the entities exchanging the messages and
> +currently two usages have been identified.
> +
> +First usage of VAS, referred to as VAS/NX involves a software thread
> +submitting data compression requests to a co-processor (hardware/nest
> +accelerator) aka NX engine. The API for this usage is described in the
> +VAS/NX API document.
> +
> +Alternatively, VAS can be used by two software threads to efficiently
> +exchange messages. Initially, this mechanism is intended to wake up a
> +waiting thread quickly - i.e "fast thread wake-up (FTW)". This document
> +describes the user API for this VAS/FTW mechanism.
> +
> +Application access to the FTW mechanism is provided through the NX-FTW
> +device node (/dev/crypto/nx-ftw) implemented by the VAS/FTW device
> +driver.

crypto?

> +
> +A software thread T1 that intends to wait for an event must first setup
> +a receive window, by opening the NX-FTW device and using the
> +VAS_RX_WIN_OPEN ioctl. Upon successful return from the VAS_RX_WIN_OPEN
> +ioctl, an rx_win_handle is returned.

I realise there is a window here as part of the hardware implementation, but the
users don't care about the window on the receive side. It's hidden from them. 
It's just an rx handle IMHO.

The sender certainly has a window that users care about since they have to mmap
it.

> +
> +A software thread T2 that intends to wake up T1 at some point, must first
> +set up a "send window" using the VAS_TX_WIN_OPEN ioctl and specify the
> +rx_win_handle obtained by T1. After a successful VAS_TX_WIN_OPEN ioctl
> the
> +send window of T2 is considered paired with the receive window of T1. The
> +thread T2 must then use mmap() to obtain a "paste address" for the send
> +window.


> +With this set up, thread T1 can wait for an event using the WAIT
> +instruction.
> +
> +Thread T2 can wake up T1 by using the "COPY/PASTE" instructions and
> +submitting an empty/NULL CRB to the send window's paste address. The
> +wait/wake up process can be repeated as long as the threads have the
> +send/receive windows open.



> +1. NX-FTW Device Node
> +
> +There is one /dev/crypto/nx-ftw node in the system and it provides
> +access to the VAS/FTW functionality.


> +The only valid operations on the NX-FTW node are:
> +
> +- open() the device for read and write.
> +
> +- issue either VAS_RX_WIN_OPEN or VAS_TX_WIN_OPEN ioctls to set up
> +  receive or send (only one of them per open).
> +
> +- if the open is associated with send window (i.e VAS_TX_WIN_OPEN
> +  ioctl was issued) mmap() the send window into the application's
> +  virtual address space. (i.e get a 'paste_address' for the send
> +  window).
> +
> +- close the device node.
> +
> +Other file operations on the NX-FTW node are undefined.
> +
> +Note tHAT the COPY and PASTE operations go directly to the hardware
> +and not go through the NX-FTW device.

I don't understand this statement

> +
> +Although a system may have several instances of the VAS in the system
> +(typically, one per P9 chip) there is just one NX-FTW device node in
> +the system.

> + When the NX-FTW device node is opened, the kernel assigns a suitable
> + instance of VAS to the process. Kernel will make a best-effort
> attempt
> + to assign an optimal instance

Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Michael Neuling

On Tue, 2017-08-08 at 16:06 -0700, Sukadev Bhattiprolu wrote:
> We need the SPRN_TIDR to bet set for use with fast thread-wakeup
> (core-to-core wakeup).  Each thread in a process needs to have a
> unique id within the process but as explained below, for now, we
> assign globally unique thread ids to all threads in the system.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  arch/powerpc/include/asm/processor.h |  4 ++
>  arch/powerpc/kernel/process.c| 74
> 
>  2 files changed, 78 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/processor.h
> b/arch/powerpc/include/asm/processor.h
> index fab7ff8..bf6ba63 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -232,6 +232,10 @@ struct debug_reg {
>  struct thread_struct {
>   unsigned long   ksp;/* Kernel stack pointer */
>  
> +#ifdef CONFIG_PPC_VAS

I'm tempted to have this always, or a new feature CONFIG_PPC_TID that's PPC_VAS
depends on.

> + unsigned long   tidr;

> +#endif
> +
>  #ifdef CONFIG_PPC64
>   unsigned long   ksp_vsid;
>  #endif
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 9f3e2c9..6123859 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1213,6 +1213,16 @@ struct task_struct *__switch_to(struct task_struct
> *prev,
>   hard_irq_disable();
>   }
>  
> +#ifdef CONFIG_PPC_VAS
> + mtspr(SPRN_TIDR, new->thread.tidr);

how much does this hurt our context_switch benchmark in
tools/testing/selftests/powerpc/benchmarks/context_switch.c ?

Also you need an CPU_FTR_ARCH_300 test here (and elsewhere)

> +#endif
> + /*
> +  * We can't take a PMU exception inside _switch() since there is a
> +  * window where the kernel stack SLB and the kernel stack are out
> +  * of sync. Hard disable here.
> +  */
> + hard_irq_disable();
> +

What is this?

>   /*
>    * Call restore_sprs() before calling _switch(). If we move it after
>    * _switch() then we miss out on calling it for new tasks. The reason
> @@ -1449,9 +1459,70 @@ void flush_thread(void)
>  #endif /* CONFIG_HAVE_HW_BREAKPOINT */
>  }
>  
> +#ifdef CONFIG_PPC_VAS
> +static DEFINE_SPINLOCK(vas_thread_id_lock);
> +static DEFINE_IDA(vas_thread_ida);

This IDA be per process, not global.

> +
> +/*
> + * We need to assign an unique thread id to each thread in a process. This
> + * thread id is intended to be used with the Fast Thread-wakeup (aka Core-
> + * to-core wakeup) mechanism being implemented on top of Virtual Accelerator
> + * Switchboard (VAS).
> + *
> + * To get a unique thread-id per process we could simply use task_pid_nr()
> + * but the problem is that task_pid_nr() is not yet available for the thread
> + * when copy_thread() is called. Fixing that would require changing more
> + * intrusive arch-neutral code in code path in copy_process()?.
> + *
> + * Further, to assign unique thread ids within each process, we need an
> + * atomic field (or an IDR) in task_struct, which again intrudes into the
> + * arch-neutral code.

Really?

> + * So try to assign globally unique thraed ids for now.

Yuck!

> + */
> +static int assign_thread_id(void)
> +{
> + int index;
> + int err;
> +
> +again:
> + if (!ida_pre_get(_thread_ida, GFP_KERNEL))
> + return -ENOMEM;
> +
> + spin_lock(_thread_id_lock);
> + err = ida_get_new_above(_thread_ida, 1, );

We can't use 0 or 1?

> + spin_unlock(_thread_id_lock);
> +
> + if (err == -EAGAIN)
> + goto again;
> + else if (err)
> + return err;
> +
> + if (index > MAX_USER_CONTEXT) {
> + spin_lock(_thread_id_lock);
> + ida_remove(_thread_ida, index);
> + spin_unlock(_thread_id_lock);
> + return -ENOMEM;
> + }
> +
> + return index;
> +}
> +
> +static void free_thread_id(int id)
> +{
> + spin_lock(_thread_id_lock);
> + ida_remove(_thread_ida, id);
> + spin_unlock(_thread_id_lock);
> +}
> +#endif /* CONFIG_PPC_VAS */
> +
> +
>  void
>  release_thread(struct task_struct *t)
>  {
> +#ifdef CONFIG_PPC_VAS
> + free_thread_id(t->thread.tidr);
> +#endif

Can you restructure this to avoid the #ifdef ugliness

>  }
>  
>  /*
> @@ -1587,6 +1658,9 @@ int copy_thread(unsigned long clone_flags, unsigned long
> usp,
>  #endif
>  
>   setup_ksp_vsid(p, sp);
> +#ifdef CONFIG_PPC_VAS
> + p->thread.tidr = assign_thread_id();
> +#endif

Same here... 

>  
>  #ifdef CONFIG_PPC64 
>   if (cpu_has_feature(CPU_FTR_DSCR)) {

Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Michael Neuling

On Tue, 2017-08-08 at 16:06 -0700, Sukadev Bhattiprolu wrote:
> We need the SPRN_TIDR to bet set for use with fast thread-wakeup
> (core-to-core wakeup).  Each thread in a process needs to have a
> unique id within the process but as explained below, for now, we
> assign globally unique thread ids to all threads in the system.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  arch/powerpc/include/asm/processor.h |  4 ++
>  arch/powerpc/kernel/process.c| 74
> 
>  2 files changed, 78 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/processor.h
> b/arch/powerpc/include/asm/processor.h
> index fab7ff8..bf6ba63 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -232,6 +232,10 @@ struct debug_reg {
>  struct thread_struct {
>   unsigned long   ksp;/* Kernel stack pointer */
>  
> +#ifdef CONFIG_PPC_VAS

I'm tempted to have this always, or a new feature CONFIG_PPC_TID that's PPC_VAS
depends on.

> + unsigned long   tidr;

> +#endif
> +
>  #ifdef CONFIG_PPC64
>   unsigned long   ksp_vsid;
>  #endif
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 9f3e2c9..6123859 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1213,6 +1213,16 @@ struct task_struct *__switch_to(struct task_struct
> *prev,
>   hard_irq_disable();
>   }
>  
> +#ifdef CONFIG_PPC_VAS
> + mtspr(SPRN_TIDR, new->thread.tidr);

how much does this hurt our context_switch benchmark in
tools/testing/selftests/powerpc/benchmarks/context_switch.c ?

Also you need an CPU_FTR_ARCH_300 test here (and elsewhere)

> +#endif
> + /*
> +  * We can't take a PMU exception inside _switch() since there is a
> +  * window where the kernel stack SLB and the kernel stack are out
> +  * of sync. Hard disable here.
> +  */
> + hard_irq_disable();
> +

What is this?

>   /*
>    * Call restore_sprs() before calling _switch(). If we move it after
>    * _switch() then we miss out on calling it for new tasks. The reason
> @@ -1449,9 +1459,70 @@ void flush_thread(void)
>  #endif /* CONFIG_HAVE_HW_BREAKPOINT */
>  }
>  
> +#ifdef CONFIG_PPC_VAS
> +static DEFINE_SPINLOCK(vas_thread_id_lock);
> +static DEFINE_IDA(vas_thread_ida);

This IDA be per process, not global.

> +
> +/*
> + * We need to assign an unique thread id to each thread in a process. This
> + * thread id is intended to be used with the Fast Thread-wakeup (aka Core-
> + * to-core wakeup) mechanism being implemented on top of Virtual Accelerator
> + * Switchboard (VAS).
> + *
> + * To get a unique thread-id per process we could simply use task_pid_nr()
> + * but the problem is that task_pid_nr() is not yet available for the thread
> + * when copy_thread() is called. Fixing that would require changing more
> + * intrusive arch-neutral code in code path in copy_process()?.
> + *
> + * Further, to assign unique thread ids within each process, we need an
> + * atomic field (or an IDR) in task_struct, which again intrudes into the
> + * arch-neutral code.

Really?

> + * So try to assign globally unique thraed ids for now.

Yuck!

> + */
> +static int assign_thread_id(void)
> +{
> + int index;
> + int err;
> +
> +again:
> + if (!ida_pre_get(_thread_ida, GFP_KERNEL))
> + return -ENOMEM;
> +
> + spin_lock(_thread_id_lock);
> + err = ida_get_new_above(_thread_ida, 1, );

We can't use 0 or 1?

> + spin_unlock(_thread_id_lock);
> +
> + if (err == -EAGAIN)
> + goto again;
> + else if (err)
> + return err;
> +
> + if (index > MAX_USER_CONTEXT) {
> + spin_lock(_thread_id_lock);
> + ida_remove(_thread_ida, index);
> + spin_unlock(_thread_id_lock);
> + return -ENOMEM;
> + }
> +
> + return index;
> +}
> +
> +static void free_thread_id(int id)
> +{
> + spin_lock(_thread_id_lock);
> + ida_remove(_thread_ida, id);
> + spin_unlock(_thread_id_lock);
> +}
> +#endif /* CONFIG_PPC_VAS */
> +
> +
>  void
>  release_thread(struct task_struct *t)
>  {
> +#ifdef CONFIG_PPC_VAS
> + free_thread_id(t->thread.tidr);
> +#endif

Can you restructure this to avoid the #ifdef ugliness

>  }
>  
>  /*
> @@ -1587,6 +1658,9 @@ int copy_thread(unsigned long clone_flags, unsigned long
> usp,
>  #endif
>  
>   setup_ksp_vsid(p, sp);
> +#ifdef CONFIG_PPC_VAS
> + p->thread.tidr = assign_thread_id();
> +#endif

Same here... 

>  
>  #ifdef CONFIG_PPC64 
>   if (cpu_has_feature(CPU_FTR_DSCR)) {

Re: [PATCH 2/3] powernv:idle: Decouple TB restore & Per-core SPRs restore

2017-04-13 Thread Michael Neuling

On Wed, 2017-04-12 at 17:16 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> The idle-exit code assumes that if Timebase is not lost, then neither
> are the per-core hypervisor resources lost. 

Double negative!  How about:

The idle-exit code assumes that if the timebase is restored, then the
per-core hypervisor resources are also restored.

> This was true on POWER8
> where fast-sleep lost only TB but not per-core resources, and winkle
> lost both.
> 
> This assumption is not true for POWER9 however, since there can be
> states which do not lose timebase but can lose per-core SPRs.
> 
> Hence check if we need to restore the per-core hypervisor state even
> if timebase is not lost.

I think I understand what you're doing, just seems awkwardly worded.

Is this actually what the patch is doing?  It seem to be just changing one
branch.

Mikey

> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kernel/idle_book3s.S | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S
> b/arch/powerpc/kernel/idle_book3s.S
> index 9b747e9..6a9bd28 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -723,13 +723,14 @@ timebase_resync:
>    * Use cr3 which indicates that we are waking up with atleast partial
>    * hypervisor state loss to determine if TIMEBASE RESYNC is needed.
>    */
> - ble cr3,clear_lock
> + ble cr3,.Ltb_resynced
>   /* Time base re-sync */
>   bl  opal_resync_timebase;
>   /*
> -  * If waking up from sleep, per core state is not lost, skip to
> -  * clear_lock.
> +  * If waking up from sleep (POWER8), per core state
> +  * is not lost, skip to clear_lock.
>    */
> +.Ltb_resynced:
>   blt cr4,clear_lock
>  
>   /*

Re: [PATCH 2/3] powernv:idle: Decouple TB restore & Per-core SPRs restore

2017-04-13 Thread Michael Neuling

On Wed, 2017-04-12 at 17:16 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> The idle-exit code assumes that if Timebase is not lost, then neither
> are the per-core hypervisor resources lost. 

Double negative!  How about:

The idle-exit code assumes that if the timebase is restored, then the
per-core hypervisor resources are also restored.

> This was true on POWER8
> where fast-sleep lost only TB but not per-core resources, and winkle
> lost both.
> 
> This assumption is not true for POWER9 however, since there can be
> states which do not lose timebase but can lose per-core SPRs.
> 
> Hence check if we need to restore the per-core hypervisor state even
> if timebase is not lost.

I think I understand what you're doing, just seems awkwardly worded.

Is this actually what the patch is doing?  It seem to be just changing one
branch.

Mikey

> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kernel/idle_book3s.S | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S
> b/arch/powerpc/kernel/idle_book3s.S
> index 9b747e9..6a9bd28 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -723,13 +723,14 @@ timebase_resync:
>    * Use cr3 which indicates that we are waking up with atleast partial
>    * hypervisor state loss to determine if TIMEBASE RESYNC is needed.
>    */
> - ble cr3,clear_lock
> + ble cr3,.Ltb_resynced
>   /* Time base re-sync */
>   bl  opal_resync_timebase;
>   /*
> -  * If waking up from sleep, per core state is not lost, skip to
> -  * clear_lock.
> +  * If waking up from sleep (POWER8), per core state
> +  * is not lost, skip to clear_lock.
>    */
> +.Ltb_resynced:
>   blt cr4,clear_lock
>  
>   /*

Re: [PATCH 1/3] powernv:idle: Use correct IDLE_THREAD_BITS in POWER8/9

2017-04-13 Thread Michael Neuling

On Wed, 2017-04-12 at 17:16 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> This patch ensures that POWER8 and POWER9 processors use the correct
> value of IDLE_THREAD_BITS as POWER8 has 8 threads per core and hence
> the IDLE_THREAD_BITS should be 0xFF while POWER9 has only 4 threads
> per core and hence the IDLE_THREAD_BITS should be 0xF.

Why don't we derive this from the device tree rather than hard wiring it per cpu
type?

Mikey

> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/include/asm/cpuidle.h| 3 ++-
>  arch/powerpc/kernel/idle_book3s.S | 9 ++---
>  arch/powerpc/platforms/powernv/idle.c | 5 -
>  3 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cpuidle.h
> b/arch/powerpc/include/asm/cpuidle.h
> index 52586f9..fece6ca 100644
> --- a/arch/powerpc/include/asm/cpuidle.h
> +++ b/arch/powerpc/include/asm/cpuidle.h
> @@ -34,7 +34,8 @@
>  #define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT   8
>  #define PNV_CORE_IDLE_THREAD_WINKLE_BITS 0xFF00
>  
> -#define PNV_CORE_IDLE_THREAD_BITS    0x00FF
> +#define PNV_CORE_IDLE_4THREAD_BITS   0x000F
> +#define PNV_CORE_IDLE_8THREAD_BITS   0x00FF
>  
>  /*
>   *  NOTE =
> diff --git a/arch/powerpc/kernel/idle_book3s.S
> b/arch/powerpc/kernel/idle_book3s.S
> index 2b13fe2..9b747e9 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -223,7 +223,7 @@ lwarx_loop1:
>   add r15,r15,r5  /* Add if winkle */
>   andcr15,r15,r7  /* Clear thread bit */
>  
> - andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
> + andi.   r9,r15,PNV_CORE_IDLE_8THREAD_BITS
>  
>  /*
>   * If cr0 = 0, then current thread is the last thread of the core entering
> @@ -582,8 +582,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>   stwcx.  r15,0,r14
>   bne-1b
>   isync
> -
> - andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
> +BEGIN_FTR_SECTION
> + andi.   r9,r15,PNV_CORE_IDLE_4THREAD_BITS
> +FTR_SECTION_ELSE
> + andi.   r9,r15,PNV_CORE_IDLE_8THREAD_BITS
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
>   cmpwi   cr2,r9,0
>  
>   /*
> diff --git a/arch/powerpc/platforms/powernv/idle.c
> b/arch/powerpc/platforms/powernv/idle.c
> index 445f30a..d46920b 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -112,7 +112,10 @@ static void pnv_alloc_idle_core_states(void)
>   size_t paca_ptr_array_size;
>  
>   core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL,
> node);
> - *core_idle_state = PNV_CORE_IDLE_THREAD_BITS;
> + if (cpu_has_feature(CPU_FTR_ARCH_300))
> + *core_idle_state = PNV_CORE_IDLE_4THREAD_BITS;
> + else
> + *core_idle_state = PNV_CORE_IDLE_8THREAD_BITS;
>   paca_ptr_array_size = (threads_per_core *
>      sizeof(struct paca_struct *));
>

Re: [PATCH 1/3] powernv:idle: Use correct IDLE_THREAD_BITS in POWER8/9

2017-04-13 Thread Michael Neuling

On Wed, 2017-04-12 at 17:16 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> This patch ensures that POWER8 and POWER9 processors use the correct
> value of IDLE_THREAD_BITS as POWER8 has 8 threads per core and hence
> the IDLE_THREAD_BITS should be 0xFF while POWER9 has only 4 threads
> per core and hence the IDLE_THREAD_BITS should be 0xF.

Why don't we derive this from the device tree rather than hard wiring it per cpu
type?

Mikey

> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/include/asm/cpuidle.h| 3 ++-
>  arch/powerpc/kernel/idle_book3s.S | 9 ++---
>  arch/powerpc/platforms/powernv/idle.c | 5 -
>  3 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cpuidle.h
> b/arch/powerpc/include/asm/cpuidle.h
> index 52586f9..fece6ca 100644
> --- a/arch/powerpc/include/asm/cpuidle.h
> +++ b/arch/powerpc/include/asm/cpuidle.h
> @@ -34,7 +34,8 @@
>  #define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT   8
>  #define PNV_CORE_IDLE_THREAD_WINKLE_BITS 0xFF00
>  
> -#define PNV_CORE_IDLE_THREAD_BITS    0x00FF
> +#define PNV_CORE_IDLE_4THREAD_BITS   0x000F
> +#define PNV_CORE_IDLE_8THREAD_BITS   0x00FF
>  
>  /*
>   *  NOTE =
> diff --git a/arch/powerpc/kernel/idle_book3s.S
> b/arch/powerpc/kernel/idle_book3s.S
> index 2b13fe2..9b747e9 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -223,7 +223,7 @@ lwarx_loop1:
>   add r15,r15,r5  /* Add if winkle */
>   andcr15,r15,r7  /* Clear thread bit */
>  
> - andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
> + andi.   r9,r15,PNV_CORE_IDLE_8THREAD_BITS
>  
>  /*
>   * If cr0 = 0, then current thread is the last thread of the core entering
> @@ -582,8 +582,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>   stwcx.  r15,0,r14
>   bne-1b
>   isync
> -
> - andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
> +BEGIN_FTR_SECTION
> + andi.   r9,r15,PNV_CORE_IDLE_4THREAD_BITS
> +FTR_SECTION_ELSE
> + andi.   r9,r15,PNV_CORE_IDLE_8THREAD_BITS
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
>   cmpwi   cr2,r9,0
>  
>   /*
> diff --git a/arch/powerpc/platforms/powernv/idle.c
> b/arch/powerpc/platforms/powernv/idle.c
> index 445f30a..d46920b 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -112,7 +112,10 @@ static void pnv_alloc_idle_core_states(void)
>   size_t paca_ptr_array_size;
>  
>   core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL,
> node);
> - *core_idle_state = PNV_CORE_IDLE_THREAD_BITS;
> + if (cpu_has_feature(CPU_FTR_ARCH_300))
> + *core_idle_state = PNV_CORE_IDLE_4THREAD_BITS;
> + else
> + *core_idle_state = PNV_CORE_IDLE_8THREAD_BITS;
>   paca_ptr_array_size = (threads_per_core *
>      sizeof(struct paca_struct *));
>

Re: [PATCH 3/3] powernv:idle: Set LPCR_UPRT on wakeup from deep-stop

2017-04-13 Thread Michael Neuling

On Thu, 2017-04-13 at 14:12 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-04-13 at 09:28 +0530, Aneesh Kumar K.V wrote:
> > >   #endif
> > >    mtctr   r12
> > >    bctrl
> > > +/*
> > > + * cur_cpu_spec->cpu_restore would restore LPCR to a
> > > + * sane value that is set at early boot time,
> > > + * thereby clearing LPCR_UPRT.
> > > + * LPCR_UPRT is required if we are running in Radix mode.
> > > + * Set it here if that be the case.
> > > + */
> > > +BEGIN_MMU_FTR_SECTION
> > > + mfspr   r3, SPRN_LPCR
> > > + LOAD_REG_IMMEDIATE(r4, LPCR_UPRT)
> > > + or  r3, r3, r4
> > > + mtspr   SPRN_LPCR, r3
> > > +END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
> 
> We are probably better off saving the value somewhere during boot
> and just "blasting" it whole back.

We seem to touch LPCR in a bunch of places these days.  Not sure when "sometimes
 during boot" should actually be.

Mikey

Re: [PATCH 3/3] powernv:idle: Set LPCR_UPRT on wakeup from deep-stop

2017-04-13 Thread Michael Neuling

On Thu, 2017-04-13 at 14:12 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-04-13 at 09:28 +0530, Aneesh Kumar K.V wrote:
> > >   #endif
> > >    mtctr   r12
> > >    bctrl
> > > +/*
> > > + * cur_cpu_spec->cpu_restore would restore LPCR to a
> > > + * sane value that is set at early boot time,
> > > + * thereby clearing LPCR_UPRT.
> > > + * LPCR_UPRT is required if we are running in Radix mode.
> > > + * Set it here if that be the case.
> > > + */
> > > +BEGIN_MMU_FTR_SECTION
> > > + mfspr   r3, SPRN_LPCR
> > > + LOAD_REG_IMMEDIATE(r4, LPCR_UPRT)
> > > + or  r3, r3, r4
> > > + mtspr   SPRN_LPCR, r3
> > > +END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
> 
> We are probably better off saving the value somewhere during boot
> and just "blasting" it whole back.

We seem to touch LPCR in a bunch of places these days.  Not sure when "sometimes
 during boot" should actually be.

Mikey

Re: [PATCH] tty:tty_ldisc: add tty_ldisc_lock|unlock to prevent concurrent update to ldisc in tty_ldisc_deinit

2017-04-09 Thread Michael Neuling

Wang,

Applying this, with the other one on top and it doesn't fix the problem (applied
on next-20170405). I tried each patch by itself, with the same bad result.

Thanks for the help but the backtrace is the same:

Unable to handle kernel paging request for data at address 0x2260
Faulting instruction address: 0xc0568800
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=32 
NUMA 
PowerNV
Modules linked in:
CPU: 6 PID: 177 Comm: kworker/u56:1 Not tainted 
4.11.0-rc5-next-20170405-2-g34d2ff03e6 #9
Workqueue: events_unbound flush_to_ldisc
task: c077c498a280 task.stack: c077c49f8000
NIP: c0568800 LR: c05687e8 CTR: c0569310
REGS: c077c49fb890 TRAP: 0300   Not tainted  
(4.11.0-rc5-next-20170405-2-g34d2ff03e6)
MSR: 9280b033 
  CR: 24042428  XER: 
CFAR: c0956adc DAR: 2260 DSISR: 4000 SOFTE: 1 
GPR00: c05687e8 c077c49fbb10 c0f3cb00 c077c32710d8 
GPR04: c077bf556c20 c077bf556d20 0100 0001 
GPR08: c077c32710d8 c077c3271220 c077c3271248 c07995c28508 
GPR12: 84002428 cfff7e00 c00f2e08 c077c48c4040 
GPR16:   c079940102a8 c07994010078 
GPR20: c07994010020   0001 
GPR24:   c077bf556c20 c077bf556d20 
GPR28: 0100 0100 c077bf556d20 c077c3271000 
NIP [c0568800] n_tty_receive_buf_common+0xb0/0xbc0
LR [c05687e8] n_tty_receive_buf_common+0x98/0xbc0
Call Trace:
[c077c49fbb10] [c05687e8] n_tty_receive_buf_common+0x98/0xbc0 
(unreliable)
[c077c49fbbe0] [c056d02c] tty_ldisc_receive_buf+0x3c/0xd0
[c077c49fbc10] [c056dedc] tty_port_default_receive_buf+0x5c/0xe0
[c077c49fbc50] [c056d340] flush_to_ldisc+0x110/0x130
[c077c49fbca0] [c00ea88c] process_one_work+0x1dc/0x550
[c077c49fbd30] [c00eac88] worker_thread+0x88/0x5c0
[c077c49fbdc0] [c00f2f60] kthread+0x160/0x1a0
[c077c49fbe30] [c000bc60] ret_from_kernel_thread+0x5c/0x7c
Instruction dump:
fba1ffe8 fbc1fff0 f821ff31 f9010030 eb3f0280 483ee2a5 6000 393f0220 
395f0248 f9210020 f9410028 6042  7c2004ac 80ff0130 e8d9 
---[ end trace b30eea9f71cf8d4a ]---


Thanks for the help
Mikey

On Mon, 2017-04-10 at 00:59 +0800, Wang YanQing wrote:
> This patch could fix the issue that free_tty_struct in tty_io
> calling tty_ldisc_deinit without holding tty->ldisc_sem.
> 
> Signed-off-by: Wang YanQing 
> ---
>  drivers/tty/tty_ldisc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
> index b1f7fa5..674421b 100644
> --- a/drivers/tty/tty_ldisc.c
> +++ b/drivers/tty/tty_ldisc.c
> @@ -771,7 +771,9 @@ void tty_ldisc_init(struct tty_struct *tty)
>   */
>  void tty_ldisc_deinit(struct tty_struct *tty)
>  {
> + tty_ldisc_lock(tty, MAX_SCHEDULE_TIMEOUT);
>   if (tty->ldisc)
>   tty_ldisc_put(tty->ldisc);
>   tty->ldisc = NULL;
> + tty_ldisc_unlock(tty);
>  }

Re: [PATCH] tty:tty_ldisc: add tty_ldisc_lock|unlock to prevent concurrent update to ldisc in tty_ldisc_deinit

2017-04-09 Thread Michael Neuling

Wang,

Applying this, with the other one on top and it doesn't fix the problem (applied
on next-20170405). I tried each patch by itself, with the same bad result.

Thanks for the help but the backtrace is the same:

Unable to handle kernel paging request for data at address 0x2260
Faulting instruction address: 0xc0568800
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=32 
NUMA 
PowerNV
Modules linked in:
CPU: 6 PID: 177 Comm: kworker/u56:1 Not tainted 
4.11.0-rc5-next-20170405-2-g34d2ff03e6 #9
Workqueue: events_unbound flush_to_ldisc
task: c077c498a280 task.stack: c077c49f8000
NIP: c0568800 LR: c05687e8 CTR: c0569310
REGS: c077c49fb890 TRAP: 0300   Not tainted  
(4.11.0-rc5-next-20170405-2-g34d2ff03e6)
MSR: 9280b033 
  CR: 24042428  XER: 
CFAR: c0956adc DAR: 2260 DSISR: 4000 SOFTE: 1 
GPR00: c05687e8 c077c49fbb10 c0f3cb00 c077c32710d8 
GPR04: c077bf556c20 c077bf556d20 0100 0001 
GPR08: c077c32710d8 c077c3271220 c077c3271248 c07995c28508 
GPR12: 84002428 cfff7e00 c00f2e08 c077c48c4040 
GPR16:   c079940102a8 c07994010078 
GPR20: c07994010020   0001 
GPR24:   c077bf556c20 c077bf556d20 
GPR28: 0100 0100 c077bf556d20 c077c3271000 
NIP [c0568800] n_tty_receive_buf_common+0xb0/0xbc0
LR [c05687e8] n_tty_receive_buf_common+0x98/0xbc0
Call Trace:
[c077c49fbb10] [c05687e8] n_tty_receive_buf_common+0x98/0xbc0 
(unreliable)
[c077c49fbbe0] [c056d02c] tty_ldisc_receive_buf+0x3c/0xd0
[c077c49fbc10] [c056dedc] tty_port_default_receive_buf+0x5c/0xe0
[c077c49fbc50] [c056d340] flush_to_ldisc+0x110/0x130
[c077c49fbca0] [c00ea88c] process_one_work+0x1dc/0x550
[c077c49fbd30] [c00eac88] worker_thread+0x88/0x5c0
[c077c49fbdc0] [c00f2f60] kthread+0x160/0x1a0
[c077c49fbe30] [c000bc60] ret_from_kernel_thread+0x5c/0x7c
Instruction dump:
fba1ffe8 fbc1fff0 f821ff31 f9010030 eb3f0280 483ee2a5 6000 393f0220 
395f0248 f9210020 f9410028 6042  7c2004ac 80ff0130 e8d9 
---[ end trace b30eea9f71cf8d4a ]---


Thanks for the help
Mikey

On Mon, 2017-04-10 at 00:59 +0800, Wang YanQing wrote:
> This patch could fix the issue that free_tty_struct in tty_io
> calling tty_ldisc_deinit without holding tty->ldisc_sem.
> 
> Signed-off-by: Wang YanQing 
> ---
>  drivers/tty/tty_ldisc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
> index b1f7fa5..674421b 100644
> --- a/drivers/tty/tty_ldisc.c
> +++ b/drivers/tty/tty_ldisc.c
> @@ -771,7 +771,9 @@ void tty_ldisc_init(struct tty_struct *tty)
>   */
>  void tty_ldisc_deinit(struct tty_struct *tty)
>  {
> + tty_ldisc_lock(tty, MAX_SCHEDULE_TIMEOUT);
>   if (tty->ldisc)
>   tty_ldisc_put(tty->ldisc);
>   tty->ldisc = NULL;
> + tty_ldisc_unlock(tty);
>  }

Re: [PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Michael Neuling

Al,

On Fri, 2017-04-07 at 05:12 +0100, Al Viro wrote:
> On Fri, Apr 07, 2017 at 01:50:53PM +1000, Michael Neuling wrote:
> 
> > diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> > index bdf0e6e899..a2a9832a42 100644
> > --- a/drivers/tty/n_tty.c
> > +++ b/drivers/tty/n_tty.c
> > @@ -1668,11 +1668,17 @@ static int
> >  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
> >      char *fp, int count, int flow)
> >  {
> > -   struct n_tty_data *ldata = tty->disc_data;
> > +   struct n_tty_data *ldata;
> >     int room, n, rcvd = 0, overflow;
> >  
> >     down_read(>termios_rwsem);
> >  
> > +   ldata = tty->disc_data;
> > +   if (!ldata) {
> > +   up_read(>termios_rwsem);
> 
> I very much doubt that it's correct.  It shouldn't have been called after
> the n_tty_close(); apparently it has been.  ->termios_rwsem won't serialize
> against it, and something apparently has gone wrong with the exclusion there.
> At the very least I would like to see what's to prevent n_tty_close() from
> overlapping the exection of this function - if *that* is what broke, your
> patch will only paper over the problem.

It does seem like I'm papering over a problem. Would you be happy with the patch
if we add a WARN_ON_ONCE()?

I think the problem is permanent rather than a race/transient with the disc_data
being NULL as if we read it again later, it's still NULL.

Benh and I looked at this a bunch and we did notice tty_ldisc_reinit() was being
called called without the tty lock in one location.  We tried the below patch
but it didn't help (not an upstreamable patch, just a test).

There has been a few attempts are trying to fix this but none have worked for
me:
https://lkml.org/lkml/2017/3/23/569
and 
https://patchwork.kernel.org/patch/9114561/

I'm not that familiar with the tty layer (and I value my sanity) so I'm
struggling to root cause it by myself.

Mikey


diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 734a635e73..121402ff25 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1454,6 +1454,9 @@ static void tty_driver_remove_tty(struct tty_driver 
*driver, struct tty_struct *
driver->ttys[tty->index] = NULL;
 }
 
+extern int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout);
+extern void tty_ldisc_unlock(struct tty_struct *tty);
+
 /*
  * tty_reopen()- fast re-open of an open tty
  * @tty- the tty to open
@@ -1466,6 +1469,7 @@ static void tty_driver_remove_tty(struct tty_driver 
*driver, struct tty_struct *
 static int tty_reopen(struct tty_struct *tty)
 {
struct tty_driver *driver = tty->driver;
+   int rc = 0;
 
if (driver->type == TTY_DRIVER_TYPE_PTY &&
driver->subtype == PTY_TYPE_MASTER)
@@ -1479,10 +1483,12 @@ static int tty_reopen(struct tty_struct *tty)
 
tty->count++;
 
+   tty_ldisc_lock(tty, MAX_SCHEDULE_TIMEOUT);
if (!tty->ldisc)
-   return tty_ldisc_reinit(tty, tty->termios.c_line);
+   rc =  tty_ldisc_reinit(tty, tty->termios.c_line);
+   tty_ldisc_unlock(tty);
 
-   return 0;
+   return rc;
 }
 
 /**
diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index d0e84b6226..3b13ff11c5 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -334,7 +334,7 @@ static inline void __tty_ldisc_unlock(struct tty_struct 
*tty)
ldsem_up_write(>ldisc_sem);
 }
 
-static int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout)
+int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout)
 {
int ret;
 
@@ -345,7 +345,7 @@ static int tty_ldisc_lock(struct tty_struct *tty, unsigned 
long timeout)
return 0;
 }
 
-static void tty_ldisc_unlock(struct tty_struct *tty)
+void tty_ldisc_unlock(struct tty_struct *tty)
 {
clear_bit(TTY_LDISC_HALTED, >flags);
__tty_ldisc_unlock(tty);

Re: [PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Michael Neuling

Al,

On Fri, 2017-04-07 at 05:12 +0100, Al Viro wrote:
> On Fri, Apr 07, 2017 at 01:50:53PM +1000, Michael Neuling wrote:
> 
> > diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> > index bdf0e6e899..a2a9832a42 100644
> > --- a/drivers/tty/n_tty.c
> > +++ b/drivers/tty/n_tty.c
> > @@ -1668,11 +1668,17 @@ static int
> >  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
> >      char *fp, int count, int flow)
> >  {
> > -   struct n_tty_data *ldata = tty->disc_data;
> > +   struct n_tty_data *ldata;
> >     int room, n, rcvd = 0, overflow;
> >  
> >     down_read(>termios_rwsem);
> >  
> > +   ldata = tty->disc_data;
> > +   if (!ldata) {
> > +   up_read(>termios_rwsem);
> 
> I very much doubt that it's correct.  It shouldn't have been called after
> the n_tty_close(); apparently it has been.  ->termios_rwsem won't serialize
> against it, and something apparently has gone wrong with the exclusion there.
> At the very least I would like to see what's to prevent n_tty_close() from
> overlapping the exection of this function - if *that* is what broke, your
> patch will only paper over the problem.

It does seem like I'm papering over a problem. Would you be happy with the patch
if we add a WARN_ON_ONCE()?

I think the problem is permanent rather than a race/transient with the disc_data
being NULL as if we read it again later, it's still NULL.

Benh and I looked at this a bunch and we did notice tty_ldisc_reinit() was being
called called without the tty lock in one location.  We tried the below patch
but it didn't help (not an upstreamable patch, just a test).

There has been a few attempts are trying to fix this but none have worked for
me:
https://lkml.org/lkml/2017/3/23/569
and 
https://patchwork.kernel.org/patch/9114561/

I'm not that familiar with the tty layer (and I value my sanity) so I'm
struggling to root cause it by myself.

Mikey


diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 734a635e73..121402ff25 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1454,6 +1454,9 @@ static void tty_driver_remove_tty(struct tty_driver 
*driver, struct tty_struct *
driver->ttys[tty->index] = NULL;
 }
 
+extern int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout);
+extern void tty_ldisc_unlock(struct tty_struct *tty);
+
 /*
  * tty_reopen()- fast re-open of an open tty
  * @tty- the tty to open
@@ -1466,6 +1469,7 @@ static void tty_driver_remove_tty(struct tty_driver 
*driver, struct tty_struct *
 static int tty_reopen(struct tty_struct *tty)
 {
struct tty_driver *driver = tty->driver;
+   int rc = 0;
 
if (driver->type == TTY_DRIVER_TYPE_PTY &&
driver->subtype == PTY_TYPE_MASTER)
@@ -1479,10 +1483,12 @@ static int tty_reopen(struct tty_struct *tty)
 
tty->count++;
 
+   tty_ldisc_lock(tty, MAX_SCHEDULE_TIMEOUT);
if (!tty->ldisc)
-   return tty_ldisc_reinit(tty, tty->termios.c_line);
+   rc =  tty_ldisc_reinit(tty, tty->termios.c_line);
+   tty_ldisc_unlock(tty);
 
-   return 0;
+   return rc;
 }
 
 /**
diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index d0e84b6226..3b13ff11c5 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -334,7 +334,7 @@ static inline void __tty_ldisc_unlock(struct tty_struct 
*tty)
ldsem_up_write(>ldisc_sem);
 }
 
-static int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout)
+int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout)
 {
int ret;
 
@@ -345,7 +345,7 @@ static int tty_ldisc_lock(struct tty_struct *tty, unsigned 
long timeout)
return 0;
 }
 
-static void tty_ldisc_unlock(struct tty_struct *tty)
+void tty_ldisc_unlock(struct tty_struct *tty)
 {
clear_bit(TTY_LDISC_HALTED, >flags);
__tty_ldisc_unlock(tty);

[PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Michael Neuling

When reiniting a tty we can end up with:

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data in
tty_ldisc_receive_buf() called from flush_to_ldisc()

This fixes the issue by moving the disc_data read to after we take the
semaphore. Then when disc_data NULL returning 0 data processed rather
than dereferencing it.

Cc: <sta...@vger.kernel.org> [4.10+]
Signed-off-by: Michael Neuling <mi...@neuling.org>
---
 drivers/tty/n_tty.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
 n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
 char *fp, int count, int flow)
 {
-   struct n_tty_data *ldata = tty->disc_data;
+   struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;
 
down_read(>termios_rwsem);
 
+   ldata = tty->disc_data;
+   if (!ldata) {
+   up_read(>termios_rwsem);
+   return 0;
+   }
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars
-- 
2.9.3

[PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Michael Neuling

When reiniting a tty we can end up with:

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data in
tty_ldisc_receive_buf() called from flush_to_ldisc()

This fixes the issue by moving the disc_data read to after we take the
semaphore. Then when disc_data NULL returning 0 data processed rather
than dereferencing it.

Cc:  [4.10+]
Signed-off-by: Michael Neuling 
---
 drivers/tty/n_tty.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
 n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
 char *fp, int count, int flow)
 {
-   struct n_tty_data *ldata = tty->disc_data;
+   struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;
 
down_read(>termios_rwsem);
 
+   ldata = tty->disc_data;
+   if (!ldata) {
+   up_read(>termios_rwsem);
+   return 0;
+   }
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars
-- 
2.9.3

Re: tty crash in tty_ldisc_receive_buf()

2017-04-06 Thread Michael Neuling


> > +   /* This probably shouldn't happen, but return 0 data processed */
> > +   if (!ldata)
> > +   return 0;
> > +
> >     while (1) {
> >     /*
> >      * When PARMRK is set, each input char may take up to 3
> > chars
> 
> Maybe your patch should looks like:
> + /* This probably shouldn't happen, but return 0 data processed */
> + if (!ldata) {
> +   up_read(>termios_rwsem);
> + return 0;
> +   }

Oops, nice catch.. Thanks!

That does indeed fix the problem now without the softlockup.  I'm not sure it's
the right fix, but full patch below.

Anyone see a problem with this approach? Am I just papering over a real issue?

> Maybe below patch should work:
> @@ -1668,11 +1668,12 @@ static int
>  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
>   char *fp, int count, int flow)
>  {
> -   struct n_tty_data *ldata = tty->disc_data;
> +   struct n_tty_data *ldata;
>      int room, n, rcvd = 0, overflow;
> 
> down_read(>termios_rwsem);
> 
> +   ldata = tty->disc_data;

I did try just that alone and it didn't help.

Mikey


------------
>From 75c2a0369450692946ca8cc7ac148a98deaecd2a Mon Sep 17 00:00:00 2001
From: Michael Neuling <mi...@neuling.org>
Date: Fri, 7 Apr 2017 11:31:02 +1000
Subject: [PATCH] tty: fix regression in flush_to_ldisc

When reiniting a tty we can end up with:

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data.

This fixes the issue by moving the disc_data read to after we take the
semaphore, then returning 0 data processed when NULL.

Cc: <sta...@vger.kernel.org>[4.10+]
Signed-off-by: Michael Neuling <mi...@neuling.org>
---
 drivers/tty/n_tty.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
 n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
 char *fp, int count, int flow)
 {
-   struct n_tty_data *ldata = tty->disc_data;
+   struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;
 
down_read(>termios_rwsem);
 
+   ldata = tty->disc_data;
+   if (!ldata) {
+   up_read(>termios_rwsem);
+   return 0;
+   }
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars
-- 
2.9.3

Re: tty crash in tty_ldisc_receive_buf()

2017-04-06 Thread Michael Neuling


> > +   /* This probably shouldn't happen, but return 0 data processed */
> > +   if (!ldata)
> > +   return 0;
> > +
> >     while (1) {
> >     /*
> >      * When PARMRK is set, each input char may take up to 3
> > chars
> 
> Maybe your patch should looks like:
> + /* This probably shouldn't happen, but return 0 data processed */
> + if (!ldata) {
> +   up_read(>termios_rwsem);
> + return 0;
> +   }

Oops, nice catch.. Thanks!

That does indeed fix the problem now without the softlockup.  I'm not sure it's
the right fix, but full patch below.

Anyone see a problem with this approach? Am I just papering over a real issue?

> Maybe below patch should work:
> @@ -1668,11 +1668,12 @@ static int
>  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
>   char *fp, int count, int flow)
>  {
> -   struct n_tty_data *ldata = tty->disc_data;
> +   struct n_tty_data *ldata;
>      int room, n, rcvd = 0, overflow;
> 
> down_read(>termios_rwsem);
> 
> +   ldata = tty->disc_data;

I did try just that alone and it didn't help.

Mikey


------------
>From 75c2a0369450692946ca8cc7ac148a98deaecd2a Mon Sep 17 00:00:00 2001
From: Michael Neuling 
Date: Fri, 7 Apr 2017 11:31:02 +1000
Subject: [PATCH] tty: fix regression in flush_to_ldisc

When reiniting a tty we can end up with:

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data.

This fixes the issue by moving the disc_data read to after we take the
semaphore, then returning 0 data processed when NULL.

Cc: [4.10+]
Signed-off-by: Michael Neuling 
---
 drivers/tty/n_tty.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
 n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
 char *fp, int count, int flow)
 {
-   struct n_tty_data *ldata = tty->disc_data;
+   struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;
 
down_read(>termios_rwsem);
 
+   ldata = tty->disc_data;
+   if (!ldata) {
+   up_read(>termios_rwsem);
+   return 0;
+   }
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars
-- 
2.9.3

Re: tty crash in tty_ldisc_receive_buf()

2017-04-06 Thread Michael Neuling


> > If anyone has an idea, I'm happy to try a patch.
> 
> Can you try this one [1].

Rob, I'm still hitting it when I apply that on next-20170405. Crash below..

Any other clues?

[  229.422825] Unable to handle kernel paging request for data at address 
0x2260
[  229.423681] Faulting instruction address: 0xc06fad80
cpu 0x13: Vector: 300 (Data Access) at [c0799411f8a0]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb20
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079665d1e00
  paca= 0xcfb0be00   softe: 0    irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405+ (mikey@bml86) (gcc version 5.4.0
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #4 SMP Thu Apr 6 19:13:58 CDT
2017
enter ? for help
[c0799411fbf0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc20] c07009fc tty_port_default_receive_buf+0x2c/0x40
[c0799411fc40] c0700278 flush_to_ldisc+0x168/0x190
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74
13:mon> r
R00 = c06fad5c   R16 = 
R01 = c0799411fb20   R17 = 
R02 = c14b9200   R18 = c07994060ea8
R03 =    R19 = c07994060c78
R04 = c21c2420   R20 = c07994060c20
R05 = c21c2520   R21 = 
R06 = 0100   R22 = 
R07 = 0001   R23 = 0001
R08 =    R24 = 
R09 = c000fc459600   R25 = 
R10 = c000fc459628   R26 = c21c2420
R11 = 000201010005   R27 = c21c2520
R12 = 24002828   R28 = 0100
R13 = cfb0be00   R29 = c000fc459400
R14 = c01132d8   R30 = 0001
R15 = c079941f0ec0   R31 = c000fc459400
pc  = c06fad80 n_tty_receive_buf_common+0xc0/0xbd0
cfar= c0b98e10 down_read+0x70/0x90
lr  = c06fad5c n_tty_receive_buf_common+0x9c/0xbd0
msr = 9280b033   cr  = 24042828
ctr = c06fb890   xer =    trap =  300
dar = 2260   dsisr = 4000

Re: tty crash in tty_ldisc_receive_buf()

2017-04-06 Thread Michael Neuling


> > If anyone has an idea, I'm happy to try a patch.
> 
> Can you try this one [1].

Rob, I'm still hitting it when I apply that on next-20170405. Crash below..

Any other clues?

[  229.422825] Unable to handle kernel paging request for data at address 
0x2260
[  229.423681] Faulting instruction address: 0xc06fad80
cpu 0x13: Vector: 300 (Data Access) at [c0799411f8a0]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb20
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079665d1e00
  paca= 0xcfb0be00   softe: 0    irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405+ (mikey@bml86) (gcc version 5.4.0
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #4 SMP Thu Apr 6 19:13:58 CDT
2017
enter ? for help
[c0799411fbf0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc20] c07009fc tty_port_default_receive_buf+0x2c/0x40
[c0799411fc40] c0700278 flush_to_ldisc+0x168/0x190
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74
13:mon> r
R00 = c06fad5c   R16 = 
R01 = c0799411fb20   R17 = 
R02 = c14b9200   R18 = c07994060ea8
R03 =    R19 = c07994060c78
R04 = c21c2420   R20 = c07994060c20
R05 = c21c2520   R21 = 
R06 = 0100   R22 = 
R07 = 0001   R23 = 0001
R08 =    R24 = 
R09 = c000fc459600   R25 = 
R10 = c000fc459628   R26 = c21c2420
R11 = 000201010005   R27 = c21c2520
R12 = 24002828   R28 = 0100
R13 = cfb0be00   R29 = c000fc459400
R14 = c01132d8   R30 = 0001
R15 = c079941f0ec0   R31 = c000fc459400
pc  = c06fad80 n_tty_receive_buf_common+0xc0/0xbd0
cfar= c0b98e10 down_read+0x70/0x90
lr  = c06fad5c n_tty_receive_buf_common+0x9c/0xbd0
msr = 9280b033   cr  = 24042828
ctr = c06fb890   xer =    trap =  300
dar = 2260   dsisr = 4000

tty crash in tty_ldisc_receive_buf()

2017-04-06 Thread Michael Neuling

Hi all,

We are seeing the following crash (in linux-next but has been around since at
least v4.10). 

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0    irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

It seems the null ptr deref is in n_tty_receive_buf_common() where we do:

size_t tail = smp_load_acquire(>read_tail);

ldata is NULL.

We see this usually on boot but can also see it if we kill a getty attached to
tty (which is then respawned by systemd).  It seems like we are flushing data to
a tty at the same time as it's being torn down and restarted.

I did try the below patch which avoids the crash but locks up one of the CPUs. I
guess the data never gets flushed if we say nothing is processed.

This is on powerpc but has also been reported by parisc.

I'm not at all familiar with the tty layer and looking at the locks, mutexes,
semaphores and reference counting in there scares the hell out of me. 

If anyone has an idea, I'm happy to try a patch.

Regards,
Mikey

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..99dd757aa4 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1673,6 +1673,10 @@ n_tty_receive_buf_common(struct tty_struct *tty, const 
unsigned char *cp,
 
down_read(>termios_rwsem);
 
+   /* This probably shouldn't happen, but return 0 data processed */
+   if (!ldata)
+   return 0;
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars

tty crash in tty_ldisc_receive_buf()

2017-04-06 Thread Michael Neuling

Hi all,

We are seeing the following crash (in linux-next but has been around since at
least v4.10). 

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0    irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

It seems the null ptr deref is in n_tty_receive_buf_common() where we do:

size_t tail = smp_load_acquire(>read_tail);

ldata is NULL.

We see this usually on boot but can also see it if we kill a getty attached to
tty (which is then respawned by systemd).  It seems like we are flushing data to
a tty at the same time as it's being torn down and restarted.

I did try the below patch which avoids the crash but locks up one of the CPUs. I
guess the data never gets flushed if we say nothing is processed.

This is on powerpc but has also been reported by parisc.

I'm not at all familiar with the tty layer and looking at the locks, mutexes,
semaphores and reference counting in there scares the hell out of me. 

If anyone has an idea, I'm happy to try a patch.

Regards,
Mikey

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..99dd757aa4 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1673,6 +1673,10 @@ n_tty_receive_buf_common(struct tty_struct *tty, const 
unsigned char *cp,
 
down_read(>termios_rwsem);
 
+   /* This probably shouldn't happen, but return 0 data processed */
+   if (!ldata)
+   return 0;
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars

Re: linux-next: manual merge of the tty tree with the tty.current tree

2017-03-29 Thread Michael Neuling

On Mon, 2017-03-20 at 10:26 +0100, Dmitry Vyukov wrote:
> On Mon, Mar 20, 2017 at 10:21 AM, Dmitry Vyukov  wrote:
> > On Mon, Mar 20, 2017 at 3:28 AM, Stephen Rothwell 
> > wrote:
> > > Hi Greg,
> > > 
> > > Today's linux-next merge of the tty tree got a conflict in:
> > > 
> > >   drivers/tty/tty_ldisc.c
> > > 
> > > between commit:
> > > 
> > >   5362544bebe8 ("tty: don't panic on OOM in tty_set_ldisc()")
> > > 
> > > from the tty.current tree and commit:
> > > 
> > >   71472fa9c52b ("tty: Fix ldisc crash on reopened tty")
> > > 
> > > from the tty tree.
> > > 
> > > I fixed it up (see below) and can carry the fix as necessary. This
> > > is now fixed as far as linux-next is concerned, but any non trivial
> > > conflicts should be mentioned to your upstream maintainer when your tree
> > > is submitted for merging.  You may also want to consider cooperating
> > > with the maintainer of the conflicting tree to minimise any particularly
> > > complex conflicts.
> > > 
> > > --
> > > Cheers,
> > > Stephen Rothwell
> > > 
> > > diff --cc drivers/tty/tty_ldisc.c
> > > index b0500a0a87b8,4ee7742dced3..
> > > --- a/drivers/tty/tty_ldisc.c
> > > +++ b/drivers/tty/tty_ldisc.c
> > > @@@ -621,14 -669,17 +621,15 @@@ int tty_ldisc_reinit(struct tty_struct
> > > tty_ldisc_put(tty->ldisc);
> > > }
> > > 
> > > -   /* switch the line discipline */
> > > -   tty->ldisc = ld;
> > > tty_set_termios_ldisc(tty, disc);
> > > -   retval = tty_ldisc_open(tty, tty->ldisc);
> > > +   retval = tty_ldisc_open(tty, ld);
> > > if (retval) {
> > > -   tty_ldisc_put(tty->ldisc);
> > > -   tty->ldisc = NULL;
> > >  -  if (!WARN_ON(disc == N_TTY)) {
> > >  -  tty_ldisc_put(ld);
> > >  -  ld = NULL;
> > >  -  }
> > > ++  tty_ldisc_put(ld);
> > > ++  ld = NULL;
> > > }
> > > +
> > > +   /* switch the line discipline */
> > > +   smp_store_release(>ldisc, ld);
> > > return retval;
> > >   }
> > > 
> > 
> > 
> > Peter,
> > 
> > Looking at your patch "tty: Fix ldisc crash on reopened tty", I think
> > there is a missed barrier in tty_ldisc_ref. A single barrier does not
> > have any effect, they always need to be in pairs. So I think we also
> > need at least:
> > 
> > @@ -295,7 +295,8 @@ struct tty_ldisc *tty_ldisc_ref(struct tty_struct *tty)
> > struct tty_ldisc *ld = NULL;
> > 
> > if (ldsem_down_read_trylock(>ldisc_sem)) {
> > -   ld = tty->ldisc;
> > +   ld = READ_ONCE(tty->ldisc);
> > +   read_barrier_depends();
> > if (!ld)
> > ldsem_up_read(>ldisc_sem);
> > }
> > 
> > 
> > Or simply:
> > 
> > @@ -295,7 +295,8 @@ struct tty_ldisc *tty_ldisc_ref(struct tty_struct *tty)
> > struct tty_ldisc *ld = NULL;
> > 
> > if (ldsem_down_read_trylock(>ldisc_sem)) {
> > -   ld = tty->ldisc;
> > +   /* pairs with smp_store_release in tty_ldisc_reinit */
> > +   ld = smp_load_acquire(>ldisc);
> > if (!ld)
> > ldsem_up_read(>ldisc_sem);
> > }
> 
> 
> 
> 
> I am also surprised that callers of tty_ldisc_reinit don't hold
> ldisc_sem. I thought that ldisc_sem is what's supposed to protect
> changes to ldisc. That would also auto fix the crash without any
> tricky barriers as flush_to_ldisc uses tty_ldisc_ref.

Dmitry,

Thanks for the help.  Peter doesn't seem to be responding to email any more.

I'm not familiar with the tty layer, but the issue that patch was suppose to fix
had a similar signature to the below oops we are seeing on powerpc on boot.
(sorry I don't have a repro on mainline or linux-next). Hence why I pushed on
it.

In the below crash the call to tty_ldisc_reinit is coming from a workqueue, so
requiring the callers to hold the ldisc_sem is more tricky.

Could we just hold the ldisc_sem inside tty_ldisc_reinit()?

Regards,
Mikey

[ 9.021567] Unable to handle kernel paging request for data at address 
0x2260
[ 9.022501] Faulting instruction address: 0xc06c7770
[ 9.023105] Oops: Kernel access of bad area, sig: 11 [#1]
[ 9.023674] SMP NR_CPUS=2048
[ 9.023676] NUMA
[ 9.023970] PowerNV
[ 9.024372] Modules linked in: ofpart cmdlinepart ipmi_powernv powernv_flash 
ipmi_devintf mtd ipmi_msghandler ibmpowernv opal_prd uio_pdrv_genirq uio 
vmx_crypto ip_tables x_tables autofs4 ast i2c_algo_bit ttm drm_kms_helper 
syscopyarea sysfillrect sysimgblt crc32c_vpmsum fb_sys_fops drm ahci libahci tg3
[ 9.027146] CPU: 15 PID: 354 Comm: kworker/u64:2 Not tainted 4.10.0-8-generic 
#10-Ubuntu
[ 9.027978] Workqueue: events_unbound flush_to_ldisc
[ 9.028468] task: c016a7758c00 task.stack: c000fd084000
[ 9.029055] NIP: c06c7770 LR: c06c7758 CTR: c06c84b0
[ 9.029767] REGS:

Re: linux-next: manual merge of the tty tree with the tty.current tree

2017-03-29 Thread Michael Neuling

On Mon, 2017-03-20 at 10:26 +0100, Dmitry Vyukov wrote:
> On Mon, Mar 20, 2017 at 10:21 AM, Dmitry Vyukov  wrote:
> > On Mon, Mar 20, 2017 at 3:28 AM, Stephen Rothwell 
> > wrote:
> > > Hi Greg,
> > > 
> > > Today's linux-next merge of the tty tree got a conflict in:
> > > 
> > >   drivers/tty/tty_ldisc.c
> > > 
> > > between commit:
> > > 
> > >   5362544bebe8 ("tty: don't panic on OOM in tty_set_ldisc()")
> > > 
> > > from the tty.current tree and commit:
> > > 
> > >   71472fa9c52b ("tty: Fix ldisc crash on reopened tty")
> > > 
> > > from the tty tree.
> > > 
> > > I fixed it up (see below) and can carry the fix as necessary. This
> > > is now fixed as far as linux-next is concerned, but any non trivial
> > > conflicts should be mentioned to your upstream maintainer when your tree
> > > is submitted for merging.  You may also want to consider cooperating
> > > with the maintainer of the conflicting tree to minimise any particularly
> > > complex conflicts.
> > > 
> > > --
> > > Cheers,
> > > Stephen Rothwell
> > > 
> > > diff --cc drivers/tty/tty_ldisc.c
> > > index b0500a0a87b8,4ee7742dced3..
> > > --- a/drivers/tty/tty_ldisc.c
> > > +++ b/drivers/tty/tty_ldisc.c
> > > @@@ -621,14 -669,17 +621,15 @@@ int tty_ldisc_reinit(struct tty_struct
> > > tty_ldisc_put(tty->ldisc);
> > > }
> > > 
> > > -   /* switch the line discipline */
> > > -   tty->ldisc = ld;
> > > tty_set_termios_ldisc(tty, disc);
> > > -   retval = tty_ldisc_open(tty, tty->ldisc);
> > > +   retval = tty_ldisc_open(tty, ld);
> > > if (retval) {
> > > -   tty_ldisc_put(tty->ldisc);
> > > -   tty->ldisc = NULL;
> > >  -  if (!WARN_ON(disc == N_TTY)) {
> > >  -  tty_ldisc_put(ld);
> > >  -  ld = NULL;
> > >  -  }
> > > ++  tty_ldisc_put(ld);
> > > ++  ld = NULL;
> > > }
> > > +
> > > +   /* switch the line discipline */
> > > +   smp_store_release(>ldisc, ld);
> > > return retval;
> > >   }
> > > 
> > 
> > 
> > Peter,
> > 
> > Looking at your patch "tty: Fix ldisc crash on reopened tty", I think
> > there is a missed barrier in tty_ldisc_ref. A single barrier does not
> > have any effect, they always need to be in pairs. So I think we also
> > need at least:
> > 
> > @@ -295,7 +295,8 @@ struct tty_ldisc *tty_ldisc_ref(struct tty_struct *tty)
> > struct tty_ldisc *ld = NULL;
> > 
> > if (ldsem_down_read_trylock(>ldisc_sem)) {
> > -   ld = tty->ldisc;
> > +   ld = READ_ONCE(tty->ldisc);
> > +   read_barrier_depends();
> > if (!ld)
> > ldsem_up_read(>ldisc_sem);
> > }
> > 
> > 
> > Or simply:
> > 
> > @@ -295,7 +295,8 @@ struct tty_ldisc *tty_ldisc_ref(struct tty_struct *tty)
> > struct tty_ldisc *ld = NULL;
> > 
> > if (ldsem_down_read_trylock(>ldisc_sem)) {
> > -   ld = tty->ldisc;
> > +   /* pairs with smp_store_release in tty_ldisc_reinit */
> > +   ld = smp_load_acquire(>ldisc);
> > if (!ld)
> > ldsem_up_read(>ldisc_sem);
> > }
> 
> 
> 
> 
> I am also surprised that callers of tty_ldisc_reinit don't hold
> ldisc_sem. I thought that ldisc_sem is what's supposed to protect
> changes to ldisc. That would also auto fix the crash without any
> tricky barriers as flush_to_ldisc uses tty_ldisc_ref.

Dmitry,

Thanks for the help.  Peter doesn't seem to be responding to email any more.

I'm not familiar with the tty layer, but the issue that patch was suppose to fix
had a similar signature to the below oops we are seeing on powerpc on boot.
(sorry I don't have a repro on mainline or linux-next). Hence why I pushed on
it.

In the below crash the call to tty_ldisc_reinit is coming from a workqueue, so
requiring the callers to hold the ldisc_sem is more tricky.

Could we just hold the ldisc_sem inside tty_ldisc_reinit()?

Regards,
Mikey

[ 9.021567] Unable to handle kernel paging request for data at address 
0x2260
[ 9.022501] Faulting instruction address: 0xc06c7770
[ 9.023105] Oops: Kernel access of bad area, sig: 11 [#1]
[ 9.023674] SMP NR_CPUS=2048
[ 9.023676] NUMA
[ 9.023970] PowerNV
[ 9.024372] Modules linked in: ofpart cmdlinepart ipmi_powernv powernv_flash 
ipmi_devintf mtd ipmi_msghandler ibmpowernv opal_prd uio_pdrv_genirq uio 
vmx_crypto ip_tables x_tables autofs4 ast i2c_algo_bit ttm drm_kms_helper 
syscopyarea sysfillrect sysimgblt crc32c_vpmsum fb_sys_fops drm ahci libahci tg3
[ 9.027146] CPU: 15 PID: 354 Comm: kworker/u64:2 Not tainted 4.10.0-8-generic 
#10-Ubuntu
[ 9.027978] Workqueue: events_unbound flush_to_ldisc
[ 9.028468] task: c016a7758c00 task.stack: c000fd084000
[ 9.029055] NIP: c06c7770 LR: c06c7758 CTR: c06c84b0
[ 9.029767] REGS: c000fd0878c0 TRAP: 0300 Not tainted

[PATCH] tty: Fix ldisc crash on reopened tty

2017-03-15 Thread Michael Neuling

From: Peter Hurley <pe...@hurleysoftware.com>

If the tty has been hungup, the ldisc instance may have been destroyed.
Continued input to the tty will be ignored as long as the ldisc instance
is not visible to the flush_to_ldisc kworker. However, when the tty
is reopened and a new ldisc instance is created, the flush_to_ldisc
kworker can obtain an ldisc reference before the new ldisc is
completely initialized. This will likely crash:

 BUG: unable to handle kernel paging request at 2260
 IP: [] n_tty_receive_buf_common+0x6d/0xb80
 PGD 2ab581067 PUD 290c11067 PMD 0
 Oops:  [#1] PREEMPT SMP
 Modules linked in: nls_iso8859_1 ip6table_filter [.]
 CPU: 2 PID: 103 Comm: kworker/u16:1 Not tainted 4.6.0-rc7+wip-xeon+debug 
#rc7+wip
 Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 
04/30/2012
 Workqueue: events_unbound flush_to_ldisc
 task: 8802ad16d100 ti: 8802ad31c000 task.ti: 8802ad31c000
 RIP: 0010:[]  [] 
n_tty_receive_buf_common+0x6d/0xb80
 RSP: 0018:8802ad31fc70  EFLAGS: 00010296
 RAX:  RBX: 8802aaddd800 RCX: 0001
 RDX:  RSI: 810db48f RDI: 0246
 RBP: 8802ad31fd08 R08:  R09: 0001
 R10: 8802aadddb28 R11: 0001 R12: 8800ba6da808
 R13: 8802ad18be80 R14: 8800ba6da858 R15: 8800ba6da800
 FS:  () GS:8802b0a0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 2260 CR3: 00028ee5d000 CR4: 06e0
 Stack:
  81531219 8802aadddab8 8802aae0 8802aa78
  0001 8800ba6da858 8800ba6da860 8802ad31fd30
  81885f78 81531219  0002
 Call Trace:
  [] ? flush_to_ldisc+0x49/0xd0
  [] ? mutex_lock_nested+0x2c8/0x430
  [] ? flush_to_ldisc+0x49/0xd0
  [] n_tty_receive_buf2+0x14/0x20
  [] tty_ldisc_receive_buf+0x22/0x50
  [] flush_to_ldisc+0xbe/0xd0
  [] process_one_work+0x1ed/0x6e0
  [] ? process_one_work+0x16f/0x6e0
  [] worker_thread+0x4e/0x490
  [] ? process_one_work+0x6e0/0x6e0
  [] kthread+0xf2/0x110
  [] ? preempt_count_sub+0x4c/0x80
  [] ret_from_fork+0x22/0x50
  [] ? kthread_create_on_node+0x220/0x220
 Code: ff ff e8 27 a0 35 00 48 8d 83 78 05 00 00 c7 45 c0 00 00 00 00 48 89 45 
80 48
   8d 83 e0 05 00 00 48 89 85 78 ff ff ff 48 8b 45 b8 <48> 8b b8 60 22 00 
00 48
   8b 30 89 f8 8b 8b 88 04 00 00 29 f0 8d
 RIP  [] n_tty_receive_buf_common+0x6d/0xb80
  RSP 
 CR2: 2260

Ensure the kworker cannot obtain the ldisc reference until the new ldisc
is completely initialized.

Fixes: 892d1fa7eaae ("tty: Destroy ldisc instance on hangup")
Reported-by: Mikulas Patocka <mpato...@redhat.com>
Signed-off-by: Peter Hurley <pe...@hurleysoftware.com>
Signed-off-by: Michael Neuling <mi...@neuling.org>
---

gregkh, can you take this? It never made it upstream and Peter Hurley
doesn't seem to be responding to email since mid 2016.

I'm reposting this from https://patchwork.kernel.org/patch/9114561/

Fixes an issue on powerpc too.
---
 drivers/tty/tty_ldisc.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index 68947f6de5..4ee7742dce 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -669,16 +669,17 @@ int tty_ldisc_reinit(struct tty_struct *tty, int disc)
tty_ldisc_put(tty->ldisc);
}
 
-   /* switch the line discipline */
-   tty->ldisc = ld;
tty_set_termios_ldisc(tty, disc);
-   retval = tty_ldisc_open(tty, tty->ldisc);
+   retval = tty_ldisc_open(tty, ld);
if (retval) {
if (!WARN_ON(disc == N_TTY)) {
-   tty_ldisc_put(tty->ldisc);
-   tty->ldisc = NULL;
+   tty_ldisc_put(ld);
+   ld = NULL;
}
}
+
+   /* switch the line discipline */
+   smp_store_release(>ldisc, ld);
return retval;
 }
 
-- 
2.9.3

[PATCH] tty: Fix ldisc crash on reopened tty

2017-03-15 Thread Michael Neuling

From: Peter Hurley 

If the tty has been hungup, the ldisc instance may have been destroyed.
Continued input to the tty will be ignored as long as the ldisc instance
is not visible to the flush_to_ldisc kworker. However, when the tty
is reopened and a new ldisc instance is created, the flush_to_ldisc
kworker can obtain an ldisc reference before the new ldisc is
completely initialized. This will likely crash:

 BUG: unable to handle kernel paging request at 2260
 IP: [] n_tty_receive_buf_common+0x6d/0xb80
 PGD 2ab581067 PUD 290c11067 PMD 0
 Oops:  [#1] PREEMPT SMP
 Modules linked in: nls_iso8859_1 ip6table_filter [.]
 CPU: 2 PID: 103 Comm: kworker/u16:1 Not tainted 4.6.0-rc7+wip-xeon+debug 
#rc7+wip
 Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 
04/30/2012
 Workqueue: events_unbound flush_to_ldisc
 task: 8802ad16d100 ti: 8802ad31c000 task.ti: 8802ad31c000
 RIP: 0010:[]  [] 
n_tty_receive_buf_common+0x6d/0xb80
 RSP: 0018:8802ad31fc70  EFLAGS: 00010296
 RAX:  RBX: 8802aaddd800 RCX: 0001
 RDX:  RSI: 810db48f RDI: 0246
 RBP: 8802ad31fd08 R08:  R09: 0001
 R10: 8802aadddb28 R11: 0001 R12: 8800ba6da808
 R13: 8802ad18be80 R14: 8800ba6da858 R15: 8800ba6da800
 FS:  () GS:8802b0a0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 2260 CR3: 00028ee5d000 CR4: 06e0
 Stack:
  81531219 8802aadddab8 8802aae0 8802aa78
  0001 8800ba6da858 8800ba6da860 8802ad31fd30
  81885f78 81531219  0002
 Call Trace:
  [] ? flush_to_ldisc+0x49/0xd0
  [] ? mutex_lock_nested+0x2c8/0x430
  [] ? flush_to_ldisc+0x49/0xd0
  [] n_tty_receive_buf2+0x14/0x20
  [] tty_ldisc_receive_buf+0x22/0x50
  [] flush_to_ldisc+0xbe/0xd0
  [] process_one_work+0x1ed/0x6e0
  [] ? process_one_work+0x16f/0x6e0
  [] worker_thread+0x4e/0x490
  [] ? process_one_work+0x6e0/0x6e0
  [] kthread+0xf2/0x110
  [] ? preempt_count_sub+0x4c/0x80
  [] ret_from_fork+0x22/0x50
  [] ? kthread_create_on_node+0x220/0x220
 Code: ff ff e8 27 a0 35 00 48 8d 83 78 05 00 00 c7 45 c0 00 00 00 00 48 89 45 
80 48
   8d 83 e0 05 00 00 48 89 85 78 ff ff ff 48 8b 45 b8 <48> 8b b8 60 22 00 
00 48
   8b 30 89 f8 8b 8b 88 04 00 00 29 f0 8d
 RIP  [] n_tty_receive_buf_common+0x6d/0xb80
  RSP 
 CR2: 2260

Ensure the kworker cannot obtain the ldisc reference until the new ldisc
is completely initialized.

Fixes: 892d1fa7eaae ("tty: Destroy ldisc instance on hangup")
Reported-by: Mikulas Patocka 
Signed-off-by: Peter Hurley 
Signed-off-by: Michael Neuling 
---

gregkh, can you take this? It never made it upstream and Peter Hurley
doesn't seem to be responding to email since mid 2016.

I'm reposting this from https://patchwork.kernel.org/patch/9114561/

Fixes an issue on powerpc too.
---
 drivers/tty/tty_ldisc.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index 68947f6de5..4ee7742dce 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -669,16 +669,17 @@ int tty_ldisc_reinit(struct tty_struct *tty, int disc)
tty_ldisc_put(tty->ldisc);
}
 
-   /* switch the line discipline */
-   tty->ldisc = ld;
tty_set_termios_ldisc(tty, disc);
-   retval = tty_ldisc_open(tty, tty->ldisc);
+   retval = tty_ldisc_open(tty, ld);
if (retval) {
if (!WARN_ON(disc == N_TTY)) {
-   tty_ldisc_put(tty->ldisc);
-   tty->ldisc = NULL;
+   tty_ldisc_put(ld);
+   ld = NULL;
}
}
+
+   /* switch the line discipline */
+   smp_store_release(>ldisc, ld);
return retval;
 }
 
-- 
2.9.3

Re: tty crash in Linux 4.6

2017-03-10 Thread Michael Neuling

> This patch works, I've had no tty crashes since applying it.
>
> I've seen that you haven't sent this patch yet to Linux-4.7-rc and
> Linux-4.6-stable. Will you? Or did you create a different patch?

We are hitting this now on powerpc.  This patch never seemed to make
it upstream (drivers/tty/tty_ldisc.c hasn't been touched in 1 year).

Peter, can we take this patch as is, or do you have an updated version?

Mikey

> Mikulas
>
>
> On Tue, 17 May 2016, Peter Hurley wrote:
>
> > On 05/17/2016 08:57 AM, Peter Hurley wrote:
> > > On 05/16/2016 04:36 PM, Peter Hurley wrote:
> > >> > Hi Mikulas,
> > >> >
> > >> > On 05/16/2016 01:12 PM, Mikulas Patocka wrote:
> > >>> >> Hi
> > >>> >>
> > >>> >> In the kernel 4.6 I get crashes in the tty layer. I can reproduce the
> > >>> >> crash by logging into the machine with ssh and typing before the 
> > >>> >> prompt
> > >>> >> appears.
> > >> >
> > >> > Thanks for the report.
> > >> > I tried to reproduce this a number of times on different machines
> > >> > with no luck.
> > >
> > > I was able to reproduce this crash with a test jig.
> > > The patch below fixed it, but I'm testing a better patch now, which
> > > I'll get to you asap.
> >
> > --- >% ---
> > Subject: [PATCH] tty: Fix ldisc crash on reopened tty
> >
> > If the tty has been hungup, the ldisc instance may have been destroyed.
> > Continued input to the tty will be ignored as long as the ldisc instance
> > is not visible to the flush_to_ldisc kworker. However, when the tty
> > is reopened and a new ldisc instance is created, the flush_to_ldisc
> > kworker can obtain an ldisc reference before the new ldisc is
> > completely initialized. This will likely crash:
> >
> >  BUG: unable to handle kernel paging request at 2260
> >  IP: [] n_tty_receive_buf_common+0x6d/0xb80
> >  PGD 2ab581067 PUD 290c11067 PMD 0
> >  Oops:  [#1] PREEMPT SMP
> >  Modules linked in: nls_iso8859_1 ip6table_filter [.]
> >  CPU: 2 PID: 103 Comm: kworker/u16:1 Not tainted 4.6.0-rc7+wip-xeon+debug 
> > #rc7+wip
> >  Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 
> > 04/30/2012
> >  Workqueue: events_unbound flush_to_ldisc
> >  task: 8802ad16d100 ti: 8802ad31c000 task.ti: 8802ad31c000
> >  RIP: 0010:[]  [] 
> > n_tty_receive_buf_common+0x6d/0xb80
> >  RSP: 0018:8802ad31fc70  EFLAGS: 00010296
> >  RAX:  RBX: 8802aaddd800 RCX: 0001
> >  RDX:  RSI: 810db48f RDI: 0246
> >  RBP: 8802ad31fd08 R08:  R09: 0001
> >  R10: 8802aadddb28 R11: 0001 R12: 8800ba6da808
> >  R13: 8802ad18be80 R14: 8800ba6da858 R15: 8800ba6da800
> >  FS:  () GS:8802b0a0() 
> > knlGS:
> >  CS:  0010 DS:  ES:  CR0: 80050033
> >  CR2: 2260 CR3: 00028ee5d000 CR4: 06e0
> >  Stack:
> >   81531219 8802aadddab8 8802aae0 8802aa78
> >   0001 8800ba6da858 8800ba6da860 8802ad31fd30
> >   81885f78 81531219  0002
> >  Call Trace:
> >   [] ? flush_to_ldisc+0x49/0xd0
> >   [] ? mutex_lock_nested+0x2c8/0x430
> >   [] ? flush_to_ldisc+0x49/0xd0
> >   [] n_tty_receive_buf2+0x14/0x20
> >   [] tty_ldisc_receive_buf+0x22/0x50
> >   [] flush_to_ldisc+0xbe/0xd0
> >   [] process_one_work+0x1ed/0x6e0
> >   [] ? process_one_work+0x16f/0x6e0
> >   [] worker_thread+0x4e/0x490
> >   [] ? process_one_work+0x6e0/0x6e0
> >   [] kthread+0xf2/0x110
> >   [] ? preempt_count_sub+0x4c/0x80
> >   [] ret_from_fork+0x22/0x50
> >   [] ? kthread_create_on_node+0x220/0x220
> >  Code: ff ff e8 27 a0 35 00 48 8d 83 78 05 00 00 c7 45 c0 00 00 00 00 48 89 
> > 45 80 48
> >8d 83 e0 05 00 00 48 89 85 78 ff ff ff 48 8b 45 b8 <48> 8b b8 60 22 
> > 00 00 48
> >8b 30 89 f8 8b 8b 88 04 00 00 29 f0 8d
> >  RIP  [] n_tty_receive_buf_common+0x6d/0xb80
> >   RSP 
> >  CR2: 2260
> >
> > Ensure the kworker cannot obtain the ldisc reference until the new ldisc
> > is completely initialized.
> >
> > Fixes: 892d1fa7eaae ("tty: Destroy ldisc instance on hangup")
> > Reported-by: Mikulas Patocka 
> > Signed-off-by: Peter Hurley 
> > ---
> >  drivers/tty/tty_ldisc.c | 11 ++-
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
> > index cdd063f..bda0c85 100644
> > --- a/drivers/tty/tty_ldisc.c
> > +++ b/drivers/tty/tty_ldisc.c
> > @@ -669,16 +669,17 @@ int tty_ldisc_reinit(struct tty_struct *tty, int disc)
> >   tty_ldisc_put(tty->ldisc);
> >   }
> >
> > - /* switch the line discipline */
> > - tty->ldisc = ld;
> >   tty_set_termios_ldisc(tty, disc);
> > - retval = tty_ldisc_open(tty, tty->ldisc);
> > + retval = tty_ldisc_open(tty, ld);
> >   if (retval) {
> >

Re: tty crash in Linux 4.6

2017-03-10 Thread Michael Neuling

> This patch works, I've had no tty crashes since applying it.
>
> I've seen that you haven't sent this patch yet to Linux-4.7-rc and
> Linux-4.6-stable. Will you? Or did you create a different patch?

We are hitting this now on powerpc.  This patch never seemed to make
it upstream (drivers/tty/tty_ldisc.c hasn't been touched in 1 year).

Peter, can we take this patch as is, or do you have an updated version?

Mikey

> Mikulas
>
>
> On Tue, 17 May 2016, Peter Hurley wrote:
>
> > On 05/17/2016 08:57 AM, Peter Hurley wrote:
> > > On 05/16/2016 04:36 PM, Peter Hurley wrote:
> > >> > Hi Mikulas,
> > >> >
> > >> > On 05/16/2016 01:12 PM, Mikulas Patocka wrote:
> > >>> >> Hi
> > >>> >>
> > >>> >> In the kernel 4.6 I get crashes in the tty layer. I can reproduce the
> > >>> >> crash by logging into the machine with ssh and typing before the 
> > >>> >> prompt
> > >>> >> appears.
> > >> >
> > >> > Thanks for the report.
> > >> > I tried to reproduce this a number of times on different machines
> > >> > with no luck.
> > >
> > > I was able to reproduce this crash with a test jig.
> > > The patch below fixed it, but I'm testing a better patch now, which
> > > I'll get to you asap.
> >
> > --- >% ---
> > Subject: [PATCH] tty: Fix ldisc crash on reopened tty
> >
> > If the tty has been hungup, the ldisc instance may have been destroyed.
> > Continued input to the tty will be ignored as long as the ldisc instance
> > is not visible to the flush_to_ldisc kworker. However, when the tty
> > is reopened and a new ldisc instance is created, the flush_to_ldisc
> > kworker can obtain an ldisc reference before the new ldisc is
> > completely initialized. This will likely crash:
> >
> >  BUG: unable to handle kernel paging request at 2260
> >  IP: [] n_tty_receive_buf_common+0x6d/0xb80
> >  PGD 2ab581067 PUD 290c11067 PMD 0
> >  Oops:  [#1] PREEMPT SMP
> >  Modules linked in: nls_iso8859_1 ip6table_filter [.]
> >  CPU: 2 PID: 103 Comm: kworker/u16:1 Not tainted 4.6.0-rc7+wip-xeon+debug 
> > #rc7+wip
> >  Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 
> > 04/30/2012
> >  Workqueue: events_unbound flush_to_ldisc
> >  task: 8802ad16d100 ti: 8802ad31c000 task.ti: 8802ad31c000
> >  RIP: 0010:[]  [] 
> > n_tty_receive_buf_common+0x6d/0xb80
> >  RSP: 0018:8802ad31fc70  EFLAGS: 00010296
> >  RAX:  RBX: 8802aaddd800 RCX: 0001
> >  RDX:  RSI: 810db48f RDI: 0246
> >  RBP: 8802ad31fd08 R08:  R09: 0001
> >  R10: 8802aadddb28 R11: 0001 R12: 8800ba6da808
> >  R13: 8802ad18be80 R14: 8800ba6da858 R15: 8800ba6da800
> >  FS:  () GS:8802b0a0() 
> > knlGS:
> >  CS:  0010 DS:  ES:  CR0: 80050033
> >  CR2: 2260 CR3: 00028ee5d000 CR4: 06e0
> >  Stack:
> >   81531219 8802aadddab8 8802aae0 8802aa78
> >   0001 8800ba6da858 8800ba6da860 8802ad31fd30
> >   81885f78 81531219  0002
> >  Call Trace:
> >   [] ? flush_to_ldisc+0x49/0xd0
> >   [] ? mutex_lock_nested+0x2c8/0x430
> >   [] ? flush_to_ldisc+0x49/0xd0
> >   [] n_tty_receive_buf2+0x14/0x20
> >   [] tty_ldisc_receive_buf+0x22/0x50
> >   [] flush_to_ldisc+0xbe/0xd0
> >   [] process_one_work+0x1ed/0x6e0
> >   [] ? process_one_work+0x16f/0x6e0
> >   [] worker_thread+0x4e/0x490
> >   [] ? process_one_work+0x6e0/0x6e0
> >   [] kthread+0xf2/0x110
> >   [] ? preempt_count_sub+0x4c/0x80
> >   [] ret_from_fork+0x22/0x50
> >   [] ? kthread_create_on_node+0x220/0x220
> >  Code: ff ff e8 27 a0 35 00 48 8d 83 78 05 00 00 c7 45 c0 00 00 00 00 48 89 
> > 45 80 48
> >8d 83 e0 05 00 00 48 89 85 78 ff ff ff 48 8b 45 b8 <48> 8b b8 60 22 
> > 00 00 48
> >8b 30 89 f8 8b 8b 88 04 00 00 29 f0 8d
> >  RIP  [] n_tty_receive_buf_common+0x6d/0xb80
> >   RSP 
> >  CR2: 2260
> >
> > Ensure the kworker cannot obtain the ldisc reference until the new ldisc
> > is completely initialized.
> >
> > Fixes: 892d1fa7eaae ("tty: Destroy ldisc instance on hangup")
> > Reported-by: Mikulas Patocka 
> > Signed-off-by: Peter Hurley 
> > ---
> >  drivers/tty/tty_ldisc.c | 11 ++-
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
> > index cdd063f..bda0c85 100644
> > --- a/drivers/tty/tty_ldisc.c
> > +++ b/drivers/tty/tty_ldisc.c
> > @@ -669,16 +669,17 @@ int tty_ldisc_reinit(struct tty_struct *tty, int disc)
> >   tty_ldisc_put(tty->ldisc);
> >   }
> >
> > - /* switch the line discipline */
> > - tty->ldisc = ld;
> >   tty_set_termios_ldisc(tty, disc);
> > - retval = tty_ldisc_open(tty, tty->ldisc);
> > + retval = tty_ldisc_open(tty, ld);
> >   if (retval) {
> >   if (!WARN_ON(disc == N_TTY)) {
> > -

Re: [PATCH 1/2] mm/autonuma: Let architecture override how the write bit should be stashed in a protnone pte.

2017-02-13 Thread Michael Neuling

On Thu, 2017-02-09 at 08:30 +0530, Aneesh Kumar K.V wrote:
> Autonuma preserves the write permission across numa fault to avoid taking
> a writefault after a numa fault (Commit: b191f9b106ea " mm: numa: preserve PTE
> write permissions across a NUMA hinting fault"). Architecture can implement
> protnone in different ways and some may choose to implement that by clearing
> Read/
> Write/Exec bit of pte. Setting the write bit on such pte can result in wrong
> behaviour. Fix this up by allowing arch to override how to save the write bit
> on a protnone pte.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.ku...@linux.vnet.ibm.com>

FWIW this is pretty simple and helps with us in powerpc...

Acked-By: Michael Neuling <mi...@neuling.org>

> ---
>  include/asm-generic/pgtable.h | 16 
>  mm/huge_memory.c  |  4 ++--
>  mm/memory.c   |  2 +-
>  mm/mprotect.c |  4 ++--
>  4 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 18af2bcefe6a..b6f3a8a4b738 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -192,6 +192,22 @@ static inline void ptep_set_wrprotect(struct mm_struct
> *mm, unsigned long addres
>  }
>  #endif
>  
> +#ifndef pte_savedwrite
> +#define pte_savedwrite pte_write
> +#endif
> +
> +#ifndef pte_mk_savedwrite
> +#define pte_mk_savedwrite pte_mkwrite
> +#endif
> +
> +#ifndef pmd_savedwrite
> +#define pmd_savedwrite pmd_write
> +#endif
> +
> +#ifndef pmd_mk_savedwrite
> +#define pmd_mk_savedwrite pmd_mkwrite
> +#endif
> +
>  #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9a6bd6c8d55a..2f0f855ec911 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1300,7 +1300,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t
> pmd)
>   goto out;
>  clear_pmdnuma:
>   BUG_ON(!PageLocked(page));
> - was_writable = pmd_write(pmd);
> + was_writable = pmd_savedwrite(pmd);
>   pmd = pmd_modify(pmd, vma->vm_page_prot);
>   pmd = pmd_mkyoung(pmd);
>   if (was_writable)
> @@ -1555,7 +1555,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t
> *pmd,
>   entry = pmdp_huge_get_and_clear_notify(mm, addr,
> pmd);
>   entry = pmd_modify(entry, newprot);
>   if (preserve_write)
> - entry = pmd_mkwrite(entry);
> + entry = pmd_mk_savedwrite(entry);
>   ret = HPAGE_PMD_NR;
>   set_pmd_at(mm, addr, pmd, entry);
>   BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> diff --git a/mm/memory.c b/mm/memory.c
> index e78bf72f30dd..88c24f89d6d3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3388,7 +3388,7 @@ static int do_numa_page(struct vm_fault *vmf)
>   int target_nid;
>   bool migrated = false;
>   pte_t pte;
> - bool was_writable = pte_write(vmf->orig_pte);
> + bool was_writable = pte_savedwrite(vmf->orig_pte);
>   int flags = 0;
>  
>   /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index f9c07f54dd62..15f5c174a7c1 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -113,13 +113,13 @@ static unsigned long change_pte_range(struct
> vm_area_struct *vma, pmd_t *pmd,
>   ptent = ptep_modify_prot_start(mm, addr, pte);
>   ptent = pte_modify(ptent, newprot);
>   if (preserve_write)
> - ptent = pte_mkwrite(ptent);
> + ptent = pte_mk_savedwrite(ptent);
>  
>   /* Avoid taking write faults for known dirty pages */
>   if (dirty_accountable && pte_dirty(ptent) &&
>   (pte_soft_dirty(ptent) ||
>    !(vma->vm_flags & VM_SOFTDIRTY))) {
> - ptent = pte_mkwrite(ptent);
> + ptent = pte_mk_savedwrite(ptent);
>   }
>   ptep_modify_prot_commit(mm, addr, pte, ptent);
>   pages++;

Re: [PATCH 1/2] mm/autonuma: Let architecture override how the write bit should be stashed in a protnone pte.

2017-02-13 Thread Michael Neuling

On Thu, 2017-02-09 at 08:30 +0530, Aneesh Kumar K.V wrote:
> Autonuma preserves the write permission across numa fault to avoid taking
> a writefault after a numa fault (Commit: b191f9b106ea " mm: numa: preserve PTE
> write permissions across a NUMA hinting fault"). Architecture can implement
> protnone in different ways and some may choose to implement that by clearing
> Read/
> Write/Exec bit of pte. Setting the write bit on such pte can result in wrong
> behaviour. Fix this up by allowing arch to override how to save the write bit
> on a protnone pte.
> 
> Signed-off-by: Aneesh Kumar K.V 

FWIW this is pretty simple and helps with us in powerpc...

Acked-By: Michael Neuling 

> ---
>  include/asm-generic/pgtable.h | 16 
>  mm/huge_memory.c  |  4 ++--
>  mm/memory.c   |  2 +-
>  mm/mprotect.c |  4 ++--
>  4 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 18af2bcefe6a..b6f3a8a4b738 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -192,6 +192,22 @@ static inline void ptep_set_wrprotect(struct mm_struct
> *mm, unsigned long addres
>  }
>  #endif
>  
> +#ifndef pte_savedwrite
> +#define pte_savedwrite pte_write
> +#endif
> +
> +#ifndef pte_mk_savedwrite
> +#define pte_mk_savedwrite pte_mkwrite
> +#endif
> +
> +#ifndef pmd_savedwrite
> +#define pmd_savedwrite pmd_write
> +#endif
> +
> +#ifndef pmd_mk_savedwrite
> +#define pmd_mk_savedwrite pmd_mkwrite
> +#endif
> +
>  #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9a6bd6c8d55a..2f0f855ec911 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1300,7 +1300,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t
> pmd)
>   goto out;
>  clear_pmdnuma:
>   BUG_ON(!PageLocked(page));
> - was_writable = pmd_write(pmd);
> + was_writable = pmd_savedwrite(pmd);
>   pmd = pmd_modify(pmd, vma->vm_page_prot);
>   pmd = pmd_mkyoung(pmd);
>   if (was_writable)
> @@ -1555,7 +1555,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t
> *pmd,
>   entry = pmdp_huge_get_and_clear_notify(mm, addr,
> pmd);
>   entry = pmd_modify(entry, newprot);
>   if (preserve_write)
> - entry = pmd_mkwrite(entry);
> + entry = pmd_mk_savedwrite(entry);
>   ret = HPAGE_PMD_NR;
>   set_pmd_at(mm, addr, pmd, entry);
>   BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> diff --git a/mm/memory.c b/mm/memory.c
> index e78bf72f30dd..88c24f89d6d3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3388,7 +3388,7 @@ static int do_numa_page(struct vm_fault *vmf)
>   int target_nid;
>   bool migrated = false;
>   pte_t pte;
> - bool was_writable = pte_write(vmf->orig_pte);
> + bool was_writable = pte_savedwrite(vmf->orig_pte);
>   int flags = 0;
>  
>   /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index f9c07f54dd62..15f5c174a7c1 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -113,13 +113,13 @@ static unsigned long change_pte_range(struct
> vm_area_struct *vma, pmd_t *pmd,
>   ptent = ptep_modify_prot_start(mm, addr, pte);
>   ptent = pte_modify(ptent, newprot);
>   if (preserve_write)
> - ptent = pte_mkwrite(ptent);
> + ptent = pte_mk_savedwrite(ptent);
>  
>   /* Avoid taking write faults for known dirty pages */
>   if (dirty_accountable && pte_dirty(ptent) &&
>   (pte_soft_dirty(ptent) ||
>    !(vma->vm_flags & VM_SOFTDIRTY))) {
> - ptent = pte_mkwrite(ptent);
> + ptent = pte_mk_savedwrite(ptent);
>   }
>   ptep_modify_prot_commit(mm, addr, pte, ptent);
>   pages++;

Re: [PATCH 2/2] powerpc/mm/autonuma: Switch ppc64 to its own implementeation of saved write

2017-02-13 Thread Michael Neuling

On Thu, 2017-02-09 at 08:30 +0530, Aneesh Kumar K.V wrote:
> With this our protnone becomes a present pte with READ/WRITE/EXEC bit cleared.
> By default we also set _PAGE_PRIVILEGED on such pte. This is now used to help
> us identify a protnone pte that as saved write bit. For such pte, we will
> clear
> the _PAGE_PRIVILEGED bit. The pte still remain non-accessible from both user
> and kernel.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.ku...@linux.vnet.ibm.com>


FWIW I've tested this, so:

Acked-By: Michael Neuling <mi...@neuling.org>

> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 +++
>  arch/powerpc/include/asm/book3s/64/pgtable.h  | 32 +-
> -
>  2 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 0735d5a8049f..8720a406bbbe 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -16,6 +16,9 @@
>  #include 
>  #include 
>  
> +#ifndef __ASSEMBLY__
> +#include 
> +#endif
>  /*
>   * This is necessary to get the definition of PGTABLE_RANGE which we
>   * need for various slices related matters. Note that this isn't the
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index e91ada786d48..efff910a84b1 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -443,8 +443,8 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte)
>   */
>  static inline int pte_protnone(pte_t pte)
>  {
> - return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED))
> ==
> - cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED);
> + return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX)) ==
> + cpu_to_be64(_PAGE_PRESENT);
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> @@ -514,6 +514,32 @@ static inline pte_t pte_mkhuge(pte_t pte)
>   return pte;
>  }
>  
> +#define pte_mk_savedwrite pte_mk_savedwrite
> +static inline pte_t pte_mk_savedwrite(pte_t pte)
> +{
> + /*
> +  * Used by Autonuma subsystem to preserve the write bit
> +  * while marking the pte PROT_NONE. Only allow this
> +  * on PROT_NONE pte
> +  */
> + VM_BUG_ON((pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX |
> _PAGE_PRIVILEGED)) !=
> +   cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED));
> + return __pte(pte_val(pte) & ~_PAGE_PRIVILEGED);
> +}
> +
> +#define pte_savedwrite pte_savedwrite
> +static inline bool pte_savedwrite(pte_t pte)
> +{
> + /*
> +  * Saved write ptes are prot none ptes that doesn't have
> +  * privileged bit sit. We mark prot none as one which has
> +  * present and pviliged bit set and RWX cleared. To mark
> +  * protnone which used to have _PAGE_WRITE set we clear
> +  * the privileged bit.
> +  */
> + return !(pte_raw(pte) & cpu_to_be64(_PAGE_RWX | _PAGE_PRIVILEGED));
> +}
> +
>  static inline pte_t pte_mkdevmap(pte_t pte)
>  {
>   return __pte(pte_val(pte) | _PAGE_SPECIAL|_PAGE_DEVMAP);
> @@ -885,6 +911,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
>  #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
>  #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
>  #define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
> +#define pmd_mk_savedwrite(pmd)   pte_pmd(pte_mk_savedwrite(pmd_pte(pmd))
> )
>  
>  #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
>  #define pmd_soft_dirty(pmd)pte_soft_dirty(pmd_pte(pmd))
> @@ -901,6 +928,7 @@ static inline int pmd_protnone(pmd_t pmd)
>  
>  #define __HAVE_ARCH_PMD_WRITE
>  #define pmd_write(pmd)   pte_write(pmd_pte(pmd))
> +#define pmd_savedwrite(pmd)  pte_savedwrite(pmd_pte(pmd))
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);

Re: [PATCH 2/2] powerpc/mm/autonuma: Switch ppc64 to its own implementeation of saved write

2017-02-13 Thread Michael Neuling

On Thu, 2017-02-09 at 08:30 +0530, Aneesh Kumar K.V wrote:
> With this our protnone becomes a present pte with READ/WRITE/EXEC bit cleared.
> By default we also set _PAGE_PRIVILEGED on such pte. This is now used to help
> us identify a protnone pte that as saved write bit. For such pte, we will
> clear
> the _PAGE_PRIVILEGED bit. The pte still remain non-accessible from both user
> and kernel.
> 
> Signed-off-by: Aneesh Kumar K.V 


FWIW I've tested this, so:

Acked-By: Michael Neuling 

> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 +++
>  arch/powerpc/include/asm/book3s/64/pgtable.h  | 32 +-
> -
>  2 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 0735d5a8049f..8720a406bbbe 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -16,6 +16,9 @@
>  #include 
>  #include 
>  
> +#ifndef __ASSEMBLY__
> +#include 
> +#endif
>  /*
>   * This is necessary to get the definition of PGTABLE_RANGE which we
>   * need for various slices related matters. Note that this isn't the
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index e91ada786d48..efff910a84b1 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -443,8 +443,8 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte)
>   */
>  static inline int pte_protnone(pte_t pte)
>  {
> - return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED))
> ==
> - cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED);
> + return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX)) ==
> + cpu_to_be64(_PAGE_PRESENT);
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> @@ -514,6 +514,32 @@ static inline pte_t pte_mkhuge(pte_t pte)
>   return pte;
>  }
>  
> +#define pte_mk_savedwrite pte_mk_savedwrite
> +static inline pte_t pte_mk_savedwrite(pte_t pte)
> +{
> + /*
> +  * Used by Autonuma subsystem to preserve the write bit
> +  * while marking the pte PROT_NONE. Only allow this
> +  * on PROT_NONE pte
> +  */
> + VM_BUG_ON((pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX |
> _PAGE_PRIVILEGED)) !=
> +   cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED));
> + return __pte(pte_val(pte) & ~_PAGE_PRIVILEGED);
> +}
> +
> +#define pte_savedwrite pte_savedwrite
> +static inline bool pte_savedwrite(pte_t pte)
> +{
> + /*
> +  * Saved write ptes are prot none ptes that doesn't have
> +  * privileged bit sit. We mark prot none as one which has
> +  * present and pviliged bit set and RWX cleared. To mark
> +  * protnone which used to have _PAGE_WRITE set we clear
> +  * the privileged bit.
> +  */
> + return !(pte_raw(pte) & cpu_to_be64(_PAGE_RWX | _PAGE_PRIVILEGED));
> +}
> +
>  static inline pte_t pte_mkdevmap(pte_t pte)
>  {
>   return __pte(pte_val(pte) | _PAGE_SPECIAL|_PAGE_DEVMAP);
> @@ -885,6 +911,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
>  #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
>  #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
>  #define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
> +#define pmd_mk_savedwrite(pmd)   pte_pmd(pte_mk_savedwrite(pmd_pte(pmd))
> )
>  
>  #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
>  #define pmd_soft_dirty(pmd)pte_soft_dirty(pmd_pte(pmd))
> @@ -901,6 +928,7 @@ static inline int pmd_protnone(pmd_t pmd)
>  
>  #define __HAVE_ARCH_PMD_WRITE
>  #define pmd_write(pmd)   pte_write(pmd_pte(pmd))
> +#define pmd_savedwrite(pmd)  pte_savedwrite(pmd_pte(pmd))
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);

Re: [PATCH] powernv: Clear SPRN_PSSCR when a POWER9 CPU comes online

2016-11-22 Thread Michael Neuling

On Tue, 2016-11-22 at 23:36 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
> 
> Ensure that PSSCR is set to a safe value corresponding to no
> state-loss each time a POWER9 CPU comes online.
> 
> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>

Tested here on my configuration... FWIW

Acked-By: Michael Neuling <mi...@neuling.org>

> ---
>  arch/powerpc/kernel/cpu_setup_power.S | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S
> b/arch/powerpc/kernel/cpu_setup_power.S
> index 52ff3f0..37ad045 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -96,6 +96,7 @@ _GLOBAL(__setup_cpu_power9)
>   mtlrr11
>   beqlr
>   li  r0,0
> + mtspr   SPRN_PSSCR,r0
>   mtspr   SPRN_LPID,r0
>   mfspr   r3,SPRN_LPCR
>   ori r3, r3, LPCR_PECEDH
> @@ -116,6 +117,7 @@ _GLOBAL(__restore_cpu_power9)
>   mtlrr11
>   beqlr
>   li  r0,0
> + mtspr   SPRN_PSSCR,r0
>   mtspr   SPRN_LPID,r0
>   mfspr   r3,SPRN_LPCR
>   ori r3, r3, LPCR_PECEDH

Re: [PATCH] powernv: Clear SPRN_PSSCR when a POWER9 CPU comes online

2016-11-22 Thread Michael Neuling

On Tue, 2016-11-22 at 23:36 +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> Ensure that PSSCR is set to a safe value corresponding to no
> state-loss each time a POWER9 CPU comes online.
> 
> Signed-off-by: Gautham R. Shenoy 

Tested here on my configuration... FWIW

Acked-By: Michael Neuling 

> ---
>  arch/powerpc/kernel/cpu_setup_power.S | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S
> b/arch/powerpc/kernel/cpu_setup_power.S
> index 52ff3f0..37ad045 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -96,6 +96,7 @@ _GLOBAL(__setup_cpu_power9)
>   mtlrr11
>   beqlr
>   li  r0,0
> + mtspr   SPRN_PSSCR,r0
>   mtspr   SPRN_LPID,r0
>   mfspr   r3,SPRN_LPCR
>   ori r3, r3, LPCR_PECEDH
> @@ -116,6 +117,7 @@ _GLOBAL(__restore_cpu_power9)
>   mtlrr11
>   beqlr
>   li  r0,0
> + mtspr   SPRN_PSSCR,r0
>   mtspr   SPRN_LPID,r0
>   mfspr   r3,SPRN_LPCR
>   ori r3, r3, LPCR_PECEDH

Re: [PATCH] powernv: Clear SPRN_PSSCR when a POWER9 CPU comes online

2016-11-22 Thread Michael Neuling

On Wed, 2016-11-23 at 10:30 +1100, Michael Ellerman wrote:
> "Gautham R. Shenoy"  writes:
> 
> > From: "Gautham R. Shenoy" 
> > 
> > Ensure that PSSCR is set to a safe value corresponding to no
> > state-loss each time a POWER9 CPU comes online.
> 
> Is this a bug fix? I can't tell from the change log.

There are no known bugs it's fixing.  

It's just safer to run with a known default value, rather than what we randomly
inherit from previous firmware.

Mikey

Re: [PATCH] powernv: Clear SPRN_PSSCR when a POWER9 CPU comes online

2016-11-22 Thread Michael Neuling

On Wed, 2016-11-23 at 10:30 +1100, Michael Ellerman wrote:
> "Gautham R. Shenoy"  writes:
> 
> > From: "Gautham R. Shenoy" 
> > 
> > Ensure that PSSCR is set to a safe value corresponding to no
> > state-loss each time a POWER9 CPU comes online.
> 
> Is this a bug fix? I can't tell from the change log.

There are no known bugs it's fixing.  

It's just safer to run with a known default value, rather than what we randomly
inherit from previous firmware.

Mikey

Re: [PATCH v7 07/11] powerpc/powernv: Add platform support for stop instruction

2016-07-07 Thread Michael Neuling


> > > 
> > > @@ -439,7 +540,18 @@ timebase_resync:
> > >    */
> > >   bne cr4,clear_lock
> > >  
> > > - /* Restore per core state */
> > > + /*
> > > +  * First thread in the core to wake up and its waking up
> > > with
> > > +  * complete hypervisor state loss. Restore per core
> > > hypervisor
> > > +  * state.
> > > +  */
> > > +BEGIN_FTR_SECTION
> > > + ld  r4,_PTCR(r1)
> > > + mtspr   SPRN_PTCR,r4
> > > + ld  r4,_RPR(r1)
> > > + mtspr   SPRN_RPR,r4
> > RPR looks wrong here.  This should be on POWER8 too.
> > 
> > This has changed since v6 and not noted in the v7 comments.  Why are
> > you
> > changing this now?
> > 
> RPR is a per-core resource in P9. So with this patch, RPR will continue
> to be restored per-subcore in P8 and will restored once per core in P9.

Ok, thanks for the explanation.

Mikey

Re: [PATCH v7 07/11] powerpc/powernv: Add platform support for stop instruction

2016-07-07 Thread Michael Neuling


> > > 
> > > @@ -439,7 +540,18 @@ timebase_resync:
> > >    */
> > >   bne cr4,clear_lock
> > >  
> > > - /* Restore per core state */
> > > + /*
> > > +  * First thread in the core to wake up and its waking up
> > > with
> > > +  * complete hypervisor state loss. Restore per core
> > > hypervisor
> > > +  * state.
> > > +  */
> > > +BEGIN_FTR_SECTION
> > > + ld  r4,_PTCR(r1)
> > > + mtspr   SPRN_PTCR,r4
> > > + ld  r4,_RPR(r1)
> > > + mtspr   SPRN_RPR,r4
> > RPR looks wrong here.  This should be on POWER8 too.
> > 
> > This has changed since v6 and not noted in the v7 comments.  Why are
> > you
> > changing this now?
> > 
> RPR is a per-core resource in P9. So with this patch, RPR will continue
> to be restored per-subcore in P8 and will restored once per core in P9.

Ok, thanks for the explanation.

Mikey

Re: [PATCH v7 00/11] powerpc/powernv/cpuidle: Add support for POWER ISA v3 idle states

2016-07-07 Thread Michael Neuling

Except for the issue with patch 7 I've already commented on the rest of
this series is good with me.  FWIW:

Acked-by: Michael Neuling <mi...@neuling.org>

Thanks.

On Fri, 2016-07-08 at 02:17 +0530, Shreyas B. Prabhu wrote:
> POWER ISA v3 defines a new idle processor core mechanism. In summary,
>  a) new instruction named stop is added. This instruction replaces
>   instructions like nap, sleep, rvwinkle.
>  b) new per thread SPR named PSSCR is added which controls the behavior
>   of stop instruction. 
>   
> PSSCR has following key fields
>   Bits 0:3  - Power-Saving Level Status. This field indicates the
>   lowest power-saving state the thread entered since stop
>   instruction was last executed.
>   
>   Bit 42 - Enable State Loss  
>   0 - No state is lost irrespective of other fields  
>   1 - Allows state loss
>   
>   Bits 44:47 - Power-Saving Level Limit  
>   This limits the power-saving level that can be entered into.
>   
>   Bits 60:63 - Requested Level  
>   Used to specify which power-saving level must be entered on
>   executing stop instruction
>   
> Stop idle states and their properties like name, latency, target
> residency, psscr value are exposed via device tree.
> 
> This patch series adds support for this new mechanism.
> 
> Patches 1-6 are cleanups and code movement.
> Patch 7 adds platform specific support for stop and psscr handling.
> Patch 8 and 9 are minor cleanup in cpuidle driver.
> Patch 10 adds cpuidle driver support.
> Patch 11 makes offlined cpu use deepest stop state.
> 
> Note: Documentation for the device tree bindings is posted here-
> http://patchwork.ozlabs.org/patch/629125/
> 
> Changes in v7
> =
>  - File renamed to idle_book3s.S instead of idle_power_common.S
>  - Comment changes
>  - power_stop0, power_stop renamed to power9_idle and power_idle_stop
>  - PSSCR template is now a macro instead of storing in paca
>  - power9_idle in C file instead of assembly
>  - Fixed TOC related bug
>  - Handling subcore within FTR section
>  - Functions in idle.c reordered and broken into multiple functions
>  - calling __restore_cpu_power8/9 via cur_cpu_spec->cpu_restore 
>  - Added a minor patch with minor cleanups in cpuidle-powernv.c . This
>    was mainly to make the existing code consistent with the review
>    comments for new code
>  - Using stack for variables while probing for idle states instead of
>    kzalloc/kcalloc
> 
> Changes in v6
> =
>  - Restore new POWER ISA v3 SPRS when waking up from deep idle
> 
> Changes in v5
> =
>  - Use generic cpuidle constant CPUIDLE_NAME_LEN
>  - Fix return code handling for of_property_read_string_array
>  - Use DT flags to determine if are using stop instruction, instead of
>    cpu_has_feature
>  - Removed uncessary cast with names
>  - _loop -> stop_loop
>  - Added POWERNV_THRESHOLD_LATENCY_NS to filter out idle states with high 
> latency
> 
> Changes in v4
> =
>  - Added a patch to use PNV_THREAD_WINKLE macro while requesting for winkle
>  - Moved power7_powersave_common rename to more appropriate patch
>  - renaming power7_enter_nap_mode to pnv_enter_arch207_idle_mode
>  - Added PSSCR layout to Patch 7's commit message
>  - Improved / Fixed comments
>  - Fixed whitespace error in paca.h
>  - Using MAX_POSSIBLE_STOP_STATE macro instead of hardcoding 0xF has
>    max possible stop state
> 
> Changes in v3
> =
>  - Rebased on powerpc-next
>  - Dropping patch 1 since we are not adding a new file for P9 idle support
>  - Improved comments in multiple places
>  - Moved GET_PACA from power7_restore_hyp_resource to System Reset
>  - Instead of moving few functions from idle_power7 to idle_power_common,
>    renaming idle_power7.S to idle_power_common.S
>  - Moved HSTATE_HWTHREAD_STATE updation to power_powersave_common
>  - Dropped earlier patch 5 which moved few macros from idle_power_common to
>    asm/cpuidle.h. 
>  - Added a patch to rename reusable power7_* idle functions to pnv_*
>  - Added new patch that creates abstraction for saving SPRs before
>    entering deep idle states
>  - Instead of introducing new file idle_power_stop.S, P9 idle support
>    is added to idle_power_common.S using CPU_FTR sections.
>  - Fixed r4 reg clobbering in power_stop0
> 
> Changes in v2
> =
>  - Rebased on v4.6-rc6
>  - Using CPU_FTR_ARCH_300 bit instead of CPU_FTR_STOP_INST
> 
> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
> Cc: Daniel Lezcano <daniel.lezc...@linaro.org>
> Cc: l

Re: [PATCH v7 00/11] powerpc/powernv/cpuidle: Add support for POWER ISA v3 idle states

2016-07-07 Thread Michael Neuling

Except for the issue with patch 7 I've already commented on the rest of
this series is good with me.  FWIW:

Acked-by: Michael Neuling 

Thanks.

On Fri, 2016-07-08 at 02:17 +0530, Shreyas B. Prabhu wrote:
> POWER ISA v3 defines a new idle processor core mechanism. In summary,
>  a) new instruction named stop is added. This instruction replaces
>   instructions like nap, sleep, rvwinkle.
>  b) new per thread SPR named PSSCR is added which controls the behavior
>   of stop instruction. 
>   
> PSSCR has following key fields
>   Bits 0:3  - Power-Saving Level Status. This field indicates the
>   lowest power-saving state the thread entered since stop
>   instruction was last executed.
>   
>   Bit 42 - Enable State Loss  
>   0 - No state is lost irrespective of other fields  
>   1 - Allows state loss
>   
>   Bits 44:47 - Power-Saving Level Limit  
>   This limits the power-saving level that can be entered into.
>   
>   Bits 60:63 - Requested Level  
>   Used to specify which power-saving level must be entered on
>   executing stop instruction
>   
> Stop idle states and their properties like name, latency, target
> residency, psscr value are exposed via device tree.
> 
> This patch series adds support for this new mechanism.
> 
> Patches 1-6 are cleanups and code movement.
> Patch 7 adds platform specific support for stop and psscr handling.
> Patch 8 and 9 are minor cleanup in cpuidle driver.
> Patch 10 adds cpuidle driver support.
> Patch 11 makes offlined cpu use deepest stop state.
> 
> Note: Documentation for the device tree bindings is posted here-
> http://patchwork.ozlabs.org/patch/629125/
> 
> Changes in v7
> =
>  - File renamed to idle_book3s.S instead of idle_power_common.S
>  - Comment changes
>  - power_stop0, power_stop renamed to power9_idle and power_idle_stop
>  - PSSCR template is now a macro instead of storing in paca
>  - power9_idle in C file instead of assembly
>  - Fixed TOC related bug
>  - Handling subcore within FTR section
>  - Functions in idle.c reordered and broken into multiple functions
>  - calling __restore_cpu_power8/9 via cur_cpu_spec->cpu_restore 
>  - Added a minor patch with minor cleanups in cpuidle-powernv.c . This
>    was mainly to make the existing code consistent with the review
>    comments for new code
>  - Using stack for variables while probing for idle states instead of
>    kzalloc/kcalloc
> 
> Changes in v6
> =
>  - Restore new POWER ISA v3 SPRS when waking up from deep idle
> 
> Changes in v5
> =
>  - Use generic cpuidle constant CPUIDLE_NAME_LEN
>  - Fix return code handling for of_property_read_string_array
>  - Use DT flags to determine if are using stop instruction, instead of
>    cpu_has_feature
>  - Removed uncessary cast with names
>  - _loop -> stop_loop
>  - Added POWERNV_THRESHOLD_LATENCY_NS to filter out idle states with high 
> latency
> 
> Changes in v4
> =
>  - Added a patch to use PNV_THREAD_WINKLE macro while requesting for winkle
>  - Moved power7_powersave_common rename to more appropriate patch
>  - renaming power7_enter_nap_mode to pnv_enter_arch207_idle_mode
>  - Added PSSCR layout to Patch 7's commit message
>  - Improved / Fixed comments
>  - Fixed whitespace error in paca.h
>  - Using MAX_POSSIBLE_STOP_STATE macro instead of hardcoding 0xF has
>    max possible stop state
> 
> Changes in v3
> =
>  - Rebased on powerpc-next
>  - Dropping patch 1 since we are not adding a new file for P9 idle support
>  - Improved comments in multiple places
>  - Moved GET_PACA from power7_restore_hyp_resource to System Reset
>  - Instead of moving few functions from idle_power7 to idle_power_common,
>    renaming idle_power7.S to idle_power_common.S
>  - Moved HSTATE_HWTHREAD_STATE updation to power_powersave_common
>  - Dropped earlier patch 5 which moved few macros from idle_power_common to
>    asm/cpuidle.h. 
>  - Added a patch to rename reusable power7_* idle functions to pnv_*
>  - Added new patch that creates abstraction for saving SPRs before
>    entering deep idle states
>  - Instead of introducing new file idle_power_stop.S, P9 idle support
>    is added to idle_power_common.S using CPU_FTR sections.
>  - Fixed r4 reg clobbering in power_stop0
> 
> Changes in v2
> =
>  - Rebased on v4.6-rc6
>  - Using CPU_FTR_ARCH_300 bit instead of CPU_FTR_STOP_INST
> 
> Cc: Rafael J. Wysocki 
> Cc: Daniel Lezcano 
> Cc: linux...@vger.kernel.org
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Cc:

Re: [PATCH v7 09/11] cpuidle/powernv: cleanup powernv_add_idle_states

2016-07-07 Thread Michael Neuling

>   /*
> @@ -230,7 +238,7 @@ static int powernv_add_idle_states(void)
>   strcpy(powernv_states[nr_idle_states].desc, 
> "FastSleep");
>   powernv_states[nr_idle_states].flags = 
> CPUIDLE_FLAG_TIMER_STOP;
>   powernv_states[nr_idle_states].target_residency = 
> 30;
> - powernv_states[nr_idle_states].enter = _loop;
> + powernv_states[nr_idle_states].enter = fastsleep_loop;

You can change this code too with the same thing.

static struct cpuidle_state powernv_states[CPUIDLE_STATE_MAX] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
.exit_latency = 0,
.target_residency = 0,
.enter = _loop },
};

Mikey

Re: [PATCH v7 09/11] cpuidle/powernv: cleanup powernv_add_idle_states

2016-07-07 Thread Michael Neuling

>   /*
> @@ -230,7 +238,7 @@ static int powernv_add_idle_states(void)
>   strcpy(powernv_states[nr_idle_states].desc, 
> "FastSleep");
>   powernv_states[nr_idle_states].flags = 
> CPUIDLE_FLAG_TIMER_STOP;
>   powernv_states[nr_idle_states].target_residency = 
> 30;
> - powernv_states[nr_idle_states].enter = _loop;
> + powernv_states[nr_idle_states].enter = fastsleep_loop;

You can change this code too with the same thing.

static struct cpuidle_state powernv_states[CPUIDLE_STATE_MAX] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
.exit_latency = 0,
.target_residency = 0,
.enter = _loop },
};

Mikey

Re: [PATCH v7 07/11] powerpc/powernv: Add platform support for stop instruction

2016-07-07 Thread Michael Neuling


> diff --git a/arch/powerpc/include/asm/cpuidle.h 
> b/arch/powerpc/include/asm/cpuidle.h
> index d2f99ca..3d7fc06 100644
> --- a/arch/powerpc/include/asm/cpuidle.h
> +++ b/arch/powerpc/include/asm/cpuidle.h
> @@ -13,6 +13,8 @@
>  #ifndef __ASSEMBLY__
>  extern u32 pnv_fastsleep_workaround_at_entry[];
>  extern u32 pnv_fastsleep_workaround_at_exit[];
> +
> +extern u64 pnv_first_deep_stop_state;

mpe asked a question about this which you neither answered or addressed.
"Should this have some safe initial value?"

I'm thinking we could do this which is what you have in the init call.
   u64 pnv_first_deep_stop_state = MAX_STOP_STATE;


> @@ -439,7 +540,18 @@ timebase_resync:
>    */
>   bne cr4,clear_lock
>  
> - /* Restore per core state */
> + /*
> +  * First thread in the core to wake up and its waking up with
> +  * complete hypervisor state loss. Restore per core hypervisor
> +  * state.
> +  */
> +BEGIN_FTR_SECTION
> + ld  r4,_PTCR(r1)
> + mtspr   SPRN_PTCR,r4
> + ld  r4,_RPR(r1)
> + mtspr   SPRN_RPR,r4

RPR looks wrong here.  This should be on POWER8 too.

This has changed since v6 and not noted in the v7 comments.  Why are you
changing this now?

> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
> +
>   ld  r4,_TSCR(r1)
>   mtspr   SPRN_TSCR,r4
>   ld  r4,_WORC(r1)
> @@ -461,9 +573,7 @@ common_exit:
>  
>   /* Waking up from winkle */
>  
> - /* Restore per thread state */
> - bl  __restore_cpu_power8
> -
> +BEGIN_MMU_FTR_SECTION
>   /* Restore SLB  from PACA */
>   ld  r8,PACA_SLBSHADOWPTR(r13)
>  
> @@ -477,6 +587,9 @@ common_exit:
>   slbmte  r6,r5
>  1:   addir8,r8,16
>   .endr
> +END_MMU_FTR_SECTION_IFCLR(MMU_FTR_RADIX)
> +
> + /* Restore per thread state */

This FTR section is too big  It ends up at 25 instructions with the loop.
Probably better like this:

BEGIN_MMU_FTR_SECTION
b   no_segments
END_MMU_FTR_SECTION_IFSET(MMU_FTR_RADIX)
/* Restore SLB  from PACA */
ld  r8,PACA_SLBSHADOWPTR(r13)

.rept   SLB_NUM_BOLTED
li  r3, SLBSHADOW_SAVEAREA
LDX_BE  r5, r8, r3
addir3, r3, 8
LDX_BE  r6, r8, r3
andis.  r7,r5,SLB_ESID_V@h
beq 1f
slbmte  r6,r5
1:  addir8,r8,16
.endr

no_segments:

Re: [PATCH v7 07/11] powerpc/powernv: Add platform support for stop instruction

2016-07-07 Thread Michael Neuling


> diff --git a/arch/powerpc/include/asm/cpuidle.h 
> b/arch/powerpc/include/asm/cpuidle.h
> index d2f99ca..3d7fc06 100644
> --- a/arch/powerpc/include/asm/cpuidle.h
> +++ b/arch/powerpc/include/asm/cpuidle.h
> @@ -13,6 +13,8 @@
>  #ifndef __ASSEMBLY__
>  extern u32 pnv_fastsleep_workaround_at_entry[];
>  extern u32 pnv_fastsleep_workaround_at_exit[];
> +
> +extern u64 pnv_first_deep_stop_state;

mpe asked a question about this which you neither answered or addressed.
"Should this have some safe initial value?"

I'm thinking we could do this which is what you have in the init call.
   u64 pnv_first_deep_stop_state = MAX_STOP_STATE;


> @@ -439,7 +540,18 @@ timebase_resync:
>    */
>   bne cr4,clear_lock
>  
> - /* Restore per core state */
> + /*
> +  * First thread in the core to wake up and its waking up with
> +  * complete hypervisor state loss. Restore per core hypervisor
> +  * state.
> +  */
> +BEGIN_FTR_SECTION
> + ld  r4,_PTCR(r1)
> + mtspr   SPRN_PTCR,r4
> + ld  r4,_RPR(r1)
> + mtspr   SPRN_RPR,r4

RPR looks wrong here.  This should be on POWER8 too.

This has changed since v6 and not noted in the v7 comments.  Why are you
changing this now?

> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
> +
>   ld  r4,_TSCR(r1)
>   mtspr   SPRN_TSCR,r4
>   ld  r4,_WORC(r1)
> @@ -461,9 +573,7 @@ common_exit:
>  
>   /* Waking up from winkle */
>  
> - /* Restore per thread state */
> - bl  __restore_cpu_power8
> -
> +BEGIN_MMU_FTR_SECTION
>   /* Restore SLB  from PACA */
>   ld  r8,PACA_SLBSHADOWPTR(r13)
>  
> @@ -477,6 +587,9 @@ common_exit:
>   slbmte  r6,r5
>  1:   addir8,r8,16
>   .endr
> +END_MMU_FTR_SECTION_IFCLR(MMU_FTR_RADIX)
> +
> + /* Restore per thread state */

This FTR section is too big  It ends up at 25 instructions with the loop.
Probably better like this:

BEGIN_MMU_FTR_SECTION
b   no_segments
END_MMU_FTR_SECTION_IFSET(MMU_FTR_RADIX)
/* Restore SLB  from PACA */
ld  r8,PACA_SLBSHADOWPTR(r13)

.rept   SLB_NUM_BOLTED
li  r3, SLBSHADOW_SAVEAREA
LDX_BE  r5, r8, r3
addir3, r3, 8
LDX_BE  r6, r8, r3
andis.  r7,r5,SLB_ESID_V@h
beq 1f
slbmte  r6,r5
1:  addir8,r8,16
.endr

no_segments:

Re: [PATCH v6 07/11] powerpc/powernv: set power_save func after the idle states are initialized

2016-06-21 Thread Michael Neuling

On Wed, 2016-06-22 at 11:54 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2016-06-08 at 11:54 -0500, Shreyas B. Prabhu wrote:
> > 
> > pnv_init_idle_states discovers supported idle states from the
> > device tree and does the required initialization. Set power_save
> > function pointer only after this initialization is done
> > 
> > Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> > Signed-off-by: Shreyas B. Prabhu <shre...@linux.vnet.ibm.com>
> Acked-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> 
> Please merge that one as-is now, no need to wait for the rest, as
> otherwise pwoer9 crashes at boot. It doesn't need to wait for the
> rest of the series.

Acked-by: Michael Neuling <mi...@neuling.org>

For the same reason. Without this we need powersave=off on the cmdline on
POWER9.

Mikey

> 
> Cheers,
> Ben.
> 
> > 
> > ---
> > - No changes since v1
> > 
> >  arch/powerpc/platforms/powernv/idle.c  | 3 +++
> >  arch/powerpc/platforms/powernv/setup.c | 2 +-
> >  2 files changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/idle.c
> > b/arch/powerpc/platforms/powernv/idle.c
> > index fcc8b68..fbb09fb 100644
> > --- a/arch/powerpc/platforms/powernv/idle.c
> > +++ b/arch/powerpc/platforms/powernv/idle.c
> > @@ -285,6 +285,9 @@ static int __init pnv_init_idle_states(void)
> >     }
> >  
> >     pnv_alloc_idle_core_states();
> > +
> > +   if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
> > +   ppc_md.power_save = power7_idle;
> >  out_free:
> >     kfree(flags);
> >  out:
> > diff --git a/arch/powerpc/platforms/powernv/setup.c
> > b/arch/powerpc/platforms/powernv/setup.c
> > index ee6430b..8492bbb 100644
> > --- a/arch/powerpc/platforms/powernv/setup.c
> > +++ b/arch/powerpc/platforms/powernv/setup.c
> > @@ -315,7 +315,7 @@ define_machine(powernv) {
> >     .get_proc_freq  = pnv_get_proc_freq,
> >     .progress   = pnv_progress,
> >     .machine_shutdown   = pnv_shutdown,
> > -   .power_save = power7_idle,
> > +   .power_save = NULL,
> >     .calibrate_decr = generic_calibrate_decr,
> >  #ifdef CONFIG_KEXEC
> >     .kexec_cpu_down = pnv_kexec_cpu_down,

Re: [PATCH v6 07/11] powerpc/powernv: set power_save func after the idle states are initialized

2016-06-21 Thread Michael Neuling

On Wed, 2016-06-22 at 11:54 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2016-06-08 at 11:54 -0500, Shreyas B. Prabhu wrote:
> > 
> > pnv_init_idle_states discovers supported idle states from the
> > device tree and does the required initialization. Set power_save
> > function pointer only after this initialization is done
> > 
> > Reviewed-by: Gautham R. Shenoy 
> > Signed-off-by: Shreyas B. Prabhu 
> Acked-by: Benjamin Herrenschmidt 
> 
> Please merge that one as-is now, no need to wait for the rest, as
> otherwise pwoer9 crashes at boot. It doesn't need to wait for the
> rest of the series.

Acked-by: Michael Neuling 

For the same reason. Without this we need powersave=off on the cmdline on
POWER9.

Mikey

> 
> Cheers,
> Ben.
> 
> > 
> > ---
> > - No changes since v1
> > 
> >  arch/powerpc/platforms/powernv/idle.c  | 3 +++
> >  arch/powerpc/platforms/powernv/setup.c | 2 +-
> >  2 files changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/idle.c
> > b/arch/powerpc/platforms/powernv/idle.c
> > index fcc8b68..fbb09fb 100644
> > --- a/arch/powerpc/platforms/powernv/idle.c
> > +++ b/arch/powerpc/platforms/powernv/idle.c
> > @@ -285,6 +285,9 @@ static int __init pnv_init_idle_states(void)
> >     }
> >  
> >     pnv_alloc_idle_core_states();
> > +
> > +   if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
> > +   ppc_md.power_save = power7_idle;
> >  out_free:
> >     kfree(flags);
> >  out:
> > diff --git a/arch/powerpc/platforms/powernv/setup.c
> > b/arch/powerpc/platforms/powernv/setup.c
> > index ee6430b..8492bbb 100644
> > --- a/arch/powerpc/platforms/powernv/setup.c
> > +++ b/arch/powerpc/platforms/powernv/setup.c
> > @@ -315,7 +315,7 @@ define_machine(powernv) {
> >     .get_proc_freq  = pnv_get_proc_freq,
> >     .progress   = pnv_progress,
> >     .machine_shutdown   = pnv_shutdown,
> > -   .power_save = power7_idle,
> > +   .power_save = NULL,
> >     .calibrate_decr = generic_calibrate_decr,
> >  #ifdef CONFIG_KEXEC
> >     .kexec_cpu_down = pnv_kexec_cpu_down,

Re: [v6, 08/11] powerpc/powernv: Add platform support for stop instruction

2016-06-21 Thread Michael Neuling


> > > +#define OPAL_PM_TIMEBASE_STOP0x0002
> > > +#define OPAL_PM_LOSE_HYP_CONTEXT 0x2000
> > > +#define OPAL_PM_LOSE_FULL_CONTEXT0x4000
> > >  #define OPAL_PM_NAP_ENABLED  0x0001
> > >  #define OPAL_PM_SLEEP_ENABLED0x0002
> > >  #define OPAL_PM_WINKLE_ENABLED   0x0004
> > >  #define OPAL_PM_SLEEP_ENABLED_ER10x0008 /* with
> > > workaround */
> > > +#define OPAL_PM_STOP_INST_FAST   0x0010
> > > +#define OPAL_PM_STOP_INST_DEEP   0x0020
> > I don't see the above in skiboot yet?
> I've posted it here -
> http://patchwork.ozlabs.org/patch/617828/

FWIW, this is in now.

https://github.com/open-power/skiboot/commit/952daa69baca407383bc900911f6c40718a0e289

> > 
> > 
> > > 
> > > diff --git a/arch/powerpc/include/asm/paca.h
> > > b/arch/powerpc/include/asm/paca.h
> > > index 546540b..ae91b44 100644
> > > --- a/arch/powerpc/include/asm/paca.h
> > > +++ b/arch/powerpc/include/asm/paca.h
> > > @@ -171,6 +171,8 @@ struct paca_struct {
> > >   /* Mask to denote subcore sibling threads */
> > >   u8 subcore_sibling_mask;
> > >  #endif
> > > + /* Template for PSSCR with EC, ESL, TR, PSLL, MTL fields set
> > > */
> > > + u64 thread_psscr;
> > I'm not entirely clear on why that needs to be in the paca. Could it
> > not be global?
> > 
> While we use Requested Level (RL) field of PSSCR to request a stop
> level, other fields in the SPR like EC, ESL, TR, PSLL, MTL can be
> modified by individual threads less frequently to alter the behaviour of
> stop. So the idea was to have a per-thread variable with all (except RL)
> fields of PSSCR set appropriately. Threads at the time of entering idle,
> can modify the RL field in the variable and execute stop instruction.

But we don't do any of this currently? This is setup at init
in pnv_init_idle_states() and only the RL is changed in power_stop().

So it can still be a global.  It could just be a constant currently even.

>   .text
> > >  
> > >  /*
> > > @@ -61,8 +75,19 @@ save_sprs_to_stack:
> > >    * Note all register i.e per-core, per-subcore or per-thread 
> > > is saved
> > >    * here since any thread in the core might wake up first
> > >    */
> > > +BEGIN_FTR_SECTION
> > > + mfspr   r3,SPRN_PTCR
> > > + std r3,_PTCR(r1)
> > > + mfspr   r3,SPRN_LMRR
> > > + std r3,_LMRR(r1)
> > > + mfspr   r3,SPRN_LMSER
> > > + std r3,_LMSER(r1)
> > > + mfspr   r3,SPRN_ASDR
> > > + std r3,_ASDR(r1)
> > > +FTR_SECTION_ELSE
> > A comment here saying that SDR1 is removed in ISA 3.0 would be helpful.
> > 
> Ok.

I thought we decided we didn't need LMRR, LMSR, 

https://lkml.org/lkml/2016/6/8/1121

or ASDR isn't actually used at all yet and is only valid for some page
faults, so we don't need it here also.

> +END_MMU_FTR_SECTION_IFCLR(MMU_FTR_RADIX)
> > > +
> > > + /* Restore per thread state */
> > > +BEGIN_FTR_SECTION
> > > + bl  __restore_cpu_power9
> > > +
> > > + ld  r4,_LMRR(r1)
> > > + mtspr   SPRN_LMRR,r4
> > > + ld  r4,_LMSER(r1)
> > > + mtspr   SPRN_LMSER,r4
> > > + ld  r4,_ASDR(r1)
> > > + mtspr   SPRN_ASDR,r4
> > Should those be in __restore_cpu_power9 ?
> I was not sure how these registers will be used, but after speaking to
> Aneesh and Mikey I realized these registers will not need restoring.
> LMRR and LMSER are associated with the context and ADSR will be consumed
> before entering stop. So I'll be dropping the this hunk in next revision.

Yep.

> 
> > >   pnv_alloc_idle_core_states();
> > >  
> > > + if (supported_cpuidle_states & OPAL_PM_STOP_INST_FAST)
> > > + for_each_possible_cpu(i) {
> > > +
> > > + u64 psscr_init_val = PSSCR_ESL | PSSCR_EC |
> > > + PSSCR_PSLL_MASK | PSSCR_TR_MASK |
> > > + PSSCR_MTL_MASK;
> > > +
> > > + paca[i].thread_psscr = psscr_init_val;

This seems to be the only place you set this.  Why put it in the paca, why
not just make this a constant? 

Mikey

Re: [v6, 08/11] powerpc/powernv: Add platform support for stop instruction

2016-06-21 Thread Michael Neuling


> > > +#define OPAL_PM_TIMEBASE_STOP0x0002
> > > +#define OPAL_PM_LOSE_HYP_CONTEXT 0x2000
> > > +#define OPAL_PM_LOSE_FULL_CONTEXT0x4000
> > >  #define OPAL_PM_NAP_ENABLED  0x0001
> > >  #define OPAL_PM_SLEEP_ENABLED0x0002
> > >  #define OPAL_PM_WINKLE_ENABLED   0x0004
> > >  #define OPAL_PM_SLEEP_ENABLED_ER10x0008 /* with
> > > workaround */
> > > +#define OPAL_PM_STOP_INST_FAST   0x0010
> > > +#define OPAL_PM_STOP_INST_DEEP   0x0020
> > I don't see the above in skiboot yet?
> I've posted it here -
> http://patchwork.ozlabs.org/patch/617828/

FWIW, this is in now.

https://github.com/open-power/skiboot/commit/952daa69baca407383bc900911f6c40718a0e289

> > 
> > 
> > > 
> > > diff --git a/arch/powerpc/include/asm/paca.h
> > > b/arch/powerpc/include/asm/paca.h
> > > index 546540b..ae91b44 100644
> > > --- a/arch/powerpc/include/asm/paca.h
> > > +++ b/arch/powerpc/include/asm/paca.h
> > > @@ -171,6 +171,8 @@ struct paca_struct {
> > >   /* Mask to denote subcore sibling threads */
> > >   u8 subcore_sibling_mask;
> > >  #endif
> > > + /* Template for PSSCR with EC, ESL, TR, PSLL, MTL fields set
> > > */
> > > + u64 thread_psscr;
> > I'm not entirely clear on why that needs to be in the paca. Could it
> > not be global?
> > 
> While we use Requested Level (RL) field of PSSCR to request a stop
> level, other fields in the SPR like EC, ESL, TR, PSLL, MTL can be
> modified by individual threads less frequently to alter the behaviour of
> stop. So the idea was to have a per-thread variable with all (except RL)
> fields of PSSCR set appropriately. Threads at the time of entering idle,
> can modify the RL field in the variable and execute stop instruction.

But we don't do any of this currently? This is setup at init
in pnv_init_idle_states() and only the RL is changed in power_stop().

So it can still be a global.  It could just be a constant currently even.

>   .text
> > >  
> > >  /*
> > > @@ -61,8 +75,19 @@ save_sprs_to_stack:
> > >    * Note all register i.e per-core, per-subcore or per-thread 
> > > is saved
> > >    * here since any thread in the core might wake up first
> > >    */
> > > +BEGIN_FTR_SECTION
> > > + mfspr   r3,SPRN_PTCR
> > > + std r3,_PTCR(r1)
> > > + mfspr   r3,SPRN_LMRR
> > > + std r3,_LMRR(r1)
> > > + mfspr   r3,SPRN_LMSER
> > > + std r3,_LMSER(r1)
> > > + mfspr   r3,SPRN_ASDR
> > > + std r3,_ASDR(r1)
> > > +FTR_SECTION_ELSE
> > A comment here saying that SDR1 is removed in ISA 3.0 would be helpful.
> > 
> Ok.

I thought we decided we didn't need LMRR, LMSR, 

https://lkml.org/lkml/2016/6/8/1121

or ASDR isn't actually used at all yet and is only valid for some page
faults, so we don't need it here also.

> +END_MMU_FTR_SECTION_IFCLR(MMU_FTR_RADIX)
> > > +
> > > + /* Restore per thread state */
> > > +BEGIN_FTR_SECTION
> > > + bl  __restore_cpu_power9
> > > +
> > > + ld  r4,_LMRR(r1)
> > > + mtspr   SPRN_LMRR,r4
> > > + ld  r4,_LMSER(r1)
> > > + mtspr   SPRN_LMSER,r4
> > > + ld  r4,_ASDR(r1)
> > > + mtspr   SPRN_ASDR,r4
> > Should those be in __restore_cpu_power9 ?
> I was not sure how these registers will be used, but after speaking to
> Aneesh and Mikey I realized these registers will not need restoring.
> LMRR and LMSER are associated with the context and ADSR will be consumed
> before entering stop. So I'll be dropping the this hunk in next revision.

Yep.

> 
> > >   pnv_alloc_idle_core_states();
> > >  
> > > + if (supported_cpuidle_states & OPAL_PM_STOP_INST_FAST)
> > > + for_each_possible_cpu(i) {
> > > +
> > > + u64 psscr_init_val = PSSCR_ESL | PSSCR_EC |
> > > + PSSCR_PSLL_MASK | PSSCR_TR_MASK |
> > > + PSSCR_MTL_MASK;
> > > +
> > > + paca[i].thread_psscr = psscr_init_val;

This seems to be the only place you set this.  Why put it in the paca, why
not just make this a constant? 

Mikey

Re: [PATCH v5 08/11] powerpc/powernv: Add platform support for stop instruction

2016-06-08 Thread Michael Neuling

On Wed, 2016-06-08 at 22:31 +0530, Shreyas B Prabhu wrote:
> Hi Ben,
> 
> Sorry for the delayed response.
> 
> On 06/06/2016 03:58 AM, Benjamin Herrenschmidt wrote:
> > 
> > On Thu, 2016-06-02 at 07:38 -0500, Shreyas B. Prabhu wrote:
> > > 
> > > @@ -61,8 +72,13 @@ save_sprs_to_stack:
> > >  * Note all register i.e per-core, per-subcore or per-thread
> > > is saved
> > >  * here since any thread in the core might wake up first
> > >  */
> > > +BEGIN_FTR_SECTION
> > > +   mfspr   r3,SPRN_PTCR
> > > +   std r3,_PTCR(r1)
> > > +FTR_SECTION_ELSE
> > > mfspr   r3,SPRN_SDR1
> > > std r3,_SDR1(r1)
> > > +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> > This is the only new SPR we care about in P9 ?
> > 
> After reviewing ISA again, I've identified LMRR, LMSER and ASDR also
> need to be restored. I've fixed this in v6.

LMRR and LMSER are used the load monitored patch set.  There they will get
restored when we context switch back to userspace.  It probably doesn't
hurt that much but you don't need to restore them here. 

They are not used in the kernel.

It escapes me what ASDR is right now.

Mikey

Re: [PATCH v5 08/11] powerpc/powernv: Add platform support for stop instruction

2016-06-08 Thread Michael Neuling

On Wed, 2016-06-08 at 22:31 +0530, Shreyas B Prabhu wrote:
> Hi Ben,
> 
> Sorry for the delayed response.
> 
> On 06/06/2016 03:58 AM, Benjamin Herrenschmidt wrote:
> > 
> > On Thu, 2016-06-02 at 07:38 -0500, Shreyas B. Prabhu wrote:
> > > 
> > > @@ -61,8 +72,13 @@ save_sprs_to_stack:
> > >  * Note all register i.e per-core, per-subcore or per-thread
> > > is saved
> > >  * here since any thread in the core might wake up first
> > >  */
> > > +BEGIN_FTR_SECTION
> > > +   mfspr   r3,SPRN_PTCR
> > > +   std r3,_PTCR(r1)
> > > +FTR_SECTION_ELSE
> > > mfspr   r3,SPRN_SDR1
> > > std r3,_SDR1(r1)
> > > +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> > This is the only new SPR we care about in P9 ?
> > 
> After reviewing ISA again, I've identified LMRR, LMSER and ASDR also
> need to be restored. I've fixed this in v6.

LMRR and LMSER are used the load monitored patch set.  There they will get
restored when we context switch back to userspace.  It probably doesn't
hurt that much but you don't need to restore them here. 

They are not used in the kernel.

It escapes me what ASDR is right now.

Mikey

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

2016-05-12 Thread Michael Neuling

On Thu, 2016-05-12 at 13:33 +0200, Peter Zijlstra wrote:
> On Thu, May 12, 2016 at 09:07:52PM +1000, Michael Neuling wrote:
> > 
> > On Thu, 2016-05-12 at 07:07 +0200, Peter Zijlstra wrote:
> > 
> > > 
> > > But as per the above, Power7 and Power8 have explicit logic to share
> > > the
> > > per-core L3 with the other cores.
> > > 
> > > How effective is that? From some of the slides/documents i've looked
> > > at
> > > the L3s are connected with a high-speed fabric. Suggesting that the
> > > cross-core sharing should be fairly efficient.
> > I'm not sure.  I thought it was mostly private but if another core was
> > sleeping or not experiencing much cache pressure, another core could
> > use it
> > for some things. But I'm fuzzy on the the exact properties, sorry.
> Right; I'm going by bits and pieces found on the tubes, so I'm just
> guessing ;-)
> 
> But it sounds like these L3s are nowhere close to what Intel does with
> their L3, where each core has an L3 slice, and slices are connected on a
> ring to form a unified/shared cache across all cores.
> 
> http://www.realworldtech.com/sandy-bridge/8/

The POWER8 user manual is what you want to look at:

https://www.setphaserstostun.org/power8/POWER8_UM_v1.3_16MAR2016_pub.pdf

There is a section 10. "L3 Cache Overview" starting on page 128.  In there
it talks about L3.0 which is using the local cores L3.  L3.1 which is using
some other cores L3.

Once the L3.0 is full, we can cast out to an L3.1 (ie. the cache on another
core).  L3.1 can also provide data for reads.

ECO mode (section 10.4) is what I was talking about for sleeping/unused
cores.  That's more of a boot time (firmware option) than something we can
dynamically play with at runtime (I believe), so it's not something I think
is relevant here.

> > 
> > > 
> > > In which case it would make sense to treat/model the combined L3 as a
> > > single large LLC covering all cores.
> > Are you thinking it would be much cheaper to migrate a task to another
> > core
> > inside this chip, than to off chip?
> Basically; and if so, if its cheap enough to shoot a task to an idle
> core to avoid queueing. Assuming there still is some cache residency on
> the old core, the inter-core fill should be much cheaper than fetching
> it off package (either remote cache or dram).

So I think that will apply on POWER8.

In 10.4.2 it says "The L3.1 ECO Caches will be snooped and provide
intervention data similar to the L2 and L3.0 caches on the
chip"  That should be much faster than going to another chip or DIMM.

So migrating to another core on the same chip should be faster than off
chip.

Mikey

> Or at least; so goes my reasoning based on my google results.
>

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

2016-05-12 Thread Michael Neuling

On Thu, 2016-05-12 at 13:33 +0200, Peter Zijlstra wrote:
> On Thu, May 12, 2016 at 09:07:52PM +1000, Michael Neuling wrote:
> > 
> > On Thu, 2016-05-12 at 07:07 +0200, Peter Zijlstra wrote:
> > 
> > > 
> > > But as per the above, Power7 and Power8 have explicit logic to share
> > > the
> > > per-core L3 with the other cores.
> > > 
> > > How effective is that? From some of the slides/documents i've looked
> > > at
> > > the L3s are connected with a high-speed fabric. Suggesting that the
> > > cross-core sharing should be fairly efficient.
> > I'm not sure.  I thought it was mostly private but if another core was
> > sleeping or not experiencing much cache pressure, another core could
> > use it
> > for some things. But I'm fuzzy on the the exact properties, sorry.
> Right; I'm going by bits and pieces found on the tubes, so I'm just
> guessing ;-)
> 
> But it sounds like these L3s are nowhere close to what Intel does with
> their L3, where each core has an L3 slice, and slices are connected on a
> ring to form a unified/shared cache across all cores.
> 
> http://www.realworldtech.com/sandy-bridge/8/

The POWER8 user manual is what you want to look at:

https://www.setphaserstostun.org/power8/POWER8_UM_v1.3_16MAR2016_pub.pdf

There is a section 10. "L3 Cache Overview" starting on page 128.  In there
it talks about L3.0 which is using the local cores L3.  L3.1 which is using
some other cores L3.

Once the L3.0 is full, we can cast out to an L3.1 (ie. the cache on another
core).  L3.1 can also provide data for reads.

ECO mode (section 10.4) is what I was talking about for sleeping/unused
cores.  That's more of a boot time (firmware option) than something we can
dynamically play with at runtime (I believe), so it's not something I think
is relevant here.

> > 
> > > 
> > > In which case it would make sense to treat/model the combined L3 as a
> > > single large LLC covering all cores.
> > Are you thinking it would be much cheaper to migrate a task to another
> > core
> > inside this chip, than to off chip?
> Basically; and if so, if its cheap enough to shoot a task to an idle
> core to avoid queueing. Assuming there still is some cache residency on
> the old core, the inter-core fill should be much cheaper than fetching
> it off package (either remote cache or dram).

So I think that will apply on POWER8.

In 10.4.2 it says "The L3.1 ECO Caches will be snooped and provide
intervention data similar to the L2 and L3.0 caches on the
chip"  That should be much faster than going to another chip or DIMM.

So migrating to another core on the same chip should be faster than off
chip.

Mikey

> Or at least; so goes my reasoning based on my google results.
>

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

2016-05-12 Thread Michael Neuling

On Thu, 2016-05-12 at 07:07 +0200, Peter Zijlstra wrote:
> On Thu, May 12, 2016 at 12:05:37PM +1000, Michael Neuling wrote:
> > 
> > On Wed, 2016-05-11 at 20:24 +0200, Peter Zijlstra wrote:
> > > 
> > > On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> > > > 
> > > > 
> > > > Hmm, PPC folks; what does your topology look like?
> > > > 
> > > > Currently your sched_domain_topology, as per
> > > > arch/powerpc/kernel/smp.c
> > > > seems to suggest your cores do not share cache at all.
> > > > 
> > > > https://en.wikipedia.org/wiki/POWER7 seems to agree and states
> > > > 
> > > >   "4 MB L3 cache per C1 core"
> > > > 
> > > > And http://www-03.ibm.com/systems/resources/systems_power_software_
> > > > i_pe
> > > > rfmgmt_underthehood.pdf
> > > > also explicitly draws pictures with the L3 per core.
> > > > 
> > > > _however_, that same document describes L3 inter-core fill and
> > > > lateral
> > > > cast-out, which sounds like the L3s work together to form a node
> > > > wide
> > > > caching system.
> > > > 
> > > > Do we want to model this co-operative L3 slices thing as a sort of
> > > > node-wide LLC for the purpose of the scheduler ?
> > > Going back a generation; Power6 seems to have a shared L3 (off
> > > package)
> > > between the two cores on the package. The current topology does not
> > > reflect that at all.
> > > 
> > > And going forward a generation; Power8 seems to share the per-core
> > > (chiplet) L3 amonst all cores (chiplets) + is has the centaur (memory
> > > controller) 16M L4.
> > Yep, L1/L2/L3 is per core on POWER8 and POWER7.  POWER6 and POWER5
> > (both
> > dual core chips) had a shared off chip cache
> But as per the above, Power7 and Power8 have explicit logic to share the
> per-core L3 with the other cores.
> 
> How effective is that? From some of the slides/documents i've looked at
> the L3s are connected with a high-speed fabric. Suggesting that the
> cross-core sharing should be fairly efficient.

I'm not sure.  I thought it was mostly private but if another core was
sleeping or not experiencing much cache pressure, another core could use it
for some things. But I'm fuzzy on the the exact properties, sorry.

> In which case it would make sense to treat/model the combined L3 as a
> single large LLC covering all cores.

Are you thinking it would be much cheaper to migrate a task to another core
inside this chip, than to off chip?

Mikey

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

2016-05-12 Thread Michael Neuling

On Thu, 2016-05-12 at 07:07 +0200, Peter Zijlstra wrote:
> On Thu, May 12, 2016 at 12:05:37PM +1000, Michael Neuling wrote:
> > 
> > On Wed, 2016-05-11 at 20:24 +0200, Peter Zijlstra wrote:
> > > 
> > > On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> > > > 
> > > > 
> > > > Hmm, PPC folks; what does your topology look like?
> > > > 
> > > > Currently your sched_domain_topology, as per
> > > > arch/powerpc/kernel/smp.c
> > > > seems to suggest your cores do not share cache at all.
> > > > 
> > > > https://en.wikipedia.org/wiki/POWER7 seems to agree and states
> > > > 
> > > >   "4 MB L3 cache per C1 core"
> > > > 
> > > > And http://www-03.ibm.com/systems/resources/systems_power_software_
> > > > i_pe
> > > > rfmgmt_underthehood.pdf
> > > > also explicitly draws pictures with the L3 per core.
> > > > 
> > > > _however_, that same document describes L3 inter-core fill and
> > > > lateral
> > > > cast-out, which sounds like the L3s work together to form a node
> > > > wide
> > > > caching system.
> > > > 
> > > > Do we want to model this co-operative L3 slices thing as a sort of
> > > > node-wide LLC for the purpose of the scheduler ?
> > > Going back a generation; Power6 seems to have a shared L3 (off
> > > package)
> > > between the two cores on the package. The current topology does not
> > > reflect that at all.
> > > 
> > > And going forward a generation; Power8 seems to share the per-core
> > > (chiplet) L3 amonst all cores (chiplets) + is has the centaur (memory
> > > controller) 16M L4.
> > Yep, L1/L2/L3 is per core on POWER8 and POWER7.  POWER6 and POWER5
> > (both
> > dual core chips) had a shared off chip cache
> But as per the above, Power7 and Power8 have explicit logic to share the
> per-core L3 with the other cores.
> 
> How effective is that? From some of the slides/documents i've looked at
> the L3s are connected with a high-speed fabric. Suggesting that the
> cross-core sharing should be fairly efficient.

I'm not sure.  I thought it was mostly private but if another core was
sleeping or not experiencing much cache pressure, another core could use it
for some things. But I'm fuzzy on the the exact properties, sorry.

> In which case it would make sense to treat/model the combined L3 as a
> single large LLC covering all cores.

Are you thinking it would be much cheaper to migrate a task to another core
inside this chip, than to off chip?

Mikey

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

2016-05-11 Thread Michael Neuling

On Wed, 2016-05-11 at 20:24 +0200, Peter Zijlstra wrote:
> On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> > 
> > Hmm, PPC folks; what does your topology look like?
> > 
> > Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
> > seems to suggest your cores do not share cache at all.
> > 
> > https://en.wikipedia.org/wiki/POWER7 seems to agree and states
> > 
> >   "4 MB L3 cache per C1 core"
> > 
> > And http://www-03.ibm.com/systems/resources/systems_power_software_i_pe
> > rfmgmt_underthehood.pdf
> > also explicitly draws pictures with the L3 per core.
> > 
> > _however_, that same document describes L3 inter-core fill and lateral
> > cast-out, which sounds like the L3s work together to form a node wide
> > caching system.
> > 
> > Do we want to model this co-operative L3 slices thing as a sort of
> > node-wide LLC for the purpose of the scheduler ?
> Going back a generation; Power6 seems to have a shared L3 (off package)
> between the two cores on the package. The current topology does not
> reflect that at all.
> 
> And going forward a generation; Power8 seems to share the per-core
> (chiplet) L3 amonst all cores (chiplets) + is has the centaur (memory
> controller) 16M L4.

Yep, L1/L2/L3 is per core on POWER8 and POWER7.  POWER6 and POWER5 (both
dual core chips) had a shared off chip cache

The POWER8 L4 is really a bit different as it's out in the memory
controller.  It's more of a memory DIMM buffer as it can only cache data
associated with the physical addresses on those DIMMS.

> So it seems the current topology setup is not describing these chips
> very well. Also note that the arch topology code can runtime select a
> topology, so you could make that topo setup micro-arch specific.

We are planning on making some topology changes for the upcoming P9 which
will share L2/L3 amongst pairs of cores (24 cores per chip).

FWIW our P9 upstreaming is still in it's infancy since P9 is not released
yet.

Mike

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

2016-05-11 Thread Michael Neuling

On Wed, 2016-05-11 at 20:24 +0200, Peter Zijlstra wrote:
> On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> > 
> > Hmm, PPC folks; what does your topology look like?
> > 
> > Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
> > seems to suggest your cores do not share cache at all.
> > 
> > https://en.wikipedia.org/wiki/POWER7 seems to agree and states
> > 
> >   "4 MB L3 cache per C1 core"
> > 
> > And http://www-03.ibm.com/systems/resources/systems_power_software_i_pe
> > rfmgmt_underthehood.pdf
> > also explicitly draws pictures with the L3 per core.
> > 
> > _however_, that same document describes L3 inter-core fill and lateral
> > cast-out, which sounds like the L3s work together to form a node wide
> > caching system.
> > 
> > Do we want to model this co-operative L3 slices thing as a sort of
> > node-wide LLC for the purpose of the scheduler ?
> Going back a generation; Power6 seems to have a shared L3 (off package)
> between the two cores on the package. The current topology does not
> reflect that at all.
> 
> And going forward a generation; Power8 seems to share the per-core
> (chiplet) L3 amonst all cores (chiplets) + is has the centaur (memory
> controller) 16M L4.

Yep, L1/L2/L3 is per core on POWER8 and POWER7.  POWER6 and POWER5 (both
dual core chips) had a shared off chip cache

The POWER8 L4 is really a bit different as it's out in the memory
controller.  It's more of a memory DIMM buffer as it can only cache data
associated with the physical addresses on those DIMMS.

> So it seems the current topology setup is not describing these chips
> very well. Also note that the arch topology code can runtime select a
> topology, so you could make that topo setup micro-arch specific.

We are planning on making some topology changes for the upcoming P9 which
will share L2/L3 amongst pairs of cores (24 cores per chip).

FWIW our P9 upstreaming is still in it's infancy since P9 is not released
yet.

Mike

Re: [PATCH 4.4 60/67] powerpc/tm: Check for already reclaimed tasks

2016-05-03 Thread Michael Neuling

On Tue, 2016-05-03 at 08:32 +0200, Jiri Slaby wrote:
> On 01/27/2016, 07:12 PM, Greg Kroah-Hartman wrote:
> > 
> > 4.4-stable review patch.  If anyone has any objections, please let me
> > know.
> > 
> > ----------
> > 
> > From: Michael Neuling <mi...@neuling.org>
> > 
> > commit 7f821fc9c77a9b01fe7b1d6e72717b33d8d64142 upstream.
> > 
> > Currently we can hit a scenario where we'll tm_reclaim() twice.  This
> > results in a TM bad thing exception because the second reclaim occurs
> > when not in suspend mode.
> > 
> > The scenario in which this can happen is the following.  We attempt to
> > deliver a signal to userspace.  To do this we need obtain the stack
> > pointer to write the signal context.  To get this stack pointer we
> > must tm_reclaim() in case we need to use the checkpointed stack
> > pointer (see get_tm_stackpointer()).  Normally we'd then return
> > directly to userspace to deliver the signal
> > without going through
> > __switch_to().
> > 
> > Unfortunatley, if at this point we get an error (such as a bad
> > userspace stack pointer), we need to exit the process.  The exit will
> > result in a __switch_to().  __switch_to() will attempt to save the
> > process state which results in another tm_reclaim().  This
> > tm_reclaim() now causes a TM Bad Thing exception as this state has
> > already been saved and the processor is no longer in TM suspend mode.
> > Whee!
> > 
> > This patch checks the state of the MSR to ensure we are TM suspended
> > before we attempt the tm_reclaim().  If we've already saved the state
> > away, we should no longer be in TM suspend mode.  This has the
> > additional advantage of checking for a potential TM Bad Thing
> > exception.
> > 
> > Found using syscall fuzzer.
> > 
> > Fixes: fb09692e71f1 ("powerpc: Add reclaim and recheckpoint functions
> > for context switching transactional memory processes")
> > Signed-off-by: Michael Neuling <mi...@neuling.org>
> > Signed-off-by: Michael Ellerman <m...@ellerman.id.au>
> > Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
> > 
> > ---
> >  arch/powerpc/kernel/process.c |   18 ++
> >  1 file changed, 18 insertions(+)
> > 
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -569,6 +569,24 @@ static void tm_reclaim_thread(struct thr
> >     if (!MSR_TM_SUSPENDED(mfmsr()))
> >     return;
> >  
> > +   /*
> > +    * Use the current MSR TM suspended bit to track if we have
> > +    * checkpointed state outstanding.
> > +    * On signal delivery, we'd normally reclaim the checkpointed
> > +    * state to obtain stack pointer (see:get_tm_stackpointer()).
> > +    * This will then directly return to userspace without going
> > +    * through __switch_to(). However, if the stack frame is bad,
> > +    * we need to exit this thread which calls __switch_to() which
> > +    * will again attempt to reclaim the already saved tm state.
> > +    * Hence we need to check that we've not already reclaimed
> > +    * this state.
> > +    * We do this using the current MSR, rather tracking it in
> > +    * some specific thread_struct bit, as it has the additional
> > +    * benifit of checking for a potential TM bad thing exception.
> > +    */
> > +   if (!MSR_TM_SUSPENDED(mfmsr()))
> > +   return;
> 
> This one should have not been applied to 4.4. The patch is in mainline
> since 4.4-rc6. Hence the check is duplicated as can be seen above.

Greg, surely your scripts could check for that?

> It is harmless though, it seems?

Yes, that should be harmless, other than a small performance penalty.

Mikey

Re: [PATCH 4.4 60/67] powerpc/tm: Check for already reclaimed tasks

2016-05-03 Thread Michael Neuling

On Tue, 2016-05-03 at 08:32 +0200, Jiri Slaby wrote:
> On 01/27/2016, 07:12 PM, Greg Kroah-Hartman wrote:
> > 
> > 4.4-stable review patch.  If anyone has any objections, please let me
> > know.
> > 
> > ----------
> > 
> > From: Michael Neuling 
> > 
> > commit 7f821fc9c77a9b01fe7b1d6e72717b33d8d64142 upstream.
> > 
> > Currently we can hit a scenario where we'll tm_reclaim() twice.  This
> > results in a TM bad thing exception because the second reclaim occurs
> > when not in suspend mode.
> > 
> > The scenario in which this can happen is the following.  We attempt to
> > deliver a signal to userspace.  To do this we need obtain the stack
> > pointer to write the signal context.  To get this stack pointer we
> > must tm_reclaim() in case we need to use the checkpointed stack
> > pointer (see get_tm_stackpointer()).  Normally we'd then return
> > directly to userspace to deliver the signal
> > without going through
> > __switch_to().
> > 
> > Unfortunatley, if at this point we get an error (such as a bad
> > userspace stack pointer), we need to exit the process.  The exit will
> > result in a __switch_to().  __switch_to() will attempt to save the
> > process state which results in another tm_reclaim().  This
> > tm_reclaim() now causes a TM Bad Thing exception as this state has
> > already been saved and the processor is no longer in TM suspend mode.
> > Whee!
> > 
> > This patch checks the state of the MSR to ensure we are TM suspended
> > before we attempt the tm_reclaim().  If we've already saved the state
> > away, we should no longer be in TM suspend mode.  This has the
> > additional advantage of checking for a potential TM Bad Thing
> > exception.
> > 
> > Found using syscall fuzzer.
> > 
> > Fixes: fb09692e71f1 ("powerpc: Add reclaim and recheckpoint functions
> > for context switching transactional memory processes")
> > Signed-off-by: Michael Neuling 
> > Signed-off-by: Michael Ellerman 
> > Signed-off-by: Greg Kroah-Hartman 
> > 
> > ---
> >  arch/powerpc/kernel/process.c |   18 ++
> >  1 file changed, 18 insertions(+)
> > 
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -569,6 +569,24 @@ static void tm_reclaim_thread(struct thr
> >     if (!MSR_TM_SUSPENDED(mfmsr()))
> >     return;
> >  
> > +   /*
> > +    * Use the current MSR TM suspended bit to track if we have
> > +    * checkpointed state outstanding.
> > +    * On signal delivery, we'd normally reclaim the checkpointed
> > +    * state to obtain stack pointer (see:get_tm_stackpointer()).
> > +    * This will then directly return to userspace without going
> > +    * through __switch_to(). However, if the stack frame is bad,
> > +    * we need to exit this thread which calls __switch_to() which
> > +    * will again attempt to reclaim the already saved tm state.
> > +    * Hence we need to check that we've not already reclaimed
> > +    * this state.
> > +    * We do this using the current MSR, rather tracking it in
> > +    * some specific thread_struct bit, as it has the additional
> > +    * benifit of checking for a potential TM bad thing exception.
> > +    */
> > +   if (!MSR_TM_SUSPENDED(mfmsr()))
> > +   return;
> 
> This one should have not been applied to 4.4. The patch is in mainline
> since 4.4-rc6. Hence the check is duplicated as can be seen above.

Greg, surely your scripts could check for that?

> It is harmless though, it seems?

Yes, that should be harmless, other than a small performance penalty.

Mikey

Re: [PATCH 7/9] powerpc/powernv: Add platform support for stop instruction

2016-05-02 Thread Michael Neuling


> diff --git a/arch/powerpc/include/asm/cputable.h 
> b/arch/powerpc/include/asm/cputable.h
> index df4fb5f..a4739a1 100644
> --- a/arch/powerpc/include/asm/cputable.h
> +++ b/arch/powerpc/include/asm/cputable.h
> @@ -205,6 +205,7 @@ enum {
>  #define CPU_FTR_DABRX
> LONG_ASM_CONST(0x0800)
>  #define CPU_FTR_PMAO_BUG LONG_ASM_CONST(0x1000)
>  #define CPU_FTR_SUBCORE  
> LONG_ASM_CONST(0x2000)
> +#define CPU_FTR_STOP_INSTLONG_ASM_CONST(0x4000)

In general, we are putting all the POWER9 features under CPU_FTR_ARCH_300.
Is there a reason you need this separate bit?

CPU_FTR bits are fairly scarce these days.

Mikey

Re: [PATCH 7/9] powerpc/powernv: Add platform support for stop instruction

2016-05-02 Thread Michael Neuling


> diff --git a/arch/powerpc/include/asm/cputable.h 
> b/arch/powerpc/include/asm/cputable.h
> index df4fb5f..a4739a1 100644
> --- a/arch/powerpc/include/asm/cputable.h
> +++ b/arch/powerpc/include/asm/cputable.h
> @@ -205,6 +205,7 @@ enum {
>  #define CPU_FTR_DABRX
> LONG_ASM_CONST(0x0800)
>  #define CPU_FTR_PMAO_BUG LONG_ASM_CONST(0x1000)
>  #define CPU_FTR_SUBCORE  
> LONG_ASM_CONST(0x2000)
> +#define CPU_FTR_STOP_INSTLONG_ASM_CONST(0x4000)

In general, we are putting all the POWER9 features under CPU_FTR_ARCH_300.
Is there a reason you need this separate bit?

CPU_FTR bits are fairly scarce these days.

Mikey

Re: [PATCH 1/3] sched/fair: Fix asym packing to select correct cpu

2016-03-23 Thread Michael Neuling

On Wed, 2016-03-23 at 17:04 +0530, Srikar Dronamraju wrote:
> If asymmetric packing is used when target cpu is busy,
> update_sd_pick_busiest(), can select a lightly loaded cpu.
> find_busiest_group() has checks to ensure asym packing is only used
> when target cpu is not busy.  However it may not be able to avoid a
> lightly loaded cpu selected by update_sd_pick_busiest from being
> selected as source cpu for eventual load balancing.
> 
> Also when using asymmetric scheduling, always select higher cpu as
> source cpu for load balancing.
> 
> While doing this change, move the check to see if target cpu is busy
> into check_asym_packing().
> 
> 1. Record per second ebizzy (32 threads) on a 64 cpu power 7 box. (5 
> iterations)
> 4.5.0-master/ebizzy_32.out
> N   Min   MaxMedian   AvgStddev
> x   5   5205896  17260530  12141204  10759008   4765419
> 
> 4.5.0-master-asym-changes/ebizzy_32.out
> N   Min   MaxMedian   AvgStddev
> x   5   7044349  19112876  17440718  14947658   5263970
> 
> 2. Record per second ebizzy (32 threads) on a 64 cpu power 7 box. (5 
> iterations)
> 4.5.0-master/ebizzy_64.out
> N   Min   MaxMedian   AvgStddev
> x   5   5400083  14091418   8673907 8872662.4 3389746.8
> 
> 4.5.0-master-asym-changes/ebizzy_64.out
> N   Min   MaxMedian   AvgStddev
> x   5   7533907  17232623  15083583  13364894 3776877.9
> 
> 3. Record per second ebizzy (32 threads) on a 64 cpu power 7 box. (5 
> iterations)
> 4.5.0-master/ebizzy_128.out
> N   Min   MaxMedian   AvgStddev
> x   5  35328039  41642699  37564951  38378409   2671280
> 
> 4.5.0-master-asym-changes/ebizzy_128.out
> N   Min   MaxMedian   AvgStddev
> x   5  37102220  42736809  38442478  39529626 2298389.4

I'm not sure how to interpret these.  Any chance you can give a summary of
what these results mean?

> Signed-off-by: Srikar Dronamraju 

FWIW, this still passes my scheduler tests on POWER7, but they weren't 
failing before anyway.

Mikey

> ---
>  kernel/sched/fair.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 56b7d4b..9abfb16 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6517,6 +6517,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   if (!(env->sd->flags & SD_ASYM_PACKING))
>   return true;
>  
> + if (env->idle == CPU_NOT_IDLE)
> + return true;
>   /*
>* ASYM_PACKING needs to move all the work to the lowest
>* numbered CPUs in the group, therefore mark all groups
> @@ -6526,7 +6528,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   if (!sds->busiest)
>   return true;
>  
> - if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
> + if (group_first_cpu(sds->busiest) < group_first_cpu(sg))
>   return true;
>   }
>  
> @@ -6672,6 +6674,9 @@ static int check_asym_packing(struct lb_env *env, 
> struct sd_lb_stats *sds)
>   if (!(env->sd->flags & SD_ASYM_PACKING))
>   return 0;
>  
> + if (env->idle == CPU_NOT_IDLE)
> + return 0;
> +
>   if (!sds->busiest)
>   return 0;
>  
> @@ -6864,8 +6869,7 @@ static struct sched_group *find_busiest_group(struct 
> lb_env *env)
>   busiest = _stat;
>  
>   /* ASYM feature bypasses nice load balance check */
> - if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
> - check_asym_packing(env, ))
> + if (check_asym_packing(env, ))
>   return sds.busiest;
>  
>   /* There is no busy sibling group to pull tasks from */

Re: [PATCH 1/3] sched/fair: Fix asym packing to select correct cpu

2016-03-23 Thread Michael Neuling

On Wed, 2016-03-23 at 17:04 +0530, Srikar Dronamraju wrote:
> If asymmetric packing is used when target cpu is busy,
> update_sd_pick_busiest(), can select a lightly loaded cpu.
> find_busiest_group() has checks to ensure asym packing is only used
> when target cpu is not busy.  However it may not be able to avoid a
> lightly loaded cpu selected by update_sd_pick_busiest from being
> selected as source cpu for eventual load balancing.
> 
> Also when using asymmetric scheduling, always select higher cpu as
> source cpu for load balancing.
> 
> While doing this change, move the check to see if target cpu is busy
> into check_asym_packing().
> 
> 1. Record per second ebizzy (32 threads) on a 64 cpu power 7 box. (5 
> iterations)
> 4.5.0-master/ebizzy_32.out
> N   Min   MaxMedian   AvgStddev
> x   5   5205896  17260530  12141204  10759008   4765419
> 
> 4.5.0-master-asym-changes/ebizzy_32.out
> N   Min   MaxMedian   AvgStddev
> x   5   7044349  19112876  17440718  14947658   5263970
> 
> 2. Record per second ebizzy (32 threads) on a 64 cpu power 7 box. (5 
> iterations)
> 4.5.0-master/ebizzy_64.out
> N   Min   MaxMedian   AvgStddev
> x   5   5400083  14091418   8673907 8872662.4 3389746.8
> 
> 4.5.0-master-asym-changes/ebizzy_64.out
> N   Min   MaxMedian   AvgStddev
> x   5   7533907  17232623  15083583  13364894 3776877.9
> 
> 3. Record per second ebizzy (32 threads) on a 64 cpu power 7 box. (5 
> iterations)
> 4.5.0-master/ebizzy_128.out
> N   Min   MaxMedian   AvgStddev
> x   5  35328039  41642699  37564951  38378409   2671280
> 
> 4.5.0-master-asym-changes/ebizzy_128.out
> N   Min   MaxMedian   AvgStddev
> x   5  37102220  42736809  38442478  39529626 2298389.4

I'm not sure how to interpret these.  Any chance you can give a summary of
what these results mean?

> Signed-off-by: Srikar Dronamraju 

FWIW, this still passes my scheduler tests on POWER7, but they weren't 
failing before anyway.

Mikey

> ---
>  kernel/sched/fair.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 56b7d4b..9abfb16 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6517,6 +6517,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   if (!(env->sd->flags & SD_ASYM_PACKING))
>   return true;
>  
> + if (env->idle == CPU_NOT_IDLE)
> + return true;
>   /*
>* ASYM_PACKING needs to move all the work to the lowest
>* numbered CPUs in the group, therefore mark all groups
> @@ -6526,7 +6528,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   if (!sds->busiest)
>   return true;
>  
> - if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
> + if (group_first_cpu(sds->busiest) < group_first_cpu(sg))
>   return true;
>   }
>  
> @@ -6672,6 +6674,9 @@ static int check_asym_packing(struct lb_env *env, 
> struct sd_lb_stats *sds)
>   if (!(env->sd->flags & SD_ASYM_PACKING))
>   return 0;
>  
> + if (env->idle == CPU_NOT_IDLE)
> + return 0;
> +
>   if (!sds->busiest)
>   return 0;
>  
> @@ -6864,8 +6869,7 @@ static struct sched_group *find_busiest_group(struct 
> lb_env *env)
>   busiest = _stat;
>  
>   /* ASYM feature bypasses nice load balance check */
> - if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
> - check_asym_packing(env, ))
> + if (check_asym_packing(env, ))
>   return sds.busiest;
>  
>   /* There is no busy sibling group to pull tasks from */

Re: [PATCH v8 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path

2016-03-19 Thread Michael Neuling

On Fri, 2016-03-18 at 15:04 +1100, Michael Neuling wrote:

> On Wed, 2016-02-03 at 01:11 +0530, Shilpasri G Bhat wrote:
> 

> > cpu_to_chip_id() does a DT walk through to find out the chip id by
> > taking a contended device tree lock. This adds an unnecessary
> > overhead
> > in a hot path. So instead of calling cpu_to_chip_id() everytime
> > cache
> > the chip ids for all cores in the array 'core_to_chip_map' and use
> > it
> > in the hotpath.
> > 
> > Reported-by: Anton Blanchard <an...@samba.org>
> > Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com>
> > Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> > Acked-by: Viresh Kumar <viresh.ku...@linaro.org>
> > ---
> > No changes from v7.
> 
> How about this instead?  It removes the linear lookup and seems a lot
> less complex.

BTW we never init nr_chips before using it.  We also need something
like.

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index d63d2cb..c819ed4 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -556,6 +556,8 @@ static int init_chip_info(void)
unsigned int cpu, i;
unsigned int prev_chip_id = UINT_MAX;
 
+   nr_chips = 0;
+
for_each_possible_cpu(cpu) {
unsigned int id = cpu_to_chip_id(cpu);

Re: [PATCH v8 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path

2016-03-19 Thread Michael Neuling

On Fri, 2016-03-18 at 15:04 +1100, Michael Neuling wrote:

> On Wed, 2016-02-03 at 01:11 +0530, Shilpasri G Bhat wrote:
> 

> > cpu_to_chip_id() does a DT walk through to find out the chip id by
> > taking a contended device tree lock. This adds an unnecessary
> > overhead
> > in a hot path. So instead of calling cpu_to_chip_id() everytime
> > cache
> > the chip ids for all cores in the array 'core_to_chip_map' and use
> > it
> > in the hotpath.
> > 
> > Reported-by: Anton Blanchard 
> > Signed-off-by: Shilpasri G Bhat 
> > Reviewed-by: Gautham R. Shenoy 
> > Acked-by: Viresh Kumar 
> > ---
> > No changes from v7.
> 
> How about this instead?  It removes the linear lookup and seems a lot
> less complex.

BTW we never init nr_chips before using it.  We also need something
like.

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index d63d2cb..c819ed4 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -556,6 +556,8 @@ static int init_chip_info(void)
unsigned int cpu, i;
unsigned int prev_chip_id = UINT_MAX;
 
+   nr_chips = 0;
+
for_each_possible_cpu(cpu) {
unsigned int id = cpu_to_chip_id(cpu);

Re: [PATCH v8 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path

2016-03-18 Thread Michael Neuling

On Wed, 2016-02-03 at 01:11 +0530, Shilpasri G Bhat wrote:

> cpu_to_chip_id() does a DT walk through to find out the chip id by
> taking a contended device tree lock. This adds an unnecessary overhead
> in a hot path. So instead of calling cpu_to_chip_id() everytime cache
> the chip ids for all cores in the array 'core_to_chip_map' and use it
> in the hotpath.
> 
> Reported-by: Anton Blanchard 
> Signed-off-by: Shilpasri G Bhat 
> Reviewed-by: Gautham R. Shenoy 
> Acked-by: Viresh Kumar 
> ---
> No changes from v7.

How about this instead?  It removes the linear lookup and seems a lot
less complex.

Mikey

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 547890f..d63d2cb 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -52,6 +52,7 @@ static struct chip {
 } *chips;
 
 static int nr_chips;
+static DEFINE_PER_CPU(unsigned int, chip_id);
 
 /*
  * Note: The set of pstates consists of contiguous integers, the
@@ -317,9 +318,7 @@ static void powernv_cpufreq_throttle_check(void *data)
 
pmsr = get_pmspr(SPRN_PMSR);
 
-   for (i = 0; i < nr_chips; i++)
-   if (chips[i].id == cpu_to_chip_id(cpu))
-   break;
+   i = this_cpu_read(chip_id);
 
/* Check for Pmax Capping */
pmsr_pmax = (s8)PMSR_MAX(pmsr);
@@ -560,6 +559,7 @@ static int init_chip_info(void)
for_each_possible_cpu(cpu) {
unsigned int id = cpu_to_chip_id(cpu);
 
+   per_cpu(chip_id, cpu) = nr_chips;
if (prev_chip_id != id) {
prev_chip_id = id;
chip[nr_chips++] = id;

Re: [PATCH v8 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path

2016-03-18 Thread Michael Neuling

On Wed, 2016-02-03 at 01:11 +0530, Shilpasri G Bhat wrote:

> cpu_to_chip_id() does a DT walk through to find out the chip id by
> taking a contended device tree lock. This adds an unnecessary overhead
> in a hot path. So instead of calling cpu_to_chip_id() everytime cache
> the chip ids for all cores in the array 'core_to_chip_map' and use it
> in the hotpath.
> 
> Reported-by: Anton Blanchard 
> Signed-off-by: Shilpasri G Bhat 
> Reviewed-by: Gautham R. Shenoy 
> Acked-by: Viresh Kumar 
> ---
> No changes from v7.

How about this instead?  It removes the linear lookup and seems a lot
less complex.

Mikey

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 547890f..d63d2cb 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -52,6 +52,7 @@ static struct chip {
 } *chips;
 
 static int nr_chips;
+static DEFINE_PER_CPU(unsigned int, chip_id);
 
 /*
  * Note: The set of pstates consists of contiguous integers, the
@@ -317,9 +318,7 @@ static void powernv_cpufreq_throttle_check(void *data)
 
pmsr = get_pmspr(SPRN_PMSR);
 
-   for (i = 0; i < nr_chips; i++)
-   if (chips[i].id == cpu_to_chip_id(cpu))
-   break;
+   i = this_cpu_read(chip_id);
 
/* Check for Pmax Capping */
pmsr_pmax = (s8)PMSR_MAX(pmsr);
@@ -560,6 +559,7 @@ static int init_chip_info(void)
for_each_possible_cpu(cpu) {
unsigned int id = cpu_to_chip_id(cpu);
 
+   per_cpu(chip_id, cpu) = nr_chips;
if (prev_chip_id != id) {
prev_chip_id = id;
chip[nr_chips++] = id;

Re: [PATCH v8 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path

2016-03-18 Thread Michael Neuling

On Sat, 2016-03-19 at 09:37 +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2016-03-18 at 15:04 +1100, Michael Neuling wrote:
> > 
> >  static int nr_chips;
> > +static DEFINE_PER_CPU(unsigned int, chip_id);
> >  
> >  /*
> >   * Note: The set of pstates consists of contiguous integers, the
> > @@ -317,9 +318,7 @@ static void powernv_cpufreq_throttle_check(void
> > *data)
> >  
> > pmsr = get_pmspr(SPRN_PMSR);
> >  
> > -   for (i = 0; i < nr_chips; i++)
> > -   if (chips[i].id == cpu_to_chip_id(cpu))
> > -   break;
> > +   i = this_cpu_read(chip_id);
> 
> Except it's not a chip_id, so your patch confused me for a good 2mn
> ...
> Call it chip_idx maybe ? ie, index.

Yeah, it was a badly named variable but I changed it even more and
Shilpasri rebased it here:

http://patchwork.ozlabs.org/patch/599523/

Mikey

Re: [PATCH v8 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path

2016-03-18 Thread Michael Neuling

On Sat, 2016-03-19 at 09:37 +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2016-03-18 at 15:04 +1100, Michael Neuling wrote:
> > 
> >  static int nr_chips;
> > +static DEFINE_PER_CPU(unsigned int, chip_id);
> >  
> >  /*
> >   * Note: The set of pstates consists of contiguous integers, the
> > @@ -317,9 +318,7 @@ static void powernv_cpufreq_throttle_check(void
> > *data)
> >  
> > pmsr = get_pmspr(SPRN_PMSR);
> >  
> > -   for (i = 0; i < nr_chips; i++)
> > -   if (chips[i].id == cpu_to_chip_id(cpu))
> > -   break;
> > +   i = this_cpu_read(chip_id);
> 
> Except it's not a chip_id, so your patch confused me for a good 2mn
> ...
> Call it chip_idx maybe ? ie, index.

Yeah, it was a badly named variable but I changed it even more and
Shilpasri rebased it here:

http://patchwork.ozlabs.org/patch/599523/

Mikey

Re: [PATCH v3 1/2] cxl: Add mechanism for delivering AFU driver specific events

2016-03-09 Thread Michael Neuling

On Wed, 2016-03-09 at 20:07 +0530, Vaibhav Jain wrote:
> Hi Ian,
> 
> Sorry for getting into this discussion late. I have few suggestions.
> 
> Ian Munsie  writes:
> > 
> > diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
> > index 8756d06..560412c 100644
> > --- a/drivers/misc/cxl/Kconfig
> > +++ b/drivers/misc/cxl/Kconfig
> > @@ -15,12 +15,17 @@ config CXL_EEH
> > bool
> > default n
> >  
> > +config CXL_AFU_DRIVER_OPS
> > +   bool
> > +   default n
> > +
> >  config CXL
> > tristate "Support for IBM Coherent Accelerators (CXL)"
> > depends on PPC_POWERNV && PCI_MSI && EEH
> > select CXL_BASE
> > select CXL_KERNEL_API
> > select CXL_EEH
> > +   select CXL_AFU_DRIVER_OPS
> I suggest wrapping the driver_ops struct definition and other related
> functions inside a #ifdef CONFIG_CXL_AFU_DRIVER_OPS.

These are here to enable the feature in other drivers.  So the cxlflash
(or whoever) can put their code in via the linux-scsi tree but that new
piece is only enabled when CXL_AFU_DRIVER_OPS is present (ie. when
merged upstream).  But if it's not, their code can still compile.  

Hence their code compiles in linux-scsi and our code compiles in linux
-ppc, but only once they're together do they actually enable the full
feature.  We don't have a nasty dependency of linux-scsi having to pull
in linux-ppc or visa versa before the merge window.  Everyone works
independently and it all gets fixed in linus tree.

Eventually, when everyone has the all the code in merged upstream, we
can remove these config options.  We should be able to remove
 CXL_KERNEL_API and CXL_EEH now actually!

So no, we shouldn't wrap the actual code.

Mikey

Re: [PATCH v3 1/2] cxl: Add mechanism for delivering AFU driver specific events

2016-03-09 Thread Michael Neuling

On Wed, 2016-03-09 at 20:07 +0530, Vaibhav Jain wrote:
> Hi Ian,
> 
> Sorry for getting into this discussion late. I have few suggestions.
> 
> Ian Munsie  writes:
> > 
> > diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
> > index 8756d06..560412c 100644
> > --- a/drivers/misc/cxl/Kconfig
> > +++ b/drivers/misc/cxl/Kconfig
> > @@ -15,12 +15,17 @@ config CXL_EEH
> > bool
> > default n
> >  
> > +config CXL_AFU_DRIVER_OPS
> > +   bool
> > +   default n
> > +
> >  config CXL
> > tristate "Support for IBM Coherent Accelerators (CXL)"
> > depends on PPC_POWERNV && PCI_MSI && EEH
> > select CXL_BASE
> > select CXL_KERNEL_API
> > select CXL_EEH
> > +   select CXL_AFU_DRIVER_OPS
> I suggest wrapping the driver_ops struct definition and other related
> functions inside a #ifdef CONFIG_CXL_AFU_DRIVER_OPS.

These are here to enable the feature in other drivers.  So the cxlflash
(or whoever) can put their code in via the linux-scsi tree but that new
piece is only enabled when CXL_AFU_DRIVER_OPS is present (ie. when
merged upstream).  But if it's not, their code can still compile.  

Hence their code compiles in linux-scsi and our code compiles in linux
-ppc, but only once they're together do they actually enable the full
feature.  We don't have a nasty dependency of linux-scsi having to pull
in linux-ppc or visa versa before the merge window.  Everyone works
independently and it all gets fixed in linus tree.

Eventually, when everyone has the all the code in merged upstream, we
can remove these config options.  We should be able to remove
 CXL_KERNEL_API and CXL_EEH now actually!

So no, we shouldn't wrap the actual code.

Mikey

Re: [PATCH] cxl: Add alternate MMIO error handling

2015-08-06 Thread Michael Neuling



On Thu, 2015-07-23 at 16:43 +1000, Ian Munsie wrote:
> From: Ian Munsie 
> 
> userspace programs using cxl currently have to use two strategies for
> dealing with MMIO errors simultaneously. They have to check every read
> for a return of all Fs in case the adapter has gone away and the kernel
> has not yet noticed, and they have to deal with SIGBUS in case the
> kernel has already noticed, invalidated the mapping and marked the
> context as failed.
> 
> In order to simplify things, this patch adds an alternative approach
> where the kernel will return a page filled with Fs instead of delivering
> a SIGBUS. This allows userspace to only need to deal with one of these
> two error paths, and is intended for use in libraries that use cxl
> transparently and may not be able to safely install a signal handler.
> 
> This approach will only work if certain constraints are met. Namely, if
> the application is both reading and writing to an address in the problem
> state area it cannot assume that a non-FF read is OK, as it may just be
> reading out a value it has previously written. Further - since only one
> page is used per context a write to a given offset would be visible when
> reading the same offset from a different page in the mapping (this only
> applies within a single context, not between contexts).
> 
> An application could deal with this by e.g. making sure it also reads
> from a read-only offset after any reads to a read/write offset.
> 
> Due to these constraints, this functionality must be explicitly
> requested by userspace when starting the context by passing in the
> CXL_START_WORK_ERR_FF flag.
> 
> Signed-off-by: Ian Munsie 

Acked-by: Michael Neuling 

> ---
>  drivers/misc/cxl/context.c | 14 ++
>  drivers/misc/cxl/cxl.h |  4 +++-
>  drivers/misc/cxl/file.c|  4 +++-
>  include/uapi/misc/cxl.h|  4 +++-
>  4 files changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
> index 1287148..6570f10 100644
> --- a/drivers/misc/cxl/context.c
> +++ b/drivers/misc/cxl/context.c
> @@ -126,6 +126,18 @@ static int cxl_mmap_fault(struct vm_area_struct *vma, 
> struct vm_fault *vmf)
>   if (ctx->status != STARTED) {
>   mutex_unlock(>status_mutex);
>   pr_devel("%s: Context not started, failing problem state 
> access\n", __func__);
> + if (ctx->mmio_err_ff) {
> + if (!ctx->ff_page) {
> + ctx->ff_page = alloc_page(GFP_USER);
> + if (!ctx->ff_page)
> + return VM_FAULT_OOM;
> + memset(page_address(ctx->ff_page), 0xff, 
> PAGE_SIZE);
> + }
> + get_page(ctx->ff_page);
> + vmf->page = ctx->ff_page;
> + vma->vm_page_prot = pgprot_cached(vma->vm_page_prot);
> + return 0;
> + }
>   return VM_FAULT_SIGBUS;
>   }
>  
> @@ -253,6 +265,8 @@ static void reclaim_ctx(struct rcu_head *rcu)
>   struct cxl_context *ctx = container_of(rcu, struct cxl_context, rcu);
>  
>   free_page((u64)ctx->sstp);
> + if (ctx->ff_page)
> + __free_page(ctx->ff_page);
>   ctx->sstp = NULL;
>  
>   kfree(ctx);
> diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
> index 4fd66ca..b7293a4 100644
> --- a/drivers/misc/cxl/cxl.h
> +++ b/drivers/misc/cxl/cxl.h
> @@ -34,7 +34,7 @@ extern uint cxl_verbose;
>   * Bump version each time a user API change is made, whether it is
>   * backwards compatible ot not.
>   */
> -#define CXL_API_VERSION 1
> +#define CXL_API_VERSION 2
>  #define CXL_API_VERSION_COMPATIBLE 1
>  
>  /*
> @@ -418,6 +418,8 @@ struct cxl_context {
>   /* Used to unmap any mmaps when force detaching */
>   struct address_space *mapping;
>   struct mutex mapping_lock;
> + struct page *ff_page;
> + bool mmio_err_ff;
>  
>   spinlock_t sste_lock; /* Protects segment table entries */
>   struct cxl_sste *sstp;
> diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
> index e3f4b69..34c7a5e 100644
> --- a/drivers/misc/cxl/file.c
> +++ b/drivers/misc/cxl/file.c
> @@ -179,6 +179,8 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
>   if (work.flags & CXL_START_WORK_AMR)
>   amr = work.amr & mfspr(SPRN_UAMOR);
>  
> + ctx->mmio_err_ff = !!(work.flags & CXL_START_WORK_ERR_FF);
> +
>   /*
>* We grab the PID here and not in the file ope

Re: [PATCH] cxl: Add alternate MMIO error handling

2015-08-06 Thread Michael Neuling



On Thu, 2015-07-23 at 16:43 +1000, Ian Munsie wrote:
 From: Ian Munsie imun...@au1.ibm.com
 
 userspace programs using cxl currently have to use two strategies for
 dealing with MMIO errors simultaneously. They have to check every read
 for a return of all Fs in case the adapter has gone away and the kernel
 has not yet noticed, and they have to deal with SIGBUS in case the
 kernel has already noticed, invalidated the mapping and marked the
 context as failed.
 
 In order to simplify things, this patch adds an alternative approach
 where the kernel will return a page filled with Fs instead of delivering
 a SIGBUS. This allows userspace to only need to deal with one of these
 two error paths, and is intended for use in libraries that use cxl
 transparently and may not be able to safely install a signal handler.
 
 This approach will only work if certain constraints are met. Namely, if
 the application is both reading and writing to an address in the problem
 state area it cannot assume that a non-FF read is OK, as it may just be
 reading out a value it has previously written. Further - since only one
 page is used per context a write to a given offset would be visible when
 reading the same offset from a different page in the mapping (this only
 applies within a single context, not between contexts).
 
 An application could deal with this by e.g. making sure it also reads
 from a read-only offset after any reads to a read/write offset.
 
 Due to these constraints, this functionality must be explicitly
 requested by userspace when starting the context by passing in the
 CXL_START_WORK_ERR_FF flag.
 
 Signed-off-by: Ian Munsie imun...@au1.ibm.com

Acked-by: Michael Neuling mi...@neuling.org

 ---
  drivers/misc/cxl/context.c | 14 ++
  drivers/misc/cxl/cxl.h |  4 +++-
  drivers/misc/cxl/file.c|  4 +++-
  include/uapi/misc/cxl.h|  4 +++-
  4 files changed, 23 insertions(+), 3 deletions(-)
 
 diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
 index 1287148..6570f10 100644
 --- a/drivers/misc/cxl/context.c
 +++ b/drivers/misc/cxl/context.c
 @@ -126,6 +126,18 @@ static int cxl_mmap_fault(struct vm_area_struct *vma, 
 struct vm_fault *vmf)
   if (ctx-status != STARTED) {
   mutex_unlock(ctx-status_mutex);
   pr_devel(%s: Context not started, failing problem state 
 access\n, __func__);
 + if (ctx-mmio_err_ff) {
 + if (!ctx-ff_page) {
 + ctx-ff_page = alloc_page(GFP_USER);
 + if (!ctx-ff_page)
 + return VM_FAULT_OOM;
 + memset(page_address(ctx-ff_page), 0xff, 
 PAGE_SIZE);
 + }
 + get_page(ctx-ff_page);
 + vmf-page = ctx-ff_page;
 + vma-vm_page_prot = pgprot_cached(vma-vm_page_prot);
 + return 0;
 + }
   return VM_FAULT_SIGBUS;
   }
  
 @@ -253,6 +265,8 @@ static void reclaim_ctx(struct rcu_head *rcu)
   struct cxl_context *ctx = container_of(rcu, struct cxl_context, rcu);
  
   free_page((u64)ctx-sstp);
 + if (ctx-ff_page)
 + __free_page(ctx-ff_page);
   ctx-sstp = NULL;
  
   kfree(ctx);
 diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
 index 4fd66ca..b7293a4 100644
 --- a/drivers/misc/cxl/cxl.h
 +++ b/drivers/misc/cxl/cxl.h
 @@ -34,7 +34,7 @@ extern uint cxl_verbose;
   * Bump version each time a user API change is made, whether it is
   * backwards compatible ot not.
   */
 -#define CXL_API_VERSION 1
 +#define CXL_API_VERSION 2
  #define CXL_API_VERSION_COMPATIBLE 1
  
  /*
 @@ -418,6 +418,8 @@ struct cxl_context {
   /* Used to unmap any mmaps when force detaching */
   struct address_space *mapping;
   struct mutex mapping_lock;
 + struct page *ff_page;
 + bool mmio_err_ff;
  
   spinlock_t sste_lock; /* Protects segment table entries */
   struct cxl_sste *sstp;
 diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
 index e3f4b69..34c7a5e 100644
 --- a/drivers/misc/cxl/file.c
 +++ b/drivers/misc/cxl/file.c
 @@ -179,6 +179,8 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
   if (work.flags  CXL_START_WORK_AMR)
   amr = work.amr  mfspr(SPRN_UAMOR);
  
 + ctx-mmio_err_ff = !!(work.flags  CXL_START_WORK_ERR_FF);
 +
   /*
* We grab the PID here and not in the file open to allow for the case
* where a process (master, some daemon, etc) has opened the chardev on
 @@ -519,7 +521,7 @@ int __init cxl_file_init(void)
* If these change we really need to update API.  Either change some
* flags or update API version number CXL_API_VERSION.
*/
 - BUILD_BUG_ON(CXL_API_VERSION != 1);
 + BUILD_BUG_ON(CXL_API_VERSION != 2);
   BUILD_BUG_ON(sizeof(struct cxl_ioctl_start_work) != 64);
   BUILD_BUG_ON(sizeof

Re: [PATCH 1/8] misc: cxl: clean up afu_read_config()

2015-08-05 Thread Michael Neuling

On Mon, 2015-07-27 at 00:18 +0300, Vladimir Zapolskiy wrote:
> The sanity checks for overflow are not needed, because this is done on
> caller side in fs/sysfs/file.c
> 
> Signed-off-by: Vladimir Zapolskiy 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: Ian Munsie 
> Cc: Michael Neuling 

Acked-by: Michael Neuling 

> ---
>  drivers/misc/cxl/sysfs.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
> index 31f38bc..87cd747 100644
> --- a/drivers/misc/cxl/sysfs.c
> +++ b/drivers/misc/cxl/sysfs.c
> @@ -443,12 +443,7 @@ static ssize_t afu_read_config(struct file *filp, struct 
> kobject *kobj,
>   struct afu_config_record *cr = to_cr(kobj);
>   struct cxl_afu *afu = to_cxl_afu(container_of(kobj->parent, struct 
> device, kobj));
>  
> - u64 i, j, val, size = afu->crs_len;
> -
> - if (off > size)
> - return 0;
> - if (off + count > size)
> - count = size - off;
> + u64 i, j, val;
>  
>   for (i = 0; i < count;) {
>   val = cxl_afu_cr_read64(afu, cr->cr, off & ~0x7);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/8] misc: cxl: clean up afu_read_config()

2015-08-05 Thread Michael Neuling

On Mon, 2015-07-27 at 00:18 +0300, Vladimir Zapolskiy wrote:
 The sanity checks for overflow are not needed, because this is done on
 caller side in fs/sysfs/file.c
 
 Signed-off-by: Vladimir Zapolskiy v...@mleia.com
 Cc: linuxppc-...@lists.ozlabs.org
 Cc: Ian Munsie imun...@au1.ibm.com
 Cc: Michael Neuling mi...@neuling.org

Acked-by: Michael Neuling mi...@neuling.org

 ---
  drivers/misc/cxl/sysfs.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)
 
 diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
 index 31f38bc..87cd747 100644
 --- a/drivers/misc/cxl/sysfs.c
 +++ b/drivers/misc/cxl/sysfs.c
 @@ -443,12 +443,7 @@ static ssize_t afu_read_config(struct file *filp, struct 
 kobject *kobj,
   struct afu_config_record *cr = to_cr(kobj);
   struct cxl_afu *afu = to_cxl_afu(container_of(kobj-parent, struct 
 device, kobj));
  
 - u64 i, j, val, size = afu-crs_len;
 -
 - if (off  size)
 - return 0;
 - if (off + count  size)
 - count = size - off;
 + u64 i, j, val;
  
   for (i = 0; i  count;) {
   val = cxl_afu_cr_read64(afu, cr-cr, off  ~0x7);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 573 matches

Mail list logo