Re: [PATCH v2 0/3] powerpc: build out kprobes blacklist

2017-05-03 Thread Naveen N. Rao
On 2017/04/27 02:06PM, Naveen N. Rao wrote:
> v2 changes:
> - Patches 3 and 4 from the previous series have been merged.
> - Updated to no longer blacklist functions involved with stolen time
>   accounting.
> 
> v1:
> https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg117514.html
> --
> This is the second in the series of patches to build out an appropriate
> kprobes blacklist. This series blacklists system_call() and functions
> involved when handling the trap itself. Not everything is covered, but
> this is the first set of functions that I have tested with. More
> patches to follow once I expand my tests.
> 
> I have converted many labels into private -- these are labels that I
> felt are not necessary to read stack traces. If any of those are
> important to have, please let me know.

Hi Anton,
Can you please take a look at this series and let me know if you any 
concerns, especially around some of the labels that I have made private?

Thanks,
Naveen



[PATCH v2] drivers:soc:fsl:qbman:qman.c: Sleep instead of stuck hacking jiffies.

2017-05-03 Thread Karim Eshapa
Avoid stuck and hacking jiffies for a long time and using msleep()
for certatin numeber of cylces without the factor of safety
but using the the long delay considering the factor of safety
with the while loop such that after msleep for a short period
of time we check the  msg if it's OK, breaking the big loop delay.

Signed-off-by: Karim Eshapa 
---
 drivers/soc/fsl/qbman/qman.c | 47 ++--
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 6f509f6..4f99298 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -1067,32 +1067,33 @@ static irqreturn_t portal_isr(int irq, void *ptr)
 static int drain_mr_fqrni(struct qm_portal *p)
 {
const union qm_mr_entry *msg;
+   unsigned long stop;
+   unsigned int timeout = jiffies_to_msecs(1000);
 loop:
msg = qm_mr_current(p);
-   if (!msg) {
-   /*
-* if MR was full and h/w had other FQRNI entries to produce, we
-* need to allow it time to produce those entries once the
-* existing entries are consumed. A worst-case situation
-* (fully-loaded system) means h/w sequencers may have to do 3-4
-* other things before servicing the portal's MR pump, each of
-* which (if slow) may take ~50 qman cycles (which is ~200
-* processor cycles). So rounding up and then multiplying this
-* worst-case estimate by a factor of 10, just to be
-* ultra-paranoid, goes as high as 10,000 cycles. NB, we consume
-* one entry at a time, so h/w has an opportunity to produce new
-* entries well before the ring has been fully consumed, so
-* we're being *really* paranoid here.
-*/
-   u64 now, then = jiffies;
-
-   do {
-   now = jiffies;
-   } while ((then + 1) > now);
+   stop = jiffies + 1;
+   /*
+* if MR was full and h/w had other FQRNI entries to produce, we
+* need to allow it time to produce those entries once the
+* existing entries are consumed. A worst-case situation
+* (fully-loaded system) means h/w sequencers may have to do 3-4
+* other things before servicing the portal's MR pump, each of
+* which (if slow) may take ~50 qman cycles (which is ~200
+* processor cycles). So rounding up and then multiplying this
+* worst-case estimate by a factor of 10, just to be
+* ultra-paranoid, goes as high as 10,000 cycles. NB, we consume
+* one entry at a time, so h/w has an opportunity to produce new
+* entries well before the ring has been fully consumed, so
+* we're being *really* paranoid here.
+*/
+   do {
+   if (msg)
+   break;
+   msleep(timeout);
msg = qm_mr_current(p);
-   if (!msg)
-   return 0;
-   }
+   } while (time_before(jiffies, stop));
+   if (!msg)
+   return 0;
if ((msg->verb & QM_MR_VERB_TYPE_MASK) != QM_MR_VERB_FQRNI) {
/* We aren't draining anything but FQRNIs */
pr_err("Found verb 0x%x in MR\n", msg->verb);
-- 
2.7.4


[PATCH 8/8] powerpc/xmon: Disable function_graph tracing while in xmon

2017-05-03 Thread Naveen N. Rao
HAVE_FUNCTION_GRAPH_FP_TEST reveals another area (apart from jprobes)
that conflicts with the function_graph tracer: xmon. This is due to the
use of longjmp() in various places in xmon.

To address this, pause function_graph tracing while in xmon.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/xmon/xmon.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index da5619847517..22db68347ff6 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -459,6 +460,7 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
 
local_irq_save(flags);
hard_irq_disable();
+   pause_graph_tracing();
 
bp = in_breakpoint_table(regs->nip, );
if (bp != NULL) {
@@ -655,6 +657,7 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
 
touch_nmi_watchdog();
local_irq_restore(flags);
+   unpause_graph_tracing();
 
return cmd != 'X' && cmd != EOF;
 }
-- 
2.12.2



[PATCH 6/8] powerpc/ftrace: Add support for HAVE_FUNCTION_GRAPH_FP_TEST for -mprofile-kernel

2017-05-03 Thread Naveen N. Rao
This is very handy to catch potential crashes due to unexpected
interactions of function_graph tracer with weird things like
jprobes.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/asm-prototypes.h  | 3 ++-
 arch/powerpc/include/asm/ftrace.h  | 3 +++
 arch/powerpc/kernel/trace/ftrace.c | 4 ++--
 arch/powerpc/kernel/trace/ftrace_64.S  | 1 +
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 1 +
 5 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 7330150bfe34..14964ab80a53 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -124,6 +124,7 @@ extern int __ucmpdi2(u64, u64);
 
 /* tracing */
 void _mcount(void);
-unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip);
+unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip,
+   unsigned long fp);
 
 #endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 686c5f70eb84..2b89043b0c61 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -6,6 +6,9 @@
 #ifdef CONFIG_FUNCTION_TRACER
 #define MCOUNT_ADDR((unsigned long)(_mcount))
 #define MCOUNT_INSN_SIZE   4 /* sizeof mcount call */
+#ifdef CONFIG_MPROFILE_KERNEL
+#define HAVE_FUNCTION_GRAPH_FP_TEST
+#endif
 
 #ifdef __ASSEMBLY__
 
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 32509de6ce4c..7e3e099cdfe4 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -572,7 +572,7 @@ int ftrace_disable_ftrace_graph_caller(void)
  * Hook the return address and push it in the stack of return addrs
  * in current thread info. Return the address we want to divert to.
  */
-unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip)
+unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip, 
unsigned long fp)
 {
struct ftrace_graph_ent trace;
unsigned long return_hooker;
@@ -592,7 +592,7 @@ unsigned long prepare_ftrace_return(unsigned long parent, 
unsigned long ip)
if (!ftrace_graph_entry())
goto out;
 
-   if (ftrace_push_return_trace(parent, ip, , 0,
+   if (ftrace_push_return_trace(parent, ip, , fp,
 NULL) == -EBUSY)
goto out;
 
diff --git a/arch/powerpc/kernel/trace/ftrace_64.S 
b/arch/powerpc/kernel/trace/ftrace_64.S
index e5ccea19821e..57ab4f73a5c4 100644
--- a/arch/powerpc/kernel/trace/ftrace_64.S
+++ b/arch/powerpc/kernel/trace/ftrace_64.S
@@ -68,6 +68,7 @@ _GLOBAL(return_to_handler)
 */
ld  r2, PACATOC(r13)
 
+   addir3, r1, 112
bl  ftrace_return_to_handler
nop
 
diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index 981b99dc3029..fb6910e48f22 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -241,6 +241,7 @@ _GLOBAL(ftrace_graph_caller)
 
ld  r3, _LINK(r1)
subir4, r14, MCOUNT_INSN_SIZE   /* load saved original NIP */
+   addir5, r1, SWITCH_FRAME_SIZE
 
bl  prepare_ftrace_return
nop
-- 
2.12.2



[PATCH 5/8] powerpc/ftrace: Eliminate duplicate stack setup for ftrace_graph_caller()

2017-05-03 Thread Naveen N. Rao
Currently, ftrace_graph_caller() sets up a separate stack frame to save
some state before calling into prepare_ftrace_return(). Eliminate this
by calling into ftrace_graph_caller() right after the ftrace handler.

We make a few changes to accommodate this:
1. Currently, due to where it is present, ftrace_graph_caller() is not
invoked if the NIP is modified by the ftrace handler, indicating either
a livepatched function or a jprobe'd function. To achieve the same
behavior, we now save/compare against original NIP if function graph is
enabled, in addition to CONFIG_LIVEPATCH.

2. We use r14 for saving the original NIP and r15 for storing the
possibly modified NIP. r15 is later used to determine if the function
has been livepatched.

3. To re-use the same stack frame setup/teardown code, we have
ftrace_graph_caller() save the modified LR in pt_regs.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 70 --
 1 file changed, 21 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index b1bad68ea6db..981b99dc3029 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -72,9 +72,10 @@ _GLOBAL(ftrace_caller)
addir3,r3,function_trace_op@toc@l
ld  r5,0(r3)
 
-#ifdef CONFIG_LIVEPATCH
+#if defined(CONFIG_FUNCTION_GRAPH_TRACER) || defined(CONFIG_LIVEPATCH)
mr  r14,r7  /* remember old NIP */
 #endif
+
/* Calculate ip from nip-4 into r3 for call below */
subir3, r7, MCOUNT_INSN_SIZE
 
@@ -97,10 +98,19 @@ ftrace_call:
nop
 
/* Load ctr with the possibly modified NIP */
-   ld  r3, _NIP(r1)
-   mtctr   r3
+   ld  r15, _NIP(r1)
+
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+.globl ftrace_graph_call
+ftrace_graph_call:
+   b   ftrace_graph_stub
+_GLOBAL(ftrace_graph_stub)
+#endif
+
+   mtctr   r15
+
 #ifdef CONFIG_LIVEPATCH
-   cmpdr14,r3  /* has NIP been altered? */
+   cmpdr14, r15/* has NIP been altered? */
 #endif
 
/* Restore gprs */
@@ -124,13 +134,6 @@ ftrace_call:
bne-livepatch_handler
 #endif
 
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-.globl ftrace_graph_call
-ftrace_graph_call:
-   b   ftrace_graph_stub
-_GLOBAL(ftrace_graph_stub)
-#endif
-
bctr/* jump after _mcount site */
 
 _GLOBAL(ftrace_stub)
@@ -233,50 +236,19 @@ livepatch_handler:
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 _GLOBAL(ftrace_graph_caller)
-   stdur1, -112(r1)
-   /* with -mprofile-kernel, parameter regs are still alive at _mcount */
-   std r10, 104(r1)
-   std r9, 96(r1)
-   std r8, 88(r1)
-   std r7, 80(r1)
-   std r6, 72(r1)
-   std r5, 64(r1)
-   std r4, 56(r1)
-   std r3, 48(r1)
+   cmpdr14, r15/* has NIP been altered? */
+   bne-ftrace_graph_stub
 
-   /* Save callee's TOC in the ABI compliant location */
-   std r2, 24(r1)
-   ld  r2, PACATOC(r13)/* get kernel TOC in r2 */
-
-   mfctr   r4  /* ftrace_caller has moved local addr here */
-   std r4, 40(r1)
-   mflrr3  /* ftrace_caller has restored LR from stack */
-   subir4, r4, MCOUNT_INSN_SIZE
+   ld  r3, _LINK(r1)
+   subir4, r14, MCOUNT_INSN_SIZE   /* load saved original NIP */
 
bl  prepare_ftrace_return
nop
 
/*
 * prepare_ftrace_return gives us the address we divert to.
-* Change the LR to this.
+* We save it in pt_regs to be used when we go back.
 */
-   mtlrr3
-
-   ld  r0, 40(r1)
-   mtctr   r0
-   ld  r10, 104(r1)
-   ld  r9, 96(r1)
-   ld  r8, 88(r1)
-   ld  r7, 80(r1)
-   ld  r6, 72(r1)
-   ld  r5, 64(r1)
-   ld  r4, 56(r1)
-   ld  r3, 48(r1)
-
-   /* Restore callee's TOC */
-   ld  r2, 24(r1)
-
-   addir1, r1, 112
-   mflrr0
-   bctr
+   std r3, _LINK(r1)
+   b   ftrace_graph_stub
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
-- 
2.12.2



[PATCH 7/8] powerpc/livepatch: Clarify location of mcount call site

2017-05-03 Thread Naveen N. Rao
The existing comment around the location of mcount() call site in a
function is a bit confusing.

Depending on the gcc version, the mcount() call can either be the fourth
(gcc v6 and later) or the fifth instruction (gcc v5 and earlier) into a
function. So, the mcount call is actually within the first _20_ bytes of
a function.

However, ftrace_location_range() does an inclusive search and hence
passing (addr + 16) is still accurate.

Clarify the same by updating comments around this.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/livepatch.h | 4 ++--
 arch/powerpc/kernel/kprobes.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/livepatch.h 
b/arch/powerpc/include/asm/livepatch.h
index 47a03b9b528b..62f98d977f59 100644
--- a/arch/powerpc/include/asm/livepatch.h
+++ b/arch/powerpc/include/asm/livepatch.h
@@ -37,8 +37,8 @@ static inline void klp_arch_set_pc(struct pt_regs *regs, 
unsigned long ip)
 static inline unsigned long klp_get_ftrace_location(unsigned long faddr)
 {
/*
-* Live patch works only with -mprofile-kernel on PPC. In this case,
-* the ftrace location is always within the first 16 bytes.
+* Live patch only works with -mprofile-kernel on PPC. In this case, the
+* ftrace location always starts within the first 16 bytes (inclusive).
 */
return ftrace_location_range(faddr, faddr + 16);
 }
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 562d18f456d7..4398ea60b4e0 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -62,8 +62,8 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
 #ifdef CONFIG_KPROBES_ON_FTRACE
unsigned long faddr;
/*
-* Per livepatch.h, ftrace location is always within the first
-* 16 bytes of a function on powerpc with -mprofile-kernel.
+* Per livepatch.h, ftrace location always starts within the 
first
+* 16 bytes (inclusive) of a function with -mprofile-kernel.
 */
faddr = ftrace_location_range((unsigned long)addr,
  (unsigned long)addr + 16);
-- 
2.12.2



[PATCH 3/8] powerpc/ftrace: Remove redundant saving of LR in ftrace[_graph]_caller

2017-05-03 Thread Naveen N. Rao
For -mprofile-kernel, there is no need to save LR into the ABI save
location on entry into ftrace_caller(). It is sufficient to have the
(possibly updated) return address in LR (and r0?). On return from
ftrace, this value is stored in the ABI save location if necessary.

Furthermore, we can also remove the redundant saving of LR in
ftrace_graph_caller() for similar reasons. It is sufficient to ensure
LR and r0 point to the new return address.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index fa0921410fa4..d8d75f4eb853 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -38,9 +38,6 @@
  * and then arrange for the ftrace function to be called.
  */
 _GLOBAL(ftrace_caller)
-   /* Save the original return address in A's stack frame */
-   std r0,LRSAVE(r1)
-
/* Create our stack frame + pt_regs */
stdur1,-SWITCH_FRAME_SIZE(r1)
 
@@ -271,6 +268,5 @@ _GLOBAL(ftrace_graph_caller)
 
addir1, r1, 112
mflrr0
-   std r0, LRSAVE(r1)
bctr
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
-- 
2.12.2



[PATCH 4/8] powerpc/kprobes_on_ftrace: Skip livepatch_handler() for jprobes

2017-05-03 Thread Naveen N. Rao
ftrace_caller() depends on a modified regs->nip to detect if a certain
function has been livepatched. However, with KPROBES_ON_FTRACE, it is
possible for regs->nip to have been modified by the jprobes pre_handler.
In this case, we do not want to invoke the livepatch_handler so as not
to consume the livepatch stack.

To distinguish between the two (jprobes and livepatch), we compare the
returned NIP with R12. Jprobes setjmp_pre_handler() sets up both NIP
and R12 to the global entry point of the jprobes hook, while livepatch
handler only sets up NIP and R12 is setup in livepatch_handler.

So, if NIP == R12, we know we came here due to jprobes and we just
branch to the new IP. Otherwise, we continue with livepatch processing
as usual.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index d8d75f4eb853..b1bad68ea6db 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -153,8 +153,18 @@ _GLOBAL(ftrace_stub)
 *
 * r0 can't be used as the base register for a DS-form load or store, so
 * we temporarily shuffle r1 (stack pointer) into r0 and then put it 
back.
+*
+* But, before we can do all that, we first need to confirm that we are
+* indeed here due to livepatch and not due to jprobes. The difference
+* between the two handlers is that jprobes additionally sets up r12.
+* So, we can compare nip with r12 as a test to determine if we are here
+* due to livepatch or due to jprobes.
 */
 livepatch_handler:
+   mfctr   r0
+   cmpdr0, r12
+   beqctr
+
CURRENT_THREAD_INFO(r12, r1)
 
/* Save stack pointer into r0 */
-- 
2.12.2



[PATCH 2/8] powerpc/ftrace: Pass the correct stack pointer for DYNAMIC_FTRACE_WITH_REGS

2017-05-03 Thread Naveen N. Rao
For DYNAMIC_FTRACE_WITH_REGS, we should be passing-in the original set
of registers in pt_regs, to capture the state _before_ ftrace_caller.
However, we are instead passing the stack pointer *after* allocating a
stack frame in ftrace_caller. Fix this by saving the proper value of r1
in pt_regs. Also, use SAVE_10GPRS() to simplify the code.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index 7c933a99f5d5..fa0921410fa4 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -45,10 +45,14 @@ _GLOBAL(ftrace_caller)
stdur1,-SWITCH_FRAME_SIZE(r1)
 
/* Save all gprs to pt_regs */
-   SAVE_8GPRS(0,r1)
-   SAVE_8GPRS(8,r1)
-   SAVE_8GPRS(16,r1)
-   SAVE_8GPRS(24,r1)
+   SAVE_GPR(0, r1)
+   SAVE_10GPRS(2, r1)
+   SAVE_10GPRS(12, r1)
+   SAVE_10GPRS(22, r1)
+
+   /* Save previous stack pointer (r1) */
+   addir8, r1, SWITCH_FRAME_SIZE
+   std r8, GPR1(r1)
 
/* Load special regs for save below */
mfmsr   r8
@@ -103,10 +107,10 @@ ftrace_call:
 #endif
 
/* Restore gprs */
-   REST_8GPRS(0,r1)
-   REST_8GPRS(8,r1)
-   REST_8GPRS(16,r1)
-   REST_8GPRS(24,r1)
+   REST_GPR(0,r1)
+   REST_10GPRS(2,r1)
+   REST_10GPRS(12,r1)
+   REST_10GPRS(22,r1)
 
/* Restore possibly modified LR */
ld  r0, _LINK(r1)
-- 
2.12.2



[PATCH 1/8] powerpc/kprobes: Pause function_graph tracing during jprobes handling

2017-05-03 Thread Naveen N. Rao
This fixes a crash when function_graph and jprobes are used together.
This is essentially commit 237d28db036e ("ftrace/jprobes/x86: Fix
conflict between jprobes and function graph tracing"), but for powerpc.

Jprobes breaks function_graph tracing since the jprobe hook needs to use
jprobe_return(), which never returns back to the hook, but instead to
the original jprobe'd function. The solution is to momentarily pause
function_graph tracing before invoking the jprobe hook and re-enable it
when returning back to the original jprobe'd function.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 255d28d31ca1..562d18f456d7 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -609,6 +609,15 @@ int setjmp_pre_handler(struct kprobe *p, struct pt_regs 
*regs)
regs->gpr[2] = (unsigned long)(((func_descr_t *)jp->entry)->toc);
 #endif
 
+   /*
+* jprobes use jprobe_return() which skips the normal return
+* path of the function, and this messes up the accounting of the
+* function graph tracer.
+*
+* Pause function graph tracing while performing the jprobe function.
+*/
+   pause_graph_tracing();
+
return 1;
 }
 NOKPROBE_SYMBOL(setjmp_pre_handler);
@@ -634,6 +643,8 @@ int longjmp_break_handler(struct kprobe *p, struct pt_regs 
*regs)
 * saved regs...
 */
memcpy(regs, >jprobe_saved_regs, sizeof(struct pt_regs));
+   /* It's OK to start function graph tracing again */
+   unpause_graph_tracing();
preempt_enable_no_resched();
return 1;
 }
-- 
2.12.2



[PATCH 0/8] powerpc: Various fixes and enhancements for kprobes and ftrace

2017-05-03 Thread Naveen N. Rao
[Reposting this as I missed copying the distribution list in my
previous posting. For those who were copied previously, sorry for the spam]

This series initially started out as a fix for a single bug I found:
kernel Oops when jprobes is used with the function_graph tracer. Though
that particular bug turned out to be a solved problem on other
architectures (as seen in patch 1), my investigation revealed quite a
few other bugs as well as scope for optimizing our ftrace handlers.

This series includes all the fixes and enhancements I have made so far.
There are a few more bugs and fixes that I will be coding up and sending
across in a day or two.

This series has been run through ftrace selftests.

- Naveen

Naveen N. Rao (8):
  powerpc/kprobes: Pause function_graph tracing during jprobes handling
  powerpc/ftrace: Pass the correct stack pointer for
DYNAMIC_FTRACE_WITH_REGS
  powerpc/ftrace: Remove redundant saving of LR in ftrace[_graph]_caller
  powerpc/kprobes_on_ftrace: Skip livepatch_handler() for jprobes
  powerpc/ftrace: Eliminate duplicate stack setup for
ftrace_graph_caller()
  powerpc/ftrace: Add support for HAVE_FUNCTION_GRAPH_FP_TEST for
-mprofile-kernel
  powerpc/livepatch: Clarify location of mcount call site
  powerpc/xmon: Disable function_graph tracing while in xmon

 arch/powerpc/include/asm/asm-prototypes.h  |   3 +-
 arch/powerpc/include/asm/ftrace.h  |   3 +
 arch/powerpc/include/asm/livepatch.h   |   4 +-
 arch/powerpc/kernel/kprobes.c  |  15 +++-
 arch/powerpc/kernel/trace/ftrace.c |   4 +-
 arch/powerpc/kernel/trace/ftrace_64.S  |   1 +
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 105 +++--
 arch/powerpc/xmon/xmon.c   |   3 +
 8 files changed, 70 insertions(+), 68 deletions(-)

-- 
2.12.2



Re: [PATCH 1/8] powerpc/kprobes: Pause function_graph tracing during jprobes handling

2017-05-03 Thread Naveen N. Rao
[Copying linuxppc-dev list which I missed cc'ing initially]

On 2017/05/03 03:58PM, Steven Rostedt wrote:
> On Wed,  3 May 2017 23:43:41 +0530
> "Naveen N. Rao"  wrote:
> 
> > This fixes a crash when function_graph and jprobes are used together.
> > This is essentially commit 237d28db036e ("ftrace/jprobes/x86: Fix
> > conflict between jprobes and function graph tracing"), but for powerpc.
> > 
> > Jprobes breaks function_graph tracing since the jprobe hook needs to use
> > jprobe_return(), which never returns back to the hook, but instead to
> > the original jprobe'd function. The solution is to momentarily pause
> > function_graph tracing before invoking the jprobe hook and re-enable it
> > when returning back to the original jprobe'd function.
> 
> I wonder if any of this code could be made arch agnostic?

I don't see a way to do that as the jprobe handlers are all 
arch-specific.

> 
> Anyway, Acked-by: Steven Rostedt (VMware) 

Thanks,
- Naveen



Re: powerpc/mm/radix: Drop support for CPUs without lockless tlbie

2017-05-03 Thread Michael Ellerman
On Wed, 2017-05-03 at 04:44:59 UTC, Michael Ellerman wrote:
> Currently the radix TLB code includes support for CPUs that do *not*
> have MMU_FTR_LOCKLESS_TLBIE. On those CPUs we are required to take a
> global spinlock before issuing a tlbie.
> 
> Radix can only be built for 64-bit Book3s CPUs, and of those, only
> POWER4, 970, Cell and PA6T do not have MMU_FTR_LOCKLESS_TLBIE. Although
> it's possible to build a kernel with Radix support that can also boot on
> those CPUs, we happen to know that in reality none of those CPUs support
> the Radix MMU, so the code can never actually run on those CPUs.
> 
> So remove the native_tlbie_lock in the Radix TLB code.
> 
> Note that there is another lock of the same name in the hash code, which
> is unaffected by this patch.
> 
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Nicholas Piggin 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/3c9ac2bcc35453141f82461c71ed10

cheers


Re: powerpc/powernv: Fix TCE kill on NVLink2

2017-05-03 Thread Michael Ellerman
On Wed, 2017-05-03 at 03:24:08 UTC, Alistair Popple wrote:
> Commit 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on
> NVLink2") forced all TCE kills to go via the OPAL call for
> NVLink2. However the PHB3 implementation of TCE kill was still being
> called directly from some functions which in some circumstances caused
> a machine check.
> 
> This patch adds an equivalent IODA2 version of the function which uses
> the correct invalidation method depending on PHB model and changes all
> external callers to use it instead.
> 
> Fixes: 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2")
> Signed-off-by: Alistair Popple 
> Cc: sta...@vger.kernel.org

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/6b3d12a948d27977816a15eb48409a

cheers


Re: [v4] cxl: mask slice error interrupts after first occurrence

2017-05-03 Thread Michael Ellerman
On Mon, 2017-05-01 at 00:53:31 UTC, Alastair D'Silva wrote:
> From: Alastair D'Silva 
> 
> In some situations, a faulty AFU slice may create an interrupt storm of
> slice errors, rendering the machine unusable. Since these interrupts are
> informational only, present the interrupt once, then mask it off to
> prevent it from being retriggered until the AFU is reset.
> 
> Signed-off-by: Alastair D'Silva 
> Reviewed-by: Andrew Donnellan 
> Reviewed-by: Vaibhav Jain 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a715626a8e904e7226915d1bc48853

cheers


Re: [PATCH-RESEND] cxl: Route eeh events to all drivers in cxl_pci_error_detected()

2017-05-03 Thread Michael Ellerman
On Thu, 2017-04-27 at 05:28:22 UTC, Vaibhav Jain wrote:
> Fix a boundary condition where in some cases an eeh event that results
> in card reset isn't passed on to a driver attached to the virtual PCI
> device associated with a slice. This will happen in case when a slice
> attached device driver returns a value other than
> PCI_ERS_RESULT_NEED_RESET from the eeh error_detected() callback. This
> would result in an early return from cxl_pci_error_detected() and
> other drivers attached to other AFUs on the card wont be notified.
> 
> The patch fixes this by making sure that all slice attached
> device-drivers are notified and the return values from
> error_detected() callback are aggregated in a scheme where request for
> 'disconnect' trumps all and 'none' trumps 'need_reset'.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 9e8df8a21963("cxl: EEH support")
> Based-on: https://patchwork.ozlabs.org/patch/755799/
> Signed-off-by: Vaibhav Jain 
> Reviewed-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/4f58f0bf155e87dda31a3088b1e107

cheers


Re: [PATCH-RESEND,v4] cxl: Force context lock during EEH flow

2017-05-03 Thread Michael Ellerman
On Thu, 2017-04-27 at 05:23:25 UTC, Vaibhav Jain wrote:
> During an eeh event when the cxl card is fenced and card sysfs attr
> perst_reloads_same_image is set following warning message is seen in the
> kernel logs:
> 
>  [   60.622727] Adapter context unlocked with 0 active contexts
>  [   60.622762] [ cut here ]
>  [   60.622771] WARNING: CPU: 12 PID: 627 at
>  ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl]
> 
> Even though this warning is harmless, it clutters the kernel log
> during an eeh event. This warning is triggered as the EEH callback
> cxl_pci_error_detected doesn't obtain a context-lock before forcibly
> detaching all active context and when context-lock is released during
> call to cxl_configure_adapter from cxl_pci_slot_reset, a warning in
> cxl_adapter_context_unlock is triggered.
> 
> To fix this warning, we acquire the adapter context-lock via
> cxl_adapter_context_lock() in the eeh callback
> cxl_pci_error_detected() once all the virtual AFU PHBs are notified
> and their contexts detached. The context-lock is released in
> cxl_pci_slot_reset() after the adapter is successfully reconfigured
> and before the we call the slot_reset callback on slice attached
> device-drivers.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 70b565bbdb91("cxl: Prevent adapter reset if an active context exists")
> Reported-by: Andrew Donnellan 
> Signed-off-by: Vaibhav Jain 
> Acked-by: Frederic Barrat 
> Reviewed-by: Matthew R. Ochs 
> Tested-by: Uma Krishnan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/ea9a26d117cf0637c71d3e0076f4a1

cheers


Re: powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body

2017-05-03 Thread Michael Ellerman
On Mon, 2017-04-24 at 10:46:59 UTC, Michael Ellerman wrote:
> The entire body of unregister_cpu_online() is inside an #ifdef
> CONFIG_HOTPLUG_CPU block. This is ugly and means we create an empty function
> when hotplug is disabled for no reason.
> 
> Instead move the #ifdef out of the function body and define the function to be
> NULL in the else case. This means we'll pass NULL to cpuhp_setup_state(), but
> that's fine because it accepts NULL to mean there is no teardown callback, 
> which
> is exactly what we want.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/3f2290e1b5192fdfa74f012220a8d9

cheers


Re: [v3, 2/2] powerpc/eeh: Clean up and document event handling functions

2017-05-03 Thread Michael Ellerman
On Wed, 2017-04-19 at 07:39:27 UTC, Russell Currey wrote:
> Remove unnecessary tags in eeh_handle_normal_event(), and add function
> comments for eeh_handle_normal_event() and eeh_handle_special_event().
> 
> The only functional difference is that in the case of a PE reaching the
> maximum number of failures, rather than one message telling you of this
> and suggesting you reseat the device, there are two separate messages.
> 
> Suggested-by: Alexey Kardashevskiy 
> Signed-off-by: Russell Currey 
> Reviewed-by: Andrew Donnellan 
> Reviewed-by: Gavin Shan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c0b64978f09195e00d6649ca0ad024

cheers


Re: [v3, 1/2] powerpc/eeh: Avoid use after free in eeh_handle_special_event()

2017-05-03 Thread Michael Ellerman
On Wed, 2017-04-19 at 07:39:26 UTC, Russell Currey wrote:
> eeh_handle_special_event() is called when an EEH event is detected but
> can't be narrowed down to a specific PE.  This function looks through
> every PE to find one in an erroneous state, then calls the regular event
> handler eeh_handle_normal_event() once it knows which PE has an error.
> 
> However, if eeh_handle_normal_event() found that the PE cannot possibly
> be recovered, it will free it, rendering the passed PE stale.
> This leads to a use after free in eeh_handle_special_event() as it attempts to
> clear the "recovering" state on the PE after eeh_handle_normal_event() 
> returns.
> 
> Thus, make sure the PE is valid when attempting to clear state in
> eeh_handle_special_event().
> 
> Cc:  #3.10+
> Reported-by: Alexey Kardashevskiy 
> Signed-off-by: Russell Currey 
> Reviewed-by: Gavin Shan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/daeba2956f32f91f3493788ff6ee02

cheers


Re: [2/4] powerpc/64s: POWER9 no LPCR VRMASD bits

2017-05-03 Thread Michael Ellerman
On Tue, 2017-04-18 at 19:12:17 UTC, Nicholas Piggin wrote:
> POWER9/ISAv3 has no VRMASD field in LPCR. Don't set reserved bits.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/700b7eadd5625d22b8235fb21259b3

cheers


Re: [v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.

2017-05-03 Thread Michael Ellerman
On Tue, 2017-04-18 at 16:38:17 UTC, Mahesh Salgaonkar wrote:
> From: Mahesh Salgaonkar 
> 
> machine_check_early() gets called in real mode. The very first time when
> add_taint() is called, it prints a warning which ends up calling opal
> call (that uses OPAL_CALL wrapper) for writing it to console. If we get a
> very first machine check while we are in opal we are doomed. OPAL_CALL
> overwrites the PACASAVEDMSR in r13 and in this case when we are done with
> MCE handling the original opal call will use this new MSR on it's way
> back to opal_return. This usually leads unexpected behaviour or kernel
> to panic. Instead move the add_taint() call later in the virtual mode
> where it is safe to call.
> 
> This is broken with current FW level. We got lucky so far for not getting
> very first MCE hit while in OPAL. But easily reproducible on Mambo.
> This should go to stable.
> 
> Fixes: 27ea2c420cad powerpc: Set the correct kernel taint on machine check 
> errors.
> Signed-off-by: Mahesh Salgaonkar 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d93b0ac01a9ce276ec39644be47001

cheers


Re: powerpc: Teach oops about radix vectors

2017-05-03 Thread Michael Ellerman
On Thu, 2017-03-16 at 03:04:40 UTC, Michael Neuling wrote:
> Currently if we oops caused by an 0x380 or 0x480 exception, we get a
> print which assumes SLB problems. With radix, these vectors have
> different meanings.
> 
> This patch updates the oops message to reflect these different
> meanings.
> 
> Signed-off-by: Michael Neuling 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8915bcd68b2547c819b8a2a33c0981

cheers


Re: powerpc/mpc52xx: Don't select user-visible RTAS_PROC

2017-05-03 Thread Michael Ellerman
On Fri, 2017-02-17 at 06:31:22 UTC, Michael Ellerman wrote:
> Otherwise we might select it when its dependenices aren't enabled,
> leading to a build break.
> 
> It's default y anyway, so will be on unless someone disables it
> manually.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/0cc68bfad45b47167c9d99cf560b50

cheers


Re: powerpc: Document irq enable/disable toggle in migrate_irqs()

2017-05-03 Thread Michael Ellerman
On Wed, 2017-02-15 at 09:49:54 UTC, Michael Ellerman wrote:
> This code is undocumented and unclear, and we've already had one patch
> sent to remove it because it's "paradoxical and unnecessary". So
> document it to save our future selves from puzzling over it.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/687b8f24f14db842c8c3f8cb8b24c9

cheers


Re: [7/7] powerpc/64: allow CONFIG_RELOCATABLE if COMPILE_TEST

2017-05-03 Thread Michael Ellerman
On Wed, 2016-10-19 at 03:16:00 UTC, Nicholas Piggin wrote:
> This is cruft to work around allmodconfig build breakage.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/084a275e4c9477c432b05e872c3a29

cheers


Re: powerpc/powernv: document cxl dependency on special case in pnv_eeh_reset()

2017-05-03 Thread Michael Ellerman
On Fri, 2016-07-22 at 07:16:35 UTC, Andrew Donnellan wrote:
> pnv_eeh_reset() has special handling for PEs whose primary bus is the root
> bus or the bus immediately underneath the root port.
> 
> The cxl bi-modal card support added in b0b5e5918ad1 ("cxl: Add
> cxl_check_and_switch_mode() API to switch bi-modal cards") relies on this
> behaviour when hot-resetting the CAPI adapter following a mode switch.
> Document this in pnv_eeh_reset() so we don't accidentally break it.
> 
> Suggested-by: Gavin Shan 
> Signed-off-by: Andrew Donnellan 
> Reviewed-by: Gavin Shan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/b7da1230532984fc00962b2c8be94a

cheers


[PATCH v5 2/2] powerpc/fadump: update documentation about 'fadump_append=' parameter

2017-05-03 Thread Hari Bathini
With the introduction of 'fadump_append=' parameter to pass additional
parameters to fadump (capture) kernel, update documentation about it.

Signed-off-by: Hari Bathini 
---

Changes from v4:
* Based on top of patchset that includes
  
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=akpm=05f383cdfba8793240e73f9a9fbff4e25d66003f


 Documentation/powerpc/firmware-assisted-dump.txt |   10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 8394bc8..6327193 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -162,7 +162,15 @@ How to enable firmware-assisted dump (fadump):
 
 1. Set config option CONFIG_FA_DUMP=y and build kernel.
 2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
-3. Optionally, user can also set 'crashkernel=' kernel cmdline
+3. A user can pass additional command line parameters as a comma
+   separated list through 'fadump_append=' parameter, to be enforced
+   when fadump is active. For example, if parameters like nr_cpus=1,
+   numa=off & udev.children-max=2 are to be enforced when fadump is
+   active, 'fadump_append=nr_cpus=1,numa=off,udev.children-max=2'
+   can be passed in command line, which will be replaced with
+   "nr_cpus=1 numa=off udev.children-max=2" when fadump is active.
+   This helps in reducing memory consumption during dump capture.
+4. Optionally, user can also set 'crashkernel=' kernel cmdline
to specify size of the memory to reserve for boot memory dump
preservation.
 



[PATCH v5 1/2] powerpc/fadump: reduce memory consumption for capture kernel

2017-05-03 Thread Hari Bathini
With fadump (dump capture) kernel booting like a regular kernel, it almost
needs the same amount of memory to boot as the production kernel, which is
unwarranted for a dump capture kernel. But with no option to disable some
of the unnecessary subsystems in fadump kernel, that much memory is wasted
on fadump, depriving the production kernel of that memory.

Introduce kernel parameter 'fadump_append=' that would take regular kernel
parameters as a comma separated list, to be enforced when fadump is active.
This 'fadump_append=' parameter can be leveraged to pass parameters like
nr_cpus=1, cgroup_disable=memory and numa=off, to disable unwarranted
resources/subsystems.

Also, ensure the log "Firmware-assisted dump is active" is printed early
in the boot process to put the subsequent fadump messages in context.

Suggested-by: Michael Ellerman 
Signed-off-by: Hari Bathini 
---

Changes from v4:
* Reverted to using comma-separated list for fadump_append parameters.
* Modifying command line to replace 'fadump_append=param1,param2' with
  "param1 param2" to enforce them when fadump is active.
* Based on top of patchset that includes
  
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=akpm=c2afb7ce9d088696427018ba2135b898058507b8


 arch/powerpc/include/asm/fadump.h |2 +
 arch/powerpc/kernel/fadump.c  |   72 +++--
 arch/powerpc/kernel/prom.c|4 ++
 3 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 60b9108..2fadae5 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -204,11 +204,13 @@ extern int early_init_dt_scan_fw_dump(unsigned long node,
const char *uname, int depth, void *data);
 extern int fadump_reserve_mem(void);
 extern int setup_fadump(void);
+extern void update_command_line_with_fadump_append(char *cmdline);
 extern int is_fadump_active(void);
 extern void crash_fadump(struct pt_regs *, const char *);
 extern void fadump_cleanup(void);
 
 #else  /* CONFIG_FA_DUMP */
+static inline void update_command_line_with_fadump_append(char *cmdline) { }
 static inline int is_fadump_active(void) { return 0; }
 static inline void crash_fadump(struct pt_regs *regs, const char *str) { }
 #endif
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index e013f8f..b6f64c6 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -79,8 +79,10 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 * dump data waiting for us.
 */
fdm_active = of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL);
-   if (fdm_active)
+   if (fdm_active) {
+   pr_info("Firmware-assisted dump is active.\n");
fw_dump.dump_active = 1;
+   }
 
/* Get the sizes required to store dump data for the firmware provided
 * dump sections.
@@ -263,8 +265,12 @@ int __init fadump_reserve_mem(void)
 {
unsigned long base, size, memory_boundary;
 
-   if (!fw_dump.fadump_enabled)
+   if (!fw_dump.fadump_enabled) {
+   if (fw_dump.dump_active)
+   pr_warn("Firmware-assisted dump was active but kernel"
+   " booted with fadump disabled!\n");
return 0;
+   }
 
if (!fw_dump.fadump_supported) {
printk(KERN_INFO "Firmware-assisted dump is not supported on"
@@ -304,7 +310,6 @@ int __init fadump_reserve_mem(void)
memory_boundary = memblock_end_of_DRAM();
 
if (fw_dump.dump_active) {
-   printk(KERN_INFO "Firmware-assisted dump is active.\n");
/*
 * If last boot has crashed then reserve all the memory
 * above boot_memory_size so that we don't touch it until
@@ -359,6 +364,67 @@ static int __init early_fadump_param(char *p)
 }
 early_param("fadump", early_fadump_param);
 
+#define FADUMP_APPEND_CMDLINE_PREFIX   "fadump_append="
+
+static __init char *get_last_fadump_append(char *p)
+{
+   char *fadump_cmdline = NULL;
+
+   /* find fadump_append and use the last one if there are more */
+   p = strstr(p, FADUMP_APPEND_CMDLINE_PREFIX);
+   while (p) {
+   fadump_cmdline = p;
+   p = strstr(p+1, FADUMP_APPEND_CMDLINE_PREFIX);
+   }
+
+   return fadump_cmdline;
+}
+
+/*
+ * Replace "fadump_append=param1,param2,param3" in cmdline with
+ * "param1 param2 param3". This will ensure that the additional
+ * parameters passed with 'fadump_append=' are enforced.
+ */
+void __init update_command_line_with_fadump_append(char *cmdline)
+{
+   static char fadump_cmdline[COMMAND_LINE_SIZE] __initdata;
+   static char append_cmdline[COMMAND_LINE_SIZE] __initdata;
+   size_t fadump_append_size = 0;
+   char *last, *p, 

Re: [RFC][PATCH] powerpc/64s: Leave IRQs hard enabled over context switch

2017-05-03 Thread Benjamin Herrenschmidt
On Wed, 2017-05-03 at 20:26 +1000, Michael Ellerman wrote:
> Couldn't we avoid the whole problem by just having two bolted slots for
> the stack, meaning we could have the current and next stack bolted at
> all times.
> 
> That would mean we'd be using 4 slots for bolted entries, which is one
> more than 3 - but it might be a good trade off.
> 
> Or we could make the SLB insertion algorithm smarter so that we could
> later free the slot that was used for previous kernel stack.

Changing the insertion algo is a bit nasty, it's nice and simple as-is,
an option would be to replace the bolted "threshold" by a bolted "map"
with a bit (or a byte) per entry.

Sadly, we have no way that I know of to search an slb entry that tells
you in which slot it is, do we ? So even if the "new" stack is already
somewhere in there, we can't know it and just "mark it bolted". We'd
still have to invalidate just in case before bolting it in.

Ben.


Re: [v2 3/5] mm: add "zero" argument to vmemmap allocators

2017-05-03 Thread Pasha Tatashin

Hi Dave,

Thank you for the review. I will address your comment and update patchset..

Pasha

On 05/03/2017 10:34 AM, David Miller wrote:

From: Pavel Tatashin 
Date: Fri, 24 Mar 2017 15:19:50 -0400


Allow clients to request non-zeroed memory from vmemmap allocator.
The following two public function have a new boolean argument called zero:

__vmemmap_alloc_block_buf()
vmemmap_alloc_block()

When zero is true, memory that is allocated by memblock allocator is zeroed
(the current behavior), when argument is false, the memory is not zeroed.

This change allows for optimizations where client knows when it is better
to zero memory: may be later when other CPUs are started, or may be client
is going to set every byte in the allocated memory, so no need to zero
memory beforehand.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Shannon Nelson 


I think when you add a new argument that can adjust behavior, you
should add the new argument but retain exactly the current behavior in
the existing calls.

Then later you can piece by piece change behavior, and document properly
in the commit message what is happening and why the transformation is
legal.

Here, you are adding the new boolean to __earlyonly_bootmem_alloc() and
then making sparse_mem_maps_populate_node() pass false, which changes
behavior such that it doesn't get zero'd memory any more.

Please make one change at a time.  Otherwise review and bisection is
going to be difficult.

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [v2 3/5] mm: add "zero" argument to vmemmap allocators

2017-05-03 Thread David Miller
From: Pavel Tatashin 
Date: Fri, 24 Mar 2017 15:19:50 -0400

> Allow clients to request non-zeroed memory from vmemmap allocator.
> The following two public function have a new boolean argument called zero:
> 
> __vmemmap_alloc_block_buf()
> vmemmap_alloc_block()
> 
> When zero is true, memory that is allocated by memblock allocator is zeroed
> (the current behavior), when argument is false, the memory is not zeroed.
> 
> This change allows for optimizations where client knows when it is better
> to zero memory: may be later when other CPUs are started, or may be client
> is going to set every byte in the allocated memory, so no need to zero
> memory beforehand.
> 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Shannon Nelson 

I think when you add a new argument that can adjust behavior, you
should add the new argument but retain exactly the current behavior in
the existing calls.

Then later you can piece by piece change behavior, and document properly
in the commit message what is happening and why the transformation is
legal.

Here, you are adding the new boolean to __earlyonly_bootmem_alloc() and
then making sparse_mem_maps_populate_node() pass false, which changes
behavior such that it doesn't get zero'd memory any more.

Please make one change at a time.  Otherwise review and bisection is
going to be difficult.



[RFC 0/2] powerpc/mm: Mark memory contexts requiring global TLBIs

2017-05-03 Thread Frederic Barrat
capi2 and opencapi require the TLB invalidations being sent for
addresses used on the cxl adapter or opencapi device to be global, as
there's a translation cache in the PSL (for capi2) or NPU (for
opencapi). The CAPP (for PSL) and NPU snoop the power bus.

This is not new: for the hash memory model, as soon as the cxl driver
is active, all local TLBIs become global. We need a similar mechanism
for the radix memory model. This patch tries to improve things a bit
by flagging the contexts requiring global TLBIs, therefore limiting
the "upgrade" and not affecting contexts not used by the card.

Alistair: for nvlink2, it is my understanding that all the required
invalidations are already in place through software mmio/ATSD, i.e. this
patch is not useful for you.

Submitting as an RFC, since I don't get to touch mmu.h everyday and
would like to probe people's reaction.



Frederic Barrat (2):
  powerpc/mm: Add marker for contexts requiring global TLB invalidations
  cxl: Mark context requiring global TLBIs

 arch/powerpc/include/asm/book3s/64/mmu.h |  9 +
 arch/powerpc/include/asm/tlb.h   | 10 --
 arch/powerpc/mm/mmu_context_book3s64.c   |  1 +
 drivers/misc/cxl/api.c   |  5 -
 drivers/misc/cxl/file.c  |  5 -
 5 files changed, 26 insertions(+), 4 deletions(-)

-- 
2.11.0



[RFC 2/2] cxl: Mark context requiring global TLBIs

2017-05-03 Thread Frederic Barrat
The PSL needs to see all TLBI pertinent to the memory contexts used on
the cxl adapter. For the hash memory model, it was done by making all
TLBIs global as soon as the cxl driver is in us. For radix, we need
something similar, but we can refine and only convert to global the
invalidations for contexts actually used by the device.

So mark the contexts being attached to the cxl adapter as requiring
global TLBIs.

Signed-off-by: Frederic Barrat 
---
 drivers/misc/cxl/api.c  | 5 -
 drivers/misc/cxl/file.c | 5 -
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 1a138c83f877..86b2ad86fdee 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -347,7 +347,10 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
cxl_context_mm_count_put(ctx);
goto out;
}
-
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (ctx->mm)
+   mm_context_set_global_tlbi(>mm->context);
+#endif
ctx->status = STARTED;
 out:
mutex_unlock(>status_mutex);
diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 17b433f1ce23..c7ead488eee2 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -239,7 +239,10 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
cxl_context_mm_count_put(ctx);
goto out;
}
-
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (ctx->mm)
+   mm_context_set_global_tlbi(>mm->context);
+#endif
ctx->status = STARTED;
rc = 0;
 out:
-- 
2.11.0



[RFC 1/2] powerpc/mm: Add marker for contexts requiring global TLB invalidations

2017-05-03 Thread Frederic Barrat
Introduce a new 'flags' attribute per context and define its first bit
to be a marker requiring all TLBIs for that context to be broadcasted
globally. Once that marker is set on a context, it cannot be removed.

Such a marker is useful for memory contexts used by devices behind the
NPU and CAPP/PSL. The NPU and the PSL keep their own
translation cache so they need to see all the TLBIs for those
contexts.

Signed-off-by: Frederic Barrat 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  9 +
 arch/powerpc/include/asm/tlb.h   | 10 --
 arch/powerpc/mm/mmu_context_book3s64.c   |  1 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3e3811..7b640ab1cbeb 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -78,8 +78,12 @@ struct spinlock;
 /* Maximum possible number of NPUs in a system. */
 #define NV_MAX_NPUS 8
 
+/* Bits definition for the context flags */
+#define MM_CONTEXT_GLOBAL_TLBI 1   /* TLBI must be global */
+
 typedef struct {
mm_context_id_t id;
+   unsigned long flags;
u16 user_psize; /* page size index */
 
/* NPU NMMU context */
@@ -164,5 +168,10 @@ extern void radix_init_pseries(void);
 static inline void radix_init_pseries(void) { };
 #endif
 
+static inline void mm_context_set_global_tlbi(mm_context_t *ctx)
+{
+   set_bit(MM_CONTEXT_GLOBAL_TLBI, >flags);
+}
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_MMU_H_ */
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 609557569f65..bd18ed083011 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -71,8 +71,14 @@ static inline int mm_is_core_local(struct mm_struct *mm)
 
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
-   return cpumask_equal(mm_cpumask(mm),
- cpumask_of(smp_processor_id()));
+   int rc;
+
+   rc = cpumask_equal(mm_cpumask(mm),
+   cpumask_of(smp_processor_id()));
+#ifdef CONFIG_PPC_BOOK3S_64
+   rc = rc && !test_bit(MM_CONTEXT_GLOBAL_TLBI, >context.flags);
+#endif
+   return rc;
 }
 
 #else
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index c6dca2ae78ef..53983a0c220c 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -156,6 +156,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
return index;
 
mm->context.id = index;
+   mm->context.flags = 0;
 #ifdef CONFIG_PPC_ICSWX
mm->context.cop_lockp = kmalloc(sizeof(spinlock_t), GFP_KERNEL);
if (!mm->context.cop_lockp) {
-- 
2.11.0



Re: [PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing

2017-05-03 Thread Anshuman Khandual
On 05/03/2017 09:22 AM, Rashmica Gupta wrote:
> On 28/04/17 19:52, Anshuman Khandual wrote:
>> On 04/28/2017 11:12 AM, Rashmica Gupta wrote:
>>> Some powerpc hardware features may want to gain access to a chunk of
>> What kind of features ? Please add specifics.
>>
>>> undisturbed real memory.  This update provides a means to unplug said
>>> memory
>> Undisturbed ? Meaning part of memblock and currently inside the buddy
>> allocator which we are trying to hot unplug out ?
>>
>>> from the kernel with a set of debugfs calls.  By writing an integer
>>> containing
>>>   the size of memory to be unplugged into
>> Does the size has some constraints like aligned with memblock section
>> size ? LMB size ? page block size ? etc. Please add the details.
> 
> Will do.
> 
>>
>>> /sys/kernel/debug/powerpc/memtrace/enable, the code will remove that
>>> much
>>> memory from the end of each available chip's memory space (ie each
>>> memory node).
>>  amount (I guess bytes in this case) of memory will be removed
>> from the end of the NUMA node ? Whats the guarantee that they would be
>> free at that time and not being pinned by some process ? If its not
>> guaranteed to be freed, then interface description should state that
>> clearly.
> 
> We start looking from the end of the NUMA node but of course there is no
> guarantee
> that we will always be able to find some memory there that we are able
> to remove.


Okay. Do we have interface for giving this memory back to the buddy
allocator again when we are done with HW tracing ? If not we need to
add one.

> 
>>> In addition, the means to read out the contents of the unplugged
>>> memory is also
>>> provided by reading out the
>>> /sys/kernel/debug/powerpc/memtrace//trace
>>> file.
>> All of the debugfs file interfaces added here should be documented some
>> where in detail.
>>
>>> Signed-off-by: Anton Blanchard 
>>> Signed-off-by: Rashmica Gupta 
>>>
>>> ---
>>> This requires the 'Wire up hpte_removebolted for powernv' patch.
>>>
>>> RFC -> v1: Added in two missing locks. Replaced the open-coded
>>> flush_memory_region() with the existing
>>> flush_inval_dcache_range(start, end).
>>>
>>> memtrace_offline_pages() is open-coded because offline_pages is
>>> designed to be
>>> called through the sysfs interface - not directly.
>>>
>>> We could move the offlining of pages to userspace, which removes some
>>> of this
>>> open-coding. This would then require passing info to the kernel such
>>> that it
>>> can then remove the memory that has been offlined. This could be done
>>> using
>>> notifiers, but this isn't simple due to locking (remove_memory needs
>>> mem_hotplug_begin() which the sysfs interface already has). This
>>> could also be
>>> done through the debugfs interface (similar to what is done here).
>>> Either way,
>>> this would require the process that needs the memory to have
>>> open-coded code
>>> which it shouldn't really be involved with.
>>>
>>> As the current remove_memory() function requires the memory to
>>> already be
>>> offlined, it makes sense to keep the offlining and removal of memory
>>> functionality grouped together so that a process can simply make one
>>> request to
>>> unplug some memory. Ideally there would be a kernel function we could
>>> call that
>>> would offline the memory and then remove it.
>>>
>>>
>>>   arch/powerpc/platforms/powernv/memtrace.c | 276
>>> ++
>>>   1 file changed, 276 insertions(+)
>>>   create mode 100644 arch/powerpc/platforms/powernv/memtrace.c
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/memtrace.c
>>> b/arch/powerpc/platforms/powernv/memtrace.c
>>> new file mode 100644
>>> index 000..86184b1
>>> --- /dev/null
>>> +++ b/arch/powerpc/platforms/powernv/memtrace.c
>>> @@ -0,0 +1,276 @@
>>> +/*
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + *
>>> + * Copyright (C) IBM Corporation, 2014
>>> + *
>>> + * Author: Anton Blanchard 
>>> + */
>>> +
>>> +#define pr_fmt(fmt) "powernv-memtrace: " fmt
>>> +
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +
>>> +struct memtrace_entry {
>>> +void *mem;
>>> +u64 start;
>>> +u64 size;
>>> +u32 nid;
>>> +struct dentry *dir;
>>> +char name[16];
>>> +};
>> Little bit of description about the structure here will help.
> 
> 

Re: [PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing

2017-05-03 Thread Anshuman Khandual
On 05/03/2017 11:38 AM, Rashmica Gupta wrote:
> 
> 
> On 03/05/17 13:52, Rashmica Gupta wrote:
>> On 28/04/17 19:52, Anshuman Khandual wrote:
>> 
 +static int check_memblock_online(struct memory_block *mem, void *arg)
 +{
 +if (mem->state != MEM_ONLINE)
 +return -1;
 +
 +return 0;
 +}
 +
 +static int change_memblock_state(struct memory_block *mem, void *arg)
 +{
 +unsigned long state = (unsigned long)arg;
 +
 +mem->state = state;
 +return 0;
 +}
 +
 +static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64
 nr_pages)
 +{
 +u64 end_pfn = start_pfn + nr_pages - 1;
 +
 +if (walk_memory_range(start_pfn, end_pfn, NULL,
 +check_memblock_online))
 +return false;
 +
 +walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
 +  change_memblock_state);
 +
>>> walk_memory_range() might be expensive, cant we just change the state
>>> to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during
>>> the first loop and bail out if any of the memblock is not in MEM_ONLINE
>>> in the first place.
>>
>> Good idea.
>>
> 
> This is assuming that it's more likely that the state of memory will be
> MEM_ONLINE rather than anything else (if the state isn't MEM_ONLINE we
> will still have to do a second call of walk_memory_range() to revert the
> state of any memory blocks that we changed). Seems like a reasonable
> assumption to me, thoughts?

Revert the state of memory blocks that we changed till the point we
discover that something is not MEM_ONLINE and when we decide to abort.
In that case we have to remember all the changes we have done till
that point for us to revert back. Lets keep it as it is.




Re: [PATCH] powerpc/mm: Fix mapped range information print during boot

2017-05-03 Thread Anshuman Khandual
On 05/03/2017 03:02 PM, Michael Ellerman wrote:
> Anshuman Khandual  writes:
> 
>> This is a trivial fix patch regarding mapped ranges in radix MMU
>> environment during boot. No functional change.
>>
>> Before the patch:
>>
>> $dmesg | grep Mapped
>> [0.00] Mapped range 0x0 - 0x20 with 0x4000
>> [0.00] Mapped range 0x2000 - 0x2020 with 0x4000
>>
>> After the patch:
>>
>> $dmesg | grep Mapped
>> [0.00] Mapped range 0x - 0x0020 with 
>> 1024 MB
>> [0.00] Mapped range 0x2000 - 0x2020 with 
>> 1024 MB
> 
> It's a bit better, how about this instead?

Yeah this is better, I had tried to fix it in place.



Re: [RFC][PATCH] powerpc/64s: Leave IRQs hard enabled over context switch

2017-05-03 Thread Michael Ellerman
Benjamin Herrenschmidt  writes:

> On Wed, 2017-05-03 at 17:34 +1000, Nicholas Piggin wrote:
>> Extending the soft IRQ disable to cover PMU interrupts will allow this
>> hard disable to be removed from hash based kernels too, but they will
>> still have to soft-disable PMU interrupts.
>> 
>> - Q1: Can we do this? It gives nice profiles of context switch code
>>   rather than assigning it all to local_irq_enable.
>
> Probably ok with radix yes.
>
>> - Q2: What is the unrecoverable SLB miss on exception entry? Is there
>>   anywhere we access the kernel stack with RI disabled? Something else?
>
> Not sure what you mean by Q2, but the original problem is an occurrence
> of what we call the 'megabug' which hit us in different forms over the
> years, and happens when we get a kernel stack SLB entry wrong.
>
> Normally, the segment containing the current kernel stack is always
> bolted in the SLB in a specific slot. If we accidentally lose that
> "bolt", we can end up faulting it into the wrong slot, thus making it
> possible for it to get evicted later on etc...
>
> That in turns hits the exception return path which accesses the kernel
> stack after clearing RI and setting SRR0/1 to the return values.

Couldn't we avoid the whole problem by just having two bolted slots for
the stack, meaning we could have the current and next stack bolted at
all times.

That would mean we'd be using 4 slots for bolted entries, which is one
more than 3 - but it might be a good trade off.

Or we could make the SLB insertion algorithm smarter so that we could
later free the slot that was used for previous kernel stack.

cheers


Re: [PATCH -next] soc/qbman: fix implicit header dependency now causing build fails

2017-05-03 Thread Joerg Roedel
Hi Stephen,

On Wed, May 03, 2017 at 07:15:24PM +1000, Stephen Rothwell wrote:
> It looks like there is at least one more:
> 
> drivers/soc/fsl/qbman/qman.c: In function 'qman_init_fq':
> drivers/soc/fsl/qbman/qman.c:1787:4: error: implicit declaration of function 
> 'dma_map_single' [-Werror=implicit-function-declaration]
> drivers/soc/fsl/qbman/qman.c:1788:21: error: 'DMA_TO_DEVICE' undeclared 
> (first use in this function)
> drivers/soc/fsl/qbman/qman.c:1789:4: error: implicit declaration of function 
> 'dma_mapping_error' [-Werror=implicit-function-declaration]
> 
> This is from a powerpc orenet64_smp_defconfig build of today's
> linux-next.

Thanks, I'll fix that up too later today. Please let me know if you find
more of that in your compile-testing.


Joerg



Re: [PATCH] powerpc/mm: Fix mapped range information print during boot

2017-05-03 Thread Michael Ellerman
Anshuman Khandual  writes:

> This is a trivial fix patch regarding mapped ranges in radix MMU
> environment during boot. No functional change.
>
> Before the patch:
>
> $dmesg | grep Mapped
> [0.00] Mapped range 0x0 - 0x20 with 0x4000
> [0.00] Mapped range 0x2000 - 0x2020 with 0x4000
>
> After the patch:
>
> $dmesg | grep Mapped
> [0.00] Mapped range 0x - 0x0020 with 1024 
> MB
> [0.00] Mapped range 0x2000 - 0x2020 with 1024 
> MB

It's a bit better, how about this instead?

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index c28165d8970b..519cfef569d1 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -8,9 +8,14 @@
  * as published by the Free Software Foundation; either version
  * 2 of the License, or (at your option) any later version.
  */
+
+#define pr_fmt(fmt) "radix-mmu: " fmt
+
+#include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -112,10 +117,14 @@ static inline void __meminit print_mapping(unsigned long 
start,
   unsigned long end,
   unsigned long size)
 {
+   char buf[10];
+
if (end <= start)
return;
 
-   pr_info("Mapped range 0x%lx - 0x%lx with 0x%lx\n", start, end, size);
+   string_get_size(size, 1, STRING_UNITS_2, buf, sizeof(buf));
+
+   pr_info("Mapped 0x%016lx-0x%016lx with %s pages\n", start, end, buf);
 }
 
 static int __meminit create_physical_mapping(unsigned long start,


cheers


Re: [RFC][PATCH] powerpc/64s: Leave IRQs hard enabled over context switch

2017-05-03 Thread Nicholas Piggin
On Wed, 03 May 2017 10:28:27 +0200
Benjamin Herrenschmidt  wrote:

> On Wed, 2017-05-03 at 17:34 +1000, Nicholas Piggin wrote:
> > Extending the soft IRQ disable to cover PMU interrupts will allow this
> > hard disable to be removed from hash based kernels too, but they will
> > still have to soft-disable PMU interrupts.
> > 
> > - Q1: Can we do this? It gives nice profiles of context switch code
> >   rather than assigning it all to local_irq_enable.  
> 
> Probably ok with radix yes.

Cool.

> > - Q2: What is the unrecoverable SLB miss on exception entry? Is there
> >   anywhere we access the kernel stack with RI disabled? Something else?  
> 
> Not sure what you mean by Q2, but the original problem is an occurrence
> of what we call the 'megabug' which hit us in different forms over the
> years, and happens when we get a kernel stack SLB entry wrong.
> 
> Normally, the segment containing the current kernel stack is always
> bolted in the SLB in a specific slot. If we accidentally lose that
> "bolt", we can end up faulting it into the wrong slot, thus making it
> possible for it to get evicted later on etc...
> 
> That in turns hits the exception return path which accesses the kernel
> stack after clearing RI and setting SRR0/1 to the return values.

This is exactly my question. The original patch said there was an
unrecoverable SLB miss at exception entry. Either I missed that or
it has since been removed. The stack access at exit would do it. I
didn't want to change anything for hash, just wondering where the
bug was (subject line got truncated but is supposed to say "for radix").

Thanks,
Nick


Re: [PATCH -next] soc/qbman: fix implicit header dependency now causing build fails

2017-05-03 Thread Joerg Roedel
Hi Paul,

On Tue, May 02, 2017 at 06:21:12PM -0400, Paul Gortmaker wrote:
> In commit 461a6946b1f9 ("iommu: Remove pci.h include from
> trace/events/iommu.h") that header shuffle uncovered an implicit
> include in this driver, manifesting as:
> 
> CC  drivers/soc/fsl/qbman/qman_portal.o
> drivers/soc/fsl/qbman/qman_portal.c: In function 'qman_portal_probe':
> drivers/soc/fsl/qbman/qman_portal.c:299:2: error: implicit declaration of 
> function 'dma_set_mask'
> drivers/soc/fsl/qbman/qman_portal.c:299:2: error: implicit declaration of 
> function 'DMA_BIT_MASK'
> if (dma_set_mask(dev, DMA_BIT_MASK(40))) {
> ^
> 
> on the corenet32_smp_defconfig (and 64 bit respectively.)  The above
> commit was singled out via git bisect.
> 
> The header it was implictly relying on getting was dma-mapping.h - so
> we explicitly add it here.
> 
> Fixes: 461a6946b1f9 ("iommu: Remove pci.h include from trace/events/iommu.h")
> Cc: Joerg Roedel 
> Cc: Scott Wood 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Paul Gortmaker 

Thanks for catching that, I though I found all breakages caused by
removing this include. Obviously this wasn't true :)

I applied the fix to the iommu/core branch.


Joerg



Re: [PATCH -next] soc/qbman: fix implicit header dependency now causing build fails

2017-05-03 Thread Stephen Rothwell
Hi Joerg,

On Wed, 3 May 2017 10:42:40 +0200 Joerg Roedel  wrote:
>
> On Tue, May 02, 2017 at 06:21:12PM -0400, Paul Gortmaker wrote:
> > In commit 461a6946b1f9 ("iommu: Remove pci.h include from
> > trace/events/iommu.h") that header shuffle uncovered an implicit
> > include in this driver, manifesting as:
> > 
> > CC  drivers/soc/fsl/qbman/qman_portal.o
> > drivers/soc/fsl/qbman/qman_portal.c: In function 'qman_portal_probe':
> > drivers/soc/fsl/qbman/qman_portal.c:299:2: error: implicit declaration 
> > of function 'dma_set_mask'
> > drivers/soc/fsl/qbman/qman_portal.c:299:2: error: implicit declaration 
> > of function 'DMA_BIT_MASK'
> > if (dma_set_mask(dev, DMA_BIT_MASK(40))) {
> > ^
> > 
> > on the corenet32_smp_defconfig (and 64 bit respectively.)  The above
> > commit was singled out via git bisect.
> > 
> > The header it was implictly relying on getting was dma-mapping.h - so
> > we explicitly add it here.
> > 
> > Fixes: 461a6946b1f9 ("iommu: Remove pci.h include from 
> > trace/events/iommu.h")
> > Cc: Joerg Roedel 
> > Cc: Scott Wood 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Paul Gortmaker   
> 
> Thanks for catching that, I though I found all breakages caused by
> removing this include. Obviously this wasn't true :)
> 
> I applied the fix to the iommu/core branch.

Thanks.

It looks like there is at least one more:

drivers/soc/fsl/qbman/qman.c: In function 'qman_init_fq':
drivers/soc/fsl/qbman/qman.c:1787:4: error: implicit declaration of function 
'dma_map_single' [-Werror=implicit-function-declaration]
drivers/soc/fsl/qbman/qman.c:1788:21: error: 'DMA_TO_DEVICE' undeclared (first 
use in this function)
drivers/soc/fsl/qbman/qman.c:1789:4: error: implicit declaration of function 
'dma_mapping_error' [-Werror=implicit-function-declaration]

This is from a powerpc orenet64_smp_defconfig build of today's
linux-next.
-- 
Cheers,
Stephen Rothwell


Re: [PATCH v4 RFT 1/2] powerpc/fadump: reduce memory consumption for capture kernel

2017-05-03 Thread Hari Bathini



On Wednesday 03 May 2017 12:43 PM, Hari Bathini wrote:



On Tuesday 02 May 2017 09:26 PM, Michal Suchanek wrote:
With fadump (dump capture) kernel booting like a regular kernel, it 
almost
needs the same amount of memory to boot as the production kernel, 
which is
unwarranted for a dump capture kernel. But with no option to disable 
some
of the unnecessary subsystems in fadump kernel, that much memory is 
wasted

on fadump, depriving the production kernel of that memory.

Introduce kernel parameter 'fadump_append=' that would take regular 
kernel

parameters to be appended when fadump is active.
This 'fadump_append=' parameter can be leveraged to pass parameters like
nr_cpus=1, cgroup_disable=memory and numa=off, to disable unwarranted
resources/subsystems.

Also, ensure the log "Firmware-assisted dump is active" is printed early
in the boot process to put the subsequent fadump messages in context.

Suggested-by: Michael Ellerman 
Signed-off-by: Hari Bathini 
Signed-off-by: Michal Suchanek 
---
v4:
  - use space separated arguments instead of comma separated
  - do not append parameters when fadummp is disabled
---
  arch/powerpc/kernel/fadump.c | 27 ---
  1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index ebf2e9c..e0c728a 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -79,8 +79,10 @@ int __init early_init_dt_scan_fw_dump(unsigned 
long node,

   * dump data waiting for us.
   */
  fdm_active = of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL);
-if (fdm_active)
+if (fdm_active) {
+pr_info("Firmware-assisted dump is active.\n");
  fw_dump.dump_active = 1;
+}

  /* Get the sizes required to store dump data for the firmware 
provided

   * dump sections.
@@ -263,8 +265,12 @@ int __init fadump_reserve_mem(void)
  {
  unsigned long base, size, memory_boundary;

-if (!fw_dump.fadump_enabled)
+if (!fw_dump.fadump_enabled) {
+if (fw_dump.dump_active)
+pr_warn("Firmware-assisted dump was active but kernel"
+" booted with fadump disabled!\n");
  return 0;
+}

  if (!fw_dump.fadump_supported) {
  printk(KERN_INFO "Firmware-assisted dump is not supported on"
@@ -304,7 +310,6 @@ int __init fadump_reserve_mem(void)
  memory_boundary = memblock_end_of_DRAM();

  if (fw_dump.dump_active) {
-printk(KERN_INFO "Firmware-assisted dump is active.\n");
  /*
   * If last boot has crashed then reserve all the memory
   * above boot_memory_size so that we don't touch it until
@@ -363,6 +368,22 @@ unsigned long __init 
arch_reserved_kernel_pages(void)

  return memblock_reserved_size() / PAGE_SIZE;
  }

+/* Look for fadump_append= cmdline option. */
+static int __init early_fadump_append_param(char *p)
+{
+if (!p)
+return 1;
+
+if (fw_dump.fadump_enabled && fw_dump.dump_active) {
+pr_info("enforcing additional parameters (%s) passed through "
+"'fadump_append=' parameter\n", p);
+parse_early_options(p);


While testing this on a system with yaboot bootloader, it seems that 
parsing

a parameter with syntax like fadump_append="nr_cpus numa=off" is not


fadump_append="nr_cpus numa=off" should have read 
fadump_append="nr_cpus=1 numa=off".
Nonetheless, considering different environments this is supposed to 
work, I was trying to convey

comma-separated append list maybe better over space-separated one..

Thanks
Hari



Re: [RFC][PATCH] powerpc/64s: Leave IRQs hard enabled over context switch

2017-05-03 Thread Benjamin Herrenschmidt
On Wed, 2017-05-03 at 17:34 +1000, Nicholas Piggin wrote:
> Extending the soft IRQ disable to cover PMU interrupts will allow this
> hard disable to be removed from hash based kernels too, but they will
> still have to soft-disable PMU interrupts.
> 
> - Q1: Can we do this? It gives nice profiles of context switch code
>   rather than assigning it all to local_irq_enable.

Probably ok with radix yes.

> - Q2: What is the unrecoverable SLB miss on exception entry? Is there
>   anywhere we access the kernel stack with RI disabled? Something else?

Not sure what you mean by Q2, but the original problem is an occurrence
of what we call the 'megabug' which hit us in different forms over the
years, and happens when we get a kernel stack SLB entry wrong.

Normally, the segment containing the current kernel stack is always
bolted in the SLB in a specific slot. If we accidentally lose that
"bolt", we can end up faulting it into the wrong slot, thus making it
possible for it to get evicted later on etc...

That in turns hits the exception return path which accesses the kernel
stack after clearing RI and setting SRR0/1 to the return values.

Cheers,
Ben.



[RFC][PATCH] powerpc/64s: Leave IRQs hard enabled over context switch

2017-05-03 Thread Nicholas Piggin
Commit 4387e9ff25 ("[POWERPC] Fix PMU + soft interrupt disable bug")
hard disabled interrupts over the low level context switch, because
the SLB management can't cope with a PMU interrupt accesing the stack
in that window.

Radix based kernel mapping does not use the SLB so it does not require
interrupts disabled here. This is worth a % or so in context switch
performance, and also allows the low level context switch code to be
profiled.

Extending the soft IRQ disable to cover PMU interrupts will allow this
hard disable to be removed from hash based kernels too, but they will
still have to soft-disable PMU interrupts.

- Q1: Can we do this? It gives nice profiles of context switch code
  rather than assigning it all to local_irq_enable.

- Q2: What is the unrecoverable SLB miss on exception entry? Is there
  anywhere we access the kernel stack with RI disabled? Something else?

---
 arch/powerpc/kernel/process.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index d645da302bf2..915ec20a18a9 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1176,12 +1176,14 @@ struct task_struct *__switch_to(struct task_struct 
*prev,
 
__switch_to_tm(prev, new);
 
-   /*
-* We can't take a PMU exception inside _switch() since there is a
-* window where the kernel stack SLB and the kernel stack are out
-* of sync. Hard disable here.
-*/
-   hard_irq_disable();
+   if (!radix_enabled()) {
+   /*
+* We can't take a PMU exception inside _switch() since there
+* is a window where the kernel stack SLB and the kernel stack
+* are out of sync. Hard disable here.
+*/
+   hard_irq_disable();
+   }
 
/*
 * Call restore_sprs() before calling _switch(). If we move it after
-- 
2.11.0



Re: [PATCH v4 RFT 1/2] powerpc/fadump: reduce memory consumption for capture kernel

2017-05-03 Thread Hari Bathini



On Tuesday 02 May 2017 09:26 PM, Michal Suchanek wrote:

With fadump (dump capture) kernel booting like a regular kernel, it almost
needs the same amount of memory to boot as the production kernel, which is
unwarranted for a dump capture kernel. But with no option to disable some
of the unnecessary subsystems in fadump kernel, that much memory is wasted
on fadump, depriving the production kernel of that memory.

Introduce kernel parameter 'fadump_append=' that would take regular kernel
parameters to be appended when fadump is active.
This 'fadump_append=' parameter can be leveraged to pass parameters like
nr_cpus=1, cgroup_disable=memory and numa=off, to disable unwarranted
resources/subsystems.

Also, ensure the log "Firmware-assisted dump is active" is printed early
in the boot process to put the subsequent fadump messages in context.

Suggested-by: Michael Ellerman 
Signed-off-by: Hari Bathini 
Signed-off-by: Michal Suchanek 
---
v4:
  - use space separated arguments instead of comma separated
  - do not append parameters when fadummp is disabled
---
  arch/powerpc/kernel/fadump.c | 27 ---
  1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index ebf2e9c..e0c728a 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -79,8 +79,10 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 * dump data waiting for us.
 */
fdm_active = of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL);
-   if (fdm_active)
+   if (fdm_active) {
+   pr_info("Firmware-assisted dump is active.\n");
fw_dump.dump_active = 1;
+   }

/* Get the sizes required to store dump data for the firmware provided
 * dump sections.
@@ -263,8 +265,12 @@ int __init fadump_reserve_mem(void)
  {
unsigned long base, size, memory_boundary;

-   if (!fw_dump.fadump_enabled)
+   if (!fw_dump.fadump_enabled) {
+   if (fw_dump.dump_active)
+   pr_warn("Firmware-assisted dump was active but kernel"
+   " booted with fadump disabled!\n");
return 0;
+   }

if (!fw_dump.fadump_supported) {
printk(KERN_INFO "Firmware-assisted dump is not supported on"
@@ -304,7 +310,6 @@ int __init fadump_reserve_mem(void)
memory_boundary = memblock_end_of_DRAM();

if (fw_dump.dump_active) {
-   printk(KERN_INFO "Firmware-assisted dump is active.\n");
/*
 * If last boot has crashed then reserve all the memory
 * above boot_memory_size so that we don't touch it until
@@ -363,6 +368,22 @@ unsigned long __init arch_reserved_kernel_pages(void)
return memblock_reserved_size() / PAGE_SIZE;
  }

+/* Look for fadump_append= cmdline option. */
+static int __init early_fadump_append_param(char *p)
+{
+   if (!p)
+   return 1;
+
+   if (fw_dump.fadump_enabled && fw_dump.dump_active) {
+   pr_info("enforcing additional parameters (%s) passed through "
+   "'fadump_append=' parameter\n", p);
+   parse_early_options(p);


While testing this on a system with yaboot bootloader, it seems that parsing
a parameter with syntax like fadump_append="nr_cpus numa=off" is not
resulting in the intended way..

Thanks
Hari



Re: [PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing

2017-05-03 Thread Rashmica Gupta



On 03/05/17 13:52, Rashmica Gupta wrote:

On 28/04/17 19:52, Anshuman Khandual wrote:


+static int check_memblock_online(struct memory_block *mem, void *arg)
+{
+if (mem->state != MEM_ONLINE)
+return -1;
+
+return 0;
+}
+
+static int change_memblock_state(struct memory_block *mem, void *arg)
+{
+unsigned long state = (unsigned long)arg;
+
+mem->state = state;
+return 0;
+}
+
+static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 
nr_pages)

+{
+u64 end_pfn = start_pfn + nr_pages - 1;
+
+if (walk_memory_range(start_pfn, end_pfn, NULL,
+check_memblock_online))
+return false;
+
+walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
+  change_memblock_state);
+

walk_memory_range() might be expensive, cant we just change the state
to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during
the first loop and bail out if any of the memblock is not in MEM_ONLINE
in the first place.


Good idea.



This is assuming that it's more likely that the state of memory will be 
MEM_ONLINE rather than anything else (if the state isn't MEM_ONLINE we 
will still have to do a second call of walk_memory_range() to revert the 
state of any memory blocks that we changed). Seems like a reasonable 
assumption to me, thoughts?






Re: [PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing

2017-05-03 Thread Rashmica Gupta



On 03/05/17 13:52, Rashmica Gupta wrote:

On 28/04/17 19:52, Anshuman Khandual wrote:


+static int check_memblock_online(struct memory_block *mem, void *arg)
+{
+if (mem->state != MEM_ONLINE)
+return -1;
+
+return 0;
+}
+
+static int change_memblock_state(struct memory_block *mem, void *arg)
+{
+unsigned long state = (unsigned long)arg;
+
+mem->state = state;
+return 0;
+}
+
+static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 
nr_pages)

+{
+u64 end_pfn = start_pfn + nr_pages - 1;
+
+if (walk_memory_range(start_pfn, end_pfn, NULL,
+check_memblock_online))
+return false;
+
+walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
+  change_memblock_state);
+

walk_memory_range() might be expensive, cant we just change the state
to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during
the first loop and bail out if any of the memblock is not in MEM_ONLINE
in the first place.


Good idea.



This is assuming that it's more likely that the state of memory will be 
MEM_ONLINE rather than anything else (if the state isn't MEM_ONLINE we 
will still have to do a second call of walk_memory_range() to revert the 
state of any memory blocks that we changed). Seems like a reasonable 
assumption to me, thoughts?