[PATCH 3/3] powerpc/lib/sstep: Fix fixed-point shift instructions that set CA32

2017-09-28 Thread Sandipan Das
This fixes the emulated behaviour of existing fixed-point shift right
algebraic instructions that are supposed to set both the CA and CA32
bits of XER when running on a system that is compliant with POWER ISA
v3.0 independent of whether the system is executing in 32-bit mode or
64-bit mode. The following instructions are affected:
  * Shift Right Algebraic Word Immediate (srawi[.])
  * Shift Right Algebraic Word (sraw[.])
  * Shift Right Algebraic Doubleword Immediate (sradi[.])
  * Shift Right Algebraic Doubleword (srad[.])

Signed-off-by: Sandipan Das 
---
 arch/powerpc/lib/sstep.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index fe1910733e55..5118110c3983 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1804,6 +1804,7 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
op->xerval |= XER_CA;
else
op->xerval &= ~XER_CA;
+   set_ca32(op, op->xerval & XER_CA);
goto logical_done;
 
case 824:   /* srawi */
@@ -1816,6 +1817,7 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
op->xerval |= XER_CA;
else
op->xerval &= ~XER_CA;
+   set_ca32(op, op->xerval & XER_CA);
goto logical_done;
 
 #ifdef __powerpc64__
@@ -1845,6 +1847,7 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
op->xerval |= XER_CA;
else
op->xerval &= ~XER_CA;
+   set_ca32(op, op->xerval & XER_CA);
goto logical_done;
 
case 826:   /* sradi with sh_5 = 0 */
@@ -1858,6 +1861,7 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
op->xerval |= XER_CA;
else
op->xerval &= ~XER_CA;
+   set_ca32(op, op->xerval & XER_CA);
goto logical_done;
 #endif /* __powerpc64__ */
 
-- 
2.13.5



[PATCH 2/3] powerpc/lib/sstep: Fix fixed-point arithmetic instructions that set CA32

2017-09-28 Thread Sandipan Das
There are existing fixed-point arithmetic instructions that always set the
CA bit of XER to reflect the carry out of bit 0 in 64-bit mode and out of
bit 32 in 32-bit mode. In ISA v3.0, these instructions also always set the
CA32 bit of XER to reflect the carry out of bit 32.

This fixes the emulated behaviour of such instructions when running on a
system that is compliant with POWER ISA v3.0. The following instructions
are affected:
  * Add Immediate Carrying (addic)
  * Add Immediate Carrying and Record (addic.)
  * Subtract From Immediate Carrying (subfic)
  * Add Carrying (addc[.])
  * Subtract From Carrying (subfc[.])
  * Add Extended (adde[.])
  * Subtract From Extended (subfe[.])
  * Add to Minus One Extended (addme[.])
  * Subtract From Minus One Extended (subfme[.])
  * Add to Zero Extended (addze[.])
  * Subtract From Zero Extended (subfze[.])

Signed-off-by: Sandipan Das 
---
 arch/powerpc/lib/sstep.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 16814bfc01da..fe1910733e55 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -964,6 +964,16 @@ static nokprobe_inline void set_cr0(const struct pt_regs 
*regs,
op->ccval |= 0x2000;
 }
 
+static nokprobe_inline void set_ca32(struct instruction_op *op, bool val)
+{
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   if (val)
+   op->xerval |= XER_CA32;
+   else
+   op->xerval &= ~XER_CA32;
+   }
+}
+
 static nokprobe_inline void add_with_carry(const struct pt_regs *regs,
 struct instruction_op *op, int rd,
 unsigned long val1, unsigned long val2,
@@ -987,6 +997,9 @@ static nokprobe_inline void add_with_carry(const struct 
pt_regs *regs,
op->xerval |= XER_CA;
else
op->xerval &= ~XER_CA;
+
+   set_ca32(op, (unsigned int)val < (unsigned int)val1 ||
+   (carry_in && (unsigned int)val == (unsigned int)val1));
 }
 
 static nokprobe_inline void do_cmp_signed(const struct pt_regs *regs,
-- 
2.13.5



[PATCH 1/3] powerpc/lib/sstep: Add XER bits introduced in POWER ISA v3.0

2017-09-28 Thread Sandipan Das
This adds definitions for the OV32 and CA32 bits of XER that
were introduced in POWER ISA v3.0. There are some existing
instructions that currently set the OV and CA bits based on
certain conditions.

The emulation behaviour of all these instructions needs to
be updated to set these new bits accordingly.

Signed-off-by: Sandipan Das 
---
 arch/powerpc/lib/sstep.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 5e8418c28bd8..16814bfc01da 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -31,6 +31,8 @@ extern char system_call_common[];
 #define XER_SO 0x8000U
 #define XER_OV 0x4000U
 #define XER_CA 0x2000U
+#define XER_OV32   0x0008U
+#define XER_CA32   0x0004U
 
 #ifdef CONFIG_PPC_FPU
 /*
-- 
2.13.5



Re: [PATCH] powerpc/powernv: Make opal_event_shutdown() callable from IRQ context

2017-09-28 Thread Balbir Singh
On Fri, 29 Sep 2017 13:58:02 +1000
Michael Ellerman  wrote:

> In opal_event_shutdown() we free all the IRQs hanging off the
> opal_event_irqchip. However it's not safe to do so if we're called
> from IRQ context, because free_irq() wants to synchronise versus IRQ
> context. This can lead to warnings and a stuck system.
> 
> For example from sysrq-b:
> 
>   Trying to free IRQ 17 from IRQ context!
>   [ cut here ]
>   WARNING: CPU: 0 PID: 0 at kernel/irq/manage.c:1461 __free_irq+0x398/0x8d0
>   ...
>   NIP __free_irq+0x398/0x8d0
>   LR __free_irq+0x394/0x8d0
>   Call Trace:
> __free_irq+0x394/0x8d0 (unreliable)
> free_irq+0xa4/0x140
> opal_event_shutdown+0x128/0x180
> opal_shutdown+0x1c/0xb0
> pnv_shutdown+0x20/0x40
> machine_restart+0x38/0x90
> emergency_restart+0x28/0x40
> sysrq_handle_reboot+0x24/0x40
> __handle_sysrq+0x198/0x590
> hvc_poll+0x48c/0x8c0
> hvc_handle_interrupt+0x1c/0x50
> __handle_irq_event_percpu+0xe8/0x6e0
> handle_irq_event_percpu+0x34/0xe0
> handle_irq_event+0xc4/0x210
> handle_level_irq+0x250/0x770
> generic_handle_irq+0x5c/0xa0
> opal_handle_events+0x11c/0x240
> opal_interrupt+0x38/0x50
> __handle_irq_event_percpu+0xe8/0x6e0
> handle_irq_event_percpu+0x34/0xe0
> handle_irq_event+0xc4/0x210
> handle_fasteoi_irq+0x174/0xa10
> generic_handle_irq+0x5c/0xa0
> __do_irq+0xbc/0x4e0
> call_do_irq+0x14/0x24
> do_IRQ+0x18c/0x540
> hardware_interrupt_common+0x158/0x180
> 
> We can avoid that by using disable_irq_nosync() rather than
> free_irq(). Although it doesn't fully free the IRQ, it should be
> sufficient when we're shutting down, particularly in an emergency.
> 
> Add an in_interrupt() check and use free_irq() when we're shutting
> down normally. It's probably OK to use disable_irq_nosync() in that
> case too, but for now it's safer to leave that behaviour as-is.
> 
> Fixes: 9f0fd0499d30 ("powerpc/powernv: Add a virtual irqchip for opal events")
> Signed-off-by: Michael Ellerman 
> ---

Acked-by: Balbir Singh 


[PATCH v4 5/5] powerpc/mce: hookup memory_failure for UE errors

2017-09-28 Thread Balbir Singh
If we are in user space and hit a UE error, we now have the
basic infrastructure to walk the page tables and find out
the effective address that was accessed, since the DAR
is not valid.

We use a work_queue content to hookup the bad pfn, any
other context causes problems, since memory_failure itself
can call into schedule() via lru_drain_ bits.

We could probably poison the struct page to avoid a race
between detection and taking corrective action.

Signed-off-by: Balbir Singh 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce.c | 70 +--
 1 file changed, 67 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ee148f4..299553e 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -39,11 +39,21 @@ static DEFINE_PER_CPU(struct 
machine_check_event[MAX_MC_EVT], mce_event);
 static DEFINE_PER_CPU(int, mce_queue_count);
 static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT], mce_event_queue);
 
+/* Queue for delayed MCE UE events. */
+static DEFINE_PER_CPU(int, mce_ue_count);
+static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT],
+   mce_ue_event_queue);
+
 static void machine_check_process_queued_event(struct irq_work *work);
+void machine_check_ue_event(struct machine_check_event *evt);
+static void machine_process_ue_event(struct work_struct *work);
+
 static struct irq_work mce_event_process_work = {
 .func = machine_check_process_queued_event,
 };
 
+DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -143,6 +153,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
if (phys_addr != ULONG_MAX) {
mce->u.ue_error.physical_address_provided = true;
mce->u.ue_error.physical_address = phys_addr;
+   machine_check_ue_event(mce);
}
}
return;
@@ -197,6 +208,26 @@ void release_mce_event(void)
get_mce_event(NULL, true);
 }
 
+
+/*
+ * Queue up the MCE event which then can be handled later.
+ */
+void machine_check_ue_event(struct machine_check_event *evt)
+{
+   int index;
+
+   index = __this_cpu_inc_return(mce_ue_count) - 1;
+   /* If queue is full, just return for now. */
+   if (index >= MAX_MC_EVT) {
+   __this_cpu_dec(mce_ue_count);
+   return;
+   }
+   memcpy(this_cpu_ptr(_ue_event_queue[index]), evt, sizeof(*evt));
+
+   /* Queue work to process this event later. */
+   schedule_work(_ue_event_work);
+}
+
 /*
  * Queue up the MCE event which then can be handled later.
  */
@@ -219,7 +250,39 @@ void machine_check_queue_event(void)
/* Queue irq work to process this event later. */
irq_work_queue(_event_process_work);
 }
-
+/*
+ * process pending MCE event from the mce event queue. This function will be
+ * called during syscall exit.
+ */
+static void machine_process_ue_event(struct work_struct *work)
+{
+   int index;
+   struct machine_check_event *evt;
+
+   while (__this_cpu_read(mce_ue_count) > 0) {
+   index = __this_cpu_read(mce_ue_count) - 1;
+   evt = this_cpu_ptr(_ue_event_queue[index]);
+#ifdef CONFIG_MEMORY_FAILURE
+   /*
+* This should probably queued elsewhere, but
+* oh! well
+*/
+   if (evt->error_type == MCE_ERROR_TYPE_UE) {
+   if (evt->u.ue_error.physical_address_provided) {
+   unsigned long pfn;
+
+   pfn = evt->u.ue_error.physical_address >>
+   PAGE_SHIFT;
+   memory_failure(pfn, SIGBUS, 0);
+   } else
+   pr_warn("Failed to identify bad address from "
+   "where the uncorrectable error (UE) "
+   "was generated\n");
+   }
+#endif
+   __this_cpu_dec(mce_ue_count);
+   }
+}
 /*
  * process pending MCE event from the mce event queue. This function will be
  * called during syscall exit.
@@ -227,6 +290,7 @@ void machine_check_queue_event(void)
 static void machine_check_process_queued_event(struct irq_work *work)
 {
int index;
+   struct machine_check_event *evt;
 
add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
@@ -236,8 +300,8 @@ static void machine_check_process_queued_event(struct 
irq_work *work)
 */
while (__this_cpu_read(mce_queue_count) > 0) {
index = __this_cpu_read(mce_queue_count) - 1;
-   machine_check_print_event_info(
-   

[PATCH v4 4/5] powerpc/mce: Hookup ierror (instruction) UE errors

2017-09-28 Thread Balbir Singh
Hookup instruction errors (UE) for memory offling via memory_failure()
in a manner similar to load/store errors (derror). Since we have access
to the NIP, the conversion is a one step process in this case.

Signed-off-by: Balbir Singh 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce_power.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 0e584d5..2a888f6 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -488,7 +488,8 @@ static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, 
uint64_t *addr,
 
 static int mce_handle_ierror(struct pt_regs *regs,
const struct mce_ierror_table table[],
-   struct mce_error_info *mce_err, uint64_t *addr)
+   struct mce_error_info *mce_err, uint64_t *addr,
+   uint64_t *phys_addr)
 {
uint64_t srr1 = regs->msr;
int handled = 0;
@@ -540,8 +541,22 @@ static int mce_handle_ierror(struct pt_regs *regs,
}
mce_err->severity = table[i].severity;
mce_err->initiator = table[i].initiator;
-   if (table[i].nip_valid)
+   if (table[i].nip_valid) {
*addr = regs->nip;
+   if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
+   table[i].error_type == MCE_ERROR_TYPE_UE) {
+   unsigned long pfn;
+
+   if (get_paca()->in_mce < MAX_MCE_DEPTH) {
+   pfn = addr_to_pfn(regs, regs->nip);
+   if (pfn != ULONG_MAX) {
+   *phys_addr =
+   (pfn << PAGE_SHIFT);
+   handled = 1;
+   }
+   }
+   }
+   }
return handled;
}
 
@@ -676,7 +691,8 @@ static long mce_handle_error(struct pt_regs *regs,
handled = mce_handle_derror(regs, dtable, _err, ,
_addr);
else
-   handled = mce_handle_ierror(regs, itable, _err, );
+   handled = mce_handle_ierror(regs, itable, _err, ,
+   _addr);
 
if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
handled = mce_handle_ue_error(regs);
-- 
2.9.5



[PATCH v4 3/5] powerpc/mce: Hookup derror (load/store) UE errors

2017-09-28 Thread Balbir Singh
Extract physical_address for UE errors by walking the page
tables for the mm and address at the NIP, to extract the
instruction. Then use the instruction to find the effective
address via analyse_instr().

We might have page table walking races, but we expect them to
be rare, the physical address extraction is best effort. The idea
is to then hook up this infrastructure to memory failure eventually.

Signed-off-by: Balbir Singh 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/exception-64s.h |  5 ++
 arch/powerpc/include/asm/mce.h   |  2 +-
 arch/powerpc/kernel/exceptions-64s.S |  2 +-
 arch/powerpc/kernel/mce.c|  6 ++-
 arch/powerpc/kernel/mce_power.c  | 87 ++--
 5 files changed, 94 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 9a31897..b272052 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -55,6 +55,11 @@
 #endif
 
 /*
+ * maximum recursive depth of MCE exceptions
+ */
+#define MAX_MCE_DEPTH  4
+
+/*
  * EX_LR is only used in EXSLB and where it does not overlap with EX_DAR
  * EX_CCR similarly with DSISR, but being 4 byte registers there is a hole
  * in the save area so it's not necessary to overlap them. Could be used
diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 75292c7..3a1226e 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -204,7 +204,7 @@ struct mce_error_info {
 
 extern void save_mce_event(struct pt_regs *regs, long handled,
   struct mce_error_info *mce_err, uint64_t nip,
-  uint64_t addr);
+  uint64_t addr, uint64_t phys_addr);
 extern int get_mce_event(struct machine_check_event *mce, bool release);
 extern void release_mce_event(void);
 extern void machine_check_queue_event(void);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 48da0f5..3a6c8c8 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -232,7 +232,7 @@ BEGIN_FTR_SECTION
addir10,r10,1   /* increment paca->in_mce */
sth r10,PACA_IN_MCE(r13)
/* Limit nested MCE to level 4 to avoid stack overflow */
-   cmpwi   r10,4
+   cmpwi   r10,MAX_MCE_DEPTH
bgt 2f  /* Check if we hit limit of 4 */
std r11,GPR1(r1)/* Save r1 on the stack. */
std r11,0(r1)   /* make stack chain pointer */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fef1408..ee148f4 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -82,7 +82,7 @@ static void mce_set_error_info(struct machine_check_event 
*mce,
  */
 void save_mce_event(struct pt_regs *regs, long handled,
struct mce_error_info *mce_err,
-   uint64_t nip, uint64_t addr)
+   uint64_t nip, uint64_t addr, uint64_t phys_addr)
 {
int index = __this_cpu_inc_return(mce_nest_count) - 1;
struct machine_check_event *mce = this_cpu_ptr(_event[index]);
@@ -140,6 +140,10 @@ void save_mce_event(struct pt_regs *regs, long handled,
} else if (mce->error_type == MCE_ERROR_TYPE_UE) {
mce->u.ue_error.effective_address_provided = true;
mce->u.ue_error.effective_address = addr;
+   if (phys_addr != ULONG_MAX) {
+   mce->u.ue_error.physical_address_provided = true;
+   mce->u.ue_error.physical_address = phys_addr;
+   }
}
return;
 }
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index b76ca19..0e584d5 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -27,6 +27,36 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Convert an address related to an mm to a PFN. NOTE: we are in real
+ * mode, we could potentially race with page table updates.
+ */
+static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
+{
+   pte_t *ptep;
+   unsigned long flags;
+   struct mm_struct *mm;
+
+   if (user_mode(regs))
+   mm = current->mm;
+   else
+   mm = _mm;
+
+   local_irq_save(flags);
+   if (mm == current->mm)
+   ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
+   else
+   ptep = find_init_mm_pte(addr, NULL);
+   local_irq_restore(flags);
+   if (!ptep || pte_special(*ptep))
+   return ULONG_MAX;
+   return pte_pfn(*ptep);
+}
 
 static void flush_tlb_206(unsigned int num_sets, unsigned int action)
 {
@@ -421,6 +451,41 @@ static const struct mce_derror_table 

[PATCH v4 2/5] powerpc/mce: align the print of physical address better

2017-09-28 Thread Balbir Singh
Use the same alignment as Effective address and rename
phyiscal address to Page Frame Number

Signed-off-by: Balbir Singh 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index e254399..fef1408 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -340,7 +340,7 @@ void machine_check_print_event_info(struct 
machine_check_event *evt,
printk("%sEffective address: %016llx\n",
   level, evt->u.ue_error.effective_address);
if (evt->u.ue_error.physical_address_provided)
-   printk("%s  Physical address: %016llx\n",
+   printk("%sPhysical address:  %016llx\n",
   level, evt->u.ue_error.physical_address);
break;
case MCE_ERROR_TYPE_SLB:
-- 
2.9.5



[PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr()

2017-09-28 Thread Balbir Singh
There are no users of get_mce_fault_addr()

Fixes: b63a0ff ("powerpc/powernv: Machine check exception handling.")

Signed-off-by: Balbir Singh 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/mce.h |  2 --
 arch/powerpc/kernel/mce.c  | 39 ---
 2 files changed, 41 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 190d69a..75292c7 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -210,6 +210,4 @@ extern void release_mce_event(void);
 extern void machine_check_queue_event(void);
 extern void machine_check_print_event_info(struct machine_check_event *evt,
   bool user_mode);
-extern uint64_t get_mce_fault_addr(struct machine_check_event *evt);
-
 #endif /* __ASM_PPC64_MCE_H__ */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 9b2ea7e..e254399 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -411,45 +411,6 @@ void machine_check_print_event_info(struct 
machine_check_event *evt,
 }
 EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 
-uint64_t get_mce_fault_addr(struct machine_check_event *evt)
-{
-   switch (evt->error_type) {
-   case MCE_ERROR_TYPE_UE:
-   if (evt->u.ue_error.effective_address_provided)
-   return evt->u.ue_error.effective_address;
-   break;
-   case MCE_ERROR_TYPE_SLB:
-   if (evt->u.slb_error.effective_address_provided)
-   return evt->u.slb_error.effective_address;
-   break;
-   case MCE_ERROR_TYPE_ERAT:
-   if (evt->u.erat_error.effective_address_provided)
-   return evt->u.erat_error.effective_address;
-   break;
-   case MCE_ERROR_TYPE_TLB:
-   if (evt->u.tlb_error.effective_address_provided)
-   return evt->u.tlb_error.effective_address;
-   break;
-   case MCE_ERROR_TYPE_USER:
-   if (evt->u.user_error.effective_address_provided)
-   return evt->u.user_error.effective_address;
-   break;
-   case MCE_ERROR_TYPE_RA:
-   if (evt->u.ra_error.effective_address_provided)
-   return evt->u.ra_error.effective_address;
-   break;
-   case MCE_ERROR_TYPE_LINK:
-   if (evt->u.link_error.effective_address_provided)
-   return evt->u.link_error.effective_address;
-   break;
-   default:
-   case MCE_ERROR_TYPE_UNKNOWN:
-   break;
-   }
-   return 0;
-}
-EXPORT_SYMBOL(get_mce_fault_addr);
-
 /*
  * This function is called in real mode. Strictly no printk's please.
  *
-- 
2.9.5



[PATCH v4 0/5] Revisit MCE handling for UE Errors

2017-09-28 Thread Balbir Singh
This patch series is designed to hook up memory_failure on
UE errors, this is specially helpful for user_mode UE errors.

The first two patches cleanup bits, remove dead code.
I could not find any users of get_mce_fault_addr().
The second one improves printing of physical address

The third patch walks kernel/user mode page tables in
real mode to extract the effective address of the instruction
that caused the UE error and the effective address it was
trying to access (for load/store). The fourth patch hooks
up the pfn for instruction UE errors (ierror).

The fifth patch hooks up memory_failure to the MCE patch.

TODO:
Log the address in NVRAM, so that we can recover from
bad pages at boot and keep the blacklist persistent.

Changelog v4:
- Add a #define for MAX_MCE_DEPTH instead of hard
coding the value
- Refactor addr_to_pfn to deduce the right parameters
Changelog v3:
- Check for recursive MCE invocations (suggested
by Nick)
Changelog v2:
- Remove pr_warn from real mode to a more delayed
context, where its OK to warn.
- Add a trivial patch to align prints of physical
and effective address
Changelog v1:
- Address review comments from Nick and Mahesh
(initialization of pfn and more comments on failure
when addr_to_pfn() or anaylse_instr() fail)
- Hookup ierrors to the framework as well
(comments from Mahesh)

Balbir Singh (5):
  powerpc/mce.c: Remove unused function get_mce_fault_addr()
  powerpc/mce: align the print of physical address better
  powerpc/mce: Hookup derror (load/store) UE errors
  powerpc/mce: Hookup ierror (instruction) UE errors
  powerpc/mce: hookup memory_failure for UE errors

 arch/powerpc/include/asm/exception-64s.h |   5 ++
 arch/powerpc/include/asm/mce.h   |   4 +-
 arch/powerpc/kernel/exceptions-64s.S |   2 +-
 arch/powerpc/kernel/mce.c| 117 +++
 arch/powerpc/kernel/mce_power.c  | 109 +---
 5 files changed, 181 insertions(+), 56 deletions(-)

-- 
2.9.5



[PATCH] powerpc/powernv: Make opal_event_shutdown() callable from IRQ context

2017-09-28 Thread Michael Ellerman
In opal_event_shutdown() we free all the IRQs hanging off the
opal_event_irqchip. However it's not safe to do so if we're called
from IRQ context, because free_irq() wants to synchronise versus IRQ
context. This can lead to warnings and a stuck system.

For example from sysrq-b:

  Trying to free IRQ 17 from IRQ context!
  [ cut here ]
  WARNING: CPU: 0 PID: 0 at kernel/irq/manage.c:1461 __free_irq+0x398/0x8d0
  ...
  NIP __free_irq+0x398/0x8d0
  LR __free_irq+0x394/0x8d0
  Call Trace:
__free_irq+0x394/0x8d0 (unreliable)
free_irq+0xa4/0x140
opal_event_shutdown+0x128/0x180
opal_shutdown+0x1c/0xb0
pnv_shutdown+0x20/0x40
machine_restart+0x38/0x90
emergency_restart+0x28/0x40
sysrq_handle_reboot+0x24/0x40
__handle_sysrq+0x198/0x590
hvc_poll+0x48c/0x8c0
hvc_handle_interrupt+0x1c/0x50
__handle_irq_event_percpu+0xe8/0x6e0
handle_irq_event_percpu+0x34/0xe0
handle_irq_event+0xc4/0x210
handle_level_irq+0x250/0x770
generic_handle_irq+0x5c/0xa0
opal_handle_events+0x11c/0x240
opal_interrupt+0x38/0x50
__handle_irq_event_percpu+0xe8/0x6e0
handle_irq_event_percpu+0x34/0xe0
handle_irq_event+0xc4/0x210
handle_fasteoi_irq+0x174/0xa10
generic_handle_irq+0x5c/0xa0
__do_irq+0xbc/0x4e0
call_do_irq+0x14/0x24
do_IRQ+0x18c/0x540
hardware_interrupt_common+0x158/0x180

We can avoid that by using disable_irq_nosync() rather than
free_irq(). Although it doesn't fully free the IRQ, it should be
sufficient when we're shutting down, particularly in an emergency.

Add an in_interrupt() check and use free_irq() when we're shutting
down normally. It's probably OK to use disable_irq_nosync() in that
case too, but for now it's safer to leave that behaviour as-is.

Fixes: 9f0fd0499d30 ("powerpc/powernv: Add a virtual irqchip for opal events")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/platforms/powernv/opal-irqchip.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c 
b/arch/powerpc/platforms/powernv/opal-irqchip.c
index ecdcba9d1220..9d1b8c0aaf93 100644
--- a/arch/powerpc/platforms/powernv/opal-irqchip.c
+++ b/arch/powerpc/platforms/powernv/opal-irqchip.c
@@ -174,8 +174,14 @@ void opal_event_shutdown(void)
 
/* First free interrupts, which will also mask them */
for (i = 0; i < opal_irq_count; i++) {
-   if (opal_irqs[i])
+   if (!opal_irqs[i])
+   continue;
+
+   if (in_interrupt())
+   disable_irq_nosync(opal_irqs[i]);
+   else
free_irq(opal_irqs[i], NULL);
+
opal_irqs[i] = 0;
}
 }
-- 
2.7.4



[PATCH] powerpc: Fix workaround for spurious MCE on POWER9

2017-09-28 Thread Michael Neuling
In the recent commit:
  d8bd9f3f09 powerpc: Handle MCE on POWER9 with only DSISR bit 30 set
I screwed up the bit.  It should be bit 25 (IBM bit 38).

Signed-off-by: Michael Neuling 
---
 arch/powerpc/kernel/mce_power.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index f523125b9d..72f153c6f3 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -626,7 +626,7 @@ long __machine_check_early_realmode_p9(struct pt_regs *regs)
 {
/*
 * On POWER9 DD2.1 and below, it's possible to get a machine check
-* caused by a paste instruction where only DSISR bit 30 is set. This
+* caused by a paste instruction where only DSISR bit 25 is set. This
 * will result in the MCE handler seeing an unknown event and the kernel
 * crashing. An MCE that occurs like this is spurious, so we don't need
 * to do anything in terms of servicing it. If there is something that
@@ -634,7 +634,7 @@ long __machine_check_early_realmode_p9(struct pt_regs *regs)
 * correct DSISR so that it can be serviced properly. So detect this
 * case and mark it as handled.
 */
-   if (SRR1_MC_LOADSTORE(regs->msr) && regs->dsisr == 0x4000)
+   if (SRR1_MC_LOADSTORE(regs->msr) && regs->dsisr == 0x0200)
return 1;
 
return mce_handle_error(regs, mce_p9_derror_table, mce_p9_ierror_table);
-- 
2.11.0



[PATCH v2 6/6] powerpc/powernv: implement NMI IPI with OPAL_SIGNAL_SYSTEM_RESET

2017-09-28 Thread Nicholas Piggin
This allows MSR[EE]=0 lockups to be detected on an OPAL (bare metal)
system similarly to the hcall NMI IPI on pseries guests, when the
platform/firmware supports it.

This is an example of CPU10 spinning with interrupts hard disabled:

Watchdog CPU:32 detected Hard LOCKUP other CPUS:10
Watchdog CPU:10 Hard LOCKUP
CPU: 10 PID: 4410 Comm: bash Not tainted 4.13.0-rc7-00074-ge89ce1f89f62-dirty 
#34
task: c003a82b4400 task.stack: c003af55c000
NIP: c00a7b38 LR: c0659044 CTR: c00a7b00
REGS: cfd23d80 TRAP: 0100   Not tainted  
(4.13.0-rc7-00074-ge89ce1f89f62-dirty)
MSR: 900c1033 
CR: 2842  XER: 2000
CFAR: c00a7b38 SOFTE: 0
GPR00: c0659044 c003af55fbb0 c1072a00 0078
GPR04: c003c81b5c80 c003c81cc7e8 90009033 
GPR08:  c00a7b00 0001 90001003
GPR12: c00a7b00 cfd83200 10180df8 10189e60
GPR16: 10189ed8 10151270 1018bd88 1018de78
GPR20: 370a0668 0001 101645e0 10163c10
GPR24: 7fffd14d6294 7fffd14d6290 c0fba6f0 0004
GPR28: c0f351d8 0078 c0f4095c 
NIP [c00a7b38] sysrq_handle_xmon+0x38/0x40
LR [c0659044] __handle_sysrq+0xe4/0x270
Call Trace:
[c003af55fbd0] [c0659044] __handle_sysrq+0xe4/0x270
[c003af55fc70] [c0659810] write_sysrq_trigger+0x70/0xa0
[c003af55fca0] [c03da650] proc_reg_write+0xb0/0x110
[c003af55fcf0] [c03423bc] __vfs_write+0x6c/0x1b0
[c003af55fd90] [c0344398] vfs_write+0xd8/0x240
[c003af55fde0] [c034632c] SyS_write+0x6c/0x110
[c003af55fe30] [c000b220] system_call+0x58/0x6c

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/opal-api.h|  1 +
 arch/powerpc/include/asm/opal.h|  2 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
 arch/powerpc/platforms/powernv/setup.c |  1 +
 arch/powerpc/platforms/powernv/smp.c   | 52 ++
 5 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 450a60b81d2a..9d191ebea706 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -188,6 +188,7 @@
 #define OPAL_XIVE_DUMP 142
 #define OPAL_XIVE_RESERVED3143
 #define OPAL_XIVE_RESERVED4144
+#define OPAL_SIGNAL_SYSTEM_RESET   145
 #define OPAL_NPU_INIT_CONTEXT  146
 #define OPAL_NPU_DESTROY_CONTEXT   147
 #define OPAL_NPU_MAP_LPAR  148
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 726c23304a57..7d7613c49f2b 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -281,6 +281,8 @@ int opal_get_power_shift_ratio(u32 handle, int token, u32 
*psr);
 int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 int opal_sensor_group_clear(u32 group_hndl, int token);
 
+int64_t opal_signal_system_reset(int32_t cpu);
+
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
   int depth, void *data);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 8c1ede2d3f7e..37cd170201a2 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -307,6 +307,7 @@ OPAL_CALL(opal_xive_get_vp_info,
OPAL_XIVE_GET_VP_INFO);
 OPAL_CALL(opal_xive_set_vp_info,   OPAL_XIVE_SET_VP_INFO);
 OPAL_CALL(opal_xive_sync,  OPAL_XIVE_SYNC);
 OPAL_CALL(opal_xive_dump,  OPAL_XIVE_DUMP);
+OPAL_CALL(opal_signal_system_reset,OPAL_SIGNAL_SYSTEM_RESET);
 OPAL_CALL(opal_npu_init_context,   OPAL_NPU_INIT_CONTEXT);
 OPAL_CALL(opal_npu_destroy_context,OPAL_NPU_DESTROY_CONTEXT);
 OPAL_CALL(opal_npu_map_lpar,   OPAL_NPU_MAP_LPAR);
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 897aa1400eb8..cf52d53da460 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -282,6 +282,7 @@ static void __init pnv_setup_machdep_opal(void)
ppc_md.restart = pnv_restart;
pm_power_off = pnv_power_off;
ppc_md.halt = pnv_halt;
+   /* ppc_md.system_reset_exception gets filled in by pnv_smp_init() */
ppc_md.machine_check_exception = opal_machine_check;
ppc_md.mce_check_early_recovery = opal_mce_check_early_recovery;
ppc_md.hmi_exception_early = opal_hmi_exception_early;
diff --git 

[PATCH v2 5/6] powerpc/64s: Implement system reset idle wakeup reason

2017-09-28 Thread Nicholas Piggin
It is possible to wake from idle due to a system reset exception, in
which case the CPU takes a system reset interrupt to wake from idle,
with system reset as the wakeup reason.

The regular (not idle wakeup) system reset interrupt handler must be
invoked in this case, otherwise the system reset interrupt is lost.

Handle the system reset interrupt immediately after CPU state has been
restored.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/irq.c | 41 ++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 4e65bf82f5e0..4813b83b22aa 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -394,11 +394,19 @@ bool prep_irq_for_idle_irqsoff(void)
 /*
  * Take the SRR1 wakeup reason, index into this table to find the
  * appropriate irq_happened bit.
+ *
+ * Sytem reset exceptions taken in idle state also come through here,
+ * but they are NMI interrupts so do not need to wait for IRQs to be
+ * restored, and should be taken as early as practical. These are marked
+ * with 0xff in the table. The Power ISA specifies 0100b as the system
+ * reset interrupt reason.
  */
+#define IRQ_SYSTEM_RESET   0xff
+
 static const u8 srr1_to_lazyirq[0x10] = {
0, 0, 0,
PACA_IRQ_DBELL,
-   0,
+   IRQ_SYSTEM_RESET,
PACA_IRQ_DBELL,
PACA_IRQ_DEC,
0,
@@ -407,15 +415,42 @@ static const u8 srr1_to_lazyirq[0x10] = {
PACA_IRQ_HMI,
0, 0, 0, 0, 0 };
 
+static noinline void replay_system_reset(void)
+{
+   struct pt_regs regs;
+
+   ppc_save_regs();
+   regs.trap = 0x100;
+   get_paca()->in_nmi = 1;
+   system_reset_exception();
+   get_paca()->in_nmi = 0;
+}
+
 void irq_set_pending_from_srr1(unsigned long srr1)
 {
unsigned int idx = (srr1 & SRR1_WAKEMASK_P8) >> 18;
+   u8 reason = srr1_to_lazyirq[idx];
+
+   /*
+* Take the system reset now, which is immediately after registers
+* are restored from idle. It's an NMI, so interrupts need not be
+* re-enabled before it is taken.
+*/
+   if (unlikely(reason == IRQ_SYSTEM_RESET)) {
+   replay_system_reset();
+   return;
+   }
 
/*
 * The 0 index (SRR1[42:45]=b) must always evaluate to 0,
-* so this can be called unconditionally with srr1 wake reason.
+* so this can be called unconditionally with the SRR1 wake
+* reason as returned by the idle code, which uses 0 to mean no
+* interrupt.
+*
+* If a future CPU was to designate this as an interrupt reason,
+* then a new index for no interrupt must be assigned.
 */
-   local_paca->irq_happened |= srr1_to_lazyirq[idx];
+   local_paca->irq_happened |= reason;
 }
 #endif /* CONFIG_PPC_BOOK3S */
 
-- 
2.13.3



[PATCH v2 4/6] powerpc/xmon: avoid tripping SMP hardlockup watchdog

2017-09-28 Thread Nicholas Piggin
The SMP hardlockup watchdog cross-checks other CPUs for lockups,
which causes xmon headaches because it's assuming interrupts
hard disabled means no watchdog troubles. Try to improve that by
calling touch_nmi_watchdog() in obvious places where secondaries
are spinning.

Also annotate these spin loops with spin_begin/end calls.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/xmon/xmon.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 33351c6704b1..d9a12102b111 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -530,14 +530,19 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
 
  waiting:
secondary = 1;
+   spin_begin();
while (secondary && !xmon_gate) {
if (in_xmon == 0) {
-   if (fromipi)
+   if (fromipi) {
+   spin_end();
goto leave;
+   }
secondary = test_and_set_bit(0, _xmon);
}
-   barrier();
+   spin_cpu_relax();
+   touch_nmi_watchdog();
}
+   spin_end();
 
if (!secondary && !xmon_gate) {
/* we are the first cpu to come in */
@@ -568,21 +573,25 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
mb();
xmon_gate = 1;
barrier();
+   touch_nmi_watchdog();
}
 
  cmdloop:
while (in_xmon) {
if (secondary) {
+   spin_begin();
if (cpu == xmon_owner) {
if (!test_and_set_bit(0, _taken)) {
secondary = 0;
+   spin_end();
continue;
}
/* missed it */
while (cpu == xmon_owner)
-   barrier();
+   spin_cpu_relax();
}
-   barrier();
+   spin_cpu_relax();
+   touch_nmi_watchdog();
} else {
cmd = cmds(regs);
if (cmd != 0) {
-- 
2.13.3



[PATCH v2 3/6] powerpc/watchdog: do not trigger SMP crash from touch_nmi_watchdog

2017-09-28 Thread Nicholas Piggin
In xmon, touch_nmi_watchdog() is not expected to be checking that
other CPUs have not touched the watchdog, so the code will just
call touch_nmi_watchdog() once before re-enabling hard interrupts.

Just update our CPU's state, and ignore apparently stuck SMP threads.

Arguably touch_nmi_watchdog should check for SMP lockups, and
callers should be fixed, but that's not trivial for the input code
of xmon.
---
 arch/powerpc/kernel/watchdog.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 920e61c79f47..1fb9379dc683 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -277,9 +277,12 @@ void arch_touch_nmi_watchdog(void)
 {
unsigned long ticks = tb_ticks_per_usec * wd_timer_period_ms * 1000;
int cpu = smp_processor_id();
+   u64 tb = get_tb();
 
-   if (get_tb() - per_cpu(wd_timer_tb, cpu) >= ticks)
-   watchdog_timer_interrupt(cpu);
+   if (tb - per_cpu(wd_timer_tb, cpu) >= ticks) {
+   per_cpu(wd_timer_tb, cpu) = tb;
+   wd_smp_clear_cpu_pending(cpu, tb);
+   }
 }
 EXPORT_SYMBOL(arch_touch_nmi_watchdog);
 
-- 
2.13.3



[PATCH v2 2/6] powerpc/watchdog: do not backtrace locked CPUs twice if allcpus backtrace is enabled

2017-09-28 Thread Nicholas Piggin
If sysctl_hardlockup_all_cpu_backtrace is enabled, there is no need to
IPI stuck CPUs for backtrace before trigger_allbutself_cpu_backtrace(),
which does the same thing again.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/watchdog.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 532a1adbe89b..920e61c79f47 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -133,15 +133,18 @@ static void watchdog_smp_panic(int cpu, u64 tb)
pr_emerg("Watchdog CPU:%d detected Hard LOCKUP other CPUS:%*pbl\n",
cpu, cpumask_pr_args(_smp_cpus_pending));
 
-   /*
-* Try to trigger the stuck CPUs.
-*/
-   for_each_cpu(c, _smp_cpus_pending) {
-   if (c == cpu)
-   continue;
-   smp_send_nmi_ipi(c, wd_lockup_ipi, 100);
+   if (!sysctl_hardlockup_all_cpu_backtrace) {
+   /*
+* Try to trigger the stuck CPUs, unless we are going to
+* get a backtrace on all of them anyway.
+*/
+   for_each_cpu(c, _smp_cpus_pending) {
+   if (c == cpu)
+   continue;
+   smp_send_nmi_ipi(c, wd_lockup_ipi, 100);
+   }
+   smp_flush_nmi_ipi(100);
}
-   smp_flush_nmi_ipi(100);
 
/* Take the stuck CPUs out of the watch group */
set_cpumask_stuck(_smp_cpus_pending, tb);
-- 
2.13.3



[PATCH v2 1/6] powerpc/watchdog: do not panic from locked CPU's IPI handler

2017-09-28 Thread Nicholas Piggin
The SMP watchdog will detect locked CPUs and IPI them to print a
backtrace and registers. If panic on hard lockup is enabled, do
not panic from this handler, because that can cause recursion into
the IPI layer during the panic.

The caller already panics in this case.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/watchdog.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 2f6eadd9408d..532a1adbe89b 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -97,8 +97,7 @@ static void wd_lockup_ipi(struct pt_regs *regs)
else
dump_stack();
 
-   if (hardlockup_panic)
-   nmi_panic(regs, "Hard LOCKUP");
+   /* Do not panic from here because that can recurse into NMI IPI layer */
 }
 
 static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
-- 
2.13.3



[PATCH v2 0/6] More NMI IPI enablement work

2017-09-28 Thread Nicholas Piggin
Since last post, the first 4 patches are unchanged.

Split the last patch into 2, tidied up a few things, and removed
the DD1 workarounds because firmware is not going to support DD1.

Then re-tested with upstream firmware which has now merged support
for OPAL_SIGNAL_SYSTEM_RESET on POWER9 DD2.


Nicholas Piggin (6):
  powerpc/watchdog: do not panic from locked CPU's IPI handler
  powerpc/watchdog: do not backtrace locked CPUs twice if allcpus
backtrace is enabled
  powerpc/watchdog: do not trigger SMP crash from touch_nmi_watchdog
  powerpc/xmon: avoid tripping SMP hardlockup watchdog
  powerpc/64s: Implement system reset idle wakeup reason
  powerpc/powernv: implement NMI IPI with OPAL_SIGNAL_SYSTEM_RESET

 arch/powerpc/include/asm/opal-api.h|  1 +
 arch/powerpc/include/asm/opal.h|  2 +
 arch/powerpc/kernel/irq.c  | 41 ++--
 arch/powerpc/kernel/watchdog.c | 29 --
 arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
 arch/powerpc/platforms/powernv/setup.c |  1 +
 arch/powerpc/platforms/powernv/smp.c   | 52 ++
 arch/powerpc/xmon/xmon.c   | 17 +++--
 8 files changed, 125 insertions(+), 19 deletions(-)

-- 
2.13.3



Re: [PATCH kernel v3] vfio/spapr: Add cond_resched() for huge updates

2017-09-28 Thread David Gibson
On Thu, Sep 28, 2017 at 07:16:12PM +1000, Alexey Kardashevskiy wrote:
> Clearing very big IOMMU tables can trigger soft lockups. This adds
> cond_resched() to allow the scheduler to do context switching when
> it decides to.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
> 
> The testcase is POWER9 box with 264GB guest, 4 VFIO devices from
> independent IOMMU groups, 64K IOMMU pages. This configuration produces
> 4325376 TCE entries, each entry update incurs 4 OPAL calls to update
> an individual PE TCE cache; this produced lockups for more than 20s.
> Reducing table size to 4194304 (i.e. 256GB guest) or removing one
> of 4 VFIO devices makes the problem go away.
> 
> ---
> Changes:
> v3:
> * cond_resched() checks for should_resched() so we just call resched()
> and let the cpu scheduler decide whether to switch or not
> 
> v2:
> * replaced with time based solution
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 63112c36ab2d..759a5bdd40e1 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -507,6 +507,8 @@ static int tce_iommu_clear(struct tce_container 
> *container,
>   enum dma_data_direction direction;
>  
>   for ( ; pages; --pages, ++entry) {
> + cond_resched();
> +
>   direction = DMA_NONE;
>   oldhpa = 0;
>   ret = iommu_tce_xchg(tbl, entry, , );

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PATCH] powerpc/vio: dispose of virq mapping on vdevice unregister

2017-09-28 Thread Tyrel Datwyler
When a vdevice is DLPAR removed from the system the vio subsystem doesn't
bother unmapping the virq from the irq_domain. As a result we have a virq
mapped to a hardware irq that is no longer valid for the irq_domain. A side
effect is that we are left with /proc/irq/ affinity entries, and
attempts to modify the smp_affinity of the irq will fail.

In the following observed example the kernel log is spammed by
ics_rtas_set_affinity errors after the removal of a VSCSI adapter. This is a
result of irqbalance trying to adjust the affinity every 10 seconds.

rpadlpar_io: slot U8408.E8E.10A7ACV-V5-C25 removed
ics_rtas_set_affinity: ibm,set-xive irq=655385 returns -3
ics_rtas_set_affinity: ibm,set-xive irq=655385 returns -3

This patch fixes the issue by calling irq_dispose_mapping() on the virq of the
viodev on unregister.

Cc: sta...@vger.kernel.org
Signed-off-by: Tyrel Datwyler 
---
 arch/powerpc/kernel/vio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 5f8dcda..a3228d6 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1577,6 +1577,8 @@ static struct device_attribute vio_dev_attrs[] = {
 void vio_unregister_device(struct vio_dev *viodev)
 {
device_unregister(>dev);
+   if (viodev->family == VDEVICE)
+   irq_dispose_mapping(viodev->irq);
 }
 EXPORT_SYMBOL(vio_unregister_device);
 
-- 
1.7.12.4



Re: [PATCH v3 00/20] Speculative page faults

2017-09-28 Thread Andrew Morton
On Thu, 28 Sep 2017 14:29:02 +0200 Laurent Dufour  
wrote:

> > Laurent's [0/n] provides some nice-looking performance benefits for
> > workloads which are chosen to show performance benefits(!) but, alas,
> > no quantitative testing results for workloads which we may suspect will
> > be harmed by the changes(?).  Even things as simple as impact upon
> > single-threaded pagefault-intensive workloads and its effect upon
> > CONFIG_SMP=n .text size?
> 
> I forgot to mention in my previous email the impact on the .text section.
> 
> Here are the metrics I got :
> 
> .text sizeUP  SMP Delta
> 4.13-mmotm8444201 8964137 6.16%
> '' +spf   8452041 8971929 6.15%
>   Delta   0.09%   0.09%   
> 
> No major impact as you could see.

8k text increase seems rather a lot actually.  That's a lot more
userspace cacheclines that get evicted during a fault...

Is the feature actually beneficial on uniprocessor?


[RFC PATCH for 4.14] membarrier powerpc: Move hook to switch_mm()

2017-09-28 Thread Mathieu Desnoyers
Nick has a valid point that the sched_in() hook is a fast-path compared
to switch_mm(). Adding an extra TIF test in a fast-path to save a
barrier in a comparatively slow-path is therefore not such a good idea
overall.

Therefore, move the architecture hook to switch_mm() instead.

[ This patch is aimed at Paul's tree. It applies on top of
  "membarrier: Provide register expedited private command (v4)" and
  "membarrier: Document scheduler barrier requirements (v4)". ]

Signed-off-by: Mathieu Desnoyers 
CC: Peter Zijlstra 
CC: Paul E. McKenney 
CC: Nicholas Piggin 
CC: Boqun Feng 
CC: Andrew Hunter 
CC: Maged Michael 
CC: gro...@google.com
CC: Avi Kivity 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Dave Watson 
CC: Alan Stern 
CC: Will Deacon 
CC: Andy Lutomirski 
CC: Ingo Molnar 
CC: Alexander Viro 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
 arch/powerpc/include/asm/membarrier.h |  9 -
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 13 -
 kernel/sched/core.c   |  6 ++
 4 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h 
b/arch/powerpc/include/asm/membarrier.h
index 588154c1cf57..61152a7a3cf9 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -1,8 +1,8 @@
 #ifndef _ASM_POWERPC_MEMBARRIER_H
 #define _ASM_POWERPC_MEMBARRIER_H
 
-static inline void membarrier_arch_sched_in(struct task_struct *prev,
-   struct task_struct *next)
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+   struct mm_struct *next, struct task_struct *tsk)
 {
/*
 * Only need the full barrier when switching between processes.
@@ -11,9 +11,8 @@ static inline void membarrier_arch_sched_in(struct 
task_struct *prev,
 * when switching from userspace to kernel is not needed after
 * store to rq->curr.
 */
-   if (likely(!test_ti_thread_flag(task_thread_info(next),
-   TIF_MEMBARRIER_PRIVATE_EXPEDITED)
-   || !prev->mm || prev->mm == next->mm))
+   if (likely(!test_ti_thread_flag(task_thread_info(tsk),
+   TIF_MEMBARRIER_PRIVATE_EXPEDITED) || !prev))
return;
 
/*
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0f613bc63c50..22f5c91cdc38 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -67,6 +68,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 *
 * On the read side the barrier is in pte_xchg(), which orders
 * the store to the PTE vs the load of mm_cpumask.
+*
+* This full barrier is needed by membarrier when switching
+* between processes after store to rq->curr, before user-space
+* memory accesses.
 */
smp_mb();
 
@@ -89,6 +94,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 
if (new_on_cpu)
radix_kvm_prefetch_workaround(next);
+   else
+   membarrier_arch_switch_mm(prev, next, tsk);
 
/*
 * The actual HW switching method differs between the various
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1bd10c2c0893..d5a9ab8f3836 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -215,8 +215,8 @@ static inline void memalloc_noreclaim_restore(unsigned int 
flags)
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
 #include 
 #else
-static inline void membarrier_arch_sched_in(struct task_struct *prev,
-   struct task_struct *next)
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+   struct mm_struct *next, struct task_struct *tsk)
 {
 }
 static inline void membarrier_arch_fork(struct task_struct *t,
@@ -232,11 +232,6 @@ static inline void 
membarrier_arch_register_private_expedited(
 }
 #endif
 
-static inline void membarrier_sched_in(struct task_struct *prev,
-   struct task_struct *next)
-{
-   membarrier_arch_sched_in(prev, next);
-}
 static inline void membarrier_fork(struct task_struct *t,
unsigned long clone_flags)
 {
@@ -252,8 +247,8 @@ static inline void membarrier_execve(struct task_struct *t)
membarrier_arch_execve(t);
 }
 #else
-static inline void 

[PATCH v2] powerpc: fix compile error on 64K pages on 40x, 44x

2017-09-28 Thread Christian Lamparter
The mmu context on the 40x, 44x does not define pte_frag
entry. This causes gcc abort the compilation due to:

setup-common.c: In function ‘setup_arch’:
setup-common.c:908: error: ‘mm_context_t’ has no ‘pte_frag’

This patch fixes the issue by adding additional guard
conditions, that limit the initialization to just
platforms that define the pte_frag variable in the
mmu context.

Signed-off-by: Christian Lamparter 
---
 arch/powerpc/kernel/setup-common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 0ac741fae90e..4a22ec6b34de 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -904,7 +904,8 @@ void __init setup_arch(char **cmdline_p)
 #endif
 #endif
 
-#ifdef CONFIG_PPC_64K_PAGES
+#if defined(CONFIG_PPC_64K_PAGES) && \
+(defined(CONFIG_PPC_BOOK3S_64) || defined(CONFIG_PPC_BOOK3E_MMU))
init_mm.context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-- 
2.14.2



Re: [PATCH v4 for 4.14 1/3] membarrier: Provide register expedited private command

2017-09-28 Thread Mathieu Desnoyers
- On Sep 28, 2017, at 12:16 PM, Nicholas Piggin npig...@gmail.com wrote:

> On Thu, 28 Sep 2017 15:29:50 + (UTC)
> Mathieu Desnoyers  wrote:
> 
>> - On Sep 28, 2017, at 11:01 AM, Nicholas Piggin npig...@gmail.com wrote:
>> 
>> > On Thu, 28 Sep 2017 13:31:36 + (UTC)
>> > Mathieu Desnoyers  wrote:
>> >   
>> >> - On Sep 27, 2017, at 9:04 AM, Nicholas Piggin npig...@gmail.com 
>> >> wrote:
>> >>   
> 
> [snip]
> 
>> >> So I don't see much point in trying to remove that registration step.
>> > 
>> > I don't follow you. You are talking about the concept of registering
>> > intention to use a different function? And the registration API is not
>> > merged yet?
>> 
>> Yes, I'm talking about requiring processes to invoke membarrier cmd
>> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED before they can successfully
>> invoke membarrier cmd MEMBARRIER_CMD_PRIVATE_EXPEDITED.
>> 
>> > Let me say I'm not completely against the idea of a registration API. But
>> > don't think registration for this expedited command is necessary.
>> 
>> Given that we have the powerpc lack-of-full-barrier-on-return-to-userspace
>> case now, and we foresee x86-sysexit, sparc, and alpha also requiring
>> special treatment when we introduce the MEMBARRIER_FLAG_SYNC_CORE behavior
>> in the next release, it seems that we'll have a hard time handling
>> architecture special cases efficiently if we don't expose the registration
>> API right away.
> 
> But SYNC_CORE is a different functionality, right? You can add the
> registration API for it when that goes in.

Sure, I could. However, I was hoping to re-use the same command, with a
"SYNC_CORE" flag, and I would have liked to have consistent behavior
for same commands used with different flags.

> 
>> > But (aside) let's say a tif flag turns out to be a good diea for your
>> > second case, why not just check the flag in the membarrier sys call and
>> > do the registration the first time it uses it?
>> 
>> We also considered that option. It's mainly about guaranteeing that
>> an expedited membarrier command never blocks. If we introduce this
>> "lazy auto-registration" behavior, we end up blocking the process
>> at a random point in its execution so we can issue a synchronize_sched().
>> By exposing an explicit registration, we can control where this delay
>> occurs, and even allow library constructors to invoke the registration
>> while the process is a single threaded, therefore allowing us to completely
>> skip synchronize_sched().
> 
> Okay I guess that could be a good reason. As I said I'm not opposed to
> the concept. I suppose you could even have a registration for expedited
> private even if it's a no-op on all architectures, just in case some new
> ways of implementing it can be done in future.

That's an approach I would be OK with too. Mandating explicit registration
will give us much more flexibility.

> I suppose I'm more objecting to the added complexity for powerpc, and
> more code in the fastpath to make the slowpath faster.

Just to make sure I understand your concern here. The "fastpath" you
refer to is the TIF flag test in membarrier_sched_in() within
finish_task_switch(), and the "slowpath" is switch_mm() which lacks
the required full barrier now, am I correct ?

Would it help if we invoke the membarrier hook from switch_mm()
instead ? We'd therefore only add the TIF flag test in switch_mm(),
rather than for every context switch.

Thanks,

Mathieu


> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [EXTERNAL]Re: FSL serial 166550 errata broken?

2017-09-28 Thread Joakim Tjernlund
On Thu, 2017-09-28 at 17:54 +0200, Joakim Tjernlund wrote:
> On Wed, 2017-09-27 at 15:32 +, York Sun wrote:
> > On 09/27/2017 04:03 AM, Joakim Tjernlund wrote:
> > > On Mon, 2017-09-25 at 17:26 +, York Sun wrote:
> > > > On 09/25/2017 09:55 AM, Joakim Tjernlund wrote:
> > > > > We got some "broken" boards(mpx8321) where UART RX is held low(BREAK)
> > > > > There we get a few:
> > > > > serial8250: too much work for irq18
> > > > > and the board freezes.
> > > > > 
> > > > > Looking inte to driver/CPU there is an errtum(A-004737) w.r.t BREAK 
> > > > > handling
> > > > > and I can see we are hitting the irq function fsl8250_handle_irq() 
> > > > > added
> > > > > in commit: 9deaa53ac7fa373623123aa4f18828dd62292b1a
> > > > >"serial: add irq handler for Freescale 16550 errata."
> > > > > For all I can tell this workaround is broken and I cannot figure out 
> > > > > why.
> > > > > Any pointers?
> > > > > 
> > > > 
> > > > Jocke,
> > > > 
> > > > Did you mean MPC8321?
> > > > 
> > > > I personally don't have experience debugging this erratum. Have you
> > > > tried to contact the patch author Paul Gortmaker to see how he managed
> > > > to get it work?
> > > 
> > > No, but I just found out it is u-boot related. If I use an very old 
> > > u-boot it works.
> > > Between then and now we did a massive upgrade of u-boot and now if 
> > > breaks. And no,
> > > bisect not possible due to local u-boot mods :)
> > 
> > How old? It is a good sign that an old U-Boot works. Git bisect would be 
> > a good tool if practical. Sometimes I have to reapply some local changes 
> > for every step of bisect to make it useful. You mileage may vary though.
> > 
> > > 
> > > Any idea what could be different? I cannot see and I have tested
> > > what I could see/think of but now I am out of ideas.
> > 
> > I don't believe we have much change for MPC8321 for a long time. If any 
> > change has impact on kernel but not U-Boot itself, it may be other 
> > errata workarounds.
> > 
> > I presume this happens on your own board. If I am in your shoes, I would 
> > try to reduce the number of local commits and reapply them to various 
> > U-Boot tags to further narrow down the search scope. In the mean time, 
> > getting register dump to compare the good and bad may be another way to go.
> 
> God, this was no fun exercise but I think I found the offending commit: 
> 82dda962f0a6449672df3378bb6b5fe54372a927
> serial: Unconditionally enable CONFIG_SERIAL_MULTI
> 
> Enable CONFIG_SERIAL_MULTI for all builds of U-Boot. That includes
> both SPL builds and non-SPL builds, everything. To avoid poluting
> this patch with removal of ifdef-endif constructions containing
> CONFIG_SERIAL_MULTI, the CONFIG_SERIAL_MULTI is temporarily added
> into CPPFLAGS in config.mk . This will be again removed in following
> patch.
> 

Ok, unless I init ttyS1 in u-boot too, IRQ storm will ensue if BREAK is present
when opening ttyS1:
+   /* Must init the second RS232 or IRQ storm happens
+* when BREAK is constant on while opening ttyS1 */
+   NS16550_init((NS16550_t)CONFIG_SYS_NS16550_COM2,
+ns16550_calc_divisor((NS16550_t)CONFIG_SYS_NS16550_COM2,
+ CONFIG_SYS_NS16550_CLK,
+ CONFIG_BAUDRATE));

I guess this is a kernel bug too, the driver should clear/init needed state 
before
starting the device. Fixing that is not on my menu though :)

I also noted that FSL custom irq handler, fsl8250_handle_irq, does not handle
dma like the standard one does. I guess it needs some love too.

 Jocke

Re: [PATCH v4 for 4.14 1/3] membarrier: Provide register expedited private command

2017-09-28 Thread Nicholas Piggin
On Thu, 28 Sep 2017 17:51:15 +0200
Peter Zijlstra  wrote:

> On Fri, Sep 29, 2017 at 01:01:12AM +1000, Nicholas Piggin wrote:
> > That's fine. If a user is not bound to a subset of CPUs, they could
> > also cause disturbances with other syscalls and faults, taking locks,
> > causing tlb flushes and IPIs and things.  
> 
> So on the big SGI class machines we've had trouble with
> for_each_cpu() loops before, and IIRC the biggest Power box is not too
> far from that 1-2K CPUs IIRC.

This is a loop in process context, interrupts on, can reschedule, no
locks held though.

The biggest power boxes are more tightly coupled than those big
SGI systems, but even so just plodding along taking and releasing
locks in turn would be fine on those SGI ones as well really. Not DoS
level. This is not a single mega hot cache line or lock that is
bouncing over the entire machine, but one process grabbing a line and
lock from each of 1000 CPUs.

Slight disturbance sure, but each individual CPU will see it as 1/1000th
of a disturbance, most of the cost will be concentrated in the syscall
caller.

> 
> Bouncing that lock across the machine is *painful*, I have vague
> memories of cases where the lock ping-pong was most the time spend.
> 
> But only Power needs this, all the other architectures are fine with the
> lockless approach for MEMBAR_EXPEDITED_PRIVATE.

Yes, we can add an iterator function that power can override in a few
lines. Less arch specific code than this proposal.

> 
> The ISYNC variant of the same however appears to want TIF flags or
> something to aid a number of archs, the rq->lock will not help there.

The SYNC_CORE? Yes it seems different. I think that's another issue
though.

Thanks,
Nick


Re: [PATCH v4 for 4.14 1/3] membarrier: Provide register expedited private command

2017-09-28 Thread Nicholas Piggin
On Thu, 28 Sep 2017 15:29:50 + (UTC)
Mathieu Desnoyers  wrote:

> - On Sep 28, 2017, at 11:01 AM, Nicholas Piggin npig...@gmail.com wrote:
> 
> > On Thu, 28 Sep 2017 13:31:36 + (UTC)
> > Mathieu Desnoyers  wrote:
> >   
> >> - On Sep 27, 2017, at 9:04 AM, Nicholas Piggin npig...@gmail.com wrote:
> >>   

[snip]

> >> So I don't see much point in trying to remove that registration step.  
> > 
> > I don't follow you. You are talking about the concept of registering
> > intention to use a different function? And the registration API is not
> > merged yet?  
> 
> Yes, I'm talking about requiring processes to invoke membarrier cmd
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED before they can successfully
> invoke membarrier cmd MEMBARRIER_CMD_PRIVATE_EXPEDITED.
> 
> > Let me say I'm not completely against the idea of a registration API. But
> > don't think registration for this expedited command is necessary.  
> 
> Given that we have the powerpc lack-of-full-barrier-on-return-to-userspace
> case now, and we foresee x86-sysexit, sparc, and alpha also requiring
> special treatment when we introduce the MEMBARRIER_FLAG_SYNC_CORE behavior
> in the next release, it seems that we'll have a hard time handling
> architecture special cases efficiently if we don't expose the registration
> API right away.

But SYNC_CORE is a different functionality, right? You can add the
registration API for it when that goes in.

> > But (aside) let's say a tif flag turns out to be a good diea for your
> > second case, why not just check the flag in the membarrier sys call and
> > do the registration the first time it uses it?  
> 
> We also considered that option. It's mainly about guaranteeing that
> an expedited membarrier command never blocks. If we introduce this
> "lazy auto-registration" behavior, we end up blocking the process
> at a random point in its execution so we can issue a synchronize_sched().
> By exposing an explicit registration, we can control where this delay
> occurs, and even allow library constructors to invoke the registration
> while the process is a single threaded, therefore allowing us to completely
> skip synchronize_sched().

Okay I guess that could be a good reason. As I said I'm not opposed to
the concept. I suppose you could even have a registration for expedited
private even if it's a no-op on all architectures, just in case some new
ways of implementing it can be done in future.

I suppose I'm more objecting to the added complexity for powerpc, and
more code in the fastpath to make the slowpath faster.

Thanks,
Nick


Re: [EXTERNAL]Re: FSL serial 166550 errata broken?

2017-09-28 Thread Joakim Tjernlund
On Wed, 2017-09-27 at 15:32 +, York Sun wrote:
> On 09/27/2017 04:03 AM, Joakim Tjernlund wrote:
> > On Mon, 2017-09-25 at 17:26 +, York Sun wrote:
> > > On 09/25/2017 09:55 AM, Joakim Tjernlund wrote:
> > > > We got some "broken" boards(mpx8321) where UART RX is held low(BREAK)
> > > > There we get a few:
> > > > serial8250: too much work for irq18
> > > > and the board freezes.
> > > > 
> > > > Looking inte to driver/CPU there is an errtum(A-004737) w.r.t BREAK 
> > > > handling
> > > > and I can see we are hitting the irq function fsl8250_handle_irq() added
> > > > in commit: 9deaa53ac7fa373623123aa4f18828dd62292b1a
> > > >"serial: add irq handler for Freescale 16550 errata."
> > > > For all I can tell this workaround is broken and I cannot figure out 
> > > > why.
> > > > Any pointers?
> > > > 
> > > 
> > > Jocke,
> > > 
> > > Did you mean MPC8321?
> > > 
> > > I personally don't have experience debugging this erratum. Have you
> > > tried to contact the patch author Paul Gortmaker to see how he managed
> > > to get it work?
> > 
> > No, but I just found out it is u-boot related. If I use an very old u-boot 
> > it works.
> > Between then and now we did a massive upgrade of u-boot and now if breaks. 
> > And no,
> > bisect not possible due to local u-boot mods :)
> 
> How old? It is a good sign that an old U-Boot works. Git bisect would be 
> a good tool if practical. Sometimes I have to reapply some local changes 
> for every step of bisect to make it useful. You mileage may vary though.
> 
> > 
> > Any idea what could be different? I cannot see and I have tested
> > what I could see/think of but now I am out of ideas.
> 
> I don't believe we have much change for MPC8321 for a long time. If any 
> change has impact on kernel but not U-Boot itself, it may be other 
> errata workarounds.
> 
> I presume this happens on your own board. If I am in your shoes, I would 
> try to reduce the number of local commits and reapply them to various 
> U-Boot tags to further narrow down the search scope. In the mean time, 
> getting register dump to compare the good and bad may be another way to go.

God, this was no fun exercise but I think I found the offending commit: 
82dda962f0a6449672df3378bb6b5fe54372a927
serial: Unconditionally enable CONFIG_SERIAL_MULTI

Enable CONFIG_SERIAL_MULTI for all builds of U-Boot. That includes
both SPL builds and non-SPL builds, everything. To avoid poluting
this patch with removal of ifdef-endif constructions containing
CONFIG_SERIAL_MULTI, the CONFIG_SERIAL_MULTI is temporarily added
into CPPFLAGS in config.mk . This will be again removed in following
patch.



Re: [PATCH v4 for 4.14 1/3] membarrier: Provide register expedited private command

2017-09-28 Thread Peter Zijlstra
On Fri, Sep 29, 2017 at 01:01:12AM +1000, Nicholas Piggin wrote:
> That's fine. If a user is not bound to a subset of CPUs, they could
> also cause disturbances with other syscalls and faults, taking locks,
> causing tlb flushes and IPIs and things.

So on the big SGI class machines we've had trouble with
for_each_cpu() loops before, and IIRC the biggest Power box is not too
far from that 1-2K CPUs IIRC.

Bouncing that lock across the machine is *painful*, I have vague
memories of cases where the lock ping-pong was most the time spend.

But only Power needs this, all the other architectures are fine with the
lockless approach for MEMBAR_EXPEDITED_PRIVATE.

The ISYNC variant of the same however appears to want TIF flags or
something to aid a number of archs, the rq->lock will not help there.


Re: [PATCH v4 for 4.14 1/3] membarrier: Provide register expedited private command

2017-09-28 Thread Mathieu Desnoyers
- On Sep 28, 2017, at 11:01 AM, Nicholas Piggin npig...@gmail.com wrote:

> On Thu, 28 Sep 2017 13:31:36 + (UTC)
> Mathieu Desnoyers  wrote:
> 
>> - On Sep 27, 2017, at 9:04 AM, Nicholas Piggin npig...@gmail.com wrote:
>> 
>> > On Tue, 26 Sep 2017 20:43:28 + (UTC)
>> > Mathieu Desnoyers  wrote:
>> >   
>> >> - On Sep 26, 2017, at 1:51 PM, Mathieu Desnoyers
>> >> mathieu.desnoy...@efficios.com wrote:
>> >>
   
[...]

>> Therefore,
>> you end up with the same rq lock disruption as if you would iterate on all
>> online CPUs. If userspace does that in a loop, you end up, in PeterZ's words,
>> with an Insta-DoS.
> 
> I really don't see how that can be true. spinlock by definition is for
> sharing of resources, it's not an insta-DoS just because you take shared
> spinlocks!

[...]

>> 
>> > 
>> > For the powerpc approach, yes there is some controversy about using
>> > runqueue locks even for cpus that we already can interfere with, but I
>> > think we have a lot of options we could look at *after* it ever shows
>> > up as a problem.
>> 
>> The DoS argument from Peter seems to be a strong opposition to grabbing
>> the rq locks.
> 
> Well if I still can't unconvince you, then we should try testing that
> theory.

[ I'll let PeterZ pitch in on this part of the discussion ]

> 
>> 
>> Here is another point in favor of having a register command for the
>> private membarrier: This gives us greater flexibility to improve the
>> kernel scheduler and return-to-userspace barriers if need be in the
>> future.
>> 
>> For instance, I plan to propose a "MEMBARRIER_FLAG_SYNC_CORE" flag
>> that will also provide guarantees about context synchronization of
>> all cores for memory reclaim performed by JIT for the next merge
>> window. So far, the following architectures seems to have the proper
>> core serializing instructions already in place when returning to
>> user-space: x86 (iret), powerpc (rfi), arm32/64 (return from exception,
>> eret), s390/x (lpswe), ia64 (rfi), parisc (issue at least 7 instructions
>> while signing around a bonfire), and mips SMP (eret).
>> 
>> So far, AFAIU, only x86 (eventually going through sysexit), alpha
>> (appears to require an explicit imb), and sparc (explicit flush + 5
>> instructions around similar bonfire as parisc) appear to require special
>> handling.
>> 
>> I therefore plan to use the registration step with a
>> MEMBARRIER_FLAG_SYNC_CORE flag set to set TIF flags and add the
>> required context synchronizing barriers on sched_in() only for
>> processes wishing to use private expedited membarrier.
>> 
>> So I don't see much point in trying to remove that registration step.
> 
> I don't follow you. You are talking about the concept of registering
> intention to use a different function? And the registration API is not
> merged yet?

Yes, I'm talking about requiring processes to invoke membarrier cmd
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED before they can successfully
invoke membarrier cmd MEMBARRIER_CMD_PRIVATE_EXPEDITED.

> Let me say I'm not completely against the idea of a registration API. But
> don't think registration for this expedited command is necessary.

Given that we have the powerpc lack-of-full-barrier-on-return-to-userspace
case now, and we foresee x86-sysexit, sparc, and alpha also requiring
special treatment when we introduce the MEMBARRIER_FLAG_SYNC_CORE behavior
in the next release, it seems that we'll have a hard time handling
architecture special cases efficiently if we don't expose the registration
API right away.

> 
> But (aside) let's say a tif flag turns out to be a good diea for your
> second case, why not just check the flag in the membarrier sys call and
> do the registration the first time it uses it?

We also considered that option. It's mainly about guaranteeing that
an expedited membarrier command never blocks. If we introduce this
"lazy auto-registration" behavior, we end up blocking the process
at a random point in its execution so we can issue a synchronize_sched().
By exposing an explicit registration, we can control where this delay
occurs, and even allow library constructors to invoke the registration
while the process is a single threaded, therefore allowing us to completely
skip synchronize_sched().

Thanks,

Mathieu

> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v4 for 4.14 1/3] membarrier: Provide register expedited private command

2017-09-28 Thread Nicholas Piggin
On Thu, 28 Sep 2017 13:31:36 + (UTC)
Mathieu Desnoyers  wrote:

> - On Sep 27, 2017, at 9:04 AM, Nicholas Piggin npig...@gmail.com wrote:
> 
> > On Tue, 26 Sep 2017 20:43:28 + (UTC)
> > Mathieu Desnoyers  wrote:
> >   
> >> - On Sep 26, 2017, at 1:51 PM, Mathieu Desnoyers
> >> mathieu.desnoy...@efficios.com wrote:
> >>   
> >> > Provide a new command allowing processes to register their intent to use
> >> > the private expedited command.
> >> > 
> >> 
> >> I missed a few maintainers that should have been CC'd. Adding them now.
> >> This patch is aimed to go through Paul E. McKenney's tree.  
> > 
> > Honestly this is pretty ugly new user API and fairly large amount of
> > complexity just to avoid the powerpc barrier. And you end up with arch
> > specific hooks anyway!
> > 
> > So my plan was to add an arch-overridable loop primitive that iterates
> > over all running threads for an mm.  
> 
> Iterating over all threads for an mm is one possible solution that has
> been listed at the membarrier BOF at LPC. We ended up dismissing this
> solution because it would not inefficient for processes which have
> lots of threads (e.g. Java).

Sorry I meant hardware threads, I should have said "CPUs".

> 
> > powerpc will use its mm_cpumask for
> > iterating and use runqueue locks to avoid the barrier.  
> 
> This is another solution which has been rejected during the LPC BOF.
> 
> What I gather from past threads is that the mm_cpumask's bits on powerpc
> are pretty much only being set, never much cleared. Therefore, over the
> lifetime of a thread which is not affined to specific processors, chances
> are that this cpumask will end up having all cores on the system.

That's fine. If a user is not bound to a subset of CPUs, they could
also cause disturbances with other syscalls and faults, taking locks,
causing tlb flushes and IPIs and things.

> Therefore,
> you end up with the same rq lock disruption as if you would iterate on all
> online CPUs. If userspace does that in a loop, you end up, in PeterZ's words,
> with an Insta-DoS.

I really don't see how that can be true. spinlock by definition is for
sharing of resources, it's not an insta-DoS just because you take shared
spinlocks!

You can contend other common locks or resources too. Linux simply does not
have guaranteed strict isolation particularly when you allow threads to be
scheduled together on CPUs, so this can't be used arbitrarily.

If it was taking all locks at once that's one thing which has always been
good policy to avoid. But it's not, any single rq lock will only be taken
and released for a very short time, far shorter than a normal context
switch. And the entire operation will be moderated by the cost of the
syscall, and the number of runqueues it has to iterate.

There's better ways to cause DoS. Start lots of threads and burn cycles,
bounce between CPUs with affinty, sleep and wake one another between remote
CPUs etc. Run out of memory, cause hash collisions on various hashes that
are easy to control, cause lots of TLB flush IPIs...

The runqueue lock is not really special. Might equally complain about a
zone page allocator or lru lock, or an inode mutex or common dentry lock.

> The name may sound cool, but I gather that this is not
> a service the kernel willingly wants to provide to userspace.
> 
> A cunning process could easily find a way to fill its mm_cpumask and then
> issue membarrier in a loop to bring a large system to its knees.

I still don't see how. Nothing that you couldn't do with other resources or
configurations of threads and system calls. Sure you might cause a
disturbance and admin might notice and kill it.

> 
> > x86 will most
> > likely want to use its mm_cpumask to iterate.  
> 
> Iterating on mm_cpumask rather than online cpus adds complexity wrt memory
> barriers (unless we go for rq locks approach). We'd need, in addition to
> ensure that we have the proper barriers before/after store to rq->curr,
> to also ensure that we have the proper barriers between mm_cpumask
> updates and user-space loads/stores. Considering that we're not aiming
> at taking the rq locks anyway, iteration over all online cpus seems
> less complex than iterating on mm_cpumask on the architectures that
> keep track of it.

Well x86 does not *have* to implement it. I thought it probably would
like to, and I didn't think its barriers would have been all that
complex when I last looked, but didn't look too closely.

> 
> > 
> > For the powerpc approach, yes there is some controversy about using
> > runqueue locks even for cpus that we already can interfere with, but I
> > think we have a lot of options we could look at *after* it ever shows
> > up as a problem.  
> 
> The DoS argument from Peter seems to be a strong opposition to grabbing
> the rq locks.

Well if I still can't unconvince you, then we should try testing that
theory.

> 
> Here is another point in 

[PATCH] selftests/powerpc: Use snprintf to construct DSCR sysfs interface paths

2017-09-28 Thread Seth Forshee
Currently sprintf is used, and while paths should never exceed
the size of the buffer it is theoretically possible since
dirent.d_name is 256 bytes. As a result this trips
-Wformat-overflow, and since the test is built with -Wall -Werror
the causes the build to fail. Switch to using snprintf and skip
any paths which are too long for the filename buffer.

Signed-off-by: Seth Forshee 
---
 tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c 
b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
index 17fb1b43c320..1899bd85121f 100644
--- a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
+++ b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
@@ -53,6 +53,8 @@ static int check_all_cpu_dscr_defaults(unsigned long val)
}
 
while ((dp = readdir(sysfs))) {
+   int len;
+
if (!(dp->d_type & DT_DIR))
continue;
if (!strcmp(dp->d_name, "cpuidle"))
@@ -60,7 +62,9 @@ static int check_all_cpu_dscr_defaults(unsigned long val)
if (!strstr(dp->d_name, "cpu"))
continue;
 
-   sprintf(file, "%s%s/dscr", CPU_PATH, dp->d_name);
+   len = snprintf(file, LEN_MAX, "%s%s/dscr", CPU_PATH, 
dp->d_name);
+   if (len >= LEN_MAX)
+   continue;
if (access(file, F_OK))
continue;
 
-- 
2.14.1



[PATCH] powerpc: Always initialize input array when calling epapr_hypercall()

2017-09-28 Thread Seth Forshee
Several callers to epapr_hypercall() pass an uninitialized stack
allocated array for the input arguments, presumably because they
have no input arguments. However this can produce errors like
this one

 arch/powerpc/include/asm/epapr_hcalls.h:470:42: error: 'in' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
  unsigned long register r3 asm("r3") = in[0];
~~^~~

Fix callers to this function to always zero-initialize the input
arguments array to prevent this.

Signed-off-by: Seth Forshee 
---
 arch/powerpc/include/asm/epapr_hcalls.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/epapr_hcalls.h 
b/arch/powerpc/include/asm/epapr_hcalls.h
index 334459ad145b..90863245df53 100644
--- a/arch/powerpc/include/asm/epapr_hcalls.h
+++ b/arch/powerpc/include/asm/epapr_hcalls.h
@@ -508,7 +508,7 @@ static unsigned long epapr_hypercall(unsigned long *in,
 
 static inline long epapr_hypercall0_1(unsigned int nr, unsigned long *r2)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
unsigned long r;
 
@@ -520,7 +520,7 @@ static inline long epapr_hypercall0_1(unsigned int nr, 
unsigned long *r2)
 
 static inline long epapr_hypercall0(unsigned int nr)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
return epapr_hypercall(in, out, nr);
@@ -528,7 +528,7 @@ static inline long epapr_hypercall0(unsigned int nr)
 
 static inline long epapr_hypercall1(unsigned int nr, unsigned long p1)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
in[0] = p1;
@@ -538,7 +538,7 @@ static inline long epapr_hypercall1(unsigned int nr, 
unsigned long p1)
 static inline long epapr_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
in[0] = p1;
@@ -549,7 +549,7 @@ static inline long epapr_hypercall2(unsigned int nr, 
unsigned long p1,
 static inline long epapr_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
in[0] = p1;
@@ -562,7 +562,7 @@ static inline long epapr_hypercall4(unsigned int nr, 
unsigned long p1,
unsigned long p2, unsigned long p3,
unsigned long p4)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
in[0] = p1;
-- 
2.14.1



Re: [PATCH v4 for 4.14 1/3] membarrier: Provide register expedited private command

2017-09-28 Thread Mathieu Desnoyers
- On Sep 27, 2017, at 9:04 AM, Nicholas Piggin npig...@gmail.com wrote:

> On Tue, 26 Sep 2017 20:43:28 + (UTC)
> Mathieu Desnoyers  wrote:
> 
>> - On Sep 26, 2017, at 1:51 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > Provide a new command allowing processes to register their intent to use
>> > the private expedited command.
>> >   
>> 
>> I missed a few maintainers that should have been CC'd. Adding them now.
>> This patch is aimed to go through Paul E. McKenney's tree.
> 
> Honestly this is pretty ugly new user API and fairly large amount of
> complexity just to avoid the powerpc barrier. And you end up with arch
> specific hooks anyway!
> 
> So my plan was to add an arch-overridable loop primitive that iterates
> over all running threads for an mm.

Iterating over all threads for an mm is one possible solution that has
been listed at the membarrier BOF at LPC. We ended up dismissing this
solution because it would not inefficient for processes which have
lots of threads (e.g. Java).

> powerpc will use its mm_cpumask for
> iterating and use runqueue locks to avoid the barrier.

This is another solution which has been rejected during the LPC BOF.

What I gather from past threads is that the mm_cpumask's bits on powerpc
are pretty much only being set, never much cleared. Therefore, over the
lifetime of a thread which is not affined to specific processors, chances
are that this cpumask will end up having all cores on the system. Therefore,
you end up with the same rq lock disruption as if you would iterate on all
online CPUs. If userspace does that in a loop, you end up, in PeterZ's words,
with an Insta-DoS. The name may sound cool, but I gather that this is not
a service the kernel willingly wants to provide to userspace.

A cunning process could easily find a way to fill its mm_cpumask and then
issue membarrier in a loop to bring a large system to its knees.

> x86 will most
> likely want to use its mm_cpumask to iterate.

Iterating on mm_cpumask rather than online cpus adds complexity wrt memory
barriers (unless we go for rq locks approach). We'd need, in addition to
ensure that we have the proper barriers before/after store to rq->curr,
to also ensure that we have the proper barriers between mm_cpumask
updates and user-space loads/stores. Considering that we're not aiming
at taking the rq locks anyway, iteration over all online cpus seems
less complex than iterating on mm_cpumask on the architectures that
keep track of it.

> 
> For the powerpc approach, yes there is some controversy about using
> runqueue locks even for cpus that we already can interfere with, but I
> think we have a lot of options we could look at *after* it ever shows
> up as a problem.

The DoS argument from Peter seems to be a strong opposition to grabbing
the rq locks.

Here is another point in favor of having a register command for the
private membarrier: This gives us greater flexibility to improve the
kernel scheduler and return-to-userspace barriers if need be in the
future.

For instance, I plan to propose a "MEMBARRIER_FLAG_SYNC_CORE" flag
that will also provide guarantees about context synchronization of
all cores for memory reclaim performed by JIT for the next merge
window. So far, the following architectures seems to have the proper
core serializing instructions already in place when returning to
user-space: x86 (iret), powerpc (rfi), arm32/64 (return from exception,
eret), s390/x (lpswe), ia64 (rfi), parisc (issue at least 7 instructions
while signing around a bonfire), and mips SMP (eret).

So far, AFAIU, only x86 (eventually going through sysexit), alpha
(appears to require an explicit imb), and sparc (explicit flush + 5
instructions around similar bonfire as parisc) appear to require special
handling.

I therefore plan to use the registration step with a
MEMBARRIER_FLAG_SYNC_CORE flag set to set TIF flags and add the
required context synchronizing barriers on sched_in() only for
processes wishing to use private expedited membarrier.

So I don't see much point in trying to remove that registration step.

Thanks,

Mathieu


> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] powerpc/powernv: Use early_radix_enabled in POWER9 tlb flush

2017-09-28 Thread Meng YK Li
Nick,
 
I applied your patch into linux kernel 4.13, rebuild it, installed it, then had a test, still can not boot OS. System hung here, the same as before.
Would you please have a look?
The system is going down NOW!Sent SIGTERM to all processesSent SIGKILL to all processes[  104.755810] kexec_core: Starting new kernel[  126.157029744,5] OPAL: Switch to big-endian OS
 
 
Best Regards!Meng LiIBM OpenPOWER Application Engineer
 
 
- Original message -From: Nicholas Piggin To: linuxppc-dev@lists.ozlabs.orgCc: Nicholas Piggin , Jeremy Kerr , Meng YK Li Subject: [PATCH] powerpc/powernv: Use early_radix_enabled in POWER9 tlb flushDate: Wed, Sep 27, 2017 1:46 PM 
This code is used at boot and machine checks, so it should be usingearly_radix_enabled() (which is usable any time).Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/mce_power.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.cindex b76ca198e09c..d37e612050b5 100644--- a/arch/powerpc/kernel/mce_power.c+++ b/arch/powerpc/kernel/mce_power.c@@ -128,7 +128,7 @@ void __flush_tlb_power9(unsigned int action) {  unsigned int num_sets; - if (radix_enabled())+ if (early_radix_enabled())  num_sets = POWER9_TLB_SETS_RADIX;  else  num_sets = POWER9_TLB_SETS_HASH;--2.13.3 
 



Re: [PATCH v3 00/20] Speculative page faults

2017-09-28 Thread Laurent Dufour

Hi Andrew,

On 26/09/2017 01:34, Andrew Morton wrote:

On Mon, 25 Sep 2017 09:27:43 -0700 Alexei Starovoitov 
 wrote:


On Mon, Sep 18, 2017 at 12:15 AM, Laurent Dufour
 wrote:

Despite the unprovable lockdep warning raised by Sergey, I didn't get any
feedback on this series.

Is there a chance to get it moved upstream ?


what is the status ?
We're eagerly looking forward for this set to land,
since we have several use cases for tracing that
will build on top of this set as discussed at Plumbers.


There has been sadly little review and testing so far :(

I'll be taking a close look at it all over the next couple of weeks.

One terribly important thing (especially for a patchset this large and
intrusive) is the rationale for merging it: the justification, usually
in the form of end-user benefit.

Laurent's [0/n] provides some nice-looking performance benefits for
workloads which are chosen to show performance benefits(!) but, alas,
no quantitative testing results for workloads which we may suspect will
be harmed by the changes(?).  Even things as simple as impact upon
single-threaded pagefault-intensive workloads and its effect upon
CONFIG_SMP=n .text size?


I forgot to mention in my previous email the impact on the .text section.

Here are the metrics I got :

.text size  UP  SMP Delta
4.13-mmotm  8444201 8964137 6.16%
'' +spf 8452041 8971929 6.15%
Delta   0.09%   0.09%   

No major impact as you could see.

Thanks,
Laurent


If you have additional usecases then please, spell them out for us in
full detail so we can better understand the benefits which this
patchset provides.





Re: [PATCH v3 00/20] Speculative page faults

2017-09-28 Thread Laurent Dufour

Hi,

On 26/09/2017 01:34, Andrew Morton wrote:

On Mon, 25 Sep 2017 09:27:43 -0700 Alexei Starovoitov 
 wrote:


On Mon, Sep 18, 2017 at 12:15 AM, Laurent Dufour
 wrote:

Despite the unprovable lockdep warning raised by Sergey, I didn't get any
feedback on this series.

Is there a chance to get it moved upstream ?


what is the status ?
We're eagerly looking forward for this set to land,
since we have several use cases for tracing that
will build on top of this set as discussed at Plumbers.


There has been sadly little review and testing so far :(


I do agree and I could just encourage people to do so :/


I'll be taking a close look at it all over the next couple of weeks.


Thanks Andrew for giving it a close look.


One terribly important thing (especially for a patchset this large and
intrusive) is the rationale for merging it: the justification, usually
in the form of end-user benefit.


The benefit is only for multi-threaded processes. But even on *small* 
systems with 16 CPUs, there is a real benefit.




Laurent's [0/n] provides some nice-looking performance benefits for
workloads which are chosen to show performance benefits(!) but, alas,
no quantitative testing results for workloads which we may suspect will
be harmed by the changes(?).


I did test with kernbench, involving gcc/ld which are not 
multi-threaded, AFAIK, and I didn't see any impact.

But if you know additional test I should give a try, please advise.

Regarding ebizzy, it was designed to simulate web server's activity, so 
I guess there will be improvements when running real web servers.



 Even things as simple as impact upon
single-threaded pagefault-intensive workloads and its effect upon
CONFIG_SMP=n .text size?

If you have additional usecases then please, spell them out for us in
full detail so we can better understand the benefits which this
patchset provides.


The other use-case I'm aware of is on memory database, where performance 
improvements is really significant, as I mentioned in the header of my 
series.


Cheers,
Laurent.



RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

2017-09-28 Thread David Laight
From: Simon Guo
> Sent: 27 September 2017 19:34
...
> > On X86 all the AVX registers are caller saved, the system call
> > entry could issue the instruction that invalidates them all.
> > Kernel code running in the context of a user process could then
> > use the registers without saving them.
> > It would only need to set a mark to ensure they are invalidated
> > again on return to user (might be cheap enough to do anyway).
> > Dunno about PPC though.
> 
> I am not aware of any ppc instruction which can set a "mark" or provide
> any high granularity flag against single or subgroup of vec regs' validity.
> But ppc experts may want to correct me.

I was just thinking of a software flag.

David



[PATCH kernel v3] vfio/spapr: Add cond_resched() for huge updates

2017-09-28 Thread Alexey Kardashevskiy
Clearing very big IOMMU tables can trigger soft lockups. This adds
cond_resched() to allow the scheduler to do context switching when
it decides to.

Signed-off-by: Alexey Kardashevskiy 
---

The testcase is POWER9 box with 264GB guest, 4 VFIO devices from
independent IOMMU groups, 64K IOMMU pages. This configuration produces
4325376 TCE entries, each entry update incurs 4 OPAL calls to update
an individual PE TCE cache; this produced lockups for more than 20s.
Reducing table size to 4194304 (i.e. 256GB guest) or removing one
of 4 VFIO devices makes the problem go away.

---
Changes:
v3:
* cond_resched() checks for should_resched() so we just call resched()
and let the cpu scheduler decide whether to switch or not

v2:
* replaced with time based solution
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 63112c36ab2d..759a5bdd40e1 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -507,6 +507,8 @@ static int tce_iommu_clear(struct tce_container *container,
enum dma_data_direction direction;
 
for ( ; pages; --pages, ++entry) {
+   cond_resched();
+
direction = DMA_NONE;
oldhpa = 0;
ret = iommu_tce_xchg(tbl, entry, , );
-- 
2.11.0



Re: [PATCH 1/1] KVM: PPC: Book3S: Fix server always zero from kvmppc_xive_get_xive()

2017-09-28 Thread Benjamin Herrenschmidt
On Thu, 2017-09-28 at 11:45 +1000, David Gibson wrote:
> On Tue, Sep 26, 2017 at 04:47:04PM +1000, Sam Bobroff wrote:
> > In KVM's XICS-on-XIVE emulation, kvmppc_xive_get_xive() returns the
> > value of state->guest_server as "server". However, this value is not
> > set by it's counterpart kvmppc_xive_set_xive(). When the guest uses
> > this interface to migrate interrupts away from a CPU that is going
> > offline, it sees all interrupts as belonging to CPU 0, so they are
> > left assigned to (now) offline CPUs.
> > 
> > This patch removes the guest_server field from the state, and returns
> > act_server in it's place (that is, the CPU actually handling the
> > interrupt, which may differ from the one requested).
> > 
> > Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE
> > interrupt controller")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Sam Bobroff 
> > ---
> > The other obvious way to patch this would be to set state->guest_server in
> > kvmppc_xive_set_xive() and that does also work because act_server is usually
> > equal to guest_server.
> > 
> > However, in the cases where guest_server differed from act_server, the guest
> > would only move IRQs correctly if it got act_server (the CPU actually 
> > handling
> > the interrupt) here. So, that approach seemed better.
> 
> Paolo, again this is a pretty urgent fix for KVM on Power and Paulus
> is away.  We're hoping BenH will ack shortly (he's the logical
> technical reviewer), after which can you merge this direct into the
> KVM staging tree?  (RHBZ 1477391, and we suspect several more are
> related).

Acked-by: Benjamin Herrenschmidt 

As a subsequent cleanup we should probably rename act_server to server.

Note: We know of a remaining theorical race that isn't fixed yet with
CPU unplug. If an interrupt is already in the queue of the CPU calling
xics_migrate_irqs_away (guest), then that irq never gets pulled out of
that queue and thus the bug this patch is fixing will re-occur.

Fix isn't trivial, I'm working on it, though I'm tempted to make some
assumptions about how linux does things to keep it (much) simpler.

I'll elaborate later (at Kernel Recipes right now)

Cheers,
Ben.

> >  arch/powerpc/kvm/book3s_xive.c | 5 ++---
> >  arch/powerpc/kvm/book3s_xive.h | 1 -
> >  2 files changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > index 13304622ab1c..bf457843e032 100644
> > --- a/arch/powerpc/kvm/book3s_xive.c
> > +++ b/arch/powerpc/kvm/book3s_xive.c
> > @@ -622,7 +622,7 @@ int kvmppc_xive_get_xive(struct kvm *kvm, u32 irq, u32 
> > *server,
> > return -EINVAL;
> > state = >irq_state[idx];
> > arch_spin_lock(>lock);
> > -   *server = state->guest_server;
> > +   *server = state->act_server;
> > *priority = state->guest_priority;
> > arch_spin_unlock(>lock);
> >  
> > @@ -1331,7 +1331,7 @@ static int xive_get_source(struct kvmppc_xive *xive, 
> > long irq, u64 addr)
> > xive->saved_src_count++;
> >  
> > /* Convert saved state into something compatible with xics */
> > -   val = state->guest_server;
> > +   val = state->act_server;
> > prio = state->saved_scan_prio;
> >  
> > if (prio == MASKED) {
> > @@ -1507,7 +1507,6 @@ static int xive_set_source(struct kvmppc_xive *xive, 
> > long irq, u64 addr)
> > /* First convert prio and mark interrupt as untargetted */
> > act_prio = xive_prio_from_guest(guest_prio);
> > state->act_priority = MASKED;
> > -   state->guest_server = server;
> >  
> > /*
> >  * We need to drop the lock due to the mutex below. Hopefully
> > diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> > index 5938f7644dc1..6ba63f8e8a61 100644
> > --- a/arch/powerpc/kvm/book3s_xive.h
> > +++ b/arch/powerpc/kvm/book3s_xive.h
> > @@ -35,7 +35,6 @@ struct kvmppc_xive_irq_state {
> > struct xive_irq_data *pt_data;  /* XIVE Pass-through associated data */
> >  
> > /* Targetting as set by guest */
> > -   u32 guest_server;   /* Current guest selected target */
> > u8 guest_priority;  /* Guest set priority */
> > u8 saved_priority;  /* Saved priority when masking */
> >  
> 
>