Re: [PATCH] mm: add account_locked_vm utility function

2019-05-23 Thread Alexey Kardashevskiy



On 21/05/2019 01:30, Daniel Jordan wrote:
> On Mon, May 20, 2019 at 04:19:34PM +1000, Alexey Kardashevskiy wrote:
>> On 04/05/2019 06:16, Daniel Jordan wrote:
>>> locked_vm accounting is done roughly the same way in five places, so
>>> unify them in a helper.  Standardize the debug prints, which vary
>>> slightly.
>>
>> And I rather liked that prints were different and tell precisely which
>> one of three each printk is.
> 
> I'm not following.  One of three...callsites?  But there were five callsites.


Well, 3 of them are mine, I was referring to them :)


> Anyway, I added a _RET_IP_ to the debug print so you can differentiate.


I did not know that existed, cool!


> 
>> I commented below but in general this seems working.
>>
>> Tested-by: Alexey Kardashevskiy 
> 
> Thanks!  And for the review as well.
> 
>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
>>> b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> index 6b64e45a5269..d39a1b830d82 100644
>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> @@ -34,49 +35,13 @@
>>>  static void tce_iommu_detach_group(void *iommu_data,
>>> struct iommu_group *iommu_group);
>>>  
>>> -static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>>> +static int tce_account_locked_vm(struct mm_struct *mm, unsigned long 
>>> npages,
>>> +bool inc)
>>>  {
>>> -   long ret = 0, locked, lock_limit;
>>> -
>>> if (WARN_ON_ONCE(!mm))
>>> return -EPERM;
>>
>>
>> If this WARN_ON is the only reason for having tce_account_locked_vm()
>> instead of calling account_locked_vm() directly, you can then ditch the
>> check as I have never ever seen this triggered.
> 
> Great, will do.
> 
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>>> b/drivers/vfio/vfio_iommu_type1.c
>>> index d0f731c9920a..15ac76171ccd 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -273,25 +273,14 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
>>> npage, bool async)
>>> return -ESRCH; /* process exited */
>>>  
>>> ret = down_write_killable(&mm->mmap_sem);
>>> -   if (!ret) {
>>> -   if (npage > 0) {
>>> -   if (!dma->lock_cap) {
>>> -   unsigned long limit;
>>> -
>>> -   limit = task_rlimit(dma->task,
>>> -   RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>> -
>>> -   if (mm->locked_vm + npage > limit)
>>> -   ret = -ENOMEM;
>>> -   }
>>> -   }
>>> +   if (ret)
>>> +   goto out;
>>
>>
>> A single "goto" to jump just 3 lines below seems unnecessary.
> 
> No strong preference here, I'll take out the goto.
> 
>>> +int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool 
>>> inc,
>>> +   struct task_struct *task, bool bypass_rlim)
>>> +{
>>> +   unsigned long locked_vm, limit;
>>> +   int ret = 0;
>>> +
>>> +   locked_vm = mm->locked_vm;
>>> +   if (inc) {
>>> +   if (!bypass_rlim) {
>>> +   limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>> +   if (locked_vm + pages > limit) {
>>> +   ret = -ENOMEM;
>>> +   goto out;
>>> +   }
>>> +   }
>>
>> Nit:
>>
>> if (!ret)
>>
>> and then you don't need "goto out".
> 
> Ok, sure.
> 
>>> +   mm->locked_vm = locked_vm + pages;
>>> +   } else {
>>> +   WARN_ON_ONCE(pages > locked_vm);
>>> +   mm->locked_vm = locked_vm - pages;
>>
>>
>> Can go negative here. Not a huge deal but inaccurate imo.
> 
> I hear you, but setting a negative value to zero, as we had done previously,
> doesn't make much sense to me.


Ok then. I have not seen these WARN_ON for a very long time anyway.


-- 
Alexey


Re: [BISECTED] kexec regression on PowerBook G4

2019-05-23 Thread Christophe Leroy




Le 24/05/2019 à 07:46, Christophe Leroy a écrit :

Hi

Le 24/05/2019 à 00:23, Aaro Koskinen a écrit :

Hi,

On Thu, May 23, 2019 at 08:58:11PM +0200, Christophe Leroy wrote:

Le 23/05/2019 à 19:27, Aaro Koskinen a écrit :

On Thu, May 23, 2019 at 07:33:38AM +0200, Christophe Leroy wrote:
Ok, the Oops confirms that the error is due to executing the kexec 
control

code which is located outside the kernel text area.

My yesterday's proposed change doesn't work because on book3S/32, NX
protection is based on setting segments to NX, and using IBATs for 
kernel

text.

Can you try the patch I sent out a few minutes ago ?
(https://patchwork.ozlabs.org/patch/1103827/)


It now crashes with "BUG: Unable to handle kernel instruction fetch"
and the faulting address is 0xef13a000.


Ok.

Can you try with both changes at the same time, ie the mtsrin(...) 
and the

change_page_attr() ?

I suspect that allthough the HW is not able to check EXEC flag, the 
SW will

check it before loading the hash entry.


Unfortunately still no luck... The crash is pretty much the same with 
both

changes.


Right. In fact change_page_attr() does nothing because this part of RAM 
is mapped by DBATs so v_block_mapped() returns not NULL.


So, we have to set an IBAT for this area. I'll try and send you a new 
patch for that before noon (CET).




patch sent out. In the patch I have also added a printk to print the 
buffer address, so if the problem still occurs, we'll know if the 
problem is really at the address of the buffer or if we are wrong from 
the beginning.


Christophe


[RFC PATCH v2] powerpc: fix kexec failure on book3s/32

2019-05-23 Thread Christophe Leroy
Fixes: 63b2bc619565 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/machine_kexec_32.c | 8 
 arch/powerpc/mm/book3s32/mmu.c | 7 +--
 arch/powerpc/mm/mmu_decl.h | 2 ++
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/machine_kexec_32.c 
b/arch/powerpc/kernel/machine_kexec_32.c
index affe5dcce7f4..83e61a8f8468 100644
--- a/arch/powerpc/kernel/machine_kexec_32.c
+++ b/arch/powerpc/kernel/machine_kexec_32.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef void (*relocate_new_kernel_t)(
unsigned long indirection_page,
@@ -35,6 +36,8 @@ void default_machine_kexec(struct kimage *image)
unsigned long page_list;
unsigned long reboot_code_buffer, reboot_code_buffer_phys;
relocate_new_kernel_t rnk;
+   unsigned long bat_size = 128 << 10;
+   unsigned long bat_mask = ~(bat_size - 1);
 
/* Interrupts aren't acceptable while we reboot */
local_irq_disable();
@@ -54,6 +57,11 @@ void default_machine_kexec(struct kimage *image)
memcpy((void *)reboot_code_buffer, relocate_new_kernel,
relocate_new_kernel_size);
 
+   printk(KERN_INFO "Reboot code buffer at %lx\n", reboot_code_buffer);
+   mtsrin(mfsrin(reboot_code_buffer) & ~SR_NX, reboot_code_buffer);
+   setibat(7, reboot_code_buffer & bat_mask, reboot_code_buffer_phys & 
bat_mask,
+   bat_size, PAGE_KERNEL_TEXT);
+
flush_icache_range(reboot_code_buffer,
reboot_code_buffer + KEXEC_CONTROL_PAGE_SIZE);
printk(KERN_INFO "Bye!\n");
diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
index fc073cb2c517..7124700edb0f 100644
--- a/arch/powerpc/mm/book3s32/mmu.c
+++ b/arch/powerpc/mm/book3s32/mmu.c
@@ -124,8 +124,8 @@ static unsigned int block_size(unsigned long base, unsigned 
long top)
  * of 2 between 128k and 256M.
  * Only for 603+ ...
  */
-static void setibat(int index, unsigned long virt, phys_addr_t phys,
-   unsigned int size, pgprot_t prot)
+void setibat(int index, unsigned long virt, phys_addr_t phys,
+unsigned int size, pgprot_t prot)
 {
unsigned int bl = (size >> 17) - 1;
int wimgxpp;
@@ -197,6 +197,9 @@ void mmu_mark_initmem_nx(void)
if (cpu_has_feature(CPU_FTR_601))
return;
 
+   if (IS_ENABLED(CONFIG_KEXEC))
+   nb--;
+
for (i = 0; i < nb - 1 && base < top && top - base > (128 << 10);) {
size = block_size(base, top);
setibat(i++, PAGE_OFFSET + base, base, size, PAGE_KERNEL_TEXT);
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 7bac0aa2026a..478584d50cf2 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -103,6 +103,8 @@ void print_system_hash_info(void);
 extern void mapin_ram(void);
 extern void setbat(int index, unsigned long virt, phys_addr_t phys,
   unsigned int size, pgprot_t prot);
+void setibat(int index, unsigned long virt, phys_addr_t phys,
+unsigned int size, pgprot_t prot);
 
 extern int __map_without_bats;
 extern unsigned int rtas_data, rtas_size;
-- 
2.13.3



Re: [BISECTED] kexec regression on PowerBook G4

2019-05-23 Thread Christophe Leroy

Hi

Le 24/05/2019 à 00:23, Aaro Koskinen a écrit :

Hi,

On Thu, May 23, 2019 at 08:58:11PM +0200, Christophe Leroy wrote:

Le 23/05/2019 à 19:27, Aaro Koskinen a écrit :

On Thu, May 23, 2019 at 07:33:38AM +0200, Christophe Leroy wrote:

Ok, the Oops confirms that the error is due to executing the kexec control
code which is located outside the kernel text area.

My yesterday's proposed change doesn't work because on book3S/32, NX
protection is based on setting segments to NX, and using IBATs for kernel
text.

Can you try the patch I sent out a few minutes ago ?
(https://patchwork.ozlabs.org/patch/1103827/)


It now crashes with "BUG: Unable to handle kernel instruction fetch"
and the faulting address is 0xef13a000.


Ok.

Can you try with both changes at the same time, ie the mtsrin(...) and the
change_page_attr() ?

I suspect that allthough the HW is not able to check EXEC flag, the SW will
check it before loading the hash entry.


Unfortunately still no luck... The crash is pretty much the same with both
changes.


Right. In fact change_page_attr() does nothing because this part of RAM 
is mapped by DBATs so v_block_mapped() returns not NULL.


So, we have to set an IBAT for this area. I'll try and send you a new 
patch for that before noon (CET).


Christophe


[PATCH] powerpc/powernv: Update firmware archaeology around OPAL_HANDLE_HMI

2019-05-23 Thread Stewart Smith
The first machines to ship with OPAL firmware all got firmware updates
that have the new call, but just in case someone is foolish enough to
believe the first 4 months of firmware is the best, we keep this code
around.

Comment is updated to not refer to late 2014 as recent or the future.

Signed-off-by: Stewart Smith 
---
 arch/powerpc/platforms/powernv/opal.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index f2b063b027f0..89b6ddc3ed38 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -206,16 +206,18 @@ static int __init opal_register_exception_handlers(void)
glue = 0x7000;
 
/*
-* Check if we are running on newer firmware that exports
-* OPAL_HANDLE_HMI token. If yes, then don't ask OPAL to patch
-* the HMI interrupt and we catch it directly in Linux.
+* Only ancient OPAL firmware requires this.
+* Specifically, firmware from FW810.00 (released June 2014)
+* through FW810.20 (Released October 2014).
 *
-* For older firmware (i.e currently released POWER8 System Firmware
-* as of today <= SV810_087), we fallback to old behavior and let OPAL
-* patch the HMI vector and handle it inside OPAL firmware.
+* Check if we are running on newer (post Oct 2014) firmware that
+* exports the OPAL_HANDLE_HMI token. If yes, then don't ask OPAL to
+* patch the HMI interrupt and we catch it directly in Linux.
 *
-* For newer firmware (in development/yet to be released) we will
-* start catching/handling HMI directly in Linux.
+* For older firmware (i.e < FW810.20), we fallback to old behavior and
+* let OPAL patch the HMI vector and handle it inside OPAL firmware.
+*
+* For newer firmware we catch/handle the HMI directly in Linux.
 */
if (!opal_check_token(OPAL_HANDLE_HMI)) {
pr_info("Old firmware detected, OPAL handles HMIs.\n");
@@ -225,6 +227,11 @@ static int __init opal_register_exception_handlers(void)
glue += 128;
}
 
+   /*
+* Only applicable to ancient firmware, all modern
+* (post March 2015/skiboot 5.0) firmware will just return
+* OPAL_UNSUPPORTED.
+*/
opal_register_exception_handler(OPAL_SOFTPATCH_HANDLER, 0, glue);
 #endif
 
-- 
2.21.0



[RFC] powerpc/xmon: restrict when kernel is locked down

2019-05-23 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and completely
disable xmon when lockdown=confidentiality. Xmon checks the lockdown
state and takes appropriate action:

 (1) during xmon_setup to prevent early xmon'ing

 (2) when triggered via sysrq

 (3) when toggled via debugfs

 (4) when triggered via a previously enabled breakpoint

The following lockdown state transitions are handled:

 (1) lockdown=none -> lockdown=integrity
 clear all breakpoints, set xmon read-only mode

 (2) lockdown=none -> lockdown=confidentiality
 clear all breakpoints, prevent re-entry into xmon

 (3) lockdown=integrity -> lockdown=confidentiality
 prevent re-entry into xmon

Suggested-by: Andrew Donnellan 
Signed-off-by: Christopher M. Riedl 
---
Applies on top of this series:
https://patchwork.kernel.org/patch/10870173/

I've done some limited testing using a single CPU QEMU config.

 arch/powerpc/xmon/xmon.c | 56 +++-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 3e7be19aa208..8c4a5a0c28f0 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -191,6 +191,9 @@ static void dump_tlb_44x(void);
 static void dump_tlb_book3e(void);
 #endif
 
+static void clear_all_bpt(void);
+static void xmon_init(int);
+
 #ifdef CONFIG_PPC64
 #define REG"%.16lx"
 #else
@@ -291,6 +294,39 @@ Commands:\n\
   zh   halt\n"
 ;
 
+#ifdef CONFIG_LOCK_DOWN_KERNEL
+static bool xmon_check_lockdown(void)
+{
+   static bool lockdown = false;
+
+   if (!lockdown) {
+   lockdown = kernel_is_locked_down("Using xmon",
+LOCKDOWN_CONFIDENTIALITY);
+   if (lockdown) {
+   printf("xmon: Disabled by strict kernel lockdown\n");
+   xmon_on = 0;
+   xmon_init(0);
+   }
+   }
+
+   if (!xmon_is_ro) {
+   xmon_is_ro = kernel_is_locked_down("Using xmon write-access",
+  LOCKDOWN_INTEGRITY);
+   if (xmon_is_ro) {
+   printf("xmon: Read-only due to kernel lockdown\n");
+   clear_all_bpt();
+   }
+   }
+
+   return lockdown;
+}
+#else
+inline static bool xmon_check_lockdown(void)
+{
+   return false;
+}
+#endif /* CONFIG_LOCK_DOWN_KERNEL */
+
 static struct pt_regs *xmon_regs;
 
 static inline void sync(void)
@@ -708,6 +744,9 @@ static int xmon_bpt(struct pt_regs *regs)
struct bpt *bp;
unsigned long offset;
 
+   if (xmon_check_lockdown())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
 
@@ -739,6 +778,9 @@ static int xmon_sstep(struct pt_regs *regs)
 
 static int xmon_break_match(struct pt_regs *regs)
 {
+   if (xmon_check_lockdown())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
if (dabr.enabled == 0)
@@ -749,6 +791,9 @@ static int xmon_break_match(struct pt_regs *regs)
 
 static int xmon_iabr_match(struct pt_regs *regs)
 {
+   if (xmon_check_lockdown())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
if (iabr == NULL)
@@ -3742,6 +3787,9 @@ static void xmon_init(int enable)
 #ifdef CONFIG_MAGIC_SYSRQ
 static void sysrq_handle_xmon(int key)
 {
+   if (xmon_check_lockdown())
+   return;
+
/* ensure xmon is enabled */
xmon_init(1);
debugger(get_irq_regs());
@@ -3763,7 +3811,6 @@ static int __init setup_xmon_sysrq(void)
 device_initcall(setup_xmon_sysrq);
 #endif /* CONFIG_MAGIC_SYSRQ */
 
-#ifdef CONFIG_DEBUG_FS
 static void clear_all_bpt(void)
 {
int i;
@@ -3785,8 +3832,12 @@ static void clear_all_bpt(void)
printf("xmon: All breakpoints cleared\n");
 }
 
+#ifdef CONFIG_DEBUG_FS
 static int xmon_dbgfs_set(void *data, u64 val)
 {
+   if (xmon_check_lockdown())
+   return 0;
+
xmon_on = !!val;
xmon_init(xmon_on);
 
@@ -3845,6 +3896,9 @@ early_param("xmon", early_parse_xmon);
 
 void __init xmon_setup(void)
 {
+   if (xmon_check_lockdown())
+   return;
+
if (xmon_on)
xmon_init(1);
if (xmon_early)
-- 
2.21.0



Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant

2019-05-23 Thread Pingfan Liu
Matthias, ping? Any suggestions?

Thanks,
Pingfan


On Thu, May 2, 2019 at 2:22 PM Pingfan Liu  wrote:
>
> On Thu, Apr 25, 2019 at 4:20 PM Pingfan Liu  wrote:
> >
> > On Wed, Apr 24, 2019 at 4:31 PM Matthias Brugger  wrote:
> > >
> > >
> > [...]
> > > > @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char 
> > > > *cmdline,
> > > >   pr_warn("crashkernel: unrecognized char: %c\n", *cur);
> > > >   return -EINVAL;
> > > >   }
> > > > + if (*crash_size == 0)
> > > > + return -EINVAL;
> > >
> > > This covers the case where I pass an argument like "crashkernel=0M" ?
> > > Can't we fix that by using kstrtoull() in memparse and check if the 
> > > return value
> > > is < 0? In that case we could return without updating the retptr and we 
> > > will be
> > > fine.
> After a series of work, I suddenly realized that it can not be done
> like this way. "0M" causes kstrtoull() to return -EINVAL, but this is
> caused by "M", not "0". If passing "0" to kstrtoull(), it will return
> 0 on success.
>
> > >
> > It seems that kstrtoull() treats 0M as invalid parameter, while
> > simple_strtoull() does not.
> >
> My careless going through the code. And I tested with a valid value
> "256M" using kstrtoull(), it also returned -EINVAL.
>
> So I think there is no way to distinguish 0 from a positive value
> inside this basic math function.
> Do I miss anything?
>
> Thanks and regards,
> Pingfan


[PATCH v5] powerpc/64s: support nospectre_v2 cmdline option

2019-05-23 Thread Christopher M. Riedl
Add support for disabling the kernel implemented spectre v2 mitigation
(count cache flush on context switch) via the nospectre_v2 and
mitigations=off cmdline options.

Suggested-by: Michael Ellerman 
Signed-off-by: Christopher M. Riedl 
Reviewed-by: Andrew Donnellan 
---
v4->v5:
Fix checkpatch complaint
 arch/powerpc/kernel/security.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index e1c9cf079503..7cfcb294b11c 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -28,7 +28,7 @@ static enum count_cache_flush_type count_cache_flush_type = 
COUNT_CACHE_FLUSH_NO
 bool barrier_nospec_enabled;
 static bool no_nospec;
 static bool btb_flush_enabled;
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3S_64)
 static bool no_spectrev2;
 #endif
 
@@ -114,7 +114,7 @@ static __init int security_feature_debugfs_init(void)
 device_initcall(security_feature_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3S_64)
 static int __init handle_nospectre_v2(char *p)
 {
no_spectrev2 = true;
@@ -122,6 +122,9 @@ static int __init handle_nospectre_v2(char *p)
return 0;
 }
 early_param("nospectre_v2", handle_nospectre_v2);
+#endif /* CONFIG_PPC_FSL_BOOK3E || CONFIG_PPC_BOOK3S_64 */
+
+#ifdef CONFIG_PPC_FSL_BOOK3E
 void setup_spectre_v2(void)
 {
if (no_spectrev2 || cpu_mitigations_off())
@@ -399,7 +402,17 @@ static void toggle_count_cache_flush(bool enable)
 
 void setup_count_cache_flush(void)
 {
-   toggle_count_cache_flush(true);
+   bool enable = true;
+
+   if (no_spectrev2 || cpu_mitigations_off()) {
+   if (security_ftr_enabled(SEC_FTR_BCCTRL_SERIALISED) ||
+   security_ftr_enabled(SEC_FTR_COUNT_CACHE_DISABLED))
+   pr_warn("Spectre v2 mitigations not under software 
control, can't disable\n");
+
+   enable = false;
+   }
+
+   toggle_count_cache_flush(enable);
 }
 
 #ifdef CONFIG_DEBUG_FS
-- 
2.21.0



[RFC PATCH v4 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once

2019-05-23 Thread Ricardo Neri
When there are more than one implementation of the NMI watchdog, there may
be situations in which switching from one to another is needed (e.g., if
the time-stamp counter becomes unstable, the HPET-based NMI watchdog can
no longer be used.

The perf-based implementation of the hardlockup detector makes use of
various per-CPU variables which are accessed via this_cpu operations.
Hence, each CPU needs to enable its own NMI watchdog if using the perf
implementation.

Add functionality to switch from one NMI watchdog to another and do it
from each allowed CPU.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h |  2 ++
 kernel/watchdog.c   | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index e5f1a86e20b7..6d828334348b 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -83,9 +83,11 @@ static inline void reset_hung_task_detector(void) { }
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
+extern void hardlockup_start_all(void);
 extern unsigned int hardlockup_panic;
 #else
 static inline void hardlockup_detector_disable(void) {}
+static inline void hardlockup_start_all(void) {}
 #endif
 
 #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7f9e7b9306fe..be589001200a 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -566,6 +566,21 @@ int lockup_detector_offline_cpu(unsigned int cpu)
return 0;
 }
 
+static int hardlockup_start_fn(void *data)
+{
+   watchdog_nmi_enable(smp_processor_id());
+   return 0;
+}
+
+void hardlockup_start_all(void)
+{
+   int cpu;
+
+   cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
+   for_each_cpu(cpu, &watchdog_allowed_mask)
+   smp_call_on_cpu(cpu, hardlockup_start_fn, NULL, false);
+}
+
 static void lockup_detector_reconfigure(void)
 {
cpus_read_lock();
-- 
2.17.1



[RFC PATCH v4 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf

2019-05-23 Thread Ricardo Neri
The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Rafael J. Wysocki" 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Alexei Starovoitov 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: Byungchul Park 
Cc: "Paul E. McKenney" 
Cc: "Luis R. Rodriguez" 
Cc: Waiman Long 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Davidlohr Bueso 
Cc: Marc Zyngier 
Cc: Kai-Heng Feng 
Cc: Konrad Rzeszutek Wilk 
Cc: David Rientjes 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h   |  5 -
 kernel/Makefile   |  2 +-
 kernel/watchdog_hld.c | 32 
 lib/Kconfig.debug |  4 
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 5a8b19749769..e5f1a86e20b7 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM  0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 33824f0385b3..d07d52a03cc9 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -83,7 +83,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-   .type   = PERF_TYPE_HARDWARE,
-   .config = PERF_COUNT_HW_CPU_CYCLES,
-   .size   = sizeof(struct perf_event_attr),
-   .pinned = 1,
-   .disabled   = 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .size   = sizeof(struct perf_event_attr),
+   .pinned = 1,
+   .disabled   = 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
   struct perf_sample_data *data,
@@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void)
}
return ret

[RFC PATCH v4 07/21] watchdog/hardlockup: Define a generic function to detect hardlockups

2019-05-23 Thread Ricardo Neri
The procedure to detect hardlockups is independent of the underlying
mechanism that generates the non-maskable interrupt used to drive the
detector. Thus, it can be put in a separate, generic function. In this
manner, it can be invoked by various implementations of the NMI watchdog.

For this purpose, move the bulk of watchdog_overflow_callback() to the
new function inspect_for_hardlockups(). This function can then be called
from the applicable NMI handlers.

Cc: "H. Peter Anvin" 
Cc: Ashok Raj 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Don Zickus 
Cc: Nicholas Piggin 
Cc: Michael Ellerman 
Cc: Frederic Weisbecker 
Cc: Babu Moger 
Cc: "David S. Miller" 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Mathieu Desnoyers 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Philippe Ombredanne 
Cc: Colin Ian King 
Cc: "Luis R. Rodriguez" 
Cc: Stephane Eranian 
Cc: Suravee Suthikulpanit 
Cc: "Ravi V. Shankar" 
Cc: x...@kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri 
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 9003e29cde46..5a8b19749769 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -212,6 +212,7 @@ extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
 extern int proc_watchdog_cpumask(struct ctl_table *, int,
 void __user *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include 
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
.disabled   = 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-  struct perf_sample_data *data,
-  struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-   /* Ensure the watchdog never gets throttled */
-   event->hw.interrupts = 0;
-
if (__this_cpu_read(watchdog_nmi_touch) == true) {
__this_cpu_write(watchdog_nmi_touch, false);
return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event 
*event,
return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+  struct perf_sample_data *data,
+  struct pt_regs *regs)
+{
+   /* Ensure the watchdog never gets throttled */
+   event->hw.interrupts = 0;
+   inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
unsigned int cpu = smp_processor_id();
-- 
2.17.1



Re: [PATCH] ASoC: fsl_esai: fix the channel swap issue after xrun

2019-05-23 Thread Nicolin Chen
On Thu, May 23, 2019 at 11:04:03AM +, S.j. Wang wrote:
> > On Thu, May 23, 2019 at 09:53:42AM +, S.j. Wang wrote:
> > > > > + /*
> > > > > +  * Add fifo reset here, because the regcache_sync will
> > > > > +  * write one more data to ETDR.
> > > > > +  * Which will cause channel shift.
> > > >
> > > > Sounds like a bug to me...should fix it first by marking the data
> > > > registers as volatile.
> > > >
> > > The ETDR is a writable register, it is not volatile. Even we change it
> > > to Volatile, I don't think we can't avoid this issue. for the
> > > regcache_sync Just to write this register, it is correct behavior.
> > 
> > Is that so? Quoting the comments of regcache_sync():
> > "* regcache_sync - Sync the register cache with the hardware.
> >  *
> >  * @map: map to configure.
> >  *
> >  * Any registers that should not be synced should be marked as
> >  * volatile."
> > 
> > If regcache_sync() does sync volatile registers too as you said, I don't 
> > mind
> > having this FIFO reset WAR for now, though I think this mismatch between
> > the comments and the actual behavior then should get people's attention.
> > 
> > Thank you
> 
> ETDR is not volatile,  if we mark it is volatile, is it correct?

Well, you have a point -- it might not be ideally true, but it sounds
like a correct fix to me according to this comments.

We can wait for Mark's comments or just send a patch to the mail list 
for review.

Thanks you


Re: [BISECTED] kexec regression on PowerBook G4

2019-05-23 Thread Aaro Koskinen
Hi,

On Thu, May 23, 2019 at 08:58:11PM +0200, Christophe Leroy wrote:
> Le 23/05/2019 à 19:27, Aaro Koskinen a écrit :
> >On Thu, May 23, 2019 at 07:33:38AM +0200, Christophe Leroy wrote:
> >>Ok, the Oops confirms that the error is due to executing the kexec control
> >>code which is located outside the kernel text area.
> >>
> >>My yesterday's proposed change doesn't work because on book3S/32, NX
> >>protection is based on setting segments to NX, and using IBATs for kernel
> >>text.
> >>
> >>Can you try the patch I sent out a few minutes ago ?
> >>(https://patchwork.ozlabs.org/patch/1103827/)
> >
> >It now crashes with "BUG: Unable to handle kernel instruction fetch"
> >and the faulting address is 0xef13a000.
> 
> Ok.
> 
> Can you try with both changes at the same time, ie the mtsrin(...) and the
> change_page_attr() ?
> 
> I suspect that allthough the HW is not able to check EXEC flag, the SW will
> check it before loading the hash entry.

Unfortunately still no luck... The crash is pretty much the same with both
changes.

A.


[PATCH 5.1 084/122] x86/mpx, mm/core: Fix recursive munmap() corruption

2019-05-23 Thread Greg Kroah-Hartman
From: Dave Hansen 

commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream.

This is a bit of a mess, to put it mildly.  But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now, because nobody uses MPX.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory.  When
memory is unmapped, there is also a need to unmap the MPX
bounds tables.  Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state.  It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

See the "long story" further below for the gory details.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful.  Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

== UML / unicore32 impact ==

Remove unused 'vma' argument to arch_unmap().  No functional
change.

I compile tested this on UML but not unicore32.

== powerpc impact ==

powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
arch_unmap() makes this happen earlier in __do_munmap().  But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace.  I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.

powerpc does not use the 'vma' argument and is unaffected by
its removal.

I compile-tested a 64-bit powerpc defconfig.

== x86 impact ==

For the common success case this is functionally identical to
what was there before.  For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use.  But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory.  There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.

The long story:

munmap() has a couple of pieces:

 1. Find the affected VMA(s)
 2. Split the start/end one(s) if neceesary
 3. Pull the VMAs out of the rbtree
 4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
 5. Fix up some of the accounting (like fput()) and actually
free the VMA itself.

This specific ordering was actually introduced by:

  dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

during the 4.20 merge window.  The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list().  arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.

Richard Biener reported a test that shows this in dmesg:

  [1216548.787498] BUG: Bad rss-counter state mm:17ce560b idx:1 val:551
  [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap().  It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this.  For one, arch_unmap()
can free VMAs.  But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.

I tried a couple of things here.  First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue.  I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart.  That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

This was also reported in the following kernel bugzilla entry:

  https://bugzilla.kernel.org/show_bug.cgi?id=203123

There are some reports that this commit triggered this bug:

  dd2283f2605 ("mm: mmap: zap pages with read mmap_sem in munmap")

While that commit certainly made the issues easier to hit, I believe
the fundamental issue has been with us as long as MPX itself, thus
the Fixes: tag below is for one of the original MPX commits.

[ mingo: Minor edits to the changelog and the patch. ]

Reported-by: Richard Biener 
Reported-by: H.J. Lu 
Signed-off-by: Dave Hansen 
Reviewed-by Thomas Gleixner 
Reviewed-by: Yang Shi 
Acked-by: Michael Ellerman 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Anton Ivanov 
Cc: Benjamin Her

[PATCH 5.0 075/139] x86/mpx, mm/core: Fix recursive munmap() corruption

2019-05-23 Thread Greg Kroah-Hartman
From: Dave Hansen 

commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream.

This is a bit of a mess, to put it mildly.  But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now, because nobody uses MPX.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory.  When
memory is unmapped, there is also a need to unmap the MPX
bounds tables.  Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state.  It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

See the "long story" further below for the gory details.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful.  Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

== UML / unicore32 impact ==

Remove unused 'vma' argument to arch_unmap().  No functional
change.

I compile tested this on UML but not unicore32.

== powerpc impact ==

powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
arch_unmap() makes this happen earlier in __do_munmap().  But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace.  I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.

powerpc does not use the 'vma' argument and is unaffected by
its removal.

I compile-tested a 64-bit powerpc defconfig.

== x86 impact ==

For the common success case this is functionally identical to
what was there before.  For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use.  But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory.  There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.

The long story:

munmap() has a couple of pieces:

 1. Find the affected VMA(s)
 2. Split the start/end one(s) if neceesary
 3. Pull the VMAs out of the rbtree
 4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
 5. Fix up some of the accounting (like fput()) and actually
free the VMA itself.

This specific ordering was actually introduced by:

  dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

during the 4.20 merge window.  The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list().  arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.

Richard Biener reported a test that shows this in dmesg:

  [1216548.787498] BUG: Bad rss-counter state mm:17ce560b idx:1 val:551
  [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap().  It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this.  For one, arch_unmap()
can free VMAs.  But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.

I tried a couple of things here.  First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue.  I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart.  That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

This was also reported in the following kernel bugzilla entry:

  https://bugzilla.kernel.org/show_bug.cgi?id=203123

There are some reports that this commit triggered this bug:

  dd2283f2605 ("mm: mmap: zap pages with read mmap_sem in munmap")

While that commit certainly made the issues easier to hit, I believe
the fundamental issue has been with us as long as MPX itself, thus
the Fixes: tag below is for one of the original MPX commits.

[ mingo: Minor edits to the changelog and the patch. ]

Reported-by: Richard Biener 
Reported-by: H.J. Lu 
Signed-off-by: Dave Hansen 
Reviewed-by Thomas Gleixner 
Reviewed-by: Yang Shi 
Acked-by: Michael Ellerman 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Anton Ivanov 
Cc: Benjamin Her

Re: [BISECTED] kexec regression on PowerBook G4

2019-05-23 Thread Christophe Leroy




Le 23/05/2019 à 19:27, Aaro Koskinen a écrit :

Hi,

On Thu, May 23, 2019 at 07:33:38AM +0200, Christophe Leroy wrote:

Ok, the Oops confirms that the error is due to executing the kexec control
code which is located outside the kernel text area.

My yesterday's proposed change doesn't work because on book3S/32, NX
protection is based on setting segments to NX, and using IBATs for kernel
text.

Can you try the patch I sent out a few minutes ago ?
(https://patchwork.ozlabs.org/patch/1103827/)


It now crashes with "BUG: Unable to handle kernel instruction fetch"
and the faulting address is 0xef13a000.



Ok.

Can you try with both changes at the same time, ie the mtsrin(...) and 
the change_page_attr() ?


I suspect that allthough the HW is not able to check EXEC flag, the SW 
will check it before loading the hash entry.


Christophe


Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Christophe Leroy




On 05/23/2019 10:16 AM, Mathieu Malaterre wrote:

On Thu, May 23, 2019 at 11:45 AM Christophe Leroy
 wrote:




Le 23/05/2019 à 10:53, Mathieu Malaterre a écrit :


I confirm powerpc/merge does not boot for me (same config). Commit id:

a27eaa62326d (powerpc/merge) Automatic merge of branches 'master',
'next' and 'fixes' into merge


I see in the config you sent me that you have selected CONFIG_KASAN,
which is a big new stuff.

Can you try without it ?


With same config but CONFIG_KASAN=n (on top of a27eaa62326d), I can
reproduce the boot failure (no change).

Time for bisect ?



I found the issue. In order to be able to support KASAN, the setup of 
segments have moved earlier in the boot. Your problem is a side effect 
of this change.

Function setup_disp_bat() is supposed to setup BAT3 for btext data.
But setup_disp_bat() rely on someone setting in disp_BAT the values to 
be loaded into BATs. This is done by btext_prepare_BAT() which is called 
by bootx_init().
The problem is that bootx_init() is never called, so setup_disp_bat() 
does nothing and the access to btext data is possible because the 
bootloader has set an entry for it in the hash table.


But by setting up the segment earlier, we break the bootloader hash 
table, which shouldn't be an issue if the BATs had been set properly as 
expected.


The problematic commit is 215b823707ce ("powerpc/32s: set up an early 
static hash table for KASAN)"


Here is a dirty fix that works for me when CONFIG_KASAN is NOT set.
Of course, the real fix has to be to setup the BATs properly, but I 
won't have time to look at that before June. Maybe you can ?


diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 755fab9641d6..fba16970c028 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -162,7 +162,6 @@ __after_mmu_off:
bl  flush_tlbs

bl  initial_bats
-   bl  load_segment_registers
 #ifdef CONFIG_KASAN
bl  early_hash_table
 #endif
@@ -920,6 +919,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
RFI
 /* Load up the kernel context */
 2: bl  load_up_mmu
+   bl  load_segment_registers

 #ifdef CONFIG_BDI_SWITCH
/* Add helper information for the Abatron bdiGDB debugger.

Christophe


Re: [BISECTED] kexec regression on PowerBook G4

2019-05-23 Thread Aaro Koskinen
Hi,

On Thu, May 23, 2019 at 07:33:38AM +0200, Christophe Leroy wrote:
> Ok, the Oops confirms that the error is due to executing the kexec control
> code which is located outside the kernel text area.
> 
> My yesterday's proposed change doesn't work because on book3S/32, NX
> protection is based on setting segments to NX, and using IBATs for kernel
> text.
> 
> Can you try the patch I sent out a few minutes ago ?
> (https://patchwork.ozlabs.org/patch/1103827/)

It now crashes with "BUG: Unable to handle kernel instruction fetch"
and the faulting address is 0xef13a000.

A.


Re: [PATCH v2 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 06:20:05PM +0200, Oleg Nesterov wrote:
> On 05/23, Christian Brauner wrote:
> >
> > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > +{
> > +   unsigned int cur_max;
> > +
> > +   if (fd > max_fd)
> > +   return -EINVAL;
> > +
> > +   rcu_read_lock();
> > +   cur_max = files_fdtable(files)->max_fds;
> > +   rcu_read_unlock();
> > +
> > +   /* cap to last valid index into fdtable */
> > +   max_fd = max(max_fd, (cur_max - 1));
>  ^^^
> 
> Hmm. min() ?

Yes, thanks! Massive brainf*rt on my end, sorry.

Christian


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() 
> syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> >  /* that exec is sensitive */
> >  unshare(CLONE_FILES);
> >  /* we don't want anything past stderr here */
> >  close_range(3, ~0U);
> >  execve();
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> >   task is multi-threaded and shares the file descriptor table with another
> >   thread in which case two threads could race with one thread allocating
> >   file descriptors and the other one closing them via close_range(). For the
> >   general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc//fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >- Python (cf. [2])
> >- Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> >  int dir_fd;
> >  DIR *dir;
> >  struct dirent *direntp;
> >
> >  dir = opendir("/proc/self/fd");
> >  if (!dir)
> >  return -1;
> >  dir_fd = dirfd(dir);
> >  while ((direntp = readdir(dir))) {
> >  int fd;
> >  if (strcmp(direntp->d_name, ".") == 0)
> >  continue;
> >  if (strcmp(direntp->d_name, "..") == 0)
> >  continue;
> >  fd = atoi(direntp->d_name);
> >  if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> >  continue;
> >  close(fd);
> >  }
> >  closedir(dir);
> >  return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> > - close_all_fds(): ~280 us
> > - close_range():~24 us
> >
> > 2. closing 1000 open files:
> > - close_all_fds(): ~5000 us
> > - close_range():   ~800 us
> >
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> > be used to implement extension. Once can imagine userspace wanting to stop
> > at the first error instead of ignoring errors under certain circumstances.
> 
> > There might be other valid ideas in the future. In any case, a flag
> > argument doesn't hurt and keeps us on the safe side.
> 
> Here is another strange but real-live scenario: crash handler for dumping 
> core.
> 
> If applications has network connections it would be better to close them all,
> otherwise clients will wait until end of dumping process or timeout.
> Also closing normal files might be a good idea for releasing locks.

RE: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread David Laight
From:  Konstantin Khlebnikov
> Sent: 23 May 2019 17:22

>  > In addition, the syscall will also work for tasks that do not have procfs
>  > mounted and on kernels that do not have procfs support compiled in. In such
>  > situations the only way to make sure that all file descriptors are closed
>  > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
>  > OPEN_MAX trickery (cf. comment [8] on Rust).

Code using RLIMIT_NOFILE is broken.
It is easy to reduce the hard limit below that of an open fd.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Konstantin Khlebnikov

On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() 
syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
>
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
>
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
>
>  /* that exec is sensitive */
>  unshare(CLONE_FILES);
>  /* we don't want anything past stderr here */
>  close_range(3, ~0U);
>  execve();
>
> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
>   task is multi-threaded and shares the file descriptor table with another
>   thread in which case two threads could race with one thread allocating
>   file descriptors and the other one closing them via close_range(). For the
>   general case close_range() before the execve() is sufficient.)
>
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc//fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
>- Python (cf. [2])
>- Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).
>
> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
>
> static int close_all_fds(void)
> {
>  int dir_fd;
>  DIR *dir;
>  struct dirent *direntp;
>
>  dir = opendir("/proc/self/fd");
>  if (!dir)
>  return -1;
>  dir_fd = dirfd(dir);
>  while ((direntp = readdir(dir))) {
>  int fd;
>  if (strcmp(direntp->d_name, ".") == 0)
>  continue;
>  if (strcmp(direntp->d_name, "..") == 0)
>  continue;
>  fd = atoi(direntp->d_name);
>  if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
>  continue;
>  close(fd);
>  }
>  closedir(dir);
>  return 0;
> }
>
> to close_range() yields:
> 1. closing 4 open files:
> - close_all_fds(): ~280 us
> - close_range():~24 us
>
> 2. closing 1000 open files:
> - close_all_fds(): ~5000 us
> - close_range():   ~800 us
>
> close_range() is designed to allow for some flexibility. Specifically, it
> does not simply always close all open file descriptors of a task. Instead,
> callers can specify an upper bound.
> This is e.g. useful for scenarios where specific file descriptors are
> created with well-known numbers that are supposed to be excluded from
> getting closed.
> For extra paranoia close_range() comes with a flags argument. This can e.g.
> be used to implement extension. Once can imagine userspace wanting to stop
> at the first error instead of ignoring errors under certain circumstances.

> There might be other valid ideas in the future. In any case, a flag
> argument doesn't hurt and keeps us on the safe side.

Here is another strange but real-live scenario: crash handler for dumping core.

If applications has network connections it would be better to close them all,
otherwise clients will wait until end of dumping process or timeout.
Also closing normal files might be a good idea for releasing locks.

But simple closing might race with other threads - closed fd will be reused
while some code still thinks it refers to original file.

Our solution closes files without freeing fd: it opens /dev/null and
replaces all opened descriptors using dup2.

So, special flag for close_range

Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Oleg Nesterov
On 05/23, Christian Brauner wrote:
>
> So given that we would really need another find_next_open_fd() I think
> sticking to the simple cond_resched() version I sent before is better
> for now until we see real-world performance issues.

OK, agreed.

Oleg.



Re: [PATCH v2 1/2] open: add close_range()

2019-05-23 Thread Oleg Nesterov
On 05/23, Christian Brauner wrote:
>
> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> + unsigned int cur_max;
> +
> + if (fd > max_fd)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + cur_max = files_fdtable(files)->max_fds;
> + rcu_read_unlock();
> +
> + /* cap to last valid index into fdtable */
> + max_fd = max(max_fd, (cur_max - 1));
 ^^^

Hmm. min() ?

Oleg.



[PATCH v2 2/2] tests: add close_range() tests

2019-05-23 Thread Christian Brauner
This adds basic tests for the new close_range() syscall.
- test that no invalid flags can be passed
- test that a range of file descriptors is correctly closed
- test that a range of file descriptors is correctly closed if there there
  are already closed file descriptors in the range
- test that max_fd is correctly capped to the current fdtable maximum

Signed-off-by: Christian Brauner 
Cc: Arnd Bergmann 
Cc: Jann Horn 
Cc: David Howells 
Cc: Dmitry V. Levin 
Cc: Oleg Nesterov 
Cc: Linus Torvalds 
Cc: Florian Weimer 
Cc: linux-...@vger.kernel.org
---
v1: unchanged
v2:
- Christian Brauner :
  - verify that close_range() correctly closes a single file descriptor
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/core/.gitignore   |   1 +
 tools/testing/selftests/core/Makefile |   6 +
 .../testing/selftests/core/close_range_test.c | 142 ++
 4 files changed, 150 insertions(+)
 create mode 100644 tools/testing/selftests/core/.gitignore
 create mode 100644 tools/testing/selftests/core/Makefile
 create mode 100644 tools/testing/selftests/core/close_range_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..06e57fabbff9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += bpf
 TARGETS += breakpoints
 TARGETS += capabilities
 TARGETS += cgroup
+TARGETS += core
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
 TARGETS += drivers/dma-buf
diff --git a/tools/testing/selftests/core/.gitignore 
b/tools/testing/selftests/core/.gitignore
new file mode 100644
index ..6e6712ce5817
--- /dev/null
+++ b/tools/testing/selftests/core/.gitignore
@@ -0,0 +1 @@
+close_range_test
diff --git a/tools/testing/selftests/core/Makefile 
b/tools/testing/selftests/core/Makefile
new file mode 100644
index ..de3ae68aa345
--- /dev/null
+++ b/tools/testing/selftests/core/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -g -I../../../../usr/include/ -I../../../../include
+
+TEST_GEN_PROGS := close_range_test
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/core/close_range_test.c 
b/tools/testing/selftests/core/close_range_test.c
new file mode 100644
index ..d6e6079d3d53
--- /dev/null
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags)
+{
+   return syscall(__NR_close_range, fd, max_fd, flags);
+}
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+int main(int argc, char **argv)
+{
+   const char *test_name = "close_range";
+   int i, ret;
+   int open_fds[101];
+   int fd_max, fd_mid, fd_min;
+
+   ksft_set_plan(9);
+
+   for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
+   int fd;
+
+   fd = open("/dev/null", O_RDONLY | O_CLOEXEC);
+   if (fd < 0) {
+   if (errno == ENOENT)
+   ksft_exit_skip(
+   "%s test: skipping test since /dev/null 
does not exist\n",
+   test_name);
+
+   ksft_exit_fail_msg(
+   "%s test: %s - failed to open /dev/null\n",
+   strerror(errno), test_name);
+   }
+
+   open_fds[i] = fd;
+   }
+
+   fd_min = open_fds[0];
+   fd_max = open_fds[99];
+
+   ret = sys_close_range(fd_min, fd_max, 1);
+   if (!ret)
+   ksft_exit_fail_msg(
+   "%s test: managed to pass invalid flag value\n",
+   test_name);
+   ksft_test_result_pass("do not allow invalid flag values for 
close_range()\n");
+
+   fd_mid = open_fds[50];
+   ret = sys_close_range(fd_min, fd_mid, 0);
+   if (ret < 0)
+   ksft_exit_fail_msg(
+   "%s test: Failed to close range of file descriptors 
from %d to %d\n",
+   test_name, fd_min, fd_mid);
+   ksft_test_result_pass("close_range() from %d to %d\n", fd_min, fd_mid);
+
+   for (i = 0; i <= 50; i++) {
+   ret = fcntl(open_fds[i], F_GETFL);
+   if (ret >= 0)
+   ksft_exit_fail_msg(
+   "%s test: Failed to close range of file 
descriptors from %d to %d\n",
+   test_name, fd_min, fd_mid);
+   }
+   ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", 
fd_min, fd_mid);
+
+   /* create a couple of gaps */
+   close(57);
+   close(78);
+   close(81);
+   close(82);
+   close(84);
+  

[PATCH v2 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve();

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
 task is multi-threaded and shares the file descriptor table with another
 thread in which case two threads could race with one thread allocating
 file descriptors and the other one closing them via close_range(). For the
 general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc//fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
int dir_fd;
DIR *dir;
struct dirent *direntp;

dir = opendir("/proc/self/fd");
if (!dir)
return -1;
dir_fd = dirfd(dir);
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir);
return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

>From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to th

[PATCH v2 0/2] close_range()

2019-05-23 Thread Christian Brauner
Hey,

This is v2 of this patchset.

In accordance with some comments There's a cond_resched() added to the
close loop similar to what is done for close_files().
A common helper pick_file() for __close_fd() and __close_range() has
been split out. This allows to only make a cond_resched() call when
filp_close() has been called similar to what is done in close_files().
Maybe that's not worth it. Jann mentioned that cond_resched() looks
rather cheap.
So it maybe that we could simply do:

while (fd <= max_fd) {
   __close(files, fd++);
   cond_resched();
}

I also added a missing test for close_range(fd, fd, 0).

Thanks!
Christian

Christian Brauner (2):
  open: add close_range()
  tests: add close_range() tests

 arch/alpha/kernel/syscalls/syscall.tbl|   1 +
 arch/arm/tools/syscall.tbl|   1 +
 arch/arm64/include/asm/unistd32.h |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl   |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl  |   1 +
 arch/s390/kernel/syscalls/syscall.tbl |   1 +
 arch/sh/kernel/syscalls/syscall.tbl   |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl|   1 +
 arch/x86/entry/syscalls/syscall_32.tbl|   1 +
 arch/x86/entry/syscalls/syscall_64.tbl|   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl   |   1 +
 fs/file.c |  62 +++-
 fs/open.c |  20 +++
 include/linux/fdtable.h   |   2 +
 include/linux/syscalls.h  |   2 +
 include/uapi/asm-generic/unistd.h |   4 +-
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/core/.gitignore   |   1 +
 tools/testing/selftests/core/Makefile |   6 +
 .../testing/selftests/core/close_range_test.c | 142 ++
 26 files changed, 249 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/core/.gitignore
 create mode 100644 tools/testing/selftests/core/Makefile
 create mode 100644 tools/testing/selftests/core/close_range_test.c

-- 
2.21.0



[PATCH v3 3/4] mm: introduce ARCH_HAS_PTE_DEVMAP

2019-05-23 Thread Robin Murphy
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.

In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.

Cc: x...@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Acked-by: Dan Williams 
Reviewed-by: Ira Weiny 
Acked-by: Oliver O'Halloran 
Reviewed-by: Anshuman Khandual 
Signed-off-by: Robin Murphy 
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h | 1 -
 arch/x86/Kconfig | 2 +-
 arch/x86/include/asm/pgtable.h   | 4 ++--
 arch/x86/include/asm/pgtable_types.h | 1 -
 include/linux/mm.h   | 4 ++--
 include/linux/pfn_t.h| 4 ++--
 mm/Kconfig   | 5 ++---
 mm/gup.c | 2 +-
 9 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8c1c636308c8..1120ff8ac715 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -128,6 +128,7 @@ config PPC
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_APIif PPC64
+   select ARCH_HAS_PTE_DEVMAP  if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC64
@@ -135,7 +136,6 @@ config PPC
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
select ARCH_HAS_UBSAN_SANITIZE_ALL
-   select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 7dede2e34b70..c6c2bdfb369b 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -90,7 +90,6 @@
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
 #define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page */
-#define __HAVE_ARCH_PTE_DEVMAP
 
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..57c4e80bd368 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,6 +69,7 @@ config X86
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_APIif X86_64
+   select ARCH_HAS_PTE_DEVMAP  if X86_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_REFCOUNT
select ARCH_HAS_UACCESS_FLUSHCACHE  if X86_64
@@ -79,7 +80,6 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
-   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5e0509b41986..0bc530c4eb13 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -271,7 +271,7 @@ static inline int has_transparent_hugepage(void)
return boot_cpu_has(X86_FEATURE_PSE);
 }
 
-#ifdef __HAVE_ARCH_PTE_DEVMAP
+#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pmd_devmap(pmd_t pmd)
 {
return !!(pmd_val(pmd) & _PAGE_DEVMAP);
@@ -732,7 +732,7 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
-#ifdef __HAVE_ARCH_PTE_DEVMAP
+#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pte_devmap(pte_t a)
 {
return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index d6ff0bbdb394..b5e49e6bac63 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -103,7 +103,6 @@
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX   (_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP   (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
-#define __HAVE_ARCH_PTE_DEVMAP
 #else
 #define _PAGE_NX   (_AT(pteval_t, 0))
 #define _PAGE_DEVMAP   (_AT(pteval_t

Re: [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs

2019-05-23 Thread Frederic Barrat




Le 23/05/2019 à 15:45, Michael Ellerman a écrit :

Frederic Barrat  writes:


If the kernel is notified of an HMI caused by the NPU2, it's currently
not being recognized and it logs the default message:

 Unknown Malfunction Alert of type 3

The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
possible causes, but we should at least log that it's an NPU problem
and report which FIR and which bit were raised if opal gave us the
information.

Signed-off-by: Frederic Barrat 
---

Could be merged independently from (the opal-api.h change is already
in the skiboot tree), but works better with, the matching skiboot
change:
http://patchwork.ozlabs.org/patch/1104076/


Well it *must* work with or without the skiboot change, because old/new
kernels will run on old/new skiboots.

It looks like it will work fine, we just won't get any extra information
in xstop_reason, right?



Yes, that's understood, and it was tested. On an old skiboot, we're now 
printing that we got an NPU checkstop (instead of the "unknown 
malfunction alert"), we just won't have the extra FIR info. That's what 
I meant by "works better with the skiboot patch".


  Fred




cheers


diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index e1577cfa7186..2492fe248e1e 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -568,6 +568,7 @@ enum OpalHMI_XstopType {
CHECKSTOP_TYPE_UNKNOWN  =   0,
CHECKSTOP_TYPE_CORE =   1,
CHECKSTOP_TYPE_NX   =   2,
+   CHECKSTOP_TYPE_NPU  =   3
  };
  
  enum OpalHMI_CoreXstopReason {

diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c 
b/arch/powerpc/platforms/powernv/opal-hmi.c
index 586ec71a4e17..de12a240b477 100644
--- a/arch/powerpc/platforms/powernv/opal-hmi.c
+++ b/arch/powerpc/platforms/powernv/opal-hmi.c
@@ -149,6 +149,43 @@ static void print_nx_checkstop_reason(const char *level,
xstop_reason[i].description);
  }
  
+static void print_npu_checkstop_reason(const char *level,

+   struct OpalHMIEvent *hmi_evt)
+{
+   uint8_t reason, reason_count, i;
+
+   /*
+* We may not have a checkstop reason on some combination of
+* hardware and/or skiboot version
+*/
+   if (!hmi_evt->u.xstop_error.xstop_reason) {
+   printk("%s NPU checkstop on chip %x\n", level,
+   be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id));
+   return;
+   }
+
+   /*
+* NPU2 has 3 FIRs. Reason encoded on a byte as:
+*   2 bits for the FIR number
+*   6 bits for the bit number
+* It may be possible to find several reasons.
+*
+* We don't display a specific message per FIR bit as there
+* are too many and most are meaningless without the workbook
+* and/or hw team help anyway.
+*/
+   reason_count = sizeof(hmi_evt->u.xstop_error.xstop_reason) /
+   sizeof(reason);
+   for (i = 0; i < reason_count; i++) {
+   reason = (hmi_evt->u.xstop_error.xstop_reason >> (8 * i)) & 
0xFF;
+   if (reason)
+   printk("%s NPU checkstop on chip %x: FIR%d bit %d is 
set\n",
+   level,
+   be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id),
+   reason >> 6, reason & 0x3F);
+   }
+}
+
  static void print_checkstop_reason(const char *level,
struct OpalHMIEvent *hmi_evt)
  {
@@ -160,6 +197,9 @@ static void print_checkstop_reason(const char *level,
case CHECKSTOP_TYPE_NX:
print_nx_checkstop_reason(level, hmi_evt);
break;
+   case CHECKSTOP_TYPE_NPU:
+   print_npu_checkstop_reason(level, hmi_evt);
+   break;
default:
printk("%s Unknown Malfunction Alert of type %d\n",
   level, type);
--
2.21.0






Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 04:32:14PM +0200, Jann Horn wrote:
> On Thu, May 23, 2019 at 1:51 PM Christian Brauner  
> wrote:
> [...]
> > I kept it dumb and was about to reply that your solution introduces more
> > code when it seemed we wanted to keep this very simple for now.
> > But then I saw that find_next_opened_fd() already exists as
> > find_next_fd(). So it's actually not bad compared to what I sent in v1.
> > So - with some small tweaks (need to test it and all now) - how do we
> > feel about?:
> [...]
> > static int __close_next_open_fd(struct files_struct *files, unsigned 
> > *curfd, unsigned maxfd)
> > {
> > struct file *file = NULL;
> > unsigned fd;
> > struct fdtable *fdt;
> >
> > spin_lock(&files->file_lock);
> > fdt = files_fdtable(files);
> > fd = find_next_fd(fdt, *curfd);
> 
> find_next_fd() finds free fds, not used ones.
> 
> > if (fd >= fdt->max_fds || fd > maxfd)
> > goto out_unlock;
> >
> > file = fdt->fd[fd];
> > rcu_assign_pointer(fdt->fd[fd], NULL);
> > __put_unused_fd(files, fd);
> 
> You can't do __put_unused_fd() if the old pointer in fdt->fd[fd] was
> NULL - because that means that the fd has been reserved by another
> thread that is about to put a file pointer in there, and if you
> release the fd here, that messes up the refcounting (or hits the
> BUG_ON() in __fd_install()).
> 
> > out_unlock:
> > spin_unlock(&files->file_lock);
> >
> > if (!file)
> > return -EBADF;
> >
> > *curfd = fd;
> > filp_close(file, files);
> > return 0;
> > }
> >
> > int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > {
> > if (fd > max_fd)
> > return -EINVAL;
> >
> > while (fd <= max_fd) {
> 
> Note that with a pattern like this, you have to be careful about what
> happens if someone gives you max_fd==0x - then this condition
> is always true and the loop can not terminate this way.
> 
> > if (__close_next_fd(files, &fd, maxfd))
> > break;
> 
> (obviously it can still terminate this way)

Yup, this was only a quick draft.
I think the dumb simple thing that I did before was the best way to do
it for now.
I first thought that the find_next_open_fd() function already exists but
when I went to write a POC for testing realized it doesn't anyway.


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 04:14:47PM +0200, Christian Brauner wrote:
> On Thu, May 23, 2019 at 01:51:18PM +0200, Christian Brauner wrote:
> > On Wed, May 22, 2019 at 06:57:37PM +0200, Oleg Nesterov wrote:
> > > On 05/22, Christian Brauner wrote:
> > > >
> > > > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> > > >  {
> > > > -   struct file *file;
> > > > +   struct file *file = NULL;
> > > > struct fdtable *fdt;
> > > >  
> > > > spin_lock(&files->file_lock);
> > > > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, 
> > > > unsigned fd)
> > > > goto out_unlock;
> > > > rcu_assign_pointer(fdt->fd[fd], NULL);
> > > > __put_unused_fd(files, fd);
> > > > -   spin_unlock(&files->file_lock);
> > > > -   return filp_close(file, files);
> > > >  
> > > >  out_unlock:
> > > > spin_unlock(&files->file_lock);
> > > > -   return -EBADF;
> > > > +   return file;
> > > 
> > > ...
> > > 
> > > > +int __close_range(struct files_struct *files, unsigned fd, unsigned 
> > > > max_fd)
> > > > +{
> > > > +   unsigned int cur_max;
> > > > +
> > > > +   if (fd > max_fd)
> > > > +   return -EINVAL;
> > > > +
> > > > +   rcu_read_lock();
> > > > +   cur_max = files_fdtable(files)->max_fds;
> > > > +   rcu_read_unlock();
> > > > +
> > > > +   /* cap to last valid index into fdtable */
> > > > +   if (max_fd >= cur_max)
> > > > +   max_fd = cur_max - 1;
> > > > +
> > > > +   while (fd <= max_fd) {
> > > > +   struct file *file;
> > > > +
> > > > +   file = pick_file(files, fd++);
> > > 
> > > Well, how about something like
> > > 
> > >   static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned 
> > > start)
> > >   {
> > >   unsigned int maxfd = fdt->max_fds;
> > >   unsigned int maxbit = maxfd / BITS_PER_LONG;
> > >   unsigned int bitbit = start / BITS_PER_LONG;
> > > 
> > >   bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * 
> > > BITS_PER_LONG;
> > >   if (bitbit > maxfd)
> > >   return maxfd;
> > >   if (bitbit > start)
> > >   start = bitbit;
> > >   return find_next_bit(fdt->open_fds, maxfd, start);
> > >   }
> > 
> > > 
> > >   unsigned close_next_fd(struct files_struct *files, unsigned start, 
> > > unsigned maxfd)
> > >   {
> > >   unsigned fd;
> > >   struct file *file;
> > >   struct fdtable *fdt;
> > >   
> > >   spin_lock(&files->file_lock);
> > >   fdt = files_fdtable(files);
> > >   fd = find_next_opened_fd(fdt, start);
> > >   if (fd >= fdt->max_fds || fd > maxfd) {
> > >   fd = -1;
> > >   goto out;
> > >   }
> > > 
> > >   file = fdt->fd[fd];
> > >   rcu_assign_pointer(fdt->fd[fd], NULL);
> > >   __put_unused_fd(files, fd);
> > >   out:
> > >   spin_unlock(&files->file_lock);
> > > 
> > >   if (fd == -1u)
> > >   return fd;
> > > 
> > >   filp_close(file, files);
> > >   return fd + 1;
> > >   }
> > 
> > Thanks, Oleg!
> > 
> > I kept it dumb and was about to reply that your solution introduces more
> > code when it seemed we wanted to keep this very simple for now.
> > But then I saw that find_next_opened_fd() already exists as
> > find_next_fd(). So it's actually not bad compared to what I sent in v1.
> > So - with some small tweaks (need to test it and all now) - how do we
> > feel about?:
> 
> That's obviously not correct atm but I'll send out a tweaked version in
> a bit.

So given that we would really need another find_next_open_fd() I think
sticking to the simple cond_resched() version I sent before is better
for now until we see real-world performance issues.
I was however missing a test for close_range(fd, fd, 0) anyway so I'll
need to send a v2 with this test added.

Christian


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 01:51:18PM +0200, Christian Brauner wrote:
> On Wed, May 22, 2019 at 06:57:37PM +0200, Oleg Nesterov wrote:
> > On 05/22, Christian Brauner wrote:
> > >
> > > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> > >  {
> > > - struct file *file;
> > > + struct file *file = NULL;
> > >   struct fdtable *fdt;
> > >  
> > >   spin_lock(&files->file_lock);
> > > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned 
> > > fd)
> > >   goto out_unlock;
> > >   rcu_assign_pointer(fdt->fd[fd], NULL);
> > >   __put_unused_fd(files, fd);
> > > - spin_unlock(&files->file_lock);
> > > - return filp_close(file, files);
> > >  
> > >  out_unlock:
> > >   spin_unlock(&files->file_lock);
> > > - return -EBADF;
> > > + return file;
> > 
> > ...
> > 
> > > +int __close_range(struct files_struct *files, unsigned fd, unsigned 
> > > max_fd)
> > > +{
> > > + unsigned int cur_max;
> > > +
> > > + if (fd > max_fd)
> > > + return -EINVAL;
> > > +
> > > + rcu_read_lock();
> > > + cur_max = files_fdtable(files)->max_fds;
> > > + rcu_read_unlock();
> > > +
> > > + /* cap to last valid index into fdtable */
> > > + if (max_fd >= cur_max)
> > > + max_fd = cur_max - 1;
> > > +
> > > + while (fd <= max_fd) {
> > > + struct file *file;
> > > +
> > > + file = pick_file(files, fd++);
> > 
> > Well, how about something like
> > 
> > static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned 
> > start)
> > {
> > unsigned int maxfd = fdt->max_fds;
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> > 
> > bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * 
> > BITS_PER_LONG;
> > if (bitbit > maxfd)
> > return maxfd;
> > if (bitbit > start)
> > start = bitbit;
> > return find_next_bit(fdt->open_fds, maxfd, start);
> > }
> 
> > 
> > unsigned close_next_fd(struct files_struct *files, unsigned start, 
> > unsigned maxfd)
> > {
> > unsigned fd;
> > struct file *file;
> > struct fdtable *fdt;
> > 
> > spin_lock(&files->file_lock);
> > fdt = files_fdtable(files);
> > fd = find_next_opened_fd(fdt, start);
> > if (fd >= fdt->max_fds || fd > maxfd) {
> > fd = -1;
> > goto out;
> > }
> > 
> > file = fdt->fd[fd];
> > rcu_assign_pointer(fdt->fd[fd], NULL);
> > __put_unused_fd(files, fd);
> > out:
> > spin_unlock(&files->file_lock);
> > 
> > if (fd == -1u)
> > return fd;
> > 
> > filp_close(file, files);
> > return fd + 1;
> > }
> 
> Thanks, Oleg!
> 
> I kept it dumb and was about to reply that your solution introduces more
> code when it seemed we wanted to keep this very simple for now.
> But then I saw that find_next_opened_fd() already exists as
> find_next_fd(). So it's actually not bad compared to what I sent in v1.
> So - with some small tweaks (need to test it and all now) - how do we
> feel about?:

That's obviously not correct atm but I'll send out a tweaked version in
a bit.

Christian


Applied "spi: spi-fsl-spi: call spi_finalize_current_message() at the end" to the spi tree

2019-05-23 Thread Mark Brown
The patch

   spi: spi-fsl-spi: call spi_finalize_current_message() at the end

has been applied to the spi tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-5.2

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 44a042182cb1e9f7916e015c836967bf638b33c4 Mon Sep 17 00:00:00 2001
From: Christophe Leroy 
Date: Wed, 22 May 2019 11:00:36 +
Subject: [PATCH] spi: spi-fsl-spi: call spi_finalize_current_message() at the
 end

spi_finalize_current_message() shall be called once all
actions are finished, otherwise the last actions might
step over a newly started transfer.

Fixes: c592becbe704 ("spi: fsl-(e)spi: migrate to generic master queueing")
Signed-off-by: Christophe Leroy 
Signed-off-by: Mark Brown 
---
 drivers/spi/spi-fsl-spi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/spi/spi-fsl-spi.c b/drivers/spi/spi-fsl-spi.c
index b36ac6aa3b1f..7fbdaf066719 100644
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -432,7 +432,6 @@ static int fsl_spi_do_one_msg(struct spi_master *master,
}
 
m->status = status;
-   spi_finalize_current_message(master);
 
if (status || !cs_change) {
ndelay(nsecs);
@@ -440,6 +439,7 @@ static int fsl_spi_do_one_msg(struct spi_master *master,
}
 
fsl_spi_setup_transfer(spi, NULL);
+   spi_finalize_current_message(master);
return 0;
 }
 
-- 
2.20.1



Re: [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs

2019-05-23 Thread Michael Ellerman
Frederic Barrat  writes:

> If the kernel is notified of an HMI caused by the NPU2, it's currently
> not being recognized and it logs the default message:
>
> Unknown Malfunction Alert of type 3
>
> The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
> possible causes, but we should at least log that it's an NPU problem
> and report which FIR and which bit were raised if opal gave us the
> information.
>
> Signed-off-by: Frederic Barrat 
> ---
>
> Could be merged independently from (the opal-api.h change is already
> in the skiboot tree), but works better with, the matching skiboot
> change:
> http://patchwork.ozlabs.org/patch/1104076/

Well it *must* work with or without the skiboot change, because old/new
kernels will run on old/new skiboots.

It looks like it will work fine, we just won't get any extra information
in xstop_reason, right?

cheers

> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index e1577cfa7186..2492fe248e1e 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -568,6 +568,7 @@ enum OpalHMI_XstopType {
>   CHECKSTOP_TYPE_UNKNOWN  =   0,
>   CHECKSTOP_TYPE_CORE =   1,
>   CHECKSTOP_TYPE_NX   =   2,
> + CHECKSTOP_TYPE_NPU  =   3
>  };
>  
>  enum OpalHMI_CoreXstopReason {
> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c 
> b/arch/powerpc/platforms/powernv/opal-hmi.c
> index 586ec71a4e17..de12a240b477 100644
> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> @@ -149,6 +149,43 @@ static void print_nx_checkstop_reason(const char *level,
>   xstop_reason[i].description);
>  }
>  
> +static void print_npu_checkstop_reason(const char *level,
> + struct OpalHMIEvent *hmi_evt)
> +{
> + uint8_t reason, reason_count, i;
> +
> + /*
> +  * We may not have a checkstop reason on some combination of
> +  * hardware and/or skiboot version
> +  */
> + if (!hmi_evt->u.xstop_error.xstop_reason) {
> + printk("%s  NPU checkstop on chip %x\n", level,
> + be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id));
> + return;
> + }
> +
> + /*
> +  * NPU2 has 3 FIRs. Reason encoded on a byte as:
> +  *   2 bits for the FIR number
> +  *   6 bits for the bit number
> +  * It may be possible to find several reasons.
> +  *
> +  * We don't display a specific message per FIR bit as there
> +  * are too many and most are meaningless without the workbook
> +  * and/or hw team help anyway.
> +  */
> + reason_count = sizeof(hmi_evt->u.xstop_error.xstop_reason) /
> + sizeof(reason);
> + for (i = 0; i < reason_count; i++) {
> + reason = (hmi_evt->u.xstop_error.xstop_reason >> (8 * i)) & 
> 0xFF;
> + if (reason)
> + printk("%s  NPU checkstop on chip %x: FIR%d bit %d 
> is set\n",
> + level,
> + be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id),
> + reason >> 6, reason & 0x3F);
> + }
> +}
> +
>  static void print_checkstop_reason(const char *level,
>   struct OpalHMIEvent *hmi_evt)
>  {
> @@ -160,6 +197,9 @@ static void print_checkstop_reason(const char *level,
>   case CHECKSTOP_TYPE_NX:
>   print_nx_checkstop_reason(level, hmi_evt);
>   break;
> + case CHECKSTOP_TYPE_NPU:
> + print_npu_checkstop_reason(level, hmi_evt);
> + break;
>   default:
>   printk("%s  Unknown Malfunction Alert of type %d\n",
>  level, type);
> -- 
> 2.21.0


[PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs

2019-05-23 Thread Frederic Barrat
If the kernel is notified of an HMI caused by the NPU2, it's currently
not being recognized and it logs the default message:

Unknown Malfunction Alert of type 3

The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
possible causes, but we should at least log that it's an NPU problem
and report which FIR and which bit were raised if opal gave us the
information.

Signed-off-by: Frederic Barrat 
---

Could be merged independently from (the opal-api.h change is already
in the skiboot tree), but works better with, the matching skiboot
change:
http://patchwork.ozlabs.org/patch/1104076/


 arch/powerpc/include/asm/opal-api.h   |  1 +
 arch/powerpc/platforms/powernv/opal-hmi.c | 40 +++
 2 files changed, 41 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index e1577cfa7186..2492fe248e1e 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -568,6 +568,7 @@ enum OpalHMI_XstopType {
CHECKSTOP_TYPE_UNKNOWN  =   0,
CHECKSTOP_TYPE_CORE =   1,
CHECKSTOP_TYPE_NX   =   2,
+   CHECKSTOP_TYPE_NPU  =   3
 };
 
 enum OpalHMI_CoreXstopReason {
diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c 
b/arch/powerpc/platforms/powernv/opal-hmi.c
index 586ec71a4e17..de12a240b477 100644
--- a/arch/powerpc/platforms/powernv/opal-hmi.c
+++ b/arch/powerpc/platforms/powernv/opal-hmi.c
@@ -149,6 +149,43 @@ static void print_nx_checkstop_reason(const char *level,
xstop_reason[i].description);
 }
 
+static void print_npu_checkstop_reason(const char *level,
+   struct OpalHMIEvent *hmi_evt)
+{
+   uint8_t reason, reason_count, i;
+
+   /*
+* We may not have a checkstop reason on some combination of
+* hardware and/or skiboot version
+*/
+   if (!hmi_evt->u.xstop_error.xstop_reason) {
+   printk("%s  NPU checkstop on chip %x\n", level,
+   be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id));
+   return;
+   }
+
+   /*
+* NPU2 has 3 FIRs. Reason encoded on a byte as:
+*   2 bits for the FIR number
+*   6 bits for the bit number
+* It may be possible to find several reasons.
+*
+* We don't display a specific message per FIR bit as there
+* are too many and most are meaningless without the workbook
+* and/or hw team help anyway.
+*/
+   reason_count = sizeof(hmi_evt->u.xstop_error.xstop_reason) /
+   sizeof(reason);
+   for (i = 0; i < reason_count; i++) {
+   reason = (hmi_evt->u.xstop_error.xstop_reason >> (8 * i)) & 
0xFF;
+   if (reason)
+   printk("%s  NPU checkstop on chip %x: FIR%d bit %d 
is set\n",
+   level,
+   be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id),
+   reason >> 6, reason & 0x3F);
+   }
+}
+
 static void print_checkstop_reason(const char *level,
struct OpalHMIEvent *hmi_evt)
 {
@@ -160,6 +197,9 @@ static void print_checkstop_reason(const char *level,
case CHECKSTOP_TYPE_NX:
print_nx_checkstop_reason(level, hmi_evt);
break;
+   case CHECKSTOP_TYPE_NPU:
+   print_npu_checkstop_reason(level, hmi_evt);
+   break;
default:
printk("%s  Unknown Malfunction Alert of type %d\n",
   level, type);
-- 
2.21.0



Re: [PATCH] powerpc/power: Expose pfn_is_nosave prototype

2019-05-23 Thread Christophe Leroy




Le 23/05/2019 à 13:47, Mathieu Malaterre a écrit :

The declaration for pfn_is_nosave is only available in
kernel/power/power.h. Since this function can be override in arch,
expose it globally. Having a prototype will make sure to avoid warning
(sometime treated as error with W=1) such as:

   arch/powerpc/kernel/suspend.c:18:5: error: no previous prototype for 
'pfn_is_nosave' [-Werror=missing-prototypes]

This moves the declaration into a globally visible header file and add
missing include to avoid a warning in powerpc.


Then you should also drop it from kernel/power/power.h and 
arch/s390/kernel/entry.h


Christophe



Signed-off-by: Mathieu Malaterre 
---
  arch/powerpc/kernel/suspend.c | 1 +
  include/linux/suspend.h   | 1 +
  2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/suspend.c b/arch/powerpc/kernel/suspend.c
index a531154cc0f3..9e1b6b894245 100644
--- a/arch/powerpc/kernel/suspend.c
+++ b/arch/powerpc/kernel/suspend.c
@@ -8,6 +8,7 @@
   */
  
  #include 

+#include 
  #include 
  #include 
  
diff --git a/include/linux/suspend.h b/include/linux/suspend.h

index 3f529ad9a9d2..2660bbdf5230 100644
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -395,6 +395,7 @@ extern bool system_entering_hibernation(void);
  extern bool hibernation_available(void);
  asmlinkage int swsusp_save(void);
  extern struct pbe *restore_pblist;
+int pfn_is_nosave(unsigned long pfn);
  #else /* CONFIG_HIBERNATION */
  static inline void register_nosave_region(unsigned long b, unsigned long e) {}
  static inline void register_nosave_region_late(unsigned long b, unsigned long 
e) {}



Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Wed, May 22, 2019 at 06:57:37PM +0200, Oleg Nesterov wrote:
> On 05/22, Christian Brauner wrote:
> >
> > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> >  {
> > -   struct file *file;
> > +   struct file *file = NULL;
> > struct fdtable *fdt;
> >  
> > spin_lock(&files->file_lock);
> > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned 
> > fd)
> > goto out_unlock;
> > rcu_assign_pointer(fdt->fd[fd], NULL);
> > __put_unused_fd(files, fd);
> > -   spin_unlock(&files->file_lock);
> > -   return filp_close(file, files);
> >  
> >  out_unlock:
> > spin_unlock(&files->file_lock);
> > -   return -EBADF;
> > +   return file;
> 
> ...
> 
> > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > +{
> > +   unsigned int cur_max;
> > +
> > +   if (fd > max_fd)
> > +   return -EINVAL;
> > +
> > +   rcu_read_lock();
> > +   cur_max = files_fdtable(files)->max_fds;
> > +   rcu_read_unlock();
> > +
> > +   /* cap to last valid index into fdtable */
> > +   if (max_fd >= cur_max)
> > +   max_fd = cur_max - 1;
> > +
> > +   while (fd <= max_fd) {
> > +   struct file *file;
> > +
> > +   file = pick_file(files, fd++);
> 
> Well, how about something like
> 
>   static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned 
> start)
>   {
>   unsigned int maxfd = fdt->max_fds;
>   unsigned int maxbit = maxfd / BITS_PER_LONG;
>   unsigned int bitbit = start / BITS_PER_LONG;
> 
>   bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * 
> BITS_PER_LONG;
>   if (bitbit > maxfd)
>   return maxfd;
>   if (bitbit > start)
>   start = bitbit;
>   return find_next_bit(fdt->open_fds, maxfd, start);
>   }

> 
>   unsigned close_next_fd(struct files_struct *files, unsigned start, 
> unsigned maxfd)
>   {
>   unsigned fd;
>   struct file *file;
>   struct fdtable *fdt;
>   
>   spin_lock(&files->file_lock);
>   fdt = files_fdtable(files);
>   fd = find_next_opened_fd(fdt, start);
>   if (fd >= fdt->max_fds || fd > maxfd) {
>   fd = -1;
>   goto out;
>   }
> 
>   file = fdt->fd[fd];
>   rcu_assign_pointer(fdt->fd[fd], NULL);
>   __put_unused_fd(files, fd);
>   out:
>   spin_unlock(&files->file_lock);
> 
>   if (fd == -1u)
>   return fd;
> 
>   filp_close(file, files);
>   return fd + 1;
>   }

Thanks, Oleg!

I kept it dumb and was about to reply that your solution introduces more
code when it seemed we wanted to keep this very simple for now.
But then I saw that find_next_opened_fd() already exists as
find_next_fd(). So it's actually not bad compared to what I sent in v1.
So - with some small tweaks (need to test it and all now) - how do we
feel about?:

/**
 * __close_next_open_fd() - Close the nearest open fd.
 *
 * @curfd: lowest file descriptor to consider
 * @maxfd: highest file descriptor to consider
 *
 * This function will close the nearest open fd, i.e. it will either
 * close @curfd if it is open or the closest open file descriptor
 * c greater than @curfd that
 * is smaller or equal to maxfd.
 * If the function found a file descriptor to close it will return 0 and
 * place the file descriptor it closed in @curfd. If it did not find a
 * file descriptor to close it will return -EBADF.
 */
static int __close_next_open_fd(struct files_struct *files, unsigned *curfd, 
unsigned maxfd)
{
struct file *file = NULL;
unsigned fd;
struct fdtable *fdt;

spin_lock(&files->file_lock);
fdt = files_fdtable(files);
fd = find_next_fd(fdt, *curfd);
if (fd >= fdt->max_fds || fd > maxfd)
goto out_unlock;

file = fdt->fd[fd];
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);

out_unlock:
spin_unlock(&files->file_lock);

if (!file)
return -EBADF;

*curfd = fd;
filp_close(file, files);
return 0;
}

int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
{
if (fd > max_fd)
return -EINVAL;

while (fd <= max_fd) {
if (__close_next_fd(files, &fd, maxfd))
break;

cond_resched();
fd++;
}

return 0;
}

SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
unsigned int, flags)
{
if (flags)
return -EINVAL;

return __close_range(current->files, fd, max_fd);
}


Re: [PATCH v2] powerpc/32: sstep: Move variable `rc` within CONFIG_PPC64 sentinels

2019-05-23 Thread Mathieu Malaterre
ping ?

On Tue, Mar 12, 2019 at 10:23 PM Mathieu Malaterre  wrote:
>
> Fix warnings treated as errors with W=1:
>
>   arch/powerpc/lib/sstep.c:1172:31: error: variable 'rc' set but not used 
> [-Werror=unused-but-set-variable]
>
> Suggested-by: Christophe Leroy 
> Signed-off-by: Mathieu Malaterre 
> ---
> v2: as suggested prefer CONFIG_PPC64 sentinel instead of unused keyword
>
>  arch/powerpc/lib/sstep.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index 3d33fb509ef4..9996dc7a0b46 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -1169,7 +1169,10 @@ static nokprobe_inline int trap_compare(long v1, long 
> v2)
>  int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
>   unsigned int instr)
>  {
> -   unsigned int opcode, ra, rb, rc, rd, spr, u;
> +   unsigned int opcode, ra, rb, rd, spr, u;
> +#ifdef CONFIG_PPC64
> +   unsigned int rc;
> +#endif
> unsigned long int imm;
> unsigned long int val, val2;
> unsigned int mb, me, sh;
> @@ -1292,7 +1295,9 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
> rd = (instr >> 21) & 0x1f;
> ra = (instr >> 16) & 0x1f;
> rb = (instr >> 11) & 0x1f;
> +#ifdef CONFIG_PPC64
> rc = (instr >> 6) & 0x1f;
> +#endif
>
> switch (opcode) {
>  #ifdef __powerpc64__
> --
> 2.20.1
>


[PATCH] powerpc/power: Expose pfn_is_nosave prototype

2019-05-23 Thread Mathieu Malaterre
The declaration for pfn_is_nosave is only available in
kernel/power/power.h. Since this function can be override in arch,
expose it globally. Having a prototype will make sure to avoid warning
(sometime treated as error with W=1) such as:

  arch/powerpc/kernel/suspend.c:18:5: error: no previous prototype for 
'pfn_is_nosave' [-Werror=missing-prototypes]

This moves the declaration into a globally visible header file and add
missing include to avoid a warning in powerpc.

Signed-off-by: Mathieu Malaterre 
---
 arch/powerpc/kernel/suspend.c | 1 +
 include/linux/suspend.h   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/suspend.c b/arch/powerpc/kernel/suspend.c
index a531154cc0f3..9e1b6b894245 100644
--- a/arch/powerpc/kernel/suspend.c
+++ b/arch/powerpc/kernel/suspend.c
@@ -8,6 +8,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 
diff --git a/include/linux/suspend.h b/include/linux/suspend.h
index 3f529ad9a9d2..2660bbdf5230 100644
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -395,6 +395,7 @@ extern bool system_entering_hibernation(void);
 extern bool hibernation_available(void);
 asmlinkage int swsusp_save(void);
 extern struct pbe *restore_pblist;
+int pfn_is_nosave(unsigned long pfn);
 #else /* CONFIG_HIBERNATION */
 static inline void register_nosave_region(unsigned long b, unsigned long e) {}
 static inline void register_nosave_region_late(unsigned long b, unsigned long 
e) {}
-- 
2.20.1



Patch "x86/mpx, mm/core: Fix recursive munmap() corruption" has been added to the 5.1-stable tree

2019-05-23 Thread gregkh


This is a note to let you know that I've just added the patch titled

x86/mpx, mm/core: Fix recursive munmap() corruption

to the 5.1-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 x86-mpx-mm-core-fix-recursive-munmap-corruption.patch
and it can be found in the queue-5.1 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From 5a28fc94c9143db766d1ba5480cae82d856ad080 Mon Sep 17 00:00:00 2001
From: Dave Hansen 
Date: Fri, 19 Apr 2019 12:47:47 -0700
Subject: x86/mpx, mm/core: Fix recursive munmap() corruption

From: Dave Hansen 

commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream.

This is a bit of a mess, to put it mildly.  But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now, because nobody uses MPX.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory.  When
memory is unmapped, there is also a need to unmap the MPX
bounds tables.  Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state.  It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

See the "long story" further below for the gory details.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful.  Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

== UML / unicore32 impact ==

Remove unused 'vma' argument to arch_unmap().  No functional
change.

I compile tested this on UML but not unicore32.

== powerpc impact ==

powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
arch_unmap() makes this happen earlier in __do_munmap().  But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace.  I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.

powerpc does not use the 'vma' argument and is unaffected by
its removal.

I compile-tested a 64-bit powerpc defconfig.

== x86 impact ==

For the common success case this is functionally identical to
what was there before.  For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use.  But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory.  There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.

The long story:

munmap() has a couple of pieces:

 1. Find the affected VMA(s)
 2. Split the start/end one(s) if neceesary
 3. Pull the VMAs out of the rbtree
 4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
 5. Fix up some of the accounting (like fput()) and actually
free the VMA itself.

This specific ordering was actually introduced by:

  dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

during the 4.20 merge window.  The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list().  arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.

Richard Biener reported a test that shows this in dmesg:

  [1216548.787498] BUG: Bad rss-counter state mm:17ce560b idx:1 val:551
  [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap().  It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this.  For one, arch_unmap()
can free VMAs.  But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.

I tried a couple of things here.  First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue.  I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart.  That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

This was also reported in the following

Patch "x86/mpx, mm/core: Fix recursive munmap() corruption" has been added to the 5.0-stable tree

2019-05-23 Thread gregkh


This is a note to let you know that I've just added the patch titled

x86/mpx, mm/core: Fix recursive munmap() corruption

to the 5.0-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 x86-mpx-mm-core-fix-recursive-munmap-corruption.patch
and it can be found in the queue-5.0 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From 5a28fc94c9143db766d1ba5480cae82d856ad080 Mon Sep 17 00:00:00 2001
From: Dave Hansen 
Date: Fri, 19 Apr 2019 12:47:47 -0700
Subject: x86/mpx, mm/core: Fix recursive munmap() corruption

From: Dave Hansen 

commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream.

This is a bit of a mess, to put it mildly.  But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now, because nobody uses MPX.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory.  When
memory is unmapped, there is also a need to unmap the MPX
bounds tables.  Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state.  It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

See the "long story" further below for the gory details.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful.  Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

== UML / unicore32 impact ==

Remove unused 'vma' argument to arch_unmap().  No functional
change.

I compile tested this on UML but not unicore32.

== powerpc impact ==

powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
arch_unmap() makes this happen earlier in __do_munmap().  But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace.  I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.

powerpc does not use the 'vma' argument and is unaffected by
its removal.

I compile-tested a 64-bit powerpc defconfig.

== x86 impact ==

For the common success case this is functionally identical to
what was there before.  For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use.  But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory.  There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.

The long story:

munmap() has a couple of pieces:

 1. Find the affected VMA(s)
 2. Split the start/end one(s) if neceesary
 3. Pull the VMAs out of the rbtree
 4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
 5. Fix up some of the accounting (like fput()) and actually
free the VMA itself.

This specific ordering was actually introduced by:

  dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

during the 4.20 merge window.  The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list().  arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.

Richard Biener reported a test that shows this in dmesg:

  [1216548.787498] BUG: Bad rss-counter state mm:17ce560b idx:1 val:551
  [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap().  It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this.  For one, arch_unmap()
can free VMAs.  But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.

I tried a couple of things here.  First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue.  I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart.  That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

This was also reported in the following

[PATCH v4 3/3] kselftest: Extend vDSO selftest to clock_getres

2019-05-23 Thread Vincenzo Frascino
The current version of the multiarch vDSO selftest verifies only
gettimeofday.

Extend the vDSO selftest to clock_getres, to verify that the
syscall and the vDSO library function return the same information.

The extension has been used to verify the hrtimer_resoltion fix.

Cc: Shuah Khan 
Signed-off-by: Vincenzo Frascino 
---

Note: This patch is independent from the others in this series, hence it
can be merged singularly by the kselftest maintainers.

 tools/testing/selftests/vDSO/Makefile |   2 +
 .../selftests/vDSO/vdso_clock_getres.c| 124 ++
 2 files changed, 126 insertions(+)
 create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c

diff --git a/tools/testing/selftests/vDSO/Makefile 
b/tools/testing/selftests/vDSO/Makefile
index 9e03d61f52fd..d5c5bfdf1ac1 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -5,6 +5,7 @@ uname_M := $(shell uname -m 2>/dev/null || echo not)
 ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
 
 TEST_GEN_PROGS := $(OUTPUT)/vdso_test
+TEST_GEN_PROGS += $(OUTPUT)/vdso_clock_getres
 ifeq ($(ARCH),x86)
 TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86
 endif
@@ -18,6 +19,7 @@ endif
 
 all: $(TEST_GEN_PROGS)
 $(OUTPUT)/vdso_test: parse_vdso.c vdso_test.c
+$(OUTPUT)/vdso_clock_getres: vdso_clock_getres.c
 $(OUTPUT)/vdso_standalone_test_x86: vdso_standalone_test_x86.c parse_vdso.c
$(CC) $(CFLAGS) $(CFLAGS_vdso_standalone_test_x86) \
vdso_standalone_test_x86.c parse_vdso.c \
diff --git a/tools/testing/selftests/vDSO/vdso_clock_getres.c 
b/tools/testing/selftests/vDSO/vdso_clock_getres.c
new file mode 100644
index ..b62d8d4f7c38
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_clock_getres.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+/*
+ * vdso_clock_getres.c: Sample code to test clock_getres.
+ * Copyright (c) 2019 Arm Ltd.
+ *
+ * Compile with:
+ * gcc -std=gnu99 vdso_clock_getres.c
+ *
+ * Tested on ARM, ARM64, MIPS32, x86 (32-bit and 64-bit),
+ * Power (32-bit and 64-bit), S390x (32-bit and 64-bit).
+ * Might work on other architectures.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+static long syscall_clock_getres(clockid_t _clkid, struct timespec *_ts)
+{
+   long ret;
+
+   ret = syscall(SYS_clock_getres, _clkid, _ts);
+
+   return ret;
+}
+
+const char *vdso_clock_name[12] = {
+   "CLOCK_REALTIME",
+   "CLOCK_MONOTONIC",
+   "CLOCK_PROCESS_CPUTIME_ID",
+   "CLOCK_THREAD_CPUTIME_ID",
+   "CLOCK_MONOTONIC_RAW",
+   "CLOCK_REALTIME_COARSE",
+   "CLOCK_MONOTONIC_COARSE",
+   "CLOCK_BOOTTIME",
+   "CLOCK_REALTIME_ALARM",
+   "CLOCK_BOOTTIME_ALARM",
+   "CLOCK_SGI_CYCLE",
+   "CLOCK_TAI",
+};
+
+/*
+ * This function calls clock_getres in vdso and by system call
+ * with different values for clock_id.
+ *
+ * Example of output:
+ *
+ * clock_id: CLOCK_REALTIME [PASS]
+ * clock_id: CLOCK_BOOTTIME [PASS]
+ * clock_id: CLOCK_TAI [PASS]
+ * clock_id: CLOCK_REALTIME_COARSE [PASS]
+ * clock_id: CLOCK_MONOTONIC [PASS]
+ * clock_id: CLOCK_MONOTONIC_RAW [PASS]
+ * clock_id: CLOCK_MONOTONIC_COARSE [PASS]
+ */
+static inline int vdso_test_clock(unsigned int clock_id)
+{
+   struct timespec x, y;
+
+   printf("clock_id: %s", vdso_clock_name[clock_id]);
+   clock_getres(clock_id, &x);
+   syscall_clock_getres(clock_id, &y);
+
+   if ((x.tv_sec != y.tv_sec) || (x.tv_sec != y.tv_sec)) {
+   printf(" [FAIL]\n");
+   return KSFT_FAIL;
+   }
+
+   printf(" [PASS]\n");
+   return KSFT_PASS;
+}
+
+int main(int argc, char **argv)
+{
+   int ret;
+
+#if _POSIX_TIMERS > 0
+
+#ifdef CLOCK_REALTIME
+   ret = vdso_test_clock(CLOCK_REALTIME);
+#endif
+
+#ifdef CLOCK_BOOTTIME
+   ret += vdso_test_clock(CLOCK_BOOTTIME);
+#endif
+
+#ifdef CLOCK_TAI
+   ret += vdso_test_clock(CLOCK_TAI);
+#endif
+
+#ifdef CLOCK_REALTIME_COARSE
+   ret += vdso_test_clock(CLOCK_REALTIME_COARSE);
+#endif
+
+#ifdef CLOCK_MONOTONIC
+   ret += vdso_test_clock(CLOCK_MONOTONIC);
+#endif
+
+#ifdef CLOCK_MONOTONIC_RAW
+   ret += vdso_test_clock(CLOCK_MONOTONIC_RAW);
+#endif
+
+#ifdef CLOCK_MONOTONIC_COARSE
+   ret += vdso_test_clock(CLOCK_MONOTONIC_COARSE);
+#endif
+
+#endif
+   if (ret > 0)
+   return KSFT_FAIL;
+
+   return KSFT_PASS;
+}
-- 
2.21.0



[PATCH v4 2/3] s390: Fix vDSO clock_getres()

2019-05-23 Thread Vincenzo Frascino
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the s390 vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Signed-off-by: Vincenzo Frascino 
Acked-by: Martin Schwidefsky 
---

Note: This patch is independent from the others in this series, hence it
can be merged singularly by the s390 maintainers.

 arch/s390/include/asm/vdso.h   |  1 +
 arch/s390/kernel/asm-offsets.c |  2 +-
 arch/s390/kernel/time.c|  1 +
 arch/s390/kernel/vdso32/clock_getres.S | 12 +++-
 arch/s390/kernel/vdso64/clock_getres.S | 10 +-
 5 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/s390/include/asm/vdso.h b/arch/s390/include/asm/vdso.h
index 169d7604eb80..f3ba84fa9bd1 100644
--- a/arch/s390/include/asm/vdso.h
+++ b/arch/s390/include/asm/vdso.h
@@ -36,6 +36,7 @@ struct vdso_data {
__u32 tk_shift; /* Shift used for xtime_nsec0x60 */
__u32 ts_dir;   /* TOD steering direction   0x64 */
__u64 ts_end;   /* TOD steering end 0x68 */
+   __u32 hrtimer_res;  /* hrtimer resolution   0x70 */
 };
 
 struct vdso_per_cpu_data {
diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c
index 41ac4ad21311..4a229a60b24a 100644
--- a/arch/s390/kernel/asm-offsets.c
+++ b/arch/s390/kernel/asm-offsets.c
@@ -76,6 +76,7 @@ int main(void)
OFFSET(__VDSO_TK_SHIFT, vdso_data, tk_shift);
OFFSET(__VDSO_TS_DIR, vdso_data, ts_dir);
OFFSET(__VDSO_TS_END, vdso_data, ts_end);
+   OFFSET(__VDSO_CLOCK_REALTIME_RES, vdso_data, hrtimer_res);
OFFSET(__VDSO_ECTG_BASE, vdso_per_cpu_data, ectg_timer_base);
OFFSET(__VDSO_ECTG_USER, vdso_per_cpu_data, ectg_user_time);
OFFSET(__VDSO_CPU_NR, vdso_per_cpu_data, cpu_nr);
@@ -87,7 +88,6 @@ int main(void)
DEFINE(__CLOCK_REALTIME_COARSE, CLOCK_REALTIME_COARSE);
DEFINE(__CLOCK_MONOTONIC_COARSE, CLOCK_MONOTONIC_COARSE);
DEFINE(__CLOCK_THREAD_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID);
-   DEFINE(__CLOCK_REALTIME_RES, MONOTONIC_RES_NSEC);
DEFINE(__CLOCK_COARSE_RES, LOW_RES_NSEC);
BLANK();
/* idle data offsets */
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index e8766beee5ad..8ea9db599d38 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -310,6 +310,7 @@ void update_vsyscall(struct timekeeper *tk)
 
vdso_data->tk_mult = tk->tkr_mono.mult;
vdso_data->tk_shift = tk->tkr_mono.shift;
+   vdso_data->hrtimer_res = hrtimer_resolution;
smp_wmb();
++vdso_data->tb_update_count;
 }
diff --git a/arch/s390/kernel/vdso32/clock_getres.S 
b/arch/s390/kernel/vdso32/clock_getres.S
index eaf9cf1417f6..fecd7684c645 100644
--- a/arch/s390/kernel/vdso32/clock_getres.S
+++ b/arch/s390/kernel/vdso32/clock_getres.S
@@ -18,20 +18,22 @@
 __kernel_clock_getres:
CFI_STARTPROC
basr%r1,0
-   la  %r1,4f-.(%r1)
+10:al  %r1,4f-10b(%r1)
+   l   %r0,__VDSO_CLOCK_REALTIME_RES(%r1)
chi %r2,__CLOCK_REALTIME
je  0f
chi %r2,__CLOCK_MONOTONIC
je  0f
-   la  %r1,5f-4f(%r1)
+   basr%r1,0
+   la  %r1,5f-.(%r1)
+   l   %r0,0(%r1)
chi %r2,__CLOCK_REALTIME_COARSE
je  0f
chi %r2,__CLOCK_MONOTONIC_COARSE
jne 3f
 0: ltr %r3,%r3
jz  2f  /* res == NULL */
-1: l   %r0,0(%r1)
-   xc  0(4,%r3),0(%r3) /* set tp->tv_sec to zero */
+1: xc  0(4,%r3),0(%r3) /* set tp->tv_sec to zero */
st  %r0,4(%r3)  /* store tp->tv_usec */
 2: lhi %r2,0
br  %r14
@@ -39,6 +41,6 @@ __kernel_clock_getres:
svc 0
br  %r14
CFI_ENDPROC
-4: .long   __CLOCK_REALTIME_RES
+4: .long   _vdso_data - 10b
 5: .long   __CLOCK_COARSE_RES
.size   __kernel_clock_getres,.-__kernel_clock_getres
diff --git a/arch/s390/kernel/vdso64/clock_getres.S 
b/arch/s390/kernel/vdso64/clock_getres.S
index 081435398e0a..022b58c980db 100644
--- a/arch/s390/kernel/vdso64/clock_getres.S
+++ b/arch/s390/kernel/vdso64/clock_getres.S
@@ -17,12 +17,14 @@
.type  __kernel_clock_getres,@function
 __kernel_clock_getres:
CFI_STARTPROC
-   larl%r1,4f
+   larl%r1,3f
+   lg  %r0,0(%r1)
cghi%r2,__CLOCK_REALTIME_COARSE
je  0f
cghi%r2,__CLOCK_MONOTONIC_COARSE
je  0f
-

[PATCH v4 1/3] powerpc: Fix vDSO clock_getres()

2019-05-23 Thread Vincenzo Frascino
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the powerpc vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Fixes: a7f290dad32e ("[PATCH] powerpc: Merge vdso's and add vdso support
to 32 bits kernel")
Cc: sta...@vger.kernel.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Signed-off-by: Vincenzo Frascino 
Reviewed-by: Christophe Leroy 
---

Note: This patch is independent from the others in this series, hence it
can be merged singularly by the powerpc maintainers.

 arch/powerpc/include/asm/vdso_datapage.h  | 2 ++
 arch/powerpc/kernel/asm-offsets.c | 2 +-
 arch/powerpc/kernel/time.c| 1 +
 arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +--
 arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +--
 5 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/vdso_datapage.h 
b/arch/powerpc/include/asm/vdso_datapage.h
index bbc06bd72b1f..4333b9a473dc 100644
--- a/arch/powerpc/include/asm/vdso_datapage.h
+++ b/arch/powerpc/include/asm/vdso_datapage.h
@@ -86,6 +86,7 @@ struct vdso_data {
__s32 wtom_clock_nsec;  /* Wall to monotonic clock nsec 
*/
__s64 wtom_clock_sec;   /* Wall to monotonic clock sec 
*/
struct timespec stamp_xtime;/* xtime as at tb_orig_stamp */
+   __u32 hrtimer_res;  /* hrtimer resolution */
__u32 syscall_map_64[SYSCALL_MAP_SIZE]; /* map of syscalls  */
__u32 syscall_map_32[SYSCALL_MAP_SIZE]; /* map of syscalls */
 };
@@ -107,6 +108,7 @@ struct vdso_data {
__s32 wtom_clock_nsec;
struct timespec stamp_xtime;/* xtime as at tb_orig_stamp */
__u32 stamp_sec_fraction;   /* fractional seconds of stamp_xtime */
+   __u32 hrtimer_res;  /* hrtimer resolution */
__u32 syscall_map_32[SYSCALL_MAP_SIZE]; /* map of syscalls */
__u32 dcache_block_size;/* L1 d-cache block size */
__u32 icache_block_size;/* L1 i-cache block size */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 8e02444e9d3d..dfc40f29f2b9 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -389,6 +389,7 @@ int main(void)
OFFSET(WTOM_CLOCK_NSEC, vdso_data, wtom_clock_nsec);
OFFSET(STAMP_XTIME, vdso_data, stamp_xtime);
OFFSET(STAMP_SEC_FRAC, vdso_data, stamp_sec_fraction);
+   OFFSET(CLOCK_REALTIME_RES, vdso_data, hrtimer_res);
OFFSET(CFG_ICACHE_BLOCKSZ, vdso_data, icache_block_size);
OFFSET(CFG_DCACHE_BLOCKSZ, vdso_data, dcache_block_size);
OFFSET(CFG_ICACHE_LOGBLOCKSZ, vdso_data, icache_log_block_size);
@@ -419,7 +420,6 @@ int main(void)
DEFINE(CLOCK_REALTIME_COARSE, CLOCK_REALTIME_COARSE);
DEFINE(CLOCK_MONOTONIC_COARSE, CLOCK_MONOTONIC_COARSE);
DEFINE(NSEC_PER_SEC, NSEC_PER_SEC);
-   DEFINE(CLOCK_REALTIME_RES, MONOTONIC_RES_NSEC);
 
 #ifdef CONFIG_BUG
DEFINE(BUG_ENTRY_SIZE, sizeof(struct bug_entry));
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 325d60633dfa..4ea4e9d7a58e 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -963,6 +963,7 @@ void update_vsyscall(struct timekeeper *tk)
vdso_data->wtom_clock_nsec = tk->wall_to_monotonic.tv_nsec;
vdso_data->stamp_xtime = xt;
vdso_data->stamp_sec_fraction = frac_sec;
+   vdso_data->hrtimer_res = hrtimer_resolution;
smp_wmb();
++(vdso_data->tb_update_count);
 }
diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S 
b/arch/powerpc/kernel/vdso32/gettimeofday.S
index afd516b572f8..2b5f9e83c610 100644
--- a/arch/powerpc/kernel/vdso32/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso32/gettimeofday.S
@@ -160,12 +160,15 @@ V_FUNCTION_BEGIN(__kernel_clock_getres)
crorcr0*4+eq,cr0*4+eq,cr1*4+eq
bne cr0,99f
 
+   mflrr12
+  .cfi_register lr,r12
+   bl  __get_datapage@local
+   lwz r5,CLOCK_REALTIME_RES(r3)
+   mtlrr12
li  r3,0
cmpli   cr0,r4,0
crclr   cr0*4+so
beqlr
-   lis r5,CLOCK_REALTIME_RES@h
-   ori r5,r5,CLOCK_REALTIME_RES@l
stw r3,TSPC32_TV_SEC(r4)
stw r5,TSPC32_TV_NSEC(r4)
blr
diff --git a/arch/powerpc/kernel/vdso64/gettimeofday.S 
b/arch/powerpc/kernel/vdso64/gettimeofday.S
index 1f324c28705b..f07730f73d5e 100644
--- a/arch/powerpc/kernel/vdso64/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso64/gettimeofday.S
@@ -190,12 +190,15 @@ V_FUNCTION_BEGIN(__kernel_clock_getres)
crorcr0*4+eq,cr0*4+

[PATCH v4 0/3] Fix vDSO clock_getres()

2019-05-23 Thread Vincenzo Frascino
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

A possible fix is to change the vdso implementation of clock_getres,
keeping a copy of hrtimer_resolution in vdso data and using that
directly [1].

This patchset implements the proposed fix for arm64, powerpc, s390,
nds32 and adds a test to verify that the syscall and the vdso library
implementation of clock_getres return the same values.

Even if these patches are unified by the same topic, there is no
dependency between them, hence they can be merged singularly by each
arch maintainer.

Note: arm64 and nds32 respective fixes have been merged in 5.2-rc1,
hence they have been removed from this series.

[1] https://marc.info/?l=linux-arm-kernel&m=155110381930196&w=2

Changes:

v4:
  - Address review comments.
v3:
  - Rebased on 5.2-rc1.
  - Address review comments.
v2:
  - Rebased on 5.1-rc5.
  - Addressed review comments.

Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Shuah Khan 
Cc: Thomas Gleixner 
Cc: Arnd Bergmann 
Signed-off-by: Vincenzo Frascino 

Vincenzo Frascino (3):
  powerpc: Fix vDSO clock_getres()
  s390: Fix vDSO clock_getres()
  kselftest: Extend vDSO selftest to clock_getres

 arch/powerpc/include/asm/vdso_datapage.h  |   2 +
 arch/powerpc/kernel/asm-offsets.c |   2 +-
 arch/powerpc/kernel/time.c|   1 +
 arch/powerpc/kernel/vdso32/gettimeofday.S |   7 +-
 arch/powerpc/kernel/vdso64/gettimeofday.S |   7 +-
 arch/s390/include/asm/vdso.h  |   1 +
 arch/s390/kernel/asm-offsets.c|   2 +-
 arch/s390/kernel/time.c   |   1 +
 arch/s390/kernel/vdso32/clock_getres.S|  12 +-
 arch/s390/kernel/vdso64/clock_getres.S|  10 +-
 tools/testing/selftests/vDSO/Makefile |   2 +
 .../selftests/vDSO/vdso_clock_getres.c| 124 ++
 12 files changed, 155 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c

-- 
2.21.0



Re: [PATCH] ASoC: fsl_esai: fix the channel swap issue after xrun

2019-05-23 Thread S.j. Wang
Hi

> On Thu, May 23, 2019 at 09:53:42AM +, S.j. Wang wrote:
> > > > + /*
> > > > +  * Add fifo reset here, because the regcache_sync will
> > > > +  * write one more data to ETDR.
> > > > +  * Which will cause channel shift.
> > >
> > > Sounds like a bug to me...should fix it first by marking the data
> > > registers as volatile.
> > >
> > The ETDR is a writable register, it is not volatile. Even we change it
> > to Volatile, I don't think we can't avoid this issue. for the
> > regcache_sync Just to write this register, it is correct behavior.
> 
> Is that so? Quoting the comments of regcache_sync():
> "* regcache_sync - Sync the register cache with the hardware.
>  *
>  * @map: map to configure.
>  *
>  * Any registers that should not be synced should be marked as
>  * volatile."
> 
> If regcache_sync() does sync volatile registers too as you said, I don't mind
> having this FIFO reset WAR for now, though I think this mismatch between
> the comments and the actual behavior then should get people's attention.
> 
> Thank you

ETDR is not volatile,  if we mark it is volatile, is it correct?

Bets regards
Wang shengjiu



[PATCH] powerpc: Remove variable ‘path’ since not used

2019-05-23 Thread Mathieu Malaterre
In commit eab00a208eb6 ("powerpc: Move `path` variable inside
DEBUG_PROM") DEBUG_PROM sentinels were added to silence a warning
(treated as error with W=1):

  arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set but not 
used [-Werror=unused-but-set-variable]

Rework the original patch and simplify the code, by removing the
variable ‘path’ completely. Fix line over 90 characters.

Suggested-by: Michael Ellerman 
Signed-off-by: Mathieu Malaterre 
---
 arch/powerpc/kernel/prom_init.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 7edb23861162..f6df4ddebb82 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1566,9 +1566,6 @@ static void __init reserve_mem(u64 base, u64 size)
 static void __init prom_init_mem(void)
 {
phandle node;
-#ifdef DEBUG_PROM
-   char *path;
-#endif
char type[64];
unsigned int plen;
cell_t *p, *endp;
@@ -1590,9 +1587,6 @@ static void __init prom_init_mem(void)
prom_debug("root_size_cells: %x\n", rsc);
 
prom_debug("scanning memory:\n");
-#ifdef DEBUG_PROM
-   path = prom_scratch;
-#endif
 
for (node = 0; prom_next_node(&node); ) {
type[0] = 0;
@@ -1617,9 +1611,10 @@ static void __init prom_init_mem(void)
endp = p + (plen / sizeof(cell_t));
 
 #ifdef DEBUG_PROM
-   memset(path, 0, sizeof(prom_scratch));
-   call_prom("package-to-path", 3, 1, node, path, 
sizeof(prom_scratch) - 1);
-   prom_debug("  node %s :\n", path);
+   memset(prom_scratch, 0, sizeof(prom_scratch));
+   call_prom("package-to-path", 3, 1, node, prom_scratch,
+ sizeof(prom_scratch) - 1);
+   prom_debug("  node %s :\n", prom_scratch);
 #endif /* DEBUG_PROM */
 
while ((endp - p) >= (rac + rsc)) {
-- 
2.20.1



Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Christophe Leroy




On 05/23/2019 10:05 AM, Christophe Leroy wrote:



On 05/23/2019 09:59 AM, Christophe Leroy wrote:



On 05/23/2019 09:45 AM, Christophe Leroy wrote:



Le 23/05/2019 à 10:53, Mathieu Malaterre a écrit :

Commit id is:

e93c9c99a629 (tag: v5.1) Linux 5.1


Did you try latest powerpc/merge branch ?


Will try that next.


I confirm powerpc/merge does not boot for me (same config). Commit id:

a27eaa62326d (powerpc/merge) Automatic merge of branches 'master',
'next' and 'fixes' into merge


I see in the config you sent me that you have selected CONFIG_KASAN, 
which is a big new stuff.


Can you try without it ?


While building with your config, I get a huge amount of:

ppc-linux-ld: warning: orphan section `.data..LASAN0' from 
`lib/xarray.o' being placed in section `.data..LASAN0'.

   SORTEX  vmlinux



I see you have also selected CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y

I guess nobody have never tried both this and CONFIG_KASAN together on 
ppc32. I'll give it a try.



And you also have CONFIG_FTRACE.

In a recent patch implementing KASAN on PPC64, Daniel says that KASAN 
and FTRACE don't go together well 
(https://patchwork.ozlabs.org/patch/1103826/)


If you find out that it works without KASAN, can you then try with KASAN 
but without FTRACE ?




I tried your config in Qemu, looks I'm getting a recursive Oops:

#50 0xc0066af0 in do_exit (code=0xb) at kernel/exit.c:787
#51 0xc0013984 in oops_end (flags=, regs=, 
signr=0xb) at arch/powerpc/kernel/traps.c:253

#52 0xc001c30c in handle_page_fault () at arch/powerpc/kernel/entry_32.S:637
#53 0x20302e30 in ?? ()
#54 0xc001cb60 in btext_drawchar (c=0x0) at arch/powerpc/kernel/btext.c:522
#55 0xc00167cc in udbg_write (s=0xc113ae22  "   0.00] CPU: 0 
PID: 0 Comm: swapper Not tainted 5.1.0+ #1647\n0\n", n=0x37) at 
arch/powerpc/kernel/udbg.c:114
#56 0xc00d43f0 in call_console_drivers (ext_text=, 
text=, len=, ext_len=) at 
kernel/printk/printk.c:1780

#57 console_unlock () at kernel/printk/printk.c:2462
#58 0xc00d6630 in console_flush_on_panic () at kernel/printk/printk.c:2552
#59 0xc00618a0 in panic (fmt=0xc10f459f  "!") at kernel/panic.c:280
#60 0xc0066af0 in do_exit (code=0xb) at kernel/exit.c:787
#61 0xc0013984 in oops_end (flags=, regs=, 
signr=0xb) at arch/powerpc/kernel/traps.c:253

#62 0xc001c30c in handle_page_fault () at arch/powerpc/kernel/entry_32.S:637
#63 0x20302e30 in ?? ()
#64 0xc001cb60 in btext_drawchar (c=0x0) at arch/powerpc/kernel/btext.c:522
#65 0xc00167cc in udbg_write (s=0xc113ae22  "   0.00] CPU: 0 
PID: 0 Comm: swapper Not tainted 5.1.0+ #1647\n0\n", n=0x45) at 
arch/powerpc/kernel/udbg.c:114
#66 0xc00d43f0 in call_console_drivers (ext_text=, 
text=, len=, ext_len=) at 
kernel/printk/printk.c:1780

#67 console_unlock () at kernel/printk/printk.c:2462
#68 0xc00d6630 in console_flush_on_panic () at kernel/printk/printk.c:2552
#69 0xc00618a0 in panic (fmt=0xc10f459f  "!") at kernel/panic.c:280
#70 0xc0066af0 in do_exit (code=0xb) at kernel/exit.c:787
#71 0xc0013984 in oops_end (flags=, regs=, 
signr=0xb) at arch/powerpc/kernel/traps.c:253

#72 0xc001c30c in handle_page_fault () at arch/powerpc/kernel/entry_32.S:637
#73 0x20302e30 in ?? ()
#74 0xc001cb60 in btext_drawchar (c=0x0) at arch/powerpc/kernel/btext.c:522
#75 0xc00167cc in udbg_write (s=0xc113ae22  "   0.00] CPU: 0 
PID: 0 Comm: swapper Not tainted 5.1.0+ #1647\n0\n", n=0x32) at 
arch/powerpc/kernel/udbg.c:114
#76 0xc00d43f0 in call_console_drivers (ext_text=, 
text=, len=, ext_len=) at 
kernel/printk/printk.c:1780

#77 console_unlock () at kernel/printk/printk.c:2462
#78 0xc00d68d8 in vprintk_emit (facility=, 
level=, dict=0x0, dictlen=0x0, fmt=0xc085e4c0 
"\001\066printk: %sconsole [%s%d] enabled\n",

args=0xc10cff30) at kernel/printk/printk.c:1985
#79 0xc00d69d8 in vprintk_default (fmt=, args=out>) at kernel/printk/printk.c:2012
#80 0xc00d7a40 in vprintk_func (fmt=, args=out>) at kernel/printk/printk_safe.c:398
#81 0xc00d2638 in printk (fmt=) at 
kernel/printk/printk.c:2045
#82 0xc00d4ef8 in register_console (newcon=0xc0cb9a20 ) at 
kernel/printk/printk.c:2777
#83 0xc0b79ed0 in machine_init (dt_ptr=) at 
arch/powerpc/kernel/setup_32.c:83

#84 0xc000347c in start_here () at arch/powerpc/kernel/head_32.S:901

Christophe


Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Mathieu Malaterre
On Thu, May 23, 2019 at 11:45 AM Christophe Leroy
 wrote:
>
>
>
> Le 23/05/2019 à 10:53, Mathieu Malaterre a écrit :
> > On Thu, May 23, 2019 at 10:29 AM Mathieu Malaterre  wrote:
> >>
> >> On Thu, May 23, 2019 at 8:39 AM Christophe Leroy
> >>  wrote:
> >>>
> >>> Salut Mathieu,
> >>>
> >>> Le 23/05/2019 à 08:24, Mathieu Malaterre a écrit :
>  Salut Christophe,
> 
>  On Wed, May 22, 2019 at 2:20 PM Christophe Leroy
>   wrote:
> >
> >
> >
> > Le 22/05/2019 à 14:15, Mathieu Malaterre a écrit :
> >> Hi all,
> >>
> >> I have not boot my G4 in a while, today using master here is what I 
> >> see:
> >>
> >> done
> >> Setting btext !
> >> W=640 H=488 LB=768 addr=0x9c008000
> >> copying OF device tree...
> >> starting device tree allocs at 01401000
> >> otloc_up(0010, 0013d948)
> >>  trying: 0x01401000
> >>  trying: 0x01501000
> >> -› 01501000
> >>  alloc_bottom : 01601000
> >>  alloc_top: 2000
> >>  alloc_top_hi : 2000
> >>  nmo_top  : 2000
> >>  ram_top  : 2000
> >> Building dt strings...
> >> Building dt structure...
> >> reserved memory map:
> >>  00d4 - 006c1000
> >> Device tree strings 0x01502000 -> 0x0007
> >> Device tree struct 0x01503000 -> 0x0007
> >> Quiescing Open Firmware ...
> >> Booting Linux via __start() @ 0x00140
> >> ->dt_headr_start=0x01501000
> >>
> >> Any suggestions before I start a bisect ?
> >>
> >
> > Have you tried without CONFIG_PPC_KUEP and CONFIG_PPC_KUAP ?
> 
>  Using locally:
> 
>  diff --git a/arch/powerpc/configs/g4_defconfig
>  b/arch/powerpc/configs/g4_defconfig
>  index 14d0376f637d..916bce8ce9c3 100644
>  --- a/arch/powerpc/configs/g4_defconfig
>  +++ b/arch/powerpc/configs/g4_defconfig
>  @@ -32,6 +32,8 @@ CONFIG_USERFAULTFD=y
> # CONFIG_COMPAT_BRK is not set
> CONFIG_PROFILING=y
> CONFIG_G4_CPU=y
>  +# CONFIG_PPC_KUEP is not set
>  +# CONFIG_PPC_KUAP is not set
> CONFIG_PANIC_TIMEOUT=0
> # CONFIG_PPC_CHRP is not set
> CONFIG_CPU_FREQ=y
> 
> 
>  Leads to almost the same error (some values have changed):
> >>>
> >>> Ok.
> >>>
> >>> When you say you are using 'master', what do you mean ? Can you give the
> >>> commit Id ?
> >>>
> >>> Does it boots with Kernel 5.1.4 ?
> >>
> >> I was able to boot v5.1:
> >>
> >> $ dmesg | head
> >> [0.00] printk: bootconsole [udbg0] enabled
> >> [0.00] Total memory = 512MB; using 1024kB for hash table (at 
> >> (ptrval))
> >> [0.00] Linux version 5.1.0+ (ma...@debian.org) (gcc version
> >> 8.3.0 (Debian 8.3.0-7)) #8 Thu May 23 06:26:38 UTC 2019
> >>
> >> Commit id is:
> >>
> >> e93c9c99a629 (tag: v5.1) Linux 5.1
> >>
> >>> Did you try latest powerpc/merge branch ?
> >>
> >> Will try that next.
> >
> > I confirm powerpc/merge does not boot for me (same config). Commit id:
> >
> > a27eaa62326d (powerpc/merge) Automatic merge of branches 'master',
> > 'next' and 'fixes' into merge
>
> I see in the config you sent me that you have selected CONFIG_KASAN,
> which is a big new stuff.
>
> Can you try without it ?

With same config but CONFIG_KASAN=n (on top of a27eaa62326d), I can
reproduce the boot failure (no change).

Time for bisect ?

> Christophe
>
> >
> >
> >>> Can you send your full .config ?
> >>
> >> Config is attached.
> >>
> >> Thanks,
> >>
> >>> Christophe
> >>>
> 
>  done
>  Setting btext !
>  W=640 H=488 LB=768 addr=0x9c008000
>  copying OF device tree...
>  starting device tree allocs at 0130
>  alloc_up(0010, 0013d948)
>  trying: 0x0130
>  trying: 0x0140
> -› 0140
>  alloc_bottom : 0150
>  alloc_top: 2000
>  alloc_top_hi : 2000
>  nmo_top  : 2000
>  ram_top  : 2000
>  Building dt strings...
>  Building dt structure...
>  reserved memory map:
>  00c4 - 006c
>  Device tree strings 0x01401000 -> 0x0007
>  Device tree struct 0x01402000 -> 0x0007
>  Quiescing Open Firmware ...
>  Booting Linux via __start() @ 0x00140
>  ->dt_headr_start=0x0140
> 
>  Thanks anyway,
> 


Re: [PATCH] ASoC: fsl_esai: fix the channel swap issue after xrun

2019-05-23 Thread Nicolin Chen
Hello Shengjiu,

On Thu, May 23, 2019 at 09:53:42AM +, S.j. Wang wrote:
> > > + /*
> > > +  * Add fifo reset here, because the regcache_sync will
> > > +  * write one more data to ETDR.
> > > +  * Which will cause channel shift.
> > 
> > Sounds like a bug to me...should fix it first by marking the data registers 
> > as
> > volatile.
> > 
> The ETDR is a writable register, it is not volatile. Even we change it to
> Volatile, I don't think we can't avoid this issue. for the regcache_sync
> Just to write this register, it is correct behavior.

Is that so? Quoting the comments of regcache_sync():
"* regcache_sync - Sync the register cache with the hardware.
 *
 * @map: map to configure.
 *
 * Any registers that should not be synced should be marked as
 * volatile."

If regcache_sync() does sync volatile registers too as you said,
I don't mind having this FIFO reset WAR for now, though I think
this mismatch between the comments and the actual behavior then
should get people's attention.

Thank you


Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Christophe Leroy




On 05/23/2019 09:59 AM, Christophe Leroy wrote:



On 05/23/2019 09:45 AM, Christophe Leroy wrote:



Le 23/05/2019 à 10:53, Mathieu Malaterre a écrit :

Commit id is:

e93c9c99a629 (tag: v5.1) Linux 5.1


Did you try latest powerpc/merge branch ?


Will try that next.


I confirm powerpc/merge does not boot for me (same config). Commit id:

a27eaa62326d (powerpc/merge) Automatic merge of branches 'master',
'next' and 'fixes' into merge


I see in the config you sent me that you have selected CONFIG_KASAN, 
which is a big new stuff.


Can you try without it ?


While building with your config, I get a huge amount of:

ppc-linux-ld: warning: orphan section `.data..LASAN0' from 
`lib/xarray.o' being placed in section `.data..LASAN0'.

   SORTEX  vmlinux



I see you have also selected CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y

I guess nobody have never tried both this and CONFIG_KASAN together on 
ppc32. I'll give it a try.



And you also have CONFIG_FTRACE.

In a recent patch implementing KASAN on PPC64, Daniel says that KASAN 
and FTRACE don't go together well 
(https://patchwork.ozlabs.org/patch/1103826/)


If you find out that it works without KASAN, can you then try with KASAN 
but without FTRACE ?


Christophe



Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Christophe Leroy




On 05/23/2019 09:45 AM, Christophe Leroy wrote:



Le 23/05/2019 à 10:53, Mathieu Malaterre a écrit :

Commit id is:

e93c9c99a629 (tag: v5.1) Linux 5.1


Did you try latest powerpc/merge branch ?


Will try that next.


I confirm powerpc/merge does not boot for me (same config). Commit id:

a27eaa62326d (powerpc/merge) Automatic merge of branches 'master',
'next' and 'fixes' into merge


I see in the config you sent me that you have selected CONFIG_KASAN, 
which is a big new stuff.


Can you try without it ?


While building with your config, I get a huge amount of:

ppc-linux-ld: warning: orphan section `.data..LASANLOC10' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC10'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC11' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC11'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC12' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC12'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC13' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC13'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC14' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC14'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC15' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC15'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC16' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC16'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC1' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC1'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC2' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC2'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC3' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC3'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC4' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC4'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC5' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC5'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC6' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC6'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC7' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC7'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC8' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC8'.
ppc-linux-ld: warning: orphan section `.data..LASANLOC9' from 
`lib/vsprintf.o' being placed in section `.data..LASANLOC9'.
ppc-linux-ld: warning: orphan section `.data..LASAN0' from 
`lib/xarray.o' being placed in section `.data..LASAN0'.

  SORTEX  vmlinux



I see you have also selected CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y

I guess nobody have never tried both this and CONFIG_KASAN together on 
ppc32. I'll give it a try.


Christophe



Christophe





Can you send your full .config ?


Config is attached.

Thanks,


Christophe



done
Setting btext !
W=640 H=488 LB=768 addr=0x9c008000
copying OF device tree...
starting device tree allocs at 0130
alloc_up(0010, 0013d948)
    trying: 0x0130
    trying: 0x0140
   -› 0140
    alloc_bottom : 0150
    alloc_top    : 2000
    alloc_top_hi : 2000
    nmo_top  : 2000
    ram_top  : 2000
Building dt strings...
Building dt structure...
reserved memory map:
    00c4 - 006c
Device tree strings 0x01401000 -> 0x0007
Device tree struct 0x01402000 -> 0x0007
Quiescing Open Firmware ...
Booting Linux via __start() @ 0x00140
->dt_headr_start=0x0140

Thanks anyway,



Re: [PATCH] ASoC: fsl_esai: fix the channel swap issue after xrun

2019-05-23 Thread S.j. Wang
Hi

> > + /*
> > +  * Add fifo reset here, because the regcache_sync will
> > +  * write one more data to ETDR.
> > +  * Which will cause channel shift.
> 
> Sounds like a bug to me...should fix it first by marking the data registers as
> volatile.
> 

The ETDR is a writable register, it is not volatile. Even we change it to
Volatile, I don't think we can't avoid this issue. for the regcache_sync
Just to write this register, it is correct behavior.

Best regards
Wang shengjiu


Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Christophe Leroy




Le 23/05/2019 à 10:53, Mathieu Malaterre a écrit :

On Thu, May 23, 2019 at 10:29 AM Mathieu Malaterre  wrote:


On Thu, May 23, 2019 at 8:39 AM Christophe Leroy
 wrote:


Salut Mathieu,

Le 23/05/2019 à 08:24, Mathieu Malaterre a écrit :

Salut Christophe,

On Wed, May 22, 2019 at 2:20 PM Christophe Leroy
 wrote:




Le 22/05/2019 à 14:15, Mathieu Malaterre a écrit :

Hi all,

I have not boot my G4 in a while, today using master here is what I see:

done
Setting btext !
W=640 H=488 LB=768 addr=0x9c008000
copying OF device tree...
starting device tree allocs at 01401000
otloc_up(0010, 0013d948)
 trying: 0x01401000
 trying: 0x01501000
-› 01501000
 alloc_bottom : 01601000
 alloc_top: 2000
 alloc_top_hi : 2000
 nmo_top  : 2000
 ram_top  : 2000
Building dt strings...
Building dt structure...
reserved memory map:
 00d4 - 006c1000
Device tree strings 0x01502000 -> 0x0007
Device tree struct 0x01503000 -> 0x0007
Quiescing Open Firmware ...
Booting Linux via __start() @ 0x00140
->dt_headr_start=0x01501000

Any suggestions before I start a bisect ?



Have you tried without CONFIG_PPC_KUEP and CONFIG_PPC_KUAP ?


Using locally:

diff --git a/arch/powerpc/configs/g4_defconfig
b/arch/powerpc/configs/g4_defconfig
index 14d0376f637d..916bce8ce9c3 100644
--- a/arch/powerpc/configs/g4_defconfig
+++ b/arch/powerpc/configs/g4_defconfig
@@ -32,6 +32,8 @@ CONFIG_USERFAULTFD=y
   # CONFIG_COMPAT_BRK is not set
   CONFIG_PROFILING=y
   CONFIG_G4_CPU=y
+# CONFIG_PPC_KUEP is not set
+# CONFIG_PPC_KUAP is not set
   CONFIG_PANIC_TIMEOUT=0
   # CONFIG_PPC_CHRP is not set
   CONFIG_CPU_FREQ=y


Leads to almost the same error (some values have changed):


Ok.

When you say you are using 'master', what do you mean ? Can you give the
commit Id ?

Does it boots with Kernel 5.1.4 ?


I was able to boot v5.1:

$ dmesg | head
[0.00] printk: bootconsole [udbg0] enabled
[0.00] Total memory = 512MB; using 1024kB for hash table (at (ptrval))
[0.00] Linux version 5.1.0+ (ma...@debian.org) (gcc version
8.3.0 (Debian 8.3.0-7)) #8 Thu May 23 06:26:38 UTC 2019

Commit id is:

e93c9c99a629 (tag: v5.1) Linux 5.1


Did you try latest powerpc/merge branch ?


Will try that next.


I confirm powerpc/merge does not boot for me (same config). Commit id:

a27eaa62326d (powerpc/merge) Automatic merge of branches 'master',
'next' and 'fixes' into merge


I see in the config you sent me that you have selected CONFIG_KASAN, 
which is a big new stuff.


Can you try without it ?

Christophe





Can you send your full .config ?


Config is attached.

Thanks,


Christophe



done
Setting btext !
W=640 H=488 LB=768 addr=0x9c008000
copying OF device tree...
starting device tree allocs at 0130
alloc_up(0010, 0013d948)
trying: 0x0130
trying: 0x0140
   -› 0140
alloc_bottom : 0150
alloc_top: 2000
alloc_top_hi : 2000
nmo_top  : 2000
ram_top  : 2000
Building dt strings...
Building dt structure...
reserved memory map:
00c4 - 006c
Device tree strings 0x01401000 -> 0x0007
Device tree struct 0x01402000 -> 0x0007
Quiescing Open Firmware ...
Booting Linux via __start() @ 0x00140
->dt_headr_start=0x0140

Thanks anyway,



Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Mathieu Malaterre
On Thu, May 23, 2019 at 10:29 AM Mathieu Malaterre  wrote:
>
> On Thu, May 23, 2019 at 8:39 AM Christophe Leroy
>  wrote:
> >
> > Salut Mathieu,
> >
> > Le 23/05/2019 à 08:24, Mathieu Malaterre a écrit :
> > > Salut Christophe,
> > >
> > > On Wed, May 22, 2019 at 2:20 PM Christophe Leroy
> > >  wrote:
> > >>
> > >>
> > >>
> > >> Le 22/05/2019 à 14:15, Mathieu Malaterre a écrit :
> > >>> Hi all,
> > >>>
> > >>> I have not boot my G4 in a while, today using master here is what I see:
> > >>>
> > >>> done
> > >>> Setting btext !
> > >>> W=640 H=488 LB=768 addr=0x9c008000
> > >>> copying OF device tree...
> > >>> starting device tree allocs at 01401000
> > >>> otloc_up(0010, 0013d948)
> > >>> trying: 0x01401000
> > >>> trying: 0x01501000
> > >>>-› 01501000
> > >>> alloc_bottom : 01601000
> > >>> alloc_top: 2000
> > >>> alloc_top_hi : 2000
> > >>> nmo_top  : 2000
> > >>> ram_top  : 2000
> > >>> Building dt strings...
> > >>> Building dt structure...
> > >>> reserved memory map:
> > >>> 00d4 - 006c1000
> > >>> Device tree strings 0x01502000 -> 0x0007
> > >>> Device tree struct 0x01503000 -> 0x0007
> > >>> Quiescing Open Firmware ...
> > >>> Booting Linux via __start() @ 0x00140
> > >>> ->dt_headr_start=0x01501000
> > >>>
> > >>> Any suggestions before I start a bisect ?
> > >>>
> > >>
> > >> Have you tried without CONFIG_PPC_KUEP and CONFIG_PPC_KUAP ?
> > >
> > > Using locally:
> > >
> > > diff --git a/arch/powerpc/configs/g4_defconfig
> > > b/arch/powerpc/configs/g4_defconfig
> > > index 14d0376f637d..916bce8ce9c3 100644
> > > --- a/arch/powerpc/configs/g4_defconfig
> > > +++ b/arch/powerpc/configs/g4_defconfig
> > > @@ -32,6 +32,8 @@ CONFIG_USERFAULTFD=y
> > >   # CONFIG_COMPAT_BRK is not set
> > >   CONFIG_PROFILING=y
> > >   CONFIG_G4_CPU=y
> > > +# CONFIG_PPC_KUEP is not set
> > > +# CONFIG_PPC_KUAP is not set
> > >   CONFIG_PANIC_TIMEOUT=0
> > >   # CONFIG_PPC_CHRP is not set
> > >   CONFIG_CPU_FREQ=y
> > >
> > >
> > > Leads to almost the same error (some values have changed):
> >
> > Ok.
> >
> > When you say you are using 'master', what do you mean ? Can you give the
> > commit Id ?
> >
> > Does it boots with Kernel 5.1.4 ?
>
> I was able to boot v5.1:
>
> $ dmesg | head
> [0.00] printk: bootconsole [udbg0] enabled
> [0.00] Total memory = 512MB; using 1024kB for hash table (at (ptrval))
> [0.00] Linux version 5.1.0+ (ma...@debian.org) (gcc version
> 8.3.0 (Debian 8.3.0-7)) #8 Thu May 23 06:26:38 UTC 2019
>
> Commit id is:
>
> e93c9c99a629 (tag: v5.1) Linux 5.1
>
> > Did you try latest powerpc/merge branch ?
>
> Will try that next.

I confirm powerpc/merge does not boot for me (same config). Commit id:

a27eaa62326d (powerpc/merge) Automatic merge of branches 'master',
'next' and 'fixes' into merge


> > Can you send your full .config ?
>
> Config is attached.
>
> Thanks,
>
> > Christophe
> >
> > >
> > > done
> > > Setting btext !
> > > W=640 H=488 LB=768 addr=0x9c008000
> > > copying OF device tree...
> > > starting device tree allocs at 0130
> > > alloc_up(0010, 0013d948)
> > >trying: 0x0130
> > >trying: 0x0140
> > >   -› 0140
> > >alloc_bottom : 0150
> > >alloc_top: 2000
> > >alloc_top_hi : 2000
> > >nmo_top  : 2000
> > >ram_top  : 2000
> > > Building dt strings...
> > > Building dt structure...
> > > reserved memory map:
> > >00c4 - 006c
> > > Device tree strings 0x01401000 -> 0x0007
> > > Device tree struct 0x01402000 -> 0x0007
> > > Quiescing Open Firmware ...
> > > Booting Linux via __start() @ 0x00140
> > > ->dt_headr_start=0x0140
> > >
> > > Thanks anyway,
> > >


Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Mathieu Malaterre
On Thu, May 23, 2019 at 10:29 AM Mathieu Malaterre  wrote:
>
> On Thu, May 23, 2019 at 8:39 AM Christophe Leroy
>  wrote:
> >
> > Salut Mathieu,
> >
> > Le 23/05/2019 à 08:24, Mathieu Malaterre a écrit :
> > > Salut Christophe,
> > >
> > > On Wed, May 22, 2019 at 2:20 PM Christophe Leroy
> > >  wrote:
> > >>
> > >>
> > >>
> > >> Le 22/05/2019 à 14:15, Mathieu Malaterre a écrit :
> > >>> Hi all,
> > >>>
> > >>> I have not boot my G4 in a while, today using master here is what I see:
> > >>>
> > >>> done
> > >>> Setting btext !
> > >>> W=640 H=488 LB=768 addr=0x9c008000
> > >>> copying OF device tree...
> > >>> starting device tree allocs at 01401000
> > >>> otloc_up(0010, 0013d948)
> > >>> trying: 0x01401000
> > >>> trying: 0x01501000
> > >>>-› 01501000
> > >>> alloc_bottom : 01601000
> > >>> alloc_top: 2000
> > >>> alloc_top_hi : 2000
> > >>> nmo_top  : 2000
> > >>> ram_top  : 2000
> > >>> Building dt strings...
> > >>> Building dt structure...
> > >>> reserved memory map:
> > >>> 00d4 - 006c1000
> > >>> Device tree strings 0x01502000 -> 0x0007
> > >>> Device tree struct 0x01503000 -> 0x0007
> > >>> Quiescing Open Firmware ...
> > >>> Booting Linux via __start() @ 0x00140
> > >>> ->dt_headr_start=0x01501000
> > >>>
> > >>> Any suggestions before I start a bisect ?
> > >>>
> > >>
> > >> Have you tried without CONFIG_PPC_KUEP and CONFIG_PPC_KUAP ?
> > >
> > > Using locally:
> > >
> > > diff --git a/arch/powerpc/configs/g4_defconfig
> > > b/arch/powerpc/configs/g4_defconfig
> > > index 14d0376f637d..916bce8ce9c3 100644
> > > --- a/arch/powerpc/configs/g4_defconfig
> > > +++ b/arch/powerpc/configs/g4_defconfig
> > > @@ -32,6 +32,8 @@ CONFIG_USERFAULTFD=y
> > >   # CONFIG_COMPAT_BRK is not set
> > >   CONFIG_PROFILING=y
> > >   CONFIG_G4_CPU=y
> > > +# CONFIG_PPC_KUEP is not set
> > > +# CONFIG_PPC_KUAP is not set
> > >   CONFIG_PANIC_TIMEOUT=0
> > >   # CONFIG_PPC_CHRP is not set
> > >   CONFIG_CPU_FREQ=y
> > >
> > >
> > > Leads to almost the same error (some values have changed):
> >
> > Ok.
> >
> > When you say you are using 'master', what do you mean ? Can you give the
> > commit Id ?

Sorry about that. The problematic commit for me is:

54dee406374c Merge tag 'arm64-fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux


> >
> > Does it boots with Kernel 5.1.4 ?
>
> I was able to boot v5.1:
>
> $ dmesg | head
> [0.00] printk: bootconsole [udbg0] enabled
> [0.00] Total memory = 512MB; using 1024kB for hash table (at (ptrval))
> [0.00] Linux version 5.1.0+ (ma...@debian.org) (gcc version
> 8.3.0 (Debian 8.3.0-7)) #8 Thu May 23 06:26:38 UTC 2019
>
> Commit id is:
>
> e93c9c99a629 (tag: v5.1) Linux 5.1
>
> > Did you try latest powerpc/merge branch ?
>
> Will try that next.
>
> > Can you send your full .config ?
>
> Config is attached.
>
> Thanks,
>
> > Christophe
> >
> > >
> > > done
> > > Setting btext !
> > > W=640 H=488 LB=768 addr=0x9c008000
> > > copying OF device tree...
> > > starting device tree allocs at 0130
> > > alloc_up(0010, 0013d948)
> > >trying: 0x0130
> > >trying: 0x0140
> > >   -› 0140
> > >alloc_bottom : 0150
> > >alloc_top: 2000
> > >alloc_top_hi : 2000
> > >nmo_top  : 2000
> > >ram_top  : 2000
> > > Building dt strings...
> > > Building dt structure...
> > > reserved memory map:
> > >00c4 - 006c
> > > Device tree strings 0x01401000 -> 0x0007
> > > Device tree struct 0x01402000 -> 0x0007
> > > Quiescing Open Firmware ...
> > > Booting Linux via __start() @ 0x00140
> > > ->dt_headr_start=0x0140
> > >
> > > Thanks anyway,
> > >


[PATCH] powerpc/32: fix build failure on book3e with KVM

2019-05-23 Thread Christophe Leroy
Build failure was introduced by the commit identified below,
due to missed macro expension leading to wrong called function's name.

arch/powerpc/kernel/head_fsl_booke.o: In function `SystemCall':
arch/powerpc/kernel/head_fsl_booke.S:416: undefined reference to 
`kvmppc_handler_BOOKE_INTERRUPT_SYSCALL_SPRN_SRR1'
Makefile:1052: recipe for target 'vmlinux' failed

The called function should be kvmppc_handler_8_0x01B(). This patch fixes it.

Reported-by: Paul Mackerras 
Fixes: 1a4b739bbb4f ("powerpc/32: implement fast entry for syscalls on BOOKE")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_booke.h | 4 ++--
 arch/powerpc/kernel/head_fsl_booke.S | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index bfeb469e8106..dec0912a6508 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -83,7 +83,7 @@ END_BTB_FLUSH_SECTION
SAVE_4GPRS(3, r11);  \
SAVE_2GPRS(7, r11)
 
-.macro SYSCALL_ENTRY trapno intno
+.macro SYSCALL_ENTRY trapno intno srr1
mfspr   r10, SPRN_SPRG_THREAD
 #ifdef CONFIG_KVM_BOOKE_HV
 BEGIN_FTR_SECTION
@@ -94,7 +94,7 @@ BEGIN_FTR_SECTION
mfspr   r11, SPRN_SRR1
mtocrf  0x80, r11   /* check MSR[GS] without clobbering reg */
bf  3, 1975f
-   b   kvmppc_handler_BOOKE_INTERRUPT_\intno\()_SPRN_SRR1
+   b   kvmppc_handler_\intno\()_\srr1
 1975:
mr  r12, r13
lwz r13, THREAD_NORMSAVE(2)(r10)
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index 6621f230cc37..2b39f42c3676 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -413,7 +413,7 @@ interrupt_base:
 
/* System Call Interrupt */
START_EXCEPTION(SystemCall)
-   SYSCALL_ENTRY   0xc00 SYSCALL
+   SYSCALL_ENTRY   0xc00 BOOKE_INTERRUPT_SYSCALL SPRN_SRR1
 
/* Auxiliary Processor Unavailable Interrupt */
EXCEPTION(0x2900, AP_UNAVAIL, AuxillaryProcessorUnavailable, \
-- 
2.13.3



kmemleak: 1157 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

2019-05-23 Thread Mathieu Malaterre
Hi there,

Is there a way to dump more context (somewhere in of tree
flattening?). I cannot make sense of the following:

kmemleak: 1157 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

Where:

# head -40 /sys/kernel/debug/kmemleak
unreferenced object 0xdf44d180 (size 8):
  comm "swapper", pid 1, jiffies 4294892297 (age 4766.460s)
  hex dump (first 8 bytes):
62 61 73 65 00 00 00 00  base
  backtrace:
[<0ca59825>] kstrdup+0x4c/0xb8
[] kobject_set_name_vargs+0x34/0xc8
[<661b4c86>] kobject_add+0x78/0x120
[] __of_attach_node_sysfs+0xa0/0x14c
[<2a143d10>] of_core_init+0x90/0x114
[] driver_init+0x30/0x48
[<84ed01b1>] kernel_init_freeable+0xfc/0x3fc
[] kernel_init+0x20/0x110
[] ret_from_kernel_thread+0x14/0x1c
unreferenced object 0xdf44d178 (size 8):
  comm "swapper", pid 1, jiffies 4294892297 (age 4766.460s)
  hex dump (first 8 bytes):
6d 6f 64 65 6c 00 97 c8  model...
  backtrace:
[<0ca59825>] kstrdup+0x4c/0xb8
[<0eeb0a3b>] __of_add_property_sysfs+0x88/0x12c
[] __of_attach_node_sysfs+0xcc/0x14c
[<2a143d10>] of_core_init+0x90/0x114
[] driver_init+0x30/0x48
[<84ed01b1>] kernel_init_freeable+0xfc/0x3fc
[] kernel_init+0x20/0x110
[] ret_from_kernel_thread+0x14/0x1c
unreferenced object 0xdf4021e0 (size 16):
  comm "swapper", pid 1, jiffies 4294892297 (age 4766.460s)
  hex dump (first 16 bytes):
63 6f 6d 70 61 74 69 62 6c 65 00 01 00 00 00 00  compatible..
  backtrace:
[<0ca59825>] kstrdup+0x4c/0xb8
[<0eeb0a3b>] __of_add_property_sysfs+0x88/0x12c
[] __of_attach_node_sysfs+0xcc/0x14c
[<2a143d10>] of_core_init+0x90/0x114
[] driver_init+0x30/0x48
[<84ed01b1>] kernel_init_freeable+0xfc/0x3fc
[] kernel_init+0x20/0x110
[] ret_from_kernel_thread+0x14/0x1c


Re: Failure to boot G4: dt_headr_start=0x01501000

2019-05-23 Thread Mathieu Malaterre
On Thu, May 23, 2019 at 8:39 AM Christophe Leroy
 wrote:
>
> Salut Mathieu,
>
> Le 23/05/2019 à 08:24, Mathieu Malaterre a écrit :
> > Salut Christophe,
> >
> > On Wed, May 22, 2019 at 2:20 PM Christophe Leroy
> >  wrote:
> >>
> >>
> >>
> >> Le 22/05/2019 à 14:15, Mathieu Malaterre a écrit :
> >>> Hi all,
> >>>
> >>> I have not boot my G4 in a while, today using master here is what I see:
> >>>
> >>> done
> >>> Setting btext !
> >>> W=640 H=488 LB=768 addr=0x9c008000
> >>> copying OF device tree...
> >>> starting device tree allocs at 01401000
> >>> otloc_up(0010, 0013d948)
> >>> trying: 0x01401000
> >>> trying: 0x01501000
> >>>-› 01501000
> >>> alloc_bottom : 01601000
> >>> alloc_top: 2000
> >>> alloc_top_hi : 2000
> >>> nmo_top  : 2000
> >>> ram_top  : 2000
> >>> Building dt strings...
> >>> Building dt structure...
> >>> reserved memory map:
> >>> 00d4 - 006c1000
> >>> Device tree strings 0x01502000 -> 0x0007
> >>> Device tree struct 0x01503000 -> 0x0007
> >>> Quiescing Open Firmware ...
> >>> Booting Linux via __start() @ 0x00140
> >>> ->dt_headr_start=0x01501000
> >>>
> >>> Any suggestions before I start a bisect ?
> >>>
> >>
> >> Have you tried without CONFIG_PPC_KUEP and CONFIG_PPC_KUAP ?
> >
> > Using locally:
> >
> > diff --git a/arch/powerpc/configs/g4_defconfig
> > b/arch/powerpc/configs/g4_defconfig
> > index 14d0376f637d..916bce8ce9c3 100644
> > --- a/arch/powerpc/configs/g4_defconfig
> > +++ b/arch/powerpc/configs/g4_defconfig
> > @@ -32,6 +32,8 @@ CONFIG_USERFAULTFD=y
> >   # CONFIG_COMPAT_BRK is not set
> >   CONFIG_PROFILING=y
> >   CONFIG_G4_CPU=y
> > +# CONFIG_PPC_KUEP is not set
> > +# CONFIG_PPC_KUAP is not set
> >   CONFIG_PANIC_TIMEOUT=0
> >   # CONFIG_PPC_CHRP is not set
> >   CONFIG_CPU_FREQ=y
> >
> >
> > Leads to almost the same error (some values have changed):
>
> Ok.
>
> When you say you are using 'master', what do you mean ? Can you give the
> commit Id ?
>
> Does it boots with Kernel 5.1.4 ?

I was able to boot v5.1:

$ dmesg | head
[0.00] printk: bootconsole [udbg0] enabled
[0.00] Total memory = 512MB; using 1024kB for hash table (at (ptrval))
[0.00] Linux version 5.1.0+ (ma...@debian.org) (gcc version
8.3.0 (Debian 8.3.0-7)) #8 Thu May 23 06:26:38 UTC 2019

Commit id is:

e93c9c99a629 (tag: v5.1) Linux 5.1

> Did you try latest powerpc/merge branch ?

Will try that next.

> Can you send your full .config ?

Config is attached.

Thanks,

> Christophe
>
> >
> > done
> > Setting btext !
> > W=640 H=488 LB=768 addr=0x9c008000
> > copying OF device tree...
> > starting device tree allocs at 0130
> > alloc_up(0010, 0013d948)
> >trying: 0x0130
> >trying: 0x0140
> >   -› 0140
> >alloc_bottom : 0150
> >alloc_top: 2000
> >alloc_top_hi : 2000
> >nmo_top  : 2000
> >ram_top  : 2000
> > Building dt strings...
> > Building dt structure...
> > reserved memory map:
> >00c4 - 006c
> > Device tree strings 0x01401000 -> 0x0007
> > Device tree struct 0x01402000 -> 0x0007
> > Quiescing Open Firmware ...
> > Booting Linux via __start() @ 0x00140
> > ->dt_headr_start=0x0140
> >
> > Thanks anyway,
> >


g4_defconfig
Description: Binary data


Re: [PATCH v3 14/16] powerpc/32: implement fast entry for syscalls on BOOKE

2019-05-23 Thread Christophe Leroy




Le 23/05/2019 à 09:00, Christophe Leroy a écrit :

[...]






arch/powerpc/kernel/head_fsl_booke.o: In function `SystemCall':
arch/powerpc/kernel/head_fsl_booke.S:416: undefined reference to 
`kvmppc_handler_BOOKE_INTERRUPT_SYSCALL_SPRN_SRR1'

Makefile:1052: recipe for target 'vmlinux' failed


+.macro SYSCALL_ENTRY trapno intno
+    mfspr    r10, SPRN_SPRG_THREAD
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+    mtspr    SPRN_SPRG_WSCRATCH0, r10
+    stw    r11, THREAD_NORMSAVE(0)(r10)
+    stw    r13, THREAD_NORMSAVE(2)(r10)
+    mfcr    r13    /* save CR in r13 for now   */
+    mfspr    r11, SPRN_SRR1
+    mtocrf    0x80, r11    /* check MSR[GS] without clobbering reg */
+    bf    3, 1975f
+    b    kvmppc_handler_BOOKE_INTERRUPT_\intno\()_SPRN_SRR1


It seems to me that the "_SPRN_SRR1" on the end of this line
isn't meant to be there...  However, it still fails to link with that
removed.


It looks like I missed the macro expansion.

The called function should be kvmppc_handler_8_0x01B

Seems like kisskb doesn't build any config like this.

Christophe



This SYSCALL_ENTRY macro is a slimmed version of NORMAL_EXCEPTION_PROLOG()

In NORMAL_EXCEPTION_PROLOG(), we have:
 DO_KVM    BOOKE_INTERRUPT_##intno SPRN_SRR1;

The _SPRN_SRR1 comes from there


Then in /arch/powerpc/include/asm/kvm_booke_hv_asm.h:

.macro DO_KVM intno srr1
#ifdef CONFIG_KVM_BOOKE_HV
BEGIN_FTR_SECTION
 mtocrf    0x80, r11    /* check MSR[GS] without clobbering reg */
 bf    3, 1975f
 b    kvmppc_handler_\intno\()_\srr1
1975:
END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
#endif
.endm


Christophe


Re: [PATCH 1/3] powerpc/powernv: remove the unused pnv_pci_set_p2p function

2019-05-23 Thread Christoph Hellwig
On Mon, May 06, 2019 at 10:46:11AM +0200, Frederic Barrat wrote:
> Hi,
>
> The PCI p2p and tunnel code is used by the Mellanox CX5 driver, at least 
> their latest, out of tree version, which is used for CORAL. My 
> understanding is that they'll upstream it at some point, though I don't 
> know what their schedule is like.

FYI, Max who wrote (at least larger parts of) that code is on Cc agreed
that all P2P code should go through the kernel P2P infrastructure and
might be able to spend some cycles on it.

Which still doesn't change anything about that fact that we [1]
generally don't add infrastructure for anything that is not in the
tree.

[1] well, powernv seems to have handles this a little oddly, and now is
on my special watchlist.


ppc85xx_basic_defconfig is buggy ?

2019-05-23 Thread Christophe Leroy

ppc85xx_basic_defconfig doesn't not select CONFIG_PPC_85xx.

Is that expected ?

Christophe


[PATCH 4/4] powerpc/powernv: remove the unused vas_win_paste_addr and vas_win_id functions

2019-05-23 Thread Christoph Hellwig
These two function have never been used since they were added to the
kernel.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/vas.h  | 10 --
 arch/powerpc/platforms/powernv/vas-window.c | 19 ---
 arch/powerpc/platforms/powernv/vas.h| 20 
 3 files changed, 49 deletions(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index 771456227496..9b5b7261df7b 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -167,14 +167,4 @@ int vas_copy_crb(void *crb, int offset);
  */
 int vas_paste_crb(struct vas_window *win, int offset, bool re);
 
-/*
- * Return a system-wide unique id for the VAS window @win.
- */
-extern u32 vas_win_id(struct vas_window *win);
-
-/*
- * Return the power bus paste address associated with @win so the caller
- * can map that address into their address space.
- */
-extern u64 vas_win_paste_addr(struct vas_window *win);
 #endif /* __ASM_POWERPC_VAS_H */
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index e59e0e60e5b5..e48c44cb3a16 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -44,16 +44,6 @@ static void compute_paste_address(struct vas_window *window, 
u64 *addr, int *len
pr_debug("Txwin #%d: Paste addr 0x%llx\n", winid, *addr);
 }
 
-u64 vas_win_paste_addr(struct vas_window *win)
-{
-   u64 addr;
-
-   compute_paste_address(win, &addr, NULL);
-
-   return addr;
-}
-EXPORT_SYMBOL(vas_win_paste_addr);
-
 static inline void get_hvwc_mmio_bar(struct vas_window *window,
u64 *start, int *len)
 {
@@ -1268,12 +1258,3 @@ int vas_win_close(struct vas_window *window)
return 0;
 }
 EXPORT_SYMBOL_GPL(vas_win_close);
-
-/*
- * Return a system-wide unique window id for the window @win.
- */
-u32 vas_win_id(struct vas_window *win)
-{
-   return encode_pswid(win->vinst->vas_id, win->winid);
-}
-EXPORT_SYMBOL_GPL(vas_win_id);
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index f5493dbdd7ff..551affaddd59 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -448,26 +448,6 @@ static inline u64 read_hvwc_reg(struct vas_window *win,
return in_be64(win->hvwc_map+reg);
 }
 
-/*
- * Encode/decode the Partition Send Window ID (PSWID) for a window in
- * a way that we can uniquely identify any window in the system. i.e.
- * we should be able to locate the 'struct vas_window' given the PSWID.
- *
- * BitsUsage
- * 0:7 VAS id (8 bits)
- * 8:15Unused, 0 (3 bits)
- * 16:31   Window id (16 bits)
- */
-static inline u32 encode_pswid(int vasid, int winid)
-{
-   u32 pswid = 0;
-
-   pswid |= vasid << (31 - 7);
-   pswid |= winid;
-
-   return pswid;
-}
-
 static inline void decode_pswid(u32 pswid, int *vasid, int *winid)
 {
if (vasid)
-- 
2.20.1



[PATCH 3/4] powerpc/powernv: remove dead NPU DMA code

2019-05-23 Thread Christoph Hellwig
None of these routines were ever used since they were added to the
kernel.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |   2 -
 arch/powerpc/include/asm/powernv.h   |  22 -
 arch/powerpc/mm/book3s64/mmu_context.c   |   1 -
 arch/powerpc/platforms/powernv/npu-dma.c | 556 ---
 4 files changed, 581 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 74d24201fc4f..23b83d3593e2 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -116,8 +116,6 @@ typedef struct {
/* Number of users of the external (Nest) MMU */
atomic_t copros;
 
-   /* NPU NMMU context */
-   struct npu_context *npu_context;
struct hash_mm_context *hash_context;
 
unsigned long vdso_base;
diff --git a/arch/powerpc/include/asm/powernv.h 
b/arch/powerpc/include/asm/powernv.h
index 05b552418519..40f868c5e93c 100644
--- a/arch/powerpc/include/asm/powernv.h
+++ b/arch/powerpc/include/asm/powernv.h
@@ -11,35 +11,13 @@
 #define _ASM_POWERNV_H
 
 #ifdef CONFIG_PPC_POWERNV
-#define NPU2_WRITE 1
 extern void powernv_set_nmmu_ptcr(unsigned long ptcr);
-extern struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
-   unsigned long flags,
-   void (*cb)(struct npu_context *, void *),
-   void *priv);
-extern void pnv_npu2_destroy_context(struct npu_context *context,
-   struct pci_dev *gpdev);
-extern int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea,
-   unsigned long *flags, unsigned long *status,
-   int count);
 
 void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val);
 
 void pnv_tm_init(void);
 #else
 static inline void powernv_set_nmmu_ptcr(unsigned long ptcr) { }
-static inline struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
-   unsigned long flags,
-   struct npu_context *(*cb)(struct npu_context *, void *),
-   void *priv) { return ERR_PTR(-ENODEV); }
-static inline void pnv_npu2_destroy_context(struct npu_context *context,
-   struct pci_dev *gpdev) { }
-
-static inline int pnv_npu2_handle_fault(struct npu_context *context,
-   uintptr_t *ea, unsigned long *flags,
-   unsigned long *status, int count) {
-   return -ENODEV;
-}
 
 static inline void pnv_tm_init(void) { }
 #endif
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index cb2b08635508..0dd3e631cf3e 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -140,7 +140,6 @@ static int radix__init_new_context(struct mm_struct *mm)
 */
asm volatile("ptesync;isync" : : : "memory");
 
-   mm->context.npu_context = NULL;
mm->context.hash_context = NULL;
 
return index;
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 495550432f3d..4ed24132bb7c 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -22,12 +22,6 @@
 
 #include "pci.h"
 
-/*
- * spinlock to protect initialisation of an npu_context for a particular
- * mm_struct.
- */
-static DEFINE_SPINLOCK(npu_context_lock);
-
 static struct pci_dev *get_pci_dev(struct device_node *dn)
 {
struct pci_dn *pdn = PCI_DN(dn);
@@ -362,15 +356,6 @@ struct npu_comp {
 /* An NPU descriptor, valid for POWER9 only */
 struct npu {
int index;
-   __be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS];
-   unsigned int mmio_atsd_count;
-
-   /* Bitmask for MMIO register usage */
-   unsigned long mmio_atsd_usage;
-
-   /* Do we need to explicitly flush the nest mmu? */
-   bool nmmu_flush;
-
struct npu_comp npucomp;
 };
 
@@ -627,534 +612,8 @@ struct iommu_table_group *pnv_npu_compound_attach(struct 
pnv_ioda_pe *pe)
 }
 #endif /* CONFIG_IOMMU_API */
 
-/* Maximum number of nvlinks per npu */
-#define NV_MAX_LINKS 6
-
-/* Maximum index of npu2 hosts in the system. Always < NV_MAX_NPUS */
-static int max_npu2_index;
-
-struct npu_context {
-   struct mm_struct *mm;
-   struct pci_dev *npdev[NV_MAX_NPUS][NV_MAX_LINKS];
-   struct mmu_notifier mn;
-   struct kref kref;
-   bool nmmu_flush;
-
-   /* Callback to stop translation requests on a given GPU */
-   void (*release_cb)(struct npu_context *context, void *priv);
-
-   /*
-* Private pointer passed to the above callback for usage by
-* device drivers.
-*/
-   void *priv;
-};
-
-struct mmio_atsd_reg {
-   struct npu *npu;
-   int reg;
-};
-
-/*
- * Find a free MMIO ATSD register and mark it in use. Return

remove dead powernv code v2

2019-05-23 Thread Christoph Hellwig
Hi all,

the powerpc powernv port has a fairly large chunk of code that never
had any upstream user.  We generally strive to not keep dead code
around, and this was affirmed at least years Maintainer summit.

Changes since v1:
 - rebased to v5.2-rc1
 - remove even more dead code


[PATCH 2/4] powerpc/powernv: remove the unused tunneling exports

2019-05-23 Thread Christoph Hellwig
These have been unused ever since they've been added to the kernel.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/pnv-pci.h|  4 --
 arch/powerpc/platforms/powernv/pci-ioda.c |  4 +-
 arch/powerpc/platforms/powernv/pci.c  | 71 ---
 arch/powerpc/platforms/powernv/pci.h  |  1 -
 4 files changed, 3 insertions(+), 77 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index 9fcb0bc462c6..1ab4b0111abc 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -27,12 +27,8 @@ extern int pnv_pci_get_power_state(uint64_t id, uint8_t 
*state);
 extern int pnv_pci_set_power_state(uint64_t id, uint8_t state,
   struct opal_msg *msg);
 
-extern int pnv_pci_enable_tunnel(struct pci_dev *dev, uint64_t *asnind);
-extern int pnv_pci_disable_tunnel(struct pci_dev *dev);
 extern int pnv_pci_set_tunnel_bar(struct pci_dev *dev, uint64_t addr,
  int enable);
-extern int pnv_pci_get_as_notify_info(struct task_struct *task, u32 *lpid,
- u32 *pid, u32 *tid);
 int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
 int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
   unsigned int virq);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 126602b4e399..6b0caa2d0425 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -54,6 +54,8 @@
 static const char * const pnv_phb_names[] = { "IODA1", "IODA2", "NPU_NVLINK",
  "NPU_OCAPI" };
 
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+
 void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
const char *fmt, ...)
 {
@@ -2360,7 +2362,7 @@ static long pnv_pci_ioda2_set_window(struct 
iommu_table_group *table_group,
return 0;
 }
 
-void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index 8d28f2932c3b..fc69f5611020 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -868,54 +868,6 @@ struct device_node *pnv_pci_get_phb_node(struct pci_dev 
*dev)
 }
 EXPORT_SYMBOL(pnv_pci_get_phb_node);
 
-int pnv_pci_enable_tunnel(struct pci_dev *dev, u64 *asnind)
-{
-   struct device_node *np;
-   const __be32 *prop;
-   struct pnv_ioda_pe *pe;
-   uint16_t window_id;
-   int rc;
-
-   if (!radix_enabled())
-   return -ENXIO;
-
-   if (!(np = pnv_pci_get_phb_node(dev)))
-   return -ENXIO;
-
-   prop = of_get_property(np, "ibm,phb-indications", NULL);
-   of_node_put(np);
-
-   if (!prop || !prop[1])
-   return -ENXIO;
-
-   *asnind = (u64)be32_to_cpu(prop[1]);
-   pe = pnv_ioda_get_pe(dev);
-   if (!pe)
-   return -ENODEV;
-
-   /* Increase real window size to accept as_notify messages. */
-   window_id = (pe->pe_number << 1 ) + 1;
-   rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id, pe->pe_number,
-window_id, pe->tce_bypass_base,
-(uint64_t)1 << 48);
-   return opal_error_code(rc);
-}
-EXPORT_SYMBOL_GPL(pnv_pci_enable_tunnel);
-
-int pnv_pci_disable_tunnel(struct pci_dev *dev)
-{
-   struct pnv_ioda_pe *pe;
-
-   pe = pnv_ioda_get_pe(dev);
-   if (!pe)
-   return -ENODEV;
-
-   /* Restore default real window size. */
-   pnv_pci_ioda2_set_bypass(pe, true);
-   return 0;
-}
-EXPORT_SYMBOL_GPL(pnv_pci_disable_tunnel);
-
 int pnv_pci_set_tunnel_bar(struct pci_dev *dev, u64 addr, int enable)
 {
__be64 val;
@@ -970,29 +922,6 @@ int pnv_pci_set_tunnel_bar(struct pci_dev *dev, u64 addr, 
int enable)
 }
 EXPORT_SYMBOL_GPL(pnv_pci_set_tunnel_bar);
 
-#ifdef CONFIG_PPC64/* for thread.tidr */
-int pnv_pci_get_as_notify_info(struct task_struct *task, u32 *lpid, u32 *pid,
-  u32 *tid)
-{
-   struct mm_struct *mm = NULL;
-
-   if (task == NULL)
-   return -EINVAL;
-
-   mm = get_task_mm(task);
-   if (mm == NULL)
-   return -EINVAL;
-
-   *pid = mm->context.id;
-   mmput(mm);
-
-   *tid = task->thread.tidr;
-   *lpid = mfspr(SPRN_LPID);
-   return 0;
-}
-EXPORT_SYMBOL_GPL(pnv_pci_get_as_notify_info);
-#endif
-
 void pnv_pci_shutdown(void)
 {
struct pci_controller *hose;
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 4f11c077af6

[PATCH 1/4] powerpc/powernv: remove the unused pnv_pci_set_p2p function

2019-05-23 Thread Christoph Hellwig
This function has never been used since it has been added to the tree.
We also now have proper PCIe P2P APIs in the core kernel, and any new
P2P support should be using those.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/opal-api.h|  6 --
 arch/powerpc/include/asm/opal.h|  2 -
 arch/powerpc/include/asm/pnv-pci.h |  2 -
 arch/powerpc/platforms/powernv/opal-call.c |  1 -
 arch/powerpc/platforms/powernv/pci.c   | 74 --
 arch/powerpc/platforms/powernv/pci.h   |  5 --
 6 files changed, 90 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index e1577cfa7186..cd34c328774d 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -1132,12 +1132,6 @@ enum {
OPAL_IMC_COUNTERS_TRACE = 3,
 };
 
-
-/* PCI p2p descriptor */
-#define OPAL_PCI_P2P_ENABLE0x1
-#define OPAL_PCI_P2P_LOAD  0x2
-#define OPAL_PCI_P2P_STORE 0x4
-
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 4cc37e708bc7..15c488ce4225 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -287,8 +287,6 @@ int64_t opal_xive_set_queue_state(uint64_t vp, uint32_t 
prio,
  uint32_t qtoggle,
  uint32_t qindex);
 int64_t opal_xive_get_vp_state(uint64_t vp, __be64 *out_w01);
-int64_t opal_pci_set_p2p(uint64_t phb_init, uint64_t phb_target,
-   uint64_t desc, uint16_t pe_number);
 
 int64_t opal_imc_counters_init(uint32_t type, uint64_t address,
uint64_t cpu_pir);
diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index 630eb8b1b7ed..9fcb0bc462c6 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -26,8 +26,6 @@ extern int pnv_pci_get_presence_state(uint64_t id, uint8_t 
*state);
 extern int pnv_pci_get_power_state(uint64_t id, uint8_t *state);
 extern int pnv_pci_set_power_state(uint64_t id, uint8_t state,
   struct opal_msg *msg);
-extern int pnv_pci_set_p2p(struct pci_dev *initiator, struct pci_dev *target,
-  u64 desc);
 
 extern int pnv_pci_enable_tunnel(struct pci_dev *dev, uint64_t *asnind);
 extern int pnv_pci_disable_tunnel(struct pci_dev *dev);
diff --git a/arch/powerpc/platforms/powernv/opal-call.c 
b/arch/powerpc/platforms/powernv/opal-call.c
index 36c8fa3647a2..29ca523c1c79 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -273,7 +273,6 @@ OPAL_CALL(opal_npu_map_lpar,
OPAL_NPU_MAP_LPAR);
 OPAL_CALL(opal_imc_counters_init,  OPAL_IMC_COUNTERS_INIT);
 OPAL_CALL(opal_imc_counters_start, OPAL_IMC_COUNTERS_START);
 OPAL_CALL(opal_imc_counters_stop,  OPAL_IMC_COUNTERS_STOP);
-OPAL_CALL(opal_pci_set_p2p,OPAL_PCI_SET_P2P);
 OPAL_CALL(opal_get_powercap,   OPAL_GET_POWERCAP);
 OPAL_CALL(opal_set_powercap,   OPAL_SET_POWERCAP);
 OPAL_CALL(opal_get_power_shift_ratio,  OPAL_GET_POWER_SHIFT_RATIO);
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index ef9448a907c6..8d28f2932c3b 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -38,7 +38,6 @@
 #include "powernv.h"
 #include "pci.h"
 
-static DEFINE_MUTEX(p2p_mutex);
 static DEFINE_MUTEX(tunnel_mutex);
 
 int pnv_pci_get_slot_id(struct device_node *np, uint64_t *id)
@@ -861,79 +860,6 @@ void pnv_pci_dma_bus_setup(struct pci_bus *bus)
}
 }
 
-int pnv_pci_set_p2p(struct pci_dev *initiator, struct pci_dev *target, u64 
desc)
-{
-   struct pci_controller *hose;
-   struct pnv_phb *phb_init, *phb_target;
-   struct pnv_ioda_pe *pe_init;
-   int rc;
-
-   if (!opal_check_token(OPAL_PCI_SET_P2P))
-   return -ENXIO;
-
-   hose = pci_bus_to_host(initiator->bus);
-   phb_init = hose->private_data;
-
-   hose = pci_bus_to_host(target->bus);
-   phb_target = hose->private_data;
-
-   pe_init = pnv_ioda_get_pe(initiator);
-   if (!pe_init)
-   return -ENODEV;
-
-   /*
-* Configuring the initiator's PHB requires to adjust its
-* TVE#1 setting. Since the same device can be an initiator
-* several times for different target devices, we need to keep
-* a reference count to know when we can restore the default
-* bypass setting on its TVE#1 when disabling. Opal is not
-* tracking PE states, so we add a reference count on the PE
-* in linux.
-*
-* For the target, the configuration is per PHB, so we keep a
-* target reference count o

Re: [PATCH] powerpc/powernv: fix variable "c" set but not used

2019-05-23 Thread Christoph Hellwig
On Thu, May 23, 2019 at 09:26:53AM +0200, Christophe Leroy wrote:
> You are not fixing the problem, you are just hiding it.
> 
> If the result of __get_user() is unneeded, it means __get_user() is not the
> good function to use.
> 
> Should use fault_in_pages_readable() instead.

Also it is not just the variable that is unused, but that whole
function.  I'll resend my series to remote it in a bit.


Re: [PATCH] powerpc/powernv: fix variable "c" set but not used

2019-05-23 Thread Christophe Leroy




Le 23/05/2019 à 04:31, Qian Cai a écrit :

The commit 58629c0dc349 ("powerpc/powernv/npu: Fault user page into the
hypervisor's pagetable") introduced a variable "c" to be used in
__get_user() and __get_user_nocheck() which need to stay as macros for
performance reasons, and "c" is not actually used in
pnv_npu2_handle_fault(),

arch/powerpc/platforms/powernv/npu-dma.c: In function 'pnv_npu2_handle_fault':
arch/powerpc/platforms/powernv/npu-dma.c:1122:7: warning: variable 'c'
set but not used [-Wunused-but-set-variable]

Fixed it by appending the __maybe_unused attribute, so compilers would
ignore it.


You are not fixing the problem, you are just hiding it.

If the result of __get_user() is unneeded, it means __get_user() is not 
the good function to use.


Should use fault_in_pages_readable() instead.

A similar warning was fixed in commit 9f9eae5ce717 ("powerpc/kvm: Prefer 
fault_in_pages_readable function")


See 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc?id=9f9eae5ce




Signed-off-by: Qian Cai 


You should add a Fixes: tag

58629c0dc349 ("powerpc/powernv/npu: Fault user page into the 
hypervisor's pagetable")


Christophe


---
  arch/powerpc/platforms/powernv/npu-dma.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 495550432f3d..5bbe59573ee6 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -1119,7 +1119,8 @@ int pnv_npu2_handle_fault(struct npu_context *context, 
uintptr_t *ea,
int i, is_write;
struct page *page[1];
const char __user *u;
-   char c;
+   /* To silence a -Wunused-but-set-variable warning. */
+   char c __maybe_unused;
  
  	/* mmap_sem should be held so the struct_mm must be present */

struct mm_struct *mm = context->mm;



Re: [PATCH] crypto: talitos - fix skcipher failure due to wrong output IV

2019-05-23 Thread Herbert Xu
On Wed, May 15, 2019 at 12:29:03PM +, Christophe Leroy wrote:
> Selftests report the following:
> 
> [2.984845] alg: skcipher: cbc-aes-talitos encryption test failed (wrong 
> output IV) on test vector 0, cfg="in-place"
> [2.995377] : 3d af ba 42 9d 9e b4 30 b4 22 da 80 2c 9f ac 41
> [3.032673] alg: skcipher: cbc-des-talitos encryption test failed (wrong 
> output IV) on test vector 0, cfg="in-place"
> [3.043185] : fe dc ba 98 76 54 32 10
> [3.063238] alg: skcipher: cbc-3des-talitos encryption test failed (wrong 
> output IV) on test vector 0, cfg="in-place"
> [3.073818] : 7d 33 88 93 0f 93 b2 42
> 
> This above dumps show that the actual output IV is indeed the input IV.
> This is due to the IV not being copied back into the request.
> 
> This patch fixes that.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  drivers/crypto/talitos.c | 4 
>  1 file changed, 4 insertions(+)

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v3 14/16] powerpc/32: implement fast entry for syscalls on BOOKE

2019-05-23 Thread Christophe Leroy




Le 23/05/2019 à 08:14, Paul Mackerras a écrit :

On Tue, Apr 30, 2019 at 12:39:03PM +, Christophe Leroy wrote:

This patch implements a fast entry for syscalls.

Syscalls don't have to preserve non volatile registers except LR.

This patch then implement a fast entry for syscalls, where
volatile registers get clobbered.

As this entry is dedicated to syscall it always sets MSR_EE
and warns in case MSR_EE was previously off

It also assumes that the call is always from user, system calls are
unexpected from kernel.


This is now upstream as commit 1a4b739bbb4f.  On the e500mc test
config that I use, I'm getting this build failure:


Is that a standard defconfig ? If not, can you provide your .config ?



arch/powerpc/kernel/head_fsl_booke.o: In function `SystemCall':
arch/powerpc/kernel/head_fsl_booke.S:416: undefined reference to 
`kvmppc_handler_BOOKE_INTERRUPT_SYSCALL_SPRN_SRR1'
Makefile:1052: recipe for target 'vmlinux' failed


+.macro SYSCALL_ENTRY trapno intno
+   mfspr   r10, SPRN_SPRG_THREAD
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+   mtspr   SPRN_SPRG_WSCRATCH0, r10
+   stw r11, THREAD_NORMSAVE(0)(r10)
+   stw r13, THREAD_NORMSAVE(2)(r10)
+   mfcrr13 /* save CR in r13 for now  */
+   mfspr   r11, SPRN_SRR1
+   mtocrf  0x80, r11   /* check MSR[GS] without clobbering reg */
+   bf  3, 1975f
+   b   kvmppc_handler_BOOKE_INTERRUPT_\intno\()_SPRN_SRR1


It seems to me that the "_SPRN_SRR1" on the end of this line
isn't meant to be there...  However, it still fails to link with that
removed.


This SYSCALL_ENTRY macro is a slimmed version of NORMAL_EXCEPTION_PROLOG()

In NORMAL_EXCEPTION_PROLOG(), we have:
DO_KVM  BOOKE_INTERRUPT_##intno SPRN_SRR1;  

The _SPRN_SRR1 comes from there


Then in /arch/powerpc/include/asm/kvm_booke_hv_asm.h:

.macro DO_KVM intno srr1
#ifdef CONFIG_KVM_BOOKE_HV
BEGIN_FTR_SECTION
mtocrf  0x80, r11   /* check MSR[GS] without clobbering reg */
bf  3, 1975f
b   kvmppc_handler_\intno\()_\srr1
1975:
END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
#endif
.endm


Christophe


Re: [RFC PATCH 6/7] kasan: allow arches to hook into global registration

2019-05-23 Thread Daniel Axtens
Christophe Leroy  writes:

> Le 23/05/2019 à 07:21, Daniel Axtens a écrit :
>> Not all arches have a specific space carved out for modules -
>> some, such as powerpc, just use regular vmalloc space. Therefore,
>> globals in these modules cannot be backed by real shadow memory.
>
> Can you explain in more details the reason why ?

At this point, purely simplicity. As you discuss below, it's possible to
do better.

>
> PPC32 also uses regular vmalloc space, and it has been possible to 
> manage globals on it, by simply implementing a module_alloc() function.
>
> See 
> https://elixir.bootlin.com/linux/v5.2-rc1/source/arch/powerpc/mm/kasan/kasan_init_32.c#L135
>
> It is also possible to easily define a different area for modules, by 
> replacing the call to vmalloc_exec() by a call to __vmalloc_node_range() 
> as done by vmalloc_exec(), but with different bounds than 
> VMALLOC_START/VMALLOC_END
>
> See https://elixir.bootlin.com/linux/v5.2-rc1/source/mm/vmalloc.c#L2633
>
> Today in PPC64 (unlike PPC32), there is already a split between VMALLOC 
> space and IOREMAP space. I'm sure it would be easy to split it once more 
> for modules.
>

OK, good to know, I'll look into one of those approaches for the next
spin!

Regards,
Daniel


> Christophe
>
>> 
>> In order to allow arches to perform this check, add a hook.
>> 
>> Signed-off-by: Daniel Axtens 
>> ---
>>   include/linux/kasan.h | 5 +
>>   mm/kasan/generic.c| 3 +++
>>   2 files changed, 8 insertions(+)
>> 
>> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
>> index dfee2b42d799..4752749e4797 100644
>> --- a/include/linux/kasan.h
>> +++ b/include/linux/kasan.h
>> @@ -18,6 +18,11 @@ struct task_struct;
>>   static inline bool kasan_arch_is_ready(void)   { return true; }
>>   #endif
>>   
>> +#ifndef kasan_arch_can_register_global
>> +static inline bool kasan_arch_can_register_global(const void * addr)
>> { return true; }
>> +#endif
>> +
>> +
>>   #ifndef ARCH_HAS_KASAN_EARLY_SHADOW
>>   extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
>>   extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
>> diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
>> index 0336f31bbae3..935b06f659a0 100644
>> --- a/mm/kasan/generic.c
>> +++ b/mm/kasan/generic.c
>> @@ -208,6 +208,9 @@ static void register_global(struct kasan_global *global)
>>   {
>>  size_t aligned_size = round_up(global->size, KASAN_SHADOW_SCALE_SIZE);
>>   
>> +if (!kasan_arch_can_register_global(global->beg))
>> +return;
>> +
>>  kasan_unpoison_shadow(global->beg, global->size);
>>   
>>  kasan_poison_shadow(global->beg + aligned_size,
>>