Re: [PATCH 1/6] powerpc/64s: Add barrier_nospec

2018-04-23 Thread Nicholas Piggin
On Tue, 24 Apr 2018 14:15:54 +1000
Michael Ellerman  wrote:

> From: Michal Suchanek 
> 
> A no-op form of ori (or immediate of 0 into r31 and the result stored
> in r31) has been re-tasked as a speculation barrier. The instruction
> only acts as a barrier on newer machines with appropriate firmware
> support. On older CPUs it remains a harmless no-op.
> 
> Implement barrier_nospec using this instruction.
> 
> mpe: The semantics of the instruction are believed to be that it
> prevents execution of subsequent instructions until preceding branches
> have been fully resolved and are no longer executing speculatively.
> There is no further documentation available at this time.
> 
> Signed-off-by: Michal Suchanek 
> Signed-off-by: Michael Ellerman 
> ---
> mpe: Make it Book3S64 only, update comment & change log, add a
>  memory clobber to the asm.

These all seem good to me. Thanks Michal.

We should (eventually) work on the module patching problem too.

Thanks,
Nick



Re: [PATCH 3/5] powerpc: use time64_t in read_persistent_clock

2018-04-23 Thread kbuild test robot
Hi Arnd,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.17-rc2 next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Arnd-Bergmann/powerpc-always-enable-RTC_LIB/20180423-223504
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc64-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc64 

All errors (new ones prefixed by >>):

   arch/powerpc/platforms/maple/time.c: In function 'maple_get_boot_time':
>> arch/powerpc/platforms/maple/time.c:173:9: error: implicit declaration of 
>> function 'rtc_tm_to_time66'; did you mean 'rtc_tm_to_time64'? 
>> [-Werror=implicit-function-declaration]
 return rtc_tm_to_time66();
^~~~
rtc_tm_to_time64
   cc1: all warnings being treated as errors

vim +173 arch/powerpc/platforms/maple/time.c

   139  
   140  time64_t __init maple_get_boot_time(void)
   141  {
   142  struct rtc_time tm;
   143  struct device_node *rtcs;
   144  
   145  rtcs = of_find_compatible_node(NULL, "rtc", "pnpPNP,b00");
   146  if (rtcs) {
   147  struct resource r;
   148  if (of_address_to_resource(rtcs, 0, )) {
   149  printk(KERN_EMERG "Maple: Unable to translate 
RTC"
   150 " address\n");
   151  goto bail;
   152  }
   153  if (!(r.flags & IORESOURCE_IO)) {
   154  printk(KERN_EMERG "Maple: RTC address isn't 
PIO!\n");
   155  goto bail;
   156  }
   157  maple_rtc_addr = r.start;
   158  printk(KERN_INFO "Maple: Found RTC at IO 0x%x\n",
   159 maple_rtc_addr);
   160  }
   161   bail:
   162  if (maple_rtc_addr == 0) {
   163  maple_rtc_addr = RTC_PORT(0); /* legacy address */
   164  printk(KERN_INFO "Maple: No device node for RTC, 
assuming "
   165 "legacy address (0x%x)\n", maple_rtc_addr);
   166  }
   167  
   168  rtc_iores.start = maple_rtc_addr;
   169  rtc_iores.end = maple_rtc_addr + 7;
   170  request_resource(_resource, _iores);
   171  
   172  maple_get_rtc_time();
 > 173  return rtc_tm_to_time66();
   174  }
   175  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Shilpasri G Bhat
Hi,

On 04/24/2018 10:40 AM, Stewart Smith wrote:
> Shilpasri G Bhat  writes:
>> gpstate_timer_handler() uses synchronous smp_call to set the pstate
>> on the requested core. This causes the below hard lockup:
>>
>> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
>> (unreliable)
>> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
>> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
>> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
>> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
>> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
>> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
>> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
>> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
>> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
>> --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
>> LR = arch_send_call_function_ipi_mask+0x120/0x130
>> [c03fe566ba50] [c004876c] 
>> arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
>> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
>> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
>> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
>> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
>> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
>> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
>> [c03fe566be30] [c000b184] system_call+0x58/0x6c
>>
>> Fix this by using the asynchronus smp_call in the timer interrupt handler.
>> We don't have to wait in this handler until the pstates are changed on
>> the core. This change will not have any impact on the global pstate
>> ramp-down algorithm.
>>
>> Reported-by: Nicholas Piggin 
>> Reported-by: Pridhiviraj Paidipeddi 
>> Signed-off-by: Shilpasri G Bhat 
>> ---
>>  drivers/cpufreq/powernv-cpufreq.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
>> b/drivers/cpufreq/powernv-cpufreq.c
>> index 0591874..7e0c752 100644
>> --- a/drivers/cpufreq/powernv-cpufreq.c
>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>> @@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
>>  spin_unlock(>gpstate_lock);
>>
>>  /* Timer may get migrated to a different cpu on cpu hot unplug */
>> -smp_call_function_any(policy->cpus, set_pstate, _data, 1);
>> +smp_call_function_any(policy->cpus, set_pstate, _data, 0);
>>  }
> 
> Should this have:
> Fixes: eaa2c3aeef83f
> and CC stable v4.7+ ?
> 

Yeah this is required.

Fixes: eaa2c3aeef83 (cpufreq: powernv: Ramp-down global pstate slower than
local-pstate)

Thanks and Regards,
Shilpa



Re: [PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Stewart Smith
Shilpasri G Bhat  writes:
> gpstate_timer_handler() uses synchronous smp_call to set the pstate
> on the requested core. This causes the below hard lockup:
>
> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
> (unreliable)
> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
> --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
> LR = arch_send_call_function_ipi_mask+0x120/0x130
> [c03fe566ba50] [c004876c] 
> arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
> [c03fe566be30] [c000b184] system_call+0x58/0x6c
>
> Fix this by using the asynchronus smp_call in the timer interrupt handler.
> We don't have to wait in this handler until the pstates are changed on
> the core. This change will not have any impact on the global pstate
> ramp-down algorithm.
>
> Reported-by: Nicholas Piggin 
> Reported-by: Pridhiviraj Paidipeddi 
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 0591874..7e0c752 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
>   spin_unlock(>gpstate_lock);
>
>   /* Timer may get migrated to a different cpu on cpu hot unplug */
> - smp_call_function_any(policy->cpus, set_pstate, _data, 1);
> + smp_call_function_any(policy->cpus, set_pstate, _data, 0);
>  }

Should this have:
Fixes: eaa2c3aeef83f
and CC stable v4.7+ ?

-- 
Stewart Smith
OPAL Architect, IBM.



[PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Shilpasri G Bhat
gpstate_timer_handler() uses synchronous smp_call to set the pstate
on the requested core. This causes the below hard lockup:

[c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
(unreliable)
[c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
[c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
[c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
[c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
[c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
[c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
[c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
[c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
[c03fe566b760] [c0009014] decrementer_common+0x114/0x120
--- interrupt: 901 at doorbell_global_ipi+0x34/0x50
LR = arch_send_call_function_ipi_mask+0x120/0x130
[c03fe566ba50] [c004876c] 
arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
[c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
[c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
[c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
[c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
[c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
[c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
[c03fe566be30] [c000b184] system_call+0x58/0x6c

Fix this by using the asynchronus smp_call in the timer interrupt handler.
We don't have to wait in this handler until the pstates are changed on
the core. This change will not have any impact on the global pstate
ramp-down algorithm.

Reported-by: Nicholas Piggin 
Reported-by: Pridhiviraj Paidipeddi 
Signed-off-by: Shilpasri G Bhat 
---
 drivers/cpufreq/powernv-cpufreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 0591874..7e0c752 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
spin_unlock(>gpstate_lock);
 
/* Timer may get migrated to a different cpu on cpu hot unplug */
-   smp_call_function_any(policy->cpus, set_pstate, _data, 1);
+   smp_call_function_any(policy->cpus, set_pstate, _data, 0);
 }
 
 /*
-- 
1.8.3.1



[PATCH 6/6] powerpc/64: Use barrier_nospec in syscall entry

2018-04-23 Thread Michael Ellerman
Our syscall entry is done in assembly so patch in an explicit
barrier_nospec.

Based on a patch by Michal Suchanek.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
mpe: Move the barrier to immediately prior to the vulnerable load,
 and add a comment trying to explain why. Drop the barrier from
 syscall_dotrace, because that syscall number comes from the
 kernel.
---
 arch/powerpc/kernel/entry_64.S | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 51695608c68b..de30f9a34c0c 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #ifdef CONFIG_PPC_BOOK3S
 #include 
@@ -178,6 +179,15 @@ system_call:   /* label this so stack 
traces look sane */
clrldi  r8,r8,32
 15:
slwir0,r0,4
+
+   barrier_nospec_asm
+   /*
+* Prevent the load of the handler below (based on the user-passed
+* system call number) being speculatively executed until the test
+* against NR_syscalls and branch to .Lsyscall_enosys above has
+* committed.
+*/
+
ldx r12,r11,r0  /* Fetch system call handler [ptr] */
mtctr   r12
bctrl   /* Call handler */
-- 
2.14.1



[PATCH 5/6] powerpc: Use barrier_nospec in copy_from_user()

2018-04-23 Thread Michael Ellerman
Based on the x86 commit doing the same.

See commit 304ec1b05031 ("x86/uaccess: Use __uaccess_begin_nospec()
and uaccess_try_nospec") and b3bbfb3fb5d2 ("x86: Introduce
__uaccess_begin_nospec() and uaccess_try_nospec") for more detail.

In all cases we are ordering the load from the potentially
user-controlled pointer vs a previous branch based on an access_ok()
check or similar.

Base on a patch from Michal Suchanek.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/uaccess.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index a62ee663b2c8..6dc3d2eeea4a 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -252,6 +252,7 @@ do {
\
__chk_user_ptr(ptr);\
if (!is_kernel_addr((unsigned long)__gu_addr))  \
might_fault();  \
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__typeof__(*(ptr)))__gu_val; \
__gu_err;   \
@@ -263,8 +264,10 @@ do {   
\
unsigned long  __gu_val = 0;\
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
might_fault();  \
-   if (access_ok(VERIFY_READ, __gu_addr, (size)))  \
+   if (access_ok(VERIFY_READ, __gu_addr, (size))) {\
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
+   }   \
(x) = (__force __typeof__(*(ptr)))__gu_val; 
\
__gu_err;   \
 })
@@ -275,6 +278,7 @@ do {
\
unsigned long __gu_val; \
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
__chk_user_ptr(ptr);\
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
__gu_err;   \
@@ -302,15 +306,19 @@ static inline unsigned long raw_copy_from_user(void *to,
 
switch (n) {
case 1:
+   barrier_nospec();
__get_user_size(*(u8 *)to, from, 1, ret);
break;
case 2:
+   barrier_nospec();
__get_user_size(*(u16 *)to, from, 2, ret);
break;
case 4:
+   barrier_nospec();
__get_user_size(*(u32 *)to, from, 4, ret);
break;
case 8:
+   barrier_nospec();
__get_user_size(*(u64 *)to, from, 8, ret);
break;
}
@@ -318,6 +326,7 @@ static inline unsigned long raw_copy_from_user(void *to,
return 0;
}
 
+   barrier_nospec();
return __copy_tofrom_user((__force void __user *)to, from, n);
 }
 
-- 
2.14.1



[PATCH 4/6] powerpc/64s: Enable barrier_nospec based on firmware settings

2018-04-23 Thread Michael Ellerman
From: Michal Suchanek 

Check what firmware told us and enable/disable the barrier_nospec as
appropriate.

We err on the side of enabling the barrier, as it's no-op on older
systems, see the comment for more detail.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/setup.h   |  1 +
 arch/powerpc/kernel/security.c | 60 ++
 arch/powerpc/platforms/powernv/setup.c |  1 +
 arch/powerpc/platforms/pseries/setup.c |  1 +
 4 files changed, 63 insertions(+)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 4335cddc1cf2..aeb175e8a525 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -52,6 +52,7 @@ enum l1d_flush_type {
 
 void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
+void setup_barrier_nospec(void);
 void do_barrier_nospec_fixups(bool enable);
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index b963eae0b0a0..d1b9639e5e24 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -22,6 +23,65 @@ static void enable_barrier_nospec(bool enable)
do_barrier_nospec_fixups(enable);
 }
 
+void setup_barrier_nospec(void)
+{
+   bool enable;
+
+   /*
+* It would make sense to check SEC_FTR_SPEC_BAR_ORI31 below as well.
+* But there's a good reason not to. The two flags we check below are
+* both are enabled by default in the kernel, so if the hcall is not
+* functional they will be enabled.
+* On a system where the host firmware has been updated (so the ori
+* functions as a barrier), but on which the hypervisor (KVM/Qemu) has
+* not been updated, we would like to enable the barrier. Dropping the
+* check for SEC_FTR_SPEC_BAR_ORI31 achieves that. The only downside is
+* we potentially enable the barrier on systems where the host firmware
+* is not updated, but that's harmless as it's a no-op.
+*/
+   enable = security_ftr_enabled(SEC_FTR_FAVOUR_SECURITY) &&
+security_ftr_enabled(SEC_FTR_BNDS_CHK_SPEC_BAR);
+
+   enable_barrier_nospec(enable);
+}
+
+#ifdef CONFIG_DEBUG_FS
+static int barrier_nospec_set(void *data, u64 val)
+{
+   switch (val) {
+   case 0:
+   case 1:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   if (!!val == !!barrier_nospec_enabled)
+   return 0;
+
+   enable_barrier_nospec(!!val);
+
+   return 0;
+}
+
+static int barrier_nospec_get(void *data, u64 *val)
+{
+   *val = barrier_nospec_enabled ? 1 : 0;
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_barrier_nospec,
+   barrier_nospec_get, barrier_nospec_set, "%llu\n");
+
+static __init int barrier_nospec_debugfs_init(void)
+{
+   debugfs_create_file("barrier_nospec", 0600, powerpc_debugfs_root, NULL,
+   _barrier_nospec);
+   return 0;
+}
+device_initcall(barrier_nospec_debugfs_init);
+#endif /* CONFIG_DEBUG_FS */
+
 ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, 
char *buf)
 {
bool thread_priv;
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index ef8c9ce53a61..e2ca5f77a55f 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -124,6 +124,7 @@ static void pnv_setup_rfi_flush(void)
  security_ftr_enabled(SEC_FTR_L1D_FLUSH_HV));
 
setup_rfi_flush(type, enable);
+   setup_barrier_nospec();
 }
 
 static void __init pnv_setup_arch(void)
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index b55ad4286dc7..63b1f0d10ef0 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -534,6 +534,7 @@ void pseries_setup_rfi_flush(void)
 security_ftr_enabled(SEC_FTR_L1D_FLUSH_PR);
 
setup_rfi_flush(types, enable);
+   setup_barrier_nospec();
 }
 
 #ifdef CONFIG_PCI_IOV
-- 
2.14.1



[PATCH 3/6] powerpc/64s: Patch barrier_nospec in modules

2018-04-23 Thread Michael Ellerman
From: Michal Suchanek 

Note that unlike RFI which is patched only in kernel the nospec state
reflects settings at the time the module was loaded.

Iterating all modules and re-patching every time the settings change
is not implemented.

Based on lwsync patching.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/setup.h  |  6 ++
 arch/powerpc/kernel/module.c  |  6 ++
 arch/powerpc/lib/feature-fixups.c | 16 +---
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index afc7280cce3b..4335cddc1cf2 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -54,6 +54,12 @@ void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
 void do_barrier_nospec_fixups(bool enable);
 
+#ifdef CONFIG_PPC_BOOK3S_64
+void do_barrier_nospec_fixups_range(bool enable, void *start, void *end);
+#else
+static inline void do_barrier_nospec_fixups_range(bool enable, void *start, 
void *end) { };
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_SETUP_H */
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 3f7ba0f5bf29..a72698cd3dd0 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -72,6 +72,12 @@ int module_finalize(const Elf_Ehdr *hdr,
do_feature_fixups(powerpc_firmware_features,
  (void *)sect->sh_addr,
  (void *)sect->sh_addr + sect->sh_size);
+
+   sect = find_section(hdr, sechdrs, "__spec_barrier_fixup");
+   if (sect != NULL)
+   do_barrier_nospec_fixups_range(true,
+ (void *)sect->sh_addr,
+ (void *)sect->sh_addr + sect->sh_size);
 #endif
 
sect = find_section(hdr, sechdrs, "__lwsync_fixup");
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 093c1d2ea5fd..3b37529f82f8 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -163,14 +163,14 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
: "unknown");
 }
 
-void do_barrier_nospec_fixups(bool enable)
+void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void 
*fixup_end)
 {
unsigned int instr, *dest;
long *start, *end;
int i;
 
-   start = PTRRELOC(&__start___barrier_nospec_fixup),
-   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+   start = fixup_start;
+   end = fixup_end;
 
instr = 0x6000; /* nop */
 
@@ -189,6 +189,16 @@ void do_barrier_nospec_fixups(bool enable)
printk(KERN_DEBUG "barrier-nospec: patched %d locations\n", i);
 }
 
+void do_barrier_nospec_fixups(bool enable)
+{
+   void *start, *end;
+
+   start = PTRRELOC(&__start___barrier_nospec_fixup),
+   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+
+   do_barrier_nospec_fixups_range(enable, start, end);
+}
+
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 void do_lwsync_fixups(unsigned long value, void *fixup_start, void *fixup_end)
-- 
2.14.1



[PATCH 2/6] powerpc/64s: Add support for ori barrier_nospec patching

2018-04-23 Thread Michael Ellerman
From: Michal Suchanek 

Based on the RFI patching. This is required to be able to disable the
speculation barrier.

Only one barrier type is supported and it does nothing when the
firmware does not enable it. Also re-patching modules is not supported
So the only meaningful thing that can be done is patching out the
speculation barrier at boot when the user says it is not wanted.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/barrier.h|  2 +-
 arch/powerpc/include/asm/feature-fixups.h |  9 +
 arch/powerpc/include/asm/setup.h  |  1 +
 arch/powerpc/kernel/security.c|  9 +
 arch/powerpc/kernel/vmlinux.lds.S |  7 +++
 arch/powerpc/lib/feature-fixups.c | 27 +++
 6 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index e582d2c88092..f67b3f6e36be 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -81,7 +81,7 @@ do {  
\
  * Prevent execution of subsequent instructions until preceding branches have
  * been fully resolved and are no longer executing speculatively.
  */
-#define barrier_nospec_asm ori 31,31,0
+#define barrier_nospec_asm NOSPEC_BARRIER_FIXUP_SECTION; nop
 
 // This also acts as a compiler barrier due to the memory clobber.
 #define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
diff --git a/arch/powerpc/include/asm/feature-fixups.h 
b/arch/powerpc/include/asm/feature-fixups.h
index 1e82eb3caabd..86ac59e75f36 100644
--- a/arch/powerpc/include/asm/feature-fixups.h
+++ b/arch/powerpc/include/asm/feature-fixups.h
@@ -195,11 +195,20 @@ label##3: \
FTR_ENTRY_OFFSET 951b-952b; \
.popsection;
 
+#define NOSPEC_BARRIER_FIXUP_SECTION   \
+953:   \
+   .pushsection __barrier_nospec_fixup,"a";\
+   .align 2;   \
+954:   \
+   FTR_ENTRY_OFFSET 953b-954b; \
+   .popsection;
+
 
 #ifndef __ASSEMBLY__
 #include 
 
 extern long __start___rfi_flush_fixup, __stop___rfi_flush_fixup;
+extern long __start___barrier_nospec_fixup, __stop___barrier_nospec_fixup;
 
 void apply_feature_fixups(void);
 void setup_feature_keys(void);
diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 27fa52ed6d00..afc7280cce3b 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -52,6 +52,7 @@ enum l1d_flush_type {
 
 void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
+void do_barrier_nospec_fixups(bool enable);
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index bab5a27ea805..b963eae0b0a0 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -9,10 +9,19 @@
 #include 
 
 #include 
+#include 
 
 
 unsigned long powerpc_security_features __read_mostly = SEC_FTR_DEFAULT;
 
+static bool barrier_nospec_enabled;
+
+static void enable_barrier_nospec(bool enable)
+{
+   barrier_nospec_enabled = enable;
+   do_barrier_nospec_fixups(enable);
+}
+
 ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, 
char *buf)
 {
bool thread_priv;
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index c8af90ff49f0..ff73f498568c 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -139,6 +139,13 @@ SECTIONS
*(__rfi_flush_fixup)
__stop___rfi_flush_fixup = .;
}
+
+   . = ALIGN(8);
+   __spec_barrier_fixup : AT(ADDR(__spec_barrier_fixup) - LOAD_OFFSET) {
+   __start___barrier_nospec_fixup = .;
+   *(__barrier_nospec_fixup)
+   __stop___barrier_nospec_fixup = .;
+   }
 #endif
 
EXCEPTION_TABLE(0)
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 288fe4f0db4e..093c1d2ea5fd 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -162,6 +162,33 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
(types &  L1D_FLUSH_MTTRIG) ? "mttrig type"
: "unknown");
 }
+
+void do_barrier_nospec_fixups(bool enable)
+{
+   unsigned int instr, *dest;
+   long *start, *end;
+   int i;
+
+   start = PTRRELOC(&__start___barrier_nospec_fixup),
+   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+
+   instr = 0x6000; /* nop */
+
+   if 

[PATCH 1/6] powerpc/64s: Add barrier_nospec

2018-04-23 Thread Michael Ellerman
From: Michal Suchanek 

A no-op form of ori (or immediate of 0 into r31 and the result stored
in r31) has been re-tasked as a speculation barrier. The instruction
only acts as a barrier on newer machines with appropriate firmware
support. On older CPUs it remains a harmless no-op.

Implement barrier_nospec using this instruction.

mpe: The semantics of the instruction are believed to be that it
prevents execution of subsequent instructions until preceding branches
have been fully resolved and are no longer executing speculatively.
There is no further documentation available at this time.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
mpe: Make it Book3S64 only, update comment & change log, add a
 memory clobber to the asm.
---
 arch/powerpc/include/asm/barrier.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index c7c63959ba91..e582d2c88092 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -76,6 +76,21 @@ do { 
\
___p1;  \
 })
 
+#ifdef CONFIG_PPC_BOOK3S_64
+/*
+ * Prevent execution of subsequent instructions until preceding branches have
+ * been fully resolved and are no longer executing speculatively.
+ */
+#define barrier_nospec_asm ori 31,31,0
+
+// This also acts as a compiler barrier due to the memory clobber.
+#define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
+
+#else /* !CONFIG_PPC_BOOK3S_64 */
+#define barrier_nospec_asm
+#define barrier_nospec()
+#endif
+
 #include 
 
 #endif /* _ASM_POWERPC_BARRIER_H */
-- 
2.14.1



Re: [PATCH RFC 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space

2018-04-23 Thread David Gibson
On Tue, Apr 24, 2018 at 01:19:15PM +1000, Sam Bobroff wrote:
> On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
> > On 04/16/2018 06:09 AM, David Gibson wrote:
> > > On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
> > >> It is not currently possible to create the full number of possible
> > >> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> > >> threads per core than it's core stride (or "VSMT mode"). This is
> > >> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> > >> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> > >>
> > >> To address this, "pack" the VCORE ID and XIVE offsets by using
> > >> knowledge of the way the VCPU IDs will be used when there are less
> > >> guest threads per core than the core stride. The primary thread of
> > >> each core will always be used first. Then, if the guest uses more than
> > >> one thread per core, these secondary threads will sequentially follow
> > >> the primary in each core.
> > >>
> > >> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> > >> VCPUs are being spaced apart, so at least half of each core is empty
> > >> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> > >> into the second half of each core (4..7, in an 8-thread core).
> > >>
> > >> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> > >> each core is being left empty, and we can map down into the second and
> > >> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> > >>
> > >> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> > >> threads are being used and 7/8 of the core is empty, allowing use of
> > >> the 1, 3, 5 and 7 thread slots.
> > >>
> > >> (Strides less than 8 are handled similarly.)
> > >>
> > >> This allows the VCORE ID or offset to be calculated quickly from the
> > >> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> > >>
> > >> Signed-off-by: Sam Bobroff 
> > >> ---
> > >> Hello everyone,
> > >>
> > >> I've tested this on P8 and P9, in lots of combinations of host and guest
> > >> threading modes and it has been fine but it does feel like a "tricky"
> > >> approach, so I still feel somewhat wary about it.
> > 
> > Have you done any migration ? 
> 
> No, but I will :-)
> 
> > >> I've posted it as an RFC because I have not tested it with guest 
> > >> native-XIVE,
> > >> and I suspect that it will take some work to support it.
> > 
> > The KVM XIVE device will be different for XIVE exploitation mode, same 
> > structures 
> > though. I will send a patchset shortly. 
> 
> Great. This is probably where conflicts between the host and guest
> numbers will show up. (See dwg's question below.)
> 
> > >>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++
> > >>  arch/powerpc/kvm/book3s_hv.c  | 14 ++
> > >>  arch/powerpc/kvm/book3s_xive.c|  9 +++--
> > >>  3 files changed, 36 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> > >> b/arch/powerpc/include/asm/kvm_book3s.h
> > >> index 376ae803b69c..1295056d564a 100644
> > >> --- a/arch/powerpc/include/asm/kvm_book3s.h
> > >> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> > >> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct 
> > >> kvm_vcpu *vcpu);
> > >>  #define SPLIT_HACK_MASK 0xff00
> > >>  #define SPLIT_HACK_OFFS 0xfb00
> > >>  
> > >> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> > >> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core 
> > >> stride
> > >> + * (but not it's actual threading mode, which is not available) to avoid
> > >> + * collisions.
> > >> + */
> > >> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> > >> +{
> > >> +const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 
> > >> 3, 7};
> > > 
> > > I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> > > roughly the same thing, but I think makes the pattern more obvious.
> 
> OK.
> 
> > >> +int stride = kvm->arch.emul_smt_mode > 1 ?
> > >> + kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> > > 
> > > AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> > > always be 1 when this is called, so the conditional here doesn't seem
> > > useful.
> 
> Ah yes, right. (That was an older version when I was thinking of using
> it for P8 as well but that didn't seem to be a good idea.)
> 
> > >> +int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> > >> +u32 packed_id;
> > >> +
> > >> +BUG_ON(block >= MAX_SMT_THREADS);
> > >> +packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> > >> +BUG_ON(packed_id >= KVM_MAX_VCPUS);
> > >> +return packed_id;
> > >> +}
> > > 
> > > It took me a while to wrap my head around the packing 

Re: [1/2] powernv/npu: Do a PID GPU TLB flush when invalidating a large address range

2018-04-23 Thread Michael Ellerman
On Tue, 2018-04-17 at 09:11:28 UTC, Alistair Popple wrote:
> The NPU has a limited number of address translation shootdown (ATSD)
> registers and the GPU has limited bandwidth to process ATSDs. This can
> result in contention of ATSD registers leading to soft lockups on some
> threads, particularly when invalidating a large address range in
> pnv_npu2_mn_invalidate_range().
> 
> At some threshold it becomes more efficient to flush the entire GPU TLB for
> the given MM context (PID) than individually flushing each address in the
> range. This patch will result in ranges greater than 2MB being converted
> from 32+ ATSDs into a single ATSD which will flush the TLB for the given
> PID on each GPU.
> 
> Signed-off-by: Alistair Popple 
> Acked-by: Balbir Singh 
> Tested-by: Balbir Singh 

Patch 1 applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/d0cf9b561ca97d5245bb9e0c4774b7

cheers


Re: [1/2] powernv/npu: Add lock to prevent race in concurrent context init/destroy

2018-04-23 Thread Michael Ellerman
On Wed, 2018-04-11 at 06:38:54 UTC, Alistair Popple wrote:
> The pnv_npu2_init_context() and pnv_npu2_destroy_context() functions are
> used to allocate/free contexts to allow address translation and shootdown
> by the NPU on a particular GPU. Context initialisation is implicitly safe
> as it is protected by the requirement mmap_sem be held in write mode,
> however pnv_npu2_destroy_context() does not require mmap_sem to be held and
> it is not safe to call with a concurrent initialisation for a different
> GPU.
> 
> It was assumed the driver would ensure destruction was not called
> concurrently with initialisation. However the driver may be simplified by
> allowing concurrent initialisation and destruction for different GPUs. As
> npu context creation/destruction is not a performance critical path and the
> critical section is not large a single spinlock is used for simplicity.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Mark Hairgrove 
> Tested-by: Mark Hairgrove 
> Reviewed-by: Balbir Singh 

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/28a5933e8d362766462ea9e5f135e1

cheers


Re: [1/2] powerpc/mm: Flush cache on memory hot(un)plug

2018-04-23 Thread Michael Ellerman
On Fri, 2018-04-06 at 05:24:23 UTC, Balbir Singh wrote:
> This patch adds support for flushing potentially dirty
> cache lines when memory is hot-plugged/hot-un-plugged.
> The support is currently limited to 64 bit systems.
> 
> The bug was exposed when mappings for a device were
> actually hot-unplugged and plugged in back later.
> A similar issue was observed during the development
> of memtrace, but memtrace does it's own flushing of
> region via a custom routine.
> 
> These patches do a flush both on hotplug/unplug to
> clear any stale data in the cache w.r.t mappings,
> there is a small race window where a clean cache
> line may be created again just prior to tearing
> down the mapping.
> 
> The patches were tested by disabling the flush
> routines in memtrace and doing I/O on the trace
> file. The system immediately checkstops (quite
> reliablly if prior to the hot-unplug of the memtrace
> region, we memset the regions we are about to
> hot unplug). After these patches no custom flushing
> is needed in the memtrace code.
> 
> Signed-off-by: Balbir Singh 
> Acked-by: Reza Arbab 
> Reviewed-by: Rashmica Gupta 

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/fb5924fddf9ee31db04da7ad4e8c34

cheers


Re: [PATCH RFC 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space

2018-04-23 Thread Sam Bobroff
On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
> On 04/16/2018 06:09 AM, David Gibson wrote:
> > On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
> >> It is not currently possible to create the full number of possible
> >> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> >> threads per core than it's core stride (or "VSMT mode"). This is
> >> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> >> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> >>
> >> To address this, "pack" the VCORE ID and XIVE offsets by using
> >> knowledge of the way the VCPU IDs will be used when there are less
> >> guest threads per core than the core stride. The primary thread of
> >> each core will always be used first. Then, if the guest uses more than
> >> one thread per core, these secondary threads will sequentially follow
> >> the primary in each core.
> >>
> >> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> >> VCPUs are being spaced apart, so at least half of each core is empty
> >> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> >> into the second half of each core (4..7, in an 8-thread core).
> >>
> >> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> >> each core is being left empty, and we can map down into the second and
> >> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> >>
> >> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> >> threads are being used and 7/8 of the core is empty, allowing use of
> >> the 1, 3, 5 and 7 thread slots.
> >>
> >> (Strides less than 8 are handled similarly.)
> >>
> >> This allows the VCORE ID or offset to be calculated quickly from the
> >> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> >>
> >> Signed-off-by: Sam Bobroff 
> >> ---
> >> Hello everyone,
> >>
> >> I've tested this on P8 and P9, in lots of combinations of host and guest
> >> threading modes and it has been fine but it does feel like a "tricky"
> >> approach, so I still feel somewhat wary about it.
> 
> Have you done any migration ? 

No, but I will :-)

> >> I've posted it as an RFC because I have not tested it with guest 
> >> native-XIVE,
> >> and I suspect that it will take some work to support it.
> 
> The KVM XIVE device will be different for XIVE exploitation mode, same 
> structures 
> though. I will send a patchset shortly. 

Great. This is probably where conflicts between the host and guest
numbers will show up. (See dwg's question below.)

> >>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++
> >>  arch/powerpc/kvm/book3s_hv.c  | 14 ++
> >>  arch/powerpc/kvm/book3s_xive.c|  9 +++--
> >>  3 files changed, 36 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> >> b/arch/powerpc/include/asm/kvm_book3s.h
> >> index 376ae803b69c..1295056d564a 100644
> >> --- a/arch/powerpc/include/asm/kvm_book3s.h
> >> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> >> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu 
> >> *vcpu);
> >>  #define SPLIT_HACK_MASK   0xff00
> >>  #define SPLIT_HACK_OFFS   0xfb00
> >>  
> >> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> >> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core 
> >> stride
> >> + * (but not it's actual threading mode, which is not available) to avoid
> >> + * collisions.
> >> + */
> >> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> >> +{
> >> +  const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> > 
> > I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> > roughly the same thing, but I think makes the pattern more obvious.

OK.

> >> +  int stride = kvm->arch.emul_smt_mode > 1 ?
> >> +   kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> > 
> > AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> > always be 1 when this is called, so the conditional here doesn't seem
> > useful.

Ah yes, right. (That was an older version when I was thinking of using
it for P8 as well but that didn't seem to be a good idea.)

> >> +  int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> >> +  u32 packed_id;
> >> +
> >> +  BUG_ON(block >= MAX_SMT_THREADS);
> >> +  packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> >> +  BUG_ON(packed_id >= KVM_MAX_VCPUS);
> >> +  return packed_id;
> >> +}
> > 
> > It took me a while to wrap my head around the packing function, but I
> > think I got there in the end.  It's pretty clever.

Thanks, I'll try to add a better description as well :-)

> > One thing bothers me, though.  This certainly packs things under
> > KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> > e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 

Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Balbir Singh
On Mon, 2018-04-23 at 10:29 +0530, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar 
> 
> The current code extracts the physical address for UE errors and then
> hooks it up into memory failure infrastructure. On successful extraction
> of physical address it wrongly sets "handled = 1" which means this UE error
> has been recovered. Since MCE handler gets return value as handled = 1, it
> assumes that error has been recovered and goes back to same NIP. This causes
> MCE interrupt again and again in a loop leading to hard lockup.
> 
> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
> undesired page to hwpoison.
> 
> Without this patch we see:
> [ 1476.541984] Severe Machine check interrupt [Recovered]
> [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
> [ 1476.541986]   Initiator: CPU
> [ 1476.541987]   Error type: UE [Load/Store]
> [ 1476.541988] Effective address: 7fffd2755940
> [ 1476.541989] Physical address:  20181a08
> [...]
> [ 1476.542003] Severe Machine check interrupt [Recovered]
> [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
> [ 1476.542005]   Initiator: CPU
> [ 1476.542006]   Error type: UE [Load/Store]
> [ 1476.542006] Effective address: 7fffd2755940
> [ 1476.542007] Physical address:  20181a08
> [ 1476.542010] Severe Machine check interrupt [Recovered]
> [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
> [ 1476.542013]   Initiator: CPU
> [ 1476.542014]   Error type: UE [Load/Store]
> [ 1476.542015] Effective address: 7fffd2755940
> [ 1476.542016] Physical address:  20181a08
> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
> page: Recovered
> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> [...]
> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
> 
> After this patch we see:
> 
> [  325.384336] Severe Machine check interrupt [Not recovered]
> [  325.384338]   NIP: [7fffaae585f4] PID: 7168 Comm: find
> [  325.384339]   Initiator: CPU
> [  325.384341]   Error type: UE [Load/Store]
> [  325.384343] Effective address: 7fffaafe28ac
> [  325.384345] Physical address:  2017c0bd
> [  325.384350] find[7168]: unhandled signal 7 at 7fffaae585f4 nip 
> 7fffaae585f4 lr 7fffaae585e0 code 4
> [  325.388574] Memory failure: 0x2017c0bd: recovery action for dirty LRU 
> page: Recovered
> 
> Fixes: 01eaac2b0591 ("powerpc/mce: Hookup ierror (instruction) UE errors")
> Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/kernel/mce_power.c |8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index fe6fc63251fe..63b58ae5d601 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -441,7 +441,6 @@ static int mce_handle_ierror(struct pt_regs *regs,
>   if (pfn != ULONG_MAX) {
>   *phys_addr =
>   (pfn << PAGE_SHIFT);
> - handled = 1;
>   }
>   }
>   }
> @@ -532,9 +531,8 @@ static int mce_handle_derror(struct pt_regs *regs,
>* kernel/exception-64s.h
>*/
>   if (get_paca()->in_mce < MAX_MCE_DEPTH)
> - if (!mce_find_instr_ea_and_pfn(regs, addr,
> - phys_addr))
> - handled = 1;
> + mce_find_instr_ea_and_pfn(regs, addr,
> + phys_addr);
>   }
>   found = 1;
>   }
> @@ -572,7 +570,7 @@ static long mce_handle_error(struct pt_regs *regs,
>   const struct mce_ierror_table itable[])
>  {
>   struct mce_error_info mce_err = { 0 };
> - uint64_t addr, phys_addr;
> + uint64_t addr, phys_addr = ULONG_MAX;
>   uint64_t srr1 = regs->msr;
>   long handled;
>  
> 

Reviewed-by: Balbir Singh 



Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Balbir Singh
On Mon, 2018-04-23 at 23:01 +1000, Nicholas Piggin wrote:
> On Mon, 23 Apr 2018 21:14:12 +1000
> Balbir Singh  wrote:
> 
> > On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
> >  wrote:
> > > On 04/23/2018 12:21 PM, Balbir Singh wrote:  
> > > > On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
> > > >  wrote:  
> > > > > From: Mahesh Salgaonkar 
> > > > > 
> > > > > The current code extracts the physical address for UE errors and then
> > > > > hooks it up into memory failure infrastructure. On successful 
> > > > > extraction
> > > > > of physical address it wrongly sets "handled = 1" which means this UE 
> > > > > error
> > > > > has been recovered. Since MCE handler gets return value as handled = 
> > > > > 1, it
> > > > > assumes that error has been recovered and goes back to same NIP. This 
> > > > > causes
> > > > > MCE interrupt again and again in a loop leading to hard lockup.
> > > > > 
> > > > > Also, initialize phys_addr to ULONG_MAX so that we don't end up 
> > > > > queuing
> > > > > undesired page to hwpoison.
> > > > > 
> > > > > Without this patch we see:
> > > > > [ 1476.541984] Severe Machine check interrupt [Recovered]
> > > > > [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
> > > > > [ 1476.541986]   Initiator: CPU
> > > > > [ 1476.541987]   Error type: UE [Load/Store]
> > > > > [ 1476.541988] Effective address: 7fffd2755940
> > > > > [ 1476.541989] Physical address:  20181a08
> > > > > [...]
> > > > > [ 1476.542003] Severe Machine check interrupt [Recovered]
> > > > > [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
> > > > > [ 1476.542005]   Initiator: CPU
> > > > > [ 1476.542006]   Error type: UE [Load/Store]
> > > > > [ 1476.542006] Effective address: 7fffd2755940
> > > > > [ 1476.542007] Physical address:  20181a08
> > > > > [ 1476.542010] Severe Machine check interrupt [Recovered]
> > > > > [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
> > > > > [ 1476.542013]   Initiator: CPU
> > > > > [ 1476.542014]   Error type: UE [Load/Store]
> > > > > [ 1476.542015] Effective address: 7fffd2755940
> > > > > [ 1476.542016] Physical address:  20181a08
> > > > > [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty 
> > > > > LRU page: Recovered
> > > > > [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [...]
> > > > > [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
> > > > > 
> > > > > After this patch we see:
> > > > > 
> > > > > [  325.384336] Severe Machine check interrupt [Not recovered]  
> > > > 
> > > > How did you test for this?  
> > > 
> > > By injecting cache SUE using L2 FIR register (0x1001080c).
> > >  
> > > > If the error was recovered, shouldn't the
> > > > process have gotten
> > > > a SIGBUS and we should have prevented further access as a part of the 
> > > > handling
> > > > (memory_failure()). Do we just need a MF_MUST_KILL in the flags?  
> > > 
> > > We hook it up to memory_failure() through a work queue and by the time
> > > work queue kicks in, the application continues to restart and hit same
> > > NIP again and again. Every MCE again hooks the same address to memory
> > > failure work queue and throws multiple recovered MCE messages for same
> > > address. Once the memory_failure() hwpoisons the page, application gets
> > > SIGBUS and then we are fine.
> > >  
> > 
> > That seems quite broken and not recovered is very confusing. So effectively
> > we can never recover from a MCE UE. I think we need a notion of delayed
> > recovery then? Where we do recover, but mark is as recovered with delays?
> > We might want to revisit our recovery process and see if the recovery 
> > requires
> > to turn the MMU on, but that is for later, I suppose.
> 
> The notion of being handled in the machine check return value is not
> whether the failing resource is later de-allocated or fixed, but if
> *this* particular exception was able to be corrected / processing
> resume as normal without further action.
> 
> The MCE UE is not recovered just by finding its address here, so I think
> Mahesh's patch is right.
> 

OK, It would nice to see a "recovered" in the output as opposed to process
killed, but the MCE is "not recovered"? The kernel bits do sound sane

Balbir Singh.



Re: [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver

2018-04-23 Thread Randy Dunlap
On 04/23/18 12:53, Greg KH wrote:
> On Mon, Apr 23, 2018 at 11:38:18AM -0700, Randy Dunlap wrote:
>> On 04/23/18 07:46, Bryant G. Ly wrote:
>>> This driver is a logical device which provides an
>>> interface between the hypervisor and a management
>>> partition.
>>>
>>> This driver is to be used for the POWER Virtual
>>> Management Channel Virtual Adapter on the PowerVM
>>> platform. It provides both request/response and
>>> async message support through the /dev/ibmvmc node.
>>>
>>> Signed-off-by: Bryant G. Ly 
>>> Reviewed-by: Steven Royer 
>>> Reviewed-by: Adam Reznechek 
>>> Tested-by: Taylor Jakobson 
>>> Tested-by: Brad Warrum 
>>> Cc: Greg Kroah-Hartman 
>>> Cc: Arnd Bergmann 
>>> Cc: Benjamin Herrenschmidt 
>>> Cc: Michael Ellerman 
>>> ---
>>>  Documentation/ioctl/ioctl-number.txt  |1 +
>>>  Documentation/misc-devices/ibmvmc.txt |  161 +++
>>>  MAINTAINERS   |6 +
>>>  arch/powerpc/include/asm/hvcall.h |1 +
>>>  drivers/misc/Kconfig  |   14 +
>>>  drivers/misc/Makefile |1 +
>>>  drivers/misc/ibmvmc.c | 2415 
>>> +
>>>  drivers/misc/ibmvmc.h |  209 +++
>>>  8 files changed, 2808 insertions(+)
>>>  create mode 100644 Documentation/misc-devices/ibmvmc.txt
>>>  create mode 100644 drivers/misc/ibmvmc.c
>>>  create mode 100644 drivers/misc/ibmvmc.h
>>
>>> diff --git a/Documentation/misc-devices/ibmvmc.txt 
>>> b/Documentation/misc-devices/ibmvmc.txt
>>> new file mode 100644
>>> index 000..bae1064
>>> --- /dev/null
>>> +++ b/Documentation/misc-devices/ibmvmc.txt
> 
> Aren't we doing new documentation in .rst format instead of .txt?

I am not aware that .rst format is a *requirement* for new documentation.

??

-- 
~Randy


Re: [PATCH v2] powerpc/mm/radix: use do/while(0) trick for single statement block

2018-04-23 Thread Randy Dunlap
On 04/23/18 12:36, Mathieu Malaterre wrote:
> In commit 7a22d6321c3d ("powerpc/mm/radix: Update command line parsing for
> disable_radix") an `if` statement was added for a possible empty body
> (prom_debug).
> 
> Fix the following warning, treated as error with W=1:
> 
>   arch/powerpc/kernel/prom_init.c:656:46: error: suggest braces around empty 
> body in an ‘if’ statement [-Werror=empty-body]
> 
> Suggested-by: Randy Dunlap 
> Signed-off-by: Mathieu Malaterre 

Acked-by: Randy Dunlap 

Thanks.

> ---
> v2: update macro directly
> 
>  arch/powerpc/kernel/prom_init.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 5ae153b97d0a..7edf3aa5bc6d 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -103,7 +103,7 @@ int of_workarounds;
>  #ifdef DEBUG_PROM
>  #define prom_debug(x...) prom_printf(x)
>  #else
> -#define prom_debug(x...)
> +#define prom_debug(x...) do { } while (0)
>  #endif
>  
>  
> 


-- 
~Randy


Re: [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver

2018-04-23 Thread Bryant G. Ly
On 4/23/18 2:53 PM, Greg KH wrote:

> On Mon, Apr 23, 2018 at 11:38:18AM -0700, Randy Dunlap wrote:
>> On 04/23/18 07:46, Bryant G. Ly wrote:
>>> This driver is a logical device which provides an
>>> interface between the hypervisor and a management
>>> partition.
>>>
>>> This driver is to be used for the POWER Virtual
>>> Management Channel Virtual Adapter on the PowerVM
>>> platform. It provides both request/response and
>>> async message support through the /dev/ibmvmc node.
>>>
>>> Signed-off-by: Bryant G. Ly 
>>> Reviewed-by: Steven Royer 
>>> Reviewed-by: Adam Reznechek 
>>> Tested-by: Taylor Jakobson 
>>> Tested-by: Brad Warrum 
>>> Cc: Greg Kroah-Hartman 
>>> Cc: Arnd Bergmann 
>>> Cc: Benjamin Herrenschmidt 
>>> Cc: Michael Ellerman 
>>> ---
>>>  Documentation/ioctl/ioctl-number.txt  |1 +
>>>  Documentation/misc-devices/ibmvmc.txt |  161 +++
>>>  MAINTAINERS   |6 +
>>>  arch/powerpc/include/asm/hvcall.h |1 +
>>>  drivers/misc/Kconfig  |   14 +
>>>  drivers/misc/Makefile |1 +
>>>  drivers/misc/ibmvmc.c | 2415 
>>> +
>>>  drivers/misc/ibmvmc.h |  209 +++
>>>  8 files changed, 2808 insertions(+)
>>>  create mode 100644 Documentation/misc-devices/ibmvmc.txt
>>>  create mode 100644 drivers/misc/ibmvmc.c
>>>  create mode 100644 drivers/misc/ibmvmc.h
>>> diff --git a/Documentation/misc-devices/ibmvmc.txt 
>>> b/Documentation/misc-devices/ibmvmc.txt
>>> new file mode 100644
>>> index 000..bae1064
>>> --- /dev/null
>>> +++ b/Documentation/misc-devices/ibmvmc.txt
> Aren't we doing new documentation in .rst format instead of .txt?
>
> thanks,
>
> greg k-h
>
I will convert to .rst and fix the comments from Randy in the next version.

Thanks, 

Bryant



Re: [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver

2018-04-23 Thread Greg KH
On Mon, Apr 23, 2018 at 11:38:18AM -0700, Randy Dunlap wrote:
> On 04/23/18 07:46, Bryant G. Ly wrote:
> > This driver is a logical device which provides an
> > interface between the hypervisor and a management
> > partition.
> > 
> > This driver is to be used for the POWER Virtual
> > Management Channel Virtual Adapter on the PowerVM
> > platform. It provides both request/response and
> > async message support through the /dev/ibmvmc node.
> > 
> > Signed-off-by: Bryant G. Ly 
> > Reviewed-by: Steven Royer 
> > Reviewed-by: Adam Reznechek 
> > Tested-by: Taylor Jakobson 
> > Tested-by: Brad Warrum 
> > Cc: Greg Kroah-Hartman 
> > Cc: Arnd Bergmann 
> > Cc: Benjamin Herrenschmidt 
> > Cc: Michael Ellerman 
> > ---
> >  Documentation/ioctl/ioctl-number.txt  |1 +
> >  Documentation/misc-devices/ibmvmc.txt |  161 +++
> >  MAINTAINERS   |6 +
> >  arch/powerpc/include/asm/hvcall.h |1 +
> >  drivers/misc/Kconfig  |   14 +
> >  drivers/misc/Makefile |1 +
> >  drivers/misc/ibmvmc.c | 2415 
> > +
> >  drivers/misc/ibmvmc.h |  209 +++
> >  8 files changed, 2808 insertions(+)
> >  create mode 100644 Documentation/misc-devices/ibmvmc.txt
> >  create mode 100644 drivers/misc/ibmvmc.c
> >  create mode 100644 drivers/misc/ibmvmc.h
> 
> > diff --git a/Documentation/misc-devices/ibmvmc.txt 
> > b/Documentation/misc-devices/ibmvmc.txt
> > new file mode 100644
> > index 000..bae1064
> > --- /dev/null
> > +++ b/Documentation/misc-devices/ibmvmc.txt

Aren't we doing new documentation in .rst format instead of .txt?

thanks,

greg k-h


[PATCH] powerpc/wii: Make hlwd_pic_init function static

2018-04-23 Thread Mathieu Malaterre
The function hlwd_pic_init can be made static, so do it. Fix the following
warning treated as error (W=1):

../arch/powerpc/platforms/embedded6xx/hlwd-pic.c:158:20: error: no previous 
prototype for ‘hlwd_pic_init’ [-Werror=missing-prototypes]

Signed-off-by: Mathieu Malaterre 
---
 arch/powerpc/platforms/embedded6xx/hlwd-pic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/embedded6xx/hlwd-pic.c 
b/arch/powerpc/platforms/embedded6xx/hlwd-pic.c
index 89c54de88b7a..e3e3af73e9d8 100644
--- a/arch/powerpc/platforms/embedded6xx/hlwd-pic.c
+++ b/arch/powerpc/platforms/embedded6xx/hlwd-pic.c
@@ -155,7 +155,7 @@ static void __hlwd_quiesce(void __iomem *io_base)
out_be32(io_base + HW_BROADWAY_ICR, 0x);
 }
 
-struct irq_domain *hlwd_pic_init(struct device_node *np)
+static struct irq_domain *hlwd_pic_init(struct device_node *np)
 {
struct irq_domain *irq_domain;
struct resource res;
-- 
2.11.0



[PATCH v2] powerpc/mm/radix: use do/while(0) trick for single statement block

2018-04-23 Thread Mathieu Malaterre
In commit 7a22d6321c3d ("powerpc/mm/radix: Update command line parsing for
disable_radix") an `if` statement was added for a possible empty body
(prom_debug).

Fix the following warning, treated as error with W=1:

  arch/powerpc/kernel/prom_init.c:656:46: error: suggest braces around empty 
body in an ‘if’ statement [-Werror=empty-body]

Suggested-by: Randy Dunlap 
Signed-off-by: Mathieu Malaterre 
---
v2: update macro directly

 arch/powerpc/kernel/prom_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 5ae153b97d0a..7edf3aa5bc6d 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -103,7 +103,7 @@ int of_workarounds;
 #ifdef DEBUG_PROM
 #define prom_debug(x...)   prom_printf(x)
 #else
-#define prom_debug(x...)
+#define prom_debug(x...)   do { } while (0)
 #endif
 
 
-- 
2.11.0



Re: [PATCH] powerpc/mm/radix: add missing braces for single statement block

2018-04-23 Thread Mathieu Malaterre
On Mon, Apr 23, 2018 at 9:31 PM, Mathieu Malaterre  wrote:
> On Sun, Apr 8, 2018 at 10:34 PM, Randy Dunlap  wrote:
>> On 04/08/2018 12:44 PM, Mathieu Malaterre wrote:
>>> In commit 7a22d6321c3d ("powerpc/mm/radix: Update command line parsing for
>>> disable_radix") an `if` statement was added for a possible empty body
>>> (prom_debug).
>>>
>>> Fix the following warning, treated as error with W=1:
>>>
>>>   arch/powerpc/kernel/prom_init.c:656:46: error: suggest braces around 
>>> empty body in an ‘if’ statement [-Werror=empty-body]
>>>
>>> Signed-off-by: Mathieu Malaterre 
>>> ---
>>>  arch/powerpc/kernel/prom_init.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/kernel/prom_init.c 
>>> b/arch/powerpc/kernel/prom_init.c
>>> index 5ae153b97d0a..f0e802495530 100644
>>> --- a/arch/powerpc/kernel/prom_init.c
>>> +++ b/arch/powerpc/kernel/prom_init.c
>>> @@ -652,8 +652,9 @@ static void __init early_cmdline_parse(void)
>>>   } else
>>>   prom_radix_disable = true;
>>>   }
>>> - if (prom_radix_disable)
>>> + if (prom_radix_disable) {
>>>   prom_debug("Radix disabled from cmdline\n");
>>
>> Looks like the macro for #prom_debug() should be fixed instead.
>
> Well if I try instead:
>
> @@ -101,9 +101,9 @@ int of_workarounds;
>  } while (0)
>
>  #ifdef DEBUG_PROM
> -#define prom_debug(x...) prom_printf(x)
> +#define prom_debug(x...) do { prom_printf(x); } while (0)
>  #else
> -#define prom_debug(x...)
> +#define prom_debug(x...) do { } while (0)
>  #endif
>
>
> the checkpatch script returns:
>
> WARNING: Single statement macros should not use a do {} while (0) loop
> #33: FILE: arch/powerpc/kernel/prom_init.c:104:
> +#define prom_debug(x...) do { prom_printf(x); } while (0)
>
>
> So I suspect we cannot do much better than my original patch unfortunately.

Right, it seems I cannot read the correct line number for a warning.
An updated patch is coming.

>>> + }
>>>  }
>>>
>>>  #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
>>>
>>
>>
>> --
>> ~Randy


Re: [PATCH] powerpc/mm/radix: add missing braces for single statement block

2018-04-23 Thread Mathieu Malaterre
On Sun, Apr 8, 2018 at 10:34 PM, Randy Dunlap  wrote:
> On 04/08/2018 12:44 PM, Mathieu Malaterre wrote:
>> In commit 7a22d6321c3d ("powerpc/mm/radix: Update command line parsing for
>> disable_radix") an `if` statement was added for a possible empty body
>> (prom_debug).
>>
>> Fix the following warning, treated as error with W=1:
>>
>>   arch/powerpc/kernel/prom_init.c:656:46: error: suggest braces around empty 
>> body in an ‘if’ statement [-Werror=empty-body]
>>
>> Signed-off-by: Mathieu Malaterre 
>> ---
>>  arch/powerpc/kernel/prom_init.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/prom_init.c 
>> b/arch/powerpc/kernel/prom_init.c
>> index 5ae153b97d0a..f0e802495530 100644
>> --- a/arch/powerpc/kernel/prom_init.c
>> +++ b/arch/powerpc/kernel/prom_init.c
>> @@ -652,8 +652,9 @@ static void __init early_cmdline_parse(void)
>>   } else
>>   prom_radix_disable = true;
>>   }
>> - if (prom_radix_disable)
>> + if (prom_radix_disable) {
>>   prom_debug("Radix disabled from cmdline\n");
>
> Looks like the macro for #prom_debug() should be fixed instead.

Well if I try instead:

@@ -101,9 +101,9 @@ int of_workarounds;
 } while (0)

 #ifdef DEBUG_PROM
-#define prom_debug(x...) prom_printf(x)
+#define prom_debug(x...) do { prom_printf(x); } while (0)
 #else
-#define prom_debug(x...)
+#define prom_debug(x...) do { } while (0)
 #endif


the checkpatch script returns:

WARNING: Single statement macros should not use a do {} while (0) loop
#33: FILE: arch/powerpc/kernel/prom_init.c:104:
+#define prom_debug(x...) do { prom_printf(x); } while (0)


So I suspect we cannot do much better than my original patch unfortunately.

>> + }
>>  }
>>
>>  #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
>>
>
>
> --
> ~Randy


Re: [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver

2018-04-23 Thread Randy Dunlap
On 04/23/18 07:46, Bryant G. Ly wrote:
> This driver is a logical device which provides an
> interface between the hypervisor and a management
> partition.
> 
> This driver is to be used for the POWER Virtual
> Management Channel Virtual Adapter on the PowerVM
> platform. It provides both request/response and
> async message support through the /dev/ibmvmc node.
> 
> Signed-off-by: Bryant G. Ly 
> Reviewed-by: Steven Royer 
> Reviewed-by: Adam Reznechek 
> Tested-by: Taylor Jakobson 
> Tested-by: Brad Warrum 
> Cc: Greg Kroah-Hartman 
> Cc: Arnd Bergmann 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> ---
>  Documentation/ioctl/ioctl-number.txt  |1 +
>  Documentation/misc-devices/ibmvmc.txt |  161 +++
>  MAINTAINERS   |6 +
>  arch/powerpc/include/asm/hvcall.h |1 +
>  drivers/misc/Kconfig  |   14 +
>  drivers/misc/Makefile |1 +
>  drivers/misc/ibmvmc.c | 2415 
> +
>  drivers/misc/ibmvmc.h |  209 +++
>  8 files changed, 2808 insertions(+)
>  create mode 100644 Documentation/misc-devices/ibmvmc.txt
>  create mode 100644 drivers/misc/ibmvmc.c
>  create mode 100644 drivers/misc/ibmvmc.h

> diff --git a/Documentation/misc-devices/ibmvmc.txt 
> b/Documentation/misc-devices/ibmvmc.txt
> new file mode 100644
> index 000..bae1064
> --- /dev/null
> +++ b/Documentation/misc-devices/ibmvmc.txt
> @@ -0,0 +1,161 @@
> +Kernel Driver ibmvmc
> +
> +
> +Authors:
> + Dave Engebretsen 
> + Adam Reznechek 
> + Steven Royer 
> + Bryant G. Ly 
> +
> +Description
> +===
> +

...

> +
> +Virtual Management Channel (VMC)
> +A logical device, called the virtual management channel (VMC), is defined
> +for communicating between the Novalink application and the hypervisor.
> +This device, similar to a VSCSI server device, is presented to a designated
> +management partition as a virtual device and is only presented when the
> +system is not HMC managed.
> +This communication device borrows aspects from both VSCSI and ILLAN devices
> +and is implemented using the CRQ and the RDMA interfaces. A three-way
> +handshake is defined that must take place to establish that both the
> +hypervisor and management partition sides of the channel are running prior
> +to sending/receiving any of the protocol messages.
> +This driver also utilizes Transport Event CRQs. CRQ messages that are sent

 Drop   "that" ?
otherwise the sentence construction is ... odd.

> +when the hypervisor detects one of the peer partitions has abnormally
> +terminated, or one side has called H_FREE_CRQ to close their CRQ.
> +Two new classes of CRQ messages are introduced for the VMC device. VMC
> +Administrative messages are used for each partition using the VMC to
> +communicate capabilities to their partner. HMC Interface messages are used
> +for the actual flow of HMC messages between the management partition and
> +the hypervisor. As most HMC messages far exceed the size of a CRQ bugger,

what is a bugger?
[reads more]
oh, buffer?

> +a virtual DMA (RMDA) of the HMC message data is done prior to each HMC
> +Interface CRQ message. Only the management partition drives RDMA
> +operations; hypervisors never directly causes the movement of message data.
> +
> +Example Management Partition VMC Driver Interface
> +=

...


Looks pretty good.  In general (IMO), it could use more white space (like blank
lines after section names.)

If you fix those small items (buggers) above, you can add:
Reviewed-by: Randy Dunlap 

thanks,
-- 
~Randy


Re: [PATCH v10 01/25] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-04-23 Thread Laurent Dufour
On 23/04/2018 07:58, Minchan Kim wrote:
> Hi Laurent,
> 
> I guess it's good timing to review. Guess LSF/MM goes so might change
> a lot since then. :) Anyway, I grap a time to review.

Hi,

Thanks a lot for reviewing this series.

> On Tue, Apr 17, 2018 at 04:33:07PM +0200, Laurent Dufour wrote:
>> This configuration variable will be used to build the code needed to
>> handle speculative page fault.
>>
>> By default it is turned off, and activated depending on architecture
>> support, SMP and MMU.
> 
> Can we have description in here why it depends on architecture?

The reason is that the per architecture page fault code must handle the
speculative page fault. It is done in this series for x86 and ppc64.

I'll make it explicit here.

> 
>>
>> Suggested-by: Thomas Gleixner 
>> Suggested-by: David Rientjes 
>> Signed-off-by: Laurent Dufour 
>> ---
>>  mm/Kconfig | 22 ++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index d5004d82a1d6..5484dca11199 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -752,3 +752,25 @@ config GUP_BENCHMARK
>>performance of get_user_pages_fast().
>>  
>>See tools/testing/selftests/vm/gup_benchmark.c
>> +
>> +config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> +   def_bool n
>> +
>> +config SPECULATIVE_PAGE_FAULT
>> +   bool "Speculative page faults"
>> +   default y
>> +   depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> +   depends on MMU && SMP
>> +   help
>> + Try to handle user space page faults without holding the mmap_sem.
>> +
>> + This should allow better concurrency for massively threaded process
>> + since the page fault handler will not wait for other threads memory
>> + layout change to be done, assuming that this change is done in another
>> + part of the process's memory space. This type of page fault is named
>> + speculative page fault.
>> +
>> + If the speculative page fault fails because of a concurrency is
>> + detected or because underlying PMD or PTE tables are not yet
>> + allocating, it is failing its processing and a classic page fault
>> + is then tried.
>> -- 
>> 2.7.4
>>
> 



[PATCH v1 1/1] misc: IBM Virtual Management Channel Driver

2018-04-23 Thread Bryant G. Ly
This driver is a logical device which provides an
interface between the hypervisor and a management
partition.

This driver is to be used for the POWER Virtual
Management Channel Virtual Adapter on the PowerVM
platform. It provides both request/response and
async message support through the /dev/ibmvmc node.

Signed-off-by: Bryant G. Ly 
Reviewed-by: Steven Royer 
Reviewed-by: Adam Reznechek 
Tested-by: Taylor Jakobson 
Tested-by: Brad Warrum 
Cc: Greg Kroah-Hartman 
Cc: Arnd Bergmann 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
---
 Documentation/ioctl/ioctl-number.txt  |1 +
 Documentation/misc-devices/ibmvmc.txt |  161 +++
 MAINTAINERS   |6 +
 arch/powerpc/include/asm/hvcall.h |1 +
 drivers/misc/Kconfig  |   14 +
 drivers/misc/Makefile |1 +
 drivers/misc/ibmvmc.c | 2415 +
 drivers/misc/ibmvmc.h |  209 +++
 8 files changed, 2808 insertions(+)
 create mode 100644 Documentation/misc-devices/ibmvmc.txt
 create mode 100644 drivers/misc/ibmvmc.c
 create mode 100644 drivers/misc/ibmvmc.h

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 84bb74d..9851cee 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -329,6 +329,7 @@ Code  Seq#(hex) Include FileComments
 0xCA   80-BF   uapi/scsi/cxlflash_ioctl.h
 0xCB   00-1F   CBM serial IEC bus  in development:


+0xCC   00-0F   drivers/misc/ibmvmc.hpseries VMC driver
 0xCD   01  linux/reiserfs_fs.h
 0xCF   02  fs/cifs/ioctl.c
 0xDB   00-0F   drivers/char/mwave/mwavepub.h
diff --git a/Documentation/misc-devices/ibmvmc.txt 
b/Documentation/misc-devices/ibmvmc.txt
new file mode 100644
index 000..bae1064
--- /dev/null
+++ b/Documentation/misc-devices/ibmvmc.txt
@@ -0,0 +1,161 @@
+Kernel Driver ibmvmc
+
+
+Authors:
+   Dave Engebretsen 
+   Adam Reznechek 
+   Steven Royer 
+   Bryant G. Ly 
+
+Description
+===
+
+The Virtual Management Channel (VMC) is a logical device which provides an
+interface between the hypervisor and a management partition. This
+management partition is intended to provide an alternative to HMC-based
+system management. In the management partition, a Novalink application
+exists which enables a system administrator to configure the system’s
+partitioning characteristics via a command line interface (CLI) or
+Representational State Transfer Application (REST). You can also manage the
+server by using PowerVC or other OpenStack solution. Support for
+conventional HMC management of the system may still be provided on a
+system; however, when an HMC is attached to the system, the VMC
+interface is disabled by the hypervisor.
+
+NovaLink runs on a Linux logical partition on a POWER8 or newer processor-
+based server that is virtualized by PowerVM. System configuration,
+maintenance, and control functions which traditionally require an HMC can
+be implemented in the Novalink using a combination of HMC to hypervisor
+interfaces and existing operating system methods. This tool provides a
+subset of the functions implemented by the HMC and enables basic partition
+configuration. The set of HMC to hypervisor messages supported by the
+Novalink component are passed to the hypervisor over a VMC interface, which
+is defined below.
+
+Virtual Management Channel (VMC)
+A logical device, called the virtual management channel (VMC), is defined
+for communicating between the Novalink application and the hypervisor.
+This device, similar to a VSCSI server device, is presented to a designated
+management partition as a virtual device and is only presented when the
+system is not HMC managed.
+This communication device borrows aspects from both VSCSI and ILLAN devices
+and is implemented using the CRQ and the RDMA interfaces. A three-way
+handshake is defined that must take place to establish that both the
+hypervisor and management partition sides of the channel are running prior
+to sending/receiving any of the protocol messages.
+This driver also utilizes Transport Event CRQs. CRQ messages that are sent
+when the hypervisor detects one of the peer partitions has abnormally
+terminated, or one side has called H_FREE_CRQ to close their CRQ.
+Two new classes of CRQ messages are introduced for the VMC device. VMC
+Administrative messages are used for each partition using the VMC to
+communicate capabilities to their partner. HMC Interface messages are used
+for 

[PATCH v1 0/1] misc: IBM Virtual Management Channel Driver

2018-04-23 Thread Bryant G. Ly
Steven Royer had previously attempted to upstream this
driver two years ago, but never got the chance to address
the concerns from Greg Kroah-Hartman.

The thread with the initial upstream is:
https://lkml.org/lkml/2016/2/16/918

I have addressed the following:
- Documentation
- Use of dev_dbg instead of pr_dbg
- Change to misc class
- Fixed memory barrier usages
- Addressed styling, checkpatch, renaming of functions
- General fixes to the driver to make it more inline with
  existing upstream drivers.

Bryant G. Ly (1):
  misc: IBM Virtual Management Channel Driver

 Documentation/ioctl/ioctl-number.txt  |1 +
 Documentation/misc-devices/ibmvmc.txt |  161 +++
 MAINTAINERS   |6 +
 arch/powerpc/include/asm/hvcall.h |1 +
 drivers/misc/Kconfig  |9 +
 drivers/misc/Makefile |1 +
 drivers/misc/ibmvmc.c | 2413 +
 drivers/misc/ibmvmc.h |  215 +++
 8 files changed, 2807 insertions(+)
 create mode 100644 Documentation/misc-devices/ibmvmc.txt
 create mode 100644 drivers/misc/ibmvmc.c
 create mode 100644 drivers/misc/ibmvmc.h

-- 
2.7.2



Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Mahesh Jagannath Salgaonkar
On 04/23/2018 04:44 PM, Balbir Singh wrote:
> On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
>  wrote:
>> On 04/23/2018 12:21 PM, Balbir Singh wrote:
>>> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
>>>  wrote:
 From: Mahesh Salgaonkar 

 The current code extracts the physical address for UE errors and then
 hooks it up into memory failure infrastructure. On successful extraction
 of physical address it wrongly sets "handled = 1" which means this UE error
 has been recovered. Since MCE handler gets return value as handled = 1, it
 assumes that error has been recovered and goes back to same NIP. This 
 causes
 MCE interrupt again and again in a loop leading to hard lockup.

 Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
 undesired page to hwpoison.

 Without this patch we see:
 [ 1476.541984] Severe Machine check interrupt [Recovered]
 [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
 [ 1476.541986]   Initiator: CPU
 [ 1476.541987]   Error type: UE [Load/Store]
 [ 1476.541988] Effective address: 7fffd2755940
 [ 1476.541989] Physical address:  20181a08
 [...]
 [ 1476.542003] Severe Machine check interrupt [Recovered]
 [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
 [ 1476.542005]   Initiator: CPU
 [ 1476.542006]   Error type: UE [Load/Store]
 [ 1476.542006] Effective address: 7fffd2755940
 [ 1476.542007] Physical address:  20181a08
 [ 1476.542010] Severe Machine check interrupt [Recovered]
 [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
 [ 1476.542013]   Initiator: CPU
 [ 1476.542014]   Error type: UE [Load/Store]
 [ 1476.542015] Effective address: 7fffd2755940
 [ 1476.542016] Physical address:  20181a08
 [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
 page: Recovered
 [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
 [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
 [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
 [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
 [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
 [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
 [...]
 [ 1490.972174] Watchdog CPU:38 Hard LOCKUP

 After this patch we see:

 [  325.384336] Severe Machine check interrupt [Not recovered]
>>>
>>> How did you test for this?
>>
>> By injecting cache SUE using L2 FIR register (0x1001080c).
>>
>>> If the error was recovered, shouldn't the
>>> process have gotten
>>> a SIGBUS and we should have prevented further access as a part of the 
>>> handling
>>> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?
>>
>> We hook it up to memory_failure() through a work queue and by the time
>> work queue kicks in, the application continues to restart and hit same
>> NIP again and again. Every MCE again hooks the same address to memory
>> failure work queue and throws multiple recovered MCE messages for same
>> address. Once the memory_failure() hwpoisons the page, application gets
>> SIGBUS and then we are fine.
>>
> 
> That seems quite broken and not recovered is very confusing. So effectively
> we can never recover from a MCE UE. 

By not setting handle = 1, the recovery code will fall through
machine_check_exception()->opal_machine_check() and then SIGBUS is sent
to this process to recover OR head to panic path for kernel UE. We have
already hooked up the physical address to memory_failure() which will
later hwpoison the page whenever work queue kicks in. This patch makes
sure this happens.

> I think we need a notion of delayed
> recovery then? Where we do recover, but mark is as recovered with delays?

Yeah, may be we can set disposition for userspace mce event as recovery
in progress/delayed and then print the mce event again from work queue
by looking at return value from memory_failure(). What do you think ?

> We might want to revisit our recovery process and see if the recovery requires
> to turn the MMU on, but that is for later, I suppose.
> 
>> But in case of UE in kernel space, if early machine_check handler
>> "machine_check_early()" returns as recovered then
>> machine_check_handle_early() queues up the MCE event and continues from
>> NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
>> up in hard lockup.
>>
> 
> Yeah for the kernel, we need to definitely cause a panic for now, I've got 
> other
> patches for things we need to do for pmem that would allow potential recovery.
> 
> Balbir Singh
> 



Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Nicholas Piggin
On Mon, 23 Apr 2018 21:14:12 +1000
Balbir Singh  wrote:

> On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
>  wrote:
> > On 04/23/2018 12:21 PM, Balbir Singh wrote:  
> >> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
> >>  wrote:  
> >>> From: Mahesh Salgaonkar 
> >>>
> >>> The current code extracts the physical address for UE errors and then
> >>> hooks it up into memory failure infrastructure. On successful extraction
> >>> of physical address it wrongly sets "handled = 1" which means this UE 
> >>> error
> >>> has been recovered. Since MCE handler gets return value as handled = 1, it
> >>> assumes that error has been recovered and goes back to same NIP. This 
> >>> causes
> >>> MCE interrupt again and again in a loop leading to hard lockup.
> >>>
> >>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
> >>> undesired page to hwpoison.
> >>>
> >>> Without this patch we see:
> >>> [ 1476.541984] Severe Machine check interrupt [Recovered]
> >>> [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
> >>> [ 1476.541986]   Initiator: CPU
> >>> [ 1476.541987]   Error type: UE [Load/Store]
> >>> [ 1476.541988] Effective address: 7fffd2755940
> >>> [ 1476.541989] Physical address:  20181a08
> >>> [...]
> >>> [ 1476.542003] Severe Machine check interrupt [Recovered]
> >>> [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
> >>> [ 1476.542005]   Initiator: CPU
> >>> [ 1476.542006]   Error type: UE [Load/Store]
> >>> [ 1476.542006] Effective address: 7fffd2755940
> >>> [ 1476.542007] Physical address:  20181a08
> >>> [ 1476.542010] Severe Machine check interrupt [Recovered]
> >>> [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
> >>> [ 1476.542013]   Initiator: CPU
> >>> [ 1476.542014]   Error type: UE [Load/Store]
> >>> [ 1476.542015] Effective address: 7fffd2755940
> >>> [ 1476.542016] Physical address:  20181a08
> >>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
> >>> page: Recovered
> >>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> >>> [...]
> >>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
> >>>
> >>> After this patch we see:
> >>>
> >>> [  325.384336] Severe Machine check interrupt [Not recovered]  
> >>
> >> How did you test for this?  
> >
> > By injecting cache SUE using L2 FIR register (0x1001080c).
> >  
> >> If the error was recovered, shouldn't the
> >> process have gotten
> >> a SIGBUS and we should have prevented further access as a part of the 
> >> handling
> >> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?  
> >
> > We hook it up to memory_failure() through a work queue and by the time
> > work queue kicks in, the application continues to restart and hit same
> > NIP again and again. Every MCE again hooks the same address to memory
> > failure work queue and throws multiple recovered MCE messages for same
> > address. Once the memory_failure() hwpoisons the page, application gets
> > SIGBUS and then we are fine.
> >  
> 
> That seems quite broken and not recovered is very confusing. So effectively
> we can never recover from a MCE UE. I think we need a notion of delayed
> recovery then? Where we do recover, but mark is as recovered with delays?
> We might want to revisit our recovery process and see if the recovery requires
> to turn the MMU on, but that is for later, I suppose.

The notion of being handled in the machine check return value is not
whether the failing resource is later de-allocated or fixed, but if
*this* particular exception was able to be corrected / processing
resume as normal without further action.

The MCE UE is not recovered just by finding its address here, so I think
Mahesh's patch is right.

You can still recover it with further action later.

> 
> > But in case of UE in kernel space, if early machine_check handler
> > "machine_check_early()" returns as recovered then
> > machine_check_handle_early() queues up the MCE event and continues from
> > NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
> > up in hard lockup.
> >  
> 
> Yeah for the kernel, we need to definitely cause a panic for now, I've got 
> other
> patches for things we need to do for pmem that would allow potential recovery.
> 
> Balbir Singh



Re: [PATCH v4 4/7] powerpc/fadump: Reservationless firmware assisted dump

2018-04-23 Thread Hari Bathini



On Friday 20 April 2018 10:34 AM, Mahesh J Salgaonkar wrote:

From: Mahesh Salgaonkar 

One of the primary issues with Firmware Assisted Dump (fadump) on Power
is that it needs a large amount of memory to be reserved. On large
systems with TeraBytes of memory, this reservation can be quite
significant.

In some cases, fadump fails if the memory reserved is insufficient, or
if the reserved memory was DLPAR hot-removed.

In the normal case, post reboot, the preserved memory is filtered to
extract only relevant areas of interest using the makedumpfile tool.
While the tool provides flexibility to determine what needs to be part
of the dump and what memory to filter out, all supported distributions
default this to "Capture only kernel data and nothing else".

We take advantage of this default and the Linux kernel's Contiguous
Memory Allocator (CMA) to fundamentally change the memory reservation
model for fadump.

Instead of setting aside a significant chunk of memory nobody can use,
this patch uses CMA instead, to reserve a significant chunk of memory
that the kernel is prevented from using (due to MIGRATE_CMA), but
applications are free to use it. With this fadump will still be able
to capture all of the kernel memory and most of the user space memory
except the user pages that were present in CMA region.

Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
[root@zzxx-yy10 ~]# free -m
   totalusedfree  shared  buff/cache   available
Mem:   7557 1936822  12 5416725
Swap:  4095   04095

With this patch:
[root@zzxx-yy10 ~]# free -m
   totalusedfree  shared  buff/cache   available
Mem:   8133 1947464  12 4757338
Swap:  4095   04095

Changes made here are completely transparent to how fadump has
traditionally worked.

Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand
CMA and its usage.

TODO:
- Handle case where CMA reservation spans nodes.

Signed-off-by: Ananth N Mavinakayanahalli 
Signed-off-by: Mahesh Salgaonkar 
---
  arch/powerpc/kernel/fadump.c |  120 --
  1 file changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 16b3e8c5cae0..7f76924ab190 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -34,6 +34,7 @@
  #include 
  #include 
  #include 
+#include 

  #include 
  #include 
@@ -45,11 +46,57 @@
  static struct fw_dump fw_dump;
  static struct fadump_mem_struct fdm;
  static const struct fadump_mem_struct *fdm_active;
+static struct cma *fadump_cma;

  static DEFINE_MUTEX(fadump_mutex);
  struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES];
  int crash_mem_ranges;

+/*
+ * fadump_cma_reserve() - reserve area for fadump memory reservation
+ *
+ * This function reserves memory from early allocator. It should be
+ * called by arch specific code once the memblock allocator
+ * has been activated.
+ */
+int __init fadump_cma_reserve(void)
+{
+   unsigned long long base, size;
+   int rc;
+
+   if (!fw_dump.fadump_enabled)
+   return 0;
+
+   base = fw_dump.reserve_dump_area_start;
+   size = fw_dump.reserve_dump_area_size;


Mahesh, How about moving sections around instead:

Old:
  1. cpu state data region
  2. hpte region
  3. real memory region

New:
  2. cpu state data region
  3. hpte region
  1. real memory region

and using only boot memory size for cma reserve. The other regions, 
crashinfo header

& elfcorehdrs can still use memblock_reserve.

This achieves two things. One, ensures we don't waste memory in alignment
as cma uses hugepage(16MB)/maxorder as default alignment (we need to 
ensure boot
memory size is aligned by hugepage(16MB)/maxorder though). Two, we don't 
have to

move around meta data from end to start (patch 1/7)

To differentiate the old and new section order, we can overload crash 
info magic

(FADUMPINF -> FADUMPIV2), I guess. That differentiation may be needed for
re-registering after dump capture..


+   pr_debug("Original reserve area base %ld, size %ld\n",
+   (unsigned long)base >> 20,
+   (unsigned long)size >> 20);
+   if (!size)
+   return 0;
+
+   rc = cma_declare_contiguous(base, size, 0, 0, 0, false,
+   "fadump_cma", _cma);


Compilation fails when CONFIG_CMA is not set. A fallback when CONFIG_CMA
is not set or dependency enforced for FA_DUMP config option seems to be 
missing..


Also, considering we already deduce the base by looking for holes in 
fadump code, we could
have a 'fixed' ('true' for 6th parameter) cma region? Again, we 

Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Balbir Singh
On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
 wrote:
> On 04/23/2018 12:21 PM, Balbir Singh wrote:
>> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
>>  wrote:
>>> From: Mahesh Salgaonkar 
>>>
>>> The current code extracts the physical address for UE errors and then
>>> hooks it up into memory failure infrastructure. On successful extraction
>>> of physical address it wrongly sets "handled = 1" which means this UE error
>>> has been recovered. Since MCE handler gets return value as handled = 1, it
>>> assumes that error has been recovered and goes back to same NIP. This causes
>>> MCE interrupt again and again in a loop leading to hard lockup.
>>>
>>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>>> undesired page to hwpoison.
>>>
>>> Without this patch we see:
>>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>>> [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
>>> [ 1476.541986]   Initiator: CPU
>>> [ 1476.541987]   Error type: UE [Load/Store]
>>> [ 1476.541988] Effective address: 7fffd2755940
>>> [ 1476.541989] Physical address:  20181a08
>>> [...]
>>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>>> [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
>>> [ 1476.542005]   Initiator: CPU
>>> [ 1476.542006]   Error type: UE [Load/Store]
>>> [ 1476.542006] Effective address: 7fffd2755940
>>> [ 1476.542007] Physical address:  20181a08
>>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>>> [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
>>> [ 1476.542013]   Initiator: CPU
>>> [ 1476.542014]   Error type: UE [Load/Store]
>>> [ 1476.542015] Effective address: 7fffd2755940
>>> [ 1476.542016] Physical address:  20181a08
>>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
>>> page: Recovered
>>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>>> [...]
>>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>>
>>> After this patch we see:
>>>
>>> [  325.384336] Severe Machine check interrupt [Not recovered]
>>
>> How did you test for this?
>
> By injecting cache SUE using L2 FIR register (0x1001080c).
>
>> If the error was recovered, shouldn't the
>> process have gotten
>> a SIGBUS and we should have prevented further access as a part of the 
>> handling
>> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?
>
> We hook it up to memory_failure() through a work queue and by the time
> work queue kicks in, the application continues to restart and hit same
> NIP again and again. Every MCE again hooks the same address to memory
> failure work queue and throws multiple recovered MCE messages for same
> address. Once the memory_failure() hwpoisons the page, application gets
> SIGBUS and then we are fine.
>

That seems quite broken and not recovered is very confusing. So effectively
we can never recover from a MCE UE. I think we need a notion of delayed
recovery then? Where we do recover, but mark is as recovered with delays?
We might want to revisit our recovery process and see if the recovery requires
to turn the MMU on, but that is for later, I suppose.

> But in case of UE in kernel space, if early machine_check handler
> "machine_check_early()" returns as recovered then
> machine_check_handle_early() queues up the MCE event and continues from
> NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
> up in hard lockup.
>

Yeah for the kernel, we need to definitely cause a panic for now, I've got other
patches for things we need to do for pmem that would allow potential recovery.

Balbir Singh


Re: [PATCH 4/4] powerpc: Allow LD_DEAD_CODE_DATA_ELIMINATION to be selected

2018-04-23 Thread Mathieu Malaterre
Christophe,

On Sat, Apr 21, 2018 at 9:16 AM, christophe leroy
 wrote:
>
>
> Le 20/04/2018 à 22:08, Mathieu Malaterre a écrit :
>>
>> On Fri, Apr 20, 2018 at 12:41 PM, Nicholas Piggin 
>> wrote:
>>>
>>> On Fri, 20 Apr 2018 12:00:49 +0200
>>> Mathieu Malaterre  wrote:
>>>
 On Fri, Apr 20, 2018 at 9:34 AM, Nicholas Piggin 
 wrote:
>
> This requires further changes to linker script to KEEP some tables
> and wildcard compiler generated sections into the right place. This
> includes pp32 modifications from Christophe Leroy.
>
> When compiling powernv_defconfig with this option:
>
> text   data  bss   decfilename
> 11827621   4810490   1341080   17979191   vmlinux
> 11752437   4598858   1338776   17690071   vmlinux.dcde
>
> Resulting kernel is almost 400kB smaller (and still boots).
>
> [ppc32 numbers here]


 ^^^

 Do you want somebody else to provide those numbers ?
>>>
>>>
>>> If you have a booting kernel, yes some more numbers would be good.
>>
>>
>> I've used /boot/config-4.15.0-2-powerpc from my current debian
>> package. Rebuild master with and without option, boot ok, load/unload
>> module ok.
>>
>> $ size nick/vmlinux.with*
>> textdata bss dec hex filename
>> 7386425 2364370 1425432 11176227 aa8923 nick/vmlinux.with
>> 7461457 2475122 1428064 11364643 ad6923 nick/vmlinux.without
>>
>> This is not clear why with option the size of kernel is slightly bigger:
>
>
> The file contains also debug symbols, which might differ.
> Only the LOAD part of the file is interesting, that's the part you get when
> doing ppc-linux-objcopy vmlinux vmlinux.bin -O binary
>
> You can see it with readelf -l vmlinux

Here is what I see:

$ diff -u <(readelf -l nick/vmlinux.with) <(readelf -l nick/vmlinux.without)
--- /proc/self/fd/11 2018-04-23 12:59:52.413101612 +0200
+++ /proc/self/fd/12 2018-04-23 12:59:52.417101567 +0200
@@ -5,9 +5,9 @@

 Program Headers:
   Type   Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
-  LOAD   0x01 0xc000 0x 0x950f34 0xaad018 RWE 0x1
-  NOTE   0x867640 0xc0857640 0x00857640 0x00024 0x00024 0x4
-  NOTE   0x867664 0xc0857664 0x00857664 0xc 0xc 0x1
+  LOAD   0x01 0xc000 0x 0x97ef34 0xadba68 RWE 0x1
+  NOTE   0x87f640 0xc086f640 0x0086f640 0x00024 0x00024 0x4
+  NOTE   0x87f664 0xc086f664 0x0086f664 0xc 0xc 0x1

  Section to Segment mapping:
   Segment Sections...


> Christophe
>
>>
>> $ du -sk nick/vmlinux.with*
>> 124488 nick/vmlinux.with
>> 124004 nick/vmlinux.without
>>
>>
>>> Thanks,
>>> Nick
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>


Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Mahesh Jagannath Salgaonkar
On 04/23/2018 12:21 PM, Balbir Singh wrote:
> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
>  wrote:
>> From: Mahesh Salgaonkar 
>>
>> The current code extracts the physical address for UE errors and then
>> hooks it up into memory failure infrastructure. On successful extraction
>> of physical address it wrongly sets "handled = 1" which means this UE error
>> has been recovered. Since MCE handler gets return value as handled = 1, it
>> assumes that error has been recovered and goes back to same NIP. This causes
>> MCE interrupt again and again in a loop leading to hard lockup.
>>
>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>> undesired page to hwpoison.
>>
>> Without this patch we see:
>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>> [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.541986]   Initiator: CPU
>> [ 1476.541987]   Error type: UE [Load/Store]
>> [ 1476.541988] Effective address: 7fffd2755940
>> [ 1476.541989] Physical address:  20181a08
>> [...]
>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>> [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.542005]   Initiator: CPU
>> [ 1476.542006]   Error type: UE [Load/Store]
>> [ 1476.542006] Effective address: 7fffd2755940
>> [ 1476.542007] Physical address:  20181a08
>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>> [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.542013]   Initiator: CPU
>> [ 1476.542014]   Error type: UE [Load/Store]
>> [ 1476.542015] Effective address: 7fffd2755940
>> [ 1476.542016] Physical address:  20181a08
>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
>> page: Recovered
>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>> [...]
>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>
>> After this patch we see:
>>
>> [  325.384336] Severe Machine check interrupt [Not recovered]
> 
> How did you test for this? 

By injecting cache SUE using L2 FIR register (0x1001080c).

> If the error was recovered, shouldn't the
> process have gotten
> a SIGBUS and we should have prevented further access as a part of the handling
> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?

We hook it up to memory_failure() through a work queue and by the time
work queue kicks in, the application continues to restart and hit same
NIP again and again. Every MCE again hooks the same address to memory
failure work queue and throws multiple recovered MCE messages for same
address. Once the memory_failure() hwpoisons the page, application gets
SIGBUS and then we are fine.

But in case of UE in kernel space, if early machine_check handler
"machine_check_early()" returns as recovered then
machine_check_handle_early() queues up the MCE event and continues from
NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
up in hard lockup.

> Why shouldn't we treat it as handled if we isolate the page?

Yes we should, but I think not until the page is actually hwpoisioned OR
until we send SIGBUS to process.

> 
> Thanks,
> Balbir Singh.
> 



Re: [PATCH RFC 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space

2018-04-23 Thread Cédric Le Goater
On 04/16/2018 06:09 AM, David Gibson wrote:
> On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
>> It is not currently possible to create the full number of possible
>> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
>> threads per core than it's core stride (or "VSMT mode"). This is
>> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
>> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
>>
>> To address this, "pack" the VCORE ID and XIVE offsets by using
>> knowledge of the way the VCPU IDs will be used when there are less
>> guest threads per core than the core stride. The primary thread of
>> each core will always be used first. Then, if the guest uses more than
>> one thread per core, these secondary threads will sequentially follow
>> the primary in each core.
>>
>> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
>> VCPUs are being spaced apart, so at least half of each core is empty
>> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
>> into the second half of each core (4..7, in an 8-thread core).
>>
>> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
>> each core is being left empty, and we can map down into the second and
>> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
>>
>> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
>> threads are being used and 7/8 of the core is empty, allowing use of
>> the 1, 3, 5 and 7 thread slots.
>>
>> (Strides less than 8 are handled similarly.)
>>
>> This allows the VCORE ID or offset to be calculated quickly from the
>> VCPU ID or XIVE server numbers, without access to the VCPU structure.
>>
>> Signed-off-by: Sam Bobroff 
>> ---
>> Hello everyone,
>>
>> I've tested this on P8 and P9, in lots of combinations of host and guest
>> threading modes and it has been fine but it does feel like a "tricky"
>> approach, so I still feel somewhat wary about it.

Have you done any migration ? 

>> I've posted it as an RFC because I have not tested it with guest native-XIVE,
>> and I suspect that it will take some work to support it.

The KVM XIVE device will be different for XIVE exploitation mode, same 
structures 
though. I will send a patchset shortly. 

>>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++
>>  arch/powerpc/kvm/book3s_hv.c  | 14 ++
>>  arch/powerpc/kvm/book3s_xive.c|  9 +++--
>>  3 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
>> b/arch/powerpc/include/asm/kvm_book3s.h
>> index 376ae803b69c..1295056d564a 100644
>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu 
>> *vcpu);
>>  #define SPLIT_HACK_MASK 0xff00
>>  #define SPLIT_HACK_OFFS 0xfb00
>>  
>> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
>> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core 
>> stride
>> + * (but not it's actual threading mode, which is not available) to avoid
>> + * collisions.
>> + */
>> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
>> +{
>> +const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> 
> I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> roughly the same thing, but I think makes the pattern more obvious.
> 
>> +int stride = kvm->arch.emul_smt_mode > 1 ?
>> + kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> 
> AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> always be 1 when this is called, so the conditional here doesn't seem
> useful.
> 
>> +int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
>> +u32 packed_id;
>> +
>> +BUG_ON(block >= MAX_SMT_THREADS);
>> +packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
>> +BUG_ON(packed_id >= KVM_MAX_VCPUS);
>> +return packed_id;
>> +}
> 
> It took me a while to wrap my head around the packing function, but I
> think I got there in the end.  It's pretty clever.
> 
> One thing bothers me, though.  This certainly packs things under
> KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
> qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
> and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
> etc. unused.
> 
> So again, the question is what exactly are these remapped IDs useful
> for.  If we're indexing into a bare array of structures of size
> KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
> more entries than vcpus.  If we're indexing into something sparser,
> then why is the remapping worthwhile?
> 
> 
> 
>> +
>>  #endif /* __ASM_KVM_BOOK3S_H__ */
>> diff --git 

Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Balbir Singh
On Mon, Apr 23, 2018 at 4:51 PM, Balbir Singh  wrote:
> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
>  wrote:
>> From: Mahesh Salgaonkar 
>>
>> The current code extracts the physical address for UE errors and then
>> hooks it up into memory failure infrastructure. On successful extraction
>> of physical address it wrongly sets "handled = 1" which means this UE error
>> has been recovered. Since MCE handler gets return value as handled = 1, it
>> assumes that error has been recovered and goes back to same NIP. This causes
>> MCE interrupt again and again in a loop leading to hard lockup.
>>
>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>> undesired page to hwpoison.
>>
>> Without this patch we see:
>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>> [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.541986]   Initiator: CPU
>> [ 1476.541987]   Error type: UE [Load/Store]
>> [ 1476.541988] Effective address: 7fffd2755940
>> [ 1476.541989] Physical address:  20181a08
>> [...]
>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>> [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.542005]   Initiator: CPU
>> [ 1476.542006]   Error type: UE [Load/Store]
>> [ 1476.542006] Effective address: 7fffd2755940
>> [ 1476.542007] Physical address:  20181a08
>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>> [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.542013]   Initiator: CPU
>> [ 1476.542014]   Error type: UE [Load/Store]
>> [ 1476.542015] Effective address: 7fffd2755940
>> [ 1476.542016] Physical address:  20181a08
>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
>> page: Recovered
>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>> [...]
>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>
>> After this patch we see:
>>
>> [  325.384336] Severe Machine check interrupt [Not recovered]
>
> How did you test for this? If the error was recovered, shouldn't the
> process have gotten
> a SIGBUS and we should have prevented further access as a part of the handling
> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?
>
> Why shouldn't we treat it as handled if we isolate the page?

Not yet signed-off-by patch attached for testing, Mahesh, please check/confirm

Thanks,
Balbir

>
> Thanks,
> Balbir Singh.
From b297f2ad8473eea7755bcc239e7de21227438065 Mon Sep 17 00:00:00 2001
From: Balbir Singh 
Date: Mon, 23 Apr 2018 19:21:27 +1000
Subject: [PATCH] powerpc/mce: force a KILL on user page MCE

Experimental fix for MCE error, force a kill
because that's our current recovery mechanism

Not tested/compiled yet

Signed-off-by: Balbir Singh 
---
 arch/powerpc/kernel/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index efdd16a79075..a8441f75861e 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -273,7 +273,7 @@ static void machine_process_ue_event(struct work_struct *work)
 
 pfn = evt->u.ue_error.physical_address >>
 	PAGE_SHIFT;
-memory_failure(pfn, 0);
+memory_failure(pfn, MF_MUST_KILL);
 			} else
 pr_warn("Failed to identify bad address from "
 	"where the uncorrectable error (UE) "
-- 
2.17.0



[PATCH 5/5] powerpc: remove unused to_tm() helper

2018-04-23 Thread Arnd Bergmann
to_tm() is now completely unused, the only reference being in the
_dump_time() helper that is also unused. This removes both, leaving
the rest of the powerpc RTC code y2038 safe to as far as the hardware
supports.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/include/asm/time.h   |  2 --
 arch/powerpc/kernel/time.c| 50 ---
 arch/powerpc/platforms/ps3/time.c | 24 ---
 3 files changed, 76 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index db546c034905..79bc9c3e4325 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -26,8 +26,6 @@ extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
 
-struct rtc_time;
-extern void to_tm(int tim, struct rtc_time * tm);
 extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index f9b0baa3fa2b..79bdeea85ab4 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -1138,56 +1138,6 @@ void __init time_init(void)
 #endif
 }
 
-
-#define FEBRUARY   2
-#defineSTARTOFTIME 1970
-#define SECDAY 86400L
-#define SECYR  (SECDAY * 365)
-#defineleapyear(year)  ((year) % 4 == 0 && \
-((year) % 100 != 0 || (year) % 400 == 0))
-#definedays_in_year(a) (leapyear(a) ? 366 : 365)
-#definedays_in_month(a)(month_days[(a) - 1])
-
-static int month_days[12] = {
-   31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31
-};
-
-void to_tm(int tim, struct rtc_time * tm)
-{
-   register inti;
-   register long   hms, day;
-
-   day = tim / SECDAY;
-   hms = tim % SECDAY;
-
-   /* Hours, minutes, seconds are easy */
-   tm->tm_hour = hms / 3600;
-   tm->tm_min = (hms % 3600) / 60;
-   tm->tm_sec = (hms % 3600) % 60;
-
-   /* Number of years in days */
-   for (i = STARTOFTIME; day >= days_in_year(i); i++)
-   day -= days_in_year(i);
-   tm->tm_year = i;
-
-   /* Number of months in days left */
-   if (leapyear(tm->tm_year))
-   days_in_month(FEBRUARY) = 29;
-   for (i = 1; day >= days_in_month(i); i++)
-   day -= days_in_month(i);
-   days_in_month(FEBRUARY) = 28;
-   tm->tm_mon = i;
-
-   /* Days are what is left over (+1) from all that. */
-   tm->tm_mday = day + 1;
-
-   /*
-* No-one uses the day of the week.
-*/
-   tm->tm_wday = -1;
-}
-EXPORT_SYMBOL(to_tm);
-
 /*
  * Divide a 128-bit dividend by a 32-bit divisor, leaving a 128 bit
  * result.
diff --git a/arch/powerpc/platforms/ps3/time.c 
b/arch/powerpc/platforms/ps3/time.c
index 9dac125c997e..08ca76e23d09 100644
--- a/arch/powerpc/platforms/ps3/time.c
+++ b/arch/powerpc/platforms/ps3/time.c
@@ -28,30 +28,6 @@
 
 #include "platform.h"
 
-#define dump_tm(_a) _dump_tm(_a, __func__, __LINE__)
-static void _dump_tm(const struct rtc_time *tm, const char* func, int line)
-{
-   pr_debug("%s:%d tm_sec  %d\n", func, line, tm->tm_sec);
-   pr_debug("%s:%d tm_min  %d\n", func, line, tm->tm_min);
-   pr_debug("%s:%d tm_hour %d\n", func, line, tm->tm_hour);
-   pr_debug("%s:%d tm_mday %d\n", func, line, tm->tm_mday);
-   pr_debug("%s:%d tm_mon  %d\n", func, line, tm->tm_mon);
-   pr_debug("%s:%d tm_year %d\n", func, line, tm->tm_year);
-   pr_debug("%s:%d tm_wday %d\n", func, line, tm->tm_wday);
-}
-
-#define dump_time(_a) _dump_time(_a, __func__, __LINE__)
-static void __maybe_unused _dump_time(int time, const char *func,
-   int line)
-{
-   struct rtc_time tm;
-
-   to_tm(time, );
-
-   pr_debug("%s:%d time%d\n", func, line, time);
-   _dump_tm(, func, line);
-}
-
 void __init ps3_calibrate_decr(void)
 {
int result;
-- 
2.9.0



[PATCH 1/5] powerpc: always enable RTC_LIB

2018-04-23 Thread Arnd Bergmann
In order to use the rtc_tm_to_time64() and rtc_time64_to_tm()
helper functions in later patches, we have to ensure that
CONFIG_RTC_LIB is always built-in.

Note that this symbol only controls a couple of helper functions,
not the actual RTC subsystem, which remains optional and is
enabled with CONFIG_RTC_CLASS.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..de2cda320fdd 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -232,6 +232,7 @@ config PPC
select OF_RESERVED_MEM
select OLD_SIGACTIONif PPC32
select OLD_SIGSUSPEND
+   select RTC_LIB
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
select VIRT_TO_BUS  if !PPC64
-- 
2.9.0



[PATCH 3/5] powerpc: use time64_t in read_persistent_clock

2018-04-23 Thread Arnd Bergmann
Looking through the remaining users of the deprecated mktime()
function, I found the powerpc rtc handlers, which use it in
place of rtc_tm_to_time64().

To clean this up, I'm changing over the read_persistent_clock()
function to the read_persistent_clock64() variant, and change
all the platform specific handlers along with it.

Signed-off-by: Arnd Bergmann 
---
v2: fix a typo: rtc_tm_to_time66 -> rtc_tm_to_time64
---
 arch/powerpc/include/asm/machdep.h|  2 +-
 arch/powerpc/include/asm/opal.h   |  2 +-
 arch/powerpc/include/asm/rtas.h   |  2 +-
 arch/powerpc/kernel/rtas-rtc.c|  4 ++--
 arch/powerpc/kernel/time.c|  7 +++
 arch/powerpc/platforms/8xx/m8xx_setup.c   |  4 +---
 arch/powerpc/platforms/maple/maple.h  |  2 +-
 arch/powerpc/platforms/maple/time.c   |  5 ++---
 arch/powerpc/platforms/pasemi/pasemi.h|  2 +-
 arch/powerpc/platforms/pasemi/time.c  |  4 ++--
 arch/powerpc/platforms/powermac/pmac.h|  2 +-
 arch/powerpc/platforms/powermac/time.c| 31 +++
 arch/powerpc/platforms/powernv/opal-rtc.c |  5 ++---
 arch/powerpc/platforms/ps3/platform.h |  2 +-
 arch/powerpc/platforms/ps3/time.c |  2 +-
 15 files changed, 31 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index ffe7c71e1132..a47de82fb8e2 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -83,7 +83,7 @@ struct machdep_calls {
 
int (*set_rtc_time)(struct rtc_time *);
void(*get_rtc_time)(struct rtc_time *);
-   unsigned long   (*get_boot_time)(void);
+   time64_t(*get_boot_time)(void);
unsigned char   (*rtc_read_val)(int addr);
void(*rtc_write_val)(int addr, unsigned char val);
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 03e1a920491e..fc211bd98e0f 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -325,7 +325,7 @@ extern int opal_async_wait_response_interruptible(uint64_t 
token,
 extern int opal_get_sensor_data(u32 sensor_hndl, u32 *sensor_data);
 
 struct rtc_time;
-extern unsigned long opal_get_boot_time(void);
+extern time64_t opal_get_boot_time(void);
 extern void opal_nvram_init(void);
 extern void opal_flash_update_init(void);
 extern void opal_flash_update_print_message(void);
diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ec9dd79398ee..71e393c46a49 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -361,7 +361,7 @@ extern int rtas_offline_cpus_mask(cpumask_var_t cpus);
 extern int rtas_ibm_suspend_me(u64 handle);
 
 struct rtc_time;
-extern unsigned long rtas_get_boot_time(void);
+extern time64_t rtas_get_boot_time(void);
 extern void rtas_get_rtc_time(struct rtc_time *rtc_time);
 extern int rtas_set_rtc_time(struct rtc_time *rtc_time);
 
diff --git a/arch/powerpc/kernel/rtas-rtc.c b/arch/powerpc/kernel/rtas-rtc.c
index 49600985c7ef..a28239b8b0c0 100644
--- a/arch/powerpc/kernel/rtas-rtc.c
+++ b/arch/powerpc/kernel/rtas-rtc.c
@@ -13,7 +13,7 @@
 
 #define MAX_RTC_WAIT 5000  /* 5 sec */
 #define RTAS_CLOCK_BUSY (-2)
-unsigned long __init rtas_get_boot_time(void)
+time64_t __init rtas_get_boot_time(void)
 {
int ret[8];
int error;
@@ -38,7 +38,7 @@ unsigned long __init rtas_get_boot_time(void)
return 0;
}
 
-   return mktime(ret[0], ret[1], ret[2], ret[3], ret[4], ret[5]);
+   return mktime64(ret[0], ret[1], ret[2], ret[3], ret[4], ret[5]);
 }
 
 /* NOTE: get_rtc_time will get an error if executed in interrupt context
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 360e71d455cc..afb27962b396 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -795,7 +795,7 @@ int update_persistent_clock(struct timespec now)
return ppc_md.set_rtc_time();
 }
 
-static void __read_persistent_clock(struct timespec *ts)
+static void __read_persistent_clock(struct timespec64 *ts)
 {
struct rtc_time tm;
static int first = 1;
@@ -819,11 +819,10 @@ static void __read_persistent_clock(struct timespec *ts)
}
ppc_md.get_rtc_time();
 
-   ts->tv_sec = mktime(tm.tm_year+1900, tm.tm_mon+1, tm.tm_mday,
-   tm.tm_hour, tm.tm_min, tm.tm_sec);
+   ts->tv_sec = rtc_tm_to_time64();
 }
 
-void read_persistent_clock(struct timespec *ts)
+void read_persistent_clock64(struct timespec64 *ts)
 {
__read_persistent_clock(ts);
 
diff --git a/arch/powerpc/platforms/8xx/m8xx_setup.c 
b/arch/powerpc/platforms/8xx/m8xx_setup.c
index 2188d691a40f..d76daa90647b 100644
--- a/arch/powerpc/platforms/8xx/m8xx_setup.c
+++ b/arch/powerpc/platforms/8xx/m8xx_setup.c
@@ -192,9 +192,7 @@ void mpc8xx_get_rtc_time(struct rtc_time *tm)
 
/* Get time 

[PATCH 2/5] powerpc: rtas: clean up time handling

2018-04-23 Thread Arnd Bergmann
The to_tm() helper function operates on a signed integer for the time,
so it will suffer from overflow in 2038, even on 64-bit kernels.

Rather than fix that function, this replaces its use in the rtas
procfs implementation with the standard rtc_time64_to_tm() helper
that is very similar but is not affected by the overflow.

In order to actually support long times, the parser function gets
changed to 64-bit user input and output as well. Note that the tm_mon
and tm_year representation is slightly different, so we have to manually
add an offset here.

Signed-off-by: Arnd Bergmann 
---
v2: fix up rtc_time64_to_tm() calling conventions
---
 arch/powerpc/kernel/rtas-proc.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/rtas-proc.c b/arch/powerpc/kernel/rtas-proc.c
index fb070d8cad07..815027476600 100644
--- a/arch/powerpc/kernel/rtas-proc.c
+++ b/arch/powerpc/kernel/rtas-proc.c
@@ -280,7 +280,7 @@ static int __init proc_rtas_init(void)
 
 __initcall(proc_rtas_init);
 
-static int parse_number(const char __user *p, size_t count, unsigned long *val)
+static int parse_number(const char __user *p, size_t count, u64 *val)
 {
char buf[40];
char *end;
@@ -293,7 +293,7 @@ static int parse_number(const char __user *p, size_t count, 
unsigned long *val)
 
buf[count] = 0;
 
-   *val = simple_strtoul(buf, , 10);
+   *val = simple_strtoull(buf, , 10);
if (*end && *end != '\n')
return -EINVAL;
 
@@ -307,17 +307,17 @@ static ssize_t ppc_rtas_poweron_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppos)
 {
struct rtc_time tm;
-   unsigned long nowtime;
+   time64_t nowtime;
int error = parse_number(buf, count, );
if (error)
return error;
 
power_on_time = nowtime; /* save the time */
 
-   to_tm(nowtime, );
+   rtc_time64_to_tm(nowtime, );
 
error = rtas_call(rtas_token("set-time-for-power-on"), 7, 1, NULL, 
-   tm.tm_year, tm.tm_mon, tm.tm_mday, 
+   tm.tm_year + 1900, tm.tm_mon + 1, tm.tm_mday,
tm.tm_hour, tm.tm_min, tm.tm_sec, 0 /* nano */);
if (error)
printk(KERN_WARNING "error: setting poweron time returned: 
%s\n", 
@@ -373,14 +373,14 @@ static ssize_t ppc_rtas_clock_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppos)
 {
struct rtc_time tm;
-   unsigned long nowtime;
+   time64_t nowtime;
int error = parse_number(buf, count, );
if (error)
return error;
 
-   to_tm(nowtime, );
+   rtc_time64_to_tm(nowtime, );
error = rtas_call(rtas_token("set-time-of-day"), 7, 1, NULL, 
-   tm.tm_year, tm.tm_mon, tm.tm_mday, 
+   tm.tm_year + 1900, tm.tm_mon + 1, tm.tm_mday,
tm.tm_hour, tm.tm_min, tm.tm_sec, 0);
if (error)
printk(KERN_WARNING "error: setting the clock returned: %s\n", 
@@ -401,8 +401,8 @@ static int ppc_rtas_clock_show(struct seq_file *m, void *v)
unsigned int year, mon, day, hour, min, sec;
year = ret[0]; mon  = ret[1]; day  = ret[2];
hour = ret[3]; min  = ret[4]; sec  = ret[5];
-   seq_printf(m, "%lu\n",
-   mktime(year, mon, day, hour, min, sec));
+   seq_printf(m, "%lld\n",
+   mktime64(year, mon, day, hour, min, sec));
}
return 0;
 }
@@ -731,7 +731,7 @@ static void get_location_code(struct seq_file *m, struct 
individual_sensor *s,
 static ssize_t ppc_rtas_tone_freq_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppos)
 {
-   unsigned long freq;
+   u64 freq;
int error = parse_number(buf, count, );
if (error)
return error;
@@ -756,7 +756,7 @@ static int ppc_rtas_tone_freq_show(struct seq_file *m, void 
*v)
 static ssize_t ppc_rtas_tone_volume_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppos)
 {
-   unsigned long volume;
+   u64 volume;
int error = parse_number(buf, count, );
if (error)
return error;
-- 
2.9.0



[PATCH 4/5] powerpc: use time64_t in update_persistent_clock

2018-04-23 Thread Arnd Bergmann
update_persistent_clock() is deprecated because it suffers from overflow
in 2038 on 32-bit architectures. This changes powerpc to use the
update_persistent_clock64() replacement, and to pass down 64-bit
timestamps consistently.

This is now simpler, as we no longer have to worry about the offset
numbers in tm_year and tm_mon that are different between the Linux
conventions and RTAS.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/kernel/time.c  |  6 ++
 arch/powerpc/platforms/8xx/m8xx_setup.c |  7 +++
 arch/powerpc/platforms/powermac/time.c  | 17 -
 3 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index afb27962b396..f9b0baa3fa2b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -781,16 +781,14 @@ void __init generic_calibrate_decr(void)
}
 }
 
-int update_persistent_clock(struct timespec now)
+int update_persistent_clock64(struct timespec64 now)
 {
struct rtc_time tm;
 
if (!ppc_md.set_rtc_time)
return -ENODEV;
 
-   to_tm(now.tv_sec + 1 + timezone_offset, );
-   tm.tm_year -= 1900;
-   tm.tm_mon -= 1;
+   rtc_time64_to_tm(now.tv_sec + 1 + timezone_offset, );
 
return ppc_md.set_rtc_time();
 }
diff --git a/arch/powerpc/platforms/8xx/m8xx_setup.c 
b/arch/powerpc/platforms/8xx/m8xx_setup.c
index d76daa90647b..027c42d8966c 100644
--- a/arch/powerpc/platforms/8xx/m8xx_setup.c
+++ b/arch/powerpc/platforms/8xx/m8xx_setup.c
@@ -169,15 +169,14 @@ int mpc8xx_set_rtc_time(struct rtc_time *tm)
 {
sitk8xx_t __iomem *sys_tmr1;
sit8xx_t __iomem *sys_tmr2;
-   int time;
+   time64_t time;
 
sys_tmr1 = immr_map(im_sitk);
sys_tmr2 = immr_map(im_sit);
-   time = mktime(tm->tm_year+1900, tm->tm_mon+1, tm->tm_mday,
- tm->tm_hour, tm->tm_min, tm->tm_sec);
+   time = rtc_tm_to_time64(tm);
 
out_be32(_tmr1->sitk_rtck, KAPWR_KEY);
-   out_be32(_tmr2->sit_rtc, time);
+   out_be32(_tmr2->sit_rtc, (u32)time);
out_be32(_tmr1->sitk_rtck, ~KAPWR_KEY);
 
immr_unmap(sys_tmr2);
diff --git a/arch/powerpc/platforms/powermac/time.c 
b/arch/powerpc/platforms/powermac/time.c
index d5d1c452038e..7c968e46736f 100644
--- a/arch/powerpc/platforms/powermac/time.c
+++ b/arch/powerpc/platforms/powermac/time.c
@@ -84,15 +84,6 @@ long __init pmac_time_init(void)
return delta;
 }
 
-#if defined(CONFIG_ADB_CUDA) || defined(CONFIG_ADB_PMU) || \
-defined(CONFIG_PMAC_SMU)
-static unsigned long from_rtc_time(struct rtc_time *tm)
-{
-   return mktime(tm->tm_year+1900, tm->tm_mon+1, tm->tm_mday,
- tm->tm_hour, tm->tm_min, tm->tm_sec);
-}
-#endif
-
 #ifdef CONFIG_ADB_CUDA
 static time64_t cuda_get_time(void)
 {
@@ -115,10 +106,10 @@ static time64_t cuda_get_time(void)
 
 static int cuda_set_rtc_time(struct rtc_time *tm)
 {
-   unsigned int nowtime;
+   time64_t nowtime;
struct adb_request req;
 
-   nowtime = from_rtc_time(tm) + RTC_OFFSET;
+   nowtime = rtc_tm_to_time64(tm) + RTC_OFFSET;
if (cuda_request(, NULL, 6, CUDA_PACKET, CUDA_SET_TIME,
 nowtime >> 24, nowtime >> 16, nowtime >> 8,
 nowtime) < 0)
@@ -158,10 +149,10 @@ static time64_t pmu_get_time(void)
 
 static int pmu_set_rtc_time(struct rtc_time *tm)
 {
-   unsigned int nowtime;
+   time64_t nowtime;
struct adb_request req;
 
-   nowtime = from_rtc_time(tm) + RTC_OFFSET;
+   nowtime = rtc_tm_to_time64(tm) + RTC_OFFSET;
if (pmu_request(, NULL, 5, PMU_SET_RTC, nowtime >> 24,
nowtime >> 16, nowtime >> 8, nowtime) < 0)
return -ENXIO;
-- 
2.9.0



Re: [PATCH 2/5] powerpc: rtas: clean up time handling

2018-04-23 Thread Arnd Bergmann
On Mon, Apr 23, 2018 at 10:10 AM, Arnd Bergmann  wrote:

> @@ -307,17 +307,17 @@ static ssize_t ppc_rtas_poweron_write(struct file *file,
> const char __user *buf, size_t count, loff_t *ppos)
>  {
> struct rtc_time tm;
> -   unsigned long nowtime;
> +   time64_t nowtime;
> int error = parse_number(buf, count, );
> if (error)
> return error;
>
> power_on_time = nowtime; /* save the time */
>
> -   to_tm(nowtime, );
> +   rtc_time64_to_tm(nowtime, 0, );
>

Here was the other obvious typo: I had originally changed this to
'time64_to_tm()', which has slightly different calling conventions to
'rtc_time64_to_tm()', and messed it up when I changed it back
after it turned out I needed to select RTC_LIB anyway.

v2 coming.

Arnd


Re: [PATCH 3/5] powerpc: use time64_t in read_persistent_clock

2018-04-23 Thread Arnd Bergmann
On Mon, Apr 23, 2018 at 10:10 AM, Arnd Bergmann  wrote:

> @@ -170,7 +170,6 @@ unsigned long __init maple_get_boot_time(void)
> request_resource(_resource, _iores);
>
> maple_get_rtc_time();
> -   return mktime(tm.tm_year+1900, tm.tm_mon+1, tm.tm_mday,
> - tm.tm_hour, tm.tm_min, tm.tm_sec);
> +   return rtc_tm_to_time66();
>  }

No, it's not rtc_tm_to_time66 ;(

I did some build testing over the weekend to make sure I find all the stupid
bugs, but then sent out the version without the two fixes. Please ignore all
five patches, will follow up with v2 that fixes this one and another equally
trivial bug.

   Arnd


[PATCH 2/5] powerpc: rtas: clean up time handling

2018-04-23 Thread Arnd Bergmann
The to_tm() helper function operates on a signed integer for the time,
so it will suffer from overflow in 2038, even on 64-bit kernels.

Rather than fix that function, this replaces its use in the rtas
procfs implementation with the standard rtc_time64_to_tm() helper
that is very similar but is not affected by the overflow.

In order to actually support long times, the parser function gets
changed to 64-bit user input and output as well. Note that the tm_mon
and tm_year representation is slightly different, so we have to manually
add an offset here.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/kernel/rtas-proc.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/rtas-proc.c b/arch/powerpc/kernel/rtas-proc.c
index fb070d8cad07..80864b787745 100644
--- a/arch/powerpc/kernel/rtas-proc.c
+++ b/arch/powerpc/kernel/rtas-proc.c
@@ -280,7 +280,7 @@ static int __init proc_rtas_init(void)
 
 __initcall(proc_rtas_init);
 
-static int parse_number(const char __user *p, size_t count, unsigned long *val)
+static int parse_number(const char __user *p, size_t count, u64 *val)
 {
char buf[40];
char *end;
@@ -293,7 +293,7 @@ static int parse_number(const char __user *p, size_t count, 
unsigned long *val)
 
buf[count] = 0;
 
-   *val = simple_strtoul(buf, , 10);
+   *val = simple_strtoull(buf, , 10);
if (*end && *end != '\n')
return -EINVAL;
 
@@ -307,17 +307,17 @@ static ssize_t ppc_rtas_poweron_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppos)
 {
struct rtc_time tm;
-   unsigned long nowtime;
+   time64_t nowtime;
int error = parse_number(buf, count, );
if (error)
return error;
 
power_on_time = nowtime; /* save the time */
 
-   to_tm(nowtime, );
+   rtc_time64_to_tm(nowtime, 0, );
 
error = rtas_call(rtas_token("set-time-for-power-on"), 7, 1, NULL, 
-   tm.tm_year, tm.tm_mon, tm.tm_mday, 
+   tm.tm_year + 1900, tm.tm_mon + 1, tm.tm_mday,
tm.tm_hour, tm.tm_min, tm.tm_sec, 0 /* nano */);
if (error)
printk(KERN_WARNING "error: setting poweron time returned: 
%s\n", 
@@ -373,14 +373,14 @@ static ssize_t ppc_rtas_clock_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppos)
 {
struct rtc_time tm;
-   unsigned long nowtime;
+   time64_t nowtime;
int error = parse_number(buf, count, );
if (error)
return error;
 
-   to_tm(nowtime, );
+   rtc_time64_to_tm(nowtime, 0, );
error = rtas_call(rtas_token("set-time-of-day"), 7, 1, NULL, 
-   tm.tm_year, tm.tm_mon, tm.tm_mday, 
+   tm.tm_year + 1900, tm.tm_mon + 1, tm.tm_mday,
tm.tm_hour, tm.tm_min, tm.tm_sec, 0);
if (error)
printk(KERN_WARNING "error: setting the clock returned: %s\n", 
@@ -401,8 +401,8 @@ static int ppc_rtas_clock_show(struct seq_file *m, void *v)
unsigned int year, mon, day, hour, min, sec;
year = ret[0]; mon  = ret[1]; day  = ret[2];
hour = ret[3]; min  = ret[4]; sec  = ret[5];
-   seq_printf(m, "%lu\n",
-   mktime(year, mon, day, hour, min, sec));
+   seq_printf(m, "%lld\n",
+   mktime64(year, mon, day, hour, min, sec));
}
return 0;
 }
@@ -731,7 +731,7 @@ static void get_location_code(struct seq_file *m, struct 
individual_sensor *s,
 static ssize_t ppc_rtas_tone_freq_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppos)
 {
-   unsigned long freq;
+   u64 freq;
int error = parse_number(buf, count, );
if (error)
return error;
@@ -756,7 +756,7 @@ static int ppc_rtas_tone_freq_show(struct seq_file *m, void 
*v)
 static ssize_t ppc_rtas_tone_volume_write(struct file *file,
const char __user *buf, size_t count, loff_t *ppos)
 {
-   unsigned long volume;
+   u64 volume;
int error = parse_number(buf, count, );
if (error)
return error;
-- 
2.9.0



[PATCH 5/5] powerpc: remove unused to_tm() helper

2018-04-23 Thread Arnd Bergmann
to_tm() is now completely unused, the only reference being in the
_dump_time() helper that is also unused. This removes both, leaving
the rest of the powerpc RTC code y2038 safe to as far as the hardware
supports.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/include/asm/time.h   |  2 --
 arch/powerpc/kernel/time.c| 50 ---
 arch/powerpc/platforms/ps3/time.c | 24 ---
 3 files changed, 76 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index db546c034905..79bc9c3e4325 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -26,8 +26,6 @@ extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
 
-struct rtc_time;
-extern void to_tm(int tim, struct rtc_time * tm);
 extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index f9b0baa3fa2b..79bdeea85ab4 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -1138,56 +1138,6 @@ void __init time_init(void)
 #endif
 }
 
-
-#define FEBRUARY   2
-#defineSTARTOFTIME 1970
-#define SECDAY 86400L
-#define SECYR  (SECDAY * 365)
-#defineleapyear(year)  ((year) % 4 == 0 && \
-((year) % 100 != 0 || (year) % 400 == 0))
-#definedays_in_year(a) (leapyear(a) ? 366 : 365)
-#definedays_in_month(a)(month_days[(a) - 1])
-
-static int month_days[12] = {
-   31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31
-};
-
-void to_tm(int tim, struct rtc_time * tm)
-{
-   register inti;
-   register long   hms, day;
-
-   day = tim / SECDAY;
-   hms = tim % SECDAY;
-
-   /* Hours, minutes, seconds are easy */
-   tm->tm_hour = hms / 3600;
-   tm->tm_min = (hms % 3600) / 60;
-   tm->tm_sec = (hms % 3600) % 60;
-
-   /* Number of years in days */
-   for (i = STARTOFTIME; day >= days_in_year(i); i++)
-   day -= days_in_year(i);
-   tm->tm_year = i;
-
-   /* Number of months in days left */
-   if (leapyear(tm->tm_year))
-   days_in_month(FEBRUARY) = 29;
-   for (i = 1; day >= days_in_month(i); i++)
-   day -= days_in_month(i);
-   days_in_month(FEBRUARY) = 28;
-   tm->tm_mon = i;
-
-   /* Days are what is left over (+1) from all that. */
-   tm->tm_mday = day + 1;
-
-   /*
-* No-one uses the day of the week.
-*/
-   tm->tm_wday = -1;
-}
-EXPORT_SYMBOL(to_tm);
-
 /*
  * Divide a 128-bit dividend by a 32-bit divisor, leaving a 128 bit
  * result.
diff --git a/arch/powerpc/platforms/ps3/time.c 
b/arch/powerpc/platforms/ps3/time.c
index 9dac125c997e..08ca76e23d09 100644
--- a/arch/powerpc/platforms/ps3/time.c
+++ b/arch/powerpc/platforms/ps3/time.c
@@ -28,30 +28,6 @@
 
 #include "platform.h"
 
-#define dump_tm(_a) _dump_tm(_a, __func__, __LINE__)
-static void _dump_tm(const struct rtc_time *tm, const char* func, int line)
-{
-   pr_debug("%s:%d tm_sec  %d\n", func, line, tm->tm_sec);
-   pr_debug("%s:%d tm_min  %d\n", func, line, tm->tm_min);
-   pr_debug("%s:%d tm_hour %d\n", func, line, tm->tm_hour);
-   pr_debug("%s:%d tm_mday %d\n", func, line, tm->tm_mday);
-   pr_debug("%s:%d tm_mon  %d\n", func, line, tm->tm_mon);
-   pr_debug("%s:%d tm_year %d\n", func, line, tm->tm_year);
-   pr_debug("%s:%d tm_wday %d\n", func, line, tm->tm_wday);
-}
-
-#define dump_time(_a) _dump_time(_a, __func__, __LINE__)
-static void __maybe_unused _dump_time(int time, const char *func,
-   int line)
-{
-   struct rtc_time tm;
-
-   to_tm(time, );
-
-   pr_debug("%s:%d time%d\n", func, line, time);
-   _dump_tm(, func, line);
-}
-
 void __init ps3_calibrate_decr(void)
 {
int result;
-- 
2.9.0



[PATCH 4/5] powerpc: use time64_t in update_persistent_clock

2018-04-23 Thread Arnd Bergmann
update_persistent_clock() is deprecated because it suffers from overflow
in 2038 on 32-bit architectures. This changes powerpc to use the
update_persistent_clock64() replacement, and to pass down 64-bit
timestamps consistently.

This is now simpler, as we no longer have to worry about the offset
numbers in tm_year and tm_mon that are different between the Linux
conventions and RTAS.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/kernel/time.c  |  6 ++
 arch/powerpc/platforms/8xx/m8xx_setup.c |  7 +++
 arch/powerpc/platforms/powermac/time.c  | 17 -
 3 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index afb27962b396..f9b0baa3fa2b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -781,16 +781,14 @@ void __init generic_calibrate_decr(void)
}
 }
 
-int update_persistent_clock(struct timespec now)
+int update_persistent_clock64(struct timespec64 now)
 {
struct rtc_time tm;
 
if (!ppc_md.set_rtc_time)
return -ENODEV;
 
-   to_tm(now.tv_sec + 1 + timezone_offset, );
-   tm.tm_year -= 1900;
-   tm.tm_mon -= 1;
+   rtc_time64_to_tm(now.tv_sec + 1 + timezone_offset, );
 
return ppc_md.set_rtc_time();
 }
diff --git a/arch/powerpc/platforms/8xx/m8xx_setup.c 
b/arch/powerpc/platforms/8xx/m8xx_setup.c
index d76daa90647b..027c42d8966c 100644
--- a/arch/powerpc/platforms/8xx/m8xx_setup.c
+++ b/arch/powerpc/platforms/8xx/m8xx_setup.c
@@ -169,15 +169,14 @@ int mpc8xx_set_rtc_time(struct rtc_time *tm)
 {
sitk8xx_t __iomem *sys_tmr1;
sit8xx_t __iomem *sys_tmr2;
-   int time;
+   time64_t time;
 
sys_tmr1 = immr_map(im_sitk);
sys_tmr2 = immr_map(im_sit);
-   time = mktime(tm->tm_year+1900, tm->tm_mon+1, tm->tm_mday,
- tm->tm_hour, tm->tm_min, tm->tm_sec);
+   time = rtc_tm_to_time64(tm);
 
out_be32(_tmr1->sitk_rtck, KAPWR_KEY);
-   out_be32(_tmr2->sit_rtc, time);
+   out_be32(_tmr2->sit_rtc, (u32)time);
out_be32(_tmr1->sitk_rtck, ~KAPWR_KEY);
 
immr_unmap(sys_tmr2);
diff --git a/arch/powerpc/platforms/powermac/time.c 
b/arch/powerpc/platforms/powermac/time.c
index d5d1c452038e..7c968e46736f 100644
--- a/arch/powerpc/platforms/powermac/time.c
+++ b/arch/powerpc/platforms/powermac/time.c
@@ -84,15 +84,6 @@ long __init pmac_time_init(void)
return delta;
 }
 
-#if defined(CONFIG_ADB_CUDA) || defined(CONFIG_ADB_PMU) || \
-defined(CONFIG_PMAC_SMU)
-static unsigned long from_rtc_time(struct rtc_time *tm)
-{
-   return mktime(tm->tm_year+1900, tm->tm_mon+1, tm->tm_mday,
- tm->tm_hour, tm->tm_min, tm->tm_sec);
-}
-#endif
-
 #ifdef CONFIG_ADB_CUDA
 static time64_t cuda_get_time(void)
 {
@@ -115,10 +106,10 @@ static time64_t cuda_get_time(void)
 
 static int cuda_set_rtc_time(struct rtc_time *tm)
 {
-   unsigned int nowtime;
+   time64_t nowtime;
struct adb_request req;
 
-   nowtime = from_rtc_time(tm) + RTC_OFFSET;
+   nowtime = rtc_tm_to_time64(tm) + RTC_OFFSET;
if (cuda_request(, NULL, 6, CUDA_PACKET, CUDA_SET_TIME,
 nowtime >> 24, nowtime >> 16, nowtime >> 8,
 nowtime) < 0)
@@ -158,10 +149,10 @@ static time64_t pmu_get_time(void)
 
 static int pmu_set_rtc_time(struct rtc_time *tm)
 {
-   unsigned int nowtime;
+   time64_t nowtime;
struct adb_request req;
 
-   nowtime = from_rtc_time(tm) + RTC_OFFSET;
+   nowtime = rtc_tm_to_time64(tm) + RTC_OFFSET;
if (pmu_request(, NULL, 5, PMU_SET_RTC, nowtime >> 24,
nowtime >> 16, nowtime >> 8, nowtime) < 0)
return -ENXIO;
-- 
2.9.0



[PATCH 3/5] powerpc: use time64_t in read_persistent_clock

2018-04-23 Thread Arnd Bergmann
Looking through the remaining users of the deprecated mktime()
function, I found the powerpc rtc handlers, which use it in
place of rtc_tm_to_time64().

To clean this up, I'm changing over the read_persistent_clock()
function to the read_persistent_clock64() variant, and change
all the platform specific handlers along with it.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/include/asm/machdep.h|  2 +-
 arch/powerpc/include/asm/opal.h   |  2 +-
 arch/powerpc/include/asm/rtas.h   |  2 +-
 arch/powerpc/kernel/rtas-rtc.c|  4 ++--
 arch/powerpc/kernel/time.c|  7 +++
 arch/powerpc/platforms/8xx/m8xx_setup.c   |  4 +---
 arch/powerpc/platforms/maple/maple.h  |  2 +-
 arch/powerpc/platforms/maple/time.c   |  5 ++---
 arch/powerpc/platforms/pasemi/pasemi.h|  2 +-
 arch/powerpc/platforms/pasemi/time.c  |  4 ++--
 arch/powerpc/platforms/powermac/pmac.h|  2 +-
 arch/powerpc/platforms/powermac/time.c| 31 +++
 arch/powerpc/platforms/powernv/opal-rtc.c |  5 ++---
 arch/powerpc/platforms/ps3/platform.h |  2 +-
 arch/powerpc/platforms/ps3/time.c |  2 +-
 15 files changed, 31 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index ffe7c71e1132..a47de82fb8e2 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -83,7 +83,7 @@ struct machdep_calls {
 
int (*set_rtc_time)(struct rtc_time *);
void(*get_rtc_time)(struct rtc_time *);
-   unsigned long   (*get_boot_time)(void);
+   time64_t(*get_boot_time)(void);
unsigned char   (*rtc_read_val)(int addr);
void(*rtc_write_val)(int addr, unsigned char val);
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 03e1a920491e..fc211bd98e0f 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -325,7 +325,7 @@ extern int opal_async_wait_response_interruptible(uint64_t 
token,
 extern int opal_get_sensor_data(u32 sensor_hndl, u32 *sensor_data);
 
 struct rtc_time;
-extern unsigned long opal_get_boot_time(void);
+extern time64_t opal_get_boot_time(void);
 extern void opal_nvram_init(void);
 extern void opal_flash_update_init(void);
 extern void opal_flash_update_print_message(void);
diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ec9dd79398ee..71e393c46a49 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -361,7 +361,7 @@ extern int rtas_offline_cpus_mask(cpumask_var_t cpus);
 extern int rtas_ibm_suspend_me(u64 handle);
 
 struct rtc_time;
-extern unsigned long rtas_get_boot_time(void);
+extern time64_t rtas_get_boot_time(void);
 extern void rtas_get_rtc_time(struct rtc_time *rtc_time);
 extern int rtas_set_rtc_time(struct rtc_time *rtc_time);
 
diff --git a/arch/powerpc/kernel/rtas-rtc.c b/arch/powerpc/kernel/rtas-rtc.c
index 49600985c7ef..a28239b8b0c0 100644
--- a/arch/powerpc/kernel/rtas-rtc.c
+++ b/arch/powerpc/kernel/rtas-rtc.c
@@ -13,7 +13,7 @@
 
 #define MAX_RTC_WAIT 5000  /* 5 sec */
 #define RTAS_CLOCK_BUSY (-2)
-unsigned long __init rtas_get_boot_time(void)
+time64_t __init rtas_get_boot_time(void)
 {
int ret[8];
int error;
@@ -38,7 +38,7 @@ unsigned long __init rtas_get_boot_time(void)
return 0;
}
 
-   return mktime(ret[0], ret[1], ret[2], ret[3], ret[4], ret[5]);
+   return mktime64(ret[0], ret[1], ret[2], ret[3], ret[4], ret[5]);
 }
 
 /* NOTE: get_rtc_time will get an error if executed in interrupt context
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 360e71d455cc..afb27962b396 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -795,7 +795,7 @@ int update_persistent_clock(struct timespec now)
return ppc_md.set_rtc_time();
 }
 
-static void __read_persistent_clock(struct timespec *ts)
+static void __read_persistent_clock(struct timespec64 *ts)
 {
struct rtc_time tm;
static int first = 1;
@@ -819,11 +819,10 @@ static void __read_persistent_clock(struct timespec *ts)
}
ppc_md.get_rtc_time();
 
-   ts->tv_sec = mktime(tm.tm_year+1900, tm.tm_mon+1, tm.tm_mday,
-   tm.tm_hour, tm.tm_min, tm.tm_sec);
+   ts->tv_sec = rtc_tm_to_time64();
 }
 
-void read_persistent_clock(struct timespec *ts)
+void read_persistent_clock64(struct timespec64 *ts)
 {
__read_persistent_clock(ts);
 
diff --git a/arch/powerpc/platforms/8xx/m8xx_setup.c 
b/arch/powerpc/platforms/8xx/m8xx_setup.c
index 2188d691a40f..d76daa90647b 100644
--- a/arch/powerpc/platforms/8xx/m8xx_setup.c
+++ b/arch/powerpc/platforms/8xx/m8xx_setup.c
@@ -192,9 +192,7 @@ void mpc8xx_get_rtc_time(struct rtc_time *tm)
 
/* Get time from the RTC. */
data = in_be32(_tmr->sit_rtc);
- 

[PATCH 1/5] powerpc: always enable RTC_LIB

2018-04-23 Thread Arnd Bergmann
In order to use the rtc_tm_to_time64() and rtc_time64_to_tm()
helper functions in later patches, we have to ensure that
CONFIG_RTC_LIB is always built-in.

Note that this symbol only controls a couple of helper functions,
not the actual RTC subsystem, which remains optional and is
enabled with CONFIG_RTC_CLASS.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..de2cda320fdd 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -232,6 +232,7 @@ config PPC
select OF_RESERVED_MEM
select OLD_SIGACTIONif PPC32
select OLD_SIGSUSPEND
+   select RTC_LIB
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
select VIRT_TO_BUS  if !PPC64
-- 
2.9.0



Re: [PATCH v10 12/25] mm: cache some VMA fields in the vm_fault structure

2018-04-23 Thread Minchan Kim
On Tue, Apr 17, 2018 at 04:33:18PM +0200, Laurent Dufour wrote:
> When handling speculative page fault, the vma->vm_flags and
> vma->vm_page_prot fields are read once the page table lock is released. So
> there is no more guarantee that these fields would not change in our back.
> They will be saved in the vm_fault structure before the VMA is checked for
> changes.

Sorry. I cannot understand.
If it is changed under us, what happens? If it's critical, why cannot we
check with seqcounter?
Clearly, I'm not understanding the logic here. However, it's a global
change without CONFIG_SPF so I want to be more careful.
It would be better to describe why we need to sanpshot those values
into vm_fault rather than preventing the race.

Thanks.

> 
> This patch also set the fields in hugetlb_no_page() and
> __collapse_huge_page_swapin even if it is not need for the callee.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h | 10 --
>  mm/huge_memory.c   |  6 +++---
>  mm/hugetlb.c   |  2 ++
>  mm/khugepaged.c|  2 ++
>  mm/memory.c| 50 ++
>  mm/migrate.c   |  2 +-
>  6 files changed, 42 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f6edd15563bc..c65205c8c558 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -367,6 +367,12 @@ struct vm_fault {
>* page table to avoid allocation from
>* atomic context.
>*/
> + /*
> +  * These entries are required when handling speculative page fault.
> +  * This way the page handling is done using consistent field values.
> +  */
> + unsigned long vma_flags;
> + pgprot_t vma_page_prot;
>  };
>  
>  /* page entry size for vm->huge_fault() */
> @@ -687,9 +693,9 @@ void free_compound_page(struct page *page);
>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
>   * that do not have writing enabled, when used by access_process_vm.
>   */
> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
>  {
> - if (likely(vma->vm_flags & VM_WRITE))
> + if (likely(vma_flags & VM_WRITE))
>   pte = pte_mkwrite(pte);
>   return pte;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a3a1815f8e11..da2afda67e68 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1194,8 +1194,8 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault 
> *vmf, pmd_t orig_pmd,
>  
>   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
>   pte_t entry;
> - entry = mk_pte(pages[i], vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + entry = mk_pte(pages[i], vmf->vma_page_prot);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
>   memcg = (void *)page_private(pages[i]);
>   set_page_private(pages[i], 0);
>   page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
> @@ -2168,7 +2168,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   entry = pte_swp_mksoft_dirty(entry);
>   } else {
>   entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
> - entry = maybe_mkwrite(entry, vma);
> + entry = maybe_mkwrite(entry, vma->vm_flags);
>   if (!write)
>   entry = pte_wrprotect(entry);
>   if (!young)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 218679138255..774864153407 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3718,6 +3718,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   .vma = vma,
>   .address = address,
>   .flags = flags,
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,
>   /*
>* Hard to debug if it ends up being
>* used by a callee that assumes
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 0b28af4b950d..2b02a9f9589e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -887,6 +887,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
> *mm,
>   .flags = FAULT_FLAG_ALLOW_RETRY,
>   .pmd = pmd,
>   .pgoff = linear_page_index(vma, address),
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,
>   };
>  
>   /* we only decide to swapin, if there is enough young ptes */
> diff --git 

Re: [PATCH v10 09/25] mm: protect VMA modifications using VMA sequence count

2018-04-23 Thread Minchan Kim
On Tue, Apr 17, 2018 at 04:33:15PM +0200, Laurent Dufour wrote:
> The VMA sequence count has been introduced to allow fast detection of
> VMA modification when running a page fault handler without holding
> the mmap_sem.
> 
> This patch provides protection against the VMA modification done in :
>   - madvise()
>   - mpol_rebind_policy()
>   - vma_replace_policy()
>   - change_prot_numa()
>   - mlock(), munlock()
>   - mprotect()
>   - mmap_region()
>   - collapse_huge_page()
>   - userfaultd registering services
> 
> In addition, VMA fields which will be read during the speculative fault
> path needs to be written using WRITE_ONCE to prevent write to be split
> and intermediate values to be pushed to other CPUs.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  fs/proc/task_mmu.c |  5 -
>  fs/userfaultfd.c   | 17 +
>  mm/khugepaged.c|  3 +++
>  mm/madvise.c   |  6 +-
>  mm/mempolicy.c | 51 ++-
>  mm/mlock.c | 13 -
>  mm/mmap.c  | 22 +-
>  mm/mprotect.c  |  4 +++-
>  mm/swap_state.c|  8 ++--
>  9 files changed, 89 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c486ad4b43f0..aeb417f28839 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, 
> const char __user *buf,
>   goto out_mm;
>   }
>   for (vma = mm->mmap; vma; vma = vma->vm_next) {
> - vma->vm_flags &= ~VM_SOFTDIRTY;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags,
> +vma->vm_flags & 
> ~VM_SOFTDIRTY);
>   vma_set_page_prot(vma);
> + vm_write_end(vma);

trivial:

I think It's tricky to maintain that VMA fields to be read during SPF should be
(READ|WRITE_ONCE). I think we need some accessor to read/write them rather than
raw accessing like like vma_set_page_prot. Maybe spf prefix would be helpful. 

vma_spf_set_value(vma, vm_flags, val);

We also add some markers in vm_area_struct's fileds to indicate that
people shouldn't access those fields directly.

Just a thought.


>   }
>   downgrade_write(>mmap_sem);


> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index fe079756bb18..8a8a402ed59f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -575,6 +575,10 @@ static unsigned long swapin_nr_pages(unsigned long 
> offset)
>   * the readahead.
>   *
>   * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL.
> + * This is needed to ensure the VMA will not be freed in our back. In the 
> case
> + * of the speculative page fault handler, this cannot happen, even if we 
> don't
> + * hold the mmap_sem. Callees are assumed to take care of reading VMA's 
> fields

I guess reader would be curious on *why* is safe with SPF.
Comment about the why could be helpful for reviewer.

> + * using READ_ONCE() to read consistent values.
>   */
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>   struct vm_fault *vmf)
> @@ -668,9 +672,9 @@ static inline void swap_ra_clamp_pfn(struct 
> vm_area_struct *vma,
>unsigned long *start,
>unsigned long *end)
>  {
> - *start = max3(lpfn, PFN_DOWN(vma->vm_start),
> + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
> PFN_DOWN(faddr & PMD_MASK));
> - *end = min3(rpfn, PFN_DOWN(vma->vm_end),
> + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
>   PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>  }
>  
> -- 
> 2.7.4
> 


Re: [PATCH v2 5/7] ocxl: Expose the thread_id needed for wait on p9

2018-04-23 Thread Andrew Donnellan

On 18/04/18 11:08, Alastair D'Silva wrote:

From: Alastair D'Silva 

In order to successfully issue as_notify, an AFU needs to know the TID
to notify, which in turn means that this information should be
available in userspace so it can be communicated to the AFU.

Signed-off-by: Alastair D'Silva 


nitpicks below

Acked-by: Andrew Donnellan 


---
  drivers/misc/ocxl/context.c   |  5 +++-
  drivers/misc/ocxl/file.c  | 53 +++
  drivers/misc/ocxl/link.c  | 36 ++
  drivers/misc/ocxl/ocxl_internal.h |  1 +
  include/misc/ocxl.h   |  9 +++
  include/uapi/misc/ocxl.h  | 10 
  6 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
index 909e8807824a..95f74623113e 100644
--- a/drivers/misc/ocxl/context.c
+++ b/drivers/misc/ocxl/context.c
@@ -34,6 +34,8 @@ int ocxl_context_init(struct ocxl_context *ctx, struct 
ocxl_afu *afu,
mutex_init(>xsl_error_lock);
mutex_init(>irq_lock);
idr_init(>irq_idr);
+   ctx->tidr = 0;
+
/*
 * Keep a reference on the AFU to make sure it's valid for the
 * duration of the life of the context
@@ -65,6 +67,7 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr)
  {
int rc;
  
+	// Locks both status & tidr

mutex_lock(>status_mutex);
if (ctx->status != OPENED) {
rc = -EIO;
@@ -72,7 +75,7 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr)
}
  
  	rc = ocxl_link_add_pe(ctx->afu->fn->link, ctx->pasid,

-   current->mm->context.id, 0, amr, current->mm,
+   current->mm->context.id, ctx->tidr, amr, current->mm,
xsl_fault_error, ctx);
if (rc)
goto out;
diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
index 038509e5d031..eb409a469f21 100644
--- a/drivers/misc/ocxl/file.c
+++ b/drivers/misc/ocxl/file.c
@@ -5,6 +5,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 
  #include "ocxl_internal.h"
  
  
@@ -123,11 +125,55 @@ static long afu_ioctl_get_metadata(struct ocxl_context *ctx,

return 0;
  }
  
+#ifdef CONFIG_PPC64

+static long afu_ioctl_enable_p9_wait(struct ocxl_context *ctx,
+   struct ocxl_ioctl_p9_wait __user *uarg)
+{
+   struct ocxl_ioctl_p9_wait arg;
+
+   memset(, 0, sizeof(arg));
+
+   if (cpu_has_feature(CPU_FTR_P9_TIDR)) {
+   enum ocxl_context_status status;
+
+   // Locks both status & tidr
+   mutex_lock(>status_mutex);
+   if (!ctx->tidr) {
+   if (set_thread_tidr(current))
+   return -ENOENT;
+
+   ctx->tidr = current->thread.tidr;
+   }
+
+   status = ctx->status;
+   mutex_unlock(>status_mutex);
+
+   if (status == ATTACHED) {
+   int rc;
+   struct link *link = ctx->afu->fn->link;


Declarations at the top


+
+   rc = ocxl_link_update_pe(link, ctx->pasid, ctx->tidr);
+   if (rc)
+   return rc;
+   }
+
+   arg.thread_id = ctx->tidr;
+   } else
+   return -ENOENT;
+
+   if (copy_to_user(uarg, , sizeof(arg)))
+   return -EFAULT;
+
+   return 0;
+}
+#endif
+
  #define CMD_STR(x) (x == OCXL_IOCTL_ATTACH ? "ATTACH" : \
x == OCXL_IOCTL_IRQ_ALLOC ? "IRQ_ALLOC" : \
x == OCXL_IOCTL_IRQ_FREE ? "IRQ_FREE" :   \
x == OCXL_IOCTL_IRQ_SET_FD ? "IRQ_SET_FD" :   \
x == OCXL_IOCTL_GET_METADATA ? "GET_METADATA" :   \
+   x == OCXL_IOCTL_ENABLE_P9_WAIT ? "ENABLE_P9_WAIT" :   \
"UNKNOWN")
  
  static long afu_ioctl(struct file *file, unsigned int cmd,

@@ -186,6 +232,13 @@ static long afu_ioctl(struct file *file, unsigned int cmd,
(struct ocxl_ioctl_metadata __user *) args);
break;
  
+#ifdef CONFIG_PPC64

+   case OCXL_IOCTL_ENABLE_P9_WAIT:
+   rc = afu_ioctl_enable_p9_wait(ctx,
+   (struct ocxl_ioctl_p9_wait __user *) args);
+   break;
+#endif
+
default:
rc = -EINVAL;
}
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 656e8610eec2..88876ae8f330 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -544,6 +544,42 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 
pidr, u32 tidr,
  }
  EXPORT_SYMBOL_GPL(ocxl_link_add_pe);
  
+int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid)

+{
+   

Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Balbir Singh
On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
 wrote:
> From: Mahesh Salgaonkar 
>
> The current code extracts the physical address for UE errors and then
> hooks it up into memory failure infrastructure. On successful extraction
> of physical address it wrongly sets "handled = 1" which means this UE error
> has been recovered. Since MCE handler gets return value as handled = 1, it
> assumes that error has been recovered and goes back to same NIP. This causes
> MCE interrupt again and again in a loop leading to hard lockup.
>
> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
> undesired page to hwpoison.
>
> Without this patch we see:
> [ 1476.541984] Severe Machine check interrupt [Recovered]
> [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
> [ 1476.541986]   Initiator: CPU
> [ 1476.541987]   Error type: UE [Load/Store]
> [ 1476.541988] Effective address: 7fffd2755940
> [ 1476.541989] Physical address:  20181a08
> [...]
> [ 1476.542003] Severe Machine check interrupt [Recovered]
> [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
> [ 1476.542005]   Initiator: CPU
> [ 1476.542006]   Error type: UE [Load/Store]
> [ 1476.542006] Effective address: 7fffd2755940
> [ 1476.542007] Physical address:  20181a08
> [ 1476.542010] Severe Machine check interrupt [Recovered]
> [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
> [ 1476.542013]   Initiator: CPU
> [ 1476.542014]   Error type: UE [Load/Store]
> [ 1476.542015] Effective address: 7fffd2755940
> [ 1476.542016] Physical address:  20181a08
> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
> page: Recovered
> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> [...]
> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>
> After this patch we see:
>
> [  325.384336] Severe Machine check interrupt [Not recovered]

How did you test for this? If the error was recovered, shouldn't the
process have gotten
a SIGBUS and we should have prevented further access as a part of the handling
(memory_failure()). Do we just need a MF_MUST_KILL in the flags?

Why shouldn't we treat it as handled if we isolate the page?

Thanks,
Balbir Singh.


Re: [PATCH v10 08/25] mm: VMA sequence count

2018-04-23 Thread Minchan Kim
On Tue, Apr 17, 2018 at 04:33:14PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra 
> 
> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> counts such that we can easily test if a VMA is changed.

So, seqcount is to protect modifying all attributes of vma?

> 
> The unmap_page_range() one allows us to make assumptions about
> page-tables; when we find the seqcount hasn't changed we can assume
> page-tables are still valid.

Hmm, seqcount covers page-table, too.
Please describe what the seqcount want to protect.

> 
> The flip side is that we cannot distinguish between a vma_adjust() and
> the unmap_page_range() -- where with the former we could have
> re-checked the vma bounds against the address.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> 
> [Port to 4.12 kernel]
> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
> [Introduce vm_write_* inline function depending on
>  CONFIG_SPECULATIVE_PAGE_FAULT]
> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
>  using vm_raw_write* functions]
> [Fix a lock dependency warning in mmap_region() when entering the error
>  path]
> [move sequence initialisation INIT_VMA()]
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h   | 44 
>  include/linux/mm_types.h |  3 +++
>  mm/memory.c  |  2 ++
>  mm/mmap.c| 31 +++
>  4 files changed, 80 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index efc1248b82bd..988daf7030c9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1264,6 +1264,9 @@ struct zap_details {
>  static inline void INIT_VMA(struct vm_area_struct *vma)
>  {
>   INIT_LIST_HEAD(>anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_init(>vm_sequence);
> +#endif
>  }
>  
>  struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1386,6 +1389,47 @@ static inline void unmap_shared_mapping_range(struct 
> address_space *mapping,
>   unmap_mapping_range(mapping, holebegin, holelen, 0);
>  }
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> + write_seqcount_begin(>vm_sequence);
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> +  int subclass)
> +{
> + write_seqcount_begin_nested(>vm_sequence, subclass);
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> + write_seqcount_end(>vm_sequence);
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> + raw_write_seqcount_begin(>vm_sequence);
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> + raw_write_seqcount_end(>vm_sequence);
> +}
> +#else
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> +  int subclass)
> +{
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
>  extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
>   void *buf, int len, unsigned int gup_flags);
>  extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 21612347d311..db5e9d630e7a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -335,6 +335,9 @@ struct vm_area_struct {
>   struct mempolicy *vm_policy;/* NUMA policy for the VMA */
>  #endif
>   struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_t vm_sequence;
> +#endif
>  } __randomize_layout;
>  
>  struct core_thread {
> diff --git a/mm/memory.c b/mm/memory.c
> index f86efcb8e268..f7fed053df80 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1503,6 +1503,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>   unsigned long next;
>  
>   BUG_ON(addr >= end);

The comment about saying it aims for page-table stability will help.

> + vm_write_begin(vma);
>   tlb_start_vma(tlb, vma);
>   pgd = pgd_offset(vma->vm_mm, addr);
>   do {
> @@ -1512,6 +1513,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>   next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
>   } while (pgd++, addr = next, addr != end);
>   tlb_end_vma(tlb, vma);
> + vm_write_end(vma);
>  }
>  
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8bd9ae1dfacc..813e49589ea1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -692,6 +692,30 @@ int __vma_adjust(struct 

Re: [PATCH v10 06/25] mm: make pte_unmap_same compatible with SPF

2018-04-23 Thread Minchan Kim
On Tue, Apr 17, 2018 at 04:33:12PM +0200, Laurent Dufour wrote:
> pte_unmap_same() is making the assumption that the page table are still
> around because the mmap_sem is held.
> This is no more the case when running a speculative page fault and
> additional check must be made to ensure that the final page table are still
> there.
> 
> This is now done by calling pte_spinlock() to check for the VMA's
> consistency while locking for the page tables.
> 
> This is requiring passing a vm_fault structure to pte_unmap_same() which is
> containing all the needed parameters.
> 
> As pte_spinlock() may fail in the case of a speculative page fault, if the
> VMA has been touched in our back, pte_unmap_same() should now return 3
> cases :
>   1. pte are the same (0)
>   2. pte are different (VM_FAULT_PTNOTSAME)
>   3. a VMA's changes has been detected (VM_FAULT_RETRY)
> 
> The case 2 is handled by the introduction of a new VM_FAULT flag named
> VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().

I don't see such logic in this patch.
Maybe you introduces it later? If so, please comment on it.
Or just return 0 in case of 2 without introducing VM_FAULT_PTNOTSAME.

> If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
> page fault while holding the mmap_sem.
> 
> Acked-by: David Rientjes 
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c| 39 ---
>  2 files changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4d1aff80669c..714da99d77a3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1208,6 +1208,7 @@ static inline void clear_page_pfmemalloc(struct page 
> *page)
>  #define VM_FAULT_NEEDDSYNC  0x2000   /* ->fault did not modify page tables
>* and needs fsync() to complete (for
>* synchronous page faults in DAX) */
> +#define VM_FAULT_PTNOTSAME 0x4000/* Page table entries have changed */
>  
>  #define VM_FAULT_ERROR   (VM_FAULT_OOM | VM_FAULT_SIGBUS | 
> VM_FAULT_SIGSEGV | \
>VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
> diff --git a/mm/memory.c b/mm/memory.c
> index 0b9a51f80e0e..f86efcb8e268 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2309,21 +2309,29 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>   * parts, do_swap_page must check under lock before unmapping the pte and
>   * proceeding (but do_wp_page is only called after already making such a 
> check;
>   * and do_anonymous_page can safely check later on).
> + *
> + * pte_unmap_same() returns:
> + *   0   if the PTE are the same
> + *   VM_FAULT_PTNOTSAME  if the PTE are different
> + *   VM_FAULT_RETRY  if the VMA has changed in our back during
> + *   a speculative page fault handling.
>   */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
>  {
> - int same = 1;
> + int ret = 0;
> +
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
>   if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> - spin_unlock(ptl);
> + if (pte_spinlock(vmf)) {
> + if (!pte_same(*vmf->pte, vmf->orig_pte))
> + ret = VM_FAULT_PTNOTSAME;
> + spin_unlock(vmf->ptl);
> + } else
> + ret = VM_FAULT_RETRY;
>   }
>  #endif
> - pte_unmap(page_table);
> - return same;
> + pte_unmap(vmf->pte);
> + return ret;
>  }
>  
>  static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> @@ -2912,10 +2920,19 @@ int do_swap_page(struct vm_fault *vmf)
>   pte_t pte;
>   int locked;
>   int exclusive = 0;
> - int ret = 0;
> + int ret;
>  
> - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> + ret = pte_unmap_same(vmf);
> + if (ret) {
> + /*
> +  * If pte != orig_pte, this means another thread did the
> +  * swap operation in our back.
> +  * So nothing else to do.
> +  */
> + if (ret == VM_FAULT_PTNOTSAME)
> + ret = 0;
>   goto out;
> + }
>  
>   entry = pte_to_swp_entry(vmf->orig_pte);
>   if (unlikely(non_swap_entry(entry))) {
> -- 
> 2.7.4
>