date:20180313

[PATCH] cxl: Add new kernel traces

2018-03-13 Thread Christophe Lombard

This patch adds new kernel traces in the current in-kernel 'library'
which can be called by other drivers to help interacting with an
IBM XSL on a POWER9 system.

If some kernel traces exist in the 'normal path' to handle a page or a
segment fault, some others are missing when a page fault is handle
through cxllib.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxllib.c |   3 ++
 drivers/misc/cxl/fault.c  |   2 +
 drivers/misc/cxl/irq.c|   2 +-
 drivers/misc/cxl/trace.h  | 115 ++
 4 files changed, 72 insertions(+), 50 deletions(-)

diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index 30ccba4..91cfb69 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -13,6 +13,7 @@
 #include 
 
 #include "cxl.h"
+#include "trace.h"
 
 #define CXL_INVALID_DRA ~0ull
 #define CXL_DUMMY_READ_SIZE 128
@@ -218,6 +219,8 @@ int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 
size, u64 flags)
if (mm == NULL)
return -EFAULT;
 
+   trace_cxl_lib_handle_fault(addr, size, flags);
+
down_read(&mm->mmap_sem);
 
vma = find_vma(mm, addr);
diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
index 70dbb6d..1c4fd74 100644
--- a/drivers/misc/cxl/fault.c
+++ b/drivers/misc/cxl/fault.c
@@ -138,6 +138,8 @@ int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, 
u64 dar)
int result;
unsigned long access, flags, inv_flags = 0;
 
+   trace_cxl_handle_mm_fault(dsisr, dar);
+
/*
 * Add the fault handling cpu to task mm cpumask so that we
 * can do a safe lockless page table walk when inserting the
diff --git a/drivers/misc/cxl/irq.c b/drivers/misc/cxl/irq.c
index ce08a9f..79b8b49 100644
--- a/drivers/misc/cxl/irq.c
+++ b/drivers/misc/cxl/irq.c
@@ -41,7 +41,7 @@ irqreturn_t cxl_irq_psl9(int irq, struct cxl_context *ctx, 
struct cxl_irq_info *
dsisr = irq_info->dsisr;
dar = irq_info->dar;
 
-   trace_cxl_psl9_irq(ctx, irq, dsisr, dar);
+   trace_cxl_psl_irq(ctx, irq, dsisr, dar);
 
pr_devel("CXL interrupt %i for afu pe: %i DSISR: %#llx DAR: %#llx\n", 
irq, ctx->pe, dsisr, dar);
 
diff --git a/drivers/misc/cxl/trace.h b/drivers/misc/cxl/trace.h
index b8e300a..8eb2607 100644
--- a/drivers/misc/cxl/trace.h
+++ b/drivers/misc/cxl/trace.h
@@ -26,19 +26,20 @@
{ CXL_PSL9_DSISR_An_OC, "OC" }, \
{ CXL_PSL9_DSISR_An_S,  "S" })
 
-#define DSISR_FLAGS \
-   { CXL_PSL_DSISR_An_DS,  "DS" }, \
-   { CXL_PSL_DSISR_An_DM,  "DM" }, \
-   { CXL_PSL_DSISR_An_ST,  "ST" }, \
-   { CXL_PSL_DSISR_An_UR,  "UR" }, \
-   { CXL_PSL_DSISR_An_PE,  "PE" }, \
-   { CXL_PSL_DSISR_An_AE,  "AE" }, \
-   { CXL_PSL_DSISR_An_OC,  "OC" }, \
-   { CXL_PSL_DSISR_An_M,   "M" }, \
-   { CXL_PSL_DSISR_An_P,   "P" }, \
-   { CXL_PSL_DSISR_An_A,   "A" }, \
-   { CXL_PSL_DSISR_An_S,   "S" }, \
-   { CXL_PSL_DSISR_An_K,   "K" }
+#define dsisr_psl8_flags(flags) \
+   __print_flags(flags, "|", \
+   { CXL_PSL_DSISR_An_DS,  "DS" }, \
+   { CXL_PSL_DSISR_An_DM,  "DM" }, \
+   { CXL_PSL_DSISR_An_ST,  "ST" }, \
+   { CXL_PSL_DSISR_An_UR,  "UR" }, \
+   { CXL_PSL_DSISR_An_PE,  "PE" }, \
+   { CXL_PSL_DSISR_An_AE,  "AE" }, \
+   { CXL_PSL_DSISR_An_OC,  "OC" }, \
+   { CXL_PSL_DSISR_An_M,   "M" }, \
+   { CXL_PSL_DSISR_An_P,   "P" }, \
+   { CXL_PSL_DSISR_An_A,   "A" }, \
+   { CXL_PSL_DSISR_An_S,   "S" }, \
+   { CXL_PSL_DSISR_An_K,   "K" })
 
 #define TFC_FLAGS \
{ CXL_PSL_TFC_An_A, "A" }, \
@@ -163,7 +164,7 @@ TRACE_EVENT(cxl_afu_irq,
)
 );
 
-TRACE_EVENT(cxl_psl9_irq,
+TRACE_EVENT(cxl_psl_irq,
TP_PROTO(struct cxl_context *ctx, int irq, u64 dsisr, u64 dar),
 
TP_ARGS(ctx, irq, dsisr, dar),
@@ -192,40 +193,8 @@ TRACE_EVENT(cxl_psl9_irq,
__entry->pe,
__entry->irq,
__entry->dsisr,
-   dsisr_psl9_flags(__entry->dsisr),
-   __entry->dar
-   )
-);
-
-TRACE_EVENT(cxl_psl_irq,
-   TP_PROTO(struct cxl_context *ctx, int irq, u64 dsisr, u64 dar),
-
-   TP_ARGS(ctx, irq, dsisr, dar),
-
-   TP_STRUCT__entry(
-   __field(u8, card)
-   __field(u8, afu)
-   __field(u16, pe)
-   __field(int, irq)
-   __field(u64, dsisr)
-   __field(u64, dar)
-   ),
-
-   TP_fast_assign(
-   __entry->card = ctx->afu->adapter->adapter_num;
-   __entry->afu = ctx->afu->slice;
-   __entry->pe = ctx->pe;
-   __entry->irq = irq;
-   __entry->dsisr = dsisr;
-   __entry->dar = dar;
-   ),
-
-   TP_printk("afu%i.%i pe=%i irq=%i dsisr=%s dar=0x%016llx",
-

Re: [PATCH 2/2] misc: ocxl: use put_device() instead of device_unregister()

2018-03-13 Thread Frederic Barrat




Le 12/03/2018 à 12:36, Arvind Yadav a écrit :

if device_register() returned an error! Always use put_device()
to give up the reference initialized.

Signed-off-by: Arvind Yadav 
---


OK, device_unregister() calls put_device() but also other actions that 
we can skip in this case.


Acked-by: Frederic Barrat 



  drivers/misc/ocxl/pci.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/ocxl/pci.c b/drivers/misc/ocxl/pci.c
index 0051d9e..21f4254 100644
--- a/drivers/misc/ocxl/pci.c
+++ b/drivers/misc/ocxl/pci.c
@@ -519,7 +519,7 @@ static struct ocxl_fn *init_function(struct pci_dev *dev)
rc = device_register(&fn->dev);
if (rc) {
deconfigure_function(fn);
-   device_unregister(&fn->dev);
+   put_device(&fn->dev);
return ERR_PTR(rc);
}
return fn;

Re: [PATCH] cxl: Perform NULL check for 'cxl_afu *' at various places in cxl

2018-03-13 Thread Frederic Barrat




Le 08/03/2018 à 11:05, Vaibhav Jain a écrit :

It is possible for a CXL card to have a valid PSL but no valid
AFUs. When this happens we have a valid instance of 'struct cxl'
representing the adapter but with its member 'struct cxl_afu *cxl[]'
as empty. Unfortunately at many placed within cxl code (especially
during an EEH) the elements of this array are passed on to various
other cxl functions. Which may result in kernel oops/panic when this
'struct cxl_afu *' is dereferenced.

So this patch puts a NULL check at the beginning of various cxl
functions that accept 'struct cxl_afu *' as a formal argument and are
called from with a loop of the form:

for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
/* call some function with 'afu' */
}



So we are calling functions with an invalid afu argument. We can verify 
in the callees the value of the afu pointer, like you're doing here, but 
why not tackle it at source and avoid calling the function in the first 
place? It would have the nice side effect of reminding developers that 
the AFU array can be empty.
We already have a few checks in place in the "for (i = 0; i < 
adapter->slices; i++)" loops, but it was overlooked when eeh support was 
added. I think we should fix that.


  Fred





Signed-off-by: Vaibhav Jain 
---
  drivers/misc/cxl/api.c |  2 +-
  drivers/misc/cxl/context.c |  3 +++
  drivers/misc/cxl/guest.c   |  4 
  drivers/misc/cxl/main.c|  3 +++
  drivers/misc/cxl/native.c  |  4 
  drivers/misc/cxl/pci.c | 13 -
  6 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 753b1a698fc4..3466ef8b9e86 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -128,7 +128,7 @@ struct cxl_context *cxl_dev_context_init(struct pci_dev 
*dev)
int rc;

afu = cxl_pci_to_afu(dev);
-   if (IS_ERR(afu))
+   if (IS_ERR_OR_NULL(afu))
return ERR_CAST(afu);

ctx = cxl_context_alloc();
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 7ff315ad3692..3957e6e7d187 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -303,6 +303,9 @@ void cxl_context_detach_all(struct cxl_afu *afu)
struct cxl_context *ctx;
int tmp;

+   if (afu == NULL)
+   return;
+
mutex_lock(&afu->contexts_lock);
idr_for_each_entry(&afu->contexts_idr, ctx, tmp) {
/*
diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index f58b4b6c79f2..8165f6f26704 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -760,6 +760,8 @@ static int activate_afu_directed(struct cxl_afu *afu)

  static int guest_afu_activate_mode(struct cxl_afu *afu, int mode)
  {
+   if (afu == NULL)
+   return -EINVAL;
if (!mode)
return 0;
if (!(mode & afu->modes_supported))
@@ -791,6 +793,8 @@ static int deactivate_afu_directed(struct cxl_afu *afu)

  static int guest_afu_deactivate_mode(struct cxl_afu *afu, int mode)
  {
+   if (afu == NULL)
+   return -EINVAL;
if (!mode)
return 0;
if (!(mode & afu->modes_supported))
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index c1ba0d42cbc8..296a71ca6f2e 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -271,6 +271,9 @@ struct cxl_afu *cxl_alloc_afu(struct cxl *adapter, int 
slice)

  int cxl_afu_select_best_mode(struct cxl_afu *afu)
  {
+   if (afu == NULL)
+   return -EINVAL;
+
if (afu->modes_supported & CXL_MODE_DIRECTED)
return cxl_ops->afu_activate_mode(afu, CXL_MODE_DIRECTED);

diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 1b3d7c65ea3f..d46415b19b71 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -971,6 +971,8 @@ static int deactivate_dedicated_process(struct cxl_afu *afu)

  static int native_afu_deactivate_mode(struct cxl_afu *afu, int mode)
  {
+   if (afu == NULL)
+   return -EINVAL;
if (mode == CXL_MODE_DIRECTED)
return deactivate_afu_directed(afu);
if (mode == CXL_MODE_DEDICATED)
@@ -980,6 +982,8 @@ static int native_afu_deactivate_mode(struct cxl_afu *afu, 
int mode)

  static int native_afu_activate_mode(struct cxl_afu *afu, int mode)
  {
+   if (!afu)
+   return -EINVAL;
if (!mode)
return 0;
if (!(mode & afu->modes_supported))
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 758842f65a1b..8c87d9fdcf5a 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1295,6 +1295,9 @@ static int pci_configure_afu(struct cxl_afu *afu, struct 
cxl *adapter, struct pc
  {
int rc;

+   if (afu == NULL)
+   return -EINVAL;
+
if ((rc = pci_map_slice_regs(afu, adap

[PATCH] powerpc/numa: Correct kernel message severity

2018-03-13 Thread Vipin K Parashar

printk in unmap_cpu_from_node() uses KERN_ERR message severity
for a WARNING message. Correct message severity to KERN_WARNING.

Signed-off-by: Vipin K Parashar 
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index edd8d0b..79c94cc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -163,7 +163,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
+   printk(KERN_WARNING "WARNING: cpu %lu not found in node %d\n",
   cpu, node);
}
 }
-- 
2.7.4

Re: [PATCH] powerpc/numa: Correct kernel message severity

2018-03-13 Thread Christophe LEROY




Le 13/03/2018 à 11:11, Vipin K Parashar a écrit :

printk in unmap_cpu_from_node() uses KERN_ERR message severity
for a WARNING message. Correct message severity to KERN_WARNING.

Signed-off-by: Vipin K Parashar 
---
  arch/powerpc/mm/numa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index edd8d0b..79c94cc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -163,7 +163,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
+   printk(KERN_WARNING "WARNING: cpu %lu not found in node %d\n",

>   cpu, node);

Why not take the opportunity to use pr_warn() instead, hence to put back 
the cpu and node vars on the same line.


Christophe


}
  }

Re: [PATCH] powerpc/powernv : Add support to enable sensor groups

2018-03-13 Thread Michael Ellerman

Shilpasri G Bhat  writes:
> On 12/04/2017 10:11 AM, Stewart Smith wrote:
>> Shilpasri G Bhat  writes:
>>> On 11/28/2017 05:07 PM, Michael Ellerman wrote:
 Shilpasri G Bhat  writes:

> Adds support to enable/disable a sensor group. This can be used to
> select the sensor groups that needs to be copied to main memory by
> OCC. Sensor groups like power, temperature, current, voltage,
> frequency, utilization can be enabled/disabled at runtime.
>
> Signed-off-by: Shilpasri G Bhat 
> ---
> The skiboot patch for the opal call is posted below:
> https://lists.ozlabs.org/pipermail/skiboot/2017-November/009713.html

 Can you remind me why we're doing this with a completely bespoke sysfs
 API, rather than using some generic sensors API?
>>>
>>> Disabling/Enabling sensor groups is not supported in the current generic 
>>> sensors
>>> API. And also we dont export all type of sensors in HWMON as not all of 
>>> them are
>>> environment sensors (like performance).
>> 
>> Are there barriers to adding such concepts to the generic sensors API?
>
> Yes.
>
> HWMON does not support attributes for a sensor-group. If we are to extend 
> HWMON
> to add new per-sensor attributes to disable/enable, then we need to do either 
> of
> the below:
>
> 1) If any one of the sensor is disabled then all the sensors belonging to that
> group will be disabled. OR
>
> 2) To disable a sensor group we need to disable all the sensors belonging to
> that group.

Either of those sound doable, the first is probably simpler, as long as
there's some way for userspace to understand that it is modifying the
state of the whole group.

> Another problem is hwmon categorizes the sensor-groups based on the type of
> sensors like power, temp. If OCC allows multiple groups of the same type then
> this approach adds some more complexity to the user to identify the sensors
> belonging to correct group.

I don't really understand this one, what do you mean by "If OCC allows"?

Also do we really expect users to be using this API? Or rather tools?

> And lastly HWMON does not allow platform specific non-standard sensor groups
> like CSM, job-scheduler, profiler.

Have we actually made specific proposals to the hwmon maintainers on
adding/changing any of the above? Have they rejected those proposals and
told us to go away?

cheers

Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again

2018-03-13 Thread Michael Ellerman

Mauricio Faria de Oliveira  writes:

> Hi Michael and Michal,
>
> Got back to this; sorry for the delay.
>
> On 03/06/2018 09:55 AM, Michal Suchánek wrote:
>> Michael Ellerman  wrote:
>
>>> I*think*  the patch below is all we need, as well as some tweaking of
>>> patch 2, are you able to test and repost?
>
>> Enabling the fallback flush always looks a bit dodgy but
>> do_rfi_flush_fixups will overwrite the jump so long any other fixup is
>> enabled.
>
> I agree; the 'Using fallback displacement flush' message is misleading
> (is the system slower/fallback or not? Ô_o)

That message is actually just wrong.

It still prints that even if enable=false.

So we should change all those messages, perhaps:

pr_info("rfi-flush: fallback displacement flush available\n");
pr_info("rfi-flush: ori type flush available\n");
pr_info("rfi-flush: mttrig type flush available\n");


> So I wrote something with a new function parameter to force the init of
> the fallback flush area (true in pseries, false in powernv).  Not that
> contained, but it seemed to convey the intent here in a clear way.
>
> That's v2, just sent.

OK thanks. I don't really like it :D - sorry!

It's a lot of plumbing of that bool just to avoid the message, whereas I
think we could just change the message like above.

cheers

OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static)

2018-03-13 Thread Michael Ellerman

Anyone object to us merging the following patch via the powerpc tree?

Full series is here if anyone's interested:
  http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=28377&state=*

cheers

Nicholas Piggin  writes:
> This will be used by powerpc to allocate per-cpu stacks and other
> data structures node-local where possible.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  include/linux/memblock.h | 5 -
>  mm/memblock.c| 2 +-
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 8be5077efb5f..8cab51398705 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -316,9 +316,12 @@ static inline bool memblock_bottom_up(void)
>  #define MEMBLOCK_ALLOC_ANYWHERE  (~(phys_addr_t)0)
>  #define MEMBLOCK_ALLOC_ACCESSIBLE0
>  
> -phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
> +phys_addr_t memblock_alloc_range(phys_addr_t size, phys_addr_t align,
>   phys_addr_t start, phys_addr_t end,
>   ulong flags);
> +phys_addr_t memblock_alloc_base_nid(phys_addr_t size,
> + phys_addr_t align, phys_addr_t max_addr,
> + int nid, ulong flags);
>  phys_addr_t memblock_alloc_base(phys_addr_t size, phys_addr_t align,
>   phys_addr_t max_addr);
>  phys_addr_t __memblock_alloc_base(phys_addr_t size, phys_addr_t align,
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 5a9ca2a1751b..cea2af494da0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1190,7 +1190,7 @@ phys_addr_t __init memblock_alloc_range(phys_addr_t 
> size, phys_addr_t align,
>   flags);
>  }
>  
> -static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
> +phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
>   phys_addr_t align, phys_addr_t max_addr,
>   int nid, ulong flags)
>  {
> -- 
> 2.16.1

Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again

2018-03-13 Thread Mauricio Faria de Oliveira


Hi Michael,

On 03/13/2018 08:39 AM, Michael Ellerman wrote:

I agree; the 'Using fallback displacement flush' message is misleading
(is the system slower/fallback or not? Ô_o)



That message is actually just wrong.

It still prints that even if enable=false.

So we should change all those messages, perhaps:

pr_info("rfi-flush: fallback displacement flush available\n");
pr_info("rfi-flush: ori type flush available\n");
pr_info("rfi-flush: mttrig type flush available\n");


Indeed.


So I wrote something with a new function parameter to force the init of
the fallback flush area (true in pseries, false in powernv).  Not that
contained, but it seemed to convey the intent here in a clear way.

That's v2, just sent.



OK thanks. I don't really like it :D - sorry!


No worries :) fair enough. Well, I didn't like it much, either, TBH.


It's a lot of plumbing of that bool just to avoid the message, whereas I
think we could just change the message like above.


Yup.

And what you think about a more descriptive confirmation of what flush
instructions/methods are _actually_ being used?

Currently and w/ your suggestion aobve, all that is known is what is
_available_, not what has gone in (or out, in the disable case) the
nop slots.

cheers,
mauricio

Re: [PATCH 07/14] powerpc/64: move default SPR recording

2018-03-13 Thread Michael Ellerman

Nicholas Piggin  writes:

> Move this into the early setup code, and don't iterate over CPU masks.
> We don't want to call into sysfs so early from setup, and a future patch
> won't initialize CPU masks by the time this is called.
> ---
>  arch/powerpc/kernel/paca.c |  3 +++
>  arch/powerpc/kernel/setup.h|  9 +++--
>  arch/powerpc/kernel/setup_64.c |  8 
>  arch/powerpc/kernel/sysfs.c| 18 +++---
>  4 files changed, 21 insertions(+), 17 deletions(-)

This patch, and 8, 9, 10, aren't signed-off by you.

I'll assume you just forgot and add it.

cheers

Re: [PATCH 03/14] powerpc/64s: allocate lppacas individually

2018-03-13 Thread Michael Ellerman

Nicholas Piggin  writes:

> diff --git a/arch/powerpc/platforms/pseries/kexec.c 
> b/arch/powerpc/platforms/pseries/kexec.c
> index eeb13429d685..3fe126796975 100644
> --- a/arch/powerpc/platforms/pseries/kexec.c
> +++ b/arch/powerpc/platforms/pseries/kexec.c
> @@ -23,7 +23,12 @@
>  
>  void pseries_kexec_cpu_down(int crash_shutdown, int secondary)
>  {
> - /* Don't risk a hypervisor call if we're crashing */
> + /*
> +  * Don't risk a hypervisor call if we're crashing
> +  * XXX: Why? The hypervisor is not crashing. It might be better
> +  * to at least attempt unregister to avoid the hypervisor stepping
> +  * on our memory.
> +  */

Because every extra line of code we run in the crashed kernel is another
opportunity to screw up and not make it into the kdump kernel.

For example the hcalls we do to unregister the VPA might trigger hcall
tracing which runs a bunch of code and might trip up on something. We
could modify those hcalls to not be traced, but then we can't trace them
in normal operation.

And the hypervisor might continue to write to the VPA, but that's OK
because it's the VPA of the crashing kernel, the kdump kernel runs in a
separate reserved memory region.

Possibly we could fix the hcall tracing issues etc, but this code has
not given us any problems for quite a while (~13 years) - ie. there
seems to be no issue with re-registering the VPAs etc. in the kdump
kernel.

cheers

Re: [PATCH 03/14] powerpc/64s: allocate lppacas individually

2018-03-13 Thread Nicholas Piggin

On Tue, 13 Mar 2018 23:41:46 +1100
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > diff --git a/arch/powerpc/platforms/pseries/kexec.c 
> > b/arch/powerpc/platforms/pseries/kexec.c
> > index eeb13429d685..3fe126796975 100644
> > --- a/arch/powerpc/platforms/pseries/kexec.c
> > +++ b/arch/powerpc/platforms/pseries/kexec.c
> > @@ -23,7 +23,12 @@
> >  
> >  void pseries_kexec_cpu_down(int crash_shutdown, int secondary)
> >  {
> > -   /* Don't risk a hypervisor call if we're crashing */
> > +   /*
> > +* Don't risk a hypervisor call if we're crashing
> > +* XXX: Why? The hypervisor is not crashing. It might be better
> > +* to at least attempt unregister to avoid the hypervisor stepping
> > +* on our memory.
> > +*/  
> 
> Because every extra line of code we run in the crashed kernel is another
> opportunity to screw up and not make it into the kdump kernel.
> 
> For example the hcalls we do to unregister the VPA might trigger hcall
> tracing which runs a bunch of code and might trip up on something. We
> could modify those hcalls to not be traced, but then we can't trace them
> in normal operation.

We really make no other hcalls in a crash? I didn't think of that.

> 
> And the hypervisor might continue to write to the VPA, but that's OK
> because it's the VPA of the crashing kernel, the kdump kernel runs in a
> separate reserved memory region.

Well that takes care of that concern.

> Possibly we could fix the hcall tracing issues etc, but this code has
> not given us any problems for quite a while (~13 years) - ie. there
> seems to be no issue with re-registering the VPAs etc. in the kdump
> kernel.

No I think it's okay then, if you could drop that hunk...

Thanks,
Nick

Re: [PATCH 07/14] powerpc/64: move default SPR recording

2018-03-13 Thread Nicholas Piggin

On Tue, 13 Mar 2018 23:25:05 +1100
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > Move this into the early setup code, and don't iterate over CPU masks.
> > We don't want to call into sysfs so early from setup, and a future patch
> > won't initialize CPU masks by the time this is called.
> > ---
> >  arch/powerpc/kernel/paca.c |  3 +++
> >  arch/powerpc/kernel/setup.h|  9 +++--
> >  arch/powerpc/kernel/setup_64.c |  8 
> >  arch/powerpc/kernel/sysfs.c| 18 +++---
> >  4 files changed, 21 insertions(+), 17 deletions(-)  
> 
> This patch, and 8, 9, 10, aren't signed-off by you.
> 
> I'll assume you just forgot and add it.

Yes I did.

Thanks,
Nick

Re: [PATCH 4.14 1/4] powerpc/mm/slice: Remove intermediate bitmap copy

2018-03-13 Thread Michael Ellerman

Greg Kroah-Hartman  writes:
> On Sat, Mar 10, 2018 at 05:14:22PM +0100, christophe leroy wrote:
>> Le 10/03/2018 à 15:52, Greg Kroah-Hartman a écrit :
>> > On Sat, Mar 10, 2018 at 08:27:54AM +0100, christophe leroy wrote:
>> > > Le 10/03/2018 à 01:10, Greg Kroah-Hartman a écrit :
>> > > > On Fri, Mar 09, 2018 at 04:48:59PM +0100, Christophe Leroy wrote:
>> > > > > Upstream 326691ad4f179e6edc7eb1271e618dd673e4736d
>> > > > 
>> > > > There is no such git commit id in Linus's tree :(
>> > > > 
>> > > > Please fix up and resend the series.
>> > > 
>> > > I checked again, it is there
>> > > 
>> > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/arch/powerpc/mm/slice.c?h=next-20180309&id=326691ad4f179e6edc7eb1271e618dd673e4736d
>> > 
>> > That is linux-next, which has everything and the kitchen sink.  It is
>> > not Linus's tree.  Please wait for these things to be merged into
>> > Linus's tree before asking for the to be merged into the stable tree.
>> > That's a requirement.
>> 
>> Oops, sorry, I thought everything on kernel.org was official.
>
> That would be a whole lot of "official" :)
>
> Please read:
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for what the rules are here, if you haven't already.
>
>> Once it is in, do I resend the patches or do I just ping you ?
>
> You would need to resend the patches (if they need backporting
> manually), or just send a list of the git commit ids that are needed to
> be applied (usually easier.)
>
> Also, why were these patches not tagged with the stable tag to start
> with?  That way they would be automatically included in the stable tree
> when they hit Linus's tree.

Because they're fairly large and invasive and not well tested on other
platforms, so the maintainer is not comfortable with them going straight
to stable :)

Once they've had some testing in Linus' tree at least, then we'll ask
for a backport if there's no issues.

Sorry for the confusion.

cheers

Re: [PATCH v4 02/10] include: Move compat_timespec/ timeval to compat_time.h

2018-03-13 Thread kbuild test robot

Hi Deepa,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on ]

url:
https://github.com/0day-ci/linux/commits/Deepa-Dinamani/posix_clocks-Prepare-syscalls-for-64-bit-time_t-conversion/20180313-203305
base:
config: arm64-allnoconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   arch/arm64/kernel/process.c: In function 'copy_thread':
>> arch/arm64/kernel/process.c:342:8: error: implicit declaration of function 
>> 'is_compat_thread'; did you mean 'is_compat_task'? 
>> [-Werror=implicit-function-declaration]
   if (is_compat_thread(task_thread_info(p)))
   ^~~~
   is_compat_task
   cc1: some warnings being treated as errors

vim +342 arch/arm64/kernel/process.c

b3901d54d Catalin Marinas  2012-03-05  307  
b3901d54d Catalin Marinas  2012-03-05  308  int copy_thread(unsigned long 
clone_flags, unsigned long stack_start,
afa86fc42 Al Viro  2012-10-22  309  unsigned long stk_sz, 
struct task_struct *p)
b3901d54d Catalin Marinas  2012-03-05  310  {
b3901d54d Catalin Marinas  2012-03-05  311  struct pt_regs *childregs = 
task_pt_regs(p);
b3901d54d Catalin Marinas  2012-03-05  312  
c34501d21 Catalin Marinas  2012-10-05  313  memset(&p->thread.cpu_context, 
0, sizeof(struct cpu_context));
c34501d21 Catalin Marinas  2012-10-05  314  
bc0ee4760 Dave Martin  2017-10-31  315  /*
bc0ee4760 Dave Martin  2017-10-31  316   * Unalias p->thread.sve_state 
(if any) from the parent task
bc0ee4760 Dave Martin  2017-10-31  317   * and disable discard SVE 
state for p:
bc0ee4760 Dave Martin  2017-10-31  318   */
bc0ee4760 Dave Martin  2017-10-31  319  clear_tsk_thread_flag(p, 
TIF_SVE);
bc0ee4760 Dave Martin  2017-10-31  320  p->thread.sve_state = NULL;
bc0ee4760 Dave Martin  2017-10-31  321  
071b6d4a5 Dave Martin  2017-12-05  322  /*
071b6d4a5 Dave Martin  2017-12-05  323   * In case p was allocated the 
same task_struct pointer as some
071b6d4a5 Dave Martin  2017-12-05  324   * other recently-exited task, 
make sure p is disassociated from
071b6d4a5 Dave Martin  2017-12-05  325   * any cpu that may have run 
that now-exited task recently.
071b6d4a5 Dave Martin  2017-12-05  326   * Otherwise we could 
erroneously skip reloading the FPSIMD
071b6d4a5 Dave Martin  2017-12-05  327   * registers for p.
071b6d4a5 Dave Martin  2017-12-05  328   */
071b6d4a5 Dave Martin  2017-12-05  329  fpsimd_flush_task_state(p);
071b6d4a5 Dave Martin  2017-12-05  330  
9ac080021 Al Viro  2012-10-21  331  if (likely(!(p->flags & 
PF_KTHREAD))) {
9ac080021 Al Viro  2012-10-21  332  *childregs = 
*current_pt_regs();
b3901d54d Catalin Marinas  2012-03-05  333  childregs->regs[0] = 0;
d00a3810c Will Deacon  2015-05-27  334  
b3901d54d Catalin Marinas  2012-03-05  335  /*
b3901d54d Catalin Marinas  2012-03-05  336   * Read the current TLS 
pointer from tpidr_el0 as it may be
b3901d54d Catalin Marinas  2012-03-05  337   * out-of-sync with the 
saved value.
b3901d54d Catalin Marinas  2012-03-05  338   */
adf758999 Mark Rutland 2016-09-08  339  *task_user_tls(p) = 
read_sysreg(tpidr_el0);
d00a3810c Will Deacon  2015-05-27  340  
e0fd18ce1 Al Viro  2012-10-18  341  if (stack_start) {
d00a3810c Will Deacon  2015-05-27 @342  if 
(is_compat_thread(task_thread_info(p)))
d00a3810c Will Deacon  2015-05-27  343  
childregs->compat_sp = stack_start;
d00a3810c Will Deacon  2015-05-27  344  else
b3901d54d Catalin Marinas  2012-03-05  345  
childregs->sp = stack_start;
b3901d54d Catalin Marinas  2012-03-05  346  }
d00a3810c Will Deacon  2015-05-27  347  
c34501d21 Catalin Marinas  2012-10-05  348  /*
c34501d21 Catalin Marinas  2012-10-05  349   * If a TLS pointer was 
passed to clone (4th argument), use it
c34501d21 Catalin Marinas  2012-10-05  350   * for the new thread.
c34501d21 Catalin Marinas  2012-10-05  351   */
b3901d54d Catalin Marinas  2012-03-05  352  if (clone_flags & 
CLONE_SETTLS)
d00a3810c Will Deacon  2015-05-27  353  
p->thread.tp_value = childregs->regs[3];
c34501d21 Catalin Marinas  2012-10-05  354  } else {
c34501d21 Catalin Marinas  2012-10-05  355  memset(childregs, 0, 
sizeo

Re: [PATCH v4 02/10] include: Move compat_timespec/ timeval to compat_time.h

2018-03-13 Thread kbuild test robot

Hi Deepa,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on ]

url:
https://github.com/0day-ci/linux/commits/Deepa-Dinamani/posix_clocks-Prepare-syscalls-for-64-bit-time_t-conversion/20180313-203305
base:
config: powerpc-iss476-smp_defconfig (attached as .config)
compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/oprofile/backtrace.c: In function 'user_getsp32':
>> arch/powerpc/oprofile/backtrace.c:31:19: error: implicit declaration of 
>> function 'compat_ptr'; did you mean 'complete'? 
>> [-Werror=implicit-function-declaration]
 void __user *p = compat_ptr(sp);
  ^~
  complete
>> arch/powerpc/oprofile/backtrace.c:31:19: error: initialization makes pointer 
>> from integer without a cast [-Werror=int-conversion]
   cc1: all warnings being treated as errors

vim +31 arch/powerpc/oprofile/backtrace.c

6c6bd754 Brian Rogan 2006-03-27  27  
6c6bd754 Brian Rogan 2006-03-27  28  static unsigned int user_getsp32(unsigned 
int sp, int is_first)
6c6bd754 Brian Rogan 2006-03-27  29  {
6c6bd754 Brian Rogan 2006-03-27  30 unsigned int stack_frame[2];
62034f03 Al Viro 2006-09-23 @31 void __user *p = compat_ptr(sp);
6c6bd754 Brian Rogan 2006-03-27  32  
62034f03 Al Viro 2006-09-23  33 if (!access_ok(VERIFY_READ, p, 
sizeof(stack_frame)))
6c6bd754 Brian Rogan 2006-03-27  34 return 0;
6c6bd754 Brian Rogan 2006-03-27  35  
6c6bd754 Brian Rogan 2006-03-27  36 /*
6c6bd754 Brian Rogan 2006-03-27  37  * The most likely reason for this is 
that we returned -EFAULT,
6c6bd754 Brian Rogan 2006-03-27  38  * which means that we've done all that 
we can do from
6c6bd754 Brian Rogan 2006-03-27  39  * interrupt context.
6c6bd754 Brian Rogan 2006-03-27  40  */
62034f03 Al Viro 2006-09-23  41 if 
(__copy_from_user_inatomic(stack_frame, p, sizeof(stack_frame)))
6c6bd754 Brian Rogan 2006-03-27  42 return 0;
6c6bd754 Brian Rogan 2006-03-27  43  
6c6bd754 Brian Rogan 2006-03-27  44 if (!is_first)
6c6bd754 Brian Rogan 2006-03-27  45 
oprofile_add_trace(STACK_LR32(stack_frame));
6c6bd754 Brian Rogan 2006-03-27  46  
6c6bd754 Brian Rogan 2006-03-27  47 /*
6c6bd754 Brian Rogan 2006-03-27  48  * We do not enforce increasing stack 
addresses here because
6c6bd754 Brian Rogan 2006-03-27  49  * we may transition to a different 
stack, eg a signal handler.
6c6bd754 Brian Rogan 2006-03-27  50  */
6c6bd754 Brian Rogan 2006-03-27  51 return STACK_SP(stack_frame);
6c6bd754 Brian Rogan 2006-03-27  52  }
6c6bd754 Brian Rogan 2006-03-27  53  

:: The code at line 31 was first introduced by commit
:: 62034f03380a64c0144b6721f4a2aa55d65346c1 [POWERPC] powerpc oprofile 
__user annotations

:: TO: Al Viro 
:: CC: Paul Mackerras 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 07/14] powerpc/64: move default SPR recording

2018-03-13 Thread Nicholas Piggin

On Tue, 13 Mar 2018 23:25:05 +1100
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > Move this into the early setup code, and don't iterate over CPU masks.
> > We don't want to call into sysfs so early from setup, and a future patch
> > won't initialize CPU masks by the time this is called.
> > ---
> >  arch/powerpc/kernel/paca.c |  3 +++
> >  arch/powerpc/kernel/setup.h|  9 +++--
> >  arch/powerpc/kernel/setup_64.c |  8 
> >  arch/powerpc/kernel/sysfs.c| 18 +++---
> >  4 files changed, 21 insertions(+), 17 deletions(-)  
> 
> This patch, and 8, 9, 10, aren't signed-off by you.
> 
> I'll assume you just forgot and add it.

Can I give you an incremental fix for this patch? dscr_default
is zero at this point, so set it from spr_default_dscr before
setting pacas.

Remove the assignment from initialise_paca -- this happens too
early and gets overwritten anyway.

---
 arch/powerpc/kernel/paca.c  | 3 ---
 arch/powerpc/kernel/sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 2f3187501a36..7736188c764f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -133,9 +133,6 @@ void __init initialise_paca(struct paca_struct *new_paca, 
int cpu)
new_paca->kexec_state = KEXEC_STATE_NONE;
new_paca->__current = &init_task;
new_paca->data_offset = 0xfeeeULL;
-#ifdef CONFIG_PPC64
-   new_paca->dscr_default = spr_default_dscr;
-#endif
 #ifdef CONFIG_PPC_BOOK3S_64
new_paca->slb_shadow_ptr = NULL;
 #endif
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index aaab582a640c..755dc98a57ae 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -20,6 +20,7 @@
 #include 
 
 #include "cacheinfo.h"
+#include "setup.h"
 
 #ifdef CONFIG_PPC64
 #include 
@@ -592,6 +593,7 @@ static void sysfs_create_dscr_default(void)
int err = 0;
int cpu;
 
+   dscr_default = spr_default_dscr;
for_each_possible_cpu(cpu)
paca_ptrs[cpu]->dscr_default = dscr_default;
 
-- 
2.16.1

Re: [PATCH] powerpc/powernv : Add support to enable sensor groups

2018-03-13 Thread Guenter Roeck

On Tue, Mar 13, 2018 at 10:02:09PM +1100, Michael Ellerman wrote:
> Shilpasri G Bhat  writes:
> > On 12/04/2017 10:11 AM, Stewart Smith wrote:
> >> Shilpasri G Bhat  writes:
> >>> On 11/28/2017 05:07 PM, Michael Ellerman wrote:
>  Shilpasri G Bhat  writes:
> 
> > Adds support to enable/disable a sensor group. This can be used to
> > select the sensor groups that needs to be copied to main memory by
> > OCC. Sensor groups like power, temperature, current, voltage,
> > frequency, utilization can be enabled/disabled at runtime.
> >
> > Signed-off-by: Shilpasri G Bhat 
> > ---
> > The skiboot patch for the opal call is posted below:
> > https://lists.ozlabs.org/pipermail/skiboot/2017-November/009713.html
> 
>  Can you remind me why we're doing this with a completely bespoke sysfs
>  API, rather than using some generic sensors API?
> >>>
> >>> Disabling/Enabling sensor groups is not supported in the current generic 
> >>> sensors
> >>> API. And also we dont export all type of sensors in HWMON as not all of 
> >>> them are
> >>> environment sensors (like performance).
> >> 
> >> Are there barriers to adding such concepts to the generic sensors API?
> >
> > Yes.
> >
> > HWMON does not support attributes for a sensor-group. If we are to extend 
> > HWMON
> > to add new per-sensor attributes to disable/enable, then we need to do 
> > either of
> > the below:
> >
> > 1) If any one of the sensor is disabled then all the sensors belonging to 
> > that
> > group will be disabled. OR
> >
> > 2) To disable a sensor group we need to disable all the sensors belonging to
> > that group.
> 
> Either of those sound doable, the first is probably simpler, as long as
> there's some way for userspace to understand that it is modifying the
> state of the whole group.
> 
> > Another problem is hwmon categorizes the sensor-groups based on the type of
> > sensors like power, temp. If OCC allows multiple groups of the same type 
> > then
> > this approach adds some more complexity to the user to identify the sensors
> > belonging to correct group.
> 
> I don't really understand this one, what do you mean by "If OCC allows"?
> 
> Also do we really expect users to be using this API? Or rather tools?
> 
> > And lastly HWMON does not allow platform specific non-standard sensor groups
> > like CSM, job-scheduler, profiler.
> 
> Have we actually made specific proposals to the hwmon maintainers on
> adding/changing any of the above? Have they rejected those proposals and
> told us to go away?
> 
Those don't really sound like sensor groups at all. What does "job-scheduler"
or "profiler" have to do with hardware monitoring ? We do allow additional
attributes if it makes sense, but those should be hardware monitoring related.
We also allow registration with other subsystems (such as gpio) if a hardware
monitoring device also has gpio pins and it seems to cumbersome to request
that an mfd driver is written. However, I am not convinced that completely
unrelated attributes should be handled through the hwmon subsystem; if this
is deemed necessary, it rather seems that hardware monitoring is one of many
functionalities of a given chip, and such functionality should be handled
elsewhere.

For the rest (enabling or disabling sensors dynamically), I am not specifically
opposed to improving the hwmon core to add such support, but someone would have
to make a specific proposal. One key problem is that the hwmon API assumes
'static' sensor allocation. The behavior of sensors appearng or disappearing
at runtime (even though it happens) is not well defined. Any proposal along
that line will need to ensure that userspace behavior is well documented.

Thanks,
Guenter

Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again

2018-03-13 Thread Michal Suchánek

On Tue, 13 Mar 2018 09:14:39 -0300
Mauricio Faria de Oliveira  wrote:

> Hi Michael,
> 
> On 03/13/2018 08:39 AM, Michael Ellerman wrote:
> >> I agree; the 'Using fallback displacement flush' message is
> >> misleading (is the system slower/fallback or not? Ô_o)  
> 
> > That message is actually just wrong.
> > 
> > It still prints that even if enable=false.
> > 
> > So we should change all those messages, perhaps:
> > 
> > pr_info("rfi-flush: fallback displacement flush
> > available\n"); pr_info("rfi-flush: ori type flush available\n");
> > pr_info("rfi-flush: mttrig type flush available\n");  
> 
> Indeed.

Maybe it would make more sense to move the messages to the function
that actually patches in the instructions?

Thanks

Michal

[PATCH v9 00/24] Speculative page faults

2018-03-13 Thread Laurent Dufour

This is a port on kernel 4.16 of the work done by Peter Zijlstra to
handle page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type page fault is named speculative
page fault. If the speculative page fault fails because of a concurrency is
detected or because underlying PMD or PTE tables are not yet allocating, it
is failing its processing and a classic page fault is then tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, this is done by introducing a rwlock
which protects the access to the mm_rb tree. Previously this was done using
SRCU but it was introducing a lot of scheduling to process the VMA's
freeing
operation which was hitting the performance by 20% as reported by Kemi Wang
[2].Using a rwlock to protect access to the mm_rb tree is limiting the
locking contention to these operations which are expected to be in a O(log
n)
order. In addition to ensure that the VMA is not freed in our back a
reference count is added and 2 services (get_vma() and put_vma()) are
introduced to handle the reference count. When a VMA is fetch from the RB
tree using get_vma() is must be later freeed using put_vma(). Furthermore,
to allow the VMA to be used again by the classic page fault handler a
service is introduced can_reuse_spf_vma(). This service is expected to be
called with the mmap_sem hold. It checked that the VMA is still matching
the specified address and is releasing its reference count as the mmap_sem
is hold it is ensure that it will not be freed in our back. In general, the
VMA's reference count could be decremented when holding the mmap_sem but it
should not be increased as holding the mmap_sem is ensuring that the VMA is
stable. I can't see anymore the overhead I got while will-it-scale
benchmark anymore.

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA is found, the speculative page fault handler would check for
the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence lockings are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

The locking of the PTE is done with interrupts disabled, this allows to
check for the PMD to ensure that there is not an ongoing collapsing
operation. Since khugepaged is firstly set the PMD to pmd_none and then is
waiting for the other CPU to have catch the IPI interrupt, if the pmd is
valid at the time the PTE is locked, we have the guarantee that the
collapsing opertion will have to wait on the PTE lock to move foward. This
allows the SPF handler to map the PTE safely. If the PMD value is different
than the one recorded at the beginning of the SPF operation, the classic
page fault handler will be called to handle the operation while holding the
mmap_sem. As the PTE lock is done with the interrupts disabled, the lock is
done using spin_trylock() to avoid dead lock when handling a page fault
while a TLB invalidate is requested by an other CPU holding the PTE.

Support for THP is not done because when checking for the PMD, we can be
confused by an in progress collapsing operation done by khugepaged. The
issue is that pmd_none() could be true either if the PMD is not already
populated or if the underlying PTE are in the way to be collapsed. So we
cannot safely allocate a PMD if pmd_none() is true.

This series a new software performance event named 'speculative-faults' or
'spf'. It counts the number of successful page fault event handled in a
speculative way. When recording 'faults,spf' events, the faults one is
counting the total number of page fault events while 'spf' is only counting
the part of the faults processed in a speculative way.

There are some trace events introduced by this series. They allow to
identify why the page faults where not processed in a speculative way. This
doesn't take in account the faults generated by a monothreaded process
which direc

[PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour

This configuration variable will be used to build the code needed to
handle speculative page fault.

By default it is turned off, and activated depending on architecture
support.

Suggested-by: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 mm/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index abefa573bcd8..07c566c88faf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -759,3 +759,6 @@ config GUP_BENCHMARK
  performance of get_user_pages_fast().
 
  See tools/testing/selftests/vm/gup_benchmark.c
+
+config SPECULATIVE_PAGE_FAULT
+   bool
-- 
2.7.4

[PATCH v9 02/24] x86/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour

Introduce CONFIG_SPECULATIVE_PAGE_FAULT which turns on the Speculative Page
Fault handler when building for 64bits with SMP.

Cc: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a0a777ce4c7c..4c018c48d414 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86_64
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
select X86_DEV_DMA_OPS
+   select SPECULATIVE_PAGE_FAULT if SMP
 
 #
 # Arch settings
-- 
2.7.4

[PATCH v9 03/24] powerpc/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour

Define CONFIG_SPECULATIVE_PAGE_FAULT for BOOK3S_64 and SMP. This enables
the Speculative Page Fault handler.

Support is only provide for BOOK3S_64 currently because:
- require CONFIG_PPC_STD_MMU because checks done in
  set_access_flags_filter()
- require BOOK3S because we can't support for book3e_hugetlb_preload()
  called by update_mmu_cache()

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 73ce5dd07642..acf2696a6505 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -233,6 +233,7 @@ config PPC
select OLD_SIGACTIONif PPC32
select OLD_SIGSUSPEND
select SPARSE_IRQ
+   select SPECULATIVE_PAGE_FAULT   if PPC_BOOK3S_64 && SMP
select SYSCTL_EXCEPTION_TRACE
select VIRT_TO_BUS  if !PPC64
#
-- 
2.7.4

[PATCH v9 04/24] mm: Prepare for FAULT_FLAG_SPECULATIVE

2018-03-13 Thread Laurent Dufour

From: Peter Zijlstra 

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
 implemented as the vm_fault structure in the kernel]
[move pte_map_lock()'s definition upper in the file]
Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 56 ++
 2 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4d02524a7998..2f3e98edc94a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -300,6 +300,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER0x40/* The fault originated in 
userspace */
 #define FAULT_FLAG_REMOTE  0x80/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100  /* The fault was during an instruction 
fetch */
+#define FAULT_FLAG_SPECULATIVE 0x200   /* Speculative fault, not holding 
mmap_sem */
 
 #define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
diff --git a/mm/memory.c b/mm/memory.c
index e0ae4999c824..8ac241b9f370 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+  vmf->address, &vmf->ptl);
+   return true;
+}
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry which was
  * read non-atomically.  Before making any commitment, on those architectures
@@ -2477,6 +2484,7 @@ static int wp_page_copy(struct vm_fault *vmf)
const unsigned long mmun_start = vmf->address & PAGE_MASK;
const unsigned long mmun_end = mmun_start + PAGE_SIZE;
struct mem_cgroup *memcg;
+   int ret = VM_FAULT_OOM;
 
if (unlikely(anon_vma_prepare(vma)))
goto oom;
@@ -2504,7 +2512,11 @@ static int wp_page_copy(struct vm_fault *vmf)
/*
 * Re-check the pte - we dropped the lock
 */
-   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+   if (!pte_map_lock(vmf)) {
+   mem_cgroup_cancel_charge(new_page, memcg, false);
+   ret = VM_FAULT_RETRY;
+   goto oom_free_new;
+   }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -2596,7 +2608,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 oom:
if (old_page)
put_page(old_page);
-   return VM_FAULT_OOM;
+   return ret;
 }
 
 /**
@@ -2617,8 +2629,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-  &vmf->ptl);
+   if (!pte_map_lock(vmf))
+   return VM_FAULT_RETRY;
/*
 * We might have raced with another page fault while we released the
 * pte_offset_map_lock.
@@ -2736,8 +2748,11 @@ static int do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
lock_page(vmf->page);
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, &vmf->ptl);
+   if (!pte_map_lock(vmf)) {
+   unlock_page(vmf->page);
+   put_page(vmf->page);
+   return VM_FAULT_RETRY;
+   }
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
unlock_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2947,8 +2962,10 @@ int do_swap_page(struct vm_fault *vmf)
 * Back out if somebody else faulted in this pte
 * while we released the pte lock.
 */
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, &vmf->ptl);
+   if (!pte_map_lock(vmf)) {
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   return VM_FAULT_RETRY;
+   }

[PATCH v9 05/24] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2018-03-13 Thread Laurent Dufour

When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8ac241b9f370..21b1212a0892 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3798,8 +3805,8 @@ static int do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -3992,8 +3999,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
-- 
2.7.4

[PATCH v9 07/24] mm: VMA sequence count

2018-03-13 Thread Laurent Dufour

From: Peter Zijlstra 

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

Signed-off-by: Peter Zijlstra (Intel) 

[Port to 4.12 kernel]
[Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
[Introduce vm_write_* inline function depending on
 CONFIG_SPECULATIVE_PAGE_FAULT]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
 using vm_raw_write* functions]
Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h   | 41 +
 include/linux/mm_types.h |  3 +++
 mm/memory.c  |  2 ++
 mm/mmap.c| 35 +++
 4 files changed, 81 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6432a261e63..88042d843668 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1372,6 +1372,47 @@ static inline void unmap_shared_mapping_range(struct 
address_space *mapping,
unmap_mapping_range(mapping, holebegin, holelen, 0);
 }
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+   write_seqcount_begin(&vma->vm_sequence);
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+int subclass)
+{
+   write_seqcount_begin_nested(&vma->vm_sequence, subclass);
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+   write_seqcount_end(&vma->vm_sequence);
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+   raw_write_seqcount_begin(&vma->vm_sequence);
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+   raw_write_seqcount_end(&vma->vm_sequence);
+}
+#else
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+int subclass)
+{
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
 extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b9591d..34fde7111e88 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -333,6 +333,9 @@ struct vm_area_struct {
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
 #endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqcount_t vm_sequence;
+#endif
 } __randomize_layout;
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index 4bc7b0bdcb40..d57749966fb8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1503,6 +1503,7 @@ void unmap_page_range(struct mmu_gather *tlb,
unsigned long next;
 
BUG_ON(addr >= end);
+   vm_write_begin(vma);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -1512,6 +1513,7 @@ void unmap_page_range(struct mmu_gather *tlb,
next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
+   vm_write_end(vma);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index faf85699f1a1..5898255d0aeb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -558,6 +558,10 @@ void __vma_link_rb(struct mm_struct *mm, struct 
vm_area_struct *vma,
else
mm->highest_vm_end = vm_end_gap(vma);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqcount_init(&vma->vm_sequence);
+#endif
+
/*
 * vma->vm_prev wasn't known when we followed the rbtree to find the
 * correct insertion point for that vma. As a result, we could not
@@ -692,6 +696,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
long adjust_next = 0;
int remove_next = 0;
 
+   /*
+* Why using vm_raw_write*() functions here to avoid lockdep's warning ?
+*
+* Locked is complaining about a theoretical lock dependency, involving
+* 3 locks:
+*   mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
+*
+* Here are the major path leading to this dependency :
+*  1. __vma_adjust() mmap_sem  -> vm_sequence -> i_mmap_rwsem
+*  2. move_vmap() mmap_sem -> vm_sequence -

[PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-03-13 Thread Laurent Dufour

pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.

This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.

This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.

As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)

The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 29 +++--
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2f3e98edc94a..b6432a261e63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1199,6 +1199,7 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 #define VM_FAULT_NEEDDSYNC  0x2000 /* ->fault did not modify page tables
 * and needs fsync() to complete (for
 * synchronous page faults in DAX) */
+#define VM_FAULT_PTNOTSAME 0x4000  /* Page table entries have changed */
 
 #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
diff --git a/mm/memory.c b/mm/memory.c
index 21b1212a0892..4bc7b0bdcb40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
  * parts, do_swap_page must check under lock before unmapping the pte and
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0   if the PTE are the same
+ * VM_FAULT_PTNOTSAME  if the PTE are different
+ * VM_FAULT_RETRY  if the VMA has changed in our back during
+ * a speculative page fault handling.
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
-   int same = 1;
+   int ret = 0;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
+   if (pte_spinlock(vmf)) {
+   if (!pte_same(*vmf->pte, vmf->orig_pte))
+   ret = VM_FAULT_PTNOTSAME;
+   spin_unlock(vmf->ptl);
+   } else
+   ret = VM_FAULT_RETRY;
}
 #endif
-   pte_unmap(page_table);
-   return same;
+   pte_unmap(vmf->pte);
+   return ret;
 }
 
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
@@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
int exclusive = 0;
int ret = 0;
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+   ret = pte_unmap_same(vmf);
+   if (ret)
goto out;
 
entry = pte_to_swp_entry(vmf->orig_pte);
-- 
2.7.4

[PATCH v9 09/24] mm: protect mremap() against SPF hanlder

2018-03-13 Thread Laurent Dufour

If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.

CPU 1   CPU2
enter mremap()
unmap the dest area
copy_vma()  Enter speculative page fault handler
   >> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler

move_ptes()
  > it is assumed that the dest area is empty,
  > the move ptes overwrite the page mapped by the CPU2.

To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 16 
 mm/mmap.c  | 47 ---
 mm/mremap.c| 13 +
 3 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 88042d843668..ef6ef0627090 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2189,16 +2189,24 @@ void anon_vma_interval_tree_verify(struct 
anon_vma_chain *node);
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
cap_sys_admin);
 extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand);
+   struct vm_area_struct *expand, bool keep_locked);
 static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
-   return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+   return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
 }
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+extern struct vm_area_struct *__vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *, struct vm_userfaultfd_ctx);
+   struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked);
+static inline struct vm_area_struct *vma_merge(struct mm_struct *vma,
+   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+   return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off,
+  pol, uff, false);
+}
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index d6533cb85213..ac32b577a0c9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand)
+   struct vm_area_struct *expand, bool keep_locked)
 {
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 
if (next && next != vma)
vm_raw_write_end(next);
-   vm_raw_write_end(vma);
+   if (!keep_locked)
+   vm_raw_write_end(vma);
 
validate_mm(mm);
 
@@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, 
unsigned long vm_flags,
  * parameter) may establish ptes with the wrong permissions of 
  * instead of the right permissions of .
  */
-struct vm_area_struct *vma_merge(struct mm_struct *mm,
+struct vm_area_struct *__vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
unsigned long end, unsigned long vm_flags,

[PATCH v9 10/24] mm: Protect SPF handler against anon_vma changes

2018-03-13 Thread Laurent Dufour

The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.

In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.

In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.

When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index d57749966fb8..0200340ef089 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -624,7 +624,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 * Hide vma from rmap and truncate_pagecache before freeing
 * pgtables
 */
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
@@ -638,7 +640,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
   && !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
-- 
2.7.4

[PATCH v9 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2018-03-13 Thread Laurent Dufour

migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Signed-off-by: Laurent Dufour 
---
 include/linux/migrate.h | 4 ++--
 mm/memory.c | 2 +-
 mm/migrate.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f2b4abbca55e..fd4c3ab7bd9c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -126,14 +126,14 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_fault *vmf, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 46fe92b93682..412014d5785b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3880,7 +3880,7 @@ static int do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index 5d0dc7b85f90..ad8692ca6a4f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1900,7 +1900,7 @@ bool pmd_trans_migrating(pmd_t pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
   int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
@@ -1913,7 +1913,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 * with execute permissions as they are probably shared libraries.
 */
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
-   (vma->vm_flags & VM_EXEC))
+   (vmf->vma_flags & VM_EXEC))
goto out;
 
/*
-- 
2.7.4

[PATCH v9 11/24] mm: Cache some VMA fields in the vm_fault structure

2018-03-13 Thread Laurent Dufour

When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  6 ++
 mm/hugetlb.c   |  2 ++
 mm/khugepaged.c|  2 ++
 mm/memory.c| 38 --
 4 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef6ef0627090..dfa81a638b7c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -359,6 +359,12 @@ struct vm_fault {
 * page table to avoid allocation from
 * atomic context.
 */
+   /*
+* These entries are required when handling speculative page fault.
+* This way the page handling is done using consistent field values.
+*/
+   unsigned long vma_flags;
+   pgprot_t vma_page_prot;
 };
 
 /* page entry size for vm->huge_fault() */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 446427cafa19..f71db2b42b30 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3717,6 +3717,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
.vma = vma,
.address = address,
.flags = flags,
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
/*
 * Hard to debug if it ends up being
 * used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 32314e9e48dd..a946d5306160 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma, address),
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
};
 
/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index 0200340ef089..46fe92b93682 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * Don't let another task, with possibly unlocked vma,
 * keep the mlocked page.
 */
-   if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+   if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
lock_page(old_page);/* LRU manipulation */
if (PageMlocked(old_page))
munlock_vma_page(old_page);
@@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf)
  */
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
-   WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+   WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
/*
@@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf)
 * We should not cow pages in a shared writeable mapping.
 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 */
-   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
 (VM_WRITE|VM_SHARED))
return wp_pfn_shared(vmf);
 
@@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf)
return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
@@ -3067,7 +3067,7 @@ int do_swap_page(struct vm_fault *vmf)
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
-   pte = mk_pte(page, vma->vm_page_prot);
+   pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3093,7 +3093,7 @@ int do_swap_page(struct vm_fault *vmf)
 
swap_free(entry);
if (mem_cgroup_swap_full(page) ||
-   (vma->vm_flags & VM_LOCKED) || PageMlocked(page

[PATCH v9 14/24] mm: Introduce __maybe_mkwrite()

2018-03-13 Thread Laurent Dufour

The current maybe_mkwrite() is getting passed the pointer to the vma
structure to fetch the vm_flags field.

When dealing with the speculative page fault handler, it will be better to
rely on the cached vm_flags value stored in the vm_fault structure.

This patch introduce a __maybe_mkwrite() service which can be called by
passing the value of the vm_flags field.

There is no change functional changes expected for the other callers of
maybe_mkwrite().

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 9 +++--
 mm/memory.c| 6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dfa81a638b7c..a84ddc218bbd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -684,13 +684,18 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags)
 {
-   if (likely(vma->vm_flags & VM_WRITE))
+   if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
 }
 
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+   return __maybe_mkwrite(pte, vma->vm_flags);
+}
+
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
 int finish_fault(struct vm_fault *vmf);
diff --git a/mm/memory.c b/mm/memory.c
index 0a0a483d9a65..af0338fbc34d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf)
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-   entry = mk_pte(new_page, vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = mk_pte(new_page, vmf->vma_page_prot);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/*
 * Clear the pte entry and flush it first, before updating the
 * pte with the new entry. This will avoid a race condition
-- 
2.7.4

[PATCH v9 13/24] mm: Introduce __lru_cache_add_active_or_unevictable

2018-03-13 Thread Laurent Dufour

The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Signed-off-by: Laurent Dufour 
---
 include/linux/swap.h | 10 --
 mm/memory.c  |  8 
 mm/swap.c|  6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1985940af479..a7dc37e0e405 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,14 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
-   struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+   unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+   struct vm_area_struct *vma)
+{
+   return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index 412014d5785b..0a0a483d9a65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2560,7 +2560,7 @@ static int wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(new_page, vma);
+   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
 * We call the notify macro here because, when using secondary
 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3083,7 +3083,7 @@ int do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3234,7 +3234,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3486,7 +3486,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 3dd518832096..f2f9c587246f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -455,12 +455,12 @@ void lru_cache_add(struct page *page)
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
-struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+  unsigned long vma_flags)
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+   if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
-- 
2.7.4

[PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-03-13 Thread Laurent Dufour

When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.

Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.

Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  7 +--
 mm/memory.c| 18 ++
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a84ddc218bbd..73b8b99f482b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1263,8 +1263,11 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap 
*/
 };
 
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device);
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags);
+#define _vm_normal_page(vma, addr, pte, with_public_device) \
+   __vm_normal_page(vma, addr, pte, with_public_device, (vma)->vm_flags)
 #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
 
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index af0338fbc34d..184a0d663a76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -826,8 +826,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags)
 {
unsigned long pfn = pte_pfn(pte);
 
@@ -836,7 +837,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
-   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+   if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
return NULL;
@@ -868,8 +869,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 
/* !HAVE_PTE_SPECIAL case follows: */
 
-   if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-   if (vma->vm_flags & VM_MIXEDMAP) {
+   if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+   if (vma_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
@@ -878,7 +879,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
-   if (!is_cow_mapping(vma->vm_flags))
+   if (!is_cow_mapping(vma_flags))
return NULL;
}
}
@@ -2742,7 +2743,8 @@ static int do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
-   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+   vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
+vmf->vma_flags);
if (!vmf->page) {
/*
 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -3839,7 +3841,7 @@ static int do_numa_page(struct vm_fault *vmf)
ptep_modify_prot_commit(vma->vm_mm, vmf->address, vmf->pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
 
-   page = vm_normal_page(vma, vmf->address, pte);
+   page = __vm_normal_page(vma, vmf->address, pte, false, vmf->vma_flags);
if (!page) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
-- 
2.7.4

[PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()

2018-03-13 Thread Laurent Dufour

When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.

When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.

Signed-off-by: Laurent Dufour 
---
 include/linux/rmap.h | 12 ++--
 mm/memory.c  |  8 
 mm/rmap.c|  5 ++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..a5d282573093 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct 
vm_area_struct *,
unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-   unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address, bool compound)
+{
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   __page_add_new_anon_rmap(page, vma, address, compound);
+}
+
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
diff --git a/mm/memory.c b/mm/memory.c
index 184a0d663a76..66517535514b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2559,7 +2559,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * thread doing COW.
 */
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-   page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
__lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
@@ -3083,7 +3083,7 @@ int do_swap_page(struct vm_fault *vmf)
 
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
@@ -3234,7 +3234,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
}
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
@@ -3486,7 +3486,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
/* copy-on-write page */
if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
diff --git a/mm/rmap.c b/mm/rmap.c
index 9eaa6354fe70..e028d660c304 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1136,7 +1136,7 @@ void do_page_add_

[PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

2018-03-13 Thread Laurent Dufour

This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.

To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock.  The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.

To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1.  Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.

This patch is preparing for 2 kind of VMA access :
 - as usual, under the control of the mmap_sem,
 - without holding the mmap_sem for the speculative page fault handler.

Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.

The patch is introducing 2 new functions:
 - vma_get() to find a VMA based on an address by holding the new rwlock.
 - vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.

When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.

When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.

[1] https://patchwork.kernel.org/patch/5108281/

Cc: Peter Zijlstra (Intel) 
Cc: Matthew Wilcox 
Signed-off-by: Laurent Dufour 
---
 include/linux/mm_types.h |   4 ++
 kernel/fork.c|   3 ++
 mm/init-mm.c |   3 ++
 mm/internal.h|   6 +++
 mm/mmap.c| 122 ++-
 5 files changed, 106 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 34fde7111e88..28c763ea1036 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,7 @@ struct vm_area_struct {
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_t vm_sequence;
+   atomic_t vm_ref_count;  /* see vma_get(), vma_put() */
 #endif
 } __randomize_layout;
 
@@ -353,6 +354,9 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_t mm_rb_lock;
+#endif
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index a32e1c4311b2..9ecac4f725b9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -889,6 +889,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_init(&mm->mm_rb_lock);
+#endif
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f94d5d15ebc0..e71ac37a98c4 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -17,6 +17,9 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   .mm_rb_lock = __RW_LOCK_UNLOCKED(init_mm.mm_rb_lock),
+#endif
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..fb2667b20f0a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,12 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern struct vm_area_struct *get_vma(struct mm_struct *mm,
+ unsigned long addr);
+extern void put_vma(struct vm_area_struct *vma);
+#endif
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index ac32b577a0c9..182359a5445c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -160,6 +160,27

[PATCH v9 18/24] mm: Provide speculative fault infrastructure

2018-03-13 Thread Laurent Dufour

From: Peter Zijlstra 

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) 

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
Signed-off-by: Laurent Dufour 
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |   8 +
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  16 +-
 mm/memory.c| 342 -
 5 files changed, 364 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73b8b99f482b..1acc3f4e07d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -329,6 +329,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned int sequence;
+   pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1351,6 +1355,10 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern int handle_speculative_fault(struct mm_struct *mm,
+   unsigned long address, unsigned int flags);
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 34ce3ebf97d5..70e4d2688e7b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -456,8 +456,8 @@ static inline pgoff_t linear_page_index(struct 
vm_area_struct *vma,
pgoff_t pgoff;
if (unlikely(is_vm_hugetlb_page(vma)))
r

[PATCH v9 20/24] perf: Add a speculative page fault sw event

2018-03-13 Thread Laurent Dufour

Add a new software event to count succeeded speculative page faults.

Signed-off-by: Laurent Dufour 
---
 include/uapi/linux/perf_event.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 6f873503552d..a6ddab9edeec 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
-- 
2.7.4

[PATCH v9 21/24] perf tools: Add support for the SPF perf event

2018-03-13 Thread Laurent Dufour

Add support for the new speculative faults event.

Signed-off-by: Laurent Dufour 
---
 tools/include/uapi/linux/perf_event.h | 1 +
 tools/perf/util/evsel.c   | 1 +
 tools/perf/util/parse-events.c| 4 
 tools/perf/util/parse-events.l| 1 +
 tools/perf/util/python.c  | 1 +
 5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 6f873503552d..a6ddab9edeec 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index ef351688b797..45b954019118 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -428,6 +428,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+   "speculative-faults",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 34589c427e52..2a8189c6d5fc 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias  = "",
},
+   [PERF_COUNT_SW_SPF] = {
+   .symbol = "speculative-faults",
+   .alias  = "spf",
+   },
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 655ecff636a8..5d6782426b30 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -308,6 +308,7 @@ emulation-faults{ return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 duration_time  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
 
/*
 * We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index 2918cac7a142..00dd227959e6 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1174,6 +1174,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+   PERF_CONST(COUNT_SW_SPF),
 
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
-- 
2.7.4

[PATCH v9 22/24] mm: Speculative page fault handler return VMA

2018-03-13 Thread Laurent Dufour

When the speculative page fault handler is returning VM_RETRY, there is a
chance that VMA fetched without grabbing the mmap_sem can be reused by the
legacy page fault handler.  By reusing it, we avoid calling find_vma()
again. To achieve, that we must ensure that the VMA structure will not be
freed in our back. This is done by getting the reference on it (get_vma())
and by assuming that the caller will call the new service
can_reuse_spf_vma() once it has grabbed the mmap_sem.

can_reuse_spf_vma() is first checking that the VMA is still in the RB tree
, and then that the VMA's boundaries matched the passed address and release
the reference on the VMA so that it can be freed if needed.

In the case the VMA is freed, can_reuse_spf_vma() will have returned false
as the VMA is no more in the RB tree.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |   5 +-
 mm/memory.c| 136 +
 2 files changed, 88 insertions(+), 53 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1acc3f4e07d1..38a8c0041fd0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1357,7 +1357,10 @@ extern int handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
unsigned int flags);
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
 extern int handle_speculative_fault(struct mm_struct *mm,
-   unsigned long address, unsigned int flags);
+   unsigned long address, unsigned int flags,
+   struct vm_area_struct **vma);
+extern bool can_reuse_spf_vma(struct vm_area_struct *vma,
+ unsigned long address);
 #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
diff --git a/mm/memory.c b/mm/memory.c
index f39c4a4df703..16d3f5f4ffdd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4292,13 +4292,22 @@ static int __handle_mm_fault(struct vm_area_struct 
*vma, unsigned long address,
 /* This is required by vm_normal_page() */
 #error "Speculative page fault handler requires __HAVE_ARCH_PTE_SPECIAL"
 #endif
-
 /*
  * vm_normal_page() adds some processing which should be done while
  * hodling the mmap_sem.
  */
+
+/*
+ * Tries to handle the page fault in a speculative way, without grabbing the
+ * mmap_sem.
+ * When VM_FAULT_RETRY is returned, the vma pointer is valid and this vma must
+ * be checked later when the mmap_sem has been grabbed by calling
+ * can_reuse_spf_vma().
+ * This is needed as the returned vma is kept in memory until the call to
+ * can_reuse_spf_vma() is made.
+ */
 int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
-unsigned int flags)
+unsigned int flags, struct vm_area_struct **vma)
 {
struct vm_fault vmf = {
.address = address,
@@ -4307,7 +4316,6 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
p4d_t *p4d, p4dval;
pud_t pudval;
int seq, ret = VM_FAULT_RETRY;
-   struct vm_area_struct *vma;
 #ifdef CONFIG_NUMA
struct mempolicy *pol;
 #endif
@@ -4316,14 +4324,16 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
flags |= FAULT_FLAG_SPECULATIVE;
 
-   vma = get_vma(mm, address);
-   if (!vma)
+   *vma = get_vma(mm, address);
+   if (!*vma)
return ret;
+   vmf.vma = *vma;
 
-   seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> 
seqlock,vma_rb_erase() */
+   /* rmb <-> seqlock,vma_rb_erase() */
+   seq = raw_read_seqcount(&vmf.vma->vm_sequence);
if (seq & 1) {
-   trace_spf_vma_changed(_RET_IP_, vma, address);
-   goto out_put;
+   trace_spf_vma_changed(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4331,9 +4341,9 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * with the VMA.
 * This include huge page from hugetlbfs.
 */
-   if (vma->vm_ops) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto out_put;
+   if (vmf.vma->vm_ops) {
+   trace_spf_vma_notsup(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4341,18 +4351,18 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * because vm_next and vm_prev must be safe. This can't be guaranteed
 * in the speculative path.
 */
-   if (unlikely(!vma->anon_vma)) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto out_put;
+   if (unlikely(!vmf.vma->anon_vma)) {
+   trace_spf_vma_notsup(_RET_IP

[PATCH v9 23/24] x86/mm: Add speculative pagefault handling

2018-03-13 Thread Laurent Dufour

From: Peter Zijlstra 

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Try speculative fault path only for multi threaded processes]
[Try to the VMA fetch during the speculative path in case of retry]
Signed-off-by: Laurent Dufour 
---
 arch/x86/mm/fault.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e6af2b464c3d..a73cf227edd6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1239,6 +1239,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
unsigned long address)
 {
struct vm_area_struct *vma;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct vm_area_struct *spf_vma = NULL;
+#endif
struct task_struct *tsk;
struct mm_struct *mm;
int fault, major = 0;
@@ -1332,6 +1335,27 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
if (error_code & X86_PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if ((error_code & X86_PF_USER) && (atomic_read(&mm->mm_users) > 1)) {
+   fault = handle_speculative_fault(mm, address, flags,
+&spf_vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   if (!(fault & VM_FAULT_ERROR)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   /*
+* In case of error we need the pkey value, but
+* can't get it from the spf_vma as it is only returned
+* when VM_FAULT_RETRY is returned. So we have to
+* retry the page fault with the mmap_sem grabbed.
+*/
+   }
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
 * When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in
@@ -1365,7 +1389,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
might_sleep();
}
 
-   vma = find_vma(mm, address);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (spf_vma) {
+   if (can_reuse_spf_vma(spf_vma, address))
+   vma = spf_vma;
+   else
+   vma = find_vma(mm, address);
+   spf_vma = NULL;
+   } else
+#endif
+   vma = find_vma(mm, address);
if (unlikely(!vma)) {
bad_area(regs, error_code, address);
return;
@@ -1451,6 +1484,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
return;
}
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
/*
 * Major/minor page fault accounting. If any of the events
 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4

[PATCH v9 24/24] powerpc/mm: Add speculative page fault

2018-03-13 Thread Laurent Dufour

This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Build on if CONFIG_SPECULATIVE_PAGE_FAULT is defined (currently for
BOOK3S_64 && SMP).

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/mm/fault.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 866446cf2d9a..104f3cc86b51 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -392,6 +392,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
   unsigned long error_code)
 {
struct vm_area_struct * vma;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct vm_area_struct *spf_vma = NULL;
+#endif
struct mm_struct *mm = current->mm;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
int is_exec = TRAP(regs) == 0x400;
@@ -459,6 +462,20 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (is_user && (atomic_read(&mm->mm_users) > 1)) {
+   /* let's try a speculative page fault without grabbing the
+* mmap_sem.
+*/
+   fault = handle_speculative_fault(mm, address, flags, &spf_vma);
+   if (!(fault & VM_FAULT_RETRY)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -489,7 +506,16 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
might_sleep();
}
 
-   vma = find_vma(mm, address);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (spf_vma) {
+   if (can_reuse_spf_vma(spf_vma, address))
+   vma = spf_vma;
+   else
+   vma =  find_vma(mm, address);
+   spf_vma = NULL;
+   } else
+#endif
+   vma = find_vma(mm, address);
if (unlikely(!vma))
return bad_area(regs, address);
if (likely(vma->vm_start <= address))
@@ -568,6 +594,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
up_read(¤t->mm->mmap_sem);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
-- 
2.7.4

[PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count

2018-03-13 Thread Laurent Dufour

The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour 
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 +
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++-
 mm/mlock.c | 13 -
 mm/mmap.c  | 17 ++---
 mm/mprotect.c  |  4 +++-
 mm/swap_state.c|  8 ++--
 9 files changed, 86 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 65ae54659833..a2d9c87b7b0b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & 
~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   vm_write_end(vma);
}
downgrade_write(&mm->mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index cec550c8468f..b8212ba17695 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   vm_write_end(vma);
return 0;
}
 
@@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
}
up_write(&mm->mmap_sem);
mmput(mm);
@@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   vm_write_end(vma);
 
skip:
prev = vma;
@@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b7e2268dfc9a..32314e9e48dd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
+   vm_write_end(vma);
result = SCAN_FAIL;
goto out;
}
@@ -1075,6 +1

[PATCH v9 19/24] mm: Adding speculative page fault failure trace events

2018-03-13 Thread Laurent Dufour

This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour 
---
 include/trace/events/pagefault.h | 87 
 mm/memory.c  | 62 ++--
 2 files changed, 136 insertions(+), 13 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index ..1d793f8c739b
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,87 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include 
+#include 
+
+DECLARE_EVENT_CLASS(spf,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, caller)
+   __field(unsigned long, vm_start)
+   __field(unsigned long, vm_end)
+   __field(unsigned long, address)
+   ),
+
+   TP_fast_assign(
+   __entry->caller = caller;
+   __entry->vm_start   = vma->vm_start;
+   __entry->vm_end = vma->vm_end;
+   __entry->address= address;
+   ),
+
+   TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_pte_lock,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/mm/memory.c b/mm/memory.c
index f0f2caa11282..f39c4a4df703 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -80,6 +80,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
 #endif
@@ -2312,8 +2315,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
}
 
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2321,16 +2326,21 @@ static bool pte_spinlock(struct vm_fault *vmf)
 * is not a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 #endif
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   if (unlikely(!spin_trylock(vmf->ptl)))
+   if (unlikely(!spin_trylock(vmf->ptl))) {
+   trace_spf_pte_lock(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -2363,8 +2373,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * block on the PTL and thus we're safe.
 */
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2372,8 +2384,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * is not a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmd

Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again

2018-03-13 Thread Mauricio Faria de Oliveira


On 03/13/2018 02:59 PM, Michal Suchánek wrote:

Maybe it would make more sense to move the messages to the function
that actually patches in the instructions?


That helps, but if the instructions are not patched (e.g., no_rfi_flush)
then there is no information about what the system actually supports,
which is useful for diagnostics/debugging (and patch verification! :-) )

cheers,
mauricio

[PATCH RFC 1/8] powerpc: Add barrier_nospec

2018-03-13 Thread Michal Suchanek

Copypasta from original gmb() and rfi implementation

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/include/asm/barrier.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index 10daa1d56e0a..8e47b3abe405 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -75,6 +75,15 @@ do { 
\
___p1;  \
 })
 
+/* TODO: add patching so this can be disabled */
+/* Prevent speculative execution past this barrier. */
+#define barrier_nospec_asm ori 31,31,0
+#ifdef __ASSEMBLY__
+#define barrier_nospec barrier_nospec_asm
+#else
+#define barrier_nospec() __asm__ __volatile__ 
(stringify_in_c(barrier_nospec_asm) : : :)
+#endif
+
 #include 
 
 #endif /* _ASM_POWERPC_BARRIER_H */
-- 
2.13.6

[PATCH RFC 0/8] powerpc barrier_nospec

2018-03-13 Thread Michal Suchanek

Hello,

this is patchset adding barrier_nospec on powerpc. It is based on the
out-of-tree gmb() patch and the existing rfi patches.

I do not have the tests for the Spectre/Meltdown issues available so this is
untested.

Feedback on the general approach as well as actual effectivity is welcome.

Thanks

Michal


Michal Suchanek (8):
  powerpc: Add barrier_nospec
  powerpc: Use barrier_nospec in copy_from_user
  powerpc/64: Use barrier_nospec in syscall entry
  powerpc/64s: Add support for ori barrier_nospec
  powerpc/64: Patch barrier_nospec in modules
  powerpc/64: barrier_nospec: Add debugfs trigger
  powerpc/64s: barrier_nospec: Add hcall triggerr
  powerpc/64: barrier_nospec: Add commandline trigger

 arch/powerpc/include/asm/barrier.h|  9 
 arch/powerpc/include/asm/feature-fixups.h |  9 
 arch/powerpc/include/asm/setup.h  | 11 +
 arch/powerpc/include/asm/uaccess.h| 11 -
 arch/powerpc/kernel/entry_64.S|  3 ++
 arch/powerpc/kernel/module.c  |  6 +++
 arch/powerpc/kernel/setup_64.c| 72 +++
 arch/powerpc/kernel/vmlinux.lds.S |  7 +++
 arch/powerpc/lib/feature-fixups.c | 38 
 arch/powerpc/platforms/pseries/setup.c| 38 ++--
 10 files changed, 190 insertions(+), 14 deletions(-)

-- 
2.13.6

[PATCH RFC 3/8] powerpc/64: Use barrier_nospec in syscall entry

2018-03-13 Thread Michal Suchanek

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/kernel/entry_64.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 2cb5109a7ea3..7bfc4cf48af2 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #ifdef CONFIG_PPC_BOOK3S
 #include 
@@ -159,6 +160,7 @@ system_call:/* label this so stack 
traces look sane */
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
bne .Lsyscall_dotrace   /* does not return */
cmpldi  0,r0,NR_syscalls
+   barrier_nospec
bge-.Lsyscall_enosys
 
 .Lsyscall:
@@ -319,6 +321,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ld  r10,TI_FLAGS(r10)
 
cmpldi  r0,NR_syscalls
+   barrier_nospec
blt+.Lsyscall
 
/* Return code is already in r3 thanks to do_syscall_trace_enter() */
-- 
2.13.6

[PATCH RFC 2/8] powerpc: Use barrier_nospec in copy_from_user

2018-03-13 Thread Michal Suchanek

Coopypasta from x86.

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/include/asm/uaccess.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 51bfeb8777f0..af9b0e731f46 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -248,6 +248,7 @@ do {
\
__chk_user_ptr(ptr);\
if (!is_kernel_addr((unsigned long)__gu_addr))  \
might_fault();  \
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__typeof__(*(ptr)))__gu_val; \
__gu_err;   \
@@ -258,8 +259,10 @@ do {   
\
long __gu_err = -EFAULT;\
unsigned long  __gu_val = 0;\
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
+   int can_access = access_ok(VERIFY_READ, __gu_addr, (size)); \
might_fault();  \
-   if (access_ok(VERIFY_READ, __gu_addr, (size)))  \
+   barrier_nospec();   \
+   if (can_access) \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__force __typeof__(*(ptr)))__gu_val; 
\
__gu_err;   \
@@ -271,6 +274,7 @@ do {
\
unsigned long __gu_val; \
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
__chk_user_ptr(ptr);\
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
__gu_err;   \
@@ -298,15 +302,19 @@ static inline unsigned long raw_copy_from_user(void *to,
 
switch (n) {
case 1:
+   barrier_nospec();
__get_user_size(*(u8 *)to, from, 1, ret);
break;
case 2:
+   barrier_nospec();
__get_user_size(*(u16 *)to, from, 2, ret);
break;
case 4:
+   barrier_nospec();
__get_user_size(*(u32 *)to, from, 4, ret);
break;
case 8:
+   barrier_nospec();
__get_user_size(*(u64 *)to, from, 8, ret);
break;
}
@@ -314,6 +322,7 @@ static inline unsigned long raw_copy_from_user(void *to,
return 0;
}
 
+   barrier_nospec();
return __copy_tofrom_user((__force void __user *)to, from, n);
 }
 
-- 
2.13.6

[PATCH RFC 4/8] powerpc/64s: Add support for ori barrier_nospec

2018-03-13 Thread Michal Suchanek

Copypasta from rfi implementation

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/include/asm/barrier.h|  4 ++--
 arch/powerpc/include/asm/feature-fixups.h |  9 +
 arch/powerpc/include/asm/setup.h  |  8 
 arch/powerpc/kernel/setup_64.c| 29 +
 arch/powerpc/kernel/vmlinux.lds.S |  7 +++
 arch/powerpc/lib/feature-fixups.c | 27 +++
 6 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index 8e47b3abe405..4079a95e84c2 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -75,9 +75,9 @@ do {  
\
___p1;  \
 })
 
-/* TODO: add patching so this can be disabled */
 /* Prevent speculative execution past this barrier. */
-#define barrier_nospec_asm ori 31,31,0
+#define barrier_nospec_asm SPEC_BARRIER_FIXUP_SECTION; \
+   nop
 #ifdef __ASSEMBLY__
 #define barrier_nospec barrier_nospec_asm
 #else
diff --git a/arch/powerpc/include/asm/feature-fixups.h 
b/arch/powerpc/include/asm/feature-fixups.h
index 1e82eb3caabd..9d3382618ffd 100644
--- a/arch/powerpc/include/asm/feature-fixups.h
+++ b/arch/powerpc/include/asm/feature-fixups.h
@@ -195,11 +195,20 @@ label##3: \
FTR_ENTRY_OFFSET 951b-952b; \
.popsection;
 
+#define SPEC_BARRIER_FIXUP_SECTION \
+953:   \
+   .pushsection __spec_barrier_fixup,"a";  \
+   .align 2;   \
+954:   \
+   FTR_ENTRY_OFFSET 953b-954b; \
+   .popsection;
+
 
 #ifndef __ASSEMBLY__
 #include 
 
 extern long __start___rfi_flush_fixup, __stop___rfi_flush_fixup;
+extern long __start___spec_barrier_fixup, __stop___spec_barrier_fixup;
 
 void apply_feature_fixups(void);
 void setup_feature_keys(void);
diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 469b7fdc9be4..486d02e4a310 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -49,8 +49,16 @@ enum l1d_flush_type {
L1D_FLUSH_MTTRIG= 0x8,
 };
 
+/* These are bit flags */
+enum spec_barrier_type {
+   SPEC_BARRIER_NONE   = 0x1,
+   SPEC_BARRIER_ORI= 0x2,
+};
+
 void __init setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
+void __init setup_barrier_nospec(enum spec_barrier_type, bool enable);
+void do_barrier_nospec_fixups(enum spec_barrier_type type);
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index c388cc3357fa..09f21a954bfc 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -815,6 +815,10 @@ static enum l1d_flush_type enabled_flush_types;
 static void *l1d_flush_fallback_area;
 static bool no_rfi_flush;
 bool rfi_flush;
+enum spec_barrier_type powerpc_barrier_nospec;
+static enum spec_barrier_type barrier_nospec_type;
+static bool no_nospec;
+bool barrier_nospec_enabled;
 
 static int __init handle_no_rfi_flush(char *p)
 {
@@ -899,6 +903,31 @@ void __init setup_rfi_flush(enum l1d_flush_type types, 
bool enable)
rfi_flush_enable(enable);
 }
 
+void barrier_nospec_enable(bool enable)
+{
+   barrier_nospec_enabled = enable;
+
+   if (enable) {
+   powerpc_barrier_nospec = barrier_nospec_type;
+   do_barrier_nospec_fixups(powerpc_barrier_nospec);
+   on_each_cpu(do_nothing, NULL, 1);
+   } else {
+   powerpc_barrier_nospec = SPEC_BARRIER_NONE;
+   do_barrier_nospec_fixups(powerpc_barrier_nospec);
+   }
+}
+
+void __init setup_barrier_nospec(enum spec_barrier_type type, bool enable)
+{
+   if (type & SPEC_BARRIER_ORI)
+   pr_info("barrier_nospec: Using ori type flush\n");
+
+   barrier_nospec_type = type;
+
+   if (!no_nospec)
+   barrier_nospec_enable(enable);
+}
+
 #ifdef CONFIG_DEBUG_FS
 static int rfi_flush_set(void *data, u64 val)
 {
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index c8af90ff49f0..744b58ff77f1 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -139,6 +139,13 @@ SECTIONS
*(__rfi_flush_fixup)
__stop___rfi_flush_fixup = .;
}
+
+   . = ALIGN(8);
+   __spec_barrier_fixup : AT(ADDR(__spec_barrier_fixup) - LOAD_OFFSET) {
+   __start___spec_barrier_fixup = .;
+   *(__spec_barrier_fixup)
+   __stop___spec_barrier_fixup = .;
+   }

[PATCH RFC 6/8] powerpc/64: barrier_nospec: Add debugfs trigger

2018-03-13 Thread Michal Suchanek

Copypasta from rfi implementation

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/kernel/setup_64.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index d1d9f047161e..4b67b7b877d9 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -955,6 +955,41 @@ static __init int rfi_flush_debugfs_init(void)
return 0;
 }
 device_initcall(rfi_flush_debugfs_init);
+
+static int barrier_nospec_set(void *data, u64 val)
+{
+   switch (val) {
+   case 0:
+   case 1:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   if (!!val == !!barrier_nospec_enabled)
+   return 0;
+
+   barrier_nospec_enable(!!val);
+
+   return 0;
+}
+
+static int barrier_nospec_get(void *data, u64 *val)
+{
+   *val = barrier_nospec_enabled ? 1 : 0;
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_barrier_nospec,
+   barrier_nospec_get, barrier_nospec_set, "%llu\n");
+
+static __init int barrier_nospec_debugfs_init(void)
+{
+   debugfs_create_file("barrier_nospec", 0600, powerpc_debugfs_root, NULL,
+   &fops_barrier_nospec);
+   return 0;
+}
+device_initcall(barrier_nospec_debugfs_init);
 #endif
 
 ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, 
char *buf)
-- 
2.13.6

[PATCH RFC 7/8] powerpc/64s: barrier_nospec: Add hcall triggerr

2018-03-13 Thread Michal Suchanek

Copypasta from rfi implementation

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/platforms/pseries/setup.c | 38 ++
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 1a527625acf7..b779ddb8e250 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -459,38 +459,50 @@ static void __init find_and_init_phbs(void)
of_pci_check_probe_only();
 }
 
-static void pseries_setup_rfi_flush(void)
+static void pseries_setup_rfi_nospec(void)
 {
struct h_cpu_char_result result;
-   enum l1d_flush_type types;
-   bool enable;
+   enum l1d_flush_type flush_types;
+   enum spec_barrier_type barrier_type;
+   bool flush_enable;
+   bool barrier_enable;
long rc;
 
/* Enable by default */
-   enable = true;
+   flush_enable = true;
+   barrier_enable = true;
+   /* no fallback if the firmware does not tell us */
+   barrier_type = SPEC_BARRIER_NONE;
 
rc = plpar_get_cpu_characteristics(&result);
if (rc == H_SUCCESS) {
-   types = L1D_FLUSH_NONE;
+   flush_types = L1D_FLUSH_NONE;
 
if (result.character & H_CPU_CHAR_L1D_FLUSH_TRIG2)
-   types |= L1D_FLUSH_MTTRIG;
+   flush_types |= L1D_FLUSH_MTTRIG;
if (result.character & H_CPU_CHAR_L1D_FLUSH_ORI30)
-   types |= L1D_FLUSH_ORI;
+   flush_types |= L1D_FLUSH_ORI;
+   if (result.character & H_CPU_CHAR_SPEC_BAR_ORI31)
+   barrier_type |= SPEC_BARRIER_ORI;
 
/* Use fallback if nothing set in hcall */
-   if (types == L1D_FLUSH_NONE)
-   types = L1D_FLUSH_FALLBACK;
+   if (flush_types == L1D_FLUSH_NONE)
+   flush_types = L1D_FLUSH_FALLBACK;
 
if ((!(result.behaviour & H_CPU_BEHAV_L1D_FLUSH_PR)) ||
(!(result.behaviour & H_CPU_BEHAV_FAVOUR_SECURITY)))
-   enable = false;
+   flush_enable = false;
+
+   if ((!(result.behaviour & H_CPU_BEHAV_BNDS_CHK_SPEC_BAR)) ||
+   (!(result.behaviour & H_CPU_BEHAV_FAVOUR_SECURITY)))
+   barrier_enable = false;
} else {
/* Default to fallback if case hcall is not available */
-   types = L1D_FLUSH_FALLBACK;
+   flush_types = L1D_FLUSH_FALLBACK;
}
 
-   setup_rfi_flush(types, enable);
+   setup_barrier_nospec(barrier_type, barrier_enable);
+   setup_rfi_flush(flush_types, flush_enable);
 }
 
 #ifdef CONFIG_PCI_IOV
@@ -666,7 +678,7 @@ static void __init pSeries_setup_arch(void)
 
fwnmi_init();
 
-   pseries_setup_rfi_flush();
+   pseries_setup_rfi_nospec();
 
/* By default, only probe PCI (can be overridden by rtas_pci) */
pci_add_flags(PCI_PROBE_ONLY);
-- 
2.13.6

[PATCH RFC 5/8] powerpc/64: Patch barrier_nospec in modules

2018-03-13 Thread Michal Suchanek

Copypasta from lwsync patching.

Note that unlike RFI which is patched only in kernel the nospec state
reflects settings at the time the module was loaded.

Iterating all modules and re-patching every time the settings change is
not implemented.

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/include/asm/setup.h  |  5 -
 arch/powerpc/kernel/module.c  |  6 ++
 arch/powerpc/kernel/setup_64.c|  4 ++--
 arch/powerpc/lib/feature-fixups.c | 17 ++---
 4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 486d02e4a310..7e3a41248810 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -58,7 +58,10 @@ enum spec_barrier_type {
 void __init setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
 void __init setup_barrier_nospec(enum spec_barrier_type, bool enable);
-void do_barrier_nospec_fixups(enum spec_barrier_type type);
+void do_barrier_nospec_fixups_kernel(enum spec_barrier_type type);
+void do_barrier_nospec_fixups(enum spec_barrier_type type,
+ void *start, void *end);
+extern enum spec_barrier_type powerpc_barrier_nospec;
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 3f7ba0f5bf29..7b6d0ec06a21 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -72,6 +72,12 @@ int module_finalize(const Elf_Ehdr *hdr,
do_feature_fixups(powerpc_firmware_features,
  (void *)sect->sh_addr,
  (void *)sect->sh_addr + sect->sh_size);
+
+   sect = find_section(hdr, sechdrs, "__spec_barrier_fixup");
+   if (sect != NULL)
+   do_barrier_nospec_fixups(powerpc_barrier_nospec,
+ (void *)sect->sh_addr,
+ (void *)sect->sh_addr + sect->sh_size);
 #endif
 
sect = find_section(hdr, sechdrs, "__lwsync_fixup");
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 09f21a954bfc..d1d9f047161e 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -909,11 +909,11 @@ void barrier_nospec_enable(bool enable)
 
if (enable) {
powerpc_barrier_nospec = barrier_nospec_type;
-   do_barrier_nospec_fixups(powerpc_barrier_nospec);
+   do_barrier_nospec_fixups_kernel(powerpc_barrier_nospec);
on_each_cpu(do_nothing, NULL, 1);
} else {
powerpc_barrier_nospec = SPEC_BARRIER_NONE;
-   do_barrier_nospec_fixups(powerpc_barrier_nospec);
+   do_barrier_nospec_fixups_kernel(powerpc_barrier_nospec);
}
 }
 
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 000e153184ad..b59ebc2215e8 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -156,14 +156,15 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
printk(KERN_DEBUG "rfi-flush: patched %d locations\n", i);
 }
 
-void do_barrier_nospec_fixups(enum spec_barrier_type type)
+void do_barrier_nospec_fixups(enum spec_barrier_type type,
+ void *fixup_start, void *fixup_end)
 {
unsigned int instr, *dest;
long *start, *end;
int i;
 
-   start = PTRRELOC(&__start___spec_barrier_fixup),
-   end = PTRRELOC(&__stop___spec_barrier_fixup);
+   start = fixup_start;
+   end = fixup_end;
 
instr = 0x6000; /* nop */
 
@@ -182,6 +183,16 @@ void do_barrier_nospec_fixups(enum spec_barrier_type type)
printk(KERN_DEBUG "barrier-nospec: patched %d locations\n", i);
 }
 
+void do_barrier_nospec_fixups_kernel(enum spec_barrier_type type)
+{
+   void *start, *end;
+
+   start = PTRRELOC(&__start___spec_barrier_fixup),
+   end = PTRRELOC(&__stop___spec_barrier_fixup);
+
+   do_barrier_nospec_fixups(type, start, end);
+}
+
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 void do_lwsync_fixups(unsigned long value, void *fixup_start, void *fixup_end)
-- 
2.13.6

[PATCH RFC 8/8] powerpc/64: barrier_nospec: Add commandline trigger

2018-03-13 Thread Michal Suchanek

Copypasta from rfi implementation

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/kernel/setup_64.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 4b67b7b877d9..257f0e6be107 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -840,6 +840,14 @@ static int __init handle_no_pti(char *p)
 }
 early_param("nopti", handle_no_pti);
 
+static int __init handle_no_nospec(char *p)
+{
+   pr_info("barrier_nospec: disabled on command line.");
+   no_nospec = true;
+   return 0;
+}
+early_param("no_nospec", handle_no_nospec);
+
 static void do_nothing(void *unused)
 {
/*
-- 
2.13.6

Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again

2018-03-13 Thread Michal Suchánek

On Tue, 13 Mar 2018 15:13:11 -0300
Mauricio Faria de Oliveira  wrote:

> On 03/13/2018 02:59 PM, Michal Suchánek wrote:
> > Maybe it would make more sense to move the messages to the function
> > that actually patches in the instructions?  
> 
> That helps, but if the instructions are not patched (e.g.,
> no_rfi_flush) then there is no information about what the system
> actually supports, which is useful for diagnostics/debugging (and
> patch verification! :-) )

Can't you patch with debugfs in that case?

Thanks

Michal

Re: OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static)

2018-03-13 Thread Andrew Morton

On Tue, 13 Mar 2018 23:06:35 +1100 Michael Ellerman  wrote:

> Anyone object to us merging the following patch via the powerpc tree?
> 
> Full series is here if anyone's interested:
>   http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=28377&state=*
> 

Yup, please go ahead.

I assume the change to the memblock_alloc_range() declaration was an
unrelated, unchangelogged cleanup.

Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again

2018-03-13 Thread Mauricio Faria de Oliveira


On 03/13/2018 03:36 PM, Michal Suchánek wrote:

On Tue, 13 Mar 2018 15:13:11 -0300
Mauricio Faria de Oliveira  wrote:


On 03/13/2018 02:59 PM, Michal Suchánek wrote:

Maybe it would make more sense to move the messages to the function
that actually patches in the instructions?



That helps, but if the instructions are not patched (e.g.,
no_rfi_flush) then there is no information about what the system
actually supports, which is useful for diagnostics/debugging (and
patch verification!:-)  )



Can't you patch with debugfs in that case?


For development purposes, yes, sure; but unfortunately sometimes only a
dmesg output or other offline/postmortem data is available.

And there's the user case where he is not aware/willing/allowed to use
the debugfs switch.

I still think the correct, informative messages are a good way to go :)

cheers,
mauricio

Re: [PATCH RFC 1/8] powerpc: Add barrier_nospec

2018-03-13 Thread Peter Zijlstra

On Tue, Mar 13, 2018 at 07:32:59PM +0100, Michal Suchanek wrote:
> Copypasta from original gmb() and rfi implementation

Actual real changelogs would be very welcome. Seeing as I've not seen
these mythical RFI patches, this leaves one quite puzzled :-)

[GIT PULL] Please pull JSON files for POWR9 PMU events

2018-03-13 Thread Sukadev Bhattiprolu


Hi Arnaldo,

Please pull an update to the JSON files for POWER9 PMU events.

The following changes since commit 90d2614c4d10c2f9d0ada9a3b01e5f43ca8d1ae3:

  perf test: Fix exit code for record+probe_libc_inet_pton.sh (2018-03-13 
15:14:43 -0300)

are available in the git repository at:

  https://github.com/sukadev/linux/ p9-json-v5

for you to fetch changes up to 99c9dff949f2502964005f9afa8d60c89b446f2c:

  perf vendor events: Update POWER9 events (2018-03-13 16:48:12 -0500)


Sukadev Bhattiprolu (1):
  perf vendor events: Update POWER9 events

 .../perf/pmu-events/arch/powerpc/power9/cache.json |  25 ---
 .../pmu-events/arch/powerpc/power9/frontend.json   |  10 -
 .../pmu-events/arch/powerpc/power9/marked.json |   5 -
 .../pmu-events/arch/powerpc/power9/memory.json |   5 -
 .../perf/pmu-events/arch/powerpc/power9/other.json | 241 ++---
 .../pmu-events/arch/powerpc/power9/pipeline.json   |  50 ++---
 tools/perf/pmu-events/arch/powerpc/power9/pmc.json |   5 -
 .../arch/powerpc/power9/translation.json   |  10 +-
 8 files changed, 178 insertions(+), 173 deletions(-)

[PATCH] powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

2018-03-13 Thread Michael Ellerman

When running virtualised the powerpc kernel is able to run the system
in "compat mode" - which means the kernel and hardware are pretending
to userspace that the CPU is an older version than it actually is.

AT_BASE_PLATFORM is an AUXV entry that we export to userspace for use
when we're running in that mode, which tells userspace the "platform"
string for the real CPU version, as opposed to the faked version.

Although we don't support compat mode when using DT CPU features, and
arguably don't need to set AT_BASE_PLATFORM, the existing cputable
based code always sets it even when we're running bare metal. That
means the lack of AT_BASE_PLATFORM is a user-visible artifact of the
fact that the kernel is using DT CPU features, which we don't want.

So set it in the DT CPU features code also.

This results in eg:
  $ LD_SHOW_AUXV=1 /bin/true | grep "AT_.*PLATFORM"
  AT_PLATFORM: power9
  AT_BASE_PLATFORM:power9

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/dt_cpu_ftrs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 945e2c29ad2d..0bcfb0f256e1 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -720,6 +720,9 @@ static void __init cpufeatures_setup_finished(void)
cur_cpu_spec->cpu_features |= CPU_FTR_HVMODE;
}
 
+   /* Make sure powerpc_base_platform is non-NULL */
+   powerpc_base_platform = cur_cpu_spec->platform;
+
system_registers.lpcr = mfspr(SPRN_LPCR);
system_registers.hfscr = mfspr(SPRN_HFSCR);
system_registers.fscr = mfspr(SPRN_FSCR);
-- 
2.14.1

Re: [PATCH] powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

2018-03-13 Thread Nicholas Piggin

On Wed, 14 Mar 2018 10:14:11 +1100
Michael Ellerman  wrote:

> When running virtualised the powerpc kernel is able to run the system
> in "compat mode" - which means the kernel and hardware are pretending
> to userspace that the CPU is an older version than it actually is.
> 
> AT_BASE_PLATFORM is an AUXV entry that we export to userspace for use
> when we're running in that mode, which tells userspace the "platform"
> string for the real CPU version, as opposed to the faked version.
> 
> Although we don't support compat mode when using DT CPU features, and
> arguably don't need to set AT_BASE_PLATFORM, the existing cputable
> based code always sets it even when we're running bare metal. That
> means the lack of AT_BASE_PLATFORM is a user-visible artifact of the
> fact that the kernel is using DT CPU features, which we don't want.
> 
> So set it in the DT CPU features code also.
> 
> This results in eg:
>   $ LD_SHOW_AUXV=1 /bin/true | grep "AT_.*PLATFORM"
>   AT_PLATFORM: power9
>   AT_BASE_PLATFORM:power9
> 
> Signed-off-by: Michael Ellerman 

Thanks, I missed this one. Seems fine to me.

Reviewed-by: Nicholas Piggin 

> ---
>  arch/powerpc/kernel/dt_cpu_ftrs.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
> b/arch/powerpc/kernel/dt_cpu_ftrs.c
> index 945e2c29ad2d..0bcfb0f256e1 100644
> --- a/arch/powerpc/kernel/dt_cpu_ftrs.c
> +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
> @@ -720,6 +720,9 @@ static void __init cpufeatures_setup_finished(void)
>   cur_cpu_spec->cpu_features |= CPU_FTR_HVMODE;
>   }
>  
> + /* Make sure powerpc_base_platform is non-NULL */
> + powerpc_base_platform = cur_cpu_spec->platform;
> +
>   system_registers.lpcr = mfspr(SPRN_LPCR);
>   system_registers.hfscr = mfspr(SPRN_HFSCR);
>   system_registers.fscr = mfspr(SPRN_FSCR);

Re: OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static)

2018-03-13 Thread Nicholas Piggin

On Tue, 13 Mar 2018 12:41:28 -0700
Andrew Morton  wrote:

> On Tue, 13 Mar 2018 23:06:35 +1100 Michael Ellerman  
> wrote:
> 
> > Anyone object to us merging the following patch via the powerpc tree?
> > 
> > Full series is here if anyone's interested:
> >   
> > http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=28377&state=*
> >   
> 
> Yup, please go ahead.
> 
> I assume the change to the memblock_alloc_range() declaration was an
> unrelated, unchangelogged cleanup.
> 

It is. I'm trying to get better at that. Michael might drop that bit if
he's not already sick of fixing up my patches...

Thanks,
Nick

Re: [PATCHv4 1/3] powerpc, cpu: partially unbind the mapping between cpu logical id and its seq in dt

2018-03-13 Thread Pingfan Liu

On Tue, Mar 13, 2018 at 10:58 AM, Benjamin Herrenschmidt
 wrote:
> On Mon, 2018-03-12 at 12:43 +0800, Pingfan Liu wrote:
>> For kexec -p, the boot cpu can be not the cpu0, this causes the problem
>> to alloc paca[]. In theory, there is no requirement to assign cpu's logical
>> id as its present seq by device tree. But we have something like
>> cpu_first_thread_sibling(), which makes assumption on the mapping inside
>> a core. Hence partially changing the mapping, i.e. unbind the mapping of
>> core while keep the mapping inside a core. After this patch, boot-cpu
>> will always be mapped into the range [0,threads_per_core).
>
> I'm ok with the idea but not fan of the implementation:
>
>> Signed-off-by: Pingfan Liu 
>> ---
>>  arch/powerpc/include/asm/smp.h |  1 +
>>  arch/powerpc/kernel/prom.c | 25 ++---
>>  arch/powerpc/kernel/setup-common.c | 21 +
>>  3 files changed, 36 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
>> index fac963e..1299100 100644
>> --- a/arch/powerpc/include/asm/smp.h
>> +++ b/arch/powerpc/include/asm/smp.h
>> @@ -30,6 +30,7 @@
>>  #include 
>>
>>  extern int boot_cpuid;
>> +extern int boot_cpuhwid;
>>  extern int spinning_secondaries;
>>
>>  extern void cpu_die(void);
>> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>> index da67606..d0ebb25 100644
>> --- a/arch/powerpc/kernel/prom.c
>> +++ b/arch/powerpc/kernel/prom.c
>> @@ -315,8 +315,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
>> node,
>>   const __be32 *intserv;
>>   int i, nthreads;
>>   int len;
>> - int found = -1;
>> - int found_thread = 0;
>> + bool found = false;
>>
>>   /* We are scanning "cpu" nodes only */
>>   if (type == NULL || strcmp(type, "cpu") != 0)
>> @@ -341,8 +340,11 @@ static int __init early_init_dt_scan_cpus(unsigned long 
>> node,
>>   if (fdt_version(initial_boot_params) >= 2) {
>>   if (be32_to_cpu(intserv[i]) ==
>>   fdt_boot_cpuid_phys(initial_boot_params)) {
>> - found = boot_cpu_count;
>> - found_thread = i;
>> + /* always map the boot-cpu logical id into the
>> +  * the range of [0, thread_per_core)
>> +  */
>> + boot_cpuid = i;
>> + found = true;
>>   }
>
> Call it boot_thread_id
>
But I think boot_cpuid has the meaning of global index, while the
thread_id has the meaning of index in a core.

>>   } else {
>>   /*
>> @@ -351,8 +353,10 @@ static int __init early_init_dt_scan_cpus(unsigned long 
>> node,
>>* off secondary threads.
>>*/
>>   if (of_get_flat_dt_prop(node,
>> - "linux,boot-cpu", NULL) != NULL)
>> - found = boot_cpu_count;
>> + "linux,boot-cpu", NULL) != NULL) {
>> + boot_cpuid = i;
>> + found = true;
>> + }
>>   }
>>  #ifdef CONFIG_SMP
>>   /* logical cpu id is always 0 on UP kernels */
>> @@ -361,13 +365,12 @@ static int __init early_init_dt_scan_cpus(unsigned 
>> long node,
>>   }
>>
>>   /* Not the boot CPU */
>> - if (found < 0)
>> + if (!found)
>>   return 0;
>>
>> - DBG("boot cpu: logical %d physical %d\n", found,
>> - be32_to_cpu(intserv[found_thread]));
>> - boot_cpuid = found;
>> - set_hard_smp_processor_id(found, be32_to_cpu(intserv[found_thread]));
>> + boot_cpuhwid = be32_to_cpu(intserv[boot_cpuid]);
>> + DBG("boot cpu: logical %d physical %d\n", boot_cpuid, boot_cpuhwid);
>> + set_hard_smp_processor_id(boot_cpuid, boot_cpuhwid);
>>
>>   /*
>>* PAPR defines "logical" PVR values for cpus that
>> diff --git a/arch/powerpc/kernel/setup-common.c 
>> b/arch/powerpc/kernel/setup-common.c
>> index 66f7cc6..1a67344 100644
>> --- a/arch/powerpc/kernel/setup-common.c
>> +++ b/arch/powerpc/kernel/setup-common.c
>> @@ -86,6 +86,7 @@ struct machdep_calls *machine_id;
>>  EXPORT_SYMBOL(machine_id);
>>
>>  int boot_cpuid = -1;
>> +int boot_cpuhwid = -1;
>>  EXPORT_SYMBOL_GPL(boot_cpuid);
>>
>>  /*
>> @@ -459,11 +460,17 @@ static void __init cpu_init_thread_core_maps(int tpc)
>>  void __init smp_setup_cpu_maps(void)
>>  {
>>   struct device_node *dn;
>> + struct device_node *boot_dn = NULL;
>> + bool handling_bootdn = true;
>>   int cpu = 0;
>>   int nthreads = 1;
>>
>>   DBG("smp_setup_cpu_maps()\n");
>>
>> +again:
>> + /* E.g. kexec will not boot from the 1st core. So firstly loop to find 
>> out
>> +  * the dn of boot-cpu, and map them on

Re: [PATCH kernel] powerpc/npu: Do not try invalidating 32bit table when 64bit table is enabled

2018-03-13 Thread Alexey Kardashevskiy

On 7/3/18 2:40 pm, Alexey Kardashevskiy wrote:
> On 13/02/18 16:51, Alexey Kardashevskiy wrote:
>> GPUs and the corresponding NVLink bridges get different PEs as they have
>> separate translation validation entries (TVEs). We put these PEs to
>> the same IOMMU group so they cannot be passed through separately.
>> So the iommu_table_group_ops::set_window/unset_window for GPUs do set
>> tables to the NPU PEs as well which means that iommu_table's list of
>> attached PEs (iommu_table_group_link) has both GPU and NPU PEs linked.
>> This list is used for TCE cache invalidation.
>>
>> The problem is that NPU PE has just a single TVE and can be programmed
>> to point to 32bit or 64bit windows while GPU PE has two (as any other PCI
>> device). So we end up having an 32bit iommu_table struct linked to both
>> PEs even though only the 64bit TCE table cache can be invalidated on NPU.
>> And a relatively recent skiboot detects this and prints errors.
>>
>> This changes GPU's iommu_table_group_ops::set_window/unset_window to make
>> sure that NPU PE is only linked to the table actually used by the hardware.
>> If there are two tables used by an IOMMU group, the NPU PE will use
>> the last programmed one which with the current use scenarios is expected
>> to be a 64bit one.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> --
>>
>> Do we need BUG_ON(IOMMU_TABLE_GROUP_MAX_TABLES != 2)?
> 
> 
> 
> Ping?


Anyone? Alistair? :)


> 
> 
> 
>>
>>
>> This is an example for:
>>
>> 0004:04:00.0 3D: NVIDIA Corporation Device 1db1 (rev a1)
>> 0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
>> 0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
>>
>> Before the patch (npu2_tce_kill messages are from skiboot):
>>
>> pci 0004:04 : [PE# 00] Setting up window#0 0..3fff pg=1000
>> pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fff pg=1000
>> pci 0004:04 : [PE# 00] Setting up window#1 
>> 800..800 pg=1
>> pci 0006:00:00.0: [PE# 0d] Setting up window 
>> 800..800 pg=1
>> NPU6: npu2_tce_kill: Unexpected TCE size (got 0x1000 expected 0x1)
>> NPU6: npu2_tce_kill: Unexpected TCE size (got 0x1000 expected 0x1)
>> NPU6: npu2_tce_kill: Unexpected TCE size (got 0x1000 expected 0x1)
>> NPU6: npu2_tce_kill: Unexpected TCE size (got 0x1000 expected 0x1)
>> NPU6: npu2_tce_kill: Unexpected TCE size (got 0x1000 expected 0x1)
>> ...
>> pci 0004:04 : [PE# 00] Removing DMA window #0
>> pci 0006:00:00.0: [PE# 0d] Removing DMA window
>> pci 0004:04 : [PE# 00] Removing DMA window #1
>> pci 0006:00:00.0: [PE# 0d] Removing DMA window
>> pci 0004:04 : [PE# 00] Setting up window#0 0..3fff pg=1000
>> pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fff pg=1000
>> pci 0004:04 : [PE# 00] Setting up window#1 
>> 800..800 pg=1
>> pci 0006:00:00.0: [PE# 0d] Setting up window 
>> 800..800 pg=1
>>
>> After the patch (no errors here):
>>
>> pci 0004:04 : [PE# 00] Setting up window#0 0..3fff pg=1000
>> pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fff pg=1000
>> pci 0004:04 : [PE# 00] Setting up window#1 
>> 800..800 pg=1
>> pci 0006:00:00.0: [PE# 0d] Removing DMA window
>> pci 0006:00:00.0: [PE# 0d] Setting up window 
>> 800..800 pg=1
>> pci 0004:04 : [PE# 00] Removing DMA window #0
>> pci 0004:04 : [PE# 00] Removing DMA window #1
>> pci 0006:00:00.0: [PE# 0d] Removing DMA window
>> pci 0004:04 : [PE# 00] Setting up window#0 0..3fff pg=1000
>> pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fff pg=1000
>> pci 0004:04 : [PE# 00] Setting up window#1 
>> 800..800 pg=1
>> pci 0006:00:00.0: [PE# 0d] Removing DMA window
>> pci 0006:00:00.0: [PE# 0d] Setting up window 
>> 800..800 pg=1
>> ---
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 27 ---
>>  1 file changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
>> b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 496e476..2f91815 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2681,14 +2681,23 @@ static struct pnv_ioda_pe *gpe_table_group_to_npe(
>>  static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group 
>> *table_group,
>>  int num, struct iommu_table *tbl)
>>  {
>> +struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
>> +int num2 = (num == 0) ? 1 : 0;
>>  long ret = pnv_pci_ioda2_set_window(table_group, num, tbl);
>>  
>>  if (ret)
>>  return ret;
>>  
>> -ret = pnv_npu_set_window(gpe_table_group_to_npe(table_group), num, tbl);
>> -if (ret)
>> +if (table_group->tables[num2])
>> +pnv_npu_unset_window(npe, num2);
>> +
>> +ret = pnv_npu_set_window(npe, num, tbl);
>> +if (ret) {
>>

Re: [Y2038] [PATCH v4 02/10] include: Move compat_timespec/ timeval to compat_time.h

2018-03-13 Thread Deepa Dinamani

The file arch/arm64/kernel/process.c needs asm/compat.h also to be
included directly since this is included conditionally from
include/compat.h. This does seem to be typical of arm64 as I was not
completely able to get rid of asm/compat.h includes for arm64 in this
series. My plan is to have separate patches to get rid of asm/compat.h
includes for the architectures that are not straight forward to keep
this series simple.
I will fix this and update the series.

-Deepa


On Tue, Mar 13, 2018 at 8:22 AM, kbuild test robot  wrote:
> Hi Deepa,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on ]
>
> url:
> https://github.com/0day-ci/linux/commits/Deepa-Dinamani/posix_clocks-Prepare-syscalls-for-64-bit-time_t-conversion/20180313-203305
> base:
> config: arm64-allnoconfig (attached as .config)
> compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=arm64
>
> All errors (new ones prefixed by >>):
>
>arch/arm64/kernel/process.c: In function 'copy_thread':
>>> arch/arm64/kernel/process.c:342:8: error: implicit declaration of function 
>>> 'is_compat_thread'; did you mean 'is_compat_task'? 
>>> [-Werror=implicit-function-declaration]
>if (is_compat_thread(task_thread_info(p)))
>^~~~
>is_compat_task
>cc1: some warnings being treated as errors
>
> vim +342 arch/arm64/kernel/process.c
>
> b3901d54d Catalin Marinas  2012-03-05  307
> b3901d54d Catalin Marinas  2012-03-05  308  int copy_thread(unsigned long 
> clone_flags, unsigned long stack_start,
> afa86fc42 Al Viro  2012-10-22  309  unsigned long stk_sz, 
> struct task_struct *p)
> b3901d54d Catalin Marinas  2012-03-05  310  {
> b3901d54d Catalin Marinas  2012-03-05  311  struct pt_regs *childregs = 
> task_pt_regs(p);
> b3901d54d Catalin Marinas  2012-03-05  312
> c34501d21 Catalin Marinas  2012-10-05  313  
> memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
> c34501d21 Catalin Marinas  2012-10-05  314
> bc0ee4760 Dave Martin  2017-10-31  315  /*
> bc0ee4760 Dave Martin  2017-10-31  316   * Unalias 
> p->thread.sve_state (if any) from the parent task
> bc0ee4760 Dave Martin  2017-10-31  317   * and disable discard SVE 
> state for p:
> bc0ee4760 Dave Martin  2017-10-31  318   */
> bc0ee4760 Dave Martin  2017-10-31  319  clear_tsk_thread_flag(p, 
> TIF_SVE);
> bc0ee4760 Dave Martin  2017-10-31  320  p->thread.sve_state = NULL;
> bc0ee4760 Dave Martin  2017-10-31  321
> 071b6d4a5 Dave Martin  2017-12-05  322  /*
> 071b6d4a5 Dave Martin  2017-12-05  323   * In case p was allocated 
> the same task_struct pointer as some
> 071b6d4a5 Dave Martin  2017-12-05  324   * other recently-exited 
> task, make sure p is disassociated from
> 071b6d4a5 Dave Martin  2017-12-05  325   * any cpu that may have run 
> that now-exited task recently.
> 071b6d4a5 Dave Martin  2017-12-05  326   * Otherwise we could 
> erroneously skip reloading the FPSIMD
> 071b6d4a5 Dave Martin  2017-12-05  327   * registers for p.
> 071b6d4a5 Dave Martin  2017-12-05  328   */
> 071b6d4a5 Dave Martin  2017-12-05  329  fpsimd_flush_task_state(p);
> 071b6d4a5 Dave Martin  2017-12-05  330
> 9ac080021 Al Viro  2012-10-21  331  if (likely(!(p->flags & 
> PF_KTHREAD))) {
> 9ac080021 Al Viro  2012-10-21  332  *childregs = 
> *current_pt_regs();
> b3901d54d Catalin Marinas  2012-03-05  333  childregs->regs[0] = 
> 0;
> d00a3810c Will Deacon  2015-05-27  334
> b3901d54d Catalin Marinas  2012-03-05  335  /*
> b3901d54d Catalin Marinas  2012-03-05  336   * Read the current 
> TLS pointer from tpidr_el0 as it may be
> b3901d54d Catalin Marinas  2012-03-05  337   * out-of-sync with 
> the saved value.
> b3901d54d Catalin Marinas  2012-03-05  338   */
> adf758999 Mark Rutland 2016-09-08  339  *task_user_tls(p) = 
> read_sysreg(tpidr_el0);
> d00a3810c Will Deacon  2015-05-27  340
> e0fd18ce1 Al Viro  2012-10-18  341  if (stack_start) {
> d00a3810c Will Deacon  2015-05-27 @342  if 
> (is_compat_thread(task_thread_info(p)))
> d00a3810c Will Deacon  2015-05-27  343  
> childregs->compat_

Re: [PATCH v4 02/10] include: Move compat_timespec/ timeval to compat_time.h

2018-03-13 Thread Deepa Dinamani

This is again a tricky include file ordering when linux/compat.h is
included instead of asm/compat.h. is_compat_task() is unconditionally
defined in linux/compat.h as a macro which conflicts with inline
function define in asm/compat.h for this arch.
As before, I will do the simple thing here and leave the asm/compat.h
to keep this series simple.
I will submit follow up patches to eliminate direct inclusion asm/compat.h.

I will include this also in the update.

-Deepa

On Tue, Mar 13, 2018 at 8:30 AM, kbuild test robot  wrote:
> Hi Deepa,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on ]
>
> url:
> https://github.com/0day-ci/linux/commits/Deepa-Dinamani/posix_clocks-Prepare-syscalls-for-64-bit-time_t-conversion/20180313-203305
> base:
> config: powerpc-iss476-smp_defconfig (attached as .config)
> compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=powerpc
>
> All errors (new ones prefixed by >>):
>
>arch/powerpc/oprofile/backtrace.c: In function 'user_getsp32':
>>> arch/powerpc/oprofile/backtrace.c:31:19: error: implicit declaration of 
>>> function 'compat_ptr'; did you mean 'complete'? 
>>> [-Werror=implicit-function-declaration]
>  void __user *p = compat_ptr(sp);
>   ^~
>   complete
>>> arch/powerpc/oprofile/backtrace.c:31:19: error: initialization makes 
>>> pointer from integer without a cast [-Werror=int-conversion]
>cc1: all warnings being treated as errors
>
> vim +31 arch/powerpc/oprofile/backtrace.c
>
> 6c6bd754 Brian Rogan 2006-03-27  27
> 6c6bd754 Brian Rogan 2006-03-27  28  static unsigned int 
> user_getsp32(unsigned int sp, int is_first)
> 6c6bd754 Brian Rogan 2006-03-27  29  {
> 6c6bd754 Brian Rogan 2006-03-27  30 unsigned int stack_frame[2];
> 62034f03 Al Viro 2006-09-23 @31 void __user *p = compat_ptr(sp);
> 6c6bd754 Brian Rogan 2006-03-27  32
> 62034f03 Al Viro 2006-09-23  33 if (!access_ok(VERIFY_READ, p, 
> sizeof(stack_frame)))
> 6c6bd754 Brian Rogan 2006-03-27  34 return 0;
> 6c6bd754 Brian Rogan 2006-03-27  35
> 6c6bd754 Brian Rogan 2006-03-27  36 /*
> 6c6bd754 Brian Rogan 2006-03-27  37  * The most likely reason for this is 
> that we returned -EFAULT,
> 6c6bd754 Brian Rogan 2006-03-27  38  * which means that we've done all 
> that we can do from
> 6c6bd754 Brian Rogan 2006-03-27  39  * interrupt context.
> 6c6bd754 Brian Rogan 2006-03-27  40  */
> 62034f03 Al Viro 2006-09-23  41 if 
> (__copy_from_user_inatomic(stack_frame, p, sizeof(stack_frame)))
> 6c6bd754 Brian Rogan 2006-03-27  42 return 0;
> 6c6bd754 Brian Rogan 2006-03-27  43
> 6c6bd754 Brian Rogan 2006-03-27  44 if (!is_first)
> 6c6bd754 Brian Rogan 2006-03-27  45 
> oprofile_add_trace(STACK_LR32(stack_frame));
> 6c6bd754 Brian Rogan 2006-03-27  46
> 6c6bd754 Brian Rogan 2006-03-27  47 /*
> 6c6bd754 Brian Rogan 2006-03-27  48  * We do not enforce increasing stack 
> addresses here because
> 6c6bd754 Brian Rogan 2006-03-27  49  * we may transition to a different 
> stack, eg a signal handler.
> 6c6bd754 Brian Rogan 2006-03-27  50  */
> 6c6bd754 Brian Rogan 2006-03-27  51 return STACK_SP(stack_frame);
> 6c6bd754 Brian Rogan 2006-03-27  52  }
> 6c6bd754 Brian Rogan 2006-03-27  53
>
> :: The code at line 31 was first introduced by commit
> :: 62034f03380a64c0144b6721f4a2aa55d65346c1 [POWERPC] powerpc oprofile 
> __user annotations
>
> :: TO: Al Viro 
> :: CC: Paul Mackerras 
>
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[PATCH v5 00/10] posix_clocks: Prepare syscalls for 64 bit time_t conversion

2018-03-13 Thread Deepa Dinamani

The series is a preparation series for individual architectures
to use 64 bit time_t syscalls in compat and 32 bit emulation modes.

This is a follow up to the series Arnd Bergmann posted:
https://sourceware.org/ml/libc-alpha/2015-05/msg00070.html [1]

Thomas, Arnd, this seems ready to be merged now.
Can you help get this merged?

Big picture is as per the lwn article:
https://lwn.net/Articles/643234/ [2]

The series is directed at converting posix clock syscalls:
clock_gettime, clock_settime, clock_getres and clock_nanosleep
to use a new data structure __kernel_timespec at syscall boundaries.
__kernel_timespec maintains 64 bit time_t across all execution modes.

vdso will be handled as part of each architecture when they enable
support for 64 bit time_t.

The compat syscalls are repurposed to provide backward compatibility
by using them as native syscalls as well for 32 bit architectures.
They will continue to use timespec at syscall boundaries.

CONFIG_64_BIT_TIME controls whether the syscalls use __kernel_timespec
or timespec at syscall boundaries.

The series does the following:
1. Enable compat syscalls on 32 bit architectures.
2. Add a new __kernel_timespec type to be used as the data structure
   for all the new syscalls.
3. Add new config CONFIG_64BIT_TIME(intead of the CONFIG_COMPAT_TIME in
   [1] and [2] to switch to new definition of __kernel_timespec. It is
   the same as struct timespec otherwise.
4. Add new CONFIG_32BIT_TIME to conditionally compile compat syscalls.

* Changes since v4:
 * Fixed up kbuild errors for arm64 and powerpc non compat configs
* Changes since v3:
 * Updated include file ordering
* Changes since v2:
 * Dropped the ARCH_HAS_64BIT_TIME config.
 * Fixed zeroing out of higher order bits of tv_nsec for real.
 * Addressed minor review comments from v1.
* Changes since v1:
 * Introduce CONFIG_32BIT_TIME
 * Fixed zeroing out of higher order bits of tv_nsec
 * Included Arnd's changes to fix up use of compat headers

I decided against using LEGACY_TIME_SYSCALLS to conditionally compile
legacy time syscalls such as sys_nanosleep because this will need to
enclose compat_sys_nanosleep as well. So, defining it as 

config LEGACY_TIME_SYSCALLS
 def_bool 64BIT || !64BIT_TIME

will not include compat_sys_nanosleep. We will instead need a new config to
exclusively mark legacy syscalls.

Deepa Dinamani (10):
  compat: Make compat helpers independent of CONFIG_COMPAT
  include: Move compat_timespec/ timeval to compat_time.h
  compat: enable compat_get/put_timespec64 always
  arch: introduce CONFIG_64BIT_TIME
  arch: Introduce CONFIG_COMPAT_32BIT_TIME
  posix-clocks: Make compat syscalls depend on CONFIG_COMPAT_32BIT_TIME
  include: Add new y2038 safe __kernel_timespec
  fix get_timespec64() for y2038 safe compat interfaces
  change time types to new y2038 safe __kernel_* types
  nanosleep: change time types to safe __kernel_* types

 arch/Kconfig   | 15 +
 arch/arm64/include/asm/compat.h| 11 ---
 arch/arm64/include/asm/stat.h  |  1 +
 arch/arm64/kernel/hw_breakpoint.c  |  1 -
 arch/arm64/kernel/perf_regs.c  |  2 +-
 arch/mips/include/asm/compat.h | 11 ---
 arch/mips/kernel/signal32.c|  2 +-
 arch/parisc/include/asm/compat.h   | 11 ---
 arch/powerpc/include/asm/compat.h  | 11 ---
 arch/powerpc/kernel/asm-offsets.c  |  2 +-
 arch/powerpc/oprofile/backtrace.c  |  1 +
 arch/s390/hypfs/hypfs_sprp.c   |  1 -
 arch/s390/include/asm/compat.h | 11 ---
 arch/s390/include/asm/elf.h|  4 +--
 arch/s390/kvm/priv.c   |  1 -
 arch/s390/pci/pci_clp.c|  1 -
 arch/sparc/include/asm/compat.h| 11 ---
 arch/tile/include/asm/compat.h | 11 ---
 arch/x86/events/core.c |  2 +-
 arch/x86/include/asm/compat.h  | 11 ---
 arch/x86/include/asm/ftrace.h  |  2 +-
 arch/x86/include/asm/sys_ia32.h|  2 +-
 arch/x86/kernel/sys_x86_64.c   |  2 +-
 drivers/s390/block/dasd_ioctl.c|  1 -
 drivers/s390/char/fs3270.c |  1 -
 drivers/s390/char/sclp_ctl.c   |  1 -
 drivers/s390/char/vmcp.c   |  1 -
 drivers/s390/cio/chsc_sch.c|  1 -
 drivers/s390/net/qeth_core_main.c  |  2 +-
 include/linux/compat.h | 11 ---
 include/linux/compat_time.h| 23 ++
 include/linux/restart_block.h  |  7 ++--
 include/linux/syscalls.h   | 12 +++
 include/linux/time.h   |  4 +--
 include/linux/time64.h | 10 +-
 include/uapi/asm-generic/posix_types.h |  1 +
 include/uapi/linux/time.h  |  7 
 kernel/compat.c| 52 +-
 kernel/time/hrtimer.c  | 10 --
 kernel/time/posix-stubs.c  | 12 ---
 kernel/time/posix-timers.c | 24 ++
 kernel

[PATCH v5 02/10] include: Move compat_timespec/ timeval to compat_time.h

2018-03-13 Thread Deepa Dinamani

All the current architecture specific defines for these
are the same. Refactor these common defines to a common
header file.

The new common linux/compat_time.h is also useful as it
will eventually be used to hold all the defines that
are needed for compat time types that support non y2038
safe types. New architectures need not have to define these
new types as they will only use new y2038 safe syscalls.
This file can be deleted after y2038 when we stop supporting
non y2038 safe syscalls.

The patch also requires an operation similar to:

git grep "asm/compat\.h" | cut -d ":" -f 1 |  xargs -n 1 sed -i -e 
"s%asm/compat.h%linux/compat.h%g"

Cc: a...@kernel.org
Cc: b...@kernel.crashing.org
Cc: borntrae...@de.ibm.com
Cc: catalin.mari...@arm.com
Cc: cmetc...@mellanox.com
Cc: coh...@redhat.com
Cc: da...@davemloft.net
Cc: del...@gmx.de
Cc: de...@driverdev.osuosl.org
Cc: gerald.schae...@de.ibm.com
Cc: gre...@linuxfoundation.org
Cc: heiko.carst...@de.ibm.com
Cc: hoepp...@linux.vnet.ibm.com
Cc: h...@zytor.com
Cc: j...@parisc-linux.org
Cc: j...@linux.vnet.ibm.com
Cc: linux-ker...@vger.kernel.org
Cc: linux-m...@linux-mips.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: mark.rutl...@arm.com
Cc: mi...@redhat.com
Cc: m...@ellerman.id.au
Cc: ober...@linux.vnet.ibm.com
Cc: oprofile-l...@lists.sf.net
Cc: pau...@samba.org
Cc: pet...@infradead.org
Cc: r...@linux-mips.org
Cc: rost...@goodmis.org
Cc: r...@kernel.org
Cc: schwidef...@de.ibm.com
Cc: seb...@linux.vnet.ibm.com
Cc: sparcli...@vger.kernel.org
Cc: s...@linux.vnet.ibm.com
Cc: ubr...@linux.vnet.ibm.com
Cc: will.dea...@arm.com
Cc: x...@kernel.org
Signed-off-by: Arnd Bergmann 
Signed-off-by: Deepa Dinamani 
Acked-by: Steven Rostedt (VMware) 
Acked-by: Catalin Marinas 
Acked-by: James Hogan 
Acked-by: Helge Deller 
---
 arch/arm64/include/asm/compat.h   | 11 ---
 arch/arm64/include/asm/stat.h |  1 +
 arch/arm64/kernel/hw_breakpoint.c |  1 -
 arch/arm64/kernel/perf_regs.c |  2 +-
 arch/mips/include/asm/compat.h| 11 ---
 arch/mips/kernel/signal32.c   |  2 +-
 arch/parisc/include/asm/compat.h  | 11 ---
 arch/powerpc/include/asm/compat.h | 11 ---
 arch/powerpc/kernel/asm-offsets.c |  2 +-
 arch/powerpc/oprofile/backtrace.c |  1 +
 arch/s390/hypfs/hypfs_sprp.c  |  1 -
 arch/s390/include/asm/compat.h| 11 ---
 arch/s390/include/asm/elf.h   |  4 ++--
 arch/s390/kvm/priv.c  |  1 -
 arch/s390/pci/pci_clp.c   |  1 -
 arch/sparc/include/asm/compat.h   | 11 ---
 arch/tile/include/asm/compat.h| 11 ---
 arch/x86/events/core.c|  2 +-
 arch/x86/include/asm/compat.h | 11 ---
 arch/x86/include/asm/ftrace.h |  2 +-
 arch/x86/include/asm/sys_ia32.h   |  2 +-
 arch/x86/kernel/sys_x86_64.c  |  2 +-
 drivers/s390/block/dasd_ioctl.c   |  1 -
 drivers/s390/char/fs3270.c|  1 -
 drivers/s390/char/sclp_ctl.c  |  1 -
 drivers/s390/char/vmcp.c  |  1 -
 drivers/s390/cio/chsc_sch.c   |  1 -
 drivers/s390/net/qeth_core_main.c |  2 +-
 include/linux/compat.h|  1 +
 include/linux/compat_time.h   | 19 +++
 30 files changed, 32 insertions(+), 107 deletions(-)
 create mode 100644 include/linux/compat_time.h

diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
index c00c62e1a4a3..0030f79808b3 100644
--- a/arch/arm64/include/asm/compat.h
+++ b/arch/arm64/include/asm/compat.h
@@ -34,7 +34,6 @@
 
 typedef u32compat_size_t;
 typedef s32compat_ssize_t;
-typedef s32compat_time_t;
 typedef s32compat_clock_t;
 typedef s32compat_pid_t;
 typedef u16__compat_uid_t;
@@ -66,16 +65,6 @@ typedef u32  compat_ulong_t;
 typedef u64compat_u64;
 typedef u32compat_uptr_t;
 
-struct compat_timespec {
-   compat_time_t   tv_sec;
-   s32 tv_nsec;
-};
-
-struct compat_timeval {
-   compat_time_t   tv_sec;
-   s32 tv_usec;
-};
-
 struct compat_stat {
 #ifdef __AARCH64EB__
short   st_dev;
diff --git a/arch/arm64/include/asm/stat.h b/arch/arm64/include/asm/stat.h
index 15e35598ac40..eab738019707 100644
--- a/arch/arm64/include/asm/stat.h
+++ b/arch/arm64/include/asm/stat.h
@@ -20,6 +20,7 @@
 
 #ifdef CONFIG_COMPAT
 
+#include 
 #include 
 
 /*
diff --git a/arch/arm64/kernel/hw_breakpoint.c 
b/arch/arm64/kernel/hw_breakpoint.c
index 74bb56f656ef..413dbe530da8 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -30,7 +30,6 @@
 #include 
 #include 
 
-#include 
 #include 
 #include 
 #include 
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 1d091d048d04..0bbac612146e 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -1,11 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
+#include 
 #include 
 #include 
 #include

dtc warnings

2018-03-13 Thread Stephen Rothwell

Hi all,


I get the following from a powerpc_ppc44x defconfig build in current
linux-next:

arch/powerpc/boot/ebony.dtb: Warning (pci_bridge): /plb/pci@20ec0: missing 
bus-range for PCI bridge
arch/powerpc/boot/ebony.dtb: Warning (pci_device_bus_num): Failed prerequisite 
'pci_bridge'
arch/powerpc/boot/ebony.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead
arch/powerpc/boot/sequoia.dtb: Warning (pci_bridge): /plb/pci@1ec00: 
missing bus-range for PCI bridge
arch/powerpc/boot/sequoia.dtb: Warning (pci_device_bus_num): Failed 
prerequisite 'pci_bridge'
arch/powerpc/boot/sequoia.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead
arch/powerpc/boot/sam440ep.dtb: Warning (pci_bridge): /plb/pci@ec00: 
missing bus-range for PCI bridge
arch/powerpc/boot/sam440ep.dtb: Warning (pci_device_bus_num): Failed 
prerequisite 'pci_bridge'
arch/powerpc/boot/sam440ep.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead
arch/powerpc/boot/rainier.dtb: Warning (pci_bridge): /plb/pci@1ec00: 
missing bus-range for PCI bridge
arch/powerpc/boot/rainier.dtb: Warning (pci_device_bus_num): Failed 
prerequisite 'pci_bridge'
arch/powerpc/boot/rainier.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead
arch/powerpc/boot/taishan.dtb: Warning (pci_bridge): /plb/pci@20ec0: 
missing bus-range for PCI bridge
arch/powerpc/boot/taishan.dtb: Warning (pci_device_bus_num): Failed 
prerequisite 'pci_bridge'
arch/powerpc/boot/taishan.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead
arch/powerpc/boot/bamboo.dtb: Warning (pci_bridge): /plb/pci@ec00: missing 
bus-range for PCI bridge
arch/powerpc/boot/bamboo.dtb: Warning (pci_device_bus_num): Failed prerequisite 
'pci_bridge'
arch/powerpc/boot/bamboo.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead
arch/powerpc/boot/katmai.dtb: Warning (pci_bridge): /plb/pciex@d: node 
name is not "pci" or "pcie"
arch/powerpc/boot/katmai.dtb: Warning (pci_bridge): /plb/pciex@d2000: node 
name is not "pci" or "pcie"
arch/powerpc/boot/katmai.dtb: Warning (pci_bridge): /plb/pciex@d4000: node 
name is not "pci" or "pcie"
arch/powerpc/boot/katmai.dtb: Warning (pci_device_bus_num): Failed prerequisite 
'pci_bridge'
arch/powerpc/boot/katmai.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead
arch/powerpc/boot/warp.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead
arch/powerpc/boot/yosemite.dtb: Warning (pci_bridge): /plb/pci@ec00: 
missing bus-range for PCI bridge
arch/powerpc/boot/yosemite.dtb: Warning (pci_device_bus_num): Failed 
prerequisite 'pci_bridge'
arch/powerpc/boot/yosemite.dtb: Warning (chosen_node_stdout_path): 
/chosen:linux,stdout-path: Use 'stdout-path' instead

I though someone might like to do something about them ...
-- 
Cheers,
Stephen Rothwell


pgp4x88j8B8aW.pgp
Description: OpenPGP digital signature

more warnings

2018-03-13 Thread Stephen Rothwell

Hi all,

I also get these from a powerpc allyesconfig build of linux-next:

WARNING: vmlinux.o(.text+0x7c81c): Section mismatch in reference from the 
function .stop_machine_change_mapping() to the function 
.meminit.text:.create_physical_mapping()
The function .stop_machine_change_mapping() references
the function __meminit .create_physical_mapping().
This is often because .stop_machine_change_mapping lacks a __meminit 
annotation or the annotation of .create_physical_mapping is wrong.

WARNING: vmlinux.o(.text+0x7c828): Section mismatch in reference from the 
function .stop_machine_change_mapping() to the function 
.meminit.text:.create_physical_mapping()
The function .stop_machine_change_mapping() references
the function __meminit .create_physical_mapping().
This is often because .stop_machine_change_mapping lacks a __meminit 
annotation or the annotation of .create_physical_mapping is wrong.

-- 
Cheers,
Stephen Rothwell


pgpcg139hFcHJ.pgp
Description: OpenPGP digital signature

70 matches

Mail list logo