date:20180328

Re: [PATCH 12/14] powerpc: pass node id into create_section_mapping

2018-03-28 Thread Michael Ellerman

Nicholas Piggin  writes:

> diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> index 328ff9abc333..435b19e74508 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -862,9 +862,9 @@ static void remove_pagetable(unsigned long start, 
> unsigned long end)
>   radix__flush_tlb_kernel_range(start, end);
>  }
>  
> -int __ref radix__create_section_mapping(unsigned long start, unsigned long 
> end)
> +int __ref radix__create_section_mapping(unsigned long start, unsigned long 
> end, int nid)
>  {
> - return create_physical_mapping(start, end);
> + return create_physical_mapping(start, end, nid);
>  }

This got a little muddled. We add the nid argument here, but
create_physical_mapping() doesn't take it until patch 14.

I managed to fix it by rearranging the last three patches and fiddling
things a bit. If you can check the result once I push that would be good.

cheers

Re: [PATCH 09/14] powerpc/64: defer paca allocation until memory topology is discovered

2018-03-28 Thread Michael Ellerman

Nicholas Piggin  writes:
> ---
>  arch/powerpc/include/asm/paca.h|  3 +-
>  arch/powerpc/kernel/paca.c | 90 
> --
>  arch/powerpc/kernel/prom.c |  5 ++-
>  arch/powerpc/kernel/setup-common.c | 24 +++---
>  4 files changed, 51 insertions(+), 71 deletions(-)

Added SOB.

cheers

Re: [PATCH 08/14] powerpc/setup: cpu_to_phys_id array

2018-03-28 Thread Michael Ellerman

Nicholas Piggin  writes:

> Build an array that finds hardware CPU number from logical CPU
> number in firmware CPU discovery. Use that rather than setting
> paca of other CPUs directly, to begin with. Subsequent patch will
> not have pacas allocated at this point.
> ---
>  arch/powerpc/include/asm/smp.h |  1 +
>  arch/powerpc/kernel/prom.c |  7 +++
>  arch/powerpc/kernel/setup-common.c | 15 ++-
>  3 files changed, 22 insertions(+), 1 deletion(-)

Added SOB.

> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 4dffef947b8a..5979e34ba90e 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -874,5 +874,12 @@ EXPORT_SYMBOL(cpu_to_chip_id);
>  
>  bool arch_match_cpu_phys_id(int cpu, u64 phys_id)
>  {
> + /*
> +  * Early firmware scanning must use this rather than
> +  * get_hard_smp_processor_id because we don't have pacas allocated
> +  * until memory topology is discovered.
> +  */
> + if (cpu_to_phys_id != NULL)
> + return (int)phys_id == cpu_to_phys_id[cpu];

This needed an #ifdef CONFIG_SMP.

cheers

Re: [PATCH 10/14] powerpc/64: allocate pacas per node

2018-03-28 Thread Michael Ellerman

Nicholas Piggin  writes:

> Per-node allocations are possible on 64s with radix that does
> not have the bolted SLB limitation.
>
> Hash would be able to do the same if all CPUs had the bottom of
> their node-local memory bolted as well. This is left as an
> exercise for the reader.
> ---
>  arch/powerpc/kernel/paca.c | 41 +++--
>  arch/powerpc/kernel/setup_64.c |  4 
>  2 files changed, 39 insertions(+), 6 deletions(-)

Added SOB.

cheers

[PATCH] powerpc/fscr: Enable interrupts earlier before calling get_user()

2018-03-28 Thread Anshuman Khandual

The function get_user() can sleep while trying to fetch instruction
from user address space and causes the following warning from the
scheduler.

BUG: sleeping function called from invalid context

Though interrupts get enabled back but it happens bit later after
get_user() is called. This change moves enabling these interrupts
earlier covering the function get_user(). While at this, lets check
for kernel mode and crash as this interrupt should not have been
triggered from the kernel context.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/kernel/traps.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 1e48d15..4d5a55e 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1611,6 +1611,13 @@ void facility_unavailable_exception(struct pt_regs *regs)
else
value = mfspr(SPRN_FSCR);
 
+   /* We should not have taken this interrupt in kernel */
+   BUG_ON(!user_mode(regs));
+
+   /* We restore the interrupt state now */
+   if (!arch_irq_disabled_regs(regs))
+   local_irq_enable();
+
status = value >> 56;
if (status == FSCR_DSCR_LG) {
/*
@@ -1683,10 +1690,6 @@ void facility_unavailable_exception(struct pt_regs *regs)
facility_strings[status])
facility = facility_strings[status];
 
-   /* We restore the interrupt state now */
-   if (!arch_irq_disabled_regs(regs))
-   local_irq_enable();
-
pr_err_ratelimited("%sFacility '%s' unavailable (%d), exception at 
0x%lx, MSR=%lx\n",
hv ? "Hypervisor " : "", facility, status, regs->nip, 
regs->msr);
 
-- 
2.9.3

Re: [PATCH] powerpc/powernv/nvram: opal_nvram_write handle unknown OPAL errors

2018-03-28 Thread Stewart Smith

Nicholas Piggin  writes:
> opal_nvram_write currently just assumes success if it encounters an
> error other than OPAL_BUSY or OPAL_BUSY_EVENT. Have it return -EIO
> on other errors instead.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/platforms/powernv/opal-nvram.c | 2 ++
>  1 file changed, 2 insertions(+)

Acked-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

[PATCH v2 4/4] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.

2018-03-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

For fadump to work successfully there should not be any holes in reserved
memory ranges where kernel has asked firmware to move the content of old
kernel memory in event of crash. But this memory area is currently not
protected from hot-remove operations. Hence, fadump service can fail to
re-register after the hot-remove operation, if hot-removed memory belongs
to fadump reserved region. To avoid this make sure that memory from fadump
reserved area is not hot-removable if fadump is registered.

However, if user still wants to remove that memory, he can do so by
manually stopping fadump service before hot-remove operation.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/fadump.h   |2 +-
 arch/powerpc/kernel/fadump.c|   10 --
 arch/powerpc/platforms/pseries/hotplug-memory.c |7 +--
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 776cba0baec4..bd84b496d2b8 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -206,7 +206,7 @@ struct fad_crash_memory_ranges {
unsigned long long  size;
 };
 
-extern int is_fadump_boot_memory_area(u64 addr, ulong size);
+extern int is_fadump_memory_area(u64 addr, ulong size);
 extern int early_init_dt_scan_fw_dump(unsigned long node,
const char *uname, int depth, void *data);
 extern int fadump_reserve_mem(void);
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 0268a32b632e..331066eefaee 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 
 /*
  * If fadump is registered, check if the memory provided
- * falls within boot memory area.
+ * falls within boot memory area and reserved memory area.
  */
-int is_fadump_boot_memory_area(u64 addr, ulong size)
+int is_fadump_memory_area(u64 addr, ulong size)
 {
+   u64 d_start = fw_dump.reserve_dump_area_start;
+   u64 d_end = d_start + fw_dump.reserve_dump_area_size;
+
if (!fw_dump.dump_registered)
return 0;
 
+   if (((addr + size) > d_start) && (addr <= d_end))
+   return 1;
+
return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size;
 }
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c1578f54c626..e4c658cda3a7 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
phys_addr = lmb->base_addr;
 
 #ifdef CONFIG_FA_DUMP
-   /* Don't hot-remove memory that falls in fadump boot memory area */
-   if (is_fadump_boot_memory_area(phys_addr, block_sz))
+   /*
+* Don't hot-remove memory that falls in fadump boot memory area
+* and memory that is reserved for capturing old kernel memory.
+*/
+   if (is_fadump_memory_area(phys_addr, block_sz))
return false;
 #endif

[PATCH v2 3/4] powerpc/fadump: throw proper error message on fadump registration failure.

2018-03-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

fadump fails to register when there are holes in reserved memory area.
This can happen if user has hot-removed a memory that falls in the fadump
reserved memory area. Throw a meaningful error message to the user in
such case.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/fadump.c |   33 +
 1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 9d75619cac28..0268a32b632e 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -216,6 +216,36 @@ static int is_boot_memory_area_contiguous(void)
return ret;
 }
 
+/*
+ * Returns 1, if there are no holes in reserved memory area,
+ * 0 otherwise.
+ */
+static int is_reserved_memory_area_contiguous(void)
+{
+   struct memblock_region *reg;
+   unsigned long start, end;
+   unsigned long d_start = fw_dump.reserve_dump_area_start;
+   unsigned long d_end = d_start + fw_dump.reserve_dump_area_size;
+   int ret = 0;
+
+   for_each_memblock(memory, reg) {
+   start = max(d_start, (unsigned long)reg->base);
+   end = min(d_end, (unsigned long)(reg->base + reg->size));
+   if (d_start < end) {
+   /* Memory hole from d_start to start */
+   if (start > d_start)
+   break;
+
+   if (end == d_end) {
+   ret = 1;
+   break;
+   }
+   d_start = end + 1;
+   }
+   }
+   return 0;
+}
+
 /* Print firmware assisted dump configurations for debugging purpose. */
 static void fadump_show_config(void)
 {
@@ -605,6 +635,9 @@ static int register_fw_dump(struct fadump_mem_struct *fdm)
if (!is_boot_memory_area_contiguous())
pr_err("Can't have holes in boot memory area while "
   "registering fadump\n");
+   else if (!is_reserved_memory_area_contiguous())
+   pr_err("Can't have holes in reserved memory area while"
+  " registering fadump\n");
 
printk(KERN_ERR "Failed to register firmware-assisted kernel"
" dump. Parameter Error(%d).\n", rc);

[PATCH v2 2/4] powerpc/fadump: exclude memory holes while reserving memory in second kernel.

2018-03-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

The second kernel, during early boot after the crash, reserves rest of
the memory above boot memory size to make sure it does not touch any of the
dump memory area. It uses memblock_reserve() that reserves the specified
memory region irrespective of memory holes present within that region.
There are chances where previous kernel would have hot removed some of
its memory leaving memory holes behind. In such cases fadump kernel reports
incorrect number of reserved pages through arch_reserved_kernel_pages() hook
causing kernel to hang or panic.

Fix this by excluding memory holes while reserving rest of the memory
above boot memory size during second kernel boot after crash.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/fadump.c |   19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 2098583c935f..9d75619cac28 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -434,6 +434,23 @@ static inline unsigned long get_fadump_metadata_base(
be64_to_cpu(fdm_active->rmr_region.source_len));
 }
 
+static void fadump_memblock_reserve(unsigned long base, unsigned long size)
+{
+   struct memblock_region *reg;
+   unsigned long start, end;
+
+   for_each_memblock(memory, reg) {
+   start = (unsigned long)reg->base;
+   end = start + (unsigned long)reg->size;
+
+   if ((start >= base) && (start < (base + size))) {
+   if (end > (base + size))
+   end = base + size;
+   memblock_reserve(start, end - start);
+   }
+   }
+}
+
 int __init fadump_reserve_mem(void)
 {
unsigned long base, size, memory_boundary;
@@ -488,7 +505,7 @@ int __init fadump_reserve_mem(void)
 */
base = fw_dump.boot_memory_size;
size = memory_boundary - base;
-   memblock_reserve(base, size);
+   fadump_memblock_reserve(base, size);
printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "
"for saving crash dump\n",
(unsigned long)(size >> 20),

[PATCH v2 1/4] powerpc/fadump: Reservationless firmware assisted dump

2018-03-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

One of the primary issues with Firmware Assisted Dump (fadump) on Power
is that it needs a large amount of memory to be reserved. On large
systems with TeraBytes of memory, this reservation can be quite
significant.

In some cases, fadump fails if the memory reserved is insufficient, or
if the reserved memory was DLPAR hot-removed.

In the normal case, post reboot, the preserved memory is filtered to
extract only relevant areas of interest using the makedumpfile tool.
While the tool provides flexibility to determine what needs to be part
of the dump and what memory to filter out, all supported distributions
default this to "Capture only kernel data and nothing else".

We take advantage of this default and the Linux kernel's Contiguous
Memory Allocator (CMA) to fundamentally change the memory reservation
model for fadump. Fadump can now only capture kernel memory.

Instead of setting aside a significant chunk of memory nobody can use,
this patch uses CMA instead, to reserve a significant chunk of memory
that the kernel is prevented from using (due to MIGRATE_CMA), but
applications are free to use it.

Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
[root@zzxx-yy10 ~]# free -m
  totalusedfree  shared  buff/cache   available
Mem:   7557 1936822  12 5416725
Swap:  4095   04095

With this patch:
[root@zzxx-yy10 ~]# free -m
  totalusedfree  shared  buff/cache   available
Mem:   8133 1947464  12 4757338
Swap:  4095   04095

Changes made here are completely transparent to how fadump has
traditionally worked.

Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand
CMA and its usage.

TODO:
- Handle case where CMA reservation spans nodes.

Signed-off-by: Ananth N Mavinakayanahalli 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/fadump.h |3 +
 arch/powerpc/kernel/fadump.c  |  179 +++--
 2 files changed, 154 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 5a23010af600..776cba0baec4 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -119,6 +119,7 @@ struct fadump_mem_struct {
struct fadump_section   cpu_state_data;
struct fadump_section   hpte_region;
struct fadump_section   rmr_region;
+   struct fadump_section   metadata_region;
 };
 
 /* Firmware-assisted dump configuration details. */
@@ -141,6 +142,8 @@ struct fw_dump {
unsigned long   fadump_supported:1;
unsigned long   dump_active:1;
unsigned long   dump_registered:1;
+   /* flag to indicate fadump metadata area is cma allocated */
+   unsigned long   cma_alloc:1;
 };
 
 /*
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 3c2c2688918f..2098583c935f 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -45,11 +46,57 @@
 static struct fw_dump fw_dump;
 static struct fadump_mem_struct fdm;
 static const struct fadump_mem_struct *fdm_active;
+static struct cma *fadump_cma;
 
 static DEFINE_MUTEX(fadump_mutex);
 struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES];
 int crash_mem_ranges;
 
+/*
+ * fadump_cma_reserve() - reserve area for fadump memory reservation
+ *
+ * This function reserves memory from early allocator. It should be
+ * called by arch specific code once the memblock allocator
+ * has been activated.
+ */
+int __init fadump_cma_reserve(void)
+{
+   unsigned long long base, size;
+   int rc;
+
+   if (!fw_dump.fadump_enabled)
+   return 0;
+
+   base = fw_dump.reserve_dump_area_start;
+   size = fw_dump.reserve_dump_area_size;
+   pr_debug("Original reserve area base %ld, size %ld\n",
+   (unsigned long)base >> 20,
+   (unsigned long)size >> 20);
+   if (!size)
+   return 0;
+
+   rc = cma_declare_contiguous(base, size, 0, 0, 0, false,
+   "fadump_cma", _cma);
+   if (rc) {
+   printk(KERN_ERR "fadump: Failed to reserve cma area for "
+   "firmware-assisted dump, %d\n", rc);
+   fw_dump.reserve_dump_area_size = 0;
+   return 0;
+   }
+   /*
+* So we now have cma area reserved for fadump. base may be different
+* from what we requested.
+*/
+   fw_dump.reserve_dump_area_start = cma_get_base(fadump_cma);
+

Re: [PATCH 6/6] doc/devicetree: NVDIMM region documentation

2018-03-28 Thread Oliver

On Thu, Mar 29, 2018 at 4:06 AM, Rob Herring  wrote:
> On Tue, Mar 27, 2018 at 9:53 AM, Oliver  wrote:
>> On Tue, Mar 27, 2018 at 9:24 AM, Rob Herring  wrote:
>>> On Fri, Mar 23, 2018 at 07:12:09PM +1100, Oliver O'Halloran wrote:
 Add device-tree binding documentation for the nvdimm region driver.

 Cc: devicet...@vger.kernel.org
 Signed-off-by: Oliver O'Halloran 
 ---
  .../devicetree/bindings/nvdimm/nvdimm-region.txt   | 45 
 ++
  1 file changed, 45 insertions(+)
  create mode 100644 
 Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt

 diff --git a/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt 
 b/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt
 new file mode 100644
 index ..02091117ff16
 --- /dev/null
 +++ b/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt
 @@ -0,0 +1,45 @@
 +Device-tree bindings for NVDIMM memory regions
 +-
 +
 +Non-volatile DIMMs are memory modules used to provide (cacheable) main 
 memory
>>>
>>> Are DIMMs always going to be the only form factor for NV memory?
>>>
>>> And if you have multiple DIMMs, does each DT node correspond to a DIMM?
>>
>> A nvdimm-region might correspond to a single NVDIMM, a set of
>> interleaved NVDIMMs, or it might just be a chunk of normal memory that
>> you want treated as a NVDIMM for some reason. The last case is useful
>> for provisioning install media on servers since it allows you do
>> download a DVD image, turn it into an nvdimm-region, and kexec into
>> the installer which can use it as a root disk. That may seem a little
>> esoteric, but it's handy and we're using a full linux environment for
>> our boot loader so it's easy to make use of.
>
> I'm really just asking if we should drop the "dimm" name because it is
> not always a DIMM. Maybe pmem instead? I don't know, naming is
> hard(TM).

pmem is probably a better name. I'll fix that up.

>>> If not, then what if we want/need to provide power control to a DIMM?
>>
>> That would require a DIMM (and probably memory controller) specific
>> driver. I've deliberately left out how regions are mapped back to
>> DIMMs from the binding since it's not really clear to me how that
>> should work. A phandle array pointing to each DIMM device (which could
>> be anything) would do the trick, but I've found that a bit awkward to
>> plumb into the model that libnvdimm expects.
>>
 +that retains its contents across power cycles. In more practical terms, 
 they
 +are kind of storage device where the contents can be accessed by the CPU
 +directly, rather than indirectly via a storage controller or similar. The 
 an
 +nvdimm-region specifies a physical address range that is hosted on an 
 NVDIMM
 +device.
 +
 +Bindings for the region nodes:
 +-
 +
 +Required properties:
 + - compatible = "nvdimm-region"
 +
 + - reg = ;
 + The system physical address range of this nvdimm region.
 +
 +Optional properties:
 + - Any relevant NUMA assocativity properties for the target platform.
 + - A "volatile" property indicating that this region is actually in
 +   normal DRAM and does not require cache flushes after each write.
 +
 +A complete example:
 +
 +
 +/ {
 + #size-cells = <2>;
 + #address-cells = <2>;
 +
 + platform {
>>>
>>> Perhaps we need a more well defined node here. Like we have 'memory' for
>>> memory nodes.
>>
>> I think treating it as a platform device is fine. Memory nodes are
>
> Platform device is a Linux term...
>
>> special since the OS needs to know where it can allocate early in boot
>> and I don't see non-volatile memory as being similarly significant.
>> Fundamentally an NVDIMM is just a memory mapped storage device so we
>> should be able to defer looking at them until later in boot.
>
> It's not clear if 'platform' is just an example or random name or what
> the node is required to be called. In the latter case, we should be
> much more specific because 'platform' could be anything. In the former
> case, then we have no way to find or validate the node because the
> name could be anything and there's no compatible property either.

Sorry, the platform node is just there as an example. I'll remove it.

> "region" is pretty generic too.

It is, but I didn't see a compelling reason to call it something else.

>> That said you might have problems with XIP kernels and what not. I
>> think that problem is better solved through other means though.
>>
 + region@5000 {
 + compatible = "nvdimm-region;
 + reg = <0x0001 0x

Re: [PATCH v8 22/24] mm: Speculative page fault handler return VMA

2018-03-28 Thread Ganesh Mahendran

2018-03-29 10:26 GMT+08:00 Ganesh Mahendran :
> Hi, Laurent
>
> 2018-02-16 23:25 GMT+08:00 Laurent Dufour :
>> When the speculative page fault handler is returning VM_RETRY, there is a
>> chance that VMA fetched without grabbing the mmap_sem can be reused by the
>> legacy page fault handler.  By reusing it, we avoid calling find_vma()
>> again. To achieve, that we must ensure that the VMA structure will not be
>> freed in our back. This is done by getting the reference on it (get_vma())
>> and by assuming that the caller will call the new service
>> can_reuse_spf_vma() once it has grabbed the mmap_sem.
>>
>> can_reuse_spf_vma() is first checking that the VMA is still in the RB tree
>> , and then that the VMA's boundaries matched the passed address and release
>> the reference on the VMA so that it can be freed if needed.
>>
>> In the case the VMA is freed, can_reuse_spf_vma() will have returned false
>> as the VMA is no more in the RB tree.
>
> when I applied this patch to arm64, I got a crash:
>
> [6.088296] Unable to handle kernel NULL pointer dereference at
> virtual address 
> [6.088307] pgd = ff9d67735000
> [6.088313] [] *pgd=0001795e3003,
> *pud=0001795e3003, *pmd=
> [6.088372] [ cut here ]
> [6.088377] Kernel BUG at ff9d64f65960 [verbose debug info unavailable]
> [6.088384] Internal error: Oops - BUG: 9645 [#1] PREEMPT SMP
> [6.088389] BUG: Bad rss-counter state mm:ffe8f3861040 idx:0 val:90
> [6.088393] BUG: Bad rss-counter state mm:ffe8f3861040 idx:1 val:58
> [6.088398] Modules linked in:
> [6.088408] CPU: 1 PID: 621 Comm: qseecomd Not tainted 4.4.78-perf+ #88
> [6.088413] Hardware name: Qualcomm Technologies, Inc. SDM 636
> PM660 + PM660L MTP E7S (DT)
> [6.088419] task: ffe8f6208000 ti: ffe872a8c000 task.ti:
> ffe872a8c000
> [6.088432] PC is at __rb_erase_color+0x108/0x240
> [6.088441] LR is at vma_interval_tree_remove+0x244/0x24c
> [6.088447] pc : [] lr : []
> pstate: 604001c5
> [6.088451] sp : ffe872a8fa50
> [6.088455] x29: ffe872a8fa50 x28: 0008
> [6.088462] x27: 0009 x26: 
> [6.088470] x25: ffe8f458fb80 x24: 00768ff87000
> [6.088477] x23:  x22: 
> [6.088484] x21: ff9d64d9be7c x20: ffe8f3ff0680
> [6.088492] x19: ffe8f212e9b0 x18: 0074
> [6.088499] x17: 0007 x16: 000e
> [6.088507] x15: ff9d65c88000 x14: 0001
> [6.088514] x13: 00192d76 x12: 00989680
> [6.088521] x11: 001f x10: ff9d661ded1b
> [6.088528] x9 : 007691759000 x8 : 07691759
> [6.088535] x7 :  x6 : ffe871ebada8
> [6.088541] x5 : 00e1 x4 : ffe8f212e958
> [6.088548] x3 : 00e9 x2 : 
> [6.088555] x1 : ffe8f212f110 x0 : ffe8f212e9b1
> [6.088564]
> [6.088564] PC: 0xff9d64f65920:
> [6.088568] 5920  f902 aa0103e0 aa1603e1 d63f02a0 aa1603e1
> f9400822 f9000662 f9000833
> [6.088590] 5940  143b f9400a61 f9400020 370002c0 f9400436
> b2400260 f9000a76 f9000433
> [6.088610] 5960  f90002c0 f9400260 f920 f9000261 f27ef400
> 54000100 f9400802 eb13005f
> [6.088630] 5980  5461 f9000801 1404 f9000401 1402
> f9000281 aa1303e0 d63f02a0
> [6.088652]
> [6.088652] LR: 0xff9d64d9c298:
> [6.088656] c298  f9403083 b483 f9400c63 eb03005f 9a832042
> f9403883 eb02007f 54a0
> [6.088676] c2b8  f9003882 f9402c82 927ef442 b5fffd22 b480
> f0e2 9139f042 94072561
> [6.088695] c2d8  a8c17bfd d65f03c0 a9bf7bfd 910003fd f943
> d280 b4e3 f9400c65
> [6.088715] c2f8  d1016063 eb0100bf 5463 aa0303e0 97fffef2
> a8c17bfd d65f03c0 a9bf7bfd
> [6.088735]
> [6.088735] SP: 0xffe872a8fa10:
> [6.088740] fa10  64d9c2d8 ff9d 72a8fa50 ffe8 64f65960
> ff9d 604001c5 
> [6.088759] fa30  71d67d70 ffe8 71c281e8 ffe8 
> 0080 64daa90c ff9d
> [6.088779] fa50  72a8fa90 ffe8 64d9c2d8 ff9d 71ebada8
> ffe8 f3ff0678 ffe8
> [6.088799] fa70  72a8fb80 ffe8   
>  0001 
> [6.088818]
> [6.088823] Process qseecomd (pid: 621, stack limit = 0xffe872a8c028)
> [6.088828] Call trace:
> [6.088834] Exception stack(0xffe872a8f860 to 0xffe872a8f990)
> [6.088841] f860: ffe8f212e9b0 0080
> 82b37000 ff9d64f65960
> [6.088848] f880: 604001c5 ff9d672c8680
> ff9d672c9c00 ff9d672d3ab7
> [6.088855] f8a0: ffe872a8f8f0 ff9d64db9bfc
>  ffe8f9402c00
> [6.088861] f8c0: ffe872a8c000 
> ffe872a8f920 ff9d64db9bfc
>

Re: [PATCH v8 22/24] mm: Speculative page fault handler return VMA

2018-03-28 Thread Ganesh Mahendran

Hi, Laurent

2018-02-16 23:25 GMT+08:00 Laurent Dufour :
> When the speculative page fault handler is returning VM_RETRY, there is a
> chance that VMA fetched without grabbing the mmap_sem can be reused by the
> legacy page fault handler.  By reusing it, we avoid calling find_vma()
> again. To achieve, that we must ensure that the VMA structure will not be
> freed in our back. This is done by getting the reference on it (get_vma())
> and by assuming that the caller will call the new service
> can_reuse_spf_vma() once it has grabbed the mmap_sem.
>
> can_reuse_spf_vma() is first checking that the VMA is still in the RB tree
> , and then that the VMA's boundaries matched the passed address and release
> the reference on the VMA so that it can be freed if needed.
>
> In the case the VMA is freed, can_reuse_spf_vma() will have returned false
> as the VMA is no more in the RB tree.

when I applied this patch to arm64, I got a crash:

[6.088296] Unable to handle kernel NULL pointer dereference at
virtual address 
[6.088307] pgd = ff9d67735000
[6.088313] [] *pgd=0001795e3003,
*pud=0001795e3003, *pmd=
[6.088372] [ cut here ]
[6.088377] Kernel BUG at ff9d64f65960 [verbose debug info unavailable]
[6.088384] Internal error: Oops - BUG: 9645 [#1] PREEMPT SMP
[6.088389] BUG: Bad rss-counter state mm:ffe8f3861040 idx:0 val:90
[6.088393] BUG: Bad rss-counter state mm:ffe8f3861040 idx:1 val:58
[6.088398] Modules linked in:
[6.088408] CPU: 1 PID: 621 Comm: qseecomd Not tainted 4.4.78-perf+ #88
[6.088413] Hardware name: Qualcomm Technologies, Inc. SDM 636
PM660 + PM660L MTP E7S (DT)
[6.088419] task: ffe8f6208000 ti: ffe872a8c000 task.ti:
ffe872a8c000
[6.088432] PC is at __rb_erase_color+0x108/0x240
[6.088441] LR is at vma_interval_tree_remove+0x244/0x24c
[6.088447] pc : [] lr : []
pstate: 604001c5
[6.088451] sp : ffe872a8fa50
[6.088455] x29: ffe872a8fa50 x28: 0008
[6.088462] x27: 0009 x26: 
[6.088470] x25: ffe8f458fb80 x24: 00768ff87000
[6.088477] x23:  x22: 
[6.088484] x21: ff9d64d9be7c x20: ffe8f3ff0680
[6.088492] x19: ffe8f212e9b0 x18: 0074
[6.088499] x17: 0007 x16: 000e
[6.088507] x15: ff9d65c88000 x14: 0001
[6.088514] x13: 00192d76 x12: 00989680
[6.088521] x11: 001f x10: ff9d661ded1b
[6.088528] x9 : 007691759000 x8 : 07691759
[6.088535] x7 :  x6 : ffe871ebada8
[6.088541] x5 : 00e1 x4 : ffe8f212e958
[6.088548] x3 : 00e9 x2 : 
[6.088555] x1 : ffe8f212f110 x0 : ffe8f212e9b1
[6.088564]
[6.088564] PC: 0xff9d64f65920:
[6.088568] 5920  f902 aa0103e0 aa1603e1 d63f02a0 aa1603e1
f9400822 f9000662 f9000833
[6.088590] 5940  143b f9400a61 f9400020 370002c0 f9400436
b2400260 f9000a76 f9000433
[6.088610] 5960  f90002c0 f9400260 f920 f9000261 f27ef400
54000100 f9400802 eb13005f
[6.088630] 5980  5461 f9000801 1404 f9000401 1402
f9000281 aa1303e0 d63f02a0
[6.088652]
[6.088652] LR: 0xff9d64d9c298:
[6.088656] c298  f9403083 b483 f9400c63 eb03005f 9a832042
f9403883 eb02007f 54a0
[6.088676] c2b8  f9003882 f9402c82 927ef442 b5fffd22 b480
f0e2 9139f042 94072561
[6.088695] c2d8  a8c17bfd d65f03c0 a9bf7bfd 910003fd f943
d280 b4e3 f9400c65
[6.088715] c2f8  d1016063 eb0100bf 5463 aa0303e0 97fffef2
a8c17bfd d65f03c0 a9bf7bfd
[6.088735]
[6.088735] SP: 0xffe872a8fa10:
[6.088740] fa10  64d9c2d8 ff9d 72a8fa50 ffe8 64f65960
ff9d 604001c5 
[6.088759] fa30  71d67d70 ffe8 71c281e8 ffe8 
0080 64daa90c ff9d
[6.088779] fa50  72a8fa90 ffe8 64d9c2d8 ff9d 71ebada8
ffe8 f3ff0678 ffe8
[6.088799] fa70  72a8fb80 ffe8   
 0001 
[6.088818]
[6.088823] Process qseecomd (pid: 621, stack limit = 0xffe872a8c028)
[6.088828] Call trace:
[6.088834] Exception stack(0xffe872a8f860 to 0xffe872a8f990)
[6.088841] f860: ffe8f212e9b0 0080
82b37000 ff9d64f65960
[6.088848] f880: 604001c5 ff9d672c8680
ff9d672c9c00 ff9d672d3ab7
[6.088855] f8a0: ffe872a8f8f0 ff9d64db9bfc
 ffe8f9402c00
[6.088861] f8c0: ffe872a8c000 
ffe872a8f920 ff9d64db9bfc
[6.088867] f8e0:  ffe8f9402b00
ffe872a8fa10 ff9d64dba568
[6.088874] f900: ffbe61c759c0 ffe871d67d70
ffe8f9402c00 1de56fb006cba396
[6.01] f920: ffe8f212e9b1 ffe8f212f110

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Wei Yang

On Wed, Mar 28, 2018 at 09:55:07AM -0700, Kees Cook wrote:
>On Wed, Mar 28, 2018 at 8:26 AM, Shea Levy  wrote:
>> Now only those architectures that have custom initrd free requirements
>> need to define free_initrd_mem.
>>
>> Signed-off-by: Shea Levy 
>
>Yay consolidation! :)
>
>> --- a/usr/Kconfig
>> +++ b/usr/Kconfig
>> @@ -233,3 +233,7 @@ config INITRAMFS_COMPRESSION
>> default ".lzma" if RD_LZMA
>> default ".bz2"  if RD_BZIP2
>> default ""
>> +
>> +config HAVE_ARCH_FREE_INITRD_MEM
>> +   bool
>> +   default n
>
>If you keep the Kconfig, you can leave off "default n", and I'd
>suggest adding a help section just to describe what the per-arch
>responsibilities are when select-ing the config. (See
>HAVE_ARCH_SECCOMP_FILTER for an example.)
>

One question about this change.

The original code would "select" HAVE_ARCH_FREE_INITRD_MEM on those arch.
After this change, we need to manually "select" this?

>-Kees
>
>-- 
>Kees Cook
>Pixel Security

-- 
Wei Yang
Help you, Help me

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Nicholas Piggin

On Thu, 29 Mar 2018 09:37:52 +1100
Oliver  wrote:

> On Thu, Mar 29, 2018 at 9:14 AM, Russell King - ARM Linux
>  wrote:
> > On Wed, Mar 28, 2018 at 02:04:22PM -0500, Rob Landley wrote:  
> >>
> >>
> >> On 03/28/2018 11:48 AM, Russell King - ARM Linux wrote:  
> >> > On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote:  
> >> >> On 03/28/2018 10:26 AM, Shea Levy wrote:  
> >> >>> Now only those architectures that have custom initrd free requirements
> >> >>> need to define free_initrd_mem.  
> >> >> ...  
> >> >>> --- a/arch/arc/mm/init.c
> >> >>> +++ b/arch/arc/mm/init.c
> >> >>> @@ -229,10 +229,3 @@ void __ref free_initmem(void)
> >> >>>  {
> >> >>>   free_initmem_default(-1);
> >> >>>  }
> >> >>> -
> >> >>> -#ifdef CONFIG_BLK_DEV_INITRD
> >> >>> -void __init free_initrd_mem(unsigned long start, unsigned long end)
> >> >>> -{
> >> >>> - free_reserved_area((void *)start, (void *)end, -1, "initrd");
> >> >>> -}
> >> >>> -#endif
> >> >>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> >> >>> index 3f972e83909b..19d1c5594e2d 100644
> >> >>> --- a/arch/arm/Kconfig
> >> >>> +++ b/arch/arm/Kconfig
> >> >>> @@ -47,6 +47,7 @@ config ARM
> >> >>>   select HARDIRQS_SW_RESEND
> >> >>>   select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT)
> >> >>>   select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
> >> >>> + select HAVE_ARCH_FREE_INITRD_MEM
> >> >>>   select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
> >> >>>   select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
> >> >>>   select HAVE_ARCH_MMAP_RND_BITS if MMU  
> >> >>
> >> >> Isn't this why weak symbols were invented?  
> >> >
> >> > Weak symbols means that we end up with both the weakly-referenced code
> >> > and the arch code in the kernel image.  That's fine if the weak code
> >> > is small.  
> >>
> >> The kernel's been able to build with link time garbage collection since 
> >> 2016:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b67067f1176d
> >>
> >> Wouldn't that remove the unused one?  
> >
> > Probably, if anyone bothered to use that, which they don't.
> >
> > LD_DEAD_CODE_DATA_ELIMINATION is a symbol without a prompt, and from
> > what I can see, nothing selects it.  Therefore, the symbol is always
> > disabled, and so the feature never gets used in mainline kernels.
> >
> > Brings up the obvious question - why is it there if it's completely
> > unused?  (Maybe to cause confusion, and allowing a justification
> > for __weak ?)  

Well weak symbols have been used long before it was added.

> IIRC Nick had some patches to do the arch enablement for powerpc, but
> I'm not sure what happened to them though. I suspect it just fell down
> Nick's ever growing TODO list.

Yeah I had started some patches for powerpc and x86 that have ended up
on the back burner. There's been some MIPS people playing with it too.

For the kernel, LD_DEAD_CODE_DATA_ELIMINATION is not great. It can save
a little, but you get issues like any exception table entry or bug table
entry in a function will create a reference back to the function, so the
linker can't trim it away even if nothing else references it.

I'll try to take another look at it within the next few months and
remove it if I can't make progress.

Nicolas Pitre has been doing some much better work on dead code using
real LTO.

Thanks,
Nick

[PATCH] macintosh/adb: Use C99 initializers for struct adb_driver instances

2018-03-28 Thread Finn Thain

No change to object files.

Cc: Benjamin Herrenschmidt 
Signed-off-by: Finn Thain 

---
 drivers/macintosh/adb-iop.c| 14 +++---
 drivers/macintosh/macio-adb.c  | 15 +++
 drivers/macintosh/via-macii.c  | 14 +++---
 drivers/macintosh/via-pmu.c| 14 +++---
 drivers/macintosh/via-pmu68k.c | 14 +++---
 5 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/drivers/macintosh/adb-iop.c b/drivers/macintosh/adb-iop.c
index 15db69d8ba69..ca623e6446e4 100644
--- a/drivers/macintosh/adb-iop.c
+++ b/drivers/macintosh/adb-iop.c
@@ -53,13 +53,13 @@ static void adb_iop_poll(void);
 static int adb_iop_reset_bus(void);
 
 struct adb_driver adb_iop_driver = {
-   "ISM IOP",
-   adb_iop_probe,
-   adb_iop_init,
-   adb_iop_send_request,
-   adb_iop_autopoll,
-   adb_iop_poll,
-   adb_iop_reset_bus
+   .name = "ISM IOP",
+   .probe= adb_iop_probe,
+   .init = adb_iop_init,
+   .send_request = adb_iop_send_request,
+   .autopoll = adb_iop_autopoll,
+   .poll = adb_iop_poll,
+   .reset_bus= adb_iop_reset_bus
 };
 
 static void adb_iop_end_req(struct adb_request *req, int state)
diff --git a/drivers/macintosh/macio-adb.c b/drivers/macintosh/macio-adb.c
index 9a6223add30e..eb3adfb7f88d 100644
--- a/drivers/macintosh/macio-adb.c
+++ b/drivers/macintosh/macio-adb.c
@@ -70,14 +70,13 @@ static void macio_adb_poll(void);
 static int macio_adb_reset_bus(void);
 
 struct adb_driver macio_adb_driver = {
-   "MACIO",
-   macio_probe,
-   macio_init,
-   macio_send_request,
-   /*macio_write,*/
-   macio_adb_autopoll,
-   macio_adb_poll,
-   macio_adb_reset_bus
+   .name = "MACIO",
+   .probe= macio_probe,
+   .init = macio_init,
+   .send_request = macio_send_request,
+   .autopoll = macio_adb_autopoll,
+   .poll = macio_adb_poll,
+   .reset_bus= macio_adb_reset_bus,
 };
 
 int macio_probe(void)
diff --git a/drivers/macintosh/via-macii.c b/drivers/macintosh/via-macii.c
index 4ba06a1695ea..cf6f7d52d6be 100644
--- a/drivers/macintosh/via-macii.c
+++ b/drivers/macintosh/via-macii.c
@@ -91,13 +91,13 @@ static void macii_poll(void);
 static int macii_reset_bus(void);
 
 struct adb_driver via_macii_driver = {
-   "Mac II",
-   macii_probe,
-   macii_init,
-   macii_send_request,
-   macii_autopoll,
-   macii_poll,
-   macii_reset_bus
+   .name = "Mac II",
+   .probe= macii_probe,
+   .init = macii_init,
+   .send_request = macii_send_request,
+   .autopoll = macii_autopoll,
+   .poll = macii_poll,
+   .reset_bus= macii_reset_bus,
 };
 
 static enum macii_state {
diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index c4c2b3b85ebc..8da474899d79 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -199,13 +199,13 @@ static const struct file_operations pmu_options_proc_fops;
 
 #ifdef CONFIG_ADB
 struct adb_driver via_pmu_driver = {
-   "PMU",
-   pmu_probe,
-   pmu_init,
-   pmu_send_request,
-   pmu_adb_autopoll,
-   pmu_poll_adb,
-   pmu_adb_reset_bus
+   .name = "PMU",
+   .probe= pmu_probe,
+   .init = pmu_init,
+   .send_request = pmu_send_request,
+   .autopoll = pmu_adb_autopoll,
+   .poll = pmu_poll_adb,
+   .reset_bus= pmu_adb_reset_bus,
 };
 #endif /* CONFIG_ADB */
 
diff --git a/drivers/macintosh/via-pmu68k.c b/drivers/macintosh/via-pmu68k.c
index 7d9c4baf8c11..d545ed45e482 100644
--- a/drivers/macintosh/via-pmu68k.c
+++ b/drivers/macintosh/via-pmu68k.c
@@ -120,13 +120,13 @@ static void pmu_enable_backlight(int on);
 static void pmu_set_brightness(int level);
 
 struct adb_driver via_pmu_driver = {
-   "68K PMU",
-   pmu_probe,
-   pmu_init,
-   pmu_send_request,
-   pmu_autopoll,
-   pmu_poll,
-   pmu_reset_bus
+   .name = "68K PMU",
+   .probe= pmu_probe,
+   .init = pmu_init,
+   .send_request = pmu_send_request,
+   .autopoll = pmu_autopoll,
+   .poll = pmu_poll,
+   .reset_bus= pmu_reset_bus,
 };
 
 /*

Re: [PATCH v12 07/22] selftests/vm: fixed bugs in pkey_disable_clear()

2018-03-28 Thread Thiago Jung Bauermann


Dave Hansen  writes:

> On 03/28/2018 01:47 PM, Thiago Jung Bauermann wrote:
if (flags)
 -  assert(rdpkey_reg() > orig_pkey_reg);
 +  assert(rdpkey_reg() < orig_pkey_reg);
  }

  void pkey_write_allow(int pkey)
>>> This seems so horribly wrong that I wonder how it worked in the first
>>> place.  Any idea?
>> The code simply wasn't used. pkey_disable_clear() is called by
>> pkey_write_allow() and pkey_access_allow(), but before this patch series
>> nothing called either of these functions.
>> 
>
> Ahh, that explains it.  Can that get stuck in the changelog, please?

Yes, will be in the next version.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Oliver

On Thu, Mar 29, 2018 at 9:14 AM, Russell King - ARM Linux
 wrote:
> On Wed, Mar 28, 2018 at 02:04:22PM -0500, Rob Landley wrote:
>>
>>
>> On 03/28/2018 11:48 AM, Russell King - ARM Linux wrote:
>> > On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote:
>> >> On 03/28/2018 10:26 AM, Shea Levy wrote:
>> >>> Now only those architectures that have custom initrd free requirements
>> >>> need to define free_initrd_mem.
>> >> ...
>> >>> --- a/arch/arc/mm/init.c
>> >>> +++ b/arch/arc/mm/init.c
>> >>> @@ -229,10 +229,3 @@ void __ref free_initmem(void)
>> >>>  {
>> >>>   free_initmem_default(-1);
>> >>>  }
>> >>> -
>> >>> -#ifdef CONFIG_BLK_DEV_INITRD
>> >>> -void __init free_initrd_mem(unsigned long start, unsigned long end)
>> >>> -{
>> >>> - free_reserved_area((void *)start, (void *)end, -1, "initrd");
>> >>> -}
>> >>> -#endif
>> >>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>> >>> index 3f972e83909b..19d1c5594e2d 100644
>> >>> --- a/arch/arm/Kconfig
>> >>> +++ b/arch/arm/Kconfig
>> >>> @@ -47,6 +47,7 @@ config ARM
>> >>>   select HARDIRQS_SW_RESEND
>> >>>   select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT)
>> >>>   select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
>> >>> + select HAVE_ARCH_FREE_INITRD_MEM
>> >>>   select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
>> >>>   select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
>> >>>   select HAVE_ARCH_MMAP_RND_BITS if MMU
>> >>
>> >> Isn't this why weak symbols were invented?
>> >
>> > Weak symbols means that we end up with both the weakly-referenced code
>> > and the arch code in the kernel image.  That's fine if the weak code
>> > is small.
>>
>> The kernel's been able to build with link time garbage collection since 2016:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b67067f1176d
>>
>> Wouldn't that remove the unused one?
>
> Probably, if anyone bothered to use that, which they don't.
>
> LD_DEAD_CODE_DATA_ELIMINATION is a symbol without a prompt, and from
> what I can see, nothing selects it.  Therefore, the symbol is always
> disabled, and so the feature never gets used in mainline kernels.
>
> Brings up the obvious question - why is it there if it's completely
> unused?  (Maybe to cause confusion, and allowing a justification
> for __weak ?)

IIRC Nick had some patches to do the arch enablement for powerpc, but
I'm not sure what happened to them though. I suspect it just fell down
Nick's ever growing TODO list.

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Russell King - ARM Linux

On Wed, Mar 28, 2018 at 02:04:22PM -0500, Rob Landley wrote:
> 
> 
> On 03/28/2018 11:48 AM, Russell King - ARM Linux wrote:
> > On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote:
> >> On 03/28/2018 10:26 AM, Shea Levy wrote:
> >>> Now only those architectures that have custom initrd free requirements
> >>> need to define free_initrd_mem.
> >> ...
> >>> --- a/arch/arc/mm/init.c
> >>> +++ b/arch/arc/mm/init.c
> >>> @@ -229,10 +229,3 @@ void __ref free_initmem(void)
> >>>  {
> >>>   free_initmem_default(-1);
> >>>  }
> >>> -
> >>> -#ifdef CONFIG_BLK_DEV_INITRD
> >>> -void __init free_initrd_mem(unsigned long start, unsigned long end)
> >>> -{
> >>> - free_reserved_area((void *)start, (void *)end, -1, "initrd");
> >>> -}
> >>> -#endif
> >>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> >>> index 3f972e83909b..19d1c5594e2d 100644
> >>> --- a/arch/arm/Kconfig
> >>> +++ b/arch/arm/Kconfig
> >>> @@ -47,6 +47,7 @@ config ARM
> >>>   select HARDIRQS_SW_RESEND
> >>>   select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT)
> >>>   select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
> >>> + select HAVE_ARCH_FREE_INITRD_MEM
> >>>   select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
> >>>   select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
> >>>   select HAVE_ARCH_MMAP_RND_BITS if MMU
> >>
> >> Isn't this why weak symbols were invented?
> > 
> > Weak symbols means that we end up with both the weakly-referenced code
> > and the arch code in the kernel image.  That's fine if the weak code
> > is small.
> 
> The kernel's been able to build with link time garbage collection since 2016:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b67067f1176d
> 
> Wouldn't that remove the unused one?

Probably, if anyone bothered to use that, which they don't.

LD_DEAD_CODE_DATA_ELIMINATION is a symbol without a prompt, and from
what I can see, nothing selects it.  Therefore, the symbol is always
disabled, and so the feature never gets used in mainline kernels.

Brings up the obvious question - why is it there if it's completely
unused?  (Maybe to cause confusion, and allowing a justification
for __weak ?)

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

Re: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.

2018-03-28 Thread Ilya Smith

> On 28 Mar 2018, at 02:49, Matthew Wilcox  wrote:
> 
> On Tue, Mar 27, 2018 at 03:53:53PM -0700, Kees Cook wrote:
>> I agree: pushing this off to libc leaves a lot of things unprotected.
>> I think this should live in the kernel. The question I have is about
>> making it maintainable/readable/etc.
>> 
>> The state-of-the-art for ASLR is moving to finer granularity (over
>> just base-address offset), so I'd really like to see this supported in
>> the kernel. We'll be getting there for other things in the future, and
>> I'd like to have a working production example for researchers to
>> study, etc.
> 
> One thing we need is to limit the fragmentation of this approach.
> Even on 64-bit systems, we can easily get into a situation where there isn't
> space to map a contiguous terabyte.

As I wrote before, shift_random is introduced to be fragmentation limit. Even 
without it, the main question here is ‘if we can’t allocate memory with N size 
bytes, how many bytes we already allocated?’. From these point of view I 
already showed in previous version of patch that if application uses not so big 
memory allocations, it will have enough memory to use. If it uses XX Gigabytes 
or Terabytes memory, this application has all chances to be exploited with 
fully randomization or without. Since it is much easier to find(or guess) any 
usable pointer, etc. For the instance you have only 128 terabytes of memory for 
user space, so probability to exploit this application is 1/128 what is not 
secure at all. This is very rough estimate but I try to make things easier to 
understand.

Best regards,
Ilya

RE: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.

2018-03-28 Thread Luck, Tony

> The default limit of only 65536 VMAs will also quickly come into play
> if consecutive anon mmaps don't get merged. Of course this can be
> raised, but it has significant resource and performance (fork) costs.

Could the random mmap address chooser look for how many existing
VMAs have space before/after and the right attributes to merge with the
new one you want to create? If this is above some threshold (100?) then
pick one of them randomly and allocate the new address so that it will
merge from below/above with an existing one.

That should still give you a very high degree of randomness, but prevent
out of control numbers of VMAs from being created.

-Tony

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Rob Landley



On 03/28/2018 11:48 AM, Russell King - ARM Linux wrote:
> On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote:
>> On 03/28/2018 10:26 AM, Shea Levy wrote:
>>> Now only those architectures that have custom initrd free requirements
>>> need to define free_initrd_mem.
>> ...
>>> --- a/arch/arc/mm/init.c
>>> +++ b/arch/arc/mm/init.c
>>> @@ -229,10 +229,3 @@ void __ref free_initmem(void)
>>>  {
>>> free_initmem_default(-1);
>>>  }
>>> -
>>> -#ifdef CONFIG_BLK_DEV_INITRD
>>> -void __init free_initrd_mem(unsigned long start, unsigned long end)
>>> -{
>>> -   free_reserved_area((void *)start, (void *)end, -1, "initrd");
>>> -}
>>> -#endif
>>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>>> index 3f972e83909b..19d1c5594e2d 100644
>>> --- a/arch/arm/Kconfig
>>> +++ b/arch/arm/Kconfig
>>> @@ -47,6 +47,7 @@ config ARM
>>> select HARDIRQS_SW_RESEND
>>> select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT)
>>> select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
>>> +   select HAVE_ARCH_FREE_INITRD_MEM
>>> select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
>>> select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
>>> select HAVE_ARCH_MMAP_RND_BITS if MMU
>>
>> Isn't this why weak symbols were invented?
> 
> Weak symbols means that we end up with both the weakly-referenced code
> and the arch code in the kernel image.  That's fine if the weak code
> is small.

The kernel's been able to build with link time garbage collection since 2016:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b67067f1176d

Wouldn't that remove the unused one?

Rob

Re: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.

2018-03-28 Thread Ilya Smith

> On 28 Mar 2018, at 01:16, Theodore Y. Ts'o  wrote:
> 
> On Tue, Mar 27, 2018 at 04:51:08PM +0300, Ilya Smith wrote:
>>> /dev/[u]random is not sufficient?
>> 
>> Using /dev/[u]random makes 3 syscalls - open, read, close. This is a 
>> performance
>> issue.
> 
> You may want to take a look at the getrandom(2) system call, which is
> the recommended way getting secure random numbers from the kernel.
> 
>>> Well, I am pretty sure userspace can implement proper free ranges
>>> tracking…
>> 
>> I think we need to know what libc developers will say on implementing ASLR 
>> in 
>> user-mode. I am pretty sure they will say ‘nether’ or ‘some-day’. And 
>> problem 
>> of ASLR will stay forever.
> 
> Why can't you send patches to the libc developers?
> 
> Regards,
> 
>   - Ted

I still believe the issue is on kernel side, not in library.

Best regards,
Ilya

Re: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.

2018-03-28 Thread Ilya Smith

> On 27 Mar 2018, at 17:38, Michal Hocko  wrote:
> 
> On Tue 27-03-18 16:51:08, Ilya Smith wrote:
>> 
>>> On 27 Mar 2018, at 10:24, Michal Hocko  wrote:
>>> 
>>> On Mon 26-03-18 22:45:31, Ilya Smith wrote:

> On 26 Mar 2018, at 11:46, Michal Hocko  wrote:
> 
> On Fri 23-03-18 20:55:49, Ilya Smith wrote:
>> 
>>> On 23 Mar 2018, at 15:48, Matthew Wilcox  wrote:
>>> 
>>> On Thu, Mar 22, 2018 at 07:36:36PM +0300, Ilya Smith wrote:
 Current implementation doesn't randomize address returned by mmap.
 All the entropy ends with choosing mmap_base_addr at the process
 creation. After that mmap build very predictable layout of address
 space. It allows to bypass ASLR in many cases. This patch make
 randomization of address on any mmap call.
>>> 
>>> Why should this be done in the kernel rather than libc?  libc is 
>>> perfectly
>>> capable of specifying random numbers in the first argument of mmap.
>> Well, there is following reasons:
>> 1. It should be done in any libc implementation, what is not possible 
>> IMO;
> 
> Is this really so helpful?

 Yes, ASLR is one of very important mitigation techniques which are really 
 used 
 to protect applications. If there is no ASLR, it is very easy to exploit 
 vulnerable application and compromise the system. We can’t just fix all 
 the 
 vulnerabilities right now, thats why we have mitigations - techniques 
 which are 
 makes exploitation more hard or impossible in some cases.

 Thats why it is helpful.
>>> 
>>> I am not questioning ASLR in general. I am asking whether we really need
>>> per mmap ASLR in general. I can imagine that some environments want to
>>> pay the additional price and other side effects, but considering this
>>> can be achieved by libc, why to add more code to the kernel?
>> 
>> I believe this is the only one right place for it. Adding these 200+ lines 
>> of 
>> code we give this feature for any user - on desktop, on server, on IoT 
>> device, 
>> on SCADA, etc. But if only glibc will implement ‘user-mode-aslr’ IoT and 
>> SCADA 
>> devices will never get it.
> 
> I guess it would really help if you could be more specific about the
> class of security issues this would help to mitigate. My first
> understanding was that we we need some randomization between program
> executable segments to reduce the attack space when a single address
> leaks and you know the segments layout (ordering). But why do we need
> _all_ mmaps to be randomized. Because that complicates the
> implementation consirably for different reasons you have mentioned
> earlier.
> 

There are following reasons:
1) To protect layout if one region was leaked (as you said). 
2) To protect against exploitation of Out-of-bounds vulnerabilities in some 
cases (CWE-125 , CWE-787)
3) To protect against exploitation of Buffer Overflows in some cases (CWE-120)
4) To protect application in cases when attacker need to guess the address 
(paper ASLR-NG by  Hector Marco-Gisbert and  Ismael Ripoll-Ripoll)
And may be more cases.

> Do you have any specific CVE that would be mitigated by this
> randomization approach?
> I am sorry, I am not a security expert to see all the cosequences but a
> vague - the more randomization the better - sounds rather weak to me.

It is hard to name concrete CVE number, sorry. Mitigations are made to prevent 
exploitation but not to fix vulnerabilities. It means good mitigation will make 
vulnerable application crash but not been compromised in most cases. This means 
the better randomization, the less successful exploitation rate.

Thanks,
Ilya

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Russell King - ARM Linux

On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote:
> On 03/28/2018 10:26 AM, Shea Levy wrote:
> > Now only those architectures that have custom initrd free requirements
> > need to define free_initrd_mem.
> ...
> > --- a/arch/arc/mm/init.c
> > +++ b/arch/arc/mm/init.c
> > @@ -229,10 +229,3 @@ void __ref free_initmem(void)
> >  {
> > free_initmem_default(-1);
> >  }
> > -
> > -#ifdef CONFIG_BLK_DEV_INITRD
> > -void __init free_initrd_mem(unsigned long start, unsigned long end)
> > -{
> > -   free_reserved_area((void *)start, (void *)end, -1, "initrd");
> > -}
> > -#endif
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index 3f972e83909b..19d1c5594e2d 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -47,6 +47,7 @@ config ARM
> > select HARDIRQS_SW_RESEND
> > select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT)
> > select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
> > +   select HAVE_ARCH_FREE_INITRD_MEM
> > select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
> > select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
> > select HAVE_ARCH_MMAP_RND_BITS if MMU
> 
> Isn't this why weak symbols were invented?

Weak symbols means that we end up with both the weakly-referenced code
and the arch code in the kernel image.  That's fine if the weak code
is small.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Shea Levy

Hi Rob,

Rob Landley  writes:

> On 03/28/2018 10:26 AM, Shea Levy wrote:
>> Now only those architectures that have custom initrd free requirements
>> need to define free_initrd_mem.
> ...
>> --- a/arch/arc/mm/init.c
>> +++ b/arch/arc/mm/init.c
>> @@ -229,10 +229,3 @@ void __ref free_initmem(void)
>>  {
>>  free_initmem_default(-1);
>>  }
>> -
>> -#ifdef CONFIG_BLK_DEV_INITRD
>> -void __init free_initrd_mem(unsigned long start, unsigned long end)
>> -{
>> -free_reserved_area((void *)start, (void *)end, -1, "initrd");
>> -}
>> -#endif
>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>> index 3f972e83909b..19d1c5594e2d 100644
>> --- a/arch/arm/Kconfig
>> +++ b/arch/arm/Kconfig
>> @@ -47,6 +47,7 @@ config ARM
>>  select HARDIRQS_SW_RESEND
>>  select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT)
>>  select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
>> +select HAVE_ARCH_FREE_INITRD_MEM
>>  select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
>>  select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
>>  select HAVE_ARCH_MMAP_RND_BITS if MMU
>
> Isn't this why weak symbols were invented?
>

This approach was suggested by Christoph Hellwig upthread, and seems to
have some precedent elsewhere (e.g. strncasecmp), but I agree weak
symbols seem appropriate here. I'm happy to implement either approach!

>
> Confused,
>
> Rob

Thanks,
Shea


signature.asc
Description: PGP signature

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Rob Landley

On 03/28/2018 10:26 AM, Shea Levy wrote:
> Now only those architectures that have custom initrd free requirements
> need to define free_initrd_mem.
...
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -229,10 +229,3 @@ void __ref free_initmem(void)
>  {
>   free_initmem_default(-1);
>  }
> -
> -#ifdef CONFIG_BLK_DEV_INITRD
> -void __init free_initrd_mem(unsigned long start, unsigned long end)
> -{
> - free_reserved_area((void *)start, (void *)end, -1, "initrd");
> -}
> -#endif
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 3f972e83909b..19d1c5594e2d 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -47,6 +47,7 @@ config ARM
>   select HARDIRQS_SW_RESEND
>   select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT)
>   select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
> + select HAVE_ARCH_FREE_INITRD_MEM
>   select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
>   select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
>   select HAVE_ARCH_MMAP_RND_BITS if MMU

Isn't this why weak symbols were invented?

Confused,

Rob

[PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Shea Levy

Now only those architectures that have custom initrd free requirements
need to define free_initrd_mem.

Signed-off-by: Shea Levy 
---
 arch/alpha/mm/init.c  |  8 
 arch/arc/mm/init.c|  7 ---
 arch/arm/Kconfig  |  1 +
 arch/arm64/Kconfig|  1 +
 arch/blackfin/Kconfig |  1 +
 arch/c6x/mm/init.c|  7 ---
 arch/cris/Kconfig |  1 +
 arch/frv/mm/init.c| 11 ---
 arch/h8300/mm/init.c  |  7 ---
 arch/hexagon/Kconfig  |  1 +
 arch/ia64/Kconfig |  1 +
 arch/m32r/Kconfig |  1 +
 arch/m32r/mm/init.c   | 11 ---
 arch/m68k/mm/init.c   |  7 ---
 arch/metag/Kconfig|  1 +
 arch/microblaze/mm/init.c |  7 ---
 arch/mips/Kconfig |  1 +
 arch/mn10300/Kconfig  |  1 +
 arch/nios2/mm/init.c  |  7 ---
 arch/openrisc/mm/init.c   |  7 ---
 arch/parisc/mm/init.c |  7 ---
 arch/powerpc/mm/mem.c |  7 ---
 arch/riscv/mm/init.c  |  6 --
 arch/s390/Kconfig |  1 +
 arch/score/Kconfig|  1 +
 arch/sh/mm/init.c |  7 ---
 arch/sparc/Kconfig|  1 +
 arch/tile/Kconfig |  1 +
 arch/um/kernel/mem.c  |  7 ---
 arch/unicore32/Kconfig|  1 +
 arch/x86/Kconfig  |  1 +
 arch/xtensa/Kconfig   |  1 +
 init/initramfs.c  |  7 +++
 usr/Kconfig   |  4 
 34 files changed, 28 insertions(+), 113 deletions(-)

diff --git a/arch/alpha/mm/init.c b/arch/alpha/mm/init.c
index 9d74520298ab..55f7c8efa962 100644
--- a/arch/alpha/mm/init.c
+++ b/arch/alpha/mm/init.c
@@ -291,11 +291,3 @@ free_initmem(void)
 {
free_initmem_default(-1);
 }
-
-#ifdef CONFIG_BLK_DEV_INITRD
-void
-free_initrd_mem(unsigned long start, unsigned long end)
-{
-   free_reserved_area((void *)start, (void *)end, -1, "initrd");
-}
-#endif
diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index ba145065c579..7bcf23ab1756 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -229,10 +229,3 @@ void __ref free_initmem(void)
 {
free_initmem_default(-1);
 }
-
-#ifdef CONFIG_BLK_DEV_INITRD
-void __init free_initrd_mem(unsigned long start, unsigned long end)
-{
-   free_reserved_area((void *)start, (void *)end, -1, "initrd");
-}
-#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 3f972e83909b..19d1c5594e2d 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -47,6 +47,7 @@ config ARM
select HARDIRQS_SW_RESEND
select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT)
select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
+   select HAVE_ARCH_FREE_INITRD_MEM
select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_MMAP_RND_BITS if MMU
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index cb03e93f03cf..de93620870af 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -85,6 +85,7 @@ config ARM64
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_BITREVERSE
+   select HAVE_ARCH_FREE_INITRD_MEM
select HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig
index d9c2866ba618..6c6dae9fe894 100644
--- a/arch/blackfin/Kconfig
+++ b/arch/blackfin/Kconfig
@@ -15,6 +15,7 @@ config BLACKFIN
def_bool y
select HAVE_ARCH_KGDB
select HAVE_ARCH_TRACEHOOK
+   select HAVE_ARCH_FREE_INITRD_MEM
select HAVE_DYNAMIC_FTRACE
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
diff --git a/arch/c6x/mm/init.c b/arch/c6x/mm/init.c
index 4cc72b0d1c1d..a11cb657182a 100644
--- a/arch/c6x/mm/init.c
+++ b/arch/c6x/mm/init.c
@@ -66,13 +66,6 @@ void __init mem_init(void)
mem_init_print_info(NULL);
 }
 
-#ifdef CONFIG_BLK_DEV_INITRD
-void __init free_initrd_mem(unsigned long start, unsigned long end)
-{
-   free_reserved_area((void *)start, (void *)end, -1, "initrd");
-}
-#endif
-
 void __init free_initmem(void)
 {
free_initmem_default(-1);
diff --git a/arch/cris/Kconfig b/arch/cris/Kconfig
index cd5a0865c97f..5425f77e5664 100644
--- a/arch/cris/Kconfig
+++ b/arch/cris/Kconfig
@@ -76,6 +76,7 @@ config CRIS
select HAVE_DEBUG_BUGVERBOSE if ETRAX_ARCH_V32
select HAVE_NMI
select DMA_DIRECT_OPS if PCI
+   select HAVE_ARCH_FREE_INITRD_MEM
 
 config HZ
int
diff --git a/arch/frv/mm/init.c b/arch/frv/mm/init.c
index cf464100e838..345edc4dc462 100644
--- a/arch/frv/mm/init.c
+++ b/arch/frv/mm/init.c
@@ -131,14 +131,3 @@ void free_initmem(void)
free_initmem_default(-1);
 #endif
 } /* end free_initmem() */
-
-/*/
-/*
- * free the initial ramdisk memory
- */

Re: Linux 4.16: Reported regressions as of Tuesday, 2018-03-27 (Was: Linux 4.16-rc7)

2018-03-28 Thread David Sterba

On Tue, Mar 27, 2018 at 09:13:32PM +0200, Thorsten Leemhuis wrote:
> On 26.03.2018 01:37, Linus Torvalds wrote:
> > […] Anyway. Go out and test. And let's hope next week is nice and calm and
> > I can release the final 4.16 next Sunday without any extra  rc's.
> > 
> >Linus
> 
> Hi! Find below my sixth regression report for Linux 4.16. It lists 7
> regressions I'm currently aware of. 2 were fixed since last weeks
> report; 2 are new.
> 
> Are you aware of any other regressions that got introduced this
> development cycle? Then please let me know by mail (a simple bounce or
> forward to the sender of this email address is enough!). And please tell
> me if there is anything in the report that shouldn't be there.
> 
> Ciao, Thorsten
> 
> == Current regressions ==
> 
> Dell R640 does not boot due to SCSI/SATA failure
> - Status: Afaics still unfixed; lost track, ask reporter for an update
> on Monday morning, no reply yet
> - Cause:  https://git.kernel.org/torvalds/c/84676c1f21e8
> - Reported: 2018-02-22
> https://marc.info/?l=linux-kernel=151931128006031
> - Note: Issue understood and even (kind of accidentally) fixed by a
> patch series that was proposed for 4.17 (see links)
> - Last known developer activity: 2018-03-14
> https://marc.info/?l=linux-block=152102086831636=2
> - Other relevant links:
> https://marc.info/?l=linux-block=152051511802229=2
> https://marc.info/?l=linux-kernel=152026091325037

I'm not the original reporter but I can confirm this regression has been
fixed by git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git
scsi-fixes (tag 0c30612d535460af7fe16e7e6ed8b97defe4adbe)

The machine has a HP array and stopped booting some time around rc1. I
tried to bisect it and in the end it was the commit 84676c1f21e8 that I
reverted for testing builds, so I'm sure it's the same issue as you've
been tracking.

Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-28 Thread Kees Cook

On Wed, Mar 28, 2018 at 8:26 AM, Shea Levy  wrote:
> Now only those architectures that have custom initrd free requirements
> need to define free_initrd_mem.
>
> Signed-off-by: Shea Levy 

Yay consolidation! :)

> --- a/usr/Kconfig
> +++ b/usr/Kconfig
> @@ -233,3 +233,7 @@ config INITRAMFS_COMPRESSION
> default ".lzma" if RD_LZMA
> default ".bz2"  if RD_BZIP2
> default ""
> +
> +config HAVE_ARCH_FREE_INITRD_MEM
> +   bool
> +   default n

If you keep the Kconfig, you can leave off "default n", and I'd
suggest adding a help section just to describe what the per-arch
responsibilities are when select-ing the config. (See
HAVE_ARCH_SECCOMP_FILTER for an example.)

-Kees

-- 
Kees Cook
Pixel Security

Re: RFC on writel and writel_relaxed

2018-03-28 Thread Nicholas Piggin

On Thu, 29 Mar 2018 08:31:32 +1100
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
> > David Miller  wrote:
> >   
> > > From: Benjamin Herrenschmidt 
> > > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > >   
> > > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > > the archs are unused or dead anyway.
> > > 
> > > Agreed.  
> > 
> > While we're making decrees here, can we do something about mmiowb?
> > The semantics are basically indecipherable.  
> 
> I was going to tackle that next :-)
> 
> >   This is a variation on the mandatory write barrier that causes writes to 
> > weakly
> >   ordered I/O regions to be partially ordered.  Its effects may go beyond 
> > the
> >   CPU->Hardware interface and actually affect the hardware at some level.
> > 
> > How can a driver writer possibly get that right?
> > 
> > IIRC it was added for some big ia64 system that was really expensive
> > to implement the proper wmb() semantics on. So wmb() semantics were
> > quietly downgraded, then the subsequently broken drivers they cared
> > about were fixed by adding the stronger mmiowb().
> > 
> > What should have happened was wmb and writel remained correct, sane, and
> > expensive, and they add an mmio_wmb() to order MMIO stores made by the
> > writel_relaxed accessors, then use that to speed up the few drivers they
> > care about.
> > 
> > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> > make wmb ordering talk about stores to the device, not to some
> > intermediate stage of the interconnect where it can be subsequently
> > reordered wrt the device? Drivers can be converted back to using wmb
> > or writel gradually.  
> 
> I was under the impression that mmiowb was specifically about ordering
> writel's with a subsequent spin_unlock, without it, MMIOs from
> different CPUs (within the same lock) would still arrive OO.

Yes more or less, and I think that until mmiowb was introduced, wmb
or writel was sufficient for this.

Thanks,
Nick

Re: [PATCH v3 00/41] cxlflash: OCXL transport support and miscellaneous fixes

2018-03-28 Thread Martin K. Petersen


Uma,

> This patch series adds OCXL support to the cxlflash driver. With this
> support, new devices using the OCXL transport will be supported by the
> cxlflash driver along with the existing CXL devices. An effort is made
> to keep this transport specific function independent of the existing
> core driver that communicates with the AFU.
>
> The first three patches contain a minor fix and staging improvements.
>
> This series is intended for 4.17 and is bisectable.

Something this big really needs to be submitted around rc2/rc3. It's way
too late in the cycle to ensure proper zeroday coverage for all these
commits.

I have started a 4.18/scsi-queue branch to hold this series for now.
The 4.18 branch will be rebased once 4.17rc1 is out in a few weeks. Your
changes won't show up in for-next until then either.

Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt

On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
> David Miller  wrote:
> 
> > From: Benjamin Herrenschmidt 
> > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > 
> > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > the archs are unused or dead anyway.  
> > 
> > Agreed.
> 
> While we're making decrees here, can we do something about mmiowb?
> The semantics are basically indecipherable.

I was going to tackle that next :-)

>   This is a variation on the mandatory write barrier that causes writes to 
> weakly
>   ordered I/O regions to be partially ordered.  Its effects may go beyond the
>   CPU->Hardware interface and actually affect the hardware at some level.
> 
> How can a driver writer possibly get that right?
> 
> IIRC it was added for some big ia64 system that was really expensive
> to implement the proper wmb() semantics on. So wmb() semantics were
> quietly downgraded, then the subsequently broken drivers they cared
> about were fixed by adding the stronger mmiowb().
> 
> What should have happened was wmb and writel remained correct, sane, and
> expensive, and they add an mmio_wmb() to order MMIO stores made by the
> writel_relaxed accessors, then use that to speed up the few drivers they
> care about.
> 
> Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> make wmb ordering talk about stores to the device, not to some
> intermediate stage of the interconnect where it can be subsequently
> reordered wrt the device? Drivers can be converted back to using wmb
> or writel gradually.

I was under the impression that mmiowb was specifically about ordering
writel's with a subsequent spin_unlock, without it, MMIOs from
different CPUs (within the same lock) would still arrive OO.

If that's indeed the case, I would suggest ia64 switches to a similar
per-cpu flag trick powerpc uses.

Cheers,
Ben.

> Thanks,
> Nick

Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder

2018-03-28 Thread David Rientjes

On Wed, 28 Mar 2018, Laurent Dufour wrote:

> >> @@ -326,7 +336,10 @@ static unsigned long move_vma(struct vm_area_struct 
> >> *vma,
> >>mremap_userfaultfd_prep(new_vma, uf);
> >>arch_remap(mm, old_addr, old_addr + old_len,
> >>   new_addr, new_addr + new_len);
> >> +  if (vma != new_vma)
> >> +  vm_raw_write_end(vma);
> >>}
> >> +  vm_raw_write_end(new_vma);
> > 
> > Just do
> > 
> > vm_raw_write_end(vma);
> > vm_raw_write_end(new_vma);
> > 
> > here.
> 
> Are you sure ? we can have vma = new_vma done if (unlikely(err))
> 

Sorry, what I meant was do

if (vma != new_vma)
vm_raw_write_end(vma);
vm_raw_write_end(new_vma);

after the conditional.  Having the locking unnecessarily embedded in the 
conditional has been an issue in the past with other areas of core code, 
unless you have a strong reason for it.

Re: [PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-28 Thread David Rientjes

On Wed, 28 Mar 2018, Laurent Dufour wrote:

> > Putting this in mm/Kconfig is definitely the right way to go about it 
> > instead of any generic option in arch/*.
> > 
> > My question, though, was making this configurable by the user:
> > 
> > config SPECULATIVE_PAGE_FAULT
> > bool "Speculative page faults"
> > depends on X86_64 || PPC
> > default y
> > help
> >   ..
> > 
> > It's a question about whether we want this always enabled on x86_64 and 
> > power or whether the user should be able to disable it (right now they 
> > can't).  With a large feature like this, you may want to offer something 
> > simple (disable CONFIG_SPECULATIVE_PAGE_FAULT) if someone runs into 
> > regressions.
> 
> I agree, but I think it would be important to get the per architecture
> enablement to avoid complex check here. For instance in the case of powerPC
> this is only supported for PPC_BOOK3S_64.
> 
> To avoid exposing such per architecture define here, what do you think about
> having supporting architectures setting ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> and the SPECULATIVE_PAGE_FAULT depends on this, like this:
> 
> In mm/Kconfig:
> config SPECULATIVE_PAGE_FAULT
>   bool "Speculative page faults"
>   depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT && SMP
>   default y
>   help
>   ...
> 
> In arch/powerpc/Kconfig:
> config PPC
>   ...
>   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
> 
> In arch/x86/Kconfig:
> config X86_64
>   ...
>   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> 
> 

Looks good to me!  It feels like this will add more assurance that if 
things regress for certain workloads that it can be disabled.  I don't 
feel strongly about the default value, I'm ok with it being enabled by 
default.

Re: [PATCH v12 07/22] selftests/vm: fixed bugs in pkey_disable_clear()

2018-03-28 Thread Dave Hansen

On 03/28/2018 01:47 PM, Thiago Jung Bauermann wrote:
>>> if (flags)
>>> -   assert(rdpkey_reg() > orig_pkey_reg);
>>> +   assert(rdpkey_reg() < orig_pkey_reg);
>>>  }
>>>
>>>  void pkey_write_allow(int pkey)
>> This seems so horribly wrong that I wonder how it worked in the first
>> place.  Any idea?
> The code simply wasn't used. pkey_disable_clear() is called by
> pkey_write_allow() and pkey_access_allow(), but before this patch series
> nothing called either of these functions.
> 

Ahh, that explains it.  Can that get stuck in the changelog, please?

Re: [PATCH v4 14/16] powerpc: Use generic free_initrd_mem.

2018-03-28 Thread Shea Levy

Joe Perches  writes:

> On Wed, 2018-03-28 at 16:36 -0400, Shea Levy wrote:
>> Signed-off-by: Shea Levy 
>
> Most people seem to want some form of commit message
> and not just your sign-off.
>

Ah, if the subject is insufficient I can add some more detail.

>
> And btw:
>
> It seems you used get_maintainer to determine who to
> send these patches to.
>
> I suggest you add --nogit and --nogit-fallback to the
> get_maintainer command line you use to avoid sending
> these patches to people like me that have done drive-by
> cleanup work on these files.

Whoops, thanks for the tip and sorry for the noise!


signature.asc
Description: PGP signature

Re: [PATCH v12 07/22] selftests/vm: fixed bugs in pkey_disable_clear()

2018-03-28 Thread Thiago Jung Bauermann


Dave Hansen  writes:

> On 02/21/2018 05:55 PM, Ram Pai wrote:
>> --- a/tools/testing/selftests/vm/protection_keys.c
>> +++ b/tools/testing/selftests/vm/protection_keys.c
>> @@ -461,7 +461,7 @@ void pkey_disable_clear(int pkey, int flags)
>>  pkey, pkey, pkey_rights);
>>  pkey_assert(pkey_rights >= 0);
>>
>> -pkey_rights |= flags;
>> +pkey_rights &= ~flags;
>>
>>  ret = pkey_set(pkey, pkey_rights, 0);
>>  /* pkey_reg and flags have the same format */
>> @@ -475,7 +475,7 @@ void pkey_disable_clear(int pkey, int flags)
>>  dprintf1("%s(%d) pkey_reg: 0x%016lx\n", __func__,
>>  pkey, rdpkey_reg());
>>  if (flags)
>> -assert(rdpkey_reg() > orig_pkey_reg);
>> +assert(rdpkey_reg() < orig_pkey_reg);
>>  }
>>
>>  void pkey_write_allow(int pkey)
>
> This seems so horribly wrong that I wonder how it worked in the first
> place.  Any idea?

The code simply wasn't used. pkey_disable_clear() is called by
pkey_write_allow() and pkey_access_allow(), but before this patch series
nothing called either of these functions.


--
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH v4 14/16] powerpc: Use generic free_initrd_mem.

2018-03-28 Thread Joe Perches

On Wed, 2018-03-28 at 16:36 -0400, Shea Levy wrote:
> Signed-off-by: Shea Levy 

Most people seem to want some form of commit message
and not just your sign-off.

And btw:

It seems you used get_maintainer to determine who to
send these patches to.

I suggest you add --nogit and --nogit-fallback to the
get_maintainer command line you use to avoid sending
these patches to people like me that have done drive-by
cleanup work on these files.

Re: [PATCH] powerpc/64: Fix checksum folding in csum_add

2018-03-28 Thread Paul Mackerras

On Tue, Mar 27, 2018 at 05:22:32PM +0200, LEROY Christophe wrote:
> Shile Zhang  a écrit :
> 
> >fix the missed point in Paul's patch:
> >"powerpc/64: Fix checksum folding in csum_tcpudp_nofold and
> >ip_fast_csum_nofold"
> >
> >Signed-off-by: Shile Zhang 
> >---
> > arch/powerpc/include/asm/checksum.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/arch/powerpc/include/asm/checksum.h
> >b/arch/powerpc/include/asm/checksum.h
> >index 5b1a6e3..430d038 100644
> >--- a/arch/powerpc/include/asm/checksum.h
> >+++ b/arch/powerpc/include/asm/checksum.h
> >@@ -108,7 +108,7 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
> >
> > #ifdef __powerpc64__
> > res += (__force u64)addend;
> >-return (__force __wsum)((u32)res + (res >> 32));
> >+return (__force __wsum) from64to32(res);
> 
> Did you encounter a bug due to that ?
> As far as i understand, csum and addend are 32 bits so can't exceed 0x
> Then their sum won't exceed 0x1fffe. So the sum of upper and lower part
> won't carry

If the sum of the two halves was 0x1fffe, then that previously got
truncated to 32 bits and returned as 0xfffe, which is wrong - the
result should be 0x.

Paul.

[PATCH v4 14/16] powerpc: Use generic free_initrd_mem.

2018-03-28 Thread Shea Levy

Signed-off-by: Shea Levy 
---
 arch/powerpc/mm/mem.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index fe8c61149fb8..e85b2a3cd264 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -404,13 +404,6 @@ void free_initmem(void)
free_initmem_default(POISON_FREE_INITMEM);
 }
 
-#ifdef CONFIG_BLK_DEV_INITRD
-void __init free_initrd_mem(unsigned long start, unsigned long end)
-{
-   free_reserved_area((void *)start, (void *)end, -1, "initrd");
-}
-#endif
-
 /*
  * This is called when a page has been modified by the kernel.
  * It just marks the page as not i-cache clean.  We do the i-cache
-- 
2.16.2

[PATCH v2 04/19] powerpc/kvm: Prefer fault_in_pages_readable function

2018-03-28 Thread Mathieu Malaterre

Directly use fault_in_pages_readable instead of manual __get_user code. Fix
warning treated as error with W=1:

  arch/powerpc/kernel/kvm.c:675:6: error: variable ‘tmp’ set but not used 
[-Werror=unused-but-set-variable]

Suggested-by: Christophe Leroy 
Signed-off-by: Mathieu Malaterre 
---
v2: use fault_in_pages_readable instead
 arch/powerpc/kernel/kvm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 9ad37f827a97..124c51030b75 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -672,14 +673,13 @@ static void kvm_use_magic_page(void)
 {
u32 *p;
u32 *start, *end;
-   u32 tmp;
u32 features;
 
/* Tell the host to map the magic page to -4096 on all CPUs */
on_each_cpu(kvm_map_magic_page, , 1);
 
/* Quick self-test to see if the mapping works */
-   if (__get_user(tmp, (u32*)KVM_MAGIC_PAGE)) {
+   if (!fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, 
sizeof(u32))) {
kvm_patching_worked = false;
return;
}
-- 
2.11.0

[PATCH v2 07/19] powerpc/powermac: Make some functions static

2018-03-28 Thread Mathieu Malaterre

These functions can all be static, make it so. Fix warnings treated as
errors with W=1:

  arch/powerpc/platforms/powermac/pci.c:1022:6: error: no previous prototype 
for ‘pmac_pci_fixup_ohci’ [-Werror=missing-prototypes]
  arch/powerpc/platforms/powermac/pci.c:1057:6: error: no previous prototype 
for ‘pmac_pci_fixup_cardbus’ [-Werror=missing-prototypes]
  arch/powerpc/platforms/powermac/pci.c:1094:6: error: no previous prototype 
for ‘pmac_pci_fixup_pciata’ [-Werror=missing-prototypes]

Remove has_address declaration and assignment since not used. Also add gcc
attribute unused to fix a warning treated as error with W=1:

  arch/powerpc/platforms/powermac/pci.c:784:19: error: variable ‘has_address’ 
set but not used [-Werror=unused-but-set-variable]
  arch/powerpc/platforms/powermac/pci.c:907:22: error: variable ‘ht’ set but 
not used [-Werror=unused-but-set-variable]

Suggested-by: Christophe Leroy 
Signed-off-by: Mathieu Malaterre 
---
v2: remove has_address variable since not used
 arch/powerpc/platforms/powermac/pci.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powermac/pci.c 
b/arch/powerpc/platforms/powermac/pci.c
index 0b8174a79993..67c497093e0a 100644
--- a/arch/powerpc/platforms/powermac/pci.c
+++ b/arch/powerpc/platforms/powermac/pci.c
@@ -781,12 +781,12 @@ static int __init pmac_add_bridge(struct device_node *dev)
struct resource rsrc;
char *disp_name;
const int *bus_range;
-   int primary = 1, has_address = 0;
+   int primary = 1;
 
DBG("Adding PCI host bridge %pOF\n", dev);
 
/* Fetch host bridge registers address */
-   has_address = (of_address_to_resource(dev, 0, ) == 0);
+   of_address_to_resource(dev, 0, );
 
/* Get bus range if any */
bus_range = of_get_property(dev, "bus-range", );
@@ -904,7 +904,7 @@ static int pmac_pci_root_bridge_prepare(struct 
pci_host_bridge *bridge)
 void __init pmac_pci_init(void)
 {
struct device_node *np, *root;
-   struct device_node *ht = NULL;
+   struct device_node *ht __maybe_unused = NULL;
 
pci_set_flags(PCI_CAN_SKIP_ISA_ALIGN);
 
@@ -1019,7 +1019,7 @@ static bool pmac_pci_enable_device_hook(struct pci_dev 
*dev)
return true;
 }
 
-void pmac_pci_fixup_ohci(struct pci_dev *dev)
+static void pmac_pci_fixup_ohci(struct pci_dev *dev)
 {
struct device_node *node = pci_device_to_OF_node(dev);
 
@@ -1054,7 +1054,7 @@ void __init pmac_pcibios_after_init(void)
}
 }
 
-void pmac_pci_fixup_cardbus(struct pci_dev* dev)
+static void pmac_pci_fixup_cardbus(struct pci_dev *dev)
 {
if (!machine_is(powermac))
return;
@@ -1091,7 +1091,7 @@ void pmac_pci_fixup_cardbus(struct pci_dev* dev)
 
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_TI, PCI_ANY_ID, pmac_pci_fixup_cardbus);
 
-void pmac_pci_fixup_pciata(struct pci_dev* dev)
+static void pmac_pci_fixup_pciata(struct pci_dev *dev)
 {
u8 progif = 0;
 
-- 
2.11.0

[PATCH v2 05/19] powerpc/chrp/setup: Add attribute unused and make some functions static

2018-03-28 Thread Mathieu Malaterre

Remove variable declaration idu_size and associated code since not used.

These functions can all be static, make it so. Fix warnings treated as
errors with W=1:

  arch/powerpc/platforms/chrp/setup.c:97:6: error: no previous prototype for 
‘chrp_show_cpuinfo’ [-Werror=missing-prototypes]
  arch/powerpc/platforms/chrp/setup.c:302:13: error: no previous prototype for 
‘chrp_setup_arch’ [-Werror=missing-prototypes]
  arch/powerpc/platforms/chrp/setup.c:385:16: error: variable ‘idu_size’ set 
but not used [-Werror=unused-but-set-variable]
  arch/powerpc/platforms/chrp/setup.c:526:13: error: no previous prototype for 
‘chrp_init_IRQ’ [-Werror=missing-prototypes]
  arch/powerpc/platforms/chrp/setup.c:559:1: error: no previous prototype for 
‘chrp_init2’ [-Werror=missing-prototypes]

Suggested-by: Christophe Leroy 
Signed-off-by: Mathieu Malaterre 
---
v2: Remove totally variable idu_size
 arch/powerpc/platforms/chrp/setup.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/chrp/setup.c 
b/arch/powerpc/platforms/chrp/setup.c
index 481ed133e04b..d6d8ffc0271e 100644
--- a/arch/powerpc/platforms/chrp/setup.c
+++ b/arch/powerpc/platforms/chrp/setup.c
@@ -94,7 +94,7 @@ static const char *chrp_names[] = {
"Total Impact Briq"
 };
 
-void chrp_show_cpuinfo(struct seq_file *m)
+static void chrp_show_cpuinfo(struct seq_file *m)
 {
int i, sdramen;
unsigned int t;
@@ -299,7 +299,7 @@ static __init void chrp_init(void)
of_node_put(node);
 }
 
-void __init chrp_setup_arch(void)
+static void __init chrp_setup_arch(void)
 {
struct device_node *root = of_find_node_by_path("/");
const char *machine = NULL;
@@ -382,7 +382,7 @@ static void __init chrp_find_openpic(void)
 {
struct device_node *np, *root;
int len, i, j;
-   int isu_size, idu_size;
+   int isu_size;
const unsigned int *iranges, *opprop = NULL;
int oplen = 0;
unsigned long opaddr;
@@ -427,11 +427,9 @@ static void __init chrp_find_openpic(void)
}
 
isu_size = 0;
-   idu_size = 0;
if (len > 0 && iranges[1] != 0) {
printk(KERN_INFO "OpenPIC irqs %d..%d in IDU\n",
   iranges[0], iranges[0] + iranges[1] - 1);
-   idu_size = iranges[1];
}
if (len > 1)
isu_size = iranges[3];
@@ -523,7 +521,7 @@ static void __init chrp_find_8259(void)
}
 }
 
-void __init chrp_init_IRQ(void)
+static void __init chrp_init_IRQ(void)
 {
 #if defined(CONFIG_VT) && defined(CONFIG_INPUT_ADBHID) && defined(CONFIG_XMON)
struct device_node *kbd;
@@ -555,7 +553,7 @@ void __init chrp_init_IRQ(void)
 #endif
 }
 
-void __init
+static void __init
 chrp_init2(void)
 {
 #ifdef CONFIG_NVRAM
-- 
2.11.0

[PATCH v2 03/19] powerpc: Mark variables as unused

2018-03-28 Thread Mathieu Malaterre

Add gcc attribute unused for two variables. Fix warnings treated as errors
with W=1:

  arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set but not 
used [-Werror=unused-but-set-variable]

Suggested-by: Christophe Leroy 
Signed-off-by: Mathieu Malaterre 
---
v2: move path within ifdef DEBUG_PROM

 arch/powerpc/kernel/prom_init.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index acf4b2e0530c..4163b11abb6c 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -603,7 +603,7 @@ static void __init early_cmdline_parse(void)
const char *opt;
 
char *p;
-   int l = 0;
+   int l __maybe_unused = 0;
 
prom_cmd_line[0] = 0;
p = prom_cmd_line;
@@ -1385,7 +1385,7 @@ static void __init reserve_mem(u64 base, u64 size)
 static void __init prom_init_mem(void)
 {
phandle node;
-   char *path, type[64];
+   char *path __maybe_unused, type[64];
unsigned int plen;
cell_t *p, *endp;
__be32 val;
@@ -1406,7 +1406,6 @@ static void __init prom_init_mem(void)
prom_debug("root_size_cells: %x\n", rsc);
 
prom_debug("scanning memory:\n");
-   path = prom_scratch;
 
for (node = 0; prom_next_node(); ) {
type[0] = 0;
@@ -1431,6 +1430,7 @@ static void __init prom_init_mem(void)
endp = p + (plen / sizeof(cell_t));
 
 #ifdef DEBUG_PROM
+   path = prom_scratch;
memset(path, 0, PROM_SCRATCH_SIZE);
call_prom("package-to-path", 3, 1, node, path, 
PROM_SCRATCH_SIZE-1);
prom_debug("  node %s :\n", path);
-- 
2.11.0

[PATCH v2 02/19] powerpc/powermac: Mark variable x as unused

2018-03-28 Thread Mathieu Malaterre

Since the value of x is never intended to be read, remove it. Fix warning
treated as error with W=1:

  arch/powerpc/platforms/powermac/udbg_scc.c:76:9: error: variable ‘x’ set but 
not used [-Werror=unused-but-set-variable]

Suggested-by: Christophe Leroy 
Signed-off-by: Mathieu Malaterre 
---
v2: remove x completely

 arch/powerpc/platforms/powermac/udbg_scc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powermac/udbg_scc.c 
b/arch/powerpc/platforms/powermac/udbg_scc.c
index d83135a9830e..8901973ed683 100644
--- a/arch/powerpc/platforms/powermac/udbg_scc.c
+++ b/arch/powerpc/platforms/powermac/udbg_scc.c
@@ -73,7 +73,7 @@ void udbg_scc_init(int force_scc)
struct device_node *stdout = NULL, *escc = NULL, *macio = NULL;
struct device_node *ch, *ch_def = NULL, *ch_a = NULL;
const char *path;
-   int i, x;
+   int i;
 
escc = of_find_node_by_name(NULL, "escc");
if (escc == NULL)
@@ -120,7 +120,7 @@ void udbg_scc_init(int force_scc)
mb();
 
for (i = 2; i != 0; --i)
-   x = in_8(sccc);
+   in_8(sccc);
out_8(sccc, 0x09);  /* reset A or B side */
out_8(sccc, 0xc0);
 
-- 
2.11.0

[PATCH v2 01/19] powerpc/powermac: Mark variable x as unused

2018-03-28 Thread Mathieu Malaterre

Since the value of x is never intended to be read, declare it with gcc
attribute as unused. Fix warning treated as error with W=1:

  arch/powerpc/platforms/powermac/bootx_init.c:471:21: error: variable ‘x’ set 
but not used [-Werror=unused-but-set-variable]

Signed-off-by: Mathieu Malaterre 
---
v2: move x variable within its local scope

 arch/powerpc/platforms/powermac/bootx_init.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powermac/bootx_init.c 
b/arch/powerpc/platforms/powermac/bootx_init.c
index c3c9bbb3573a..d44e8571c1ec 100644
--- a/arch/powerpc/platforms/powermac/bootx_init.c
+++ b/arch/powerpc/platforms/powermac/bootx_init.c
@@ -468,7 +468,7 @@ void __init bootx_init(unsigned long r3, unsigned long r4)
boot_infos_t *bi = (boot_infos_t *) r4;
unsigned long hdr;
unsigned long space;
-   unsigned long ptr, x;
+   unsigned long ptr;
char *model;
unsigned long offset = reloc_offset();
 
@@ -562,6 +562,7 @@ void __init bootx_init(unsigned long r3, unsigned long r4)
 * MMU switched OFF, so this should not be useful anymore.
 */
if (bi->version < 4) {
+   unsigned long x __maybe_unused;
bootx_printf("Touching pages...\n");
 
/*
-- 
2.11.0

Re: [PATCH 15/19] powerpc: Add missing prototype

2018-03-28 Thread Mathieu Malaterre

On Fri, Mar 23, 2018 at 1:20 PM, christophe leroy
 wrote:
>
>
> Le 22/03/2018 à 21:20, Mathieu Malaterre a écrit :
>>
>> Add one missing prototype for function rh_dump_blk. Fix warning treated as
>> error in W=1:
>>
>>arch/powerpc/lib/rheap.c:740:6: error: no previous prototype for
>> ‘rh_dump_blk’ [-Werror=missing-prototypes]
>>
>> Signed-off-by: Mathieu Malaterre 
>> ---
>>   arch/powerpc/include/asm/rheap.h | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/rheap.h
>> b/arch/powerpc/include/asm/rheap.h
>> index 172381769cfc..e75d96de19a0 100644
>> --- a/arch/powerpc/include/asm/rheap.h
>> +++ b/arch/powerpc/include/asm/rheap.h
>> @@ -83,6 +83,9 @@ extern int rh_get_stats(rh_info_t * info, int what, int
>> max_stats,
>>   /* Simple dump of remote heap info */
>>   extern void rh_dump(rh_info_t * info);
>>   +/* Simple dump of remote info block */
>> +extern void rh_dump_blk(rh_info_t *info, rh_block_t *blk);
>> +
>
>
> Only used in one place, should be static

Well here is what I see over here:

$ git grep rh_dump_blk
...
arch/powerpc/lib/rheap.c:EXPORT_SYMBOL_GPL(rh_dump_blk);



> Christophe
>
>>   /* Set owner of taken block */
>>   extern int rh_set_owner(rh_info_t * info, unsigned long start, const
>> char *owner);
>>
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>

Re: [PATCH 11/19] powerpc/powermac: Move pmac_pfunc_base_install prototype to header file

2018-03-28 Thread Mathieu Malaterre

On Fri, Mar 23, 2018 at 1:13 PM, christophe leroy
 wrote:
>
>
> Le 22/03/2018 à 21:19, Mathieu Malaterre a écrit :
>>
>> The pmac_pfunc_base_install prototype was declared in powermac/smp.c since
>> function was used there, move it to pmac_pfunc.h header to be visible in
>> pfunc_base.c. Fix a warning treated as error with W=1:
>>
>>arch/powerpc/platforms/powermac/pfunc_base.c:330:12: error: no previous
>> prototype for ‘pmac_pfunc_base_install’ [-Werror=missing-prototypes]
>>
>> Signed-off-by: Mathieu Malaterre 
>> ---
>>   arch/powerpc/include/asm/pmac_pfunc.h | 1 +
>>   arch/powerpc/platforms/powermac/smp.c | 1 -
>>   2 files changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/pmac_pfunc.h
>> b/arch/powerpc/include/asm/pmac_pfunc.h
>> index 73bd8f28f2a8..99f7a288789a 100644
>> --- a/arch/powerpc/include/asm/pmac_pfunc.h
>> +++ b/arch/powerpc/include/asm/pmac_pfunc.h
>> @@ -245,6 +245,7 @@ extern void pmf_put_function(struct pmf_function
>> *func);
>> extern int pmf_call_one(struct pmf_function *func, struct pmf_args
>> *args);
>>   +extern int pmac_pfunc_base_install(void);
>
>
>
> extern keyword is not needed

I understand; but for consistency every single protoypes in this
header file actually use the extern keyword. Is there a guide/best
practice to refer to in this case ?

> Christophe
>
>> /* Suspend/resume code called by via-pmu directly for now */
>>   extern void pmac_pfunc_base_suspend(void);
>> diff --git a/arch/powerpc/platforms/powermac/smp.c
>> b/arch/powerpc/platforms/powermac/smp.c
>> index 95275e0e2efa..447da6db450a 100644
>> --- a/arch/powerpc/platforms/powermac/smp.c
>> +++ b/arch/powerpc/platforms/powermac/smp.c
>> @@ -65,7 +65,6 @@
>>   #endif
>> extern void __secondary_start_pmac_0(void);
>> -extern int pmac_pfunc_base_install(void);
>> static void (*pmac_tb_freeze)(int freeze);
>>   static u64 timebase;
>>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>

[PATCH v3] powerpc/altivec: Add missing prototypes for altivec

2018-03-28 Thread Mathieu Malaterre

Some functions prototypes were missing for the non-altivec code. Add the
missing prototypes in a new header file, fix warnings treated as errors
with W=1:

  arch/powerpc/lib/xor_vmx_glue.c:18:6: error: no previous prototype for 
‘xor_altivec_2’ [-Werror=missing-prototypes]
  arch/powerpc/lib/xor_vmx_glue.c:29:6: error: no previous prototype for 
‘xor_altivec_3’ [-Werror=missing-prototypes]
  arch/powerpc/lib/xor_vmx_glue.c:40:6: error: no previous prototype for 
‘xor_altivec_4’ [-Werror=missing-prototypes]
  arch/powerpc/lib/xor_vmx_glue.c:52:6: error: no previous prototype for 
‘xor_altivec_5’ [-Werror=missing-prototypes]

The prototypes were already present in  but this header file is
meant to be included after . Trying to re-use
 directly would lead to warnings such as:

  arch/powerpc/include/asm/xor.h:39:15: error: variable ‘xor_block_altivec’ has 
initializer but incomplete type

Trying to re-use  after  in
xor_vmx_glue.c would in turn trigger the following warnings:

  include/asm-generic/xor.h:688:34: error: ‘xor_block_32regs’ defined but not 
used [-Werror=unused-variable]

Signed-off-by: Mathieu Malaterre 
---
v3: missing changelog, no other change
v2: new version where a separate header was introduced. Also update commit 
message explaining history.

 arch/powerpc/include/asm/xor.h | 12 +---
 arch/powerpc/include/asm/xor_altivec.h | 19 +++
 arch/powerpc/lib/xor_vmx_glue.c|  1 +
 3 files changed, 21 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/xor_altivec.h

diff --git a/arch/powerpc/include/asm/xor.h b/arch/powerpc/include/asm/xor.h
index a36c2069d8ed..7d6dc503349d 100644
--- a/arch/powerpc/include/asm/xor.h
+++ b/arch/powerpc/include/asm/xor.h
@@ -24,17 +24,7 @@
 
 #include 
 #include 
-
-void xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
-  unsigned long *v2_in);
-void xor_altivec_3(unsigned long bytes, unsigned long *v1_in,
-  unsigned long *v2_in, unsigned long *v3_in);
-void xor_altivec_4(unsigned long bytes, unsigned long *v1_in,
-  unsigned long *v2_in, unsigned long *v3_in,
-  unsigned long *v4_in);
-void xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
-  unsigned long *v2_in, unsigned long *v3_in,
-  unsigned long *v4_in, unsigned long *v5_in);
+#include 
 
 static struct xor_block_template xor_block_altivec = {
.name = "altivec",
diff --git a/arch/powerpc/include/asm/xor_altivec.h 
b/arch/powerpc/include/asm/xor_altivec.h
new file mode 100644
index ..6ca923510b59
--- /dev/null
+++ b/arch/powerpc/include/asm/xor_altivec.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_XOR_ALTIVEC_H
+#define _ASM_POWERPC_XOR_ALTIVEC_H
+
+#ifdef CONFIG_ALTIVEC
+
+void xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
+  unsigned long *v2_in);
+void xor_altivec_3(unsigned long bytes, unsigned long *v1_in,
+  unsigned long *v2_in, unsigned long *v3_in);
+void xor_altivec_4(unsigned long bytes, unsigned long *v1_in,
+  unsigned long *v2_in, unsigned long *v3_in,
+  unsigned long *v4_in);
+void xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
+  unsigned long *v2_in, unsigned long *v3_in,
+  unsigned long *v4_in, unsigned long *v5_in);
+
+#endif
+#endif /* _ASM_POWERPC_XOR_ALTIVEC_H */
diff --git a/arch/powerpc/lib/xor_vmx_glue.c b/arch/powerpc/lib/xor_vmx_glue.c
index 6521fe5e8cef..dab2b6bfcf36 100644
--- a/arch/powerpc/lib/xor_vmx_glue.c
+++ b/arch/powerpc/lib/xor_vmx_glue.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "xor_vmx.h"
 
 void xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
-- 
2.11.0

[PATCH v2] powerpc/altivec: Add missing prototypes for altivec

2018-03-28 Thread Mathieu Malaterre

Some functions prototypes were missing for the non-altivec code. Add the
missing prototypes in a new header file, fix warnings treated as errors
with W=1:

  arch/powerpc/lib/xor_vmx_glue.c:18:6: error: no previous prototype for 
‘xor_altivec_2’ [-Werror=missing-prototypes]
  arch/powerpc/lib/xor_vmx_glue.c:29:6: error: no previous prototype for 
‘xor_altivec_3’ [-Werror=missing-prototypes]
  arch/powerpc/lib/xor_vmx_glue.c:40:6: error: no previous prototype for 
‘xor_altivec_4’ [-Werror=missing-prototypes]
  arch/powerpc/lib/xor_vmx_glue.c:52:6: error: no previous prototype for 
‘xor_altivec_5’ [-Werror=missing-prototypes]

The prototypes were already present in  but this header file is
meant to be included after . Trying to re-use
 directly would lead to warnings such as:

  arch/powerpc/include/asm/xor.h:39:15: error: variable ‘xor_block_altivec’ has 
initializer but incomplete type

Trying to re-use  after  in
xor_vmx_glue.c would in turn trigger the following warnings:

  include/asm-generic/xor.h:688:34: error: ‘xor_block_32regs’ defined but not 
used [-Werror=unused-variable]

Signed-off-by: Mathieu Malaterre 
---
 arch/powerpc/include/asm/xor.h | 12 +---
 arch/powerpc/include/asm/xor_altivec.h | 19 +++
 arch/powerpc/lib/xor_vmx_glue.c|  1 +
 3 files changed, 21 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/xor_altivec.h

diff --git a/arch/powerpc/include/asm/xor.h b/arch/powerpc/include/asm/xor.h
index a36c2069d8ed..7d6dc503349d 100644
--- a/arch/powerpc/include/asm/xor.h
+++ b/arch/powerpc/include/asm/xor.h
@@ -24,17 +24,7 @@
 
 #include 
 #include 
-
-void xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
-  unsigned long *v2_in);
-void xor_altivec_3(unsigned long bytes, unsigned long *v1_in,
-  unsigned long *v2_in, unsigned long *v3_in);
-void xor_altivec_4(unsigned long bytes, unsigned long *v1_in,
-  unsigned long *v2_in, unsigned long *v3_in,
-  unsigned long *v4_in);
-void xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
-  unsigned long *v2_in, unsigned long *v3_in,
-  unsigned long *v4_in, unsigned long *v5_in);
+#include 
 
 static struct xor_block_template xor_block_altivec = {
.name = "altivec",
diff --git a/arch/powerpc/include/asm/xor_altivec.h 
b/arch/powerpc/include/asm/xor_altivec.h
new file mode 100644
index ..6ca923510b59
--- /dev/null
+++ b/arch/powerpc/include/asm/xor_altivec.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_XOR_ALTIVEC_H
+#define _ASM_POWERPC_XOR_ALTIVEC_H
+
+#ifdef CONFIG_ALTIVEC
+
+void xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
+  unsigned long *v2_in);
+void xor_altivec_3(unsigned long bytes, unsigned long *v1_in,
+  unsigned long *v2_in, unsigned long *v3_in);
+void xor_altivec_4(unsigned long bytes, unsigned long *v1_in,
+  unsigned long *v2_in, unsigned long *v3_in,
+  unsigned long *v4_in);
+void xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
+  unsigned long *v2_in, unsigned long *v3_in,
+  unsigned long *v4_in, unsigned long *v5_in);
+
+#endif
+#endif /* _ASM_POWERPC_XOR_ALTIVEC_H */
diff --git a/arch/powerpc/lib/xor_vmx_glue.c b/arch/powerpc/lib/xor_vmx_glue.c
index 6521fe5e8cef..dab2b6bfcf36 100644
--- a/arch/powerpc/lib/xor_vmx_glue.c
+++ b/arch/powerpc/lib/xor_vmx_glue.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "xor_vmx.h"
 
 void xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
-- 
2.11.0

Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder

2018-03-28 Thread Laurent Dufour



On 28/03/2018 00:12, David Rientjes wrote:
> On Tue, 13 Mar 2018, Laurent Dufour wrote:
> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 88042d843668..ef6ef0627090 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2189,16 +2189,24 @@ void anon_vma_interval_tree_verify(struct 
>> anon_vma_chain *node);
>>  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
>> cap_sys_admin);
>>  extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>  unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
>> -struct vm_area_struct *expand);
>> +struct vm_area_struct *expand, bool keep_locked);
>>  static inline int vma_adjust(struct vm_area_struct *vma, unsigned long 
>> start,
>>  unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
>>  {
>> -return __vma_adjust(vma, start, end, pgoff, insert, NULL);
>> +return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
>>  }
>> -extern struct vm_area_struct *vma_merge(struct mm_struct *,
>> +extern struct vm_area_struct *__vma_merge(struct mm_struct *,
>>  struct vm_area_struct *prev, unsigned long addr, unsigned long end,
>>  unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
>> -struct mempolicy *, struct vm_userfaultfd_ctx);
>> +struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked);
>> +static inline struct vm_area_struct *vma_merge(struct mm_struct *vma,
>> +struct vm_area_struct *prev, unsigned long addr, unsigned long end,
>> +unsigned long vm_flags, struct anon_vma *anon, struct file *file,
>> +pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
>> +{
>> +return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off,
>> +   pol, uff, false);
>> +}
> 
> The first formal to vma_merge() is an mm, not a vma.

Oops !

> This area could use an uncluttering.
> 
>>  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
>>  extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
>>  unsigned long addr, int new_below);
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index d6533cb85213..ac32b577a0c9 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct 
>> *mm,
>>   */
>>  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>  unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
>> -struct vm_area_struct *expand)
>> +struct vm_area_struct *expand, bool keep_locked)
>>  {
>>  struct mm_struct *mm = vma->vm_mm;
>>  struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
>> @@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
>> long start,
>>  
>>  if (next && next != vma)
>>  vm_raw_write_end(next);
>> -vm_raw_write_end(vma);
>> +if (!keep_locked)
>> +vm_raw_write_end(vma);
>>  
>>  validate_mm(mm);
>>  
> 
> This will require a fixup for the following patch where a retval from 
> anon_vma_close() can also return without vma locked even though 
> keep_locked == true.

Yes I saw your previous email about that.

> 
> How does vma_merge() handle that error wrt vm_raw_write_begin(vma)?

The not written assumption is that in the case __vma_adjust() is returning an
error, the vma is no more sequence locked. That's need to be clarify a huge
comment near the __vma_adjust() definition.

This being said, in that case of __vma_merge() is returning NULL which means
that it didn't do the job (the caller don't know if this is due to an error or
because the VMA cannot be merged). In that case again the assumption is that
the VMA are no locked in that case since the merge operation was not done. But
again this is not documented at all and I've to fix that.
In addition the caller copy_vma() which is the only one calling __vma_merge()
with keep_locked=true, is assuming that in that case, the VMA is not locked and
it will allocate a new VMA which will be locked before it is inserted in the RB
tree.

So it should work, but I've to make a huge effort to document that in the code.
Thanks a lot for raising this !

> 
>> @@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, 
>> unsigned long vm_flags,
>>   * parameter) may establish ptes with the wrong permissions of 
>>   * instead of the right permissions of .
>>   */
>> -struct vm_area_struct *vma_merge(struct mm_struct *mm,
>> +struct vm_area_struct *__vma_merge(struct mm_struct *mm,
>>  struct vm_area_struct *prev, unsigned long addr,
>>  unsigned long end, unsigned long vm_flags,
>>  struct anon_vma *anon_vma, struct file *file,
>>  pgoff_t pgoff, struct mempolicy *policy,
>> -struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
>> +struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
>>

Re: [PATCH v9 07/24] mm: VMA sequence count

2018-03-28 Thread Laurent Dufour



On 27/03/2018 23:30, David Rientjes wrote:
> On Tue, 13 Mar 2018, Laurent Dufour wrote:
> 
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index faf85699f1a1..5898255d0aeb 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -558,6 +558,10 @@ void __vma_link_rb(struct mm_struct *mm, struct 
>> vm_area_struct *vma,
>>  else
>>  mm->highest_vm_end = vm_end_gap(vma);
>>  
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +seqcount_init(>vm_sequence);
>> +#endif
>> +
>>  /*
>>   * vma->vm_prev wasn't known when we followed the rbtree to find the
>>   * correct insertion point for that vma. As a result, we could not
>> @@ -692,6 +696,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
>> long start,
>>  long adjust_next = 0;
>>  int remove_next = 0;
>>  
>> +/*
>> + * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
>> + *
>> + * Locked is complaining about a theoretical lock dependency, involving
>> + * 3 locks:
>> + *   mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
>> + *
>> + * Here are the major path leading to this dependency :
>> + *  1. __vma_adjust() mmap_sem  -> vm_sequence -> i_mmap_rwsem
>> + *  2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
>> + *  3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
>> + *  4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
>> + *
>> + * So there is no way to solve this easily, especially because in
>> + * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
>> + * VMAs are not yet known.
>> + * However, the way the vm_seq is used is guarantying that we will
>> + * never block on it since we just check for its value and never wait
>> + * for it to move, see vma_has_changed() and handle_speculative_fault().
>> + */
>> +vm_raw_write_begin(vma);
>> +if (next)
>> +vm_raw_write_begin(next);
>> +
>>  if (next && !insert) {
>>  struct vm_area_struct *exporter = NULL, *importer = NULL;
>>  
> 
> Eek, what about later on:
> 
>   /*
>* Easily overlooked: when mprotect shifts the boundary,
>* make sure the expanding vma has anon_vma set if the
>* shrinking vma had, to cover any anon pages imported.
>*/
>   if (exporter && exporter->anon_vma && !importer->anon_vma) {
>   int error;
> 
>   importer->anon_vma = exporter->anon_vma;
>   error = anon_vma_clone(importer, exporter);
>   if (error)
>   return error;
>   }
> 
> This needs
> 
> if (error) {
>   if (next && next != vma)
>   vm_raw_write_end(next);
>   vm_raw_write_end(vma);
>   return error;
> }

Nice catch !

Thanks,
Laurent.

Re: [PATCH 6/6] doc/devicetree: NVDIMM region documentation

2018-03-28 Thread Dan Williams

On Wed, Mar 28, 2018 at 10:06 AM, Rob Herring  wrote:
[..]
> >> Are DIMMs always going to be the only form factor for NV memory?
> >>
> >> And if you have multiple DIMMs, does each DT node correspond to a DIMM?
> >
> > A nvdimm-region might correspond to a single NVDIMM, a set of
> > interleaved NVDIMMs, or it might just be a chunk of normal memory that
> > you want treated as a NVDIMM for some reason. The last case is useful
> > for provisioning install media on servers since it allows you do
> > download a DVD image, turn it into an nvdimm-region, and kexec into
> > the installer which can use it as a root disk. That may seem a little
> > esoteric, but it's handy and we're using a full linux environment for
> > our boot loader so it's easy to make use of.
>
> I'm really just asking if we should drop the "dimm" name because it is
> not always a DIMM. Maybe pmem instead? I don't know, naming is
> hard(TM).

The Linux enabling uses the term "memory device". The Linux device
object name for memory devices is "nmem".

[..]
> > special since the OS needs to know where it can allocate early in boot
> > and I don't see non-volatile memory as being similarly significant.
> > Fundamentally an NVDIMM is just a memory mapped storage device so we
> > should be able to defer looking at them until later in boot.
>
> It's not clear if 'platform' is just an example or random name or what
> the node is required to be called. In the latter case, we should be
> much more specific because 'platform' could be anything. In the former
> case, then we have no way to find or validate the node because the
> name could be anything and there's no compatible property either.
>
> "region" is pretty generic too.
>

The term "nvdimm-region" has specific meaning to the libnvdimm
sub-system. It is a contiguous physical address range backed by one or
more memory devices, DIMMs, in an interleaved configuration
(interleave set).

One feature that is currently missing from libnvdimm is a management
interface to change an interleave configuration. To date, Linux only
reads a static region configuration published by platform firmware.
Linux can provide dynamic provisioning of namespaces out of those
regions, but interleave configuration has been left to vendor specific
tooling to date. It would be great to start incorporating generic
Linux support for that capability across platform firmware
implementations.

Re: [PATCH] powerpc/powernv/nvram: opal_nvram_write handle unknown OPAL errors

2018-03-28 Thread Vasant Hegde


On 03/27/2018 01:08 PM, Nicholas Piggin wrote:

On Tue, 27 Mar 2018 12:47:31 +0530
Vasant Hegde  wrote:


On 03/26/2018 08:32 PM, Nicholas Piggin wrote:

opal_nvram_write currently just assumes success if it encounters an
error other than OPAL_BUSY or OPAL_BUSY_EVENT. Have it return -EIO
on other errors instead.

Signed-off-by: Nicholas Piggin 
---
   arch/powerpc/platforms/powernv/opal-nvram.c | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-nvram.c 
b/arch/powerpc/platforms/powernv/opal-nvram.c
index 9db4398ded5d..13bf625dc3e8 100644
--- a/arch/powerpc/platforms/powernv/opal-nvram.c
+++ b/arch/powerpc/platforms/powernv/opal-nvram.c
@@ -59,6 +59,8 @@ static ssize_t opal_nvram_write(char *buf, size_t count, 
loff_t *index)
if (rc == OPAL_BUSY_EVENT)
opal_poll_events(NULL);


Current code does continuous poller here. May be we have small breathing time
here. What you say?


Yeah that's probably not a bad idea. I cc'ed skiboot list -- what's a
reasonable delay between retries?


I think it depends on individual API. Like in case of dump retrival I've 20ms 
delay... as FSP takes time to copy data to host memory. But may be here we can 
have smaller delay.



Linux has a bunch of similar kind of
loops if you grep for opal_poll_events and OPAL_BUSY. It would be good
to convert them all to a standard form with a standard delay as
recommended by OPAL, and where specific calls have different delay
for a good reason, that would be documented in the OPAL API docs.


Yes. We should update API documentation.

-Vasant

Re: [PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count

2018-03-28 Thread Laurent Dufour



On 27/03/2018 23:57, David Rientjes wrote:
> On Tue, 13 Mar 2018, Laurent Dufour wrote:
> 
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 5898255d0aeb..d6533cb85213 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -847,17 +847,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
>> long start,
>>  }
>>  
>>  if (start != vma->vm_start) {
>> -vma->vm_start = start;
>> +WRITE_ONCE(vma->vm_start, start);
>>  start_changed = true;
>>  }
>>  if (end != vma->vm_end) {
>> -vma->vm_end = end;
>> +WRITE_ONCE(vma->vm_end, end);
>>  end_changed = true;
>>  }
>> -vma->vm_pgoff = pgoff;
>> +WRITE_ONCE(vma->vm_pgoff, pgoff);
>>  if (adjust_next) {
>> -next->vm_start += adjust_next << PAGE_SHIFT;
>> -next->vm_pgoff += adjust_next;
>> +WRITE_ONCE(next->vm_start,
>> +   next->vm_start + (adjust_next << PAGE_SHIFT));
>> +WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next);
>>  }
>>  
>>  if (root) {
>> @@ -1781,6 +1782,7 @@ unsigned long mmap_region(struct file *file, unsigned 
>> long addr,
>>  out:
>>  perf_event_mmap(vma);
>>  
>> +vm_write_begin(vma);
>>  vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
>>  if (vm_flags & VM_LOCKED) {
>>  if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
>> @@ -1803,6 +1805,7 @@ unsigned long mmap_region(struct file *file, unsigned 
>> long addr,
>>  vma->vm_flags |= VM_SOFTDIRTY;
>>  
>>  vma_set_page_prot(vma);
>> +vm_write_end(vma);
>>  
>>  return addr;
>>  
> 
> Shouldn't this also protect vma->vm_flags?

Nice catch !
I just found that too while reviewing the entire patch to answer your previous
email.

> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1796,7 +1796,8 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>   vma == get_gate_vma(current->mm)))
>   mm->locked_vm += (len >> PAGE_SHIFT);
>   else
> - vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> + WRITE_ONCE(vma->vm_flags,
> +vma->vm_flags & VM_LOCKED_CLEAR_MASK);
>   }
> 
>   if (file)
> @@ -1809,7 +1810,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>* then new mapped in-place (which must be aimed as
>* a completely new data area).
>*/
> - vma->vm_flags |= VM_SOFTDIRTY;
> + WRITE_ONCE(vma->vm_flags, vma->vm_flags | VM_SOFTDIRTY);
> 
>   vma_set_page_prot(vma);
>   vm_write_end(vma);
>

Re: [RFC PATCH] powerpc/xmon: Use OPAL_DEBUG to debug srest in OPAL

2018-03-28 Thread Vasant Hegde


On 03/27/2018 12:58 PM, Nicholas Piggin wrote:

On Tue, 27 Mar 2018 12:42:32 +0530
Vasant Hegde  wrote:


On 03/26/2018 08:39 PM, Nicholas Piggin wrote:

xmon can be entered via sreset NMI (from a management sreset, or an
NMI IPI), which can interrupt OPAL. Add checks to xmon to see if pc
or sp are within OPAL memory, and if so, then use OPAL_DEBUG to
print the opal stack and return the Linux stack, which can then be
dumped by xmon


Nick,


OPAL uses FSP/cronus interface for many of debug interface (like OPAL assert,
getting opal console, triggering FSP R/R etc). May be in future we may add new
debug capability.


It would be good to ensure an API could accommodate them, or at least
not get in the way.


Agree.




Once secureboot is enabled none of these interface work and we have limited 
debug
capability.

Here you are using very generic API name (OPAL_DEBUG). May be we should have 
generic
interface (exported via debugfs?) here rather than SRESET specific one.


OPAL_DEBUG here actually uses the sub-function OPAL_DEBUG_DUMP_STACK (1),
but I didn't bring that constant across from skiboot which I should have.


Nick,

May be we should define sub-function usage.  Also current API limits number of 
arguments

and its type. may be we should have argument 2 as "void *" ?
something like :
  s64 opal_debug(u32 debug_type, void *data, u64 dsize);

That way individual sub-function can parse/use based on its need.



But I don't think this is SRESET specific. If Linux has any way to get
an r1 for a CPU in OPAL, then it can use this function. If it does not,
then it simply can't use it.

I haven't really followed what's happening with secure boot, but presumably
we can still get NMIs (at least machine check, even if all system reset
sources are suppressed).


AFAIK secureboot won't block us here. It mostly blocks external entity (like 
FSP/cronus) from
accessing host memory. (like they can't directly read, write to host memory, 
SCOM operations

are restricted etc).

-Vasant

Re: [PATCH 6/6] doc/devicetree: NVDIMM region documentation

2018-03-28 Thread Rob Herring

On Tue, Mar 27, 2018 at 9:53 AM, Oliver  wrote:
> On Tue, Mar 27, 2018 at 9:24 AM, Rob Herring  wrote:
>> On Fri, Mar 23, 2018 at 07:12:09PM +1100, Oliver O'Halloran wrote:
>>> Add device-tree binding documentation for the nvdimm region driver.
>>>
>>> Cc: devicet...@vger.kernel.org
>>> Signed-off-by: Oliver O'Halloran 
>>> ---
>>>  .../devicetree/bindings/nvdimm/nvdimm-region.txt   | 45 
>>> ++
>>>  1 file changed, 45 insertions(+)
>>>  create mode 100644 
>>> Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt
>>>
>>> diff --git a/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt 
>>> b/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt
>>> new file mode 100644
>>> index ..02091117ff16
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt
>>> @@ -0,0 +1,45 @@
>>> +Device-tree bindings for NVDIMM memory regions
>>> +-
>>> +
>>> +Non-volatile DIMMs are memory modules used to provide (cacheable) main 
>>> memory
>>
>> Are DIMMs always going to be the only form factor for NV memory?
>>
>> And if you have multiple DIMMs, does each DT node correspond to a DIMM?
>
> A nvdimm-region might correspond to a single NVDIMM, a set of
> interleaved NVDIMMs, or it might just be a chunk of normal memory that
> you want treated as a NVDIMM for some reason. The last case is useful
> for provisioning install media on servers since it allows you do
> download a DVD image, turn it into an nvdimm-region, and kexec into
> the installer which can use it as a root disk. That may seem a little
> esoteric, but it's handy and we're using a full linux environment for
> our boot loader so it's easy to make use of.

I'm really just asking if we should drop the "dimm" name because it is
not always a DIMM. Maybe pmem instead? I don't know, naming is
hard(TM).

>> If not, then what if we want/need to provide power control to a DIMM?
>
> That would require a DIMM (and probably memory controller) specific
> driver. I've deliberately left out how regions are mapped back to
> DIMMs from the binding since it's not really clear to me how that
> should work. A phandle array pointing to each DIMM device (which could
> be anything) would do the trick, but I've found that a bit awkward to
> plumb into the model that libnvdimm expects.
>
>>> +that retains its contents across power cycles. In more practical terms, 
>>> they
>>> +are kind of storage device where the contents can be accessed by the CPU
>>> +directly, rather than indirectly via a storage controller or similar. The 
>>> an
>>> +nvdimm-region specifies a physical address range that is hosted on an 
>>> NVDIMM
>>> +device.
>>> +
>>> +Bindings for the region nodes:
>>> +-
>>> +
>>> +Required properties:
>>> + - compatible = "nvdimm-region"
>>> +
>>> + - reg = ;
>>> + The system physical address range of this nvdimm region.
>>> +
>>> +Optional properties:
>>> + - Any relevant NUMA assocativity properties for the target platform.
>>> + - A "volatile" property indicating that this region is actually in
>>> +   normal DRAM and does not require cache flushes after each write.
>>> +
>>> +A complete example:
>>> +
>>> +
>>> +/ {
>>> + #size-cells = <2>;
>>> + #address-cells = <2>;
>>> +
>>> + platform {
>>
>> Perhaps we need a more well defined node here. Like we have 'memory' for
>> memory nodes.
>
> I think treating it as a platform device is fine. Memory nodes are

Platform device is a Linux term...

> special since the OS needs to know where it can allocate early in boot
> and I don't see non-volatile memory as being similarly significant.
> Fundamentally an NVDIMM is just a memory mapped storage device so we
> should be able to defer looking at them until later in boot.

It's not clear if 'platform' is just an example or random name or what
the node is required to be called. In the latter case, we should be
much more specific because 'platform' could be anything. In the former
case, then we have no way to find or validate the node because the
name could be anything and there's no compatible property either.

"region" is pretty generic too.

>
> That said you might have problems with XIP kernels and what not. I
> think that problem is better solved through other means though.
>
>>> + region@5000 {
>>> + compatible = "nvdimm-region;
>>> + reg = <0x0001 0x 0x 0x4000>
>>> +
>>> + };
>>> +
>>> + region@6000 {
>>> + compatible = "nvdimm-region";
>>> + reg = <0x0001 0x 0x 0x4000>

Thinking about this some more, the 2 levels of nodes is pointless.
Just follow memory nodes structure.

nv-memory@1 {
  compatible = "nvdimm-region";
  reg =

Re: RFC on writel and writel_relaxed

2018-03-28 Thread Jason Gunthorpe

On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > they should have the same semantics as writel.
> > > 
> > > One thing that ARM does IIRC is that it only guarantees to order writel() 
> > > within
> > > one device, and the memory mapped PCI I/O space window almost certainly
> > > counts as a separate device to the CPU.
> > 
> > That sounds bogus.
> 
> To elaborate, if you do the following on arm:
> 
>   writel(DEVICE_FOO);
>   writel(DEVICE_BAR);
> 
> we generally cannot guarantee in which order those accesses will hit the
> devices even if we add every barrier under the sun. You'd need something
> in between, specific to DEVICE_FOO (probably a read-back) to really push
> the first write out. This doesn't sound like it would be that uncommon to
> me.

The PCI posted write does not require the above to execute 'in order'
only that any bus segment shared by the two devices have the writes
issued in CPU order. ie at a shared PCI root port for instance.

If I recall this is very similar to the ordering that ARM's on-chip
AXI interconnect is supposed to provide.. So I'd be very surprised if
a modern ARM64 has an meaningful difference from x86 here.

When talking about ordering between the devices, the relevant question
is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
from the DEVICE_FOO. 'ordered' means that in this case
writel(DEVICE_FOO) must be presented to FOO before anything generated
by BAR.

Jason

Re: [PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count

2018-03-28 Thread Laurent Dufour

On 27/03/2018 23:45, David Rientjes wrote:
> On Tue, 13 Mar 2018, Laurent Dufour wrote:
> 
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 65ae54659833..a2d9c87b7b0b 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, 
>> const char __user *buf,
>>  goto out_mm;
>>  }
>>  for (vma = mm->mmap; vma; vma = vma->vm_next) {
>> -vma->vm_flags &= ~VM_SOFTDIRTY;
>> +vm_write_begin(vma);
>> +WRITE_ONCE(vma->vm_flags,
>> +   vma->vm_flags & 
>> ~VM_SOFTDIRTY);
>>  vma_set_page_prot(vma);
>> +vm_write_end(vma);
>>  }
>>  downgrade_write(>mmap_sem);
>>  break;
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index cec550c8468f..b8212ba17695 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
>> list_head *fcs)
>>  
>>  octx = vma->vm_userfaultfd_ctx.ctx;
>>  if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
>> +vm_write_begin(vma);
>>  vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>> -vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
>> +WRITE_ONCE(vma->vm_flags,
>> +   vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
>> +vm_write_end(vma);
>>  return 0;
>>  }
>>  
> 
> In several locations in this patch vm_write_begin(vma) -> 
> vm_write_end(vma) is nesting things other than vma->vm_flags, 
> vma->vm_policy, etc.  I think it's better to do vm_write_end(vma) as soon 
> as the members that the seqcount protects are modified.  In other words, 
> this isn't offering protection for vma->vm_userfaultfd_ctx.  There are 
> several examples of this in the patch.

That's true in this particular case, and I could change that to not include the
change to vm_userfaultfd_ctx.
This being said, I don't think this will have a major impact, but I'll make a
close review on this patch to be sure there is too large protected part of code.

>> @@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, 
>> struct file *file)
>>  vma = prev;
>>  else
>>  prev = vma;
>> -vma->vm_flags = new_flags;
>> +vm_write_begin(vma);
>> +WRITE_ONCE(vma->vm_flags, new_flags);
>>  vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>> +vm_write_end(vma);
>>  }
>>  up_write(>mmap_sem);
>>  mmput(mm);
>> @@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct 
>> userfaultfd_ctx *ctx,
>>   * the next vma was merged into the current one and
>>   * the current one has not been updated yet.
>>   */
>> -vma->vm_flags = new_flags;
>> +vm_write_begin(vma);
>> +WRITE_ONCE(vma->vm_flags, new_flags);
>>  vma->vm_userfaultfd_ctx.ctx = ctx;
>> +vm_write_end(vma);
>>  
>>  skip:
>>  prev = vma;
>> @@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct 
>> userfaultfd_ctx *ctx,
>>   * the next vma was merged into the current one and
>>   * the current one has not been updated yet.
>>   */
>> -vma->vm_flags = new_flags;
>> +vm_write_begin(vma);
>> +WRITE_ONCE(vma->vm_flags, new_flags);
>>  vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>> +vm_write_end(vma);
>>  
>>  skip:
>>  prev = vma;
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b7e2268dfc9a..32314e9e48dd 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>>  if (mm_find_pmd(mm, address) != pmd)
>>  goto out;
>>  
>> +vm_write_begin(vma);
>>  anon_vma_lock_write(vma->anon_vma);
>>  
>>  pte = pte_offset_map(pmd, address);
>> @@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>>  pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>  spin_unlock(pmd_ptl);
>>  anon_vma_unlock_write(vma->anon_vma);
>> +vm_write_end(vma);
>>  result = SCAN_FAIL;
>>  goto out;
>>  }
>> @@ -1075,6 +1077,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>>  set_pmd_at(mm, address, pmd, _pmd);
>>  update_mmu_cache_pmd(vma, address, pmd);
>>  spin_unlock(pmd_ptl);
>> +vm_write_end(vma);
>>  
>>  *hpage = NULL;
>>  
>> diff --git a/mm/madvise.c

Re: RFC on writel and writel_relaxed

2018-03-28 Thread Nicholas Piggin

On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
David Miller  wrote:

> From: Benjamin Herrenschmidt 
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
> > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > the archs are unused or dead anyway.  
> 
> Agreed.

While we're making decrees here, can we do something about mmiowb?
The semantics are basically indecipherable.

  This is a variation on the mandatory write barrier that causes writes to 
weakly
  ordered I/O regions to be partially ordered.  Its effects may go beyond the
  CPU->Hardware interface and actually affect the hardware at some level.

How can a driver writer possibly get that right?

IIRC it was added for some big ia64 system that was really expensive
to implement the proper wmb() semantics on. So wmb() semantics were
quietly downgraded, then the subsequently broken drivers they cared
about were fixed by adding the stronger mmiowb().

What should have happened was wmb and writel remained correct, sane, and
expensive, and they add an mmio_wmb() to order MMIO stores made by the
writel_relaxed accessors, then use that to speed up the few drivers they
care about.

Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
make wmb ordering talk about stores to the device, not to some
intermediate stage of the interconnect where it can be subsequently
reordered wrt the device? Drivers can be converted back to using wmb
or writel gradually.

Thanks,
Nick

RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight

From: Benjamin Herrenschmidt
> Sent: 28 March 2018 16:13
...
> > I've always wondered exactly what the twi;isync were for - always seemed
> > very heavy handed for most mmio reads.
> > Particularly if you are doing mmio reads from a data fifo.
> 
> If you do that you should use the "s" version of the accessors. Those
> will only do the above trick at the end of the access series. Also a
> FIFO needs special care about endianness anyway, so you should use
> those accessors regardless. (Hint: you never endian swap a FIFO even on
> BE on a LE device, unless something's been wired very badly in HW).

That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface.
Reading the high address 'clocked' the fifo.
So the first 3 reads could happen in any order, but the 4th had to be last.
This is a small ppc and we shovel a lot of data through that fifo.

Whether it needed byteswapping depended completely on how our hardware people
had built the pcb (not made easy by some docs using the ibm bit numbering).
In fact it didn't

While that driver only had to run on a very specific small ppc, generic drivers
might have similar issues.

I suspect that writel() is always (or should always be):
barrier_before_writel()
writel_relaxed()
barrier_after_writel()
So if a driver needs to do multiple writes (without strong ordering)
it should be able to repeat the writel_relaxed() with only one set
of barriers.
Similarly for readl().
In addition a lesser barrier is probably enough between a readl_relaxed()
and a writel_relaxed() that is conditional on the read value.

David

Re: RFC on writel and writel_relaxed

2018-03-28 Thread David Miller

From: Benjamin Herrenschmidt 
Date: Thu, 29 Mar 2018 02:13:16 +1100

> Let's fix all archs, it's way easier than fixing all drivers. Half of
> the archs are unused or dead anyway.

Agreed.

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-28 Thread Paul E. McKenney

On Wed, Mar 28, 2018 at 04:36:05PM +0300, Yury Norov wrote:
> On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote:
> > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote:
> > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote:
> > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending 
> > > > > broadcast IPI.
> > > > > If CPU is in extended quiescent state (idle task or nohz_full 
> > > > > userspace), this
> > > > > work may be done at the exit of this state. Delaying synchronization 
> > > > > helps to
> > > > > save power if CPU is in idle state and decrease latency for real-time 
> > > > > tasks.
> > > > > 
> > > > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab 
> > > > > and arm64
> > > > > code to delay syncronization.
> > > > > 
> > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the 
> > > > > CPU running
> > > > > isolated task would be fatal, as it breaks isolation. The approach 
> > > > > with delaying
> > > > > of synchronization work helps to maintain isolated state.
> > > > > 
> > > > > I've tested it with test from task isolation series on ThunderX2 for 
> > > > > more than
> > > > > 10 hours (10k giga-ticks) without breaking isolation.
> > > > > 
> > > > > Signed-off-by: Yury Norov 
> > > > > ---
> > > > >  arch/arm64/kernel/insn.c |  2 +-
> > > > >  include/linux/smp.h  |  2 ++
> > > > >  kernel/smp.c | 24 
> > > > >  mm/slab.c|  2 +-
> > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > > > > index 2718a77da165..9d7c492e920e 100644
> > > > > --- a/arch/arm64/kernel/insn.c
> > > > > +++ b/arch/arm64/kernel/insn.c
> > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void 
> > > > > *addrs[], u32 insns[], int cnt)
> > > > >* synchronization.
> > > > >*/
> > > > >   ret = aarch64_insn_patch_text_nosync(addrs[0], 
> > > > > insns[0]);
> > > > > - kick_all_cpus_sync();
> > > > > + kick_active_cpus_sync();
> > > > >   return ret;
> > > > >   }
> > > > >   }
> > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > > > > index 9fb239e12b82..27215e22240d 100644
> > > > > --- a/include/linux/smp.h
> > > > > +++ b/include/linux/smp.h
> > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask 
> > > > > *mask,
> > > > > smp_call_func_t func, void *info, int wait);
> > > > > 
> > > > >  void kick_all_cpus_sync(void);
> > > > > +void kick_active_cpus_sync(void);
> > > > >  void wake_up_all_idle_cpus(void);
> > > > > 
> > > > >  /*
> > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask *mask, 
> > > > > smp_call_func_t func,
> > > > >  }
> > > > > 
> > > > >  static inline void kick_all_cpus_sync(void) {  }
> > > > > +static inline void kick_active_cpus_sync(void) {  }
> > > > >  static inline void wake_up_all_idle_cpus(void) {  }
> > > > > 
> > > > >  #ifdef CONFIG_UP_LATE_INIT
> > > > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > > > index 084c8b3a2681..0358d6673850 100644
> > > > > --- a/kernel/smp.c
> > > > > +++ b/kernel/smp.c
> > > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> > > > > 
> > > > > +/**
> > > > > + * kick_active_cpus_sync - Force CPUs that are not in extended
> > > > > + * quiescent state (idle or nohz_full userspace) sync by sending
> > > > > + * IPI. Extended quiescent state CPUs will sync at the exit of
> > > > > + * that state.
> > > > > + */
> > > > > +void kick_active_cpus_sync(void)
> > > > > +{
> > > > > + int cpu;
> > > > > + struct cpumask kernel_cpus;
> > > > > +
> > > > > + smp_mb();
> > > > > +
> > > > > + cpumask_clear(_cpus);
> > > > > + preempt_disable();
> > > > > + for_each_online_cpu(cpu) {
> > > > > + if (!rcu_eqs_special_set(cpu))
> > > > 
> > > > If we get here, the CPU is not in a quiescent state, so we therefore
> > > > must IPI it, correct?
> > > > 
> > > > But don't you also need to define rcu_eqs_special_exit() so that RCU
> > > > can invoke it when it next leaves its quiescent state?  Or are you able
> > > > to ignore the CPU in that case?  (If you are able to ignore the CPU in
> > > > that case, I could give you a lower-cost function to get your job done.)
> > > > 
> > > > Thanx, Paul
> > > 
> > > What's actually needed for synchronization is issuing memory barrier on 
> > > target
> > > CPUs before we start executing kernel code.
> > > 
> > > smp_mb() is implicitly called in smp_call_function*() path for it. In
> > >

Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt

On Wed, 2018-03-28 at 07:41 -0400, ok...@codeaurora.org wrote:
> Yes, we want to get there indeed. It is because of some arch not 
> implementing writel properly. Maintainers want to play safe.
> 
> That is why I asked if IA64 and other well known archs follow the 
> strongly ordered rule at this moment like PPC and ARM.
> 
> Or should we go and inform every arch about this before yanking wmb()?
> 
> Maintainers are afraid of introducing a regression.

Let's fix all archs, it's way easier than fixing all drivers. Half of
the archs are unused or dead anyway.

> > 
> > The above code makes no sense, and just looks stupid to me. It also
> > generates pointlessly bad code on x86, so it's bad there too.
> > 
> > Linus

Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt

On Wed, 2018-03-28 at 11:30 +, David Laight wrote:
> From: Benjamin Herrenschmidt
> > Sent: 28 March 2018 10:56
> 
> ...
> > For example, let's say I have a device with a reset bit and the spec
> > says the reset bit needs to be set for at least 10us.
> > 
> > This is wrong:
> > 
> > writel(1, RESET_REG);
> > usleep(10);
> > writel(0, RESET_REG);
> > 
> > Because of write posting, the first write might arrive to the device
> > right before the second one.
> > 
> > The typical "fix" is to turn that into:
> > 
> > writel(1, RESET_REG);
> > readl(RESET_REG); /* Flush posted writes */
> 
> Would a writel(1, RESET_REG) here provide enough synchronsiation?

Probably yes. It's one of those things where you try to deal with the
fact that 90% of driver writers barely understand the basic stuff and
so you need the "default" accessors to be hardened as much as possible.

We still need to get a reasonably definition of the semantics of the
relaxed ones vs. WC memory but let's get through that exercise first
and hopefully for the last time.

> > usleep(10);
> > writel(0, RESET_REG);
> > 
> > *However* the issue here, at least on power, is that the CPU can issue
> > that readl but doesn't necessarily wait for it to complete (ie, the
> > data to return), before proceeding to the usleep. Now a usleep contains
> > a bunch of loads and stores and is probably fine, but a udelay which
> > just loops on the timebase may not be.
> > 
> > Thus we may still violate the timing requirement.
> 
> I've seem that sort of code (with udelay() and no read back) quite often.
> How many were in linux I don't know.
> 
> For small delays I usually fix it by repeated writes (of the same value)
> to the device register. That can guarantee very short intervals.

As long as you know the bus frequency...

> The only time I've actually seen buffered writes break timing was
> between a 286 and an 8859 interrupt controller.

:-)

The problem for me is not so much what I've seen, I realize that most
of the issues we are talking about are the kind that will hit once in a
thousand times or less.

But we *can* reason about them in a way that can effectively prevent
the problem completely and when your cluster has 1 machine, 1/1000
starts becoming significant.

These days the vast majority of IO devices either are 99% DMA driven so
that a bit of overhead on MMIOs is irrelevant, or have one fast path
(low latency IB etc...) that needs some finer control, and the rest is
all setup which can be paranoid at will.

So I think we should aim for the "default" accessors most people use to
be as hadened as we can think of. I favor correctness over performance
in all cases. But then we also define a reasonable semantic for the
relaxed ones (well, we sort-of do have one, we might have to make it a
bit more precise in some areas) that allows the few MMIO fast path that
care to be optimized.

> If you wrote to the mask then enabled interrupts the first IACK cycle
> could be too close to write and break the cycle recovery time.
> That clobbered most of the interrupt controller registers.
> That probably affected every 286 board ever built!
> Not sure how much software added the required extra bus cycle.
> 
> > What we did inside readl, with the twi;isync sequence (which basically
> > means, trap on return value with "trap never" as a condition, followed
> > by isync that ensures all excpetion conditions are resolved), is force
> > the CPU to "consume" the data from the read before moving on.
> > 
> > This effectively makes readl fully synchronous (we would probably avoid
> > that if we were to implement a readl_relaxed).
> 
> I've always wondered exactly what the twi;isync were for - always seemed
> very heavy handed for most mmio reads.
> Particularly if you are doing mmio reads from a data fifo.

If you do that you should use the "s" version of the accessors. Those
will only do the above trick at the end of the access series. Also a
FIFO needs special care about endianness anyway, so you should use
those accessors regardless. (Hint: you never endian swap a FIFO even on
BE on a LE device, unless something's been wired very badly in HW).

> Perhaps there should be a writel_wait() that is allowed to do a read back
> for such code paths?

I think what we have is fine, we just define that the standard
writel/readl as fairly simple and hardened, and we look at providing a
somewhat reasonable set of relaxed variants for optimizing fast path.
We pretty much already are there, we just need to be better at defining
the semantics.

And for the super high perf case, which thankfully is either seldom
(server high perf network stuff) or very arch specific (ARM SoC stuff),
then arch specific driver hacks will always remain the norm.

Cheers,
Ben. 

>   David
>

[RFC PATCH 3/3] powerpc/64s: always flush non-local CPUs from single threaded mms

2018-03-28 Thread Nicholas Piggin

Go one step further, if we're going to put a tlbie on the bus
at all, make it count. Always flush all others and restore our
mm to a local one.
---
 arch/powerpc/mm/tlb-radix.c | 45 +++--
 1 file changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index f00acacf48f1..ba48539e799e 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -424,10 +424,16 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, 
unsigned long vmaddr,
return;
 
preempt_disable();
-   if (!mm_is_thread_local(mm))
-   _tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
-   else
+   if (mm_is_thread_local(mm)) {
_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
+   } else {
+   if (mm_is_singlethreaded(mm)) {
+   _tlbie_pid(pid, RIC_FLUSH_ALL);
+   mm_reset_thread_local(mm);
+   } else {
+   _tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
+   }
+   }
preempt_enable();
 }
 
@@ -496,14 +502,14 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, 
unsigned long start,
nr_pages > tlb_single_page_flush_ceiling);
}
 
-   if (full) {
+   if (!local && mm_is_singlethreaded(mm)) {
+   _tlbie_pid(pid, RIC_FLUSH_ALL);
+   mm_reset_thread_local(mm);
+   } else if (full) {
if (local) {
_tlbiel_pid(pid, RIC_FLUSH_TLB);
} else {
-   if (mm_is_singlethreaded(mm)) {
-   _tlbie_pid(pid, RIC_FLUSH_ALL);
-   mm_reset_thread_local(mm);
-   } else if (mm_needs_flush_escalation(mm)) {
+   if (mm_needs_flush_escalation(mm)) {
_tlbie_pid(pid, RIC_FLUSH_ALL);
} else {
_tlbie_pid(pid, RIC_FLUSH_TLB);
@@ -618,19 +624,17 @@ static inline void __radix__flush_tlb_range_psize(struct 
mm_struct *mm,
nr_pages > tlb_single_page_flush_ceiling);
}
 
-   if (full) {
+   if (!local && mm_is_singlethreaded(mm)) {
+   _tlbie_pid(pid, RIC_FLUSH_ALL);
+   mm_reset_thread_local(mm);
+   } else if (full) {
if (local) {
_tlbiel_pid(pid, also_pwc ? RIC_FLUSH_ALL : 
RIC_FLUSH_TLB);
} else {
-   if (mm_is_singlethreaded(mm)) {
-   _tlbie_pid(pid, RIC_FLUSH_ALL);
-   mm_reset_thread_local(mm);
-   } else {
-   if (mm_needs_flush_escalation(mm))
-   also_pwc = true;
+   if (mm_needs_flush_escalation(mm))
+   also_pwc = true;
 
-   _tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : 
RIC_FLUSH_TLB);
-   }
+   _tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : 
RIC_FLUSH_TLB);
}
} else {
if (local)
@@ -676,7 +680,12 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, 
unsigned long addr)
if (mm_is_thread_local(mm)) {
_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, 
true);
} else {
-   _tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, 
true);
+   if (mm_is_singlethreaded(mm)) {
+   _tlbie_pid(pid, RIC_FLUSH_ALL);
+   mm_reset_thread_local(mm);
+   } else {
+   _tlbie_va_range(addr, end, pid, PAGE_SIZE, 
mmu_virtual_psize, true);
+   }
}
 
preempt_enable();
-- 
2.16.1

[RFC PATCH 2/3] powerpc/64s/radix: reset mm_cpumask for single thread process when possible

2018-03-28 Thread Nicholas Piggin

When a single-threaded process has a non-local mm_cpumask and requires
a full PID tlbie invalidation, use that as an opportunity to reset the
cpumask back to the current CPU we're running on.

No other thread can concurrently switch to this mm, because it must
have had a reference on mm_users before it could use_mm. mm_users can
be asynchronously incremented e.g., by mmget_not_zero, but those
users should not be doing use_mm.
---
 arch/powerpc/include/asm/mmu_context.h | 33 +++-
 arch/powerpc/include/asm/tlb.h |  7 +
 arch/powerpc/mm/tlb-radix.c| 57 +-
 3 files changed, 75 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 3a15b6db9501..9d373c8fe9fa 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include
 #include 
@@ -102,13 +103,6 @@ static inline void mm_context_add_copro(struct mm_struct 
*mm)
 
 static inline void mm_context_remove_copro(struct mm_struct *mm)
 {
-   int c;
-
-   c = atomic_dec_if_positive(>context.copros);
-
-   /* Detect imbalance between add and remove */
-   WARN_ON(c < 0);
-
/*
 * Need to broadcast a global flush of the full mm before
 * decrementing active_cpus count, as the next TLBI may be
@@ -119,10 +113,15 @@ static inline void mm_context_remove_copro(struct 
mm_struct *mm)
 * for the time being. Invalidations will remain global if
 * used on hash.
 */
-   if (c == 0 && radix_enabled()) {
+   if (radix_enabled()) {
flush_all_mm(mm);
dec_mm_active_cpus(mm);
}
+
+   /* Detect imbalance between add and remove */
+   WARN_ON(!atomic_read(>context.copros));
+   atomic_dec(>context.copros);
+
 }
 #else
 static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
@@ -162,6 +161,24 @@ static inline void activate_mm(struct mm_struct *prev, 
struct mm_struct *next)
 static inline void enter_lazy_tlb(struct mm_struct *mm,
  struct task_struct *tsk)
 {
+#ifdef CONFIG_PPC_BOOK3S_64
+   /*
+* Under radix, we do not want to keep lazy PIDs around because
+* even if the CPU does not access userspace, it can still bring
+* in translations through speculation and prefetching.
+*
+* Switching away here allows us to trim back the mm_cpumask in
+* cases where we know the process is not running on some CPUs
+* (see mm/tlb-radix.c).
+*/
+   if (radix_enabled() && mm != _mm) {
+   mmgrab(_mm);
+   tsk->active_mm = _mm;
+   switch_mm_irqs_off(mm, tsk->active_mm, tsk);
+   mmdrop(mm);
+   }
+#endif
+
/* 64-bit Book3E keeps track of current PGD in the PACA */
 #ifdef CONFIG_PPC_BOOK3E_64
get_paca()->pgd = NULL;
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index a7eabff27a0f..c4f43bcc6494 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -76,6 +76,13 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
return false;
return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
+static inline void mm_reset_thread_local(struct mm_struct *mm)
+{
+   WARN_ON(!(atomic_read(>mm_users) == 1 && current->mm == mm));
+   atomic_set(>context.active_cpus, 1);
+   cpumask_clear(mm_cpumask(mm));
+   cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
+}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index a07f5372a4bf..f00acacf48f1 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -341,6 +341,15 @@ void radix__local_flush_tlb_page(struct vm_area_struct 
*vma, unsigned long vmadd
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
 
+static bool mm_is_singlethreaded(struct mm_struct *mm)
+{
+   if (atomic_read(>context.copros) > 0)
+   return false;
+   if (atomic_read(>mm_users) == 1 && current->mm == mm)
+   return true;
+   return false;
+}
+
 static bool mm_needs_flush_escalation(struct mm_struct *mm)
 {
/*
@@ -348,7 +357,9 @@ static bool mm_needs_flush_escalation(struct mm_struct *mm)
 * caching PTEs and not flushing them properly when
 * RIC = 0 for a PID/LPID invalidate
 */
-   return atomic_read(>context.copros) != 0;
+   if (atomic_read(>context.copros) > 0)
+   return true;
+   return false;
 }
 
 #ifdef CONFIG_SMP
@@ -362,12 +373,17 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 
preempt_disable();
if (!mm_is_thread_local(mm)) {
-   if

[RFC PATCH 1/3] powerpc/64s: do not flush TLB when relaxing access

2018-03-28 Thread Nicholas Piggin

Book3S does not require TLB flushing when protection is being relaxed.
>From Power ISA v3.0B, p.1090:

Setting a Reference or Change Bit or Upgrading Access Authority
(PTE Subject to Atomic Hardware Updates)

If the only change being made to a valid PTE that is subject to
atomic hardware updates is to set the Refer- ence or Change bit to
1 or to add access authorities, a simpler sequence suffices
because the translation hardware will refetch the PTE if an access
is attempted for which the only problems were reference and/or
change bits needing to be set or insufficient access authority.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/pgtable-book3s64.c | 1 -
 arch/powerpc/mm/pgtable.c  | 3 ++-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 422e80253a33..de49cedcbc84 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -40,7 +40,6 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, 
unsigned long address,
if (changed) {
__ptep_set_access_flags(vma->vm_mm, pmdp_ptep(pmdp),
pmd_pte(entry), address);
-   flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
}
return changed;
 }
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9f361ae571e9..5b07a626df5b 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -224,7 +224,8 @@ int ptep_set_access_flags(struct vm_area_struct *vma, 
unsigned long address,
if (!is_vm_hugetlb_page(vma))
assert_pte_locked(vma->vm_mm, address);
__ptep_set_access_flags(vma->vm_mm, ptep, entry, address);
-   flush_tlb_page(vma, address);
+   if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64))
+   flush_tlb_page(vma, address);
}
return changed;
 }
-- 
2.16.1

[RFC PATCH 0/3] powerpc tlbie reductions again

2018-03-28 Thread Nicholas Piggin

Last time this came up, there was concern about whether we can trim
the mm cpumask, and what concurrency there is vs use_mm(). I've had
more of a look and still think this is okay. I haven't thought of a
good way to add debug checks to ensure it though.

When doing a parallel kernel build on a 2 socket P9 system, this
series causes tlbie (broadcast) to go from 1.37 million (22k/sec) to
181 thousand (3k/sec).

tlbiel (local) flushes increase from 20.2 to 23.7 million. Due to
requiring 128 tlbiel (vs 1 tlbie) to flush a PID, and also we set the
cutoff higher before we switch from va range to full PID flush, when
doing tlbiel.

End result performance was very little changed, very tiny improvement
maybe but well under 1%. Kernel compile mostly stays off the
interconnect, and this is a small system, and without nMMU
involvement. Any of these factors could make broadcast tlbie reduction
more important.

Remaining work - ensuring correctness of this stuff, implementations
for hash, understanding and testing nMMU cases better, using IPIs for
some/all types of invalidations, then possibly looking at doing
something more fancy with the PID allocator.

Nicholas Piggin (3):
  powerpc/64s: do not flush TLB when relaxing access
  powerpc/64s/radix: reset mm_cpumask for single thread process when
possible
  powerpc/64s: always flush non-local CPUs from single threaded mms

 arch/powerpc/include/asm/mmu_context.h | 33 ++
 arch/powerpc/include/asm/tlb.h |  7 +++
 arch/powerpc/mm/pgtable-book3s64.c |  1 -
 arch/powerpc/mm/pgtable.c  |  3 +-
 arch/powerpc/mm/tlb-radix.c| 78 +-
 5 files changed, 92 insertions(+), 30 deletions(-)

-- 
2.16.1

Re: [PATCH v3 41/41] cxlflash: Handle spurious interrupts

2018-03-28 Thread Matthew R. Ochs

On Mon, Mar 26, 2018 at 11:35:42AM -0500, Uma Krishnan wrote:
> The following Oops can occur when there is heavy I/O traffic and the host
> is reset by a tool such as sg_reset.
> 
> [c000200fff3fbc90] c0081690117c process_cmd_doneq+0x104/0x500
>[cxlflash] (unreliable)
> [c000200fff3fbd80] c00816901648 cxlflash_rrq_irq+0xd0/0x150 [cxlflash]
> [c000200fff3fbde0] c0193130 __handle_irq_event_percpu+0xa0/0x310
> [c000200fff3fbea0] c01933d8 handle_irq_event_percpu+0x38/0x90
> [c000200fff3fbee0] c0193494 handle_irq_event+0x64/0xb0
> [c000200fff3fbf10] c0198ea0 handle_fasteoi_irq+0xc0/0x230
> [c000200fff3fbf40] c019182c generic_handle_irq+0x4c/0x70
> [c000200fff3fbf60] c001794c __do_irq+0x7c/0x1c0
> [c000200fff3fbf90] c002a390 call_do_irq+0x14/0x24
> [c000200e5828fab0] c0017b2c do_IRQ+0x9c/0x130
> [c000200e5828fb00] c0009b04 h_virt_irq_common+0x114/0x120
> 
> When a context is reset, the pending commands are flushed and the AFU
> is notified. Before the AFU handles this request there could be command
> completion interrupts queued to PHB which are yet to be delivered to the
> context. In this scenario, a context could receive an interrupt for a
> command that has been flushed, leading to a possible crash when the memory
> for the flushed command is accessed.
> 
> To resolve this problem, a boolean will indicate if the hardware queue is
> ready to process interrupts or not. This can be evaluated in the interrupt
> handler before proessing an interrupt.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH v3 40/41] cxlflash: Remove commmands from pending list on timeout

2018-03-28 Thread Matthew R. Ochs

On Mon, Mar 26, 2018 at 11:35:34AM -0500, Uma Krishnan wrote:
> The following Oops can occur if an internal command sent to the AFU does
> not complete within the timeout:
> 
> [c00ff101b810] c00816020d94 term_mc+0xfc/0x1b0 [cxlflash]
> [c00ff101b8a0] c00816020fb0 term_afu+0x168/0x280 [cxlflash]
> [c00ff101b930] c008160232ec cxlflash_pci_error_detected+0x184/0x230
>[cxlflash]
> [c00ff101b9e0] c0080d95d468 cxl_vphb_error_detected+0x90/0x150[cxl]
> [c00ff101ba20] c0080d95f27c cxl_pci_error_detected+0xa4/0x240 [cxl]
> [c00ff101bac0] c003eaf8 eeh_report_error+0xd8/0x1b0
> [c00ff101bb20] c003d0b8 eeh_pe_dev_traverse+0x98/0x170
> [c00ff101bbb0] c003f438 eeh_handle_normal_event+0x198/0x580
> [c00ff101bc60] c003fba4 eeh_handle_event+0x2a4/0x338
> [c00ff101bd10] c00400b8 eeh_event_handler+0x1f8/0x200
> [c00ff101bdc0] c013da48 kthread+0x1a8/0x1b0
> [c00ff101be30] c000b528 ret_from_kernel_thread+0x5c/0xb4
> 
> When an internal command times out, the command buffer is freed while it
> is still in the pending commands list of the context. This corrupts the
> list and when the context is cleaned up, a crash is encountered.
> 
> To resolve this issue, when an AFU command or TMF command times out, the
> command should be deleted from the hardware queue pending command list
> before freeing the buffer.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-28 Thread Paul E. McKenney

On Wed, Mar 28, 2018 at 05:41:40PM +0300, Yury Norov wrote:
> On Wed, Mar 28, 2018 at 06:56:17AM -0700, Paul E. McKenney wrote:
> > On Wed, Mar 28, 2018 at 04:36:05PM +0300, Yury Norov wrote:
> > > On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote:
> > > > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote:
> > > > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> > > > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending 
> > > > > > > broadcast IPI.
> > > > > > > If CPU is in extended quiescent state (idle task or nohz_full 
> > > > > > > userspace), this
> > > > > > > work may be done at the exit of this state. Delaying 
> > > > > > > synchronization helps to
> > > > > > > save power if CPU is in idle state and decrease latency for 
> > > > > > > real-time tasks.
> > > > > > > 
> > > > > > > This patch introduces kick_active_cpus_sync() and uses it in 
> > > > > > > mm/slab and arm64
> > > > > > > code to delay syncronization.
> > > > > > > 
> > > > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to 
> > > > > > > the CPU running
> > > > > > > isolated task would be fatal, as it breaks isolation. The 
> > > > > > > approach with delaying
> > > > > > > of synchronization work helps to maintain isolated state.
> > > > > > > 
> > > > > > > I've tested it with test from task isolation series on ThunderX2 
> > > > > > > for more than
> > > > > > > 10 hours (10k giga-ticks) without breaking isolation.
> > > > > > > 
> > > > > > > Signed-off-by: Yury Norov 
> > > > > > > ---
> > > > > > >  arch/arm64/kernel/insn.c |  2 +-
> > > > > > >  include/linux/smp.h  |  2 ++
> > > > > > >  kernel/smp.c | 24 
> > > > > > >  mm/slab.c|  2 +-
> > > > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > > > > > > index 2718a77da165..9d7c492e920e 100644
> > > > > > > --- a/arch/arm64/kernel/insn.c
> > > > > > > +++ b/arch/arm64/kernel/insn.c
> > > > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void 
> > > > > > > *addrs[], u32 insns[], int cnt)
> > > > > > >* synchronization.
> > > > > > >*/
> > > > > > >   ret = aarch64_insn_patch_text_nosync(addrs[0], 
> > > > > > > insns[0]);
> > > > > > > - kick_all_cpus_sync();
> > > > > > > + kick_active_cpus_sync();
> > > > > > >   return ret;
> > > > > > >   }
> > > > > > >   }
> > > > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > > > > > > index 9fb239e12b82..27215e22240d 100644
> > > > > > > --- a/include/linux/smp.h
> > > > > > > +++ b/include/linux/smp.h
> > > > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct 
> > > > > > > cpumask *mask,
> > > > > > > smp_call_func_t func, void *info, int wait);
> > > > > > > 
> > > > > > >  void kick_all_cpus_sync(void);
> > > > > > > +void kick_active_cpus_sync(void);
> > > > > > >  void wake_up_all_idle_cpus(void);
> > > > > > > 
> > > > > > >  /*
> > > > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask 
> > > > > > > *mask, smp_call_func_t func,
> > > > > > >  }
> > > > > > > 
> > > > > > >  static inline void kick_all_cpus_sync(void) {  }
> > > > > > > +static inline void kick_active_cpus_sync(void) {  }
> > > > > > >  static inline void wake_up_all_idle_cpus(void) {  }
> > > > > > > 
> > > > > > >  #ifdef CONFIG_UP_LATE_INIT
> > > > > > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > > > > > index 084c8b3a2681..0358d6673850 100644
> > > > > > > --- a/kernel/smp.c
> > > > > > > +++ b/kernel/smp.c
> > > > > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void)
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> > > > > > > 
> > > > > > > +/**
> > > > > > > + * kick_active_cpus_sync - Force CPUs that are not in extended
> > > > > > > + * quiescent state (idle or nohz_full userspace) sync by sending
> > > > > > > + * IPI. Extended quiescent state CPUs will sync at the exit of
> > > > > > > + * that state.
> > > > > > > + */
> > > > > > > +void kick_active_cpus_sync(void)
> > > > > > > +{
> > > > > > > + int cpu;
> > > > > > > + struct cpumask kernel_cpus;
> > > > > > > +
> > > > > > > + smp_mb();
> > > > > > > +
> > > > > > > + cpumask_clear(_cpus);
> > > > > > > + preempt_disable();
> > > > > > > + for_each_online_cpu(cpu) {
> > > > > > > + if (!rcu_eqs_special_set(cpu))
> > > > > > 
> > > > > > If we get here, the CPU is not in a quiescent state, so we therefore
> > > > > > must IPI it, correct?
> > > > > > 
> > > > > > But don't you also need to define rcu_eqs_special_exit() so that RCU
> > > > > > can invoke it when it next leaves its quiescent state?  Or are you 
> > > > > >

Re: [PATCH v3 39/41] cxlflash: Synchronize reset and remove ops

2018-03-28 Thread Matthew R. Ochs

On Mon, Mar 26, 2018 at 11:35:27AM -0500, Uma Krishnan wrote:
> The following Oops can be encountered if a device removal or system
> shutdown is initiated while an EEH recovery is in process:
> 
> [c00ff2f479c0] c00815256f18 cxlflash_pci_slot_reset+0xa0/0x100
>   [cxlflash]
> [c00ff2f47a30] c0080dae22e0 cxl_pci_slot_reset+0x168/0x290 [cxl]
> [c00ff2f47ae0] c003ef1c eeh_report_reset+0xec/0x170
> [c00ff2f47b20] c003d0b8 eeh_pe_dev_traverse+0x98/0x170
> [c00ff2f47bb0] c003f80c eeh_handle_normal_event+0x56c/0x580
> [c00ff2f47c60] c003fba4 eeh_handle_event+0x2a4/0x338
> [c00ff2f47d10] c00400b8 eeh_event_handler+0x1f8/0x200
> [c00ff2f47dc0] c013da48 kthread+0x1a8/0x1b0
> [c00ff2f47e30] c000b528 ret_from_kernel_thread+0x5c/0xb4
> 
> The remove handler frees AFU memory while the EEH recovery is in progress,
> leading to a race condition. This can result in a crash if the recovery
> thread tries to access this memory.
> 
> To resolve this issue, the cxlflash remove handler will evaluate the
> device state and yield to any active reset or probing threads.
> 
> Signed-off-by: Uma Krishnan 

Looks good!

Acked-by: Matthew R. Ochs

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-28 Thread Yury Norov

On Wed, Mar 28, 2018 at 06:56:17AM -0700, Paul E. McKenney wrote:
> On Wed, Mar 28, 2018 at 04:36:05PM +0300, Yury Norov wrote:
> > On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote:
> > > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote:
> > > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote:
> > > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> > > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending 
> > > > > > broadcast IPI.
> > > > > > If CPU is in extended quiescent state (idle task or nohz_full 
> > > > > > userspace), this
> > > > > > work may be done at the exit of this state. Delaying 
> > > > > > synchronization helps to
> > > > > > save power if CPU is in idle state and decrease latency for 
> > > > > > real-time tasks.
> > > > > > 
> > > > > > This patch introduces kick_active_cpus_sync() and uses it in 
> > > > > > mm/slab and arm64
> > > > > > code to delay syncronization.
> > > > > > 
> > > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to 
> > > > > > the CPU running
> > > > > > isolated task would be fatal, as it breaks isolation. The approach 
> > > > > > with delaying
> > > > > > of synchronization work helps to maintain isolated state.
> > > > > > 
> > > > > > I've tested it with test from task isolation series on ThunderX2 
> > > > > > for more than
> > > > > > 10 hours (10k giga-ticks) without breaking isolation.
> > > > > > 
> > > > > > Signed-off-by: Yury Norov 
> > > > > > ---
> > > > > >  arch/arm64/kernel/insn.c |  2 +-
> > > > > >  include/linux/smp.h  |  2 ++
> > > > > >  kernel/smp.c | 24 
> > > > > >  mm/slab.c|  2 +-
> > > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > > > > > index 2718a77da165..9d7c492e920e 100644
> > > > > > --- a/arch/arm64/kernel/insn.c
> > > > > > +++ b/arch/arm64/kernel/insn.c
> > > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void 
> > > > > > *addrs[], u32 insns[], int cnt)
> > > > > >  * synchronization.
> > > > > >  */
> > > > > > ret = aarch64_insn_patch_text_nosync(addrs[0], 
> > > > > > insns[0]);
> > > > > > -   kick_all_cpus_sync();
> > > > > > +   kick_active_cpus_sync();
> > > > > > return ret;
> > > > > > }
> > > > > > }
> > > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > > > > > index 9fb239e12b82..27215e22240d 100644
> > > > > > --- a/include/linux/smp.h
> > > > > > +++ b/include/linux/smp.h
> > > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask 
> > > > > > *mask,
> > > > > >   smp_call_func_t func, void *info, int wait);
> > > > > > 
> > > > > >  void kick_all_cpus_sync(void);
> > > > > > +void kick_active_cpus_sync(void);
> > > > > >  void wake_up_all_idle_cpus(void);
> > > > > > 
> > > > > >  /*
> > > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask 
> > > > > > *mask, smp_call_func_t func,
> > > > > >  }
> > > > > > 
> > > > > >  static inline void kick_all_cpus_sync(void) {  }
> > > > > > +static inline void kick_active_cpus_sync(void) {  }
> > > > > >  static inline void wake_up_all_idle_cpus(void) {  }
> > > > > > 
> > > > > >  #ifdef CONFIG_UP_LATE_INIT
> > > > > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > > > > index 084c8b3a2681..0358d6673850 100644
> > > > > > --- a/kernel/smp.c
> > > > > > +++ b/kernel/smp.c
> > > > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> > > > > > 
> > > > > > +/**
> > > > > > + * kick_active_cpus_sync - Force CPUs that are not in extended
> > > > > > + * quiescent state (idle or nohz_full userspace) sync by sending
> > > > > > + * IPI. Extended quiescent state CPUs will sync at the exit of
> > > > > > + * that state.
> > > > > > + */
> > > > > > +void kick_active_cpus_sync(void)
> > > > > > +{
> > > > > > +   int cpu;
> > > > > > +   struct cpumask kernel_cpus;
> > > > > > +
> > > > > > +   smp_mb();
> > > > > > +
> > > > > > +   cpumask_clear(_cpus);
> > > > > > +   preempt_disable();
> > > > > > +   for_each_online_cpu(cpu) {
> > > > > > +   if (!rcu_eqs_special_set(cpu))
> > > > > 
> > > > > If we get here, the CPU is not in a quiescent state, so we therefore
> > > > > must IPI it, correct?
> > > > > 
> > > > > But don't you also need to define rcu_eqs_special_exit() so that RCU
> > > > > can invoke it when it next leaves its quiescent state?  Or are you 
> > > > > able
> > > > > to ignore the CPU in that case?  (If you are able to ignore the CPU in
> > > > > that case, I could give you a lower-cost function to get your job 
> > > > > done.)
> > > > > 
> > > > >

Re: [v2, 01/10] powerpc: Add security feature flags for Spectre/Meltdown

2018-03-28 Thread Michael Ellerman

On Tue, 2018-03-27 at 12:01:44 UTC, Michael Ellerman wrote:
> This commit adds security feature flags to reflect the settings we
> receive from firmware regarding Spectre/Meltdown mitigations.
> 
> The feature names reflect the names we are given by firmware on bare
> metal machines. See the hostboot source for details.
> 
> Arguably these could be firmware features, but that then requires them
> to be read early in boot so they're available prior to asm feature
> patching, but we don't actually want to use them for patching. We may
> also want to dynamically update them in future, which would be
> incompatible with the way firmware features work (at the moment at
> least). So for now just make them separate flags.
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/9a868f634349e62922c226834aa23e

cheers

Re: [v3,1/8] powerpc: Add ppc_breakpoint_available()

2018-03-28 Thread Michael Ellerman

On Tue, 2018-03-27 at 04:37:17 UTC, Michael Neuling wrote:
> Add ppc_breakpoint_available() to determine if a breakpoint is
> available currently via the DAWR or DABR.
> 
> Signed-off-by: Michael Neuling 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/404b27d66ed657ebccb08a9c8f8f65

cheers

Re: [v2] powerpc/perf: Fix the kernel address leak to userspace via SDAR

2018-03-28 Thread Michael Ellerman

On Wed, 2018-03-21 at 11:40:26 UTC, Madhavan Srinivasan wrote:
> Sampled Data Address Register (SDAR) is a 64-bit
> register that contains the effective address of
> the storage operand of an instruction that was
> being executed, possibly out-of-order, at or around
> the time that the Performance Monitor alert occurred.
> 
> In certain scenario SDAR happen to contain the kernel
> address even for userspace only sampling. Add checks
> to prevent it.
> 
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/cd1231d7035fea894118d5155ff984

cheers

Re: [v2] powerpc/perf: Fix kernel address leak to userspace via BHRB buffer

2018-03-28 Thread Michael Ellerman

On Wed, 2018-03-21 at 11:40:25 UTC, Madhavan Srinivasan wrote:
> The current Branch History Rolling Buffer (BHRB) code does
> not check for any privilege levels before updating the data
> from BHRB. This leaks kernel addresses to userspace even when
> profiling only with userspace privileges. Add proper checks
> to prevent it.
> 
> Acked-by: Balbir Singh 
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/bb19af816025d495376bd76bf6fbcf

cheers

Re: [v2] powerpc/perf: Fix kernel address leaks via Sampling registers

2018-03-28 Thread Michael Ellerman

On Wed, 2018-03-21 at 11:40:24 UTC, Madhavan Srinivasan wrote:
> From: Michael Ellerman 
> 
> Current code in power_pmu_disable() does not clear the sampling
> registers like Sampling Instruction Address Register (SAIR) and
> Sampling Data Address Register (SDAR) after disabling the PMU.
> Since these are userspace readable and could contain kernel
> address, add code to explicitly clear the content of these registers.
> Patch also adds a "context synchronizing instruction" to enforce
> no further updates to these registers as mandated by PowerISA.
> 
> "If an mtspr instruction is executed that changes the
> value of a Performance Monitor register other than
> SIAR, SDAR, and SIER, the change is not guaranteed
> to have taken effect until after a subsequent context
> synchronizing instruction has been executed (see
> Chapter 11. "Synchronization Requirements for Con-
> text Alterations" on page 1133)."
> 
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e1ebd0e5b9d0a10ba65e63a3514b6d

cheers

Re: [v2,1/9] powerpc/eeh: Remove eeh_handle_event()

2018-03-28 Thread Michael Ellerman

On Mon, 2018-03-19 at 02:46:20 UTC, Sam Bobroff wrote:
> The function eeh_handle_event(pe) does nothing other than switching
> between calling eeh_handle_normal_event(pe) and
> eeh_handle_special_event(). However it is only called in two places,
> one where pe can't be NULL and the other where it must be NULL (see
> eeh_event_handler()) so it does nothing but obscure the flow of
> control.
> 
> So, remove it.
> 
> Signed-off-by: Sam Bobroff 
> Reviewed-by: Alexey Kardashevskiy 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/68701780712f7ddb2fa81032aa1b4a

cheers

Re: [v3, 1/5] rfi-flush: Move the logic to avoid a redo into the debugfs code

2018-03-28 Thread Michael Ellerman

On Wed, 2018-03-14 at 22:40:38 UTC, Mauricio Faria de Oliveira wrote:
> From: Michael Ellerman 
> 
> rfi_flush_enable() includes a check to see if we're already
> enabled (or disabled), and in that case does nothing.
> 
> But that means calling setup_rfi_flush() a 2nd time doesn't actually
> work, which is a bit confusing.
> 
> Move that check into the debugfs code, where it really belongs.
> 
> Signed-off-by: Michael Ellerman 
> Signed-off-by: Mauricio Faria de Oliveira 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/1e2a9fc7496955faacbbed49461d61

cheers

Re: powerpc/mm: Fix section mismatch warning in stop_machine_change_mapping()

2018-03-28 Thread Michael Ellerman

On Fri, 2018-03-09 at 20:45:58 UTC, Mauricio Faria de Oliveira wrote:
> Fix the warning messages for stop_machine_change_mapping(), and a number
> of other affected functions in its call chain.
> 
> All modified functions are under CONFIG_MEMORY_HOTPLUG, so __meminit
> is okay (keeps them / does not discard them).
> 
> Boot-tested on powernv/power9/radix-mmu and pseries/power8/hash-mmu.
> 
> $ make -j$(nproc) CONFIG_DEBUG_SECTION_MISMATCH=y vmlinux
> ...
>   MODPOST vmlinux.o
> WARNING: vmlinux.o(.text+0x6b130): Section mismatch in reference from the 
> function stop_machine_change_mapping() to the function 
> .meminit.text:create_physical_mapping()
> The function stop_machine_change_mapping() references
> the function __meminit create_physical_mapping().
> This is often because stop_machine_change_mapping lacks a __meminit 
> annotation or the annotation of create_physical_mapping is wrong.
> 
> WARNING: vmlinux.o(.text+0x6b13c): Section mismatch in reference from the 
> function stop_machine_change_mapping() to the function 
> .meminit.text:create_physical_mapping()
> The function stop_machine_change_mapping() references
> the function __meminit create_physical_mapping().
> This is often because stop_machine_change_mapping lacks a __meminit 
> annotation or the annotation of create_physical_mapping is wrong.
> ...
> 
> Signed-off-by: Mauricio Faria de Oliveira 
> Acked-by: Balbir Singh 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/bde709a70884bfc790da6fbc4467c9

cheers

Re: [1/3] powerpc/perf: Infrastructure to support addition of blacklisted events

2018-03-28 Thread Michael Ellerman

On Sun, 2018-03-04 at 11:56:26 UTC, Madhavan Srinivasan wrote:
> Introduce code to support addition of blacklisted events for a
> processor version. A 'pointer' and 'int' variable to hold the
> number of events are added to 'struct power_pmu', along with a
> generic function to loop through the list to validate the given
> event. Generic function 'is_event_blacklisted' is called in
> power_pmu_event_init() to detect and reject early.
> 
> Signed-off-by: Madhavan Srinivasan 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/b58064da046243f0c988afd939997e

cheers

Re: [kernel] powerpc/npu: Do not try invalidating 32bit table when 64bit table is enabled

2018-03-28 Thread Michael Ellerman

On Tue, 2018-02-13 at 05:51:35 UTC, Alexey Kardashevskiy wrote:
> GPUs and the corresponding NVLink bridges get different PEs as they have
> separate translation validation entries (TVEs). We put these PEs to
> the same IOMMU group so they cannot be passed through separately.
> So the iommu_table_group_ops::set_window/unset_window for GPUs do set
> tables to the NPU PEs as well which means that iommu_table's list of
> attached PEs (iommu_table_group_link) has both GPU and NPU PEs linked.
> This list is used for TCE cache invalidation.
> 
> The problem is that NPU PE has just a single TVE and can be programmed
> to point to 32bit or 64bit windows while GPU PE has two (as any other PCI
> device). So we end up having an 32bit iommu_table struct linked to both
> PEs even though only the 64bit TCE table cache can be invalidated on NPU.
> And a relatively recent skiboot detects this and prints errors.
> 
> This changes GPU's iommu_table_group_ops::set_window/unset_window to make
> sure that NPU PE is only linked to the table actually used by the hardware.
> If there are two tables used by an IOMMU group, the NPU PE will use
> the last programmed one which with the current use scenarios is expected
> to be a 64bit one.
> 
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d41ce7b1bcc3e1d02cc9da3b83c0fe

cheers

Re: [kernel] powerpc/mm: Fix typo in comments

2018-03-28 Thread Michael Ellerman

On Thu, 2018-02-01 at 05:07:25 UTC, Alexey Kardashevskiy wrote:
> Fixes: 912cc87a6 "powerpc/mm/radix: Add LPID based tlb flush helpers"
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/b574df94883df4d37f1b9d648867d6

cheers

Re: [kernel] powerpc/lpar/debug: Initialize flags before printing debug message

2018-03-28 Thread Michael Ellerman

On Tue, 2018-01-09 at 05:52:14 UTC, Alexey Kardashevskiy wrote:
> With enabled DEBUG, there is a compile error:
> "error: âflagsâ is used uninitialized in this function".
> 
> This moves pr_devel() little further where @flags are initialized.
> 
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a8c0bf3c621e0acc01451e27fe47c4

cheers

Re: [kernel] powerpc/init: Do not advertise radix during client-architecture-support

2018-03-28 Thread Michael Ellerman

On Tue, 2018-01-09 at 05:45:20 UTC, Alexey Kardashevskiy wrote:
> Currently the pseries kernel advertises radix MMU support even if
> the actual support is disabled via the CONFIG_PPC_RADIX_MMU option.
> 
> This adds a check for CONFIG_PPC_RADIX_MMU to avoid advertising radix
> to the hypervisor.
> 
> Suggested-by: Paul Mackerras 
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/79b4686857029cdea97d0102d179ae

cheers

Re: powerpc/64: Call H_REGISTER_PROC_TBL when running as a HPT guest on POWER9

2018-03-28 Thread Michael Ellerman

On Thu, 2017-02-16 at 05:03:39 UTC, Paul Mackerras wrote:
> On POWER9, since commit cc3d2940133d ("powerpc/64: Enable use of radix
> MMU under hypervisor on POWER9", 2017-01-30), we set both the radix and
> HPT bits in the client-architecture-support (CAS) vector, which tells
> the hypervisor that we can do either radix or HPT.  According to PAPR,
> if we use this combination we are promising to do a H_REGISTER_PROC_TBL
> hcall later on to let the hypervisor know whether we are doing radix
> or HPT.  We currently do this call if we are doing radix but not if
> we are doing HPT.  If the hypervisor is able to support both radix
> and HPT guests, it would be entitled to defer allocation of the HPT
> until the H_REGISTER_PROC_TBL call, and to fail any attempts to create
> HPTEs until the H_REGISTER_PROC_TBL call.  Thus we need to do a
> H_REGISTER_PROC_TBL call when we are doing HPT; otherwise we may
> crash at boot time.
> 
> This adds the code to call H_REGISTER_PROC_TBL in this case, before
> we attempt to create any HPT entries using H_ENTER.
> 
> Fixes: cc3d2940133d ("powerpc/64: Enable use of radix MMU under hypervisor on 
> POWER9")
> Signed-off-by: Paul Mackerras 
> Reviewed-by: Suraj Jitindar Singh 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/dbfcf3cb9c681aa0c5d0bb46068f98

cheers

Re: [PATCH] powerpc: Fix smp_wmb barrier definition use use lwsync consistently

2018-03-28 Thread Nicholas Piggin

On Wed, 28 Mar 2018 23:40:05 +1100
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > asm/barrier.h is not always included after asm/synch.h, which meant
> > it was missing __SUBARCH_HAS_LWSYNC, so in some files smp_wmb() would
> > be eieio when it should be lwsync. kernel/time/hrtimer.c is one case.  
> 
> Wow nice catch. Only broken since 2008 presumably.
> 
> Some days I think maybe we aren't very good at this writing software
> thing, good to have some certainty :)

Yeah, I only caught it by luck when looking through instruction traces.
The pipeline model just happens to make eieio look different to most
other instructions (which is likely a bug in the model) which made me
look closer at it. Could have been with us for another 10 years.

> > __SUBARCH_HAS_LWSYNC is only used in one place, so just fold it in
> > to where it's used. Previously with my small simulator config, 377
> > instances of eieio in the tree. After this patch there are 55.  
> 
> At least for Book3S this isn't actually a terrible bug AFAICS:
> 
>  - smp_wmb() is only defined to order accesses to cacheable memory.
>  - smp_wmb() only orders prior stores vs later stores.
>  - eieio orders all prior stores vs all later stores for cacheable
>memory.
>  - lwsync orders everything except prior stores vs later loads for
>cacheable memory.
> 
> So eieio and lwsync are both valid to use as smp_wmb(), but it's still
> terrible fishy that we were using both in different places depending on
> include ordering.

Oh yeah it's not a bug in that it would cause violation of memory
ordering, only performance (and expectations when debugging and
observing things I guess). eieio works fine for smp_wmb().

> I'm inclined to tag this for stable unless anyone can think of a reason
> not to?

I think that would be good.

Thanks,
Nick

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-28 Thread Yury Norov

On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote:
> On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote:
> > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote:
> > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending 
> > > > broadcast IPI.
> > > > If CPU is in extended quiescent state (idle task or nohz_full 
> > > > userspace), this
> > > > work may be done at the exit of this state. Delaying synchronization 
> > > > helps to
> > > > save power if CPU is in idle state and decrease latency for real-time 
> > > > tasks.
> > > > 
> > > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab 
> > > > and arm64
> > > > code to delay syncronization.
> > > > 
> > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the 
> > > > CPU running
> > > > isolated task would be fatal, as it breaks isolation. The approach with 
> > > > delaying
> > > > of synchronization work helps to maintain isolated state.
> > > > 
> > > > I've tested it with test from task isolation series on ThunderX2 for 
> > > > more than
> > > > 10 hours (10k giga-ticks) without breaking isolation.
> > > > 
> > > > Signed-off-by: Yury Norov 
> > > > ---
> > > >  arch/arm64/kernel/insn.c |  2 +-
> > > >  include/linux/smp.h  |  2 ++
> > > >  kernel/smp.c | 24 
> > > >  mm/slab.c|  2 +-
> > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > > > index 2718a77da165..9d7c492e920e 100644
> > > > --- a/arch/arm64/kernel/insn.c
> > > > +++ b/arch/arm64/kernel/insn.c
> > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void 
> > > > *addrs[], u32 insns[], int cnt)
> > > >  * synchronization.
> > > >  */
> > > > ret = aarch64_insn_patch_text_nosync(addrs[0], 
> > > > insns[0]);
> > > > -   kick_all_cpus_sync();
> > > > +   kick_active_cpus_sync();
> > > > return ret;
> > > > }
> > > > }
> > > > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > > > index 9fb239e12b82..27215e22240d 100644
> > > > --- a/include/linux/smp.h
> > > > +++ b/include/linux/smp.h
> > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask 
> > > > *mask,
> > > >   smp_call_func_t func, void *info, int wait);
> > > > 
> > > >  void kick_all_cpus_sync(void);
> > > > +void kick_active_cpus_sync(void);
> > > >  void wake_up_all_idle_cpus(void);
> > > > 
> > > >  /*
> > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask *mask, 
> > > > smp_call_func_t func,
> > > >  }
> > > > 
> > > >  static inline void kick_all_cpus_sync(void) {  }
> > > > +static inline void kick_active_cpus_sync(void) {  }
> > > >  static inline void wake_up_all_idle_cpus(void) {  }
> > > > 
> > > >  #ifdef CONFIG_UP_LATE_INIT
> > > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > > index 084c8b3a2681..0358d6673850 100644
> > > > --- a/kernel/smp.c
> > > > +++ b/kernel/smp.c
> > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> > > > 
> > > > +/**
> > > > + * kick_active_cpus_sync - Force CPUs that are not in extended
> > > > + * quiescent state (idle or nohz_full userspace) sync by sending
> > > > + * IPI. Extended quiescent state CPUs will sync at the exit of
> > > > + * that state.
> > > > + */
> > > > +void kick_active_cpus_sync(void)
> > > > +{
> > > > +   int cpu;
> > > > +   struct cpumask kernel_cpus;
> > > > +
> > > > +   smp_mb();
> > > > +
> > > > +   cpumask_clear(_cpus);
> > > > +   preempt_disable();
> > > > +   for_each_online_cpu(cpu) {
> > > > +   if (!rcu_eqs_special_set(cpu))
> > > 
> > > If we get here, the CPU is not in a quiescent state, so we therefore
> > > must IPI it, correct?
> > > 
> > > But don't you also need to define rcu_eqs_special_exit() so that RCU
> > > can invoke it when it next leaves its quiescent state?  Or are you able
> > > to ignore the CPU in that case?  (If you are able to ignore the CPU in
> > > that case, I could give you a lower-cost function to get your job done.)
> > > 
> > >   Thanx, Paul
> > 
> > What's actually needed for synchronization is issuing memory barrier on 
> > target
> > CPUs before we start executing kernel code.
> > 
> > smp_mb() is implicitly called in smp_call_function*() path for it. In
> > rcu_eqs_special_set() -> rcu_dynticks_eqs_exit() path, 
> > smp_mb__after_atomic()
> > is called just before rcu_eqs_special_exit().
> > 
> > So I think, rcu_eqs_special_exit() may be left untouched. Empty
> > rcu_eqs_special_exit() in new RCU path corresponds empty

Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-03-28 Thread Laurent Dufour

On 26/03/2018 00:10, David Rientjes wrote:
> On Wed, 21 Mar 2018, Laurent Dufour wrote:
> 
>> I found the root cause of this lockdep warning.
>>
>> In mmap_region(), unmap_region() may be called while vma_link() has not been
>> called. This happens during the error path if call_mmap() failed.
>>
>> The only to fix that particular case is to call
>> seqcount_init(>vm_sequence) when initializing the vma in mmap_region().
>>
> 
> Ack, although that would require a fixup to dup_mmap() as well.

You're right, I'll fix that too.

Thanks a lot.
Laurent.

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-28 Thread Yury Norov

On Mon, Mar 26, 2018 at 02:57:35PM -0400, Steven Rostedt wrote:
> On Mon, 26 Mar 2018 10:53:13 +0200
> Andrea Parri  wrote:
> 
> > > --- a/kernel/smp.c
> > > +++ b/kernel/smp.c
> > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> > >  
> > > +/**
> > > + * kick_active_cpus_sync - Force CPUs that are not in extended
> > > + * quiescent state (idle or nohz_full userspace) sync by sending
> > > + * IPI. Extended quiescent state CPUs will sync at the exit of
> > > + * that state.
> > > + */
> > > +void kick_active_cpus_sync(void)
> > > +{
> > > + int cpu;
> > > + struct cpumask kernel_cpus;
> > > +
> > > + smp_mb();  
> > 
> > (A general remark only:)
> > 
> > checkpatch.pl should have warned about the fact that this barrier is
> > missing an accompanying comment (which accesses are being "ordered",
> > what is the pairing barrier, etc.).
> 
> He could have simply copied the comment above the smp_mb() for
> kick_all_cpus_sync():
> 
>   /* Make sure the change is visible before we kick the cpus */
> 
> The kick itself is pretty much a synchronization primitive.
> 
> That is, you make some changes and then you need all CPUs to see it,
> and you call: kick_active_cpus_synch(), which is the barrier to make
> sure you previous changes are seen on all CPUS before you proceed
> further. Note, the matching barrier is implicit in the IPI itself.
>
>  -- Steve

I know that I had to copy the comment from kick_all_cpus_sync(), but I
don't like copy-pasting in general, and as Steven told, this smp_mb() is
already inside synchronization routine, so we may hope that users of
kick_*_cpus_sync() will explain better what for they need it...
 
> 
> > 
> > Moreover if, as your reply above suggested, your patch is relying on
> > "implicit barriers" (something I would not recommend) then even more
> > so you should comment on these requirements.
> > 
> > This could: (a) force you to reason about the memory ordering stuff,
> > (b) easy the task of reviewing and adopting your patch, (c) easy the
> > task of preserving those requirements (as implementations changes).
> > 
> >   Andrea

I need v2 anyway, and I will add comments to address all questions in this
thread.

I also hope that we'll agree that for powerpc it's also safe to delay
synchronization, and if so, we will have no users of kick_all_cpus_sync(),
and can drop it.

(It looks like this, because nohz_full userspace CPU cannot have pending
IPIs, but I'd like to get confirmation from powerpc people.)

Would it make sense to rename kick_all_cpus_sync() to smp_mb_sync(), which
would stand for 'synchronous memory barrier on all online CPUs'?

Yury

[GIT PULL] Please pull powerpc/linux.git powerpc-4.16-6 tag

2018-03-28 Thread Michael Ellerman

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull some more powerpc fixes for 4.16. Apologies if this is a bit
big at rc7, but they're all reasonably important fixes. None are
actually for new code, so they aren't indicative of 4.16 being in bad
shape from our point of view.

cheers


The following changes since commit b0c41b8b6e43120d7c35e4709508a3d90a09646e:

  powerpc/pseries: Fix vector5 in ibm architecture vector table (2018-03-06 
23:05:38 +1100)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.16-6

for you to fetch changes up to 52396500f97c53860164debc7d4f759077853423:

  powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile GPRs 
(2018-03-26 07:40:17 +1100)

- 
powerpc fixes for 4.16 #6

These are actually all fixes for pre-4.16 code, or new hardware workarounds.

Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new firmware interface
for describing CPU features.

Fix lost pending interrupts due to a race in our interrupt soft-masking code.

A workaround for a nest MMU bug with TLB invalidations on Power9.

A workaround for broadcast TLB invalidations on Power9.

Fix a bug in our instruction SLB miss handler, when handling bad addresses
(eg. >= TASK_SIZE), which could corrupt non-volatile user GPRs.

Thanks to:
  Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt, Nicholas Piggin.

- 
Aneesh Kumar K.V (3):
  powerpc/mm/radix: Remove unused code
  powerpc/mm/radix: Move the functions that does the actual tlbie closer
  powerpc/mm: Fixup tlbie vs store ordering issue on POWER9

Benjamin Herrenschmidt (2):
  powerpc/mm: Add tracking of the number of coprocessors using a context
  powerpc/mm: Workaround Nest MMU bug with TLB invalidations

Michael Ellerman (1):
  powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

Nicholas Piggin (2):
  powerpc/64s: Fix lost pending interrupt due to race causing lost update 
to irq_happened
  powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile 
GPRs

 arch/powerpc/include/asm/book3s/64/mmu.h   |   3 +
 .../powerpc/include/asm/book3s/64/tlbflush-radix.h |   3 -
 arch/powerpc/include/asm/cputable.h|   3 +-
 arch/powerpc/include/asm/mmu_context.h |  18 ++-
 arch/powerpc/kernel/dt_cpu_ftrs.c  |   6 +
 arch/powerpc/kernel/exceptions-64s.S   |   2 +-
 arch/powerpc/kernel/irq.c  |   8 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   3 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c|  11 ++
 arch/powerpc/mm/hash_native_64.c   |  16 +-
 arch/powerpc/mm/mmu_context_book3s64.c |   1 +
 arch/powerpc/mm/pgtable_64.c   |   1 +
 arch/powerpc/mm/tlb-radix.c| 169 +++--
 13 files changed, 154 insertions(+), 90 deletions(-)
-BEGIN PGP SIGNATURE-

iQIcBAEBCAAGBQJau480AAoJEFHr6jzI4aWAYwIP/RBX30EIQegM7GUwEYulVVhG
E16yFb+QLvUrh3YXU9Oqyb0tj1y/gLJcPojin28TWBpt+wGejriCoSIOPjoI915E
W8jnnxrR8u/nZfzARNh+zsR6+qbNVz3MyVSAsn9BrS2jjWqiekXuvF7WLr34M22N
6xiu2BJ7B4cjvNJa2AFj1GYLCFeed4BDhERn/bNaK3jhdp3hYpnpaxG6y+od5M2J
GdQRO+GhCfBzho/ypLAX62klnprAVYCh+qnDKOOHGpE0qz0WOTHnZ3d5VVJO4sAe
zCdadR66h6YvukrGvTLmJfz1ykZv1vbrr+WbgN0kt+8EsYAqRtO1OB1wm0hjbQYK
e5hA+PccnVkxqjv0b/dg81OS/3PCA7/4xNtBRFxc4YNO3JGebU87gMYepaJAIt0A
zCySqOYOIZMznfuk9310XYY7Z+LnpnRxRLTNZFq5tvaI/O9I/5P3a5f1IUwvkJk3
PCY/EyzLuyIac8FFcMQh5UIJLqebAVVAnNC75YTZG576gf0NmJYO/jwxPz5Yep48
oO1er783SX4vojcYfEVyp+mA4KGbVKel615L1pvgX4CH9tXM4PhX5y/ODZjjP9ua
XksvQdaUlakHIBJwNboFS70gaBHoivN1OFEUINf3XtOhwZRTo51ufilONM8XZhoX
r8o+pDgD/5QCyJqMhlcr
=VKmI
-END PGP SIGNATURE-

Re: [PATCH] powerpc: Fix smp_wmb barrier definition use use lwsync consistently

2018-03-28 Thread Michael Ellerman

Nicholas Piggin  writes:

> asm/barrier.h is not always included after asm/synch.h, which meant
> it was missing __SUBARCH_HAS_LWSYNC, so in some files smp_wmb() would
> be eieio when it should be lwsync. kernel/time/hrtimer.c is one case.

Wow nice catch. Only broken since 2008 presumably.

Some days I think maybe we aren't very good at this writing software
thing, good to have some certainty :)

> __SUBARCH_HAS_LWSYNC is only used in one place, so just fold it in
> to where it's used. Previously with my small simulator config, 377
> instances of eieio in the tree. After this patch there are 55.

At least for Book3S this isn't actually a terrible bug AFAICS:

 - smp_wmb() is only defined to order accesses to cacheable memory.
 - smp_wmb() only orders prior stores vs later stores.
 - eieio orders all prior stores vs all later stores for cacheable
   memory.
 - lwsync orders everything except prior stores vs later loads for
   cacheable memory.

So eieio and lwsync are both valid to use as smp_wmb(), but it's still
terrible fishy that we were using both in different places depending on
include ordering.

I'm inclined to tag this for stable unless anyone can think of a reason
not to?

cheers

> diff --git a/arch/powerpc/include/asm/barrier.h 
> b/arch/powerpc/include/asm/barrier.h
> index 10daa1d56e0a..c7c63959ba91 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -35,7 +35,8 @@
>  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
>  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
>  
> -#ifdef __SUBARCH_HAS_LWSYNC
> +/* The sub-arch has lwsync */
> +#if defined(__powerpc64__) || defined(CONFIG_PPC_E500MC)
>  #define SMPWMB  LWSYNC
>  #else
>  #define SMPWMB  eieio
> diff --git a/arch/powerpc/include/asm/synch.h 
> b/arch/powerpc/include/asm/synch.h
> index 63e7f5a1f105..6ec546090ba1 100644
> --- a/arch/powerpc/include/asm/synch.h
> +++ b/arch/powerpc/include/asm/synch.h
> @@ -6,10 +6,6 @@
>  #include 
>  #include 
>  
> -#if defined(__powerpc64__) || defined(CONFIG_PPC_E500MC)
> -#define __SUBARCH_HAS_LWSYNC
> -#endif
> -
>  #ifndef __ASSEMBLY__
>  extern unsigned int __start___lwsync_fixup, __stop___lwsync_fixup;
>  extern void do_lwsync_fixups(unsigned long value, void *fixup_start,
> -- 
> 2.16.1

Re: [PATCH 01/16] initrd: Add generic code path for common initrd unloading logic.

2018-03-28 Thread Geert Uytterhoeven

On Wed, Mar 28, 2018 at 2:04 PM, Christoph Hellwig  wrote:
>> +#ifdef CONFIG_INITRAMFS_GENERIC_UNLOAD
>> +void free_initrd_mem(unsigned long start, unsigned long end)
>> +{
>> +   free_reserved_area((void *)start, (void *)end, -1, "initrd");
>> +}
>> +#endif
>
> Given how trivial this is and how many architectures can use it I'd
> reverse the polarity and add a CONFIG_HAVE_ARCH_FREE_INITRD_MEM
> instead.

And while adding "special" functionality to the generic version, more and more
users of CONFIG_HAVE_ARCH_FREE_INITRD_MEM will be removed.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH 01/16] initrd: Add generic code path for common initrd unloading logic.

2018-03-28 Thread Christoph Hellwig

> +#ifdef CONFIG_INITRAMFS_GENERIC_UNLOAD
> +void free_initrd_mem(unsigned long start, unsigned long end)
> +{
> +   free_reserved_area((void *)start, (void *)end, -1, "initrd");
> +}
> +#endif

Given how trivial this is and how many architectures can use it I'd
reverse the polarity and add a CONFIG_HAVE_ARCH_FREE_INITRD_MEM
instead.

Re: [PATCH 14/19] powerpc/altivec: Add missing prototypes for altivec

2018-03-28 Thread Mathieu Malaterre

On Wed, Mar 28, 2018 at 9:26 AM, Mathieu Malaterre  wrote:
> On Tue, Mar 27, 2018 at 7:33 PM, LEROY Christophe
>  wrote:
>> LEROY Christophe  a écrit :
>>
>>
>>> Mathieu Malaterre  a écrit :
>>>
 Christophe,

 On Sat, Mar 24, 2018 at 9:10 PM, LEROY Christophe
  wrote:
>
> Mathieu Malaterre  a écrit :
>
>
>> On Fri, Mar 23, 2018 at 1:19 PM, christophe leroy
>>  wrote:
>>>
>>>
>>>
>>>
>>> Le 22/03/2018 à 21:20, Mathieu Malaterre a écrit :



 Some functions prototypes were missing for the non-altivec code. Add
 the
 missing prototypes directly in xor_vmx, fix warnings treated as
 errors
 with
 W=1:

   arch/powerpc/lib/xor_vmx_glue.c:18:6: error: no previous prototype
 for
 ‘xor_altivec_2’ [-Werror=missing-prototypes]
   arch/powerpc/lib/xor_vmx_glue.c:29:6: error: no previous prototype
 for
 ‘xor_altivec_3’ [-Werror=missing-prototypes]
   arch/powerpc/lib/xor_vmx_glue.c:40:6: error: no previous prototype
 for
 ‘xor_altivec_4’ [-Werror=missing-prototypes]
   arch/powerpc/lib/xor_vmx_glue.c:52:6: error: no previous prototype
 for
 ‘xor_altivec_5’ [-Werror=missing-prototypes]

 Signed-off-by: Mathieu Malaterre 
 ---
  arch/powerpc/lib/xor_vmx.h | 14 ++
  1 file changed, 14 insertions(+)

 diff --git a/arch/powerpc/lib/xor_vmx.h b/arch/powerpc/lib/xor_vmx.h
 index 5c2b0839b179..2173e3c84151 100644
 --- a/arch/powerpc/lib/xor_vmx.h
 +++ b/arch/powerpc/lib/xor_vmx.h
 @@ -19,3 +19,17 @@ void __xor_altivec_4(unsigned long bytes, unsigned
 long
 *v1_in,
  void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
 unsigned long *v2_in, unsigned long
 *v3_in,
 unsigned long *v4_in, unsigned long
 *v5_in);
 +
 +void xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
 +unsigned long *v2_in);
 +
>>>
>>>
>>>
>>>
>>> Only used in one place, should be static instead of adding it in a .h
>>>
>>> Same for the other ones.
>>
>>
>>
>> $ git grep xor_altivec_2
>> [...]
>> arch/powerpc/lib/xor_vmx_glue.c:EXPORT_SYMBOL(xor_altivec_2);
>>
>> Are you sure I can change this function to static ?
>
>
>
> Yes you are right.  But in fact those fonctions are already defined in
> asm/xor. h
> So you just need to add the missing #include


 I originally tried it, but this leads to:

  CC  arch/powerpc/lib/xor_vmx_glue.o
 In file included from arch/powerpc/lib/xor_vmx_glue.c:16:0:
 ./arch/powerpc/include/asm/xor.h:39:15: error: variable
 ‘xor_block_altivec’ has initializer but incomplete type
 static struct xor_block_template xor_block_altivec = {
   ^~
 ./arch/powerpc/include/asm/xor.h:40:2: error: unknown field ‘name’
 specified in initializer
  .name = "altivec",
  ^
 [...]

 The file  (powerpc) is pretty much expected to be included
 after .

 I did not want to tweak  to test for #ifdef _XOR_H just before

 #ifdef _XOR_H
 static struct xor_block_template xor_block_altivec = {
 [...]

 since this seems like a hack to me.

 Is this ok to test for #ifdef _XOR_H in 
 ?
>>>
>>>
>>> What about including linux/raid/xor.h in asm/xor.h ?
>
> This leads to:
>
>   CALL../arch/powerpc/kernel/systbl_chk.sh
> In file included from ../arch/powerpc/include/asm/xor.h:57:0,
>  from ../arch/powerpc/lib/xor_vmx_glue.c:17:
> ../include/asm-generic/xor.h:688:34: error: ‘xor_block_32regs’ defined
> but not used [-Werror=unused-variable]
>  static struct xor_block_template xor_block_32regs = {
>   ^~~~
> ../include/asm-generic/xor.h:680:34: error: ‘xor_block_8regs’ defined
> but not used [-Werror=unused-variable]
>  static struct xor_block_template xor_block_8regs = {
>   ^~~
> In file included from ../arch/powerpc/lib/xor_vmx_glue.c:17:0:
> ../arch/powerpc/include/asm/xor.h:39:34: error: ‘xor_block_altivec’
> defined but not used [-Werror=unused-variable]
>  static struct xor_block_template xor_block_altivec = {
>   ^
>   CALL../arch/powerpc/kernel/prom_init_check.sh
>

I'll prepare a patch which moves the prototypes from
arch/powerpc/include/asm/xor.h to
arch/powerpc/include/asm/xor_altivec.h

Re: RFC on writel and writel_relaxed

2018-03-28 Thread okaya


On 2018-03-28 02:14, Linus Torvalds wrote:
On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya  
wrote:


Basically changing it to

dma_buffer->foo = 1;/* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
mmiowb()


Why?

Why not  just remove the wmb(), and keep the barrier in the writel()?


Yes, we want to get there indeed. It is because of some arch not 
implementing writel properly. Maintainers want to play safe.


That is why I asked if IA64 and other well known archs follow the 
strongly ordered rule at this moment like PPC and ARM.


Or should we go and inform every arch about this before yanking wmb()?

Maintainers are afraid of introducing a regression.



The above code makes no sense, and just looks stupid to me. It also
generates pointlessly bad code on x86, so it's bad there too.

   Linus

RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight

From: Benjamin Herrenschmidt
> Sent: 28 March 2018 10:56
...
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
>   writel(1, RESET_REG);
>   usleep(10);
>   writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 
> The typical "fix" is to turn that into:
> 
>   writel(1, RESET_REG);
>   readl(RESET_REG); /* Flush posted writes */

Would a writel(1, RESET_REG) here provide enough synchronsiation?

>   usleep(10);
>   writel(0, RESET_REG);
> 
> *However* the issue here, at least on power, is that the CPU can issue
> that readl but doesn't necessarily wait for it to complete (ie, the
> data to return), before proceeding to the usleep. Now a usleep contains
> a bunch of loads and stores and is probably fine, but a udelay which
> just loops on the timebase may not be.
> 
> Thus we may still violate the timing requirement.

I've seem that sort of code (with udelay() and no read back) quite often.
How many were in linux I don't know.

For small delays I usually fix it by repeated writes (of the same value)
to the device register. That can guarantee very short intervals.

The only time I've actually seen buffered writes break timing was
between a 286 and an 8859 interrupt controller.
If you wrote to the mask then enabled interrupts the first IACK cycle
could be too close to write and break the cycle recovery time.
That clobbered most of the interrupt controller registers.
That probably affected every 286 board ever built!
Not sure how much software added the required extra bus cycle.

> What we did inside readl, with the twi;isync sequence (which basically
> means, trap on return value with "trap never" as a condition, followed
> by isync that ensures all excpetion conditions are resolved), is force
> the CPU to "consume" the data from the read before moving on.
> 
> This effectively makes readl fully synchronous (we would probably avoid
> that if we were to implement a readl_relaxed).

I've always wondered exactly what the twi;isync were for - always seemed
very heavy handed for most mmio reads.
Particularly if you are doing mmio reads from a data fifo.

Perhaps there should be a writel_wait() that is allowed to do a read back
for such code paths?

David

1 2 >

1 - 100 of 137 matches

Mail list logo