Re: [PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte

2017-05-16 Thread Madhavan Srinivasan



On Wednesday 17 May 2017 10:27 AM, Benjamin Herrenschmidt wrote:

On Wed, 2017-05-17 at 08:57 +0530, Aneesh Kumar K.V wrote:

Benjamin Herrenschmidt  writes:


On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote:

+static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea,
+   bool *is_thp, unsigned *hshift)
+{
+   VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) ,
+   "%s called with irq enabled\n", __func__);
+   return __find_linux_pte(pgdir, ea, is_thp, hshift);
+}
+

When is arch_irqs_disabled() not sufficient ?

We can do lockless page table walk in interrupt handlers where we find
MSR_EE = 0.

Such as ?


I was not sure we mark softenabled 0 there. What I wanted to
indicate in the patch is that we are safe with either softenable = 0 or MSR_EE 
= 0

Reading the MSR is expensive...

Can you find a case where we are hard disabled and not soft disable in
C code ? I can't think of one off-hand ... I know we have some asm that
can do that very temporarily but I wouldn't think we have anything at
runtime.

Talking of which, we have this in irq.c:


#ifdef CONFIG_TRACE_IRQFLAGS
else {
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON(mfmsr() & MSR_EE))
__hard_irq_disable();
}
#endif

I think we should move that to a new CONFIG_PPC_DEBUG_LAZY_IRQ because
distros are likely to have CONFIG_TRACE_IRQFLAGS these days no ?


Yes, CONFIG_TRACE_IRQFLAGS are enabled. So in my local_t patchset,
I have added a patch to do the same with a flag "CONFIG_IRQ_DEBUG_SUPPORT"

mpe reported boot hang with the current version of the
local_t patchset in Booke system, and have a fix for the
same and it is being tested. Will post a newer version
once the patch verified.

Maddy


Also we could add additional checks, such as MSR_EE matching paca-

irq_happened or the above you mentioned, ie, WARN if we find case

where IRQs are hard disabled but soft enabled.

If we find these, I think we should fix them.

Cheers,
Ben.





Re: [PATCH] powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line

2017-05-16 Thread Anshuman Khandual
On 05/16/2017 02:54 PM, Aneesh Kumar K.V wrote:
> +void __init reserve_hugetlb_gpages(void)
> +{
> + char buf[10];
> + phys_addr_t base;
> + unsigned long gpage_size = 1UL << 34;
> + static __initdata char cmdline[COMMAND_LINE_SIZE];
> +
> + if (radix_enabled())
> + gpage_size = 1UL << 30;
> +
> + strlcpy(cmdline, boot_command_line, COMMAND_LINE_SIZE);
> + parse_args("hugetlb gpages", cmdline, NULL, 0, 0, 0,
> +NULL, _gpage_early_setup);
> +
> + if (!gpage_npages)
> + return;
> +
> + string_get_size(gpage_size, 1, STRING_UNITS_2, buf, sizeof(buf));
> + pr_info("Trying to reserve %ld %s pages\n", gpage_npages, buf);
> +
> + /* Allocate one page at a time */
> + while(gpage_npages) {
> + base = memblock_alloc_base(gpage_size, gpage_size,
> +MEMBLOCK_ALLOC_ANYWHERE);
> + add_gpage(base, gpage_size, 1);

For 16GB pages (1UL << 34) on POWER8, we already do these functions
inside htab_dt_scan_hugepage_blocks(). IIUC this happens just by
scanning DT without even specifying any gpages in kernel command
line.

memblock_reserve()
add_gpage()

Then attempting to allocate from memblock and adding it again into
gigantic pages list wont collide ? More over its trying to allocate
across the RAM not specifically on the gpages mentioned in device
tree by the platform. Are we trying to support 16GB pages just from
any memory without platform notification through DT ?



Re: [PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte

2017-05-16 Thread Benjamin Herrenschmidt
On Wed, 2017-05-17 at 08:57 +0530, Aneesh Kumar K.V wrote:
> Benjamin Herrenschmidt  writes:
> 
> > On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote:
> > > +static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea,
> > > +   bool *is_thp, unsigned *hshift)
> > > +{
> > > +   VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) ,
> > > +   "%s called with irq enabled\n", __func__);
> > > +   return __find_linux_pte(pgdir, ea, is_thp, hshift);
> > > +}
> > > +
> > 
> > When is arch_irqs_disabled() not sufficient ?
> 
> We can do lockless page table walk in interrupt handlers where we find
> MSR_EE = 0. 

Such as ?

> I was not sure we mark softenabled 0 there. What I wanted to
> indicate in the patch is that we are safe with either softenable = 0 or 
> MSR_EE = 0

Reading the MSR is expensive...

Can you find a case where we are hard disabled and not soft disable in
C code ? I can't think of one off-hand ... I know we have some asm that
can do that very temporarily but I wouldn't think we have anything at
runtime.

Talking of which, we have this in irq.c:


#ifdef CONFIG_TRACE_IRQFLAGS
else {
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON(mfmsr() & MSR_EE))
__hard_irq_disable();
}
#endif

I think we should move that to a new CONFIG_PPC_DEBUG_LAZY_IRQ because
distros are likely to have CONFIG_TRACE_IRQFLAGS these days no ?

Also we could add additional checks, such as MSR_EE matching paca-
>irq_happened or the above you mentioned, ie, WARN if we find case
where IRQs are hard disabled but soft enabled.

If we find these, I think we should fix them.

Cheers,
Ben.



[PATCH v3 2/2] powerpc/mm/hugetlb: Add support for 1G huge pages

2017-05-16 Thread Aneesh Kumar K.V
POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch
enables the usage of 1G page size for hugetlbfs. This also update the helper
such we can do 1G page allocation at runtime.

We still don't enable 1G page size on DD1 version. This is to avoid doing
workaround mentioned in commit: 6d3a0379ebdc8 (powerpc/mm: Add
radix__tlb_flush_pte_p9_dd1()

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 10 ++
 arch/powerpc/mm/hugetlbpage.c|  7 +--
 arch/powerpc/platforms/Kconfig.cputype   |  1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index cd366596..5c28bd6f2ae1 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -50,4 +50,14 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct 
vm_area_struct *vma,
else
return entry;
 }
+
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void)
+{
+   if (radix_enabled())
+   return true;
+   return false;
+}
+#endif
+
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a4f33de4008e..80f6d2ed551a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -763,8 +763,11 @@ static int __init add_huge_page_size(unsigned long long 
size)
 * Hash: 16M and 16G
 */
if (radix_enabled()) {
-   if (mmu_psize != MMU_PAGE_2M)
-   return -EINVAL;
+   if (mmu_psize != MMU_PAGE_2M) {
+   if (cpu_has_feature(CPU_FTR_POWER9_DD1) ||
+   (mmu_psize != MMU_PAGE_1G))
+   return -EINVAL;
+   }
} else {
if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
return -EINVAL;
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 684e886eaae4..b76ef6637016 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -344,6 +344,7 @@ config PPC_STD_MMU_64
 config PPC_RADIX_MMU
bool "Radix MMU Support"
depends on PPC_BOOK3S_64
+   select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
default y
help
  Enable support for the Power ISA 3.0 Radix style MMU. Currently this
-- 
2.7.4



[PATCH v3 1/2] mm/hugetlb: Cleanup ARCH_HAS_GIGANTIC_PAGE

2017-05-16 Thread Aneesh Kumar K.V
This moves the #ifdef in C code to a Kconfig dependency. Also we move the
gigantic_page_supported() function to be arch specific. This gives arch to
conditionally enable runtime allocation of gigantic huge page. Architectures
like ppc64 supports different gigantic huge page size (16G and 1G) based on the
translation mode selected. This provides an opportunity for ppc64 to enable
runtime allocation only w.r.t 1G hugepage.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
Changes from V2:
* Fix build error with x86
* Update the Kconfig change to match the C #ifdef

 arch/arm64/Kconfig   | 2 +-
 arch/arm64/include/asm/hugetlb.h | 4 
 arch/s390/Kconfig| 2 +-
 arch/s390/include/asm/hugetlb.h  | 3 +++
 arch/x86/Kconfig | 2 +-
 arch/x86/include/asm/hugetlb.h   | 4 
 mm/hugetlb.c | 7 ++-
 7 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3741859765cf..87240dcb6a07 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -11,7 +11,7 @@ config ARM64
select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index bbc1e35aa601..793bd73b0d07 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -83,4 +83,8 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
 extern void huge_ptep_clear_flush(struct vm_area_struct *vma,
  unsigned long addr, pte_t *ptep);
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void) { return true; }
+#endif
+
 #endif /* __ASM_HUGETLB_H */
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index a2dcef0aacc7..f3637b641d7e 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -67,7 +67,7 @@ config S390
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index cd546a245c68..89057b2cc8fe 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -112,4 +112,7 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t 
newprot)
return pte_modify(pte, newprot);
 }
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void) { return true; }
+#endif
 #endif /* _ASM_S390_HUGETLB_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..e39b3b6b7d16 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,7 +22,7 @@ config X86_64
def_bool y
depends on 64BIT
# Options that are inherently 64-bit kernel only:
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_SUPPORTS_INT128
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index 3a106165e03a..535af0f2d8ac 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -85,4 +85,8 @@ static inline void arch_clear_hugepage_flags(struct page 
*page)
 {
 }
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void) { return true; }
+#endif
+
 #endif /* _ASM_X86_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d0aab9ee80d..ce090186b992 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1024,9 +1024,7 @@ static int hstate_next_node_to_free(struct hstate *h, 
nodemask_t *nodes_allowed)
((node = hstate_next_node_to_free(hs, mask)) || 1); \
nr_nodes--)
 
-#if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) && \
-   ((defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || \
-   defined(CONFIG_CMA))
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 static void destroy_compound_gigantic_page(struct page *page,
unsigned int order)
 {
@@ -1158,8 +1156,7 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
return 0;
 }
 
-static inline bool gigantic_page_supported(void) { return true; }
-#else
+#else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
 static inline bool gigantic_page_supported(void) { return false; }
 static inline void free_gigantic_page(struct page *page, unsigned int order) { 
}
 static inline void 

Re: [PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte

2017-05-16 Thread Aneesh Kumar K.V
Benjamin Herrenschmidt  writes:

> On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote:
>> +static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea,
>> +   bool *is_thp, unsigned *hshift)
>> +{
>> +   VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) ,
>> +   "%s called with irq enabled\n", __func__);
>> +   return __find_linux_pte(pgdir, ea, is_thp, hshift);
>> +}
>> +
>
> When is arch_irqs_disabled() not sufficient ?

We can do lockless page table walk in interrupt handlers where we find
MSR_EE = 0. I was not sure we mark softenabled 0 there. What I wanted to
indicate in the patch is that we are safe with either softenable = 0 or MSR_EE 
= 0

-aneesh



Re: [RFC 0/2] Consolidate patch_instruction

2017-05-16 Thread Balbir Singh
On Tue, 2017-05-16 at 22:20 +0200, LEROY Christophe wrote:
> Balbir Singh  a écrit :
> 
> > patch_instruction is enhanced in this RFC to support
> > patching via a different virtual address (text_poke_area).
> > The mapping of text_poke_area->addr is RW and not RWX.
> > This way the mapping allows write for patching and then we tear
> > down the mapping. The downside is that we introduce a spinlock
> > which serializes our patching to one patch at a time.
> 
> Very nice patch, would fit great with my patch for impmementing  
> CONFIG_DEBUG_RODATA (https://patchwork.ozlabs.org/patch/754289 ).
> Would avoid having to set the text area back to RW for patching
> 

Awesome! It seems like you have some of the work for CONFIG_STRICT_KERNEL_RWX
any reason why this is under CONFIG_DEBUG_RODATA? But I think there is
reuse capability across the future patches and the current set.

Cheers,
Balbir  Singh.


Re: [RFC 2/2] powerpc/kprobes: Move kprobes over to patch_instruction

2017-05-16 Thread Balbir Singh
On Tue, 2017-05-16 at 19:05 +0530, Naveen N. Rao wrote:
> On 2017/05/16 01:49PM, Balbir Singh wrote:
> > arch_arm/disarm_probe use direct assignment for copying
> > instructions, replace them with patch_instruction
> 
> Thanks for doing this!
> 
> We will also have to convert optprobes and ftrace to use 
> patch_instruction, but that can be done once the basic infrastructure is 
> in.
>

I think these patches can go in without even patch 1. I looked quickly at
optprobes and ftrace and thought they were already using patch_instruction
(ftrace_modify_code() and arch_optimize_kprobes()), are there other paths
I missed?

Balbir Singh
 


Re: [RFC 0/2] Consolidate patch_instruction

2017-05-16 Thread Balbir Singh
On Tue, 2017-05-16 at 19:11 +0530, Naveen N. Rao wrote:
> On 2017/05/16 10:56AM, Anshuman Khandual wrote:
> > On 05/16/2017 09:19 AM, Balbir Singh wrote:
> > > patch_instruction is enhanced in this RFC to support
> > > patching via a different virtual address (text_poke_area).
> > 
> > Why writing instruction directly into the address is not
> > sufficient and need to go through this virtual address ?
> 
> To enable KERNEL_STRICT_RWX and map all of kernel text to be read-only?
>

Precisely, the rest of the bits are still being developed.
 
> > 
> > > The mapping of text_poke_area->addr is RW and not RWX.
> > > This way the mapping allows write for patching and then we tear
> > > down the mapping. The downside is that we introduce a spinlock
> > > which serializes our patching to one patch at a time.
> > 
> > So whats the benifits we get otherwise in this approach when
> > we are adding a new lock into the equation.
> 
> Instruction patching isn't performance critical, so the slow down is 
> likely not noticeable. Marking kernel text read-only helps harden the 
> kernel by catching unintended code modifications whether through 
> exploits or through bugs.
>

Precisely!

Balbir Singh. 


Re: [PATCH 2/2] powerpc/jprobes: Validate break handler invocation as being due to a jprobe_return()

2017-05-16 Thread Masami Hiramatsu
On Mon, 15 May 2017 23:35:04 +0530
"Naveen N. Rao"  wrote:

> Fix a circa 2005 FIXME by implementing a check to ensure that we
> actually got into the jprobe break handler() due to the trap in
> jprobe_return().
> 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/kprobes.c | 20 +---
>  1 file changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 19b053475758..1ebeb8c482db 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -627,25 +627,23 @@ NOKPROBE_SYMBOL(setjmp_pre_handler);
>  
>  void __used jprobe_return(void)
>  {
> - asm volatile("trap" ::: "memory");
> + asm volatile("jprobe_return_trap:\n"
> +  "trap\n"
> +  ::: "memory");
>  }
>  NOKPROBE_SYMBOL(jprobe_return);
>  
> -static void __used jprobe_return_end(void)
> -{
> -}
> -NOKPROBE_SYMBOL(jprobe_return_end);
> -
>  int longjmp_break_handler(struct kprobe *p, struct pt_regs *regs)
>  {
>   struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
>   unsigned long sp;
>  
> - /*
> -  * FIXME - we should ideally be validating that we got here 'cos
> -  * of the "trap" in jprobe_return() above, before restoring the
> -  * saved regs...
> -  */
> + if (regs->nip != ppc_kallsyms_lookup_name("jprobe_return_trap")) {
> + WARN(1, "longjmp_break_handler NIP (0x%lx) does not match 
> jprobe_return_trap (0x%lx)\n",
> + regs->nip, 
> ppc_kallsyms_lookup_name("jprobe_return_trap"));
> + return 0;

If you don't handle this break, you shouldn't warn it, because
program_check_exception() will continue to find how to handle it
by notify_die(). IOW, please return silently, or just add a
debug message.

Thank you,

> + }
> +
>   memcpy(regs, >jprobe_saved_regs, sizeof(struct pt_regs));
>   sp = kernel_stack_pointer(regs);
>   memcpy((void *)sp, >jprobes_stack, MIN_STACK_SIZE(sp));
> -- 
> 2.12.2
> 


-- 
Masami Hiramatsu 


Re: [PATCH 1/2] powerpc/jprobes: Save and restore the parameter save area

2017-05-16 Thread Masami Hiramatsu
On Mon, 15 May 2017 23:35:03 +0530
"Naveen N. Rao"  wrote:
> diff --git a/arch/powerpc/include/asm/kprobes.h 
> b/arch/powerpc/include/asm/kprobes.h
> index a83821f33ea3..b6960ef213ac 100644
> --- a/arch/powerpc/include/asm/kprobes.h
> +++ b/arch/powerpc/include/asm/kprobes.h
> @@ -61,6 +61,15 @@ extern kprobe_opcode_t optprobe_template_end[];
>  #define MAX_OPTINSN_SIZE (optprobe_template_end - 
> optprobe_template_entry)
>  #define RELATIVEJUMP_SIZEsizeof(kprobe_opcode_t) /* 4 bytes */
>  
> +/* Save upto 16 parameters along with the stack frame header */
> +#define MAX_STACK_SIZE   (STACK_FRAME_PARM_SAVE + (16 * 
> sizeof(unsigned long)))
> +#define MIN_STACK_SIZE(ADDR)\
> + (((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \
> +   THREAD_SIZE - (unsigned long)(ADDR)))\
> +  ? (MAX_STACK_SIZE)\
> +  : (((unsigned long)current_thread_info()) +   \
> + THREAD_SIZE - (unsigned long)(ADDR)))

Could you add CUR_STACK_SIZE(addr) as x86 does instead of repeating similar 
code?

Thank you,


-- 
Masami Hiramatsu 


Re: [v3 0/9] parallelized "struct page" zeroing

2017-05-16 Thread Benjamin Herrenschmidt
On Fri, 2017-05-12 at 13:37 -0400, David Miller wrote:
> > Right now it is larger, but what I suggested is to add a new optimized
> > routine just for this case, which would do STBI for 64-bytes but
> > without membar (do membar at the end of memmap_init_zone() and
> > deferred_init_memmap()
> > 
> > #define struct_page_clear(page) \
> >  __asm__ __volatile__(   \
> >  "stxa   %%g0, [%0]%2\n" \
> >  "stxa   %%xg0, [%0 + %1]%2\n"   \
> >  : /* No output */   \
> >  : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))
> > 
> > And insert it into __init_single_page() instead of memset()
> > 
> > The final result is 4.01s/T which is even faster compared to current
> > 4.97s/T
> 
> Ok, indeed, that would work.

On ppc64, that might not. We have a dcbz instruction that clears an
entire cache line at once. That's what we use for memset's and page
clearing. However, 64 bytes is half a cache line on modern processors
so we can't use it with that semantic and would have to fallback to the
slower stores.

Cheers,
Ben.



Re: RFC: better timer interface

2017-05-16 Thread Arnd Bergmann
On Tue, May 16, 2017 at 5:51 PM, Christoph Hellwig  wrote:
> On Tue, May 16, 2017 at 05:45:07PM +0200, Arnd Bergmann wrote:
>> This looks really nice, but what is the long-term plan for the interface?
>> Do you expect that we will eventually change all 700+ users of timer_list
>> to the new type, or do we keep both variants around indefinitely to avoid
>> having to do mass-conversions?
>
> I think we should eventually move everyone over, but it might take
> some time.

Ok.

>> If we are going to touch them all in the end, we might want to think
>> about other changes that could be useful here. The main one I have
>> in mind would be moving away from 'jiffies + timeout' as the interface,
>> and instead passing a relative number of milliseconds (or seconds)
>> into a mod_timer() variant. This is what most drivers want anyway,
>> and if we have both changes (callback argument and expiration
>> time) in place, we modernize the API one driver at a time with both
>> changes at once.
>
> Yes, that sounds useful to me as well.  As you said it's an independent
> but somewhat related change.  I can add it to my series, but I'll
> need a suggestions for a good and short name.  That already was the
> hardest part for the setup side :)

If we keep the unusual *_timer() naming (rather than timer_*() as hrtimer
has), we could use one of

a) start_timer(struct timer_list *timer, unsigned long ms);
b) restart_timer(struct timer_list *timer, unsigned long ms);
c) mod_timer_ms(struct timer_list *timer, unsigned long ms);
mod_timer_sec(struct timer_list *timer, unsigned long sec);

The first is slightly shorter but conflicts with three files that use
the same name for a local function name. The third one fits
well with the existing interfaces and provides both millisecond
and second versions, I'd probably go with that.

We could consider even passing a default interval as another
argument to prepare_timer(), and using that in add_timer(),
but that would in those cases that have a constant interval
(maybe about half of the users from) and would be a bit surprising
to readers that are only familiar with the existing interfaces.

One final option would be a larger-scale replacement of
the API by mirroring the hrtimer style where possible while
staying compatible with the existing calls, e.g. timer_prepare(),
timer_add_expires(), timer_start(), ...

   Arnd


Re: [RFC 0/2] Consolidate patch_instruction

2017-05-16 Thread LEROY Christophe

Balbir Singh  a écrit :


patch_instruction is enhanced in this RFC to support
patching via a different virtual address (text_poke_area).
The mapping of text_poke_area->addr is RW and not RWX.
This way the mapping allows write for patching and then we tear
down the mapping. The downside is that we introduce a spinlock
which serializes our patching to one patch at a time.


Very nice patch, would fit great with my patch for impmementing  
CONFIG_DEBUG_RODATA (https://patchwork.ozlabs.org/patch/754289 ).

Would avoid having to set the text area back to RW for patching

Christophe



In this patchset we also consolidate instruction changes
in kprobes to use patch_instruction().

Balbir Singh (2):
  powerpc/lib/code-patching: Enhance code patching
  powerpc/kprobes: Move kprobes over to patch_instruction

 arch/powerpc/kernel/kprobes.c|  4 +-
 arch/powerpc/lib/code-patching.c | 88  
++--

 2 files changed, 86 insertions(+), 6 deletions(-)

--
2.9.3





Re: [PATCH 2/9] timers: provide a "modern" variant of timers

2017-05-16 Thread Arnd Bergmann
On Tue, May 16, 2017 at 1:48 PM, Christoph Hellwig  wrote:

> unsigned long   expires;
> -   void(*function)(unsigned long);
> +   union {
> +   void(*func)(struct timer_list *timer);
> +   void(*function)(unsigned long);
> +   };
...
> +#define INIT_TIMER(_func, _expires, _flags)\
> +{  \
> +   .entry = { .next = TIMER_ENTRY_STATIC },\
> +   .func = (_func),\
> +   .expires = (_expires),  \
> +   .flags = TIMER_MODERN | (_flags),   \
> +   __TIMER_LOCKDEP_MAP_INITIALIZER(__FILE__ ":" __stringify(__LINE__)) \
> +}

If I remember correctly, this will fail with gcc-4.5 and earlier, which can't
use named initializers for anonymous unions. One of these two should
work, but they are both ugly:

a) don't use a named initializer for the union (a bit fragile)

 +#define INIT_TIMER(_func, _expires, _flags)\
 +{  \
 +   .entry = { .next = TIMER_ENTRY_STATIC },\
 +   .expires = (_expires),  \
 +   { .func = (_func) },\
 +   .flags = TIMER_MODERN | (_flags),   \
 +   __TIMER_LOCKDEP_MAP_INITIALIZER(__FILE__ ":" __stringify(__LINE__)) \
 +}

b) give the union a name (breaks any reference to timer_list->func in C code):

 +   union {
 +   void(*func)(struct timer_list *timer);
 +   void(*function)(unsigned long);
 +   } u;
...
 +#define INIT_TIMER(_func, _expires, _flags)\
 +{  \
 +   .entry = { .next = TIMER_ENTRY_STATIC },\
 +   .u.func = (_func),\
 +   .expires = (_expires),  \
 +   .flags = TIMER_MODERN | (_flags),   \
 +   __TIMER_LOCKDEP_MAP_INITIALIZER(__FILE__ ":" __stringify(__LINE__)) \
 +}

> +/**
> + * prepare_timer - initialize a timer before first use
> + * @timer: timer structure to prepare
> + * @func:  callback to be called when the timer expires
> + * @flags  %TIMER_* flags that control timer behavior
> + *
> + * This function initializes a timer_list structure so that it can
> + * be used (by calling add_timer() or mod_timer()).
> + */
> +static inline void prepare_timer(struct timer_list *timer,
> +   void (*func)(struct timer_list *timer), u32 flags)
> +{
> +   __init_timer(timer, TIMER_MODERN | flags);
> +   timer->func = func;
> +}
> +
> +static inline void prepare_timer_on_stack(struct timer_list *timer,
> +   void (*func)(struct timer_list *timer), u32 flags)
> +{
> +   __init_timer_on_stack(timer, TIMER_MODERN | flags);
> +   timer->func = func;
> +}

I fear this breaks lockdep output, which turns the name of
the timer into a string that gets printed later. It should work
when these are macros, or a macro wrapping an inline function
like __init_timer is.

  Arnd


Re: [PATCH 9/9] timers: remove old timer initialization macros

2017-05-16 Thread Arnd Bergmann
On Tue, May 16, 2017 at 1:48 PM, Christoph Hellwig  wrote:
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/timer.h | 22 +++---
>  1 file changed, 3 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/timer.h b/include/linux/timer.h
> index 87afe52c8349..9c6694d3f66a 100644
> --- a/include/linux/timer.h
> +++ b/include/linux/timer.h
> @@ -80,35 +80,19 @@ struct timer_list {
> struct timer_list _name = INIT_TIMER(_func, _expires, _flags)
>
>  /*
> - * Don't use the macros below, use DECLARE_TIMER and INIT_TIMER with their
> + * Don't use the macro below, use DECLARE_TIMER and INIT_TIMER with their
>   * improved callback signature above.
>   */
> -#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
> +#define DEFINE_TIMER(_name, _function, _expires, _data)\
> +   struct timer_list _name = { \
> .entry = { .next = TIMER_ENTRY_STATIC },\
> .function = (_function),\
> .expires = (_expires),  \
> .data = (_data),\
> -   .flags = (_flags),  \
> __TIMER_LOCKDEP_MAP_INITIALIZER(\
> __FILE__ ":" __stringify(__LINE__)) \
> }

Not sure what to do about it, but I notice that the '_expires'
argument is completely
bogus, I don't see any way it could be used in a meaningful way, and the only
user that passes anything other than zero is arch/mips/mti-malta/malta-display.c
and that seems to be unintentional.

  Arnd


Re: [PATCH 2/9] timers: provide a "modern" variant of timers

2017-05-16 Thread Randy Dunlap
On 05/16/17 04:48, Christoph Hellwig wrote:

> diff --git a/include/linux/timer.h b/include/linux/timer.h
> index e6789b8757d5..87afe52c8349 100644
> --- a/include/linux/timer.h
> +++ b/include/linux/timer.h
\
> @@ -126,6 +146,32 @@ static inline void init_timer_on_stack_key(struct 
> timer_list *timer,
>   init_timer_on_stack_key((_timer), (_flags), NULL, NULL)
>  #endif
>  
> +/**
> + * prepare_timer - initialize a timer before first use
> + * @timer:   timer structure to prepare
> + * @func:callback to be called when the timer expires
> + * @flags%TIMER_* flags that control timer behavior

missing ':' on @flags:

> + *
> + * This function initializes a timer_list structure so that it can
> + * be used (by calling add_timer() or mod_timer()).
> + */
> +static inline void prepare_timer(struct timer_list *timer,
> + void (*func)(struct timer_list *timer), u32 flags)
> +{



-- 
~Randy


[patch V2 06/17] powerpc: Adjust system_state check

2017-05-16 Thread Thomas Gleixner
To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

Adjust the system_state check in smp_generic_cpu_bootable() to handle the
extra states.

Signed-off-by: Thomas Gleixner 
Acked-by: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/smp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -97,7 +97,7 @@ int smp_generic_cpu_bootable(unsigned in
/* Special case - we inhibit secondary thread startup
 * during boot if the user requests it.
 */
-   if (system_state == SYSTEM_BOOTING && cpu_has_feature(CPU_FTR_SMT)) {
+   if (system_state < SYSTEM_RUNNING && cpu_has_feature(CPU_FTR_SMT)) {
if (!smt_enabled_at_boot && cpu_thread_in_core(nr) != 0)
return 0;
if (smt_enabled_at_boot




[patch V2 09/17] cpufreq/pasemi: Adjust system_state check

2017-05-16 Thread Thomas Gleixner
To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

Adjust the system_state check in pas_cpufreq_cpu_exit() to handle the extra
states.

Signed-off-by: Thomas Gleixner 
Acked-by: Viresh Kumar 
Cc: "Rafael J. Wysocki" 
Cc: linuxppc-dev@lists.ozlabs.org
---
 drivers/cpufreq/pasemi-cpufreq.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -226,7 +226,7 @@ static int pas_cpufreq_cpu_exit(struct c
 * We don't support CPU hotplug. Don't unmap after the system
 * has already made it to a running state.
 */
-   if (system_state != SYSTEM_BOOTING)
+   if (system_state >= SYSTEM_RUNNING)
return 0;
 
if (sdcasr_mapbase)




Re: kernel BUG at mm/usercopy.c:72!

2017-05-16 Thread Breno Leitao
On Tue, May 16, 2017 at 09:02:29PM +1000, Michael Ellerman wrote:
> Breno Leitao  writes:
> 
> > Hello,
> >
> > Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual
> > machine. Justing SSHing into the machine causes this issue.
> >
> > [23.138124] usercopy: kernel memory overwrite attempt detected to 
> > d3d80030 (mm_struct) (560 bytes)
> > [23.138195] [ cut here ]
> > [23.138229] kernel BUG at mm/usercopy.c:72!
> > [23.138252] Oops: Exception in kernel mode, sig: 5 [#3]
> > [23.138280] SMP NR_CPUS=2048 
> > [23.138280] NUMA 
> > [23.138302] pSeries
> > [23.138330] Modules linked in:
> > [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G  D 
> > 4.12.0-rc1+ #9
> > [23.138395] task: c001e272dc00 task.stack: c001e27b
> > [23.138430] NIP: c0342358 LR: c0342354 CTR: 
> > c06eb060
> > [23.138472] REGS: c001e27b3a00 TRAP: 0700   Tainted: G  D   
> >(4.12.0-rc1+)
> > [23.138513] MSR: 80029033 
> > [23.138517]   CR: 28004222  XER: 2000
> > [23.138565] CFAR: c0b34500 SOFTE: 1 
> > [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 
> > 005e 
> > [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 
> > 79746573290d0a74 
> > [23.138565] GPR08: 0007 c0f61864 0001feeb 
> > 3064206f74206465 
> > [23.138565] GPR12: 4400 cfb42600 0015 
> > 545bdc40 
> > [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 
> > 545cf000 
> > [23.138565] GPR20: 546109c8 c7e8 54610010 
> > 778c22e8 
> > [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 
> > 0230 
> > [23.138565] GPR28: d3d80260  0230 
> > d3d80030 
> > [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0
> > [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0
> > [23.138990] Call Trace:
> > [23.139006] [c001e27b3c80] [c0342354] 
> > __check_object_size+0x84/0x2d0 (unreliable)
> > [23.139056] [c001e27b3d00] [c09f5ba8] 
> > bpf_prog_create_from_user+0xa8/0x1a0
> > [23.139099] [c001e27b3d60] [c01e5d30] do_seccomp+0x120/0x720
> > [23.139136] [c001e27b3dd0] [c00fd53c] SyS_prctl+0x2ac/0x6b0
> > [23.139172] [c001e27b3e30] [c000af84] system_call+0x38/0xe0
> > [23.139218] Instruction dump:
> > [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 
> > 3c62ff95 7fc8f378 
> > [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 
> > 2ba30010 409d018c 
> > [23.139328] ---[ end trace 1a1dc952a4b7c4af ]---
> 
> Do you have any idea what is calling seccomp() and triggering the bug?

This bug is hit using several path, not only via seccomp. This is
another path, via vfs_read, that triggers the bug:

[  370.154307] usercopy: kernel memory exposure attempt detected from 
d3d6007c (vm_area_struct) (6 bytes)
[  370.154373] [ cut here ]
[  370.154402] kernel BUG at mm/usercopy.c:72!  
  
[  370.154425] Oops: Exception in kernel mode, sig: 5 [#4]

[370.155220] [c001d30efab0] [c0342354] 
__check_object_size+0x84/0x2b0 (unreliable)
[370.155272] [c001d30efb30] [c06c96cc] 
copy_from_read_buf+0xac/0x1e0
[370.155315] [c001d30efba0] [c06ccbc4] 
n_tty_read+0x324/0x920
[370.155351] [c001d30efcb0] [c06c4c50] tty_read+0xc0/0x180  

[370.155387] [c001d30efd00] [c0347f64] __vfs_read+0x44/0x1a0
[370.155424] [c001d30efd90] [c03499ac] vfs_read+0xbc/0x1b0
[370.155460] [c001d30efde0] [c034b6f8] SyS_read+0x68/0x110
[370.155497] [c001d30efe30] [c000af84] system_call+0x38/0xe0

Anyway, I see the seccomp() path issue when I log into the system using SSH,
and the issue with tty_read() just during the system boot.

> I run the BPF and seccomp test suites, and I haven't seen this.

Do you have the hardening options enabled? For example, I do not
reproduce this problem if I do not set CONFIG_HARDENED_USERCOPY=y.


Re: RFC: better timer interface

2017-05-16 Thread Christoph Hellwig
On Tue, May 16, 2017 at 05:45:07PM +0200, Arnd Bergmann wrote:
> This looks really nice, but what is the long-term plan for the interface?
> Do you expect that we will eventually change all 700+ users of timer_list
> to the new type, or do we keep both variants around indefinitely to avoid
> having to do mass-conversions?

I think we should eventually move everyone over, but it might take
some time.

> If we are going to touch them all in the end, we might want to think
> about other changes that could be useful here. The main one I have
> in mind would be moving away from 'jiffies + timeout' as the interface,
> and instead passing a relative number of milliseconds (or seconds)
> into a mod_timer() variant. This is what most drivers want anyway,
> and if we have both changes (callback argument and expiration
> time) in place, we modernize the API one driver at a time with both
> changes at once.

Yes, that sounds useful to me as well.  As you said it's an independent
but somewhat related change.  I can add it to my series, but I'll
need a suggestions for a good and short name.  That already was the
hardest part for the setup side :)


Re: RFC: better timer interface

2017-05-16 Thread Arnd Bergmann
On Tue, May 16, 2017 at 1:48 PM, Christoph Hellwig  wrote:
> Hi all,
>
> this series attempts to provide a "modern" timer interface where the
> callback gets the timer_list structure as an argument so that it
> can use container_of instead of having to cast to/from unsigned long
> all the time (or even worse use function pointer casts, we have quite
> a few of those as well).

This looks really nice, but what is the long-term plan for the interface?
Do you expect that we will eventually change all 700+ users of timer_list
to the new type, or do we keep both variants around indefinitely to avoid
having to do mass-conversions?

If we are going to touch them all in the end, we might want to think
about other changes that could be useful here. The main one I have
in mind would be moving away from 'jiffies + timeout' as the interface,
and instead passing a relative number of milliseconds (or seconds)
into a mod_timer() variant. This is what most drivers want anyway,
and if we have both changes (callback argument and expiration
time) in place, we modernize the API one driver at a time with both
changes at once.

  Arnd


Re: [PATCH v2 1/2] mm/hugetlb: Cleanup ARCH_HAS_GIGANTIC_PAGE

2017-05-16 Thread Aneesh Kumar K.V



On Tuesday 16 May 2017 03:52 PM, Anshuman Khandual wrote:

On 05/16/2017 02:47 PM, Aneesh Kumar K.V wrote:

This moves the #ifdef in C code to a Kconfig dependency. Also we move the
gigantic_page_supported() function to be arch specific. This gives arch to
conditionally enable runtime allocation of gigantic huge page. Architectures
like ppc64 supports different gigantic huge page size (16G and 1G) based on the
translation mode selected. This provides an opportunity for ppc64 to enable
runtime allocation only w.r.t 1G hugepage.


Right.



No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/arm64/Kconfig   | 2 +-
 arch/arm64/include/asm/hugetlb.h | 4 
 arch/s390/Kconfig| 2 +-
 arch/s390/include/asm/hugetlb.h  | 3 +++
 arch/x86/Kconfig | 2 +-
 mm/hugetlb.c | 7 ++-
 6 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3741859765cf..1f8c1f73aada 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -11,7 +11,7 @@ config ARM64
select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
select ARCH_HAS_KCOV
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index bbc1e35aa601..793bd73b0d07 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -83,4 +83,8 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
 extern void huge_ptep_clear_flush(struct vm_area_struct *vma,
  unsigned long addr, pte_t *ptep);

+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void) { return true; }
+#endif
+
 #endif /* __ASM_HUGETLB_H */
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index a2dcef0aacc7..a41bbf420dda 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -67,7 +67,7 @@ config S390
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
select ARCH_HAS_KCOV
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index cd546a245c68..89057b2cc8fe 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -112,4 +112,7 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t 
newprot)
return pte_modify(pte, newprot);
 }

+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void) { return true; }
+#endif
 #endif /* _ASM_S390_HUGETLB_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..30a6328136ac 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,7 +22,7 @@ config X86_64
def_bool y
depends on 64BIT
# Options that are inherently 64-bit kernel only:
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
select ARCH_SUPPORTS_INT128
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY


Should not we define gigantic_page_supported() function for X86 as well
like the other two archs above ?



yes. Will update the patch.

-aneesh



Re: kernel BUG at mm/usercopy.c:72!

2017-05-16 Thread Laura Abbott
On 05/16/2017 07:32 AM, Kees Cook wrote:
> On Tue, May 16, 2017 at 4:09 AM, Michael Ellerman  wrote:
>> [Cc'ing the relevant folks]
>>
>> Breno Leitao  writes:
>>> Hello,
>>>
>>> Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual
>>> machine. Justing SSHing into the machine causes this issue.
>>>
>>>   [23.138124] usercopy: kernel memory overwrite attempt detected to 
>>> d3d80030 (mm_struct) (560 bytes)
>>>   [23.138195] [ cut here ]
>>>   [23.138229] kernel BUG at mm/usercopy.c:72!
>>>   [23.138252] Oops: Exception in kernel mode, sig: 5 [#3]
>>>   [23.138280] SMP NR_CPUS=2048
>>>   [23.138280] NUMA
>>>   [23.138302] pSeries
>>>   [23.138330] Modules linked in:
>>>   [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G  D 
>>> 4.12.0-rc1+ #9
>>>   [23.138395] task: c001e272dc00 task.stack: c001e27b
>>>   [23.138430] NIP: c0342358 LR: c0342354 CTR: 
>>> c06eb060
>>>   [23.138472] REGS: c001e27b3a00 TRAP: 0700   Tainted: G  D 
>>>  (4.12.0-rc1+)
>>>   [23.138513] MSR: 80029033 
>>>   [23.138517]   CR: 28004222  XER: 2000
>>>   [23.138565] CFAR: c0b34500 SOFTE: 1
>>>   [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 
>>> 005e
>>>   [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 
>>> 79746573290d0a74
>>>   [23.138565] GPR08: 0007 c0f61864 0001feeb 
>>> 3064206f74206465
>>>   [23.138565] GPR12: 4400 cfb42600 0015 
>>> 545bdc40
>>>   [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 
>>> 545cf000
>>>   [23.138565] GPR20: 546109c8 c7e8 54610010 
>>> 778c22e8
>>>   [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 
>>> 0230
>>>   [23.138565] GPR28: d3d80260  0230 
>>> d3d80030
>>>   [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0
>>>   [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0
>>>   [23.138990] Call Trace:
>>>   [23.139006] [c001e27b3c80] [c0342354] 
>>> __check_object_size+0x84/0x2d0 (unreliable)
>>>   [23.139056] [c001e27b3d00] [c09f5ba8] 
>>> bpf_prog_create_from_user+0xa8/0x1a0
>>>   [23.139099] [c001e27b3d60] [c01e5d30] 
>>> do_seccomp+0x120/0x720
>>>   [23.139136] [c001e27b3dd0] [c00fd53c] 
>>> SyS_prctl+0x2ac/0x6b0
>>>   [23.139172] [c001e27b3e30] [c000af84] 
>>> system_call+0x38/0xe0
>>>   [23.139218] Instruction dump:
>>>   [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 
>>> 3c62ff95 7fc8f378
>>>   [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 
>>> 2ba30010 409d018c
>>>   [23.139328] ---[ end trace 1a1dc952a4b7c4af ]---
>>>
>>> I found that kernel 4.11 does not have this issue. I also found that, if
>>> I revert 517e1fbeb65f5eade8d14f46ac365db6c75aea9b, I do not see the
>>> problem.
>>>
>>> On the other side, if I cherry-pick commit
>>> 517e1fbeb65f5eade8d14f46ac365db6c75aea9b into 4.11, I start seeing the
>>> same issue also on 4.11.
>>
>> Yeah it looks like powerpc also suffers from the same bug that arm64
>> used to, ie. virt_addr_valid() will return true for some vmalloc
>> addresses.
>>
>> virt_addr_valid() is used pretty widely, I'm not sure if we can just fix
>> it without other fallout. I'll dig a bit more tomorrow if no one beats
>> me to it.
>>
>> Kees, depending on how that turns out we may ask you to revert
>> 517e1fbeb65f ("mm/usercopy: Drop extra is_vmalloc_or_module() check").
> 
> That's fine by me. Let me know what you think would be best.
> 
> Laura, I don't see much harm in putting this back in place. It seems
> like it's just a matter of efficiency to have it removed?
> 
> -Kees
> 

Yes, there shouldn't be any harm if we need to bring it back.
Perhaps I should submit a follow on patch to rename virt_addr_valid to
virt_addr_valid_except_where_it_isnt.

Thanks,
Laura


Re: kernel BUG at mm/usercopy.c:72!

2017-05-16 Thread Kees Cook
On Tue, May 16, 2017 at 4:09 AM, Michael Ellerman  wrote:
> [Cc'ing the relevant folks]
>
> Breno Leitao  writes:
>> Hello,
>>
>> Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual
>> machine. Justing SSHing into the machine causes this issue.
>>
>>   [23.138124] usercopy: kernel memory overwrite attempt detected to 
>> d3d80030 (mm_struct) (560 bytes)
>>   [23.138195] [ cut here ]
>>   [23.138229] kernel BUG at mm/usercopy.c:72!
>>   [23.138252] Oops: Exception in kernel mode, sig: 5 [#3]
>>   [23.138280] SMP NR_CPUS=2048
>>   [23.138280] NUMA
>>   [23.138302] pSeries
>>   [23.138330] Modules linked in:
>>   [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G  D 
>> 4.12.0-rc1+ #9
>>   [23.138395] task: c001e272dc00 task.stack: c001e27b
>>   [23.138430] NIP: c0342358 LR: c0342354 CTR: 
>> c06eb060
>>   [23.138472] REGS: c001e27b3a00 TRAP: 0700   Tainted: G  D  
>> (4.12.0-rc1+)
>>   [23.138513] MSR: 80029033 
>>   [23.138517]   CR: 28004222  XER: 2000
>>   [23.138565] CFAR: c0b34500 SOFTE: 1
>>   [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 
>> 005e
>>   [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 
>> 79746573290d0a74
>>   [23.138565] GPR08: 0007 c0f61864 0001feeb 
>> 3064206f74206465
>>   [23.138565] GPR12: 4400 cfb42600 0015 
>> 545bdc40
>>   [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 
>> 545cf000
>>   [23.138565] GPR20: 546109c8 c7e8 54610010 
>> 778c22e8
>>   [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 
>> 0230
>>   [23.138565] GPR28: d3d80260  0230 
>> d3d80030
>>   [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0
>>   [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0
>>   [23.138990] Call Trace:
>>   [23.139006] [c001e27b3c80] [c0342354] 
>> __check_object_size+0x84/0x2d0 (unreliable)
>>   [23.139056] [c001e27b3d00] [c09f5ba8] 
>> bpf_prog_create_from_user+0xa8/0x1a0
>>   [23.139099] [c001e27b3d60] [c01e5d30] 
>> do_seccomp+0x120/0x720
>>   [23.139136] [c001e27b3dd0] [c00fd53c] SyS_prctl+0x2ac/0x6b0
>>   [23.139172] [c001e27b3e30] [c000af84] system_call+0x38/0xe0
>>   [23.139218] Instruction dump:
>>   [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 
>> 3c62ff95 7fc8f378
>>   [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 
>> 2ba30010 409d018c
>>   [23.139328] ---[ end trace 1a1dc952a4b7c4af ]---
>>
>> I found that kernel 4.11 does not have this issue. I also found that, if
>> I revert 517e1fbeb65f5eade8d14f46ac365db6c75aea9b, I do not see the
>> problem.
>>
>> On the other side, if I cherry-pick commit
>> 517e1fbeb65f5eade8d14f46ac365db6c75aea9b into 4.11, I start seeing the
>> same issue also on 4.11.
>
> Yeah it looks like powerpc also suffers from the same bug that arm64
> used to, ie. virt_addr_valid() will return true for some vmalloc
> addresses.
>
> virt_addr_valid() is used pretty widely, I'm not sure if we can just fix
> it without other fallout. I'll dig a bit more tomorrow if no one beats
> me to it.
>
> Kees, depending on how that turns out we may ask you to revert
> 517e1fbeb65f ("mm/usercopy: Drop extra is_vmalloc_or_module() check").

That's fine by me. Let me know what you think would be best.

Laura, I don't see much harm in putting this back in place. It seems
like it's just a matter of efficiency to have it removed?

-Kees

-- 
Kees Cook
Pixel Security


Re: [RFC 0/2] Consolidate patch_instruction

2017-05-16 Thread Naveen N. Rao
On 2017/05/16 10:56AM, Anshuman Khandual wrote:
> On 05/16/2017 09:19 AM, Balbir Singh wrote:
> > patch_instruction is enhanced in this RFC to support
> > patching via a different virtual address (text_poke_area).
> 
> Why writing instruction directly into the address is not
> sufficient and need to go through this virtual address ?

To enable KERNEL_STRICT_RWX and map all of kernel text to be read-only?

> 
> > The mapping of text_poke_area->addr is RW and not RWX.
> > This way the mapping allows write for patching and then we tear
> > down the mapping. The downside is that we introduce a spinlock
> > which serializes our patching to one patch at a time.
> 
> So whats the benifits we get otherwise in this approach when
> we are adding a new lock into the equation.

Instruction patching isn't performance critical, so the slow down is 
likely not noticeable. Marking kernel text read-only helps harden the 
kernel by catching unintended code modifications whether through 
exploits or through bugs.

- Naveen



Re: [RFC 2/2] powerpc/kprobes: Move kprobes over to patch_instruction

2017-05-16 Thread Naveen N. Rao
On 2017/05/16 01:49PM, Balbir Singh wrote:
> arch_arm/disarm_probe use direct assignment for copying
> instructions, replace them with patch_instruction

Thanks for doing this!

We will also have to convert optprobes and ftrace to use 
patch_instruction, but that can be done once the basic infrastructure is 
in.

Regards,
Naveen



[PATCH 9/9] timers: remove old timer initialization macros

2017-05-16 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/timer.h | 22 +++---
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 87afe52c8349..9c6694d3f66a 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -80,35 +80,19 @@ struct timer_list {
struct timer_list _name = INIT_TIMER(_func, _expires, _flags)
 
 /*
- * Don't use the macros below, use DECLARE_TIMER and INIT_TIMER with their
+ * Don't use the macro below, use DECLARE_TIMER and INIT_TIMER with their
  * improved callback signature above.
  */
-#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
+#define DEFINE_TIMER(_name, _function, _expires, _data)\
+   struct timer_list _name = { \
.entry = { .next = TIMER_ENTRY_STATIC },\
.function = (_function),\
.expires = (_expires),  \
.data = (_data),\
-   .flags = (_flags),  \
__TIMER_LOCKDEP_MAP_INITIALIZER(\
__FILE__ ":" __stringify(__LINE__)) \
}
 
-#define TIMER_INITIALIZER(_function, _expires, _data)  \
-   __TIMER_INITIALIZER((_function), (_expires), (_data), 0)
-
-#define TIMER_PINNED_INITIALIZER(_function, _expires, _data)   \
-   __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_PINNED)
-
-#define TIMER_DEFERRED_INITIALIZER(_function, _expires, _data) \
-   __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_DEFERRABLE)
-
-#define TIMER_PINNED_DEFERRED_INITIALIZER(_function, _expires, _data)  \
-   __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_DEFERRABLE 
| TIMER_PINNED)
-
-#define DEFINE_TIMER(_name, _function, _expires, _data)\
-   struct timer_list _name =   \
-   TIMER_INITIALIZER(_function, _expires, _data)
-
 void init_timer_key(struct timer_list *timer, unsigned int flags,
const char *name, struct lock_class_key *key);
 
-- 
2.11.0



[PATCH 8/9] tlclk: switch switchover_timer to a modern timer

2017-05-16 Thread Christoph Hellwig
And remove a superflous double-initialization.

Signed-off-by: Christoph Hellwig 
---
 drivers/char/tlclk.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/char/tlclk.c b/drivers/char/tlclk.c
index 572a51704e67..7144016da82c 100644
--- a/drivers/char/tlclk.c
+++ b/drivers/char/tlclk.c
@@ -184,10 +184,14 @@ static unsigned int telclk_interrupt;
 static int int_events; /* Event that generate a interrupt */
 static int got_event;  /* if events processing have been done */
 
-static void switchover_timeout(unsigned long data);
-static struct timer_list switchover_timer =
-   TIMER_INITIALIZER(switchover_timeout , 0, 0);
-static unsigned long tlclk_timer_data;
+static void switchover_timeout(struct timer_list *timer);
+
+static struct switchover_timer {
+   struct timer_list timer;
+   unsigned long data;
+} switchover_timer = {
+   .timer = INIT_TIMER(switchover_timeout, 0, TIMER_DEFERRABLE),
+};
 
 static struct tlclk_alarms *alarm_events;
 
@@ -805,8 +809,6 @@ static int __init tlclk_init(void)
goto out3;
}
 
-   init_timer(_timer);
-
ret = misc_register(_miscdev);
if (ret < 0) {
printk(KERN_ERR "tlclk: misc_register returns %d.\n", ret);
@@ -850,25 +852,26 @@ static void __exit tlclk_cleanup(void)
unregister_chrdev(tlclk_major, "telco_clock");
 
release_region(TLCLK_BASE, 8);
-   del_timer_sync(_timer);
+   del_timer_sync(_timer.timer);
kfree(alarm_events);
 
 }
 
-static void switchover_timeout(unsigned long data)
+static void switchover_timeout(struct timer_list *timer)
 {
-   unsigned long flags = *(unsigned long *) data;
+   struct switchover_timer *s =
+   container_of(timer, struct switchover_timer, timer);
 
-   if ((flags & 1)) {
-   if ((inb(TLCLK_REG1) & 0x08) != (flags & 0x08))
+   if ((s->data & 1)) {
+   if ((inb(TLCLK_REG1) & 0x08) != (s->data & 0x08))
alarm_events->switchover_primary++;
} else {
-   if ((inb(TLCLK_REG1) & 0x08) != (flags & 0x08))
+   if ((inb(TLCLK_REG1) & 0x08) != (s->data & 0x08))
alarm_events->switchover_secondary++;
}
 
/* Alarm processing is done, wake up read task */
-   del_timer(_timer);
+   del_timer(_timer.timer);
got_event = 1;
wake_up();
 }
@@ -920,10 +923,9 @@ static irqreturn_t tlclk_interrupt(int irq, void *dev_id)
alarm_events->pll_holdover++;
 
/* TIMEOUT in ~10ms */
-   switchover_timer.expires = jiffies + msecs_to_jiffies(10);
-   tlclk_timer_data = inb(TLCLK_REG1);
-   switchover_timer.data = (unsigned long) _timer_data;
-   mod_timer(_timer, switchover_timer.expires);
+   switchover_timer.data = inb(TLCLK_REG1);
+   mod_timer(_timer.timer,
+   jiffies + msecs_to_jiffies(10));
} else {
got_event = 1;
wake_up();
-- 
2.11.0



[PATCH 7/9] s390: switch lgr timer to a modern timer

2017-05-16 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 arch/s390/kernel/lgr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/kernel/lgr.c b/arch/s390/kernel/lgr.c
index ae7dff110054..147124c05f28 100644
--- a/arch/s390/kernel/lgr.c
+++ b/arch/s390/kernel/lgr.c
@@ -153,14 +153,14 @@ static void lgr_timer_set(void);
 /*
  * LGR timer callback
  */
-static void lgr_timer_fn(unsigned long ignored)
+static void lgr_timer_fn(struct timer_list *timer)
 {
lgr_info_log();
lgr_timer_set();
 }
 
 static struct timer_list lgr_timer =
-   TIMER_DEFERRED_INITIALIZER(lgr_timer_fn, 0, 0);
+   INIT_TIMER(lgr_timer_fn, 0, TIMER_DEFERRABLE);
 
 /*
  * Setup next LGR timer
-- 
2.11.0



[PATCH 6/9] s390: switch topology_timer to a modern timer

2017-05-16 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 arch/s390/kernel/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index bb47c92476f0..4a0e867fca2b 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -289,7 +289,7 @@ void topology_schedule_update(void)
schedule_work(_work);
 }
 
-static void topology_timer_fn(unsigned long ignored)
+static void topology_timer_fn(struct timer_list *timer)
 {
if (ptf(PTF_CHECK))
topology_schedule_update();
@@ -297,7 +297,7 @@ static void topology_timer_fn(unsigned long ignored)
 }
 
 static struct timer_list topology_timer =
-   TIMER_DEFERRED_INITIALIZER(topology_timer_fn, 0, 0);
+   INIT_TIMER(topology_timer_fn, 0, TIMER_DEFERRABLE);
 
 static atomic_t topology_poll = ATOMIC_INIT(0);
 
-- 
2.11.0



[PATCH 5/9] powerpc/numa: switch topology_timer to modern timer

2017-05-16 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/mm/numa.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 371792e4418f..93a11227716b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1437,7 +1437,7 @@ static void topology_schedule_update(void)
schedule_work(_work);
 }
 
-static void topology_timer_fn(unsigned long ignored)
+static void topology_timer_fn(struct timer_list *timer)
 {
if (prrn_enabled && cpumask_weight(_associativity_changes_mask))
topology_schedule_update();
@@ -1447,8 +1447,7 @@ static void topology_timer_fn(unsigned long ignored)
reset_topology_timer();
}
 }
-static struct timer_list topology_timer =
-   TIMER_INITIALIZER(topology_timer_fn, 0, 0);
+static struct timer_list topology_timer = INIT_TIMER(topology_timer_fn, 0, 0);
 
 static void reset_topology_timer(void)
 {
-- 
2.11.0



[PATCH 4/9] workqueue: switch to modern timers

2017-05-16 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/workqueue.h| 16 ++--
 kernel/workqueue.c   | 14 +++---
 .../rcutorture/formal/srcu-cbmc/src/workqueues.h |  2 +-
 3 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index c102ef65cb64..59c889bf601e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -17,7 +17,7 @@ struct workqueue_struct;
 
 struct work_struct;
 typedef void (*work_func_t)(struct work_struct *work);
-void delayed_work_timer_fn(unsigned long __data);
+void delayed_work_timer_fn(struct timer_list *timer);
 
 /*
  * The first word is the work queue pointer and the flags rolled into
@@ -175,9 +175,8 @@ struct execute_work {
 
 #define __DELAYED_WORK_INITIALIZER(n, f, tflags) { \
.work = __WORK_INITIALIZER((n).work, (f)),  \
-   .timer = __TIMER_INITIALIZER(delayed_work_timer_fn, \
-0, (unsigned long)&(n),\
-(tflags) | TIMER_IRQSAFE), \
+   .timer = INIT_TIMER(delayed_work_timer_fn, 0,   \
+   (tflags) | TIMER_IRQSAFE), \
}
 
 #define DECLARE_WORK(n, f) \
@@ -241,18 +240,15 @@ static inline unsigned int work_static(struct work_struct 
*work) { return 0; }
 #define __INIT_DELAYED_WORK(_work, _func, _tflags) \
do {\
INIT_WORK(&(_work)->work, (_func)); \
-   __setup_timer(&(_work)->timer, delayed_work_timer_fn,   \
- (unsigned long)(_work),   \
+   prepare_timer(&(_work)->timer, delayed_work_timer_fn,   \
  (_tflags) | TIMER_IRQSAFE);   \
} while (0)
 
 #define __INIT_DELAYED_WORK_ONSTACK(_work, _func, _tflags) \
do {\
INIT_WORK_ONSTACK(&(_work)->work, (_func)); \
-   __setup_timer_on_stack(&(_work)->timer, \
-  delayed_work_timer_fn,   \
-  (unsigned long)(_work),  \
-  (_tflags) | TIMER_IRQSAFE);  \
+   prepare_timer_on_stack(&(_work)->timer, delayed_work_timer_fn, \
+  (_tflags) | TIMER_IRQSAFE);  \
} while (0)
 
 #define INIT_DELAYED_WORK(_work, _func)
\
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c74bf39ef764..ba2cd509902f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1492,9 +1492,10 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_work_on);
 
-void delayed_work_timer_fn(unsigned long __data)
+void delayed_work_timer_fn(struct timer_list *timer)
 {
-   struct delayed_work *dwork = (struct delayed_work *)__data;
+   struct delayed_work *dwork =
+   container_of(timer, struct delayed_work, timer);
 
/* should have been called from irqsafe timer with irq already off */
__queue_work(dwork->cpu, dwork->wq, >work);
@@ -1508,8 +1509,7 @@ static void __queue_delayed_work(int cpu, struct 
workqueue_struct *wq,
struct work_struct *work = >work;
 
WARN_ON_ONCE(!wq);
-   WARN_ON_ONCE(timer->function != delayed_work_timer_fn ||
-timer->data != (unsigned long)dwork);
+   WARN_ON_ONCE(timer->func != delayed_work_timer_fn);
WARN_ON_ONCE(timer_pending(timer));
WARN_ON_ONCE(!list_empty(>entry));
 
@@ -5335,11 +5335,11 @@ static void workqueue_sysfs_unregister(struct 
workqueue_struct *wq) { }
  */
 #ifdef CONFIG_WQ_WATCHDOG
 
-static void wq_watchdog_timer_fn(unsigned long data);
+static void wq_watchdog_timer_fn(struct timer_list *timer);
 
 static unsigned long wq_watchdog_thresh = 30;
 static struct timer_list wq_watchdog_timer =
-   TIMER_DEFERRED_INITIALIZER(wq_watchdog_timer_fn, 0, 0);
+   INIT_TIMER(wq_watchdog_timer_fn, 0, TIMER_DEFERRABLE);
 
 static unsigned long wq_watchdog_touched = INITIAL_JIFFIES;
 static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = 
INITIAL_JIFFIES;
@@ -5353,7 +5353,7 @@ static void wq_watchdog_reset_touched(void)
per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
 }
 
-static void wq_watchdog_timer_fn(unsigned long data)
+static void wq_watchdog_timer_fn(struct timer_list *timer)
 {
unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ;
bool lockup_detected = false;
diff --git 

[PATCH 2/9] timers: provide a "modern" variant of timers

2017-05-16 Thread Christoph Hellwig
The new callback gets a pointer to the timer_list itself, which can
then be used to get the containing structure using container_of
instead of casting from and to unsigned long all the time.

The setup helpers take a flags argument instead of needing countless
variants.

Note: this further reduces space for the cpumask.  By the time we'll
need the additional cpumask space getting rid of the old-style timers
will hopefully be finished.

Signed-off-by: Christoph Hellwig 
---
 include/linux/timer.h | 50 --
 kernel/time/timer.c   | 24 ++--
 2 files changed, 62 insertions(+), 12 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index e6789b8757d5..87afe52c8349 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -16,7 +16,10 @@ struct timer_list {
 */
struct hlist_node   entry;
unsigned long   expires;
-   void(*function)(unsigned long);
+   union {
+   void(*func)(struct timer_list *timer);
+   void(*function)(unsigned long);
+   };
unsigned long   data;
u32 flags;
 
@@ -52,7 +55,8 @@ struct timer_list {
  * workqueue locking issues. It's not meant for executing random crap
  * with interrupts disabled. Abuse is monitored!
  */
-#define TIMER_CPUMASK  0x0003
+#define TIMER_CPUMASK  0x0001
+#define TIMER_MODERN   0x0002
 #define TIMER_MIGRATING0x0004
 #define TIMER_BASEMASK (TIMER_CPUMASK | TIMER_MIGRATING)
 #define TIMER_DEFERRABLE   0x0008
@@ -63,6 +67,22 @@ struct timer_list {
 
 #define TIMER_TRACE_FLAGMASK   (TIMER_MIGRATING | TIMER_DEFERRABLE | 
TIMER_PINNED | TIMER_IRQSAFE)
 
+#define INIT_TIMER(_func, _expires, _flags)\
+{  \
+   .entry = { .next = TIMER_ENTRY_STATIC },\
+   .func = (_func),\
+   .expires = (_expires),  \
+   .flags = TIMER_MODERN | (_flags),   \
+   __TIMER_LOCKDEP_MAP_INITIALIZER(__FILE__ ":" __stringify(__LINE__)) \
+}
+
+#define DECLARE_TIMER(_name, _func, _expires, _flags)  \
+   struct timer_list _name = INIT_TIMER(_func, _expires, _flags)
+
+/*
+ * Don't use the macros below, use DECLARE_TIMER and INIT_TIMER with their
+ * improved callback signature above.
+ */
 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
.entry = { .next = TIMER_ENTRY_STATIC },\
.function = (_function),\
@@ -126,6 +146,32 @@ static inline void init_timer_on_stack_key(struct 
timer_list *timer,
init_timer_on_stack_key((_timer), (_flags), NULL, NULL)
 #endif
 
+/**
+ * prepare_timer - initialize a timer before first use
+ * @timer: timer structure to prepare
+ * @func:  callback to be called when the timer expires
+ * @flags  %TIMER_* flags that control timer behavior
+ *
+ * This function initializes a timer_list structure so that it can
+ * be used (by calling add_timer() or mod_timer()).
+ */
+static inline void prepare_timer(struct timer_list *timer,
+   void (*func)(struct timer_list *timer), u32 flags)
+{
+   __init_timer(timer, TIMER_MODERN | flags);
+   timer->func = func;
+}
+
+static inline void prepare_timer_on_stack(struct timer_list *timer,
+   void (*func)(struct timer_list *timer), u32 flags)
+{
+   __init_timer_on_stack(timer, TIMER_MODERN | flags);
+   timer->func = func;
+}
+
+/*
+ * Don't use - use prepare_timer above for new code instead.
+ */
 #define init_timer(timer)  \
__init_timer((timer), 0)
 #define init_timer_pinned(timer)   \
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index c7978fcdbbea..48d8450cfa5f 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -579,7 +579,7 @@ static struct debug_obj_descr timer_debug_descr;
 
 static void *timer_debug_hint(void *addr)
 {
-   return ((struct timer_list *) addr)->function;
+   return ((struct timer_list *) addr)->func;
 }
 
 static bool timer_is_static_object(void *addr)
@@ -930,7 +930,7 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires, bool pending_only)
unsigned long clk = 0, flags;
int ret = 0;
 
-   BUG_ON(!timer->function);
+   BUG_ON(!timer->func && !timer->function);
 
/*
 * This is a common optimization triggered by the networking code - if
@@ -1064,12 +1064,12 @@ EXPORT_SYMBOL(mod_timer);
  * add_timer - start a timer
  * @timer: the timer to be added
  *
- * The kernel will do a ->function(->data) callback from the
- * timer interrupt at the ->expires point in the future. The
- * current time is 'jiffies'.
+ 

[PATCH 3/9] kthread: remove unused macros

2017-05-16 Thread Christoph Hellwig
KTHREAD_DELAYED_WORK_INIT and DEFINE_KTHREAD_DELAYED_WORK are unused
and are using a timer helper that's about to go away.

Signed-off-by: Christoph Hellwig 
---
 include/linux/kthread.h | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 4fec8b775895..acb6edb4b4b4 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -114,23 +114,12 @@ struct kthread_delayed_work {
.func = (fn),   \
}
 
-#define KTHREAD_DELAYED_WORK_INIT(dwork, fn) { \
-   .work = KTHREAD_WORK_INIT((dwork).work, (fn)),  \
-   .timer = __TIMER_INITIALIZER(kthread_delayed_work_timer_fn, \
-0, (unsigned long)&(dwork),\
-TIMER_IRQSAFE),\
-   }
-
 #define DEFINE_KTHREAD_WORKER(worker)  \
struct kthread_worker worker = KTHREAD_WORKER_INIT(worker)
 
 #define DEFINE_KTHREAD_WORK(work, fn)  \
struct kthread_work work = KTHREAD_WORK_INIT(work, fn)
 
-#define DEFINE_KTHREAD_DELAYED_WORK(dwork, fn) \
-   struct kthread_delayed_work dwork = \
-   KTHREAD_DELAYED_WORK_INIT(dwork, fn)
-
 /*
  * kthread_worker.lock needs its own lockdep class key when defined on
  * stack with lockdep enabled.  Use the following macros in such cases.
-- 
2.11.0



[PATCH 1/9] timers: remove the fn and data arguments to call_timer_fn

2017-05-16 Thread Christoph Hellwig
And just move the dereferences inline, given that the timer gets
passed as an argument.

Signed-off-by: Christoph Hellwig 
---
 kernel/time/timer.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 152a706ef8b8..c7978fcdbbea 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1240,8 +1240,7 @@ int del_timer_sync(struct timer_list *timer)
 EXPORT_SYMBOL(del_timer_sync);
 #endif
 
-static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long),
- unsigned long data)
+static void call_timer_fn(struct timer_list *timer)
 {
int count = preempt_count();
 
@@ -1265,14 +1264,14 @@ static void call_timer_fn(struct timer_list *timer, 
void (*fn)(unsigned long),
lock_map_acquire(_map);
 
trace_timer_expire_entry(timer);
-   fn(data);
+   timer->function(timer->data);
trace_timer_expire_exit(timer);
 
lock_map_release(_map);
 
if (count != preempt_count()) {
WARN_ONCE(1, "timer: %pF preempt leak: %08x -> %08x\n",
- fn, count, preempt_count());
+ timer->function, count, preempt_count());
/*
 * Restore the preempt count. That gives us a decent
 * chance to survive and extract information. If the
@@ -1287,24 +1286,19 @@ static void expire_timers(struct timer_base *base, 
struct hlist_head *head)
 {
while (!hlist_empty(head)) {
struct timer_list *timer;
-   void (*fn)(unsigned long);
-   unsigned long data;
 
timer = hlist_entry(head->first, struct timer_list, entry);
 
base->running_timer = timer;
detach_timer(timer, true);
 
-   fn = timer->function;
-   data = timer->data;
-
if (timer->flags & TIMER_IRQSAFE) {
spin_unlock(>lock);
-   call_timer_fn(timer, fn, data);
+   call_timer_fn(timer);
spin_lock(>lock);
} else {
spin_unlock_irq(>lock);
-   call_timer_fn(timer, fn, data);
+   call_timer_fn(timer);
spin_lock_irq(>lock);
}
}
-- 
2.11.0



RFC: better timer interface

2017-05-16 Thread Christoph Hellwig
Hi all,

this series attempts to provide a "modern" timer interface where the
callback gets the timer_list structure as an argument so that it
can use container_of instead of having to cast to/from unsigned long
all the time (or even worse use function pointer casts, we have quite
a few of those as well).

For that it steals another bit from the cpu mask to add a modern flag,
and if that flag is set the different new function prototype is used.
Last but least new helpers to initialize these modern timers are added.
Instead of having a larger number of initialization macros we simply
pass the timer flags to them.



Re: [PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte

2017-05-16 Thread Benjamin Herrenschmidt
On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote:
> +static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea,
> +   bool *is_thp, unsigned *hshift)
> +{
> +   VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) ,
> +   "%s called with irq enabled\n", __func__);
> +   return __find_linux_pte(pgdir, ea, is_thp, hshift);
> +}
> +

When is arch_irqs_disabled() not sufficient ?

Cheers,
Ben.



Re: [PATCH 1/3] powerpc: Add __hard_irqs_disabled()

2017-05-16 Thread Benjamin Herrenschmidt
On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote:
>  
> +static inline bool __hard_irqs_disabled(void)
> +{
> +   unsigned long flags = mfmsr();
> +   return (flags & MSR_EE) == 0;
> +}
> +

Reading the MSR has a cost. Can't we rely on paca->irq_happened being
non-0 ?

(If you are paranoid, add a test of msr as well and warn if there's
a mismatch ...)

Cheers,
Ben.



Re: Mainline build brakes on powerpc with error : fs/xfs/xfs_iomap.c:1152: undefined reference to `.put_dax

2017-05-16 Thread Arnd Bergmann
On Tue, May 16, 2017 at 1:02 PM, Abdul Haleem
 wrote:
> Hi,
>
> Today's mainline 4.12-rc1 fails to build for the attached configuration
> file on Power7 box with below errors.
>
> $ make
> fs/built-in.o: In function `xfs_file_iomap_end':
> fs/xfs/xfs_iomap.c:1152: undefined reference to `.put_dax'
> fs/built-in.o: In function `xfs_file_iomap_begin':
> fs/xfs/xfs_iomap.c:1071: undefined reference to `.dax_get_by_host'
>
> Also reproducible on latest linux-next, and the last successful build
> was at next-20170510.

This should be fixed by https://patchwork.kernel.org/patch/9725515/

  Arnd


Re: kernel BUG at mm/usercopy.c:72!

2017-05-16 Thread Michael Ellerman
[Cc'ing the relevant folks]

Breno Leitao  writes:
> Hello,
>
> Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual
> machine. Justing SSHing into the machine causes this issue.
>
>   [23.138124] usercopy: kernel memory overwrite attempt detected to 
> d3d80030 (mm_struct) (560 bytes)
>   [23.138195] [ cut here ]
>   [23.138229] kernel BUG at mm/usercopy.c:72!
>   [23.138252] Oops: Exception in kernel mode, sig: 5 [#3]
>   [23.138280] SMP NR_CPUS=2048 
>   [23.138280] NUMA 
>   [23.138302] pSeries
>   [23.138330] Modules linked in:
>   [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G  D 
> 4.12.0-rc1+ #9
>   [23.138395] task: c001e272dc00 task.stack: c001e27b
>   [23.138430] NIP: c0342358 LR: c0342354 CTR: 
> c06eb060
>   [23.138472] REGS: c001e27b3a00 TRAP: 0700   Tainted: G  D   
>(4.12.0-rc1+)
>   [23.138513] MSR: 80029033 
>   [23.138517]   CR: 28004222  XER: 2000
>   [23.138565] CFAR: c0b34500 SOFTE: 1 
>   [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 
> 005e 
>   [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 
> 79746573290d0a74 
>   [23.138565] GPR08: 0007 c0f61864 0001feeb 
> 3064206f74206465 
>   [23.138565] GPR12: 4400 cfb42600 0015 
> 545bdc40 
>   [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 
> 545cf000 
>   [23.138565] GPR20: 546109c8 c7e8 54610010 
> 778c22e8 
>   [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 
> 0230 
>   [23.138565] GPR28: d3d80260  0230 
> d3d80030 
>   [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0
>   [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0
>   [23.138990] Call Trace:
>   [23.139006] [c001e27b3c80] [c0342354] 
> __check_object_size+0x84/0x2d0 (unreliable)
>   [23.139056] [c001e27b3d00] [c09f5ba8] 
> bpf_prog_create_from_user+0xa8/0x1a0
>   [23.139099] [c001e27b3d60] [c01e5d30] do_seccomp+0x120/0x720
>   [23.139136] [c001e27b3dd0] [c00fd53c] SyS_prctl+0x2ac/0x6b0
>   [23.139172] [c001e27b3e30] [c000af84] system_call+0x38/0xe0
>   [23.139218] Instruction dump:
>   [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 
> 3c62ff95 7fc8f378 
>   [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 
> 2ba30010 409d018c 
>   [23.139328] ---[ end trace 1a1dc952a4b7c4af ]---
>   
> I found that kernel 4.11 does not have this issue. I also found that, if
> I revert 517e1fbeb65f5eade8d14f46ac365db6c75aea9b, I do not see the
> problem.
>
> On the other side, if I cherry-pick commit
> 517e1fbeb65f5eade8d14f46ac365db6c75aea9b into 4.11, I start seeing the
> same issue also on 4.11.

Yeah it looks like powerpc also suffers from the same bug that arm64
used to, ie. virt_addr_valid() will return true for some vmalloc
addresses.

virt_addr_valid() is used pretty widely, I'm not sure if we can just fix
it without other fallout. I'll dig a bit more tomorrow if no one beats
me to it.

Kees, depending on how that turns out we may ask you to revert
517e1fbeb65f ("mm/usercopy: Drop extra is_vmalloc_or_module() check").

cheers


Mainline build brakes on powerpc with error : fs/xfs/xfs_iomap.c:1152: undefined reference to `.put_dax

2017-05-16 Thread Abdul Haleem
Hi,

Today's mainline 4.12-rc1 fails to build for the attached configuration
file on Power7 box with below errors.

$ make
fs/built-in.o: In function `xfs_file_iomap_end':
fs/xfs/xfs_iomap.c:1152: undefined reference to `.put_dax'
fs/built-in.o: In function `xfs_file_iomap_begin':
fs/xfs/xfs_iomap.c:1071: undefined reference to `.dax_get_by_host'

Also reproducible on latest linux-next, and the last successful build
was at next-20170510.

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre


#
# Automatically generated file; DO NOT EDIT.
# Linux/powerpc 4.10.0-rc5 Kernel Configuration
#
CONFIG_PPC64=y

#
# Processor support
#
CONFIG_PPC_BOOK3S_64=y
# CONFIG_PPC_BOOK3E_64 is not set
CONFIG_GENERIC_CPU=y
# CONFIG_CELL_CPU is not set
# CONFIG_POWER4_CPU is not set
# CONFIG_POWER5_CPU is not set
# CONFIG_POWER6_CPU is not set
# CONFIG_POWER7_CPU is not set
# CONFIG_POWER8_CPU is not set
CONFIG_PPC_BOOK3S=y
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_VSX=y
CONFIG_PPC_ICSWX=y
# CONFIG_PPC_ICSWX_PID is not set
# CONFIG_PPC_ICSWX_USE_SIGILL is not set
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_STD_MMU_64=y
CONFIG_PPC_RADIX_MMU=y
CONFIG_PPC_MM_SLICES=y
CONFIG_PPC_HAVE_PMU_SUPPORT=y
CONFIG_PPC_PERF_CTRS=y
CONFIG_SMP=y
CONFIG_NR_CPUS=2048
CONFIG_PPC_DOORBELL=y
CONFIG_VDSO32=y
CONFIG_CPU_BIG_ENDIAN=y
# CONFIG_CPU_LITTLE_ENDIAN is not set
CONFIG_64BIT=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_MMU=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NR_IRQS=512
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_ILOG2_U32=y
CONFIG_ARCH_HAS_ILOG2_U64=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK=y
CONFIG_PPC=y
# CONFIG_GENERIC_CSUM is not set
CONFIG_EARLY_PRINTK=y
CONFIG_PANIC_TIMEOUT=180
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_PPC_UDBG_16550=y
# CONFIG_GENERIC_TBSYNC is not set
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
CONFIG_EPAPR_BOOT=y
# CONFIG_DEFAULT_UIMAGE is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_PPC_DCR_NATIVE is not set
# CONFIG_PPC_DCR_MMIO is not set
# CONFIG_PPC_OF_PLATFORM_PCI is not set
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_PPC_EMULATE_SSTEP=y
CONFIG_ZONE_DMA32=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_XZ is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_GENERIC_MSI_IRQ=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_GENERIC_TIME_VSYSCALL_OLD=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CONFIG_NO_HZ_FULL_ALL is not set
# CONFIG_NO_HZ_FULL_SYSIDLE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_CONTEXT_TRACKING=y
# CONFIG_CONTEXT_TRACKING_FORCE is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_NONE is not set
# CONFIG_RCU_NOCB_CPU_ZERO is not set
CONFIG_RCU_NOCB_CPU_ALL=y
# CONFIG_BUILD_BIN2C is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=20
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
#CONFIG_NUMA_BALANCING is not set
#CONFIG_NUMA_BALANCING_DEFAULT_ENABLED is not set
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_SWAP_ENABLED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_CGROUP_PIDS is not set
# 

Re: kernel BUG at mm/usercopy.c:72!

2017-05-16 Thread Michael Ellerman
Breno Leitao  writes:

> Hello,
>
> Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual
> machine. Justing SSHing into the machine causes this issue.
>
>   [23.138124] usercopy: kernel memory overwrite attempt detected to 
> d3d80030 (mm_struct) (560 bytes)
>   [23.138195] [ cut here ]
>   [23.138229] kernel BUG at mm/usercopy.c:72!
>   [23.138252] Oops: Exception in kernel mode, sig: 5 [#3]
>   [23.138280] SMP NR_CPUS=2048 
>   [23.138280] NUMA 
>   [23.138302] pSeries
>   [23.138330] Modules linked in:
>   [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G  D 
> 4.12.0-rc1+ #9
>   [23.138395] task: c001e272dc00 task.stack: c001e27b
>   [23.138430] NIP: c0342358 LR: c0342354 CTR: 
> c06eb060
>   [23.138472] REGS: c001e27b3a00 TRAP: 0700   Tainted: G  D   
>(4.12.0-rc1+)
>   [23.138513] MSR: 80029033 
>   [23.138517]   CR: 28004222  XER: 2000
>   [23.138565] CFAR: c0b34500 SOFTE: 1 
>   [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 
> 005e 
>   [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 
> 79746573290d0a74 
>   [23.138565] GPR08: 0007 c0f61864 0001feeb 
> 3064206f74206465 
>   [23.138565] GPR12: 4400 cfb42600 0015 
> 545bdc40 
>   [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 
> 545cf000 
>   [23.138565] GPR20: 546109c8 c7e8 54610010 
> 778c22e8 
>   [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 
> 0230 
>   [23.138565] GPR28: d3d80260  0230 
> d3d80030 
>   [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0
>   [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0
>   [23.138990] Call Trace:
>   [23.139006] [c001e27b3c80] [c0342354] 
> __check_object_size+0x84/0x2d0 (unreliable)
>   [23.139056] [c001e27b3d00] [c09f5ba8] 
> bpf_prog_create_from_user+0xa8/0x1a0
>   [23.139099] [c001e27b3d60] [c01e5d30] do_seccomp+0x120/0x720
>   [23.139136] [c001e27b3dd0] [c00fd53c] SyS_prctl+0x2ac/0x6b0
>   [23.139172] [c001e27b3e30] [c000af84] system_call+0x38/0xe0
>   [23.139218] Instruction dump:
>   [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 
> 3c62ff95 7fc8f378 
>   [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 
> 2ba30010 409d018c 
>   [23.139328] ---[ end trace 1a1dc952a4b7c4af ]---

Do you have any idea what is calling seccomp() and triggering the bug?

I run the BPF and seccomp test suites, and I haven't seen this.

cheers


[PATCH] powerpc/mm: Fix crash in page table dump with huge pages

2017-05-16 Thread Michael Ellerman
The page table dump code doesn't know about huge pages, so currently
it crashes (or walks random memory, usually leading to a crash), if it
finds a huge page. On Book3S we only see huge pages in the Linux page
tables when we're using the P9 Radix MMU.

Teaching the code to properly handle huge pages is a bit more involved,
so for now just prevent the crash.

Cc: sta...@vger.kernel.org # v4.10+
Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/mm/dump_linuxpagetables.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/dump_linuxpagetables.c 
b/arch/powerpc/mm/dump_linuxpagetables.c
index d659345a98d6..6070d3d60ef1 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -391,7 +391,7 @@ static void walk_pmd(struct pg_state *st, pud_t *pud, 
unsigned long start)
 
for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
addr = start + i * PMD_SIZE;
-   if (!pmd_none(*pmd))
+   if (!pmd_none(*pmd) && !pmd_huge(*pmd))
/* pmd exists */
walk_pte(st, pmd, addr);
else
@@ -407,7 +407,7 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, 
unsigned long start)
 
for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
addr = start + i * PUD_SIZE;
-   if (!pud_none(*pud))
+   if (!pud_none(*pud) && !pud_huge(*pud))
/* pud exists */
walk_pmd(st, pud, addr);
else
@@ -427,7 +427,7 @@ static void walk_pagetables(struct pg_state *st)
 */
for (i = 0; i < PTRS_PER_PGD; i++, pgd++) {
addr = KERN_VIRT_START + i * PGDIR_SIZE;
-   if (!pgd_none(*pgd))
+   if (!pgd_none(*pgd) && !pgd_huge(*pgd))
/* pgd exists */
walk_pud(st, pgd, addr);
else
-- 
2.7.4



Re: [PATCH v2 2/2] powerpc/mm/hugetlb: Add support for 1G huge pages

2017-05-16 Thread Anshuman Khandual
On 05/16/2017 02:47 PM, Aneesh Kumar K.V wrote:
> POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch
> enables the usage of 1G page size for hugetlbfs. This also update the helper
> such we can do 1G page allocation at runtime.
> 
> We still don't enable 1G page size on DD1 version. This is to avoid doing
> workaround mentioned in commit: 6d3a0379ebdc8 (powerpc/mm: Add
> radix__tlb_flush_pte_p9_dd1()
> 
> Signed-off-by: Aneesh Kumar K.V 

Sounds good.

Reviewed-by: Anshuman Khandual 



Re: [PATCH v2 1/2] mm/hugetlb: Cleanup ARCH_HAS_GIGANTIC_PAGE

2017-05-16 Thread Anshuman Khandual
On 05/16/2017 02:47 PM, Aneesh Kumar K.V wrote:
> This moves the #ifdef in C code to a Kconfig dependency. Also we move the
> gigantic_page_supported() function to be arch specific. This gives arch to
> conditionally enable runtime allocation of gigantic huge page. Architectures
> like ppc64 supports different gigantic huge page size (16G and 1G) based on 
> the
> translation mode selected. This provides an opportunity for ppc64 to enable
> runtime allocation only w.r.t 1G hugepage.

Right.

> 
> No functional change in this patch.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/arm64/Kconfig   | 2 +-
>  arch/arm64/include/asm/hugetlb.h | 4 
>  arch/s390/Kconfig| 2 +-
>  arch/s390/include/asm/hugetlb.h  | 3 +++
>  arch/x86/Kconfig | 2 +-
>  mm/hugetlb.c | 7 ++-
>  6 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 3741859765cf..1f8c1f73aada 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -11,7 +11,7 @@ config ARM64
>   select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
>   select ARCH_HAS_ELF_RANDOMIZE
>   select ARCH_HAS_GCOV_PROFILE_ALL
> - select ARCH_HAS_GIGANTIC_PAGE
> + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
>   select ARCH_HAS_KCOV
>   select ARCH_HAS_SET_MEMORY
>   select ARCH_HAS_SG_CHAIN
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bbc1e35aa601..793bd73b0d07 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -83,4 +83,8 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
>  extern void huge_ptep_clear_flush(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep);
>  
> +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> +static inline bool gigantic_page_supported(void) { return true; }
> +#endif
> +
>  #endif /* __ASM_HUGETLB_H */
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index a2dcef0aacc7..a41bbf420dda 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -67,7 +67,7 @@ config S390
>   select ARCH_HAS_DEVMEM_IS_ALLOWED
>   select ARCH_HAS_ELF_RANDOMIZE
>   select ARCH_HAS_GCOV_PROFILE_ALL
> - select ARCH_HAS_GIGANTIC_PAGE
> + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
>   select ARCH_HAS_KCOV
>   select ARCH_HAS_SET_MEMORY
>   select ARCH_HAS_SG_CHAIN
> diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
> index cd546a245c68..89057b2cc8fe 100644
> --- a/arch/s390/include/asm/hugetlb.h
> +++ b/arch/s390/include/asm/hugetlb.h
> @@ -112,4 +112,7 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t 
> newprot)
>   return pte_modify(pte, newprot);
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> +static inline bool gigantic_page_supported(void) { return true; }
> +#endif
>  #endif /* _ASM_S390_HUGETLB_H */
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index cc98d5a294ee..30a6328136ac 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -22,7 +22,7 @@ config X86_64
>   def_bool y
>   depends on 64BIT
>   # Options that are inherently 64-bit kernel only:
> - select ARCH_HAS_GIGANTIC_PAGE
> + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
>   select ARCH_SUPPORTS_INT128
>   select ARCH_USE_CMPXCHG_LOCKREF
>   select HAVE_ARCH_SOFT_DIRTY

Should not we define gigantic_page_supported() function for X86 as well
like the other two archs above ?



[PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte

2017-05-16 Thread Aneesh Kumar K.V
No functional change. Add newer helpers with addtional warnings and use those.
---
 arch/powerpc/include/asm/pgtable.h | 10 +
 arch/powerpc/include/asm/pte-walk.h| 38 ++
 arch/powerpc/kernel/eeh.c  |  4 ++--
 arch/powerpc/kernel/io-workarounds.c   |  5 +++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  5 +++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 33 -
 arch/powerpc/kvm/book3s_64_vio_hv.c|  3 ++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c| 12 ---
 arch/powerpc/kvm/e500_mmu_host.c   |  3 ++-
 arch/powerpc/mm/hash_utils_64.c|  5 +++--
 arch/powerpc/mm/hugetlbpage.c  | 24 -
 arch/powerpc/mm/tlb_hash64.c   |  6 --
 arch/powerpc/perf/callchain.c  |  3 ++-
 13 files changed, 97 insertions(+), 54 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pte-walk.h

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index dd01212935ac..9fa263ad7cb3 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -66,16 +66,8 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_large(pmd) 0
 #endif
-pte_t *__find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
-  bool *is_thp, unsigned *shift);
-static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
-  bool *is_thp, unsigned *shift)
-{
-   VM_WARN(!arch_irqs_disabled(),
-   "%s called with irq enabled\n", __func__);
-   return __find_linux_pte_or_hugepte(pgdir, ea, is_thp, shift);
-}
 
+/* can we use this in kvm */
 unsigned long vmalloc_to_phys(void *vmalloc_addr);
 
 void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
diff --git a/arch/powerpc/include/asm/pte-walk.h 
b/arch/powerpc/include/asm/pte-walk.h
new file mode 100644
index ..ea30c4ddd211
--- /dev/null
+++ b/arch/powerpc/include/asm/pte-walk.h
@@ -0,0 +1,38 @@
+#ifndef _ASM_POWERPC_PTE_WALK_H
+#define _ASM_POWERPC_PTE_WALK_H
+
+#ifndef __ASSEMBLY__
+#include 
+
+/* Don't use this directly */
+extern pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
+  bool *is_thp, unsigned *hshift);
+
+static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea,
+   bool *is_thp, unsigned *hshift)
+{
+   VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) ,
+   "%s called with irq enabled\n", __func__);
+   return __find_linux_pte(pgdir, ea, is_thp, hshift);
+}
+
+static inline pte_t *find_init_mm_pte(unsigned long ea, unsigned *hshift)
+{
+   pgd_t *pgdir = init_mm.pgd;
+   return __find_linux_pte(pgdir, ea, NULL, hshift);
+}
+/*
+ * This is what we should always use. Any other lockless page table lookup 
needs
+ * careful audit against THP split.
+ */
+static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
+bool *is_thp, unsigned *hshift)
+{
+   VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) ,
+   "%s called with irq enabled\n", __func__);
+   VM_WARN(pgdir != current->mm->pgd,
+   "%s lock less page table lookup called on wrong mm\n", 
__func__);
+   return __find_linux_pte(pgdir, ea, is_thp, hshift);
+}
+#endif
+#endif
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 63992b2d8e15..5e6887c40528 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 
 /** Overview:
@@ -352,8 +353,7 @@ static inline unsigned long eeh_token_to_phys(unsigned long 
token)
 * worried about _PAGE_SPLITTING/collapse. Also we will not hit
 * page table free, because of init_mm.
 */
-   ptep = __find_linux_pte_or_hugepte(init_mm.pgd, token,
-  NULL, _shift);
+   ptep = find_init_mm_pte(token, _shift);
if (!ptep)
return token;
WARN_ON(hugepage_shift);
diff --git a/arch/powerpc/kernel/io-workarounds.c 
b/arch/powerpc/kernel/io-workarounds.c
index a582e0d42525..bbe85f5aea71 100644
--- a/arch/powerpc/kernel/io-workarounds.c
+++ b/arch/powerpc/kernel/io-workarounds.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+
 
 #define IOWA_MAX_BUS   8
 
@@ -75,8 +77,7 @@ struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
 * We won't find huge pages here (iomem). Also can't hit
 * a page table free due to init_mm
 */
-   ptep = __find_linux_pte_or_hugepte(init_mm.pgd, vaddr,
-  NULL, _shift);
+   ptep = find_init_mm_pte(vaddr, _shift);
if (ptep 

[PATCH 3/3] powerpc/mm: Don't send IPI to all cpus on THP updates

2017-05-16 Thread Aneesh Kumar K.V
Now that we made sure that lockless walk of linux page table is mostly limitted
to current task(current->mm->pgdir) we can update the THP update sequence to
only send IPI to cpus on which this task has run. This helps in reducing the IPI
overload on systems with large number of CPUs.

W.r.t kvm even though kvm is walking page table with vpc->arch.pgdir, it is
done only on secondary cpus and in that case we have primary cpu added to
task's mm cpumask. Sending an IPI to primary will force the secondary to do
a vm exit and hence this mm cpumask usage is safe here.

W.r.t CAPI, we still end up walking linux page table with capi context MM. For
now the pte lookup serialization sends an IPI to all cpus in CPI is in use. We
can further improve this by adding the CAPI interrupt handling cpu to task
mm cpumask. That will be done in a later patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  1 +
 arch/powerpc/mm/pgtable-book3s64.c   | 32 +++-
 arch/powerpc/mm/pgtable-hash64.c |  8 +++
 arch/powerpc/mm/pgtable-radix.c  |  8 +++
 4 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc9875c3be..d8c3c18e220d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1145,6 +1145,7 @@ static inline bool arch_needs_pgtable_deposit(void)
return false;
return true;
 }
+extern void serialize_against_pte_lookup(struct mm_struct *mm);
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 5fcb3dd74c13..2679f57b90e2 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -64,6 +65,35 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
trace_hugepage_set_pmd(addr, pmd_val(pmd));
return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
 }
+
+static void do_nothing(void *unused)
+{
+
+}
+/*
+ * Serialize against find_current_mm_pte which does lock-less
+ * lookup in page tables with local interrupts disabled. For huge pages
+ * it casts pmd_t to pte_t. Since format of pte_t is different from
+ * pmd_t we want to prevent transit from pmd pointing to page table
+ * to pmd pointing to huge page (and back) while interrupts are disabled.
+ * We clear pmd to possibly replace it with page table pointer in
+ * different code paths. So make sure we wait for the parallel
+ * find_current_mm_pte to finish.
+ */
+void serialize_against_pte_lookup(struct mm_struct *mm)
+{
+   smp_mb();
+   /*
+* Cxl fault handling requires us to do a lockless page table
+* walk while inserting hash page table entry with mm tracked
+* in cxl context. Hence we need to do a global flush.
+*/
+   if (cxl_ctx_in_use())
+   smp_call_function(do_nothing, NULL, 1);
+   else
+   smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+}
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
@@ -77,7 +107,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned 
long address,
 * This ensures that generic code that rely on IRQ disabling
 * to prevent a parallel THP split work as expected.
 */
-   kick_all_cpus_sync();
+   serialize_against_pte_lookup(vma->vm_mm);
 }
 
 static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index 8b85a14b08ea..f6313cc29ae4 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -159,7 +159,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long addres
 * by sending an IPI to all the cpus and executing a dummy
 * function there.
 */
-   kick_all_cpus_sync();
+   serialize_against_pte_lookup(vma->vm_mm);
/*
 * Now invalidate the hpte entries in the range
 * covered by pmd. This make sure we take a
@@ -299,16 +299,16 @@ pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 */
memset(pgtable, 0, PTE_FRAG_SIZE);
/*
-* Serialize against find_linux_pte_or_hugepte which does lock-less
+* Serialize against find_current_mm_pte variants which does lock-less
 * lookup in page tables with local interrupts disabled. For huge pages
 * it casts pmd_t to pte_t. Since format of pte_t is different from
 * pmd_t we want to prevent transit from pmd pointing to page table
 * to pmd pointing to huge page (and back) while interrupts are 
disabled.
 * We 

[PATCH 1/3] powerpc: Add __hard_irqs_disabled()

2017-05-16 Thread Aneesh Kumar K.V
Add __hard_irqs_disabled() similar to arch_irqs_disabled to check whether irqs
are hard disabled.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/hw_irq.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index eba60416536e..541bd42f902f 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -88,6 +88,12 @@ static inline bool arch_irqs_disabled(void)
return arch_irqs_disabled_flags(arch_local_save_flags());
 }
 
+static inline bool __hard_irqs_disabled(void)
+{
+   unsigned long flags = mfmsr();
+   return (flags & MSR_EE) == 0;
+}
+
 #ifdef CONFIG_PPC_BOOK3E
 #define __hard_irq_enable()asm volatile("wrteei 1" : : : "memory")
 #define __hard_irq_disable()   asm volatile("wrteei 0" : : : "memory")
@@ -197,6 +203,7 @@ static inline bool arch_irqs_disabled(void)
 }
 
 #define hard_irq_disable() arch_local_irq_disable()
+#define __hard_irqs_disabled() arch_irqs_disabled()
 
 static inline bool arch_irq_disabled_regs(struct pt_regs *regs)
 {
-- 
2.7.4



[PATCH] powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line

2017-05-16 Thread Aneesh Kumar K.V
We use the kernel command line to do reservation of hugetlb pages. The code
duplcation here is mostly to make it simpler. With 64 bit book3s, we need to
support either 16G or 1G gigantic hugepage. Whereas the FSL_BOOK3E
implementation needs to support multiple gigantic hugepage. We avoid the
gpage_npages array and use a gpage_npage count for ppc64. We also cannot use the
generic code to do the gigantic page allocation because that will require
conditonal to handle the pseries allocation, where the memory is already
reserved by the hypervisor.

Inorder to keep it simpler, book3s 64 implements a version that keeps it simpler
and working with pseries.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/hugetlb.h |  8 +---
 arch/powerpc/mm/hugetlbpage.c  | 78 ++
 2 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 7f4025a6c69e..03401a17d1da 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -218,13 +218,7 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned 
long addr,
 }
 #endif /* CONFIG_HUGETLB_PAGE */
 
-/*
- * FSL Book3E platforms require special gpage handling - the gpages
- * are reserved early in the boot process by memblock instead of via
- * the .dts as on IBM platforms.
- */
-#if defined(CONFIG_HUGETLB_PAGE) && (defined(CONFIG_PPC_FSL_BOOK3E) || \
-defined(CONFIG_PPC_8xx))
+#ifdef CONFIG_HUGETLB_PAGE
 extern void __init reserve_hugetlb_gpages(void);
 #else
 static inline void reserve_hugetlb_gpages(void)
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 1816b965a142..4ebaa18f2495 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -373,6 +374,83 @@ int alloc_bootmem_huge_page(struct hstate *hstate)
m->hstate = hstate;
return 1;
 }
+
+static unsigned long gpage_npages;
+static int __init do_gpage_early_setup(char *param, char *val,
+  const char *unused, void *arg)
+{
+   unsigned long npages;
+   static unsigned long size = 0;
+   unsigned long gpage_size = 1UL << 34;
+
+   if (radix_enabled())
+   gpage_size = 1UL << 30;
+
+   /*
+* The hugepagesz and hugepages cmdline options are interleaved.  We
+* use the size variable to keep track of whether or not this was done
+* properly and skip over instances where it is incorrect.  Other
+* command-line parsing code will issue warnings, so we don't need to.
+*
+*/
+   if ((strcmp(param, "default_hugepagesz") == 0) ||
+   (strcmp(param, "hugepagesz") == 0)) {
+   size = memparse(val, NULL);
+   /*
+* We want to handle on 16GB gigantic huge page here.
+*/
+   if (size != gpage_size)
+   size = 0;
+   } else if (strcmp(param, "hugepages") == 0) {
+   if (size != 0) {
+   if (sscanf(val, "%lu", ) <= 0)
+   npages = 0;
+   if (npages > MAX_NUMBER_GPAGES) {
+   pr_warn("MMU: %lu 16GB pages requested, "
+   "limiting to %d pages\n", npages,
+   MAX_NUMBER_GPAGES);
+   npages = MAX_NUMBER_GPAGES;
+   }
+   gpage_npages = npages;
+   size = 0;
+   }
+   }
+   return 0;
+}
+
+/*
+ * This will just do the necessary memblock reservations. Every else is
+ * done by core, based on kernel command line parsing.
+ */
+void __init reserve_hugetlb_gpages(void)
+{
+   char buf[10];
+   phys_addr_t base;
+   unsigned long gpage_size = 1UL << 34;
+   static __initdata char cmdline[COMMAND_LINE_SIZE];
+
+   if (radix_enabled())
+   gpage_size = 1UL << 30;
+
+   strlcpy(cmdline, boot_command_line, COMMAND_LINE_SIZE);
+   parse_args("hugetlb gpages", cmdline, NULL, 0, 0, 0,
+  NULL, _gpage_early_setup);
+
+   if (!gpage_npages)
+   return;
+
+   string_get_size(gpage_size, 1, STRING_UNITS_2, buf, sizeof(buf));
+   pr_info("Trying to reserve %ld %s pages\n", gpage_npages, buf);
+
+   /* Allocate one page at a time */
+   while(gpage_npages) {
+   base = memblock_alloc_base(gpage_size, gpage_size,
+  MEMBLOCK_ALLOC_ANYWHERE);
+   add_gpage(base, gpage_size, 1);
+   gpage_npages--;
+   }
+}
+
 #endif
 
 #if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
-- 
2.7.4



[PATCH v2 8/9] powerpc/mm/hugetlb: Remove follow_huge_addr for powerpc

2017-05-16 Thread Aneesh Kumar K.V
With generic code now handling hugetlb entries at pgd level and also
supporting hugepage directory format, we can now remove the powerpc
sepcific follow_huge_addr implementation.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c | 64 ---
 1 file changed, 64 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 5c829a83a4cc..1816b965a142 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -619,11 +619,6 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
} while (addr = next, addr != end);
 }
 
-/*
- * 64 bit book3s use generic follow_page_mask
- */
-#ifdef CONFIG_PPC_BOOK3S_64
-
 struct page *follow_huge_pd(struct vm_area_struct *vma,
unsigned long address, hugepd_t hpd,
int flags, int pdshift)
@@ -657,65 +652,6 @@ struct page *follow_huge_pd(struct vm_area_struct *vma,
return page;
 }
 
-#else /* !CONFIG_PPC_BOOK3S_64 */
-
-/*
- * We are holding mmap_sem, so a parallel huge page collapse cannot run.
- * To prevent hugepage split, disable irq.
- */
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
-   bool is_thp;
-   pte_t *ptep, pte;
-   unsigned shift;
-   unsigned long mask, flags;
-   struct page *page = ERR_PTR(-EINVAL);
-
-   local_irq_save(flags);
-   ptep = find_linux_pte_or_hugepte(mm->pgd, address, _thp, );
-   if (!ptep)
-   goto no_page;
-   pte = READ_ONCE(*ptep);
-   /*
-* Verify it is a huge page else bail.
-* Transparent hugepages are handled by generic code. We can skip them
-* here.
-*/
-   if (!shift || is_thp)
-   goto no_page;
-
-   if (!pte_present(pte)) {
-   page = NULL;
-   goto no_page;
-   }
-   mask = (1UL << shift) - 1;
-   page = pte_page(pte);
-   if (page)
-   page += (address & mask) / PAGE_SIZE;
-
-no_page:
-   local_irq_restore(flags);
-   return page;
-}
-
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-   pmd_t *pmd, int write)
-{
-   BUG();
-   return NULL;
-}
-
-struct page *
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-   pud_t *pud, int write)
-{
-   BUG();
-   return NULL;
-}
-#endif
-
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
 {
-- 
2.7.4



[PATCH v2 9/9] powerpc/hugetlb: Enable hugetlb migration for ppc64

2017-05-16 Thread Aneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/Kconfig.cputype | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 8017542d..8acc4f27d101 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -351,6 +351,11 @@ config PPC_RADIX_MMU
  is only implemented by IBM Power9 CPUs, if you don't have one of them
  you can probably disable this.
 
+config ARCH_ENABLE_HUGEPAGE_MIGRATION
+   def_bool y
+   depends on PPC_BOOK3S_64 && HUGETLB_PAGE && MIGRATION
+
+
 config PPC_MMU_NOHASH
def_bool y
depends on !PPC_STD_MMU
-- 
2.7.4



[PATCH v2 2/9] mm/follow_page_mask: Split follow_page_mask to smaller functions.

2017-05-16 Thread Aneesh Kumar K.V
Makes code reading easy. No functional changes in this patch. In a followup
patch, we will be updating the follow_page_mask to handle hugetlb hugepd format
so that archs like ppc64 can switch to the generic version. This split helps
in doing that nicely.

Reviewed-by: Naoya Horiguchi 
Signed-off-by: Aneesh Kumar K.V 
---
 mm/gup.c | 148 +++
 1 file changed, 91 insertions(+), 57 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 04aa405350dc..73d46f9f7b81 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -208,68 +208,16 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
return no_page_table(vma, flags);
 }
 
-/**
- * follow_page_mask - look up a page descriptor from a user-virtual address
- * @vma: vm_area_struct mapping @address
- * @address: virtual address to look up
- * @flags: flags modifying lookup behaviour
- * @page_mask: on output, *page_mask is set according to the size of the page
- *
- * @flags can have FOLL_ flags set, defined in 
- *
- * Returns the mapped (struct page *), %NULL if no mapping exists, or
- * an error pointer if there is a mapping to something not represented
- * by a page descriptor (see also vm_normal_page()).
- */
-struct page *follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- unsigned int *page_mask)
+static struct page *follow_pmd_mask(struct vm_area_struct *vma,
+   unsigned long address, pud_t *pudp,
+   unsigned int flags, unsigned int *page_mask)
 {
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
pmd_t *pmd;
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
 
-   *page_mask = 0;
-
-   page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
-   if (!IS_ERR(page)) {
-   BUG_ON(flags & FOLL_GET);
-   return page;
-   }
-
-   pgd = pgd_offset(mm, address);
-   if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-   return no_page_table(vma, flags);
-   p4d = p4d_offset(pgd, address);
-   if (p4d_none(*p4d))
-   return no_page_table(vma, flags);
-   BUILD_BUG_ON(p4d_huge(*p4d));
-   if (unlikely(p4d_bad(*p4d)))
-   return no_page_table(vma, flags);
-   pud = pud_offset(p4d, address);
-   if (pud_none(*pud))
-   return no_page_table(vma, flags);
-   if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
-   page = follow_huge_pud(mm, address, pud, flags);
-   if (page)
-   return page;
-   return no_page_table(vma, flags);
-   }
-   if (pud_devmap(*pud)) {
-   ptl = pud_lock(mm, pud);
-   page = follow_devmap_pud(vma, address, pud, flags);
-   spin_unlock(ptl);
-   if (page)
-   return page;
-   }
-   if (unlikely(pud_bad(*pud)))
-   return no_page_table(vma, flags);
-
-   pmd = pmd_offset(pud, address);
+   pmd = pmd_offset(pudp, address);
if (pmd_none(*pmd))
return no_page_table(vma, flags);
if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
@@ -319,13 +267,99 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return ret ? ERR_PTR(ret) :
follow_page_pte(vma, address, pmd, flags);
}
-
page = follow_trans_huge_pmd(vma, address, pmd, flags);
spin_unlock(ptl);
*page_mask = HPAGE_PMD_NR - 1;
return page;
 }
 
+
+static struct page *follow_pud_mask(struct vm_area_struct *vma,
+   unsigned long address, p4d_t *p4dp,
+   unsigned int flags, unsigned int *page_mask)
+{
+   pud_t *pud;
+   spinlock_t *ptl;
+   struct page *page;
+   struct mm_struct *mm = vma->vm_mm;
+
+   pud = pud_offset(p4dp, address);
+   if (pud_none(*pud))
+   return no_page_table(vma, flags);
+   if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
+   page = follow_huge_pud(mm, address, pud, flags);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
+   if (pud_devmap(*pud)) {
+   ptl = pud_lock(mm, pud);
+   page = follow_devmap_pud(vma, address, pud, flags);
+   spin_unlock(ptl);
+   if (page)
+   return page;
+   }
+   if (unlikely(pud_bad(*pud)))
+   return no_page_table(vma, flags);
+
+   return follow_pmd_mask(vma, address, pud, flags, page_mask);
+}
+
+
+static struct page *follow_p4d_mask(struct vm_area_struct *vma,
+   unsigned long 

[PATCH v2 6/9] mm/follow_page_mask: Add support for hugepage directory entry

2017-05-16 Thread Aneesh Kumar K.V
Architectures like ppc64 supports hugepage size that is not mapped to any of
of the page table levels. Instead they add an alternate page table entry format
called hugepage directory (hugepd). hugepd indicates that the page table entry 
maps
to a set of hugetlb pages. Add support for this in generic follow_page_mask
code. We already support this format in the generic gup code.

The defaul implementation prints warning and returns NULL. We will add ppc64
support in later patches

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h |  4 
 mm/gup.c| 33 +
 mm/hugetlb.c|  8 
 3 files changed, 45 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f66c1d4e0d1f..caee7c4664c8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -141,6 +141,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long 
addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
  int write);
+struct page *follow_huge_pd(struct vm_area_struct *vma,
+   unsigned long address, hugepd_t hpd,
+   int flags, int pdshift);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
@@ -175,6 +178,7 @@ static inline void hugetlb_report_meminfo(struct seq_file 
*m)
 static inline void hugetlb_show_meminfo(void)
 {
 }
+#define follow_huge_pd(vma, addr, hpd, flags, pdshift) NULL
 #define follow_huge_pmd(mm, addr, pmd, flags)  NULL
 #define follow_huge_pud(mm, addr, pud, flags)  NULL
 #define follow_huge_pgd(mm, addr, pgd, flags)  NULL
diff --git a/mm/gup.c b/mm/gup.c
index 65255389620a..a7f5b82e15f3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -226,6 +226,14 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pmd_val(*pmd {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pmd_val(*pmd)), flags,
+ PMD_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -292,6 +300,14 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pud_val(*pud {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pud_val(*pud)), flags,
+ PUD_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags);
@@ -311,6 +327,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct 
*vma,
unsigned int flags, unsigned int *page_mask)
 {
p4d_t *p4d;
+   struct page *page;
 
p4d = p4d_offset(pgdp, address);
if (p4d_none(*p4d))
@@ -319,6 +336,14 @@ static struct page *follow_p4d_mask(struct vm_area_struct 
*vma,
if (unlikely(p4d_bad(*p4d)))
return no_page_table(vma, flags);
 
+   if (is_hugepd(__hugepd(p4d_val(*p4d {
+   page = follow_huge_pd(vma, address,
+ __hugepd(p4d_val(*p4d)), flags,
+ P4D_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
return follow_pud_mask(vma, address, p4d, flags, page_mask);
 }
 
@@ -363,6 +388,14 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pgd_val(*pgd {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pgd_val(*pgd)), flags,
+ PGDIR_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
 
return follow_p4d_mask(vma, address, pgd, flags, page_mask);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a12d3cab04fe..58307d62ac37 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4643,6 +4643,14 @@ follow_huge_addr(struct 

[PATCH v2 7/9] powerpc/hugetlb: Add follow_huge_pd implementation for ppc64.

2017-05-16 Thread Aneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 80f6d2ed551a..5c829a83a4cc 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -618,6 +620,46 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 }
 
 /*
+ * 64 bit book3s use generic follow_page_mask
+ */
+#ifdef CONFIG_PPC_BOOK3S_64
+
+struct page *follow_huge_pd(struct vm_area_struct *vma,
+   unsigned long address, hugepd_t hpd,
+   int flags, int pdshift)
+{
+   pte_t *ptep;
+   spinlock_t *ptl;
+   struct page *page = NULL;
+   unsigned long mask;
+   int shift = hugepd_shift(hpd);
+   struct mm_struct *mm = vma->vm_mm;
+
+retry:
+   ptl = >page_table_lock;
+   spin_lock(ptl);
+
+   ptep = hugepte_offset(hpd, address, pdshift);
+   if (pte_present(*ptep)) {
+   mask = (1UL << shift) - 1;
+   page = pte_page(*ptep);
+   page += ((address & mask) >> PAGE_SHIFT);
+   if (flags & FOLL_GET)
+   get_page(page);
+   } else {
+   if (is_hugetlb_entry_migration(*ptep)) {
+   spin_unlock(ptl);
+   __migration_entry_wait(mm, ptep, ptl);
+   goto retry;
+   }
+   }
+   spin_unlock(ptl);
+   return page;
+}
+
+#else /* !CONFIG_PPC_BOOK3S_64 */
+
+/*
  * We are holding mmap_sem, so a parallel huge page collapse cannot run.
  * To prevent hugepage split, disable irq.
  */
@@ -672,6 +714,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
BUG();
return NULL;
 }
+#endif
 
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
-- 
2.7.4



[PATCH v2 5/9] mm/hugetlb: Move default definition of hugepd_t earlier in the header

2017-05-16 Thread Aneesh Kumar K.V
This enable to use the hugepd_t type early. No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 47 ---
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edab98f0a7b8..f66c1d4e0d1f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -14,6 +14,30 @@ struct ctl_table;
 struct user_struct;
 struct mmu_gather;
 
+#ifndef is_hugepd
+/*
+ * Some architectures requires a hugepage directory format that is
+ * required to support multiple hugepage sizes. For example
+ * a4fe3ce76 "powerpc/mm: Allow more flexible layouts for hugepage pagetables"
+ * introduced the same on powerpc. This allows for a more flexible hugepage
+ * pagetable layout.
+ */
+typedef struct { unsigned long pd; } hugepd_t;
+#define is_hugepd(hugepd) (0)
+#define __hugepd(x) ((hugepd_t) { (x) })
+static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+ unsigned pdshift, unsigned long end,
+ int write, struct page **pages, int *nr)
+{
+   return 0;
+}
+#else
+extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+  unsigned pdshift, unsigned long end,
+  int write, struct page **pages, int *nr);
+#endif
+
+
 #ifdef CONFIG_HUGETLB_PAGE
 
 #include 
@@ -222,29 +246,6 @@ static inline int pud_write(pud_t pud)
 }
 #endif
 
-#ifndef is_hugepd
-/*
- * Some architectures requires a hugepage directory format that is
- * required to support multiple hugepage sizes. For example
- * a4fe3ce76 "powerpc/mm: Allow more flexible layouts for hugepage pagetables"
- * introduced the same on powerpc. This allows for a more flexible hugepage
- * pagetable layout.
- */
-typedef struct { unsigned long pd; } hugepd_t;
-#define is_hugepd(hugepd) (0)
-#define __hugepd(x) ((hugepd_t) { (x) })
-static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
- unsigned pdshift, unsigned long end,
- int write, struct page **pages, int *nr)
-{
-   return 0;
-}
-#else
-extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-  unsigned pdshift, unsigned long end,
-  int write, struct page **pages, int *nr);
-#endif
-
 #define HUGETLB_ANON_FILE "anon_hugepage"
 
 enum {
-- 
2.7.4



[PATCH v2 4/9] mm/follow_page_mask: Add support for hugetlb pgd entries.

2017-05-16 Thread Aneesh Kumar K.V
From: Anshuman Khandual 

ppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd entries to
follow_page_mask so that ppc64 can switch to it to handle hugetlbe entries.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 4 
 mm/gup.c| 7 +++
 mm/hugetlb.c| 9 +
 3 files changed, 20 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fddf6cf403d5..edab98f0a7b8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -121,6 +121,9 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned 
long address,
pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags);
+struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
+pgd_t *pgd, int flags);
+
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -150,6 +153,7 @@ static inline void hugetlb_show_meminfo(void)
 }
 #define follow_huge_pmd(mm, addr, pmd, flags)  NULL
 #define follow_huge_pud(mm, addr, pud, flags)  NULL
+#define follow_huge_pgd(mm, addr, pgd, flags)  NULL
 #define prepare_hugepage_range(file, addr, len)(-EINVAL)
 #define pmd_huge(x)0
 #define pud_huge(x)0
diff --git a/mm/gup.c b/mm/gup.c
index 73d46f9f7b81..65255389620a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -357,6 +357,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
return no_page_table(vma, flags);
 
+   if (pgd_huge(*pgd)) {
+   page = follow_huge_pgd(mm, address, pgd, flags);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
+
return follow_p4d_mask(vma, address, pgd, flags, page_mask);
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 25e2ee888a90..a12d3cab04fe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4687,6 +4687,15 @@ follow_huge_pud(struct mm_struct *mm, unsigned long 
address,
return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
 }
 
+struct page * __weak
+follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int 
flags)
+{
+   if (flags & FOLL_GET)
+   return NULL;
+
+   return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
PAGE_SHIFT);
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 /*
-- 
2.7.4



[PATCH v2 3/9] mm/hugetlb: export hugetlb_entry_migration helper

2017-05-16 Thread Aneesh Kumar K.V
We will be using this later from the ppc64 code. Change the return type to bool.

Reviewed-by: Naoya Horiguchi 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b857fc8cc2ec..fddf6cf403d5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -126,6 +126,7 @@ int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
 
+bool is_hugetlb_entry_migration(pte_t pte);
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ce090186b992..25e2ee888a90 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3182,17 +3182,17 @@ static void set_huge_ptep_writable(struct 
vm_area_struct *vma,
update_mmu_cache(vma, address, ptep);
 }
 
-static int is_hugetlb_entry_migration(pte_t pte)
+bool is_hugetlb_entry_migration(pte_t pte)
 {
swp_entry_t swp;
 
if (huge_pte_none(pte) || pte_present(pte))
-   return 0;
+   return false;
swp = pte_to_swp_entry(pte);
if (non_swap_entry(swp) && is_migration_entry(swp))
-   return 1;
+   return true;
else
-   return 0;
+   return false;
 }
 
 static int is_hugetlb_entry_hwpoisoned(pte_t pte)
-- 
2.7.4



[PATCH v2 1/9] mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at

2017-05-16 Thread Aneesh Kumar K.V
The right interface to use to set a hugetlb pte entry is set_huge_pte_at. Use
that instead of set_pte_at.

Reviewed-by: Naoya Horiguchi 
Signed-off-by: Aneesh Kumar K.V 
---
 mm/migrate.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 9a0897a14d37..4c272ac6fe53 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -224,25 +224,26 @@ static int remove_migration_pte(struct page *page, struct 
vm_area_struct *vma,
if (is_write_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
 
+   flush_dcache_page(new);
 #ifdef CONFIG_HUGETLB_PAGE
if (PageHuge(new)) {
pte = pte_mkhuge(pte);
pte = arch_make_huge_pte(pte, vma, new, 0);
-   }
-#endif
-   flush_dcache_page(new);
-   set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
-
-   if (PageHuge(new)) {
+   set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, 
pte);
if (PageAnon(new))
hugepage_add_anon_rmap(new, vma, pvmw.address);
else
page_dup_rmap(new, true);
-   } else if (PageAnon(new))
-   page_add_anon_rmap(new, vma, pvmw.address, false);
-   else
-   page_add_file_rmap(new, false);
+   } else
+#endif
+   {
+   set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
+   if (PageAnon(new))
+   page_add_anon_rmap(new, vma, pvmw.address, 
false);
+   else
+   page_add_file_rmap(new, false);
+   }
if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
 
-- 
2.7.4



[PATCH v2 0/9] HugeTLB migration support for PPC64

2017-05-16 Thread Aneesh Kumar K.V
HugeTLB migration support for PPC64

Changes from V1:
* Added Reviewed-by:
* Drop follow_huge_addr from powerpc

Aneesh Kumar K.V (8):
  mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at
  mm/follow_page_mask: Split follow_page_mask to smaller functions.
  mm/hugetlb: export hugetlb_entry_migration helper
  mm/hugetlb: Move default definition of hugepd_t earlier in the header
  mm/follow_page_mask: Add support for hugepage directory entry
  powerpc/hugetlb: Add follow_huge_pd implementation for ppc64.
  powerpc/mm/hugetlb: Remove follow_huge_addr for powerpc
  powerpc/hugetlb: Enable hugetlb migration for ppc64

Anshuman Khandual (1):
  mm/follow_page_mask: Add support for hugetlb pgd entries.

 arch/powerpc/mm/hugetlbpage.c  |  81 ++
 arch/powerpc/platforms/Kconfig.cputype |   5 +
 include/linux/hugetlb.h|  56 ++
 mm/gup.c   | 186 +++--
 mm/hugetlb.c   |  25 -
 mm/migrate.c   |  21 ++--
 6 files changed, 230 insertions(+), 144 deletions(-)

-- 
2.7.4



[PATCH v2 2/2] powerpc/mm/hugetlb: Add support for 1G huge pages

2017-05-16 Thread Aneesh Kumar K.V
POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch
enables the usage of 1G page size for hugetlbfs. This also update the helper
such we can do 1G page allocation at runtime.

We still don't enable 1G page size on DD1 version. This is to avoid doing
workaround mentioned in commit: 6d3a0379ebdc8 (powerpc/mm: Add
radix__tlb_flush_pte_p9_dd1()

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 10 ++
 arch/powerpc/mm/hugetlbpage.c|  7 +--
 arch/powerpc/platforms/Kconfig.cputype   |  1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index cd366596..5c28bd6f2ae1 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -50,4 +50,14 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct 
vm_area_struct *vma,
else
return entry;
 }
+
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void)
+{
+   if (radix_enabled())
+   return true;
+   return false;
+}
+#endif
+
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a4f33de4008e..80f6d2ed551a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -763,8 +763,11 @@ static int __init add_huge_page_size(unsigned long long 
size)
 * Hash: 16M and 16G
 */
if (radix_enabled()) {
-   if (mmu_psize != MMU_PAGE_2M)
-   return -EINVAL;
+   if (mmu_psize != MMU_PAGE_2M) {
+   if (cpu_has_feature(CPU_FTR_POWER9_DD1) ||
+   (mmu_psize != MMU_PAGE_1G))
+   return -EINVAL;
+   }
} else {
if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
return -EINVAL;
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 684e886eaae4..8017542d 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -344,6 +344,7 @@ config PPC_STD_MMU_64
 config PPC_RADIX_MMU
bool "Radix MMU Support"
depends on PPC_BOOK3S_64
+   select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
default y
help
  Enable support for the Power ISA 3.0 Radix style MMU. Currently this
-- 
2.7.4



[PATCH v2 1/2] mm/hugetlb: Cleanup ARCH_HAS_GIGANTIC_PAGE

2017-05-16 Thread Aneesh Kumar K.V
This moves the #ifdef in C code to a Kconfig dependency. Also we move the
gigantic_page_supported() function to be arch specific. This gives arch to
conditionally enable runtime allocation of gigantic huge page. Architectures
like ppc64 supports different gigantic huge page size (16G and 1G) based on the
translation mode selected. This provides an opportunity for ppc64 to enable
runtime allocation only w.r.t 1G hugepage.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/arm64/Kconfig   | 2 +-
 arch/arm64/include/asm/hugetlb.h | 4 
 arch/s390/Kconfig| 2 +-
 arch/s390/include/asm/hugetlb.h  | 3 +++
 arch/x86/Kconfig | 2 +-
 mm/hugetlb.c | 7 ++-
 6 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3741859765cf..1f8c1f73aada 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -11,7 +11,7 @@ config ARM64
select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
select ARCH_HAS_KCOV
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index bbc1e35aa601..793bd73b0d07 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -83,4 +83,8 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
 extern void huge_ptep_clear_flush(struct vm_area_struct *vma,
  unsigned long addr, pte_t *ptep);
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void) { return true; }
+#endif
+
 #endif /* __ASM_HUGETLB_H */
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index a2dcef0aacc7..a41bbf420dda 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -67,7 +67,7 @@ config S390
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
select ARCH_HAS_KCOV
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index cd546a245c68..89057b2cc8fe 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -112,4 +112,7 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t 
newprot)
return pte_modify(pte, newprot);
 }
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static inline bool gigantic_page_supported(void) { return true; }
+#endif
 #endif /* _ASM_S390_HUGETLB_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..30a6328136ac 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,7 +22,7 @@ config X86_64
def_bool y
depends on 64BIT
# Options that are inherently 64-bit kernel only:
-   select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA
select ARCH_SUPPORTS_INT128
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d0aab9ee80d..ce090186b992 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1024,9 +1024,7 @@ static int hstate_next_node_to_free(struct hstate *h, 
nodemask_t *nodes_allowed)
((node = hstate_next_node_to_free(hs, mask)) || 1); \
nr_nodes--)
 
-#if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) && \
-   ((defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || \
-   defined(CONFIG_CMA))
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 static void destroy_compound_gigantic_page(struct page *page,
unsigned int order)
 {
@@ -1158,8 +1156,7 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
return 0;
 }
 
-static inline bool gigantic_page_supported(void) { return true; }
-#else
+#else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
 static inline bool gigantic_page_supported(void) { return false; }
 static inline void free_gigantic_page(struct page *page, unsigned int order) { 
}
 static inline void destroy_compound_gigantic_page(struct page *page,
-- 
2.7.4



[PATCH 0/6] Enable support for deep-stop states on POWER9

2017-05-16 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

This patch series contains some of the fixes required for enabling
support for deep stop states such as STOP4 and STOP11 via CPU-Hotplug.

These fixes mainly ensure that some of the hypervisor resources which
are lost during the deep stop state are correctly restored on a
wakeup.

There are 6 patches in the series.

Patch 1 correctly initializes the core_idle_state_ptr based on the
threads_per_core. core_idle_state_ptr is used to determine if a thread
is the last thread entering a deep stop state or a first thread waking
up from deep stop state in order to save/restore per-core resources.

Patch 2 decouples restoring timebase from restoring hypervisor
resources, as there are stop states which lose hypervisor state but
not the timebase.

Patch 3 saves the LPCR value before executing deep stop and restores
it back to the saved value on the wakeup from stop.

Patch 4 programs the restoration of some of one-time initialized SPRs
via the stop-api.

Patch 5 provides a workaround for a hardware issue on POWER9 DD1 chips
where the PLS value cannot be relied upon on a wakeup from deep stop.

Patch 6 fixes the cpuidle-powernv initialization code to allow deep
states that don't lose timebase.

These patches are based on the Linux upstream and have been tested
with the corresponding skiboot patches in
https://lists.ozlabs.org/pipermail/skiboot/2017-May/007183.html to get
STOP4 working via CPU-Hotplug.

Akshay Adiga (1):
  powernv:idle: Restore SPRs for deep idle states via stop API.

Gautham R. Shenoy (5):
  powernv:idle: Correctly initialize core_idle_state_ptr
  powernv:idle: Decouple Timebase restore & Per-core SPRs restore
  powernv:idle: Restore LPCR on wakeup from deep-stop
  powernv:idle: Use Requested Level for restoring state on P9 DD1
  cpuidle-powernv: Allow Deep stop states that don't stop time

 arch/powerpc/include/asm/paca.h   |   2 +
 arch/powerpc/kernel/asm-offsets.c |   1 +
 arch/powerpc/kernel/idle_book3s.S |  33 +++---
 arch/powerpc/platforms/powernv/idle.c | 112 +-
 drivers/cpuidle/cpuidle-powernv.c |  16 +++--
 5 files changed, 110 insertions(+), 54 deletions(-)

-- 
1.8.3.1



[PATCH 2/6] powernv:idle: Decouple Timebase restore & Per-core SPRs restore

2017-05-16 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER8, in case of
   -  nap: both timebase and hypervisor state is retained.
   -  fast-sleep: timebase is lost. But the hypervisor state is retained.
   -  winkle: timebase and hypervisor state is lost.

Hence, the current code for handling exit from a idle state assumes
that if the timebase value is retained, then so is the hypervisor
state. Thus, the current code doesn't restore per-core hypervisor
state in such cases.

But that is no longer the case on POWER9 where we do have stop states
in which timebase value is retained, but the hypervisor state is
lost. So we have to ensure that the per-core hypervisor state gets
restored in such cases.

Fix this by ensuring that even in the case when timebase is retained,
we explicitly check if we are waking up from a deep stop that loses
per-core hypervisor state (indicated by cr4 being eq or gt), and if
this is the case, we restore the per-core hypervisor state.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/idle_book3s.S | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 4898d67..afd029f 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -731,13 +731,14 @@ timebase_resync:
 * Use cr3 which indicates that we are waking up with atleast partial
 * hypervisor state loss to determine if TIMEBASE RESYNC is needed.
 */
-   ble cr3,clear_lock
+   ble cr3,.Ltb_resynced
/* Time base re-sync */
bl  opal_resync_timebase;
/*
-* If waking up from sleep, per core state is not lost, skip to
-* clear_lock.
+* If waking up from sleep (POWER8), per core state
+* is not lost, skip to clear_lock.
 */
+.Ltb_resynced:
blt cr4,clear_lock
 
/*
-- 
1.8.3.1



[PATCH 3/6] powernv:idle: Restore LPCR on wakeup from deep-stop

2017-05-16 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On wakeup from a deep stop state which is supposed to lose the
hypervisor state, we don't restore the LPCR to the old value but set
it to a "sane" value via cur_cpu_spec->cpu_restore().

The problem is that the "sane" value doesn't include UPRT and the HR
bits which are required to run correctly in Radix mode.

Fix this on POWER9 onwards by restoring the LPCR value whatever it was
before executing the stop instruction.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/idle_book3s.S | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index afd029f..6c9920d 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -31,6 +31,7 @@
  * registers for winkle support.
  */
 #define _SDR1  GPR3
+#define _PTCR  GPR3
 #define _RPR   GPR4
 #define _SPURR GPR5
 #define _PURR  GPR6
@@ -39,7 +40,7 @@
 #define _AMOR  GPR9
 #define _WORT  GPR10
 #define _WORC  GPR11
-#define _PTCR  GPR12
+#define _LPCR  GPR12
 
 #define PSSCR_EC_ESL_MASK_SHIFTED  (PSSCR_EC | PSSCR_ESL) >> 16
 
@@ -55,12 +56,14 @@ save_sprs_to_stack:
 * here since any thread in the core might wake up first
 */
 BEGIN_FTR_SECTION
-   mfspr   r3,SPRN_PTCR
-   std r3,_PTCR(r1)
/*
 * Note - SDR1 is dropped in Power ISA v3. Hence not restoring
 * SDR1 here
 */
+   mfspr   r3,SPRN_PTCR
+   std r3,_PTCR(r1)
+   mfspr   r3,SPRN_LPCR
+   std r3,_LPCR(r1)
 FTR_SECTION_ELSE
mfspr   r3,SPRN_SDR1
std r3,_SDR1(r1)
@@ -813,6 +816,10 @@ no_segments:
mtctr   r12
bctrl
 
+BEGIN_FTR_SECTION
+   ld  r4,_LPCR(r1)
+   mtspr   SPRN_LPCR,r4
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 hypervisor_state_restored:
 
mtspr   SPRN_SRR1,r16
-- 
1.8.3.1



[PATCH 4/6] powernv:idle: Restore SPRs for deep idle states via stop API.

2017-05-16 Thread Gautham R. Shenoy
From: Akshay Adiga 

Some of the SPR values (HID0, MSR, SPRG0) don't change during the run
time of a booted kernel, once they have been initialized.

The contents of these SPRs are lost when the CPUs enter deep stop
states. So instead saving and restoring SPRs from the kernel, use the
stop-api provided by the firmware by which the firmware can restore
the contents of these SPRs to their initialized values after wakeup
from a deep stop state.

Apart from these, program the PSSCR value to that of the deepest stop
state via the stop-api. This will be used to indicate to the
underlying firmware as to what stop state to put the threads that have
been woken up by a special-wakeup.

And while we are at programming SPRs via stop-api, ensure that HID1,
HID4 and HID5 registers which are only available on POWER8 are not
requested to be restored by the firware on POWER9.

Signed-off-by: Akshay Adiga 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/powernv/idle.c | 83 ++-
 1 file changed, 52 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 84eb9bc..4deac0d 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -30,8 +30,33 @@
 /* Power ISA 3.0 allows for stop states 0x0 - 0xF */
 #define MAX_STOP_STATE 0xF
 
+#define P9_STOP_SPR_MSR 2000
+#define P9_STOP_SPR_PSSCR  855
+
 static u32 supported_cpuidle_states;
 
+/*
+ * The default stop state that will be used by ppc_md.power_save
+ * function on platforms that support stop instruction.
+ */
+static u64 pnv_default_stop_val;
+static u64 pnv_default_stop_mask;
+static bool default_stop_found;
+
+/*
+ * First deep stop state. Used to figure out when to save/restore
+ * hypervisor context.
+ */
+u64 pnv_first_deep_stop_state = MAX_STOP_STATE;
+
+/*
+ * psscr value and mask of the deepest stop idle state.
+ * Used when a cpu is offlined.
+ */
+static u64 pnv_deepest_stop_psscr_val;
+static u64 pnv_deepest_stop_psscr_mask;
+static bool deepest_stop_found;
+
 static int pnv_save_sprs_for_deep_states(void)
 {
int cpu;
@@ -48,6 +73,8 @@ static int pnv_save_sprs_for_deep_states(void)
uint64_t hid4_val = mfspr(SPRN_HID4);
uint64_t hid5_val = mfspr(SPRN_HID5);
uint64_t hmeer_val = mfspr(SPRN_HMEER);
+   uint64_t msr_val = MSR_IDLE;
+   uint64_t psscr_val = pnv_deepest_stop_psscr_val;
 
for_each_possible_cpu(cpu) {
uint64_t pir = get_hard_smp_processor_id(cpu);
@@ -61,6 +88,18 @@ static int pnv_save_sprs_for_deep_states(void)
if (rc != 0)
return rc;
 
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, msr_val);
+   if (rc)
+   return rc;
+
+   rc = opal_slw_set_reg(pir,
+ P9_STOP_SPR_PSSCR, psscr_val);
+
+   if (rc)
+   return rc;
+   }
+
/* HIDs are per core registers */
if (cpu_thread_in_core(cpu) == 0) {
 
@@ -72,17 +111,21 @@ static int pnv_save_sprs_for_deep_states(void)
if (rc != 0)
return rc;
 
-   rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
-   if (rc != 0)
-   return rc;
+   /* Only p8 needs to set extra HID regiters */
+   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
 
-   rc = opal_slw_set_reg(pir, SPRN_HID4, hid4_val);
-   if (rc != 0)
-   return rc;
+   rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
+   if (rc != 0)
+   return rc;
 
-   rc = opal_slw_set_reg(pir, SPRN_HID5, hid5_val);
-   if (rc != 0)
-   return rc;
+   rc = opal_slw_set_reg(pir, SPRN_HID4, hid4_val);
+   if (rc != 0)
+   return rc;
+
+   rc = opal_slw_set_reg(pir, SPRN_HID5, hid5_val);
+   if (rc != 0)
+   return rc;
+   }
}
}
 
@@ -241,14 +284,6 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600,
store_fastsleep_workaround_applyonce);
 
 /*
- * The default stop state that will be used by ppc_md.power_save
- * function on platforms that support stop instruction.
- */
-static u64 pnv_default_stop_val;
-static u64 

[PATCH 1/6] powernv:idle: Correctly initialize core_idle_state_ptr

2017-05-16 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The lower 8 bits of core_idle_state_ptr tracks the number of non-idle
threads in the core. This is supposed to be initialized to bit-map
corresponding to the threads_per_core. However, currently it is
initialized to PNV_CORE_IDLE_THREAD_BITS (0xFF). This is correct for
POWER8 which has 8 threads per core, but not for POWER9 which has 4
threads per core.

As a result, on POWER9, core_idle_state_ptr gets initialized to
0xFF. In case when all the threads of the core are idle, the bits
corresponding tracking the idle-threads are non-zero. As a result, the
idle entry/exit code fails to save/restore per-core hypervisor state
since it assumes that there are threads in the cores which are still
active.

Fix this by correctly initializing the lower bits of the
core_idle_state_ptr on the basis of threads_per_core.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/powernv/idle.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 445f30a..84eb9bc 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -96,15 +96,24 @@ static void pnv_alloc_idle_core_states(void)
u32 *core_idle_state;
 
/*
-* core_idle_state - First 8 bits track the idle state of each thread
-* of the core. The 8th bit is the lock bit. Initially all thread bits
-* are set. They are cleared when the thread enters deep idle state
-* like sleep and winkle. Initially the lock bit is cleared.
-* The lock bit has 2 purposes
-* a. While the first thread is restoring core state, it prevents
-* other threads in the core from switching to process context.
-* b. While the last thread in the core is saving the core state, it
-* prevents a different thread from waking up.
+* core_idle_state - The lower 8 bits track the idle state of
+* each thread of the core.
+*
+* The most significant bit is the lock bit.
+*
+* Initially all the bits corresponding to threads_per_core
+* are set. They are cleared when the thread enters deep idle
+* state like sleep and winkle/stop.
+*
+* Initially the lock bit is cleared.  The lock bit has 2
+* purposes:
+*  a. While the first thread in the core waking up from
+* idle is restoring core state, it prevents other
+* threads in the core from switching to process
+* context.
+*  b. While the last thread in the core is saving the
+* core state, it prevents a different thread from
+* waking up.
 */
for (i = 0; i < nr_cores; i++) {
int first_cpu = i * threads_per_core;
@@ -112,7 +121,7 @@ static void pnv_alloc_idle_core_states(void)
size_t paca_ptr_array_size;
 
core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
-   *core_idle_state = PNV_CORE_IDLE_THREAD_BITS;
+   *core_idle_state = (1 << threads_per_core) - 1;
paca_ptr_array_size = (threads_per_core *
   sizeof(struct paca_struct *));
 
-- 
1.8.3.1



[PATCH 5/6] powernv:idle: Use Requested Level for restoring state on P9 DD1

2017-05-16 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On Power9 DD1 due to a hardware bug the Power-Saving Level Status
field (PLS) of the PSSCR for a thread waking up from a deep state can
under-report if some other thread in the core is in a shallow stop
state. The scenario in which this can manifest is as follows:

   1) All the threads of the core are in deep stop.
   2) One of the threads is woken up. The PLS for this thread will
  correctly reflect that it is waking up from deep stop.
   3) The thread that has woken up now executes a shallow stop.
   4) When some other thread in the core is woken, its PLS will reflect
  the shallow stop state.

Thus, the subsequent thread for which the PLS is under-reporting the
wakeup state will not restore the hypervisor resources.

Hence, on DD1 systems, use the Requested Level (RL) field as a
workaround to restore the contents of the hypervisor resources on the
wakeup from the stop state.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/paca.h   |  2 ++
 arch/powerpc/kernel/asm-offsets.c |  1 +
 arch/powerpc/kernel/idle_book3s.S | 13 -
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1c09f8f..77f60a0 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -177,6 +177,8 @@ struct paca_struct {
 * to the sibling threads' paca.
 */
struct paca_struct **thread_sibling_pacas;
+   /* The PSSCR value that the kernel requested before going to stop */
+   u64 requested_psscr;
 #endif
 
 #ifdef CONFIG_PPC_STD_MMU_64
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 709e234..e15c178 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -742,6 +742,7 @@ int main(void)
OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas);
+   OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
 #endif
 
DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 6c9920d..98a6d07 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -379,6 +379,7 @@ _GLOBAL(power9_idle_stop)
mfspr   r5,SPRN_PSSCR
andcr5,r5,r4
or  r3,r3,r5
+   std r3, PACA_REQ_PSSCR(r13)
mtspr   SPRN_PSSCR,r3
LOAD_REG_ADDR(r5,power_enter_stop)
li  r4,1
@@ -498,12 +499,22 @@ pnv_restore_hyp_resource_arch300:
LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
ld  r4,ADDROFF(pnv_first_deep_stop_state)(r5)
 
-   mfspr   r5,SPRN_PSSCR
+BEGIN_FTR_SECTION_NESTED(71)
+   /*
+* Assume that we are waking up from the state
+* same as the Requested Level (RL) in the PSSCR
+* which are Bits 60-63
+*/
+   ld  r5,PACA_REQ_PSSCR(r13)
+   rldicl  r5,r5,0,60
+FTR_SECTION_ELSE_NESTED(71)
/*
 * 0-3 bits correspond to Power-Saving Level Status
 * which indicates the idle state we are waking up from
 */
+   mfspr   r5, SPRN_PSSCR
rldicl  r5,r5,4,60
+ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_POWER9_DD1, 71)
cmpdcr4,r5,r4
bge cr4,pnv_wakeup_tb_loss /* returns to caller */
 
-- 
1.8.3.1



[PATCH 6/6] cpuidle-powernv: Allow Deep stop states that don't stop time

2017-05-16 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The current code in the cpuidle-powernv intialization only allows deep
stop states (indicated by OPAL_PM_STOP_INST_DEEP) which lose timebase
(indicated by OPAL_PM_TIMEBASE_STOP). This assumption goes back to
POWER8 time where deep states used to lose the timebase. However, on
POWER9, we do have stop states that are deep (they lose hypervisor
state) but retain the timebase.

Fix the initialization code in the cpuidle-powernv driver to allow
such deep states.

Further, there is a bug in cpuidle-powernv driver with
CONFIG_TICK_ONESHOT=n where we end up incrementing the nr_idle_states
even if a platform idle state which loses time base was not added to
the cpuidle table.

Fix this by ensuring that the nr_idle_states variable gets incremented
only when the platform idle state was added to the cpuidle table.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-powernv.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 12409a5..45eaf06 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -354,6 +354,7 @@ static int powernv_add_idle_states(void)
 
for (i = 0; i < dt_idle_states; i++) {
unsigned int exit_latency, target_residency;
+   bool stops_timebase = false;
/*
 * If an idle state has exit latency beyond
 * POWERNV_THRESHOLD_LATENCY_NS then don't use it
@@ -381,6 +382,9 @@ static int powernv_add_idle_states(void)
}
}
 
+   if (flags[i] & OPAL_PM_TIMEBASE_STOP)
+   stops_timebase = true;
+
/*
 * For nap and fastsleep, use default target_residency
 * values if f/w does not expose it.
@@ -392,8 +396,7 @@ static int powernv_add_idle_states(void)
add_powernv_state(nr_idle_states, "Nap",
  CPUIDLE_FLAG_NONE, nap_loop,
  target_residency, exit_latency, 0, 0);
-   } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) &&
-   !(flags[i] & OPAL_PM_TIMEBASE_STOP)) {
+   } else if (has_stop_states && !stops_timebase) {
add_powernv_state(nr_idle_states, names[i],
  CPUIDLE_FLAG_NONE, stop_loop,
  target_residency, exit_latency,
@@ -405,8 +408,8 @@ static int powernv_add_idle_states(void)
 * within this config dependency check.
 */
 #ifdef CONFIG_TICK_ONESHOT
-   if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
-   flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
+   else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
+flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
if (!rc)
target_residency = 30;
/* Add FASTSLEEP state */
@@ -414,14 +417,15 @@ static int powernv_add_idle_states(void)
  CPUIDLE_FLAG_TIMER_STOP,
  fastsleep_loop,
  target_residency, exit_latency, 0, 0);
-   } else if ((flags[i] & OPAL_PM_STOP_INST_DEEP) &&
-   (flags[i] & OPAL_PM_TIMEBASE_STOP)) {
+   } else if (has_stop_states && stops_timebase) {
add_powernv_state(nr_idle_states, names[i],
  CPUIDLE_FLAG_TIMER_STOP, stop_loop,
  target_residency, exit_latency,
  psscr_val[i], psscr_mask[i]);
}
 #endif
+   else
+   continue;
nr_idle_states++;
}
 out:
-- 
1.8.3.1



Re: [v3 0/9] parallelized "struct page" zeroing

2017-05-16 Thread Michal Hocko
On Mon 15-05-17 16:44:26, Pasha Tatashin wrote:
> On 05/15/2017 03:38 PM, Michal Hocko wrote:
> >I do not think this is the right approach. Your measurements just show
> >that sparc could have a more optimized memset for small sizes. If you
> >keep the same memset only for the parallel initialization then you
> >just hide this fact. I wouldn't worry about other architectures. All
> >sane architectures should simply work reasonably well when touching a
> >single or only few cache lines at the same time. If some arches really
> >suffer from small memsets then the initialization should be driven by a
> >specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend
> >on DEFERRED_INIT. Or if you are too worried then make it opt-in and make
> >it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and
> >sparc after memset optimization.
> 
> OK, I will think about this.
> 
> I do not really like adding new configs because they tend to clutter the
> code. This is why,

Yes I hate adding new (arch) config options as well. And I still believe
we do not need any here either...

> I wanted to rely on already existing config that I know benefits all
> platforms that use it.

I wouldn't be so sure about this. If any other platform has a similar
issues with small memset as sparc then the overhead is just papered over
by parallel initialization.

> Eventually,
> "CONFIG_DEFERRED_STRUCT_PAGE_INIT" is going to become the default
> everywhere, as there should not be a drawback of using it even on small
> machines.

Maybe and I would highly appreciate that.
-- 
Michal Hocko
SUSE Labs