Re: [PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte
On Wednesday 17 May 2017 10:27 AM, Benjamin Herrenschmidt wrote: On Wed, 2017-05-17 at 08:57 +0530, Aneesh Kumar K.V wrote: Benjamin Herrenschmidtwrites: On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote: +static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea, + bool *is_thp, unsigned *hshift) +{ + VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) , + "%s called with irq enabled\n", __func__); + return __find_linux_pte(pgdir, ea, is_thp, hshift); +} + When is arch_irqs_disabled() not sufficient ? We can do lockless page table walk in interrupt handlers where we find MSR_EE = 0. Such as ? I was not sure we mark softenabled 0 there. What I wanted to indicate in the patch is that we are safe with either softenable = 0 or MSR_EE = 0 Reading the MSR is expensive... Can you find a case where we are hard disabled and not soft disable in C code ? I can't think of one off-hand ... I know we have some asm that can do that very temporarily but I wouldn't think we have anything at runtime. Talking of which, we have this in irq.c: #ifdef CONFIG_TRACE_IRQFLAGS else { /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON(mfmsr() & MSR_EE)) __hard_irq_disable(); } #endif I think we should move that to a new CONFIG_PPC_DEBUG_LAZY_IRQ because distros are likely to have CONFIG_TRACE_IRQFLAGS these days no ? Yes, CONFIG_TRACE_IRQFLAGS are enabled. So in my local_t patchset, I have added a patch to do the same with a flag "CONFIG_IRQ_DEBUG_SUPPORT" mpe reported boot hang with the current version of the local_t patchset in Booke system, and have a fix for the same and it is being tested. Will post a newer version once the patch verified. Maddy Also we could add additional checks, such as MSR_EE matching paca- irq_happened or the above you mentioned, ie, WARN if we find case where IRQs are hard disabled but soft enabled. If we find these, I think we should fix them. Cheers, Ben.
Re: [PATCH] powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line
On 05/16/2017 02:54 PM, Aneesh Kumar K.V wrote: > +void __init reserve_hugetlb_gpages(void) > +{ > + char buf[10]; > + phys_addr_t base; > + unsigned long gpage_size = 1UL << 34; > + static __initdata char cmdline[COMMAND_LINE_SIZE]; > + > + if (radix_enabled()) > + gpage_size = 1UL << 30; > + > + strlcpy(cmdline, boot_command_line, COMMAND_LINE_SIZE); > + parse_args("hugetlb gpages", cmdline, NULL, 0, 0, 0, > +NULL, _gpage_early_setup); > + > + if (!gpage_npages) > + return; > + > + string_get_size(gpage_size, 1, STRING_UNITS_2, buf, sizeof(buf)); > + pr_info("Trying to reserve %ld %s pages\n", gpage_npages, buf); > + > + /* Allocate one page at a time */ > + while(gpage_npages) { > + base = memblock_alloc_base(gpage_size, gpage_size, > +MEMBLOCK_ALLOC_ANYWHERE); > + add_gpage(base, gpage_size, 1); For 16GB pages (1UL << 34) on POWER8, we already do these functions inside htab_dt_scan_hugepage_blocks(). IIUC this happens just by scanning DT without even specifying any gpages in kernel command line. memblock_reserve() add_gpage() Then attempting to allocate from memblock and adding it again into gigantic pages list wont collide ? More over its trying to allocate across the RAM not specifically on the gpages mentioned in device tree by the platform. Are we trying to support 16GB pages just from any memory without platform notification through DT ?
Re: [PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte
On Wed, 2017-05-17 at 08:57 +0530, Aneesh Kumar K.V wrote: > Benjamin Herrenschmidtwrites: > > > On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote: > > > +static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea, > > > + bool *is_thp, unsigned *hshift) > > > +{ > > > + VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) , > > > + "%s called with irq enabled\n", __func__); > > > + return __find_linux_pte(pgdir, ea, is_thp, hshift); > > > +} > > > + > > > > When is arch_irqs_disabled() not sufficient ? > > We can do lockless page table walk in interrupt handlers where we find > MSR_EE = 0. Such as ? > I was not sure we mark softenabled 0 there. What I wanted to > indicate in the patch is that we are safe with either softenable = 0 or > MSR_EE = 0 Reading the MSR is expensive... Can you find a case where we are hard disabled and not soft disable in C code ? I can't think of one off-hand ... I know we have some asm that can do that very temporarily but I wouldn't think we have anything at runtime. Talking of which, we have this in irq.c: #ifdef CONFIG_TRACE_IRQFLAGS else { /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON(mfmsr() & MSR_EE)) __hard_irq_disable(); } #endif I think we should move that to a new CONFIG_PPC_DEBUG_LAZY_IRQ because distros are likely to have CONFIG_TRACE_IRQFLAGS these days no ? Also we could add additional checks, such as MSR_EE matching paca- >irq_happened or the above you mentioned, ie, WARN if we find case where IRQs are hard disabled but soft enabled. If we find these, I think we should fix them. Cheers, Ben.
[PATCH v3 2/2] powerpc/mm/hugetlb: Add support for 1G huge pages
POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch enables the usage of 1G page size for hugetlbfs. This also update the helper such we can do 1G page allocation at runtime. We still don't enable 1G page size on DD1 version. This is to avoid doing workaround mentioned in commit: 6d3a0379ebdc8 (powerpc/mm: Add radix__tlb_flush_pte_p9_dd1() Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/include/asm/book3s/64/hugetlb.h | 10 ++ arch/powerpc/mm/hugetlbpage.c| 7 +-- arch/powerpc/platforms/Kconfig.cputype | 1 + 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h index cd366596..5c28bd6f2ae1 100644 --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h @@ -50,4 +50,14 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, else return entry; } + +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) +{ + if (radix_enabled()) + return true; + return false; +} +#endif + #endif diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index a4f33de4008e..80f6d2ed551a 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -763,8 +763,11 @@ static int __init add_huge_page_size(unsigned long long size) * Hash: 16M and 16G */ if (radix_enabled()) { - if (mmu_psize != MMU_PAGE_2M) - return -EINVAL; + if (mmu_psize != MMU_PAGE_2M) { + if (cpu_has_feature(CPU_FTR_POWER9_DD1) || + (mmu_psize != MMU_PAGE_1G)) + return -EINVAL; + } } else { if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G) return -EINVAL; diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 684e886eaae4..b76ef6637016 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -344,6 +344,7 @@ config PPC_STD_MMU_64 config PPC_RADIX_MMU bool "Radix MMU Support" depends on PPC_BOOK3S_64 + select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA default y help Enable support for the Power ISA 3.0 Radix style MMU. Currently this -- 2.7.4
[PATCH v3 1/2] mm/hugetlb: Cleanup ARCH_HAS_GIGANTIC_PAGE
This moves the #ifdef in C code to a Kconfig dependency. Also we move the gigantic_page_supported() function to be arch specific. This gives arch to conditionally enable runtime allocation of gigantic huge page. Architectures like ppc64 supports different gigantic huge page size (16G and 1G) based on the translation mode selected. This provides an opportunity for ppc64 to enable runtime allocation only w.r.t 1G hugepage. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V--- Changes from V2: * Fix build error with x86 * Update the Kconfig change to match the C #ifdef arch/arm64/Kconfig | 2 +- arch/arm64/include/asm/hugetlb.h | 4 arch/s390/Kconfig| 2 +- arch/s390/include/asm/hugetlb.h | 3 +++ arch/x86/Kconfig | 2 +- arch/x86/include/asm/hugetlb.h | 4 mm/hugetlb.c | 7 ++- 7 files changed, 16 insertions(+), 8 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 3741859765cf..87240dcb6a07 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -11,7 +11,7 @@ config ARM64 select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA select ARCH_HAS_KCOV select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h index bbc1e35aa601..793bd73b0d07 100644 --- a/arch/arm64/include/asm/hugetlb.h +++ b/arch/arm64/include/asm/hugetlb.h @@ -83,4 +83,8 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm, extern void huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep); +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) { return true; } +#endif + #endif /* __ASM_HUGETLB_H */ diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index a2dcef0aacc7..f3637b641d7e 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -67,7 +67,7 @@ config S390 select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA select ARCH_HAS_KCOV select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h index cd546a245c68..89057b2cc8fe 100644 --- a/arch/s390/include/asm/hugetlb.h +++ b/arch/s390/include/asm/hugetlb.h @@ -112,4 +112,7 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot) return pte_modify(pte, newprot); } +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) { return true; } +#endif #endif /* _ASM_S390_HUGETLB_H */ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc98d5a294ee..e39b3b6b7d16 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -22,7 +22,7 @@ config X86_64 def_bool y depends on 64BIT # Options that are inherently 64-bit kernel only: - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA select ARCH_SUPPORTS_INT128 select ARCH_USE_CMPXCHG_LOCKREF select HAVE_ARCH_SOFT_DIRTY diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h index 3a106165e03a..535af0f2d8ac 100644 --- a/arch/x86/include/asm/hugetlb.h +++ b/arch/x86/include/asm/hugetlb.h @@ -85,4 +85,8 @@ static inline void arch_clear_hugepage_flags(struct page *page) { } +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) { return true; } +#endif + #endif /* _ASM_X86_HUGETLB_H */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d0aab9ee80d..ce090186b992 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1024,9 +1024,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) ((node = hstate_next_node_to_free(hs, mask)) || 1); \ nr_nodes--) -#if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) && \ - ((defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || \ - defined(CONFIG_CMA)) +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE static void destroy_compound_gigantic_page(struct page *page, unsigned int order) { @@ -1158,8 +1156,7 @@ static int alloc_fresh_gigantic_page(struct hstate *h, return 0; } -static inline bool gigantic_page_supported(void) { return true; } -#else +#else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */ static inline bool gigantic_page_supported(void) { return false; } static inline void free_gigantic_page(struct page *page, unsigned int order) { } static inline void
Re: [PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte
Benjamin Herrenschmidtwrites: > On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote: >> +static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea, >> + bool *is_thp, unsigned *hshift) >> +{ >> + VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) , >> + "%s called with irq enabled\n", __func__); >> + return __find_linux_pte(pgdir, ea, is_thp, hshift); >> +} >> + > > When is arch_irqs_disabled() not sufficient ? We can do lockless page table walk in interrupt handlers where we find MSR_EE = 0. I was not sure we mark softenabled 0 there. What I wanted to indicate in the patch is that we are safe with either softenable = 0 or MSR_EE = 0 -aneesh
Re: [RFC 0/2] Consolidate patch_instruction
On Tue, 2017-05-16 at 22:20 +0200, LEROY Christophe wrote: > Balbir Singha écrit : > > > patch_instruction is enhanced in this RFC to support > > patching via a different virtual address (text_poke_area). > > The mapping of text_poke_area->addr is RW and not RWX. > > This way the mapping allows write for patching and then we tear > > down the mapping. The downside is that we introduce a spinlock > > which serializes our patching to one patch at a time. > > Very nice patch, would fit great with my patch for impmementing > CONFIG_DEBUG_RODATA (https://patchwork.ozlabs.org/patch/754289 ). > Would avoid having to set the text area back to RW for patching > Awesome! It seems like you have some of the work for CONFIG_STRICT_KERNEL_RWX any reason why this is under CONFIG_DEBUG_RODATA? But I think there is reuse capability across the future patches and the current set. Cheers, Balbir Singh.
Re: [RFC 2/2] powerpc/kprobes: Move kprobes over to patch_instruction
On Tue, 2017-05-16 at 19:05 +0530, Naveen N. Rao wrote: > On 2017/05/16 01:49PM, Balbir Singh wrote: > > arch_arm/disarm_probe use direct assignment for copying > > instructions, replace them with patch_instruction > > Thanks for doing this! > > We will also have to convert optprobes and ftrace to use > patch_instruction, but that can be done once the basic infrastructure is > in. > I think these patches can go in without even patch 1. I looked quickly at optprobes and ftrace and thought they were already using patch_instruction (ftrace_modify_code() and arch_optimize_kprobes()), are there other paths I missed? Balbir Singh
Re: [RFC 0/2] Consolidate patch_instruction
On Tue, 2017-05-16 at 19:11 +0530, Naveen N. Rao wrote: > On 2017/05/16 10:56AM, Anshuman Khandual wrote: > > On 05/16/2017 09:19 AM, Balbir Singh wrote: > > > patch_instruction is enhanced in this RFC to support > > > patching via a different virtual address (text_poke_area). > > > > Why writing instruction directly into the address is not > > sufficient and need to go through this virtual address ? > > To enable KERNEL_STRICT_RWX and map all of kernel text to be read-only? > Precisely, the rest of the bits are still being developed. > > > > > The mapping of text_poke_area->addr is RW and not RWX. > > > This way the mapping allows write for patching and then we tear > > > down the mapping. The downside is that we introduce a spinlock > > > which serializes our patching to one patch at a time. > > > > So whats the benifits we get otherwise in this approach when > > we are adding a new lock into the equation. > > Instruction patching isn't performance critical, so the slow down is > likely not noticeable. Marking kernel text read-only helps harden the > kernel by catching unintended code modifications whether through > exploits or through bugs. > Precisely! Balbir Singh.
Re: [PATCH 2/2] powerpc/jprobes: Validate break handler invocation as being due to a jprobe_return()
On Mon, 15 May 2017 23:35:04 +0530 "Naveen N. Rao"wrote: > Fix a circa 2005 FIXME by implementing a check to ensure that we > actually got into the jprobe break handler() due to the trap in > jprobe_return(). > > Signed-off-by: Naveen N. Rao > --- > arch/powerpc/kernel/kprobes.c | 20 +--- > 1 file changed, 9 insertions(+), 11 deletions(-) > > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c > index 19b053475758..1ebeb8c482db 100644 > --- a/arch/powerpc/kernel/kprobes.c > +++ b/arch/powerpc/kernel/kprobes.c > @@ -627,25 +627,23 @@ NOKPROBE_SYMBOL(setjmp_pre_handler); > > void __used jprobe_return(void) > { > - asm volatile("trap" ::: "memory"); > + asm volatile("jprobe_return_trap:\n" > + "trap\n" > + ::: "memory"); > } > NOKPROBE_SYMBOL(jprobe_return); > > -static void __used jprobe_return_end(void) > -{ > -} > -NOKPROBE_SYMBOL(jprobe_return_end); > - > int longjmp_break_handler(struct kprobe *p, struct pt_regs *regs) > { > struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); > unsigned long sp; > > - /* > - * FIXME - we should ideally be validating that we got here 'cos > - * of the "trap" in jprobe_return() above, before restoring the > - * saved regs... > - */ > + if (regs->nip != ppc_kallsyms_lookup_name("jprobe_return_trap")) { > + WARN(1, "longjmp_break_handler NIP (0x%lx) does not match > jprobe_return_trap (0x%lx)\n", > + regs->nip, > ppc_kallsyms_lookup_name("jprobe_return_trap")); > + return 0; If you don't handle this break, you shouldn't warn it, because program_check_exception() will continue to find how to handle it by notify_die(). IOW, please return silently, or just add a debug message. Thank you, > + } > + > memcpy(regs, >jprobe_saved_regs, sizeof(struct pt_regs)); > sp = kernel_stack_pointer(regs); > memcpy((void *)sp, >jprobes_stack, MIN_STACK_SIZE(sp)); > -- > 2.12.2 > -- Masami Hiramatsu
Re: [PATCH 1/2] powerpc/jprobes: Save and restore the parameter save area
On Mon, 15 May 2017 23:35:03 +0530 "Naveen N. Rao"wrote: > diff --git a/arch/powerpc/include/asm/kprobes.h > b/arch/powerpc/include/asm/kprobes.h > index a83821f33ea3..b6960ef213ac 100644 > --- a/arch/powerpc/include/asm/kprobes.h > +++ b/arch/powerpc/include/asm/kprobes.h > @@ -61,6 +61,15 @@ extern kprobe_opcode_t optprobe_template_end[]; > #define MAX_OPTINSN_SIZE (optprobe_template_end - > optprobe_template_entry) > #define RELATIVEJUMP_SIZEsizeof(kprobe_opcode_t) /* 4 bytes */ > > +/* Save upto 16 parameters along with the stack frame header */ > +#define MAX_STACK_SIZE (STACK_FRAME_PARM_SAVE + (16 * > sizeof(unsigned long))) > +#define MIN_STACK_SIZE(ADDR)\ > + (((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \ > + THREAD_SIZE - (unsigned long)(ADDR)))\ > + ? (MAX_STACK_SIZE)\ > + : (((unsigned long)current_thread_info()) + \ > + THREAD_SIZE - (unsigned long)(ADDR))) Could you add CUR_STACK_SIZE(addr) as x86 does instead of repeating similar code? Thank you, -- Masami Hiramatsu
Re: [v3 0/9] parallelized "struct page" zeroing
On Fri, 2017-05-12 at 13:37 -0400, David Miller wrote: > > Right now it is larger, but what I suggested is to add a new optimized > > routine just for this case, which would do STBI for 64-bytes but > > without membar (do membar at the end of memmap_init_zone() and > > deferred_init_memmap() > > > > #define struct_page_clear(page) \ > > __asm__ __volatile__( \ > > "stxa %%g0, [%0]%2\n" \ > > "stxa %%xg0, [%0 + %1]%2\n" \ > > : /* No output */ \ > > : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P)) > > > > And insert it into __init_single_page() instead of memset() > > > > The final result is 4.01s/T which is even faster compared to current > > 4.97s/T > > Ok, indeed, that would work. On ppc64, that might not. We have a dcbz instruction that clears an entire cache line at once. That's what we use for memset's and page clearing. However, 64 bytes is half a cache line on modern processors so we can't use it with that semantic and would have to fallback to the slower stores. Cheers, Ben.
Re: RFC: better timer interface
On Tue, May 16, 2017 at 5:51 PM, Christoph Hellwigwrote: > On Tue, May 16, 2017 at 05:45:07PM +0200, Arnd Bergmann wrote: >> This looks really nice, but what is the long-term plan for the interface? >> Do you expect that we will eventually change all 700+ users of timer_list >> to the new type, or do we keep both variants around indefinitely to avoid >> having to do mass-conversions? > > I think we should eventually move everyone over, but it might take > some time. Ok. >> If we are going to touch them all in the end, we might want to think >> about other changes that could be useful here. The main one I have >> in mind would be moving away from 'jiffies + timeout' as the interface, >> and instead passing a relative number of milliseconds (or seconds) >> into a mod_timer() variant. This is what most drivers want anyway, >> and if we have both changes (callback argument and expiration >> time) in place, we modernize the API one driver at a time with both >> changes at once. > > Yes, that sounds useful to me as well. As you said it's an independent > but somewhat related change. I can add it to my series, but I'll > need a suggestions for a good and short name. That already was the > hardest part for the setup side :) If we keep the unusual *_timer() naming (rather than timer_*() as hrtimer has), we could use one of a) start_timer(struct timer_list *timer, unsigned long ms); b) restart_timer(struct timer_list *timer, unsigned long ms); c) mod_timer_ms(struct timer_list *timer, unsigned long ms); mod_timer_sec(struct timer_list *timer, unsigned long sec); The first is slightly shorter but conflicts with three files that use the same name for a local function name. The third one fits well with the existing interfaces and provides both millisecond and second versions, I'd probably go with that. We could consider even passing a default interval as another argument to prepare_timer(), and using that in add_timer(), but that would in those cases that have a constant interval (maybe about half of the users from) and would be a bit surprising to readers that are only familiar with the existing interfaces. One final option would be a larger-scale replacement of the API by mirroring the hrtimer style where possible while staying compatible with the existing calls, e.g. timer_prepare(), timer_add_expires(), timer_start(), ... Arnd
Re: [RFC 0/2] Consolidate patch_instruction
Balbir Singha écrit : patch_instruction is enhanced in this RFC to support patching via a different virtual address (text_poke_area). The mapping of text_poke_area->addr is RW and not RWX. This way the mapping allows write for patching and then we tear down the mapping. The downside is that we introduce a spinlock which serializes our patching to one patch at a time. Very nice patch, would fit great with my patch for impmementing CONFIG_DEBUG_RODATA (https://patchwork.ozlabs.org/patch/754289 ). Would avoid having to set the text area back to RW for patching Christophe In this patchset we also consolidate instruction changes in kprobes to use patch_instruction(). Balbir Singh (2): powerpc/lib/code-patching: Enhance code patching powerpc/kprobes: Move kprobes over to patch_instruction arch/powerpc/kernel/kprobes.c| 4 +- arch/powerpc/lib/code-patching.c | 88 ++-- 2 files changed, 86 insertions(+), 6 deletions(-) -- 2.9.3
Re: [PATCH 2/9] timers: provide a "modern" variant of timers
On Tue, May 16, 2017 at 1:48 PM, Christoph Hellwigwrote: > unsigned long expires; > - void(*function)(unsigned long); > + union { > + void(*func)(struct timer_list *timer); > + void(*function)(unsigned long); > + }; ... > +#define INIT_TIMER(_func, _expires, _flags)\ > +{ \ > + .entry = { .next = TIMER_ENTRY_STATIC },\ > + .func = (_func),\ > + .expires = (_expires), \ > + .flags = TIMER_MODERN | (_flags), \ > + __TIMER_LOCKDEP_MAP_INITIALIZER(__FILE__ ":" __stringify(__LINE__)) \ > +} If I remember correctly, this will fail with gcc-4.5 and earlier, which can't use named initializers for anonymous unions. One of these two should work, but they are both ugly: a) don't use a named initializer for the union (a bit fragile) +#define INIT_TIMER(_func, _expires, _flags)\ +{ \ + .entry = { .next = TIMER_ENTRY_STATIC },\ + .expires = (_expires), \ + { .func = (_func) },\ + .flags = TIMER_MODERN | (_flags), \ + __TIMER_LOCKDEP_MAP_INITIALIZER(__FILE__ ":" __stringify(__LINE__)) \ +} b) give the union a name (breaks any reference to timer_list->func in C code): + union { + void(*func)(struct timer_list *timer); + void(*function)(unsigned long); + } u; ... +#define INIT_TIMER(_func, _expires, _flags)\ +{ \ + .entry = { .next = TIMER_ENTRY_STATIC },\ + .u.func = (_func),\ + .expires = (_expires), \ + .flags = TIMER_MODERN | (_flags), \ + __TIMER_LOCKDEP_MAP_INITIALIZER(__FILE__ ":" __stringify(__LINE__)) \ +} > +/** > + * prepare_timer - initialize a timer before first use > + * @timer: timer structure to prepare > + * @func: callback to be called when the timer expires > + * @flags %TIMER_* flags that control timer behavior > + * > + * This function initializes a timer_list structure so that it can > + * be used (by calling add_timer() or mod_timer()). > + */ > +static inline void prepare_timer(struct timer_list *timer, > + void (*func)(struct timer_list *timer), u32 flags) > +{ > + __init_timer(timer, TIMER_MODERN | flags); > + timer->func = func; > +} > + > +static inline void prepare_timer_on_stack(struct timer_list *timer, > + void (*func)(struct timer_list *timer), u32 flags) > +{ > + __init_timer_on_stack(timer, TIMER_MODERN | flags); > + timer->func = func; > +} I fear this breaks lockdep output, which turns the name of the timer into a string that gets printed later. It should work when these are macros, or a macro wrapping an inline function like __init_timer is. Arnd
Re: [PATCH 9/9] timers: remove old timer initialization macros
On Tue, May 16, 2017 at 1:48 PM, Christoph Hellwigwrote: > Signed-off-by: Christoph Hellwig > --- > include/linux/timer.h | 22 +++--- > 1 file changed, 3 insertions(+), 19 deletions(-) > > diff --git a/include/linux/timer.h b/include/linux/timer.h > index 87afe52c8349..9c6694d3f66a 100644 > --- a/include/linux/timer.h > +++ b/include/linux/timer.h > @@ -80,35 +80,19 @@ struct timer_list { > struct timer_list _name = INIT_TIMER(_func, _expires, _flags) > > /* > - * Don't use the macros below, use DECLARE_TIMER and INIT_TIMER with their > + * Don't use the macro below, use DECLARE_TIMER and INIT_TIMER with their > * improved callback signature above. > */ > -#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \ > +#define DEFINE_TIMER(_name, _function, _expires, _data)\ > + struct timer_list _name = { \ > .entry = { .next = TIMER_ENTRY_STATIC },\ > .function = (_function),\ > .expires = (_expires), \ > .data = (_data),\ > - .flags = (_flags), \ > __TIMER_LOCKDEP_MAP_INITIALIZER(\ > __FILE__ ":" __stringify(__LINE__)) \ > } Not sure what to do about it, but I notice that the '_expires' argument is completely bogus, I don't see any way it could be used in a meaningful way, and the only user that passes anything other than zero is arch/mips/mti-malta/malta-display.c and that seems to be unintentional. Arnd
Re: [PATCH 2/9] timers: provide a "modern" variant of timers
On 05/16/17 04:48, Christoph Hellwig wrote: > diff --git a/include/linux/timer.h b/include/linux/timer.h > index e6789b8757d5..87afe52c8349 100644 > --- a/include/linux/timer.h > +++ b/include/linux/timer.h \ > @@ -126,6 +146,32 @@ static inline void init_timer_on_stack_key(struct > timer_list *timer, > init_timer_on_stack_key((_timer), (_flags), NULL, NULL) > #endif > > +/** > + * prepare_timer - initialize a timer before first use > + * @timer: timer structure to prepare > + * @func:callback to be called when the timer expires > + * @flags%TIMER_* flags that control timer behavior missing ':' on @flags: > + * > + * This function initializes a timer_list structure so that it can > + * be used (by calling add_timer() or mod_timer()). > + */ > +static inline void prepare_timer(struct timer_list *timer, > + void (*func)(struct timer_list *timer), u32 flags) > +{ -- ~Randy
[patch V2 06/17] powerpc: Adjust system_state check
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in smp_generic_cpu_bootable() to handle the extra states. Signed-off-by: Thomas GleixnerAcked-by: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/smp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -97,7 +97,7 @@ int smp_generic_cpu_bootable(unsigned in /* Special case - we inhibit secondary thread startup * during boot if the user requests it. */ - if (system_state == SYSTEM_BOOTING && cpu_has_feature(CPU_FTR_SMT)) { + if (system_state < SYSTEM_RUNNING && cpu_has_feature(CPU_FTR_SMT)) { if (!smt_enabled_at_boot && cpu_thread_in_core(nr) != 0) return 0; if (smt_enabled_at_boot
[patch V2 09/17] cpufreq/pasemi: Adjust system_state check
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in pas_cpufreq_cpu_exit() to handle the extra states. Signed-off-by: Thomas GleixnerAcked-by: Viresh Kumar Cc: "Rafael J. Wysocki" Cc: linuxppc-dev@lists.ozlabs.org --- drivers/cpufreq/pasemi-cpufreq.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/cpufreq/pasemi-cpufreq.c +++ b/drivers/cpufreq/pasemi-cpufreq.c @@ -226,7 +226,7 @@ static int pas_cpufreq_cpu_exit(struct c * We don't support CPU hotplug. Don't unmap after the system * has already made it to a running state. */ - if (system_state != SYSTEM_BOOTING) + if (system_state >= SYSTEM_RUNNING) return 0; if (sdcasr_mapbase)
Re: kernel BUG at mm/usercopy.c:72!
On Tue, May 16, 2017 at 09:02:29PM +1000, Michael Ellerman wrote: > Breno Leitaowrites: > > > Hello, > > > > Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual > > machine. Justing SSHing into the machine causes this issue. > > > > [23.138124] usercopy: kernel memory overwrite attempt detected to > > d3d80030 (mm_struct) (560 bytes) > > [23.138195] [ cut here ] > > [23.138229] kernel BUG at mm/usercopy.c:72! > > [23.138252] Oops: Exception in kernel mode, sig: 5 [#3] > > [23.138280] SMP NR_CPUS=2048 > > [23.138280] NUMA > > [23.138302] pSeries > > [23.138330] Modules linked in: > > [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G D > > 4.12.0-rc1+ #9 > > [23.138395] task: c001e272dc00 task.stack: c001e27b > > [23.138430] NIP: c0342358 LR: c0342354 CTR: > > c06eb060 > > [23.138472] REGS: c001e27b3a00 TRAP: 0700 Tainted: G D > >(4.12.0-rc1+) > > [23.138513] MSR: 80029033 > > [23.138517] CR: 28004222 XER: 2000 > > [23.138565] CFAR: c0b34500 SOFTE: 1 > > [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 > > 005e > > [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 > > 79746573290d0a74 > > [23.138565] GPR08: 0007 c0f61864 0001feeb > > 3064206f74206465 > > [23.138565] GPR12: 4400 cfb42600 0015 > > 545bdc40 > > [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 > > 545cf000 > > [23.138565] GPR20: 546109c8 c7e8 54610010 > > 778c22e8 > > [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 > > 0230 > > [23.138565] GPR28: d3d80260 0230 > > d3d80030 > > [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0 > > [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0 > > [23.138990] Call Trace: > > [23.139006] [c001e27b3c80] [c0342354] > > __check_object_size+0x84/0x2d0 (unreliable) > > [23.139056] [c001e27b3d00] [c09f5ba8] > > bpf_prog_create_from_user+0xa8/0x1a0 > > [23.139099] [c001e27b3d60] [c01e5d30] do_seccomp+0x120/0x720 > > [23.139136] [c001e27b3dd0] [c00fd53c] SyS_prctl+0x2ac/0x6b0 > > [23.139172] [c001e27b3e30] [c000af84] system_call+0x38/0xe0 > > [23.139218] Instruction dump: > > [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 > > 3c62ff95 7fc8f378 > > [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 > > 2ba30010 409d018c > > [23.139328] ---[ end trace 1a1dc952a4b7c4af ]--- > > Do you have any idea what is calling seccomp() and triggering the bug? This bug is hit using several path, not only via seccomp. This is another path, via vfs_read, that triggers the bug: [ 370.154307] usercopy: kernel memory exposure attempt detected from d3d6007c (vm_area_struct) (6 bytes) [ 370.154373] [ cut here ] [ 370.154402] kernel BUG at mm/usercopy.c:72! [ 370.154425] Oops: Exception in kernel mode, sig: 5 [#4] [370.155220] [c001d30efab0] [c0342354] __check_object_size+0x84/0x2b0 (unreliable) [370.155272] [c001d30efb30] [c06c96cc] copy_from_read_buf+0xac/0x1e0 [370.155315] [c001d30efba0] [c06ccbc4] n_tty_read+0x324/0x920 [370.155351] [c001d30efcb0] [c06c4c50] tty_read+0xc0/0x180 [370.155387] [c001d30efd00] [c0347f64] __vfs_read+0x44/0x1a0 [370.155424] [c001d30efd90] [c03499ac] vfs_read+0xbc/0x1b0 [370.155460] [c001d30efde0] [c034b6f8] SyS_read+0x68/0x110 [370.155497] [c001d30efe30] [c000af84] system_call+0x38/0xe0 Anyway, I see the seccomp() path issue when I log into the system using SSH, and the issue with tty_read() just during the system boot. > I run the BPF and seccomp test suites, and I haven't seen this. Do you have the hardening options enabled? For example, I do not reproduce this problem if I do not set CONFIG_HARDENED_USERCOPY=y.
Re: RFC: better timer interface
On Tue, May 16, 2017 at 05:45:07PM +0200, Arnd Bergmann wrote: > This looks really nice, but what is the long-term plan for the interface? > Do you expect that we will eventually change all 700+ users of timer_list > to the new type, or do we keep both variants around indefinitely to avoid > having to do mass-conversions? I think we should eventually move everyone over, but it might take some time. > If we are going to touch them all in the end, we might want to think > about other changes that could be useful here. The main one I have > in mind would be moving away from 'jiffies + timeout' as the interface, > and instead passing a relative number of milliseconds (or seconds) > into a mod_timer() variant. This is what most drivers want anyway, > and if we have both changes (callback argument and expiration > time) in place, we modernize the API one driver at a time with both > changes at once. Yes, that sounds useful to me as well. As you said it's an independent but somewhat related change. I can add it to my series, but I'll need a suggestions for a good and short name. That already was the hardest part for the setup side :)
Re: RFC: better timer interface
On Tue, May 16, 2017 at 1:48 PM, Christoph Hellwigwrote: > Hi all, > > this series attempts to provide a "modern" timer interface where the > callback gets the timer_list structure as an argument so that it > can use container_of instead of having to cast to/from unsigned long > all the time (or even worse use function pointer casts, we have quite > a few of those as well). This looks really nice, but what is the long-term plan for the interface? Do you expect that we will eventually change all 700+ users of timer_list to the new type, or do we keep both variants around indefinitely to avoid having to do mass-conversions? If we are going to touch them all in the end, we might want to think about other changes that could be useful here. The main one I have in mind would be moving away from 'jiffies + timeout' as the interface, and instead passing a relative number of milliseconds (or seconds) into a mod_timer() variant. This is what most drivers want anyway, and if we have both changes (callback argument and expiration time) in place, we modernize the API one driver at a time with both changes at once. Arnd
Re: [PATCH v2 1/2] mm/hugetlb: Cleanup ARCH_HAS_GIGANTIC_PAGE
On Tuesday 16 May 2017 03:52 PM, Anshuman Khandual wrote: On 05/16/2017 02:47 PM, Aneesh Kumar K.V wrote: This moves the #ifdef in C code to a Kconfig dependency. Also we move the gigantic_page_supported() function to be arch specific. This gives arch to conditionally enable runtime allocation of gigantic huge page. Architectures like ppc64 supports different gigantic huge page size (16G and 1G) based on the translation mode selected. This provides an opportunity for ppc64 to enable runtime allocation only w.r.t 1G hugepage. Right. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V--- arch/arm64/Kconfig | 2 +- arch/arm64/include/asm/hugetlb.h | 4 arch/s390/Kconfig| 2 +- arch/s390/include/asm/hugetlb.h | 3 +++ arch/x86/Kconfig | 2 +- mm/hugetlb.c | 7 ++- 6 files changed, 12 insertions(+), 8 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 3741859765cf..1f8c1f73aada 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -11,7 +11,7 @@ config ARM64 select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA select ARCH_HAS_KCOV select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h index bbc1e35aa601..793bd73b0d07 100644 --- a/arch/arm64/include/asm/hugetlb.h +++ b/arch/arm64/include/asm/hugetlb.h @@ -83,4 +83,8 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm, extern void huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep); +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) { return true; } +#endif + #endif /* __ASM_HUGETLB_H */ diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index a2dcef0aacc7..a41bbf420dda 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -67,7 +67,7 @@ config S390 select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA select ARCH_HAS_KCOV select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h index cd546a245c68..89057b2cc8fe 100644 --- a/arch/s390/include/asm/hugetlb.h +++ b/arch/s390/include/asm/hugetlb.h @@ -112,4 +112,7 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot) return pte_modify(pte, newprot); } +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) { return true; } +#endif #endif /* _ASM_S390_HUGETLB_H */ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc98d5a294ee..30a6328136ac 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -22,7 +22,7 @@ config X86_64 def_bool y depends on 64BIT # Options that are inherently 64-bit kernel only: - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA select ARCH_SUPPORTS_INT128 select ARCH_USE_CMPXCHG_LOCKREF select HAVE_ARCH_SOFT_DIRTY Should not we define gigantic_page_supported() function for X86 as well like the other two archs above ? yes. Will update the patch. -aneesh
Re: kernel BUG at mm/usercopy.c:72!
On 05/16/2017 07:32 AM, Kees Cook wrote: > On Tue, May 16, 2017 at 4:09 AM, Michael Ellermanwrote: >> [Cc'ing the relevant folks] >> >> Breno Leitao writes: >>> Hello, >>> >>> Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual >>> machine. Justing SSHing into the machine causes this issue. >>> >>> [23.138124] usercopy: kernel memory overwrite attempt detected to >>> d3d80030 (mm_struct) (560 bytes) >>> [23.138195] [ cut here ] >>> [23.138229] kernel BUG at mm/usercopy.c:72! >>> [23.138252] Oops: Exception in kernel mode, sig: 5 [#3] >>> [23.138280] SMP NR_CPUS=2048 >>> [23.138280] NUMA >>> [23.138302] pSeries >>> [23.138330] Modules linked in: >>> [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G D >>> 4.12.0-rc1+ #9 >>> [23.138395] task: c001e272dc00 task.stack: c001e27b >>> [23.138430] NIP: c0342358 LR: c0342354 CTR: >>> c06eb060 >>> [23.138472] REGS: c001e27b3a00 TRAP: 0700 Tainted: G D >>> (4.12.0-rc1+) >>> [23.138513] MSR: 80029033 >>> [23.138517] CR: 28004222 XER: 2000 >>> [23.138565] CFAR: c0b34500 SOFTE: 1 >>> [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 >>> 005e >>> [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 >>> 79746573290d0a74 >>> [23.138565] GPR08: 0007 c0f61864 0001feeb >>> 3064206f74206465 >>> [23.138565] GPR12: 4400 cfb42600 0015 >>> 545bdc40 >>> [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 >>> 545cf000 >>> [23.138565] GPR20: 546109c8 c7e8 54610010 >>> 778c22e8 >>> [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 >>> 0230 >>> [23.138565] GPR28: d3d80260 0230 >>> d3d80030 >>> [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0 >>> [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0 >>> [23.138990] Call Trace: >>> [23.139006] [c001e27b3c80] [c0342354] >>> __check_object_size+0x84/0x2d0 (unreliable) >>> [23.139056] [c001e27b3d00] [c09f5ba8] >>> bpf_prog_create_from_user+0xa8/0x1a0 >>> [23.139099] [c001e27b3d60] [c01e5d30] >>> do_seccomp+0x120/0x720 >>> [23.139136] [c001e27b3dd0] [c00fd53c] >>> SyS_prctl+0x2ac/0x6b0 >>> [23.139172] [c001e27b3e30] [c000af84] >>> system_call+0x38/0xe0 >>> [23.139218] Instruction dump: >>> [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 >>> 3c62ff95 7fc8f378 >>> [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 >>> 2ba30010 409d018c >>> [23.139328] ---[ end trace 1a1dc952a4b7c4af ]--- >>> >>> I found that kernel 4.11 does not have this issue. I also found that, if >>> I revert 517e1fbeb65f5eade8d14f46ac365db6c75aea9b, I do not see the >>> problem. >>> >>> On the other side, if I cherry-pick commit >>> 517e1fbeb65f5eade8d14f46ac365db6c75aea9b into 4.11, I start seeing the >>> same issue also on 4.11. >> >> Yeah it looks like powerpc also suffers from the same bug that arm64 >> used to, ie. virt_addr_valid() will return true for some vmalloc >> addresses. >> >> virt_addr_valid() is used pretty widely, I'm not sure if we can just fix >> it without other fallout. I'll dig a bit more tomorrow if no one beats >> me to it. >> >> Kees, depending on how that turns out we may ask you to revert >> 517e1fbeb65f ("mm/usercopy: Drop extra is_vmalloc_or_module() check"). > > That's fine by me. Let me know what you think would be best. > > Laura, I don't see much harm in putting this back in place. It seems > like it's just a matter of efficiency to have it removed? > > -Kees > Yes, there shouldn't be any harm if we need to bring it back. Perhaps I should submit a follow on patch to rename virt_addr_valid to virt_addr_valid_except_where_it_isnt. Thanks, Laura
Re: kernel BUG at mm/usercopy.c:72!
On Tue, May 16, 2017 at 4:09 AM, Michael Ellermanwrote: > [Cc'ing the relevant folks] > > Breno Leitao writes: >> Hello, >> >> Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual >> machine. Justing SSHing into the machine causes this issue. >> >> [23.138124] usercopy: kernel memory overwrite attempt detected to >> d3d80030 (mm_struct) (560 bytes) >> [23.138195] [ cut here ] >> [23.138229] kernel BUG at mm/usercopy.c:72! >> [23.138252] Oops: Exception in kernel mode, sig: 5 [#3] >> [23.138280] SMP NR_CPUS=2048 >> [23.138280] NUMA >> [23.138302] pSeries >> [23.138330] Modules linked in: >> [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G D >> 4.12.0-rc1+ #9 >> [23.138395] task: c001e272dc00 task.stack: c001e27b >> [23.138430] NIP: c0342358 LR: c0342354 CTR: >> c06eb060 >> [23.138472] REGS: c001e27b3a00 TRAP: 0700 Tainted: G D >> (4.12.0-rc1+) >> [23.138513] MSR: 80029033 >> [23.138517] CR: 28004222 XER: 2000 >> [23.138565] CFAR: c0b34500 SOFTE: 1 >> [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 >> 005e >> [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 >> 79746573290d0a74 >> [23.138565] GPR08: 0007 c0f61864 0001feeb >> 3064206f74206465 >> [23.138565] GPR12: 4400 cfb42600 0015 >> 545bdc40 >> [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 >> 545cf000 >> [23.138565] GPR20: 546109c8 c7e8 54610010 >> 778c22e8 >> [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 >> 0230 >> [23.138565] GPR28: d3d80260 0230 >> d3d80030 >> [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0 >> [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0 >> [23.138990] Call Trace: >> [23.139006] [c001e27b3c80] [c0342354] >> __check_object_size+0x84/0x2d0 (unreliable) >> [23.139056] [c001e27b3d00] [c09f5ba8] >> bpf_prog_create_from_user+0xa8/0x1a0 >> [23.139099] [c001e27b3d60] [c01e5d30] >> do_seccomp+0x120/0x720 >> [23.139136] [c001e27b3dd0] [c00fd53c] SyS_prctl+0x2ac/0x6b0 >> [23.139172] [c001e27b3e30] [c000af84] system_call+0x38/0xe0 >> [23.139218] Instruction dump: >> [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 >> 3c62ff95 7fc8f378 >> [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 >> 2ba30010 409d018c >> [23.139328] ---[ end trace 1a1dc952a4b7c4af ]--- >> >> I found that kernel 4.11 does not have this issue. I also found that, if >> I revert 517e1fbeb65f5eade8d14f46ac365db6c75aea9b, I do not see the >> problem. >> >> On the other side, if I cherry-pick commit >> 517e1fbeb65f5eade8d14f46ac365db6c75aea9b into 4.11, I start seeing the >> same issue also on 4.11. > > Yeah it looks like powerpc also suffers from the same bug that arm64 > used to, ie. virt_addr_valid() will return true for some vmalloc > addresses. > > virt_addr_valid() is used pretty widely, I'm not sure if we can just fix > it without other fallout. I'll dig a bit more tomorrow if no one beats > me to it. > > Kees, depending on how that turns out we may ask you to revert > 517e1fbeb65f ("mm/usercopy: Drop extra is_vmalloc_or_module() check"). That's fine by me. Let me know what you think would be best. Laura, I don't see much harm in putting this back in place. It seems like it's just a matter of efficiency to have it removed? -Kees -- Kees Cook Pixel Security
Re: [RFC 0/2] Consolidate patch_instruction
On 2017/05/16 10:56AM, Anshuman Khandual wrote: > On 05/16/2017 09:19 AM, Balbir Singh wrote: > > patch_instruction is enhanced in this RFC to support > > patching via a different virtual address (text_poke_area). > > Why writing instruction directly into the address is not > sufficient and need to go through this virtual address ? To enable KERNEL_STRICT_RWX and map all of kernel text to be read-only? > > > The mapping of text_poke_area->addr is RW and not RWX. > > This way the mapping allows write for patching and then we tear > > down the mapping. The downside is that we introduce a spinlock > > which serializes our patching to one patch at a time. > > So whats the benifits we get otherwise in this approach when > we are adding a new lock into the equation. Instruction patching isn't performance critical, so the slow down is likely not noticeable. Marking kernel text read-only helps harden the kernel by catching unintended code modifications whether through exploits or through bugs. - Naveen
Re: [RFC 2/2] powerpc/kprobes: Move kprobes over to patch_instruction
On 2017/05/16 01:49PM, Balbir Singh wrote: > arch_arm/disarm_probe use direct assignment for copying > instructions, replace them with patch_instruction Thanks for doing this! We will also have to convert optprobes and ftrace to use patch_instruction, but that can be done once the basic infrastructure is in. Regards, Naveen
[PATCH 9/9] timers: remove old timer initialization macros
Signed-off-by: Christoph Hellwig--- include/linux/timer.h | 22 +++--- 1 file changed, 3 insertions(+), 19 deletions(-) diff --git a/include/linux/timer.h b/include/linux/timer.h index 87afe52c8349..9c6694d3f66a 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -80,35 +80,19 @@ struct timer_list { struct timer_list _name = INIT_TIMER(_func, _expires, _flags) /* - * Don't use the macros below, use DECLARE_TIMER and INIT_TIMER with their + * Don't use the macro below, use DECLARE_TIMER and INIT_TIMER with their * improved callback signature above. */ -#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \ +#define DEFINE_TIMER(_name, _function, _expires, _data)\ + struct timer_list _name = { \ .entry = { .next = TIMER_ENTRY_STATIC },\ .function = (_function),\ .expires = (_expires), \ .data = (_data),\ - .flags = (_flags), \ __TIMER_LOCKDEP_MAP_INITIALIZER(\ __FILE__ ":" __stringify(__LINE__)) \ } -#define TIMER_INITIALIZER(_function, _expires, _data) \ - __TIMER_INITIALIZER((_function), (_expires), (_data), 0) - -#define TIMER_PINNED_INITIALIZER(_function, _expires, _data) \ - __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_PINNED) - -#define TIMER_DEFERRED_INITIALIZER(_function, _expires, _data) \ - __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_DEFERRABLE) - -#define TIMER_PINNED_DEFERRED_INITIALIZER(_function, _expires, _data) \ - __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_DEFERRABLE | TIMER_PINNED) - -#define DEFINE_TIMER(_name, _function, _expires, _data)\ - struct timer_list _name = \ - TIMER_INITIALIZER(_function, _expires, _data) - void init_timer_key(struct timer_list *timer, unsigned int flags, const char *name, struct lock_class_key *key); -- 2.11.0
[PATCH 8/9] tlclk: switch switchover_timer to a modern timer
And remove a superflous double-initialization. Signed-off-by: Christoph Hellwig--- drivers/char/tlclk.c | 36 +++- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/drivers/char/tlclk.c b/drivers/char/tlclk.c index 572a51704e67..7144016da82c 100644 --- a/drivers/char/tlclk.c +++ b/drivers/char/tlclk.c @@ -184,10 +184,14 @@ static unsigned int telclk_interrupt; static int int_events; /* Event that generate a interrupt */ static int got_event; /* if events processing have been done */ -static void switchover_timeout(unsigned long data); -static struct timer_list switchover_timer = - TIMER_INITIALIZER(switchover_timeout , 0, 0); -static unsigned long tlclk_timer_data; +static void switchover_timeout(struct timer_list *timer); + +static struct switchover_timer { + struct timer_list timer; + unsigned long data; +} switchover_timer = { + .timer = INIT_TIMER(switchover_timeout, 0, TIMER_DEFERRABLE), +}; static struct tlclk_alarms *alarm_events; @@ -805,8 +809,6 @@ static int __init tlclk_init(void) goto out3; } - init_timer(_timer); - ret = misc_register(_miscdev); if (ret < 0) { printk(KERN_ERR "tlclk: misc_register returns %d.\n", ret); @@ -850,25 +852,26 @@ static void __exit tlclk_cleanup(void) unregister_chrdev(tlclk_major, "telco_clock"); release_region(TLCLK_BASE, 8); - del_timer_sync(_timer); + del_timer_sync(_timer.timer); kfree(alarm_events); } -static void switchover_timeout(unsigned long data) +static void switchover_timeout(struct timer_list *timer) { - unsigned long flags = *(unsigned long *) data; + struct switchover_timer *s = + container_of(timer, struct switchover_timer, timer); - if ((flags & 1)) { - if ((inb(TLCLK_REG1) & 0x08) != (flags & 0x08)) + if ((s->data & 1)) { + if ((inb(TLCLK_REG1) & 0x08) != (s->data & 0x08)) alarm_events->switchover_primary++; } else { - if ((inb(TLCLK_REG1) & 0x08) != (flags & 0x08)) + if ((inb(TLCLK_REG1) & 0x08) != (s->data & 0x08)) alarm_events->switchover_secondary++; } /* Alarm processing is done, wake up read task */ - del_timer(_timer); + del_timer(_timer.timer); got_event = 1; wake_up(); } @@ -920,10 +923,9 @@ static irqreturn_t tlclk_interrupt(int irq, void *dev_id) alarm_events->pll_holdover++; /* TIMEOUT in ~10ms */ - switchover_timer.expires = jiffies + msecs_to_jiffies(10); - tlclk_timer_data = inb(TLCLK_REG1); - switchover_timer.data = (unsigned long) _timer_data; - mod_timer(_timer, switchover_timer.expires); + switchover_timer.data = inb(TLCLK_REG1); + mod_timer(_timer.timer, + jiffies + msecs_to_jiffies(10)); } else { got_event = 1; wake_up(); -- 2.11.0
[PATCH 7/9] s390: switch lgr timer to a modern timer
Signed-off-by: Christoph Hellwig--- arch/s390/kernel/lgr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/kernel/lgr.c b/arch/s390/kernel/lgr.c index ae7dff110054..147124c05f28 100644 --- a/arch/s390/kernel/lgr.c +++ b/arch/s390/kernel/lgr.c @@ -153,14 +153,14 @@ static void lgr_timer_set(void); /* * LGR timer callback */ -static void lgr_timer_fn(unsigned long ignored) +static void lgr_timer_fn(struct timer_list *timer) { lgr_info_log(); lgr_timer_set(); } static struct timer_list lgr_timer = - TIMER_DEFERRED_INITIALIZER(lgr_timer_fn, 0, 0); + INIT_TIMER(lgr_timer_fn, 0, TIMER_DEFERRABLE); /* * Setup next LGR timer -- 2.11.0
[PATCH 6/9] s390: switch topology_timer to a modern timer
Signed-off-by: Christoph Hellwig--- arch/s390/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c index bb47c92476f0..4a0e867fca2b 100644 --- a/arch/s390/kernel/topology.c +++ b/arch/s390/kernel/topology.c @@ -289,7 +289,7 @@ void topology_schedule_update(void) schedule_work(_work); } -static void topology_timer_fn(unsigned long ignored) +static void topology_timer_fn(struct timer_list *timer) { if (ptf(PTF_CHECK)) topology_schedule_update(); @@ -297,7 +297,7 @@ static void topology_timer_fn(unsigned long ignored) } static struct timer_list topology_timer = - TIMER_DEFERRED_INITIALIZER(topology_timer_fn, 0, 0); + INIT_TIMER(topology_timer_fn, 0, TIMER_DEFERRABLE); static atomic_t topology_poll = ATOMIC_INIT(0); -- 2.11.0
[PATCH 5/9] powerpc/numa: switch topology_timer to modern timer
Signed-off-by: Christoph Hellwig--- arch/powerpc/mm/numa.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 371792e4418f..93a11227716b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1437,7 +1437,7 @@ static void topology_schedule_update(void) schedule_work(_work); } -static void topology_timer_fn(unsigned long ignored) +static void topology_timer_fn(struct timer_list *timer) { if (prrn_enabled && cpumask_weight(_associativity_changes_mask)) topology_schedule_update(); @@ -1447,8 +1447,7 @@ static void topology_timer_fn(unsigned long ignored) reset_topology_timer(); } } -static struct timer_list topology_timer = - TIMER_INITIALIZER(topology_timer_fn, 0, 0); +static struct timer_list topology_timer = INIT_TIMER(topology_timer_fn, 0, 0); static void reset_topology_timer(void) { -- 2.11.0
[PATCH 4/9] workqueue: switch to modern timers
Signed-off-by: Christoph Hellwig--- include/linux/workqueue.h| 16 ++-- kernel/workqueue.c | 14 +++--- .../rcutorture/formal/srcu-cbmc/src/workqueues.h | 2 +- 3 files changed, 14 insertions(+), 18 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index c102ef65cb64..59c889bf601e 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -17,7 +17,7 @@ struct workqueue_struct; struct work_struct; typedef void (*work_func_t)(struct work_struct *work); -void delayed_work_timer_fn(unsigned long __data); +void delayed_work_timer_fn(struct timer_list *timer); /* * The first word is the work queue pointer and the flags rolled into @@ -175,9 +175,8 @@ struct execute_work { #define __DELAYED_WORK_INITIALIZER(n, f, tflags) { \ .work = __WORK_INITIALIZER((n).work, (f)), \ - .timer = __TIMER_INITIALIZER(delayed_work_timer_fn, \ -0, (unsigned long)&(n),\ -(tflags) | TIMER_IRQSAFE), \ + .timer = INIT_TIMER(delayed_work_timer_fn, 0, \ + (tflags) | TIMER_IRQSAFE), \ } #define DECLARE_WORK(n, f) \ @@ -241,18 +240,15 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; } #define __INIT_DELAYED_WORK(_work, _func, _tflags) \ do {\ INIT_WORK(&(_work)->work, (_func)); \ - __setup_timer(&(_work)->timer, delayed_work_timer_fn, \ - (unsigned long)(_work), \ + prepare_timer(&(_work)->timer, delayed_work_timer_fn, \ (_tflags) | TIMER_IRQSAFE); \ } while (0) #define __INIT_DELAYED_WORK_ONSTACK(_work, _func, _tflags) \ do {\ INIT_WORK_ONSTACK(&(_work)->work, (_func)); \ - __setup_timer_on_stack(&(_work)->timer, \ - delayed_work_timer_fn, \ - (unsigned long)(_work), \ - (_tflags) | TIMER_IRQSAFE); \ + prepare_timer_on_stack(&(_work)->timer, delayed_work_timer_fn, \ + (_tflags) | TIMER_IRQSAFE); \ } while (0) #define INIT_DELAYED_WORK(_work, _func) \ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c74bf39ef764..ba2cd509902f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1492,9 +1492,10 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq, } EXPORT_SYMBOL(queue_work_on); -void delayed_work_timer_fn(unsigned long __data) +void delayed_work_timer_fn(struct timer_list *timer) { - struct delayed_work *dwork = (struct delayed_work *)__data; + struct delayed_work *dwork = + container_of(timer, struct delayed_work, timer); /* should have been called from irqsafe timer with irq already off */ __queue_work(dwork->cpu, dwork->wq, >work); @@ -1508,8 +1509,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, struct work_struct *work = >work; WARN_ON_ONCE(!wq); - WARN_ON_ONCE(timer->function != delayed_work_timer_fn || -timer->data != (unsigned long)dwork); + WARN_ON_ONCE(timer->func != delayed_work_timer_fn); WARN_ON_ONCE(timer_pending(timer)); WARN_ON_ONCE(!list_empty(>entry)); @@ -5335,11 +5335,11 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq) { } */ #ifdef CONFIG_WQ_WATCHDOG -static void wq_watchdog_timer_fn(unsigned long data); +static void wq_watchdog_timer_fn(struct timer_list *timer); static unsigned long wq_watchdog_thresh = 30; static struct timer_list wq_watchdog_timer = - TIMER_DEFERRED_INITIALIZER(wq_watchdog_timer_fn, 0, 0); + INIT_TIMER(wq_watchdog_timer_fn, 0, TIMER_DEFERRABLE); static unsigned long wq_watchdog_touched = INITIAL_JIFFIES; static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = INITIAL_JIFFIES; @@ -5353,7 +5353,7 @@ static void wq_watchdog_reset_touched(void) per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies; } -static void wq_watchdog_timer_fn(unsigned long data) +static void wq_watchdog_timer_fn(struct timer_list *timer) { unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ; bool lockup_detected = false; diff --git
[PATCH 2/9] timers: provide a "modern" variant of timers
The new callback gets a pointer to the timer_list itself, which can then be used to get the containing structure using container_of instead of casting from and to unsigned long all the time. The setup helpers take a flags argument instead of needing countless variants. Note: this further reduces space for the cpumask. By the time we'll need the additional cpumask space getting rid of the old-style timers will hopefully be finished. Signed-off-by: Christoph Hellwig--- include/linux/timer.h | 50 -- kernel/time/timer.c | 24 ++-- 2 files changed, 62 insertions(+), 12 deletions(-) diff --git a/include/linux/timer.h b/include/linux/timer.h index e6789b8757d5..87afe52c8349 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -16,7 +16,10 @@ struct timer_list { */ struct hlist_node entry; unsigned long expires; - void(*function)(unsigned long); + union { + void(*func)(struct timer_list *timer); + void(*function)(unsigned long); + }; unsigned long data; u32 flags; @@ -52,7 +55,8 @@ struct timer_list { * workqueue locking issues. It's not meant for executing random crap * with interrupts disabled. Abuse is monitored! */ -#define TIMER_CPUMASK 0x0003 +#define TIMER_CPUMASK 0x0001 +#define TIMER_MODERN 0x0002 #define TIMER_MIGRATING0x0004 #define TIMER_BASEMASK (TIMER_CPUMASK | TIMER_MIGRATING) #define TIMER_DEFERRABLE 0x0008 @@ -63,6 +67,22 @@ struct timer_list { #define TIMER_TRACE_FLAGMASK (TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE) +#define INIT_TIMER(_func, _expires, _flags)\ +{ \ + .entry = { .next = TIMER_ENTRY_STATIC },\ + .func = (_func),\ + .expires = (_expires), \ + .flags = TIMER_MODERN | (_flags), \ + __TIMER_LOCKDEP_MAP_INITIALIZER(__FILE__ ":" __stringify(__LINE__)) \ +} + +#define DECLARE_TIMER(_name, _func, _expires, _flags) \ + struct timer_list _name = INIT_TIMER(_func, _expires, _flags) + +/* + * Don't use the macros below, use DECLARE_TIMER and INIT_TIMER with their + * improved callback signature above. + */ #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \ .entry = { .next = TIMER_ENTRY_STATIC },\ .function = (_function),\ @@ -126,6 +146,32 @@ static inline void init_timer_on_stack_key(struct timer_list *timer, init_timer_on_stack_key((_timer), (_flags), NULL, NULL) #endif +/** + * prepare_timer - initialize a timer before first use + * @timer: timer structure to prepare + * @func: callback to be called when the timer expires + * @flags %TIMER_* flags that control timer behavior + * + * This function initializes a timer_list structure so that it can + * be used (by calling add_timer() or mod_timer()). + */ +static inline void prepare_timer(struct timer_list *timer, + void (*func)(struct timer_list *timer), u32 flags) +{ + __init_timer(timer, TIMER_MODERN | flags); + timer->func = func; +} + +static inline void prepare_timer_on_stack(struct timer_list *timer, + void (*func)(struct timer_list *timer), u32 flags) +{ + __init_timer_on_stack(timer, TIMER_MODERN | flags); + timer->func = func; +} + +/* + * Don't use - use prepare_timer above for new code instead. + */ #define init_timer(timer) \ __init_timer((timer), 0) #define init_timer_pinned(timer) \ diff --git a/kernel/time/timer.c b/kernel/time/timer.c index c7978fcdbbea..48d8450cfa5f 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -579,7 +579,7 @@ static struct debug_obj_descr timer_debug_descr; static void *timer_debug_hint(void *addr) { - return ((struct timer_list *) addr)->function; + return ((struct timer_list *) addr)->func; } static bool timer_is_static_object(void *addr) @@ -930,7 +930,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) unsigned long clk = 0, flags; int ret = 0; - BUG_ON(!timer->function); + BUG_ON(!timer->func && !timer->function); /* * This is a common optimization triggered by the networking code - if @@ -1064,12 +1064,12 @@ EXPORT_SYMBOL(mod_timer); * add_timer - start a timer * @timer: the timer to be added * - * The kernel will do a ->function(->data) callback from the - * timer interrupt at the ->expires point in the future. The - * current time is 'jiffies'. +
[PATCH 3/9] kthread: remove unused macros
KTHREAD_DELAYED_WORK_INIT and DEFINE_KTHREAD_DELAYED_WORK are unused and are using a timer helper that's about to go away. Signed-off-by: Christoph Hellwig--- include/linux/kthread.h | 11 --- 1 file changed, 11 deletions(-) diff --git a/include/linux/kthread.h b/include/linux/kthread.h index 4fec8b775895..acb6edb4b4b4 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -114,23 +114,12 @@ struct kthread_delayed_work { .func = (fn), \ } -#define KTHREAD_DELAYED_WORK_INIT(dwork, fn) { \ - .work = KTHREAD_WORK_INIT((dwork).work, (fn)), \ - .timer = __TIMER_INITIALIZER(kthread_delayed_work_timer_fn, \ -0, (unsigned long)&(dwork),\ -TIMER_IRQSAFE),\ - } - #define DEFINE_KTHREAD_WORKER(worker) \ struct kthread_worker worker = KTHREAD_WORKER_INIT(worker) #define DEFINE_KTHREAD_WORK(work, fn) \ struct kthread_work work = KTHREAD_WORK_INIT(work, fn) -#define DEFINE_KTHREAD_DELAYED_WORK(dwork, fn) \ - struct kthread_delayed_work dwork = \ - KTHREAD_DELAYED_WORK_INIT(dwork, fn) - /* * kthread_worker.lock needs its own lockdep class key when defined on * stack with lockdep enabled. Use the following macros in such cases. -- 2.11.0
[PATCH 1/9] timers: remove the fn and data arguments to call_timer_fn
And just move the dereferences inline, given that the timer gets passed as an argument. Signed-off-by: Christoph Hellwig--- kernel/time/timer.c | 16 +--- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 152a706ef8b8..c7978fcdbbea 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1240,8 +1240,7 @@ int del_timer_sync(struct timer_list *timer) EXPORT_SYMBOL(del_timer_sync); #endif -static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long), - unsigned long data) +static void call_timer_fn(struct timer_list *timer) { int count = preempt_count(); @@ -1265,14 +1264,14 @@ static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long), lock_map_acquire(_map); trace_timer_expire_entry(timer); - fn(data); + timer->function(timer->data); trace_timer_expire_exit(timer); lock_map_release(_map); if (count != preempt_count()) { WARN_ONCE(1, "timer: %pF preempt leak: %08x -> %08x\n", - fn, count, preempt_count()); + timer->function, count, preempt_count()); /* * Restore the preempt count. That gives us a decent * chance to survive and extract information. If the @@ -1287,24 +1286,19 @@ static void expire_timers(struct timer_base *base, struct hlist_head *head) { while (!hlist_empty(head)) { struct timer_list *timer; - void (*fn)(unsigned long); - unsigned long data; timer = hlist_entry(head->first, struct timer_list, entry); base->running_timer = timer; detach_timer(timer, true); - fn = timer->function; - data = timer->data; - if (timer->flags & TIMER_IRQSAFE) { spin_unlock(>lock); - call_timer_fn(timer, fn, data); + call_timer_fn(timer); spin_lock(>lock); } else { spin_unlock_irq(>lock); - call_timer_fn(timer, fn, data); + call_timer_fn(timer); spin_lock_irq(>lock); } } -- 2.11.0
RFC: better timer interface
Hi all, this series attempts to provide a "modern" timer interface where the callback gets the timer_list structure as an argument so that it can use container_of instead of having to cast to/from unsigned long all the time (or even worse use function pointer casts, we have quite a few of those as well). For that it steals another bit from the cpu mask to add a modern flag, and if that flag is set the different new function prototype is used. Last but least new helpers to initialize these modern timers are added. Instead of having a larger number of initialization macros we simply pass the timer flags to them.
Re: [PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte
On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote: > +static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea, > + bool *is_thp, unsigned *hshift) > +{ > + VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) , > + "%s called with irq enabled\n", __func__); > + return __find_linux_pte(pgdir, ea, is_thp, hshift); > +} > + When is arch_irqs_disabled() not sufficient ? Cheers, Ben.
Re: [PATCH 1/3] powerpc: Add __hard_irqs_disabled()
On Tue, 2017-05-16 at 14:56 +0530, Aneesh Kumar K.V wrote: > > +static inline bool __hard_irqs_disabled(void) > +{ > + unsigned long flags = mfmsr(); > + return (flags & MSR_EE) == 0; > +} > + Reading the MSR has a cost. Can't we rely on paca->irq_happened being non-0 ? (If you are paranoid, add a test of msr as well and warn if there's a mismatch ...) Cheers, Ben.
Re: Mainline build brakes on powerpc with error : fs/xfs/xfs_iomap.c:1152: undefined reference to `.put_dax
On Tue, May 16, 2017 at 1:02 PM, Abdul Haleemwrote: > Hi, > > Today's mainline 4.12-rc1 fails to build for the attached configuration > file on Power7 box with below errors. > > $ make > fs/built-in.o: In function `xfs_file_iomap_end': > fs/xfs/xfs_iomap.c:1152: undefined reference to `.put_dax' > fs/built-in.o: In function `xfs_file_iomap_begin': > fs/xfs/xfs_iomap.c:1071: undefined reference to `.dax_get_by_host' > > Also reproducible on latest linux-next, and the last successful build > was at next-20170510. This should be fixed by https://patchwork.kernel.org/patch/9725515/ Arnd
Re: kernel BUG at mm/usercopy.c:72!
[Cc'ing the relevant folks] Breno Leitaowrites: > Hello, > > Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual > machine. Justing SSHing into the machine causes this issue. > > [23.138124] usercopy: kernel memory overwrite attempt detected to > d3d80030 (mm_struct) (560 bytes) > [23.138195] [ cut here ] > [23.138229] kernel BUG at mm/usercopy.c:72! > [23.138252] Oops: Exception in kernel mode, sig: 5 [#3] > [23.138280] SMP NR_CPUS=2048 > [23.138280] NUMA > [23.138302] pSeries > [23.138330] Modules linked in: > [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G D > 4.12.0-rc1+ #9 > [23.138395] task: c001e272dc00 task.stack: c001e27b > [23.138430] NIP: c0342358 LR: c0342354 CTR: > c06eb060 > [23.138472] REGS: c001e27b3a00 TRAP: 0700 Tainted: G D >(4.12.0-rc1+) > [23.138513] MSR: 80029033 > [23.138517] CR: 28004222 XER: 2000 > [23.138565] CFAR: c0b34500 SOFTE: 1 > [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 > 005e > [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 > 79746573290d0a74 > [23.138565] GPR08: 0007 c0f61864 0001feeb > 3064206f74206465 > [23.138565] GPR12: 4400 cfb42600 0015 > 545bdc40 > [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 > 545cf000 > [23.138565] GPR20: 546109c8 c7e8 54610010 > 778c22e8 > [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 > 0230 > [23.138565] GPR28: d3d80260 0230 > d3d80030 > [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0 > [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0 > [23.138990] Call Trace: > [23.139006] [c001e27b3c80] [c0342354] > __check_object_size+0x84/0x2d0 (unreliable) > [23.139056] [c001e27b3d00] [c09f5ba8] > bpf_prog_create_from_user+0xa8/0x1a0 > [23.139099] [c001e27b3d60] [c01e5d30] do_seccomp+0x120/0x720 > [23.139136] [c001e27b3dd0] [c00fd53c] SyS_prctl+0x2ac/0x6b0 > [23.139172] [c001e27b3e30] [c000af84] system_call+0x38/0xe0 > [23.139218] Instruction dump: > [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 > 3c62ff95 7fc8f378 > [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 > 2ba30010 409d018c > [23.139328] ---[ end trace 1a1dc952a4b7c4af ]--- > > I found that kernel 4.11 does not have this issue. I also found that, if > I revert 517e1fbeb65f5eade8d14f46ac365db6c75aea9b, I do not see the > problem. > > On the other side, if I cherry-pick commit > 517e1fbeb65f5eade8d14f46ac365db6c75aea9b into 4.11, I start seeing the > same issue also on 4.11. Yeah it looks like powerpc also suffers from the same bug that arm64 used to, ie. virt_addr_valid() will return true for some vmalloc addresses. virt_addr_valid() is used pretty widely, I'm not sure if we can just fix it without other fallout. I'll dig a bit more tomorrow if no one beats me to it. Kees, depending on how that turns out we may ask you to revert 517e1fbeb65f ("mm/usercopy: Drop extra is_vmalloc_or_module() check"). cheers
Mainline build brakes on powerpc with error : fs/xfs/xfs_iomap.c:1152: undefined reference to `.put_dax
Hi, Today's mainline 4.12-rc1 fails to build for the attached configuration file on Power7 box with below errors. $ make fs/built-in.o: In function `xfs_file_iomap_end': fs/xfs/xfs_iomap.c:1152: undefined reference to `.put_dax' fs/built-in.o: In function `xfs_file_iomap_begin': fs/xfs/xfs_iomap.c:1071: undefined reference to `.dax_get_by_host' Also reproducible on latest linux-next, and the last successful build was at next-20170510. -- Regard's Abdul Haleem IBM Linux Technology Centre # # Automatically generated file; DO NOT EDIT. # Linux/powerpc 4.10.0-rc5 Kernel Configuration # CONFIG_PPC64=y # # Processor support # CONFIG_PPC_BOOK3S_64=y # CONFIG_PPC_BOOK3E_64 is not set CONFIG_GENERIC_CPU=y # CONFIG_CELL_CPU is not set # CONFIG_POWER4_CPU is not set # CONFIG_POWER5_CPU is not set # CONFIG_POWER6_CPU is not set # CONFIG_POWER7_CPU is not set # CONFIG_POWER8_CPU is not set CONFIG_PPC_BOOK3S=y CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_VSX=y CONFIG_PPC_ICSWX=y # CONFIG_PPC_ICSWX_PID is not set # CONFIG_PPC_ICSWX_USE_SIGILL is not set CONFIG_PPC_STD_MMU=y CONFIG_PPC_STD_MMU_64=y CONFIG_PPC_RADIX_MMU=y CONFIG_PPC_MM_SLICES=y CONFIG_PPC_HAVE_PMU_SUPPORT=y CONFIG_PPC_PERF_CTRS=y CONFIG_SMP=y CONFIG_NR_CPUS=2048 CONFIG_PPC_DOORBELL=y CONFIG_VDSO32=y CONFIG_CPU_BIG_ENDIAN=y # CONFIG_CPU_LITTLE_ENDIAN is not set CONFIG_64BIT=y CONFIG_ARCH_PHYS_ADDR_T_64BIT=y CONFIG_ARCH_DMA_ADDR_T_64BIT=y CONFIG_MMU=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NR_IRQS=512 CONFIG_STACKTRACE_SUPPORT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_ILOG2_U32=y CONFIG_ARCH_HAS_ILOG2_U64=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK=y CONFIG_PPC=y # CONFIG_GENERIC_CSUM is not set CONFIG_EARLY_PRINTK=y CONFIG_PANIC_TIMEOUT=180 CONFIG_COMPAT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_PPC_UDBG_16550=y # CONFIG_GENERIC_TBSYNC is not set CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y CONFIG_EPAPR_BOOT=y # CONFIG_DEFAULT_UIMAGE is not set CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y # CONFIG_PPC_DCR_NATIVE is not set # CONFIG_PPC_DCR_MMIO is not set # CONFIG_PPC_OF_PLATFORM_PCI is not set CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_PPC_EMULATE_SSTEP=y CONFIG_ZONE_DMA32=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_XZ is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y # CONFIG_USELIB is not set CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_SHOW_LEVEL=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_IRQ_DOMAIN=y CONFIG_GENERIC_MSI_IRQ=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_GENERIC_TIME_VSYSCALL_OLD=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_ARCH_HAS_TICK_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y # CONFIG_NO_HZ_FULL_ALL is not set # CONFIG_NO_HZ_FULL_SYSIDLE is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_VIRT_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_GEN=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TASKS_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_CONTEXT_TRACKING=y # CONFIG_CONTEXT_TRACKING_FORCE is not set # CONFIG_TREE_RCU_TRACE is not set CONFIG_RCU_NOCB_CPU=y # CONFIG_RCU_NOCB_CPU_NONE is not set # CONFIG_RCU_NOCB_CPU_ZERO is not set CONFIG_RCU_NOCB_CPU_ALL=y # CONFIG_BUILD_BIN2C is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=20 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y #CONFIG_NUMA_BALANCING is not set #CONFIG_NUMA_BALANCING_DEFAULT_ENABLED is not set CONFIG_CGROUPS=y CONFIG_PAGE_COUNTER=y CONFIG_MEMCG=y CONFIG_MEMCG_SWAP=y CONFIG_MEMCG_SWAP_ENABLED=y CONFIG_BLK_CGROUP=y # CONFIG_DEBUG_BLK_CGROUP is not set CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y # CONFIG_CGROUP_PIDS is not set #
Re: kernel BUG at mm/usercopy.c:72!
Breno Leitaowrites: > Hello, > > Kernel 4.12-rc1 is showing a bug when I try it on a POWER8 virtual > machine. Justing SSHing into the machine causes this issue. > > [23.138124] usercopy: kernel memory overwrite attempt detected to > d3d80030 (mm_struct) (560 bytes) > [23.138195] [ cut here ] > [23.138229] kernel BUG at mm/usercopy.c:72! > [23.138252] Oops: Exception in kernel mode, sig: 5 [#3] > [23.138280] SMP NR_CPUS=2048 > [23.138280] NUMA > [23.138302] pSeries > [23.138330] Modules linked in: > [23.138354] CPU: 4 PID: 2215 Comm: sshd Tainted: G D > 4.12.0-rc1+ #9 > [23.138395] task: c001e272dc00 task.stack: c001e27b > [23.138430] NIP: c0342358 LR: c0342354 CTR: > c06eb060 > [23.138472] REGS: c001e27b3a00 TRAP: 0700 Tainted: G D >(4.12.0-rc1+) > [23.138513] MSR: 80029033 > [23.138517] CR: 28004222 XER: 2000 > [23.138565] CFAR: c0b34500 SOFTE: 1 > [23.138565] GPR00: c0342354 c001e27b3c80 c142a000 > 005e > [23.138565] GPR04: c001ffe0ade8 c001ffe21bf8 2920283536302062 > 79746573290d0a74 > [23.138565] GPR08: 0007 c0f61864 0001feeb > 3064206f74206465 > [23.138565] GPR12: 4400 cfb42600 0015 > 545bdc40 > [23.138565] GPR16: 545c49c8 01000b4b8890 778c26f0 > 545cf000 > [23.138565] GPR20: 546109c8 c7e8 54610010 > 778c22e8 > [23.138565] GPR24: 545c8c40 c000ff6bcef0 c01e5220 > 0230 > [23.138565] GPR28: d3d80260 0230 > d3d80030 > [23.138920] NIP [c0342358] __check_object_size+0x88/0x2d0 > [23.138956] LR [c0342354] __check_object_size+0x84/0x2d0 > [23.138990] Call Trace: > [23.139006] [c001e27b3c80] [c0342354] > __check_object_size+0x84/0x2d0 (unreliable) > [23.139056] [c001e27b3d00] [c09f5ba8] > bpf_prog_create_from_user+0xa8/0x1a0 > [23.139099] [c001e27b3d60] [c01e5d30] do_seccomp+0x120/0x720 > [23.139136] [c001e27b3dd0] [c00fd53c] SyS_prctl+0x2ac/0x6b0 > [23.139172] [c001e27b3e30] [c000af84] system_call+0x38/0xe0 > [23.139218] Instruction dump: > [23.139240] 6000 6042 3c82ff94 3ca2ff9d 38841788 38a5e868 > 3c62ff95 7fc8f378 > [23.139283] 7fe6fb78 386310c0 487f2169 6000 <0fe0> 6042 > 2ba30010 409d018c > [23.139328] ---[ end trace 1a1dc952a4b7c4af ]--- Do you have any idea what is calling seccomp() and triggering the bug? I run the BPF and seccomp test suites, and I haven't seen this. cheers
[PATCH] powerpc/mm: Fix crash in page table dump with huge pages
The page table dump code doesn't know about huge pages, so currently it crashes (or walks random memory, usually leading to a crash), if it finds a huge page. On Book3S we only see huge pages in the Linux page tables when we're using the P9 Radix MMU. Teaching the code to properly handle huge pages is a bit more involved, so for now just prevent the crash. Cc: sta...@vger.kernel.org # v4.10+ Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables") Signed-off-by: Michael Ellerman--- arch/powerpc/mm/dump_linuxpagetables.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c index d659345a98d6..6070d3d60ef1 100644 --- a/arch/powerpc/mm/dump_linuxpagetables.c +++ b/arch/powerpc/mm/dump_linuxpagetables.c @@ -391,7 +391,7 @@ static void walk_pmd(struct pg_state *st, pud_t *pud, unsigned long start) for (i = 0; i < PTRS_PER_PMD; i++, pmd++) { addr = start + i * PMD_SIZE; - if (!pmd_none(*pmd)) + if (!pmd_none(*pmd) && !pmd_huge(*pmd)) /* pmd exists */ walk_pte(st, pmd, addr); else @@ -407,7 +407,7 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start) for (i = 0; i < PTRS_PER_PUD; i++, pud++) { addr = start + i * PUD_SIZE; - if (!pud_none(*pud)) + if (!pud_none(*pud) && !pud_huge(*pud)) /* pud exists */ walk_pmd(st, pud, addr); else @@ -427,7 +427,7 @@ static void walk_pagetables(struct pg_state *st) */ for (i = 0; i < PTRS_PER_PGD; i++, pgd++) { addr = KERN_VIRT_START + i * PGDIR_SIZE; - if (!pgd_none(*pgd)) + if (!pgd_none(*pgd) && !pgd_huge(*pgd)) /* pgd exists */ walk_pud(st, pgd, addr); else -- 2.7.4
Re: [PATCH v2 2/2] powerpc/mm/hugetlb: Add support for 1G huge pages
On 05/16/2017 02:47 PM, Aneesh Kumar K.V wrote: > POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch > enables the usage of 1G page size for hugetlbfs. This also update the helper > such we can do 1G page allocation at runtime. > > We still don't enable 1G page size on DD1 version. This is to avoid doing > workaround mentioned in commit: 6d3a0379ebdc8 (powerpc/mm: Add > radix__tlb_flush_pte_p9_dd1() > > Signed-off-by: Aneesh Kumar K.VSounds good. Reviewed-by: Anshuman Khandual
Re: [PATCH v2 1/2] mm/hugetlb: Cleanup ARCH_HAS_GIGANTIC_PAGE
On 05/16/2017 02:47 PM, Aneesh Kumar K.V wrote: > This moves the #ifdef in C code to a Kconfig dependency. Also we move the > gigantic_page_supported() function to be arch specific. This gives arch to > conditionally enable runtime allocation of gigantic huge page. Architectures > like ppc64 supports different gigantic huge page size (16G and 1G) based on > the > translation mode selected. This provides an opportunity for ppc64 to enable > runtime allocation only w.r.t 1G hugepage. Right. > > No functional change in this patch. > > Signed-off-by: Aneesh Kumar K.V> --- > arch/arm64/Kconfig | 2 +- > arch/arm64/include/asm/hugetlb.h | 4 > arch/s390/Kconfig| 2 +- > arch/s390/include/asm/hugetlb.h | 3 +++ > arch/x86/Kconfig | 2 +- > mm/hugetlb.c | 7 ++- > 6 files changed, 12 insertions(+), 8 deletions(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 3741859765cf..1f8c1f73aada 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -11,7 +11,7 @@ config ARM64 > select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI > select ARCH_HAS_ELF_RANDOMIZE > select ARCH_HAS_GCOV_PROFILE_ALL > - select ARCH_HAS_GIGANTIC_PAGE > + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA > select ARCH_HAS_KCOV > select ARCH_HAS_SET_MEMORY > select ARCH_HAS_SG_CHAIN > diff --git a/arch/arm64/include/asm/hugetlb.h > b/arch/arm64/include/asm/hugetlb.h > index bbc1e35aa601..793bd73b0d07 100644 > --- a/arch/arm64/include/asm/hugetlb.h > +++ b/arch/arm64/include/asm/hugetlb.h > @@ -83,4 +83,8 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm, > extern void huge_ptep_clear_flush(struct vm_area_struct *vma, > unsigned long addr, pte_t *ptep); > > +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > +static inline bool gigantic_page_supported(void) { return true; } > +#endif > + > #endif /* __ASM_HUGETLB_H */ > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig > index a2dcef0aacc7..a41bbf420dda 100644 > --- a/arch/s390/Kconfig > +++ b/arch/s390/Kconfig > @@ -67,7 +67,7 @@ config S390 > select ARCH_HAS_DEVMEM_IS_ALLOWED > select ARCH_HAS_ELF_RANDOMIZE > select ARCH_HAS_GCOV_PROFILE_ALL > - select ARCH_HAS_GIGANTIC_PAGE > + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA > select ARCH_HAS_KCOV > select ARCH_HAS_SET_MEMORY > select ARCH_HAS_SG_CHAIN > diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h > index cd546a245c68..89057b2cc8fe 100644 > --- a/arch/s390/include/asm/hugetlb.h > +++ b/arch/s390/include/asm/hugetlb.h > @@ -112,4 +112,7 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t > newprot) > return pte_modify(pte, newprot); > } > > +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > +static inline bool gigantic_page_supported(void) { return true; } > +#endif > #endif /* _ASM_S390_HUGETLB_H */ > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index cc98d5a294ee..30a6328136ac 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -22,7 +22,7 @@ config X86_64 > def_bool y > depends on 64BIT > # Options that are inherently 64-bit kernel only: > - select ARCH_HAS_GIGANTIC_PAGE > + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA > select ARCH_SUPPORTS_INT128 > select ARCH_USE_CMPXCHG_LOCKREF > select HAVE_ARCH_SOFT_DIRTY Should not we define gigantic_page_supported() function for X86 as well like the other two archs above ?
[PATCH 2/3] powerpc/mm: Rename find_linux_pte_or_hugepte
No functional change. Add newer helpers with addtional warnings and use those. --- arch/powerpc/include/asm/pgtable.h | 10 + arch/powerpc/include/asm/pte-walk.h| 38 ++ arch/powerpc/kernel/eeh.c | 4 ++-- arch/powerpc/kernel/io-workarounds.c | 5 +++-- arch/powerpc/kvm/book3s_64_mmu_hv.c| 5 +++-- arch/powerpc/kvm/book3s_64_mmu_radix.c | 33 - arch/powerpc/kvm/book3s_64_vio_hv.c| 3 ++- arch/powerpc/kvm/book3s_hv_rm_mmu.c| 12 --- arch/powerpc/kvm/e500_mmu_host.c | 3 ++- arch/powerpc/mm/hash_utils_64.c| 5 +++-- arch/powerpc/mm/hugetlbpage.c | 24 - arch/powerpc/mm/tlb_hash64.c | 6 -- arch/powerpc/perf/callchain.c | 3 ++- 13 files changed, 97 insertions(+), 54 deletions(-) create mode 100644 arch/powerpc/include/asm/pte-walk.h diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index dd01212935ac..9fa263ad7cb3 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -66,16 +66,8 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, #ifndef CONFIG_TRANSPARENT_HUGEPAGE #define pmd_large(pmd) 0 #endif -pte_t *__find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, - bool *is_thp, unsigned *shift); -static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, - bool *is_thp, unsigned *shift) -{ - VM_WARN(!arch_irqs_disabled(), - "%s called with irq enabled\n", __func__); - return __find_linux_pte_or_hugepte(pgdir, ea, is_thp, shift); -} +/* can we use this in kvm */ unsigned long vmalloc_to_phys(void *vmalloc_addr); void pgtable_cache_add(unsigned shift, void (*ctor)(void *)); diff --git a/arch/powerpc/include/asm/pte-walk.h b/arch/powerpc/include/asm/pte-walk.h new file mode 100644 index ..ea30c4ddd211 --- /dev/null +++ b/arch/powerpc/include/asm/pte-walk.h @@ -0,0 +1,38 @@ +#ifndef _ASM_POWERPC_PTE_WALK_H +#define _ASM_POWERPC_PTE_WALK_H + +#ifndef __ASSEMBLY__ +#include + +/* Don't use this directly */ +extern pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, + bool *is_thp, unsigned *hshift); + +static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea, + bool *is_thp, unsigned *hshift) +{ + VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) , + "%s called with irq enabled\n", __func__); + return __find_linux_pte(pgdir, ea, is_thp, hshift); +} + +static inline pte_t *find_init_mm_pte(unsigned long ea, unsigned *hshift) +{ + pgd_t *pgdir = init_mm.pgd; + return __find_linux_pte(pgdir, ea, NULL, hshift); +} +/* + * This is what we should always use. Any other lockless page table lookup needs + * careful audit against THP split. + */ +static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea, +bool *is_thp, unsigned *hshift) +{ + VM_WARN((!arch_irqs_disabled() && !__hard_irqs_disabled()) , + "%s called with irq enabled\n", __func__); + VM_WARN(pgdir != current->mm->pgd, + "%s lock less page table lookup called on wrong mm\n", __func__); + return __find_linux_pte(pgdir, ea, is_thp, hshift); +} +#endif +#endif diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 63992b2d8e15..5e6887c40528 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -44,6 +44,7 @@ #include #include #include +#include /** Overview: @@ -352,8 +353,7 @@ static inline unsigned long eeh_token_to_phys(unsigned long token) * worried about _PAGE_SPLITTING/collapse. Also we will not hit * page table free, because of init_mm. */ - ptep = __find_linux_pte_or_hugepte(init_mm.pgd, token, - NULL, _shift); + ptep = find_init_mm_pte(token, _shift); if (!ptep) return token; WARN_ON(hugepage_shift); diff --git a/arch/powerpc/kernel/io-workarounds.c b/arch/powerpc/kernel/io-workarounds.c index a582e0d42525..bbe85f5aea71 100644 --- a/arch/powerpc/kernel/io-workarounds.c +++ b/arch/powerpc/kernel/io-workarounds.c @@ -19,6 +19,8 @@ #include #include #include +#include + #define IOWA_MAX_BUS 8 @@ -75,8 +77,7 @@ struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr) * We won't find huge pages here (iomem). Also can't hit * a page table free due to init_mm */ - ptep = __find_linux_pte_or_hugepte(init_mm.pgd, vaddr, - NULL, _shift); + ptep = find_init_mm_pte(vaddr, _shift); if (ptep
[PATCH 3/3] powerpc/mm: Don't send IPI to all cpus on THP updates
Now that we made sure that lockless walk of linux page table is mostly limitted to current task(current->mm->pgdir) we can update the THP update sequence to only send IPI to cpus on which this task has run. This helps in reducing the IPI overload on systems with large number of CPUs. W.r.t kvm even though kvm is walking page table with vpc->arch.pgdir, it is done only on secondary cpus and in that case we have primary cpu added to task's mm cpumask. Sending an IPI to primary will force the secondary to do a vm exit and hence this mm cpumask usage is safe here. W.r.t CAPI, we still end up walking linux page table with capi context MM. For now the pte lookup serialization sends an IPI to all cpus in CPI is in use. We can further improve this by adding the CAPI interrupt handling cpu to task mm cpumask. That will be done in a later patch. Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/include/asm/book3s/64/pgtable.h | 1 + arch/powerpc/mm/pgtable-book3s64.c | 32 +++- arch/powerpc/mm/pgtable-hash64.c | 8 +++ arch/powerpc/mm/pgtable-radix.c | 8 +++ 4 files changed, 40 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 85bc9875c3be..d8c3c18e220d 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1145,6 +1145,7 @@ static inline bool arch_needs_pgtable_deposit(void) return false; return true; } +extern void serialize_against_pte_lookup(struct mm_struct *mm); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* __ASSEMBLY__ */ diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c index 5fcb3dd74c13..2679f57b90e2 100644 --- a/arch/powerpc/mm/pgtable-book3s64.c +++ b/arch/powerpc/mm/pgtable-book3s64.c @@ -9,6 +9,7 @@ #include #include +#include #include #include @@ -64,6 +65,35 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, trace_hugepage_set_pmd(addr, pmd_val(pmd)); return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd)); } + +static void do_nothing(void *unused) +{ + +} +/* + * Serialize against find_current_mm_pte which does lock-less + * lookup in page tables with local interrupts disabled. For huge pages + * it casts pmd_t to pte_t. Since format of pte_t is different from + * pmd_t we want to prevent transit from pmd pointing to page table + * to pmd pointing to huge page (and back) while interrupts are disabled. + * We clear pmd to possibly replace it with page table pointer in + * different code paths. So make sure we wait for the parallel + * find_current_mm_pte to finish. + */ +void serialize_against_pte_lookup(struct mm_struct *mm) +{ + smp_mb(); + /* +* Cxl fault handling requires us to do a lockless page table +* walk while inserting hash page table entry with mm tracked +* in cxl context. Hence we need to do a global flush. +*/ + if (cxl_ctx_in_use()) + smp_call_function(do_nothing, NULL, 1); + else + smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); +} + /* * We use this to invalidate a pmdp entry before switching from a * hugepte to regular pmd entry. @@ -77,7 +107,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, * This ensures that generic code that rely on IRQ disabling * to prevent a parallel THP split work as expected. */ - kick_all_cpus_sync(); + serialize_against_pte_lookup(vma->vm_mm); } static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot) diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c index 8b85a14b08ea..f6313cc29ae4 100644 --- a/arch/powerpc/mm/pgtable-hash64.c +++ b/arch/powerpc/mm/pgtable-hash64.c @@ -159,7 +159,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres * by sending an IPI to all the cpus and executing a dummy * function there. */ - kick_all_cpus_sync(); + serialize_against_pte_lookup(vma->vm_mm); /* * Now invalidate the hpte entries in the range * covered by pmd. This make sure we take a @@ -299,16 +299,16 @@ pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm, */ memset(pgtable, 0, PTE_FRAG_SIZE); /* -* Serialize against find_linux_pte_or_hugepte which does lock-less +* Serialize against find_current_mm_pte variants which does lock-less * lookup in page tables with local interrupts disabled. For huge pages * it casts pmd_t to pte_t. Since format of pte_t is different from * pmd_t we want to prevent transit from pmd pointing to page table * to pmd pointing to huge page (and back) while interrupts are disabled. * We
[PATCH 1/3] powerpc: Add __hard_irqs_disabled()
Add __hard_irqs_disabled() similar to arch_irqs_disabled to check whether irqs are hard disabled. Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/include/asm/hw_irq.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h index eba60416536e..541bd42f902f 100644 --- a/arch/powerpc/include/asm/hw_irq.h +++ b/arch/powerpc/include/asm/hw_irq.h @@ -88,6 +88,12 @@ static inline bool arch_irqs_disabled(void) return arch_irqs_disabled_flags(arch_local_save_flags()); } +static inline bool __hard_irqs_disabled(void) +{ + unsigned long flags = mfmsr(); + return (flags & MSR_EE) == 0; +} + #ifdef CONFIG_PPC_BOOK3E #define __hard_irq_enable()asm volatile("wrteei 1" : : : "memory") #define __hard_irq_disable() asm volatile("wrteei 0" : : : "memory") @@ -197,6 +203,7 @@ static inline bool arch_irqs_disabled(void) } #define hard_irq_disable() arch_local_irq_disable() +#define __hard_irqs_disabled() arch_irqs_disabled() static inline bool arch_irq_disabled_regs(struct pt_regs *regs) { -- 2.7.4
[PATCH] powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line
We use the kernel command line to do reservation of hugetlb pages. The code duplcation here is mostly to make it simpler. With 64 bit book3s, we need to support either 16G or 1G gigantic hugepage. Whereas the FSL_BOOK3E implementation needs to support multiple gigantic hugepage. We avoid the gpage_npages array and use a gpage_npage count for ppc64. We also cannot use the generic code to do the gigantic page allocation because that will require conditonal to handle the pseries allocation, where the memory is already reserved by the hypervisor. Inorder to keep it simpler, book3s 64 implements a version that keeps it simpler and working with pseries. Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/include/asm/hugetlb.h | 8 +--- arch/powerpc/mm/hugetlbpage.c | 78 ++ 2 files changed, 79 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index 7f4025a6c69e..03401a17d1da 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -218,13 +218,7 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, } #endif /* CONFIG_HUGETLB_PAGE */ -/* - * FSL Book3E platforms require special gpage handling - the gpages - * are reserved early in the boot process by memblock instead of via - * the .dts as on IBM platforms. - */ -#if defined(CONFIG_HUGETLB_PAGE) && (defined(CONFIG_PPC_FSL_BOOK3E) || \ -defined(CONFIG_PPC_8xx)) +#ifdef CONFIG_HUGETLB_PAGE extern void __init reserve_hugetlb_gpages(void); #else static inline void reserve_hugetlb_gpages(void) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 1816b965a142..4ebaa18f2495 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -373,6 +374,83 @@ int alloc_bootmem_huge_page(struct hstate *hstate) m->hstate = hstate; return 1; } + +static unsigned long gpage_npages; +static int __init do_gpage_early_setup(char *param, char *val, + const char *unused, void *arg) +{ + unsigned long npages; + static unsigned long size = 0; + unsigned long gpage_size = 1UL << 34; + + if (radix_enabled()) + gpage_size = 1UL << 30; + + /* +* The hugepagesz and hugepages cmdline options are interleaved. We +* use the size variable to keep track of whether or not this was done +* properly and skip over instances where it is incorrect. Other +* command-line parsing code will issue warnings, so we don't need to. +* +*/ + if ((strcmp(param, "default_hugepagesz") == 0) || + (strcmp(param, "hugepagesz") == 0)) { + size = memparse(val, NULL); + /* +* We want to handle on 16GB gigantic huge page here. +*/ + if (size != gpage_size) + size = 0; + } else if (strcmp(param, "hugepages") == 0) { + if (size != 0) { + if (sscanf(val, "%lu", ) <= 0) + npages = 0; + if (npages > MAX_NUMBER_GPAGES) { + pr_warn("MMU: %lu 16GB pages requested, " + "limiting to %d pages\n", npages, + MAX_NUMBER_GPAGES); + npages = MAX_NUMBER_GPAGES; + } + gpage_npages = npages; + size = 0; + } + } + return 0; +} + +/* + * This will just do the necessary memblock reservations. Every else is + * done by core, based on kernel command line parsing. + */ +void __init reserve_hugetlb_gpages(void) +{ + char buf[10]; + phys_addr_t base; + unsigned long gpage_size = 1UL << 34; + static __initdata char cmdline[COMMAND_LINE_SIZE]; + + if (radix_enabled()) + gpage_size = 1UL << 30; + + strlcpy(cmdline, boot_command_line, COMMAND_LINE_SIZE); + parse_args("hugetlb gpages", cmdline, NULL, 0, 0, 0, + NULL, _gpage_early_setup); + + if (!gpage_npages) + return; + + string_get_size(gpage_size, 1, STRING_UNITS_2, buf, sizeof(buf)); + pr_info("Trying to reserve %ld %s pages\n", gpage_npages, buf); + + /* Allocate one page at a time */ + while(gpage_npages) { + base = memblock_alloc_base(gpage_size, gpage_size, + MEMBLOCK_ALLOC_ANYWHERE); + add_gpage(base, gpage_size, 1); + gpage_npages--; + } +} + #endif #if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx) -- 2.7.4
[PATCH v2 8/9] powerpc/mm/hugetlb: Remove follow_huge_addr for powerpc
With generic code now handling hugetlb entries at pgd level and also supporting hugepage directory format, we can now remove the powerpc sepcific follow_huge_addr implementation. Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/mm/hugetlbpage.c | 64 --- 1 file changed, 64 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 5c829a83a4cc..1816b965a142 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -619,11 +619,6 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, } while (addr = next, addr != end); } -/* - * 64 bit book3s use generic follow_page_mask - */ -#ifdef CONFIG_PPC_BOOK3S_64 - struct page *follow_huge_pd(struct vm_area_struct *vma, unsigned long address, hugepd_t hpd, int flags, int pdshift) @@ -657,65 +652,6 @@ struct page *follow_huge_pd(struct vm_area_struct *vma, return page; } -#else /* !CONFIG_PPC_BOOK3S_64 */ - -/* - * We are holding mmap_sem, so a parallel huge page collapse cannot run. - * To prevent hugepage split, disable irq. - */ -struct page * -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) -{ - bool is_thp; - pte_t *ptep, pte; - unsigned shift; - unsigned long mask, flags; - struct page *page = ERR_PTR(-EINVAL); - - local_irq_save(flags); - ptep = find_linux_pte_or_hugepte(mm->pgd, address, _thp, ); - if (!ptep) - goto no_page; - pte = READ_ONCE(*ptep); - /* -* Verify it is a huge page else bail. -* Transparent hugepages are handled by generic code. We can skip them -* here. -*/ - if (!shift || is_thp) - goto no_page; - - if (!pte_present(pte)) { - page = NULL; - goto no_page; - } - mask = (1UL << shift) - 1; - page = pte_page(pte); - if (page) - page += (address & mask) / PAGE_SIZE; - -no_page: - local_irq_restore(flags); - return page; -} - -struct page * -follow_huge_pmd(struct mm_struct *mm, unsigned long address, - pmd_t *pmd, int write) -{ - BUG(); - return NULL; -} - -struct page * -follow_huge_pud(struct mm_struct *mm, unsigned long address, - pud_t *pud, int write) -{ - BUG(); - return NULL; -} -#endif - static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, unsigned long sz) { -- 2.7.4
[PATCH v2 9/9] powerpc/hugetlb: Enable hugetlb migration for ppc64
Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/platforms/Kconfig.cputype | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 8017542d..8acc4f27d101 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -351,6 +351,11 @@ config PPC_RADIX_MMU is only implemented by IBM Power9 CPUs, if you don't have one of them you can probably disable this. +config ARCH_ENABLE_HUGEPAGE_MIGRATION + def_bool y + depends on PPC_BOOK3S_64 && HUGETLB_PAGE && MIGRATION + + config PPC_MMU_NOHASH def_bool y depends on !PPC_STD_MMU -- 2.7.4
[PATCH v2 2/9] mm/follow_page_mask: Split follow_page_mask to smaller functions.
Makes code reading easy. No functional changes in this patch. In a followup patch, we will be updating the follow_page_mask to handle hugetlb hugepd format so that archs like ppc64 can switch to the generic version. This split helps in doing that nicely. Reviewed-by: Naoya HoriguchiSigned-off-by: Aneesh Kumar K.V --- mm/gup.c | 148 +++ 1 file changed, 91 insertions(+), 57 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 04aa405350dc..73d46f9f7b81 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -208,68 +208,16 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, return no_page_table(vma, flags); } -/** - * follow_page_mask - look up a page descriptor from a user-virtual address - * @vma: vm_area_struct mapping @address - * @address: virtual address to look up - * @flags: flags modifying lookup behaviour - * @page_mask: on output, *page_mask is set according to the size of the page - * - * @flags can have FOLL_ flags set, defined in - * - * Returns the mapped (struct page *), %NULL if no mapping exists, or - * an error pointer if there is a mapping to something not represented - * by a page descriptor (see also vm_normal_page()). - */ -struct page *follow_page_mask(struct vm_area_struct *vma, - unsigned long address, unsigned int flags, - unsigned int *page_mask) +static struct page *follow_pmd_mask(struct vm_area_struct *vma, + unsigned long address, pud_t *pudp, + unsigned int flags, unsigned int *page_mask) { - pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; pmd_t *pmd; spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma->vm_mm; - *page_mask = 0; - - page = follow_huge_addr(mm, address, flags & FOLL_WRITE); - if (!IS_ERR(page)) { - BUG_ON(flags & FOLL_GET); - return page; - } - - pgd = pgd_offset(mm, address); - if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) - return no_page_table(vma, flags); - p4d = p4d_offset(pgd, address); - if (p4d_none(*p4d)) - return no_page_table(vma, flags); - BUILD_BUG_ON(p4d_huge(*p4d)); - if (unlikely(p4d_bad(*p4d))) - return no_page_table(vma, flags); - pud = pud_offset(p4d, address); - if (pud_none(*pud)) - return no_page_table(vma, flags); - if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) { - page = follow_huge_pud(mm, address, pud, flags); - if (page) - return page; - return no_page_table(vma, flags); - } - if (pud_devmap(*pud)) { - ptl = pud_lock(mm, pud); - page = follow_devmap_pud(vma, address, pud, flags); - spin_unlock(ptl); - if (page) - return page; - } - if (unlikely(pud_bad(*pud))) - return no_page_table(vma, flags); - - pmd = pmd_offset(pud, address); + pmd = pmd_offset(pudp, address); if (pmd_none(*pmd)) return no_page_table(vma, flags); if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) { @@ -319,13 +267,99 @@ struct page *follow_page_mask(struct vm_area_struct *vma, return ret ? ERR_PTR(ret) : follow_page_pte(vma, address, pmd, flags); } - page = follow_trans_huge_pmd(vma, address, pmd, flags); spin_unlock(ptl); *page_mask = HPAGE_PMD_NR - 1; return page; } + +static struct page *follow_pud_mask(struct vm_area_struct *vma, + unsigned long address, p4d_t *p4dp, + unsigned int flags, unsigned int *page_mask) +{ + pud_t *pud; + spinlock_t *ptl; + struct page *page; + struct mm_struct *mm = vma->vm_mm; + + pud = pud_offset(p4dp, address); + if (pud_none(*pud)) + return no_page_table(vma, flags); + if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) { + page = follow_huge_pud(mm, address, pud, flags); + if (page) + return page; + return no_page_table(vma, flags); + } + if (pud_devmap(*pud)) { + ptl = pud_lock(mm, pud); + page = follow_devmap_pud(vma, address, pud, flags); + spin_unlock(ptl); + if (page) + return page; + } + if (unlikely(pud_bad(*pud))) + return no_page_table(vma, flags); + + return follow_pmd_mask(vma, address, pud, flags, page_mask); +} + + +static struct page *follow_p4d_mask(struct vm_area_struct *vma, + unsigned long
[PATCH v2 6/9] mm/follow_page_mask: Add support for hugepage directory entry
Architectures like ppc64 supports hugepage size that is not mapped to any of of the page table levels. Instead they add an alternate page table entry format called hugepage directory (hugepd). hugepd indicates that the page table entry maps to a set of hugetlb pages. Add support for this in generic follow_page_mask code. We already support this format in the generic gup code. The defaul implementation prints warning and returns NULL. We will add ppc64 support in later patches Signed-off-by: Aneesh Kumar K.V--- include/linux/hugetlb.h | 4 mm/gup.c| 33 + mm/hugetlb.c| 8 3 files changed, 45 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index f66c1d4e0d1f..caee7c4664c8 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -141,6 +141,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr); int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, int write); +struct page *follow_huge_pd(struct vm_area_struct *vma, + unsigned long address, hugepd_t hpd, + int flags, int pdshift); struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int flags); struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address, @@ -175,6 +178,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m) static inline void hugetlb_show_meminfo(void) { } +#define follow_huge_pd(vma, addr, hpd, flags, pdshift) NULL #define follow_huge_pmd(mm, addr, pmd, flags) NULL #define follow_huge_pud(mm, addr, pud, flags) NULL #define follow_huge_pgd(mm, addr, pgd, flags) NULL diff --git a/mm/gup.c b/mm/gup.c index 65255389620a..a7f5b82e15f3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -226,6 +226,14 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } + if (is_hugepd(__hugepd(pmd_val(*pmd { + page = follow_huge_pd(vma, address, + __hugepd(pmd_val(*pmd)), flags, + PMD_SHIFT); + if (page) + return page; + return no_page_table(vma, flags); + } if (pmd_devmap(*pmd)) { ptl = pmd_lock(mm, pmd); page = follow_devmap_pmd(vma, address, pmd, flags); @@ -292,6 +300,14 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } + if (is_hugepd(__hugepd(pud_val(*pud { + page = follow_huge_pd(vma, address, + __hugepd(pud_val(*pud)), flags, + PUD_SHIFT); + if (page) + return page; + return no_page_table(vma, flags); + } if (pud_devmap(*pud)) { ptl = pud_lock(mm, pud); page = follow_devmap_pud(vma, address, pud, flags); @@ -311,6 +327,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, unsigned int flags, unsigned int *page_mask) { p4d_t *p4d; + struct page *page; p4d = p4d_offset(pgdp, address); if (p4d_none(*p4d)) @@ -319,6 +336,14 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, if (unlikely(p4d_bad(*p4d))) return no_page_table(vma, flags); + if (is_hugepd(__hugepd(p4d_val(*p4d { + page = follow_huge_pd(vma, address, + __hugepd(p4d_val(*p4d)), flags, + P4D_SHIFT); + if (page) + return page; + return no_page_table(vma, flags); + } return follow_pud_mask(vma, address, p4d, flags, page_mask); } @@ -363,6 +388,14 @@ struct page *follow_page_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } + if (is_hugepd(__hugepd(pgd_val(*pgd { + page = follow_huge_pd(vma, address, + __hugepd(pgd_val(*pgd)), flags, + PGDIR_SHIFT); + if (page) + return page; + return no_page_table(vma, flags); + } return follow_p4d_mask(vma, address, pgd, flags, page_mask); } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a12d3cab04fe..58307d62ac37 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4643,6 +4643,14 @@ follow_huge_addr(struct
[PATCH v2 7/9] powerpc/hugetlb: Add follow_huge_pd implementation for ppc64.
Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/mm/hugetlbpage.c | 43 +++ 1 file changed, 43 insertions(+) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 80f6d2ed551a..5c829a83a4cc 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -17,6 +17,8 @@ #include #include #include +#include +#include #include #include #include @@ -618,6 +620,46 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, } /* + * 64 bit book3s use generic follow_page_mask + */ +#ifdef CONFIG_PPC_BOOK3S_64 + +struct page *follow_huge_pd(struct vm_area_struct *vma, + unsigned long address, hugepd_t hpd, + int flags, int pdshift) +{ + pte_t *ptep; + spinlock_t *ptl; + struct page *page = NULL; + unsigned long mask; + int shift = hugepd_shift(hpd); + struct mm_struct *mm = vma->vm_mm; + +retry: + ptl = >page_table_lock; + spin_lock(ptl); + + ptep = hugepte_offset(hpd, address, pdshift); + if (pte_present(*ptep)) { + mask = (1UL << shift) - 1; + page = pte_page(*ptep); + page += ((address & mask) >> PAGE_SHIFT); + if (flags & FOLL_GET) + get_page(page); + } else { + if (is_hugetlb_entry_migration(*ptep)) { + spin_unlock(ptl); + __migration_entry_wait(mm, ptep, ptl); + goto retry; + } + } + spin_unlock(ptl); + return page; +} + +#else /* !CONFIG_PPC_BOOK3S_64 */ + +/* * We are holding mmap_sem, so a parallel huge page collapse cannot run. * To prevent hugepage split, disable irq. */ @@ -672,6 +714,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, BUG(); return NULL; } +#endif static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, unsigned long sz) -- 2.7.4
[PATCH v2 5/9] mm/hugetlb: Move default definition of hugepd_t earlier in the header
This enable to use the hugepd_t type early. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V--- include/linux/hugetlb.h | 47 --- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index edab98f0a7b8..f66c1d4e0d1f 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -14,6 +14,30 @@ struct ctl_table; struct user_struct; struct mmu_gather; +#ifndef is_hugepd +/* + * Some architectures requires a hugepage directory format that is + * required to support multiple hugepage sizes. For example + * a4fe3ce76 "powerpc/mm: Allow more flexible layouts for hugepage pagetables" + * introduced the same on powerpc. This allows for a more flexible hugepage + * pagetable layout. + */ +typedef struct { unsigned long pd; } hugepd_t; +#define is_hugepd(hugepd) (0) +#define __hugepd(x) ((hugepd_t) { (x) }) +static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr, + unsigned pdshift, unsigned long end, + int write, struct page **pages, int *nr) +{ + return 0; +} +#else +extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr, + unsigned pdshift, unsigned long end, + int write, struct page **pages, int *nr); +#endif + + #ifdef CONFIG_HUGETLB_PAGE #include @@ -222,29 +246,6 @@ static inline int pud_write(pud_t pud) } #endif -#ifndef is_hugepd -/* - * Some architectures requires a hugepage directory format that is - * required to support multiple hugepage sizes. For example - * a4fe3ce76 "powerpc/mm: Allow more flexible layouts for hugepage pagetables" - * introduced the same on powerpc. This allows for a more flexible hugepage - * pagetable layout. - */ -typedef struct { unsigned long pd; } hugepd_t; -#define is_hugepd(hugepd) (0) -#define __hugepd(x) ((hugepd_t) { (x) }) -static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr, - unsigned pdshift, unsigned long end, - int write, struct page **pages, int *nr) -{ - return 0; -} -#else -extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr, - unsigned pdshift, unsigned long end, - int write, struct page **pages, int *nr); -#endif - #define HUGETLB_ANON_FILE "anon_hugepage" enum { -- 2.7.4
[PATCH v2 4/9] mm/follow_page_mask: Add support for hugetlb pgd entries.
From: Anshuman Khandualppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd entries to follow_page_mask so that ppc64 can switch to it to handle hugetlbe entries. Signed-off-by: Anshuman Khandual Signed-off-by: Aneesh Kumar K.V --- include/linux/hugetlb.h | 4 mm/gup.c| 7 +++ mm/hugetlb.c| 9 + 3 files changed, 20 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index fddf6cf403d5..edab98f0a7b8 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -121,6 +121,9 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int flags); struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address, pud_t *pud, int flags); +struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address, +pgd_t *pgd, int flags); + int pmd_huge(pmd_t pmd); int pud_huge(pud_t pud); unsigned long hugetlb_change_protection(struct vm_area_struct *vma, @@ -150,6 +153,7 @@ static inline void hugetlb_show_meminfo(void) } #define follow_huge_pmd(mm, addr, pmd, flags) NULL #define follow_huge_pud(mm, addr, pud, flags) NULL +#define follow_huge_pgd(mm, addr, pgd, flags) NULL #define prepare_hugepage_range(file, addr, len)(-EINVAL) #define pmd_huge(x)0 #define pud_huge(x)0 diff --git a/mm/gup.c b/mm/gup.c index 73d46f9f7b81..65255389620a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -357,6 +357,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma, if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) return no_page_table(vma, flags); + if (pgd_huge(*pgd)) { + page = follow_huge_pgd(mm, address, pgd, flags); + if (page) + return page; + return no_page_table(vma, flags); + } + return follow_p4d_mask(vma, address, pgd, flags, page_mask); } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 25e2ee888a90..a12d3cab04fe 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4687,6 +4687,15 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); } +struct page * __weak +follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags) +{ + if (flags & FOLL_GET) + return NULL; + + return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT); +} + #ifdef CONFIG_MEMORY_FAILURE /* -- 2.7.4
[PATCH v2 3/9] mm/hugetlb: export hugetlb_entry_migration helper
We will be using this later from the ppc64 code. Change the return type to bool. Reviewed-by: Naoya HoriguchiSigned-off-by: Aneesh Kumar K.V --- include/linux/hugetlb.h | 1 + mm/hugetlb.c| 8 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index b857fc8cc2ec..fddf6cf403d5 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -126,6 +126,7 @@ int pud_huge(pud_t pud); unsigned long hugetlb_change_protection(struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot); +bool is_hugetlb_entry_migration(pte_t pte); #else /* !CONFIG_HUGETLB_PAGE */ static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ce090186b992..25e2ee888a90 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3182,17 +3182,17 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma, update_mmu_cache(vma, address, ptep); } -static int is_hugetlb_entry_migration(pte_t pte) +bool is_hugetlb_entry_migration(pte_t pte) { swp_entry_t swp; if (huge_pte_none(pte) || pte_present(pte)) - return 0; + return false; swp = pte_to_swp_entry(pte); if (non_swap_entry(swp) && is_migration_entry(swp)) - return 1; + return true; else - return 0; + return false; } static int is_hugetlb_entry_hwpoisoned(pte_t pte) -- 2.7.4
[PATCH v2 1/9] mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at
The right interface to use to set a hugetlb pte entry is set_huge_pte_at. Use that instead of set_pte_at. Reviewed-by: Naoya HoriguchiSigned-off-by: Aneesh Kumar K.V --- mm/migrate.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 9a0897a14d37..4c272ac6fe53 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -224,25 +224,26 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma, if (is_write_migration_entry(entry)) pte = maybe_mkwrite(pte, vma); + flush_dcache_page(new); #ifdef CONFIG_HUGETLB_PAGE if (PageHuge(new)) { pte = pte_mkhuge(pte); pte = arch_make_huge_pte(pte, vma, new, 0); - } -#endif - flush_dcache_page(new); - set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); - - if (PageHuge(new)) { + set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); if (PageAnon(new)) hugepage_add_anon_rmap(new, vma, pvmw.address); else page_dup_rmap(new, true); - } else if (PageAnon(new)) - page_add_anon_rmap(new, vma, pvmw.address, false); - else - page_add_file_rmap(new, false); + } else +#endif + { + set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); + if (PageAnon(new)) + page_add_anon_rmap(new, vma, pvmw.address, false); + else + page_add_file_rmap(new, false); + } if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new)) mlock_vma_page(new); -- 2.7.4
[PATCH v2 0/9] HugeTLB migration support for PPC64
HugeTLB migration support for PPC64 Changes from V1: * Added Reviewed-by: * Drop follow_huge_addr from powerpc Aneesh Kumar K.V (8): mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at mm/follow_page_mask: Split follow_page_mask to smaller functions. mm/hugetlb: export hugetlb_entry_migration helper mm/hugetlb: Move default definition of hugepd_t earlier in the header mm/follow_page_mask: Add support for hugepage directory entry powerpc/hugetlb: Add follow_huge_pd implementation for ppc64. powerpc/mm/hugetlb: Remove follow_huge_addr for powerpc powerpc/hugetlb: Enable hugetlb migration for ppc64 Anshuman Khandual (1): mm/follow_page_mask: Add support for hugetlb pgd entries. arch/powerpc/mm/hugetlbpage.c | 81 ++ arch/powerpc/platforms/Kconfig.cputype | 5 + include/linux/hugetlb.h| 56 ++ mm/gup.c | 186 +++-- mm/hugetlb.c | 25 - mm/migrate.c | 21 ++-- 6 files changed, 230 insertions(+), 144 deletions(-) -- 2.7.4
[PATCH v2 2/2] powerpc/mm/hugetlb: Add support for 1G huge pages
POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch enables the usage of 1G page size for hugetlbfs. This also update the helper such we can do 1G page allocation at runtime. We still don't enable 1G page size on DD1 version. This is to avoid doing workaround mentioned in commit: 6d3a0379ebdc8 (powerpc/mm: Add radix__tlb_flush_pte_p9_dd1() Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/include/asm/book3s/64/hugetlb.h | 10 ++ arch/powerpc/mm/hugetlbpage.c| 7 +-- arch/powerpc/platforms/Kconfig.cputype | 1 + 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h index cd366596..5c28bd6f2ae1 100644 --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h @@ -50,4 +50,14 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, else return entry; } + +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) +{ + if (radix_enabled()) + return true; + return false; +} +#endif + #endif diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index a4f33de4008e..80f6d2ed551a 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -763,8 +763,11 @@ static int __init add_huge_page_size(unsigned long long size) * Hash: 16M and 16G */ if (radix_enabled()) { - if (mmu_psize != MMU_PAGE_2M) - return -EINVAL; + if (mmu_psize != MMU_PAGE_2M) { + if (cpu_has_feature(CPU_FTR_POWER9_DD1) || + (mmu_psize != MMU_PAGE_1G)) + return -EINVAL; + } } else { if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G) return -EINVAL; diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 684e886eaae4..8017542d 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -344,6 +344,7 @@ config PPC_STD_MMU_64 config PPC_RADIX_MMU bool "Radix MMU Support" depends on PPC_BOOK3S_64 + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA default y help Enable support for the Power ISA 3.0 Radix style MMU. Currently this -- 2.7.4
[PATCH v2 1/2] mm/hugetlb: Cleanup ARCH_HAS_GIGANTIC_PAGE
This moves the #ifdef in C code to a Kconfig dependency. Also we move the gigantic_page_supported() function to be arch specific. This gives arch to conditionally enable runtime allocation of gigantic huge page. Architectures like ppc64 supports different gigantic huge page size (16G and 1G) based on the translation mode selected. This provides an opportunity for ppc64 to enable runtime allocation only w.r.t 1G hugepage. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V--- arch/arm64/Kconfig | 2 +- arch/arm64/include/asm/hugetlb.h | 4 arch/s390/Kconfig| 2 +- arch/s390/include/asm/hugetlb.h | 3 +++ arch/x86/Kconfig | 2 +- mm/hugetlb.c | 7 ++- 6 files changed, 12 insertions(+), 8 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 3741859765cf..1f8c1f73aada 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -11,7 +11,7 @@ config ARM64 select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA select ARCH_HAS_KCOV select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h index bbc1e35aa601..793bd73b0d07 100644 --- a/arch/arm64/include/asm/hugetlb.h +++ b/arch/arm64/include/asm/hugetlb.h @@ -83,4 +83,8 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm, extern void huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep); +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) { return true; } +#endif + #endif /* __ASM_HUGETLB_H */ diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index a2dcef0aacc7..a41bbf420dda 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -67,7 +67,7 @@ config S390 select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA select ARCH_HAS_KCOV select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h index cd546a245c68..89057b2cc8fe 100644 --- a/arch/s390/include/asm/hugetlb.h +++ b/arch/s390/include/asm/hugetlb.h @@ -112,4 +112,7 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot) return pte_modify(pte, newprot); } +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE +static inline bool gigantic_page_supported(void) { return true; } +#endif #endif /* _ASM_S390_HUGETLB_H */ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc98d5a294ee..30a6328136ac 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -22,7 +22,7 @@ config X86_64 def_bool y depends on 64BIT # Options that are inherently 64-bit kernel only: - select ARCH_HAS_GIGANTIC_PAGE + select ARCH_HAS_GIGANTIC_PAGE if MEMORY_ISOLATION && COMPACTION && CMA select ARCH_SUPPORTS_INT128 select ARCH_USE_CMPXCHG_LOCKREF select HAVE_ARCH_SOFT_DIRTY diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d0aab9ee80d..ce090186b992 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1024,9 +1024,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) ((node = hstate_next_node_to_free(hs, mask)) || 1); \ nr_nodes--) -#if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) && \ - ((defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || \ - defined(CONFIG_CMA)) +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE static void destroy_compound_gigantic_page(struct page *page, unsigned int order) { @@ -1158,8 +1156,7 @@ static int alloc_fresh_gigantic_page(struct hstate *h, return 0; } -static inline bool gigantic_page_supported(void) { return true; } -#else +#else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */ static inline bool gigantic_page_supported(void) { return false; } static inline void free_gigantic_page(struct page *page, unsigned int order) { } static inline void destroy_compound_gigantic_page(struct page *page, -- 2.7.4
[PATCH 0/6] Enable support for deep-stop states on POWER9
From: "Gautham R. Shenoy"Hi, This patch series contains some of the fixes required for enabling support for deep stop states such as STOP4 and STOP11 via CPU-Hotplug. These fixes mainly ensure that some of the hypervisor resources which are lost during the deep stop state are correctly restored on a wakeup. There are 6 patches in the series. Patch 1 correctly initializes the core_idle_state_ptr based on the threads_per_core. core_idle_state_ptr is used to determine if a thread is the last thread entering a deep stop state or a first thread waking up from deep stop state in order to save/restore per-core resources. Patch 2 decouples restoring timebase from restoring hypervisor resources, as there are stop states which lose hypervisor state but not the timebase. Patch 3 saves the LPCR value before executing deep stop and restores it back to the saved value on the wakeup from stop. Patch 4 programs the restoration of some of one-time initialized SPRs via the stop-api. Patch 5 provides a workaround for a hardware issue on POWER9 DD1 chips where the PLS value cannot be relied upon on a wakeup from deep stop. Patch 6 fixes the cpuidle-powernv initialization code to allow deep states that don't lose timebase. These patches are based on the Linux upstream and have been tested with the corresponding skiboot patches in https://lists.ozlabs.org/pipermail/skiboot/2017-May/007183.html to get STOP4 working via CPU-Hotplug. Akshay Adiga (1): powernv:idle: Restore SPRs for deep idle states via stop API. Gautham R. Shenoy (5): powernv:idle: Correctly initialize core_idle_state_ptr powernv:idle: Decouple Timebase restore & Per-core SPRs restore powernv:idle: Restore LPCR on wakeup from deep-stop powernv:idle: Use Requested Level for restoring state on P9 DD1 cpuidle-powernv: Allow Deep stop states that don't stop time arch/powerpc/include/asm/paca.h | 2 + arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kernel/idle_book3s.S | 33 +++--- arch/powerpc/platforms/powernv/idle.c | 112 +- drivers/cpuidle/cpuidle-powernv.c | 16 +++-- 5 files changed, 110 insertions(+), 54 deletions(-) -- 1.8.3.1
[PATCH 2/6] powernv:idle: Decouple Timebase restore & Per-core SPRs restore
From: "Gautham R. Shenoy"On POWER8, in case of - nap: both timebase and hypervisor state is retained. - fast-sleep: timebase is lost. But the hypervisor state is retained. - winkle: timebase and hypervisor state is lost. Hence, the current code for handling exit from a idle state assumes that if the timebase value is retained, then so is the hypervisor state. Thus, the current code doesn't restore per-core hypervisor state in such cases. But that is no longer the case on POWER9 where we do have stop states in which timebase value is retained, but the hypervisor state is lost. So we have to ensure that the per-core hypervisor state gets restored in such cases. Fix this by ensuring that even in the case when timebase is retained, we explicitly check if we are waking up from a deep stop that loses per-core hypervisor state (indicated by cr4 being eq or gt), and if this is the case, we restore the per-core hypervisor state. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/idle_book3s.S | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 4898d67..afd029f 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -731,13 +731,14 @@ timebase_resync: * Use cr3 which indicates that we are waking up with atleast partial * hypervisor state loss to determine if TIMEBASE RESYNC is needed. */ - ble cr3,clear_lock + ble cr3,.Ltb_resynced /* Time base re-sync */ bl opal_resync_timebase; /* -* If waking up from sleep, per core state is not lost, skip to -* clear_lock. +* If waking up from sleep (POWER8), per core state +* is not lost, skip to clear_lock. */ +.Ltb_resynced: blt cr4,clear_lock /* -- 1.8.3.1
[PATCH 3/6] powernv:idle: Restore LPCR on wakeup from deep-stop
From: "Gautham R. Shenoy"On wakeup from a deep stop state which is supposed to lose the hypervisor state, we don't restore the LPCR to the old value but set it to a "sane" value via cur_cpu_spec->cpu_restore(). The problem is that the "sane" value doesn't include UPRT and the HR bits which are required to run correctly in Radix mode. Fix this on POWER9 onwards by restoring the LPCR value whatever it was before executing the stop instruction. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/idle_book3s.S | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index afd029f..6c9920d 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -31,6 +31,7 @@ * registers for winkle support. */ #define _SDR1 GPR3 +#define _PTCR GPR3 #define _RPR GPR4 #define _SPURR GPR5 #define _PURR GPR6 @@ -39,7 +40,7 @@ #define _AMOR GPR9 #define _WORT GPR10 #define _WORC GPR11 -#define _PTCR GPR12 +#define _LPCR GPR12 #define PSSCR_EC_ESL_MASK_SHIFTED (PSSCR_EC | PSSCR_ESL) >> 16 @@ -55,12 +56,14 @@ save_sprs_to_stack: * here since any thread in the core might wake up first */ BEGIN_FTR_SECTION - mfspr r3,SPRN_PTCR - std r3,_PTCR(r1) /* * Note - SDR1 is dropped in Power ISA v3. Hence not restoring * SDR1 here */ + mfspr r3,SPRN_PTCR + std r3,_PTCR(r1) + mfspr r3,SPRN_LPCR + std r3,_LPCR(r1) FTR_SECTION_ELSE mfspr r3,SPRN_SDR1 std r3,_SDR1(r1) @@ -813,6 +816,10 @@ no_segments: mtctr r12 bctrl +BEGIN_FTR_SECTION + ld r4,_LPCR(r1) + mtspr SPRN_LPCR,r4 +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) hypervisor_state_restored: mtspr SPRN_SRR1,r16 -- 1.8.3.1
[PATCH 4/6] powernv:idle: Restore SPRs for deep idle states via stop API.
From: Akshay AdigaSome of the SPR values (HID0, MSR, SPRG0) don't change during the run time of a booted kernel, once they have been initialized. The contents of these SPRs are lost when the CPUs enter deep stop states. So instead saving and restoring SPRs from the kernel, use the stop-api provided by the firmware by which the firmware can restore the contents of these SPRs to their initialized values after wakeup from a deep stop state. Apart from these, program the PSSCR value to that of the deepest stop state via the stop-api. This will be used to indicate to the underlying firmware as to what stop state to put the threads that have been woken up by a special-wakeup. And while we are at programming SPRs via stop-api, ensure that HID1, HID4 and HID5 registers which are only available on POWER8 are not requested to be restored by the firware on POWER9. Signed-off-by: Akshay Adiga Signed-off-by: Gautham R. Shenoy --- arch/powerpc/platforms/powernv/idle.c | 83 ++- 1 file changed, 52 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 84eb9bc..4deac0d 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -30,8 +30,33 @@ /* Power ISA 3.0 allows for stop states 0x0 - 0xF */ #define MAX_STOP_STATE 0xF +#define P9_STOP_SPR_MSR 2000 +#define P9_STOP_SPR_PSSCR 855 + static u32 supported_cpuidle_states; +/* + * The default stop state that will be used by ppc_md.power_save + * function on platforms that support stop instruction. + */ +static u64 pnv_default_stop_val; +static u64 pnv_default_stop_mask; +static bool default_stop_found; + +/* + * First deep stop state. Used to figure out when to save/restore + * hypervisor context. + */ +u64 pnv_first_deep_stop_state = MAX_STOP_STATE; + +/* + * psscr value and mask of the deepest stop idle state. + * Used when a cpu is offlined. + */ +static u64 pnv_deepest_stop_psscr_val; +static u64 pnv_deepest_stop_psscr_mask; +static bool deepest_stop_found; + static int pnv_save_sprs_for_deep_states(void) { int cpu; @@ -48,6 +73,8 @@ static int pnv_save_sprs_for_deep_states(void) uint64_t hid4_val = mfspr(SPRN_HID4); uint64_t hid5_val = mfspr(SPRN_HID5); uint64_t hmeer_val = mfspr(SPRN_HMEER); + uint64_t msr_val = MSR_IDLE; + uint64_t psscr_val = pnv_deepest_stop_psscr_val; for_each_possible_cpu(cpu) { uint64_t pir = get_hard_smp_processor_id(cpu); @@ -61,6 +88,18 @@ static int pnv_save_sprs_for_deep_states(void) if (rc != 0) return rc; + if (cpu_has_feature(CPU_FTR_ARCH_300)) { + rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, msr_val); + if (rc) + return rc; + + rc = opal_slw_set_reg(pir, + P9_STOP_SPR_PSSCR, psscr_val); + + if (rc) + return rc; + } + /* HIDs are per core registers */ if (cpu_thread_in_core(cpu) == 0) { @@ -72,17 +111,21 @@ static int pnv_save_sprs_for_deep_states(void) if (rc != 0) return rc; - rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val); - if (rc != 0) - return rc; + /* Only p8 needs to set extra HID regiters */ + if (!cpu_has_feature(CPU_FTR_ARCH_300)) { - rc = opal_slw_set_reg(pir, SPRN_HID4, hid4_val); - if (rc != 0) - return rc; + rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val); + if (rc != 0) + return rc; - rc = opal_slw_set_reg(pir, SPRN_HID5, hid5_val); - if (rc != 0) - return rc; + rc = opal_slw_set_reg(pir, SPRN_HID4, hid4_val); + if (rc != 0) + return rc; + + rc = opal_slw_set_reg(pir, SPRN_HID5, hid5_val); + if (rc != 0) + return rc; + } } } @@ -241,14 +284,6 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600, store_fastsleep_workaround_applyonce); /* - * The default stop state that will be used by ppc_md.power_save - * function on platforms that support stop instruction. - */ -static u64 pnv_default_stop_val; -static u64
[PATCH 1/6] powernv:idle: Correctly initialize core_idle_state_ptr
From: "Gautham R. Shenoy"The lower 8 bits of core_idle_state_ptr tracks the number of non-idle threads in the core. This is supposed to be initialized to bit-map corresponding to the threads_per_core. However, currently it is initialized to PNV_CORE_IDLE_THREAD_BITS (0xFF). This is correct for POWER8 which has 8 threads per core, but not for POWER9 which has 4 threads per core. As a result, on POWER9, core_idle_state_ptr gets initialized to 0xFF. In case when all the threads of the core are idle, the bits corresponding tracking the idle-threads are non-zero. As a result, the idle entry/exit code fails to save/restore per-core hypervisor state since it assumes that there are threads in the cores which are still active. Fix this by correctly initializing the lower bits of the core_idle_state_ptr on the basis of threads_per_core. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/platforms/powernv/idle.c | 29 +++-- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 445f30a..84eb9bc 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -96,15 +96,24 @@ static void pnv_alloc_idle_core_states(void) u32 *core_idle_state; /* -* core_idle_state - First 8 bits track the idle state of each thread -* of the core. The 8th bit is the lock bit. Initially all thread bits -* are set. They are cleared when the thread enters deep idle state -* like sleep and winkle. Initially the lock bit is cleared. -* The lock bit has 2 purposes -* a. While the first thread is restoring core state, it prevents -* other threads in the core from switching to process context. -* b. While the last thread in the core is saving the core state, it -* prevents a different thread from waking up. +* core_idle_state - The lower 8 bits track the idle state of +* each thread of the core. +* +* The most significant bit is the lock bit. +* +* Initially all the bits corresponding to threads_per_core +* are set. They are cleared when the thread enters deep idle +* state like sleep and winkle/stop. +* +* Initially the lock bit is cleared. The lock bit has 2 +* purposes: +* a. While the first thread in the core waking up from +* idle is restoring core state, it prevents other +* threads in the core from switching to process +* context. +* b. While the last thread in the core is saving the +* core state, it prevents a different thread from +* waking up. */ for (i = 0; i < nr_cores; i++) { int first_cpu = i * threads_per_core; @@ -112,7 +121,7 @@ static void pnv_alloc_idle_core_states(void) size_t paca_ptr_array_size; core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node); - *core_idle_state = PNV_CORE_IDLE_THREAD_BITS; + *core_idle_state = (1 << threads_per_core) - 1; paca_ptr_array_size = (threads_per_core * sizeof(struct paca_struct *)); -- 1.8.3.1
[PATCH 5/6] powernv:idle: Use Requested Level for restoring state on P9 DD1
From: "Gautham R. Shenoy"On Power9 DD1 due to a hardware bug the Power-Saving Level Status field (PLS) of the PSSCR for a thread waking up from a deep state can under-report if some other thread in the core is in a shallow stop state. The scenario in which this can manifest is as follows: 1) All the threads of the core are in deep stop. 2) One of the threads is woken up. The PLS for this thread will correctly reflect that it is waking up from deep stop. 3) The thread that has woken up now executes a shallow stop. 4) When some other thread in the core is woken, its PLS will reflect the shallow stop state. Thus, the subsequent thread for which the PLS is under-reporting the wakeup state will not restore the hypervisor resources. Hence, on DD1 systems, use the Requested Level (RL) field as a workaround to restore the contents of the hypervisor resources on the wakeup from the stop state. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/paca.h | 2 ++ arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kernel/idle_book3s.S | 13 - 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index 1c09f8f..77f60a0 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -177,6 +177,8 @@ struct paca_struct { * to the sibling threads' paca. */ struct paca_struct **thread_sibling_pacas; + /* The PSSCR value that the kernel requested before going to stop */ + u64 requested_psscr; #endif #ifdef CONFIG_PPC_STD_MMU_64 diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 709e234..e15c178 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -742,6 +742,7 @@ int main(void) OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask); OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask); OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas); + OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr); #endif DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER); diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 6c9920d..98a6d07 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -379,6 +379,7 @@ _GLOBAL(power9_idle_stop) mfspr r5,SPRN_PSSCR andcr5,r5,r4 or r3,r3,r5 + std r3, PACA_REQ_PSSCR(r13) mtspr SPRN_PSSCR,r3 LOAD_REG_ADDR(r5,power_enter_stop) li r4,1 @@ -498,12 +499,22 @@ pnv_restore_hyp_resource_arch300: LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state) ld r4,ADDROFF(pnv_first_deep_stop_state)(r5) - mfspr r5,SPRN_PSSCR +BEGIN_FTR_SECTION_NESTED(71) + /* +* Assume that we are waking up from the state +* same as the Requested Level (RL) in the PSSCR +* which are Bits 60-63 +*/ + ld r5,PACA_REQ_PSSCR(r13) + rldicl r5,r5,0,60 +FTR_SECTION_ELSE_NESTED(71) /* * 0-3 bits correspond to Power-Saving Level Status * which indicates the idle state we are waking up from */ + mfspr r5, SPRN_PSSCR rldicl r5,r5,4,60 +ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_POWER9_DD1, 71) cmpdcr4,r5,r4 bge cr4,pnv_wakeup_tb_loss /* returns to caller */ -- 1.8.3.1
[PATCH 6/6] cpuidle-powernv: Allow Deep stop states that don't stop time
From: "Gautham R. Shenoy"The current code in the cpuidle-powernv intialization only allows deep stop states (indicated by OPAL_PM_STOP_INST_DEEP) which lose timebase (indicated by OPAL_PM_TIMEBASE_STOP). This assumption goes back to POWER8 time where deep states used to lose the timebase. However, on POWER9, we do have stop states that are deep (they lose hypervisor state) but retain the timebase. Fix the initialization code in the cpuidle-powernv driver to allow such deep states. Further, there is a bug in cpuidle-powernv driver with CONFIG_TICK_ONESHOT=n where we end up incrementing the nr_idle_states even if a platform idle state which loses time base was not added to the cpuidle table. Fix this by ensuring that the nr_idle_states variable gets incremented only when the platform idle state was added to the cpuidle table. Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-powernv.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 12409a5..45eaf06 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -354,6 +354,7 @@ static int powernv_add_idle_states(void) for (i = 0; i < dt_idle_states; i++) { unsigned int exit_latency, target_residency; + bool stops_timebase = false; /* * If an idle state has exit latency beyond * POWERNV_THRESHOLD_LATENCY_NS then don't use it @@ -381,6 +382,9 @@ static int powernv_add_idle_states(void) } } + if (flags[i] & OPAL_PM_TIMEBASE_STOP) + stops_timebase = true; + /* * For nap and fastsleep, use default target_residency * values if f/w does not expose it. @@ -392,8 +396,7 @@ static int powernv_add_idle_states(void) add_powernv_state(nr_idle_states, "Nap", CPUIDLE_FLAG_NONE, nap_loop, target_residency, exit_latency, 0, 0); - } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) && - !(flags[i] & OPAL_PM_TIMEBASE_STOP)) { + } else if (has_stop_states && !stops_timebase) { add_powernv_state(nr_idle_states, names[i], CPUIDLE_FLAG_NONE, stop_loop, target_residency, exit_latency, @@ -405,8 +408,8 @@ static int powernv_add_idle_states(void) * within this config dependency check. */ #ifdef CONFIG_TICK_ONESHOT - if (flags[i] & OPAL_PM_SLEEP_ENABLED || - flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) { + else if (flags[i] & OPAL_PM_SLEEP_ENABLED || +flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) { if (!rc) target_residency = 30; /* Add FASTSLEEP state */ @@ -414,14 +417,15 @@ static int powernv_add_idle_states(void) CPUIDLE_FLAG_TIMER_STOP, fastsleep_loop, target_residency, exit_latency, 0, 0); - } else if ((flags[i] & OPAL_PM_STOP_INST_DEEP) && - (flags[i] & OPAL_PM_TIMEBASE_STOP)) { + } else if (has_stop_states && stops_timebase) { add_powernv_state(nr_idle_states, names[i], CPUIDLE_FLAG_TIMER_STOP, stop_loop, target_residency, exit_latency, psscr_val[i], psscr_mask[i]); } #endif + else + continue; nr_idle_states++; } out: -- 1.8.3.1
Re: [v3 0/9] parallelized "struct page" zeroing
On Mon 15-05-17 16:44:26, Pasha Tatashin wrote: > On 05/15/2017 03:38 PM, Michal Hocko wrote: > >I do not think this is the right approach. Your measurements just show > >that sparc could have a more optimized memset for small sizes. If you > >keep the same memset only for the parallel initialization then you > >just hide this fact. I wouldn't worry about other architectures. All > >sane architectures should simply work reasonably well when touching a > >single or only few cache lines at the same time. If some arches really > >suffer from small memsets then the initialization should be driven by a > >specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend > >on DEFERRED_INIT. Or if you are too worried then make it opt-in and make > >it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and > >sparc after memset optimization. > > OK, I will think about this. > > I do not really like adding new configs because they tend to clutter the > code. This is why, Yes I hate adding new (arch) config options as well. And I still believe we do not need any here either... > I wanted to rely on already existing config that I know benefits all > platforms that use it. I wouldn't be so sure about this. If any other platform has a similar issues with small memset as sparc then the overhead is just papered over by parallel initialization. > Eventually, > "CONFIG_DEFERRED_STRUCT_PAGE_INIT" is going to become the default > everywhere, as there should not be a drawback of using it even on small > machines. Maybe and I would highly appreciate that. -- Michal Hocko SUSE Labs