Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> This patch series includes fixes for debug_vm_pgtable test code so that
> they follow page table updates rules correctly. The first two patches 
> introduce
> changes w.r.t ppc64. The patches are included in this series for 
> completeness. We can
> merge them via ppc64 tree if required.
> 
> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
> page table update rules.
> 
> These tests are broken w.r.t page table update rules and results in kernel
> crash as below. 
> 
> [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> pc: c009a5ec: assert_pte_locked+0x14c/0x380
> lr: c05c: pte_update+0x11c/0x190
> sp: c00c6d1e7950
>msr: 82029033
>   current = 0xc00c6d172c80
>   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> pid   = 1, comm = swapper/0
> kernel BUG at arch/powerpc/mm/pgtable.c:304!
> [link register   ] c05c pte_update+0x11c/0x190
> [c00c6d1e7950] 0001 (unreliable)
> [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> 
> With DEBUG_VM disabled
> 
> [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x
> [   20.530183] Faulting instruction address: 0xc00df330
> cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> pc: c00df330: memset+0x68/0x104
> lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> sp: c00c6d19f990
>msr: 82009033
>dar: 0
>   current = 0xc00c6d177480
>   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> pid   = 1, comm = swapper/0
> [link register   ] c009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> [c00c6d19f990] c009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 
> (unreliable)
> [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> 
> Changes from v3:
> * Address review feedback
> * Move page table depost and withdraw patch after adding pmdlock to avoid 
> bisect failure.

This version

- Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
DEBUG_VM_PGTABLE)
- Runs on arm64 and x86 without any regression, atleast nothing that I have 
noticed
- Will be great if this could get tested on s390, arc, riscv, ppc32 platforms 
as well

+ linux-riscv 
+ linux-snps-...@lists.infradead.org 
+ linux-s...@vger.kernel.org
+ Gerald Schaefer 
+ Vineet Gupta 

There is still an open git bisect issue on arm64 platform which ideally should 
be fixed.

- Anshuman


Re: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread Christoph Hellwig
On Fri, Sep 04, 2020 at 03:55:10AM +0100, Al Viro wrote:
> On Thu, Sep 03, 2020 at 04:22:40PM +0200, Christoph Hellwig wrote:
> 
> > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> > index c8a85b512796e1..94f7be4971ed04 100644
> > --- a/arch/x86/lib/getuser.S
> > +++ b/arch/x86/lib/getuser.S
> > @@ -35,10 +35,19 @@
> >  #include 
> >  #include 
> >  
> > +#ifdef CONFIG_X86_5LEVEL
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +   ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
> > +   "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
> > +#else
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +   mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> > +#endif
> 
> Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean
> 
> #define LOAD_TASK_SIZE_MINUS_N(n) \
>   ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
>   __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), 
> X86_FEATURE_LA57
> 
> there?

Don't ask me about the how, but it builds and works with X86_5LEVEL,
and the style is copied from elsewhere..


Re: [PATCH v4 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> The seems to be missing quite a lot of details w.r.t allocating
> the correct pgtable_t page (huge_pte_alloc()), holding the right
> lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA.
> 
> ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these.
> Hence disable the test on ppc64.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index b53903fdee85..9afa1354326b 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -811,6 +811,7 @@ static void __init hugetlb_basic_tests(unsigned long pfn, 
> pgprot_t prot)
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>  }
>  
> +#ifndef CONFIG_PPC_BOOK3S_64
>  static void __init hugetlb_advanced_tests(struct mm_struct *mm,
> struct vm_area_struct *vma,
> pte_t *ptep, unsigned long pfn,
> @@ -853,6 +854,7 @@ static void __init hugetlb_advanced_tests(struct 
> mm_struct *mm,
>   pte = huge_ptep_get(ptep);
>   WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
>  }
> +#endif
>  #else  /* !CONFIG_HUGETLB_PAGE */
>  static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { }
>  static void __init hugetlb_advanced_tests(struct mm_struct *mm,
> @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void)
>   pud_populate_tests(mm, pudp, saved_pmdp);
>   spin_unlock(ptl);
>  
> +#ifndef CONFIG_PPC_BOOK3S_64
>   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +#endif
>  
>   spin_lock(&mm->page_table_lock);
>   p4d_clear_tests(mm, p4dp);
> 

Is it still required now that DEBUG_VM_PGTABLE has been dropped from powerpc
or you would like to re-enabled it back ?

https://lore.kernel.org/linuxppc-dev/159913592797.5893.5829441560236719450.b4...@ellerman.id.au/T/#m6d890e2fe84cf180cb875fae5f791e9c83db8d30


Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-09-03 Thread Leonardo Bras
On Thu, 2020-09-03 at 14:41 +1000, Alexey Kardashevskiy wrote:
> I am new to this, so I am trying to understand how a memory page mapped
> > as DMA, and used for something else could be a problem.
> 
>  From the device prospective, there is PCI space and everything from 0 
> till 1<<64 is accessible and what is that mapped to - the device does 
> not know. PHB's IOMMU is the thing to notice invalid access and raise 
> EEH but PHB only knows about PCI->physical memory mapping (with IOMMU 
> pages) but nothing about the host kernel pages. Does this help? Thanks,

According to our conversation on Slack:
1- There is a problem if a hypervisor gives to it's VMs contiguous
memory blocks that are not aligned to IOMMU pages, because then an 
iommu_map_page() could map some memory in this VM and some memory in
other VM / process.
2- To guarantee this, we should have system pagesize >= iommu_pagesize 

One way to get (2) is by doing this in enable_ddw():
if ((query.page_size & 4) && PAGE_SHIFT >= 24) {
page_shift = 24; /* 16MB */
} else if ((query.page_size & 2) &&  PAGE_SHIFT >= 16 ) {
page_shift = 16; /* 64kB */
} else if (query.page_size & 1 &&  PAGE_SHIFT >= 12) {
page_shift = 12; /* 4kB */
[...]

Another way of solving this, would be adding in LoPAR documentation
that the blocksize of contiguous memory the hypervisor gives a VM
should always be aligned to IOMMU pagesize offered.

I think the best approach would be first sending the above patch, which
is faster, and then get working into adding that to documentation, so
hypervisors guarantee this.

If this gets into the docs, we can revert the patch.

What do you think?

Best regards!



Re: [PATCH v4 11/13] mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> pmd_clear() should not be used to clear pmd level pte entries.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 26023d990bd0..b53903fdee85 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -196,6 +196,8 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(pmd_young(pmd));
>  
> + /*  Clear the pte entries  */
> + pmdp_huge_get_and_clear(mm, vaddr, pmdp);
>   pgtable = pgtable_trans_huge_withdraw(mm, pmdp);
>  }
>  
> @@ -319,6 +321,8 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
>   pudp_test_and_clear_young(vma, vaddr, pudp);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(pud_young(pud));
> +
> + pudp_huge_get_and_clear(mm, vaddr, pudp);
>  }
>  
>  static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot)
> @@ -442,8 +446,6 @@ static void __init pud_populate_tests(struct mm_struct 
> *mm, pud_t *pudp,
>* This entry points to next level page table page.
>* Hence this must not qualify as pud_bad().
>*/
> - pmd_clear(pmdp);
> - pud_clear(pudp);
>   pud_populate(mm, pudp, pmdp);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(pud_bad(pud));
> @@ -575,7 +577,6 @@ static void __init pmd_populate_tests(struct mm_struct 
> *mm, pmd_t *pmdp,
>* This entry points to next level page table page.
>* Hence this must not qualify as pmd_bad().
>*/
> - pmd_clear(pmdp);
>   pmd_populate(mm, pmdp, pgtable);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(pmd_bad(pmd));
> 

Why pxxp_huge_get_and_clear() cannot be called inside pxx_populate_tests()
functions itself ? Nonetheless, this does not seem to cause any problem.


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Ingo Molnar


* Christoph Hellwig  wrote:

> Hi all,
> 
> this series removes the last set_fs() used to force a kernel address
> space for the uaccess code in the kernel read/write/splice code, and then
> stops implementing the address space overrides entirely for x86 and
> powerpc.

Cool! For the x86 bits:

  Acked-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH v4 09/13] mm/debug_vm_pgtable/locks: Take correct page table lock

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> Make sure we call pte accessors with correct lock held.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 35 ++-
>  1 file changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index f59cf6a9b05e..2bc1952e5f83 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -1035,30 +1035,39 @@ static int __init debug_vm_pgtable(void)
>  
>   hugetlb_basic_tests(pte_aligned, prot);
>  
> - pte_clear_tests(mm, ptep, vaddr);
> - pmd_clear_tests(mm, pmdp);
> - pud_clear_tests(mm, pudp);
> - p4d_clear_tests(mm, p4dp);
> - pgd_clear_tests(mm, pgdp);
> + /*
> +  * Page table modifying tests. They need to hold
> +  * proper page table lock.
> +  */
>  
>   ptl = pte_lockptr(mm, pmdp);
>   spin_lock(ptl);
> -
> + pte_clear_tests(mm, ptep, vaddr);
>   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
> - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
> - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> -
> + pte_unmap_unlock(ptep, ptl);
>  
> + ptl = pmd_lock(mm, pmdp);
> + pmd_clear_tests(mm, pmdp);
> + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
>   pmd_huge_tests(pmdp, pmd_aligned, prot);
> + pmd_populate_tests(mm, pmdp, saved_ptep);
> + spin_unlock(ptl);
> +
> + ptl = pud_lock(mm, pudp);
> + pud_clear_tests(mm, pudp);
> + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
>   pud_huge_tests(pudp, pud_aligned, prot);
> + pud_populate_tests(mm, pudp, saved_pmdp);
> + spin_unlock(ptl);
>  
> - pte_unmap_unlock(ptep, ptl);
> + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
>  
> - pmd_populate_tests(mm, pmdp, saved_ptep);
> - pud_populate_tests(mm, pudp, saved_pmdp);
> + spin_lock(&mm->page_table_lock);
> + p4d_clear_tests(mm, p4dp);
> + pgd_clear_tests(mm, pgdp);
>   p4d_populate_tests(mm, p4dp, saved_pudp);
>   pgd_populate_tests(mm, pgdp, saved_p4dp);
> + spin_unlock(&mm->page_table_lock);
>  
>   p4d_free(mm, saved_p4dp);
>   pud_free(mm, saved_pudp);
> 

This patch, in itself looks good but will probably require folding with
the previous one to prevent the git bisect problem on arm64.


Re: [PATCH v4 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> set_pte_at() should not be used to set a pte entry at locations that
> already holds a valid pte entry. Architectures like ppc64 don't do TLB
> invalidate in set_pte_at() and hence expect it to be used to set locations
> that are not a valid PTE.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 35 +++
>  1 file changed, 15 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 9cafed39c236..de333871f407 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -79,15 +79,18 @@ static void __init pte_advanced_tests(struct mm_struct 
> *mm,
>  {
>   pte_t pte = pfn_pte(pfn, prot);
>  
> + /*
> +  * Architectures optimize set_pte_at by avoiding TLB flush.
> +  * This requires set_pte_at to be not used to update an
> +  * existing pte entry. Clear pte before we do set_pte_at
> +  */
> +
>   pr_debug("Validating PTE advanced\n");
>   pte = pfn_pte(pfn, prot);
>   set_pte_at(mm, vaddr, ptep, pte);
>   ptep_set_wrprotect(mm, vaddr, ptep);
>   pte = ptep_get(ptep);
>   WARN_ON(pte_write(pte));
> -
> - pte = pfn_pte(pfn, prot);
> - set_pte_at(mm, vaddr, ptep, pte);
>   ptep_get_and_clear(mm, vaddr, ptep);
>   pte = ptep_get(ptep);
>   WARN_ON(!pte_none(pte));
> @@ -101,13 +104,11 @@ static void __init pte_advanced_tests(struct mm_struct 
> *mm,
>   ptep_set_access_flags(vma, vaddr, ptep, pte, 1);
>   pte = ptep_get(ptep);
>   WARN_ON(!(pte_write(pte) && pte_dirty(pte)));
> -
> - pte = pfn_pte(pfn, prot);
> - set_pte_at(mm, vaddr, ptep, pte);
>   ptep_get_and_clear_full(mm, vaddr, ptep, 1);
>   pte = ptep_get(ptep);
>   WARN_ON(!pte_none(pte));
>  
> + pte = pfn_pte(pfn, prot);
>   pte = pte_mkyoung(pte);
>   set_pte_at(mm, vaddr, ptep, pte);
>   ptep_test_and_clear_young(vma, vaddr, ptep);
> @@ -169,9 +170,6 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   pmdp_set_wrprotect(mm, vaddr, pmdp);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(pmd_write(pmd));
> -
> - pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
> - set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_huge_get_and_clear(mm, vaddr, pmdp);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(!pmd_none(pmd));
> @@ -185,13 +183,11 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   pmdp_set_access_flags(vma, vaddr, pmdp, pmd, 1);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(!(pmd_write(pmd) && pmd_dirty(pmd)));
> -
> - pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
> - set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_huge_get_and_clear_full(vma, vaddr, pmdp, 1);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(!pmd_none(pmd));
>  
> + pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   pmd = pmd_mkyoung(pmd);
>   set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_test_and_clear_young(vma, vaddr, pmdp);
> @@ -292,17 +288,9 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
>   WARN_ON(pud_write(pud));
>  
>  #ifndef __PAGETABLE_PMD_FOLDED
> - pud = pud_mkhuge(pfn_pud(pfn, prot));
> - set_pud_at(mm, vaddr, pudp, pud);
>   pudp_huge_get_and_clear(mm, vaddr, pudp);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(!pud_none(pud));
> -
> - pud = pud_mkhuge(pfn_pud(pfn, prot));
> - set_pud_at(mm, vaddr, pudp, pud);
> - pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1);
> - pud = READ_ONCE(*pudp);
> - WARN_ON(!pud_none(pud));
>  #endif /* __PAGETABLE_PMD_FOLDED */
>  
>   pud = pud_mkhuge(pfn_pud(pfn, prot));
> @@ -315,6 +303,13 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
>   pud = READ_ONCE(*pudp);
>   WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
>  
> +#ifndef __PAGETABLE_PMD_FOLDED
> + pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1);
> + pud = READ_ONCE(*pudp);
> + WARN_ON(!pud_none(pud));
> +#endif /* __PAGETABLE_PMD_FOLDED */
> +
> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   pud = pud_mkyoung(pud);
>   set_pud_at(mm, vaddr, pudp, pud);
>   pudp_test_and_clear_young(vma, vaddr, pudp);
> 

Reviewed-by: Anshuman Khandual 


Re: [PATCH v4 06/13] mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> kernel expects entries to be marked huge before we use
> set_pmd_at()/set_pud_at().
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 20 +++-
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 8704901f6bd8..9cafed39c236 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -155,7 +155,7 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
> unsigned long pfn, unsigned long vaddr,
> pgprot_t prot)
>  {
> - pmd_t pmd = pfn_pmd(pfn, prot);
> + pmd_t pmd;
>  
>   if (!has_transparent_hugepage())
>   return;
> @@ -164,19 +164,19 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   /* Align the address wrt HPAGE_PMD_SIZE */
>   vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE;
>  
> - pmd = pfn_pmd(pfn, prot);
> + pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_set_wrprotect(mm, vaddr, pmdp);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(pmd_write(pmd));
>  
> - pmd = pfn_pmd(pfn, prot);
> + pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_huge_get_and_clear(mm, vaddr, pmdp);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(!pmd_none(pmd));
>  
> - pmd = pfn_pmd(pfn, prot);
> + pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   pmd = pmd_wrprotect(pmd);
>   pmd = pmd_mkclean(pmd);
>   set_pmd_at(mm, vaddr, pmdp, pmd);
> @@ -236,7 +236,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
> long pfn, pgprot_t prot)
>  
>  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
>  {
> - pmd_t pmd = pfn_pmd(pfn, prot);
> + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>  
>   if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
>   return;
> @@ -276,7 +276,7 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
> unsigned long pfn, unsigned long vaddr,
> pgprot_t prot)
>  {
> - pud_t pud = pfn_pud(pfn, prot);
> + pud_t pud;
>  
>   if (!has_transparent_hugepage())
>   return;
> @@ -285,25 +285,27 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
>   /* Align the address wrt HPAGE_PUD_SIZE */
>   vaddr = (vaddr & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE;
>  
> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   set_pud_at(mm, vaddr, pudp, pud);
>   pudp_set_wrprotect(mm, vaddr, pudp);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(pud_write(pud));
>  
>  #ifndef __PAGETABLE_PMD_FOLDED
> - pud = pfn_pud(pfn, prot);
> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   set_pud_at(mm, vaddr, pudp, pud);
>   pudp_huge_get_and_clear(mm, vaddr, pudp);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(!pud_none(pud));
>  
> - pud = pfn_pud(pfn, prot);
> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   set_pud_at(mm, vaddr, pudp, pud);
>   pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(!pud_none(pud));
>  #endif /* __PAGETABLE_PMD_FOLDED */
> - pud = pfn_pud(pfn, prot);
> +
> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   pud = pud_wrprotect(pud);
>   pud = pud_mkclean(pud);
>   set_pud_at(mm, vaddr, pudp, pud);
> 

Reviewed-by: Anshuman Khandual 


Re: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread Al Viro
On Fri, Sep 04, 2020 at 03:55:10AM +0100, Al Viro wrote:
> On Thu, Sep 03, 2020 at 04:22:40PM +0200, Christoph Hellwig wrote:
> 
> > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> > index c8a85b512796e1..94f7be4971ed04 100644
> > --- a/arch/x86/lib/getuser.S
> > +++ b/arch/x86/lib/getuser.S
> > @@ -35,10 +35,19 @@
> >  #include 
> >  #include 
> >  
> > +#ifdef CONFIG_X86_5LEVEL
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +   ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
> > +   "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
> > +#else
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +   mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> > +#endif
> 
> Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean
> 
> #define LOAD_TASK_SIZE_MINUS_N(n) \
>   ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
>   __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), 
> X86_FEATURE_LA57
> 
> there?

Pushed out with the following folded in.

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index 94f7be4971ed..2f052bc96866 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -37,8 +37,8 @@
 
 #ifdef CONFIG_X86_5LEVEL
 #define LOAD_TASK_SIZE_MINUS_N(n) \
-   ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
-   "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
+   ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
+   __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), 
X86_FEATURE_LA57
 #else
 #define LOAD_TASK_SIZE_MINUS_N(n) \
mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index 445374885153..358239d77dff 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -33,8 +33,8 @@
 
 #ifdef CONFIG_X86_5LEVEL
 #define LOAD_TASK_SIZE_MINUS_N(n) \
-   ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rbx", \
-   "mov $((1 << 56) - 4096 - (n)),%rbx", X86_FEATURE_LA57
+   ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rbx), \
+   __stringify(mov $((1 << 56) - 4096 - (n)),%rbx), 
X86_FEATURE_LA57
 #else
 #define LOAD_TASK_SIZE_MINUS_N(n) \
mov $(TASK_SIZE_MAX - (n)),%_ASM_BX


Re: [PATCH v4 10/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> Architectures like ppc64 use deposited page table while updating the
> huge pte entries.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 2bc1952e5f83..26023d990bd0 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -154,7 +154,7 @@ static void __init pmd_basic_tests(unsigned long pfn, 
> pgprot_t prot)
>  static void __init pmd_advanced_tests(struct mm_struct *mm,
> struct vm_area_struct *vma, pmd_t *pmdp,
> unsigned long pfn, unsigned long vaddr,
> -   pgprot_t prot)
> +   pgprot_t prot, pgtable_t pgtable)
>  {
>   pmd_t pmd;
>  
> @@ -165,6 +165,8 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   /* Align the address wrt HPAGE_PMD_SIZE */
>   vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE;
>  
> + pgtable_trans_huge_deposit(mm, pmdp, pgtable);
> +
>   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_set_wrprotect(mm, vaddr, pmdp);
> @@ -193,6 +195,8 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   pmdp_test_and_clear_young(vma, vaddr, pmdp);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(pmd_young(pmd));
> +
> + pgtable = pgtable_trans_huge_withdraw(mm, pmdp);
>  }
>  
>  static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot)
> @@ -371,7 +375,7 @@ static void __init pud_basic_tests(unsigned long pfn, 
> pgprot_t prot) { }
>  static void __init pmd_advanced_tests(struct mm_struct *mm,
> struct vm_area_struct *vma, pmd_t *pmdp,
> unsigned long pfn, unsigned long vaddr,
> -   pgprot_t prot)
> +   pgprot_t prot, pgtable_t pgtable)
>  {
>  }
>  static void __init pud_advanced_tests(struct mm_struct *mm,
> @@ -1048,7 +1052,7 @@ static int __init debug_vm_pgtable(void)
>  
>   ptl = pmd_lock(mm, pmdp);
>   pmd_clear_tests(mm, pmdp);
> - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
> + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
>   pmd_huge_tests(pmdp, pmd_aligned, prot);
>   pmd_populate_tests(mm, pmdp, saved_ptep);
>   spin_unlock(ptl);
> 

Moving this down further in the series has not really helped the possible
git bisect problem on arm64. Nonetheless, the patch in itself, makes sense.

Reviewed-by: Anshuman Khandual 


Re: [PATCH v4 13/13] mm/debug_vm_pgtable: Avoid none pte in pte_clear_test

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:16 PM, Aneesh Kumar K.V wrote:
> pte_clear_tests operate on an existing pte entry. Make sure that
> is not a none pte entry.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 9afa1354326b..c36530c69e33 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -542,9 +542,10 @@ static void __init pgd_populate_tests(struct mm_struct 
> *mm, pgd_t *pgdp,
>  #endif /* PAGETABLE_P4D_FOLDED */
>  
>  static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep,
> -unsigned long vaddr)
> +unsigned long pfn, unsigned long vaddr,
> +pgprot_t prot)
>  {
> - pte_t pte = ptep_get(ptep);
> + pte_t pte = pfn_pte(pfn, prot);
>  
>   pr_debug("Validating PTE clear\n");
>   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> @@ -1049,7 +1050,7 @@ static int __init debug_vm_pgtable(void)
>  
>   ptl = pte_lockptr(mm, pmdp);
>   spin_lock(ptl);
> - pte_clear_tests(mm, ptep, vaddr);
> + pte_clear_tests(mm, ptep, pte_aligned, vaddr, prot);
>   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
>   pte_unmap_unlock(ptep, ptl);
>  
> 

Reviewed-by: Anshuman Khandual 


Re: [PATCH v4 05/13] mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with CONFIG_NUMA_BALANCING

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> Saved write support was added to track the write bit of a pte after
> marking the pte protnone. This was done so that AUTONUMA can convert
> a write pte to protnone and still track the old write bit. When converting
> it back we set the pte write bit correctly thereby avoiding a write fault
> again. Hence enable the test only when CONFIG_NUMA_BALANCING is enabled and
> use protnone protflags.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 4c73e63b4ceb..8704901f6bd8 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -119,10 +119,14 @@ static void __init pte_savedwrite_tests(unsigned long 
> pfn, pgprot_t prot)
>  {
>   pte_t pte = pfn_pte(pfn, prot);
>  
> + if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
> + return;
> +
>   pr_debug("Validating PTE saved write\n");
>   WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte;
>   WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte;
>  }
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot)
>  {
> @@ -234,6 +238,9 @@ static void __init pmd_savedwrite_tests(unsigned long 
> pfn, pgprot_t prot)
>  {
>   pmd_t pmd = pfn_pmd(pfn, prot);
>  
> + if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
> + return;
> +
>   pr_debug("Validating PMD saved write\n");
>   WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd;
>   WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd;
> @@ -1019,8 +1026,8 @@ static int __init debug_vm_pgtable(void)
>   pmd_huge_tests(pmdp, pmd_aligned, prot);
>   pud_huge_tests(pudp, pud_aligned, prot);
>  
> - pte_savedwrite_tests(pte_aligned, prot);
> - pmd_savedwrite_tests(pmd_aligned, prot);
> + pte_savedwrite_tests(pte_aligned, protnone);
> + pmd_savedwrite_tests(pmd_aligned, protnone);
>  
>   pte_unmap_unlock(ptep, ptl);
>  
> 

No more checkpatch.pl warnings.

Reviewed-by: Anshuman Khandual 


Re: [PATCH v4 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> ppc64 supports huge vmap only with radix translation. Hence use arch helper
> to determine the huge vmap support.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 00649b47f6e0..4c73e63b4ceb 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long pfn, 
> pgprot_t prot)
>   WARN_ON(!pmd_leaf(pmd));
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>  static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
> prot)
>  {
>   pmd_t pmd;
>  
> - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
> + if (!arch_ioremap_pmd_supported())
>   return;
>  
>   pr_debug("Validating PMD huge\n");
> @@ -224,6 +226,9 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
> long pfn, pgprot_t prot)
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(!pmd_none(pmd));
>  }
> +#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
> prot) { }
> +#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
>  
>  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
>  {
> @@ -320,11 +325,12 @@ static void __init pud_leaf_tests(unsigned long pfn, 
> pgprot_t prot)
>   WARN_ON(!pud_leaf(pud));
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>  static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
> prot)
>  {
>   pud_t pud;
>  
> - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
> + if (!arch_ioremap_pud_supported())
>   return;
>  
>   pr_debug("Validating PUD huge\n");
> @@ -338,6 +344,10 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned 
> long pfn, pgprot_t prot)
>   pud = READ_ONCE(*pudp);
>   WARN_ON(!pud_none(pud));
>  }
> +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
> +static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
> prot) { }
> +#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
> +
>  #else  /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>  static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { }
>  static void __init pud_advanced_tests(struct mm_struct *mm,
> 

Reviewed-by: Anshuman Khandual 


Re: [PATCH v4 03/13] mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting
> that bit in random value.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 086309fb9b6f..00649b47f6e0 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -44,10 +44,17 @@
>   * entry type. But these bits might affect the ability to clear entries with
>   * pxx_clear() because of how dynamic page table folding works on s390. So
>   * while loading up the entries do not change the lower 4 bits. It does not
> - * have affect any other platform.
> + * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
> + * used to mark a pte entry.
>   */
> -#define S390_MASK_BITS   4
> -#define RANDOM_ORVALUE   GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS)
> +#define S390_SKIP_MASK   GENMASK(3, 0)
> +#if __BITS_PER_LONG == 64
> +#define PPC64_SKIP_MASK  GENMASK(62, 62)
> +#else
> +#define PPC64_SKIP_MASK  0x0
> +#endif
> +#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
> +#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
>  #define RANDOM_NZVALUE   GENMASK(7, 0)
>  
>  static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot)
> 

Reviewed-by: Anshuman Khandual 


Re: [PATCH v4 08/13] mm/debug_vm_pgtable/locks: Move non page table modifying test together

2020-09-03 Thread Anshuman Khandual



On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> This will help in adding proper locks in a later patch
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 51 ---
>  1 file changed, 28 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index de333871f407..f59cf6a9b05e 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -986,7 +986,7 @@ static int __init debug_vm_pgtable(void)
>   p4dp = p4d_alloc(mm, pgdp, vaddr);
>   pudp = pud_alloc(mm, p4dp, vaddr);
>   pmdp = pmd_alloc(mm, pudp, vaddr);
> - ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl);
> + ptep = pte_alloc_map(mm, pmdp, vaddr);
>  
>   /*
>* Save all the page table page addresses as the page table
> @@ -1006,33 +1006,12 @@ static int __init debug_vm_pgtable(void)
>   p4d_basic_tests(p4d_aligned, prot);
>   pgd_basic_tests(pgd_aligned, prot);
>  
> - pte_clear_tests(mm, ptep, vaddr);
> - pmd_clear_tests(mm, pmdp);
> - pud_clear_tests(mm, pudp);
> - p4d_clear_tests(mm, p4dp);
> - pgd_clear_tests(mm, pgdp);
> -
> - pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
> - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
> - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> -
>   pmd_leaf_tests(pmd_aligned, prot);
>   pud_leaf_tests(pud_aligned, prot);
>  
> - pmd_huge_tests(pmdp, pmd_aligned, prot);
> - pud_huge_tests(pudp, pud_aligned, prot);
> -
>   pte_savedwrite_tests(pte_aligned, protnone);
>   pmd_savedwrite_tests(pmd_aligned, protnone);
>  
> - pte_unmap_unlock(ptep, ptl);
> -
> - pmd_populate_tests(mm, pmdp, saved_ptep);
> - pud_populate_tests(mm, pudp, saved_pmdp);
> - p4d_populate_tests(mm, p4dp, saved_pudp);
> - pgd_populate_tests(mm, pgdp, saved_p4dp);
> -
>   pte_special_tests(pte_aligned, prot);
>   pte_protnone_tests(pte_aligned, protnone);
>   pmd_protnone_tests(pmd_aligned, protnone);
> @@ -1050,11 +1029,37 @@ static int __init debug_vm_pgtable(void)
>   pmd_swap_tests(pmd_aligned, prot);
>  
>   swap_migration_tests();
> - hugetlb_basic_tests(pte_aligned, prot);
>  
>   pmd_thp_tests(pmd_aligned, prot);
>   pud_thp_tests(pud_aligned, prot);
>  
> + hugetlb_basic_tests(pte_aligned, prot);
> +
> + pte_clear_tests(mm, ptep, vaddr);
> + pmd_clear_tests(mm, pmdp);
> + pud_clear_tests(mm, pudp);
> + p4d_clear_tests(mm, p4dp);
> + pgd_clear_tests(mm, pgdp);
> +
> + ptl = pte_lockptr(mm, pmdp);
> + spin_lock(ptl);
> +
> + pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
> + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
> + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +
> +
> + pmd_huge_tests(pmdp, pmd_aligned, prot);
> + pud_huge_tests(pudp, pud_aligned, prot);
> +
> + pte_unmap_unlock(ptep, ptl);
> +
> + pmd_populate_tests(mm, pmdp, saved_ptep);
> + pud_populate_tests(mm, pudp, saved_pmdp);
> + p4d_populate_tests(mm, p4dp, saved_pudp);
> + pgd_populate_tests(mm, pgdp, saved_p4dp);
> +
>   p4d_free(mm, saved_p4dp);
>   pud_free(mm, saved_pudp);
>   pmd_free(mm, saved_pmdp);
>

Patches applied till here [i.e PATCH_1..PATCH_8] does not boot on arm64
platform, which might create a potential git bisect problem later on.

static void __init pte_advanced_tests(struct mm_struct *mm,
  struct vm_area_struct *vma, pte_t *ptep,
  unsigned long pfn, unsigned long vaddr,
  pgprot_t prot)
{
pte_t pte = pfn_pte(pfn, prot);

/*
 * Architectures optimize set_pte_at by avoiding TLB flush.
 * This requires set_pte_at to be not used to update an
 * existing pte entry. Clear pte before we do set_pte_at
 */

pr_debug("Validating PTE advanced\n");
pte = pfn_pte(pfn, prot);
set_pte_at(mm, vaddr, ptep, pte);   --> Crashed here.



[   19.031831] Unable to handle kernel paging request at virtual address 
fdfffde00028
[   19.033181] Mem abort info:
[   19.033627]   ESR = 0x9606
[   19.034129]   EC = 0x25: DABT (current EL), IL = 32 bits
[   19.035002]   SET = 0, FnV = 0
[   19.035523]   EA = 0, S1PTW = 0
[   19.036054] Data abort info:
[   19.036538]   ISV = 0, ISS = 0x0006
[   19.037170]   CM = 0, WnR = 0
[   19.037662] swapper pgtable: 4k pages, 48-bit VAs, pgdp=8158b000
[   19.038749] [fdfffde00028] pgd=81d69003, p4d=81d69003, 
pud=81d6a003, pmd=
[   19.040560] Internal error: Oops: 9606 [#1] PREEMPT SMP
[   19.04146

Re: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread Al Viro
On Thu, Sep 03, 2020 at 04:22:40PM +0200, Christoph Hellwig wrote:

> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> index c8a85b512796e1..94f7be4971ed04 100644
> --- a/arch/x86/lib/getuser.S
> +++ b/arch/x86/lib/getuser.S
> @@ -35,10 +35,19 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_X86_5LEVEL
> +#define LOAD_TASK_SIZE_MINUS_N(n) \
> + ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
> + "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
> +#else
> +#define LOAD_TASK_SIZE_MINUS_N(n) \
> + mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> +#endif

Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean

#define LOAD_TASK_SIZE_MINUS_N(n) \
ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
__stringify(mov $((1 << 56) - 4096 - (n)),%rdx), 
X86_FEATURE_LA57

there?


Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-09-03 Thread Alexey Kardashevskiy




On 02/09/2020 16:11, Leonardo Bras wrote:

On Mon, 2020-08-31 at 14:35 +1000, Alexey Kardashevskiy wrote:


On 29/08/2020 04:36, Leonardo Bras wrote:

On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:

On 18/08/2020 09:40, Leonardo Bras wrote:

As of today, if the biggest DDW that can be created can't map the whole
partition, it's creation is skipped and the default DMA window
"ibm,dma-window" is used instead.

DDW is 16x bigger than the default DMA window,


16x only under very specific circumstances which are
1. phyp
2. sriov
3. device class in hmc (or what that priority number is in the lpar config).


Yeah, missing details.


having the same amount of
pages, but increasing the page size to 64k.
Besides larger DMA window,


"Besides being larger"?


You are right there.


it performs better for allocations over 4k,


Better how?


I was thinking for allocations larger than (512 * 4k), since >2
hypercalls are needed here, and for 64k pages would still be just 1
hypercall up to (512 * 64k).
But yeah, not the usual case anyway.


Yup.



so it would be nice to use it instead.


I'd rather say something like:
===
So far we assumed we can map the guest RAM 1:1 to the bus which worked
with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.
===


I mixed this in my commit message, it looks like this:

===
powerpc/pseries/iommu: Make use of DDW for indirect mapping

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW
creation is skipped and the default DMA window "ibm,dma-window" is used
instead.

The default DMA window uses 4k pages instead of 64k pages, and since
the amount of pages is the same,


Is the amount really the same? I thought you can prioritize some VFs
over others (== allocate different number of TCEs). Does it really
matter if it is the same?


On a conversation with Travis Pizel, he explained how it's supposed to
work, and I understood this:

When a VF is created, it will be assigned a capacity, like 4%, 20%, and
so on. The number of 'TCE entries' that are available to that partition
are proportional to that capacity.

If we use the default DMA window, the IOMMU pagesize/entry will be 4k,
and if we use DDW, we will get 64k pagesize. As the number of entries
will be the same (for the same capacity), the total space that can be
addressed by the IOMMU will be 16 times bigger. This sometimes enable
direct mapping, but sometimes it's still not enough.



Good to know. This is still an implementation detail, QEMU does not 
allocate TCEs like this.





On Travis words :
"A low capacity VF, with less resources available, will certainly have
less DMA window capability than a high capacity VF. But, an 8GB DMA
window (with 64k pages) is still 16x larger than an 512MB window (with
4K pages).
A high capacity VF - for example, one that Leonardo has in his scenario
- will go from 8GB (using 4K pages) to 128GB (using 64K pages) - again,
16x larger - but it's obviously still possible to create a partition
that exceeds 128GB of memory in size."



Right except the default dma window is not 8GB, it is <=2GB.








making use of DDW instead of the
default DMA window for indirect mapping will expand in 16x the amount
of memory that can be mapped on DMA.


Stop saying "16x", it is not guaranteed by anything :)



The DDW created will be used for direct mapping by default. [...]
===

What do you think?


The DDW created will be used for direct mapping by default.
If it's not available, indirect mapping will be used instead.

For indirect mapping, it's necessary to update the iommu_table so
iommu_alloc() can use the DDW created. For this,
iommu_table_update_window() is called when everything else succeeds
at enable_ddw().

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

As there will never have both direct and indirect mappings at the same
time, the same property name can be used for the created DDW.

So renaming
define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
to
define DMA64_PROPNAME "linux,dma64-ddr-window-info"
looks the right thing to do.


I know I suggested this but this does not look so good anymore as I
suspect it breaks kexec (from older kernel to this one) so you either
need to check for both DT names or just keep the old one. Changing the
macro name is fine.



Yeah, having 'direct' in the name don't really makes sense if i

[PATCH] spi: fsl-espi: Only process interrupts for expected events

2020-09-03 Thread Chris Packham
The SPIE register contains counts for the TX FIFO so any time the irq
handler was invoked we would attempt to process the RX/TX fifos. Use the
SPIM value to mask the events so that we only process interrupts that
were expected.

This was a latent issue exposed by commit 3282a3da25bd ("powerpc/64:
Implement soft interrupt replay in C").

Signed-off-by: Chris Packham 
Cc: sta...@vger.kernel.org
---

Notes:
I've tested this on a T2080RDB and a custom board using the T2081 SoC. With
this change I don't see any spurious instances of the "Transfer done but
SPIE_DON isn't set!" or "Transfer done but rx/tx fifo's aren't empty!" 
messages
and the updates to spi flash are successful.

I think this should go into the stable trees that contain 3282a3da25bd but I
haven't added a Fixes: tag because I think 3282a3da25bd exposed the issue as
opposed to causing it.

 drivers/spi/spi-fsl-espi.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
index 7e7c92cafdbb..cb120b68c0e2 100644
--- a/drivers/spi/spi-fsl-espi.c
+++ b/drivers/spi/spi-fsl-espi.c
@@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi *espi, u32 
events)
 static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
 {
struct fsl_espi *espi = context_data;
-   u32 events;
+   u32 events, mask;
 
spin_lock(&espi->lock);
 
/* Get interrupt events(tx/rx) */
events = fsl_espi_read_reg(espi, ESPI_SPIE);
-   if (!events) {
+   mask = fsl_espi_read_reg(espi, ESPI_SPIM);
+   if (!(events & mask)) {
spin_unlock(&espi->lock);
return IRQ_NONE;
}
-- 
2.28.0



Re: fsl_espi errors on v5.7.15

2020-09-03 Thread Chris Packham

On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:59, Chris Packham wrote:
 On 31/08/20 9:41 am, Heiner Kallweit wrote:
> On 30.08.2020 23:00, Chris Packham wrote:
>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>> 
>>
 I've also now seen the RX FIFO not empty error on the T2080RDB

 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't 
 empty!
 fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

 With my current workaround of emptying the RX FIFO. It seems
 survivable. Interestingly it only ever seems to be 1 extra byte in 
 the
 RX FIFO and it seems to be after either a READ_SR or a READ_FSR.

 fsl_espi ffe11.spi: tx 70
 fsl_espi ffe11.spi: rx 03
 fsl_espi ffe11.spi: Extra RX 00
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't 
 empty!
 fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
 fsl_espi ffe11.spi: tx 05
 fsl_espi ffe11.spi: rx 00
 fsl_espi ffe11.spi: Extra RX 03
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't 
 empty!
 fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
 fsl_espi ffe11.spi: tx 05
 fsl_espi ffe11.spi: rx 00
 fsl_espi ffe11.spi: Extra RX 03

  From all the Micron SPI-NOR datasheets I've got access to it 
 is
 possible to continually read the SR/FSR. But I've no idea why it
 happens some times and not others.
>>> So I think I've got a reproduction and I think I've bisected the 
>>> problem
>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt 
>>> replay in
>>> C"). My day is just finishing now so I haven't applied too much 
>>> scrutiny
>>> to this result. Given the various rabbit holes I've been down on 
>>> this
>>> issue already I'd take this information with a good degree of 
>>> skepticism.
>>>
>> OK, so an easy test should be to re-test with a 5.4 kernel.
>> It doesn't have yet the change you're referring to, and the fsl-espi 
>> driver
>> is basically the same as in 5.7 (just two small changes in 5.7).
> There's 6cc0c16d82f88 and maybe also other interrupt related patches
> around this time that could affect book E, so it's good if that exact
> patch is confirmed.
 My confirmation is basically that I can induce the issue in a 5.4 
 kernel
 by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
 5.9-rc2 by reverting that one commit.

 I both cases it's not exactly a clean cherry-pick/revert so I also
 confirmed the bisection result by building at 3282a3da25bd (which sees
 the issue) and the commit just before (which does not).
>>> Thanks for testing, that confirms it well.
>>>
>>> [snip patch]
>>>
 I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
 didn't report anything (either with or without the change above).
>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>> else has changed.
>>>
>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>> higher interrupt latency?
>>>
>>> I don't think the patch should cause significantly worse latency,
>>> (it's supposed to be a bit better if anything because it doesn't set
>>> up the full interrupt frame). But it's possible.
>> My working theory is that the SPI_DON indication is all about the TX
>> direction an now that the interrupts are faster we're hitting an error
>> because there is still RX activity going on. Heiner disagrees with my
>> interpretation of the SPI_DON indication and the fact that it doesn't
>> happen every time does throw doubt on it.
>>
> It's right that the eSPI spec can be interpreted that SPI_DON refers to
> TX only. However this wouldn't really make sense, because also for RX
> we program the frame len

Re: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread Linus Torvalds
On Thu, Sep 3, 2020 at 2:30 PM David Laight  wrote:
>
> A non-canonical (is that the right term) address between the highest
> valid user address and the lowest valid kernel address (7ffe to fffe?)
> will fault anyway.

Yes.

But we actually warn against that fault, because it's been a good way
to catch places that didn't use the proper "access_ok()" pattern.

See ex_handler_uaccess() and the

WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in
user access. Non-canonical address?");

warning. It's been good for randomized testing - a missing range check
on a user address will often hit this.

Of course, you should never see it in real life (and hopefully not in
testing either any more). But belt-and-suspenders..

  Linus


Re: [PATCH] mm: check for memory's node later during boot

2020-09-03 Thread Andrew Morton
On Wed,  2 Sep 2020 11:09:11 +0200 Laurent Dufour  wrote:

> register_mem_sect_under_nodem() is checking the memory block's node id only
> if the system state is "SYSTEM_BOOTING". On PowerPC, the memory blocks are
> registered while the system state is "SYSTEM_SCHEDULING", the one before
> SYSTEM_RUNNING.
> 
> The consequence on PowerPC guest with interleaved memory node's ranges is
> that some memory block could be assigned to multiple nodes on sysfs. This
> lately prevents some memory hot-plug and hot-unplug to succeed because
> links are remaining. Such a panic is then displayed:
> 
> [ cut here ]
> kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto 
> gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
> CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
> NIP:  c0403f34 LR: c0403f2c CTR: 
> REGS: c004876e3660 TRAP: 0700   Not tainted  (5.9.0-rc1+)
> MSR:  8282b033   CR: 24000448  XER: 
> 2004
> CFAR: c0846d20 IRQMASK: 0
> GPR00: c0403f2c c004876e38f0 c12f6f00 ffef
> GPR04: 0227 c004805ae680  0004886f
> GPR08: 0226 0003 0002 fffd
> GPR12: 88000484 c0001ec96280  
> GPR16:   0004 0003
> GPR20: c0047814ffe0 c0077c08 0010 c13332c8
> GPR24:  c11f6cc0  
> GPR28: ffef 0001 00015000 1000
> NIP [c0403f34] add_memory_resource+0x244/0x340
> LR [c0403f2c] add_memory_resource+0x23c/0x340
> Call Trace:
> [c004876e38f0] [c0403f2c] add_memory_resource+0x23c/0x340 
> (unreliable)
> [c004876e39c0] [c040408c] __add_memory+0x5c/0xf0
> [c004876e39f0] [c00e2b94] dlpar_add_lmb+0x1b4/0x500
> [c004876e3ad0] [c00e3888] dlpar_memory+0x1f8/0xb80
> [c004876e3b60] [c00dc0d0] handle_dlpar_errorlog+0xc0/0x190
> [c004876e3bd0] [c00dc398] dlpar_store+0x198/0x4a0
> [c004876e3c90] [c072e630] kobj_attr_store+0x30/0x50
> [c004876e3cb0] [c051f954] sysfs_kf_write+0x64/0x90
> [c004876e3cd0] [c051ee40] kernfs_fop_write+0x1b0/0x290
> [c004876e3d20] [c0438dd8] vfs_write+0xe8/0x290
> [c004876e3d70] [c04391ac] ksys_write+0xdc/0x130
> [c004876e3dc0] [c0034e40] system_call_exception+0x160/0x270
> [c004876e3e20] [c000d740] system_call_common+0xf0/0x27c
> Instruction dump:
> 48442e35 6000 0b03 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14
> 78a58402 48442db1 6000 7c7c1b78 <0b03> 7f23cb78 4bda371d 6000
> ---[ end trace 562fd6c109cd0fb2 ]---
> 
> To prevent this multiple links, make the node checking done for states
> prior to SYSTEM_RUNNING.

Did you consider adding a cc:stable to this fix?


RE: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread David Laight
From: Christoph Hellwig
> Sent: 03 September 2020 15:23
> 
> Stop providing the possibility to override the address space using
> set_fs() now that there is no need for that any more.  To properly
> handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on
> x86 a new alternative is introduced, which just like the one in
> entry_64.S has to use the hardcoded virtual address bits to escape
> the fact that TASK_SIZE_MAX isn't actually a constant when 5-level
> page tables are enabled.

Why does it matter whether 4 or 5 level page tables are in use?
Surely all access_ok() needs to do is ensure that a valid kernel
address isn't supplied.
A non-canonical (is that the right term) address between the highest
valid user address and the lowest valid kernel address (7ffe to fffe?)
will fault anyway.
So any limit between the valid user and kernel addresses should
work?
So a limit of 1<<63 would seem appropriate.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH] powerpc/boot/dts: Fix dtc "pciex" warnings

2020-09-03 Thread Christian Lamparter

On 2020-06-23 15:03, Michael Ellerman wrote:

With CONFIG_OF_ALL_DTBS=y, as set by eg. allmodconfig, we see lots of
warnings about our dts files, such as:

   arch/powerpc/boot/dts/glacier.dts:492.26-532.5:
   Warning (pci_bridge): /plb/pciex@d: node name is not "pci"
   or "pcie"

The node name should not particularly matter, it's just a name, and
AFAICS there's no kernel code that cares whether nodes are *named*
"pciex" or "pcie". So shutup these warnings by converting to the name
dtc wants.

As always there's some risk this could break something obscure that
does rely on the name, in which case we can revert.


Hmm, I noticed this when I was looking up why nobody commented
on my series of adding more devices to the APM82181/bluestone series:


(I'll post a v3 "soonish".)


Unfortunately yes. This patch will break uboot code in Meraki MX60(W) / MX60.

> 
https://github.com/riptidewave93/meraki-uboot/blob/mx60w-20180413/board/amcc/bluestone/bluestone.c#L1178

| if (!pci_available()) {
|     fdt_find_and_setprop(blob, "/plb/pciex@d", "status",
|   "disabled", sizeof("disabled"), 1);
| }


Backstory: There are two version of the Meraki MX60. The MX60
and the MX60W. The difference is that the MX60W has a populated
mini-pcie slot on the PCB for a >W
Signed-off-by: Michael Ellerman 
---

diff --git a/arch/powerpc/boot/dts/bluestone.dts 
b/arch/powerpc/boot/dts/bluestone.dts
index cc965a1816b6..aa1ae94cd776 100644
--- a/arch/powerpc/boot/dts/bluestone.dts
+++ b/arch/powerpc/boot/dts/bluestone.dts
@@ -325,7 +325,7 @@ EMAC0: ethernet@ef600c00 {
};
};
  
-		PCIE0: pciex@d {

+   PCIE0: pcie@d {
device_type = "pci";
#interrupt-cells = <1>;
#size-cells = <2>;




[powerpc:fixes-test] BUILD SUCCESS 4c62285439f80f8996c38e0bda79b1125a192365

2020-09-03 Thread kernel test robot
lmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
x86_64   randconfig-a004-20200903
x86_64   randconfig-a006-20200903
x86_64   randconfig-a003-20200903
x86_64   randconfig-a005-20200903
x86_64   randconfig-a001-20200903
x86_64   randconfig-a002-20200903
i386 randconfig-a004-20200902
i386 randconfig-a005-20200902
i386 randconfig-a006-20200902
i386 randconfig-a002-20200902
i386 randconfig-a001-20200902
i386 randconfig-a003-20200902
i386 randconfig-a004-20200903
i386 randconfig-a005-20200903
i386 randconfig-a006-20200903
i386 randconfig-a002-20200903
i386 randconfig-a001-20200903
i386 randconfig-a003-20200903
x86_64   randconfig-a013-20200902
x86_64   randconfig-a016-20200902
x86_64   randconfig-a011-20200902
x86_64   randconfig-a012-20200902
x86_64   randconfig-a015-20200902
x86_64   randconfig-a014-20200902
i386 randconfig-a016-20200903
i386 randconfig-a015-20200903
i386 randconfig-a011-20200903
i386 randconfig-a013-20200903
i386 randconfig-a014-20200903
i386 randconfig-a012-20200903
i386 randconfig-a016-20200902
i386 randconfig-a015-20200902
i386 randconfig-a011-20200902
i386 randconfig-a013-20200902
i386 randconfig-a014-20200902
i386 randconfig-a012-20200902
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a004-20200902
x86_64   randconfig-a006-20200902
x86_64   randconfig-a003-20200902
x86_64   randconfig-a005-20200902
x86_64   randconfig-a001-20200902
x86_64   randconfig-a002-20200902
x86_64   randconfig-a013-20200903
x86_64   randconfig-a016-20200903
x86_64   randconfig-a011-20200903
x86_64   randconfig-a012-20200903
x86_64   randconfig-a015-20200903
x86_64   randconfig-a014-20200903

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[powerpc:next-test] BUILD SUCCESS c3ea77ab83a1cd36cca6a54206657a4aeb98c49c

2020-09-03 Thread kernel test robot
 allyesconfig
parisc   allyesconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc   allnoconfig
x86_64   randconfig-a004-20200903
x86_64   randconfig-a006-20200903
x86_64   randconfig-a003-20200903
x86_64   randconfig-a005-20200903
x86_64   randconfig-a001-20200903
x86_64   randconfig-a002-20200903
i386 randconfig-a004-20200902
i386 randconfig-a005-20200902
i386 randconfig-a006-20200902
i386 randconfig-a002-20200902
i386 randconfig-a001-20200902
i386 randconfig-a003-20200902
i386 randconfig-a004-20200903
i386 randconfig-a005-20200903
i386 randconfig-a006-20200903
i386 randconfig-a002-20200903
i386 randconfig-a001-20200903
i386 randconfig-a003-20200903
x86_64   randconfig-a013-20200902
x86_64   randconfig-a016-20200902
x86_64   randconfig-a011-20200902
x86_64   randconfig-a012-20200902
x86_64   randconfig-a015-20200902
x86_64   randconfig-a014-20200902
i386 randconfig-a016-20200902
i386 randconfig-a015-20200902
i386 randconfig-a011-20200902
i386 randconfig-a013-20200902
i386 randconfig-a014-20200902
i386 randconfig-a012-20200902
i386 randconfig-a016-20200903
i386 randconfig-a015-20200903
i386 randconfig-a011-20200903
i386 randconfig-a013-20200903
i386 randconfig-a014-20200903
i386 randconfig-a012-20200903
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a004-20200902
x86_64   randconfig-a006-20200902
x86_64   randconfig-a003-20200902
x86_64   randconfig-a005-20200902
x86_64   randconfig-a001-20200902
x86_64   randconfig-a002-20200902
x86_64   randconfig-a013-20200903
x86_64   randconfig-a016-20200903
x86_64   randconfig-a011-20200903
x86_64   randconfig-a012-20200903
x86_64   randconfig-a015-20200903
x86_64   randconfig-a014-20200903

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


Re: [PATCH 1/2] dma-mapping: introduce dma_get_seg_boundary_nr_pages()

2020-09-03 Thread Christophe Leroy




Le 03/09/2020 à 18:12, Christoph Hellwig a écrit :

On Thu, Sep 03, 2020 at 01:57:39PM +0300, Andy Shevchenko wrote:

+{
+   if (!dev)
+   return (U32_MAX >> page_shift) + 1;
+   return (dma_get_seg_boundary(dev) >> page_shift) + 1;


Can it be better to do something like
   unsigned long boundary = dev ? dma_get_seg_boundary(dev) : U32_MAX;

   return (boundary >> page_shift) + 1;

?


I don't really see what that would buy us.



I guess it would avoid the duplication of>> page_shift) + 1

Christophe


Re: [PATCH 0/2] dma-mapping: update default segment_boundary_mask

2020-09-03 Thread Christoph Hellwig
Applied with the recommendation from Michael folded in.


Re: [PATCH 1/2] dma-mapping: introduce dma_get_seg_boundary_nr_pages()

2020-09-03 Thread Christoph Hellwig
On Thu, Sep 03, 2020 at 01:57:39PM +0300, Andy Shevchenko wrote:
> > +{
> > +   if (!dev)
> > +   return (U32_MAX >> page_shift) + 1;
> > +   return (dma_get_seg_boundary(dev) >> page_shift) + 1;
> 
> Can it be better to do something like
>   unsigned long boundary = dev ? dma_get_seg_boundary(dev) : U32_MAX;
> 
>   return (boundary >> page_shift) + 1;
> 
> ?

I don't really see what that would buy us.


Re: [PATCH 14/14] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christophe Leroy




Le 03/09/2020 à 17:56, Christoph Hellwig a écrit :

On Thu, Sep 03, 2020 at 05:49:09PM +0200, Christoph Hellwig wrote:

On Thu, Sep 03, 2020 at 05:43:25PM +0200, Christophe Leroy wrote:



Le 03/09/2020 à 16:22, Christoph Hellwig a écrit :

Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more.

Signed-off-by: Christoph Hellwig 
---




   -static inline int __access_ok(unsigned long addr, unsigned long size,
-   mm_segment_t seg)
+static inline bool __access_ok(unsigned long addr, unsigned long size)
   {
-   if (addr > seg.seg)
-   return 0;
-   return (size == 0 || size - 1 <= seg.seg - addr);
+   if (addr >= TASK_SIZE_MAX)
+   return false;
+   return size == 0 || size <= TASK_SIZE_MAX - addr;
   }


You don't need to test size == 0 anymore. It used to be necessary because
of the 'size - 1', as size is unsigned.

Now you can directly do

return size <= TASK_SIZE_MAX - addr;

If size is 0, this will always be true (because you already know that addr
is not >= TASK_SIZE_MAX


True.  What do you think of Linus' comment about always using the
ppc32 version on ppc64 as well with this?


I have nothing against it. That's only adding a substract, all args are 
already in registers so that will be in the noise for a modern CPU.




i.e. something like this folded in:

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 5363f7fc6dd06c..be070254e50943 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -11,26 +11,14 @@
  #ifdef __powerpc64__
  /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */
  #define TASK_SIZE_MAX TASK_SIZE_USER64
-
-/*
- * This check is sufficient because there is a large enough gap between user
- * addresses and the kernel addresses.
- */
-static inline bool __access_ok(unsigned long addr, unsigned long size)
-{
-   return addr < TASK_SIZE_MAX && size < TASK_SIZE_MAX;
-}
-
  #else
  #define TASK_SIZE_MAX TASK_SIZE
+#endif
  
  static inline bool __access_ok(unsigned long addr, unsigned long size)

  {
-   if (addr >= TASK_SIZE_MAX)
-   return false;
-   return size == 0 || size <= TASK_SIZE_MAX - addr;
+   return addr < TASK_SIZE_MAX && size <= TASK_SIZE_MAX - addr;
  }
-#endif /* __powerpc64__ */
  
  #define access_ok(addr, size)		\

(__chk_user_ptr(addr),  \




Christophe


Re: [PATCH 14/14] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christoph Hellwig
On Thu, Sep 03, 2020 at 05:49:09PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 05:43:25PM +0200, Christophe Leroy wrote:
> >
> >
> > Le 03/09/2020 à 16:22, Christoph Hellwig a écrit :
> >> Stop providing the possibility to override the address space using
> >> set_fs() now that there is no need for that any more.
> >>
> >> Signed-off-by: Christoph Hellwig 
> >> ---
> >
> >
> >>   -static inline int __access_ok(unsigned long addr, unsigned long size,
> >> -  mm_segment_t seg)
> >> +static inline bool __access_ok(unsigned long addr, unsigned long size)
> >>   {
> >> -  if (addr > seg.seg)
> >> -  return 0;
> >> -  return (size == 0 || size - 1 <= seg.seg - addr);
> >> +  if (addr >= TASK_SIZE_MAX)
> >> +  return false;
> >> +  return size == 0 || size <= TASK_SIZE_MAX - addr;
> >>   }
> >
> > You don't need to test size == 0 anymore. It used to be necessary because 
> > of the 'size - 1', as size is unsigned.
> >
> > Now you can directly do
> >
> > return size <= TASK_SIZE_MAX - addr;
> >
> > If size is 0, this will always be true (because you already know that addr 
> > is not >= TASK_SIZE_MAX
> 
> True.  What do you think of Linus' comment about always using the
> ppc32 version on ppc64 as well with this?

i.e. something like this folded in:

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 5363f7fc6dd06c..be070254e50943 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -11,26 +11,14 @@
 #ifdef __powerpc64__
 /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */
 #define TASK_SIZE_MAX  TASK_SIZE_USER64
-
-/*
- * This check is sufficient because there is a large enough gap between user
- * addresses and the kernel addresses.
- */
-static inline bool __access_ok(unsigned long addr, unsigned long size)
-{
-   return addr < TASK_SIZE_MAX && size < TASK_SIZE_MAX;
-}
-
 #else
 #define TASK_SIZE_MAX  TASK_SIZE
+#endif
 
 static inline bool __access_ok(unsigned long addr, unsigned long size)
 {
-   if (addr >= TASK_SIZE_MAX)
-   return false;
-   return size == 0 || size <= TASK_SIZE_MAX - addr;
+   return addr < TASK_SIZE_MAX && size <= TASK_SIZE_MAX - addr;
 }
-#endif /* __powerpc64__ */
 
 #define access_ok(addr, size)  \
(__chk_user_ptr(addr),  \


Re: [PATCH v2] cpuidle-pseries: Fix CEDE latency conversion from tb to us

2020-09-03 Thread Vaidyanathan Srinivasan
* Gautham R Shenoy  [2020-09-03 14:57:27]:

> From: "Gautham R. Shenoy" 
> 
> commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> of the Extended CEDE states advertised by the platform. The values
> advertised by the platform are in timebase ticks. However the cpuidle
> framework requires the latency values in microseconds.
> 
> If the tb-ticks value advertised by the platform correspond to a value
> smaller than 1us, during the conversion from tb-ticks to microseconds,
> in the current code, the result becomes zero. This is incorrect as it
> puts a CEDE state on par with the snooze state.
> 
> This patch fixes this by rounding up the result obtained while
> converting the latency value from tb-ticks to microseconds. It also
> prints a warning in case we discover an extended-cede state with
> wakeup latency to be 0. In such a case, ensure that CEDE(0) has a
> non-zero wakeup latency.
> 
> Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)")
> 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
> v1-->v2: Added a warning if a CEDE state has 0 wakeup latency (Suggested by 
> Joel Stanley)
>  Also added code to ensure that CEDE(0) has a non-zero wakeup 
> latency. 
>  drivers/cpuidle/cpuidle-pseries.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index ff6d99e..a2b5c6f 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -361,7 +361,10 @@ static void __init fixup_cede0_latency(void)
>   for (i = 0; i < nr_xcede_records; i++) {
>   struct xcede_latency_record *record = &payload->records[i];
>   u64 latency_tb = be64_to_cpu(record->latency_ticks);
> - u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
> + u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), 
> NSEC_PER_USEC);

This would fix the issue of rounding down to zero.

> +
> + if (latency_us == 0)
> + pr_warn("cpuidle: xcede record %d has an unrealistic 
> latency of 0us.\n", i);

+1  This should not happen.

>   if (latency_us < min_latency_us)
>   min_latency_us = latency_us;
> @@ -378,10 +381,14 @@ static void __init fixup_cede0_latency(void)
>* Perform the fix-up.
>*/
>   if (min_latency_us < dedicated_states[1].exit_latency) {
> - u64 cede0_latency = min_latency_us - 1;
> + /*
> +  * We set a minimum of 1us wakeup latency for cede0 to
> +  * distinguish it from snooze
> +  */
> + u64 cede0_latency = 1;
> 
> - if (cede0_latency <= 0)
> - cede0_latency = min_latency_us;
> + if (min_latency_us > cede0_latency)
> + cede0_latency = min_latency_us - 1;

Good checks to expect cede latency of 1us or more.

>   dedicated_states[1].exit_latency = cede0_latency;
>   dedicated_states[1].target_residency = 10 * (cede0_latency);

--Vaidy



Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Linus Torvalds
On Thu, Sep 3, 2020 at 7:22 AM Christoph Hellwig  wrote:
>
> [Note to Linus: I'd like to get this into linux-next rather earlier
> than later.  Do you think it is ok to add this tree to linux-next?]

This whole series looks really good to me now, with each patch looking
like a clear cleanup on its own.

So ack on the whole series, and yes, please get this into linux-next
and ready for 5.10. Whether through Al or something else.

Thanks,

   Linus


Re: [PATCH 14/14] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christoph Hellwig
On Thu, Sep 03, 2020 at 05:43:25PM +0200, Christophe Leroy wrote:
>
>
> Le 03/09/2020 à 16:22, Christoph Hellwig a écrit :
>> Stop providing the possibility to override the address space using
>> set_fs() now that there is no need for that any more.
>>
>> Signed-off-by: Christoph Hellwig 
>> ---
>
>
>>   -static inline int __access_ok(unsigned long addr, unsigned long size,
>> -mm_segment_t seg)
>> +static inline bool __access_ok(unsigned long addr, unsigned long size)
>>   {
>> -if (addr > seg.seg)
>> -return 0;
>> -return (size == 0 || size - 1 <= seg.seg - addr);
>> +if (addr >= TASK_SIZE_MAX)
>> +return false;
>> +return size == 0 || size <= TASK_SIZE_MAX - addr;
>>   }
>
> You don't need to test size == 0 anymore. It used to be necessary because 
> of the 'size - 1', as size is unsigned.
>
> Now you can directly do
>
>   return size <= TASK_SIZE_MAX - addr;
>
> If size is 0, this will always be true (because you already know that addr 
> is not >= TASK_SIZE_MAX

True.  What do you think of Linus' comment about always using the
ppc32 version on ppc64 as well with this?


Re: [PATCH 14/14] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christophe Leroy




Le 03/09/2020 à 16:22, Christoph Hellwig a écrit :

Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more.

Signed-off-by: Christoph Hellwig 
---



  
-static inline int __access_ok(unsigned long addr, unsigned long size,

-   mm_segment_t seg)
+static inline bool __access_ok(unsigned long addr, unsigned long size)
  {
-   if (addr > seg.seg)
-   return 0;
-   return (size == 0 || size - 1 <= seg.seg - addr);
+   if (addr >= TASK_SIZE_MAX)
+   return false;
+   return size == 0 || size <= TASK_SIZE_MAX - addr;
  }


You don't need to test size == 0 anymore. It used to be necessary 
because of the 'size - 1', as size is unsigned.


Now you can directly do

return size <= TASK_SIZE_MAX - addr;

If size is 0, this will always be true (because you already know that 
addr is not >= TASK_SIZE_MAX


Christophe


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Christoph Hellwig
On Thu, Sep 03, 2020 at 03:36:29PM +0100, Al Viro wrote:
> FWIW, vfs.git#for-next is always a merge of independent branches; I don't
> put stuff directly into #for-next - too easy to lose that way.
> 
> IOW, that would be something like #base.set_fs, included into #for-next
> merge set.  And I've no problem with never-rebased branches...
> 
> Where in the mainline are the most recent prereqs of this series?

I can't think of anything past -rc1, but I haven't actually checked.


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Al Viro
On Thu, Sep 03, 2020 at 04:30:03PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 03:28:03PM +0100, Al Viro wrote:
> > On Thu, Sep 03, 2020 at 04:22:28PM +0200, Christoph Hellwig wrote:
> > 
> > > Besides x86 and powerpc I plan to eventually convert all other
> > > architectures, although this will be a slow process, starting with the
> > > easier ones once the infrastructure is merged.  The process to convert
> > > architectures is roughtly:
> > > 
> > >  (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
> > >  (2) implement __get_kernel_nofault and __put_kernel_nofault
> > >  (3) remove the arch specific address limitation functionality
> > 
> > The one to really watch out for is sparc; I have something in that
> > direction, will resurrect as soon as I'm done with eventpoll analysis...
> > 
> > I can live with this series; do you want that in vfs.git#for-next?
> 
> Either that or a separate tree is fine with me.  It would be good to
> eventually have a non-rebased stable tree so that other arch trees
> can work from it, though.

FWIW, vfs.git#for-next is always a merge of independent branches; I don't
put stuff directly into #for-next - too easy to lose that way.

IOW, that would be something like #base.set_fs, included into #for-next
merge set.  And I've no problem with never-rebased branches...

Where in the mainline are the most recent prereqs of this series?


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Christoph Hellwig
On Thu, Sep 03, 2020 at 03:28:03PM +0100, Al Viro wrote:
> On Thu, Sep 03, 2020 at 04:22:28PM +0200, Christoph Hellwig wrote:
> 
> > Besides x86 and powerpc I plan to eventually convert all other
> > architectures, although this will be a slow process, starting with the
> > easier ones once the infrastructure is merged.  The process to convert
> > architectures is roughtly:
> > 
> >  (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
> >  (2) implement __get_kernel_nofault and __put_kernel_nofault
> >  (3) remove the arch specific address limitation functionality
> 
> The one to really watch out for is sparc; I have something in that
> direction, will resurrect as soon as I'm done with eventpoll analysis...
> 
> I can live with this series; do you want that in vfs.git#for-next?

Either that or a separate tree is fine with me.  It would be good to
eventually have a non-rebased stable tree so that other arch trees
can work from it, though.


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Al Viro
On Thu, Sep 03, 2020 at 04:22:28PM +0200, Christoph Hellwig wrote:

> Besides x86 and powerpc I plan to eventually convert all other
> architectures, although this will be a slow process, starting with the
> easier ones once the infrastructure is merged.  The process to convert
> architectures is roughtly:
> 
>  (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
>  (2) implement __get_kernel_nofault and __put_kernel_nofault
>  (3) remove the arch specific address limitation functionality

The one to really watch out for is sparc; I have something in that
direction, will resurrect as soon as I'm done with eventpoll analysis...

I can live with this series; do you want that in vfs.git#for-next?


[PATCH 14/14] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christoph Hellwig
Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/Kconfig   |  1 -
 arch/powerpc/include/asm/processor.h   |  7 
 arch/powerpc/include/asm/thread_info.h |  5 +--
 arch/powerpc/include/asm/uaccess.h | 57 +++---
 arch/powerpc/kernel/signal.c   |  3 --
 arch/powerpc/lib/sstep.c   |  6 +--
 6 files changed, 18 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 4d1c18a3b83977..65bed1fdeaad71 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -249,7 +249,6 @@ config PPC
select PCI_SYSCALL  if PCI
select PPC_DAWR if PPC64
select RTC_LIB
-   select SET_FS
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index ed0d633ab5aa42..f01e4d650c520a 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -83,10 +83,6 @@ struct task_struct;
 void start_thread(struct pt_regs *regs, unsigned long fdptr, unsigned long sp);
 void release_thread(struct task_struct *);
 
-typedef struct {
-   unsigned long seg;
-} mm_segment_t;
-
 #define TS_FPR(i) fp_state.fpr[i][TS_FPROFFSET]
 #define TS_CKFPR(i) ckfp_state.fpr[i][TS_FPROFFSET]
 
@@ -148,7 +144,6 @@ struct thread_struct {
unsigned long   ksp_vsid;
 #endif
struct pt_regs  *regs;  /* Pointer to saved register state */
-   mm_segment_taddr_limit; /* for get_fs() validation */
 #ifdef CONFIG_BOOKE
/* BookE base exception scratch space; align on cacheline */
unsigned long   normsave[8] cacheline_aligned;
@@ -295,7 +290,6 @@ struct thread_struct {
 #define INIT_THREAD { \
.ksp = INIT_SP, \
.ksp_limit = INIT_SP_LIMIT, \
-   .addr_limit = KERNEL_DS, \
.pgdir = swapper_pg_dir, \
.fpexc_mode = MSR_FE0 | MSR_FE1, \
SPEFSCR_INIT \
@@ -303,7 +297,6 @@ struct thread_struct {
 #else
 #define INIT_THREAD  { \
.ksp = INIT_SP, \
-   .addr_limit = KERNEL_DS, \
.fpexc_mode = 0, \
 }
 #endif
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index ca6c9702570494..46a210b03d2b80 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -90,7 +90,6 @@ void arch_setup_new_exec(void);
 #define TIF_SYSCALL_TRACE  0   /* syscall trace active */
 #define TIF_SIGPENDING 1   /* signal pending */
 #define TIF_NEED_RESCHED   2   /* rescheduling necessary */
-#define TIF_FSCHECK3   /* Check FS is USER_DS on return */
 #define TIF_SYSCALL_EMU4   /* syscall emulation active */
 #define TIF_RESTORE_TM 5   /* need to restore TM FP/VEC/VSX */
 #define TIF_PATCH_PENDING  6   /* pending live patching update */
@@ -130,7 +129,6 @@ void arch_setup_new_exec(void);
 #define _TIF_SYSCALL_TRACEPOINT(1<
 #include 
 
-/*
- * The fs value determines whether argument validity checking should be
- * performed or not.  If get_fs() == USER_DS, checking is performed, with
- * get_fs() == KERNEL_DS, checking is bypassed.
- *
- * For historical reasons, these macros are grossly misnamed.
- *
- * The fs/ds values are now the highest legal address in the "segment".
- * This simplifies the checking in the routines below.
- */
-
-#define MAKE_MM_SEG(s)  ((mm_segment_t) { (s) })
-
-#define KERNEL_DS  MAKE_MM_SEG(~0UL)
 #ifdef __powerpc64__
 /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */
-#define USER_DSMAKE_MM_SEG(TASK_SIZE_USER64 - 1)
-#else
-#define USER_DSMAKE_MM_SEG(TASK_SIZE - 1)
-#endif
-
-#define get_fs()   (current->thread.addr_limit)
-
-static inline void set_fs(mm_segment_t fs)
-{
-   current->thread.addr_limit = fs;
-   /* On user-mode return check addr_limit (fs) is correct */
-   set_thread_flag(TIF_FSCHECK);
-}
-
-#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg)
-#define user_addr_max()(get_fs().seg)
+#define TASK_SIZE_MAX  TASK_SIZE_USER64
 
-#ifdef __powerpc64__
 /*
- * This check is sufficient because there is a large enough
- * gap between user addresses and the kernel addresses
+ * This check is sufficient because there is a large enough gap between user
+ * addresses and the kernel addresses.
  */
-#define __access_ok(addr, size, segment)   \
-   (((addr) <= (segment).seg) && ((size) <= (segment).seg))
+static inline bool __access_ok(unsigned long addr, unsigned long size)
+{
+   return addr < TASK_SIZE_MAX && size < TASK_SIZE_MAX;
+}
 
 #else
+#define TASK_SIZE_MAX  TASK_SIZE
 
-static inline int __access_ok(unsigned long addr, u

[PATCH 13/14] powerpc: use non-set_fs based maccess routines

2020-09-03 Thread Christoph Hellwig
Provide __get_kernel_nofault and __put_kernel_nofault routines to
implement the maccess routines without messing with set_fs and without
opening up access to user space.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/uaccess.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 00699903f1efca..7fe3531ad36a77 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -623,4 +623,20 @@ do {   
\
__put_user_goto(*(u8*)(_src + _i), (u8 __user *)(_dst + _i), 
e);\
 } while (0)
 
+#define HAVE_GET_KERNEL_NOFAULT
+
+#define __get_kernel_nofault(dst, src, type, err_label)
\
+do {   \
+   int __kr_err;   \
+   \
+   __get_user_size_allowed(*((type *)(dst)), (__force type __user *)(src),\
+   sizeof(type), __kr_err);\
+   if (unlikely(__kr_err)) \
+   goto err_label; \
+} while (0)
+
+#define __put_kernel_nofault(dst, src, type, err_label)
\
+   __put_user_size_goto(*((type *)(src)),  \
+   (__force type __user *)(dst), sizeof(type), err_label)
+
 #endif /* _ARCH_POWERPC_UACCESS_H */
-- 
2.28.0



[PATCH 07/14] uaccess: add infrastructure for kernel builds with set_fs()

2020-09-03 Thread Christoph Hellwig
Add a CONFIG_SET_FS option that is selected by architecturess that
implement set_fs, which is all of them initially.  If the option is not
set stubs for routines related to overriding the address space are
provided so that architectures can start to opt out of providing set_fs.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 arch/Kconfig|  3 +++
 arch/alpha/Kconfig  |  1 +
 arch/arc/Kconfig|  1 +
 arch/arm/Kconfig|  1 +
 arch/arm64/Kconfig  |  1 +
 arch/c6x/Kconfig|  1 +
 arch/csky/Kconfig   |  1 +
 arch/h8300/Kconfig  |  1 +
 arch/hexagon/Kconfig|  1 +
 arch/ia64/Kconfig   |  1 +
 arch/m68k/Kconfig   |  1 +
 arch/microblaze/Kconfig |  1 +
 arch/mips/Kconfig   |  1 +
 arch/nds32/Kconfig  |  1 +
 arch/nios2/Kconfig  |  1 +
 arch/openrisc/Kconfig   |  1 +
 arch/parisc/Kconfig |  1 +
 arch/powerpc/Kconfig|  1 +
 arch/riscv/Kconfig  |  1 +
 arch/s390/Kconfig   |  1 +
 arch/sh/Kconfig |  1 +
 arch/sparc/Kconfig  |  1 +
 arch/um/Kconfig |  1 +
 arch/x86/Kconfig|  1 +
 arch/xtensa/Kconfig |  1 +
 include/linux/uaccess.h | 18 ++
 26 files changed, 45 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493fc..3fab619a6aa51a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -24,6 +24,9 @@ config KEXEC_ELF
 config HAVE_IMA_KEXEC
bool
 
+config SET_FS
+   bool
+
 config HOTPLUG_SMT
bool
 
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 9c5f06e8eb9bc0..d6e9fc7a7b19e2 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -39,6 +39,7 @@ config ALPHA
select OLD_SIGSUSPEND
select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67
select MMU_GATHER_NO_RANGE
+   select SET_FS
help
  The Alpha is a 64-bit general-purpose processor designed and
  marketed by the Digital Equipment Corporation of blessed memory,
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index ba00c4e1e1c271..c49f5754a11e40 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -48,6 +48,7 @@ config ARC
select PCI_SYSCALL if PCI
select PERF_USE_VMALLOC if ARC_CACHE_VIPT_ALIASING
select HAVE_ARCH_JUMP_LABEL if ISA_ARCV2 && !CPU_ENDIAN_BE32
+   select SET_FS
 
 config ARCH_HAS_CACHE_LINE_SIZE
def_bool y
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e00d94b1665876..87e1478a42dc4f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -118,6 +118,7 @@ config ARM
select PCI_SYSCALL if PCI
select PERF_USE_VMALLOC
select RTC_LIB
+   select SET_FS
select SYS_SUPPORTS_APM_EMULATION
# Above selects are sorted alphabetically; please add new ones
# according to that.  Thanks.
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbeee8..fbd9e35bef096f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -192,6 +192,7 @@ config ARM64
select PCI_SYSCALL if PCI
select POWER_RESET
select POWER_SUPPLY
+   select SET_FS
select SPARSE_IRQ
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
diff --git a/arch/c6x/Kconfig b/arch/c6x/Kconfig
index 6444ebfd06a665..48d66bf0465d68 100644
--- a/arch/c6x/Kconfig
+++ b/arch/c6x/Kconfig
@@ -22,6 +22,7 @@ config C6X
select GENERIC_CLOCKEVENTS
select MODULES_USE_ELF_RELA
select MMU_GATHER_NO_RANGE if MMU
+   select SET_FS
 
 config MMU
def_bool n
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 3d5afb5f568543..2836f6e76fdb2d 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -78,6 +78,7 @@ config CSKY
select PCI_DOMAINS_GENERIC if PCI
select PCI_SYSCALL if PCI
select PCI_MSI if PCI
+   select SET_FS
 
 config LOCKDEP_SUPPORT
def_bool y
diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig
index d11666d538fea8..7945de067e9fcc 100644
--- a/arch/h8300/Kconfig
+++ b/arch/h8300/Kconfig
@@ -25,6 +25,7 @@ config H8300
select HAVE_ARCH_KGDB
select HAVE_ARCH_HASH
select CPU_NO_EFFICIENT_FFS
+   select SET_FS
select UACCESS_MEMCPY
 
 config CPU_BIG_ENDIAN
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index 667cfc511cf999..f2afabbadd430e 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -31,6 +31,7 @@ config HEXAGON
select GENERIC_CLOCKEVENTS_BROADCAST
select MODULES_USE_ELF_RELA
select GENERIC_CPU_DEVICES
+   select SET_FS
help
  Qualcomm Hexagon is a processor architecture designed for high
  performance and low power across a wide variety of applications.
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 5b4ec80bf5863a..22a6853840e235 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -56,6 +56,7 @@ config IA64
select NEED_DMA_MAP_STATE
select NEED_SG_DMA_LENGTH
select NUMA if !FLATMEM
+   select SET_FS
  

[PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread Christoph Hellwig
Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more.  To properly
handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on
x86 a new alternative is introduced, which just like the one in
entry_64.S has to use the hardcoded virtual address bits to escape
the fact that TASK_SIZE_MAX isn't actually a constant when 5-level
page tables are enabled.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 arch/x86/Kconfig   |  1 -
 arch/x86/ia32/ia32_aout.c  |  1 -
 arch/x86/include/asm/processor.h   | 11 +--
 arch/x86/include/asm/thread_info.h |  2 --
 arch/x86/include/asm/uaccess.h | 26 +
 arch/x86/kernel/asm-offsets.c  |  3 --
 arch/x86/lib/getuser.S | 47 +++---
 arch/x86/lib/putuser.S | 25 
 8 files changed, 39 insertions(+), 77 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f85c13355732fe..7101ac64bb209d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -237,7 +237,6 @@ config X86
select HAVE_ARCH_KCSAN  if X86_64
select X86_FEATURE_NAMESif PROC_FS
select PROC_PID_ARCH_STATUS if PROC_FS
-   select SET_FS
imply IMA_SECURE_AND_OR_TRUSTED_BOOTif EFI
 
 config INSTRUCTION_DECODER
diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index ca8a657edf5977..a09fc37ead9d47 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -239,7 +239,6 @@ static int load_aout_binary(struct linux_binprm *bprm)
(regs)->ss = __USER32_DS;
regs->r8 = regs->r9 = regs->r10 = regs->r11 =
regs->r12 = regs->r13 = regs->r14 = regs->r15 = 0;
-   set_fs(USER_DS);
return 0;
 }
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 1618eeb08361a9..189573d95c3af6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -482,10 +482,6 @@ extern unsigned int fpu_user_xstate_size;
 
 struct perf_event;
 
-typedef struct {
-   unsigned long   seg;
-} mm_segment_t;
-
 struct thread_struct {
/* Cached TLS descriptors: */
struct desc_struct  tls_array[GDT_ENTRY_TLS_ENTRIES];
@@ -538,8 +534,6 @@ struct thread_struct {
 */
unsigned long   iopl_emul;
 
-   mm_segment_taddr_limit;
-
unsigned intsig_on_uaccess_err:1;
 
/* Floating point and extended processor state */
@@ -785,15 +779,12 @@ static inline void spin_lock_prefetch(const void *x)
 #define INIT_THREAD  {   \
.sp0= TOP_OF_INIT_STACK,  \
.sysenter_cs= __KERNEL_CS,\
-   .addr_limit = KERNEL_DS,  \
 }
 
 #define KSTK_ESP(task) (task_pt_regs(task)->sp)
 
 #else
-#define INIT_THREAD  { \
-   .addr_limit = KERNEL_DS,\
-}
+#define INIT_THREAD { }
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
 
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 267701ae3d86dd..44733a4bfc4294 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -102,7 +102,6 @@ struct thread_info {
 #define TIF_SYSCALL_TRACEPOINT 28  /* syscall tracepoint instrumentation */
 #define TIF_ADDR32 29  /* 32-bit address space on 64 bits */
 #define TIF_X3230  /* 32-bit native x86-64 binary 
*/
-#define TIF_FSCHECK31  /* Check FS is USER_DS on return */
 
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -131,7 +130,6 @@ struct thread_info {
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32(1 << TIF_ADDR32)
 #define _TIF_X32   (1 << TIF_X32)
-#define _TIF_FSCHECK   (1 << TIF_FSCHECK)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE   \
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index ecefaffd15d4c8..a4ceda0510ea87 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -12,30 +12,6 @@
 #include 
 #include 
 
-/*
- * The fs value determines whether argument validity checking should be
- * performed or not.  If get_fs() == USER_DS, checking is performed, with
- * get_fs() == KERNEL_DS, checking is bypassed.
- *
- * For historical reasons, these macros are grossly misnamed.
- */
-
-#define MAKE_MM_SEG(s) ((mm_segment_t) { (s) })
-
-#define KERNEL_DS  MAKE_MM_SEG(-1UL)
-#define USER_DSMAKE_MM_SEG(TASK_SIZE_MAX)
-
-#define get_f

[PATCH 11/14] x86: make TASK_SIZE_MAX usable from assembly code

2020-09-03 Thread Christoph Hellwig
For 64-bit the only thing missing was a strategic _AC, and for 32-bit we
need to use __PAGE_OFFSET instead of PAGE_OFFSET in the TASK_SIZE
definition to escape the explicit unsigned long cast.  This just works
because __PAGE_OFFSET is defined using _AC itself and thus never needs
the cast anyway.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 arch/x86/include/asm/page_32_types.h | 4 ++--
 arch/x86/include/asm/page_64_types.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/page_32_types.h 
b/arch/x86/include/asm/page_32_types.h
index 26236925fb2c36..f462895a33e452 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -44,8 +44,8 @@
 /*
  * User space process size: 3GB (default).
  */
-#define IA32_PAGE_OFFSET   PAGE_OFFSET
-#define TASK_SIZE  PAGE_OFFSET
+#define IA32_PAGE_OFFSET   __PAGE_OFFSET
+#define TASK_SIZE  __PAGE_OFFSET
 #define TASK_SIZE_LOW  TASK_SIZE
 #define TASK_SIZE_MAX  TASK_SIZE
 #define DEFAULT_MAP_WINDOW TASK_SIZE
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 996595c9897e0a..838515daf87b36 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -76,7 +76,7 @@
  *
  * With page table isolation enabled, we map the LDT in ... [stay tuned]
  */
-#define TASK_SIZE_MAX  ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+#define TASK_SIZE_MAX  ((_AC(1,UL) << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
 
 #define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
 
-- 
2.28.0



[PATCH 10/14] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32, 64}_types.h

2020-09-03 Thread Christoph Hellwig
At least for 64-bit this moves them closer to some of the defines
they are based on, and it prepares for using the TASK_SIZE_MAX
definition from assembly.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 arch/x86/include/asm/page_32_types.h | 11 +++
 arch/x86/include/asm/page_64_types.h | 38 +
 arch/x86/include/asm/processor.h | 49 
 3 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/page_32_types.h 
b/arch/x86/include/asm/page_32_types.h
index 565ad755c785e2..26236925fb2c36 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -41,6 +41,17 @@
 #define __VIRTUAL_MASK_SHIFT   32
 #endif /* CONFIG_X86_PAE */
 
+/*
+ * User space process size: 3GB (default).
+ */
+#define IA32_PAGE_OFFSET   PAGE_OFFSET
+#define TASK_SIZE  PAGE_OFFSET
+#define TASK_SIZE_LOW  TASK_SIZE
+#define TASK_SIZE_MAX  TASK_SIZE
+#define DEFAULT_MAP_WINDOW TASK_SIZE
+#define STACK_TOP  TASK_SIZE
+#define STACK_TOP_MAX  STACK_TOP
+
 /*
  * Kernel image size is limited to 512 MB (see in arch/x86/kernel/head_32.S)
  */
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 288b065955b729..996595c9897e0a 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -58,6 +58,44 @@
 #define __VIRTUAL_MASK_SHIFT   47
 #endif
 
+/*
+ * User space process size.  This is the first address outside the user range.
+ * There are a few constraints that determine this:
+ *
+ * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
+ * address, then that syscall will enter the kernel with a
+ * non-canonical return address, and SYSRET will explode dangerously.
+ * We avoid this particular problem by preventing anything executable
+ * from being mapped at the maximum canonical address.
+ *
+ * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
+ * CPUs malfunction if they execute code from the highest canonical page.
+ * They'll speculate right off the end of the canonical space, and
+ * bad things happen.  This is worked around in the same way as the
+ * Intel problem.
+ *
+ * With page table isolation enabled, we map the LDT in ... [stay tuned]
+ */
+#define TASK_SIZE_MAX  ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
+
+/* This decides where the kernel will search for a free chunk of vm
+ * space during mmap's.
+ */
+#define IA32_PAGE_OFFSET   ((current->personality & ADDR_LIMIT_3GB) ? \
+   0xc000 : 0xe000)
+
+#define TASK_SIZE_LOW  (test_thread_flag(TIF_ADDR32) ? \
+   IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
+#define TASK_SIZE  (test_thread_flag(TIF_ADDR32) ? \
+   IA32_PAGE_OFFSET : TASK_SIZE_MAX)
+#define TASK_SIZE_OF(child)((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
+   IA32_PAGE_OFFSET : TASK_SIZE_MAX)
+
+#define STACK_TOP  TASK_SIZE_LOW
+#define STACK_TOP_MAX  TASK_SIZE_MAX
+
 /*
  * Maximum kernel image size is limited to 1 GiB, due to the fixmap living
  * in the next 1 GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S).
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 97143d87994c24..1618eeb08361a9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -782,17 +782,6 @@ static inline void spin_lock_prefetch(const void *x)
 })
 
 #ifdef CONFIG_X86_32
-/*
- * User space process size: 3GB (default).
- */
-#define IA32_PAGE_OFFSET   PAGE_OFFSET
-#define TASK_SIZE  PAGE_OFFSET
-#define TASK_SIZE_LOW  TASK_SIZE
-#define TASK_SIZE_MAX  TASK_SIZE
-#define DEFAULT_MAP_WINDOW TASK_SIZE
-#define STACK_TOP  TASK_SIZE
-#define STACK_TOP_MAX  STACK_TOP
-
 #define INIT_THREAD  {   \
.sp0= TOP_OF_INIT_STACK,  \
.sysenter_cs= __KERNEL_CS,\
@@ -802,44 +791,6 @@ static inline void spin_lock_prefetch(const void *x)
 #define KSTK_ESP(task) (task_pt_regs(task)->sp)
 
 #else
-/*
- * User space process size.  This is the first address outside the user range.
- * There are a few constraints that determine this:
- *
- * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
- * address, then that syscall will enter the kernel with a
- * non-canonical return address, and SYSRET will explode dangerously.
- * We avoid this particular problem by preventing anything executable
- * from being mapped at the maximum canonical address.
- *
- * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
- * CPUs malfuncti

[PATCH 09/14] lkdtm: remove set_fs-based tests

2020-09-03 Thread Christoph Hellwig
Once we can't manipulate the address limit, we also can't test what
happens when the manipulation is abused.

Signed-off-by: Christoph Hellwig 
---
 drivers/misc/lkdtm/bugs.c   | 10 --
 drivers/misc/lkdtm/core.c   |  2 --
 drivers/misc/lkdtm/lkdtm.h  |  2 --
 drivers/misc/lkdtm/usercopy.c   | 15 ---
 tools/testing/selftests/lkdtm/tests.txt |  2 --
 5 files changed, 31 deletions(-)

diff --git a/drivers/misc/lkdtm/bugs.c b/drivers/misc/lkdtm/bugs.c
index 4dfbfd51bdf774..a0675d4154d2fd 100644
--- a/drivers/misc/lkdtm/bugs.c
+++ b/drivers/misc/lkdtm/bugs.c
@@ -312,16 +312,6 @@ void lkdtm_CORRUPT_LIST_DEL(void)
pr_err("list_del() corruption not detected!\n");
 }
 
-/* Test if unbalanced set_fs(KERNEL_DS)/set_fs(USER_DS) check exists. */
-void lkdtm_CORRUPT_USER_DS(void)
-{
-   pr_info("setting bad task size limit\n");
-   set_fs(KERNEL_DS);
-
-   /* Make sure we do not keep running with a KERNEL_DS! */
-   force_sig(SIGKILL);
-}
-
 /* Test that VMAP_STACK is actually allocating with a leading guard page */
 void lkdtm_STACK_GUARD_PAGE_LEADING(void)
 {
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index a5e344df916632..97803f213d9d45 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -112,7 +112,6 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(CORRUPT_STACK_STRONG),
CRASHTYPE(CORRUPT_LIST_ADD),
CRASHTYPE(CORRUPT_LIST_DEL),
-   CRASHTYPE(CORRUPT_USER_DS),
CRASHTYPE(STACK_GUARD_PAGE_LEADING),
CRASHTYPE(STACK_GUARD_PAGE_TRAILING),
CRASHTYPE(UNSET_SMEP),
@@ -172,7 +171,6 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_STACK_FRAME_FROM),
CRASHTYPE(USERCOPY_STACK_BEYOND),
CRASHTYPE(USERCOPY_KERNEL),
-   CRASHTYPE(USERCOPY_KERNEL_DS),
CRASHTYPE(STACKLEAK_ERASING),
CRASHTYPE(CFI_FORWARD_PROTO),
 #ifdef CONFIG_X86_32
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 8878538b2c1322..6dec4c9b442ff3 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -27,7 +27,6 @@ void lkdtm_OVERFLOW_UNSIGNED(void);
 void lkdtm_ARRAY_BOUNDS(void);
 void lkdtm_CORRUPT_LIST_ADD(void);
 void lkdtm_CORRUPT_LIST_DEL(void);
-void lkdtm_CORRUPT_USER_DS(void);
 void lkdtm_STACK_GUARD_PAGE_LEADING(void);
 void lkdtm_STACK_GUARD_PAGE_TRAILING(void);
 void lkdtm_UNSET_SMEP(void);
@@ -96,7 +95,6 @@ void lkdtm_USERCOPY_STACK_FRAME_TO(void);
 void lkdtm_USERCOPY_STACK_FRAME_FROM(void);
 void lkdtm_USERCOPY_STACK_BEYOND(void);
 void lkdtm_USERCOPY_KERNEL(void);
-void lkdtm_USERCOPY_KERNEL_DS(void);
 
 /* lkdtm_stackleak.c */
 void lkdtm_STACKLEAK_ERASING(void);
diff --git a/drivers/misc/lkdtm/usercopy.c b/drivers/misc/lkdtm/usercopy.c
index b833367a45d053..109e8d4302c113 100644
--- a/drivers/misc/lkdtm/usercopy.c
+++ b/drivers/misc/lkdtm/usercopy.c
@@ -325,21 +325,6 @@ void lkdtm_USERCOPY_KERNEL(void)
vm_munmap(user_addr, PAGE_SIZE);
 }
 
-void lkdtm_USERCOPY_KERNEL_DS(void)
-{
-   char __user *user_ptr =
-   (char __user *)(0xFUL << (sizeof(unsigned long) * 8 - 4));
-   mm_segment_t old_fs = get_fs();
-   char buf[10] = {0};
-
-   pr_info("attempting copy_to_user() to noncanonical address: %px\n",
-   user_ptr);
-   set_fs(KERNEL_DS);
-   if (copy_to_user(user_ptr, buf, sizeof(buf)) == 0)
-   pr_err("copy_to_user() to noncanonical address succeeded!?\n");
-   set_fs(old_fs);
-}
-
 void __init lkdtm_usercopy_init(void)
 {
/* Prepare cache that lacks SLAB_USERCOPY flag. */
diff --git a/tools/testing/selftests/lkdtm/tests.txt 
b/tools/testing/selftests/lkdtm/tests.txt
index 9d266e79c6a270..74a8d329a72c80 100644
--- a/tools/testing/selftests/lkdtm/tests.txt
+++ b/tools/testing/selftests/lkdtm/tests.txt
@@ -9,7 +9,6 @@ EXCEPTION
 #CORRUPT_STACK_STRONG Crashes entire system on success
 CORRUPT_LIST_ADD list_add corruption
 CORRUPT_LIST_DEL list_del corruption
-CORRUPT_USER_DS Invalid address limit on user-mode return
 STACK_GUARD_PAGE_LEADING
 STACK_GUARD_PAGE_TRAILING
 UNSET_SMEP CR4 bits went missing
@@ -67,6 +66,5 @@ USERCOPY_STACK_FRAME_TO
 USERCOPY_STACK_FRAME_FROM
 USERCOPY_STACK_BEYOND
 USERCOPY_KERNEL
-USERCOPY_KERNEL_DS
 STACKLEAK_ERASING OK: the rest of the thread stack is properly erased
 CFI_FORWARD_PROTO
-- 
2.28.0



[PATCH 08/14] test_bitmap: remove user bitmap tests

2020-09-03 Thread Christoph Hellwig
We can't run the tests for userspace bitmap parsing if set_fs() doesn't
exist, and it is about to go away for x86, powerpc with other major
architectures to follow.

Signed-off-by: Christoph Hellwig 
---
 lib/test_bitmap.c | 91 +++
 1 file changed, 21 insertions(+), 70 deletions(-)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index df903c53952bb9..4425a1dd4ef1c7 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -354,50 +354,37 @@ static const struct test_bitmap_parselist 
parselist_tests[] __initconst = {
 
 };
 
-static void __init __test_bitmap_parselist(int is_user)
+static void __init test_bitmap_parselist(void)
 {
int i;
int err;
ktime_t time;
DECLARE_BITMAP(bmap, 2048);
-   char *mode = is_user ? "_user"  : "";
 
for (i = 0; i < ARRAY_SIZE(parselist_tests); i++) {
 #define ptest parselist_tests[i]
 
-   if (is_user) {
-   mm_segment_t orig_fs = get_fs();
-   size_t len = strlen(ptest.in);
-
-   set_fs(KERNEL_DS);
-   time = ktime_get();
-   err = bitmap_parselist_user((__force const char __user 
*)ptest.in, len,
-   bmap, ptest.nbits);
-   time = ktime_get() - time;
-   set_fs(orig_fs);
-   } else {
-   time = ktime_get();
-   err = bitmap_parselist(ptest.in, bmap, ptest.nbits);
-   time = ktime_get() - time;
-   }
+   time = ktime_get();
+   err = bitmap_parselist(ptest.in, bmap, ptest.nbits);
+   time = ktime_get() - time;
 
if (err != ptest.errno) {
-   pr_err("parselist%s: %d: input is %s, errno is %d, 
expected %d\n",
-   mode, i, ptest.in, err, ptest.errno);
+   pr_err("parselist: %d: input is %s, errno is %d, 
expected %d\n",
+   i, ptest.in, err, ptest.errno);
continue;
}
 
if (!err && ptest.expected
 && !__bitmap_equal(bmap, ptest.expected, ptest.nbits)) 
{
-   pr_err("parselist%s: %d: input is %s, result is 0x%lx, 
expected 0x%lx\n",
-   mode, i, ptest.in, bmap[0],
+   pr_err("parselist: %d: input is %s, result is 0x%lx, 
expected 0x%lx\n",
+   i, ptest.in, bmap[0],
*ptest.expected);
continue;
}
 
if (ptest.flags & PARSE_TIME)
-   pr_err("parselist%s: %d: input is '%s' OK, Time: 
%llu\n",
-   mode, i, ptest.in, time);
+   pr_err("parselist: %d: input is '%s' OK, Time: %llu\n",
+   i, ptest.in, time);
 
 #undef ptest
}
@@ -443,75 +430,41 @@ static const struct test_bitmap_parselist parse_tests[] 
__initconst = {
 #undef step
 };
 
-static void __init __test_bitmap_parse(int is_user)
+static void __init test_bitmap_parse(void)
 {
int i;
int err;
ktime_t time;
DECLARE_BITMAP(bmap, 2048);
-   char *mode = is_user ? "_user"  : "";
 
for (i = 0; i < ARRAY_SIZE(parse_tests); i++) {
struct test_bitmap_parselist test = parse_tests[i];
+   size_t len = test.flags & NO_LEN ? UINT_MAX : strlen(test.in);
 
-   if (is_user) {
-   size_t len = strlen(test.in);
-   mm_segment_t orig_fs = get_fs();
-
-   set_fs(KERNEL_DS);
-   time = ktime_get();
-   err = bitmap_parse_user((__force const char __user 
*)test.in, len,
-   bmap, test.nbits);
-   time = ktime_get() - time;
-   set_fs(orig_fs);
-   } else {
-   size_t len = test.flags & NO_LEN ?
-   UINT_MAX : strlen(test.in);
-   time = ktime_get();
-   err = bitmap_parse(test.in, len, bmap, test.nbits);
-   time = ktime_get() - time;
-   }
+   time = ktime_get();
+   err = bitmap_parse(test.in, len, bmap, test.nbits);
+   time = ktime_get() - time;
 
if (err != test.errno) {
-   pr_err("parse%s: %d: input is %s, errno is %d, expected 
%d\n",
-   mode, i, test.in, err, test.errno);
+   pr_err("parse: %d: input is %s, errno is %d, expected 
%d\n",
+   i, test.in, err,

[PATCH 06/14] fs: don't allow splice read/write without explicit ops

2020-09-03 Thread Christoph Hellwig
default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers.  It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method.  The usual file systems and other
high bandwidth instances all provide a ->splice_read, so this just removes
support for various device drivers and procfs/debugfs files.  If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 fs/read_write.c|   2 +-
 fs/splice.c| 130 +
 include/linux/fs.h |   2 -
 3 files changed, 15 insertions(+), 119 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 702c4301d9eb6b..8c61f67453e3d3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1077,7 +1077,7 @@ ssize_t vfs_iter_write(struct file *file, struct iov_iter 
*iter, loff_t *ppos,
 }
 EXPORT_SYMBOL(vfs_iter_write);
 
-ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
+static ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
  unsigned long vlen, loff_t *pos, rwf_t flags)
 {
struct iovec iovstack[UIO_FASTIOV];
diff --git a/fs/splice.c b/fs/splice.c
index d7c8a7c4db07ff..412df7b48f9eb7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -342,89 +342,6 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = {
 };
 EXPORT_SYMBOL(nosteal_pipe_buf_ops);
 
-static ssize_t kernel_readv(struct file *file, const struct kvec *vec,
-   unsigned long vlen, loff_t offset)
-{
-   mm_segment_t old_fs;
-   loff_t pos = offset;
-   ssize_t res;
-
-   old_fs = get_fs();
-   set_fs(KERNEL_DS);
-   /* The cast to a user pointer is valid due to the set_fs() */
-   res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
-   set_fs(old_fs);
-
-   return res;
-}
-
-static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
-struct pipe_inode_info *pipe, size_t len,
-unsigned int flags)
-{
-   struct kvec *vec, __vec[PIPE_DEF_BUFFERS];
-   struct iov_iter to;
-   struct page **pages;
-   unsigned int nr_pages;
-   unsigned int mask;
-   size_t offset, base, copied = 0;
-   ssize_t res;
-   int i;
-
-   if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-   return -EAGAIN;
-
-   /*
-* Try to keep page boundaries matching to source pagecache ones -
-* it probably won't be much help, but...
-*/
-   offset = *ppos & ~PAGE_MASK;
-
-   iov_iter_pipe(&to, READ, pipe, len + offset);
-
-   res = iov_iter_get_pages_alloc(&to, &pages, len + offset, &base);
-   if (res <= 0)
-   return -ENOMEM;
-
-   nr_pages = DIV_ROUND_UP(res + base, PAGE_SIZE);
-
-   vec = __vec;
-   if (nr_pages > PIPE_DEF_BUFFERS) {
-   vec = kmalloc_array(nr_pages, sizeof(struct kvec), GFP_KERNEL);
-   if (unlikely(!vec)) {
-   res = -ENOMEM;
-   goto out;
-   }
-   }
-
-   mask = pipe->ring_size - 1;
-   pipe->bufs[to.head & mask].offset = offset;
-   pipe->bufs[to.head & mask].len -= offset;
-
-   for (i = 0; i < nr_pages; i++) {
-   size_t this_len = min_t(size_t, len, PAGE_SIZE - offset);
-   vec[i].iov_base = page_address(pages[i]) + offset;
-   vec[i].iov_len = this_len;
-   len -= this_len;
-   offset = 0;
-   }
-
-   res = kernel_readv(in, vec, nr_pages, *ppos);
-   if (res > 0) {
-   copied = res;
-   *ppos += res;
-   }
-
-   if (vec != __vec)
-   kfree(vec);
-out:
-   for (i = 0; i < nr_pages; i++)
-   put_page(pages[i]);
-   kvfree(pages);
-   iov_iter_advance(&to, copied);  /* truncates and discards */
-   return res;
-}
-
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
  * using sendpage(). Return the number of bytes sent.
@@ -788,33 +705,6 @@ iter_file_splice_write(struct pipe_inode_info *pipe, 
struct file *out,
 
 EXPORT_SYMBOL(iter_file_splice_write);
 
-static int write_pipe_buf(struct pipe_inode_info *pipe, struct pipe_buffer 
*buf,
- struct splice_desc *sd)
-{
-   int ret;
-   void *data;
-   loff_t tmp = sd->pos;
-
-   data = kmap(buf->page);
-   ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp);
-   kunmap(buf->page);
-
-   return ret;
-}
-
-static ssize_t default_file_splice_write(struct pipe_inode_info *pipe,
-struct file *out, loff_t *ppos,
-size_t le

[PATCH 05/14] fs: don't allow kernel reads and writes without iter ops

2020-09-03 Thread Christoph Hellwig
Don't allow calling ->read or ->write with set_fs as a preparation for
killing off set_fs.  All the instances that we use kernel_read/write on
are using the iter ops already.

If a file has both the regular ->read/->write methods and the iter
variants those could have different semantics for messed up enough
drivers.  Also fails the kernel access to them in that case.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 fs/read_write.c | 67 +++--
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 5db58b8c78d0dd..702c4301d9eb6b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -419,27 +419,41 @@ static ssize_t new_sync_read(struct file *filp, char 
__user *buf, size_t len, lo
return ret;
 }
 
+static int warn_unsupported(struct file *file, const char *op)
+{
+   pr_warn_ratelimited(
+   "kernel %s not supported for file %pD4 (pid: %d comm: %.20s)\n",
+   op, file, current->pid, current->comm);
+   return -EINVAL;
+}
+
 ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
 {
-   mm_segment_t old_fs = get_fs();
+   struct kvec iov = {
+   .iov_base   = buf,
+   .iov_len= min_t(size_t, count, MAX_RW_COUNT),
+   };
+   struct kiocb kiocb;
+   struct iov_iter iter;
ssize_t ret;
 
if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))
return -EINVAL;
if (!(file->f_mode & FMODE_CAN_READ))
return -EINVAL;
+   /*
+* Also fail if ->read_iter and ->read are both wired up as that
+* implies very convoluted semantics.
+*/
+   if (unlikely(!file->f_op->read_iter || file->f_op->read))
+   return warn_unsupported(file, "read");
 
-   if (count > MAX_RW_COUNT)
-   count =  MAX_RW_COUNT;
-   set_fs(KERNEL_DS);
-   if (file->f_op->read)
-   ret = file->f_op->read(file, (void __user *)buf, count, pos);
-   else if (file->f_op->read_iter)
-   ret = new_sync_read(file, (void __user *)buf, count, pos);
-   else
-   ret = -EINVAL;
-   set_fs(old_fs);
+   init_sync_kiocb(&kiocb, file);
+   kiocb.ki_pos = *pos;
+   iov_iter_kvec(&iter, READ, &iov, 1, iov.iov_len);
+   ret = file->f_op->read_iter(&kiocb, &iter);
if (ret > 0) {
+   *pos = kiocb.ki_pos;
fsnotify_access(file);
add_rchar(current, ret);
}
@@ -510,28 +524,31 @@ static ssize_t new_sync_write(struct file *filp, const 
char __user *buf, size_t
 /* caller is responsible for file_start_write/file_end_write */
 ssize_t __kernel_write(struct file *file, const void *buf, size_t count, 
loff_t *pos)
 {
-   mm_segment_t old_fs;
-   const char __user *p;
+   struct kvec iov = {
+   .iov_base   = (void *)buf,
+   .iov_len= min_t(size_t, count, MAX_RW_COUNT),
+   };
+   struct kiocb kiocb;
+   struct iov_iter iter;
ssize_t ret;
 
if (WARN_ON_ONCE(!(file->f_mode & FMODE_WRITE)))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
+   /*
+* Also fail if ->write_iter and ->write are both wired up as that
+* implies very convoluted semantics.
+*/
+   if (unlikely(!file->f_op->write_iter || file->f_op->write))
+   return warn_unsupported(file, "write");
 
-   old_fs = get_fs();
-   set_fs(KERNEL_DS);
-   p = (__force const char __user *)buf;
-   if (count > MAX_RW_COUNT)
-   count =  MAX_RW_COUNT;
-   if (file->f_op->write)
-   ret = file->f_op->write(file, p, count, pos);
-   else if (file->f_op->write_iter)
-   ret = new_sync_write(file, p, count, pos);
-   else
-   ret = -EINVAL;
-   set_fs(old_fs);
+   init_sync_kiocb(&kiocb, file);
+   kiocb.ki_pos = *pos;
+   iov_iter_kvec(&iter, WRITE, &iov, 1, iov.iov_len);
+   ret = file->f_op->write_iter(&kiocb, &iter);
if (ret > 0) {
+   *pos = kiocb.ki_pos;
fsnotify_modify(file);
add_wchar(current, ret);
}
-- 
2.28.0



[PATCH 03/14] proc: add a read_iter method to proc proc_ops

2020-09-03 Thread Christoph Hellwig
This will allow proc files to implement iter read semantics.

Signed-off-by: Christoph Hellwig 
---
 fs/proc/inode.c | 53 ++---
 include/linux/proc_fs.h |  1 +
 2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 93dd2045737504..58c075e2a452d6 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -297,6 +297,21 @@ static loff_t proc_reg_llseek(struct file *file, loff_t 
offset, int whence)
return rv;
 }
 
+static ssize_t proc_reg_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+   struct proc_dir_entry *pde = PDE(file_inode(iocb->ki_filp));
+   ssize_t ret;
+
+   if (pde_is_permanent(pde))
+   return pde->proc_ops->proc_read_iter(iocb, iter);
+
+   if (!use_pde(pde))
+   return -EIO;
+   ret = pde->proc_ops->proc_read_iter(iocb, iter);
+   unuse_pde(pde);
+   return ret;
+}
+
 static ssize_t pde_read(struct proc_dir_entry *pde, struct file *file, char 
__user *buf, size_t count, loff_t *ppos)
 {
typeof_member(struct proc_ops, proc_read) read;
@@ -578,6 +593,18 @@ static const struct file_operations proc_reg_file_ops = {
.release= proc_reg_release,
 };
 
+static const struct file_operations proc_iter_file_ops = {
+   .llseek = proc_reg_llseek,
+   .read_iter  = proc_reg_read_iter,
+   .write  = proc_reg_write,
+   .poll   = proc_reg_poll,
+   .unlocked_ioctl = proc_reg_unlocked_ioctl,
+   .mmap   = proc_reg_mmap,
+   .get_unmapped_area = proc_reg_get_unmapped_area,
+   .open   = proc_reg_open,
+   .release= proc_reg_release,
+};
+
 #ifdef CONFIG_COMPAT
 static const struct file_operations proc_reg_file_ops_compat = {
.llseek = proc_reg_llseek,
@@ -591,6 +618,19 @@ static const struct file_operations 
proc_reg_file_ops_compat = {
.open   = proc_reg_open,
.release= proc_reg_release,
 };
+
+static const struct file_operations proc_iter_file_ops_compat = {
+   .llseek = proc_reg_llseek,
+   .read_iter  = proc_reg_read_iter,
+   .write  = proc_reg_write,
+   .poll   = proc_reg_poll,
+   .unlocked_ioctl = proc_reg_unlocked_ioctl,
+   .compat_ioctl   = proc_reg_compat_ioctl,
+   .mmap   = proc_reg_mmap,
+   .get_unmapped_area = proc_reg_get_unmapped_area,
+   .open   = proc_reg_open,
+   .release= proc_reg_release,
+};
 #endif
 
 static void proc_put_link(void *p)
@@ -642,10 +682,17 @@ struct inode *proc_get_inode(struct super_block *sb, 
struct proc_dir_entry *de)
 
if (S_ISREG(inode->i_mode)) {
inode->i_op = de->proc_iops;
-   inode->i_fop = &proc_reg_file_ops;
+   if (de->proc_ops->proc_read_iter)
+   inode->i_fop = &proc_iter_file_ops;
+   else
+   inode->i_fop = &proc_reg_file_ops;
 #ifdef CONFIG_COMPAT
-   if (de->proc_ops->proc_compat_ioctl)
-   inode->i_fop = &proc_reg_file_ops_compat;
+   if (de->proc_ops->proc_compat_ioctl) {
+   if (de->proc_ops->proc_read_iter)
+   inode->i_fop = &proc_iter_file_ops_compat;
+   else
+   inode->i_fop = &proc_reg_file_ops_compat;
+   }
 #endif
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = de->proc_iops;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 2df965cd09742d..270cab43ca3dad 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -30,6 +30,7 @@ struct proc_ops {
unsigned int proc_flags;
int (*proc_open)(struct inode *, struct file *);
ssize_t (*proc_read)(struct file *, char __user *, size_t, loff_t *);
+   ssize_t (*proc_read_iter)(struct kiocb *, struct iov_iter *);
ssize_t (*proc_write)(struct file *, const char __user *, size_t, 
loff_t *);
loff_t  (*proc_lseek)(struct file *, loff_t, int);
int (*proc_release)(struct inode *, struct file *);
-- 
2.28.0



remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Christoph Hellwig
Hi all,

this series removes the last set_fs() used to force a kernel address
space for the uaccess code in the kernel read/write/splice code, and then
stops implementing the address space overrides entirely for x86 and
powerpc.

[Note to Linus: I'd like to get this into linux-next rather earlier
than later.  Do you think it is ok to add this tree to linux-next?]

The file system part has been posted a few times, and the read/write side
has been pretty much unchanced.  For splice this series drops the
conversion of the seq_file and sysctl code to the iter ops, and thus loses
the splice support for them.  The reasons for that is that it caused a lot
of churn for not much use - splice for these small files really isn't much
of a win, even if existing userspace uses it.  All callers I found do the
proper fallback, but if this turns out to be an issue the conversion can
be resurrected.

Besides x86 and powerpc I plan to eventually convert all other
architectures, although this will be a slow process, starting with the
easier ones once the infrastructure is merged.  The process to convert
architectures is roughtly:

 (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
 (2) implement __get_kernel_nofault and __put_kernel_nofault
 (3) remove the arch specific address limitation functionality

Changes since v2:
 - add back the patch to support splice through read_iter/write iter
   on /proc/sys/*
 - entirely remove the tests that depend on set_fs.  Note that for
   lkdtm the maintainer (Kees) disagrees with this request from Linus
 - fix a wrong check in the powerpc access_ok, and drop a few spurious
   cleanups there

Changes since v1:
 - drop the patch to remove the non-iter ops for /dev/zero and
   /dev/null as they caused a performance regression
 - don't enable user access in __get_kernel on powerpc
 - xfail the set_fs() based lkdtm tests

Diffstat:


[PATCH 04/14] sysctl: Convert to iter interfaces

2020-09-03 Thread Christoph Hellwig
From: "Matthew Wilcox (Oracle)" 

Using the read_iter/write_iter interfaces allows for in-kernel users
to set sysctls without using set_fs().  Also, the buffer is a string,
so give it the real type of 'char *', not void *.

Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Christoph Hellwig 
---
 fs/proc/proc_sysctl.c  | 46 ++
 include/linux/bpf-cgroup.h |  2 +-
 kernel/bpf/cgroup.c|  2 +-
 3 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 6c1166ccdaea57..a4a3122f8a584a 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -540,13 +541,14 @@ static struct dentry *proc_sys_lookup(struct inode *dir, 
struct dentry *dentry,
return err;
 }
 
-static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf,
-   size_t count, loff_t *ppos, int write)
+static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
+   int write)
 {
-   struct inode *inode = file_inode(filp);
+   struct inode *inode = file_inode(iocb->ki_filp);
struct ctl_table_header *head = grab_header(inode);
struct ctl_table *table = PROC_I(inode)->sysctl_entry;
-   void *kbuf;
+   size_t count = iov_iter_count(iter);
+   char *kbuf;
ssize_t error;
 
if (IS_ERR(head))
@@ -569,32 +571,30 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
error = -ENOMEM;
if (count >= KMALLOC_MAX_SIZE)
goto out;
+   kbuf = kzalloc(count + 1, GFP_KERNEL);
+   if (!kbuf)
+   goto out;
 
if (write) {
-   kbuf = memdup_user_nul(ubuf, count);
-   if (IS_ERR(kbuf)) {
-   error = PTR_ERR(kbuf);
-   goto out;
-   }
-   } else {
-   kbuf = kzalloc(count, GFP_KERNEL);
-   if (!kbuf)
-   goto out;
+   error = -EFAULT;
+   if (!copy_from_iter_full(kbuf, count, iter))
+   goto out_free_buf;
+   kbuf[count] = '\0';
}
 
error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, &kbuf, &count,
-  ppos);
+  &iocb->ki_pos);
if (error)
goto out_free_buf;
 
/* careful: calling conventions are nasty here */
-   error = table->proc_handler(table, write, kbuf, &count, ppos);
+   error = table->proc_handler(table, write, kbuf, &count, &iocb->ki_pos);
if (error)
goto out_free_buf;
 
if (!write) {
error = -EFAULT;
-   if (copy_to_user(ubuf, kbuf, count))
+   if (copy_to_iter(kbuf, count, iter) < count)
goto out_free_buf;
}
 
@@ -607,16 +607,14 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
return error;
 }
 
-static ssize_t proc_sys_read(struct file *filp, char __user *buf,
-   size_t count, loff_t *ppos)
+static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter)
 {
-   return proc_sys_call_handler(filp, (void __user *)buf, count, ppos, 0);
+   return proc_sys_call_handler(iocb, iter, 0);
 }
 
-static ssize_t proc_sys_write(struct file *filp, const char __user *buf,
-   size_t count, loff_t *ppos)
+static ssize_t proc_sys_write(struct kiocb *iocb, struct iov_iter *iter)
 {
-   return proc_sys_call_handler(filp, (void __user *)buf, count, ppos, 1);
+   return proc_sys_call_handler(iocb, iter, 1);
 }
 
 static int proc_sys_open(struct inode *inode, struct file *filp)
@@ -853,8 +851,8 @@ static int proc_sys_getattr(const struct path *path, struct 
kstat *stat,
 static const struct file_operations proc_sys_file_operations = {
.open   = proc_sys_open,
.poll   = proc_sys_poll,
-   .read   = proc_sys_read,
-   .write  = proc_sys_write,
+   .read_iter  = proc_sys_read,
+   .write_iter = proc_sys_write,
.llseek = default_llseek,
 };
 
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 64f367044e25f4..82b26a1386d85e 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -136,7 +136,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 
major, u32 minor,
 
 int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
   struct ctl_table *table, int write,
-  void **buf, size_t *pcount, loff_t *ppos,
+  char **buf, size_t *pcount, loff_t *ppos,
   enum bpf_attach_type type);
 
 in

[PATCH 01/14] proc: remove a level of indentation in proc_get_inode

2020-09-03 Thread Christoph Hellwig
Just return early on inode allocation failure.

Signed-off-by: Christoph Hellwig 
---
 fs/proc/inode.c | 72 +
 1 file changed, 37 insertions(+), 35 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 28d6105e908e4c..016b1302cbabc0 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -619,42 +619,44 @@ struct inode *proc_get_inode(struct super_block *sb, 
struct proc_dir_entry *de)
 {
struct inode *inode = new_inode(sb);
 
-   if (inode) {
-   inode->i_ino = de->low_ino;
-   inode->i_mtime = inode->i_atime = inode->i_ctime = 
current_time(inode);
-   PROC_I(inode)->pde = de;
-
-   if (is_empty_pde(de)) {
-   make_empty_dir_inode(inode);
-   return inode;
-   }
-   if (de->mode) {
-   inode->i_mode = de->mode;
-   inode->i_uid = de->uid;
-   inode->i_gid = de->gid;
-   }
-   if (de->size)
-   inode->i_size = de->size;
-   if (de->nlink)
-   set_nlink(inode, de->nlink);
-
-   if (S_ISREG(inode->i_mode)) {
-   inode->i_op = de->proc_iops;
-   inode->i_fop = &proc_reg_file_ops;
+   if (!inode) {
+   pde_put(de);
+   return NULL;
+   }
+
+   inode->i_ino = de->low_ino;
+   inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+   PROC_I(inode)->pde = de;
+   if (is_empty_pde(de)) {
+   make_empty_dir_inode(inode);
+   return inode;
+   }
+
+   if (de->mode) {
+   inode->i_mode = de->mode;
+   inode->i_uid = de->uid;
+   inode->i_gid = de->gid;
+   }
+   if (de->size)
+   inode->i_size = de->size;
+   if (de->nlink)
+   set_nlink(inode, de->nlink);
+
+   if (S_ISREG(inode->i_mode)) {
+   inode->i_op = de->proc_iops;
+   inode->i_fop = &proc_reg_file_ops;
 #ifdef CONFIG_COMPAT
-   if (!de->proc_ops->proc_compat_ioctl) {
-   inode->i_fop = &proc_reg_file_ops_no_compat;
-   }
+   if (!de->proc_ops->proc_compat_ioctl)
+   inode->i_fop = &proc_reg_file_ops_no_compat;
 #endif
-   } else if (S_ISDIR(inode->i_mode)) {
-   inode->i_op = de->proc_iops;
-   inode->i_fop = de->proc_dir_ops;
-   } else if (S_ISLNK(inode->i_mode)) {
-   inode->i_op = de->proc_iops;
-   inode->i_fop = NULL;
-   } else
-   BUG();
-   } else
-  pde_put(de);
+   } else if (S_ISDIR(inode->i_mode)) {
+   inode->i_op = de->proc_iops;
+   inode->i_fop = de->proc_dir_ops;
+   } else if (S_ISLNK(inode->i_mode)) {
+   inode->i_op = de->proc_iops;
+   inode->i_fop = NULL;
+   } else {
+   BUG();
+   }
return inode;
 }
-- 
2.28.0



[PATCH 02/14] proc: cleanup the compat vs no compat file ops

2020-09-03 Thread Christoph Hellwig
Instead of providing a special no-compat version provide a special
compat version for operations with ->compat_ioctl.

Signed-off-by: Christoph Hellwig 
---
 fs/proc/inode.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 016b1302cbabc0..93dd2045737504 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -572,9 +572,6 @@ static const struct file_operations proc_reg_file_ops = {
.write  = proc_reg_write,
.poll   = proc_reg_poll,
.unlocked_ioctl = proc_reg_unlocked_ioctl,
-#ifdef CONFIG_COMPAT
-   .compat_ioctl   = proc_reg_compat_ioctl,
-#endif
.mmap   = proc_reg_mmap,
.get_unmapped_area = proc_reg_get_unmapped_area,
.open   = proc_reg_open,
@@ -582,12 +579,13 @@ static const struct file_operations proc_reg_file_ops = {
 };
 
 #ifdef CONFIG_COMPAT
-static const struct file_operations proc_reg_file_ops_no_compat = {
+static const struct file_operations proc_reg_file_ops_compat = {
.llseek = proc_reg_llseek,
.read   = proc_reg_read,
.write  = proc_reg_write,
.poll   = proc_reg_poll,
.unlocked_ioctl = proc_reg_unlocked_ioctl,
+   .compat_ioctl   = proc_reg_compat_ioctl,
.mmap   = proc_reg_mmap,
.get_unmapped_area = proc_reg_get_unmapped_area,
.open   = proc_reg_open,
@@ -646,8 +644,8 @@ struct inode *proc_get_inode(struct super_block *sb, struct 
proc_dir_entry *de)
inode->i_op = de->proc_iops;
inode->i_fop = &proc_reg_file_ops;
 #ifdef CONFIG_COMPAT
-   if (!de->proc_ops->proc_compat_ioctl)
-   inode->i_fop = &proc_reg_file_ops_no_compat;
+   if (de->proc_ops->proc_compat_ioctl)
+   inode->i_fop = &proc_reg_file_ops_compat;
 #endif
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = de->proc_iops;
-- 
2.28.0



Re: [PATCH] selftests/powerpc: Skip PROT_SAO test in guests/LPARS

2020-09-03 Thread Michael Ellerman
On Tue, 1 Sep 2020 22:46:53 +1000, Michael Ellerman wrote:
> In commit 9b725a90a8f1 ("powerpc/64s: Disallow PROT_SAO in LPARs by
> default") PROT_SAO was disabled in guests/LPARs by default. So skip
> the test if we are running in a guest to avoid a spurious failure.

Applied to powerpc/fixes.

[1/1] selftests/powerpc: Skip PROT_SAO test in guests/LPARS
  https://git.kernel.org/powerpc/c/fc1f178cdb31783ff37296ecae817a1045a1a513

cheers


Re: [PATCH v3] powerpc/mm: Remove DEBUG_VM_PGTABLE support on powerpc

2020-09-03 Thread Michael Ellerman
On Wed, 2 Sep 2020 09:31:22 +0530, Aneesh Kumar K.V wrote:
> The test is broken w.r.t page table update rules and results in kernel
> crash as below. Disable the support until we get the tests updated.
> 
> [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> pc: c009a5ec: assert_pte_locked+0x14c/0x380
> lr: c05c: pte_update+0x11c/0x190
> sp: c00c6d1e7950
>msr: 82029033
>   current = 0xc00c6d172c80
>   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> pid   = 1, comm = swapper/0
> kernel BUG at arch/powerpc/mm/pgtable.c:304!
> [link register   ] c05c pte_update+0x11c/0x190
> [c00c6d1e7950] 0001 (unreliable)
> [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/mm: Remove DEBUG_VM_PGTABLE support on powerpc
  https://git.kernel.org/powerpc/c/675bceb097e6fb9769309093ec6f3d33a2fed9cc

cheers


Re: [PATCH v2] powerpc/book3s64/radix: Fix boot failure with large amount of guest memory

2020-09-03 Thread Michael Ellerman
On Fri, 28 Aug 2020 15:38:52 +0530, Aneesh Kumar K.V wrote:
> If the hypervisor doesn't support hugepages, the kernel ends up allocating a 
> large
> number of page table pages. The early page table allocation was wrongly
> setting the max memblock limit to ppc64_rma_size with radix translation
> which resulted in boot failure as shown below.
> 
> Kernel panic - not syncing:
> early_alloc_pgtable: Failed to allocate 16777216 bytes align=0x100 nid=-1 
> from=0x max_addr=0x
>  CPU: 0 PID: 0 Comm: swapper Not tainted 5.8.0-24.9-default+ #2
>  Call Trace:
>  [c16f3d00] [c07c6470] dump_stack+0xc4/0x114 (unreliable)
>  [c16f3d40] [c014c78c] panic+0x164/0x418
>  [c16f3dd0] [c0098890] early_alloc_pgtable+0xe0/0xec
>  [c16f3e60] [c10a5440] radix__early_init_mmu+0x360/0x4b4
>  [c16f3ef0] [c1099bac] early_init_mmu+0x1c/0x3c
>  [c16f3f10] [c109a320] early_setup+0x134/0x170
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/book3s64/radix: Fix boot failure with large amount of guest memory
  https://git.kernel.org/powerpc/c/103a8542cb35b5130f732d00b0419a594ba1b517

cheers


Re: [PATCH 1/2] dma-mapping: introduce dma_get_seg_boundary_nr_pages()

2020-09-03 Thread Andy Shevchenko
On Wed, Sep 2, 2020 at 1:20 AM Nicolin Chen  wrote:
>
> We found that callers of dma_get_seg_boundary mostly do an ALIGN
> with page mask and then do a page shift to get number of pages:
> ALIGN(boundary + 1, 1 << shift) >> shift
>
> However, the boundary might be as large as ULONG_MAX, which means
> that a device has no specific boundary limit. So either "+ 1" or
> passing it to ALIGN() would potentially overflow.
>
> According to kernel defines:
> #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
> #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)
>
> We can simplify the logic here into a helper function doing:
>   ALIGN(boundary + 1, 1 << shift) >> shift
> = ALIGN_MASK(b + 1, (1 << s) - 1) >> s
> = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
> = [b + 1 + (1 << s) - 1] >> s
> = [b + (1 << s)] >> s
> = (b >> s) + 1
>
> This patch introduces and applies dma_get_seg_boundary_nr_pages()
> as an overflow-free helper for the dma_get_seg_boundary() callers
> to get numbers of pages. It also takes care of the NULL dev case
> for non-DMA API callers.

...

> +static inline unsigned long dma_get_seg_boundary_nr_pages(struct device *dev,
> +   unsigned int page_shift)
> +{
> +   if (!dev)
> +   return (U32_MAX >> page_shift) + 1;
> +   return (dma_get_seg_boundary(dev) >> page_shift) + 1;

Can it be better to do something like
  unsigned long boundary = dev ? dma_get_seg_boundary(dev) : U32_MAX;

  return (boundary >> page_shift) + 1;

?

> +}

-- 
With Best Regards,
Andy Shevchenko


[PATCH v2] cpuidle-pseries: Fix CEDE latency conversion from tb to us

2020-09-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
of the Extended CEDE states advertised by the platform. The values
advertised by the platform are in timebase ticks. However the cpuidle
framework requires the latency values in microseconds.

If the tb-ticks value advertised by the platform correspond to a value
smaller than 1us, during the conversion from tb-ticks to microseconds,
in the current code, the result becomes zero. This is incorrect as it
puts a CEDE state on par with the snooze state.

This patch fixes this by rounding up the result obtained while
converting the latency value from tb-ticks to microseconds. It also
prints a warning in case we discover an extended-cede state with
wakeup latency to be 0. In such a case, ensure that CEDE(0) has a
non-zero wakeup latency.

Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)")

Signed-off-by: Gautham R. Shenoy 
---
v1-->v2: Added a warning if a CEDE state has 0 wakeup latency (Suggested by 
Joel Stanley)
 Also added code to ensure that CEDE(0) has a non-zero wakeup latency.  
 
 drivers/cpuidle/cpuidle-pseries.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index ff6d99e..a2b5c6f 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -361,7 +361,10 @@ static void __init fixup_cede0_latency(void)
for (i = 0; i < nr_xcede_records; i++) {
struct xcede_latency_record *record = &payload->records[i];
u64 latency_tb = be64_to_cpu(record->latency_ticks);
-   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+   u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), 
NSEC_PER_USEC);
+
+   if (latency_us == 0)
+   pr_warn("cpuidle: xcede record %d has an unrealistic 
latency of 0us.\n", i);
 
if (latency_us < min_latency_us)
min_latency_us = latency_us;
@@ -378,10 +381,14 @@ static void __init fixup_cede0_latency(void)
 * Perform the fix-up.
 */
if (min_latency_us < dedicated_states[1].exit_latency) {
-   u64 cede0_latency = min_latency_us - 1;
+   /*
+* We set a minimum of 1us wakeup latency for cede0 to
+* distinguish it from snooze
+*/
+   u64 cede0_latency = 1;
 
-   if (cede0_latency <= 0)
-   cede0_latency = min_latency_us;
+   if (min_latency_us > cede0_latency)
+   cede0_latency = min_latency_us - 1;
 
dedicated_states[1].exit_latency = cede0_latency;
dedicated_states[1].target_residency = 10 * (cede0_latency);
-- 
1.9.4



Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christophe Leroy




Le 03/09/2020 à 09:11, Christoph Hellwig a écrit :


Except that we do not actually have such a patch.  For normal user
writes we only use ->write_iter if ->write is not present.  But what
shows up in the profile is that /dev/zero only has a read_iter op and
not a normal read.  I've added a patch below that implements a normal
read which might help a tad with this workload, but should not be part
of a regression.



With that patch below, throughput is 113.5MB/s (instead of 99.9MB/s).
So a 14% improvement. That's not bad.

Christophe



---
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index abd4ffdc8cdebc..1dc99ab158457a 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -726,6 +726,27 @@ static ssize_t read_iter_zero(struct kiocb *iocb, struct 
iov_iter *iter)
return written;
  }
  
+static ssize_t read_zero(struct file *file, char __user *buf,

+size_t count, loff_t *ppos)
+{
+   size_t cleared = 0;
+
+   while (count) {
+   size_t chunk = min_t(size_t, count, PAGE_SIZE);
+
+   if (clear_user(buf + cleared, chunk))
+   return cleared ? cleared : -EFAULT;
+   cleared += chunk;
+   count -= chunk;
+
+   if (signal_pending(current))
+   return cleared ? cleared : -ERESTARTSYS;
+   cond_resched();
+   }
+
+   return cleared;
+}
+
  static int mmap_zero(struct file *file, struct vm_area_struct *vma)
  {
  #ifndef CONFIG_MMU
@@ -921,6 +942,7 @@ static const struct file_operations zero_fops = {
.llseek = zero_lseek,
.write  = write_zero,
.read_iter  = read_iter_zero,
+   .read   = read_zero,
.write_iter = write_iter_zero,
.mmap   = mmap_zero,
.get_unmapped_area = get_unmapped_area_zero,



Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christophe Leroy




Le 03/09/2020 à 09:11, Christoph Hellwig a écrit :

On Wed, Sep 02, 2020 at 11:02:22AM -0700, Linus Torvalds wrote:

I don't see why this change would make any difference.


Me neither, but while looking at a different project I did spot places
that actually do an access_ok with len 0, that's why I wanted him to
try.

That being said: Christophe are these number stables?  Do you get
similar numbers with multiple runs?


Yes the numbers are similar with multiple runs and multiple reboots.




And btw, why do the 32-bit and 64-bit checks even differ? It's not
like the extra (single) instruction should even matter. I think the
main reason is that the simpler 64-bit case could stay as a macro
(because it only uses "addr" and "size" once), but honestly, that
"simplification" doesn't help when you then need to have that #ifdef
for the 32-bit case and an inline function anyway.


I'll have to leave that to the powerpc folks.  The intent was to not
change the behavior (and I even fucked that up for the the size == 0
case).


However, I suspect a bigger reason for the actual performance
degradation would be the patch that makes things use "write_iter()"
for writing, even when a simpler "write()" exists.


Except that we do not actually have such a patch.  For normal user
writes we only use ->write_iter if ->write is not present.  But what
shows up in the profile is that /dev/zero only has a read_iter op and
not a normal read.  I've added a patch below that implements a normal
read which might help a tad with this workload, but should not be part
of a regression.

Also Christophe:  can you bisect which patch starts this?  Is it really
this last patch in the series?


5.9-rc2: 91.5MB/s
Patch 1: 74.9MB/s
Patch 2: 97.9MB/s
Patch 3: 97.7MB/s
Patch 4 to 9: 97.9MB/s
Patch 10: 85.3MB/s
Patch 11: 75.4MB/s

See my other mail, when removing CONFIG_STACKPROTECTOR, I get a stable 
99.8MB/s throughput.


Christophe


Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christophe Leroy




Le 02/09/2020 à 20:02, Linus Torvalds a écrit :

On Wed, Sep 2, 2020 at 8:17 AM Christophe Leroy
 wrote:



With this fix, I get

root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M
536870912 bytes (512.0MB) copied, 6.776327 seconds, 75.6MB/s

That's still far from the 91.7MB/s I get with 5.9-rc2, but better than
the 65.8MB/s I got yesterday with your series. Still some way to go thought.


I don't see why this change would make any difference.



Neither do I.

Looks like nowadays, CONFIG_STACKPROTECTOR has become a default.
I rebuilt the kernel without it, I now get a throughput of 99.8MB/s both 
without and with this series.


Looking at the generated code (GCC 10.1), a small change in a function 
seems to make large changes in the generated code when 
CONFIG_STACKPROTECTOR is set.


In addition to that, trivial functions which don't use the stack at all 
get a stack frame anyway when CONFIG_STACKPROTECTOR is set, allthough 
that's only -fstack-protector-strong. And there is no canary check.


Without CONFIG_STACKPROTECTOR:

c01572a0 :
c01572a0:   38 60 ff ff li  r3,-1
c01572a4:   38 80 ff e3 li  r4,-29
c01572a8:   4e 80 00 20 blr

With CONFIG_STACKPROTECTOR (regardless of CONFIG_STACKPROTECTOR_STRONG 
or not):


c0164e08 :
c0164e08:   94 21 ff f0 stwur1,-16(r1)
c0164e0c:   38 60 ff ff li  r3,-1
c0164e10:   38 80 ff e3 li  r4,-29
c0164e14:   38 21 00 10 addir1,r1,16
c0164e18:   4e 80 00 20 blr

Wondering why CONFIG_STACKPROTECTOR has become the default. It seems to 
imply a 10% performance loss even in the best case (91.7MB/s versus 
99.8MB/s)


Note that without CONFIG_STACKPROTECTOR_STRONG, I'm at 99.3MB/s, so 
that's really the _STRONG alternative that hurts.


Christophe


Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-03 Thread Christoph Hellwig
On Wed, Sep 02, 2020 at 11:02:22AM -0700, Linus Torvalds wrote:
> I don't see why this change would make any difference.

Me neither, but while looking at a different project I did spot places
that actually do an access_ok with len 0, that's why I wanted him to
try.

That being said: Christophe are these number stables?  Do you get
similar numbers with multiple runs?

> And btw, why do the 32-bit and 64-bit checks even differ? It's not
> like the extra (single) instruction should even matter. I think the
> main reason is that the simpler 64-bit case could stay as a macro
> (because it only uses "addr" and "size" once), but honestly, that
> "simplification" doesn't help when you then need to have that #ifdef
> for the 32-bit case and an inline function anyway.

I'll have to leave that to the powerpc folks.  The intent was to not
change the behavior (and I even fucked that up for the the size == 0
case).

> However, I suspect a bigger reason for the actual performance
> degradation would be the patch that makes things use "write_iter()"
> for writing, even when a simpler "write()" exists.

Except that we do not actually have such a patch.  For normal user
writes we only use ->write_iter if ->write is not present.  But what
shows up in the profile is that /dev/zero only has a read_iter op and
not a normal read.  I've added a patch below that implements a normal
read which might help a tad with this workload, but should not be part
of a regression.

Also Christophe:  can you bisect which patch starts this?  Is it really
this last patch in the series?

---
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index abd4ffdc8cdebc..1dc99ab158457a 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -726,6 +726,27 @@ static ssize_t read_iter_zero(struct kiocb *iocb, struct 
iov_iter *iter)
return written;
 }
 
+static ssize_t read_zero(struct file *file, char __user *buf,
+size_t count, loff_t *ppos)
+{
+   size_t cleared = 0;
+
+   while (count) {
+   size_t chunk = min_t(size_t, count, PAGE_SIZE);
+
+   if (clear_user(buf + cleared, chunk))
+   return cleared ? cleared : -EFAULT;
+   cleared += chunk;
+   count -= chunk;
+
+   if (signal_pending(current))
+   return cleared ? cleared : -ERESTARTSYS;
+   cond_resched();
+   }
+
+   return cleared;
+}
+
 static int mmap_zero(struct file *file, struct vm_area_struct *vma)
 {
 #ifndef CONFIG_MMU
@@ -921,6 +942,7 @@ static const struct file_operations zero_fops = {
.llseek = zero_lseek,
.write  = write_zero,
.read_iter  = read_iter_zero,
+   .read   = read_zero,
.write_iter = write_iter_zero,
.mmap   = mmap_zero,
.get_unmapped_area = get_unmapped_area_zero,


[PATCH v5] soc: fsl: enable acpi support in RCPM driver

2020-09-03 Thread Ran Wang
From: Peng Ma 

This patch enables ACPI support in RCPM driver.

Signed-off-by: Peng Ma 
Signed-off-by: Ran Wang 
---
Change in v5:
 - Fix panic when dev->of_node is null

Change in v4:
 - Make commit subject more accurate
 - Remove unrelated new blank line

Change in v3:
 - Add #ifdef CONFIG_ACPI for acpi_device_id
 - Rename rcpm_acpi_imx_ids to rcpm_acpi_ids

Change in v2:
 - Update acpi_device_id to fix conflict with other driver

 drivers/soc/fsl/rcpm.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c
index a093dbe..58870f4 100644
--- a/drivers/soc/fsl/rcpm.c
+++ b/drivers/soc/fsl/rcpm.c
@@ -2,7 +2,7 @@
 //
 // rcpm.c - Freescale QorIQ RCPM driver
 //
-// Copyright 2019 NXP
+// Copyright 2019-2020 NXP
 //
 // Author: Ran Wang 
 
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define RCPM_WAKEUP_CELL_MAX_SIZE  7
 
@@ -56,10 +57,14 @@ static int rcpm_pm_prepare(struct device *dev)
"fsl,rcpm-wakeup", value,
rcpm->wakeup_cells + 1);
 
-   /*  Wakeup source should refer to current rcpm device */
-   if (ret || (np->phandle != value[0]))
+   if (ret)
continue;
 
+   if (is_of_node(dev->fwnode))
+   /*  Should refer to current rcpm device */
+   if (np->phandle != value[0])
+   continue;
+
/* Property "#fsl,rcpm-wakeup-cells" of rcpm node defines the
 * number of IPPDEXPCR register cells, and "fsl,rcpm-wakeup"
 * of wakeup source IP contains an integer array: