Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-09 Thread Gerald Schaefer
On Wed, 9 Sep 2020 13:38:25 +0530
Anshuman Khandual  wrote:

> 
> 
> On 09/04/2020 08:56 PM, Gerald Schaefer wrote:
> > On Fri, 4 Sep 2020 12:18:05 +0530
> > Anshuman Khandual  wrote:
> > 
> >>
> >>
> >> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> >>> This patch series includes fixes for debug_vm_pgtable test code so that
> >>> they follow page table updates rules correctly. The first two patches 
> >>> introduce
> >>> changes w.r.t ppc64. The patches are included in this series for 
> >>> completeness. We can
> >>> merge them via ppc64 tree if required.
> >>>
> >>> Hugetlb test is disabled on ppc64 because that needs larger change to 
> >>> satisfy
> >>> page table update rules.
> >>>
> >>> These tests are broken w.r.t page table update rules and results in kernel
> >>> crash as below. 
> >>>
> >>> [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> >>> cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> >>> pc: c009a5ec: assert_pte_locked+0x14c/0x380
> >>> lr: c05c: pte_update+0x11c/0x190
> >>> sp: c00c6d1e7950
> >>>msr: 82029033
> >>>   current = 0xc00c6d172c80
> >>>   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> >>> pid   = 1, comm = swapper/0
> >>> kernel BUG at arch/powerpc/mm/pgtable.c:304!
> >>> [link register   ] c05c pte_update+0x11c/0x190
> >>> [c00c6d1e7950] 0001 (unreliable)
> >>> [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> >>> [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> >>> [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> >>> [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> >>> [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> >>> [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> >>> [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> >>>
> >>> With DEBUG_VM disabled
> >>>
> >>> [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x
> >>> [   20.530183] Faulting instruction address: 0xc00df330
> >>> cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> >>> pc: c00df330: memset+0x68/0x104
> >>> lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> >>> sp: c00c6d19f990
> >>>msr: 82009033
> >>>dar: 0
> >>>   current = 0xc00c6d177480
> >>>   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> >>> pid   = 1, comm = swapper/0
> >>> [link register   ] c009f6d8 
> >>> hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> >>> [c00c6d19f990] c009f748 
> >>> hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> >>> [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> >>> [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> >>> [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> >>> [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> >>> [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> >>> [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> >>>
> >>> Changes from v3:
> >>> * Address review feedback
> >>> * Move page table depost and withdraw patch after adding pmdlock to avoid 
> >>> bisect failure.
> >>
> >> This version
> >>
> >> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
> >> DEBUG_VM_PGTABLE)
> >> - Runs on arm64 and x86 without any regression, atleast nothing that I 
> >> have noticed
> >> - Will be great if this could get tested on s390, arc, riscv, ppc32 
> >> platforms as well
> > 
> > When I quickly tested v3, it worked fine, but now it turned out to
> > only work fine "sometimes", both v3 and v4. I need to look into it
> > further, but so far it seems related to the hugetlb_advanced_tests().
> > 
> > I guess there was already some discussion on this test, but we did
> > not receive all of the thread(s). Please always add at least
> > linux-s...@vger.kernel.org and maybe myself an

Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-09 Thread Gerald Schaefer
On Wed, 09 Sep 2020 11:38:39 +0530
"Aneesh Kumar K.V"  wrote:

> Gerald Schaefer  writes:
> 
> > On Fri, 4 Sep 2020 18:01:15 +0200
> > Gerald Schaefer  wrote:
> >
> > [...]
> >> 
> >> BTW2, a quick test with this change (so far) made the issues on s390
> >> go away:
> >> 
> >> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
> >> spin_unlock(ptl);
> >> 
> >>  #ifndef CONFIG_PPC_BOOK3S_64
> >> -   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> >> +   hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, 
> >> vaddr, prot);
> >>  #endif
> >> 
> >> spin_lock(&mm->page_table_lock);
> >> 
> >> That would more match the "pte_t pointer" usage for hugetlb code,
> >> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> >> but I think the root cause is the pte_t pointer.
> >> 
> >> Not entirely sure though if that would really be the correct fix.
> >> I somehow lost whatever little track I had about what these tests
> >> really want to check, and if that would still be valid with that
> >> change.
> >
> > Uh oh, wasn't aware that this (or some predecessor) already went
> > upstream, and broke our debug kernel today.
> 
> Not sure i followed the above. Are you finding that s390 kernel crash
> after this patch series or the original patchset? As noted in my patch
> the hugetlb test is broken and we should fix that. A quick fix is to
> comment out that test for s390 too as i have done for PPC64.

We see it with both, it basically is broken since there is a hugetlb
test using real pte pointers. It doesn't always show, depending on
random vaddr, so it slipped through earlier testing.

I guess we also would have had one or the other chance to notice
that earlier, through better review, or better reading of previous
mails. I must admit that I neglected this a bit.

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-09 Thread Gerald Schaefer
On Wed, 9 Sep 2020 13:45:48 +0530
Anshuman Khandual  wrote:

[...]
> > 
> > That would more match the "pte_t pointer" usage for hugetlb code,
> > i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> > but I think the root cause is the pte_t pointer.
> 
> Ideally, the pte_t pointer used here should be from huge_pte_alloc()
> not from pte_alloc_map_lock() as the case currently.

Ah, good point. I assumed that this would also always return casted
pmd etc. pointers, and never pte pointers. Unfortunately, that doesn't
seem to be true for all architectures, e.g. ia64, parisc, (some) powerpc,
where they really do a pte_alloc_map() for some reason.

I guess that means you cannot simply cast the pmd pointer, as suggested,
although I really do not understand how any architecture can work with
real ptes for hugepages. But that's fair, s390 also does some things that
nobody would expect or understand for other architectures...

So, for using huge_pte_alloc() you'd also need some size, maybe
iterating over hstates with for_each_hstate() could be an option,
if they are already initialized at that point. Then you have the
size(s) with huge_page_size(hstate) and can actually call the
hugetlb tests for all supported sizes, and with proper pointer
from huge_pte_alloc().

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-08 Thread Gerald Schaefer
On Fri, 4 Sep 2020 18:01:15 +0200
Gerald Schaefer  wrote:

[...]
> 
> BTW2, a quick test with this change (so far) made the issues on s390
> go away:
> 
> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
> spin_unlock(ptl);
> 
>  #ifndef CONFIG_PPC_BOOK3S_64
> -   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +   hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, 
> prot);
>  #endif
> 
> spin_lock(&mm->page_table_lock);
> 
> That would more match the "pte_t pointer" usage for hugetlb code,
> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> but I think the root cause is the pte_t pointer.
> 
> Not entirely sure though if that would really be the correct fix.
> I somehow lost whatever little track I had about what these tests
> really want to check, and if that would still be valid with that
> change.

Uh oh, wasn't aware that this (or some predecessor) already went
upstream, and broke our debug kernel today.

I found out now what goes (horribly) wrong on s390, see below for
more details. In short, using hugetlb primitives with ptep pointers
that do _not_ point to a pmd or pud entry will not work on s390.
It also seems to make no sense to verify / test such a thing in general,
as it would also be a severe bug if any kernel code would do that.
After all, with hugepages, there are no pte tables, only pmd etc.
tables.

My change above would fix the issue for s390, but I can still not
completely judge if that would not break other things for your
tests. In general, for normal kernel code, much of what you do would
be very broken, but I guess your tests are doing such "special" things
because they can. E.g. because they operate on some "sandbox" mm
and page tables, and you also do not need properly populated page
tables for some exit / free cleanup, you just throw them away
explicitly with pXd_free at the end. So it might just be "the right
thing" to pass a casted pmd pointer to hugetlb_advanced_tests(),
to simulate and test (proper) usage of the hugetlb primitives.
I also see no other way to make this work for s390, than using a
proper pmd/pud pointer. If not possible, please add us to the
#ifndef.

So, for all those interested, here is what goes wrong on s390.
huge_ptep_get_and_clear() uses the "idte" instruction for the
clearing (and TLB invalidation) part. That instruction expects
a "region or segment table" origin, which is a pmd/pud/p4d/pgd,
but not a pte table. Even worse, when we calculate the table
origin from the given ptep (which *should* not point to a pte),
due to different table sizes for pte / pXd tables, we end up
at some place before the given pte table.

The "idte" instruction also gets the virtual address, and does
corresponding index addition to the given table origin. Depending
on the pmd_index we now end up either within the pte table again,
in which case we see a panic because idte complains about seeing
a pte value. If we are unlucky, then we end up outside the pte
table, and depending on the content of that memory location, idte
might succeed, effectively corrupting that memory.

That explains why we only see the panic sometimes, depending on
random vaddr, other symptoms other times, and probably completely
silent memory corruption for the rest...

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-04 Thread Gerald Schaefer
On Fri, 4 Sep 2020 18:01:15 +0200
Gerald Schaefer  wrote:

> On Fri, 4 Sep 2020 17:26:47 +0200
> Gerald Schaefer  wrote:
> 
> > On Fri, 4 Sep 2020 12:18:05 +0530
> > Anshuman Khandual  wrote:
> > 
> > > 
> > > 
> > > On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> > > > This patch series includes fixes for debug_vm_pgtable test code so that
> > > > they follow page table updates rules correctly. The first two patches 
> > > > introduce
> > > > changes w.r.t ppc64. The patches are included in this series for 
> > > > completeness. We can
> > > > merge them via ppc64 tree if required.
> > > > 
> > > > Hugetlb test is disabled on ppc64 because that needs larger change to 
> > > > satisfy
> > > > page table update rules.
> > > > 
> > > > These tests are broken w.r.t page table update rules and results in 
> > > > kernel
> > > > crash as below. 
> > > > 
> > > > [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > > > cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> > > > pc: c009a5ec: assert_pte_locked+0x14c/0x380
> > > > lr: c05c: pte_update+0x11c/0x190
> > > > sp: c00c6d1e7950
> > > >msr: 82029033
> > > >   current = 0xc00c6d172c80
> > > >   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> > > > pid   = 1, comm = swapper/0
> > > > kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > > > [link register   ] c05c pte_update+0x11c/0x190
> > > > [c00c6d1e7950] 0001 (unreliable)
> > > > [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> > > > [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> > > > [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> > > > [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> > > > [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> > > > [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> > > > [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > > > 
> > > > With DEBUG_VM disabled
> > > > 
> > > > [   20.530152] BUG: Kernel NULL pointer dereference on read at 
> > > > 0x
> > > > [   20.530183] Faulting instruction address: 0xc00df330
> > > > cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> > > > pc: c00df330: memset+0x68/0x104
> > > > lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > > > sp: c00c6d19f990
> > > >msr: 82009033
> > > >dar: 0
> > > >   current = 0xc00c6d177480
> > > >   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> > > > pid   = 1, comm = swapper/0
> > > > [link register   ] c009f6d8 
> > > > hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > > > [c00c6d19f990] c009f748 
> > > > hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> > > > [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> > > > [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> > > > [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> > > > [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> > > > [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> > > > [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > > > 
> > > > Changes from v3:
> > > > * Address review feedback
> > > > * Move page table depost and withdraw patch after adding pmdlock to 
> > > > avoid bisect failure.
> > > 
> > > This version
> > > 
> > > - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
> > > DEBUG_VM_PGTABLE)
> > > - Runs on arm64 and x86 without any regression, atleast nothing that I 
> > > have noticed
> > > - Will be great if this could get tested on s390, arc, riscv, ppc32 
> > > platforms as well
> > 
> > When I quickly tested v3, it worked fine, but now it turned out to
> > only work fine "sometimes", both v3 and v4. I need to look into it
> > further, but so far it seems related to the hugetlb_advanced_tests().
> > 
> >

Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-04 Thread Gerald Schaefer
On Fri, 4 Sep 2020 17:26:47 +0200
Gerald Schaefer  wrote:

> On Fri, 4 Sep 2020 12:18:05 +0530
> Anshuman Khandual  wrote:
> 
> > 
> > 
> > On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> > > This patch series includes fixes for debug_vm_pgtable test code so that
> > > they follow page table updates rules correctly. The first two patches 
> > > introduce
> > > changes w.r.t ppc64. The patches are included in this series for 
> > > completeness. We can
> > > merge them via ppc64 tree if required.
> > > 
> > > Hugetlb test is disabled on ppc64 because that needs larger change to 
> > > satisfy
> > > page table update rules.
> > > 
> > > These tests are broken w.r.t page table update rules and results in kernel
> > > crash as below. 
> > > 
> > > [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > > cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> > > pc: c009a5ec: assert_pte_locked+0x14c/0x380
> > > lr: c05c: pte_update+0x11c/0x190
> > > sp: c00c6d1e7950
> > >msr: 82029033
> > >   current = 0xc00c6d172c80
> > >   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> > > pid   = 1, comm = swapper/0
> > > kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > > [link register   ] c05c pte_update+0x11c/0x190
> > > [c00c6d1e7950] 0001 (unreliable)
> > > [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> > > [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> > > [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> > > [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> > > [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> > > [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> > > [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > > 
> > > With DEBUG_VM disabled
> > > 
> > > [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x
> > > [   20.530183] Faulting instruction address: 0xc00df330
> > > cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> > > pc: c00df330: memset+0x68/0x104
> > > lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > > sp: c00c6d19f990
> > >msr: 82009033
> > >dar: 0
> > >   current = 0xc00c6d177480
> > >   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> > > pid   = 1, comm = swapper/0
> > > [link register   ] c009f6d8 
> > > hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > > [c00c6d19f990] c009f748 
> > > hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> > > [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> > > [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> > > [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> > > [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> > > [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> > > [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > > 
> > > Changes from v3:
> > > * Address review feedback
> > > * Move page table depost and withdraw patch after adding pmdlock to avoid 
> > > bisect failure.
> > 
> > This version
> > 
> > - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
> > DEBUG_VM_PGTABLE)
> > - Runs on arm64 and x86 without any regression, atleast nothing that I have 
> > noticed
> > - Will be great if this could get tested on s390, arc, riscv, ppc32 
> > platforms as well
> 
> When I quickly tested v3, it worked fine, but now it turned out to
> only work fine "sometimes", both v3 and v4. I need to look into it
> further, but so far it seems related to the hugetlb_advanced_tests().
> 
> I guess there was already some discussion on this test, but we did
> not receive all of the thread(s). Please always add at least
> linux-s...@vger.kernel.org and maybe myself and Vasily Gorbik 
> 
> for further discussions.

BTW, with myself I mean the new address gerald.schae...@linux.ibm.com.
The old gerald.schae...@de.ibm.com seems to work (again), but is not
very reliable.

BTW2, a quick test with this change (so far) made the issues on s390
go away:

@@ -1069,7 +1074,7 @@ static int __in

Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-04 Thread Gerald Schaefer
On Fri, 4 Sep 2020 12:18:05 +0530
Anshuman Khandual  wrote:

> 
> 
> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> > This patch series includes fixes for debug_vm_pgtable test code so that
> > they follow page table updates rules correctly. The first two patches 
> > introduce
> > changes w.r.t ppc64. The patches are included in this series for 
> > completeness. We can
> > merge them via ppc64 tree if required.
> > 
> > Hugetlb test is disabled on ppc64 because that needs larger change to 
> > satisfy
> > page table update rules.
> > 
> > These tests are broken w.r.t page table update rules and results in kernel
> > crash as below. 
> > 
> > [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> > pc: c009a5ec: assert_pte_locked+0x14c/0x380
> > lr: c05c: pte_update+0x11c/0x190
> > sp: c00c6d1e7950
> >msr: 82029033
> >   current = 0xc00c6d172c80
> >   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> > pid   = 1, comm = swapper/0
> > kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > [link register   ] c05c pte_update+0x11c/0x190
> > [c00c6d1e7950] 0001 (unreliable)
> > [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> > [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> > [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> > [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> > [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> > [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> > [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > 
> > With DEBUG_VM disabled
> > 
> > [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x
> > [   20.530183] Faulting instruction address: 0xc00df330
> > cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> > pc: c00df330: memset+0x68/0x104
> > lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > sp: c00c6d19f990
> >msr: 82009033
> >dar: 0
> >   current = 0xc00c6d177480
> >   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> > pid   = 1, comm = swapper/0
> > [link register   ] c009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > [c00c6d19f990] c009f748 
> > hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> > [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> > [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> > [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> > [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> > [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> > [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > 
> > Changes from v3:
> > * Address review feedback
> > * Move page table depost and withdraw patch after adding pmdlock to avoid 
> > bisect failure.
> 
> This version
> 
> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
> DEBUG_VM_PGTABLE)
> - Runs on arm64 and x86 without any regression, atleast nothing that I have 
> noticed
> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms 
> as well

When I quickly tested v3, it worked fine, but now it turned out to
only work fine "sometimes", both v3 and v4. I need to look into it
further, but so far it seems related to the hugetlb_advanced_tests().

I guess there was already some discussion on this test, but we did
not receive all of the thread(s). Please always add at least
linux-s...@vger.kernel.org and maybe myself and Vasily Gorbik 

for further discussions.

That being said, sorry for duplications, this might already have been
discussed. Preliminary analysis showed that it only seems to go wrong
for certain random vaddr values. I cannot make any sense of that yet,
but what seems strange to me is that the hugetlb_advanced_tests()
take a (real) pte_t pointer as input, and also use that for all
kinds of operations (set_huge_pte_at, huge_ptep_get_and_clear, etc.).

Although all the hugetlb code in the kernel is (mis)using pte_t
pointers instead of the correct pmd/pud_t pointers like THP, that
is just for historic reasons. The pointers will actually never point
to a real pte_t (i.e. page table entry), but of course to a pmd
or pud entry, depending on hugepage size.

What is passed in as ptep to hugetlb_advanced_tests() seems to be
the result from the previous ptep = pte_alloc_map(mm, pmdp, vaddr),
so I would expect that it points to a real page table entry. Need
to investigate further, but IIUC, using such a pointer for adding
large pte entries (i.e. pmd/pud entries) at least feels very wrong
to me, and I assume it is related to the issues we see on s390.

We actually see different issues, e.g. once a panic directly in
hugetlb_advanced_tests() -> hug

Re: [PATCH V3 0/4] mm/debug_vm_pgtable: Add some more tests

2020-06-24 Thread Gerald Schaefer
On Wed, 24 Jun 2020 13:05:39 +0200
Alexander Gordeev  wrote:

> On Wed, Jun 24, 2020 at 08:43:10AM +0530, Anshuman Khandual wrote:
> 
> [...]
> 
> > Hello Gerald/Christophe/Vineet,
> > 
> > It would be really great if you could give this series a quick test
> > on s390/ppc/arc platforms respectively. Thank you.
> 
> That worked for me with the default and debug s390 configurations.
> Would you like to try with some particular options or combinations
> of the options?

It will be enabled automatically on all archs that set
ARCH_HAS_DEBUG_VM_PGTABLE, which we do for s390 unconditionally.
Also, DEBUG_VM has to be set, which we have only in the debug config.
So only the s390 debug config will have it enabled, you can check
dmesg for "debug_vm_pgtable" to see when / where it was run, and if it
triggered any warnings.

I also checked with the v3 series, and it works fine for s390.

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V2 0/3] mm/debug: Add more arch page table helper tests

2020-04-08 Thread Gerald Schaefer
On Wed, 8 Apr 2020 12:41:51 +0530
Anshuman Khandual  wrote:

[...]
> >   
> >>
> >> Some thing like this instead.
> >>
> >> pte_t pte = READ_ONCE(*ptep);
> >> pte = pte_mkhuge(__pte((pte_val(pte) | RANDOM_ORVALUE) & PMD_MASK));
> >>
> >> We cannot use mk_pte_phys() as it is defined only on some platforms
> >> without any generic fallback for others.  
> > 
> > Oh, didn't know that, sorry. What about using mk_pte() instead, at least
> > it would result in a present pte:
> > 
> > pte = pte_mkhuge(mk_pte(phys_to_page(RANDOM_ORVALUE & PMD_MASK), prot));  
> 
> Lets use mk_pte() here but can we do this instead
> 
> paddr = (__pfn_to_phys(pfn) | RANDOM_ORVALUE) & PMD_MASK;
> pte = pte_mkhuge(mk_pte(phys_to_page(paddr), prot));
> 

Sure, that will also work.

BTW, this RANDOM_ORVALUE is not really very random, the way it is
defined. For s390 we already changed it to mask out some arch bits,
but I guess there are other archs and bits that would always be
set with this "not so random" value, and I wonder if/how that would
affect all the tests using this value, see also below.

> > 
> > And if you also want to do some with the existing value, which seems
> > to be an empty pte, then maybe just check if writing and reading that
> > value with set_huge_pte_at() / huge_ptep_get() returns the same,
> > i.e. initially w/o RANDOM_ORVALUE.
> > 
> > So, in combination, like this (BTW, why is the barrier() needed, it
> > is not used for the other set_huge_pte_at() calls later?):  
> 
> Ahh missed, will add them. Earlier we faced problem without it after
> set_pte_at() for a test on powerpc (64) platform. Hence just added it
> here to be extra careful.
> 
> > 
> > @@ -733,24 +733,28 @@ static void __init hugetlb_advanced_test
> > struct page *page = pfn_to_page(pfn);
> > pte_t pte = READ_ONCE(*ptep);
> >  
> > -   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> > +   set_huge_pte_at(mm, vaddr, ptep, pte);
> > +   WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
> > +
> > +   pte = pte_mkhuge(mk_pte(phys_to_page(RANDOM_ORVALUE & PMD_MASK), 
> > prot));
> > set_huge_pte_at(mm, vaddr, ptep, pte);
> > barrier();
> > WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
> > 
> > This would actually add a new test "write empty pte with
> > set_huge_pte_at(), then verify with huge_ptep_get()", which happens
> > to trigger a warning on s390 :-)  
> 
> On arm64 as well which checks for pte_present() in set_huge_pte_at().
> But PTE present check is not really present in each set_huge_pte_at()
> implementation especially without __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT.
> Hence wondering if we should add this new test here which will keep
> giving warnings on s390 and arm64 (at the least).

Hmm, interesting. I forgot about huge swap / migration, which is not
(and probably cannot be) supported on s390. The pte_present() check
on arm64 seems to check for such huge swap / migration entries,
according to the comment.

The new test "write empty pte with set_huge_pte_at(), then verify
with huge_ptep_get()" would then probably trigger the
WARN_ON(!pte_present(pte)) in arm64 code. So I guess "writing empty
ptes with set_huge_pte_at()" is not really a valid use case in practice,
or else you would have seen this warning before. In that case, it
might not be a good idea to add this test.

I also do wonder now, why the original test with
"pte = __pte(pte_val(pte) | RANDOM_ORVALUE);"
did not also trigger that warning on arm64. On s390 this test failed
exactly because the constructed pte was not present (initially empty,
or'ing RANDOM_ORVALUE does not make it present for s390). I guess this
just worked by chance on arm64, because the bits from RANDOM_ORVALUE
also happened to mark the pte present for arm64.

This brings us back to the question above, regarding the "randomness"
of RANDOM_ORVALUE. Not really sure what the intention behind that was,
but maybe it would make sense to restrict this RANDOM_ORVALUE to
non-arch-specific bits, i.e. only bits that would be part of the
address value within a page table entry? Or was it intentionally
chosen to also mess with other bits?

Regards,
Gerald


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V2 0/3] mm/debug: Add more arch page table helper tests

2020-04-07 Thread Gerald Schaefer
On Sun, 5 Apr 2020 17:58:14 +0530
Anshuman Khandual  wrote:

[...]
> > 
> > Could be fixed like this (the first de-reference is a bit special,
> > because at that point *ptep does not really point to a large (pmd) entry
> > yet, it is initially an invalid pte entry, which breaks our huge_ptep_get() 
> >  
> 
> There seems to be an inconsistency on s390 platform. Even though it defines
> a huge_ptep_get() override, it does not subscribe __HAVE_ARCH_HUGE_PTEP_GET
> which should have forced it fallback on generic huge_ptep_get() but it does
> not :) Then I realized that __HAVE_ARCH_HUGE_PTEP_GET only makes sense when
> an arch uses . s390 does not use that and hence gets
> away with it's own huge_ptep_get() without __HAVE_ARCH_HUGE_PTEP_GET. Sounds
> confusing ? But I might not have the entire context here.

Yes, that sounds very confusing. Also a bit ironic, since huge_ptep_get()
was initially introduced because of s390, and now we don't select
__HAVE_ARCH_HUGE_PTEP_GET...

As you realized, I guess this is because we do not use generic hugetlb.h.
And when __HAVE_ARCH_HUGE_PTEP_GET was introduced with commit 544db7597ad
("hugetlb: introduce generic version of huge_ptep_get"), that was probably
the reason why we did not get our share of __HAVE_ARCH_HUGE_PTEP_GET.

Nothing really wrong with that, but yes, very confusing. Maybe we could
also select it for s390, even though it wouldn't have any functional
impact (so far), just for less confusion. Maybe also thinking about
using the generic hugetlb.h, not sure if the original reasons for not
doing so would still apply. Now I only need to find the time...

> 
> > conversion logic. I also added PMD_MASK alignment for RANDOM_ORVALUE,
> > because we do have some special bits there in our large pmds. It seems
> > to also work w/o that alignment, but it feels a bit wrong):  
> 
> Sure, we can accommodate that.
> 
> > 
> > @@ -731,26 +731,26 @@ static void __init hugetlb_advanced_test
> >   unsigned long vaddr, pgprot_t 
> > prot)
> >  {
> > struct page *page = pfn_to_page(pfn);
> > -   pte_t pte = READ_ONCE(*ptep);
> > +   pte_t pte;
> > 
> > -   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> > +   pte = pte_mkhuge(mk_pte_phys(RANDOM_ORVALUE & PMD_MASK, prot));  
> 
> So that keeps the existing value in 'ptep' pointer at bay and instead
> construct a PTE from scratch. I would rather have READ_ONCE(*ptep) at
> least provide the seed that can be ORed with RANDOM_ORVALUE before
> being masked with PMD_MASK. Do you see any problem ?

Yes, unfortunately. The problem is that the resulting pte is not marked
as present. The conversion pte -> (huge) pmd, which is done in
set_huge_pte_at() for s390, will establish an empty pmd for non-present
ptes, all the RANDOM_ORVALUE stuff is lost. And a subsequent
huge_ptep_get() will not result in the same original pte value. If you
want to preserve and check the RANDOM_ORVALUE, it has to be a present
pte, hence the mk_pte(_phys).

> 
> Some thing like this instead.
> 
> pte_t pte = READ_ONCE(*ptep);
> pte = pte_mkhuge(__pte((pte_val(pte) | RANDOM_ORVALUE) & PMD_MASK));
> 
> We cannot use mk_pte_phys() as it is defined only on some platforms
> without any generic fallback for others.

Oh, didn't know that, sorry. What about using mk_pte() instead, at least
it would result in a present pte:

pte = pte_mkhuge(mk_pte(phys_to_page(RANDOM_ORVALUE & PMD_MASK), prot));

And if you also want to do some with the existing value, which seems
to be an empty pte, then maybe just check if writing and reading that
value with set_huge_pte_at() / huge_ptep_get() returns the same,
i.e. initially w/o RANDOM_ORVALUE.

So, in combination, like this (BTW, why is the barrier() needed, it
is not used for the other set_huge_pte_at() calls later?):

@@ -733,24 +733,28 @@ static void __init hugetlb_advanced_test
struct page *page = pfn_to_page(pfn);
pte_t pte = READ_ONCE(*ptep);
 
-   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
+   set_huge_pte_at(mm, vaddr, ptep, pte);
+   WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
+
+   pte = pte_mkhuge(mk_pte(phys_to_page(RANDOM_ORVALUE & PMD_MASK), prot));
set_huge_pte_at(mm, vaddr, ptep, pte);
barrier();
WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));

This would actually add a new test "write empty pte with
set_huge_pte_at(), then verify with huge_ptep_get()", which happens
to trigger a warning on s390 :-)

That (new) warning actually points to misbehavior on s390, we do not
write a correct empty pmd in this case, but one that is empty and also
marked as large. huge_ptep_get() will then not correctly recognize it
as empty and do wrong conversion. It is also not consistent with
huge_ptep_get_and_clear(), where we write the empty pmd w/o marking
as large. Last but not least it would also break our pmd_protnone()
logic (see below). Another nice finding on s390 :-)

I don't think this has any effect in practice (y

Re: [PATCH V2 0/3] mm/debug: Add more arch page table helper tests

2020-03-31 Thread Gerald Schaefer
On Tue, 24 Mar 2020 10:52:52 +0530
Anshuman Khandual  wrote:

> This series adds more arch page table helper tests. The new tests here are
> either related to core memory functions and advanced arch pgtable helpers.
> This also creates a documentation file enlisting all expected semantics as
> suggested by Mike Rapoport (https://lkml.org/lkml/2020/1/30/40).
> 
> This series has been tested on arm64 and x86 platforms. There is just one
> expected failure on arm64 that will be fixed when we enable THP migration.
> 
> [   21.741634] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:782
> 
> which corresponds to
> 
> WARN_ON(!pmd_present(pmd_mknotpresent(pmd_mkhuge(pmd
> 
> There are many TRANSPARENT_HUGEPAGE and ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD
> ifdefs scattered across the test. But consolidating all the fallback stubs
> is not very straight forward because ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD is
> not explicitly dependent on ARCH_HAS_TRANSPARENT_HUGEPAGE.
> 
> This series has been build tested on many platforms including the ones that
> subscribe the test through ARCH_HAS_DEBUG_VM_PGTABLE.
> 

Hi Anshuman,

thanks for the update. There are a couple of issues on s390, some might
also affect other archs.

1) The pxd_huge_tests are using pxd_set/clear_huge, which defaults to
returning 0 if !CONFIG_HAVE_ARCH_HUGE_VMAP. As result, the checks for
!pxd_test/clear_huge in the pxd_huge_tests will always trigger the
warning. This should affect all archs w/o CONFIG_HAVE_ARCH_HUGE_VMAP.
Could be fixed like this:

@@ -923,8 +923,10 @@ void __init debug_vm_pgtable(void)
pmd_leaf_tests(pmd_aligned, prot);
pud_leaf_tests(pud_aligned, prot);
 
-   pmd_huge_tests(pmdp, pmd_aligned, prot);
-   pud_huge_tests(pudp, pud_aligned, prot);
+   if (IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) {
+   pmd_huge_tests(pmdp, pmd_aligned, prot);
+   pud_huge_tests(pudp, pud_aligned, prot);
+   }
 
pte_savedwrite_tests(pte_aligned, prot);
pmd_savedwrite_tests(pmd_aligned, prot);

BTW, please add some comments to the various #ifdef/#else stuff, especially
when the different parts are far away and/or nested.

2) The hugetlb_advanced_test will fail because it directly de-references
huge *ptep pointers instead of using huge_ptep_get() for this. We have
very different pagetable entry layout for pte and (large) pmd on s390,
and unfortunately the whole hugetlbfs code is using pte_t instead of pmd_t
like THP. For this reason, huge_ptep_get() was introduced, which will
return a "converted" pte, because directly reading from a *ptep (pointing
to a large pmd) will not return a proper pte. Only ARM has also an
implementation of huge_ptep_get(), so they could be affected, depending
on what exactly they need it for.

Could be fixed like this (the first de-reference is a bit special,
because at that point *ptep does not really point to a large (pmd) entry
yet, it is initially an invalid pte entry, which breaks our huge_ptep_get()
conversion logic. I also added PMD_MASK alignment for RANDOM_ORVALUE,
because we do have some special bits there in our large pmds. It seems
to also work w/o that alignment, but it feels a bit wrong):

@@ -731,26 +731,26 @@ static void __init hugetlb_advanced_test
  unsigned long vaddr, pgprot_t prot)
 {
struct page *page = pfn_to_page(pfn);
-   pte_t pte = READ_ONCE(*ptep);
+   pte_t pte;

-   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
+   pte = pte_mkhuge(mk_pte_phys(RANDOM_ORVALUE & PMD_MASK, prot));
set_huge_pte_at(mm, vaddr, ptep, pte);
barrier();
WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
huge_pte_clear(mm, vaddr, ptep, PMD_SIZE);
-   pte = READ_ONCE(*ptep);
+   pte = huge_ptep_get(ptep);
WARN_ON(!huge_pte_none(pte));
 
pte = mk_huge_pte(page, prot);
set_huge_pte_at(mm, vaddr, ptep, pte);
huge_ptep_set_wrprotect(mm, vaddr, ptep);
-   pte = READ_ONCE(*ptep);
+   pte = huge_ptep_get(ptep);
WARN_ON(huge_pte_write(pte));
 
pte = mk_huge_pte(page, prot);
set_huge_pte_at(mm, vaddr, ptep, pte);
huge_ptep_get_and_clear(mm, vaddr, ptep);
-   pte = READ_ONCE(*ptep);
+   pte = huge_ptep_get(ptep);
WARN_ON(!huge_pte_none(pte));
 
pte = mk_huge_pte(page, prot);
@@ -759,7 +759,7 @@ static void __init hugetlb_advanced_test
pte = huge_pte_mkwrite(pte);
pte = huge_pte_mkdirty(pte);
huge_ptep_set_access_flags(vma, vaddr, ptep, pte, 1);
-   pte = READ_ONCE(*ptep);
+   pte = huge_ptep_get(ptep);
WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
 }
 #else

3) The pmd_protnone_tests() has an issue, because it passes a pmd to
pmd_protnone() which has not been marked as large. We check for large
pmd in the s390 implementation of pmd_protnone(), and will fail if a
pmd is not large. We had similar issues before, in other helpers, where
I 

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-02-12 Thread Gerald Schaefer
On Wed, 12 Feb 2020 15:12:54 +0530
Anshuman Khandual  wrote:

> >> +/*
> >> + * On s390 platform, the lower 12 bits are used to identify given page 
> >> table
> >> + * entry type and for other arch specific requirements. But these bits 
> >> might
> >> + * affect the ability to clear entries with pxx_clear(). So while loading 
> >> up
> >> + * the entries skip all lower 12 bits in order to accommodate s390 
> >> platform.
> >> + * It does not have affect any other platform.
> >> + */
> >> +#define RANDOM_ORVALUE(0xf000UL)  
> > 
> > I'd suggest you generate this mask with something like
> > GENMASK(BITS_PER_LONG, PAGE_SHIFT).  
> 
> IIRC the lower 12 bits constrains on s390 platform might not be really related
> to it's PAGE_SHIFT which can be a variable, but instead just a constant 
> number.
> But can definitely use GENMASK or it's variants here.
> 
> https://lkml.org/lkml/2019/9/5/862

PAGE_SHIFT would be fine, it is 12 on s390. However, in order to be
more precise, we do not really need all 12 bits, only the last 4 bits.
So, something like this would work:

#define RANDOM_ORVALUE GENMASK(BITS_PER_LONG - 1, 4)

The text in the comment could then also be changed from 12 to 4, and
be a bit more specific on the fact that the impact on pxx_clear()
results from the dynamic page table folding logic on s390:

/*
 * On s390 platform, the lower 4 bits are used to identify given page table
 * entry type. But these bits might affect the ability to clear entries with
 * pxx_clear() because of how dynamic page table folding works on s390. So
 * while loading up the entries do not change the lower 4 bits.
 * It does not have affect any other platform.
 */


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-29 Thread Gerald Schaefer
On Mon, 27 Jan 2020 22:33:08 -0500
Qian Cai  wrote:

> > 
> >> Did those tests ever find any regression or this is almost only useful for 
> >> new
> > 
> > The test has already found problems with s390 page table helpers.
> 
> Hmm, that is pretty weak where s390 is not even official supported with this 
> version.
> 

I first had to get the three patches upstream, each fixing non-conform
behavior on s390, and each issue was found by this extremely useful test:

2416cefc504b s390/mm: add mm_pxd_folded() checks to pxd_free()
2d1fc1eb9b54 s390/mm: simplify page table helpers for large entries
1c27a4bc817b s390/mm: make pmd/pud_bad() report large entries as bad

I did not see any direct effect of this misbehavior yet, but I am
very happy that this could be found and fixed in order to prevent
future issues. And this is exactly the value of this test, to make
sure that all architectures have a common understanding of how
the various page table helpers are supposed to work.

For example, who would have thought that pXd_bad() is supposed to
report large entries as bad? It's not really documented anywhere,
so we just checked them for sanity like normal entries, which
apparently worked fine so far, but for how long?


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-29 Thread Gerald Schaefer
On Tue, 28 Jan 2020 06:57:53 +0530
Anshuman Khandual  wrote:

> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> This test covers basic page table entry transformations including but not
> limited to old, young, dirty, clean, write, write protect etc at various
> level along with populating intermediate entries with next page table page
> and validating them.
> 
> Test page table pages are allocated from system memory with required size
> and alignments. The mapped pfns at page table levels are derived from a
> real pfn representing a valid kernel text symbol. This test gets called
> right after page_alloc_init_late().
> 
> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
> arm64. Going forward, other architectures too can enable this after fixing
> build or runtime problems (if any) with their page table helpers.
> 
> Folks interested in making sure that a given platform's page table helpers
> conform to expected generic MM semantics should enable the above config
> which will just trigger this test during boot. Any non conformity here will
> be reported as an warning which would need to be fixed. This test will help
> catch any changes to the agreed upon semantics expected from generic MM and
> enable platforms to accommodate it thereafter.
> 

[...]

> 
> Tested-by: Christophe Leroy  #PPC32

Tested-by: Gerald Schaefer  # s390

Thanks again for this effort, and for keeping up the spirit against
all odds and even after 12 iterations :-)

> 
> diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt 
> b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> new file mode 100644
> index ..f3f8111edbe3
> --- /dev/null
> +++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> @@ -0,0 +1,35 @@
> +#
> +# Feature name:  debug-vm-pgtable
> +# Kconfig:   ARCH_HAS_DEBUG_VM_PGTABLE
> +# description:   arch supports pgtable tests for semantics compliance
> +#
> +---
> +| arch |status|
> +---
> +|   alpha: | TODO |
> +| arc: |  ok  |
> +| arm: | TODO |
> +|   arm64: |  ok  |
> +| c6x: | TODO |
> +|csky: | TODO |
> +|   h8300: | TODO |
> +| hexagon: | TODO |
> +|ia64: | TODO |
> +|m68k: | TODO |
> +|  microblaze: | TODO |
> +|mips: | TODO |
> +|   nds32: | TODO |
> +|   nios2: | TODO |
> +|openrisc: | TODO |
> +|  parisc: | TODO |
> +|  powerpc/32: |  ok  |
> +|  powerpc/64: | TODO |
> +|   riscv: | TODO |
> +|s390: | TODO |

s390 is ok now, with my patches included in v5.5-rc1. So you can now add

--- a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
+++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -25,7 +25,7 @@
 |  powerpc/32: |  ok  |
 |  powerpc/64: | TODO |
 |   riscv: | TODO |
-|s390: | TODO |
+|s390: |  ok  |
 |  sh: | TODO |
 |   sparc: | TODO |
 |  um: | TODO |
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -59,6 +59,7 @@ config KASAN_SHADOW_OFFSET
 config S390
def_bool y
select ARCH_BINFMT_ELF_STATE
+   select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V8] mm/debug: Add tests validating architecture page table helpers

2019-11-05 Thread Gerald Schaefer
On Mon, 28 Oct 2019 10:59:22 +0530
Anshuman Khandual  wrote:

> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> This test covers basic page table entry transformations including but not
> limited to old, young, dirty, clean, write, write protect etc at various
> level along with populating intermediate entries with next page table page
> and validating them.
> 
> Test page table pages are allocated from system memory with required size
> and alignments. The mapped pfns at page table levels are derived from a
> real pfn representing a valid kernel text symbol. This test gets called
> right after page_alloc_init_late().
> 
> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
> arm64. Going forward, other architectures too can enable this after fixing
> build or runtime problems (if any) with their page table helpers.

I've prepared a couple of commits to our arch code to make this work on s390,
they will go upstream in the next merge window. After that, we can add s390
to the supported architectures.

We had some issues, e.g. because we do not report large entries as bad in
pxd_bad(), do not check for folded page tables in pxd_free(), or assume
that primitives like pmd_mkdirty() will only be called after pmd_mkhuge().
None of those should have any impact on current code, but your test module
revealed that we do not behave like other architectures in some aspects,
and it's good to find and fix such things to prevent possible future issues.

Thanks a lot for the effort!

Regards,
Gerald


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-18 Thread Gerald Schaefer
On Wed, 18 Sep 2019 18:26:03 +0200
Christophe Leroy  wrote:

[..] 
> My suggestion was not to completely drop the #ifdef but to do like you 
> did in pgd_clear_tests() for instance, ie to add the following test on 
> top of the function:
> 
>   if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
>   return;
> 

Ah, very nice, this would also fix the remaining issues for s390. Since
we have dynamic page table folding, neither __PAGETABLE_PXX_FOLDED nor
__ARCH_HAS_XLEVEL_HACK is defined, but mm_pxx_folded() will work.

mm_alloc() returns with a 3-level page table by default on s390, so we
will run into issues in p4d_clear/populate_tests(), and also at the end
with p4d/pud_free() (double free).

So, adding the mm_pud_folded() check to p4d_clear/populate_tests(),
and also adding mm_p4d/pud_folded() checks at the end before calling
p4d/pud_free(), would make it all work on s390.

BTW, regarding p4d/pud_free(), I'm not sure if we should rather check
the folding inside our s390 functions, similar to how we do it for
p4d/pud_free_tlb(), instead of relying on not being called for folded
p4d/pud. So far, I see no problem with this behavior, all callers of
p4d/pud_free() should be fine because of our folding check within
p4d/pud_present/none(). But that doesn't mean that it is correct not
to check for the folding inside p4d/pud_free(). At least, with this
test module we do now have a caller of p4d/pud_free() on potentially
folded entries, so instead of adding pxx_folded() checks to this
test module, we could add them to our p4d/pud_free() functions.
Any thoughts on this?

Regards,
Gerald


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-09 Thread Gerald Schaefer
On Mon, 9 Sep 2019 11:56:50 +0530
Anshuman Khandual  wrote:

[..]
> > 
> > Hmm, I simply used this on my system to make pud_clear_tests() work, not
> > sure if it works on all archs:
> > 
> > pud_val(*pudp) |= RANDOM_NZVALUE;  
> 
> Which compiles on arm64 but then fails on x86 because of the way pmd_val()
> has been defined there. on arm64 and s390 (with many others) pmd_val() is
> a macro which still got the variable that can be used as lvalue but that is
> not true for some other platforms like x86.
> 
> arch/arm64/include/asm/pgtable-types.h:   #define pmd_val(x)  
> ((x).pmd)
> arch/s390/include/asm/page.h: #define pmd_val(x)  ((x).pmd)
> arch/x86/include/asm/pgtable.h:   #define pmd_val(x)   
> native_pmd_val(x)
> 
> static inline pmdval_t native_pmd_val(pmd_t pmd)
> {
> return pmd.pmd;
> }
> 
> Unless I am mistaken, the return value from this function can not be used as
> lvalue for future assignments.
> 
> mm/arch_pgtable_test.c: In function ‘pud_clear_tests’:
> mm/arch_pgtable_test.c:156:17: error: lvalue required as left operand of 
> assignment
>   pud_val(*pudp) |= RANDOM_ORVALUE;
>  ^~
> AFAICS pxx_val() were never intended to be used as lvalue and using it that 
> way
> might just happen to work on all those platforms which define them as macros.
> They meant to just provide values for an entry as being determined by the 
> platform.
> 
> In principle pxx_val() on an entry was not supposed to be modified directly 
> from
> generic code without going through (again) platform helpers for any specific 
> state
> change (write, old, dirty, special, huge etc). The current use case is a 
> deviation
> for that.
> 
> I originally went with memset() just to load up the entries with non-zero 
> value so
> that we know pxx_clear() are really doing the clearing. The same is being 
> followed
> for all pxx_same() checks.
> 
> Another way for fixing the problem would be to mark them with known attributes
> like write/young/huge etc instead which for sure will create non-zero entries.
> We can do that for pxx_clear() and pxx_same() tests and drop RANDOM_NZVALUE
> completely. Does that sound good ?

Umm, not really. Those mkwrite/young/huge etc. helpers do only exist for
page table levels where we can also have large mappings, at least on s390.
Also, we do (on s390) again check for certain sanity before actually setting
the bits.
Good news is that at least for the pxx_same() checks the memset() is no
problem, because pxx_same() does not do any checks other than the same check.

For the pxx_clear_tests(), maybe it could be an option to put them behind the
pxx_populate_tests(), and rely on them having properly populated (non-clear)
values after that?

[...]
> > 
> > Actually, using get_unmapped_area() as suggested by Kirill could also
> > solve this issue. We do create a new mm with 3-level page tables on s390,
> > and the dynamic upgrade to 4 or 5 levels is then triggered exactly by
> > arch_get_unmapped_area(), depending on the addr. But I currently don't
> > see how / where arch_get_unmapped_area() is set up for such a dummy mm
> > created by mm_alloc().  
> 
> Normally they are set during program loading but we can set it up explicitly
> for the test mm_struct if we need to but there are some other challenges.
> 
> load_[aout|elf|flat|..]_binary()
>   setup_new_exec()
>   arch_pick_mmap_layout().
> 
> I did some initial experiments around get_unmapped_area(). Seems bit tricky
> to get it working on a pure 'test' mm_struct. It expects a real user context
> in the form of current->mm.

Yes, that's where I stopped because it looked rather complicated :-)
Not sure why Kirill suggested it initially, but if using get_unmapped_area()
would only be necessary to get properly initialized page table levels
on s390, you could also defer this to a later add-on patch.

Regards,
Gerald


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-06 Thread Gerald Schaefer
On Fri, 6 Sep 2019 11:58:59 +0530
Anshuman Khandual  wrote:

> On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
> > On Thu, 5 Sep 2019 14:48:14 +0530
> > Anshuman Khandual  wrote:
> >   
> >>> [...]
> >>>> +
> >>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> >>>> +static void pud_clear_tests(pud_t *pudp)
> >>>> +{
> >>>> +memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >>>> +pud_clear(pudp);
> >>>> +WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >>>> +}
> >>>
> >>> For pgd/p4d/pud_clear(), we only clear if the page table level is present
> >>> and not folded. The memset() here overwrites the table type bits, so
> >>> pud_clear() will not clear anything on s390 and the pud_none() check will
> >>> fail.
> >>> Would it be possible to OR a (larger) random value into the table, so that
> >>> the lower 12 bits would be preserved?
> >>
> >> So the suggestion is instead of doing memset() on entry with 
> >> RANDOM_NZVALUE,
> >> it should OR a large random value preserving lower 12 bits. Hmm, this 
> >> should
> >> still do the trick for other platforms, they just need non zero value. So 
> >> on
> >> s390, the lower 12 bits on the page table entry already has valid value 
> >> while
> >> entering this function which would make sure that pud_clear() really does
> >> clear the entry ?  
> > 
> > Yes, in theory the table entry on s390 would have the type set in the last
> > 4 bits, so preserving those would be enough. If it does not conflict with
> > others, I would still suggest preserving all 12 bits since those would 
> > contain
> > arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
> > would also work with the memset, but for consistency I think the same logic
> > should be used in all pxd_clear_tests.  
> 
> Makes sense but..
> 
> There is a small challenge with this. Modifying individual bits on a given
> page table entry from generic code like this test case is bit tricky. That
> is because there are not enough helpers to create entries with an absolute
> value. This would have been easier if all the platforms provided functions
> like __pxx() which is not the case now. Otherwise something like this should
> have worked.
> 
> 
> pud_t pud = READ_ONCE(*pudp);
> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
> WRITE_ONCE(*pudp, pud);
> 
> But __pud() will fail to build in many platforms.

Hmm, I simply used this on my system to make pud_clear_tests() work, not
sure if it works on all archs:

pud_val(*pudp) |= RANDOM_NZVALUE;

> 
> The other alternative will be to make sure memset() happens on all other
> bits except the lower 12 bits which will depend on endianness. If s390
> has a fixed endianness, we can still use either of them which will hold
> good for others as well.
> 
> memset(pudp, RANDOM_NZVALUE, sizeof(pud_t) - 3);
> 
> OR
> 
> memset(pudp + 3, RANDOM_NZVALUE, sizeof(pud_t) - 3);
> 
> > 
> > However, there is another issue on s390 which will make this only work
> > for pud_clear_tests(), and not for the p4d/pgd_tests. The problem is that
> > mm_alloc() will only give you a 3-level page table initially on s390.
> > This means that pudp == p4dp == pgdp, and so the p4d/pgd_tests will
> > both see the pud level (of course this also affects other tests).  
> 
> Got it.
> 
> > 
> > Not sure yet how to fix this, i.e. how to initialize/update the page table
> > to 5 levels. We can handle 5 level page tables, and it would be good if
> > all levels could be tested, but using mm_alloc() to establish the page
> > tables might not work on s390. One option could be to provide an arch-hook
> > or weak function to allocate/initialize the mm.  
> 
> Sure, got it. Though I plan to do add some arch specific tests or init 
> sequence
> like the above later on but for now the idea is to get the smallest possible 
> set
> of test cases which builds and runs on all platforms without requiring any 
> arch
> specific hooks or special casing (#ifdef) to be agreed upon broadly and 
> accepted.
> 
> Do you think this is absolutely necessary on s390 for the very first set of 
> test
> cases or we can add this later on as an improvement ?

It can be added later, no problem. I did not expect this to work flawlessly
on s390 right from the start anyway, with all our peculiarities, so don't
let this hinder you. I might come up with an add-on patch later.

Actually, using get_unmapped_area() as suggested by Kirill could also
solve this issue. We do create a new mm with 3-level page tables on s390,
and the dynamic upgrade to 4 or 5 levels is then triggered exactly by
arch_get_unmapped_area(), depending on the addr. But I currently don't
see how / where arch_get_unmapped_area() is set up for such a dummy mm
created by mm_alloc().

Regards,
Gerald


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Gerald Schaefer
On Thu, 5 Sep 2019 14:48:14 +0530
Anshuman Khandual  wrote:

> > [...]  
> >> +
> >> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> >> +static void pud_clear_tests(pud_t *pudp)
> >> +{
> >> +  memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >> +  pud_clear(pudp);
> >> +  WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >> +}  
> > 
> > For pgd/p4d/pud_clear(), we only clear if the page table level is present
> > and not folded. The memset() here overwrites the table type bits, so
> > pud_clear() will not clear anything on s390 and the pud_none() check will
> > fail.
> > Would it be possible to OR a (larger) random value into the table, so that
> > the lower 12 bits would be preserved?  
> 
> So the suggestion is instead of doing memset() on entry with RANDOM_NZVALUE,
> it should OR a large random value preserving lower 12 bits. Hmm, this should
> still do the trick for other platforms, they just need non zero value. So on
> s390, the lower 12 bits on the page table entry already has valid value while
> entering this function which would make sure that pud_clear() really does
> clear the entry ?

Yes, in theory the table entry on s390 would have the type set in the last
4 bits, so preserving those would be enough. If it does not conflict with
others, I would still suggest preserving all 12 bits since those would contain
arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
would also work with the memset, but for consistency I think the same logic
should be used in all pxd_clear_tests.

However, there is another issue on s390 which will make this only work
for pud_clear_tests(), and not for the p4d/pgd_tests. The problem is that
mm_alloc() will only give you a 3-level page table initially on s390.
This means that pudp == p4dp == pgdp, and so the p4d/pgd_tests will
both see the pud level (of course this also affects other tests).

Not sure yet how to fix this, i.e. how to initialize/update the page table
to 5 levels. We can handle 5 level page tables, and it would be good if
all levels could be tested, but using mm_alloc() to establish the page
tables might not work on s390. One option could be to provide an arch-hook
or weak function to allocate/initialize the mm.

IIUC, the (dummy) mm is really only needed to provide an mm->pgd as starting
point, right?

Regards,
Gerald


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Gerald Schaefer
On Tue,  3 Sep 2019 13:31:46 +0530
Anshuman Khandual  wrote:

> This adds a test module which will validate architecture page table helpers
> and accessors regarding compliance with generic MM semantics expectations.
> This will help various architectures in validating changes to the existing
> page table helpers or addition of new ones.
> 
> Test page table and memory pages creating it's entries at various level are
> all allocated from system memory with required alignments. If memory pages
> with required size and alignment could not be allocated, then all depending
> individual tests are skipped.

This looks very useful, thanks. Of course, s390 is quite special and does
not work nicely with this patch (yet), mostly because of our dynamic page
table levels/folding. Still need to figure out what can be fixed in the arch
code and what would need to be changed in the test module. See below for some
generic comments/questions.

At least one real bug in the s390 code was already revealed by this, which
is very nice. In pmd/pud_bad(), we also check large pmds/puds for sanity,
instead of reporting them as bad, which is apparently not how it is expected.

[...]
> +/*
> + * Basic operations
> + *
> + * mkold(entry)  = An old and not a young entry
> + * mkyoung(entry)= A young and not an old entry
> + * mkdirty(entry)= A dirty and not a clean entry
> + * mkclean(entry)= A clean and not a dirty entry
> + * mkwrite(entry)= A write and not a write protected entry
> + * wrprotect(entry)  = A write protected and not a write entry
> + * pxx_bad(entry)= A mapped and non-table entry
> + * pxx_same(entry1, entry2)  = Both entries hold the exact same value
> + */
> +#define VADDR_TEST   (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)

Why is P4D_SIZE missing in the VADDR_TEST calculation?

[...]
> +
> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> +static void pud_clear_tests(pud_t *pudp)
> +{
> + memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> + pud_clear(pudp);
> + WARN_ON(!pud_none(READ_ONCE(*pudp)));
> +}

For pgd/p4d/pud_clear(), we only clear if the page table level is present
and not folded. The memset() here overwrites the table type bits, so
pud_clear() will not clear anything on s390 and the pud_none() check will
fail.
Would it be possible to OR a (larger) random value into the table, so that
the lower 12 bits would be preserved?

> +
> +static void pud_populate_tests(struct mm_struct *mm, pud_t *pudp, pmd_t 
> *pmdp)
> +{
> + /*
> +  * This entry points to next level page table page.
> +  * Hence this must not qualify as pud_bad().
> +  */
> + pmd_clear(pmdp);
> + pud_clear(pudp);
> + pud_populate(mm, pudp, pmdp);
> + WARN_ON(pud_bad(READ_ONCE(*pudp)));
> +}

This will populate the pud with a pmd pointer that does not point to the
beginning of the pmd table, but to the second entry (because of how
VADDR_TEST is constructed). This will result in failing pud_bad() check
on s390. Not sure why/how it works on other archs, but would it be possible
to align pmdp down to the beginning of the pmd table (and similar for the
other pxd_populate_tests)?

[...]
> +
> + p4d_free(mm, saved_p4dp);
> + pud_free(mm, saved_pudp);
> + pmd_free(mm, saved_pmdp);
> + pte_free(mm, (pgtable_t) virt_to_page(saved_ptep));

pgtable_t is arch-specific, and on s390 it is not a struct page pointer,
but a pte pointer. So this will go wrong, also on all other archs (if any)
where pgtable_t is not struct page.
Would it be possible to use pte_free_kernel() instead, and just pass
saved_ptep directly?

Regards,
Gerald


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 0/9] Enable THP migration for all possible architectures

2018-04-27 Thread Gerald Schaefer
On Thu, 26 Apr 2018 10:27:55 -0400
Zi Yan  wrote:

> From: Zi Yan 
> 
> Hi all,
> 
> THP migration is only enabled on x86_64 with a special
> ARCH_ENABLE_THP_MIGRATION macro. This patchset enables THP migration for
> all architectures that uses transparent hugepage, so that special macro can
> be dropped. Instead, THP migration is enabled/disabled via
> /sys/kernel/mm/transparent_hugepage/enable_thp_migration.
> 
> I grepped for TRANSPARENT_HUGEPAGE in arch folder and got 9 architectures that
> are supporting transparent hugepage. I mechanically add __pmd_to_swp_entry() 
> and
> __swp_entry_to_pmd() based on existing __pte_to_swp_entry() and
> __swp_entry_to_pte() for all these architectures, except tile which is going 
> to
> be dropped.

This will not work on s390, the pmd layout is very different from the pte
layout. Using __swp_entry/type/offset() on a pmd will go horribly wrong.
I currently don't see a chance to make this work for us, so please make/keep
this configurable, and do not configure it for s390.

Regards,
Gerald


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc