Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-07-02 Thread Larry Finger

On 07/01/2018 11:16 PM, Michael Ellerman wrote:

Linus Torvalds  writes:

On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  wrote:


I have more information regarding this BUG. Line 700 of page-flags.h is the
macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
include/linux/mm.h. I also added a printk call to PageTable() that logs
page->page_type. The routine was called twice. The first had page_type of
0xfbff, which would have been expected for a . The second call had
0x, which led to the BUG.


So it looks to me like the tear-down of the page tables first found a
page that is indeed a page table, and cleared the page table bit
(well, it set it - the bits are reversed).

...


That said, can some ppc person who knows the 32-bit ppc code and maybe
knows what that "interrupt: 700" means talk about that oddity in the
trace, please?


I think everyone else answered your questions here, and it should be
fixed now in your tree.

Larry let me know if you're still seeing a crash with 4.18-rc3.


The problem is fixed in 4.18-rc3. Thanks to all that helped.

Larry



Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-07-01 Thread Michael Ellerman
Linus Torvalds  writes:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  
> wrote:
>>
>> I have more information regarding this BUG. Line 700 of page-flags.h is the
>> macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
>> the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), 
>> page)
>> in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
>> include/linux/mm.h. I also added a printk call to PageTable() that logs
>> page->page_type. The routine was called twice. The first had page_type of
>> 0xfbff, which would have been expected for a . The second call had
>> 0x, which led to the BUG.
>
> So it looks to me like the tear-down of the page tables first found a
> page that is indeed a page table, and cleared the page table bit
> (well, it set it - the bits are reversed).
...
>
> That said, can some ppc person who knows the 32-bit ppc code and maybe
> knows what that "interrupt: 700" means talk about that oddity in the
> trace, please?

I think everyone else answered your questions here, and it should be
fixed now in your tree.

Larry let me know if you're still seeing a crash with 4.18-rc3.

cheers


Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-30 Thread Larry Finger

On 06/30/2018 04:31 AM, christophe leroy wrote:



Le 29/06/2018 à 22:42, Larry Finger a écrit :
My PowerBook G4 Aluminum crashes on boot with 4.18-rcX kernels with a kernel 
BUG at include/linux/page-flags.h:700! The problem was bisected to commit 
1d40a5ea01d5 ("mm: mark pages in use for page tables"). It is not possible to 
capture the bug with anything other than a camera. The first few lines of the 
traceback are as follows:


free_pgd_range+0x19c/0x30c (unreliable)
free_pgtables_0xa0/0xb0
exit_pmap+0xf4/0x16c
mmput+0x64/0xf0
do_exit+0x33c/0x89c
oops_end+0x13c/0x144
_exception_pkey+0x58/0x128
ret_from_except_full+0x0/0x4
--- interrupt: 700 at free_pgd_range+0x19c/0x30c
 LR = free_pgd_range+0x19c/0x30c
free_pgtables+0xa/0xb
exit_mnap+0xf4/0x16c
mmput+0x64/0xf0
flush_old_exec+0x490/0x550

I have more information regarding this BUG. Line 700 of page-flags.h is the 
macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded 
the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), 
page) in routine __ClearPageTable(), which is called from pgtable_page_dtor() 
in include/linux/mm.h. I also added a printk call to PageTable() that logs 
page->page_type. The routine was called twice. The first had page_type of 
0xfbff, which would have been expected for a . The second call had 
0x, which led to the BUG.




Oh, seems to be the one I noticed and told Aneesh about 
(https://patchwork.ozlabs.org/patch/922771/)


Aneesh provided the patch https://patchwork.ozlabs.org/patch/934111/ for it, 
does it help ?


Yes, those changes fix the problem.

Larry



Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-30 Thread christophe leroy




Le 29/06/2018 à 22:42, Larry Finger a écrit :
My PowerBook G4 Aluminum crashes on boot with 4.18-rcX kernels with a 
kernel BUG at include/linux/page-flags.h:700! The problem was bisected 
to commit 1d40a5ea01d5 ("mm: mark pages in use for page tables"). It is 
not possible to capture the bug with anything other than a camera. The 
first few lines of the traceback are as follows:


free_pgd_range+0x19c/0x30c (unreliable)
free_pgtables_0xa0/0xb0
exit_pmap+0xf4/0x16c
mmput+0x64/0xf0
do_exit+0x33c/0x89c
oops_end+0x13c/0x144
_exception_pkey+0x58/0x128
ret_from_except_full+0x0/0x4
--- interrupt: 700 at free_pgd_range+0x19c/0x30c
     LR = free_pgd_range+0x19c/0x30c
free_pgtables+0xa/0xb
exit_mnap+0xf4/0x16c
mmput+0x64/0xf0
flush_old_exec+0x490/0x550

I have more information regarding this BUG. Line 700 of page-flags.h is 
the macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually 
expanded the macro, and found that the bug line is 
VM_BUG_ON_PAGE(!PageTable(page), page) in routine __ClearPageTable(), 
which is called from pgtable_page_dtor() in include/linux/mm.h. I also 
added a printk call to PageTable() that logs page->page_type. The 
routine was called twice. The first had page_type of 0xfbff, which 
would have been expected for a . The second call had 0x, which 
led to the BUG.




Oh, seems to be the one I noticed and told Aneesh about 
(https://patchwork.ozlabs.org/patch/922771/)


Aneesh provided the patch https://patchwork.ozlabs.org/patch/934111/ for 
it, does it help ?


Christophe


Larry


---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus



Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-30 Thread Aneesh Kumar K.V

On 06/30/2018 03:16 AM, Kirill A. Shutemov wrote:

On Fri, Jun 29, 2018 at 02:01:46PM -0700, Linus Torvalds wrote:

On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  wrote:


I have more information regarding this BUG. Line 700 of page-flags.h is the
macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
include/linux/mm.h. I also added a printk call to PageTable() that logs
page->page_type. The routine was called twice. The first had page_type of
0xfbff, which would have been expected for a . The second call had
0x, which led to the BUG.


So it looks to me like the tear-down of the page tables first found a
page that is indeed a page table, and cleared the page table bit
(well, it set it - the bits are reversed).

Then it took an exception (that "interrupt: 700") and that causes
do_exit() again, and it tries to free the same page table - and now
it's no longer marked as a page table, because it already went through
the __ClearPageTable() dance once.

So on the second path through, it catches that "the bit already said
it wasn't a page table" and does the BUG.

But the real question is what the problem was the *first* time around.


+Aneesh.

Looks like pgtable_page_dtor() gets called in __pte_free_tlb() path twice.
Once in __pte_free_tlb() itself and the second time in pgtable_free().

Would this help?

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 6a6673907e45..e7a2f0e6b695 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -137,7 +137,6 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
  static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
  {
-   pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
  }
  #endif /* _ASM_POWERPC_BOOK3S_32_PGALLOC_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 1707781d2f20..30a13b80fd58 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -139,7 +139,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, 
pgtable_t table,
  unsigned long address)
  {
tlb_flush_pgtable(tlb, address);
-   pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
  }
  #endif /* _ASM_POWERPC_PGALLOC_32_H */




https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-June/175015.html

Also part of pull request from Michael Ellerman

-aneesh



Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-29 Thread Denise Finger

On 06/29/2018 04:01 PM, Linus Torvalds wrote:

On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  wrote:


I have more information regarding this BUG. Line 700 of page-flags.h is the
macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
include/linux/mm.h. I also added a printk call to PageTable() that logs
page->page_type. The routine was called twice. The first had page_type of
0xfbff, which would have been expected for a . The second call had
0x, which led to the BUG.


So it looks to me like the tear-down of the page tables first found a
page that is indeed a page table, and cleared the page table bit
(well, it set it - the bits are reversed).

Then it took an exception (that "interrupt: 700") and that causes
do_exit() again, and it tries to free the same page table - and now
it's no longer marked as a page table, because it already went through
the __ClearPageTable() dance once.

So on the second path through, it catches that "the bit already said
it wasn't a page table" and does the BUG.

But the real question is what the problem was the *first* time around.
I assume that has scrolled off the screen? This part:

   _exception_pkey+0x58/0x128
   ret_from_except_full+0x0/0x4
   --- interrupt: 700 at free_pgd_range+0x19c/0x30c
LR = free_pgd_range+0x19c/0x30c
   free_pgtables+0xa/0xb
   exit_mnap+0xf4/0x16c
   mmput+0x64/0xf0

Does reverting that commit 1d40a5ea01d5 make everything work for you?
Because if so, judging by the deafening silence on this so far, I
think that's what we should do.

That said, can some ppc person who knows the 32-bit ppc code and maybe
knows what that "interrupt: 700" means talk about that oddity in the
trace, please?


The deafening silence may be due to my having an old Microsoft address for 
Matthew Wilcox in my first posting. He should now have received the BUG report, 
and he may have some suggestions. Yes, reverting commit 1d40a5ea01d5 does permit 
the box to boot.


Kirill's patch also works, which seems like a better solution. If any other 
architecture bugs on boot, at least we will know where to look. :)


@Kirill: You may add a Reported-by: and Tested-by: Larry Finger 
 to the patch.


Thanks for the help,

Larry



Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-29 Thread Linus Torvalds
On Fri, Jun 29, 2018 at 2:46 PM Kirill A. Shutemov  wrote:
>
> Looks like pgtable_page_dtor() gets called in __pte_free_tlb() path twice.
> Once in __pte_free_tlb() itself and the second time in pgtable_free().

Ahh, that would certainly do it,. and explains why this hits ppc32 but
not x86, for example.

Linus


Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-29 Thread Segher Boessenkool
On Fri, Jun 29, 2018 at 02:01:46PM -0700, Linus Torvalds wrote:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  
> wrote:
> But the real question is what the problem was the *first* time around.
> I assume that has scrolled off the screen? This part:
> 
>   _exception_pkey+0x58/0x128
>   ret_from_except_full+0x0/0x4
>   --- interrupt: 700 at free_pgd_range+0x19c/0x30c
>LR = free_pgd_range+0x19c/0x30c
>   free_pgtables+0xa/0xb
>   exit_mnap+0xf4/0x16c
>   mmput+0x64/0xf0
> 
> Does reverting that commit 1d40a5ea01d5 make everything work for you?
> Because if so, judging by the deafening silence on this so far, I
> think that's what we should do.
> 
> That said, can some ppc person who knows the 32-bit ppc code and maybe
> knows what that "interrupt: 700" means talk about that oddity in the
> trace, please?

700 is "program interrupt"; here it probably means a BUG() happened (which
does a trap instruction, which causes a 700).  The stuff that scrolled away
should tell more.


Segher


Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-29 Thread Kirill A. Shutemov
On Fri, Jun 29, 2018 at 02:01:46PM -0700, Linus Torvalds wrote:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  
> wrote:
> >
> > I have more information regarding this BUG. Line 700 of page-flags.h is the
> > macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually 
> > expanded
> > the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), 
> > page)
> > in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
> > include/linux/mm.h. I also added a printk call to PageTable() that logs
> > page->page_type. The routine was called twice. The first had page_type of
> > 0xfbff, which would have been expected for a . The second call had
> > 0x, which led to the BUG.
> 
> So it looks to me like the tear-down of the page tables first found a
> page that is indeed a page table, and cleared the page table bit
> (well, it set it - the bits are reversed).
> 
> Then it took an exception (that "interrupt: 700") and that causes
> do_exit() again, and it tries to free the same page table - and now
> it's no longer marked as a page table, because it already went through
> the __ClearPageTable() dance once.
> 
> So on the second path through, it catches that "the bit already said
> it wasn't a page table" and does the BUG.
> 
> But the real question is what the problem was the *first* time around.

+Aneesh.

Looks like pgtable_page_dtor() gets called in __pte_free_tlb() path twice.
Once in __pte_free_tlb() itself and the second time in pgtable_free().

Would this help?

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 6a6673907e45..e7a2f0e6b695 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -137,7 +137,6 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
 {
-   pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
 }
 #endif /* _ASM_POWERPC_BOOK3S_32_PGALLOC_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 1707781d2f20..30a13b80fd58 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -139,7 +139,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, 
pgtable_t table,
  unsigned long address)
 {
tlb_flush_pgtable(tlb, address);
-   pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
 }
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
-- 
 Kirill A. Shutemov


Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-29 Thread Linus Torvalds
On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  wrote:
>
> I have more information regarding this BUG. Line 700 of page-flags.h is the
> macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
> the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), 
> page)
> in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
> include/linux/mm.h. I also added a printk call to PageTable() that logs
> page->page_type. The routine was called twice. The first had page_type of
> 0xfbff, which would have been expected for a . The second call had
> 0x, which led to the BUG.

So it looks to me like the tear-down of the page tables first found a
page that is indeed a page table, and cleared the page table bit
(well, it set it - the bits are reversed).

Then it took an exception (that "interrupt: 700") and that causes
do_exit() again, and it tries to free the same page table - and now
it's no longer marked as a page table, because it already went through
the __ClearPageTable() dance once.

So on the second path through, it catches that "the bit already said
it wasn't a page table" and does the BUG.

But the real question is what the problem was the *first* time around.
I assume that has scrolled off the screen? This part:

  _exception_pkey+0x58/0x128
  ret_from_except_full+0x0/0x4
  --- interrupt: 700 at free_pgd_range+0x19c/0x30c
   LR = free_pgd_range+0x19c/0x30c
  free_pgtables+0xa/0xb
  exit_mnap+0xf4/0x16c
  mmput+0x64/0xf0

Does reverting that commit 1d40a5ea01d5 make everything work for you?
Because if so, judging by the deafening silence on this so far, I
think that's what we should do.

That said, can some ppc person who knows the 32-bit ppc code and maybe
knows what that "interrupt: 700" means talk about that oddity in the
trace, please?

Linus


[Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-29 Thread Larry Finger
My PowerBook G4 Aluminum crashes on boot with 4.18-rcX kernels with a kernel BUG 
at include/linux/page-flags.h:700! The problem was bisected to commit 
1d40a5ea01d5 ("mm: mark pages in use for page tables"). It is not possible to 
capture the bug with anything other than a camera. The first few lines of the 
traceback are as follows:


free_pgd_range+0x19c/0x30c (unreliable)
free_pgtables_0xa0/0xb0
exit_pmap+0xf4/0x16c
mmput+0x64/0xf0
do_exit+0x33c/0x89c
oops_end+0x13c/0x144
_exception_pkey+0x58/0x128
ret_from_except_full+0x0/0x4
--- interrupt: 700 at free_pgd_range+0x19c/0x30c
LR = free_pgd_range+0x19c/0x30c
free_pgtables+0xa/0xb
exit_mnap+0xf4/0x16c
mmput+0x64/0xf0
flush_old_exec+0x490/0x550

I have more information regarding this BUG. Line 700 of page-flags.h is the 
macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded 
the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page) 
in routine __ClearPageTable(), which is called from pgtable_page_dtor() in 
include/linux/mm.h. I also added a printk call to PageTable() that logs 
page->page_type. The routine was called twice. The first had page_type of 
0xfbff, which would have been expected for a . The second call had 
0x, which led to the BUG.


Larry