Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-19 Thread Vlastimil Babka
On 8/17/19 2:20 AM, Petr Vandrovec wrote:
> Vlastimil Babka wrote on 8/16/2019 5:47 AM:
>> On 8/15/19 9:13 PM, Petr Vandrovec wrote:
>>> With iommu=off disks are visible, but USB keyboard (and other USB
>>> devices) does not work:
>>
>> I've been told iommu=soft might help.
> 
> Thanks.  I've rebuilt kernel without IOMMU.
> 
> Unfortunately I was not able to reproduce original problem with latest 
> kernel - neither with CMA nor without CMA.  System seems stable as a rock.

That's rather fortunate I would say :) But if you have any unrelated
regressions in 5.3, please report them accordingly.

> This is the change on which I've tried to reproduce it with your 
> debugging patches:
> 
> commit 41de59634046b19cd53a1983594a95135c656997 (HEAD -> master, 
> origin/master, origin/HEAD)
> Merge: e22a97a2a85d 1ee1119d184b
> Author: Linus Torvalds 
> Date:   Wed Aug 14 15:29:53 2019 -0700
> 
>  Merge tag 'Wimplicit-fallthrough-5.3-rc5' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
> 
>   Petr
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-17 Thread Petr Vandrovec

Vlastimil Babka wrote on 8/16/2019 5:47 AM:

On 8/15/19 9:13 PM, Petr Vandrovec wrote:

With iommu=off disks are visible, but USB keyboard (and other USB
devices) does not work:


I've been told iommu=soft might help.


Thanks.  I've rebuilt kernel without IOMMU.

Unfortunately I was not able to reproduce original problem with latest 
kernel - neither with CMA nor without CMA.  System seems stable as a rock.


This is the change on which I've tried to reproduce it with your 
debugging patches:


commit 41de59634046b19cd53a1983594a95135c656997 (HEAD -> master, 
origin/master, origin/HEAD)

Merge: e22a97a2a85d 1ee1119d184b
Author: Linus Torvalds 
Date:   Wed Aug 14 15:29:53 2019 -0700

Merge tag 'Wimplicit-fallthrough-5.3-rc5' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux


Petr
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-17 Thread Joerg Roedel
On Fri, Aug 16, 2019 at 02:47:53PM +0200, Vlastimil Babka wrote:
> On 8/15/19 9:13 PM, Petr Vandrovec wrote:
> > [   18.110985] DMAR: [DMA Write] Request device [07:00.1] fault addr 
> > fffe [fault reason 02] Present bit in context entry is clear
> 
> Worth reporting as well, not nice regression.

Is that a regression between 5.3-rc3 and 5.3-rc4 or is it already broken
since -rc1? The 5.3-rc5 kernel will contains some VT-d fixes that are
worth a try here too. If you can test latest linus/master branch that
would be great, otherwise -rc5 is fine too.


Regards,

Joerg
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-17 Thread Petr Vandrovec

Vlastimil Babka wrote on 8/15/2019 7:32 AM:


Does the issue still happen with rc4? Could you apply the 3 attached
patches (work in progress), configure-enable CONFIG_DEBUG_PAGEALLOC and
CONFIG_PAGE_OWNER and boot kernel with debug_pagealloc=on page_owner=on
parameters? That should print stacktraces of allocation and first
freeing (assuming this is a double free).


Unfortunately -rc4 does not find any my SATA disks due to some 
misunderstanding between AHCI driver and HPT642L adapter (there is no 
device 07:00.1, HPT is single-function device at 07:00.0):


[   18.003015] scsi host6: ahci
[   18.006605] DMAR: DRHD: handling fault status reg 2
[   18.006619] DMAR: [DMA Write] Request device [07:00.1] fault addr 
fffe [fault reason 02] Present bit in context entry is clear

[   18.076616] DMAR: DRHD: handling fault status reg 102
[   18.085910] DMAR: [DMA Write] Request device [07:00.1] fault addr 
fffa [fault reason 02] Present bit in context entry is clear

[   18.100989] DMAR: DRHD: handling fault status reg 202
[   18.110985] DMAR: [DMA Write] Request device [07:00.1] fault addr 
fffe [fault reason 02] Present bit in context entry is clear


With iommu=off disks are visible, but USB keyboard (and other USB 
devices) does not work:


[   18.174802] ehci-pci :00:1a.0: swiotlb buffer is full (sz: 8 
bytes), total 0 (slots), used 0 (slots)
[   18.174804] ehci-pci :00:1a.0: overflow 0x000ffdc75ae8+8 of 
DMA mask  bus mask 0
[   18.174815] WARNING: CPU: 2 PID: 508 at kernel/dma/direct.c:35 
report_addr+0x2e/0x50

[   18.174816] Modules linked in:
[   18.174818] CPU: 2 PID: 508 Comm: kworker/2:1 Tainted: G 
  T 5.3.0-rc4-64-00058-gd717b092e0b2 #77
[   18.174819] Hardware name: Dell Inc. Precision T3610/09M8Y8, BIOS A16 
02/05/2018

[   18.174822] Workqueue: usb_hub_wq hub_event

I'll try to find -rc4 configuration that has enabled debugging and can boot. 



Petr

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-16 Thread Vlastimil Babka
On 8/15/19 9:13 PM, Petr Vandrovec wrote:
> Vlastimil Babka wrote on 8/15/2019 7:32 AM:
>>
>> Does the issue still happen with rc4? Could you apply the 3 attached
>> patches (work in progress), configure-enable CONFIG_DEBUG_PAGEALLOC and
>> CONFIG_PAGE_OWNER and boot kernel with debug_pagealloc=on page_owner=on
>> parameters? That should print stacktraces of allocation and first
>> freeing (assuming this is a double free).
> 
> Unfortunately -rc4 does not find any my SATA disks due to some 
> misunderstanding between AHCI driver and HPT642L adapter (there is no 
> device 07:00.1, HPT is single-function device at 07:00.0):
> 
> [   18.003015] scsi host6: ahci
> [   18.006605] DMAR: DRHD: handling fault status reg 2
> [   18.006619] DMAR: [DMA Write] Request device [07:00.1] fault addr 
> fffe [fault reason 02] Present bit in context entry is clear
> [   18.076616] DMAR: DRHD: handling fault status reg 102
> [   18.085910] DMAR: [DMA Write] Request device [07:00.1] fault addr 
> fffa [fault reason 02] Present bit in context entry is clear
> [   18.100989] DMAR: DRHD: handling fault status reg 202
> [   18.110985] DMAR: [DMA Write] Request device [07:00.1] fault addr 
> fffe [fault reason 02] Present bit in context entry is clear

Worth reporting as well, not nice regression.

> With iommu=off disks are visible, but USB keyboard (and other USB 
> devices) does not work:

I've been told iommu=soft might help.

> [   18.174802] ehci-pci :00:1a.0: swiotlb buffer is full (sz: 8 
> bytes), total 0 (slots), used 0 (slots)
> [   18.174804] ehci-pci :00:1a.0: overflow 0x000ffdc75ae8+8 of 
> DMA mask  bus mask 0
> [   18.174815] WARNING: CPU: 2 PID: 508 at kernel/dma/direct.c:35 
> report_addr+0x2e/0x50
> [   18.174816] Modules linked in:
> [   18.174818] CPU: 2 PID: 508 Comm: kworker/2:1 Tainted: G 
>    T 5.3.0-rc4-64-00058-gd717b092e0b2 #77
> [   18.174819] Hardware name: Dell Inc. Precision T3610/09M8Y8, BIOS A16 
> 02/05/2018
> [   18.174822] Workqueue: usb_hub_wq hub_event
> 
> I'll try to find -rc4 configuration that has enabled debugging and can boot. 
> 
> 
> Petr
> 
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-15 Thread Vlastimil Babka
On 8/3/19 1:29 AM, Petr Vandrovec wrote:
> On Fri, Aug 2, 2019, 3:59 PM Matthew Wilcox  > wrote:
> 
> That doesn't help because we call reset_page_owner() in the free
> page path.
> 
> We could turn on tracing because we call trace_mm_page_free() in this
> path.  That requires the reporter to be able to reproduce the problem,
> and it's not clear to me whether this is a "happened once" or "every
> time I do this, it happens" problem.
> 
> 
> It happened on 3 of the boots with that kernel.  4th time box either
> spontaneously rebooted when X started, or watchdog restarted box shortly
> after starting X server.
> 
> So I believe I should be able to reproduce it with additional patches or
> extra flags enabled.

Does the issue still happen with rc4? Could you apply the 3 attached
patches (work in progress), configure-enable CONFIG_DEBUG_PAGEALLOC and
CONFIG_PAGE_OWNER and boot kernel with debug_pagealloc=on page_owner=on
parameters? That should print stacktraces of allocation and first
freeing (assuming this is a double free).

Vlastimil

>From 5b4c46cb1d7a8bca3e8d98433b19e60b28fb5796 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka 
Date: Thu, 15 Aug 2019 14:06:50 +0200
Subject: [PATCH 1/3] mm, page_owner: record page owner for each subpage

Currently, page owner info is only recorded for the first page of a high-order
allocation, and copied to tail pages in the event of a split page. With the
plan to keep previous owner info after freeing the page, it would be benefical
to record page owner for each subpage upon allocation. This increases the
overhead for high orders, but that should be acceptable for a debugging option.

Signed-off-by: Vlastimil Babka 
---
 mm/page_owner.c | 33 +
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index addcbb2ae4e4..695cd5fdf7fd 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -154,18 +154,23 @@ static noinline depot_stack_handle_t save_stack(gfp_t flags)
 	return handle;
 }
 
-static inline void __set_page_owner_handle(struct page_ext *page_ext,
-	depot_stack_handle_t handle, unsigned int order, gfp_t gfp_mask)
+static inline void __set_page_owner_handle(struct page *page,
+	struct page_ext *page_ext, depot_stack_handle_t handle,
+	unsigned int order, gfp_t gfp_mask)
 {
 	struct page_owner *page_owner;
+	int i;
 
-	page_owner = get_page_owner(page_ext);
-	page_owner->handle = handle;
-	page_owner->order = order;
-	page_owner->gfp_mask = gfp_mask;
-	page_owner->last_migrate_reason = -1;
+	for (i = 0; i < (1 << order); i++) {
+		page_owner = get_page_owner(page_ext);
+		page_owner->handle = handle;
+		page_owner->order = order;
+		page_owner->gfp_mask = gfp_mask;
+		page_owner->last_migrate_reason = -1;
+		__set_bit(PAGE_EXT_OWNER, _ext->flags);
 
-	__set_bit(PAGE_EXT_OWNER, _ext->flags);
+		page_ext = lookup_page_ext(page + i);
+	}
 }
 
 noinline void __set_page_owner(struct page *page, unsigned int order,
@@ -178,7 +183,7 @@ noinline void __set_page_owner(struct page *page, unsigned int order,
 		return;
 
 	handle = save_stack(gfp_mask);
-	__set_page_owner_handle(page_ext, handle, order, gfp_mask);
+	__set_page_owner_handle(page, page_ext, handle, order, gfp_mask);
 }
 
 void __set_page_owner_migrate_reason(struct page *page, int reason)
@@ -204,8 +209,11 @@ void __split_page_owner(struct page *page, unsigned int order)
 
 	page_owner = get_page_owner(page_ext);
 	page_owner->order = 0;
-	for (i = 1; i < (1 << order); i++)
-		__copy_page_owner(page, page + i);
+	for (i = 1; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+		page_owner = get_page_owner(page_ext);
+		page_owner->order = 0;
+	}
 }
 
 void __copy_page_owner(struct page *oldpage, struct page *newpage)
@@ -562,7 +570,8 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
 continue;
 
 			/* Found early allocated page */
-			__set_page_owner_handle(page_ext, early_handle, 0, 0);
+			__set_page_owner_handle(page, page_ext, early_handle,
+		0, 0);
 			count++;
 		}
 		cond_resched();
-- 
2.22.0

>From 063f233cc154819bb39a005726260496eea12265 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka 
Date: Thu, 15 Aug 2019 14:48:54 +0200
Subject: [PATCH 2/3] mm, page_owner: keep owner info when freeing the page

For debugging purposes it might be useful to keep the owner info even after
page has been freed and include it in e.g. dump_page() when detecting a bad
page state. For that, change the PAGE_EXT_OWNER flag meaning to "page owner
info has been set at least once" and add new PAGE_EXT_OWNER_ACTIVE for tracking
whether page is supposed to be currently allocated or free. Adjust dump_page()
and page_owner file printing accordingly.

Signed-off-by: Vlastimil Babka 
---
 include/linux/page_ext.h |  1 +
 mm/page_owner.c  | 38 +-
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git 

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-05 Thread Petr Vandrovec
On Fri, Aug 2, 2019, 3:59 PM Matthew Wilcox  wrote:

> That doesn't help because we call reset_page_owner() in the free page path.
>
> We could turn on tracing because we call trace_mm_page_free() in this
> path.  That requires the reporter to be able to reproduce the problem,
> and it's not clear to me whether this is a "happened once" or "every
> time I do this, it happens" problem.
>

It happened on 3 of the boots with that kernel.  4th time box either
spontaneously rebooted when X started, or watchdog restarted box shortly
after starting X server.

So I believe I should be able to reproduce it with additional patches or
extra flags enabled.

Petr

>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-05 Thread Petr Vandrovec

Andrew Morton wrote on 8/2/2019 1:23 PM:

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 01 Aug 2019 22:34:16 + bugzilla-dae...@bugzilla.kernel.org wrote:


[259701.549382] Code: 10 e9 67 ff ff ff 0f 1f 44 00 00 48 8b 15 b1 6c 0c 00 f7
d8 64 89 02 48 c7 c0 ff ff ff ff e9 6b ff ff ff b8 0b 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 89 6c 0c 00 f7 d8 64 89 01 48
[259701.549382] RSP: 002b:7ffe529db138 EFLAGS: 0206 ORIG_RAX:
000b
[259701.549382] RAX: ffda RBX: 564a5eabce70 RCX:
7f504d0ec1d7
[259701.549382] RDX: 7ffe529db140 RSI: 0040 RDI:
7f5044b65000
[259701.549382] RBP: 564a5eafe460 R08: 000b R09:
00010283e000
[259701.549382] R10: 0001 R11: 0206 R12:
564a5e475b08
[259701.549382] R13: 564a5e475c80 R14: 7ffe529db190 R15:
0c80
[259701.707238] Disabling lock debugging due to kernel taint


I assume the above is misbehaviour in the DRM code?


Most probably.

When I switched back to 5.2, crashes were gone, but log was filled with 
11 millions of messages complaining about alloc_contig_range problems:


[22042.108043] alloc_contig_range: [106f118, 106f119) PFNs busy
[22042.114400] alloc_contig_range: [106f11c, 106f11d) PFNs busy
[22042.120787] alloc_contig_range: [106f11d, 106f11e) PFNs busy
[22047.093057] alloc_contig_range: 47963 callbacks suppressed
[22047.093058] alloc_contig_range: [106f117, 106f118) PFNs busy
[22047.105576] alloc_contig_range: [106f118, 106f119) PFNs busy
[22047.111937] alloc_contig_range: [106f11c, 106f11d) PFNs busy
[22047.118329] alloc_contig_range: [106f11d, 106f11e) PFNs busy

In total there is 9735 messages logged individually, and 11 million 
suppressed:


petr-dev3:~$ dmesg | grep alloc_contig.*PFNs | wc -l
9735
petr-dev3:~$ expr `echo \`dmesg | grep alloc_contig.*callbacks | cut -d' 
' -f3\` | sed -e 's/ / + /'g`

11333722

So it could be my problems are caused by new Xorg driver.

After I disabled CMA system is stable on 5.2.  I did not try 5.3-rc2 yet.

Petr
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-05 Thread Matthew Wilcox
On Fri, Aug 02, 2019 at 05:17:30PM -0400, Qian Cai wrote:
> On Fri, 2019-08-02 at 13:33 -0700, Matthew Wilcox wrote:
> > It occurs to me that when a page is freed, we could record some useful bits
> > of information in the page from the stack trace to help debug double-free 
> > situations.  Even just stashing __builtin_return_address in page->mapping
> > would be helpful, I think.
> 
> Sounds like need to enable "page_owner", so it will do  __dump_page_owner().

That doesn't help because we call reset_page_owner() in the free page path.

We could turn on tracing because we call trace_mm_page_free() in this
path.  That requires the reporter to be able to reproduce the problem,
and it's not clear to me whether this is a "happened once" or "every
time I do this, it happens" problem.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-05 Thread Qian Cai
On Fri, 2019-08-02 at 13:33 -0700, Matthew Wilcox wrote:
> On Fri, Aug 02, 2019 at 01:23:06PM -0700, Andrew Morton wrote:
> > > [259701.387365] BUG: Bad page state in process Xorg  pfn:2a300
> > > [259701.393593] page:eaa8c000 refcount:0 mapcount:-128
> > > mapping: index:0x0
> 
> mapcount -128 is PAGE_MAPCOUNT_RESERVE, aka PageBuddy.  I think somebody
> called put_page() once more than they should have.  The one before this
> caused it to be freed to the page allocator, which set PageBuddy.  Then
> this one happened and we got a complaint.
> 
> > > [259701.402832] flags: 0x2000()
> > > [259701.407426] raw: 2000 822ab778 eaa8f208
> > > 
> > > [259701.415900] raw:  0003 ff7f
> > > 
> > > [259701.424373] page dumped because: nonzero mapcount
> 
> It occurs to me that when a page is freed, we could record some useful bits
> of information in the page from the stack trace to help debug double-free 
> situations.  Even just stashing __builtin_return_address in page->mapping
> would be helpful, I think.

Sounds like need to enable "page_owner", so it will do  __dump_page_owner().

> 
> > > [259701.549382] Call Trace:
> > > [259701.549382]  dump_stack+0x46/0x60
> > > [259701.549382]  bad_page.cold.28+0x81/0xb4
> > > [259701.549382]  __free_pages_ok+0x236/0x240
> > > [259701.549382]  __ttm_dma_free_page+0x2f/0x40
> > > [259701.549382]  ttm_dma_unpopulate+0x29b/0x370
> > > [259701.549382]  ttm_tt_destroy.part.6+0x44/0x50
> > > [259701.549382]  ttm_bo_cleanup_memtype_use+0x29/0x70
> > > [259701.549382]  ttm_bo_put+0x225/0x280
> > > [259701.549382]  ttm_bo_vm_close+0x10/0x20
> > > [259701.549382]  remove_vma+0x20/0x40
> > > [259701.549382]  __do_munmap+0x2da/0x420
> > > [259701.549382]  __vm_munmap+0x66/0xc0
> > > [259701.549382]  __x64_sys_munmap+0x22/0x30
> > > [259701.549382]  do_syscall_64+0x5e/0x1a0
> > > [259701.549382]  ? prepare_exit_to_usermode+0x75/0xa0
> > > [259701.549382]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > [259701.549382] RIP: 0033:0x7f504d0ec1d7
> > > [259701.549382] Code: 10 e9 67 ff ff ff 0f 1f 44 00 00 48 8b 15 b1 6c 0c
> > > 00 f7
> > > d8 64 89 02 48 c7 c0 ff ff ff ff e9 6b ff ff ff b8 0b 00 00 00 0f 05 <48>
> > > 3d 01
> > > f0 ff ff 73 01 c3 48 8b 0d 89 6c 0c 00 f7 d8 64 89 01 48
> > > [259701.549382] RSP: 002b:7ffe529db138 EFLAGS: 0206 ORIG_RAX:
> > > 000b
> > > [259701.549382] RAX: ffda RBX: 564a5eabce70 RCX:
> > > 7f504d0ec1d7
> > > [259701.549382] RDX: 7ffe529db140 RSI: 0040 RDI:
> > > 7f5044b65000
> > > [259701.549382] RBP: 564a5eafe460 R08: 000b R09:
> > > 00010283e000
> > > [259701.549382] R10: 0001 R11: 0206 R12:
> > > 564a5e475b08
> > > [259701.549382] R13: 564a5e475c80 R14: 7ffe529db190 R15:
> > > 0c80
> > > [259701.707238] Disabling lock debugging due to kernel taint
> > 
> > I assume the above is misbehaviour in the DRM code?
> 
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-05 Thread Matthew Wilcox
On Fri, Aug 02, 2019 at 01:23:06PM -0700, Andrew Morton wrote:
> > [259701.387365] BUG: Bad page state in process Xorg  pfn:2a300
> > [259701.393593] page:eaa8c000 refcount:0 mapcount:-128
> > mapping: index:0x0

mapcount -128 is PAGE_MAPCOUNT_RESERVE, aka PageBuddy.  I think somebody
called put_page() once more than they should have.  The one before this
caused it to be freed to the page allocator, which set PageBuddy.  Then
this one happened and we got a complaint.

> > [259701.402832] flags: 0x2000()
> > [259701.407426] raw: 2000 822ab778 eaa8f208
> > 
> > [259701.415900] raw:  0003 ff7f
> > 
> > [259701.424373] page dumped because: nonzero mapcount

It occurs to me that when a page is freed, we could record some useful bits
of information in the page from the stack trace to help debug double-free 
situations.  Even just stashing __builtin_return_address in page->mapping
would be helpful, I think.

> > [259701.549382] Call Trace:
> > [259701.549382]  dump_stack+0x46/0x60
> > [259701.549382]  bad_page.cold.28+0x81/0xb4
> > [259701.549382]  __free_pages_ok+0x236/0x240
> > [259701.549382]  __ttm_dma_free_page+0x2f/0x40
> > [259701.549382]  ttm_dma_unpopulate+0x29b/0x370
> > [259701.549382]  ttm_tt_destroy.part.6+0x44/0x50
> > [259701.549382]  ttm_bo_cleanup_memtype_use+0x29/0x70
> > [259701.549382]  ttm_bo_put+0x225/0x280
> > [259701.549382]  ttm_bo_vm_close+0x10/0x20
> > [259701.549382]  remove_vma+0x20/0x40
> > [259701.549382]  __do_munmap+0x2da/0x420
> > [259701.549382]  __vm_munmap+0x66/0xc0
> > [259701.549382]  __x64_sys_munmap+0x22/0x30
> > [259701.549382]  do_syscall_64+0x5e/0x1a0
> > [259701.549382]  ? prepare_exit_to_usermode+0x75/0xa0
> > [259701.549382]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [259701.549382] RIP: 0033:0x7f504d0ec1d7
> > [259701.549382] Code: 10 e9 67 ff ff ff 0f 1f 44 00 00 48 8b 15 b1 6c 0c 00 
> > f7
> > d8 64 89 02 48 c7 c0 ff ff ff ff e9 6b ff ff ff b8 0b 00 00 00 0f 05 <48> 
> > 3d 01
> > f0 ff ff 73 01 c3 48 8b 0d 89 6c 0c 00 f7 d8 64 89 01 48
> > [259701.549382] RSP: 002b:7ffe529db138 EFLAGS: 0206 ORIG_RAX:
> > 000b
> > [259701.549382] RAX: ffda RBX: 564a5eabce70 RCX:
> > 7f504d0ec1d7
> > [259701.549382] RDX: 7ffe529db140 RSI: 0040 RDI:
> > 7f5044b65000
> > [259701.549382] RBP: 564a5eafe460 R08: 000b R09:
> > 00010283e000
> > [259701.549382] R10: 0001 R11: 0206 R12:
> > 564a5e475b08
> > [259701.549382] R13: 564a5e475c80 R14: 7ffe529db190 R15:
> > 0c80
> > [259701.707238] Disabling lock debugging due to kernel taint
> 
> I assume the above is misbehaviour in the DRM code?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Bug 204407] New: Bad page state in process Xorg

2019-08-02 Thread Andrew Morton
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 01 Aug 2019 22:34:16 + bugzilla-dae...@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=204407
> 
> Bug ID: 204407
>Summary: Bad page state in process Xorg
>Product: Memory Management
>Version: 2.5
> Kernel Version: 5.3.0-rc2-00013
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Page Allocator
>   Assignee: a...@linux-foundation.org
>   Reporter: p...@vandrovec.name
> Regression: No
> 
> Created attachment 284081
>   --> https://bugzilla.kernel.org/attachment.cgi?id=284081=edit
> dmesg
> 
> I've upgraded from 5.3-rc1 to 5.3-rc2, and when I started X server, system
> became unhappy:
> 
> [259701.387365] BUG: Bad page state in process Xorg  pfn:2a300
> [259701.393593] page:eaa8c000 refcount:0 mapcount:-128
> mapping: index:0x0
> [259701.402832] flags: 0x2000()
> [259701.407426] raw: 2000 822ab778 eaa8f208
> 
> [259701.415900] raw:  0003 ff7f
> 
> [259701.424373] page dumped because: nonzero mapcount
> [259701.429847] Modules linked in: af_packet xt_REDIRECT nft_compat x_tables
> nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv4 nf_tables
> nfnetlink ppdev parport fuse autofs4 binfmt_misc uinput twofish_generic
> twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common
> camellia_generic camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx_x86_64
> serpent_sse2_x86_64 serpent_generic blowfish_generic blowfish_x86_64
> blowfish_common cast5_avx_x86_64 cast5_generic cast_common des_generic cmac
> xcbc rmd160 af_key xfrm_algo rpcsec_gss_krb5 nfsv4 nls_iso8859_2 cifs libarc4
> nfsv3 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipv6 crc_ccitt
> nf_defrag_ipv6 snd_hda_codec_hdmi pktcdvd coretemp hwmon intel_rapl_common
> iosf_mbi x86_pkg_temp_thermal snd_hda_codec_realtek snd_hda_codec_generic
> ledtrig_audio snd_hda_intel snd_hda_codec crct10dif_pclmul crc32_pclmul
> snd_hwdep crc32c_intel snd_hda_core ghash_clmulni_intel snd_pcm_oss uas
> iTCO_wdt aesni_intel e1000e
> [259701.429873]  snd_mixer_oss iTCO_vendor_support aes_x86_64 snd_pcm
> crypto_simd ptp mei_me dcdbas lpc_ich sr_mod cryptd usb_storage snd_timer
> glue_helper mfd_core input_leds pps_core tpm_tis cdrom i2c_i801 snd mei
> tpm_tis_core sg tpm [last unloaded: parport_pc]
> [259701.539387] CPU: 10 PID: 4860 Comm: Xorg Tainted: GT
> 5.3.0-rc2-64-00013-g03f05a670a3d #69
> [259701.549382] Hardware name: Dell Inc. Precision T3610/09M8Y8, BIOS A16
> 02/05/2018
> [259701.549382] Call Trace:
> [259701.549382]  dump_stack+0x46/0x60
> [259701.549382]  bad_page.cold.28+0x81/0xb4
> [259701.549382]  __free_pages_ok+0x236/0x240
> [259701.549382]  __ttm_dma_free_page+0x2f/0x40
> [259701.549382]  ttm_dma_unpopulate+0x29b/0x370
> [259701.549382]  ttm_tt_destroy.part.6+0x44/0x50
> [259701.549382]  ttm_bo_cleanup_memtype_use+0x29/0x70
> [259701.549382]  ttm_bo_put+0x225/0x280
> [259701.549382]  ttm_bo_vm_close+0x10/0x20
> [259701.549382]  remove_vma+0x20/0x40
> [259701.549382]  __do_munmap+0x2da/0x420
> [259701.549382]  __vm_munmap+0x66/0xc0
> [259701.549382]  __x64_sys_munmap+0x22/0x30
> [259701.549382]  do_syscall_64+0x5e/0x1a0
> [259701.549382]  ? prepare_exit_to_usermode+0x75/0xa0
> [259701.549382]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [259701.549382] RIP: 0033:0x7f504d0ec1d7
> [259701.549382] Code: 10 e9 67 ff ff ff 0f 1f 44 00 00 48 8b 15 b1 6c 0c 00 f7
> d8 64 89 02 48 c7 c0 ff ff ff ff e9 6b ff ff ff b8 0b 00 00 00 0f 05 <48> 3d 
> 01
> f0 ff ff 73 01 c3 48 8b 0d 89 6c 0c 00 f7 d8 64 89 01 48
> [259701.549382] RSP: 002b:7ffe529db138 EFLAGS: 0206 ORIG_RAX:
> 000b
> [259701.549382] RAX: ffda RBX: 564a5eabce70 RCX:
> 7f504d0ec1d7
> [259701.549382] RDX: 7ffe529db140 RSI: 0040 RDI:
> 7f5044b65000
> [259701.549382] RBP: 564a5eafe460 R08: 000b R09:
> 00010283e000
> [259701.549382] R10: 0001 R11: 0206 R12:
> 564a5e475b08
> [259701.549382] R13: 564a5e475c80 R14: 7ffe529db190 R15:
> 0c80
> [259701.707238] Disabling lock debugging due to kernel taint

I assume the above is misbehaviour in the DRM code?

> 
> Also - maybe related, maybe not - I've got three userspace crashes earlier on
> this kernel (but never before):
> 
> [77154.886836] iscons.py[12441]: segfault at 2c ip 080cf0b5 sp
> f773fb60 error 4 in python[8048000+11a000]
> [77154.898376] Code: 02 0f 84 4a 2e 00 00 8b 4d 08 8b bd 04 ff ff ff 8b 59 38
> 8b 57 20 8b 7b 10 85 ff 0f 84 ee 22 00 00 8b 8d 04 ff ff ff 8b 59 08 <8b> 43 
> 2c
> 85