Re: Re: unused swap offset / bad page map.

2013-11-12 Thread Alin Dobre

On 27/08/13 17:30, Dave Jones wrote:

Seems to do the trick.


We are running many virtualization hosts with Linux 3.11.3, qemu 1.6.1 + 
kvm and ksm. The hosts have 128GB RAM, 10GB swap and 24x AMD Opteron 
6238 cores.


Several times few weeks ago, we have seen the OOM killer come to life 
and quickly kill a large number of VMs on a host, even when there 
appears to be free memory on that host at the start of this.


However the OOM killings are preceded by some other traces, similar to 
the ones that were reported by Dave couple of months ago in this very 
thread (https://lkml.org/lkml/2013/8/7/27).


The relevant kernel log lines read:

20:30:44 kernel: swap_free: Unused swap file entry 2000200
20:30:44 kernel: BUG: Bad page map in process qemu-system-x86 
pte:00040002 pmd:1ecc0d4067
20:30:44 kernel: addr:7f5b8b404000 vm_flags:80100073 
anon_vma:880ff0e9df00 mapping:  (null) index:7f5b8b404
20:30:44 kernel: CPU: 9 PID: 22652 Comm: qemu-system-x86 Not tainted 
3.11.2-elastic #2
20:30:44 kernel: Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 
2.0b   03/01/2012
20:30:44 kernel: 7f5b8b404000 8807b76b1ab8 817ee7a6 
000400f6
20:30:44 kernel: 880ea36a0e60 8807b76b1b08 81135ed5 
000e
20:30:44 kernel: 0007f5b8b404 8807b76b1b08 7f5b8b404000 
880ea36a0e60

20:30:44 kernel: Call Trace:
20:30:44 kernel: [] dump_stack+0x55/0x86
20:30:44 kernel: [] print_bad_pte+0x1f5/0x213
20:30:44 kernel: [] unmap_single_vma+0x509/0x6d6
20:30:44 kernel: [] unmap_vmas+0x4d/0x80
20:30:44 kernel: [] exit_mmap+0x93/0x11e
20:30:44 kernel: [] mmput+0x51/0xdb
20:30:44 kernel: [] do_exit+0x33c/0x8a2
20:30:44 kernel: [] ? get_futex_key+0x87/0x20c
20:30:44 kernel: [] ? __dequeue_signal+0x16/0x114
20:30:44 kernel: [] do_group_exit+0x6a/0x9d
20:30:44 kernel: [] get_signal_to_deliver+0x488/0x4a7
20:30:44 kernel: [] do_signal+0x47/0x48f
20:30:44 kernel: [] ? rcu_eqs_enter+0x7d/0x82
20:30:44 kernel: [] ? account_user_time+0x6a/0x95
20:30:44 kernel: [] ? vtime_account_user+0x5d/0x65
20:30:44 kernel: [] do_notify_resume+0x28/0x6a
20:30:44 kernel: [] int_signal+0x12/0x17
20:30:44 kernel: Disabling lock debugging due to kernel taint
20:30:44 kernel: 33550335 pages RAM
20:30:44 kernel: 561601 pages reserved
20:30:44 kernel: 24628376 pages shared
20:30:44 kernel: 7190750 pages non-shared

Since we are using a 3.11.3 kernel, it already contains Cyrill's fix. 
However, our kernel log is very similar to Dave's report, so we are 
wondering if our mass OOM kill is another problem in the same area?


Any thoughts on this? I can provide more information from the logs, if 
necessary, and my colleague Richard originally reported the mass OOM 
kill in detail at http://article.gmane.org/gmane.linux.kernel.mm/108703.


Cheers,
Alin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: unused swap offset / bad page map.

2013-11-12 Thread Alin Dobre

On 27/08/13 17:30, Dave Jones wrote:

Seems to do the trick.


We are running many virtualization hosts with Linux 3.11.3, qemu 1.6.1 + 
kvm and ksm. The hosts have 128GB RAM, 10GB swap and 24x AMD Opteron 
6238 cores.


Several times few weeks ago, we have seen the OOM killer come to life 
and quickly kill a large number of VMs on a host, even when there 
appears to be free memory on that host at the start of this.


However the OOM killings are preceded by some other traces, similar to 
the ones that were reported by Dave couple of months ago in this very 
thread (https://lkml.org/lkml/2013/8/7/27).


The relevant kernel log lines read:

20:30:44 kernel: swap_free: Unused swap file entry 2000200
20:30:44 kernel: BUG: Bad page map in process qemu-system-x86 
pte:00040002 pmd:1ecc0d4067
20:30:44 kernel: addr:7f5b8b404000 vm_flags:80100073 
anon_vma:880ff0e9df00 mapping:  (null) index:7f5b8b404
20:30:44 kernel: CPU: 9 PID: 22652 Comm: qemu-system-x86 Not tainted 
3.11.2-elastic #2
20:30:44 kernel: Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 
2.0b   03/01/2012
20:30:44 kernel: 7f5b8b404000 8807b76b1ab8 817ee7a6 
000400f6
20:30:44 kernel: 880ea36a0e60 8807b76b1b08 81135ed5 
000e
20:30:44 kernel: 0007f5b8b404 8807b76b1b08 7f5b8b404000 
880ea36a0e60

20:30:44 kernel: Call Trace:
20:30:44 kernel: [817ee7a6] dump_stack+0x55/0x86
20:30:44 kernel: [81135ed5] print_bad_pte+0x1f5/0x213
20:30:44 kernel: [811379fd] unmap_single_vma+0x509/0x6d6
20:30:44 kernel: [81138291] unmap_vmas+0x4d/0x80
20:30:44 kernel: [8113e615] exit_mmap+0x93/0x11e
20:30:44 kernel: [810bc2fb] mmput+0x51/0xdb
20:30:44 kernel: [810c00b1] do_exit+0x33c/0x8a2
20:30:44 kernel: [810f58ab] ? get_futex_key+0x87/0x20c
20:30:44 kernel: [810c7215] ? __dequeue_signal+0x16/0x114
20:30:44 kernel: [810c06af] do_group_exit+0x6a/0x9d
20:30:44 kernel: [810c956a] get_signal_to_deliver+0x488/0x4a7
20:30:44 kernel: [81032db9] do_signal+0x47/0x48f
20:30:44 kernel: [8110dc29] ? rcu_eqs_enter+0x7d/0x82
20:30:44 kernel: [810e0ff4] ? account_user_time+0x6a/0x95
20:30:44 kernel: [810e13b6] ? vtime_account_user+0x5d/0x65
20:30:44 kernel: [81033229] do_notify_resume+0x28/0x6a
20:30:44 kernel: [817f6358] int_signal+0x12/0x17
20:30:44 kernel: Disabling lock debugging due to kernel taint
20:30:44 kernel: 33550335 pages RAM
20:30:44 kernel: 561601 pages reserved
20:30:44 kernel: 24628376 pages shared
20:30:44 kernel: 7190750 pages non-shared

Since we are using a 3.11.3 kernel, it already contains Cyrill's fix. 
However, our kernel log is very similar to Dave's report, so we are 
wondering if our mass OOM kill is another problem in the same area?


Any thoughts on this? I can provide more information from the logs, if 
necessary, and my colleague Richard originally reported the mass OOM 
kill in detail at http://article.gmane.org/gmane.linux.kernel.mm/108703.


Cheers,
Alin.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-27 Thread Cyrill Gorcunov
On Tue, Aug 27, 2013 at 12:24:27PM -0400, Dave Jones wrote:
>  > 
>  > I managed to trigger the issue as well. The patch below fixes it.
>  > Dave, could you please give it a shot once time permit?
> 
> Seems to do the trick.
> 
> Tested-by: Dave Jones 

Thanks a lot, Dave!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-27 Thread Dave Jones
On Tue, Aug 27, 2013 at 12:37:18PM +0400, Cyrill Gorcunov wrote:
 > On Mon, Aug 26, 2013 at 06:28:33PM -0400, Dave Jones wrote:
 > >  > 
 > >  > I've not tried matching up bits with Dave's reports, and just going
 > >  > into a meeting now, but this patch looks worth a try: probably Cyrill
 > >  > can improve it meanwhile to what he actually wants there (I'm
 > >  > surprised anything special is needed for just moving a pte).
 > >  > 
 > >  > Hugh
 > >  > 
 > >  > --- 3.11-rc7/mm/mremap.c2013-07-14 17:10:16.640003652 -0700
 > >  > +++ linux/mm/mremap.c   2013-08-26 14:46:14.460027627 -0700
 > >  > @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
 > >  > continue;
 > >  > pte = ptep_get_and_clear(mm, old_addr, old_pte);
 > >  > pte = move_pte(pte, new_vma->vm_page_prot, old_addr, 
 > > new_addr);
 > >  > -   set_pte_at(mm, new_addr, new_pte, 
 > > pte_mksoft_dirty(pte));
 > >  > +   set_pte_at(mm, new_addr, new_pte, pte);
 > >  > }
 > > 
 > > I'll give this a shot once I'm done with the bisect.
 > 
 > I managed to trigger the issue as well. The patch below fixes it.
 > Dave, could you please give it a shot once time permit?

Seems to do the trick.

Tested-by: Dave Jones 

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-27 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 06:28:33PM -0400, Dave Jones wrote:
>  > 
>  > I've not tried matching up bits with Dave's reports, and just going
>  > into a meeting now, but this patch looks worth a try: probably Cyrill
>  > can improve it meanwhile to what he actually wants there (I'm
>  > surprised anything special is needed for just moving a pte).
>  > 
>  > Hugh
>  > 
>  > --- 3.11-rc7/mm/mremap.c   2013-07-14 17:10:16.640003652 -0700
>  > +++ linux/mm/mremap.c  2013-08-26 14:46:14.460027627 -0700
>  > @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
>  >continue;
>  >pte = ptep_get_and_clear(mm, old_addr, old_pte);
>  >pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
>  > -  set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
>  > +  set_pte_at(mm, new_addr, new_pte, pte);
>  >}
> 
> I'll give this a shot once I'm done with the bisect.

I managed to trigger the issue as well. The patch below fixes it.
Dave, could you please give it a shot once time permit?

Pavel, I kept 'make it dirty on move' logic, but i'm somehow doubt
in it, won't plain pte copying (as in Hugh's patch) work of us?
---
From: Cyrill Gorcunov 
Subject: [PATCH] mm: move_ptes -- Set soft dirty bit depending on pte type

Dave reported corrupted swap entries

 | [ 4588.541886] swap_free: Unused swap offset entry 2d15
 | [ 4588.541952] BUG: Bad page map in process trinity-kid12  pte:005a2a80 
pmd:22c01f067

and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit
set regardless the type of entry pte consists of. The
trick here is that -- when we carry soft dirty status
in swap entries we are to use _PAGE_SWP_SOFT_DIRTY instead,
because this is the only place in pte which can be used
for own needs without intersecting with bits owned by
swap entry type/offset.

Reported-by: Dave Jones 
Signed-off-by: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: Linus Torvalds 
Cc: Hugh Dickins 
Cc: Hillf Danton 
Cc: Andrew Morton 
---
 mm/mremap.c |   21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

Index: linux-2.6.git/mm/mremap.c
===
--- linux-2.6.git.orig/mm/mremap.c
+++ linux-2.6.git/mm/mremap.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -69,6 +70,23 @@ static pmd_t *alloc_new_pmd(struct mm_st
return pmd;
 }
 
+static pte_t move_soft_dirty_pte(pte_t pte)
+{
+   /*
+* Set soft dirty bit so we can notice
+* in userspace the ptes were moved.
+*/
+#ifdef CONFIG_MEM_SOFT_DIRTY
+   if (pte_present(pte))
+   pte = pte_mksoft_dirty(pte);
+   else if (is_swap_pte(pte))
+   pte = pte_swp_mksoft_dirty(pte);
+   else if (pte_file(pte))
+   pte = pte_file_mksoft_dirty(pte);
+#endif
+   return pte;
+}
+
 static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
@@ -126,7 +144,8 @@ static void move_ptes(struct vm_area_str
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
-   set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
+   pte = move_soft_dirty_pte(pte);
+   set_pte_at(mm, new_addr, new_pte, pte);
}
 
arch_leave_lazy_mmu_mode();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-27 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 06:28:33PM -0400, Dave Jones wrote:
   
   I've not tried matching up bits with Dave's reports, and just going
   into a meeting now, but this patch looks worth a try: probably Cyrill
   can improve it meanwhile to what he actually wants there (I'm
   surprised anything special is needed for just moving a pte).
   
   Hugh
   
   --- 3.11-rc7/mm/mremap.c   2013-07-14 17:10:16.640003652 -0700
   +++ linux/mm/mremap.c  2013-08-26 14:46:14.460027627 -0700
   @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
  continue;
  pte = ptep_get_and_clear(mm, old_addr, old_pte);
  pte = move_pte(pte, new_vma-vm_page_prot, old_addr, new_addr);
   -  set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
   +  set_pte_at(mm, new_addr, new_pte, pte);
  }
 
 I'll give this a shot once I'm done with the bisect.

I managed to trigger the issue as well. The patch below fixes it.
Dave, could you please give it a shot once time permit?

Pavel, I kept 'make it dirty on move' logic, but i'm somehow doubt
in it, won't plain pte copying (as in Hugh's patch) work of us?
---
From: Cyrill Gorcunov gorcu...@gmail.com
Subject: [PATCH] mm: move_ptes -- Set soft dirty bit depending on pte type

Dave reported corrupted swap entries

 | [ 4588.541886] swap_free: Unused swap offset entry 2d15
 | [ 4588.541952] BUG: Bad page map in process trinity-kid12  pte:005a2a80 
pmd:22c01f067

and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit
set regardless the type of entry pte consists of. The
trick here is that -- when we carry soft dirty status
in swap entries we are to use _PAGE_SWP_SOFT_DIRTY instead,
because this is the only place in pte which can be used
for own needs without intersecting with bits owned by
swap entry type/offset.

Reported-by: Dave Jones da...@redhat.com
Signed-off-by: Cyrill Gorcunov gorcu...@openvz.org
Cc: Pavel Emelyanov xe...@parallels.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Hugh Dickins hu...@google.com
Cc: Hillf Danton dhi...@gmail.com
Cc: Andrew Morton a...@linux-foundation.org
---
 mm/mremap.c |   21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

Index: linux-2.6.git/mm/mremap.c
===
--- linux-2.6.git.orig/mm/mremap.c
+++ linux-2.6.git/mm/mremap.c
@@ -15,6 +15,7 @@
 #include linux/swap.h
 #include linux/capability.h
 #include linux/fs.h
+#include linux/swapops.h
 #include linux/highmem.h
 #include linux/security.h
 #include linux/syscalls.h
@@ -69,6 +70,23 @@ static pmd_t *alloc_new_pmd(struct mm_st
return pmd;
 }
 
+static pte_t move_soft_dirty_pte(pte_t pte)
+{
+   /*
+* Set soft dirty bit so we can notice
+* in userspace the ptes were moved.
+*/
+#ifdef CONFIG_MEM_SOFT_DIRTY
+   if (pte_present(pte))
+   pte = pte_mksoft_dirty(pte);
+   else if (is_swap_pte(pte))
+   pte = pte_swp_mksoft_dirty(pte);
+   else if (pte_file(pte))
+   pte = pte_file_mksoft_dirty(pte);
+#endif
+   return pte;
+}
+
 static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
@@ -126,7 +144,8 @@ static void move_ptes(struct vm_area_str
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
pte = move_pte(pte, new_vma-vm_page_prot, old_addr, new_addr);
-   set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
+   pte = move_soft_dirty_pte(pte);
+   set_pte_at(mm, new_addr, new_pte, pte);
}
 
arch_leave_lazy_mmu_mode();
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-27 Thread Dave Jones
On Tue, Aug 27, 2013 at 12:37:18PM +0400, Cyrill Gorcunov wrote:
  On Mon, Aug 26, 2013 at 06:28:33PM -0400, Dave Jones wrote:
 
 I've not tried matching up bits with Dave's reports, and just going
 into a meeting now, but this patch looks worth a try: probably Cyrill
 can improve it meanwhile to what he actually wants there (I'm
 surprised anything special is needed for just moving a pte).
 
 Hugh
 
 --- 3.11-rc7/mm/mremap.c2013-07-14 17:10:16.640003652 -0700
 +++ linux/mm/mremap.c   2013-08-26 14:46:14.460027627 -0700
 @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
 continue;
 pte = ptep_get_and_clear(mm, old_addr, old_pte);
 pte = move_pte(pte, new_vma-vm_page_prot, old_addr, 
   new_addr);
 -   set_pte_at(mm, new_addr, new_pte, 
   pte_mksoft_dirty(pte));
 +   set_pte_at(mm, new_addr, new_pte, pte);
 }
   
   I'll give this a shot once I'm done with the bisect.
  
  I managed to trigger the issue as well. The patch below fixes it.
  Dave, could you please give it a shot once time permit?

Seems to do the trick.

Tested-by: Dave Jones da...@fedoraproject.org

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-27 Thread Cyrill Gorcunov
On Tue, Aug 27, 2013 at 12:24:27PM -0400, Dave Jones wrote:
   
   I managed to trigger the issue as well. The patch below fixes it.
   Dave, could you please give it a shot once time permit?
 
 Seems to do the trick.
 
 Tested-by: Dave Jones da...@fedoraproject.org

Thanks a lot, Dave!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 04:15:00PM -0700, Linus Torvalds wrote:
> On Mon, Aug 26, 2013 at 3:08 PM, Hugh Dickins  wrote:
> >
> > I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
> > a line in mremap which worries me.  That set_pte_at() is operating
> > on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
> > prone to corrupt a swap entry.
> 
> Uhhuh. I think you hit the nail on the head here.
> 
> I checked all the pte_swp_*soft_dirty() users (they should be used on
> swp entries), because that came up in another thread. But you're
> right, the non-swp ones only work on present pte entries (or on
> file-offset entries, I guess), and at least that mremap() case seems
> bogus.

Oh my :( Indeed it sets _PAGE_SOFT_DIRTY unconditionally, sigh. This
nit comes from former soft-dirty commit. Let me check all other places
we set soft dirty bit (Pavel CC'ed).

> I'm not seeing the point of marking the thing soft-dirty at all,
> although I guess it's "dirty" in the sense that it changed the
> contents at that virtual address. But for that code to work, it would
> have to have the same bit for swap entries as for present pages (and
> for file mapping entries), and that's not true. They are two different
> bits (_PAGE_SOFT_DIRTY is bit #11 vs _PAGE_SWP_SOFT_DIRTY is bit #7).
> 
> Ugh. Cyrill, this is a mess.

Linus, I simply had no place in pte entry to carry soft-dirty status
when pte incoded in swap format, so it was unpleasant but necessary
decision. That's why bits access are wrapped in own macros with
'swp' prefix thus reader would easily grep for them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Linus Torvalds
On Mon, Aug 26, 2013 at 3:08 PM, Hugh Dickins  wrote:
>
> I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
> a line in mremap which worries me.  That set_pte_at() is operating
> on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
> prone to corrupt a swap entry.

Uhhuh. I think you hit the nail on the head here.

I checked all the pte_swp_*soft_dirty() users (they should be used on
swp entries), because that came up in another thread. But you're
right, the non-swp ones only work on present pte entries (or on
file-offset entries, I guess), and at least that mremap() case seems
bogus.

I'm not seeing the point of marking the thing soft-dirty at all,
although I guess it's "dirty" in the sense that it changed the
contents at that virtual address. But for that code to work, it would
have to have the same bit for swap entries as for present pages (and
for file mapping entries), and that's not true. They are two different
bits (_PAGE_SOFT_DIRTY is bit #11 vs _PAGE_SWP_SOFT_DIRTY is bit #7).

Ugh. Cyrill, this is a mess.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Mon, Aug 26, 2013 at 03:08:45PM -0700, Hugh Dickins wrote:
 
 > > That said, google does find "swap_free: Unused swap offset entry"
 > > reports from over the years. Most of them seem to be single-bit
 > > errors, though (ie when the entry is 0100 or similar I'm more
 > > inclined to blame a bit error
 > 
 > Yes, historically they have usually represented either single-bit
 > errors, or corruption of page tables by other kernel data.  The
 > swap subsystem discovers it, but it's rarely an error of swap.
 
Just to rule out bad hardware, I've seen this on two systems
(admittedly the exact same spec, but still..)

 > So I don't care for Dave's suggestion much earlier in this thread,
 > that swapoff should fail with -EINVAL if there has been a bad page
 > taint: that doesn't necessarily interfere with swapoff at all.
 > 
 > And besides, swapoff is killable: yes, if counts go wrong, it
 > can cycle around endlessly, but it checks for signal_pending()
 > each time around the loop.

It might be killable, but if I've done /sbin/reboot, and the
kernel dies in sys_swapoff because of the corruption, I won't
get a chance to kill it, because at that point the shutdown process
has killed my shell, sshd, and just about everything else.
It mieans a grumpy walk to the other side of the house to prod a
reset button.  So yeah, it might not be a mergable thing, but
at least while bisecting it's pretty much a must-have.

 > I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
 > a line in mremap which worries me.  That set_pte_at() is operating
 > on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
 > prone to corrupt a swap entry.
 > 
 > I've not tried matching up bits with Dave's reports, and just going
 > into a meeting now, but this patch looks worth a try: probably Cyrill
 > can improve it meanwhile to what he actually wants there (I'm
 > surprised anything special is needed for just moving a pte).
 > 
 > Hugh
 > 
 > --- 3.11-rc7/mm/mremap.c 2013-07-14 17:10:16.640003652 -0700
 > +++ linux/mm/mremap.c2013-08-26 14:46:14.460027627 -0700
 > @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
 >  continue;
 >  pte = ptep_get_and_clear(mm, old_addr, old_pte);
 >  pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 > -set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
 > +set_pte_at(mm, new_addr, new_pte, pte);
 >  }

I'll give this a shot once I'm done with the bisect.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Hugh Dickins
On Mon, 26 Aug 2013, Linus Torvalds wrote:
> On Mon, Aug 26, 2013 at 1:15 PM, Linus Torvalds
>  wrote:
> >
> > So I'm almost likely to think that we are more likely to have
> > something wrong in the messy magical special cases.
> 
> Of course, the good news would be if it actually ends up being the
> soft-dirty stuff, and bisection hits something recent.

I suspect so.

> 
> So maybe I'm overly pessimistic. That messy swap_map[] code really
> _is_ messy, but at the same time it should also be pretty well-tested.
> I don't think it's been touched in years.

Blame me for the byte-instead-of-short continuation stuff.
But it's never yet shown any problem (okay, perhaps that's
because it's so rare to need any continuation anyway).

> 
> That said, google does find "swap_free: Unused swap offset entry"
> reports from over the years. Most of them seem to be single-bit
> errors, though (ie when the entry is 0100 or similar I'm more
> inclined to blame a bit error

Yes, historically they have usually represented either single-bit
errors, or corruption of page tables by other kernel data.  The
swap subsystem discovers it, but it's rarely an error of swap.

So I don't care for Dave's suggestion much earlier in this thread,
that swapoff should fail with -EINVAL if there has been a bad page
taint: that doesn't necessarily interfere with swapoff at all.

And besides, swapoff is killable: yes, if counts go wrong, it
can cycle around endlessly, but it checks for signal_pending()
each time around the loop.

> - in contrast your values look like "real" swap entries).

Indeed they do.

I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
a line in mremap which worries me.  That set_pte_at() is operating
on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
prone to corrupt a swap entry.

I've not tried matching up bits with Dave's reports, and just going
into a meeting now, but this patch looks worth a try: probably Cyrill
can improve it meanwhile to what he actually wants there (I'm
surprised anything special is needed for just moving a pte).

Hugh

--- 3.11-rc7/mm/mremap.c2013-07-14 17:10:16.640003652 -0700
+++ linux/mm/mremap.c   2013-08-26 14:46:14.460027627 -0700
@@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
-   set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
+   set_pte_at(mm, new_addr, new_pte, pte);
}
 
arch_leave_lazy_mmu_mode();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Tue, Aug 27, 2013 at 01:49:40AM +0400, Cyrill Gorcunov wrote:
 > On Mon, Aug 26, 2013 at 05:42:44PM -0400, Dave Jones wrote:
 > > 
 > > Yeah, for reproducing this bug, I'd stick to running it as a user, without 
 > > --dangerous.
 > > you might still hit a few fairly-easy to trigger warn-on/printks. I run 
 > > with
 > > this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make 
 > > things
 > > a little less noisy.
 > 
 > Ah, thanks, pulling it in. Btw, have you seen this problem earlier than -rc4 
 > at all?

I just hit it on 3.11rc1. Couldn't reproduce on 3.10.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 05:42:44PM -0400, Dave Jones wrote:
> 
> Yeah, for reproducing this bug, I'd stick to running it as a user, without 
> --dangerous.
> you might still hit a few fairly-easy to trigger warn-on/printks. I run with
> this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make 
> things
> a little less noisy.

Ah, thanks, pulling it in. Btw, have you seen this problem earlier than -rc4 at 
all?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Tue, Aug 27, 2013 at 01:37:54AM +0400, Cyrill Gorcunov wrote:
 > On Tue, Aug 27, 2013 at 12:42:03AM +0400, Cyrill Gorcunov wrote:
 > > On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:
 > > > 
 > > > Try adding the -C64 to the invocation in scripts/test-multi.sh,
 > > > and perhaps up'ing the NR_PROCESSES variable there too.
 > > 
 > > Thanks! I'll ping you if I manage to crash my instance.
 > 
 > So trinity tained kernel, but definitely not in place I'm interested.
 > 
 > [  320.904506] raw_sendmsg: trinity-child14 forgot to set AF_INET. Fix it!
 > [  329.570812] [ cut here ]
 > [  329.571650] WARNING: CPU: 0 PID: 1982 at kernel/lockdep.c:3552 
 > check_flags+0x18a/0x1c1()
 > [  329.571650] DEBUG_LOCKS_WARN_ON(current->softirqs_enabled)
 > [  329.571650] Modules linked in:
 > [  329.571650] CPU: 0 PID: 1982 Comm: trinity-child4 Not tainted 
 > 3.11.0-rc6-dirty #386
 > [  329.571650] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 > [  329.571650]  0009 88001ee03b10 8157ac8a 
 > 0006
 > [  329.571650]  88001ee03b60 88001ee03b50 81045bb2 
 > 81583840
 > [  329.571650]  81092620 880002b48000 0046 
 > 81a2f750
 > [  329.571650] Call Trace:
 > [  329.571650][] dump_stack+0x4f/0x84
 > [  329.571650]  [] warn_slowpath_common+0x81/0x9b
 > [  329.571650]  [] ? ftrace_call+0x5/0x2f
 > [  329.571650]  [] ? check_flags+0x18a/0x1c1
 > [  329.571650]  [] warn_slowpath_fmt+0x46/0x48
 > [  329.571650]  [] ? warn_slowpath_fmt+0x5/0x48
 > [  329.571650]  [] check_flags+0x18a/0x1c1
 > [  329.571650]  [] lock_is_held+0x30/0x5f
 > [  329.571650]  [] rcu_read_lock_held+0x36/0x38
 > [  329.571650]  [] perf_tp_event+0x92/0x220
 > [  329.571650]  [] ? perf_tp_event+0x20e/0x220
 > [  329.571650]  [] ? __local_bh_enable+0x9a/0x9e
 > [  329.571650]  [] ? get_parent_ip+0x3f/0x3f
 > [  329.571650]  [] ? __local_bh_enable+0x9a/0x9e
 > [  329.571650]  [] perf_ftrace_function_call+0xce/0xdc

when it rains, it pours.. 
 
 > (since my config pretty similar to yours I tried to run trinity without
 >  kernel recompilation. At first i loaded swap space with crap data
 > 
 > [root@ovz trinity]# free 
 >  total   used   free sharedbuffers cached
 > Mem:493228 480188  13040  0   2912  12112
 > -/+ buffers/cache: 465164  28064
 > Swap:  20633561741304 322052
 > 
 > then run it as
 > 
 > [root@ovz trinity]# ./trinity -C64 --dangerous)

Yeah, for reproducing this bug, I'd stick to running it as a user, without 
--dangerous.
you might still hit a few fairly-easy to trigger warn-on/printks. I run with
this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make things
a little less noisy.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Tue, Aug 27, 2013 at 12:42:03AM +0400, Cyrill Gorcunov wrote:
> On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:
> > 
> > Try adding the -C64 to the invocation in scripts/test-multi.sh,
> > and perhaps up'ing the NR_PROCESSES variable there too.
> 
> Thanks! I'll ping you if I manage to crash my instance.

So trinity tained kernel, but definitely not in place I'm interested.

[  320.904506] raw_sendmsg: trinity-child14 forgot to set AF_INET. Fix it!
[  329.570812] [ cut here ]
[  329.571650] WARNING: CPU: 0 PID: 1982 at kernel/lockdep.c:3552 
check_flags+0x18a/0x1c1()
[  329.571650] DEBUG_LOCKS_WARN_ON(current->softirqs_enabled)
[  329.571650] Modules linked in:
[  329.571650] CPU: 0 PID: 1982 Comm: trinity-child4 Not tainted 
3.11.0-rc6-dirty #386
[  329.571650] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  329.571650]  0009 88001ee03b10 8157ac8a 
0006
[  329.571650]  88001ee03b60 88001ee03b50 81045bb2 
81583840
[  329.571650]  81092620 880002b48000 0046 
81a2f750
[  329.571650] Call Trace:
[  329.571650][] dump_stack+0x4f/0x84
[  329.571650]  [] warn_slowpath_common+0x81/0x9b
[  329.571650]  [] ? ftrace_call+0x5/0x2f
[  329.571650]  [] ? check_flags+0x18a/0x1c1
[  329.571650]  [] warn_slowpath_fmt+0x46/0x48
[  329.571650]  [] ? warn_slowpath_fmt+0x5/0x48
[  329.571650]  [] check_flags+0x18a/0x1c1
[  329.571650]  [] lock_is_held+0x30/0x5f
[  329.571650]  [] rcu_read_lock_held+0x36/0x38
[  329.571650]  [] perf_tp_event+0x92/0x220
[  329.571650]  [] ? perf_tp_event+0x20e/0x220
[  329.571650]  [] ? __local_bh_enable+0x9a/0x9e
[  329.571650]  [] ? get_parent_ip+0x3f/0x3f
[  329.571650]  [] ? __local_bh_enable+0x9a/0x9e
[  329.571650]  [] perf_ftrace_function_call+0xce/0xdc

...

(since my config pretty similar to yours I tried to run trinity without
 kernel recompilation. At first i loaded swap space with crap data

[root@ovz trinity]# free 
 total   used   free sharedbuffers cached
Mem:493228 480188  13040  0   2912  12112
-/+ buffers/cache: 465164  28064
Swap:  20633561741304 322052

then run it as

[root@ovz trinity]# ./trinity -C64 --dangerous)

I'll continue tomorrow with your config and test-multi.sh.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Linus Torvalds
On Mon, Aug 26, 2013 at 1:15 PM, Linus Torvalds
 wrote:
>
> So I'm almost likely to think that we are more likely to have
> something wrong in the messy magical special cases.

Of course, the good news would be if it actually ends up being the
soft-dirty stuff, and bisection hits something recent.

So maybe I'm overly pessimistic. That messy swap_map[] code really
_is_ messy, but at the same time it should also be pretty well-tested.
I don't think it's been touched in years.

That said, google does find "swap_free: Unused swap offset entry"
reports from over the years. Most of them seem to be single-bit
errors, though (ie when the entry is 0100 or similar I'm more
inclined to blame a bit error - in contrast your values look like
"real" swap entries).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:
> 
> Try adding the -C64 to the invocation in scripts/test-multi.sh,
> and perhaps up'ing the NR_PROCESSES variable there too.

Thanks! I'll ping you if I manage to crash my instance.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Tue, Aug 27, 2013 at 12:18:46AM +0400, Cyrill Gorcunov wrote:
 > On Mon, Aug 26, 2013 at 03:08:22PM -0400, Dave Jones wrote:
 > > On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
 > >  > On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones  wrote:
 > >  > >
 > >  > > It actually seems worse, seems I can trigger it even easier now, as if
 > >  > > there's a leak.
 > >  > >
 > >  > Can you please try the new fix for TLB flush?
 > >  > 
 > >  > commit  2b047252d087be7f2ba
 > >  > Fix TLB gather virtual address range invalidation corner cases
 > > 
 > > No luck.
 > 
 > Hi Dave, could you please put your .config somewhere so i would try
 > to repeat this problem? (i've tried trinity with -C64 but it didn't
 > trigger the issue).

http://paste.fedoraproject.org/34944/77549285
machine I'm using has 8gb ram, 8gb swap, and 4 cores.

Try adding the -C64 to the invocation in scripts/test-multi.sh,
and perhaps up'ing the NR_PROCESSES variable there too.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 03:08:22PM -0400, Dave Jones wrote:
> On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
>  > On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones  wrote:
>  > >
>  > > It actually seems worse, seems I can trigger it even easier now, as if
>  > > there's a leak.
>  > >
>  > Can you please try the new fix for TLB flush?
>  > 
>  > commit  2b047252d087be7f2ba
>  > Fix TLB gather virtual address range invalidation corner cases
> 
> No luck.

Hi Dave, could you please put your .config somewhere so i would try
to repeat this problem? (i've tried trinity with -C64 but it didn't
trigger the issue).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Linus Torvalds
On Mon, Aug 26, 2013 at 12:08 PM, Dave Jones  wrote:
>
> [ 4588.541886] swap_free: Unused swap offset entry 2d15
> [ 4588.541952] BUG: Bad page map in process trinity-kid12  pte:005a2a80 
> pmd:22c01f067
>
> I can reproduce this pretty quickly by driving the system into swapping using
> a few instances of 'trinity -C64' (this creates 64 threads)
>
> I'm not sure how far back this bug goes, so I'll try some older kernels
> and see if I can bisect it, because we don't seem to be getting closer
> to figuring out what's actually happening..

Bisecting would indeed be good. But I get the feeling that you'll need
to go back a *long* time, because the swap_map[] code hasn't changed
in ages.

I'm adding Hugh Dickins to the cc just in case he hasn't seen this on
linux-mm, because the swap_map[] code is complex as hell, and Hugh did
touch some of it last. The whole swap_map[] thing is complicated by:

 - it's a single byte per swap entry
 - it's not even a *structured* byte, but a single counter that has
several "fields" by hand
 - it has a count in the low 6 bits, with a magic "bad" value (which
is also a magic "continuation" value if one of the high bits are set)
 - it has two magic bits: HAS_CACHE and CONTINUED
 - it has a _third_ magic value (SWAP_MAP_SHMEM) which is "CONTINUED+BAD"
 - we increment this nasty pseudo-counter wildly hackily, and and have
magic special case checks for the odd cases

and if we get any of the special cases wrong, we'll
increment/decrement it wrong, and we're screwed.

The *locking* looks pretty simple, though. It's a simple spinlock. We
do some optimistic tests outside the spinlock, but the actual
allocation and modification seem to all be inside the lock and
re-check any optimistic values afaik.

So I'm almost likely to think that we are more likely to have
something wrong in the messy magical special cases. I'm wondering if
we should get rid of the continuation crap, for example, and expand
the "one byte per swap page" to two bytes instead.

Hugh, I think you know this code best, because you added the last
special case (that SWAP_MAP_SHMEM value). Comments?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
 > On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones  wrote:
 > >
 > > It actually seems worse, seems I can trigger it even easier now, as if
 > > there's a leak.
 > >
 > Can you please try the new fix for TLB flush?
 > 
 > commit  2b047252d087be7f2ba
 > Fix TLB gather virtual address range invalidation corner cases

No luck.

[ 4588.541886] swap_free: Unused swap offset entry 2d15
[ 4588.541952] BUG: Bad page map in process trinity-kid12  pte:005a2a80 
pmd:22c01f067
[ 4588.541979] addr:7f0e95fa8000 vm_flags:00100073 
anon_vma:880217665550 mapping:  (null) index:1a42
[ 4588.542011] Modules linked in: snd_seq_dummy fuse hidp bnep 
scsi_transport_iscsi rfcomm ipt_ULOG can_bcm can_raw nfnetlink nfc caif_socket 
caif af_802154 phonet af_rxrpc bluetooth rfkill can llc2 pppoe pppox 
ppp_generic slhc irda crc_ccitt rds af_key rose x25 atm netrom appletalk ipx 
p8023 psnap p8022 llc ax25 xfs libcrc32c snd_hda_codec_realtek snd_hda_intel 
e1000e snd_hda_codec snd_hwdep ptp snd_seq snd_seq_device snd_pcm usb_debug 
pps_core pcspkr snd_page_alloc snd_timer snd soundcore
[ 4588.542245] CPU: 2 PID: 25390 Comm: trinity-kid12 Not tainted 3.11.0-rc7+ 
#13 
[ 4588.542321]   88021ba33c98 816f9ddf 
7f0e95fa8000
[ 4588.542354]  88021ba33ce0 81177047 005a2a80 
1a42
[ 4588.542386]  7f0e9600 88022c01fd40 005a2a80 
88021ba33e00
[ 4588.542418] Call Trace:
[ 4588.542435]  [] dump_stack+0x54/0x74
[ 4588.542457]  [] print_bad_pte+0x187/0x220
[ 4588.542478]  [] unmap_single_vma+0x524/0x850
[ 4588.542500]  [] unmap_vmas+0x49/0x90
[ 4588.542521]  [] exit_mmap+0xc5/0x170
[ 4588.542542]  [] mmput+0x77/0x100
[ 4588.542562]  [] do_exit+0x28d/0xcd0
[ 4588.542583]  [] ? trace_hardirqs_on_caller+0x115/0x1e0
[ 4588.542607]  [] ? trace_hardirqs_on+0xd/0x10
[ 4588.542629]  [] do_group_exit+0x4c/0xc0
[ 4588.543534]  [] SyS_exit_group+0x14/0x20
[ 4588.544438]  [] tracesys+0xdd/0xe2

I can reproduce this pretty quickly by driving the system into swapping using
a few instances of 'trinity -C64' (this creates 64 threads) 

I'm not sure how far back this bug goes, so I'll try some older kernels
and see if I can bisect it, because we don't seem to be getting closer
to figuring out what's actually happening..

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
  On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones da...@redhat.com wrote:
  
   It actually seems worse, seems I can trigger it even easier now, as if
   there's a leak.
  
  Can you please try the new fix for TLB flush?
  
  commit  2b047252d087be7f2ba
  Fix TLB gather virtual address range invalidation corner cases

No luck.

[ 4588.541886] swap_free: Unused swap offset entry 2d15
[ 4588.541952] BUG: Bad page map in process trinity-kid12  pte:005a2a80 
pmd:22c01f067
[ 4588.541979] addr:7f0e95fa8000 vm_flags:00100073 
anon_vma:880217665550 mapping:  (null) index:1a42
[ 4588.542011] Modules linked in: snd_seq_dummy fuse hidp bnep 
scsi_transport_iscsi rfcomm ipt_ULOG can_bcm can_raw nfnetlink nfc caif_socket 
caif af_802154 phonet af_rxrpc bluetooth rfkill can llc2 pppoe pppox 
ppp_generic slhc irda crc_ccitt rds af_key rose x25 atm netrom appletalk ipx 
p8023 psnap p8022 llc ax25 xfs libcrc32c snd_hda_codec_realtek snd_hda_intel 
e1000e snd_hda_codec snd_hwdep ptp snd_seq snd_seq_device snd_pcm usb_debug 
pps_core pcspkr snd_page_alloc snd_timer snd soundcore
[ 4588.542245] CPU: 2 PID: 25390 Comm: trinity-kid12 Not tainted 3.11.0-rc7+ 
#13 
[ 4588.542321]   88021ba33c98 816f9ddf 
7f0e95fa8000
[ 4588.542354]  88021ba33ce0 81177047 005a2a80 
1a42
[ 4588.542386]  7f0e9600 88022c01fd40 005a2a80 
88021ba33e00
[ 4588.542418] Call Trace:
[ 4588.542435]  [816f9ddf] dump_stack+0x54/0x74
[ 4588.542457]  [81177047] print_bad_pte+0x187/0x220
[ 4588.542478]  [81178874] unmap_single_vma+0x524/0x850
[ 4588.542500]  [81179ac9] unmap_vmas+0x49/0x90
[ 4588.542521]  [811822c5] exit_mmap+0xc5/0x170
[ 4588.542542]  [8104ffb7] mmput+0x77/0x100
[ 4588.542562]  [8105465d] do_exit+0x28d/0xcd0
[ 4588.542583]  [810c0085] ? trace_hardirqs_on_caller+0x115/0x1e0
[ 4588.542607]  [810c015d] ? trace_hardirqs_on+0xd/0x10
[ 4588.542629]  [8105643c] do_group_exit+0x4c/0xc0
[ 4588.543534]  [810564c4] SyS_exit_group+0x14/0x20
[ 4588.544438]  [8170d554] tracesys+0xdd/0xe2

I can reproduce this pretty quickly by driving the system into swapping using
a few instances of 'trinity -C64' (this creates 64 threads) 

I'm not sure how far back this bug goes, so I'll try some older kernels
and see if I can bisect it, because we don't seem to be getting closer
to figuring out what's actually happening..

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Linus Torvalds
On Mon, Aug 26, 2013 at 12:08 PM, Dave Jones da...@redhat.com wrote:

 [ 4588.541886] swap_free: Unused swap offset entry 2d15
 [ 4588.541952] BUG: Bad page map in process trinity-kid12  pte:005a2a80 
 pmd:22c01f067

 I can reproduce this pretty quickly by driving the system into swapping using
 a few instances of 'trinity -C64' (this creates 64 threads)

 I'm not sure how far back this bug goes, so I'll try some older kernels
 and see if I can bisect it, because we don't seem to be getting closer
 to figuring out what's actually happening..

Bisecting would indeed be good. But I get the feeling that you'll need
to go back a *long* time, because the swap_map[] code hasn't changed
in ages.

I'm adding Hugh Dickins to the cc just in case he hasn't seen this on
linux-mm, because the swap_map[] code is complex as hell, and Hugh did
touch some of it last. The whole swap_map[] thing is complicated by:

 - it's a single byte per swap entry
 - it's not even a *structured* byte, but a single counter that has
several fields by hand
 - it has a count in the low 6 bits, with a magic bad value (which
is also a magic continuation value if one of the high bits are set)
 - it has two magic bits: HAS_CACHE and CONTINUED
 - it has a _third_ magic value (SWAP_MAP_SHMEM) which is CONTINUED+BAD
 - we increment this nasty pseudo-counter wildly hackily, and and have
magic special case checks for the odd cases

and if we get any of the special cases wrong, we'll
increment/decrement it wrong, and we're screwed.

The *locking* looks pretty simple, though. It's a simple spinlock. We
do some optimistic tests outside the spinlock, but the actual
allocation and modification seem to all be inside the lock and
re-check any optimistic values afaik.

So I'm almost likely to think that we are more likely to have
something wrong in the messy magical special cases. I'm wondering if
we should get rid of the continuation crap, for example, and expand
the one byte per swap page to two bytes instead.

Hugh, I think you know this code best, because you added the last
special case (that SWAP_MAP_SHMEM value). Comments?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 03:08:22PM -0400, Dave Jones wrote:
 On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
   On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones da...@redhat.com wrote:
   
It actually seems worse, seems I can trigger it even easier now, as if
there's a leak.
   
   Can you please try the new fix for TLB flush?
   
   commit  2b047252d087be7f2ba
   Fix TLB gather virtual address range invalidation corner cases
 
 No luck.

Hi Dave, could you please put your .config somewhere so i would try
to repeat this problem? (i've tried trinity with -C64 but it didn't
trigger the issue).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Tue, Aug 27, 2013 at 12:18:46AM +0400, Cyrill Gorcunov wrote:
  On Mon, Aug 26, 2013 at 03:08:22PM -0400, Dave Jones wrote:
   On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
 On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones da...@redhat.com wrote:
 
  It actually seems worse, seems I can trigger it even easier now, as if
  there's a leak.
 
 Can you please try the new fix for TLB flush?
 
 commit  2b047252d087be7f2ba
 Fix TLB gather virtual address range invalidation corner cases
   
   No luck.
  
  Hi Dave, could you please put your .config somewhere so i would try
  to repeat this problem? (i've tried trinity with -C64 but it didn't
  trigger the issue).

http://paste.fedoraproject.org/34944/77549285
machine I'm using has 8gb ram, 8gb swap, and 4 cores.

Try adding the -C64 to the invocation in scripts/test-multi.sh,
and perhaps up'ing the NR_PROCESSES variable there too.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:
 
 Try adding the -C64 to the invocation in scripts/test-multi.sh,
 and perhaps up'ing the NR_PROCESSES variable there too.

Thanks! I'll ping you if I manage to crash my instance.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Linus Torvalds
On Mon, Aug 26, 2013 at 1:15 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 So I'm almost likely to think that we are more likely to have
 something wrong in the messy magical special cases.

Of course, the good news would be if it actually ends up being the
soft-dirty stuff, and bisection hits something recent.

So maybe I'm overly pessimistic. That messy swap_map[] code really
_is_ messy, but at the same time it should also be pretty well-tested.
I don't think it's been touched in years.

That said, google does find swap_free: Unused swap offset entry
reports from over the years. Most of them seem to be single-bit
errors, though (ie when the entry is 0100 or similar I'm more
inclined to blame a bit error - in contrast your values look like
real swap entries).

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Tue, Aug 27, 2013 at 12:42:03AM +0400, Cyrill Gorcunov wrote:
 On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:
  
  Try adding the -C64 to the invocation in scripts/test-multi.sh,
  and perhaps up'ing the NR_PROCESSES variable there too.
 
 Thanks! I'll ping you if I manage to crash my instance.

So trinity tained kernel, but definitely not in place I'm interested.

[  320.904506] raw_sendmsg: trinity-child14 forgot to set AF_INET. Fix it!
[  329.570812] [ cut here ]
[  329.571650] WARNING: CPU: 0 PID: 1982 at kernel/lockdep.c:3552 
check_flags+0x18a/0x1c1()
[  329.571650] DEBUG_LOCKS_WARN_ON(current-softirqs_enabled)
[  329.571650] Modules linked in:
[  329.571650] CPU: 0 PID: 1982 Comm: trinity-child4 Not tainted 
3.11.0-rc6-dirty #386
[  329.571650] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  329.571650]  0009 88001ee03b10 8157ac8a 
0006
[  329.571650]  88001ee03b60 88001ee03b50 81045bb2 
81583840
[  329.571650]  81092620 880002b48000 0046 
81a2f750
[  329.571650] Call Trace:
[  329.571650]  IRQ  [8157ac8a] dump_stack+0x4f/0x84
[  329.571650]  [81045bb2] warn_slowpath_common+0x81/0x9b
[  329.571650]  [81583840] ? ftrace_call+0x5/0x2f
[  329.571650]  [81092620] ? check_flags+0x18a/0x1c1
[  329.571650]  [81045c6f] warn_slowpath_fmt+0x46/0x48
[  329.571650]  [81045c2e] ? warn_slowpath_fmt+0x5/0x48
[  329.571650]  [81092620] check_flags+0x18a/0x1c1
[  329.571650]  [81093595] lock_is_held+0x30/0x5f
[  329.571650]  [810eb19e] rcu_read_lock_held+0x36/0x38
[  329.571650]  [810f1b92] perf_tp_event+0x92/0x220
[  329.571650]  [810f1d0e] ? perf_tp_event+0x20e/0x220
[  329.571650]  [81049f6c] ? __local_bh_enable+0x9a/0x9e
[  329.571650]  [810712f3] ? get_parent_ip+0x3f/0x3f
[  329.571650]  [81049f6c] ? __local_bh_enable+0x9a/0x9e
[  329.571650]  [810e3af1] perf_ftrace_function_call+0xce/0xdc

...

(since my config pretty similar to yours I tried to run trinity without
 kernel recompilation. At first i loaded swap space with crap data

[root@ovz trinity]# free 
 total   used   free sharedbuffers cached
Mem:493228 480188  13040  0   2912  12112
-/+ buffers/cache: 465164  28064
Swap:  20633561741304 322052

then run it as

[root@ovz trinity]# ./trinity -C64 --dangerous)

I'll continue tomorrow with your config and test-multi.sh.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Tue, Aug 27, 2013 at 01:37:54AM +0400, Cyrill Gorcunov wrote:
  On Tue, Aug 27, 2013 at 12:42:03AM +0400, Cyrill Gorcunov wrote:
   On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:

Try adding the -C64 to the invocation in scripts/test-multi.sh,
and perhaps up'ing the NR_PROCESSES variable there too.
   
   Thanks! I'll ping you if I manage to crash my instance.
  
  So trinity tained kernel, but definitely not in place I'm interested.
  
  [  320.904506] raw_sendmsg: trinity-child14 forgot to set AF_INET. Fix it!
  [  329.570812] [ cut here ]
  [  329.571650] WARNING: CPU: 0 PID: 1982 at kernel/lockdep.c:3552 
  check_flags+0x18a/0x1c1()
  [  329.571650] DEBUG_LOCKS_WARN_ON(current-softirqs_enabled)
  [  329.571650] Modules linked in:
  [  329.571650] CPU: 0 PID: 1982 Comm: trinity-child4 Not tainted 
  3.11.0-rc6-dirty #386
  [  329.571650] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  [  329.571650]  0009 88001ee03b10 8157ac8a 
  0006
  [  329.571650]  88001ee03b60 88001ee03b50 81045bb2 
  81583840
  [  329.571650]  81092620 880002b48000 0046 
  81a2f750
  [  329.571650] Call Trace:
  [  329.571650]  IRQ  [8157ac8a] dump_stack+0x4f/0x84
  [  329.571650]  [81045bb2] warn_slowpath_common+0x81/0x9b
  [  329.571650]  [81583840] ? ftrace_call+0x5/0x2f
  [  329.571650]  [81092620] ? check_flags+0x18a/0x1c1
  [  329.571650]  [81045c6f] warn_slowpath_fmt+0x46/0x48
  [  329.571650]  [81045c2e] ? warn_slowpath_fmt+0x5/0x48
  [  329.571650]  [81092620] check_flags+0x18a/0x1c1
  [  329.571650]  [81093595] lock_is_held+0x30/0x5f
  [  329.571650]  [810eb19e] rcu_read_lock_held+0x36/0x38
  [  329.571650]  [810f1b92] perf_tp_event+0x92/0x220
  [  329.571650]  [810f1d0e] ? perf_tp_event+0x20e/0x220
  [  329.571650]  [81049f6c] ? __local_bh_enable+0x9a/0x9e
  [  329.571650]  [810712f3] ? get_parent_ip+0x3f/0x3f
  [  329.571650]  [81049f6c] ? __local_bh_enable+0x9a/0x9e
  [  329.571650]  [810e3af1] perf_ftrace_function_call+0xce/0xdc

when it rains, it pours.. 
 
  (since my config pretty similar to yours I tried to run trinity without
   kernel recompilation. At first i loaded swap space with crap data
  
  [root@ovz trinity]# free 
   total   used   free sharedbuffers cached
  Mem:493228 480188  13040  0   2912  12112
  -/+ buffers/cache: 465164  28064
  Swap:  20633561741304 322052
  
  then run it as
  
  [root@ovz trinity]# ./trinity -C64 --dangerous)

Yeah, for reproducing this bug, I'd stick to running it as a user, without 
--dangerous.
you might still hit a few fairly-easy to trigger warn-on/printks. I run with
this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make things
a little less noisy.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 05:42:44PM -0400, Dave Jones wrote:
 
 Yeah, for reproducing this bug, I'd stick to running it as a user, without 
 --dangerous.
 you might still hit a few fairly-easy to trigger warn-on/printks. I run with
 this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make 
 things
 a little less noisy.

Ah, thanks, pulling it in. Btw, have you seen this problem earlier than -rc4 at 
all?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Tue, Aug 27, 2013 at 01:49:40AM +0400, Cyrill Gorcunov wrote:
  On Mon, Aug 26, 2013 at 05:42:44PM -0400, Dave Jones wrote:
   
   Yeah, for reproducing this bug, I'd stick to running it as a user, without 
   --dangerous.
   you might still hit a few fairly-easy to trigger warn-on/printks. I run 
   with
   this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make 
   things
   a little less noisy.
  
  Ah, thanks, pulling it in. Btw, have you seen this problem earlier than -rc4 
  at all?

I just hit it on 3.11rc1. Couldn't reproduce on 3.10.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Hugh Dickins
On Mon, 26 Aug 2013, Linus Torvalds wrote:
 On Mon, Aug 26, 2013 at 1:15 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  So I'm almost likely to think that we are more likely to have
  something wrong in the messy magical special cases.
 
 Of course, the good news would be if it actually ends up being the
 soft-dirty stuff, and bisection hits something recent.

I suspect so.

 
 So maybe I'm overly pessimistic. That messy swap_map[] code really
 _is_ messy, but at the same time it should also be pretty well-tested.
 I don't think it's been touched in years.

Blame me for the byte-instead-of-short continuation stuff.
But it's never yet shown any problem (okay, perhaps that's
because it's so rare to need any continuation anyway).

 
 That said, google does find swap_free: Unused swap offset entry
 reports from over the years. Most of them seem to be single-bit
 errors, though (ie when the entry is 0100 or similar I'm more
 inclined to blame a bit error

Yes, historically they have usually represented either single-bit
errors, or corruption of page tables by other kernel data.  The
swap subsystem discovers it, but it's rarely an error of swap.

So I don't care for Dave's suggestion much earlier in this thread,
that swapoff should fail with -EINVAL if there has been a bad page
taint: that doesn't necessarily interfere with swapoff at all.

And besides, swapoff is killable: yes, if counts go wrong, it
can cycle around endlessly, but it checks for signal_pending()
each time around the loop.

 - in contrast your values look like real swap entries).

Indeed they do.

I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
a line in mremap which worries me.  That set_pte_at() is operating
on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
prone to corrupt a swap entry.

I've not tried matching up bits with Dave's reports, and just going
into a meeting now, but this patch looks worth a try: probably Cyrill
can improve it meanwhile to what he actually wants there (I'm
surprised anything special is needed for just moving a pte).

Hugh

--- 3.11-rc7/mm/mremap.c2013-07-14 17:10:16.640003652 -0700
+++ linux/mm/mremap.c   2013-08-26 14:46:14.460027627 -0700
@@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
pte = move_pte(pte, new_vma-vm_page_prot, old_addr, new_addr);
-   set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
+   set_pte_at(mm, new_addr, new_pte, pte);
}
 
arch_leave_lazy_mmu_mode();
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Dave Jones
On Mon, Aug 26, 2013 at 03:08:45PM -0700, Hugh Dickins wrote:
 
   That said, google does find swap_free: Unused swap offset entry
   reports from over the years. Most of them seem to be single-bit
   errors, though (ie when the entry is 0100 or similar I'm more
   inclined to blame a bit error
  
  Yes, historically they have usually represented either single-bit
  errors, or corruption of page tables by other kernel data.  The
  swap subsystem discovers it, but it's rarely an error of swap.
 
Just to rule out bad hardware, I've seen this on two systems
(admittedly the exact same spec, but still..)

  So I don't care for Dave's suggestion much earlier in this thread,
  that swapoff should fail with -EINVAL if there has been a bad page
  taint: that doesn't necessarily interfere with swapoff at all.
  
  And besides, swapoff is killable: yes, if counts go wrong, it
  can cycle around endlessly, but it checks for signal_pending()
  each time around the loop.

It might be killable, but if I've done /sbin/reboot, and the
kernel dies in sys_swapoff because of the corruption, I won't
get a chance to kill it, because at that point the shutdown process
has killed my shell, sshd, and just about everything else.
It mieans a grumpy walk to the other side of the house to prod a
reset button.  So yeah, it might not be a mergable thing, but
at least while bisecting it's pretty much a must-have.

  I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
  a line in mremap which worries me.  That set_pte_at() is operating
  on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
  prone to corrupt a swap entry.
  
  I've not tried matching up bits with Dave's reports, and just going
  into a meeting now, but this patch looks worth a try: probably Cyrill
  can improve it meanwhile to what he actually wants there (I'm
  surprised anything special is needed for just moving a pte).
  
  Hugh
  
  --- 3.11-rc7/mm/mremap.c 2013-07-14 17:10:16.640003652 -0700
  +++ linux/mm/mremap.c2013-08-26 14:46:14.460027627 -0700
  @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
   continue;
   pte = ptep_get_and_clear(mm, old_addr, old_pte);
   pte = move_pte(pte, new_vma-vm_page_prot, old_addr, new_addr);
  -set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
  +set_pte_at(mm, new_addr, new_pte, pte);
   }

I'll give this a shot once I'm done with the bisect.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Linus Torvalds
On Mon, Aug 26, 2013 at 3:08 PM, Hugh Dickins hu...@google.com wrote:

 I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
 a line in mremap which worries me.  That set_pte_at() is operating
 on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
 prone to corrupt a swap entry.

Uhhuh. I think you hit the nail on the head here.

I checked all the pte_swp_*soft_dirty() users (they should be used on
swp entries), because that came up in another thread. But you're
right, the non-swp ones only work on present pte entries (or on
file-offset entries, I guess), and at least that mremap() case seems
bogus.

I'm not seeing the point of marking the thing soft-dirty at all,
although I guess it's dirty in the sense that it changed the
contents at that virtual address. But for that code to work, it would
have to have the same bit for swap entries as for present pages (and
for file mapping entries), and that's not true. They are two different
bits (_PAGE_SOFT_DIRTY is bit #11 vs _PAGE_SWP_SOFT_DIRTY is bit #7).

Ugh. Cyrill, this is a mess.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-26 Thread Cyrill Gorcunov
On Mon, Aug 26, 2013 at 04:15:00PM -0700, Linus Torvalds wrote:
 On Mon, Aug 26, 2013 at 3:08 PM, Hugh Dickins hu...@google.com wrote:
 
  I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
  a line in mremap which worries me.  That set_pte_at() is operating
  on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
  prone to corrupt a swap entry.
 
 Uhhuh. I think you hit the nail on the head here.
 
 I checked all the pte_swp_*soft_dirty() users (they should be used on
 swp entries), because that came up in another thread. But you're
 right, the non-swp ones only work on present pte entries (or on
 file-offset entries, I guess), and at least that mremap() case seems
 bogus.

Oh my :( Indeed it sets _PAGE_SOFT_DIRTY unconditionally, sigh. This
nit comes from former soft-dirty commit. Let me check all other places
we set soft dirty bit (Pavel CC'ed).

 I'm not seeing the point of marking the thing soft-dirty at all,
 although I guess it's dirty in the sense that it changed the
 contents at that virtual address. But for that code to work, it would
 have to have the same bit for swap entries as for present pages (and
 for file mapping entries), and that's not true. They are two different
 bits (_PAGE_SOFT_DIRTY is bit #11 vs _PAGE_SWP_SOFT_DIRTY is bit #7).
 
 Ugh. Cyrill, this is a mess.

Linus, I simply had no place in pte entry to carry soft-dirty status
when pte incoded in swap format, so it was unpleasant but necessary
decision. That's why bits access are wrapped in own macros with
'swp' prefix thus reader would easily grep for them.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-25 Thread Hillf Danton
On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones  wrote:
>
> It actually seems worse, seems I can trigger it even easier now, as if
> there's a leak.
>
Can you please try the new fix for TLB flush?

commit  2b047252d087be7f2ba
Fix TLB gather virtual address range invalidation corner cases
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-25 Thread Hillf Danton
On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones da...@redhat.com wrote:

 It actually seems worse, seems I can trigger it even easier now, as if
 there's a leak.

Can you please try the new fix for TLB flush?

commit  2b047252d087be7f2ba
Fix TLB gather virtual address range invalidation corner cases
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-22 Thread Dave Jones
On Fri, Aug 23, 2013 at 11:27:29AM +0800, Hillf Danton wrote:
 > On Fri, Aug 23, 2013 at 11:21 AM, Dave Jones  wrote:
 > >
 > > I still see the swap_free messages with this applied.
 > >
 > Decremented?

It actually seems worse, seems I can trigger it even easier now, as if
there's a leak.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-22 Thread Dave Jones
On Thu, Aug 22, 2013 at 11:21:28AM +0800, Hillf Danton wrote:
 > On Thu, Aug 22, 2013 at 4:49 AM, Dave Jones  wrote:
 > >
 > > didn't hit the bug_on, but got a bunch of
 > >
 > > [  424.077993] swap_free: Unused swap offset entry 000187d5
 > > [  439.377194] swap_free: Unused swap offset entry 000187e7
 > > [  441.998411] swap_free: Unused swap offset entry 000187ee
 > > [  446.956551] swap_free: Unused swap offset entry 245f
 > >
 > If page is reused, its swap entry is freed.
 > 
 > reuse_swap_page()
 >   delete_from_swap_cache()
 > swapcache_free()
 >   count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
 > 
 > If count drops to zero, then swap_free() gives warning.
 > 
 > 
 > --- a/mm/memory.c Wed Aug  7 16:29:34 2013
 > +++ b/mm/memory.c Thu Aug 22 10:44:32 2013
 > @@ -3123,6 +3123,7 @@ static int do_swap_page(struct mm_struct
 >   /* It's better to call commit-charge after rmap is established */
 >   mem_cgroup_commit_charge_swapin(page, ptr);
 > 
 > + if (!exclusive)
 >   swap_free(entry);
 >   if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
 >   try_to_free_swap(page);
 > --

I still see the swap_free messages with this applied.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-22 Thread Dave Jones
On Thu, Aug 22, 2013 at 11:21:28AM +0800, Hillf Danton wrote:
  On Thu, Aug 22, 2013 at 4:49 AM, Dave Jones da...@redhat.com wrote:
  
   didn't hit the bug_on, but got a bunch of
  
   [  424.077993] swap_free: Unused swap offset entry 000187d5
   [  439.377194] swap_free: Unused swap offset entry 000187e7
   [  441.998411] swap_free: Unused swap offset entry 000187ee
   [  446.956551] swap_free: Unused swap offset entry 245f
  
  If page is reused, its swap entry is freed.
  
  reuse_swap_page()
delete_from_swap_cache()
  swapcache_free()
count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
  
  If count drops to zero, then swap_free() gives warning.
  
  
  --- a/mm/memory.c Wed Aug  7 16:29:34 2013
  +++ b/mm/memory.c Thu Aug 22 10:44:32 2013
  @@ -3123,6 +3123,7 @@ static int do_swap_page(struct mm_struct
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);
  
  + if (!exclusive)
swap_free(entry);
if (vm_swap_full() || (vma-vm_flags  VM_LOCKED) || PageMlocked(page))
try_to_free_swap(page);
  --

I still see the swap_free messages with this applied.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-22 Thread Dave Jones
On Fri, Aug 23, 2013 at 11:27:29AM +0800, Hillf Danton wrote:
  On Fri, Aug 23, 2013 at 11:21 AM, Dave Jones da...@redhat.com wrote:
  
   I still see the swap_free messages with this applied.
  
  Decremented?

It actually seems worse, seems I can trigger it even easier now, as if
there's a leak.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-21 Thread Hillf Danton
On Thu, Aug 22, 2013 at 4:49 AM, Dave Jones  wrote:
>
> didn't hit the bug_on, but got a bunch of
>
> [  424.077993] swap_free: Unused swap offset entry 000187d5
> [  439.377194] swap_free: Unused swap offset entry 000187e7
> [  441.998411] swap_free: Unused swap offset entry 000187ee
> [  446.956551] swap_free: Unused swap offset entry 245f
>
If page is reused, its swap entry is freed.

reuse_swap_page()
  delete_from_swap_cache()
swapcache_free()
  count = swap_entry_free(p, entry, SWAP_HAS_CACHE);

If count drops to zero, then swap_free() gives warning.


--- a/mm/memory.c Wed Aug  7 16:29:34 2013
+++ b/mm/memory.c Thu Aug 22 10:44:32 2013
@@ -3123,6 +3123,7 @@ static int do_swap_page(struct mm_struct
  /* It's better to call commit-charge after rmap is established */
  mem_cgroup_commit_charge_swapin(page, ptr);

+ if (!exclusive)
  swap_free(entry);
  if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
  try_to_free_swap(page);
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-21 Thread Hillf Danton
On Thu, Aug 22, 2013 at 4:49 AM, Dave Jones  wrote:
>
> didn't hit the bug_on, but got a bunch of
>
> [  424.077993] swap_free: Unused swap offset entry 000187d5
> [  439.377194] swap_free: Unused swap offset entry 000187e7
> [  441.998411] swap_free: Unused swap offset entry 000187ee
> [  446.956551] swap_free: Unused swap offset entry 245f
>
Related to the regression reported?

Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use
https://lkml.org/lkml/2013/8/21/294
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-21 Thread Dave Jones
On Tue, Aug 20, 2013 at 12:39:05PM +0800, Hillf Danton wrote:
 > On Tue, Aug 20, 2013 at 7:18 AM, Dave Jones  wrote:
 > 
 > --- a/mm/memory.c Wed Aug  7 16:29:34 2013
 > +++ b/mm/memory.c Tue Aug 20 11:13:06 2013
 > @@ -933,8 +933,10 @@ again:
 >   if (progress >= 32) {
 >   progress = 0;
 >   if (need_resched() ||
 > -spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
 > +spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) {
 > + BUG_ON(entry.val);
 >   break;
 > + }
 >   }
 >   if (pte_none(*src_pte)) {
 >   progress++;

didn't hit the bug_on, but got a bunch of 

[  424.077993] swap_free: Unused swap offset entry 000187d5
[  439.377194] swap_free: Unused swap offset entry 000187e7
[  441.998411] swap_free: Unused swap offset entry 000187ee
[  446.956551] swap_free: Unused swap offset entry 245f

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-21 Thread Dave Jones
On Tue, Aug 20, 2013 at 12:39:05PM +0800, Hillf Danton wrote:
  On Tue, Aug 20, 2013 at 7:18 AM, Dave Jones da...@redhat.com wrote:
  
  --- a/mm/memory.c Wed Aug  7 16:29:34 2013
  +++ b/mm/memory.c Tue Aug 20 11:13:06 2013
  @@ -933,8 +933,10 @@ again:
if (progress = 32) {
progress = 0;
if (need_resched() ||
  -spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
  +spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) {
  + BUG_ON(entry.val);
break;
  + }
}
if (pte_none(*src_pte)) {
progress++;

didn't hit the bug_on, but got a bunch of 

[  424.077993] swap_free: Unused swap offset entry 000187d5
[  439.377194] swap_free: Unused swap offset entry 000187e7
[  441.998411] swap_free: Unused swap offset entry 000187ee
[  446.956551] swap_free: Unused swap offset entry 245f

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-21 Thread Hillf Danton
On Thu, Aug 22, 2013 at 4:49 AM, Dave Jones da...@redhat.com wrote:

 didn't hit the bug_on, but got a bunch of

 [  424.077993] swap_free: Unused swap offset entry 000187d5
 [  439.377194] swap_free: Unused swap offset entry 000187e7
 [  441.998411] swap_free: Unused swap offset entry 000187ee
 [  446.956551] swap_free: Unused swap offset entry 245f

Related to the regression reported?

Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use
https://lkml.org/lkml/2013/8/21/294
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-21 Thread Hillf Danton
On Thu, Aug 22, 2013 at 4:49 AM, Dave Jones da...@redhat.com wrote:

 didn't hit the bug_on, but got a bunch of

 [  424.077993] swap_free: Unused swap offset entry 000187d5
 [  439.377194] swap_free: Unused swap offset entry 000187e7
 [  441.998411] swap_free: Unused swap offset entry 000187ee
 [  446.956551] swap_free: Unused swap offset entry 245f

If page is reused, its swap entry is freed.

reuse_swap_page()
  delete_from_swap_cache()
swapcache_free()
  count = swap_entry_free(p, entry, SWAP_HAS_CACHE);

If count drops to zero, then swap_free() gives warning.


--- a/mm/memory.c Wed Aug  7 16:29:34 2013
+++ b/mm/memory.c Thu Aug 22 10:44:32 2013
@@ -3123,6 +3123,7 @@ static int do_swap_page(struct mm_struct
  /* It's better to call commit-charge after rmap is established */
  mem_cgroup_commit_charge_swapin(page, ptr);

+ if (!exclusive)
  swap_free(entry);
  if (vm_swap_full() || (vma-vm_flags  VM_LOCKED) || PageMlocked(page))
  try_to_free_swap(page);
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-19 Thread Hillf Danton
On Tue, Aug 20, 2013 at 7:18 AM, Dave Jones  wrote:
>
> btw, anyone have thoughts on a patch something like below ?

And another(sorry if message is reformatted by the mail agent,
and it took my an hour to get the agent back to the correct format but failed,
and thanks a lot for any howto send plain text message).

Hillf

--- a/mm/memory.c Wed Aug  7 16:29:34 2013
+++ b/mm/memory.c Tue Aug 20 11:13:06 2013
@@ -933,8 +933,10 @@ again:
  if (progress >= 32) {
  progress = 0;
  if (need_resched() ||
-spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
+spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) {
+ BUG_ON(entry.val);
  break;
+ }
  }
  if (pte_none(*src_pte)) {
  progress++;
--


> It's really annoying to debug stuff like this and have to walk
> over to the machine and reboot it by hand after it wedges during swapoff.
>
> Dave
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6cf2e60..bbb1192 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1587,6 +1587,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
> specialfile)
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> +   /* If we have hit memory corruption, we could hang during swapoff, so 
> don't even try. */
> +   if (test_taint(TAINT_BAD_PAGE))
> +   return -EINVAL;
> +
> BUG_ON(!current->mm);
>
> pathname = getname(specialfile);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-19 Thread Dave Jones
On Thu, Aug 08, 2013 at 11:20:28PM +0800, Hillf Danton wrote:
 > On Wed, Aug 7, 2013 at 11:30 PM, Dave Jones  wrote:
 > > printk didn't trigger.
 > >
 > Is a corrupted page table entry encountered, according to the
 > comment of swap_duplicate()?
 > 
 > 
 > --- a/mm/swapfile.c  Wed Aug  7 17:27:22 2013
 > +++ b/mm/swapfile.c  Thu Aug  8 23:12:30 2013
 > @@ -770,6 +770,7 @@ int free_swap_and_cache(swp_entry_t entr
 >  unlock_page(page);
 >  page_cache_release(page);
 >  }
 > +return 1;
 >  return p != NULL;
 >  }
 > 
 > --

[sorry for delay, been travelling]

With this applied, I no longer see the 'bad page' warning, but 
I do still get a bunch of messages like..

[  340.342436] swap_free: Unused swap offset entry 3bb4
[  340.952980] swap_free: Unused swap offset entry 298d
[  340.953016] swap_free: Unused swap offset entry 2996
[  340.953048] swap_free: Unused swap offset entry 299d


btw, anyone have thoughts on a patch something like below ?
It's really annoying to debug stuff like this and have to walk
over to the machine and reboot it by hand after it wedges during swapoff.

Dave

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6cf2e60..bbb1192 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1587,6 +1587,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
specialfile)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
+   /* If we have hit memory corruption, we could hang during swapoff, so 
don't even try. */
+   if (test_taint(TAINT_BAD_PAGE))
+   return -EINVAL;
+
BUG_ON(!current->mm);
 
pathname = getname(specialfile);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-19 Thread Dave Jones
On Thu, Aug 08, 2013 at 11:20:28PM +0800, Hillf Danton wrote:
  On Wed, Aug 7, 2013 at 11:30 PM, Dave Jones da...@redhat.com wrote:
   printk didn't trigger.
  
  Is a corrupted page table entry encountered, according to the
  comment of swap_duplicate()?
  
  
  --- a/mm/swapfile.c  Wed Aug  7 17:27:22 2013
  +++ b/mm/swapfile.c  Thu Aug  8 23:12:30 2013
  @@ -770,6 +770,7 @@ int free_swap_and_cache(swp_entry_t entr
   unlock_page(page);
   page_cache_release(page);
   }
  +return 1;
   return p != NULL;
   }
  
  --

[sorry for delay, been travelling]

With this applied, I no longer see the 'bad page' warning, but 
I do still get a bunch of messages like..

[  340.342436] swap_free: Unused swap offset entry 3bb4
[  340.952980] swap_free: Unused swap offset entry 298d
[  340.953016] swap_free: Unused swap offset entry 2996
[  340.953048] swap_free: Unused swap offset entry 299d


btw, anyone have thoughts on a patch something like below ?
It's really annoying to debug stuff like this and have to walk
over to the machine and reboot it by hand after it wedges during swapoff.

Dave

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6cf2e60..bbb1192 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1587,6 +1587,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
specialfile)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
+   /* If we have hit memory corruption, we could hang during swapoff, so 
don't even try. */
+   if (test_taint(TAINT_BAD_PAGE))
+   return -EINVAL;
+
BUG_ON(!current-mm);
 
pathname = getname(specialfile);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-19 Thread Hillf Danton
On Tue, Aug 20, 2013 at 7:18 AM, Dave Jones da...@redhat.com wrote:

 btw, anyone have thoughts on a patch something like below ?

And another(sorry if message is reformatted by the mail agent,
and it took my an hour to get the agent back to the correct format but failed,
and thanks a lot for any howto send plain text message).

Hillf

--- a/mm/memory.c Wed Aug  7 16:29:34 2013
+++ b/mm/memory.c Tue Aug 20 11:13:06 2013
@@ -933,8 +933,10 @@ again:
  if (progress = 32) {
  progress = 0;
  if (need_resched() ||
-spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
+spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) {
+ BUG_ON(entry.val);
  break;
+ }
  }
  if (pte_none(*src_pte)) {
  progress++;
--


 It's really annoying to debug stuff like this and have to walk
 over to the machine and reboot it by hand after it wedges during swapoff.

 Dave

 diff --git a/mm/swapfile.c b/mm/swapfile.c
 index 6cf2e60..bbb1192 100644
 --- a/mm/swapfile.c
 +++ b/mm/swapfile.c
 @@ -1587,6 +1587,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
 specialfile)
 if (!capable(CAP_SYS_ADMIN))
 return -EPERM;

 +   /* If we have hit memory corruption, we could hang during swapoff, so 
 don't even try. */
 +   if (test_taint(TAINT_BAD_PAGE))
 +   return -EINVAL;
 +
 BUG_ON(!current-mm);

 pathname = getname(specialfile);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-08 Thread Dave Jones
On Thu, Aug 08, 2013 at 11:20:28PM +0800, Hillf Danton wrote:
 > On Wed, Aug 7, 2013 at 11:30 PM, Dave Jones  wrote:
 > > printk didn't trigger.
 > >
 > Is a corrupted page table entry encountered, according to the
 > comment of swap_duplicate()?
 > 
 > 
 > --- a/mm/swapfile.c  Wed Aug  7 17:27:22 2013
 > +++ b/mm/swapfile.c  Thu Aug  8 23:12:30 2013
 > @@ -770,6 +770,7 @@ int free_swap_and_cache(swp_entry_t entr
 >  unlock_page(page);
 >  page_cache_release(page);
 >  }
 > +return 1;
 >  return p != NULL;
 >  }

Travelling for a week, I'll check it out when I get back.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-08 Thread Hillf Danton
On Wed, Aug 7, 2013 at 11:30 PM, Dave Jones  wrote:
> printk didn't trigger.
>
Is a corrupted page table entry encountered, according to the
comment of swap_duplicate()?


--- a/mm/swapfile.c Wed Aug  7 17:27:22 2013
+++ b/mm/swapfile.c Thu Aug  8 23:12:30 2013
@@ -770,6 +770,7 @@ int free_swap_and_cache(swp_entry_t entr
unlock_page(page);
page_cache_release(page);
}
+   return 1;
return p != NULL;
 }

--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-08 Thread Hillf Danton
On Wed, Aug 7, 2013 at 11:30 PM, Dave Jones da...@redhat.com wrote:
 printk didn't trigger.

Is a corrupted page table entry encountered, according to the
comment of swap_duplicate()?


--- a/mm/swapfile.c Wed Aug  7 17:27:22 2013
+++ b/mm/swapfile.c Thu Aug  8 23:12:30 2013
@@ -770,6 +770,7 @@ int free_swap_and_cache(swp_entry_t entr
unlock_page(page);
page_cache_release(page);
}
+   return 1;
return p != NULL;
 }

--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-08 Thread Dave Jones
On Thu, Aug 08, 2013 at 11:20:28PM +0800, Hillf Danton wrote:
  On Wed, Aug 7, 2013 at 11:30 PM, Dave Jones da...@redhat.com wrote:
   printk didn't trigger.
  
  Is a corrupted page table entry encountered, according to the
  comment of swap_duplicate()?
  
  
  --- a/mm/swapfile.c  Wed Aug  7 17:27:22 2013
  +++ b/mm/swapfile.c  Thu Aug  8 23:12:30 2013
  @@ -770,6 +770,7 @@ int free_swap_and_cache(swp_entry_t entr
   unlock_page(page);
   page_cache_release(page);
   }
  +return 1;
   return p != NULL;
   }

Travelling for a week, I'll check it out when I get back.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-07 Thread Dave Jones

void __lru_cache_add(struct page *page)
{
struct pagevec *pvec = _cpu_var(lru_add_pvec);

page_cache_get(page);
if (!pagevec_space(pvec))
__pagevec_lru_add(pvec);
pagevec_add(pvec, page);
put_cpu_var(lru_add_pvec);
}

I added a printk, and found that pagevec_add frequently returns 0. Is that ok ?

What happens to 'page' in this case ?

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-07 Thread Dave Jones
On Wed, Aug 07, 2013 at 06:04:20PM +0800, Hillf Danton wrote:
 > > There were a slew of these. same trace, different addr/anon_vma/index.
 > > mapping always null.
 > >
 > Would you please run again with the debug info added?
 > ---
 > --- a/mm/swapfile.c  Wed Aug  7 17:27:22 2013
 > +++ b/mm/swapfile.c  Wed Aug  7 17:57:20 2013
 > @@ -509,6 +509,7 @@ static struct swap_info_struct *swap_inf
 >  {
 >  struct swap_info_struct *p;
 >  unsigned long offset, type;
 > +int race = 0;
 > 
 >  if (!entry.val)
 >  goto out;
 > @@ -524,10 +525,17 @@ static struct swap_info_struct *swap_inf
 >  if (!p->swap_map[offset])
 >  goto bad_free;
 >  spin_lock(>lock);
 > +if (!p->swap_map[offset]) {
 > +race = 1;
 > +spin_unlock(>lock);
 > +goto bad_free;
 > +}
 >  return p;
 > 
 >  bad_free:
 >  printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
 > +if (race)
 > +printk(KERN_ERR "but due to race\n");
 >  goto out;
 >  bad_offset:
 >  printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val);
 > --

printk didn't trigger.
This time around the oom killer was going off the same time.
I'm wondering if we have some allocations somewhere in the swap code that
don't handle failure correctly.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-07 Thread Hillf Danton
Hello Dave

On Wed, Aug 7, 2013 at 1:51 PM, Dave Jones  wrote:
> Seen while fuzzing with lots of child processes.
>
> swap_free: Unused swap offset entry 001263f5
> BUG: Bad page map in process trinity-child29  pte:24c7ea00 pmd:09fec067
> addr:7f9db958d000 vm_flags:00100073 anon_vma:88022c004ba0 mapping:
>   (null) index:f99
> Modules linked in: fuse ipt_ULOG snd_seq_dummy tun sctp scsi_transport_iscsi 
> can_raw can_bcm rfcomm bnep nfnetlink hidp appletalk bluetooth rose can 
> af_802154 phonet x25 af_rxrpc llc2 nfc rfkill af_key pppoe rds pppox 
> ppp_generic slhc caif_socket caif irda crc_ccitt atm netrom ax25 ipx p8023 
> psnap p8022 llc snd_hda_codec_realtek pcspkr usb_debug snd_seq snd_seq_device 
> snd_hda_intel snd_hda_codec snd_hwdep e1000e snd_pcm ptp pps_core 
> snd_page_alloc snd_timer snd soundcore xfs libcrc32c
> CPU: 1 PID: 2624 Comm: trinity-child29 Not tainted 3.11.0-rc4+ #1
>   8801fd7ddc90 81700f2c 7f9db958d000
>  8801fd7ddcd8 8117cba7 24c7ea00 0f99
>  7f9db960 880009fecc68 24c7ea00 8801fd7dde00
> Call Trace:
>  [] dump_stack+0x4e/0x82
>  [] print_bad_pte+0x187/0x220
>  [] unmap_single_vma+0x535/0x890
>  [] unmap_vmas+0x49/0x90
>  [] exit_mmap+0xc1/0x170
>  [] mmput+0x6f/0x100
>  [] do_exit+0x288/0xcd0
>  [] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [] ? trace_hardirqs_on+0xd/0x10
>  [] do_group_exit+0x4c/0xc0
>  [] SyS_exit_group+0x14/0x20
>  [] tracesys+0xdd/0xe2
>
> There were a slew of these. same trace, different addr/anon_vma/index.
> mapping always null.
>
Would you please run again with the debug info added?
---
--- a/mm/swapfile.c Wed Aug  7 17:27:22 2013
+++ b/mm/swapfile.c Wed Aug  7 17:57:20 2013
@@ -509,6 +509,7 @@ static struct swap_info_struct *swap_inf
 {
struct swap_info_struct *p;
unsigned long offset, type;
+   int race = 0;

if (!entry.val)
goto out;
@@ -524,10 +525,17 @@ static struct swap_info_struct *swap_inf
if (!p->swap_map[offset])
goto bad_free;
spin_lock(>lock);
+   if (!p->swap_map[offset]) {
+   race = 1;
+   spin_unlock(>lock);
+   goto bad_free;
+   }
return p;

 bad_free:
printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
+   if (race)
+   printk(KERN_ERR "but due to race\n");
goto out;
 bad_offset:
printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val);
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-07 Thread Hillf Danton
Hello Dave

On Wed, Aug 7, 2013 at 1:51 PM, Dave Jones da...@redhat.com wrote:
 Seen while fuzzing with lots of child processes.

 swap_free: Unused swap offset entry 001263f5
 BUG: Bad page map in process trinity-child29  pte:24c7ea00 pmd:09fec067
 addr:7f9db958d000 vm_flags:00100073 anon_vma:88022c004ba0 mapping:
   (null) index:f99
 Modules linked in: fuse ipt_ULOG snd_seq_dummy tun sctp scsi_transport_iscsi 
 can_raw can_bcm rfcomm bnep nfnetlink hidp appletalk bluetooth rose can 
 af_802154 phonet x25 af_rxrpc llc2 nfc rfkill af_key pppoe rds pppox 
 ppp_generic slhc caif_socket caif irda crc_ccitt atm netrom ax25 ipx p8023 
 psnap p8022 llc snd_hda_codec_realtek pcspkr usb_debug snd_seq snd_seq_device 
 snd_hda_intel snd_hda_codec snd_hwdep e1000e snd_pcm ptp pps_core 
 snd_page_alloc snd_timer snd soundcore xfs libcrc32c
 CPU: 1 PID: 2624 Comm: trinity-child29 Not tainted 3.11.0-rc4+ #1
   8801fd7ddc90 81700f2c 7f9db958d000
  8801fd7ddcd8 8117cba7 24c7ea00 0f99
  7f9db960 880009fecc68 24c7ea00 8801fd7dde00
 Call Trace:
  [81700f2c] dump_stack+0x4e/0x82
  [8117cba7] print_bad_pte+0x187/0x220
  [8117e415] unmap_single_vma+0x535/0x890
  [8117f719] unmap_vmas+0x49/0x90
  [81187ef1] exit_mmap+0xc1/0x170
  [810510ef] mmput+0x6f/0x100
  [81055818] do_exit+0x288/0xcd0
  [810c1da5] ? trace_hardirqs_on_caller+0x115/0x1e0
  [810c1e7d] ? trace_hardirqs_on+0xd/0x10
  [810575dc] do_group_exit+0x4c/0xc0
  [81057664] SyS_exit_group+0x14/0x20
  [81713dd4] tracesys+0xdd/0xe2

 There were a slew of these. same trace, different addr/anon_vma/index.
 mapping always null.

Would you please run again with the debug info added?
---
--- a/mm/swapfile.c Wed Aug  7 17:27:22 2013
+++ b/mm/swapfile.c Wed Aug  7 17:57:20 2013
@@ -509,6 +509,7 @@ static struct swap_info_struct *swap_inf
 {
struct swap_info_struct *p;
unsigned long offset, type;
+   int race = 0;

if (!entry.val)
goto out;
@@ -524,10 +525,17 @@ static struct swap_info_struct *swap_inf
if (!p-swap_map[offset])
goto bad_free;
spin_lock(p-lock);
+   if (!p-swap_map[offset]) {
+   race = 1;
+   spin_unlock(p-lock);
+   goto bad_free;
+   }
return p;

 bad_free:
printk(KERN_ERR swap_free: %s%08lx\n, Unused_offset, entry.val);
+   if (race)
+   printk(KERN_ERR but due to race\n);
goto out;
 bad_offset:
printk(KERN_ERR swap_free: %s%08lx\n, Bad_offset, entry.val);
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-07 Thread Dave Jones
On Wed, Aug 07, 2013 at 06:04:20PM +0800, Hillf Danton wrote:
   There were a slew of these. same trace, different addr/anon_vma/index.
   mapping always null.
  
  Would you please run again with the debug info added?
  ---
  --- a/mm/swapfile.c  Wed Aug  7 17:27:22 2013
  +++ b/mm/swapfile.c  Wed Aug  7 17:57:20 2013
  @@ -509,6 +509,7 @@ static struct swap_info_struct *swap_inf
   {
   struct swap_info_struct *p;
   unsigned long offset, type;
  +int race = 0;
  
   if (!entry.val)
   goto out;
  @@ -524,10 +525,17 @@ static struct swap_info_struct *swap_inf
   if (!p-swap_map[offset])
   goto bad_free;
   spin_lock(p-lock);
  +if (!p-swap_map[offset]) {
  +race = 1;
  +spin_unlock(p-lock);
  +goto bad_free;
  +}
   return p;
  
   bad_free:
   printk(KERN_ERR swap_free: %s%08lx\n, Unused_offset, entry.val);
  +if (race)
  +printk(KERN_ERR but due to race\n);
   goto out;
   bad_offset:
   printk(KERN_ERR swap_free: %s%08lx\n, Bad_offset, entry.val);
  --

printk didn't trigger.
This time around the oom killer was going off the same time.
I'm wondering if we have some allocations somewhere in the swap code that
don't handle failure correctly.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unused swap offset / bad page map.

2013-08-07 Thread Dave Jones

void __lru_cache_add(struct page *page)
{
struct pagevec *pvec = get_cpu_var(lru_add_pvec);

page_cache_get(page);
if (!pagevec_space(pvec))
__pagevec_lru_add(pvec);
pagevec_add(pvec, page);
put_cpu_var(lru_add_pvec);
}

I added a printk, and found that pagevec_add frequently returns 0. Is that ok ?

What happens to 'page' in this case ?

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


unused swap offset / bad page map.

2013-08-06 Thread Dave Jones
Seen while fuzzing with lots of child processes.

swap_free: Unused swap offset entry 001263f5
BUG: Bad page map in process trinity-child29  pte:24c7ea00 pmd:09fec067
addr:7f9db958d000 vm_flags:00100073 anon_vma:88022c004ba0 mapping:  
(null) index:f99
Modules linked in: fuse ipt_ULOG snd_seq_dummy tun sctp scsi_transport_iscsi 
can_raw can_bcm rfcomm bnep nfnetlink hidp appletalk bluetooth rose can 
af_802154 phonet x25 af_rxrpc llc2 nfc rfkill af_key pppoe rds pppox 
ppp_generic slhc caif_socket caif irda crc_ccitt atm netrom ax25 ipx p8023 
psnap p8022 llc snd_hda_codec_realtek pcspkr usb_debug snd_seq snd_seq_device 
snd_hda_intel snd_hda_codec snd_hwdep e1000e snd_pcm ptp pps_core 
snd_page_alloc snd_timer snd soundcore xfs libcrc32c
CPU: 1 PID: 2624 Comm: trinity-child29 Not tainted 3.11.0-rc4+ #1
  8801fd7ddc90 81700f2c 7f9db958d000
 8801fd7ddcd8 8117cba7 24c7ea00 0f99
 7f9db960 880009fecc68 24c7ea00 8801fd7dde00
Call Trace:
 [] dump_stack+0x4e/0x82
 [] print_bad_pte+0x187/0x220
 [] unmap_single_vma+0x535/0x890
 [] unmap_vmas+0x49/0x90
 [] exit_mmap+0xc1/0x170
 [] mmput+0x6f/0x100
 [] do_exit+0x288/0xcd0
 [] ? trace_hardirqs_on_caller+0x115/0x1e0
 [] ? trace_hardirqs_on+0xd/0x10
 [] do_group_exit+0x4c/0xc0
 [] SyS_exit_group+0x14/0x20
 [] tracesys+0xdd/0xe2

There were a slew of these. same trace, different addr/anon_vma/index.
mapping always null.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


unused swap offset / bad page map.

2013-08-06 Thread Dave Jones
Seen while fuzzing with lots of child processes.

swap_free: Unused swap offset entry 001263f5
BUG: Bad page map in process trinity-child29  pte:24c7ea00 pmd:09fec067
addr:7f9db958d000 vm_flags:00100073 anon_vma:88022c004ba0 mapping:  
(null) index:f99
Modules linked in: fuse ipt_ULOG snd_seq_dummy tun sctp scsi_transport_iscsi 
can_raw can_bcm rfcomm bnep nfnetlink hidp appletalk bluetooth rose can 
af_802154 phonet x25 af_rxrpc llc2 nfc rfkill af_key pppoe rds pppox 
ppp_generic slhc caif_socket caif irda crc_ccitt atm netrom ax25 ipx p8023 
psnap p8022 llc snd_hda_codec_realtek pcspkr usb_debug snd_seq snd_seq_device 
snd_hda_intel snd_hda_codec snd_hwdep e1000e snd_pcm ptp pps_core 
snd_page_alloc snd_timer snd soundcore xfs libcrc32c
CPU: 1 PID: 2624 Comm: trinity-child29 Not tainted 3.11.0-rc4+ #1
  8801fd7ddc90 81700f2c 7f9db958d000
 8801fd7ddcd8 8117cba7 24c7ea00 0f99
 7f9db960 880009fecc68 24c7ea00 8801fd7dde00
Call Trace:
 [81700f2c] dump_stack+0x4e/0x82
 [8117cba7] print_bad_pte+0x187/0x220
 [8117e415] unmap_single_vma+0x535/0x890
 [8117f719] unmap_vmas+0x49/0x90
 [81187ef1] exit_mmap+0xc1/0x170
 [810510ef] mmput+0x6f/0x100
 [81055818] do_exit+0x288/0xcd0
 [810c1da5] ? trace_hardirqs_on_caller+0x115/0x1e0
 [810c1e7d] ? trace_hardirqs_on+0xd/0x10
 [810575dc] do_group_exit+0x4c/0xc0
 [81057664] SyS_exit_group+0x14/0x20
 [81713dd4] tracesys+0xdd/0xe2

There were a slew of these. same trace, different addr/anon_vma/index.
mapping always null.

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/