subject:"\[PATCH \-alternative\] mm\: hugetlbfs\: Close race during teardown of hugetlbfs shared page tables V2 \(resend\)"

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Michal Hocko

On Thu 02-08-12 14:33:10, Mel Gorman wrote:
> On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote:
> > On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> > > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> > [...]
> > > > On the other hand, mine is more coupled with the sharing code so it
> > > > makes the code easier to follow and also makes the sharing more
> > > > effective because racing processes see pmd populated when checking for
> > > > shareable mappings.
> > > > 
> > > 
> > > It could do with a small comment above huge_pmd_share() explaining that
> > > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> > > parallel faults missing a sharing opportunity with each other but it's
> > > not mandatory.
> > 
> > Sure, that's a good idea. What about the following:
> > 
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index 40b2500..51839d1 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, 
> > unsigned long addr)
> >  }
> >  
> >  /*
> > - * search for a shareable pmd page for hugetlb.
> > + * search for a shareable pmd page for hugetlb. In any case calls
> > + * pmd_alloc and returns the corresponding pte. While this not necessary
> > + * for the !shared pmd case because we can allocate the pmd later as
> > + * well it makes the code much cleaner. pmd allocation is essential for
> > + * the shared case though because pud has to be populated inside the
> > + * same i_mmap_mutex section otherwise racing tasks could either miss
> > + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
> >   */
> >  static pte_t*
> >  huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > 
> 
> Looks reasonable to me.

OK, added to the patch. I will send it to Andrew now.

Thanks a lot!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Mel Gorman

On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote:
> On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> [...]
> > > On the other hand, mine is more coupled with the sharing code so it
> > > makes the code easier to follow and also makes the sharing more
> > > effective because racing processes see pmd populated when checking for
> > > shareable mappings.
> > > 
> > 
> > It could do with a small comment above huge_pmd_share() explaining that
> > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> > parallel faults missing a sharing opportunity with each other but it's
> > not mandatory.
> 
> Sure, that's a good idea. What about the following:
> 
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 40b2500..51839d1 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, 
> unsigned long addr)
>  }
>  
>  /*
> - * search for a shareable pmd page for hugetlb.
> + * search for a shareable pmd page for hugetlb. In any case calls
> + * pmd_alloc and returns the corresponding pte. While this not necessary
> + * for the !shared pmd case because we can allocate the pmd later as
> + * well it makes the code much cleaner. pmd allocation is essential for
> + * the shared case though because pud has to be populated inside the
> + * same i_mmap_mutex section otherwise racing tasks could either miss
> + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
>   */
>  static pte_t*
>  huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> 

Looks reasonable to me.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Michal Hocko

On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
[...]
> > On the other hand, mine is more coupled with the sharing code so it
> > makes the code easier to follow and also makes the sharing more
> > effective because racing processes see pmd populated when checking for
> > shareable mappings.
> > 
> 
> It could do with a small comment above huge_pmd_share() explaining that
> calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> parallel faults missing a sharing opportunity with each other but it's
> not mandatory.

Sure, that's a good idea. What about the following:

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 40b2500..51839d1 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, 
unsigned long addr)
 }
 
 /*
- * search for a shareable pmd page for hugetlb.
+ * search for a shareable pmd page for hugetlb. In any case calls
+ * pmd_alloc and returns the corresponding pte. While this not necessary
+ * for the !shared pmd case because we can allocate the pmd later as
+ * well it makes the code much cleaner. pmd allocation is essential for
+ * the shared case though because pud has to be populated inside the
+ * same i_mmap_mutex section otherwise racing tasks could either miss
+ * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
  */
 static pte_t*
 huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)

> 
> > So I am more inclined to mine but I don't want to push it because both
> > are good and make sense. What other people think?
> > 
> 
> I vote yours
> 
> Reviewed-by: Mel Gorman 

Thanks!

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Mel Gorman

On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> Hi Larry,
> 
> On Wed 01-08-12 11:06:33, Larry Woodman wrote:
> > On 08/01/2012 08:32 AM, Michal Hocko wrote:
> > >
> > >I am really lame :/. The previous patch is wrong as well for goto out
> > >branch. The updated patch as follows:
> > This patch worked fine Michal!  
> 
> Thanks for the good news!
> 
> > You and Mel can duke it out over who's is best. :)
> 
> The answer is clear here ;) Mel did the hard work of identifying the
> culprit so kudos go to him.

I'm happy once it's fixed!

> I just tried to solve the issue more inside x86 arch code. The pmd
> allocation outside of sharing code seemed strange to me for quite some
> time I just underestimated its consequences completely.
> 
> Both approaches have some pros. Mel's patch is more resistant to other
> not-yet-discovered races and it also makes the arch independent code
> more robust because relying on the pmd trick is not ideal.

If there is another race then it is best to hear about it, understand
it and fix the underlying problem. More importantly, your patch ensures
that two processes faulting at the same time will share page tables with
each other. My patch only noted that this missed opportunity could cause
problems with fork.

> On the other hand, mine is more coupled with the sharing code so it
> makes the code easier to follow and also makes the sharing more
> effective because racing processes see pmd populated when checking for
> shareable mappings.
> 

It could do with a small comment above huge_pmd_share() explaining that
calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
parallel faults missing a sharing opportunity with each other but it's
not mandatory.

> So I am more inclined to mine but I don't want to push it because both
> are good and make sense. What other people think?
> 

I vote yours

Reviewed-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Michal Hocko

Hi Larry,

On Wed 01-08-12 11:06:33, Larry Woodman wrote:
> On 08/01/2012 08:32 AM, Michal Hocko wrote:
> >
> >I am really lame :/. The previous patch is wrong as well for goto out
> >branch. The updated patch as follows:
> This patch worked fine Michal!  

Thanks for the good news!

> You and Mel can duke it out over who's is best. :)

The answer is clear here ;) Mel did the hard work of identifying the
culprit so kudos go to him.
I just tried to solve the issue more inside x86 arch code. The pmd
allocation outside of sharing code seemed strange to me for quite some
time I just underestimated its consequences completely.

Both approaches have some pros. Mel's patch is more resistant to other
not-yet-discovered races and it also makes the arch independent code
more robust because relying on the pmd trick is not ideal.
On the other hand, mine is more coupled with the sharing code so it
makes the code easier to follow and also makes the sharing more
effective because racing processes see pmd populated when checking for
shareable mappings.

So I am more inclined to mine but I don't want to push it because both
are good and make sense. What other people think?

> 
> Larry
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Michal Hocko

Hi Larry,

On Wed 01-08-12 11:06:33, Larry Woodman wrote:
 On 08/01/2012 08:32 AM, Michal Hocko wrote:
 
 I am really lame :/. The previous patch is wrong as well for goto out
 branch. The updated patch as follows:
 This patch worked fine Michal!  

Thanks for the good news!

 You and Mel can duke it out over who's is best. :)

The answer is clear here ;) Mel did the hard work of identifying the
culprit so kudos go to him.
I just tried to solve the issue more inside x86 arch code. The pmd
allocation outside of sharing code seemed strange to me for quite some
time I just underestimated its consequences completely.

Both approaches have some pros. Mel's patch is more resistant to other
not-yet-discovered races and it also makes the arch independent code
more robust because relying on the pmd trick is not ideal.
On the other hand, mine is more coupled with the sharing code so it
makes the code easier to follow and also makes the sharing more
effective because racing processes see pmd populated when checking for
shareable mappings.

So I am more inclined to mine but I don't want to push it because both
are good and make sense. What other people think?

 
 Larry
 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Mel Gorman

On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
 Hi Larry,
 
 On Wed 01-08-12 11:06:33, Larry Woodman wrote:
  On 08/01/2012 08:32 AM, Michal Hocko wrote:
  
  I am really lame :/. The previous patch is wrong as well for goto out
  branch. The updated patch as follows:
  This patch worked fine Michal!  
 
 Thanks for the good news!
 
  You and Mel can duke it out over who's is best. :)
 
 The answer is clear here ;) Mel did the hard work of identifying the
 culprit so kudos go to him.

I'm happy once it's fixed!

 I just tried to solve the issue more inside x86 arch code. The pmd
 allocation outside of sharing code seemed strange to me for quite some
 time I just underestimated its consequences completely.
 
 Both approaches have some pros. Mel's patch is more resistant to other
 not-yet-discovered races and it also makes the arch independent code
 more robust because relying on the pmd trick is not ideal.

If there is another race then it is best to hear about it, understand
it and fix the underlying problem. More importantly, your patch ensures
that two processes faulting at the same time will share page tables with
each other. My patch only noted that this missed opportunity could cause
problems with fork.

 On the other hand, mine is more coupled with the sharing code so it
 makes the code easier to follow and also makes the sharing more
 effective because racing processes see pmd populated when checking for
 shareable mappings.
 

It could do with a small comment above huge_pmd_share() explaining that
calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
parallel faults missing a sharing opportunity with each other but it's
not mandatory.

 So I am more inclined to mine but I don't want to push it because both
 are good and make sense. What other people think?
 

I vote yours

Reviewed-by: Mel Gorman mgor...@suse.de

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Michal Hocko

On Thu 02-08-12 08:37:57, Mel Gorman wrote:
 On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
[...]
  On the other hand, mine is more coupled with the sharing code so it
  makes the code easier to follow and also makes the sharing more
  effective because racing processes see pmd populated when checking for
  shareable mappings.
  
 
 It could do with a small comment above huge_pmd_share() explaining that
 calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
 parallel faults missing a sharing opportunity with each other but it's
 not mandatory.

Sure, that's a good idea. What about the following:

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 40b2500..51839d1 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, 
unsigned long addr)
 }
 
 /*
- * search for a shareable pmd page for hugetlb.
+ * search for a shareable pmd page for hugetlb. In any case calls
+ * pmd_alloc and returns the corresponding pte. While this not necessary
+ * for the !shared pmd case because we can allocate the pmd later as
+ * well it makes the code much cleaner. pmd allocation is essential for
+ * the shared case though because pud has to be populated inside the
+ * same i_mmap_mutex section otherwise racing tasks could either miss
+ * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
  */
 static pte_t*
 huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)

 
  So I am more inclined to mine but I don't want to push it because both
  are good and make sense. What other people think?
  
 
 I vote yours
 
 Reviewed-by: Mel Gorman mgor...@suse.de

Thanks!

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Mel Gorman

On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote:
 On Thu 02-08-12 08:37:57, Mel Gorman wrote:
  On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
 [...]
   On the other hand, mine is more coupled with the sharing code so it
   makes the code easier to follow and also makes the sharing more
   effective because racing processes see pmd populated when checking for
   shareable mappings.
   
  
  It could do with a small comment above huge_pmd_share() explaining that
  calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
  parallel faults missing a sharing opportunity with each other but it's
  not mandatory.
 
 Sure, that's a good idea. What about the following:
 
 diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
 index 40b2500..51839d1 100644
 --- a/arch/x86/mm/hugetlbpage.c
 +++ b/arch/x86/mm/hugetlbpage.c
 @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, 
 unsigned long addr)
  }
  
  /*
 - * search for a shareable pmd page for hugetlb.
 + * search for a shareable pmd page for hugetlb. In any case calls
 + * pmd_alloc and returns the corresponding pte. While this not necessary
 + * for the !shared pmd case because we can allocate the pmd later as
 + * well it makes the code much cleaner. pmd allocation is essential for
 + * the shared case though because pud has to be populated inside the
 + * same i_mmap_mutex section otherwise racing tasks could either miss
 + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
   */
  static pte_t*
  huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 

Looks reasonable to me.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-02 Thread Michal Hocko

On Thu 02-08-12 14:33:10, Mel Gorman wrote:
 On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote:
  On Thu 02-08-12 08:37:57, Mel Gorman wrote:
   On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
  [...]
On the other hand, mine is more coupled with the sharing code so it
makes the code easier to follow and also makes the sharing more
effective because racing processes see pmd populated when checking for
shareable mappings.

   
   It could do with a small comment above huge_pmd_share() explaining that
   calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
   parallel faults missing a sharing opportunity with each other but it's
   not mandatory.
  
  Sure, that's a good idea. What about the following:
  
  diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
  index 40b2500..51839d1 100644
  --- a/arch/x86/mm/hugetlbpage.c
  +++ b/arch/x86/mm/hugetlbpage.c
  @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, 
  unsigned long addr)
   }
   
   /*
  - * search for a shareable pmd page for hugetlb.
  + * search for a shareable pmd page for hugetlb. In any case calls
  + * pmd_alloc and returns the corresponding pte. While this not necessary
  + * for the !shared pmd case because we can allocate the pmd later as
  + * well it makes the code much cleaner. pmd allocation is essential for
  + * the shared case though because pud has to be populated inside the
  + * same i_mmap_mutex section otherwise racing tasks could either miss
  + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
*/
   static pte_t*
   huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  
 
 Looks reasonable to me.

OK, added to the patch. I will send it to Andrew now.

Thanks a lot!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-01 Thread Larry Woodman


On 08/01/2012 08:32 AM, Michal Hocko wrote:


I am really lame :/. The previous patch is wrong as well for goto out
branch. The updated patch as follows:
This patch worked fine Michal!  You and Mel can duke it out over who's 
is best. :)


Larry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-01 Thread Michal Hocko

On Wed 01-08-12 10:20:36, Michal Hocko wrote:
> On Tue 31-07-12 22:45:43, Larry Woodman wrote:
> > On 07/31/2012 04:06 PM, Michal Hocko wrote:
> > >On Tue 31-07-12 13:49:21, Larry Woodman wrote:
> > >>On 07/31/2012 08:46 AM, Mel Gorman wrote:
> > >>>Fundamentally I think the problem is that we are not correctly detecting
> > >>>that page table sharing took place during huge_pte_alloc(). This patch is
> > >>>longer and makes an API change but if I'm right, it addresses the 
> > >>>underlying
> > >>>problem. The first VM_MAYSHARE patch is still necessary but would you 
> > >>>mind
> > >>>testing this on top please?
> > >>Hi Mel, yes this does work just fine.  It ran for hours without a panic so
> > >>I'll Ack this one if you send it to the list.
> > >Hi Larry, thanks for testing! I have a different patch which tries to
> > >address this very same issue. I am not saying it is better or that it
> > >should be merged instead of Mel's one but I would be really happy if you
> > >could give it a try. We can discuss (dis)advantages of both approaches
> > >later.
> > >
> > >Thanks!
> > 
> > Hi Michal, the system hung when I tested this patch on top of the
> > latest 3.5 kernel.  I wont have AltSysrq access to the system until
> > tomorrow AM.  
> 
> Please hold on. The patch is crap. I forgot about 
> if (!vma_shareable(vma, addr))
>   return;
> 
> case so somebody got an uninitialized pmd. The patch bellow handles
> that.
> 

I am really lame :/. The previous patch is wrong as well for goto out
branch. The updated patch as follows:
---
>From 886b79204491b500437e156aa8eb35e776a4bb07 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[]  [] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [] evict+0x9f/0x1b0
[ 1106.373951]  [] iput_final+0xe3/0x1e0
[ 1106.379773]  [] iput+0x3e/0x50
[ 1106.384922]  [] d_kill+0xf8/0x110
[ 1106.390356]  [] dput+0xe2/0x1b0
[ 1106.395595]  [] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  >pmd
   src_pte--> data page
  ^
other->pmd|
^
child---|
   dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-01 Thread Michal Hocko

On Tue 31-07-12 22:45:43, Larry Woodman wrote:
> On 07/31/2012 04:06 PM, Michal Hocko wrote:
> >On Tue 31-07-12 13:49:21, Larry Woodman wrote:
> >>On 07/31/2012 08:46 AM, Mel Gorman wrote:
> >>>Fundamentally I think the problem is that we are not correctly detecting
> >>>that page table sharing took place during huge_pte_alloc(). This patch is
> >>>longer and makes an API change but if I'm right, it addresses the 
> >>>underlying
> >>>problem. The first VM_MAYSHARE patch is still necessary but would you mind
> >>>testing this on top please?
> >>Hi Mel, yes this does work just fine.  It ran for hours without a panic so
> >>I'll Ack this one if you send it to the list.
> >Hi Larry, thanks for testing! I have a different patch which tries to
> >address this very same issue. I am not saying it is better or that it
> >should be merged instead of Mel's one but I would be really happy if you
> >could give it a try. We can discuss (dis)advantages of both approaches
> >later.
> >
> >Thanks!
> 
> Hi Michal, the system hung when I tested this patch on top of the
> latest 3.5 kernel.  I wont have AltSysrq access to the system until
> tomorrow AM.  

Please hold on. The patch is crap. I forgot about 
if (!vma_shareable(vma, addr))
return;

case so somebody got an uninitialized pmd. The patch bellow handles
that.

Sorry about that and thanks to Mel to notice this.
---
>From 14d1cfcb7e19369653e2367c38204b1012a398f9 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[]  [] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [] evict+0x9f/0x1b0
[ 1106.373951]  [] iput_final+0xe3/0x1e0
[ 1106.379773]  [] iput+0x3e/0x50
[ 1106.384922]  [] d_kill+0xf8/0x110
[ 1106.390356]  [] dput+0xe2/0x1b0
[ 1106.395595]  [] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  >pmd
   src_pte--> data page
  ^
other->pmd|
^
child---|
   dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A  PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
  src_pte == huge_pte_offset  src_pte

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-01 Thread Michal Hocko

On Tue 31-07-12 22:45:43, Larry Woodman wrote:
 On 07/31/2012 04:06 PM, Michal Hocko wrote:
 On Tue 31-07-12 13:49:21, Larry Woodman wrote:
 On 07/31/2012 08:46 AM, Mel Gorman wrote:
 Fundamentally I think the problem is that we are not correctly detecting
 that page table sharing took place during huge_pte_alloc(). This patch is
 longer and makes an API change but if I'm right, it addresses the 
 underlying
 problem. The first VM_MAYSHARE patch is still necessary but would you mind
 testing this on top please?
 Hi Mel, yes this does work just fine.  It ran for hours without a panic so
 I'll Ack this one if you send it to the list.
 Hi Larry, thanks for testing! I have a different patch which tries to
 address this very same issue. I am not saying it is better or that it
 should be merged instead of Mel's one but I would be really happy if you
 could give it a try. We can discuss (dis)advantages of both approaches
 later.
 
 Thanks!
 
 Hi Michal, the system hung when I tested this patch on top of the
 latest 3.5 kernel.  I wont have AltSysrq access to the system until
 tomorrow AM.  

Please hold on. The patch is crap. I forgot about 
if (!vma_shareable(vma, addr))
return;

case so somebody got an uninitialized pmd. The patch bellow handles
that.

Sorry about that and thanks to Mel to notice this.
---
From 14d1cfcb7e19369653e2367c38204b1012a398f9 Mon Sep 17 00:00:00 2001
From: Michal Hocko mho...@suse.cz
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [8112d040] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [8125b405] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [8125b4f8] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [811ab1af] evict+0x9f/0x1b0
[ 1106.373951]  [811ab3a3] iput_final+0xe3/0x1e0
[ 1106.379773]  [811ab4de] iput+0x3e/0x50
[ 1106.384922]  [811a8e18] d_kill+0xf8/0x110
[ 1106.390356]  [811a8f12] dput+0xe2/0x1b0
[ 1106.395595]  [81193612] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  pmd
   src_pte-- data page
  ^
other-pmd|
^
child---|
   dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A  PROC B

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-01 Thread Michal Hocko

On Wed 01-08-12 10:20:36, Michal Hocko wrote:
 On Tue 31-07-12 22:45:43, Larry Woodman wrote:
  On 07/31/2012 04:06 PM, Michal Hocko wrote:
  On Tue 31-07-12 13:49:21, Larry Woodman wrote:
  On 07/31/2012 08:46 AM, Mel Gorman wrote:
  Fundamentally I think the problem is that we are not correctly detecting
  that page table sharing took place during huge_pte_alloc(). This patch is
  longer and makes an API change but if I'm right, it addresses the 
  underlying
  problem. The first VM_MAYSHARE patch is still necessary but would you 
  mind
  testing this on top please?
  Hi Mel, yes this does work just fine.  It ran for hours without a panic so
  I'll Ack this one if you send it to the list.
  Hi Larry, thanks for testing! I have a different patch which tries to
  address this very same issue. I am not saying it is better or that it
  should be merged instead of Mel's one but I would be really happy if you
  could give it a try. We can discuss (dis)advantages of both approaches
  later.
  
  Thanks!
  
  Hi Michal, the system hung when I tested this patch on top of the
  latest 3.5 kernel.  I wont have AltSysrq access to the system until
  tomorrow AM.  
 
 Please hold on. The patch is crap. I forgot about 
 if (!vma_shareable(vma, addr))
   return;
 
 case so somebody got an uninitialized pmd. The patch bellow handles
 that.
 

I am really lame :/. The previous patch is wrong as well for goto out
branch. The updated patch as follows:
---
From 886b79204491b500437e156aa8eb35e776a4bb07 Mon Sep 17 00:00:00 2001
From: Michal Hocko mho...@suse.cz
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [8112d040] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [8125b405] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [8125b4f8] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [811ab1af] evict+0x9f/0x1b0
[ 1106.373951]  [811ab3a3] iput_final+0xe3/0x1e0
[ 1106.379773]  [811ab4de] iput+0x3e/0x50
[ 1106.384922]  [811a8e18] d_kill+0xf8/0x110
[ 1106.390356]  [811a8f12] dput+0xe2/0x1b0
[ 1106.395595]  [81193612] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  pmd
   src_pte-- data page
  ^
other-pmd|
^
child---|
   dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-01 Thread Larry Woodman


On 08/01/2012 08:32 AM, Michal Hocko wrote:


I am really lame :/. The previous patch is wrong as well for goto out
branch. The updated patch as follows:
This patch worked fine Michal!  You and Mel can duke it out over who's 
is best. :)


Larry

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Larry Woodman


On 07/31/2012 04:06 PM, Michal Hocko wrote:

On Tue 31-07-12 13:49:21, Larry Woodman wrote:

On 07/31/2012 08:46 AM, Mel Gorman wrote:

Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

Hi Mel, yes this does work just fine.  It ran for hours without a panic so
I'll Ack this one if you send it to the list.

Hi Larry, thanks for testing! I have a different patch which tries to
address this very same issue. I am not saying it is better or that it
should be merged instead of Mel's one but I would be really happy if you
could give it a try. We can discuss (dis)advantages of both approaches
later.

Thanks!


Hi Michal, the system hung when I tested this patch on top of the
latest 3.5 kernel.  I wont have AltSysrq access to the system until
tomorrow AM.  I'll retry this kernel and get AltSysrq output and let
you know whats happening in the morning.

Larry


---
 From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001
From: Michal Hocko
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[]  [] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [] evict+0x9f/0x1b0
[ 1106.373951]  [] iput_final+0xe3/0x1e0
[ 1106.379773]  [] iput+0x3e/0x50
[ 1106.384922]  [] d_kill+0xf8/0x110
[ 1106.390356]  [] dput+0xe2/0x1b0
[ 1106.395595]  [] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
   |
   >pmd
src_pte-->  data page
   ^
other->pmd|
 ^
child---|
dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A  PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
   src_pte == huge_pte_offset  src_pte == huge_pte_offset
   !src_pte so no sharing  !src_pte so no sharing

(time passes)

hugetlb_fault   hugetlb_fault
   huge_pte_alloc

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Larry Woodman


On 07/31/2012 04:06 PM, Michal Hocko wrote:

On Tue 31-07-12 13:49:21, Larry Woodman wrote:

On 07/31/2012 08:46 AM, Mel Gorman wrote:

Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

Hi Mel, yes this does work just fine.  It ran for hours without a panic so
I'll Ack this one if you send it to the list.

Hi Larry, thanks for testing! I have a different patch which tries to
address this very same issue. I am not saying it is better or that it
should be merged instead of Mel's one but I would be really happy if you
could give it a try. We can discuss (dis)advantages of both approaches
later.

Thanks!
Sure, it will take me a day since I keep loosing the hardware to 
proproduce the

problem with.  I'll report back tomorrow.

Larry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Michal Hocko

On Tue 31-07-12 13:49:21, Larry Woodman wrote:
> On 07/31/2012 08:46 AM, Mel Gorman wrote:
> >
> >Fundamentally I think the problem is that we are not correctly detecting
> >that page table sharing took place during huge_pte_alloc(). This patch is
> >longer and makes an API change but if I'm right, it addresses the underlying
> >problem. The first VM_MAYSHARE patch is still necessary but would you mind
> >testing this on top please?
> Hi Mel, yes this does work just fine.  It ran for hours without a panic so
> I'll Ack this one if you send it to the list.

Hi Larry, thanks for testing! I have a different patch which tries to
address this very same issue. I am not saying it is better or that it
should be merged instead of Mel's one but I would be really happy if you
could give it a try. We can discuss (dis)advantages of both approaches
later.

Thanks!
---
>From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[]  [] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [] evict+0x9f/0x1b0
[ 1106.373951]  [] iput_final+0xe3/0x1e0
[ 1106.379773]  [] iput+0x3e/0x50
[ 1106.384922]  [] d_kill+0xf8/0x110
[ 1106.390356]  [] dput+0xe2/0x1b0
[ 1106.395595]  [] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  >pmd
   src_pte--> data page
  ^
other->pmd|
^
child---|
   dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A  PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
  src_pte == huge_pte_offset  src_pte == huge_pte_offset
  !src_pte so no sharing  !src_pte so no sharing

(time passes)

hugetlb_fault   hugetlb_fault
  huge_pte_alloc  huge_pte_alloc
huge_pmd_share huge_pmd_share
  LOCK(i_mmap_mutex)
  find nothing, no sharing
  UNLOCK(i_mmap_mutex)
LOCK(i_mmap_mutex)
find

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Larry Woodman


On 07/31/2012 08:46 AM, Mel Gorman wrote:


Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

Hi Mel, yes this does work just fine.  It ran for hours without a panic so
I'll Ack this one if you send it to the list.

Larry




---8<---
mm: hugetlbfs: Correctly detect if page tables have just been shared

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[]  [] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [] evict+0x9f/0x1b0
[ 1106.373951]  [] iput_final+0xe3/0x1e0
[ 1106.379773]  [] iput+0x3e/0x50
[ 1106.384922]  [] d_kill+0xf8/0x110
[ 1106.390356]  [] dput+0xe2/0x1b0
[ 1106.395595]  [] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
   |
   >pmd
src_pte-->  data page
   ^
other->pmd|
 ^
child---|
dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A  PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
   src_pte == huge_pte_offset src_pte == huge_pte_offset
   !src_pte so no sharing !src_pte so no sharing

(time passes)

hugetlb_fault   hugetlb_fault
   huge_pte_alloc huge_pte_alloc
 huge_pmd_sharehuge_pmd_share
   LOCK(i_mmap_mutex)
   find nothing, no sharing
   UNLOCK(i_mmap_mutex)
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
 pmd_alloc  pmd_alloc
 LOCK(instantiation_mutex)
 fault
 UNLOCK(instantiation_mutex)
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)

These two processes are not poing to

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Rik van Riel


On 07/31/2012 08:46 AM, Mel Gorman wrote:


mm: hugetlbfs: Correctly detect if page tables have just been shared

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman



Signed-off-by: Mel Gorman


Reviewed-by: Rik van Riel 

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Mel Gorman

On Tue, Jul 31, 2012 at 09:07:14AM -0400, Larry Woodman wrote:
> On 07/31/2012 08:46 AM, Mel Gorman wrote:
> >On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:
> >>>
> >>>That is a surprise. Can you try your test case on 3.4 and tell us if the
> >>>patch fixes the problem there? I would like to rule out the possibility
> >>>that the locking rules are slightly different in RHEL. If it hits on 3.4
> >>>then it's also possible you are seeing a different bug, more on this later.
> >>>
> >>Sorry for the delay Mel, here is the BUG() traceback from the 3.4
> >>kernel with your
> >>patches:
> >>
> >>
> >>[ 1106.156569] [ cut here ]
> >>[ 1106.161731] kernel BUG at mm/filemap.c:135!
> >>[ 1106.166395] invalid opcode:  [#1] SMP
> >>[ 1106.170975] CPU 22
> >>[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
> >>dcdbas microcode pcspkr acpi_pad acpi]
> >>[ 1106.201770]
> >Thanks, looks very similar.
> >
> >>[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW
> >>3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
> >You say this was a 3.4 kernel but the message says 3.3. Probably not
> >relevant, just interesting.
> >
> Oh, sorry I posted the wrong traceback.  I tested both 3.3 & 3.4 and
> had the same results.
> I'll do it again and post the 3.4 traceback for you,

It'll probably be the same. The likelhood is that the bug is really old and
did not change between 3.3 and 3.4. I mentioned it in case you accidentally
tested with an old kernel that was not patched or patched with something
different. I considered this to be very unlikely though and you already
said that RHEL was affected so it's probably the same bug seen in all
three.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Michal Hocko

On Tue 31-07-12 13:46:50, Mel Gorman wrote:
[...]
> mm: hugetlbfs: Correctly detect if page tables have just been shared
> 
> Each page mapped in a processes address space must be correctly
> accounted for in _mapcount. Normally the rules for this are
> straight-forward but hugetlbfs page table sharing is different.
> The page table pages at the PMD level are reference counted while
> the mapcount remains the same. If this accounting is wrong, it causes
> bugs like this one reported by Larry Woodman
> 
> [ 1106.156569] [ cut here ]
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode:  [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
> microcode pcspkr acpi_pad acpi]
> [ 1106.201770]
> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 
> Dell Inc. PowerEdge R620/07NDJ2
> [ 1106.213822] RIP: 0010:[]  [] 
> __delete_from_page_cache+0x15d/0x170
> [ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
> [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: 
> ffb0
> [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 
> 88043ffd9e00
> [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 
> 0003
> [ 1106.253876] R10: 000d R11:  R12: 
> 880428708150
> [ 1106.261826] R13: 880428708150 R14:  R15: 
> ea0006b8
> [ 1106.269780] FS:  () GS:88042fd6() 
> knlGS:
> [ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 
> 000406e0
> [ 1106.293149] DR0:  DR1:  DR2: 
> 
> [ 1106.301097] DR3:  DR6: 0ff0 DR7: 
> 0400
> [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
> 880428b5cc20)
> [ 1106.318447] Stack:
> [ 1106.320690]  ea0006b8  880428973bc8 
> 8112d040
> [ 1106.328958]  880428973bc8 02ab 02a0 
> 880428973c18
> [ 1106.337234]  880428973cc8 8125b405 88040001 
> 
> [ 1106.345513] Call Trace:
> [ 1106.348235]  [] delete_from_page_cache+0x40/0x80
> [ 1106.355128]  [] truncate_hugepages+0x115/0x1f0
> [ 1106.361826]  [] hugetlbfs_evict_inode+0x18/0x30
> [ 1106.368615]  [] evict+0x9f/0x1b0
> [ 1106.373951]  [] iput_final+0xe3/0x1e0
> [ 1106.379773]  [] iput+0x3e/0x50
> [ 1106.384922]  [] d_kill+0xf8/0x110
> [ 1106.390356]  [] dput+0xe2/0x1b0
> [ 1106.395595]  [] __fput+0x162/0x240
> 
> During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
> shared page tables with the check dst_pte == src_pte. The logic is if
> the PMD page is the same, they must be shared. This assumes that the
> sharing is between the parent and child. However, if the sharing is with
> a different process entirely then this check fails as in this diagram.
> 
> parent
>   |
>   >pmd
>src_pte--> data page
>   ^
> other->pmd|
> ^
> child---|
>dst_pte
> 
> For this situation to occur, it must be possible for Parent and Other
> to have faulted and failed to share page tables with each other. This is
> possible due to the following style of race.
> 
> PROC APROC B
> copy_hugetlb_page_range   copy_hugetlb_page_range
>   src_pte == huge_pte_offsetsrc_pte == huge_pte_offset
>   !src_pte so no sharing!src_pte so no sharing
> 
> (time passes)
> 
> hugetlb_fault hugetlb_fault
>   huge_pte_allochuge_pte_alloc
> huge_pmd_share   huge_pmd_share
>   LOCK(i_mmap_mutex)
>   find nothing, no sharing
>   UNLOCK(i_mmap_mutex)
>   LOCK(i_mmap_mutex)
>   find nothing, no 
> sharing
>   UNLOCK(i_mmap_mutex)
> pmd_alloc pmd_alloc
> LOCK(instantiation_mutex)
> fault
> UNLOCK(instantiation_mutex)
>   LOCK(instantiation_mutex)
>   fault
>   UNLOCK(instantiation_mutex)

Makes sense. I was wondering how the child could share with somebody
else and pmd would be different from the parent because parent is
blocked by mmap_sem (for write) so no concurrent faults are allowed.
Other process should see the parents pmd if it is shareable. I obviously

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Larry Woodman


On 07/31/2012 08:46 AM, Mel Gorman wrote:

On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:


That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.


Sorry for the delay Mel, here is the BUG() traceback from the 3.4
kernel with your
patches:


[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]

Thanks, looks very similar.


[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW
3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2

You say this was a 3.4 kernel but the message says 3.3. Probably not
relevant, just interesting.

Oh, sorry I posted the wrong traceback.  I tested both 3.3 & 3.4 and had 
the same results.

I'll do it again and post the 3.4 traceback for you,

Larry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Mel Gorman

On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:
> > 
> >That is a surprise. Can you try your test case on 3.4 and tell us if the
> >patch fixes the problem there? I would like to rule out the possibility
> >that the locking rules are slightly different in RHEL. If it hits on 3.4
> >then it's also possible you are seeing a different bug, more on this later.
> >
> 
> Sorry for the delay Mel, here is the BUG() traceback from the 3.4
> kernel with your
> patches:
> 
> 
> [ 1106.156569] [ cut here ]
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode:  [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
> dcdbas microcode pcspkr acpi_pad acpi]
> [ 1106.201770]

Thanks, looks very similar.

> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW
> 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2

You say this was a 3.4 kernel but the message says 3.3. Probably not
relevant, just interesting.

> I'll see if I can distribute the program that causes the panic, I
> dont have source, only binary.
> 
> Larry
> 
> 
> BTW, the only way Ilve been able to get the panic to stop is:
> 
> --
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c36febb..cc023b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
> *dst, struct mm_struct *src,
> goto nomem;
> 
> /* If the pagetables are shared don't copy or take
> references */
> -   if (dst_pte == src_pte)
> +   if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
> continue;
> spin_lock(>page_table_lock);

I think this is papering over the problem. Basically this check works
because after page table sharing, the parent and child are pointing to the
same data page even if they are not sharing page tables. As it's during
fork(), we cannot have faulted in parallel so the populated PTE must be
due to page table sharing. If the parent has not faulted the page, then
sharing is not attempted and again the problem is avoided. It would be a
new instance though of hugetlbfs just happening to work because of its
limitations - in this case, it works because we only share page tables
for MAP_SHARED.

Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

---8<---
mm: hugetlbfs: Correctly detect if page tables have just been shared

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[]  [] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Hillf Danton

On Tue, Jul 31, 2012 at 3:11 AM, Larry Woodman  wrote:
> [ 1106.156569] [ cut here ]
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode:  [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas
> microcode pcspkr acpi_pad acpi]
> [ 1106.201770]
> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4
> Dell Inc. PowerEdge R620/07NDJ2
> [ 1106.213822] RIP: 0010:[]  []
> __delete_from_page_cache+0x15d/0x170
> [ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
> [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX:
> ffb0
> [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI:
> 88043ffd9e00
> [ 1106.245927] RBP: 880428973b98 R08: 0050 R09:
> 0003
> [ 1106.253876] R10: 000d R11:  R12:
> 880428708150
> [ 1106.261826] R13: 880428708150 R14:  R15:
> ea0006b8
> [ 1106.269780] FS:  () GS:88042fd6()
> knlGS:
> [ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4:
> 000406e0
> [ 1106.293149] DR0:  DR1:  DR2:
> 
> [ 1106.301097] DR3:  DR6: 0ff0 DR7:
> 0400
> [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000,
> task 880428b5cc20)
> [ 1106.318447] Stack:
> [ 1106.320690]  ea0006b8  880428973bc8
> 8112d040
> [ 1106.328958]  880428973bc8 02ab 02a0
> 880428973c18
> [ 1106.337234]  880428973cc8 8125b405 88040001
> 
> [ 1106.345513] Call Trace:
> [ 1106.348235]  [] delete_from_page_cache+0x40/0x80
> [ 1106.355128]  [] truncate_hugepages+0x115/0x1f0
> [ 1106.361826]  [] hugetlbfs_evict_inode+0x18/0x30
> [ 1106.368615]  [] evict+0x9f/0x1b0
> [ 1106.373951]  [] iput_final+0xe3/0x1e0
> [ 1106.379773]  [] iput+0x3e/0x50
> [ 1106.384922]  [] d_kill+0xf8/0x110
> [ 1106.390356]  [] dput+0xe2/0x1b0
> [ 1106.395595]  [] __fput+0x162/0x240
> [ 1106.401124]  [] fput+0x25/0x30
> [ 1106.406265]  [] filp_close+0x63/0x90
> [ 1106.411997]  [] put_files_struct+0x7f/0xf0
> [ 1106.418302]  [] exit_files+0x4c/0x60
> [ 1106.424025]  [] do_exit+0x1a7/0x470
> [ 1106.429652]  [] do_group_exit+0x55/0xd0
> [ 1106.435665]  [] sys_exit_group+0x17/0x20
> [ 1106.441777]  [] system_call_fastpath+0x16/0x1b


Perhaps we have to remove rmap when evicting inode.

--- a/fs/hugetlbfs/inode.c  Tue Jul 31 19:59:32 2012
+++ b/fs/hugetlbfs/inode.c  Tue Jul 31 20:04:14 2012
@@ -390,9 +390,11 @@ static void truncate_hugepages(struct in
hugetlb_unreserve_pages(inode, start, freed);
 }

+static int hugetlb_vmtruncate(struct inode *, loff_t);
+
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
-   truncate_hugepages(inode, 0);
+   hugetlb_vmtruncate(inode, 0);
clear_inode(inode);
 }

--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Hillf Danton

On Tue, Jul 31, 2012 at 3:11 AM, Larry Woodman lwood...@redhat.com wrote:
 [ 1106.156569] [ cut here ]
 [ 1106.161731] kernel BUG at mm/filemap.c:135!
 [ 1106.166395] invalid opcode:  [#1] SMP
 [ 1106.170975] CPU 22
 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas
 microcode pcspkr acpi_pad acpi]
 [ 1106.201770]
 [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4
 Dell Inc. PowerEdge R620/07NDJ2
 [ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed]
 __delete_from_page_cache+0x15d/0x170
 [ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX:
 ffb0
 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI:
 88043ffd9e00
 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09:
 0003
 [ 1106.253876] R10: 000d R11:  R12:
 880428708150
 [ 1106.261826] R13: 880428708150 R14:  R15:
 ea0006b8
 [ 1106.269780] FS:  () GS:88042fd6()
 knlGS:
 [ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4:
 000406e0
 [ 1106.293149] DR0:  DR1:  DR2:
 
 [ 1106.301097] DR3:  DR6: 0ff0 DR7:
 0400
 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000,
 task 880428b5cc20)
 [ 1106.318447] Stack:
 [ 1106.320690]  ea0006b8  880428973bc8
 8112d040
 [ 1106.328958]  880428973bc8 02ab 02a0
 880428973c18
 [ 1106.337234]  880428973cc8 8125b405 88040001
 
 [ 1106.345513] Call Trace:
 [ 1106.348235]  [8112d040] delete_from_page_cache+0x40/0x80
 [ 1106.355128]  [8125b405] truncate_hugepages+0x115/0x1f0
 [ 1106.361826]  [8125b4f8] hugetlbfs_evict_inode+0x18/0x30
 [ 1106.368615]  [811ab1af] evict+0x9f/0x1b0
 [ 1106.373951]  [811ab3a3] iput_final+0xe3/0x1e0
 [ 1106.379773]  [811ab4de] iput+0x3e/0x50
 [ 1106.384922]  [811a8e18] d_kill+0xf8/0x110
 [ 1106.390356]  [811a8f12] dput+0xe2/0x1b0
 [ 1106.395595]  [81193612] __fput+0x162/0x240
 [ 1106.401124]  [81193715] fput+0x25/0x30
 [ 1106.406265]  [8118f6c3] filp_close+0x63/0x90
 [ 1106.411997]  [8106058f] put_files_struct+0x7f/0xf0
 [ 1106.418302]  [8106064c] exit_files+0x4c/0x60
 [ 1106.424025]  [810629d7] do_exit+0x1a7/0x470
 [ 1106.429652]  [81062cf5] do_group_exit+0x55/0xd0
 [ 1106.435665]  [81062d87] sys_exit_group+0x17/0x20
 [ 1106.441777]  [815d0229] system_call_fastpath+0x16/0x1b


Perhaps we have to remove rmap when evicting inode.

--- a/fs/hugetlbfs/inode.c  Tue Jul 31 19:59:32 2012
+++ b/fs/hugetlbfs/inode.c  Tue Jul 31 20:04:14 2012
@@ -390,9 +390,11 @@ static void truncate_hugepages(struct in
hugetlb_unreserve_pages(inode, start, freed);
 }

+static int hugetlb_vmtruncate(struct inode *, loff_t);
+
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
-   truncate_hugepages(inode, 0);
+   hugetlb_vmtruncate(inode, 0);
clear_inode(inode);
 }

--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Mel Gorman

On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:
  SNIP
 That is a surprise. Can you try your test case on 3.4 and tell us if the
 patch fixes the problem there? I would like to rule out the possibility
 that the locking rules are slightly different in RHEL. If it hits on 3.4
 then it's also possible you are seeing a different bug, more on this later.
 
 
 Sorry for the delay Mel, here is the BUG() traceback from the 3.4
 kernel with your
 patches:
 
 
 [ 1106.156569] [ cut here ]
 [ 1106.161731] kernel BUG at mm/filemap.c:135!
 [ 1106.166395] invalid opcode:  [#1] SMP
 [ 1106.170975] CPU 22
 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
 dcdbas microcode pcspkr acpi_pad acpi]
 [ 1106.201770]

Thanks, looks very similar.

 [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW
 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2

You say this was a 3.4 kernel but the message says 3.3. Probably not
relevant, just interesting.

 I'll see if I can distribute the program that causes the panic, I
 dont have source, only binary.
 
 Larry
 
 
 BTW, the only way Ilve been able to get the panic to stop is:
 
 --
 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index c36febb..cc023b8 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
 *dst, struct mm_struct *src,
 goto nomem;
 
 /* If the pagetables are shared don't copy or take
 references */
 -   if (dst_pte == src_pte)
 +   if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
 continue;
 spin_lock(dst-page_table_lock);

I think this is papering over the problem. Basically this check works
because after page table sharing, the parent and child are pointing to the
same data page even if they are not sharing page tables. As it's during
fork(), we cannot have faulted in parallel so the populated PTE must be
due to page table sharing. If the parent has not faulted the page, then
sharing is not attempted and again the problem is avoided. It would be a
new instance though of hugetlbfs just happening to work because of its
limitations - in this case, it works because we only share page tables
for MAP_SHARED.

Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

---8---
mm: hugetlbfs: Correctly detect if page tables have just been shared

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Larry Woodman


On 07/31/2012 08:46 AM, Mel Gorman wrote:

On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:

SNIP
That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.


Sorry for the delay Mel, here is the BUG() traceback from the 3.4
kernel with your
patches:


[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]

Thanks, looks very similar.


[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW
3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2

You say this was a 3.4 kernel but the message says 3.3. Probably not
relevant, just interesting.

Oh, sorry I posted the wrong traceback.  I tested both 3.3  3.4 and had 
the same results.

I'll do it again and post the 3.4 traceback for you,

Larry

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Michal Hocko

On Tue 31-07-12 13:46:50, Mel Gorman wrote:
[...]
 mm: hugetlbfs: Correctly detect if page tables have just been shared
 
 Each page mapped in a processes address space must be correctly
 accounted for in _mapcount. Normally the rules for this are
 straight-forward but hugetlbfs page table sharing is different.
 The page table pages at the PMD level are reference counted while
 the mapcount remains the same. If this accounting is wrong, it causes
 bugs like this one reported by Larry Woodman
 
 [ 1106.156569] [ cut here ]
 [ 1106.161731] kernel BUG at mm/filemap.c:135!
 [ 1106.166395] invalid opcode:  [#1] SMP
 [ 1106.170975] CPU 22
 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
 microcode pcspkr acpi_pad acpi]
 [ 1106.201770]
 [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 
 Dell Inc. PowerEdge R620/07NDJ2
 [ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed] 
 __delete_from_page_cache+0x15d/0x170
 [ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: 
 ffb0
 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 
 88043ffd9e00
 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 
 0003
 [ 1106.253876] R10: 000d R11:  R12: 
 880428708150
 [ 1106.261826] R13: 880428708150 R14:  R15: 
 ea0006b8
 [ 1106.269780] FS:  () GS:88042fd6() 
 knlGS:
 [ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 
 000406e0
 [ 1106.293149] DR0:  DR1:  DR2: 
 
 [ 1106.301097] DR3:  DR6: 0ff0 DR7: 
 0400
 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
 880428b5cc20)
 [ 1106.318447] Stack:
 [ 1106.320690]  ea0006b8  880428973bc8 
 8112d040
 [ 1106.328958]  880428973bc8 02ab 02a0 
 880428973c18
 [ 1106.337234]  880428973cc8 8125b405 88040001 
 
 [ 1106.345513] Call Trace:
 [ 1106.348235]  [8112d040] delete_from_page_cache+0x40/0x80
 [ 1106.355128]  [8125b405] truncate_hugepages+0x115/0x1f0
 [ 1106.361826]  [8125b4f8] hugetlbfs_evict_inode+0x18/0x30
 [ 1106.368615]  [811ab1af] evict+0x9f/0x1b0
 [ 1106.373951]  [811ab3a3] iput_final+0xe3/0x1e0
 [ 1106.379773]  [811ab4de] iput+0x3e/0x50
 [ 1106.384922]  [811a8e18] d_kill+0xf8/0x110
 [ 1106.390356]  [811a8f12] dput+0xe2/0x1b0
 [ 1106.395595]  [81193612] __fput+0x162/0x240
 
 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
 shared page tables with the check dst_pte == src_pte. The logic is if
 the PMD page is the same, they must be shared. This assumes that the
 sharing is between the parent and child. However, if the sharing is with
 a different process entirely then this check fails as in this diagram.
 
 parent
   |
   pmd
src_pte-- data page
   ^
 other-pmd|
 ^
 child---|
dst_pte
 
 For this situation to occur, it must be possible for Parent and Other
 to have faulted and failed to share page tables with each other. This is
 possible due to the following style of race.
 
 PROC APROC B
 copy_hugetlb_page_range   copy_hugetlb_page_range
   src_pte == huge_pte_offsetsrc_pte == huge_pte_offset
   !src_pte so no sharing!src_pte so no sharing
 
 (time passes)
 
 hugetlb_fault hugetlb_fault
   huge_pte_allochuge_pte_alloc
 huge_pmd_share   huge_pmd_share
   LOCK(i_mmap_mutex)
   find nothing, no sharing
   UNLOCK(i_mmap_mutex)
   LOCK(i_mmap_mutex)
   find nothing, no 
 sharing
   UNLOCK(i_mmap_mutex)
 pmd_alloc pmd_alloc
 LOCK(instantiation_mutex)
 fault
 UNLOCK(instantiation_mutex)
   LOCK(instantiation_mutex)
   fault
   UNLOCK(instantiation_mutex)

Makes sense. I was wondering how the child could share with somebody
else and pmd would be different from the parent because parent is
blocked by mmap_sem (for write) so no concurrent faults are allowed.
Other process

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Mel Gorman

On Tue, Jul 31, 2012 at 09:07:14AM -0400, Larry Woodman wrote:
 On 07/31/2012 08:46 AM, Mel Gorman wrote:
 On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:
 SNIP
 That is a surprise. Can you try your test case on 3.4 and tell us if the
 patch fixes the problem there? I would like to rule out the possibility
 that the locking rules are slightly different in RHEL. If it hits on 3.4
 then it's also possible you are seeing a different bug, more on this later.
 
 Sorry for the delay Mel, here is the BUG() traceback from the 3.4
 kernel with your
 patches:
 
 
 [ 1106.156569] [ cut here ]
 [ 1106.161731] kernel BUG at mm/filemap.c:135!
 [ 1106.166395] invalid opcode:  [#1] SMP
 [ 1106.170975] CPU 22
 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
 dcdbas microcode pcspkr acpi_pad acpi]
 [ 1106.201770]
 Thanks, looks very similar.
 
 [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW
 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
 You say this was a 3.4 kernel but the message says 3.3. Probably not
 relevant, just interesting.
 
 Oh, sorry I posted the wrong traceback.  I tested both 3.3  3.4 and
 had the same results.
 I'll do it again and post the 3.4 traceback for you,

It'll probably be the same. The likelhood is that the bug is really old and
did not change between 3.3 and 3.4. I mentioned it in case you accidentally
tested with an old kernel that was not patched or patched with something
different. I considered this to be very unlikely though and you already
said that RHEL was affected so it's probably the same bug seen in all
three.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Rik van Riel


On 07/31/2012 08:46 AM, Mel Gorman wrote:


mm: hugetlbfs: Correctly detect if page tables have just been shared

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman



Signed-off-by: Mel Gormanmgor...@suse.de


Reviewed-by: Rik van Riel r...@redhat.com

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Larry Woodman


On 07/31/2012 08:46 AM, Mel Gorman wrote:


Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

Hi Mel, yes this does work just fine.  It ran for hours without a panic so
I'll Ack this one if you send it to the list.

Larry




---8---
mm: hugetlbfs: Correctly detect if page tables have just been shared

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [8112d040] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [8125b405] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [8125b4f8] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [811ab1af] evict+0x9f/0x1b0
[ 1106.373951]  [811ab3a3] iput_final+0xe3/0x1e0
[ 1106.379773]  [811ab4de] iput+0x3e/0x50
[ 1106.384922]  [811a8e18] d_kill+0xf8/0x110
[ 1106.390356]  [811a8f12] dput+0xe2/0x1b0
[ 1106.395595]  [81193612] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
   |
   pmd
src_pte--  data page
   ^
other-pmd|
 ^
child---|
dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A  PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
   src_pte == huge_pte_offset src_pte == huge_pte_offset
   !src_pte so no sharing !src_pte so no sharing

(time passes)

hugetlb_fault   hugetlb_fault
   huge_pte_alloc huge_pte_alloc
 huge_pmd_sharehuge_pmd_share
   LOCK(i_mmap_mutex)
   find nothing, no sharing
   UNLOCK(i_mmap_mutex)
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
 pmd_alloc  pmd_alloc
 LOCK(instantiation_mutex)
 fault
 UNLOCK(instantiation_mutex)

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Michal Hocko

On Tue 31-07-12 13:49:21, Larry Woodman wrote:
 On 07/31/2012 08:46 AM, Mel Gorman wrote:
 
 Fundamentally I think the problem is that we are not correctly detecting
 that page table sharing took place during huge_pte_alloc(). This patch is
 longer and makes an API change but if I'm right, it addresses the underlying
 problem. The first VM_MAYSHARE patch is still necessary but would you mind
 testing this on top please?
 Hi Mel, yes this does work just fine.  It ran for hours without a panic so
 I'll Ack this one if you send it to the list.

Hi Larry, thanks for testing! I have a different patch which tries to
address this very same issue. I am not saying it is better or that it
should be merged instead of Mel's one but I would be really happy if you
could give it a try. We can discuss (dis)advantages of both approaches
later.

Thanks!
---
From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001
From: Michal Hocko mho...@suse.cz
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [8112d040] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [8125b405] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [8125b4f8] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [811ab1af] evict+0x9f/0x1b0
[ 1106.373951]  [811ab3a3] iput_final+0xe3/0x1e0
[ 1106.379773]  [811ab4de] iput+0x3e/0x50
[ 1106.384922]  [811a8e18] d_kill+0xf8/0x110
[ 1106.390356]  [811a8f12] dput+0xe2/0x1b0
[ 1106.395595]  [81193612] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  pmd
   src_pte-- data page
  ^
other-pmd|
^
child---|
   dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A  PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
  src_pte == huge_pte_offset  src_pte == huge_pte_offset
  !src_pte so no sharing  !src_pte so no sharing

(time passes)

hugetlb_fault   hugetlb_fault
  huge_pte_alloc  huge_pte_alloc
huge_pmd_share huge_pmd_share
  LOCK(i_mmap_mutex)
  find nothing, no

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Larry Woodman


On 07/31/2012 04:06 PM, Michal Hocko wrote:

On Tue 31-07-12 13:49:21, Larry Woodman wrote:

On 07/31/2012 08:46 AM, Mel Gorman wrote:

Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

Hi Mel, yes this does work just fine.  It ran for hours without a panic so
I'll Ack this one if you send it to the list.

Hi Larry, thanks for testing! I have a different patch which tries to
address this very same issue. I am not saying it is better or that it
should be merged instead of Mel's one but I would be really happy if you
could give it a try. We can discuss (dis)advantages of both approaches
later.

Thanks!
Sure, it will take me a day since I keep loosing the hardware to 
proproduce the

problem with.  I'll report back tomorrow.

Larry

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-31 Thread Larry Woodman


On 07/31/2012 04:06 PM, Michal Hocko wrote:

On Tue 31-07-12 13:49:21, Larry Woodman wrote:

On 07/31/2012 08:46 AM, Mel Gorman wrote:

Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

Hi Mel, yes this does work just fine.  It ran for hours without a panic so
I'll Ack this one if you send it to the list.

Hi Larry, thanks for testing! I have a different patch which tries to
address this very same issue. I am not saying it is better or that it
should be merged instead of Mel's one but I would be really happy if you
could give it a try. We can discuss (dis)advantages of both approaches
later.

Thanks!


Hi Michal, the system hung when I tested this patch on top of the
latest 3.5 kernel.  I wont have AltSysrq access to the system until
tomorrow AM.  I'll retry this kernel and get AltSysrq output and let
you know whats happening in the morning.

Larry


---
 From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001
From: Michal Hockomho...@suse.cz
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas 
microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell 
Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003
[ 1106.253876] R10: 000d R11:  R12: 880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:
[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0
[ 1106.293149] DR0:  DR1:  DR2: 
[ 1106.301097] DR3:  DR6: 0ff0 DR7: 0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 
880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 

[ 1106.345513] Call Trace:
[ 1106.348235]  [8112d040] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [8125b405] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [8125b4f8] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [811ab1af] evict+0x9f/0x1b0
[ 1106.373951]  [811ab3a3] iput_final+0xe3/0x1e0
[ 1106.379773]  [811ab4de] iput+0x3e/0x50
[ 1106.384922]  [811a8e18] d_kill+0xf8/0x110
[ 1106.390356]  [811a8f12] dput+0xe2/0x1b0
[ 1106.395595]  [81193612] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
   |
   pmd
src_pte--  data page
   ^
other-pmd|
 ^
child---|
dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A  PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
   src_pte == huge_pte_offset  src_pte == huge_pte_offset
   !src_pte so no sharing

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-30 Thread Larry Woodman


On 07/27/2012 06:23 AM, Mel Gorman wrote:

On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:

On 07/26/2012 02:37 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing,
and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins

This patch looks good to me.

Larry, does Hugh's patch survive your testing?



Like I said earlier, no.

That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.



Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel 
with your

patches:


[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc 
dcdbas microcode pcspkr acpi_pad acpi]

[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ 
#4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[]  [] 
__delete_from_page_cache+0x15d/0x170

[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: 
ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 
88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 
0003
[ 1106.253876] R10: 000d R11:  R12: 
880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: 
ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:

[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 
000406e0
[ 1106.293149] DR0:  DR1:  DR2: 

[ 1106.301097] DR3:  DR6: 0ff0 DR7: 
0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, 
task 880428b5cc20)

[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 


[ 1106.345513] Call Trace:
[ 1106.348235]  [] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [] evict+0x9f/0x1b0
[ 1106.373951]  [] iput_final+0xe3/0x1e0
[ 1106.379773]  [] iput+0x3e/0x50
[ 1106.384922]  [] d_kill+0xf8/0x110
[ 1106.390356]  [] dput+0xe2/0x1b0
[ 1106.395595]  [] __fput+0x162/0x240
[ 1106.401124]  [] fput+0x25/0x30
[ 1106.406265]  [] filp_close+0x63/0x90
[ 1106.411997]  [] put_files_struct+0x7f/0xf0
[ 1106.418302]  [] exit_files+0x4c/0x60
[ 1106.424025]  [] do_exit+0x1a7/0x470
[ 1106.429652]  [] do_group_exit+0x55/0xd0
[ 1106.435665]  [] sys_exit_group+0x17/0x20
[ 1106.441777]  [] system_call_fastpath+0x16/0x1b
[ 1106.448474] Code: 66 0f 1f 44 00 00 48 8b 47 08 48 8b 00 48 8b 40 28 
44 8b 80 38 03 00 00 45 85 c0 0f
[ 1106.470022] RIP  [] 
__delete_from_page_cache+0x15d/0x170

[ 1106.477693]  RSP 
--

I'll see if I can distribute the program that causes the panic, I dont 
have source, only binary.


Larry


BTW, the only way Ilve been able to get the panic to stop is:

--
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, 
struct mm_struct *src,

goto nomem;

/* If the pagetables are shared don't copy or take 
references */

-   if (dst_pte == src_pte)
+

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-30 Thread Larry Woodman


On 07/27/2012 06:23 AM, Mel Gorman wrote:

On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:

On 07/26/2012 02:37 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing,
and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickinshu...@google.com

This patch looks good to me.

Larry, does Hugh's patch survive your testing?



Like I said earlier, no.

That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.



Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel 
with your

patches:


[ 1106.156569] [ cut here ]
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode:  [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc 
dcdbas microcode pcspkr acpi_pad acpi]

[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ 
#4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[8112cfed]  [8112cfed] 
__delete_from_page_cache+0x15d/0x170

[ 1106.224117] RSP: 0018:880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: 
ffb0
[ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 
88043ffd9e00
[ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 
0003
[ 1106.253876] R10: 000d R11:  R12: 
880428708150
[ 1106.261826] R13: 880428708150 R14:  R15: 
ea0006b8
[ 1106.269780] FS:  () GS:88042fd6() 
knlGS:

[ 1106.278794] CS:  0010 DS:  ES:  CR0: 80050033
[ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 
000406e0
[ 1106.293149] DR0:  DR1:  DR2: 

[ 1106.301097] DR3:  DR6: 0ff0 DR7: 
0400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, 
task 880428b5cc20)

[ 1106.318447] Stack:
[ 1106.320690]  ea0006b8  880428973bc8 
8112d040
[ 1106.328958]  880428973bc8 02ab 02a0 
880428973c18
[ 1106.337234]  880428973cc8 8125b405 88040001 


[ 1106.345513] Call Trace:
[ 1106.348235]  [8112d040] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [8125b405] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [8125b4f8] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [811ab1af] evict+0x9f/0x1b0
[ 1106.373951]  [811ab3a3] iput_final+0xe3/0x1e0
[ 1106.379773]  [811ab4de] iput+0x3e/0x50
[ 1106.384922]  [811a8e18] d_kill+0xf8/0x110
[ 1106.390356]  [811a8f12] dput+0xe2/0x1b0
[ 1106.395595]  [81193612] __fput+0x162/0x240
[ 1106.401124]  [81193715] fput+0x25/0x30
[ 1106.406265]  [8118f6c3] filp_close+0x63/0x90
[ 1106.411997]  [8106058f] put_files_struct+0x7f/0xf0
[ 1106.418302]  [8106064c] exit_files+0x4c/0x60
[ 1106.424025]  [810629d7] do_exit+0x1a7/0x470
[ 1106.429652]  [81062cf5] do_group_exit+0x55/0xd0
[ 1106.435665]  [81062d87] sys_exit_group+0x17/0x20
[ 1106.441777]  [815d0229] system_call_fastpath+0x16/0x1b
[ 1106.448474] Code: 66 0f 1f 44 00 00 48 8b 47 08 48 8b 00 48 8b 40 28 
44 8b 80 38 03 00 00 45 85 c0 0f
[ 1106.470022] RIP  [8112cfed] 
__delete_from_page_cache+0x15d/0x170

[ 1106.477693]  RSP 880428973b88
--

I'll see if I can distribute the program that causes the panic, I dont 
have source, only binary.


Larry


BTW, the only way Ilve been able to get the panic to stop is:

--
diff --git a/mm/hugetlb.c

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Larry Woodman


On 07/27/2012 06:23 AM, Mel Gorman wrote:

On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:

On 07/26/2012 02:37 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing,
and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins

This patch looks good to me.

Larry, does Hugh's patch survive your testing?



Like I said earlier, no.

That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.

Sure, it will take me a little while because the machine is shared between
several users.



However, I finally set up a reproducer
that only takes a few seconds
on a large system and this totally fixes the problem:


The other possibility is that your reproducer case is triggering a
different race to mine. Would it be possible to post?

Let me ask, I only have the binary and dont know if its OK to distribute
so I dont know exactly what is going on.  I did some tracing and saw 
forking,

group exits, multi-threading, hufetlbfs file creation, mmap'ng munmap'ng &
deleting the hugetlbfs
files.




-
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
*dst, struct mm_struct *src,
 goto nomem;

 /* If the pagetables are shared don't copy or take references 
*/
-   if (dst_pte == src_pte)
+   if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
 continue;

 spin_lock(>page_table_lock);
---

When we compare what the src_pte&  dst_pte point to instead of their
addresses everything works,

The dst_pte and src_pte are pointing to the PMD page though which is what
we're meant to be checking. Your patch appears to change that to check if
they are sharing data which is quite different. This is functionally
similar to if you just checked VM_MAYSHARE at the start of the function
and bailed if so. The PTEs would be populated at fault time instead.


I suspect there is a missing memory barrier somewhere ???


Possibly but hard to tell whether it's barriers that are the real
problem during fork. The copy routine is suspicious.

On the barrier side - in normal PTE alloc routines there is a write
barrier which is documented in __pte_alloc. If hugepage table sharing is
successful, there is no similar barrier in huge_pmd_share before the PUD
is populated. By rights, there should be a smp_wmb() before the page table
spinlock is taken in huge_pmd_share().

The lack of a write barrier leads to a possible snarls between fork()
and fault. Take three processes, parent, child and other. Parent is
forking to create child. Other is calling fault.

Other faults
hugetlb_fault()->huge_pte_alloc->allocate a PMD (write barrier)
It is about to enter hugetlb_no_fault()

Parent forks() runs at the same time
Child shares a page table page but NOT with the forking process (dst_pte
!= src_pte) and calls huge_pte_offset.

As it's not reading the contents of the PMD page, there is no implicit read
barrier to pair with the write barrier from hugetlb_fault that updates
the PMD page and they are not serialised by the page table lock. Hard to
see exactly where that would cause a problem though.

Thing is, in this scenario I think it's possible that page table sharing
is not correctly detected by that dst_pte == src_pte check.  dst_pte !=
src_pte but that does not mean it's not sharing with somebody! If it's
sharing and it falls though then it copies the src PTE even though the
dst PTE could already be populated and updates the mapcount accordingly.
That would be a mess in its own right.

I think this is exactly what is happening.  I'll put more cave-man debugging
code in and let you know.

Larry



There might be two bugs here.



--
To unsubscribe from this list: send the line

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Mel Gorman

On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:
> On 07/26/2012 02:37 PM, Rik van Riel wrote:
> >On 07/23/2012 12:04 AM, Hugh Dickins wrote:
> >
> >>I spent hours trying to dream up a better patch, trying various
> >>approaches.  I think I have a nice one now, what do you think?  And
> >>more importantly, does it work?  I have not tried to test it at all,
> >>that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
> >>
> >>If you like it, please take it over and add your comments and signoff
> >>and send it in.  The second part won't come up in your testing,
> >>and could
> >>be made a separate patch if you prefer: it's a related point that struck
> >>me while I was playing with a different approach.
> >>
> >>I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
> >>but that too would be unfair.
> >>
> >>Subject-to-your-testing-
> >>Signed-off-by: Hugh Dickins 
> >
> >This patch looks good to me.
> >
> >Larry, does Hugh's patch survive your testing?
> >
> >
>
> Like I said earlier, no. 

That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.

> However, I finally set up a reproducer
> that only takes a few seconds
> on a large system and this totally fixes the problem:
> 

The other possibility is that your reproducer case is triggering a
different race to mine. Would it be possible to post?

> -
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c36febb..cc023b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
> *dst, struct mm_struct *src,
> goto nomem;
> 
> /* If the pagetables are shared don't copy or take references 
> */
> -   if (dst_pte == src_pte)
> +   if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
> continue;
> 
> spin_lock(>page_table_lock);
> ---
> 
> When we compare what the src_pte & dst_pte point to instead of their
> addresses everything works,

The dst_pte and src_pte are pointing to the PMD page though which is what
we're meant to be checking. Your patch appears to change that to check if
they are sharing data which is quite different. This is functionally
similar to if you just checked VM_MAYSHARE at the start of the function
and bailed if so. The PTEs would be populated at fault time instead.

> I suspect there is a missing memory barrier somewhere ???
> 

Possibly but hard to tell whether it's barriers that are the real
problem during fork. The copy routine is suspicious.

On the barrier side - in normal PTE alloc routines there is a write
barrier which is documented in __pte_alloc. If hugepage table sharing is
successful, there is no similar barrier in huge_pmd_share before the PUD
is populated. By rights, there should be a smp_wmb() before the page table
spinlock is taken in huge_pmd_share().

The lack of a write barrier leads to a possible snarls between fork()
and fault. Take three processes, parent, child and other. Parent is
forking to create child. Other is calling fault.

Other faults
hugetlb_fault()->huge_pte_alloc->allocate a PMD (write barrier)
It is about to enter hugetlb_no_fault()

Parent forks() runs at the same time
Child shares a page table page but NOT with the forking process (dst_pte
!= src_pte) and calls huge_pte_offset.

As it's not reading the contents of the PMD page, there is no implicit read
barrier to pair with the write barrier from hugetlb_fault that updates
the PMD page and they are not serialised by the page table lock. Hard to
see exactly where that would cause a problem though.

Thing is, in this scenario I think it's possible that page table sharing
is not correctly detected by that dst_pte == src_pte check.  dst_pte !=
src_pte but that does not mean it's not sharing with somebody! If it's
sharing and it falls though then it copies the src PTE even though the
dst PTE could already be populated and updates the mapcount accordingly.
That would be a mess in its own right.

There might be two bugs here.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Larry Woodman


On 07/26/2012 11:48 PM, Larry Woodman wrote:


Mel, did you see this???

Larry


This patch looks good to me.

Larry, does Hugh's patch survive your testing?




Like I said earlier, no.  However, I finally set up a reproducer that 
only takes a few seconds

on a large system and this totally fixes the problem:

- 


diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct 
*dst, struct mm_struct *src,

goto nomem;

/* If the pagetables are shared don't copy or take 
references */

-   if (dst_pte == src_pte)
+   if (*(unsigned long *)dst_pte == *(unsigned long 
*)src_pte)

continue;

spin_lock(>page_table_lock);
--- 



When we compare what the src_pte & dst_pte point to instead of their 
addresses everything works,

I suspect there is a missing memory barrier somewhere ???

Larry



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Michal Hocko

On Thu 26-07-12 14:31:50, Rik van Riel wrote:
> On 07/20/2012 10:36 AM, Michal Hocko wrote:
> 
> >--- a/arch/x86/mm/hugetlbpage.c
> >+++ b/arch/x86/mm/hugetlbpage.c
> >@@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned 
> >long addr, pud_t *pud)
> > if (saddr) {
> > spte = huge_pte_offset(svma->vm_mm, saddr);
> > if (spte) {
> >-get_page(virt_to_page(spte));
> >+struct page *spte_page = virt_to_page(spte);
> >+if (!is_hugetlb_pmd_page_valid(spte_page)) {
> 
> What prevents somebody else from marking the hugetlb
> pmd invalid, between here...
> 
> >+spte = NULL;
> >+continue;
> >+}
> 
> ... and here?

huge_ptep_get_and_clear is (should be) called inside i_mmap which is not
the case right now as Mel already pointed out in other email

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Mel Gorman

On Thu, Jul 26, 2012 at 01:42:26PM -0400, Rik van Riel wrote:
> On 07/23/2012 12:04 AM, Hugh Dickins wrote:
> 
> >Please don't be upset if I say that I don't like either of your patches.
> >Mainly for obvious reasons - I don't like Mel's because anything with
> >trylock retries and nested spinlocks worries me before I can even start
> >to think about it; and I don't like Michal's for the same reason as Mel,
> >that it spreads more change around in common paths than we would like.
> 
> I have a naive question.
> 
> In huge_pmd_share, we protect ourselves by taking
> the mapping->i_mmap_mutex.
> 
> Is there any reason we could not take the i_mmap_mutex
> in the huge_pmd_unshare path?
> 

We do, in 3.4 at least - callers of __unmap_hugepage_range hold the
i_mmap_mutex. Locking changes in mmotm and there is a patch there that
needs to be reverted. What tree are you looking at?

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Mel Gorman

On Thu, Jul 26, 2012 at 01:42:26PM -0400, Rik van Riel wrote:
 On 07/23/2012 12:04 AM, Hugh Dickins wrote:
 
 Please don't be upset if I say that I don't like either of your patches.
 Mainly for obvious reasons - I don't like Mel's because anything with
 trylock retries and nested spinlocks worries me before I can even start
 to think about it; and I don't like Michal's for the same reason as Mel,
 that it spreads more change around in common paths than we would like.
 
 I have a naive question.
 
 In huge_pmd_share, we protect ourselves by taking
 the mapping-i_mmap_mutex.
 
 Is there any reason we could not take the i_mmap_mutex
 in the huge_pmd_unshare path?
 

We do, in 3.4 at least - callers of __unmap_hugepage_range hold the
i_mmap_mutex. Locking changes in mmotm and there is a patch there that
needs to be reverted. What tree are you looking at?

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Michal Hocko

On Thu 26-07-12 14:31:50, Rik van Riel wrote:
 On 07/20/2012 10:36 AM, Michal Hocko wrote:
 
 --- a/arch/x86/mm/hugetlbpage.c
 +++ b/arch/x86/mm/hugetlbpage.c
 @@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned 
 long addr, pud_t *pud)
  if (saddr) {
  spte = huge_pte_offset(svma-vm_mm, saddr);
  if (spte) {
 -get_page(virt_to_page(spte));
 +struct page *spte_page = virt_to_page(spte);
 +if (!is_hugetlb_pmd_page_valid(spte_page)) {
 
 What prevents somebody else from marking the hugetlb
 pmd invalid, between here...
 
 +spte = NULL;
 +continue;
 +}
 
 ... and here?

huge_ptep_get_and_clear is (should be) called inside i_mmap which is not
the case right now as Mel already pointed out in other email

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Larry Woodman


On 07/26/2012 11:48 PM, Larry Woodman wrote:


Mel, did you see this???

Larry


This patch looks good to me.

Larry, does Hugh's patch survive your testing?




Like I said earlier, no.  However, I finally set up a reproducer that 
only takes a few seconds

on a large system and this totally fixes the problem:

- 


diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct 
*dst, struct mm_struct *src,

goto nomem;

/* If the pagetables are shared don't copy or take 
references */

-   if (dst_pte == src_pte)
+   if (*(unsigned long *)dst_pte == *(unsigned long 
*)src_pte)

continue;

spin_lock(dst-page_table_lock);
--- 



When we compare what the src_pte  dst_pte point to instead of their 
addresses everything works,

I suspect there is a missing memory barrier somewhere ???

Larry



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Mel Gorman

On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:
 On 07/26/2012 02:37 PM, Rik van Riel wrote:
 On 07/23/2012 12:04 AM, Hugh Dickins wrote:
 
 I spent hours trying to dream up a better patch, trying various
 approaches.  I think I have a nice one now, what do you think?  And
 more importantly, does it work?  I have not tried to test it at all,
 that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
 
 If you like it, please take it over and add your comments and signoff
 and send it in.  The second part won't come up in your testing,
 and could
 be made a separate patch if you prefer: it's a related point that struck
 me while I was playing with a different approach.
 
 I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
 but that too would be unfair.
 
 Subject-to-your-testing-
 Signed-off-by: Hugh Dickins hu...@google.com
 
 This patch looks good to me.
 
 Larry, does Hugh's patch survive your testing?
 
 

 Like I said earlier, no. 

That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.

 However, I finally set up a reproducer
 that only takes a few seconds
 on a large system and this totally fixes the problem:
 

The other possibility is that your reproducer case is triggering a
different race to mine. Would it be possible to post?

 -
 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index c36febb..cc023b8 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
 *dst, struct mm_struct *src,
 goto nomem;
 
 /* If the pagetables are shared don't copy or take references 
 */
 -   if (dst_pte == src_pte)
 +   if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
 continue;
 
 spin_lock(dst-page_table_lock);
 ---
 
 When we compare what the src_pte  dst_pte point to instead of their
 addresses everything works,

The dst_pte and src_pte are pointing to the PMD page though which is what
we're meant to be checking. Your patch appears to change that to check if
they are sharing data which is quite different. This is functionally
similar to if you just checked VM_MAYSHARE at the start of the function
and bailed if so. The PTEs would be populated at fault time instead.

 I suspect there is a missing memory barrier somewhere ???
 

Possibly but hard to tell whether it's barriers that are the real
problem during fork. The copy routine is suspicious.

On the barrier side - in normal PTE alloc routines there is a write
barrier which is documented in __pte_alloc. If hugepage table sharing is
successful, there is no similar barrier in huge_pmd_share before the PUD
is populated. By rights, there should be a smp_wmb() before the page table
spinlock is taken in huge_pmd_share().

The lack of a write barrier leads to a possible snarls between fork()
and fault. Take three processes, parent, child and other. Parent is
forking to create child. Other is calling fault.

Other faults
hugetlb_fault()-huge_pte_alloc-allocate a PMD (write barrier)
It is about to enter hugetlb_no_fault()

Parent forks() runs at the same time
Child shares a page table page but NOT with the forking process (dst_pte
!= src_pte) and calls huge_pte_offset.

As it's not reading the contents of the PMD page, there is no implicit read
barrier to pair with the write barrier from hugetlb_fault that updates
the PMD page and they are not serialised by the page table lock. Hard to
see exactly where that would cause a problem though.

Thing is, in this scenario I think it's possible that page table sharing
is not correctly detected by that dst_pte == src_pte check.  dst_pte !=
src_pte but that does not mean it's not sharing with somebody! If it's
sharing and it falls though then it copies the src PTE even though the
dst PTE could already be populated and updates the mapcount accordingly.
That would be a mess in its own right.

There might be two bugs here.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-27 Thread Larry Woodman


On 07/27/2012 06:23 AM, Mel Gorman wrote:

On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:

On 07/26/2012 02:37 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing,
and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickinshu...@google.com

This patch looks good to me.

Larry, does Hugh's patch survive your testing?



Like I said earlier, no.

That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.

Sure, it will take me a little while because the machine is shared between
several users.



However, I finally set up a reproducer
that only takes a few seconds
on a large system and this totally fixes the problem:


The other possibility is that your reproducer case is triggering a
different race to mine. Would it be possible to post?

Let me ask, I only have the binary and dont know if its OK to distribute
so I dont know exactly what is going on.  I did some tracing and saw 
forking,

group exits, multi-threading, hufetlbfs file creation, mmap'ng munmap'ng 
deleting the hugetlbfs
files.




-
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
*dst, struct mm_struct *src,
 goto nomem;

 /* If the pagetables are shared don't copy or take references 
*/
-   if (dst_pte == src_pte)
+   if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
 continue;

 spin_lock(dst-page_table_lock);
---

When we compare what the src_pte  dst_pte point to instead of their
addresses everything works,

The dst_pte and src_pte are pointing to the PMD page though which is what
we're meant to be checking. Your patch appears to change that to check if
they are sharing data which is quite different. This is functionally
similar to if you just checked VM_MAYSHARE at the start of the function
and bailed if so. The PTEs would be populated at fault time instead.


I suspect there is a missing memory barrier somewhere ???


Possibly but hard to tell whether it's barriers that are the real
problem during fork. The copy routine is suspicious.

On the barrier side - in normal PTE alloc routines there is a write
barrier which is documented in __pte_alloc. If hugepage table sharing is
successful, there is no similar barrier in huge_pmd_share before the PUD
is populated. By rights, there should be a smp_wmb() before the page table
spinlock is taken in huge_pmd_share().

The lack of a write barrier leads to a possible snarls between fork()
and fault. Take three processes, parent, child and other. Parent is
forking to create child. Other is calling fault.

Other faults
hugetlb_fault()-huge_pte_alloc-allocate a PMD (write barrier)
It is about to enter hugetlb_no_fault()

Parent forks() runs at the same time
Child shares a page table page but NOT with the forking process (dst_pte
!= src_pte) and calls huge_pte_offset.

As it's not reading the contents of the PMD page, there is no implicit read
barrier to pair with the write barrier from hugetlb_fault that updates
the PMD page and they are not serialised by the page table lock. Hard to
see exactly where that would cause a problem though.

Thing is, in this scenario I think it's possible that page table sharing
is not correctly detected by that dst_pte == src_pte check.  dst_pte !=
src_pte but that does not mean it's not sharing with somebody! If it's
sharing and it falls though then it copies the src PTE even though the
dst PTE could already be populated and updates the mapcount accordingly.
That would be a mess in its own right.

I think this is exactly what is happening.  I'll put more cave-man debugging
code in and let you know.

Larry



There might be two bugs here.



--
To unsubscribe from this list: send the

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Larry Woodman


On 07/26/2012 02:37 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and 
could

be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins 


This patch looks good to me.

Larry, does Hugh's patch survive your testing?


Like I said earlier, no.  However, I finally set up a reproducer that 
only takes a few seconds

on a large system and this totally fixes the problem:

-
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, 
struct mm_struct *src,

goto nomem;

/* If the pagetables are shared don't copy or take 
references */

-   if (dst_pte == src_pte)
+   if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
continue;

spin_lock(>page_table_lock);
---

When we compare what the src_pte & dst_pte point to instead of their 
addresses everything works,

I suspect there is a missing memory barrier somewhere ???

Larry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Larry Woodman


On 07/26/2012 02:37 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and 
could

be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins 


This patch looks good to me.

Larry, does Hugh's patch survive your testing?


It doesnt.  However its got a slightly different footprint because this 
is RHEL6 and
there have been changes to the hugetlbfs_inode code.  Also, we are 
seeing the
problem via group_exit() rather than shmdt().  Also, I print out the 
actual _mapcount

at the BUG and most of the time its 1 but have seen it as high as 6.



dell-per620-01.lab.bos.redhat.com login: MAPCOUNT = 2
[ cut here ]
kernel BUG at mm/filemap.c:131!
invalid opcode:  [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map
CPU 8
Modules linked in: autofs4 sunrpc ipv6 acpi_pad power_meter dcdbas 
microcode sb_edac edac_core iTCO_wdt i]


Pid: 3106, comm: mpitest Not tainted 2.6.32-289.el6.sharedpte.x86_64 #17 
Dell Inc. PowerEdge R620/07NDJ2
RIP: 0010:[]  [] 
__remove_from_page_cache+0xe2/0x100

RSP: 0018:880434897b78  EFLAGS: 00010002
RAX: 0001 RBX: ea00074ec000 RCX: 10f6
RDX:  RSI: 0046 RDI: 0046
RBP: 880434897b88 R08: 81c01a00 R09: 
R10:  R11: 0004 R12: 880432683d98
R13: 880432683db0 R14:  R15: ea00074ec000
FS:  () GS:88002828() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 003a1d38c4a8 CR3: 01a85000 CR4: 000406e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process mpitest (pid: 3106, threadinfo 880434896000, task 
880431abb500)

Stack:
 ea00074ec000  880434897bb8 81114ab4
 880434897bb8 02ab 02a0 880434897c08
 880434897cb8 811f758d 88022dd8 
Call Trace:
 [] remove_from_page_cache+0x54/0x90
 [] truncate_hugepages+0x11d/0x200
 [] ? hugetlbfs_delete_inode+0x0/0x30
 [] hugetlbfs_delete_inode+0x18/0x30
 [] generic_delete_inode+0xde/0x1d0
 [] hugetlbfs_drop_inode+0x5d/0x70
 [] iput+0x62/0x70
 [] dentry_iput+0x90/0x100
 [] d_kill+0x31/0x60
 [] dput+0x7c/0x150
 [] __fput+0x189/0x210
 [] fput+0x25/0x30
 [] filp_close+0x5d/0x90
 [] put_files_struct+0x7f/0xf0
 [] exit_files+0x53/0x70
 [] do_exit+0x18d/0x870
 [] ? audit_syscall_entry+0x272/0x2a0
 [] do_group_exit+0x58/0xd0
 [] sys_exit_group+0x17/0x20
 [] system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Rik van Riel


On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins 


This patch looks good to me.

Larry, does Hugh's patch survive your testing?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Rik van Riel


On 07/20/2012 10:36 AM, Michal Hocko wrote:


--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned 
long addr, pud_t *pud)
if (saddr) {
spte = huge_pte_offset(svma->vm_mm, saddr);
if (spte) {
-   get_page(virt_to_page(spte));
+   struct page *spte_page = virt_to_page(spte);
+   if (!is_hugetlb_pmd_page_valid(spte_page)) {


What prevents somebody else from marking the hugetlb
pmd invalid, between here...


+   spte = NULL;
+   continue;
+   }


... and here?


+   get_page(spte_page);
break;
}


I think need to take the refcount before checking whether
the hugetlb pmd is still valid.

Also, disregard my previous email in this thread, I just
read Mel's detailed explanation and wrapped my brain
around the bug :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Rik van Riel


On 07/23/2012 12:04 AM, Hugh Dickins wrote:


Please don't be upset if I say that I don't like either of your patches.
Mainly for obvious reasons - I don't like Mel's because anything with
trylock retries and nested spinlocks worries me before I can even start
to think about it; and I don't like Michal's for the same reason as Mel,
that it spreads more change around in common paths than we would like.


I have a naive question.

In huge_pmd_share, we protect ourselves by taking
the mapping->i_mmap_mutex.

Is there any reason we could not take the i_mmap_mutex
in the huge_pmd_unshare path?

I see that hugetlb_change_protection already takes that
lock. Is there something preventing __unmap_hugepage_range
from also taking mapping->i_mmap_mutex?

That way the sharing and the unsharing code are
protected by the same, per shm segment, lock.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Larry Woodman


On 07/26/2012 01:42 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


Please don't be upset if I say that I don't like either of your patches.
Mainly for obvious reasons - I don't like Mel's because anything with
trylock retries and nested spinlocks worries me before I can even start
to think about it; and I don't like Michal's for the same reason as Mel,
that it spreads more change around in common paths than we would like.


I have a naive question.

In huge_pmd_share, we protect ourselves by taking
the mapping->i_mmap_mutex.

Is there any reason we could not take the i_mmap_mutex
in the huge_pmd_unshare path?


I think it is already taken on every path into huge_pmd_unshare().

Larry


I see that hugetlb_change_protection already takes that
lock. Is there something preventing __unmap_hugepage_range
from also taking mapping->i_mmap_mutex?

That way the sharing and the unsharing code are
protected by the same, per shm segment, lock.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Larry Woodman


On 07/26/2012 01:42 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


Please don't be upset if I say that I don't like either of your patches.
Mainly for obvious reasons - I don't like Mel's because anything with
trylock retries and nested spinlocks worries me before I can even start
to think about it; and I don't like Michal's for the same reason as Mel,
that it spreads more change around in common paths than we would like.


I have a naive question.

In huge_pmd_share, we protect ourselves by taking
the mapping-i_mmap_mutex.

Is there any reason we could not take the i_mmap_mutex
in the huge_pmd_unshare path?


I think it is already taken on every path into huge_pmd_unshare().

Larry


I see that hugetlb_change_protection already takes that
lock. Is there something preventing __unmap_hugepage_range
from also taking mapping-i_mmap_mutex?

That way the sharing and the unsharing code are
protected by the same, per shm segment, lock.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Rik van Riel


On 07/23/2012 12:04 AM, Hugh Dickins wrote:


Please don't be upset if I say that I don't like either of your patches.
Mainly for obvious reasons - I don't like Mel's because anything with
trylock retries and nested spinlocks worries me before I can even start
to think about it; and I don't like Michal's for the same reason as Mel,
that it spreads more change around in common paths than we would like.


I have a naive question.

In huge_pmd_share, we protect ourselves by taking
the mapping-i_mmap_mutex.

Is there any reason we could not take the i_mmap_mutex
in the huge_pmd_unshare path?

I see that hugetlb_change_protection already takes that
lock. Is there something preventing __unmap_hugepage_range
from also taking mapping-i_mmap_mutex?

That way the sharing and the unsharing code are
protected by the same, per shm segment, lock.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Rik van Riel


On 07/20/2012 10:36 AM, Michal Hocko wrote:


--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned 
long addr, pud_t *pud)
if (saddr) {
spte = huge_pte_offset(svma-vm_mm, saddr);
if (spte) {
-   get_page(virt_to_page(spte));
+   struct page *spte_page = virt_to_page(spte);
+   if (!is_hugetlb_pmd_page_valid(spte_page)) {


What prevents somebody else from marking the hugetlb
pmd invalid, between here...


+   spte = NULL;
+   continue;
+   }


... and here?


+   get_page(spte_page);
break;
}


I think need to take the refcount before checking whether
the hugetlb pmd is still valid.

Also, disregard my previous email in this thread, I just
read Mel's detailed explanation and wrapped my brain
around the bug :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Rik van Riel


On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins hu...@google.com


This patch looks good to me.

Larry, does Hugh's patch survive your testing?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Larry Woodman


On 07/26/2012 02:37 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and 
could

be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins hu...@google.com


This patch looks good to me.

Larry, does Hugh's patch survive your testing?


It doesnt.  However its got a slightly different footprint because this 
is RHEL6 and
there have been changes to the hugetlbfs_inode code.  Also, we are 
seeing the
problem via group_exit() rather than shmdt().  Also, I print out the 
actual _mapcount

at the BUG and most of the time its 1 but have seen it as high as 6.



dell-per620-01.lab.bos.redhat.com login: MAPCOUNT = 2
[ cut here ]
kernel BUG at mm/filemap.c:131!
invalid opcode:  [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map
CPU 8
Modules linked in: autofs4 sunrpc ipv6 acpi_pad power_meter dcdbas 
microcode sb_edac edac_core iTCO_wdt i]


Pid: 3106, comm: mpitest Not tainted 2.6.32-289.el6.sharedpte.x86_64 #17 
Dell Inc. PowerEdge R620/07NDJ2
RIP: 0010:[81114a42]  [81114a42] 
__remove_from_page_cache+0xe2/0x100

RSP: 0018:880434897b78  EFLAGS: 00010002
RAX: 0001 RBX: ea00074ec000 RCX: 10f6
RDX:  RSI: 0046 RDI: 0046
RBP: 880434897b88 R08: 81c01a00 R09: 
R10:  R11: 0004 R12: 880432683d98
R13: 880432683db0 R14:  R15: ea00074ec000
FS:  () GS:88002828() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 003a1d38c4a8 CR3: 01a85000 CR4: 000406e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process mpitest (pid: 3106, threadinfo 880434896000, task 
880431abb500)

Stack:
 ea00074ec000  880434897bb8 81114ab4
d 880434897bb8 02ab 02a0 880434897c08
d 880434897cb8 811f758d 88022dd8 
Call Trace:
 [81114ab4] remove_from_page_cache+0x54/0x90
 [811f758d] truncate_hugepages+0x11d/0x200
 [811f7670] ? hugetlbfs_delete_inode+0x0/0x30
 [811f7688] hugetlbfs_delete_inode+0x18/0x30
 [8119618e] generic_delete_inode+0xde/0x1d0
 [811f76fd] hugetlbfs_drop_inode+0x5d/0x70
 [81195132] iput+0x62/0x70
 [81191c90] dentry_iput+0x90/0x100
 [81191df1] d_kill+0x31/0x60
 [8119381c] dput+0x7c/0x150
 [8117c979] __fput+0x189/0x210
 [8117ca25] fput+0x25/0x30
 [8117844d] filp_close+0x5d/0x90
 [8106e45f] put_files_struct+0x7f/0xf0
 [8106e523] exit_files+0x53/0x70
 [8107059d] do_exit+0x18d/0x870
 [810d6cc2] ? audit_syscall_entry+0x272/0x2a0
 [81070cd8] do_group_exit+0x58/0xd0
 [81070d67] sys_exit_group+0x17/0x20
 [8100b0f2] system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-26 Thread Larry Woodman


On 07/26/2012 02:37 PM, Rik van Riel wrote:

On 07/23/2012 12:04 AM, Hugh Dickins wrote:


I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and 
could

be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins hu...@google.com


This patch looks good to me.

Larry, does Hugh's patch survive your testing?


Like I said earlier, no.  However, I finally set up a reproducer that 
only takes a few seconds

on a large system and this totally fixes the problem:

-
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, 
struct mm_struct *src,

goto nomem;

/* If the pagetables are shared don't copy or take 
references */

-   if (dst_pte == src_pte)
+   if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
continue;

spin_lock(dst-page_table_lock);
---

When we compare what the src_pte  dst_pte point to instead of their 
addresses everything works,

I suspect there is a missing memory barrier somewhere ???

Larry

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-25 Thread Mel Gorman

On Tue, Jul 24, 2012 at 12:23:58PM -0700, Hugh Dickins wrote:
> On Tue, 24 Jul 2012, Mel Gorman wrote:
> > On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
> > > 
> > > So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.
> > > 
> > 
> > I agree with you. When I was thinking about the potential problems, I was
> > thinking of them in the general context of the core VM and what we normally
> > take into account.
> > 
> > I confess that I really find this working-by-coincidence very icky and am
> > uncomfortable with it but your patch is the only patch that contains the
> > mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
> > the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
> > VMA is found so I also affected the core.
> 
> "icky" is not quite the word I'd use, but yes, it feels like you only
> have to dislodge a stone somewhere at the other end of the kernel,
> and the whole lot would come tumbling down.
> 
> If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE,
> I would: to warn us when assumptions change.  If we were prepared to waste
> another vm_flag on it (and just because there's now a type which lets them
> expand does not mean we can be profligate with them), then you can imagine
> a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off
> that instead; or something of that kind.
> 

A new VM flag would be overkill for this right now.

> But I'm afraid I see that as TODO-list material: the one-liner is pretty
> good for stable backporting, and I felt smiled-upon when it turned out to
> be workable (and not even needing a change in arch/x86/mm, that really
> surprised me).  It seems ungrateful not to seize the simple fix it offers,
> which I found much easier to understand than the alternatives.
> 

That's fair enough.

> > 
> > So, lets go with your patch but with all this documented! I stuck a
> > changelog and an additional comment onto your patch and this is the end
> > result.
> 
> Okay, thanks.  (I think you've copied rather more of my previous mail
> into the commit description than it deserves, but it looks like you
> like more words where I like less!)
> 

I did copy more than was necessary, I'll fix it.

> > 
> > Do you want to pick this up and send it to Andrew or will I?
> 
> Oh, please change your Reviewed-by to Signed-off-by: almost all of the
> work and description comes from you and Michal; then please, you send it
> in to Andrew - sorry, I really need to turn my attention to other things.
> 

That's fine, I'll pick it. Thanks for working on this.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-25 Thread Mel Gorman

On Tue, Jul 24, 2012 at 12:23:58PM -0700, Hugh Dickins wrote:
 On Tue, 24 Jul 2012, Mel Gorman wrote:
  On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
   
   So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good.
   
  
  I agree with you. When I was thinking about the potential problems, I was
  thinking of them in the general context of the core VM and what we normally
  take into account.
  
  I confess that I really find this working-by-coincidence very icky and am
  uncomfortable with it but your patch is the only patch that contains the
  mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
  the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
  VMA is found so I also affected the core.
 
 icky is not quite the word I'd use, but yes, it feels like you only
 have to dislodge a stone somewhere at the other end of the kernel,
 and the whole lot would come tumbling down.
 
 If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE,
 I would: to warn us when assumptions change.  If we were prepared to waste
 another vm_flag on it (and just because there's now a type which lets them
 expand does not mean we can be profligate with them), then you can imagine
 a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off
 that instead; or something of that kind.
 

A new VM flag would be overkill for this right now.

 But I'm afraid I see that as TODO-list material: the one-liner is pretty
 good for stable backporting, and I felt smiled-upon when it turned out to
 be workable (and not even needing a change in arch/x86/mm, that really
 surprised me).  It seems ungrateful not to seize the simple fix it offers,
 which I found much easier to understand than the alternatives.
 

That's fair enough.

  
  So, lets go with your patch but with all this documented! I stuck a
  changelog and an additional comment onto your patch and this is the end
  result.
 
 Okay, thanks.  (I think you've copied rather more of my previous mail
 into the commit description than it deserves, but it looks like you
 like more words where I like less!)
 

I did copy more than was necessary, I'll fix it.

  
  Do you want to pick this up and send it to Andrew or will I?
 
 Oh, please change your Reviewed-by to Signed-off-by: almost all of the
 work and description comes from you and Michal; then please, you send it
 in to Andrew - sorry, I really need to turn my attention to other things.
 

That's fine, I'll pick it. Thanks for working on this.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-24 Thread Hugh Dickins

On Tue, 24 Jul 2012, Mel Gorman wrote:
> On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
> > 
> > So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.
> > 
> 
> I agree with you. When I was thinking about the potential problems, I was
> thinking of them in the general context of the core VM and what we normally
> take into account.
> 
> I confess that I really find this working-by-coincidence very icky and am
> uncomfortable with it but your patch is the only patch that contains the
> mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
> the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
> VMA is found so I also affected the core.

"icky" is not quite the word I'd use, but yes, it feels like you only
have to dislodge a stone somewhere at the other end of the kernel,
and the whole lot would come tumbling down.

If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE,
I would: to warn us when assumptions change.  If we were prepared to waste
another vm_flag on it (and just because there's now a type which lets them
expand does not mean we can be profligate with them), then you can imagine
a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off
that instead; or something of that kind.

But I'm afraid I see that as TODO-list material: the one-liner is pretty
good for stable backporting, and I felt smiled-upon when it turned out to
be workable (and not even needing a change in arch/x86/mm, that really
surprised me).  It seems ungrateful not to seize the simple fix it offers,
which I found much easier to understand than the alternatives.

> 
> So, lets go with your patch but with all this documented! I stuck a
> changelog and an additional comment onto your patch and this is the end
> result.

Okay, thanks.  (I think you've copied rather more of my previous mail
into the commit description than it deserves, but it looks like you
like more words where I like less!)

> 
> Do you want to pick this up and send it to Andrew or will I?

Oh, please change your Reviewed-by to Signed-off-by: almost all of the
work and description comes from you and Michal; then please, you send it
in to Andrew - sorry, I really need to turn my attention to other things.

But I hadn't realized how messy it's going to be: I was concentrating on
3.5, not the mmotm tree, which as Michal points out is fairly different.
Yes, it definitely needs to revert to holding i_mmap_mutex when unsharing,
it was a mistake to have removed that (what stabilizes the page_count 1
check in huge_pmd_unshare? only i_mmap_mutex).

I guess this fix needs to go in to 3.6 early, and "someone" rejig the
hugetlb area of mmotm before that goes on to Linus.  Urggh.  AArgh64.
Sorry, I'm not volunteering.

But an interesting aspect of the hugetlb changes there, is that
mmu_gather is now being used by __unmap_hugepage_range: I did want that
for one of the solutions to this bug that I was toying with.  Although
earlier I had been afraid of "doing free_pgtables work" down in unshare,
it occurred to me later that already we do that (pud_clear) with no harm
observed, and free_pgtables does not depend on having entries still
present at the lower levels.  It may be that there's a less tricky
fix available once the dust has settled here.

> 
> Thanks Hugh!
> 
> ---8<---
> mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
> 
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
> 
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
> 
> [  ..] Lots of bad pmd messages followed by this
> [  127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7).
> [  127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7).
> [  127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7).
> [  127.186778] [ cut here ]
> [  127.186781] kernel BUG at mm/filemap.c:134!
> [  127.186782] invalid opcode:  [#1] SMP
> [  127.186783] CPU 7
> [  127.186784] Modules linked in: af_packet cpufreq_conservative 
> cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod 
> coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 
> r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw 
> cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode 
> ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-24 Thread Michal Hocko

On Tue 24-07-12 10:34:06, Mel Gorman wrote:
[...]
> ---8<---
> mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
> 
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
> 
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
> 
> [  ..] Lots of bad pmd messages followed by this
> [  127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7).
> [  127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7).
> [  127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7).
> [  127.186778] [ cut here ]
> [  127.186781] kernel BUG at mm/filemap.c:134!
> [  127.186782] invalid opcode:  [#1] SMP
> [  127.186783] CPU 7
> [  127.186784] Modules linked in: af_packet cpufreq_conservative 
> cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod 
> coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 
> r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw 
> cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode 
> ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm 
> i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button 
> i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon 
> ata_generic pata_atiixp libata scsi_mod
> [  127.186801]
> [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild 
> #53 Dell Inc. OptiPlex 990/06D7TR
> [  127.186804] RIP: 0010:[]  [] 
> __delete_from_page_cache+0x15e/0x160
> [  127.186809] RSP: :8804144b5c08  EFLAGS: 00010002
> [  127.186810] RAX: 0001 RBX: ea000a5c9000 RCX: 
> ffc0
> [  127.186811] RDX:  RSI: 0009 RDI: 
> 88042dfdad00
> [  127.186812] RBP: 8804144b5c18 R08: 0009 R09: 
> 0003
> [  127.186813] R10:  R11: 002d R12: 
> 880412ff83d8
> [  127.186814] R13: 880412ff83d8 R14:  R15: 
> 880412ff83d8
> [  127.186815] FS:  7fe18ed2c700() GS:88042dce() 
> knlGS:
> [  127.186816] CS:  0010 DS:  ES:  CR0: 8005003b
> [  127.186817] CR2: 7fe34503 CR3: 000417a14000 CR4: 
> 000407e0
> [  127.186818] DR0:  DR1:  DR2: 
> 
> [  127.186819] DR3:  DR6: 0ff0 DR7: 
> 0400
> [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo 
> 8804144b4000, task 880417f803c0)
> [  127.186821] Stack:
> [  127.186822]  ea000a5c9000  8804144b5c48 
> 810ed83b
> [  127.186824]  8804144b5c48 138a 1387 
> 8804144b5c98
> [  127.186825]  8804144b5d48 811bc925 8804144b5cb8 
> 
> [  127.186827] Call Trace:
> [  127.186829]  [] delete_from_page_cache+0x3b/0x80
> [  127.186832]  [] truncate_hugepages+0x115/0x220
> [  127.186834]  [] hugetlbfs_evict_inode+0x13/0x30
> [  127.186837]  [] evict+0xa7/0x1b0
> [  127.186839]  [] iput_final+0xd3/0x1f0
> [  127.186840]  [] iput+0x39/0x50
> [  127.186842]  [] d_kill+0xf8/0x130
> [  127.186843]  [] dput+0xd2/0x1a0
> [  127.186845]  [] __fput+0x170/0x230
> [  127.186848]  [] ? rb_erase+0xce/0x150
> [  127.186849]  [] fput+0x1d/0x30
> [  127.186851]  [] remove_vma+0x37/0x80
> [  127.186853]  [] do_munmap+0x2d2/0x360
> [  127.186855]  [] sys_shmdt+0xc9/0x170
> [  127.186857]  [] system_call_fastpath+0x16/0x1b
> [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 
> 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 
> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [  127.186868] RIP  [] __delete_from_page_cache+0x15e/0x160
> [  127.186870]  RSP 
> [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
> 
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
> 
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
> 
> On my particular machine, it usually triggers within 10 minutes but enabling
> debug options can change the timing such that it never hits. Once the bug is
> triggered, the machine is in

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-24 Thread Mel Gorman

On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
> On Mon, 23 Jul 2012, Mel Gorman wrote:
> > On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> > > On Fri, 20 Jul 2012, Mel Gorman wrote:
> > > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > 
> > I like it in that it's simple and I can confirm it works for the test case
> > of interest.
> 
> Phew, I'm glad to hear that, thanks.
> 
> > 
> > However, is your patch not vunerable to truncate issues?
> > madvise()/truncate() issues was the main reason why I was wary of VMA tricks
> > as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
> > ignored for hugetlbfs but I think truncate is still problematic. Lets say
> > we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
> > 
> > invalidate_inode_pages2
> >   invalidate_inode_pages2_range
> > unmap_mapping_range_vma
> >   zap_page_range_single
> > unmap_single_vma
> >   __unmap_hugepage_range (removes VM_MAYSHARE)
> > 
> > The VMA still exists so the consequences for this would be varied but
> > minimally fault is going to be "interesting".
> 
> You had me worried there, I hadn't considered truncation or invalidation2
> at all.
> 
> But actually, I don't think they do pose any problem for my patch.  They
> would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
> as you show above; but no, I'm removing it in unmap_hugepage_range().
> 

True, I hadn't considered the distinction to be relevant. My train of
thought was "we potentially end up calling these functions and affect
VMAs that are not being torn down". This set off an alarm but I didn't
follow through properly.

> That's only called by unmap_single_vma(): which is called via
> unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
> (the problem cases); or by madvise_dontneed() via zap_page_range(), which
> as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().
> 

The madvise(DONTNEED) case also made me worry that if this fix worked, it
worked by co-incidence. If someone "fixes" hugetlbfs to support DONTNEED
(which would be nonsensical) or direct IO (which would be weird) then
this fix potentially regressed.

> zap_page_range_single() is called by zap_vma_ptes(), which is only
> allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
> like it was going to deadlock on i_mmap_mutex (with or without my
> patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
> and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.
> 

Yep, it happens to work.

> invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
> but hugetlbfs doesn't support direct_IO, and otherwise I think they're
> called by a filesystem directly on its own inodes, which hugetlbfs
> does not. Anyway, if there's a deadlock on i_mmap_mutex somewhere
> in there, it's not introduced by the proposed patch.
> 
> So, unmap_hugepage_range() is only being called in the problem cases,
> just before free_pgtables(), when unmapping a vma (with mmap_sem held),
> or when exiting (when we have the last reference to mm): in each case,
> the vma is on its way out, and VM_MAYSHARE no longer of interest to others.
> 
> I spent a while concerned that I'd overlooked the truncation case, before
> realizing that it's not a problem: the issue comes when we free_pgtables(),
> which truncation makes no attempt to do.
> 
> So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.
> 

I agree with you. When I was thinking about the potential problems, I was
thinking of them in the general context of the core VM and what we normally
take into account.

I confess that I really find this working-by-coincidence very icky and am
uncomfortable with it but your patch is the only patch that contains the
mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
VMA is found so I also affected the core.

So, lets go with your patch but with all this documented! I stuck a
changelog and an additional comment onto your patch and this is the end
result.

Do you want to pick this up and send it to Andrew or will I?

Thanks Hugh!

---8<---
mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced in
2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;

[

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-24 Thread Michal Hocko

On Mon 23-07-12 18:08:05, Hugh Dickins wrote:
> On Mon, 23 Jul 2012, Mel Gorman wrote:
> > On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> > > On Fri, 20 Jul 2012, Mel Gorman wrote:
> > > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > 
> > I like it in that it's simple and I can confirm it works for the test case
> > of interest.
> 
> Phew, I'm glad to hear that, thanks.
> 
> > 
> > However, is your patch not vunerable to truncate issues?
> > madvise()/truncate() issues was the main reason why I was wary of VMA tricks
> > as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
> > ignored for hugetlbfs but I think truncate is still problematic. Lets say
> > we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
> > 
> > invalidate_inode_pages2
> >   invalidate_inode_pages2_range
> > unmap_mapping_range_vma
> >   zap_page_range_single
> > unmap_single_vma
> >   __unmap_hugepage_range (removes VM_MAYSHARE)
> > 
> > The VMA still exists so the consequences for this would be varied but
> > minimally fault is going to be "interesting".
> 
> You had me worried there, I hadn't considered truncation or invalidation2
> at all.
> 
> But actually, I don't think they do pose any problem for my patch.  They
> would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
> as you show above; but no, I'm removing it in unmap_hugepage_range().
> 
> That's only called by unmap_single_vma(): which is called via
> unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
> (the problem cases); or by madvise_dontneed() via zap_page_range(), which
> as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().
> 
> zap_page_range_single() is called by zap_vma_ptes(), which is only
> allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
> like it was going to deadlock on i_mmap_mutex (with or without my
> patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
> and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.
> 
> invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
> but hugetlbfs doesn't support direct_IO, and otherwise I think they're
> called by a filesystem directly on its own inodes, which hugetlbfs
> does not.  

Good point, I didn't get this while looking into the code so I introduce
the `last' parameter which told me that I am in the correct path.
Thanks for clarification.

> Anyway, if there's a deadlock on i_mmap_mutex somewhere in there, it's
> not introduced by the proposed patch.

> So, unmap_hugepage_range() is only being called in the problem cases,
> just before free_pgtables(), when unmapping a vma (with mmap_sem held),
> or when exiting (when we have the last reference to mm): in each case,
> the vma is on its way out, and VM_MAYSHARE no longer of interest to others.
> 
> I spent a while concerned that I'd overlooked the truncation case, before
> realizing that it's not a problem: the issue comes when we free_pgtables(),
> which truncation makes no attempt to do.
> 
> So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.

Yes, this is convincing (and subtle ;)) and much less polluting.
You can add my Reviewed-by (with the above reasoning in the patch
description)

Anyway, the patch for mmotm needs an update because there was a
reorganization in the area. First, we need to revert "hugetlb: avoid
taking i_mmap_mutex in unmap_single_vma() for hugetlb)" (80f408f5 in
memcg-devel) and then push your code into unmap_single_vma. All the
above is still valid AFAICS.

> 
> Hugh

Thanks a lot Hugh!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-24 Thread Michal Hocko

On Mon 23-07-12 18:08:05, Hugh Dickins wrote:
 On Mon, 23 Jul 2012, Mel Gorman wrote:
  On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
   On Fri, 20 Jul 2012, Mel Gorman wrote:
On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
  
  I like it in that it's simple and I can confirm it works for the test case
  of interest.
 
 Phew, I'm glad to hear that, thanks.
 
  
  However, is your patch not vunerable to truncate issues?
  madvise()/truncate() issues was the main reason why I was wary of VMA tricks
  as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
  ignored for hugetlbfs but I think truncate is still problematic. Lets say
  we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
  
  invalidate_inode_pages2
invalidate_inode_pages2_range
  unmap_mapping_range_vma
zap_page_range_single
  unmap_single_vma
__unmap_hugepage_range (removes VM_MAYSHARE)
  
  The VMA still exists so the consequences for this would be varied but
  minimally fault is going to be interesting.
 
 You had me worried there, I hadn't considered truncation or invalidation2
 at all.
 
 But actually, I don't think they do pose any problem for my patch.  They
 would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
 as you show above; but no, I'm removing it in unmap_hugepage_range().
 
 That's only called by unmap_single_vma(): which is called via
 unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
 (the problem cases); or by madvise_dontneed() via zap_page_range(), which
 as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().
 
 zap_page_range_single() is called by zap_vma_ptes(), which is only
 allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
 like it was going to deadlock on i_mmap_mutex (with or without my
 patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
 and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.
 
 invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
 but hugetlbfs doesn't support direct_IO, and otherwise I think they're
 called by a filesystem directly on its own inodes, which hugetlbfs
 does not.  

Good point, I didn't get this while looking into the code so I introduce
the `last' parameter which told me that I am in the correct path.
Thanks for clarification.

 Anyway, if there's a deadlock on i_mmap_mutex somewhere in there, it's
 not introduced by the proposed patch.

 So, unmap_hugepage_range() is only being called in the problem cases,
 just before free_pgtables(), when unmapping a vma (with mmap_sem held),
 or when exiting (when we have the last reference to mm): in each case,
 the vma is on its way out, and VM_MAYSHARE no longer of interest to others.
 
 I spent a while concerned that I'd overlooked the truncation case, before
 realizing that it's not a problem: the issue comes when we free_pgtables(),
 which truncation makes no attempt to do.
 
 So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good.

Yes, this is convincing (and subtle ;)) and much less polluting.
You can add my Reviewed-by (with the above reasoning in the patch
description)

Anyway, the patch for mmotm needs an update because there was a
reorganization in the area. First, we need to revert hugetlb: avoid
taking i_mmap_mutex in unmap_single_vma() for hugetlb) (80f408f5 in
memcg-devel) and then push your code into unmap_single_vma. All the
above is still valid AFAICS.

 
 Hugh

Thanks a lot Hugh!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-24 Thread Mel Gorman

On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
 On Mon, 23 Jul 2012, Mel Gorman wrote:
  On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
   On Fri, 20 Jul 2012, Mel Gorman wrote:
On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
  
  I like it in that it's simple and I can confirm it works for the test case
  of interest.
 
 Phew, I'm glad to hear that, thanks.
 
  
  However, is your patch not vunerable to truncate issues?
  madvise()/truncate() issues was the main reason why I was wary of VMA tricks
  as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
  ignored for hugetlbfs but I think truncate is still problematic. Lets say
  we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
  
  invalidate_inode_pages2
invalidate_inode_pages2_range
  unmap_mapping_range_vma
zap_page_range_single
  unmap_single_vma
__unmap_hugepage_range (removes VM_MAYSHARE)
  
  The VMA still exists so the consequences for this would be varied but
  minimally fault is going to be interesting.
 
 You had me worried there, I hadn't considered truncation or invalidation2
 at all.
 
 But actually, I don't think they do pose any problem for my patch.  They
 would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
 as you show above; but no, I'm removing it in unmap_hugepage_range().
 

True, I hadn't considered the distinction to be relevant. My train of
thought was we potentially end up calling these functions and affect
VMAs that are not being torn down. This set off an alarm but I didn't
follow through properly.

 That's only called by unmap_single_vma(): which is called via
 unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
 (the problem cases); or by madvise_dontneed() via zap_page_range(), which
 as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().
 

The madvise(DONTNEED) case also made me worry that if this fix worked, it
worked by co-incidence. If someone fixes hugetlbfs to support DONTNEED
(which would be nonsensical) or direct IO (which would be weird) then
this fix potentially regressed.

 zap_page_range_single() is called by zap_vma_ptes(), which is only
 allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
 like it was going to deadlock on i_mmap_mutex (with or without my
 patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
 and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.
 

Yep, it happens to work.

 invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
 but hugetlbfs doesn't support direct_IO, and otherwise I think they're
 called by a filesystem directly on its own inodes, which hugetlbfs
 does not. Anyway, if there's a deadlock on i_mmap_mutex somewhere
 in there, it's not introduced by the proposed patch.
 
 So, unmap_hugepage_range() is only being called in the problem cases,
 just before free_pgtables(), when unmapping a vma (with mmap_sem held),
 or when exiting (when we have the last reference to mm): in each case,
 the vma is on its way out, and VM_MAYSHARE no longer of interest to others.
 
 I spent a while concerned that I'd overlooked the truncation case, before
 realizing that it's not a problem: the issue comes when we free_pgtables(),
 which truncation makes no attempt to do.
 
 So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good.
 

I agree with you. When I was thinking about the potential problems, I was
thinking of them in the general context of the core VM and what we normally
take into account.

I confess that I really find this working-by-coincidence very icky and am
uncomfortable with it but your patch is the only patch that contains the
mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
VMA is found so I also affected the core.

So, lets go with your patch but with all this documented! I stuck a
changelog and an additional comment onto your patch and this is the end
result.

Do you want to pick this up and send it to Andrew or will I?

Thanks Hugh!

---8---
mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a bad pmd and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced in
2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;

[  ..] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-24 Thread Michal Hocko

On Tue 24-07-12 10:34:06, Mel Gorman wrote:
[...]
 ---8---
 mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
 
 If a process creates a large hugetlbfs mapping that is eligible for page
 table sharing and forks heavily with children some of whom fault and
 others which destroy the mapping then it is possible for page tables to
 get corrupted. Some teardowns of the mapping encounter a bad pmd and
 output a message to the kernel log. The final teardown will trigger a
 BUG_ON in mm/filemap.c.
 
 This was reproduced in 3.4 but is known to have existed for a long time
 and goes back at least as far as 2.6.37. It was probably was introduced in
 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
 look like this;
 
 [  ..] Lots of bad pmd messages followed by this
 [  127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7).
 [  127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7).
 [  127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7).
 [  127.186778] [ cut here ]
 [  127.186781] kernel BUG at mm/filemap.c:134!
 [  127.186782] invalid opcode:  [#1] SMP
 [  127.186783] CPU 7
 [  127.186784] Modules linked in: af_packet cpufreq_conservative 
 cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod 
 coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 
 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw 
 cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode 
 ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm 
 i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button 
 i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon 
 ata_generic pata_atiixp libata scsi_mod
 [  127.186801]
 [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild 
 #53 Dell Inc. OptiPlex 990/06D7TR
 [  127.186804] RIP: 0010:[810ed6ce]  [810ed6ce] 
 __delete_from_page_cache+0x15e/0x160
 [  127.186809] RSP: :8804144b5c08  EFLAGS: 00010002
 [  127.186810] RAX: 0001 RBX: ea000a5c9000 RCX: 
 ffc0
 [  127.186811] RDX:  RSI: 0009 RDI: 
 88042dfdad00
 [  127.186812] RBP: 8804144b5c18 R08: 0009 R09: 
 0003
 [  127.186813] R10:  R11: 002d R12: 
 880412ff83d8
 [  127.186814] R13: 880412ff83d8 R14:  R15: 
 880412ff83d8
 [  127.186815] FS:  7fe18ed2c700() GS:88042dce() 
 knlGS:
 [  127.186816] CS:  0010 DS:  ES:  CR0: 8005003b
 [  127.186817] CR2: 7fe34503 CR3: 000417a14000 CR4: 
 000407e0
 [  127.186818] DR0:  DR1:  DR2: 
 
 [  127.186819] DR3:  DR6: 0ff0 DR7: 
 0400
 [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo 
 8804144b4000, task 880417f803c0)
 [  127.186821] Stack:
 [  127.186822]  ea000a5c9000  8804144b5c48 
 810ed83b
 [  127.186824]  8804144b5c48 138a 1387 
 8804144b5c98
 [  127.186825]  8804144b5d48 811bc925 8804144b5cb8 
 
 [  127.186827] Call Trace:
 [  127.186829]  [810ed83b] delete_from_page_cache+0x3b/0x80
 [  127.186832]  [811bc925] truncate_hugepages+0x115/0x220
 [  127.186834]  [811bca43] hugetlbfs_evict_inode+0x13/0x30
 [  127.186837]  [811655c7] evict+0xa7/0x1b0
 [  127.186839]  [811657a3] iput_final+0xd3/0x1f0
 [  127.186840]  [811658f9] iput+0x39/0x50
 [  127.186842]  [81162708] d_kill+0xf8/0x130
 [  127.186843]  [81162812] dput+0xd2/0x1a0
 [  127.186845]  [8114e2d0] __fput+0x170/0x230
 [  127.186848]  [81236e0e] ? rb_erase+0xce/0x150
 [  127.186849]  [8114e3ad] fput+0x1d/0x30
 [  127.186851]  [81117db7] remove_vma+0x37/0x80
 [  127.186853]  [81119182] do_munmap+0x2d2/0x360
 [  127.186855]  [811cc639] sys_shmdt+0xc9/0x170
 [  127.186857]  [81410a39] system_call_fastpath+0x16/0x1b
 [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 
 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0f 
 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
 [  127.186868] RIP  [810ed6ce] __delete_from_page_cache+0x15e/0x160
 [  127.186870]  RSP 8804144b5c08
 [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
 
 The bug is a race and not always easy to reproduce. To reproduce it I was
 doing the following on a single socket I7-based machine with 16G of RAM.
 
 $ hugeadm --pool-pages-max DEFAULT:13G
 $ echo $((18*1048576*1024))  /proc/sys/kernel/shmmax
 $ echo $((18*1048576*1024))  /proc/sys/kernel/shmall
 $ for i in `seq 1 9000`; do

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-24 Thread Hugh Dickins

On Tue, 24 Jul 2012, Mel Gorman wrote:
 On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
  
  So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good.
  
 
 I agree with you. When I was thinking about the potential problems, I was
 thinking of them in the general context of the core VM and what we normally
 take into account.
 
 I confess that I really find this working-by-coincidence very icky and am
 uncomfortable with it but your patch is the only patch that contains the
 mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
 the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
 VMA is found so I also affected the core.

icky is not quite the word I'd use, but yes, it feels like you only
have to dislodge a stone somewhere at the other end of the kernel,
and the whole lot would come tumbling down.

If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE,
I would: to warn us when assumptions change.  If we were prepared to waste
another vm_flag on it (and just because there's now a type which lets them
expand does not mean we can be profligate with them), then you can imagine
a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off
that instead; or something of that kind.

But I'm afraid I see that as TODO-list material: the one-liner is pretty
good for stable backporting, and I felt smiled-upon when it turned out to
be workable (and not even needing a change in arch/x86/mm, that really
surprised me).  It seems ungrateful not to seize the simple fix it offers,
which I found much easier to understand than the alternatives.

 
 So, lets go with your patch but with all this documented! I stuck a
 changelog and an additional comment onto your patch and this is the end
 result.

Okay, thanks.  (I think you've copied rather more of my previous mail
into the commit description than it deserves, but it looks like you
like more words where I like less!)

 
 Do you want to pick this up and send it to Andrew or will I?

Oh, please change your Reviewed-by to Signed-off-by: almost all of the
work and description comes from you and Michal; then please, you send it
in to Andrew - sorry, I really need to turn my attention to other things.

But I hadn't realized how messy it's going to be: I was concentrating on
3.5, not the mmotm tree, which as Michal points out is fairly different.
Yes, it definitely needs to revert to holding i_mmap_mutex when unsharing,
it was a mistake to have removed that (what stabilizes the page_count 1
check in huge_pmd_unshare? only i_mmap_mutex).

I guess this fix needs to go in to 3.6 early, and someone rejig the
hugetlb area of mmotm before that goes on to Linus.  Urggh.  AArgh64.
Sorry, I'm not volunteering.

But an interesting aspect of the hugetlb changes there, is that
mmu_gather is now being used by __unmap_hugepage_range: I did want that
for one of the solutions to this bug that I was toying with.  Although
earlier I had been afraid of doing free_pgtables work down in unshare,
it occurred to me later that already we do that (pud_clear) with no harm
observed, and free_pgtables does not depend on having entries still
present at the lower levels.  It may be that there's a less tricky
fix available once the dust has settled here.

 
 Thanks Hugh!
 
 ---8---
 mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
 
 If a process creates a large hugetlbfs mapping that is eligible for page
 table sharing and forks heavily with children some of whom fault and
 others which destroy the mapping then it is possible for page tables to
 get corrupted. Some teardowns of the mapping encounter a bad pmd and
 output a message to the kernel log. The final teardown will trigger a
 BUG_ON in mm/filemap.c.
 
 This was reproduced in 3.4 but is known to have existed for a long time
 and goes back at least as far as 2.6.37. It was probably was introduced in
 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
 look like this;
 
 [  ..] Lots of bad pmd messages followed by this
 [  127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7).
 [  127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7).
 [  127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7).
 [  127.186778] [ cut here ]
 [  127.186781] kernel BUG at mm/filemap.c:134!
 [  127.186782] invalid opcode:  [#1] SMP
 [  127.186783] CPU 7
 [  127.186784] Modules linked in: af_packet cpufreq_conservative 
 cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod 
 coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 
 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw 
 cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode 
 ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm 
 i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-23 Thread Hugh Dickins

On Mon, 23 Jul 2012, Mel Gorman wrote:
> On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> > On Fri, 20 Jul 2012, Mel Gorman wrote:
> > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> 
> I like it in that it's simple and I can confirm it works for the test case
> of interest.

Phew, I'm glad to hear that, thanks.

> 
> However, is your patch not vunerable to truncate issues?
> madvise()/truncate() issues was the main reason why I was wary of VMA tricks
> as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
> ignored for hugetlbfs but I think truncate is still problematic. Lets say
> we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
> 
> invalidate_inode_pages2
>   invalidate_inode_pages2_range
> unmap_mapping_range_vma
>   zap_page_range_single
> unmap_single_vma
> __unmap_hugepage_range (removes VM_MAYSHARE)
> 
> The VMA still exists so the consequences for this would be varied but
> minimally fault is going to be "interesting".

You had me worried there, I hadn't considered truncation or invalidation2
at all.

But actually, I don't think they do pose any problem for my patch.  They
would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
as you show above; but no, I'm removing it in unmap_hugepage_range().

That's only called by unmap_single_vma(): which is called via
unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
(the problem cases); or by madvise_dontneed() via zap_page_range(), which
as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().

zap_page_range_single() is called by zap_vma_ptes(), which is only
allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
like it was going to deadlock on i_mmap_mutex (with or without my
patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.

invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
but hugetlbfs doesn't support direct_IO, and otherwise I think they're
called by a filesystem directly on its own inodes, which hugetlbfs
does not.  Anyway, if there's a deadlock on i_mmap_mutex somewhere
in there, it's not introduced by the proposed patch.

So, unmap_hugepage_range() is only being called in the problem cases,
just before free_pgtables(), when unmapping a vma (with mmap_sem held),
or when exiting (when we have the last reference to mm): in each case,
the vma is on its way out, and VM_MAYSHARE no longer of interest to others.

I spent a while concerned that I'd overlooked the truncation case, before
realizing that it's not a problem: the issue comes when we free_pgtables(),
which truncation makes no attempt to do.

So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-23 Thread Mel Gorman

On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> On Fri, 20 Jul 2012, Mel Gorman wrote:
> > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > > And here is my attempt for the fix (Hugh mentioned something similar
> > > earlier but he suggested using special flags in ptes or VMAs). I still
> > > owe doc. update and it hasn't been tested with too many configs and I
> > > could missed some definition updates.
> > > I also think that changelog could be much better, I will add (steal) the
> > > full bug description if people think that this way is worth going rather
> > > than the one suggested by Mel.
> > > To be honest I am not quite happy how I had to pollute generic mm code 
> > > with
> > > something that is specific to a single architecture.
> > > Mel hammered it with the test case and it survived.
> > 
> > Tested-by: Mel Gorman 
> > 
> > This approach looks more or less like what I was expecting. I like that
> > the trick was applied to the page table page instead of using PTE tricks
> > or by bodging it with a VMA flag like I was thinking so kudos for that. I
> > also prefer this approach to trying to free the page tables on or near
> > huge_pmd_unshare()
> > 
> > In general I think this patch would execute better than mine because it is
> > far less heavy-handed but I share your concern that it changes the core MM
> > quite a bit for a corner case that only one architecture cares about. I am
> > completely biased of course, but I still prefer my patch because other than
> > an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
> > . I am also not concerned with the scalability of how quickly we can setup
> > page table sharing.
> > 
> > Hugh, I'm afraid you get to choose :)
> 
> Thank you bestowing that honour upon me :) 

Just so you know, there was a ceremonial gong when it happened.

> Seriously, though, you
> were quite right to Cc me on this, it is one of those areas I ought
> to know something about (unlike hugetlb reservations, for example).
> 
> Please don't be upset if I say that I don't like either of your patches.

I can live with that :) It would not be the first time we found the right
patch out of dislike for the first proposed and getting the fix is what's
important.

> Mainly for obvious reasons - I don't like Mel's because anything with
> trylock retries and nested spinlocks worries me before I can even start
> to think about it;

That's a reasonable objection. The trylock could be avoided by always
falling through at the cost of reducing the amount of sharing
opportunities but the nested locking is unavoidable. I agree with you
that nested locking like this should always be a cause for concern.

> and I don't like Michal's for the same reason as Mel,
> that it spreads more change around in common paths than we would like.
> 
> But I didn't spend much time thinking through either of them, they just
> seemed more complicated than should be needed.  I cannot confirm or deny
> whether they're correct - though I still do not understand how mmap_sem
> can help you, Mel.  I can see that it will help in your shmdt()ing test,
> but if you leave the area mapped on exit, then mmap_sem is not taken in
> the exit_mmap() path, so how does it help?
> 

It certainly helps in the shmdt case which is what the test case focused
on because that is what the application that triggered this bug was
doing. However, you're right in that the exit_mmap() path is still vunerable
because it does not take mmap_sem. I'll think about that a bit more.

> I spent hours trying to dream up a better patch, trying various
> approaches.  I think I have a nice one now, what do you think?  And
> more importantly, does it work?  I have not tried to test it at all,
> that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
> 
> If you like it, please take it over and add your comments and signoff
> and send it in. 
> 

I like it in that it's simple and I can confirm it works for the test case
of interest.

However, is your patch not vunerable to truncate issues?
madvise()/truncate() issues was the main reason why I was wary of VMA tricks
as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
ignored for hugetlbfs but I think truncate is still problematic. Lets say
we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.

invalidate_inode_pages2
  invalidate_inode_pages2_range
unmap_mapping_range_vma
  zap_page_range_single
unmap_single_vma
  __unmap_hugepage_range (removes VM_MAYSHARE)

The VMA still exists so the consequences for this would be varied but
minimally fault is going to be "interesting".

I think that potentially we could work around this but it may end up stomping
on the core MM and not necessarily be any better than Michal's patch.

> The second part won't come up in your testing, and could
> be made a separate patch if you prefer: it's a related point that struck
> me while I was

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-23 Thread Mel Gorman

On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
 On Fri, 20 Jul 2012, Mel Gorman wrote:
  On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
   And here is my attempt for the fix (Hugh mentioned something similar
   earlier but he suggested using special flags in ptes or VMAs). I still
   owe doc. update and it hasn't been tested with too many configs and I
   could missed some definition updates.
   I also think that changelog could be much better, I will add (steal) the
   full bug description if people think that this way is worth going rather
   than the one suggested by Mel.
   To be honest I am not quite happy how I had to pollute generic mm code 
   with
   something that is specific to a single architecture.
   Mel hammered it with the test case and it survived.
  
  Tested-by: Mel Gorman mgor...@suse.de
  
  This approach looks more or less like what I was expecting. I like that
  the trick was applied to the page table page instead of using PTE tricks
  or by bodging it with a VMA flag like I was thinking so kudos for that. I
  also prefer this approach to trying to free the page tables on or near
  huge_pmd_unshare()
  
  In general I think this patch would execute better than mine because it is
  far less heavy-handed but I share your concern that it changes the core MM
  quite a bit for a corner case that only one architecture cares about. I am
  completely biased of course, but I still prefer my patch because other than
  an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
  . I am also not concerned with the scalability of how quickly we can setup
  page table sharing.
  
  Hugh, I'm afraid you get to choose :)
 
 Thank you bestowing that honour upon me :) 

Just so you know, there was a ceremonial gong when it happened.

 Seriously, though, you
 were quite right to Cc me on this, it is one of those areas I ought
 to know something about (unlike hugetlb reservations, for example).
 
 Please don't be upset if I say that I don't like either of your patches.

I can live with that :) It would not be the first time we found the right
patch out of dislike for the first proposed and getting the fix is what's
important.

 Mainly for obvious reasons - I don't like Mel's because anything with
 trylock retries and nested spinlocks worries me before I can even start
 to think about it;

That's a reasonable objection. The trylock could be avoided by always
falling through at the cost of reducing the amount of sharing
opportunities but the nested locking is unavoidable. I agree with you
that nested locking like this should always be a cause for concern.

 and I don't like Michal's for the same reason as Mel,
 that it spreads more change around in common paths than we would like.
 
 But I didn't spend much time thinking through either of them, they just
 seemed more complicated than should be needed.  I cannot confirm or deny
 whether they're correct - though I still do not understand how mmap_sem
 can help you, Mel.  I can see that it will help in your shmdt()ing test,
 but if you leave the area mapped on exit, then mmap_sem is not taken in
 the exit_mmap() path, so how does it help?
 

It certainly helps in the shmdt case which is what the test case focused
on because that is what the application that triggered this bug was
doing. However, you're right in that the exit_mmap() path is still vunerable
because it does not take mmap_sem. I'll think about that a bit more.

 I spent hours trying to dream up a better patch, trying various
 approaches.  I think I have a nice one now, what do you think?  And
 more importantly, does it work?  I have not tried to test it at all,
 that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
 
 If you like it, please take it over and add your comments and signoff
 and send it in. 
 

I like it in that it's simple and I can confirm it works for the test case
of interest.

However, is your patch not vunerable to truncate issues?
madvise()/truncate() issues was the main reason why I was wary of VMA tricks
as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
ignored for hugetlbfs but I think truncate is still problematic. Lets say
we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.

invalidate_inode_pages2
  invalidate_inode_pages2_range
unmap_mapping_range_vma
  zap_page_range_single
unmap_single_vma
  __unmap_hugepage_range (removes VM_MAYSHARE)

The VMA still exists so the consequences for this would be varied but
minimally fault is going to be interesting.

I think that potentially we could work around this but it may end up stomping
on the core MM and not necessarily be any better than Michal's patch.

 The second part won't come up in your testing, and could
 be made a separate patch if you prefer: it's a related point that struck
 me while I was playing with a different approach.

I'm fine with the second part.

-- 
Mel Gorman
SUSE

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-23 Thread Hugh Dickins

On Mon, 23 Jul 2012, Mel Gorman wrote:
 On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
  On Fri, 20 Jul 2012, Mel Gorman wrote:
   On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
 
 I like it in that it's simple and I can confirm it works for the test case
 of interest.

Phew, I'm glad to hear that, thanks.

 
 However, is your patch not vunerable to truncate issues?
 madvise()/truncate() issues was the main reason why I was wary of VMA tricks
 as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
 ignored for hugetlbfs but I think truncate is still problematic. Lets say
 we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
 
 invalidate_inode_pages2
   invalidate_inode_pages2_range
 unmap_mapping_range_vma
   zap_page_range_single
 unmap_single_vma
 __unmap_hugepage_range (removes VM_MAYSHARE)
 
 The VMA still exists so the consequences for this would be varied but
 minimally fault is going to be interesting.

You had me worried there, I hadn't considered truncation or invalidation2
at all.

But actually, I don't think they do pose any problem for my patch.  They
would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
as you show above; but no, I'm removing it in unmap_hugepage_range().

That's only called by unmap_single_vma(): which is called via
unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
(the problem cases); or by madvise_dontneed() via zap_page_range(), which
as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().

zap_page_range_single() is called by zap_vma_ptes(), which is only
allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
like it was going to deadlock on i_mmap_mutex (with or without my
patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.

invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
but hugetlbfs doesn't support direct_IO, and otherwise I think they're
called by a filesystem directly on its own inodes, which hugetlbfs
does not.  Anyway, if there's a deadlock on i_mmap_mutex somewhere
in there, it's not introduced by the proposed patch.

So, unmap_hugepage_range() is only being called in the problem cases,
just before free_pgtables(), when unmapping a vma (with mmap_sem held),
or when exiting (when we have the last reference to mm): in each case,
the vma is on its way out, and VM_MAYSHARE no longer of interest to others.

I spent a while concerned that I'd overlooked the truncation case, before
realizing that it's not a problem: the issue comes when we free_pgtables(),
which truncation makes no attempt to do.

So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-22 Thread Hugh Dickins

On Fri, 20 Jul 2012, Mel Gorman wrote:
> On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > And here is my attempt for the fix (Hugh mentioned something similar
> > earlier but he suggested using special flags in ptes or VMAs). I still
> > owe doc. update and it hasn't been tested with too many configs and I
> > could missed some definition updates.
> > I also think that changelog could be much better, I will add (steal) the
> > full bug description if people think that this way is worth going rather
> > than the one suggested by Mel.
> > To be honest I am not quite happy how I had to pollute generic mm code with
> > something that is specific to a single architecture.
> > Mel hammered it with the test case and it survived.
> 
> Tested-by: Mel Gorman 
> 
> This approach looks more or less like what I was expecting. I like that
> the trick was applied to the page table page instead of using PTE tricks
> or by bodging it with a VMA flag like I was thinking so kudos for that. I
> also prefer this approach to trying to free the page tables on or near
> huge_pmd_unshare()
> 
> In general I think this patch would execute better than mine because it is
> far less heavy-handed but I share your concern that it changes the core MM
> quite a bit for a corner case that only one architecture cares about. I am
> completely biased of course, but I still prefer my patch because other than
> an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
> . I am also not concerned with the scalability of how quickly we can setup
> page table sharing.
> 
> Hugh, I'm afraid you get to choose :)

Thank you bestowing that honour upon me :)  Seriously, though, you
were quite right to Cc me on this, it is one of those areas I ought
to know something about (unlike hugetlb reservations, for example).

Please don't be upset if I say that I don't like either of your patches.
Mainly for obvious reasons - I don't like Mel's because anything with
trylock retries and nested spinlocks worries me before I can even start
to think about it; and I don't like Michal's for the same reason as Mel,
that it spreads more change around in common paths than we would like.

But I didn't spend much time thinking through either of them, they just
seemed more complicated than should be needed.  I cannot confirm or deny
whether they're correct - though I still do not understand how mmap_sem
can help you, Mel.  I can see that it will help in your shmdt()ing test,
but if you leave the area mapped on exit, then mmap_sem is not taken in
the exit_mmap() path, so how does it help?

I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins 
---

 mm/hugetlb.c |   18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

--- v3.5/mm/hugetlb.c   2012-07-21 13:58:29.0 -0700
+++ linux/mm/hugetlb.c  2012-07-22 20:28:59.858077817 -0700
@@ -2393,6 +2393,15 @@ void unmap_hugepage_range(struct vm_area
 {
mutex_lock(>vm_file->f_mapping->i_mmap_mutex);
__unmap_hugepage_range(vma, start, end, ref_page);
+   /*
+* Clear this flag so that x86's huge_pmd_share page_table_shareable
+* test will fail on a vma being torn down, and not grab a page table
+* on its way out.  We're lucky that the flag has such an appropriate
+* name, and can in fact be safely cleared here.  We could clear it
+* before the __unmap_hugepage_range above, but all that's necessary
+* is to clear it before releasing the i_mmap_mutex below.
+*/
+   vma->vm_flags &= ~VM_MAYSHARE;
mutex_unlock(>vm_file->f_mapping->i_mmap_mutex);
 }

@@ -2959,9 +2968,14 @@ void hugetlb_change_protection(struct vm
}
}
spin_unlock(>page_table_lock);
-   mutex_unlock(>vm_file->f_mapping->i_mmap_mutex);
-
+   /*
+* Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
+* may have cleared our pud entry and done put_page on the page table:
+* once we release i_mmap_mutex, another task can do the final put_page
+* and that page table be reused and filled with junk.
+*/
flush_tlb_range(vma, start, end);
+   mutex_unlock(>vm_file->f_mapping->i_mmap_mutex);
 }

 int hugetlb_reserve_pages(struct inode *inode,
--
To unsubscribe from this list: send the line "unsubscribe

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-22 Thread Hugh Dickins

On Fri, 20 Jul 2012, Mel Gorman wrote:
 On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
  And here is my attempt for the fix (Hugh mentioned something similar
  earlier but he suggested using special flags in ptes or VMAs). I still
  owe doc. update and it hasn't been tested with too many configs and I
  could missed some definition updates.
  I also think that changelog could be much better, I will add (steal) the
  full bug description if people think that this way is worth going rather
  than the one suggested by Mel.
  To be honest I am not quite happy how I had to pollute generic mm code with
  something that is specific to a single architecture.
  Mel hammered it with the test case and it survived.
 
 Tested-by: Mel Gorman mgor...@suse.de
 
 This approach looks more or less like what I was expecting. I like that
 the trick was applied to the page table page instead of using PTE tricks
 or by bodging it with a VMA flag like I was thinking so kudos for that. I
 also prefer this approach to trying to free the page tables on or near
 huge_pmd_unshare()
 
 In general I think this patch would execute better than mine because it is
 far less heavy-handed but I share your concern that it changes the core MM
 quite a bit for a corner case that only one architecture cares about. I am
 completely biased of course, but I still prefer my patch because other than
 an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
 . I am also not concerned with the scalability of how quickly we can setup
 page table sharing.
 
 Hugh, I'm afraid you get to choose :)

Thank you bestowing that honour upon me :)  Seriously, though, you
were quite right to Cc me on this, it is one of those areas I ought
to know something about (unlike hugetlb reservations, for example).

Please don't be upset if I say that I don't like either of your patches.
Mainly for obvious reasons - I don't like Mel's because anything with
trylock retries and nested spinlocks worries me before I can even start
to think about it; and I don't like Michal's for the same reason as Mel,
that it spreads more change around in common paths than we would like.

But I didn't spend much time thinking through either of them, they just
seemed more complicated than should be needed.  I cannot confirm or deny
whether they're correct - though I still do not understand how mmap_sem
can help you, Mel.  I can see that it will help in your shmdt()ing test,
but if you leave the area mapped on exit, then mmap_sem is not taken in
the exit_mmap() path, so how does it help?

I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins hu...@google.com
---

 mm/hugetlb.c |   18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

--- v3.5/mm/hugetlb.c   2012-07-21 13:58:29.0 -0700
+++ linux/mm/hugetlb.c  2012-07-22 20:28:59.858077817 -0700
@@ -2393,6 +2393,15 @@ void unmap_hugepage_range(struct vm_area
 {
mutex_lock(vma-vm_file-f_mapping-i_mmap_mutex);
__unmap_hugepage_range(vma, start, end, ref_page);
+   /*
+* Clear this flag so that x86's huge_pmd_share page_table_shareable
+* test will fail on a vma being torn down, and not grab a page table
+* on its way out.  We're lucky that the flag has such an appropriate
+* name, and can in fact be safely cleared here.  We could clear it
+* before the __unmap_hugepage_range above, but all that's necessary
+* is to clear it before releasing the i_mmap_mutex below.
+*/
+   vma-vm_flags = ~VM_MAYSHARE;
mutex_unlock(vma-vm_file-f_mapping-i_mmap_mutex);
 }
 
@@ -2959,9 +2968,14 @@ void hugetlb_change_protection(struct vm
}
}
spin_unlock(mm-page_table_lock);
-   mutex_unlock(vma-vm_file-f_mapping-i_mmap_mutex);
-
+   /*
+* Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
+* may have cleared our pud entry and done put_page on the page table:
+* once we release i_mmap_mutex, another task can do the final put_page
+* and that page table be reused and filled with junk.
+*/
flush_tlb_range(vma, start, end);
+   mutex_unlock(vma-vm_file-f_mapping-i_mmap_mutex);
 }
 
 int hugetlb_reserve_pages(struct inode *inode,
--
To unsubscribe from this list: send the line unsubscribe

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Mel Gorman

On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> And here is my attempt for the fix (Hugh mentioned something similar
> earlier but he suggested using special flags in ptes or VMAs). I still
> owe doc. update and it hasn't been tested with too many configs and I
> could missed some definition updates.
> I also think that changelog could be much better, I will add (steal) the
> full bug description if people think that this way is worth going rather
> than the one suggested by Mel.
> To be honest I am not quite happy how I had to pollute generic mm code with
> something that is specific to a single architecture.
> Mel hammered it with the test case and it survived.

Tested-by: Mel Gorman 

This approach looks more or less like what I was expecting. I like that
the trick was applied to the page table page instead of using PTE tricks
or by bodging it with a VMA flag like I was thinking so kudos for that. I
also prefer this approach to trying to free the page tables on or near
huge_pmd_unshare()

In general I think this patch would execute better than mine because it is
far less heavy-handed but I share your concern that it changes the core MM
quite a bit for a corner case that only one architecture cares about. I am
completely biased of course, but I still prefer my patch because other than
an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
. I am also not concerned with the scalability of how quickly we can setup
page table sharing.

Hugh, I'm afraid you get to choose :)

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Michal Hocko

And here is my attempt for the fix (Hugh mentioned something similar
earlier but he suggested using special flags in ptes or VMAs). I still
owe doc. update and it hasn't been tested with too many configs and I
could missed some definition updates.
I also think that changelog could be much better, I will add (steal) the
full bug description if people think that this way is worth going rather
than the one suggested by Mel.
To be honest I am not quite happy how I had to pollute generic mm code with
something that is specific to a single architecture.
Mel hammered it with the test case and it survived.
---
>From d71cd88da83c669beced2aa752847265e896b89b Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 20 Jul 2012 15:10:40 +0200
Subject: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs

The primary problem is that huge_pte_offset called from huge_pmd_share might
return a spte which is about to be deallocated and the caller doesn't have any
way to find this out. This means that huge_pmd_share might reuse spte and
increase the reference count on its page.  i_mmap_mutex is not sufficient
because the spte is still available after unmap_hugepage_range and free
free_pgtables which takes care of spte teardown doesn't use any locking.

This patch addresses the issue by marking spte's page that it is due to removal
(misuse page mapping for that matter because that one is not used for spte page
during whole life cycle).
huge_pte_offset then checks is_hugetlb_pmd_page_valid and ignores such sptes.
The spte page is invalidated after the last pte is unshared and only if
unmap_vmas is followed by free_pgtables. This is all done under i_mmap_mutex so
we cannot race.

The spte page is cleaned up later in free_pmd_range after pud is cleared and
before it is handed over to pmd_free_tlb which frees the page. Write memory
barrier makes sure that other CPUs always see either cleared pud (thus NULL
pmd) or invalidated spte page.

[motivated by Hugh's idea]
Signed-off-by: Michal Hocko 
---
 arch/x86/include/asm/hugetlb.h |   30 ++
 arch/x86/mm/hugetlbpage.c  |7 ++-
 fs/hugetlbfs/inode.c   |2 +-
 include/linux/hugetlb.h|7 ---
 include/linux/mm.h |2 +-
 mm/hugetlb.c   |   10 ++
 mm/memory.c|   17 +++--
 mm/mmap.c  |4 ++--
 8 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index 439a9ac..eaff713 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -48,6 +48,36 @@ static inline pte_t huge_ptep_get_and_clear(struct mm_struct 
*mm,
return ptep_get_and_clear(mm, addr, ptep);
 }
 
+static inline bool is_hugetlb_pmd_page_valid(struct page *page)
+{
+   smp_rmb();
+   return page->mapping == NULL;
+}
+
+static inline void invalidate_hugetlb_pmd_page(pmd_t *pmd)
+{
+   struct page *pmd_page = virt_to_page(pmd);
+
+   /* TODO add comment about the race */
+   pmd_page->mapping = (struct address_space *)(-1UL);
+   smp_wmb();
+}
+
+#define ARCH_HAVE_CLEAR_HUGETLB_PMD_PAGE 1
+static inline void clear_hugetlb_pmd_page(pmd_t *pmd)
+{
+   struct page *pmd_page = virt_to_page(pmd);
+   if (!is_hugetlb_pmd_page_valid(pmd_page)) {
+   /*
+* Make sure that pud_clear is visible before we remove the
+* invalidate flag here so huge_pte_offset either returns NULL
+* or invalidated pmd page during the tear down.
+*/
+   smp_wmb();
+   pmd_page->mapping = NULL;
+   }
+}
+
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
 unsigned long addr, pte_t *ptep)
 {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..c040089 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned 
long addr, pud_t *pud)
if (saddr) {
spte = huge_pte_offset(svma->vm_mm, saddr);
if (spte) {
-   get_page(virt_to_page(spte));
+   struct page *spte_page = virt_to_page(spte);
+   if (!is_hugetlb_pmd_page_valid(spte_page)) {
+   spte = NULL;
+   continue;
+   }
+   get_page(spte_page);
break;
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 001ef01..0d0c235 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -417,7 +417,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, 
pgoff_t pgoff)

[PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Michal Hocko

And here is my attempt for the fix (Hugh mentioned something similar
earlier but he suggested using special flags in ptes or VMAs). I still
owe doc. update and it hasn't been tested with too many configs and I
could missed some definition updates.
I also think that changelog could be much better, I will add (steal) the
full bug description if people think that this way is worth going rather
than the one suggested by Mel.
To be honest I am not quite happy how I had to pollute generic mm code with
something that is specific to a single architecture.
Mel hammered it with the test case and it survived.
---
From d71cd88da83c669beced2aa752847265e896b89b Mon Sep 17 00:00:00 2001
From: Michal Hocko mho...@suse.cz
Date: Fri, 20 Jul 2012 15:10:40 +0200
Subject: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs

The primary problem is that huge_pte_offset called from huge_pmd_share might
return a spte which is about to be deallocated and the caller doesn't have any
way to find this out. This means that huge_pmd_share might reuse spte and
increase the reference count on its page.  i_mmap_mutex is not sufficient
because the spte is still available after unmap_hugepage_range and free
free_pgtables which takes care of spte teardown doesn't use any locking.

This patch addresses the issue by marking spte's page that it is due to removal
(misuse page mapping for that matter because that one is not used for spte page
during whole life cycle).
huge_pte_offset then checks is_hugetlb_pmd_page_valid and ignores such sptes.
The spte page is invalidated after the last pte is unshared and only if
unmap_vmas is followed by free_pgtables. This is all done under i_mmap_mutex so
we cannot race.

The spte page is cleaned up later in free_pmd_range after pud is cleared and
before it is handed over to pmd_free_tlb which frees the page. Write memory
barrier makes sure that other CPUs always see either cleared pud (thus NULL
pmd) or invalidated spte page.

[motivated by Hugh's idea]
Signed-off-by: Michal Hocko mho...@suse.cz
---
 arch/x86/include/asm/hugetlb.h |   30 ++
 arch/x86/mm/hugetlbpage.c  |7 ++-
 fs/hugetlbfs/inode.c   |2 +-
 include/linux/hugetlb.h|7 ---
 include/linux/mm.h |2 +-
 mm/hugetlb.c   |   10 ++
 mm/memory.c|   17 +++--
 mm/mmap.c  |4 ++--
 8 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index 439a9ac..eaff713 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -48,6 +48,36 @@ static inline pte_t huge_ptep_get_and_clear(struct mm_struct 
*mm,
return ptep_get_and_clear(mm, addr, ptep);
 }
 
+static inline bool is_hugetlb_pmd_page_valid(struct page *page)
+{
+   smp_rmb();
+   return page-mapping == NULL;
+}
+
+static inline void invalidate_hugetlb_pmd_page(pmd_t *pmd)
+{
+   struct page *pmd_page = virt_to_page(pmd);
+
+   /* TODO add comment about the race */
+   pmd_page-mapping = (struct address_space *)(-1UL);
+   smp_wmb();
+}
+
+#define ARCH_HAVE_CLEAR_HUGETLB_PMD_PAGE 1
+static inline void clear_hugetlb_pmd_page(pmd_t *pmd)
+{
+   struct page *pmd_page = virt_to_page(pmd);
+   if (!is_hugetlb_pmd_page_valid(pmd_page)) {
+   /*
+* Make sure that pud_clear is visible before we remove the
+* invalidate flag here so huge_pte_offset either returns NULL
+* or invalidated pmd page during the tear down.
+*/
+   smp_wmb();
+   pmd_page-mapping = NULL;
+   }
+}
+
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
 unsigned long addr, pte_t *ptep)
 {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..c040089 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned 
long addr, pud_t *pud)
if (saddr) {
spte = huge_pte_offset(svma-vm_mm, saddr);
if (spte) {
-   get_page(virt_to_page(spte));
+   struct page *spte_page = virt_to_page(spte);
+   if (!is_hugetlb_pmd_page_valid(spte_page)) {
+   spte = NULL;
+   continue;
+   }
+   get_page(spte_page);
break;
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 001ef01..0d0c235 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -417,7 +417,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, 
pgoff_t

Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Mel Gorman

On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
 And here is my attempt for the fix (Hugh mentioned something similar
 earlier but he suggested using special flags in ptes or VMAs). I still
 owe doc. update and it hasn't been tested with too many configs and I
 could missed some definition updates.
 I also think that changelog could be much better, I will add (steal) the
 full bug description if people think that this way is worth going rather
 than the one suggested by Mel.
 To be honest I am not quite happy how I had to pollute generic mm code with
 something that is specific to a single architecture.
 Mel hammered it with the test case and it survived.

Tested-by: Mel Gorman mgor...@suse.de

This approach looks more or less like what I was expecting. I like that
the trick was applied to the page table page instead of using PTE tricks
or by bodging it with a VMA flag like I was thinking so kudos for that. I
also prefer this approach to trying to free the page tables on or near
huge_pmd_unshare()

In general I think this patch would execute better than mine because it is
far less heavy-handed but I share your concern that it changes the core MM
quite a bit for a corner case that only one architecture cares about. I am
completely biased of course, but I still prefer my patch because other than
an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
. I am also not concerned with the scalability of how quickly we can setup
page table sharing.

Hugh, I'm afraid you get to choose :)

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

80 matches

Mail list logo