Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu 02-08-12 14:33:10, Mel Gorman wrote: > On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote: > > On Thu 02-08-12 08:37:57, Mel Gorman wrote: > > > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote: > > [...] > > > > On the other hand, mine is more coupled with the sharing code so it > > > > makes the code easier to follow and also makes the sharing more > > > > effective because racing processes see pmd populated when checking for > > > > shareable mappings. > > > > > > > > > > It could do with a small comment above huge_pmd_share() explaining that > > > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two > > > parallel faults missing a sharing opportunity with each other but it's > > > not mandatory. > > > > Sure, that's a good idea. What about the following: > > > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > > index 40b2500..51839d1 100644 > > --- a/arch/x86/mm/hugetlbpage.c > > +++ b/arch/x86/mm/hugetlbpage.c > > @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, > > unsigned long addr) > > } > > > > /* > > - * search for a shareable pmd page for hugetlb. > > + * search for a shareable pmd page for hugetlb. In any case calls > > + * pmd_alloc and returns the corresponding pte. While this not necessary > > + * for the !shared pmd case because we can allocate the pmd later as > > + * well it makes the code much cleaner. pmd allocation is essential for > > + * the shared case though because pud has to be populated inside the > > + * same i_mmap_mutex section otherwise racing tasks could either miss > > + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing. > > */ > > static pte_t* > > huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > > Looks reasonable to me. OK, added to the patch. I will send it to Andrew now. Thanks a lot! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote: > On Thu 02-08-12 08:37:57, Mel Gorman wrote: > > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote: > [...] > > > On the other hand, mine is more coupled with the sharing code so it > > > makes the code easier to follow and also makes the sharing more > > > effective because racing processes see pmd populated when checking for > > > shareable mappings. > > > > > > > It could do with a small comment above huge_pmd_share() explaining that > > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two > > parallel faults missing a sharing opportunity with each other but it's > > not mandatory. > > Sure, that's a good idea. What about the following: > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > index 40b2500..51839d1 100644 > --- a/arch/x86/mm/hugetlbpage.c > +++ b/arch/x86/mm/hugetlbpage.c > @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, > unsigned long addr) > } > > /* > - * search for a shareable pmd page for hugetlb. > + * search for a shareable pmd page for hugetlb. In any case calls > + * pmd_alloc and returns the corresponding pte. While this not necessary > + * for the !shared pmd case because we can allocate the pmd later as > + * well it makes the code much cleaner. pmd allocation is essential for > + * the shared case though because pud has to be populated inside the > + * same i_mmap_mutex section otherwise racing tasks could either miss > + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing. > */ > static pte_t* > huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > Looks reasonable to me. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu 02-08-12 08:37:57, Mel Gorman wrote: > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote: [...] > > On the other hand, mine is more coupled with the sharing code so it > > makes the code easier to follow and also makes the sharing more > > effective because racing processes see pmd populated when checking for > > shareable mappings. > > > > It could do with a small comment above huge_pmd_share() explaining that > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two > parallel faults missing a sharing opportunity with each other but it's > not mandatory. Sure, that's a good idea. What about the following: diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 40b2500..51839d1 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr) } /* - * search for a shareable pmd page for hugetlb. + * search for a shareable pmd page for hugetlb. In any case calls + * pmd_alloc and returns the corresponding pte. While this not necessary + * for the !shared pmd case because we can allocate the pmd later as + * well it makes the code much cleaner. pmd allocation is essential for + * the shared case though because pud has to be populated inside the + * same i_mmap_mutex section otherwise racing tasks could either miss + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing. */ static pte_t* huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > So I am more inclined to mine but I don't want to push it because both > > are good and make sense. What other people think? > > > > I vote yours > > Reviewed-by: Mel Gorman Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote: > Hi Larry, > > On Wed 01-08-12 11:06:33, Larry Woodman wrote: > > On 08/01/2012 08:32 AM, Michal Hocko wrote: > > > > > >I am really lame :/. The previous patch is wrong as well for goto out > > >branch. The updated patch as follows: > > This patch worked fine Michal! > > Thanks for the good news! > > > You and Mel can duke it out over who's is best. :) > > The answer is clear here ;) Mel did the hard work of identifying the > culprit so kudos go to him. I'm happy once it's fixed! > I just tried to solve the issue more inside x86 arch code. The pmd > allocation outside of sharing code seemed strange to me for quite some > time I just underestimated its consequences completely. > > Both approaches have some pros. Mel's patch is more resistant to other > not-yet-discovered races and it also makes the arch independent code > more robust because relying on the pmd trick is not ideal. If there is another race then it is best to hear about it, understand it and fix the underlying problem. More importantly, your patch ensures that two processes faulting at the same time will share page tables with each other. My patch only noted that this missed opportunity could cause problems with fork. > On the other hand, mine is more coupled with the sharing code so it > makes the code easier to follow and also makes the sharing more > effective because racing processes see pmd populated when checking for > shareable mappings. > It could do with a small comment above huge_pmd_share() explaining that calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two parallel faults missing a sharing opportunity with each other but it's not mandatory. > So I am more inclined to mine but I don't want to push it because both > are good and make sense. What other people think? > I vote yours Reviewed-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
Hi Larry, On Wed 01-08-12 11:06:33, Larry Woodman wrote: > On 08/01/2012 08:32 AM, Michal Hocko wrote: > > > >I am really lame :/. The previous patch is wrong as well for goto out > >branch. The updated patch as follows: > This patch worked fine Michal! Thanks for the good news! > You and Mel can duke it out over who's is best. :) The answer is clear here ;) Mel did the hard work of identifying the culprit so kudos go to him. I just tried to solve the issue more inside x86 arch code. The pmd allocation outside of sharing code seemed strange to me for quite some time I just underestimated its consequences completely. Both approaches have some pros. Mel's patch is more resistant to other not-yet-discovered races and it also makes the arch independent code more robust because relying on the pmd trick is not ideal. On the other hand, mine is more coupled with the sharing code so it makes the code easier to follow and also makes the sharing more effective because racing processes see pmd populated when checking for shareable mappings. So I am more inclined to mine but I don't want to push it because both are good and make sense. What other people think? > > Larry > -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
Hi Larry, On Wed 01-08-12 11:06:33, Larry Woodman wrote: On 08/01/2012 08:32 AM, Michal Hocko wrote: I am really lame :/. The previous patch is wrong as well for goto out branch. The updated patch as follows: This patch worked fine Michal! Thanks for the good news! You and Mel can duke it out over who's is best. :) The answer is clear here ;) Mel did the hard work of identifying the culprit so kudos go to him. I just tried to solve the issue more inside x86 arch code. The pmd allocation outside of sharing code seemed strange to me for quite some time I just underestimated its consequences completely. Both approaches have some pros. Mel's patch is more resistant to other not-yet-discovered races and it also makes the arch independent code more robust because relying on the pmd trick is not ideal. On the other hand, mine is more coupled with the sharing code so it makes the code easier to follow and also makes the sharing more effective because racing processes see pmd populated when checking for shareable mappings. So I am more inclined to mine but I don't want to push it because both are good and make sense. What other people think? Larry -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote: Hi Larry, On Wed 01-08-12 11:06:33, Larry Woodman wrote: On 08/01/2012 08:32 AM, Michal Hocko wrote: I am really lame :/. The previous patch is wrong as well for goto out branch. The updated patch as follows: This patch worked fine Michal! Thanks for the good news! You and Mel can duke it out over who's is best. :) The answer is clear here ;) Mel did the hard work of identifying the culprit so kudos go to him. I'm happy once it's fixed! I just tried to solve the issue more inside x86 arch code. The pmd allocation outside of sharing code seemed strange to me for quite some time I just underestimated its consequences completely. Both approaches have some pros. Mel's patch is more resistant to other not-yet-discovered races and it also makes the arch independent code more robust because relying on the pmd trick is not ideal. If there is another race then it is best to hear about it, understand it and fix the underlying problem. More importantly, your patch ensures that two processes faulting at the same time will share page tables with each other. My patch only noted that this missed opportunity could cause problems with fork. On the other hand, mine is more coupled with the sharing code so it makes the code easier to follow and also makes the sharing more effective because racing processes see pmd populated when checking for shareable mappings. It could do with a small comment above huge_pmd_share() explaining that calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two parallel faults missing a sharing opportunity with each other but it's not mandatory. So I am more inclined to mine but I don't want to push it because both are good and make sense. What other people think? I vote yours Reviewed-by: Mel Gorman mgor...@suse.de -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu 02-08-12 08:37:57, Mel Gorman wrote: On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote: [...] On the other hand, mine is more coupled with the sharing code so it makes the code easier to follow and also makes the sharing more effective because racing processes see pmd populated when checking for shareable mappings. It could do with a small comment above huge_pmd_share() explaining that calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two parallel faults missing a sharing opportunity with each other but it's not mandatory. Sure, that's a good idea. What about the following: diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 40b2500..51839d1 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr) } /* - * search for a shareable pmd page for hugetlb. + * search for a shareable pmd page for hugetlb. In any case calls + * pmd_alloc and returns the corresponding pte. While this not necessary + * for the !shared pmd case because we can allocate the pmd later as + * well it makes the code much cleaner. pmd allocation is essential for + * the shared case though because pud has to be populated inside the + * same i_mmap_mutex section otherwise racing tasks could either miss + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing. */ static pte_t* huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) So I am more inclined to mine but I don't want to push it because both are good and make sense. What other people think? I vote yours Reviewed-by: Mel Gorman mgor...@suse.de Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote: On Thu 02-08-12 08:37:57, Mel Gorman wrote: On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote: [...] On the other hand, mine is more coupled with the sharing code so it makes the code easier to follow and also makes the sharing more effective because racing processes see pmd populated when checking for shareable mappings. It could do with a small comment above huge_pmd_share() explaining that calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two parallel faults missing a sharing opportunity with each other but it's not mandatory. Sure, that's a good idea. What about the following: diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 40b2500..51839d1 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr) } /* - * search for a shareable pmd page for hugetlb. + * search for a shareable pmd page for hugetlb. In any case calls + * pmd_alloc and returns the corresponding pte. While this not necessary + * for the !shared pmd case because we can allocate the pmd later as + * well it makes the code much cleaner. pmd allocation is essential for + * the shared case though because pud has to be populated inside the + * same i_mmap_mutex section otherwise racing tasks could either miss + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing. */ static pte_t* huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) Looks reasonable to me. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu 02-08-12 14:33:10, Mel Gorman wrote: On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote: On Thu 02-08-12 08:37:57, Mel Gorman wrote: On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote: [...] On the other hand, mine is more coupled with the sharing code so it makes the code easier to follow and also makes the sharing more effective because racing processes see pmd populated when checking for shareable mappings. It could do with a small comment above huge_pmd_share() explaining that calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two parallel faults missing a sharing opportunity with each other but it's not mandatory. Sure, that's a good idea. What about the following: diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 40b2500..51839d1 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr) } /* - * search for a shareable pmd page for hugetlb. + * search for a shareable pmd page for hugetlb. In any case calls + * pmd_alloc and returns the corresponding pte. While this not necessary + * for the !shared pmd case because we can allocate the pmd later as + * well it makes the code much cleaner. pmd allocation is essential for + * the shared case though because pud has to be populated inside the + * same i_mmap_mutex section otherwise racing tasks could either miss + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing. */ static pte_t* huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) Looks reasonable to me. OK, added to the patch. I will send it to Andrew now. Thanks a lot! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 08/01/2012 08:32 AM, Michal Hocko wrote: I am really lame :/. The previous patch is wrong as well for goto out branch. The updated patch as follows: This patch worked fine Michal! You and Mel can duke it out over who's is best. :) Larry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Wed 01-08-12 10:20:36, Michal Hocko wrote: > On Tue 31-07-12 22:45:43, Larry Woodman wrote: > > On 07/31/2012 04:06 PM, Michal Hocko wrote: > > >On Tue 31-07-12 13:49:21, Larry Woodman wrote: > > >>On 07/31/2012 08:46 AM, Mel Gorman wrote: > > >>>Fundamentally I think the problem is that we are not correctly detecting > > >>>that page table sharing took place during huge_pte_alloc(). This patch is > > >>>longer and makes an API change but if I'm right, it addresses the > > >>>underlying > > >>>problem. The first VM_MAYSHARE patch is still necessary but would you > > >>>mind > > >>>testing this on top please? > > >>Hi Mel, yes this does work just fine. It ran for hours without a panic so > > >>I'll Ack this one if you send it to the list. > > >Hi Larry, thanks for testing! I have a different patch which tries to > > >address this very same issue. I am not saying it is better or that it > > >should be merged instead of Mel's one but I would be really happy if you > > >could give it a try. We can discuss (dis)advantages of both approaches > > >later. > > > > > >Thanks! > > > > Hi Michal, the system hung when I tested this patch on top of the > > latest 3.5 kernel. I wont have AltSysrq access to the system until > > tomorrow AM. > > Please hold on. The patch is crap. I forgot about > if (!vma_shareable(vma, addr)) > return; > > case so somebody got an uninitialized pmd. The patch bellow handles > that. > I am really lame :/. The previous patch is wrong as well for goto out branch. The updated patch as follows: --- >From 886b79204491b500437e156aa8eb35e776a4bb07 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 31 Jul 2012 15:00:26 +0200 Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [] delete_from_page_cache+0x40/0x80 [ 1106.355128] [] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [] evict+0x9f/0x1b0 [ 1106.373951] [] iput_final+0xe3/0x1e0 [ 1106.379773] [] iput+0x3e/0x50 [ 1106.384922] [] d_kill+0xf8/0x110 [ 1106.390356] [] dput+0xe2/0x1b0 [ 1106.395595] [] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | >pmd src_pte--> data page ^ other->pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue 31-07-12 22:45:43, Larry Woodman wrote: > On 07/31/2012 04:06 PM, Michal Hocko wrote: > >On Tue 31-07-12 13:49:21, Larry Woodman wrote: > >>On 07/31/2012 08:46 AM, Mel Gorman wrote: > >>>Fundamentally I think the problem is that we are not correctly detecting > >>>that page table sharing took place during huge_pte_alloc(). This patch is > >>>longer and makes an API change but if I'm right, it addresses the > >>>underlying > >>>problem. The first VM_MAYSHARE patch is still necessary but would you mind > >>>testing this on top please? > >>Hi Mel, yes this does work just fine. It ran for hours without a panic so > >>I'll Ack this one if you send it to the list. > >Hi Larry, thanks for testing! I have a different patch which tries to > >address this very same issue. I am not saying it is better or that it > >should be merged instead of Mel's one but I would be really happy if you > >could give it a try. We can discuss (dis)advantages of both approaches > >later. > > > >Thanks! > > Hi Michal, the system hung when I tested this patch on top of the > latest 3.5 kernel. I wont have AltSysrq access to the system until > tomorrow AM. Please hold on. The patch is crap. I forgot about if (!vma_shareable(vma, addr)) return; case so somebody got an uninitialized pmd. The patch bellow handles that. Sorry about that and thanks to Mel to notice this. --- >From 14d1cfcb7e19369653e2367c38204b1012a398f9 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 31 Jul 2012 15:00:26 +0200 Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [] delete_from_page_cache+0x40/0x80 [ 1106.355128] [] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [] evict+0x9f/0x1b0 [ 1106.373951] [] iput_final+0xe3/0x1e0 [ 1106.379773] [] iput+0x3e/0x50 [ 1106.384922] [] d_kill+0xf8/0x110 [ 1106.390356] [] dput+0xe2/0x1b0 [ 1106.395595] [] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | >pmd src_pte--> data page ^ other->pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue 31-07-12 22:45:43, Larry Woodman wrote: On 07/31/2012 04:06 PM, Michal Hocko wrote: On Tue 31-07-12 13:49:21, Larry Woodman wrote: On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Hi Larry, thanks for testing! I have a different patch which tries to address this very same issue. I am not saying it is better or that it should be merged instead of Mel's one but I would be really happy if you could give it a try. We can discuss (dis)advantages of both approaches later. Thanks! Hi Michal, the system hung when I tested this patch on top of the latest 3.5 kernel. I wont have AltSysrq access to the system until tomorrow AM. Please hold on. The patch is crap. I forgot about if (!vma_shareable(vma, addr)) return; case so somebody got an uninitialized pmd. The patch bellow handles that. Sorry about that and thanks to Mel to notice this. --- From 14d1cfcb7e19369653e2367c38204b1012a398f9 Mon Sep 17 00:00:00 2001 From: Michal Hocko mho...@suse.cz Date: Tue, 31 Jul 2012 15:00:26 +0200 Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [8112d040] delete_from_page_cache+0x40/0x80 [ 1106.355128] [8125b405] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [8125b4f8] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [811ab1af] evict+0x9f/0x1b0 [ 1106.373951] [811ab3a3] iput_final+0xe3/0x1e0 [ 1106.379773] [811ab4de] iput+0x3e/0x50 [ 1106.384922] [811a8e18] d_kill+0xf8/0x110 [ 1106.390356] [811a8f12] dput+0xe2/0x1b0 [ 1106.395595] [81193612] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | pmd src_pte-- data page ^ other-pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Wed 01-08-12 10:20:36, Michal Hocko wrote: On Tue 31-07-12 22:45:43, Larry Woodman wrote: On 07/31/2012 04:06 PM, Michal Hocko wrote: On Tue 31-07-12 13:49:21, Larry Woodman wrote: On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Hi Larry, thanks for testing! I have a different patch which tries to address this very same issue. I am not saying it is better or that it should be merged instead of Mel's one but I would be really happy if you could give it a try. We can discuss (dis)advantages of both approaches later. Thanks! Hi Michal, the system hung when I tested this patch on top of the latest 3.5 kernel. I wont have AltSysrq access to the system until tomorrow AM. Please hold on. The patch is crap. I forgot about if (!vma_shareable(vma, addr)) return; case so somebody got an uninitialized pmd. The patch bellow handles that. I am really lame :/. The previous patch is wrong as well for goto out branch. The updated patch as follows: --- From 886b79204491b500437e156aa8eb35e776a4bb07 Mon Sep 17 00:00:00 2001 From: Michal Hocko mho...@suse.cz Date: Tue, 31 Jul 2012 15:00:26 +0200 Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [8112d040] delete_from_page_cache+0x40/0x80 [ 1106.355128] [8125b405] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [8125b4f8] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [811ab1af] evict+0x9f/0x1b0 [ 1106.373951] [811ab3a3] iput_final+0xe3/0x1e0 [ 1106.379773] [811ab4de] iput+0x3e/0x50 [ 1106.384922] [811a8e18] d_kill+0xf8/0x110 [ 1106.390356] [811a8f12] dput+0xe2/0x1b0 [ 1106.395595] [81193612] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | pmd src_pte-- data page ^ other-pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 08/01/2012 08:32 AM, Michal Hocko wrote: I am really lame :/. The previous patch is wrong as well for goto out branch. The updated patch as follows: This patch worked fine Michal! You and Mel can duke it out over who's is best. :) Larry -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 04:06 PM, Michal Hocko wrote: On Tue 31-07-12 13:49:21, Larry Woodman wrote: On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Hi Larry, thanks for testing! I have a different patch which tries to address this very same issue. I am not saying it is better or that it should be merged instead of Mel's one but I would be really happy if you could give it a try. We can discuss (dis)advantages of both approaches later. Thanks! Hi Michal, the system hung when I tested this patch on top of the latest 3.5 kernel. I wont have AltSysrq access to the system until tomorrow AM. I'll retry this kernel and get AltSysrq output and let you know whats happening in the morning. Larry --- From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 31 Jul 2012 15:00:26 +0200 Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [] delete_from_page_cache+0x40/0x80 [ 1106.355128] [] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [] evict+0x9f/0x1b0 [ 1106.373951] [] iput_final+0xe3/0x1e0 [ 1106.379773] [] iput+0x3e/0x50 [ 1106.384922] [] d_kill+0xf8/0x110 [ 1106.390356] [] dput+0xe2/0x1b0 [ 1106.395595] [] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | >pmd src_pte--> data page ^ other->pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 04:06 PM, Michal Hocko wrote: On Tue 31-07-12 13:49:21, Larry Woodman wrote: On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Hi Larry, thanks for testing! I have a different patch which tries to address this very same issue. I am not saying it is better or that it should be merged instead of Mel's one but I would be really happy if you could give it a try. We can discuss (dis)advantages of both approaches later. Thanks! Sure, it will take me a day since I keep loosing the hardware to proproduce the problem with. I'll report back tomorrow. Larry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue 31-07-12 13:49:21, Larry Woodman wrote: > On 07/31/2012 08:46 AM, Mel Gorman wrote: > > > >Fundamentally I think the problem is that we are not correctly detecting > >that page table sharing took place during huge_pte_alloc(). This patch is > >longer and makes an API change but if I'm right, it addresses the underlying > >problem. The first VM_MAYSHARE patch is still necessary but would you mind > >testing this on top please? > Hi Mel, yes this does work just fine. It ran for hours without a panic so > I'll Ack this one if you send it to the list. Hi Larry, thanks for testing! I have a different patch which tries to address this very same issue. I am not saying it is better or that it should be merged instead of Mel's one but I would be really happy if you could give it a try. We can discuss (dis)advantages of both approaches later. Thanks! --- >From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 31 Jul 2012 15:00:26 +0200 Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [] delete_from_page_cache+0x40/0x80 [ 1106.355128] [] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [] evict+0x9f/0x1b0 [ 1106.373951] [] iput_final+0xe3/0x1e0 [ 1106.379773] [] iput+0x3e/0x50 [ 1106.384922] [] d_kill+0xf8/0x110 [ 1106.390356] [] dput+0xe2/0x1b0 [ 1106.395595] [] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | >pmd src_pte--> data page ^ other->pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc huge_pte_alloc huge_pmd_share huge_pmd_share LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) LOCK(i_mmap_mutex) find
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Larry ---8<--- mm: hugetlbfs: Correctly detect if page tables have just been shared Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [] delete_from_page_cache+0x40/0x80 [ 1106.355128] [] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [] evict+0x9f/0x1b0 [ 1106.373951] [] iput_final+0xe3/0x1e0 [ 1106.379773] [] iput+0x3e/0x50 [ 1106.384922] [] d_kill+0xf8/0x110 [ 1106.390356] [] dput+0xe2/0x1b0 [ 1106.395595] [] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | >pmd src_pte--> data page ^ other->pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc huge_pte_alloc huge_pmd_sharehuge_pmd_share LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) pmd_alloc pmd_alloc LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) These two processes are not poing to
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 08:46 AM, Mel Gorman wrote: mm: hugetlbfs: Correctly detect if page tables have just been shared Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue, Jul 31, 2012 at 09:07:14AM -0400, Larry Woodman wrote: > On 07/31/2012 08:46 AM, Mel Gorman wrote: > >On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote: > >>> > >>>That is a surprise. Can you try your test case on 3.4 and tell us if the > >>>patch fixes the problem there? I would like to rule out the possibility > >>>that the locking rules are slightly different in RHEL. If it hits on 3.4 > >>>then it's also possible you are seeing a different bug, more on this later. > >>> > >>Sorry for the delay Mel, here is the BUG() traceback from the 3.4 > >>kernel with your > >>patches: > >> > >> > >>[ 1106.156569] [ cut here ] > >>[ 1106.161731] kernel BUG at mm/filemap.c:135! > >>[ 1106.166395] invalid opcode: [#1] SMP > >>[ 1106.170975] CPU 22 > >>[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc > >>dcdbas microcode pcspkr acpi_pad acpi] > >>[ 1106.201770] > >Thanks, looks very similar. > > > >>[ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW > >>3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 > >You say this was a 3.4 kernel but the message says 3.3. Probably not > >relevant, just interesting. > > > Oh, sorry I posted the wrong traceback. I tested both 3.3 & 3.4 and > had the same results. > I'll do it again and post the 3.4 traceback for you, It'll probably be the same. The likelhood is that the bug is really old and did not change between 3.3 and 3.4. I mentioned it in case you accidentally tested with an old kernel that was not patched or patched with something different. I considered this to be very unlikely though and you already said that RHEL was affected so it's probably the same bug seen in all three. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue 31-07-12 13:46:50, Mel Gorman wrote: [...] > mm: hugetlbfs: Correctly detect if page tables have just been shared > > Each page mapped in a processes address space must be correctly > accounted for in _mapcount. Normally the rules for this are > straight-forward but hugetlbfs page table sharing is different. > The page table pages at the PMD level are reference counted while > the mapcount remains the same. If this accounting is wrong, it causes > bugs like this one reported by Larry Woodman > > [ 1106.156569] [ cut here ] > [ 1106.161731] kernel BUG at mm/filemap.c:135! > [ 1106.166395] invalid opcode: [#1] SMP > [ 1106.170975] CPU 22 > [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas > microcode pcspkr acpi_pad acpi] > [ 1106.201770] > [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 > Dell Inc. PowerEdge R620/07NDJ2 > [ 1106.213822] RIP: 0010:[] [] > __delete_from_page_cache+0x15d/0x170 > [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 > [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: > ffb0 > [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: > 88043ffd9e00 > [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: > 0003 > [ 1106.253876] R10: 000d R11: R12: > 880428708150 > [ 1106.261826] R13: 880428708150 R14: R15: > ea0006b8 > [ 1106.269780] FS: () GS:88042fd6() > knlGS: > [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 > [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: > 000406e0 > [ 1106.293149] DR0: DR1: DR2: > > [ 1106.301097] DR3: DR6: 0ff0 DR7: > 0400 > [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task > 880428b5cc20) > [ 1106.318447] Stack: > [ 1106.320690] ea0006b8 880428973bc8 > 8112d040 > [ 1106.328958] 880428973bc8 02ab 02a0 > 880428973c18 > [ 1106.337234] 880428973cc8 8125b405 88040001 > > [ 1106.345513] Call Trace: > [ 1106.348235] [] delete_from_page_cache+0x40/0x80 > [ 1106.355128] [] truncate_hugepages+0x115/0x1f0 > [ 1106.361826] [] hugetlbfs_evict_inode+0x18/0x30 > [ 1106.368615] [] evict+0x9f/0x1b0 > [ 1106.373951] [] iput_final+0xe3/0x1e0 > [ 1106.379773] [] iput+0x3e/0x50 > [ 1106.384922] [] d_kill+0xf8/0x110 > [ 1106.390356] [] dput+0xe2/0x1b0 > [ 1106.395595] [] __fput+0x162/0x240 > > During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() > shared page tables with the check dst_pte == src_pte. The logic is if > the PMD page is the same, they must be shared. This assumes that the > sharing is between the parent and child. However, if the sharing is with > a different process entirely then this check fails as in this diagram. > > parent > | > >pmd >src_pte--> data page > ^ > other->pmd| > ^ > child---| >dst_pte > > For this situation to occur, it must be possible for Parent and Other > to have faulted and failed to share page tables with each other. This is > possible due to the following style of race. > > PROC APROC B > copy_hugetlb_page_range copy_hugetlb_page_range > src_pte == huge_pte_offsetsrc_pte == huge_pte_offset > !src_pte so no sharing!src_pte so no sharing > > (time passes) > > hugetlb_fault hugetlb_fault > huge_pte_allochuge_pte_alloc > huge_pmd_share huge_pmd_share > LOCK(i_mmap_mutex) > find nothing, no sharing > UNLOCK(i_mmap_mutex) > LOCK(i_mmap_mutex) > find nothing, no > sharing > UNLOCK(i_mmap_mutex) > pmd_alloc pmd_alloc > LOCK(instantiation_mutex) > fault > UNLOCK(instantiation_mutex) > LOCK(instantiation_mutex) > fault > UNLOCK(instantiation_mutex) Makes sense. I was wondering how the child could share with somebody else and pmd would be different from the parent because parent is blocked by mmap_sem (for write) so no concurrent faults are allowed. Other process should see the parents pmd if it is shareable. I obviously
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 08:46 AM, Mel Gorman wrote: On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote: That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel with your patches: [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] Thanks, looks very similar. [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 You say this was a 3.4 kernel but the message says 3.3. Probably not relevant, just interesting. Oh, sorry I posted the wrong traceback. I tested both 3.3 & 3.4 and had the same results. I'll do it again and post the 3.4 traceback for you, Larry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote: > > > >That is a surprise. Can you try your test case on 3.4 and tell us if the > >patch fixes the problem there? I would like to rule out the possibility > >that the locking rules are slightly different in RHEL. If it hits on 3.4 > >then it's also possible you are seeing a different bug, more on this later. > > > > Sorry for the delay Mel, here is the BUG() traceback from the 3.4 > kernel with your > patches: > > > [ 1106.156569] [ cut here ] > [ 1106.161731] kernel BUG at mm/filemap.c:135! > [ 1106.166395] invalid opcode: [#1] SMP > [ 1106.170975] CPU 22 > [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc > dcdbas microcode pcspkr acpi_pad acpi] > [ 1106.201770] Thanks, looks very similar. > [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW > 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 You say this was a 3.4 kernel but the message says 3.3. Probably not relevant, just interesting. > I'll see if I can distribute the program that causes the panic, I > dont have source, only binary. > > Larry > > > BTW, the only way Ilve been able to get the panic to stop is: > > -- > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index c36febb..cc023b8 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct > *dst, struct mm_struct *src, > goto nomem; > > /* If the pagetables are shared don't copy or take > references */ > - if (dst_pte == src_pte) > + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) > continue; > spin_lock(>page_table_lock); I think this is papering over the problem. Basically this check works because after page table sharing, the parent and child are pointing to the same data page even if they are not sharing page tables. As it's during fork(), we cannot have faulted in parallel so the populated PTE must be due to page table sharing. If the parent has not faulted the page, then sharing is not attempted and again the problem is avoided. It would be a new instance though of hugetlbfs just happening to work because of its limitations - in this case, it works because we only share page tables for MAP_SHARED. Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? ---8<--- mm: hugetlbfs: Correctly detect if page tables have just been shared Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue, Jul 31, 2012 at 3:11 AM, Larry Woodman wrote: > [ 1106.156569] [ cut here ] > [ 1106.161731] kernel BUG at mm/filemap.c:135! > [ 1106.166395] invalid opcode: [#1] SMP > [ 1106.170975] CPU 22 > [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas > microcode pcspkr acpi_pad acpi] > [ 1106.201770] > [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 > Dell Inc. PowerEdge R620/07NDJ2 > [ 1106.213822] RIP: 0010:[] [] > __delete_from_page_cache+0x15d/0x170 > [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 > [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: > ffb0 > [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: > 88043ffd9e00 > [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: > 0003 > [ 1106.253876] R10: 000d R11: R12: > 880428708150 > [ 1106.261826] R13: 880428708150 R14: R15: > ea0006b8 > [ 1106.269780] FS: () GS:88042fd6() > knlGS: > [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 > [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: > 000406e0 > [ 1106.293149] DR0: DR1: DR2: > > [ 1106.301097] DR3: DR6: 0ff0 DR7: > 0400 > [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, > task 880428b5cc20) > [ 1106.318447] Stack: > [ 1106.320690] ea0006b8 880428973bc8 > 8112d040 > [ 1106.328958] 880428973bc8 02ab 02a0 > 880428973c18 > [ 1106.337234] 880428973cc8 8125b405 88040001 > > [ 1106.345513] Call Trace: > [ 1106.348235] [] delete_from_page_cache+0x40/0x80 > [ 1106.355128] [] truncate_hugepages+0x115/0x1f0 > [ 1106.361826] [] hugetlbfs_evict_inode+0x18/0x30 > [ 1106.368615] [] evict+0x9f/0x1b0 > [ 1106.373951] [] iput_final+0xe3/0x1e0 > [ 1106.379773] [] iput+0x3e/0x50 > [ 1106.384922] [] d_kill+0xf8/0x110 > [ 1106.390356] [] dput+0xe2/0x1b0 > [ 1106.395595] [] __fput+0x162/0x240 > [ 1106.401124] [] fput+0x25/0x30 > [ 1106.406265] [] filp_close+0x63/0x90 > [ 1106.411997] [] put_files_struct+0x7f/0xf0 > [ 1106.418302] [] exit_files+0x4c/0x60 > [ 1106.424025] [] do_exit+0x1a7/0x470 > [ 1106.429652] [] do_group_exit+0x55/0xd0 > [ 1106.435665] [] sys_exit_group+0x17/0x20 > [ 1106.441777] [] system_call_fastpath+0x16/0x1b Perhaps we have to remove rmap when evicting inode. --- a/fs/hugetlbfs/inode.c Tue Jul 31 19:59:32 2012 +++ b/fs/hugetlbfs/inode.c Tue Jul 31 20:04:14 2012 @@ -390,9 +390,11 @@ static void truncate_hugepages(struct in hugetlb_unreserve_pages(inode, start, freed); } +static int hugetlb_vmtruncate(struct inode *, loff_t); + static void hugetlbfs_evict_inode(struct inode *inode) { - truncate_hugepages(inode, 0); + hugetlb_vmtruncate(inode, 0); clear_inode(inode); } -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue, Jul 31, 2012 at 3:11 AM, Larry Woodman lwood...@redhat.com wrote: [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [8112d040] delete_from_page_cache+0x40/0x80 [ 1106.355128] [8125b405] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [8125b4f8] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [811ab1af] evict+0x9f/0x1b0 [ 1106.373951] [811ab3a3] iput_final+0xe3/0x1e0 [ 1106.379773] [811ab4de] iput+0x3e/0x50 [ 1106.384922] [811a8e18] d_kill+0xf8/0x110 [ 1106.390356] [811a8f12] dput+0xe2/0x1b0 [ 1106.395595] [81193612] __fput+0x162/0x240 [ 1106.401124] [81193715] fput+0x25/0x30 [ 1106.406265] [8118f6c3] filp_close+0x63/0x90 [ 1106.411997] [8106058f] put_files_struct+0x7f/0xf0 [ 1106.418302] [8106064c] exit_files+0x4c/0x60 [ 1106.424025] [810629d7] do_exit+0x1a7/0x470 [ 1106.429652] [81062cf5] do_group_exit+0x55/0xd0 [ 1106.435665] [81062d87] sys_exit_group+0x17/0x20 [ 1106.441777] [815d0229] system_call_fastpath+0x16/0x1b Perhaps we have to remove rmap when evicting inode. --- a/fs/hugetlbfs/inode.c Tue Jul 31 19:59:32 2012 +++ b/fs/hugetlbfs/inode.c Tue Jul 31 20:04:14 2012 @@ -390,9 +390,11 @@ static void truncate_hugepages(struct in hugetlb_unreserve_pages(inode, start, freed); } +static int hugetlb_vmtruncate(struct inode *, loff_t); + static void hugetlbfs_evict_inode(struct inode *inode) { - truncate_hugepages(inode, 0); + hugetlb_vmtruncate(inode, 0); clear_inode(inode); } -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote: SNIP That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel with your patches: [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] Thanks, looks very similar. [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 You say this was a 3.4 kernel but the message says 3.3. Probably not relevant, just interesting. I'll see if I can distribute the program that causes the panic, I dont have source, only binary. Larry BTW, the only way Ilve been able to get the panic to stop is: -- diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) continue; spin_lock(dst-page_table_lock); I think this is papering over the problem. Basically this check works because after page table sharing, the parent and child are pointing to the same data page even if they are not sharing page tables. As it's during fork(), we cannot have faulted in parallel so the populated PTE must be due to page table sharing. If the parent has not faulted the page, then sharing is not attempted and again the problem is avoided. It would be a new instance though of hugetlbfs just happening to work because of its limitations - in this case, it works because we only share page tables for MAP_SHARED. Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? ---8--- mm: hugetlbfs: Correctly detect if page tables have just been shared Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 08:46 AM, Mel Gorman wrote: On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote: SNIP That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel with your patches: [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] Thanks, looks very similar. [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 You say this was a 3.4 kernel but the message says 3.3. Probably not relevant, just interesting. Oh, sorry I posted the wrong traceback. I tested both 3.3 3.4 and had the same results. I'll do it again and post the 3.4 traceback for you, Larry -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue 31-07-12 13:46:50, Mel Gorman wrote: [...] mm: hugetlbfs: Correctly detect if page tables have just been shared Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [8112d040] delete_from_page_cache+0x40/0x80 [ 1106.355128] [8125b405] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [8125b4f8] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [811ab1af] evict+0x9f/0x1b0 [ 1106.373951] [811ab3a3] iput_final+0xe3/0x1e0 [ 1106.379773] [811ab4de] iput+0x3e/0x50 [ 1106.384922] [811a8e18] d_kill+0xf8/0x110 [ 1106.390356] [811a8f12] dput+0xe2/0x1b0 [ 1106.395595] [81193612] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | pmd src_pte-- data page ^ other-pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC APROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offsetsrc_pte == huge_pte_offset !src_pte so no sharing!src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_allochuge_pte_alloc huge_pmd_share huge_pmd_share LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) pmd_alloc pmd_alloc LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) Makes sense. I was wondering how the child could share with somebody else and pmd would be different from the parent because parent is blocked by mmap_sem (for write) so no concurrent faults are allowed. Other process
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue, Jul 31, 2012 at 09:07:14AM -0400, Larry Woodman wrote: On 07/31/2012 08:46 AM, Mel Gorman wrote: On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote: SNIP That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel with your patches: [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] Thanks, looks very similar. [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 You say this was a 3.4 kernel but the message says 3.3. Probably not relevant, just interesting. Oh, sorry I posted the wrong traceback. I tested both 3.3 3.4 and had the same results. I'll do it again and post the 3.4 traceback for you, It'll probably be the same. The likelhood is that the bug is really old and did not change between 3.3 and 3.4. I mentioned it in case you accidentally tested with an old kernel that was not patched or patched with something different. I considered this to be very unlikely though and you already said that RHEL was affected so it's probably the same bug seen in all three. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 08:46 AM, Mel Gorman wrote: mm: hugetlbfs: Correctly detect if page tables have just been shared Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman Signed-off-by: Mel Gormanmgor...@suse.de Reviewed-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Larry ---8--- mm: hugetlbfs: Correctly detect if page tables have just been shared Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [8112d040] delete_from_page_cache+0x40/0x80 [ 1106.355128] [8125b405] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [8125b4f8] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [811ab1af] evict+0x9f/0x1b0 [ 1106.373951] [811ab3a3] iput_final+0xe3/0x1e0 [ 1106.379773] [811ab4de] iput+0x3e/0x50 [ 1106.384922] [811a8e18] d_kill+0xf8/0x110 [ 1106.390356] [811a8f12] dput+0xe2/0x1b0 [ 1106.395595] [81193612] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | pmd src_pte-- data page ^ other-pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc huge_pte_alloc huge_pmd_sharehuge_pmd_share LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) pmd_alloc pmd_alloc LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex)
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue 31-07-12 13:49:21, Larry Woodman wrote: On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Hi Larry, thanks for testing! I have a different patch which tries to address this very same issue. I am not saying it is better or that it should be merged instead of Mel's one but I would be really happy if you could give it a try. We can discuss (dis)advantages of both approaches later. Thanks! --- From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001 From: Michal Hocko mho...@suse.cz Date: Tue, 31 Jul 2012 15:00:26 +0200 Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [8112d040] delete_from_page_cache+0x40/0x80 [ 1106.355128] [8125b405] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [8125b4f8] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [811ab1af] evict+0x9f/0x1b0 [ 1106.373951] [811ab3a3] iput_final+0xe3/0x1e0 [ 1106.379773] [811ab4de] iput+0x3e/0x50 [ 1106.384922] [811a8e18] d_kill+0xf8/0x110 [ 1106.390356] [811a8f12] dput+0xe2/0x1b0 [ 1106.395595] [81193612] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | pmd src_pte-- data page ^ other-pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc huge_pte_alloc huge_pmd_share huge_pmd_share LOCK(i_mmap_mutex) find nothing, no
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 04:06 PM, Michal Hocko wrote: On Tue 31-07-12 13:49:21, Larry Woodman wrote: On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Hi Larry, thanks for testing! I have a different patch which tries to address this very same issue. I am not saying it is better or that it should be merged instead of Mel's one but I would be really happy if you could give it a try. We can discuss (dis)advantages of both approaches later. Thanks! Sure, it will take me a day since I keep loosing the hardware to proproduce the problem with. I'll report back tomorrow. Larry -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/31/2012 04:06 PM, Michal Hocko wrote: On Tue 31-07-12 13:49:21, Larry Woodman wrote: On 07/31/2012 08:46 AM, Mel Gorman wrote: Fundamentally I think the problem is that we are not correctly detecting that page table sharing took place during huge_pte_alloc(). This patch is longer and makes an API change but if I'm right, it addresses the underlying problem. The first VM_MAYSHARE patch is still necessary but would you mind testing this on top please? Hi Mel, yes this does work just fine. It ran for hours without a panic so I'll Ack this one if you send it to the list. Hi Larry, thanks for testing! I have a different patch which tries to address this very same issue. I am not saying it is better or that it should be merged instead of Mel's one but I would be really happy if you could give it a try. We can discuss (dis)advantages of both approaches later. Thanks! Hi Michal, the system hung when I tested this patch on top of the latest 3.5 kernel. I wont have AltSysrq access to the system until tomorrow AM. I'll retry this kernel and get AltSysrq output and let you know whats happening in the morning. Larry --- From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001 From: Michal Hockomho...@suse.cz Date: Tue, 31 Jul 2012 15:00:26 +0200 Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd Each page mapped in a processes address space must be correctly accounted for in _mapcount. Normally the rules for this are straight-forward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [8112d040] delete_from_page_cache+0x40/0x80 [ 1106.355128] [8125b405] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [8125b4f8] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [811ab1af] evict+0x9f/0x1b0 [ 1106.373951] [811ab3a3] iput_final+0xe3/0x1e0 [ 1106.379773] [811ab4de] iput+0x3e/0x50 [ 1106.384922] [811a8e18] d_kill+0xf8/0x110 [ 1106.390356] [811a8f12] dput+0xe2/0x1b0 [ 1106.395595] [81193612] __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram. parent | pmd src_pte-- data page ^ other-pmd| ^ child---| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/27/2012 06:23 AM, Mel Gorman wrote: On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote: On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel with your patches: [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [] delete_from_page_cache+0x40/0x80 [ 1106.355128] [] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [] evict+0x9f/0x1b0 [ 1106.373951] [] iput_final+0xe3/0x1e0 [ 1106.379773] [] iput+0x3e/0x50 [ 1106.384922] [] d_kill+0xf8/0x110 [ 1106.390356] [] dput+0xe2/0x1b0 [ 1106.395595] [] __fput+0x162/0x240 [ 1106.401124] [] fput+0x25/0x30 [ 1106.406265] [] filp_close+0x63/0x90 [ 1106.411997] [] put_files_struct+0x7f/0xf0 [ 1106.418302] [] exit_files+0x4c/0x60 [ 1106.424025] [] do_exit+0x1a7/0x470 [ 1106.429652] [] do_group_exit+0x55/0xd0 [ 1106.435665] [] sys_exit_group+0x17/0x20 [ 1106.441777] [] system_call_fastpath+0x16/0x1b [ 1106.448474] Code: 66 0f 1f 44 00 00 48 8b 47 08 48 8b 00 48 8b 40 28 44 8b 80 38 03 00 00 45 85 c0 0f [ 1106.470022] RIP [] __delete_from_page_cache+0x15d/0x170 [ 1106.477693] RSP -- I'll see if I can distribute the program that causes the panic, I dont have source, only binary. Larry BTW, the only way Ilve been able to get the panic to stop is: -- diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) +
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/27/2012 06:23 AM, Mel Gorman wrote: On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote: On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickinshu...@google.com This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel with your patches: [ 1106.156569] [ cut here ] [ 1106.161731] kernel BUG at mm/filemap.c:135! [ 1106.166395] invalid opcode: [#1] SMP [ 1106.170975] CPU 22 [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] [ 1106.201770] [ 1106.203426] Pid: 18001, comm: mpitest Tainted: GW3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 [ 1106.213822] RIP: 0010:[8112cfed] [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.224117] RSP: 0018:880428973b88 EFLAGS: 00010002 [ 1106.230032] RAX: 0001 RBX: ea0006b8 RCX: ffb0 [ 1106.237979] RDX: 00016df1 RSI: 0009 RDI: 88043ffd9e00 [ 1106.245927] RBP: 880428973b98 R08: 0050 R09: 0003 [ 1106.253876] R10: 000d R11: R12: 880428708150 [ 1106.261826] R13: 880428708150 R14: R15: ea0006b8 [ 1106.269780] FS: () GS:88042fd6() knlGS: [ 1106.278794] CS: 0010 DS: ES: CR0: 80050033 [ 1106.285193] CR2: 003a1d38c4a8 CR3: 0187d000 CR4: 000406e0 [ 1106.293149] DR0: DR1: DR2: [ 1106.301097] DR3: DR6: 0ff0 DR7: 0400 [ 1106.309046] Process mpitest (pid: 18001, threadinfo 880428972000, task 880428b5cc20) [ 1106.318447] Stack: [ 1106.320690] ea0006b8 880428973bc8 8112d040 [ 1106.328958] 880428973bc8 02ab 02a0 880428973c18 [ 1106.337234] 880428973cc8 8125b405 88040001 [ 1106.345513] Call Trace: [ 1106.348235] [8112d040] delete_from_page_cache+0x40/0x80 [ 1106.355128] [8125b405] truncate_hugepages+0x115/0x1f0 [ 1106.361826] [8125b4f8] hugetlbfs_evict_inode+0x18/0x30 [ 1106.368615] [811ab1af] evict+0x9f/0x1b0 [ 1106.373951] [811ab3a3] iput_final+0xe3/0x1e0 [ 1106.379773] [811ab4de] iput+0x3e/0x50 [ 1106.384922] [811a8e18] d_kill+0xf8/0x110 [ 1106.390356] [811a8f12] dput+0xe2/0x1b0 [ 1106.395595] [81193612] __fput+0x162/0x240 [ 1106.401124] [81193715] fput+0x25/0x30 [ 1106.406265] [8118f6c3] filp_close+0x63/0x90 [ 1106.411997] [8106058f] put_files_struct+0x7f/0xf0 [ 1106.418302] [8106064c] exit_files+0x4c/0x60 [ 1106.424025] [810629d7] do_exit+0x1a7/0x470 [ 1106.429652] [81062cf5] do_group_exit+0x55/0xd0 [ 1106.435665] [81062d87] sys_exit_group+0x17/0x20 [ 1106.441777] [815d0229] system_call_fastpath+0x16/0x1b [ 1106.448474] Code: 66 0f 1f 44 00 00 48 8b 47 08 48 8b 00 48 8b 40 28 44 8b 80 38 03 00 00 45 85 c0 0f [ 1106.470022] RIP [8112cfed] __delete_from_page_cache+0x15d/0x170 [ 1106.477693] RSP 880428973b88 -- I'll see if I can distribute the program that causes the panic, I dont have source, only binary. Larry BTW, the only way Ilve been able to get the panic to stop is: -- diff --git a/mm/hugetlb.c
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/27/2012 06:23 AM, Mel Gorman wrote: On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote: On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. Sure, it will take me a little while because the machine is shared between several users. However, I finally set up a reproducer that only takes a few seconds on a large system and this totally fixes the problem: The other possibility is that your reproducer case is triggering a different race to mine. Would it be possible to post? Let me ask, I only have the binary and dont know if its OK to distribute so I dont know exactly what is going on. I did some tracing and saw forking, group exits, multi-threading, hufetlbfs file creation, mmap'ng munmap'ng & deleting the hugetlbfs files. - diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) continue; spin_lock(>page_table_lock); --- When we compare what the src_pte& dst_pte point to instead of their addresses everything works, The dst_pte and src_pte are pointing to the PMD page though which is what we're meant to be checking. Your patch appears to change that to check if they are sharing data which is quite different. This is functionally similar to if you just checked VM_MAYSHARE at the start of the function and bailed if so. The PTEs would be populated at fault time instead. I suspect there is a missing memory barrier somewhere ??? Possibly but hard to tell whether it's barriers that are the real problem during fork. The copy routine is suspicious. On the barrier side - in normal PTE alloc routines there is a write barrier which is documented in __pte_alloc. If hugepage table sharing is successful, there is no similar barrier in huge_pmd_share before the PUD is populated. By rights, there should be a smp_wmb() before the page table spinlock is taken in huge_pmd_share(). The lack of a write barrier leads to a possible snarls between fork() and fault. Take three processes, parent, child and other. Parent is forking to create child. Other is calling fault. Other faults hugetlb_fault()->huge_pte_alloc->allocate a PMD (write barrier) It is about to enter hugetlb_no_fault() Parent forks() runs at the same time Child shares a page table page but NOT with the forking process (dst_pte != src_pte) and calls huge_pte_offset. As it's not reading the contents of the PMD page, there is no implicit read barrier to pair with the write barrier from hugetlb_fault that updates the PMD page and they are not serialised by the page table lock. Hard to see exactly where that would cause a problem though. Thing is, in this scenario I think it's possible that page table sharing is not correctly detected by that dst_pte == src_pte check. dst_pte != src_pte but that does not mean it's not sharing with somebody! If it's sharing and it falls though then it copies the src PTE even though the dst PTE could already be populated and updates the mapcount accordingly. That would be a mess in its own right. I think this is exactly what is happening. I'll put more cave-man debugging code in and let you know. Larry There might be two bugs here. -- To unsubscribe from this list: send the line
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote: > On 07/26/2012 02:37 PM, Rik van Riel wrote: > >On 07/23/2012 12:04 AM, Hugh Dickins wrote: > > > >>I spent hours trying to dream up a better patch, trying various > >>approaches. I think I have a nice one now, what do you think? And > >>more importantly, does it work? I have not tried to test it at all, > >>that I'm hoping to leave to you, I'm sure you'll attack it with gusto! > >> > >>If you like it, please take it over and add your comments and signoff > >>and send it in. The second part won't come up in your testing, > >>and could > >>be made a separate patch if you prefer: it's a related point that struck > >>me while I was playing with a different approach. > >> > >>I'm sorely tempted to leave a dangerous pair of eyes off the Cc, > >>but that too would be unfair. > >> > >>Subject-to-your-testing- > >>Signed-off-by: Hugh Dickins > > > >This patch looks good to me. > > > >Larry, does Hugh's patch survive your testing? > > > > > > Like I said earlier, no. That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. > However, I finally set up a reproducer > that only takes a few seconds > on a large system and this totally fixes the problem: > The other possibility is that your reproducer case is triggering a different race to mine. Would it be possible to post? > - > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index c36febb..cc023b8 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct > *dst, struct mm_struct *src, > goto nomem; > > /* If the pagetables are shared don't copy or take references > */ > - if (dst_pte == src_pte) > + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) > continue; > > spin_lock(>page_table_lock); > --- > > When we compare what the src_pte & dst_pte point to instead of their > addresses everything works, The dst_pte and src_pte are pointing to the PMD page though which is what we're meant to be checking. Your patch appears to change that to check if they are sharing data which is quite different. This is functionally similar to if you just checked VM_MAYSHARE at the start of the function and bailed if so. The PTEs would be populated at fault time instead. > I suspect there is a missing memory barrier somewhere ??? > Possibly but hard to tell whether it's barriers that are the real problem during fork. The copy routine is suspicious. On the barrier side - in normal PTE alloc routines there is a write barrier which is documented in __pte_alloc. If hugepage table sharing is successful, there is no similar barrier in huge_pmd_share before the PUD is populated. By rights, there should be a smp_wmb() before the page table spinlock is taken in huge_pmd_share(). The lack of a write barrier leads to a possible snarls between fork() and fault. Take three processes, parent, child and other. Parent is forking to create child. Other is calling fault. Other faults hugetlb_fault()->huge_pte_alloc->allocate a PMD (write barrier) It is about to enter hugetlb_no_fault() Parent forks() runs at the same time Child shares a page table page but NOT with the forking process (dst_pte != src_pte) and calls huge_pte_offset. As it's not reading the contents of the PMD page, there is no implicit read barrier to pair with the write barrier from hugetlb_fault that updates the PMD page and they are not serialised by the page table lock. Hard to see exactly where that would cause a problem though. Thing is, in this scenario I think it's possible that page table sharing is not correctly detected by that dst_pte == src_pte check. dst_pte != src_pte but that does not mean it's not sharing with somebody! If it's sharing and it falls though then it copies the src PTE even though the dst PTE could already be populated and updates the mapcount accordingly. That would be a mess in its own right. There might be two bugs here. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/26/2012 11:48 PM, Larry Woodman wrote: Mel, did you see this??? Larry This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. However, I finally set up a reproducer that only takes a few seconds on a large system and this totally fixes the problem: - diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) continue; spin_lock(>page_table_lock); --- When we compare what the src_pte & dst_pte point to instead of their addresses everything works, I suspect there is a missing memory barrier somewhere ??? Larry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu 26-07-12 14:31:50, Rik van Riel wrote: > On 07/20/2012 10:36 AM, Michal Hocko wrote: > > >--- a/arch/x86/mm/hugetlbpage.c > >+++ b/arch/x86/mm/hugetlbpage.c > >@@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned > >long addr, pud_t *pud) > > if (saddr) { > > spte = huge_pte_offset(svma->vm_mm, saddr); > > if (spte) { > >-get_page(virt_to_page(spte)); > >+struct page *spte_page = virt_to_page(spte); > >+if (!is_hugetlb_pmd_page_valid(spte_page)) { > > What prevents somebody else from marking the hugetlb > pmd invalid, between here... > > >+spte = NULL; > >+continue; > >+} > > ... and here? huge_ptep_get_and_clear is (should be) called inside i_mmap which is not the case right now as Mel already pointed out in other email -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu, Jul 26, 2012 at 01:42:26PM -0400, Rik van Riel wrote: > On 07/23/2012 12:04 AM, Hugh Dickins wrote: > > >Please don't be upset if I say that I don't like either of your patches. > >Mainly for obvious reasons - I don't like Mel's because anything with > >trylock retries and nested spinlocks worries me before I can even start > >to think about it; and I don't like Michal's for the same reason as Mel, > >that it spreads more change around in common paths than we would like. > > I have a naive question. > > In huge_pmd_share, we protect ourselves by taking > the mapping->i_mmap_mutex. > > Is there any reason we could not take the i_mmap_mutex > in the huge_pmd_unshare path? > We do, in 3.4 at least - callers of __unmap_hugepage_range hold the i_mmap_mutex. Locking changes in mmotm and there is a patch there that needs to be reverted. What tree are you looking at? -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu, Jul 26, 2012 at 01:42:26PM -0400, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: Please don't be upset if I say that I don't like either of your patches. Mainly for obvious reasons - I don't like Mel's because anything with trylock retries and nested spinlocks worries me before I can even start to think about it; and I don't like Michal's for the same reason as Mel, that it spreads more change around in common paths than we would like. I have a naive question. In huge_pmd_share, we protect ourselves by taking the mapping-i_mmap_mutex. Is there any reason we could not take the i_mmap_mutex in the huge_pmd_unshare path? We do, in 3.4 at least - callers of __unmap_hugepage_range hold the i_mmap_mutex. Locking changes in mmotm and there is a patch there that needs to be reverted. What tree are you looking at? -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu 26-07-12 14:31:50, Rik van Riel wrote: On 07/20/2012 10:36 AM, Michal Hocko wrote: --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (saddr) { spte = huge_pte_offset(svma-vm_mm, saddr); if (spte) { -get_page(virt_to_page(spte)); +struct page *spte_page = virt_to_page(spte); +if (!is_hugetlb_pmd_page_valid(spte_page)) { What prevents somebody else from marking the hugetlb pmd invalid, between here... +spte = NULL; +continue; +} ... and here? huge_ptep_get_and_clear is (should be) called inside i_mmap which is not the case right now as Mel already pointed out in other email -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/26/2012 11:48 PM, Larry Woodman wrote: Mel, did you see this??? Larry This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. However, I finally set up a reproducer that only takes a few seconds on a large system and this totally fixes the problem: - diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) continue; spin_lock(dst-page_table_lock); --- When we compare what the src_pte dst_pte point to instead of their addresses everything works, I suspect there is a missing memory barrier somewhere ??? Larry -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote: On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins hu...@google.com This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. However, I finally set up a reproducer that only takes a few seconds on a large system and this totally fixes the problem: The other possibility is that your reproducer case is triggering a different race to mine. Would it be possible to post? - diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) continue; spin_lock(dst-page_table_lock); --- When we compare what the src_pte dst_pte point to instead of their addresses everything works, The dst_pte and src_pte are pointing to the PMD page though which is what we're meant to be checking. Your patch appears to change that to check if they are sharing data which is quite different. This is functionally similar to if you just checked VM_MAYSHARE at the start of the function and bailed if so. The PTEs would be populated at fault time instead. I suspect there is a missing memory barrier somewhere ??? Possibly but hard to tell whether it's barriers that are the real problem during fork. The copy routine is suspicious. On the barrier side - in normal PTE alloc routines there is a write barrier which is documented in __pte_alloc. If hugepage table sharing is successful, there is no similar barrier in huge_pmd_share before the PUD is populated. By rights, there should be a smp_wmb() before the page table spinlock is taken in huge_pmd_share(). The lack of a write barrier leads to a possible snarls between fork() and fault. Take three processes, parent, child and other. Parent is forking to create child. Other is calling fault. Other faults hugetlb_fault()-huge_pte_alloc-allocate a PMD (write barrier) It is about to enter hugetlb_no_fault() Parent forks() runs at the same time Child shares a page table page but NOT with the forking process (dst_pte != src_pte) and calls huge_pte_offset. As it's not reading the contents of the PMD page, there is no implicit read barrier to pair with the write barrier from hugetlb_fault that updates the PMD page and they are not serialised by the page table lock. Hard to see exactly where that would cause a problem though. Thing is, in this scenario I think it's possible that page table sharing is not correctly detected by that dst_pte == src_pte check. dst_pte != src_pte but that does not mean it's not sharing with somebody! If it's sharing and it falls though then it copies the src PTE even though the dst PTE could already be populated and updates the mapcount accordingly. That would be a mess in its own right. There might be two bugs here. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/27/2012 06:23 AM, Mel Gorman wrote: On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote: On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickinshu...@google.com This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. That is a surprise. Can you try your test case on 3.4 and tell us if the patch fixes the problem there? I would like to rule out the possibility that the locking rules are slightly different in RHEL. If it hits on 3.4 then it's also possible you are seeing a different bug, more on this later. Sure, it will take me a little while because the machine is shared between several users. However, I finally set up a reproducer that only takes a few seconds on a large system and this totally fixes the problem: The other possibility is that your reproducer case is triggering a different race to mine. Would it be possible to post? Let me ask, I only have the binary and dont know if its OK to distribute so I dont know exactly what is going on. I did some tracing and saw forking, group exits, multi-threading, hufetlbfs file creation, mmap'ng munmap'ng deleting the hugetlbfs files. - diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) continue; spin_lock(dst-page_table_lock); --- When we compare what the src_pte dst_pte point to instead of their addresses everything works, The dst_pte and src_pte are pointing to the PMD page though which is what we're meant to be checking. Your patch appears to change that to check if they are sharing data which is quite different. This is functionally similar to if you just checked VM_MAYSHARE at the start of the function and bailed if so. The PTEs would be populated at fault time instead. I suspect there is a missing memory barrier somewhere ??? Possibly but hard to tell whether it's barriers that are the real problem during fork. The copy routine is suspicious. On the barrier side - in normal PTE alloc routines there is a write barrier which is documented in __pte_alloc. If hugepage table sharing is successful, there is no similar barrier in huge_pmd_share before the PUD is populated. By rights, there should be a smp_wmb() before the page table spinlock is taken in huge_pmd_share(). The lack of a write barrier leads to a possible snarls between fork() and fault. Take three processes, parent, child and other. Parent is forking to create child. Other is calling fault. Other faults hugetlb_fault()-huge_pte_alloc-allocate a PMD (write barrier) It is about to enter hugetlb_no_fault() Parent forks() runs at the same time Child shares a page table page but NOT with the forking process (dst_pte != src_pte) and calls huge_pte_offset. As it's not reading the contents of the PMD page, there is no implicit read barrier to pair with the write barrier from hugetlb_fault that updates the PMD page and they are not serialised by the page table lock. Hard to see exactly where that would cause a problem though. Thing is, in this scenario I think it's possible that page table sharing is not correctly detected by that dst_pte == src_pte check. dst_pte != src_pte but that does not mean it's not sharing with somebody! If it's sharing and it falls though then it copies the src PTE even though the dst PTE could already be populated and updates the mapcount accordingly. That would be a mess in its own right. I think this is exactly what is happening. I'll put more cave-man debugging code in and let you know. Larry There might be two bugs here. -- To unsubscribe from this list: send the
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. However, I finally set up a reproducer that only takes a few seconds on a large system and this totally fixes the problem: - diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) continue; spin_lock(>page_table_lock); --- When we compare what the src_pte & dst_pte point to instead of their addresses everything works, I suspect there is a missing memory barrier somewhere ??? Larry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins This patch looks good to me. Larry, does Hugh's patch survive your testing? It doesnt. However its got a slightly different footprint because this is RHEL6 and there have been changes to the hugetlbfs_inode code. Also, we are seeing the problem via group_exit() rather than shmdt(). Also, I print out the actual _mapcount at the BUG and most of the time its 1 but have seen it as high as 6. dell-per620-01.lab.bos.redhat.com login: MAPCOUNT = 2 [ cut here ] kernel BUG at mm/filemap.c:131! invalid opcode: [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map CPU 8 Modules linked in: autofs4 sunrpc ipv6 acpi_pad power_meter dcdbas microcode sb_edac edac_core iTCO_wdt i] Pid: 3106, comm: mpitest Not tainted 2.6.32-289.el6.sharedpte.x86_64 #17 Dell Inc. PowerEdge R620/07NDJ2 RIP: 0010:[] [] __remove_from_page_cache+0xe2/0x100 RSP: 0018:880434897b78 EFLAGS: 00010002 RAX: 0001 RBX: ea00074ec000 RCX: 10f6 RDX: RSI: 0046 RDI: 0046 RBP: 880434897b88 R08: 81c01a00 R09: R10: R11: 0004 R12: 880432683d98 R13: 880432683db0 R14: R15: ea00074ec000 FS: () GS:88002828() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 003a1d38c4a8 CR3: 01a85000 CR4: 000406e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process mpitest (pid: 3106, threadinfo 880434896000, task 880431abb500) Stack: ea00074ec000 880434897bb8 81114ab4 880434897bb8 02ab 02a0 880434897c08 880434897cb8 811f758d 88022dd8 Call Trace: [] remove_from_page_cache+0x54/0x90 [] truncate_hugepages+0x11d/0x200 [] ? hugetlbfs_delete_inode+0x0/0x30 [] hugetlbfs_delete_inode+0x18/0x30 [] generic_delete_inode+0xde/0x1d0 [] hugetlbfs_drop_inode+0x5d/0x70 [] iput+0x62/0x70 [] dentry_iput+0x90/0x100 [] d_kill+0x31/0x60 [] dput+0x7c/0x150 [] __fput+0x189/0x210 [] fput+0x25/0x30 [] filp_close+0x5d/0x90 [] put_files_struct+0x7f/0xf0 [] exit_files+0x53/0x70 [] do_exit+0x18d/0x870 [] ? audit_syscall_entry+0x272/0x2a0 [] do_group_exit+0x58/0xd0 [] sys_exit_group+0x17/0x20 [] system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins This patch looks good to me. Larry, does Hugh's patch survive your testing? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/20/2012 10:36 AM, Michal Hocko wrote: --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (saddr) { spte = huge_pte_offset(svma->vm_mm, saddr); if (spte) { - get_page(virt_to_page(spte)); + struct page *spte_page = virt_to_page(spte); + if (!is_hugetlb_pmd_page_valid(spte_page)) { What prevents somebody else from marking the hugetlb pmd invalid, between here... + spte = NULL; + continue; + } ... and here? + get_page(spte_page); break; } I think need to take the refcount before checking whether the hugetlb pmd is still valid. Also, disregard my previous email in this thread, I just read Mel's detailed explanation and wrapped my brain around the bug :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/23/2012 12:04 AM, Hugh Dickins wrote: Please don't be upset if I say that I don't like either of your patches. Mainly for obvious reasons - I don't like Mel's because anything with trylock retries and nested spinlocks worries me before I can even start to think about it; and I don't like Michal's for the same reason as Mel, that it spreads more change around in common paths than we would like. I have a naive question. In huge_pmd_share, we protect ourselves by taking the mapping->i_mmap_mutex. Is there any reason we could not take the i_mmap_mutex in the huge_pmd_unshare path? I see that hugetlb_change_protection already takes that lock. Is there something preventing __unmap_hugepage_range from also taking mapping->i_mmap_mutex? That way the sharing and the unsharing code are protected by the same, per shm segment, lock. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/26/2012 01:42 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: Please don't be upset if I say that I don't like either of your patches. Mainly for obvious reasons - I don't like Mel's because anything with trylock retries and nested spinlocks worries me before I can even start to think about it; and I don't like Michal's for the same reason as Mel, that it spreads more change around in common paths than we would like. I have a naive question. In huge_pmd_share, we protect ourselves by taking the mapping->i_mmap_mutex. Is there any reason we could not take the i_mmap_mutex in the huge_pmd_unshare path? I think it is already taken on every path into huge_pmd_unshare(). Larry I see that hugetlb_change_protection already takes that lock. Is there something preventing __unmap_hugepage_range from also taking mapping->i_mmap_mutex? That way the sharing and the unsharing code are protected by the same, per shm segment, lock. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/26/2012 01:42 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: Please don't be upset if I say that I don't like either of your patches. Mainly for obvious reasons - I don't like Mel's because anything with trylock retries and nested spinlocks worries me before I can even start to think about it; and I don't like Michal's for the same reason as Mel, that it spreads more change around in common paths than we would like. I have a naive question. In huge_pmd_share, we protect ourselves by taking the mapping-i_mmap_mutex. Is there any reason we could not take the i_mmap_mutex in the huge_pmd_unshare path? I think it is already taken on every path into huge_pmd_unshare(). Larry I see that hugetlb_change_protection already takes that lock. Is there something preventing __unmap_hugepage_range from also taking mapping-i_mmap_mutex? That way the sharing and the unsharing code are protected by the same, per shm segment, lock. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/23/2012 12:04 AM, Hugh Dickins wrote: Please don't be upset if I say that I don't like either of your patches. Mainly for obvious reasons - I don't like Mel's because anything with trylock retries and nested spinlocks worries me before I can even start to think about it; and I don't like Michal's for the same reason as Mel, that it spreads more change around in common paths than we would like. I have a naive question. In huge_pmd_share, we protect ourselves by taking the mapping-i_mmap_mutex. Is there any reason we could not take the i_mmap_mutex in the huge_pmd_unshare path? I see that hugetlb_change_protection already takes that lock. Is there something preventing __unmap_hugepage_range from also taking mapping-i_mmap_mutex? That way the sharing and the unsharing code are protected by the same, per shm segment, lock. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/20/2012 10:36 AM, Michal Hocko wrote: --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (saddr) { spte = huge_pte_offset(svma-vm_mm, saddr); if (spte) { - get_page(virt_to_page(spte)); + struct page *spte_page = virt_to_page(spte); + if (!is_hugetlb_pmd_page_valid(spte_page)) { What prevents somebody else from marking the hugetlb pmd invalid, between here... + spte = NULL; + continue; + } ... and here? + get_page(spte_page); break; } I think need to take the refcount before checking whether the hugetlb pmd is still valid. Also, disregard my previous email in this thread, I just read Mel's detailed explanation and wrapped my brain around the bug :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins hu...@google.com This patch looks good to me. Larry, does Hugh's patch survive your testing? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins hu...@google.com This patch looks good to me. Larry, does Hugh's patch survive your testing? It doesnt. However its got a slightly different footprint because this is RHEL6 and there have been changes to the hugetlbfs_inode code. Also, we are seeing the problem via group_exit() rather than shmdt(). Also, I print out the actual _mapcount at the BUG and most of the time its 1 but have seen it as high as 6. dell-per620-01.lab.bos.redhat.com login: MAPCOUNT = 2 [ cut here ] kernel BUG at mm/filemap.c:131! invalid opcode: [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map CPU 8 Modules linked in: autofs4 sunrpc ipv6 acpi_pad power_meter dcdbas microcode sb_edac edac_core iTCO_wdt i] Pid: 3106, comm: mpitest Not tainted 2.6.32-289.el6.sharedpte.x86_64 #17 Dell Inc. PowerEdge R620/07NDJ2 RIP: 0010:[81114a42] [81114a42] __remove_from_page_cache+0xe2/0x100 RSP: 0018:880434897b78 EFLAGS: 00010002 RAX: 0001 RBX: ea00074ec000 RCX: 10f6 RDX: RSI: 0046 RDI: 0046 RBP: 880434897b88 R08: 81c01a00 R09: R10: R11: 0004 R12: 880432683d98 R13: 880432683db0 R14: R15: ea00074ec000 FS: () GS:88002828() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 003a1d38c4a8 CR3: 01a85000 CR4: 000406e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process mpitest (pid: 3106, threadinfo 880434896000, task 880431abb500) Stack: ea00074ec000 880434897bb8 81114ab4 d 880434897bb8 02ab 02a0 880434897c08 d 880434897cb8 811f758d 88022dd8 Call Trace: [81114ab4] remove_from_page_cache+0x54/0x90 [811f758d] truncate_hugepages+0x11d/0x200 [811f7670] ? hugetlbfs_delete_inode+0x0/0x30 [811f7688] hugetlbfs_delete_inode+0x18/0x30 [8119618e] generic_delete_inode+0xde/0x1d0 [811f76fd] hugetlbfs_drop_inode+0x5d/0x70 [81195132] iput+0x62/0x70 [81191c90] dentry_iput+0x90/0x100 [81191df1] d_kill+0x31/0x60 [8119381c] dput+0x7c/0x150 [8117c979] __fput+0x189/0x210 [8117ca25] fput+0x25/0x30 [8117844d] filp_close+0x5d/0x90 [8106e45f] put_files_struct+0x7f/0xf0 [8106e523] exit_files+0x53/0x70 [8107059d] do_exit+0x18d/0x870 [810d6cc2] ? audit_syscall_entry+0x272/0x2a0 [81070cd8] do_group_exit+0x58/0xd0 [81070d67] sys_exit_group+0x17/0x20 [8100b0f2] system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On 07/26/2012 02:37 PM, Rik van Riel wrote: On 07/23/2012 12:04 AM, Hugh Dickins wrote: I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins hu...@google.com This patch looks good to me. Larry, does Hugh's patch survive your testing? Like I said earlier, no. However, I finally set up a reproducer that only takes a few seconds on a large system and this totally fixes the problem: - diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c36febb..cc023b8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, goto nomem; /* If the pagetables are shared don't copy or take references */ - if (dst_pte == src_pte) + if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte) continue; spin_lock(dst-page_table_lock); --- When we compare what the src_pte dst_pte point to instead of their addresses everything works, I suspect there is a missing memory barrier somewhere ??? Larry -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue, Jul 24, 2012 at 12:23:58PM -0700, Hugh Dickins wrote: > On Tue, 24 Jul 2012, Mel Gorman wrote: > > On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote: > > > > > > So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good. > > > > > > > I agree with you. When I was thinking about the potential problems, I was > > thinking of them in the general context of the core VM and what we normally > > take into account. > > > > I confess that I really find this working-by-coincidence very icky and am > > uncomfortable with it but your patch is the only patch that contains the > > mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing > > the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb > > VMA is found so I also affected the core. > > "icky" is not quite the word I'd use, but yes, it feels like you only > have to dislodge a stone somewhere at the other end of the kernel, > and the whole lot would come tumbling down. > > If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE, > I would: to warn us when assumptions change. If we were prepared to waste > another vm_flag on it (and just because there's now a type which lets them > expand does not mean we can be profligate with them), then you can imagine > a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off > that instead; or something of that kind. > A new VM flag would be overkill for this right now. > But I'm afraid I see that as TODO-list material: the one-liner is pretty > good for stable backporting, and I felt smiled-upon when it turned out to > be workable (and not even needing a change in arch/x86/mm, that really > surprised me). It seems ungrateful not to seize the simple fix it offers, > which I found much easier to understand than the alternatives. > That's fair enough. > > > > So, lets go with your patch but with all this documented! I stuck a > > changelog and an additional comment onto your patch and this is the end > > result. > > Okay, thanks. (I think you've copied rather more of my previous mail > into the commit description than it deserves, but it looks like you > like more words where I like less!) > I did copy more than was necessary, I'll fix it. > > > > Do you want to pick this up and send it to Andrew or will I? > > Oh, please change your Reviewed-by to Signed-off-by: almost all of the > work and description comes from you and Michal; then please, you send it > in to Andrew - sorry, I really need to turn my attention to other things. > That's fine, I'll pick it. Thanks for working on this. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue, Jul 24, 2012 at 12:23:58PM -0700, Hugh Dickins wrote: On Tue, 24 Jul 2012, Mel Gorman wrote: On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote: So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good. I agree with you. When I was thinking about the potential problems, I was thinking of them in the general context of the core VM and what we normally take into account. I confess that I really find this working-by-coincidence very icky and am uncomfortable with it but your patch is the only patch that contains the mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb VMA is found so I also affected the core. icky is not quite the word I'd use, but yes, it feels like you only have to dislodge a stone somewhere at the other end of the kernel, and the whole lot would come tumbling down. If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE, I would: to warn us when assumptions change. If we were prepared to waste another vm_flag on it (and just because there's now a type which lets them expand does not mean we can be profligate with them), then you can imagine a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off that instead; or something of that kind. A new VM flag would be overkill for this right now. But I'm afraid I see that as TODO-list material: the one-liner is pretty good for stable backporting, and I felt smiled-upon when it turned out to be workable (and not even needing a change in arch/x86/mm, that really surprised me). It seems ungrateful not to seize the simple fix it offers, which I found much easier to understand than the alternatives. That's fair enough. So, lets go with your patch but with all this documented! I stuck a changelog and an additional comment onto your patch and this is the end result. Okay, thanks. (I think you've copied rather more of my previous mail into the commit description than it deserves, but it looks like you like more words where I like less!) I did copy more than was necessary, I'll fix it. Do you want to pick this up and send it to Andrew or will I? Oh, please change your Reviewed-by to Signed-off-by: almost all of the work and description comes from you and Michal; then please, you send it in to Andrew - sorry, I really need to turn my attention to other things. That's fine, I'll pick it. Thanks for working on this. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue, 24 Jul 2012, Mel Gorman wrote: > On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote: > > > > So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good. > > > > I agree with you. When I was thinking about the potential problems, I was > thinking of them in the general context of the core VM and what we normally > take into account. > > I confess that I really find this working-by-coincidence very icky and am > uncomfortable with it but your patch is the only patch that contains the > mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing > the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb > VMA is found so I also affected the core. "icky" is not quite the word I'd use, but yes, it feels like you only have to dislodge a stone somewhere at the other end of the kernel, and the whole lot would come tumbling down. If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE, I would: to warn us when assumptions change. If we were prepared to waste another vm_flag on it (and just because there's now a type which lets them expand does not mean we can be profligate with them), then you can imagine a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off that instead; or something of that kind. But I'm afraid I see that as TODO-list material: the one-liner is pretty good for stable backporting, and I felt smiled-upon when it turned out to be workable (and not even needing a change in arch/x86/mm, that really surprised me). It seems ungrateful not to seize the simple fix it offers, which I found much easier to understand than the alternatives. > > So, lets go with your patch but with all this documented! I stuck a > changelog and an additional comment onto your patch and this is the end > result. Okay, thanks. (I think you've copied rather more of my previous mail into the commit description than it deserves, but it looks like you like more words where I like less!) > > Do you want to pick this up and send it to Andrew or will I? Oh, please change your Reviewed-by to Signed-off-by: almost all of the work and description comes from you and Michal; then please, you send it in to Andrew - sorry, I really need to turn my attention to other things. But I hadn't realized how messy it's going to be: I was concentrating on 3.5, not the mmotm tree, which as Michal points out is fairly different. Yes, it definitely needs to revert to holding i_mmap_mutex when unsharing, it was a mistake to have removed that (what stabilizes the page_count 1 check in huge_pmd_unshare? only i_mmap_mutex). I guess this fix needs to go in to 3.6 early, and "someone" rejig the hugetlb area of mmotm before that goes on to Linus. Urggh. AArgh64. Sorry, I'm not volunteering. But an interesting aspect of the hugetlb changes there, is that mmu_gather is now being used by __unmap_hugepage_range: I did want that for one of the solutions to this bug that I was toying with. Although earlier I had been afraid of "doing free_pgtables work" down in unshare, it occurred to me later that already we do that (pud_clear) with no harm observed, and free_pgtables does not depend on having entries still present at the lower levels. It may be that there's a less tricky fix available once the dust has settled here. > > Thanks Hugh! > > ---8<--- > mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables > > If a process creates a large hugetlbfs mapping that is eligible for page > table sharing and forks heavily with children some of whom fault and > others which destroy the mapping then it is possible for page tables to > get corrupted. Some teardowns of the mapping encounter a "bad pmd" and > output a message to the kernel log. The final teardown will trigger a > BUG_ON in mm/filemap.c. > > This was reproduced in 3.4 but is known to have existed for a long time > and goes back at least as far as 2.6.37. It was probably was introduced in > 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages > look like this; > > [ ..] Lots of bad pmd messages followed by this > [ 127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7). > [ 127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7). > [ 127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7). > [ 127.186778] [ cut here ] > [ 127.186781] kernel BUG at mm/filemap.c:134! > [ 127.186782] invalid opcode: [#1] SMP > [ 127.186783] CPU 7 > [ 127.186784] Modules linked in: af_packet cpufreq_conservative > cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod > coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 > r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw > cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode > ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue 24-07-12 10:34:06, Mel Gorman wrote: [...] > ---8<--- > mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables > > If a process creates a large hugetlbfs mapping that is eligible for page > table sharing and forks heavily with children some of whom fault and > others which destroy the mapping then it is possible for page tables to > get corrupted. Some teardowns of the mapping encounter a "bad pmd" and > output a message to the kernel log. The final teardown will trigger a > BUG_ON in mm/filemap.c. > > This was reproduced in 3.4 but is known to have existed for a long time > and goes back at least as far as 2.6.37. It was probably was introduced in > 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages > look like this; > > [ ..] Lots of bad pmd messages followed by this > [ 127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7). > [ 127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7). > [ 127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7). > [ 127.186778] [ cut here ] > [ 127.186781] kernel BUG at mm/filemap.c:134! > [ 127.186782] invalid opcode: [#1] SMP > [ 127.186783] CPU 7 > [ 127.186784] Modules linked in: af_packet cpufreq_conservative > cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod > coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 > r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw > cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode > ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm > i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button > i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon > ata_generic pata_atiixp libata scsi_mod > [ 127.186801] > [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild > #53 Dell Inc. OptiPlex 990/06D7TR > [ 127.186804] RIP: 0010:[] [] > __delete_from_page_cache+0x15e/0x160 > [ 127.186809] RSP: :8804144b5c08 EFLAGS: 00010002 > [ 127.186810] RAX: 0001 RBX: ea000a5c9000 RCX: > ffc0 > [ 127.186811] RDX: RSI: 0009 RDI: > 88042dfdad00 > [ 127.186812] RBP: 8804144b5c18 R08: 0009 R09: > 0003 > [ 127.186813] R10: R11: 002d R12: > 880412ff83d8 > [ 127.186814] R13: 880412ff83d8 R14: R15: > 880412ff83d8 > [ 127.186815] FS: 7fe18ed2c700() GS:88042dce() > knlGS: > [ 127.186816] CS: 0010 DS: ES: CR0: 8005003b > [ 127.186817] CR2: 7fe34503 CR3: 000417a14000 CR4: > 000407e0 > [ 127.186818] DR0: DR1: DR2: > > [ 127.186819] DR3: DR6: 0ff0 DR7: > 0400 > [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo > 8804144b4000, task 880417f803c0) > [ 127.186821] Stack: > [ 127.186822] ea000a5c9000 8804144b5c48 > 810ed83b > [ 127.186824] 8804144b5c48 138a 1387 > 8804144b5c98 > [ 127.186825] 8804144b5d48 811bc925 8804144b5cb8 > > [ 127.186827] Call Trace: > [ 127.186829] [] delete_from_page_cache+0x3b/0x80 > [ 127.186832] [] truncate_hugepages+0x115/0x220 > [ 127.186834] [] hugetlbfs_evict_inode+0x13/0x30 > [ 127.186837] [] evict+0xa7/0x1b0 > [ 127.186839] [] iput_final+0xd3/0x1f0 > [ 127.186840] [] iput+0x39/0x50 > [ 127.186842] [] d_kill+0xf8/0x130 > [ 127.186843] [] dput+0xd2/0x1a0 > [ 127.186845] [] __fput+0x170/0x230 > [ 127.186848] [] ? rb_erase+0xce/0x150 > [ 127.186849] [] fput+0x1d/0x30 > [ 127.186851] [] remove_vma+0x37/0x80 > [ 127.186853] [] do_munmap+0x2d2/0x360 > [ 127.186855] [] sys_shmdt+0xc9/0x170 > [ 127.186857] [] system_call_fastpath+0x16/0x1b > [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 > 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> > 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 > [ 127.186868] RIP [] __delete_from_page_cache+0x15e/0x160 > [ 127.186870] RSP > [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- > > The bug is a race and not always easy to reproduce. To reproduce it I was > doing the following on a single socket I7-based machine with 16G of RAM. > > $ hugeadm --pool-pages-max DEFAULT:13G > $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax > $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall > $ for i in `seq 1 9000`; do ./hugetlbfs-test; done > > On my particular machine, it usually triggers within 10 minutes but enabling > debug options can change the timing such that it never hits. Once the bug is > triggered, the machine is in
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote: > On Mon, 23 Jul 2012, Mel Gorman wrote: > > On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote: > > > On Fri, 20 Jul 2012, Mel Gorman wrote: > > > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: > > > > I like it in that it's simple and I can confirm it works for the test case > > of interest. > > Phew, I'm glad to hear that, thanks. > > > > > However, is your patch not vunerable to truncate issues? > > madvise()/truncate() issues was the main reason why I was wary of VMA tricks > > as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is > > ignored for hugetlbfs but I think truncate is still problematic. Lets say > > we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason. > > > > invalidate_inode_pages2 > > invalidate_inode_pages2_range > > unmap_mapping_range_vma > > zap_page_range_single > > unmap_single_vma > > __unmap_hugepage_range (removes VM_MAYSHARE) > > > > The VMA still exists so the consequences for this would be varied but > > minimally fault is going to be "interesting". > > You had me worried there, I hadn't considered truncation or invalidation2 > at all. > > But actually, I don't think they do pose any problem for my patch. They > would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range() > as you show above; but no, I'm removing it in unmap_hugepage_range(). > True, I hadn't considered the distinction to be relevant. My train of thought was "we potentially end up calling these functions and affect VMAs that are not being torn down". This set off an alarm but I didn't follow through properly. > That's only called by unmap_single_vma(): which is called via > unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables() > (the problem cases); or by madvise_dontneed() via zap_page_range(), which > as you note is disallowed on VM_HUGETLB; or by zap_page_range_single(). > The madvise(DONTNEED) case also made me worry that if this fix worked, it worked by co-incidence. If someone "fixes" hugetlbfs to support DONTNEED (which would be nonsensical) or direct IO (which would be weird) then this fix potentially regressed. > zap_page_range_single() is called by zap_vma_ptes(), which is only > allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked > like it was going to deadlock on i_mmap_mutex (with or without my > patch) until I realized that hugetlbfs has its own hugetlbfs_setattr() > and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all. > Yep, it happens to work. > invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(), > but hugetlbfs doesn't support direct_IO, and otherwise I think they're > called by a filesystem directly on its own inodes, which hugetlbfs > does not. Anyway, if there's a deadlock on i_mmap_mutex somewhere > in there, it's not introduced by the proposed patch. > > So, unmap_hugepage_range() is only being called in the problem cases, > just before free_pgtables(), when unmapping a vma (with mmap_sem held), > or when exiting (when we have the last reference to mm): in each case, > the vma is on its way out, and VM_MAYSHARE no longer of interest to others. > > I spent a while concerned that I'd overlooked the truncation case, before > realizing that it's not a problem: the issue comes when we free_pgtables(), > which truncation makes no attempt to do. > > So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good. > I agree with you. When I was thinking about the potential problems, I was thinking of them in the general context of the core VM and what we normally take into account. I confess that I really find this working-by-coincidence very icky and am uncomfortable with it but your patch is the only patch that contains the mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb VMA is found so I also affected the core. So, lets go with your patch but with all this documented! I stuck a changelog and an additional comment onto your patch and this is the end result. Do you want to pick this up and send it to Andrew or will I? Thanks Hugh! ---8<--- mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a "bad pmd" and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Mon 23-07-12 18:08:05, Hugh Dickins wrote: > On Mon, 23 Jul 2012, Mel Gorman wrote: > > On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote: > > > On Fri, 20 Jul 2012, Mel Gorman wrote: > > > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: > > > > I like it in that it's simple and I can confirm it works for the test case > > of interest. > > Phew, I'm glad to hear that, thanks. > > > > > However, is your patch not vunerable to truncate issues? > > madvise()/truncate() issues was the main reason why I was wary of VMA tricks > > as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is > > ignored for hugetlbfs but I think truncate is still problematic. Lets say > > we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason. > > > > invalidate_inode_pages2 > > invalidate_inode_pages2_range > > unmap_mapping_range_vma > > zap_page_range_single > > unmap_single_vma > > __unmap_hugepage_range (removes VM_MAYSHARE) > > > > The VMA still exists so the consequences for this would be varied but > > minimally fault is going to be "interesting". > > You had me worried there, I hadn't considered truncation or invalidation2 > at all. > > But actually, I don't think they do pose any problem for my patch. They > would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range() > as you show above; but no, I'm removing it in unmap_hugepage_range(). > > That's only called by unmap_single_vma(): which is called via > unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables() > (the problem cases); or by madvise_dontneed() via zap_page_range(), which > as you note is disallowed on VM_HUGETLB; or by zap_page_range_single(). > > zap_page_range_single() is called by zap_vma_ptes(), which is only > allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked > like it was going to deadlock on i_mmap_mutex (with or without my > patch) until I realized that hugetlbfs has its own hugetlbfs_setattr() > and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all. > > invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(), > but hugetlbfs doesn't support direct_IO, and otherwise I think they're > called by a filesystem directly on its own inodes, which hugetlbfs > does not. Good point, I didn't get this while looking into the code so I introduce the `last' parameter which told me that I am in the correct path. Thanks for clarification. > Anyway, if there's a deadlock on i_mmap_mutex somewhere in there, it's > not introduced by the proposed patch. > So, unmap_hugepage_range() is only being called in the problem cases, > just before free_pgtables(), when unmapping a vma (with mmap_sem held), > or when exiting (when we have the last reference to mm): in each case, > the vma is on its way out, and VM_MAYSHARE no longer of interest to others. > > I spent a while concerned that I'd overlooked the truncation case, before > realizing that it's not a problem: the issue comes when we free_pgtables(), > which truncation makes no attempt to do. > > So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good. Yes, this is convincing (and subtle ;)) and much less polluting. You can add my Reviewed-by (with the above reasoning in the patch description) Anyway, the patch for mmotm needs an update because there was a reorganization in the area. First, we need to revert "hugetlb: avoid taking i_mmap_mutex in unmap_single_vma() for hugetlb)" (80f408f5 in memcg-devel) and then push your code into unmap_single_vma. All the above is still valid AFAICS. > > Hugh Thanks a lot Hugh! -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Mon 23-07-12 18:08:05, Hugh Dickins wrote: On Mon, 23 Jul 2012, Mel Gorman wrote: On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote: On Fri, 20 Jul 2012, Mel Gorman wrote: On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: I like it in that it's simple and I can confirm it works for the test case of interest. Phew, I'm glad to hear that, thanks. However, is your patch not vunerable to truncate issues? madvise()/truncate() issues was the main reason why I was wary of VMA tricks as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is ignored for hugetlbfs but I think truncate is still problematic. Lets say we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason. invalidate_inode_pages2 invalidate_inode_pages2_range unmap_mapping_range_vma zap_page_range_single unmap_single_vma __unmap_hugepage_range (removes VM_MAYSHARE) The VMA still exists so the consequences for this would be varied but minimally fault is going to be interesting. You had me worried there, I hadn't considered truncation or invalidation2 at all. But actually, I don't think they do pose any problem for my patch. They would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range() as you show above; but no, I'm removing it in unmap_hugepage_range(). That's only called by unmap_single_vma(): which is called via unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables() (the problem cases); or by madvise_dontneed() via zap_page_range(), which as you note is disallowed on VM_HUGETLB; or by zap_page_range_single(). zap_page_range_single() is called by zap_vma_ptes(), which is only allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked like it was going to deadlock on i_mmap_mutex (with or without my patch) until I realized that hugetlbfs has its own hugetlbfs_setattr() and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all. invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(), but hugetlbfs doesn't support direct_IO, and otherwise I think they're called by a filesystem directly on its own inodes, which hugetlbfs does not. Good point, I didn't get this while looking into the code so I introduce the `last' parameter which told me that I am in the correct path. Thanks for clarification. Anyway, if there's a deadlock on i_mmap_mutex somewhere in there, it's not introduced by the proposed patch. So, unmap_hugepage_range() is only being called in the problem cases, just before free_pgtables(), when unmapping a vma (with mmap_sem held), or when exiting (when we have the last reference to mm): in each case, the vma is on its way out, and VM_MAYSHARE no longer of interest to others. I spent a while concerned that I'd overlooked the truncation case, before realizing that it's not a problem: the issue comes when we free_pgtables(), which truncation makes no attempt to do. So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good. Yes, this is convincing (and subtle ;)) and much less polluting. You can add my Reviewed-by (with the above reasoning in the patch description) Anyway, the patch for mmotm needs an update because there was a reorganization in the area. First, we need to revert hugetlb: avoid taking i_mmap_mutex in unmap_single_vma() for hugetlb) (80f408f5 in memcg-devel) and then push your code into unmap_single_vma. All the above is still valid AFAICS. Hugh Thanks a lot Hugh! -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote: On Mon, 23 Jul 2012, Mel Gorman wrote: On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote: On Fri, 20 Jul 2012, Mel Gorman wrote: On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: I like it in that it's simple and I can confirm it works for the test case of interest. Phew, I'm glad to hear that, thanks. However, is your patch not vunerable to truncate issues? madvise()/truncate() issues was the main reason why I was wary of VMA tricks as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is ignored for hugetlbfs but I think truncate is still problematic. Lets say we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason. invalidate_inode_pages2 invalidate_inode_pages2_range unmap_mapping_range_vma zap_page_range_single unmap_single_vma __unmap_hugepage_range (removes VM_MAYSHARE) The VMA still exists so the consequences for this would be varied but minimally fault is going to be interesting. You had me worried there, I hadn't considered truncation or invalidation2 at all. But actually, I don't think they do pose any problem for my patch. They would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range() as you show above; but no, I'm removing it in unmap_hugepage_range(). True, I hadn't considered the distinction to be relevant. My train of thought was we potentially end up calling these functions and affect VMAs that are not being torn down. This set off an alarm but I didn't follow through properly. That's only called by unmap_single_vma(): which is called via unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables() (the problem cases); or by madvise_dontneed() via zap_page_range(), which as you note is disallowed on VM_HUGETLB; or by zap_page_range_single(). The madvise(DONTNEED) case also made me worry that if this fix worked, it worked by co-incidence. If someone fixes hugetlbfs to support DONTNEED (which would be nonsensical) or direct IO (which would be weird) then this fix potentially regressed. zap_page_range_single() is called by zap_vma_ptes(), which is only allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked like it was going to deadlock on i_mmap_mutex (with or without my patch) until I realized that hugetlbfs has its own hugetlbfs_setattr() and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all. Yep, it happens to work. invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(), but hugetlbfs doesn't support direct_IO, and otherwise I think they're called by a filesystem directly on its own inodes, which hugetlbfs does not. Anyway, if there's a deadlock on i_mmap_mutex somewhere in there, it's not introduced by the proposed patch. So, unmap_hugepage_range() is only being called in the problem cases, just before free_pgtables(), when unmapping a vma (with mmap_sem held), or when exiting (when we have the last reference to mm): in each case, the vma is on its way out, and VM_MAYSHARE no longer of interest to others. I spent a while concerned that I'd overlooked the truncation case, before realizing that it's not a problem: the issue comes when we free_pgtables(), which truncation makes no attempt to do. So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good. I agree with you. When I was thinking about the potential problems, I was thinking of them in the general context of the core VM and what we normally take into account. I confess that I really find this working-by-coincidence very icky and am uncomfortable with it but your patch is the only patch that contains the mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb VMA is found so I also affected the core. So, lets go with your patch but with all this documented! I stuck a changelog and an additional comment onto your patch and this is the end result. Do you want to pick this up and send it to Andrew or will I? Thanks Hugh! ---8--- mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a bad pmd and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue 24-07-12 10:34:06, Mel Gorman wrote: [...] ---8--- mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a bad pmd and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7). [ 127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7). [ 127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7). [ 127.186778] [ cut here ] [ 127.186781] kernel BUG at mm/filemap.c:134! [ 127.186782] invalid opcode: [#1] SMP [ 127.186783] CPU 7 [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod [ 127.186801] [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR [ 127.186804] RIP: 0010:[810ed6ce] [810ed6ce] __delete_from_page_cache+0x15e/0x160 [ 127.186809] RSP: :8804144b5c08 EFLAGS: 00010002 [ 127.186810] RAX: 0001 RBX: ea000a5c9000 RCX: ffc0 [ 127.186811] RDX: RSI: 0009 RDI: 88042dfdad00 [ 127.186812] RBP: 8804144b5c18 R08: 0009 R09: 0003 [ 127.186813] R10: R11: 002d R12: 880412ff83d8 [ 127.186814] R13: 880412ff83d8 R14: R15: 880412ff83d8 [ 127.186815] FS: 7fe18ed2c700() GS:88042dce() knlGS: [ 127.186816] CS: 0010 DS: ES: CR0: 8005003b [ 127.186817] CR2: 7fe34503 CR3: 000417a14000 CR4: 000407e0 [ 127.186818] DR0: DR1: DR2: [ 127.186819] DR3: DR6: 0ff0 DR7: 0400 [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo 8804144b4000, task 880417f803c0) [ 127.186821] Stack: [ 127.186822] ea000a5c9000 8804144b5c48 810ed83b [ 127.186824] 8804144b5c48 138a 1387 8804144b5c98 [ 127.186825] 8804144b5d48 811bc925 8804144b5cb8 [ 127.186827] Call Trace: [ 127.186829] [810ed83b] delete_from_page_cache+0x3b/0x80 [ 127.186832] [811bc925] truncate_hugepages+0x115/0x220 [ 127.186834] [811bca43] hugetlbfs_evict_inode+0x13/0x30 [ 127.186837] [811655c7] evict+0xa7/0x1b0 [ 127.186839] [811657a3] iput_final+0xd3/0x1f0 [ 127.186840] [811658f9] iput+0x39/0x50 [ 127.186842] [81162708] d_kill+0xf8/0x130 [ 127.186843] [81162812] dput+0xd2/0x1a0 [ 127.186845] [8114e2d0] __fput+0x170/0x230 [ 127.186848] [81236e0e] ? rb_erase+0xce/0x150 [ 127.186849] [8114e3ad] fput+0x1d/0x30 [ 127.186851] [81117db7] remove_vma+0x37/0x80 [ 127.186853] [81119182] do_munmap+0x2d2/0x360 [ 127.186855] [811cc639] sys_shmdt+0xc9/0x170 [ 127.186857] [81410a39] system_call_fastpath+0x16/0x1b [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0f 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 [ 127.186868] RIP [810ed6ce] __delete_from_page_cache+0x15e/0x160 [ 127.186870] RSP 8804144b5c08 [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- The bug is a race and not always easy to reproduce. To reproduce it I was doing the following on a single socket I7-based machine with 16G of RAM. $ hugeadm --pool-pages-max DEFAULT:13G $ echo $((18*1048576*1024)) /proc/sys/kernel/shmmax $ echo $((18*1048576*1024)) /proc/sys/kernel/shmall $ for i in `seq 1 9000`; do
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Tue, 24 Jul 2012, Mel Gorman wrote: On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote: So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good. I agree with you. When I was thinking about the potential problems, I was thinking of them in the general context of the core VM and what we normally take into account. I confess that I really find this working-by-coincidence very icky and am uncomfortable with it but your patch is the only patch that contains the mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb VMA is found so I also affected the core. icky is not quite the word I'd use, but yes, it feels like you only have to dislodge a stone somewhere at the other end of the kernel, and the whole lot would come tumbling down. If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE, I would: to warn us when assumptions change. If we were prepared to waste another vm_flag on it (and just because there's now a type which lets them expand does not mean we can be profligate with them), then you can imagine a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off that instead; or something of that kind. But I'm afraid I see that as TODO-list material: the one-liner is pretty good for stable backporting, and I felt smiled-upon when it turned out to be workable (and not even needing a change in arch/x86/mm, that really surprised me). It seems ungrateful not to seize the simple fix it offers, which I found much easier to understand than the alternatives. So, lets go with your patch but with all this documented! I stuck a changelog and an additional comment onto your patch and this is the end result. Okay, thanks. (I think you've copied rather more of my previous mail into the commit description than it deserves, but it looks like you like more words where I like less!) Do you want to pick this up and send it to Andrew or will I? Oh, please change your Reviewed-by to Signed-off-by: almost all of the work and description comes from you and Michal; then please, you send it in to Andrew - sorry, I really need to turn my attention to other things. But I hadn't realized how messy it's going to be: I was concentrating on 3.5, not the mmotm tree, which as Michal points out is fairly different. Yes, it definitely needs to revert to holding i_mmap_mutex when unsharing, it was a mistake to have removed that (what stabilizes the page_count 1 check in huge_pmd_unshare? only i_mmap_mutex). I guess this fix needs to go in to 3.6 early, and someone rejig the hugetlb area of mmotm before that goes on to Linus. Urggh. AArgh64. Sorry, I'm not volunteering. But an interesting aspect of the hugetlb changes there, is that mmu_gather is now being used by __unmap_hugepage_range: I did want that for one of the solutions to this bug that I was toying with. Although earlier I had been afraid of doing free_pgtables work down in unshare, it occurred to me later that already we do that (pud_clear) with no harm observed, and free_pgtables does not depend on having entries still present at the lower levels. It may be that there's a less tricky fix available once the dust has settled here. Thanks Hugh! ---8--- mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a bad pmd and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7). [ 127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7). [ 127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7). [ 127.186778] [ cut here ] [ 127.186781] kernel BUG at mm/filemap.c:134! [ 127.186782] invalid opcode: [#1] SMP [ 127.186783] CPU 7 [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Mon, 23 Jul 2012, Mel Gorman wrote: > On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote: > > On Fri, 20 Jul 2012, Mel Gorman wrote: > > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: > > I like it in that it's simple and I can confirm it works for the test case > of interest. Phew, I'm glad to hear that, thanks. > > However, is your patch not vunerable to truncate issues? > madvise()/truncate() issues was the main reason why I was wary of VMA tricks > as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is > ignored for hugetlbfs but I think truncate is still problematic. Lets say > we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason. > > invalidate_inode_pages2 > invalidate_inode_pages2_range > unmap_mapping_range_vma > zap_page_range_single > unmap_single_vma > __unmap_hugepage_range (removes VM_MAYSHARE) > > The VMA still exists so the consequences for this would be varied but > minimally fault is going to be "interesting". You had me worried there, I hadn't considered truncation or invalidation2 at all. But actually, I don't think they do pose any problem for my patch. They would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range() as you show above; but no, I'm removing it in unmap_hugepage_range(). That's only called by unmap_single_vma(): which is called via unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables() (the problem cases); or by madvise_dontneed() via zap_page_range(), which as you note is disallowed on VM_HUGETLB; or by zap_page_range_single(). zap_page_range_single() is called by zap_vma_ptes(), which is only allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked like it was going to deadlock on i_mmap_mutex (with or without my patch) until I realized that hugetlbfs has its own hugetlbfs_setattr() and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all. invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(), but hugetlbfs doesn't support direct_IO, and otherwise I think they're called by a filesystem directly on its own inodes, which hugetlbfs does not. Anyway, if there's a deadlock on i_mmap_mutex somewhere in there, it's not introduced by the proposed patch. So, unmap_hugepage_range() is only being called in the problem cases, just before free_pgtables(), when unmapping a vma (with mmap_sem held), or when exiting (when we have the last reference to mm): in each case, the vma is on its way out, and VM_MAYSHARE no longer of interest to others. I spent a while concerned that I'd overlooked the truncation case, before realizing that it's not a problem: the issue comes when we free_pgtables(), which truncation makes no attempt to do. So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote: > On Fri, 20 Jul 2012, Mel Gorman wrote: > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: > > > And here is my attempt for the fix (Hugh mentioned something similar > > > earlier but he suggested using special flags in ptes or VMAs). I still > > > owe doc. update and it hasn't been tested with too many configs and I > > > could missed some definition updates. > > > I also think that changelog could be much better, I will add (steal) the > > > full bug description if people think that this way is worth going rather > > > than the one suggested by Mel. > > > To be honest I am not quite happy how I had to pollute generic mm code > > > with > > > something that is specific to a single architecture. > > > Mel hammered it with the test case and it survived. > > > > Tested-by: Mel Gorman > > > > This approach looks more or less like what I was expecting. I like that > > the trick was applied to the page table page instead of using PTE tricks > > or by bodging it with a VMA flag like I was thinking so kudos for that. I > > also prefer this approach to trying to free the page tables on or near > > huge_pmd_unshare() > > > > In general I think this patch would execute better than mine because it is > > far less heavy-handed but I share your concern that it changes the core MM > > quite a bit for a corner case that only one architecture cares about. I am > > completely biased of course, but I still prefer my patch because other than > > an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c > > . I am also not concerned with the scalability of how quickly we can setup > > page table sharing. > > > > Hugh, I'm afraid you get to choose :) > > Thank you bestowing that honour upon me :) Just so you know, there was a ceremonial gong when it happened. > Seriously, though, you > were quite right to Cc me on this, it is one of those areas I ought > to know something about (unlike hugetlb reservations, for example). > > Please don't be upset if I say that I don't like either of your patches. I can live with that :) It would not be the first time we found the right patch out of dislike for the first proposed and getting the fix is what's important. > Mainly for obvious reasons - I don't like Mel's because anything with > trylock retries and nested spinlocks worries me before I can even start > to think about it; That's a reasonable objection. The trylock could be avoided by always falling through at the cost of reducing the amount of sharing opportunities but the nested locking is unavoidable. I agree with you that nested locking like this should always be a cause for concern. > and I don't like Michal's for the same reason as Mel, > that it spreads more change around in common paths than we would like. > > But I didn't spend much time thinking through either of them, they just > seemed more complicated than should be needed. I cannot confirm or deny > whether they're correct - though I still do not understand how mmap_sem > can help you, Mel. I can see that it will help in your shmdt()ing test, > but if you leave the area mapped on exit, then mmap_sem is not taken in > the exit_mmap() path, so how does it help? > It certainly helps in the shmdt case which is what the test case focused on because that is what the application that triggered this bug was doing. However, you're right in that the exit_mmap() path is still vunerable because it does not take mmap_sem. I'll think about that a bit more. > I spent hours trying to dream up a better patch, trying various > approaches. I think I have a nice one now, what do you think? And > more importantly, does it work? I have not tried to test it at all, > that I'm hoping to leave to you, I'm sure you'll attack it with gusto! > > If you like it, please take it over and add your comments and signoff > and send it in. > I like it in that it's simple and I can confirm it works for the test case of interest. However, is your patch not vunerable to truncate issues? madvise()/truncate() issues was the main reason why I was wary of VMA tricks as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is ignored for hugetlbfs but I think truncate is still problematic. Lets say we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason. invalidate_inode_pages2 invalidate_inode_pages2_range unmap_mapping_range_vma zap_page_range_single unmap_single_vma __unmap_hugepage_range (removes VM_MAYSHARE) The VMA still exists so the consequences for this would be varied but minimally fault is going to be "interesting". I think that potentially we could work around this but it may end up stomping on the core MM and not necessarily be any better than Michal's patch. > The second part won't come up in your testing, and could > be made a separate patch if you prefer: it's a related point that struck > me while I was
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote: On Fri, 20 Jul 2012, Mel Gorman wrote: On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: And here is my attempt for the fix (Hugh mentioned something similar earlier but he suggested using special flags in ptes or VMAs). I still owe doc. update and it hasn't been tested with too many configs and I could missed some definition updates. I also think that changelog could be much better, I will add (steal) the full bug description if people think that this way is worth going rather than the one suggested by Mel. To be honest I am not quite happy how I had to pollute generic mm code with something that is specific to a single architecture. Mel hammered it with the test case and it survived. Tested-by: Mel Gorman mgor...@suse.de This approach looks more or less like what I was expecting. I like that the trick was applied to the page table page instead of using PTE tricks or by bodging it with a VMA flag like I was thinking so kudos for that. I also prefer this approach to trying to free the page tables on or near huge_pmd_unshare() In general I think this patch would execute better than mine because it is far less heavy-handed but I share your concern that it changes the core MM quite a bit for a corner case that only one architecture cares about. I am completely biased of course, but I still prefer my patch because other than an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c . I am also not concerned with the scalability of how quickly we can setup page table sharing. Hugh, I'm afraid you get to choose :) Thank you bestowing that honour upon me :) Just so you know, there was a ceremonial gong when it happened. Seriously, though, you were quite right to Cc me on this, it is one of those areas I ought to know something about (unlike hugetlb reservations, for example). Please don't be upset if I say that I don't like either of your patches. I can live with that :) It would not be the first time we found the right patch out of dislike for the first proposed and getting the fix is what's important. Mainly for obvious reasons - I don't like Mel's because anything with trylock retries and nested spinlocks worries me before I can even start to think about it; That's a reasonable objection. The trylock could be avoided by always falling through at the cost of reducing the amount of sharing opportunities but the nested locking is unavoidable. I agree with you that nested locking like this should always be a cause for concern. and I don't like Michal's for the same reason as Mel, that it spreads more change around in common paths than we would like. But I didn't spend much time thinking through either of them, they just seemed more complicated than should be needed. I cannot confirm or deny whether they're correct - though I still do not understand how mmap_sem can help you, Mel. I can see that it will help in your shmdt()ing test, but if you leave the area mapped on exit, then mmap_sem is not taken in the exit_mmap() path, so how does it help? It certainly helps in the shmdt case which is what the test case focused on because that is what the application that triggered this bug was doing. However, you're right in that the exit_mmap() path is still vunerable because it does not take mmap_sem. I'll think about that a bit more. I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. I like it in that it's simple and I can confirm it works for the test case of interest. However, is your patch not vunerable to truncate issues? madvise()/truncate() issues was the main reason why I was wary of VMA tricks as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is ignored for hugetlbfs but I think truncate is still problematic. Lets say we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason. invalidate_inode_pages2 invalidate_inode_pages2_range unmap_mapping_range_vma zap_page_range_single unmap_single_vma __unmap_hugepage_range (removes VM_MAYSHARE) The VMA still exists so the consequences for this would be varied but minimally fault is going to be interesting. I think that potentially we could work around this but it may end up stomping on the core MM and not necessarily be any better than Michal's patch. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm fine with the second part. -- Mel Gorman SUSE
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Mon, 23 Jul 2012, Mel Gorman wrote: On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote: On Fri, 20 Jul 2012, Mel Gorman wrote: On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: I like it in that it's simple and I can confirm it works for the test case of interest. Phew, I'm glad to hear that, thanks. However, is your patch not vunerable to truncate issues? madvise()/truncate() issues was the main reason why I was wary of VMA tricks as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is ignored for hugetlbfs but I think truncate is still problematic. Lets say we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason. invalidate_inode_pages2 invalidate_inode_pages2_range unmap_mapping_range_vma zap_page_range_single unmap_single_vma __unmap_hugepage_range (removes VM_MAYSHARE) The VMA still exists so the consequences for this would be varied but minimally fault is going to be interesting. You had me worried there, I hadn't considered truncation or invalidation2 at all. But actually, I don't think they do pose any problem for my patch. They would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range() as you show above; but no, I'm removing it in unmap_hugepage_range(). That's only called by unmap_single_vma(): which is called via unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables() (the problem cases); or by madvise_dontneed() via zap_page_range(), which as you note is disallowed on VM_HUGETLB; or by zap_page_range_single(). zap_page_range_single() is called by zap_vma_ptes(), which is only allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked like it was going to deadlock on i_mmap_mutex (with or without my patch) until I realized that hugetlbfs has its own hugetlbfs_setattr() and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all. invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(), but hugetlbfs doesn't support direct_IO, and otherwise I think they're called by a filesystem directly on its own inodes, which hugetlbfs does not. Anyway, if there's a deadlock on i_mmap_mutex somewhere in there, it's not introduced by the proposed patch. So, unmap_hugepage_range() is only being called in the problem cases, just before free_pgtables(), when unmapping a vma (with mmap_sem held), or when exiting (when we have the last reference to mm): in each case, the vma is on its way out, and VM_MAYSHARE no longer of interest to others. I spent a while concerned that I'd overlooked the truncation case, before realizing that it's not a problem: the issue comes when we free_pgtables(), which truncation makes no attempt to do. So, after a bout of anxiety, I think my = ~VM_MAYSHARE remains good. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri, 20 Jul 2012, Mel Gorman wrote: > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: > > And here is my attempt for the fix (Hugh mentioned something similar > > earlier but he suggested using special flags in ptes or VMAs). I still > > owe doc. update and it hasn't been tested with too many configs and I > > could missed some definition updates. > > I also think that changelog could be much better, I will add (steal) the > > full bug description if people think that this way is worth going rather > > than the one suggested by Mel. > > To be honest I am not quite happy how I had to pollute generic mm code with > > something that is specific to a single architecture. > > Mel hammered it with the test case and it survived. > > Tested-by: Mel Gorman > > This approach looks more or less like what I was expecting. I like that > the trick was applied to the page table page instead of using PTE tricks > or by bodging it with a VMA flag like I was thinking so kudos for that. I > also prefer this approach to trying to free the page tables on or near > huge_pmd_unshare() > > In general I think this patch would execute better than mine because it is > far less heavy-handed but I share your concern that it changes the core MM > quite a bit for a corner case that only one architecture cares about. I am > completely biased of course, but I still prefer my patch because other than > an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c > . I am also not concerned with the scalability of how quickly we can setup > page table sharing. > > Hugh, I'm afraid you get to choose :) Thank you bestowing that honour upon me :) Seriously, though, you were quite right to Cc me on this, it is one of those areas I ought to know something about (unlike hugetlb reservations, for example). Please don't be upset if I say that I don't like either of your patches. Mainly for obvious reasons - I don't like Mel's because anything with trylock retries and nested spinlocks worries me before I can even start to think about it; and I don't like Michal's for the same reason as Mel, that it spreads more change around in common paths than we would like. But I didn't spend much time thinking through either of them, they just seemed more complicated than should be needed. I cannot confirm or deny whether they're correct - though I still do not understand how mmap_sem can help you, Mel. I can see that it will help in your shmdt()ing test, but if you leave the area mapped on exit, then mmap_sem is not taken in the exit_mmap() path, so how does it help? I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins --- mm/hugetlb.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) --- v3.5/mm/hugetlb.c 2012-07-21 13:58:29.0 -0700 +++ linux/mm/hugetlb.c 2012-07-22 20:28:59.858077817 -0700 @@ -2393,6 +2393,15 @@ void unmap_hugepage_range(struct vm_area { mutex_lock(>vm_file->f_mapping->i_mmap_mutex); __unmap_hugepage_range(vma, start, end, ref_page); + /* +* Clear this flag so that x86's huge_pmd_share page_table_shareable +* test will fail on a vma being torn down, and not grab a page table +* on its way out. We're lucky that the flag has such an appropriate +* name, and can in fact be safely cleared here. We could clear it +* before the __unmap_hugepage_range above, but all that's necessary +* is to clear it before releasing the i_mmap_mutex below. +*/ + vma->vm_flags &= ~VM_MAYSHARE; mutex_unlock(>vm_file->f_mapping->i_mmap_mutex); } @@ -2959,9 +2968,14 @@ void hugetlb_change_protection(struct vm } } spin_unlock(>page_table_lock); - mutex_unlock(>vm_file->f_mapping->i_mmap_mutex); - + /* +* Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare +* may have cleared our pud entry and done put_page on the page table: +* once we release i_mmap_mutex, another task can do the final put_page +* and that page table be reused and filled with junk. +*/ flush_tlb_range(vma, start, end); + mutex_unlock(>vm_file->f_mapping->i_mmap_mutex); } int hugetlb_reserve_pages(struct inode *inode, -- To unsubscribe from this list: send the line "unsubscribe
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri, 20 Jul 2012, Mel Gorman wrote: On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: And here is my attempt for the fix (Hugh mentioned something similar earlier but he suggested using special flags in ptes or VMAs). I still owe doc. update and it hasn't been tested with too many configs and I could missed some definition updates. I also think that changelog could be much better, I will add (steal) the full bug description if people think that this way is worth going rather than the one suggested by Mel. To be honest I am not quite happy how I had to pollute generic mm code with something that is specific to a single architecture. Mel hammered it with the test case and it survived. Tested-by: Mel Gorman mgor...@suse.de This approach looks more or less like what I was expecting. I like that the trick was applied to the page table page instead of using PTE tricks or by bodging it with a VMA flag like I was thinking so kudos for that. I also prefer this approach to trying to free the page tables on or near huge_pmd_unshare() In general I think this patch would execute better than mine because it is far less heavy-handed but I share your concern that it changes the core MM quite a bit for a corner case that only one architecture cares about. I am completely biased of course, but I still prefer my patch because other than an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c . I am also not concerned with the scalability of how quickly we can setup page table sharing. Hugh, I'm afraid you get to choose :) Thank you bestowing that honour upon me :) Seriously, though, you were quite right to Cc me on this, it is one of those areas I ought to know something about (unlike hugetlb reservations, for example). Please don't be upset if I say that I don't like either of your patches. Mainly for obvious reasons - I don't like Mel's because anything with trylock retries and nested spinlocks worries me before I can even start to think about it; and I don't like Michal's for the same reason as Mel, that it spreads more change around in common paths than we would like. But I didn't spend much time thinking through either of them, they just seemed more complicated than should be needed. I cannot confirm or deny whether they're correct - though I still do not understand how mmap_sem can help you, Mel. I can see that it will help in your shmdt()ing test, but if you leave the area mapped on exit, then mmap_sem is not taken in the exit_mmap() path, so how does it help? I spent hours trying to dream up a better patch, trying various approaches. I think I have a nice one now, what do you think? And more importantly, does it work? I have not tried to test it at all, that I'm hoping to leave to you, I'm sure you'll attack it with gusto! If you like it, please take it over and add your comments and signoff and send it in. The second part won't come up in your testing, and could be made a separate patch if you prefer: it's a related point that struck me while I was playing with a different approach. I'm sorely tempted to leave a dangerous pair of eyes off the Cc, but that too would be unfair. Subject-to-your-testing- Signed-off-by: Hugh Dickins hu...@google.com --- mm/hugetlb.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) --- v3.5/mm/hugetlb.c 2012-07-21 13:58:29.0 -0700 +++ linux/mm/hugetlb.c 2012-07-22 20:28:59.858077817 -0700 @@ -2393,6 +2393,15 @@ void unmap_hugepage_range(struct vm_area { mutex_lock(vma-vm_file-f_mapping-i_mmap_mutex); __unmap_hugepage_range(vma, start, end, ref_page); + /* +* Clear this flag so that x86's huge_pmd_share page_table_shareable +* test will fail on a vma being torn down, and not grab a page table +* on its way out. We're lucky that the flag has such an appropriate +* name, and can in fact be safely cleared here. We could clear it +* before the __unmap_hugepage_range above, but all that's necessary +* is to clear it before releasing the i_mmap_mutex below. +*/ + vma-vm_flags = ~VM_MAYSHARE; mutex_unlock(vma-vm_file-f_mapping-i_mmap_mutex); } @@ -2959,9 +2968,14 @@ void hugetlb_change_protection(struct vm } } spin_unlock(mm-page_table_lock); - mutex_unlock(vma-vm_file-f_mapping-i_mmap_mutex); - + /* +* Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare +* may have cleared our pud entry and done put_page on the page table: +* once we release i_mmap_mutex, another task can do the final put_page +* and that page table be reused and filled with junk. +*/ flush_tlb_range(vma, start, end); + mutex_unlock(vma-vm_file-f_mapping-i_mmap_mutex); } int hugetlb_reserve_pages(struct inode *inode, -- To unsubscribe from this list: send the line unsubscribe
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: > And here is my attempt for the fix (Hugh mentioned something similar > earlier but he suggested using special flags in ptes or VMAs). I still > owe doc. update and it hasn't been tested with too many configs and I > could missed some definition updates. > I also think that changelog could be much better, I will add (steal) the > full bug description if people think that this way is worth going rather > than the one suggested by Mel. > To be honest I am not quite happy how I had to pollute generic mm code with > something that is specific to a single architecture. > Mel hammered it with the test case and it survived. Tested-by: Mel Gorman This approach looks more or less like what I was expecting. I like that the trick was applied to the page table page instead of using PTE tricks or by bodging it with a VMA flag like I was thinking so kudos for that. I also prefer this approach to trying to free the page tables on or near huge_pmd_unshare() In general I think this patch would execute better than mine because it is far less heavy-handed but I share your concern that it changes the core MM quite a bit for a corner case that only one architecture cares about. I am completely biased of course, but I still prefer my patch because other than an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c . I am also not concerned with the scalability of how quickly we can setup page table sharing. Hugh, I'm afraid you get to choose :) -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
And here is my attempt for the fix (Hugh mentioned something similar earlier but he suggested using special flags in ptes or VMAs). I still owe doc. update and it hasn't been tested with too many configs and I could missed some definition updates. I also think that changelog could be much better, I will add (steal) the full bug description if people think that this way is worth going rather than the one suggested by Mel. To be honest I am not quite happy how I had to pollute generic mm code with something that is specific to a single architecture. Mel hammered it with the test case and it survived. --- >From d71cd88da83c669beced2aa752847265e896b89b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 Jul 2012 15:10:40 +0200 Subject: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs The primary problem is that huge_pte_offset called from huge_pmd_share might return a spte which is about to be deallocated and the caller doesn't have any way to find this out. This means that huge_pmd_share might reuse spte and increase the reference count on its page. i_mmap_mutex is not sufficient because the spte is still available after unmap_hugepage_range and free free_pgtables which takes care of spte teardown doesn't use any locking. This patch addresses the issue by marking spte's page that it is due to removal (misuse page mapping for that matter because that one is not used for spte page during whole life cycle). huge_pte_offset then checks is_hugetlb_pmd_page_valid and ignores such sptes. The spte page is invalidated after the last pte is unshared and only if unmap_vmas is followed by free_pgtables. This is all done under i_mmap_mutex so we cannot race. The spte page is cleaned up later in free_pmd_range after pud is cleared and before it is handed over to pmd_free_tlb which frees the page. Write memory barrier makes sure that other CPUs always see either cleared pud (thus NULL pmd) or invalidated spte page. [motivated by Hugh's idea] Signed-off-by: Michal Hocko --- arch/x86/include/asm/hugetlb.h | 30 ++ arch/x86/mm/hugetlbpage.c |7 ++- fs/hugetlbfs/inode.c |2 +- include/linux/hugetlb.h|7 --- include/linux/mm.h |2 +- mm/hugetlb.c | 10 ++ mm/memory.c| 17 +++-- mm/mmap.c |4 ++-- 8 files changed, 61 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h index 439a9ac..eaff713 100644 --- a/arch/x86/include/asm/hugetlb.h +++ b/arch/x86/include/asm/hugetlb.h @@ -48,6 +48,36 @@ static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm, return ptep_get_and_clear(mm, addr, ptep); } +static inline bool is_hugetlb_pmd_page_valid(struct page *page) +{ + smp_rmb(); + return page->mapping == NULL; +} + +static inline void invalidate_hugetlb_pmd_page(pmd_t *pmd) +{ + struct page *pmd_page = virt_to_page(pmd); + + /* TODO add comment about the race */ + pmd_page->mapping = (struct address_space *)(-1UL); + smp_wmb(); +} + +#define ARCH_HAVE_CLEAR_HUGETLB_PMD_PAGE 1 +static inline void clear_hugetlb_pmd_page(pmd_t *pmd) +{ + struct page *pmd_page = virt_to_page(pmd); + if (!is_hugetlb_pmd_page_valid(pmd_page)) { + /* +* Make sure that pud_clear is visible before we remove the +* invalidate flag here so huge_pte_offset either returns NULL +* or invalidated pmd page during the tear down. +*/ + smp_wmb(); + pmd_page->mapping = NULL; + } +} + static inline void huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index f6679a7..c040089 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (saddr) { spte = huge_pte_offset(svma->vm_mm, saddr); if (spte) { - get_page(virt_to_page(spte)); + struct page *spte_page = virt_to_page(spte); + if (!is_hugetlb_pmd_page_valid(spte_page)) { + spte = NULL; + continue; + } + get_page(spte_page); break; } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 001ef01..0d0c235 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -417,7 +417,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
[PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
And here is my attempt for the fix (Hugh mentioned something similar earlier but he suggested using special flags in ptes or VMAs). I still owe doc. update and it hasn't been tested with too many configs and I could missed some definition updates. I also think that changelog could be much better, I will add (steal) the full bug description if people think that this way is worth going rather than the one suggested by Mel. To be honest I am not quite happy how I had to pollute generic mm code with something that is specific to a single architecture. Mel hammered it with the test case and it survived. --- From d71cd88da83c669beced2aa752847265e896b89b Mon Sep 17 00:00:00 2001 From: Michal Hocko mho...@suse.cz Date: Fri, 20 Jul 2012 15:10:40 +0200 Subject: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs The primary problem is that huge_pte_offset called from huge_pmd_share might return a spte which is about to be deallocated and the caller doesn't have any way to find this out. This means that huge_pmd_share might reuse spte and increase the reference count on its page. i_mmap_mutex is not sufficient because the spte is still available after unmap_hugepage_range and free free_pgtables which takes care of spte teardown doesn't use any locking. This patch addresses the issue by marking spte's page that it is due to removal (misuse page mapping for that matter because that one is not used for spte page during whole life cycle). huge_pte_offset then checks is_hugetlb_pmd_page_valid and ignores such sptes. The spte page is invalidated after the last pte is unshared and only if unmap_vmas is followed by free_pgtables. This is all done under i_mmap_mutex so we cannot race. The spte page is cleaned up later in free_pmd_range after pud is cleared and before it is handed over to pmd_free_tlb which frees the page. Write memory barrier makes sure that other CPUs always see either cleared pud (thus NULL pmd) or invalidated spte page. [motivated by Hugh's idea] Signed-off-by: Michal Hocko mho...@suse.cz --- arch/x86/include/asm/hugetlb.h | 30 ++ arch/x86/mm/hugetlbpage.c |7 ++- fs/hugetlbfs/inode.c |2 +- include/linux/hugetlb.h|7 --- include/linux/mm.h |2 +- mm/hugetlb.c | 10 ++ mm/memory.c| 17 +++-- mm/mmap.c |4 ++-- 8 files changed, 61 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h index 439a9ac..eaff713 100644 --- a/arch/x86/include/asm/hugetlb.h +++ b/arch/x86/include/asm/hugetlb.h @@ -48,6 +48,36 @@ static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm, return ptep_get_and_clear(mm, addr, ptep); } +static inline bool is_hugetlb_pmd_page_valid(struct page *page) +{ + smp_rmb(); + return page-mapping == NULL; +} + +static inline void invalidate_hugetlb_pmd_page(pmd_t *pmd) +{ + struct page *pmd_page = virt_to_page(pmd); + + /* TODO add comment about the race */ + pmd_page-mapping = (struct address_space *)(-1UL); + smp_wmb(); +} + +#define ARCH_HAVE_CLEAR_HUGETLB_PMD_PAGE 1 +static inline void clear_hugetlb_pmd_page(pmd_t *pmd) +{ + struct page *pmd_page = virt_to_page(pmd); + if (!is_hugetlb_pmd_page_valid(pmd_page)) { + /* +* Make sure that pud_clear is visible before we remove the +* invalidate flag here so huge_pte_offset either returns NULL +* or invalidated pmd page during the tear down. +*/ + smp_wmb(); + pmd_page-mapping = NULL; + } +} + static inline void huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index f6679a7..c040089 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (saddr) { spte = huge_pte_offset(svma-vm_mm, saddr); if (spte) { - get_page(virt_to_page(spte)); + struct page *spte_page = virt_to_page(spte); + if (!is_hugetlb_pmd_page_valid(spte_page)) { + spte = NULL; + continue; + } + get_page(spte_page); break; } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 001ef01..0d0c235 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -417,7 +417,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t
Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote: And here is my attempt for the fix (Hugh mentioned something similar earlier but he suggested using special flags in ptes or VMAs). I still owe doc. update and it hasn't been tested with too many configs and I could missed some definition updates. I also think that changelog could be much better, I will add (steal) the full bug description if people think that this way is worth going rather than the one suggested by Mel. To be honest I am not quite happy how I had to pollute generic mm code with something that is specific to a single architecture. Mel hammered it with the test case and it survived. Tested-by: Mel Gorman mgor...@suse.de This approach looks more or less like what I was expecting. I like that the trick was applied to the page table page instead of using PTE tricks or by bodging it with a VMA flag like I was thinking so kudos for that. I also prefer this approach to trying to free the page tables on or near huge_pmd_unshare() In general I think this patch would execute better than mine because it is far less heavy-handed but I share your concern that it changes the core MM quite a bit for a corner case that only one architecture cares about. I am completely biased of course, but I still prefer my patch because other than an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c . I am also not concerned with the scalability of how quickly we can setup page table sharing. Hugh, I'm afraid you get to choose :) -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/