Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
Ping? Can someone take it before it's lost? On Wed, Oct 31, 2012 at 2:29 PM, Sasha Levin wrote: > On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra wrote: >> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: >>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra >>> wrote: >>> > >>> > So I think the below should work, we hold the spinlock over both rb-tree >>> > modification as sp free, this makes mpol_shared_policy_lookup() which >>> > returns the policy with an incremented refcount work with just the >>> > spinlock. >>> > >>> > Comments? >>> >>> Looks reasonable, if annoyingly complex for something that shouldn't >>> be important enough for this. Oh well. >> >> I agree with that.. Its just that when doing numa placement one needs to >> respect the pre-existing placement constraints. I've not seen a way >> around this. >> >>> However, please check me on this: the need for this is only for >>> linux-next right now, correct? All the current users in my tree are ok >>> with just the mutex, no? >> >> Yes, the need comes from the numa stuff and I'll stick this patch in >> there. >> >> I completely missed Mel's patch turning it into a mutex, but I guess >> that's what -next is for :-). > > So I've been fuzzing with it for the past couple of days and it's been > looking fine with it. Can someone grab it into his tree please? > > > Thanks, > Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
Ping? Can someone take it before it's lost? On Wed, Oct 31, 2012 at 2:29 PM, Sasha Levin levinsasha...@gmail.com wrote: On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra pet...@infradead.org wrote: On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra pet...@infradead.org wrote: So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? Looks reasonable, if annoyingly complex for something that shouldn't be important enough for this. Oh well. I agree with that.. Its just that when doing numa placement one needs to respect the pre-existing placement constraints. I've not seen a way around this. However, please check me on this: the need for this is only for linux-next right now, correct? All the current users in my tree are ok with just the mutex, no? Yes, the need comes from the numa stuff and I'll stick this patch in there. I completely missed Mel's patch turning it into a mutex, but I guess that's what -next is for :-). So I've been fuzzing with it for the past couple of days and it's been looking fine with it. Can someone grab it into his tree please? Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra wrote: > On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: >> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: >> > >> > So I think the below should work, we hold the spinlock over both rb-tree >> > modification as sp free, this makes mpol_shared_policy_lookup() which >> > returns the policy with an incremented refcount work with just the >> > spinlock. >> > >> > Comments? >> >> Looks reasonable, if annoyingly complex for something that shouldn't >> be important enough for this. Oh well. > > I agree with that.. Its just that when doing numa placement one needs to > respect the pre-existing placement constraints. I've not seen a way > around this. > >> However, please check me on this: the need for this is only for >> linux-next right now, correct? All the current users in my tree are ok >> with just the mutex, no? > > Yes, the need comes from the numa stuff and I'll stick this patch in > there. > > I completely missed Mel's patch turning it into a mutex, but I guess > that's what -next is for :-). So I've been fuzzing with it for the past couple of days and it's been looking fine with it. Can someone grab it into his tree please? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra pet...@infradead.org wrote: On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra pet...@infradead.org wrote: So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? Looks reasonable, if annoyingly complex for something that shouldn't be important enough for this. Oh well. I agree with that.. Its just that when doing numa placement one needs to respect the pre-existing placement constraints. I've not seen a way around this. However, please check me on this: the need for this is only for linux-next right now, correct? All the current users in my tree are ok with just the mutex, no? Yes, the need comes from the numa stuff and I'll stick this patch in there. I completely missed Mel's patch turning it into a mutex, but I guess that's what -next is for :-). So I've been fuzzing with it for the past couple of days and it's been looking fine with it. Can someone grab it into his tree please? Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: > On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: > > > > So I think the below should work, we hold the spinlock over both rb-tree > > modification as sp free, this makes mpol_shared_policy_lookup() which > > returns the policy with an incremented refcount work with just the > > spinlock. > > > > Comments? > > Looks reasonable, if annoyingly complex for something that shouldn't > be important enough for this. Oh well. I agree with that.. Its just that when doing numa placement one needs to respect the pre-existing placement constraints. I've not seen a way around this. > However, please check me on this: the need for this is only for > linux-next right now, correct? All the current users in my tree are ok > with just the mutex, no? Yes, the need comes from the numa stuff and I'll stick this patch in there. I completely missed Mel's patch turning it into a mutex, but I guess that's what -next is for :-). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra pet...@infradead.org wrote: So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? Looks reasonable, if annoyingly complex for something that shouldn't be important enough for this. Oh well. I agree with that.. Its just that when doing numa placement one needs to respect the pre-existing placement constraints. I've not seen a way around this. However, please check me on this: the need for this is only for linux-next right now, correct? All the current users in my tree are ok with just the mutex, no? Yes, the need comes from the numa stuff and I'll stick this patch in there. I completely missed Mel's patch turning it into a mutex, but I guess that's what -next is for :-). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: > > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? Looks reasonable, if annoyingly complex for something that shouldn't be important enough for this. Oh well. However, please check me on this: the need for this is only for linux-next right now, correct? All the current users in my tree are ok with just the mutex, no? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Thu, 25 Oct 2012, Peter Zijlstra wrote: > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? > It's rather unfortunate that we need to protect modification with a spinlock and a mutex but since sharing was removed in commit 869833f2c5c6 ("mempolicy: remove mempolicy sharing") it requires that sp_alloc() is blockable to do the whole mpol_new() and rebind if necessary, which could require mm->mmap_sem; it's not as simple as just converting all the allocations to GFP_ATOMIC. It looks as though there is no other alternative other than protecting modification with both the spinlock and mutex, which is a clever solution, so it looks good to me, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On 10/25/2012 10:39 AM, Peter Zijlstra wrote: > On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote: >> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: >>> Ok, this looks the same but it's actually a different issue: >>> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, >>> calls get_vma_policy() which may take the shared policy mutex. This >>> happens while holding page_table_lock from do_huge_pmd_numa_page() but >>> also from do_numa_page() while holding a spinlock on the ptl, which is >>> coming from the sched/numa branch. >>> >>> Is there anyway that we can avoid changing the shared policy mutex back >>> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race >>> in shared_policy_replace()"])? >>> >>> Adding Peter, Rik, and Mel to the cc. >> >> Urgh, crud I totally missed that. >> >> So the problem is that we need to compute if the current page is placed >> 'right' while holding pte_lock in order to avoid multiple pte_lock >> acquisitions on the 'fast' path. >> >> I'll look into this in a bit, but one thing that comes to mind is having >> both a spnilock and a mutex and require holding both for modification >> while either one is sufficient for read. >> >> That would allow sp_lookup() to use the spinlock, while insert and >> replace can hold both. >> >> Not sure it will work for this, need to stare at this code a little >> more. > > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? > > --- It made the warnings I've reported go away. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote: > On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: > > Ok, this looks the same but it's actually a different issue: > > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, > > calls get_vma_policy() which may take the shared policy mutex. This > > happens while holding page_table_lock from do_huge_pmd_numa_page() but > > also from do_numa_page() while holding a spinlock on the ptl, which is > > coming from the sched/numa branch. > > > > Is there anyway that we can avoid changing the shared policy mutex back > > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race > > in shared_policy_replace()"])? > > > > Adding Peter, Rik, and Mel to the cc. > > Urgh, crud I totally missed that. > > So the problem is that we need to compute if the current page is placed > 'right' while holding pte_lock in order to avoid multiple pte_lock > acquisitions on the 'fast' path. > > I'll look into this in a bit, but one thing that comes to mind is having > both a spnilock and a mutex and require holding both for modification > while either one is sufficient for read. > > That would allow sp_lookup() to use the spinlock, while insert and > replace can hold both. > > Not sure it will work for this, need to stare at this code a little > more. So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? --- include/linux/mempolicy.h |1 + mm/mempolicy.c| 23 ++- 2 files changed, 19 insertions(+), 5 deletions(-) --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -133,6 +133,7 @@ struct sp_node { struct shared_policy { struct rb_root root; + spinlock_t lock; struct mutex mutex; }; --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2099,12 +2099,20 @@ bool __mpol_equal(struct mempolicy *a, s * * Remember policies even when nobody has shared memory mapped. * The policies are kept in Red-Black tree linked from the inode. - * They are protected by the sp->lock spinlock, which should be held - * for any accesses to the tree. + * + * The rb-tree is locked using both a mutex and a spinlock. Every modification + * to the tree must hold both the mutex and the spinlock, lookups can hold + * either to observe a stable tree. + * + * In particular, sp_insert() and sp_delete() take the spinlock, whereas + * sp_lookup() doesn't, this so users have choice. + * + * shared_policy_replace() and mpol_free_shared_policy() take the mutex + * and call sp_insert(), sp_delete(). */ /* lookup first element intersecting start-end */ -/* Caller holds sp->mutex */ +/* Caller holds either sp->lock and/or sp->mutex */ static struct sp_node * sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end) { @@ -2143,6 +2151,7 @@ static void sp_insert(struct shared_poli struct rb_node *parent = NULL; struct sp_node *nd; + spin_lock(>lock); while (*p) { parent = *p; nd = rb_entry(parent, struct sp_node, nd); @@ -2155,6 +2164,7 @@ static void sp_insert(struct shared_poli } rb_link_node(>nd, parent, p); rb_insert_color(>nd, >root); + spin_unlock(>lock); pr_debug("inserting %lx-%lx: %d\n", new->start, new->end, new->policy ? new->policy->mode : 0); } @@ -2168,13 +2178,13 @@ mpol_shared_policy_lookup(struct shared_ if (!sp->root.rb_node) return NULL; - mutex_lock(>mutex); + spin_lock(>lock); sn = sp_lookup(sp, idx, idx+1); if (sn) { mpol_get(sn->policy); pol = sn->policy; } - mutex_unlock(>mutex); + spin_unlock(>lock); return pol; } @@ -2295,8 +2305,10 @@ int mpol_misplaced(struct page *page, st static void sp_delete(struct shared_policy *sp, struct sp_node *n) { pr_debug("deleting %lx-l%lx\n", n->start, n->end); + spin_lock(>lock); rb_erase(>nd, >root); sp_free(n); + spin_unlock(>lock); } static struct sp_node *sp_alloc(unsigned long start, unsigned long end, @@ -2381,6 +2393,7 @@ void mpol_shared_policy_init(struct shar int ret; sp->root = RB_ROOT; /* empty tree == default mempolicy */ + spin_lock_init(>lock); mutex_init(>mutex); if (mpol) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: > Ok, this looks the same but it's actually a different issue: > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, > calls get_vma_policy() which may take the shared policy mutex. This > happens while holding page_table_lock from do_huge_pmd_numa_page() but > also from do_numa_page() while holding a spinlock on the ptl, which is > coming from the sched/numa branch. > > Is there anyway that we can avoid changing the shared policy mutex back > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race > in shared_policy_replace()"])? > > Adding Peter, Rik, and Mel to the cc. Urgh, crud I totally missed that. So the problem is that we need to compute if the current page is placed 'right' while holding pte_lock in order to avoid multiple pte_lock acquisitions on the 'fast' path. I'll look into this in a bit, but one thing that comes to mind is having both a spnilock and a mutex and require holding both for modification while either one is sufficient for read. That would allow sp_lookup() to use the spinlock, while insert and replace can hold both. Not sure it will work for this, need to stare at this code a little more. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: Ok, this looks the same but it's actually a different issue: mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, calls get_vma_policy() which may take the shared policy mutex. This happens while holding page_table_lock from do_huge_pmd_numa_page() but also from do_numa_page() while holding a spinlock on the ptl, which is coming from the sched/numa branch. Is there anyway that we can avoid changing the shared policy mutex back into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race in shared_policy_replace()])? Adding Peter, Rik, and Mel to the cc. Urgh, crud I totally missed that. So the problem is that we need to compute if the current page is placed 'right' while holding pte_lock in order to avoid multiple pte_lock acquisitions on the 'fast' path. I'll look into this in a bit, but one thing that comes to mind is having both a spnilock and a mutex and require holding both for modification while either one is sufficient for read. That would allow sp_lookup() to use the spinlock, while insert and replace can hold both. Not sure it will work for this, need to stare at this code a little more. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote: On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: Ok, this looks the same but it's actually a different issue: mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, calls get_vma_policy() which may take the shared policy mutex. This happens while holding page_table_lock from do_huge_pmd_numa_page() but also from do_numa_page() while holding a spinlock on the ptl, which is coming from the sched/numa branch. Is there anyway that we can avoid changing the shared policy mutex back into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race in shared_policy_replace()])? Adding Peter, Rik, and Mel to the cc. Urgh, crud I totally missed that. So the problem is that we need to compute if the current page is placed 'right' while holding pte_lock in order to avoid multiple pte_lock acquisitions on the 'fast' path. I'll look into this in a bit, but one thing that comes to mind is having both a spnilock and a mutex and require holding both for modification while either one is sufficient for read. That would allow sp_lookup() to use the spinlock, while insert and replace can hold both. Not sure it will work for this, need to stare at this code a little more. So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? --- include/linux/mempolicy.h |1 + mm/mempolicy.c| 23 ++- 2 files changed, 19 insertions(+), 5 deletions(-) --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -133,6 +133,7 @@ struct sp_node { struct shared_policy { struct rb_root root; + spinlock_t lock; struct mutex mutex; }; --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2099,12 +2099,20 @@ bool __mpol_equal(struct mempolicy *a, s * * Remember policies even when nobody has shared memory mapped. * The policies are kept in Red-Black tree linked from the inode. - * They are protected by the sp-lock spinlock, which should be held - * for any accesses to the tree. + * + * The rb-tree is locked using both a mutex and a spinlock. Every modification + * to the tree must hold both the mutex and the spinlock, lookups can hold + * either to observe a stable tree. + * + * In particular, sp_insert() and sp_delete() take the spinlock, whereas + * sp_lookup() doesn't, this so users have choice. + * + * shared_policy_replace() and mpol_free_shared_policy() take the mutex + * and call sp_insert(), sp_delete(). */ /* lookup first element intersecting start-end */ -/* Caller holds sp-mutex */ +/* Caller holds either sp-lock and/or sp-mutex */ static struct sp_node * sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end) { @@ -2143,6 +2151,7 @@ static void sp_insert(struct shared_poli struct rb_node *parent = NULL; struct sp_node *nd; + spin_lock(sp-lock); while (*p) { parent = *p; nd = rb_entry(parent, struct sp_node, nd); @@ -2155,6 +2164,7 @@ static void sp_insert(struct shared_poli } rb_link_node(new-nd, parent, p); rb_insert_color(new-nd, sp-root); + spin_unlock(sp-lock); pr_debug(inserting %lx-%lx: %d\n, new-start, new-end, new-policy ? new-policy-mode : 0); } @@ -2168,13 +2178,13 @@ mpol_shared_policy_lookup(struct shared_ if (!sp-root.rb_node) return NULL; - mutex_lock(sp-mutex); + spin_lock(sp-lock); sn = sp_lookup(sp, idx, idx+1); if (sn) { mpol_get(sn-policy); pol = sn-policy; } - mutex_unlock(sp-mutex); + spin_unlock(sp-lock); return pol; } @@ -2295,8 +2305,10 @@ int mpol_misplaced(struct page *page, st static void sp_delete(struct shared_policy *sp, struct sp_node *n) { pr_debug(deleting %lx-l%lx\n, n-start, n-end); + spin_lock(sp-lock); rb_erase(n-nd, sp-root); sp_free(n); + spin_unlock(sp-lock); } static struct sp_node *sp_alloc(unsigned long start, unsigned long end, @@ -2381,6 +2393,7 @@ void mpol_shared_policy_init(struct shar int ret; sp-root = RB_ROOT; /* empty tree == default mempolicy */ + spin_lock_init(sp-lock); mutex_init(sp-mutex); if (mpol) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On 10/25/2012 10:39 AM, Peter Zijlstra wrote: On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote: On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: Ok, this looks the same but it's actually a different issue: mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, calls get_vma_policy() which may take the shared policy mutex. This happens while holding page_table_lock from do_huge_pmd_numa_page() but also from do_numa_page() while holding a spinlock on the ptl, which is coming from the sched/numa branch. Is there anyway that we can avoid changing the shared policy mutex back into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race in shared_policy_replace()])? Adding Peter, Rik, and Mel to the cc. Urgh, crud I totally missed that. So the problem is that we need to compute if the current page is placed 'right' while holding pte_lock in order to avoid multiple pte_lock acquisitions on the 'fast' path. I'll look into this in a bit, but one thing that comes to mind is having both a spnilock and a mutex and require holding both for modification while either one is sufficient for read. That would allow sp_lookup() to use the spinlock, while insert and replace can hold both. Not sure it will work for this, need to stare at this code a little more. So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? --- It made the warnings I've reported go away. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Thu, 25 Oct 2012, Peter Zijlstra wrote: So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? It's rather unfortunate that we need to protect modification with a spinlock and a mutex but since sharing was removed in commit 869833f2c5c6 (mempolicy: remove mempolicy sharing) it requires that sp_alloc() is blockable to do the whole mpol_new() and rebind if necessary, which could require mm-mmap_sem; it's not as simple as just converting all the allocations to GFP_ATOMIC. It looks as though there is no other alternative other than protecting modification with both the spinlock and mutex, which is a clever solution, so it looks good to me, thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra pet...@infradead.org wrote: So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? Looks reasonable, if annoyingly complex for something that shouldn't be important enough for this. Oh well. However, please check me on this: the need for this is only for linux-next right now, correct? All the current users in my tree are ok with just the mutex, no? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 24 Oct 2012, KOSAKI Motohiro wrote: > Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter, > I guess you commited it, right? If so, may I review your mempolicy > changes? Now mempolicy has a lot of horrible buggy code and I hope to > maintain carefully. Which tree should i see? > Check out sched/numa from git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git $ git diff v3.7-rc2.. mm/mempolicy.c | diffstat mempolicy.c | 444 +--- 1 file changed, 277 insertions(+), 167 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 24, 2012 at 8:08 PM, David Rientjes wrote: > On Wed, 24 Oct 2012, Sasha Levin wrote: > >> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while >> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading >> > /proc/pid/numa_maps on that kernel? >> >> I was actually referring to the warnings Dave Jones saw when fuzzing >> with trinity after the >> original patch was applied. >> >> I still see the following when fuzzing: >> >> [ 338.467156] BUG: sleeping function called from invalid context at >> kernel/mutex.c:269 >> [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: >> trinity-main >> [ 338.481199] 2 locks held by trinity-main/6361: >> [ 338.486629] #0: (>mmap_sem){++}, at: [] >> __do_page_fault+0x1e4/0x4f0 >> [ 338.498783] #1: (&(>page_table_lock)->rlock){+.+...}, at: >> [] handle_pte_fault+0x3f7/0x6a0 >> [ 338.511409] Pid: 6361, comm: trinity-main Tainted: GW >> 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74 >> [ 338.530318] Call Trace: >> [ 338.534088] [] __might_sleep+0x1c3/0x1e0 >> [ 338.539358] [] mutex_lock_nested+0x29/0x50 >> [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 >> [ 338.545258] [] shmem_get_policy+0x2e/0x30 >> [ 338.545264] [] get_vma_policy+0x5a/0xa0 >> [ 338.545267] [] mpol_misplaced+0x41/0x1d0 >> [ 338.545272] [] handle_pte_fault+0x465/0x6a0 >> [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 >> [ 338.545282] [] handle_mm_fault+0x32a/0x360 >> [ 338.545286] [] __do_page_fault+0x480/0x4f0 >> [ 338.545293] [] ? del_timer+0x26/0x80 >> [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 >> [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 >> [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 >> [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 >> [ 338.545312] [] do_page_fault+0x26/0x40 >> [ 338.545317] [] do_async_page_fault+0x30/0xa0 >> [ 338.545321] [] async_page_fault+0x28/0x30 >> > > Ok, this looks the same but it's actually a different issue: > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, > calls get_vma_policy() which may take the shared policy mutex. This > happens while holding page_table_lock from do_huge_pmd_numa_page() but > also from do_numa_page() while holding a spinlock on the ptl, which is > coming from the sched/numa branch. > > Is there anyway that we can avoid changing the shared policy mutex back > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race > in shared_policy_replace()"])? > > Adding Peter, Rik, and Mel to the cc. Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter, I guess you commited it, right? If so, may I review your mempolicy changes? Now mempolicy has a lot of horrible buggy code and I hope to maintain carefully. Which tree should i see? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 24 Oct 2012, Sasha Levin wrote: > > This should be fixed by 9e7814404b77 ("hold task->mempolicy while > > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading > > /proc/pid/numa_maps on that kernel? > > I was actually referring to the warnings Dave Jones saw when fuzzing > with trinity after the > original patch was applied. > > I still see the following when fuzzing: > > [ 338.467156] BUG: sleeping function called from invalid context at > kernel/mutex.c:269 > [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: > trinity-main > [ 338.481199] 2 locks held by trinity-main/6361: > [ 338.486629] #0: (>mmap_sem){++}, at: [] > __do_page_fault+0x1e4/0x4f0 > [ 338.498783] #1: (&(>page_table_lock)->rlock){+.+...}, at: > [] handle_pte_fault+0x3f7/0x6a0 > [ 338.511409] Pid: 6361, comm: trinity-main Tainted: GW > 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74 > [ 338.530318] Call Trace: > [ 338.534088] [] __might_sleep+0x1c3/0x1e0 > [ 338.539358] [] mutex_lock_nested+0x29/0x50 > [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 > [ 338.545258] [] shmem_get_policy+0x2e/0x30 > [ 338.545264] [] get_vma_policy+0x5a/0xa0 > [ 338.545267] [] mpol_misplaced+0x41/0x1d0 > [ 338.545272] [] handle_pte_fault+0x465/0x6a0 > [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 > [ 338.545282] [] handle_mm_fault+0x32a/0x360 > [ 338.545286] [] __do_page_fault+0x480/0x4f0 > [ 338.545293] [] ? del_timer+0x26/0x80 > [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 > [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 > [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 > [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 > [ 338.545312] [] do_page_fault+0x26/0x40 > [ 338.545317] [] do_async_page_fault+0x30/0xa0 > [ 338.545321] [] async_page_fault+0x28/0x30 > Ok, this looks the same but it's actually a different issue: mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, calls get_vma_policy() which may take the shared policy mutex. This happens while holding page_table_lock from do_huge_pmd_numa_page() but also from do_numa_page() while holding a spinlock on the ptl, which is coming from the sched/numa branch. Is there anyway that we can avoid changing the shared policy mutex back into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race in shared_policy_replace()"])? Adding Peter, Rik, and Mel to the cc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 24, 2012 at 7:34 PM, David Rientjes wrote: > On Wed, 24 Oct 2012, Sasha Levin wrote: > >> I'm not sure about the status of the patch, but it doesn't apply on >> top of -next, and I still >> see the warnings when fuzzing on -next. >> > > This should be fixed by 9e7814404b77 ("hold task->mempolicy while > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading > /proc/pid/numa_maps on that kernel? I was actually referring to the warnings Dave Jones saw when fuzzing with trinity after the original patch was applied. I still see the following when fuzzing: [ 338.467156] BUG: sleeping function called from invalid context at kernel/mutex.c:269 [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main [ 338.481199] 2 locks held by trinity-main/6361: [ 338.486629] #0: (>mmap_sem){++}, at: [] __do_page_fault+0x1e4/0x4f0 [ 338.498783] #1: (&(>page_table_lock)->rlock){+.+...}, at: [] handle_pte_fault+0x3f7/0x6a0 [ 338.511409] Pid: 6361, comm: trinity-main Tainted: GW 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74 [ 338.530318] Call Trace: [ 338.534088] [] __might_sleep+0x1c3/0x1e0 [ 338.539358] [] mutex_lock_nested+0x29/0x50 [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 [ 338.545258] [] shmem_get_policy+0x2e/0x30 [ 338.545264] [] get_vma_policy+0x5a/0xa0 [ 338.545267] [] mpol_misplaced+0x41/0x1d0 [ 338.545272] [] handle_pte_fault+0x465/0x6a0 [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 [ 338.545282] [] handle_mm_fault+0x32a/0x360 [ 338.545286] [] __do_page_fault+0x480/0x4f0 [ 338.545293] [] ? del_timer+0x26/0x80 [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 [ 338.545312] [] do_page_fault+0x26/0x40 [ 338.545317] [] do_async_page_fault+0x30/0xa0 [ 338.545321] [] async_page_fault+0x28/0x30 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 24 Oct 2012, Sasha Levin wrote: > I'm not sure about the status of the patch, but it doesn't apply on > top of -next, and I still > see the warnings when fuzzing on -next. > This should be fixed by 9e7814404b77 ("hold task->mempolicy while numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading /proc/pid/numa_maps on that kernel? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 1:24 AM, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > >> BUG: sleeping function called from invalid context at kernel/mutex.c:269 >> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 >> 3 locks on stack by trinity-child2/8558: >> #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: >> [] seq_lseek+0x3f/0x120 >> #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: >> [] m_start+0xa7/0x190 >> #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: >> 88011fc64f30, at: [] show_numa_map+0x14f/0x610 >> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 >> Call Trace: >> [] __might_sleep+0x14c/0x200 >> [] mutex_lock_nested+0x2e/0x50 >> [] mpol_shared_policy_lookup+0x33/0x90 >> [] shmem_get_policy+0x33/0x40 >> [] get_vma_policy+0x3a/0x90 >> [] show_numa_map+0x163/0x610 >> [] ? pid_maps_open+0x20/0x20 >> [] ? pagemap_hugetlb_range+0xf0/0xf0 >> [] show_pid_numa_map+0x13/0x20 >> [] traverse+0xf2/0x230 >> [] seq_lseek+0xab/0x120 >> [] sys_lseek+0x7b/0xb0 >> [] tracesys+0xe1/0xe6 >> > > Hmm, looks like we need to change the refcount semantics entirely. We'll > need to make get_vma_policy() always take a reference and then drop it > accordingly. This work sif get_vma_policy() can grab a reference while > holding task_lock() for the task policy fallback case. > > Comments on this approach? > --- [snip] I'm not sure about the status of the patch, but it doesn't apply on top of -next, and I still see the warnings when fuzzing on -next. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 1:24 AM, David Rientjes rient...@google.com wrote: On Wed, 17 Oct 2012, Dave Jones wrote: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: [8120cd1f] seq_lseek+0x3f/0x120 #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: [81254437] m_start+0xa7/0x190 #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [810ae4ec] __might_sleep+0x14c/0x200 [816bdf4e] mutex_lock_nested+0x2e/0x50 [811c43a3] mpol_shared_policy_lookup+0x33/0x90 [8118d5c3] shmem_get_policy+0x33/0x40 [811c31fa] get_vma_policy+0x3a/0x90 [81254fa3] show_numa_map+0x163/0x610 [81255b10] ? pid_maps_open+0x20/0x20 [81255980] ? pagemap_hugetlb_range+0xf0/0xf0 [81255483] show_pid_numa_map+0x13/0x20 [8120c902] traverse+0xf2/0x230 [8120cd8b] seq_lseek+0xab/0x120 [811e6c0b] sys_lseek+0x7b/0xb0 [816ca088] tracesys+0xe1/0xe6 Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? --- [snip] I'm not sure about the status of the patch, but it doesn't apply on top of -next, and I still see the warnings when fuzzing on -next. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 24 Oct 2012, Sasha Levin wrote: I'm not sure about the status of the patch, but it doesn't apply on top of -next, and I still see the warnings when fuzzing on -next. This should be fixed by 9e7814404b77 (hold task-mempolicy while numa_maps scans.) in 3.7-rc2, can you reproduce any issues reading /proc/pid/numa_maps on that kernel? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 24, 2012 at 7:34 PM, David Rientjes rient...@google.com wrote: On Wed, 24 Oct 2012, Sasha Levin wrote: I'm not sure about the status of the patch, but it doesn't apply on top of -next, and I still see the warnings when fuzzing on -next. This should be fixed by 9e7814404b77 (hold task-mempolicy while numa_maps scans.) in 3.7-rc2, can you reproduce any issues reading /proc/pid/numa_maps on that kernel? I was actually referring to the warnings Dave Jones saw when fuzzing with trinity after the original patch was applied. I still see the following when fuzzing: [ 338.467156] BUG: sleeping function called from invalid context at kernel/mutex.c:269 [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main [ 338.481199] 2 locks held by trinity-main/6361: [ 338.486629] #0: (mm-mmap_sem){++}, at: [810aa314] __do_page_fault+0x1e4/0x4f0 [ 338.498783] #1: ((mm-page_table_lock)-rlock){+.+...}, at: [8122f017] handle_pte_fault+0x3f7/0x6a0 [ 338.511409] Pid: 6361, comm: trinity-main Tainted: GW 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74 [ 338.530318] Call Trace: [ 338.534088] [8114e393] __might_sleep+0x1c3/0x1e0 [ 338.539358] [83ae5209] mutex_lock_nested+0x29/0x50 [ 338.545253] [8124fc3e] mpol_shared_policy_lookup+0x2e/0x90 [ 338.545258] [81219ebe] shmem_get_policy+0x2e/0x30 [ 338.545264] [8124e99a] get_vma_policy+0x5a/0xa0 [ 338.545267] [8124fce1] mpol_misplaced+0x41/0x1d0 [ 338.545272] [8122f085] handle_pte_fault+0x465/0x6a0 [ 338.545278] [81131e04] ? __rcu_read_unlock+0x44/0xb0 [ 338.545282] [81230baa] handle_mm_fault+0x32a/0x360 [ 338.545286] [810aa5b0] __do_page_fault+0x480/0x4f0 [ 338.545293] [8111a706] ? del_timer+0x26/0x80 [ 338.545298] [811c7313] ? rcu_cleanup_after_idle+0x23/0x170 [ 338.545302] [811ca9a4] ? rcu_eqs_exit_common+0x64/0x3a0 [ 338.545305] [811c8c66] ? rcu_eqs_enter_common+0x7c6/0x970 [ 338.545309] [811cafdc] ? rcu_eqs_exit+0x9c/0xb0 [ 338.545312] [810aa666] do_page_fault+0x26/0x40 [ 338.545317] [810a3a40] do_async_page_fault+0x30/0xa0 [ 338.545321] [83ae9268] async_page_fault+0x28/0x30 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 24 Oct 2012, Sasha Levin wrote: This should be fixed by 9e7814404b77 (hold task-mempolicy while numa_maps scans.) in 3.7-rc2, can you reproduce any issues reading /proc/pid/numa_maps on that kernel? I was actually referring to the warnings Dave Jones saw when fuzzing with trinity after the original patch was applied. I still see the following when fuzzing: [ 338.467156] BUG: sleeping function called from invalid context at kernel/mutex.c:269 [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main [ 338.481199] 2 locks held by trinity-main/6361: [ 338.486629] #0: (mm-mmap_sem){++}, at: [810aa314] __do_page_fault+0x1e4/0x4f0 [ 338.498783] #1: ((mm-page_table_lock)-rlock){+.+...}, at: [8122f017] handle_pte_fault+0x3f7/0x6a0 [ 338.511409] Pid: 6361, comm: trinity-main Tainted: GW 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74 [ 338.530318] Call Trace: [ 338.534088] [8114e393] __might_sleep+0x1c3/0x1e0 [ 338.539358] [83ae5209] mutex_lock_nested+0x29/0x50 [ 338.545253] [8124fc3e] mpol_shared_policy_lookup+0x2e/0x90 [ 338.545258] [81219ebe] shmem_get_policy+0x2e/0x30 [ 338.545264] [8124e99a] get_vma_policy+0x5a/0xa0 [ 338.545267] [8124fce1] mpol_misplaced+0x41/0x1d0 [ 338.545272] [8122f085] handle_pte_fault+0x465/0x6a0 [ 338.545278] [81131e04] ? __rcu_read_unlock+0x44/0xb0 [ 338.545282] [81230baa] handle_mm_fault+0x32a/0x360 [ 338.545286] [810aa5b0] __do_page_fault+0x480/0x4f0 [ 338.545293] [8111a706] ? del_timer+0x26/0x80 [ 338.545298] [811c7313] ? rcu_cleanup_after_idle+0x23/0x170 [ 338.545302] [811ca9a4] ? rcu_eqs_exit_common+0x64/0x3a0 [ 338.545305] [811c8c66] ? rcu_eqs_enter_common+0x7c6/0x970 [ 338.545309] [811cafdc] ? rcu_eqs_exit+0x9c/0xb0 [ 338.545312] [810aa666] do_page_fault+0x26/0x40 [ 338.545317] [810a3a40] do_async_page_fault+0x30/0xa0 [ 338.545321] [83ae9268] async_page_fault+0x28/0x30 Ok, this looks the same but it's actually a different issue: mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, calls get_vma_policy() which may take the shared policy mutex. This happens while holding page_table_lock from do_huge_pmd_numa_page() but also from do_numa_page() while holding a spinlock on the ptl, which is coming from the sched/numa branch. Is there anyway that we can avoid changing the shared policy mutex back into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race in shared_policy_replace()])? Adding Peter, Rik, and Mel to the cc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 24, 2012 at 8:08 PM, David Rientjes rient...@google.com wrote: On Wed, 24 Oct 2012, Sasha Levin wrote: This should be fixed by 9e7814404b77 (hold task-mempolicy while numa_maps scans.) in 3.7-rc2, can you reproduce any issues reading /proc/pid/numa_maps on that kernel? I was actually referring to the warnings Dave Jones saw when fuzzing with trinity after the original patch was applied. I still see the following when fuzzing: [ 338.467156] BUG: sleeping function called from invalid context at kernel/mutex.c:269 [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main [ 338.481199] 2 locks held by trinity-main/6361: [ 338.486629] #0: (mm-mmap_sem){++}, at: [810aa314] __do_page_fault+0x1e4/0x4f0 [ 338.498783] #1: ((mm-page_table_lock)-rlock){+.+...}, at: [8122f017] handle_pte_fault+0x3f7/0x6a0 [ 338.511409] Pid: 6361, comm: trinity-main Tainted: GW 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74 [ 338.530318] Call Trace: [ 338.534088] [8114e393] __might_sleep+0x1c3/0x1e0 [ 338.539358] [83ae5209] mutex_lock_nested+0x29/0x50 [ 338.545253] [8124fc3e] mpol_shared_policy_lookup+0x2e/0x90 [ 338.545258] [81219ebe] shmem_get_policy+0x2e/0x30 [ 338.545264] [8124e99a] get_vma_policy+0x5a/0xa0 [ 338.545267] [8124fce1] mpol_misplaced+0x41/0x1d0 [ 338.545272] [8122f085] handle_pte_fault+0x465/0x6a0 [ 338.545278] [81131e04] ? __rcu_read_unlock+0x44/0xb0 [ 338.545282] [81230baa] handle_mm_fault+0x32a/0x360 [ 338.545286] [810aa5b0] __do_page_fault+0x480/0x4f0 [ 338.545293] [8111a706] ? del_timer+0x26/0x80 [ 338.545298] [811c7313] ? rcu_cleanup_after_idle+0x23/0x170 [ 338.545302] [811ca9a4] ? rcu_eqs_exit_common+0x64/0x3a0 [ 338.545305] [811c8c66] ? rcu_eqs_enter_common+0x7c6/0x970 [ 338.545309] [811cafdc] ? rcu_eqs_exit+0x9c/0xb0 [ 338.545312] [810aa666] do_page_fault+0x26/0x40 [ 338.545317] [810a3a40] do_async_page_fault+0x30/0xa0 [ 338.545321] [83ae9268] async_page_fault+0x28/0x30 Ok, this looks the same but it's actually a different issue: mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, calls get_vma_policy() which may take the shared policy mutex. This happens while holding page_table_lock from do_huge_pmd_numa_page() but also from do_numa_page() while holding a spinlock on the ptl, which is coming from the sched/numa branch. Is there anyway that we can avoid changing the shared policy mutex back into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race in shared_policy_replace()])? Adding Peter, Rik, and Mel to the cc. Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter, I guess you commited it, right? If so, may I review your mempolicy changes? Now mempolicy has a lot of horrible buggy code and I hope to maintain carefully. Which tree should i see? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 24 Oct 2012, KOSAKI Motohiro wrote: Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter, I guess you commited it, right? If so, may I review your mempolicy changes? Now mempolicy has a lot of horrible buggy code and I hope to maintain carefully. Which tree should i see? Check out sched/numa from git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git $ git diff v3.7-rc2.. mm/mempolicy.c | diffstat mempolicy.c | 444 +--- 1 file changed, 277 insertions(+), 167 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > > Um, this was just changed to a mutex last week in commit b22d127a39dd > > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() > > can be done with GFP_KERNEL, so I didn't consider reverting that behavior. > > Are you nacking that patch, which you acked, now? > > Yes, sadly. /proc usage is a corner case issue. It's not worth to > strike main path. It also simplifies the fastpath since we can now unconditionally drop the reference. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 3:50 PM, David Rientjes wrote: > On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > >> > I think this refcounting is better than using task_lock(). >> >> I don't think so. get_vma_policy() is used from fast path. In other >> words, number of >> atomic ops is sensible for allocation performance. > > There are enhancements that we can make with refcounting: for instance, we > may want to avoid doing it in the super-fast path when the policy is > default_policy and then just do > > if (mpol != _policy) > mpol_put(mpol); > >> Instead, I'd like >> to use spinlock >> for shared mempolicy instead of mutex. >> > > Um, this was just changed to a mutex last week in commit b22d127a39dd > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() > can be done with GFP_KERNEL, so I didn't consider reverting that behavior. > Are you nacking that patch, which you acked, now? Yes, sadly. /proc usage is a corner case issue. It's not worth to strike main path. see commit 52cd3b0740 and around patches. That explain why we avoided your approach. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > > I think this refcounting is better than using task_lock(). > > I don't think so. get_vma_policy() is used from fast path. In other > words, number of > atomic ops is sensible for allocation performance. There are enhancements that we can make with refcounting: for instance, we may want to avoid doing it in the super-fast path when the policy is default_policy and then just do if (mpol != _policy) mpol_put(mpol); > Instead, I'd like > to use spinlock > for shared mempolicy instead of mutex. > Um, this was just changed to a mutex last week in commit b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() can be done with GFP_KERNEL, so I didn't consider reverting that behavior. Are you nacking that patch, which you acked, now? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 12:38:55PM -0700, David Rientjes wrote: > > > Sounds good. Is it possible to verify that policy_cache isn't getting > > > larger than normal in /proc/slabinfo, i.e. when all processes with a > > > task mempolicy or shared vma policy have exited, are there still a > > > significant number of active objects? > > > > Killing the fuzzer caused it to drop dramatically. > > > Excellent, thanks. This shows that the refcounting is working properly > and we're not leaking any references as a result of this change causing > the mempolicies to never be freed. ("numa_policy" turns out to be > policy_cache in the code, so thanks for checking both of them.) > > Could I add your tested-by? Sure. Here's a fresh one I just baked. Tested-by: Dave Jones Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, Dave Jones wrote: > > Sounds good. Is it possible to verify that policy_cache isn't getting > > larger than normal in /proc/slabinfo, i.e. when all processes with a > > task mempolicy or shared vma policy have exited, are there still a > > significant number of active objects? > > Killing the fuzzer caused it to drop dramatically. > > Before: > (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep > policy > shared_policy_node 2931 2967376 434 : tunables000 : > slabdata 69 69 0 > numa_policy 2971 6545464 354 : tunables000 : > slabdata187187 0 > > After: > (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep > policy > shared_policy_node 0215376 434 : tunables000 : > slabdata 5 5 0 > numa_policy 15175464 354 : tunables000 : > slabdata 5 5 0 > Excellent, thanks. This shows that the refcounting is working properly and we're not leaking any references as a result of this change causing the mempolicies to never be freed. ("numa_policy" turns out to be policy_cache in the code, so thanks for checking both of them.) Could I add your tested-by? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 12:21:10PM -0700, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > > > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > > > On Wed, 17 Oct 2012, Dave Jones wrote: > > > > > > > BUG: sleeping function called from invalid context at > > kernel/mutex.c:269 > > > > > > Hmm, looks like we need to change the refcount semantics entirely. > > We'll > > > need to make get_vma_policy() always take a reference and then drop it > > > accordingly. This work sif get_vma_policy() can grab a reference while > > > holding task_lock() for the task policy fallback case. > > > > > > Comments on this approach? > > > > Seems to be surviving my testing at least.. > > > > Sounds good. Is it possible to verify that policy_cache isn't getting > larger than normal in /proc/slabinfo, i.e. when all processes with a > task mempolicy or shared vma policy have exited, are there still a > significant number of active objects? Killing the fuzzer caused it to drop dramatically. Before: (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 2931 2967376 434 : tunables000 : slabdata 69 69 0 numa_policy 2971 6545464 354 : tunables000 : slabdata187187 0 After: (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 0215376 434 : tunables000 : slabdata 5 5 0 numa_policy 15175464 354 : tunables000 : slabdata 5 5 0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, Dave Jones wrote: > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > > On Wed, 17 Oct 2012, Dave Jones wrote: > > > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > > > > Hmm, looks like we need to change the refcount semantics entirely. We'll > > need to make get_vma_policy() always take a reference and then drop it > > accordingly. This work sif get_vma_policy() can grab a reference while > > holding task_lock() for the task policy fallback case. > > > > Comments on this approach? > > Seems to be surviving my testing at least.. > Sounds good. Is it possible to verify that policy_cache isn't getting larger than normal in /proc/slabinfo, i.e. when all processes with a task mempolicy or shared vma policy have exited, are there still a significant number of active objects? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > > Hmm, looks like we need to change the refcount semantics entirely. We'll > need to make get_vma_policy() always take a reference and then drop it > accordingly. This work sif get_vma_policy() can grab a reference while > holding task_lock() for the task policy fallback case. > > Comments on this approach? Seems to be surviving my testing at least.. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 1:42 AM, Kamezawa Hiroyuki wrote: > (2012/10/17 14:24), David Rientjes wrote: >> >> On Wed, 17 Oct 2012, Dave Jones wrote: >> >>> BUG: sleeping function called from invalid context at kernel/mutex.c:269 >>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 >>> 3 locks on stack by trinity-child2/8558: >>> #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: >>> [] seq_lseek+0x3f/0x120 >>> #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: >>> [] m_start+0xa7/0x190 >>> #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: >>> 88011fc64f30, at: [] show_numa_map+0x14f/0x610 >>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 >>> Call Trace: >>> [] __might_sleep+0x14c/0x200 >>> [] mutex_lock_nested+0x2e/0x50 >>> [] mpol_shared_policy_lookup+0x33/0x90 >>> [] shmem_get_policy+0x33/0x40 >>> [] get_vma_policy+0x3a/0x90 >>> [] show_numa_map+0x163/0x610 >>> [] ? pid_maps_open+0x20/0x20 >>> [] ? pagemap_hugetlb_range+0xf0/0xf0 >>> [] show_pid_numa_map+0x13/0x20 >>> [] traverse+0xf2/0x230 >>> [] seq_lseek+0xab/0x120 >>> [] sys_lseek+0x7b/0xb0 >>> [] tracesys+0xe1/0xe6 >>> >> >> Hmm, looks like we need to change the refcount semantics entirely. We'll >> need to make get_vma_policy() always take a reference and then drop it >> accordingly. This work sif get_vma_policy() can grab a reference while >> holding task_lock() for the task policy fallback case. >> >> Comments on this approach? > > > I think this refcounting is better than using task_lock(). I don't think so. get_vma_policy() is used from fast path. In other words, number of atomic ops is sensible for allocation performance. Instead, I'd like to use spinlock for shared mempolicy instead of mutex. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 1:42 AM, Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: (2012/10/17 14:24), David Rientjes wrote: On Wed, 17 Oct 2012, Dave Jones wrote: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: [8120cd1f] seq_lseek+0x3f/0x120 #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: [81254437] m_start+0xa7/0x190 #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [810ae4ec] __might_sleep+0x14c/0x200 [816bdf4e] mutex_lock_nested+0x2e/0x50 [811c43a3] mpol_shared_policy_lookup+0x33/0x90 [8118d5c3] shmem_get_policy+0x33/0x40 [811c31fa] get_vma_policy+0x3a/0x90 [81254fa3] show_numa_map+0x163/0x610 [81255b10] ? pid_maps_open+0x20/0x20 [81255980] ? pagemap_hugetlb_range+0xf0/0xf0 [81255483] show_pid_numa_map+0x13/0x20 [8120c902] traverse+0xf2/0x230 [8120cd8b] seq_lseek+0xab/0x120 [811e6c0b] sys_lseek+0x7b/0xb0 [816ca088] tracesys+0xe1/0xe6 Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? I think this refcounting is better than using task_lock(). I don't think so. get_vma_policy() is used from fast path. In other words, number of atomic ops is sensible for allocation performance. Instead, I'd like to use spinlock for shared mempolicy instead of mutex. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: On Wed, 17 Oct 2012, Dave Jones wrote: BUG: sleeping function called from invalid context at kernel/mutex.c:269 Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? Seems to be surviving my testing at least.. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, Dave Jones wrote: On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: On Wed, 17 Oct 2012, Dave Jones wrote: BUG: sleeping function called from invalid context at kernel/mutex.c:269 Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? Seems to be surviving my testing at least.. Sounds good. Is it possible to verify that policy_cache isn't getting larger than normal in /proc/slabinfo, i.e. when all processes with a task mempolicy or shared vma policy have exited, are there still a significant number of active objects? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 12:21:10PM -0700, David Rientjes wrote: On Wed, 17 Oct 2012, Dave Jones wrote: On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: On Wed, 17 Oct 2012, Dave Jones wrote: BUG: sleeping function called from invalid context at kernel/mutex.c:269 Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? Seems to be surviving my testing at least.. Sounds good. Is it possible to verify that policy_cache isn't getting larger than normal in /proc/slabinfo, i.e. when all processes with a task mempolicy or shared vma policy have exited, are there still a significant number of active objects? Killing the fuzzer caused it to drop dramatically. Before: (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 2931 2967376 434 : tunables000 : slabdata 69 69 0 numa_policy 2971 6545464 354 : tunables000 : slabdata187187 0 After: (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 0215376 434 : tunables000 : slabdata 5 5 0 numa_policy 15175464 354 : tunables000 : slabdata 5 5 0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, Dave Jones wrote: Sounds good. Is it possible to verify that policy_cache isn't getting larger than normal in /proc/slabinfo, i.e. when all processes with a task mempolicy or shared vma policy have exited, are there still a significant number of active objects? Killing the fuzzer caused it to drop dramatically. Before: (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 2931 2967376 434 : tunables000 : slabdata 69 69 0 numa_policy 2971 6545464 354 : tunables000 : slabdata187187 0 After: (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 0215376 434 : tunables000 : slabdata 5 5 0 numa_policy 15175464 354 : tunables000 : slabdata 5 5 0 Excellent, thanks. This shows that the refcounting is working properly and we're not leaking any references as a result of this change causing the mempolicies to never be freed. (numa_policy turns out to be policy_cache in the code, so thanks for checking both of them.) Could I add your tested-by? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 12:38:55PM -0700, David Rientjes wrote: Sounds good. Is it possible to verify that policy_cache isn't getting larger than normal in /proc/slabinfo, i.e. when all processes with a task mempolicy or shared vma policy have exited, are there still a significant number of active objects? Killing the fuzzer caused it to drop dramatically. Excellent, thanks. This shows that the refcounting is working properly and we're not leaking any references as a result of this change causing the mempolicies to never be freed. (numa_policy turns out to be policy_cache in the code, so thanks for checking both of them.) Could I add your tested-by? Sure. Here's a fresh one I just baked. Tested-by: Dave Jones da...@redhat.com Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: I think this refcounting is better than using task_lock(). I don't think so. get_vma_policy() is used from fast path. In other words, number of atomic ops is sensible for allocation performance. There are enhancements that we can make with refcounting: for instance, we may want to avoid doing it in the super-fast path when the policy is default_policy and then just do if (mpol != default_policy) mpol_put(mpol); Instead, I'd like to use spinlock for shared mempolicy instead of mutex. Um, this was just changed to a mutex last week in commit b22d127a39dd (mempolicy: fix a race in shared_policy_replace()) so that sp_alloc() can be done with GFP_KERNEL, so I didn't consider reverting that behavior. Are you nacking that patch, which you acked, now? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, Oct 17, 2012 at 3:50 PM, David Rientjes rient...@google.com wrote: On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: I think this refcounting is better than using task_lock(). I don't think so. get_vma_policy() is used from fast path. In other words, number of atomic ops is sensible for allocation performance. There are enhancements that we can make with refcounting: for instance, we may want to avoid doing it in the super-fast path when the policy is default_policy and then just do if (mpol != default_policy) mpol_put(mpol); Instead, I'd like to use spinlock for shared mempolicy instead of mutex. Um, this was just changed to a mutex last week in commit b22d127a39dd (mempolicy: fix a race in shared_policy_replace()) so that sp_alloc() can be done with GFP_KERNEL, so I didn't consider reverting that behavior. Are you nacking that patch, which you acked, now? Yes, sadly. /proc usage is a corner case issue. It's not worth to strike main path. see commit 52cd3b0740 and around patches. That explain why we avoided your approach. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: Um, this was just changed to a mutex last week in commit b22d127a39dd (mempolicy: fix a race in shared_policy_replace()) so that sp_alloc() can be done with GFP_KERNEL, so I didn't consider reverting that behavior. Are you nacking that patch, which you acked, now? Yes, sadly. /proc usage is a corner case issue. It's not worth to strike main path. It also simplifies the fastpath since we can now unconditionally drop the reference. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
(2012/10/17 14:24), David Rientjes wrote: On Wed, 17 Oct 2012, Dave Jones wrote: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: [] seq_lseek+0x3f/0x120 #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: [] m_start+0xa7/0x190 #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: 88011fc64f30, at: [] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [] __might_sleep+0x14c/0x200 [] mutex_lock_nested+0x2e/0x50 [] mpol_shared_policy_lookup+0x33/0x90 [] shmem_get_policy+0x33/0x40 [] get_vma_policy+0x3a/0x90 [] show_numa_map+0x163/0x610 [] ? pid_maps_open+0x20/0x20 [] ? pagemap_hugetlb_range+0xf0/0xf0 [] show_pid_numa_map+0x13/0x20 [] traverse+0xf2/0x230 [] seq_lseek+0xab/0x120 [] sys_lseek+0x7b/0xb0 [] tracesys+0xe1/0xe6 Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? I think this refcounting is better than using task_lock(). Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, Dave Jones wrote: > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 > 3 locks on stack by trinity-child2/8558: > #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: > [] seq_lseek+0x3f/0x120 > #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: > [] m_start+0xa7/0x190 > #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: > 88011fc64f30, at: [] show_numa_map+0x14f/0x610 > Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 > Call Trace: > [] __might_sleep+0x14c/0x200 > [] mutex_lock_nested+0x2e/0x50 > [] mpol_shared_policy_lookup+0x33/0x90 > [] shmem_get_policy+0x33/0x40 > [] get_vma_policy+0x3a/0x90 > [] show_numa_map+0x163/0x610 > [] ? pid_maps_open+0x20/0x20 > [] ? pagemap_hugetlb_range+0xf0/0xf0 > [] show_pid_numa_map+0x13/0x20 > [] traverse+0xf2/0x230 > [] seq_lseek+0xab/0x120 > [] sys_lseek+0x7b/0xb0 > [] tracesys+0xe1/0xe6 > Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? --- fs/proc/task_mmu.c |4 +--- include/linux/mm.h |3 +-- mm/hugetlb.c |4 ++-- mm/mempolicy.c | 41 ++--- 4 files changed, 26 insertions(+), 26 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - task_unlock(task); + __mpol_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -216,8 +216,7 @@ struct vm_operations_struct { * get_policy() op must add reference [mpol_get()] to any policy at * (vma,addr) marked as MPOL_SHARED. The shared policy infrastructure * in mm/mempolicy.c will do this automatically. -* get_policy() must NOT add a ref if the policy at (vma,addr) is not -* marked as MPOL_SHARED. vma policies are protected by the mmap_sem. +* vma policies are protected by the mmap_sem. * If no [shared/vma] mempolicy exists at the addr, get_policy() op * must return NULL--i.e., do not "fallback" to task or system default * policy. diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -568,13 +568,13 @@ retry_cpuset: } } - mpol_cond_put(mpol); + __mpol_put(mpol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; err: - mpol_cond_put(mpol); + __mpol_put(mpol); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,39 +1536,41 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies must be - * protected by task_lock(task) by the caller. - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference - * count--added by the get_policy() vm_op, as appropriate--to protect against - * freeing by another task. It is the caller's responsibility to free the - * extra reference for shared policies. + * Increments the reference count of the returned mempolicy, it is the caller's + * responsibility to decrement with __mpol_put(). + * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes + * task_lock(task) for task policy fallback. */ struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { - struct mempolicy *pol = task->mempolicy; + struct mempolicy *pol; + + task_lock(task); + pol = task->mempolicy; + mpol_get(pol); + task_unlock(task); if (vma) { if (vma->vm_ops && vma->vm_ops->get_policy) { struct mempolicy *vpol = vma->vm_ops->get_policy(vma, addr); - if (vpol) + if (vpol) { + mpol_put(pol); pol = vpol; + if (!mpol_needs_cond_ref(pol)) +
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, Oct 16, 2012 at 05:31:23PM -0700, David Rientjes wrote: > -pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > +task_lock(task); > +pol = get_vma_policy(task, vma, vma->vm_start); > mpol_to_str(buffer, sizeof(buffer), pol, 0); > mpol_cond_put(pol); > +task_unlock(task); This seems to cause some fallout for me.. BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: [] seq_lseek+0x3f/0x120 #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: [] m_start+0xa7/0x190 #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: 88011fc64f30, at: [] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [] __might_sleep+0x14c/0x200 [] mutex_lock_nested+0x2e/0x50 [] mpol_shared_policy_lookup+0x33/0x90 [] shmem_get_policy+0x33/0x40 [] get_vma_policy+0x3a/0x90 [] show_numa_map+0x163/0x610 [] ? pid_maps_open+0x20/0x20 [] ? pagemap_hugetlb_range+0xf0/0xf0 [] show_pid_numa_map+0x13/0x20 [] traverse+0xf2/0x230 [] seq_lseek+0xab/0x120 [] sys_lseek+0x7b/0xb0 [] tracesys+0xe1/0xe6 same problem, different syscall.. BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 21996, name: trinity-child3 3 locks on stack by trinity-child3/21996: #0: held: (>lock){+.+.+.}, instance: 88008d712c08, at: [] seq_read+0x3d/0x3e0 #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: [] m_start+0xa7/0x190 #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: 88011fc64f30, at: [] show_numa_map+0x14f/0x610 Pid: 21996, comm: trinity-child3 Not tainted 3.7.0-rc1+ #32 Call Trace: [] __might_sleep+0x14c/0x200 [] mutex_lock_nested+0x2e/0x50 [] mpol_shared_policy_lookup+0x33/0x90 [] shmem_get_policy+0x33/0x40 [] get_vma_policy+0x3a/0x90 [] show_numa_map+0x163/0x610 [] ? pid_maps_open+0x20/0x20 [] ? pagemap_hugetlb_range+0xf0/0xf0 [] show_pid_numa_map+0x13/0x20 [] traverse+0xf2/0x230 [] seq_read+0x34b/0x3e0 [] ? seq_lseek+0x120/0x120 [] do_loop_readv_writev+0x5a/0x90 [] do_readv_writev+0x1c1/0x1e0 [] ? get_parent_ip+0x11/0x50 [] vfs_readv+0x35/0x60 [] sys_preadv+0xc2/0xe0 [] tracesys+0xe1/0xe6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, Oct 16, 2012 at 9:49 PM, David Rientjes wrote: > On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> > index 0b78fb9..d04a8a5 100644 >> > --- a/mm/mempolicy.c >> > +++ b/mm/mempolicy.c >> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t >> > start, compat_ulong_t len, >> > * >> > * Returns effective policy for a VMA at specified address. >> > * Falls back to @task or system default policy, as necessary. >> > - * Current or other task's task mempolicy and non-shared vma policies >> > - * are protected by the task's mmap_sem, which must be held for read by >> > - * the caller. >> > + * Current or other task's task mempolicy and non-shared vma policies >> > must be >> > + * protected by task_lock(task) by the caller. >> >> This is not correct. mmap_sem is needed for protecting vma. task_lock() >> is needed to close vs exit race only when task != current. In other word, >> caller must held both mmap_sem and task_lock if task != current. > > The comment is specifically addressing non-shared vma policies, you do not > need to hold mmap_sem to access another thread's mempolicy. I didn't say old comment is true. I just only your new comment also false. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 0b78fb9..d04a8a5 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t > > start, compat_ulong_t len, > > * > > * Returns effective policy for a VMA at specified address. > > * Falls back to @task or system default policy, as necessary. > > - * Current or other task's task mempolicy and non-shared vma policies > > - * are protected by the task's mmap_sem, which must be held for read by > > - * the caller. > > + * Current or other task's task mempolicy and non-shared vma policies must > > be > > + * protected by task_lock(task) by the caller. > > This is not correct. mmap_sem is needed for protecting vma. task_lock() > is needed to close vs exit race only when task != current. In other word, > caller must held both mmap_sem and task_lock if task != current. > The comment is specifically addressing non-shared vma policies, you do not need to hold mmap_sem to access another thread's mempolicy. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, Oct 16, 2012 at 8:31 PM, David Rientjes wrote: > When reading /proc/pid/numa_maps, it's possible to return the contents of > the stack where the mempolicy string should be printed if the policy gets > freed from beneath us. > > This happens because mpol_to_str() may return an error the > stack-allocated buffer is then printed without ever being stored. > > There are two possible error conditions in mpol_to_str(): > > - if the buffer allocated is insufficient for the string to be stored, >and > > - if the mempolicy has an invalid mode. > > The first error condition is not triggered in any of the callers to > mpol_to_str(): at least 50 bytes is always allocated on the stack and this > is sufficient for the string to be written. A future patch should convert > this into BUILD_BUG_ON() since we know the maximum strlen possible, but > that's not -rc material. > > The second error condition is possible if a race occurs in dropping a > reference to a task's mempolicy causing it to be freed during the read(). > The slab poison value is then used for the mode and mpol_to_str() returns > -EINVAL. > > This race is only possible because get_vma_policy() believes that > mm->mmap_sem protects task->mempolicy, which isn't true. The exit path > does not hold mm->mmap_sem when dropping the reference or setting > task->mempolicy to NULL: it uses task_lock(task) instead. > > Thus, it's required for the caller of a task mempolicy to hold > task_lock(task) while grabbing the mempolicy and reading it. Callers with > a vma policy store their mempolicy earlier and can simply increment the > reference count so it's guaranteed not to be freed. > > Reported-by: Dave Jones > Signed-off-by: David Rientjes > --- > fs/proc/task_mmu.c |7 +-- > mm/mempolicy.c |5 ++--- > 2 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, > int is_pid) > struct vm_area_struct *vma = v; > struct numa_maps *md = _priv->md; > struct file *file = vma->vm_file; > + struct task_struct *task = proc_priv->task; > struct mm_struct *mm = vma->vm_mm; > struct mm_walk walk = {}; > struct mempolicy *pol; > @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, > int is_pid) > walk.private = md; > walk.mm = mm; > > - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > + task_lock(task); > + pol = get_vma_policy(task, vma, vma->vm_start); > mpol_to_str(buffer, sizeof(buffer), pol, 0); > mpol_cond_put(pol); > + task_unlock(task); > > seq_printf(m, "%08lx %s", vma->vm_start, buffer); > > @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, > int is_pid) > } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { > seq_printf(m, " heap"); > } else { > - pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid); > + pid_t tid = vm_is_stack(task, vma, is_pid); > if (tid != 0) { > /* > * Thread stack in /proc/PID/task/TID/maps or > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 0b78fb9..d04a8a5 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, > compat_ulong_t len, > * > * Returns effective policy for a VMA at specified address. > * Falls back to @task or system default policy, as necessary. > - * Current or other task's task mempolicy and non-shared vma policies > - * are protected by the task's mmap_sem, which must be held for read by > - * the caller. > + * Current or other task's task mempolicy and non-shared vma policies must be > + * protected by task_lock(task) by the caller. This is not correct. mmap_sem is needed for protecting vma. task_lock() is needed to close vs exit race only when task != current. In other word, caller must held both mmap_sem and task_lock if task != current. > * Shared policies [those marked as MPOL_F_SHARED] require an extra reference > * count--added by the get_policy() vm_op, as appropriate--to protect against > * freeing by another task. It is the caller's responsibility to free the -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, Oct 16, 2012 at 8:31 PM, David Rientjes rient...@google.com wrote: When reading /proc/pid/numa_maps, it's possible to return the contents of the stack where the mempolicy string should be printed if the policy gets freed from beneath us. This happens because mpol_to_str() may return an error the stack-allocated buffer is then printed without ever being stored. There are two possible error conditions in mpol_to_str(): - if the buffer allocated is insufficient for the string to be stored, and - if the mempolicy has an invalid mode. The first error condition is not triggered in any of the callers to mpol_to_str(): at least 50 bytes is always allocated on the stack and this is sufficient for the string to be written. A future patch should convert this into BUILD_BUG_ON() since we know the maximum strlen possible, but that's not -rc material. The second error condition is possible if a race occurs in dropping a reference to a task's mempolicy causing it to be freed during the read(). The slab poison value is then used for the mode and mpol_to_str() returns -EINVAL. This race is only possible because get_vma_policy() believes that mm-mmap_sem protects task-mempolicy, which isn't true. The exit path does not hold mm-mmap_sem when dropping the reference or setting task-mempolicy to NULL: it uses task_lock(task) instead. Thus, it's required for the caller of a task mempolicy to hold task_lock(task) while grabbing the mempolicy and reading it. Callers with a vma policy store their mempolicy earlier and can simply increment the reference count so it's guaranteed not to be freed. Reported-by: Dave Jones da...@redhat.com Signed-off-by: David Rientjes rient...@google.com --- fs/proc/task_mmu.c |7 +-- mm/mempolicy.c |5 ++--- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) struct vm_area_struct *vma = v; struct numa_maps *md = numa_priv-md; struct file *file = vma-vm_file; + struct task_struct *task = proc_priv-task; struct mm_struct *mm = vma-vm_mm; struct mm_walk walk = {}; struct mempolicy *pol; @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - pol = get_vma_policy(proc_priv-task, vma, vma-vm_start); + task_lock(task); + pol = get_vma_policy(task, vma, vma-vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); + task_unlock(task); seq_printf(m, %08lx %s, vma-vm_start, buffer); @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) } else if (vma-vm_start = mm-brk vma-vm_end = mm-start_brk) { seq_printf(m, heap); } else { - pid_t tid = vm_is_stack(proc_priv-task, vma, is_pid); + pid_t tid = vm_is_stack(task, vma, is_pid); if (tid != 0) { /* * Thread stack in /proc/PID/task/TID/maps or diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0b78fb9..d04a8a5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies - * are protected by the task's mmap_sem, which must be held for read by - * the caller. + * Current or other task's task mempolicy and non-shared vma policies must be + * protected by task_lock(task) by the caller. This is not correct. mmap_sem is needed for protecting vma. task_lock() is needed to close vs exit race only when task != current. In other word, caller must held both mmap_sem and task_lock if task != current. * Shared policies [those marked as MPOL_F_SHARED] require an extra reference * count--added by the get_policy() vm_op, as appropriate--to protect against * freeing by another task. It is the caller's responsibility to free the -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0b78fb9..d04a8a5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies - * are protected by the task's mmap_sem, which must be held for read by - * the caller. + * Current or other task's task mempolicy and non-shared vma policies must be + * protected by task_lock(task) by the caller. This is not correct. mmap_sem is needed for protecting vma. task_lock() is needed to close vs exit race only when task != current. In other word, caller must held both mmap_sem and task_lock if task != current. The comment is specifically addressing non-shared vma policies, you do not need to hold mmap_sem to access another thread's mempolicy. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, Oct 16, 2012 at 9:49 PM, David Rientjes rient...@google.com wrote: On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0b78fb9..d04a8a5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies - * are protected by the task's mmap_sem, which must be held for read by - * the caller. + * Current or other task's task mempolicy and non-shared vma policies must be + * protected by task_lock(task) by the caller. This is not correct. mmap_sem is needed for protecting vma. task_lock() is needed to close vs exit race only when task != current. In other word, caller must held both mmap_sem and task_lock if task != current. The comment is specifically addressing non-shared vma policies, you do not need to hold mmap_sem to access another thread's mempolicy. I didn't say old comment is true. I just only your new comment also false. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Tue, Oct 16, 2012 at 05:31:23PM -0700, David Rientjes wrote: -pol = get_vma_policy(proc_priv-task, vma, vma-vm_start); +task_lock(task); +pol = get_vma_policy(task, vma, vma-vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); +task_unlock(task); This seems to cause some fallout for me.. BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: [8120cd1f] seq_lseek+0x3f/0x120 #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: [81254437] m_start+0xa7/0x190 #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [810ae4ec] __might_sleep+0x14c/0x200 [816bdf4e] mutex_lock_nested+0x2e/0x50 [811c43a3] mpol_shared_policy_lookup+0x33/0x90 [8118d5c3] shmem_get_policy+0x33/0x40 [811c31fa] get_vma_policy+0x3a/0x90 [81254fa3] show_numa_map+0x163/0x610 [81255b10] ? pid_maps_open+0x20/0x20 [81255980] ? pagemap_hugetlb_range+0xf0/0xf0 [81255483] show_pid_numa_map+0x13/0x20 [8120c902] traverse+0xf2/0x230 [8120cd8b] seq_lseek+0xab/0x120 [811e6c0b] sys_lseek+0x7b/0xb0 [816ca088] tracesys+0xe1/0xe6 same problem, different syscall.. BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 21996, name: trinity-child3 3 locks on stack by trinity-child3/21996: #0: held: (p-lock){+.+.+.}, instance: 88008d712c08, at: [8120ce3d] seq_read+0x3d/0x3e0 #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: [81254437] m_start+0xa7/0x190 #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610 Pid: 21996, comm: trinity-child3 Not tainted 3.7.0-rc1+ #32 Call Trace: [810ae4ec] __might_sleep+0x14c/0x200 [816bdf4e] mutex_lock_nested+0x2e/0x50 [811c43a3] mpol_shared_policy_lookup+0x33/0x90 [8118d5c3] shmem_get_policy+0x33/0x40 [811c31fa] get_vma_policy+0x3a/0x90 [81254fa3] show_numa_map+0x163/0x610 [81255b10] ? pid_maps_open+0x20/0x20 [81255980] ? pagemap_hugetlb_range+0xf0/0xf0 [81255483] show_pid_numa_map+0x13/0x20 [8120c902] traverse+0xf2/0x230 [8120d14b] seq_read+0x34b/0x3e0 [8120ce00] ? seq_lseek+0x120/0x120 [811e751a] do_loop_readv_writev+0x5a/0x90 [811e7851] do_readv_writev+0x1c1/0x1e0 [810b0de1] ? get_parent_ip+0x11/0x50 [811e7905] vfs_readv+0x35/0x60 [811e7b72] sys_preadv+0xc2/0xe0 [816ca088] tracesys+0xe1/0xe6 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
On Wed, 17 Oct 2012, Dave Jones wrote: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: [8120cd1f] seq_lseek+0x3f/0x120 #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: [81254437] m_start+0xa7/0x190 #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [810ae4ec] __might_sleep+0x14c/0x200 [816bdf4e] mutex_lock_nested+0x2e/0x50 [811c43a3] mpol_shared_policy_lookup+0x33/0x90 [8118d5c3] shmem_get_policy+0x33/0x40 [811c31fa] get_vma_policy+0x3a/0x90 [81254fa3] show_numa_map+0x163/0x610 [81255b10] ? pid_maps_open+0x20/0x20 [81255980] ? pagemap_hugetlb_range+0xf0/0xf0 [81255483] show_pid_numa_map+0x13/0x20 [8120c902] traverse+0xf2/0x230 [8120cd8b] seq_lseek+0xab/0x120 [811e6c0b] sys_lseek+0x7b/0xb0 [816ca088] tracesys+0xe1/0xe6 Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? --- fs/proc/task_mmu.c |4 +--- include/linux/mm.h |3 +-- mm/hugetlb.c |4 ++-- mm/mempolicy.c | 41 ++--- 4 files changed, 26 insertions(+), 26 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma-vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - task_unlock(task); + __mpol_put(pol); seq_printf(m, %08lx %s, vma-vm_start, buffer); diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -216,8 +216,7 @@ struct vm_operations_struct { * get_policy() op must add reference [mpol_get()] to any policy at * (vma,addr) marked as MPOL_SHARED. The shared policy infrastructure * in mm/mempolicy.c will do this automatically. -* get_policy() must NOT add a ref if the policy at (vma,addr) is not -* marked as MPOL_SHARED. vma policies are protected by the mmap_sem. +* vma policies are protected by the mmap_sem. * If no [shared/vma] mempolicy exists at the addr, get_policy() op * must return NULL--i.e., do not fallback to task or system default * policy. diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -568,13 +568,13 @@ retry_cpuset: } } - mpol_cond_put(mpol); + __mpol_put(mpol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) !page)) goto retry_cpuset; return page; err: - mpol_cond_put(mpol); + __mpol_put(mpol); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,39 +1536,41 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies must be - * protected by task_lock(task) by the caller. - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference - * count--added by the get_policy() vm_op, as appropriate--to protect against - * freeing by another task. It is the caller's responsibility to free the - * extra reference for shared policies. + * Increments the reference count of the returned mempolicy, it is the caller's + * responsibility to decrement with __mpol_put(). + * Requires vma-vm_mm-mmap_sem to be held for vma policies and takes + * task_lock(task) for task policy fallback. */ struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { - struct mempolicy *pol = task-mempolicy; + struct mempolicy *pol; + + task_lock(task); + pol = task-mempolicy; + mpol_get(pol); + task_unlock(task); if (vma) { if (vma-vm_ops vma-vm_ops-get_policy) { struct mempolicy *vpol = vma-vm_ops-get_policy(vma, addr); -
Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
(2012/10/17 14:24), David Rientjes wrote: On Wed, 17 Oct 2012, Dave Jones wrote: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: [8120cd1f] seq_lseek+0x3f/0x120 #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: [81254437] m_start+0xa7/0x190 #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [810ae4ec] __might_sleep+0x14c/0x200 [816bdf4e] mutex_lock_nested+0x2e/0x50 [811c43a3] mpol_shared_policy_lookup+0x33/0x90 [8118d5c3] shmem_get_policy+0x33/0x40 [811c31fa] get_vma_policy+0x3a/0x90 [81254fa3] show_numa_map+0x163/0x610 [81255b10] ? pid_maps_open+0x20/0x20 [81255980] ? pagemap_hugetlb_range+0xf0/0xf0 [81255483] show_pid_numa_map+0x13/0x20 [8120c902] traverse+0xf2/0x230 [8120cd8b] seq_lseek+0xab/0x120 [811e6c0b] sys_lseek+0x7b/0xb0 [816ca088] tracesys+0xe1/0xe6 Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? I think this refcounting is better than using task_lock(). Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/