Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-11-20 Thread Sasha Levin
Ping? Can someone take it before it's lost?

On Wed, Oct 31, 2012 at 2:29 PM, Sasha Levin  wrote:
> On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra  wrote:
>> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
>>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra  
>>> wrote:
>>> >
>>> > So I think the below should work, we hold the spinlock over both rb-tree
>>> > modification as sp free, this makes mpol_shared_policy_lookup() which
>>> > returns the policy with an incremented refcount work with just the
>>> > spinlock.
>>> >
>>> > Comments?
>>>
>>> Looks reasonable, if annoyingly complex for something that shouldn't
>>> be important enough for this. Oh well.
>>
>> I agree with that.. Its just that when doing numa placement one needs to
>> respect the pre-existing placement constraints. I've not seen a way
>> around this.
>>
>>> However, please check me on this: the need for this is only for
>>> linux-next right now, correct? All the current users in my tree are ok
>>> with just the mutex, no?
>>
>> Yes, the need comes from the numa stuff and I'll stick this patch in
>> there.
>>
>> I completely missed Mel's patch turning it into a mutex, but I guess
>> that's what -next is for :-).
>
> So I've been fuzzing with it for the past couple of days and it's been
> looking fine with it. Can someone grab it into his tree please?
>
>
> Thanks,
> Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-11-20 Thread Sasha Levin
Ping? Can someone take it before it's lost?

On Wed, Oct 31, 2012 at 2:29 PM, Sasha Levin levinsasha...@gmail.com wrote:
 On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
 On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra pet...@infradead.org 
 wrote:
 
  So I think the below should work, we hold the spinlock over both rb-tree
  modification as sp free, this makes mpol_shared_policy_lookup() which
  returns the policy with an incremented refcount work with just the
  spinlock.
 
  Comments?

 Looks reasonable, if annoyingly complex for something that shouldn't
 be important enough for this. Oh well.

 I agree with that.. Its just that when doing numa placement one needs to
 respect the pre-existing placement constraints. I've not seen a way
 around this.

 However, please check me on this: the need for this is only for
 linux-next right now, correct? All the current users in my tree are ok
 with just the mutex, no?

 Yes, the need comes from the numa stuff and I'll stick this patch in
 there.

 I completely missed Mel's patch turning it into a mutex, but I guess
 that's what -next is for :-).

 So I've been fuzzing with it for the past couple of days and it's been
 looking fine with it. Can someone grab it into his tree please?


 Thanks,
 Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-31 Thread Sasha Levin
On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra  wrote:
> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra  wrote:
>> >
>> > So I think the below should work, we hold the spinlock over both rb-tree
>> > modification as sp free, this makes mpol_shared_policy_lookup() which
>> > returns the policy with an incremented refcount work with just the
>> > spinlock.
>> >
>> > Comments?
>>
>> Looks reasonable, if annoyingly complex for something that shouldn't
>> be important enough for this. Oh well.
>
> I agree with that.. Its just that when doing numa placement one needs to
> respect the pre-existing placement constraints. I've not seen a way
> around this.
>
>> However, please check me on this: the need for this is only for
>> linux-next right now, correct? All the current users in my tree are ok
>> with just the mutex, no?
>
> Yes, the need comes from the numa stuff and I'll stick this patch in
> there.
>
> I completely missed Mel's patch turning it into a mutex, but I guess
> that's what -next is for :-).

So I've been fuzzing with it for the past couple of days and it's been
looking fine with it. Can someone grab it into his tree please?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-31 Thread Sasha Levin
On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
 On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra pet...@infradead.org wrote:
 
  So I think the below should work, we hold the spinlock over both rb-tree
  modification as sp free, this makes mpol_shared_policy_lookup() which
  returns the policy with an incremented refcount work with just the
  spinlock.
 
  Comments?

 Looks reasonable, if annoyingly complex for something that shouldn't
 be important enough for this. Oh well.

 I agree with that.. Its just that when doing numa placement one needs to
 respect the pre-existing placement constraints. I've not seen a way
 around this.

 However, please check me on this: the need for this is only for
 linux-next right now, correct? All the current users in my tree are ok
 with just the mutex, no?

 Yes, the need comes from the numa stuff and I'll stick this patch in
 there.

 I completely missed Mel's patch turning it into a mutex, but I guess
 that's what -next is for :-).

So I've been fuzzing with it for the past couple of days and it's been
looking fine with it. Can someone grab it into his tree please?


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-26 Thread Peter Zijlstra
On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra  wrote:
> >
> > So I think the below should work, we hold the spinlock over both rb-tree
> > modification as sp free, this makes mpol_shared_policy_lookup() which
> > returns the policy with an incremented refcount work with just the
> > spinlock.
> >
> > Comments?
> 
> Looks reasonable, if annoyingly complex for something that shouldn't
> be important enough for this. Oh well.

I agree with that.. Its just that when doing numa placement one needs to
respect the pre-existing placement constraints. I've not seen a way
around this.

> However, please check me on this: the need for this is only for
> linux-next right now, correct? All the current users in my tree are ok
> with just the mutex, no?

Yes, the need comes from the numa stuff and I'll stick this patch in
there.

I completely missed Mel's patch turning it into a mutex, but I guess
that's what -next is for :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-26 Thread Peter Zijlstra
On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
 On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra pet...@infradead.org wrote:
 
  So I think the below should work, we hold the spinlock over both rb-tree
  modification as sp free, this makes mpol_shared_policy_lookup() which
  returns the policy with an incremented refcount work with just the
  spinlock.
 
  Comments?
 
 Looks reasonable, if annoyingly complex for something that shouldn't
 be important enough for this. Oh well.

I agree with that.. Its just that when doing numa placement one needs to
respect the pre-existing placement constraints. I've not seen a way
around this.

 However, please check me on this: the need for this is only for
 linux-next right now, correct? All the current users in my tree are ok
 with just the mutex, no?

Yes, the need comes from the numa stuff and I'll stick this patch in
there.

I completely missed Mel's patch turning it into a mutex, but I guess
that's what -next is for :-).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Linus Torvalds
On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra  wrote:
>
> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
>
> Comments?

Looks reasonable, if annoyingly complex for something that shouldn't
be important enough for this. Oh well.

However, please check me on this: the need for this is only for
linux-next right now, correct? All the current users in my tree are ok
with just the mutex, no?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread David Rientjes
On Thu, 25 Oct 2012, Peter Zijlstra wrote:

> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
> 
> Comments?
> 

It's rather unfortunate that we need to protect modification with a 
spinlock and a mutex but since sharing was removed in commit 869833f2c5c6 
("mempolicy: remove mempolicy sharing") it requires that sp_alloc() is 
blockable to do the whole mpol_new() and rebind if necessary, which could 
require mm->mmap_sem; it's not as simple as just converting all the 
allocations to GFP_ATOMIC.

It looks as though there is no other alternative other than protecting 
modification with both the spinlock and mutex, which is a clever 
solution, so it looks good to me, thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Sasha Levin
On 10/25/2012 10:39 AM, Peter Zijlstra wrote:
> On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote:
>> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
>>> Ok, this looks the same but it's actually a different issue: 
>>> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
>>> calls get_vma_policy() which may take the shared policy mutex.  This 
>>> happens while holding page_table_lock from do_huge_pmd_numa_page() but 
>>> also from do_numa_page() while holding a spinlock on the ptl, which is 
>>> coming from the sched/numa branch.
>>>
>>> Is there anyway that we can avoid changing the shared policy mutex back 
>>> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race 
>>> in shared_policy_replace()"])?
>>>
>>> Adding Peter, Rik, and Mel to the cc. 
>>
>> Urgh, crud I totally missed that.
>>
>> So the problem is that we need to compute if the current page is placed
>> 'right' while holding pte_lock in order to avoid multiple pte_lock
>> acquisitions on the 'fast' path.
>>
>> I'll look into this in a bit, but one thing that comes to mind is having
>> both a spnilock and a mutex and require holding both for modification
>> while either one is sufficient for read.
>>
>> That would allow sp_lookup() to use the spinlock, while insert and
>> replace can hold both.
>>
>> Not sure it will work for this, need to stare at this code a little
>> more.
> 
> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
> 
> Comments?
> 
> ---

It made the warnings I've reported go away.


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Peter Zijlstra
On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote:
> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
> > Ok, this looks the same but it's actually a different issue: 
> > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
> > calls get_vma_policy() which may take the shared policy mutex.  This 
> > happens while holding page_table_lock from do_huge_pmd_numa_page() but 
> > also from do_numa_page() while holding a spinlock on the ptl, which is 
> > coming from the sched/numa branch.
> > 
> > Is there anyway that we can avoid changing the shared policy mutex back 
> > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race 
> > in shared_policy_replace()"])?
> > 
> > Adding Peter, Rik, and Mel to the cc. 
> 
> Urgh, crud I totally missed that.
> 
> So the problem is that we need to compute if the current page is placed
> 'right' while holding pte_lock in order to avoid multiple pte_lock
> acquisitions on the 'fast' path.
> 
> I'll look into this in a bit, but one thing that comes to mind is having
> both a spnilock and a mutex and require holding both for modification
> while either one is sufficient for read.
> 
> That would allow sp_lookup() to use the spinlock, while insert and
> replace can hold both.
> 
> Not sure it will work for this, need to stare at this code a little
> more.

So I think the below should work, we hold the spinlock over both rb-tree
modification as sp free, this makes mpol_shared_policy_lookup() which
returns the policy with an incremented refcount work with just the
spinlock.

Comments?

---
 include/linux/mempolicy.h |1 +
 mm/mempolicy.c|   23 ++-
 2 files changed, 19 insertions(+), 5 deletions(-)

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -133,6 +133,7 @@ struct sp_node {
 
 struct shared_policy {
struct rb_root root;
+   spinlock_t lock;
struct mutex mutex;
 };
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2099,12 +2099,20 @@ bool __mpol_equal(struct mempolicy *a, s
  *
  * Remember policies even when nobody has shared memory mapped.
  * The policies are kept in Red-Black tree linked from the inode.
- * They are protected by the sp->lock spinlock, which should be held
- * for any accesses to the tree.
+ *
+ * The rb-tree is locked using both a mutex and a spinlock. Every modification
+ * to the tree must hold both the mutex and the spinlock, lookups can hold
+ * either to observe a stable tree.
+ *
+ * In particular, sp_insert() and sp_delete() take the spinlock, whereas
+ * sp_lookup() doesn't, this so users have choice.
+ *
+ * shared_policy_replace() and mpol_free_shared_policy() take the mutex
+ * and call sp_insert(), sp_delete().
  */
 
 /* lookup first element intersecting start-end */
-/* Caller holds sp->mutex */
+/* Caller holds either sp->lock and/or sp->mutex */
 static struct sp_node *
 sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
 {
@@ -2143,6 +2151,7 @@ static void sp_insert(struct shared_poli
struct rb_node *parent = NULL;
struct sp_node *nd;
 
+   spin_lock(>lock);
while (*p) {
parent = *p;
nd = rb_entry(parent, struct sp_node, nd);
@@ -2155,6 +2164,7 @@ static void sp_insert(struct shared_poli
}
rb_link_node(>nd, parent, p);
rb_insert_color(>nd, >root);
+   spin_unlock(>lock);
pr_debug("inserting %lx-%lx: %d\n", new->start, new->end,
 new->policy ? new->policy->mode : 0);
 }
@@ -2168,13 +2178,13 @@ mpol_shared_policy_lookup(struct shared_
 
if (!sp->root.rb_node)
return NULL;
-   mutex_lock(>mutex);
+   spin_lock(>lock);
sn = sp_lookup(sp, idx, idx+1);
if (sn) {
mpol_get(sn->policy);
pol = sn->policy;
}
-   mutex_unlock(>mutex);
+   spin_unlock(>lock);
return pol;
 }
 
@@ -2295,8 +2305,10 @@ int mpol_misplaced(struct page *page, st
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
+   spin_lock(>lock);
rb_erase(>nd, >root);
sp_free(n);
+   spin_unlock(>lock);
 }
 
 static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
@@ -2381,6 +2393,7 @@ void mpol_shared_policy_init(struct shar
int ret;
 
sp->root = RB_ROOT; /* empty tree == default mempolicy */
+   spin_lock_init(>lock);
mutex_init(>mutex);
 
if (mpol) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Peter Zijlstra
On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
> Ok, this looks the same but it's actually a different issue: 
> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
> calls get_vma_policy() which may take the shared policy mutex.  This 
> happens while holding page_table_lock from do_huge_pmd_numa_page() but 
> also from do_numa_page() while holding a spinlock on the ptl, which is 
> coming from the sched/numa branch.
> 
> Is there anyway that we can avoid changing the shared policy mutex back 
> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race 
> in shared_policy_replace()"])?
> 
> Adding Peter, Rik, and Mel to the cc. 

Urgh, crud I totally missed that.

So the problem is that we need to compute if the current page is placed
'right' while holding pte_lock in order to avoid multiple pte_lock
acquisitions on the 'fast' path.

I'll look into this in a bit, but one thing that comes to mind is having
both a spnilock and a mutex and require holding both for modification
while either one is sufficient for read.

That would allow sp_lookup() to use the spinlock, while insert and
replace can hold both.

Not sure it will work for this, need to stare at this code a little
more.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Peter Zijlstra
On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
 Ok, this looks the same but it's actually a different issue: 
 mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
 calls get_vma_policy() which may take the shared policy mutex.  This 
 happens while holding page_table_lock from do_huge_pmd_numa_page() but 
 also from do_numa_page() while holding a spinlock on the ptl, which is 
 coming from the sched/numa branch.
 
 Is there anyway that we can avoid changing the shared policy mutex back 
 into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race 
 in shared_policy_replace()])?
 
 Adding Peter, Rik, and Mel to the cc. 

Urgh, crud I totally missed that.

So the problem is that we need to compute if the current page is placed
'right' while holding pte_lock in order to avoid multiple pte_lock
acquisitions on the 'fast' path.

I'll look into this in a bit, but one thing that comes to mind is having
both a spnilock and a mutex and require holding both for modification
while either one is sufficient for read.

That would allow sp_lookup() to use the spinlock, while insert and
replace can hold both.

Not sure it will work for this, need to stare at this code a little
more.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Peter Zijlstra
On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote:
 On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
  Ok, this looks the same but it's actually a different issue: 
  mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
  calls get_vma_policy() which may take the shared policy mutex.  This 
  happens while holding page_table_lock from do_huge_pmd_numa_page() but 
  also from do_numa_page() while holding a spinlock on the ptl, which is 
  coming from the sched/numa branch.
  
  Is there anyway that we can avoid changing the shared policy mutex back 
  into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race 
  in shared_policy_replace()])?
  
  Adding Peter, Rik, and Mel to the cc. 
 
 Urgh, crud I totally missed that.
 
 So the problem is that we need to compute if the current page is placed
 'right' while holding pte_lock in order to avoid multiple pte_lock
 acquisitions on the 'fast' path.
 
 I'll look into this in a bit, but one thing that comes to mind is having
 both a spnilock and a mutex and require holding both for modification
 while either one is sufficient for read.
 
 That would allow sp_lookup() to use the spinlock, while insert and
 replace can hold both.
 
 Not sure it will work for this, need to stare at this code a little
 more.

So I think the below should work, we hold the spinlock over both rb-tree
modification as sp free, this makes mpol_shared_policy_lookup() which
returns the policy with an incremented refcount work with just the
spinlock.

Comments?

---
 include/linux/mempolicy.h |1 +
 mm/mempolicy.c|   23 ++-
 2 files changed, 19 insertions(+), 5 deletions(-)

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -133,6 +133,7 @@ struct sp_node {
 
 struct shared_policy {
struct rb_root root;
+   spinlock_t lock;
struct mutex mutex;
 };
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2099,12 +2099,20 @@ bool __mpol_equal(struct mempolicy *a, s
  *
  * Remember policies even when nobody has shared memory mapped.
  * The policies are kept in Red-Black tree linked from the inode.
- * They are protected by the sp-lock spinlock, which should be held
- * for any accesses to the tree.
+ *
+ * The rb-tree is locked using both a mutex and a spinlock. Every modification
+ * to the tree must hold both the mutex and the spinlock, lookups can hold
+ * either to observe a stable tree.
+ *
+ * In particular, sp_insert() and sp_delete() take the spinlock, whereas
+ * sp_lookup() doesn't, this so users have choice.
+ *
+ * shared_policy_replace() and mpol_free_shared_policy() take the mutex
+ * and call sp_insert(), sp_delete().
  */
 
 /* lookup first element intersecting start-end */
-/* Caller holds sp-mutex */
+/* Caller holds either sp-lock and/or sp-mutex */
 static struct sp_node *
 sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
 {
@@ -2143,6 +2151,7 @@ static void sp_insert(struct shared_poli
struct rb_node *parent = NULL;
struct sp_node *nd;
 
+   spin_lock(sp-lock);
while (*p) {
parent = *p;
nd = rb_entry(parent, struct sp_node, nd);
@@ -2155,6 +2164,7 @@ static void sp_insert(struct shared_poli
}
rb_link_node(new-nd, parent, p);
rb_insert_color(new-nd, sp-root);
+   spin_unlock(sp-lock);
pr_debug(inserting %lx-%lx: %d\n, new-start, new-end,
 new-policy ? new-policy-mode : 0);
 }
@@ -2168,13 +2178,13 @@ mpol_shared_policy_lookup(struct shared_
 
if (!sp-root.rb_node)
return NULL;
-   mutex_lock(sp-mutex);
+   spin_lock(sp-lock);
sn = sp_lookup(sp, idx, idx+1);
if (sn) {
mpol_get(sn-policy);
pol = sn-policy;
}
-   mutex_unlock(sp-mutex);
+   spin_unlock(sp-lock);
return pol;
 }
 
@@ -2295,8 +2305,10 @@ int mpol_misplaced(struct page *page, st
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
pr_debug(deleting %lx-l%lx\n, n-start, n-end);
+   spin_lock(sp-lock);
rb_erase(n-nd, sp-root);
sp_free(n);
+   spin_unlock(sp-lock);
 }
 
 static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
@@ -2381,6 +2393,7 @@ void mpol_shared_policy_init(struct shar
int ret;
 
sp-root = RB_ROOT; /* empty tree == default mempolicy */
+   spin_lock_init(sp-lock);
mutex_init(sp-mutex);
 
if (mpol) {

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Sasha Levin
On 10/25/2012 10:39 AM, Peter Zijlstra wrote:
 On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote:
 On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
 Ok, this looks the same but it's actually a different issue: 
 mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
 calls get_vma_policy() which may take the shared policy mutex.  This 
 happens while holding page_table_lock from do_huge_pmd_numa_page() but 
 also from do_numa_page() while holding a spinlock on the ptl, which is 
 coming from the sched/numa branch.

 Is there anyway that we can avoid changing the shared policy mutex back 
 into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race 
 in shared_policy_replace()])?

 Adding Peter, Rik, and Mel to the cc. 

 Urgh, crud I totally missed that.

 So the problem is that we need to compute if the current page is placed
 'right' while holding pte_lock in order to avoid multiple pte_lock
 acquisitions on the 'fast' path.

 I'll look into this in a bit, but one thing that comes to mind is having
 both a spnilock and a mutex and require holding both for modification
 while either one is sufficient for read.

 That would allow sp_lookup() to use the spinlock, while insert and
 replace can hold both.

 Not sure it will work for this, need to stare at this code a little
 more.
 
 So I think the below should work, we hold the spinlock over both rb-tree
 modification as sp free, this makes mpol_shared_policy_lookup() which
 returns the policy with an incremented refcount work with just the
 spinlock.
 
 Comments?
 
 ---

It made the warnings I've reported go away.


Thanks,
Sasha

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread David Rientjes
On Thu, 25 Oct 2012, Peter Zijlstra wrote:

 So I think the below should work, we hold the spinlock over both rb-tree
 modification as sp free, this makes mpol_shared_policy_lookup() which
 returns the policy with an incremented refcount work with just the
 spinlock.
 
 Comments?
 

It's rather unfortunate that we need to protect modification with a 
spinlock and a mutex but since sharing was removed in commit 869833f2c5c6 
(mempolicy: remove mempolicy sharing) it requires that sp_alloc() is 
blockable to do the whole mpol_new() and rebind if necessary, which could 
require mm-mmap_sem; it's not as simple as just converting all the 
allocations to GFP_ATOMIC.

It looks as though there is no other alternative other than protecting 
modification with both the spinlock and mutex, which is a clever 
solution, so it looks good to me, thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Linus Torvalds
On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra pet...@infradead.org wrote:

 So I think the below should work, we hold the spinlock over both rb-tree
 modification as sp free, this makes mpol_shared_policy_lookup() which
 returns the policy with an incremented refcount work with just the
 spinlock.

 Comments?

Looks reasonable, if annoyingly complex for something that shouldn't
be important enough for this. Oh well.

However, please check me on this: the need for this is only for
linux-next right now, correct? All the current users in my tree are ok
with just the mutex, no?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread David Rientjes
On Wed, 24 Oct 2012, KOSAKI Motohiro wrote:

> Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter,
> I guess you commited it, right? If so, may I review your mempolicy
> changes? Now mempolicy has a lot of horrible buggy code and I hope to
> maintain carefully. Which tree should i see?
> 

Check out sched/numa from 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

$ git diff v3.7-rc2.. mm/mempolicy.c | diffstat
 mempolicy.c |  444 +---
 1 file changed, 277 insertions(+), 167 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread KOSAKI Motohiro
On Wed, Oct 24, 2012 at 8:08 PM, David Rientjes  wrote:
> On Wed, 24 Oct 2012, Sasha Levin wrote:
>
>> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while
>> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
>> > /proc/pid/numa_maps on that kernel?
>>
>> I was actually referring to the warnings Dave Jones saw when fuzzing
>> with trinity after the
>> original patch was applied.
>>
>> I still see the following when fuzzing:
>>
>> [  338.467156] BUG: sleeping function called from invalid context at
>> kernel/mutex.c:269
>> [  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: 
>> trinity-main
>> [  338.481199] 2 locks held by trinity-main/6361:
>> [  338.486629]  #0:  (>mmap_sem){++}, at: []
>> __do_page_fault+0x1e4/0x4f0
>> [  338.498783]  #1:  (&(>page_table_lock)->rlock){+.+...}, at:
>> [] handle_pte_fault+0x3f7/0x6a0
>> [  338.511409] Pid: 6361, comm: trinity-main Tainted: GW
>> 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74
>> [  338.530318] Call Trace:
>> [  338.534088]  [] __might_sleep+0x1c3/0x1e0
>> [  338.539358]  [] mutex_lock_nested+0x29/0x50
>> [  338.545253]  [] mpol_shared_policy_lookup+0x2e/0x90
>> [  338.545258]  [] shmem_get_policy+0x2e/0x30
>> [  338.545264]  [] get_vma_policy+0x5a/0xa0
>> [  338.545267]  [] mpol_misplaced+0x41/0x1d0
>> [  338.545272]  [] handle_pte_fault+0x465/0x6a0
>> [  338.545278]  [] ? __rcu_read_unlock+0x44/0xb0
>> [  338.545282]  [] handle_mm_fault+0x32a/0x360
>> [  338.545286]  [] __do_page_fault+0x480/0x4f0
>> [  338.545293]  [] ? del_timer+0x26/0x80
>> [  338.545298]  [] ? rcu_cleanup_after_idle+0x23/0x170
>> [  338.545302]  [] ? rcu_eqs_exit_common+0x64/0x3a0
>> [  338.545305]  [] ? rcu_eqs_enter_common+0x7c6/0x970
>> [  338.545309]  [] ? rcu_eqs_exit+0x9c/0xb0
>> [  338.545312]  [] do_page_fault+0x26/0x40
>> [  338.545317]  [] do_async_page_fault+0x30/0xa0
>> [  338.545321]  [] async_page_fault+0x28/0x30
>>
>
> Ok, this looks the same but it's actually a different issue:
> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,
> calls get_vma_policy() which may take the shared policy mutex.  This
> happens while holding page_table_lock from do_huge_pmd_numa_page() but
> also from do_numa_page() while holding a spinlock on the ptl, which is
> coming from the sched/numa branch.
>
> Is there anyway that we can avoid changing the shared policy mutex back
> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race
> in shared_policy_replace()"])?
>
> Adding Peter, Rik, and Mel to the cc.

Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter,
I guess you commited it, right? If so, may I review your mempolicy
changes? Now mempolicy has a lot of horrible buggy code and I hope to
maintain carefully. Which tree should i see?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread David Rientjes
On Wed, 24 Oct 2012, Sasha Levin wrote:

> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while
> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
> > /proc/pid/numa_maps on that kernel?
> 
> I was actually referring to the warnings Dave Jones saw when fuzzing
> with trinity after the
> original patch was applied.
> 
> I still see the following when fuzzing:
> 
> [  338.467156] BUG: sleeping function called from invalid context at
> kernel/mutex.c:269
> [  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: 
> trinity-main
> [  338.481199] 2 locks held by trinity-main/6361:
> [  338.486629]  #0:  (>mmap_sem){++}, at: []
> __do_page_fault+0x1e4/0x4f0
> [  338.498783]  #1:  (&(>page_table_lock)->rlock){+.+...}, at:
> [] handle_pte_fault+0x3f7/0x6a0
> [  338.511409] Pid: 6361, comm: trinity-main Tainted: GW
> 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74
> [  338.530318] Call Trace:
> [  338.534088]  [] __might_sleep+0x1c3/0x1e0
> [  338.539358]  [] mutex_lock_nested+0x29/0x50
> [  338.545253]  [] mpol_shared_policy_lookup+0x2e/0x90
> [  338.545258]  [] shmem_get_policy+0x2e/0x30
> [  338.545264]  [] get_vma_policy+0x5a/0xa0
> [  338.545267]  [] mpol_misplaced+0x41/0x1d0
> [  338.545272]  [] handle_pte_fault+0x465/0x6a0
> [  338.545278]  [] ? __rcu_read_unlock+0x44/0xb0
> [  338.545282]  [] handle_mm_fault+0x32a/0x360
> [  338.545286]  [] __do_page_fault+0x480/0x4f0
> [  338.545293]  [] ? del_timer+0x26/0x80
> [  338.545298]  [] ? rcu_cleanup_after_idle+0x23/0x170
> [  338.545302]  [] ? rcu_eqs_exit_common+0x64/0x3a0
> [  338.545305]  [] ? rcu_eqs_enter_common+0x7c6/0x970
> [  338.545309]  [] ? rcu_eqs_exit+0x9c/0xb0
> [  338.545312]  [] do_page_fault+0x26/0x40
> [  338.545317]  [] do_async_page_fault+0x30/0xa0
> [  338.545321]  [] async_page_fault+0x28/0x30
> 

Ok, this looks the same but it's actually a different issue: 
mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
calls get_vma_policy() which may take the shared policy mutex.  This 
happens while holding page_table_lock from do_huge_pmd_numa_page() but 
also from do_numa_page() while holding a spinlock on the ptl, which is 
coming from the sched/numa branch.

Is there anyway that we can avoid changing the shared policy mutex back 
into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race 
in shared_policy_replace()"])?

Adding Peter, Rik, and Mel to the cc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread Sasha Levin
On Wed, Oct 24, 2012 at 7:34 PM, David Rientjes  wrote:
> On Wed, 24 Oct 2012, Sasha Levin wrote:
>
>> I'm not sure about the status of the patch, but it doesn't apply on
>> top of -next, and I still
>> see the warnings when fuzzing on -next.
>>
>
> This should be fixed by 9e7814404b77 ("hold task->mempolicy while
> numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
> /proc/pid/numa_maps on that kernel?

I was actually referring to the warnings Dave Jones saw when fuzzing
with trinity after the
original patch was applied.

I still see the following when fuzzing:

[  338.467156] BUG: sleeping function called from invalid context at
kernel/mutex.c:269
[  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
[  338.481199] 2 locks held by trinity-main/6361:
[  338.486629]  #0:  (>mmap_sem){++}, at: []
__do_page_fault+0x1e4/0x4f0
[  338.498783]  #1:  (&(>page_table_lock)->rlock){+.+...}, at:
[] handle_pte_fault+0x3f7/0x6a0
[  338.511409] Pid: 6361, comm: trinity-main Tainted: GW
3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74
[  338.530318] Call Trace:
[  338.534088]  [] __might_sleep+0x1c3/0x1e0
[  338.539358]  [] mutex_lock_nested+0x29/0x50
[  338.545253]  [] mpol_shared_policy_lookup+0x2e/0x90
[  338.545258]  [] shmem_get_policy+0x2e/0x30
[  338.545264]  [] get_vma_policy+0x5a/0xa0
[  338.545267]  [] mpol_misplaced+0x41/0x1d0
[  338.545272]  [] handle_pte_fault+0x465/0x6a0
[  338.545278]  [] ? __rcu_read_unlock+0x44/0xb0
[  338.545282]  [] handle_mm_fault+0x32a/0x360
[  338.545286]  [] __do_page_fault+0x480/0x4f0
[  338.545293]  [] ? del_timer+0x26/0x80
[  338.545298]  [] ? rcu_cleanup_after_idle+0x23/0x170
[  338.545302]  [] ? rcu_eqs_exit_common+0x64/0x3a0
[  338.545305]  [] ? rcu_eqs_enter_common+0x7c6/0x970
[  338.545309]  [] ? rcu_eqs_exit+0x9c/0xb0
[  338.545312]  [] do_page_fault+0x26/0x40
[  338.545317]  [] do_async_page_fault+0x30/0xa0
[  338.545321]  [] async_page_fault+0x28/0x30
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread David Rientjes
On Wed, 24 Oct 2012, Sasha Levin wrote:

> I'm not sure about the status of the patch, but it doesn't apply on
> top of -next, and I still
> see the warnings when fuzzing on -next.
> 

This should be fixed by 9e7814404b77 ("hold task->mempolicy while 
numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading 
/proc/pid/numa_maps on that kernel?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread Sasha Levin
On Wed, Oct 17, 2012 at 1:24 AM, David Rientjes  wrote:
> On Wed, 17 Oct 2012, Dave Jones wrote:
>
>> BUG: sleeping function called from invalid context at kernel/mutex.c:269
>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
>> 3 locks on stack by trinity-child2/8558:
>>  #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: 
>> [] seq_lseek+0x3f/0x120
>>  #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: 
>> [] m_start+0xa7/0x190
>>  #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: 
>> 88011fc64f30, at: [] show_numa_map+0x14f/0x610
>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
>> Call Trace:
>>  [] __might_sleep+0x14c/0x200
>>  [] mutex_lock_nested+0x2e/0x50
>>  [] mpol_shared_policy_lookup+0x33/0x90
>>  [] shmem_get_policy+0x33/0x40
>>  [] get_vma_policy+0x3a/0x90
>>  [] show_numa_map+0x163/0x610
>>  [] ? pid_maps_open+0x20/0x20
>>  [] ? pagemap_hugetlb_range+0xf0/0xf0
>>  [] show_pid_numa_map+0x13/0x20
>>  [] traverse+0xf2/0x230
>>  [] seq_lseek+0xab/0x120
>>  [] sys_lseek+0x7b/0xb0
>>  [] tracesys+0xe1/0xe6
>>
>
> Hmm, looks like we need to change the refcount semantics entirely.  We'll
> need to make get_vma_policy() always take a reference and then drop it
> accordingly.  This work sif get_vma_policy() can grab a reference while
> holding task_lock() for the task policy fallback case.
>
> Comments on this approach?
> ---
[snip]

I'm not sure about the status of the patch, but it doesn't apply on
top of -next, and I still
see the warnings when fuzzing on -next.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread Sasha Levin
On Wed, Oct 17, 2012 at 1:24 AM, David Rientjes rient...@google.com wrote:
 On Wed, 17 Oct 2012, Dave Jones wrote:

 BUG: sleeping function called from invalid context at kernel/mutex.c:269
 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
 3 locks on stack by trinity-child2/8558:
  #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: 
 [8120cd1f] seq_lseek+0x3f/0x120
  #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: 
 [81254437] m_start+0xa7/0x190
  #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 
 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610
 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
 Call Trace:
  [810ae4ec] __might_sleep+0x14c/0x200
  [816bdf4e] mutex_lock_nested+0x2e/0x50
  [811c43a3] mpol_shared_policy_lookup+0x33/0x90
  [8118d5c3] shmem_get_policy+0x33/0x40
  [811c31fa] get_vma_policy+0x3a/0x90
  [81254fa3] show_numa_map+0x163/0x610
  [81255b10] ? pid_maps_open+0x20/0x20
  [81255980] ? pagemap_hugetlb_range+0xf0/0xf0
  [81255483] show_pid_numa_map+0x13/0x20
  [8120c902] traverse+0xf2/0x230
  [8120cd8b] seq_lseek+0xab/0x120
  [811e6c0b] sys_lseek+0x7b/0xb0
  [816ca088] tracesys+0xe1/0xe6


 Hmm, looks like we need to change the refcount semantics entirely.  We'll
 need to make get_vma_policy() always take a reference and then drop it
 accordingly.  This work sif get_vma_policy() can grab a reference while
 holding task_lock() for the task policy fallback case.

 Comments on this approach?
 ---
[snip]

I'm not sure about the status of the patch, but it doesn't apply on
top of -next, and I still
see the warnings when fuzzing on -next.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread David Rientjes
On Wed, 24 Oct 2012, Sasha Levin wrote:

 I'm not sure about the status of the patch, but it doesn't apply on
 top of -next, and I still
 see the warnings when fuzzing on -next.
 

This should be fixed by 9e7814404b77 (hold task-mempolicy while 
numa_maps scans.) in 3.7-rc2, can you reproduce any issues reading 
/proc/pid/numa_maps on that kernel?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread Sasha Levin
On Wed, Oct 24, 2012 at 7:34 PM, David Rientjes rient...@google.com wrote:
 On Wed, 24 Oct 2012, Sasha Levin wrote:

 I'm not sure about the status of the patch, but it doesn't apply on
 top of -next, and I still
 see the warnings when fuzzing on -next.


 This should be fixed by 9e7814404b77 (hold task-mempolicy while
 numa_maps scans.) in 3.7-rc2, can you reproduce any issues reading
 /proc/pid/numa_maps on that kernel?

I was actually referring to the warnings Dave Jones saw when fuzzing
with trinity after the
original patch was applied.

I still see the following when fuzzing:

[  338.467156] BUG: sleeping function called from invalid context at
kernel/mutex.c:269
[  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
[  338.481199] 2 locks held by trinity-main/6361:
[  338.486629]  #0:  (mm-mmap_sem){++}, at: [810aa314]
__do_page_fault+0x1e4/0x4f0
[  338.498783]  #1:  ((mm-page_table_lock)-rlock){+.+...}, at:
[8122f017] handle_pte_fault+0x3f7/0x6a0
[  338.511409] Pid: 6361, comm: trinity-main Tainted: GW
3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74
[  338.530318] Call Trace:
[  338.534088]  [8114e393] __might_sleep+0x1c3/0x1e0
[  338.539358]  [83ae5209] mutex_lock_nested+0x29/0x50
[  338.545253]  [8124fc3e] mpol_shared_policy_lookup+0x2e/0x90
[  338.545258]  [81219ebe] shmem_get_policy+0x2e/0x30
[  338.545264]  [8124e99a] get_vma_policy+0x5a/0xa0
[  338.545267]  [8124fce1] mpol_misplaced+0x41/0x1d0
[  338.545272]  [8122f085] handle_pte_fault+0x465/0x6a0
[  338.545278]  [81131e04] ? __rcu_read_unlock+0x44/0xb0
[  338.545282]  [81230baa] handle_mm_fault+0x32a/0x360
[  338.545286]  [810aa5b0] __do_page_fault+0x480/0x4f0
[  338.545293]  [8111a706] ? del_timer+0x26/0x80
[  338.545298]  [811c7313] ? rcu_cleanup_after_idle+0x23/0x170
[  338.545302]  [811ca9a4] ? rcu_eqs_exit_common+0x64/0x3a0
[  338.545305]  [811c8c66] ? rcu_eqs_enter_common+0x7c6/0x970
[  338.545309]  [811cafdc] ? rcu_eqs_exit+0x9c/0xb0
[  338.545312]  [810aa666] do_page_fault+0x26/0x40
[  338.545317]  [810a3a40] do_async_page_fault+0x30/0xa0
[  338.545321]  [83ae9268] async_page_fault+0x28/0x30
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread David Rientjes
On Wed, 24 Oct 2012, Sasha Levin wrote:

  This should be fixed by 9e7814404b77 (hold task-mempolicy while
  numa_maps scans.) in 3.7-rc2, can you reproduce any issues reading
  /proc/pid/numa_maps on that kernel?
 
 I was actually referring to the warnings Dave Jones saw when fuzzing
 with trinity after the
 original patch was applied.
 
 I still see the following when fuzzing:
 
 [  338.467156] BUG: sleeping function called from invalid context at
 kernel/mutex.c:269
 [  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: 
 trinity-main
 [  338.481199] 2 locks held by trinity-main/6361:
 [  338.486629]  #0:  (mm-mmap_sem){++}, at: [810aa314]
 __do_page_fault+0x1e4/0x4f0
 [  338.498783]  #1:  ((mm-page_table_lock)-rlock){+.+...}, at:
 [8122f017] handle_pte_fault+0x3f7/0x6a0
 [  338.511409] Pid: 6361, comm: trinity-main Tainted: GW
 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74
 [  338.530318] Call Trace:
 [  338.534088]  [8114e393] __might_sleep+0x1c3/0x1e0
 [  338.539358]  [83ae5209] mutex_lock_nested+0x29/0x50
 [  338.545253]  [8124fc3e] mpol_shared_policy_lookup+0x2e/0x90
 [  338.545258]  [81219ebe] shmem_get_policy+0x2e/0x30
 [  338.545264]  [8124e99a] get_vma_policy+0x5a/0xa0
 [  338.545267]  [8124fce1] mpol_misplaced+0x41/0x1d0
 [  338.545272]  [8122f085] handle_pte_fault+0x465/0x6a0
 [  338.545278]  [81131e04] ? __rcu_read_unlock+0x44/0xb0
 [  338.545282]  [81230baa] handle_mm_fault+0x32a/0x360
 [  338.545286]  [810aa5b0] __do_page_fault+0x480/0x4f0
 [  338.545293]  [8111a706] ? del_timer+0x26/0x80
 [  338.545298]  [811c7313] ? rcu_cleanup_after_idle+0x23/0x170
 [  338.545302]  [811ca9a4] ? rcu_eqs_exit_common+0x64/0x3a0
 [  338.545305]  [811c8c66] ? rcu_eqs_enter_common+0x7c6/0x970
 [  338.545309]  [811cafdc] ? rcu_eqs_exit+0x9c/0xb0
 [  338.545312]  [810aa666] do_page_fault+0x26/0x40
 [  338.545317]  [810a3a40] do_async_page_fault+0x30/0xa0
 [  338.545321]  [83ae9268] async_page_fault+0x28/0x30
 

Ok, this looks the same but it's actually a different issue: 
mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
calls get_vma_policy() which may take the shared policy mutex.  This 
happens while holding page_table_lock from do_huge_pmd_numa_page() but 
also from do_numa_page() while holding a spinlock on the ptl, which is 
coming from the sched/numa branch.

Is there anyway that we can avoid changing the shared policy mutex back 
into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race 
in shared_policy_replace()])?

Adding Peter, Rik, and Mel to the cc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread KOSAKI Motohiro
On Wed, Oct 24, 2012 at 8:08 PM, David Rientjes rient...@google.com wrote:
 On Wed, 24 Oct 2012, Sasha Levin wrote:

  This should be fixed by 9e7814404b77 (hold task-mempolicy while
  numa_maps scans.) in 3.7-rc2, can you reproduce any issues reading
  /proc/pid/numa_maps on that kernel?

 I was actually referring to the warnings Dave Jones saw when fuzzing
 with trinity after the
 original patch was applied.

 I still see the following when fuzzing:

 [  338.467156] BUG: sleeping function called from invalid context at
 kernel/mutex.c:269
 [  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: 
 trinity-main
 [  338.481199] 2 locks held by trinity-main/6361:
 [  338.486629]  #0:  (mm-mmap_sem){++}, at: [810aa314]
 __do_page_fault+0x1e4/0x4f0
 [  338.498783]  #1:  ((mm-page_table_lock)-rlock){+.+...}, at:
 [8122f017] handle_pte_fault+0x3f7/0x6a0
 [  338.511409] Pid: 6361, comm: trinity-main Tainted: GW
 3.7.0-rc2-next-20121024-sasha-1-gd95ef01-dirty #74
 [  338.530318] Call Trace:
 [  338.534088]  [8114e393] __might_sleep+0x1c3/0x1e0
 [  338.539358]  [83ae5209] mutex_lock_nested+0x29/0x50
 [  338.545253]  [8124fc3e] mpol_shared_policy_lookup+0x2e/0x90
 [  338.545258]  [81219ebe] shmem_get_policy+0x2e/0x30
 [  338.545264]  [8124e99a] get_vma_policy+0x5a/0xa0
 [  338.545267]  [8124fce1] mpol_misplaced+0x41/0x1d0
 [  338.545272]  [8122f085] handle_pte_fault+0x465/0x6a0
 [  338.545278]  [81131e04] ? __rcu_read_unlock+0x44/0xb0
 [  338.545282]  [81230baa] handle_mm_fault+0x32a/0x360
 [  338.545286]  [810aa5b0] __do_page_fault+0x480/0x4f0
 [  338.545293]  [8111a706] ? del_timer+0x26/0x80
 [  338.545298]  [811c7313] ? rcu_cleanup_after_idle+0x23/0x170
 [  338.545302]  [811ca9a4] ? rcu_eqs_exit_common+0x64/0x3a0
 [  338.545305]  [811c8c66] ? rcu_eqs_enter_common+0x7c6/0x970
 [  338.545309]  [811cafdc] ? rcu_eqs_exit+0x9c/0xb0
 [  338.545312]  [810aa666] do_page_fault+0x26/0x40
 [  338.545317]  [810a3a40] do_async_page_fault+0x30/0xa0
 [  338.545321]  [83ae9268] async_page_fault+0x28/0x30


 Ok, this looks the same but it's actually a different issue:
 mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,
 calls get_vma_policy() which may take the shared policy mutex.  This
 happens while holding page_table_lock from do_huge_pmd_numa_page() but
 also from do_numa_page() while holding a spinlock on the ptl, which is
 coming from the sched/numa branch.

 Is there anyway that we can avoid changing the shared policy mutex back
 into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race
 in shared_policy_replace()])?

 Adding Peter, Rik, and Mel to the cc.

Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter,
I guess you commited it, right? If so, may I review your mempolicy
changes? Now mempolicy has a lot of horrible buggy code and I hope to
maintain carefully. Which tree should i see?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-24 Thread David Rientjes
On Wed, 24 Oct 2012, KOSAKI Motohiro wrote:

 Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter,
 I guess you commited it, right? If so, may I review your mempolicy
 changes? Now mempolicy has a lot of horrible buggy code and I hope to
 maintain carefully. Which tree should i see?
 

Check out sched/numa from 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

$ git diff v3.7-rc2.. mm/mempolicy.c | diffstat
 mempolicy.c |  444 +---
 1 file changed, 277 insertions(+), 167 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread David Rientjes
On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

> > Um, this was just changed to a mutex last week in commit b22d127a39dd
> > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc()
> > can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
> > Are you nacking that patch, which you acked, now?
> 
> Yes, sadly. /proc usage is a corner case issue. It's not worth to
> strike main path.

It also simplifies the fastpath since we can now unconditionally drop the 
reference.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread KOSAKI Motohiro
On Wed, Oct 17, 2012 at 3:50 PM, David Rientjes  wrote:
> On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:
>
>> > I think this refcounting is better than using task_lock().
>>
>> I don't think so. get_vma_policy() is used from fast path. In other
>> words, number of
>> atomic ops is sensible for allocation performance.
>
> There are enhancements that we can make with refcounting: for instance, we
> may want to avoid doing it in the super-fast path when the policy is
> default_policy and then just do
>
> if (mpol != _policy)
> mpol_put(mpol);
>
>> Instead, I'd like
>> to use spinlock
>> for shared mempolicy instead of mutex.
>>
>
> Um, this was just changed to a mutex last week in commit b22d127a39dd
> ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc()
> can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
> Are you nacking that patch, which you acked, now?

Yes, sadly. /proc usage is a corner case issue. It's not worth to
strike main path.
see commit 52cd3b0740 and around patches. That explain why we avoided your
approach.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread David Rientjes
On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

> > I think this refcounting is better than using task_lock().
> 
> I don't think so. get_vma_policy() is used from fast path. In other
> words, number of
> atomic ops is sensible for allocation performance.

There are enhancements that we can make with refcounting: for instance, we 
may want to avoid doing it in the super-fast path when the policy is 
default_policy and then just do

if (mpol != _policy)
mpol_put(mpol);

> Instead, I'd like
> to use spinlock
> for shared mempolicy instead of mutex.
> 

Um, this was just changed to a mutex last week in commit b22d127a39dd 
("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() 
can be done with GFP_KERNEL, so I didn't consider reverting that behavior.  
Are you nacking that patch, which you acked, now?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread Dave Jones
On Wed, Oct 17, 2012 at 12:38:55PM -0700, David Rientjes wrote:

 > >  > Sounds good.  Is it possible to verify that policy_cache isn't getting 
 > >  > larger than normal in /proc/slabinfo, i.e. when all processes with a 
 > >  > task mempolicy or shared vma policy have exited, are there still a 
 > >  > significant number of active objects?
 > > 
 > > Killing the fuzzer caused it to drop dramatically.
 > > 
 > Excellent, thanks.  This shows that the refcounting is working properly 
 > and we're not leaking any references as a result of this change causing 
 > the mempolicies to never be freed.  ("numa_policy" turns out to be 
 > policy_cache in the code, so thanks for checking both of them.)
 > 
 > Could I add your tested-by?

Sure. Here's a fresh one I just baked.

Tested-by: Dave Jones 

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread David Rientjes
On Wed, 17 Oct 2012, Dave Jones wrote:

>  > Sounds good.  Is it possible to verify that policy_cache isn't getting 
>  > larger than normal in /proc/slabinfo, i.e. when all processes with a 
>  > task mempolicy or shared vma policy have exited, are there still a 
>  > significant number of active objects?
> 
> Killing the fuzzer caused it to drop dramatically.
> 
> Before:
> (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep 
> policy
> shared_policy_node   2931   2967376   434 : tunables000 : 
> slabdata 69 69  0
> numa_policy 2971   6545464   354 : tunables000 : 
> slabdata187187  0
> 
> After:
> (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep 
> policy
> shared_policy_node  0215376   434 : tunables000 : 
> slabdata  5  5  0
> numa_policy   15175464   354 : tunables000 : 
> slabdata  5  5  0
> 

Excellent, thanks.  This shows that the refcounting is working properly 
and we're not leaking any references as a result of this change causing 
the mempolicies to never be freed.  ("numa_policy" turns out to be 
policy_cache in the code, so thanks for checking both of them.)

Could I add your tested-by?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread Dave Jones
On Wed, Oct 17, 2012 at 12:21:10PM -0700, David Rientjes wrote:
 > On Wed, 17 Oct 2012, Dave Jones wrote:
 > 
 > > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
 > >  > On Wed, 17 Oct 2012, Dave Jones wrote:
 > >  > 
 > >  > > BUG: sleeping function called from invalid context at 
 > > kernel/mutex.c:269
 > >  > 
 > >  > Hmm, looks like we need to change the refcount semantics entirely.  
 > > We'll 
 > >  > need to make get_vma_policy() always take a reference and then drop it 
 > >  > accordingly.  This work sif get_vma_policy() can grab a reference while 
 > >  > holding task_lock() for the task policy fallback case.
 > >  > 
 > >  > Comments on this approach?
 > > 
 > > Seems to be surviving my testing at least..
 > > 
 > 
 > Sounds good.  Is it possible to verify that policy_cache isn't getting 
 > larger than normal in /proc/slabinfo, i.e. when all processes with a 
 > task mempolicy or shared vma policy have exited, are there still a 
 > significant number of active objects?

Killing the fuzzer caused it to drop dramatically.

Before:
(15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep 
policy
shared_policy_node   2931   2967376   434 : tunables000 : 
slabdata 69 69  0
numa_policy 2971   6545464   354 : tunables000 : 
slabdata187187  0

After:
(15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep 
policy
shared_policy_node  0215376   434 : tunables000 : 
slabdata  5  5  0
numa_policy   15175464   354 : tunables000 : 
slabdata  5  5  0


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread David Rientjes
On Wed, 17 Oct 2012, Dave Jones wrote:

> On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
>  > On Wed, 17 Oct 2012, Dave Jones wrote:
>  > 
>  > > BUG: sleeping function called from invalid context at kernel/mutex.c:269
>  > 
>  > Hmm, looks like we need to change the refcount semantics entirely.  We'll 
>  > need to make get_vma_policy() always take a reference and then drop it 
>  > accordingly.  This work sif get_vma_policy() can grab a reference while 
>  > holding task_lock() for the task policy fallback case.
>  > 
>  > Comments on this approach?
> 
> Seems to be surviving my testing at least..
> 

Sounds good.  Is it possible to verify that policy_cache isn't getting 
larger than normal in /proc/slabinfo, i.e. when all processes with a 
task mempolicy or shared vma policy have exited, are there still a 
significant number of active objects?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread Dave Jones
On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
 > On Wed, 17 Oct 2012, Dave Jones wrote:
 > 
 > > BUG: sleeping function called from invalid context at kernel/mutex.c:269
 > 
 > Hmm, looks like we need to change the refcount semantics entirely.  We'll 
 > need to make get_vma_policy() always take a reference and then drop it 
 > accordingly.  This work sif get_vma_policy() can grab a reference while 
 > holding task_lock() for the task policy fallback case.
 > 
 > Comments on this approach?

Seems to be surviving my testing at least..

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread KOSAKI Motohiro
On Wed, Oct 17, 2012 at 1:42 AM, Kamezawa Hiroyuki
 wrote:
> (2012/10/17 14:24), David Rientjes wrote:
>>
>> On Wed, 17 Oct 2012, Dave Jones wrote:
>>
>>> BUG: sleeping function called from invalid context at kernel/mutex.c:269
>>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
>>> 3 locks on stack by trinity-child2/8558:
>>>   #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at:
>>> [] seq_lseek+0x3f/0x120
>>>   #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at:
>>> [] m_start+0xa7/0x190
>>>   #2: held: (&(>alloc_lock)->rlock){+.+...}, instance:
>>> 88011fc64f30, at: [] show_numa_map+0x14f/0x610
>>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
>>> Call Trace:
>>>   [] __might_sleep+0x14c/0x200
>>>   [] mutex_lock_nested+0x2e/0x50
>>>   [] mpol_shared_policy_lookup+0x33/0x90
>>>   [] shmem_get_policy+0x33/0x40
>>>   [] get_vma_policy+0x3a/0x90
>>>   [] show_numa_map+0x163/0x610
>>>   [] ? pid_maps_open+0x20/0x20
>>>   [] ? pagemap_hugetlb_range+0xf0/0xf0
>>>   [] show_pid_numa_map+0x13/0x20
>>>   [] traverse+0xf2/0x230
>>>   [] seq_lseek+0xab/0x120
>>>   [] sys_lseek+0x7b/0xb0
>>>   [] tracesys+0xe1/0xe6
>>>
>>
>> Hmm, looks like we need to change the refcount semantics entirely.  We'll
>> need to make get_vma_policy() always take a reference and then drop it
>> accordingly.  This work sif get_vma_policy() can grab a reference while
>> holding task_lock() for the task policy fallback case.
>>
>> Comments on this approach?
>
>
> I think this refcounting is better than using task_lock().

I don't think so. get_vma_policy() is used from fast path. In other
words, number of
atomic ops is sensible for allocation performance. Instead, I'd like
to use spinlock
for shared mempolicy instead of mutex.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread KOSAKI Motohiro
On Wed, Oct 17, 2012 at 1:42 AM, Kamezawa Hiroyuki
kamezawa.hir...@jp.fujitsu.com wrote:
 (2012/10/17 14:24), David Rientjes wrote:

 On Wed, 17 Oct 2012, Dave Jones wrote:

 BUG: sleeping function called from invalid context at kernel/mutex.c:269
 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
 3 locks on stack by trinity-child2/8558:
   #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at:
 [8120cd1f] seq_lseek+0x3f/0x120
   #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at:
 [81254437] m_start+0xa7/0x190
   #2: held: ((p-alloc_lock)-rlock){+.+...}, instance:
 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610
 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
 Call Trace:
   [810ae4ec] __might_sleep+0x14c/0x200
   [816bdf4e] mutex_lock_nested+0x2e/0x50
   [811c43a3] mpol_shared_policy_lookup+0x33/0x90
   [8118d5c3] shmem_get_policy+0x33/0x40
   [811c31fa] get_vma_policy+0x3a/0x90
   [81254fa3] show_numa_map+0x163/0x610
   [81255b10] ? pid_maps_open+0x20/0x20
   [81255980] ? pagemap_hugetlb_range+0xf0/0xf0
   [81255483] show_pid_numa_map+0x13/0x20
   [8120c902] traverse+0xf2/0x230
   [8120cd8b] seq_lseek+0xab/0x120
   [811e6c0b] sys_lseek+0x7b/0xb0
   [816ca088] tracesys+0xe1/0xe6


 Hmm, looks like we need to change the refcount semantics entirely.  We'll
 need to make get_vma_policy() always take a reference and then drop it
 accordingly.  This work sif get_vma_policy() can grab a reference while
 holding task_lock() for the task policy fallback case.

 Comments on this approach?


 I think this refcounting is better than using task_lock().

I don't think so. get_vma_policy() is used from fast path. In other
words, number of
atomic ops is sensible for allocation performance. Instead, I'd like
to use spinlock
for shared mempolicy instead of mutex.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread Dave Jones
On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
  On Wed, 17 Oct 2012, Dave Jones wrote:
  
   BUG: sleeping function called from invalid context at kernel/mutex.c:269
  
  Hmm, looks like we need to change the refcount semantics entirely.  We'll 
  need to make get_vma_policy() always take a reference and then drop it 
  accordingly.  This work sif get_vma_policy() can grab a reference while 
  holding task_lock() for the task policy fallback case.
  
  Comments on this approach?

Seems to be surviving my testing at least..

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread David Rientjes
On Wed, 17 Oct 2012, Dave Jones wrote:

 On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
   On Wed, 17 Oct 2012, Dave Jones wrote:
   
BUG: sleeping function called from invalid context at kernel/mutex.c:269
   
   Hmm, looks like we need to change the refcount semantics entirely.  We'll 
   need to make get_vma_policy() always take a reference and then drop it 
   accordingly.  This work sif get_vma_policy() can grab a reference while 
   holding task_lock() for the task policy fallback case.
   
   Comments on this approach?
 
 Seems to be surviving my testing at least..
 

Sounds good.  Is it possible to verify that policy_cache isn't getting 
larger than normal in /proc/slabinfo, i.e. when all processes with a 
task mempolicy or shared vma policy have exited, are there still a 
significant number of active objects?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread Dave Jones
On Wed, Oct 17, 2012 at 12:21:10PM -0700, David Rientjes wrote:
  On Wed, 17 Oct 2012, Dave Jones wrote:
  
   On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
 On Wed, 17 Oct 2012, Dave Jones wrote:
 
  BUG: sleeping function called from invalid context at 
   kernel/mutex.c:269
 
 Hmm, looks like we need to change the refcount semantics entirely.  
   We'll 
 need to make get_vma_policy() always take a reference and then drop it 
 accordingly.  This work sif get_vma_policy() can grab a reference while 
 holding task_lock() for the task policy fallback case.
 
 Comments on this approach?
   
   Seems to be surviving my testing at least..
   
  
  Sounds good.  Is it possible to verify that policy_cache isn't getting 
  larger than normal in /proc/slabinfo, i.e. when all processes with a 
  task mempolicy or shared vma policy have exited, are there still a 
  significant number of active objects?

Killing the fuzzer caused it to drop dramatically.

Before:
(15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep 
policy
shared_policy_node   2931   2967376   434 : tunables000 : 
slabdata 69 69  0
numa_policy 2971   6545464   354 : tunables000 : 
slabdata187187  0

After:
(15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep 
policy
shared_policy_node  0215376   434 : tunables000 : 
slabdata  5  5  0
numa_policy   15175464   354 : tunables000 : 
slabdata  5  5  0


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread David Rientjes
On Wed, 17 Oct 2012, Dave Jones wrote:

   Sounds good.  Is it possible to verify that policy_cache isn't getting 
   larger than normal in /proc/slabinfo, i.e. when all processes with a 
   task mempolicy or shared vma policy have exited, are there still a 
   significant number of active objects?
 
 Killing the fuzzer caused it to drop dramatically.
 
 Before:
 (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep 
 policy
 shared_policy_node   2931   2967376   434 : tunables000 : 
 slabdata 69 69  0
 numa_policy 2971   6545464   354 : tunables000 : 
 slabdata187187  0
 
 After:
 (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep 
 policy
 shared_policy_node  0215376   434 : tunables000 : 
 slabdata  5  5  0
 numa_policy   15175464   354 : tunables000 : 
 slabdata  5  5  0
 

Excellent, thanks.  This shows that the refcounting is working properly 
and we're not leaking any references as a result of this change causing 
the mempolicies to never be freed.  (numa_policy turns out to be 
policy_cache in the code, so thanks for checking both of them.)

Could I add your tested-by?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread Dave Jones
On Wed, Oct 17, 2012 at 12:38:55PM -0700, David Rientjes wrote:

 Sounds good.  Is it possible to verify that policy_cache isn't getting 
 larger than normal in /proc/slabinfo, i.e. when all processes with a 
 task mempolicy or shared vma policy have exited, are there still a 
 significant number of active objects?
   
   Killing the fuzzer caused it to drop dramatically.
   
  Excellent, thanks.  This shows that the refcounting is working properly 
  and we're not leaking any references as a result of this change causing 
  the mempolicies to never be freed.  (numa_policy turns out to be 
  policy_cache in the code, so thanks for checking both of them.)
  
  Could I add your tested-by?

Sure. Here's a fresh one I just baked.

Tested-by: Dave Jones da...@redhat.com

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread David Rientjes
On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

  I think this refcounting is better than using task_lock().
 
 I don't think so. get_vma_policy() is used from fast path. In other
 words, number of
 atomic ops is sensible for allocation performance.

There are enhancements that we can make with refcounting: for instance, we 
may want to avoid doing it in the super-fast path when the policy is 
default_policy and then just do

if (mpol != default_policy)
mpol_put(mpol);

 Instead, I'd like
 to use spinlock
 for shared mempolicy instead of mutex.
 

Um, this was just changed to a mutex last week in commit b22d127a39dd 
(mempolicy: fix a race in shared_policy_replace()) so that sp_alloc() 
can be done with GFP_KERNEL, so I didn't consider reverting that behavior.  
Are you nacking that patch, which you acked, now?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread KOSAKI Motohiro
On Wed, Oct 17, 2012 at 3:50 PM, David Rientjes rient...@google.com wrote:
 On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

  I think this refcounting is better than using task_lock().

 I don't think so. get_vma_policy() is used from fast path. In other
 words, number of
 atomic ops is sensible for allocation performance.

 There are enhancements that we can make with refcounting: for instance, we
 may want to avoid doing it in the super-fast path when the policy is
 default_policy and then just do

 if (mpol != default_policy)
 mpol_put(mpol);

 Instead, I'd like
 to use spinlock
 for shared mempolicy instead of mutex.


 Um, this was just changed to a mutex last week in commit b22d127a39dd
 (mempolicy: fix a race in shared_policy_replace()) so that sp_alloc()
 can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
 Are you nacking that patch, which you acked, now?

Yes, sadly. /proc usage is a corner case issue. It's not worth to
strike main path.
see commit 52cd3b0740 and around patches. That explain why we avoided your
approach.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-17 Thread David Rientjes
On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

  Um, this was just changed to a mutex last week in commit b22d127a39dd
  (mempolicy: fix a race in shared_policy_replace()) so that sp_alloc()
  can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
  Are you nacking that patch, which you acked, now?
 
 Yes, sadly. /proc usage is a corner case issue. It's not worth to
 strike main path.

It also simplifies the fastpath since we can now unconditionally drop the 
reference.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread Kamezawa Hiroyuki

(2012/10/17 14:24), David Rientjes wrote:

On Wed, 17 Oct 2012, Dave Jones wrote:


BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
3 locks on stack by trinity-child2/8558:
  #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: 
[] seq_lseek+0x3f/0x120
  #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: 
[] m_start+0xa7/0x190
  #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: 88011fc64f30, at: 
[] show_numa_map+0x14f/0x610
Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
Call Trace:
  [] __might_sleep+0x14c/0x200
  [] mutex_lock_nested+0x2e/0x50
  [] mpol_shared_policy_lookup+0x33/0x90
  [] shmem_get_policy+0x33/0x40
  [] get_vma_policy+0x3a/0x90
  [] show_numa_map+0x163/0x610
  [] ? pid_maps_open+0x20/0x20
  [] ? pagemap_hugetlb_range+0xf0/0xf0
  [] show_pid_numa_map+0x13/0x20
  [] traverse+0xf2/0x230
  [] seq_lseek+0xab/0x120
  [] sys_lseek+0x7b/0xb0
  [] tracesys+0xe1/0xe6



Hmm, looks like we need to change the refcount semantics entirely.  We'll
need to make get_vma_policy() always take a reference and then drop it
accordingly.  This work sif get_vma_policy() can grab a reference while
holding task_lock() for the task policy fallback case.

Comments on this approach?


I think this refcounting is better than using task_lock().

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread David Rientjes
On Wed, 17 Oct 2012, Dave Jones wrote:

> BUG: sleeping function called from invalid context at kernel/mutex.c:269
> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
> 3 locks on stack by trinity-child2/8558:
>  #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: 
> [] seq_lseek+0x3f/0x120
>  #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: 
> [] m_start+0xa7/0x190
>  #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: 
> 88011fc64f30, at: [] show_numa_map+0x14f/0x610
> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
> Call Trace:
>  [] __might_sleep+0x14c/0x200
>  [] mutex_lock_nested+0x2e/0x50
>  [] mpol_shared_policy_lookup+0x33/0x90
>  [] shmem_get_policy+0x33/0x40
>  [] get_vma_policy+0x3a/0x90
>  [] show_numa_map+0x163/0x610
>  [] ? pid_maps_open+0x20/0x20
>  [] ? pagemap_hugetlb_range+0xf0/0xf0
>  [] show_pid_numa_map+0x13/0x20
>  [] traverse+0xf2/0x230
>  [] seq_lseek+0xab/0x120
>  [] sys_lseek+0x7b/0xb0
>  [] tracesys+0xe1/0xe6
> 

Hmm, looks like we need to change the refcount semantics entirely.  We'll 
need to make get_vma_policy() always take a reference and then drop it 
accordingly.  This work sif get_vma_policy() can grab a reference while 
holding task_lock() for the task policy fallback case.

Comments on this approach?
---
 fs/proc/task_mmu.c |4 +---
 include/linux/mm.h |3 +--
 mm/hugetlb.c   |4 ++--
 mm/mempolicy.c |   41 ++---
 4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, 
int is_pid)
walk.private = md;
walk.mm = mm;
 
-   task_lock(task);
pol = get_vma_policy(task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
-   mpol_cond_put(pol);
-   task_unlock(task);
+   __mpol_put(pol);
 
seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -216,8 +216,7 @@ struct vm_operations_struct {
 * get_policy() op must add reference [mpol_get()] to any policy at
 * (vma,addr) marked as MPOL_SHARED.  The shared policy infrastructure
 * in mm/mempolicy.c will do this automatically.
-* get_policy() must NOT add a ref if the policy at (vma,addr) is not
-* marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
+* vma policies are protected by the mmap_sem.
 * If no [shared/vma] mempolicy exists at the addr, get_policy() op
 * must return NULL--i.e., do not "fallback" to task or system default
 * policy.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,13 +568,13 @@ retry_cpuset:
}
}
 
-   mpol_cond_put(mpol);
+   __mpol_put(mpol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
return page;
 
 err:
-   mpol_cond_put(mpol);
+   __mpol_put(mpol);
return NULL;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1536,39 +1536,41 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, 
compat_ulong_t len,
  *
  * Returns effective policy for a VMA at specified address.
  * Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task.  It is the caller's responsibility to free the
- * extra reference for shared policies.
+ * Increments the reference count of the returned mempolicy, it is the caller's
+ * responsibility to decrement with __mpol_put().
+ * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes
+ * task_lock(task) for task policy fallback.
  */
 struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
 {
-   struct mempolicy *pol = task->mempolicy;
+   struct mempolicy *pol;
+
+   task_lock(task);
+   pol = task->mempolicy;
+   mpol_get(pol);
+   task_unlock(task);
 
if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
addr);
-   if (vpol)
+   if (vpol) {
+   mpol_put(pol);
pol = vpol;
+   if (!mpol_needs_cond_ref(pol))
+  

Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread Dave Jones
On Tue, Oct 16, 2012 at 05:31:23PM -0700, David Rientjes wrote:

 > -pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
 > +task_lock(task);
 > +pol = get_vma_policy(task, vma, vma->vm_start);
 >  mpol_to_str(buffer, sizeof(buffer), pol, 0);
 >  mpol_cond_put(pol);
 > +task_unlock(task);

This seems to cause some fallout for me..

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
3 locks on stack by trinity-child2/8558:
 #0: held: (>lock){+.+.+.}, instance: 88010c9a00b0, at: 
[] seq_lseek+0x3f/0x120
 #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: 
[] m_start+0xa7/0x190
 #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: 88011fc64f30, 
at: [] show_numa_map+0x14f/0x610
Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
Call Trace:
 [] __might_sleep+0x14c/0x200
 [] mutex_lock_nested+0x2e/0x50
 [] mpol_shared_policy_lookup+0x33/0x90
 [] shmem_get_policy+0x33/0x40
 [] get_vma_policy+0x3a/0x90
 [] show_numa_map+0x163/0x610
 [] ? pid_maps_open+0x20/0x20
 [] ? pagemap_hugetlb_range+0xf0/0xf0
 [] show_pid_numa_map+0x13/0x20
 [] traverse+0xf2/0x230
 [] seq_lseek+0xab/0x120
 [] sys_lseek+0x7b/0xb0
 [] tracesys+0xe1/0xe6


same problem, different syscall..


BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 21996, name: trinity-child3
3 locks on stack by trinity-child3/21996:
 #0: held: (>lock){+.+.+.}, instance: 88008d712c08, at: 
[] seq_read+0x3d/0x3e0
 #1: held: (>mmap_sem){++}, instance: 88013956f7c8, at: 
[] m_start+0xa7/0x190
 #2: held: (&(>alloc_lock)->rlock){+.+...}, instance: 88011fc64f30, 
at: [] show_numa_map+0x14f/0x610
Pid: 21996, comm: trinity-child3 Not tainted 3.7.0-rc1+ #32
Call Trace:
 [] __might_sleep+0x14c/0x200
 [] mutex_lock_nested+0x2e/0x50
 [] mpol_shared_policy_lookup+0x33/0x90
 [] shmem_get_policy+0x33/0x40
 [] get_vma_policy+0x3a/0x90
 [] show_numa_map+0x163/0x610
 [] ? pid_maps_open+0x20/0x20
 [] ? pagemap_hugetlb_range+0xf0/0xf0
 [] show_pid_numa_map+0x13/0x20
 [] traverse+0xf2/0x230
 [] seq_read+0x34b/0x3e0
 [] ? seq_lseek+0x120/0x120
 [] do_loop_readv_writev+0x5a/0x90
 [] do_readv_writev+0x1c1/0x1e0
 [] ? get_parent_ip+0x11/0x50
 [] vfs_readv+0x35/0x60
 [] sys_preadv+0xc2/0xe0
 [] tracesys+0xe1/0xe6


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread KOSAKI Motohiro
On Tue, Oct 16, 2012 at 9:49 PM, David Rientjes  wrote:
> On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:
>
>> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> > index 0b78fb9..d04a8a5 100644
>> > --- a/mm/mempolicy.c
>> > +++ b/mm/mempolicy.c
>> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t 
>> > start, compat_ulong_t len,
>> >   *
>> >   * Returns effective policy for a VMA at specified address.
>> >   * Falls back to @task or system default policy, as necessary.
>> > - * Current or other task's task mempolicy and non-shared vma policies
>> > - * are protected by the task's mmap_sem, which must be held for read by
>> > - * the caller.
>> > + * Current or other task's task mempolicy and non-shared vma policies 
>> > must be
>> > + * protected by task_lock(task) by the caller.
>>
>> This is not correct. mmap_sem is needed for protecting vma. task_lock()
>> is needed to close vs exit race only when task != current. In other word,
>> caller must held both mmap_sem and task_lock if task != current.
>
> The comment is specifically addressing non-shared vma policies, you do not
> need to hold mmap_sem to access another thread's mempolicy.

I didn't say old comment is true. I just only your new comment also false.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread David Rientjes
On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 0b78fb9..d04a8a5 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t 
> > start, compat_ulong_t len,
> >   *
> >   * Returns effective policy for a VMA at specified address.
> >   * Falls back to @task or system default policy, as necessary.
> > - * Current or other task's task mempolicy and non-shared vma policies
> > - * are protected by the task's mmap_sem, which must be held for read by
> > - * the caller.
> > + * Current or other task's task mempolicy and non-shared vma policies must 
> > be
> > + * protected by task_lock(task) by the caller.
> 
> This is not correct. mmap_sem is needed for protecting vma. task_lock()
> is needed to close vs exit race only when task != current. In other word,
> caller must held both mmap_sem and task_lock if task != current.
> 

The comment is specifically addressing non-shared vma policies, you do not 
need to hold mmap_sem to access another thread's mempolicy.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread KOSAKI Motohiro
On Tue, Oct 16, 2012 at 8:31 PM, David Rientjes  wrote:
> When reading /proc/pid/numa_maps, it's possible to return the contents of
> the stack where the mempolicy string should be printed if the policy gets
> freed from beneath us.
>
> This happens because mpol_to_str() may return an error the
> stack-allocated buffer is then printed without ever being stored.
>
> There are two possible error conditions in mpol_to_str():
>
>  - if the buffer allocated is insufficient for the string to be stored,
>and
>
>  - if the mempolicy has an invalid mode.
>
> The first error condition is not triggered in any of the callers to
> mpol_to_str(): at least 50 bytes is always allocated on the stack and this
> is sufficient for the string to be written.  A future patch should convert
> this into BUILD_BUG_ON() since we know the maximum strlen possible, but
> that's not -rc material.
>
> The second error condition is possible if a race occurs in dropping a
> reference to a task's mempolicy causing it to be freed during the read().
> The slab poison value is then used for the mode and mpol_to_str() returns
> -EINVAL.
>
> This race is only possible because get_vma_policy() believes that
> mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path
> does not hold mm->mmap_sem when dropping the reference or setting
> task->mempolicy to NULL: it uses task_lock(task) instead.
>
> Thus, it's required for the caller of a task mempolicy to hold
> task_lock(task) while grabbing the mempolicy and reading it.  Callers with
> a vma policy store their mempolicy earlier and can simply increment the
> reference count so it's guaranteed not to be freed.
>
> Reported-by: Dave Jones 
> Signed-off-by: David Rientjes 
> ---
>  fs/proc/task_mmu.c |7 +--
>  mm/mempolicy.c |5 ++---
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, 
> int is_pid)
> struct vm_area_struct *vma = v;
> struct numa_maps *md = _priv->md;
> struct file *file = vma->vm_file;
> +   struct task_struct *task = proc_priv->task;
> struct mm_struct *mm = vma->vm_mm;
> struct mm_walk walk = {};
> struct mempolicy *pol;
> @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, 
> int is_pid)
> walk.private = md;
> walk.mm = mm;
>
> -   pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
> +   task_lock(task);
> +   pol = get_vma_policy(task, vma, vma->vm_start);
> mpol_to_str(buffer, sizeof(buffer), pol, 0);
> mpol_cond_put(pol);
> +   task_unlock(task);
>
> seq_printf(m, "%08lx %s", vma->vm_start, buffer);
>
> @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, 
> int is_pid)
> } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
> seq_printf(m, " heap");
> } else {
> -   pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid);
> +   pid_t tid = vm_is_stack(task, vma, is_pid);
> if (tid != 0) {
> /*
>  * Thread stack in /proc/PID/task/TID/maps or
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0b78fb9..d04a8a5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, 
> compat_ulong_t len,
>   *
>   * Returns effective policy for a VMA at specified address.
>   * Falls back to @task or system default policy, as necessary.
> - * Current or other task's task mempolicy and non-shared vma policies
> - * are protected by the task's mmap_sem, which must be held for read by
> - * the caller.
> + * Current or other task's task mempolicy and non-shared vma policies must be
> + * protected by task_lock(task) by the caller.

This is not correct. mmap_sem is needed for protecting vma. task_lock()
is needed to close vs exit race only when task != current. In other word,
caller must held both mmap_sem and task_lock if task != current.




>   * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
>   * count--added by the get_policy() vm_op, as appropriate--to protect against
>   * freeing by another task.  It is the caller's responsibility to free the
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread KOSAKI Motohiro
On Tue, Oct 16, 2012 at 8:31 PM, David Rientjes rient...@google.com wrote:
 When reading /proc/pid/numa_maps, it's possible to return the contents of
 the stack where the mempolicy string should be printed if the policy gets
 freed from beneath us.

 This happens because mpol_to_str() may return an error the
 stack-allocated buffer is then printed without ever being stored.

 There are two possible error conditions in mpol_to_str():

  - if the buffer allocated is insufficient for the string to be stored,
and

  - if the mempolicy has an invalid mode.

 The first error condition is not triggered in any of the callers to
 mpol_to_str(): at least 50 bytes is always allocated on the stack and this
 is sufficient for the string to be written.  A future patch should convert
 this into BUILD_BUG_ON() since we know the maximum strlen possible, but
 that's not -rc material.

 The second error condition is possible if a race occurs in dropping a
 reference to a task's mempolicy causing it to be freed during the read().
 The slab poison value is then used for the mode and mpol_to_str() returns
 -EINVAL.

 This race is only possible because get_vma_policy() believes that
 mm-mmap_sem protects task-mempolicy, which isn't true.  The exit path
 does not hold mm-mmap_sem when dropping the reference or setting
 task-mempolicy to NULL: it uses task_lock(task) instead.

 Thus, it's required for the caller of a task mempolicy to hold
 task_lock(task) while grabbing the mempolicy and reading it.  Callers with
 a vma policy store their mempolicy earlier and can simply increment the
 reference count so it's guaranteed not to be freed.

 Reported-by: Dave Jones da...@redhat.com
 Signed-off-by: David Rientjes rient...@google.com
 ---
  fs/proc/task_mmu.c |7 +--
  mm/mempolicy.c |5 ++---
  2 files changed, 7 insertions(+), 5 deletions(-)

 diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
 --- a/fs/proc/task_mmu.c
 +++ b/fs/proc/task_mmu.c
 @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, 
 int is_pid)
 struct vm_area_struct *vma = v;
 struct numa_maps *md = numa_priv-md;
 struct file *file = vma-vm_file;
 +   struct task_struct *task = proc_priv-task;
 struct mm_struct *mm = vma-vm_mm;
 struct mm_walk walk = {};
 struct mempolicy *pol;
 @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, 
 int is_pid)
 walk.private = md;
 walk.mm = mm;

 -   pol = get_vma_policy(proc_priv-task, vma, vma-vm_start);
 +   task_lock(task);
 +   pol = get_vma_policy(task, vma, vma-vm_start);
 mpol_to_str(buffer, sizeof(buffer), pol, 0);
 mpol_cond_put(pol);
 +   task_unlock(task);

 seq_printf(m, %08lx %s, vma-vm_start, buffer);

 @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, 
 int is_pid)
 } else if (vma-vm_start = mm-brk  vma-vm_end = mm-start_brk) {
 seq_printf(m,  heap);
 } else {
 -   pid_t tid = vm_is_stack(proc_priv-task, vma, is_pid);
 +   pid_t tid = vm_is_stack(task, vma, is_pid);
 if (tid != 0) {
 /*
  * Thread stack in /proc/PID/task/TID/maps or
 diff --git a/mm/mempolicy.c b/mm/mempolicy.c
 index 0b78fb9..d04a8a5 100644
 --- a/mm/mempolicy.c
 +++ b/mm/mempolicy.c
 @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, 
 compat_ulong_t len,
   *
   * Returns effective policy for a VMA at specified address.
   * Falls back to @task or system default policy, as necessary.
 - * Current or other task's task mempolicy and non-shared vma policies
 - * are protected by the task's mmap_sem, which must be held for read by
 - * the caller.
 + * Current or other task's task mempolicy and non-shared vma policies must be
 + * protected by task_lock(task) by the caller.

This is not correct. mmap_sem is needed for protecting vma. task_lock()
is needed to close vs exit race only when task != current. In other word,
caller must held both mmap_sem and task_lock if task != current.




   * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
   * count--added by the get_policy() vm_op, as appropriate--to protect against
   * freeing by another task.  It is the caller's responsibility to free the
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread David Rientjes
On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

  diff --git a/mm/mempolicy.c b/mm/mempolicy.c
  index 0b78fb9..d04a8a5 100644
  --- a/mm/mempolicy.c
  +++ b/mm/mempolicy.c
  @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t 
  start, compat_ulong_t len,
*
* Returns effective policy for a VMA at specified address.
* Falls back to @task or system default policy, as necessary.
  - * Current or other task's task mempolicy and non-shared vma policies
  - * are protected by the task's mmap_sem, which must be held for read by
  - * the caller.
  + * Current or other task's task mempolicy and non-shared vma policies must 
  be
  + * protected by task_lock(task) by the caller.
 
 This is not correct. mmap_sem is needed for protecting vma. task_lock()
 is needed to close vs exit race only when task != current. In other word,
 caller must held both mmap_sem and task_lock if task != current.
 

The comment is specifically addressing non-shared vma policies, you do not 
need to hold mmap_sem to access another thread's mempolicy.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread KOSAKI Motohiro
On Tue, Oct 16, 2012 at 9:49 PM, David Rientjes rient...@google.com wrote:
 On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

  diff --git a/mm/mempolicy.c b/mm/mempolicy.c
  index 0b78fb9..d04a8a5 100644
  --- a/mm/mempolicy.c
  +++ b/mm/mempolicy.c
  @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t 
  start, compat_ulong_t len,
*
* Returns effective policy for a VMA at specified address.
* Falls back to @task or system default policy, as necessary.
  - * Current or other task's task mempolicy and non-shared vma policies
  - * are protected by the task's mmap_sem, which must be held for read by
  - * the caller.
  + * Current or other task's task mempolicy and non-shared vma policies 
  must be
  + * protected by task_lock(task) by the caller.

 This is not correct. mmap_sem is needed for protecting vma. task_lock()
 is needed to close vs exit race only when task != current. In other word,
 caller must held both mmap_sem and task_lock if task != current.

 The comment is specifically addressing non-shared vma policies, you do not
 need to hold mmap_sem to access another thread's mempolicy.

I didn't say old comment is true. I just only your new comment also false.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread Dave Jones
On Tue, Oct 16, 2012 at 05:31:23PM -0700, David Rientjes wrote:

  -pol = get_vma_policy(proc_priv-task, vma, vma-vm_start);
  +task_lock(task);
  +pol = get_vma_policy(task, vma, vma-vm_start);
   mpol_to_str(buffer, sizeof(buffer), pol, 0);
   mpol_cond_put(pol);
  +task_unlock(task);

This seems to cause some fallout for me..

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
3 locks on stack by trinity-child2/8558:
 #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: 
[8120cd1f] seq_lseek+0x3f/0x120
 #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: 
[81254437] m_start+0xa7/0x190
 #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, 
at: [81254f8f] show_numa_map+0x14f/0x610
Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
Call Trace:
 [810ae4ec] __might_sleep+0x14c/0x200
 [816bdf4e] mutex_lock_nested+0x2e/0x50
 [811c43a3] mpol_shared_policy_lookup+0x33/0x90
 [8118d5c3] shmem_get_policy+0x33/0x40
 [811c31fa] get_vma_policy+0x3a/0x90
 [81254fa3] show_numa_map+0x163/0x610
 [81255b10] ? pid_maps_open+0x20/0x20
 [81255980] ? pagemap_hugetlb_range+0xf0/0xf0
 [81255483] show_pid_numa_map+0x13/0x20
 [8120c902] traverse+0xf2/0x230
 [8120cd8b] seq_lseek+0xab/0x120
 [811e6c0b] sys_lseek+0x7b/0xb0
 [816ca088] tracesys+0xe1/0xe6


same problem, different syscall..


BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 21996, name: trinity-child3
3 locks on stack by trinity-child3/21996:
 #0: held: (p-lock){+.+.+.}, instance: 88008d712c08, at: 
[8120ce3d] seq_read+0x3d/0x3e0
 #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: 
[81254437] m_start+0xa7/0x190
 #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, 
at: [81254f8f] show_numa_map+0x14f/0x610
Pid: 21996, comm: trinity-child3 Not tainted 3.7.0-rc1+ #32
Call Trace:
 [810ae4ec] __might_sleep+0x14c/0x200
 [816bdf4e] mutex_lock_nested+0x2e/0x50
 [811c43a3] mpol_shared_policy_lookup+0x33/0x90
 [8118d5c3] shmem_get_policy+0x33/0x40
 [811c31fa] get_vma_policy+0x3a/0x90
 [81254fa3] show_numa_map+0x163/0x610
 [81255b10] ? pid_maps_open+0x20/0x20
 [81255980] ? pagemap_hugetlb_range+0xf0/0xf0
 [81255483] show_pid_numa_map+0x13/0x20
 [8120c902] traverse+0xf2/0x230
 [8120d14b] seq_read+0x34b/0x3e0
 [8120ce00] ? seq_lseek+0x120/0x120
 [811e751a] do_loop_readv_writev+0x5a/0x90
 [811e7851] do_readv_writev+0x1c1/0x1e0
 [810b0de1] ? get_parent_ip+0x11/0x50
 [811e7905] vfs_readv+0x35/0x60
 [811e7b72] sys_preadv+0xc2/0xe0
 [816ca088] tracesys+0xe1/0xe6


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread David Rientjes
On Wed, 17 Oct 2012, Dave Jones wrote:

 BUG: sleeping function called from invalid context at kernel/mutex.c:269
 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
 3 locks on stack by trinity-child2/8558:
  #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: 
 [8120cd1f] seq_lseek+0x3f/0x120
  #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: 
 [81254437] m_start+0xa7/0x190
  #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 
 88011fc64f30, at: [81254f8f] show_numa_map+0x14f/0x610
 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
 Call Trace:
  [810ae4ec] __might_sleep+0x14c/0x200
  [816bdf4e] mutex_lock_nested+0x2e/0x50
  [811c43a3] mpol_shared_policy_lookup+0x33/0x90
  [8118d5c3] shmem_get_policy+0x33/0x40
  [811c31fa] get_vma_policy+0x3a/0x90
  [81254fa3] show_numa_map+0x163/0x610
  [81255b10] ? pid_maps_open+0x20/0x20
  [81255980] ? pagemap_hugetlb_range+0xf0/0xf0
  [81255483] show_pid_numa_map+0x13/0x20
  [8120c902] traverse+0xf2/0x230
  [8120cd8b] seq_lseek+0xab/0x120
  [811e6c0b] sys_lseek+0x7b/0xb0
  [816ca088] tracesys+0xe1/0xe6
 

Hmm, looks like we need to change the refcount semantics entirely.  We'll 
need to make get_vma_policy() always take a reference and then drop it 
accordingly.  This work sif get_vma_policy() can grab a reference while 
holding task_lock() for the task policy fallback case.

Comments on this approach?
---
 fs/proc/task_mmu.c |4 +---
 include/linux/mm.h |3 +--
 mm/hugetlb.c   |4 ++--
 mm/mempolicy.c |   41 ++---
 4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, 
int is_pid)
walk.private = md;
walk.mm = mm;
 
-   task_lock(task);
pol = get_vma_policy(task, vma, vma-vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
-   mpol_cond_put(pol);
-   task_unlock(task);
+   __mpol_put(pol);
 
seq_printf(m, %08lx %s, vma-vm_start, buffer);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -216,8 +216,7 @@ struct vm_operations_struct {
 * get_policy() op must add reference [mpol_get()] to any policy at
 * (vma,addr) marked as MPOL_SHARED.  The shared policy infrastructure
 * in mm/mempolicy.c will do this automatically.
-* get_policy() must NOT add a ref if the policy at (vma,addr) is not
-* marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
+* vma policies are protected by the mmap_sem.
 * If no [shared/vma] mempolicy exists at the addr, get_policy() op
 * must return NULL--i.e., do not fallback to task or system default
 * policy.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,13 +568,13 @@ retry_cpuset:
}
}
 
-   mpol_cond_put(mpol);
+   __mpol_put(mpol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie)  !page))
goto retry_cpuset;
return page;
 
 err:
-   mpol_cond_put(mpol);
+   __mpol_put(mpol);
return NULL;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1536,39 +1536,41 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, 
compat_ulong_t len,
  *
  * Returns effective policy for a VMA at specified address.
  * Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task.  It is the caller's responsibility to free the
- * extra reference for shared policies.
+ * Increments the reference count of the returned mempolicy, it is the caller's
+ * responsibility to decrement with __mpol_put().
+ * Requires vma-vm_mm-mmap_sem to be held for vma policies and takes
+ * task_lock(task) for task policy fallback.
  */
 struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
 {
-   struct mempolicy *pol = task-mempolicy;
+   struct mempolicy *pol;
+
+   task_lock(task);
+   pol = task-mempolicy;
+   mpol_get(pol);
+   task_unlock(task);
 
if (vma) {
if (vma-vm_ops  vma-vm_ops-get_policy) {
struct mempolicy *vpol = vma-vm_ops-get_policy(vma,
addr);
-   

Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-16 Thread Kamezawa Hiroyuki

(2012/10/17 14:24), David Rientjes wrote:

On Wed, 17 Oct 2012, Dave Jones wrote:


BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
3 locks on stack by trinity-child2/8558:
  #0: held: (p-lock){+.+.+.}, instance: 88010c9a00b0, at: 
[8120cd1f] seq_lseek+0x3f/0x120
  #1: held: (mm-mmap_sem){++}, instance: 88013956f7c8, at: 
[81254437] m_start+0xa7/0x190
  #2: held: ((p-alloc_lock)-rlock){+.+...}, instance: 88011fc64f30, at: 
[81254f8f] show_numa_map+0x14f/0x610
Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
Call Trace:
  [810ae4ec] __might_sleep+0x14c/0x200
  [816bdf4e] mutex_lock_nested+0x2e/0x50
  [811c43a3] mpol_shared_policy_lookup+0x33/0x90
  [8118d5c3] shmem_get_policy+0x33/0x40
  [811c31fa] get_vma_policy+0x3a/0x90
  [81254fa3] show_numa_map+0x163/0x610
  [81255b10] ? pid_maps_open+0x20/0x20
  [81255980] ? pagemap_hugetlb_range+0xf0/0xf0
  [81255483] show_pid_numa_map+0x13/0x20
  [8120c902] traverse+0xf2/0x230
  [8120cd8b] seq_lseek+0xab/0x120
  [811e6c0b] sys_lseek+0x7b/0xb0
  [816ca088] tracesys+0xe1/0xe6



Hmm, looks like we need to change the refcount semantics entirely.  We'll
need to make get_vma_policy() always take a reference and then drop it
accordingly.  This work sif get_vma_policy() can grab a reference while
holding task_lock() for the task policy fallback case.

Comments on this approach?


I think this refcounting is better than using task_lock().

Thanks,
-Kame

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/