Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Mel Gorman
On Thu, Aug 29, 2013 at 01:14:19PM +0200, Peter Zijlstra wrote:
> > I intended to say nr_node_ids, the same size as buffers such as the
> > task_numa_buffers. If we ever return a nid > nr_node_ids here then
> > task_numa_fault would corrupt memory. However, it should be possible for
> > node_weight to exceed nr_node_ids except maybe during node hot-remove so
> > it's not the problem.
> 
> The nodemask situation seems somewhat more confused than the cpumask
> case; how would we ever return a nid > nr_node_ids? Corrupt nodemask?
> 

ARGH. I meant impossible. The exact bloody opposite of what I wrote.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2013 at 11:56:57AM +0100, Mel Gorman wrote:
> > I thought it was, we crashed somewhere suspiciously close, but no. You
> > need shared mpols for this to actually trigger and the NUMA stuff
> > doesn't use that.
> > 
> 
> Ah, so this is a red herring?

Yeah, but I still think its an actual bug. It seems the easiest way to
trigger this would be to:

 create a task that constantly allocates pages
 have said task have an MPOL_INTERLEAVE task policy
 put said task into a cpuset
 using a different task (your shell for example) flip the cpuset's
   mems_allowed back and forth.

This would have the shell task constantly rebind (in two steps) our
allocating task's INTERLEAVE policy.

> > I used whatever nodemask.h did to detect end-of-bitmap and they use
> > MAX_NUMNODES. See __next_node() and for_each_node() like.
> > 
> 
> The check does prevent us going off the end of the bitmap but does not
> necessarily return an online node.

Right, but its guaranteed to return a 'valid' node. I don't think it
returning an offline node is a problem, we'll find it empty and fail the
page allocation.

> > MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
> > of the bitmap, nr_online_nodes would hoever.
> > 
> 
> I intended to say nr_node_ids, the same size as buffers such as the
> task_numa_buffers. If we ever return a nid > nr_node_ids here then
> task_numa_fault would corrupt memory. However, it should be possible for
> node_weight to exceed nr_node_ids except maybe during node hot-remove so
> it's not the problem.

The nodemask situation seems somewhat more confused than the cpumask
case; how would we ever return a nid > nr_node_ids? Corrupt nodemask?

In the cpumask case we use the runtime limit nr_cpu_ids for all bitmap
operations, arguably we should make the nodemask stuff do the same.

Less bits to iterate is always good; a MAX_NUMNODES=64
(x86_64-defconfig) will still iterate all 64 bits, even though its
unlikely to have more than 1 let alone more than 8 nodes.

> > So I explicitly didn't use the node_isset() test because that's more
> > likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> > a node that isn't actually part of the mask anymore -- a race is a race
> > anyway.
> 
> Yeah and as long as it's < nr_node_ids it should be ok within the task
> numa fault handling as well.

Right, I'm just a tad confused on how we could ever get a nid >=
nr_node_ids except from a prior bug (corrupted nodemask).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Mel Gorman
On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote:
> > On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > > > So I think this patch is broken (still).
> > 
> > I am assuming the lack of complaints is that it is not a heavily executed
> > path. I expect that you (and Rik) are hitting this as part of automatic
> > NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
> > is the reproduction case.
> 
> I thought it was, we crashed somewhere suspiciously close, but no. You
> need shared mpols for this to actually trigger and the NUMA stuff
> doesn't use that.
> 

Ah, so this is a red herring?

> > > + if (unlikely((unsigned)nid >= MAX_NUMNODES))
> > > + goto again;
> > > +
> > 
> > MAX_NUMNODES is unrelated to anything except that it might prevent a crash
> > and even then nr_online_nodes is probably what you wanted and even that
> > assumes the NUMA node numbering is contiguous. 
> 
> I used whatever nodemask.h did to detect end-of-bitmap and they use
> MAX_NUMNODES. See __next_node() and for_each_node() like.
> 

The check does prevent us going off the end of the bitmap but does not
necessarily return an online node.

> MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
> of the bitmap, nr_online_nodes would hoever.
> 

I intended to say nr_node_ids, the same size as buffers such as the
task_numa_buffers. If we ever return a nid > nr_node_ids here then
task_numa_fault would corrupt memory. However, it should be possible for
node_weight to exceed nr_node_ids except maybe during node hot-remove so
it's not the problem.

> > The real concern is whether
> > the updated mask is an allowed target for the updated memory policy. If
> > it's not then "nid" can be pointing off the deep end somewhere.  With this
> > conversion to a for loop there is race after you check nnodes where target
> > gets set to 0 and then return a nid of -1 which I suppose will just blow
> > up differently but it's fixable.
> 
> But but but, I did i <= target, which will match when target == 0 so
> you'll get at least a single iteration and nid will be set.
> 

True.

> > This? Untested. Fixes implicit types while it's there. Note the use of
> > first node and (c < target) to guarantee nid gets set and that the first
> > potential node is still used as an interleave target.
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 7431001..ae880c3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
> >  }
> >  
> >  /* Do static interleaving for a VMA with known offset. */
> > -static unsigned offset_il_node(struct mempolicy *pol,
> > +static unsigned int offset_il_node(struct mempolicy *pol,
> > struct vm_area_struct *vma, unsigned long off)
> >  {
> > -   unsigned nnodes = nodes_weight(pol->v.nodes);
> > -   unsigned target;
> > -   int c;
> > -   int nid = -1;
> > +   unsigned int nr_nodes, target;
> > +   int i, nid;
> >  
> > -   if (!nnodes)
> > +again:
> > +   nr_nodes = nodes_weight(pol->v.nodes);
> > +   if (!nr_nodes)
> > return numa_node_id();
> > -   target = (unsigned int)off % nnodes;
> > -   c = 0;
> > -   do {
> > +   target = (unsigned int)off % nr_nodes;
> > +   for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
> > nid = next_node(nid, pol->v.nodes);
> > -   c++;
> > -   } while (c <= target);
> > +
> > +   /* Policy nodemask can potentially update in parallel */
> > +   if (unlikely(!node_isset(nid, pol->v.nodes)))
> > +   goto again;
> > +
> > return nid;
> >  }
> 
> So I explicitly didn't use the node_isset() test because that's more
> likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> a node that isn't actually part of the mask anymore -- a race is a race
> anyway.

Yeah and as long as it's < nr_node_ids it should be ok within the task
numa fault handling as well.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 7431001..ae880c3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
> >  }
> >  
> >  /* Do static interleaving for a VMA with known offset. */
> > -static unsigned offset_il_node(struct mempolicy *pol,
> > +static unsigned int offset_il_node(struct mempolicy *pol,
> > struct vm_area_struct *vma, unsigned long off)
> >  {
> > -   unsigned nnodes = nodes_weight(pol->v.nodes);
> > -   unsigned target;
> > -   int c;
> > -   int nid = -1;
> > +   unsigned int nr_nodes, target;
> > +   int i, nid;
> >  
> > -   if (!nnodes)
> > +again:
> > +   nr_nodes = nodes_weight(pol->v.nodes);
> > +   if (!nr_nodes)
> > return numa_node_id();
> > -   target = (unsigned int)off % nnodes;
> > -   c = 0;
> > -   do {
> > +   target = (unsigned int)off % nr_nodes;
> > +   for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
> > nid = next_node(nid, pol->v.nodes);
> > -   c++;
> > -   } while (c <= target);
> > +
> > +   /* Policy nodemask can potentially update in parallel */
> > +   if (unlikely(!node_isset(nid, pol->v.nodes)))
> > +   goto again;
> > +
> > return nid;
> >  }
> 
> So I explicitly didn't use the node_isset() test because that's more
> likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> a node that isn't actually part of the mask anymore -- a race is a race
> anyway.

Oh more importantly, if nid does indeed end up being >= MAX_NUMNODES as
is possible with next_node() the node_isset() test will be out-of-bounds
and can crash itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote:
> On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > > So I think this patch is broken (still).
> 
> I am assuming the lack of complaints is that it is not a heavily executed
> path. I expect that you (and Rik) are hitting this as part of automatic
> NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
> is the reproduction case.

I thought it was, we crashed somewhere suspiciously close, but no. You
need shared mpols for this to actually trigger and the NUMA stuff
doesn't use that.

> > +   if (unlikely((unsigned)nid >= MAX_NUMNODES))
> > +   goto again;
> > +
> 
> MAX_NUMNODES is unrelated to anything except that it might prevent a crash
> and even then nr_online_nodes is probably what you wanted and even that
> assumes the NUMA node numbering is contiguous. 

I used whatever nodemask.h did to detect end-of-bitmap and they use
MAX_NUMNODES. See __next_node() and for_each_node() like.

MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
of the bitmap, nr_online_nodes would hoever.

> The real concern is whether
> the updated mask is an allowed target for the updated memory policy. If
> it's not then "nid" can be pointing off the deep end somewhere.  With this
> conversion to a for loop there is race after you check nnodes where target
> gets set to 0 and then return a nid of -1 which I suppose will just blow
> up differently but it's fixable.

But but but, I did i <= target, which will match when target == 0 so
you'll get at least a single iteration and nid will be set.

> This? Untested. Fixes implicit types while it's there. Note the use of
> first node and (c < target) to guarantee nid gets set and that the first
> potential node is still used as an interleave target.
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 7431001..ae880c3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
>  }
>  
>  /* Do static interleaving for a VMA with known offset. */
> -static unsigned offset_il_node(struct mempolicy *pol,
> +static unsigned int offset_il_node(struct mempolicy *pol,
>   struct vm_area_struct *vma, unsigned long off)
>  {
> - unsigned nnodes = nodes_weight(pol->v.nodes);
> - unsigned target;
> - int c;
> - int nid = -1;
> + unsigned int nr_nodes, target;
> + int i, nid;
>  
> - if (!nnodes)
> +again:
> + nr_nodes = nodes_weight(pol->v.nodes);
> + if (!nr_nodes)
>   return numa_node_id();
> - target = (unsigned int)off % nnodes;
> - c = 0;
> - do {
> + target = (unsigned int)off % nr_nodes;
> + for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
>   nid = next_node(nid, pol->v.nodes);
> - c++;
> - } while (c <= target);
> +
> + /* Policy nodemask can potentially update in parallel */
> + if (unlikely(!node_isset(nid, pol->v.nodes)))
> + goto again;
> +
>   return nid;
>  }

So I explicitly didn't use the node_isset() test because that's more
likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
a node that isn't actually part of the mask anymore -- a race is a race
anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Mel Gorman
On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > So I think this patch is broken (still).

I am assuming the lack of complaints is that it is not a heavily executed
path. I expect that you (and Rik) are hitting this as part of automatic
NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
is the reproduction case.

> > Suppose we have an
> > INTERLEAVE mempol like 0x3 and change it to 0xc.
> > 
> > Original:   0x3
> > Rebind Step 1:  0xf /* set bits */
> > Rebind Step 2:  0xc /* clear bits */
> > 
> > Now look at what can happen with offset_il_node() when its ran
> > concurrently with step 2:
> > 
> >   nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */
> > 
> >   /* now we clear the actual bits */
> >   
> >   target = (unsigned int)off % nnodes; /* assume target >= 2 */
> >   c = 0;
> >   do {
> > nid = next_node(nid, pol->v.nodes);
> > c++;
> >   } while (c <= target);
> > 
> >   /* here nid := MAX_NUMNODES */
> > 
> > 
> > This nid is then blindly inserted into node_zonelist() which does an
> > NODE_DATA() array access out of bounds and off we go.
> > 
> > This would suggest we put the whole seqcount thing inside
> > offset_il_node().
> 
> Oh bloody grrr. Its not directly related at all, the patch in question
> fixes a cpuset task_struct::mems_allowed problem while the above is a
> mempolicy issue and of course the cpuset and mempolicy code are
> completely bloody different :/
> 
> So I guess the quick and ugly solution is something like the below. 
> 
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1762,19 +1762,21 @@ unsigned slab_node(void)
>  static unsigned offset_il_node(struct mempolicy *pol,
>   struct vm_area_struct *vma, unsigned long off)
>  {
> - unsigned nnodes = nodes_weight(pol->v.nodes);
> - unsigned target;
> - int c;
> - int nid = -1;
> + unsigned nnodes, target;
> + int c, nid;
>  
> +again:
> + nnodes = nodes_weight(pol->v.nodes);
>   if (!nnodes)
>   return numa_node_id();
> +
>   target = (unsigned int)off % nnodes;
> - c = 0;
> - do {
> + for (c = 0, nid = -1; c <= target; c++)
>   nid = next_node(nid, pol->v.nodes);
> - c++;
> - } while (c <= target);
> +
> + if (unlikely((unsigned)nid >= MAX_NUMNODES))
> + goto again;
> +

MAX_NUMNODES is unrelated to anything except that it might prevent a crash
and even then nr_online_nodes is probably what you wanted and even that
assumes the NUMA node numbering is contiguous. The real concern is whether
the updated mask is an allowed target for the updated memory policy. If
it's not then "nid" can be pointing off the deep end somewhere.  With this
conversion to a for loop there is race after you check nnodes where target
gets set to 0 and then return a nid of -1 which I suppose will just blow
up differently but it's fixable.

This? Untested. Fixes implicit types while it's there. Note the use of
first node and (c < target) to guarantee nid gets set and that the first
potential node is still used as an interleave target.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..ae880c3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1755,22 +1755,24 @@ unsigned slab_node(void)
 }
 
 /* Do static interleaving for a VMA with known offset. */
-static unsigned offset_il_node(struct mempolicy *pol,
+static unsigned int offset_il_node(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long off)
 {
-   unsigned nnodes = nodes_weight(pol->v.nodes);
-   unsigned target;
-   int c;
-   int nid = -1;
+   unsigned int nr_nodes, target;
+   int i, nid;
 
-   if (!nnodes)
+again:
+   nr_nodes = nodes_weight(pol->v.nodes);
+   if (!nr_nodes)
return numa_node_id();
-   target = (unsigned int)off % nnodes;
-   c = 0;
-   do {
+   target = (unsigned int)off % nr_nodes;
+   for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
nid = next_node(nid, pol->v.nodes);
-   c++;
-   } while (c <= target);
+
+   /* Policy nodemask can potentially update in parallel */
+   if (unlikely(!node_isset(nid, pol->v.nodes)))
+   goto again;
+
return nid;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Mel Gorman
On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
 On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
  So I think this patch is broken (still).

I am assuming the lack of complaints is that it is not a heavily executed
path. I expect that you (and Rik) are hitting this as part of automatic
NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
is the reproduction case.

  Suppose we have an
  INTERLEAVE mempol like 0x3 and change it to 0xc.
  
  Original:   0x3
  Rebind Step 1:  0xf /* set bits */
  Rebind Step 2:  0xc /* clear bits */
  
  Now look at what can happen with offset_il_node() when its ran
  concurrently with step 2:
  
nnodes = nodes_weight(pol-v.nodes); /* observes 0xf and returns 4 */
  
/* now we clear the actual bits */

target = (unsigned int)off % nnodes; /* assume target = 2 */
c = 0;
do {
  nid = next_node(nid, pol-v.nodes);
  c++;
} while (c = target);
  
/* here nid := MAX_NUMNODES */
  
  
  This nid is then blindly inserted into node_zonelist() which does an
  NODE_DATA() array access out of bounds and off we go.
  
  This would suggest we put the whole seqcount thing inside
  offset_il_node().
 
 Oh bloody grrr. Its not directly related at all, the patch in question
 fixes a cpuset task_struct::mems_allowed problem while the above is a
 mempolicy issue and of course the cpuset and mempolicy code are
 completely bloody different :/
 
 So I guess the quick and ugly solution is something like the below. 
 
 --- a/mm/mempolicy.c
 +++ b/mm/mempolicy.c
 @@ -1762,19 +1762,21 @@ unsigned slab_node(void)
  static unsigned offset_il_node(struct mempolicy *pol,
   struct vm_area_struct *vma, unsigned long off)
  {
 - unsigned nnodes = nodes_weight(pol-v.nodes);
 - unsigned target;
 - int c;
 - int nid = -1;
 + unsigned nnodes, target;
 + int c, nid;
  
 +again:
 + nnodes = nodes_weight(pol-v.nodes);
   if (!nnodes)
   return numa_node_id();
 +
   target = (unsigned int)off % nnodes;
 - c = 0;
 - do {
 + for (c = 0, nid = -1; c = target; c++)
   nid = next_node(nid, pol-v.nodes);
 - c++;
 - } while (c = target);
 +
 + if (unlikely((unsigned)nid = MAX_NUMNODES))
 + goto again;
 +

MAX_NUMNODES is unrelated to anything except that it might prevent a crash
and even then nr_online_nodes is probably what you wanted and even that
assumes the NUMA node numbering is contiguous. The real concern is whether
the updated mask is an allowed target for the updated memory policy. If
it's not then nid can be pointing off the deep end somewhere.  With this
conversion to a for loop there is race after you check nnodes where target
gets set to 0 and then return a nid of -1 which I suppose will just blow
up differently but it's fixable.

This? Untested. Fixes implicit types while it's there. Note the use of
first node and (c  target) to guarantee nid gets set and that the first
potential node is still used as an interleave target.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..ae880c3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1755,22 +1755,24 @@ unsigned slab_node(void)
 }
 
 /* Do static interleaving for a VMA with known offset. */
-static unsigned offset_il_node(struct mempolicy *pol,
+static unsigned int offset_il_node(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long off)
 {
-   unsigned nnodes = nodes_weight(pol-v.nodes);
-   unsigned target;
-   int c;
-   int nid = -1;
+   unsigned int nr_nodes, target;
+   int i, nid;
 
-   if (!nnodes)
+again:
+   nr_nodes = nodes_weight(pol-v.nodes);
+   if (!nr_nodes)
return numa_node_id();
-   target = (unsigned int)off % nnodes;
-   c = 0;
-   do {
+   target = (unsigned int)off % nr_nodes;
+   for (i = 0, nid = first_node(pol-v.nodes); i  target; i++)
nid = next_node(nid, pol-v.nodes);
-   c++;
-   } while (c = target);
+
+   /* Policy nodemask can potentially update in parallel */
+   if (unlikely(!node_isset(nid, pol-v.nodes)))
+   goto again;
+
return nid;
 }
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote:
 On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
  On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
   So I think this patch is broken (still).
 
 I am assuming the lack of complaints is that it is not a heavily executed
 path. I expect that you (and Rik) are hitting this as part of automatic
 NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
 is the reproduction case.

I thought it was, we crashed somewhere suspiciously close, but no. You
need shared mpols for this to actually trigger and the NUMA stuff
doesn't use that.

  +   if (unlikely((unsigned)nid = MAX_NUMNODES))
  +   goto again;
  +
 
 MAX_NUMNODES is unrelated to anything except that it might prevent a crash
 and even then nr_online_nodes is probably what you wanted and even that
 assumes the NUMA node numbering is contiguous. 

I used whatever nodemask.h did to detect end-of-bitmap and they use
MAX_NUMNODES. See __next_node() and for_each_node() like.

MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
of the bitmap, nr_online_nodes would hoever.

 The real concern is whether
 the updated mask is an allowed target for the updated memory policy. If
 it's not then nid can be pointing off the deep end somewhere.  With this
 conversion to a for loop there is race after you check nnodes where target
 gets set to 0 and then return a nid of -1 which I suppose will just blow
 up differently but it's fixable.

But but but, I did i = target, which will match when target == 0 so
you'll get at least a single iteration and nid will be set.

 This? Untested. Fixes implicit types while it's there. Note the use of
 first node and (c  target) to guarantee nid gets set and that the first
 potential node is still used as an interleave target.
 
 diff --git a/mm/mempolicy.c b/mm/mempolicy.c
 index 7431001..ae880c3 100644
 --- a/mm/mempolicy.c
 +++ b/mm/mempolicy.c
 @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
  }
  
  /* Do static interleaving for a VMA with known offset. */
 -static unsigned offset_il_node(struct mempolicy *pol,
 +static unsigned int offset_il_node(struct mempolicy *pol,
   struct vm_area_struct *vma, unsigned long off)
  {
 - unsigned nnodes = nodes_weight(pol-v.nodes);
 - unsigned target;
 - int c;
 - int nid = -1;
 + unsigned int nr_nodes, target;
 + int i, nid;
  
 - if (!nnodes)
 +again:
 + nr_nodes = nodes_weight(pol-v.nodes);
 + if (!nr_nodes)
   return numa_node_id();
 - target = (unsigned int)off % nnodes;
 - c = 0;
 - do {
 + target = (unsigned int)off % nr_nodes;
 + for (i = 0, nid = first_node(pol-v.nodes); i  target; i++)
   nid = next_node(nid, pol-v.nodes);
 - c++;
 - } while (c = target);
 +
 + /* Policy nodemask can potentially update in parallel */
 + if (unlikely(!node_isset(nid, pol-v.nodes)))
 + goto again;
 +
   return nid;
  }

So I explicitly didn't use the node_isset() test because that's more
likely to trigger than the nid = MAX_NUMNODES test. Its fine to return
a node that isn't actually part of the mask anymore -- a race is a race
anyway.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote:
  diff --git a/mm/mempolicy.c b/mm/mempolicy.c
  index 7431001..ae880c3 100644
  --- a/mm/mempolicy.c
  +++ b/mm/mempolicy.c
  @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
   }
   
   /* Do static interleaving for a VMA with known offset. */
  -static unsigned offset_il_node(struct mempolicy *pol,
  +static unsigned int offset_il_node(struct mempolicy *pol,
  struct vm_area_struct *vma, unsigned long off)
   {
  -   unsigned nnodes = nodes_weight(pol-v.nodes);
  -   unsigned target;
  -   int c;
  -   int nid = -1;
  +   unsigned int nr_nodes, target;
  +   int i, nid;
   
  -   if (!nnodes)
  +again:
  +   nr_nodes = nodes_weight(pol-v.nodes);
  +   if (!nr_nodes)
  return numa_node_id();
  -   target = (unsigned int)off % nnodes;
  -   c = 0;
  -   do {
  +   target = (unsigned int)off % nr_nodes;
  +   for (i = 0, nid = first_node(pol-v.nodes); i  target; i++)
  nid = next_node(nid, pol-v.nodes);
  -   c++;
  -   } while (c = target);
  +
  +   /* Policy nodemask can potentially update in parallel */
  +   if (unlikely(!node_isset(nid, pol-v.nodes)))
  +   goto again;
  +
  return nid;
   }
 
 So I explicitly didn't use the node_isset() test because that's more
 likely to trigger than the nid = MAX_NUMNODES test. Its fine to return
 a node that isn't actually part of the mask anymore -- a race is a race
 anyway.

Oh more importantly, if nid does indeed end up being = MAX_NUMNODES as
is possible with next_node() the node_isset() test will be out-of-bounds
and can crash itself.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Mel Gorman
On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote:
 On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote:
  On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
   On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
So I think this patch is broken (still).
  
  I am assuming the lack of complaints is that it is not a heavily executed
  path. I expect that you (and Rik) are hitting this as part of automatic
  NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
  is the reproduction case.
 
 I thought it was, we crashed somewhere suspiciously close, but no. You
 need shared mpols for this to actually trigger and the NUMA stuff
 doesn't use that.
 

Ah, so this is a red herring?

   + if (unlikely((unsigned)nid = MAX_NUMNODES))
   + goto again;
   +
  
  MAX_NUMNODES is unrelated to anything except that it might prevent a crash
  and even then nr_online_nodes is probably what you wanted and even that
  assumes the NUMA node numbering is contiguous. 
 
 I used whatever nodemask.h did to detect end-of-bitmap and they use
 MAX_NUMNODES. See __next_node() and for_each_node() like.
 

The check does prevent us going off the end of the bitmap but does not
necessarily return an online node.

 MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
 of the bitmap, nr_online_nodes would hoever.
 

I intended to say nr_node_ids, the same size as buffers such as the
task_numa_buffers. If we ever return a nid  nr_node_ids here then
task_numa_fault would corrupt memory. However, it should be possible for
node_weight to exceed nr_node_ids except maybe during node hot-remove so
it's not the problem.

  The real concern is whether
  the updated mask is an allowed target for the updated memory policy. If
  it's not then nid can be pointing off the deep end somewhere.  With this
  conversion to a for loop there is race after you check nnodes where target
  gets set to 0 and then return a nid of -1 which I suppose will just blow
  up differently but it's fixable.
 
 But but but, I did i = target, which will match when target == 0 so
 you'll get at least a single iteration and nid will be set.
 

True.

  This? Untested. Fixes implicit types while it's there. Note the use of
  first node and (c  target) to guarantee nid gets set and that the first
  potential node is still used as an interleave target.
  
  diff --git a/mm/mempolicy.c b/mm/mempolicy.c
  index 7431001..ae880c3 100644
  --- a/mm/mempolicy.c
  +++ b/mm/mempolicy.c
  @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
   }
   
   /* Do static interleaving for a VMA with known offset. */
  -static unsigned offset_il_node(struct mempolicy *pol,
  +static unsigned int offset_il_node(struct mempolicy *pol,
  struct vm_area_struct *vma, unsigned long off)
   {
  -   unsigned nnodes = nodes_weight(pol-v.nodes);
  -   unsigned target;
  -   int c;
  -   int nid = -1;
  +   unsigned int nr_nodes, target;
  +   int i, nid;
   
  -   if (!nnodes)
  +again:
  +   nr_nodes = nodes_weight(pol-v.nodes);
  +   if (!nr_nodes)
  return numa_node_id();
  -   target = (unsigned int)off % nnodes;
  -   c = 0;
  -   do {
  +   target = (unsigned int)off % nr_nodes;
  +   for (i = 0, nid = first_node(pol-v.nodes); i  target; i++)
  nid = next_node(nid, pol-v.nodes);
  -   c++;
  -   } while (c = target);
  +
  +   /* Policy nodemask can potentially update in parallel */
  +   if (unlikely(!node_isset(nid, pol-v.nodes)))
  +   goto again;
  +
  return nid;
   }
 
 So I explicitly didn't use the node_isset() test because that's more
 likely to trigger than the nid = MAX_NUMNODES test. Its fine to return
 a node that isn't actually part of the mask anymore -- a race is a race
 anyway.

Yeah and as long as it's  nr_node_ids it should be ok within the task
numa fault handling as well.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2013 at 11:56:57AM +0100, Mel Gorman wrote:
  I thought it was, we crashed somewhere suspiciously close, but no. You
  need shared mpols for this to actually trigger and the NUMA stuff
  doesn't use that.
  
 
 Ah, so this is a red herring?

Yeah, but I still think its an actual bug. It seems the easiest way to
trigger this would be to:

 create a task that constantly allocates pages
 have said task have an MPOL_INTERLEAVE task policy
 put said task into a cpuset
 using a different task (your shell for example) flip the cpuset's
   mems_allowed back and forth.

This would have the shell task constantly rebind (in two steps) our
allocating task's INTERLEAVE policy.

  I used whatever nodemask.h did to detect end-of-bitmap and they use
  MAX_NUMNODES. See __next_node() and for_each_node() like.
  
 
 The check does prevent us going off the end of the bitmap but does not
 necessarily return an online node.

Right, but its guaranteed to return a 'valid' node. I don't think it
returning an offline node is a problem, we'll find it empty and fail the
page allocation.

  MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
  of the bitmap, nr_online_nodes would hoever.
  
 
 I intended to say nr_node_ids, the same size as buffers such as the
 task_numa_buffers. If we ever return a nid  nr_node_ids here then
 task_numa_fault would corrupt memory. However, it should be possible for
 node_weight to exceed nr_node_ids except maybe during node hot-remove so
 it's not the problem.

The nodemask situation seems somewhat more confused than the cpumask
case; how would we ever return a nid  nr_node_ids? Corrupt nodemask?

In the cpumask case we use the runtime limit nr_cpu_ids for all bitmap
operations, arguably we should make the nodemask stuff do the same.

Less bits to iterate is always good; a MAX_NUMNODES=64
(x86_64-defconfig) will still iterate all 64 bits, even though its
unlikely to have more than 1 let alone more than 8 nodes.

  So I explicitly didn't use the node_isset() test because that's more
  likely to trigger than the nid = MAX_NUMNODES test. Its fine to return
  a node that isn't actually part of the mask anymore -- a race is a race
  anyway.
 
 Yeah and as long as it's  nr_node_ids it should be ok within the task
 numa fault handling as well.

Right, I'm just a tad confused on how we could ever get a nid =
nr_node_ids except from a prior bug (corrupted nodemask).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-29 Thread Mel Gorman
On Thu, Aug 29, 2013 at 01:14:19PM +0200, Peter Zijlstra wrote:
  I intended to say nr_node_ids, the same size as buffers such as the
  task_numa_buffers. If we ever return a nid  nr_node_ids here then
  task_numa_fault would corrupt memory. However, it should be possible for
  node_weight to exceed nr_node_ids except maybe during node hot-remove so
  it's not the problem.
 
 The nodemask situation seems somewhat more confused than the cpumask
 case; how would we ever return a nid  nr_node_ids? Corrupt nodemask?
 

ARGH. I meant impossible. The exact bloody opposite of what I wrote.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-25 Thread Rik van Riel
On 08/23/2013 02:15 PM, Peter Zijlstra wrote:

> So I guess the quick and ugly solution is something like the below. 

This still crashes :)

> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1762,19 +1762,21 @@ unsigned slab_node(void)
>  static unsigned offset_il_node(struct mempolicy *pol,
>   struct vm_area_struct *vma, unsigned long off)
>  {
> - unsigned nnodes = nodes_weight(pol->v.nodes);
> - unsigned target;
> - int c;
> - int nid = -1;
> + unsigned nnodes, target;
> + int c, nid;
>  
> +again:
> + nnodes = nodes_weight(pol->v.nodes);
>   if (!nnodes)
>   return numa_node_id();
> +
>   target = (unsigned int)off % nnodes;
> - c = 0;
> - do {
> + for (c = 0, nid = -1; c <= target; c++)
>   nid = next_node(nid, pol->v.nodes);
> - c++;
> - } while (c <= target);
> +
> + if (unlikely((unsigned)nid >= MAX_NUMNODES))
> + goto again;

I'll go kick off a compile that replaces the conditional above with:

if (unlikely(!node_online(nid)))
goto again;

>   return nid;
>  }


-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-25 Thread Rik van Riel
On 08/23/2013 02:15 PM, Peter Zijlstra wrote:

 So I guess the quick and ugly solution is something like the below. 

This still crashes :)

 --- a/mm/mempolicy.c
 +++ b/mm/mempolicy.c
 @@ -1762,19 +1762,21 @@ unsigned slab_node(void)
  static unsigned offset_il_node(struct mempolicy *pol,
   struct vm_area_struct *vma, unsigned long off)
  {
 - unsigned nnodes = nodes_weight(pol-v.nodes);
 - unsigned target;
 - int c;
 - int nid = -1;
 + unsigned nnodes, target;
 + int c, nid;
  
 +again:
 + nnodes = nodes_weight(pol-v.nodes);
   if (!nnodes)
   return numa_node_id();
 +
   target = (unsigned int)off % nnodes;
 - c = 0;
 - do {
 + for (c = 0, nid = -1; c = target; c++)
   nid = next_node(nid, pol-v.nodes);
 - c++;
 - } while (c = target);
 +
 + if (unlikely((unsigned)nid = MAX_NUMNODES))
 + goto again;

I'll go kick off a compile that replaces the conditional above with:

if (unlikely(!node_online(nid)))
goto again;

   return nid;
  }


-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-23 Thread Peter Zijlstra
On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> So I think this patch is broken (still). Suppose we have an
> INTERLEAVE mempol like 0x3 and change it to 0xc.
> 
> Original: 0x3
> Rebind Step 1:0xf /* set bits */
> Rebind Step 2:0xc /* clear bits */
> 
> Now look at what can happen with offset_il_node() when its ran
> concurrently with step 2:
> 
>   nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */
> 
>   /* now we clear the actual bits */
>   
>   target = (unsigned int)off % nnodes; /* assume target >= 2 */
>   c = 0;
>   do {
>   nid = next_node(nid, pol->v.nodes);
>   c++;
>   } while (c <= target);
> 
>   /* here nid := MAX_NUMNODES */
> 
> 
> This nid is then blindly inserted into node_zonelist() which does an
> NODE_DATA() array access out of bounds and off we go.
> 
> This would suggest we put the whole seqcount thing inside
> offset_il_node().

Oh bloody grrr. Its not directly related at all, the patch in question
fixes a cpuset task_struct::mems_allowed problem while the above is a
mempolicy issue and of course the cpuset and mempolicy code are
completely bloody different :/

So I guess the quick and ugly solution is something like the below. 

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1762,19 +1762,21 @@ unsigned slab_node(void)
 static unsigned offset_il_node(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long off)
 {
-   unsigned nnodes = nodes_weight(pol->v.nodes);
-   unsigned target;
-   int c;
-   int nid = -1;
+   unsigned nnodes, target;
+   int c, nid;
 
+again:
+   nnodes = nodes_weight(pol->v.nodes);
if (!nnodes)
return numa_node_id();
+
target = (unsigned int)off % nnodes;
-   c = 0;
-   do {
+   for (c = 0, nid = -1; c <= target; c++)
nid = next_node(nid, pol->v.nodes);
-   c++;
-   } while (c <= target);
+
+   if (unlikely((unsigned)nid >= MAX_NUMNODES))
+   goto again;
+
return nid;
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-23 Thread Peter Zijlstra
On Wed, Mar 07, 2012 at 06:08:52PM +, Mel Gorman wrote:
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 06b145f..013d981 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1843,18 +1843,24 @@ struct page *
>  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>   unsigned long addr, int node)
>  {
> - struct mempolicy *pol = get_vma_policy(current, vma, addr);
> + struct mempolicy *pol;
>   struct zonelist *zl;
>   struct page *page;
> + unsigned int cpuset_mems_cookie;
> +
> +retry_cpuset:
> + pol = get_vma_policy(current, vma, addr);
> + cpuset_mems_cookie = get_mems_allowed();
>  
> - get_mems_allowed();
>   if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>   unsigned nid;
>  
>   nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
>   mpol_cond_put(pol);
>   page = alloc_page_interleave(gfp, order, nid);
> - put_mems_allowed();
> + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> + goto retry_cpuset;
> +
>   return page;
>   }
>   zl = policy_zonelist(gfp, pol, node);

So I think this patch is broken (still). Suppose we have an
INTERLEAVE mempol like 0x3 and change it to 0xc.

Original:   0x3
Rebind Step 1:  0xf /* set bits */
Rebind Step 2:  0xc /* clear bits */

Now look at what can happen with offset_il_node() when its ran
concurrently with step 2:

  nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */

  /* now we clear the actual bits */
  
  target = (unsigned int)off % nnodes; /* assume target >= 2 */
  c = 0;
  do {
nid = next_node(nid, pol->v.nodes);
c++;
  } while (c <= target);

  /* here nid := MAX_NUMNODES */


This nid is then blindly inserted into node_zonelist() which does an
NODE_DATA() array access out of bounds and off we go.

This would suggest we put the whole seqcount thing inside
offset_il_node().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-23 Thread Peter Zijlstra
On Wed, Mar 07, 2012 at 06:08:52PM +, Mel Gorman wrote:
 diff --git a/mm/mempolicy.c b/mm/mempolicy.c
 index 06b145f..013d981 100644
 --- a/mm/mempolicy.c
 +++ b/mm/mempolicy.c
 @@ -1843,18 +1843,24 @@ struct page *
  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
   unsigned long addr, int node)
  {
 - struct mempolicy *pol = get_vma_policy(current, vma, addr);
 + struct mempolicy *pol;
   struct zonelist *zl;
   struct page *page;
 + unsigned int cpuset_mems_cookie;
 +
 +retry_cpuset:
 + pol = get_vma_policy(current, vma, addr);
 + cpuset_mems_cookie = get_mems_allowed();
  
 - get_mems_allowed();
   if (unlikely(pol-mode == MPOL_INTERLEAVE)) {
   unsigned nid;
  
   nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
   mpol_cond_put(pol);
   page = alloc_page_interleave(gfp, order, nid);
 - put_mems_allowed();
 + if (unlikely(!put_mems_allowed(cpuset_mems_cookie)  !page))
 + goto retry_cpuset;
 +
   return page;
   }
   zl = policy_zonelist(gfp, pol, node);

So I think this patch is broken (still). Suppose we have an
INTERLEAVE mempol like 0x3 and change it to 0xc.

Original:   0x3
Rebind Step 1:  0xf /* set bits */
Rebind Step 2:  0xc /* clear bits */

Now look at what can happen with offset_il_node() when its ran
concurrently with step 2:

  nnodes = nodes_weight(pol-v.nodes); /* observes 0xf and returns 4 */

  /* now we clear the actual bits */
  
  target = (unsigned int)off % nnodes; /* assume target = 2 */
  c = 0;
  do {
nid = next_node(nid, pol-v.nodes);
c++;
  } while (c = target);

  /* here nid := MAX_NUMNODES */


This nid is then blindly inserted into node_zonelist() which does an
NODE_DATA() array access out of bounds and off we go.

This would suggest we put the whole seqcount thing inside
offset_il_node().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3

2013-08-23 Thread Peter Zijlstra
On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
 So I think this patch is broken (still). Suppose we have an
 INTERLEAVE mempol like 0x3 and change it to 0xc.
 
 Original: 0x3
 Rebind Step 1:0xf /* set bits */
 Rebind Step 2:0xc /* clear bits */
 
 Now look at what can happen with offset_il_node() when its ran
 concurrently with step 2:
 
   nnodes = nodes_weight(pol-v.nodes); /* observes 0xf and returns 4 */
 
   /* now we clear the actual bits */
   
   target = (unsigned int)off % nnodes; /* assume target = 2 */
   c = 0;
   do {
   nid = next_node(nid, pol-v.nodes);
   c++;
   } while (c = target);
 
   /* here nid := MAX_NUMNODES */
 
 
 This nid is then blindly inserted into node_zonelist() which does an
 NODE_DATA() array access out of bounds and off we go.
 
 This would suggest we put the whole seqcount thing inside
 offset_il_node().

Oh bloody grrr. Its not directly related at all, the patch in question
fixes a cpuset task_struct::mems_allowed problem while the above is a
mempolicy issue and of course the cpuset and mempolicy code are
completely bloody different :/

So I guess the quick and ugly solution is something like the below. 

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1762,19 +1762,21 @@ unsigned slab_node(void)
 static unsigned offset_il_node(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long off)
 {
-   unsigned nnodes = nodes_weight(pol-v.nodes);
-   unsigned target;
-   int c;
-   int nid = -1;
+   unsigned nnodes, target;
+   int c, nid;
 
+again:
+   nnodes = nodes_weight(pol-v.nodes);
if (!nnodes)
return numa_node_id();
+
target = (unsigned int)off % nnodes;
-   c = 0;
-   do {
+   for (c = 0, nid = -1; c = target; c++)
nid = next_node(nid, pol-v.nodes);
-   c++;
-   } while (c = target);
+
+   if (unlikely((unsigned)nid = MAX_NUMNODES))
+   goto again;
+
return nid;
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/