Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 01:14:19PM +0200, Peter Zijlstra wrote: > > I intended to say nr_node_ids, the same size as buffers such as the > > task_numa_buffers. If we ever return a nid > nr_node_ids here then > > task_numa_fault would corrupt memory. However, it should be possible for > > node_weight to exceed nr_node_ids except maybe during node hot-remove so > > it's not the problem. > > The nodemask situation seems somewhat more confused than the cpumask > case; how would we ever return a nid > nr_node_ids? Corrupt nodemask? > ARGH. I meant impossible. The exact bloody opposite of what I wrote. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 11:56:57AM +0100, Mel Gorman wrote: > > I thought it was, we crashed somewhere suspiciously close, but no. You > > need shared mpols for this to actually trigger and the NUMA stuff > > doesn't use that. > > > > Ah, so this is a red herring? Yeah, but I still think its an actual bug. It seems the easiest way to trigger this would be to: create a task that constantly allocates pages have said task have an MPOL_INTERLEAVE task policy put said task into a cpuset using a different task (your shell for example) flip the cpuset's mems_allowed back and forth. This would have the shell task constantly rebind (in two steps) our allocating task's INTERLEAVE policy. > > I used whatever nodemask.h did to detect end-of-bitmap and they use > > MAX_NUMNODES. See __next_node() and for_each_node() like. > > > > The check does prevent us going off the end of the bitmap but does not > necessarily return an online node. Right, but its guaranteed to return a 'valid' node. I don't think it returning an offline node is a problem, we'll find it empty and fail the page allocation. > > MAX_NUMNODES doesn't assume contiguous numbers since its the actual size > > of the bitmap, nr_online_nodes would hoever. > > > > I intended to say nr_node_ids, the same size as buffers such as the > task_numa_buffers. If we ever return a nid > nr_node_ids here then > task_numa_fault would corrupt memory. However, it should be possible for > node_weight to exceed nr_node_ids except maybe during node hot-remove so > it's not the problem. The nodemask situation seems somewhat more confused than the cpumask case; how would we ever return a nid > nr_node_ids? Corrupt nodemask? In the cpumask case we use the runtime limit nr_cpu_ids for all bitmap operations, arguably we should make the nodemask stuff do the same. Less bits to iterate is always good; a MAX_NUMNODES=64 (x86_64-defconfig) will still iterate all 64 bits, even though its unlikely to have more than 1 let alone more than 8 nodes. > > So I explicitly didn't use the node_isset() test because that's more > > likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return > > a node that isn't actually part of the mask anymore -- a race is a race > > anyway. > > Yeah and as long as it's < nr_node_ids it should be ok within the task > numa fault handling as well. Right, I'm just a tad confused on how we could ever get a nid >= nr_node_ids except from a prior bug (corrupted nodemask). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote: > On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote: > > On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote: > > > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote: > > > > So I think this patch is broken (still). > > > > I am assuming the lack of complaints is that it is not a heavily executed > > path. I expect that you (and Rik) are hitting this as part of automatic > > NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing > > is the reproduction case. > > I thought it was, we crashed somewhere suspiciously close, but no. You > need shared mpols for this to actually trigger and the NUMA stuff > doesn't use that. > Ah, so this is a red herring? > > > + if (unlikely((unsigned)nid >= MAX_NUMNODES)) > > > + goto again; > > > + > > > > MAX_NUMNODES is unrelated to anything except that it might prevent a crash > > and even then nr_online_nodes is probably what you wanted and even that > > assumes the NUMA node numbering is contiguous. > > I used whatever nodemask.h did to detect end-of-bitmap and they use > MAX_NUMNODES. See __next_node() and for_each_node() like. > The check does prevent us going off the end of the bitmap but does not necessarily return an online node. > MAX_NUMNODES doesn't assume contiguous numbers since its the actual size > of the bitmap, nr_online_nodes would hoever. > I intended to say nr_node_ids, the same size as buffers such as the task_numa_buffers. If we ever return a nid > nr_node_ids here then task_numa_fault would corrupt memory. However, it should be possible for node_weight to exceed nr_node_ids except maybe during node hot-remove so it's not the problem. > > The real concern is whether > > the updated mask is an allowed target for the updated memory policy. If > > it's not then "nid" can be pointing off the deep end somewhere. With this > > conversion to a for loop there is race after you check nnodes where target > > gets set to 0 and then return a nid of -1 which I suppose will just blow > > up differently but it's fixable. > > But but but, I did i <= target, which will match when target == 0 so > you'll get at least a single iteration and nid will be set. > True. > > This? Untested. Fixes implicit types while it's there. Note the use of > > first node and (c < target) to guarantee nid gets set and that the first > > potential node is still used as an interleave target. > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 7431001..ae880c3 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1755,22 +1755,24 @@ unsigned slab_node(void) > > } > > > > /* Do static interleaving for a VMA with known offset. */ > > -static unsigned offset_il_node(struct mempolicy *pol, > > +static unsigned int offset_il_node(struct mempolicy *pol, > > struct vm_area_struct *vma, unsigned long off) > > { > > - unsigned nnodes = nodes_weight(pol->v.nodes); > > - unsigned target; > > - int c; > > - int nid = -1; > > + unsigned int nr_nodes, target; > > + int i, nid; > > > > - if (!nnodes) > > +again: > > + nr_nodes = nodes_weight(pol->v.nodes); > > + if (!nr_nodes) > > return numa_node_id(); > > - target = (unsigned int)off % nnodes; > > - c = 0; > > - do { > > + target = (unsigned int)off % nr_nodes; > > + for (i = 0, nid = first_node(pol->v.nodes); i < target; i++) > > nid = next_node(nid, pol->v.nodes); > > - c++; > > - } while (c <= target); > > + > > + /* Policy nodemask can potentially update in parallel */ > > + if (unlikely(!node_isset(nid, pol->v.nodes))) > > + goto again; > > + > > return nid; > > } > > So I explicitly didn't use the node_isset() test because that's more > likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return > a node that isn't actually part of the mask anymore -- a race is a race > anyway. Yeah and as long as it's < nr_node_ids it should be ok within the task numa fault handling as well. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote: > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 7431001..ae880c3 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1755,22 +1755,24 @@ unsigned slab_node(void) > > } > > > > /* Do static interleaving for a VMA with known offset. */ > > -static unsigned offset_il_node(struct mempolicy *pol, > > +static unsigned int offset_il_node(struct mempolicy *pol, > > struct vm_area_struct *vma, unsigned long off) > > { > > - unsigned nnodes = nodes_weight(pol->v.nodes); > > - unsigned target; > > - int c; > > - int nid = -1; > > + unsigned int nr_nodes, target; > > + int i, nid; > > > > - if (!nnodes) > > +again: > > + nr_nodes = nodes_weight(pol->v.nodes); > > + if (!nr_nodes) > > return numa_node_id(); > > - target = (unsigned int)off % nnodes; > > - c = 0; > > - do { > > + target = (unsigned int)off % nr_nodes; > > + for (i = 0, nid = first_node(pol->v.nodes); i < target; i++) > > nid = next_node(nid, pol->v.nodes); > > - c++; > > - } while (c <= target); > > + > > + /* Policy nodemask can potentially update in parallel */ > > + if (unlikely(!node_isset(nid, pol->v.nodes))) > > + goto again; > > + > > return nid; > > } > > So I explicitly didn't use the node_isset() test because that's more > likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return > a node that isn't actually part of the mask anymore -- a race is a race > anyway. Oh more importantly, if nid does indeed end up being >= MAX_NUMNODES as is possible with next_node() the node_isset() test will be out-of-bounds and can crash itself. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote: > On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote: > > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote: > > > So I think this patch is broken (still). > > I am assuming the lack of complaints is that it is not a heavily executed > path. I expect that you (and Rik) are hitting this as part of automatic > NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing > is the reproduction case. I thought it was, we crashed somewhere suspiciously close, but no. You need shared mpols for this to actually trigger and the NUMA stuff doesn't use that. > > + if (unlikely((unsigned)nid >= MAX_NUMNODES)) > > + goto again; > > + > > MAX_NUMNODES is unrelated to anything except that it might prevent a crash > and even then nr_online_nodes is probably what you wanted and even that > assumes the NUMA node numbering is contiguous. I used whatever nodemask.h did to detect end-of-bitmap and they use MAX_NUMNODES. See __next_node() and for_each_node() like. MAX_NUMNODES doesn't assume contiguous numbers since its the actual size of the bitmap, nr_online_nodes would hoever. > The real concern is whether > the updated mask is an allowed target for the updated memory policy. If > it's not then "nid" can be pointing off the deep end somewhere. With this > conversion to a for loop there is race after you check nnodes where target > gets set to 0 and then return a nid of -1 which I suppose will just blow > up differently but it's fixable. But but but, I did i <= target, which will match when target == 0 so you'll get at least a single iteration and nid will be set. > This? Untested. Fixes implicit types while it's there. Note the use of > first node and (c < target) to guarantee nid gets set and that the first > potential node is still used as an interleave target. > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 7431001..ae880c3 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1755,22 +1755,24 @@ unsigned slab_node(void) > } > > /* Do static interleaving for a VMA with known offset. */ > -static unsigned offset_il_node(struct mempolicy *pol, > +static unsigned int offset_il_node(struct mempolicy *pol, > struct vm_area_struct *vma, unsigned long off) > { > - unsigned nnodes = nodes_weight(pol->v.nodes); > - unsigned target; > - int c; > - int nid = -1; > + unsigned int nr_nodes, target; > + int i, nid; > > - if (!nnodes) > +again: > + nr_nodes = nodes_weight(pol->v.nodes); > + if (!nr_nodes) > return numa_node_id(); > - target = (unsigned int)off % nnodes; > - c = 0; > - do { > + target = (unsigned int)off % nr_nodes; > + for (i = 0, nid = first_node(pol->v.nodes); i < target; i++) > nid = next_node(nid, pol->v.nodes); > - c++; > - } while (c <= target); > + > + /* Policy nodemask can potentially update in parallel */ > + if (unlikely(!node_isset(nid, pol->v.nodes))) > + goto again; > + > return nid; > } So I explicitly didn't use the node_isset() test because that's more likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return a node that isn't actually part of the mask anymore -- a race is a race anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote: > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote: > > So I think this patch is broken (still). I am assuming the lack of complaints is that it is not a heavily executed path. I expect that you (and Rik) are hitting this as part of automatic NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing is the reproduction case. > > Suppose we have an > > INTERLEAVE mempol like 0x3 and change it to 0xc. > > > > Original: 0x3 > > Rebind Step 1: 0xf /* set bits */ > > Rebind Step 2: 0xc /* clear bits */ > > > > Now look at what can happen with offset_il_node() when its ran > > concurrently with step 2: > > > > nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */ > > > > /* now we clear the actual bits */ > > > > target = (unsigned int)off % nnodes; /* assume target >= 2 */ > > c = 0; > > do { > > nid = next_node(nid, pol->v.nodes); > > c++; > > } while (c <= target); > > > > /* here nid := MAX_NUMNODES */ > > > > > > This nid is then blindly inserted into node_zonelist() which does an > > NODE_DATA() array access out of bounds and off we go. > > > > This would suggest we put the whole seqcount thing inside > > offset_il_node(). > > Oh bloody grrr. Its not directly related at all, the patch in question > fixes a cpuset task_struct::mems_allowed problem while the above is a > mempolicy issue and of course the cpuset and mempolicy code are > completely bloody different :/ > > So I guess the quick and ugly solution is something like the below. > > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1762,19 +1762,21 @@ unsigned slab_node(void) > static unsigned offset_il_node(struct mempolicy *pol, > struct vm_area_struct *vma, unsigned long off) > { > - unsigned nnodes = nodes_weight(pol->v.nodes); > - unsigned target; > - int c; > - int nid = -1; > + unsigned nnodes, target; > + int c, nid; > > +again: > + nnodes = nodes_weight(pol->v.nodes); > if (!nnodes) > return numa_node_id(); > + > target = (unsigned int)off % nnodes; > - c = 0; > - do { > + for (c = 0, nid = -1; c <= target; c++) > nid = next_node(nid, pol->v.nodes); > - c++; > - } while (c <= target); > + > + if (unlikely((unsigned)nid >= MAX_NUMNODES)) > + goto again; > + MAX_NUMNODES is unrelated to anything except that it might prevent a crash and even then nr_online_nodes is probably what you wanted and even that assumes the NUMA node numbering is contiguous. The real concern is whether the updated mask is an allowed target for the updated memory policy. If it's not then "nid" can be pointing off the deep end somewhere. With this conversion to a for loop there is race after you check nnodes where target gets set to 0 and then return a nid of -1 which I suppose will just blow up differently but it's fixable. This? Untested. Fixes implicit types while it's there. Note the use of first node and (c < target) to guarantee nid gets set and that the first potential node is still used as an interleave target. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7431001..ae880c3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1755,22 +1755,24 @@ unsigned slab_node(void) } /* Do static interleaving for a VMA with known offset. */ -static unsigned offset_il_node(struct mempolicy *pol, +static unsigned int offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol->v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned int nr_nodes, target; + int i, nid; - if (!nnodes) +again: + nr_nodes = nodes_weight(pol->v.nodes); + if (!nr_nodes) return numa_node_id(); - target = (unsigned int)off % nnodes; - c = 0; - do { + target = (unsigned int)off % nr_nodes; + for (i = 0, nid = first_node(pol->v.nodes); i < target; i++) nid = next_node(nid, pol->v.nodes); - c++; - } while (c <= target); + + /* Policy nodemask can potentially update in parallel */ + if (unlikely(!node_isset(nid, pol->v.nodes))) + goto again; + return nid; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote: On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote: So I think this patch is broken (still). I am assuming the lack of complaints is that it is not a heavily executed path. I expect that you (and Rik) are hitting this as part of automatic NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing is the reproduction case. Suppose we have an INTERLEAVE mempol like 0x3 and change it to 0xc. Original: 0x3 Rebind Step 1: 0xf /* set bits */ Rebind Step 2: 0xc /* clear bits */ Now look at what can happen with offset_il_node() when its ran concurrently with step 2: nnodes = nodes_weight(pol-v.nodes); /* observes 0xf and returns 4 */ /* now we clear the actual bits */ target = (unsigned int)off % nnodes; /* assume target = 2 */ c = 0; do { nid = next_node(nid, pol-v.nodes); c++; } while (c = target); /* here nid := MAX_NUMNODES */ This nid is then blindly inserted into node_zonelist() which does an NODE_DATA() array access out of bounds and off we go. This would suggest we put the whole seqcount thing inside offset_il_node(). Oh bloody grrr. Its not directly related at all, the patch in question fixes a cpuset task_struct::mems_allowed problem while the above is a mempolicy issue and of course the cpuset and mempolicy code are completely bloody different :/ So I guess the quick and ugly solution is something like the below. --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1762,19 +1762,21 @@ unsigned slab_node(void) static unsigned offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol-v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned nnodes, target; + int c, nid; +again: + nnodes = nodes_weight(pol-v.nodes); if (!nnodes) return numa_node_id(); + target = (unsigned int)off % nnodes; - c = 0; - do { + for (c = 0, nid = -1; c = target; c++) nid = next_node(nid, pol-v.nodes); - c++; - } while (c = target); + + if (unlikely((unsigned)nid = MAX_NUMNODES)) + goto again; + MAX_NUMNODES is unrelated to anything except that it might prevent a crash and even then nr_online_nodes is probably what you wanted and even that assumes the NUMA node numbering is contiguous. The real concern is whether the updated mask is an allowed target for the updated memory policy. If it's not then nid can be pointing off the deep end somewhere. With this conversion to a for loop there is race after you check nnodes where target gets set to 0 and then return a nid of -1 which I suppose will just blow up differently but it's fixable. This? Untested. Fixes implicit types while it's there. Note the use of first node and (c target) to guarantee nid gets set and that the first potential node is still used as an interleave target. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7431001..ae880c3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1755,22 +1755,24 @@ unsigned slab_node(void) } /* Do static interleaving for a VMA with known offset. */ -static unsigned offset_il_node(struct mempolicy *pol, +static unsigned int offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol-v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned int nr_nodes, target; + int i, nid; - if (!nnodes) +again: + nr_nodes = nodes_weight(pol-v.nodes); + if (!nr_nodes) return numa_node_id(); - target = (unsigned int)off % nnodes; - c = 0; - do { + target = (unsigned int)off % nr_nodes; + for (i = 0, nid = first_node(pol-v.nodes); i target; i++) nid = next_node(nid, pol-v.nodes); - c++; - } while (c = target); + + /* Policy nodemask can potentially update in parallel */ + if (unlikely(!node_isset(nid, pol-v.nodes))) + goto again; + return nid; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote: On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote: On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote: So I think this patch is broken (still). I am assuming the lack of complaints is that it is not a heavily executed path. I expect that you (and Rik) are hitting this as part of automatic NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing is the reproduction case. I thought it was, we crashed somewhere suspiciously close, but no. You need shared mpols for this to actually trigger and the NUMA stuff doesn't use that. + if (unlikely((unsigned)nid = MAX_NUMNODES)) + goto again; + MAX_NUMNODES is unrelated to anything except that it might prevent a crash and even then nr_online_nodes is probably what you wanted and even that assumes the NUMA node numbering is contiguous. I used whatever nodemask.h did to detect end-of-bitmap and they use MAX_NUMNODES. See __next_node() and for_each_node() like. MAX_NUMNODES doesn't assume contiguous numbers since its the actual size of the bitmap, nr_online_nodes would hoever. The real concern is whether the updated mask is an allowed target for the updated memory policy. If it's not then nid can be pointing off the deep end somewhere. With this conversion to a for loop there is race after you check nnodes where target gets set to 0 and then return a nid of -1 which I suppose will just blow up differently but it's fixable. But but but, I did i = target, which will match when target == 0 so you'll get at least a single iteration and nid will be set. This? Untested. Fixes implicit types while it's there. Note the use of first node and (c target) to guarantee nid gets set and that the first potential node is still used as an interleave target. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7431001..ae880c3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1755,22 +1755,24 @@ unsigned slab_node(void) } /* Do static interleaving for a VMA with known offset. */ -static unsigned offset_il_node(struct mempolicy *pol, +static unsigned int offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol-v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned int nr_nodes, target; + int i, nid; - if (!nnodes) +again: + nr_nodes = nodes_weight(pol-v.nodes); + if (!nr_nodes) return numa_node_id(); - target = (unsigned int)off % nnodes; - c = 0; - do { + target = (unsigned int)off % nr_nodes; + for (i = 0, nid = first_node(pol-v.nodes); i target; i++) nid = next_node(nid, pol-v.nodes); - c++; - } while (c = target); + + /* Policy nodemask can potentially update in parallel */ + if (unlikely(!node_isset(nid, pol-v.nodes))) + goto again; + return nid; } So I explicitly didn't use the node_isset() test because that's more likely to trigger than the nid = MAX_NUMNODES test. Its fine to return a node that isn't actually part of the mask anymore -- a race is a race anyway. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7431001..ae880c3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1755,22 +1755,24 @@ unsigned slab_node(void) } /* Do static interleaving for a VMA with known offset. */ -static unsigned offset_il_node(struct mempolicy *pol, +static unsigned int offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol-v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned int nr_nodes, target; + int i, nid; - if (!nnodes) +again: + nr_nodes = nodes_weight(pol-v.nodes); + if (!nr_nodes) return numa_node_id(); - target = (unsigned int)off % nnodes; - c = 0; - do { + target = (unsigned int)off % nr_nodes; + for (i = 0, nid = first_node(pol-v.nodes); i target; i++) nid = next_node(nid, pol-v.nodes); - c++; - } while (c = target); + + /* Policy nodemask can potentially update in parallel */ + if (unlikely(!node_isset(nid, pol-v.nodes))) + goto again; + return nid; } So I explicitly didn't use the node_isset() test because that's more likely to trigger than the nid = MAX_NUMNODES test. Its fine to return a node that isn't actually part of the mask anymore -- a race is a race anyway. Oh more importantly, if nid does indeed end up being = MAX_NUMNODES as is possible with next_node() the node_isset() test will be out-of-bounds and can crash itself. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote: On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote: On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote: On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote: So I think this patch is broken (still). I am assuming the lack of complaints is that it is not a heavily executed path. I expect that you (and Rik) are hitting this as part of automatic NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing is the reproduction case. I thought it was, we crashed somewhere suspiciously close, but no. You need shared mpols for this to actually trigger and the NUMA stuff doesn't use that. Ah, so this is a red herring? + if (unlikely((unsigned)nid = MAX_NUMNODES)) + goto again; + MAX_NUMNODES is unrelated to anything except that it might prevent a crash and even then nr_online_nodes is probably what you wanted and even that assumes the NUMA node numbering is contiguous. I used whatever nodemask.h did to detect end-of-bitmap and they use MAX_NUMNODES. See __next_node() and for_each_node() like. The check does prevent us going off the end of the bitmap but does not necessarily return an online node. MAX_NUMNODES doesn't assume contiguous numbers since its the actual size of the bitmap, nr_online_nodes would hoever. I intended to say nr_node_ids, the same size as buffers such as the task_numa_buffers. If we ever return a nid nr_node_ids here then task_numa_fault would corrupt memory. However, it should be possible for node_weight to exceed nr_node_ids except maybe during node hot-remove so it's not the problem. The real concern is whether the updated mask is an allowed target for the updated memory policy. If it's not then nid can be pointing off the deep end somewhere. With this conversion to a for loop there is race after you check nnodes where target gets set to 0 and then return a nid of -1 which I suppose will just blow up differently but it's fixable. But but but, I did i = target, which will match when target == 0 so you'll get at least a single iteration and nid will be set. True. This? Untested. Fixes implicit types while it's there. Note the use of first node and (c target) to guarantee nid gets set and that the first potential node is still used as an interleave target. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7431001..ae880c3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1755,22 +1755,24 @@ unsigned slab_node(void) } /* Do static interleaving for a VMA with known offset. */ -static unsigned offset_il_node(struct mempolicy *pol, +static unsigned int offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol-v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned int nr_nodes, target; + int i, nid; - if (!nnodes) +again: + nr_nodes = nodes_weight(pol-v.nodes); + if (!nr_nodes) return numa_node_id(); - target = (unsigned int)off % nnodes; - c = 0; - do { + target = (unsigned int)off % nr_nodes; + for (i = 0, nid = first_node(pol-v.nodes); i target; i++) nid = next_node(nid, pol-v.nodes); - c++; - } while (c = target); + + /* Policy nodemask can potentially update in parallel */ + if (unlikely(!node_isset(nid, pol-v.nodes))) + goto again; + return nid; } So I explicitly didn't use the node_isset() test because that's more likely to trigger than the nid = MAX_NUMNODES test. Its fine to return a node that isn't actually part of the mask anymore -- a race is a race anyway. Yeah and as long as it's nr_node_ids it should be ok within the task numa fault handling as well. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 11:56:57AM +0100, Mel Gorman wrote: I thought it was, we crashed somewhere suspiciously close, but no. You need shared mpols for this to actually trigger and the NUMA stuff doesn't use that. Ah, so this is a red herring? Yeah, but I still think its an actual bug. It seems the easiest way to trigger this would be to: create a task that constantly allocates pages have said task have an MPOL_INTERLEAVE task policy put said task into a cpuset using a different task (your shell for example) flip the cpuset's mems_allowed back and forth. This would have the shell task constantly rebind (in two steps) our allocating task's INTERLEAVE policy. I used whatever nodemask.h did to detect end-of-bitmap and they use MAX_NUMNODES. See __next_node() and for_each_node() like. The check does prevent us going off the end of the bitmap but does not necessarily return an online node. Right, but its guaranteed to return a 'valid' node. I don't think it returning an offline node is a problem, we'll find it empty and fail the page allocation. MAX_NUMNODES doesn't assume contiguous numbers since its the actual size of the bitmap, nr_online_nodes would hoever. I intended to say nr_node_ids, the same size as buffers such as the task_numa_buffers. If we ever return a nid nr_node_ids here then task_numa_fault would corrupt memory. However, it should be possible for node_weight to exceed nr_node_ids except maybe during node hot-remove so it's not the problem. The nodemask situation seems somewhat more confused than the cpumask case; how would we ever return a nid nr_node_ids? Corrupt nodemask? In the cpumask case we use the runtime limit nr_cpu_ids for all bitmap operations, arguably we should make the nodemask stuff do the same. Less bits to iterate is always good; a MAX_NUMNODES=64 (x86_64-defconfig) will still iterate all 64 bits, even though its unlikely to have more than 1 let alone more than 8 nodes. So I explicitly didn't use the node_isset() test because that's more likely to trigger than the nid = MAX_NUMNODES test. Its fine to return a node that isn't actually part of the mask anymore -- a race is a race anyway. Yeah and as long as it's nr_node_ids it should be ok within the task numa fault handling as well. Right, I'm just a tad confused on how we could ever get a nid = nr_node_ids except from a prior bug (corrupted nodemask). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Thu, Aug 29, 2013 at 01:14:19PM +0200, Peter Zijlstra wrote: I intended to say nr_node_ids, the same size as buffers such as the task_numa_buffers. If we ever return a nid nr_node_ids here then task_numa_fault would corrupt memory. However, it should be possible for node_weight to exceed nr_node_ids except maybe during node hot-remove so it's not the problem. The nodemask situation seems somewhat more confused than the cpumask case; how would we ever return a nid nr_node_ids? Corrupt nodemask? ARGH. I meant impossible. The exact bloody opposite of what I wrote. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On 08/23/2013 02:15 PM, Peter Zijlstra wrote: > So I guess the quick and ugly solution is something like the below. This still crashes :) > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1762,19 +1762,21 @@ unsigned slab_node(void) > static unsigned offset_il_node(struct mempolicy *pol, > struct vm_area_struct *vma, unsigned long off) > { > - unsigned nnodes = nodes_weight(pol->v.nodes); > - unsigned target; > - int c; > - int nid = -1; > + unsigned nnodes, target; > + int c, nid; > > +again: > + nnodes = nodes_weight(pol->v.nodes); > if (!nnodes) > return numa_node_id(); > + > target = (unsigned int)off % nnodes; > - c = 0; > - do { > + for (c = 0, nid = -1; c <= target; c++) > nid = next_node(nid, pol->v.nodes); > - c++; > - } while (c <= target); > + > + if (unlikely((unsigned)nid >= MAX_NUMNODES)) > + goto again; I'll go kick off a compile that replaces the conditional above with: if (unlikely(!node_online(nid))) goto again; > return nid; > } -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On 08/23/2013 02:15 PM, Peter Zijlstra wrote: So I guess the quick and ugly solution is something like the below. This still crashes :) --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1762,19 +1762,21 @@ unsigned slab_node(void) static unsigned offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol-v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned nnodes, target; + int c, nid; +again: + nnodes = nodes_weight(pol-v.nodes); if (!nnodes) return numa_node_id(); + target = (unsigned int)off % nnodes; - c = 0; - do { + for (c = 0, nid = -1; c = target; c++) nid = next_node(nid, pol-v.nodes); - c++; - } while (c = target); + + if (unlikely((unsigned)nid = MAX_NUMNODES)) + goto again; I'll go kick off a compile that replaces the conditional above with: if (unlikely(!node_online(nid))) goto again; return nid; } -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote: > So I think this patch is broken (still). Suppose we have an > INTERLEAVE mempol like 0x3 and change it to 0xc. > > Original: 0x3 > Rebind Step 1:0xf /* set bits */ > Rebind Step 2:0xc /* clear bits */ > > Now look at what can happen with offset_il_node() when its ran > concurrently with step 2: > > nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */ > > /* now we clear the actual bits */ > > target = (unsigned int)off % nnodes; /* assume target >= 2 */ > c = 0; > do { > nid = next_node(nid, pol->v.nodes); > c++; > } while (c <= target); > > /* here nid := MAX_NUMNODES */ > > > This nid is then blindly inserted into node_zonelist() which does an > NODE_DATA() array access out of bounds and off we go. > > This would suggest we put the whole seqcount thing inside > offset_il_node(). Oh bloody grrr. Its not directly related at all, the patch in question fixes a cpuset task_struct::mems_allowed problem while the above is a mempolicy issue and of course the cpuset and mempolicy code are completely bloody different :/ So I guess the quick and ugly solution is something like the below. --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1762,19 +1762,21 @@ unsigned slab_node(void) static unsigned offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol->v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned nnodes, target; + int c, nid; +again: + nnodes = nodes_weight(pol->v.nodes); if (!nnodes) return numa_node_id(); + target = (unsigned int)off % nnodes; - c = 0; - do { + for (c = 0, nid = -1; c <= target; c++) nid = next_node(nid, pol->v.nodes); - c++; - } while (c <= target); + + if (unlikely((unsigned)nid >= MAX_NUMNODES)) + goto again; + return nid; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Wed, Mar 07, 2012 at 06:08:52PM +, Mel Gorman wrote: > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 06b145f..013d981 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1843,18 +1843,24 @@ struct page * > alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > unsigned long addr, int node) > { > - struct mempolicy *pol = get_vma_policy(current, vma, addr); > + struct mempolicy *pol; > struct zonelist *zl; > struct page *page; > + unsigned int cpuset_mems_cookie; > + > +retry_cpuset: > + pol = get_vma_policy(current, vma, addr); > + cpuset_mems_cookie = get_mems_allowed(); > > - get_mems_allowed(); > if (unlikely(pol->mode == MPOL_INTERLEAVE)) { > unsigned nid; > > nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); > mpol_cond_put(pol); > page = alloc_page_interleave(gfp, order, nid); > - put_mems_allowed(); > + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) > + goto retry_cpuset; > + > return page; > } > zl = policy_zonelist(gfp, pol, node); So I think this patch is broken (still). Suppose we have an INTERLEAVE mempol like 0x3 and change it to 0xc. Original: 0x3 Rebind Step 1: 0xf /* set bits */ Rebind Step 2: 0xc /* clear bits */ Now look at what can happen with offset_il_node() when its ran concurrently with step 2: nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */ /* now we clear the actual bits */ target = (unsigned int)off % nnodes; /* assume target >= 2 */ c = 0; do { nid = next_node(nid, pol->v.nodes); c++; } while (c <= target); /* here nid := MAX_NUMNODES */ This nid is then blindly inserted into node_zonelist() which does an NODE_DATA() array access out of bounds and off we go. This would suggest we put the whole seqcount thing inside offset_il_node(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Wed, Mar 07, 2012 at 06:08:52PM +, Mel Gorman wrote: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 06b145f..013d981 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1843,18 +1843,24 @@ struct page * alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, int node) { - struct mempolicy *pol = get_vma_policy(current, vma, addr); + struct mempolicy *pol; struct zonelist *zl; struct page *page; + unsigned int cpuset_mems_cookie; + +retry_cpuset: + pol = get_vma_policy(current, vma, addr); + cpuset_mems_cookie = get_mems_allowed(); - get_mems_allowed(); if (unlikely(pol-mode == MPOL_INTERLEAVE)) { unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) !page)) + goto retry_cpuset; + return page; } zl = policy_zonelist(gfp, pol, node); So I think this patch is broken (still). Suppose we have an INTERLEAVE mempol like 0x3 and change it to 0xc. Original: 0x3 Rebind Step 1: 0xf /* set bits */ Rebind Step 2: 0xc /* clear bits */ Now look at what can happen with offset_il_node() when its ran concurrently with step 2: nnodes = nodes_weight(pol-v.nodes); /* observes 0xf and returns 4 */ /* now we clear the actual bits */ target = (unsigned int)off % nnodes; /* assume target = 2 */ c = 0; do { nid = next_node(nid, pol-v.nodes); c++; } while (c = target); /* here nid := MAX_NUMNODES */ This nid is then blindly inserted into node_zonelist() which does an NODE_DATA() array access out of bounds and off we go. This would suggest we put the whole seqcount thing inside offset_il_node(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote: So I think this patch is broken (still). Suppose we have an INTERLEAVE mempol like 0x3 and change it to 0xc. Original: 0x3 Rebind Step 1:0xf /* set bits */ Rebind Step 2:0xc /* clear bits */ Now look at what can happen with offset_il_node() when its ran concurrently with step 2: nnodes = nodes_weight(pol-v.nodes); /* observes 0xf and returns 4 */ /* now we clear the actual bits */ target = (unsigned int)off % nnodes; /* assume target = 2 */ c = 0; do { nid = next_node(nid, pol-v.nodes); c++; } while (c = target); /* here nid := MAX_NUMNODES */ This nid is then blindly inserted into node_zonelist() which does an NODE_DATA() array access out of bounds and off we go. This would suggest we put the whole seqcount thing inside offset_il_node(). Oh bloody grrr. Its not directly related at all, the patch in question fixes a cpuset task_struct::mems_allowed problem while the above is a mempolicy issue and of course the cpuset and mempolicy code are completely bloody different :/ So I guess the quick and ugly solution is something like the below. --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1762,19 +1762,21 @@ unsigned slab_node(void) static unsigned offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off) { - unsigned nnodes = nodes_weight(pol-v.nodes); - unsigned target; - int c; - int nid = -1; + unsigned nnodes, target; + int c, nid; +again: + nnodes = nodes_weight(pol-v.nodes); if (!nnodes) return numa_node_id(); + target = (unsigned int)off % nnodes; - c = 0; - do { + for (c = 0, nid = -1; c = target; c++) nid = next_node(nid, pol-v.nodes); - c++; - } while (c = target); + + if (unlikely((unsigned)nid = MAX_NUMNODES)) + goto again; + return nid; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/