Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 20 Nov 2007, Andrew Morton wrote: > uhm, maybe. It's getting toward the time when we should try to get -mm > vaguely compiling and booting on some machines, which means stopping > merging new stuff. I left that too late in the 2.6.23 cycle. Huh? mm1 works fine here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 20 Nov 2007 12:18:10 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Tue, 20 Nov 2007, Mel Gorman wrote: > > > Went back and revisited this. Allocating them at boot-time is below but > > essentially it is a silly and it makes sense to just have two zonelists > > where one of them is for __GFP_THISNODE. Implementation wise, this involves > > dropping the last patch in the set and the overall result is still a > > reduction > > in the number of zonelists. > > Allright with me. Andrew could we get this patchset merged? uhm, maybe. It's getting toward the time when we should try to get -mm vaguely compiling and booting on some machines, which means stopping merging new stuff. I left that too late in the 2.6.23 cycle. otoh it'd be nice to get mainline fixed up a bit too :( - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (20/11/07 12:18), Christoph Lameter didst pronounce: > On Tue, 20 Nov 2007, Mel Gorman wrote: > > > Went back and revisited this. Allocating them at boot-time is below but > > essentially it is a silly and it makes sense to just have two zonelists > > where one of them is for __GFP_THISNODE. Implementation wise, this involves > > dropping the last patch in the set and the overall result is still a > > reduction > > in the number of zonelists. > > Allright with me. Andrew could we get this patchset merged? > They will not merge cleanly with a recent tree. I expect to be ready to post a new set when regression tests complete later this evening. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 20 Nov 2007, Mel Gorman wrote: > Hold off testing for the moment. Getting all the corner cases right for > __GFP_THISNODE has turned too complicated to be considered as part of a larger > patchset. I believe it makes sense to drop the final patch and settle with > having two zonelists. One of these will be for __GFP_THISNODE allocations. We > can then tackle removing that zonelist at a later date. Ack. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 20 Nov 2007, Mel Gorman wrote: > Went back and revisited this. Allocating them at boot-time is below but > essentially it is a silly and it makes sense to just have two zonelists > where one of them is for __GFP_THISNODE. Implementation wise, this involves > dropping the last patch in the set and the overall result is still a reduction > in the number of zonelists. Allright with me. Andrew could we get this patchset merged? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (20/11/07 10:14), Lee Schermerhorn didst pronounce: > On Tue, 2007-11-20 at 14:19 +, Mel Gorman wrote: > > On (09/11/07 07:45), Christoph Lameter didst pronounce: > > > On Fri, 9 Nov 2007, Mel Gorman wrote: > > > > > > > struct page * fastcall > > > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > > > struct zonelist *zonelist) > > > > { > > > > + /* > > > > +* Use a temporary nodemask for __GFP_THISNODE allocations. If > > > > the > > > > +* cost of allocating on the stack or the stack usage becomes > > > > +* noticable, allocate the nodemasks per node at boot or > > > > compile time > > > > +*/ > > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > > > + nodemask_t nodemask; > > > > > > Hmmm.. This places a potentially big structure on the stack. nodemask can > > > contain up to 1024 bits which means 128 bytes. Maybe keep an array of > > > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? > > > > Went back and revisited this. Allocating them at boot-time is below but > > essentially it is a silly and it makes sense to just have two zonelists > > where one of them is for __GFP_THISNODE. Implementation wise, this involves > > dropping the last patch in the set and the overall result is still a > > reduction > > in the number of zonelists. > > Hi, Mel: > > I'll try this out [n 24-rc2-mm1 or later]. Hold off testing for the moment. Getting all the corner cases right for __GFP_THISNODE has turned too complicated to be considered as part of a larger patchset. I believe it makes sense to drop the final patch and settle with having two zonelists. One of these will be for __GFP_THISNODE allocations. We can then tackle removing that zonelist at a later date. This will still remove the hack, reduce the number of zonelists and improve how MPOL_BIND policies are applied so it is still an overall win. I should have a revised patchset with the final one dropped posted in a few hours. > I have a series with yet > another rework of policy reference counting that will be easier/cleaner > atop your series w/o the external zonelist hung off 'BIND policies. So, > I'm hoping your series goes into -mm "real soon now". > Sounds good. > Question: Just wondering why you didn't embed the '_THISNODE nodemask > in the pgdat_t--initialized at boot/node-hot-add time? At the time, to save space in the pg_data_t structure but it is not worth the complexity. > Perhaps under > #ifdef CONFIG_NUMA. The size is known at build time, right? And > wouldn't that be way smaller than an additional zonelist? > Yes it would. > Lee > > > > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > > linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h > > linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h > > --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h > > 2007-11-19 19:27:15.0 + > > +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h > > 2007-11-19 19:28:55.0 + > > @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i > > extern struct page * > > FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, > > struct zonelist *, nodemask_t *nodemask)); > > -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask); > > > > static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, > > unsigned int order) > > @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n > > if (nid < 0) > > nid = numa_node_id(); > > > > - /* Use a temporary nodemask for __GFP_THISNODE allocations */ > > if (unlikely(gfp_mask & __GFP_THISNODE)) { > > - nodemask_t nodemask; > > - > > return __alloc_pages_nodemask(gfp_mask, order, > > node_zonelist(nid), > > - nodemask_thisnode(nid, )); > > + NODE_DATA(nid)->nodemask_thisnode); > > } > > > > return __alloc_pages(gfp_mask, order, node_zonelist(nid)); > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > > linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h > > linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h > > --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h > > 2007-11-19 19:27:15.0 + > > +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h > > 2007-11-19 19:28:55.0 + > > @@ -519,6 +519,9 @@ typedef struct pglist_data { > > struct zone node_zones[MAX_NR_ZONES]; > > struct zonelist node_zonelist; > > int nr_zones; > > + > > + /* nodemask suitable for __GFP_THISNODE */ > > + nodemask_t *nodemask_thisnode; > > #ifdef CONFIG_FLAT_NODE_MEM_MAP > > struct page *node_mem_map; > > #endif > > diff -rup -X
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 2007-11-20 at 14:19 +, Mel Gorman wrote: > On (09/11/07 07:45), Christoph Lameter didst pronounce: > > On Fri, 9 Nov 2007, Mel Gorman wrote: > > > > > struct page * fastcall > > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > > struct zonelist *zonelist) > > > { > > > + /* > > > + * Use a temporary nodemask for __GFP_THISNODE allocations. If the > > > + * cost of allocating on the stack or the stack usage becomes > > > + * noticable, allocate the nodemasks per node at boot or compile time > > > + */ > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > > + nodemask_t nodemask; > > > > Hmmm.. This places a potentially big structure on the stack. nodemask can > > contain up to 1024 bits which means 128 bytes. Maybe keep an array of > > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? > > Went back and revisited this. Allocating them at boot-time is below but > essentially it is a silly and it makes sense to just have two zonelists > where one of them is for __GFP_THISNODE. Implementation wise, this involves > dropping the last patch in the set and the overall result is still a reduction > in the number of zonelists. Hi, Mel: I'll try this out [n 24-rc2-mm1 or later]. I have a series with yet another rework of policy reference counting that will be easier/cleaner atop your series w/o the external zonelist hung off 'BIND policies. So, I'm hoping your series goes into -mm "real soon now". Question: Just wondering why you didn't embed the '_THISNODE nodemask in the pgdat_t--initialized at boot/node-hot-add time? Perhaps under #ifdef CONFIG_NUMA. The size is known at build time, right? And wouldn't that be way smaller than an additional zonelist? Lee > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h > linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h > --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h > 2007-11-19 19:27:15.0 + > +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h > 2007-11-19 19:28:55.0 + > @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i > extern struct page * > FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, > struct zonelist *, nodemask_t *nodemask)); > -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask); > > static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, > unsigned int order) > @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n > if (nid < 0) > nid = numa_node_id(); > > - /* Use a temporary nodemask for __GFP_THISNODE allocations */ > if (unlikely(gfp_mask & __GFP_THISNODE)) { > - nodemask_t nodemask; > - > return __alloc_pages_nodemask(gfp_mask, order, > node_zonelist(nid), > - nodemask_thisnode(nid, )); > + NODE_DATA(nid)->nodemask_thisnode); > } > > return __alloc_pages(gfp_mask, order, node_zonelist(nid)); > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h > linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h > --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h > 2007-11-19 19:27:15.0 + > +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h > 2007-11-19 19:28:55.0 + > @@ -519,6 +519,9 @@ typedef struct pglist_data { > struct zone node_zones[MAX_NR_ZONES]; > struct zonelist node_zonelist; > int nr_zones; > + > + /* nodemask suitable for __GFP_THISNODE */ > + nodemask_t *nodemask_thisnode; > #ifdef CONFIG_FLAT_NODE_MEM_MAP > struct page *node_mem_map; > #endif > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c > linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c > --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 2007-11-19 > 19:27:15.0 + > +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c > 2007-11-19 19:28:55.0 + > @@ -1695,28 +1695,36 @@ got_pg: > } > > /* Creates a nodemask suitable for GFP_THISNODE allocations */ > -nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask) > +static inline void alloc_node_nodemask_thisnode(pg_data_t *pgdat) > { > - nodes_clear(*nodemask); > - node_set(nid, *nodemask); > + nodemask_t *nodemask_thisnode; > > - return nodemask; > + /* Only a machine with multiple nodes needs the nodemask */ > + if (!NUMA_BUILD || num_online_nodes() == 1) > + return; > + > + /* Allocate the nodemask. Serious if it fails, but not world ending */ > + nodemask_thisnode =
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (09/11/07 07:45), Christoph Lameter didst pronounce: > On Fri, 9 Nov 2007, Mel Gorman wrote: > > > struct page * fastcall > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > struct zonelist *zonelist) > > { > > + /* > > +* Use a temporary nodemask for __GFP_THISNODE allocations. If the > > +* cost of allocating on the stack or the stack usage becomes > > +* noticable, allocate the nodemasks per node at boot or compile time > > +*/ > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > + nodemask_t nodemask; > > Hmmm.. This places a potentially big structure on the stack. nodemask can > contain up to 1024 bits which means 128 bytes. Maybe keep an array of > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? Went back and revisited this. Allocating them at boot-time is below but essentially it is a silly and it makes sense to just have two zonelists where one of them is for __GFP_THISNODE. Implementation wise, this involves dropping the last patch in the set and the overall result is still a reduction in the number of zonelists. diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h 2007-11-19 19:28:55.0 + @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i extern struct page * FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, struct zonelist *, nodemask_t *nodemask)); -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask); static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n if (nid < 0) nid = numa_node_id(); - /* Use a temporary nodemask for __GFP_THISNODE allocations */ if (unlikely(gfp_mask & __GFP_THISNODE)) { - nodemask_t nodemask; - return __alloc_pages_nodemask(gfp_mask, order, node_zonelist(nid), - nodemask_thisnode(nid, )); + NODE_DATA(nid)->nodemask_thisnode); } return __alloc_pages(gfp_mask, order, node_zonelist(nid)); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h 2007-11-19 19:28:55.0 + @@ -519,6 +519,9 @@ typedef struct pglist_data { struct zone node_zones[MAX_NR_ZONES]; struct zonelist node_zonelist; int nr_zones; + + /* nodemask suitable for __GFP_THISNODE */ + nodemask_t *nodemask_thisnode; #ifdef CONFIG_FLAT_NODE_MEM_MAP struct page *node_mem_map; #endif diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c 2007-11-19 19:28:55.0 + @@ -1695,28 +1695,36 @@ got_pg: } /* Creates a nodemask suitable for GFP_THISNODE allocations */ -nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask) +static inline void alloc_node_nodemask_thisnode(pg_data_t *pgdat) { - nodes_clear(*nodemask); - node_set(nid, *nodemask); + nodemask_t *nodemask_thisnode; - return nodemask; + /* Only a machine with multiple nodes needs the nodemask */ + if (!NUMA_BUILD || num_online_nodes() == 1) + return; + + /* Allocate the nodemask. Serious if it fails, but not world ending */ + nodemask_thisnode = alloc_bootmem_node(pgdat, sizeof(nodemask_t)); + if (!nodemask_thisnode) { + printk(KERN_WARNING + "thisnode nodemask allocation failed." + "There may be sub-optimal NUMA placement.\n"); + return; + } + + /* Initialise the nodemask to only cover the current node */ + nodes_clear(*nodemask_thisnode); + node_set(pgdat->node_id, *nodemask_thisnode); + pgdat->nodemask_thisnode = nodemask_thisnode; } struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { - /* -* Use a temporary nodemask for __GFP_THISNODE allocations. If the -
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (09/11/07 07:45), Christoph Lameter didst pronounce: On Fri, 9 Nov 2007, Mel Gorman wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* +* Use a temporary nodemask for __GFP_THISNODE allocations. If the +* cost of allocating on the stack or the stack usage becomes +* noticable, allocate the nodemasks per node at boot or compile time +*/ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? Went back and revisited this. Allocating them at boot-time is below but essentially it is a silly and it makes sense to just have two zonelists where one of them is for __GFP_THISNODE. Implementation wise, this involves dropping the last patch in the set and the overall result is still a reduction in the number of zonelists. diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h 2007-11-19 19:28:55.0 + @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i extern struct page * FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, struct zonelist *, nodemask_t *nodemask)); -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask); static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n if (nid 0) nid = numa_node_id(); - /* Use a temporary nodemask for __GFP_THISNODE allocations */ if (unlikely(gfp_mask __GFP_THISNODE)) { - nodemask_t nodemask; - return __alloc_pages_nodemask(gfp_mask, order, node_zonelist(nid), - nodemask_thisnode(nid, nodemask)); + NODE_DATA(nid)-nodemask_thisnode); } return __alloc_pages(gfp_mask, order, node_zonelist(nid)); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h 2007-11-19 19:28:55.0 + @@ -519,6 +519,9 @@ typedef struct pglist_data { struct zone node_zones[MAX_NR_ZONES]; struct zonelist node_zonelist; int nr_zones; + + /* nodemask suitable for __GFP_THISNODE */ + nodemask_t *nodemask_thisnode; #ifdef CONFIG_FLAT_NODE_MEM_MAP struct page *node_mem_map; #endif diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c 2007-11-19 19:28:55.0 + @@ -1695,28 +1695,36 @@ got_pg: } /* Creates a nodemask suitable for GFP_THISNODE allocations */ -nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask) +static inline void alloc_node_nodemask_thisnode(pg_data_t *pgdat) { - nodes_clear(*nodemask); - node_set(nid, *nodemask); + nodemask_t *nodemask_thisnode; - return nodemask; + /* Only a machine with multiple nodes needs the nodemask */ + if (!NUMA_BUILD || num_online_nodes() == 1) + return; + + /* Allocate the nodemask. Serious if it fails, but not world ending */ + nodemask_thisnode = alloc_bootmem_node(pgdat, sizeof(nodemask_t)); + if (!nodemask_thisnode) { + printk(KERN_WARNING + thisnode nodemask allocation failed. + There may be sub-optimal NUMA placement.\n); + return; + } + + /* Initialise the nodemask to only cover the current node */ + nodes_clear(*nodemask_thisnode); + node_set(pgdat-node_id, *nodemask_thisnode); + pgdat-nodemask_thisnode = nodemask_thisnode; } struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { - /* -* Use a temporary nodemask for __GFP_THISNODE allocations. If the -* cost of allocating on the
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 2007-11-20 at 14:19 +, Mel Gorman wrote: On (09/11/07 07:45), Christoph Lameter didst pronounce: On Fri, 9 Nov 2007, Mel Gorman wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* + * Use a temporary nodemask for __GFP_THISNODE allocations. If the + * cost of allocating on the stack or the stack usage becomes + * noticable, allocate the nodemasks per node at boot or compile time + */ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? Went back and revisited this. Allocating them at boot-time is below but essentially it is a silly and it makes sense to just have two zonelists where one of them is for __GFP_THISNODE. Implementation wise, this involves dropping the last patch in the set and the overall result is still a reduction in the number of zonelists. Hi, Mel: I'll try this out [n 24-rc2-mm1 or later]. I have a series with yet another rework of policy reference counting that will be easier/cleaner atop your series w/o the external zonelist hung off 'BIND policies. So, I'm hoping your series goes into -mm real soon now. Question: Just wondering why you didn't embed the '_THISNODE nodemask in the pgdat_t--initialized at boot/node-hot-add time? Perhaps under #ifdef CONFIG_NUMA. The size is known at build time, right? And wouldn't that be way smaller than an additional zonelist? Lee diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h 2007-11-19 19:28:55.0 + @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i extern struct page * FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, struct zonelist *, nodemask_t *nodemask)); -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask); static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n if (nid 0) nid = numa_node_id(); - /* Use a temporary nodemask for __GFP_THISNODE allocations */ if (unlikely(gfp_mask __GFP_THISNODE)) { - nodemask_t nodemask; - return __alloc_pages_nodemask(gfp_mask, order, node_zonelist(nid), - nodemask_thisnode(nid, nodemask)); + NODE_DATA(nid)-nodemask_thisnode); } return __alloc_pages(gfp_mask, order, node_zonelist(nid)); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h 2007-11-19 19:28:55.0 + @@ -519,6 +519,9 @@ typedef struct pglist_data { struct zone node_zones[MAX_NR_ZONES]; struct zonelist node_zonelist; int nr_zones; + + /* nodemask suitable for __GFP_THISNODE */ + nodemask_t *nodemask_thisnode; #ifdef CONFIG_FLAT_NODE_MEM_MAP struct page *node_mem_map; #endif diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c 2007-11-19 19:28:55.0 + @@ -1695,28 +1695,36 @@ got_pg: } /* Creates a nodemask suitable for GFP_THISNODE allocations */ -nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask) +static inline void alloc_node_nodemask_thisnode(pg_data_t *pgdat) { - nodes_clear(*nodemask); - node_set(nid, *nodemask); + nodemask_t *nodemask_thisnode; - return nodemask; + /* Only a machine with multiple nodes needs the nodemask */ + if (!NUMA_BUILD || num_online_nodes() == 1) + return; + + /* Allocate the nodemask. Serious if it fails, but not world ending */ + nodemask_thisnode = alloc_bootmem_node(pgdat, sizeof(nodemask_t)); + if (!nodemask_thisnode) { + printk(KERN_WARNING +
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (20/11/07 10:14), Lee Schermerhorn didst pronounce: On Tue, 2007-11-20 at 14:19 +, Mel Gorman wrote: On (09/11/07 07:45), Christoph Lameter didst pronounce: On Fri, 9 Nov 2007, Mel Gorman wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* +* Use a temporary nodemask for __GFP_THISNODE allocations. If the +* cost of allocating on the stack or the stack usage becomes +* noticable, allocate the nodemasks per node at boot or compile time +*/ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? Went back and revisited this. Allocating them at boot-time is below but essentially it is a silly and it makes sense to just have two zonelists where one of them is for __GFP_THISNODE. Implementation wise, this involves dropping the last patch in the set and the overall result is still a reduction in the number of zonelists. Hi, Mel: I'll try this out [n 24-rc2-mm1 or later]. Hold off testing for the moment. Getting all the corner cases right for __GFP_THISNODE has turned too complicated to be considered as part of a larger patchset. I believe it makes sense to drop the final patch and settle with having two zonelists. One of these will be for __GFP_THISNODE allocations. We can then tackle removing that zonelist at a later date. This will still remove the hack, reduce the number of zonelists and improve how MPOL_BIND policies are applied so it is still an overall win. I should have a revised patchset with the final one dropped posted in a few hours. I have a series with yet another rework of policy reference counting that will be easier/cleaner atop your series w/o the external zonelist hung off 'BIND policies. So, I'm hoping your series goes into -mm real soon now. Sounds good. Question: Just wondering why you didn't embed the '_THISNODE nodemask in the pgdat_t--initialized at boot/node-hot-add time? At the time, to save space in the pg_data_t structure but it is not worth the complexity. Perhaps under #ifdef CONFIG_NUMA. The size is known at build time, right? And wouldn't that be way smaller than an additional zonelist? Yes it would. Lee diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/gfp.h 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/gfp.h 2007-11-19 19:28:55.0 + @@ -175,7 +175,6 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i extern struct page * FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, struct zonelist *, nodemask_t *nodemask)); -extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask); static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) @@ -187,13 +186,10 @@ static inline struct page *alloc_pages_n if (nid 0) nid = numa_node_id(); - /* Use a temporary nodemask for __GFP_THISNODE allocations */ if (unlikely(gfp_mask __GFP_THISNODE)) { - nodemask_t nodemask; - return __alloc_pages_nodemask(gfp_mask, order, node_zonelist(nid), - nodemask_thisnode(nid, nodemask)); + NODE_DATA(nid)-nodemask_thisnode); } return __alloc_pages(gfp_mask, order, node_zonelist(nid)); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/include/linux/mmzone.h 2007-11-19 19:27:15.0 + +++ linux-2.6.24-rc2-mm1-045_use_static_nodemask/include/linux/mmzone.h 2007-11-19 19:28:55.0 + @@ -519,6 +519,9 @@ typedef struct pglist_data { struct zone node_zones[MAX_NR_ZONES]; struct zonelist node_zonelist; int nr_zones; + + /* nodemask suitable for __GFP_THISNODE */ + nodemask_t *nodemask_thisnode; #ifdef CONFIG_FLAT_NODE_MEM_MAP struct page *node_mem_map; #endif diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c linux-2.6.24-rc2-mm1-045_use_static_nodemask/mm/page_alloc.c --- linux-2.6.24-rc2-mm1-040_use_one_zonelist/mm/page_alloc.c
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 20 Nov 2007, Mel Gorman wrote: Went back and revisited this. Allocating them at boot-time is below but essentially it is a silly and it makes sense to just have two zonelists where one of them is for __GFP_THISNODE. Implementation wise, this involves dropping the last patch in the set and the overall result is still a reduction in the number of zonelists. Allright with me. Andrew could we get this patchset merged? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 20 Nov 2007, Mel Gorman wrote: Hold off testing for the moment. Getting all the corner cases right for __GFP_THISNODE has turned too complicated to be considered as part of a larger patchset. I believe it makes sense to drop the final patch and settle with having two zonelists. One of these will be for __GFP_THISNODE allocations. We can then tackle removing that zonelist at a later date. Ack. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (20/11/07 12:18), Christoph Lameter didst pronounce: On Tue, 20 Nov 2007, Mel Gorman wrote: Went back and revisited this. Allocating them at boot-time is below but essentially it is a silly and it makes sense to just have two zonelists where one of them is for __GFP_THISNODE. Implementation wise, this involves dropping the last patch in the set and the overall result is still a reduction in the number of zonelists. Allright with me. Andrew could we get this patchset merged? They will not merge cleanly with a recent tree. I expect to be ready to post a new set when regression tests complete later this evening. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 20 Nov 2007 12:18:10 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Tue, 20 Nov 2007, Mel Gorman wrote: Went back and revisited this. Allocating them at boot-time is below but essentially it is a silly and it makes sense to just have two zonelists where one of them is for __GFP_THISNODE. Implementation wise, this involves dropping the last patch in the set and the overall result is still a reduction in the number of zonelists. Allright with me. Andrew could we get this patchset merged? uhm, maybe. It's getting toward the time when we should try to get -mm vaguely compiling and booting on some machines, which means stopping merging new stuff. I left that too late in the 2.6.23 cycle. otoh it'd be nice to get mainline fixed up a bit too :( - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 20 Nov 2007, Andrew Morton wrote: uhm, maybe. It's getting toward the time when we should try to get -mm vaguely compiling and booting on some machines, which means stopping merging new stuff. I left that too late in the 2.6.23 cycle. Huh? mm1 works fine here. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Sun, 11 Nov 2007, Mel Gorman wrote: > If MPOL_BIND is in effect, the allocation will be filtered based on the > current allowed nodemask. If they specify THISNODE and the specified > node or current node is not in the mask, I would expect the allocation > to fail. Is that unexpected to anybody? Currently GFP_THISNODE with MPOL_BIND results an allocation on the first node. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Sun, 11 Nov 2007, Mel Gorman wrote: If MPOL_BIND is in effect, the allocation will be filtered based on the current allowed nodemask. If they specify THISNODE and the specified node or current node is not in the mask, I would expect the allocation to fail. Is that unexpected to anybody? Currently GFP_THISNODE with MPOL_BIND results an allocation on the first node. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (09/11/07 09:26), Christoph Lameter didst pronounce: > On Fri, 9 Nov 2007, Lee Schermerhorn wrote: > > > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there > > > is no nid to base the allocation on, so we "fallback" to numa_node_id() > > > [ almost like the nid had been specified as -1 ]. > > > > > > So I guess this is logical -- but I wonder, do we have any callers of > > > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when > > > alloc_pages_node() exists? > > > > I don't know if we have any current callers that do this, but absent any > > documentation specifying otherwise, Mel's implementation matches what > > I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. > > However, we could specify that THISNODE is ignored in __alloc_pages() > > and recommend the use of alloc_pages_node() passing numa_node_id() as > > the nid parameter to achieve the behavior. This would eliminate the > > check for 'THISNODE in __alloc_pages(). Just mask it off before calling > > down to __alloc_pages_internal(). > > > > Does this make sense? > > I like consistency. If someone absolutely wants a local page then > specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. > Agreed. > What happens though if an MPOL_BIND policy is in effect? The node used > must then be the nearest node from the policy mask > If MPOL_BIND is in effect, the allocation will be filtered based on the current allowed nodemask. If they specify THISNODE and the specified node or current node is not in the mask, I would expect the allocation to fail. Is that unexpected to anybody? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (09/11/07 09:26), Christoph Lameter didst pronounce: On Fri, 9 Nov 2007, Lee Schermerhorn wrote: On the other hand, if we call alloc_pages() with GFP_THISNODE set, there is no nid to base the allocation on, so we fallback to numa_node_id() [ almost like the nid had been specified as -1 ]. So I guess this is logical -- but I wonder, do we have any callers of alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when alloc_pages_node() exists? I don't know if we have any current callers that do this, but absent any documentation specifying otherwise, Mel's implementation matches what I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. However, we could specify that THISNODE is ignored in __alloc_pages() and recommend the use of alloc_pages_node() passing numa_node_id() as the nid parameter to achieve the behavior. This would eliminate the check for 'THISNODE in __alloc_pages(). Just mask it off before calling down to __alloc_pages_internal(). Does this make sense? I like consistency. If someone absolutely wants a local page then specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. Agreed. What happens though if an MPOL_BIND policy is in effect? The node used must then be the nearest node from the policy mask If MPOL_BIND is in effect, the allocation will be filtered based on the current allowed nodemask. If they specify THISNODE and the specified node or current node is not in the mask, I would expect the allocation to fail. Is that unexpected to anybody? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 9 Nov 2007, Nishanth Aravamudan wrote: > > Indeed, this probably needs to be validated... Sigh, more interleaving > > of policies and everything else... > > Hrm, more importantly, isn't this an existing issue? Maybe should be > resolved separately from the one zonelist patches. GFP_THISNODE with alloc_pages() currently yields an allocation from the first node of the MPOL_BIND zonelist. So its the lowest node of the set. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.11.2007 [10:16:07 -0800], Nishanth Aravamudan wrote: > On 09.11.2007 [09:26:01 -0800], Christoph Lameter wrote: > > On Fri, 9 Nov 2007, Lee Schermerhorn wrote: > > > > > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there > > > > is no nid to base the allocation on, so we "fallback" to numa_node_id() > > > > [ almost like the nid had been specified as -1 ]. > > > > > > > > So I guess this is logical -- but I wonder, do we have any callers of > > > > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when > > > > alloc_pages_node() exists? > > > > > > I don't know if we have any current callers that do this, but absent any > > > documentation specifying otherwise, Mel's implementation matches what > > > I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. > > > However, we could specify that THISNODE is ignored in __alloc_pages() > > > and recommend the use of alloc_pages_node() passing numa_node_id() as > > > the nid parameter to achieve the behavior. This would eliminate the > > > check for 'THISNODE in __alloc_pages(). Just mask it off before calling > > > down to __alloc_pages_internal(). > > > > > > Does this make sense? > > > > I like consistency. If someone absolutely wants a local page then > > specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. > > Fair enough. > > > What happens though if an MPOL_BIND policy is in effect? The node used > > must then be the nearest node from the policy mask > > Indeed, this probably needs to be validated... Sigh, more interleaving > of policies and everything else... Hrm, more importantly, isn't this an existing issue? Maybe should be resolved separately from the one zonelist patches. -Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.11.2007 [12:18:52 -0500], Lee Schermerhorn wrote: > On Fri, 2007-11-09 at 08:45 -0800, Nishanth Aravamudan wrote: > > On 09.11.2007 [16:14:55 +], Mel Gorman wrote: > > > On (09/11/07 07:45), Christoph Lameter didst pronounce: > > > > On Fri, 9 Nov 2007, Mel Gorman wrote: > > > > > > > > > struct page * fastcall > > > > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > > > > struct zonelist *zonelist) > > > > > { > > > > > + /* > > > > > + * Use a temporary nodemask for __GFP_THISNODE allocations. If > > > > > the > > > > > + * cost of allocating on the stack or the stack usage becomes > > > > > + * noticable, allocate the nodemasks per node at boot or > > > > > compile time > > > > > + */ > > > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > > > > + nodemask_t nodemask; > > > > > > > > Hmmm.. This places a potentially big structure on the stack. nodemask > > > > can > > > > contain up to 1024 bits which means 128 bytes. Maybe keep an array of > > > > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? > > > > > > > > > > That is what I was hinting at in the comment as a possible solution. > > > > > > > > + > > > > > + return __alloc_pages_internal(gfp_mask, order, > > > > > + zonelist, nodemask_thisnode(numa_node_id(), > > > > > )); > > > > > > > > Argh GFP_THISNODE must use the nid passed to alloc_pages_node > > > > and *not* the local numa node id. Only if the node specified to > > > > alloc_pages nodes is -1 will this work. > > > > > > > > > > alloc_pages_node() calls __alloc_pages_nodemask() though where in this > > > function if I'm reading it right is called without a node id. Given no > > > other details on the nid, the current one seemed a logical choice. > > > > Yeah, I guess the context here matters (and is a little hard to follow > > because thare are a few places that change in different ways here): > > > > For allocating pages from a particular node (GFP_THISNODE with nid), > > the nid clearly must be specified. This only happens with > > alloc_pages_node(), AFAICT. So, in that interface, the right thing is > > done and the appropriate nodemask will be built. > > I agree. In an earlier patch, Mel was ignoring nid and using > numa_node_id() here. This was causing your [Nish's] hugetlb pool > allocation patches to fail. Mel fixed that ~9oct07. Yep, and that's why I'm on the Cc, I think :) > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there > > is no nid to base the allocation on, so we "fallback" to numa_node_id() > > [ almost like the nid had been specified as -1 ]. > > > > So I guess this is logical -- but I wonder, do we have any callers of > > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when > > alloc_pages_node() exists? > > I don't know if we have any current callers that do this, but absent any > documentation specifying otherwise, Mel's implementation matches what > I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. > However, we could specify that THISNODE is ignored in __alloc_pages() > and recommend the use of alloc_pages_node() passing numa_node_id() as > the nid parameter to achieve the behavior. This would eliminate the > check for 'THISNODE in __alloc_pages(). Just mask it off before calling > down to __alloc_pages_internal(). > > Does this make sense? The caller could also just use -1 as the nid, since then alloc_pages_node() should do the numa_node_id() for the caller... But I agree, there is no documentation saying GFP_THISNODE is *not* allowed for alloc_pages(), so we should probably handle it -Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.11.2007 [09:26:01 -0800], Christoph Lameter wrote: > On Fri, 9 Nov 2007, Lee Schermerhorn wrote: > > > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there > > > is no nid to base the allocation on, so we "fallback" to numa_node_id() > > > [ almost like the nid had been specified as -1 ]. > > > > > > So I guess this is logical -- but I wonder, do we have any callers of > > > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when > > > alloc_pages_node() exists? > > > > I don't know if we have any current callers that do this, but absent any > > documentation specifying otherwise, Mel's implementation matches what > > I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. > > However, we could specify that THISNODE is ignored in __alloc_pages() > > and recommend the use of alloc_pages_node() passing numa_node_id() as > > the nid parameter to achieve the behavior. This would eliminate the > > check for 'THISNODE in __alloc_pages(). Just mask it off before calling > > down to __alloc_pages_internal(). > > > > Does this make sense? > > I like consistency. If someone absolutely wants a local page then > specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. Fair enough. > What happens though if an MPOL_BIND policy is in effect? The node used > must then be the nearest node from the policy mask Indeed, this probably needs to be validated... Sigh, more interleaving of policies and everything else... -Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 9 Nov 2007, Lee Schermerhorn wrote: > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there > > is no nid to base the allocation on, so we "fallback" to numa_node_id() > > [ almost like the nid had been specified as -1 ]. > > > > So I guess this is logical -- but I wonder, do we have any callers of > > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when > > alloc_pages_node() exists? > > I don't know if we have any current callers that do this, but absent any > documentation specifying otherwise, Mel's implementation matches what > I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. > However, we could specify that THISNODE is ignored in __alloc_pages() > and recommend the use of alloc_pages_node() passing numa_node_id() as > the nid parameter to achieve the behavior. This would eliminate the > check for 'THISNODE in __alloc_pages(). Just mask it off before calling > down to __alloc_pages_internal(). > > Does this make sense? I like consistency. If someone absolutely wants a local page then specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. What happens though if an MPOL_BIND policy is in effect? The node used must then be the nearest node from the policy mask - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 2007-11-09 at 08:45 -0800, Nishanth Aravamudan wrote: > On 09.11.2007 [16:14:55 +], Mel Gorman wrote: > > On (09/11/07 07:45), Christoph Lameter didst pronounce: > > > On Fri, 9 Nov 2007, Mel Gorman wrote: > > > > > > > struct page * fastcall > > > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > > > struct zonelist *zonelist) > > > > { > > > > + /* > > > > +* Use a temporary nodemask for __GFP_THISNODE allocations. If > > > > the > > > > +* cost of allocating on the stack or the stack usage becomes > > > > +* noticable, allocate the nodemasks per node at boot or > > > > compile time > > > > +*/ > > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > > > + nodemask_t nodemask; > > > > > > Hmmm.. This places a potentially big structure on the stack. nodemask can > > > contain up to 1024 bits which means 128 bytes. Maybe keep an array of > > > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? > > > > > > > That is what I was hinting at in the comment as a possible solution. > > > > > > + > > > > + return __alloc_pages_internal(gfp_mask, order, > > > > + zonelist, nodemask_thisnode(numa_node_id(), > > > > )); > > > > > > Argh GFP_THISNODE must use the nid passed to alloc_pages_node > > > and *not* the local numa node id. Only if the node specified to > > > alloc_pages nodes is -1 will this work. > > > > > > > alloc_pages_node() calls __alloc_pages_nodemask() though where in this > > function if I'm reading it right is called without a node id. Given no > > other details on the nid, the current one seemed a logical choice. > > Yeah, I guess the context here matters (and is a little hard to follow > because thare are a few places that change in different ways here): > > For allocating pages from a particular node (GFP_THISNODE with nid), > the nid clearly must be specified. This only happens with > alloc_pages_node(), AFAICT. So, in that interface, the right thing is > done and the appropriate nodemask will be built. I agree. In an earlier patch, Mel was ignoring nid and using numa_node_id() here. This was causing your [Nish's] hugetlb pool allocation patches to fail. Mel fixed that ~9oct07. > > On the other hand, if we call alloc_pages() with GFP_THISNODE set, there > is no nid to base the allocation on, so we "fallback" to numa_node_id() > [ almost like the nid had been specified as -1 ]. > > So I guess this is logical -- but I wonder, do we have any callers of > alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when > alloc_pages_node() exists? I don't know if we have any current callers that do this, but absent any documentation specifying otherwise, Mel's implementation matches what I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. However, we could specify that THISNODE is ignored in __alloc_pages() and recommend the use of alloc_pages_node() passing numa_node_id() as the nid parameter to achieve the behavior. This would eliminate the check for 'THISNODE in __alloc_pages(). Just mask it off before calling down to __alloc_pages_internal(). Does this make sense? Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.11.2007 [16:14:55 +], Mel Gorman wrote: > On (09/11/07 07:45), Christoph Lameter didst pronounce: > > On Fri, 9 Nov 2007, Mel Gorman wrote: > > > > > struct page * fastcall > > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > > struct zonelist *zonelist) > > > { > > > + /* > > > + * Use a temporary nodemask for __GFP_THISNODE allocations. If the > > > + * cost of allocating on the stack or the stack usage becomes > > > + * noticable, allocate the nodemasks per node at boot or compile time > > > + */ > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > > + nodemask_t nodemask; > > > > Hmmm.. This places a potentially big structure on the stack. nodemask can > > contain up to 1024 bits which means 128 bytes. Maybe keep an array of > > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? > > > > That is what I was hinting at in the comment as a possible solution. > > > > + > > > + return __alloc_pages_internal(gfp_mask, order, > > > + zonelist, nodemask_thisnode(numa_node_id(), )); > > > > Argh GFP_THISNODE must use the nid passed to alloc_pages_node > > and *not* the local numa node id. Only if the node specified to > > alloc_pages nodes is -1 will this work. > > > > alloc_pages_node() calls __alloc_pages_nodemask() though where in this > function if I'm reading it right is called without a node id. Given no > other details on the nid, the current one seemed a logical choice. Yeah, I guess the context here matters (and is a little hard to follow because thare are a few places that change in different ways here): For allocating pages from a particular node (GFP_THISNODE with nid), the nid clearly must be specified. This only happens with alloc_pages_node(), AFAICT. So, in that interface, the right thing is done and the appropriate nodemask will be built. On the other hand, if we call alloc_pages() with GFP_THISNODE set, there is no nid to base the allocation on, so we "fallback" to numa_node_id() [ almost like the nid had been specified as -1 ]. So I guess this is logical -- but I wonder, do we have any callers of alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when alloc_pages_node() exists? Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 9 Nov 2007, Mel Gorman wrote: > > Argh GFP_THISNODE must use the nid passed to alloc_pages_node and > > *not* the local numa node id. Only if the node specified to alloc_pages > > nodes is -1 will this work. > > > > alloc_pages_node() calls __alloc_pages_nodemask() though where in this > function if I'm reading it right is called without a node id. Given no > other details on the nid, the current one seemed a logical choice. If the function only called when the first zone == numa_node_id then the use of numa_node_id here is okay. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (09/11/07 07:45), Christoph Lameter didst pronounce: > On Fri, 9 Nov 2007, Mel Gorman wrote: > > > struct page * fastcall > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > struct zonelist *zonelist) > > { > > + /* > > +* Use a temporary nodemask for __GFP_THISNODE allocations. If the > > +* cost of allocating on the stack or the stack usage becomes > > +* noticable, allocate the nodemasks per node at boot or compile time > > +*/ > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > + nodemask_t nodemask; > > Hmmm.. This places a potentially big structure on the stack. nodemask can > contain up to 1024 bits which means 128 bytes. Maybe keep an array of > gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? > That is what I was hinting at in the comment as a possible solution. > > + > > + return __alloc_pages_internal(gfp_mask, order, > > + zonelist, nodemask_thisnode(numa_node_id(), )); > > Argh GFP_THISNODE must use the nid passed to alloc_pages_node and > *not* the local numa node id. Only if the node specified to alloc_pages > nodes is -1 will this work. > alloc_pages_node() calls __alloc_pages_nodemask() though where in this function if I'm reading it right is called without a node id. Given no other details on the nid, the current one seemed a logical choice. What I did notice when rechecking is I left the warning about THISNODE in by accident :( -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 9 Nov 2007, Mel Gorman wrote: > struct page * fastcall > __alloc_pages(gfp_t gfp_mask, unsigned int order, > struct zonelist *zonelist) > { > + /* > + * Use a temporary nodemask for __GFP_THISNODE allocations. If the > + * cost of allocating on the stack or the stack usage becomes > + * noticable, allocate the nodemasks per node at boot or compile time > + */ > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? > + > + return __alloc_pages_internal(gfp_mask, order, > + zonelist, nodemask_thisnode(numa_node_id(), )); Argh GFP_THISNODE must use the nid passed to alloc_pages_node and *not* the local numa node id. Only if the node specified to alloc_pages nodes is -1 will this work. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 9 Nov 2007, Mel Gorman wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* + * Use a temporary nodemask for __GFP_THISNODE allocations. If the + * cost of allocating on the stack or the stack usage becomes + * noticable, allocate the nodemasks per node at boot or compile time + */ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(numa_node_id(), nodemask)); Argh GFP_THISNODE must use the nid passed to alloc_pages_node and *not* the local numa node id. Only if the node specified to alloc_pages nodes is -1 will this work. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (09/11/07 07:45), Christoph Lameter didst pronounce: On Fri, 9 Nov 2007, Mel Gorman wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* +* Use a temporary nodemask for __GFP_THISNODE allocations. If the +* cost of allocating on the stack or the stack usage becomes +* noticable, allocate the nodemasks per node at boot or compile time +*/ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? That is what I was hinting at in the comment as a possible solution. + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(numa_node_id(), nodemask)); Argh GFP_THISNODE must use the nid passed to alloc_pages_node and *not* the local numa node id. Only if the node specified to alloc_pages nodes is -1 will this work. alloc_pages_node() calls __alloc_pages_nodemask() though where in this function if I'm reading it right is called without a node id. Given no other details on the nid, the current one seemed a logical choice. What I did notice when rechecking is I left the warning about THISNODE in by accident :( -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 9 Nov 2007, Mel Gorman wrote: Argh GFP_THISNODE must use the nid passed to alloc_pages_node and *not* the local numa node id. Only if the node specified to alloc_pages nodes is -1 will this work. alloc_pages_node() calls __alloc_pages_nodemask() though where in this function if I'm reading it right is called without a node id. Given no other details on the nid, the current one seemed a logical choice. If the function only called when the first zone == numa_node_id then the use of numa_node_id here is okay. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.11.2007 [16:14:55 +], Mel Gorman wrote: On (09/11/07 07:45), Christoph Lameter didst pronounce: On Fri, 9 Nov 2007, Mel Gorman wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* + * Use a temporary nodemask for __GFP_THISNODE allocations. If the + * cost of allocating on the stack or the stack usage becomes + * noticable, allocate the nodemasks per node at boot or compile time + */ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? That is what I was hinting at in the comment as a possible solution. + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(numa_node_id(), nodemask)); Argh GFP_THISNODE must use the nid passed to alloc_pages_node and *not* the local numa node id. Only if the node specified to alloc_pages nodes is -1 will this work. alloc_pages_node() calls __alloc_pages_nodemask() though where in this function if I'm reading it right is called without a node id. Given no other details on the nid, the current one seemed a logical choice. Yeah, I guess the context here matters (and is a little hard to follow because thare are a few places that change in different ways here): For allocating pages from a particular node (GFP_THISNODE with nid), the nid clearly must be specified. This only happens with alloc_pages_node(), AFAICT. So, in that interface, the right thing is done and the appropriate nodemask will be built. On the other hand, if we call alloc_pages() with GFP_THISNODE set, there is no nid to base the allocation on, so we fallback to numa_node_id() [ almost like the nid had been specified as -1 ]. So I guess this is logical -- but I wonder, do we have any callers of alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when alloc_pages_node() exists? Thanks, Nish -- Nishanth Aravamudan [EMAIL PROTECTED] IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 2007-11-09 at 08:45 -0800, Nishanth Aravamudan wrote: On 09.11.2007 [16:14:55 +], Mel Gorman wrote: On (09/11/07 07:45), Christoph Lameter didst pronounce: On Fri, 9 Nov 2007, Mel Gorman wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* +* Use a temporary nodemask for __GFP_THISNODE allocations. If the +* cost of allocating on the stack or the stack usage becomes +* noticable, allocate the nodemasks per node at boot or compile time +*/ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? That is what I was hinting at in the comment as a possible solution. + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(numa_node_id(), nodemask)); Argh GFP_THISNODE must use the nid passed to alloc_pages_node and *not* the local numa node id. Only if the node specified to alloc_pages nodes is -1 will this work. alloc_pages_node() calls __alloc_pages_nodemask() though where in this function if I'm reading it right is called without a node id. Given no other details on the nid, the current one seemed a logical choice. Yeah, I guess the context here matters (and is a little hard to follow because thare are a few places that change in different ways here): For allocating pages from a particular node (GFP_THISNODE with nid), the nid clearly must be specified. This only happens with alloc_pages_node(), AFAICT. So, in that interface, the right thing is done and the appropriate nodemask will be built. I agree. In an earlier patch, Mel was ignoring nid and using numa_node_id() here. This was causing your [Nish's] hugetlb pool allocation patches to fail. Mel fixed that ~9oct07. On the other hand, if we call alloc_pages() with GFP_THISNODE set, there is no nid to base the allocation on, so we fallback to numa_node_id() [ almost like the nid had been specified as -1 ]. So I guess this is logical -- but I wonder, do we have any callers of alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when alloc_pages_node() exists? I don't know if we have any current callers that do this, but absent any documentation specifying otherwise, Mel's implementation matches what I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. However, we could specify that THISNODE is ignored in __alloc_pages() and recommend the use of alloc_pages_node() passing numa_node_id() as the nid parameter to achieve the behavior. This would eliminate the check for 'THISNODE in __alloc_pages(). Just mask it off before calling down to __alloc_pages_internal(). Does this make sense? Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 9 Nov 2007, Lee Schermerhorn wrote: On the other hand, if we call alloc_pages() with GFP_THISNODE set, there is no nid to base the allocation on, so we fallback to numa_node_id() [ almost like the nid had been specified as -1 ]. So I guess this is logical -- but I wonder, do we have any callers of alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when alloc_pages_node() exists? I don't know if we have any current callers that do this, but absent any documentation specifying otherwise, Mel's implementation matches what I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. However, we could specify that THISNODE is ignored in __alloc_pages() and recommend the use of alloc_pages_node() passing numa_node_id() as the nid parameter to achieve the behavior. This would eliminate the check for 'THISNODE in __alloc_pages(). Just mask it off before calling down to __alloc_pages_internal(). Does this make sense? I like consistency. If someone absolutely wants a local page then specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. What happens though if an MPOL_BIND policy is in effect? The node used must then be the nearest node from the policy mask - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.11.2007 [12:18:52 -0500], Lee Schermerhorn wrote: On Fri, 2007-11-09 at 08:45 -0800, Nishanth Aravamudan wrote: On 09.11.2007 [16:14:55 +], Mel Gorman wrote: On (09/11/07 07:45), Christoph Lameter didst pronounce: On Fri, 9 Nov 2007, Mel Gorman wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* + * Use a temporary nodemask for __GFP_THISNODE allocations. If the + * cost of allocating on the stack or the stack usage becomes + * noticable, allocate the nodemasks per node at boot or compile time + */ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; Hmmm.. This places a potentially big structure on the stack. nodemask can contain up to 1024 bits which means 128 bytes. Maybe keep an array of gfp_thisnode nodemasks (node_nodemask?) and use node_nodemask[nid]? That is what I was hinting at in the comment as a possible solution. + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(numa_node_id(), nodemask)); Argh GFP_THISNODE must use the nid passed to alloc_pages_node and *not* the local numa node id. Only if the node specified to alloc_pages nodes is -1 will this work. alloc_pages_node() calls __alloc_pages_nodemask() though where in this function if I'm reading it right is called without a node id. Given no other details on the nid, the current one seemed a logical choice. Yeah, I guess the context here matters (and is a little hard to follow because thare are a few places that change in different ways here): For allocating pages from a particular node (GFP_THISNODE with nid), the nid clearly must be specified. This only happens with alloc_pages_node(), AFAICT. So, in that interface, the right thing is done and the appropriate nodemask will be built. I agree. In an earlier patch, Mel was ignoring nid and using numa_node_id() here. This was causing your [Nish's] hugetlb pool allocation patches to fail. Mel fixed that ~9oct07. Yep, and that's why I'm on the Cc, I think :) On the other hand, if we call alloc_pages() with GFP_THISNODE set, there is no nid to base the allocation on, so we fallback to numa_node_id() [ almost like the nid had been specified as -1 ]. So I guess this is logical -- but I wonder, do we have any callers of alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when alloc_pages_node() exists? I don't know if we have any current callers that do this, but absent any documentation specifying otherwise, Mel's implementation matches what I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. However, we could specify that THISNODE is ignored in __alloc_pages() and recommend the use of alloc_pages_node() passing numa_node_id() as the nid parameter to achieve the behavior. This would eliminate the check for 'THISNODE in __alloc_pages(). Just mask it off before calling down to __alloc_pages_internal(). Does this make sense? The caller could also just use -1 as the nid, since then alloc_pages_node() should do the numa_node_id() for the caller... But I agree, there is no documentation saying GFP_THISNODE is *not* allowed for alloc_pages(), so we should probably handle it -Nish -- Nishanth Aravamudan [EMAIL PROTECTED] IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.11.2007 [09:26:01 -0800], Christoph Lameter wrote: On Fri, 9 Nov 2007, Lee Schermerhorn wrote: On the other hand, if we call alloc_pages() with GFP_THISNODE set, there is no nid to base the allocation on, so we fallback to numa_node_id() [ almost like the nid had been specified as -1 ]. So I guess this is logical -- but I wonder, do we have any callers of alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when alloc_pages_node() exists? I don't know if we have any current callers that do this, but absent any documentation specifying otherwise, Mel's implementation matches what I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. However, we could specify that THISNODE is ignored in __alloc_pages() and recommend the use of alloc_pages_node() passing numa_node_id() as the nid parameter to achieve the behavior. This would eliminate the check for 'THISNODE in __alloc_pages(). Just mask it off before calling down to __alloc_pages_internal(). Does this make sense? I like consistency. If someone absolutely wants a local page then specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. Fair enough. What happens though if an MPOL_BIND policy is in effect? The node used must then be the nearest node from the policy mask Indeed, this probably needs to be validated... Sigh, more interleaving of policies and everything else... -Nish -- Nishanth Aravamudan [EMAIL PROTECTED] IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Fri, 9 Nov 2007, Nishanth Aravamudan wrote: Indeed, this probably needs to be validated... Sigh, more interleaving of policies and everything else... Hrm, more importantly, isn't this an existing issue? Maybe should be resolved separately from the one zonelist patches. GFP_THISNODE with alloc_pages() currently yields an allocation from the first node of the MPOL_BIND zonelist. So its the lowest node of the set. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.11.2007 [10:16:07 -0800], Nishanth Aravamudan wrote: On 09.11.2007 [09:26:01 -0800], Christoph Lameter wrote: On Fri, 9 Nov 2007, Lee Schermerhorn wrote: On the other hand, if we call alloc_pages() with GFP_THISNODE set, there is no nid to base the allocation on, so we fallback to numa_node_id() [ almost like the nid had been specified as -1 ]. So I guess this is logical -- but I wonder, do we have any callers of alloc_pages(GFP_THISNODE) ? It seems like an odd thing to do, when alloc_pages_node() exists? I don't know if we have any current callers that do this, but absent any documentation specifying otherwise, Mel's implementation matches what I'd expect the behavior to be if I DID call alloc_pages with 'THISNODE. However, we could specify that THISNODE is ignored in __alloc_pages() and recommend the use of alloc_pages_node() passing numa_node_id() as the nid parameter to achieve the behavior. This would eliminate the check for 'THISNODE in __alloc_pages(). Just mask it off before calling down to __alloc_pages_internal(). Does this make sense? I like consistency. If someone absolutely wants a local page then specifying GFP_THISNODE to __alloc_pages is okay. Leave as is I guess. Fair enough. What happens though if an MPOL_BIND policy is in effect? The node used must then be the nearest node from the policy mask Indeed, this probably needs to be validated... Sigh, more interleaving of policies and everything else... Hrm, more importantly, isn't this an existing issue? Maybe should be resolved separately from the one zonelist patches. -Nish -- Nishanth Aravamudan [EMAIL PROTECTED] IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (10/10/07 11:53), Lee Schermerhorn didst pronounce: > On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote: > > > > > Subject: Use specified node ID with GFP_THISNODE if available > > > > It had been assumed that __GFP_THISNODE meant allocating from the local > > node and only the local node. However, users of alloc_pages_node() may also > > specify GFP_THISNODE. In this case, only the specified node should be used. > > This patch will allocate pages only from the requested node when > > GFP_THISNODE > > is used with alloc_pages_node(). > > > > [EMAIL PROTECTED]: Detailed analysis of problem] > > Found-by: Lee Schermerhorn <[EMAIL PROTECTED]> > > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> > > > > > Mel: I applied this patch [to your v8 series--the most recent, I > think?] and it does fix the problem. However, now I'm tripping over > this warning in __alloc_pages_nodemask: > > /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */ > WARN_ON(gfp_mask & __GFP_THISNODE); > > for each huge page allocated. Rather slow as my console is a virtual > serial line and the warning includes the stack traceback. > > I think we want to just drop this warning, but maybe you have a tighter > condition that you want to warn about? > I should drop the warning. The nature of the comment and the WARN_ON was rooted in my belief that "THISNODE means this node I am running on" and the warning was defensive programming just in case the assumption was broken. Now we know the assumption was wrong and the warning is bogus. Thanks Lee. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 10.10.2007 [11:53:40 -0400], Lee Schermerhorn wrote: > On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote: > > > > > Subject: Use specified node ID with GFP_THISNODE if available > > > > It had been assumed that __GFP_THISNODE meant allocating from the local > > node and only the local node. However, users of alloc_pages_node() may also > > specify GFP_THISNODE. In this case, only the specified node should be used. > > This patch will allocate pages only from the requested node when > > GFP_THISNODE > > is used with alloc_pages_node(). > > > > [EMAIL PROTECTED]: Detailed analysis of problem] > > Found-by: Lee Schermerhorn <[EMAIL PROTECTED]> > > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> > > > > > Mel: I applied this patch [to your v8 series--the most recent, I > think?] and it does fix the problem. However, now I'm tripping over > this warning in __alloc_pages_nodemask: > > /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */ > WARN_ON(gfp_mask & __GFP_THISNODE); > > for each huge page allocated. Rather slow as my console is a virtual > serial line and the warning includes the stack traceback. > > I think we want to just drop this warning, but maybe you have a tighter > condition that you want to warn about? Sigh, sorry Mel. I see this too on my box. I purely checked the functionality and didn't think to check the logs, as the tests worked :/ I think it's quite clear that the WARN_ON() makes no sense now, since alloc_pages_node() now calls __alloc_pages_nodemask(). -Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote: > > Subject: Use specified node ID with GFP_THISNODE if available > > It had been assumed that __GFP_THISNODE meant allocating from the local > node and only the local node. However, users of alloc_pages_node() may also > specify GFP_THISNODE. In this case, only the specified node should be used. > This patch will allocate pages only from the requested node when GFP_THISNODE > is used with alloc_pages_node(). > > [EMAIL PROTECTED]: Detailed analysis of problem] > Found-by: Lee Schermerhorn <[EMAIL PROTECTED]> > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> > Mel: I applied this patch [to your v8 series--the most recent, I think?] and it does fix the problem. However, now I'm tripping over this warning in __alloc_pages_nodemask: /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */ WARN_ON(gfp_mask & __GFP_THISNODE); for each huge page allocated. Rather slow as my console is a virtual serial line and the warning includes the stack traceback. I think we want to just drop this warning, but maybe you have a tighter condition that you want to warn about? Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote: snip Subject: Use specified node ID with GFP_THISNODE if available It had been assumed that __GFP_THISNODE meant allocating from the local node and only the local node. However, users of alloc_pages_node() may also specify GFP_THISNODE. In this case, only the specified node should be used. This patch will allocate pages only from the requested node when GFP_THISNODE is used with alloc_pages_node(). [EMAIL PROTECTED]: Detailed analysis of problem] Found-by: Lee Schermerhorn [EMAIL PROTECTED] Signed-off-by: Mel Gorman [EMAIL PROTECTED] snip Mel: I applied this patch [to your v8 series--the most recent, I think?] and it does fix the problem. However, now I'm tripping over this warning in __alloc_pages_nodemask: /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */ WARN_ON(gfp_mask __GFP_THISNODE); for each huge page allocated. Rather slow as my console is a virtual serial line and the warning includes the stack traceback. I think we want to just drop this warning, but maybe you have a tighter condition that you want to warn about? Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 10.10.2007 [11:53:40 -0400], Lee Schermerhorn wrote: On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote: snip Subject: Use specified node ID with GFP_THISNODE if available It had been assumed that __GFP_THISNODE meant allocating from the local node and only the local node. However, users of alloc_pages_node() may also specify GFP_THISNODE. In this case, only the specified node should be used. This patch will allocate pages only from the requested node when GFP_THISNODE is used with alloc_pages_node(). [EMAIL PROTECTED]: Detailed analysis of problem] Found-by: Lee Schermerhorn [EMAIL PROTECTED] Signed-off-by: Mel Gorman [EMAIL PROTECTED] snip Mel: I applied this patch [to your v8 series--the most recent, I think?] and it does fix the problem. However, now I'm tripping over this warning in __alloc_pages_nodemask: /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */ WARN_ON(gfp_mask __GFP_THISNODE); for each huge page allocated. Rather slow as my console is a virtual serial line and the warning includes the stack traceback. I think we want to just drop this warning, but maybe you have a tighter condition that you want to warn about? Sigh, sorry Mel. I see this too on my box. I purely checked the functionality and didn't think to check the logs, as the tests worked :/ I think it's quite clear that the WARN_ON() makes no sense now, since alloc_pages_node() now calls __alloc_pages_nodemask(). -Nish -- Nishanth Aravamudan [EMAIL PROTECTED] IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On (10/10/07 11:53), Lee Schermerhorn didst pronounce: On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote: snip Subject: Use specified node ID with GFP_THISNODE if available It had been assumed that __GFP_THISNODE meant allocating from the local node and only the local node. However, users of alloc_pages_node() may also specify GFP_THISNODE. In this case, only the specified node should be used. This patch will allocate pages only from the requested node when GFP_THISNODE is used with alloc_pages_node(). [EMAIL PROTECTED]: Detailed analysis of problem] Found-by: Lee Schermerhorn [EMAIL PROTECTED] Signed-off-by: Mel Gorman [EMAIL PROTECTED] snip Mel: I applied this patch [to your v8 series--the most recent, I think?] and it does fix the problem. However, now I'm tripping over this warning in __alloc_pages_nodemask: /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */ WARN_ON(gfp_mask __GFP_THISNODE); for each huge page allocated. Rather slow as my console is a virtual serial line and the warning includes the stack traceback. I think we want to just drop this warning, but maybe you have a tighter condition that you want to warn about? I should drop the warning. The nature of the comment and the WARN_ON was rooted in my belief that THISNODE means this node I am running on and the warning was defensive programming just in case the assumption was broken. Now we know the assumption was wrong and the warning is bogus. Thanks Lee. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 9 Oct 2007, Nishanth Aravamudan wrote: > > > And nodemask_thisnode() always gives us a nodemask with only the node > > > the current process is running on set, I think? > > > > > > > Yes, I interpreted THISNODE to mean "this node I am running on". > > Callers seemed to expect this but the memoryless needs it to be "this > > node I am running on unless I specify a node in which case I mean that > > node.". > > I think that is only true (THISNODE = local node) if the callpath is not > via alloc_pages_node(). If the callpath is via alloc_pages_node(), then > it depends on whether the nid parameter is -1 (in which case it is also > local node) or anything (in which case it is the nid specified). Ah, > reading further along, that's exactly what your changelog indicates too > :) Right. THISNODE means the node we are on or the node that we indicated we want to allocate from. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote: > First, sorry for being so slow to respond. I was getting ill towards the end > of last week and am worse now. Brain is in total mush as a result. Thanks > Lee for finding this problem and thanks to Nish for investigating it properly. > > Comments and candidate fix to one zonelist are below. > > On (08/10/07 18:11), Nishanth Aravamudan didst pronounce: > > On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote: > > > > > > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed > > > to use memory only from a node local to the CPU. As we can now filter the > > > zonelist based on a nodemask, we filter the standard node zonelist for > > > zones > > > on the local node when GFP_THISNODE is specified. > > > > > > When GFP_THISNODE is used, a temporary nodemask is created with only the > > > node local to the CPU set. This allows us to eliminate the second > > > zonelist. > > > > > > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> > > > Acked-by: Christoph Lameter <[EMAIL PROTECTED]> > > > > > > > > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > > > linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h > > > linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h > > > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h > > > 2007-09-28 15:49:57.0 +0100 > > > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h > > > 2007-09-28 15:55:03.0 +0100 > > > > [Reordering the chunks to make my comments a little more logical] > > > > > > > > > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > > > +static inline struct zonelist *node_zonelist(int nid) > > > { > > > - return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > > > + return _DATA(nid)->node_zonelist; > > > } > > > > > > #ifndef HAVE_ARCH_FREE_PAGE > > > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n > > > if (nid < 0) > > > nid = numa_node_id(); > > > > > > - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > > > + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); > > > } > > > > This is alloc_pages_node(), and converting the nid to a zonelist means > > that lower levels (specifically __alloc_pages() here) are not aware of > > nids, as far as I can tell. > > Yep, this is correct. > > > This isn't a change, I just want to make > > sure I understand... > > > > > > > > > struct page * fastcall > > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > > struct zonelist *zonelist) > > > { > > > + /* > > > + * Use a temporary nodemask for __GFP_THISNODE allocations. If the > > > + * cost of allocating on the stack or the stack usage becomes > > > + * noticable, allocate the nodemasks per node at boot or compile time > > > + */ > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > > + nodemask_t nodemask; > > > + > > > + return __alloc_pages_internal(gfp_mask, order, > > > + zonelist, nodemask_thisnode()); > > > + } > > > + > > > return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); > > > } > > > > > > > > So alloc_pages_node() calls here and for THISNODE allocations, we go ask > > nodemask_thisnode() for a nodemask... > > > > Also correct. > > > > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask) > > > +{ > > > + /* Build a nodemask for just this node */ > > > + int nid = numa_node_id(); > > > + > > > + nodes_clear(*nodemask); > > > + node_set(nid, *nodemask); > > > + > > > + return nodemask; > > > +} > > > > > > > > And nodemask_thisnode() always gives us a nodemask with only the node > > the current process is running on set, I think? > > > > Yes, I interpreted THISNODE to mean "this node I am running on". Callers > seemed to expect this but the memoryless needs it to be "this node I am > running on unless I specify a node in which case I mean that node.". > > > That seems really wrong -- and would explain what Lee was seeing while > > using my patches for the hugetlb pool allocator to use THISNODE > > allocations. All the allocations would end up coming from whatever node > > the process happened to be running on. This obviously messes up hugetlb > > accounting, as I rely on THISNODE requests returning NULL if they go > > off-node. > > > > I'm not sure how this would be fixed, as __alloc_pages() no longer has > > the nid to set in the mask. > > > > Am I wrong in my analysis? > > > > No, you seem to be right on the ball. Can you review the following patch > please and determine if it fixes the problem in a satisfactory manner? I > think it does and your tests seemed to give proper values with this patch > applied but brain no worky work and a second opinion is needed. > > > Subject: Use specified node ID with GFP_THISNODE if available > > It had been assumed that __GFP_THISNODE meant allocating from the local > node and only the
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote: > First, sorry for being so slow to respond. I was getting ill towards the end > of last week and am worse now. Brain is in total mush as a result. Thanks > Lee for finding this problem and thanks to Nish for investigating it properly. > > Comments and candidate fix to one zonelist are below. > > On (08/10/07 18:11), Nishanth Aravamudan didst pronounce: > > On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote: > > > > > > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed > > > to use memory only from a node local to the CPU. As we can now filter the > > > zonelist based on a nodemask, we filter the standard node zonelist for > > > zones > > > on the local node when GFP_THISNODE is specified. > > > > > > When GFP_THISNODE is used, a temporary nodemask is created with only the > > > node local to the CPU set. This allows us to eliminate the second > > > zonelist. > > > > > > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> > > > Acked-by: Christoph Lameter <[EMAIL PROTECTED]> > > > > > > > > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > > > linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h > > > linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h > > > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h > > > 2007-09-28 15:49:57.0 +0100 > > > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h > > > 2007-09-28 15:55:03.0 +0100 > > > > [Reordering the chunks to make my comments a little more logical] > > > > > > > > > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > > > +static inline struct zonelist *node_zonelist(int nid) > > > { > > > - return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > > > + return _DATA(nid)->node_zonelist; > > > } > > > > > > #ifndef HAVE_ARCH_FREE_PAGE > > > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n > > > if (nid < 0) > > > nid = numa_node_id(); > > > > > > - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > > > + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); > > > } > > > > This is alloc_pages_node(), and converting the nid to a zonelist means > > that lower levels (specifically __alloc_pages() here) are not aware of > > nids, as far as I can tell. > > Yep, this is correct. > > > This isn't a change, I just want to make > > sure I understand... > > > > > > > > > struct page * fastcall > > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > > struct zonelist *zonelist) > > > { > > > + /* > > > + * Use a temporary nodemask for __GFP_THISNODE allocations. If the > > > + * cost of allocating on the stack or the stack usage becomes > > > + * noticable, allocate the nodemasks per node at boot or compile time > > > + */ > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > > + nodemask_t nodemask; > > > + > > > + return __alloc_pages_internal(gfp_mask, order, > > > + zonelist, nodemask_thisnode()); > > > + } > > > + > > > return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); > > > } > > > > > > > > So alloc_pages_node() calls here and for THISNODE allocations, we go ask > > nodemask_thisnode() for a nodemask... > > > > Also correct. > > > > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask) > > > +{ > > > + /* Build a nodemask for just this node */ > > > + int nid = numa_node_id(); > > > + > > > + nodes_clear(*nodemask); > > > + node_set(nid, *nodemask); > > > + > > > + return nodemask; > > > +} > > > > > > > > And nodemask_thisnode() always gives us a nodemask with only the node > > the current process is running on set, I think? > > > > Yes, I interpreted THISNODE to mean "this node I am running on". > Callers seemed to expect this but the memoryless needs it to be "this > node I am running on unless I specify a node in which case I mean that > node.". I think that is only true (THISNODE = local node) if the callpath is not via alloc_pages_node(). If the callpath is via alloc_pages_node(), then it depends on whether the nid parameter is -1 (in which case it is also local node) or anything (in which case it is the nid specified). Ah, reading further along, that's exactly what your changelog indicates too :) > > That seems really wrong -- and would explain what Lee was seeing while > > using my patches for the hugetlb pool allocator to use THISNODE > > allocations. All the allocations would end up coming from whatever node > > the process happened to be running on. This obviously messes up hugetlb > > accounting, as I rely on THISNODE requests returning NULL if they go > > off-node. > > > > I'm not sure how this would be fixed, as __alloc_pages() no longer has > > the nid to set in the mask. > > > > Am I wrong in my analysis? > > > > No, you seem to be right on the ball. Can you review the following patch > please and determine if
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
First, sorry for being so slow to respond. I was getting ill towards the end of last week and am worse now. Brain is in total mush as a result. Thanks Lee for finding this problem and thanks to Nish for investigating it properly. Comments and candidate fix to one zonelist are below. On (08/10/07 18:11), Nishanth Aravamudan didst pronounce: > On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote: > > > > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed > > to use memory only from a node local to the CPU. As we can now filter the > > zonelist based on a nodemask, we filter the standard node zonelist for zones > > on the local node when GFP_THISNODE is specified. > > > > When GFP_THISNODE is used, a temporary nodemask is created with only the > > node local to the CPU set. This allows us to eliminate the second zonelist. > > > > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> > > Acked-by: Christoph Lameter <[EMAIL PROTECTED]> > > > > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > > linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h > > linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h > > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h > > 2007-09-28 15:49:57.0 +0100 > > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h > > 2007-09-28 15:55:03.0 +0100 > > [Reordering the chunks to make my comments a little more logical] > > > > > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > > +static inline struct zonelist *node_zonelist(int nid) > > { > > - return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > > + return _DATA(nid)->node_zonelist; > > } > > > > #ifndef HAVE_ARCH_FREE_PAGE > > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n > > if (nid < 0) > > nid = numa_node_id(); > > > > - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > > + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); > > } > > This is alloc_pages_node(), and converting the nid to a zonelist means > that lower levels (specifically __alloc_pages() here) are not aware of > nids, as far as I can tell. Yep, this is correct. > This isn't a change, I just want to make > sure I understand... > > > > > struct page * fastcall > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > struct zonelist *zonelist) > > { > > + /* > > +* Use a temporary nodemask for __GFP_THISNODE allocations. If the > > +* cost of allocating on the stack or the stack usage becomes > > +* noticable, allocate the nodemasks per node at boot or compile time > > +*/ > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > + nodemask_t nodemask; > > + > > + return __alloc_pages_internal(gfp_mask, order, > > + zonelist, nodemask_thisnode()); > > + } > > + > > return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); > > } > > > > So alloc_pages_node() calls here and for THISNODE allocations, we go ask > nodemask_thisnode() for a nodemask... > Also correct. > > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask) > > +{ > > + /* Build a nodemask for just this node */ > > + int nid = numa_node_id(); > > + > > + nodes_clear(*nodemask); > > + node_set(nid, *nodemask); > > + > > + return nodemask; > > +} > > > > And nodemask_thisnode() always gives us a nodemask with only the node > the current process is running on set, I think? > Yes, I interpreted THISNODE to mean "this node I am running on". Callers seemed to expect this but the memoryless needs it to be "this node I am running on unless I specify a node in which case I mean that node.". > That seems really wrong -- and would explain what Lee was seeing while > using my patches for the hugetlb pool allocator to use THISNODE > allocations. All the allocations would end up coming from whatever node > the process happened to be running on. This obviously messes up hugetlb > accounting, as I rely on THISNODE requests returning NULL if they go > off-node. > > I'm not sure how this would be fixed, as __alloc_pages() no longer has > the nid to set in the mask. > > Am I wrong in my analysis? > No, you seem to be right on the ball. Can you review the following patch please and determine if it fixes the problem in a satisfactory manner? I think it does and your tests seemed to give proper values with this patch applied but brain no worky work and a second opinion is needed. Subject: Use specified node ID with GFP_THISNODE if available It had been assumed that __GFP_THISNODE meant allocating from the local node and only the local node. However, users of alloc_pages_node() may also specify GFP_THISNODE. In this case, only the specified node should be used. This patch will allocate pages only from the requested node when GFP_THISNODE is used with alloc_pages_node(). [EMAIL PROTECTED]: Detailed
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
First, sorry for being so slow to respond. I was getting ill towards the end of last week and am worse now. Brain is in total mush as a result. Thanks Lee for finding this problem and thanks to Nish for investigating it properly. Comments and candidate fix to one zonelist are below. On (08/10/07 18:11), Nishanth Aravamudan didst pronounce: On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote: Two zonelists exist so that GFP_THISNODE allocations will be guaranteed to use memory only from a node local to the CPU. As we can now filter the zonelist based on a nodemask, we filter the standard node zonelist for zones on the local node when GFP_THISNODE is specified. When GFP_THISNODE is used, a temporary nodemask is created with only the node local to the CPU set. This allows us to eliminate the second zonelist. Signed-off-by: Mel Gorman [EMAIL PROTECTED] Acked-by: Christoph Lameter [EMAIL PROTECTED] snip diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 2007-09-28 15:49:57.0 +0100 +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 2007-09-28 15:55:03.0 +0100 [Reordering the chunks to make my comments a little more logical] snip -static inline struct zonelist *node_zonelist(int nid, gfp_t flags) +static inline struct zonelist *node_zonelist(int nid) { - return NODE_DATA(nid)-node_zonelists + gfp_zonelist(flags); + return NODE_DATA(nid)-node_zonelist; } #ifndef HAVE_ARCH_FREE_PAGE @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n if (nid 0) nid = numa_node_id(); - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); } This is alloc_pages_node(), and converting the nid to a zonelist means that lower levels (specifically __alloc_pages() here) are not aware of nids, as far as I can tell. Yep, this is correct. This isn't a change, I just want to make sure I understand... snip struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* +* Use a temporary nodemask for __GFP_THISNODE allocations. If the +* cost of allocating on the stack or the stack usage becomes +* noticable, allocate the nodemasks per node at boot or compile time +*/ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(nodemask)); + } + return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); } snip So alloc_pages_node() calls here and for THISNODE allocations, we go ask nodemask_thisnode() for a nodemask... Also correct. +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask) +{ + /* Build a nodemask for just this node */ + int nid = numa_node_id(); + + nodes_clear(*nodemask); + node_set(nid, *nodemask); + + return nodemask; +} snip And nodemask_thisnode() always gives us a nodemask with only the node the current process is running on set, I think? Yes, I interpreted THISNODE to mean this node I am running on. Callers seemed to expect this but the memoryless needs it to be this node I am running on unless I specify a node in which case I mean that node.. That seems really wrong -- and would explain what Lee was seeing while using my patches for the hugetlb pool allocator to use THISNODE allocations. All the allocations would end up coming from whatever node the process happened to be running on. This obviously messes up hugetlb accounting, as I rely on THISNODE requests returning NULL if they go off-node. I'm not sure how this would be fixed, as __alloc_pages() no longer has the nid to set in the mask. Am I wrong in my analysis? No, you seem to be right on the ball. Can you review the following patch please and determine if it fixes the problem in a satisfactory manner? I think it does and your tests seemed to give proper values with this patch applied but brain no worky work and a second opinion is needed. Subject: Use specified node ID with GFP_THISNODE if available It had been assumed that __GFP_THISNODE meant allocating from the local node and only the local node. However, users of alloc_pages_node() may also specify GFP_THISNODE. In this case, only the specified node should be used. This patch will allocate pages only from the requested node when GFP_THISNODE is used with alloc_pages_node(). [EMAIL PROTECTED]: Detailed analysis of problem] Found-by: Lee Schermerhorn [EMAIL PROTECTED] Signed-off-by: Mel Gorman [EMAIL PROTECTED] --- include/linux/gfp.h | 10
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote: First, sorry for being so slow to respond. I was getting ill towards the end of last week and am worse now. Brain is in total mush as a result. Thanks Lee for finding this problem and thanks to Nish for investigating it properly. Comments and candidate fix to one zonelist are below. On (08/10/07 18:11), Nishanth Aravamudan didst pronounce: On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote: Two zonelists exist so that GFP_THISNODE allocations will be guaranteed to use memory only from a node local to the CPU. As we can now filter the zonelist based on a nodemask, we filter the standard node zonelist for zones on the local node when GFP_THISNODE is specified. When GFP_THISNODE is used, a temporary nodemask is created with only the node local to the CPU set. This allows us to eliminate the second zonelist. Signed-off-by: Mel Gorman [EMAIL PROTECTED] Acked-by: Christoph Lameter [EMAIL PROTECTED] snip diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 2007-09-28 15:49:57.0 +0100 +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 2007-09-28 15:55:03.0 +0100 [Reordering the chunks to make my comments a little more logical] snip -static inline struct zonelist *node_zonelist(int nid, gfp_t flags) +static inline struct zonelist *node_zonelist(int nid) { - return NODE_DATA(nid)-node_zonelists + gfp_zonelist(flags); + return NODE_DATA(nid)-node_zonelist; } #ifndef HAVE_ARCH_FREE_PAGE @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n if (nid 0) nid = numa_node_id(); - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); } This is alloc_pages_node(), and converting the nid to a zonelist means that lower levels (specifically __alloc_pages() here) are not aware of nids, as far as I can tell. Yep, this is correct. This isn't a change, I just want to make sure I understand... snip struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* + * Use a temporary nodemask for __GFP_THISNODE allocations. If the + * cost of allocating on the stack or the stack usage becomes + * noticable, allocate the nodemasks per node at boot or compile time + */ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(nodemask)); + } + return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); } snip So alloc_pages_node() calls here and for THISNODE allocations, we go ask nodemask_thisnode() for a nodemask... Also correct. +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask) +{ + /* Build a nodemask for just this node */ + int nid = numa_node_id(); + + nodes_clear(*nodemask); + node_set(nid, *nodemask); + + return nodemask; +} snip And nodemask_thisnode() always gives us a nodemask with only the node the current process is running on set, I think? Yes, I interpreted THISNODE to mean this node I am running on. Callers seemed to expect this but the memoryless needs it to be this node I am running on unless I specify a node in which case I mean that node.. I think that is only true (THISNODE = local node) if the callpath is not via alloc_pages_node(). If the callpath is via alloc_pages_node(), then it depends on whether the nid parameter is -1 (in which case it is also local node) or anything (in which case it is the nid specified). Ah, reading further along, that's exactly what your changelog indicates too :) That seems really wrong -- and would explain what Lee was seeing while using my patches for the hugetlb pool allocator to use THISNODE allocations. All the allocations would end up coming from whatever node the process happened to be running on. This obviously messes up hugetlb accounting, as I rely on THISNODE requests returning NULL if they go off-node. I'm not sure how this would be fixed, as __alloc_pages() no longer has the nid to set in the mask. Am I wrong in my analysis? No, you seem to be right on the ball. Can you review the following patch please and determine if it fixes the problem in a satisfactory manner? I think it does and your tests seemed to give proper values with this patch applied but brain no worky work and a second opinion is needed. Subject: Use specified node ID with GFP_THISNODE if available It had been
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote: First, sorry for being so slow to respond. I was getting ill towards the end of last week and am worse now. Brain is in total mush as a result. Thanks Lee for finding this problem and thanks to Nish for investigating it properly. Comments and candidate fix to one zonelist are below. On (08/10/07 18:11), Nishanth Aravamudan didst pronounce: On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote: Two zonelists exist so that GFP_THISNODE allocations will be guaranteed to use memory only from a node local to the CPU. As we can now filter the zonelist based on a nodemask, we filter the standard node zonelist for zones on the local node when GFP_THISNODE is specified. When GFP_THISNODE is used, a temporary nodemask is created with only the node local to the CPU set. This allows us to eliminate the second zonelist. Signed-off-by: Mel Gorman [EMAIL PROTECTED] Acked-by: Christoph Lameter [EMAIL PROTECTED] snip diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 2007-09-28 15:49:57.0 +0100 +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 2007-09-28 15:55:03.0 +0100 [Reordering the chunks to make my comments a little more logical] snip -static inline struct zonelist *node_zonelist(int nid, gfp_t flags) +static inline struct zonelist *node_zonelist(int nid) { - return NODE_DATA(nid)-node_zonelists + gfp_zonelist(flags); + return NODE_DATA(nid)-node_zonelist; } #ifndef HAVE_ARCH_FREE_PAGE @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n if (nid 0) nid = numa_node_id(); - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); } This is alloc_pages_node(), and converting the nid to a zonelist means that lower levels (specifically __alloc_pages() here) are not aware of nids, as far as I can tell. Yep, this is correct. This isn't a change, I just want to make sure I understand... snip struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* + * Use a temporary nodemask for __GFP_THISNODE allocations. If the + * cost of allocating on the stack or the stack usage becomes + * noticable, allocate the nodemasks per node at boot or compile time + */ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(nodemask)); + } + return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); } snip So alloc_pages_node() calls here and for THISNODE allocations, we go ask nodemask_thisnode() for a nodemask... Also correct. +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask) +{ + /* Build a nodemask for just this node */ + int nid = numa_node_id(); + + nodes_clear(*nodemask); + node_set(nid, *nodemask); + + return nodemask; +} snip And nodemask_thisnode() always gives us a nodemask with only the node the current process is running on set, I think? Yes, I interpreted THISNODE to mean this node I am running on. Callers seemed to expect this but the memoryless needs it to be this node I am running on unless I specify a node in which case I mean that node.. That seems really wrong -- and would explain what Lee was seeing while using my patches for the hugetlb pool allocator to use THISNODE allocations. All the allocations would end up coming from whatever node the process happened to be running on. This obviously messes up hugetlb accounting, as I rely on THISNODE requests returning NULL if they go off-node. I'm not sure how this would be fixed, as __alloc_pages() no longer has the nid to set in the mask. Am I wrong in my analysis? No, you seem to be right on the ball. Can you review the following patch please and determine if it fixes the problem in a satisfactory manner? I think it does and your tests seemed to give proper values with this patch applied but brain no worky work and a second opinion is needed. Subject: Use specified node ID with GFP_THISNODE if available It had been assumed that __GFP_THISNODE meant allocating from the local node and only the local node. However, users of alloc_pages_node() may also specify GFP_THISNODE. In this case, only the specified node should be used. This patch will allocate pages only from the requested node when GFP_THISNODE is used with alloc_pages_node(). [EMAIL PROTECTED]: Detailed
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Tue, 9 Oct 2007, Nishanth Aravamudan wrote: And nodemask_thisnode() always gives us a nodemask with only the node the current process is running on set, I think? Yes, I interpreted THISNODE to mean this node I am running on. Callers seemed to expect this but the memoryless needs it to be this node I am running on unless I specify a node in which case I mean that node.. I think that is only true (THISNODE = local node) if the callpath is not via alloc_pages_node(). If the callpath is via alloc_pages_node(), then it depends on whether the nid parameter is -1 (in which case it is also local node) or anything (in which case it is the nid specified). Ah, reading further along, that's exactly what your changelog indicates too :) Right. THISNODE means the node we are on or the node that we indicated we want to allocate from. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 08.10.2007 [18:56:05 -0700], Christoph Lameter wrote: > On Mon, 8 Oct 2007, Nishanth Aravamudan wrote: > > > > struct page * fastcall > > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > > struct zonelist *zonelist) > > > { > > > + /* > > > + * Use a temporary nodemask for __GFP_THISNODE allocations. If the > > > + * cost of allocating on the stack or the stack usage becomes > > > + * noticable, allocate the nodemasks per node at boot or compile time > > > + */ > > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > > + nodemask_t nodemask; > > > + > > > + return __alloc_pages_internal(gfp_mask, order, > > > + zonelist, nodemask_thisnode()); > > > + } > > > + > > > return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); > > > } > > > > > > > > So alloc_pages_node() calls here and for THISNODE allocations, we go ask > > nodemask_thisnode() for a nodemask... > > H... nodemask_thisnode needs to be passed the zonelist. > > > And nodemask_thisnode() always gives us a nodemask with only the node > > the current process is running on set, I think? > > Right. > > > > That seems really wrong -- and would explain what Lee was seeing while > > using my patches for the hugetlb pool allocator to use THISNODE > > allocations. All the allocations would end up coming from whatever node > > the process happened to be running on. This obviously messes up hugetlb > > accounting, as I rely on THISNODE requests returning NULL if they go > > off-node. > > > > I'm not sure how this would be fixed, as __alloc_pages() no longer has > > the nid to set in the mask. > > > > Am I wrong in my analysis? > > No you are right on target. The thisnode function must determine the > node from the first zone of the zonelist. It seems like I would zonelist_node_idx() for this, along the lines of: static nodemask_t *nodemask_thisnode(nodemask_t *nodemask, struct zonelist *zonelist) { int nid = zonelist_node_idx(zonelist); /* Build a nodemask for just this node */ nodes_clear(*nodemask); node_set(nid, *nodemask); return nodemask; } But I think I need to check that zonelist->_zonerefs->zone is !NULL, given this definition of zonelist_node_idx() static inline int zonelist_node_idx(struct zoneref *zoneref) { #ifdef CONFIG_NUMA /* zone_to_nid not available in this context */ return zoneref->zone->node; #else return 0; #endif /* CONFIG_NUMA */ } and this comment in __alloc_pages_internal(): z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */ if (unlikely(!z->zone)) { /* * Happens if we have an empty zonelist as a result of * GFP_THISNODE being used on a memoryless node */ return NULL; } ... It seems like zoneref->zone may be NULL in zonelist_node_idx()? Maybe someone else should look into resolving this :) Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Mon, 8 Oct 2007, Nishanth Aravamudan wrote: > > struct page * fastcall > > __alloc_pages(gfp_t gfp_mask, unsigned int order, > > struct zonelist *zonelist) > > { > > + /* > > +* Use a temporary nodemask for __GFP_THISNODE allocations. If the > > +* cost of allocating on the stack or the stack usage becomes > > +* noticable, allocate the nodemasks per node at boot or compile time > > +*/ > > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > > + nodemask_t nodemask; > > + > > + return __alloc_pages_internal(gfp_mask, order, > > + zonelist, nodemask_thisnode()); > > + } > > + > > return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); > > } > > > > So alloc_pages_node() calls here and for THISNODE allocations, we go ask > nodemask_thisnode() for a nodemask... H... nodemask_thisnode needs to be passed the zonelist. > And nodemask_thisnode() always gives us a nodemask with only the node > the current process is running on set, I think? Right. > That seems really wrong -- and would explain what Lee was seeing while > using my patches for the hugetlb pool allocator to use THISNODE > allocations. All the allocations would end up coming from whatever node > the process happened to be running on. This obviously messes up hugetlb > accounting, as I rely on THISNODE requests returning NULL if they go > off-node. > > I'm not sure how this would be fixed, as __alloc_pages() no longer has > the nid to set in the mask. > > Am I wrong in my analysis? No you are right on target. The thisnode function must determine the node from the first zone of the zonelist. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote: > > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed > to use memory only from a node local to the CPU. As we can now filter the > zonelist based on a nodemask, we filter the standard node zonelist for zones > on the local node when GFP_THISNODE is specified. > > When GFP_THISNODE is used, a temporary nodemask is created with only the > node local to the CPU set. This allows us to eliminate the second zonelist. > > Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> > Acked-by: Christoph Lameter <[EMAIL PROTECTED]> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff > linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h > linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h > 2007-09-28 15:49:57.0 +0100 > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h > 2007-09-28 15:55:03.0 +0100 [Reordering the chunks to make my comments a little more logical] > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > +static inline struct zonelist *node_zonelist(int nid) > { > - return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > + return _DATA(nid)->node_zonelist; > } > > #ifndef HAVE_ARCH_FREE_PAGE > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n > if (nid < 0) > nid = numa_node_id(); > > - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); > } This is alloc_pages_node(), and converting the nid to a zonelist means that lower levels (specifically __alloc_pages() here) are not aware of nids, as far as I can tell. This isn't a change, I just want to make sure I understand... > struct page * fastcall > __alloc_pages(gfp_t gfp_mask, unsigned int order, > struct zonelist *zonelist) > { > + /* > + * Use a temporary nodemask for __GFP_THISNODE allocations. If the > + * cost of allocating on the stack or the stack usage becomes > + * noticable, allocate the nodemasks per node at boot or compile time > + */ > + if (unlikely(gfp_mask & __GFP_THISNODE)) { > + nodemask_t nodemask; > + > + return __alloc_pages_internal(gfp_mask, order, > + zonelist, nodemask_thisnode()); > + } > + > return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); > } So alloc_pages_node() calls here and for THISNODE allocations, we go ask nodemask_thisnode() for a nodemask... > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask) > +{ > + /* Build a nodemask for just this node */ > + int nid = numa_node_id(); > + > + nodes_clear(*nodemask); > + node_set(nid, *nodemask); > + > + return nodemask; > +} And nodemask_thisnode() always gives us a nodemask with only the node the current process is running on set, I think? That seems really wrong -- and would explain what Lee was seeing while using my patches for the hugetlb pool allocator to use THISNODE allocations. All the allocations would end up coming from whatever node the process happened to be running on. This obviously messes up hugetlb accounting, as I rely on THISNODE requests returning NULL if they go off-node. I'm not sure how this would be fixed, as __alloc_pages() no longer has the nid to set in the mask. Am I wrong in my analysis? Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote: Two zonelists exist so that GFP_THISNODE allocations will be guaranteed to use memory only from a node local to the CPU. As we can now filter the zonelist based on a nodemask, we filter the standard node zonelist for zones on the local node when GFP_THISNODE is specified. When GFP_THISNODE is used, a temporary nodemask is created with only the node local to the CPU set. This allows us to eliminate the second zonelist. Signed-off-by: Mel Gorman [EMAIL PROTECTED] Acked-by: Christoph Lameter [EMAIL PROTECTED] snip diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h 2007-09-28 15:49:57.0 +0100 +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h 2007-09-28 15:55:03.0 +0100 [Reordering the chunks to make my comments a little more logical] snip -static inline struct zonelist *node_zonelist(int nid, gfp_t flags) +static inline struct zonelist *node_zonelist(int nid) { - return NODE_DATA(nid)-node_zonelists + gfp_zonelist(flags); + return NODE_DATA(nid)-node_zonelist; } #ifndef HAVE_ARCH_FREE_PAGE @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n if (nid 0) nid = numa_node_id(); - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); } This is alloc_pages_node(), and converting the nid to a zonelist means that lower levels (specifically __alloc_pages() here) are not aware of nids, as far as I can tell. This isn't a change, I just want to make sure I understand... snip struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* + * Use a temporary nodemask for __GFP_THISNODE allocations. If the + * cost of allocating on the stack or the stack usage becomes + * noticable, allocate the nodemasks per node at boot or compile time + */ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(nodemask)); + } + return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); } snip So alloc_pages_node() calls here and for THISNODE allocations, we go ask nodemask_thisnode() for a nodemask... +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask) +{ + /* Build a nodemask for just this node */ + int nid = numa_node_id(); + + nodes_clear(*nodemask); + node_set(nid, *nodemask); + + return nodemask; +} snip And nodemask_thisnode() always gives us a nodemask with only the node the current process is running on set, I think? That seems really wrong -- and would explain what Lee was seeing while using my patches for the hugetlb pool allocator to use THISNODE allocations. All the allocations would end up coming from whatever node the process happened to be running on. This obviously messes up hugetlb accounting, as I rely on THISNODE requests returning NULL if they go off-node. I'm not sure how this would be fixed, as __alloc_pages() no longer has the nid to set in the mask. Am I wrong in my analysis? Thanks, Nish -- Nishanth Aravamudan [EMAIL PROTECTED] IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On Mon, 8 Oct 2007, Nishanth Aravamudan wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* +* Use a temporary nodemask for __GFP_THISNODE allocations. If the +* cost of allocating on the stack or the stack usage becomes +* noticable, allocate the nodemasks per node at boot or compile time +*/ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(nodemask)); + } + return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); } snip So alloc_pages_node() calls here and for THISNODE allocations, we go ask nodemask_thisnode() for a nodemask... H... nodemask_thisnode needs to be passed the zonelist. And nodemask_thisnode() always gives us a nodemask with only the node the current process is running on set, I think? Right. That seems really wrong -- and would explain what Lee was seeing while using my patches for the hugetlb pool allocator to use THISNODE allocations. All the allocations would end up coming from whatever node the process happened to be running on. This obviously messes up hugetlb accounting, as I rely on THISNODE requests returning NULL if they go off-node. I'm not sure how this would be fixed, as __alloc_pages() no longer has the nid to set in the mask. Am I wrong in my analysis? No you are right on target. The thisnode function must determine the node from the first zone of the zonelist. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
On 08.10.2007 [18:56:05 -0700], Christoph Lameter wrote: On Mon, 8 Oct 2007, Nishanth Aravamudan wrote: struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { + /* + * Use a temporary nodemask for __GFP_THISNODE allocations. If the + * cost of allocating on the stack or the stack usage becomes + * noticable, allocate the nodemasks per node at boot or compile time + */ + if (unlikely(gfp_mask __GFP_THISNODE)) { + nodemask_t nodemask; + + return __alloc_pages_internal(gfp_mask, order, + zonelist, nodemask_thisnode(nodemask)); + } + return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); } snip So alloc_pages_node() calls here and for THISNODE allocations, we go ask nodemask_thisnode() for a nodemask... H... nodemask_thisnode needs to be passed the zonelist. And nodemask_thisnode() always gives us a nodemask with only the node the current process is running on set, I think? Right. That seems really wrong -- and would explain what Lee was seeing while using my patches for the hugetlb pool allocator to use THISNODE allocations. All the allocations would end up coming from whatever node the process happened to be running on. This obviously messes up hugetlb accounting, as I rely on THISNODE requests returning NULL if they go off-node. I'm not sure how this would be fixed, as __alloc_pages() no longer has the nid to set in the mask. Am I wrong in my analysis? No you are right on target. The thisnode function must determine the node from the first zone of the zonelist. It seems like I would zonelist_node_idx() for this, along the lines of: static nodemask_t *nodemask_thisnode(nodemask_t *nodemask, struct zonelist *zonelist) { int nid = zonelist_node_idx(zonelist); /* Build a nodemask for just this node */ nodes_clear(*nodemask); node_set(nid, *nodemask); return nodemask; } But I think I need to check that zonelist-_zonerefs-zone is !NULL, given this definition of zonelist_node_idx() static inline int zonelist_node_idx(struct zoneref *zoneref) { #ifdef CONFIG_NUMA /* zone_to_nid not available in this context */ return zoneref-zone-node; #else return 0; #endif /* CONFIG_NUMA */ } and this comment in __alloc_pages_internal(): z = zonelist-_zonerefs; /* the list of zones suitable for gfp_mask */ if (unlikely(!z-zone)) { /* * Happens if we have an empty zonelist as a result of * GFP_THISNODE being used on a memoryless node */ return NULL; } ... It seems like zoneref-zone may be NULL in zonelist_node_idx()? Maybe someone else should look into resolving this :) Thanks, Nish -- Nishanth Aravamudan [EMAIL PROTECTED] IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/